Digitizing Paper Books to Ebooks

From MobileRead
Jump to: navigation, search

This page focuses digitizing on whole books because that's the hardest task, working with magazines that open flat or single sheets means some of the tips aren't needed.

This page uses the "image and OCR" approach. You can also "image and view" - keep the page images as images and view them that way. Most ebook readers deal poorly with images, even if you resize them to suit the display of your reader, and the total file size will be enormous (300 pages at 100kB per page is 30MB, when the actual text will probably be 100kB). Alternatively you can "read and type" - type the book off the page into a computer. There are companies who specialize in this latter service in low-wage countries, but is both slow and error-prone.

Contents

[edit] Source threads for this page

on scanning http://www.mobileread.com/forums/showthread.php?t=7724

do-it yourself repro v-cradle for paper books http://www.mobileread.com/forums/showthread.php?t=13848

Tool to split up double pages http://www.mobileread.com/forums/showthread.php?t=14340

Scanning two pages http://www.mobileread.com/forums/showthread.php?t=14394

how to digitize books http://www.mobileread.com/forums/showthread.php?t=14475

Scanning paper (out of copyright) books. http://www.mobileread.com/forums/showthread.php?t=6792

Conversion techniques. http://www.mobileread.com/forums/showthread.php?t=9666

[edit] Capturing the Images

If you already have digital images (from a screenshot, PDF or similar) skip this section.

[edit] Using a scanner

Scanners have many advantages - they produce clean, high resolution images with no extra work on your part, they're cheap and easy to set up.

You will get best results scanning at the native (optical) resolution of the scanner. Save to jpeg if you can set the compression so you can't see softening or compression artifacts, otherwise use tiff or bitmap. Try both, see which one works best for you.

Scanningdifficulty.jpg

If you want to scan books there are models that scan very close to one edge of the scanner, like the Plustec OpticBook 3600 and Mustec Scanmaker s280 These will save you much post-processing work during the OCR stage caused by the distorted image shown at the right.

The main problem is that scanning is very slow unless you spend a lot of money. They may also damage books when you force them wide open to scan. A cheap USB A4/letter scanner might take 30s or more to scan a single side of one page. A scanner with Automatic Document Feeder (ADF) is likely to be more than twice the price of one without, and one with a 50 sheet tray and 20 page per minute throughput will probably cost ten times what a simple flatbed one does.

Another solution could be a pen scanner like the Planon docupen. This pen is full page width and can be slid down the page in 4 to 8 seconds. It has up to 400 dpi and color. It can be used by itself while on the road or in the library and then docked to your computer later. It supports micro SD memory cards for additional capacity.

[edit] Using a camera

Digital cameras are a very fast, flexible way to capture page images. With a little work it's possible to make a stand that will give consistently good results, using nothing more than a tripod, cardboard box and a lamp. More work gives better and faster results, and at the high end dedicated book scanners largely automate the entire process (they also cost a lot). The downside is that setup is quite fiddly, image quality can be variable and the resolution will be lower. But this approach is the one that works for high volume image capture without damaging your books.

Tips:

  • use +1 exposure compensation. Cameras expose to make the image overall a medium grey, but most pages are white so the camera will underexpose.
  • use a tripod. The tripod makes it easy to keep the camera correctly positioned and allows you to use a slower shutter speed without shake/blur.
  • buy a small sheet of glass. Glass is normally quite cheap even when cut to size, less than the cost of a cheap book. The sheet of glass will hold your page flat for the photo, making everything else much easier.
  • photograph all the pages on one side of the book, then do the other side. Rotating the book every time is slow unless you have a spinning book stand.

A simple setup could be a V shaped stand made from a few pieces of wood. A cookbook stand may even work so long as the book can remain open.

A more complex setup will block out much of the light from the room, hold the sheet of glass in place better and remotely trigger the camera so you don't wobble it when taking photos. This speeds up both imaging each page and OCR'ing the resulting image.

You can add more to these setups to further improve the process, and if you're planning on scanning a lot of books the time spent will be amply repaid. But start simple - just put a tripod on a table and see what happens.

[edit] For more information

[edit] OCR software

OCR stands for Optical Character Recognition. (Please check the OCR article for all things OCR)

[edit] Proofreading Tips

Yes, you will need to proofread the resultant document. Running a spell checker first will help find some common OCR errors.

Keep a copy of the original images! Most OCR software processes the image to make it easier for the software to recognize, but sometimes this does not work and the OCR image is hard for you to read too. Proofreading in the OCR software is necessary but not enough. The first time you read through the text you will find many more errors and correcting those will often require looking at the original image (or the original book).

[edit] Output

[edit] What Format and Software to Use

HowTo: Create an eBook is a tutorial on the task.

See eBook conversion for lots of conversion software

[edit] How to Lay it out

For books I favor plain text, then formatting it with the bare minimum of HTML for heading, Table of Contents and so on. When I transform it for my liseuse I select a font, size and so on.

[edit] Arranging Pages and Sections When Producing the Ebook

Much discussion to be had about this.

[edit] Getting someone else to do it

  • KirtasBooks will obtain a book from multiple libraries and produce a PDF for a fee.
  • Some commercial copy centers can now produce a PDF.
  • Some high end copy machines have the ability to produce PDF files as an output choice.
  • http://bookscan.us/ will scan your book (including color) and produce files in PDF or Word. The original will be destroyed but the price is very reasonable (as low as $1 per book). Options include e-reader formats and audiobook creation. Samples of all format types are available on the site for testing. See the site for complete details.
  • Precision eBooks does conversions of paper books to Kindle or ePub eBooks with reflowable text, linked table of contents, linked footnotes, etc. Proofreading is available as an option.
  • See eBook Conversion Services for companies in this business.
Personal tools
Namespaces

Variants
Actions
Navigation
MobileRead Networks
Toolbox