OCR

From MobileRead
Jump to: navigation, search

OCR stands for Optical Character Recognition. This is software that is used with a scanner or graphics image to interpret the page that has been scanned.

Contents

[edit] Overview

OCR software converts the text image to regular text files although some formatting can usually be recognized such as bold and italicized words. Some programs can even handle font sizes. The technique for using a scanner to create an eBook is covered in the article Digitizing Paper Books to Ebooks.

Some software works with images stored in PDF files but all really work with the graphics images themselves, typically TIF. This means they can be used with pictures you take with a camera which is the second way to digitize a book. The graphic input quality is often the key to how good the OCR results will be. For this reason a careful scan is important but the image can often be improved or cleaned up with Scan Tailor.

OCR software will often be able to talk directly to the scanner or digital camera using a protocol called TWAIN. This means the OCR task and the scanning are integrated into a single application.

[edit] OCR software

  • Another free choice is tesseract-ocr. this is considered the best free OCR and is documented in a wiki on the web site. This tool is supported by a high level analyzer called OCR Opus that is currently the result of AI research to enhance character recognition with line analysis, language analysis, layout analysis, etc. OCR Opus features a plug-in approach where the individual analysis packages can be customized or use different tools.

[edit] Tips on software

  • Abbyy FineReader Pro has a good recognition rate, speed and usability, but is a bit expensive. However, older versions are much cheaper, and v.8 is e.g. even faster than v.9.
  • The home versions of Abbyy and Nuance are reasonably priced if they have the features you need.

[edit] Correcting the output

  • Yes, you will need to proof read the resultant document. Running a spell checker first will help find some common OCR errors.
  • Keep a copy of the original images! Most OCR software processes the image to make it easier for the software to recognize, but sometimes this does not work and the OCR image is hard for you to read too.
  • Proofreading in the OCR software is necessary but not enough. The first time you read through the text you will find many more errors and correcting those will often require looking at the original image or even better the original book.

[edit] Quality Levels

The OSIS standard has quality levels defined for OCR documents. These are useful guidelines for any OCR effort.

  • Level 1: Sub-OCR Quality
The text may have many typographical errors; essentially, it is unproofed text from automated OCR, probably of a less-than-ideal original.
  • Level 2: OCR Quality
The text may have up to 5 typographical errors per source page. It may be unproofed output from ideal OCR of an ideal source, or may have been run at least through rudimentary spell-checking or vocabulary counting and repair, or entered by a double-keying or similar service that maintains accuracy to the required level.
  • Level 3: Proof Quality
There may not be more than an average of 1 error per source page (or per 2000 characters of content) as compared with the stated copy text. This requirement does not preclude producing new editions, which for example may fix typos in the original, normalize spelling of older texts, and so on. However, in such cases it is recommended that the best available copy of the source text as it existed prior to such modernizations, also be made available.
  • Level 4: Trusted Quality
A Trusted Quality document must fulfill all the requirements of a Proof Quality document, and must also have been in public use for at least one year, and read by at least 5 independent proofreaders, with all noted errors fixed. The text should have available a complete log of changes made since it reached Proof Quality. Random spot-checks of at least 3% of the text must come up with no instances of more than 1 error per 5 pages (or 10,000 characters of content).
Personal tools
Namespaces

Variants
Actions
Navigation
MobileRead Networks
Toolbox
Advertisement