OCR

From MobileRead
Jump to: navigation, search

OCR stands for Optical Character Recognition. This is software that is used with a scanner or graphics image to interpret the page that has been scanned.

Contents

[edit] Overview

OCR software converts the text image to regular text files which may include some formatting. It can usually recognize fonts such as bold and italicized words. Some programs can even handle font sizes. The technique for using a scanner to create an eBook is covered in the article Digitizing Paper Books to Ebooks.

Some software works with images stored in PDF files but all really work with the graphics images themselves, typically TIF. This means they can be used with pictures you take with a camera which is the second way to digitize a book. The graphic input quality is often the key to how good the OCR results will be. For this reason a careful scan is important but the image can often be improved or cleaned up with Scan Tailor.

OCR software will often be able to talk directly to the scanner or digital camera using a protocol called TWAIN. This means the OCR task and the scanning are integrated into a single application.

Other acronyms, OMR (Optical mark recognition) and ICR (Intelligent Character Recognition), are also used for specialized products used to scan forms and checks where there is a need to recognize check marks in boxes or full handwriting. Voting machines and scanners use OMR. OCR is used interchangeably for Optical Character and Optical Word Recognition where a space is normally found separating words. However, IWR (Intelligent Word Recognition) is a separate designation and usually a separate application that will recognize handwritten documents and will separate words even in languages that do not use space between the words.

[edit] OCR software

  • Abbyy FineReader Pro - somewhat expensive but works very well. FineReader is multilingual. It supports Latin alphabet languages and Arabic, Chinese, Cyrillic, Greek, Hebrew, Japanese, and Thai, among other languages.
  • FineReader - USA site with standard, Corporate, and Enterprise versions.
  • Nuance Omnipage - Standard, Ultimate and server products (Previous product TextBridge Pro has been discontinued.)
  • Simple OCR - Free, The site also shows comparisons and help information. They also sell OCR software from several sources.
  • Gnu OCR - GOCR is free and fast but not the most accurate. It is under active development. Note that it is also called jOCR as sourceforge already had a program called GOCR. Latest version is 2013.
  • Tesseract-ocr is another free choice. This is considered the best free OCR and is documented on the web site. Google is now behind this tool although they were not the inventors. This tool is supported by a high level analyzer called OCR Opus that is currently the result of AI research to enhance character recognition with line analysis, language analysis, layout analysis, etc. OCR Opus features a plug-in approach where the individual analysis packages can be customized or use different tools.
  • OCRmyPDF adds text for image PDF files using Tesseract-ocr.

[edit] Tips on software

  • Abbyy FineReader Pro has a good recognition rate, speed and usability, but is a bit expensive. However, older versions are much cheaper, and v.8 is e.g. even faster than v.9.
  • The home versions of Abbyy and Nuance are reasonably priced if they have the features you need.
  • K2pdfopt has OCR built in as a option if you are starting with a PDF then this is a good choice.

[edit] Multilingual

Beginning in the 21st century OCR has taken on languages that are not based on the Latin alphabet, however results are not uniformly available. To get good results you need both a recognition of the characters and a lexicon in the language. If you are decoding non-English languages and in particular non-Latin alphabets you need to ensure that there is a lexicon in that particular language. Abby FineReader Pro provides all of the lexicons. Another leading OCR package has 36 lexicons a the third one ships 20 lexicons. And when no linguistic database is included, the linguistic phase has no role to play in the recognition process, so the accuracy will likely drop substantially in those languages!

With the major OCR packages, you can now read the American, Western European, Eastern European and Baltic languages, the “Cyrillic” (Russian) languages, Greek and Turkish.

[edit] OCR services

Typically an OCR service takes your book and cuts out the pages. They then scan the sheets and produce an eBook. Generally this is a PDF but some will do other formats. Some also provide additional services.

Other services take images you supply (generally in a PDF) and convert them to text.

[edit] Correcting the output

  • Yes, you will need to proof read the resultant document. Running a spell checker first will help find some common OCR errors.
  • Keep a copy of the original images! Most OCR software processes the image to make it easier for the software to recognize, but sometimes this does not work and the OCR image is hard for you to read too.
  • Proofreading in the OCR software is necessary but not enough. The first time you read through the text you will find many more errors and correcting those will often require looking at the original image or even better the original book.
  • Look for the OCR villains which are typical errors made by OCR software.

[edit] Quality Levels

The OSIS standard has quality levels defined for OCR documents. These are useful guidelines for any OCR effort.

  • Level 1: Sub-OCR Quality
The text may have many typographical errors; essentially, it is unproofed text from automated OCR, probably of a less-than-ideal original.
  • Level 2: OCR Quality
The text may have up to 5 typographical errors per source page. It may be unproofed output from ideal OCR of an ideal source, or may have been run at least through rudimentary spell-checking or vocabulary counting and repair, or entered by a double-keying or similar service that maintains accuracy to the required level.
  • Level 3: Proof Quality
There may not be more than an average of 1 error per source page (or per 2000 characters of content) as compared with the stated copy text. This requirement does not preclude producing new editions, which for example may fix typos in the original, normalize spelling of older texts, and so on. However, in such cases it is recommended that the best available copy of the source text as it existed prior to such modernizations, also be made available.
  • Level 4: Trusted Quality
A Trusted Quality document must fulfill all the requirements of a Proof Quality document, and must also have been in public use for at least one year, and read by at least 5 independent proofreaders, with all noted errors fixed. The text should have available a complete log of changes made since it reached Proof Quality. Random spot-checks of at least 3% of the text must come up with no instances of more than 1 error per 5 pages (or 10,000 characters of content).

[edit] Now what?

[edit] For more information

Personal tools
Namespaces

Variants
Actions
Navigation
MobileRead Networks
Toolbox