OCRmyPDF

From MobileRead
Jump to: navigation, search

OCRmyPDF adds OCR text layer to scanned PDF files

Contents

[edit] Features

  • Generates a searchable PDF/A file from a regular PDF
  • Places OCR text accurately below the image to ease copy / paste
  • Keeps the exact resolution of the original embedded images
  • When possible, inserts OCR information as a "lossless" operation without rendering vector information
  • Keeps file size about the same
  • If requested de-skews and/or cleans the image before performing OCR
  • Validates input and output files
  • Provides debug mode to enable easy verification of the OCR results
  • Processes pages in parallel when more than one CPU core is available
  • Uses Tesseract OCR engine
  • Supports more than 100 languages recognized by Tesseract
  • Battle-tested on thousands of PDFs, a test suite and continuous integration

[edit] Command line

ocrmypdf                     # it's a scriptable command line program
  -l eng+fra                 # it supports multiple languages
  --rotate-pages             # it can fix pages that are misrotated
  --deskew                   # it can deskew crooked PDFs!
  --title "My PDF"           # it can change output metadata
  --jobs 4                   # it uses multiple cores by default
  --output-type pdfa         # it produces PDF/A by default
  input_scanned.pdf          # takes PDF input (or images)
  output_searchable.pdf      # produces validated PDF output

[edit] Platforms

Available for Linux, UNIX, and macOS X. Windows is not directly supported but there is a Docker image available that runs on Windows. There is an official package in Debian Linux for those using Linux.

[edit] Download

https://github.com/jbarlow83/OCRmyPDF

[edit] For discussion

Works to postprocess both a Spanish and English language PDF files.

Personal tools
Namespaces

Variants
Actions
Navigation
MobileRead Networks
Toolbox