OCRmyPDF

OCRmyPDF adds OCR text layer to scanned PDF files

[edit] Features

Generates a searchable PDF/A file from a regular PDF
Places OCR text accurately below the image to ease copy / paste
Keeps the exact resolution of the original embedded images
When possible, inserts OCR information as a "lossless" operation without rendering vector information
Keeps file size about the same
If requested de-skews and/or cleans the image before performing OCR
Validates input and output files
Provides debug mode to enable easy verification of the OCR results
Processes pages in parallel when more than one CPU core is available
Uses Tesseract OCR engine
Supports more than 100 languages recognized by Tesseract
Battle-tested on thousands of PDFs, a test suite and continuous integration

[edit] Command line

ocrmypdf                     # it's a scriptable command line program
  -l eng+fra                 # it supports multiple languages
  --rotate-pages             # it can fix pages that are misrotated
  --deskew                   # it can deskew crooked PDFs!
  --title "My PDF"           # it can change output metadata
  --jobs 4                   # it uses multiple cores by default
  --output-type pdfa         # it produces PDF/A by default
  input_scanned.pdf          # takes PDF input (or images)
  output_searchable.pdf      # produces validated PDF output

[edit] Platforms

Available for Linux, UNIX, and macOS X. Windows is not directly supported but there is a Docker image available that runs on Windows. There is an official package in Debian Linux for those using Linux.

[edit] Download

https://github.com/jbarlow83/OCRmyPDF

[edit] For discussion

Works to postprocess both a Spanish and English language PDF files.

https://www.mobileread.com/forums/showthread.php?t=294101

OCRmyPDF

Contents

[edit] Features

[edit] Command line

[edit] Platforms

[edit] Download

[edit] For discussion

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

MobileRead Networks

Toolbox