OCRmyPDF
From MobileRead
OCRmyPDF adds OCR text layer to scanned PDF files
Contents |
[edit] Features
- Generates a searchable PDF/A file from a regular PDF
- Places OCR text accurately below the image to ease copy / paste
- Keeps the exact resolution of the original embedded images
- When possible, inserts OCR information as a "lossless" operation without rendering vector information
- Keeps file size about the same
- If requested de-skews and/or cleans the image before performing OCR
- Validates input and output files
- Provides debug mode to enable easy verification of the OCR results
- Processes pages in parallel when more than one CPU core is available
- Uses Tesseract OCR engine
- Supports more than 100 languages recognized by Tesseract
- Battle-tested on thousands of PDFs, a test suite and continuous integration
[edit] Command line
ocrmypdf # it's a scriptable command line program -l eng+fra # it supports multiple languages --rotate-pages # it can fix pages that are misrotated --deskew # it can deskew crooked PDFs! --title "My PDF" # it can change output metadata --jobs 4 # it uses multiple cores by default --output-type pdfa # it produces PDF/A by default input_scanned.pdf # takes PDF input (or images) output_searchable.pdf # produces validated PDF output
[edit] Platforms
Available for Linux, UNIX, and macOS X. Windows is not directly supported but there is a Docker image available that runs on Windows. There is an official package in Debian Linux for those using Linux.
[edit] Download
https://github.com/jbarlow83/OCRmyPDF
[edit] For discussion
Works to postprocess both a Spanish and English language PDF files.