pdfa

Command-line tool that converts regular PDF documents into PDF/A files using OCRmyPDF with built-in OCR.

Features

Wraps OCRmyPDF to generate PDF/A-2 compliant files with OCR enforced.
Accepts input/output paths along with configurable OCR language and PDF/A level.
Ships with tests, black, and ruff configurations for streamlined development.

Python 3.11+
OCRmyPDF runtime dependencies (Tesseract, Ghostscript, etc.) installed on your system. Refer to the OCRmyPDF installation guide.

Install the system dependencies with APT before setting up the virtual environment:

sudo apt update
sudo apt install python3-venv python3-pip tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu ghostscript qpdf

Add extra tesseract-ocr-<lang> packages if you need OCR support for additional languages.

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pdfa-cli --help

Tip: Activating the virtual environment adds .venv/bin to your PATH, so pdfa-cli is available directly.

pdfa-cli input.pdf output.pdf --language deu+eng --pdfa-level 3

This command converts input.pdf into a PDF/A file written to output.pdf, enforcing OCR with the specified Tesseract languages.

pytest

.
├── pyproject.toml
├── README.md
├── src
│   └── pdfa
│       ├── __init__.py
│       └── cli.py
└── tests
    ├── __init__.py
    └── test_cli.py