Command-line tool and REST API that converts PDF, Office, OpenDocument, and image files into PDF/A files using OCRmyPDF with built-in OCR.
| Link | Description |
|---|---|
| 🇩🇪 Deutsch (German) | Complete German documentation |
| ⚙️ Compression Configuration | PDF compression settings - Configure quality vs file size trade-offs |
| ⚙️ Komprimierungs-Konfiguration (Deutsch) | PDF-Komprimierungseinstellungen - Qualität vs. Dateigröße konfigurieren |
| 🥧 OCR-SCANNER Setup Guide | Raspberry Pi & Network Setup - Deploy pdfa-service as a network-wide OCR scanner |
| 🥧 OCR-SCANNER (Deutsch) | Raspberry Pi & Netzwerk-Anleitung - Einsatz als Dokumentenscanner im lokalen Netzwerk |
| 📋 OCR-SCANNER Practical Guide | Real-world scenarios with Docker Compose - Home office, law firms, medical practices |
| 📋 OCR-SCANNER Praktische Anleitung | Praktische Szenarien mit Docker Compose - Heimatelier, Kanzlei, Arztpraxis |
- Converts PDF, MS Office (DOCX, PPTX, XLSX), OpenDocument (ODT, ODS, ODP), and image files (JPG, PNG, TIFF, BMP, GIF) to PDF/A-compliant documents
- Office, OpenDocument, and image files are automatically converted to PDF before PDF/A processing
- Wraps OCRmyPDF to generate PDF/A-2 compliant files with configurable OCR
- Configurable OCR language and PDF/A level (1, 2, or 3)
- Offers a FastAPI REST endpoint for document conversions
- Ships with comprehensive tests,
black, andruffconfigurations
- Python 3.11+
- LibreOffice (for Office document conversion)
- OCRmyPDF runtime dependencies: Tesseract OCR, Ghostscript, and qpdf for PDF processing
For detailed installation instructions, refer to the OCRmyPDF installation guide.
Install the system dependencies before setting up the virtual environment:
sudo apt update
sudo apt install python3-venv python3-pip \
libreoffice-calc libreoffice-impress \
tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu \
ghostscript qpdfInstall the system dependencies using DNF:
sudo dnf install python3.11+ python3-pip \
libreoffice-calc libreoffice-impress \
tesseract tesseract-langpack-deu tesseract-langpack-eng \
ghostscript qpdfFor RHEL 9 and older versions, you may need to enable the PowerTools repository for some packages:
sudo dnf config-manager --set-enabled powertools # RHEL
# or for other RHEL-based distros, check your repository configurationInstall the system dependencies using Pacman:
sudo pacman -Syu
sudo pacman -S python python-pip \
libreoffice-still \
tesseract tesseract-data-deu tesseract-data-eng \
ghostscript qpdfNote: Arch provides libreoffice-still (stable) instead of splitting Calc and Impress into separate packages.
Adding Additional OCR Languages:
The default installation includes English (eng) and German (deu) OCR support. To add more languages:
| Distribution | Command |
|---|---|
| Debian/Ubuntu | sudo apt install tesseract-ocr-<lang> (e.g., tesseract-ocr-fra for French) |
| Red Hat/Fedora | sudo dnf install tesseract-langpack-<lang> (e.g., tesseract-langpack-fra) |
| Arch Linux | sudo pacman -S tesseract-data-<lang> (e.g., tesseract-data-fra) |
Verifying Installation:
After installation, verify that all dependencies are available:
# Check Python version (3.11+)
python3 --version
# Verify Tesseract OCR
tesseract --version
# Verify Ghostscript
gs --version
# Verify qpdf
qpdf --version
# Verify LibreOffice
libreoffice --versionAll commands should return version information without errors.
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pdfa-cli --helpTip: Activating the virtual environment adds
.venv/binto yourPATH, sopdfa-cliis available directly.
The CLI accepts PDF, MS Office (DOCX, PPTX, XLSX), OpenDocument (ODT, ODS, ODP), and image files (JPG, PNG, TIFF, BMP, GIF):
# Convert PDF to PDF/A
pdfa-cli input.pdf output.pdf --language deu+eng --pdfa-level 3
# Convert Office documents to PDF/A (automatic)
pdfa-cli document.docx output.pdf --language eng
pdfa-cli presentation.pptx output.pdf
pdfa-cli spreadsheet.xlsx output.pdf
# Convert OpenDocument files to PDF/A (automatic)
pdfa-cli document.odt output.pdf --language eng
pdfa-cli presentation.odp output.pdf
pdfa-cli spreadsheet.ods output.pdf
# Convert images to PDF/A (automatic)
pdfa-cli photo.jpg output.pdf --language eng
pdfa-cli scan.png output.pdf
pdfa-cli document.tiff output.pdfOptions:
-l, --language: Tesseract language codes for OCR (default:deu+eng)--pdfa-level: PDF/A compliance level (1, 2, or 3; default:2)--no-ocr: Disable OCR and convert without text recognition--force-ocr-on-tagged-pdfs: Force OCR on PDFs with structure tags. By default, OCR is skipped for tagged PDFs to preserve accessibility information-v, --verbose: Enable verbose (debug) logging--log-file: Write logs to a file in addition to stderr
Start the REST service with uvicorn:
uvicorn pdfa.api:app --host 0.0.0.0 --port 8000Once the API is running, visit http://localhost:8000 to access the interactive web interface where you can:
- Upload documents (PDF, Office, OpenDocument, and image formats)
- Select OCR language and PDF/A compliance level
- Toggle OCR on/off
- Skip OCR for tagged PDFs (enabled by default to preserve accessibility)
- Download converted files directly from your browser
This is the easiest way to test the service without using the command line.
Upload a document via POST /convert with a multipart/form-data request:
# Convert PDF to PDF/A
curl -X POST "http://localhost:8000/convert" \
-F "file=@input.pdf;type=application/pdf" \
-F "language=deu+eng" \
-F "pdfa_level=2" \
--output output.pdf
# Convert MS Office documents to PDF/A (automatic)
curl -X POST "http://localhost:8000/convert" \
-F "file=@document.docx;type=application/vnd.openxmlformats-officedocument.wordprocessingml.document" \
--output output.pdf
curl -X POST "http://localhost:8000/convert" \
-F "file=@presentation.pptx;type=application/vnd.openxmlformats-officedocument.presentationml.presentation" \
--output output.pdf
curl -X POST "http://localhost:8000/convert" \
-F "file=@spreadsheet.xlsx;type=application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" \
--output output.pdf
# Convert OpenDocument files to PDF/A (automatic)
curl -X POST "http://localhost:8000/convert" \
-F "file=@document.odt;type=application/vnd.oasis.opendocument.text" \
--output output.pdf
curl -X POST "http://localhost:8000/convert" \
-F "file=@presentation.odp;type=application/vnd.oasis.opendocument.presentation" \
--output output.pdf
curl -X POST "http://localhost:8000/convert" \
-F "file=@spreadsheet.ods;type=application/vnd.oasis.opendocument.spreadsheet" \
--output output.pdf
# Convert image files to PDF/A (automatic)
curl -X POST "http://localhost:8000/convert" \
-F "file=@photo.jpg;type=image/jpeg" \
--output output.pdf
curl -X POST "http://localhost:8000/convert" \
-F "file=@scan.png;type=image/png" \
--output output.pdf
curl -X POST "http://localhost:8000/convert" \
-F "file=@document.tiff;type=image/tiff" \
--output output.pdfThe service validates the upload, converts Office, OpenDocument, and image files to PDF (if needed), converts to PDF/A using OCRmyPDF, and returns the converted document as the HTTP response body.
file(required): PDF, MS Office (DOCX, PPTX, XLSX), OpenDocument (ODT, ODS, ODP), or image (JPG, PNG, TIFF, BMP, GIF) file to convertlanguage(optional): Tesseract language codes for OCR (default:deu+eng)pdfa_level(optional): PDF/A compliance level:1,2, or3(default:2)ocr_enabled(optional): Whether to perform OCR (default:true). Set tofalseto skip OCR.
Example without OCR:
curl -X POST "http://localhost:8000/convert" \
-F "file=@input.pdf;type=application/pdf" \
-F "ocr_enabled=false" \
--output output.pdfConvert multiple files in a directory recursively:
# Convert all PDFs in directory and subdirectories, save with -pdfa.pdf suffix
find /path/to/documents -name "*.pdf" -type f | while read file; do
output="${file%.*}-pdfa.pdf"
echo "Converting: $file -> $output"
curl -s -X POST "http://localhost:8000/convert" \
-F "file=@${file};type=application/pdf" \
-F "language=deu+eng" \
-F "pdfa_level=2" \
--output "$output"
doneConvert multiple file types (PDF, DOCX, PPTX, XLSX, ODT, ODS, ODP, JPG, PNG, TIFF, BMP, GIF) in a single directory:
# Convert all supported formats
for file in /path/to/documents/*.*; do
[ ! -f "$file" ] && continue
ext="${file##*.}"
output="${file%.*}-pdfa.pdf"
# Determine MIME type
case "$ext" in
pdf)
mime="application/pdf"
;;
docx)
mime="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
;;
pptx)
mime="application/vnd.openxmlformats-officedocument.presentationml.presentation"
;;
xlsx)
mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
;;
odt)
mime="application/vnd.oasis.opendocument.text"
;;
odp)
mime="application/vnd.oasis.opendocument.presentation"
;;
ods)
mime="application/vnd.oasis.opendocument.spreadsheet"
;;
jpg|jpeg)
mime="image/jpeg"
;;
png)
mime="image/png"
;;
tiff|tif)
mime="image/tiff"
;;
bmp)
mime="image/bmp"
;;
gif)
mime="image/gif"
;;
*)
echo "Skipping unsupported format: $file"
continue
;;
esac
echo "Converting: $file -> $output"
curl -s -X POST "http://localhost:8000/convert" \
-F "file=@${file};type=${mime}" \
-F "language=deu+eng" \
-F "pdfa_level=2" \
--output "$output"
doneFor faster batch processing with multiple concurrent requests:
# Convert up to 4 files in parallel (all supported formats)
find /path/to/documents -type f \( -name "*.pdf" -o -name "*.docx" -o -name "*.pptx" -o -name "*.xlsx" -o -name "*.odt" -o -name "*.odp" -o -name "*.ods" -o -name "*.jpg" -o -name "*.jpeg" -o -name "*.png" -o -name "*.tiff" -o -name "*.tif" -o -name "*.bmp" -o -name "*.gif" \) | \
xargs -P 4 -I {} bash -c '
file="{}"
output="${file%.*}-pdfa.pdf"
mime="application/pdf"
[[ "$file" == *.docx ]] && mime="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
[[ "$file" == *.pptx ]] && mime="application/vnd.openxmlformats-officedocument.presentationml.presentation"
[[ "$file" == *.xlsx ]] && mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
[[ "$file" == *.odt ]] && mime="application/vnd.oasis.opendocument.text"
[[ "$file" == *.odp ]] && mime="application/vnd.oasis.opendocument.presentation"
[[ "$file" == *.ods ]] && mime="application/vnd.oasis.opendocument.spreadsheet"
[[ "$file" == *.jpg || "$file" == *.jpeg ]] && mime="image/jpeg"
[[ "$file" == *.png ]] && mime="image/png"
[[ "$file" == *.tiff || "$file" == *.tif ]] && mime="image/tiff"
[[ "$file" == *.bmp ]] && mime="image/bmp"
[[ "$file" == *.gif ]] && mime="image/gif"
echo "Converting: $file"
curl -s -X POST "http://localhost:8000/convert" \
-F "file=@${file};type=${mime}" \
-F "language=deu+eng" \
--output "$output"
'For a more robust solution with error handling, logging, and progress tracking, use the provided batch conversion script:
# Make the script executable
chmod +x scripts/batch-convert.sh
# Convert all documents in a directory (recursive)
./scripts/batch-convert.sh /path/to/documents
# With custom API endpoint and language
./scripts/batch-convert.sh /path/to/documents \
--api-url "http://api-server:8000" \
--language "eng" \
--pdfa-level "3"
# Dry-run mode (preview without actually converting)
./scripts/batch-convert.sh /path/to/documents --dry-runSee scripts/README.md for detailed documentation on the batch conversion script.
Run the test suite:
pytestRun with verbose output:
pytest -vThe project uses act to run GitHub Actions workflows locally before pushing to GitHub.
Prerequisites: Install act following the installation guide
Run specific jobs:
# Run tests
act -j test
# Run security scan
act -j security
# Run both (all stage 0 jobs)
actConfiguration: The .actrc file configures act to use the correct Docker image for local testing.
Note: The build-and-push job requires Docker Hub credentials and is typically not run locally.
Two Docker image variants are available:
| Variant | Tags | Features | Size | Use Case |
|---|---|---|---|---|
| Full | :latest, :1.2.3 |
PDF, Office docs (.docx, .xlsx, .pptx), Images (.jpg, .png) | ~1.2 GB | Complete functionality with LibreOffice support |
| Minimal | :latest-minimal, :1.2.3-minimal |
PDF to PDF/A only | ~400-500 MB | Smaller footprint, PDF/A conversion only |
Choosing an Image:
- Use the full image (
:latest) if you need to convert Office documents or images - Use the minimal image (
:latest-minimal) if you only convert PDFs to PDF/A and want a smaller image
Build the full image (default):
docker build -t pdfa-service:latest .Build the minimal image:
docker build --target minimal -t pdfa-service:minimal .Pull and run the full image:
docker pull <username>/pdfa-service:latest
docker run -p 8000:8000 <username>/pdfa-service:latestPull and run the minimal image:
docker pull <username>/pdfa-service:latest-minimal
docker run -p 8000:8000 <username>/pdfa-service:latest-minimalRun the API service in a container:
docker run -p 8000:8000 pdfa-service:latestConvert a PDF using the containerized CLI:
docker run --rm -v $(pwd):/data pdfa-service:latest \
pdfa-cli /data/input.pdf /data/output.pdf --language engThe simplest way to run the service locally:
docker compose upThis starts the REST API on http://localhost:8000. Upload PDFs via:
curl -X POST "http://localhost:8000/convert" \
-F "file=@input.pdf;type=application/pdf" \
-F "language=eng" \
-F "pdfa_level=2" \
--output output.pdf.
├── pyproject.toml
├── README.md
├── src
│ └── pdfa
│ ├── __init__.py
│ ├── api.py
│ ├── cli.py
│ └── converter.py
└── tests
├── __init__.py
├── conftest.py
├── test_api.py
└── test_cli.py
This project uses automated vulnerability scanning to ensure dependency security:
- pip-audit: Scans Python dependencies for known CVEs using the PyPI Advisory Database
- Trivy: Scans Docker images for vulnerabilities in OS packages and Python dependencies
- Dependabot: Automatically creates pull requests for dependency updates and security patches
Security scans run on every push and pull request:
- Python Dependency Scan: Runs in parallel with tests using pip-audit
- Docker Image Scan: Scans both full and minimal image variants with Trivy before pushing
- Build Failure: CI pipeline fails if HIGH or CRITICAL vulnerabilities are detected
Vulnerability reports are automatically uploaded to the GitHub Security tab for review.
Scan Python dependencies for vulnerabilities:
pip install pip-audit
pip-auditScan Docker images:
# Install Trivy
# See: https://aquasecurity.github.io/trivy/latest/getting-started/installation/
# Build and scan the image
docker build -t pdfa-service .
trivy image --severity HIGH,CRITICAL pdfa-serviceDependabot is configured to:
- Check for dependency updates weekly (Mondays at 06:00 UTC)
- Create pull requests for Python dependencies and GitHub Actions
- Automatically label security-related updates
- Limit open PRs to prevent overwhelming the repository
All Dependabot PRs trigger the full test suite and security scans before merge.
Symptoms:
- Error messages like:
Error: /undefined in --runpdf-- Ghostscript rasterizing failed- Conversion fails during OCR processing
Cause: Some PDFs contain features that Ghostscript cannot handle during OCR rasterization (e.g., complex graphics, certain compression types, problematic font embeddings).
Automatic Three-Tier Fallback: The service automatically tries multiple strategies to handle problematic PDFs:
- Tier 1 - Normal conversion with your requested settings
- Tier 2 - Safe-mode OCR with Ghostscript-friendly parameters:
- Lower DPI (100) for easier rendering
- Preserved vector graphics
- Minimal compression/optimization
- Simpler PDF/A level (e.g., PDF/A-2 instead of PDF/A-3)
- Tier 3 - Conversion without OCR as final fallback
- Result: Best possible PDF/A file, potentially with reduced quality or no searchable text
Manual Workaround: If you encounter these errors, you can explicitly disable OCR:
# CLI
pdfa-cli input.pdf output.pdf --no-ocr
# API
curl -X POST -F "file=@input.pdf" \
-F "ocr_enabled=false" \
http://localhost:8000/api/v1/convertSymptoms:
- Error:
Cannot process encrypted PDF. Please remove encryption first.
Solution: Remove PDF encryption before conversion:
# Using qpdf
qpdf --decrypt --password=yourpassword encrypted.pdf decrypted.pdf
# Then convert
pdfa-cli decrypted.pdf output.pdfSymptoms:
- Error:
Invalid or corrupted PDF file - Conversion fails immediately
Solutions:
-
Try repairing the PDF with Ghostscript:
gs -o repaired.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress input.pdf pdfa-cli repaired.pdf output.pdf
-
Re-export the PDF from its original source (Word, LibreOffice, etc.)
Symptoms:
- Log message:
PDF already has OCR layer - Conversion completes successfully
Action: No action needed. The service detects existing OCR and continues conversion. This is not an error.