pdfa service

Command-line tool and REST API that converts PDF, Office, OpenDocument, and image files into PDF/A files using OCRmyPDF with built-in OCR.

📚 Documentation & Language

Link	Description
🇩🇪 Deutsch (German)	Complete German documentation
⚙️ Compression Configuration	PDF compression settings - Configure quality vs file size trade-offs
⚙️ Komprimierungs-Konfiguration (Deutsch)	PDF-Komprimierungseinstellungen - Qualität vs. Dateigröße konfigurieren
🥧 OCR-SCANNER Setup Guide	Raspberry Pi & Network Setup - Deploy pdfa-service as a network-wide OCR scanner
🥧 OCR-SCANNER (Deutsch)	Raspberry Pi & Netzwerk-Anleitung - Einsatz als Dokumentenscanner im lokalen Netzwerk
📋 OCR-SCANNER Practical Guide	Real-world scenarios with Docker Compose - Home office, law firms, medical practices
📋 OCR-SCANNER Praktische Anleitung	Praktische Szenarien mit Docker Compose - Heimatelier, Kanzlei, Arztpraxis

Features

Converts PDF, MS Office (DOCX, PPTX, XLSX), OpenDocument (ODT, ODS, ODP), and image files (JPG, PNG, TIFF, BMP, GIF) to PDF/A-compliant documents
Office, OpenDocument, and image files are automatically converted to PDF before PDF/A processing
Wraps OCRmyPDF to generate PDF/A-2 compliant files with configurable OCR
Configurable OCR language and PDF/A level (1, 2, or 3)
Offers a FastAPI REST endpoint for document conversions
Ships with comprehensive tests, black, and ruff configurations

Requirements

Python 3.11+
LibreOffice (for Office document conversion)
OCRmyPDF runtime dependencies: Tesseract OCR, Ghostscript, and qpdf for PDF processing

For detailed installation instructions, refer to the OCRmyPDF installation guide.

System Dependencies by Distribution

Debian 12+ / Ubuntu 22.04+ / Linux Mint

Install the system dependencies before setting up the virtual environment:

sudo apt update
sudo apt install python3-venv python3-pip \
  libreoffice-calc libreoffice-impress \
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu \
  ghostscript qpdf

Red Hat / Fedora / AlmaLinux / Rocky Linux

Install the system dependencies using DNF:

sudo dnf install python3.11+ python3-pip \
  libreoffice-calc libreoffice-impress \
  tesseract tesseract-langpack-deu tesseract-langpack-eng \
  ghostscript qpdf

For RHEL 9 and older versions, you may need to enable the PowerTools repository for some packages:

sudo dnf config-manager --set-enabled powertools  # RHEL
# or for other RHEL-based distros, check your repository configuration

Arch Linux / Manjaro

Install the system dependencies using Pacman:

sudo pacman -Syu
sudo pacman -S python python-pip \
  libreoffice-still \
  tesseract tesseract-data-deu tesseract-data-eng \
  ghostscript qpdf

Note: Arch provides libreoffice-still (stable) instead of splitting Calc and Impress into separate packages.

Language Support and Verification

Adding Additional OCR Languages:

The default installation includes English (eng) and German (deu) OCR support. To add more languages:

Distribution	Command
Debian/Ubuntu	`sudo apt install tesseract-ocr-<lang>` (e.g., `tesseract-ocr-fra` for French)
Red Hat/Fedora	`sudo dnf install tesseract-langpack-<lang>` (e.g., `tesseract-langpack-fra`)
Arch Linux	`sudo pacman -S tesseract-data-<lang>` (e.g., `tesseract-data-fra`)

Verifying Installation:

After installation, verify that all dependencies are available:

# Check Python version (3.11+)
python3 --version

# Verify Tesseract OCR
tesseract --version

# Verify Ghostscript
gs --version

# Verify qpdf
qpdf --version

# Verify LibreOffice
libreoffice --version

All commands should return version information without errors.

Getting Started

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pdfa-cli --help

Tip: Activating the virtual environment adds .venv/bin to your PATH, so pdfa-cli is available directly.

Usage

CLI: Converting Documents

The CLI accepts PDF, MS Office (DOCX, PPTX, XLSX), OpenDocument (ODT, ODS, ODP), and image files (JPG, PNG, TIFF, BMP, GIF):

# Convert PDF to PDF/A
pdfa-cli input.pdf output.pdf --language deu+eng --pdfa-level 3

# Convert Office documents to PDF/A (automatic)
pdfa-cli document.docx output.pdf --language eng
pdfa-cli presentation.pptx output.pdf
pdfa-cli spreadsheet.xlsx output.pdf

# Convert OpenDocument files to PDF/A (automatic)
pdfa-cli document.odt output.pdf --language eng
pdfa-cli presentation.odp output.pdf
pdfa-cli spreadsheet.ods output.pdf

# Convert images to PDF/A (automatic)
pdfa-cli photo.jpg output.pdf --language eng
pdfa-cli scan.png output.pdf
pdfa-cli document.tiff output.pdf

Options:

-l, --language: Tesseract language codes for OCR (default: deu+eng)
--pdfa-level: PDF/A compliance level (1, 2, or 3; default: 2)
--no-ocr: Disable OCR and convert without text recognition
--force-ocr-on-tagged-pdfs: Force OCR on PDFs with structure tags. By default, OCR is skipped for tagged PDFs to preserve accessibility information
-v, --verbose: Enable verbose (debug) logging
--log-file: Write logs to a file in addition to stderr

Running the REST API

Start the REST service with uvicorn:

uvicorn pdfa.api:app --host 0.0.0.0 --port 8000

Web-Based Test Interface

Once the API is running, visit http://localhost:8000 to access the interactive web interface where you can:

Upload documents (PDF, Office, OpenDocument, and image formats)
Select OCR language and PDF/A compliance level
Toggle OCR on/off
Skip OCR for tagged PDFs (enabled by default to preserve accessibility)
Download converted files directly from your browser

This is the easiest way to test the service without using the command line.

Programmatic Usage

Upload a document via POST /convert with a multipart/form-data request:

# Convert PDF to PDF/A
curl -X POST "http://localhost:8000/convert" \
  -F "file=@input.pdf;type=application/pdf" \
  -F "language=deu+eng" \
  -F "pdfa_level=2" \
  --output output.pdf

# Convert MS Office documents to PDF/A (automatic)
curl -X POST "http://localhost:8000/convert" \
  -F "file=@document.docx;type=application/vnd.openxmlformats-officedocument.wordprocessingml.document" \
  --output output.pdf

curl -X POST "http://localhost:8000/convert" \
  -F "file=@presentation.pptx;type=application/vnd.openxmlformats-officedocument.presentationml.presentation" \
  --output output.pdf

curl -X POST "http://localhost:8000/convert" \
  -F "file=@spreadsheet.xlsx;type=application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" \
  --output output.pdf

# Convert OpenDocument files to PDF/A (automatic)
curl -X POST "http://localhost:8000/convert" \
  -F "file=@document.odt;type=application/vnd.oasis.opendocument.text" \
  --output output.pdf

curl -X POST "http://localhost:8000/convert" \
  -F "file=@presentation.odp;type=application/vnd.oasis.opendocument.presentation" \
  --output output.pdf

curl -X POST "http://localhost:8000/convert" \
  -F "file=@spreadsheet.ods;type=application/vnd.oasis.opendocument.spreadsheet" \
  --output output.pdf

# Convert image files to PDF/A (automatic)
curl -X POST "http://localhost:8000/convert" \
  -F "file=@photo.jpg;type=image/jpeg" \
  --output output.pdf

curl -X POST "http://localhost:8000/convert" \
  -F "file=@scan.png;type=image/png" \
  --output output.pdf

curl -X POST "http://localhost:8000/convert" \
  -F "file=@document.tiff;type=image/tiff" \
  --output output.pdf

The service validates the upload, converts Office, OpenDocument, and image files to PDF (if needed), converts to PDF/A using OCRmyPDF, and returns the converted document as the HTTP response body.

Available Parameters

file (required): PDF, MS Office (DOCX, PPTX, XLSX), OpenDocument (ODT, ODS, ODP), or image (JPG, PNG, TIFF, BMP, GIF) file to convert
language (optional): Tesseract language codes for OCR (default: deu+eng)
pdfa_level (optional): PDF/A compliance level: 1, 2, or 3 (default: 2)
ocr_enabled (optional): Whether to perform OCR (default: true). Set to false to skip OCR.

Example without OCR:

curl -X POST "http://localhost:8000/convert" \
  -F "file=@input.pdf;type=application/pdf" \
  -F "ocr_enabled=false" \
  --output output.pdf

Advanced Usage

Batch Processing with curl

Convert multiple files in a directory recursively:

# Convert all PDFs in directory and subdirectories, save with -pdfa.pdf suffix
find /path/to/documents -name "*.pdf" -type f | while read file; do
  output="${file%.*}-pdfa.pdf"
  echo "Converting: $file -> $output"
  curl -s -X POST "http://localhost:8000/convert" \
    -F "file=@${file};type=application/pdf" \
    -F "language=deu+eng" \
    -F "pdfa_level=2" \
    --output "$output"
done

Mixed Format Batch Processing

Convert multiple file types (PDF, DOCX, PPTX, XLSX, ODT, ODS, ODP, JPG, PNG, TIFF, BMP, GIF) in a single directory:

# Convert all supported formats
for file in /path/to/documents/*.*; do
  [ ! -f "$file" ] && continue

  ext="${file##*.}"
  output="${file%.*}-pdfa.pdf"

  # Determine MIME type
  case "$ext" in
    pdf)
      mime="application/pdf"
      ;;
    docx)
      mime="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
      ;;
    pptx)
      mime="application/vnd.openxmlformats-officedocument.presentationml.presentation"
      ;;
    xlsx)
      mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
      ;;
    odt)
      mime="application/vnd.oasis.opendocument.text"
      ;;
    odp)
      mime="application/vnd.oasis.opendocument.presentation"
      ;;
    ods)
      mime="application/vnd.oasis.opendocument.spreadsheet"
      ;;
    jpg|jpeg)
      mime="image/jpeg"
      ;;
    png)
      mime="image/png"
      ;;
    tiff|tif)
      mime="image/tiff"
      ;;
    bmp)
      mime="image/bmp"
      ;;
    gif)
      mime="image/gif"
      ;;
    *)
      echo "Skipping unsupported format: $file"
      continue
      ;;
  esac

  echo "Converting: $file -> $output"
  curl -s -X POST "http://localhost:8000/convert" \
    -F "file=@${file};type=${mime}" \
    -F "language=deu+eng" \
    -F "pdfa_level=2" \
    --output "$output"
done

Parallel Processing

For faster batch processing with multiple concurrent requests:

# Convert up to 4 files in parallel (all supported formats)
find /path/to/documents -type f \( -name "*.pdf" -o -name "*.docx" -o -name "*.pptx" -o -name "*.xlsx" -o -name "*.odt" -o -name "*.odp" -o -name "*.ods" -o -name "*.jpg" -o -name "*.jpeg" -o -name "*.png" -o -name "*.tiff" -o -name "*.tif" -o -name "*.bmp" -o -name "*.gif" \) | \
  xargs -P 4 -I {} bash -c '
    file="{}"
    output="${file%.*}-pdfa.pdf"
    mime="application/pdf"
    [[ "$file" == *.docx ]] && mime="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
    [[ "$file" == *.pptx ]] && mime="application/vnd.openxmlformats-officedocument.presentationml.presentation"
    [[ "$file" == *.xlsx ]] && mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
    [[ "$file" == *.odt ]] && mime="application/vnd.oasis.opendocument.text"
    [[ "$file" == *.odp ]] && mime="application/vnd.oasis.opendocument.presentation"
    [[ "$file" == *.ods ]] && mime="application/vnd.oasis.opendocument.spreadsheet"
    [[ "$file" == *.jpg || "$file" == *.jpeg ]] && mime="image/jpeg"
    [[ "$file" == *.png ]] && mime="image/png"
    [[ "$file" == *.tiff || "$file" == *.tif ]] && mime="image/tiff"
    [[ "$file" == *.bmp ]] && mime="image/bmp"
    [[ "$file" == *.gif ]] && mime="image/gif"

    echo "Converting: $file"
    curl -s -X POST "http://localhost:8000/convert" \
      -F "file=@${file};type=${mime}" \
      -F "language=deu+eng" \
      --output "$output"
  '

Batch Script

For a more robust solution with error handling, logging, and progress tracking, use the provided batch conversion script:

# Make the script executable
chmod +x scripts/batch-convert.sh

# Convert all documents in a directory (recursive)
./scripts/batch-convert.sh /path/to/documents

# With custom API endpoint and language
./scripts/batch-convert.sh /path/to/documents \
  --api-url "http://api-server:8000" \
  --language "eng" \
  --pdfa-level "3"

# Dry-run mode (preview without actually converting)
./scripts/batch-convert.sh /path/to/documents --dry-run

See scripts/README.md for detailed documentation on the batch conversion script.

Testing

Unit and Integration Tests

Run the test suite:

pytest

Run with verbose output:

pytest -v

Testing GitHub Actions Locally

The project uses act to run GitHub Actions workflows locally before pushing to GitHub.

Prerequisites: Install act following the installation guide

Run specific jobs:

# Run tests
act -j test

# Run security scan
act -j security

# Run both (all stage 0 jobs)
act

Configuration: The .actrc file configures act to use the correct Docker image for local testing.

Note: The build-and-push job requires Docker Hub credentials and is typically not run locally.

Deployment

Docker

Docker Image Variants

Two Docker image variants are available:

Variant	Tags	Features	Size	Use Case
Full	`:latest`, `:1.2.3`	PDF, Office docs (.docx, .xlsx, .pptx), Images (.jpg, .png)	~1.2 GB	Complete functionality with LibreOffice support
Minimal	`:latest-minimal`, `:1.2.3-minimal`	PDF to PDF/A only	~400-500 MB	Smaller footprint, PDF/A conversion only

Choosing an Image:

Use the full image (:latest) if you need to convert Office documents or images
Use the minimal image (:latest-minimal) if you only convert PDFs to PDF/A and want a smaller image

Building Locally

Build the full image (default):

docker build -t pdfa-service:latest .

Build the minimal image:

docker build --target minimal -t pdfa-service:minimal .

Using Pre-built Images from Docker Hub

Pull and run the full image:

docker pull <username>/pdfa-service:latest
docker run -p 8000:8000 <username>/pdfa-service:latest

Pull and run the minimal image:

docker pull <username>/pdfa-service:latest-minimal
docker run -p 8000:8000 <username>/pdfa-service:latest-minimal

Running the API Service

Run the API service in a container:

docker run -p 8000:8000 pdfa-service:latest

Using the CLI

Convert a PDF using the containerized CLI:

docker run --rm -v $(pwd):/data pdfa-service:latest \
  pdfa-cli /data/input.pdf /data/output.pdf --language eng

Docker Compose

The simplest way to run the service locally:

docker compose up

This starts the REST API on http://localhost:8000. Upload PDFs via:

curl -X POST "http://localhost:8000/convert" \
  -F "file=@input.pdf;type=application/pdf" \
  -F "language=eng" \
  -F "pdfa_level=2" \
  --output output.pdf

Project Layout

.
├── pyproject.toml
├── README.md
├── src
│   └── pdfa
│       ├── __init__.py
│       ├── api.py
│       ├── cli.py
│       └── converter.py
└── tests
    ├── __init__.py
    ├── conftest.py
    ├── test_api.py
    └── test_cli.py

Security

This project uses automated vulnerability scanning to ensure dependency security:

pip-audit: Scans Python dependencies for known CVEs using the PyPI Advisory Database
Trivy: Scans Docker images for vulnerabilities in OS packages and Python dependencies
Dependabot: Automatically creates pull requests for dependency updates and security patches

CI/CD Security Pipeline

Security scans run on every push and pull request:

Python Dependency Scan: Runs in parallel with tests using pip-audit
Docker Image Scan: Scans both full and minimal image variants with Trivy before pushing
Build Failure: CI pipeline fails if HIGH or CRITICAL vulnerabilities are detected

Vulnerability reports are automatically uploaded to the GitHub Security tab for review.

Running Security Scans Locally

Scan Python dependencies for vulnerabilities:

pip install pip-audit
pip-audit

Scan Docker images:

# Install Trivy
# See: https://aquasecurity.github.io/trivy/latest/getting-started/installation/

# Build and scan the image
docker build -t pdfa-service .
trivy image --severity HIGH,CRITICAL pdfa-service

Automated Dependency Updates

Dependabot is configured to:

Check for dependency updates weekly (Mondays at 06:00 UTC)
Create pull requests for Python dependencies and GitHub Actions
Automatically label security-related updates
Limit open PRs to prevent overwhelming the repository

All Dependabot PRs trigger the full test suite and security scans before merge.

Troubleshooting

Ghostscript Rendering Errors

Symptoms:

Error messages like: Error: /undefined in --runpdf--
Ghostscript rasterizing failed
Conversion fails during OCR processing

Cause: Some PDFs contain features that Ghostscript cannot handle during OCR rasterization (e.g., complex graphics, certain compression types, problematic font embeddings).

Automatic Three-Tier Fallback: The service automatically tries multiple strategies to handle problematic PDFs:

Tier 1 - Normal conversion with your requested settings
Tier 2 - Safe-mode OCR with Ghostscript-friendly parameters:
- Lower DPI (100) for easier rendering
- Preserved vector graphics
- Minimal compression/optimization
- Simpler PDF/A level (e.g., PDF/A-2 instead of PDF/A-3)
Tier 3 - Conversion without OCR as final fallback
Result: Best possible PDF/A file, potentially with reduced quality or no searchable text

Manual Workaround: If you encounter these errors, you can explicitly disable OCR:

# CLI
pdfa-cli input.pdf output.pdf --no-ocr

# API
curl -X POST -F "file=@input.pdf" \
  -F "ocr_enabled=false" \
  http://localhost:8000/api/v1/convert

Encrypted PDFs

Symptoms:

Error: Cannot process encrypted PDF. Please remove encryption first.

Solution: Remove PDF encryption before conversion:

# Using qpdf
qpdf --decrypt --password=yourpassword encrypted.pdf decrypted.pdf

# Then convert
pdfa-cli decrypted.pdf output.pdf

Corrupted or Invalid PDFs

Symptoms:

Error: Invalid or corrupted PDF file
Conversion fails immediately

Solutions:

Try repairing the PDF with Ghostscript:

gs -o repaired.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress input.pdf
pdfa-cli repaired.pdf output.pdf

Re-export the PDF from its original source (Word, LibreOffice, etc.)

PDFs Already Have OCR

Symptoms:

Log message: PDF already has OCR layer
Conversion completes successfully

Action: No action needed. The service detects existing OCR and continues conversion. This is not an error.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.github		.github
docker-compose-examples		docker-compose-examples
examples		examples
scripts		scripts
src/pdfa		src/pdfa
tests		tests
.actrc		.actrc
.dockerignore		.dockerignore
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.de.md		CLAUDE.de.md
CLAUDE.md		CLAUDE.md
COMPRESSION.de.md		COMPRESSION.de.md
COMPRESSION.md		COMPRESSION.md
CURL-GUIDE.md		CURL-GUIDE.md
DOCKER-HUB.md		DOCKER-HUB.md
Dockerfile		Dockerfile
LICENSE		LICENSE
OCR-SCANNER-GUIDE.de.md		OCR-SCANNER-GUIDE.de.md
OCR-SCANNER-GUIDE.md		OCR-SCANNER-GUIDE.md
OCR-SCANNER.de.md		OCR-SCANNER.de.md
OCR-SCANNER.md		OCR-SCANNER.md
README.de.md		README.de.md
README.md		README.md
TRANSLATIONS.md		TRANSLATIONS.md
batch-convert.log		batch-convert.log
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Folders and files

Latest commit

History

Repository files navigation

pdfa service

📚 Documentation & Language

Features

Requirements

System Dependencies by Distribution

Debian 12+ / Ubuntu 22.04+ / Linux Mint

Red Hat / Fedora / AlmaLinux / Rocky Linux

Arch Linux / Manjaro

Language Support and Verification

Getting Started

Usage

CLI: Converting Documents

Running the REST API

Web-Based Test Interface

Programmatic Usage

Available Parameters

Advanced Usage

Batch Processing with curl

Mixed Format Batch Processing

Parallel Processing

Batch Script

Testing

Unit and Integration Tests

Testing GitHub Actions Locally

Deployment

Docker

Docker Image Variants

Building Locally

Using Pre-built Images from Docker Hub

Running the API Service

Using the CLI

Docker Compose

Project Layout

Security

CI/CD Security Pipeline

Running Security Scans Locally

Automated Dependency Updates

Troubleshooting

Ghostscript Rendering Errors

Encrypted PDFs

Corrupted or Invalid PDFs

PDFs Already Have OCR

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages