Skip to content

kutsenko/pdfa-service

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

105 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdfa service

Command-line tool and REST API that converts PDF, Office, OpenDocument, and image files into PDF/A files using OCRmyPDF with built-in OCR.

📚 Documentation & Language

Link Description
🇩🇪 Deutsch (German) Complete German documentation
⚙️ Compression Configuration PDF compression settings - Configure quality vs file size trade-offs
⚙️ Komprimierungs-Konfiguration (Deutsch) PDF-Komprimierungseinstellungen - Qualität vs. Dateigröße konfigurieren
🥧 OCR-SCANNER Setup Guide Raspberry Pi & Network Setup - Deploy pdfa-service as a network-wide OCR scanner
🥧 OCR-SCANNER (Deutsch) Raspberry Pi & Netzwerk-Anleitung - Einsatz als Dokumentenscanner im lokalen Netzwerk
📋 OCR-SCANNER Practical Guide Real-world scenarios with Docker Compose - Home office, law firms, medical practices
📋 OCR-SCANNER Praktische Anleitung Praktische Szenarien mit Docker Compose - Heimatelier, Kanzlei, Arztpraxis

Features

  • Converts PDF, MS Office (DOCX, PPTX, XLSX), OpenDocument (ODT, ODS, ODP), and image files (JPG, PNG, TIFF, BMP, GIF) to PDF/A-compliant documents
  • Office, OpenDocument, and image files are automatically converted to PDF before PDF/A processing
  • Wraps OCRmyPDF to generate PDF/A-2 compliant files with configurable OCR
  • Configurable OCR language and PDF/A level (1, 2, or 3)
  • Offers a FastAPI REST endpoint for document conversions
  • Ships with comprehensive tests, black, and ruff configurations

Requirements

  • Python 3.11+
  • LibreOffice (for Office document conversion)
  • OCRmyPDF runtime dependencies: Tesseract OCR, Ghostscript, and qpdf for PDF processing

For detailed installation instructions, refer to the OCRmyPDF installation guide.

System Dependencies by Distribution

Debian 12+ / Ubuntu 22.04+ / Linux Mint

Install the system dependencies before setting up the virtual environment:

sudo apt update
sudo apt install python3-venv python3-pip \
  libreoffice-calc libreoffice-impress \
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-deu \
  ghostscript qpdf

Red Hat / Fedora / AlmaLinux / Rocky Linux

Install the system dependencies using DNF:

sudo dnf install python3.11+ python3-pip \
  libreoffice-calc libreoffice-impress \
  tesseract tesseract-langpack-deu tesseract-langpack-eng \
  ghostscript qpdf

For RHEL 9 and older versions, you may need to enable the PowerTools repository for some packages:

sudo dnf config-manager --set-enabled powertools  # RHEL
# or for other RHEL-based distros, check your repository configuration

Arch Linux / Manjaro

Install the system dependencies using Pacman:

sudo pacman -Syu
sudo pacman -S python python-pip \
  libreoffice-still \
  tesseract tesseract-data-deu tesseract-data-eng \
  ghostscript qpdf

Note: Arch provides libreoffice-still (stable) instead of splitting Calc and Impress into separate packages.

Language Support and Verification

Adding Additional OCR Languages:

The default installation includes English (eng) and German (deu) OCR support. To add more languages:

Distribution Command
Debian/Ubuntu sudo apt install tesseract-ocr-<lang> (e.g., tesseract-ocr-fra for French)
Red Hat/Fedora sudo dnf install tesseract-langpack-<lang> (e.g., tesseract-langpack-fra)
Arch Linux sudo pacman -S tesseract-data-<lang> (e.g., tesseract-data-fra)

Verifying Installation:

After installation, verify that all dependencies are available:

# Check Python version (3.11+)
python3 --version

# Verify Tesseract OCR
tesseract --version

# Verify Ghostscript
gs --version

# Verify qpdf
qpdf --version

# Verify LibreOffice
libreoffice --version

All commands should return version information without errors.

Getting Started

python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pdfa-cli --help

Tip: Activating the virtual environment adds .venv/bin to your PATH, so pdfa-cli is available directly.

Usage

CLI: Converting Documents

The CLI accepts PDF, MS Office (DOCX, PPTX, XLSX), OpenDocument (ODT, ODS, ODP), and image files (JPG, PNG, TIFF, BMP, GIF):

# Convert PDF to PDF/A
pdfa-cli input.pdf output.pdf --language deu+eng --pdfa-level 3

# Convert Office documents to PDF/A (automatic)
pdfa-cli document.docx output.pdf --language eng
pdfa-cli presentation.pptx output.pdf
pdfa-cli spreadsheet.xlsx output.pdf

# Convert OpenDocument files to PDF/A (automatic)
pdfa-cli document.odt output.pdf --language eng
pdfa-cli presentation.odp output.pdf
pdfa-cli spreadsheet.ods output.pdf

# Convert images to PDF/A (automatic)
pdfa-cli photo.jpg output.pdf --language eng
pdfa-cli scan.png output.pdf
pdfa-cli document.tiff output.pdf

Options:

  • -l, --language: Tesseract language codes for OCR (default: deu+eng)
  • --pdfa-level: PDF/A compliance level (1, 2, or 3; default: 2)
  • --no-ocr: Disable OCR and convert without text recognition
  • --force-ocr-on-tagged-pdfs: Force OCR on PDFs with structure tags. By default, OCR is skipped for tagged PDFs to preserve accessibility information
  • -v, --verbose: Enable verbose (debug) logging
  • --log-file: Write logs to a file in addition to stderr

Running the REST API

Start the REST service with uvicorn:

uvicorn pdfa.api:app --host 0.0.0.0 --port 8000

Web-Based Test Interface

Once the API is running, visit http://localhost:8000 to access the interactive web interface where you can:

  • Upload documents (PDF, Office, OpenDocument, and image formats)
  • Select OCR language and PDF/A compliance level
  • Toggle OCR on/off
  • Skip OCR for tagged PDFs (enabled by default to preserve accessibility)
  • Download converted files directly from your browser

This is the easiest way to test the service without using the command line.

Programmatic Usage

Upload a document via POST /convert with a multipart/form-data request:

# Convert PDF to PDF/A
curl -X POST "http://localhost:8000/convert" \
  -F "file=@input.pdf;type=application/pdf" \
  -F "language=deu+eng" \
  -F "pdfa_level=2" \
  --output output.pdf

# Convert MS Office documents to PDF/A (automatic)
curl -X POST "http://localhost:8000/convert" \
  -F "file=@document.docx;type=application/vnd.openxmlformats-officedocument.wordprocessingml.document" \
  --output output.pdf

curl -X POST "http://localhost:8000/convert" \
  -F "file=@presentation.pptx;type=application/vnd.openxmlformats-officedocument.presentationml.presentation" \
  --output output.pdf

curl -X POST "http://localhost:8000/convert" \
  -F "file=@spreadsheet.xlsx;type=application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" \
  --output output.pdf

# Convert OpenDocument files to PDF/A (automatic)
curl -X POST "http://localhost:8000/convert" \
  -F "file=@document.odt;type=application/vnd.oasis.opendocument.text" \
  --output output.pdf

curl -X POST "http://localhost:8000/convert" \
  -F "file=@presentation.odp;type=application/vnd.oasis.opendocument.presentation" \
  --output output.pdf

curl -X POST "http://localhost:8000/convert" \
  -F "file=@spreadsheet.ods;type=application/vnd.oasis.opendocument.spreadsheet" \
  --output output.pdf

# Convert image files to PDF/A (automatic)
curl -X POST "http://localhost:8000/convert" \
  -F "file=@photo.jpg;type=image/jpeg" \
  --output output.pdf

curl -X POST "http://localhost:8000/convert" \
  -F "file=@scan.png;type=image/png" \
  --output output.pdf

curl -X POST "http://localhost:8000/convert" \
  -F "file=@document.tiff;type=image/tiff" \
  --output output.pdf

The service validates the upload, converts Office, OpenDocument, and image files to PDF (if needed), converts to PDF/A using OCRmyPDF, and returns the converted document as the HTTP response body.

Available Parameters

  • file (required): PDF, MS Office (DOCX, PPTX, XLSX), OpenDocument (ODT, ODS, ODP), or image (JPG, PNG, TIFF, BMP, GIF) file to convert
  • language (optional): Tesseract language codes for OCR (default: deu+eng)
  • pdfa_level (optional): PDF/A compliance level: 1, 2, or 3 (default: 2)
  • ocr_enabled (optional): Whether to perform OCR (default: true). Set to false to skip OCR.

Example without OCR:

curl -X POST "http://localhost:8000/convert" \
  -F "file=@input.pdf;type=application/pdf" \
  -F "ocr_enabled=false" \
  --output output.pdf

Advanced Usage

Batch Processing with curl

Convert multiple files in a directory recursively:

# Convert all PDFs in directory and subdirectories, save with -pdfa.pdf suffix
find /path/to/documents -name "*.pdf" -type f | while read file; do
  output="${file%.*}-pdfa.pdf"
  echo "Converting: $file -> $output"
  curl -s -X POST "http://localhost:8000/convert" \
    -F "file=@${file};type=application/pdf" \
    -F "language=deu+eng" \
    -F "pdfa_level=2" \
    --output "$output"
done

Mixed Format Batch Processing

Convert multiple file types (PDF, DOCX, PPTX, XLSX, ODT, ODS, ODP, JPG, PNG, TIFF, BMP, GIF) in a single directory:

# Convert all supported formats
for file in /path/to/documents/*.*; do
  [ ! -f "$file" ] && continue

  ext="${file##*.}"
  output="${file%.*}-pdfa.pdf"

  # Determine MIME type
  case "$ext" in
    pdf)
      mime="application/pdf"
      ;;
    docx)
      mime="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
      ;;
    pptx)
      mime="application/vnd.openxmlformats-officedocument.presentationml.presentation"
      ;;
    xlsx)
      mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
      ;;
    odt)
      mime="application/vnd.oasis.opendocument.text"
      ;;
    odp)
      mime="application/vnd.oasis.opendocument.presentation"
      ;;
    ods)
      mime="application/vnd.oasis.opendocument.spreadsheet"
      ;;
    jpg|jpeg)
      mime="image/jpeg"
      ;;
    png)
      mime="image/png"
      ;;
    tiff|tif)
      mime="image/tiff"
      ;;
    bmp)
      mime="image/bmp"
      ;;
    gif)
      mime="image/gif"
      ;;
    *)
      echo "Skipping unsupported format: $file"
      continue
      ;;
  esac

  echo "Converting: $file -> $output"
  curl -s -X POST "http://localhost:8000/convert" \
    -F "file=@${file};type=${mime}" \
    -F "language=deu+eng" \
    -F "pdfa_level=2" \
    --output "$output"
done

Parallel Processing

For faster batch processing with multiple concurrent requests:

# Convert up to 4 files in parallel (all supported formats)
find /path/to/documents -type f \( -name "*.pdf" -o -name "*.docx" -o -name "*.pptx" -o -name "*.xlsx" -o -name "*.odt" -o -name "*.odp" -o -name "*.ods" -o -name "*.jpg" -o -name "*.jpeg" -o -name "*.png" -o -name "*.tiff" -o -name "*.tif" -o -name "*.bmp" -o -name "*.gif" \) | \
  xargs -P 4 -I {} bash -c '
    file="{}"
    output="${file%.*}-pdfa.pdf"
    mime="application/pdf"
    [[ "$file" == *.docx ]] && mime="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
    [[ "$file" == *.pptx ]] && mime="application/vnd.openxmlformats-officedocument.presentationml.presentation"
    [[ "$file" == *.xlsx ]] && mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
    [[ "$file" == *.odt ]] && mime="application/vnd.oasis.opendocument.text"
    [[ "$file" == *.odp ]] && mime="application/vnd.oasis.opendocument.presentation"
    [[ "$file" == *.ods ]] && mime="application/vnd.oasis.opendocument.spreadsheet"
    [[ "$file" == *.jpg || "$file" == *.jpeg ]] && mime="image/jpeg"
    [[ "$file" == *.png ]] && mime="image/png"
    [[ "$file" == *.tiff || "$file" == *.tif ]] && mime="image/tiff"
    [[ "$file" == *.bmp ]] && mime="image/bmp"
    [[ "$file" == *.gif ]] && mime="image/gif"

    echo "Converting: $file"
    curl -s -X POST "http://localhost:8000/convert" \
      -F "file=@${file};type=${mime}" \
      -F "language=deu+eng" \
      --output "$output"
  '

Batch Script

For a more robust solution with error handling, logging, and progress tracking, use the provided batch conversion script:

# Make the script executable
chmod +x scripts/batch-convert.sh

# Convert all documents in a directory (recursive)
./scripts/batch-convert.sh /path/to/documents

# With custom API endpoint and language
./scripts/batch-convert.sh /path/to/documents \
  --api-url "http://api-server:8000" \
  --language "eng" \
  --pdfa-level "3"

# Dry-run mode (preview without actually converting)
./scripts/batch-convert.sh /path/to/documents --dry-run

See scripts/README.md for detailed documentation on the batch conversion script.

Testing

Unit and Integration Tests

Run the test suite:

pytest

Run with verbose output:

pytest -v

Testing GitHub Actions Locally

The project uses act to run GitHub Actions workflows locally before pushing to GitHub.

Prerequisites: Install act following the installation guide

Run specific jobs:

# Run tests
act -j test

# Run security scan
act -j security

# Run both (all stage 0 jobs)
act

Configuration: The .actrc file configures act to use the correct Docker image for local testing.

Note: The build-and-push job requires Docker Hub credentials and is typically not run locally.

Deployment

Docker

Docker Image Variants

Two Docker image variants are available:

Variant Tags Features Size Use Case
Full :latest, :1.2.3 PDF, Office docs (.docx, .xlsx, .pptx), Images (.jpg, .png) ~1.2 GB Complete functionality with LibreOffice support
Minimal :latest-minimal, :1.2.3-minimal PDF to PDF/A only ~400-500 MB Smaller footprint, PDF/A conversion only

Choosing an Image:

  • Use the full image (:latest) if you need to convert Office documents or images
  • Use the minimal image (:latest-minimal) if you only convert PDFs to PDF/A and want a smaller image

Building Locally

Build the full image (default):

docker build -t pdfa-service:latest .

Build the minimal image:

docker build --target minimal -t pdfa-service:minimal .

Using Pre-built Images from Docker Hub

Pull and run the full image:

docker pull <username>/pdfa-service:latest
docker run -p 8000:8000 <username>/pdfa-service:latest

Pull and run the minimal image:

docker pull <username>/pdfa-service:latest-minimal
docker run -p 8000:8000 <username>/pdfa-service:latest-minimal

Running the API Service

Run the API service in a container:

docker run -p 8000:8000 pdfa-service:latest

Using the CLI

Convert a PDF using the containerized CLI:

docker run --rm -v $(pwd):/data pdfa-service:latest \
  pdfa-cli /data/input.pdf /data/output.pdf --language eng

Docker Compose

The simplest way to run the service locally:

docker compose up

This starts the REST API on http://localhost:8000. Upload PDFs via:

curl -X POST "http://localhost:8000/convert" \
  -F "file=@input.pdf;type=application/pdf" \
  -F "language=eng" \
  -F "pdfa_level=2" \
  --output output.pdf

Project Layout

.
├── pyproject.toml
├── README.md
├── src
│   └── pdfa
│       ├── __init__.py
│       ├── api.py
│       ├── cli.py
│       └── converter.py
└── tests
    ├── __init__.py
    ├── conftest.py
    ├── test_api.py
    └── test_cli.py

Security

This project uses automated vulnerability scanning to ensure dependency security:

  • pip-audit: Scans Python dependencies for known CVEs using the PyPI Advisory Database
  • Trivy: Scans Docker images for vulnerabilities in OS packages and Python dependencies
  • Dependabot: Automatically creates pull requests for dependency updates and security patches

CI/CD Security Pipeline

Security scans run on every push and pull request:

  1. Python Dependency Scan: Runs in parallel with tests using pip-audit
  2. Docker Image Scan: Scans both full and minimal image variants with Trivy before pushing
  3. Build Failure: CI pipeline fails if HIGH or CRITICAL vulnerabilities are detected

Vulnerability reports are automatically uploaded to the GitHub Security tab for review.

Running Security Scans Locally

Scan Python dependencies for vulnerabilities:

pip install pip-audit
pip-audit

Scan Docker images:

# Install Trivy
# See: https://aquasecurity.github.io/trivy/latest/getting-started/installation/

# Build and scan the image
docker build -t pdfa-service .
trivy image --severity HIGH,CRITICAL pdfa-service

Automated Dependency Updates

Dependabot is configured to:

  • Check for dependency updates weekly (Mondays at 06:00 UTC)
  • Create pull requests for Python dependencies and GitHub Actions
  • Automatically label security-related updates
  • Limit open PRs to prevent overwhelming the repository

All Dependabot PRs trigger the full test suite and security scans before merge.

Troubleshooting

Ghostscript Rendering Errors

Symptoms:

  • Error messages like: Error: /undefined in --runpdf--
  • Ghostscript rasterizing failed
  • Conversion fails during OCR processing

Cause: Some PDFs contain features that Ghostscript cannot handle during OCR rasterization (e.g., complex graphics, certain compression types, problematic font embeddings).

Automatic Three-Tier Fallback: The service automatically tries multiple strategies to handle problematic PDFs:

  1. Tier 1 - Normal conversion with your requested settings
  2. Tier 2 - Safe-mode OCR with Ghostscript-friendly parameters:
    • Lower DPI (100) for easier rendering
    • Preserved vector graphics
    • Minimal compression/optimization
    • Simpler PDF/A level (e.g., PDF/A-2 instead of PDF/A-3)
  3. Tier 3 - Conversion without OCR as final fallback
  4. Result: Best possible PDF/A file, potentially with reduced quality or no searchable text

Manual Workaround: If you encounter these errors, you can explicitly disable OCR:

# CLI
pdfa-cli input.pdf output.pdf --no-ocr

# API
curl -X POST -F "file=@input.pdf" \
  -F "ocr_enabled=false" \
  http://localhost:8000/api/v1/convert

Encrypted PDFs

Symptoms:

  • Error: Cannot process encrypted PDF. Please remove encryption first.

Solution: Remove PDF encryption before conversion:

# Using qpdf
qpdf --decrypt --password=yourpassword encrypted.pdf decrypted.pdf

# Then convert
pdfa-cli decrypted.pdf output.pdf

Corrupted or Invalid PDFs

Symptoms:

  • Error: Invalid or corrupted PDF file
  • Conversion fails immediately

Solutions:

  1. Try repairing the PDF with Ghostscript:

    gs -o repaired.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress input.pdf
    pdfa-cli repaired.pdf output.pdf
  2. Re-export the PDF from its original source (Word, LibreOffice, etc.)

PDFs Already Have OCR

Symptoms:

  • Log message: PDF already has OCR layer
  • Conversion completes successfully

Action: No action needed. The service detects existing OCR and continues conversion. This is not an error.

About

Service for converting PDFs to PDF/A and performing automatic OCR text recognition for archiving.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors