A Comprehensive Benchmarking Tool for Vision Language Models on Historical Document OCR
Palladia is a dedicated benchmarking project designed to evaluate Vision Language Models (VLMs) on historical document OCR tasks using the GT4HistOCR dataset. It provides standardized evaluation metrics that allow researchers and practitioners to compare model performance across various types of historical documents, languages, and preservation conditions.
A live demo with sample data can be found at https://palladia.vercel.app/.
Historical documents present unique challenges that modern OCR benchmarks often overlook. These include varied typography arising from diverse fonts, handwriting styles, and printing techniques, as well as document degradation caused by aging, stains, or physical damage.
However, as I further explained on the website, the goal of this project is to provide a fair and transparent comparison between models, understanding how quickly both flagship and secondary models are closing the gap in their ability to analyze historical text.
- Standardized Metrics: WER, CER, exact match accuracy, and execution time benchmarking
- Batch Processing: Efficient evaluation across large document collections
- Export Capabilities: Results available in JSON and visualization-ready
- OpenRouter Friendly: Any model available on OpenRouter is supported.
- Python 3.13+
- API keys for model providers you want to evaluate
- uv dependency manager
git clone https://github.com/dassoo/Palladia.git
cd Palladia
uv sync- Copy the environment template:
cp .env.example .env- Add your OpenRouter api key to
.env:
OPENROUTER_API_KEY=your_key_here- Download the evaluation dataset:
python src/utils/download_dataset.py- Edit the
.yamlfiles insrc/configto choose the input data and the models to use
source: GT4HistOCR/corpus/EarlyModernLatin/1564-Thucydides-Valla
images_to_process: 2
avoid_rescan: True
models:
- model_id: openai/gpt-5-mini
enabled: True
link: https://openrouter.ai/openai/gpt-5-mini
- model_id: openai/gpt-5
enabled: False
link: https://openrouter.ai/openai/gpt-5- Run your benchmark:
python src/benchmark/execution.pyResults are automatically saved in JSON format in /benchmarks, following the same path of the chosen input folder
Palladia relies on the GT4HistOCR dataset, a large-scale collection of historical documents with human-verified transcriptions. It spans multiple centuries, covering the 15th to the 20th, and includes texts in a variety of European languages with historical spelling variations. The dataset encompasses documents in different preservation states and image qualities, providing a realistic benchmark for model evaluation. With over 300,000 lines of transcribed text, GT4HistOCR organizes documents by type, period, and language, delivering high-resolution images alongside their corresponding text files.
Palladia provides comprehensive evaluation using industry-standard metrics:
| Metric | Description | Range | Best |
|---|---|---|---|
| Word Error Rate (WER) | Percentage of incorrectly transcribed words | 0-100% | 0% |
| Character Error Rate (CER) | Percentage of incorrectly transcribed characters | 0-100% | 0% |
| Exact Match Accuracy | Percentage of perfectly transcribed documents | 0-100% | 100% |
| Execution Time | Average processing time per document | Seconds | Lower |
It evaluates OCR outputs using standard metrics implemented with Python libraries. Word Error Rate (WER) and Character Error Rate (CER) are computed using the jiwer library, while character-level differences and accuracy scoring are handled by diff_match_patch. These tools provide a reliable framework for analyzing transcription errors and understanding where models succeed or fail at both word and character levels.
If you use Palladia in your research, please cite:
@software{palladia2025,
title={Palladia: A Benchmarking Tool for Vision Language Models on Historical Document OCR},
author={Federico Dassiè},
year={2025},
url={https://github.com/dassoo/Palladia}
}This project is licensed under the MIT License - see the LICENSE.txt file for details.

