LayoutTranslateBench

The first public benchmark for document translation that scores layout fidelity and reading order alongside translation quality.

Today's translation tools either translate plain text well (DeepL, Google Translate) or translate documents while destroying their layout (DeepL Documents, Google Translate documents, ChatGPT vision). LayoutTranslateBench (LTB) is the first benchmark that measures both at once — so the next generation of layout-preserving translation tools has a single, objective number to optimize.

TL;DR

59 documents across 9 categories (v0.1.7; v0.2 scales to 200+)
16 language pairs — en-es, en-de, en-zh, en-ar, en-ja, en-fr, en-th, en-ms, en-ru, en-ko, en-vi, en-id, en-ur, en-uz, en-kk, en-zh-tw
Single composite score — LTB-100 — combining chrF (text quality), layout IoU, and reading-order Kendall τ
Dual metric leaderboard — chrF (fast, deterministic) + COMET-Kiwi-22 (neural, reference-free)
Open code (Apache-2.0) and open dataset (CC-BY-4.0 / CC-BY-SA-4.0 per entry)
Reproducibility-first — every submission records runtime, cost, hardware, and config

Current results (v0.1.7.1 · chrF · oracle-layout)

System	LTB-100 [95% CI]	chrF	Coverage
DeepL Text API	78.20 [75.2, 81.7]	56.40	6/16
NLLB-200-distilled-600M	73.71 [72.5, 74.9]	47.61	16/16
Helsinki-NLP/opus-mt	68.60 [67.1, 70.0]	37.26	16/16
identity-baseline	50.38 [50.2, 50.6]	0.62	16/16

Full leaderboard (including COMET-Kiwi-22): lawrenzho-bit.github.io/LayoutTranslateBench/leaderboard/

Key finding: NLLB is the best open-weight system at full 16/16 pair coverage. opus-mt is competitive on European pairs and commercial-safe (Apache-2.0), but effectively fails en-ko (chrF 2.6) and en-uz (chrF 0.04). The COMET-Kiwi–chrF gap is largest for non-Latin scripts (NLLB en-kk: COMET 83.5 vs chrF 52.2), validating the dual-metric design.

Why this benchmark exists

Document translation is a $50B/year problem hiding inside the larger localization industry. Immigration paperwork, certified legal translations, scientific paper localization, corporate compliance docs, and e-commerce listings all need translation that preserves the original document's appearance. Existing tools do not do this well — and there is no public scoreboard, so no incentive to improve.

LTB fixes that. If your translation system can produce a target-language document that looks like the source and reads the same way, you score high. If you flatten everything to plain text, you don't.

Install

pip install -e .

Requires Python 3.10+. Core scoring runs CPU-only with no heavyweight ML dependencies.

Quick start

# Verify the dataset
ltbench verify

# Score the identity baseline (returns source text unchanged)
ltbench score --submission submissions/identity-baseline --output results/identity-baseline.json

# Rebuild the leaderboard HTML
ltbench leaderboard

Repository layout

document-parser/
├── BENCHMARK.md              # Benchmark spec — the citeable artifact
├── LEADERBOARD.md            # Current results (mirror of leaderboard/index.html)
├── llms.txt                  # GEO root file for LLM crawlers
├── ltbench/                  # Python package
│   ├── schemas.py            # Pydantic models for manifest, annotation, submission
│   ├── metrics/              # chrF, layout IoU, reading-order τ, COMET-Kiwi, composite
│   ├── runners/              # System adapters (identity, deepl, nllb, opus-mt, ...)
│   ├── dataset/              # Manifest loader and validator
│   ├── leaderboard/          # Static HTML generator
│   └── cli.py                # ltbench command
├── data/
│   ├── manifest.json         # 59-doc index (v0.1.7)
│   ├── annotations/          # Author-curated ground-truth JSONs (doc_001–025)
│   ├── flores_derived/       # FLORES-200 derived docs + CC-BY-SA-4.0 LICENSE
│   └── rileykim_derived/     # ML-curated expansion docs (Apache-2.0)
├── submissions/              # System submissions (one subdir per system)
├── results/                  # Scored result JSONs (input to leaderboard)
├── leaderboard/              # Generated static site
└── docs/                     # Methodology, submission guide, FAQ, citation

How the metrics compose

LTB-100 (v0.1) = 100 × ( 0.50 × chrF/100  +  0.30 × IoU  +  0.20 × Kendall-τ )

chrF — character-level F-score (F₂, n-grams 1..6) of predicted vs reference translation, per region, area-weighted; language-detection gate prevents Latin-bleed on non-Latin targets
Layout IoU — mean intersection-over-union of predicted vs ground-truth bounding boxes
Kendall τ — coverage-aware, normalized to [0, 1]; measures how well the system preserves source reading order

Optional second metric: COMET-Kiwi-22 (reference-free neural QE) substitutes for chrF, producing a parallel leaderboard. Rankings differ most on non-Latin pairs. See docs/comet-setup.md.

Full details: docs/methodology.md · docs/methodology-roadmap.md.

Submit a system

Run your system over data/manifest.json to produce one JSONL per language pair.
Put them in submissions/<your-system-name>/.
ltbench score --submission submissions/<your-system-name> produces a result JSON.
Open a PR with the result JSON and submission directory.

Full guide: docs/submission.md.

Related work

Project	What it does	What LTB adds
OmniDocBench	Document parsing / OCR — extraction, not translation	Translation + layout scoring
socOCRbench	Multi-region OCR quality	Layout-preserved output, not just extraction
WMT shared tasks	Text-only translation quality	Document layout as a first-class signal
OmniDoc-TokenBench	VAE reconstruction of text-heavy documents	End-to-end translation evaluation

License

Code — Apache-2.0 (LICENSE)
Dataset — CC-BY-4.0 for author-curated and rileykim-derived docs; CC-BY-SA-4.0 for FLORES-derived docs (data/flores_derived/LICENSE). Licenses recorded per entry in data/manifest.json.

Citation

See docs/citation.md. The short form:

@misc{ltbench2026,
  title = {LayoutTranslateBench: A Benchmark for Document Translation with Layout Preservation},
  year  = {2026},
  url   = {https://github.com/Lawrenzho-bit/LayoutTranslateBench}
}

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.github		.github
data		data
docs		docs
leaderboard		leaderboard
ltbench		ltbench
paper		paper
results		results
scripts		scripts
submissions		submissions
tests		tests
.gitignore		.gitignore
BENCHMARK.md		BENCHMARK.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
DATASET_LICENSE		DATASET_LICENSE
LEADERBOARD.md		LEADERBOARD.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
_config.yml		_config.yml
google630bdcfaed2077e0.html		google630bdcfaed2077e0.html
llms.txt		llms.txt
pyproject.toml		pyproject.toml
robots.txt		robots.txt
sitemap.xml		sitemap.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LayoutTranslateBench

TL;DR

Current results (v0.1.7.1 · chrF · oracle-layout)

Why this benchmark exists

Install

Quick start

Repository layout

How the metrics compose

Submit a system

Related work

License

Citation

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LayoutTranslateBench

TL;DR

Current results (v0.1.7.1 · chrF · oracle-layout)

Why this benchmark exists

Install

Quick start

Repository layout

How the metrics compose

Submit a system

Related work

License

Citation

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages