Summary
We have gold-standard agronomic papers with manually verified extractions. This issue tracks the work of building a benchmark harness that evaluates candidate open-source document extractors against those ground-truth annotations, producing a reproducible accuracy report to inform our adapter layer design.
Motivation
The adapter layer (see #[related issue]) needs to produce structured document objects — section-classified blocks, structured tables, and figure references — from raw PDFs. Before committing to Marker as the extraction backend, we should have empirical evidence of where it (and alternatives) succeed or fail on our specific document type: multi-column scientific agronomic papers with complex table structures.
Scope
Extractors to evaluate (starting set, open to additions):
| Extractor |
Notes |
| Marker |
Primary candidate; JSON block tree output |
| Docling |
IBM; strong table handling |
| PyMuPDF4LLM |
Lightweight, Markdown output |
| Nougat |
Academic-focused, LaTeX-aware |
| Unstructured |
General-purpose, broad format support |
Add or remove candidates by commenting below.
Metrics
Section classification
- Accuracy = correct section labels / total blocks (per paper, then macro-averaged)
- A block is "correct" if its assigned label matches the gold label exactly
Table extraction
- Header exact match: all header cells match gold (binary per table)
- Row recall: fraction of gold rows present in extracted output (fuzzy cell match, threshold TBD)
- Footnote recall: fraction of gold footnotes recovered
Trait extraction proxy
- Field-level exact match: for each
(trait, value, unit) triple in gold, check exact string match in extractor output
- F1 over the full triple set per paper
Note: "Exact match" means case-folded, whitespace-normalized string equality. Define edge cases (e.g., 8.40 vs 8.4) in benchmark/SCORING.md before running.
Deliverables
Summary
We have gold-standard agronomic papers with manually verified extractions. This issue tracks the work of building a benchmark harness that evaluates candidate open-source document extractors against those ground-truth annotations, producing a reproducible accuracy report to inform our adapter layer design.
Motivation
The adapter layer (see #[related issue]) needs to produce structured document objects — section-classified blocks, structured tables, and figure references — from raw PDFs. Before committing to Marker as the extraction backend, we should have empirical evidence of where it (and alternatives) succeed or fail on our specific document type: multi-column scientific agronomic papers with complex table structures.
Scope
Extractors to evaluate (starting set, open to additions):
Add or remove candidates by commenting below.
Metrics
Section classification
Table extraction
Trait extraction proxy
(trait, value, unit)triple in gold, check exact string match in extractor outputDeliverables
benchmark/directory with:run_extractor.py— CLI that takes--extractor <name>and--paper <id>, writes normalized output JSON toresults/score.py— readsresults/againstgold/, prints per-paper and aggregate metricsSCORING.md— documents all normalization rules and tie-breaking decisionsresults/summary_table.md— filled-in comparison table (one row per extractor × paper)