Skip to content

[Benchmark] Evaluate open-source PDF extractors against ground-truth agronomic papers #1

Description

@koolgax99

Summary

We have gold-standard agronomic papers with manually verified extractions. This issue tracks the work of building a benchmark harness that evaluates candidate open-source document extractors against those ground-truth annotations, producing a reproducible accuracy report to inform our adapter layer design.

Motivation

The adapter layer (see #[related issue]) needs to produce structured document objects — section-classified blocks, structured tables, and figure references — from raw PDFs. Before committing to Marker as the extraction backend, we should have empirical evidence of where it (and alternatives) succeed or fail on our specific document type: multi-column scientific agronomic papers with complex table structures.

Scope

Extractors to evaluate (starting set, open to additions):

Extractor Notes
Marker Primary candidate; JSON block tree output
Docling IBM; strong table handling
PyMuPDF4LLM Lightweight, Markdown output
Nougat Academic-focused, LaTeX-aware
Unstructured General-purpose, broad format support

Add or remove candidates by commenting below.

Metrics

Section classification

  • Accuracy = correct section labels / total blocks (per paper, then macro-averaged)
  • A block is "correct" if its assigned label matches the gold label exactly

Table extraction

  • Header exact match: all header cells match gold (binary per table)
  • Row recall: fraction of gold rows present in extracted output (fuzzy cell match, threshold TBD)
  • Footnote recall: fraction of gold footnotes recovered

Trait extraction proxy

  • Field-level exact match: for each (trait, value, unit) triple in gold, check exact string match in extractor output
  • F1 over the full triple set per paper

Note: "Exact match" means case-folded, whitespace-normalized string equality. Define edge cases (e.g., 8.40 vs 8.4) in benchmark/SCORING.md before running.

Deliverables

  • benchmark/ directory with:
    • run_extractor.py — CLI that takes --extractor <name> and --paper <id>, writes normalized output JSON to results/
    • score.py — reads results/ against gold/, prints per-paper and aggregate metrics
    • SCORING.md — documents all normalization rules and tie-breaking decisions
  • results/summary_table.md — filled-in comparison table (one row per extractor × paper)
  • Short written findings (can be PR description): where does each extractor break down, and what does that imply for the adapter layer?

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
In Progress

Relationships

None yet

Development

No branches or pull requests

Issue actions