[Benchmark] Evaluate open-source PDF extractors against ground-truth agronomic papers

<h2>Summary</h2>
<p>We have gold-standard agronomic papers with manually verified extractions. This issue tracks the work of building a benchmark harness that evaluates candidate open-source document extractors against those ground-truth annotations, producing a reproducible accuracy report to inform our adapter layer design.</p>
<h2>Motivation</h2>
<p>The adapter layer (see #[related issue]) needs to produce structured document objects — section-classified blocks, structured tables, and figure references — from raw PDFs. Before committing to Marker as the extraction backend, we should have empirical evidence of where it (and alternatives) succeed or fail on <em>our specific document type</em>: multi-column scientific agronomic papers with complex table structures.</p>
<h2>Scope</h2>
<p><strong>Extractors to evaluate</strong> (starting set, open to additions):</p>

Extractor | Notes
-- | --
Marker | Primary candidate; JSON block tree output
Docling | IBM; strong table handling
PyMuPDF4LLM | Lightweight, Markdown output
Nougat | Academic-focused, LaTeX-aware
Unstructured | General-purpose, broad format support


<p>Add or remove candidates by commenting below.</p>

<h2>Metrics</h2>
<h3>Section classification</h3>
<ul>
<li><strong>Accuracy</strong> = correct section labels / total blocks (per paper, then macro-averaged)</li>
<li>A block is "correct" if its assigned label matches the gold label exactly</li>
</ul>
<h3>Table extraction</h3>
<ul>
<li><strong>Header exact match</strong>: all header cells match gold (binary per table)</li>
<li><strong>Row recall</strong>: fraction of gold rows present in extracted output (fuzzy cell match, threshold TBD)</li>
<li><strong>Footnote recall</strong>: fraction of gold footnotes recovered</li>
</ul>
<h3>Trait extraction proxy</h3>
<ul>
<li><strong>Field-level exact match</strong>: for each <code>(trait, value, unit)</code> triple in gold, check exact string match in extractor output</li>
<li><strong>F1</strong> over the full triple set per paper</li>
</ul>
<blockquote>
<p><strong>Note:</strong> "Exact match" means case-folded, whitespace-normalized string equality. Define edge cases (e.g., <code>8.40</code> vs <code>8.4</code>) in <code>benchmark/SCORING.md</code> before running.</p>
</blockquote>
<h2>Deliverables</h2>
<ul>
<li>[ ] <code>benchmark/</code> directory with:
<ul>
<li>[ ] <code>run_extractor.py</code> — CLI that takes <code>--extractor &lt;name&gt;</code> and <code>--paper &lt;id&gt;</code>, writes normalized output JSON to <code>results/</code></li>
<li>[ ] <code>score.py</code> — reads <code>results/</code> against <code>gold/</code>, prints per-paper and aggregate metrics</li>
<li>[ ] <code>SCORING.md</code> — documents all normalization rules and tie-breaking decisions</li>
</ul>
</li>
<li>[ ] <code>results/summary_table.md</code> — filled-in comparison table (one row per extractor × paper)</li>
<li>[ ] Short written findings (can be PR description): where does each extractor break down, and what does that imply for the adapter layer?</li>
</ul>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Benchmark] Evaluate open-source PDF extractors against ground-truth agronomic papers #1

Summary

Motivation

Scope

Metrics

Section classification

Table extraction

Trait extraction proxy

Deliverables

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Extractor	Notes
Marker	Primary candidate; JSON block tree output
Docling	IBM; strong table handling
PyMuPDF4LLM	Lightweight, Markdown output
Nougat	Academic-focused, LaTeX-aware
Unstructured	General-purpose, broad format support

Uh oh!

[Benchmark] Evaluate open-source PDF extractors against ground-truth agronomic papers #1

Description

Summary

Motivation

Scope

Metrics

Section classification

Table extraction

Trait extraction proxy

Deliverables

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions