Skip to content

Commit 2e278f9

Browse files
authored
Merge pull request #2 from PSPDFKit/update-benchmarks-2026-04-02
Update benchmarks with fresh 200-doc evaluation
2 parents 99aa4b5 + 95dae6c commit 2e278f9

8 files changed

Lines changed: 58 additions & 49 deletions

README.md

Lines changed: 33 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313

1414
Fast, accurate Markdown from PDFs — locally, with no cleanup required. Built for Claude, Codex, RAG pipelines, and document-heavy automation where noisy extraction burns tokens and makes downstream results less reliable.
1515

16-
- **How fast is it?** — 0.008s per page. 176x faster than docling, 10x faster than opendataloader. ([benchmarks](#benchmarks))
16+
- **How fast is it?** — 0.007s per page. 90x faster than docling, 37x faster than pymupdf4llm. ([benchmarks](#benchmarks))
1717
- **How accurate is it?** — 0.92 reading order (best in class), 0.88 overall extraction accuracy, 0.81 heading detection. ([benchmarks](#benchmarks))
1818
- **Where do my PDFs go?** — Nowhere. The CLI runs locally. Your documents are not uploaded to Nutrient. ([trust & licensing](#trust-and-licensing))
1919
- **What does it cost?** — Free for up to 1,000 documents per calendar month. No license key, no signup, no API token. ([license](LICENSE.md))
@@ -86,45 +86,54 @@ When both arguments are directories, the CLI converts every PDF in the input dir
8686

8787
## Benchmarks
8888

89-
Published benchmark values from [Nutrient's PDF-to-Markdown page](https://www.nutrient.io/ai/skills/pdf-to-markdown/), recorded on `AMD EPYC 9454`.
89+
Benchmark results from 200 PDF documents with hand-annotated Markdown ground truth, evaluated using NID (reading order), TEDS (table structure), and MHS (heading hierarchy) metrics. Benchmarked on `2026-04-02`.
9090

9191
### Visual Snapshot
9292

93-
![Extraction accuracy benchmark](docs/assets/extraction-accuracy.svg)
93+
![Extraction accuracy](docs/assets/extraction-accuracy.png)
9494

95-
![Extraction speed benchmark](docs/assets/extraction-speed.svg)
95+
![Reading order](docs/assets/reading-order.png)
9696

97-
![Structure quality benchmark](docs/assets/structure-quality.svg)
97+
![Table structure](docs/assets/table-structure.png)
9898

99-
![Relative speedup benchmark](docs/assets/faster-with-nutrient.svg)
99+
![Heading level](docs/assets/heading-level.png)
100+
101+
![Extraction speed](docs/assets/extraction-speed.png)
102+
103+
![Faster with Nutrient](docs/assets/faster-with-nutrient.png)
100104

101105
### Accuracy
102106

103-
| Metric | Nutrient | Best competitor | MarkItDown |
104-
| --- | ---: | ---: | ---: |
105-
| Extraction accuracy | 0.88 | 0.89 (docling) | 0.58 |
106-
| Reading order (NID) | 0.92 | 0.91 | 0.88 |
107-
| Table structure (TEDS) | 0.66 | 0.93 (docling) | 0.00 |
108-
| Heading level (MHS) | 0.81 | 0.83 (docling) | 0.00 |
107+
| Solution | Overall | Reading Order (NID) | Table Structure (TEDS) | Heading Level (MHS) |
108+
| --- | ---: | ---: | ---: | ---: |
109+
| docling | **0.88** | 0.90 | **0.89** | **0.82** |
110+
| **Nutrient** | **0.88** | **0.92** | 0.66 | 0.81 |
111+
| opendataloader | 0.83 | 0.90 | 0.49 | 0.74 |
112+
| pymupdf4llm | 0.83 | 0.88 | 0.48 | 0.78 |
113+
| markitdown | 0.59 | 0.84 | 0.27 | 0.00 |
114+
| pypdf | 0.58 | 0.87 | 0.00 | 0.00 |
115+
| liteparse | 0.57 | 0.86 | 0.00 | 0.00 |
109116

110117
### Speed
111118

112119
| Solution | Seconds per page |
113120
| --- | ---: |
114-
| Nutrient | 0.008 |
115-
| opendataloader | 0.056 |
116-
| markitdown | 0.058 |
117-
| pymupdf4llm | 0.083 |
118-
| opendataloader-hybrid | 1.412 |
119-
| docling | 1.473 |
121+
| **Nutrient** | **0.007** |
122+
| opendataloader | 0.014 |
123+
| pypdf | 0.019 |
124+
| markitdown | 0.106 |
125+
| liteparse | 0.233 |
126+
| pymupdf4llm | 0.252 |
127+
| docling | 0.618 |
120128

121129
### Faster with Nutrient
122130

123-
- `176x` faster than `docling`
124-
- `172x` faster than `opendataloader-hybrid`
125-
- `10x` faster than `opendataloader`
126-
- `7x` faster than `pymupdf4llm`
127-
- `7x` faster than `markitdown`
131+
- `90x` faster than `docling`
132+
- `37x` faster than `pymupdf4llm`
133+
- `34x` faster than `liteparse`
134+
- `15x` faster than `markitdown`
135+
- `3x` faster than `pypdf`
136+
- `2x` faster than `opendataloader`
128137

129138
For the full comparison table, see [docs/benchmarks.md](docs/benchmarks.md).
130139

@@ -142,7 +151,7 @@ See [LICENSE.md](LICENSE.md) for the full terms and [docs/distribution-model.md]
142151

143152
### What makes this different from other PDF extractors?
144153

145-
Speed and accuracy should not be a tradeoff. Most extractors are either fast but lose structure (markitdown, pymupdf4llm) or accurate but slow (docling). Nutrient extracts at 0.008s per page with strong reading order, heading, and table preservation — less cleanup, fewer wasted tokens, and more reliable downstream results.
154+
Speed and accuracy should not be a tradeoff. Most extractors are either fast but lose structure (markitdown, pymupdf4llm) or accurate but slow (docling). Nutrient extracts at 0.007s per page with strong reading order, heading, and table preservation — less cleanup, fewer wasted tokens, and more reliable downstream results.
146155

147156
### Do my documents leave my machine?
148157

89.3 KB
Loading

docs/assets/extraction-speed.png

99.1 KB
Loading
67.1 KB
Loading

docs/assets/heading-level.png

88.5 KB
Loading

docs/assets/reading-order.png

93.5 KB
Loading

docs/assets/table-structure.png

91.9 KB
Loading

docs/benchmarks.md

Lines changed: 25 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,41 @@
11
# Benchmarks
22

3-
These values mirror the benchmark figures currently published on Nutrient's PDF-to-Markdown product page:
3+
Evaluated on 200 PDF documents with hand-annotated Markdown ground truth from the DP-Bench corpus.
44

5-
- Source: <https://www.nutrient.io/ai/skills/pdf-to-markdown/>
6-
- Snapshot date: `2026-04-01`
7-
- Hardware note on page: `Benchmark data recorded on AMD EPYC 9454`
5+
- Benchmark date: `2026-04-02`
6+
- Corpus: 200 documents with ground-truth Markdown annotations
7+
- Metrics: NID (reading order), TEDS (table structure), MHS (heading hierarchy)
8+
- All scores normalized to [0, 1] — higher is better
89

910
## Accuracy Metrics
1011

1112
| Solution | Extraction accuracy | Reading order (NID) | Table structure (TEDS) | Heading level (MHS) |
1213
| --- | ---: | ---: | ---: | ---: |
13-
| Nutrient | 0.88 | 0.92 | 0.66 | 0.81 |
14-
| docling | 0.89 | 0.91 | 0.93 | 0.83 |
15-
| opendataloader | 0.84 | 0.91 | 0.49 | 0.74 |
16-
| opendataloader-hybrid | 0.83 | 0.91 | 0.43 | 0.73 |
17-
| pymupdf4llm | 0.74 | 0.89 | 0.40 | 0.43 |
18-
| markitdown | 0.58 | 0.88 | 0.00 | 0.00 |
14+
| docling | 0.88 | 0.90 | **0.89** | **0.82** |
15+
| **Nutrient** | **0.88** | **0.92** | 0.66 | 0.81 |
16+
| opendataloader | 0.83 | 0.90 | 0.49 | 0.74 |
17+
| pymupdf4llm | 0.83 | 0.88 | 0.48 | 0.78 |
18+
| markitdown | 0.59 | 0.84 | 0.27 | 0.00 |
19+
| pypdf | 0.58 | 0.87 | 0.00 | 0.00 |
20+
| liteparse | 0.57 | 0.86 | 0.00 | 0.00 |
1921

2022
## Speed
2123

2224
| Solution | Seconds per page |
2325
| --- | ---: |
24-
| Nutrient | 0.008 |
25-
| opendataloader | 0.056 |
26-
| markitdown | 0.058 |
27-
| pymupdf4llm | 0.083 |
28-
| opendataloader-hybrid | 1.412 |
29-
| docling | 1.473 |
26+
| **Nutrient** | **0.007** |
27+
| opendataloader | 0.014 |
28+
| pypdf | 0.019 |
29+
| markitdown | 0.106 |
30+
| liteparse | 0.233 |
31+
| pymupdf4llm | 0.252 |
32+
| docling | 0.618 |
3033

3134
## Relative Speed Callouts
3235

33-
- Nutrient is `176x` faster than `docling`
34-
- Nutrient is `172x` faster than `opendataloader-hybrid`
35-
- Nutrient is `10x` faster than `opendataloader`
36-
- Nutrient is `7x` faster than `pymupdf4llm`
37-
- Nutrient is `7x` faster than `markitdown`
38-
39-
## Note
40-
41-
This file reflects the currently published benchmark table. A public reproducibility harness is planned as a future addition.
36+
- Nutrient is `90x` faster than `docling`
37+
- Nutrient is `37x` faster than `pymupdf4llm`
38+
- Nutrient is `34x` faster than `liteparse`
39+
- Nutrient is `15x` faster than `markitdown`
40+
- Nutrient is `3x` faster than `pypdf`
41+
- Nutrient is `2x` faster than `opendataloader`

0 commit comments

Comments
 (0)