Skip to content

Commit b7b18a0

Browse files
committed
Initial release: Nutrient PDF to Markdown CLI
Fast, accurate Markdown from PDFs — locally, with no cleanup required. Built for Claude, Codex, Pi, Cursor, Gemini CLI, RAG pipelines, and document-heavy automation. - 0.007s per page (90x faster than docling, 37x faster than pymupdf4llm) - 0.92 reading order accuracy (best in class) - 0.88 overall extraction accuracy - Free for up to 1,000 documents/month Includes CLI wrapper, installer, benchmark data (200 docs), and documentation.
0 parents  commit b7b18a0

21 files changed

Lines changed: 830 additions & 0 deletions

.github/workflows/validate.yml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
name: Validate
2+
3+
on:
4+
push:
5+
pull_request:
6+
7+
jobs:
8+
validate:
9+
runs-on: ubuntu-latest
10+
steps:
11+
- uses: actions/checkout@v4
12+
13+
- uses: actions/setup-node@v4
14+
with:
15+
node-version: 20
16+
17+
- name: Check shell syntax
18+
run: npm run check
19+
20+
- name: Verify installer copies local wrapper
21+
run: |
22+
tmpdir="$(mktemp -d)"
23+
INSTALL_DIR="$tmpdir/bin" ./install.sh
24+
test -x "$tmpdir/bin/pdf-to-markdown"
25+
26+
- name: Verify npm pack surface
27+
run: npm pack --dry-run

.gitignore

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
node_modules/
2+
*.tgz
3+
.DS_Store
4+
sample-report.pdf
5+
report.md

LICENSE.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Nutrient Free Use License
2+
3+
Copyright (c) Nutrient.io
4+
5+
## 1. License Grant
6+
7+
Nutrient grants you a limited, non-exclusive, non-transferable, revocable license to use the accompanying software (the "Software") for internal business or personal purposes, subject to the terms of this License.
8+
9+
## 2. Free Usage Tier
10+
11+
You may use the Software free of charge to process up to 1,000 documents per calendar month.
12+
13+
For the purposes of this License:
14+
15+
A "document" means any individual file submitted for processing by the Software, regardless of format.
16+
17+
Each processing event counts as one document, including repeated or duplicate processing of the same file.
18+
19+
## 3. Commercial Use
20+
21+
You must obtain a commercial license from Nutrient if you:
22+
23+
Process more than 1,000 documents per calendar month
24+
25+
To obtain a commercial license, contact: sales@nutrient.io
26+
27+
## 4. Permitted Uses Within Free Tier
28+
29+
Subject to the 1,000 documents per calendar month limit, you may:
30+
31+
Use the Software for internal or external purposes
32+
33+
Incorporate the Software into applications, including hosted, SaaS, OEM, embedded, or white-labeled solutions
34+
35+
Provide the Software's functionality to third parties
36+
37+
## 5. Restrictions
38+
39+
You may not:
40+
41+
Circumvent, disable, or interfere with usage limits or licensing controls
42+
43+
Use the Software to provide a service or product that competes with Nutrient's commercial offerings
44+
45+
Remove or alter any copyright, trademark, or proprietary notices
46+
47+
Use the Software in violation of applicable laws or regulations
48+
49+
## 6. Ownership
50+
51+
The Software is licensed, not sold. Nutrient and its licensors retain all right, title, and interest in and to the Software.
52+
53+
## 7. Termination
54+
55+
This License automatically terminates if you fail to comply with any of its terms.
56+
57+
Upon termination, you must cease all use of the Software.
58+
59+
## 8. Disclaimer of Warranties
60+
61+
THE SOFTWARE IS PROVIDED "AS IS" AND "AS AVAILABLE", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT.
62+
63+
## 9. Limitation of Liability
64+
65+
TO THE MAXIMUM EXTENT PERMITTED BY LAW, IN NO EVENT SHALL NUTRIENT BE LIABLE FOR ANY INDIRECT, INCIDENTAL, SPECIAL, CONSEQUENTIAL, OR EXEMPLARY DAMAGES, INCLUDING BUT NOT LIMITED TO LOSS OF PROFITS, DATA, OR USE, ARISING OUT OF OR RELATED TO THE SOFTWARE.
66+
67+
## 10. Governing Law
68+
69+
This License shall be governed by and construed in accordance with the laws of the State of North Carolina, without regard to conflict of law principles.
70+
71+
## 11. Usage Data
72+
73+
The Software may collect and transmit usage data related to performance, feature usage, and document processing activity. This data does not include the contents of documents or personally identifiable information.
74+
75+
By using the Software, you agree to the collection and use of such data in accordance with Nutrient's Privacy Policy.
76+
77+
## 12. Third-Party Software
78+
79+
This Software incorporates third-party open source components. A full list of acknowledgements is available at: https://www.nutrient.io/legal/acknowledgements/nutrient-cli-acknowledgements/

README.md

Lines changed: 172 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,172 @@
1+
# Nutrient PDF to Markdown
2+
3+
[![License: Proprietary](https://img.shields.io/badge/license-Nutrient_Free_Use-blue)](LICENSE.md)
4+
[![macOS](https://img.shields.io/badge/macOS-arm64-brightgreen)](https://github.com/PSPDFKit/pdf-to-markdown)
5+
[![Linux](https://img.shields.io/badge/Linux-x64_|_arm64-brightgreen)](https://github.com/PSPDFKit/pdf-to-markdown)
6+
[![Windows](https://img.shields.io/badge/Windows-x64_(coming_soon)-yellow)](https://github.com/PSPDFKit/pdf-to-markdown)
7+
8+
<p align="center">
9+
<img src="docs/assets/demo.gif" alt="pdf-to-markdown demo" width="720">
10+
</p>
11+
12+
**Stop wasting your context window on PDF extraction.**
13+
14+
Fast, accurate Markdown from PDFs — locally, with no cleanup required. Built for Claude, Codex, RAG pipelines, and document-heavy automation where noisy extraction burns tokens and makes downstream results less reliable.
15+
16+
- **How fast is it?** — 0.007s per page. 90x faster than docling, 37x faster than pymupdf4llm. ([benchmarks](#benchmarks))
17+
- **How accurate is it?** — 0.92 reading order (best in class), 0.88 overall extraction accuracy, 0.81 heading detection. ([benchmarks](#benchmarks))
18+
- **Where do my PDFs go?** — Nowhere. The CLI runs locally. Your documents are not uploaded to Nutrient. ([trust & licensing](#trust-and-licensing))
19+
- **What does it cost?** — Free for up to 1,000 documents per calendar month. No license key, no signup, no API token. ([license](LICENSE.md))
20+
21+
## Install
22+
23+
### Agent skill (recommended)
24+
25+
If you use Claude Code, Codex, Pi, Cursor, or Gemini CLI, install the [Nutrient Skills](https://github.com/pspdfkit-labs/nutrient-skills) plugin — the extraction runs automatically when your agent needs to read a PDF:
26+
27+
```bash
28+
npx skills add pspdfkit-labs/nutrient-skills --skill pdf-to-markdown
29+
```
30+
31+
Or with marketplace/plugin flows (Claude Code, Codex):
32+
33+
```text
34+
/plugin marketplace add pspdfkit-labs/nutrient-skills
35+
/plugin install pdf-to-markdown@nutrient-skills
36+
```
37+
38+
With Pi:
39+
40+
```bash
41+
pi install git:github.com/PSPDFKit-labs/nutrient-skills
42+
```
43+
44+
Once installed, just reference a PDF in your prompt — no extra commands needed:
45+
46+
> "Extract the pricing table from proposal.pdf"
47+
48+
The skill invokes the CLI transparently and passes the resulting Markdown into your agent context.
49+
50+
### Standalone CLI
51+
52+
For use outside an agent, install the CLI directly:
53+
54+
```bash
55+
curl -fsSL https://raw.githubusercontent.com/PSPDFKit/pdf-to-markdown/main/install.sh | sh
56+
```
57+
58+
This installs `pdf-to-markdown` into `~/.local/bin` by default.
59+
60+
You can also install from a clone:
61+
62+
```bash
63+
git clone https://github.com/PSPDFKit/pdf-to-markdown.git
64+
cd pdf-to-markdown
65+
./install.sh # or: npm install -g .
66+
```
67+
68+
## Usage
69+
70+
### Single PDF
71+
72+
```bash
73+
pdf-to-markdown input.pdf output.md
74+
```
75+
76+
If `output.md` is omitted, Markdown is written to stdout.
77+
78+
### Batch directory
79+
80+
```bash
81+
pdf-to-markdown ./input-pdfs ./output-markdown
82+
```
83+
84+
When both arguments are directories, the CLI converts every PDF in the input directory and writes matching Markdown files into the output directory.
85+
86+
## Platform Support
87+
88+
- macOS Apple Silicon (`Darwin/arm64`)
89+
- Linux x86_64
90+
- Linux arm64
91+
- Windows x64 (coming soon)
92+
93+
## Benchmarks
94+
95+
Benchmark results from 200 PDF documents with hand-annotated Markdown ground truth, evaluated using NID (reading order), TEDS (table structure), and MHS (heading hierarchy) metrics. Benchmarked on `2026-04-02`.
96+
97+
### Visual Snapshot
98+
99+
![Extraction accuracy](docs/assets/extraction-accuracy.png)
100+
101+
![Reading order](docs/assets/reading-order.png)
102+
103+
![Table structure](docs/assets/table-structure.png)
104+
105+
![Heading level](docs/assets/heading-level.png)
106+
107+
![Extraction speed](docs/assets/extraction-speed.png)
108+
109+
![Faster with Nutrient](docs/assets/faster-with-nutrient.png)
110+
111+
### Accuracy
112+
113+
| Solution | Overall | Reading Order (NID) | Table Structure (TEDS) | Heading Level (MHS) |
114+
| --- | ---: | ---: | ---: | ---: |
115+
| docling | **0.88** | 0.90 | **0.89** | **0.82** |
116+
| **Nutrient** | **0.88** | **0.92** | 0.66 | 0.81 |
117+
| opendataloader | 0.83 | 0.90 | 0.49 | 0.74 |
118+
| pymupdf4llm | 0.83 | 0.88 | 0.48 | 0.78 |
119+
| markitdown | 0.59 | 0.84 | 0.27 | 0.00 |
120+
| pypdf | 0.58 | 0.87 | 0.00 | 0.00 |
121+
| liteparse | 0.57 | 0.86 | 0.00 | 0.00 |
122+
123+
### Speed
124+
125+
| Solution | Seconds per page |
126+
| --- | ---: |
127+
| **Nutrient** | **0.007** |
128+
| opendataloader | 0.014 |
129+
| pypdf | 0.019 |
130+
| markitdown | 0.106 |
131+
| liteparse | 0.233 |
132+
| pymupdf4llm | 0.252 |
133+
| docling | 0.618 |
134+
135+
### Faster with Nutrient
136+
137+
- `90x` faster than `docling`
138+
- `37x` faster than `pymupdf4llm`
139+
- `34x` faster than `liteparse`
140+
- `15x` faster than `markitdown`
141+
- `3x` faster than `pypdf`
142+
- `2x` faster than `opendataloader`
143+
144+
For the full comparison table, see [docs/benchmarks.md](docs/benchmarks.md).
145+
146+
## Trust and Licensing
147+
148+
- Free for up to `1,000` documents per calendar month
149+
- PDFs stay local — your documents are not uploaded to Nutrient by this extractor
150+
- A commercial license is required for processing more than `1,000` documents per month
151+
- The extraction engine is delivered as a signed platform binary; the repo contains only the wrapper and documentation
152+
- The license is non-transferable — you may not redistribute the binary standalone or sublicense it to third parties; embedding it in your own application is permitted under the free tier terms
153+
154+
See [LICENSE.md](LICENSE.md) for the full terms and [docs/distribution-model.md](docs/distribution-model.md) for details on what ships in this repo vs. the binary.
155+
156+
## FAQ
157+
158+
### What makes this different from other PDF extractors?
159+
160+
Speed and accuracy should not be a tradeoff. Most extractors are either fast but lose structure (markitdown, pymupdf4llm) or accurate but slow (docling). Nutrient extracts at 0.007s per page with strong reading order, heading, and table preservation — less cleanup, fewer wasted tokens, and more reliable downstream results.
161+
162+
### Do my documents leave my machine?
163+
164+
No. The CLI processes PDFs locally. Nothing is uploaded to Nutrient. Note that if you feed the extracted Markdown into Claude, Codex, or another model provider, their own data policies apply.
165+
166+
### Do I need a license key or API token?
167+
168+
No. There is no signup, no license key, and no API token. Install the CLI and start converting. The free tier (up to 1,000 documents per calendar month) is enforced via the [license terms](LICENSE.md), not a technical gate. If you need to process more than 1,000 documents per month, contact `sales@nutrient.io` for a commercial license.
169+
170+
### Why is the extraction engine closed-source?
171+
172+
The repo is designed to be reviewable — you can read the wrapper, the installer, and the documentation. The extraction engine is distributed as a signed binary to protect the implementation while keeping the CLI surface fully transparent.

0 commit comments

Comments
 (0)