|
| 1 | +# Nutrient PDF to Markdown |
| 2 | + |
| 3 | +[](LICENSE.md) |
| 4 | +[](https://github.com/PSPDFKit/pdf-to-markdown) |
| 5 | +[](https://github.com/PSPDFKit/pdf-to-markdown) |
| 6 | +[-yellow)](https://github.com/PSPDFKit/pdf-to-markdown) |
| 7 | + |
| 8 | +<p align="center"> |
| 9 | + <img src="docs/assets/demo.gif" alt="pdf-to-markdown demo" width="720"> |
| 10 | +</p> |
| 11 | + |
| 12 | +**Stop wasting your context window on PDF extraction.** |
| 13 | + |
| 14 | +Fast, accurate Markdown from PDFs — locally, with no cleanup required. Built for Claude, Codex, RAG pipelines, and document-heavy automation where noisy extraction burns tokens and makes downstream results less reliable. |
| 15 | + |
| 16 | +- **How fast is it?** — 0.007s per page. 90x faster than docling, 37x faster than pymupdf4llm. ([benchmarks](#benchmarks)) |
| 17 | +- **How accurate is it?** — 0.92 reading order (best in class), 0.88 overall extraction accuracy, 0.81 heading detection. ([benchmarks](#benchmarks)) |
| 18 | +- **Where do my PDFs go?** — Nowhere. The CLI runs locally. Your documents are not uploaded to Nutrient. ([trust & licensing](#trust-and-licensing)) |
| 19 | +- **What does it cost?** — Free for up to 1,000 documents per calendar month. No license key, no signup, no API token. ([license](LICENSE.md)) |
| 20 | + |
| 21 | +## Install |
| 22 | + |
| 23 | +### Agent skill (recommended) |
| 24 | + |
| 25 | +If you use Claude Code, Codex, Pi, Cursor, or Gemini CLI, install the [Nutrient Skills](https://github.com/pspdfkit-labs/nutrient-skills) plugin — the extraction runs automatically when your agent needs to read a PDF: |
| 26 | + |
| 27 | +```bash |
| 28 | +npx skills add pspdfkit-labs/nutrient-skills --skill pdf-to-markdown |
| 29 | +``` |
| 30 | + |
| 31 | +Or with marketplace/plugin flows (Claude Code, Codex): |
| 32 | + |
| 33 | +```text |
| 34 | +/plugin marketplace add pspdfkit-labs/nutrient-skills |
| 35 | +/plugin install pdf-to-markdown@nutrient-skills |
| 36 | +``` |
| 37 | + |
| 38 | +With Pi: |
| 39 | + |
| 40 | +```bash |
| 41 | +pi install git:github.com/PSPDFKit-labs/nutrient-skills |
| 42 | +``` |
| 43 | + |
| 44 | +Once installed, just reference a PDF in your prompt — no extra commands needed: |
| 45 | + |
| 46 | +> "Extract the pricing table from proposal.pdf" |
| 47 | +
|
| 48 | +The skill invokes the CLI transparently and passes the resulting Markdown into your agent context. |
| 49 | + |
| 50 | +### Standalone CLI |
| 51 | + |
| 52 | +For use outside an agent, install the CLI directly: |
| 53 | + |
| 54 | +```bash |
| 55 | +curl -fsSL https://raw.githubusercontent.com/PSPDFKit/pdf-to-markdown/main/install.sh | sh |
| 56 | +``` |
| 57 | + |
| 58 | +This installs `pdf-to-markdown` into `~/.local/bin` by default. |
| 59 | + |
| 60 | +You can also install from a clone: |
| 61 | + |
| 62 | +```bash |
| 63 | +git clone https://github.com/PSPDFKit/pdf-to-markdown.git |
| 64 | +cd pdf-to-markdown |
| 65 | +./install.sh # or: npm install -g . |
| 66 | +``` |
| 67 | + |
| 68 | +## Usage |
| 69 | + |
| 70 | +### Single PDF |
| 71 | + |
| 72 | +```bash |
| 73 | +pdf-to-markdown input.pdf output.md |
| 74 | +``` |
| 75 | + |
| 76 | +If `output.md` is omitted, Markdown is written to stdout. |
| 77 | + |
| 78 | +### Batch directory |
| 79 | + |
| 80 | +```bash |
| 81 | +pdf-to-markdown ./input-pdfs ./output-markdown |
| 82 | +``` |
| 83 | + |
| 84 | +When both arguments are directories, the CLI converts every PDF in the input directory and writes matching Markdown files into the output directory. |
| 85 | + |
| 86 | +## Platform Support |
| 87 | + |
| 88 | +- macOS Apple Silicon (`Darwin/arm64`) |
| 89 | +- Linux x86_64 |
| 90 | +- Linux arm64 |
| 91 | +- Windows x64 (coming soon) |
| 92 | + |
| 93 | +## Benchmarks |
| 94 | + |
| 95 | +Benchmark results from 200 PDF documents with hand-annotated Markdown ground truth, evaluated using NID (reading order), TEDS (table structure), and MHS (heading hierarchy) metrics. Benchmarked on `2026-04-02`. |
| 96 | + |
| 97 | +### Visual Snapshot |
| 98 | + |
| 99 | + |
| 100 | + |
| 101 | + |
| 102 | + |
| 103 | + |
| 104 | + |
| 105 | + |
| 106 | + |
| 107 | + |
| 108 | + |
| 109 | + |
| 110 | + |
| 111 | +### Accuracy |
| 112 | + |
| 113 | +| Solution | Overall | Reading Order (NID) | Table Structure (TEDS) | Heading Level (MHS) | |
| 114 | +| --- | ---: | ---: | ---: | ---: | |
| 115 | +| docling | **0.88** | 0.90 | **0.89** | **0.82** | |
| 116 | +| **Nutrient** | **0.88** | **0.92** | 0.66 | 0.81 | |
| 117 | +| opendataloader | 0.83 | 0.90 | 0.49 | 0.74 | |
| 118 | +| pymupdf4llm | 0.83 | 0.88 | 0.48 | 0.78 | |
| 119 | +| markitdown | 0.59 | 0.84 | 0.27 | 0.00 | |
| 120 | +| pypdf | 0.58 | 0.87 | 0.00 | 0.00 | |
| 121 | +| liteparse | 0.57 | 0.86 | 0.00 | 0.00 | |
| 122 | + |
| 123 | +### Speed |
| 124 | + |
| 125 | +| Solution | Seconds per page | |
| 126 | +| --- | ---: | |
| 127 | +| **Nutrient** | **0.007** | |
| 128 | +| opendataloader | 0.014 | |
| 129 | +| pypdf | 0.019 | |
| 130 | +| markitdown | 0.106 | |
| 131 | +| liteparse | 0.233 | |
| 132 | +| pymupdf4llm | 0.252 | |
| 133 | +| docling | 0.618 | |
| 134 | + |
| 135 | +### Faster with Nutrient |
| 136 | + |
| 137 | +- `90x` faster than `docling` |
| 138 | +- `37x` faster than `pymupdf4llm` |
| 139 | +- `34x` faster than `liteparse` |
| 140 | +- `15x` faster than `markitdown` |
| 141 | +- `3x` faster than `pypdf` |
| 142 | +- `2x` faster than `opendataloader` |
| 143 | + |
| 144 | +For the full comparison table, see [docs/benchmarks.md](docs/benchmarks.md). |
| 145 | + |
| 146 | +## Trust and Licensing |
| 147 | + |
| 148 | +- Free for up to `1,000` documents per calendar month |
| 149 | +- PDFs stay local — your documents are not uploaded to Nutrient by this extractor |
| 150 | +- A commercial license is required for processing more than `1,000` documents per month |
| 151 | +- The extraction engine is delivered as a signed platform binary; the repo contains only the wrapper and documentation |
| 152 | +- The license is non-transferable — you may not redistribute the binary standalone or sublicense it to third parties; embedding it in your own application is permitted under the free tier terms |
| 153 | + |
| 154 | +See [LICENSE.md](LICENSE.md) for the full terms and [docs/distribution-model.md](docs/distribution-model.md) for details on what ships in this repo vs. the binary. |
| 155 | + |
| 156 | +## FAQ |
| 157 | + |
| 158 | +### What makes this different from other PDF extractors? |
| 159 | + |
| 160 | +Speed and accuracy should not be a tradeoff. Most extractors are either fast but lose structure (markitdown, pymupdf4llm) or accurate but slow (docling). Nutrient extracts at 0.007s per page with strong reading order, heading, and table preservation — less cleanup, fewer wasted tokens, and more reliable downstream results. |
| 161 | + |
| 162 | +### Do my documents leave my machine? |
| 163 | + |
| 164 | +No. The CLI processes PDFs locally. Nothing is uploaded to Nutrient. Note that if you feed the extracted Markdown into Claude, Codex, or another model provider, their own data policies apply. |
| 165 | + |
| 166 | +### Do I need a license key or API token? |
| 167 | + |
| 168 | +No. There is no signup, no license key, and no API token. Install the CLI and start converting. The free tier (up to 1,000 documents per calendar month) is enforced via the [license terms](LICENSE.md), not a technical gate. If you need to process more than 1,000 documents per month, contact `sales@nutrient.io` for a commercial license. |
| 169 | + |
| 170 | +### Why is the extraction engine closed-source? |
| 171 | + |
| 172 | +The repo is designed to be reviewable — you can read the wrapper, the installer, and the documentation. The extraction engine is distributed as a signed binary to protect the implementation while keeping the CLI surface fully transparent. |
0 commit comments