feat: add Microsoft MarkItDown engine adapter + benchmark coverage by Mihailorama · Pull Request #8 · Mihailorama/docfold

Mihailorama · 2026-04-25T18:20:30Z

Summary

Wraps Microsoft's markitdown (PDF, Office, HTML, images, CSV/JSON/XML, ePub, audio, ZIP → Markdown) in a DocumentEngine adapter so docfold can route to it like any other engine.
Registers it in benchmark.py so it runs against the 7 synthetic Latin / Arabic / Hebrew / CJK PDFs alongside pymupdf, docling, mineru, marker and the OCR engines.
Follows the mandated TDD workflow: proposal → red tests → green implementation → E2E benchmark.

What's new

src/docfold/engines/markitdown_engine.py — adapter; sync convert() dispatched via run_in_executor; honors MARKDOWN / TEXT / HTML / JSON; defensive is_available() so a broken transitive import (e.g. cryptography PyO3) cannot kill the whole router.
tests/engines/test_markitdown_engine.py — 12 unit tests, mocks the markitdown module so they run without the dep installed.
pyproject.toml — new markitdown extras and inclusion in [all].
src/docfold/engines/router.py — markitdown registered in extension priority lists for PDF, Office, HTML, CSV/JSON/XML, images, ePub, audio, ZIP.
benchmark.py — MarkItDownEngine() added to the candidate list.
README.md, CHANGELOG.md, docs/tasks/MARKITDOWN_ENGINE.md — engine tables, unreleased entry, feature proposal.
docs/benchmark_results.json — refreshed with markitdown numbers.

Benchmark results (this run)

Only pymupdf and markitdown were available on the build host; comparative numbers vs docling/mineru/marker/surya require a host with those deps installed.

Engine	Avg time	Avg CER	Avg WER
pymupdf	3.3 ms	0.0000	0.0000
markitdown	50.7 ms	0.0478	0.2415

markitdown is perfect (CER=WER=0) on Latin-script PDFs; measurable degradation on Arabic / Hebrew / CJK is expected — markitdown does not do layout analysis or RTL reshaping.

Test plan

pytest tests/ — 322 passed
python benchmark.py — runs end-to-end with markitdown installed; report written to docs/benchmark_results.json
CI green
Re-run benchmark.py on a host with docling/mineru/marker/surya installed for full comparison

https://claude.ai/code/session_01FwNghFepN7YHe4yeU5A17L

Generated by Claude Code

Wraps markitdown (https://github.com/microsoft/markitdown) in a DocumentEngine adapter so docfold can route to it like any other engine. Registers it in benchmark.py so it runs against the 7 synthetic Latin / Arabic / Hebrew / CJK PDFs alongside pymupdf, docling, mineru, marker and the OCR engines. - src/docfold/engines/markitdown_engine.py — adapter; dispatches sync convert() via run_in_executor, honors MARKDOWN / TEXT / HTML / JSON. - tests/engines/test_markitdown_engine.py — 12 unit tests, mocks the markitdown module so tests run without the dep installed. - pyproject.toml — markitdown extras (+ included in [all]). - router.py — adds markitdown to priority lists for its formats. - benchmark.py — registers MarkItDownEngine() in the candidate list. - README.md / CHANGELOG.md — engine tables + unreleased entry. - docs/tasks/MARKITDOWN_ENGINE.md — feature proposal (TDD workflow). - docs/benchmark_results.json — refreshed with markitdown numbers.

The first round of benchmarks only generated PDFs, which is the format where markitdown is *least* differentiated. This adds three synthetic non-PDF documents so the harness can show markitdown's strengths: - DOCX built with stdlib zipfile + minimal Office Open XML (no python-docx dependency). - HTML page with heading and paragraphs. - CSV with five rows; ground truth is the canonical Markdown table so CER/WER measure formatting fidelity rather than how cells are joined. Engines are now filtered per-doc by ``supported_extensions`` so PyMuPDF and OCR engines stop running on Office/web/tabular fixtures. Local results (pymupdf + markitdown only): Engine Avg time Avg CER Avg WER Errors pymupdf 4.4ms 0.0000 0.0000 0 markitdown 47.0ms 0.0343 0.1726 0 DOCX / HTML / CSV all score CER=0.0 / WER=0 on markitdown; on a host with docling / mineru / marker / unstructured installed the comparison table will be richer. https://claude.ai/code/session_01FwNghFepN7YHe4yeU5A17L

claude added 3 commits April 24, 2026 17:58

fix: wrap long lines in router priority map for ruff E501

b982feb

Mihailorama merged commit ed94fcd into main Apr 25, 2026
10 checks passed

Mihailorama deleted the claude/markitdown-benchmarks-wPIIY branch April 25, 2026 18:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Microsoft MarkItDown engine adapter + benchmark coverage#8

feat: add Microsoft MarkItDown engine adapter + benchmark coverage#8
Mihailorama merged 3 commits into
mainfrom
claude/markitdown-benchmarks-wPIIY

Mihailorama commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mihailorama commented Apr 25, 2026

Summary

What's new

Benchmark results (this run)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants