Skip to content

feat: add Microsoft MarkItDown engine adapter + benchmark coverage#8

Merged
Mihailorama merged 3 commits into
mainfrom
claude/markitdown-benchmarks-wPIIY
Apr 25, 2026
Merged

feat: add Microsoft MarkItDown engine adapter + benchmark coverage#8
Mihailorama merged 3 commits into
mainfrom
claude/markitdown-benchmarks-wPIIY

Conversation

@Mihailorama
Copy link
Copy Markdown
Owner

Summary

  • Wraps Microsoft's markitdown (PDF, Office, HTML, images, CSV/JSON/XML, ePub, audio, ZIP → Markdown) in a DocumentEngine adapter so docfold can route to it like any other engine.
  • Registers it in benchmark.py so it runs against the 7 synthetic Latin / Arabic / Hebrew / CJK PDFs alongside pymupdf, docling, mineru, marker and the OCR engines.
  • Follows the mandated TDD workflow: proposal → red tests → green implementation → E2E benchmark.

What's new

  • src/docfold/engines/markitdown_engine.py — adapter; sync convert() dispatched via run_in_executor; honors MARKDOWN / TEXT / HTML / JSON; defensive is_available() so a broken transitive import (e.g. cryptography PyO3) cannot kill the whole router.
  • tests/engines/test_markitdown_engine.py — 12 unit tests, mocks the markitdown module so they run without the dep installed.
  • pyproject.toml — new markitdown extras and inclusion in [all].
  • src/docfold/engines/router.py — markitdown registered in extension priority lists for PDF, Office, HTML, CSV/JSON/XML, images, ePub, audio, ZIP.
  • benchmark.pyMarkItDownEngine() added to the candidate list.
  • README.md, CHANGELOG.md, docs/tasks/MARKITDOWN_ENGINE.md — engine tables, unreleased entry, feature proposal.
  • docs/benchmark_results.json — refreshed with markitdown numbers.

Benchmark results (this run)

Only pymupdf and markitdown were available on the build host; comparative numbers vs docling/mineru/marker/surya require a host with those deps installed.

Engine Avg time Avg CER Avg WER
pymupdf 3.3 ms 0.0000 0.0000
markitdown 50.7 ms 0.0478 0.2415

markitdown is perfect (CER=WER=0) on Latin-script PDFs; measurable degradation on Arabic / Hebrew / CJK is expected — markitdown does not do layout analysis or RTL reshaping.

Test plan

  • pytest tests/ — 322 passed
  • python benchmark.py — runs end-to-end with markitdown installed; report written to docs/benchmark_results.json
  • CI green
  • Re-run benchmark.py on a host with docling/mineru/marker/surya installed for full comparison

https://claude.ai/code/session_01FwNghFepN7YHe4yeU5A17L


Generated by Claude Code

claude added 3 commits April 24, 2026 17:58
Wraps markitdown (https://github.com/microsoft/markitdown) in a
DocumentEngine adapter so docfold can route to it like any other engine.
Registers it in benchmark.py so it runs against the 7 synthetic Latin /
Arabic / Hebrew / CJK PDFs alongside pymupdf, docling, mineru, marker
and the OCR engines.

- src/docfold/engines/markitdown_engine.py — adapter; dispatches sync
  convert() via run_in_executor, honors MARKDOWN / TEXT / HTML / JSON.
- tests/engines/test_markitdown_engine.py — 12 unit tests, mocks the
  markitdown module so tests run without the dep installed.
- pyproject.toml — markitdown extras (+ included in [all]).
- router.py — adds markitdown to priority lists for its formats.
- benchmark.py — registers MarkItDownEngine() in the candidate list.
- README.md / CHANGELOG.md — engine tables + unreleased entry.
- docs/tasks/MARKITDOWN_ENGINE.md — feature proposal (TDD workflow).
- docs/benchmark_results.json — refreshed with markitdown numbers.
The first round of benchmarks only generated PDFs, which is the format
where markitdown is *least* differentiated.  This adds three synthetic
non-PDF documents so the harness can show markitdown's strengths:

- DOCX built with stdlib zipfile + minimal Office Open XML (no python-docx
  dependency).
- HTML page with heading and paragraphs.
- CSV with five rows; ground truth is the canonical Markdown table so
  CER/WER measure formatting fidelity rather than how cells are joined.

Engines are now filtered per-doc by ``supported_extensions`` so PyMuPDF
and OCR engines stop running on Office/web/tabular fixtures.

Local results (pymupdf + markitdown only):

  Engine        Avg time   Avg CER   Avg WER   Errors
  pymupdf         4.4ms     0.0000    0.0000      0
  markitdown     47.0ms     0.0343    0.1726      0

DOCX / HTML / CSV all score CER=0.0 / WER=0 on markitdown; on a host with
docling / mineru / marker / unstructured installed the comparison table
will be richer.

https://claude.ai/code/session_01FwNghFepN7YHe4yeU5A17L
@Mihailorama Mihailorama merged commit ed94fcd into main Apr 25, 2026
10 checks passed
@Mihailorama Mihailorama deleted the claude/markitdown-benchmarks-wPIIY branch April 25, 2026 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants