feat: add Microsoft MarkItDown engine adapter + benchmark coverage#8
Merged
Conversation
Wraps markitdown (https://github.com/microsoft/markitdown) in a DocumentEngine adapter so docfold can route to it like any other engine. Registers it in benchmark.py so it runs against the 7 synthetic Latin / Arabic / Hebrew / CJK PDFs alongside pymupdf, docling, mineru, marker and the OCR engines. - src/docfold/engines/markitdown_engine.py — adapter; dispatches sync convert() via run_in_executor, honors MARKDOWN / TEXT / HTML / JSON. - tests/engines/test_markitdown_engine.py — 12 unit tests, mocks the markitdown module so tests run without the dep installed. - pyproject.toml — markitdown extras (+ included in [all]). - router.py — adds markitdown to priority lists for its formats. - benchmark.py — registers MarkItDownEngine() in the candidate list. - README.md / CHANGELOG.md — engine tables + unreleased entry. - docs/tasks/MARKITDOWN_ENGINE.md — feature proposal (TDD workflow). - docs/benchmark_results.json — refreshed with markitdown numbers.
The first round of benchmarks only generated PDFs, which is the format where markitdown is *least* differentiated. This adds three synthetic non-PDF documents so the harness can show markitdown's strengths: - DOCX built with stdlib zipfile + minimal Office Open XML (no python-docx dependency). - HTML page with heading and paragraphs. - CSV with five rows; ground truth is the canonical Markdown table so CER/WER measure formatting fidelity rather than how cells are joined. Engines are now filtered per-doc by ``supported_extensions`` so PyMuPDF and OCR engines stop running on Office/web/tabular fixtures. Local results (pymupdf + markitdown only): Engine Avg time Avg CER Avg WER Errors pymupdf 4.4ms 0.0000 0.0000 0 markitdown 47.0ms 0.0343 0.1726 0 DOCX / HTML / CSV all score CER=0.0 / WER=0 on markitdown; on a host with docling / mineru / marker / unstructured installed the comparison table will be richer. https://claude.ai/code/session_01FwNghFepN7YHe4yeU5A17L
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
markitdown(PDF, Office, HTML, images, CSV/JSON/XML, ePub, audio, ZIP → Markdown) in aDocumentEngineadapter so docfold can route to it like any other engine.benchmark.pyso it runs against the 7 synthetic Latin / Arabic / Hebrew / CJK PDFs alongside pymupdf, docling, mineru, marker and the OCR engines.What's new
src/docfold/engines/markitdown_engine.py— adapter; syncconvert()dispatched viarun_in_executor; honorsMARKDOWN/TEXT/HTML/JSON; defensiveis_available()so a broken transitive import (e.g. cryptography PyO3) cannot kill the whole router.tests/engines/test_markitdown_engine.py— 12 unit tests, mocks themarkitdownmodule so they run without the dep installed.pyproject.toml— newmarkitdownextras and inclusion in[all].src/docfold/engines/router.py— markitdown registered in extension priority lists for PDF, Office, HTML, CSV/JSON/XML, images, ePub, audio, ZIP.benchmark.py—MarkItDownEngine()added to the candidate list.README.md,CHANGELOG.md,docs/tasks/MARKITDOWN_ENGINE.md— engine tables, unreleased entry, feature proposal.docs/benchmark_results.json— refreshed with markitdown numbers.Benchmark results (this run)
Only
pymupdfandmarkitdownwere available on the build host; comparative numbers vs docling/mineru/marker/surya require a host with those deps installed.markitdown is perfect (CER=WER=0) on Latin-script PDFs; measurable degradation on Arabic / Hebrew / CJK is expected — markitdown does not do layout analysis or RTL reshaping.
Test plan
pytest tests/— 322 passedpython benchmark.py— runs end-to-end with markitdown installed; report written todocs/benchmark_results.jsonbenchmark.pyon a host with docling/mineru/marker/surya installed for full comparisonhttps://claude.ai/code/session_01FwNghFepN7YHe4yeU5A17L
Generated by Claude Code