High-fidelity Word (.docx) to Markdown for documents where citations, tables, footnotes, and structure need to survive conversion.
Most DOCX-to-Markdown tools do fine on simple prose, then fall over on the details that matter in real reports and papers. docx2md-cli exists to preserve Word-specific structure such as field-code references, bibliography content controls, vertically merged tables, inline footnotes, and list numbering with minimal cleanup after conversion.
| Feature | docx2md-cli | Pandoc | MarkItDown | mammoth |
|---|---|---|---|---|
| Bold / Italic / Underline | ✅ | ✅ | ❌ | ✅ |
| Footnotes (inline position) | ✅ | ✅ | ❌ | ✅ |
Field codes ([N] refs) |
✅ | Partial | ❌ | ❌ |
| Bibliography (SDT) | ✅ | ❌ | ❌ | ❌ |
| Vertical merge (vMerge) | ✅ | ❌ | ❌ | ❌ |
| Split table detection | ✅ | ❌ | ❌ | ❌ |
| Numbered list distinction | ✅ | ✅ | ❌ | ❌ |
| Nested list levels | ✅ | ✅ | ❌ | ❌ |
| Image extraction + rename | ✅ | ✅ | ❌ | ❌ |
| YAML frontmatter | ✅ | ❌ | ❌ | ❌ |
pip install docx2md-cli
docx2md input.docx
docx2md input.docx -o output.mdOptional frontmatter support:
pip install "docx2md-cli[frontmatter]"Basic usage:
docx2md input.docx
docx2md input.docx -o output.md --extract-images images
docx2md input.docx --skip-before-heading --no-frontmatterAll flags:
| Flag | Description | Example |
|---|---|---|
-o, --output PATH |
Write Markdown to PATH. Use - for stdout. |
docx2md input.docx -o output.md |
--extract-images DIR |
Extract embedded images and link them in Markdown. | docx2md input.docx --extract-images images |
--skip-before-heading |
Ignore content before the first real Word heading. | docx2md input.docx --skip-before-heading |
--frontmatter FILE |
Prepend custom YAML frontmatter from a file. | docx2md input.docx --frontmatter meta.yaml |
--no-frontmatter |
Disable both auto and custom frontmatter. | docx2md input.docx --no-frontmatter |
-q, --quiet |
Suppress stats output. | docx2md input.docx -q |
--json-stats |
Emit machine-readable stats JSON. | docx2md input.docx --json-stats |
-v, --version |
Print the installed version. | docx2md --version |
Streaming examples:
cat input.docx | docx2md - -o -
docx2md input.docx --json-stats
docx2md input.docx -o - --no-frontmatterfrom docx2md_cli import convert
result = convert(
"input.docx",
output_path="output.md",
images_dir="images",
skip_before_heading=False,
frontmatter_path=None,
frontmatter_dict=None,
no_frontmatter=False,
print_stats=True,
json_stats=False,
)Parameters:
| Parameter | Type | Description |
|---|---|---|
input_path |
`str | bytes |
output_path |
`str | None` |
images_dir |
`str | None` |
skip_before_heading |
bool |
Skip cover pages or prefatory content before Heading N. |
frontmatter_path |
`str | None` |
frontmatter_dict |
`dict | None` |
no_frontmatter |
bool |
Disable frontmatter generation. |
print_stats |
bool |
Emit conversion stats when writing output. |
json_stats |
bool |
Emit stats as JSON instead of human-readable text. |
stats_stream |
`TextIO | None` |
Return value:
print(result.lines[:3])
print(result.stats["table_rows"])
print(result.as_json())convert() returns ConvertResult, which is list-like for backward compatibility and also exposes .lines, .stats, and .as_json().
Use stdout-friendly and machine-readable modes when chaining tools:
docx2md input.docx --json-stats
docx2md input.docx -q -o output.md
cat input.docx | docx2md - -o -from docx2md_cli import convert
result = convert("input.docx", print_stats=False, no_frontmatter=True)
stats = result.stats
payload = result.as_json()--quiet avoids human-oriented console output. --json-stats gives structured stats for automation. -o - writes Markdown to stdout. ConvertResult lets agents inspect lines and counters without reparsing terminal output.
Caption matching currently recognizes:
- Spanish:
Figura,Tabla - English:
Figure,Table - French:
Tableau - German:
Abbildung,Tabelle - Portuguese:
Tabela - Italian:
Tabella
Word heading detection intentionally follows the standard Heading N style names.
The converter walks the Word document body in order instead of flattening everything to plain text. A field-code state machine preserves citation references, numbering.xml is read directly to distinguish ordered vs unordered lists and nested levels, and the table walker handles vMerge and split-table cases before emitting Markdown. Footnotes are collected from footnotes.xml, bibliography SDTs are extracted, and image filenames can be derived from nearby captions.
Issues welcome. PRs welcome. Run pytest before submitting.
MIT