docx2md-cli

High-fidelity Word (.docx) to Markdown for documents where citations, tables, footnotes, and structure need to survive conversion.

Why This Exists

Most DOCX-to-Markdown tools do fine on simple prose, then fall over on the details that matter in real reports and papers. docx2md-cli exists to preserve Word-specific structure such as field-code references, bibliography content controls, vertically merged tables, inline footnotes, and list numbering with minimal cleanup after conversion.

Feature Comparison

Feature	docx2md-cli	Pandoc	MarkItDown	mammoth
Bold / Italic / Underline	✅	✅	❌	✅
Footnotes (inline position)	✅	✅	❌	✅
Field codes (`[N]` refs)	✅	Partial	❌	❌
Bibliography (SDT)	✅	❌	❌	❌
Vertical merge (vMerge)	✅	❌	❌	❌
Split table detection	✅	❌	❌	❌
Numbered list distinction	✅	✅	❌	❌
Nested list levels	✅	✅	❌	❌
Image extraction + rename	✅	✅	❌	❌
YAML frontmatter	✅	❌	❌	❌

Quick Start

pip install docx2md-cli
docx2md input.docx
docx2md input.docx -o output.md

Optional frontmatter support:

pip install "docx2md-cli[frontmatter]"

CLI Reference

Basic usage:

docx2md input.docx
docx2md input.docx -o output.md --extract-images images
docx2md input.docx --skip-before-heading --no-frontmatter

All flags:

Flag	Description	Example
`-o`, `--output PATH`	Write Markdown to `PATH`. Use `-` for stdout.	`docx2md input.docx -o output.md`
`--extract-images DIR`	Extract embedded images and link them in Markdown.	`docx2md input.docx --extract-images images`
`--skip-before-heading`	Ignore content before the first real Word heading.	`docx2md input.docx --skip-before-heading`
`--frontmatter FILE`	Prepend custom YAML frontmatter from a file.	`docx2md input.docx --frontmatter meta.yaml`
`--no-frontmatter`	Disable both auto and custom frontmatter.	`docx2md input.docx --no-frontmatter`
`-q`, `--quiet`	Suppress stats output.	`docx2md input.docx -q`
`--json-stats`	Emit machine-readable stats JSON.	`docx2md input.docx --json-stats`
`-v`, `--version`	Print the installed version.	`docx2md --version`

Streaming examples:

cat input.docx | docx2md - -o -
docx2md input.docx --json-stats
docx2md input.docx -o - --no-frontmatter

Python API

from docx2md_cli import convert

result = convert(
    "input.docx",
    output_path="output.md",
    images_dir="images",
    skip_before_heading=False,
    frontmatter_path=None,
    frontmatter_dict=None,
    no_frontmatter=False,
    print_stats=True,
    json_stats=False,
)

Parameters:

Parameter	Type	Description
`input_path`	`str	bytes
`output_path`	`str	None`
`images_dir`	`str	None`
`skip_before_heading`	`bool`	Skip cover pages or prefatory content before `Heading N`.
`frontmatter_path`	`str	None`
`frontmatter_dict`	`dict	None`
`no_frontmatter`	`bool`	Disable frontmatter generation.
`print_stats`	`bool`	Emit conversion stats when writing output.
`json_stats`	`bool`	Emit stats as JSON instead of human-readable text.
`stats_stream`	`TextIO	None`

Return value:

print(result.lines[:3])
print(result.stats["table_rows"])
print(result.as_json())

convert() returns ConvertResult, which is list-like for backward compatibility and also exposes .lines, .stats, and .as_json().

For AI Agents

Use stdout-friendly and machine-readable modes when chaining tools:

docx2md input.docx --json-stats
docx2md input.docx -q -o output.md
cat input.docx | docx2md - -o -

from docx2md_cli import convert

result = convert("input.docx", print_stats=False, no_frontmatter=True)
stats = result.stats
payload = result.as_json()

--quiet avoids human-oriented console output. --json-stats gives structured stats for automation. -o - writes Markdown to stdout. ConvertResult lets agents inspect lines and counters without reparsing terminal output.

Supported Languages

Caption matching currently recognizes:

Spanish: Figura, Tabla
English: Figure, Table
French: Tableau
German: Abbildung, Tabelle
Portuguese: Tabela
Italian: Tabella

Word heading detection intentionally follows the standard Heading N style names.

How It Works

The converter walks the Word document body in order instead of flattening everything to plain text. A field-code state machine preserves citation references, numbering.xml is read directly to distinguish ordered vs unordered lists and nested levels, and the table walker handles vMerge and split-table cases before emitting Markdown. Footnotes are collected from footnotes.xml, bibliography SDTs are extracted, and image filenames can be derived from nearby captions.

Contributing

Issues welcome. PRs welcome. Run pytest before submitting.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
scripts		scripts
skill		skill
src/docx2md_cli		src/docx2md_cli
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docx2md-cli

Why This Exists

Feature Comparison

Quick Start

CLI Reference

Python API

For AI Agents

Supported Languages

How It Works

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

docx2md-cli

Why This Exists

Feature Comparison

Quick Start

CLI Reference

Python API

For AI Agents

Supported Languages

How It Works

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages