Skip to content

gonzalopezgil/docx2md-cli

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

docx2md-cli

Python License PyPI

High-fidelity Word (.docx) to Markdown for documents where citations, tables, footnotes, and structure need to survive conversion.

Why This Exists

Most DOCX-to-Markdown tools do fine on simple prose, then fall over on the details that matter in real reports and papers. docx2md-cli exists to preserve Word-specific structure such as field-code references, bibliography content controls, vertically merged tables, inline footnotes, and list numbering with minimal cleanup after conversion.

Feature Comparison

Feature docx2md-cli Pandoc MarkItDown mammoth
Bold / Italic / Underline
Footnotes (inline position)
Field codes ([N] refs) Partial
Bibliography (SDT)
Vertical merge (vMerge)
Split table detection
Numbered list distinction
Nested list levels
Image extraction + rename
YAML frontmatter

Quick Start

pip install docx2md-cli
docx2md input.docx
docx2md input.docx -o output.md

Optional frontmatter support:

pip install "docx2md-cli[frontmatter]"

CLI Reference

Basic usage:

docx2md input.docx
docx2md input.docx -o output.md --extract-images images
docx2md input.docx --skip-before-heading --no-frontmatter

All flags:

Flag Description Example
-o, --output PATH Write Markdown to PATH. Use - for stdout. docx2md input.docx -o output.md
--extract-images DIR Extract embedded images and link them in Markdown. docx2md input.docx --extract-images images
--skip-before-heading Ignore content before the first real Word heading. docx2md input.docx --skip-before-heading
--frontmatter FILE Prepend custom YAML frontmatter from a file. docx2md input.docx --frontmatter meta.yaml
--no-frontmatter Disable both auto and custom frontmatter. docx2md input.docx --no-frontmatter
-q, --quiet Suppress stats output. docx2md input.docx -q
--json-stats Emit machine-readable stats JSON. docx2md input.docx --json-stats
-v, --version Print the installed version. docx2md --version

Streaming examples:

cat input.docx | docx2md - -o -
docx2md input.docx --json-stats
docx2md input.docx -o - --no-frontmatter

Python API

from docx2md_cli import convert

result = convert(
    "input.docx",
    output_path="output.md",
    images_dir="images",
    skip_before_heading=False,
    frontmatter_path=None,
    frontmatter_dict=None,
    no_frontmatter=False,
    print_stats=True,
    json_stats=False,
)

Parameters:

Parameter Type Description
input_path `str bytes
output_path `str None`
images_dir `str None`
skip_before_heading bool Skip cover pages or prefatory content before Heading N.
frontmatter_path `str None`
frontmatter_dict `dict None`
no_frontmatter bool Disable frontmatter generation.
print_stats bool Emit conversion stats when writing output.
json_stats bool Emit stats as JSON instead of human-readable text.
stats_stream `TextIO None`

Return value:

print(result.lines[:3])
print(result.stats["table_rows"])
print(result.as_json())

convert() returns ConvertResult, which is list-like for backward compatibility and also exposes .lines, .stats, and .as_json().

For AI Agents

Use stdout-friendly and machine-readable modes when chaining tools:

docx2md input.docx --json-stats
docx2md input.docx -q -o output.md
cat input.docx | docx2md - -o -
from docx2md_cli import convert

result = convert("input.docx", print_stats=False, no_frontmatter=True)
stats = result.stats
payload = result.as_json()

--quiet avoids human-oriented console output. --json-stats gives structured stats for automation. -o - writes Markdown to stdout. ConvertResult lets agents inspect lines and counters without reparsing terminal output.

Supported Languages

Caption matching currently recognizes:

  • Spanish: Figura, Tabla
  • English: Figure, Table
  • French: Tableau
  • German: Abbildung, Tabelle
  • Portuguese: Tabela
  • Italian: Tabella

Word heading detection intentionally follows the standard Heading N style names.

How It Works

The converter walks the Word document body in order instead of flattening everything to plain text. A field-code state machine preserves citation references, numbering.xml is read directly to distinguish ordered vs unordered lists and nested levels, and the table walker handles vMerge and split-table cases before emitting Markdown. Footnotes are collected from footnotes.xml, bibliography SDTs are extracted, and image filenames can be derived from nearby captions.

Contributing

Issues welcome. PRs welcome. Run pytest before submitting.

License

MIT

About

High-fidelity Word (.docx) to Markdown converter. Preserves tables (vMerge), footnotes, field codes, bibliography, lists, and images where Pandoc and others fall short.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages