Add --save-images flag to extract document images to disk by JunxuLin · Pull Request #1645 · microsoft/markitdown

JunxuLin · 2026-03-28T07:19:42Z

Title: Add --save-images flag to extract document images to disk

Description:

Summary

MarkItDown already has --keep-data-uris to retain images as base64 data URIs inline in the markdown. However, this has two limitations:

It doesn't work for PDFs. The PDF converter uses pdfminer/pdfplumber to extract text directly and never produces HTML with <img src="data:..."> tags — there is nothing for --keep-data-uris to act on. Images in PDFs were silently dropped entirely.
Base64 inline is unreadable for humans. A single image inflates the markdown file by hundreds of kilobytes of encoded text, making it impractical to read or version-control.

This PR adds --save-images [DIR], an opt-in flag that saves images as files on disk and references them with standard markdown image links. It also fixes a root cause bug in the EPUB converter where images were lost regardless of any flag.

Changes

EPUB — fix image loss: relative <img src> paths inside the ZIP were never resolved before HTML conversion, so images were always silently dropped. They are now resolved from the ZIP and either embedded as base64 (default) or saved to files (--save-images).
DOCX — extract base64-encoded images from mammoth's HTML output and save them as files when the flag is set.
PPTX — save shape.image.blob to files when the flag is set.
PDF — extract images via pdfminer and interleave them at their correct vertical position in the text using page.crop() regions, rather than appending them at the end. Table structure in cropped regions is also preserved.
CLI — add --save-images [DIR] flag:
- --save-images (no DIR): auto-creates images_{output_stem}/ derived from the -o filename, falling back to the input filename.
- --save-images ./my_dir: saves to the specified path.
- Default (no flag): unchanged — images are omitted from output.
converter_utils/images.py — shared resolve_images_dir() helper used by all four converters to avoid code duplication.

Test

Tested against the existing test files in tests/test_files/:

# EPUB
markitdown tests/test_files/test.epub --save-images > out.md
# → images_test/ created, markdown contains ![...](images_test/f0001-01.jpg) etc.

# DOCX
markitdown tests/test_files/test.docx --save-images > out.md
# → images_test/ created, image files saved and referenced in markdown

# PPTX
markitdown tests/test_files/test.pptx --save-images > out.md
# → images_test/ created, slide images saved and referenced in markdown

# PDF — images interleaved at correct position, tables preserved
markitdown tests/test_files/SPARSE-2024-INV-1234_borderless_table.pdf --save-images > out.md
# → images_SPARSE-2024-INV-1234_borderless_table/ created with image_1.png, image_2.png
# → images appear after "Variance Analysis:" in the output, not at the end

# Custom dir
markitdown tests/test_files/test.epub --save-images ./assets > out.md
# → images saved to ./assets/

# Auto-naming from -o flag
markitdown tests/test_files/test.pdf --save-images -o result.md
# → images saved to images_result/

JunxuLin · 2026-03-28T07:23:17Z

@JunxuLin please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

JunxuLin · 2026-03-28T07:23:43Z

@JunxuLin please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree company="Microsoft"

- Fix image loss in EPUB conversion: resolve relative <img src> paths inside the ZIP and embed them as base64 or save to files - Add --save-images [DIR] CLI flag (and save_images kwarg for the API): - No DIR: auto-creates images_{output_stem}/ next to the output file - With DIR: saves images to the specified path - Support image extraction for EPUB, DOCX, PPTX, and PDF converters - PDF: interleave extracted images at their correct vertical position in the text rather than appending them at the end; preserve table structure in cropped page regions when images are present - Extract shared dir-resolution logic into converter_utils/images.py - Add debug/ and generated image dirs to .gitignore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

JunxuLin closed this Mar 28, 2026

JunxuLin reopened this Mar 28, 2026

JunxuLin force-pushed the fix/epub-image-loss branch from ecee3c1 to 47d58f2 Compare March 29, 2026 11:24

JunxuLin changed the title ~~Fix image loss in EPUB to Markdown conversion~~ Add --save-images flag to extract document images to disk Mar 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --save-images flag to extract document images to disk#1645

Add --save-images flag to extract document images to disk#1645
JunxuLin wants to merge 1 commit intomicrosoft:mainfrom
JunxuLin:fix/epub-image-loss

JunxuLin commented Mar 28, 2026 •

edited

Loading

Uh oh!

JunxuLin commented Mar 28, 2026

Uh oh!

JunxuLin commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JunxuLin commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test

Uh oh!

JunxuLin commented Mar 28, 2026

Uh oh!

JunxuLin commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JunxuLin commented Mar 28, 2026 •

edited

Loading