Skip to content

Add --save-images flag to extract document images to disk#1645

Open
JunxuLin wants to merge 1 commit intomicrosoft:mainfrom
JunxuLin:fix/epub-image-loss
Open

Add --save-images flag to extract document images to disk#1645
JunxuLin wants to merge 1 commit intomicrosoft:mainfrom
JunxuLin:fix/epub-image-loss

Conversation

@JunxuLin
Copy link
Copy Markdown

@JunxuLin JunxuLin commented Mar 28, 2026


Title: Add --save-images flag to extract document images to disk


Description:

Summary

MarkItDown already has --keep-data-uris to retain images as base64 data URIs inline in the markdown. However, this has two limitations:

  1. It doesn't work for PDFs. The PDF converter uses pdfminer/pdfplumber to extract text directly and never produces HTML with <img src="data:..."> tags — there is nothing for --keep-data-uris to act on. Images in PDFs were silently dropped entirely.
  2. Base64 inline is unreadable for humans. A single image inflates the markdown file by hundreds of kilobytes of encoded text, making it impractical to read or version-control.

This PR adds --save-images [DIR], an opt-in flag that saves images as files on disk and references them with standard markdown image links. It also fixes a root cause bug in the EPUB converter where images were lost regardless of any flag.

Changes

  • EPUB — fix image loss: relative <img src> paths inside the ZIP were never resolved before HTML conversion, so images were always silently dropped. They are now resolved from the ZIP and either embedded as base64 (default) or saved to files (--save-images).
  • DOCX — extract base64-encoded images from mammoth's HTML output and save them as files when the flag is set.
  • PPTX — save shape.image.blob to files when the flag is set.
  • PDF — extract images via pdfminer and interleave them at their correct vertical position in the text using page.crop() regions, rather than appending them at the end. Table structure in cropped regions is also preserved.
  • CLI — add --save-images [DIR] flag:
    • --save-images (no DIR): auto-creates images_{output_stem}/ derived from the -o filename, falling back to the input filename.
    • --save-images ./my_dir: saves to the specified path.
    • Default (no flag): unchanged — images are omitted from output.
  • converter_utils/images.py — shared resolve_images_dir() helper used by all four converters to avoid code duplication.

Test

Tested against the existing test files in tests/test_files/:

# EPUB
markitdown tests/test_files/test.epub --save-images > out.md
# → images_test/ created, markdown contains ![...](images_test/f0001-01.jpg) etc.

# DOCX
markitdown tests/test_files/test.docx --save-images > out.md
# → images_test/ created, image files saved and referenced in markdown

# PPTX
markitdown tests/test_files/test.pptx --save-images > out.md
# → images_test/ created, slide images saved and referenced in markdown

# PDF — images interleaved at correct position, tables preserved
markitdown tests/test_files/SPARSE-2024-INV-1234_borderless_table.pdf --save-images > out.md
# → images_SPARSE-2024-INV-1234_borderless_table/ created with image_1.png, image_2.png
# → images appear after "Variance Analysis:" in the output, not at the end

# Custom dir
markitdown tests/test_files/test.epub --save-images ./assets > out.md
# → images saved to ./assets/

# Auto-naming from -o flag
markitdown tests/test_files/test.pdf --save-images -o result.md
# → images saved to images_result/

@JunxuLin
Copy link
Copy Markdown
Author

@JunxuLin please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@JunxuLin JunxuLin closed this Mar 28, 2026
@JunxuLin
Copy link
Copy Markdown
Author

@JunxuLin please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree company="Microsoft"

@JunxuLin JunxuLin reopened this Mar 28, 2026
- Fix image loss in EPUB conversion: resolve relative <img src> paths
  inside the ZIP and embed them as base64 or save to files
- Add --save-images [DIR] CLI flag (and save_images kwarg for the API):
  - No DIR: auto-creates images_{output_stem}/ next to the output file
  - With DIR: saves images to the specified path
- Support image extraction for EPUB, DOCX, PPTX, and PDF converters
- PDF: interleave extracted images at their correct vertical position
  in the text rather than appending them at the end; preserve table
  structure in cropped page regions when images are present
- Extract shared dir-resolution logic into converter_utils/images.py
- Add debug/ and generated image dirs to .gitignore

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@JunxuLin JunxuLin force-pushed the fix/epub-image-loss branch from ecee3c1 to 47d58f2 Compare March 29, 2026 11:24
@JunxuLin JunxuLin changed the title Fix image loss in EPUB to Markdown conversion Add --save-images flag to extract document images to disk Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant