Add --save-images flag to extract document images to disk#1645
Open
JunxuLin wants to merge 1 commit intomicrosoft:mainfrom
Open
Add --save-images flag to extract document images to disk#1645JunxuLin wants to merge 1 commit intomicrosoft:mainfrom
JunxuLin wants to merge 1 commit intomicrosoft:mainfrom
Conversation
Author
|
Author
@microsoft-github-policy-service agree company="Microsoft" |
- Fix image loss in EPUB conversion: resolve relative <img src> paths
inside the ZIP and embed them as base64 or save to files
- Add --save-images [DIR] CLI flag (and save_images kwarg for the API):
- No DIR: auto-creates images_{output_stem}/ next to the output file
- With DIR: saves images to the specified path
- Support image extraction for EPUB, DOCX, PPTX, and PDF converters
- PDF: interleave extracted images at their correct vertical position
in the text rather than appending them at the end; preserve table
structure in cropped page regions when images are present
- Extract shared dir-resolution logic into converter_utils/images.py
- Add debug/ and generated image dirs to .gitignore
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ecee3c1 to
47d58f2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Title:
Add --save-images flag to extract document images to diskDescription:
Summary
MarkItDown already has
--keep-data-uristo retain images as base64 data URIs inline in the markdown. However, this has two limitations:<img src="data:...">tags — there is nothing for--keep-data-uristo act on. Images in PDFs were silently dropped entirely.This PR adds
--save-images [DIR], an opt-in flag that saves images as files on disk and references them with standard markdown image links. It also fixes a root cause bug in the EPUB converter where images were lost regardless of any flag.Changes
<img src>paths inside the ZIP were never resolved before HTML conversion, so images were always silently dropped. They are now resolved from the ZIP and either embedded as base64 (default) or saved to files (--save-images).shape.image.blobto files when the flag is set.page.crop()regions, rather than appending them at the end. Table structure in cropped regions is also preserved.--save-images [DIR]flag:--save-images(no DIR): auto-createsimages_{output_stem}/derived from the-ofilename, falling back to the input filename.--save-images ./my_dir: saves to the specified path.converter_utils/images.py— sharedresolve_images_dir()helper used by all four converters to avoid code duplication.Test
Tested against the existing test files in
tests/test_files/: