-
Notifications
You must be signed in to change notification settings - Fork 1
address review feedback for docx metadata and pdf rendering #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
address review feedback for docx metadata and pdf rendering #21
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses review feedback to align DOCX page metadata with the MetadataKeys constant for consistency and improves PDF rendering by using a static using alias pattern for PDFtoImage conversion helpers.
Key changes:
- Standardize metadata key usage across document converters
- Adopt established alias pattern for PDF image rendering
Reviewed Changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/MarkItDown/MetadataKeys.cs | Added MetadataKeys.Slide constant for consistent slide metadata |
| src/MarkItDown/Converters/DocxConverter.cs | Changed hardcoded "page" to MetadataKeys.Page for consistency |
| src/MarkItDown/Converters/PptxConverter.cs | Changed hardcoded "slide" to MetadataKeys.Slide for consistency |
| src/MarkItDown/Converters/PdfConverter.cs | Added static using for PDFtoImage.Conversion and updated call to ToImages |
| Multiple test files | Added comprehensive test coverage for conversion pipeline functionality |
| Multiple infrastructure files | Added documentation and build tooling enhancements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting
| private async Task AppendMissingPageSnapshotsAsync( | ||
| IReadOnlyList<int> pageNumbers, | ||
| HashSet<int> pagesWithInlineImages, | ||
| byte[] pdfBytes, | ||
| StreamInfo streamInfo, | ||
| List<DocumentSegment> segments, | ||
| ConversionArtifacts artifacts, | ||
| CancellationToken cancellationToken) | ||
| { | ||
| if (pageNumbers.Count == 0) | ||
| { | ||
| return; | ||
| } | ||
|
|
||
| var missingPages = pageNumbers | ||
| .Where(page => !pagesWithInlineImages.Contains(page)) | ||
| .Distinct() | ||
| .OrderBy(page => page) | ||
| .ToList(); | ||
|
|
||
| if (missingPages.Count == 0) | ||
| { | ||
| return; | ||
| } | ||
|
|
||
| var renderedPages = await imageRenderer.RenderImagesAsync(pdfBytes, cancellationToken).ConfigureAwait(false); | ||
| if (renderedPages.Count == 0) | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rendering snapshots hard‑fails document intelligence conversions
The new AppendMissingPageSnapshotsAsync call unconditionally runs imageRenderer.RenderImagesAsync to create snapshots for pages without inline images. When the PDF‑to‑image renderer is not available or throws (e.g., Ghostscript not installed), the exception bubbles out of BuildExtractionFromDocumentIntelligenceAsync, causing TryBuildExtractionFromDocumentIntelligenceAsync to return null and the conversion to fall back to PdfPig—which then calls the same failing renderer and the whole conversion aborts. Previously the document‑intelligence path never invoked the renderer, so conversions still succeeded without that dependency. This makes existing deployments that rely on document intelligence but do not ship PDF rendering suddenly fail; consider catching and ignoring snapshot rendering failures so that DI results can still be returned.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting
| private async Task AppendMissingPageSnapshotsAsync( | ||
| IReadOnlyList<int> pageNumbers, | ||
| HashSet<int> pagesWithInlineImages, | ||
| byte[] pdfBytes, | ||
| StreamInfo streamInfo, | ||
| List<DocumentSegment> segments, | ||
| ConversionArtifacts artifacts, | ||
| CancellationToken cancellationToken) | ||
| { | ||
| if (pageNumbers.Count == 0) | ||
| { | ||
| return; | ||
| } | ||
|
|
||
| var missingPages = pageNumbers | ||
| .Where(page => !pagesWithInlineImages.Contains(page)) | ||
| .Distinct() | ||
| .OrderBy(page => page) | ||
| .ToList(); | ||
|
|
||
| if (missingPages.Count == 0) | ||
| { | ||
| return; | ||
| } | ||
|
|
||
| var renderedPages = await imageRenderer.RenderImagesAsync(pdfBytes, cancellationToken).ConfigureAwait(false); | ||
| if (renderedPages.Count == 0) | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rendering snapshots hard‑fails document intelligence conversions
The new AppendMissingPageSnapshotsAsync call unconditionally runs imageRenderer.RenderImagesAsync to create snapshots for pages without inline images. When the PDF‑to‑image renderer is not available or throws (e.g., Ghostscript not installed), the exception bubbles out of BuildExtractionFromDocumentIntelligenceAsync, causing TryBuildExtractionFromDocumentIntelligenceAsync to return null and the conversion to fall back to PdfPig—which then calls the same failing renderer and the whole conversion aborts. Previously the document‑intelligence path never invoked the renderer, so conversions still succeeded without that dependency. This makes existing deployments that rely on document intelligence but do not ship PDF rendering suddenly fail; consider catching and ignoring snapshot rendering failures so that DI results can still be returned.
Useful? React with 👍 / 👎.
|
@codex fix all comments |
|
Summary
Testing
|
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…z' into codex/fix-all-comments
pdf: tolerate missing renderer when using pdfpig
Summary
Testing
dotnet test MarkItDown.slnxhttps://chatgpt.com/codex/tasks/task_e_68e9f8ffeda0832686f0fff19e585de0