address review feedback for docx metadata and pdf rendering #21

KSemenenko · 2025-10-11T15:50:15Z

Summary

align DOCX page metadata with MetadataKeys constant for consistency
call PDFtoImage conversion helpers via a static using to match the established alias pattern

Testing

dotnet test MarkItDown.slnx

https://chatgpt.com/codex/tasks/task_e_68e9f8ffeda0832686f0fff19e585de0

Copilot

Pull Request Overview

This PR addresses review feedback to align DOCX page metadata with the MetadataKeys constant for consistency and improves PDF rendering by using a static using alias pattern for PDFtoImage conversion helpers.

Key changes:

Standardize metadata key usage across document converters
Adopt established alias pattern for PDF image rendering

Reviewed Changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/MarkItDown/MetadataKeys.cs	Added MetadataKeys.Slide constant for consistent slide metadata
src/MarkItDown/Converters/DocxConverter.cs	Changed hardcoded "page" to MetadataKeys.Page for consistency
src/MarkItDown/Converters/PptxConverter.cs	Changed hardcoded "slide" to MetadataKeys.Slide for consistency
src/MarkItDown/Converters/PdfConverter.cs	Added static using for PDFtoImage.Conversion and updated call to ToImages
Multiple test files	Added comprehensive test coverage for conversion pipeline functionality
Multiple infrastructure files	Added documentation and build tooling enhancements

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@codex fix this CI failure
@codex address that feedback

chatgpt-codex-connector · 2025-10-11T15:55:16Z

src/MarkItDown/Converters/PdfConverter.cs

+    private async Task AppendMissingPageSnapshotsAsync(
+        IReadOnlyList<int> pageNumbers,
+        HashSet<int> pagesWithInlineImages,
+        byte[] pdfBytes,
+        StreamInfo streamInfo,
+        List<DocumentSegment> segments,
+        ConversionArtifacts artifacts,
+        CancellationToken cancellationToken)
+    {
+        if (pageNumbers.Count == 0)
+        {
+            return;
+        }
+
+        var missingPages = pageNumbers
+            .Where(page => !pagesWithInlineImages.Contains(page))
+            .Distinct()
+            .OrderBy(page => page)
+            .ToList();
+
+        if (missingPages.Count == 0)
+        {
+            return;
+        }
+
+        var renderedPages = await imageRenderer.RenderImagesAsync(pdfBytes, cancellationToken).ConfigureAwait(false);
+        if (renderedPages.Count == 0)
+        {


Rendering snapshots hard‑fails document intelligence conversions

The new AppendMissingPageSnapshotsAsync call unconditionally runs imageRenderer.RenderImagesAsync to create snapshots for pages without inline images. When the PDF‑to‑image renderer is not available or throws (e.g., Ghostscript not installed), the exception bubbles out of BuildExtractionFromDocumentIntelligenceAsync, causing TryBuildExtractionFromDocumentIntelligenceAsync to return null and the conversion to fall back to PdfPig—which then calls the same failing renderer and the whole conversion aborts. Previously the document‑intelligence path never invoked the renderer, so conversions still succeeded without that dependency. This makes existing deployments that rely on document intelligence but do not ship PDF rendering suddenly fail; consider catching and ignoring snapshot rendering failures so that DI results can still be returned.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@codex fix this CI failure
@codex address that feedback

chatgpt-codex-connector · 2025-10-11T15:55:19Z

src/MarkItDown/Converters/PdfConverter.cs

+    private async Task AppendMissingPageSnapshotsAsync(
+        IReadOnlyList<int> pageNumbers,
+        HashSet<int> pagesWithInlineImages,
+        byte[] pdfBytes,
+        StreamInfo streamInfo,
+        List<DocumentSegment> segments,
+        ConversionArtifacts artifacts,
+        CancellationToken cancellationToken)
+    {
+        if (pageNumbers.Count == 0)
+        {
+            return;
+        }
+
+        var missingPages = pageNumbers
+            .Where(page => !pagesWithInlineImages.Contains(page))
+            .Distinct()
+            .OrderBy(page => page)
+            .ToList();
+
+        if (missingPages.Count == 0)
+        {
+            return;
+        }
+
+        var renderedPages = await imageRenderer.RenderImagesAsync(pdfBytes, cancellationToken).ConfigureAwait(false);
+        if (renderedPages.Count == 0)
+        {


Rendering snapshots hard‑fails document intelligence conversions

The new AppendMissingPageSnapshotsAsync call unconditionally runs imageRenderer.RenderImagesAsync to create snapshots for pages without inline images. When the PDF‑to‑image renderer is not available or throws (e.g., Ghostscript not installed), the exception bubbles out of BuildExtractionFromDocumentIntelligenceAsync, causing TryBuildExtractionFromDocumentIntelligenceAsync to return null and the conversion to fall back to PdfPig—which then calls the same failing renderer and the whole conversion aborts. Previously the document‑intelligence path never invoked the renderer, so conversions still succeeded without that dependency. This makes existing deployments that rely on document intelligence but do not ship PDF rendering suddenly fail; consider catching and ignoring snapshot rendering failures so that DI results can still be returned.

Useful? React with 👍 / 👎.

KSemenenko · 2025-10-11T16:09:28Z

@codex fix all comments

chatgpt-codex-connector · 2025-10-11T16:10:41Z

Summary

Guarded document intelligence snapshot rendering so non-cancellation failures from the optional PDF renderer are ignored, preventing unnecessary conversion aborts. src/MarkItDown/Converters/PdfConverter.csL437-L452

Testing

⚠️ dotnet test MarkItDown.slnx (fails: command not found: dotnet)

View task →

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…z' into codex/fix-all-comments

pdf: tolerate missing renderer when using pdfpig

fix review feedback

bf2c540

Copilot AI review requested due to automatic review settings October 11, 2025 15:50

KSemenenko added the codex label Oct 11, 2025 — with ChatGPT Codex Connector

Copilot AI reviewed Oct 11, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Oct 11, 2025

View reviewed changes

middleware: tighten image enrichment loops

edfd1cb

KSemenenko and others added 6 commits October 11, 2025 18:15

pdf: guard snapshot rendering failures

9fcba23

pdf: tolerate missing renderer when using pdfpig

f373de0

pdf: tolerate missing renderer when using pdfpig

4b3d09c

Update src/MarkItDown/Converters/PdfConverter.cs

c8ad6d1

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'codex/add-ichatclient-support-to-parsing-pipeline-3libi…

50c1385

…z' into codex/fix-all-comments

Merge pull request #22 from managedcode/codex/fix-all-comments

04aabb7

pdf: tolerate missing renderer when using pdfpig

KSemenenko merged commit 8ec74c2 into main Oct 11, 2025
1 check passed

KSemenenko deleted the codex/add-ichatclient-support-to-parsing-pipeline-3libiz branch October 11, 2025 16:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

address review feedback for docx metadata and pdf rendering #21

address review feedback for docx metadata and pdf rendering #21

Uh oh!

KSemenenko commented Oct 11, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Oct 11, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Oct 11, 2025

Uh oh!

KSemenenko commented Oct 11, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

address review feedback for docx metadata and pdf rendering #21

address review feedback for docx metadata and pdf rendering #21

Uh oh!

Conversation

KSemenenko commented Oct 11, 2025

Summary

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

KSemenenko commented Oct 11, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants