Skip to content

Conversation

@KSemenenko
Copy link
Member

Summary

  • align DOCX page metadata with MetadataKeys constant for consistency
  • call PDFtoImage conversion helpers via a static using to match the established alias pattern

Testing

  • dotnet test MarkItDown.slnx

https://chatgpt.com/codex/tasks/task_e_68e9f8ffeda0832686f0fff19e585de0

Copilot AI review requested due to automatic review settings October 11, 2025 15:50
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses review feedback to align DOCX page metadata with the MetadataKeys constant for consistency and improves PDF rendering by using a static using alias pattern for PDFtoImage conversion helpers.

Key changes:

  • Standardize metadata key usage across document converters
  • Adopt established alias pattern for PDF image rendering

Reviewed Changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/MarkItDown/MetadataKeys.cs Added MetadataKeys.Slide constant for consistent slide metadata
src/MarkItDown/Converters/DocxConverter.cs Changed hardcoded "page" to MetadataKeys.Page for consistency
src/MarkItDown/Converters/PptxConverter.cs Changed hardcoded "slide" to MetadataKeys.Slide for consistency
src/MarkItDown/Converters/PdfConverter.cs Added static using for PDFtoImage.Conversion and updated call to ToImages
Multiple test files Added comprehensive test coverage for conversion pipeline functionality
Multiple infrastructure files Added documentation and build tooling enhancements

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

Comment on lines 412 to 439
private async Task AppendMissingPageSnapshotsAsync(
IReadOnlyList<int> pageNumbers,
HashSet<int> pagesWithInlineImages,
byte[] pdfBytes,
StreamInfo streamInfo,
List<DocumentSegment> segments,
ConversionArtifacts artifacts,
CancellationToken cancellationToken)
{
if (pageNumbers.Count == 0)
{
return;
}

var missingPages = pageNumbers
.Where(page => !pagesWithInlineImages.Contains(page))
.Distinct()
.OrderBy(page => page)
.ToList();

if (missingPages.Count == 0)
{
return;
}

var renderedPages = await imageRenderer.RenderImagesAsync(pdfBytes, cancellationToken).ConfigureAwait(false);
if (renderedPages.Count == 0)
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Rendering snapshots hard‑fails document intelligence conversions

The new AppendMissingPageSnapshotsAsync call unconditionally runs imageRenderer.RenderImagesAsync to create snapshots for pages without inline images. When the PDF‑to‑image renderer is not available or throws (e.g., Ghostscript not installed), the exception bubbles out of BuildExtractionFromDocumentIntelligenceAsync, causing TryBuildExtractionFromDocumentIntelligenceAsync to return null and the conversion to fall back to PdfPig—which then calls the same failing renderer and the whole conversion aborts. Previously the document‑intelligence path never invoked the renderer, so conversions still succeeded without that dependency. This makes existing deployments that rely on document intelligence but do not ship PDF rendering suddenly fail; consider catching and ignoring snapshot rendering failures so that DI results can still be returned.

Useful? React with 👍 / 👎.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

Comment on lines 412 to 439
private async Task AppendMissingPageSnapshotsAsync(
IReadOnlyList<int> pageNumbers,
HashSet<int> pagesWithInlineImages,
byte[] pdfBytes,
StreamInfo streamInfo,
List<DocumentSegment> segments,
ConversionArtifacts artifacts,
CancellationToken cancellationToken)
{
if (pageNumbers.Count == 0)
{
return;
}

var missingPages = pageNumbers
.Where(page => !pagesWithInlineImages.Contains(page))
.Distinct()
.OrderBy(page => page)
.ToList();

if (missingPages.Count == 0)
{
return;
}

var renderedPages = await imageRenderer.RenderImagesAsync(pdfBytes, cancellationToken).ConfigureAwait(false);
if (renderedPages.Count == 0)
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Rendering snapshots hard‑fails document intelligence conversions

The new AppendMissingPageSnapshotsAsync call unconditionally runs imageRenderer.RenderImagesAsync to create snapshots for pages without inline images. When the PDF‑to‑image renderer is not available or throws (e.g., Ghostscript not installed), the exception bubbles out of BuildExtractionFromDocumentIntelligenceAsync, causing TryBuildExtractionFromDocumentIntelligenceAsync to return null and the conversion to fall back to PdfPig—which then calls the same failing renderer and the whole conversion aborts. Previously the document‑intelligence path never invoked the renderer, so conversions still succeeded without that dependency. This makes existing deployments that rely on document intelligence but do not ship PDF rendering suddenly fail; consider catching and ignoring snapshot rendering failures so that DI results can still be returned.

Useful? React with 👍 / 👎.

@KSemenenko
Copy link
Member Author

@codex fix all comments

@chatgpt-codex-connector
Copy link

Summary

Testing

  • ⚠️ dotnet test MarkItDown.slnx (fails: command not found: dotnet)

View task →

@KSemenenko KSemenenko merged commit 8ec74c2 into main Oct 11, 2025
1 check passed
@KSemenenko KSemenenko deleted the codex/add-ichatclient-support-to-parsing-pipeline-3libiz branch October 11, 2025 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants