replace docx binary fixture with generated stream #20

KSemenenko · 2025-10-11T11:54:18Z

Summary

replace the doc-with-image.docx regression asset with a programmatic DOCX generator that embeds a PNG inline image
update the DOCX converter regression tests to use the generated stream so image artifacts remain validated without committing binaries

Testing

dotnet test MarkItDown.slnx

https://chatgpt.com/codex/tasks/task_e_68e9f8ffeda0832686f0fff19e585de0

Copilot

Pull Request Overview

This PR replaces a binary DOCX regression test asset with a programmatic document generator and implements comprehensive conversion middleware infrastructure for AI-powered image enrichment. The changes enable document converters to capture raw extraction artifacts and pass them through a configurable middleware pipeline before Markdown composition.

Key changes:

Introduced conversion middleware architecture with pipeline execution for document post-processing
Added AI image enrichment middleware that generates detailed descriptions using chat clients
Replaced binary test fixture with generated DOCX containing inline PNG images
Enhanced PDF, DOCX, and PPTX converters to capture image artifacts and execute middleware pipelines

Reviewed Changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`tests/MarkItDown.Tests/Fixtures/DocxInlineImageFactory.cs`	Programmatic DOCX generator that creates documents with embedded PNG images
`tests/MarkItDown.Tests/RecordingPipeline.cs`	Test harness that records pipeline execution and injects test content after image placeholders
`src/MarkItDown/Conversion/`	New middleware infrastructure including pipeline execution, context objects, and AI enrichment middleware
`src/MarkItDown/Converters/DocxConverter.cs`	Enhanced to extract image artifacts and execute conversion pipeline
`src/MarkItDown/Converters/PptxConverter.cs`	Enhanced to extract slide images and execute conversion pipeline
`src/MarkItDown/Converters/PdfConverter.cs`	Enhanced to capture page snapshots and execute conversion pipeline
`src/MarkItDown/MarkItDown.cs`	Updated to build conversion pipeline and pass to converters

Copilot · 2025-10-11T11:54:42Z

src/MarkItDown/Converters/PdfConverter.cs


 #pragma warning disable CA1416
-            foreach (var bitmap in Conversion.ToImages(pdfBytes, password: null, options))
+            foreach (var bitmap in PDFtoImage.Conversion.ToImages(pdfBytes, password: null, options))


Namespace alias inconsistency. The using statement imports PDFtoImage directly as Conversion (line not shown), but this code references PDFtoImage.Conversion. Remove the PDFtoImage. prefix to match the existing alias pattern.

Copilot · 2025-10-11T11:54:42Z

src/MarkItDown/Converters/DocxConverter.cs

+    {
+        var metadata = new Dictionary<string, string>
+        {
+            ["page"] = pageNumber.ToString(CultureInfo.InvariantCulture)


Metadata key inconsistency. Use the constant MetadataKeys.Page instead of the hardcoded string "page" to maintain consistency with the established pattern used elsewhere in the codebase.

tests: synthesize docx image fixture

49a7d13

Copilot AI review requested due to automatic review settings October 11, 2025 11:54

KSemenenko added the codex label Oct 11, 2025 — with ChatGPT Codex Connector

Copilot AI reviewed Oct 11, 2025

View reviewed changes

KSemenenko closed this Oct 11, 2025

KSemenenko deleted the codex/add-ichatclient-support-to-parsing-pipeline branch October 26, 2025 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

replace docx binary fixture with generated stream #20

replace docx binary fixture with generated stream #20

Uh oh!

KSemenenko commented Oct 11, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 11, 2025

Uh oh!

Copilot AI Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

replace docx binary fixture with generated stream #20

replace docx binary fixture with generated stream #20

Uh oh!

Conversation

KSemenenko commented Oct 11, 2025

Summary

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants