Skip to content

Conversation

@KSemenenko
Copy link
Member

Summary

  • embed the ODT regression sample as an inline base64 constant and materialize it into a temporary .odt during the test
  • remove the binary sample.odt fixture so the repository no longer checks in that asset

Testing

  • dotnet test MarkItDown.slnx

https://chatgpt.com/codex/tasks/task_e_68eb63577ed48326bbde32ca81781d06

Copilot AI review requested due to automatic review settings October 12, 2025 12:06
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR embeds the ODT regression sample as an inline base64 constant and replaces the binary fixture with a temporary file materialization during tests. It also includes significant updates to add support for many new document formats and improves the codebase architecture.

Key Changes

  • Removed binary .odt fixture file and replaced with inline base64 encoding in test
  • Added comprehensive support for 20+ new document formats including DocBook, JATS, OPML, FB2, ODT, citation formats, plain-text markups, and diagram syntaxes
  • Renamed main class from MarkItDown to MarkItDownClient throughout codebase for consistency

Reviewed Changes

Copilot reviewed 65 out of 65 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/MarkItDown.Tests/NewFormatsConverterTests.cs Added comprehensive test suite for new format converters with inline ODT base64 constant
src/MarkItDown/MarkItDownClient.cs Renamed from MarkItDown, added telemetry support and registered 23 new format converters
src/MarkItDown/Converters/*.cs Added 23 new converter classes supporting formats like DocBook, JATS, OPML, FB2, citation formats, markup languages, and diagram types
src/MarkItDown/MimeMapping.cs Extended MIME type mappings to support all new formats with fallback logic
tests/MarkItDown.Tests/TestFiles/ Added 22 new test fixture files for regression testing of new formats
Multiple test files Updated class name references from MarkItDown to MarkItDownClient

var hex = rtf.Substring(i, 2);
if (byte.TryParse(hex, System.Globalization.NumberStyles.HexNumber, null, out var value))
{
builder.Append(Encoding.Default.GetString(new[] { value }));
Copy link

Copilot AI Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Encoding.Default can lead to platform-dependent behavior. Consider using a specific encoding like Encoding.UTF8 or Encoding.GetEncoding(1252) for RTF content.

Suggested change
builder.Append(Encoding.Default.GetString(new[] { value }));
builder.Append(Encoding.GetEncoding(1252).GetString(new[] { value }));

Copilot uses AI. Check for mistakes.
Comment on lines +60 to +63
if (File.Exists(tempPath))
{
File.Delete(tempPath);
}
Copy link

Copilot AI Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The File.Exists check is unnecessary before File.Delete as File.Delete does not throw an exception if the file doesn't exist.

Suggested change
if (File.Exists(tempPath))
{
File.Delete(tempPath);
}
File.Delete(tempPath);

Copilot uses AI. Check for mistakes.
{
if (!stream.CanSeek)
{
throw new FileConversionException("ODT conversion requires a seekable stream.");
Copy link

Copilot AI Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using a more specific exception message that explains why seekable streams are required (e.g., 'ODT files are ZIP archives that require seekable streams for random access').

Suggested change
throw new FileConversionException("ODT conversion requires a seekable stream.");
throw new FileConversionException("ODT files are ZIP archives that require seekable streams for random access.");

Copilot uses AI. Check for mistakes.
KSemenenko and others added 2 commits October 12, 2025 14:09
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@KSemenenko KSemenenko merged commit 7ec5cf6 into main Oct 12, 2025
1 check passed
@KSemenenko KSemenenko deleted the codex/add-support-for-missing-pandoc-formats branch October 12, 2025 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants