Skip to content

Conversation

@lesyk
Copy link
Contributor

@lesyk lesyk commented Feb 10, 2026

This pull request enhances the handling and extraction of complex tables from PDF files in the markitdown package. It increases the flexibility of the PDF table extraction logic to support documents with a larger number of columns, updates the package version, and adds comprehensive tests for new PDF scenarios. Additionally, it improves repository configuration for handling binary files.

@lesyk lesyk changed the title Extend table support for wide tables [MS] Extend table support for wide tables Feb 10, 2026
@lesyk lesyk marked this pull request as ready for review February 10, 2026 11:56
Copy link
Contributor

@gagb gagb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @lesyk could you please clarify:

  • How were the adaptive constants (0.70 percentile, [25,50] clamp, 10 cols/inch threshold) chosen? Were other values tested?
  • Were the existing PDF tests run before and after this change to confirm no regressions?
  • Why was the version number bumped?

Can you please also update your description to include commands to run to test your changes and also indicate that you have manually verified all changes, especially if any AI was used to write the code.

@lesyk
Copy link
Contributor Author

lesyk commented Feb 12, 2026

  • How were the adaptive constants (0.70 percentile, [25,50] clamp, 10 cols/inch threshold) chosen? Were other values tested?

We have internal testing datasets which has variety of different files After new dataset was added we found that old process of parsing did not work out, thus, making these changes.
As for values, these seem to be more stable from my testing using our datasets.
In previous PRs I have added same synthetic samples, and for each PR add more of them.

  • Were the existing PDF tests run before and after this change to confirm no regressions?

I see no regressions on our internal datasets, nor tests I have added previously.

  • Why was the version number bumped?

I think I misunderstood versioning for beta channels. I will change to 0.1.5b2. My mistake.

Can you please also update your description to include commands to run to test your changes and also indicate that you have manually verified all changes, especially if any AI was used to write the code.

I am following repos setup: pytest or hatch from root.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants