fix: handle deeply nested HTML that triggers RecursionError#1644
Open
jigangz wants to merge 1 commit intomicrosoft:mainfrom
Open
fix: handle deeply nested HTML that triggers RecursionError#1644jigangz wants to merge 1 commit intomicrosoft:mainfrom
jigangz wants to merge 1 commit intomicrosoft:mainfrom
Conversation
…t#1636) Large HTML files with deep DOM nesting (e.g., SEC EDGAR filings) cause markdownify's recursive DOM traversal to exceed Python's default recursion limit (1000). Previously this RecursionError was caught by the top-level _convert() dispatcher, which then fell through to PlainTextConverter — silently returning the raw HTML as 'markdown' with no warning. This fix catches RecursionError in HtmlConverter.convert() and falls back to BeautifulSoup's iterative get_text() method, which handles arbitrary nesting depths. A warning is emitted so callers know the output is plain text rather than full markdown. Root cause chain: 1. HtmlConverter.convert() calls markdownify.convert_soup() (recursive) 2. Deeply nested HTML (>~400 levels) triggers RecursionError 3. _convert() catches all Exceptions, stores in failed_attempts 4. PlainTextConverter.accepts() matches text/html via 'text/' prefix 5. PlainTextConverter.convert() returns raw HTML bytes as text 6. Caller receives 'markdown' that is actually unconverted HTML
Author
|
@microsoft-github-policy-service agree |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix large HTML files (>3MB) with deep DOM nesting silently returning unconverted HTML instead of markdown.
Problem
When converting deeply nested HTML documents (e.g., SEC EDGAR filings like Tesla's DEF 14A proxy statement),
markdownify's recursive DOM traversal exceeds Python's default recursion limit (~400 nesting levels). TheRecursionErroris caught by the top-level_convert()dispatcher'sexcept Exceptionblock, and the request falls through toPlainTextConverterwhich returns the raw HTML as-is — with no error or warning.Root cause chain:
HtmlConverter.convert()→markdownify.convert_soup()(recursive traversal)RecursionError_convert()catches it viaexcept Exception, stores infailed_attemptsPlainTextConverter.accepts()matchestext/htmlviatext/prefix → truePlainTextConverter.convert()returns raw HTML bytes as textFix
Catch
RecursionErrorinHtmlConverter.convert()and fall back to BeautifulSoup's iterativeget_text()method, which handles arbitrary nesting depths. AUserWarningis emitted so callers know the output is plain text rather than full markdown.Changes
packages/markitdown/src/markitdown/converters/_html_converter.py: catchRecursionError, fall back toget_text(), emit warningpackages/markitdown/tests/test_module_misc.py: addtest_deeply_nested_html_fallbackverifying the fallback behavior and warningFixes #1636