Skip to content

fix: handle deeply nested HTML that triggers RecursionError#1644

Open
jigangz wants to merge 1 commit intomicrosoft:mainfrom
jigangz:fix/large-html-silent-failure
Open

fix: handle deeply nested HTML that triggers RecursionError#1644
jigangz wants to merge 1 commit intomicrosoft:mainfrom
jigangz:fix/large-html-silent-failure

Conversation

@jigangz
Copy link
Copy Markdown

@jigangz jigangz commented Mar 28, 2026

Summary

Fix large HTML files (>3MB) with deep DOM nesting silently returning unconverted HTML instead of markdown.

Problem

When converting deeply nested HTML documents (e.g., SEC EDGAR filings like Tesla's DEF 14A proxy statement), markdownify's recursive DOM traversal exceeds Python's default recursion limit (~400 nesting levels). The RecursionError is caught by the top-level _convert() dispatcher's except Exception block, and the request falls through to PlainTextConverter which returns the raw HTML as-is — with no error or warning.

Root cause chain:

  1. HtmlConverter.convert()markdownify.convert_soup() (recursive traversal)
  2. Deep nesting (>~400 levels) → RecursionError
  3. _convert() catches it via except Exception, stores in failed_attempts
  4. PlainTextConverter.accepts() matches text/html via text/ prefix → true
  5. PlainTextConverter.convert() returns raw HTML bytes as text
  6. Caller gets "markdown" that is actually unconverted HTML

Fix

Catch RecursionError in HtmlConverter.convert() and fall back to BeautifulSoup's iterative get_text() method, which handles arbitrary nesting depths. A UserWarning is emitted so callers know the output is plain text rather than full markdown.

Changes

  • packages/markitdown/src/markitdown/converters/_html_converter.py: catch RecursionError, fall back to get_text(), emit warning
  • packages/markitdown/tests/test_module_misc.py: add test_deeply_nested_html_fallback verifying the fallback behavior and warning

Fixes #1636

…t#1636)

Large HTML files with deep DOM nesting (e.g., SEC EDGAR filings) cause
markdownify's recursive DOM traversal to exceed Python's default
recursion limit (1000). Previously this RecursionError was caught by
the top-level _convert() dispatcher, which then fell through to
PlainTextConverter — silently returning the raw HTML as 'markdown'
with no warning.

This fix catches RecursionError in HtmlConverter.convert() and falls
back to BeautifulSoup's iterative get_text() method, which handles
arbitrary nesting depths. A warning is emitted so callers know the
output is plain text rather than full markdown.

Root cause chain:
1. HtmlConverter.convert() calls markdownify.convert_soup() (recursive)
2. Deeply nested HTML (>~400 levels) triggers RecursionError
3. _convert() catches all Exceptions, stores in failed_attempts
4. PlainTextConverter.accepts() matches text/html via 'text/' prefix
5. PlainTextConverter.convert() returns raw HTML bytes as text
6. Caller receives 'markdown' that is actually unconverted HTML
@jigangz jigangz marked this pull request as ready for review March 28, 2026 06:00
@jigangz
Copy link
Copy Markdown
Author

jigangz commented Mar 29, 2026

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Large HTML files (>3MB) silently return unconverted HTML

1 participant