fix: handle deeply nested HTML that triggers RecursionError by jigangz · Pull Request #1644 · microsoft/markitdown

jigangz · 2026-03-28T05:57:00Z

Summary

Fix large HTML files (>3MB) with deep DOM nesting silently returning unconverted HTML instead of markdown.

Problem

When converting deeply nested HTML documents (e.g., SEC EDGAR filings like Tesla's DEF 14A proxy statement), markdownify's recursive DOM traversal exceeds Python's default recursion limit (~400 nesting levels). The RecursionError is caught by the top-level _convert() dispatcher's except Exception block, and the request falls through to PlainTextConverter which returns the raw HTML as-is — with no error or warning.

Root cause chain:

HtmlConverter.convert() → markdownify.convert_soup() (recursive traversal)
Deep nesting (>~400 levels) → RecursionError
_convert() catches it via except Exception, stores in failed_attempts
PlainTextConverter.accepts() matches text/html via text/ prefix → true
PlainTextConverter.convert() returns raw HTML bytes as text
Caller gets "markdown" that is actually unconverted HTML

Fix

Catch RecursionError in HtmlConverter.convert() and fall back to BeautifulSoup's iterative get_text() method, which handles arbitrary nesting depths. A UserWarning is emitted so callers know the output is plain text rather than full markdown.

Changes

packages/markitdown/src/markitdown/converters/_html_converter.py: catch RecursionError, fall back to get_text(), emit warning
packages/markitdown/tests/test_module_misc.py: add test_deeply_nested_html_fallback verifying the fallback behavior and warning

Fixes #1636

…t#1636) Large HTML files with deep DOM nesting (e.g., SEC EDGAR filings) cause markdownify's recursive DOM traversal to exceed Python's default recursion limit (1000). Previously this RecursionError was caught by the top-level _convert() dispatcher, which then fell through to PlainTextConverter — silently returning the raw HTML as 'markdown' with no warning. This fix catches RecursionError in HtmlConverter.convert() and falls back to BeautifulSoup's iterative get_text() method, which handles arbitrary nesting depths. A warning is emitted so callers know the output is plain text rather than full markdown. Root cause chain: 1. HtmlConverter.convert() calls markdownify.convert_soup() (recursive) 2. Deeply nested HTML (>~400 levels) triggers RecursionError 3. _convert() catches all Exceptions, stores in failed_attempts 4. PlainTextConverter.accepts() matches text/html via 'text/' prefix 5. PlainTextConverter.convert() returns raw HTML bytes as text 6. Caller receives 'markdown' that is actually unconverted HTML

jigangz · 2026-03-29T15:23:30Z

@microsoft-github-policy-service agree

jigangz marked this pull request as ready for review March 28, 2026 06:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle deeply nested HTML that triggers RecursionError#1644

fix: handle deeply nested HTML that triggers RecursionError#1644
jigangz wants to merge 1 commit intomicrosoft:mainfrom
jigangz:fix/large-html-silent-failure

jigangz commented Mar 28, 2026

Uh oh!

jigangz commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jigangz commented Mar 28, 2026

Summary

Problem

Root cause chain:

Fix

Changes

Uh oh!

jigangz commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant