Skip to content

Trim trailing empty rows/columns from Excel sheets to prevent OOM#1638

Open
spencerogden-dsam wants to merge 1 commit intomicrosoft:mainfrom
spencerogden-dsam:fix/sparse-spreadsheet-oom
Open

Trim trailing empty rows/columns from Excel sheets to prevent OOM#1638
spencerogden-dsam wants to merge 1 commit intomicrosoft:mainfrom
spencerogden-dsam:fix/sparse-spreadsheet-oom

Conversation

@spencerogden-dsam
Copy link
Copy Markdown

Summary

Excel files (especially .xls format) often pad sheets to fixed dimensions (256 columns, 65,536 rows) even when only a few cells contain data. When every empty cell is rendered through DataFrame.to_html()HtmlConverter, this causes extreme memory usage and enormous output.

Real-world example: A 57 KB .xls file with ~5 data columns padded to 256 columns produced 95 MB of Markdown (1,700x expansion) and consumed 13+ GB of RAM during conversion, causing OOM kills in production.

Fix

Added _trim_trailing_empty() which trims trailing all-NaN rows and columns from each sheet's DataFrame before calling to_html(). Applied to both XlsxConverter and XlsConverter.

Only trailing empties are removed — intentional blank rows or columns used as visual separators within the data area are preserved.

Before/After

Metric Before After
Output size (57KB .xls) 95 MB ~50 KB
Peak memory 13+ GB < 100 MB
Conversion time 511s < 1s

Test plan

  • Existing xlsx/xls test vectors pass (tested locally)
  • CI passes

🤖 Generated with Claude Code

Excel files (especially .xls) often pad to fixed dimensions (256 columns,
65536 rows) even when only a few cells contain data. When every empty cell
is rendered through to_html() → Markdown, this causes extreme memory usage
and enormous output. A real-world 57 KB .xls file produced 95 MB of
Markdown (1,700x expansion) and consumed 13+ GB of RAM.

This trims trailing all-NaN rows and columns before calling to_html().
Only trailing empties are removed, preserving intentional blank rows or
columns used as visual separators within the data area.

Applied to both XlsxConverter and XlsConverter.
@spencerogden-dsam
Copy link
Copy Markdown
Author

spencerogden-dsam commented Mar 26, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants