ks-xlsx-parser v0.2.0 — Benchmark + Retrievability 📊

Headline: ks-xlsx-parser now has a head-to-head benchmark against Docling on the SpreadsheetBench corpus (912 task instances, 5,458 xlsx files). ks parses 99.945% of the corpus and ties Docling at recall@1 / wins at recall@3 (+2.7 pp) and recall@5 (+1.8 pp) on apples-to-apples retrieval, with 36.9% citation-grade geometric recall that Docling structurally cannot achieve.

Plus three quiet RAG-breaking rendering bugs in 0.1.1 are gone.

What's new

🏁 SpreadsheetBench benchmark — `make bench`

A reproducible, parser-agnostic benchmark over real-world workbooks scraped from ExcelHome / Mr.Excel / r/excel:

Metric	ks-xlsx-parser	Docling 2.93	Δ
Parse success (5,458 files)	99.945%	not run at scale	—
Recall@1 (text-match)	0.580	0.579	+0.1 pp (tied)
Recall@3 (text-match)	0.697	0.670	+2.7 pp
Recall@5 (text-match)	0.704	0.686	+1.8 pp
Recall@5 (geometric, A1 anchor overlap)	0.369	0.000	Docling has no per-chunk anchors
Mean parse time per file	251 ms	265 ms	ks ~5% faster

Why "geometric" recall matters for RAG: ks emits a sheet!A1:Z99 range with every chunk. A retrieval system that surfaces the chunk can render a citation that points at the exact source cells. Docling produces markdown without per-chunk anchors, so it can't satisfy this metric at all. This is the difference between "the answer was in the workbook" and "the answer was in cell C7 of the Revenue sheet."

Marker is intentionally absent — its xlsx → HTML → PDF → layout-model pipeline clocks >30 min per workbook on CPU. The harness supports adding a Marker adapter (tests/benchmarks/adapters/docling_adapter.py as a template); the speed wall is the obstacle.

Full methodology, capability matrix, and caveats: tests/benchmarks/reports/COMPARISON.md.

🔧 Three rendering bugs that were silently torpedoing retrieval

Comma-formatted numbers. 1272 rendered as "1,272.00" (Excel's display format). A user query "1272" substring-missed. Now: numeric cells render the raw value.
Spurious sci-notation. The [=] formula marker inflated a cell past column width, tripping a long-value fallback that rendered 1272 as "1.272000e+03". Now: column widths computed using the same rendering pipeline data rows will use.
Embedded newlines in headers (common in CJK workbooks like "租金\n天数") tore apart the Markdown table grid. Now: collapsed to spaces.

These three together accounted for the entire retrieval-recall gap we initially measured against Docling.

🧹 Segmenter — no more banded-table fragmentation

Removed _detect_style_boundaries from chunking/segmenter.py. The function split a coherent table into 5 fragments at fill-color band boundaries (year-banding, alternating-row shading), shedding header context from data rows. The connected-components + gap detection already handles real boundaries; fill banding is not a semantic one.

🛡️ GradientFill safety

Cells using GradientFill (rare but real — caught by SpreadsheetBench instance 118-8, 8 sheets / 1,244 cells previously lost) used to crash the sheet parser. Now: defensively skipped, sheet keeps parsing.

🐳 Productionization

Makefile: make bench, make bench-robust, make bench-retrieval
scripts/download_corpora.sh now fetches SpreadsheetBench v0.1
scripts/summarize_retrieval.py — re-aggregate a partial results.ndjson if a long run gets interrupted
New benchmark framework supports adding parsers (Marker, hucre, others) via the NDJSON-worker protocol; see tests/benchmarks/README.md

Reproduce

pip install -U ks-xlsx-parser==0.2.0      # or
git clone https://github.com/knowledgestack/ks-xlsx-parser
cd ks-xlsx-parser
make corpus-download                       # one-time, ~100 MB
make bench                                 # ~30 min for both benchmarks
open tests/benchmarks/reports/COMPARISON.md

Upgrading from 0.1.1

No breaking API changes. The only behavioral change is that render_text on numeric cells now contains the raw value instead of the Excel-display-formatted string (e.g. 1272 instead of 1,272.00). If you were relying on display formatting in retrieval keys or downstream regex parsing, switch to the cell's display_value field on the ChunkDTO. For everything else, drop-in.

Full changelog: CHANGELOG.md.

Thanks

To the SpreadsheetBench team at Renmin University for publishing a clean, real-world xlsx corpus with structured ground truth — none of this comparison would have been possible without it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

ks-xlsx-parser v0.2.0 — Benchmark + Retrievability 📊

What's new

🏁 SpreadsheetBench benchmark — `make bench`

🔧 Three rendering bugs that were silently torpedoing retrieval

🧹 Segmenter — no more banded-table fragmentation

🛡️ GradientFill safety

🐳 Productionization

Reproduce

Upgrading from 0.1.1

Thanks

Uh oh!

Releases: knowledgestack/excel-parser

v0.2.0 — SpreadsheetBench benchmark + retrievability

ks-xlsx-parser v0.2.0 — Benchmark + Retrievability 📊

What's new

🏁 SpreadsheetBench benchmark — make bench

🔧 Three rendering bugs that were silently torpedoing retrieval

🧹 Segmenter — no more banded-table fragmentation

🛡️ GradientFill safety

🐳 Productionization

Reproduce

Upgrading from 0.1.1

Thanks

Uh oh!

🏁 SpreadsheetBench benchmark — `make bench`