v0.2.0 — SpreadsheetBench benchmark + retrievability #7

arnav2 · 2026-05-11T20:34:03Z

arnav2
May 11, 2026
Maintainer

0.2.0 — 2026-05-11

Headline: ks-xlsx-parser ties Docling at recall@1 and wins recall@3/@5 on
SpreadsheetBench (912 instances, 5,458 xlsx). 99.945% parse success.
36.9% citation-grade geometric recall (Docling 0% structurally).

Added

make bench — head-to-head benchmark vs Docling on SpreadsheetBench
tests/benchmarks/adapters/docling_adapter.py — Docling adapter
scripts/eval_retrieval.py — retrieval recall@k with embedded chunks
scripts/summarize_retrieval.py — re-aggregate partial runs
scripts/download_corpora.sh — fetches SpreadsheetBench v0.1
tests/benchmarks/reports/COMPARISON.md — full methodology + capability matrix
Release workflow now picks up docs/launch/RELEASE_NOTES_vX.Y.Z.md automatically

Fixed

Numeric cells render raw values (1272), not display-formatted ("1,272.00")
[=] formula marker no longer triggers spurious sci-notation truncation
Dates render as ISO YYYY-MM-DD (drop midnight 00:00:00)
Newlines in headers collapse to spaces (CJK regression)
GradientFill cells no longer crash the sheet parser

Changed

Removed _detect_style_boundaries in segmenter — fill-color banding is not
a semantic boundary (was splitting coherent tables into 5 header-less fragments)
tests/benchmarks/_schema.py: formulas nullable on status=ok
(Docling/Marker don't model formulas)

Removed

scripts/compare_docling.py — superseded by the unified benchmark framework

Performance

~5% faster than Docling on SpreadsheetBench (251 ms vs 265 ms mean parse time)

Breaking

render_text on numeric cells now contains the raw value, not Excel's
display-formatted string. If you keyed off display formatting, use the
cell DTO's display_value field instead.

Install: pip install -U ks-xlsx-parser==0.2.0
Full notes: docs/launch/RELEASE_NOTES_v0.2.0.md
Benchmark report: tests/benchmarks/reports/COMPARISON.md

This discussion was created from the release v0.2.0 — SpreadsheetBench benchmark + retrievability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.0 — SpreadsheetBench benchmark + retrievability #7

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

v0.2.0 — SpreadsheetBench benchmark + retrievability #7

Uh oh!

arnav2 May 11, 2026 Maintainer

0.2.0 — 2026-05-11

Added

Fixed

Changed

Removed

Performance

Breaking

Replies: 0 comments

arnav2
May 11, 2026
Maintainer