v0.2.0: SpreadsheetBench benchmark + retrievability fixes by arnav2 · Pull Request #4 · knowledgestack/excel-parser

arnav2 · 2026-05-11T20:16:45Z

Summary

Head-to-head benchmark vs Docling on SpreadsheetBench (912 instances, 5,458 xlsx files): ks-xlsx-parser ties at recall@1 (0.580 vs 0.579), wins recall@3 (0.697 vs 0.670, +2.7 pp) and recall@5 (0.704 vs 0.686, +1.8 pp), with 36.9% citation-grade geometric recall (Docling 0% structurally — no per-chunk A1 anchors). 99.945% parse success on the full corpus.
Three rendering bugs fixed that were silently torpedoing retrieval: numeric cells now render the raw value (1272) not Excel's "1,272.00"; the [=] formula marker no longer trips a sci-notation fallback; dates render as ISO YYYY-MM-DD. These were the entire pre-fix gap to Docling.
Benchmark framework: new tests/benchmarks/adapters/docling_adapter.py, scripts/eval_retrieval.py, make bench; reproducible end-to-end. Marker is excluded by design (xlsx→PDF→layout-model = >30 min/workbook on CPU).
Productionization: version 0.1.1 → 0.2.0, CHANGELOG entry, docs/launch/RELEASE_NOTES_v0.2.0.md, release.yml picks up the notes automatically.
Architecture diagram moved from README to docs/wiki/Architecture.md. README front page now leads with a KS-branded benchmark visual.

Full writeup: tests/benchmarks/reports/COMPARISON.md.

Test plan

make test — 1460 pass, 0 fail (~2 min)
make bench — 5,461/5,464 parse success (99.945%) on SpreadsheetBench robustness; full 912-instance retrieval recall completed, results in tests/benchmarks/reports/
Hand-inspected the 2 ks-only errors from the first benchmark run → fixed (GradientFill crash in cell_parser.py)
Wait for CI green across {ubuntu, macOS} × py{3.10, 3.11, 3.12} before tagging
After merge to main: tag v0.2.0 → triggers PyPI publish via the Release workflow
Verify pypi environment with Trusted Publishing is configured (gh api repos/knowledgestack/ks-xlsx-parser/environments)

Breaking changes

None API-wise. Behavioural: render_text on numeric cells now contains the raw value instead of Excel-display-formatted strings (e.g. 1272 instead of 1,272.00). If you were relying on display formatting in retrieval keys or downstream regex parsing, switch to the display_value field on each cell DTO.

🤖 Generated with Claude Code

Adds a head-to-head benchmark against Docling on the SpreadsheetBench corpus (912 instances, 5,458 xlsx files) and fixes three rendering bugs that were silently torpedoing RAG retrieval. Results: ks-xlsx-parser parses 99.945% of SpreadsheetBench, ties Docling at recall@1 and wins recall@3 (+2.7 pp) / recall@5 (+1.8 pp) on apples-to-apples text-match, plus 36.9% citation-grade geometric recall (Docling 0%, structurally — no A1 anchors). Renderer (src/rendering/text_renderer.py): - Numeric cells render the raw value (1272) not Excel's display-formatted string (1,272.00) — substring search now works. - Column widths computed with the [=] formula marker included, so it no longer spuriously triggers the sci-notation long-value fallback. - Dates render as ISO YYYY-MM-DD; midnight 00:00:00 dropped. - Embedded \n in headers collapse to spaces so they don't tear the markdown grid (regression fix for CJK workbooks). Segmenter (src/chunking/segmenter.py): - Removed _detect_style_boundaries — fill-color banding (year/alternating shading) is not a semantic boundary and was splitting coherent tables into 5 header-less fragments. Parsers (src/parsers/cell_parser.py): - GradientFill cells no longer crash the sheet parser (caught by SpreadsheetBench instance 118-8; 8 sheets / 1,244 cells previously lost on AttributeError). Benchmark framework: - tests/benchmarks/adapters/docling_adapter.py (new) — NDJSON-protocol Docling adapter, plugs into existing vs_hucre.py harness. - scripts/eval_retrieval.py (new) — retrieval recall@k over SpreadsheetBench's (instruction, data_position, answer_position) triples. Persistent docling subprocess with hard-kill timeout. - scripts/summarize_retrieval.py (new) — re-aggregate partial NDJSON if a long run is interrupted. - scripts/download_corpora.sh — fetches SpreadsheetBench v0.1. - tests/benchmarks/_schema.py — formulas now nullable on status=ok (Docling/Marker don't model formulas). - tests/benchmarks/README.md and tests/benchmarks/reports/COMPARISON.md (new) — methodology + head-to-head report. - Makefile: bench, bench-robust, bench-retrieval. Productionization: - pyproject.toml + ks_xlsx_parser/__init__.py: 0.1.1 → 0.2.0. - CHANGELOG.md: full 0.2.0 entry following Keep-a-Changelog format. - docs/launch/RELEASE_NOTES_v0.2.0.md (new) — picked up automatically by the updated release.yml workflow as the GitHub release body. - README: new KS-branded benchmark section near the top with shields.io hero badges + colored comparison table (emerald for ks-xlsx-parser, slate for Docling, rose where Docling scores 0 structurally). - Architecture diagram moved out of README into docs/wiki/Architecture.md to keep the front page scannable. - .gitignore: data/corpora/ (downloaded benchmark corpora; gitignored), .claude/ (Conductor transient state). - scripts/compare_docling.py removed — superseded by the unified benchmark framework. Tests: 1460 pass (no regressions). Full corpus retrieval benchmark output committed under tests/benchmarks/reports/retrieval/.

arnav2 mentioned this pull request May 11, 2026

docs: add release-process checklist + PyPI Trusted Publisher setup #5

Merged

4 tasks

arnav2 merged commit 0395de8 into main May 11, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.0: SpreadsheetBench benchmark + retrievability fixes#4

v0.2.0: SpreadsheetBench benchmark + retrievability fixes#4
arnav2 merged 1 commit into
mainfrom
arnav2/fix-excel-table-chunking

arnav2 commented May 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

arnav2 commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Breaking changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

arnav2 commented May 11, 2026 •

edited

Loading