v0.2.0: SpreadsheetBench benchmark + retrievability fixes#4
Merged
Conversation
Adds a head-to-head benchmark against Docling on the SpreadsheetBench corpus (912 instances, 5,458 xlsx files) and fixes three rendering bugs that were silently torpedoing RAG retrieval. Results: ks-xlsx-parser parses 99.945% of SpreadsheetBench, ties Docling at recall@1 and wins recall@3 (+2.7 pp) / recall@5 (+1.8 pp) on apples-to-apples text-match, plus 36.9% citation-grade geometric recall (Docling 0%, structurally — no A1 anchors). Renderer (src/rendering/text_renderer.py): - Numeric cells render the raw value (1272) not Excel's display-formatted string (1,272.00) — substring search now works. - Column widths computed with the [=] formula marker included, so it no longer spuriously triggers the sci-notation long-value fallback. - Dates render as ISO YYYY-MM-DD; midnight 00:00:00 dropped. - Embedded \n in headers collapse to spaces so they don't tear the markdown grid (regression fix for CJK workbooks). Segmenter (src/chunking/segmenter.py): - Removed _detect_style_boundaries — fill-color banding (year/alternating shading) is not a semantic boundary and was splitting coherent tables into 5 header-less fragments. Parsers (src/parsers/cell_parser.py): - GradientFill cells no longer crash the sheet parser (caught by SpreadsheetBench instance 118-8; 8 sheets / 1,244 cells previously lost on AttributeError). Benchmark framework: - tests/benchmarks/adapters/docling_adapter.py (new) — NDJSON-protocol Docling adapter, plugs into existing vs_hucre.py harness. - scripts/eval_retrieval.py (new) — retrieval recall@k over SpreadsheetBench's (instruction, data_position, answer_position) triples. Persistent docling subprocess with hard-kill timeout. - scripts/summarize_retrieval.py (new) — re-aggregate partial NDJSON if a long run is interrupted. - scripts/download_corpora.sh — fetches SpreadsheetBench v0.1. - tests/benchmarks/_schema.py — formulas now nullable on status=ok (Docling/Marker don't model formulas). - tests/benchmarks/README.md and tests/benchmarks/reports/COMPARISON.md (new) — methodology + head-to-head report. - Makefile: bench, bench-robust, bench-retrieval. Productionization: - pyproject.toml + ks_xlsx_parser/__init__.py: 0.1.1 → 0.2.0. - CHANGELOG.md: full 0.2.0 entry following Keep-a-Changelog format. - docs/launch/RELEASE_NOTES_v0.2.0.md (new) — picked up automatically by the updated release.yml workflow as the GitHub release body. - README: new KS-branded benchmark section near the top with shields.io hero badges + colored comparison table (emerald for ks-xlsx-parser, slate for Docling, rose where Docling scores 0 structurally). - Architecture diagram moved out of README into docs/wiki/Architecture.md to keep the front page scannable. - .gitignore: data/corpora/ (downloaded benchmark corpora; gitignored), .claude/ (Conductor transient state). - scripts/compare_docling.py removed — superseded by the unified benchmark framework. Tests: 1460 pass (no regressions). Full corpus retrieval benchmark output committed under tests/benchmarks/reports/retrieval/.
4 tasks
arnav2
added a commit
that referenced
this pull request
May 11, 2026
Adds a head-to-head benchmark against Docling on the SpreadsheetBench corpus (912 instances, 5,458 xlsx files) and fixes three rendering bugs that were silently torpedoing RAG retrieval. Results: ks-xlsx-parser parses 99.945% of SpreadsheetBench, ties Docling at recall@1 and wins recall@3 (+2.7 pp) / recall@5 (+1.8 pp) on apples-to-apples text-match, plus 36.9% citation-grade geometric recall (Docling 0%, structurally — no A1 anchors). Renderer (src/rendering/text_renderer.py): - Numeric cells render the raw value (1272) not Excel's display-formatted string (1,272.00) — substring search now works. - Column widths computed with the [=] formula marker included, so it no longer spuriously triggers the sci-notation long-value fallback. - Dates render as ISO YYYY-MM-DD; midnight 00:00:00 dropped. - Embedded \n in headers collapse to spaces so they don't tear the markdown grid (regression fix for CJK workbooks). Segmenter (src/chunking/segmenter.py): - Removed _detect_style_boundaries — fill-color banding (year/alternating shading) is not a semantic boundary and was splitting coherent tables into 5 header-less fragments. Parsers (src/parsers/cell_parser.py): - GradientFill cells no longer crash the sheet parser (caught by SpreadsheetBench instance 118-8; 8 sheets / 1,244 cells previously lost on AttributeError). Benchmark framework: - tests/benchmarks/adapters/docling_adapter.py (new) — NDJSON-protocol Docling adapter, plugs into existing vs_hucre.py harness. - scripts/eval_retrieval.py (new) — retrieval recall@k over SpreadsheetBench's (instruction, data_position, answer_position) triples. Persistent docling subprocess with hard-kill timeout. - scripts/summarize_retrieval.py (new) — re-aggregate partial NDJSON if a long run is interrupted. - scripts/download_corpora.sh — fetches SpreadsheetBench v0.1. - tests/benchmarks/_schema.py — formulas now nullable on status=ok (Docling/Marker don't model formulas). - tests/benchmarks/README.md and tests/benchmarks/reports/COMPARISON.md (new) — methodology + head-to-head report. - Makefile: bench, bench-robust, bench-retrieval. Productionization: - pyproject.toml + ks_xlsx_parser/__init__.py: 0.1.1 → 0.2.0. - CHANGELOG.md: full 0.2.0 entry following Keep-a-Changelog format. - docs/launch/RELEASE_NOTES_v0.2.0.md (new) — picked up automatically by the updated release.yml workflow as the GitHub release body. - README: new KS-branded benchmark section near the top with shields.io hero badges + colored comparison table (emerald for ks-xlsx-parser, slate for Docling, rose where Docling scores 0 structurally). - Architecture diagram moved out of README into docs/wiki/Architecture.md to keep the front page scannable. - .gitignore: data/corpora/ (downloaded benchmark corpora; gitignored), .claude/ (Conductor transient state). - scripts/compare_docling.py removed — superseded by the unified benchmark framework. Tests: 1460 pass (no regressions). Full corpus retrieval benchmark output committed under tests/benchmarks/reports/retrieval/.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
1272) not Excel's"1,272.00"; the[=]formula marker no longer trips a sci-notation fallback; dates render as ISOYYYY-MM-DD. These were the entire pre-fix gap to Docling.tests/benchmarks/adapters/docling_adapter.py,scripts/eval_retrieval.py,make bench; reproducible end-to-end. Marker is excluded by design (xlsx→PDF→layout-model = >30 min/workbook on CPU).docs/launch/RELEASE_NOTES_v0.2.0.md, release.yml picks up the notes automatically.Full writeup:
tests/benchmarks/reports/COMPARISON.md.Test plan
make test— 1460 pass, 0 fail (~2 min)make bench— 5,461/5,464 parse success (99.945%) on SpreadsheetBench robustness; full 912-instance retrieval recall completed, results intests/benchmarks/reports/GradientFillcrash incell_parser.py)v0.2.0→ triggers PyPI publish via the Release workflowpypienvironment with Trusted Publishing is configured (gh api repos/knowledgestack/ks-xlsx-parser/environments)Breaking changes
None API-wise. Behavioural:
render_texton numeric cells now contains the raw value instead of Excel-display-formatted strings (e.g.1272instead of1,272.00). If you were relying on display formatting in retrieval keys or downstream regex parsing, switch to thedisplay_valuefield on each cell DTO.🤖 Generated with Claude Code