Skip to content

v0.2.0: SpreadsheetBench benchmark + retrievability fixes#4

Merged
arnav2 merged 1 commit into
mainfrom
arnav2/fix-excel-table-chunking
May 11, 2026
Merged

v0.2.0: SpreadsheetBench benchmark + retrievability fixes#4
arnav2 merged 1 commit into
mainfrom
arnav2/fix-excel-table-chunking

Conversation

@arnav2
Copy link
Copy Markdown
Collaborator

@arnav2 arnav2 commented May 11, 2026

Summary

  • Head-to-head benchmark vs Docling on SpreadsheetBench (912 instances, 5,458 xlsx files): ks-xlsx-parser ties at recall@1 (0.580 vs 0.579), wins recall@3 (0.697 vs 0.670, +2.7 pp) and recall@5 (0.704 vs 0.686, +1.8 pp), with 36.9% citation-grade geometric recall (Docling 0% structurally — no per-chunk A1 anchors). 99.945% parse success on the full corpus.
  • Three rendering bugs fixed that were silently torpedoing retrieval: numeric cells now render the raw value (1272) not Excel's "1,272.00"; the [=] formula marker no longer trips a sci-notation fallback; dates render as ISO YYYY-MM-DD. These were the entire pre-fix gap to Docling.
  • Benchmark framework: new tests/benchmarks/adapters/docling_adapter.py, scripts/eval_retrieval.py, make bench; reproducible end-to-end. Marker is excluded by design (xlsx→PDF→layout-model = >30 min/workbook on CPU).
  • Productionization: version 0.1.1 → 0.2.0, CHANGELOG entry, docs/launch/RELEASE_NOTES_v0.2.0.md, release.yml picks up the notes automatically.
  • Architecture diagram moved from README to docs/wiki/Architecture.md. README front page now leads with a KS-branded benchmark visual.

Full writeup: tests/benchmarks/reports/COMPARISON.md.

Test plan

  • make test1460 pass, 0 fail (~2 min)
  • make bench5,461/5,464 parse success (99.945%) on SpreadsheetBench robustness; full 912-instance retrieval recall completed, results in tests/benchmarks/reports/
  • Hand-inspected the 2 ks-only errors from the first benchmark run → fixed (GradientFill crash in cell_parser.py)
  • Wait for CI green across {ubuntu, macOS} × py{3.10, 3.11, 3.12} before tagging
  • After merge to main: tag v0.2.0 → triggers PyPI publish via the Release workflow
  • Verify pypi environment with Trusted Publishing is configured (gh api repos/knowledgestack/ks-xlsx-parser/environments)

Breaking changes

None API-wise. Behavioural: render_text on numeric cells now contains the raw value instead of Excel-display-formatted strings (e.g. 1272 instead of 1,272.00). If you were relying on display formatting in retrieval keys or downstream regex parsing, switch to the display_value field on each cell DTO.

🤖 Generated with Claude Code

Adds a head-to-head benchmark against Docling on the SpreadsheetBench
corpus (912 instances, 5,458 xlsx files) and fixes three rendering
bugs that were silently torpedoing RAG retrieval.

Results: ks-xlsx-parser parses 99.945% of SpreadsheetBench, ties Docling
at recall@1 and wins recall@3 (+2.7 pp) / recall@5 (+1.8 pp) on
apples-to-apples text-match, plus 36.9% citation-grade geometric
recall (Docling 0%, structurally — no A1 anchors).

Renderer (src/rendering/text_renderer.py):
- Numeric cells render the raw value (1272) not Excel's display-formatted
  string (1,272.00) — substring search now works.
- Column widths computed with the [=] formula marker included, so it no
  longer spuriously triggers the sci-notation long-value fallback.
- Dates render as ISO YYYY-MM-DD; midnight 00:00:00 dropped.
- Embedded \n in headers collapse to spaces so they don't tear the
  markdown grid (regression fix for CJK workbooks).

Segmenter (src/chunking/segmenter.py):
- Removed _detect_style_boundaries — fill-color banding (year/alternating
  shading) is not a semantic boundary and was splitting coherent tables
  into 5 header-less fragments.

Parsers (src/parsers/cell_parser.py):
- GradientFill cells no longer crash the sheet parser (caught by
  SpreadsheetBench instance 118-8; 8 sheets / 1,244 cells previously
  lost on AttributeError).

Benchmark framework:
- tests/benchmarks/adapters/docling_adapter.py (new) — NDJSON-protocol
  Docling adapter, plugs into existing vs_hucre.py harness.
- scripts/eval_retrieval.py (new) — retrieval recall@k over
  SpreadsheetBench's (instruction, data_position, answer_position)
  triples. Persistent docling subprocess with hard-kill timeout.
- scripts/summarize_retrieval.py (new) — re-aggregate partial NDJSON
  if a long run is interrupted.
- scripts/download_corpora.sh — fetches SpreadsheetBench v0.1.
- tests/benchmarks/_schema.py — formulas now nullable on status=ok
  (Docling/Marker don't model formulas).
- tests/benchmarks/README.md and tests/benchmarks/reports/COMPARISON.md
  (new) — methodology + head-to-head report.
- Makefile: bench, bench-robust, bench-retrieval.

Productionization:
- pyproject.toml + ks_xlsx_parser/__init__.py: 0.1.1 → 0.2.0.
- CHANGELOG.md: full 0.2.0 entry following Keep-a-Changelog format.
- docs/launch/RELEASE_NOTES_v0.2.0.md (new) — picked up automatically
  by the updated release.yml workflow as the GitHub release body.
- README: new KS-branded benchmark section near the top with shields.io
  hero badges + colored comparison table (emerald for ks-xlsx-parser,
  slate for Docling, rose where Docling scores 0 structurally).
- Architecture diagram moved out of README into docs/wiki/Architecture.md
  to keep the front page scannable.
- .gitignore: data/corpora/ (downloaded benchmark corpora; gitignored),
  .claude/ (Conductor transient state).
- scripts/compare_docling.py removed — superseded by the unified
  benchmark framework.

Tests: 1460 pass (no regressions). Full corpus retrieval benchmark
output committed under tests/benchmarks/reports/retrieval/.
@arnav2 arnav2 merged commit 0395de8 into main May 11, 2026
7 checks passed
arnav2 added a commit that referenced this pull request May 11, 2026
Adds a head-to-head benchmark against Docling on the SpreadsheetBench
corpus (912 instances, 5,458 xlsx files) and fixes three rendering
bugs that were silently torpedoing RAG retrieval.

Results: ks-xlsx-parser parses 99.945% of SpreadsheetBench, ties Docling
at recall@1 and wins recall@3 (+2.7 pp) / recall@5 (+1.8 pp) on
apples-to-apples text-match, plus 36.9% citation-grade geometric
recall (Docling 0%, structurally — no A1 anchors).

Renderer (src/rendering/text_renderer.py):
- Numeric cells render the raw value (1272) not Excel's display-formatted
  string (1,272.00) — substring search now works.
- Column widths computed with the [=] formula marker included, so it no
  longer spuriously triggers the sci-notation long-value fallback.
- Dates render as ISO YYYY-MM-DD; midnight 00:00:00 dropped.
- Embedded \n in headers collapse to spaces so they don't tear the
  markdown grid (regression fix for CJK workbooks).

Segmenter (src/chunking/segmenter.py):
- Removed _detect_style_boundaries — fill-color banding (year/alternating
  shading) is not a semantic boundary and was splitting coherent tables
  into 5 header-less fragments.

Parsers (src/parsers/cell_parser.py):
- GradientFill cells no longer crash the sheet parser (caught by
  SpreadsheetBench instance 118-8; 8 sheets / 1,244 cells previously
  lost on AttributeError).

Benchmark framework:
- tests/benchmarks/adapters/docling_adapter.py (new) — NDJSON-protocol
  Docling adapter, plugs into existing vs_hucre.py harness.
- scripts/eval_retrieval.py (new) — retrieval recall@k over
  SpreadsheetBench's (instruction, data_position, answer_position)
  triples. Persistent docling subprocess with hard-kill timeout.
- scripts/summarize_retrieval.py (new) — re-aggregate partial NDJSON
  if a long run is interrupted.
- scripts/download_corpora.sh — fetches SpreadsheetBench v0.1.
- tests/benchmarks/_schema.py — formulas now nullable on status=ok
  (Docling/Marker don't model formulas).
- tests/benchmarks/README.md and tests/benchmarks/reports/COMPARISON.md
  (new) — methodology + head-to-head report.
- Makefile: bench, bench-robust, bench-retrieval.

Productionization:
- pyproject.toml + ks_xlsx_parser/__init__.py: 0.1.1 → 0.2.0.
- CHANGELOG.md: full 0.2.0 entry following Keep-a-Changelog format.
- docs/launch/RELEASE_NOTES_v0.2.0.md (new) — picked up automatically
  by the updated release.yml workflow as the GitHub release body.
- README: new KS-branded benchmark section near the top with shields.io
  hero badges + colored comparison table (emerald for ks-xlsx-parser,
  slate for Docling, rose where Docling scores 0 structurally).
- Architecture diagram moved out of README into docs/wiki/Architecture.md
  to keep the front page scannable.
- .gitignore: data/corpora/ (downloaded benchmark corpora; gitignored),
  .claude/ (Conductor transient state).
- scripts/compare_docling.py removed — superseded by the unified
  benchmark framework.

Tests: 1460 pass (no regressions). Full corpus retrieval benchmark
output committed under tests/benchmarks/reports/retrieval/.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant