knowledgestack · arnav2 · May 11, 2026 · May 11, 2026
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -59,14 +59,31 @@ jobs:
           name: dist
           path: dist
 
+      - name: Find hand-written release notes (if any)
+        id: notes
+        run: |
+          # Use docs/launch/RELEASE_NOTES_<tag>.md as the release body when
+          # present, otherwise fall back to GitHub's auto-generated notes.
+          NOTES="docs/launch/RELEASE_NOTES_${GITHUB_REF_NAME}.md"
+          if [ -f "$NOTES" ]; then
+            echo "path=$NOTES" >> "$GITHUB_OUTPUT"
+            echo "auto=false" >> "$GITHUB_OUTPUT"
+            echo "Found release notes: $NOTES"
+          else
+            echo "path=" >> "$GITHUB_OUTPUT"
+            echo "auto=true" >> "$GITHUB_OUTPUT"
+            echo "No hand-written release notes at $NOTES; using auto-generated."
+          fi
+
       - name: Create GitHub release
         uses: softprops/action-gh-release@v2
         with:
           files: |
             dist/*.whl
             dist/*.tar.gz
             dist/testBench-v*.zip
-          generate_release_notes: true
+          body_path: ${{ steps.notes.outputs.path }}
+          generate_release_notes: ${{ steps.notes.outputs.auto == 'true' }}
 
   pypi:
     needs: build

diff --git a/.gitignore b/.gitignore
@@ -72,3 +72,9 @@ rust/**/target/
 rust/**/Cargo.lock.bak
 **/.cargo-lock
 **/.fingerprint/
+
+# Downloaded benchmark corpora (see scripts/download_corpora.sh)
+data/corpora/
+
+# Conductor / Claude Code transient state
+.claude/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -47,6 +47,98 @@ Template for a new release (copy this block, fill in, move Unreleased items in):
 
 Nothing yet. Open a PR and add your entry under the appropriate heading.
 
+## [0.2.0] — 2026-05-11
+
+**Benchmark + retrievability release.** Adds a head-to-head benchmark against
+[Docling](https://github.com/DS4SD/docling) on the [SpreadsheetBench](https://github.com/RUCKBReasoning/SpreadsheetBench)
+corpus (912 instances, 5,458 xlsx files) and fixes three rendering bugs that
+were silently torpedoing RAG retrieval. ks-xlsx-parser parses **99.945%** of
+SpreadsheetBench and **ties Docling at recall@1 / wins at recall@3 (+2.7 pp)
+and recall@5 (+1.8 pp)**, plus 36.9% citation-grade geometric recall (Docling
+0%, structurally — no A1 anchors).
+
+### Added
+- `tests/benchmarks/adapters/docling_adapter.py` — Docling adapter speaking the
+  same NDJSON-worker protocol as `ks_adapter.py` (#TBD).
+- `tests/benchmarks/_runner.py`: `docling_runner` factory wired into
+  `vs_hucre.py`'s `--parsers` dispatch.
+- `scripts/eval_retrieval.py` — retrieval-recall benchmark over
+  SpreadsheetBench's `(instruction, data_position, answer_position)` triples.
+  Uses `sentence-transformers` (default `BAAI/bge-small-en-v1.5`) and computes
+  geometric overlap + numeric/date/boolean-normalized text-match recall@k.
+  Persistent docling subprocess with hard-kill timeout — PyTorch's table-rec
+  loop holds the GIL through C-land so in-process timeouts don't work.
+- `scripts/summarize_retrieval.py` — re-aggregate a `results.ndjson` into
+  `summary.json` / `summary.md` if a long run is interrupted.
+- `scripts/download_corpora.sh`: fetches SpreadsheetBench v0.1 (~96 MB tar.gz)
+  into `data/corpora/spreadsheetbench/` (gitignored).
+- `tests/benchmarks/README.md` — adapter design notes + benchmark how-to.
+- `tests/benchmarks/reports/COMPARISON.md` — head-to-head report incl.
+  methodology, capability matrix, caveats.
+- `Makefile`: `bench`, `bench-robust`, `bench-retrieval` targets.
+
+### Fixed
+- `src/rendering/text_renderer.py`: numeric cells now render the raw value
+  (`1272`) instead of Excel's display-formatted string (`1,272.00`). The
+  display format defeated substring-match retrieval for the most common RAG
+  query shape ("what was the value in 2020?" → user types `1272`).
+- `src/rendering/text_renderer.py`: the `[=]` formula marker no longer
+  spuriously inflates a cell past its column width, which used to trigger
+  a sci-notation fallback (`1.272000e+03`) on perfectly small values.
+  Column widths now computed using the same rendering pipeline data rows
+  will use, so the long-value path only triggers on genuinely-too-wide
+  values.
+- `src/rendering/text_renderer.py`: dates render as ISO `YYYY-MM-DD` and drop
+  the spurious `00:00:00` time component on midnight datetimes.
+- `src/rendering/text_renderer.py`: embedded newlines inside header cells
+  (e.g. `"租金\n天数"`) collapse to spaces so they don't tear apart the
+  Markdown grid (regression fixed for `租赁收入计提表.xlsx`-class layouts).
+- `src/chunking/segmenter.py`: removed `_detect_style_boundaries`. The
+  function split a coherent table into 5 fragments at fill-color band
+  boundaries (year-banding, alternating-row shading), shedding header
+  context from data rows. The connected-components + gap detection
+  already handles real boundaries; fill banding is not a semantic one.
+- `src/parsers/cell_parser.py`: `GradientFill` cells no longer crash the
+  sheet parser. Accessing `.patternType` on a `GradientFill` (vs the
+  expected `PatternFill`) raised `AttributeError`, which propagated up and
+  killed every cell on the sheet. We don't model gradients but we no
+  longer drop the sheet because of them (caught by SpreadsheetBench
+  instance `118-8`, 8 sheets / 1,244 cells previously lost).
+
+### Changed
+- `tests/benchmarks/_schema.py`: `formulas` is now nullable on `status=ok`
+  records. Parsers that don't model formulas (Docling, Marker) can now
+  emit valid `BenchmarkRecord`s without tripping schema validation. The
+  schema's load-bearing `None` vs `0` distinction is preserved: `None` =
+  "feature not modeled by this parser", `0` = "modeled and observed zero".
+
+### Removed
+- `scripts/compare_docling.py` — superseded by the unified `tests/benchmarks/`
+  framework + `eval_retrieval.py`. The old script's `ScoreCard` composite
+  score was structurally biased (formula-preservation gave Docling a 0 by
+  definition while contributing 20/100 points; header-propagation used
+  different proxies for each parser); replaced by parser-agnostic
+  text-match and geometric recall metrics.
+
+### Performance
+- ks-xlsx-parser is now ~5% faster on average parse time on SpreadsheetBench
+  than Docling (251 ms vs 265 ms mean), while producing a richer output
+  (formulas, dependency graph, charts, named ranges, etc.).
+
+### Docs
+- `tests/benchmarks/README.md` — new — methodology + adapter design.
+- `tests/benchmarks/reports/COMPARISON.md` — new — head-to-head report.
+- README — new "Benchmark — ks-xlsx-parser vs Docling on SpreadsheetBench"
+  section near the top with the headline table.
+
+### Internal
+- `tests/test_rendering.py`: updated `test_numeric_cells_use_scientific_notation_not_truncation`
+  to assert the new raw-numeric rendering (test renamed
+  `test_numeric_cells_render_raw_not_display_formatted`).
+- `.gitignore`: `data/corpora/` (downloaded benchmark corpora; can run to
+  several GB).
+- `Makefile`: `bench`, `bench-robust`, `bench-retrieval` targets.
+
 ## [0.1.1] — 2026-04-17
 
 **First public release.** MIT-licensed, open-sourced under the

diff --git a/Makefile b/Makefile
@@ -1,4 +1,4 @@
-.PHONY: help install test test-ci testbench testbench-build testbench-zip lint format typecheck clean corpus-download
+.PHONY: help install test test-ci testbench testbench-build testbench-zip lint format typecheck clean corpus-download bench-robust bench-retrieval bench
 
 PYTHON ?= python
 PKG_VERSION := $(shell $(PYTHON) -c "import tomllib, pathlib; print(tomllib.loads(pathlib.Path('pyproject.toml').read_text())['project']['version'])")
@@ -20,6 +20,10 @@ help:
 	@echo "  make typecheck       mypy"
 	@echo ""
 	@echo "  make corpus-download Fetch public XLSX corpora for extended robustness"
+	@echo ""
+	@echo "  make bench-robust    Robustness on SpreadsheetBench (ks vs docling, ~20 min)"
+	@echo "  make bench-retrieval Retrieval recall on SpreadsheetBench (ks vs docling, ~40 min)"
+	@echo "  make bench           Run both benchmarks back-to-back"
 
 install:
 	$(PYTHON) -m pip install -e ".[dev,api]"
@@ -62,3 +66,16 @@ clean:
 
 corpus-download:
 	./scripts/download_corpora.sh
+
+bench-robust:
+	@test -d data/corpora/spreadsheetbench || (echo "Corpus missing. Run 'make corpus-download' first." && exit 1)
+	PYTHONPATH=src $(PYTHON) -m tests.benchmarks.vs_hucre \
+		--corpus data/corpora/spreadsheetbench --parsers ks,docling \
+		--per-file-timeout 120 \
+		--out tests/benchmarks/reports/spreadsheetbench
+
+bench-retrieval:
+	@test -d data/corpora/spreadsheetbench || (echo "Corpus missing. Run 'make corpus-download' first." && exit 1)
+	PYTHONPATH=src $(PYTHON) scripts/eval_retrieval.py --parsers ks,docling
+
+bench: bench-robust bench-retrieval
diff --git a/README.md b/README.md
@@ -79,6 +79,90 @@ graph that drops straight into [LangChain](https://www.langchain.com/),
 
 ---
 
+## 🏁 Benchmark — ks-xlsx-parser vs Docling on SpreadsheetBench
+
+<p align="center">
+  <a href="tests/benchmarks/reports/COMPARISON.md"><img src="https://img.shields.io/badge/SpreadsheetBench-912%20instances%20%C2%B7%205%2C458%20xlsx-047857?style=for-the-badge&logo=microsoftexcel&logoColor=white" alt="SpreadsheetBench"></a>
+  <a href="tests/benchmarks/reports/COMPARISON.md"><img src="https://img.shields.io/badge/parse%20success-99.945%25-22C55E?style=for-the-badge&logo=checkmarx&logoColor=white" alt="Parse success"></a>
+  <a href="tests/benchmarks/reports/COMPARISON.md"><img src="https://img.shields.io/badge/recall%403%20vs%20Docling-%2B2.7%20pp-22C55E?style=for-the-badge&logo=target&logoColor=white" alt="Recall@3 vs Docling"></a>
+  <a href="tests/benchmarks/reports/COMPARISON.md"><img src="https://img.shields.io/badge/citation%20anchors-A1%20per%20chunk-047857?style=for-the-badge&logo=anchor&logoColor=white" alt="A1 anchors"></a>
+</p>
+
+Apples-to-apples on [SpreadsheetBench v0.1](https://github.com/RUCKBReasoning/SpreadsheetBench): 912 real-world task instances curated from ExcelHome / Mr.Excel / r/excel. For each instance we parse the input `.xlsx`, embed every chunk with `BAAI/bge-small-en-v1.5`, then check whether the chunk containing the ground-truth answer is in the top-k by similarity to the question.
+
+<table>
+<thead>
+<tr>
+  <th align="left">Metric</th>
+  <th align="center" bgcolor="#047857"><span style="color:#FFFFFF"><b>🟢 ks-xlsx-parser</b></span></th>
+  <th align="center" bgcolor="#475569"><span style="color:#FFFFFF"><b>⚪ Docling 2.93</b></span></th>
+  <th align="center">Δ</th>
+</tr>
+</thead>
+<tbody>
+<tr>
+  <td><b>📊 Parse success</b><br/><sub>5,458-file corpus</sub></td>
+  <td align="center" bgcolor="#D1FAE5"><img src="https://img.shields.io/badge/99.945%25-047857?style=flat-square&labelColor=047857" alt="99.945%"><br/><sub>5,461 ok · 3 timeouts · 0 errors</sub></td>
+  <td align="center" bgcolor="#F1F5F9"><sub>not run at scale</sub></td>
+  <td align="center">—</td>
+</tr>
+<tr>
+  <td><b>🎯 Recall@1</b><br/><sub>text-match</sub></td>
+  <td align="center" bgcolor="#D1FAE5"><img src="https://img.shields.io/badge/0.580-047857?style=flat-square" alt="0.580"></td>
+  <td align="center" bgcolor="#F1F5F9"><img src="https://img.shields.io/badge/0.579-64748B?style=flat-square" alt="0.579"></td>
+  <td align="center"><img src="https://img.shields.io/badge/tied-22C55E?style=flat-square" alt="tied"></td>
+</tr>
+<tr>
+  <td><b>🎯 Recall@3</b><br/><sub>text-match</sub></td>
+  <td align="center" bgcolor="#A7F3D0"><img src="https://img.shields.io/badge/0.697-047857?style=flat-square" alt="0.697"></td>
+  <td align="center" bgcolor="#F1F5F9"><img src="https://img.shields.io/badge/0.670-64748B?style=flat-square" alt="0.670"></td>
+  <td align="center"><img src="https://img.shields.io/badge/%2B2.7%20pp-22C55E?style=flat-square&logo=arrowup&logoColor=white" alt="+2.7 pp"></td>
+</tr>
+<tr>
+  <td><b>🎯 Recall@5</b><br/><sub>text-match</sub></td>
+  <td align="center" bgcolor="#A7F3D0"><img src="https://img.shields.io/badge/0.704-047857?style=flat-square" alt="0.704"></td>
+  <td align="center" bgcolor="#F1F5F9"><img src="https://img.shields.io/badge/0.686-64748B?style=flat-square" alt="0.686"></td>
+  <td align="center"><img src="https://img.shields.io/badge/%2B1.8%20pp-22C55E?style=flat-square&logo=arrowup&logoColor=white" alt="+1.8 pp"></td>
+</tr>
+<tr>
+  <td><b>📍 Geometric Recall@5</b><br/><sub>chunk's <code>sheet!A1:Z99</code> overlaps the ground-truth range</sub></td>
+  <td align="center" bgcolor="#6EE7B7"><img src="https://img.shields.io/badge/0.369-064E3B?style=flat-square" alt="0.369"></td>
+  <td align="center" bgcolor="#FEE2E2"><img src="https://img.shields.io/badge/0.000-991B1B?style=flat-square" alt="0.000"></td>
+  <td align="center"><img src="https://img.shields.io/badge/citation--grade%20only-047857?style=flat-square&logo=anchor&logoColor=white" alt="citation-grade only"></td>
+</tr>
+<tr>
+  <td><b>⚡ Mean parse time</b><br/><sub>per file</sub></td>
+  <td align="center" bgcolor="#D1FAE5"><img src="https://img.shields.io/badge/251%20ms-047857?style=flat-square" alt="251 ms"></td>
+  <td align="center" bgcolor="#F1F5F9"><img src="https://img.shields.io/badge/265%20ms-64748B?style=flat-square" alt="265 ms"></td>
+  <td align="center"><img src="https://img.shields.io/badge/%7E5%25%20faster-22C55E?style=flat-square" alt="~5% faster"></td>
+</tr>
+<tr>
+  <td><b>🧱 Parser errors</b><br/><sub>across 912 instances</sub></td>
+  <td align="center" bgcolor="#D1FAE5"><img src="https://img.shields.io/badge/0-047857?style=flat-square" alt="0"></td>
+  <td align="center" bgcolor="#F1F5F9"><img src="https://img.shields.io/badge/0-64748B?style=flat-square" alt="0"></td>
+  <td align="center">—</td>
+</tr>
+</tbody>
+</table>
+
+### 💡 What the numbers mean
+
+- **`ks-xlsx-parser` ties at recall@1 and wins recall@3 (+2.7 pp) and recall@5 (+1.8 pp).** Text-match recall is parser-agnostic — it asks whether *any* parser surfaced a chunk containing the answer string, after normalising commas, percent signs, ISO dates, and booleans on both sides.
+- **`ks-xlsx-parser` wins citation-grade (geometric) recall outright (0.369 vs 0.000).** Docling produces markdown without per-chunk `sheet!range` anchors, so it can't render a citation that points at the exact source cells. This is the difference between "the answer is somewhere in the workbook" and "the answer is in `Revenue!C7`."
+- **`Marker` is excluded by design.** Its xlsx → HTML → PDF → layout-recognition pipeline clocks >30 min per workbook on CPU. The benchmark framework supports adding a Marker adapter when GPU is available — see [`tests/benchmarks/adapters/docling_adapter.py`](tests/benchmarks/adapters/docling_adapter.py) as a template.
+
+### 🔁 Reproduce
+
+```bash
+make corpus-download   # one-time, ~100 MB; gitignored under data/corpora/
+make bench             # robustness + retrieval, ~50 min on M-series CPU
+open tests/benchmarks/reports/COMPARISON.md
+```
+
+Full methodology, capability matrix, error breakdown, and caveats live in [`tests/benchmarks/reports/COMPARISON.md`](tests/benchmarks/reports/COMPARISON.md). Adapter design notes in [`tests/benchmarks/README.md`](tests/benchmarks/README.md).
+
+---
+
 ## ✨ What you get, at a glance
 
 <table>
@@ -161,6 +245,7 @@ That's it. Every chunk has:
 
 ## 🗺️ Table of Contents
 
+- [🏁 Benchmark — vs Docling on SpreadsheetBench](#-benchmark--ks-xlsx-parser-vs-docling-on-spreadsheetbench)
 - [🤔 Why a dedicated XLSX parser for LLMs?](#-why-a-dedicated-xlsx-parser-for-llms)
 - [🏗️ Architecture](#️-architecture)
 - [📦 Installation](#-installation)
@@ -201,62 +286,7 @@ corpus, and everything is open source.
 
 ## 🏗️ Architecture
 
-```mermaid
-%%{init: {'theme':'base', 'themeVariables': {
-  'primaryColor':'#10B981','primaryTextColor':'#fff','primaryBorderColor':'#047857',
-  'lineColor':'#94A3B8','secondaryColor':'#22C55E','tertiaryColor':'#34D399',
-  'background':'#FFFFFF','mainBkg':'#10B981','clusterBkg':'#F0FDF4'
-}}}%%
-flowchart TD
-    IN([📄 .xlsx bytes])
-    PARSE[["① parsers/<br/>OOXML drivers<br/><i>openpyxl + lxml</i>"]]
-    MODELS[["② models/<br/>Pydantic DTOs<br/><i>Workbook · Sheet · Cell · Table · Chart</i>"]]
-    FORMULA[["③ formula/<br/>lexer + parser<br/><i>cross-sheet · table · array</i>"]]
-    ANALYSIS[["④ analysis/<br/>dependency graph<br/><i>cycles · impact</i>"]]
-    CHARTS[["⑤ charts/<br/>OOXML chart extraction"]]
-    ANNOT[["⑥ annotation/<br/>semantic roles · KPIs"]]
-    SEG[["⑦ chunking/<br/>adaptive segmenter"]]
-    REND[["⑧ rendering/<br/>HTML + pipe-text<br/>token counts"]]
-    STORE[["🗄️ storage/<br/>JSON · DB rows · vectors"]]
-    VER[["✅ verification/<br/>stage assertions"]]
-    CMP[["🔀 comparison/<br/>multi-workbook templates"]]
-    EXP[["🧬 export/<br/>generated importer"]]
-    OUT([🤖 LLM-ready chunks<br/>with citations])
-
-    IN --> PARSE --> MODELS
-    MODELS --> FORMULA
-    MODELS --> ANALYSIS
-    MODELS --> CHARTS
-    FORMULA --> ANALYSIS
-    ANALYSIS --> ANNOT
-    CHARTS --> ANNOT
-    ANNOT --> SEG --> REND --> STORE
-    MODELS --> VER
-    STORE --> OUT
-    STORE -.-> CMP -.-> EXP
-
-    %% All-green palette: deepest for entry, lightest for auxiliary stages,
-    %% emerald for the headline output node.
-    classDef entry   fill:#064E3B,stroke:#022C22,color:#fff,stroke-width:2px;
-    classDef parse   fill:#065F46,stroke:#022C22,color:#fff,stroke-width:2px;
-    classDef model   fill:#047857,stroke:#064E3B,color:#fff,stroke-width:2px;
-    classDef analyze fill:#059669,stroke:#065F46,color:#fff,stroke-width:2px;
-    classDef render  fill:#16A34A,stroke:#166534,color:#fff,stroke-width:2px;
-    classDef output  fill:#22C55E,stroke:#15803D,color:#fff,stroke-width:2px;
-    classDef aux     fill:#A7F3D0,stroke:#047857,color:#065F46,stroke-width:2px;
-
-    class IN entry
-    class PARSE parse
-    class MODELS model
-    class FORMULA,ANALYSIS,CHARTS analyze
-    class ANNOT,SEG,REND render
-    class STORE,OUT output
-    class VER,CMP,EXP aux
-```
-
-The pipeline has **8 stages** (parse → analyse → annotate → segment →
-render → serialise → verify → compare/export). Full breakdown in
-[**Pipeline Internals**](docs/wiki/Pipeline-Internals.md).
+The pipeline runs **8 deterministic stages**: parse → analyse → annotate → segment → render → serialise → verify → compare/export. Full diagram, stage-by-stage breakdown, and module map in [**docs/wiki/Architecture.md**](docs/wiki/Architecture.md). Stage internals in [**Pipeline Internals**](docs/wiki/Pipeline-Internals.md).
 
 > [!NOTE]
 > The importable module is `xlsx_parser`; `ks_xlsx_parser` is a re-export
@@ -309,6 +339,8 @@ on each release) so this README stays scannable:
 
 ## ⚔️ How it compares
 
+This is the **structural** capability matrix. For head-to-head retrieval numbers (recall@k, geometric, latency) on a 912-instance real-world corpus, see [🏁 Benchmark — ks-xlsx-parser vs Docling on SpreadsheetBench](#-benchmark--ks-xlsx-parser-vs-docling-on-spreadsheetbench) up top.
+
 | | pandas / openpyxl | Docling | `ks-xlsx-parser` |
 |---|:---:|:---:|:---:|
 | Reads values | ✅ | ✅ | ✅ |