Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 18 additions & 1 deletion .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -59,14 +59,31 @@ jobs:
name: dist
path: dist

- name: Find hand-written release notes (if any)
id: notes
run: |
# Use docs/launch/RELEASE_NOTES_<tag>.md as the release body when
# present, otherwise fall back to GitHub's auto-generated notes.
NOTES="docs/launch/RELEASE_NOTES_${GITHUB_REF_NAME}.md"
if [ -f "$NOTES" ]; then
echo "path=$NOTES" >> "$GITHUB_OUTPUT"
echo "auto=false" >> "$GITHUB_OUTPUT"
echo "Found release notes: $NOTES"
else
echo "path=" >> "$GITHUB_OUTPUT"
echo "auto=true" >> "$GITHUB_OUTPUT"
echo "No hand-written release notes at $NOTES; using auto-generated."
fi

- name: Create GitHub release
uses: softprops/action-gh-release@v2
with:
files: |
dist/*.whl
dist/*.tar.gz
dist/testBench-v*.zip
generate_release_notes: true
body_path: ${{ steps.notes.outputs.path }}
generate_release_notes: ${{ steps.notes.outputs.auto == 'true' }}

pypi:
needs: build
Expand Down
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -72,3 +72,9 @@ rust/**/target/
rust/**/Cargo.lock.bak
**/.cargo-lock
**/.fingerprint/

# Downloaded benchmark corpora (see scripts/download_corpora.sh)
data/corpora/

# Conductor / Claude Code transient state
.claude/
92 changes: 92 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,98 @@ Template for a new release (copy this block, fill in, move Unreleased items in):

Nothing yet. Open a PR and add your entry under the appropriate heading.

## [0.2.0] — 2026-05-11

**Benchmark + retrievability release.** Adds a head-to-head benchmark against
[Docling](https://github.com/DS4SD/docling) on the [SpreadsheetBench](https://github.com/RUCKBReasoning/SpreadsheetBench)
corpus (912 instances, 5,458 xlsx files) and fixes three rendering bugs that
were silently torpedoing RAG retrieval. ks-xlsx-parser parses **99.945%** of
SpreadsheetBench and **ties Docling at recall@1 / wins at recall@3 (+2.7 pp)
and recall@5 (+1.8 pp)**, plus 36.9% citation-grade geometric recall (Docling
0%, structurally — no A1 anchors).

### Added
- `tests/benchmarks/adapters/docling_adapter.py` — Docling adapter speaking the
same NDJSON-worker protocol as `ks_adapter.py` (#TBD).
- `tests/benchmarks/_runner.py`: `docling_runner` factory wired into
`vs_hucre.py`'s `--parsers` dispatch.
- `scripts/eval_retrieval.py` — retrieval-recall benchmark over
SpreadsheetBench's `(instruction, data_position, answer_position)` triples.
Uses `sentence-transformers` (default `BAAI/bge-small-en-v1.5`) and computes
geometric overlap + numeric/date/boolean-normalized text-match recall@k.
Persistent docling subprocess with hard-kill timeout — PyTorch's table-rec
loop holds the GIL through C-land so in-process timeouts don't work.
- `scripts/summarize_retrieval.py` — re-aggregate a `results.ndjson` into
`summary.json` / `summary.md` if a long run is interrupted.
- `scripts/download_corpora.sh`: fetches SpreadsheetBench v0.1 (~96 MB tar.gz)
into `data/corpora/spreadsheetbench/` (gitignored).
- `tests/benchmarks/README.md` — adapter design notes + benchmark how-to.
- `tests/benchmarks/reports/COMPARISON.md` — head-to-head report incl.
methodology, capability matrix, caveats.
- `Makefile`: `bench`, `bench-robust`, `bench-retrieval` targets.

### Fixed
- `src/rendering/text_renderer.py`: numeric cells now render the raw value
(`1272`) instead of Excel's display-formatted string (`1,272.00`). The
display format defeated substring-match retrieval for the most common RAG
query shape ("what was the value in 2020?" → user types `1272`).
- `src/rendering/text_renderer.py`: the `[=]` formula marker no longer
spuriously inflates a cell past its column width, which used to trigger
a sci-notation fallback (`1.272000e+03`) on perfectly small values.
Column widths now computed using the same rendering pipeline data rows
will use, so the long-value path only triggers on genuinely-too-wide
values.
- `src/rendering/text_renderer.py`: dates render as ISO `YYYY-MM-DD` and drop
the spurious `00:00:00` time component on midnight datetimes.
- `src/rendering/text_renderer.py`: embedded newlines inside header cells
(e.g. `"租金\n天数"`) collapse to spaces so they don't tear apart the
Markdown grid (regression fixed for `租赁收入计提表.xlsx`-class layouts).
- `src/chunking/segmenter.py`: removed `_detect_style_boundaries`. The
function split a coherent table into 5 fragments at fill-color band
boundaries (year-banding, alternating-row shading), shedding header
context from data rows. The connected-components + gap detection
already handles real boundaries; fill banding is not a semantic one.
- `src/parsers/cell_parser.py`: `GradientFill` cells no longer crash the
sheet parser. Accessing `.patternType` on a `GradientFill` (vs the
expected `PatternFill`) raised `AttributeError`, which propagated up and
killed every cell on the sheet. We don't model gradients but we no
longer drop the sheet because of them (caught by SpreadsheetBench
instance `118-8`, 8 sheets / 1,244 cells previously lost).

### Changed
- `tests/benchmarks/_schema.py`: `formulas` is now nullable on `status=ok`
records. Parsers that don't model formulas (Docling, Marker) can now
emit valid `BenchmarkRecord`s without tripping schema validation. The
schema's load-bearing `None` vs `0` distinction is preserved: `None` =
"feature not modeled by this parser", `0` = "modeled and observed zero".

### Removed
- `scripts/compare_docling.py` — superseded by the unified `tests/benchmarks/`
framework + `eval_retrieval.py`. The old script's `ScoreCard` composite
score was structurally biased (formula-preservation gave Docling a 0 by
definition while contributing 20/100 points; header-propagation used
different proxies for each parser); replaced by parser-agnostic
text-match and geometric recall metrics.

### Performance
- ks-xlsx-parser is now ~5% faster on average parse time on SpreadsheetBench
than Docling (251 ms vs 265 ms mean), while producing a richer output
(formulas, dependency graph, charts, named ranges, etc.).

### Docs
- `tests/benchmarks/README.md` — new — methodology + adapter design.
- `tests/benchmarks/reports/COMPARISON.md` — new — head-to-head report.
- README — new "Benchmark — ks-xlsx-parser vs Docling on SpreadsheetBench"
section near the top with the headline table.

### Internal
- `tests/test_rendering.py`: updated `test_numeric_cells_use_scientific_notation_not_truncation`
to assert the new raw-numeric rendering (test renamed
`test_numeric_cells_render_raw_not_display_formatted`).
- `.gitignore`: `data/corpora/` (downloaded benchmark corpora; can run to
several GB).
- `Makefile`: `bench`, `bench-robust`, `bench-retrieval` targets.

## [0.1.1] — 2026-04-17

**First public release.** MIT-licensed, open-sourced under the
Expand Down
19 changes: 18 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.PHONY: help install test test-ci testbench testbench-build testbench-zip lint format typecheck clean corpus-download
.PHONY: help install test test-ci testbench testbench-build testbench-zip lint format typecheck clean corpus-download bench-robust bench-retrieval bench

PYTHON ?= python
PKG_VERSION := $(shell $(PYTHON) -c "import tomllib, pathlib; print(tomllib.loads(pathlib.Path('pyproject.toml').read_text())['project']['version'])")
Expand All @@ -20,6 +20,10 @@ help:
@echo " make typecheck mypy"
@echo ""
@echo " make corpus-download Fetch public XLSX corpora for extended robustness"
@echo ""
@echo " make bench-robust Robustness on SpreadsheetBench (ks vs docling, ~20 min)"
@echo " make bench-retrieval Retrieval recall on SpreadsheetBench (ks vs docling, ~40 min)"
@echo " make bench Run both benchmarks back-to-back"

install:
$(PYTHON) -m pip install -e ".[dev,api]"
Expand Down Expand Up @@ -62,3 +66,16 @@ clean:

corpus-download:
./scripts/download_corpora.sh

bench-robust:
@test -d data/corpora/spreadsheetbench || (echo "Corpus missing. Run 'make corpus-download' first." && exit 1)
PYTHONPATH=src $(PYTHON) -m tests.benchmarks.vs_hucre \
--corpus data/corpora/spreadsheetbench --parsers ks,docling \
--per-file-timeout 120 \
--out tests/benchmarks/reports/spreadsheetbench

bench-retrieval:
@test -d data/corpora/spreadsheetbench || (echo "Corpus missing. Run 'make corpus-download' first." && exit 1)
PYTHONPATH=src $(PYTHON) scripts/eval_retrieval.py --parsers ks,docling

bench: bench-robust bench-retrieval
144 changes: 88 additions & 56 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,90 @@ graph that drops straight into [LangChain](https://www.langchain.com/),

---

## 🏁 Benchmark — ks-xlsx-parser vs Docling on SpreadsheetBench

<p align="center">
<a href="tests/benchmarks/reports/COMPARISON.md"><img src="https://img.shields.io/badge/SpreadsheetBench-912%20instances%20%C2%B7%205%2C458%20xlsx-047857?style=for-the-badge&logo=microsoftexcel&logoColor=white" alt="SpreadsheetBench"></a>
<a href="tests/benchmarks/reports/COMPARISON.md"><img src="https://img.shields.io/badge/parse%20success-99.945%25-22C55E?style=for-the-badge&logo=checkmarx&logoColor=white" alt="Parse success"></a>
<a href="tests/benchmarks/reports/COMPARISON.md"><img src="https://img.shields.io/badge/recall%403%20vs%20Docling-%2B2.7%20pp-22C55E?style=for-the-badge&logo=target&logoColor=white" alt="Recall@3 vs Docling"></a>
<a href="tests/benchmarks/reports/COMPARISON.md"><img src="https://img.shields.io/badge/citation%20anchors-A1%20per%20chunk-047857?style=for-the-badge&logo=anchor&logoColor=white" alt="A1 anchors"></a>
</p>

Apples-to-apples on [SpreadsheetBench v0.1](https://github.com/RUCKBReasoning/SpreadsheetBench): 912 real-world task instances curated from ExcelHome / Mr.Excel / r/excel. For each instance we parse the input `.xlsx`, embed every chunk with `BAAI/bge-small-en-v1.5`, then check whether the chunk containing the ground-truth answer is in the top-k by similarity to the question.

<table>
<thead>
<tr>
<th align="left">Metric</th>
<th align="center" bgcolor="#047857"><span style="color:#FFFFFF"><b>🟢 ks-xlsx-parser</b></span></th>
<th align="center" bgcolor="#475569"><span style="color:#FFFFFF"><b>⚪ Docling 2.93</b></span></th>
<th align="center">Δ</th>
</tr>
</thead>
<tbody>
<tr>
<td><b>📊 Parse success</b><br/><sub>5,458-file corpus</sub></td>
<td align="center" bgcolor="#D1FAE5"><img src="https://img.shields.io/badge/99.945%25-047857?style=flat-square&labelColor=047857" alt="99.945%"><br/><sub>5,461 ok · 3 timeouts · 0 errors</sub></td>
<td align="center" bgcolor="#F1F5F9"><sub>not run at scale</sub></td>
<td align="center">—</td>
</tr>
<tr>
<td><b>🎯 Recall@1</b><br/><sub>text-match</sub></td>
<td align="center" bgcolor="#D1FAE5"><img src="https://img.shields.io/badge/0.580-047857?style=flat-square" alt="0.580"></td>
<td align="center" bgcolor="#F1F5F9"><img src="https://img.shields.io/badge/0.579-64748B?style=flat-square" alt="0.579"></td>
<td align="center"><img src="https://img.shields.io/badge/tied-22C55E?style=flat-square" alt="tied"></td>
</tr>
<tr>
<td><b>🎯 Recall@3</b><br/><sub>text-match</sub></td>
<td align="center" bgcolor="#A7F3D0"><img src="https://img.shields.io/badge/0.697-047857?style=flat-square" alt="0.697"></td>
<td align="center" bgcolor="#F1F5F9"><img src="https://img.shields.io/badge/0.670-64748B?style=flat-square" alt="0.670"></td>
<td align="center"><img src="https://img.shields.io/badge/%2B2.7%20pp-22C55E?style=flat-square&logo=arrowup&logoColor=white" alt="+2.7 pp"></td>
</tr>
<tr>
<td><b>🎯 Recall@5</b><br/><sub>text-match</sub></td>
<td align="center" bgcolor="#A7F3D0"><img src="https://img.shields.io/badge/0.704-047857?style=flat-square" alt="0.704"></td>
<td align="center" bgcolor="#F1F5F9"><img src="https://img.shields.io/badge/0.686-64748B?style=flat-square" alt="0.686"></td>
<td align="center"><img src="https://img.shields.io/badge/%2B1.8%20pp-22C55E?style=flat-square&logo=arrowup&logoColor=white" alt="+1.8 pp"></td>
</tr>
<tr>
<td><b>📍 Geometric Recall@5</b><br/><sub>chunk's <code>sheet!A1:Z99</code> overlaps the ground-truth range</sub></td>
<td align="center" bgcolor="#6EE7B7"><img src="https://img.shields.io/badge/0.369-064E3B?style=flat-square" alt="0.369"></td>
<td align="center" bgcolor="#FEE2E2"><img src="https://img.shields.io/badge/0.000-991B1B?style=flat-square" alt="0.000"></td>
<td align="center"><img src="https://img.shields.io/badge/citation--grade%20only-047857?style=flat-square&logo=anchor&logoColor=white" alt="citation-grade only"></td>
</tr>
<tr>
<td><b>⚡ Mean parse time</b><br/><sub>per file</sub></td>
<td align="center" bgcolor="#D1FAE5"><img src="https://img.shields.io/badge/251%20ms-047857?style=flat-square" alt="251 ms"></td>
<td align="center" bgcolor="#F1F5F9"><img src="https://img.shields.io/badge/265%20ms-64748B?style=flat-square" alt="265 ms"></td>
<td align="center"><img src="https://img.shields.io/badge/%7E5%25%20faster-22C55E?style=flat-square" alt="~5% faster"></td>
</tr>
<tr>
<td><b>🧱 Parser errors</b><br/><sub>across 912 instances</sub></td>
<td align="center" bgcolor="#D1FAE5"><img src="https://img.shields.io/badge/0-047857?style=flat-square" alt="0"></td>
<td align="center" bgcolor="#F1F5F9"><img src="https://img.shields.io/badge/0-64748B?style=flat-square" alt="0"></td>
<td align="center">—</td>
</tr>
</tbody>
</table>

### 💡 What the numbers mean

- **`ks-xlsx-parser` ties at recall@1 and wins recall@3 (+2.7 pp) and recall@5 (+1.8 pp).** Text-match recall is parser-agnostic — it asks whether *any* parser surfaced a chunk containing the answer string, after normalising commas, percent signs, ISO dates, and booleans on both sides.
- **`ks-xlsx-parser` wins citation-grade (geometric) recall outright (0.369 vs 0.000).** Docling produces markdown without per-chunk `sheet!range` anchors, so it can't render a citation that points at the exact source cells. This is the difference between "the answer is somewhere in the workbook" and "the answer is in `Revenue!C7`."
- **`Marker` is excluded by design.** Its xlsx → HTML → PDF → layout-recognition pipeline clocks >30 min per workbook on CPU. The benchmark framework supports adding a Marker adapter when GPU is available — see [`tests/benchmarks/adapters/docling_adapter.py`](tests/benchmarks/adapters/docling_adapter.py) as a template.

### 🔁 Reproduce

```bash
make corpus-download # one-time, ~100 MB; gitignored under data/corpora/
make bench # robustness + retrieval, ~50 min on M-series CPU
open tests/benchmarks/reports/COMPARISON.md
```

Full methodology, capability matrix, error breakdown, and caveats live in [`tests/benchmarks/reports/COMPARISON.md`](tests/benchmarks/reports/COMPARISON.md). Adapter design notes in [`tests/benchmarks/README.md`](tests/benchmarks/README.md).

---

## ✨ What you get, at a glance

<table>
Expand Down Expand Up @@ -161,6 +245,7 @@ That's it. Every chunk has:

## 🗺️ Table of Contents

- [🏁 Benchmark — vs Docling on SpreadsheetBench](#-benchmark--ks-xlsx-parser-vs-docling-on-spreadsheetbench)
- [🤔 Why a dedicated XLSX parser for LLMs?](#-why-a-dedicated-xlsx-parser-for-llms)
- [🏗️ Architecture](#️-architecture)
- [📦 Installation](#-installation)
Expand Down Expand Up @@ -201,62 +286,7 @@ corpus, and everything is open source.

## 🏗️ Architecture

```mermaid
%%{init: {'theme':'base', 'themeVariables': {
'primaryColor':'#10B981','primaryTextColor':'#fff','primaryBorderColor':'#047857',
'lineColor':'#94A3B8','secondaryColor':'#22C55E','tertiaryColor':'#34D399',
'background':'#FFFFFF','mainBkg':'#10B981','clusterBkg':'#F0FDF4'
}}}%%
flowchart TD
IN([📄 .xlsx bytes])
PARSE[["① parsers/<br/>OOXML drivers<br/><i>openpyxl + lxml</i>"]]
MODELS[["② models/<br/>Pydantic DTOs<br/><i>Workbook · Sheet · Cell · Table · Chart</i>"]]
FORMULA[["③ formula/<br/>lexer + parser<br/><i>cross-sheet · table · array</i>"]]
ANALYSIS[["④ analysis/<br/>dependency graph<br/><i>cycles · impact</i>"]]
CHARTS[["⑤ charts/<br/>OOXML chart extraction"]]
ANNOT[["⑥ annotation/<br/>semantic roles · KPIs"]]
SEG[["⑦ chunking/<br/>adaptive segmenter"]]
REND[["⑧ rendering/<br/>HTML + pipe-text<br/>token counts"]]
STORE[["🗄️ storage/<br/>JSON · DB rows · vectors"]]
VER[["✅ verification/<br/>stage assertions"]]
CMP[["🔀 comparison/<br/>multi-workbook templates"]]
EXP[["🧬 export/<br/>generated importer"]]
OUT([🤖 LLM-ready chunks<br/>with citations])

IN --> PARSE --> MODELS
MODELS --> FORMULA
MODELS --> ANALYSIS
MODELS --> CHARTS
FORMULA --> ANALYSIS
ANALYSIS --> ANNOT
CHARTS --> ANNOT
ANNOT --> SEG --> REND --> STORE
MODELS --> VER
STORE --> OUT
STORE -.-> CMP -.-> EXP

%% All-green palette: deepest for entry, lightest for auxiliary stages,
%% emerald for the headline output node.
classDef entry fill:#064E3B,stroke:#022C22,color:#fff,stroke-width:2px;
classDef parse fill:#065F46,stroke:#022C22,color:#fff,stroke-width:2px;
classDef model fill:#047857,stroke:#064E3B,color:#fff,stroke-width:2px;
classDef analyze fill:#059669,stroke:#065F46,color:#fff,stroke-width:2px;
classDef render fill:#16A34A,stroke:#166534,color:#fff,stroke-width:2px;
classDef output fill:#22C55E,stroke:#15803D,color:#fff,stroke-width:2px;
classDef aux fill:#A7F3D0,stroke:#047857,color:#065F46,stroke-width:2px;

class IN entry
class PARSE parse
class MODELS model
class FORMULA,ANALYSIS,CHARTS analyze
class ANNOT,SEG,REND render
class STORE,OUT output
class VER,CMP,EXP aux
```

The pipeline has **8 stages** (parse → analyse → annotate → segment →
render → serialise → verify → compare/export). Full breakdown in
[**Pipeline Internals**](docs/wiki/Pipeline-Internals.md).
The pipeline runs **8 deterministic stages**: parse → analyse → annotate → segment → render → serialise → verify → compare/export. Full diagram, stage-by-stage breakdown, and module map in [**docs/wiki/Architecture.md**](docs/wiki/Architecture.md). Stage internals in [**Pipeline Internals**](docs/wiki/Pipeline-Internals.md).

> [!NOTE]
> The importable module is `xlsx_parser`; `ks_xlsx_parser` is a re-export
Expand Down Expand Up @@ -309,6 +339,8 @@ on each release) so this README stays scannable:

## ⚔️ How it compares

This is the **structural** capability matrix. For head-to-head retrieval numbers (recall@k, geometric, latency) on a 912-instance real-world corpus, see [🏁 Benchmark — ks-xlsx-parser vs Docling on SpreadsheetBench](#-benchmark--ks-xlsx-parser-vs-docling-on-spreadsheetbench) up top.

| | pandas / openpyxl | Docling | `ks-xlsx-parser` |
|---|:---:|:---:|:---:|
| Reads values | ✅ | ✅ | ✅ |
Expand Down
Loading
Loading