Commit 85fe388
fix: pred-fail empty-gold false positive — v29 92.5% (was 93.0%, qid 518 v13 audit-correction)
CC-CX-KM /cxkm review of c74b46c surfaced a P2 systemic scoring bug:
post-fix `scripts/rescore_arcwise.py` falls through to compare_results
with pred_rows=[] after pred exec failure, and when gold also returns
0 rows, compare_results([], []) returns match=True — silent false
positive for malformed pred SQL.
Verified the CX claim independently by re-executing every stored
match=True pred against the live DB (.tmp/scan_empty_pred_fp.py).
Exactly 1 qid affected: 518 moderate card_games "format with most
banned cards". Origin: v13 helallao grok-4.1-reasoning rescue
(2026-05-18) produced pred SQL missing the `WITH banned_counts AS (`
prefix → SyntaxError on every execution → empty pred_rows → blessed
because gold also returns 0 (card_games has no Banned legalities for
that question's filter set). Silently propagated v13 → v22 → v29.
Fix structure:
1. NEW src/nl_sql/eval/metrics/execution_accuracy.py::safe_compare_pred
short-circuits match=False when pred_failed=True. The runner already
guards this (eval/runner.py:662); only scripts/ bypassed it.
2. scripts/audit_rescore.py + scripts/rescore_arcwise.py migrated to
safe_compare_pred with explicit pred_failed flag. 9 other voting
scripts left as-is (not running on v29 ceiling work; backlog).
3. 8 merged baselines (v22-v29) surgically patched: qid 518
match=True → False + match_note annotation. summary.matched
recomputed.
4. 3 regression tests in tests/eval/test_metrics.py::TestSafeComparePred:
- pred_failed=True short-circuits even when both sides empty
- pred_failed=False passes through to compare_results
- pred_failed=True with nonempty gold still returns match=False
Corrected v29 triplet:
BIRD original : 185/200 = 92.5% (was 186/200 / 93.0%)
Arcwise sql_only : 148/199 = 74.37% (was 149/199 / 74.87%)
Arcwise full : 136/199 = 68.34% (was 137/199 / 68.84%)
Per-tier moderate : 90/99 = 90.9% (was 91/99 / 91.9%, qid 518 is moderate)
Per-tier simple : 65/67 = 97.0% (unchanged)
Per-tier challenging: 30/34 = 88.2% (unchanged)
Above #1 paid SOTA AskData+GPT-4o (81.95%) by +10.55pp (was +11.05pp).
Within 0.46pp human-expert (BIRD paper 92.96%, was 0.04pp).
Audit-discipline trail across the v22-v29 chain — every baseline
re-confirmed stored ≡ true ≡ 0 mismatches post-patch via
scripts/audit_rescore.py:
v22: 177 v26: 181
v23: 178 v27: 183
v24: 179 v28: 184
v25: 180 v29: 185
Surfaces synced: README hero + lift trace + ceiling block, Streamlit
EN+RU research_value + +pp deltas + Arcwise number, SESSION_HANDOFF
EOD-3 tl;dr, NEXT_SESSION current-state block, .deploy_hf.py HF
README short_description, docs/ui-live-{en,ru}.png re-captured from
freshly-deployed HF Space showing 92.5%.
E2E verified per feedback_deploy_e2e_gate: Playwright headless on
https://liovina-nl-sql.hf.space confirms '92.5%' (EN) / '92,5%' (RU)
rendered after redeploy.
KM was unavailable for this review (kimi normalization_error per
reference_kimi_codex_auth_fragile). CX-only review (advisory per
feedback_cx_review_status_fragile), but I cross-confirmed by
re-executing pred against the live DB — that's stronger than KM
overlap. CX [P2] verdict stands.
Gates: 333 pytest pass (+3 regression tests), ruff src/tests/scripts/
app clean, mypy --strict src clean.
Portfolio framing: catching our own false positive via CC-CX-KM and
honestly reporting the correction is a better audit-discipline story
than the inflated number was. Headline drops 0.5pp; honesty bumps.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent fbb9e24 commit 85fe388
19 files changed
Lines changed: 216 additions & 130 deletions
File tree
- app
- docs
- eval/reports
- 2026-05-23
- 2026-05-24
- scripts
- src/nl_sql/eval/metrics
- tests/eval
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
61 | 61 | | |
62 | 62 | | |
63 | 63 | | |
64 | | - | |
| 64 | + | |
65 | 65 | | |
66 | 66 | | |
67 | 67 | | |
| |||
70 | 70 | | |
71 | 71 | | |
72 | 72 | | |
73 | | - | |
74 | | - | |
| 73 | + | |
| 74 | + | |
75 | 75 | | |
76 | 76 | | |
77 | 77 | | |
| |||
142 | 142 | | |
143 | 143 | | |
144 | 144 | | |
145 | | - | |
| 145 | + | |
146 | 146 | | |
147 | 147 | | |
148 | 148 | | |
| |||
151 | 151 | | |
152 | 152 | | |
153 | 153 | | |
154 | | - | |
155 | | - | |
| 154 | + | |
| 155 | + | |
156 | 156 | | |
157 | 157 | | |
158 | 158 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
12 | | - | |
| 12 | + | |
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
| |||
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
32 | | - | |
33 | | - | |
34 | | - | |
35 | | - | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
36 | 37 | | |
37 | 38 | | |
38 | 39 | | |
| |||
53 | 54 | | |
54 | 55 | | |
55 | 56 | | |
56 | | - | |
| 57 | + | |
57 | 58 | | |
58 | 59 | | |
59 | 60 | | |
| |||
79 | 80 | | |
80 | 81 | | |
81 | 82 | | |
82 | | - | |
| 83 | + | |
83 | 84 | | |
84 | 85 | | |
85 | 86 | | |
| |||
102 | 103 | | |
103 | 104 | | |
104 | 105 | | |
105 | | - | |
| 106 | + | |
106 | 107 | | |
107 | 108 | | |
108 | 109 | | |
| |||
122 | 123 | | |
123 | 124 | | |
124 | 125 | | |
125 | | - | |
| 126 | + | |
126 | 127 | | |
127 | 128 | | |
128 | 129 | | |
129 | 130 | | |
130 | 131 | | |
131 | | - | |
| 132 | + | |
132 | 133 | | |
133 | 134 | | |
134 | 135 | | |
135 | | - | |
| 136 | + | |
136 | 137 | | |
137 | | - | |
| 138 | + | |
138 | 139 | | |
139 | 140 | | |
140 | 141 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
2 | | - | |
3 | | - | |
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
4 | 26 | | |
5 | 27 | | |
6 | 28 | | |
| |||
Loading
Loading
Lines changed: 12 additions & 14 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1449 | 1449 | | |
1450 | 1450 | | |
1451 | 1451 | | |
1452 | | - | |
| 1452 | + | |
1453 | 1453 | | |
1454 | 1454 | | |
1455 | 1455 | | |
| |||
1891 | 1891 | | |
1892 | 1892 | | |
1893 | 1893 | | |
1894 | | - | |
| 1894 | + | |
1895 | 1895 | | |
1896 | 1896 | | |
1897 | 1897 | | |
| |||
1915 | 1915 | | |
1916 | 1916 | | |
1917 | 1917 | | |
1918 | | - | |
| 1918 | + | |
| 1919 | + | |
1919 | 1920 | | |
1920 | 1921 | | |
1921 | 1922 | | |
| |||
6896 | 6897 | | |
6897 | 6898 | | |
6898 | 6899 | | |
6899 | | - | |
6900 | | - | |
6901 | | - | |
6902 | | - | |
| 6900 | + | |
| 6901 | + | |
| 6902 | + | |
6903 | 6903 | | |
6904 | 6904 | | |
6905 | | - | |
6906 | | - | |
6907 | | - | |
| 6905 | + | |
| 6906 | + | |
6908 | 6907 | | |
6909 | | - | |
6910 | | - | |
6911 | | - | |
6912 | | - | |
| 6908 | + | |
| 6909 | + | |
| 6910 | + | |
6913 | 6911 | | |
6914 | 6912 | | |
6915 | 6913 | | |
0 commit comments