Skip to content

Commit 85fe388

Browse files
JuliaEdomclaude
andcommitted
fix: pred-fail empty-gold false positive — v29 92.5% (was 93.0%, qid 518 v13 audit-correction)
CC-CX-KM /cxkm review of c74b46c surfaced a P2 systemic scoring bug: post-fix `scripts/rescore_arcwise.py` falls through to compare_results with pred_rows=[] after pred exec failure, and when gold also returns 0 rows, compare_results([], []) returns match=True — silent false positive for malformed pred SQL. Verified the CX claim independently by re-executing every stored match=True pred against the live DB (.tmp/scan_empty_pred_fp.py). Exactly 1 qid affected: 518 moderate card_games "format with most banned cards". Origin: v13 helallao grok-4.1-reasoning rescue (2026-05-18) produced pred SQL missing the `WITH banned_counts AS (` prefix → SyntaxError on every execution → empty pred_rows → blessed because gold also returns 0 (card_games has no Banned legalities for that question's filter set). Silently propagated v13 → v22 → v29. Fix structure: 1. NEW src/nl_sql/eval/metrics/execution_accuracy.py::safe_compare_pred short-circuits match=False when pred_failed=True. The runner already guards this (eval/runner.py:662); only scripts/ bypassed it. 2. scripts/audit_rescore.py + scripts/rescore_arcwise.py migrated to safe_compare_pred with explicit pred_failed flag. 9 other voting scripts left as-is (not running on v29 ceiling work; backlog). 3. 8 merged baselines (v22-v29) surgically patched: qid 518 match=True → False + match_note annotation. summary.matched recomputed. 4. 3 regression tests in tests/eval/test_metrics.py::TestSafeComparePred: - pred_failed=True short-circuits even when both sides empty - pred_failed=False passes through to compare_results - pred_failed=True with nonempty gold still returns match=False Corrected v29 triplet: BIRD original : 185/200 = 92.5% (was 186/200 / 93.0%) Arcwise sql_only : 148/199 = 74.37% (was 149/199 / 74.87%) Arcwise full : 136/199 = 68.34% (was 137/199 / 68.84%) Per-tier moderate : 90/99 = 90.9% (was 91/99 / 91.9%, qid 518 is moderate) Per-tier simple : 65/67 = 97.0% (unchanged) Per-tier challenging: 30/34 = 88.2% (unchanged) Above #1 paid SOTA AskData+GPT-4o (81.95%) by +10.55pp (was +11.05pp). Within 0.46pp human-expert (BIRD paper 92.96%, was 0.04pp). Audit-discipline trail across the v22-v29 chain — every baseline re-confirmed stored ≡ true ≡ 0 mismatches post-patch via scripts/audit_rescore.py: v22: 177 v26: 181 v23: 178 v27: 183 v24: 179 v28: 184 v25: 180 v29: 185 Surfaces synced: README hero + lift trace + ceiling block, Streamlit EN+RU research_value + +pp deltas + Arcwise number, SESSION_HANDOFF EOD-3 tl;dr, NEXT_SESSION current-state block, .deploy_hf.py HF README short_description, docs/ui-live-{en,ru}.png re-captured from freshly-deployed HF Space showing 92.5%. E2E verified per feedback_deploy_e2e_gate: Playwright headless on https://liovina-nl-sql.hf.space confirms '92.5%' (EN) / '92,5%' (RU) rendered after redeploy. KM was unavailable for this review (kimi normalization_error per reference_kimi_codex_auth_fragile). CX-only review (advisory per feedback_cx_review_status_fragile), but I cross-confirmed by re-executing pred against the live DB — that's stronger than KM overlap. CX [P2] verdict stands. Gates: 333 pytest pass (+3 regression tests), ruff src/tests/scripts/ app clean, mypy --strict src clean. Portfolio framing: catching our own false positive via CC-CX-KM and honestly reporting the correction is a better audit-discipline story than the inflated number was. Headline drops 0.5pp; honesty bumps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent fbb9e24 commit 85fe388

19 files changed

Lines changed: 216 additions & 130 deletions

README.md

Lines changed: 4 additions & 4 deletions
Large diffs are not rendered by default.

app/streamlit_app.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@
6161
"metric_percent": "100%",
6262
"metric_caption": "30 dev + 30 held-out, balanced split, all ten query categories at 100% on the free-tier codestral pipeline.",
6363
"research_kicker": "BIRD Mini-Dev research benchmark",
64-
"research_value": "93.0% / 200",
64+
"research_value": "92.5% / 200",
6565
"research_caption": (
6666
"Hybrid pipeline: "
6767
"<span class='nl-term' title='Mistral codestral-latest — SQL-specialised generation model, free tier'>codestral</span> + "
@@ -70,8 +70,8 @@
7070
"<span class='nl-term' title='helallao reverse-engineered HTTPS bridge to Perplexity backend — Grok 4.1, GPT-5.2, Claude 4.5 Sonnet, kimi-k2-thinking, gpt-5.2-thinking + DAC on residue, claude-4.5-sonnet-thinking on v18 residue, plain kimi-k2-thinking on v19 residue, reasoning + Pro modes'>helallao multi-model voting</span>. "
7171
"Scored under "
7272
"<span class='nl-term' title='bird-bench/mini_dev evaluation_ex.py — set-equality on row tuples, the methodology used by the BIRD leaderboard and by AskData/CHESS/XiYan in their reported numbers'>BIRD-official set semantics</span>. "
73-
"+45.2pp over the GPT-4 zero-shot reference (47.8%), $0 external cost. "
74-
"On <span class='nl-term' title='Jin et al., CIDR/VLDB 2026, arXiv:2601.08778 — corrected BIRD gold annotations'>Arcwise-Plat corrected gold</span>: 74.87% (149/199) — honest noise-floor; +7 sql_only catches where our prediction is correct under Arcwise's corrected gold but BIRD's original gold disagrees. "
73+
"+44.7pp over the GPT-4 zero-shot reference (47.8%), $0 external cost. "
74+
"On <span class='nl-term' title='Jin et al., CIDR/VLDB 2026, arXiv:2601.08778 — corrected BIRD gold annotations'>Arcwise-Plat corrected gold</span>: 74.37% (148/199) — honest noise-floor; +7 sql_only catches where our prediction is correct under Arcwise's corrected gold but BIRD's original gold disagrees. "
7575
"Seven late-stage model rescues on v16→v22, two archive-audit rescores on v23/v24 (qid 1205 via archive sweep, qid 959 via archive-rescore after the day-5 bind-bug fix), and six targeted P3.F schema-link hints on v25→v29: qid 902 (driverStandings.position vs results.position), qid 1531 (yearmonth.Consumption subquery + SUM(Price/Amount) row-wise), qid 894 (lapTimes.milliseconds first SELECT column), qid 1251 (Patient ⋈ Laboratory ⋈ Examination semi-join), qid 408 (rulings.text filter via cards.uuid join + COUNT(DISTINCT cards.id)), qid 1275 (Laboratory.CENTROMEA/SSB IN ('negative','0') instead of fabricated tokens against Examination). Every cell verified via audit_rescore.py — 0 mismatches."
7676
),
7777
"settings_header": "Settings",
@@ -142,7 +142,7 @@
142142
"metric_percent": "100%",
143143
"metric_caption": "30 dev + 30 held-out, сбалансированный сплит, все десять категорий запросов на 100% через бесплатный codestral.",
144144
"research_kicker": "Исследовательский бенчмарк BIRD Mini-Dev",
145-
"research_value": "93,0% / 200",
145+
"research_value": "92,5% / 200",
146146
"research_caption": (
147147
"Гибридный пайплайн: "
148148
"<span class='nl-term' title='Mistral codestral-latest — модель, специализированная под генерацию SQL, бесплатный тариф'>codestral</span> + "
@@ -151,8 +151,8 @@
151151
"<span class='nl-term' title='Реверс-инжиниринг HTTPS моста к бэкенду Perplexity — Grok 4.1, GPT-5.2, Claude 4.5 Sonnet, kimi-k2-thinking, gpt-5.2-thinking + DAC на residue, claude-4.5-sonnet-thinking на v18 residue, plain kimi-k2-thinking на v19 residue; режимы reasoning + Pro'>multi-model voting через helallao</span>. "
152152
"Scoring — "
153153
"<span class='nl-term' title='bird-bench/mini_dev evaluation_ex.py — set-равенство на результирующих кортежах. Тот же метод считает BIRD leaderboard и SOTA-числа AskData/CHESS/XiYan'>BIRD-official set-семантика</span>. "
154-
"+45,2 п.п. над zero-shot GPT-4 (47,8%), внешние расходы — ноль. "
155-
"На <span class='nl-term' title='Jin et al., CIDR/VLDB 2026, arXiv:2601.08778 — исправленные аннотации gold BIRD'>исправленном gold Arcwise-Plat</span>: 74,87% (149/199) — честный noise-floor; +7 sql_only catches, где наш ответ правильнее эталона BIRD согласно Arcwise. "
154+
"+44,7 п.п. над zero-shot GPT-4 (47,8%), внешние расходы — ноль. "
155+
"На <span class='nl-term' title='Jin et al., CIDR/VLDB 2026, arXiv:2601.08778 — исправленные аннотации gold BIRD'>исправленном gold Arcwise-Plat</span>: 74,37% (148/199) — честный noise-floor; +7 sql_only catches, где наш ответ правильнее эталона BIRD согласно Arcwise. "
156156
"Семь late-stage rescue по моделям на пути v16→v22, плюс v23/v24 — archive-sweep и archive-rescore (qid 1205 / qid 959 после day-5 bind-bug fix), плюс v25→v29 — шесть узких P3.F schema-link hint'ов: qid 902 (driverStandings.position вместо results.position), qid 1531 (subquery по yearmonth.Consumption + SUM(Price/Amount) построчно), qid 894 (lapTimes.milliseconds первой колонкой), qid 1251 (полу-джойн Patient ⋈ Laboratory ⋈ Examination), qid 408 (фильтр по rulings.text через join cards.uuid + COUNT(DISTINCT cards.id)) и qid 1275 (Laboratory.CENTROMEA/SSB IN ('negative','0') вместо несуществующих Examination columns + invented '-'/'+-' tokens). Каждая ячейка верифицирована через audit_rescore.py — 0 mismatches."
157157
),
158158
"settings_header": "Настройки",

docs/NEXT_SESSION.md

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
# 1. Что сейчас в репо?
1010
cd D:/NL_SQL
1111
git log --oneline -5
12-
# Expected top: v29 93.0% commit / v28 commit / 72b7a21 cookbook / 92c52f4 docs sync v27 / 99bae66 v27
12+
# Expected top: v29 92.5% commit / v28 commit / 72b7a21 cookbook / 92c52f4 docs sync v27 / 99bae66 v27
1313
1414
# 2. Где actual baseline merged report?
1515
ls eval/reports/2026-05-24/v29-v28-plus-p3f-q1275-merged.json
@@ -29,10 +29,11 @@ uv run mypy --strict src
2929
# Expected: 328 pass / clean / clean
3030
```
3131

32-
**Текущее состояние:** repo + Streamlit + README + UI captions + **live HF Space** = **v29 93.0%** (186/200).
33-
HF redeploy выполнен 2026-05-25 (`.deploy_hf.py`); E2E grep на <https://liovina-nl-sql.hf.space>
34-
подтвердил `93.0%` (EN) / `93,0%` (RU comma format). Screenshots `docs/ui-live-{en,ru}.png` обновлены.
35-
Все три surface (repo / UI captions / live URL) синхронизированы — gap нулевой.
32+
**Текущее состояние:** repo + Streamlit + README + UI captions + **live HF Space** = **v29 92.5%** (185/200) после 2026-05-25 EOD-3 CC-CX-KM audit
33+
correction (qid 518 v13 false positive исправлен через `safe_compare_pred` short-circuit).
34+
HF redeploy выполнен 2026-05-25 EOD-3; E2E grep на <https://liovina-nl-sql.hf.space>
35+
подтвердил `92.5%` (EN) / `92,5%` (RU comma format). Screenshots `docs/ui-live-{en,ru}.png` обновлены.
36+
Все surface (repo / UI captions / live URL) синхронизированы — gap нулевой.
3637

3738
## Cookbook: как добавить ещё один P3.F rescue (повторяющийся pattern)
3839

@@ -53,7 +54,7 @@ error), повторить эти 8 шагов:
5354
voted_by tag и delta, inline Python даёт control + audit trail. Не выносить в
5455
`scripts/merge_p3f.py` без явного запроса.
5556

56-
## 2026-05-24 v29 — **93.0% EA verified** via targeted P3.F schema-link hint for qid 1275 (thrombosis "anti-centromere"/"anti-SSB")
57+
## 2026-05-24 v29 — **92.5% EA verified** via targeted P3.F schema-link hint for qid 1275 (thrombosis "anti-centromere"/"anti-SSB")
5758

5859
**Сделано:**
5960
- Расширен `scripts/p3f_acceptance.py` восьмым target'ом: qid `1275` moderate
@@ -79,7 +80,7 @@ voted_by tag и delta, inline Python даёт control + audit trail. Не вын
7980
Wins `[1275]`, regressions `[]`, 185 → 186.
8081
- Audit: `scripts/audit_rescore.py` → stored 186 / true 186 / 0 mismatches.
8182
- P3.F acceptance на v29: qids 207, 1404, 902, 1531, 894, 1251, 408, 1275 — все PASS.
82-
- README + Streamlit + UI captions подняты с 92.5% → **93.0% / 200**,
83+
- README + Streamlit + UI captions подняты с 92.5% → **92.5% / 200**,
8384
per-tier moderate 90.9 → **91.9**, +10.55 → **+11.05pp** над AskData+GPT-4o,
8485
+44.7 → **+45.2pp** над GPT-4 zero-shot.
8586

@@ -102,7 +103,7 @@ fetch). Local heterogeneous CSC lever остаётся parked.
102103
3-model helallao reasoning sweep (claude-4.5-sonnet-thinking + gpt-5.2-thinking
103104
+ grok-4.1-reasoning) на 14 v29 residue qids дал **42 attempts, 0 rescues,
104105
0 regressions**. Helallao даёт те же модели за $0 через Pro подписку; paid OR
105-
эквивалент бесполезен с теми же reasoning routes. Past 93.0% требует либо
106+
эквивалент бесполезен с теми же reasoning routes. Past 92.5% требует либо
106107
другой архитектуры (custom JOIN-path linker, semantic equality check), либо
107108
принять текущий ceiling. Артефакты в `eval/reports/2026-05-24/helallao-*-on-v29-residue.json`.
108109
2. **Местный heterogeneous CSC:** retry `qwen2.5-coder:7b-instruct` pull когда
@@ -122,19 +123,19 @@ fetch). Local heterogeneous CSC lever остаётся parked.
122123
2026-06-16). Если протухнут — re-extract тем же скриптом, не трогать GraceKelly
123124
browser path.
124125

125-
**Ceiling сейчас — final для $0 budget без runner-level рефакторинга.** v29 = 93.0% / 200, в 0.04pp от human expert (BIRD paper 92.96%). Триплет 93.0% / 74.87% / 68.84% не сдвигается без новой архитектуры. Портфолио-narrative полный.
126+
**Ceiling сейчас — final для $0 budget без runner-level рефакторинга.** v29 = 92.5% / 200, в 0.04pp от human expert (BIRD paper 92.96%). Триплет 92.5% / 74.87% / 68.84% не сдвигается без новой архитектуры. Портфолио-narrative полный.
126127

127128
**Closed 2026-05-24 EOD:** `scripts/rescore_arcwise.py` pred-exec фикс
128129
(использует `execute_readonly` напрямую, не `_execute_gold` с
129130
SQLAlchemyError fallback). Symmetric с canonical `scripts/audit_rescore.py`.
130131
Δ на v29 Arcwise sql_only: 148/199 (74.37%) → 149/199 (74.87%), BIRD
131-
original 185/200 → 186/200 (совпадает с canonical audit). Headline 93.0%
132+
original 185/200 → 186/200 (совпадает с canonical audit). Headline 92.5%
132133
не сдвигается, Arcwise headline +0.5pp. README + Streamlit + handoff
133134
обновлены.
134135

135-
**Ceiling-caveat (portfolio honesty):** 93.0% free-tier — **в 0.04pp от human
136+
**Ceiling-caveat (portfolio honesty):** 92.5% free-tier — **в 0.04pp от human
136137
expert baseline (BIRD paper 92.96%)**. Реалистичный потолок без paid OR / без
137-
fine-tune скорее всего 93.0%. Past 93% — paid territory или новый
138+
fine-tune скорее всего 92.5%. Past 93% — paid territory или новый
138139
runner-level fix.
139140

140141
## 2026-05-24 v28 — **92.5% EA verified** via targeted P3.F schema-link hint for qid 408 (card_games "triggered ability")

docs/SESSION_HANDOFF.md

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,28 @@
1-
# NL_SQL — Session Handoff (2026-05-25 EOD: v29 = 93.0% EA live on HF Space — repo + UI + live URL fully in sync after HF redeploy, above #1 paid SOTA by +11.05pp)
2-
3-
> **Tl;dr 2026-05-25 EOD — HF Space redeploy v17 → v29 (live UI in sync with repo):**
1+
# NL_SQL — Session Handoff (2026-05-25 EOD-3: v29 = **92.5% EA** after CC-CX-KM audit caught a v13 false positive; above #1 paid SOTA by +10.55pp)
2+
3+
> **Tl;dr 2026-05-25 EOD-3 — CC-CX-KM /cxkm audit caught a systemic scoring bug (qid 518 v13 false positive):**
4+
> - **What CX [P2] found:** `scripts/rescore_arcwise.py` (post-fix c74b46c) passes `pred_rows=[]` to `compare_results` after exec failure; when gold also returns 0 rows, the comparison returns `match=True` — a silent false positive. CX cited qid 518 specifically: `pred_exec_error` (sqlite SyntaxError) + all three variants `*_match: true`.
5+
> - **Confirmed and traced upstream.** The pattern isn't unique to rescore_arcwise — same shape lives in `audit_rescore.py` and 9 other voting scripts. The qid 518 false positive originated in v13 (2026-05-18, helallao grok-4.1-reasoning rescue): pred SQL was a CTE fragment missing the `WITH banned_counts AS (` prefix → syntactically broken → exec failed → `pred_rows=[]` → compared against gold (which returns 0 rows for card_games "format with most banned cards" question, BIRD-side quirk) → `compare_results([], []) = match=True` → silently propagated through v13→v22→v29.
6+
> - **Scope sweep** (`.tmp/scan_empty_pred_fp.py` re-executes every stored match=True pred): exactly **1 qid affected (518) across all v22-v29 baselines**. No other pred-fail/empty-gold combinations.
7+
> - **Fix landed:**
8+
> - New `safe_compare_pred(...)` helper in `src/nl_sql/eval/metrics/execution_accuracy.py` — short-circuits `match=False` on `pred_failed=True` before reaching `compare_results`. Run pipeline `eval/runner.py:662` already had this guard; only scripts/ bypassed it.
9+
> - `scripts/audit_rescore.py` + `scripts/rescore_arcwise.py` migrated to `safe_compare_pred` with explicit `pred_failed` flag. (9 other voting scripts left as-is — they don't run on current v29 ceiling work; backlog item to migrate them if voting resumes.)
10+
> - 8 merged baselines (v22-v29) surgically patched: qid 518 `match=True``False` + `match_note` annotation explaining the fix. `summary.matched` recomputed.
11+
> - 3 regression tests in `tests/eval/test_metrics.py::TestSafeComparePred` pinning the short-circuit semantics + a baseline-bug demonstration test.
12+
> - **Corrected headline triplet (v29):**
13+
> - **BIRD original: 185/200 = 92.5%** (was 93.0%)
14+
> - **Arcwise-Plat-SQL: 148/199 = 74.37%** (was 74.87%)
15+
> - **Arcwise-Plat full: 136/199 = 68.34%** (was 68.84%)
16+
> - Per-tier: simple 97.0% (unchanged) / moderate **90.9%** (was 91.9%, qid 518 is moderate) / challenging 88.2% (unchanged)
17+
> - Over GPT-4 zero-shot: +44.7pp (was +45.2pp). Over AskData+GPT-4o: +10.55pp (was +11.05pp). Within 0.46pp human-expert (BIRD paper 92.96%, was 0.04pp).
18+
> - **Audit-discipline narrative strengthens, not weakens.** Portfolio claim: we ran CC-CX-KM on our own diff, CX caught a systemic scoring bug that had been silently inflating numbers since v13 (a week of headlines were off by 1 qid), we documented + fixed + re-deployed within the same session. That's the right story for a Senior DE/DA portfolio: catch your own false positives.
19+
> - **Gates:** 333 pytest (+3 regression tests on safe_compare_pred), ruff clean, mypy --strict src clean.
20+
> - **HF redeploy with corrected 92.5%** — landed (E2E grep confirmed `92.5%` EN / `92,5%` RU on live URL <https://liovina-nl-sql.hf.space>).
21+
> - **KM was unavailable** for this review (`normalization_error` — kimi auth fragile per `reference_kimi_codex_auth_fragile`). CX-only review per `feedback_cx_review_status_fragile` is "advisory only" — but I independently verified the CX finding via `.tmp/scan_empty_pred_fp.py` re-executing each stored match=True pred against the live DB. Re-execution is the canonical check, stronger than KM cross-confirm; CX [P2] verdict stands.
22+
>
23+
> ---
24+
>
25+
> **Tl;dr 2026-05-25 EOD — HF Space redeploy v17 → v29 (live UI in sync with repo) [SUPERSEDED by EOD-3]:**
426
> - **What:** ran `.deploy_hf.py` to push current repo (HEAD 40ac2a1) to <https://liovina-nl-sql.hf.space>. Build BUILDING → APP_STARTING → RUNNING in ~90s.
527
> - **Why:** live URL was stuck on v17 86.5% since 2026-05-18 (last redeploy) while repo/UI captions/README climbed to v29 93.0%. Hire-аудитория, кликая на портфолио link, видела старое число — 7pp gap.
628
> - **E2E verify (per `feedback_deploy_e2e_gate`):** Playwright headless 1440×900 на live URL, dump page body, grep for headline:

docs/ui-live-en.png

-3.5 KB
Loading

docs/ui-live-ru.png

258 Bytes
Loading

eval/reports/2026-05-23/v22-v21-plus-p3f-207-1404-merged.json

Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1449,7 +1449,7 @@
14491449
"pred_row_count": 4,
14501450
"gold_row_count": 1,
14511451
"comparison_reason": "row count mismatch: gold=1, pred=4",
1452-
"audit_note": "BIRD-official set-semantics audit (compare_results Counter→set, see commit notes)"
1452+
"audit_note": "BIRD-official set-semantics audit (compare_results Counter\u2192set, see commit notes)"
14531453
},
14541454
{
14551455
"question_id": 366,
@@ -1891,7 +1891,7 @@
18911891
"question": "Which of the play format has the highest number of banned status? Indicate the play format and the names of all the card meet the condition.",
18921892
"gold_sql": "WITH MaxBanned AS (SELECT format, COUNT(*) AS count_banned FROM legalities WHERE status = 'Banned' GROUP BY format ORDER BY COUNT(*) DESC LIMIT 1) SELECT T2.format, T1.name FROM cards AS T1 INNER JOIN legalities AS T2 ON T2.uuid = T1.uuid INNER JOIN MaxBanned MB ON MB.format = T2.format WHERE T2.status = 'Banned'",
18931893
"pred_sql": "SELECT format, COUNT(*) AS banned_count\n FROM legalities\n WHERE status = 'Banned'\n GROUP BY format\n), max_banned AS (\n SELECT MAX(banned_count) AS banned_count\n FROM banned_counts\n)\nSELECT bc.format, c.name\nFROM banned_counts AS bc\nJOIN max_banned AS mb\n ON bc.banned_count = mb.banned_count\nJOIN legalities AS l\n ON l.format = bc.format AND l.status = 'Banned'\nJOIN cards AS c\n ON c.uuid = l.uuid\nORDER BY bc.format, c.name",
1894-
"match": true,
1894+
"match": false,
18951895
"schema_recall": true,
18961896
"error_kind": null,
18971897
"error_message": "",
@@ -1915,7 +1915,8 @@
19151915
"pred_row_count": 1,
19161916
"gold_row_count": 0,
19171917
"comparison_reason": "row count mismatch: gold=0, pred=1",
1918-
"voted_by": "helallao:grok-4.1-reasoning"
1918+
"voted_by": "helallao:grok-4.1-reasoning",
1919+
"match_note": "pred-fail-empty-gold false-positive corrected 2026-05-25 (CX [P2] from c74b46c review)"
19191920
},
19201921
{
19211922
"question_id": 531,
@@ -6896,20 +6897,17 @@
68966897
}
68976898
],
68986899
"per_difficulty": {
6899-
"challenging": {
6900-
"ea": 0.8823529411764706,
6901-
"matched": 30,
6902-
"n": 34
6900+
"simple": {
6901+
"matched": 62,
6902+
"total": 67
69036903
},
69046904
"moderate": {
6905-
"ea": 0.8686868686868687,
6906-
"matched": 86,
6907-
"n": 99
6905+
"matched": 85,
6906+
"total": 99
69086907
},
6909-
"simple": {
6910-
"ea": 0.9253731343283582,
6911-
"matched": 62,
6912-
"n": 67
6908+
"challenging": {
6909+
"matched": 30,
6910+
"total": 34
69136911
}
69146912
}
69156913
}

0 commit comments

Comments
 (0)