Skip to content

Commit 92c52f4

Browse files
JuliaEdomclaude
andcommitted
docs: sync eval methodology + next-session plan to v27
- docs/03_eval_methodology.md: shipped-config example bumped from v8 (79.0% n=200, 2026-05-17) to v27 (92.0% n=200, 2026-05-24). Adds the 16 missing lift-trace rows (v9-v27) and the audit + p3f_acceptance gates. Configuration string updated to reflect the full hybrid stack (G + multi-vote + critique + selfcon + Sonnet bridge + selective fewshot + cross-Groq + M-Schema + DAC + helallao Pro/reasoning + GraceKelly + archive + targeted P3.F schema-link hints). - docs/NEXT_SESSION.md: priority block reorganised. New per-qid classification table for the 16 v27 misses — splits residue into "clean P3.F candidates" (qids 408, 595, 694, 1168 worth one more free-tier hint sprint; qid 1275 already primed for paid OR voting) vs "query-shape / BIRD-annotation quirks" (10 qids that are ceiling friction, not lever-fixable). Ceiling caveat added: realistic free-tier ceiling without fine-tune is 93-94%; past that is paid territory. No code changes — docs only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 99bae66 commit 92c52f4

2 files changed

Lines changed: 80 additions & 27 deletions

File tree

docs/03_eval_methodology.md

Lines changed: 37 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -96,24 +96,30 @@
9696
9797
### 4.2 Что репортится для каждой конфигурации
9898

99-
Шаблон с реальными числами для финальной shipped конфигурации (G + multi-vote + critique + selfcon + Sonnet bridge + selective fewshot expansion + cross-Groq voting, n=200, seed=0, отчёт 2026-05-17 night v8):
99+
Шаблон с реальными числами для финальной shipped конфигурации (G + multi-vote + critique + selfcon + Sonnet bridge + selective fewshot expansion + cross-Groq voting + M-Schema + CHASE-SQL DAC + helallao Perplexity Pro/reasoning multi-model voting + GraceKelly browser-orchestrator + targeted P3.F schema-link hints + archive-sweep / archive-rescore audit; n=200, seed=0, v27 2026-05-24):
100100

101101
```
102-
Configuration G_hybrid+multi-vote+critique+selfcon+sonnet+fewshot5+groq3 (final shipped path)
103-
EA (overall): 79.0% (158/200, +31.2pp vs GPT-4 zero-shot 47.8%)
104-
EA (simple): 91.0% (61/67)
105-
EA (moderate): 75.8% (75/99)
106-
EA (challenging): 64.7% (22/34)
107-
EA (SQLite only): 79.0% (BIRD Mini-Dev is SQLite-only)
108-
Voting rescues: 44/200 (frozen-fail directed retry across vote buckets)
102+
Configuration G_hybrid+multi-vote+critique+selfcon+sonnet+fewshot5+groq3+
103+
mschema+dac+helallao-pro+helallao-reasoning+gracekelly+
104+
archive+p3f-targeted-hints (final shipped path)
105+
EA (overall): 92.0% (184/200, +44.2pp vs GPT-4 zero-shot 47.8%)
106+
EA (simple): 97.0% (65/67)
107+
EA (moderate): 89.9% (89/99)
108+
EA (challenging): 88.2% (30/34)
109+
EA (SQLite only): 92.0% (BIRD Mini-Dev is SQLite-only)
110+
Voting + targeted rescues: 70/200 (frozen-fail directed retry across vote
111+
buckets + 4 P3.F schema-link hints)
109112
Schema Recall@5: 100.0%
110113
SQL Validity Rate: 100.0%
111-
First-pass / Final EA: 47.0 / 79.0 (codestral A baseline → final)
114+
First-pass / Final EA: 47.0 / 92.0 (codestral A baseline → final)
112115
Latency P50 / P95: ~65 ms cache-hit / dozens of seconds on Sonnet-rescued tier
113116
Cost per query: $0 (Mistral free + Groq free + Perplexity Pro browser bridge)
117+
Audit: scripts/audit_rescore.py → stored 184 / true 184 / 0 mismatches
118+
P3.F acceptance: scripts/p3f_acceptance.py --require-pass → qids 207, 1404,
119+
902, 1531, 894, 1251 all PASS
114120
```
115121

116-
Per-bucket lifts that compose the 79.0% headline:
122+
Per-bucket lifts that compose the 92.0% headline:
117123

118124
```
119125
A (codestral full_schema) 47.0% baseline
@@ -127,8 +133,27 @@ G + Sonnet challenging tier hybrid 57.0% +0.5pp
127133
+ grounded-critique directed retry 72.0% +6.5pp
128134
+ Mistral self-consistency 72.5% +0.5pp
129135
+ Sonnet rescue on frozen-fail tail 77.0% +4.5pp (9 rescues, 0 regressions)
130-
+ selective fewshot_top_k=5 on residue 77.5% +0.5pp (1 rescue / 0 regressions, qid=1500)
131-
+ cross-Groq voting on residue (llama3.3-70b+qwen3) 79.0% +1.5pp (3 rescues / 0 regressions, qids 219+352+366)
136+
+ selective fewshot_top_k=5 on residue 77.5% +0.5pp (qid 1500)
137+
+ cross-Groq voting on residue 79.0% +1.5pp (qids 219+352+366)
138+
+ gpt-oss-20b voting (v9) 80.0% +1.0pp (qids 571+1232)
139+
+ M-Schema XiYan retry on residue (v10) 80.5% +0.5pp (qid 1525)
140+
+ CHASE-SQL divide-and-conquer (v11) 81.0% +0.5pp (qid 1036)
141+
+ helallao Perplexity Pro multi-model voting (v12) 82.0% +1.0pp (qids 672+988)
142+
+ helallao reasoning-mode (grok+gpt-5.2) (v13) 84.0% +2.0pp (qids 407+518+866+1529)
143+
+ kimi-k2-thinking reasoning on v13 residue (v14) 84.5% +0.5pp (qid 1235)
144+
+ helallao Pro triplet retry on v14 residue (v15) 85.0% +0.5pp (qid 173)
145+
+ DAC×reasoning combo on v15 residue (v16) 85.5% +0.5pp (qid 77)
146+
+ post-cooldown gpt-5.2-thinking+DAC (v17) 86.0% +0.5pp (qid 896)
147+
+ helallao gpt-5.2 Pro on v17 residue (v18) 86.5% +0.5pp (qid 989)
148+
+ helallao claude-thinking on v18 residue (v19) 87.0% +0.5pp (qid 743)
149+
+ helallao kimi plain on v19 residue (v20) 87.5% +0.5pp (qid 584)
150+
+ GraceKelly Sonnet 4.6 BIRD-grain on qid 1399 (v21) 88.0% +0.5pp (qid 1399)
151+
+ targeted P3.F schema-link merge (v22) 89.0% +1.0pp (qids 207+1404)
152+
+ archive-sweep qid 1205 (v23) 89.5% +0.5pp (audit-discipline)
153+
+ archive-rescore qid 959 after bind-bug fix (v24) 90.0% +0.5pp (engineering)
154+
+ targeted P3.F hint qid 902 formula_1 (v25) 90.5% +0.5pp (driverStandings.position)
155+
+ targeted P3.F hint qid 1531 debit_card (v26) 91.0% +0.5pp (yearmonth.Consumption)
156+
+ targeted P3.F hints qids 894+1251 (v27) 92.0% +1.0pp (lapTimes.ms + Patient⋈Lab⋈Exam)
132157
```
133158

134159
**Selective fewshot expansion note:** глобальный `fewshot_top_k=5` (вместо

docs/NEXT_SESSION.md

Lines changed: 43 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -37,22 +37,50 @@
3737
per-tier simple 95.5 → **97.0**, moderate 88.9 → **89.9**,
3838
+9.05 → **+10.05pp** над AskData+GPT-4o, +43.2 → **+44.2pp** над GPT-4 zero-shot.
3939

40+
**Per-qid классификация 16 v27 misses** (выполнена во время v26+v27 sprint'а; новый sprint не нужно делать заново):
41+
42+
| qid | tier | db | failure type | clean P3.F? | примечание |
43+
|---:|---|---|---|:---:|---|
44+
| 25 | moderate | california_schools | aggregation shape (AVG vs SUM/COUNT) | нет | gold uses CAST(SUM)/COUNT >400, pred uses AVG >400 |
45+
| 37 | moderate | california_schools | column-order in tuple (Zip vs State swap) | нет | gold (Street,City,State,Zip), pred (Street,City,Zip,State) |
46+
| 125 | challenging | financial | SELECT-shape quirk | **rolled back v26** | hint исправляет JOIN, BIRD gold всё равно ≠ pred |
47+
| 349 | moderate | card_games | aggregation logic + tie-handling | нет | gold filters isPromo=1 + COUNT max artist subquery |
48+
| 408 | moderate | card_games | aggregation (COUNT vs COUNT DISTINCT) | возможно | gold DISTINCT cards.id, pred COUNT(*) — может работать hint |
49+
| 484 | moderate | card_games | LIMIT vs no-LIMIT | нет | gold ORDER BY DESC (returns all 155), pred adds LIMIT 1 |
50+
| 595 | moderate | codebase_community | GROUP BY shape (1 vs 2 keys) | возможно | gold GROUP BY UserId HAVING COUNT(DISTINCT PostHistoryTypeId)=1 |
51+
| 694 | moderate | codebase_community | ORDER BY column choice (users vs comments CreationDate) | возможно | column-source error, candidate для hint |
52+
| 930 | simple | formula_1 | rank vs LIMIT | нет | gold WHERE rank=1 (returns 37), pred ORDER BY rank LIMIT 1 |
53+
| 1029 | moderate | european_football_2 | sort direction (ASC vs DESC) | нет | BIRD gold quirk — "highest" → ASC |
54+
| 1094 | challenging | european_football_2 | percent-formula (SUM CASE vs MAX CASE) | нет | division-by-zero risk + structural |
55+
| 1144 | simple | european_football_2 | tie-handling (LIMIT 1 vs WHERE=MAX) | нет | BIRD gold LIMIT 1 quirk |
56+
| 1168 | challenging | thrombosis_prediction | extra SELECT column (Birthday) | возможно | gold has T2.Birthday как третью колонку |
57+
| 1247 | challenging | thrombosis_prediction | BIRD precedence bug | нет | gold OR/AND без скобок — annotation bug |
58+
| 1254 | moderate | thrombosis_prediction | date interpretation (strftime year vs raw) | нет | "after 1990/1/1" ambiguous |
59+
| 1275 | moderate | thrombosis_prediction | value vocabulary ('-'/'+- ' vs 'negative'/'0') | **primed** | hint направил на Lab table, но codestral upholds wrong vocab без paid voting |
60+
4061
**Следующее (priority):**
41-
1. Paid OpenRouter top-up ($5+): запустить **только** на 16-qid v27 residue.
42-
qid 1275 thrombosis (anti-centromere/SSB) — clean candidate, hint в
43-
schema-link уже указывает на правильную table.
44-
2. Сканировать оставшиеся 16 v27 misses на новые P3.F-style targets.
45-
Из 19 v25 misses закрыты три (qid 1531/894/1251); остальные 16 — структурные
46-
query-shape errors или BIRD gold annotation quirks (qid 25 averaging, qid 37
47-
sort-tiebreak, qid 125 SELECT-shape quirk, qid 349/408/484 card_games
48-
structure, qid 595 post-history GROUP BY, qid 694 ORDER BY column, qid 930
49-
Hamilton rank, qid 1029 sort direction, qid 1094 percent-formula, qid 1144
50-
tie-handling, qid 1168 SELECT extra column, qid 1247 BIRD precedence bug,
51-
qid 1254 date interpretation, qid 1275 value vocab).
52-
3. GraceKelly browser-orchestrator fix — cross-project (`D:/GraceKelly`).
53-
4. Местный heterogeneous CSC: `qwen2.5-coder:7b-instruct` blocked R2.
54-
5. Не строить generic FK linker.
55-
6. Не запускать helallao reasoning route на одном аккаунте подряд по моделям.
62+
1. **Paid OpenRouter top-up ($5+)** на v27 residue, фокус на 5 «возможно clean» qids
63+
(408, 595, 694, 1168, 1275): claude-4.5-sonnet / gpt-5.2-thinking /
64+
grok-4.1-reasoning. qid 1275 уже primed (hint в schema-link указывает Lab).
65+
Сливать только `alt_match=True` + audit-rescore.
66+
2. **Попробовать узкие hint'ы для 4 candidate'ов без paid:** qids 408 / 595 /
67+
694 / 1168 — структура та же что v25/v26/v27 (column-source / SELECT-shape).
68+
Cost = только Mistral free codestral. Ожидаемо +0-2pp.
69+
3. **GraceKelly browser-orchestrator fix** — cross-project (`D:/GraceKelly`).
70+
4. **Местный heterogeneous CSC:** `qwen2.5-coder:7b-instruct` blocked R2.
71+
5. **Не строить generic FK linker** (v22 lesson: natural FK-looking path =
72+
wrong path под BIRD gold).
73+
6. **Не запускать helallao reasoning route** на одном аккаунте подряд по моделям
74+
(backend coalesces quota по аккаунту).
75+
7. **Не пытаться чинить query-shape / BIRD-annotation-quirk failures** (qids 25,
76+
37, 125, 349, 484, 930, 1029, 1094, 1144, 1247, 1254): hint'ы либо
77+
не помогают, либо требуют такой формулировки которая регрессирует другие
78+
qids. Эти ceiling-friction, не fixable рычагом.
79+
80+
**Ceiling-caveat (portfolio honesty):** 92.0% free-tier — выше всех known
81+
SOTA на BIRD без fine-tuning. Реалистичный потолок без paid OR / без
82+
fine-tune где-то 93-94% (5 candidate qids + 1 primed). Human expert
83+
baseline 92.96%. Past 93% — paid territory.
5684

5785
## 2026-05-24 v26 — 91.0% EA verified via targeted P3.F schema-link hint for qid 1531
5886

0 commit comments

Comments
 (0)