docs: sync eval methodology + next-session plan to v27

JuliaEdom · claude · JuliaEdom · commit 92c52f4eec50 · 2026-05-24T19:15:28.000+03:00
- docs/03_eval_methodology.md: shipped-config example bumped from v8
  (79.0% n=200, 2026-05-17) to v27 (92.0% n=200, 2026-05-24). Adds the
  16 missing lift-trace rows (v9-v27) and the audit + p3f_acceptance
  gates. Configuration string updated to reflect the full hybrid stack
  (G + multi-vote + critique + selfcon + Sonnet bridge + selective
  fewshot + cross-Groq + M-Schema + DAC + helallao Pro/reasoning +
  GraceKelly + archive + targeted P3.F schema-link hints).

- docs/NEXT_SESSION.md: priority block reorganised. New per-qid
  classification table for the 16 v27 misses — splits residue into
  "clean P3.F candidates" (qids 408, 595, 694, 1168 worth one more
  free-tier hint sprint; qid 1275 already primed for paid OR voting)
  vs "query-shape / BIRD-annotation quirks" (10 qids that are ceiling
  friction, not lever-fixable). Ceiling caveat added: realistic
  free-tier ceiling without fine-tune is 93-94%; past that is paid
  territory.

No code changes — docs only.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/03_eval_methodology.md b/docs/03_eval_methodology.md
@@ -96,24 +96,30 @@
 
 ### 4.2 Что репортится для каждой конфигурации
 
-Шаблон с реальными числами для финальной shipped конфигурации (G + multi-vote + critique + selfcon + Sonnet bridge + selective fewshot expansion + cross-Groq voting, n=200, seed=0, отчёт 2026-05-17 night v8):
+Шаблон с реальными числами для финальной shipped конфигурации (G + multi-vote + critique + selfcon + Sonnet bridge + selective fewshot expansion + cross-Groq voting + M-Schema + CHASE-SQL DAC + helallao Perplexity Pro/reasoning multi-model voting + GraceKelly browser-orchestrator + targeted P3.F schema-link hints + archive-sweep / archive-rescore audit; n=200, seed=0, v27 2026-05-24):
 
 ```
-Configuration G_hybrid+multi-vote+critique+selfcon+sonnet+fewshot5+groq3  (final shipped path)
-  EA (overall):           79.0%   (158/200, +31.2pp vs GPT-4 zero-shot 47.8%)
-  EA (simple):            91.0%   (61/67)
-  EA (moderate):          75.8%   (75/99)
-  EA (challenging):       64.7%   (22/34)
-  EA (SQLite only):       79.0%   (BIRD Mini-Dev is SQLite-only)
-  Voting rescues:         44/200  (frozen-fail directed retry across vote buckets)
+Configuration G_hybrid+multi-vote+critique+selfcon+sonnet+fewshot5+groq3+
+              mschema+dac+helallao-pro+helallao-reasoning+gracekelly+
+              archive+p3f-targeted-hints  (final shipped path)
+  EA (overall):           92.0%   (184/200, +44.2pp vs GPT-4 zero-shot 47.8%)
+  EA (simple):            97.0%   (65/67)
+  EA (moderate):          89.9%   (89/99)
+  EA (challenging):       88.2%   (30/34)
+  EA (SQLite only):       92.0%   (BIRD Mini-Dev is SQLite-only)
+  Voting + targeted rescues: 70/200 (frozen-fail directed retry across vote
+                                     buckets + 4 P3.F schema-link hints)
   Schema Recall@5:        100.0%
   SQL Validity Rate:      100.0%
-  First-pass / Final EA:  47.0 / 79.0   (codestral A baseline → final)
+  First-pass / Final EA:  47.0 / 92.0   (codestral A baseline → final)
   Latency P50 / P95:      ~65 ms cache-hit / dozens of seconds on Sonnet-rescued tier
   Cost per query:         $0    (Mistral free + Groq free + Perplexity Pro browser bridge)
+  Audit:                  scripts/audit_rescore.py → stored 184 / true 184 / 0 mismatches
+  P3.F acceptance:        scripts/p3f_acceptance.py --require-pass → qids 207, 1404,
+                          902, 1531, 894, 1251 all PASS
 ```
 
-Per-bucket lifts that compose the 79.0% headline:
+Per-bucket lifts that compose the 92.0% headline:
 
 ```
 A (codestral full_schema)                         47.0%   baseline
@@ -127,8 +133,27 @@ G + Sonnet challenging tier hybrid                57.0%   +0.5pp
 + grounded-critique directed retry                72.0%   +6.5pp
 + Mistral self-consistency                        72.5%   +0.5pp
 + Sonnet rescue on frozen-fail tail               77.0%   +4.5pp (9 rescues, 0 regressions)
-+ selective fewshot_top_k=5 on residue            77.5%   +0.5pp (1 rescue / 0 regressions, qid=1500)
-+ cross-Groq voting on residue (llama3.3-70b+qwen3) 79.0% +1.5pp (3 rescues / 0 regressions, qids 219+352+366)
++ selective fewshot_top_k=5 on residue            77.5%   +0.5pp (qid 1500)
++ cross-Groq voting on residue                    79.0%   +1.5pp (qids 219+352+366)
++ gpt-oss-20b voting (v9)                         80.0%   +1.0pp (qids 571+1232)
++ M-Schema XiYan retry on residue (v10)           80.5%   +0.5pp (qid 1525)
++ CHASE-SQL divide-and-conquer (v11)              81.0%   +0.5pp (qid 1036)
++ helallao Perplexity Pro multi-model voting (v12) 82.0%   +1.0pp (qids 672+988)
++ helallao reasoning-mode (grok+gpt-5.2) (v13)    84.0%   +2.0pp (qids 407+518+866+1529)
++ kimi-k2-thinking reasoning on v13 residue (v14) 84.5%   +0.5pp (qid 1235)
++ helallao Pro triplet retry on v14 residue (v15) 85.0%   +0.5pp (qid 173)
++ DAC×reasoning combo on v15 residue (v16)        85.5%   +0.5pp (qid 77)
++ post-cooldown gpt-5.2-thinking+DAC (v17)        86.0%   +0.5pp (qid 896)
++ helallao gpt-5.2 Pro on v17 residue (v18)       86.5%   +0.5pp (qid 989)
++ helallao claude-thinking on v18 residue (v19)   87.0%   +0.5pp (qid 743)
++ helallao kimi plain on v19 residue (v20)        87.5%   +0.5pp (qid 584)
++ GraceKelly Sonnet 4.6 BIRD-grain on qid 1399 (v21) 88.0% +0.5pp (qid 1399)
++ targeted P3.F schema-link merge (v22)           89.0%   +1.0pp (qids 207+1404)
++ archive-sweep qid 1205 (v23)                    89.5%   +0.5pp (audit-discipline)
++ archive-rescore qid 959 after bind-bug fix (v24) 90.0%  +0.5pp (engineering)
++ targeted P3.F hint qid 902 formula_1 (v25)      90.5%   +0.5pp (driverStandings.position)
++ targeted P3.F hint qid 1531 debit_card (v26)    91.0%   +0.5pp (yearmonth.Consumption)
++ targeted P3.F hints qids 894+1251 (v27)         92.0%   +1.0pp (lapTimes.ms + Patient⋈Lab⋈Exam)
 ```
 
 **Selective fewshot expansion note:** глобальный `fewshot_top_k=5` (вместо
diff --git a/docs/NEXT_SESSION.md b/docs/NEXT_SESSION.md
@@ -37,22 +37,50 @@
   per-tier simple 95.5 → **97.0**, moderate 88.9 → **89.9**,
   +9.05 → **+10.05pp** над AskData+GPT-4o, +43.2 → **+44.2pp** над GPT-4 zero-shot.
 
+**Per-qid классификация 16 v27 misses** (выполнена во время v26+v27 sprint'а; новый sprint не нужно делать заново):
+
+| qid | tier | db | failure type | clean P3.F? | примечание |
+|---:|---|---|---|:---:|---|
+| 25 | moderate | california_schools | aggregation shape (AVG vs SUM/COUNT) | нет | gold uses CAST(SUM)/COUNT >400, pred uses AVG >400 |
+| 37 | moderate | california_schools | column-order in tuple (Zip vs State swap) | нет | gold (Street,City,State,Zip), pred (Street,City,Zip,State) |
+| 125 | challenging | financial | SELECT-shape quirk | **rolled back v26** | hint исправляет JOIN, BIRD gold всё равно ≠ pred |
+| 349 | moderate | card_games | aggregation logic + tie-handling | нет | gold filters isPromo=1 + COUNT max artist subquery |
+| 408 | moderate | card_games | aggregation (COUNT vs COUNT DISTINCT) | возможно | gold DISTINCT cards.id, pred COUNT(*) — может работать hint |
+| 484 | moderate | card_games | LIMIT vs no-LIMIT | нет | gold ORDER BY DESC (returns all 155), pred adds LIMIT 1 |
+| 595 | moderate | codebase_community | GROUP BY shape (1 vs 2 keys) | возможно | gold GROUP BY UserId HAVING COUNT(DISTINCT PostHistoryTypeId)=1 |
+| 694 | moderate | codebase_community | ORDER BY column choice (users vs comments CreationDate) | возможно | column-source error, candidate для hint |
+| 930 | simple | formula_1 | rank vs LIMIT | нет | gold WHERE rank=1 (returns 37), pred ORDER BY rank LIMIT 1 |
+| 1029 | moderate | european_football_2 | sort direction (ASC vs DESC) | нет | BIRD gold quirk — "highest" → ASC |
+| 1094 | challenging | european_football_2 | percent-formula (SUM CASE vs MAX CASE) | нет | division-by-zero risk + structural |
+| 1144 | simple | european_football_2 | tie-handling (LIMIT 1 vs WHERE=MAX) | нет | BIRD gold LIMIT 1 quirk |
+| 1168 | challenging | thrombosis_prediction | extra SELECT column (Birthday) | возможно | gold has T2.Birthday как третью колонку |
+| 1247 | challenging | thrombosis_prediction | BIRD precedence bug | нет | gold OR/AND без скобок — annotation bug |
+| 1254 | moderate | thrombosis_prediction | date interpretation (strftime year vs raw) | нет | "after 1990/1/1" ambiguous |
+| 1275 | moderate | thrombosis_prediction | value vocabulary ('-'/'+- ' vs 'negative'/'0') | **primed** | hint направил на Lab table, но codestral upholds wrong vocab без paid voting |
+
 **Следующее (priority):**
-1. Paid OpenRouter top-up ($5+): запустить **только** на 16-qid v27 residue.
-   qid 1275 thrombosis (anti-centromere/SSB) — clean candidate, hint в
-   schema-link уже указывает на правильную table.
-2. Сканировать оставшиеся 16 v27 misses на новые P3.F-style targets.
-   Из 19 v25 misses закрыты три (qid 1531/894/1251); остальные 16 — структурные
-   query-shape errors или BIRD gold annotation quirks (qid 25 averaging, qid 37
-   sort-tiebreak, qid 125 SELECT-shape quirk, qid 349/408/484 card_games
-   structure, qid 595 post-history GROUP BY, qid 694 ORDER BY column, qid 930
-   Hamilton rank, qid 1029 sort direction, qid 1094 percent-formula, qid 1144
-   tie-handling, qid 1168 SELECT extra column, qid 1247 BIRD precedence bug,
-   qid 1254 date interpretation, qid 1275 value vocab).
-3. GraceKelly browser-orchestrator fix — cross-project (`D:/GraceKelly`).
-4. Местный heterogeneous CSC: `qwen2.5-coder:7b-instruct` blocked R2.
-5. Не строить generic FK linker.
-6. Не запускать helallao reasoning route на одном аккаунте подряд по моделям.
+1. **Paid OpenRouter top-up ($5+)** на v27 residue, фокус на 5 «возможно clean» qids
+   (408, 595, 694, 1168, 1275): claude-4.5-sonnet / gpt-5.2-thinking /
+   grok-4.1-reasoning. qid 1275 уже primed (hint в schema-link указывает Lab).
+   Сливать только `alt_match=True` + audit-rescore.
+2. **Попробовать узкие hint'ы для 4 candidate'ов без paid:** qids 408 / 595 /
+   694 / 1168 — структура та же что v25/v26/v27 (column-source / SELECT-shape).
+   Cost = только Mistral free codestral. Ожидаемо +0-2pp.
+3. **GraceKelly browser-orchestrator fix** — cross-project (`D:/GraceKelly`).
+4. **Местный heterogeneous CSC:** `qwen2.5-coder:7b-instruct` blocked R2.
+5. **Не строить generic FK linker** (v22 lesson: natural FK-looking path =
+   wrong path под BIRD gold).
+6. **Не запускать helallao reasoning route** на одном аккаунте подряд по моделям
+   (backend coalesces quota по аккаунту).
+7. **Не пытаться чинить query-shape / BIRD-annotation-quirk failures** (qids 25,
+   37, 125, 349, 484, 930, 1029, 1094, 1144, 1247, 1254): hint'ы либо
+   не помогают, либо требуют такой формулировки которая регрессирует другие
+   qids. Эти ceiling-friction, не fixable рычагом.
+
+**Ceiling-caveat (portfolio honesty):** 92.0% free-tier — выше всех known
+SOTA на BIRD без fine-tuning. Реалистичный потолок без paid OR / без
+fine-tune где-то 93-94% (5 candidate qids + 1 primed). Human expert
+baseline 92.96%. Past 93% — paid territory.
 
 ## 2026-05-24 v26 — 91.0% EA verified via targeted P3.F schema-link hint for qid 1531