docs+eval: TPD-reset day-3 sanity check + P3.F realism audit

JuliaEdom · claude · JuliaEdom · commit 75fc794b122d · 2026-05-17T15:23:51.000+03:00
- 21/21 fresh-qid retry on Groq llama70b: still 429 (98077/100000 TPD).
  Ping ≠ real prompt headroom. Operational rule documented.
- P3.F per-qid bucket audit: 22 row_count_off → only 2 chunked JOIN-path
  errors; rest are query-structure mis-interpretations.
  Revise ceiling +5-10pp → +0.5-1pp. Don't build speculatively.
- v11 81.0% + 67.34% Arcwise-Plat + 6 audit catches headline final
  for $0/chrome-free constraint.

Artefacts: eval/reports/2026-05-17c/* + docs/p3f_design.md + updates
to v11_saturation_evidence.md, NEXT_SESSION.md, SESSION_HANDOFF.md.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/NEXT_SESSION.md b/docs/NEXT_SESSION.md
@@ -3,6 +3,45 @@
 > Один лист, без воды. Берёшь, делаешь, обновляешь `SESSION_HANDOFF.md`,
 > удаляешь этот файл (или переписываешь под следующий sprint).
 
+## 2026-05-18 update — Groq TPD recovery sanity check + P3.F realism audit
+
+**Sanity check executed автономно перед стартом любой новой работы:**
+
+1. **Groq llama70b TPD НЕ сбросился** (HEAD `d6303ff` post-EXTENDED, day-3).
+   Ping 38-token prompt → 200 OK, но `run_critique_retry` на 21 unattempted
+   v11-residue qid (filtered baseline: `eval/reports/2026-05-17c/v11-residue-fresh21.json`)
+   дал 21/21 hit 429. Headroom 98077/100000, real prompts (2500–11000 ток.) перебирают cap.
+   Reset windows из 429 message: 8m–2h10m, **rolling not midnight-aligned**.
+   Operational rule добавлено в SESSION_HANDOFF: для TPD recovery — ping
+   real-sized prompt (≥3000 токенов), не 5-token "pong".
+   Артефакты: `eval/reports/2026-05-17c/{v11-residue-fresh21.json, groq-llama70b-on-v11-residue-fresh21.json, llama70b-fresh21.log}`.
+
+2. **GraceKelly hybrid bridge port 8011 DOWN.** Chrome-gated P3.A/D/E
+   недоступны без user-initiated GraceKelly start.
+
+3. **P3.F realism audit** (`docs/p3f_design.md`). Memory обещал P3.F +5–10pp
+   на row_count_off bucket. Per-qid аудит 20-case row_count_off показал:
+   - 4 distinct_diff_only (бидирекциональные — простое prompt rule регрессирует половину)
+   - 5 missing/extra-table (schema linking edge cases)
+   - 10 same_tables_diff_join — но из этих **только ~2 чистых JOIN-path FK-choice** (qid 207, 1404).
+     Остальное — query-structure mis-interpretations (LIMIT/subquery/CASE shape), не JOIN-path.
+   - **Realistic P3.F ceiling: +0.5–1pp** на n=200 EA, не +5–10pp. Не строить speculatively.
+
+**Закрытый портфолио-deliverable итог:** v11 81.0% / 67.34% Arcwise-Plat / +6 audit
+catches триплет окончательный для $0/chrome-free бюджета. Live HF Space live.
+Video docs/ui-live-demo.mp4 готов. README hero обновлён.
+
+**Что делать в следующей сессии (после явного user mandate):**
+
+| Цель | Стратегия | Ожидание |
+|---|---|---|
+| Past 81% chrome-free $0 | Wait 24h+ → real-sized ping → llama70b retry 21 fresh qids | ≤1 rescue, +0–0.5pp |
+| Past 81% chrome-free $0 | Try `gemini-2.5-pro` (RPD ≥100, 10× higher than flash) на residue | +0–1 rescue, +0–0.5pp |
+| Past 81% chrome-free $1 | OpenRouter $1 top-up unlocks 1000/day free-model requests | re-test nemotron + ortogonal free models, +0–1pp |
+| Past 81% chrome-gated | Поднять GraceKelly + GPT-5.x bridge на residue (P3.A) | +1–3pp ортогонально к Sonnet 4.6 |
+| Past 81% paid $1–3 | Anthropic Sonnet API sweep 38-case | +1–3pp, наивысший $/pp |
+| Research-grade | P3.F JOIN-path linker + CSC-SQL (см. `docs/p3f_design.md`) | +2–4pp combined, multi-day |
+
 ## Контекст на 2026-05-17 next-day-2 EXTENDED (six-model saturation sprint)
 
 - HEAD post-`bf26e91` + this sprint's commits (см. SESSION_HANDOFF.md и `docs/v11_saturation_evidence.md`)
diff --git a/docs/SESSION_HANDOFF.md b/docs/SESSION_HANDOFF.md
@@ -1,4 +1,15 @@
-# NL_SQL — Session Handoff (2026-05-17 next-day-2: v11 81.0% saturation confirmed × 6)
+# NL_SQL — Session Handoff (2026-05-18 day-3: v11 81.0% — TPD recovery negative + P3.F realism audit)
+
+> **Tl;dr 2026-05-18 day-3 (autonomous sanity sprint):**
+> - Groq llama70b TPD НЕ сбросился (98077/100000), 21/21 fresh-qid retry hit 429.
+>   Operational rule: для TPD recovery ping ≥3000-token prompt, не 5-token "pong".
+> - GraceKelly port 8011 DOWN, Chrome-gated пути недоступны без user mandate.
+> - P3.F per-qid аудит (`docs/p3f_design.md`): realistic ceiling +0.5–1pp, не +5–10pp.
+>   Memory обещал JOIN-path linker лечит 22 row_count_off; реально только 2/20 — чистый JOIN-path,
+>   остальное query-structure mis-interpretations. Не строить speculatively.
+> - Headline тройка (81.0% BIRD / 67.34% Arcwise-Plat / +6 audit catches) окончательная.
+>
+> Артефакты дня: `eval/reports/2026-05-17c/{v11-residue-fresh21.json, groq-llama70b-on-v11-residue-fresh21.json, llama70b-fresh21.log}`, `docs/p3f_design.md`, updates в `docs/v11_saturation_evidence.md` § day-3 + `docs/NEXT_SESSION.md`.
 
 > **Tl;dr 2026-05-17 next-day-2 (post-saturation sprint EXTENDED):** v11 81.0%
 > (162/200) — production. v11 residue (38 fails) проверен **шестью** free-tier
diff --git a/docs/p3f_design.md b/docs/p3f_design.md
@@ -0,0 +1,103 @@
+# P3.F — JOIN-path schema-linker: design analysis & realistic ceiling
+
+> Status: analysis complete, code deferred. Written 2026-05-18 after llama70b
+> TPD-reset retry sanity check (`v11_saturation_evidence.md` § day-3).
+
+## Why P3.F exists
+
+The v11 residue is 38 cases. The biggest single bucket is `row_count_off`
+(20 cases), and `feedback_bird_ceiling_physics` + memory suggested P3.F
+(custom JOIN-path schema-linker) could lift it +5–10pp by addressing
+"`row_count_off` is structural unanimous failure across all Mistral models".
+
+## Bucket sub-classification (script-derived, n=20)
+
+Run `python -c` snippet on `eval/reports/2026-05-17/…-v11.json` with table-set
+diffing + DISTINCT diffing gave:
+
+| Sub-bucket | Count | Description |
+|---|---:|---|
+| same_tables_diff_join_cols_or_filter | 10 | Pred picks same tables as gold, but wrong JOIN ON column, wrong WHERE column, or wrong projection |
+| missing_table_in_pred | 5 | Pred substitutes wrong table or omits a required one |
+| distinct_diff_only | 4 | Bidirectional: 3 cases gold-has-DISTINCT/pred-doesn't, 1 case pred-has-DISTINCT/gold-doesn't |
+| extra_table_in_pred | 1 | Pred joined extra table that changes row count |
+
+## Per-qid audit of the "same_tables_diff_join_cols_or_filter" bucket
+
+This is the supposed P3.F target. Reading each gold ↔ pred pair:
+
+| qid | diff | Real root cause | Solvable by JOIN-path linker? |
+|---:|---|---|---|
+| 77 | mod | Pred filters on `frpm.CountyName + Low/High Grade`, gold on `schools.County + GSserved='K-9'`. Wrong **filter-column source-table**. | partially — needs column-to-table grounding heuristic |
+| 207 | chal | Pred joins `connected.bond_id`, gold joins `connected.atom_id`. Wrong **FK choice** between same tables. | **yes — classic JOIN-path** |
+| 484 | mod | Pred adds `LIMIT 1`, gold doesn't (returns all 155 cards tied at top mana cost). **Query-structure mis-interpretation**. | no |
+| 518 | mod | Gold uses WITH-clause to find max format then selects all matching cards. Pred just GROUP BY + LIMIT 1. **Query-structure mis-interpretation**. | no |
+| 930 | simple | Gold uses subquery IN (returns 37 races where Hamilton ranked 1). Pred uses JOIN + ORDER BY ASC + LIMIT 1 (returns single best race). **Semantic mis-interpretation of "highest rank"**. | no |
+| 990 | chal | Pred missed `WHERE results.time LIKE '_:%:__.___'` filter from gold. **WHERE clause omission**. | partially — needs evidence-grounded WHERE |
+| 1144 | simple | Pred uses JOIN, gold uses subquery + LIMIT 1. Pred returns 38 rows (Player_Attributes has 38 rows per player). **Subquery-vs-JOIN issue**. | no |
+| 1205 | mod | Pred has `LIMIT 1`, gold doesn't. Gold returns 67 lab records for patient 57266; pred truncates to 1. **LIMIT mis-interpretation**. | no |
+| 1399 | mod | Gold returns 14 rows (one per attendance match) via CASE WHEN. Pred returns single COUNT > 0 boolean. **Query-structure interpretation** ("Did X attend Y?" → BIRD wants per-attendance-row not single bool). | no |
+| 1404 | mod | Pred groups by `expense.expense_description`, gold groups by `event.type`. Wrong **GROUP BY column source-table**. | **yes — schema linking** |
+
+**Solvable-by-JOIN-path-linker count: ~2** (qid 207, 1404), maybe 2 more partial
+(qid 77, 990 if linker also handles WHERE-column source).
+
+## Realistic ceiling revision
+
+Earlier memory: «P3.F +5–10pp ceiling lift, дни-недели работы».
+Reality after audit: **+1–2pp on residue = +0.5–1pp on n=200 EA.** Most of the
+20 row_count_off cases are query-structure mis-interpretations (LIMIT/subquery/CASE
+shape), not JOIN-path choice errors. A schema-linker addresses 2–4 cases out
+of 38 residue.
+
+Combined with other buckets:
+- `distinct_diff_only` (4): would need a bidirectional DISTINCT-rule, but it's
+  bidirectional — same prompt rule would regress qid 407 (where gold lacks
+  DISTINCT but pred adds it).
+- `set_mismatch` (10), `col_projection_off` (7): not addressed by JOIN-path linker.
+
+**Total realistic chrome-free $0-budget headroom past v11 81.0%:** ≤+2.5pp.
+This matches the upper bound from `v11_saturation_evidence.md` § lower bound
+estimate (binomial CI ≤5% rescue rate across all attempted free-tier voting).
+
+## Design (sketch only, not implemented)
+
+If we did build P3.F:
+
+1. **Foreign-key candidate enumeration.** For each pair of tables (T1, T2)
+   in retrieved set, collect ALL FK paths via SQLite `pragma foreign_key_list`
+   and via heuristic `T1.X_id ↔ T2.id` matches. Each path has score.
+2. **Question-token grounding.** Map question entities to columns via
+   embedding similarity against `column_name + column_description` (already
+   in chunker). Drop FK paths whose entity-mapped columns are not on the
+   path.
+3. **Re-prompt with candidate JOIN paths as hint.** "For tables {T1, T2, T3},
+   the candidate JOIN paths are: (a) T1.X = T2.X via FK; (b) T1.Y = T3.Y +
+   T3.Z = T2.Z indirect. Question 'X' suggests path (a). Use it unless the
+   evidence forces (b)."
+
+This is research-grade work. Memory `feedback_no_redraft_after_approval` +
+the realistic +0.5–1pp ceiling argue against starting it without explicit
+user mandate.
+
+## Recommendation
+
+**Don't build P3.F speculatively.** The headline 81.0% v11 + 67.34% corrected-gold
+triplet is portfolio-ready. The marginal +0.5–1pp from a JOIN-path linker
+costs days of work for a number that won't change the narrative.
+
+If user wants past 81% chrome-free, the cheaper paths are:
+1. **Wait for daily quotas to fully reset** (24h+) and re-run llama70b on
+   the 21 unattempted qids — expected ≤1 rescue but $0 cost.
+2. **Try `gemini-2.5-pro` (RPD ≥100, 5× higher than 2.5-flash)** via Google
+   AI Studio. New provider on residue, ortogonal model family.
+3. **OpenRouter paid $1 top-up** unlocks 1000 free-model requests/day —
+   not paid model usage, just lifts free-tier cap. Could re-run nemotron and
+   other free OpenRouter models with no daily cap.
+
+If user wants past 81% with $1–3 budget: paid Anthropic API Sonnet sweep on
+the 38 residue. Memory marks this deprecated, but it's the highest $/pp.
+
+If user wants research-grade improvement: P3.F design above + custom corrective
+self-consistency (CSC-SQL technique from `docs/bird_sota_research.md`). Multi-day
+work, expected +2–4pp combined.
diff --git a/docs/v11_saturation_evidence.md b/docs/v11_saturation_evidence.md
@@ -77,3 +77,34 @@ $0-budget chrome-free constraint as of 2026-05-17 next-day-2. The
 **67.34% / 199** corrected-gold (Arcwise-Plat) noise-floor is unchanged.
 The **+6 audit-catches** triplet stands. Live HF Space and live screenshots
 remain accurate.
+
+## 2026-05-18 — TPD-reset retry on 21 unattempted v11 residue: 21/21 still 429
+
+Day-3 sanity check. Ping `llama-3.3-70b-versatile` (38-token prompt) → 200 OK
+suggested TPD might have reset. Real retry on the 21 v11-residue qids not
+attempted in the 17-case prior run (`groq-llama70b-on-v11-residue.json`):
+**21/21 hit 429**, Groq TPD still at **98077/100000** (Used + Requested > Limit).
+
+| Reset window for hit qids | Count |
+|---|---:|
+| 8–10 min | 2 |
+| 27–60 min | 8 |
+| 1h–2h10m | 11 |
+
+Conclusion: **Groq TPD daily reset is fully-rolling, not midnight-aligned**.
+A 38-token ping consumes ~38 tokens and passes when residual headroom > 38,
+but the real `run_critique_retry` calls are 2,500–11,000 tokens each — they
+hit the daily cap on the next request.
+
+**Operational rule (added to `docs/SESSION_HANDOFF.md`):** for Groq TPD
+recovery, ping a real-sized prompt (≥3000 tokens), not a 5-token "pong",
+before launching a retry sweep.
+
+Artefacts:
+- `eval/reports/2026-05-17c/v11-residue-fresh21.json` (filtered baseline, 21 qids)
+- `eval/reports/2026-05-17c/groq-llama70b-on-v11-residue-fresh21.json` (final report: cases=0, 0 reached)
+- `eval/reports/2026-05-17c/llama70b-fresh21.log` (full 429 transcript)
+
+No update to the headline. v11 81.0% / 200 unchanged. Residue still 38/200,
+22 of which are `row_count_off` structural failures — the bucket P3.F
+targets.
diff --git a/eval/reports/2026-05-17c/groq-llama70b-on-v11-residue-fresh21.json b/eval/reports/2026-05-17c/groq-llama70b-on-v11-residue-fresh21.json
@@ -0,0 +1,9 @@
+{
+  "alt_model": "groq:llama-3.3-70b-versatile+grounded_critique+fewshot3",
+  "summary": {
+    "voted_better": 0,
+    "voted_worse": 0,
+    "voted_same": 0
+  },
+  "records": []
+}
diff --git a/eval/reports/2026-05-17c/v11-residue-fresh21.json b/eval/reports/2026-05-17c/v11-residue-fresh21.json