Skip to content

Commit 75fc794

Browse files
JuliaEdomclaude
andcommitted
docs+eval: TPD-reset day-3 sanity check + P3.F realism audit
- 21/21 fresh-qid retry on Groq llama70b: still 429 (98077/100000 TPD). Ping ≠ real prompt headroom. Operational rule documented. - P3.F per-qid bucket audit: 22 row_count_off → only 2 chunked JOIN-path errors; rest are query-structure mis-interpretations. Revise ceiling +5-10pp → +0.5-1pp. Don't build speculatively. - v11 81.0% + 67.34% Arcwise-Plat + 6 audit catches headline final for $0/chrome-free constraint. Artefacts: eval/reports/2026-05-17c/* + docs/p3f_design.md + updates to v11_saturation_evidence.md, NEXT_SESSION.md, SESSION_HANDOFF.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d6303ff commit 75fc794

6 files changed

Lines changed: 932 additions & 1 deletion

File tree

docs/NEXT_SESSION.md

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,45 @@
33
> Один лист, без воды. Берёшь, делаешь, обновляешь `SESSION_HANDOFF.md`,
44
> удаляешь этот файл (или переписываешь под следующий sprint).
55
6+
## 2026-05-18 update — Groq TPD recovery sanity check + P3.F realism audit
7+
8+
**Sanity check executed автономно перед стартом любой новой работы:**
9+
10+
1. **Groq llama70b TPD НЕ сбросился** (HEAD `d6303ff` post-EXTENDED, day-3).
11+
Ping 38-token prompt → 200 OK, но `run_critique_retry` на 21 unattempted
12+
v11-residue qid (filtered baseline: `eval/reports/2026-05-17c/v11-residue-fresh21.json`)
13+
дал 21/21 hit 429. Headroom 98077/100000, real prompts (2500–11000 ток.) перебирают cap.
14+
Reset windows из 429 message: 8m–2h10m, **rolling not midnight-aligned**.
15+
Operational rule добавлено в SESSION_HANDOFF: для TPD recovery — ping
16+
real-sized prompt (≥3000 токенов), не 5-token "pong".
17+
Артефакты: `eval/reports/2026-05-17c/{v11-residue-fresh21.json, groq-llama70b-on-v11-residue-fresh21.json, llama70b-fresh21.log}`.
18+
19+
2. **GraceKelly hybrid bridge port 8011 DOWN.** Chrome-gated P3.A/D/E
20+
недоступны без user-initiated GraceKelly start.
21+
22+
3. **P3.F realism audit** (`docs/p3f_design.md`). Memory обещал P3.F +5–10pp
23+
на row_count_off bucket. Per-qid аудит 20-case row_count_off показал:
24+
- 4 distinct_diff_only (бидирекциональные — простое prompt rule регрессирует половину)
25+
- 5 missing/extra-table (schema linking edge cases)
26+
- 10 same_tables_diff_join — но из этих **только ~2 чистых JOIN-path FK-choice** (qid 207, 1404).
27+
Остальное — query-structure mis-interpretations (LIMIT/subquery/CASE shape), не JOIN-path.
28+
- **Realistic P3.F ceiling: +0.5–1pp** на n=200 EA, не +5–10pp. Не строить speculatively.
29+
30+
**Закрытый портфолио-deliverable итог:** v11 81.0% / 67.34% Arcwise-Plat / +6 audit
31+
catches триплет окончательный для $0/chrome-free бюджета. Live HF Space live.
32+
Video docs/ui-live-demo.mp4 готов. README hero обновлён.
33+
34+
**Что делать в следующей сессии (после явного user mandate):**
35+
36+
| Цель | Стратегия | Ожидание |
37+
|---|---|---|
38+
| Past 81% chrome-free $0 | Wait 24h+ → real-sized ping → llama70b retry 21 fresh qids | ≤1 rescue, +0–0.5pp |
39+
| Past 81% chrome-free $0 | Try `gemini-2.5-pro` (RPD ≥100, 10× higher than flash) на residue | +0–1 rescue, +0–0.5pp |
40+
| Past 81% chrome-free $1 | OpenRouter $1 top-up unlocks 1000/day free-model requests | re-test nemotron + ortogonal free models, +0–1pp |
41+
| Past 81% chrome-gated | Поднять GraceKelly + GPT-5.x bridge на residue (P3.A) | +1–3pp ортогонально к Sonnet 4.6 |
42+
| Past 81% paid $1–3 | Anthropic Sonnet API sweep 38-case | +1–3pp, наивысший $/pp |
43+
| Research-grade | P3.F JOIN-path linker + CSC-SQL (см. `docs/p3f_design.md`) | +2–4pp combined, multi-day |
44+
645
## Контекст на 2026-05-17 next-day-2 EXTENDED (six-model saturation sprint)
746

847
- HEAD post-`bf26e91` + this sprint's commits (см. SESSION_HANDOFF.md и `docs/v11_saturation_evidence.md`)

docs/SESSION_HANDOFF.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,15 @@
1-
# NL_SQL — Session Handoff (2026-05-17 next-day-2: v11 81.0% saturation confirmed × 6)
1+
# NL_SQL — Session Handoff (2026-05-18 day-3: v11 81.0% — TPD recovery negative + P3.F realism audit)
2+
3+
> **Tl;dr 2026-05-18 day-3 (autonomous sanity sprint):**
4+
> - Groq llama70b TPD НЕ сбросился (98077/100000), 21/21 fresh-qid retry hit 429.
5+
> Operational rule: для TPD recovery ping ≥3000-token prompt, не 5-token "pong".
6+
> - GraceKelly port 8011 DOWN, Chrome-gated пути недоступны без user mandate.
7+
> - P3.F per-qid аудит (`docs/p3f_design.md`): realistic ceiling +0.5–1pp, не +5–10pp.
8+
> Memory обещал JOIN-path linker лечит 22 row_count_off; реально только 2/20 — чистый JOIN-path,
9+
> остальное query-structure mis-interpretations. Не строить speculatively.
10+
> - Headline тройка (81.0% BIRD / 67.34% Arcwise-Plat / +6 audit catches) окончательная.
11+
>
12+
> Артефакты дня: `eval/reports/2026-05-17c/{v11-residue-fresh21.json, groq-llama70b-on-v11-residue-fresh21.json, llama70b-fresh21.log}`, `docs/p3f_design.md`, updates в `docs/v11_saturation_evidence.md` § day-3 + `docs/NEXT_SESSION.md`.
213
314
> **Tl;dr 2026-05-17 next-day-2 (post-saturation sprint EXTENDED):** v11 81.0%
415
> (162/200) — production. v11 residue (38 fails) проверен **шестью** free-tier

docs/p3f_design.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
# P3.F — JOIN-path schema-linker: design analysis & realistic ceiling
2+
3+
> Status: analysis complete, code deferred. Written 2026-05-18 after llama70b
4+
> TPD-reset retry sanity check (`v11_saturation_evidence.md` § day-3).
5+
6+
## Why P3.F exists
7+
8+
The v11 residue is 38 cases. The biggest single bucket is `row_count_off`
9+
(20 cases), and `feedback_bird_ceiling_physics` + memory suggested P3.F
10+
(custom JOIN-path schema-linker) could lift it +5–10pp by addressing
11+
"`row_count_off` is structural unanimous failure across all Mistral models".
12+
13+
## Bucket sub-classification (script-derived, n=20)
14+
15+
Run `python -c` snippet on `eval/reports/2026-05-17/…-v11.json` with table-set
16+
diffing + DISTINCT diffing gave:
17+
18+
| Sub-bucket | Count | Description |
19+
|---|---:|---|
20+
| same_tables_diff_join_cols_or_filter | 10 | Pred picks same tables as gold, but wrong JOIN ON column, wrong WHERE column, or wrong projection |
21+
| missing_table_in_pred | 5 | Pred substitutes wrong table or omits a required one |
22+
| distinct_diff_only | 4 | Bidirectional: 3 cases gold-has-DISTINCT/pred-doesn't, 1 case pred-has-DISTINCT/gold-doesn't |
23+
| extra_table_in_pred | 1 | Pred joined extra table that changes row count |
24+
25+
## Per-qid audit of the "same_tables_diff_join_cols_or_filter" bucket
26+
27+
This is the supposed P3.F target. Reading each gold ↔ pred pair:
28+
29+
| qid | diff | Real root cause | Solvable by JOIN-path linker? |
30+
|---:|---|---|---|
31+
| 77 | mod | Pred filters on `frpm.CountyName + Low/High Grade`, gold on `schools.County + GSserved='K-9'`. Wrong **filter-column source-table**. | partially — needs column-to-table grounding heuristic |
32+
| 207 | chal | Pred joins `connected.bond_id`, gold joins `connected.atom_id`. Wrong **FK choice** between same tables. | **yes — classic JOIN-path** |
33+
| 484 | mod | Pred adds `LIMIT 1`, gold doesn't (returns all 155 cards tied at top mana cost). **Query-structure mis-interpretation**. | no |
34+
| 518 | mod | Gold uses WITH-clause to find max format then selects all matching cards. Pred just GROUP BY + LIMIT 1. **Query-structure mis-interpretation**. | no |
35+
| 930 | simple | Gold uses subquery IN (returns 37 races where Hamilton ranked 1). Pred uses JOIN + ORDER BY ASC + LIMIT 1 (returns single best race). **Semantic mis-interpretation of "highest rank"**. | no |
36+
| 990 | chal | Pred missed `WHERE results.time LIKE '_:%:__.___'` filter from gold. **WHERE clause omission**. | partially — needs evidence-grounded WHERE |
37+
| 1144 | simple | Pred uses JOIN, gold uses subquery + LIMIT 1. Pred returns 38 rows (Player_Attributes has 38 rows per player). **Subquery-vs-JOIN issue**. | no |
38+
| 1205 | mod | Pred has `LIMIT 1`, gold doesn't. Gold returns 67 lab records for patient 57266; pred truncates to 1. **LIMIT mis-interpretation**. | no |
39+
| 1399 | mod | Gold returns 14 rows (one per attendance match) via CASE WHEN. Pred returns single COUNT > 0 boolean. **Query-structure interpretation** ("Did X attend Y?" → BIRD wants per-attendance-row not single bool). | no |
40+
| 1404 | mod | Pred groups by `expense.expense_description`, gold groups by `event.type`. Wrong **GROUP BY column source-table**. | **yes — schema linking** |
41+
42+
**Solvable-by-JOIN-path-linker count: ~2** (qid 207, 1404), maybe 2 more partial
43+
(qid 77, 990 if linker also handles WHERE-column source).
44+
45+
## Realistic ceiling revision
46+
47+
Earlier memory: «P3.F +5–10pp ceiling lift, дни-недели работы».
48+
Reality after audit: **+1–2pp on residue = +0.5–1pp on n=200 EA.** Most of the
49+
20 row_count_off cases are query-structure mis-interpretations (LIMIT/subquery/CASE
50+
shape), not JOIN-path choice errors. A schema-linker addresses 2–4 cases out
51+
of 38 residue.
52+
53+
Combined with other buckets:
54+
- `distinct_diff_only` (4): would need a bidirectional DISTINCT-rule, but it's
55+
bidirectional — same prompt rule would regress qid 407 (where gold lacks
56+
DISTINCT but pred adds it).
57+
- `set_mismatch` (10), `col_projection_off` (7): not addressed by JOIN-path linker.
58+
59+
**Total realistic chrome-free $0-budget headroom past v11 81.0%:** ≤+2.5pp.
60+
This matches the upper bound from `v11_saturation_evidence.md` § lower bound
61+
estimate (binomial CI ≤5% rescue rate across all attempted free-tier voting).
62+
63+
## Design (sketch only, not implemented)
64+
65+
If we did build P3.F:
66+
67+
1. **Foreign-key candidate enumeration.** For each pair of tables (T1, T2)
68+
in retrieved set, collect ALL FK paths via SQLite `pragma foreign_key_list`
69+
and via heuristic `T1.X_id ↔ T2.id` matches. Each path has score.
70+
2. **Question-token grounding.** Map question entities to columns via
71+
embedding similarity against `column_name + column_description` (already
72+
in chunker). Drop FK paths whose entity-mapped columns are not on the
73+
path.
74+
3. **Re-prompt with candidate JOIN paths as hint.** "For tables {T1, T2, T3},
75+
the candidate JOIN paths are: (a) T1.X = T2.X via FK; (b) T1.Y = T3.Y +
76+
T3.Z = T2.Z indirect. Question 'X' suggests path (a). Use it unless the
77+
evidence forces (b)."
78+
79+
This is research-grade work. Memory `feedback_no_redraft_after_approval` +
80+
the realistic +0.5–1pp ceiling argue against starting it without explicit
81+
user mandate.
82+
83+
## Recommendation
84+
85+
**Don't build P3.F speculatively.** The headline 81.0% v11 + 67.34% corrected-gold
86+
triplet is portfolio-ready. The marginal +0.5–1pp from a JOIN-path linker
87+
costs days of work for a number that won't change the narrative.
88+
89+
If user wants past 81% chrome-free, the cheaper paths are:
90+
1. **Wait for daily quotas to fully reset** (24h+) and re-run llama70b on
91+
the 21 unattempted qids — expected ≤1 rescue but $0 cost.
92+
2. **Try `gemini-2.5-pro` (RPD ≥100, 5× higher than 2.5-flash)** via Google
93+
AI Studio. New provider on residue, ortogonal model family.
94+
3. **OpenRouter paid $1 top-up** unlocks 1000 free-model requests/day —
95+
not paid model usage, just lifts free-tier cap. Could re-run nemotron and
96+
other free OpenRouter models with no daily cap.
97+
98+
If user wants past 81% with $1–3 budget: paid Anthropic API Sonnet sweep on
99+
the 38 residue. Memory marks this deprecated, but it's the highest $/pp.
100+
101+
If user wants research-grade improvement: P3.F design above + custom corrective
102+
self-consistency (CSC-SQL technique from `docs/bird_sota_research.md`). Multi-day
103+
work, expected +2–4pp combined.

docs/v11_saturation_evidence.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,3 +77,34 @@ $0-budget chrome-free constraint as of 2026-05-17 next-day-2. The
7777
**67.34% / 199** corrected-gold (Arcwise-Plat) noise-floor is unchanged.
7878
The **+6 audit-catches** triplet stands. Live HF Space and live screenshots
7979
remain accurate.
80+
81+
## 2026-05-18 — TPD-reset retry on 21 unattempted v11 residue: 21/21 still 429
82+
83+
Day-3 sanity check. Ping `llama-3.3-70b-versatile` (38-token prompt) → 200 OK
84+
suggested TPD might have reset. Real retry on the 21 v11-residue qids not
85+
attempted in the 17-case prior run (`groq-llama70b-on-v11-residue.json`):
86+
**21/21 hit 429**, Groq TPD still at **98077/100000** (Used + Requested > Limit).
87+
88+
| Reset window for hit qids | Count |
89+
|---|---:|
90+
| 8–10 min | 2 |
91+
| 27–60 min | 8 |
92+
| 1h–2h10m | 11 |
93+
94+
Conclusion: **Groq TPD daily reset is fully-rolling, not midnight-aligned**.
95+
A 38-token ping consumes ~38 tokens and passes when residual headroom > 38,
96+
but the real `run_critique_retry` calls are 2,500–11,000 tokens each — they
97+
hit the daily cap on the next request.
98+
99+
**Operational rule (added to `docs/SESSION_HANDOFF.md`):** for Groq TPD
100+
recovery, ping a real-sized prompt (≥3000 tokens), not a 5-token "pong",
101+
before launching a retry sweep.
102+
103+
Artefacts:
104+
- `eval/reports/2026-05-17c/v11-residue-fresh21.json` (filtered baseline, 21 qids)
105+
- `eval/reports/2026-05-17c/groq-llama70b-on-v11-residue-fresh21.json` (final report: cases=0, 0 reached)
106+
- `eval/reports/2026-05-17c/llama70b-fresh21.log` (full 429 transcript)
107+
108+
No update to the headline. v11 81.0% / 200 unchanged. Residue still 38/200,
109+
22 of which are `row_count_off` structural failures — the bucket P3.F
110+
targets.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{
2+
"alt_model": "groq:llama-3.3-70b-versatile+grounded_critique+fewshot3",
3+
"summary": {
4+
"voted_better": 0,
5+
"voted_worse": 0,
6+
"voted_same": 0
7+
},
8+
"records": []
9+
}

0 commit comments

Comments
 (0)