|
| 1 | +# P3.F — JOIN-path schema-linker: design analysis & realistic ceiling |
| 2 | + |
| 3 | +> Status: analysis complete, code deferred. Written 2026-05-18 after llama70b |
| 4 | +> TPD-reset retry sanity check (`v11_saturation_evidence.md` § day-3). |
| 5 | +
|
| 6 | +## Why P3.F exists |
| 7 | + |
| 8 | +The v11 residue is 38 cases. The biggest single bucket is `row_count_off` |
| 9 | +(20 cases), and `feedback_bird_ceiling_physics` + memory suggested P3.F |
| 10 | +(custom JOIN-path schema-linker) could lift it +5–10pp by addressing |
| 11 | +"`row_count_off` is structural unanimous failure across all Mistral models". |
| 12 | + |
| 13 | +## Bucket sub-classification (script-derived, n=20) |
| 14 | + |
| 15 | +Run `python -c` snippet on `eval/reports/2026-05-17/…-v11.json` with table-set |
| 16 | +diffing + DISTINCT diffing gave: |
| 17 | + |
| 18 | +| Sub-bucket | Count | Description | |
| 19 | +|---|---:|---| |
| 20 | +| same_tables_diff_join_cols_or_filter | 10 | Pred picks same tables as gold, but wrong JOIN ON column, wrong WHERE column, or wrong projection | |
| 21 | +| missing_table_in_pred | 5 | Pred substitutes wrong table or omits a required one | |
| 22 | +| distinct_diff_only | 4 | Bidirectional: 3 cases gold-has-DISTINCT/pred-doesn't, 1 case pred-has-DISTINCT/gold-doesn't | |
| 23 | +| extra_table_in_pred | 1 | Pred joined extra table that changes row count | |
| 24 | + |
| 25 | +## Per-qid audit of the "same_tables_diff_join_cols_or_filter" bucket |
| 26 | + |
| 27 | +This is the supposed P3.F target. Reading each gold ↔ pred pair: |
| 28 | + |
| 29 | +| qid | diff | Real root cause | Solvable by JOIN-path linker? | |
| 30 | +|---:|---|---|---| |
| 31 | +| 77 | mod | Pred filters on `frpm.CountyName + Low/High Grade`, gold on `schools.County + GSserved='K-9'`. Wrong **filter-column source-table**. | partially — needs column-to-table grounding heuristic | |
| 32 | +| 207 | chal | Pred joins `connected.bond_id`, gold joins `connected.atom_id`. Wrong **FK choice** between same tables. | **yes — classic JOIN-path** | |
| 33 | +| 484 | mod | Pred adds `LIMIT 1`, gold doesn't (returns all 155 cards tied at top mana cost). **Query-structure mis-interpretation**. | no | |
| 34 | +| 518 | mod | Gold uses WITH-clause to find max format then selects all matching cards. Pred just GROUP BY + LIMIT 1. **Query-structure mis-interpretation**. | no | |
| 35 | +| 930 | simple | Gold uses subquery IN (returns 37 races where Hamilton ranked 1). Pred uses JOIN + ORDER BY ASC + LIMIT 1 (returns single best race). **Semantic mis-interpretation of "highest rank"**. | no | |
| 36 | +| 990 | chal | Pred missed `WHERE results.time LIKE '_:%:__.___'` filter from gold. **WHERE clause omission**. | partially — needs evidence-grounded WHERE | |
| 37 | +| 1144 | simple | Pred uses JOIN, gold uses subquery + LIMIT 1. Pred returns 38 rows (Player_Attributes has 38 rows per player). **Subquery-vs-JOIN issue**. | no | |
| 38 | +| 1205 | mod | Pred has `LIMIT 1`, gold doesn't. Gold returns 67 lab records for patient 57266; pred truncates to 1. **LIMIT mis-interpretation**. | no | |
| 39 | +| 1399 | mod | Gold returns 14 rows (one per attendance match) via CASE WHEN. Pred returns single COUNT > 0 boolean. **Query-structure interpretation** ("Did X attend Y?" → BIRD wants per-attendance-row not single bool). | no | |
| 40 | +| 1404 | mod | Pred groups by `expense.expense_description`, gold groups by `event.type`. Wrong **GROUP BY column source-table**. | **yes — schema linking** | |
| 41 | + |
| 42 | +**Solvable-by-JOIN-path-linker count: ~2** (qid 207, 1404), maybe 2 more partial |
| 43 | +(qid 77, 990 if linker also handles WHERE-column source). |
| 44 | + |
| 45 | +## Realistic ceiling revision |
| 46 | + |
| 47 | +Earlier memory: «P3.F +5–10pp ceiling lift, дни-недели работы». |
| 48 | +Reality after audit: **+1–2pp on residue = +0.5–1pp on n=200 EA.** Most of the |
| 49 | +20 row_count_off cases are query-structure mis-interpretations (LIMIT/subquery/CASE |
| 50 | +shape), not JOIN-path choice errors. A schema-linker addresses 2–4 cases out |
| 51 | +of 38 residue. |
| 52 | + |
| 53 | +Combined with other buckets: |
| 54 | +- `distinct_diff_only` (4): would need a bidirectional DISTINCT-rule, but it's |
| 55 | + bidirectional — same prompt rule would regress qid 407 (where gold lacks |
| 56 | + DISTINCT but pred adds it). |
| 57 | +- `set_mismatch` (10), `col_projection_off` (7): not addressed by JOIN-path linker. |
| 58 | + |
| 59 | +**Total realistic chrome-free $0-budget headroom past v11 81.0%:** ≤+2.5pp. |
| 60 | +This matches the upper bound from `v11_saturation_evidence.md` § lower bound |
| 61 | +estimate (binomial CI ≤5% rescue rate across all attempted free-tier voting). |
| 62 | + |
| 63 | +## Design (sketch only, not implemented) |
| 64 | + |
| 65 | +If we did build P3.F: |
| 66 | + |
| 67 | +1. **Foreign-key candidate enumeration.** For each pair of tables (T1, T2) |
| 68 | + in retrieved set, collect ALL FK paths via SQLite `pragma foreign_key_list` |
| 69 | + and via heuristic `T1.X_id ↔ T2.id` matches. Each path has score. |
| 70 | +2. **Question-token grounding.** Map question entities to columns via |
| 71 | + embedding similarity against `column_name + column_description` (already |
| 72 | + in chunker). Drop FK paths whose entity-mapped columns are not on the |
| 73 | + path. |
| 74 | +3. **Re-prompt with candidate JOIN paths as hint.** "For tables {T1, T2, T3}, |
| 75 | + the candidate JOIN paths are: (a) T1.X = T2.X via FK; (b) T1.Y = T3.Y + |
| 76 | + T3.Z = T2.Z indirect. Question 'X' suggests path (a). Use it unless the |
| 77 | + evidence forces (b)." |
| 78 | + |
| 79 | +This is research-grade work. Memory `feedback_no_redraft_after_approval` + |
| 80 | +the realistic +0.5–1pp ceiling argue against starting it without explicit |
| 81 | +user mandate. |
| 82 | + |
| 83 | +## Recommendation |
| 84 | + |
| 85 | +**Don't build P3.F speculatively.** The headline 81.0% v11 + 67.34% corrected-gold |
| 86 | +triplet is portfolio-ready. The marginal +0.5–1pp from a JOIN-path linker |
| 87 | +costs days of work for a number that won't change the narrative. |
| 88 | + |
| 89 | +If user wants past 81% chrome-free, the cheaper paths are: |
| 90 | +1. **Wait for daily quotas to fully reset** (24h+) and re-run llama70b on |
| 91 | + the 21 unattempted qids — expected ≤1 rescue but $0 cost. |
| 92 | +2. **Try `gemini-2.5-pro` (RPD ≥100, 5× higher than 2.5-flash)** via Google |
| 93 | + AI Studio. New provider on residue, ortogonal model family. |
| 94 | +3. **OpenRouter paid $1 top-up** unlocks 1000 free-model requests/day — |
| 95 | + not paid model usage, just lifts free-tier cap. Could re-run nemotron and |
| 96 | + other free OpenRouter models with no daily cap. |
| 97 | + |
| 98 | +If user wants past 81% with $1–3 budget: paid Anthropic API Sonnet sweep on |
| 99 | +the 38 residue. Memory marks this deprecated, but it's the highest $/pp. |
| 100 | + |
| 101 | +If user wants research-grade improvement: P3.F design above + custom corrective |
| 102 | +self-consistency (CSC-SQL technique from `docs/bird_sota_research.md`). Multi-day |
| 103 | +work, expected +2–4pp combined. |
0 commit comments