|
62 | 62 | "metric_caption": "30 dev + 30 held-out, balanced split, all ten query categories at 100% on the free-tier codestral pipeline.", |
63 | 63 | "research_kicker": "BIRD Mini-Dev research benchmark", |
64 | 64 | "research_value": "85.0% / 200", |
65 | | - "research_caption": "Hybrid pipeline: codestral + Sonnet on challenging tier + cross-provider voting + grounded-critique directed retry + Sonnet 4.6 bridge + M-Schema compact serialization + CHASE-SQL divide-and-conquer + Perplexity Pro multi-model voting (Grok 4.1 + GPT-5.2 + Claude 4.5 Sonnet) + reasoning-mode variants (grok-4.1-reasoning + gpt-5.2-thinking + kimi-k2-thinking) + DAC×reasoning combo on residue. After day-5 evening audit of the gold-runner SQLAlchemy `:identifier` bind-bug (BIRD qids 959/989/990) — net −1 from claimed 85.5%. +37.2pp over the GPT-4 zero-shot reference (47.8%), $0 external cost. On Arcwise-Plat corrected gold (Jin et al., CIDR 2026): 67.34% — honest noise-floor after BIRD annotation fixes; +6 cases where our pred catches BIRD's wrong gold.", |
| 65 | + "research_caption": ( |
| 66 | + "Hybrid pipeline: " |
| 67 | + "<span class='nl-term' title='Mistral codestral-latest — SQL-specialised generation model, free tier'>codestral</span> + " |
| 68 | + "<span class='nl-term' title='Anthropic Claude 4.5 Sonnet via Perplexity Pro browser bridge — used on the hard tier'>Sonnet 4.6 bridge</span> + " |
| 69 | + "<span class='nl-term' title='Per-failure re-prompt with executable-shape feedback — only on frozen failures, no T=0 noise'>grounded-critique retry</span> + " |
| 70 | + "<span class='nl-term' title='helallao reverse-engineered HTTPS bridge to Perplexity backend — Grok 4.1, GPT-5.2, Claude 4.5 Sonnet, kimi-k2-thinking, reasoning + Pro modes'>helallao multi-model voting</span>. " |
| 71 | + "+37.2pp over the GPT-4 zero-shot reference (47.8%), $0 external cost. " |
| 72 | + "On <span class='nl-term' title='Jin et al., CIDR/VLDB 2026, arXiv:2601.08778 — corrected BIRD gold annotations'>Arcwise-Plat corrected gold</span>: 67.34% — honest noise-floor; +6 cases where our prediction catches BIRD's own wrong gold. " |
| 73 | + "After the day-5 audit (SQLAlchemy `:identifier` bind-bug fix in `_execute_gold` — affected BIRD qids 959 / 989 / 990) the claimed 85.5% was honestly restated to 85.0%." |
| 74 | + ), |
66 | 75 | "settings_header": "Settings", |
67 | 76 | "db_label": "Database", |
68 | 77 | "db_dialect": "Dialect", |
|
132 | 141 | "metric_caption": "30 dev + 30 held-out, сбалансированный сплит, все десять категорий запросов на 100% через бесплатный codestral.", |
133 | 142 | "research_kicker": "Исследовательский бенчмарк BIRD Mini-Dev", |
134 | 143 | "research_value": "85.0% / 200", |
135 | | - "research_caption": "Гибрид: codestral + Sonnet на challenging-тире + кросс-провайдер voting + grounded-critique directed retry + Sonnet 4.6 bridge + компактная M-Schema + CHASE-SQL divide-and-conquer + Perplexity Pro multi-model voting (Grok 4.1 + GPT-5.2 + Claude 4.5 Sonnet) + reasoning-режим (grok-4.1-reasoning + gpt-5.2-thinking + kimi-k2-thinking) + DAC×reasoning комбо на residue. После day-5 evening аудита SQLAlchemy `:identifier` bind-bug в gold-runner (BIRD qids 959/989/990) — net −1 от заявленных 85.5%. +37.2 п.п. над zero-shot GPT-4 (47.8%), внешние расходы — ноль. На исправленном gold Arcwise-Plat (Jin et al., CIDR 2026) — 67.34%, честный noise-floor после правки аннотаций BIRD; +6 случаев, где наш pred правильнее эталона BIRD.", |
| 144 | + "research_caption": ( |
| 145 | + "Гибридный пайплайн: " |
| 146 | + "<span class='nl-term' title='Mistral codestral-latest — модель, специализированная под генерацию SQL, бесплатный тариф'>codestral</span> + " |
| 147 | + "<span class='nl-term' title='Anthropic Claude 4.5 Sonnet через браузерный мост Perplexity Pro — на сложных кейсах'>мост к Sonnet 4.6</span> + " |
| 148 | + "<span class='nl-term' title='Повторный prompt со shape-фидбэком исполнения — только на зафиксированных фейлах, без шума T=0'>directed-critique retry</span> + " |
| 149 | + "<span class='nl-term' title='Реверс-инжиниринг HTTPS моста к бэкенду Perplexity — Grok 4.1, GPT-5.2, Claude 4.5 Sonnet, kimi-k2-thinking; режимы reasoning + Pro'>multi-model voting через helallao</span>. " |
| 150 | + "+37,2 п.п. над zero-shot GPT-4 (47,8%), внешние расходы — ноль. " |
| 151 | + "На <span class='nl-term' title='Jin et al., CIDR/VLDB 2026, arXiv:2601.08778 — исправленные аннотации gold BIRD'>исправленном gold Arcwise-Plat</span>: 67,34% — честный noise-floor; +6 случаев, где наш ответ правильнее эталона BIRD. " |
| 152 | + "После day-5 evening аудита (фикс SQLAlchemy `:identifier` bind-bug в `_execute_gold` — затронуты BIRD qids 959 / 989 / 990) заявленные 85,5% честно пересчитаны в 85,0%." |
| 153 | + ), |
136 | 154 | "settings_header": "Настройки", |
137 | 155 | "db_label": "База данных", |
138 | 156 | "db_dialect": "Диалект", |
@@ -458,6 +476,16 @@ def _source_link_for(db_id: str) -> tuple[str, str] | None: |
458 | 476 | line-height: 1.55; |
459 | 477 | max-width: 62ch; |
460 | 478 | } |
| 479 | +.nl-term { |
| 480 | + border-bottom: 1px dotted var(--ink-mute); |
| 481 | + cursor: help; |
| 482 | + text-decoration: none; |
| 483 | + color: inherit; |
| 484 | +} |
| 485 | +.nl-term:hover { |
| 486 | + border-bottom-color: var(--ink); |
| 487 | + color: var(--ink); |
| 488 | +} |
461 | 489 |
|
462 | 490 | /* Section rule */ |
463 | 491 | .nl-section-label { |
|
0 commit comments