|
| 1 | +# Phase 1 evals: context cost vs. benefit of output schemas + field filtering |
| 2 | + |
| 3 | +Small, deterministic scripts to get the numbers we care about, compared across |
| 4 | +**three shippable configurations**: |
| 5 | + |
| 6 | +| Scenario | output schema | `fields` param | what it represents | |
| 7 | +|----------|:---:|:---:|--------------------| |
| 8 | +| **S1 baseline** | ✗ | ✗ | today's behavior (no experiment) | |
| 9 | +| **S2 schema+fields** | ✓ | ✓ | the full experiment | |
| 10 | +| **S3 fields-only** | ✗ | ✓ | hypothesized sweet spot: the model can filter without carrying the heavy schema | |
| 11 | + |
| 12 | +The intuition behind **S3**: the model doesn't need the full output schema to |
| 13 | +filter — it just needs the `fields` param (whose enum already tells it which |
| 14 | +fields exist). So S3 may capture almost all of the benefit at a fraction of the |
| 15 | +fixed cost. |
| 16 | + |
| 17 | +The two numbers we derive: |
| 18 | + |
| 19 | +1. **Fixed tax** — extra tokens added to the `tools/list` payload (paid once at |
| 20 | + client init) by each scenario. |
| 21 | +2. **Per-call savings** — tokens saved when the model filters a tool response to |
| 22 | + a subset of fields. |
| 23 | + |
| 24 | +From those: **break-even calls = fixed_tax / avg_savings_per_call**, computed per |
| 25 | +scenario. |
| 26 | + |
| 27 | +No LLM required for the offline numbers (tokenization falls back to a chars/4 |
| 28 | +proxy if `tiktoken` isn't installed). The online A/B (step 4) runs the three |
| 29 | +scenarios through a real model over realistic multi-tool sessions. |
| 30 | + |
| 31 | +## Setup |
| 32 | + |
| 33 | +```bash |
| 34 | +cd evals |
| 35 | +python3 -m venv .venv && source .venv/bin/activate |
| 36 | +pip install -r requirements.txt # tiktoken (+ anthropic for the online A/B) |
| 37 | +``` |
| 38 | + |
| 39 | +### Token & secrets |
| 40 | + |
| 41 | +Live tool calls and the online A/B need a real GitHub token. **Never hardcode or |
| 42 | +commit it.** Provide it via the environment only; `.env*` is gitignored: |
| 43 | + |
| 44 | +```bash |
| 45 | +export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxx # read-only scope is enough |
| 46 | +``` |
| 47 | + |
| 48 | +A dummy token is used automatically for `capture_tools.py` (it never calls the API). |
| 49 | + |
| 50 | + |
| 51 | +## 1) Fixed tax |
| 52 | + |
| 53 | +Capture the tool list WITH the experiment enabled, then let the analyzer derive |
| 54 | +all three scenarios by stripping `outputSchema` and/or the `fields` property: |
| 55 | + |
| 56 | +```bash |
| 57 | +python3 capture_tools.py --features output_schemas --toolsets all \ |
| 58 | + --out out/tools.treatment.json |
| 59 | + |
| 60 | +python3 fixed_tax.py --tools out/tools.treatment.json --json-out out/fixed_tax.json |
| 61 | +# add --approx if offline without tiktoken |
| 62 | +``` |
| 63 | + |
| 64 | +`fixed_tax.py` prints the payload tokens for each scenario (S1/S2/S3), the fixed |
| 65 | +tax of each vs the S1 baseline, a component breakdown (schema vs `fields`), and a |
| 66 | +per-tool breakdown. `--json-out` writes the per-scenario taxes for step 3. |
| 67 | + |
| 68 | +> Tip: measure with the `--toolsets` you'd actually ship. The tax is fixed in |
| 69 | +> absolute tokens but its *relative* size shrinks the more tools you expose. |
| 70 | +
|
| 71 | +## 2) Per-call savings (real data) |
| 72 | + |
| 73 | +Capture real full vs filtered responses for the affected tools, straight from |
| 74 | +live GitHub (read-only), then token-diff them: |
| 75 | + |
| 76 | +```bash |
| 77 | +export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxx |
| 78 | +python3 capture_fixtures.py --owner github --repo github-mcp-server --org github |
| 79 | +python3 response_savings.py --fixed-tax-json out/fixed_tax.json |
| 80 | +``` |
| 81 | + |
| 82 | +`capture_fixtures.py` calls each tool twice (no `fields` vs a small subset) and |
| 83 | +writes `fixtures/<tool>.full.json` / `.filtered.json`. Use a busy repo for a |
| 84 | +realistic upper bound, **and** a small repo for the lower bound — report a range. |
| 85 | +An example pair is included so the script also runs offline. |
| 86 | + |
| 87 | +## 3) Break-even (per scenario) |
| 88 | + |
| 89 | +`response_savings.py --fixed-tax-json out/fixed_tax.json` prints break-even for |
| 90 | +both taxed scenarios: |
| 91 | + |
| 92 | +``` |
| 93 | +break_even_calls = scenario_fixed_tax / avg_saved_per_call |
| 94 | +``` |
| 95 | + |
| 96 | +Interpretation: |
| 97 | +- A session with **more** filtered list/search calls than `break_even_calls` is |
| 98 | + net-positive on context for that scenario. |
| 99 | +- **S3 (fields-only)** has a far smaller tax than S2, so its break-even is tiny — |
| 100 | + this is the configuration to scrutinize first. |
| 101 | +- Short sessions (few tool calls) are where the fixed tax dominates — call this |
| 102 | + out in the writeup. |
| 103 | + |
| 104 | +## 4) Online A/B (Phase 2 — real multi-tool sessions, all 3 scenarios) |
| 105 | + |
| 106 | +Runs the same tasks through a real model across all three scenarios, measuring |
| 107 | +cumulative prompt tokens. This is the only way to confirm the model actually |
| 108 | +*uses* `fields` and to get the true net effect — including whether S3 really is |
| 109 | +the sweet spot. |
| 110 | + |
| 111 | +**Use a model with a real context window.** The harness talks to any |
| 112 | +**OpenAI-compatible** endpoint, so you don't need a paid third-party key: |
| 113 | + |
| 114 | +- **GitHub Models** (default) — authenticated with your GitHub token, no extra |
| 115 | + key. Convenient, but the free tier caps requests at **16,000 tokens**, so large |
| 116 | + unfiltered responses error out (`413`). Fine for a smoke test; **not** for the |
| 117 | + headline numbers. |
| 118 | +- **A Copilot / internal proxy** — point `--base-url` at any OpenAI-compatible |
| 119 | + endpoint you already have access to and pass its token via `--api-key-env`. This |
| 120 | + is how to run a large-context model (e.g. `claude-opus-4-6`) with no request cap |
| 121 | + and no out-of-pocket billing. |
| 122 | + |
| 123 | +```bash |
| 124 | +export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxx # always: the MCP server uses this |
| 125 | + |
| 126 | +# Smoke test: GitHub Models gpt-5 (16k cap — expect overflow failures on big repos) |
| 127 | +python3 schema_fields_eval.py --model openai/gpt-5 --toolsets issues,pull_requests --repeat 3 |
| 128 | + |
| 129 | +# Recommended: a large-context model via your OpenAI-compatible endpoint |
| 130 | +export COPILOT_TOKEN=... # whatever token that endpoint needs |
| 131 | +python3 schema_fields_eval.py \ |
| 132 | + --base-url https://your-openai-compatible-endpoint/v1 \ |
| 133 | + --api-key-env COPILOT_TOKEN \ |
| 134 | + --model claude-opus-4-6 --toolsets issues,pull_requests --repeat 3 |
| 135 | + |
| 136 | +# --base-url <url> any OpenAI-compatible endpoint |
| 137 | +# --api-key-env VAR env var holding that endpoint's token |
| 138 | +# --repo owner/repo target repo for the tasks (default cli/cli; see below) |
| 139 | +# --tasks-file mytasks.txt one task per line, optionally 'tag<TAB>task' |
| 140 | +``` |
| 141 | + |
| 142 | +> Target a readable repo: the tasks run against a **live** repo, so point `--repo` |
| 143 | +> at a large, **public, SAML-free** repo a plain PAT can read (default `cli/cli`). |
| 144 | +> If you aim at a SAML-protected org repo (e.g. `github/github-mcp-server`), every |
| 145 | +> call 403s, the model only ever sees tiny error payloads, and the `fields` arms |
| 146 | +> look like pure overhead because there's nothing to filter — the experiment then |
| 147 | +> measures only the fixed schema/param tax, not the filtering payoff. Such runs |
| 148 | +> are now flagged as failures (a tool returning `isError`) and excluded from the |
| 149 | +> token comparison rather than silently counted as valid. |
| 150 | +
|
| 151 | +The server is booted with `--features output_schemas` so the **S2** arm has a real |
| 152 | +schema to embed; the `fields` param and server-side filtering are present in every |
| 153 | +arm regardless, so only what each arm shows the *model* differs. |
| 154 | + |
| 155 | +It prints, per scenario: cumulative prompt/completion tokens, tool-call counts, |
| 156 | +`fields` adoption, the net delta vs the S1 baseline, and a **per-task-type |
| 157 | +breakdown** (narrow / full / neutral) so you can see *where* each config helps. |
| 158 | +Only task-runs where all three arms succeeded count toward the token comparison |
| 159 | +(so on the capped GitHub Models endpoint, the biggest filtering wins — tasks where |
| 160 | +the unfiltered baseline overflowed — show up in the failure counts, not the token |
| 161 | +table; another reason to use a large-context endpoint). Use `--repeat >= 3` to |
| 162 | +average out model nondeterminism. Per-run detail is written to |
| 163 | +`out/schema_fields_eval.jsonl`. |
| 164 | + |
| 165 | +> Task design matters: the default tasks are intentionally **neutral** (they do |
| 166 | +> not tell the model to "return only X"). Biasing prompts toward terse answers |
| 167 | +> would inflate the filtering arms. Keep a balanced mix of narrow/full/neutral. |
| 168 | +
|
| 169 | +> Cost control: the default toolsets are narrow on purpose. The relevant |
| 170 | +> differences live in the affected tools, so you don't need all 79 tools loaded |
| 171 | +> each turn. Use `fixed_tax.py` (all toolsets) for the init-tax number and the |
| 172 | +> online run for the savings/net dynamic. |
| 173 | +
|
| 174 | + |
| 175 | + |
| 176 | +## Honesty notes |
| 177 | + |
| 178 | +- Tokenizers differ across providers; report **deltas** and state the tokenizer. |
| 179 | +- Step 2 assumes the model actually uses `fields`. That adoption rate can only be |
| 180 | + confirmed by the Phase 2 online A/B — Phase 1 is an upper bound on benefit. |
| 181 | +- Real response sizes vary a lot by repo; capture fixtures from both a small and a |
| 182 | + large/busy repo and report a range, not a single number. |
| 183 | +- The `fields` param and the server-side filtering are **not** gated by the |
| 184 | + `output_schemas` feature flag in the server — only the `outputSchema` and the |
| 185 | + response's `structuredContent` are. So S1 (baseline) here means "pre-experiment |
| 186 | + main", and "flag off" in production today would still ship the `fields` param. |
| 187 | + Reconcile the scenario you measure with the toggle you'd actually ship. |
| 188 | +- With output schemas on, each tool result also carries a `structuredContent` |
| 189 | + duplicate of the payload. The online A/B forwards only the text content to the |
| 190 | + model (so all arms see identical response bytes); a client that also feeds |
| 191 | + `structuredContent` to the model would pay more in the S2 arm. State this |
| 192 | + assumption when you report. |
| 193 | + |
| 194 | +## Files |
| 195 | + |
| 196 | +| File | Purpose | |
| 197 | +|------|---------| |
| 198 | +| `capture_tools.py` | Boot server over stdio, dump `tools/list` result | |
| 199 | +| `fixed_tax.py` | Per-scenario token-diff (S1/S2/S3); `--json-out` for break-even | |
| 200 | +| `capture_fixtures.py` | Capture real full/filtered tool responses (live GitHub) | |
| 201 | +| `response_savings.py` | Token-diff full vs filtered responses; per-scenario break-even | |
| 202 | +| `schema_fields_eval.py` | 3-scenario (A/B/C) multi-tool agent eval, prompt-token accounting | |
| 203 | +| `_mcp_client.py` | Shared MCP stdio client | |
| 204 | +| `_tokenize.py` | Tokenizer helper (tiktoken or chars/4 fallback) | |
| 205 | +| `fixtures/` | Response pairs (example + captured) | |
| 206 | + |
0 commit comments