Skip to content

Commit 0c7a33b

Browse files
committed
add online multi-tool evals for 3 scenarios: (i) no output schemas and no fields param, (ii) output schemas and fields param, (iii) no output schema, and fields params
1 parent b901f13 commit 0c7a33b

20 files changed

Lines changed: 1835 additions & 4 deletions

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1237,6 +1237,7 @@ The following sets of tools are available:
12371237

12381238
- **list_branches** - List branches
12391239
- **Required OAuth Scopes**: `repo`
1240+
- `fields`: Subset of branch fields to return for each branch. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
12401241
- `owner`: Repository owner (string, required)
12411242
- `page`: Page number for pagination (min 1) (number, optional)
12421243
- `perPage`: Results per page for pagination (min 1, max 100) (number, optional)
@@ -1245,6 +1246,7 @@ The following sets of tools are available:
12451246
- **list_commits** - List commits
12461247
- **Required OAuth Scopes**: `repo`
12471248
- `author`: Author username or email address to filter commits by (string, optional)
1249+
- `fields`: Subset of commit fields to return for each commit. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
12481250
- `owner`: Repository owner (string, required)
12491251
- `page`: Page number for pagination (min 1) (number, optional)
12501252
- `path`: Only commits containing this file path will be returned (string, optional)
@@ -1256,13 +1258,15 @@ The following sets of tools are available:
12561258

12571259
- **list_releases** - List releases
12581260
- **Required OAuth Scopes**: `repo`
1261+
- `fields`: Subset of release fields to return for each release. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
12591262
- `owner`: Repository owner (string, required)
12601263
- `page`: Page number for pagination (min 1) (number, optional)
12611264
- `perPage`: Results per page for pagination (min 1, max 100) (number, optional)
12621265
- `repo`: Repository name (string, required)
12631266

12641267
- **list_tags** - List tags
12651268
- **Required OAuth Scopes**: `repo`
1269+
- `fields`: Subset of tag fields to return for each tag. If omitted, all fields are returned. Use this to reduce response size when you only need specific fields. (string[], optional)
12661270
- `owner`: Repository owner (string, required)
12671271
- `page`: Page number for pagination (min 1) (number, optional)
12681272
- `perPage`: Results per page for pagination (min 1, max 100) (number, optional)

evals/.gitignore

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
.venv/
2+
__pycache__/
3+
out/
4+
.env
5+
.env.local
6+
# Captured fixtures contain real (large) GitHub data; keep only the example committed.
7+
fixtures/*.json
8+
!fixtures/example_*.json

evals/README.md

Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
# Phase 1 evals: context cost vs. benefit of output schemas + field filtering
2+
3+
Small, deterministic scripts to get the numbers we care about, compared across
4+
**three shippable configurations**:
5+
6+
| Scenario | output schema | `fields` param | what it represents |
7+
|----------|:---:|:---:|--------------------|
8+
| **S1 baseline** ||| today's behavior (no experiment) |
9+
| **S2 schema+fields** ||| the full experiment |
10+
| **S3 fields-only** ||| hypothesized sweet spot: the model can filter without carrying the heavy schema |
11+
12+
The intuition behind **S3**: the model doesn't need the full output schema to
13+
filter — it just needs the `fields` param (whose enum already tells it which
14+
fields exist). So S3 may capture almost all of the benefit at a fraction of the
15+
fixed cost.
16+
17+
The two numbers we derive:
18+
19+
1. **Fixed tax** — extra tokens added to the `tools/list` payload (paid once at
20+
client init) by each scenario.
21+
2. **Per-call savings** — tokens saved when the model filters a tool response to
22+
a subset of fields.
23+
24+
From those: **break-even calls = fixed_tax / avg_savings_per_call**, computed per
25+
scenario.
26+
27+
No LLM required for the offline numbers (tokenization falls back to a chars/4
28+
proxy if `tiktoken` isn't installed). The online A/B (step 4) runs the three
29+
scenarios through a real model over realistic multi-tool sessions.
30+
31+
## Setup
32+
33+
```bash
34+
cd evals
35+
python3 -m venv .venv && source .venv/bin/activate
36+
pip install -r requirements.txt # tiktoken (+ anthropic for the online A/B)
37+
```
38+
39+
### Token & secrets
40+
41+
Live tool calls and the online A/B need a real GitHub token. **Never hardcode or
42+
commit it.** Provide it via the environment only; `.env*` is gitignored:
43+
44+
```bash
45+
export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxx # read-only scope is enough
46+
```
47+
48+
A dummy token is used automatically for `capture_tools.py` (it never calls the API).
49+
50+
51+
## 1) Fixed tax
52+
53+
Capture the tool list WITH the experiment enabled, then let the analyzer derive
54+
all three scenarios by stripping `outputSchema` and/or the `fields` property:
55+
56+
```bash
57+
python3 capture_tools.py --features output_schemas --toolsets all \
58+
--out out/tools.treatment.json
59+
60+
python3 fixed_tax.py --tools out/tools.treatment.json --json-out out/fixed_tax.json
61+
# add --approx if offline without tiktoken
62+
```
63+
64+
`fixed_tax.py` prints the payload tokens for each scenario (S1/S2/S3), the fixed
65+
tax of each vs the S1 baseline, a component breakdown (schema vs `fields`), and a
66+
per-tool breakdown. `--json-out` writes the per-scenario taxes for step 3.
67+
68+
> Tip: measure with the `--toolsets` you'd actually ship. The tax is fixed in
69+
> absolute tokens but its *relative* size shrinks the more tools you expose.
70+
71+
## 2) Per-call savings (real data)
72+
73+
Capture real full vs filtered responses for the affected tools, straight from
74+
live GitHub (read-only), then token-diff them:
75+
76+
```bash
77+
export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxx
78+
python3 capture_fixtures.py --owner github --repo github-mcp-server --org github
79+
python3 response_savings.py --fixed-tax-json out/fixed_tax.json
80+
```
81+
82+
`capture_fixtures.py` calls each tool twice (no `fields` vs a small subset) and
83+
writes `fixtures/<tool>.full.json` / `.filtered.json`. Use a busy repo for a
84+
realistic upper bound, **and** a small repo for the lower bound — report a range.
85+
An example pair is included so the script also runs offline.
86+
87+
## 3) Break-even (per scenario)
88+
89+
`response_savings.py --fixed-tax-json out/fixed_tax.json` prints break-even for
90+
both taxed scenarios:
91+
92+
```
93+
break_even_calls = scenario_fixed_tax / avg_saved_per_call
94+
```
95+
96+
Interpretation:
97+
- A session with **more** filtered list/search calls than `break_even_calls` is
98+
net-positive on context for that scenario.
99+
- **S3 (fields-only)** has a far smaller tax than S2, so its break-even is tiny —
100+
this is the configuration to scrutinize first.
101+
- Short sessions (few tool calls) are where the fixed tax dominates — call this
102+
out in the writeup.
103+
104+
## 4) Online A/B (Phase 2 — real multi-tool sessions, all 3 scenarios)
105+
106+
Runs the same tasks through a real model across all three scenarios, measuring
107+
cumulative prompt tokens. This is the only way to confirm the model actually
108+
*uses* `fields` and to get the true net effect — including whether S3 really is
109+
the sweet spot.
110+
111+
**Use a model with a real context window.** The harness talks to any
112+
**OpenAI-compatible** endpoint, so you don't need a paid third-party key:
113+
114+
- **GitHub Models** (default) — authenticated with your GitHub token, no extra
115+
key. Convenient, but the free tier caps requests at **16,000 tokens**, so large
116+
unfiltered responses error out (`413`). Fine for a smoke test; **not** for the
117+
headline numbers.
118+
- **A Copilot / internal proxy** — point `--base-url` at any OpenAI-compatible
119+
endpoint you already have access to and pass its token via `--api-key-env`. This
120+
is how to run a large-context model (e.g. `claude-opus-4-6`) with no request cap
121+
and no out-of-pocket billing.
122+
123+
```bash
124+
export GITHUB_PERSONAL_ACCESS_TOKEN=ghp_xxx # always: the MCP server uses this
125+
126+
# Smoke test: GitHub Models gpt-5 (16k cap — expect overflow failures on big repos)
127+
python3 schema_fields_eval.py --model openai/gpt-5 --toolsets issues,pull_requests --repeat 3
128+
129+
# Recommended: a large-context model via your OpenAI-compatible endpoint
130+
export COPILOT_TOKEN=... # whatever token that endpoint needs
131+
python3 schema_fields_eval.py \
132+
--base-url https://your-openai-compatible-endpoint/v1 \
133+
--api-key-env COPILOT_TOKEN \
134+
--model claude-opus-4-6 --toolsets issues,pull_requests --repeat 3
135+
136+
# --base-url <url> any OpenAI-compatible endpoint
137+
# --api-key-env VAR env var holding that endpoint's token
138+
# --repo owner/repo target repo for the tasks (default cli/cli; see below)
139+
# --tasks-file mytasks.txt one task per line, optionally 'tag<TAB>task'
140+
```
141+
142+
> Target a readable repo: the tasks run against a **live** repo, so point `--repo`
143+
> at a large, **public, SAML-free** repo a plain PAT can read (default `cli/cli`).
144+
> If you aim at a SAML-protected org repo (e.g. `github/github-mcp-server`), every
145+
> call 403s, the model only ever sees tiny error payloads, and the `fields` arms
146+
> look like pure overhead because there's nothing to filter — the experiment then
147+
> measures only the fixed schema/param tax, not the filtering payoff. Such runs
148+
> are now flagged as failures (a tool returning `isError`) and excluded from the
149+
> token comparison rather than silently counted as valid.
150+
151+
The server is booted with `--features output_schemas` so the **S2** arm has a real
152+
schema to embed; the `fields` param and server-side filtering are present in every
153+
arm regardless, so only what each arm shows the *model* differs.
154+
155+
It prints, per scenario: cumulative prompt/completion tokens, tool-call counts,
156+
`fields` adoption, the net delta vs the S1 baseline, and a **per-task-type
157+
breakdown** (narrow / full / neutral) so you can see *where* each config helps.
158+
Only task-runs where all three arms succeeded count toward the token comparison
159+
(so on the capped GitHub Models endpoint, the biggest filtering wins — tasks where
160+
the unfiltered baseline overflowed — show up in the failure counts, not the token
161+
table; another reason to use a large-context endpoint). Use `--repeat >= 3` to
162+
average out model nondeterminism. Per-run detail is written to
163+
`out/schema_fields_eval.jsonl`.
164+
165+
> Task design matters: the default tasks are intentionally **neutral** (they do
166+
> not tell the model to "return only X"). Biasing prompts toward terse answers
167+
> would inflate the filtering arms. Keep a balanced mix of narrow/full/neutral.
168+
169+
> Cost control: the default toolsets are narrow on purpose. The relevant
170+
> differences live in the affected tools, so you don't need all 79 tools loaded
171+
> each turn. Use `fixed_tax.py` (all toolsets) for the init-tax number and the
172+
> online run for the savings/net dynamic.
173+
174+
175+
176+
## Honesty notes
177+
178+
- Tokenizers differ across providers; report **deltas** and state the tokenizer.
179+
- Step 2 assumes the model actually uses `fields`. That adoption rate can only be
180+
confirmed by the Phase 2 online A/B — Phase 1 is an upper bound on benefit.
181+
- Real response sizes vary a lot by repo; capture fixtures from both a small and a
182+
large/busy repo and report a range, not a single number.
183+
- The `fields` param and the server-side filtering are **not** gated by the
184+
`output_schemas` feature flag in the server — only the `outputSchema` and the
185+
response's `structuredContent` are. So S1 (baseline) here means "pre-experiment
186+
main", and "flag off" in production today would still ship the `fields` param.
187+
Reconcile the scenario you measure with the toggle you'd actually ship.
188+
- With output schemas on, each tool result also carries a `structuredContent`
189+
duplicate of the payload. The online A/B forwards only the text content to the
190+
model (so all arms see identical response bytes); a client that also feeds
191+
`structuredContent` to the model would pay more in the S2 arm. State this
192+
assumption when you report.
193+
194+
## Files
195+
196+
| File | Purpose |
197+
|------|---------|
198+
| `capture_tools.py` | Boot server over stdio, dump `tools/list` result |
199+
| `fixed_tax.py` | Per-scenario token-diff (S1/S2/S3); `--json-out` for break-even |
200+
| `capture_fixtures.py` | Capture real full/filtered tool responses (live GitHub) |
201+
| `response_savings.py` | Token-diff full vs filtered responses; per-scenario break-even |
202+
| `schema_fields_eval.py` | 3-scenario (A/B/C) multi-tool agent eval, prompt-token accounting |
203+
| `_mcp_client.py` | Shared MCP stdio client |
204+
| `_tokenize.py` | Tokenizer helper (tiktoken or chars/4 fallback) |
205+
| `fixtures/` | Response pairs (example + captured) |
206+

evals/_mcp_client.py

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
#!/usr/bin/env python3
2+
"""Minimal MCP stdio client used by the eval scripts.
3+
4+
Spawns the GitHub MCP server, performs the MCP handshake, and exposes
5+
`list_tools()` and `call_tool()`. stdout is newline-delimited JSON-RPC; server
6+
logs go to stderr.
7+
8+
Security: never hardcode a token. The token is read from the process environment
9+
(GITHUB_PERSONAL_ACCESS_TOKEN). A dummy non-`ghp_` token is used only as a
10+
fallback so `tools/list` works offline without hitting GitHub.
11+
"""
12+
13+
from __future__ import annotations
14+
15+
import json
16+
import os
17+
import select
18+
import shlex
19+
import subprocess
20+
import sys
21+
from pathlib import Path
22+
from typing import Any
23+
24+
REPO_ROOT = Path(__file__).resolve().parent.parent
25+
PROTOCOL_VERSION = "2025-06-18"
26+
27+
28+
def text_content(result: dict) -> str:
29+
"""Concatenate the text parts of a tools/call result."""
30+
parts = [c.get("text", "") for c in result.get("content", []) if c.get("type") == "text"]
31+
return "".join(parts)
32+
33+
34+
class MCPServer:
35+
def __init__(
36+
self,
37+
server_cmd: str = "go run ./cmd/github-mcp-server stdio",
38+
extra_args: list[str] | None = None,
39+
env: dict[str, str] | None = None,
40+
cwd: Path = REPO_ROOT,
41+
timeout: float = 180.0,
42+
) -> None:
43+
self.cmd = shlex.split(server_cmd) + list(extra_args or [])
44+
self.env = {**os.environ, **(env or {})}
45+
self.env.setdefault("GITHUB_PERSONAL_ACCESS_TOKEN", "dummy_token_no_network")
46+
self.cwd = str(cwd)
47+
self.timeout = timeout
48+
self.proc: subprocess.Popen | None = None
49+
self._id = 0
50+
51+
def __enter__(self) -> "MCPServer":
52+
self.start()
53+
return self
54+
55+
def __exit__(self, *_: Any) -> None:
56+
self.close()
57+
58+
def start(self) -> None:
59+
print(f"[mcp] starting: {' '.join(self.cmd)}", file=sys.stderr)
60+
self.proc = subprocess.Popen(
61+
self.cmd,
62+
cwd=self.cwd,
63+
stdin=subprocess.PIPE,
64+
stdout=subprocess.PIPE,
65+
stderr=sys.stderr,
66+
text=True,
67+
env=self.env,
68+
)
69+
self._request(
70+
"initialize",
71+
{
72+
"protocolVersion": PROTOCOL_VERSION,
73+
"capabilities": {},
74+
"clientInfo": {"name": "evals", "version": "0"},
75+
},
76+
)
77+
self._notify("notifications/initialized")
78+
79+
# -- JSON-RPC plumbing -------------------------------------------------
80+
def _send(self, payload: dict) -> None:
81+
assert self.proc and self.proc.stdin
82+
self.proc.stdin.write(json.dumps(payload) + "\n")
83+
self.proc.stdin.flush()
84+
85+
def _notify(self, method: str, params: dict | None = None) -> None:
86+
self._send({"jsonrpc": "2.0", "method": method, "params": params or {}})
87+
88+
def _read(self) -> dict:
89+
assert self.proc and self.proc.stdout
90+
while True:
91+
ready, _, _ = select.select([self.proc.stdout], [], [], self.timeout)
92+
if not ready:
93+
raise TimeoutError("timed out waiting for server (see stderr above)")
94+
line = self.proc.stdout.readline()
95+
if line == "":
96+
raise EOFError("server closed stdout unexpectedly")
97+
line = line.strip()
98+
if not line:
99+
continue
100+
try:
101+
return json.loads(line)
102+
except json.JSONDecodeError:
103+
continue # ignore stray non-JSON output
104+
105+
def _request(self, method: str, params: dict) -> dict:
106+
self._id += 1
107+
req_id = self._id
108+
self._send({"jsonrpc": "2.0", "id": req_id, "method": method, "params": params})
109+
while True:
110+
msg = self._read()
111+
if msg.get("id") == req_id:
112+
if "error" in msg:
113+
raise RuntimeError(f"{method} error: {msg['error']}")
114+
return msg["result"]
115+
116+
# -- High-level API ----------------------------------------------------
117+
def list_tools(self) -> list[dict]:
118+
tools: list[dict] = []
119+
cursor = None
120+
while True:
121+
params = {} if cursor is None else {"cursor": cursor}
122+
result = self._request("tools/list", params)
123+
tools.extend(result.get("tools", []))
124+
cursor = result.get("nextCursor")
125+
if not cursor:
126+
return tools
127+
128+
def call_tool(self, name: str, arguments: dict) -> dict:
129+
return self._request("tools/call", {"name": name, "arguments": arguments})
130+
131+
def close(self) -> None:
132+
if not self.proc:
133+
return
134+
try:
135+
if self.proc.stdin:
136+
self.proc.stdin.close()
137+
except Exception: # noqa: BLE001
138+
pass
139+
self.proc.terminate()
140+
try:
141+
self.proc.wait(timeout=10)
142+
except Exception: # noqa: BLE001
143+
self.proc.kill()
144+
self.proc = None

0 commit comments

Comments
 (0)