Skip to content

Commit 4420067

Browse files
committed
Added: Remote evals params spec
1 parent e616cf7 commit 4420067

4 files changed

Lines changed: 667 additions & 0 deletions

File tree

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# Remote Eval Parameters: Overview
2+
3+
## What Are Eval Parameters?
4+
5+
**Eval parameters** let users configure evaluator behavior from the Braintrust Playground without changing code. Developers declare named parameters in their evaluator -- anything that affects how the eval runs: a model name, a similarity threshold, a feature flag, a service URL, a max output length, etc. The Playground renders these as UI controls (sliders, text inputs, etc.) and passes the user's chosen values to the evaluator when running.
6+
7+
This makes it easy to compare how a system behaves under different configurations -- for example, running the same test cases with `temperature: 0.2` vs `temperature: 0.9`, or against a staging vs. production endpoint -- without deploying new code.
8+
9+
## How It Works
10+
11+
```
12+
Braintrust Playground Developer's Machine
13+
+----------------------+ +-------------------------+
14+
| | | Dev Server |
15+
| GET /list | --------------> | |
16+
| | <-------------- | "food-classifier": |
17+
| Render UI controls: | parameters: | parameters: |
18+
| model: [gpt-4 v] | { model: ..., | model: "gpt-4" |
19+
| temp: [0.7 ---] | temp: ... } | temperature: 0.7 |
20+
| | | |
21+
| User changes model | | |
22+
| to "gpt-4o", clicks | POST /eval | |
23+
| "Run" | --------------> | parameters: |
24+
| | parameters: | { model: "gpt-4o", |
25+
| | { model: | temperature: 0.7 } |
26+
| | "gpt-4o" } | |
27+
| Results stream back | <-------------- | task receives params |
28+
+----------------------+ +-------------------------+
29+
```
30+
31+
1. **Declaration**: The developer declares named parameters in the evaluator definition. Each parameter has a name, optional type, default value, and description.
32+
2. **Discovery**: When the Playground fetches `GET /list`, the dev server includes parameter definitions in the response. The Playground renders appropriate UI controls for each parameter.
33+
3. **Delivery**: When the user clicks "Run", the Playground sends the current parameter values in the `POST /eval` request body under the `"parameters"` key.
34+
4. **Merging**: The dev server merges request values with evaluator defaults (request overrides defaults). This means parameters not changed by the user still have their default values.
35+
5. **Forwarding**: The merged parameters are forwarded to the task function and all scorer functions as they run.
36+
37+
## Key Concepts
38+
39+
**Parameter definition** -- A declaration in the evaluator specifying a parameter's name, default value, and optional metadata (type, description). Defined once in code; used to populate UI controls.
40+
41+
**Parameter values** -- The runtime values the Playground sends per-run. These override any defaults defined in the evaluator.
42+
43+
**Backward compatibility** -- Tasks and scorers that do not declare they want parameters must continue to work unchanged. The SDK is responsible for filtering parameters out of function calls to functions that don't expect them.
44+
45+
## Example
46+
47+
```pseudocode
48+
# Define an evaluator with parameters
49+
evaluator = Evaluator(
50+
task = (input, parameters) => MyModel.classify(input, model: parameters["model"]),
51+
scorers = [
52+
Scorer("exact_match", (expected, output) => output == expected ? 1.0 : 0.0)
53+
],
54+
parameters = {
55+
"model": { type: "model", default: "gpt-4", description: "Model to use" },
56+
"temperature": { type: "data", default: 0.7, description: "Sampling temperature" }
57+
}
58+
)
59+
```
60+
61+
The Playground renders a model picker and temperature input. When the user selects "gpt-4o" and clicks "Run", the task receives `parameters = {"model": "gpt-4o", "temperature": 0.7}` (temperature keeps its default since the user didn't change it).
62+
63+
## Parameters vs. Input
64+
65+
**Input** is per-case data — each test case has its own `input` value (e.g., `"apple"`, `"carrot"`). It varies case-by-case and represents *what* is being evaluated.
66+
67+
**Parameters** are per-run configuration — the same values apply to every test case in the run. They represent *how* the evaluator behaves.
68+
69+
The typical workflow: run the same dataset (same inputs) with different parameter values to compare configurations. For example, run `model: "gpt-4"` and `model: "gpt-4o"` against identical test cases, then compare scores side-by-side in the Playground.
70+
71+
## Further Reading
72+
73+
| Document | Purpose |
74+
|----------|---------|
75+
| [design.md](design.md) | End-to-end flow, component roles, and design decisions |
76+
| [contracts.md](contracts.md) | Wire protocol, data types, and API schemas |
77+
| [validation.md](validation.md) | Test scenarios and expected behaviors |
78+
79+
### Related Specs
80+
81+
- [Remote Eval Dev Server](../server/README.md) -- The broader remote eval feature this builds on
Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
# Remote Eval Parameters: Contracts
2+
3+
## SDK
4+
5+
### Evaluator
6+
7+
#### `parameters`
8+
9+
A map from parameter name to parameter spec, declared in the evaluator definition. This is the source of truth for what parameters exist and what their defaults are.
10+
11+
```pseudocode
12+
evaluator.parameters = {
13+
"model": { type: "model", default: "gpt-4", description: "Model to use" },
14+
"temperature": { type: "data", default: 0.7, description: "Sampling temperature" },
15+
"max_length": { type: "data", default: 100, description: "Max output length" }
16+
}
17+
```
18+
19+
Each parameter spec:
20+
21+
| Field | Type | Required | Description |
22+
|-------|------|----------|-------------|
23+
| `default` | `any` | No | Value used when the `POST /eval` request does not include this parameter |
24+
| `description` | `string` | No | Human-readable description shown in the Playground UI |
25+
| `type` | `string` | No | Type hint — `"data"` (default), `"model"`, or `"prompt"`. See `parameter` entry under `GET /list` Response Format. |
26+
27+
#### `task`
28+
29+
A callable that optionally declares a `parameters` argument. When declared, it receives the merged parameter map (request values overlaid on evaluator defaults) as a plain string-keyed object.
30+
31+
Tasks that do not declare `parameters` must continue to work unchanged — the SDK must not pass `parameters` to functions that don't accept it.
32+
33+
**Side effect**: the merged `parameters` map is passed to the task function on every test case invocation during a `POST /eval` run.
34+
35+
#### `scorers`
36+
37+
Local scorer functions follow the same contract as `task` with respect to parameters — they optionally declare `parameters` and receive the same merged map if they do. The SDK must not pass `parameters` to scorers that don't declare it.
38+
39+
Remote scorers (sent by the Playground in the `POST /eval` request) also receive the merged parameters via the SDK's remote scorer invocation mechanism.
40+
41+
**Side effect**: the merged `parameters` map is passed to every scorer function (local and remote) on every test case invocation during a `POST /eval` run.
42+
43+
### Dev Server
44+
45+
#### `GET /list`
46+
47+
##### Request Format
48+
49+
No body. Accepts both `GET` and `POST`.
50+
51+
```
52+
GET /list
53+
Authorization: Bearer <token>
54+
X-Bt-Org-Name: <org>
55+
```
56+
57+
##### Response Format
58+
59+
```
60+
HTTP 200 OK
61+
Content-Type: application/json
62+
```
63+
64+
Body: a JSON object keyed by evaluator name. For each evaluator, the `parameters` field contains a `parameters` object serialized from the evaluator's `parameters` definition, or `null` if the evaluator defines no parameters.
65+
66+
```json
67+
{
68+
"food-classifier": {
69+
"scores": [{ "name": "exact_match" }],
70+
"parameters": {
71+
"type": "braintrust.staticParameters",
72+
"schema": {
73+
"model": {
74+
"type": "data",
75+
"schema": { "type": "string" },
76+
"default": "gpt-4",
77+
"description": "Model to use"
78+
},
79+
"temperature": {
80+
"type": "data",
81+
"schema": { "type": "number" },
82+
"default": 0.7,
83+
"description": "Sampling temperature"
84+
}
85+
},
86+
"source": null
87+
}
88+
},
89+
"text-summarizer": {
90+
"scores": [],
91+
"parameters": null
92+
}
93+
}
94+
```
95+
96+
**`parameters` object:**
97+
98+
| Field | Type | Description |
99+
|-------|------|-------------|
100+
| `type` | `string` | Always `"braintrust.staticParameters"` for inline (code-defined) parameters |
101+
| `schema` | `Record<string, parameter>` | Map of parameter name to definition |
102+
| `source` | `null` | Always `null` for static parameters. Non-null values reference remotely-stored parameter definitions — out of scope for baseline. |
103+
104+
When the evaluator defines no parameters, set `"parameters": null` or omit the field.
105+
106+
> **Note for existing SDK implementors**: Prior to the introduction of the container format, some SDKs returned the `schema` map directly (i.e. `Record<string, parameter>`) rather than wrapping it in a `parameters` object with `type` and `source` fields. The container was introduced to distinguish static (inline) parameters from dynamic (remotely-stored) ones. If updating an existing SDK, check whether it predates this format and update accordingly.
107+
108+
**`parameter` entry** (each value in `schema`):
109+
110+
| Field | Type | Required | Description |
111+
|-------|------|----------|-------------|
112+
| `type` | `string` | Yes | `"data"` for generic values; `"model"` for a model picker; `"prompt"` for a prompt editor. For a baseline implementation, `"data"` is sufficient. |
113+
| `schema` | `object` | No | JSON Schema fragment describing the value shape. Set `type` to `"string"`, `"number"`, `"boolean"`, `"object"`, or `"array"` to match the parameter's value type. Used by the Playground to render appropriate input controls. Omit if the type is unknown or mixed. |
114+
| `default` | `any` | No | Default value. Should match the type described by `schema`. |
115+
| `description` | `string` | No | Human-readable description shown in the Playground UI. |
116+
117+
**Serialization**: each entry in `evaluator.parameters` maps to a `parameter` entry in the `schema` object. The parameter name becomes the key; the spec fields (`default`, `description`, `type`) are preserved as-is.
118+
119+
##### Error Responses
120+
121+
| Status | Condition |
122+
|--------|-----------|
123+
| `401 Unauthorized` | Missing or invalid auth token |
124+
125+
#### `POST /eval`
126+
127+
##### Request Format
128+
129+
```
130+
POST /eval
131+
Content-Type: application/json
132+
Authorization: Bearer <token>
133+
X-Bt-Org-Name: <org>
134+
```
135+
136+
The `parameters` field in the request body carries the user's chosen values from the Playground UI:
137+
138+
```json
139+
{
140+
"name": "food-classifier",
141+
"data": { ... },
142+
"parameters": {
143+
"model": "gpt-4o",
144+
"temperature": 0.9
145+
}
146+
}
147+
```
148+
149+
| Field | Type | Required | Description |
150+
|-------|------|----------|-------------|
151+
| `parameters` | `Record<string, unknown>` | No | Parameter values chosen by the user. Keys match the evaluator's parameter names. Absent, `null`, and `{}` all mean no overrides were provided. |
152+
153+
See the [Dev Server specification](../server/specification.md) for the full `POST /eval` request schema (all fields beyond `parameters`).
154+
155+
##### Response Format
156+
157+
An SSE stream. The `parameters` field has no effect on the response format — progress, summary, and done events are the same structure as without parameters.
158+
159+
See the [Dev Server specification](../server/specification.md) for the full SSE event schema.
160+
161+
**Side effect**: the merged parameters (request values overlaid on evaluator defaults) are forwarded to the task and all scorers on every test case invocation. Output values in the SSE stream reflect whatever the task produced using those parameters.
162+
163+
##### Error Responses
164+
165+
| Status | Condition |
166+
|--------|-----------|
167+
| `400 Bad Request` | `parameters` field is present but not a JSON object |
168+
| `401 Unauthorized` | Missing or invalid auth token |
169+
| `404 Not Found` | No evaluator registered with the given `name` |
170+
171+
---
172+
173+
## References
174+
175+
- [Braintrust: Remote evals guide](https://www.braintrust.dev/docs/evaluate/remote-evals)
176+
- [Dev Server specification](../server/specification.md) — full `POST /eval` and `GET /list` schemas

0 commit comments

Comments
 (0)