Status: Source of truth for Phase 1 implementation
Applies to: command_evaluator, http_evaluator, llm_judge_evaluator, preflight validation, dataset/valset loading
Current evaluator input payload is:
{"candidate": "<text>"}This is sent:
- on
stdinfor command evaluators - as JSON POST body for HTTP evaluators
Protocol v2 extends the input payload with optional metadata and example context:
{
"_protocol_version": 2,
"candidate": "<text>",
"task_model": "openai/gpt-4o-mini",
"example": {
"...": "dataset example object"
}
}candidate(string): artifact/candidate text to score
_protocol_version(integer): when present for v2 payloads, value MUST be2task_model(string): metadata identifying the model being optimized for task executionexample(object): one dataset example record (see dataset schema below)
- No dataset mode: payload contains
candidate(+ optionaltask_model) and_protocol_version: 2. - Dataset mode (
--dataset): one evaluator call per candidate/example pair; payload includesexample. - Generalization mode (
--dataset+--valset):- Training/evolution calls use
datasetexamples (withexamplekey) - Validation aggregation uses
valsetexamples (withexamplekey)
- Training/evolution calls use
- Evaluators that only read
candidateremain valid under v2. - Unknown keys MUST be ignored by evaluators unless explicitly used.
- v1 evaluators are not required to return or acknowledge protocol version.
Evaluator output remains:
{"score": 0.73, "reasoning": "...", "any_other_side_info": "..."}scoreis required- all non-
scorekeys are side information - side information is preserved for reflection/debugging
--score-range introduces two validation modes:
unit(default): score must be in[0.0, 1.0]any: score must be a finite float (unbounded)
| Component | --score-range unit |
--score-range any |
|---|---|---|
command_evaluator |
Enforce numeric + finite + range [0,1] in parse path; preflight enforces same |
Enforce numeric + finite only in parse path; preflight enforces same |
http_evaluator |
Enforce numeric + finite + range [0,1] in parse path; preflight enforces same |
Enforce numeric + finite only in parse path; preflight enforces same |
llm_judge_evaluator |
Always returns/clamps to [0,1] |
Always returns/clamps to [0,1] (unchanged) |
validate_evaluator_payload |
Enforce numeric + finite + [0,1] |
Enforce numeric + finite only |
- LLM judge behavior is intentionally invariant to
--score-range. - In
anymode, NaN and ±Inf remain invalid. - For command/HTTP adapters, parse-time and preflight-time rules MUST match.
--task-model is metadata in Phase 1 (not hard-coupled execution logic).
When provided, include in evaluator payload:
{
"_protocol_version": 2,
"candidate": "...",
"task_model": "provider/model",
"example": {"...": "..."}
}(If not in dataset mode, omit example.)
For command evaluators, also set:
OPTIMIZE_ANYTHING_TASK_MODEL=<value of --task-model>
This env var is additive; existing env handling remains intact.
- Judge model (
--judge-model) remains the evaluator model. task_modelidentifies the target model under optimization and is passed through as metadata for future evaluator logic.
- Encoding: UTF-8
- Type: JSONL (one JSON object per non-blank line)
- Blank lines: allowed and skipped
Each non-blank line MUST parse as a JSON object:
{"input": "...", "expected": "...", "metadata": {"difficulty": "easy"}}No fixed field names are required by protocol; evaluator logic defines semantic interpretation.
For both dataset and valset:
- File must be readable as UTF-8.
- Maximum non-blank valid records: 10,000.
- Each non-blank line must be valid JSON.
- Parsed JSON value must be an object (
{}), not array/string/number/etc. - Blank lines are skipped and not counted toward the 10,000 limit.
- On malformed JSON or non-object record, fail validation with explicit line numbers.
- Error messages MUST include:
- file path
- offending line number(s)
- brief parse/type reason
- CLI exits non-zero; optimization does not start with invalid dataset/valset.
- For each evaluation call in dataset-backed modes, one record is bound into payload under
example. datasetdrives optimization/evolution calls.valset(if provided) drives validation/generalization aggregates.
The following are guaranteed not to break in v2:
-
Existing v1 evaluators remain usable
- Evaluators that only consume
candidatecontinue to work.
- Evaluators that only consume
-
Evaluator output schema remains stable
- Required output key is still
score. - Additional output keys remain side-info.
- Required output key is still
-
Default score behavior remains strict unit interval
- Default mode is still equivalent to legacy
[0,1]checks.
- Default mode is still equivalent to legacy
-
LLM judge continues
[0,1]semantics- No behavior change from score-range flag.
-
Single-artifact mode remains unchanged in UX
- Running without
--dataset/--valsetpreserves current flow; only additive payload metadata may appear.
- Running without
-
Protocol extension is additive, not breaking
- New keys (
_protocol_version,example,task_model) are optional and ignorable by existing evaluators.
- New keys (
- Add
_protocol_version: 2to outbound evaluator payloads - Add optional
task_modelpayload field + command env var - Add optional
examplepayload field in dataset-backed modes - Implement score-range switch in both parse and preflight validators
- Implement JSONL loader with UTF-8, object-per-line, max-10k, malformed line reporting
- Ensure docs/tests reflect v2 contract and compatibility guarantees