Skip to content

Commit 4f23f01

Browse files
author
cellarius
committed
docs: restructure README and update evaluator cookbook for v2 protocol
- README: add How It Works section, move protocol/intake to Reference, add intake context, polish flow and transitions - Cookbook: update contract to Protocol v2, add dataset-aware and composite evaluator recipes, add score-range and task-model sections
1 parent 81e9c0a commit 4f23f01

2 files changed

Lines changed: 496 additions & 82 deletions

File tree

README.md

Lines changed: 111 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -28,47 +28,26 @@ optimize-anything optimize seed.txt \
2828
--output result.txt
2929
```
3030

31-
CLI stdout returns a JSON summary using the current result contract:
32-
- `best_artifact`
33-
- `total_metric_calls`
34-
- `score_summary` (`initial`, `latest`, `best`, deltas, `num_candidates`)
35-
- `top_diagnostics` (**list** of `{name, value}`)
36-
- `plateau_detected`, `plateau_guidance`
37-
- optional `evaluator_failure_signal`
38-
- optional `early_stopped`, `stopped_at_iteration`
31+
CLI stdout returns a JSON summary — see [Result Contract](#result-contract) for the full shape.
3932

40-
## Evaluator Contract (Protocol v2)
33+
## How It Works
4134

42-
Evaluator input payload (stdin JSON for command mode, POST JSON for HTTP mode):
35+
optimize-anything runs a GEPA (Guided Evolutionary Prompt Algorithm) loop: propose → evaluate → reflect, repeating until budget is exhausted or early stopping kicks in.
4336

44-
```json
45-
{"_protocol_version": 2, "candidate": "...", "example": {...}, "task_model": "..."}
4637
```
47-
48-
- `candidate` is required
49-
- `_protocol_version`, `example`, and `task_model` are optional/additive
50-
- legacy evaluators that only read `candidate` remain compatible
51-
52-
Evaluator output payload:
53-
54-
```json
55-
{"score": 0.75, "notes": "optional diagnostics"}
38+
seed.txt ──► [Propose] ──► candidates
39+
▲ │
40+
│ [Evaluate]
41+
[Reflect] ◄──── scores + diagnostics
5642
```
5743

58-
- `score` is required
59-
- additional keys are treated as side-info
60-
61-
## v2 Runtime Modes
44+
1. **Propose** — The optimizer generates candidate artifacts from your seed (or from scratch in seedless mode).
45+
2. **Evaluate** — Each candidate is scored by your evaluator. Three evaluator types are supported: a **command evaluator** (any executable that reads JSON on stdin and writes a score on stdout), an **HTTP evaluator** (a service that accepts POST requests), or a built-in **LLM judge** (no evaluator script required — just pass `--judge-model`).
46+
3. **Reflect** — Scores and diagnostics feed back into the next proposal round. The loop continues, progressively improving the artifact toward your objective.
6247

63-
### Intake schema keys
48+
The evaluator is the only thing you bring. Everything else — proposal strategy, reflection, early stopping, caching, parallelism — is handled by the optimizer.
6449

65-
`optimize-anything intake` normalizes these keys:
66-
- `artifact_class`
67-
- `quality_dimensions`
68-
- `hard_constraints`
69-
- `evaluation_pattern`
70-
- `execution_mode`
71-
- `evaluator_cwd`
50+
## Runtime Modes
7251

7352
### Dataset / Valset modes
7453

@@ -147,41 +126,6 @@ optimize-anything optimize seed.txt \
147126
- `analyze`
148127
- `validate`
149128

150-
## `optimize` flags (complete)
151-
152-
Exactly one evaluator source is required: `--evaluator-command` OR `--evaluator-url` OR `--judge-model`.
153-
154-
| Flag | Description | Default |
155-
|---|---|---|
156-
| `--no-seed` | Run without seed file; bootstrap from objective | `false` |
157-
| `--evaluator-command <cmd...>` | Command evaluator (stdin/stdout JSON) | -- |
158-
| `--evaluator-url <url>` | HTTP evaluator endpoint | -- |
159-
| `--intake-json <json>` | Inline intake spec | -- |
160-
| `--intake-file <path>` | Intake spec file | -- |
161-
| `--evaluator-cwd <path>` | Working dir for command evaluator | -- |
162-
| `--objective <text>` | Optimization objective | -- |
163-
| `--background <text>` | Extra domain context | -- |
164-
| `--dataset <train.jsonl>` | Training dataset JSONL | -- |
165-
| `--valset <val.jsonl>` | Validation dataset JSONL (requires `--dataset`) | -- |
166-
| `--budget <int>` | Max evaluator calls | `100` |
167-
| `--output, -o <file>` | Write best artifact to file | -- |
168-
| `--model <model>` | Proposer model (or env fallback) | `OPTIMIZE_ANYTHING_MODEL` |
169-
| `--judge-model <model>` | Built-in LLM judge evaluator model | -- |
170-
| `--judge-objective <text>` | Judge objective override | falls back to `--objective` |
171-
| `--api-base <url>` | Override LiteLLM API base | -- |
172-
| `--diff` | Print unified diff (seed vs best) to stderr | `false` |
173-
| `--run-dir <path>` | Save run artifacts in timestamped run dir | -- |
174-
| `--parallel` | Enable parallel evaluator calls | `false` |
175-
| `--workers <int>` | Max workers for parallel evaluation | -- |
176-
| `--cache` | Enable evaluator cache | `false` |
177-
| `--cache-from <run-dir>` | Copy prior `fitness_cache` into new run | -- |
178-
| `--early-stop` | Enable plateau early stop | auto on when budget > 30 |
179-
| `--early-stop-window <int>` | Plateau window size | `10` |
180-
| `--early-stop-threshold <float>` | Min improvement required over window | `0.005` |
181-
| `--spec-file <path>` | Load TOML spec defaults | -- |
182-
| `--task-model <model>` | Optional metadata forwarded to evaluators | -- |
183-
| `--score-range unit|any` | Score validation mode for cmd/http | `unit` |
184-
185129
## Claude Code Plugin
186130

187131
optimize-anything is also a Claude Code plugin with guided slash commands and skills.
@@ -236,6 +180,105 @@ The plugin includes three skills that Claude Code can invoke automatically:
236180
→ cross-checks the result with multiple judges
237181
```
238182

183+
## Reference
184+
185+
### Result Contract
186+
187+
CLI stdout returns a JSON summary with these fields:
188+
189+
- `best_artifact`
190+
- `total_metric_calls`
191+
- `score_summary` (`initial`, `latest`, `best`, deltas, `num_candidates`)
192+
- `top_diagnostics` (**list** of `{name, value}`)
193+
- `plateau_detected`, `plateau_guidance`
194+
- optional `evaluator_failure_signal`
195+
- optional `early_stopped`, `stopped_at_iteration`
196+
197+
### Evaluator Protocol (v2)
198+
199+
Evaluator input payload (stdin JSON for command mode, POST JSON for HTTP mode):
200+
201+
```json
202+
{"_protocol_version": 2, "candidate": "...", "example": {...}, "task_model": "..."}
203+
```
204+
205+
- `candidate` is required
206+
- `_protocol_version`, `example`, and `task_model` are optional/additive
207+
- legacy evaluators that only read `candidate` remain compatible
208+
209+
Evaluator output payload:
210+
211+
```json
212+
{"score": 0.75, "notes": "optional diagnostics"}
213+
```
214+
215+
- `score` is required
216+
- additional keys are treated as side-info
217+
218+
### Intake
219+
220+
Intake is optional structured guidance you pass to the optimizer to shape how evaluation works. Instead of relying solely on an `--objective` string, intake lets you declare quality dimensions, hard constraints, evaluation patterns, and execution preferences — giving you finer control over what "better" means for your artifact.
221+
222+
Use intake when your evaluation criteria are multi-dimensional, when you need to enforce hard constraints, or when you want consistent evaluation behavior across runs.
223+
224+
Pass it inline or from a file:
225+
226+
```bash
227+
# Inline
228+
optimize-anything optimize seed.txt \
229+
--intake-json '{"quality_dimensions": ["clarity", "specificity"], "hard_constraints": ["max 100 words"]}' \
230+
--judge-model openai/gpt-4o-mini
231+
232+
# From file
233+
optimize-anything optimize seed.txt \
234+
--intake-file intake.json \
235+
--judge-model openai/gpt-4o-mini
236+
```
237+
238+
`optimize-anything intake` normalizes and validates these keys:
239+
240+
- `artifact_class`
241+
- `quality_dimensions`
242+
- `hard_constraints`
243+
- `evaluation_pattern`
244+
- `execution_mode`
245+
- `evaluator_cwd`
246+
247+
### `optimize` flags (complete)
248+
249+
Exactly one evaluator source is required: `--evaluator-command` OR `--evaluator-url` OR `--judge-model`.
250+
251+
| Flag | Description | Default |
252+
|---|---|---|
253+
| `--no-seed` | Run without seed file; bootstrap from objective | `false` |
254+
| `--evaluator-command <cmd...>` | Command evaluator (stdin/stdout JSON) | -- |
255+
| `--evaluator-url <url>` | HTTP evaluator endpoint | -- |
256+
| `--intake-json <json>` | Inline intake spec | -- |
257+
| `--intake-file <path>` | Intake spec file | -- |
258+
| `--evaluator-cwd <path>` | Working dir for command evaluator | -- |
259+
| `--objective <text>` | Optimization objective | -- |
260+
| `--background <text>` | Extra domain context | -- |
261+
| `--dataset <train.jsonl>` | Training dataset JSONL | -- |
262+
| `--valset <val.jsonl>` | Validation dataset JSONL (requires `--dataset`) | -- |
263+
| `--budget <int>` | Max evaluator calls | `100` |
264+
| `--output, -o <file>` | Write best artifact to file | -- |
265+
| `--model <model>` | Proposer model (or env fallback) | `OPTIMIZE_ANYTHING_MODEL` |
266+
| `--judge-model <model>` | Built-in LLM judge evaluator model | -- |
267+
| `--judge-objective <text>` | Judge objective override | falls back to `--objective` |
268+
| `--api-base <url>` | Override LiteLLM API base | -- |
269+
| `--diff` | Print unified diff (seed vs best) to stderr | `false` |
270+
| `--run-dir <path>` | Save run artifacts in timestamped run dir | -- |
271+
| `--parallel` | Enable parallel evaluator calls | `false` |
272+
| `--workers <int>` | Max workers for parallel evaluation | -- |
273+
| `--cache` | Enable evaluator cache | `false` |
274+
| `--cache-from <run-dir>` | Copy prior `fitness_cache` into new run | -- |
275+
| `--early-stop` | Enable plateau early stop | auto on when budget > 30 |
276+
| `--early-stop-window <int>` | Plateau window size | `10` |
277+
| `--early-stop-threshold <float>` | Min improvement required over window | `0.005` |
278+
| `--spec-file <path>` | Load TOML spec defaults | -- |
279+
| `--task-model <model>` | Optional metadata forwarded to evaluators | -- |
280+
| `--score-range unit|any` | Score validation mode for cmd/http | `unit` |
281+
239282
## Learn More
240283

241284
- [EXAMPLES.md](EXAMPLES.md)

0 commit comments

Comments
 (0)