You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: eval_protocol/trainable_gepa_design.md
+15-15Lines changed: 15 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,8 @@
1
-
## GEPA-Trainable Interface Design for Eval Protocol
1
+
## GEPA-training Interface Design for Eval Protocol
2
2
3
3
### Goals
4
4
5
-
-**Tunable prompts for existing benchmarks**: Allow benchmarks like `test_aime25.py` and `test_gpqa.py` to expose parts of their configuration (e.g., system prompts) as trainable parameters, without changing their core evaluation logic.
5
+
-**Tunable prompts for existing benchmarks**: Allow benchmarks like `test_aime25.py` and `test_gpqa.py` to expose parts of their configuration (e.g., system prompts) as training parameters, without changing their core evaluation logic.
6
6
-**Tight coupling with `@evaluation_test`**: Reuse the same rollout configuration, datasets, and metrics that are already defined via `evaluation_test`, instead of duplicating that configuration in a separate training API.
7
7
-**GEPA as one optimizer backend**: Provide a clean integration point for GEPA (and potentially other optimizers later) without requiring benchmarks to depend on DSPy or GEPA directly.
8
8
@@ -14,12 +14,12 @@
14
14
-`@evaluation_test(...)`-decorated function (e.g., `test_aime25_pointwise`) that:
15
15
- Uses `SingleTurnRolloutProcessor` (or another processor).
16
16
- Computes per-row metrics and sets `row.evaluation_result`.
17
-
- Adds *optional*trainable wiring at the bottom, under `if __name__ == "__main__":`, that:
18
-
- Imports a trainable/core API from `eval_protocol.trainable`.
17
+
- Adds *optional*training wiring at the bottom, under `if __name__ == "__main__":`, that:
18
+
- Imports a training/core API from `eval_protocol.training`.
19
19
- Specifies what is tunable (e.g., the system prompt) and how to adapt rows using a candidate.
20
20
- Invokes a train routine (GEPA-based or otherwise).
21
21
22
-
-**Trainable core**
22
+
-**training core**
23
23
- Provides a single central abstraction:
24
24
-**`EPParameters`**: Encapsulates everything `evaluation_test` knows about the eval in a structured form:
25
25
- One field for every parameter that `evaluation_test` accepts (dataset sources, adapters, completion params, rollout processor, aggregation, thresholds, etc.), after parsing/env overrides.
@@ -28,8 +28,8 @@
28
28
- Build an `EPParameters` instance by introspecting an `@evaluation_test`-decorated function.
29
29
- Run a single candidate or a batch of candidates through the full rollout + evaluation pipeline, returning aggregate scores (and optionally per-row scores).
- Dataset sources (`input_dataset`, `input_rows`, dataloaders, and `dataset_adapter`), after `parse_ep_*` transforms.
67
67
-`aggregation_method`, `num_runs`, `max_dataset_rows`, etc.
68
68
- Rollout and mode information (processor, kwargs, concurrency limits, mode).
69
-
- The trainable core can then **directly convert `__ep_params__` into an `EPParameters` instance** without maintaining a separate trainable-only config.
69
+
- The training core can then **directly convert `__ep_params__` into an `EPParameters` instance** without maintaining a separate training-only config.
0 commit comments