Skip to content

Commit 42e0b08

Browse files
committed
update
1 parent 0a26115 commit 42e0b08

File tree

1 file changed

+15
-15
lines changed

1 file changed

+15
-15
lines changed

eval_protocol/trainable_gepa_design.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
## GEPA-Trainable Interface Design for Eval Protocol
1+
## GEPA-training Interface Design for Eval Protocol
22

33
### Goals
44

5-
- **Tunable prompts for existing benchmarks**: Allow benchmarks like `test_aime25.py` and `test_gpqa.py` to expose parts of their configuration (e.g., system prompts) as trainable parameters, without changing their core evaluation logic.
5+
- **Tunable prompts for existing benchmarks**: Allow benchmarks like `test_aime25.py` and `test_gpqa.py` to expose parts of their configuration (e.g., system prompts) as training parameters, without changing their core evaluation logic.
66
- **Tight coupling with `@evaluation_test`**: Reuse the same rollout configuration, datasets, and metrics that are already defined via `evaluation_test`, instead of duplicating that configuration in a separate training API.
77
- **GEPA as one optimizer backend**: Provide a clean integration point for GEPA (and potentially other optimizers later) without requiring benchmarks to depend on DSPy or GEPA directly.
88

@@ -14,12 +14,12 @@
1414
- `@evaluation_test(...)`-decorated function (e.g., `test_aime25_pointwise`) that:
1515
- Uses `SingleTurnRolloutProcessor` (or another processor).
1616
- Computes per-row metrics and sets `row.evaluation_result`.
17-
- Adds *optional* trainable wiring at the bottom, under `if __name__ == "__main__":`, that:
18-
- Imports a trainable/core API from `eval_protocol.trainable`.
17+
- Adds *optional* training wiring at the bottom, under `if __name__ == "__main__":`, that:
18+
- Imports a training/core API from `eval_protocol.training`.
1919
- Specifies what is tunable (e.g., the system prompt) and how to adapt rows using a candidate.
2020
- Invokes a train routine (GEPA-based or otherwise).
2121

22-
- **Trainable core**
22+
- **training core**
2323
- Provides a single central abstraction:
2424
- **`EPParameters`**: Encapsulates everything `evaluation_test` knows about the eval in a structured form:
2525
- One field for every parameter that `evaluation_test` accepts (dataset sources, adapters, completion params, rollout processor, aggregation, thresholds, etc.), after parsing/env overrides.
@@ -28,8 +28,8 @@
2828
- Build an `EPParameters` instance by introspecting an `@evaluation_test`-decorated function.
2929
- Run a single candidate or a batch of candidates through the full rollout + evaluation pipeline, returning aggregate scores (and optionally per-row scores).
3030

31-
- **GEPA adapter (e.g., `eval_protocol/trainable/gepa_adapter.py`)**
32-
- Wraps the trainable core and GEPA’s API:
31+
- **GEPA adapter (e.g., `eval_protocol/training/gepa_adapter.py`)**
32+
- Wraps the training core and GEPA’s API:
3333
- Accepts:
3434
- An `EPConfig`.
3535
- A candidate space definition (for now, implicit via `dict[str, str]` keys).
@@ -66,9 +66,9 @@ setattr(dual_mode_wrapper, "__ep_params__", ep_params)
6666
- Dataset sources (`input_dataset`, `input_rows`, dataloaders, and `dataset_adapter`), after `parse_ep_*` transforms.
6767
- `aggregation_method`, `num_runs`, `max_dataset_rows`, etc.
6868
- Rollout and mode information (processor, kwargs, concurrency limits, mode).
69-
- The trainable core can then **directly convert `__ep_params__` into an `EPParameters` instance** without maintaining a separate trainable-only config.
69+
- The training core can then **directly convert `__ep_params__` into an `EPParameters` instance** without maintaining a separate training-only config.
7070

71-
- Trainable core will expose:
71+
- training core will expose:
7272
- A factory like:
7373

7474
```python
@@ -90,7 +90,7 @@ setattr(dual_mode_wrapper, "__ep_params__", ep_params)
9090
- Build an `EPParameters` from it.
9191
- Call into a GEPA-based trainer that uses the `EPParameters`.
9292

93-
### Open Questions
93+
### TODO for derek to figure out: how to store the changing system prompts.
9494

9595
- **Where tuned prompts live (storage format and location)**:
9696
- GEPA already supports a `run_dir` for logging and checkpoints.
@@ -102,7 +102,7 @@ setattr(dual_mode_wrapper, "__ep_params__", ep_params)
102102

103103
### Work Split: Person A vs Person B
104104

105-
#### Person A – Trainable Core & `evaluation_test` Integration
105+
#### Person A – training Core & `evaluation_test` Integration
106106

107107
- **1. Extend `evaluation_test` metadata (no behavior change)**
108108
- Populate a single `__ep_config__` dict on the decorated test function that includes:
@@ -114,7 +114,7 @@ setattr(dual_mode_wrapper, "__ep_params__", ep_params)
114114
- Backwards compatibility for existing tests.
115115
- Clear typing and docstrings to guide future use.
116116

117-
- **2. Define core trainable abstractions in `eval_protocol/trainable/core.py`**
117+
- **2. Define core training abstractions in `eval_protocol/training/core.py`**
118118
- Define:
119119
- `EPConfig`:
120120
- A field for every parameter `evaluation_test` accepts (dataset, adapters, completion params, rollout processor, aggregation, thresholds, etc.).
@@ -140,12 +140,12 @@ setattr(dual_mode_wrapper, "__ep_params__", ep_params)
140140

141141
#### Person B – GEPA Adapter & Benchmark Wiring
142142

143-
- **4. Implement GEPA integration in `eval_protocol/trainable/gepa_adapter.py`**
143+
- **4. Implement GEPA integration in `eval_protocol/training/gepa_adapter.py`**
144144
- Define a small adapter API, e.g.:
145145

146146
```python
147147
class GEPATrainer:
148-
def __init__(self, spec: TrainableBenchmarkSpec, inject_fn: InjectFn, ...gepa_config...):
148+
def __init__(self, spec: trainingBenchmarkSpec, inject_fn: InjectFn, ...gepa_config...):
149149
...
150150

151151
def train(self) -> tuple[Candidate, Any]:
@@ -186,7 +186,7 @@ class GEPATrainer:
186186
- Add candidate-aware wiring (e.g., via dataset adapters) and an optional `__main__` entrypoint calling the same GEPA trainer.
187187
- This will validate that:
188188
- The abstractions generalize across tasks.
189-
- No DSPy/GEPA-specific imports leak into benchmark files (other than a small, well-defined trainable API).
189+
- No DSPy/GEPA-specific imports leak into benchmark files (other than a small, well-defined training API).
190190

191191
### Coordination Notes
192192

0 commit comments

Comments
 (0)