Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
a8f1c10
Bump verifiers benchmark dependency
warner-benjamin Apr 28, 2026
d009bc3
Add upstream eval config adapter
warner-benjamin Apr 28, 2026
2f02d72
Add deterministic eval path utilities
warner-benjamin Apr 28, 2026
a80e43a
Add TOML bench dry run
warner-benjamin Apr 28, 2026
e63d8da
Run TOML bench sequentially
warner-benjamin Apr 28, 2026
2fa7517
Route dynamic single runs through adapter
warner-benjamin Apr 28, 2026
829bf9d
Migrate orchestrate to TOML bench
warner-benjamin Apr 28, 2026
f631233
Discover deterministic eval outputs
warner-benjamin Apr 28, 2026
8ca8928
Preserve eval variants in processing
warner-benjamin Apr 28, 2026
cc8efc8
Convert benchmark configs to TOML
warner-benjamin Apr 28, 2026
dafc5b2
Remove YAML benchmark runner
warner-benjamin Apr 28, 2026
bd6e5d4
Document TOML benchmark workflow
warner-benjamin Apr 28, 2026
6d6b2ff
Allow namespaced orchestrate metadata
warner-benjamin Apr 28, 2026
167158d
Remove obsolete benchmark compatibility code
warner-benjamin May 6, 2026
1f6bae2
Add upstream eval config boundary
warner-benjamin May 7, 2026
520a793
Add bench output index
warner-benjamin May 7, 2026
8114ed1
Add bench index process discovery
warner-benjamin May 7, 2026
4a7b94a
Remove bench metadata monkey patch
warner-benjamin May 7, 2026
3fba5ad
Document bench sidecar workflow
warner-benjamin May 7, 2026
fc5e666
Rename eval suite configs
warner-benjamin May 7, 2026
fd8d6d1
Simplify eval output identity
warner-benjamin May 7, 2026
bb97586
Remove eval metadata helper
warner-benjamin May 7, 2026
78d8908
Default bench outputs to autoresume
warner-benjamin May 7, 2026
fd33fad
Add legacy raw run conversion script
warner-benjamin May 8, 2026
eb9be04
Convert process discovery tests to eval outputs
warner-benjamin May 8, 2026
927947f
Convert process pipeline tests to eval outputs
warner-benjamin May 8, 2026
309bf01
Remove runtime legacy manifest discovery
warner-benjamin May 8, 2026
6cd3202
Delete legacy manifest schema
warner-benjamin May 8, 2026
4efe563
Simplify process metadata normalization
warner-benjamin May 8, 2026
38d08b0
Update process docs for eval outputs
warner-benjamin May 8, 2026
8f540fe
Remove legacy manifest path helpers
warner-benjamin May 8, 2026
2d96b40
Add endpoint sampling profiles
warner-benjamin May 8, 2026
e560ed6
Preserve sampling extra_body precedence
warner-benjamin May 8, 2026
dbf2927
Upgrade verifiers and improve legacy conversion CLI
warner-benjamin May 8, 2026
2e23214
documentation pass
warner-benjamin May 8, 2026
0cdb63d
format benchmark configs
warner-benjamin May 9, 2026
0de3036
Add Medmarks endpoint aliases and smoke config
warner-benjamin May 11, 2026
cb1e6b7
Implement subprocess env lifecycle
warner-benjamin May 11, 2026
d54b2f0
Isolate missing bench env installs
warner-benjamin May 12, 2026
5ff2f65
Document isolated bench auto install
warner-benjamin May 12, 2026
652e084
Reuse empty bench output dirs
warner-benjamin May 12, 2026
6862224
Avoid reserved env task keys
warner-benjamin May 12, 2026
073c023
Add client-aware sampling sanitizers
warner-benjamin May 12, 2026
bf57af9
Derive sampling args from SDKs
warner-benjamin May 12, 2026
074cba9
Update legacy conversion output shape
warner-benjamin May 12, 2026
5d66040
Update docs for TOML bench behavior
warner-benjamin May 12, 2026
c428776
add strict answer matching option
warner-benjamin May 12, 2026
02efacf
Move Medmarks configs to top-level configs
warner-benjamin May 12, 2026
db2186e
Fix mtsamples data downloading
warner-benjamin May 12, 2026
0ecdc49
remove old configs
warner-benjamin May 12, 2026
1a0b228
update version for release
warner-benjamin May 12, 2026
7af9b4c
ruff format
warner-benjamin May 12, 2026
37cf5f2
small bug fixes
warner-benjamin May 12, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -214,4 +214,9 @@ environments/healthbench/test*
.vscode/
pyrightconfig.json


.claude
.codex
.devcontainer
plans/
verifiers/
.gitmodules
10 changes: 5 additions & 5 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,20 @@

- `medarc_verifiers/`: Core Python package (CLI entrypoints, parsers, rewards, orchestration utilities).
- `environments/<env>/`: Individual Verifiers environments (each is a small Python package with `<env>.py` and its own `pyproject.toml`).
- `configs/`: YAML configs for `medarc-eval bench` (job matrices, env configs, judge configs).
- `configs/`: TOML configs for `medarc-eval bench`, endpoint registries, and environment/judge configs.
- `docs/`: Usage docs for `medarc-eval` and related workflows.
- `tests/`: `pytest` suite.

## Architecture Overview (Start Here)

- **IMPORTANT: Read `docs/medarc-verifiers-architecture.md` before writing or modifying any code.**
- Quick workflow: eval → process → winrate
- raw outputs: `runs/raw/<run_id>/...`
- eval outputs: `runs/evals/<model>/<env>/<variant>/...`
- processed parquet: `runs/processed/<model>/<env>.parquet` + `runs/processed/env_index.json`
- winrate outputs: `runs/winrate/latest.json` and `runs/winrate/latest.csv`
- winrate outputs: `runs/processed/winrate/latest.json` and `runs/processed/winrate/latest.csv`
- `medarc-eval` CLI entrypoint/router: (`medarc_verifiers/cli/main.py`; docs: `docs/medarc-eval.md`)
- `medarc-orchestrate` CLI entrypoint: (`medarc_verifiers/orchestrate/cli.py`; docs: `docs/medarc-orchestrate.md`)
- Batch resume/restart state lives in `runs/raw/<run_id>/run_manifest.json`
- Old YAML-runner `runs/raw` artifacts must be converted with `scripts/convert_legacy_raw_runs.py` before processing.
- Environment `load_environment()` params become CLI flags (see `medarc-eval <env> --help`).
- Environment authoring utilities (used by `environments/*`):
- parsing/prompts: `medarc_verifiers/parsers/`, `medarc_verifiers/prompts.py` (XML preferred; BOXED supported)
Expand All @@ -32,7 +32,7 @@
- `uv pip install -e .`: Install `medarc-verifiers` in editable mode.
- `vf-install <env>`: Install an environment from `environments/<env>/` in editable mode.
- `uv run medarc-eval <ENV> -m <MODEL> -n 5`: Run a small evaluation.
- `uv run medarc-eval bench --config configs/job.yaml`: Run a batch evaluation from a YAML config.
- `uv run medarc-eval bench --config configs/medmarks-smoke.toml`: Run a batch evaluation from a TOML config.
- `uv run pytest tests/`: Run the full test suite.
- `uv run ruff check medarc_verifiers/ && uv run ruff format medarc_verifiers/`: Lint/format.

Expand Down
289 changes: 125 additions & 164 deletions README.md

Large diffs are not rendered by default.

50 changes: 50 additions & 0 deletions configs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# MedARC Eval TOML Configs

These configs use upstream `verifiers` TOML semantics. Repeated `env_id` entries
and `[[ablation]]` sweeps intentionally keep the upstream environment id stable;
`medarc-eval bench` writes deterministic variant directories for differing
`env_args` and `sampling_args`.

```bash
medarc-eval bench --config configs/medmarks-smoke.toml --dry-run
medarc-eval bench --config configs/medmarks-verified.toml
medarc-eval process --runs-dir runs/evals --output-dir runs/processed
```

Use `medmarks-endpoints.toml` when you want one of the Medmarks model aliases
and its sampling defaults:

```bash
medarc-eval bench \
--config configs/medmarks-verified.toml \
--endpoints-path configs/medmarks-endpoints.toml \
-m gpt-oss-20b-low \
--api-base-url https://api.pinference.ai/api/v1 \
--api-key-var PRIME_API_KEY \
--dry-run
```

`medmarks-endpoints.toml` is a portable alias registry. It maps endpoint IDs to
model IDs, client types, and sampling defaults, but intentionally omits `url`,
`key`, and `max_concurrent` because those are deployment-specific. Supply those
settings with `--provider` or with `--api-base-url` and `--api-key-var`.
The gpt-oss aliases use the Verifiers `openai_responses` client type.

For a local vLLM server exposing an OpenAI-compatible API, keep using the same
alias registry and override only the deployment settings:

```bash
VLLM_API_KEY=local-key medarc-eval bench \
--config configs/medmarks-verified.toml \
--endpoints-path configs/medmarks-endpoints.toml \
-m gpt-oss-20b-low \
--api-base-url http://127.0.0.1:8000/v1 \
--api-key-var VLLM_API_KEY \
--dry-run
```

Per-environment `[tool.verifiers.eval]` defaults are read from editable installs
where the environment `pyproject.toml` is discoverable next to the module. Wheel
installs may ignore those defaults unless the package includes `pyproject.toml`,
so production suite configs keep explicit `num_examples` and
`rollouts_per_example` values.
4 changes: 4 additions & 0 deletions configs/endpoints.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Default upstream verifiers endpoint registry.
#
# Add [[endpoint]] entries here to resolve endpoint_id aliases. An empty registry
# is valid; provider/model defaults are used when no alias matches.
41 changes: 0 additions & 41 deletions configs/envs/agentclinic.yaml

This file was deleted.

17 changes: 0 additions & 17 deletions configs/envs/careqa_en.yaml

This file was deleted.

11 changes: 0 additions & 11 deletions configs/envs/careqa_open.yaml

This file was deleted.

4 changes: 0 additions & 4 deletions configs/envs/head_qa_v2.yaml

This file was deleted.

9 changes: 0 additions & 9 deletions configs/envs/healthbench.yaml

This file was deleted.

42 changes: 0 additions & 42 deletions configs/envs/longhealth.yaml

This file was deleted.

14 changes: 0 additions & 14 deletions configs/envs/m_arc.yaml

This file was deleted.

10 changes: 0 additions & 10 deletions configs/envs/med_dialog.yaml

This file was deleted.

8 changes: 0 additions & 8 deletions configs/envs/med_halt.yaml

This file was deleted.

16 changes: 0 additions & 16 deletions configs/envs/med_mcqa.yaml

This file was deleted.

6 changes: 0 additions & 6 deletions configs/envs/medagentbench.yaml

This file was deleted.

6 changes: 0 additions & 6 deletions configs/envs/medagentbenchv2.yaml

This file was deleted.

20 changes: 0 additions & 20 deletions configs/envs/medbullets.yaml

This file was deleted.

21 changes: 0 additions & 21 deletions configs/envs/medcalc_bench.yaml

This file was deleted.

7 changes: 0 additions & 7 deletions configs/envs/medcasereasoning.yaml

This file was deleted.

23 changes: 0 additions & 23 deletions configs/envs/medconceptsqa_sample.yaml

This file was deleted.

9 changes: 0 additions & 9 deletions configs/envs/medec.yaml

This file was deleted.

Loading
Loading