MedARC-AI · warner-benjamin · May 12, 2026 · Apr 28, 2026 · Apr 28, 2026 · Apr 28, 2026
diff --git a/.gitignore b/.gitignore
@@ -214,4 +214,9 @@ environments/healthbench/test*
 .vscode/
 pyrightconfig.json
 
-
+.claude
+.codex
+.devcontainer
+plans/
+verifiers/
+.gitmodules
diff --git a/AGENTS.md b/AGENTS.md
@@ -4,20 +4,20 @@
 
 - `medarc_verifiers/`: Core Python package (CLI entrypoints, parsers, rewards, orchestration utilities).
 - `environments/<env>/`: Individual Verifiers environments (each is a small Python package with `<env>.py` and its own `pyproject.toml`).
-- `configs/`: YAML configs for `medarc-eval bench` (job matrices, env configs, judge configs).
+- `configs/`: TOML configs for `medarc-eval bench`, endpoint registries, and environment/judge configs.
 - `docs/`: Usage docs for `medarc-eval` and related workflows.
 - `tests/`: `pytest` suite.
 
 ## Architecture Overview (Start Here)
 
 - **IMPORTANT: Read `docs/medarc-verifiers-architecture.md` before writing or modifying any code.**
 - Quick workflow: eval → process → winrate
-  - raw outputs: `runs/raw/<run_id>/...`
+  - eval outputs: `runs/evals/<model>/<env>/<variant>/...`
   - processed parquet: `runs/processed/<model>/<env>.parquet` + `runs/processed/env_index.json`
-  - winrate outputs: `runs/winrate/latest.json` and `runs/winrate/latest.csv`
+  - winrate outputs: `runs/processed/winrate/latest.json` and `runs/processed/winrate/latest.csv`
 - `medarc-eval` CLI entrypoint/router: (`medarc_verifiers/cli/main.py`; docs: `docs/medarc-eval.md`)
 - `medarc-orchestrate` CLI entrypoint: (`medarc_verifiers/orchestrate/cli.py`; docs: `docs/medarc-orchestrate.md`)
-- Batch resume/restart state lives in `runs/raw/<run_id>/run_manifest.json`
+- Old YAML-runner `runs/raw` artifacts must be converted with `scripts/convert_legacy_raw_runs.py` before processing.
 - Environment `load_environment()` params become CLI flags (see `medarc-eval <env> --help`).
 - Environment authoring utilities (used by `environments/*`):
   - parsing/prompts: `medarc_verifiers/parsers/`, `medarc_verifiers/prompts.py` (XML preferred; BOXED supported)
@@ -32,7 +32,7 @@
 - `uv pip install -e .`: Install `medarc-verifiers` in editable mode.
 - `vf-install <env>`: Install an environment from `environments/<env>/` in editable mode.
 - `uv run medarc-eval <ENV> -m <MODEL> -n 5`: Run a small evaluation.
-- `uv run medarc-eval bench --config configs/job.yaml`: Run a batch evaluation from a YAML config.
+- `uv run medarc-eval bench --config configs/medmarks-smoke.toml`: Run a batch evaluation from a TOML config.
 - `uv run pytest tests/`: Run the full test suite.
 - `uv run ruff check medarc_verifiers/ && uv run ruff format medarc_verifiers/`: Lint/format.
 

diff --git a/README.md b/README.md
diff --git a/configs/README.md b/configs/README.md
@@ -0,0 +1,50 @@
+# MedARC Eval TOML Configs
+
+These configs use upstream `verifiers` TOML semantics. Repeated `env_id` entries
+and `[[ablation]]` sweeps intentionally keep the upstream environment id stable;
+`medarc-eval bench` writes deterministic variant directories for differing
+`env_args` and `sampling_args`.
+
+```bash
+medarc-eval bench --config configs/medmarks-smoke.toml --dry-run
+medarc-eval bench --config configs/medmarks-verified.toml
+medarc-eval process --runs-dir runs/evals --output-dir runs/processed
+```
+
+Use `medmarks-endpoints.toml` when you want one of the Medmarks model aliases
+and its sampling defaults:
+
+```bash
+medarc-eval bench \
+  --config configs/medmarks-verified.toml \
+  --endpoints-path configs/medmarks-endpoints.toml \
+  -m gpt-oss-20b-low \
+  --api-base-url https://api.pinference.ai/api/v1 \
+  --api-key-var PRIME_API_KEY \
+  --dry-run
+```
+
+`medmarks-endpoints.toml` is a portable alias registry. It maps endpoint IDs to
+model IDs, client types, and sampling defaults, but intentionally omits `url`,
+`key`, and `max_concurrent` because those are deployment-specific. Supply those
+settings with `--provider` or with `--api-base-url` and `--api-key-var`.
+The gpt-oss aliases use the Verifiers `openai_responses` client type.
+
+For a local vLLM server exposing an OpenAI-compatible API, keep using the same
+alias registry and override only the deployment settings:
+
+```bash
+VLLM_API_KEY=local-key medarc-eval bench \
+  --config configs/medmarks-verified.toml \
+  --endpoints-path configs/medmarks-endpoints.toml \
+  -m gpt-oss-20b-low \
+  --api-base-url http://127.0.0.1:8000/v1 \
+  --api-key-var VLLM_API_KEY \
+  --dry-run
+```
+
+Per-environment `[tool.verifiers.eval]` defaults are read from editable installs
+where the environment `pyproject.toml` is discoverable next to the module. Wheel
+installs may ignore those defaults unless the package includes `pyproject.toml`,
+so production suite configs keep explicit `num_examples` and
+`rollouts_per_example` values.
diff --git a/configs/endpoints.toml b/configs/endpoints.toml
@@ -0,0 +1,4 @@
+# Default upstream verifiers endpoint registry.
+#
+# Add [[endpoint]] entries here to resolve endpoint_id aliases. An empty registry
+# is valid; provider/model defaults are used when no alias matches.
diff --git a/configs/envs/agentclinic.yaml b/configs/envs/agentclinic.yaml
diff --git a/configs/envs/careqa_en.yaml b/configs/envs/careqa_en.yaml
diff --git a/configs/envs/careqa_open.yaml b/configs/envs/careqa_open.yaml
diff --git a/configs/envs/head_qa_v2.yaml b/configs/envs/head_qa_v2.yaml
diff --git a/configs/envs/healthbench.yaml b/configs/envs/healthbench.yaml
diff --git a/configs/envs/longhealth.yaml b/configs/envs/longhealth.yaml
diff --git a/configs/envs/m_arc.yaml b/configs/envs/m_arc.yaml
diff --git a/configs/envs/med_dialog.yaml b/configs/envs/med_dialog.yaml
diff --git a/configs/envs/med_halt.yaml b/configs/envs/med_halt.yaml
diff --git a/configs/envs/med_mcqa.yaml b/configs/envs/med_mcqa.yaml
diff --git a/configs/envs/medagentbench.yaml b/configs/envs/medagentbench.yaml
diff --git a/configs/envs/medagentbenchv2.yaml b/configs/envs/medagentbenchv2.yaml
diff --git a/configs/envs/medbullets.yaml b/configs/envs/medbullets.yaml
diff --git a/configs/envs/medcalc_bench.yaml b/configs/envs/medcalc_bench.yaml
diff --git a/configs/envs/medcasereasoning.yaml b/configs/envs/medcasereasoning.yaml
diff --git a/configs/envs/medconceptsqa_sample.yaml b/configs/envs/medconceptsqa_sample.yaml
diff --git a/configs/envs/medec.yaml b/configs/envs/medec.yaml