Razorback is a Python CLI for reproducible agentic benchmark research on top of Harbor. It turns benchmark runs into defensible numbers: freeze the spec, run the job, score the result, audit for leakage, and inspect run-dir artifacts.
rk freezepins runtime inputs into a frozen spec and provenance.rk runtranslates a frozen Razorback spec into a Harbor job and writes canonical run-dir artifacts.rk scorecomputes pass@1, Wilson intervals, and stratified summaries.rk auditscans traces for forbidden data acquisition and coverage gaps.razorback-plugin-dabemits Harbor task directories for DataAgentBench.spacedock_solveris the sealed Spacedock first-officer solver wrapper for runtime adapters such as Claude and Codex.
src/razorback/ core CLI, agents, provenance, scoring, audit
packages/razorback-plugin-dab/ DAB task generator plugin
examples/specs/ example and matrix specs
examples/drivers/ matrix dispatch and aggregation scripts
examples/solver_workflows/ solver workflow READMEs
docs/razorback-implementation/ active Spacedock implementation workflow
tests/ unit and integration tests
Razorback ships two workflow READMEs as in-package data under
src/razorback/templates/:
src/razorback/templates/experiment-workflow/README.md— six-stage experiment workflow (pending → propose → smoke → full → analyze → conclude) for benchmark-result research.src/razorback/templates/run-workflow/README.md— four-stage run workflow (pending → reconciling → completed → failed) for single-run reconciliation.
Usage is copy-and-modify: copy a template directory into a new research repo (or alongside a frozen spec) and edit the per-stage prompts in place.
uv sync
uv run pytestDAB data is external to this repo. Point specs at the local DataAgentBench data checkout for the machine running the benchmark.
uv run rk freeze examples/specs/bookreview-claude-harbor-dab.yaml
uv run rk run examples/specs/bookreview-claude-harbor-dab.frozen.yaml --explain
uv run rk run examples/specs/bookreview-claude-harbor-dab.frozen.yaml --runs-dir runs/smoke
uv run rk score <run-dir> --format json
uv run rk audit <run-dir> --policy strict --format jsonrk run writes one run-dir per (spec, job) under a base "runs-dir":
- Default:
$RAZORBACK_RUNS_DIRif set; else$XDG_DATA_HOME/razorback/runsif set; else~/.local/share/razorback/runs. - Override: pass
--runs-dir <path>tork run.
The default lives OUTSIDE your git worktree on purpose: git worktree remove --force cannot destroy experiment outputs written there. If you pin a
worktree-relative path (--runs-dir _runs, --runs-dir runs/) the outputs
share the worktree's fate.
The active goal is to produce N=1 full-dataset benchmark numbers for
DAB and ade-bench using Codex. The first dependency is implementing
the Codex runtime adapter for spacedock_solver; DAB and
ade-bench score matrices build on that surface.