Ready-to-run benchmark task directories and parity experiments for BenchFlow.
This repo contains converted tasks and parity results — the output of running each benchmark's converter and validation pipeline. Conversion code, parity tests, and benchmark.yaml descriptors live in the main benchflow repo under benchmarks/<name>/.
datasets/
├── harvey-lab/
│ ├── benchmark.yaml # standard benchmark descriptor
│ ├── README.md
│ ├── parity/
│ │ └── parity_experiment.json # side-by-side parity results
│ └── tasks/
│ ├── <task-id>/
│ │ ├── task.toml
│ │ ├── instruction.md
│ │ ├── environment/
│ │ │ └── Dockerfile
│ │ └── tests/
│ │ ├── test.sh
│ │ └── evaluate.py
│ └── ...
├── programbench/
│ ├── parity/
│ │ └── parity_experiment.json # agent + pipeline parity results
│ └── tasks/
│ ├── <owner>__<repo>.<commit>/
│ │ ├── task.toml
│ │ ├── instruction.md
│ │ ├── environment/
│ │ │ └── Dockerfile
│ │ ├── tests/
│ │ │ ├── test.sh
│ │ │ ├── verify.py
│ │ │ └── tests.json
│ │ └── solution/
│ │ └── solve.sh
│ └── ...
└── <next-benchmark>/
└── ...
Each task directory follows the BenchFlow task format:
task.toml— task metadata and configurationinstruction.md— agent instructionsenvironment/Dockerfile— container setuptests/test.sh— verifier entry pointtests/evaluate.py— evaluation logic (for LLM-as-judge benchmarks)tests/verify.py— evaluation logic (for compile + unit test benchmarks)solution/solve.sh— oracle solution (when available)
| Dataset | Tasks | Verification | Parity | Source |
|---|---|---|---|---|
| harvey-lab | 1,251 | LLM-as-judge | 25/25 (100%) | Harvey LAB |
| programbench | 201 | compile + unit tests | 4/5 exact match | ProgramBench |
# Clone and run with BenchFlow
git clone https://github.com/benchflow-ai/benchmarks.git
cd benchmarks
benchflow run -t datasets/harvey-lab/tasks/<task-id>
benchflow run -t datasets/programbench/tasks/<task-id>
# Or use from the benchflow repo with auto-download
python benchmarks/harvey-lab/run_harvey_lab.py
python benchmarks/run_programbench.py- Write a converter in
benchflow-ai/benchflow/benchmarks/<name>/ - Run parity tests and record results
- Generate all tasks and add them under
datasets/<name>/tasks/ - Add parity results under
datasets/<name>/parity/ - Submit a PR
See CONVERT.md in the main repo for detailed instructions.