BenchFlow Benchmarks

Ready-to-run benchmark task directories and parity experiments for BenchFlow.

This repo contains converted tasks and parity results — the output of running each benchmark's converter and validation pipeline. Conversion code, parity tests, and benchmark.yaml descriptors live in the main benchflow repo under benchmarks/<name>/.

Structure

datasets/
├── harvey-lab/
│   ├── benchmark.yaml            # standard benchmark descriptor
│   ├── README.md
│   ├── parity/
│   │   └── parity_experiment.json # side-by-side parity results
│   └── tasks/
│       ├── <task-id>/
│       │   ├── task.toml
│       │   ├── instruction.md
│       │   ├── environment/
│       │   │   └── Dockerfile
│       │   └── tests/
│       │       ├── test.sh
│       │       └── evaluate.py
│       └── ...
├── programbench/
│   ├── parity/
│   │   └── parity_experiment.json  # agent + pipeline parity results
│   └── tasks/
│       ├── <owner>__<repo>.<commit>/
│       │   ├── task.toml
│       │   ├── instruction.md
│       │   ├── environment/
│       │   │   └── Dockerfile
│       │   ├── tests/
│       │   │   ├── test.sh
│       │   │   ├── verify.py
│       │   │   └── tests.json
│       │   └── solution/
│       │       └── solve.sh
│       └── ...
└── <next-benchmark>/
    └── ...

Each task directory follows the BenchFlow task format:

task.toml — task metadata and configuration
instruction.md — agent instructions
environment/Dockerfile — container setup
tests/test.sh — verifier entry point
tests/evaluate.py — evaluation logic (for LLM-as-judge benchmarks)
tests/verify.py — evaluation logic (for compile + unit test benchmarks)
solution/solve.sh — oracle solution (when available)

Available Datasets

Dataset	Tasks	Verification	Parity	Source
harvey-lab	1,251	LLM-as-judge	25/25 (100%)	Harvey LAB
programbench	201	compile + unit tests	4/5 exact match	ProgramBench

Usage

# Clone and run with BenchFlow
git clone https://github.com/benchflow-ai/benchmarks.git
cd benchmarks
benchflow run -t datasets/harvey-lab/tasks/<task-id>
benchflow run -t datasets/programbench/tasks/<task-id>

# Or use from the benchflow repo with auto-download
python benchmarks/harvey-lab/run_harvey_lab.py
python benchmarks/run_programbench.py

Adding a Dataset

Write a converter in benchflow-ai/benchflow/benchmarks/<name>/
Run parity tests and record results
Generate all tasks and add them under datasets/<name>/tasks/
Add parity results under datasets/<name>/parity/
Submit a PR

See CONVERT.md in the main repo for detailed instructions.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
datasets		datasets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BenchFlow Benchmarks

Structure

Available Datasets

Usage

Adding a Dataset

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BenchFlow Benchmarks

Structure

Available Datasets

Usage

Adding a Dataset

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages