Releases: facebookresearch/ProgramBench
v1.1.0
What's Changed
This release fixes several issues with the eval harness. If you are evaluating on ProgramBench we strongly recommend you to update. Most fixes should not require rerunning agents except for a small loophole described in #45 and #14 (first raised by suche-ux in #14) and fixed by new docker images (#46). Annotating existing agent trajectories should make it easy to flag which instances were affected.
- Fix(eval): block build-script internet for submissions by @klieret in #41
- Fix(eval): Ignore flaky and otherwise unsuitable tests by @klieret in #40
- Fix(eval): evaluate in :task_cleanroom images by @klieret in #42
- Fix(eval): default to v6 docker images by @klieret in #46
New Contributors
- @dependabot[bot] made their first contribution in #26
- @arpitjain099 made their first contribution in #24
- @klieret made their first contribution in #28
- @yurekami made their first contribution in #29
Full Changelog: v1.0.2...v1.1.0
v1.0.2
This minor release ignores ~30 tests that caused hangs when evaluating incorrect solutions.
Full Changelog: v1.0.1...v1.0.2
v1.0.1
What's Changed
- Fix: stderr messages can corrupt XML coverage report (#5), thanks for the report @darshanmakwana412
New Contributors
- @eltociear made their first contribution in #8
Full Changelog: v1.0.0...v1.0.1
ProgramBench 🦊
How much of SQLite, FFmpeg, PHP compiler can Opus 4.7 rebuild from scratch? Given just an executable and no starter code or internet access.
Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end.
Read more: https://programbench.com/
