Releases · facebookresearch/ProgramBench

18 Jun 21:34

klieret

v1.1.0

ede4bdb

v1.1.0 Latest

Latest

What's Changed

This release fixes several issues with the eval harness. If you are evaluating on ProgramBench we strongly recommend you to update. Most fixes should not require rerunning agents except for a small loophole described in #45 and #14 (first raised by suche-ux in #14) and fixed by new docker images (#46). Annotating existing agent trajectories should make it easy to flag which instances were affected.

Fix(eval): block build-script internet for submissions by @klieret in #41
Fix(eval): Ignore flaky and otherwise unsuitable tests by @klieret in #40
Fix(eval): evaluate in :task_cleanroom images by @klieret in #42
Fix(eval): default to v6 docker images by @klieret in #46

New Contributors

@dependabot[bot] made their first contribution in #26
@arpitjain099 made their first contribution in #24
@klieret made their first contribution in #28
@yurekami made their first contribution in #29

Full Changelog: v1.0.2...v1.1.0

Contributors

arpitjain099, klieret, and 2 other contributors

Assets 2

11 May 16:58

klieret

v1.0.2

b33e660

v1.0.2

This minor release ignores ~30 tests that caused hangs when evaluating incorrect solutions.

Full Changelog: v1.0.1...v1.0.2

Assets 2

07 May 12:45

klieret

v1.0.1

1fe64c8

v1.0.1

What's Changed

Fix: stderr messages can corrupt XML coverage report (#5), thanks for the report @darshanmakwana412

New Contributors

@eltociear made their first contribution in #8

Full Changelog: v1.0.0...v1.0.1

Contributors

eltociear and darshanmakwana412

Assets 2

05 May 14:31

klieret

v1.0.0

2803dcc

ProgramBench 🦊

How much of SQLite, FFmpeg, PHP compiler can Opus 4.7 rebuild from scratch? Given just an executable and no starter code or internet access.

Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Releases: facebookresearch/ProgramBench

v1.1.0

What's Changed

New Contributors

Contributors

Uh oh!

v1.0.2

Uh oh!

v1.0.1

What's Changed

New Contributors

Contributors

Uh oh!

ProgramBench 🦊

Uh oh!