Fix(eval): default to v6 docker images#46
Merged
Merged
Conversation
Make the v6 image set the default everywhere eval resolves a tag: the `eval` CLI `--image-tag`, the `Evaluator.evaluate` default, and both `eval_batch` entry points now default to `task_cleanroom_v6` (was `task_cleanroom`). Update docs/README.md to point inference users at the `task_cleanroom_v6` / `task_v6` tags. Tags are otherwise unchanged: `--image-tag task_v6` still selects the full build environment, and explicit overrides are passed through verbatim. Internal-reference-commit: a92ae6227464d6c1dbff015e6003fb70790760db Internal-reference-commit: 22e92e67b0a399d7b9dc2f612c5f5eedec1c80cd Internal-reference-commit: 46518a07432af49a4b86777bc2ee8b50f7548bc7
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates ProgramBench’s evaluation tooling and usage docs to default to the v6 Docker image tags (notably task_cleanroom_v6) wherever an image tag is implicitly selected, aligning defaults with the hardened cleanroom images referenced in issues #45/#14.
Changes:
- Default eval image tag switched from
task_cleanroomtotask_cleanroom_v6inEvaluator,eval_batch, and theprogrambench evalCLI option default. - CLI help text updated to reference
task_v6as the explicit “full build environment” override. - Docs updated to point inference users at the
task_cleanroom_v6/task_v6tags.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/programbench/eval/eval.py | Updates Evaluator default image_tag to task_cleanroom_v6. |
| src/programbench/eval/eval_batch.py | Updates eval_batch entrypoint defaults to task_cleanroom_v6. |
| src/programbench/cli/main.py | Updates --image-tag default and adjusts help text toward v6 tags. |
| docs/README.md | Updates user-facing docs/links to v6 tags for inference/evaluation guidance. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Make the v6 image set the default everywhere eval resolves a tag: the
evalCLI--image-tag, theEvaluator.evaluatedefault, and botheval_batchentry points now default totask_cleanroom_v6(wastask_cleanroom). Update docs/README.md to point inference users at thetask_cleanroom_v6/task_v6tags.Tags are otherwise unchanged:
--image-tag task_v6still selects the full build environment, and explicit overrides are passed through verbatim.Internal-reference-commit: a92ae6227464d6c1dbff015e6003fb70790760db
Internal-reference-commit: 22e92e67b0a399d7b9dc2f612c5f5eedec1c80cd
Internal-reference-commit: 46518a07432af49a4b86777bc2ee8b50f7548bc7
Closes #45
Closes #14
In reference to #44