Skip to content

Fix(eval): evaluate in :task_cleanroom images#42

Merged
klieret merged 1 commit into
mainfrom
feat/eval-cleanroom-default
Jun 18, 2026
Merged

Fix(eval): evaluate in :task_cleanroom images#42
klieret merged 1 commit into
mainfrom
feat/eval-cleanroom-default

Conversation

@klieret

@klieret klieret commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Submissions are now evaluated in the artifact-free cleanroom image by default instead of the full :task build environment, so a submission can't rely on build artifacts leaked into :task. --image-tag stays as an explicit override (pass --image-tag task for the full build env).

This avoids drifts between the inference image and evaluation image.

Internal-reference: fdbd6657

Closes #9

Submissions are now evaluated in the artifact-free cleanroom image by
default instead of the full :task build environment, so a submission
can't rely on build artifacts leaked into :task. --image-tag stays as
an explicit override (pass --image-tag task for the full build env).

This avoids drifts between the inference image and evaluation image.

Internal-reference: fdbd6657
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 18, 2026
@klieret klieret changed the title Change(eval): evaluate in :task_cleanroom images Fix(eval): evaluate in :task_cleanroom images Jun 18, 2026
@klieret klieret requested a review from Copilot June 18, 2026 02:37

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes ProgramBench evaluation to run submissions in the artifact-free :task_cleanroom Docker image by default, aligning evaluation with inference and preventing reliance on leaked build artifacts. It keeps --image-tag as an explicit override to use the full :task environment when desired.

Changes:

  • Switch default image_tag from task to task_cleanroom in the core evaluator (eval.py) and batch evaluator (eval_batch.py).
  • Update the CLI programbench eval default --image-tag to task_cleanroom.
  • Expand CLI help text to explain why task_cleanroom is the default and how to override back to task.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
src/programbench/eval/eval.py Updates the evaluator’s default Docker image tag to task_cleanroom.
src/programbench/eval/eval_batch.py Updates batch evaluation defaults to use task_cleanroom unless overridden.
src/programbench/cli/main.py Updates CLI default and help text for --image-tag to default to task_cleanroom.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@klieret klieret merged commit 68e3da2 into main Jun 18, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tomnomnom__gron task and task_cleanroom images use different Go toolchains

2 participants