Replace submit_eval_jobs.py with thin wrapper around submit_eval_jobs.sh by finbarrtimbers · Pull Request #1658 · allenai/open-instruct

finbarrtimbers · 2026-05-06T18:43:36Z

Summary

Replaces scripts/submit_eval_jobs.py's Beaker spec construction with a thin argparse translator that shells out to scripts/submit_eval_jobs.sh.
New scripts/submit_eval_jobs.sh clones olmo-eval-internal and invokes olmo-eval beaker launch directly.

Split out from the icepop branch.

Test plan

Dry-run uv run python scripts/submit_eval_jobs.py --model_name foo --location /weka/... --dry_run
Submit a real eval job

🤖 Generated with Claude Code

….sh that delegates to olmo-eval beaker launch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request refactors the evaluation job submission process, transforming scripts/submit_eval_jobs.py into a wrapper that delegates to a new shell script, scripts/submit_eval_jobs.sh, which uses olmo-eval beaker launch. The review feedback identifies a concurrency risk in the shell script's cloning logic, suggesting the use of mktemp and an exit trap for isolation. Additionally, it is recommended to use shlex.join in the Python script to ensure that logged commands are properly shell-escaped.

gemini-code-assist · 2026-05-06T18:45:04Z

+REPO_ROOT=$(git -C "$(dirname "$0")" rev-parse --show-toplevel)
+CLONE="$REPO_ROOT/olmo-eval-internal"
+REF="${OLMO_EVAL_REF:-main}"
+
+if [ ! -d "$CLONE/.git" ]; then
+    git clone --depth=1 https://github.com/allenai/olmo-eval-internal.git "$CLONE"
+fi
+git -C "$CLONE" fetch --depth=1 origin "$REF"
+git -C "$CLONE" checkout FETCH_HEAD
+
+cd "$CLONE"
+exec uv run olmo-eval beaker launch -y \
+    --harness default \
+    -o provider.kind=vllm_server \
+    -o provider.trust_remote_code=true \
+    -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
+    "$@"


Cloning into a fixed path within the repository ($REPO_ROOT/olmo-eval-internal) is problematic for concurrency and repository cleanliness. If multiple jobs are submitted simultaneously, they will conflict on the same directory. Using a temporary directory with an exit trap ensures isolation and automatic cleanup. Note that exec is removed to allow the trap to fire upon completion.

Suggested change

REPO_ROOT=$(git -C "$(dirname "$0")" rev-parse --show-toplevel)

CLONE="$REPO_ROOT/olmo-eval-internal"

REF="${OLMO_EVAL_REF:-main}"

if [ ! -d "$CLONE/.git" ]; then

git clone --depth=1 https://github.com/allenai/olmo-eval-internal.git "$CLONE"

fi

git -C "$CLONE" fetch --depth=1 origin "$REF"

git -C "$CLONE" checkout FETCH_HEAD

cd "$CLONE"

exec uv run olmo-eval beaker launch -y \

--harness default \

-o provider.kind=vllm_server \

-o provider.trust_remote_code=true \

-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \

"$@"

CLONE=$(mktemp -d)

trap 'rm -rf "$CLONE"' EXIT

REF="${OLMO_EVAL_REF:-main}"

git clone --depth=1 https://github.com/allenai/olmo-eval-internal.git "$CLONE"

git -C "$CLONE" fetch --depth=1 origin "$REF"

git -C "$CLONE" checkout FETCH_HEAD

cd "$CLONE"

uv run olmo-eval beaker launch -y \

--harness default \

-o provider.kind=vllm_server \

-o provider.trust_remote_code=true \

-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \

"$@"

gemini-code-assist · 2026-05-06T18:45:04Z

+import os
 import subprocess


Re-add the shlex import to ensure that the command logging is properly escaped and readable.

Suggested change

import os

import subprocess

import os

import shlex

import subprocess

gemini-code-assist · 2026-05-06T18:45:04Z

+    env = os.environ.copy()
+    env["OLMO_EVAL_REF"] = args.olmo_eval_ref
+    cmd = [str(script), *launch_args]
+    print("Running:", " ".join(cmd))


Using shlex.join(cmd) is preferred over " ".join(cmd) for logging. It produces a shell-escaped string that accurately represents the command, which is especially important if any arguments (like experiment names or paths) contain spaces.

Suggested change

print("Running:", " ".join(cmd))

print("Running:", shlex.join(cmd))

chatgpt-codex-connector · 2026-05-06T18:46:17Z

💡 Codex Review

open-instruct/scripts/submit_eval_jobs.py

Line 46 in cff3d8c

out += ["-m", args.location]

Restore Beaker dataset locations in the wrapper

When --location is a bare Beaker dataset id or beaker://..., this now forwards that string directly as -m instead of resolving it to /model and mounting the dataset as the previous implementation did. The parent version explicitly accepted those locations, and the repository docs still show Beaker dataset ids for internal evals, so those legacy invocations will submit jobs whose model path is just the dataset id rather than an accessible mounted checkpoint.

open-instruct/scripts/submit_eval_jobs.py

Line 79 in cff3d8c

sys.exit(subprocess.run(cmd, env=env).returncode)

Avoid running the launcher during dry runs

With --dry_run, the wrapper still invokes submit_eval_jobs.sh; that script clones/fetches olmo-eval-internal before forwarding -d to the CLI. This changes dry-run from a local spec/command preview into an operation that needs GitHub access and writes a checkout into the repo, so the documented/test-plan dry run can fail on a fresh or offline clone before it ever reaches the dry-run path.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…-By: Claude Opus 4.7 <noreply@anthropic.com>

…ection (allenai#1650) * Add IcePop to GRPO: zero per-token loss when train/infer mismatch ratio is outside [1/β, β]. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Enable IcePop in qwen3_4b dapo math experiment script. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Shift BEAKER_IMAGE positional in qwen3_4b script so it isn't forwarded as a stray arg via $@. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Fix IcePop: reweight in-range tokens by ρ (paper eq. 2), use independent α/β with paper defaults (0.5, 5.0). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Track per-boundary IcePop drop fractions (val/icepop_drop_low_frac, val/icepop_drop_high_frac). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Add IcePop training collapse postmortem doc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * updated timing * Unify TIS/IcePop into a single RhoCorrection abstraction shared by grpo_fast and olmo_core paths. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Address PR review: restore checkpoint-state on empty steps; use explicit drop flags for icepop_drop_frac. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * cleaned up PR * Collapse IcePop bounds to a single beta (range [1/beta, beta]) and log val/rho_hist as a wandb histogram across grpo_fast and olmo-core paths. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Remove duplicate prompt/response length extraction to minimize diff vs main. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Move non-IcePop changes (eval script tweaks, checkpoint_state refactor) off this branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Simplify rho correction: compute ratio once, dedupe histogram accumulation across trainers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Delete compute_icepop_mask; route test through compute_rho_correction. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * cleaned up pr * Address PR review: aggregate token counts across DP ranks, gate rho histograms to rank 0, token-weight icepop drop metrics. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Replace icepop_beta with separate icepop_lower_bound/icepop_upper_bound (default 0.5/2.0). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Use venv-split install pattern in submit_eval_jobs.py to preserve image's pre-baked torch/nvidia. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Ignore olmo-eval-internal/ checkout. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Replace submit_eval_jobs.py with shell wrapper around olmo-eval beaker launch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Restore submit_eval_jobs.py as a thin translator over submit_eval_jobs.sh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Make rho_weights required in compute_grpo_loss; ones tensor disables correction. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Add CHANGELOG entry for allenai#1650. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Even more narrow IcePop * Even more narrow IcePop * Add icepop_sequence_level flag for DeepSeek-V3.2 style sequence-level IcePop. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Drop submit_eval_jobs.{py,sh} changes; moved to PR allenai#1658. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Drop sp_leader_metrics change; moved to PR allenai#1659. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Rename TIS/IcePop knobs to unified rho_correction interface (use_rho_correction, rho_clamp_*, rho_mask_*). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * revert icepop changes * Revert "revert icepop changes" This reverts commit 155012c. * Revert non-rename changes in qwen3_4b_dapo_math.sh Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Restore rho_clamp_upper_bound default of 2.0 to match prior truncated_importance_sampling_ratio_cap default Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Drop redundant --rho_clamp_upper_bound 2.0 from scripts (matches default) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Default use_rho_correction=True to preserve prior TIS-on-by-default behavior; drop redundant --use_rho_correction flags from scripts Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Replace submit_eval_jobs.py with thin wrapper around submit_eval_jobs…

cff3d8c

….sh that delegates to olmo-eval beaker launch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

finbarrtimbers added a commit that referenced this pull request May 6, 2026

Drop submit_eval_jobs.{py,sh} changes; moved to PR #1658. Co-Authored…

deedfbe

…-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace submit_eval_jobs.py with thin wrapper around submit_eval_jobs.sh#1658

Replace submit_eval_jobs.py with thin wrapper around submit_eval_jobs.sh#1658
finbarrtimbers wants to merge 1 commit into
mainfrom
finbarr/fix-eval-script-v2

finbarrtimbers commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Uh oh!

gemini-code-assist Bot May 6, 2026

Uh oh!

gemini-code-assist Bot May 6, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	print("Running:", " ".join(cmd))
	print("Running:", shlex.join(cmd))

Conversation

finbarrtimbers commented May 6, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot commented May 6, 2026

💡 Codex Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant