Replace submit_eval_jobs.py with thin wrapper around submit_eval_jobs.sh#1658
Replace submit_eval_jobs.py with thin wrapper around submit_eval_jobs.sh#1658finbarrtimbers wants to merge 1 commit into
Conversation
….sh that delegates to olmo-eval beaker launch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the evaluation job submission process, transforming scripts/submit_eval_jobs.py into a wrapper that delegates to a new shell script, scripts/submit_eval_jobs.sh, which uses olmo-eval beaker launch. The review feedback identifies a concurrency risk in the shell script's cloning logic, suggesting the use of mktemp and an exit trap for isolation. Additionally, it is recommended to use shlex.join in the Python script to ensure that logged commands are properly shell-escaped.
| REPO_ROOT=$(git -C "$(dirname "$0")" rev-parse --show-toplevel) | ||
| CLONE="$REPO_ROOT/olmo-eval-internal" | ||
| REF="${OLMO_EVAL_REF:-main}" | ||
|
|
||
| if [ ! -d "$CLONE/.git" ]; then | ||
| git clone --depth=1 https://github.com/allenai/olmo-eval-internal.git "$CLONE" | ||
| fi | ||
| git -C "$CLONE" fetch --depth=1 origin "$REF" | ||
| git -C "$CLONE" checkout FETCH_HEAD | ||
|
|
||
| cd "$CLONE" | ||
| exec uv run olmo-eval beaker launch -y \ | ||
| --harness default \ | ||
| -o provider.kind=vllm_server \ | ||
| -o provider.trust_remote_code=true \ | ||
| -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \ | ||
| "$@" |
There was a problem hiding this comment.
Cloning into a fixed path within the repository ($REPO_ROOT/olmo-eval-internal) is problematic for concurrency and repository cleanliness. If multiple jobs are submitted simultaneously, they will conflict on the same directory. Using a temporary directory with an exit trap ensures isolation and automatic cleanup. Note that exec is removed to allow the trap to fire upon completion.
| REPO_ROOT=$(git -C "$(dirname "$0")" rev-parse --show-toplevel) | |
| CLONE="$REPO_ROOT/olmo-eval-internal" | |
| REF="${OLMO_EVAL_REF:-main}" | |
| if [ ! -d "$CLONE/.git" ]; then | |
| git clone --depth=1 https://github.com/allenai/olmo-eval-internal.git "$CLONE" | |
| fi | |
| git -C "$CLONE" fetch --depth=1 origin "$REF" | |
| git -C "$CLONE" checkout FETCH_HEAD | |
| cd "$CLONE" | |
| exec uv run olmo-eval beaker launch -y \ | |
| --harness default \ | |
| -o provider.kind=vllm_server \ | |
| -o provider.trust_remote_code=true \ | |
| -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \ | |
| "$@" | |
| CLONE=$(mktemp -d) | |
| trap 'rm -rf "$CLONE"' EXIT | |
| REF="${OLMO_EVAL_REF:-main}" | |
| git clone --depth=1 https://github.com/allenai/olmo-eval-internal.git "$CLONE" | |
| git -C "$CLONE" fetch --depth=1 origin "$REF" | |
| git -C "$CLONE" checkout FETCH_HEAD | |
| cd "$CLONE" | |
| uv run olmo-eval beaker launch -y \ | |
| --harness default \ | |
| -o provider.kind=vllm_server \ | |
| -o provider.trust_remote_code=true \ | |
| -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \ | |
| "$@" |
| import os | ||
| import subprocess |
| env = os.environ.copy() | ||
| env["OLMO_EVAL_REF"] = args.olmo_eval_ref | ||
| cmd = [str(script), *launch_args] | ||
| print("Running:", " ".join(cmd)) |
There was a problem hiding this comment.
Using shlex.join(cmd) is preferred over " ".join(cmd) for logging. It produces a shell-escaped string that accurately represents the command, which is especially important if any arguments (like experiment names or paths) contain spaces.
| print("Running:", " ".join(cmd)) | |
| print("Running:", shlex.join(cmd)) |
💡 Codex Reviewopen-instruct/scripts/submit_eval_jobs.py Line 46 in cff3d8c When open-instruct/scripts/submit_eval_jobs.py Line 79 in cff3d8c With ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
…-By: Claude Opus 4.7 <noreply@anthropic.com>
…ection (allenai#1650) * Add IcePop to GRPO: zero per-token loss when train/infer mismatch ratio is outside [1/β, β]. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Enable IcePop in qwen3_4b dapo math experiment script. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Shift BEAKER_IMAGE positional in qwen3_4b script so it isn't forwarded as a stray arg via $@. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Fix IcePop: reweight in-range tokens by ρ (paper eq. 2), use independent α/β with paper defaults (0.5, 5.0). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Track per-boundary IcePop drop fractions (val/icepop_drop_low_frac, val/icepop_drop_high_frac). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Add IcePop training collapse postmortem doc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * updated timing * Unify TIS/IcePop into a single RhoCorrection abstraction shared by grpo_fast and olmo_core paths. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Address PR review: restore checkpoint-state on empty steps; use explicit drop flags for icepop_drop_frac. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * cleaned up PR * Collapse IcePop bounds to a single beta (range [1/beta, beta]) and log val/rho_hist as a wandb histogram across grpo_fast and olmo-core paths. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Remove duplicate prompt/response length extraction to minimize diff vs main. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Move non-IcePop changes (eval script tweaks, checkpoint_state refactor) off this branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Simplify rho correction: compute ratio once, dedupe histogram accumulation across trainers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Delete compute_icepop_mask; route test through compute_rho_correction. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * cleaned up pr * Address PR review: aggregate token counts across DP ranks, gate rho histograms to rank 0, token-weight icepop drop metrics. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Replace icepop_beta with separate icepop_lower_bound/icepop_upper_bound (default 0.5/2.0). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Use venv-split install pattern in submit_eval_jobs.py to preserve image's pre-baked torch/nvidia. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Ignore olmo-eval-internal/ checkout. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Replace submit_eval_jobs.py with shell wrapper around olmo-eval beaker launch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Restore submit_eval_jobs.py as a thin translator over submit_eval_jobs.sh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Make rho_weights required in compute_grpo_loss; ones tensor disables correction. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Add CHANGELOG entry for allenai#1650. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Even more narrow IcePop * Even more narrow IcePop * Add icepop_sequence_level flag for DeepSeek-V3.2 style sequence-level IcePop. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Drop submit_eval_jobs.{py,sh} changes; moved to PR allenai#1658. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Drop sp_leader_metrics change; moved to PR allenai#1659. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Rename TIS/IcePop knobs to unified rho_correction interface (use_rho_correction, rho_clamp_*, rho_mask_*). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * revert icepop changes * Revert "revert icepop changes" This reverts commit 155012c. * Revert non-rename changes in qwen3_4b_dapo_math.sh Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Restore rho_clamp_upper_bound default of 2.0 to match prior truncated_importance_sampling_ratio_cap default Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Drop redundant --rho_clamp_upper_bound 2.0 from scripts (matches default) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Default use_rho_correction=True to preserve prior TIS-on-by-default behavior; drop redundant --use_rho_correction flags from scripts Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
scripts/submit_eval_jobs.py's Beaker spec construction with a thin argparse translator that shells out toscripts/submit_eval_jobs.sh.scripts/submit_eval_jobs.shclonesolmo-eval-internaland invokesolmo-eval beaker launchdirectly.Split out from the icepop branch.
Test plan
uv run python scripts/submit_eval_jobs.py --model_name foo --location /weka/... --dry_run🤖 Generated with Claude Code