Skip to content

Replace submit_eval_jobs.py with thin wrapper around submit_eval_jobs.sh#1658

Open
finbarrtimbers wants to merge 1 commit into
mainfrom
finbarr/fix-eval-script-v2
Open

Replace submit_eval_jobs.py with thin wrapper around submit_eval_jobs.sh#1658
finbarrtimbers wants to merge 1 commit into
mainfrom
finbarr/fix-eval-script-v2

Conversation

@finbarrtimbers
Copy link
Copy Markdown
Collaborator

Summary

  • Replaces scripts/submit_eval_jobs.py's Beaker spec construction with a thin argparse translator that shells out to scripts/submit_eval_jobs.sh.
  • New scripts/submit_eval_jobs.sh clones olmo-eval-internal and invokes olmo-eval beaker launch directly.

Split out from the icepop branch.

Test plan

  • Dry-run uv run python scripts/submit_eval_jobs.py --model_name foo --location /weka/... --dry_run
  • Submit a real eval job

🤖 Generated with Claude Code

….sh that delegates to olmo-eval beaker launch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the evaluation job submission process, transforming scripts/submit_eval_jobs.py into a wrapper that delegates to a new shell script, scripts/submit_eval_jobs.sh, which uses olmo-eval beaker launch. The review feedback identifies a concurrency risk in the shell script's cloning logic, suggesting the use of mktemp and an exit trap for isolation. Additionally, it is recommended to use shlex.join in the Python script to ensure that logged commands are properly shell-escaped.

Comment on lines +16 to +32
REPO_ROOT=$(git -C "$(dirname "$0")" rev-parse --show-toplevel)
CLONE="$REPO_ROOT/olmo-eval-internal"
REF="${OLMO_EVAL_REF:-main}"

if [ ! -d "$CLONE/.git" ]; then
git clone --depth=1 https://github.com/allenai/olmo-eval-internal.git "$CLONE"
fi
git -C "$CLONE" fetch --depth=1 origin "$REF"
git -C "$CLONE" checkout FETCH_HEAD

cd "$CLONE"
exec uv run olmo-eval beaker launch -y \
--harness default \
-o provider.kind=vllm_server \
-o provider.trust_remote_code=true \
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
"$@"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Cloning into a fixed path within the repository ($REPO_ROOT/olmo-eval-internal) is problematic for concurrency and repository cleanliness. If multiple jobs are submitted simultaneously, they will conflict on the same directory. Using a temporary directory with an exit trap ensures isolation and automatic cleanup. Note that exec is removed to allow the trap to fire upon completion.

Suggested change
REPO_ROOT=$(git -C "$(dirname "$0")" rev-parse --show-toplevel)
CLONE="$REPO_ROOT/olmo-eval-internal"
REF="${OLMO_EVAL_REF:-main}"
if [ ! -d "$CLONE/.git" ]; then
git clone --depth=1 https://github.com/allenai/olmo-eval-internal.git "$CLONE"
fi
git -C "$CLONE" fetch --depth=1 origin "$REF"
git -C "$CLONE" checkout FETCH_HEAD
cd "$CLONE"
exec uv run olmo-eval beaker launch -y \
--harness default \
-o provider.kind=vllm_server \
-o provider.trust_remote_code=true \
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
"$@"
CLONE=$(mktemp -d)
trap 'rm -rf "$CLONE"' EXIT
REF="${OLMO_EVAL_REF:-main}"
git clone --depth=1 https://github.com/allenai/olmo-eval-internal.git "$CLONE"
git -C "$CLONE" fetch --depth=1 origin "$REF"
git -C "$CLONE" checkout FETCH_HEAD
cd "$CLONE"
uv run olmo-eval beaker launch -y \
--harness default \
-o provider.kind=vllm_server \
-o provider.trust_remote_code=true \
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
"$@"

Comment on lines +9 to 10
import os
import subprocess
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Re-add the shlex import to ensure that the command logging is properly escaped and readable.

Suggested change
import os
import subprocess
import os
import shlex
import subprocess

env = os.environ.copy()
env["OLMO_EVAL_REF"] = args.olmo_eval_ref
cmd = [str(script), *launch_args]
print("Running:", " ".join(cmd))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using shlex.join(cmd) is preferred over " ".join(cmd) for logging. It produces a shell-escaped string that accurately represents the command, which is especially important if any arguments (like experiment names or paths) contain spaces.

Suggested change
print("Running:", " ".join(cmd))
print("Running:", shlex.join(cmd))

@chatgpt-codex-connector
Copy link
Copy Markdown

💡 Codex Review

out += ["-m", args.location]

P2 Badge Restore Beaker dataset locations in the wrapper

When --location is a bare Beaker dataset id or beaker://..., this now forwards that string directly as -m instead of resolving it to /model and mounting the dataset as the previous implementation did. The parent version explicitly accepted those locations, and the repository docs still show Beaker dataset ids for internal evals, so those legacy invocations will submit jobs whose model path is just the dataset id rather than an accessible mounted checkpoint.


sys.exit(subprocess.run(cmd, env=env).returncode)

P2 Badge Avoid running the launcher during dry runs

With --dry_run, the wrapper still invokes submit_eval_jobs.sh; that script clones/fetches olmo-eval-internal before forwarding -d to the CLI. This changes dry-run from a local spec/command preview into an operation that needs GitHub access and writes a checkout into the repo, so the documented/test-plan dry run can fail on a fresh or offline clone before it ever reaches the dry-run path.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

finbarrtimbers added a commit that referenced this pull request May 6, 2026
…-By: Claude Opus 4.7 <noreply@anthropic.com>
pull Bot pushed a commit to nagyist/allenai.open-instruct that referenced this pull request May 7, 2026
…ection (allenai#1650)

* Add IcePop to GRPO: zero per-token loss when train/infer mismatch ratio is outside [1/β, β]. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Enable IcePop in qwen3_4b dapo math experiment script. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Shift BEAKER_IMAGE positional in qwen3_4b script so it isn't forwarded as a stray arg via $@. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Fix IcePop: reweight in-range tokens by ρ (paper eq. 2), use independent α/β with paper defaults (0.5, 5.0). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Track per-boundary IcePop drop fractions (val/icepop_drop_low_frac, val/icepop_drop_high_frac). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Add IcePop training collapse postmortem doc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* updated timing

* Unify TIS/IcePop into a single RhoCorrection abstraction shared by grpo_fast and olmo_core paths. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Address PR review: restore checkpoint-state on empty steps; use explicit drop flags for icepop_drop_frac. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* cleaned up PR

* Collapse IcePop bounds to a single beta (range [1/beta, beta]) and log val/rho_hist as a wandb histogram across grpo_fast and olmo-core paths. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Remove duplicate prompt/response length extraction to minimize diff vs main. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Move non-IcePop changes (eval script tweaks, checkpoint_state refactor) off this branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Simplify rho correction: compute ratio once, dedupe histogram accumulation across trainers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Delete compute_icepop_mask; route test through compute_rho_correction. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* cleaned up pr

* Address PR review: aggregate token counts across DP ranks, gate rho histograms to rank 0, token-weight icepop drop metrics. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Replace icepop_beta with separate icepop_lower_bound/icepop_upper_bound (default 0.5/2.0). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Use venv-split install pattern in submit_eval_jobs.py to preserve image's pre-baked torch/nvidia. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Ignore olmo-eval-internal/ checkout. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Replace submit_eval_jobs.py with shell wrapper around olmo-eval beaker launch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Restore submit_eval_jobs.py as a thin translator over submit_eval_jobs.sh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Make rho_weights required in compute_grpo_loss; ones tensor disables correction. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Add CHANGELOG entry for allenai#1650. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Even more narrow IcePop

* Even more narrow IcePop

* Add icepop_sequence_level flag for DeepSeek-V3.2 style sequence-level IcePop. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Drop submit_eval_jobs.{py,sh} changes; moved to PR allenai#1658. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Drop sp_leader_metrics change; moved to PR allenai#1659. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Rename TIS/IcePop knobs to unified rho_correction interface (use_rho_correction, rho_clamp_*, rho_mask_*). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* revert icepop changes

* Revert "revert icepop changes"

This reverts commit 155012c.

* Revert non-rename changes in qwen3_4b_dapo_math.sh Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Restore rho_clamp_upper_bound default of 2.0 to match prior truncated_importance_sampling_ratio_cap default Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Drop redundant --rho_clamp_upper_bound 2.0 from scripts (matches default) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* Default use_rho_correction=True to preserve prior TIS-on-by-default behavior; drop redundant --use_rho_correction flags from scripts Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant