Skip to content

Add olmo-eval Beaker launch integration for GRPO#1698

Draft
mnoukhov wants to merge 4 commits into
mainfrom
olmo-eval-grpo
Draft

Add olmo-eval Beaker launch integration for GRPO#1698
mnoukhov wants to merge 4 commits into
mainfrom
olmo-eval-grpo

Conversation

@mnoukhov
Copy link
Copy Markdown
Contributor

Summary

  • Add OlmoEvalLaunchConfig and launch_olmo_evals_on_weka() to launch evals via olmo-eval beaker launch after GRPO checkpoints (alongside existing oe-eval integration).
  • Wire into grpo.py and grpo_fast.py behind --try_launch_olmo_eval_jobs_on_weka; the checkpoint Weka path is passed as -m automatically.
  • Add requirements-olmo-eval.txt for optional olmo-eval CLI install (cannot be merged into the main uv env due to a rich version conflict with ai2-olmo-core).

Usage

Example flags for a GRPO run with olmo-eval auto-launch:

--try_launch_olmo_eval_jobs_on_weka \
--olmo_eval_tasks math:posttrain:dev \
--olmo_eval_cluster h100 \
--olmo_eval_groups my-experiment-group \
--olmo_eval_priority urgent \
--olmo_eval_workspace ai2/open-instruct-dev \
--output_dir /weka/oe-adapt-default/.../my-run

The launched command is equivalent to:

olmo-eval beaker launch \
  -m <checkpoint_path_on_weka> \
  -t math:posttrain:dev \
  --group my-experiment-group \
  --cluster h100 \
  --priority urgent \
  -w ai2/open-instruct-dev \
  --yes --no-follow

Test plan

  • uv run pytest open_instruct/test_olmo_eval_launch.py
  • make style && make quality
  • End-to-end Beaker run with --try_launch_olmo_eval_jobs_on_weka (requires olmo-eval CLI on PATH in training image)

Made with Cursor

mnoukhov and others added 2 commits May 22, 2026 17:48
Wire optional post-checkpoint eval launches through olmo-eval beaker launch
alongside the existing oe-eval path, with a dedicated config dataclass for
cluster, tasks, workspace, and other launch settings.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates olmo-eval into the GRPO training workflow, allowing for automated Beaker evaluation jobs to be launched at checkpoints and upon training completion. It introduces OlmoEvalLaunchConfig for configuration, command-building utilities, and execution logic using subprocesses. Reviewers identified several improvement opportunities, including making the subprocess execution more robust by handling missing dependencies and non-zero exit codes, lowering the default job priority to "normal" for better cluster resource management, and fixing a bug that caused redundant step suffixes in experiment names.

Comment thread open_instruct/olmo_eval_launch.py Outdated
olmo_eval_groups: list[str] | None = None
"""Optional Beaker group(s) for grouping related eval experiments."""

olmo_eval_priority: Literal["low", "normal", "high", "urgent"] = "urgent"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The default priority is set to "urgent". For automated evaluation jobs launched during training, it is generally better to use "normal" to avoid preempting other users' work on shared clusters, unless explicitly requested by the user via CLI flags. This also maintains consistency with the existing oe-eval integration in grpo_utils.py.

Suggested change
olmo_eval_priority: Literal["low", "normal", "high", "urgent"] = "urgent"
olmo_eval_priority: Literal["low", "normal", "high", "urgent"] = "normal"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — updated the default to "high" rather than "normal". We want auto-launched evals to run ahead of routine cluster work without preempting at "urgent" levels. Users can still override via --olmo_eval_priority.

Comment thread open_instruct/olmo_eval_launch.py Outdated
Comment on lines +154 to +161
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
logger.info(
"Olmo-eval launch finished (return code %s)\nStdout:\n%s\nStderr:\n%s",
process.returncode,
stdout.decode(),
stderr.decode(),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation using subprocess.Popen lacks error handling for cases where the olmo-eval command is missing (which is possible since it is an optional dependency). It also doesn't check the return code for failures. Using subprocess.run with capture_output=True and text=True is more idiomatic and robust. Additionally, adding a try-except block for FileNotFoundError will prevent the training process from crashing if the tool is not installed.

Suggested change
process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = process.communicate()
logger.info(
"Olmo-eval launch finished (return code %s)\nStdout:\n%s\nStderr:\n%s",
process.returncode,
stdout.decode(),
stderr.decode(),
)
try:
result = subprocess.run(command, capture_output=True, text=True, check=False, errors="replace")
if result.returncode != 0:
logger.warning(
"Olmo-eval launch failed (return code %s)\nStdout:\n%s\nStderr:\n%s",
result.returncode,
result.stdout,
result.stderr,
)
else:
logger.info("Olmo-eval launch finished successfully.\nStdout:\n%s", result.stdout)
except FileNotFoundError:
logger.error("Failed to launch olmo-eval: 'olmo-eval' command not found. Ensure it is installed.")
except Exception:
logger.exception("An unexpected error occurred while launching olmo-eval.")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 7f8b33f: switched to subprocess.run with return-code logging, plus FileNotFoundError handling so a missing olmo-eval CLI does not crash training.

Comment thread open_instruct/grpo_fast.py Outdated
step_dir, leaderboard_name, wandb_url, training_step
)
if args.try_launch_olmo_eval_jobs_on_weka and is_beaker_job():
leaderboard_name = f"{args.hf_repo_revision or args.exp_name}_step_{training_step}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This will result in a double _step_{training_step} suffix in the experiment name. The leaderboard_name is passed to launch_olmo_evals_on_weka_wrapper, which in turn calls default_olmo_eval_experiment_name(leaderboard_name, training_step). Since the helper already appends the step suffix, the leaderboard_name passed to it should only contain the base experiment name.

Suggested change
leaderboard_name = f"{args.hf_repo_revision or args.exp_name}_step_{training_step}"
leaderboard_name = args.hf_repo_revision or args.exp_name

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7f8b33fleaderboard_name is now the base name only (hf_repo_revision or exp_name); default_olmo_eval_experiment_name appends _step_{n} once.

mnoukhov and others added 2 commits May 22, 2026 17:51
Use the training experiment name as the Beaker group when olmo_eval_groups
is unset, and install olmo-eval-internal in the main env with a rich>=14.3.4
override to resolve the cached-path transitive pin.

Co-authored-by: Cursor <cursoragent@cursor.com>
Use high default priority, robust subprocess.run error handling, and fix
double step suffix in checkpoint eval experiment names.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant