Add olmo-eval Beaker launch integration for GRPO#1698
Conversation
Wire optional post-checkpoint eval launches through olmo-eval beaker launch alongside the existing oe-eval path, with a dedicated config dataclass for cluster, tasks, workspace, and other launch settings. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Code Review
This pull request integrates olmo-eval into the GRPO training workflow, allowing for automated Beaker evaluation jobs to be launched at checkpoints and upon training completion. It introduces OlmoEvalLaunchConfig for configuration, command-building utilities, and execution logic using subprocesses. Reviewers identified several improvement opportunities, including making the subprocess execution more robust by handling missing dependencies and non-zero exit codes, lowering the default job priority to "normal" for better cluster resource management, and fixing a bug that caused redundant step suffixes in experiment names.
| olmo_eval_groups: list[str] | None = None | ||
| """Optional Beaker group(s) for grouping related eval experiments.""" | ||
|
|
||
| olmo_eval_priority: Literal["low", "normal", "high", "urgent"] = "urgent" |
There was a problem hiding this comment.
The default priority is set to "urgent". For automated evaluation jobs launched during training, it is generally better to use "normal" to avoid preempting other users' work on shared clusters, unless explicitly requested by the user via CLI flags. This also maintains consistency with the existing oe-eval integration in grpo_utils.py.
| olmo_eval_priority: Literal["low", "normal", "high", "urgent"] = "urgent" | |
| olmo_eval_priority: Literal["low", "normal", "high", "urgent"] = "normal" |
There was a problem hiding this comment.
Good call — updated the default to "high" rather than "normal". We want auto-launched evals to run ahead of routine cluster work without preempting at "urgent" levels. Users can still override via --olmo_eval_priority.
| process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE) | ||
| stdout, stderr = process.communicate() | ||
| logger.info( | ||
| "Olmo-eval launch finished (return code %s)\nStdout:\n%s\nStderr:\n%s", | ||
| process.returncode, | ||
| stdout.decode(), | ||
| stderr.decode(), | ||
| ) |
There was a problem hiding this comment.
The current implementation using subprocess.Popen lacks error handling for cases where the olmo-eval command is missing (which is possible since it is an optional dependency). It also doesn't check the return code for failures. Using subprocess.run with capture_output=True and text=True is more idiomatic and robust. Additionally, adding a try-except block for FileNotFoundError will prevent the training process from crashing if the tool is not installed.
| process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE) | |
| stdout, stderr = process.communicate() | |
| logger.info( | |
| "Olmo-eval launch finished (return code %s)\nStdout:\n%s\nStderr:\n%s", | |
| process.returncode, | |
| stdout.decode(), | |
| stderr.decode(), | |
| ) | |
| try: | |
| result = subprocess.run(command, capture_output=True, text=True, check=False, errors="replace") | |
| if result.returncode != 0: | |
| logger.warning( | |
| "Olmo-eval launch failed (return code %s)\nStdout:\n%s\nStderr:\n%s", | |
| result.returncode, | |
| result.stdout, | |
| result.stderr, | |
| ) | |
| else: | |
| logger.info("Olmo-eval launch finished successfully.\nStdout:\n%s", result.stdout) | |
| except FileNotFoundError: | |
| logger.error("Failed to launch olmo-eval: 'olmo-eval' command not found. Ensure it is installed.") | |
| except Exception: | |
| logger.exception("An unexpected error occurred while launching olmo-eval.") |
There was a problem hiding this comment.
Done in 7f8b33f: switched to subprocess.run with return-code logging, plus FileNotFoundError handling so a missing olmo-eval CLI does not crash training.
| step_dir, leaderboard_name, wandb_url, training_step | ||
| ) | ||
| if args.try_launch_olmo_eval_jobs_on_weka and is_beaker_job(): | ||
| leaderboard_name = f"{args.hf_repo_revision or args.exp_name}_step_{training_step}" |
There was a problem hiding this comment.
This will result in a double _step_{training_step} suffix in the experiment name. The leaderboard_name is passed to launch_olmo_evals_on_weka_wrapper, which in turn calls default_olmo_eval_experiment_name(leaderboard_name, training_step). Since the helper already appends the step suffix, the leaderboard_name passed to it should only contain the base experiment name.
| leaderboard_name = f"{args.hf_repo_revision or args.exp_name}_step_{training_step}" | |
| leaderboard_name = args.hf_repo_revision or args.exp_name |
There was a problem hiding this comment.
Fixed in 7f8b33f — leaderboard_name is now the base name only (hf_repo_revision or exp_name); default_olmo_eval_experiment_name appends _step_{n} once.
Use the training experiment name as the Beaker group when olmo_eval_groups is unset, and install olmo-eval-internal in the main env with a rich>=14.3.4 override to resolve the cached-path transitive pin. Co-authored-by: Cursor <cursoragent@cursor.com>
Use high default priority, robust subprocess.run error handling, and fix double step suffix in checkpoint eval experiment names. Co-authored-by: Cursor <cursoragent@cursor.com>
Summary
OlmoEvalLaunchConfigandlaunch_olmo_evals_on_weka()to launch evals viaolmo-eval beaker launchafter GRPO checkpoints (alongside existing oe-eval integration).grpo.pyandgrpo_fast.pybehind--try_launch_olmo_eval_jobs_on_weka; the checkpoint Weka path is passed as-mautomatically.requirements-olmo-eval.txtfor optional olmo-eval CLI install (cannot be merged into the main uv env due to arichversion conflict withai2-olmo-core).Usage
Example flags for a GRPO run with olmo-eval auto-launch:
The launched command is equivalent to:
Test plan
uv run pytest open_instruct/test_olmo_eval_launch.pymake style && make quality--try_launch_olmo_eval_jobs_on_weka(requires olmo-eval CLI on PATH in training image)Made with Cursor