Add olmo-eval Beaker launch integration for GRPO by mnoukhov · Pull Request #1698 · allenai/open-instruct

mnoukhov · 2026-05-22T21:48:33Z

Summary

Add OlmoEvalLaunchConfig and launch_olmo_evals_on_weka() to launch evals via olmo-eval beaker launch after GRPO checkpoints (alongside existing oe-eval integration).
Wire into grpo.py and grpo_fast.py behind --try_launch_olmo_eval_jobs_on_weka; the checkpoint Weka path is passed as -m automatically.
Add requirements-olmo-eval.txt for optional olmo-eval CLI install (cannot be merged into the main uv env due to a rich version conflict with ai2-olmo-core).

Usage

Example flags for a GRPO run with olmo-eval auto-launch:

--try_launch_olmo_eval_jobs_on_weka \
--olmo_eval_tasks math:posttrain:dev \
--olmo_eval_cluster h100 \
--olmo_eval_groups my-experiment-group \
--olmo_eval_priority urgent \
--olmo_eval_workspace ai2/open-instruct-dev \
--output_dir /weka/oe-adapt-default/.../my-run

The launched command is equivalent to:

olmo-eval beaker launch \
  -m <checkpoint_path_on_weka> \
  -t math:posttrain:dev \
  --group my-experiment-group \
  --cluster h100 \
  --priority urgent \
  -w ai2/open-instruct-dev \
  --yes --no-follow

Test plan

uv run pytest open_instruct/test_olmo_eval_launch.py
make style && make quality
End-to-end Beaker run with --try_launch_olmo_eval_jobs_on_weka (requires olmo-eval CLI on PATH in training image)

Made with Cursor

Wire optional post-checkpoint eval launches through olmo-eval beaker launch alongside the existing oe-eval path, with a dedicated config dataclass for cluster, tasks, workspace, and other launch settings. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

gemini-code-assist

Code Review

This pull request integrates olmo-eval into the GRPO training workflow, allowing for automated Beaker evaluation jobs to be launched at checkpoints and upon training completion. It introduces OlmoEvalLaunchConfig for configuration, command-building utilities, and execution logic using subprocesses. Reviewers identified several improvement opportunities, including making the subprocess execution more robust by handling missing dependencies and non-zero exit codes, lowering the default job priority to "normal" for better cluster resource management, and fixing a bug that caused redundant step suffixes in experiment names.

gemini-code-assist · 2026-05-22T21:50:53Z

+    olmo_eval_groups: list[str] | None = None
+    """Optional Beaker group(s) for grouping related eval experiments."""
+
+    olmo_eval_priority: Literal["low", "normal", "high", "urgent"] = "urgent"


The default priority is set to "urgent". For automated evaluation jobs launched during training, it is generally better to use "normal" to avoid preempting other users' work on shared clusters, unless explicitly requested by the user via CLI flags. This also maintains consistency with the existing oe-eval integration in grpo_utils.py.

Suggested change

olmo_eval_priority: Literal["low", "normal", "high", "urgent"] = "urgent"

olmo_eval_priority: Literal["low", "normal", "high", "urgent"] = "normal"

Good call — updated the default to "high" rather than "normal". We want auto-launched evals to run ahead of routine cluster work without preempting at "urgent" levels. Users can still override via --olmo_eval_priority.

gemini-code-assist · 2026-05-22T21:50:53Z

+    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+    stdout, stderr = process.communicate()
+    logger.info(
+        "Olmo-eval launch finished (return code %s)\nStdout:\n%s\nStderr:\n%s",
+        process.returncode,
+        stdout.decode(),
+        stderr.decode(),
+    )


The current implementation using subprocess.Popen lacks error handling for cases where the olmo-eval command is missing (which is possible since it is an optional dependency). It also doesn't check the return code for failures. Using subprocess.run with capture_output=True and text=True is more idiomatic and robust. Additionally, adding a try-except block for FileNotFoundError will prevent the training process from crashing if the tool is not installed.

Suggested change

process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

stdout, stderr = process.communicate()

logger.info(

"Olmo-eval launch finished (return code %s)\nStdout:\n%s\nStderr:\n%s",

process.returncode,

stdout.decode(),

stderr.decode(),

)

try:

result = subprocess.run(command, capture_output=True, text=True, check=False, errors="replace")

if result.returncode != 0:

logger.warning(

"Olmo-eval launch failed (return code %s)\nStdout:\n%s\nStderr:\n%s",

result.returncode,

result.stdout,

result.stderr,

)

else:

logger.info("Olmo-eval launch finished successfully.\nStdout:\n%s", result.stdout)

except FileNotFoundError:

logger.error("Failed to launch olmo-eval: 'olmo-eval' command not found. Ensure it is installed.")

except Exception:

logger.exception("An unexpected error occurred while launching olmo-eval.")

Done in 7f8b33f: switched to subprocess.run with return-code logging, plus FileNotFoundError handling so a missing olmo-eval CLI does not crash training.

gemini-code-assist · 2026-05-22T21:50:53Z

                        step_dir, leaderboard_name, wandb_url, training_step
                    )
+            if args.try_launch_olmo_eval_jobs_on_weka and is_beaker_job():
+                leaderboard_name = f"{args.hf_repo_revision or args.exp_name}_step_{training_step}"


This will result in a double _step_{training_step} suffix in the experiment name. The leaderboard_name is passed to launch_olmo_evals_on_weka_wrapper, which in turn calls default_olmo_eval_experiment_name(leaderboard_name, training_step). Since the helper already appends the step suffix, the leaderboard_name passed to it should only contain the base experiment name.

Suggested change

leaderboard_name = f"{args.hf_repo_revision or args.exp_name}_step_{training_step}"

leaderboard_name = args.hf_repo_revision or args.exp_name

Fixed in 7f8b33f — leaderboard_name is now the base name only (hf_repo_revision or exp_name); default_olmo_eval_experiment_name appends _step_{n} once.

Use the training experiment name as the Beaker group when olmo_eval_groups is unset, and install olmo-eval-internal in the main env with a rich>=14.3.4 override to resolve the cached-path transitive pin. Co-authored-by: Cursor <cursoragent@cursor.com>

Use high default priority, robust subprocess.run error handling, and fix double step suffix in checkpoint eval experiment names. Co-authored-by: Cursor <cursoragent@cursor.com>

mnoukhov and others added 2 commits May 22, 2026 17:48

Add CHANGELOG entry for olmo-eval GRPO integration.

0a95607

Co-authored-by: Cursor <cursoragent@cursor.com>

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

mnoukhov and others added 2 commits May 22, 2026 17:51

Address Gemini review feedback on olmo-eval launch.

7f8b33f

Use high default priority, robust subprocess.run error handling, and fix double step suffix in checkpoint eval experiment names. Co-authored-by: Cursor <cursoragent@cursor.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add olmo-eval Beaker launch integration for GRPO#1698

Add olmo-eval Beaker launch integration for GRPO#1698
mnoukhov wants to merge 4 commits into
mainfrom
olmo-eval-grpo

mnoukhov commented May 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

mnoukhov May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

mnoukhov May 22, 2026

Uh oh!

gemini-code-assist Bot May 22, 2026

Uh oh!

mnoukhov May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	olmo_eval_priority: Literal["low", "normal", "high", "urgent"] = "urgent"
	olmo_eval_priority: Literal["low", "normal", "high", "urgent"] = "normal"

-    process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
-    stdout, stderr = process.communicate()
-    logger.info(
-        "Olmo-eval launch finished (return code %s)\nStdout:\n%s\nStderr:\n%s",
-        process.returncode,
-        stdout.decode(),
-        stderr.decode(),
-    )
+    try:
+        result = subprocess.run(command, capture_output=True, text=True, check=False, errors="replace")
+        if result.returncode != 0:
+            logger.warning(
+                "Olmo-eval launch failed (return code %s)\nStdout:\n%s\nStderr:\n%s",
+                result.returncode,
+                result.stdout,
+                result.stderr,
+            )
+        else:
+            logger.info("Olmo-eval launch finished successfully.\nStdout:\n%s", result.stdout)
+    except FileNotFoundError:
+        logger.error("Failed to launch olmo-eval: 'olmo-eval' command not found. Ensure it is installed.")
+    except Exception:
+        logger.exception("An unexpected error occurred while launching olmo-eval.")

	leaderboard_name = f"{args.hf_repo_revision or args.exp_name}_step_{training_step}"
	leaderboard_name = args.hf_repo_revision or args.exp_name

Conversation

mnoukhov commented May 22, 2026

Summary

Usage

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

mnoukhov May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

mnoukhov May 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

mnoukhov May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant