Add pass@k metrics for local eval by mnoukhov · Pull Request #1464 · allenai/open-instruct

mnoukhov · 2026-02-06T16:25:49Z

Summary

add eval_pass_at_k to local GRPO eval and report eval/pass_at_1 plus eval/pass_at_k
add optional eval_response_length and size vLLM max_model_len for the larger of train/eval response lengths so longer eval generations fit
on non-final steps, only evaluate when we have a full batch ready, this avoids partially draining the eval results queue and losing results.

Testing

.venv/bin/ruff format open_instruct *mason.py
.venv/bin/ruff check -q --fix open_instruct *mason.py
.venv/bin/python -m compileall -qq open_instruct *mason.py
timeout 180 .venv/bin/python -m pytest open_instruct/test_grpo_fast_eval.py -q (fails in this local env because conftest.py imports vllm, which crashes on a broken local torch install: AttributeError: module torch has no attribute Tensor)
.venv/bin/ty check (blocked by the same broken local torch environment and reports repo-wide unresolved torch attributes)

…/mason-gs

…/olmo3-rlzero

gemini-code-assist · 2026-02-06T16:26:19Z

Summary of Changes

Hello @mnoukhov, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the local evaluation framework by integrating Pass@K metrics and robust model step tracking. These improvements provide a more granular and accurate understanding of model performance during training, especially for tasks where multiple solution attempts are relevant, and ensure better synchronization between the training process and the evaluation results.

Highlights

Pass@K Evaluation: Introduced the ability to perform Pass@K evaluation during local model assessment. This allows for generating multiple completions per evaluation prompt (controlled by eval_pass_at_k) to measure the success rate when multiple attempts are allowed.
Model Step Tracking in Evaluation: Implemented tracking of the model_step during inference and evaluation. This includes adding a model_step field to GenerationResult, collecting min/max/mean/span of model steps during batch accumulation, and reporting step differences in evaluation metrics to understand the freshness of the evaluated model.
Refined Evaluation Scheduling: Modified the evaluation logic to defer processing of evaluation results on non-final training steps if the queue of results is incomplete. This ensures that evaluation metrics are calculated on full batches, preventing partial or skewed results.
VLLM Engine Integration: Updated VLLM engines to receive and utilize the model_step during weight synchronization and request processing, ensuring that the inference engines are aware of the current training progress. The Ray placement group strategy for VLLM engines was also adjusted from STRICT_SPREAD to SPREAD.
New and Updated Training Scripts: Added several new shell scripts for training Qwen models (1.5B, 1.7B, 4B) with RLZero Math, incorporating the new Pass@K evaluation parameters. Existing debug and OLMO training scripts were also updated to reflect these new evaluation capabilities and model configurations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

open_instruct/benchmark_generators.py
- Modified rlvr_tokenize_v1 arguments to explicitly set system_prompt_override to None.
open_instruct/data_loader.py
- Added all_model_steps list to accumulate_inference_batches to collect model step information.
- Extended accumulate_inference_batches to append result.model_step to all_model_steps if available.
- Added calculation of model_step_min, model_step_max, model_step_mean, and model_step_span to combined_reward_metrics.
open_instruct/data_types.py
- Added model_step: int | None field to the GenerationResult dataclass.
open_instruct/grpo_fast.py
- Introduced eval_pass_at_k argument to Args for configuring the number of completions for Pass@K metrics, with validation.
- Changed the Ray placement group strategy from STRICT_SPREAD to SPREAD for model and optimizer creation.
- Modified eval_generation_config to use args.eval_pass_at_k for the number of generations (n).
- Updated weight_sync_thread to accept vllm_engines and a weight_sync_steps_Q for model step synchronization.
- Implemented logic in weight_sync_thread to retrieve and set the target_model_step on vLLM engines.
- Refactored maybe_evaluate to defer evaluation on non-final steps if the evaluation results queue is incomplete.
- Passed max_possible_score to accumulate_inference_batches in maybe_evaluate.
- Added calculation and logging of pass_at_1, pass_at_k, and model_step_diff metrics in maybe_evaluate.
- Initialized weight_sync_steps_Q in run_training and set initial model steps for vLLM engines.
- Added training_step to weight_sync_steps_Q to trigger model step synchronization.
open_instruct/test_grpo_fast_eval.py
- Added a new test file to cover the evaluation logic in grpo_fast.py.
- Includes tests for evaluation deferral, handling incomplete queues, and accurate recording of model step metrics.
open_instruct/vllm_utils.py
- Included model_step in the GenerationResult when processing completed requests.
- Added model_step to the request metadata using actor.current_model_step.
- Initialized current_model_step: int = 0 in the LLMRayActor configuration.
- Added a set_model_step method to LLMRayActor to update its internal current_model_step.
scripts/train/debug/grpo_fast.sh
- Updated uv run python to uv run --active python.
- Changed model_name_or_path to deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.
- Removed several deprecated or unused arguments and added --vllm_num_engines 1.
scripts/train/olmo3/7b_rlzero_math.sh
- Updated BEAKER_IMAGE and added dynamic image selection logic.
- Changed python mason.py to uv run mason.py.
- Increased num_nodes from 8 to 9.
- Updated LOCAL_EVALS and EVALS to reflect new RLZero Math datasets and Pass@K metrics.
- Adjusted various training parameters including async_steps, num_samples_per_prompt_rollout, response_length, pack_length, total_episodes, vllm_num_engines, local_eval_every, mask_truncated_completions, and enabled eval_on_step_0.
scripts/train/qwen/1.5b_rlzero_math.sh
- Added new script for training deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B with RLZero Math and Pass@K evaluation.
scripts/train/qwen/debug_1.5b_rlzero_math.sh
- Added new debug script for deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B training.
scripts/train/qwen/local_1.5b_rlzero_math.sh
- Added new local training script for Qwen/Qwen3-1.7B-Base with Pass@K evaluation.
scripts/train/qwen/qwen3_1.7b_rlzero_math.sh
- Added new script for training Qwen/Qwen3-1.7B-Base with RLZero Math and Pass@K evaluation.
scripts/train/qwen/qwen3_4b_rlzero_math.sh
- Added new script for training Qwen/Qwen3-4B-Base with RLZero Math and Pass@K evaluation.

Activity

The pull request introduces a new eval_pass_at_k argument and associated logic to enable Pass@K evaluation, which is crucial for tasks requiring multiple attempts for a solution.
It refactors the evaluation pipeline to track and report the model_step during inference, providing valuable insights into the model's state at the time of evaluation.
New unit tests have been added to open_instruct/test_grpo_fast_eval.py to ensure the correctness of the updated evaluation deferral and model step tracking mechanisms.
Several training configuration scripts have been added or updated, indicating active development and expansion of supported models and evaluation strategies within the open_instruct framework.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces pass@k metrics for local evaluation and adds tracking of model steps for generated responses. The changes span argument parsing, generation configuration, evaluation logic, and vLLM actor communication. The implementation appears solid, and the inclusion of new tests for the evaluation logic is a great addition. I've provided a few suggestions for minor improvements regarding efficiency, consistency, and code duplication.

…s_at_k

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…o eval_pass_at_k

…esponse_length, base_env_config) Made-with: Cursor

…entConfig in eval tests Made-with: Cursor

…del-step extras - Restore grpo_fast weight sync and placement group to match main - Remove vLLM model_step plumbing and data_loader model_step aggregates - Keep eval_pass_at_k, eval_response_length, get_vllm_max_model_len, local eval pass@k metrics Made-with: Cursor

@gemini-code-assist

* pass in `model_name_or_path` that is on augusta and it works * make src path list * Refactor gs bucket download test * download_from_gs_bucket a separate command and removed try except * script and queue size fix * regular oe-eval image * fix path name * rerun from 2k steps * 9 nodes * final script * deepscaler comparison * rlzero final * max episodes 3k steps * scripts * 4b * single node mixed GPUs * 4 gpu * check in stuff * User prompt transform * pass at k for local eval * Add eval model-step drift metrics and queue safety * Refine eval model-step drift metrics to diff-only * Remove user_prompt_transform wiring * cleanup * weight sync trigger, cleaner * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * undo change * Add pass@k metrics for local eval * Add changelog entry for PR 1464 * Scope PR to pass@k and eval_response_length only; drop weight-sync/model-step extras - Restore grpo_fast weight sync and placement group to match main - Remove vLLM model_step plumbing and data_loader model_step aggregates - Keep eval_pass_at_k, eval_response_length, get_vllm_max_model_len, local eval pass@k metrics Made-with: Cursor * eval pass at k simplified * quality and style --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

mnoukhov added 30 commits November 12, 2025 21:41

pass in model_name_or_path that is on augusta and it works

98e7976

Merge branch 'main' of github.com:allenai/open-instruct into michaeln…

ee3ffb4

…/mason-gs

make src path list

622d99a

Refactor gs bucket download test

e461f72

download_from_gs_bucket a separate command and removed try except

9f136a9

script and queue size fix

7810e6f

regular oe-eval image

fb31b0f

fix path name

f33b5b6

Merge branch 'main' of github.com:allenai/open-instruct into michaeln…

65e3c94

…/mason-gs

Merge remote-tracking branch 'origin/michaeln/mason-gs' into michaeln…

369633f

…/olmo3-rlzero

rerun from 2k steps

c07cffd

9 nodes

3301249

Merge branch 'main' of github.com:allenai/open-instruct into michaeln…

ea2dcf8

…/olmo3-rlzero

final script

217b25d

deepscaler comparison

d68df80

Merge branch 'main' of github.com:allenai/open-instruct into michaeln…

97a6554

…/olmo3-rlzero

rlzero final

218c999

max episodes 3k steps

966f9b8

scripts

e651b4d

4b

26cd5c2

single node mixed GPUs

f53cfc7

Merge branch 'main' of github.com:allenai/open-instruct into smolzero

e280d94

4 gpu

162127b

check in stuff

2a431b9

Merge branch 'main' of github.com:allenai/open-instruct into smolzero

eeb5dc1

User prompt transform

08b715a

pass at k for local eval

100c27c

Add eval model-step drift metrics and queue safety

24a5ed2

Refine eval model-step drift metrics to diff-only

4dcf945

Remove user_prompt_transform wiring

22ca41a

gemini-code-assist Bot reviewed Feb 6, 2026

View reviewed changes

Comment thread open_instruct/grpo_fast.py Outdated

Comment thread scripts/train/qwen/debug_1.5b_rlzero_math.sh Outdated

mnoukhov and others added 7 commits February 6, 2026 11:48

Merge branch 'main' of github.com:allenai/open-instruct into eval_pas…

0978739

…s_at_k

cleanup

1d91b2d

weight sync trigger, cleaner

361c061

Apply suggestion from @gemini-code-assist[bot]

9de1c3c

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

undo change

621971a

Merge branch 'eval_pass_at_k' of github.com:allenai/open-instruct int…

7aadb5b

…o eval_pass_at_k

Add pass@k metrics for local eval

2f02f1b

mnoukhov force-pushed the eval_pass_at_k branch from 7aadb5b to 2f02f1b Compare March 26, 2026 22:28

Add changelog entry for PR 1464

c1b88c3

mnoukhov changed the title ~~Pass@K for local eval~~ Add pass@k metrics for local eval Mar 26, 2026

mnoukhov added 2 commits April 8, 2026 14:29

Merge remote eval_pass_at_k; resolve grpo_fast pass@k vs main (eval_r…

21f474e

…esponse_length, base_env_config) Made-with: Cursor

Merge origin/main into eval_pass_at_k; resolve CHANGELOG, GRPOExperim…

553141d

…entConfig in eval tests Made-with: Cursor

mnoukhov marked this pull request as ready for review April 8, 2026 18:31

mnoukhov added 2 commits April 8, 2026 14:38

eval pass at k simplified

03fa5eb

allenai deleted a comment from gemini-code-assist Bot Apr 8, 2026

Merge branch 'main' into eval_pass_at_k

956de0b

allenai deleted a comment from gemini-code-assist Bot Apr 8, 2026

quality and style

fda31e7

hamishivi approved these changes Apr 9, 2026

View reviewed changes

hamishivi added this pull request to the merge queue Apr 9, 2026

Merged via the queue into main with commit 882d6e6 Apr 9, 2026
7 checks passed

hamishivi deleted the eval_pass_at_k branch April 9, 2026 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pass@k metrics for local eval#1464

Add pass@k metrics for local eval#1464
hamishivi merged 44 commits intomainfrom
eval_pass_at_k

mnoukhov commented Feb 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Feb 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mnoukhov commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

gemini-code-assist Bot commented Feb 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mnoukhov commented Feb 6, 2026 •

edited

Loading