Skip to content

Add pass@k metrics for local eval#1464

Merged
hamishivi merged 44 commits intomainfrom
eval_pass_at_k
Apr 9, 2026
Merged

Add pass@k metrics for local eval#1464
hamishivi merged 44 commits intomainfrom
eval_pass_at_k

Conversation

@mnoukhov
Copy link
Copy Markdown
Contributor

@mnoukhov mnoukhov commented Feb 6, 2026

Summary

  • add eval_pass_at_k to local GRPO eval and report eval/pass_at_1 plus eval/pass_at_k
  • add optional eval_response_length and size vLLM max_model_len for the larger of train/eval response lengths so longer eval generations fit
  • on non-final steps, only evaluate when we have a full batch ready, this avoids partially draining the eval results queue and losing results.

Testing

  • .venv/bin/ruff format open_instruct *mason.py
  • .venv/bin/ruff check -q --fix open_instruct *mason.py
  • .venv/bin/python -m compileall -qq open_instruct *mason.py
  • timeout 180 .venv/bin/python -m pytest open_instruct/test_grpo_fast_eval.py -q (fails in this local env because conftest.py imports vllm, which crashes on a broken local torch install: AttributeError: module torch has no attribute Tensor)
  • .venv/bin/ty check (blocked by the same broken local torch environment and reports repo-wide unresolved torch attributes)

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @mnoukhov, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the local evaluation framework by integrating Pass@K metrics and robust model step tracking. These improvements provide a more granular and accurate understanding of model performance during training, especially for tasks where multiple solution attempts are relevant, and ensure better synchronization between the training process and the evaluation results.

Highlights

  • Pass@K Evaluation: Introduced the ability to perform Pass@K evaluation during local model assessment. This allows for generating multiple completions per evaluation prompt (controlled by eval_pass_at_k) to measure the success rate when multiple attempts are allowed.
  • Model Step Tracking in Evaluation: Implemented tracking of the model_step during inference and evaluation. This includes adding a model_step field to GenerationResult, collecting min/max/mean/span of model steps during batch accumulation, and reporting step differences in evaluation metrics to understand the freshness of the evaluated model.
  • Refined Evaluation Scheduling: Modified the evaluation logic to defer processing of evaluation results on non-final training steps if the queue of results is incomplete. This ensures that evaluation metrics are calculated on full batches, preventing partial or skewed results.
  • VLLM Engine Integration: Updated VLLM engines to receive and utilize the model_step during weight synchronization and request processing, ensuring that the inference engines are aware of the current training progress. The Ray placement group strategy for VLLM engines was also adjusted from STRICT_SPREAD to SPREAD.
  • New and Updated Training Scripts: Added several new shell scripts for training Qwen models (1.5B, 1.7B, 4B) with RLZero Math, incorporating the new Pass@K evaluation parameters. Existing debug and OLMO training scripts were also updated to reflect these new evaluation capabilities and model configurations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • open_instruct/benchmark_generators.py
    • Modified rlvr_tokenize_v1 arguments to explicitly set system_prompt_override to None.
  • open_instruct/data_loader.py
    • Added all_model_steps list to accumulate_inference_batches to collect model step information.
    • Extended accumulate_inference_batches to append result.model_step to all_model_steps if available.
    • Added calculation of model_step_min, model_step_max, model_step_mean, and model_step_span to combined_reward_metrics.
  • open_instruct/data_types.py
    • Added model_step: int | None field to the GenerationResult dataclass.
  • open_instruct/grpo_fast.py
    • Introduced eval_pass_at_k argument to Args for configuring the number of completions for Pass@K metrics, with validation.
    • Changed the Ray placement group strategy from STRICT_SPREAD to SPREAD for model and optimizer creation.
    • Modified eval_generation_config to use args.eval_pass_at_k for the number of generations (n).
    • Updated weight_sync_thread to accept vllm_engines and a weight_sync_steps_Q for model step synchronization.
    • Implemented logic in weight_sync_thread to retrieve and set the target_model_step on vLLM engines.
    • Refactored maybe_evaluate to defer evaluation on non-final steps if the evaluation results queue is incomplete.
    • Passed max_possible_score to accumulate_inference_batches in maybe_evaluate.
    • Added calculation and logging of pass_at_1, pass_at_k, and model_step_diff metrics in maybe_evaluate.
    • Initialized weight_sync_steps_Q in run_training and set initial model steps for vLLM engines.
    • Added training_step to weight_sync_steps_Q to trigger model step synchronization.
  • open_instruct/test_grpo_fast_eval.py
    • Added a new test file to cover the evaluation logic in grpo_fast.py.
    • Includes tests for evaluation deferral, handling incomplete queues, and accurate recording of model step metrics.
  • open_instruct/vllm_utils.py
    • Included model_step in the GenerationResult when processing completed requests.
    • Added model_step to the request metadata using actor.current_model_step.
    • Initialized current_model_step: int = 0 in the LLMRayActor configuration.
    • Added a set_model_step method to LLMRayActor to update its internal current_model_step.
  • scripts/train/debug/grpo_fast.sh
    • Updated uv run python to uv run --active python.
    • Changed model_name_or_path to deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.
    • Removed several deprecated or unused arguments and added --vllm_num_engines 1.
  • scripts/train/olmo3/7b_rlzero_math.sh
    • Updated BEAKER_IMAGE and added dynamic image selection logic.
    • Changed python mason.py to uv run mason.py.
    • Increased num_nodes from 8 to 9.
    • Updated LOCAL_EVALS and EVALS to reflect new RLZero Math datasets and Pass@K metrics.
    • Adjusted various training parameters including async_steps, num_samples_per_prompt_rollout, response_length, pack_length, total_episodes, vllm_num_engines, local_eval_every, mask_truncated_completions, and enabled eval_on_step_0.
  • scripts/train/qwen/1.5b_rlzero_math.sh
    • Added new script for training deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B with RLZero Math and Pass@K evaluation.
  • scripts/train/qwen/debug_1.5b_rlzero_math.sh
    • Added new debug script for deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B training.
  • scripts/train/qwen/local_1.5b_rlzero_math.sh
    • Added new local training script for Qwen/Qwen3-1.7B-Base with Pass@K evaluation.
  • scripts/train/qwen/qwen3_1.7b_rlzero_math.sh
    • Added new script for training Qwen/Qwen3-1.7B-Base with RLZero Math and Pass@K evaluation.
  • scripts/train/qwen/qwen3_4b_rlzero_math.sh
    • Added new script for training Qwen/Qwen3-4B-Base with RLZero Math and Pass@K evaluation.
Activity
  • The pull request introduces a new eval_pass_at_k argument and associated logic to enable Pass@K evaluation, which is crucial for tasks requiring multiple attempts for a solution.
  • It refactors the evaluation pipeline to track and report the model_step during inference, providing valuable insights into the model's state at the time of evaluation.
  • New unit tests have been added to open_instruct/test_grpo_fast_eval.py to ensure the correctness of the updated evaluation deferral and model step tracking mechanisms.
  • Several training configuration scripts have been added or updated, indicating active development and expansion of supported models and evaluation strategies within the open_instruct framework.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces pass@k metrics for local evaluation and adds tracking of model steps for generated responses. The changes span argument parsing, generation configuration, evaluation logic, and vLLM actor communication. The implementation appears solid, and the inclusion of new tests for the evaluation logic is a great addition. I've provided a few suggestions for minor improvements regarding efficiency, consistency, and code duplication.

Comment thread open_instruct/grpo_fast.py Outdated
Comment thread scripts/train/qwen/debug_1.5b_rlzero_math.sh Outdated
@mnoukhov mnoukhov changed the title Pass@K for local eval Add pass@k metrics for local eval Mar 26, 2026
mnoukhov added 2 commits April 8, 2026 14:29
…esponse_length, base_env_config)

Made-with: Cursor
@mnoukhov mnoukhov marked this pull request as ready for review April 8, 2026 18:31
mnoukhov added 2 commits April 8, 2026 14:38
…del-step extras

- Restore grpo_fast weight sync and placement group to match main
- Remove vLLM model_step plumbing and data_loader model_step aggregates
- Keep eval_pass_at_k, eval_response_length, get_vllm_max_model_len, local eval pass@k metrics

Made-with: Cursor
@allenai allenai deleted a comment from gemini-code-assist Bot Apr 8, 2026
@allenai allenai deleted a comment from gemini-code-assist Bot Apr 8, 2026
@hamishivi hamishivi added this pull request to the merge queue Apr 9, 2026
Merged via the queue into main with commit 882d6e6 Apr 9, 2026
7 checks passed
@hamishivi hamishivi deleted the eval_pass_at_k branch April 9, 2026 18:16
davidheineman pushed a commit that referenced this pull request Apr 10, 2026
* pass in `model_name_or_path` that is on augusta and it works

* make src path list

* Refactor gs bucket download test

* download_from_gs_bucket a separate command and removed try except

* script and queue size fix

* regular oe-eval image

* fix path name

* rerun from 2k steps

* 9 nodes

* final script

* deepscaler comparison

* rlzero final

* max episodes 3k steps

* scripts

* 4b

* single node mixed GPUs

* 4 gpu

* check in stuff

* User prompt transform

* pass at k for local eval

* Add eval model-step drift metrics and queue safety

* Refine eval model-step drift metrics to diff-only

* Remove user_prompt_transform wiring

* cleanup

* weight sync trigger, cleaner

* Apply suggestion from @gemini-code-assist[bot]

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* undo change

* Add pass@k metrics for local eval

* Add changelog entry for PR 1464

* Scope PR to pass@k and eval_response_length only; drop weight-sync/model-step extras

- Restore grpo_fast weight sync and placement group to match main
- Remove vLLM model_step plumbing and data_loader model_step aggregates
- Keep eval_pass_at_k, eval_response_length, get_vllm_max_model_len, local eval pass@k metrics

Made-with: Cursor

* eval pass at k simplified

* quality and style

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants