Conversation
Summary of ChangesHello @mnoukhov, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the local evaluation framework by integrating Pass@K metrics and robust model step tracking. These improvements provide a more granular and accurate understanding of model performance during training, especially for tasks where multiple solution attempts are relevant, and ensure better synchronization between the training process and the evaluation results. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces pass@k metrics for local evaluation and adds tracking of model steps for generated responses. The changes span argument parsing, generation configuration, evaluation logic, and vLLM actor communication. The implementation appears solid, and the inclusion of new tests for the evaluation logic is a great addition. I've provided a few suggestions for minor improvements regarding efficiency, consistency, and code duplication.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
7aadb5b to
2f02f1b
Compare
…esponse_length, base_env_config) Made-with: Cursor
…entConfig in eval tests Made-with: Cursor
…del-step extras - Restore grpo_fast weight sync and placement group to match main - Remove vLLM model_step plumbing and data_loader model_step aggregates - Keep eval_pass_at_k, eval_response_length, get_vllm_max_model_len, local eval pass@k metrics Made-with: Cursor
* pass in `model_name_or_path` that is on augusta and it works * make src path list * Refactor gs bucket download test * download_from_gs_bucket a separate command and removed try except * script and queue size fix * regular oe-eval image * fix path name * rerun from 2k steps * 9 nodes * final script * deepscaler comparison * rlzero final * max episodes 3k steps * scripts * 4b * single node mixed GPUs * 4 gpu * check in stuff * User prompt transform * pass at k for local eval * Add eval model-step drift metrics and queue safety * Refine eval model-step drift metrics to diff-only * Remove user_prompt_transform wiring * cleanup * weight sync trigger, cleaner * Apply suggestion from @gemini-code-assist[bot] Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * undo change * Add pass@k metrics for local eval * Add changelog entry for PR 1464 * Scope PR to pass@k and eval_response_length only; drop weight-sync/model-step extras - Restore grpo_fast weight sync and placement group to match main - Remove vLLM model_step plumbing and data_loader model_step aggregates - Keep eval_pass_at_k, eval_response_length, get_vllm_max_model_len, local eval pass@k metrics Made-with: Cursor * eval pass at k simplified * quality and style --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Summary
eval_pass_at_kto local GRPO eval and reporteval/pass_at_1pluseval/pass_at_keval_response_lengthand size vLLMmax_model_lenfor the larger of train/eval response lengths so longer eval generations fitTesting
.venv/bin/ruff format open_instruct *mason.py.venv/bin/ruff check -q --fix open_instruct *mason.py.venv/bin/python -m compileall -qq open_instruct *mason.pytimeout 180 .venv/bin/python -m pytest open_instruct/test_grpo_fast_eval.py -q(fails in this local env becauseconftest.pyimportsvllm, which crashes on a broken localtorchinstall:AttributeError: module torch has no attribute Tensor).venv/bin/ty check(blocked by the same broken localtorchenvironment and reports repo-wide unresolvedtorchattributes)