GRPO OLMo-core feature parity: eval, checkpointer, schedulers#1672
Conversation
There was a problem hiding this comment.
Code Review
This pull request brings grpo.py to feature parity with grpo_fast.py by implementing new callbacks for evaluation and timing, refactoring shared logic into grpo_utils.py, and updating the Hugging Face export process. Key improvements include a pruning checkpointer and startup verification for model saving. The review feedback identifies off-by-one errors in the evaluation scheduling logic and recommends a safety check for the tokenizer's pad token to prevent potential runtime errors.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: facc8f71bc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| trainer_callbacks["checkpointer"] = olmo_core_utils.build_checkpointer_callback( | ||
| checkpointing_steps=self.grpo_config.checkpoint_state_freq, | ||
| ephemeral_save_interval=None, |
There was a problem hiding this comment.
Handle disabled checkpoint_state_freq before building checkpointer
When an OLMo-core GRPO run uses --checkpoint_state_freq -1 (or 0) to disable periodic state checkpoints, this forwards that value directly as the OLMo-core save_interval. The installed CheckpointerCallback rejects save_interval < 1 during construction, so the run aborts in TrainerConfig.build() instead of disabling checkpointing like the GRPO config/fast path allows. Skip registering the checkpointer or pass None when the configured frequency is non-positive.
Useful? React with 👍 / 👎.
| ) | ||
| from open_instruct.grpo_fast import create_generation_configs, maybe_evaluate | ||
| from open_instruct.grpo_fast import create_generation_configs | ||
| from open_instruct.grpo_utils import maybe_evaluate |
There was a problem hiding this comment.
Retarget maybe_evaluate mocks to grpo_utils
After moving maybe_evaluate into grpo_utils, the tests still patch open_instruct.grpo_fast.accumulate_inference_batches and the old grpo_fast print helpers. Calls through this imported function now resolve data_loader_lib.accumulate_inference_batches and model_utils.print_* in grpo_utils, so the final-step/metrics tests no longer intercept the dependencies and will exercise the real queue path instead of the mocks. Update the patch targets to the new module dependencies.
Useful? React with 👍 / 👎.
…uthored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t_state_to_hf; prune permanent checkpoints Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…noreply@anthropic.com>
…, scheduler types Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…noreply@anthropic.com>
67716a3 to
f322515
Compare
…re-parity # Conflicts: # CHANGELOG.md
…re-parity # Conflicts: # CHANGELOG.md # open_instruct/grpo.py # open_instruct/olmo_core_utils.py
…core now provides these mappings Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ated script tweaks Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…po_fast and olmo_core paths Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…uthored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rimentConfig.__post_init__ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| if self.checkpoint_state_dir is not None and self.checkpoint_state_freq == -1: | ||
| if self.checkpoint_state_dir is not None and self.checkpoint_state_freq <= 0: | ||
| raise ValueError("`checkpoint_state_freq` must be greater than 0 if `checkpoint_state_dir` is provided!") | ||
| if self.save_freq != self.checkpoint_state_freq: |
There was a problem hiding this comment.
Should this warning live in grpo.py instead? GRPOExperimentConfig is shared with grpo_fast.py, and grpo_fast.py still uses save_freq for periodic model saves. Putting the warning here means non-Olmo-core runs can see an Olmo-core-specific warning. I know it says "on the olmo-core training path..." but is it better to move it?
Brings the OLMo-core GRPO trainer (
grpo.py) up to feature parity withgrpo_fast.py:EvalCallbackthat pushes eval prompts ontoprompt_Qon cadence and drains results viagrpo_utils.maybe_evaluate; newsetup_evalactor RPC andm.setup_eval.remote(...)call fromgrpo.pymain; rank-0-only eval data loader.CheckpointerCallbacktofit()driven by--checkpoint_state_freqand pruning to--keep_last_n_checkpoints. Warn if--save_freqdiffers (it's a no-op on the olmo-core path).cosine/constant/linearbranches for--lr_scheduler_type(raise on anything else).priorityso itspost_stepruns after vLLM sync, and switch to_last_step_end-based timing sotime/totalis end-to-end.qwen3_4b_dapo_math.sh/qwen3_4b_dapo_math_oc.sh— acceptBEAKER_IMAGEenv var, route checkpoints to/tmp-3m/$RUN_NAME, add--use_rho_correctiondefaults and bump--activation_memory_budgetfor the OC variant.