Skip to content

GRPO OLMo-core feature parity: eval, checkpointer, schedulers#1672

Merged
finbarrtimbers merged 15 commits into
mainfrom
finbarr/grpo-oc-feature-parity
May 15, 2026
Merged

GRPO OLMo-core feature parity: eval, checkpointer, schedulers#1672
finbarrtimbers merged 15 commits into
mainfrom
finbarr/grpo-oc-feature-parity

Conversation

@finbarrtimbers
Copy link
Copy Markdown
Collaborator

@finbarrtimbers finbarrtimbers commented May 8, 2026

Brings the OLMo-core GRPO trainer (grpo.py) up to feature parity with grpo_fast.py:

  • Eval: new EvalCallback that pushes eval prompts onto prompt_Q on cadence and drains results via grpo_utils.maybe_evaluate; new setup_eval actor RPC and m.setup_eval.remote(...) call from grpo.py main; rank-0-only eval data loader.
  • Checkpointing: add an OLMo-core CheckpointerCallback to fit() driven by --checkpoint_state_freq and pruning to --keep_last_n_checkpoints. Warn if --save_freq differs (it's a no-op on the olmo-core path).
  • Scheduler: add explicit cosine / constant / linear branches for --lr_scheduler_type (raise on anything else).
  • StepTimingCallback: lower priority so its post_step runs after vLLM sync, and switch to _last_step_end-based timing so time/total is end-to-end.
  • Scripts: qwen3_4b_dapo_math.sh / qwen3_4b_dapo_math_oc.sh — accept BEAKER_IMAGE env var, route checkpoints to /tmp-3m/$RUN_NAME, add --use_rho_correction defaults and bump --activation_memory_budget for the OC variant.

finbarrtimbers added a commit that referenced this pull request May 8, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request brings grpo.py to feature parity with grpo_fast.py by implementing new callbacks for evaluation and timing, refactoring shared logic into grpo_utils.py, and updating the Hugging Face export process. Key improvements include a pruning checkpointer and startup verification for model saving. The review feedback identifies off-by-one errors in the evaluation scheduling logic and recommends a safety check for the tokenizer's pad token to prevent potential runtime errors.

Comment thread open_instruct/grpo_callbacks.py
Comment thread open_instruct/grpo_callbacks.py Outdated
Comment thread open_instruct/grpo_utils.py
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: facc8f71bc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread open_instruct/grpo_olmo_core_actor.py Outdated
Comment on lines +380 to +382
trainer_callbacks["checkpointer"] = olmo_core_utils.build_checkpointer_callback(
checkpointing_steps=self.grpo_config.checkpoint_state_freq,
ephemeral_save_interval=None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle disabled checkpoint_state_freq before building checkpointer

When an OLMo-core GRPO run uses --checkpoint_state_freq -1 (or 0) to disable periodic state checkpoints, this forwards that value directly as the OLMo-core save_interval. The installed CheckpointerCallback rejects save_interval < 1 during construction, so the run aborts in TrainerConfig.build() instead of disabling checkpointing like the GRPO config/fast path allows. Skip registering the checkpointer or pass None when the configured frequency is non-positive.

Useful? React with 👍 / 👎.

)
from open_instruct.grpo_fast import create_generation_configs, maybe_evaluate
from open_instruct.grpo_fast import create_generation_configs
from open_instruct.grpo_utils import maybe_evaluate
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Retarget maybe_evaluate mocks to grpo_utils

After moving maybe_evaluate into grpo_utils, the tests still patch open_instruct.grpo_fast.accumulate_inference_batches and the old grpo_fast print helpers. Calls through this imported function now resolve data_loader_lib.accumulate_inference_batches and model_utils.print_* in grpo_utils, so the final-step/metrics tests no longer intercept the dependencies and will exercise the real queue path instead of the mocks. Update the patch targets to the new module dependencies.

Useful? React with 👍 / 👎.

…uthored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t_state_to_hf; prune permanent checkpoints Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…, scheduler types Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@finbarrtimbers finbarrtimbers changed the base branch from main to finbarr/hf-export-verify May 14, 2026 17:19
@finbarrtimbers finbarrtimbers changed the base branch from finbarr/hf-export-verify to main May 14, 2026 17:20
finbarrtimbers and others added 7 commits May 15, 2026 07:20
…re-parity

# Conflicts:
#	CHANGELOG.md
#	open_instruct/grpo.py
#	open_instruct/olmo_core_utils.py
…core now provides these mappings Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ated script tweaks Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…po_fast and olmo_core paths Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…uthored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rimentConfig.__post_init__ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@finbarrtimbers finbarrtimbers enabled auto-merge May 15, 2026 18:32
if self.checkpoint_state_dir is not None and self.checkpoint_state_freq == -1:
if self.checkpoint_state_dir is not None and self.checkpoint_state_freq <= 0:
raise ValueError("`checkpoint_state_freq` must be greater than 0 if `checkpoint_state_dir` is provided!")
if self.save_freq != self.checkpoint_state_freq:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this warning live in grpo.py instead? GRPOExperimentConfig is shared with grpo_fast.py, and grpo_fast.py still uses save_freq for periodic model saves. Putting the warning here means non-Olmo-core runs can see an Olmo-core-specific warning. I know it says "on the olmo-core training path..." but is it better to move it?

@finbarrtimbers finbarrtimbers added this pull request to the merge queue May 15, 2026
Merged via the queue into main with commit e91ada4 May 15, 2026
7 checks passed
@finbarrtimbers finbarrtimbers deleted the finbarr/grpo-oc-feature-parity branch May 15, 2026 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants