Now, grpo.py matches grpo_fast.py on qwen3_4b_dapo_math{,_oc}.sh#1642
Open
finbarrtimbers wants to merge 127 commits into
Open
Now, grpo.py matches grpo_fast.py on qwen3_4b_dapo_math{,_oc}.sh#1642finbarrtimbers wants to merge 127 commits into
grpo.py matches grpo_fast.py on qwen3_4b_dapo_math{,_oc}.sh#1642finbarrtimbers wants to merge 127 commits into
Conversation
…po_math.sh image positional leak Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pen-instruct-dev Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…val to math verifier Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…/brumo eval to math verifier Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>" This reverts commit cf82a70.
…h dataset=math) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ed-By: Claude Opus 4.7 <noreply@anthropic.com>
…aude Opus 4.7 <noreply@anthropic.com>
… Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tion checkpointing Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ng works Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ng Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…po.py hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…o-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…enabled) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ache lock contention Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…sync hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…to find hang location Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…gs/lm_head Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… lowercase b) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…By: Claude Opus 4.7 <noreply@anthropic.com>
…ync hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…aude Opus 4.7 <noreply@anthropic.com>
…7 <noreply@anthropic.com>
…for grpo.py Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…arrier to prevent rank desync into gloo bookkeeping collective Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…npoint hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…o ranks aren't suppressed Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…speed grads not on p.grad)
… as numeric, inflating loss_denominator 60x in grpo.py
…po_utils to enable eval/* metric logging Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…match-grpo notes) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-experiments # Conflicts: # CHANGELOG.md
…By: Claude Opus 4.7 <noreply@anthropic.com>
…n_counts between grpo_fast and grpo_utils Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…valued response_masks summed as numerics) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…locks pytest collection) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…edundant per-consumer .bool() coercions Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…uences contract) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2 tasks
…, drop redundant per-consumer .bool() coercions Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>" This reverts commit 607613d.
…umer .bool() coercions Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…y: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes
grpo.py(the olmo-core / FSDP2 GRPO path) on two axes: (1) the step-0 weight sync was broken in three independent ways, and (2) the per-step logprob recompute was running cross-doc attention while vLLM was running intra-doc, blowing upval/tis_clipfrac~570× vs the HF reference even after the weight sync was working.Bug 0 — OLMo-core trainer logprob recompute drops doc boundaries
grpo_utils.forward_for_logprobscalls the model with(input_ids, attention_mask, position_ids). HF FA3 derives intra-doccu_seqlensfrom position-id resets automatically, sogrpo_fast.pyworks. OLMo-core'sTransformerdoes not — it requires explicitdoc_lens/max_doc_lensto do intra-doc attention on packed sequences, and silently drops bothattention_maskand any inferred structure otherwise. Withpack_length=10240, every packed sequence holds multiple docs, so OLMo-core was computing full cross-doc attention while vLLM (and HF FA3) used intra-doc — a mismatch that surfaces as massively elevatedval/tis_clipfrac.Fix: add
compute_olmo_core_doc_lens(attention_mask)helper that converts the integer-coded packed mask to(doc_lens, max_doc_lens), plus apass_olmo_core_doc_lens: boolflag onforward_for_logprobs/compute_logprobs. OLMo-core call sites inolmo_core_train_modules.py(ref-policy recompute, old-logprobs recompute, per-step new-logprobs) set the flag; HF call sites are unchanged. Seedocs/grpo_divergence.mdfor the full investigation.Bug 1 — FSDP2 root-module params produced bad NCCL sends (illegal memory access)
broadcast_weights_to_vllmonly unsharded the FSDP2 submodules returned by_get_fsdp2_submodules, which deliberately excludes the rootFSDPModule. Root-level params (e.g.model.norm.weight,lm_head.weight) were therefore still DTensors backed by only the local shard. Calling.contiguous().clone()on them produced a buffer with the global stride but only the local shard's storage, so the 399th NCCL broadcast hit unmapped memory — surfacing asCuda failure 700: an illegal memory accessinpyNcclCommunicator/enqueue.cc.Fix: also call
model.unshard()/model.reshard()on the root in the FSDP2 branch, and switch the FSDP2 branch togather_whole_model=True(per-block iteration deadlocks the CUDA stream because reshard collectives interleave with trainer->vLLM NCCL sends).Bug 2 —
LLMRayActor.update_weightsdidn't match vLLM's IPC calling conventionvLLM's
IPCWeightTransferEngine.trainer_send_weightscalls the actor with a single positional dict:update_weights.remote(dict(update_info=...)). Our actor's signature wasupdate_weights(names, dtype_names, shapes, packed, model_step)— so single-GPU runs (which use the IPC backend) failed at step 0 withTypeError: missing a required argument: 'dtype_names'.Fix: unify the actor signature on
update_weights(kwargs: dict, model_step=None)and have the NCCL callers inbroadcast_weights_to_vllmbuild the same{"update_info": {...}}shape. Single path for both backends.Bug 3 — Initial weight sync wasn't fired before
trainer.fit()grpo_fastdoes an explicit pre-training weight broadcast (initialize_weight_sync) so the first NCCL collective fires from a known-good state before rollouts start racing.grpo.pyskipped this. Fix: addPolicyTrainerOLMoCoreProcess.run_initial_weight_sync()and call it fromgrpo.main()beforetrainer.fit(). Addtorch.distributed.barrier()insetup_model_update_groupso non-rank-0 ranks wait for rank 0 to finish registering NCCL weight-transfer engines.Other changes
VLLMWeightSyncCallbackgainsinflight_updates: booland skips the blockingray_get(refs)when set, matchinggrpo_fast.torch.cuda.set_device(0)inpost_stepso the NCCL broadcast targets the right device under Ray's per-actorCUDA_VISIBLE_DEVICES._collect_weight_metadatasimplified — FSDP2 path no longer needed.--output_dir/--checkpoint_state_dirto/tmp-3m/${RUN_NAME}in the two qwen3-4b DAPO scripts.Results
val/tis_clipfrac(Qwen3-4B-Base DAPO-Math, 8×H100 jupiter):parozgkegrpo_fast.py)il33h8fli3e7d0b5OLMo-core trainer is now matching vLLM bit-for-bit on logprobs (slightly better than the HF reference).
Comparing OLMo-core (
i3e7d0b5, in progress) vs HF (parozgke, finished) over the first 540 steps:Numerics — at or below HF parity with vLLM:
val/tis_clipfracval/tis_ratiopolicy/clipfrac_avgval/advantages_maxOLMo-core's
tis_clipfracis now ~2.6× lower than HF's — i.e. its trainer logprobs match vLLM more tightly than HF's do.Speed — OLMo-core dominates:
time/trainingtime/totaltime/trainer_idle_waiting_for_inferencetime/getting_responsetime/weight_syncPer-step training time scales much better on OLMo-core: as response length and pack heaviness grow, HF's
time/trainingrose from 31.8s (first 160 steps) to 45.3s (first 540), while OLMo-core's only rose from 2.8s → 7.2s. Weight sync remains the one regression (~60% slower), consistent with the OLMo-core→vLLM sync path being a bit heavier.Learning signal — converging:
objective/verifiable_correct_rateloss/policy_avgSolve-rate gap closed from ~9% (step 160) to ~3% (step 540), confirming the early gap was sampling drift rather than a real regression. Both runs use
seed=1but vLLM scheduling +async_steps=4give different prompt mixes.Runs
qwen3_4b_dapo_math_oc.shwith the doc_lens fix (in progress, tis_clipfrac confirmed at parity)qwen3_4b_dapo_math_oc.shwith weight-sync fixes (FSDP2 / NCCL multi-GPU)Test plan
qwen3_4b_dapo_math_oc.sh(FSDP2 / NCCL, multi-GPU)scripts/train/debug/single_gpu_on_beaker.sh(DeepSpeed / IPC) regressionGPU_TESTS=bypass