Restore doc-id response_masks from pack_sequences#1664
Restore doc-id response_masks from pack_sequences#1664finbarrtimbers wants to merge 126 commits into
Conversation
…po_math.sh image positional leak Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pen-instruct-dev Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…val to math verifier Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…/brumo eval to math verifier Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>" This reverts commit cf82a70.
…h dataset=math) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ed-By: Claude Opus 4.7 <noreply@anthropic.com>
…aude Opus 4.7 <noreply@anthropic.com>
… Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tion checkpointing Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ng works Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ng Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…po.py hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…o-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…enabled) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ache lock contention Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…sync hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…to find hang location Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…gs/lm_head Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… lowercase b) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…By: Claude Opus 4.7 <noreply@anthropic.com>
…ync hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…aude Opus 4.7 <noreply@anthropic.com>
…7 <noreply@anthropic.com>
…for grpo.py Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…arrier to prevent rank desync into gloo bookkeeping collective Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…npoint hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…o ranks aren't suppressed Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… 4D additive mask, so HF and OLMo-core FIXED are apples-to-apples Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… was probe artifact (HF baseline lacked intra-doc mask). Reframe grad_norm hypotheses around DTensor norm aggregation, masked_mean denominator, and accumulation boundary Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nose grad_norm parity gap Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pture Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…h script BEAKER_IMAGE positional arg shadowing Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…speed grads not on p.grad)
… as numeric, inflating loss_denominator 60x in grpo.py
…po_utils to enable eval/* metric logging Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…match-grpo notes) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…-experiments # Conflicts: # CHANGELOG.md
…By: Claude Opus 4.7 <noreply@anthropic.com>
…n_counts between grpo_fast and grpo_utils Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…valued response_masks summed as numerics) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…locks pytest collection) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…edundant per-consumer .bool() coercions Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…uences contract) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…to bool at consumers Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7681c8edae
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
|
|
||
| def _compute_per_sample_token_counts(response_masks: list[torch.Tensor], device: torch.device | str) -> torch.Tensor: | ||
| return torch.tensor([mask[:, 1:].sum().float() for mask in response_masks], device=device) |
There was a problem hiding this comment.
Count response positions instead of summing doc IDs
When pack_sequences now emits doc-id-valued response_masks, this helper over-counts any packed sample containing document IDs greater than 1 because it sums the IDs rather than counting nonzero positions. In the OLMo-core GRPO path, token_counts is then used to weight compute_metrics_from_loss_stats, so packed batches with multiple docs report skewed loss/ratio metrics; the added TestPerSampleTokenCounts case with mask values [1, 1, 2, 2, 2, 3] would return 11 instead of the expected 6. Use (mask[:, 1:] > 0).sum() here like calculate_token_counts does.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request fixes a critical logprob divergence between the OLMo-core GRPO trainer and vLLM by deriving document boundaries from packed attention masks and passing them to the transformer. It also reverts a previous change to restore int64 doc-id-valued response masks, necessitating updates to token counting logic across the codebase. Additionally, it introduces an EvalCallback to unify evaluation for the OLMo-core path, adds a fail-fast verification for HF model export compatibility, and improves infrastructure with checkpointer pruning and optimized file copying for Weka paths. Review feedback correctly identified a bug in the new token counting utility where doc-id values were being summed instead of counting active token positions.
|
|
||
|
|
||
| def _compute_per_sample_token_counts(response_masks: list[torch.Tensor], device: torch.device | str) -> torch.Tensor: | ||
| return torch.tensor([mask[:, 1:].sum().float() for mask in response_masks], device=device) |
There was a problem hiding this comment.
The _compute_per_sample_token_counts function incorrectly sums the doc-id values in response_masks instead of counting the number of response tokens. Since response_masks are now int64 doc-id-valued (where 0 is padding and >0 are document IDs), this will lead to incorrect token counts. It should count the number of positions where the mask is greater than 0, as intended by the restoration in the PR description.
| return torch.tensor([mask[:, 1:].sum().float() for mask in response_masks], device=device) | |
| return torch.tensor([(mask[:, 1:] > 0).sum().float() for mask in response_masks], device=device) |
Summary
Reverts the
pack_sequenceschange from #1642 which switchedresponse_masksfrom int64 doc-id-valued to bool, breakingdata_loader.py:1451:This site relies on the doc-id encoding (
i+1for tokens belonging to samplei,0otherwise) to gather per-token advantages via integer indexing. With bool masks of lengthpack_length, numpy reinterprets the index as a boolean mask against the length-N+1advantages array and crashes:This was hit at step 0 of
qwen3_4b_dapo_math_oc.sh(Beaker01KR2ATZC84W1X1ZS47W18N21J).pack_sequencesto emit int64 doc-id-valuedresponse_masks..bool()conversions back to the consumer call sites ingrpo_fast.py,grpo_utils.py, andolmo_core_train_modules.py.(mask[:, 1:] > 0).sum()incalculate_token_countsand the_compute_per_sample_token_countstest.Original PR: #1642.
Test plan
uv run pytest open_instruct/test_rl_utils.pypassesscripts/train/qwen/qwen3_4b_dapo_math_oc.shand confirm step 0 succeedsGPU_TESTS=bypass
🤖 Generated with Claude Code