Skip to content

Restore doc-id response_masks from pack_sequences#1664

Closed
finbarrtimbers wants to merge 126 commits into
mainfrom
finbarr/fix-mask
Closed

Restore doc-id response_masks from pack_sequences#1664
finbarrtimbers wants to merge 126 commits into
mainfrom
finbarr/fix-mask

Conversation

@finbarrtimbers
Copy link
Copy Markdown
Collaborator

Summary

Reverts the pack_sequences change from #1642 which switched response_masks from int64 doc-id-valued to bool, breaking data_loader.py:1451:

lookup_advantages = np.zeros(len(advantages) + 1, dtype=np.float32)
lookup_advantages[1:] = advantages
packed_advantages = [
    torch.tensor(lookup_advantages[packed_mask], dtype=torch.float32)
    for packed_mask in packed_sequences.response_masks
]

This site relies on the doc-id encoding (i+1 for tokens belonging to sample i, 0 otherwise) to gather per-token advantages via integer indexing. With bool masks of length pack_length, numpy reinterprets the index as a boolean mask against the length-N+1 advantages array and crashes:

IndexError: boolean index did not match indexed array along axis 0;
size of axis is 129 but size of corresponding boolean axis is 9930

This was hit at step 0 of qwen3_4b_dapo_math_oc.sh (Beaker 01KR2ATZC84W1X1ZS47W18N21J).

  • Restore pack_sequences to emit int64 doc-id-valued response_masks.
  • Move the .bool() conversions back to the consumer call sites in grpo_fast.py, grpo_utils.py, and olmo_core_train_modules.py.
  • Restore (mask[:, 1:] > 0).sum() in calculate_token_counts and the _compute_per_sample_token_counts test.

Original PR: #1642.

Test plan

  • uv run pytest open_instruct/test_rl_utils.py passes
  • Re-launch scripts/train/qwen/qwen3_4b_dapo_math_oc.sh and confirm step 0 succeeds

GPU_TESTS=bypass

🤖 Generated with Claude Code

…po_math.sh image positional leak Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pen-instruct-dev Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…val to math verifier Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…/brumo eval to math verifier Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>"

This reverts commit cf82a70.
…h dataset=math) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ed-By: Claude Opus 4.7 <noreply@anthropic.com>
… Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tion checkpointing Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ng works Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ng Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…po.py hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…o-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…enabled) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ache lock contention Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…sync hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…to find hang location Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…gs/lm_head Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… lowercase b) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ync hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…for grpo.py Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…arrier to prevent rank desync into gloo bookkeeping collective Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…npoint hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…o ranks aren't suppressed Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… 4D additive mask, so HF and OLMo-core FIXED are apples-to-apples Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… was probe artifact (HF baseline lacked intra-doc mask). Reframe grad_norm hypotheses around DTensor norm aggregation, masked_mean denominator, and accumulation boundary Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nose grad_norm parity gap Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pture Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…h script BEAKER_IMAGE positional arg shadowing Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… as numeric, inflating loss_denominator 60x in grpo.py
…po_utils to enable eval/* metric logging Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…match-grpo notes) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…n_counts between grpo_fast and grpo_utils Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…valued response_masks summed as numerics) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…locks pytest collection) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…edundant per-consumer .bool() coercions Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…uences contract) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…to bool at consumers Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7681c8edae

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".



def _compute_per_sample_token_counts(response_masks: list[torch.Tensor], device: torch.device | str) -> torch.Tensor:
return torch.tensor([mask[:, 1:].sum().float() for mask in response_masks], device=device)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Count response positions instead of summing doc IDs

When pack_sequences now emits doc-id-valued response_masks, this helper over-counts any packed sample containing document IDs greater than 1 because it sums the IDs rather than counting nonzero positions. In the OLMo-core GRPO path, token_counts is then used to weight compute_metrics_from_loss_stats, so packed batches with multiple docs report skewed loss/ratio metrics; the added TestPerSampleTokenCounts case with mask values [1, 1, 2, 2, 2, 3] would return 11 instead of the expected 6. Use (mask[:, 1:] > 0).sum() here like calculate_token_counts does.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a critical logprob divergence between the OLMo-core GRPO trainer and vLLM by deriving document boundaries from packed attention masks and passing them to the transformer. It also reverts a previous change to restore int64 doc-id-valued response masks, necessitating updates to token counting logic across the codebase. Additionally, it introduces an EvalCallback to unify evaluation for the OLMo-core path, adds a fail-fast verification for HF model export compatibility, and improves infrastructure with checkpointer pruning and optimized file copying for Weka paths. Review feedback correctly identified a bug in the new token counting utility where doc-id values were being summed instead of counting active token positions.



def _compute_per_sample_token_counts(response_masks: list[torch.Tensor], device: torch.device | str) -> torch.Tensor:
return torch.tensor([mask[:, 1:].sum().float() for mask in response_masks], device=device)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The _compute_per_sample_token_counts function incorrectly sums the doc-id values in response_masks instead of counting the number of response tokens. Since response_masks are now int64 doc-id-valued (where 0 is padding and >0 are document IDs), this will lead to incorrect token counts. It should count the number of positions where the mask is greater than 0, as intended by the restoration in the PR description.

Suggested change
return torch.tensor([mask[:, 1:].sum().float() for mask in response_masks], device=device)
return torch.tensor([(mask[:, 1:] > 0).sum().float() for mask in response_masks], device=device)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant