Restore doc-id response_masks from pack_sequences by finbarrtimbers · Pull Request #1664 · allenai/open-instruct

finbarrtimbers · 2026-05-08T14:49:16Z

Summary

Reverts the pack_sequences change from #1642 which switched response_masks from int64 doc-id-valued to bool, breaking data_loader.py:1451:

lookup_advantages = np.zeros(len(advantages) + 1, dtype=np.float32)
lookup_advantages[1:] = advantages
packed_advantages = [
    torch.tensor(lookup_advantages[packed_mask], dtype=torch.float32)
    for packed_mask in packed_sequences.response_masks
]

This site relies on the doc-id encoding (i+1 for tokens belonging to sample i, 0 otherwise) to gather per-token advantages via integer indexing. With bool masks of length pack_length, numpy reinterprets the index as a boolean mask against the length-N+1 advantages array and crashes:

IndexError: boolean index did not match indexed array along axis 0;
size of axis is 129 but size of corresponding boolean axis is 9930

This was hit at step 0 of qwen3_4b_dapo_math_oc.sh (Beaker 01KR2ATZC84W1X1ZS47W18N21J).

Restore pack_sequences to emit int64 doc-id-valued response_masks.
Move the .bool() conversions back to the consumer call sites in grpo_fast.py, grpo_utils.py, and olmo_core_train_modules.py.
Restore (mask[:, 1:] > 0).sum() in calculate_token_counts and the _compute_per_sample_token_counts test.

Original PR: #1642.

Test plan

uv run pytest open_instruct/test_rl_utils.py passes
Re-launch scripts/train/qwen/qwen3_4b_dapo_math_oc.sh and confirm step 0 succeeds

GPU_TESTS=bypass

🤖 Generated with Claude Code

…po_math.sh image positional leak Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…pen-instruct-dev Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…val to math verifier Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…/brumo eval to math verifier Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>" This reverts commit cf82a70.

…h dataset=math) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ed-By: Claude Opus 4.7 <noreply@anthropic.com>

…aude Opus 4.7 <noreply@anthropic.com>

… Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…tion checkpointing Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ng works Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ng Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…po.py hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…o-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…enabled) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ache lock contention Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…sync hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…to find hang location Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…gs/lm_head Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… lowercase b) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…By: Claude Opus 4.7 <noreply@anthropic.com>

…ync hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…aude Opus 4.7 <noreply@anthropic.com>

…7 <noreply@anthropic.com>

…for grpo.py Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…arrier to prevent rank desync into gloo bookkeeping collective Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…npoint hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…o ranks aren't suppressed Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… 4D additive mask, so HF and OLMo-core FIXED are apples-to-apples Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… was probe artifact (HF baseline lacked intra-doc mask). Reframe grad_norm hypotheses around DTensor norm aggregation, masked_mean denominator, and accumulation boundary Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…nose grad_norm parity gap Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…pture Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…h script BEAKER_IMAGE positional arg shadowing Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…speed grads not on p.grad)

…0/1 hypothesis

… as numeric, inflating loss_denominator 60x in grpo.py

…m 0.035->2.23

…po_utils to enable eval/* metric logging Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…match-grpo notes) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-experiments # Conflicts: # CHANGELOG.md

…By: Claude Opus 4.7 <noreply@anthropic.com>

…n_counts between grpo_fast and grpo_utils Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…valued response_masks summed as numerics) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…locks pytest collection) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…edundant per-consumer .bool() coercions Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…uences contract) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-experiments

…to bool at consumers Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…c.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7681c8edae

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-08T14:53:03Z



+def _compute_per_sample_token_counts(response_masks: list[torch.Tensor], device: torch.device | str) -> torch.Tensor:
+    return torch.tensor([mask[:, 1:].sum().float() for mask in response_masks], device=device)


Count response positions instead of summing doc IDs

When pack_sequences now emits doc-id-valued response_masks, this helper over-counts any packed sample containing document IDs greater than 1 because it sums the IDs rather than counting nonzero positions. In the OLMo-core GRPO path, token_counts is then used to weight compute_metrics_from_loss_stats, so packed batches with multiple docs report skewed loss/ratio metrics; the added TestPerSampleTokenCounts case with mask values [1, 1, 2, 2, 2, 3] would return 11 instead of the expected 6. Use (mask[:, 1:] > 0).sum() here like calculate_token_counts does.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request fixes a critical logprob divergence between the OLMo-core GRPO trainer and vLLM by deriving document boundaries from packed attention masks and passing them to the transformer. It also reverts a previous change to restore int64 doc-id-valued response masks, necessitating updates to token counting logic across the codebase. Additionally, it introduces an EvalCallback to unify evaluation for the OLMo-core path, adds a fail-fast verification for HF model export compatibility, and improves infrastructure with checkpointer pruning and optimized file copying for Weka paths. Review feedback correctly identified a bug in the new token counting utility where doc-id values were being summed instead of counting active token positions.

gemini-code-assist · 2026-05-08T14:55:06Z



+def _compute_per_sample_token_counts(response_masks: list[torch.Tensor], device: torch.device | str) -> torch.Tensor:
+    return torch.tensor([mask[:, 1:].sum().float() for mask in response_masks], device=device)


The _compute_per_sample_token_counts function incorrectly sums the doc-id values in response_masks instead of counting the number of response tokens. Since response_masks are now int64 doc-id-valued (where 0 is padding and >0 are document IDs), this will lead to incorrect token counts. It should count the number of positions where the mask is greater than 0, as intended by the restoration in the PR description.

Suggested change

return torch.tensor([mask[:, 1:].sum().float() for mask in response_masks], device=device)

return torch.tensor([(mask[:, 1:] > 0).sum().float() for mask in response_masks], device=device)

finbarrtimbers added 30 commits April 21, 2026 16:13

now running with urgent

cbd6c38

Trim default-valued args from qwen35_4b_dapo_math.sh; fix qwen3_4b_da…

b1fb8f9

…po_math.sh image positional leak Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Make workspace overridable in qwen3_4b_dapo_math.sh; default to ai2/o…

8ca3ebd

…pen-instruct-dev Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Support comma-separated pairs in --remap_verifier; route aime/brumo e…

cf82a70

…val to math verifier Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Revert "Support comma-separated pairs in --remap_verifier; route aime…

5ac1441

…/brumo eval to math verifier Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>" This reverts commit cf82a70.

Point eval to allenai/{aime,brumo}_2025_openinstruct (republished wit…

61f3643

…h dataset=math) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Added oc script

f9e13f4

Map Qwen/Qwen3-4B-Base to qwen3_4B config in olmo-core path Co-Author…

8e51fab

…ed-By: Claude Opus 4.7 <noreply@anthropic.com>

Add GRPO epoch sweep (2, 4) for qwen3-4b DAPO math Co-Authored-By: Cl…

9e8d9ba

…aude Opus 4.7 <noreply@anthropic.com>

Force full FSDP sharding in qwen3_4b_dapo_math_oc.sh to fit in 8xH100…

9894192

… Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add activation_memory_budget=0.5 to olmo-core script to enable activa…

bc9348f

…tion checkpointing Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Plumb compile_model through GRPOTrainModule so activation checkpointi…

364ecb4

…ng works Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add TORCH_LOGS + NCCL flight recorder envs to debug grpo olmo-core ha…

88e50d3

…ng Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Diagnostic: disable compile_model and lower pack_length to isolate gr…

19fb781

…po.py hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Diagnostic: reduce response_length to satisfy pack_length assertion C…

fb0f5dc

…o-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Restore oc.sh to match math.sh hyperparams (response/pack_length, AC …

69bf561

…enabled) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Diagnostic B: per-actor TORCHINDUCTOR_CACHE_DIR to isolate inductor c…

21703ed

…ache lock contention Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Diag C: log sleep ENTER/EXIT with active_tasks count to trace weight …

82176e6

…sync hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Diag D: log post_step + broadcast_weights_to_vllm per-rank per-block …

7c28fac

…to find hang location Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Fix FSDP2 weight-sync deadlock: whole-model summon + include embeddin…

5171e52

…gs/lm_head Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Fix Qwen3-4B config map: qwen3_4B -> qwen3_4b (TransformerConfig uses…

f20a710

… lowercase b) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Fix qwen3_4B config name casing in OLMO_MODEL_CONFIG_MAP Co-Authored-…

3880ded

…By: Claude Opus 4.7 <noreply@anthropic.com>

Use grpo_fast.py in qwen3_4b_dapo epoch sweep to avoid FSDP2 weight-s…

4cd64a7

…ync hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Fix GRPO double optim_step: override optim_step/zero_grads as no-ops …

05d27e4

…Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Temp: restrict sweep to epochs=4 only for relaunch Co-Authored-By: Cl…

ef66076

…aude Opus 4.7 <noreply@anthropic.com>

Revert: restore sweep to epochs=[2, 4] Co-Authored-By: Claude Opus 4.…

09fb265

…7 <noreply@anthropic.com>

Add dist.barrier() in VLLMWeightSyncCallback + distributed debug env …

e5fa353

…for grpo.py Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Remove redundant entry barrier in VLLMWeightSyncCallback; keep exit b…

9743c91

…arrier to prevent rank desync into gloo bookkeeping collective Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add phase-entry logging to weight sync + NCCL heartbeat timeout to pi…

c597245

…npoint hang Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Use logger.warning for weight-sync/post_step phase markers so non-zer…

f9183cb

…o ranks aren't suppressed Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

finbarrtimbers added 27 commits May 4, 2026 11:19

Parity probe: HF baseline now applies intra-doc attention masking via…

bdccbdb

… 4D additive mask, so HF and OLMo-core FIXED are apples-to-apples Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add one-shot step-1 capture hook for grpo.py and grpo_fast.py to diag…

f738530

…nose grad_norm parity gap Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Wire OPEN_INSTRUCT_DUMP_DIR/STEP into qwen DAPO scripts for step-1 ca…

25a35e7

…pture Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

fix grpo.py setup_model ModelConfig type mismatch and grpo_fast launc…

7381f38

…h script BEAKER_IMAGE positional arg shadowing Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

use weka path for OPEN_INSTRUCT_DUMP_DIR in grpo oc launch script Co-…

133a36c

…Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

add step-1 dump diff script for grpo.py vs grpo_fast.py parity

c49c882

match-grpo: log step-1 dump diff findings (response_masks dtype, deep…

a3f2ce0

…speed grads not on p.grad)

diff_step1_dumps: print top-level scalars and samples breakdown

5ef0901

diff_step1_dumps: descend into samples list of per-microbatch dicts

58f822b

diff_step1_dumps: inspect raw response_mask values to test doc-id-vs-…

9514695

…0/1 hypothesis

fix Bug 6: calculate_token_counts treats doc-id-valued response_masks…

cb81344

… as numeric, inflating loss_denominator 60x in grpo.py

match-grpo: verify Bug 6 fix; loss_denominator 3.98M->66.9k, grad_nor…

b089b60

…m 0.035->2.23

add EvalCallback to grpo.py OLMo-core path; move maybe_evaluate to gr…

ffa0b4e

…po_utils to enable eval/* metric logging Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Remove debugging/instrumentation code (step1 capture, parity probes, …

6b5bc1b

…match-grpo notes) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into finbarr/post-training…

f485553

…-experiments # Conflicts: # CHANGELOG.md

Add ty: ignore for type narrowings exposed by main merge Co-Authored-…

7e7efff

…By: Claude Opus 4.7 <noreply@anthropic.com>

Drop legacy pt-state-dict conversion script and dedupe calculate_toke…

c8af9b4

…n_counts between grpo_fast and grpo_utils Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

add failing test for Bug 6 at olmo_core_train_modules.py:440 (doc-id-…

dc2e019

…valued response_masks summed as numerics) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

fix test_grpo_fast_eval import path for relocated maybe_evaluate (unb…

b319141

…locks pytest collection) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

fix Bug 6 at source: pack_sequences emits bool response_masks, drop r…

607613d

…edundant per-consumer .bool() coercions Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

update test fixtures to use bool response_masks (matches new pack_seq…

08ef8d3

…uences contract) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into finbarr/post-training…

a62baf6

…-experiments

now, with icepop

fb716c8

less activation checkpointing

80e3c2b

Revert pack_sequences bool response_masks; keep doc-id ints, convert …

7d52863

…to bool at consumers Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add CHANGELOG entry Co-Authored-By: Claude Opus 4.7 <noreply@anthropi…

7681c8e

…c.com>

finbarrtimbers closed this May 8, 2026

chatgpt-codex-connector Bot reviewed May 8, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore doc-id response_masks from pack_sequences#1664

Restore doc-id response_masks from pack_sequences#1664
finbarrtimbers wants to merge 126 commits into
mainfrom
finbarr/fix-mask

finbarrtimbers commented May 8, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant



		def _compute_per_sample_token_counts(response_masks: list[torch.Tensor], device: torch.device \| str) -> torch.Tensor:
		return torch.tensor([mask[:, 1:].sum().float() for mask in response_masks], device=device)

	return torch.tensor([mask[:, 1:].sum().float() for mask in response_masks], device=device)
	return torch.tensor([(mask[:, 1:] > 0).sum().float() for mask in response_masks], device=device)

Conversation

finbarrtimbers commented May 8, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant