From 69df861e41e3235cf0dd746e81751f63c030de4e Mon Sep 17 00:00:00 2001
From: "claude[bot]" <41898282+claude[bot]@users.noreply.github.com>
Date: Tue, 2 Jun 2026 16:07:36 +0000
Subject: [PATCH 1/2] chore(claude): learn from #374

Add rule 5 bullet on gathering row-aligned per-sample tensors in a
single gather_for_metrics dict call to survive ragged last-batch de-pad.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 CLAUDE.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/CLAUDE.md b/CLAUDE.md
index 35e4444e..1b1b709b 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -31,6 +31,7 @@ These override defaults — read them before running anything.
 5. **Distributed forward/backward must keep collective counts aligned across ranks.** FSDP / ZeRO-3 hang at NCCL with mismatched all-gather sizes when ranks disagree on what `forward` does. The failure mode is silent at CPU smoke-test time and only manifests after collectives diverge on a real run, so the patterns below are mandatory — copy them when extending the relevant code, don't reinvent them:
    - **Per-rank branch decisions that fire collectives must be OR-reduced first.** When a `forward` takes a Python-level branch based on what the local micro-batch contains (e.g. `if has_response: embed_language_tokens(...)` in `embed_prefix`), use `_global_or_branch_decisions` in `src/opentau/policies/pi07/low_level/modeling_pi07_low_level.py` — one SUM all-reduce that both OR-reduces the per-rank decisions and asserts cross-rank presence agreement. Adding a new optional branch in distributed `forward` without going through it (or an equivalent pre-branch all-reduce) is the same bug.
    - **Composite forward units must be a single `nn.Module`.** Bundle multi-component decoder steps (e.g. a backbone layer paired with an action-expert layer) into one `nn.Module` so FSDP's all-gather hook prefetches every sub-component together — like `InterleavedDecoderLayer` in `src/opentau/policies/pi07/gemma3_with_expert.py`. Calling sub-components directly on a separately-wrapped layer (`layer.input_layernorm(...)`, `layer.self_attn.q_proj(...)`) bypasses the hook and triggers mismatched all-gather sizes across ranks.
+   - **Per-sample tensors that must stay row-aligned across ranks belong in one `gather_for_metrics({...})` dict call.** `accelerate.gather_for_metrics` de-pads a ragged final batch (when total samples aren't divisible by `world_size`); issuing a separate call per tensor relies on each independent de-pad landing on the exact same trim, which silently desyncs paired rows (e.g. per-sample losses vs. their `dataset_index` / source-index provenance). Gather all the paired per-sample tensors together in a single dict call so one identical trim applies to every entry — like the validation per-sample gather in `src/opentau/scripts/train.py`. This is a correctness bug, not a hang, and the CPU suite can't catch it (the multi-rank de-pad path only fires on >1-rank hardware).
 
 ## Project overview
 

From 03feecbc8b994f5d6b1bd95b3dbef4c83071b760 Mon Sep 17 00:00:00 2001
From: Shuheng Liu <wish1104@icloud.com>
Date: Tue, 2 Jun 2026 17:28:56 +0000
Subject: [PATCH 2/2] chore(claude): refine rule 5 row-alignment bullet,
 broaden heading

---
 CLAUDE.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/CLAUDE.md b/CLAUDE.md
index 1b1b709b..c912c7d2 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -28,10 +28,10 @@ These override defaults — read them before running anything.
    - Exceptions where plain torch ops are fine: simple `.flatten()` of a contiguous suffix, single-axis `.sum()` over the last dim, contiguity calls (`.contiguous()`), and shape-preserving ops (`.to(...)`, `.float()`). Don't rewrite these just for the sake of einops.
    - When matching a reference (HuggingFace `transformers`, upstream LeRobot, the original π model code), preserve the existing op style verbatim inside that block — readability gains are not worth diff churn against an upstream reference.
 
-5. **Distributed forward/backward must keep collective counts aligned across ranks.** FSDP / ZeRO-3 hang at NCCL with mismatched all-gather sizes when ranks disagree on what `forward` does. The failure mode is silent at CPU smoke-test time and only manifests after collectives diverge on a real run, so the patterns below are mandatory — copy them when extending the relevant code, don't reinvent them:
+5. **Distributed forward/backward and metric gathers must stay consistent across ranks.** FSDP / ZeRO-3 hang at NCCL with mismatched all-gather sizes when ranks disagree on what `forward` does; gathered per-sample results silently desync rows when paired tensors are de-padded independently. Both failure modes are silent at CPU smoke-test time and only manifest on real multi-rank runs, so the patterns below are mandatory — copy them when extending the relevant code, don't reinvent them:
    - **Per-rank branch decisions that fire collectives must be OR-reduced first.** When a `forward` takes a Python-level branch based on what the local micro-batch contains (e.g. `if has_response: embed_language_tokens(...)` in `embed_prefix`), use `_global_or_branch_decisions` in `src/opentau/policies/pi07/low_level/modeling_pi07_low_level.py` — one SUM all-reduce that both OR-reduces the per-rank decisions and asserts cross-rank presence agreement. Adding a new optional branch in distributed `forward` without going through it (or an equivalent pre-branch all-reduce) is the same bug.
    - **Composite forward units must be a single `nn.Module`.** Bundle multi-component decoder steps (e.g. a backbone layer paired with an action-expert layer) into one `nn.Module` so FSDP's all-gather hook prefetches every sub-component together — like `InterleavedDecoderLayer` in `src/opentau/policies/pi07/gemma3_with_expert.py`. Calling sub-components directly on a separately-wrapped layer (`layer.input_layernorm(...)`, `layer.self_attn.q_proj(...)`) bypasses the hook and triggers mismatched all-gather sizes across ranks.
-   - **Per-sample tensors that must stay row-aligned across ranks belong in one `gather_for_metrics({...})` dict call.** `accelerate.gather_for_metrics` de-pads a ragged final batch (when total samples aren't divisible by `world_size`); issuing a separate call per tensor relies on each independent de-pad landing on the exact same trim, which silently desyncs paired rows (e.g. per-sample losses vs. their `dataset_index` / source-index provenance). Gather all the paired per-sample tensors together in a single dict call so one identical trim applies to every entry — like the validation per-sample gather in `src/opentau/scripts/train.py`. This is a correctness bug, not a hang, and the CPU suite can't catch it (the multi-rank de-pad path only fires on >1-rank hardware).
+   - **Per-sample tensors that must stay row-aligned across ranks belong in one `gather_for_metrics({...})` dict call of int tensors.** `accelerate.gather_for_metrics` de-pads the ragged final batch (total samples not divisible by `world_size`) by trimming the gathered result to `gradient_state.remainder`. Splitting the gather per quantity breaks row alignment two ways: (1) tensors and non-tensors trim differently — `gather` trims to `remainder`, the `gather_object` path used for lists (e.g. `dataset_repo_id` strings) does not — so provenance desyncs from the loss rows; (2) even all-tensor calls stay aligned only by relying on every independent trim being identical, an accelerate implementation detail. Build the provenance as int tensors and gather everything in one dict call so a single trim hits every entry, aligned by construction — like the validation per-sample gather in `src/opentau/scripts/train.py` (per-sample MSE/CE sums + counts alongside `dataset_index` and a source index). The failure is silent (per-dataset metrics bucketed under the wrong dataset, not a hang) and the CPU suite can't catch it: the multi-rank de-pad path only fires on >1-rank hardware.
 
 ## Project overview