Skip to content

chore(claude): learn from #374#375

Merged
shuheng-liu merged 2 commits into
mainfrom
chore/claude-learn-374
Jun 2, 2026
Merged

chore(claude): learn from #374#375
shuheng-liu merged 2 commits into
mainfrom
chore/claude-learn-374

Conversation

@claude
Copy link
Copy Markdown
Contributor

@claude claude Bot commented Jun 2, 2026

What this does

Captures a lesson from the review of #374 (per-(dataset, control_mode) validation aggregation) under rule 5 in CLAUDE.md, and broadens that rule so the new bullet sits honestly.

The reviewer flagged — and the author fixed in commit 8559365 — that gathering per-sample losses and their provenance indices in separate gather_for_metrics calls is fragile. accelerate.gather_for_metrics de-pads the ragged final batch by trimming the gathered result to gradient_state.remainder; a tensor and a non-tensor (e.g. dataset_repo_id strings, which take the gather_object path) trim differently, so provenance can desync from the loss rows. The fix gathers everything as int tensors in one gather_for_metrics({...}) dict call, so a single trim applies to every entry and rows stay aligned by construction. This is a non-obvious distributed correctness gotcha — silent wrong per-dataset metrics, not a hang — that the CPU suite can't catch (the multi-rank de-pad path only fires on >1-rank hardware).

Two refinements over the original lesson draft:

  • The bullet now leads with the real failure modes — the tensor/object de-pad divergence above, plus the fact that separate all-tensor calls only stay aligned by relying on an accelerate implementation detail (on the pinned 1.12.0 the trim is deterministic, so the original "the trims could land differently" framing overstated that path).
  • Rule 5's heading/intro is broadened from "collective counts aligned" to also name gather-time row alignment, so this non-hang bullet no longer sits under a hang-only title.

📝 Documentation

How it was tested

Documentation-only change to CLAUDE.md. Markdown-relevant pre-commit hooks pass (trailing-whitespace, end-of-file-fixer, typos). No code touched; nothing to run.

How to checkout & try? (for the reviewer)

git diff main -- CLAUDE.md

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

Add rule 5 bullet on gathering row-aligned per-sample tensors in a
single gather_for_metrics dict call to survive ragged last-batch de-pad.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor Author

claude Bot commented Jun 2, 2026

[claude-review] summary for commit 03feecb

No blocking issues found.

Re-verified after the refine/broaden commit. The new rule-5 bullet matches the code it cites: src/opentau/scripts/train.py:965 is the single gather_for_metrics({...}) dict call with the six entries (mse_sum/count, ce_sum/count, norm_index, source_index), and the bullet's de-pad / row-alignment rationale is corroborated by the in-code comment at train.py:950-964 (incl. the gather_object lists-not-trimmed detail). Heading broadened from "Distributed forward/backward" to "...and metric gathers" is accurate given the added bullet. Documentation-only; nothing to run.

@shuheng-liu shuheng-liu self-assigned this Jun 2, 2026
@shuheng-liu shuheng-liu self-requested a review June 2, 2026 17:52
@shuheng-liu shuheng-liu merged commit fc86d75 into main Jun 2, 2026
18 of 20 checks passed
@shuheng-liu shuheng-liu deleted the chore/claude-learn-374 branch June 2, 2026 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant