chore(claude): learn from #374 by claude[bot] · Pull Request #375 · TensorAuto/OpenTau

claude · 2026-06-02T16:07:53Z

What this does

Captures a lesson from the review of #374 (per-(dataset, control_mode) validation aggregation) under rule 5 in CLAUDE.md, and broadens that rule so the new bullet sits honestly.

The reviewer flagged — and the author fixed in commit 8559365 — that gathering per-sample losses and their provenance indices in separate gather_for_metrics calls is fragile. accelerate.gather_for_metrics de-pads the ragged final batch by trimming the gathered result to gradient_state.remainder; a tensor and a non-tensor (e.g. dataset_repo_id strings, which take the gather_object path) trim differently, so provenance can desync from the loss rows. The fix gathers everything as int tensors in one gather_for_metrics({...}) dict call, so a single trim applies to every entry and rows stay aligned by construction. This is a non-obvious distributed correctness gotcha — silent wrong per-dataset metrics, not a hang — that the CPU suite can't catch (the multi-rank de-pad path only fires on >1-rank hardware).

Two refinements over the original lesson draft:

The bullet now leads with the real failure modes — the tensor/object de-pad divergence above, plus the fact that separate all-tensor calls only stay aligned by relying on an accelerate implementation detail (on the pinned 1.12.0 the trim is deterministic, so the original "the trims could land differently" framing overstated that path).
Rule 5's heading/intro is broadened from "collective counts aligned" to also name gather-time row alignment, so this non-hang bullet no longer sits under a hang-only title.

📝 Documentation

How it was tested

Documentation-only change to CLAUDE.md. Markdown-relevant pre-commit hooks pass (trailing-whitespace, end-of-file-fixer, typos). No code touched; nothing to run.

How to checkout & try? (for the reviewer)

git diff main -- CLAUDE.md

Checklist

I have added Google-style docstrings to important functions and ensured function parameters are typed.
My PR includes policy-related changes.
- If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

Add rule 5 bullet on gathering row-aligned per-sample tensors in a single gather_for_metrics dict call to survive ragged last-batch de-pad. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

claude · 2026-06-02T16:09:12Z

[claude-review] summary for commit 03feecb

No blocking issues found.

Re-verified after the refine/broaden commit. The new rule-5 bullet matches the code it cites: src/opentau/scripts/train.py:965 is the single gather_for_metrics({...}) dict call with the six entries (mse_sum/count, ce_sum/count, norm_index, source_index), and the bullet's de-pad / row-alignment rationale is corroborated by the in-code comment at train.py:950-964 (incl. the gather_object lists-not-trimmed detail). Heading broadened from "Distributed forward/backward" to "...and metric gathers" is accurate given the added bullet. Documentation-only; nothing to run.

chore(claude): learn from #374

69df861

Add rule 5 bullet on gathering row-aligned per-sample tensors in a single gather_for_metrics dict call to survive ragged last-batch de-pad. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

shuheng-liu self-assigned this Jun 2, 2026

chore(claude): refine rule 5 row-alignment bullet, broaden heading

03feecb

shuheng-liu self-requested a review June 2, 2026 17:52

shuheng-liu approved these changes Jun 2, 2026

View reviewed changes

shuheng-liu merged commit fc86d75 into main Jun 2, 2026
18 of 20 checks passed

shuheng-liu deleted the chore/claude-learn-374 branch June 2, 2026 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(claude): learn from #374#375

chore(claude): learn from #374#375
shuheng-liu merged 2 commits into
mainfrom
chore/claude-learn-374

claude Bot commented Jun 2, 2026 •

edited by shuheng-liu

Loading

Uh oh!

claude Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

claude Bot commented Jun 2, 2026 • edited by shuheng-liu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

How it was tested

How to checkout & try? (for the reviewer)

Checklist

Note: Before submitting this PR, please read the contributor guideline.

Uh oh!

claude Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Jun 2, 2026 •

edited by shuheng-liu

Loading

claude Bot commented Jun 2, 2026 •

edited

Loading