[WS1] Backward-pass consistency across all ops

Part of WS1 — Full Batch-Invariant Forward Chain (epic: #<WS1 tracking issue>)

## Why

A forward-aligned chain still breaks training if gradient reductions drift with batch shape — the optimizer then sees batch-dependent gradients and the run diverges even though inference looked aligned. Backward invariance is a separate, explicit requirement, not something the forward checks cover. This issue makes "backward also invariant" a first-class, cross-op acceptance condition.

## Scope

Make batch-invariant backward a required, tested property of every WS1 op.

- Define the backward-invariance check in the #108 harness (gradient outputs compared across batch configs, same tolerance policy as forward).
- Ensure each op's gradient reduction (`dx`, `dweight`, `dW`, etc.) uses a fixed, batch-shape-independent order — no `atomicAdd` in backward.
- Cover the most reduction-heavy backward first: RMSNorm `dweight`, GEMM `dW`/`dX`, attention backward, embedding-grad scatter.
- Validate gradients across batch=1/N, chunked-prefill on/off, and padding layouts.

## Out of scope

- Re-implementing each op (each op issue owns its own backward kernel and fix; this issue owns the cross-cutting requirement, reusable gradient check, and status matrix).
- Optimizer / training-loop changes; multi-GPU gradient synchronization (WS2).
- FP8 backward.

## Acceptance criteria

- The #108 harness has a reusable gradient-invariance assertion any op-issue can call.
- Each WS1 op with a backward passes gradient-invariance across the full sweep (bitwise or within #108 tolerance).
- No op uses `atomicAdd` or shape-dependent accumulation in its backward path.
- A short matrix records, per op, both forward-pass and backward-pass invariance status; gradient-drift reports include max abs diff, relative diff, tensor name, and first failing op.

## Notes

- Depends on #108; cross-cuts every op issue (RMSNorm, matmul, attention, embedding / LM head, etc.).
- Best owned as a shared concern: this issue provides the gradient check and status matrix; each op issue is responsible for fixing its own backward path and passing the check.

## Planned PRs

- [ ] Add a reusable gradient-invariance assertion to the #108 harness
- [ ] Per-op backward test requirements (RMSNorm `dweight`, GEMM `dW`/`dX`, attention backward first)
- [ ] Gradient-diff reporting utility (max abs / relative diff, tensor name, first failing op)
- [ ] Enforce no-`atomicAdd` / fixed-order accumulation in backward paths
- [ ] Per-op forward+backward invariance status matrix; wire full-chain backward into CI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WS1] Backward-pass consistency across all ops #153

Why

Scope

Out of scope

Acceptance criteria

Notes

Planned PRs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[WS1] Backward-pass consistency across all ops #153

Description

Why

Scope

Out of scope

Acceptance criteria

Notes

Planned PRs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions