|
1 | 1 | # ztensor Development Log |
2 | 2 |
|
| 3 | +## 2026-04-09: Issue #79 not reproducible at ztensor primitive level |
| 4 | + |
| 5 | +**Type:** investigation |
| 6 | +**Tags:** gpu, issue-79, patchtst, dgx-gb10 |
| 7 | + |
| 8 | +**Problem:** zerfoo PatchTST GPU training freezes at deterministic loss |
| 9 | +0.268357 on DGX GB10. Issue #79 hypothesized the fault lies in ztensor's |
| 10 | +GPU engine dst-output routing (`makeGPUResult` / `SetStorage` / |
| 11 | +`GPUStorage.Slice()`). Four hypotheses (alpha/beta/gamma/delta) were |
| 12 | +logged in the issue. |
| 13 | + |
| 14 | +**Investigation:** Added `TestGPUEngine_PatchTSTBackward_DstRoundTrip` |
| 15 | +(compute/gpu_dst_roundtrip_test.go) porting the exact op sequence from |
| 16 | +`zerfoo/timeseries/patchtst_gpu_train.go:1022-1031`: |
| 17 | +Transpose -> Zero -> MatMul(patchesT, dX, dPEW) -> in-place Add |
| 18 | +accumulate into pre-seeded gradW -> gradW.Data() read. Ran on DGX GB10 |
| 19 | +via Spark pod `ztensor-issue79-repro-1775759440` (manifest at |
| 20 | +`docs/bench/manifests/issue-79-repro.yaml`, commit 3e538e6 of |
| 21 | +`fix/issue-79-matmul-accumulate-repro`). |
| 22 | + |
| 23 | +Full test suite on DGX: |
| 24 | +``` |
| 25 | +TestGPUEngine_Add_DstRoundTrip_OutOfPlace PASS |
| 26 | +TestGPUEngine_Add_DstRoundTrip_InPlace PASS |
| 27 | +TestGPUEngine_Add_DstRoundTrip_RepeatedInPlace PASS |
| 28 | +TestGPUEngine_Add_DstRoundTrip_NoExplicitSync PASS |
| 29 | +TestGPUEngine_PatchTSTBackward_DstRoundTrip PASS |
| 30 | +``` |
| 31 | + |
| 32 | +**Root cause:** Not in ztensor primitives. The |
| 33 | +`Transpose -> Zero -> MatMul -> in-place Add` chain with a pre-seeded |
| 34 | +CPU-wrapper dst does NOT reproduce zero readback on small shapes |
| 35 | +(totalRows=4, patchLen=3, dModel=2). None of the four hypotheses from |
| 36 | +the issue body is triggered at this level. |
| 37 | + |
| 38 | +**Fix:** N/A. Investigation narrows the search to factors the ztensor |
| 39 | +test does not exercise: |
| 40 | +1. Shape regime -- production PatchTST uses thousands of rows / dModel in |
| 41 | + the hundreds; bug may only manifest under larger allocations or |
| 42 | + specific arena pressure. |
| 43 | +2. Interaction with `encoderBackward` and multi-op state carried across |
| 44 | + the full batch, not just the patch-embedding backward slice. |
| 45 | +3. The CPU-loop posEmb update at `patchtst_gpu_train.go:1012-1019` |
| 46 | + interleaved with GPU ops on the same stream. |
| 47 | +4. zerfoo-side gradTs wrapper rebuild logic affecting how `.Data()` |
| 48 | + resolves after many accumulations. |
| 49 | + |
| 50 | +**Impact:** Rules out ztensor engine primitive routing as the direct |
| 51 | +cause of the frozen-loss signature. Next debugging must happen |
| 52 | +zerfoo-side with a large-shape reproducer that closer matches the real |
| 53 | +training configuration, or by instrumenting `trainWindowedGPU` itself |
| 54 | +rather than trying to lift primitives into ztensor tests. |
| 55 | + |
3 | 56 | ## 2026-03-29 -- v1.0.0 Benchmark Baseline |
4 | 57 |
|
5 | 58 | Pre-v1 benchmark baseline recorded on Apple M4 (darwin/arm64, 10 cores). |
|
0 commit comments