You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
makeGPUResult (gpu_kernels.go:121) allocates fresh GPU memory via pool.Alloc on every call, even when a pre-sized dst tensor is provided. It then calls dst.SetStorage(newGPUStorage), orphaning dst's previous GPUStorage. The old storage is freed only when the Go GC runs its finalizer.
This defeats the purpose of dst-param variants: callers that pre-allocate buffers and pass them as dst expect zero per-call allocation. Instead, they get the same allocation rate PLUS a deferred-free dependency on GC finalization.
Impact
At production shapes (20K+ samples, 20 channels), PatchTST GPU training performs ~20 GPU ops per batch × 300+ batches per epoch. Each op allocates and orphans a GPUStorage, creating ~6000 pending finalizer calls per epoch. The GC cannot keep pace, causing unbounded GPU memory growth → OOM or severe memory-pressure slowdown.
Bisected in zerfoo/zerfoo#373: commit 09a318c6 (E85 buffer pre-allocation) regressed 20K×20×5 training from ~60s to >300s by converting local-var results to persistent dst-param struct fields. The parent commit (using local vars that go out of scope quickly) is unaffected.
Expected behavior
When dst is provided and its existing storage is a GPUStorage with Len() >= numElems:
Compute the kernel output directly into dst.GetStorage().Ptr() (no pool.Alloc)
Update dst's shape/strides if needed
Return dst
Only allocate when dst is nil, has no GPU storage, or is undersized.
Affected code paths
makeGPUResult (gpu_kernels.go:121) — central; all GPU ops that go through it
makeGPUResultView (gpu_kernels.go:147) — similar pattern with scratchpad views
Problem
makeGPUResult(gpu_kernels.go:121) allocates fresh GPU memory viapool.Allocon every call, even when a pre-sizeddsttensor is provided. It then callsdst.SetStorage(newGPUStorage), orphaning dst's previous GPUStorage. The old storage is freed only when the Go GC runs its finalizer.This defeats the purpose of dst-param variants: callers that pre-allocate buffers and pass them as
dstexpect zero per-call allocation. Instead, they get the same allocation rate PLUS a deferred-free dependency on GC finalization.Impact
At production shapes (20K+ samples, 20 channels), PatchTST GPU training performs ~20 GPU ops per batch × 300+ batches per epoch. Each op allocates and orphans a GPUStorage, creating ~6000 pending finalizer calls per epoch. The GC cannot keep pace, causing unbounded GPU memory growth → OOM or severe memory-pressure slowdown.
Bisected in zerfoo/zerfoo#373: commit
09a318c6(E85 buffer pre-allocation) regressed 20K×20×5 training from ~60s to >300s by converting local-var results to persistent dst-param struct fields. The parent commit (using local vars that go out of scope quickly) is unaffected.Expected behavior
When
dstis provided and its existing storage is a GPUStorage withLen() >= numElems:dst.GetStorage().Ptr()(no pool.Alloc)Only allocate when dst is nil, has no GPU storage, or is undersized.
Affected code paths
makeGPUResult(gpu_kernels.go:121) — central; all GPU ops that go through itmakeGPUResultView(gpu_kernels.go:147) — similar pattern with scratchpad viewsSuggested approach
reuseGPUDsthelper:In each GPU op, before
pool.Alloc, checkreuseGPUDst. If reusable, pass the existing pointer to the kernel and skip alloc.Update
makeGPUResultto detect when the output pointer matches dst's existing pointer (skip SetStorage in that case).Refs