Skip to content

GPU ops should reuse dst device memory instead of allocating per call #84

@dndungu

Description

@dndungu

Problem

makeGPUResult (gpu_kernels.go:121) allocates fresh GPU memory via pool.Alloc on every call, even when a pre-sized dst tensor is provided. It then calls dst.SetStorage(newGPUStorage), orphaning dst's previous GPUStorage. The old storage is freed only when the Go GC runs its finalizer.

This defeats the purpose of dst-param variants: callers that pre-allocate buffers and pass them as dst expect zero per-call allocation. Instead, they get the same allocation rate PLUS a deferred-free dependency on GC finalization.

Impact

At production shapes (20K+ samples, 20 channels), PatchTST GPU training performs ~20 GPU ops per batch × 300+ batches per epoch. Each op allocates and orphans a GPUStorage, creating ~6000 pending finalizer calls per epoch. The GC cannot keep pace, causing unbounded GPU memory growth → OOM or severe memory-pressure slowdown.

Bisected in zerfoo/zerfoo#373: commit 09a318c6 (E85 buffer pre-allocation) regressed 20K×20×5 training from ~60s to >300s by converting local-var results to persistent dst-param struct fields. The parent commit (using local vars that go out of scope quickly) is unaffected.

Expected behavior

When dst is provided and its existing storage is a GPUStorage with Len() >= numElems:

  1. Compute the kernel output directly into dst.GetStorage().Ptr() (no pool.Alloc)
  2. Update dst's shape/strides if needed
  3. Return dst

Only allocate when dst is nil, has no GPU storage, or is undersized.

Affected code paths

  • makeGPUResult (gpu_kernels.go:121) — central; all GPU ops that go through it
  • makeGPUResultView (gpu_kernels.go:147) — similar pattern with scratchpad views
  • Individual ops that allocate before calling makeGPUResult: Transpose, MatMul, Add, Sub, Mul, Sum, Reshape (partially fixed in fix(compute): GPUEngine.Reshape honors dst argument (#81) #82), etc.

Suggested approach

  1. Add a reuseGPUDst helper:
func reuseGPUDst[T tensor.Numeric](dst *tensor.TensorNumeric[T], neededElems int) (unsafe.Pointer, bool) {
    if dst == nil { return nil, false }
    gs, ok := dst.GetStorage().(*tensor.GPUStorage[T])
    if !ok || gs.Len() < neededElems { return nil, false }
    return gs.Ptr(), true  // reuse existing device memory
}
  1. In each GPU op, before pool.Alloc, check reuseGPUDst. If reusable, pass the existing pointer to the kernel and skip alloc.

  2. Update makeGPUResult to detect when the output pointer matches dst's existing pointer (skip SetStorage in that case).

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions