GPU ops should reuse dst device memory instead of allocating per call

## Problem

`makeGPUResult` (gpu_kernels.go:121) allocates fresh GPU memory via `pool.Alloc` on every call, even when a pre-sized `dst` tensor is provided. It then calls `dst.SetStorage(newGPUStorage)`, orphaning dst's previous GPUStorage. The old storage is freed only when the Go GC runs its finalizer.

This defeats the purpose of dst-param variants: callers that pre-allocate buffers and pass them as `dst` expect zero per-call allocation. Instead, they get the same allocation rate PLUS a deferred-free dependency on GC finalization.

## Impact

At production shapes (20K+ samples, 20 channels), PatchTST GPU training performs ~20 GPU ops per batch × 300+ batches per epoch. Each op allocates and orphans a GPUStorage, creating ~6000 pending finalizer calls per epoch. The GC cannot keep pace, causing unbounded GPU memory growth → OOM or severe memory-pressure slowdown.

Bisected in zerfoo/zerfoo#373: commit `09a318c6` (E85 buffer pre-allocation) regressed 20K×20×5 training from ~60s to >300s by converting local-var results to persistent dst-param struct fields. The parent commit (using local vars that go out of scope quickly) is unaffected.

## Expected behavior

When `dst` is provided and its existing storage is a GPUStorage with `Len() >= numElems`:
1. Compute the kernel output directly into `dst.GetStorage().Ptr()` (no pool.Alloc)
2. Update dst's shape/strides if needed
3. Return dst

Only allocate when dst is nil, has no GPU storage, or is undersized.

## Affected code paths

- `makeGPUResult` (gpu_kernels.go:121) — central; all GPU ops that go through it
- `makeGPUResultView` (gpu_kernels.go:147) — similar pattern with scratchpad views
- Individual ops that allocate before calling makeGPUResult: Transpose, MatMul, Add, Sub, Mul, Sum, Reshape (partially fixed in #82), etc.

## Suggested approach

1. Add a `reuseGPUDst` helper:
```go
func reuseGPUDst[T tensor.Numeric](dst *tensor.TensorNumeric[T], neededElems int) (unsafe.Pointer, bool) {
    if dst == nil { return nil, false }
    gs, ok := dst.GetStorage().(*tensor.GPUStorage[T])
    if !ok || gs.Len() < neededElems { return nil, false }
    return gs.Ptr(), true  // reuse existing device memory
}
```

2. In each GPU op, before `pool.Alloc`, check `reuseGPUDst`. If reusable, pass the existing pointer to the kernel and skip alloc.

3. Update `makeGPUResult` to detect when the output pointer matches dst's existing pointer (skip SetStorage in that case).

## Refs

- zerfoo/zerfoo#373 (bisect: E85 pre-alloc → OOM at large shapes)
- zerfoo/ztensor#82 (Reshape dst fix — partial, only fixed the zero-copy view path)
- zerfoo/ztensor#81 (original Reshape dst-ignored report)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU ops should reuse dst device memory instead of allocating per call #84

Problem

Impact

Expected behavior

Affected code paths

Suggested approach

Refs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU ops should reuse dst device memory instead of allocating per call #84

Description

Problem

Impact

Expected behavior

Affected code paths

Suggested approach

Refs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions