Changelog

1.6.0 (2026-04-17)

Features

compute: T1.2 add ensureNotCapturing guard and ErrCaptureIncompatibleAllocation (18e1f5a)
compute: T2.1a add WithCapture helper for capture-aware graph lifecycle (d60c902)
compute: T2.2 capture-aware allocWeight routing via cudaMallocAsync (2a723b7)
compute: T2.3 pre-allocate workspace buffers at UploadWeights to avoid capture-time alloc (9f9eb5c)
cuda: T1.1 add StreamCaptureStatus purego binding (879cbc9)
graph: add LMHead to nonCapturableOps (07ba531)
graph: T4.1 add capture watchdog with 30s timeout and status sampling (b3066a5)
graph: T99.1.2 mark Gemma4PLECombinedProducer non-capturable (6c855a9)

Bug Fixes

graph: T98.2.3 don't pool-release pass-through node inputs (6ecf8db)

1.5.0 (2026-04-10)

Features

compute: add AllocDeviceFloat32 and CopyToDevice to FusedEncoderProvider (8d6c90b)
compute: add fused PatchTST encoder layer CUDA kernels (4dfd46e)

Bug Fixes

compute: GPUEngine.Reshape honors dst argument (18a53fe)
compute: reuse dst GPU memory instead of allocating per call (#84) (26bbd49)
kernels: rename kernel_add in fused_encoder_bwd to avoid symbol clash (716bbd6)

1.4.0 (2026-04-06)

Features

graph: add NewPJRTClient for external PJRT usage (c8db036)
graph: add PJRTPlan execution wrapper with KV cache state management (3e5cb40)

Bug Fixes

ci: exclude metal and pjrt from go vet (5a7fdc3)
kernels: update GemvQ5_0F32 test to match qhOffset/qsOffset signature (70f8fd5)

1.3.0 (2026-04-03)

Features

graph: add CompilePJRT for PJRT backend compilation (dfd77a4)
pjrt: add buffer management (host-device transfer, readback, lifecycle) (9b5dc75)
pjrt: add KV cache I/O rewriting and executable cache (c8decc5)
pjrt: add PJRT C API purego bindings for plugin loading, client, and device (c675807)
pjrt: add program execution, serialization, and full StableHLO emitter (382ea0a)
pjrt: add StableHLO program compilation wrapper (7fcdde7)
stablehlo: add emitter for element-wise and unary ops (499cef2)
stablehlo: add emitter for MatMul and structural ops (13d87df)
stablehlo: add emitter for reductions and Softmax decomposition (c07b287)
stablehlo: add MLIR type system and SSA naming (7c68d1e)
stablehlo: add shape inference for arithmetic ops (cac094e)
stablehlo: add shape inference for structural ops (8bf132c)

Bug Fixes

pjrt: centralize internal/cuda import in pjrt.go (aa8c170)
pjrt: remove duplicate ccall/goStringN declarations (3e5fba9)

1.2.0 (2026-04-01)

Features

cuda: add Q6_K, Q5_K, Q5_0 GPU dequant kernels for M>1 prefill (d57e37e)
cuda: add Q8 Gather kernel for GPU embedding lookup (30eb9c4)
tensor: add QuantizeQ4K for float32 to Q4_K quantization (d0d3a82)

Bug Fixes

compute: add Q4KStorage to UploadWeights F32 skip list (cc071b6)
compute: CPU dequant fallback for Q4_K when K%256!=0 (f50ffa7)
compute: use dequant+cuBLAS for Q4_K when K%256!=0 (5f21cbb)
compute: use pool-backed GPUStorage for pool allocations (4367330)
cuda: byte-wise loads in Q5_0 GEMV for ARM64 alignment (5f19e54)
kernels: check null function pointer in FusedSoftmaxVMulF32 (935ad61)

Performance Improvements

cuda: separated GPU layout for Q5_0 GEMV (d456c39)

1.1.3 (2026-04-01)

Bug Fixes

compute: add Q5_0Storage B-weight handling to CPU MatMul (e7927e5)
compute: Q5_0 GEMV byte-wise loads for ARM64 alignment (5c7ec7a)
compute: skip Q4Storage in UploadWeights F32 loop (revert overaggressive skip) (2e91650)
compute: skip transpose reshape fast-path for square matrices (eab19d0)

1.1.2 (2026-03-31)

Bug Fixes

compute: upload CPU fallback MatMul results to GPU for device consistency (5bc914b)

1.1.1 (2026-03-31)

Bug Fixes

cuda: remove float4 alignment requirement from gemv_q8_kernel (1313605)
cuda: remove float4 alignment requirement from gemv_q8_kernel (34aba3b)

1.1.0 (2026-03-31)

Features

compute: add GPUFusedSoftmaxVMul method with provider interface (d659e76)
compute: add GPURepeatInterleave method with purego bindings (6af7b96)
compute: add GraphCapturer interface for CUDA graph capture/replay (1f37c69)
compute: GPU-native Copy using cudaMemcpyAsync D2D (efc8b42)
compute: wire capture-aware pool into GPUEngine BeginCapture/EndCapture (e39b318)
cuda: add cudaMallocAsync and cudaFreeAsync bindings (e339656)
cuda: add cudaMemsetAsync binding and GPU-native Zero (47b5d39)
cuda: add fused repeat-interleave kernel for GQA head expansion (91e2469)
cuda: add fused softmax + V multiply kernel for decode attention (ef6f7ce)
cuda: make MemPool capture-aware with SetCaptureStream (58b6337)
gpuapi: wire FusedSoftmaxVMulF32 into KernelRunner interface (9afdb01)

Bug Fixes

compute: copy mmap bytes to heap in mmapDevicePtr fallback (0ad23b5)
compute: revert H2D to sync Memcpy (async breaks mmap'd tensors) (9a87e36)
compute: use async memcpy in getDevicePtr for CUDA graph capture (b36b7ed)

1.0.0 (2026-03-30)

Miscellaneous Chores

release 1.0.0 (0230a86)

0.15.0 (2026-03-29)

Features

tensor: MmapStorage.SliceElements for zero-copy expert weight slicing (0a40e11)
xblas: streaming GEMM for mmap'd tensors, unblocks over-RAM inference (8d80b91)

0.14.1 (2026-03-28)

Bug Fixes

ci: exclude purego GPU binding packages from go vet (60f0f66)
tensor: add IQ3_S to quant registry expected list (98c9237)

0.14.0 (2026-03-28)

Features

graph: add NodeOutput method for intermediate activation extraction (76a29c6)

0.13.0 (2026-03-28)

Features

xblas: add fused Q4_K GEMV kernel — 17x faster than dequant+requant (7ceb267)

0.12.0 (2026-03-28)

Features

tensor: make TernaryStorage implement Storage[float32] (2c8e9fa)

Bug Fixes

kernels: add missing NSA, KV dequant, and IQ dequant fields to KernelLib (bf32aef)

0.11.0 (2026-03-27)

Features

compute: add CosineSimilarity to Engine[T] (204f07b)
compute: add GPU dispatch for CosineSimilarity (40588bc)
compute: add GPU dispatch for ternary GEMV (295f61c)
compute: add Hadamard matrix generator (b3b3478)
compute: add HadamardTransform to Engine[T] (5a99614)
compute: add ReduceMax to Engine[T] (4b9b712)
compute: add split-KV flash decode kernel with CPU reference (c16817e)
compute: add split-KV flash decode kernel with CPU reference (41feddf)
compute: add TernaryGEMV for ternary weight matrix-vector multiply (8731bd1)
cuda: add fused NSA three-path attention kernel stub (a024958)
tensor: add IQ2_XXS dequantization storage (48677a7)
tensor: add IQ3_S dequantization storage (9eab58b)
tensor: add IQ4_NL dequantization storage (5205837)
tensor: add TernaryStorage for 2-bit ternary weights (0f7c5ca)

0.10.1 (2026-03-27)

Bug Fixes

tensor: remove MADV_SEQUENTIAL from MmapFile (caused 7x load regression) (8949a19)

0.10.0 (2026-03-27)

Features

tensor: add madvise hints for mmap'd pages (e26c8d6)

0.9.6 (2026-03-27)

Bug Fixes

graph: skip all quantized storage in EnsureSlotsGPU/EnsureCaptureInputsGPU (0b38668)

0.9.5 (2026-03-27)

Bug Fixes

compute: skip MmapStorage entirely in UploadWeights (8796fd0)

0.9.4 (2026-03-27)

Bug Fixes

compute: copy mmap bytes to heap before cudaMemcpy upload (c2d68e7)

0.9.3 (2026-03-27)

Bug Fixes

graph: skip quantized storage in PreUploadFrozenWeights (4b8388c)

0.9.2 (2026-03-27)

Bug Fixes

compute: skip F32 MmapStorage in quantized upload path (51ed3e7)

0.9.1 (2026-03-27)

Bug Fixes

tensor: delegate K-quant MmapStorage dequant to reference implementations (3ef8261)

0.9.0 (2026-03-27)

Features

compute: add MmapStorage GPU dispatch for quantized GEMV/GEMM (62f3db1)

0.8.0 (2026-03-27)

Features

tensor: add Q4_1/Q5_0/Q5_1 support for MmapStorage (8adb879)

0.7.0 (2026-03-27)

Features

tensor: add MmapStorage type and platform mmap helpers (f8b48bb)

0.6.3 (2026-03-27)

Bug Fixes

compute: change Repeat to repeat-each semantics for GQA correctness (d3e6b96)

0.6.2 (2026-03-26)

Bug Fixes

compute: prevent FP16 MatMul segfault on aarch64 purego (a6756c5)

0.6.1 (2026-03-26)

Bug Fixes

compute: add VRAM bounds check for large MatMul allocations (915816c)

0.6.0 (2026-03-26)

Features

gguf: add shared GGUF writer package (0709c09)

0.5.1 (2026-03-26)

Bug Fixes

cuda: raise shared memory limit for Q4 GEMV with K > 12288 (d654c72)

0.5.0 (2026-03-25)

Features

tensor: add MergeQ4KStorage and MergeQ6KStorage (764a750)

0.4.1 (2026-03-24)

Bug Fixes

cuda: expand libkernels.so search paths and log dlopen errors (issue #7) (b906f08)

0.4.0 (2026-03-24)

Features

add Q5_0 fused dequant-GEMV kernel stack (de5331f)
add Q5_K fused dequant-GEMV kernel stack (c2ea6f7)
batched: add batched multi-model inference (5897e29)
compute: add ComputeAmax and ScaleForFP8 for FP8 quantization (T2.2) (8c866f4)
compute: add native Q5_K GEMV kernel (b428f17)
compute: add Q6_K GEMV dispatch and GPU engine integration (0528588)
compute: dispatch FP8 MatMul to cublasLt FP8 GEMM (T2.3) (c446655)
compute: FP16 weight upload path + PreUploadFrozenWeights skip (d893b9c)
compute: implement hardware profiling and detection framework (c0c7ef5)
compute: wire paged attention into GQA (T1.4) (abeff7a)
cuda: add FP8 GEMM kernel with cublasLt bindings (T2.1) (7f524bc)
cuda: add NVFP4 GEMV kernel for Blackwell sm_100+ (T2.5) (63fad59)
cuda: add paged attention kernel with block-table indirection (T1.3) (e89e01d)
cuda: add Q6_K fused dequant-GEMV kernel (8fc89db)
cuda: add ragged batching attention kernel (T1.6) (2748ebc)
cuda: add selective scan kernel for Mamba/SSM (T6.1) (260160e)
cuda: implement FlashAttention-2 fused kernel with GQA support (e7000f8)
cuda: implement warp-specialized GEMV kernel for decode phase (fc46cab)
cuda: optimize Q4_K GEMV for sm_121 (Blackwell GB10 / DGX Spark) (3e32432)
fpga: add FPGA runtime abstraction layer via purego (e703a86)
gpuapi: implement Apple Metal compute shader bindings via purego (d548e22)
gpuapi: implement Apple Metal compute shader bindings via purego (38212db)
graph: add fast replay path skipping PrepareSlots/EnsureSlotsGPU (e6e2355)
graph: add gradient checkpointing (T8.9) (3cd5c01)
graph: add kernel launch batch scheduler (cfd513b)
graph: add SaveParameters/LoadParametersFromFile and checkpoint serialization (8a930ec), closes #96
graph: add SlotCount method to ExecutionPlan (b8dc85f)
graph: cache EmbeddingLookup GPU buffer for fast replay (dc595dd)
graph: expand CUDA graph capture to 100% instruction coverage (33b54d9)
kv: add BlockPool for paged attention (T1.1) (e851d47)
kv: add BlockTable for per-sequence paged KV mapping (T1.2) (be1ff30)
kv: add RadixTree for KV block prefix caching (T4.1) (0e68dc9)
metal: port critical CUDA kernels to Metal compute shaders (3051613)
metrics: add Add(n int64) to CounterMetric interface (64728d8)
quant: add native Q6_K GEMV direct decode for CPU and CUDA (566136b)
quant: add W4A16 mixed-precision dispatch (8d2f97a)
quant: add W8A8 mixed-precision dispatch with INT8 weights/activations and FP32 accumulation (3fe0745)
sycl: add SYCL runtime bindings via purego (b987c36)
sycl: port GEMV and attention kernels to SYCL backend (61b0ee8)
tensor: add AWQ dequantization support (cfbc3d0)
tensor: add NewFloat16StorageFromRaw constructor for pre-encoded FP16 bytes (d21c355)
tensor: add NewFloat16StorageFromRaw for FP16 GGUF loader (fbb968d)
tensor: add NF4 quantization with double quantization (T9.3) (beaba05)
tensor: add NVFP4 E2M1 weight storage (T2.4) (6f630dd)
tensor: add NVFP4 E2M1 weight storage (T2.4) (ccd48ec)
tensor: implement GPTQ dequantization (3784403)
tensor: implement quantization format registry (f501c21)
tensorrt: add TensorRT compilation for tabular models (90f408a)

Bug Fixes

cuda: add gemv_q4k_sm121.cu to kernel build sources (issue #7) (0324568)
cuda: dispatch Q4_K GEMV directly on sm_121 without re-quantization (10349fe)
cuda: replace cgo_import_dynamic JMP trampolines with runtime.dlopen on arm64 (38f54ab), closes #3
cuda: resolve Q5_K_M and Q6_K quantized GEMM/GEMV test failures (488862c)
cuda: use cgo build tag for arm64 dlopen trampolines (ebff59e)
gemv: remove unused dp4a accumulator variables (3653fe1)
graph: remove Q4Storage skip — restore cuBLAS SGEMM path (188 tok/s) (a38af9a)
graph: restore PreUploadFrozenWeights for stable 188 tok/s baseline (2decc08)
graph: skip BFloat16Storage in PreUploadFrozenWeights (7da3407)
graph: skip CUDA graph capture during prefill (seqLen > 1) (e5f9ce0)
graph: skip K-quant storage types in PreUploadFrozenWeights (23ba86d)
graph: skip Q4Storage in EnsureCaptureInputsGPU (e4d4613)
graph: skip quantized tensors with GPU pointers in PreUploadFrozenWeights (a7e361c)
graph: sort Parameters() by name and add LoadParameters method (c1b853b)
tensor: add missing NF4Storage implementation (T9.3 agent omitted impl) (1e4beaa)

Performance Improvements

arena: add free-list for intra-pass buffer reuse (d40d6e4)
clean Q4 GEMV restore — skip Q4 in PreUploadFrozenWeights + UploadWeights (e6a6e30)
compute: convert Q4Storage to BF16 during upload (targeted, not all tensors) (39c77c9)
compute: convert Q8 to float32 in UploadWeights for cuBLAS path (4d4bd8d)
compute: upload large weight tensors as BF16 instead of F32 (e43f03f)
gemv: add dp4a INT8 Q4_K GEMV kernel (05c3113)
gemv: prefer dp4a Q4_K GEMV when available (ea98b7c)
gemv: reduce dp4a kernel register pressure (cc707d5)
gemv: wire dp4a Q4_K GEMV kernel into purego loader (8be7d1f)
graph: add tensor lifetime analysis and intra-pass arena reuse (18e5f37)
graph: let PreUploadFrozenWeights dequantize all quant types to float32 (fd755b4)
graph: remove PreUploadFrozenWeights from CUDA graph executor (adb6e1c)
graph: skip Q4Storage in PreUploadFrozenWeights for Q4 GEMV path (880b50e)
restore Phase 6-compatible upload paths (2cc6cc3)
restore Q4 GEMV path — skip Q4→F32 in both UploadWeights and PreUploadFrozenWeights (f6faf2a)
transpose: restore Phase 6 GPU transpose guard (aa0541b)
ztensor: dp4a Q4K GEMV kernel + arena free-list intra-pass reuse (4e85b12)

Reverts

graph: remove gpuPtrHolder check from PreUploadFrozenWeights (dafb96e)

0.3.2 (2026-03-21)

Bug Fixes

cuda: use cgo build tag for arm64 dlopen trampolines (ebff59e)

0.3.1 (2026-03-21)

Bug Fixes

cuda: replace cgo_import_dynamic JMP trampolines with runtime.dlopen on arm64 (38f54ab), closes #3

FilesExpand file tree

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

1.6.0 (2026-04-17)

Features

Bug Fixes

1.5.0 (2026-04-10)

Features

Bug Fixes

1.4.0 (2026-04-06)

Features

Bug Fixes

1.3.0 (2026-04-03)

Features

Bug Fixes

1.2.0 (2026-04-01)

Features

Bug Fixes

Performance Improvements

1.1.3 (2026-04-01)

Bug Fixes

1.1.2 (2026-03-31)

Bug Fixes

1.1.1 (2026-03-31)

Bug Fixes

1.1.0 (2026-03-31)

Features

Bug Fixes

1.0.0 (2026-03-30)

Miscellaneous Chores

0.15.0 (2026-03-29)

Features

0.14.1 (2026-03-28)

Bug Fixes

0.14.0 (2026-03-28)

Features

0.13.0 (2026-03-28)

Features

0.12.0 (2026-03-28)

Features

Bug Fixes

0.11.0 (2026-03-27)

Features

0.10.1 (2026-03-27)

Bug Fixes

0.10.0 (2026-03-27)

Features

0.9.6 (2026-03-27)

Bug Fixes

0.9.5 (2026-03-27)

Bug Fixes

0.9.4 (2026-03-27)

Bug Fixes

0.9.3 (2026-03-27)

Bug Fixes

0.9.2 (2026-03-27)

Bug Fixes

0.9.1 (2026-03-27)

Bug Fixes

0.9.0 (2026-03-27)

Features

0.8.0 (2026-03-27)

Features

0.7.0 (2026-03-27)

Features

0.6.3 (2026-03-27)

Bug Fixes

0.6.2 (2026-03-26)

Bug Fixes

0.6.1 (2026-03-26)

Bug Fixes

0.6.0 (2026-03-26)

Features

0.5.1 (2026-03-26)

Bug Fixes

0.5.0 (2026-03-25)