1.6.0 (2026-04-17)
- compute: T1.2 add ensureNotCapturing guard and ErrCaptureIncompatibleAllocation (18e1f5a)
- compute: T2.1a add WithCapture helper for capture-aware graph lifecycle (d60c902)
- compute: T2.2 capture-aware allocWeight routing via cudaMallocAsync (2a723b7)
- compute: T2.3 pre-allocate workspace buffers at UploadWeights to avoid capture-time alloc (9f9eb5c)
- cuda: T1.1 add StreamCaptureStatus purego binding (879cbc9)
- graph: add LMHead to nonCapturableOps (07ba531)
- graph: T4.1 add capture watchdog with 30s timeout and status sampling (b3066a5)
- graph: T99.1.2 mark Gemma4PLECombinedProducer non-capturable (6c855a9)
- graph: T98.2.3 don't pool-release pass-through node inputs (6ecf8db)
1.5.0 (2026-04-10)
- compute: add AllocDeviceFloat32 and CopyToDevice to FusedEncoderProvider (8d6c90b)
- compute: add fused PatchTST encoder layer CUDA kernels (4dfd46e)
- compute: GPUEngine.Reshape honors dst argument (18a53fe)
- compute: reuse dst GPU memory instead of allocating per call (#84) (26bbd49)
- kernels: rename kernel_add in fused_encoder_bwd to avoid symbol clash (716bbd6)
1.4.0 (2026-04-06)
- graph: add NewPJRTClient for external PJRT usage (c8db036)
- graph: add PJRTPlan execution wrapper with KV cache state management (3e5cb40)
- ci: exclude metal and pjrt from go vet (5a7fdc3)
- kernels: update GemvQ5_0F32 test to match qhOffset/qsOffset signature (70f8fd5)
1.3.0 (2026-04-03)
- graph: add CompilePJRT for PJRT backend compilation (dfd77a4)
- pjrt: add buffer management (host-device transfer, readback, lifecycle) (9b5dc75)
- pjrt: add KV cache I/O rewriting and executable cache (c8decc5)
- pjrt: add PJRT C API purego bindings for plugin loading, client, and device (c675807)
- pjrt: add program execution, serialization, and full StableHLO emitter (382ea0a)
- pjrt: add StableHLO program compilation wrapper (7fcdde7)
- stablehlo: add emitter for element-wise and unary ops (499cef2)
- stablehlo: add emitter for MatMul and structural ops (13d87df)
- stablehlo: add emitter for reductions and Softmax decomposition (c07b287)
- stablehlo: add MLIR type system and SSA naming (7c68d1e)
- stablehlo: add shape inference for arithmetic ops (cac094e)
- stablehlo: add shape inference for structural ops (8bf132c)
- pjrt: centralize internal/cuda import in pjrt.go (aa8c170)
- pjrt: remove duplicate ccall/goStringN declarations (3e5fba9)
1.2.0 (2026-04-01)
- cuda: add Q6_K, Q5_K, Q5_0 GPU dequant kernels for M>1 prefill (d57e37e)
- cuda: add Q8 Gather kernel for GPU embedding lookup (30eb9c4)
- tensor: add QuantizeQ4K for float32 to Q4_K quantization (d0d3a82)
- compute: add Q4KStorage to UploadWeights F32 skip list (cc071b6)
- compute: CPU dequant fallback for Q4_K when K%256!=0 (f50ffa7)
- compute: use dequant+cuBLAS for Q4_K when K%256!=0 (5f21cbb)
- compute: use pool-backed GPUStorage for pool allocations (4367330)
- cuda: byte-wise loads in Q5_0 GEMV for ARM64 alignment (5f19e54)
- kernels: check null function pointer in FusedSoftmaxVMulF32 (935ad61)
- cuda: separated GPU layout for Q5_0 GEMV (d456c39)
1.1.3 (2026-04-01)
- compute: add Q5_0Storage B-weight handling to CPU MatMul (e7927e5)
- compute: Q5_0 GEMV byte-wise loads for ARM64 alignment (5c7ec7a)
- compute: skip Q4Storage in UploadWeights F32 loop (revert overaggressive skip) (2e91650)
- compute: skip transpose reshape fast-path for square matrices (eab19d0)
1.1.2 (2026-03-31)
- compute: upload CPU fallback MatMul results to GPU for device consistency (5bc914b)
1.1.1 (2026-03-31)
- cuda: remove float4 alignment requirement from gemv_q8_kernel (1313605)
- cuda: remove float4 alignment requirement from gemv_q8_kernel (34aba3b)
1.1.0 (2026-03-31)
- compute: add GPUFusedSoftmaxVMul method with provider interface (d659e76)
- compute: add GPURepeatInterleave method with purego bindings (6af7b96)
- compute: add GraphCapturer interface for CUDA graph capture/replay (1f37c69)
- compute: GPU-native Copy using cudaMemcpyAsync D2D (efc8b42)
- compute: wire capture-aware pool into GPUEngine BeginCapture/EndCapture (e39b318)
- cuda: add cudaMallocAsync and cudaFreeAsync bindings (e339656)
- cuda: add cudaMemsetAsync binding and GPU-native Zero (47b5d39)
- cuda: add fused repeat-interleave kernel for GQA head expansion (91e2469)
- cuda: add fused softmax + V multiply kernel for decode attention (ef6f7ce)
- cuda: make MemPool capture-aware with SetCaptureStream (58b6337)
- gpuapi: wire FusedSoftmaxVMulF32 into KernelRunner interface (9afdb01)
- compute: copy mmap bytes to heap in mmapDevicePtr fallback (0ad23b5)
- compute: revert H2D to sync Memcpy (async breaks mmap'd tensors) (9a87e36)
- compute: use async memcpy in getDevicePtr for CUDA graph capture (b36b7ed)
1.0.0 (2026-03-30)
- release 1.0.0 (0230a86)
0.15.0 (2026-03-29)
- tensor: MmapStorage.SliceElements for zero-copy expert weight slicing (0a40e11)
- xblas: streaming GEMM for mmap'd tensors, unblocks over-RAM inference (8d80b91)
0.14.1 (2026-03-28)
- ci: exclude purego GPU binding packages from go vet (60f0f66)
- tensor: add IQ3_S to quant registry expected list (98c9237)
0.14.0 (2026-03-28)
- graph: add NodeOutput method for intermediate activation extraction (76a29c6)
0.13.0 (2026-03-28)
- xblas: add fused Q4_K GEMV kernel — 17x faster than dequant+requant (7ceb267)
0.12.0 (2026-03-28)
- tensor: make TernaryStorage implement Storage[float32] (2c8e9fa)
- kernels: add missing NSA, KV dequant, and IQ dequant fields to KernelLib (bf32aef)
0.11.0 (2026-03-27)
- compute: add CosineSimilarity to Engine[T] (204f07b)
- compute: add GPU dispatch for CosineSimilarity (40588bc)
- compute: add GPU dispatch for ternary GEMV (295f61c)
- compute: add Hadamard matrix generator (b3b3478)
- compute: add HadamardTransform to Engine[T] (5a99614)
- compute: add ReduceMax to Engine[T] (4b9b712)
- compute: add split-KV flash decode kernel with CPU reference (c16817e)
- compute: add split-KV flash decode kernel with CPU reference (41feddf)
- compute: add TernaryGEMV for ternary weight matrix-vector multiply (8731bd1)
- cuda: add fused NSA three-path attention kernel stub (a024958)
- tensor: add IQ2_XXS dequantization storage (48677a7)
- tensor: add IQ3_S dequantization storage (9eab58b)
- tensor: add IQ4_NL dequantization storage (5205837)
- tensor: add TernaryStorage for 2-bit ternary weights (0f7c5ca)
0.10.1 (2026-03-27)
- tensor: remove MADV_SEQUENTIAL from MmapFile (caused 7x load regression) (8949a19)
0.10.0 (2026-03-27)
- tensor: add madvise hints for mmap'd pages (e26c8d6)
0.9.6 (2026-03-27)
- graph: skip all quantized storage in EnsureSlotsGPU/EnsureCaptureInputsGPU (0b38668)
0.9.5 (2026-03-27)
- compute: skip MmapStorage entirely in UploadWeights (8796fd0)
0.9.4 (2026-03-27)
- compute: copy mmap bytes to heap before cudaMemcpy upload (c2d68e7)
0.9.3 (2026-03-27)
- graph: skip quantized storage in PreUploadFrozenWeights (4b8388c)
0.9.2 (2026-03-27)
- compute: skip F32 MmapStorage in quantized upload path (51ed3e7)
0.9.1 (2026-03-27)
- tensor: delegate K-quant MmapStorage dequant to reference implementations (3ef8261)
0.9.0 (2026-03-27)
- compute: add MmapStorage GPU dispatch for quantized GEMV/GEMM (62f3db1)
0.8.0 (2026-03-27)
- tensor: add Q4_1/Q5_0/Q5_1 support for MmapStorage (8adb879)
0.7.0 (2026-03-27)
- tensor: add MmapStorage type and platform mmap helpers (f8b48bb)
0.6.3 (2026-03-27)
- compute: change Repeat to repeat-each semantics for GQA correctness (d3e6b96)
0.6.2 (2026-03-26)
- compute: prevent FP16 MatMul segfault on aarch64 purego (a6756c5)
0.6.1 (2026-03-26)
- compute: add VRAM bounds check for large MatMul allocations (915816c)
0.6.0 (2026-03-26)
- gguf: add shared GGUF writer package (0709c09)
0.5.1 (2026-03-26)
- cuda: raise shared memory limit for Q4 GEMV with K > 12288 (d654c72)
0.5.0 (2026-03-25)
- tensor: add MergeQ4KStorage and MergeQ6KStorage (764a750)
0.4.1 (2026-03-24)
0.4.0 (2026-03-24)
- add Q5_0 fused dequant-GEMV kernel stack (de5331f)
- add Q5_K fused dequant-GEMV kernel stack (c2ea6f7)
- batched: add batched multi-model inference (5897e29)
- compute: add ComputeAmax and ScaleForFP8 for FP8 quantization (T2.2) (8c866f4)
- compute: add native Q5_K GEMV kernel (b428f17)
- compute: add Q6_K GEMV dispatch and GPU engine integration (0528588)
- compute: dispatch FP8 MatMul to cublasLt FP8 GEMM (T2.3) (c446655)
- compute: FP16 weight upload path + PreUploadFrozenWeights skip (d893b9c)
- compute: implement hardware profiling and detection framework (c0c7ef5)
- compute: wire paged attention into GQA (T1.4) (abeff7a)
- cuda: add FP8 GEMM kernel with cublasLt bindings (T2.1) (7f524bc)
- cuda: add NVFP4 GEMV kernel for Blackwell sm_100+ (T2.5) (63fad59)
- cuda: add paged attention kernel with block-table indirection (T1.3) (e89e01d)
- cuda: add Q6_K fused dequant-GEMV kernel (8fc89db)
- cuda: add ragged batching attention kernel (T1.6) (2748ebc)
- cuda: add selective scan kernel for Mamba/SSM (T6.1) (260160e)
- cuda: implement FlashAttention-2 fused kernel with GQA support (e7000f8)
- cuda: implement warp-specialized GEMV kernel for decode phase (fc46cab)
- cuda: optimize Q4_K GEMV for sm_121 (Blackwell GB10 / DGX Spark) (3e32432)
- fpga: add FPGA runtime abstraction layer via purego (e703a86)
- gpuapi: implement Apple Metal compute shader bindings via purego (d548e22)
- gpuapi: implement Apple Metal compute shader bindings via purego (38212db)
- graph: add fast replay path skipping PrepareSlots/EnsureSlotsGPU (e6e2355)
- graph: add gradient checkpointing (T8.9) (3cd5c01)
- graph: add kernel launch batch scheduler (cfd513b)
- graph: add SaveParameters/LoadParametersFromFile and checkpoint serialization (8a930ec), closes #96
- graph: add SlotCount method to ExecutionPlan (b8dc85f)
- graph: cache EmbeddingLookup GPU buffer for fast replay (dc595dd)
- graph: expand CUDA graph capture to 100% instruction coverage (33b54d9)
- kv: add BlockPool for paged attention (T1.1) (e851d47)
- kv: add BlockTable for per-sequence paged KV mapping (T1.2) (be1ff30)
- kv: add RadixTree for KV block prefix caching (T4.1) (0e68dc9)
- metal: port critical CUDA kernels to Metal compute shaders (3051613)
- metrics: add Add(n int64) to CounterMetric interface (64728d8)
- quant: add native Q6_K GEMV direct decode for CPU and CUDA (566136b)
- quant: add W4A16 mixed-precision dispatch (8d2f97a)
- quant: add W8A8 mixed-precision dispatch with INT8 weights/activations and FP32 accumulation (3fe0745)
- sycl: add SYCL runtime bindings via purego (b987c36)
- sycl: port GEMV and attention kernels to SYCL backend (61b0ee8)
- tensor: add AWQ dequantization support (cfbc3d0)
- tensor: add NewFloat16StorageFromRaw constructor for pre-encoded FP16 bytes (d21c355)
- tensor: add NewFloat16StorageFromRaw for FP16 GGUF loader (fbb968d)
- tensor: add NF4 quantization with double quantization (T9.3) (beaba05)
- tensor: add NVFP4 E2M1 weight storage (T2.4) (6f630dd)
- tensor: add NVFP4 E2M1 weight storage (T2.4) (ccd48ec)
- tensor: implement GPTQ dequantization (3784403)
- tensor: implement quantization format registry (f501c21)
- tensorrt: add TensorRT compilation for tabular models (90f408a)
- cuda: add gemv_q4k_sm121.cu to kernel build sources (issue #7) (0324568)
- cuda: dispatch Q4_K GEMV directly on sm_121 without re-quantization (10349fe)
- cuda: replace cgo_import_dynamic JMP trampolines with runtime.dlopen on arm64 (38f54ab), closes #3
- cuda: resolve Q5_K_M and Q6_K quantized GEMM/GEMV test failures (488862c)
- cuda: use cgo build tag for arm64 dlopen trampolines (ebff59e)
- gemv: remove unused dp4a accumulator variables (3653fe1)
- graph: remove Q4Storage skip — restore cuBLAS SGEMM path (188 tok/s) (a38af9a)
- graph: restore PreUploadFrozenWeights for stable 188 tok/s baseline (2decc08)
- graph: skip BFloat16Storage in PreUploadFrozenWeights (7da3407)
- graph: skip CUDA graph capture during prefill (seqLen > 1) (e5f9ce0)
- graph: skip K-quant storage types in PreUploadFrozenWeights (23ba86d)
- graph: skip Q4Storage in EnsureCaptureInputsGPU (e4d4613)
- graph: skip quantized tensors with GPU pointers in PreUploadFrozenWeights (a7e361c)
- graph: sort Parameters() by name and add LoadParameters method (c1b853b)
- tensor: add missing NF4Storage implementation (T9.3 agent omitted impl) (1e4beaa)
- arena: add free-list for intra-pass buffer reuse (d40d6e4)
- clean Q4 GEMV restore — skip Q4 in PreUploadFrozenWeights + UploadWeights (e6a6e30)
- compute: convert Q4Storage to BF16 during upload (targeted, not all tensors) (39c77c9)
- compute: convert Q8 to float32 in UploadWeights for cuBLAS path (4d4bd8d)
- compute: upload large weight tensors as BF16 instead of F32 (e43f03f)
- gemv: add dp4a INT8 Q4_K GEMV kernel (05c3113)
- gemv: prefer dp4a Q4_K GEMV when available (ea98b7c)
- gemv: reduce dp4a kernel register pressure (cc707d5)
- gemv: wire dp4a Q4_K GEMV kernel into purego loader (8be7d1f)
- graph: add tensor lifetime analysis and intra-pass arena reuse (18e5f37)
- graph: let PreUploadFrozenWeights dequantize all quant types to float32 (fd755b4)
- graph: remove PreUploadFrozenWeights from CUDA graph executor (adb6e1c)
- graph: skip Q4Storage in PreUploadFrozenWeights for Q4 GEMV path (880b50e)
- restore Phase 6-compatible upload paths (2cc6cc3)
- restore Q4 GEMV path — skip Q4→F32 in both UploadWeights and PreUploadFrozenWeights (f6faf2a)
- transpose: restore Phase 6 GPU transpose guard (aa0541b)
- ztensor: dp4a Q4K GEMV kernel + arena free-list intra-pass reuse (4e85b12)
- graph: remove gpuPtrHolder check from PreUploadFrozenWeights (dafb96e)
0.3.2 (2026-03-21)
- cuda: use cgo build tag for arm64 dlopen trampolines (ebff59e)