Skip to content

Releases: zerfoo/ztensor

v0.6.2

26 Mar 16:07

Choose a tag to compare

0.6.2 (2026-03-26)

Bug Fixes

  • compute: prevent FP16 MatMul segfault on aarch64 purego (a6756c5)

v0.6.1

26 Mar 16:07

Choose a tag to compare

0.6.1 (2026-03-26)

Bug Fixes

  • compute: add VRAM bounds check for large MatMul allocations (915816c)

v0.6.0

26 Mar 15:15

Choose a tag to compare

0.6.0 (2026-03-26)

Features

  • gguf: add shared GGUF writer package (0709c09)

v0.5.1

26 Mar 04:39

Choose a tag to compare

0.5.1 (2026-03-26)

Bug Fixes

  • cuda: raise shared memory limit for Q4 GEMV with K > 12288 (d654c72)

v0.5.0

25 Mar 23:10

Choose a tag to compare

0.5.0 (2026-03-25)

Features

  • tensor: add MergeQ4KStorage and MergeQ6KStorage (764a750)

v0.4.1

24 Mar 23:45

Choose a tag to compare

0.4.1 (2026-03-24)

Bug Fixes

  • cuda: expand libkernels.so search paths and log dlopen errors (issue #7) (b906f08)

v0.4.0

24 Mar 22:29

Choose a tag to compare

0.4.0 (2026-03-24)

Features

  • add Q5_0 fused dequant-GEMV kernel stack (de5331f)
  • add Q5_K fused dequant-GEMV kernel stack (c2ea6f7)
  • batched: add batched multi-model inference (5897e29)
  • compute: add ComputeAmax and ScaleForFP8 for FP8 quantization (T2.2) (8c866f4)
  • compute: add native Q5_K GEMV kernel (b428f17)
  • compute: add Q6_K GEMV dispatch and GPU engine integration (0528588)
  • compute: dispatch FP8 MatMul to cublasLt FP8 GEMM (T2.3) (c446655)
  • compute: FP16 weight upload path + PreUploadFrozenWeights skip (d893b9c)
  • compute: implement hardware profiling and detection framework (c0c7ef5)
  • compute: wire paged attention into GQA (T1.4) (abeff7a)
  • cuda: add FP8 GEMM kernel with cublasLt bindings (T2.1) (7f524bc)
  • cuda: add NVFP4 GEMV kernel for Blackwell sm_100+ (T2.5) (63fad59)
  • cuda: add paged attention kernel with block-table indirection (T1.3) (e89e01d)
  • cuda: add Q6_K fused dequant-GEMV kernel (8fc89db)
  • cuda: add ragged batching attention kernel (T1.6) (2748ebc)
  • cuda: add selective scan kernel for Mamba/SSM (T6.1) (260160e)
  • cuda: implement FlashAttention-2 fused kernel with GQA support (e7000f8)
  • cuda: implement warp-specialized GEMV kernel for decode phase (fc46cab)
  • cuda: optimize Q4_K GEMV for sm_121 (Blackwell GB10 / DGX Spark) (3e32432)
  • fpga: add FPGA runtime abstraction layer via purego (e703a86)
  • gpuapi: implement Apple Metal compute shader bindings via purego (d548e22)
  • gpuapi: implement Apple Metal compute shader bindings via purego (38212db)
  • graph: add fast replay path skipping PrepareSlots/EnsureSlotsGPU (e6e2355)
  • graph: add gradient checkpointing (T8.9) (3cd5c01)
  • graph: add kernel launch batch scheduler (cfd513b)
  • graph: add SaveParameters/LoadParametersFromFile and checkpoint serialization (8a930ec), closes #96
  • graph: add SlotCount method to ExecutionPlan (b8dc85f)
  • graph: cache EmbeddingLookup GPU buffer for fast replay (dc595dd)
  • graph: expand CUDA graph capture to 100% instruction coverage (33b54d9)
  • kv: add BlockPool for paged attention (T1.1) (e851d47)
  • kv: add BlockTable for per-sequence paged KV mapping (T1.2) (be1ff30)
  • kv: add RadixTree for KV block prefix caching (T4.1) (0e68dc9)
  • metal: port critical CUDA kernels to Metal compute shaders (3051613)
  • metrics: add Add(n int64) to CounterMetric interface (64728d8)
  • quant: add native Q6_K GEMV direct decode for CPU and CUDA (566136b)
  • quant: add W4A16 mixed-precision dispatch (8d2f97a)
  • quant: add W8A8 mixed-precision dispatch with INT8 weights/activations and FP32 accumulation (3fe0745)
  • sycl: add SYCL runtime bindings via purego (b987c36)
  • sycl: port GEMV and attention kernels to SYCL backend (61b0ee8)
  • tensor: add AWQ dequantization support (cfbc3d0)
  • tensor: add NewFloat16StorageFromRaw constructor for pre-encoded FP16 bytes (d21c355)
  • tensor: add NewFloat16StorageFromRaw for FP16 GGUF loader (fbb968d)
  • tensor: add NF4 quantization with double quantization (T9.3) (beaba05)
  • tensor: add NVFP4 E2M1 weight storage (T2.4) (6f630dd)
  • tensor: add NVFP4 E2M1 weight storage (T2.4) (ccd48ec)
  • tensor: implement GPTQ dequantization (3784403)
  • tensor: implement quantization format registry (f501c21)
  • tensorrt: add TensorRT compilation for tabular models (90f408a)

Bug Fixes

  • cuda: add gemv_q4k_sm121.cu to kernel build sources (issue #7) (0324568)
  • cuda: dispatch Q4_K GEMV directly on sm_121 without re-quantization (10349fe)
  • cuda: replace cgo_import_dynamic JMP trampolines with runtime.dlopen on arm64 (38f54ab), closes #3
  • cuda: resolve Q5_K_M and Q6_K quantized GEMM/GEMV test failures (488862c)
  • cuda: use cgo build tag for arm64 dlopen trampolines (ebff59e)
  • gemv: remove unused dp4a accumulator variables (3653fe1)
  • graph: remove Q4Storage skip — restore cuBLAS SGEMM path (188 tok/s) (a38af9a)
  • graph: restore PreUploadFrozenWeights for stable 188 tok/s baseline (2decc08)
  • graph: skip BFloat16Storage in PreUploadFrozenWeights (7da3407)
  • graph: skip CUDA graph capture during prefill (seqLen > 1) (e5f9ce0)
  • graph: skip K-quant storage types in PreUploadFrozenWeights (23ba86d)
  • graph: skip Q4Storage in EnsureCaptureInputsGPU (e4d4613)
  • graph: skip quantized tensors with GPU pointers in PreUploadFrozenWeights (a7e361c)
  • graph: sort Parameters() by name and add LoadParameters method ([c1b853b](c1b853b21...
Read more

v0.3.1

21 Mar 07:03

Choose a tag to compare

0.3.1 (2026-03-21)

Bug Fixes

  • cuda: replace cgo_import_dynamic JMP trampolines with runtime.dlopen on arm64 (38f54ab), closes #3