Releases · zerfoo/ztensor · GitHub

26 Mar 16:07

v0.6.2

0.6.2 (2026-03-26)

Bug Fixes

compute: prevent FP16 MatMul segfault on aarch64 purego (a6756c5)

Assets 2

26 Mar 16:07

v0.6.1

0.6.1 (2026-03-26)

Bug Fixes

compute: add VRAM bounds check for large MatMul allocations (915816c)

Assets 2

26 Mar 15:15

v0.6.0

0.6.0 (2026-03-26)

Features

gguf: add shared GGUF writer package (0709c09)

Assets 2

26 Mar 04:39

v0.5.1

0.5.1 (2026-03-26)

Bug Fixes

cuda: raise shared memory limit for Q4 GEMV with K > 12288 (d654c72)

Assets 2

25 Mar 23:10

v0.5.0

0.5.0 (2026-03-25)

Features

tensor: add MergeQ4KStorage and MergeQ6KStorage (764a750)

Assets 2

24 Mar 23:45

v0.4.1

0.4.1 (2026-03-24)

Bug Fixes

cuda: expand libkernels.so search paths and log dlopen errors (issue #7) (b906f08)

Assets 2

24 Mar 22:29

v0.4.0

0.4.0 (2026-03-24)

Features

add Q5_0 fused dequant-GEMV kernel stack (de5331f)
add Q5_K fused dequant-GEMV kernel stack (c2ea6f7)
batched: add batched multi-model inference (5897e29)
compute: add ComputeAmax and ScaleForFP8 for FP8 quantization (T2.2) (8c866f4)
compute: add native Q5_K GEMV kernel (b428f17)
compute: add Q6_K GEMV dispatch and GPU engine integration (0528588)
compute: dispatch FP8 MatMul to cublasLt FP8 GEMM (T2.3) (c446655)
compute: FP16 weight upload path + PreUploadFrozenWeights skip (d893b9c)
compute: implement hardware profiling and detection framework (c0c7ef5)
compute: wire paged attention into GQA (T1.4) (abeff7a)
cuda: add FP8 GEMM kernel with cublasLt bindings (T2.1) (7f524bc)
cuda: add NVFP4 GEMV kernel for Blackwell sm_100+ (T2.5) (63fad59)
cuda: add paged attention kernel with block-table indirection (T1.3) (e89e01d)
cuda: add Q6_K fused dequant-GEMV kernel (8fc89db)
cuda: add ragged batching attention kernel (T1.6) (2748ebc)
cuda: add selective scan kernel for Mamba/SSM (T6.1) (260160e)
cuda: implement FlashAttention-2 fused kernel with GQA support (e7000f8)
cuda: implement warp-specialized GEMV kernel for decode phase (fc46cab)
cuda: optimize Q4_K GEMV for sm_121 (Blackwell GB10 / DGX Spark) (3e32432)
fpga: add FPGA runtime abstraction layer via purego (e703a86)
gpuapi: implement Apple Metal compute shader bindings via purego (d548e22)
gpuapi: implement Apple Metal compute shader bindings via purego (38212db)
graph: add fast replay path skipping PrepareSlots/EnsureSlotsGPU (e6e2355)
graph: add gradient checkpointing (T8.9) (3cd5c01)
graph: add kernel launch batch scheduler (cfd513b)
graph: add SaveParameters/LoadParametersFromFile and checkpoint serialization (8a930ec), closes #96
graph: add SlotCount method to ExecutionPlan (b8dc85f)
graph: cache EmbeddingLookup GPU buffer for fast replay (dc595dd)
graph: expand CUDA graph capture to 100% instruction coverage (33b54d9)
kv: add BlockPool for paged attention (T1.1) (e851d47)
kv: add BlockTable for per-sequence paged KV mapping (T1.2) (be1ff30)
kv: add RadixTree for KV block prefix caching (T4.1) (0e68dc9)
metal: port critical CUDA kernels to Metal compute shaders (3051613)
metrics: add Add(n int64) to CounterMetric interface (64728d8)
quant: add native Q6_K GEMV direct decode for CPU and CUDA (566136b)
quant: add W4A16 mixed-precision dispatch (8d2f97a)
quant: add W8A8 mixed-precision dispatch with INT8 weights/activations and FP32 accumulation (3fe0745)
sycl: add SYCL runtime bindings via purego (b987c36)
sycl: port GEMV and attention kernels to SYCL backend (61b0ee8)
tensor: add AWQ dequantization support (cfbc3d0)
tensor: add NewFloat16StorageFromRaw constructor for pre-encoded FP16 bytes (d21c355)
tensor: add NewFloat16StorageFromRaw for FP16 GGUF loader (fbb968d)
tensor: add NF4 quantization with double quantization (T9.3) (beaba05)
tensor: add NVFP4 E2M1 weight storage (T2.4) (6f630dd)
tensor: add NVFP4 E2M1 weight storage (T2.4) (ccd48ec)
tensor: implement GPTQ dequantization (3784403)
tensor: implement quantization format registry (f501c21)
tensorrt: add TensorRT compilation for tabular models (90f408a)

Bug Fixes

cuda: add gemv_q4k_sm121.cu to kernel build sources (issue #7) (0324568)
cuda: dispatch Q4_K GEMV directly on sm_121 without re-quantization (10349fe)
cuda: replace cgo_import_dynamic JMP trampolines with runtime.dlopen on arm64 (38f54ab), closes #3
cuda: resolve Q5_K_M and Q6_K quantized GEMM/GEMV test failures (488862c)
cuda: use cgo build tag for arm64 dlopen trampolines (ebff59e)
gemv: remove unused dp4a accumulator variables (3653fe1)
graph: remove Q4Storage skip — restore cuBLAS SGEMM path (188 tok/s) (a38af9a)
graph: restore PreUploadFrozenWeights for stable 188 tok/s baseline (2decc08)
graph: skip BFloat16Storage in PreUploadFrozenWeights (7da3407)
graph: skip CUDA graph capture during prefill (seqLen > 1) (e5f9ce0)
graph: skip K-quant storage types in PreUploadFrozenWeights (23ba86d)
graph: skip Q4Storage in EnsureCaptureInputsGPU (e4d4613)
graph: skip quantized tensors with GPU pointers in PreUploadFrozenWeights (a7e361c)
graph: sort Parameters() by name and add LoadParameters method ([c1b853b](c1b853b21...

Read more

Assets 2

21 Mar 07:03

v0.3.1

0.3.1 (2026-03-21)

Bug Fixes

cuda: replace cgo_import_dynamic JMP trampolines with runtime.dlopen on arm64 (38f54ab), closes #3

Assets 2