Releases: zerfoo/ztensor
Releases · zerfoo/ztensor
v0.6.2
v0.6.1
v0.6.0
v0.5.1
v0.5.0
v0.4.1
v0.4.0
0.4.0 (2026-03-24)
Features
- add Q5_0 fused dequant-GEMV kernel stack (de5331f)
- add Q5_K fused dequant-GEMV kernel stack (c2ea6f7)
- batched: add batched multi-model inference (5897e29)
- compute: add ComputeAmax and ScaleForFP8 for FP8 quantization (T2.2) (8c866f4)
- compute: add native Q5_K GEMV kernel (b428f17)
- compute: add Q6_K GEMV dispatch and GPU engine integration (0528588)
- compute: dispatch FP8 MatMul to cublasLt FP8 GEMM (T2.3) (c446655)
- compute: FP16 weight upload path + PreUploadFrozenWeights skip (d893b9c)
- compute: implement hardware profiling and detection framework (c0c7ef5)
- compute: wire paged attention into GQA (T1.4) (abeff7a)
- cuda: add FP8 GEMM kernel with cublasLt bindings (T2.1) (7f524bc)
- cuda: add NVFP4 GEMV kernel for Blackwell sm_100+ (T2.5) (63fad59)
- cuda: add paged attention kernel with block-table indirection (T1.3) (e89e01d)
- cuda: add Q6_K fused dequant-GEMV kernel (8fc89db)
- cuda: add ragged batching attention kernel (T1.6) (2748ebc)
- cuda: add selective scan kernel for Mamba/SSM (T6.1) (260160e)
- cuda: implement FlashAttention-2 fused kernel with GQA support (e7000f8)
- cuda: implement warp-specialized GEMV kernel for decode phase (fc46cab)
- cuda: optimize Q4_K GEMV for sm_121 (Blackwell GB10 / DGX Spark) (3e32432)
- fpga: add FPGA runtime abstraction layer via purego (e703a86)
- gpuapi: implement Apple Metal compute shader bindings via purego (d548e22)
- gpuapi: implement Apple Metal compute shader bindings via purego (38212db)
- graph: add fast replay path skipping PrepareSlots/EnsureSlotsGPU (e6e2355)
- graph: add gradient checkpointing (T8.9) (3cd5c01)
- graph: add kernel launch batch scheduler (cfd513b)
- graph: add SaveParameters/LoadParametersFromFile and checkpoint serialization (8a930ec), closes #96
- graph: add SlotCount method to ExecutionPlan (b8dc85f)
- graph: cache EmbeddingLookup GPU buffer for fast replay (dc595dd)
- graph: expand CUDA graph capture to 100% instruction coverage (33b54d9)
- kv: add BlockPool for paged attention (T1.1) (e851d47)
- kv: add BlockTable for per-sequence paged KV mapping (T1.2) (be1ff30)
- kv: add RadixTree for KV block prefix caching (T4.1) (0e68dc9)
- metal: port critical CUDA kernels to Metal compute shaders (3051613)
- metrics: add Add(n int64) to CounterMetric interface (64728d8)
- quant: add native Q6_K GEMV direct decode for CPU and CUDA (566136b)
- quant: add W4A16 mixed-precision dispatch (8d2f97a)
- quant: add W8A8 mixed-precision dispatch with INT8 weights/activations and FP32 accumulation (3fe0745)
- sycl: add SYCL runtime bindings via purego (b987c36)
- sycl: port GEMV and attention kernels to SYCL backend (61b0ee8)
- tensor: add AWQ dequantization support (cfbc3d0)
- tensor: add NewFloat16StorageFromRaw constructor for pre-encoded FP16 bytes (d21c355)
- tensor: add NewFloat16StorageFromRaw for FP16 GGUF loader (fbb968d)
- tensor: add NF4 quantization with double quantization (T9.3) (beaba05)
- tensor: add NVFP4 E2M1 weight storage (T2.4) (6f630dd)
- tensor: add NVFP4 E2M1 weight storage (T2.4) (ccd48ec)
- tensor: implement GPTQ dequantization (3784403)
- tensor: implement quantization format registry (f501c21)
- tensorrt: add TensorRT compilation for tabular models (90f408a)
Bug Fixes
- cuda: add gemv_q4k_sm121.cu to kernel build sources (issue #7) (0324568)
- cuda: dispatch Q4_K GEMV directly on sm_121 without re-quantization (10349fe)
- cuda: replace cgo_import_dynamic JMP trampolines with runtime.dlopen on arm64 (38f54ab), closes #3
- cuda: resolve Q5_K_M and Q6_K quantized GEMM/GEMV test failures (488862c)
- cuda: use cgo build tag for arm64 dlopen trampolines (ebff59e)
- gemv: remove unused dp4a accumulator variables (3653fe1)
- graph: remove Q4Storage skip — restore cuBLAS SGEMM path (188 tok/s) (a38af9a)
- graph: restore PreUploadFrozenWeights for stable 188 tok/s baseline (2decc08)
- graph: skip BFloat16Storage in PreUploadFrozenWeights (7da3407)
- graph: skip CUDA graph capture during prefill (seqLen > 1) (e5f9ce0)
- graph: skip K-quant storage types in PreUploadFrozenWeights (23ba86d)
- graph: skip Q4Storage in EnsureCaptureInputsGPU (e4d4613)
- graph: skip quantized tensors with GPU pointers in PreUploadFrozenWeights (a7e361c)
- graph: sort Parameters() by name and add LoadParameters method ([c1b853b](c1b853b21...