Releases: zerfoo/ztensor
Releases · zerfoo/ztensor
v1.4.0
v1.3.0
1.3.0 (2026-04-03)
Features
- graph: add CompilePJRT for PJRT backend compilation (dfd77a4)
- pjrt: add buffer management (host-device transfer, readback, lifecycle) (9b5dc75)
- pjrt: add KV cache I/O rewriting and executable cache (c8decc5)
- pjrt: add PJRT C API purego bindings for plugin loading, client, and device (c675807)
- pjrt: add program execution, serialization, and full StableHLO emitter (382ea0a)
- pjrt: add StableHLO program compilation wrapper (7fcdde7)
- stablehlo: add emitter for element-wise and unary ops (499cef2)
- stablehlo: add emitter for MatMul and structural ops (13d87df)
- stablehlo: add emitter for reductions and Softmax decomposition (c07b287)
- stablehlo: add MLIR type system and SSA naming (7c68d1e)
- stablehlo: add shape inference for arithmetic ops (cac094e)
- stablehlo: add shape inference for structural ops (8bf132c)
Bug Fixes
v1.2.0
1.2.0 (2026-04-01)
Features
- cuda: add Q6_K, Q5_K, Q5_0 GPU dequant kernels for M>1 prefill (d57e37e)
- cuda: add Q8 Gather kernel for GPU embedding lookup (30eb9c4)
- tensor: add QuantizeQ4K for float32 to Q4_K quantization (d0d3a82)
Bug Fixes
- compute: add Q4KStorage to UploadWeights F32 skip list (cc071b6)
- compute: CPU dequant fallback for Q4_K when K%256!=0 (f50ffa7)
- compute: use dequant+cuBLAS for Q4_K when K%256!=0 (5f21cbb)
- compute: use pool-backed GPUStorage for pool allocations (4367330)
- cuda: byte-wise loads in Q5_0 GEMV for ARM64 alignment (5f19e54)
- kernels: check null function pointer in FusedSoftmaxVMulF32 (935ad61)
Performance Improvements
- cuda: separated GPU layout for Q5_0 GEMV (d456c39)
v1.1.3
1.1.3 (2026-04-01)
Bug Fixes
v1.1.2
v1.1.1
v1.1.0
1.1.0 (2026-03-31)
Features
- compute: add GPUFusedSoftmaxVMul method with provider interface (d659e76)
- compute: add GPURepeatInterleave method with purego bindings (6af7b96)
- compute: add GraphCapturer interface for CUDA graph capture/replay (1f37c69)
- compute: GPU-native Copy using cudaMemcpyAsync D2D (efc8b42)
- compute: wire capture-aware pool into GPUEngine BeginCapture/EndCapture (e39b318)
- cuda: add cudaMallocAsync and cudaFreeAsync bindings (e339656)
- cuda: add cudaMemsetAsync binding and GPU-native Zero (47b5d39)
- cuda: add fused repeat-interleave kernel for GQA head expansion (91e2469)
- cuda: add fused softmax + V multiply kernel for decode attention (ef6f7ce)
- cuda: make MemPool capture-aware with SetCaptureStream (58b6337)
- gpuapi: wire FusedSoftmaxVMulF32 into KernelRunner interface (9afdb01)