Releases: m96-chan/PyGPUkit
Releases · m96-chan/PyGPUkit
v0.2.10: Dynamic cuBLASLt + CUDA Graph Optimizations
What's New in v0.2.10
Dynamic cuBLASLt Loading
- Driver-only deployment: cuBLASLt is now loaded dynamically at runtime
- No CUDA Toolkit installation required for end users
- Automatic fallback to CUTLASS kernels when cuBLASLt is unavailable
Performance Optimizations
- GEMM descriptor caching: 2.67x faster matmul operations (395ms → 148ms for 224 matmuls)
- Zero-allocation decode path: Eliminated GPU memory allocations during token generation
- CUDA Graph integration: Reduced kernel launch overhead
LLM Inference Improvements
- Flash Attention for prefill phase
- Flash Decoding for decode phase
- GPU-based top-k sampling with temperature scaling
- Chat template support (Qwen, LLaMA, Mistral, ChatML)
- Streaming generation with
generate_stream()
New Features
GPUArray.narrow()for zero-copy tensor slicing- Fused QKV projection infrastructure
- GPU position buffer for CUDA Graph replay without recapture
Build System
- SM100/SM120 (Blackwell) CUTLASS kernel infrastructure
- CUDA 13.x compatibility improvements
v0.2.9
What's Changed
Full Changelog: v0.2.8...v0.2.9
v0.2.8
Full Changelog: v0.2.7...v0.2.8
v0.2.7
What's Changed
Full Changelog: v0.2.6...v0.2.7
v0.2.6
v0.2.5: FP16/BF16 Support, TF32 Optimization
What's New in v0.2.5
FP16 / BF16 Data Types
- FP16 (float16): Half-precision floating point
- BF16 (bfloat16): Brain floating point (better dynamic range)
- FP32 Accumulation: Numerical stability via FP32 intermediate
- Type Conversion:
astype()for seamless dtype conversion
import pygpukit as gpk
import numpy as np
# FP16 operations
a = gpk.from_numpy(np.random.randn(1024, 1024).astype(np.float16))
b = gpk.from_numpy(np.random.randn(1024, 1024).astype(np.float16))
c = a @ b # FP16 matmul
# BF16 operations
arr = np.random.randn(1024, 1024).astype(np.float32)
a_bf16 = gpk.from_numpy(arr).astype(gpk.bfloat16)
b_bf16 = gpk.from_numpy(arr).astype(gpk.bfloat16)
c_bf16 = a_bf16 @ b_bf16 # BF16 matmul
result = c_bf16.astype(gpk.float32) # Convert back to FP32Reduction Operations
| Operation | Description |
|---|---|
gpk.sum(a) |
Sum of all elements |
gpk.mean(a) |
Mean of all elements |
gpk.max(a) |
Maximum element |
Operator Overloads
c = a + b # Element-wise add
c = a - b # Element-wise subtract
c = a * b # Element-wise multiply
c = a / b # Element-wise divide
c = a @ b # Matrix multiplicationTF32 v2 Optimization
- PTX
mma.syncimplementation achieving ~30 TFLOPS - Double-buffered
cp.asyncpipeline
Performance (RTX 3090 Ti)
| Matrix Size | FP32 | TF32 | FP16 | BF16 |
|---|---|---|---|---|
| 2048×2048 | 9.6 TFLOPS | 13.2 TFLOPS | 2.4 TFLOPS | 2.4 TFLOPS |
| 4096×4096 | 14.7 TFLOPS | 22.8 TFLOPS | 2.4 TFLOPS | 2.3 TFLOPS |
| 8192×8192 | 16.7 TFLOPS | 29.7 TFLOPS | 2.3 TFLOPS | 2.3 TFLOPS |
Note: FP16/BF16 use simple kernels. TensorCore optimization planned for v0.2.6.
Additional Improvements
- JIT compiler stabilization with persistent kernel cache
- Comprehensive
benchmark.pyfor performance measurement - 166 tests passing
Closed Issues
- #58 FP16/BF16 Support Types
- #59 Operator Support
- #54 Kernel Cache LRU Persistence
- #55 JIT Compiler Stabilization
Full Changelog: v0.2.4...v0.2.5
v0.2.4
What's Changed
Full Changelog: v0.2.3...v0.2.4
v0.2.3 — TF32 TensorCore Phase
Highlights
TF32 TensorCore GEMM achieving 27.5 TFLOPS on RTX 3090 Ti (8192×8192 matrices)
This release adds TensorCore acceleration using TF32 precision, providing ~1.5x speedup over FP32 for large matrix multiplications.
New Features
TF32 TensorCore GEMM (#45)
- Direct TensorCore access via PTX
mma.sync.aligned.m16n8k8instruction cp.asyncdouble-buffered pipeline for memory latency hiding- TF32 precision: 19-bit mantissa (vs FP32's 23-bit), ~0.1% per-op error
Python API (#46)
gp.matmul(a, b, use_tf32=True)- explicit TF32 controlgp.get_device_capabilities()- query TensorCore support
Documentation (#47)
- TF32 TensorCore Design Document
- Demo Script showcasing v0.2.3 features
Infrastructure (#48)
- Driver-only build mode support for
LaunchConfig - Portable
Dim3struct for non-cudart builds
Benchmark Results (RTX 3090 Ti)
| Matrix Size | FP32 | TF32 | Speedup |
|---|---|---|---|
| 2048×2048 | 7.6 TFLOPS | 10.2 TFLOPS | 1.34x |
| 4096×4096 | 13.2 TFLOPS | 19.5 TFLOPS | 1.48x |
| 8192×8192 | 18.2 TFLOPS | 27.5 TFLOPS | 1.51x |
Requirements
- SM 80+ (Ampere architecture: RTX 30XX, A100, etc.)
- CUDA 11+ with NVRTC
- Python 3.10+
Usage
import pygpukit as gp
# Check TensorCore support
caps = gp.get_device_capabilities()
print(f"TensorCore: {caps['has_tensor_cores']}")
# TF32 matrix multiplication
a = gp.randn((4096, 4096), dtype="float32")
b = gp.randn((4096, 4096), dtype="float32")
c = gp.matmul(a, b, use_tf32=True) # ~1.5x fasterFull Changelog
v0.2.2 - Ampere-Optimized SGEMM
Highlights
- Ampere-Optimized SGEMM with
cp.asyncsoftware pipeline - 18.2 TFLOPS on RTX 3090 Ti (46% efficiency, 22x vs NumPy)
- 75% performance improvement over previous version
Performance (RTX 3090 Ti)
| Matrix Size | TFLOPS | Efficiency | vs NumPy |
|---|---|---|---|
| 2048x2048 | 7.6 | 19% | 10x |
| 4096x4096 | 13.2 | 33% | 16x |
| 8192x8192 | 18.2 | 46% | 22x |
Key Features
- 4-stage software pipeline with
cp.asyncfor async memory transfers - float4 (16-byte) vectorized loads for both A and B matrices
- BM=128, BN=128, BK=16 tiling with 8x8 thread tiles
- ~74 KB shared memory usage per block
Requirements
- SM 80+ (Ampere architecture) required for
cp.asyncsupport - RTX 30XX series or newer GPUs
Full Changelog
v0.2.0 - Rust-powered Scheduler
PyGPUkit v0.2.0 - Rust-powered Scheduler
This release introduces a complete Rust backend for the GPU scheduler, memory management, and kernel dispatch.
New Features
Core Infrastructure (Rust)
- Memory Pool - LRU eviction, size-class free lists
- Scheduler - Priority queue, memory reservation
- Transfer Engine - Separate H2D/D2H streams, priority
- Kernel Dispatch - Per-stream limits, lifecycle tracking
Advanced Features (Rust)
- Admission Control - Deterministic admission, quota enforcement
- QoS Policy Framework - Guaranteed/Burstable/BestEffort tiers
- Kernel Pacing Engine - Bandwidth-based throttling per stream
- Micro-Slicing Framework - Kernel splitting, round-robin fairness
- Pinned Memory Support - Page-locked host memory with pooling
- Kernel Cache - PTX caching, LRU eviction, TTL
- GPU Partitioning - Resource isolation, multi-tenant support
- Tiled Matmul - Shared memory + double buffering
Performance (RTX 3090 Ti)
| Matrix Size | Performance | vs NumPy |
|---|---|---|
| 512x512 | 1262 GFLOPS | 11.6x |
| 1024x1024 | 1350 GFLOPS | 2.2x |
| 2048x2048 | 4417 GFLOPS | 6.1x |
| 4096x4096 | 6555 GFLOPS | 7.9x |
Test Coverage
- 106 Rust unit tests passing
- Full PyO3 bindings for all new types
- Comprehensive demo:
examples/demo_v02_full.py
Installation
pip install pygpukit==0.2.0Breaking Changes
- Python 3.10+ required (was 3.9+)