Skip to content

Releases: m96-chan/PyGPUkit

v0.2.10: Dynamic cuBLASLt + CUDA Graph Optimizations

18 Dec 12:42
cbbf111

Choose a tag to compare

What's New in v0.2.10

Dynamic cuBLASLt Loading

  • Driver-only deployment: cuBLASLt is now loaded dynamically at runtime
  • No CUDA Toolkit installation required for end users
  • Automatic fallback to CUTLASS kernels when cuBLASLt is unavailable

Performance Optimizations

  • GEMM descriptor caching: 2.67x faster matmul operations (395ms → 148ms for 224 matmuls)
  • Zero-allocation decode path: Eliminated GPU memory allocations during token generation
  • CUDA Graph integration: Reduced kernel launch overhead

LLM Inference Improvements

  • Flash Attention for prefill phase
  • Flash Decoding for decode phase
  • GPU-based top-k sampling with temperature scaling
  • Chat template support (Qwen, LLaMA, Mistral, ChatML)
  • Streaming generation with generate_stream()

New Features

  • GPUArray.narrow() for zero-copy tensor slicing
  • Fused QKV projection infrastructure
  • GPU position buffer for CUDA Graph replay without recapture

Build System

  • SM100/SM120 (Blackwell) CUTLASS kernel infrastructure
  • CUDA 13.x compatibility improvements

v0.2.9

16 Dec 04:58
10fc369

Choose a tag to compare

What's Changed

  • feat(v0.2.9): Unified LLM Interface with ModelSpec Abstraction by @m96-chan in #80

Full Changelog: v0.2.8...v0.2.9

v0.2.8

15 Dec 23:28

Choose a tag to compare

Full Changelog: v0.2.7...v0.2.8

v0.2.7

15 Dec 23:04

Choose a tag to compare

What's Changed

  • v0.2.7: Epilogue Fusion, Multi-SM Kernels, Documentation by @m96-chan in #76

Full Changelog: v0.2.6...v0.2.7

v0.2.6

15 Dec 13:05
0555428

Choose a tag to compare

What's Changed

  • v0.2.6: Multi-LLM async execution, FP16/BF16 TensorCore, LLM support by @m96-chan in #66
  • v0.2.6: CUTLASS backend, Multi-LLM concurrent execution by @m96-chan in #70

Full Changelog: v0.2.5...v0.2.6

v0.2.5: FP16/BF16 Support, TF32 Optimization

15 Dec 06:21

Choose a tag to compare

What's New in v0.2.5

FP16 / BF16 Data Types

  • FP16 (float16): Half-precision floating point
  • BF16 (bfloat16): Brain floating point (better dynamic range)
  • FP32 Accumulation: Numerical stability via FP32 intermediate
  • Type Conversion: astype() for seamless dtype conversion
import pygpukit as gpk
import numpy as np

# FP16 operations
a = gpk.from_numpy(np.random.randn(1024, 1024).astype(np.float16))
b = gpk.from_numpy(np.random.randn(1024, 1024).astype(np.float16))
c = a @ b  # FP16 matmul

# BF16 operations
arr = np.random.randn(1024, 1024).astype(np.float32)
a_bf16 = gpk.from_numpy(arr).astype(gpk.bfloat16)
b_bf16 = gpk.from_numpy(arr).astype(gpk.bfloat16)
c_bf16 = a_bf16 @ b_bf16  # BF16 matmul
result = c_bf16.astype(gpk.float32)  # Convert back to FP32

Reduction Operations

Operation Description
gpk.sum(a) Sum of all elements
gpk.mean(a) Mean of all elements
gpk.max(a) Maximum element

Operator Overloads

c = a + b   # Element-wise add
c = a - b   # Element-wise subtract
c = a * b   # Element-wise multiply
c = a / b   # Element-wise divide
c = a @ b   # Matrix multiplication

TF32 v2 Optimization

  • PTX mma.sync implementation achieving ~30 TFLOPS
  • Double-buffered cp.async pipeline

Performance (RTX 3090 Ti)

Matrix Size FP32 TF32 FP16 BF16
2048×2048 9.6 TFLOPS 13.2 TFLOPS 2.4 TFLOPS 2.4 TFLOPS
4096×4096 14.7 TFLOPS 22.8 TFLOPS 2.4 TFLOPS 2.3 TFLOPS
8192×8192 16.7 TFLOPS 29.7 TFLOPS 2.3 TFLOPS 2.3 TFLOPS

Note: FP16/BF16 use simple kernels. TensorCore optimization planned for v0.2.6.

Additional Improvements

  • JIT compiler stabilization with persistent kernel cache
  • Comprehensive benchmark.py for performance measurement
  • 166 tests passing

Closed Issues

  • #58 FP16/BF16 Support Types
  • #59 Operator Support
  • #54 Kernel Cache LRU Persistence
  • #55 JIT Compiler Stabilization

Full Changelog: v0.2.4...v0.2.5

v0.2.4

14 Dec 13:56
6ddc6b0

Choose a tag to compare

What's Changed

  • feat(v0.2.4): single-binary distribution with dynamic NVRTC loading by @m96-chan in #57

Full Changelog: v0.2.3...v0.2.4

v0.2.3 — TF32 TensorCore Phase

14 Dec 07:07

Choose a tag to compare

Highlights

TF32 TensorCore GEMM achieving 27.5 TFLOPS on RTX 3090 Ti (8192×8192 matrices)

This release adds TensorCore acceleration using TF32 precision, providing ~1.5x speedup over FP32 for large matrix multiplications.

New Features

TF32 TensorCore GEMM (#45)

  • Direct TensorCore access via PTX mma.sync.aligned.m16n8k8 instruction
  • cp.async double-buffered pipeline for memory latency hiding
  • TF32 precision: 19-bit mantissa (vs FP32's 23-bit), ~0.1% per-op error

Python API (#46)

  • gp.matmul(a, b, use_tf32=True) - explicit TF32 control
  • gp.get_device_capabilities() - query TensorCore support

Documentation (#47)

Infrastructure (#48)

  • Driver-only build mode support for LaunchConfig
  • Portable Dim3 struct for non-cudart builds

Benchmark Results (RTX 3090 Ti)

Matrix Size FP32 TF32 Speedup
2048×2048 7.6 TFLOPS 10.2 TFLOPS 1.34x
4096×4096 13.2 TFLOPS 19.5 TFLOPS 1.48x
8192×8192 18.2 TFLOPS 27.5 TFLOPS 1.51x

Requirements

  • SM 80+ (Ampere architecture: RTX 30XX, A100, etc.)
  • CUDA 11+ with NVRTC
  • Python 3.10+

Usage

import pygpukit as gp

# Check TensorCore support
caps = gp.get_device_capabilities()
print(f"TensorCore: {caps['has_tensor_cores']}")

# TF32 matrix multiplication
a = gp.randn((4096, 4096), dtype="float32")
b = gp.randn((4096, 4096), dtype="float32")
c = gp.matmul(a, b, use_tf32=True)  # ~1.5x faster

Full Changelog

  • feat(tf32): add TF32 TensorCore GEMM kernel achieving 27 TFLOPS (#45)
  • feat(api): add matmul(use_tf32=True) and DeviceCapabilities (#46)
  • docs(v0.2.3): add TF32 design doc and demo script (#47)
  • fix(jit): add driver-only mode support for LaunchConfig (#48)

v0.2.2 - Ampere-Optimized SGEMM

13 Dec 06:12
412b550

Choose a tag to compare

Highlights

  • Ampere-Optimized SGEMM with cp.async software pipeline
  • 18.2 TFLOPS on RTX 3090 Ti (46% efficiency, 22x vs NumPy)
  • 75% performance improvement over previous version

Performance (RTX 3090 Ti)

Matrix Size TFLOPS Efficiency vs NumPy
2048x2048 7.6 19% 10x
4096x4096 13.2 33% 16x
8192x8192 18.2 46% 22x

Key Features

  • 4-stage software pipeline with cp.async for async memory transfers
  • float4 (16-byte) vectorized loads for both A and B matrices
  • BM=128, BN=128, BK=16 tiling with 8x8 thread tiles
  • ~74 KB shared memory usage per block

Requirements

  • SM 80+ (Ampere architecture) required for cp.async support
  • RTX 30XX series or newer GPUs

Full Changelog

v0.2.0...v0.2.2

v0.2.0 - Rust-powered Scheduler

12 Dec 12:28

Choose a tag to compare

PyGPUkit v0.2.0 - Rust-powered Scheduler

This release introduces a complete Rust backend for the GPU scheduler, memory management, and kernel dispatch.

New Features

Core Infrastructure (Rust)

  • Memory Pool - LRU eviction, size-class free lists
  • Scheduler - Priority queue, memory reservation
  • Transfer Engine - Separate H2D/D2H streams, priority
  • Kernel Dispatch - Per-stream limits, lifecycle tracking

Advanced Features (Rust)

  • Admission Control - Deterministic admission, quota enforcement
  • QoS Policy Framework - Guaranteed/Burstable/BestEffort tiers
  • Kernel Pacing Engine - Bandwidth-based throttling per stream
  • Micro-Slicing Framework - Kernel splitting, round-robin fairness
  • Pinned Memory Support - Page-locked host memory with pooling
  • Kernel Cache - PTX caching, LRU eviction, TTL
  • GPU Partitioning - Resource isolation, multi-tenant support
  • Tiled Matmul - Shared memory + double buffering

Performance (RTX 3090 Ti)

Matrix Size Performance vs NumPy
512x512 1262 GFLOPS 11.6x
1024x1024 1350 GFLOPS 2.2x
2048x2048 4417 GFLOPS 6.1x
4096x4096 6555 GFLOPS 7.9x

Test Coverage

  • 106 Rust unit tests passing
  • Full PyO3 bindings for all new types
  • Comprehensive demo: examples/demo_v02_full.py

Installation

pip install pygpukit==0.2.0

Breaking Changes

  • Python 3.10+ required (was 3.9+)

Full Changelog

v0.1.3...v0.2.0