Releases · m96-chan/PyGPUkit

18 Dec 12:42

m96-chan

v0.2.10

cbbf111

v0.2.10: Dynamic cuBLASLt + CUDA Graph Optimizations

What's New in v0.2.10

Dynamic cuBLASLt Loading

Driver-only deployment: cuBLASLt is now loaded dynamically at runtime
No CUDA Toolkit installation required for end users
Automatic fallback to CUTLASS kernels when cuBLASLt is unavailable

Performance Optimizations

GEMM descriptor caching: 2.67x faster matmul operations (395ms → 148ms for 224 matmuls)
Zero-allocation decode path: Eliminated GPU memory allocations during token generation
CUDA Graph integration: Reduced kernel launch overhead

LLM Inference Improvements

Flash Attention for prefill phase
Flash Decoding for decode phase
GPU-based top-k sampling with temperature scaling
Chat template support (Qwen, LLaMA, Mistral, ChatML)
Streaming generation with generate_stream()

New Features

GPUArray.narrow() for zero-copy tensor slicing
Fused QKV projection infrastructure
GPU position buffer for CUDA Graph replay without recapture

Build System

SM100/SM120 (Blackwell) CUTLASS kernel infrastructure
CUDA 13.x compatibility improvements

Assets 5

16 Dec 04:58

github-actions

v0.2.9

10fc369

v0.2.9

What's Changed

feat(v0.2.9): Unified LLM Interface with ModelSpec Abstraction by @m96-chan in #80

Full Changelog: v0.2.8...v0.2.9

Contributors

m96-chan

Assets 5

15 Dec 23:28

github-actions

v0.2.8

10c69ff

v0.2.8

Full Changelog: v0.2.7...v0.2.8

Assets 5

15 Dec 23:04

github-actions

v0.2.7

4bdb29c

v0.2.7

What's Changed

v0.2.7: Epilogue Fusion, Multi-SM Kernels, Documentation by @m96-chan in #76

Full Changelog: v0.2.6...v0.2.7

Contributors

m96-chan

Assets 5

15 Dec 13:05

github-actions

v0.2.6

0555428

v0.2.6

What's Changed

v0.2.6: Multi-LLM async execution, FP16/BF16 TensorCore, LLM support by @m96-chan in #66
v0.2.6: CUTLASS backend, Multi-LLM concurrent execution by @m96-chan in #70

Full Changelog: v0.2.5...v0.2.6

Contributors

m96-chan

Assets 5

15 Dec 06:21

m96-chan

v0.2.5

47ea531

v0.2.5: FP16/BF16 Support, TF32 Optimization

What's New in v0.2.5

FP16 / BF16 Data Types

FP16 (float16): Half-precision floating point
BF16 (bfloat16): Brain floating point (better dynamic range)
FP32 Accumulation: Numerical stability via FP32 intermediate
Type Conversion: astype() for seamless dtype conversion

import pygpukit as gpk
import numpy as np

# FP16 operations
a = gpk.from_numpy(np.random.randn(1024, 1024).astype(np.float16))
b = gpk.from_numpy(np.random.randn(1024, 1024).astype(np.float16))
c = a @ b  # FP16 matmul

# BF16 operations
arr = np.random.randn(1024, 1024).astype(np.float32)
a_bf16 = gpk.from_numpy(arr).astype(gpk.bfloat16)
b_bf16 = gpk.from_numpy(arr).astype(gpk.bfloat16)
c_bf16 = a_bf16 @ b_bf16  # BF16 matmul
result = c_bf16.astype(gpk.float32)  # Convert back to FP32

Reduction Operations

Operation	Description
`gpk.sum(a)`	Sum of all elements
`gpk.mean(a)`	Mean of all elements
`gpk.max(a)`	Maximum element

Operator Overloads

c = a + b   # Element-wise add
c = a - b   # Element-wise subtract
c = a * b   # Element-wise multiply
c = a / b   # Element-wise divide
c = a @ b   # Matrix multiplication

TF32 v2 Optimization

PTX mma.sync implementation achieving ~30 TFLOPS
Double-buffered cp.async pipeline

Performance (RTX 3090 Ti)

Matrix Size	FP32	TF32	FP16	BF16
2048×2048	9.6 TFLOPS	13.2 TFLOPS	2.4 TFLOPS	2.4 TFLOPS
4096×4096	14.7 TFLOPS	22.8 TFLOPS	2.4 TFLOPS	2.3 TFLOPS
8192×8192	16.7 TFLOPS	29.7 TFLOPS	2.3 TFLOPS	2.3 TFLOPS

Note: FP16/BF16 use simple kernels. TensorCore optimization planned for v0.2.6.

Additional Improvements

JIT compiler stabilization with persistent kernel cache
Comprehensive benchmark.py for performance measurement
166 tests passing

Closed Issues

#58 FP16/BF16 Support Types
#59 Operator Support
#54 Kernel Cache LRU Persistence
#55 JIT Compiler Stabilization

Full Changelog: v0.2.4...v0.2.5

Assets 5

14 Dec 13:56

github-actions

v0.2.4

6ddc6b0

v0.2.4

What's Changed

feat(v0.2.4): single-binary distribution with dynamic NVRTC loading by @m96-chan in #57

Full Changelog: v0.2.3...v0.2.4

Contributors

m96-chan

Assets 5

14 Dec 07:07

m96-chan

v0.2.3

04cfa53

v0.2.3 — TF32 TensorCore Phase

Highlights

TF32 TensorCore GEMM achieving 27.5 TFLOPS on RTX 3090 Ti (8192×8192 matrices)

This release adds TensorCore acceleration using TF32 precision, providing ~1.5x speedup over FP32 for large matrix multiplications.

New Features

TF32 TensorCore GEMM (#45)

Direct TensorCore access via PTX mma.sync.aligned.m16n8k8 instruction
cp.async double-buffered pipeline for memory latency hiding
TF32 precision: 19-bit mantissa (vs FP32's 23-bit), ~0.1% per-op error

Python API (#46)

gp.matmul(a, b, use_tf32=True) - explicit TF32 control
gp.get_device_capabilities() - query TensorCore support

Documentation (#47)

TF32 TensorCore Design Document
Demo Script showcasing v0.2.3 features

Infrastructure (#48)

Driver-only build mode support for LaunchConfig
Portable Dim3 struct for non-cudart builds

Benchmark Results (RTX 3090 Ti)

Matrix Size	FP32	TF32	Speedup
2048×2048	7.6 TFLOPS	10.2 TFLOPS	1.34x
4096×4096	13.2 TFLOPS	19.5 TFLOPS	1.48x
8192×8192	18.2 TFLOPS	27.5 TFLOPS	1.51x

Requirements

SM 80+ (Ampere architecture: RTX 30XX, A100, etc.)
CUDA 11+ with NVRTC
Python 3.10+

Usage

import pygpukit as gp

# Check TensorCore support
caps = gp.get_device_capabilities()
print(f"TensorCore: {caps['has_tensor_cores']}")

# TF32 matrix multiplication
a = gp.randn((4096, 4096), dtype="float32")
b = gp.randn((4096, 4096), dtype="float32")
c = gp.matmul(a, b, use_tf32=True)  # ~1.5x faster

Full Changelog

feat(tf32): add TF32 TensorCore GEMM kernel achieving 27 TFLOPS (#45)
feat(api): add matmul(use_tf32=True) and DeviceCapabilities (#46)
docs(v0.2.3): add TF32 design doc and demo script (#47)
fix(jit): add driver-only mode support for LaunchConfig (#48)

Assets 5

13 Dec 06:12

m96-chan

v0.2.2

412b550

v0.2.2 - Ampere-Optimized SGEMM

Highlights

Ampere-Optimized SGEMM with cp.async software pipeline
18.2 TFLOPS on RTX 3090 Ti (46% efficiency, 22x vs NumPy)
75% performance improvement over previous version

Performance (RTX 3090 Ti)

Matrix Size	TFLOPS	Efficiency	vs NumPy
2048x2048	7.6	19%	10x
4096x4096	13.2	33%	16x
8192x8192	18.2	46%	22x

Key Features

4-stage software pipeline with cp.async for async memory transfers
float4 (16-byte) vectorized loads for both A and B matrices
BM=128, BN=128, BK=16 tiling with 8x8 thread tiles
~74 KB shared memory usage per block

Requirements

SM 80+ (Ampere architecture) required for cp.async support
RTX 30XX series or newer GPUs

Full Changelog

v0.2.0...v0.2.2

Assets 5

12 Dec 12:28

m96-chan

v0.2.0

820fbe1

v0.2.0 - Rust-powered Scheduler

PyGPUkit v0.2.0 - Rust-powered Scheduler

This release introduces a complete Rust backend for the GPU scheduler, memory management, and kernel dispatch.

New Features

Core Infrastructure (Rust)

Memory Pool - LRU eviction, size-class free lists
Scheduler - Priority queue, memory reservation
Transfer Engine - Separate H2D/D2H streams, priority
Kernel Dispatch - Per-stream limits, lifecycle tracking

Advanced Features (Rust)

Admission Control - Deterministic admission, quota enforcement
QoS Policy Framework - Guaranteed/Burstable/BestEffort tiers
Kernel Pacing Engine - Bandwidth-based throttling per stream
Micro-Slicing Framework - Kernel splitting, round-robin fairness
Pinned Memory Support - Page-locked host memory with pooling
Kernel Cache - PTX caching, LRU eviction, TTL
GPU Partitioning - Resource isolation, multi-tenant support
Tiled Matmul - Shared memory + double buffering

Performance (RTX 3090 Ti)

Matrix Size	Performance	vs NumPy
512x512	1262 GFLOPS	11.6x
1024x1024	1350 GFLOPS	2.2x
2048x2048	4417 GFLOPS	6.1x
4096x4096	6555 GFLOPS	7.9x

Test Coverage

106 Rust unit tests passing
Full PyO3 bindings for all new types
Comprehensive demo: examples/demo_v02_full.py

Installation

pip install pygpukit==0.2.0

Breaking Changes

Python 3.10+ required (was 3.9+)

Full Changelog

v0.1.3...v0.2.0

Assets 2

Releases: m96-chan/PyGPUkit

v0.2.10: Dynamic cuBLASLt + CUDA Graph Optimizations

What's New in v0.2.10

Dynamic cuBLASLt Loading

Performance Optimizations

LLM Inference Improvements

New Features

Build System

Uh oh!

v0.2.9

What's Changed

Contributors

Uh oh!

v0.2.8

Uh oh!

v0.2.7

What's Changed

Contributors

Uh oh!

v0.2.6

What's Changed

Contributors

Uh oh!

v0.2.5: FP16/BF16 Support, TF32 Optimization

What's New in v0.2.5

FP16 / BF16 Data Types

Reduction Operations

Operator Overloads

TF32 v2 Optimization

Performance (RTX 3090 Ti)

Additional Improvements

Closed Issues

Uh oh!

v0.2.4

What's Changed

Contributors

Uh oh!

v0.2.3 — TF32 TensorCore Phase

Highlights

New Features

TF32 TensorCore GEMM (#45)

Python API (#46)

Documentation (#47)

Infrastructure (#48)

Benchmark Results (RTX 3090 Ti)

Requirements

Usage

Full Changelog

Uh oh!

v0.2.2 - Ampere-Optimized SGEMM

Highlights

Performance (RTX 3090 Ti)

Key Features

Requirements

Full Changelog

Uh oh!

v0.2.0 - Rust-powered Scheduler

PyGPUkit v0.2.0 - Rust-powered Scheduler

New Features

Core Infrastructure (Rust)

Advanced Features (Rust)

Performance (RTX 3090 Ti)

Test Coverage

Installation

Breaking Changes

Full Changelog

Uh oh!