rasptorch is an experimental deep learning library inspired by PyTorch, built with a singular focus: making complex neural networks practical and efficient to run on resource-constrained hardware like the Raspberry Pi 5, by leveraging its GPU capabilities via Vulkan.
The library operates on a multi-layered architecture to maximize hardware utilization:
- CPU Backend (Software): Uses a pure NumPy-backed autograd engine and
nnmodule for reliable computation when GPU acceleration is unavailable. - GPU Backend (Hardware): Features an experimental Vulkan backend for high-speed tensor operations (elementwise math, matmul, reductions) directly on the Pi 5's GPU.
- Interface: Provides a streamlined CLI/Streamlit UI for interactive model building, training, persistence, and inspection.
The Vulkan path relies on real compute shaders compiled to SPIR-V, giving deep control over the underlying hardware.
rasptorch now exposes a backend abstraction API so compute backends can be registered and connected at runtime:
import rasptorch
# Inspect availability
print(rasptorch.available_backends()) # {'cpu': True, 'vulkan': ..., 'opencl': ..., 'cuda': ...}
# Try to connect a backend (falls back to CPU in non-strict mode)
active = rasptorch.connect_backend("vulkan", strict=False)
print(active.name)Built-in backend adapters:
numpy(NumPy adapter; internal key:cpu) - Pure NumPy autogradvulkan(rasptorch Vulkan kernels, with optional CPU fallback) - Optimized for Raspberry Pi 4/5 β‘opencl(pyopencl when available, optional CPU fallback)cuda(CuPy when available, with PyTorch CUDA fallback, optional CPU fallback)
CLI helpers:
rasptorch backend list
rasptorch backend connect numpy
rasptorch backend connect vulkan --strict
# Benchmark with auto-tuned Vulkan kernel and submission batching
rasptorch --json backend benchmark --backends numpy,vulkan,cuda --size 2048 --iterations 100 --warmup 20 --vulkan-kernel auto --vulkan-autotune-submit --seed 42Note: User-facing CLI/UI labels the CPU backend as
numpy. Vulkan benchmark mode uses resident buffers (upload once, repeated on-device matmul, download once). Performance (Optimized): Vulkan achieves ~564 GFLOPS (78% of NumPy on matmul_vec4 with auto-tuning).--vulkan-kernel autoprobesmatmul,matmul_vec4,matmul_a_bt, andmatmul_a_bt_tiled(when available) and keeps the faster path. If Vulkan hitsVkErrorDeviceLost, lower--vulkan-submit-every(for example,4or1) or use auto-tuning. Recommended: Use--vulkan-autotune-submitto jointly probe kernel + submit chunk and pick the fastest stable combo. Optimizations: Command buffer batching, memory-mapped buffers, auto kernel selection.
- Tensor Operations: Support for elementwise math, matrix multiplication (
matmul), reductions, indexing, reshaping, stacking, and broadcasting. - Layers: Includes standard neural network blocks:
Linear,MLP,CNN,GRU,Transformer, normalization layers, activations, pooling, embeddings, and attention. - Training Tools: Full suite of tools including optimizers (
SGD), learning-rate schedulers, gradient clipping, and regularization helpers. - Persistence: Ability to save and load checkpoint weights without needing the full
torchdependency. - Interfaces: CLI (
rasptorch chat) and Streamlit UI (rasptorch ui).
A. Basic Install (CPU Only): To get the core library components running on the CPU:
pip install rasptorchB. Development Install (Full Capability): For local development and access to all potential backends:
pip install -e ".[dev]"C. GPU Mode Prerequisites: To utilize the GPU backend, you must meet these prerequisites:
- Raspberry Pi 5 with working Vulkan drivers.
- The
glslcshader compiler must be available in your systemPATH. - When running, you must specify the device:
--device gpuor--device auto.
Start the Interactive Shell:
uv run rasptorch chatLaunch the Web UI:
uv run rasptorch ui(This will usually open at http://localhost:8501)
Viewing Help: To see all available CLI subcommands:
uv run rasptorch --helpThe main.py script controls the operational mode:
cpu: Pure NumPy autograd execution on the CPU.gpu: Executes the training loop explicitly using the Vulkan backend kernels.gpu-autograd: An experimental mode for tracing gradients across the GPU pipeline.
Example Training Command:
uv run main.py --device gpu --epochs 50 --batch-size 32 --lr 0.01rasptorch provides a built-in benchmark tool for comparing backend performance on matrix multiplication:
Quick Benchmark (Single Size):
# Benchmark with default settings (2048x2048 matmul, 100 iterations)
uv run rasptorch backend benchmark
# Benchmark with custom size and multiple backends
uv run rasptorch --json backend benchmark --backends numpy,vulkan,cuda --size 2048 --iterations 100 --warmup 20 --seed 42Performance Results (Raspberry Pi 5, 2048x2048 matmul, optimized):
| Backend | Time (s) | Iterations/s | GFLOPS | Status |
|---|---|---|---|---|
| NumPy | 2.25 | 44.4 | 763 | Reference |
| Vulkan (auto-tuned) | 3.15 | 31.8 | 546 | β‘ GPU |
| CUDA (when available) | 0.56 | 178 | 3059 | Best |
Vulkan Kernel Selection:
The --vulkan-kernel auto flag intelligently probes available kernels:
matmul- Basic single-threaded implementationmatmul_vec4- SIMD-style vec4 operationsmatmul_a_bt- Matrix transpose optimization (for A @ B.T)matmul_a_bt_tiled- Tiled transpose optimization (fastest when applicable)
Advanced Tuning:
# Auto-tune both kernel AND submission batching strategy
uv run rasptorch --json backend benchmark --backends vulkan --size 2048 \
--iterations 100 --warmup 20 \
--vulkan-kernel auto \
--vulkan-autotune-submit \
--seed 42
# Manual kernel selection with custom batch submission
uv run rasptorch --json backend benchmark --backends vulkan \
--vulkan-kernel matmul_a_bt_tiled \
--vulkan-submit-every 4 \
--size 2048 --iterations 100Output Format:
Results are provided in JSON format (with --json flag) including:
status: "ok" or "unavailable"elapsed_seconds: Total benchmark timeiterations_per_second: Throughput metricestimated_gflops: Floating-point performancechecksum: Verification resultkernel: Selected kernel name (for auto mode)submit_every: Submission batch size (for Vulkan)
Optimization Tips:
- Use
--vulkan-autotune-submitfor best results (probes kernel + batch combinations) - If you see
VkErrorDeviceLost, reduce--vulkan-submit-every(try4or1) - Larger problem sizes better amortize GPU setup overhead
- Command buffer batching (
--vulkan-submit-every) balances latency and throughput
For detailed optimization guide, see VULKAN_OPTIMIZATION.md.
Basic tensor math is performed via:
# Create tensors
uv run rasptorch tensor random --shape 2,3,4
uv run rasptorch tensor ones --shape 5,10The results show the low-level tensor capabilities.
Models are defined using structured commands:
# Simple MLP
uv run rasptorch model mlp --layers "64,32,16,2"
# Complex CNN
uv run rasptorch model cnn --in-channels 3 --out-channels "32,64,128"Managing the lifecycle:
uv run rasptorch model list
uv run rasptorch model save --model-id <id> --path model.pth- Performance: The fastest paths are those that keep the computation entirely on the GPU and minimize data transfer across the PCIe bus.
- Fallback: If GPU operations fail due to driver issues, the system gracefully falls back to the CPU NumPy path, but performance will suffer.
- Advanced Use: For understanding the deep dive into custom kernel optimization, please refer to the source code in the
rasptorch/gpu_demo.pyandrasptorch/main.pyscripts.