Axon provides first-class GPU support through device annotations, automatic kernel compilation, and device-aware tensor operations. Write GPU code in Axon — no CUDA C required.
Mark a function for GPU execution:
@gpu
fn vector_add(a: Tensor<Float32, [1024]>, b: Tensor<Float32, [1024]>): Tensor<Float32, [1024]> {
a + b
}
The Axon compiler lowers @gpu functions through the MLIR backend to produce
optimized GPU kernels for the target platform (CUDA, ROCm, or Vulkan).
Explicitly mark a function for CPU-only execution:
@cpu
fn save_results(data: Tensor<Float32, [?, 10]>, path: String) {
val file = File.create(path);
file.write(data);
}
Specify a device target explicitly:
@device("cuda:0")
fn forward_gpu0(x: Tensor<Float32, [?, 784]>): Tensor<Float32, [?, 10]> {
// executes on CUDA device 0
val w = randn([784, 10]);
x @ w
}
Tensors are transferred between devices with .to_gpu() and .to_cpu():
fn gpu_example() {
// Create on CPU
val cpu_tensor = randn([1024, 1024]);
// Transfer to GPU
val gpu_tensor = cpu_tensor.to_gpu();
// Compute on GPU — fast!
val result = gpu_tensor @ gpu_tensor;
// Transfer back to CPU for I/O
val cpu_result = result.to_cpu();
println("{}", cpu_result);
}
Device transfer follows ownership rules — the source tensor is consumed:
val data = randn([256, 256]);
val gpu_data = data.to_gpu(); // data is moved
// println("{}", data); // ERROR[E4001]: use of moved value `data`
To keep a CPU copy, clone first:
val data = randn([256, 256]);
val backup = data.clone();
val gpu_data = data.to_gpu();
println("{}", backup); // OK — backup is a separate copy
Tensors track their device in the type system. Operations between tensors on different devices are compile-time errors:
val cpu_a = randn([100]);
val gpu_b = randn([100]).to_gpu();
// val c = cpu_a + gpu_b; // ERROR: device mismatch — cpu and gpu tensors
@gpu
fn init_weights(): Tensor<Float32, [784, 256]> {
randn([784, 256]) // created directly on GPU — no transfer needed
}
When you compile with --gpu, Axon compiles @gpu functions into GPU kernels:
# Compile for NVIDIA GPUs
axonc build model.axon --gpu cuda -O 3
# Compile for AMD GPUs
axonc build model.axon --gpu rocm -O 3
# Compile for Vulkan (cross-platform)
axonc build model.axon --gpu vulkan -O 3- Frontend: Axon source → AST → Typed AST (same for all targets)
- MIR: Typed AST → Mid-level IR with device annotations
- MLIR: GPU-annotated MIR → MLIR dialects (GPU, Linalg, Tensor)
- Lowering: MLIR → NVVM (CUDA) / ROCDL (ROCm) / SPIR-V (Vulkan)
- Linking: Host code + GPU kernels → single binary
Axon applies GPU-specific optimizations:
- Kernel fusion — combine adjacent operations into single kernels
- Memory coalescing — optimize memory access patterns
- Shared memory tiling — tile matrix multiplications for cache efficiency
- Async transfers — overlap computation with host↔device transfers
use std.device.{Device, cuda};
fn main() {
val dev0 = cuda(0); // first GPU
val dev1 = cuda(1); // second GPU
val a = randn([1024, 1024]).to_device(dev0);
val b = randn([1024, 1024]).to_device(dev1);
}
Split batches across GPUs:
fn train_multi_gpu(model: &mut NeuralNet, data: &DataLoader) {
val devices = [cuda(0), cuda(1)];
for batch in data {
val (inputs, targets) = batch;
// Split batch across devices
val chunks = inputs.chunk(devices.len(), dim: 0);
var losses = Vec.new();
for i in 0..devices.len() {
val chunk = chunks[i].to_device(devices[i]);
val pred = model.forward(chunk);
val loss = cross_entropy(pred, targets);
losses.push(loss);
}
// Aggregate gradients
val total_loss = losses.sum();
total_loss.backward();
}
}
use std.device;
fn main() {
val count = device.gpu_count();
println("Available GPUs: {}", count);
for i in 0..count {
val dev = device.cuda(i);
println(" GPU {}: {} ({}MB)", i, dev.name(), dev.memory_mb());
}
}
use std.device.cuda;
@gpu
fn matmul_gpu(
a: Tensor<Float32, [?, ?]>,
b: Tensor<Float32, [?, ?]>,
): Tensor<Float32, [?, ?]> {
a @ b
}
fn main() {
val size = 2048;
// Create tensors on CPU
val a = randn([size, size]);
val b = randn([size, size]);
// Transfer to GPU
val ga = a.to_gpu();
val gb = b.to_gpu();
// GPU matrix multiply
val gc = matmul_gpu(ga, gb);
// Get result
val c = gc.to_cpu();
println("Result shape: {}", c.shape);
println("Result[0][0]: {}", c[0][0]);
}
Compile and run:
axonc build matmul.axon --gpu cuda -O 3 -o matmul
./matmul
# Result shape: [2048, 2048]
# Result[0][0]: 12.3456- Minimize transfers — keep data on GPU as long as possible
- Batch operations — GPU shines with large, parallel workloads
- Use
@gpufunctions — let the compiler handle kernel generation - Profile first — not everything benefits from GPU acceleration
- Clone before transfer if you need the CPU copy
- Tensor Programming — tensor types, shapes, and operations
- Ownership & Borrowing — device-aware borrowing rules
- Tutorial: MNIST Classifier — GPU training example
- CLI Reference —
--gpubuild flag details