a tiny ML runtime in rust (with a bit of systems bite).
it builds a minimal autograd engine, a few matrix multiplication kernels, and a small memory system — just enough to explore where performance actually comes from.
this is not a framework. it’s a playground.
poolgrad is a single-binary runtime that implements:
- reverse-mode autograd over a dynamic graph
- multiple matmul kernels (naive, tiled, packed+simd, and an experimental mp variant)
- a simple kernel scheduler
- a gradient memory pool + lifetime-based release
everything is explicit. no hidden magic.
the only dependency is rayon (for parallel loops).
most ML systems hide everything behind large abstractions.
this project asks:
what actually makes neural networks fast?
- fewer multiplications?
- better memory access?
- vector instructions?
- less allocation?
forward builds a graph. backward walks it.
kernels are shared between forward and backward.
gradients are reused instead of reallocated (and released early when the planner says they’re dead).
there are four ways to multiply matrices here:
- naive loops
- tiled (better cache use)
- packed + simd (pack panels + microkernel; uses NEON on arm64 when available, AVX2+FMA on x86_64)
- mp (recursive strassen-form block transform with a packed base case): fewer multiplies, more adds, more data movement
the mp idea is simple:
can we compute the same result with fewer multiplications?
it’s strassen-like in spirit, but not recursive: it’s applied at block granularity.
update: the current implementation can recurse when profitable, and falls back to the packed kernel at the base case to keep performance stable.
numbers below (kernel benchmark): apple m4 (arm64, 10 cores), rustc 1.93.1, cargo run --release, seed=0xbadc0de, POOLGRAD_MP_MAX_SIZE=512, warmup=2, trials=9, median reported.
observations from these runs:
- packed + simd dominates once sizes grow (e.g. 2.69x vs naive at 512)
- tiling helps, but only moderately
- mp reduces multiplies but is usually slower than packed (and can be slower than naive at mid sizes)
- no single kernel wins everywhere → scheduling matters (at 32 the scheduler picked naive, but packed tied for best)
kernel times (ms, median; square gemm):
| n | naive | tiled | packed+simd | mp |
|---|---|---|---|---|
| 32 | 0.028 | 0.026 | 0.026 | 0.034 |
| 64 | 0.208 | 0.206 | 0.023 | 0.236 |
| 128 | 0.310 | 0.298 | 0.156 | 0.363 |
| 256 | 2.115 | 1.443 | 0.902 | 1.774 |
| 512 | 15.198 | 9.163 | 5.649 | 12.657 |
example (512 × 512):
naive 15.2 ms
tiled 9.2 ms
packed+simd 5.6 ms
mp 12.7 ms
gradients are reused via a simple size-based pool.
this repo reports two different concepts for grad memory:
- live bytes: currently checked out / in-use
- reserved bytes: memory kept around for reuse (pool resident, or planner pre-alloc)
training loop (pool off vs on):
note: when pool off, the runtime uses the deterministic planned backward pass and does not touch the pool (so pool metrics are 0).
| mode | allocations | reuses | peak resident bytes |
|---|---|---|---|
| pool off | 0 | 0 | 0 |
| pool on | 6 | 27 | 72 |
forward+backward step benchmark prints a clearer breakdown:
GradReservedBytes:- pool mode: peak resident bytes (live + cached)
- plan mode: bytes pre-allocated by the planner
GradLiveBytes:- pool mode: peak checked-out bytes
- plan mode: peak bytes checked out from planned buffers
ActivationBytes: bytes reserved by forward activations (Vec capacity inTensorStore)TempBytes: peak bytes tracked by instrumented temporaries (TrackedBufF32)
kernel + pool interaction (forward matmul + seeded backward; scheduler-selected kernel; one pool reused across sizes):
| n | kernel | iters | time (ms) | alloc | reuse | live peak (bytes) | resident peak (bytes) |
|---|---|---|---|---|---|---|---|
| 32 | Naive | 10 | 0.305 | 1 | 9 | 4096 | 4096 |
| 64 | TiledPacked | 10 | 0.671 | 1 | 9 | 16384 | 20480 |
| 128 | TiledPacked | 10 | 4.931 | 1 | 9 | 65536 | 86016 |
| 256 | TiledPacked | 3 | 12.174 | 1 | 2 | 262144 | 348160 |
| 512 | TiledPacked | 1 | 118.022 | 1 | 0 | 1048576 | 1396736 |
cargo run --releaseoptional knobs:
POOLGRAD_FORCE_KERNEL=TiledPacked
POOLGRAD_PAR=1
POOLGRAD_MP_MAX_SIZE=512
POOLGRAD_MP_BASE_THRESHOLD=64
POOLGRAD_MP_RECURSE_MIN=128
POOLGRAD_MP_PACKED_BLOCK=64
POOLGRAD_BENCH_WARMUP=2
POOLGRAD_BENCH_TRIALS=9- cpu only
- simd via neon (arm64) / avx2+fma (x86_64) when available
- mp is experimental — included to explore compute vs overhead tradeoffs (and has slightly higher fp error when enabled at larger sizes)
- code is intentionally small and explicit

