poolgrad

a tiny ML runtime in rust (with a bit of systems bite).

it builds a minimal autograd engine, a few matrix multiplication kernels, and a small memory system — just enough to explore where performance actually comes from.

this is not a framework. it’s a playground.

what is this?

poolgrad is a single-binary runtime that implements:

reverse-mode autograd over a dynamic graph
multiple matmul kernels (naive, tiled, packed+simd, and an experimental mp variant)
a simple kernel scheduler
a gradient memory pool + lifetime-based release

everything is explicit. no hidden magic.

the only dependency is rayon (for parallel loops).

why?

most ML systems hide everything behind large abstractions.

this project asks:

what actually makes neural networks fast?

fewer multiplications?
better memory access?
vector instructions?
less allocation?

a quick look

forward builds a graph. backward walks it.

kernels are shared between forward and backward.

gradients are reused instead of reallocated (and released early when the planner says they’re dead).

the interesting part

there are four ways to multiply matrices here:

naive loops
tiled (better cache use)
packed + simd (pack panels + microkernel; uses NEON on arm64 when available, AVX2+FMA on x86_64)
mp (recursive strassen-form block transform with a packed base case): fewer multiplies, more adds, more data movement

the mp idea is simple:

can we compute the same result with fewer multiplications?

it’s strassen-like in spirit, but not recursive: it’s applied at block granularity.

update: the current implementation can recurse when profitable, and falls back to the packed kernel at the base case to keep performance stable.

what happens

numbers below (kernel benchmark): apple m4 (arm64, 10 cores), rustc 1.93.1, cargo run --release, seed=0xbadc0de, POOLGRAD_MP_MAX_SIZE=512, warmup=2, trials=9, median reported.

observations from these runs:

packed + simd dominates once sizes grow (e.g. 2.69x vs naive at 512)
tiling helps, but only moderately
mp reduces multiplies but is usually slower than packed (and can be slower than naive at mid sizes)
no single kernel wins everywhere → scheduling matters (at 32 the scheduler picked naive, but packed tied for best)

kernel times (ms, median; square gemm):

n	naive	tiled	packed+simd	mp
32	0.028	0.026	0.026	0.034
64	0.208	0.206	0.023	0.236
128	0.310	0.298	0.156	0.363
256	2.115	1.443	0.902	1.774
512	15.198	9.163	5.649	12.657

example (512 × 512):

naive           15.2 ms
tiled            9.2 ms
packed+simd      5.6 ms
mp              12.7 ms

memory

gradients are reused via a simple size-based pool.

this repo reports two different concepts for grad memory:

live bytes: currently checked out / in-use
reserved bytes: memory kept around for reuse (pool resident, or planner pre-alloc)

training loop (pool off vs on):

note: when pool off, the runtime uses the deterministic planned backward pass and does not touch the pool (so pool metrics are 0).

mode	allocations	reuses	peak resident bytes
pool off	0	0	0
pool on	6	27	72

forward+backward step benchmark prints a clearer breakdown:

GradReservedBytes:
- pool mode: peak resident bytes (live + cached)
- plan mode: bytes pre-allocated by the planner
GradLiveBytes:
- pool mode: peak checked-out bytes
- plan mode: peak bytes checked out from planned buffers
ActivationBytes: bytes reserved by forward activations (Vec capacity in TensorStore)
TempBytes: peak bytes tracked by instrumented temporaries (TrackedBufF32)

kernel + pool interaction (forward matmul + seeded backward; scheduler-selected kernel; one pool reused across sizes):

n	kernel	iters	time (ms)	alloc	reuse	live peak (bytes)	resident peak (bytes)
32	Naive	10	0.305	1	9	4096	4096
64	TiledPacked	10	0.671	1	9	16384	20480
128	TiledPacked	10	4.931	1	9	65536	86016
256	TiledPacked	3	12.174	1	2	262144	348160
512	TiledPacked	1	118.022	1	0	1048576	1396736

run it

cargo run --release

optional knobs:

POOLGRAD_FORCE_KERNEL=TiledPacked
POOLGRAD_PAR=1
POOLGRAD_MP_MAX_SIZE=512
POOLGRAD_MP_BASE_THRESHOLD=64
POOLGRAD_MP_RECURSE_MIN=128
POOLGRAD_MP_PACKED_BLOCK=64
POOLGRAD_BENCH_WARMUP=2
POOLGRAD_BENCH_TRIALS=9

notes

cpu only
simd via neon (arm64) / avx2+fma (x86_64) when available
mp is experimental — included to explore compute vs overhead tradeoffs (and has slightly higher fp error when enabled at larger sizes)
code is intentionally small and explicit

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
bin		bin
demo		demo
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

poolgrad

what is this?

why?

a quick look

the interesting part

what happens

memory

run it

notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

poolgrad

what is this?

why?

a quick look

the interesting part

what happens

memory

run it

notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages