Skip to content

Prigoistic/poolgrad

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

poolgrad

poolgrad overview

a tiny ML runtime in rust (with a bit of systems bite).

it builds a minimal autograd engine, a few matrix multiplication kernels, and a small memory system — just enough to explore where performance actually comes from.

this is not a framework. it’s a playground.


what is this?

poolgrad is a single-binary runtime that implements:

  • reverse-mode autograd over a dynamic graph
  • multiple matmul kernels (naive, tiled, packed+simd, and an experimental mp variant)
  • a simple kernel scheduler
  • a gradient memory pool + lifetime-based release

everything is explicit. no hidden magic.

the only dependency is rayon (for parallel loops).


why?

most ML systems hide everything behind large abstractions.

this project asks:

what actually makes neural networks fast?

  • fewer multiplications?
  • better memory access?
  • vector instructions?
  • less allocation?

a quick look

forward builds a graph. backward walks it.

kernels are shared between forward and backward.

gradients are reused instead of reallocated (and released early when the planner says they’re dead).

poolgrad overview


the interesting part

there are four ways to multiply matrices here:

  • naive loops
  • tiled (better cache use)
  • packed + simd (pack panels + microkernel; uses NEON on arm64 when available, AVX2+FMA on x86_64)
  • mp (recursive strassen-form block transform with a packed base case): fewer multiplies, more adds, more data movement

the mp idea is simple:

can we compute the same result with fewer multiplications?

it’s strassen-like in spirit, but not recursive: it’s applied at block granularity.

update: the current implementation can recurse when profitable, and falls back to the packed kernel at the base case to keep performance stable.


what happens

numbers below (kernel benchmark): apple m4 (arm64, 10 cores), rustc 1.93.1, cargo run --release, seed=0xbadc0de, POOLGRAD_MP_MAX_SIZE=512, warmup=2, trials=9, median reported.

observations from these runs:

  • packed + simd dominates once sizes grow (e.g. 2.69x vs naive at 512)
  • tiling helps, but only moderately
  • mp reduces multiplies but is usually slower than packed (and can be slower than naive at mid sizes)
  • no single kernel wins everywhere → scheduling matters (at 32 the scheduler picked naive, but packed tied for best)

kernel times (ms, median; square gemm):

n naive tiled packed+simd mp
32 0.028 0.026 0.026 0.034
64 0.208 0.206 0.023 0.236
128 0.310 0.298 0.156 0.363
256 2.115 1.443 0.902 1.774
512 15.198 9.163 5.649 12.657

example (512 × 512):

naive           15.2 ms
tiled            9.2 ms
packed+simd      5.6 ms
mp              12.7 ms

memory

gradients are reused via a simple size-based pool.

this repo reports two different concepts for grad memory:

  • live bytes: currently checked out / in-use
  • reserved bytes: memory kept around for reuse (pool resident, or planner pre-alloc)

training loop (pool off vs on):

note: when pool off, the runtime uses the deterministic planned backward pass and does not touch the pool (so pool metrics are 0).

mode allocations reuses peak resident bytes
pool off 0 0 0
pool on 6 27 72

forward+backward step benchmark prints a clearer breakdown:

  • GradReservedBytes:
    • pool mode: peak resident bytes (live + cached)
    • plan mode: bytes pre-allocated by the planner
  • GradLiveBytes:
    • pool mode: peak checked-out bytes
    • plan mode: peak bytes checked out from planned buffers
  • ActivationBytes: bytes reserved by forward activations (Vec capacity in TensorStore)
  • TempBytes: peak bytes tracked by instrumented temporaries (TrackedBufF32)

kernel + pool interaction (forward matmul + seeded backward; scheduler-selected kernel; one pool reused across sizes):

n kernel iters time (ms) alloc reuse live peak (bytes) resident peak (bytes)
32 Naive 10 0.305 1 9 4096 4096
64 TiledPacked 10 0.671 1 9 16384 20480
128 TiledPacked 10 4.931 1 9 65536 86016
256 TiledPacked 3 12.174 1 2 262144 348160
512 TiledPacked 1 118.022 1 0 1048576 1396736

run it

cargo run --release

optional knobs:

POOLGRAD_FORCE_KERNEL=TiledPacked
POOLGRAD_PAR=1
POOLGRAD_MP_MAX_SIZE=512
POOLGRAD_MP_BASE_THRESHOLD=64
POOLGRAD_MP_RECURSE_MIN=128
POOLGRAD_MP_PACKED_BLOCK=64
POOLGRAD_BENCH_WARMUP=2
POOLGRAD_BENCH_TRIALS=9

notes

  • cpu only
  • simd via neon (arm64) / avx2+fma (x86_64) when available
  • mp is experimental — included to explore compute vs overhead tradeoffs (and has slightly higher fp error when enabled at larger sizes)
  • code is intentionally small and explicit

About

memory aware ML runtime to show compute-memory tradeoffs in CPU kernel matmul ops.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages