Side-by-side benchmark and real-time fragmentation visualizer for custom memory allocators battling production-shaped workloads.
Engineers pick allocators — glibc malloc, jemalloc, tcmalloc, mimalloc —
without any real intuition for how they actually behave under the workloads
they run in production. Fragmentation, cache locality, and tail latency stay
invisible until something is on fire in staging.
HeapForge makes allocator internals observable. It replays the same synthesized malloc/free stream through four hand-rolled allocators in lockstep and paints every arena live at 32-byte granularity so you can see fragmentation appear and coalesce in real time.
The point isn't to beat jemalloc. The point is to develop intuition
about where each strategy wins and where it falls apart.
Four allocators, each implemented from scratch in C11 against a common arena + vtable interface:
| Name | Strategy | What it shows |
|---|---|---|
bump |
Cursor-only arena | Fastest alloc possible; zero reclamation |
segfit |
Segregated free lists | Boundary tags + coalesce on free — the classic malloc shape |
buddy |
Binary-buddy | Power-of-two split/merge; guaranteed contiguous merges |
slab |
Per-class object cache | Intrusive free lists; locality-optimised small objects |
Four workload generators that mimic real allocation patterns:
| Workload | Pattern |
|---|---|
redis_cache |
Log-normal size mix, ~20% free rate |
nginx_requests |
Bursty alloc-then-free-all request lifecycles |
python_repl |
Heavy churn, 92% of allocations in the 16–96 B bins |
adversarial |
Crafted to build a fragmentation lattice that frustrates non-coalescing allocators |
A Go server replays the trace through every allocator simultaneously, times
each call on the C side of the cgo bridge with CLOCK_MONOTONIC, and
streams tick snapshots (packed cell states + latency histograms + frag
scores) to the browser over WebSocket.
A React / TypeScript front end renders each allocator's 2 MiB arena as a 256×256 pixel heatmap, plots log₂-bucketed latency distributions, and keeps a live leaderboard across p50 / p99 / fragmentation / alloc-failure counts.
# 1. Build the C allocator library + static .a
cd c && make
cd ..
# 2. Build the Go server (requires cgo)
cd server && go build -o bin/heap-forge ./cmd/server
cd ..
# 3. Build the web bundle
cd web && pnpm install && pnpm build
cd ..
# 4. Start the server, serving the bundle
./server/bin/heap-forge -addr :8080 -web ./web/dist
# open http://localhost:8080, pick a workload, press RUN RACE.For live frontend iteration:
# terminal 1
./server/bin/heap-forge -addr :8080
# terminal 2
cd web && pnpm dev
# opens http://localhost:5173 with /api + /ws proxied to :8080┌──────────────────────────────────────────────────────────┐
│ Browser │
│ React + D3 scales + HTML5 canvas │
│ • HeapCanvas: 256×256 pixel heatmap per allocator │
│ • LatencyChart: log2-bucketed histogram, p50/p90/p99 │
│ • Scorecard: 4-way leaderboard │
└──────────────────────────────────────────────────────────┘
↑ WebSocket /ws ↓ POST /api/run
┌──────────────────────────────────────────────────────────┐
│ Go server (cmd/server) │
│ • engine.Run: replays a []workload.Op through every │
│ compiled allocator in lockstep │
│ • metrics: log-bucketed nanosecond histogram, │
│ fragmentation score │
│ • ws.Hub: broadcasts tick JSON with 2-bit packed cells │
└──────────────────────────────────────────────────────────┘
↕ cgo trampoline (times on C side)
┌──────────────────────────────────────────────────────────┐
│ libheapforge.a (C11) │
│ • arena.c: mmap-backed 2 MiB region, aligned │
│ • bump.c: 8 B header, cursor-only, dead-bit tracking │
│ • segfit.c: 12 size classes, boundary tags, coalesce │
│ • buddy.c: orders 6..21, XOR-indexed buddy lookup │
│ • slab.c: 8 classes, 16 KiB slabs, intrusive flinks │
│ • snapshot.c: per-chunk cell labels + frag scan │
└──────────────────────────────────────────────────────────┘
cgo can't call through a C function pointer directly. Every allocator
call goes through a thin C trampoline that pairs two
clock_gettime(CLOCK_MONOTONIC) reads around the vtable dispatch, so
reported latencies exclude Go-side cgo overhead:
static int hf_go_alloc(const hf_vtable* v, hf_allocator* a, size_t size,
void** ptr_out, uint64_t* out_ns) {
uint64_t t0 = hf_now_ns();
void* p = v->alloc(a, size);
uint64_t t1 = hf_now_ns();
*ptr_out = p;
*out_ns = t1 - t0;
return p ? 0 : -1;
}Every allocator exports an hf_snapshot containing a cells[65536]
array where each byte encodes one of four states (free / used / header /
reserved) for one 32-byte chunk of the 2 MiB arena. To keep the
WebSocket payload small, Go packs four 2-bit cells per byte before
base64-encoding, cutting snapshot payloads from 65 536 to 16 384 bytes
per allocator per tick (~21 KiB base64).
metrics.Histogram is a 32-bucket log2 histogram where bucket i holds
samples where 2^(i-1) < ns ≤ 2^i. Percentiles are computed by
walking the bucket array. This gives us p50, p90, p99 in O(32)
without storing a sample buffer.
frag_score = 1 − (largest_free_run / total_free_bytes)
0 = every free byte is in one contiguous run (no external fragmentation).
1 = arena is shattered into a lattice of unusably small holes. This is a
pure external fragmentation signal — bump scores 0 here because its
free region is always a single contiguous suffix of the arena, even though
it is bleeding unreclaimed allocations in the used prefix. The heap
canvas makes that failure mode obvious in a way a single scalar can't.
cd c && make test
# bump : monotonic layout, accounting, fill-to-exhaustion
# segfit : 20k random ops, final free-all coalesces to one 2 MiB block
# buddy : 30k random ops, final free-all collapses to one 2 MiB block
# slab : 40k random ops across all 8 classes, final live=0cd server && go test ./...
# allocator: cgo bridge + 5k random ops per allocator
# workload: generator correctness (no double-alloc, no use-after-free)
# metrics: log-bucket histogram + fragmentation score math
# engine: end-to-end run emits ≥10 well-formed ticks with packed cellsPlus a live smoke test:
./server/bin/heap-forge -addr :8080 &
./server/bin/smoketest -server localhost:8080 -workload redis_cache -length 4000
# connects, starts a race, asserts ≥5 ticks arrive with valid base64 cellsheap-forge/
├── c/
│ ├── include/ headers: heapforge.h, arena.h
│ ├── src/ arena, bump, segfit, buddy, slab, snapshot, registry
│ ├── test/ one test binary per allocator
│ └── Makefile builds libheapforge.a + .dylib + runs tests
├── server/ Go 1.26 module
│ ├── allocator/ cgo bindings
│ ├── workload/ 4 trace generators
│ ├── metrics/ log histogram + frag score
│ ├── engine/ race runner that emits ticks
│ ├── ws/ WebSocket hub
│ └── cmd/server|smoketest
├── web/ Vite + React + TS + Tailwind 3
│ ├── src/components Header, ControlPanel, AllocatorCard, HeapCanvas, …
│ └── scripts/ Playwright screenshot harness
└── docs/ screenshots
- Arena size = 2 MiB, chunk = 32 B. Small enough to render as a dense 256×256 canvas, large enough for every workload to push multiple size classes hard. The buddy allocator's max order is 21 (= log₂ 2 MiB).
- All four allocators share one snapshot schema. Every allocator paints
the same
cells[65536]array, so the UI treats them as interchangeable data sources with no per-allocator rendering code. - User-requested size is tracked per allocation, not block size. That lets the UI surface internal fragmentation (reserved − in-use) as well as external fragmentation.
- Latency measured on the C side of the bridge. If we timed in Go we'd be measuring the cgo call overhead, which is ~70–100 ns on Apple Silicon and would swamp the real allocator work for small, fast allocations.
- Slow clients drop ticks, not block the engine. Tick broadcasts are
non-blocking sends with
default:drop — the engine never waits on the network, so one slow browser can't back up the whole race.
- Single-threaded. Concurrent / threaded workloads aren't modelled.
- Allocators own a fixed 2 MiB arena; there's no
sbrk-style growth. - WASM build of the C allocators is an obvious future win (run in-browser,
no server round trip). The cgo harness exists partly so that WASM can
be bolted on later by swapping
Emit()implementations. - macOS and Linux only. No Windows work has been done on the C side
(the arena uses
mmap(MAP_ANON)).
MIT.
