heap·forge

Side-by-side benchmark and real-time fragmentation visualizer for custom memory allocators battling production-shaped workloads.

Why

Engineers pick allocators — glibc malloc, jemalloc, tcmalloc, mimalloc — without any real intuition for how they actually behave under the workloads they run in production. Fragmentation, cache locality, and tail latency stay invisible until something is on fire in staging.

HeapForge makes allocator internals observable. It replays the same synthesized malloc/free stream through four hand-rolled allocators in lockstep and paints every arena live at 32-byte granularity so you can see fragmentation appear and coalesce in real time.

The point isn't to beat jemalloc. The point is to develop intuition about where each strategy wins and where it falls apart.

What's inside

Four allocators, each implemented from scratch in C11 against a common arena + vtable interface:

Name	Strategy	What it shows
`bump`	Cursor-only arena	Fastest alloc possible; zero reclamation
`segfit`	Segregated free lists	Boundary tags + coalesce on free — the classic malloc shape
`buddy`	Binary-buddy	Power-of-two split/merge; guaranteed contiguous merges
`slab`	Per-class object cache	Intrusive free lists; locality-optimised small objects

Four workload generators that mimic real allocation patterns:

Workload	Pattern
`redis_cache`	Log-normal size mix, ~20% free rate
`nginx_requests`	Bursty alloc-then-free-all request lifecycles
`python_repl`	Heavy churn, 92% of allocations in the 16–96 B bins
`adversarial`	Crafted to build a fragmentation lattice that frustrates non-coalescing allocators

A Go server replays the trace through every allocator simultaneously, times each call on the C side of the cgo bridge with CLOCK_MONOTONIC, and streams tick snapshots (packed cell states + latency histograms + frag scores) to the browser over WebSocket.

A React / TypeScript front end renders each allocator's 2 MiB arena as a 256×256 pixel heatmap, plots log₂-bucketed latency distributions, and keeps a live leaderboard across p50 / p99 / fragmentation / alloc-failure counts.

Quick start

# 1. Build the C allocator library + static .a
cd c && make
cd ..

# 2. Build the Go server (requires cgo)
cd server && go build -o bin/heap-forge ./cmd/server
cd ..

# 3. Build the web bundle
cd web && pnpm install && pnpm build
cd ..

# 4. Start the server, serving the bundle
./server/bin/heap-forge -addr :8080 -web ./web/dist

# open http://localhost:8080, pick a workload, press RUN RACE.

For live frontend iteration:

# terminal 1
./server/bin/heap-forge -addr :8080

# terminal 2
cd web && pnpm dev
# opens http://localhost:5173 with /api + /ws proxied to :8080

Architecture

┌──────────────────────────────────────────────────────────┐
│ Browser                                                   │
│  React + D3 scales + HTML5 canvas                        │
│  • HeapCanvas: 256×256 pixel heatmap per allocator       │
│  • LatencyChart: log2-bucketed histogram, p50/p90/p99    │
│  • Scorecard: 4-way leaderboard                          │
└──────────────────────────────────────────────────────────┘
                    ↑ WebSocket /ws   ↓ POST /api/run
┌──────────────────────────────────────────────────────────┐
│ Go server (cmd/server)                                    │
│  • engine.Run: replays a []workload.Op through every     │
│    compiled allocator in lockstep                        │
│  • metrics: log-bucketed nanosecond histogram,           │
│    fragmentation score                                   │
│  • ws.Hub: broadcasts tick JSON with 2-bit packed cells  │
└──────────────────────────────────────────────────────────┘
                    ↕ cgo trampoline (times on C side)
┌──────────────────────────────────────────────────────────┐
│ libheapforge.a (C11)                                     │
│  • arena.c:    mmap-backed 2 MiB region, aligned         │
│  • bump.c:     8 B header, cursor-only, dead-bit tracking │
│  • segfit.c:   12 size classes, boundary tags, coalesce  │
│  • buddy.c:    orders 6..21, XOR-indexed buddy lookup    │
│  • slab.c:     8 classes, 16 KiB slabs, intrusive flinks │
│  • snapshot.c: per-chunk cell labels + frag scan         │
└──────────────────────────────────────────────────────────┘

The bridge

cgo can't call through a C function pointer directly. Every allocator call goes through a thin C trampoline that pairs two clock_gettime(CLOCK_MONOTONIC) reads around the vtable dispatch, so reported latencies exclude Go-side cgo overhead:

static int hf_go_alloc(const hf_vtable* v, hf_allocator* a, size_t size,
                       void** ptr_out, uint64_t* out_ns) {
    uint64_t t0 = hf_now_ns();
    void* p = v->alloc(a, size);
    uint64_t t1 = hf_now_ns();
    *ptr_out = p;
    *out_ns  = t1 - t0;
    return p ? 0 : -1;
}

Snapshot wire format

Every allocator exports an hf_snapshot containing a cells[65536] array where each byte encodes one of four states (free / used / header / reserved) for one 32-byte chunk of the 2 MiB arena. To keep the WebSocket payload small, Go packs four 2-bit cells per byte before base64-encoding, cutting snapshot payloads from 65 536 to 16 384 bytes per allocator per tick (~21 KiB base64).

Latency histogram

metrics.Histogram is a 32-bucket log2 histogram where bucket i holds samples where 2^(i-1) < ns ≤ 2^i. Percentiles are computed by walking the bucket array. This gives us p50, p90, p99 in O(32) without storing a sample buffer.

Fragmentation score

frag_score = 1 − (largest_free_run / total_free_bytes)

0 = every free byte is in one contiguous run (no external fragmentation). 1 = arena is shattered into a lattice of unusably small holes. This is a pure external fragmentation signal — bump scores 0 here because its free region is always a single contiguous suffix of the arena, even though it is bleeding unreclaimed allocations in the used prefix. The heap canvas makes that failure mode obvious in a way a single scalar can't.

What the tests cover

cd c && make test
# bump     : monotonic layout, accounting, fill-to-exhaustion
# segfit   : 20k random ops, final free-all coalesces to one 2 MiB block
# buddy    : 30k random ops, final free-all collapses to one 2 MiB block
# slab     : 40k random ops across all 8 classes, final live=0

cd server && go test ./...
# allocator: cgo bridge + 5k random ops per allocator
# workload:  generator correctness (no double-alloc, no use-after-free)
# metrics:   log-bucket histogram + fragmentation score math
# engine:    end-to-end run emits ≥10 well-formed ticks with packed cells

Plus a live smoke test:

./server/bin/heap-forge -addr :8080 &
./server/bin/smoketest -server localhost:8080 -workload redis_cache -length 4000
# connects, starts a race, asserts ≥5 ticks arrive with valid base64 cells

Project layout

heap-forge/
├── c/
│   ├── include/        headers: heapforge.h, arena.h
│   ├── src/            arena, bump, segfit, buddy, slab, snapshot, registry
│   ├── test/           one test binary per allocator
│   └── Makefile        builds libheapforge.a + .dylib + runs tests
├── server/             Go 1.26 module
│   ├── allocator/      cgo bindings
│   ├── workload/       4 trace generators
│   ├── metrics/        log histogram + frag score
│   ├── engine/         race runner that emits ticks
│   ├── ws/             WebSocket hub
│   └── cmd/server|smoketest
├── web/                Vite + React + TS + Tailwind 3
│   ├── src/components  Header, ControlPanel, AllocatorCard, HeapCanvas, …
│   └── scripts/        Playwright screenshot harness
└── docs/               screenshots

Design decisions

Arena size = 2 MiB, chunk = 32 B. Small enough to render as a dense 256×256 canvas, large enough for every workload to push multiple size classes hard. The buddy allocator's max order is 21 (= log₂ 2 MiB).
All four allocators share one snapshot schema. Every allocator paints the same cells[65536] array, so the UI treats them as interchangeable data sources with no per-allocator rendering code.
User-requested size is tracked per allocation, not block size. That lets the UI surface internal fragmentation (reserved − in-use) as well as external fragmentation.
Latency measured on the C side of the bridge. If we timed in Go we'd be measuring the cgo call overhead, which is ~70–100 ns on Apple Silicon and would swamp the real allocator work for small, fast allocations.
Slow clients drop ticks, not block the engine. Tick broadcasts are non-blocking sends with default: drop — the engine never waits on the network, so one slow browser can't back up the whole race.

Known limits / not shipped

Single-threaded. Concurrent / threaded workloads aren't modelled.
Allocators own a fixed 2 MiB arena; there's no sbrk-style growth.
WASM build of the C allocators is an obvious future win (run in-browser, no server round trip). The cgo harness exists partly so that WASM can be bolted on later by swapping Emit() implementations.
macOS and Linux only. No Windows work has been done on the C side (the arena uses mmap(MAP_ANON)).

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
c		c
docs		docs
server		server
web		web
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
PLAN.md		PLAN.md
README.md		README.md
state.md		state.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

heap·forge

Why

What's inside

Quick start

Architecture

The bridge

Snapshot wire format

Latency histogram

Fragmentation score

What the tests cover

Project layout

Design decisions

Known limits / not shipped

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

heap·forge

Why

What's inside

Quick start

Architecture

The bridge

Snapshot wire format

Latency histogram

Fragmentation score

What the tests cover

Project layout

Design decisions

Known limits / not shipped

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages