Skip to content

rayancheca/heap-forge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

heap·forge

Side-by-side benchmark and real-time fragmentation visualizer for custom memory allocators battling production-shaped workloads.

C11 Go TypeScript Build

screenshot


Why

Engineers pick allocators — glibc malloc, jemalloc, tcmalloc, mimalloc — without any real intuition for how they actually behave under the workloads they run in production. Fragmentation, cache locality, and tail latency stay invisible until something is on fire in staging.

HeapForge makes allocator internals observable. It replays the same synthesized malloc/free stream through four hand-rolled allocators in lockstep and paints every arena live at 32-byte granularity so you can see fragmentation appear and coalesce in real time.

The point isn't to beat jemalloc. The point is to develop intuition about where each strategy wins and where it falls apart.


What's inside

Four allocators, each implemented from scratch in C11 against a common arena + vtable interface:

Name Strategy What it shows
bump Cursor-only arena Fastest alloc possible; zero reclamation
segfit Segregated free lists Boundary tags + coalesce on free — the classic malloc shape
buddy Binary-buddy Power-of-two split/merge; guaranteed contiguous merges
slab Per-class object cache Intrusive free lists; locality-optimised small objects

Four workload generators that mimic real allocation patterns:

Workload Pattern
redis_cache Log-normal size mix, ~20% free rate
nginx_requests Bursty alloc-then-free-all request lifecycles
python_repl Heavy churn, 92% of allocations in the 16–96 B bins
adversarial Crafted to build a fragmentation lattice that frustrates non-coalescing allocators

A Go server replays the trace through every allocator simultaneously, times each call on the C side of the cgo bridge with CLOCK_MONOTONIC, and streams tick snapshots (packed cell states + latency histograms + frag scores) to the browser over WebSocket.

A React / TypeScript front end renders each allocator's 2 MiB arena as a 256×256 pixel heatmap, plots log₂-bucketed latency distributions, and keeps a live leaderboard across p50 / p99 / fragmentation / alloc-failure counts.


Quick start

# 1. Build the C allocator library + static .a
cd c && make
cd ..

# 2. Build the Go server (requires cgo)
cd server && go build -o bin/heap-forge ./cmd/server
cd ..

# 3. Build the web bundle
cd web && pnpm install && pnpm build
cd ..

# 4. Start the server, serving the bundle
./server/bin/heap-forge -addr :8080 -web ./web/dist

# open http://localhost:8080, pick a workload, press RUN RACE.

For live frontend iteration:

# terminal 1
./server/bin/heap-forge -addr :8080

# terminal 2
cd web && pnpm dev
# opens http://localhost:5173 with /api + /ws proxied to :8080

Architecture

┌──────────────────────────────────────────────────────────┐
│ Browser                                                   │
│  React + D3 scales + HTML5 canvas                        │
│  • HeapCanvas: 256×256 pixel heatmap per allocator       │
│  • LatencyChart: log2-bucketed histogram, p50/p90/p99    │
│  • Scorecard: 4-way leaderboard                          │
└──────────────────────────────────────────────────────────┘
                    ↑ WebSocket /ws   ↓ POST /api/run
┌──────────────────────────────────────────────────────────┐
│ Go server (cmd/server)                                    │
│  • engine.Run: replays a []workload.Op through every     │
│    compiled allocator in lockstep                        │
│  • metrics: log-bucketed nanosecond histogram,           │
│    fragmentation score                                   │
│  • ws.Hub: broadcasts tick JSON with 2-bit packed cells  │
└──────────────────────────────────────────────────────────┘
                    ↕ cgo trampoline (times on C side)
┌──────────────────────────────────────────────────────────┐
│ libheapforge.a (C11)                                     │
│  • arena.c:    mmap-backed 2 MiB region, aligned         │
│  • bump.c:     8 B header, cursor-only, dead-bit tracking │
│  • segfit.c:   12 size classes, boundary tags, coalesce  │
│  • buddy.c:    orders 6..21, XOR-indexed buddy lookup    │
│  • slab.c:     8 classes, 16 KiB slabs, intrusive flinks │
│  • snapshot.c: per-chunk cell labels + frag scan         │
└──────────────────────────────────────────────────────────┘

The bridge

cgo can't call through a C function pointer directly. Every allocator call goes through a thin C trampoline that pairs two clock_gettime(CLOCK_MONOTONIC) reads around the vtable dispatch, so reported latencies exclude Go-side cgo overhead:

static int hf_go_alloc(const hf_vtable* v, hf_allocator* a, size_t size,
                       void** ptr_out, uint64_t* out_ns) {
    uint64_t t0 = hf_now_ns();
    void* p = v->alloc(a, size);
    uint64_t t1 = hf_now_ns();
    *ptr_out = p;
    *out_ns  = t1 - t0;
    return p ? 0 : -1;
}

Snapshot wire format

Every allocator exports an hf_snapshot containing a cells[65536] array where each byte encodes one of four states (free / used / header / reserved) for one 32-byte chunk of the 2 MiB arena. To keep the WebSocket payload small, Go packs four 2-bit cells per byte before base64-encoding, cutting snapshot payloads from 65 536 to 16 384 bytes per allocator per tick (~21 KiB base64).

Latency histogram

metrics.Histogram is a 32-bucket log2 histogram where bucket i holds samples where 2^(i-1) < ns ≤ 2^i. Percentiles are computed by walking the bucket array. This gives us p50, p90, p99 in O(32) without storing a sample buffer.

Fragmentation score

frag_score = 1 − (largest_free_run / total_free_bytes)

0 = every free byte is in one contiguous run (no external fragmentation). 1 = arena is shattered into a lattice of unusably small holes. This is a pure external fragmentation signal — bump scores 0 here because its free region is always a single contiguous suffix of the arena, even though it is bleeding unreclaimed allocations in the used prefix. The heap canvas makes that failure mode obvious in a way a single scalar can't.


What the tests cover

cd c && make test
# bump     : monotonic layout, accounting, fill-to-exhaustion
# segfit   : 20k random ops, final free-all coalesces to one 2 MiB block
# buddy    : 30k random ops, final free-all collapses to one 2 MiB block
# slab     : 40k random ops across all 8 classes, final live=0
cd server && go test ./...
# allocator: cgo bridge + 5k random ops per allocator
# workload:  generator correctness (no double-alloc, no use-after-free)
# metrics:   log-bucket histogram + fragmentation score math
# engine:    end-to-end run emits ≥10 well-formed ticks with packed cells

Plus a live smoke test:

./server/bin/heap-forge -addr :8080 &
./server/bin/smoketest -server localhost:8080 -workload redis_cache -length 4000
# connects, starts a race, asserts ≥5 ticks arrive with valid base64 cells

Project layout

heap-forge/
├── c/
│   ├── include/        headers: heapforge.h, arena.h
│   ├── src/            arena, bump, segfit, buddy, slab, snapshot, registry
│   ├── test/           one test binary per allocator
│   └── Makefile        builds libheapforge.a + .dylib + runs tests
├── server/             Go 1.26 module
│   ├── allocator/      cgo bindings
│   ├── workload/       4 trace generators
│   ├── metrics/        log histogram + frag score
│   ├── engine/         race runner that emits ticks
│   ├── ws/             WebSocket hub
│   └── cmd/server|smoketest
├── web/                Vite + React + TS + Tailwind 3
│   ├── src/components  Header, ControlPanel, AllocatorCard, HeapCanvas, …
│   └── scripts/        Playwright screenshot harness
└── docs/               screenshots

Design decisions

  • Arena size = 2 MiB, chunk = 32 B. Small enough to render as a dense 256×256 canvas, large enough for every workload to push multiple size classes hard. The buddy allocator's max order is 21 (= log₂ 2 MiB).
  • All four allocators share one snapshot schema. Every allocator paints the same cells[65536] array, so the UI treats them as interchangeable data sources with no per-allocator rendering code.
  • User-requested size is tracked per allocation, not block size. That lets the UI surface internal fragmentation (reserved − in-use) as well as external fragmentation.
  • Latency measured on the C side of the bridge. If we timed in Go we'd be measuring the cgo call overhead, which is ~70–100 ns on Apple Silicon and would swamp the real allocator work for small, fast allocations.
  • Slow clients drop ticks, not block the engine. Tick broadcasts are non-blocking sends with default: drop — the engine never waits on the network, so one slow browser can't back up the whole race.

Known limits / not shipped

  • Single-threaded. Concurrent / threaded workloads aren't modelled.
  • Allocators own a fixed 2 MiB arena; there's no sbrk-style growth.
  • WASM build of the C allocators is an obvious future win (run in-browser, no server round trip). The cgo harness exists partly so that WASM can be bolted on later by swapping Emit() implementations.
  • macOS and Linux only. No Windows work has been done on the C side (the arena uses mmap(MAP_ANON)).

License

MIT.

About

Side-by-side benchmark and real-time fragmentation visualizer for custom memory allocators battling production workloads.

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors