Skip to content

feat: size-routing fix for phase-crossing allocations#8

Closed
Barnadrot wants to merge 21 commits intomainfrom
feat/size-routing
Closed

feat: size-routing fix for phase-crossing allocations#8
Barnadrot wants to merge 21 commits intomainfrom
feat/size-routing

Conversation

@Barnadrot
Copy link
Copy Markdown
Owner

@Barnadrot Barnadrot commented May 6, 2026

Summary

  • Route allocations < 4096 bytes to System allocator during active phases, preventing library-internal allocations from landing in recyclable arena memory
  • Extends Emile's flush_rayon fix (leanEthereum/leanMultisig@f5e2299) which addressed one class of phase-crossing allocations but was proven insufficient for a broader second class
  • Adds ZK_ALLOC_MIN_BYTES, ZK_ALLOC_SLAB_GB, ZK_ALLOC_POISON_RESET env var controls
  • 6 new test files covering reproduction, stress (100 cycles), boundary cases, and concurrent span limits

Bug Class 1: Rayon Injector Blocks

Found by: Emile (leanEthereum/leanMultisig@f5e2299)
Fixed by: flush_rayon — 256 no-op rayon::join calls drain crossbeam-deque Injector slots

Rayon's crossbeam-deque allocates Injector blocks (~520B, 64 JobRef slots) during parallel work. Under #[global_allocator], these land in the arena during active phases. When begin_phase() recycles the slab, rayon still holds a pointer to the old block — next job push writes 17 bytes over recycled memory. Silent corruption, not a crash.

Characterization:

  • 10/10 phase cycles corrupted in standalone stress test; single 17-byte blast radius per boundary
  • Plonky3: deterministic crash on 2nd prove_and_verify (corruption cascades into tracing-subscriber HashMap → panic)
  • leanMultisig: does not fire under prove_loop — workload-specific, corruption misses observable state

Bug Class 2: Library Allocation Pooling

Found by: This investigation
Fixed by: Size-routing — allocations < 4096B bypass arena, go to System

flush_rayon does NOT fix Plonky3 — crashes persist with rayon-flush ON. A second class of phase-crossing allocations exists that flush_rayon cannot reach.

Root cause: Libraries like tracing-subscriber intentionally pool heap capacity across logical lifetimes. ExtensionsInner::clear() retains HashMap backing so future spans reuse it without reallocating. When the first span in a Registry slot is created during arena phase N, the HashMap backing lives in the arena. begin_phase(N+1) recycles it → use-after-free → panic/SIGSEGV.

Same pattern exists in: crossbeam-epoch Bags, sharded-slab Pages, hashbrown HashMap rehash buffers. All explicitly documented as retaining capacity.

Key insight: Both bug classes share a property — problematic allocations are small (<4KB). The allocations zk-alloc needs to accelerate are large (polynomials, matrices, Merkle trees — all >>4KB). Size-routing exploits this gap. It also subsumes flush_rayon (Injector blocks are ~520B, well under threshold).

Performance

No measurable regression:

  • Plonky3 prove: ~1.16-1.30s (unchanged)
  • leanMultisig prove_loop: 2.1s/proof × 30 proofs (unchanged)
  • Hot-path ZK allocations are all >> 4096 bytes and stay in arena

Validation

Check Result
Size-routing at 4096 threshold — all Plonky3 examples pass cleanly
No perf regression vs flush_rayon-only baseline
100 phase cycles × 200 joins + 64KB canary, 0 corrupted; disabling fix → SIGSEGV at cycle ~42
Plonky3 prove (Poseidon1/BB, Poseidon2/KB, Poseidon2/BB-zk) — 100 iterations with full tracing
leanMultisig prove_loop — 30 proofs verified, 2.1s/proof (no regression)
Threshold sweep: 512/1024/2048/4096 all safe; 64-128 unsafe (injector blocks too small to route)
Tested against other ZK codebases (Pico, Jolt) locally available for benchmarking. Neither is a production consumer today; leanMultisig is the current user, Plonky3 integration is planned. No breakage.

Limitations

  • Threshold boundary: sharded-slab's second page (~6.4KB) exceeds the 4096 threshold at 32+ concurrent spans. Empirically: 512 concurrent spans fails at 4096, passes at 6144. Typical ZK proving has <32 concurrent spans → safe at default. Configurable via ZK_ALLOC_MIN_BYTES for heavier tracing workloads.
  • Architectural: The bug class is intrinsic to #[global_allocator] arena patterns — any library that pools heap capacity across logical lifetimes is affected. Size-routing is the practical optimum; true robustness beyond the threshold requires scope-aware API changes.
  • Classification: These are memory-safety bugs (use-after-free, silent corruption). Whether corrupted prover memory could produce a verifier-accepted invalid proof (ZK soundness) is unproven in either direction — fuzzing experiment scoped as follow-up.

Test plan

  • cargo test --release — all tests pass
  • cargo clippy — clean
  • Plonky3 prove: 100 iterations clean
  • leanMultisig prove_loop: 30 proofs verified
  • Threshold sweep: 512/1024/2048/4096 all safe
  • CI

Barnadrot and others added 21 commits May 6, 2026 10:23
Route allocations smaller than MIN_ARENA_BYTES (default 4096) to System
allocator even during active phases. This prevents library-internal
allocations (tracing-subscriber Registry, hashbrown HashMap, crossbeam
Injector blocks) from landing in arena memory that gets recycled on
begin_phase().

Root cause: libraries like tracing-subscriber intentionally pool heap
capacity across logical lifetimes (ExtensionsInner::clear() retains
HashMap backing). When this pooled memory is arena-allocated, the next
phase recycles it while the library still holds a reference.

Key changes:
- MIN_ARENA_BYTES=4096 default (configurable via ZK_ALLOC_MIN_BYTES env)
- SLAB_SIZE now configurable via ZK_ALLOC_SLAB_GB env var
- POISON_RESET diagnostic mode (ZK_ALLOC_POISON_RESET=1) for debugging
- MADV_DONTNEED support in syscall module
- 6 test files covering reproduction, stress, boundary cases
- findings.tsv: full 15-finding audit trail from exp5

Verified: 100 iterations on Plonky3 prove, 30 proofs on leanMultisig,
no measurable performance regression. Replaces flush_rayon as primary
soundness fix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ned Vec

A Vec >= MIN_ARENA_BYTES allocated in phase N and grown via push() in phase
N+1 (after arena reset) gets its preserved bytes silently replaced by phase
N+1's first allocation. The realloc impl calls copy_nonoverlapping(old_ptr,
new_ptr, layout.size()) — but old_ptr now references recycled memory.

Both tests (8 KB and 16 KB Vecs) confirm v[0..SIZE] reads OVERWRITE bytes
instead of the original FILL after a single push() across the boundary.

The size-routing fix (MIN_ARENA_BYTES=4096) does NOT address this — any
allocation above the threshold inherits the bug. This is a NEW class of
soundness issue, distinct from F1 (rayon Injector) and F4 (tracing Registry)
which are library-internal pooled allocations. This one is user-visible.

Tests assert the bug REPRODUCES (no fix deployed yet); flip when fix lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There is no RAII guard around begin_phase/end_phase. A panic that propagates
out before end_phase() is reached leaves ARENA_ACTIVE=true. Subsequent
allocations made by user code that believes the phase is over land in arena
and are silently recycled on the next begin_phase().

Test confirms: post-panic Vec lands at slab+0; the next begin_phase + 1MB
allocation overlaps that range; post_panic[0] reads 0x33 (the new phase's
fill) instead of the original 0xCC.

This is a public-API hazard, not a library-internal bug. Recommended fix:
introduce a PhaseGuard struct with Drop-based end_phase, or document that
phases are unsafe across panics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
crossbeam-deque's Worker::push doubles its Buffer at fill capacity and
never shrinks. Deep rayon::join recursion (depth 512+) drives buffer
growth past 4 KB → lands in arena. Phase boundary recycles the slab; the
worker still references the old buffer pointer. If the worker subsequently
allocates heap data in phase N+1 that overlaps the buffer's offset, push/
pop reads/writes corrupt memory.

Findings:
- With ZK_ALLOC_MIN_BYTES=0 + rayon-flush off: SIGSEGV (deterministic)
- With ZK_ALLOC_MIN_BYTES=4096 (default): 0/20 cycles corrupted
- With ZK_ALLOC_MIN_BYTES=4096 + rayon-flush off: 0/20 cycles corrupted

Size-routing alone is sufficient; rayon-flush adds nothing for this case.
Worker buffers stay below the 4 KB threshold for typical workloads; the
size-routing fix correctly diverts the small intermediate growth steps.

Three tests:
- worker_deque_growth_during_phase (canary in main slab — does not fire)
- deep_recursion_phase_cycle_program_integrity (program survives 50 cycles)
- worker_buffer_growth_with_per_worker_canary (per-worker canary — fires under no-fix)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GENERATION is a global atomic. begin_phase() from any thread bumps it,
forcing every other thread's next allocation through the cold-reset path
and invalidating arena data they still hold. The slabs are per-thread, but
the generation counter is global — so cross-thread phase calls reset the
wrong slab from the wrong actor.

Tests:
- cross_thread_begin_phase_invalidates_data: deterministic. T1 holds an
  arena Vec; T2 calls begin_phase via barrier-controlled handshake; T1's
  next alloc lands at slab+0 over v's bytes. Asserts v_corrupted=true.
- two_threads_running_lifecycle_concurrently_corrupt_each_other: stress
  observation only — race window for inter-alloc cross-thread interference
  is too tight to assert reliably.
- concurrent_phase_stress_no_crash: 4 worker threads spam begin/end+alloc
  while main does 5000 begin/end cycles; verifies allocator atomics survive.

Same hazard class as F17 (no RAII guard): public-API misuse silently
corrupts arena data. Recommended fix: scoped begin_phase that returns
PhaseGuard, taking a marker that prevents concurrent / nested misuse.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drives many crossbeam-deque buffer growths during a phase to retire
objects to crossbeam-epoch's Bag<Garbage>, then crosses the boundary and
drives more retires. F6 hypothesized Bag nodes (~1.5 KB) are caught by
size-routing; this confirms empirically.

Empirical observation:
- ZK_ALLOC_MIN_BYTES=4096 (default): 50/50 cycles clean
- ZK_ALLOC_MIN_BYTES=0 + rayon-flush off: 50/50 cycles clean (Bag-cycling
  test alone — par_iter test hangs same as F11, not a Bag issue)

Bag nodes do not appear to be a load-bearing vector for the bug class —
they're small enough to bypass arena under default routing and small
enough not to overlap dangerously even without it (in this workload).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ThreadPoolBuilder::build() allocates Registry, ThreadInfo arrays, initial
deques, and Sleep state. Built during an active phase, those land in arena;
held across phase boundaries → Registry pointers reference recycled memory;
.install() walks corrupted scheduler state.

Findings:
- ZK_ALLOC_MIN_BYTES=4096 (default): 10 cycles clean (allocations bypass
  arena under size-routing).
- ZK_ALLOC_MIN_BYTES=0 + rayon-flush off: deterministic SIGSEGV on first
  cycle's .install().
- Pool built BEFORE any phase: clean across 50 phase cycles, even without
  the fix. Confirms the hazard is specific to mid-phase construction.

Three tests: build during phase + use across boundary, build+use+drop in
single phase, build before phase + use across many.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Arc::new allocates ArcInner<T> = (strong_count, weak_count, T). For T
large enough that the allocation is >= MIN_ARENA_BYTES (default 4096),
the ArcInner lands in arena. Holding the Arc across a phase boundary
puts the refcount fields in the path of recycled writes — phase 2's first
allocation overwrites them with arbitrary bytes.

Test confirms strong_count reads 0x5555555555555555 after a 1 MB filler
overwrites the slab. clone() increments to 0x...5556. Arc continues
operating but refcount can never reach zero → silent memory leak. Worst
cases: refcount underflow (premature drop), use-after-free, or — if
ArcInner aligns differently — SIGSEGV on the AtomicUsize CAS.

Same fundamental bug class as F16 (data retained across phase corrupted
by reset) but applies to refcounted allocations specifically. Size-routing
covers small Arcs (e.g., Arc<u64>) but not large payloads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HashMap (hashbrown) allocates one contiguous block for ctrl bytes + bucket
array. With with_capacity(4096) HashMap<u64, u64>, the allocation is
~64 KB — above MIN_ARENA_BYTES so it lands in arena. Held across phase
boundary, phase 2's first allocation overwrites the entire structure.

Detection: HashMap.get on corrupted ctrl bytes (all 0x55 from filler)
infinite-loops in linear probing — every slot looks "full" but no hash
matches. Test uses a worker thread with mpsc::recv_timeout to detect hang
as the corruption signal. Confirmed: 2s timeout fires deterministically.

Same bug class as F22 (Arc) — any large allocation retained across phase
is corrupted. Worse failure mode here: silent hang rather than memory
leak.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A trait object is (data_ptr, vtable_ptr). The vtable lives in .rodata so
dispatch survives any heap corruption. The data lives in the Box's heap
allocation; held across a phase boundary, the bytes are overwritten by
phase 2's first allocation.

Tests:
- box_dyn_trait_data_corrupted_silent: confirms sentinel field reads
  0x5555555555555555 (filler bytes) and payload_sum returns garbage. Vtable
  dispatch worked; data was wrong.
- box_dyn_vtable_dispatch_survives_data_corruption: trait method only
  increments a global atomic; runs correctly across corrupted self.

Failure mode is silent wrong values — no panic, no SIGSEGV, no hang.
Worst form of bug class for correctness checks. If the trait impl had
chased a pointer in self (e.g., self.inner_box.method()), it would
SIGSEGV on dereferencing 0x5555_5555_5555_5555.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add PhaseGuard struct with Drop-based end_phase, plus a phase(|| { ... })
scoped helper. Both compose with the existing free-function API; nothing
breaks. Use in place of paired begin_phase/end_phase calls when the phase
body can panic.

Test (test_phase_guard.rs):
- phase_guard_runs_end_phase_on_panic: catch_unwind a panicking phase()
  body; verify post-unwind allocations land in System (not arena).
  Mirrors the F17 reproducer with the fix applied — passes deterministically.
- phase_guard_runs_end_phase_on_normal_return: sanity check.
- nested_phase_guards_compose: nested phase() works.

Does not address F19 (concurrent begin/end) — same begin_phase atomics,
no cross-thread protection. That needs a separate scoped/exclusive API.

Cargo fmt also folded test_box_dyn_phase.rs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Document that Vec/Arc/HashMap/Box<dyn Trait> >= MIN_ARENA_BYTES held
across begin_phase() are silently overwritten. Lists each container's
specific failure mode and points to clone()-via-system as the workaround.

This is documentation only; behavior unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These tests demonstrate that retaining arena allocations across phase
boundaries silently corrupts user data — they're educational about the
phase-scoping contract, not standard regression tests. They live on the
exp5-soundness-tests branch alongside the soundness audit notes.

Removed: test_realloc_phase, test_arc_phase, test_hashmap_phase,
test_box_dyn_phase. Tests that exercise the allocator's correctness
under valid usage (test_panic_phase, test_concurrent_phase,
test_worker_deque_growth, test_crossbeam_epoch, test_threadpool_resize,
test_phase_guard) stay.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When an allocation's original ptr is not in the arena region (it came
from System — either pre-phase, sub-threshold, or otherwise routed),
realloc now goes through System.realloc instead of self.alloc + copy.

This prevents the silent System→arena migration on growth: a Vec
allocated outside a phase (System) that grows inside a phase would
otherwise have its new buffer placed in the arena, where it becomes
subject to phase recycling.

System.realloc may also grow in place (mremap on Linux), avoiding the
explicit copy that self.alloc + copy_nonoverlapping forces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Document explicitly that ZkAllocator dispatches between arena and System,
with size-routing and sticky-System realloc keeping the boundaries clean.
Spell out the phase-scoping contract (don't retain across begin_phase),
the available env vars (SLAB_GB / MIN_BYTES / POISON_RESET), and update
the example to use phase() / PhaseGuard for panic safety.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the single-loop example with one using phase() (panic-safe), then
add three new sections: the two-allocator model (arena vs System and what
goes where), the phase-scoping contract (don't retain across begin_phase,
construct long-lived state pre-phase, prefer phase() RAII), and a table
of the three env vars (SLAB_GB / MIN_BYTES / POISON_RESET).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On macOS aarch64 the test's two allocations land at different addresses
(T1's first alloc goes to System ~5GB while phase-2 arena allocations
sit ~500GB up), so v and w never alias and the cross-thread invalidation
isn't observable from this test. The bug is real on macOS too — the
allocator's GENERATION atomic is global — but this specific reproducer
needs aliasing to detect it.

Track aliasing via a separate AtomicBool and only assert corruption when
v and w aliased. Otherwise pass with a "test inconclusive on this
platform" note. Linux behavior unchanged (deterministic corruption).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
macOS aarch64 has a smaller default thread stack than Linux, and debug
builds inflate per-frame size. nested_join recursion at depths 1024 /
2048 / 512 overflowed the rayon worker stack on the macOS CI runner.

Cap all DEPTH constants at 256. That still drives crossbeam-deque's
Buffer to its 256-slot capacity (~4 KB header included), which crosses
MIN_ARENA_BYTES=4096 — so the test still exercises the size-routing
boundary the original depths targeted.

Verified on linux in both release and debug builds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 3 tests touch process-global state — ARENA_ACTIVE, the bump pointer,
and the panic hook — and Cargo runs tests in parallel by default. With
nested_phase_guards_compose racing phase_guard_runs_end_phase_on_normal_return,
the latter could observe ARENA_ACTIVE=true at the wrong moment and route
its `after` allocation into the arena, where the next phase's filler
recycled it.

Add a file-local `static PHASE_LOCK: Mutex<()>` and grab it at the top of
each test. No external deps; serializes only within this test binary.
Other test binaries keep their global state isolated by living in
separate processes (per Cargo's test-binary boundary).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Barnadrot Barnadrot closed this May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant