fix: safe arena routing (size-routing + sticky-System realloc)#9
Merged
fix: safe arena routing (size-routing + sticky-System realloc)#9
Conversation
Route allocations smaller than MIN_ARENA_BYTES (default 4096) to System
even during active phases. This prevents library-internal allocations
(tracing-subscriber Registry, hashbrown HashMap, crossbeam Injector
blocks) from landing in arena memory that gets recycled on begin_phase().
Additionally:
- Sticky-System routing in realloc: if the original pointer came from
System, growth stays in System too. Prevents silent migration of
Vec/HashMap into arena on push/insert.
- PhaseGuard RAII: begin_phase() on construction, end_phase() on drop.
phase(|| { ... }) convenience wrapper. Prevents panic-leaves-arena-active.
- Configurable SLAB_SIZE via ZK_ALLOC_SLAB_GB env var (default 8).
- Module-level docs rewritten for the two-allocator model.
- README updated with usage section, phase-scoping contract, env vars.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- test_rayon: Tom's original MRE — rayon::join from non-worker thread corrupts canary across phase boundary (fixed by size-routing). - test_size_routing_stress: 100-cycle stress with canaries. - test_phase_guard: PhaseGuard prevents panic-leaves-arena-active, normal-return end_phase, and nested guard composition. - test_crossbeam_epoch: empirical proof that crossbeam-epoch deferred garbage (Bag nodes) stays safe under size-routing (< 4KB). - test_panic_phase: documents the panic-without-end_phase hazard that PhaseGuard addresses. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
phase(|| { ... })wrapper ensuresend_phase()runs on both normal return and panic unwindflush_rayonfix (leanEthereum/leanMultisig@f5e2299) which addressed one class of phase-crossing allocations but was proven insufficient for a broader second classZK_ALLOC_MIN_BYTES,ZK_ALLOC_SLAB_GBenv var controlsBug Class 1: Rayon Injector Blocks
Found by: Emile (leanEthereum/leanMultisig@f5e2299)
Fixed by:
flush_rayon— 256 no-oprayon::joincalls drain crossbeam-deque Injector slotsRayon's crossbeam-deque allocates Injector blocks (~520B, 64 JobRef slots) during parallel work. Under
#[global_allocator], these land in the arena during active phases. Whenbegin_phase()recycles the slab, rayon still holds a pointer to the old block — next job push writes 17 bytes over recycled memory. Silent corruption, not a crash.Characterization:
Bug Class 2: Library Allocation Pooling
Found by: This investigation
Fixed by: Size-routing — allocations < 4096B bypass arena, go to System
flush_rayondoes NOT fix Plonky3 — crashes persist with rayon-flush ON. A second class of phase-crossing allocations exists that flush_rayon cannot reach.Root cause: Libraries like
tracing-subscriberintentionally pool heap capacity across logical lifetimes.ExtensionsInner::clear()retains HashMap backing so future spans reuse it without reallocating. When the first span in a Registry slot is created during arena phase N, the HashMap backing lives in the arena.begin_phase(N+1)recycles it → use-after-free → panic/SIGSEGV.Same pattern exists in: crossbeam-epoch Bags, sharded-slab Pages, hashbrown HashMap rehash buffers. All explicitly documented as retaining capacity.
Key insight: Both bug classes share a property — problematic allocations are small (<4KB). The allocations zk-alloc needs to accelerate are large (polynomials, matrices, Merkle trees — all >>4KB). Size-routing exploits this gap. It also subsumes flush_rayon (Injector blocks are ~520B, well under threshold).
Bug Class 3: Realloc Threshold Crossing
Found by: This investigation
Fixed by: Sticky-System routing in realloc
A System-backed
Vecthat grows pastMIN_ARENA_BYTESduring an active phase silently migrates to arena viarealloc → self.alloc(new_size). The user never explicitly allocated in the arena, but the Vec is now subject to phase recycling. Two cases: (a) Vec created before any phase that grows during one, (b) small in-phase Vec that grows past threshold.Fix:
reallocchecks whether the input pointer lies in the arena region. If not, growth stays in System viaSystem::realloc.Performance
No measurable regression:
Validation
Limitations
ZK_ALLOC_MIN_BYTESfor heavier tracing workloads.#[global_allocator]arena patterns — any library that pools heap capacity across logical lifetimes is affected. Size-routing is the practical optimum; true robustness beyond the threshold requires scope-aware API changes.Test plan
cargo test --release— all tests passcargo clippy— clean