Skip to content

fix: safe arena routing (size-routing + sticky-System realloc)#9

Merged
Barnadrot merged 2 commits intomainfrom
fix/safe-arena-routing
May 8, 2026
Merged

fix: safe arena routing (size-routing + sticky-System realloc)#9
Barnadrot merged 2 commits intomainfrom
fix/safe-arena-routing

Conversation

@Barnadrot
Copy link
Copy Markdown
Owner

@Barnadrot Barnadrot commented May 6, 2026

Summary

  • Route allocations < 4096 bytes to System allocator during active phases, preventing library-internal allocations from landing in recyclable arena memory
  • Sticky-System routing in realloc: if a pointer originated in System, growth stays in System — prevents silent migration of Vec/HashMap into arena on push/insert
  • PhaseGuard RAII: phase(|| { ... }) wrapper ensures end_phase() runs on both normal return and panic unwind
  • Extends Emile's flush_rayon fix (leanEthereum/leanMultisig@f5e2299) which addressed one class of phase-crossing allocations but was proven insufficient for a broader second class
  • Adds ZK_ALLOC_MIN_BYTES, ZK_ALLOC_SLAB_GB env var controls
  • 5 regression test files covering reproduction, stress (100 cycles), PhaseGuard, crossbeam-epoch, and panic-phase hazard

Bug Class 1: Rayon Injector Blocks

Found by: Emile (leanEthereum/leanMultisig@f5e2299)
Fixed by: flush_rayon — 256 no-op rayon::join calls drain crossbeam-deque Injector slots

Rayon's crossbeam-deque allocates Injector blocks (~520B, 64 JobRef slots) during parallel work. Under #[global_allocator], these land in the arena during active phases. When begin_phase() recycles the slab, rayon still holds a pointer to the old block — next job push writes 17 bytes over recycled memory. Silent corruption, not a crash.

Characterization:

  • 10/10 phase cycles corrupted in standalone stress test; single 17-byte blast radius per boundary
  • Plonky3: deterministic crash on 2nd prove_and_verify (corruption cascades into tracing-subscriber HashMap → panic)
  • leanMultisig: does not fire under prove_loop — workload-specific, corruption misses observable state

Bug Class 2: Library Allocation Pooling

Found by: This investigation
Fixed by: Size-routing — allocations < 4096B bypass arena, go to System

flush_rayon does NOT fix Plonky3 — crashes persist with rayon-flush ON. A second class of phase-crossing allocations exists that flush_rayon cannot reach.

Root cause: Libraries like tracing-subscriber intentionally pool heap capacity across logical lifetimes. ExtensionsInner::clear() retains HashMap backing so future spans reuse it without reallocating. When the first span in a Registry slot is created during arena phase N, the HashMap backing lives in the arena. begin_phase(N+1) recycles it → use-after-free → panic/SIGSEGV.

Same pattern exists in: crossbeam-epoch Bags, sharded-slab Pages, hashbrown HashMap rehash buffers. All explicitly documented as retaining capacity.

Key insight: Both bug classes share a property — problematic allocations are small (<4KB). The allocations zk-alloc needs to accelerate are large (polynomials, matrices, Merkle trees — all >>4KB). Size-routing exploits this gap. It also subsumes flush_rayon (Injector blocks are ~520B, well under threshold).

Bug Class 3: Realloc Threshold Crossing

Found by: This investigation
Fixed by: Sticky-System routing in realloc

A System-backed Vec that grows past MIN_ARENA_BYTES during an active phase silently migrates to arena via realloc → self.alloc(new_size). The user never explicitly allocated in the arena, but the Vec is now subject to phase recycling. Two cases: (a) Vec created before any phase that grows during one, (b) small in-phase Vec that grows past threshold.

Fix: realloc checks whether the input pointer lies in the arena region. If not, growth stays in System via System::realloc.

Performance

No measurable regression:

  • Plonky3 prove: ~1.16-1.30s (unchanged)
  • leanMultisig prove_loop: 2.1s/proof × 30 proofs (unchanged)
  • Hot-path ZK allocations are all >> 4096 bytes and stay in arena

Validation

Check Result
Size-routing at 4096 threshold — all Plonky3 examples pass cleanly
No perf regression vs flush_rayon-only baseline
100 phase cycles × 200 joins + 64KB canary, 0 corrupted; disabling fix → SIGSEGV at cycle ~42
Plonky3 prove (Poseidon1/BB, Poseidon2/KB, Poseidon2/BB-zk) — 100 iterations with full tracing
leanMultisig prove_loop — 30 proofs verified, 2.1s/proof (no regression)
Threshold sweep: 512/1024/2048/4096 all safe; 64-128 unsafe (injector blocks too small to route)

Limitations

  • Threshold boundary: sharded-slab's second page (~6.4KB) exceeds the 4096 threshold at 32+ concurrent spans. Empirically: 512 concurrent spans fails at 4096, passes at 6144. Typical ZK proving has <32 concurrent spans → safe at default. Configurable via ZK_ALLOC_MIN_BYTES for heavier tracing workloads.
  • Architectural: The bug class is intrinsic to #[global_allocator] arena patterns — any library that pools heap capacity across logical lifetimes is affected. Size-routing is the practical optimum; true robustness beyond the threshold requires scope-aware API changes.
  • Classification: These are memory-safety bugs (use-after-free, silent corruption). Whether corrupted prover memory could produce a verifier-accepted invalid proof (ZK soundness) is unproven in either direction — fuzzing experiment scoped as follow-up.

Test plan

  • cargo test --release — all tests pass
  • cargo clippy — clean
  • Plonky3 prove: 100 iterations clean
  • leanMultisig prove_loop: 30 proofs verified
  • Threshold sweep: 512/1024/2048/4096 all safe
  • CI green on Linux x86_64 + macOS aarch64
  • No perf regression confirmed on executor (running)

Barnadrot and others added 2 commits May 6, 2026 13:27
Route allocations smaller than MIN_ARENA_BYTES (default 4096) to System
even during active phases. This prevents library-internal allocations
(tracing-subscriber Registry, hashbrown HashMap, crossbeam Injector
blocks) from landing in arena memory that gets recycled on begin_phase().

Additionally:
- Sticky-System routing in realloc: if the original pointer came from
  System, growth stays in System too. Prevents silent migration of
  Vec/HashMap into arena on push/insert.
- PhaseGuard RAII: begin_phase() on construction, end_phase() on drop.
  phase(|| { ... }) convenience wrapper. Prevents panic-leaves-arena-active.
- Configurable SLAB_SIZE via ZK_ALLOC_SLAB_GB env var (default 8).
- Module-level docs rewritten for the two-allocator model.
- README updated with usage section, phase-scoping contract, env vars.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- test_rayon: Tom's original MRE — rayon::join from non-worker thread
  corrupts canary across phase boundary (fixed by size-routing).
- test_size_routing_stress: 100-cycle stress with canaries.
- test_phase_guard: PhaseGuard prevents panic-leaves-arena-active,
  normal-return end_phase, and nested guard composition.
- test_crossbeam_epoch: empirical proof that crossbeam-epoch deferred
  garbage (Bag nodes) stays safe under size-routing (< 4KB).
- test_panic_phase: documents the panic-without-end_phase hazard that
  PhaseGuard addresses.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Barnadrot Barnadrot merged commit 617e91a into main May 8, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant