Skip to content

fix: phase lifecycle safety bugs (nested phase + realloc UB)#10

Open
Barnadrot wants to merge 5 commits intomainfrom
fix/phase-safety-bugs
Open

fix: phase lifecycle safety bugs (nested phase + realloc UB)#10
Barnadrot wants to merge 5 commits intomainfrom
fix/phase-safety-bugs

Conversation

@Barnadrot
Copy link
Copy Markdown
Owner

@Barnadrot Barnadrot commented May 6, 2026

Summary

Two phase lifecycle bugs found by automated bug hunter (boundary-value testing + hypothesis-driven code audit). Both are memory-safety issues in the arena's phase management — one causes silent data corruption, the other is undefined behavior per the Rust spec.

Depends on #9 (size-routing infrastructure provides PHASE_DEPTH counter and PhaseGuard).

Bug 1: Nested begin_phase() Silently Corrupts Outer Phase Data (high)

Root cause: begin_phase() unconditionally bumped GENERATION and recycled every thread's slab. When called inside an already-active phase, the inner begin_phase() reset the slab — silently overwriting all bump-allocated data from the outer phase.

Characterization:

  • Outer phase allocates 1024 bytes, writes canary pattern 0xAA
  • Inner begin_phase() + allocation overwrites the slab from offset 0
  • Outer phase reads back corrupted data — canary bytes replaced with inner phase's writes
  • Existing test test_phase_guard::nested_phase_guards_compose missed this because it allocated nothing in the outer phase

Why it matters: Today leanMultisig calls begin_phase()/end_phase() at the top-level prove boundary, so nesting doesn't occur. But the API permits it, and nothing warns the caller. As zk-alloc integrates deeper — library wrappers, PhaseGuard in helper functions, test scaffolding — a nested call silently corrupts with no diagnostic. A public allocator API that silently destroys live data on a legal call sequence is a latent safety hole.

Fix: PHASE_DEPTH atomic counter. begin_phase() increments depth; only the outermost transition (0 → 1) bumps GENERATION and activates the arena. end_phase() decrements; only the outermost transition (1 → 0) deactivates. Nested calls compose safely as no-ops.

Bug 2: realloc Uses copy_nonoverlapping on Potentially Aliased Memory (medium)

Root cause: GlobalAlloc::realloc called ptr::copy_nonoverlapping(old_ptr, new_ptr, ...) to move data during growth. When growing within the same slab (bump pointer advanced past old allocation, then old allocation is reallocated to a larger size at the same base), old_ptr and new_ptr can alias — the new allocation overlaps the source. copy_nonoverlapping is UB on overlapping regions per the Rust spec.

Characterization:

  • Allocate 64 bytes in a phase, write canary 0xBB
  • realloc to 256 bytes — new pointer returned at same base address (bump allocator reuses position)
  • copy_nonoverlapping with overlapping src/dst — UB, Miri would flag
  • On AMD Zen 4 + glibc: not observable because the SIMD memcpy implementation reads before writing. On different codegen (e.g., debug mode, different target, LTO inlining a naive copy loop) this could corrupt the upper source bytes

Fix: Replace ptr::copy_nonoverlapping with ptr::copy (memmove semantics). Handles overlapping regions correctly. Zero performance impact — realloc is not on the hot path.

Findings Summary

Hunt Category Hypothesis Severity Result
hunt-1 phase_lifecycle Nested begin_phase recycles outer slab high Confirmed + fixed
hunt-2 realloc copy_nonoverlapping on aliased src/dst medium Confirmed + fixed
hunt-3a concurrent_phase Concurrent begin/end desync ARENA_ACTIVE low Not found — routing inefficiency only, single-controller pattern avoids
hunt-3b stats OVERFLOW_COUNT skipped on first no-slab alloc low Not found — metric undercount of 1 alloc, not safety
hunt-3c stats OVERFLOW_COUNT counts failed OOM as overflow low Not found — by-design behavior
hunt-4 rayon_init Rayon init inside phase puts long-lived state in arena low Not found — size-routing already protects (all rayon allocs < 4096)

Validation

Check Result
cargo test --release — all tests pass including 2 new regression tests
hunt-1 regression test: nested phase with canary — corruption before fix, clean after
hunt-2 regression test: realloc overlap detection — UB before fix, clean after
Unbalanced end_phase() calls are no-ops (saturate at zero depth)

Test plan

  • cargo test --release — all existing + new tests pass
  • cargo +nightly miri test — verify hunt-2 fix eliminates UB under Miri
  • Integration test with leanMultisig prove workload
  • Verify no performance regression on executor benchmark

Generated with Claude Code

@Barnadrot Barnadrot changed the base branch from fix/safe-arena-routing to main May 6, 2026 20:26
@Barnadrot Barnadrot closed this May 6, 2026
@Barnadrot Barnadrot reopened this May 6, 2026
@Barnadrot Barnadrot force-pushed the fix/phase-safety-bugs branch from b4f5280 to 23004b5 Compare May 8, 2026 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant