Skip to content

feat(platform-wallet)!: shutdown() joins coordinator threads and returns CoordinatorExitStatus#3954

Open
Claudius-Maginificent wants to merge 17 commits into
v3.1-devfrom
feat/platform-wallet-shutdown-join
Open

feat(platform-wallet)!: shutdown() joins coordinator threads and returns CoordinatorExitStatus#3954
Claudius-Maginificent wants to merge 17 commits into
v3.1-devfrom
feat/platform-wallet-shutdown-join

Conversation

@Claudius-Maginificent

@Claudius-Maginificent Claudius-Maginificent commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Why this PR exists

  • Problem: PlatformWalletManager::shutdown() never joined the sync-coordinator OS threads. The coordinators (identity-sync, platform-address-sync, shielded-sync) run on detached std::thread OS threads because the SDK futures are !Send (Handle::block_on); their JoinHandles were dropped at spawn, and shutdown() only drained via quiesce() (a while is_syncing barrier) without joining the threads or returning per-thread status.
  • What breaks without it: a consumer that drops the tokio runtime after shutdown() (dash-evo-tool's CLI one-shot / headless / stdio paths) races coordinator threads still polling tokio::time on a shutting-down runtime → panic A Tokio 1.x context was found, but it is being shutdown. DET's only deterministic workaround was std::process::exit (3 sites), which skips graceful teardown and any pending coordinator persister writes.
  • Blocking relationship: off v3.1-dev. Requested by dash-evo-tool PR build(dashmate): fix deb package release  #864.

What was done?

shutdown() now joins the coordinator OS threads and returns a per-thread CoordinatorExitStatus:

  • Each coordinator stores its JoinHandle (under the background_cancel lock, so a concurrent quiesce() can't miss it). After the existing quiesce drain (gate → cancel token → while is_syncing), quiesce() takes the handle and joins it by polling [JoinHandle::is_finished()] (5 ms steps, .awaiting each step so the executor stays free) — joining while the runtime is still alive is what closes the panic race. JoinHandle::join() is a happens-before barrier that orders the thread's full block_on + cleanup teardown before shutdown() returns. Polling rather than spawn_blocking(|| handle.join()) keeps the join cancellable: a spawn_blocking task can't be cancelled once started, so it would outlive the bounded timeout and leave the !Send thread running; the poll yields at every step so the tokio::time::timeout wrapping quiesce() is truly binding.
  • shutdown(&self) -> CoordinatorExitStatus { platform_address_sync, identity_sync, shielded_sync: Option<_>, event_adapter } — one CoordinatorThreadStatus per worker (shielded_sync is None in builds without the shielded feature; event_adapter is the wallet-event tokio task, not an OS thread). Each CoordinatorThreadStatus:
    • Ok — joined and returned normally
    • Stopped(Option<String>) — joined, but did not return normally for a non-panic reason (e.g. a tokio task cancel/abort surfaces as a non-panic JoinError); carries the reason when available. (A non-panic join error is no longer mis-reported as Ok.)
    • Panicked(String) — the coordinator panicked (not a silent drop)
    • Timeout — the bounded join exceeded SHUTDOWN_JOIN_TIMEOUT_SECS (30 s); reported distinctly rather than folded into a generic error
    • NotRunning — never started / already joined
    • Error(String) — an infrastructural join failure that is not a timeout
    • is_clean() / all_clean() treat only Ok and NotRunning as clean; every other variant is non-clean.
  • Panic-safety (the keystone): a Drop guard resets the is_syncing flag, so a panic inside sync_now can't leave the while is_syncing drain spinning forever — which would otherwise hang shutdown() and make Panicked unreachable. This guard is now a single reusable AtomicFlagGuard RAII helper in rs-dash-async, replacing three byte-identical copies across the sync coordinators (DRY).
  • A bounded tokio::time::timeout (SHUTDOWN_JOIN_TIMEOUT_SECS = 30 s) wraps each quiesce()+join so shutdown() can't wedge indefinitely (elapsing → Timeout). shutdown() asserts (a normal runtime check, active in release builds) that it runs on a multi-thread runtime — each coordinator's OS thread drives its loop via Handle::block_on and needs the runtime's timer/IO driver; a current_thread runtime can only service one block_on at a time and would deadlock. The existing background_generation reschedule guard is preserved.

Scope — Rust + FFI-internal only. The FFI platform_wallet_manager_destroy adapts to the new return type internally: it logs any non-clean exit and returns the new ErrorShutdownIncomplete (19) result code so the host can defer freeing its callback context (a coordinator that timed out may still fire one final callback). Function signatures / C ABI are unchanged (the new code is an additive enum variant). Swift gains one additive enum casePlatformWalletResultCode.errorShutdownIncomplete = 19 plus a PlatformWalletError.shutdownIncomplete mapping — mirroring the Rust code. The SDK's PlatformWalletManager.deinit now acts on it: it captures the code from shielded_sync_stop/destroy and, only on ErrorShutdownIncomplete, deliberately retains the persistence/event callback handlers (an unbalanced Unmanaged.passRetained) so a lingering coordinator can't fire into a freed context — a clean shutdown releases them normally. (The Swift changes are not compiled in this CI environment — no Swift toolchain / xcframework on the build host — so they are verified by construction; an iOS build is required before merge. The Rust + FFI changes are fully built/tested.) Once landed, DET drops its 3 std::process::exit stopgaps and its probabilistic 100 ms grace sleep, uses normal runtime drop, and honors the destroy return code (tracked DET-side).

Post-review hardening

A grumpy-review pass (CodeRabbit + an internal review) drove a round of degraded-path hardening, all Rust + FFI-internal:

  • Cancellable passes: each coordinator loop now races its in-flight pass against the cancellation token (select! { biased; _ = cancel.cancelled() => break, _ = sync_now() => {} }) on the coordinator's own thread, so stop()/quiesce() aborts a stalled SDK await directly. This shrinks the common-case drain to near-instant so the bounded join lands well inside the timeout.
  • Residual-UAF close (the definitive one): the coordinator join polls is_finished() instead of spawn_blocking(|| handle.join()), so the bounded tokio::time::timeout actually interrupts a wedged join — a spawn_blocking task is uncancellable and would outlive the timeout, leaving the !Send thread alive past shutdown() (a use-after-free / runtime-drop-panic hazard on the degraded path). start() now reaps any handle left by a prior stop() — joining it outside the background_cancel lock (after installing the new token), since the exiting thread's own epilogue also locks background_cancel; this stops a stop()→start() reschedule from detaching a live, untracked thread that shutdown() would miss (and removes a 1 s reap stall on that path). And the FFI destroy returns ErrorShutdownIncomplete on a non-clean exit so the host won't free a context a lingering thread may still touch.
  • Wedge-backstop orphan tracking (closes the last residual-UAF sliver): if start()'s 1 s reap backstop ever fires against a coordinator genuinely wedged in a non-yielding Drop, the still-live JoinHandle is now parked in a shared per-manager orphans list (the duplicated reap is consolidated into one reap_prior_or_park) instead of dropped. shutdown() drains the orphans via a bounded, yielding is_finished() poll (join_detached_orphans) and reports any survivor as the new non-clean CoordinatorThreadStatus::Detached (surfaced in CoordinatorExitStatus.detached_threads, folded into all_clean()), so destroy returns ErrorShutdownIncomplete rather than a false clean on that path too.
  • Concurrent coordinator drain: shutdown() drains the three independent coordinators (platform_address, identity, shielded) concurrently via tokio::join! — each keeping its own inner tokio::time::timeout so per-coordinator status survives — collapsing the pathological worst-case teardown from the sum of the per-coordinator timeouts (~90 s) to their max (~30 s). The three drains share no lock (each touches only its own per-manager state; the coordinator_orphans mutex is untouched by quiesce()), and the event-adapter cancel+join and join_detached_orphans stay strictly ordered after the join!.
  • Non-clean drain surfaced on every FFI teardown path: beyond destroy, clear_shielded now returns a typed PlatformWalletError::ShieldedShutdownIncomplete { status } and leaves the store intact when the shielded quiesce() times out or ends non-cleanly (so the host never wipes over a still-draining pass), and the FFI shielded_sync_stop / shielded_clear bridges return ErrorShutdownIncomplete in that case — symmetric with destroy. All three teardown paths share the SHUTDOWN_JOIN_TIMEOUT_SECS bound.
  • Invariant convergence: PlatformAddressSyncManager gained the background_generation guard its siblings already had (prevents a stop()+start() reschedule from detaching a live thread); the quiescing gate is now RAII-guarded like is_syncing; teardown mutex locks are poison-tolerant.
  • Plus #[must_use] on AtomicFlagGuard, doc/assert-message corrections, uniform failure logging, and a guard temporary no longer held across the join .await.

How Has This Been Tested?

  • cargo test -p platform-wallet --lib217 passed (default); 316 passed with --features shielded. Coverage:
    • shutdown_joins_all_workers_reports_ok_and_is_idempotent, shutdown_without_starting_reports_not_running
    • shutdown_waits_for_in_flight_pass_to_drain (the drain barrier is exercised; its handler-start wait is now bounded by a tokio::time::timeout so a regression fails fast instead of hanging CI)
    • shutdown_then_drop_runtime_does_not_panic — the race regression, 10 iterations each holding a live tokio::time timer across the shutdown window, using a per-iteration std::panic::catch_unwind (not a process-global panic hook, which could swallow sibling tests' diagnostics) to confirm 0 "being shutdown" panics
    • join_coordinator_thread_surfaces_panic, join_coordinator_thread_completes_within_deadline (the is_finished() poll-join completes within a bounded time without busy-spinning)
    • restart_after_stop_reaps_prior_thread (identity-sync, platform-address-sync, and shielded-sync) — a stop()→start() reschedule reaps the prior thread promptly (< 500 ms) rather than stalling on the reap backstop; proven non-vacuous against the old lock-held ordering (which stalled 1 s and detached)
    • reap_prior_or_park_parks_wedged_thread, join_detached_orphans_reports_detached_then_ok, shutdown_reports_detached_orphan_as_non_clean — the wedge-backstop orphan path: a force-wedged thread is parked (not dropped) and shutdown() reports it as non-clean Detached with !all_clean(); all three proven non-vacuous by neutering then reverting
    • coordinator_thread_status_clean_predicate (all six variants), coordinator_exit_status_all_clean, event_adapter_non_panic_join_error_maps_to_stopped_and_is_not_clean (now deterministic — a permanently-pending task is aborted, so the non-panic JoinErrorStopped path always fires; asserts Stopped(_) and !is_clean()). The Timeout mapping is covered via the clean-predicate test, since triggering the real 30 s timeout deterministically is impractical.
  • cargo test -p dash-async4 passed (the 2 pre-existing block_on tests + 2 new AtomicFlagGuard contract tests: normal-drop clears the flag, and a drop while unwinding a catch_unwind panic still clears it).
  • cargo check -p platform-wallet-ffi clean — function signatures / C ABI unchanged. cargo clippy -p platform-wallet -p platform-wallet-ffi -p dash-async --all-targets -- -D warnings clean (both feature sets); cargo fmt clean.
  • Independent QA (security + QA review): confirmed the join closes the runtime-drop race without deadlock on the multi-thread runtime, that the is_syncing RAII guard repairs the panic-hang and makes Panicked reachable, and that the is_finished() poll-join leaves each coordinator thread either confirmed-exited or explicitly Timeout before shutdown() returns.

Breaking Changes

  • PlatformWalletManager::shutdown() return type changed from () to CoordinatorExitStatus (public fields: platform_address_sync, identity_sync, shielded_sync: Option<CoordinatorThreadStatus>, event_adapter, detached_threads), and quiesce() now returns CoordinatorThreadStatus. Rust consumers must adapt.
  • CoordinatorThreadStatus (public enum) gained Stopped(Option<String>), Timeout, and Detached variants — exhaustive matches over it must add arms.
  • New public AtomicFlagGuard RAII helper exported from rs-dash-async (additive).
  • FFI: platform_wallet_manager_destroy, platform_wallet_manager_shielded_sync_stop, and platform_wallet_manager_shielded_clear now return the new PlatformWalletFFIResultCode::ErrorShutdownIncomplete (19) on a non-clean / timed-out drain instead of always ok() — a behavioral change for FFI hosts (defer freeing the callback context; the shielded store is left intact). The new result code is an additive enum variant; function signatures / C ABI are unchanged. (Title marked !.)
  • clear_shielded() (Rust, shielded feature) now returns Err(PlatformWalletError::ShieldedShutdownIncomplete { status }) and leaves the store intact on a non-clean / timed-out drain (previously it discarded the result and always wiped). New public PlatformWalletError::ShieldedShutdownIncomplete variant — exhaustive matches must add an arm.
  • Swift (additive): PlatformWalletResultCode gains case errorShutdownIncomplete = 19 and PlatformWalletError gains .shutdownIncomplete(String).

Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added or updated relevant unit/integration/functional/e2e tests
  • I have added "!" to the title and described breaking changes in the corresponding section
  • I have made corresponding changes to the documentation if needed

For repository code-owners and collaborators only

  • I have assigned this pull request to a milestone

🤖 Co-authored by Claudius the Magnificent AI Agent

Summary by CodeRabbit

  • New Features

    • Wallet manager shutdown now provides a structured completion report (clean/non-clean, timeouts, and panic outcomes) rather than being fire-and-forget.
    • Added a new error/result code for incomplete shutdown and surfaced it consistently across FFI and Swift SDK.
  • Bug Fixes

    • Improved shutdown/quiescence reliability for background sync work, including tighter cancellation handling and bounded joins.
    • Clear operation now fails safely when shutdown draining is incomplete.
  • Tests

    • Added/expanded regression and contract tests for stop→start reaping, status mapping, idempotency, panic propagation, and timeout behavior.

lklimek added 2 commits June 22, 2026 21:46
…rns CoordinatorExitStatus

The three periodic sync coordinators (platform-address, identity,
shielded) run their `!Send` loops on detached OS threads via
`Handle::block_on`. `shutdown()`/`quiesce()` previously only drained the
in-flight pass (the `is_syncing` barrier) and never joined the threads,
so a consumer that drops the tokio runtime right after `shutdown()`
(one-shot / headless / stdio) could race a coordinator still polling
`tokio::time` on a shutting-down runtime and panic with
"A Tokio 1.x context was found, but it is being shutdown".

Each coordinator now stores its OS-thread `JoinHandle`; `quiesce()` joins
it (via `spawn_blocking`, after the existing drain) and returns a
`CoordinatorThreadStatus` (NotRunning / Ok / Panicked / Error). Joining
while the runtime is still alive guarantees the loop has stopped touching
`tokio::time` before the host drops the runtime. `shutdown()` aggregates
the three into `CoordinatorExitStatus`, so a panicked loop surfaces in
the status instead of being silently dropped.

JoinHandle-join chosen over a oneshot/Notify signal: `JoinHandle::join`
natively distinguishes a clean return from a panic and waits for the
actual OS thread to terminate (not just a signal fired mid-teardown),
yielding the per-thread status for free. The generation-guard reschedule
and quiesce-drain behavior are preserved.

BREAKING CHANGE: `PlatformWalletManager::shutdown()` now returns
`CoordinatorExitStatus` instead of `()`.

FFI: the internal `shutdown()` call logs the new status; the `extern "C"`
`platform_wallet_manager_destroy` signature and C ABI are unchanged.

<sub>🤖 Co-authored by [Claudius the Magnificent](https://github.com/lklimek/claudius) AI Agent</sub>
…nnot wedge shutdown

SEC-001: Add `IsSyncingGuard` RAII struct to all three coordinator
`sync_now` (and shielded `sync_wallet`) implementations.  The guard
clears `is_syncing=false` on every exit path — normal return, early
return, and panic-unwind — so `quiesce()`'s drain loop can never spin
forever on a panicked pass, and the `Panicked` thread-exit status
becomes reachable.

SEC-002: Wrap each coordinator's `quiesce()` call in `shutdown()` with
`tokio::time::timeout(30 s)`.  On timeout the slot reports
`CoordinatorThreadStatus::Error("join timed out")` rather than hanging
forever.

SEC-003: Add `debug_assert!` in `shutdown()` that the current runtime
is `MultiThread`; document the precondition in the method doc.

F-5: In all three coordinators' `start()`, store the `JoinHandle` in
`background_join` while still holding the `background_cancel` lock —
eliminates the theoretical window where a concurrent `quiesce()` could
take a `None` handle because spawn completed before the store.

Rename `CoordinatorThreadExit` → `CoordinatorThreadStatus` with
variants `Ok / NotRunning / Panicked / Error` to match the coordinator
module's existing `super::CoordinatorThreadStatus` references (fixing
the compile break in f3354f6).  `join_coordinator_thread`'s
spawn_blocking `Err` arm now maps to `Error` rather than `Panicked`
to distinguish infra failure from thread panic (F-6 documented).

Co-Authored-By: Claudius the Magnificent <noreply@anthropic.com>

<sub>🤖 Co-authored by [Claudius the Magnificent](https://github.com/lklimek/claudius) AI Agent</sub>
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR hardens the shutdown lifecycle for all three background sync coordinator managers (IdentitySyncManager, PlatformAddressSyncManager, ShieldedSyncManager) and the FFI integration. It introduces CoordinatorThreadStatus/CoordinatorExitStatus enums to model coordinator-exit outcomes with clean-status predicates, adds an AtomicFlagGuard RAII utility to reliably clear atomic flags on all exit paths (including panics), applies a uniform pattern across coordinators (background-thread join handles stored under cancel locks, guarded flag cleanup in sync passes, structured quiesce() return types with timeout bounds), and surfaces the result in the FFI destroy path with clean/dirty logging.

Changes

Coordinator Shutdown Hardening

Layer / File(s) Summary
AtomicFlagGuard RAII type in rs-dash-async
packages/rs-dash-async/src/atomic.rs, packages/rs-dash-async/src/lib.rs, packages/rs-platform-wallet/Cargo.toml
Introduces panic-safe RAII AtomicFlagGuard<'a> that holds a reference to AtomicBool and unconditionally clears it to false with Release ordering on drop. Includes new atomic module in lib.rs, public re-export, and unit tests verifying cleanup on normal exit and panic unwinding. Workspace dependency wiring added.
Shutdown status types and join helper
packages/rs-platform-wallet/src/manager/mod.rs
Defines CoordinatorThreadStatus enum (Ok, Panicked, Stopped, Timeout, NotRunning, Error) with is_clean() predicate, CoordinatorExitStatus struct with per-coordinator fields and all_clean() method, join_coordinator_thread async helper using spawn_blocking for OS-thread join classification with timeout support, and public SHUTDOWN_JOIN_TIMEOUT_SECS constant.
IdentitySyncManager background join and guarded flag cleanup
packages/rs-platform-wallet/src/manager/identity_sync.rs
Adds background_join field to store spawned OS-thread JoinHandle, imports and uses AtomicFlagGuard in sync_now() for flag cleanup on all exit paths. start() drains any prior join handle and stores new one under cancel lock; background loop transitions to biased tokio::select! prioritizing cancellation. quiesce() returns CoordinatorThreadStatus, implements cancel-drain-join sequence with RAII gate reset, and delegates OS-thread join to helper. stop() recovers from poisoned mutexes. Regression test validates tight stop()start() reap behavior.
PlatformAddressSyncManager background join and guarded flag cleanup
packages/rs-platform-wallet/src/manager/platform_address_sync.rs
Adds background_join and background_generation fields to prevent stale thread cleanup. start() drains prior handles with fallback logging, stores new JoinHandle under cancel lock. Background loop uses biased tokio::select! and exit cleanup respects generation check. sync_now() holds AtomicFlagGuard across entire pass; completion callback on_platform_address_sync_completed dispatches before guard drop to prevent host-context premature release during drain. quiesce() returns CoordinatorThreadStatus and joins via helper. stop() handles poisoned mutexes. Regression test validates rapid stop()start() reaps prior thread.
ShieldedSyncManager background join and guarded flag cleanup
packages/rs-platform-wallet/src/manager/shielded_sync.rs
Adds background_join field and stores spawned OS-thread JoinHandle under cancel lock in start(), draining any prior handle with timeout fallback. Both sync_now() and sync_wallet() use AtomicFlagGuard for is_syncing to clear on all exit paths. Background loop transitions to biased tokio::select! prioritizing cancellation. quiesce() returns CoordinatorThreadStatus, sets quiescing gate under RAII guard, drains is_syncing, then joins via helper. stop() handles poisoned mutexes. Completion callbacks dispatched before guard drop.
Manager shutdown orchestration and timeout-bounded quiesce
packages/rs-platform-wallet/src/manager/mod.rs
PlatformWalletManager::shutdown() returns CoordinatorExitStatus and asserts multi-thread Tokio runtime. Quiesces each coordinator with SHUTDOWN_JOIN_TIMEOUT_SECS bounds, mapping to CoordinatorThreadStatus::Timeout on deadline overrun. Cancels and timeout-joins wallet-event adapter, classifying join outcomes (panic extraction, non-panic error mapping). clear_shielded() wraps shielded-sync quiesce with timeout, returning ShieldedShutdownIncomplete { status } unless drain is clean. Comprehensive tests cover clean/non-clean shutdown, idempotency, never-started coordinators, panic surfacing, deadline behavior, in-flight drain, and runtime-drop race prevention.
Error type and status propagation
packages/rs-platform-wallet/src/error.rs, packages/rs-platform-wallet/src/lib.rs
Adds ShieldedShutdownIncomplete error variant to PlatformWalletError carrying status: crate::manager::CoordinatorThreadStatus, indicating shielded clear where the sync coordinator did not drain cleanly. Exports SHUTDOWN_JOIN_TIMEOUT_SECS constant from crate root via updated pub use manager::{ ... } block.
FFI destroy path status inspection and logging
packages/rs-platform-wallet-ffi/src/manager.rs, packages/rs-platform-wallet-ffi/src/error.rs, packages/rs-platform-wallet-ffi/Cargo.toml
platform_wallet_manager_destroy captures CoordinatorExitStatus, inspects all_clean(), emits tracing::warn! when incomplete and tracing::debug! on success, returns ErrorShutdownIncomplete when not clean. New ErrorShutdownIncomplete = 19 variant added with host guidance (not freeing callback context, not retrying destroy). Tokio time feature enabled.
FFI shielded sync stop and clear timeout integration
packages/rs-platform-wallet-ffi/src/shielded_sync.rs
platform_wallet_manager_shielded_sync_stop wraps quiesce() with tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS), checking status and returning ErrorShutdownIncomplete when not clean. platform_wallet_manager_shielded_clear distinguishes ShieldedShutdownIncomplete and maps it to new FFI code (leaving store intact), while other failures map to ErrorWalletOperation. Documentation updated with host obligations.
FFI error conversion and Swift SDK mapping
packages/rs-platform-wallet-ffi/src/error.rs, packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletResult.swift
From<PlatformWalletError> conversion maps ShieldedShutdownIncomplete to ErrorShutdownIncomplete with message preservation. Swift SDK gains errorShutdownIncomplete = 19 result code and shutdownIncomplete(String) error case, with updated initializers and description methods. Unit test validates error mapping across FFI boundary.

Sequence Diagram(s)

sequenceDiagram
    participant FFI as platform_wallet_manager_destroy
    participant Manager as PlatformWalletManager::shutdown
    participant ISM as IdentitySyncManager::quiesce
    participant PASM as PlatformAddressSyncManager::quiesce
    participant SSM as ShieldedSyncManager::quiesce
    participant Adapter as WalletEventAdapter task
    participant Thread as OS Background Thread

    FFI->>Manager: shutdown() / multi-thread Tokio
    Manager->>ISM: quiesce with timeout
    ISM->>ISM: set quiescing, cancel loop
    ISM->>ISM: wait is_syncing drain (AtomicFlagGuard)
    ISM->>Thread: join_coordinator_thread(background_join)
    Thread-->>ISM: CoordinatorThreadStatus
    ISM-->>Manager: CoordinatorThreadStatus
    Manager->>PASM: quiesce with timeout
    PASM-->>Manager: CoordinatorThreadStatus
    Manager->>SSM: quiesce with timeout
    SSM-->>Manager: CoordinatorThreadStatus
    Manager->>Adapter: cancel task + join with timeout
    Adapter-->>Manager: CoordinatorThreadStatus
    Manager-->>FFI: CoordinatorExitStatus {identity, address, shielded, adapter}
    alt all_clean() true
        FFI->>FFI: tracing::debug log success
        FFI-->>Caller: Ok()
    else all_clean() false
        FFI->>FFI: tracing::warn with status details
        FFI-->>Caller: ErrorShutdownIncomplete
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested labels

ready for final review

Suggested reviewers

  • shumkov
  • QuantumExplorer

Poem

🐇 Hop hop, the threads now join with care,
No more free-spinning in the air!
A guard drops the flag, a status returns clean,
The shutdown now knows just what it has seen.
Warn if it's messy, debug if it's bright —
Every coordinator joins in the night. ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title clearly and specifically identifies the main change: shutdown() now joins coordinator threads and returns CoordinatorExitStatus.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/platform-wallet-shutdown-join

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@lklimek lklimek marked this pull request as ready for review June 23, 2026 08:06
@lklimek lklimek requested a review from thepastaclaw June 23, 2026 08:06
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.18%. Comparing base (96cba16) to head (261178e).
⚠️ Report is 2 commits behind head on v3.1-dev.

Additional details and impacted files
@@              Coverage Diff              @@
##           v3.1-dev    #3954       +/-   ##
=============================================
+ Coverage     52.54%   87.18%   +34.63%     
=============================================
  Files            11     2632     +2621     
  Lines          1707   327563   +325856     
=============================================
+ Hits            897   285592   +284695     
- Misses          810    41971    +41161     
Components Coverage Δ
dpp 87.70% <ø> (∅)
drive 86.14% <ø> (∅)
drive-abci 89.45% <ø> (∅)
sdk ∅ <ø> (∅)
dapi-client ∅ <ø> (∅)
platform-version ∅ <ø> (∅)
platform-value 92.20% <ø> (∅)
platform-wallet ∅ <ø> (∅)
drive-proof-verifier 49.55% <ø> (∅)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
packages/rs-platform-wallet/src/manager/identity_sync.rs (1)

418-459: 🩺 Stability & Availability | 🟠 Major

Don't overwrite an unjoined coordinator generation.

stop() removes only background_cancel and leaves the previous background_join unjoined. A stop()start() sequence passes the cancel_guard.is_some() check and overwrites background_join with a new handle, losing the old OS-thread handle before shutdown can join it. Gate restart on confirming the prior handle has been joined, or ensure every generation's handle is tracked and joined.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet/src/manager/identity_sync.rs` around lines 418 -
459, The start() method can overwrite an existing background_join handle before
it has been properly joined, causing resource leaks. Before spawning the new
identity-sync thread and storing its handle in background_join, add a check to
ensure any existing join handle in background_join has been properly cleaned up
or joined first. This can be done either by joining the existing handle before
proceeding, or by verifying that background_join is None before allowing the new
thread spawn to proceed. This ensures the prior thread's OS handle is not lost
and can be properly shutdown.
packages/rs-platform-wallet/src/manager/platform_address_sync.rs (1)

218-255: 🩺 Stability & Availability | 🟠 Major

Add generation guard to prevent thread handle loss after stop/start cycles.

After stop() cancels the token, background_cancel becomes None while the old thread keeps running. A subsequent start() sees cancel_guard.is_some() == false and spawns a new thread, unconditionally overwriting background_join. The old thread's join handle is lost, making it impossible to join its cleanup later. Additionally, without a generation counter, the exiting old thread clears the new generation's background_cancel token as it shuts down, creating a race where the new loop runs but appears stopped to is_running().

Both IdentitySyncManager and ShieldedSyncManager already implement this pattern: they increment background_generation on each start(), pass my_gen to the spawned thread, and check background_generation.load() == my_gen before clearing background_cancel on exit. Apply the same approach here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs` around
lines 218 - 255, The thread handle for the platform address sync loop can be
lost during stop/start cycles because a new thread spawns and unconditionally
overwrites the background_join handle before the old thread finishes cleanup.
Additionally, the old thread clears the new generation's background_cancel
token, creating a race condition. Implement the generation guard pattern already
used in IdentitySyncManager and ShieldedSyncManager: add a background_generation
atomic counter to the struct, increment it at the start of the start() method,
capture the current generation as my_gen before spawning the thread, pass my_gen
into the spawned thread closure, and modify the cleanup code (where
background_cancel is set to None) to only clear the token if
background_generation.load() equals my_gen, ensuring old exiting threads do not
interfere with new generations.
packages/rs-platform-wallet/src/manager/shielded_sync.rs (1)

245-300: 🩺 Stability & Availability | 🔴 Critical

Don't replace a pending shielded-sync join handle on rapid stop/start cycles.

The start() method checks only background_cancel to guard against concurrent starts. When stop() removes the token from background_cancel, a subsequent start() proceeds to spawn a new thread and overwrites background_join, even though the previous generation's thread is still winding down. The generation check (line 290) only prevents the old thread from clearing background_cancel—it does not protect background_join. This leaves prior-generation handles permanently lost.

To fix: either require quiesce() before start() to join all pending generations, or track join handles per-generation and join all pending before spawning a new thread.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet/src/manager/shielded_sync.rs` around lines 245 -
300, The `start()` method only checks `background_cancel` to guard against
concurrent starts, but when `stop()` clears this token, a subsequent `start()`
can spawn a new thread and overwrite the `background_join` handle before the
previous generation's thread has finished cleaning up. The generation check at
the cleanup section prevents the old thread from clearing `background_cancel`,
but does not protect `background_join` from being overwritten. Fix this by
either adding logic to join all pending prior-generation join handles before
storing the new one (by tracking handles per-generation), or by ensuring that
before assigning a new join handle to `background_join` in the `start()` method,
any existing pending handle from a prior generation is properly joined and
waited for completion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 706-710: The while loop waiting for handler_started to become true
has no timeout, which will cause the test to hang indefinitely if the slow
callback never executes. Wrap the entire while loop (that checks
handler_started.load(AO::Acquire)) with tokio::time::timeout() and provide an
appropriate timeout duration, then handle the timeout error case with an
assertion that explicitly fails the test with a useful message. This ensures CI
fails fast with clear feedback instead of timing out.
- Around line 483-489: In the wallet event adapter task join error handling, the
else branch that handles non-panic JoinErrors currently returns
CoordinatorThreadStatus::Ok, which should instead return
CoordinatorThreadStatus::Error with an appropriate error message. Change the
else clause that follows the is_panic() check to map the error to a
CoordinatorThreadStatus::Error variant containing details about the join error,
rather than treating it as a successful completion.
- Around line 175-181: The current implementation uses
tokio::task::spawn_blocking to wrap handle.join(), but this pattern prevents the
timeout from effectively interrupting the blocking task if the coordinator
thread hangs. Replace the spawn_blocking closure approach with explicit polling:
repeatedly check if the JoinHandle is finished using is_finished() in a loop
until the deadline is reached, and only call join() once the handle confirms it
is finished. This ensures the timeout boundary is enforced even if the
coordinator thread misbehaves or fails to clear is_syncing before join() is
called.

---

Outside diff comments:
In `@packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- Around line 418-459: The start() method can overwrite an existing
background_join handle before it has been properly joined, causing resource
leaks. Before spawning the new identity-sync thread and storing its handle in
background_join, add a check to ensure any existing join handle in
background_join has been properly cleaned up or joined first. This can be done
either by joining the existing handle before proceeding, or by verifying that
background_join is None before allowing the new thread spawn to proceed. This
ensures the prior thread's OS handle is not lost and can be properly shutdown.

In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs`:
- Around line 218-255: The thread handle for the platform address sync loop can
be lost during stop/start cycles because a new thread spawns and unconditionally
overwrites the background_join handle before the old thread finishes cleanup.
Additionally, the old thread clears the new generation's background_cancel
token, creating a race condition. Implement the generation guard pattern already
used in IdentitySyncManager and ShieldedSyncManager: add a background_generation
atomic counter to the struct, increment it at the start of the start() method,
capture the current generation as my_gen before spawning the thread, pass my_gen
into the spawned thread closure, and modify the cleanup code (where
background_cancel is set to None) to only clear the token if
background_generation.load() equals my_gen, ensuring old exiting threads do not
interfere with new generations.

In `@packages/rs-platform-wallet/src/manager/shielded_sync.rs`:
- Around line 245-300: The `start()` method only checks `background_cancel` to
guard against concurrent starts, but when `stop()` clears this token, a
subsequent `start()` can spawn a new thread and overwrite the `background_join`
handle before the previous generation's thread has finished cleaning up. The
generation check at the cleanup section prevents the old thread from clearing
`background_cancel`, but does not protect `background_join` from being
overwritten. Fix this by either adding logic to join all pending
prior-generation join handles before storing the new one (by tracking handles
per-generation), or by ensuring that before assigning a new join handle to
`background_join` in the `start()` method, any existing pending handle from a
prior generation is properly joined and waited for completion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 01cab324-105d-4c5d-afe0-0ceb6faff13e

📥 Commits

Reviewing files that changed from the base of the PR and between 9a93387 and 261178e.

📒 Files selected for processing (5)
  • packages/rs-platform-wallet-ffi/src/manager.rs
  • packages/rs-platform-wallet/src/manager/identity_sync.rs
  • packages/rs-platform-wallet/src/manager/mod.rs
  • packages/rs-platform-wallet/src/manager/platform_address_sync.rs
  • packages/rs-platform-wallet/src/manager/shielded_sync.rs

Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/identity_sync.rs Outdated
lklimek and others added 2 commits June 23, 2026 11:12
Introduces `AtomicFlagGuard`, a pub RAII guard that clears an
`AtomicBool` flag to `false` (Release ordering) on drop.  The guard
does not set the flag on construction — the caller is responsible for
doing so (typically via a `compare_exchange`) — preserving the exact
semantics of the three identical `IsSyncingGuard` structs that were
copy-pasted across the platform-wallet sync coordinators.

This is the panic-safety keystone for the quiesce drain loop: if a sync
pass panics, the guard's `drop` still clears `is_syncing`, so
`quiesce()` is never permanently wedged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…en runtime check

**Task 1 — new enum variants**
Add `Stopped(Option<String>)` (non-panic, non-clean task exit, e.g.
tokio cancel/abort) and `Timeout` (join exceeded
SHUTDOWN_JOIN_TIMEOUT_SECS) to `CoordinatorThreadStatus`.

- Non-panic JoinError on the event-adapter task → `Stopped(Some(...))`,
  not the previous `Ok` (wrong: a cancelled task is not a clean exit).
- Timeout on any `quiesce()` wrapper → `Timeout`, not `Error("join
  timed out")`.
- `is_clean()` now returns `true` only for `Ok` and `NotRunning`; all
  other variants — including the two new ones — are non-clean.
- Update all docs / comments that referenced the old `Error("join timed
  out")` wording.

**Task 2 — promote debug_assert to assert**
`shutdown()`'s multi-thread-runtime guard was `debug_assert!`, making
it a no-op in release builds.  Changed to `assert!` — this is a real
invariant: `spawn_blocking` deadlocks on a `current_thread` runtime.

**Task 3 — bound the test wait loop**
Wrap the `while !handler_started…` polling in
`shutdown_waits_for_in_flight_pass_to_drain` with a 5 s
`tokio::time::timeout` so a broken test fails fast instead of hanging.

**Task 4 — DRY IsSyncingGuard**
Replace the three identical copy-pasted `IsSyncingGuard` structs in
`identity_sync.rs`, `platform_address_sync.rs`, and `shielded_sync.rs`
with the new `dash_async::AtomicFlagGuard`.  Adds `dash-async` to
`rs-platform-wallet/Cargo.toml`.  Zero behavioral change: construction
semantics preserved (callers set the flag via `compare_exchange` before
creating the guard; `Drop` clears it with `Ordering::Release`).

**Task 5 — new tests**
- `coordinator_thread_status_clean_predicate`: unit-tests `is_clean()`
  for all six variants including the two new ones; no real timeout needed.
- `coordinator_exit_status_all_clean`: tests `all_clean()` with
  `Timeout` and `Stopped` slots.
- `event_adapter_non_panic_join_error_maps_to_stopped_and_is_not_clean`:
  aborts the adapter task before `shutdown()` and asserts the result is
  `Stopped` (covers the non-panic JoinError path).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
packages/rs-platform-wallet/src/manager/mod.rs (2)

405-408: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Handle unclean shielded quiesce before clearing state.

quiesce() now returns a meaningful shutdown status. Ignoring it lets clear_shielded() proceed to coord.clear() after Timeout, Stopped, or Panicked, which can race a still-running shielded pass that the quiesce barrier was meant to stop.

Proposed fix
-        self.shielded_sync_manager.quiesce().await;
+        let status = self.shielded_sync_manager.quiesce().await;
+        if !status.is_clean() {
+            return Err(crate::error::PlatformWalletError::ShieldedStoreError(
+                format!("shielded sync did not stop cleanly before clear: {status:?}"),
+            ));
+        }
         if let Some(coord) = self.shielded_coordinator().await {
             coord.clear().await?;
         }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet/src/manager/mod.rs` around lines 405 - 408, The
clear_shielded function ignores the return value from
self.shielded_sync_manager.quiesce().await, which now provides meaningful
shutdown status information. Instead of ignoring this result, capture the return
value and check if the quiesce completed cleanly. If quiesce returns a status
indicating Timeout, Stopped, or Panicked, return an appropriate error from
clear_shielded rather than continuing to call coord.clear(), which could race
with a still-running shielded pass. Only proceed with coord.clear() when quiesce
has successfully shut down cleanly.

463-477: 🩺 Stability & Availability | 🟠 Major

Wrap quiescing in AtomicFlagGuard to ensure cancellation-safe reset in all coordinator quiesce() implementations.

The current implementations set quiescing = true before the awaited drain loop and reset it only after. If tokio::time::timeout drops the future during the loop, the reset never executes, permanently wedging quiescing and blocking all future syncs.

All three coordinators (platform_address_sync.rs:291, identity_sync.rs:494, shielded_sync.rs:334) have identical patterns. Use the same AtomicFlagGuard approach already correctly applied to is_syncing in sync_now():

pub async fn quiesce(&self) -> super::CoordinatorThreadStatus {
    self.quiescing.store(true, Ordering::Release);
    let _quiescing_guard = AtomicFlagGuard::new(&self.quiescing);
    self.stop();
    while self.is_syncing.load(Ordering::Acquire) {
        tokio::time::sleep(Duration::from_millis(20)).await;
    }
    // quiescing.store(false) removed — guard handles reset on all exit paths
    // ...rest of implementation
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet/src/manager/mod.rs` around lines 463 - 477, The
`quiesce()` method implementations in all three coordinators (the ones
containing the timeout calls for `platform_address_sync_manager`,
`identity_sync_manager`, and `shielded_sync_manager`) don't properly handle
cancellation when `tokio::time::timeout` drops the future. Wrap the `quiescing`
flag reset in an `AtomicFlagGuard` to ensure it's reset on all exit paths
including early cancellation. In each coordinator's `quiesce()` method, after
setting `quiescing` to true, immediately create an `AtomicFlagGuard` using
`AtomicFlagGuard::new(&self.quiescing)`, and remove any manual
`quiescing.store(false)` reset call at the end since the guard will handle it
automatically. Use the same pattern already correctly implemented in
`sync_now()` for the `is_syncing` flag.
🧹 Nitpick comments (1)
packages/rs-dash-async/src/atomic.rs (1)

8-15: 📐 Maintainability & Code Quality | 🔵 Trivial

Add #[must_use] annotation to AtomicFlagGuard.

The guard is dropped as a temporary if not bound, silently resetting the flag immediately. This breaks the intended guarded scope behavior. Mark the type with #[must_use] to catch accidental non-binding at compile time.

Proposed fix
-pub struct AtomicFlagGuard<'a>(&'a AtomicBool);
+#[must_use = "AtomicFlagGuard clears the flag on drop; bind it to keep the flag set for the guarded scope"]
+pub struct AtomicFlagGuard<'a>(&'a AtomicBool);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-dash-async/src/atomic.rs` around lines 8 - 15, The
AtomicFlagGuard struct can be accidentally dropped without being bound to a
variable, causing the flag to be reset immediately and breaking the guarded
scope behavior. Add the #[must_use] attribute to the AtomicFlagGuard struct
definition to make the compiler warn when the guard is not explicitly bound to a
variable, ensuring the developer catches this mistake at compile time.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 671-695: The test
`event_adapter_non_panic_join_error_maps_to_stopped_and_is_not_clean` is
non-deterministic because it accepts both `CoordinatorThreadStatus::Stopped` and
`CoordinatorThreadStatus::Ok` as valid outcomes. This allows the test to pass
without actually exercising the non-panic JoinError branch. To fix this, replace
the current task abort approach with a task that is guaranteed to be pending and
never complete on its own, such as a task that awaits on a channel or a
never-resolving future. This ensures the abort always triggers the Stopped path
deterministically, and update the assertion to only expect
`CoordinatorThreadStatus::Stopped`.

---

Outside diff comments:
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 405-408: The clear_shielded function ignores the return value from
self.shielded_sync_manager.quiesce().await, which now provides meaningful
shutdown status information. Instead of ignoring this result, capture the return
value and check if the quiesce completed cleanly. If quiesce returns a status
indicating Timeout, Stopped, or Panicked, return an appropriate error from
clear_shielded rather than continuing to call coord.clear(), which could race
with a still-running shielded pass. Only proceed with coord.clear() when quiesce
has successfully shut down cleanly.
- Around line 463-477: The `quiesce()` method implementations in all three
coordinators (the ones containing the timeout calls for
`platform_address_sync_manager`, `identity_sync_manager`, and
`shielded_sync_manager`) don't properly handle cancellation when
`tokio::time::timeout` drops the future. Wrap the `quiescing` flag reset in an
`AtomicFlagGuard` to ensure it's reset on all exit paths including early
cancellation. In each coordinator's `quiesce()` method, after setting
`quiescing` to true, immediately create an `AtomicFlagGuard` using
`AtomicFlagGuard::new(&self.quiescing)`, and remove any manual
`quiescing.store(false)` reset call at the end since the guard will handle it
automatically. Use the same pattern already correctly implemented in
`sync_now()` for the `is_syncing` flag.

---

Nitpick comments:
In `@packages/rs-dash-async/src/atomic.rs`:
- Around line 8-15: The AtomicFlagGuard struct can be accidentally dropped
without being bound to a variable, causing the flag to be reset immediately and
breaking the guarded scope behavior. Add the #[must_use] attribute to the
AtomicFlagGuard struct definition to make the compiler warn when the guard is
not explicitly bound to a variable, ensuring the developer catches this mistake
at compile time.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 192ea517-ff8a-4b43-9773-c096391c8a49

📥 Commits

Reviewing files that changed from the base of the PR and between 261178e and 6e78b77.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (7)
  • packages/rs-dash-async/src/atomic.rs
  • packages/rs-dash-async/src/lib.rs
  • packages/rs-platform-wallet/Cargo.toml
  • packages/rs-platform-wallet/src/manager/identity_sync.rs
  • packages/rs-platform-wallet/src/manager/mod.rs
  • packages/rs-platform-wallet/src/manager/platform_address_sync.rs
  • packages/rs-platform-wallet/src/manager/shielded_sync.rs
🚧 Files skipped from review as they are similar to previous changes (2)
  • packages/rs-platform-wallet/src/manager/platform_address_sync.rs
  • packages/rs-platform-wallet/src/manager/shielded_sync.rs

Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
@thepastaclaw

thepastaclaw commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

✅ Review complete (commit 3cca1cf)

Comment thread packages/rs-platform-wallet/src/manager/mod.rs
Comment thread packages/rs-platform-wallet/src/manager/platform_address_sync.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/platform_address_sync.rs
Comment thread packages/rs-dash-async/src/atomic.rs
Comment thread packages/rs-platform-wallet/src/manager/mod.rs
Comment thread packages/rs-dash-async/src/atomic.rs
Comment thread packages/rs-dash-async/src/atomic.rs
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs
Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
lklimek and others added 4 commits June 23, 2026 12:46
RUST-001: tag `AtomicFlagGuard` `#[must_use]` so a stray `let _ = ..` or
bare-statement construction (which would drop the guard *immediately* and
clear the flag right back) gets caught at compile time instead of silently
un-gating the very flag it was meant to hold.

PROJ-001: lock the guard's contract down with two tests — flag cleared on a
normal drop, and (the load-bearing one) flag cleared while unwinding a
panic via `catch_unwind`. Makes the PR-body "dash-async tests" claim true.

SEC-003: spell out in the rustdoc that the clear-on-panic guarantee rides
on unwinding, so it holds under `panic = "unwind"` but not under the iOS
`panic = "abort"` profiles, where a panic aborts before any Drop runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…invariants

SEC-001 (the big one): a `shutdown()` quiesce timed out only because a
stalled in-flight pass pinned `is_syncing`, so the `while is_syncing` drain
never cleared, the quiesce future was dropped *before* the thread join, and
the `!Send` coordinator OS thread was left ALIVE — later firing host
callbacks through freed memory. Root-cause fix: race the pass body against
cancellation inside each coordinator's own loop

    tokio::select! {
        biased;
        _ = cancel.cancelled() => break,
        _ = this.sync_now(..) => {}
    }

so `stop()`/`quiesce()` cancelling the token drops the stalled `sync_now`
future *on the coordinator thread*, which unwinds to its `is_syncing`
`AtomicFlagGuard` and clears the flag promptly. The drain then frees and the
join lands far inside the timeout — the timeout can no longer strand a live
thread. Invariants preserved: the guard is constructed before any `.await`
so a cancel-drop always clears `is_syncing`; the completion-event dispatch
is the synchronous tail after the last `.await`, so it either runs in full
(then clears) or is skipped on cancel — never torn; idempotency and the
drain barrier are untouched. The inter-pass sleep was already cancel-raced.

MEDIUM-4 (RUST-002): RAII-guard `quiescing` in all three `quiesce()` via
`AtomicFlagGuard`, dropping the manual `store(false)`. A timed-out quiesce
no longer latches the gate `true` and silently bails every future pass.
Reopening on drop is safe because `stop()` already cancelled the loop.

MEDIUM-3 (SEC-005/CALL-001): give `PlatformAddressSyncManager` the
`background_generation` counter its siblings already have — bump it (AcqRel)
in `start()` and gate the thread-exit `*background_cancel = None` on
`generation == my_gen`, so a stop()+start() reschedule can't have an exiting
thread strip the new generation's token.

SEC-003: swap the `background_cancel`/`background_join` std-Mutex
`.lock().expect("… poisoned")` calls for `.lock().unwrap_or_else(|e|
e.into_inner())` across all three coordinators, so one prior panic can't
cascade into an abort on the teardown path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SEC-002: `clear_shielded()` now wraps its `quiesce()` in the same
`SHUTDOWN_JOIN_TIMEOUT_SECS` backstop `shutdown()` uses, so a stalled
in-flight pass can't hang Clear forever. The const is now `pub` (and
re-exported from the crate root) so the FFI shielded-stop bridge can reuse
it; its doc + the `shutdown()` doc now describe it as a backstop and note
that cancellation is what makes the drain prompt.

SEC-004: bind the event-adapter join handle to a local before the join
`.await`, so the `tokio::Mutex` guard (previously a match-scrutinee
temporary) isn't held across the up-to-30s join.

PROJ-004: drop the lone `tracing::warn!` for the adapter join error inside
`shutdown()` — the returned status already carries it and the FFI `destroy`
adapter logs the aggregate once, so all four workers are now uniform.

RUST-004: rewrite the `shutdown()` `assert!` message (and the matching
docs) to name the real constraint — the coordinator OS threads each run
`Handle::block_on` and need the multi-thread runtime's timer/IO driver —
instead of blaming `spawn_blocking`, which works fine on current_thread.

PROJ-006: fix the `all_clean()` rustdoc (Stopped/Timeout/Error also make it
false, not just panics). PROJ-003: drop the dangling ephemeral `(F-6)` and
`F-2`/`F-3`/`F-7` + `(1)/(2)/(4)/(5)/(6)` markers, replacing with
self-describing prose. SEC-003: note the unwind-vs-abort caveat on the
`shutdown()` panic-safety guarantee.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SEC-002: `platform_wallet_manager_shielded_sync_stop` blocked on a bare
`quiesce()`, so a stalled in-flight pass could hang the host's stop call
forever. Wrap the quiesce in `tokio::time::timeout` reusing the library's
`SHUTDOWN_JOIN_TIMEOUT_SECS` backstop — same guarantee as `shutdown()`.
Cancellation makes the drain prompt; the timeout only matters if a pass's
drop wedges. The C signature is unchanged and the result is still discarded
(`ok` as before) — we only need the call not to hang.

Add `tokio/time` to the crate's direct features rather than leaning on
`platform-wallet` pulling it in transitively (the crate now calls
`tokio::time::timeout` directly).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Solid lifecycle hardening overall — joining coordinator threads and the RAII is_syncing guard close real races. Three in-scope blockers undermine the new shutdown contract: the FFI destroy still returns success on non-clean shutdown (Swift may free a context a coordinator thread still owns), the bounded timeout doesn't actually bound anything because spawn_blocking tasks are non-abortable, and start-after-stop overwrites the saved JoinHandle so a later shutdown cannot join the stranded thread. Two suggestions and two test-quality nits round out the review.

🔴 3 blocking | 🟡 2 suggestion(s) | 💬 2 nitpick(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet-ffi/src/manager.rs`:
- [BLOCKING] packages/rs-platform-wallet-ffi/src/manager.rs:366-376: Destroy returns ok() on non-clean shutdown — re-opens callback-after-free window
  `shutdown()` can now legitimately return `CoordinatorThreadStatus::Timeout` (and `Stopped`/`Panicked`/`Error`) meaning a coordinator's OS thread or the event-adapter task did not actually join. This FFI entry point logs that outcome and still returns `PlatformWalletFFIResult::ok()`. The Swift host (`PlatformWalletManager.swift` deinit) is documented to free the callback `context` once `platform_wallet_manager_destroy` returns; meanwhile the still-alive coordinator thread holds an `Arc<FFIEventHandler>` / `Arc<FFIPersister>` and can fire `on_*_sync_completed` or `persister.store(...)` through the now-dangling `context` pointer. That is precisely the use-after-free this PR set out to close — the previous unbounded wait would have hung instead of returning false success. Surface non-clean shutdowns as a distinct, non-ok result code so the host knows not to free its context (or keep the FFI-owned handler `Arc` alive on the non-clean path, e.g. `mem::forget`, so any lingering callback remains memory-safe).

In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:178-192: Timeout doesn't actually bound shutdown — spawn_blocking is non-abortable
  `shutdown()` wraps each `quiesce()` in `tokio::time::timeout`, but `join_coordinator_thread` moves the `std::thread::JoinHandle` into `spawn_blocking(move || handle.join())`. Tokio blocking tasks cannot be aborted once started: dropping the outer timeout future stops `await`ing the `JoinHandle` but the underlying blocking task is still parked inside `handle.join()`, keeping the coordinator thread's `Arc<…SyncManager>` (and the host callback contexts it transitively holds) alive. When the caller then drops the multi-thread runtime, `Runtime::drop` returns without waiting for blocking tasks, leaving an OS thread plus a stranded blocking thread alive on a freed runtime — which is the very `"Tokio 1.x context … being shutdown"` race the PR cites in its motivation. Make the join cancellation-aware by polling `handle.is_finished()` (with a short async sleep) before the final `handle.join()`, so dropping the timeout future actually releases all state.
- [SUGGESTION] packages/rs-platform-wallet/src/manager/mod.rs:441-459: assert! on runtime flavor is incorrect — spawn_blocking works on current_thread
  The promotion from `debug_assert!` to `assert!` is justified in the docstring by the claim that `spawn_blocking` 'is not available on `current_thread` runtimes and will panic there.' That's not how Tokio works: `spawn_blocking` dispatches to the runtime's shared blocking pool, which both `multi_thread` and `current_thread` runtimes provision; awaiting the returned `JoinHandle` simply yields the runtime task. Today's only in-tree caller (`platform-wallet-ffi/src/runtime.rs:34`) builds a multi-thread runtime, but `shutdown()` is now a public Rust API. Any downstream consumer using `#[tokio::main(flavor = "current_thread")]` (or `Builder::new_current_thread()`) will hit a release-mode panic on a configuration that would have worked. Revert to `debug_assert!` (or drop it entirely) and correct the docstring rationale.

In `packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/identity_sync.rs:405-448: start-after-stop overwrites background_join, dropping the previous thread handle
  `stop()` takes `background_cancel` (leaving it `None`) but never touches `background_join`. A subsequent `start()` passes the `cancel_guard.is_some()` early-return check, spawns a fresh thread, and at line 447 unconditionally overwrites `background_join` with the new handle — the previous handle is dropped (detached). The old coordinator thread is cancellation-bound but not yet exited; it can still be inside its `block_on` polling `tokio::time` and feeding the event manager when the new handle replaces it. A later `shutdown()` then can only join the new thread, so the original thread can outlive `shutdown()` and touch the host's freed callback context — recreating exactly the runtime-drop race this PR is meant to eliminate. The same pattern applies to `platform_address_sync` and `shielded_sync`. Fix by taking-and-joining (or refusing) any non-`None` handle inside the `start()` lock before installing the new one.

In `packages/rs-platform-wallet/src/manager/platform_address_sync.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/platform_address_sync.rs:291-304: quiescing gate is reopened before the join completes — sync_now can slip a pass through
  `quiesce()` clears `self.quiescing` to `false` *before* taking `background_join` and awaiting `join_coordinator_thread`. The join can block for up to `SHUTDOWN_JOIN_TIMEOUT_SECS` (30s). During that window the loop is cancelled and won't start new passes, but an external caller invoking `sync_now` (e.g. an FFI on-demand sync) finds a fully open gate — `quiescing=false`, `is_syncing=false` — and runs a complete pass, including the `on_platform_address_sync_completed` host callback. That breaks the documented `quiesce()` contract that no further callback can fire after it returns, undermining the manager-level shutdown guarantee. Same pattern in `identity_sync::quiesce` and `shielded_sync::quiesce`. Move `quiescing.store(false, …)` to *after* the join completes (or have `sync_now`/`sync_wallet` also consult `background_join.is_some()`).

Comment thread packages/rs-platform-wallet-ffi/src/manager.rs
Comment thread packages/rs-platform-wallet/src/manager/mod.rs
Comment thread packages/rs-platform-wallet/src/manager/identity_sync.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/mod.rs
Comment thread packages/rs-platform-wallet/src/manager/platform_address_sync.rs
Comment thread packages/rs-platform-wallet/src/manager/mod.rs
Comment thread packages/rs-platform-wallet/src/manager/mod.rs
@Claudius-Maginificent

Copy link
Copy Markdown
Collaborator Author

Update — SEC-002 fixed in 93b89546ed: clear_shielded and the FFI platform_wallet_manager_shielded_sync_stop bridge now wrap quiesce() in the same SHUTDOWN_JOIN_TIMEOUT_SECS bound shutdown() uses, so a stalled shielded pass can no longer hang the host thread indefinitely.

🤖 Co-authored by Claudius the Magnificent AI Agent

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The PR materially hardens the wallet manager shutdown lifecycle (cancellable passes, RAII quiescing gate, bounded clear/stop, AtomicFlagGuard consolidation). Prior #4/#5 are fixed in head. Remaining issues center on a single shape: bounded teardown paths (FFI destroy, clear_shielded, FFI shielded stop) translate non-clean / timed-out shutdowns into Success without exposing that to the host, which can still hand a stale OS thread to host-owned callback/context after the host frees them. Start-after-stop still drops the prior background_join. Two test-isolation nitpicks carry forward.

🔴 2 blocking | 🟡 2 suggestion(s) | 💬 1 nitpick(s)

2 additional finding(s) omitted (not in diff).

2 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:417-433: clear_shielded silently swallows quiesce timeout — host can wipe persistence with a pass still in-flight
  `clear_shielded` wraps `shielded_sync_manager.quiesce()` in `tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, ...)`, drops the result with `let _ = ...`, then calls `coord.clear().await` and returns `Ok(())`. The method docstring (lines 402–415) tells callers the quiesce barrier guarantees "nothing can re-persist notes after this returns" and that "the host must not commit its own persistence wipe" only when the coordinator reset fails. On the timeout path that contract is broken: the in-flight pass thread still holds the FFIPersister context and can call `persister.store(...)` *after* `clear_shielded` returned Ok and the host has wiped its SwiftData rows, leaving the host's just-deleted rows re-populated by the trailing pass against the now-cleared shared tree. The intentional discard noted in the comment is fine for diagnostics, but the safety guarantee the docs make requires propagating a distinct error (e.g. `ShieldedStoreError("quiesce timed out")`) so the host knows not to commit the wipe.
- [SUGGESTION] packages/rs-platform-wallet/src/manager/mod.rs:184-198: Timeout cannot abort spawn_blocking — Timeout status can ship while the OS thread is still alive
  `join_coordinator_thread` moves the `std::thread::JoinHandle` into `tokio::task::spawn_blocking(move || handle.join())`. Blocking tasks are not abortable: when the outer `tokio::time::timeout` in `shutdown()` / `clear_shielded` / the FFI shielded-stop bridge fires, dropping the awaiter leaves the spawn_blocking task and its synchronous `handle.join()` running, and the only `JoinHandle` for the OS thread is already consumed by the blocking task. The cancellable-pass change makes this rare in practice, but the `CoordinatorThreadStatus::Timeout` contract is still misleading — returning `Timeout` means *we abandoned the wait*, not *the worker stopped*. Hosts that drop the runtime after `Timeout` are back in the runtime-drop-panic race this PR was opened to fix. Either narrow the doc-comment promise to say so explicitly, or wire the join through an abortable indirection (e.g. an `AbortHandle`-backed task that delegates to a child thread) so a true wedge surfaces and can be acted on.

In `packages/rs-platform-wallet-ffi/src/shielded_sync.rs`:
- [BLOCKING] packages/rs-platform-wallet-ffi/src/shielded_sync.rs:86-107: shielded_sync_stop returns ok() on bounded-quiesce timeout — same UAF pattern as destroy/clear
  The new `tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, manager.shielded_sync().quiesce())` correctly prevents the FFI stop call from wedging, but on timeout the future is dropped and the function returns `PlatformWalletFFIResult::ok()` with no signal that the drain didn't complete. The docstring (lines 68–85) explicitly tells hosts the call is a synchronization barrier: on return "the loop is cancelled, no new pass will start, and any in-flight pass has fully drained — its persistence callbacks have completed". That guarantee does not hold on the timeout path; the spawned shielded coordinator thread still holds its `Arc<FFIPersister>` / `Arc<FFIEventHandler>` and can still invoke completion / persister stores against the host context. Hosts using stop as a barrier before unbinding callbacks get the same callback-after-free hazard as destroy. Surface a non-success result (or extend the C ABI with a status out-param) so the host can know to retry or defer.

In `packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/identity_sync.rs:405-470: start() after cancel-only stop() drops the previous background_join, detaching the old thread from a later shutdown()
  `stop()` takes `background_cancel` but leaves `background_join` populated. A subsequent `start()` only checks `cancel_guard.is_some()`, proceeds, spawns a new OS thread, and unconditionally overwrites `*self.background_join.lock() = Some(join)` at lines 465–468 — silently dropping the previous `JoinHandle` and detaching the prior thread. The `background_generation` counter prevents the exiting thread from stripping the new cancel token, but does not preserve join ownership. The same pattern lives in `platform_address_sync.rs` and `shielded_sync.rs`. With the new cancellable passes the old thread exits its `block_on` quickly, so the race window is small under `panic = "unwind"`. Under iOS `panic = "abort"`, a host doing `stop()` → `start()` → `shutdown()` → `drop(runtime)` can still hit the runtime-drop "being shutdown" panic this PR is closing for the start→shutdown path, because the older detached thread is invisible to the later `shutdown()`. Either gate `start()` on `background_join.is_some()` (await/quiesce the prior handle first) or take both slots atomically in `stop()`.

Comment thread packages/rs-platform-wallet/src/manager/identity_sync.rs
Comment thread packages/rs-platform-wallet/src/manager/mod.rs
Comment thread packages/rs-platform-wallet/src/manager/mod.rs
Replace the spawn_blocking-based join in join_coordinator_thread with an
is_finished() poll loop that awaits a 5ms sleep each step. spawn_blocking
tasks cannot be cancelled once started, so the prior approach left the
blocking join alive past the tokio::time::timeout wrapping quiesce() —
defeating the timeout boundary. Polling yields at each .await so the
external timeout is truly binding (threads are confirmed-exited or the
caller times out).

Each coordinator's start() now drains any handle left by a prior stop()
(is_finished spin-wait, 1s bound) before overwriting background_join, so a
stop()->start() reschedule can no longer detach a live, untracked thread
that shutdown() would miss.

FFI platform_wallet_manager_destroy now returns the new
ErrorShutdownIncomplete (19) when shutdown is not all-clean, signalling the
host must not immediately free the callback context — a lingering
coordinator may still fire one final callback. The C ABI is unchanged
(additive enum variant + degraded-path return code).

Tests: deterministic Stopped path via spawn(pending).abort() -> asserts
Stopped(_) and !is_clean(); race test uses per-iteration catch_unwind
instead of a process-global panic hook.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
packages/rs-platform-wallet-ffi/src/manager.rs (1)

366-385: 🔒 Security & Privacy | 🔴 Critical

Host ignores ErrorShutdownIncomplete return code, re-opening use-after-free window.

The Rust code correctly returns ErrorShutdownIncomplete instead of ok() on non-clean shutdown to signal the host should delay freeing the callback context. However, the Swift deinit at PlatformWalletManager.swift:158 calls platform_wallet_manager_destroy(handle).discard() — explicitly discarding the return value with no branching logic. This means the host unconditionally proceeds with cleanup regardless of whether the error code is returned, re-opening the exact use-after-free this code intends to prevent: a lingering coordinator thread holding an Arc to the event handler while the host-owned context pointer is freed.

The safety guarantee must be enforced on the host side before this fix is effective.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/rs-platform-wallet-ffi/src/manager.rs` around lines 366 - 385, The
Rust code in the shutdown function correctly returns ErrorShutdownIncomplete
when coordinators do not exit cleanly, but the host side (Swift code) ignores
this return value by calling discard() without checking the result. To enforce
the safety guarantee on the Rust side, ensure that even if the host ignores the
return code, the Rust code prevents use-after-free by maintaining ownership of
critical resources (such as the event handler Arc) until all coordinator threads
are guaranteed to have fully exited, rather than relying solely on the host
respecting the ErrorShutdownIncomplete signal. Consider adding an additional
safety mechanism within the shutdown logic to keep the callback context alive on
the Rust side until true cleanup is complete.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/rs-platform-wallet-ffi/src/error.rs`:
- Around line 128-135: The Swift PlatformWalletResult enum is missing the
errorShutdownIncomplete variant that was added to the Rust FFI. Add the case
errorShutdownIncomplete = 19 to the PlatformWalletResultCode enum in the correct
position between the existing variants. Then update the init(ffi:) initializer's
switch statement to add a matching case that maps the FFI result code
PLATFORM_WALLET_FFI_RESULT_CODE_ERROR_SHUTDOWN_INCOMPLETE to the
.errorShutdownIncomplete case, ensuring the returned error code is properly
recognized instead of falling through to the default unknown error handler.

---

Outside diff comments:
In `@packages/rs-platform-wallet-ffi/src/manager.rs`:
- Around line 366-385: The Rust code in the shutdown function correctly returns
ErrorShutdownIncomplete when coordinators do not exit cleanly, but the host side
(Swift code) ignores this return value by calling discard() without checking the
result. To enforce the safety guarantee on the Rust side, ensure that even if
the host ignores the return code, the Rust code prevents use-after-free by
maintaining ownership of critical resources (such as the event handler Arc)
until all coordinator threads are guaranteed to have fully exited, rather than
relying solely on the host respecting the ErrorShutdownIncomplete signal.
Consider adding an additional safety mechanism within the shutdown logic to keep
the callback context alive on the Rust side until true cleanup is complete.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 52a4e212-979b-48ca-b4ad-cbca213672ee

📥 Commits

Reviewing files that changed from the base of the PR and between 93b8954 and 2bd9501.

📒 Files selected for processing (6)
  • packages/rs-platform-wallet-ffi/src/error.rs
  • packages/rs-platform-wallet-ffi/src/manager.rs
  • packages/rs-platform-wallet/src/manager/identity_sync.rs
  • packages/rs-platform-wallet/src/manager/mod.rs
  • packages/rs-platform-wallet/src/manager/platform_address_sync.rs
  • packages/rs-platform-wallet/src/manager/shielded_sync.rs
🚧 Files skipped from review as they are similar to previous changes (4)
  • packages/rs-platform-wallet/src/manager/identity_sync.rs
  • packages/rs-platform-wallet/src/manager/shielded_sync.rs
  • packages/rs-platform-wallet/src/manager/mod.rs
  • packages/rs-platform-wallet/src/manager/platform_address_sync.rs

Comment thread packages/rs-platform-wallet-ffi/src/error.rs Outdated

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Incremental review for platform-wallet shutdown hardening at head 747f5f0. The latest delta is a merge from v3.1-dev (dashmate DKG safety + Swift KeychainManager) and does not modify any of the platform-wallet shutdown/quiesce/FFI files; therefore all seven prior findings carry forward unchanged. The PR description advertises post-review hardening (ErrorShutdownIncomplete=19 result code, is_finished() poll-join, start() reaping prior background_join, deterministic Stopped test, RAII/chained panic hook) that is not present in the source — destroy still unconditionally returns ok(), join_coordinator_thread still uses uncancellable spawn_blocking, clear_shielded and shielded_sync_stop still discard the bounded-quiesce result, start() still overwrites background_join, and both regression tests retain the prior weaknesses. Three blocking issues remain: the FFI destroy / clear_shielded / shielded_sync_stop paths all report success while the coordinator OS thread may still be alive, leaving a callback-after-free window across the C/Swift boundary that the PR was specifically intended to close.

Reconciliation

  • Prior review at 93b89546: all 7 prior findings are STILL VALID at 747f5f00.
  • New latest-delta findings from 93b89546..747f5f00: none; the delta is a merge from v3.1-dev and does not touch the wallet shutdown/FFI files.
  • CodeRabbit inline findings: 0.

Carried-forward prior findings

  1. 🔴 [BLOCKING] destroy returns ok() on non-clean shutdown — callback-after-free window across the FFI boundary

    • packages/rs-platform-wallet-ffi/src/manager.rs:351-377
    • platform_wallet_manager_destroy removes the manager handle, awaits manager.shutdown(), branches only to log a warning for !status.all_clean() (lines 367–374), and unconditionally returns PlatformWalletFFIResult::ok() at line 376. The inline comment at lines 363–365 acknowledges this: 'the C ABI exposes none of that, so we just log it … and drop it.' After this PR, shutdown() can legitimately return Timeout, Stopped, Panicked, or Error — meaning a coordinator OS thread or the event-adapter task may still be alive and still able to invoke FFIPersister/FFIEventHandler callbacks through the host-owned *const c_void context pointer. Hosts (e.g. dash-evo-tool, the Swift example app) routinely free that callback context after destroy returns ok(); a lingering coordinator firing one final persister.store or on_* callback then writes to freed memory. This is the exact UAF pathway the PR sets out to close. The PR description states destroy now returns a new PlatformWalletFFIResultCode::ErrorShutdownIncomplete (19); no such variant exists in the codebase (grep confirms). Either add and propagate the ErrorShutdownIncomplete code on !status.all_clean() so the host can defer freeing its callback context, or retract the breaking-change claim and document loudly that ok() does not imply the OS thread has exited on the degraded path.
  2. 🔴 [BLOCKING] clear_shielded silently swallows quiesce timeout/non-clean status before clear()

    • packages/rs-platform-wallet/src/manager/mod.rs:417-433
    • clear_shielded wraps shielded_sync_manager.quiesce() in tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, …) but discards the result with let _ = …, then unconditionally calls coord.clear().await and returns Ok(()). The method's own doc-comment makes the quiesce barrier the load-bearing safety mechanism that lets the host commit its persistence wipe. On Timeout (or any non-clean CoordinatorThreadStatus the rewritten quiesce can now return) the in-flight shielded pass is still capable of holding the coordinator/persister handle and writing into the very store that clear() is about to wipe — the wipe can then be silently re-populated by the surviving pass, defeating the wipe and violating the contract the FFI consumer (platform_wallet_manager_shielded_clear) relies on. Inspect the timeout result and propagate a typed PlatformWalletError on Elapsed and on !status.is_clean(), so callers do not commit their own persistence wipe after a partial drain.
  3. 🔴 [BLOCKING] shielded_sync_stop returns ok() on bounded-quiesce timeout — same UAF window as destroy/clear

    • packages/rs-platform-wallet-ffi/src/shielded_sync.rs:86-107
    • platform_wallet_manager_shielded_sync_stop wraps manager.shielded_sync().quiesce() in tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, …) and discards the result with let _ = …, then returns PlatformWalletFFIResult::ok(). The function's own docstring promises that on return 'the loop is cancelled, no new pass will start, and any in-flight pass has fully drained'; on a Timeout (or any non-clean CoordinatorThreadStatus) that promise is broken, but the host is told the stop succeeded. Hosts that bump generation counters or release callback state on success then race the still-running coordinator thread — the same UAF pattern as destroy and clear, just on a non-teardown path that long-running apps hit far more often. Surface non-clean / Elapsed via a distinct FFI result code so the host can defer teardown / state reset until the drain actually completes.
  4. 🔴 [BLOCKING] join_coordinator_thread uses uncancellable spawn_blocking — Timeout status can ship while the !Send OS thread is still alive

    • packages/rs-platform-wallet/src/manager/mod.rs:184-198
    • join_coordinator_thread moves the std::thread::JoinHandle into tokio::task::spawn_blocking(move || handle.join()).await at line 190. spawn_blocking tasks are not abortable: when the outer tokio::time::timeout in shutdown() fires Elapsed, dropping the await handle does not cancel the inner blocking job and does not signal the OS thread. The slot is reported as CoordinatorThreadStatus::Timeout but the underlying coordinator thread is still inside Handle::block_on, still touching tokio::time, and still able to invoke host callbacks. Combined with finding 1 (destroy returning ok()), this is the residual runtime-drop / callback-after-free window the PR was sold as closing — the degraded path leaves it wide open. The PR description claims this was rewritten to poll JoinHandle::is_finished(); grep finds no is_finished() use in rs-platform-wallet at all. Either implement the poll loop (tokio::time::sleep(small_dt); if handle.is_finished() { return handle.join() … }) so the outer timeout actually binds, or document explicitly that Timeout means the OS thread may still be alive and the runtime must not be dropped — and reflect that in the FFI return codes per finding 1.
  5. 🟡 [SUGGESTION] start() after cancel-only stop() drops the previous background_join, detaching the old thread from a later shutdown()

    • packages/rs-platform-wallet/src/manager/identity_sync.rs:405-470
    • stop() (lines 480–489) takes background_cancel but never touches background_join, and the loop body's epilogue (lines 452–458) only clears background_cancel when the generation matches. A subsequent start() guards only on cancel_guard.is_some() (line 410); on a quick stop()→start() sequence cancel_guard is None, start() proceeds, spawns a new OS thread, and at lines 465–468 unconditionally writes the new JoinHandle into self.background_join, dropping (detaching) the prior, still-live JoinHandle for a loop that may still be winding down through its last pass / sleep wakeup. A subsequent shutdown() only joins the newest handle; the older thread is no longer reachable through any join barrier and can outlive shutdown(), holding the persister/event-handler context after the FFI told the host destroy/shutdown was clean. The same overwrite pattern lives in platform_address_sync.rs and shielded_sync.rs::start. The PR description claims start() now reaps the prior handle first; it does not. Either move the join slot under the same lock and reap-on-takeover (e.g. take the prior background_join and join_coordinator_thread it before installing a new one — would require start() to become async), or have stop()/the loop epilogue clear background_join too so an unjoined leftover handle cannot exist.
  6. 🔵 [NITPICK] event_adapter Stopped-path test still accepts Ok, leaving the new JoinError→Stopped mapping unverified on the abort race

    • packages/rs-platform-wallet/src/manager/mod.rs:712-743
    • event_adapter_non_panic_join_error_maps_to_stopped_and_is_not_clean aborts the adapter task then accepts CoordinatorThreadStatus::Stopped() | CoordinatorThreadStatus::Ok (lines 731–737) with an inline comment noting 'abort() races task completion'. Because the adapter is the standard make_manager() sink (no events queued), the task can trivially drain before the 10 ms sleep elapses, in which case the assertion passes via the Ok arm and the new Ok(Err()) ⇒ Stopped(...) mapping in shutdown() is never actually exercised. A regression collapsing that arm back to Ok would not be caught. To match the PR description's claim that this path is now deterministic, replace the adapter handle with one running std::future::pending::<()>().await before aborting, and assert exactly Stopped(_) and !status.event_adapter.is_clean().
  7. 🔵 [NITPICK] shutdown_then_drop_runtime installs a process-global panic hook without chaining or RAII restore

    • packages/rs-platform-wallet/src/manager/mod.rs:874-931
    • std::panic::set_hook (line 878) replaces the process-wide panic hook with a closure that only increments SHUTDOWN_PANICS on messages containing 'being shutdown' and never forwards to prev_hook. The original hook is restored only after the 10-iteration loop completes (line 925); a panic anywhere in the loop body leaves the global hook replaced for the rest of the test process and silently suppresses unrelated panic diagnostics from sibling tests in the same cargo test binary. The PR description claims this is now per-iteration std::panic::catch_unwind; the code still uses set_hook. Fixes: chain prev_hook(info) for messages that don't match the filter, and restore the previous hook in an RAII guard (scopeguard / explicit struct with Drop) so a mid-loop panic cannot leave the global state mutated.
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

- [BLOCKING] packages/rs-platform-wallet-ffi/src/manager.rs:351-377: destroy returns ok() on non-clean shutdown — callback-after-free window across the FFI boundary
platform_wallet_manager_destroy removes the manager handle, awaits manager.shutdown(), branches only to log a warning for !status.all_clean() (lines 367–374), and unconditionally returns PlatformWalletFFIResult::ok() at line 376. The inline comment at lines 363–365 acknowledges this: 'the C ABI exposes none of that, so we just log it … and drop it.' After this PR, shutdown() can legitimately return Timeout, Stopped, Panicked, or Error — meaning a coordinator OS thread or the event-adapter task may still be alive and still able to invoke FFIPersister/FFIEventHandler callbacks through the host-owned `*const c_void context` pointer. Hosts (e.g. dash-evo-tool, the Swift example app) routinely free that callback context after destroy returns ok(); a lingering coordinator firing one final persister.store or on_* callback then writes to freed memory. This is the exact UAF pathway the PR sets out to close. The PR description states destroy now returns a new PlatformWalletFFIResultCode::ErrorShutdownIncomplete (19); no such variant exists in the codebase (grep confirms). Either add and propagate the ErrorShutdownIncomplete code on !status.all_clean() so the host can defer freeing its callback context, or retract the breaking-change claim and document loudly that ok() does not imply the OS thread has exited on the degraded path.

- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:417-433: clear_shielded silently swallows quiesce timeout/non-clean status before clear()
clear_shielded wraps shielded_sync_manager.quiesce() in tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, …) but discards the result with `let _ = …`, then unconditionally calls coord.clear().await and returns Ok(()). The method's own doc-comment makes the quiesce barrier the load-bearing safety mechanism that lets the host commit its persistence wipe. On Timeout (or any non-clean CoordinatorThreadStatus the rewritten quiesce can now return) the in-flight shielded pass is still capable of holding the coordinator/persister handle and writing into the very store that clear() is about to wipe — the wipe can then be silently re-populated by the surviving pass, defeating the wipe and violating the contract the FFI consumer (platform_wallet_manager_shielded_clear) relies on. Inspect the timeout result and propagate a typed PlatformWalletError on Elapsed and on !status.is_clean(), so callers do not commit their own persistence wipe after a partial drain.

- [BLOCKING] packages/rs-platform-wallet-ffi/src/shielded_sync.rs:86-107: shielded_sync_stop returns ok() on bounded-quiesce timeout — same UAF window as destroy/clear
platform_wallet_manager_shielded_sync_stop wraps manager.shielded_sync().quiesce() in tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, …) and discards the result with `let _ = …`, then returns PlatformWalletFFIResult::ok(). The function's own docstring promises that on return 'the loop is cancelled, no new pass will start, and any in-flight pass has fully drained'; on a Timeout (or any non-clean CoordinatorThreadStatus) that promise is broken, but the host is told the stop succeeded. Hosts that bump generation counters or release callback state on success then race the still-running coordinator thread — the same UAF pattern as destroy and clear, just on a non-teardown path that long-running apps hit far more often. Surface non-clean / Elapsed via a distinct FFI result code so the host can defer teardown / state reset until the drain actually completes.

- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:184-198: join_coordinator_thread uses uncancellable spawn_blocking — Timeout status can ship while the !Send OS thread is still alive
join_coordinator_thread moves the std::thread::JoinHandle into tokio::task::spawn_blocking(move || handle.join()).await at line 190. spawn_blocking tasks are not abortable: when the outer tokio::time::timeout in shutdown() fires Elapsed, dropping the await handle does not cancel the inner blocking job and does not signal the OS thread. The slot is reported as CoordinatorThreadStatus::Timeout but the underlying coordinator thread is still inside Handle::block_on, still touching tokio::time, and still able to invoke host callbacks. Combined with finding 1 (destroy returning ok()), this is the residual runtime-drop / callback-after-free window the PR was sold as closing — the degraded path leaves it wide open. The PR description claims this was rewritten to poll JoinHandle::is_finished(); grep finds no is_finished() use in rs-platform-wallet at all. Either implement the poll loop (`tokio::time::sleep(small_dt); if handle.is_finished() { return handle.join() … }`) so the outer timeout actually binds, or document explicitly that Timeout means the OS thread may still be alive and the runtime must not be dropped — and reflect that in the FFI return codes per finding 1.

- [SUGGESTION] packages/rs-platform-wallet/src/manager/identity_sync.rs:405-470: start() after cancel-only stop() drops the previous background_join, detaching the old thread from a later shutdown()
stop() (lines 480–489) takes background_cancel but never touches background_join, and the loop body's epilogue (lines 452–458) only clears background_cancel when the generation matches. A subsequent start() guards only on cancel_guard.is_some() (line 410); on a quick stop()→start() sequence cancel_guard is None, start() proceeds, spawns a new OS thread, and at lines 465–468 unconditionally writes the new JoinHandle into self.background_join, dropping (detaching) the prior, still-live JoinHandle for a loop that may still be winding down through its last pass / sleep wakeup. A subsequent shutdown() only joins the newest handle; the older thread is no longer reachable through any join barrier and can outlive shutdown(), holding the persister/event-handler context after the FFI told the host destroy/shutdown was clean. The same overwrite pattern lives in platform_address_sync.rs and shielded_sync.rs::start. The PR description claims start() now reaps the prior handle first; it does not. Either move the join slot under the same lock and reap-on-takeover (e.g. take the prior background_join and join_coordinator_thread it before installing a new one — would require start() to become async), or have stop()/the loop epilogue clear background_join too so an unjoined leftover handle cannot exist.

- [NITPICK] packages/rs-platform-wallet/src/manager/mod.rs:712-743: event_adapter Stopped-path test still accepts Ok, leaving the new JoinError→Stopped mapping unverified on the abort race
event_adapter_non_panic_join_error_maps_to_stopped_and_is_not_clean aborts the adapter task then accepts CoordinatorThreadStatus::Stopped(_) | CoordinatorThreadStatus::Ok (lines 731–737) with an inline comment noting 'abort() races task completion'. Because the adapter is the standard make_manager() sink (no events queued), the task can trivially drain before the 10 ms sleep elapses, in which case the assertion passes via the Ok arm and the new Ok(Err(_)) ⇒ Stopped(...) mapping in shutdown() is never actually exercised. A regression collapsing that arm back to Ok would not be caught. To match the PR description's claim that this path is now deterministic, replace the adapter handle with one running std::future::pending::<()>().await before aborting, and assert exactly Stopped(_) and !status.event_adapter.is_clean().

- [NITPICK] packages/rs-platform-wallet/src/manager/mod.rs:874-931: shutdown_then_drop_runtime installs a process-global panic hook without chaining or RAII restore
std::panic::set_hook (line 878) replaces the process-wide panic hook with a closure that only increments SHUTDOWN_PANICS on messages containing 'being shutdown' and never forwards to prev_hook. The original hook is restored only after the 10-iteration loop completes (line 925); a panic anywhere in the loop body leaves the global hook replaced for the rest of the test process and silently suppresses unrelated panic diagnostics from sibling tests in the same `cargo test` binary. The PR description claims this is now per-iteration std::panic::catch_unwind; the code still uses set_hook. Fixes: chain prev_hook(info) for messages that don't match the filter, and restore the previous hook in an RAII guard (scopeguard / explicit struct with Drop) so a mid-loop panic cannot leave the global state mutated.

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Current head fixes prior-1, prior-4, prior-6, and prior-7, but prior-2 and prior-3 remain: the shielded Clear/Stop paths still discard the bounded quiesce result and report success after a failed drain barrier. Prior-5 is only partially fixed; the overwrite path was addressed, but the new start-time drain can still detach the old coordinator thread, and the new FFI shutdown-incomplete code is not safely handled by the bundled Swift wrapper.

🔴 4 blocking

1 additional finding(s) omitted (not in diff).

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:438-444: clear_shielded still clears after a failed quiesce barrier
  Carried forward from prior-2. `clear_shielded()` wraps `shielded_sync_manager.quiesce()` in the shutdown timeout but discards both `Elapsed` and non-clean `CoordinatorThreadStatus`, then immediately calls `coord.clear().await`. The comments above this method describe quiesce as the barrier that prevents an in-flight shielded pass from re-persisting notes after Clear. If the timeout fires while the pass is still draining, this method resets shared shielded state and returns `Ok(())` while the old pass can still write through the same coordinator/persister path. Surface the non-clean status and refuse to clear unless quiesce completed cleanly.

In `packages/rs-platform-wallet-ffi/src/shielded_sync.rs`:
- [BLOCKING] packages/rs-platform-wallet-ffi/src/shielded_sync.rs:98-106: shielded_sync_stop returns success when bounded quiesce fails
  Carried forward from prior-3. The FFI stop function documents that, once it returns, no in-flight shielded pass remains and persistence callbacks have completed. The implementation still discards the `timeout(..., manager.shielded_sync().quiesce())` result and always returns `PlatformWalletFFIResult::ok()`. On timeout or another non-clean status, Swift/C callers are told it is safe to free or mutate callback and persistence context even though Rust did not establish the promised drain barrier. This should return a non-success result, preferably `ErrorShutdownIncomplete`, when quiesce does not complete cleanly.

In `packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/identity_sync.rs:421-443: start-time drain can still detach the old coordinator thread
  Carried forward from prior-5 in a narrowed form. The new drain takes the previous `background_join`, waits up to one second, and drops the handle if the thread is not finished; dropping a `std::thread::JoinHandle` detaches the still-running thread. Worse, `start()` holds `background_cancel` while waiting, and the exiting thread needs that same mutex in its cleanup before it can finish, so a normal stop -> start race can force the one-second path without an external wedge. After this detach, the next `shutdown()` can only join the newly stored handle and can report clean while the old callback-capable coordinator is still alive. The same pattern is present in `platform_address_sync.rs` and `shielded_sync.rs`.

In `packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletManager.swift`:
- [BLOCKING] packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletManager.swift:153-158: Swift deinit discards the new shutdown-incomplete signal
  The Rust FFI now returns `ErrorShutdownIncomplete = 19` when `destroy` cannot prove coordinator threads exited, with an explicit contract that the host must not free callback context immediately. Swift retains the `PlatformWalletPersistenceHandler` and `PlatformWalletEventHandler` only as fields on `PlatformWalletManager`, passes them to Rust with `Unmanaged.passUnretained`, and then calls `platform_wallet_manager_destroy(handle).discard()` in `deinit`. If Rust reports shutdown incomplete, ARC still releases those handlers as deinit completes, leaving lingering Rust coordinator callbacks with dangling context pointers. The Swift result mirror also has no case for code 19, so even non-deinit callers cannot distinguish this lifecycle-specific failure from `errorUnknown`.

Comment thread packages/rs-platform-wallet/src/manager/mod.rs Outdated
Comment thread packages/rs-platform-wallet-ffi/src/shielded_sync.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/identity_sync.rs Outdated
lklimek and others added 2 commits June 23, 2026 16:18
Extend the destroy UAF-surfacing discipline (which already returns
ErrorShutdownIncomplete=19 on a non-clean shutdown) to the shielded
clear/stop paths, so a partial/timed-out coordinator drain can no
longer be silently swallowed.

- clear_shielded now captures the quiesce result instead of discarding
  it: on a timed-out or non-clean drain it returns the new typed
  PlatformWalletError::ShieldedShutdownIncomplete (carrying the terminal
  CoordinatorThreadStatus) and leaves the commitment-tree store INTACT,
  rather than unconditionally wiping a store an in-flight pass may still
  write into. The store is wiped only on a clean drain.
- FFI shielded_sync_stop now returns ErrorShutdownIncomplete (with the
  status rendered into the message) on a non-clean/timed-out drain,
  instead of always returning ok() — symmetric with destroy. A timeout
  is reported as the Timeout status.
- FFI shielded_clear maps the new ShieldedShutdownIncomplete variant to
  ErrorShutdownIncomplete (store-reset failures still map to
  ErrorWalletOperation); the blanket From<PlatformWalletError> gains the
  same arm, pinned by a unit test.
- Swift mirror gains errorShutdownIncomplete=19 plus a richer
  PlatformWalletError.shutdownIncomplete case, wired through both the
  init(ffi:) and init(result:) switches.
- Re-export CoordinatorThreadStatus / CoordinatorExitStatus from the
  crate root so the FFI can name the status type.

BREAKING CHANGE: clear_shielded / shielded_sync_stop / shielded_clear
now report a non-clean coordinator drain instead of succeeding silently;
hosts must defer freeing their callback context and must not commit their
own persistence wipe on ErrorShutdownIncomplete.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…d_cancel lock

All three coordinators (identity_sync, platform_address_sync,
shielded_sync) reaped the prior loop's OS thread inside start() WHILE
holding background_cancel. But the exiting prior thread's epilogue also
locks background_cancel to clear its slot, so a tight stop()→start()
deadlocked the reap: the prior thread blocked on the lock start() held,
never finished, and the is_finished() spin-wait burned the full 1 s
deadline then DETACHED the handle — a 1 s stall plus a transient
untracked thread, on the exact stop()→start() path the reap exists for.

Reorder start() to install the new cancel token + bump the generation
under the lock, then drop(cancel_guard) to release background_cancel,
and only THEN run the spin-wait + join. The prior thread's epilogue now
acquires the lock (or, for shielded, observes the bumped generation),
skips clearing the freshly-installed token, and returns, so is_finished()
trips in milliseconds and the join is near-instant. start() stays
synchronous; the 1 s deadline remains only as a genuine-wedge backstop.

Adds restart_after_stop_reaps_prior_thread regression tests to the
identity and platform-address coordinators: start → (stop+start
back-to-back) → assert the restart returns well under the 1 s deadline.
Verified non-vacuous — against the old lock-held ordering it stalls
~1.0 s and fails.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
@Claudius-Maginificent

Copy link
Copy Markdown
Collaborator Author

Re: the latest CHANGES_REQUESTED incremental review — it was performed against parent commit 747f5f00, before the post-review hardening landed, so its carried-forward findings are stale relative to the pushed tip 5f63c9544c. Current status:

  • Findings 1, 4, 5, 6, 7 (destroy returns ok(); join_coordinator_thread uses spawn_blocking; start-after-stop detach; event-adapter test accepts Ok; process-global panic hook) — fixed in 2bd9501a0e (is_finished() poll-join, FFI destroy returns ErrorShutdownIncomplete, deterministic Stopped test, per-iteration catch_unwind). Those inline threads are resolved with per-finding replies.
  • Findings 2 & 3 (clear_shielded and shielded_sync_stop discarding the bounded-quiesce() result) — valid, and fixed in 7c975ed5a6: clear_shielded now returns PlatformWalletError::ShieldedShutdownIncomplete and leaves the store intact on a non-clean/timed-out drain; the FFI shielded_sync_stop / shielded_clear bridges return ErrorShutdownIncomplete — symmetric with destroy.
  • The finding-5 reap also had a lock-ordering bug (reap held background_cancel across the join); fully closed in 5f63c9544c with a non-vacuous regression test.

Verification on 5f63c9544c: platform-wallet 214 (default) / 312 (--features shielded), dash-async 4, clippy -D warnings clean on both feature sets, fmt clean. A fresh review of 5f63c9544c would supersede this stale CHANGES_REQUESTED.

🤖 Co-authored by Claudius the Magnificent AI Agent

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs`:
- Around line 658-660: The mgr.quiesce().await call can hang indefinitely if the
restarted loop wedges, causing CI to stall. Wrap the quiesce().await call in a
timeout to prevent hanging, and assert that the timeout succeeds and the
returned status indicates a clean shutdown. This ensures the test fails fast if
the manager fails to quiesce properly rather than hanging the CI pipeline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b16b3350-f53b-4300-9c5f-2cb19cb4566a

📥 Commits

Reviewing files that changed from the base of the PR and between 2bd9501 and 5f63c95.

📒 Files selected for processing (9)
  • packages/rs-platform-wallet-ffi/src/error.rs
  • packages/rs-platform-wallet-ffi/src/shielded_sync.rs
  • packages/rs-platform-wallet/src/error.rs
  • packages/rs-platform-wallet/src/lib.rs
  • packages/rs-platform-wallet/src/manager/identity_sync.rs
  • packages/rs-platform-wallet/src/manager/mod.rs
  • packages/rs-platform-wallet/src/manager/platform_address_sync.rs
  • packages/rs-platform-wallet/src/manager/shielded_sync.rs
  • packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletResult.swift
🚧 Files skipped from review as they are similar to previous changes (3)
  • packages/rs-platform-wallet/src/lib.rs
  • packages/rs-platform-wallet/src/manager/identity_sync.rs
  • packages/rs-platform-wallet/src/manager/shielded_sync.rs

Comment thread packages/rs-platform-wallet/src/manager/platform_address_sync.rs

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Latest delta resolves 2 of 4 prior blockers (clear_shielded and shielded_sync_stop now surface ErrorShutdownIncomplete) and the deterministic half of the start-time detach (cancel_guard released before reap; new regression tests in identity_sync.rs and platform_address_sync.rs assert <500 ms reap). Two carried-forward concerns remain: (1) Swift deinit still .discard()s the new shutdownIncomplete result and lets ARC free callback contexts the FFI just told it not to (convergent, all 6 agents, blocking); (2) start()'s 1 s wedge-detach backstop remains and can silently desync shutdown()'s clean-status from a still-live prior coordinator thread (downgraded to suggestion: deterministic path fixed, intentional liveness tradeoff documented in code). One additional test-coverage gap: the new restart_after_stop_reaps_prior_thread regression test was not mirrored into shielded_sync.rs.

🔴 1 blocking | 🟡 2 suggestion(s)

1 additional finding(s) omitted (not in diff).

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletManager.swift`:
- [BLOCKING] packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletManager.swift:153-160: deinit discards ErrorShutdownIncomplete and lets ARC free callback contexts Rust still references
  Rust's `platform_wallet_manager_shielded_sync_stop` (rs-platform-wallet-ffi/src/shielded_sync.rs:124-132) and `platform_wallet_manager_destroy` (rs-platform-wallet-ffi/src/manager.rs:367-385) now return `ErrorShutdownIncomplete = 19` on a non-clean drain, with an explicit FFI contract: "host must not free the callback context immediately — a lingering pass may still fire one final callback through it." `PlatformWalletResultCode.errorShutdownIncomplete` and `PlatformWalletError.shutdownIncomplete` were added to mirror this, but the lifecycle-critical caller here ignores the signal:

  ```swift
  deinit {
      progressPollTask?.cancel()
      if handle != NULL_HANDLE {
          platform_wallet_manager_platform_address_sync_stop(handle).discard()
          platform_wallet_manager_shielded_sync_stop(handle).discard()
          platform_wallet_manager_destroy(handle).discard()
      }
  }

.discard() only frees the FFI result message; it never inspects the code. persistenceHandler and eventHandler are stored only as fields on PlatformWalletManager and are passed to Rust with Unmanaged.passUnretained(...).toOpaque(). The instant deinit returns, ARC releases both, and any lingering coordinator that returned Timeout will dereference freed memory on its next callback — the exact UAF the new return code was introduced to prevent.

The fix needs deinit to observe code 19 and keep both handlers alive past the destroy (e.g. Unmanaged.passRetained(...) to leak them deliberately when Rust reports incomplete shutdown, or hand them to a detached holder keyed off the handle until a follow-up signal confirms drain). Storing the result and acting on it — rather than .discard() — is the minimum behavior change.

In packages/rs-platform-wallet/src/manager/identity_sync.rs:

  • [SUGGESTION] packages/rs-platform-wallet/src/manager/identity_sync.rs:502-517: Wedge-backstop detach in start() leaves a still-live coordinator invisible to shutdown()
    The deterministic stop()→start() race that forced the detach is fixed in the current head: cancel_guard is dropped before the reap (identity_sync.rs:488-494), the cancellable tokio::select! in the loop body lands cancellation promptly, and restart_after_stop_reaps_prior_thread asserts the reap completes in <500 ms. The 1 s backstop at lines 502-517 (and the identical pattern in platform_address_sync.rs:~317 and shielded_sync.rs:343-358) still exists, however, and on a genuine wedge (e.g. a sync_now future whose Drop never yields) it drops h and detaches the still-running OS thread. Because start() has already installed the new handle in background_join, a later shutdown() joins only the new thread, reports all_clean() == true, and destroy returns ok() — at which point the FFI contract permits the host to free the callback context that the detached prior thread still holds via Arc'd FFI wrappers.

    The failure mode is narrow (requires a non-yielding wedge) and the comments document it as a deliberate liveness/safety tradeoff. But the rest of this PR's lifecycle guarantees rely on shutdown()'s status honestly reflecting whether any coordinator could still call back. Two options that close it cleanly: (i) track detached handles in a per-manager orphans list that shutdown() polls and reports as a non-clean Timeout/Detached, or (ii) drop the deadline once cancellation is signalled (cancellable select! makes a real-world stall vanishingly rare).

In packages/rs-platform-wallet/src/manager/shielded_sync.rs:

  • [SUGGESTION] packages/rs-platform-wallet/src/manager/shielded_sync.rs:325-358: shielded_sync.rs missing the restart_after_stop_reaps_prior_thread regression test
    The latest delta adds restart_after_stop_reaps_prior_thread to identity_sync.rs and platform_address_sync.rs, pinning the reap-after-drop(cancel_guard) ordering at <500 ms. The same start() restructuring was applied to ShieldedSyncManager (shielded_sync.rs:325-358) but no equivalent test was added there, leaving the shielded path's stop()→start() ordering unverified. A future refactor of shielded_sync.rs that accidentally moved the reap back inside cancel_guard's lifetime would only be caught by the two siblings; the shielded path would silently regress to the lock-held 1 s detach pattern. Mirroring the existing regression test in shielded_sync.rs (gated behind #[cfg(feature = "shielded")] if needed) would pin the invariant across all three coordinators.
</details>

Comment thread packages/rs-platform-wallet/src/manager/identity_sync.rs Outdated
Comment thread packages/rs-platform-wallet/src/manager/shielded_sync.rs Outdated
lklimek and others added 2 commits June 24, 2026 09:34
Three shielded-sync hardening fixes, bringing it in line with its
identity-sync and platform-address-sync siblings.

- shielded_sync.rs exit epilogue read `background_generation` BEFORE
  acquiring `background_cancel` (load-then-lock). That stale-read TOCTOU let
  a prior thread observe a pre-bump generation, block on the lock until a
  concurrent start() released it, then null the freshly-installed token —
  leaving the new loop running but untracked via is_running()/stop(). Acquire
  the lock first and compare the generation under it, exactly like the
  siblings.

- Add the `restart_after_stop_reaps_prior_thread` regression test the
  siblings already carry. It pins the reap-after-drop(cancel_guard) reorder:
  a back-to-back stop()+start() must reap the prior OS thread in <500 ms, not
  stall ~1 s on the detach backstop. Confirmed non-vacuous — it fails at
  ~1.0 s with the reap moved back inside the lock.

- platform-wallet-ffi: the ErrorShutdownIncomplete doc only described
  destroy. It is now also returned by shielded_sync_stop and shielded_clear,
  where the manager is NOT torn down and the operation can be retried.
  Document all three callers and their differing retry semantics.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
PlatformWalletManager.deinit called .discard() on shielded_sync_stop and
destroy, tossing the result code. Rust now returns ErrorShutdownIncomplete
(19) on a non-clean drain, with the contract: a lingering coordinator thread
may still fire one final callback through the host-owned callback context.
But persistenceHandler/eventHandler are handed to Rust via
Unmanaged.passUnretained and kept alive only by this object's fields, so the
instant deinit returns ARC frees them — a use-after-free on that final
callback.

Capture the code via a new discardReturningCode() helper; on
ErrorShutdownIncomplete from shielded_sync_stop OR destroy, deliberately leak
one extra strong reference (an unbalanced passRetained, never released) to
each handler so it outlives any lingering thread. A clean shutdown — the
common case — takes neither branch and releases the handlers normally; we
never leak unconditionally.

UNVERIFIED locally: no Swift toolchain / xcframework on this host. Reasoned
correct-by-construction; needs an iOS-environment build to confirm.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
@Claudius-Maginificent

Copy link
Copy Markdown
Collaborator Author

Swift deinit blocking finding — fixed in 5017ba1313. PlatformWalletManager.deinit now captures the result code from shielded_sync_stop and destroy (via a new discardReturningCode() helper) and, only on .errorShutdownIncomplete, deliberately leaks one extra strong ref to persistenceHandler/eventHandler (unbalanced Unmanaged.passRetained) so a lingering coordinator that returned Timeout cannot fire its final callback into a freed context. A clean shutdown takes neither branch — no unconditional leak. platform_address_sync_stop stays .discard() (cancel-only; never returns code 19).

⚠️ Build-unverified: this host has no Swift toolchain / xcframework, so the change was made correct-by-construction and is NOT compiled here — it needs an iOS-environment build before merge. Tracked.

🤖 Co-authored by Claudius the Magnificent AI Agent

…on tests

Wrap the cleanup `mgr.quiesce().await` in all three
`restart_after_stop_reaps_prior_thread` tests with a 2-second
`tokio::time::timeout`. An unbounded quiesce after the restarted loop
would hang CI forever if the loop wedges; now it fails fast with a clear
message. Also assert `status.is_clean()` on the returned
`CoordinatorThreadStatus`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The latest delta resolves prior-1 (Swift deinit at PlatformWalletManager.swift:153-195 now captures both shielded_sync_stop and destroy result codes via discardReturningCode() and leaks persistenceHandler/eventHandler via unbalanced passRetained on .errorShutdownIncomplete) and prior-3 (restart_after_stop_reaps_prior_thread now exists in shielded_sync.rs alongside its identity/platform-address siblings). Prior-2 — the 1 s wedge-backstop detach in each coordinator's start() — remains in the tree on the deliberate-wedge path; all six agents converge on it as a suggestion, narrow but conceptually opposed to the PR's central honest-status contract. No new blocking issues introduced.

🟡 1 suggestion(s)

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/identity_sync.rs:502-517: start()'s 1 s wedge-backstop detaches a still-live coordinator that shutdown()/destroy then reports as clean
  Carried forward from prior review (still valid; no reviewer-repeat required by the verifier protocol).

  On a genuine wedge — a `sync_now` future whose `Drop` never yields, or a non-yielding step inside the SDK — the prior thread cannot reach its exit epilogue within the 1 s deadline at identity_sync.rs:502-517 (and the symmetric patterns at platform_address_sync.rs:317-332 and shielded_sync.rs:350-365). `start()` then drops `h`, detaching the still-running OS thread; `background_join` already holds the new generation's handle, so a later `quiesce()`/`shutdown()` joins only the new thread, `CoordinatorExitStatus::all_clean()` returns true, and `platform_wallet_manager_destroy` returns `ok()`.

  That directly defeats the central FFI contract this PR introduces. The Swift deinit at PlatformWalletManager.swift:190-194 only retains `persistenceHandler` / `eventHandler` when it observes `.errorShutdownIncomplete`; a falsely-clean code lets ARC free both immediately, and the detached prior thread — which still owns Arc'd Rust callback wrappers carrying the `passUnretained(self).toOpaque()` raw pointers (PlatformWalletPersistenceHandler.swift:993-996, PlatformWalletManagerAddressSync.swift:47-50) — can fire one last callback into freed memory. The same hazard exists on the timeout paths of `clear_shielded` and the FFI `platform_wallet_manager_shielded_sync_stop`, where an outer `tokio::time::timeout` firing while the inner quiesce is parked inside `join_coordinator_thread(handle).await` drops the local handle at the .await — detach.

  The cancellable `tokio::select!` in each loop body and the prompt `drop(cancel_guard)` ordering (identity_sync.rs:488-494, mirrored in the other two) make this vanishingly rare in practice — `restart_after_stop_reaps_prior_thread` now pins the common-case reap at <500 ms across all three coordinators. But the residual path is the one bypass of the PR's honest-status guarantee. Two clean closures: (a) track detached handles in a per-manager orphans list that `shutdown()` polls and surfaces as `Timeout`/`Detached`, so `destroy` returns `ErrorShutdownIncomplete` and the host keeps the callback context alive; or (b) once cancellation has been signalled, drop the 1 s deadline entirely — the cancellable select! makes a real-world stall essentially impossible, and turning a degraded shutdown into a hang is arguably less dangerous than the silent UAF window the detach can open.

Comment thread packages/rs-platform-wallet/src/manager/identity_sync.rs Outdated

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Latest delta (5017ba1..b491773) is test-only: it bounds the post-restart cleanup quiesce() in three restart-reap regression tests with a 2 s tokio::time::timeout and asserts is_clean(). No production code changed and no new defects in the delta. The prior finding about start()'s 1 s wedge-backstop silently detaching a still-live coordinator is STILL VALID on the current head for all three coordinators and is carried forward as the sole in-scope finding.

Carried-forward prior findings

  1. [suggestion] 1 s wedge-backstop in start() detaches a live prior coordinator that shutdown()/destroy then reports as clean

    packages/rs-platform-wallet/src/manager/identity_sync.rs:502-517

    Verified against b491773. On a genuine wedge — a sync_now future whose Drop never yields, or a non-yielding step inside the SDK — the prior coordinator thread cannot reach its exit epilogue within the 1 s deadline. The loop at lines 504–513 then breaks and drops h, which detaches the still-running OS thread (std::thread::JoinHandle's Drop does not join). background_join now holds only the replacement handle installed at lines 483–486, so a later quiesce() (lines 576–581) joins only the replacement and CoordinatorExitStatus::all_clean() returns true. FFI platform_wallet_manager_destroy (packages/rs-platform-wallet-ffi/src/manager.rs:366–386) then returns ok() instead of ErrorShutdownIncomplete=19. Swift PlatformWalletManager.deinit (packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletManager.swift:153–195) does NOT take the retain-leak branch and ARC releases the persister/event-handler wrappers, even though the detached orphan thread still holds Arc/Arc referencing the freed passUnretained context — exactly the UAF class this PR exists to close, but now on the stop()→start()→destroy() degraded path the new restart bookkeeping creates.

    The biased cancel-select introduced at lines 446–461 narrows the common-case window (cancel beats sync_now's next .await), but it cannot abort an SDK step or future Drop that is itself non-yielding — which is precisely when the 1 s backstop fires. Same pattern exists in platform_address_sync.rs:317–332 and shielded_sync.rs:350–365; none of the three record the detach event into state observable by shutdown()/CoordinatorExitStatus. The new restart-reap tests only exercise the prompt-reap (<500 ms) path, not a genuine wedge past 1 s.

    Minimum FFI-correct fix options (none fit an inline patch): (a) track detached orphans in a Mutex<Vec<JoinHandle<()>>> the next quiesce()/shutdown() polls and surfaces as Timeout/non-clean status; (b) have start() flip a per-coordinator atomic on detach that quiesce() ORs into its returned CoordinatorThreadStatus; (c) have start() return an error rather than silently install a replacement when the prior couldn't be reaped, so the FFI host can escalate. The current shape — silently dropping the orphan — is the one outcome that defeats the PR's central invariant that destroy()'s result code tells the host whether it is safe to free callback contexts.

…) reports them as non-clean

Closes the residual use-after-free window left by the coordinator
reap backstop. On a tight stop()->start(), each sync coordinator
waits ~1s for the prior OS thread to finish; if that thread is
genuinely wedged in a non-yielding Drop, the backstop previously
DROPPED the still-live JoinHandle (detaching it). A later shutdown()
joined only the current handle, all_clean() returned true, and the
FFI destroy returned ok() — at which point the host could free the
callback context the detached, still-running thread might still touch.

Fix (review option i): the manager now owns a shared CoordinatorOrphans
list (Arc<Mutex<Vec<JoinHandle>>>) cloned into every coordinator. The
duplicated reap blocks in identity/platform-address/shielded start()
are consolidated into reap_prior_or_park(), which PARKS a wedged prior
thread in that list instead of dropping it (lock-ordering preserved:
drop(cancel_guard) still happens before the reap). shutdown() then
drains the list via join_detached_orphans() within a bounded, yielding
is_finished() poll and reports a new CoordinatorThreadStatus::Detached
(non-clean) in CoordinatorExitStatus::detached_threads when any orphan
is still alive at the grace deadline. all_clean() folds it in, so the
FFI destroy correctly returns ErrorShutdownIncomplete and the host
delays freeing its context. The new Detached variant re-exports through
lib.rs with its sibling statuses.

Tests (manager/mod.rs): reap_prior_or_park parks a force-wedged thread;
join_detached_orphans reports Detached then Ok; and a manager shutdown()
with a parked still-live orphan reports non-clean. All proven
non-vacuous by neutering the park/join. Cleanup quiesce/join in tests is
bounded; a wedged stand-in thread is released and joined so none leak.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Prior finding (start()/wedge-backstop detaching a live coordinator) is FIXED at head 76c8bee via reap_prior_or_park + CoordinatorOrphans + join_detached_orphans, with Detached folded into all_clean(). However, two related lifecycle gaps in the same orphan-tracking design remain: (1) quiesce() takes the JoinHandle out of background_join before awaiting join_coordinator_thread, so a wrapping tokio::time::timeout (used by shutdown(), clear_shielded(), and the FFI shielded_sync_stop bridge) drops the live handle and detaches the thread without parking it; on the explicit retry path the next call sees background_join == None and reports NotRunning/clean while a wedged thread may still call host callbacks. (2) clear_shielded() consults only the current shielded quiesce status and wipes the commitment-tree store even when an earlier wedged shielded coordinator is still alive in coordinator_orphans, breaking the PR's 'Clear leaves the store intact whenever a sync drain is incomplete' invariant. Both are in scope for the PR's stated FFI lifecycle/UAF hardening.

🔴 2 blocking

1 additional finding(s) omitted (not in diff).

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:240-261: Timed-out quiesce drops the JoinHandle and silently detaches a live coordinator
  join_coordinator_thread takes the JoinHandle by value. Every caller of quiesce() — shutdown(), clear_shielded(), and the FFI platform_wallet_manager_shielded_sync_stop — wraps it in tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, ...). Each coordinator's quiesce() first synchronously takes background_join (identity_sync.rs:579, shielded_sync.rs:423, platform_address_sync.rs analogous) and only then awaits this helper. If the outer timeout fires while this loop is in tokio::time::sleep, the future is dropped, the locally-owned handle is dropped with it, and the still-live OS thread is detached — not parked in coordinator_orphans. The FFI/host contract is explicitly retry-capable here (Timeout → ErrorShutdownIncomplete → host should be able to retry stop/destroy), but a retry now sees background_join == None and join_coordinator_thread returns NotRunning, which is_clean() treats as clean. The host then frees the persister / event-handler context while a wedged thread still holds Arcs to it and may still fire one final callback — the exact UAF class the orphan list was added to close. The fix is to keep the handle reachable across timeout/cancel: either don't take() from background_join until is_finished() is true, or have join_coordinator_thread re-park the unjoined handle into a manager-scoped orphans list on Drop / on the deadline (mirroring reap_prior_or_park's park-on-wedge contract).
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:644-674: clear_shielded wipes the store while parked coordinator orphans may still be alive
  clear_shielded() gates the store wipe on `self.shielded_sync_manager.quiesce()` returning a clean status, but a tight stop()→start() sequence can park a previously-wedged shielded coordinator thread into coordinator_orphans (reap_prior_or_park, mod.rs:296-332). Those parked threads hold Arc references to the same shielded coordinator / persister context the clear() below is about to wipe, and shutdown() is currently the only path that drains them via join_detached_orphans. As soon as the current quiesce reports clean, this path calls coord.clear() while a parked orphan may still be inside its sync_now drop / persister fan-out, violating the PR's stated invariant that Clear leaves the store intact whenever a sync drain is incomplete and re-creating the store-desync the orphan tracking was added to surface. Mirror the shutdown() path: after quiesce, drain coordinator_orphans (with SHUTDOWN_ORPHAN_GRACE_SECS) and refuse the wipe with ShieldedShutdownIncomplete if any orphan is still Detached/Panicked.

Comment on lines +240 to 261
pub(crate) async fn join_coordinator_thread(
handle: Option<std::thread::JoinHandle<()>>,
) -> CoordinatorThreadStatus {
let Some(handle) = handle else {
return CoordinatorThreadStatus::NotRunning;
};
// Poll until the thread exits. The coordinator was already cancelled
// (stop() fires before quiesce() calls us), so is_finished() becomes
// true nearly immediately — typically within a single 5 ms step.
loop {
if handle.is_finished() {
return match handle.join() {
Ok(()) => CoordinatorThreadStatus::Ok,
Err(payload) => CoordinatorThreadStatus::Panicked(panic_message(payload)),
};
}
// Yield to the executor so the outer tokio::time::timeout wrapping
// quiesce() can fire if the deadline has passed. Without this yield
// the loop would busy-spin and block the task.
tokio::time::sleep(std::time::Duration::from_millis(5)).await;
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Blocking: Timed-out quiesce drops the JoinHandle and silently detaches a live coordinator

join_coordinator_thread takes the JoinHandle by value. Every caller of quiesce() — shutdown(), clear_shielded(), and the FFI platform_wallet_manager_shielded_sync_stop — wraps it in tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, ...). Each coordinator's quiesce() first synchronously takes background_join (identity_sync.rs:579, shielded_sync.rs:423, platform_address_sync.rs analogous) and only then awaits this helper. If the outer timeout fires while this loop is in tokio::time::sleep, the future is dropped, the locally-owned handle is dropped with it, and the still-live OS thread is detached — not parked in coordinator_orphans. The FFI/host contract is explicitly retry-capable here (Timeout → ErrorShutdownIncomplete → host should be able to retry stop/destroy), but a retry now sees background_join == None and join_coordinator_thread returns NotRunning, which is_clean() treats as clean. The host then frees the persister / event-handler context while a wedged thread still holds Arcs to it and may still fire one final callback — the exact UAF class the orphan list was added to close. The fix is to keep the handle reachable across timeout/cancel: either don't take() from background_join until is_finished() is true, or have join_coordinator_thread re-park the unjoined handle into a manager-scoped orphans list on Drop / on the deadline (mirroring reap_prior_or_park's park-on-wedge contract).

source: ['codex']

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed — both findings are still open at HEAD 3cca1cf833. Thanks for the catch.

F1 (this thread): join_coordinator_thread takes the JoinHandle by value, and quiesce() take()s it out of background_join before awaiting. If the wrapping tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, …) fires during the helper's 5 ms poll-sleep, the future is dropped and the owned handle drops with it — the OS thread is silently detached and never re-parked into coordinator_orphans. On retry background_join == NoneNotRunningis_clean(), so destroy can free the context while the thread is still live. Confirmed UAF window.

F2 (review body): clear_shielded() (mod.rs:644-674) gates the store wipe on the current quiesce() being clean but never drains coordinator_orphans, so a prior-generation shielded thread parked by reap_prior_or_park can still hold Arc refs to the persister that clear() is about to wipe. Confirmed.

Both will be addressed before this PR merges — leaving this thread open until the fix lands.

🤖 Co-authored by Claudius the Magnificent AI Agent

…via tokio::join!

The three periodic coordinators (platform-address, identity, shielded)
were quiesced sequentially in shutdown(), making the worst-case wait
additive (~3 x SHUTDOWN_JOIN_TIMEOUT_SECS). Each quiesce() touches only
its own state (its quiescing/is_syncing atomics and its own
background_cancel/background_join mutexes) and joins its own OS thread,
sharing no lock, so racing them is sound. Drain them concurrently via
tokio::join!, collapsing the worst case to ~max(timeouts).

Each join! arm keeps its OWN inner tokio::time::timeout, so every
coordinator still yields its own per-coordinator CoordinatorThreadStatus
(a single outer timeout would flatten all three to Timeout). The event
adapter teardown and join_detached_orphans stay sequential and ordered
strictly AFTER the coordinator join!, since the adapter sinks the
coordinators' stores. The multi-thread runtime assert is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
@lklimek

lklimek commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@thepastaclaw please review

@thepastaclaw thepastaclaw left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Incremental review against head 3cca1cf. The new tokio::join! concurrent drain in shutdown() is structurally sound — each coordinator's quiesce() touches only its own quiescing/is_syncing atomics and background_cancel/background_join mutexes, and the shared coordinator_orphans Mutex is not taken inside quiesce(), so racing the three drains is safe; per-coordinator inner timeouts correctly preserve per-thread CoordinatorThreadStatus that a single outer timeout would have flattened. However, both prior blocking findings are unchanged on this head: a timed-out quiesce still loses the only JoinHandle and silently detaches a live coordinator OS thread, and clear_shielded still wipes the shielded store without consulting coordinator_orphans. Both were unanimously re-flagged by all six agent reviews and both are required for this PR's stated lifetime/UAF guarantees to actually hold.

🔴 1 blocking

1 additional finding(s) omitted (not in diff).

1 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.

🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.

In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:644-674: clear_shielded wipes the shielded store without draining parked coordinator orphans
  clear_shielded (mod.rs:644–674) gates coord.clear().await on shielded_sync_manager.quiesce() returning is_clean(), and the doc on SHUTDOWN_JOIN_TIMEOUT_SECS (mod.rs:407–408) explicitly lists clear_shielded and the FFI shielded-stop bridge as using the same backstop. But neither path inspects self.coordinator_orphans. Shielded coordinator start() parks a prior wedged shielded OS thread into that shared list via reap_prior_or_park (mod.rs:296–332) whenever a tight stop()→start() cannot reap within the 1 s wedge backstop. Only shutdown() actually drains the list via join_detached_orphans (mod.rs:820–824).

  A parked shielded thread still holds an Arc to the same NetworkShieldedCoordinator and persister context that coord.clear() (mod.rs:670–672) is about to wipe; the doc on lines 663–666 warns exactly about this — "a surviving pass writing into a store we just cleared, desyncing the host's own wipe from a repopulated tree." Because the orphan drain is missing, a sequence of stop()→start() that wedges the prior shielded thread, followed by clear_shielded, lets the wipe proceed while the parked thread is still alive and can persister.store(...) / fire on_shielded_* host callbacks against the cleared store.

  The same defect is in the FFI platform_wallet_manager_shielded_sync_stop bridge: it awaits only quiesce(), so a non-clean orphan goes unreported and the host (which sees ok()) is free to release the callback context that a parked shielded thread still references.

  Fix shape (symmetric with shutdown()): in both clear_shielded and platform_wallet_manager_shielded_sync_stop, after quiesce() returns clean, run a bounded join_detached_orphans(&self.coordinator_orphans, deadline) and fold the result into the precondition — return ShieldedShutdownIncomplete { status } (and leave the store intact) unless both the live drain and the orphan drain are clean. Caveat: coordinator_orphans is shared across all three coordinator kinds; either accept the conservative wait for identity/platform-address orphans before a shielded Clear, or tag parked handles with their coordinator kind so Clear can drain only the shielded subset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants