feat(platform-wallet)!: shutdown() joins coordinator threads and returns CoordinatorExitStatus#3954
feat(platform-wallet)!: shutdown() joins coordinator threads and returns CoordinatorExitStatus#3954Claudius-Maginificent wants to merge 17 commits into
Conversation
…rns CoordinatorExitStatus The three periodic sync coordinators (platform-address, identity, shielded) run their `!Send` loops on detached OS threads via `Handle::block_on`. `shutdown()`/`quiesce()` previously only drained the in-flight pass (the `is_syncing` barrier) and never joined the threads, so a consumer that drops the tokio runtime right after `shutdown()` (one-shot / headless / stdio) could race a coordinator still polling `tokio::time` on a shutting-down runtime and panic with "A Tokio 1.x context was found, but it is being shutdown". Each coordinator now stores its OS-thread `JoinHandle`; `quiesce()` joins it (via `spawn_blocking`, after the existing drain) and returns a `CoordinatorThreadStatus` (NotRunning / Ok / Panicked / Error). Joining while the runtime is still alive guarantees the loop has stopped touching `tokio::time` before the host drops the runtime. `shutdown()` aggregates the three into `CoordinatorExitStatus`, so a panicked loop surfaces in the status instead of being silently dropped. JoinHandle-join chosen over a oneshot/Notify signal: `JoinHandle::join` natively distinguishes a clean return from a panic and waits for the actual OS thread to terminate (not just a signal fired mid-teardown), yielding the per-thread status for free. The generation-guard reschedule and quiesce-drain behavior are preserved. BREAKING CHANGE: `PlatformWalletManager::shutdown()` now returns `CoordinatorExitStatus` instead of `()`. FFI: the internal `shutdown()` call logs the new status; the `extern "C"` `platform_wallet_manager_destroy` signature and C ABI are unchanged. <sub>🤖 Co-authored by [Claudius the Magnificent](https://github.com/lklimek/claudius) AI Agent</sub>
…nnot wedge shutdown
SEC-001: Add `IsSyncingGuard` RAII struct to all three coordinator
`sync_now` (and shielded `sync_wallet`) implementations. The guard
clears `is_syncing=false` on every exit path — normal return, early
return, and panic-unwind — so `quiesce()`'s drain loop can never spin
forever on a panicked pass, and the `Panicked` thread-exit status
becomes reachable.
SEC-002: Wrap each coordinator's `quiesce()` call in `shutdown()` with
`tokio::time::timeout(30 s)`. On timeout the slot reports
`CoordinatorThreadStatus::Error("join timed out")` rather than hanging
forever.
SEC-003: Add `debug_assert!` in `shutdown()` that the current runtime
is `MultiThread`; document the precondition in the method doc.
F-5: In all three coordinators' `start()`, store the `JoinHandle` in
`background_join` while still holding the `background_cancel` lock —
eliminates the theoretical window where a concurrent `quiesce()` could
take a `None` handle because spawn completed before the store.
Rename `CoordinatorThreadExit` → `CoordinatorThreadStatus` with
variants `Ok / NotRunning / Panicked / Error` to match the coordinator
module's existing `super::CoordinatorThreadStatus` references (fixing
the compile break in f3354f6). `join_coordinator_thread`'s
spawn_blocking `Err` arm now maps to `Error` rather than `Panicked`
to distinguish infra failure from thread panic (F-6 documented).
Co-Authored-By: Claudius the Magnificent <noreply@anthropic.com>
<sub>🤖 Co-authored by [Claudius the Magnificent](https://github.com/lklimek/claudius) AI Agent</sub>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe PR hardens the shutdown lifecycle for all three background sync coordinator managers ( ChangesCoordinator Shutdown Hardening
Sequence Diagram(s)sequenceDiagram
participant FFI as platform_wallet_manager_destroy
participant Manager as PlatformWalletManager::shutdown
participant ISM as IdentitySyncManager::quiesce
participant PASM as PlatformAddressSyncManager::quiesce
participant SSM as ShieldedSyncManager::quiesce
participant Adapter as WalletEventAdapter task
participant Thread as OS Background Thread
FFI->>Manager: shutdown() / multi-thread Tokio
Manager->>ISM: quiesce with timeout
ISM->>ISM: set quiescing, cancel loop
ISM->>ISM: wait is_syncing drain (AtomicFlagGuard)
ISM->>Thread: join_coordinator_thread(background_join)
Thread-->>ISM: CoordinatorThreadStatus
ISM-->>Manager: CoordinatorThreadStatus
Manager->>PASM: quiesce with timeout
PASM-->>Manager: CoordinatorThreadStatus
Manager->>SSM: quiesce with timeout
SSM-->>Manager: CoordinatorThreadStatus
Manager->>Adapter: cancel task + join with timeout
Adapter-->>Manager: CoordinatorThreadStatus
Manager-->>FFI: CoordinatorExitStatus {identity, address, shielded, adapter}
alt all_clean() true
FFI->>FFI: tracing::debug log success
FFI-->>Caller: Ok()
else all_clean() false
FFI->>FFI: tracing::warn with status details
FFI-->>Caller: ErrorShutdownIncomplete
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## v3.1-dev #3954 +/- ##
=============================================
+ Coverage 52.54% 87.18% +34.63%
=============================================
Files 11 2632 +2621
Lines 1707 327563 +325856
=============================================
+ Hits 897 285592 +284695
- Misses 810 41971 +41161
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
packages/rs-platform-wallet/src/manager/identity_sync.rs (1)
418-459: 🩺 Stability & Availability | 🟠 MajorDon't overwrite an unjoined coordinator generation.
stop()removes onlybackground_canceland leaves the previousbackground_joinunjoined. Astop()→start()sequence passes thecancel_guard.is_some()check and overwritesbackground_joinwith a new handle, losing the old OS-thread handle before shutdown can join it. Gate restart on confirming the prior handle has been joined, or ensure every generation's handle is tracked and joined.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-platform-wallet/src/manager/identity_sync.rs` around lines 418 - 459, The start() method can overwrite an existing background_join handle before it has been properly joined, causing resource leaks. Before spawning the new identity-sync thread and storing its handle in background_join, add a check to ensure any existing join handle in background_join has been properly cleaned up or joined first. This can be done either by joining the existing handle before proceeding, or by verifying that background_join is None before allowing the new thread spawn to proceed. This ensures the prior thread's OS handle is not lost and can be properly shutdown.packages/rs-platform-wallet/src/manager/platform_address_sync.rs (1)
218-255: 🩺 Stability & Availability | 🟠 MajorAdd generation guard to prevent thread handle loss after stop/start cycles.
After
stop()cancels the token,background_cancelbecomesNonewhile the old thread keeps running. A subsequentstart()seescancel_guard.is_some() == falseand spawns a new thread, unconditionally overwritingbackground_join. The old thread's join handle is lost, making it impossible to join its cleanup later. Additionally, without a generation counter, the exiting old thread clears the new generation'sbackground_canceltoken as it shuts down, creating a race where the new loop runs but appears stopped tois_running().Both
IdentitySyncManagerandShieldedSyncManageralready implement this pattern: they incrementbackground_generationon eachstart(), passmy_gento the spawned thread, and checkbackground_generation.load() == my_genbefore clearingbackground_cancelon exit. Apply the same approach here.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs` around lines 218 - 255, The thread handle for the platform address sync loop can be lost during stop/start cycles because a new thread spawns and unconditionally overwrites the background_join handle before the old thread finishes cleanup. Additionally, the old thread clears the new generation's background_cancel token, creating a race condition. Implement the generation guard pattern already used in IdentitySyncManager and ShieldedSyncManager: add a background_generation atomic counter to the struct, increment it at the start of the start() method, capture the current generation as my_gen before spawning the thread, pass my_gen into the spawned thread closure, and modify the cleanup code (where background_cancel is set to None) to only clear the token if background_generation.load() equals my_gen, ensuring old exiting threads do not interfere with new generations.packages/rs-platform-wallet/src/manager/shielded_sync.rs (1)
245-300: 🩺 Stability & Availability | 🔴 CriticalDon't replace a pending shielded-sync join handle on rapid stop/start cycles.
The
start()method checks onlybackground_cancelto guard against concurrent starts. Whenstop()removes the token frombackground_cancel, a subsequentstart()proceeds to spawn a new thread and overwritesbackground_join, even though the previous generation's thread is still winding down. The generation check (line 290) only prevents the old thread from clearingbackground_cancel—it does not protectbackground_join. This leaves prior-generation handles permanently lost.To fix: either require
quiesce()beforestart()to join all pending generations, or track join handles per-generation and join all pending before spawning a new thread.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-platform-wallet/src/manager/shielded_sync.rs` around lines 245 - 300, The `start()` method only checks `background_cancel` to guard against concurrent starts, but when `stop()` clears this token, a subsequent `start()` can spawn a new thread and overwrite the `background_join` handle before the previous generation's thread has finished cleaning up. The generation check at the cleanup section prevents the old thread from clearing `background_cancel`, but does not protect `background_join` from being overwritten. Fix this by either adding logic to join all pending prior-generation join handles before storing the new one (by tracking handles per-generation), or by ensuring that before assigning a new join handle to `background_join` in the `start()` method, any existing pending handle from a prior generation is properly joined and waited for completion.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 706-710: The while loop waiting for handler_started to become true
has no timeout, which will cause the test to hang indefinitely if the slow
callback never executes. Wrap the entire while loop (that checks
handler_started.load(AO::Acquire)) with tokio::time::timeout() and provide an
appropriate timeout duration, then handle the timeout error case with an
assertion that explicitly fails the test with a useful message. This ensures CI
fails fast with clear feedback instead of timing out.
- Around line 483-489: In the wallet event adapter task join error handling, the
else branch that handles non-panic JoinErrors currently returns
CoordinatorThreadStatus::Ok, which should instead return
CoordinatorThreadStatus::Error with an appropriate error message. Change the
else clause that follows the is_panic() check to map the error to a
CoordinatorThreadStatus::Error variant containing details about the join error,
rather than treating it as a successful completion.
- Around line 175-181: The current implementation uses
tokio::task::spawn_blocking to wrap handle.join(), but this pattern prevents the
timeout from effectively interrupting the blocking task if the coordinator
thread hangs. Replace the spawn_blocking closure approach with explicit polling:
repeatedly check if the JoinHandle is finished using is_finished() in a loop
until the deadline is reached, and only call join() once the handle confirms it
is finished. This ensures the timeout boundary is enforced even if the
coordinator thread misbehaves or fails to clear is_syncing before join() is
called.
---
Outside diff comments:
In `@packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- Around line 418-459: The start() method can overwrite an existing
background_join handle before it has been properly joined, causing resource
leaks. Before spawning the new identity-sync thread and storing its handle in
background_join, add a check to ensure any existing join handle in
background_join has been properly cleaned up or joined first. This can be done
either by joining the existing handle before proceeding, or by verifying that
background_join is None before allowing the new thread spawn to proceed. This
ensures the prior thread's OS handle is not lost and can be properly shutdown.
In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs`:
- Around line 218-255: The thread handle for the platform address sync loop can
be lost during stop/start cycles because a new thread spawns and unconditionally
overwrites the background_join handle before the old thread finishes cleanup.
Additionally, the old thread clears the new generation's background_cancel
token, creating a race condition. Implement the generation guard pattern already
used in IdentitySyncManager and ShieldedSyncManager: add a background_generation
atomic counter to the struct, increment it at the start of the start() method,
capture the current generation as my_gen before spawning the thread, pass my_gen
into the spawned thread closure, and modify the cleanup code (where
background_cancel is set to None) to only clear the token if
background_generation.load() equals my_gen, ensuring old exiting threads do not
interfere with new generations.
In `@packages/rs-platform-wallet/src/manager/shielded_sync.rs`:
- Around line 245-300: The `start()` method only checks `background_cancel` to
guard against concurrent starts, but when `stop()` clears this token, a
subsequent `start()` can spawn a new thread and overwrite the `background_join`
handle before the previous generation's thread has finished cleaning up. The
generation check at the cleanup section prevents the old thread from clearing
`background_cancel`, but does not protect `background_join` from being
overwritten. Fix this by either adding logic to join all pending
prior-generation join handles before storing the new one (by tracking handles
per-generation), or by ensuring that before assigning a new join handle to
`background_join` in the `start()` method, any existing pending handle from a
prior generation is properly joined and waited for completion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 01cab324-105d-4c5d-afe0-0ceb6faff13e
📒 Files selected for processing (5)
packages/rs-platform-wallet-ffi/src/manager.rspackages/rs-platform-wallet/src/manager/identity_sync.rspackages/rs-platform-wallet/src/manager/mod.rspackages/rs-platform-wallet/src/manager/platform_address_sync.rspackages/rs-platform-wallet/src/manager/shielded_sync.rs
Introduces `AtomicFlagGuard`, a pub RAII guard that clears an `AtomicBool` flag to `false` (Release ordering) on drop. The guard does not set the flag on construction — the caller is responsible for doing so (typically via a `compare_exchange`) — preserving the exact semantics of the three identical `IsSyncingGuard` structs that were copy-pasted across the platform-wallet sync coordinators. This is the panic-safety keystone for the quiesce drain loop: if a sync pass panics, the guard's `drop` still clears `is_syncing`, so `quiesce()` is never permanently wedged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…en runtime check
**Task 1 — new enum variants**
Add `Stopped(Option<String>)` (non-panic, non-clean task exit, e.g.
tokio cancel/abort) and `Timeout` (join exceeded
SHUTDOWN_JOIN_TIMEOUT_SECS) to `CoordinatorThreadStatus`.
- Non-panic JoinError on the event-adapter task → `Stopped(Some(...))`,
not the previous `Ok` (wrong: a cancelled task is not a clean exit).
- Timeout on any `quiesce()` wrapper → `Timeout`, not `Error("join
timed out")`.
- `is_clean()` now returns `true` only for `Ok` and `NotRunning`; all
other variants — including the two new ones — are non-clean.
- Update all docs / comments that referenced the old `Error("join timed
out")` wording.
**Task 2 — promote debug_assert to assert**
`shutdown()`'s multi-thread-runtime guard was `debug_assert!`, making
it a no-op in release builds. Changed to `assert!` — this is a real
invariant: `spawn_blocking` deadlocks on a `current_thread` runtime.
**Task 3 — bound the test wait loop**
Wrap the `while !handler_started…` polling in
`shutdown_waits_for_in_flight_pass_to_drain` with a 5 s
`tokio::time::timeout` so a broken test fails fast instead of hanging.
**Task 4 — DRY IsSyncingGuard**
Replace the three identical copy-pasted `IsSyncingGuard` structs in
`identity_sync.rs`, `platform_address_sync.rs`, and `shielded_sync.rs`
with the new `dash_async::AtomicFlagGuard`. Adds `dash-async` to
`rs-platform-wallet/Cargo.toml`. Zero behavioral change: construction
semantics preserved (callers set the flag via `compare_exchange` before
creating the guard; `Drop` clears it with `Ordering::Release`).
**Task 5 — new tests**
- `coordinator_thread_status_clean_predicate`: unit-tests `is_clean()`
for all six variants including the two new ones; no real timeout needed.
- `coordinator_exit_status_all_clean`: tests `all_clean()` with
`Timeout` and `Stopped` slots.
- `event_adapter_non_panic_join_error_maps_to_stopped_and_is_not_clean`:
aborts the adapter task before `shutdown()` and asserts the result is
`Stopped` (covers the non-panic JoinError path).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
packages/rs-platform-wallet/src/manager/mod.rs (2)
405-408: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winHandle unclean shielded quiesce before clearing state.
quiesce()now returns a meaningful shutdown status. Ignoring it letsclear_shielded()proceed tocoord.clear()afterTimeout,Stopped, orPanicked, which can race a still-running shielded pass that the quiesce barrier was meant to stop.Proposed fix
- self.shielded_sync_manager.quiesce().await; + let status = self.shielded_sync_manager.quiesce().await; + if !status.is_clean() { + return Err(crate::error::PlatformWalletError::ShieldedStoreError( + format!("shielded sync did not stop cleanly before clear: {status:?}"), + )); + } if let Some(coord) = self.shielded_coordinator().await { coord.clear().await?; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-platform-wallet/src/manager/mod.rs` around lines 405 - 408, The clear_shielded function ignores the return value from self.shielded_sync_manager.quiesce().await, which now provides meaningful shutdown status information. Instead of ignoring this result, capture the return value and check if the quiesce completed cleanly. If quiesce returns a status indicating Timeout, Stopped, or Panicked, return an appropriate error from clear_shielded rather than continuing to call coord.clear(), which could race with a still-running shielded pass. Only proceed with coord.clear() when quiesce has successfully shut down cleanly.
463-477: 🩺 Stability & Availability | 🟠 MajorWrap
quiescinginAtomicFlagGuardto ensure cancellation-safe reset in all coordinatorquiesce()implementations.The current implementations set
quiescing = truebefore the awaited drain loop and reset it only after. Iftokio::time::timeoutdrops the future during the loop, the reset never executes, permanently wedgingquiescingand blocking all future syncs.All three coordinators (
platform_address_sync.rs:291,identity_sync.rs:494,shielded_sync.rs:334) have identical patterns. Use the sameAtomicFlagGuardapproach already correctly applied tois_syncinginsync_now():pub async fn quiesce(&self) -> super::CoordinatorThreadStatus { self.quiescing.store(true, Ordering::Release); let _quiescing_guard = AtomicFlagGuard::new(&self.quiescing); self.stop(); while self.is_syncing.load(Ordering::Acquire) { tokio::time::sleep(Duration::from_millis(20)).await; } // quiescing.store(false) removed — guard handles reset on all exit paths // ...rest of implementation }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-platform-wallet/src/manager/mod.rs` around lines 463 - 477, The `quiesce()` method implementations in all three coordinators (the ones containing the timeout calls for `platform_address_sync_manager`, `identity_sync_manager`, and `shielded_sync_manager`) don't properly handle cancellation when `tokio::time::timeout` drops the future. Wrap the `quiescing` flag reset in an `AtomicFlagGuard` to ensure it's reset on all exit paths including early cancellation. In each coordinator's `quiesce()` method, after setting `quiescing` to true, immediately create an `AtomicFlagGuard` using `AtomicFlagGuard::new(&self.quiescing)`, and remove any manual `quiescing.store(false)` reset call at the end since the guard will handle it automatically. Use the same pattern already correctly implemented in `sync_now()` for the `is_syncing` flag.
🧹 Nitpick comments (1)
packages/rs-dash-async/src/atomic.rs (1)
8-15: 📐 Maintainability & Code Quality | 🔵 TrivialAdd
#[must_use]annotation toAtomicFlagGuard.The guard is dropped as a temporary if not bound, silently resetting the flag immediately. This breaks the intended guarded scope behavior. Mark the type with
#[must_use]to catch accidental non-binding at compile time.Proposed fix
-pub struct AtomicFlagGuard<'a>(&'a AtomicBool); +#[must_use = "AtomicFlagGuard clears the flag on drop; bind it to keep the flag set for the guarded scope"] +pub struct AtomicFlagGuard<'a>(&'a AtomicBool);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-dash-async/src/atomic.rs` around lines 8 - 15, The AtomicFlagGuard struct can be accidentally dropped without being bound to a variable, causing the flag to be reset immediately and breaking the guarded scope behavior. Add the #[must_use] attribute to the AtomicFlagGuard struct definition to make the compiler warn when the guard is not explicitly bound to a variable, ensuring the developer catches this mistake at compile time.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 671-695: The test
`event_adapter_non_panic_join_error_maps_to_stopped_and_is_not_clean` is
non-deterministic because it accepts both `CoordinatorThreadStatus::Stopped` and
`CoordinatorThreadStatus::Ok` as valid outcomes. This allows the test to pass
without actually exercising the non-panic JoinError branch. To fix this, replace
the current task abort approach with a task that is guaranteed to be pending and
never complete on its own, such as a task that awaits on a channel or a
never-resolving future. This ensures the abort always triggers the Stopped path
deterministically, and update the assertion to only expect
`CoordinatorThreadStatus::Stopped`.
---
Outside diff comments:
In `@packages/rs-platform-wallet/src/manager/mod.rs`:
- Around line 405-408: The clear_shielded function ignores the return value from
self.shielded_sync_manager.quiesce().await, which now provides meaningful
shutdown status information. Instead of ignoring this result, capture the return
value and check if the quiesce completed cleanly. If quiesce returns a status
indicating Timeout, Stopped, or Panicked, return an appropriate error from
clear_shielded rather than continuing to call coord.clear(), which could race
with a still-running shielded pass. Only proceed with coord.clear() when quiesce
has successfully shut down cleanly.
- Around line 463-477: The `quiesce()` method implementations in all three
coordinators (the ones containing the timeout calls for
`platform_address_sync_manager`, `identity_sync_manager`, and
`shielded_sync_manager`) don't properly handle cancellation when
`tokio::time::timeout` drops the future. Wrap the `quiescing` flag reset in an
`AtomicFlagGuard` to ensure it's reset on all exit paths including early
cancellation. In each coordinator's `quiesce()` method, after setting
`quiescing` to true, immediately create an `AtomicFlagGuard` using
`AtomicFlagGuard::new(&self.quiescing)`, and remove any manual
`quiescing.store(false)` reset call at the end since the guard will handle it
automatically. Use the same pattern already correctly implemented in
`sync_now()` for the `is_syncing` flag.
---
Nitpick comments:
In `@packages/rs-dash-async/src/atomic.rs`:
- Around line 8-15: The AtomicFlagGuard struct can be accidentally dropped
without being bound to a variable, causing the flag to be reset immediately and
breaking the guarded scope behavior. Add the #[must_use] attribute to the
AtomicFlagGuard struct definition to make the compiler warn when the guard is
not explicitly bound to a variable, ensuring the developer catches this mistake
at compile time.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 192ea517-ff8a-4b43-9773-c096391c8a49
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (7)
packages/rs-dash-async/src/atomic.rspackages/rs-dash-async/src/lib.rspackages/rs-platform-wallet/Cargo.tomlpackages/rs-platform-wallet/src/manager/identity_sync.rspackages/rs-platform-wallet/src/manager/mod.rspackages/rs-platform-wallet/src/manager/platform_address_sync.rspackages/rs-platform-wallet/src/manager/shielded_sync.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- packages/rs-platform-wallet/src/manager/platform_address_sync.rs
- packages/rs-platform-wallet/src/manager/shielded_sync.rs
|
✅ Review complete (commit 3cca1cf) |
RUST-001: tag `AtomicFlagGuard` `#[must_use]` so a stray `let _ = ..` or bare-statement construction (which would drop the guard *immediately* and clear the flag right back) gets caught at compile time instead of silently un-gating the very flag it was meant to hold. PROJ-001: lock the guard's contract down with two tests — flag cleared on a normal drop, and (the load-bearing one) flag cleared while unwinding a panic via `catch_unwind`. Makes the PR-body "dash-async tests" claim true. SEC-003: spell out in the rustdoc that the clear-on-panic guarantee rides on unwinding, so it holds under `panic = "unwind"` but not under the iOS `panic = "abort"` profiles, where a panic aborts before any Drop runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…invariants
SEC-001 (the big one): a `shutdown()` quiesce timed out only because a
stalled in-flight pass pinned `is_syncing`, so the `while is_syncing` drain
never cleared, the quiesce future was dropped *before* the thread join, and
the `!Send` coordinator OS thread was left ALIVE — later firing host
callbacks through freed memory. Root-cause fix: race the pass body against
cancellation inside each coordinator's own loop
tokio::select! {
biased;
_ = cancel.cancelled() => break,
_ = this.sync_now(..) => {}
}
so `stop()`/`quiesce()` cancelling the token drops the stalled `sync_now`
future *on the coordinator thread*, which unwinds to its `is_syncing`
`AtomicFlagGuard` and clears the flag promptly. The drain then frees and the
join lands far inside the timeout — the timeout can no longer strand a live
thread. Invariants preserved: the guard is constructed before any `.await`
so a cancel-drop always clears `is_syncing`; the completion-event dispatch
is the synchronous tail after the last `.await`, so it either runs in full
(then clears) or is skipped on cancel — never torn; idempotency and the
drain barrier are untouched. The inter-pass sleep was already cancel-raced.
MEDIUM-4 (RUST-002): RAII-guard `quiescing` in all three `quiesce()` via
`AtomicFlagGuard`, dropping the manual `store(false)`. A timed-out quiesce
no longer latches the gate `true` and silently bails every future pass.
Reopening on drop is safe because `stop()` already cancelled the loop.
MEDIUM-3 (SEC-005/CALL-001): give `PlatformAddressSyncManager` the
`background_generation` counter its siblings already have — bump it (AcqRel)
in `start()` and gate the thread-exit `*background_cancel = None` on
`generation == my_gen`, so a stop()+start() reschedule can't have an exiting
thread strip the new generation's token.
SEC-003: swap the `background_cancel`/`background_join` std-Mutex
`.lock().expect("… poisoned")` calls for `.lock().unwrap_or_else(|e|
e.into_inner())` across all three coordinators, so one prior panic can't
cascade into an abort on the teardown path.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SEC-002: `clear_shielded()` now wraps its `quiesce()` in the same `SHUTDOWN_JOIN_TIMEOUT_SECS` backstop `shutdown()` uses, so a stalled in-flight pass can't hang Clear forever. The const is now `pub` (and re-exported from the crate root) so the FFI shielded-stop bridge can reuse it; its doc + the `shutdown()` doc now describe it as a backstop and note that cancellation is what makes the drain prompt. SEC-004: bind the event-adapter join handle to a local before the join `.await`, so the `tokio::Mutex` guard (previously a match-scrutinee temporary) isn't held across the up-to-30s join. PROJ-004: drop the lone `tracing::warn!` for the adapter join error inside `shutdown()` — the returned status already carries it and the FFI `destroy` adapter logs the aggregate once, so all four workers are now uniform. RUST-004: rewrite the `shutdown()` `assert!` message (and the matching docs) to name the real constraint — the coordinator OS threads each run `Handle::block_on` and need the multi-thread runtime's timer/IO driver — instead of blaming `spawn_blocking`, which works fine on current_thread. PROJ-006: fix the `all_clean()` rustdoc (Stopped/Timeout/Error also make it false, not just panics). PROJ-003: drop the dangling ephemeral `(F-6)` and `F-2`/`F-3`/`F-7` + `(1)/(2)/(4)/(5)/(6)` markers, replacing with self-describing prose. SEC-003: note the unwind-vs-abort caveat on the `shutdown()` panic-safety guarantee. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SEC-002: `platform_wallet_manager_shielded_sync_stop` blocked on a bare `quiesce()`, so a stalled in-flight pass could hang the host's stop call forever. Wrap the quiesce in `tokio::time::timeout` reusing the library's `SHUTDOWN_JOIN_TIMEOUT_SECS` backstop — same guarantee as `shutdown()`. Cancellation makes the drain prompt; the timeout only matters if a pass's drop wedges. The C signature is unchanged and the result is still discarded (`ok` as before) — we only need the call not to hang. Add `tokio/time` to the crate's direct features rather than leaning on `platform-wallet` pulling it in transitively (the crate now calls `tokio::time::timeout` directly). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Solid lifecycle hardening overall — joining coordinator threads and the RAII is_syncing guard close real races. Three in-scope blockers undermine the new shutdown contract: the FFI destroy still returns success on non-clean shutdown (Swift may free a context a coordinator thread still owns), the bounded timeout doesn't actually bound anything because spawn_blocking tasks are non-abortable, and start-after-stop overwrites the saved JoinHandle so a later shutdown cannot join the stranded thread. Two suggestions and two test-quality nits round out the review.
🔴 3 blocking | 🟡 2 suggestion(s) | 💬 2 nitpick(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet-ffi/src/manager.rs`:
- [BLOCKING] packages/rs-platform-wallet-ffi/src/manager.rs:366-376: Destroy returns ok() on non-clean shutdown — re-opens callback-after-free window
`shutdown()` can now legitimately return `CoordinatorThreadStatus::Timeout` (and `Stopped`/`Panicked`/`Error`) meaning a coordinator's OS thread or the event-adapter task did not actually join. This FFI entry point logs that outcome and still returns `PlatformWalletFFIResult::ok()`. The Swift host (`PlatformWalletManager.swift` deinit) is documented to free the callback `context` once `platform_wallet_manager_destroy` returns; meanwhile the still-alive coordinator thread holds an `Arc<FFIEventHandler>` / `Arc<FFIPersister>` and can fire `on_*_sync_completed` or `persister.store(...)` through the now-dangling `context` pointer. That is precisely the use-after-free this PR set out to close — the previous unbounded wait would have hung instead of returning false success. Surface non-clean shutdowns as a distinct, non-ok result code so the host knows not to free its context (or keep the FFI-owned handler `Arc` alive on the non-clean path, e.g. `mem::forget`, so any lingering callback remains memory-safe).
In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:178-192: Timeout doesn't actually bound shutdown — spawn_blocking is non-abortable
`shutdown()` wraps each `quiesce()` in `tokio::time::timeout`, but `join_coordinator_thread` moves the `std::thread::JoinHandle` into `spawn_blocking(move || handle.join())`. Tokio blocking tasks cannot be aborted once started: dropping the outer timeout future stops `await`ing the `JoinHandle` but the underlying blocking task is still parked inside `handle.join()`, keeping the coordinator thread's `Arc<…SyncManager>` (and the host callback contexts it transitively holds) alive. When the caller then drops the multi-thread runtime, `Runtime::drop` returns without waiting for blocking tasks, leaving an OS thread plus a stranded blocking thread alive on a freed runtime — which is the very `"Tokio 1.x context … being shutdown"` race the PR cites in its motivation. Make the join cancellation-aware by polling `handle.is_finished()` (with a short async sleep) before the final `handle.join()`, so dropping the timeout future actually releases all state.
- [SUGGESTION] packages/rs-platform-wallet/src/manager/mod.rs:441-459: assert! on runtime flavor is incorrect — spawn_blocking works on current_thread
The promotion from `debug_assert!` to `assert!` is justified in the docstring by the claim that `spawn_blocking` 'is not available on `current_thread` runtimes and will panic there.' That's not how Tokio works: `spawn_blocking` dispatches to the runtime's shared blocking pool, which both `multi_thread` and `current_thread` runtimes provision; awaiting the returned `JoinHandle` simply yields the runtime task. Today's only in-tree caller (`platform-wallet-ffi/src/runtime.rs:34`) builds a multi-thread runtime, but `shutdown()` is now a public Rust API. Any downstream consumer using `#[tokio::main(flavor = "current_thread")]` (or `Builder::new_current_thread()`) will hit a release-mode panic on a configuration that would have worked. Revert to `debug_assert!` (or drop it entirely) and correct the docstring rationale.
In `packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/identity_sync.rs:405-448: start-after-stop overwrites background_join, dropping the previous thread handle
`stop()` takes `background_cancel` (leaving it `None`) but never touches `background_join`. A subsequent `start()` passes the `cancel_guard.is_some()` early-return check, spawns a fresh thread, and at line 447 unconditionally overwrites `background_join` with the new handle — the previous handle is dropped (detached). The old coordinator thread is cancellation-bound but not yet exited; it can still be inside its `block_on` polling `tokio::time` and feeding the event manager when the new handle replaces it. A later `shutdown()` then can only join the new thread, so the original thread can outlive `shutdown()` and touch the host's freed callback context — recreating exactly the runtime-drop race this PR is meant to eliminate. The same pattern applies to `platform_address_sync` and `shielded_sync`. Fix by taking-and-joining (or refusing) any non-`None` handle inside the `start()` lock before installing the new one.
In `packages/rs-platform-wallet/src/manager/platform_address_sync.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/platform_address_sync.rs:291-304: quiescing gate is reopened before the join completes — sync_now can slip a pass through
`quiesce()` clears `self.quiescing` to `false` *before* taking `background_join` and awaiting `join_coordinator_thread`. The join can block for up to `SHUTDOWN_JOIN_TIMEOUT_SECS` (30s). During that window the loop is cancelled and won't start new passes, but an external caller invoking `sync_now` (e.g. an FFI on-demand sync) finds a fully open gate — `quiescing=false`, `is_syncing=false` — and runs a complete pass, including the `on_platform_address_sync_completed` host callback. That breaks the documented `quiesce()` contract that no further callback can fire after it returns, undermining the manager-level shutdown guarantee. Same pattern in `identity_sync::quiesce` and `shielded_sync::quiesce`. Move `quiescing.store(false, …)` to *after* the join completes (or have `sync_now`/`sync_wallet` also consult `background_join.is_some()`).
|
Update — SEC-002 fixed in 🤖 Co-authored by Claudius the Magnificent AI Agent |
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
The PR materially hardens the wallet manager shutdown lifecycle (cancellable passes, RAII quiescing gate, bounded clear/stop, AtomicFlagGuard consolidation). Prior #4/#5 are fixed in head. Remaining issues center on a single shape: bounded teardown paths (FFI destroy, clear_shielded, FFI shielded stop) translate non-clean / timed-out shutdowns into Success without exposing that to the host, which can still hand a stale OS thread to host-owned callback/context after the host frees them. Start-after-stop still drops the prior background_join. Two test-isolation nitpicks carry forward.
🔴 2 blocking | 🟡 2 suggestion(s) | 💬 1 nitpick(s)
2 additional finding(s) omitted (not in diff).
2 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:417-433: clear_shielded silently swallows quiesce timeout — host can wipe persistence with a pass still in-flight
`clear_shielded` wraps `shielded_sync_manager.quiesce()` in `tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, ...)`, drops the result with `let _ = ...`, then calls `coord.clear().await` and returns `Ok(())`. The method docstring (lines 402–415) tells callers the quiesce barrier guarantees "nothing can re-persist notes after this returns" and that "the host must not commit its own persistence wipe" only when the coordinator reset fails. On the timeout path that contract is broken: the in-flight pass thread still holds the FFIPersister context and can call `persister.store(...)` *after* `clear_shielded` returned Ok and the host has wiped its SwiftData rows, leaving the host's just-deleted rows re-populated by the trailing pass against the now-cleared shared tree. The intentional discard noted in the comment is fine for diagnostics, but the safety guarantee the docs make requires propagating a distinct error (e.g. `ShieldedStoreError("quiesce timed out")`) so the host knows not to commit the wipe.
- [SUGGESTION] packages/rs-platform-wallet/src/manager/mod.rs:184-198: Timeout cannot abort spawn_blocking — Timeout status can ship while the OS thread is still alive
`join_coordinator_thread` moves the `std::thread::JoinHandle` into `tokio::task::spawn_blocking(move || handle.join())`. Blocking tasks are not abortable: when the outer `tokio::time::timeout` in `shutdown()` / `clear_shielded` / the FFI shielded-stop bridge fires, dropping the awaiter leaves the spawn_blocking task and its synchronous `handle.join()` running, and the only `JoinHandle` for the OS thread is already consumed by the blocking task. The cancellable-pass change makes this rare in practice, but the `CoordinatorThreadStatus::Timeout` contract is still misleading — returning `Timeout` means *we abandoned the wait*, not *the worker stopped*. Hosts that drop the runtime after `Timeout` are back in the runtime-drop-panic race this PR was opened to fix. Either narrow the doc-comment promise to say so explicitly, or wire the join through an abortable indirection (e.g. an `AbortHandle`-backed task that delegates to a child thread) so a true wedge surfaces and can be acted on.
In `packages/rs-platform-wallet-ffi/src/shielded_sync.rs`:
- [BLOCKING] packages/rs-platform-wallet-ffi/src/shielded_sync.rs:86-107: shielded_sync_stop returns ok() on bounded-quiesce timeout — same UAF pattern as destroy/clear
The new `tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, manager.shielded_sync().quiesce())` correctly prevents the FFI stop call from wedging, but on timeout the future is dropped and the function returns `PlatformWalletFFIResult::ok()` with no signal that the drain didn't complete. The docstring (lines 68–85) explicitly tells hosts the call is a synchronization barrier: on return "the loop is cancelled, no new pass will start, and any in-flight pass has fully drained — its persistence callbacks have completed". That guarantee does not hold on the timeout path; the spawned shielded coordinator thread still holds its `Arc<FFIPersister>` / `Arc<FFIEventHandler>` and can still invoke completion / persister stores against the host context. Hosts using stop as a barrier before unbinding callbacks get the same callback-after-free hazard as destroy. Surface a non-success result (or extend the C ABI with a status out-param) so the host can know to retry or defer.
In `packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/identity_sync.rs:405-470: start() after cancel-only stop() drops the previous background_join, detaching the old thread from a later shutdown()
`stop()` takes `background_cancel` but leaves `background_join` populated. A subsequent `start()` only checks `cancel_guard.is_some()`, proceeds, spawns a new OS thread, and unconditionally overwrites `*self.background_join.lock() = Some(join)` at lines 465–468 — silently dropping the previous `JoinHandle` and detaching the prior thread. The `background_generation` counter prevents the exiting thread from stripping the new cancel token, but does not preserve join ownership. The same pattern lives in `platform_address_sync.rs` and `shielded_sync.rs`. With the new cancellable passes the old thread exits its `block_on` quickly, so the race window is small under `panic = "unwind"`. Under iOS `panic = "abort"`, a host doing `stop()` → `start()` → `shutdown()` → `drop(runtime)` can still hit the runtime-drop "being shutdown" panic this PR is closing for the start→shutdown path, because the older detached thread is invisible to the later `shutdown()`. Either gate `start()` on `background_join.is_some()` (await/quiesce the prior handle first) or take both slots atomically in `stop()`.
Replace the spawn_blocking-based join in join_coordinator_thread with an is_finished() poll loop that awaits a 5ms sleep each step. spawn_blocking tasks cannot be cancelled once started, so the prior approach left the blocking join alive past the tokio::time::timeout wrapping quiesce() — defeating the timeout boundary. Polling yields at each .await so the external timeout is truly binding (threads are confirmed-exited or the caller times out). Each coordinator's start() now drains any handle left by a prior stop() (is_finished spin-wait, 1s bound) before overwriting background_join, so a stop()->start() reschedule can no longer detach a live, untracked thread that shutdown() would miss. FFI platform_wallet_manager_destroy now returns the new ErrorShutdownIncomplete (19) when shutdown is not all-clean, signalling the host must not immediately free the callback context — a lingering coordinator may still fire one final callback. The C ABI is unchanged (additive enum variant + degraded-path return code). Tests: deterministic Stopped path via spawn(pending).abort() -> asserts Stopped(_) and !is_clean(); race test uses per-iteration catch_unwind instead of a process-global panic hook. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/rs-platform-wallet-ffi/src/manager.rs (1)
366-385: 🔒 Security & Privacy | 🔴 CriticalHost ignores
ErrorShutdownIncompletereturn code, re-opening use-after-free window.The Rust code correctly returns
ErrorShutdownIncompleteinstead ofok()on non-clean shutdown to signal the host should delay freeing the callbackcontext. However, the SwiftdeinitatPlatformWalletManager.swift:158callsplatform_wallet_manager_destroy(handle).discard()— explicitly discarding the return value with no branching logic. This means the host unconditionally proceeds with cleanup regardless of whether the error code is returned, re-opening the exact use-after-free this code intends to prevent: a lingering coordinator thread holding anArcto the event handler while the host-owned context pointer is freed.The safety guarantee must be enforced on the host side before this fix is effective.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/rs-platform-wallet-ffi/src/manager.rs` around lines 366 - 385, The Rust code in the shutdown function correctly returns ErrorShutdownIncomplete when coordinators do not exit cleanly, but the host side (Swift code) ignores this return value by calling discard() without checking the result. To enforce the safety guarantee on the Rust side, ensure that even if the host ignores the return code, the Rust code prevents use-after-free by maintaining ownership of critical resources (such as the event handler Arc) until all coordinator threads are guaranteed to have fully exited, rather than relying solely on the host respecting the ErrorShutdownIncomplete signal. Consider adding an additional safety mechanism within the shutdown logic to keep the callback context alive on the Rust side until true cleanup is complete.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/rs-platform-wallet-ffi/src/error.rs`:
- Around line 128-135: The Swift PlatformWalletResult enum is missing the
errorShutdownIncomplete variant that was added to the Rust FFI. Add the case
errorShutdownIncomplete = 19 to the PlatformWalletResultCode enum in the correct
position between the existing variants. Then update the init(ffi:) initializer's
switch statement to add a matching case that maps the FFI result code
PLATFORM_WALLET_FFI_RESULT_CODE_ERROR_SHUTDOWN_INCOMPLETE to the
.errorShutdownIncomplete case, ensuring the returned error code is properly
recognized instead of falling through to the default unknown error handler.
---
Outside diff comments:
In `@packages/rs-platform-wallet-ffi/src/manager.rs`:
- Around line 366-385: The Rust code in the shutdown function correctly returns
ErrorShutdownIncomplete when coordinators do not exit cleanly, but the host side
(Swift code) ignores this return value by calling discard() without checking the
result. To enforce the safety guarantee on the Rust side, ensure that even if
the host ignores the return code, the Rust code prevents use-after-free by
maintaining ownership of critical resources (such as the event handler Arc)
until all coordinator threads are guaranteed to have fully exited, rather than
relying solely on the host respecting the ErrorShutdownIncomplete signal.
Consider adding an additional safety mechanism within the shutdown logic to keep
the callback context alive on the Rust side until true cleanup is complete.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 52a4e212-979b-48ca-b4ad-cbca213672ee
📒 Files selected for processing (6)
packages/rs-platform-wallet-ffi/src/error.rspackages/rs-platform-wallet-ffi/src/manager.rspackages/rs-platform-wallet/src/manager/identity_sync.rspackages/rs-platform-wallet/src/manager/mod.rspackages/rs-platform-wallet/src/manager/platform_address_sync.rspackages/rs-platform-wallet/src/manager/shielded_sync.rs
🚧 Files skipped from review as they are similar to previous changes (4)
- packages/rs-platform-wallet/src/manager/identity_sync.rs
- packages/rs-platform-wallet/src/manager/shielded_sync.rs
- packages/rs-platform-wallet/src/manager/mod.rs
- packages/rs-platform-wallet/src/manager/platform_address_sync.rs
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Incremental review for platform-wallet shutdown hardening at head 747f5f0. The latest delta is a merge from v3.1-dev (dashmate DKG safety + Swift KeychainManager) and does not modify any of the platform-wallet shutdown/quiesce/FFI files; therefore all seven prior findings carry forward unchanged. The PR description advertises post-review hardening (ErrorShutdownIncomplete=19 result code, is_finished() poll-join, start() reaping prior background_join, deterministic Stopped test, RAII/chained panic hook) that is not present in the source — destroy still unconditionally returns ok(), join_coordinator_thread still uses uncancellable spawn_blocking, clear_shielded and shielded_sync_stop still discard the bounded-quiesce result, start() still overwrites background_join, and both regression tests retain the prior weaknesses. Three blocking issues remain: the FFI destroy / clear_shielded / shielded_sync_stop paths all report success while the coordinator OS thread may still be alive, leaving a callback-after-free window across the C/Swift boundary that the PR was specifically intended to close.
Reconciliation
- Prior review at
93b89546: all 7 prior findings are STILL VALID at747f5f00. - New latest-delta findings from
93b89546..747f5f00: none; the delta is a merge fromv3.1-devand does not touch the wallet shutdown/FFI files. - CodeRabbit inline findings: 0.
Carried-forward prior findings
-
🔴 [BLOCKING] destroy returns ok() on non-clean shutdown — callback-after-free window across the FFI boundary
packages/rs-platform-wallet-ffi/src/manager.rs:351-377- platform_wallet_manager_destroy removes the manager handle, awaits manager.shutdown(), branches only to log a warning for !status.all_clean() (lines 367–374), and unconditionally returns PlatformWalletFFIResult::ok() at line 376. The inline comment at lines 363–365 acknowledges this: 'the C ABI exposes none of that, so we just log it … and drop it.' After this PR, shutdown() can legitimately return Timeout, Stopped, Panicked, or Error — meaning a coordinator OS thread or the event-adapter task may still be alive and still able to invoke FFIPersister/FFIEventHandler callbacks through the host-owned
*const c_void contextpointer. Hosts (e.g. dash-evo-tool, the Swift example app) routinely free that callback context after destroy returns ok(); a lingering coordinator firing one final persister.store or on_* callback then writes to freed memory. This is the exact UAF pathway the PR sets out to close. The PR description states destroy now returns a new PlatformWalletFFIResultCode::ErrorShutdownIncomplete (19); no such variant exists in the codebase (grep confirms). Either add and propagate the ErrorShutdownIncomplete code on !status.all_clean() so the host can defer freeing its callback context, or retract the breaking-change claim and document loudly that ok() does not imply the OS thread has exited on the degraded path.
-
🔴 [BLOCKING] clear_shielded silently swallows quiesce timeout/non-clean status before clear()
packages/rs-platform-wallet/src/manager/mod.rs:417-433- clear_shielded wraps shielded_sync_manager.quiesce() in tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, …) but discards the result with
let _ = …, then unconditionally calls coord.clear().await and returns Ok(()). The method's own doc-comment makes the quiesce barrier the load-bearing safety mechanism that lets the host commit its persistence wipe. On Timeout (or any non-clean CoordinatorThreadStatus the rewritten quiesce can now return) the in-flight shielded pass is still capable of holding the coordinator/persister handle and writing into the very store that clear() is about to wipe — the wipe can then be silently re-populated by the surviving pass, defeating the wipe and violating the contract the FFI consumer (platform_wallet_manager_shielded_clear) relies on. Inspect the timeout result and propagate a typed PlatformWalletError on Elapsed and on !status.is_clean(), so callers do not commit their own persistence wipe after a partial drain.
-
🔴 [BLOCKING] shielded_sync_stop returns ok() on bounded-quiesce timeout — same UAF window as destroy/clear
packages/rs-platform-wallet-ffi/src/shielded_sync.rs:86-107- platform_wallet_manager_shielded_sync_stop wraps manager.shielded_sync().quiesce() in tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, …) and discards the result with
let _ = …, then returns PlatformWalletFFIResult::ok(). The function's own docstring promises that on return 'the loop is cancelled, no new pass will start, and any in-flight pass has fully drained'; on a Timeout (or any non-clean CoordinatorThreadStatus) that promise is broken, but the host is told the stop succeeded. Hosts that bump generation counters or release callback state on success then race the still-running coordinator thread — the same UAF pattern as destroy and clear, just on a non-teardown path that long-running apps hit far more often. Surface non-clean / Elapsed via a distinct FFI result code so the host can defer teardown / state reset until the drain actually completes.
-
🔴 [BLOCKING] join_coordinator_thread uses uncancellable spawn_blocking — Timeout status can ship while the !Send OS thread is still alive
packages/rs-platform-wallet/src/manager/mod.rs:184-198- join_coordinator_thread moves the std::thread::JoinHandle into tokio::task::spawn_blocking(move || handle.join()).await at line 190. spawn_blocking tasks are not abortable: when the outer tokio::time::timeout in shutdown() fires Elapsed, dropping the await handle does not cancel the inner blocking job and does not signal the OS thread. The slot is reported as CoordinatorThreadStatus::Timeout but the underlying coordinator thread is still inside Handle::block_on, still touching tokio::time, and still able to invoke host callbacks. Combined with finding 1 (destroy returning ok()), this is the residual runtime-drop / callback-after-free window the PR was sold as closing — the degraded path leaves it wide open. The PR description claims this was rewritten to poll JoinHandle::is_finished(); grep finds no is_finished() use in rs-platform-wallet at all. Either implement the poll loop (
tokio::time::sleep(small_dt); if handle.is_finished() { return handle.join() … }) so the outer timeout actually binds, or document explicitly that Timeout means the OS thread may still be alive and the runtime must not be dropped — and reflect that in the FFI return codes per finding 1.
-
🟡 [SUGGESTION] start() after cancel-only stop() drops the previous background_join, detaching the old thread from a later shutdown()
packages/rs-platform-wallet/src/manager/identity_sync.rs:405-470- stop() (lines 480–489) takes background_cancel but never touches background_join, and the loop body's epilogue (lines 452–458) only clears background_cancel when the generation matches. A subsequent start() guards only on cancel_guard.is_some() (line 410); on a quick stop()→start() sequence cancel_guard is None, start() proceeds, spawns a new OS thread, and at lines 465–468 unconditionally writes the new JoinHandle into self.background_join, dropping (detaching) the prior, still-live JoinHandle for a loop that may still be winding down through its last pass / sleep wakeup. A subsequent shutdown() only joins the newest handle; the older thread is no longer reachable through any join barrier and can outlive shutdown(), holding the persister/event-handler context after the FFI told the host destroy/shutdown was clean. The same overwrite pattern lives in platform_address_sync.rs and shielded_sync.rs::start. The PR description claims start() now reaps the prior handle first; it does not. Either move the join slot under the same lock and reap-on-takeover (e.g. take the prior background_join and join_coordinator_thread it before installing a new one — would require start() to become async), or have stop()/the loop epilogue clear background_join too so an unjoined leftover handle cannot exist.
-
🔵 [NITPICK] event_adapter Stopped-path test still accepts Ok, leaving the new JoinError→Stopped mapping unverified on the abort race
packages/rs-platform-wallet/src/manager/mod.rs:712-743- event_adapter_non_panic_join_error_maps_to_stopped_and_is_not_clean aborts the adapter task then accepts CoordinatorThreadStatus::Stopped() | CoordinatorThreadStatus::Ok (lines 731–737) with an inline comment noting 'abort() races task completion'. Because the adapter is the standard make_manager() sink (no events queued), the task can trivially drain before the 10 ms sleep elapses, in which case the assertion passes via the Ok arm and the new Ok(Err()) ⇒ Stopped(...) mapping in shutdown() is never actually exercised. A regression collapsing that arm back to Ok would not be caught. To match the PR description's claim that this path is now deterministic, replace the adapter handle with one running std::future::pending::<()>().await before aborting, and assert exactly Stopped(_) and !status.event_adapter.is_clean().
-
🔵 [NITPICK] shutdown_then_drop_runtime installs a process-global panic hook without chaining or RAII restore
packages/rs-platform-wallet/src/manager/mod.rs:874-931- std::panic::set_hook (line 878) replaces the process-wide panic hook with a closure that only increments SHUTDOWN_PANICS on messages containing 'being shutdown' and never forwards to prev_hook. The original hook is restored only after the 10-iteration loop completes (line 925); a panic anywhere in the loop body leaves the global hook replaced for the rest of the test process and silently suppresses unrelated panic diagnostics from sibling tests in the same
cargo testbinary. The PR description claims this is now per-iteration std::panic::catch_unwind; the code still uses set_hook. Fixes: chain prev_hook(info) for messages that don't match the filter, and restore the previous hook in an RAII guard (scopeguard / explicit struct with Drop) so a mid-loop panic cannot leave the global state mutated.
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
- [BLOCKING] packages/rs-platform-wallet-ffi/src/manager.rs:351-377: destroy returns ok() on non-clean shutdown — callback-after-free window across the FFI boundary
platform_wallet_manager_destroy removes the manager handle, awaits manager.shutdown(), branches only to log a warning for !status.all_clean() (lines 367–374), and unconditionally returns PlatformWalletFFIResult::ok() at line 376. The inline comment at lines 363–365 acknowledges this: 'the C ABI exposes none of that, so we just log it … and drop it.' After this PR, shutdown() can legitimately return Timeout, Stopped, Panicked, or Error — meaning a coordinator OS thread or the event-adapter task may still be alive and still able to invoke FFIPersister/FFIEventHandler callbacks through the host-owned `*const c_void context` pointer. Hosts (e.g. dash-evo-tool, the Swift example app) routinely free that callback context after destroy returns ok(); a lingering coordinator firing one final persister.store or on_* callback then writes to freed memory. This is the exact UAF pathway the PR sets out to close. The PR description states destroy now returns a new PlatformWalletFFIResultCode::ErrorShutdownIncomplete (19); no such variant exists in the codebase (grep confirms). Either add and propagate the ErrorShutdownIncomplete code on !status.all_clean() so the host can defer freeing its callback context, or retract the breaking-change claim and document loudly that ok() does not imply the OS thread has exited on the degraded path.
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:417-433: clear_shielded silently swallows quiesce timeout/non-clean status before clear()
clear_shielded wraps shielded_sync_manager.quiesce() in tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, …) but discards the result with `let _ = …`, then unconditionally calls coord.clear().await and returns Ok(()). The method's own doc-comment makes the quiesce barrier the load-bearing safety mechanism that lets the host commit its persistence wipe. On Timeout (or any non-clean CoordinatorThreadStatus the rewritten quiesce can now return) the in-flight shielded pass is still capable of holding the coordinator/persister handle and writing into the very store that clear() is about to wipe — the wipe can then be silently re-populated by the surviving pass, defeating the wipe and violating the contract the FFI consumer (platform_wallet_manager_shielded_clear) relies on. Inspect the timeout result and propagate a typed PlatformWalletError on Elapsed and on !status.is_clean(), so callers do not commit their own persistence wipe after a partial drain.
- [BLOCKING] packages/rs-platform-wallet-ffi/src/shielded_sync.rs:86-107: shielded_sync_stop returns ok() on bounded-quiesce timeout — same UAF window as destroy/clear
platform_wallet_manager_shielded_sync_stop wraps manager.shielded_sync().quiesce() in tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, …) and discards the result with `let _ = …`, then returns PlatformWalletFFIResult::ok(). The function's own docstring promises that on return 'the loop is cancelled, no new pass will start, and any in-flight pass has fully drained'; on a Timeout (or any non-clean CoordinatorThreadStatus) that promise is broken, but the host is told the stop succeeded. Hosts that bump generation counters or release callback state on success then race the still-running coordinator thread — the same UAF pattern as destroy and clear, just on a non-teardown path that long-running apps hit far more often. Surface non-clean / Elapsed via a distinct FFI result code so the host can defer teardown / state reset until the drain actually completes.
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:184-198: join_coordinator_thread uses uncancellable spawn_blocking — Timeout status can ship while the !Send OS thread is still alive
join_coordinator_thread moves the std::thread::JoinHandle into tokio::task::spawn_blocking(move || handle.join()).await at line 190. spawn_blocking tasks are not abortable: when the outer tokio::time::timeout in shutdown() fires Elapsed, dropping the await handle does not cancel the inner blocking job and does not signal the OS thread. The slot is reported as CoordinatorThreadStatus::Timeout but the underlying coordinator thread is still inside Handle::block_on, still touching tokio::time, and still able to invoke host callbacks. Combined with finding 1 (destroy returning ok()), this is the residual runtime-drop / callback-after-free window the PR was sold as closing — the degraded path leaves it wide open. The PR description claims this was rewritten to poll JoinHandle::is_finished(); grep finds no is_finished() use in rs-platform-wallet at all. Either implement the poll loop (`tokio::time::sleep(small_dt); if handle.is_finished() { return handle.join() … }`) so the outer timeout actually binds, or document explicitly that Timeout means the OS thread may still be alive and the runtime must not be dropped — and reflect that in the FFI return codes per finding 1.
- [SUGGESTION] packages/rs-platform-wallet/src/manager/identity_sync.rs:405-470: start() after cancel-only stop() drops the previous background_join, detaching the old thread from a later shutdown()
stop() (lines 480–489) takes background_cancel but never touches background_join, and the loop body's epilogue (lines 452–458) only clears background_cancel when the generation matches. A subsequent start() guards only on cancel_guard.is_some() (line 410); on a quick stop()→start() sequence cancel_guard is None, start() proceeds, spawns a new OS thread, and at lines 465–468 unconditionally writes the new JoinHandle into self.background_join, dropping (detaching) the prior, still-live JoinHandle for a loop that may still be winding down through its last pass / sleep wakeup. A subsequent shutdown() only joins the newest handle; the older thread is no longer reachable through any join barrier and can outlive shutdown(), holding the persister/event-handler context after the FFI told the host destroy/shutdown was clean. The same overwrite pattern lives in platform_address_sync.rs and shielded_sync.rs::start. The PR description claims start() now reaps the prior handle first; it does not. Either move the join slot under the same lock and reap-on-takeover (e.g. take the prior background_join and join_coordinator_thread it before installing a new one — would require start() to become async), or have stop()/the loop epilogue clear background_join too so an unjoined leftover handle cannot exist.
- [NITPICK] packages/rs-platform-wallet/src/manager/mod.rs:712-743: event_adapter Stopped-path test still accepts Ok, leaving the new JoinError→Stopped mapping unverified on the abort race
event_adapter_non_panic_join_error_maps_to_stopped_and_is_not_clean aborts the adapter task then accepts CoordinatorThreadStatus::Stopped(_) | CoordinatorThreadStatus::Ok (lines 731–737) with an inline comment noting 'abort() races task completion'. Because the adapter is the standard make_manager() sink (no events queued), the task can trivially drain before the 10 ms sleep elapses, in which case the assertion passes via the Ok arm and the new Ok(Err(_)) ⇒ Stopped(...) mapping in shutdown() is never actually exercised. A regression collapsing that arm back to Ok would not be caught. To match the PR description's claim that this path is now deterministic, replace the adapter handle with one running std::future::pending::<()>().await before aborting, and assert exactly Stopped(_) and !status.event_adapter.is_clean().
- [NITPICK] packages/rs-platform-wallet/src/manager/mod.rs:874-931: shutdown_then_drop_runtime installs a process-global panic hook without chaining or RAII restore
std::panic::set_hook (line 878) replaces the process-wide panic hook with a closure that only increments SHUTDOWN_PANICS on messages containing 'being shutdown' and never forwards to prev_hook. The original hook is restored only after the 10-iteration loop completes (line 925); a panic anywhere in the loop body leaves the global hook replaced for the rest of the test process and silently suppresses unrelated panic diagnostics from sibling tests in the same `cargo test` binary. The PR description claims this is now per-iteration std::panic::catch_unwind; the code still uses set_hook. Fixes: chain prev_hook(info) for messages that don't match the filter, and restore the previous hook in an RAII guard (scopeguard / explicit struct with Drop) so a mid-loop panic cannot leave the global state mutated.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Current head fixes prior-1, prior-4, prior-6, and prior-7, but prior-2 and prior-3 remain: the shielded Clear/Stop paths still discard the bounded quiesce result and report success after a failed drain barrier. Prior-5 is only partially fixed; the overwrite path was addressed, but the new start-time drain can still detach the old coordinator thread, and the new FFI shutdown-incomplete code is not safely handled by the bundled Swift wrapper.
🔴 4 blocking
1 additional finding(s) omitted (not in diff).
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:438-444: clear_shielded still clears after a failed quiesce barrier
Carried forward from prior-2. `clear_shielded()` wraps `shielded_sync_manager.quiesce()` in the shutdown timeout but discards both `Elapsed` and non-clean `CoordinatorThreadStatus`, then immediately calls `coord.clear().await`. The comments above this method describe quiesce as the barrier that prevents an in-flight shielded pass from re-persisting notes after Clear. If the timeout fires while the pass is still draining, this method resets shared shielded state and returns `Ok(())` while the old pass can still write through the same coordinator/persister path. Surface the non-clean status and refuse to clear unless quiesce completed cleanly.
In `packages/rs-platform-wallet-ffi/src/shielded_sync.rs`:
- [BLOCKING] packages/rs-platform-wallet-ffi/src/shielded_sync.rs:98-106: shielded_sync_stop returns success when bounded quiesce fails
Carried forward from prior-3. The FFI stop function documents that, once it returns, no in-flight shielded pass remains and persistence callbacks have completed. The implementation still discards the `timeout(..., manager.shielded_sync().quiesce())` result and always returns `PlatformWalletFFIResult::ok()`. On timeout or another non-clean status, Swift/C callers are told it is safe to free or mutate callback and persistence context even though Rust did not establish the promised drain barrier. This should return a non-success result, preferably `ErrorShutdownIncomplete`, when quiesce does not complete cleanly.
In `packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/identity_sync.rs:421-443: start-time drain can still detach the old coordinator thread
Carried forward from prior-5 in a narrowed form. The new drain takes the previous `background_join`, waits up to one second, and drops the handle if the thread is not finished; dropping a `std::thread::JoinHandle` detaches the still-running thread. Worse, `start()` holds `background_cancel` while waiting, and the exiting thread needs that same mutex in its cleanup before it can finish, so a normal stop -> start race can force the one-second path without an external wedge. After this detach, the next `shutdown()` can only join the newly stored handle and can report clean while the old callback-capable coordinator is still alive. The same pattern is present in `platform_address_sync.rs` and `shielded_sync.rs`.
In `packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletManager.swift`:
- [BLOCKING] packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletManager.swift:153-158: Swift deinit discards the new shutdown-incomplete signal
The Rust FFI now returns `ErrorShutdownIncomplete = 19` when `destroy` cannot prove coordinator threads exited, with an explicit contract that the host must not free callback context immediately. Swift retains the `PlatformWalletPersistenceHandler` and `PlatformWalletEventHandler` only as fields on `PlatformWalletManager`, passes them to Rust with `Unmanaged.passUnretained`, and then calls `platform_wallet_manager_destroy(handle).discard()` in `deinit`. If Rust reports shutdown incomplete, ARC still releases those handlers as deinit completes, leaving lingering Rust coordinator callbacks with dangling context pointers. The Swift result mirror also has no case for code 19, so even non-deinit callers cannot distinguish this lifecycle-specific failure from `errorUnknown`.
Extend the destroy UAF-surfacing discipline (which already returns ErrorShutdownIncomplete=19 on a non-clean shutdown) to the shielded clear/stop paths, so a partial/timed-out coordinator drain can no longer be silently swallowed. - clear_shielded now captures the quiesce result instead of discarding it: on a timed-out or non-clean drain it returns the new typed PlatformWalletError::ShieldedShutdownIncomplete (carrying the terminal CoordinatorThreadStatus) and leaves the commitment-tree store INTACT, rather than unconditionally wiping a store an in-flight pass may still write into. The store is wiped only on a clean drain. - FFI shielded_sync_stop now returns ErrorShutdownIncomplete (with the status rendered into the message) on a non-clean/timed-out drain, instead of always returning ok() — symmetric with destroy. A timeout is reported as the Timeout status. - FFI shielded_clear maps the new ShieldedShutdownIncomplete variant to ErrorShutdownIncomplete (store-reset failures still map to ErrorWalletOperation); the blanket From<PlatformWalletError> gains the same arm, pinned by a unit test. - Swift mirror gains errorShutdownIncomplete=19 plus a richer PlatformWalletError.shutdownIncomplete case, wired through both the init(ffi:) and init(result:) switches. - Re-export CoordinatorThreadStatus / CoordinatorExitStatus from the crate root so the FFI can name the status type. BREAKING CHANGE: clear_shielded / shielded_sync_stop / shielded_clear now report a non-clean coordinator drain instead of succeeding silently; hosts must defer freeing their callback context and must not commit their own persistence wipe on ErrorShutdownIncomplete. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
…d_cancel lock All three coordinators (identity_sync, platform_address_sync, shielded_sync) reaped the prior loop's OS thread inside start() WHILE holding background_cancel. But the exiting prior thread's epilogue also locks background_cancel to clear its slot, so a tight stop()→start() deadlocked the reap: the prior thread blocked on the lock start() held, never finished, and the is_finished() spin-wait burned the full 1 s deadline then DETACHED the handle — a 1 s stall plus a transient untracked thread, on the exact stop()→start() path the reap exists for. Reorder start() to install the new cancel token + bump the generation under the lock, then drop(cancel_guard) to release background_cancel, and only THEN run the spin-wait + join. The prior thread's epilogue now acquires the lock (or, for shielded, observes the bumped generation), skips clearing the freshly-installed token, and returns, so is_finished() trips in milliseconds and the join is near-instant. start() stays synchronous; the 1 s deadline remains only as a genuine-wedge backstop. Adds restart_after_stop_reaps_prior_thread regression tests to the identity and platform-address coordinators: start → (stop+start back-to-back) → assert the restart returns well under the 1 s deadline. Verified non-vacuous — against the old lock-held ordering it stalls ~1.0 s and fails. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
|
Re: the latest
Verification on 🤖 Co-authored by Claudius the Magnificent AI Agent |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/rs-platform-wallet/src/manager/platform_address_sync.rs`:
- Around line 658-660: The mgr.quiesce().await call can hang indefinitely if the
restarted loop wedges, causing CI to stall. Wrap the quiesce().await call in a
timeout to prevent hanging, and assert that the timeout succeeds and the
returned status indicates a clean shutdown. This ensures the test fails fast if
the manager fails to quiesce properly rather than hanging the CI pipeline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b16b3350-f53b-4300-9c5f-2cb19cb4566a
📒 Files selected for processing (9)
packages/rs-platform-wallet-ffi/src/error.rspackages/rs-platform-wallet-ffi/src/shielded_sync.rspackages/rs-platform-wallet/src/error.rspackages/rs-platform-wallet/src/lib.rspackages/rs-platform-wallet/src/manager/identity_sync.rspackages/rs-platform-wallet/src/manager/mod.rspackages/rs-platform-wallet/src/manager/platform_address_sync.rspackages/rs-platform-wallet/src/manager/shielded_sync.rspackages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletResult.swift
🚧 Files skipped from review as they are similar to previous changes (3)
- packages/rs-platform-wallet/src/lib.rs
- packages/rs-platform-wallet/src/manager/identity_sync.rs
- packages/rs-platform-wallet/src/manager/shielded_sync.rs
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Latest delta resolves 2 of 4 prior blockers (clear_shielded and shielded_sync_stop now surface ErrorShutdownIncomplete) and the deterministic half of the start-time detach (cancel_guard released before reap; new regression tests in identity_sync.rs and platform_address_sync.rs assert <500 ms reap). Two carried-forward concerns remain: (1) Swift deinit still .discard()s the new shutdownIncomplete result and lets ARC free callback contexts the FFI just told it not to (convergent, all 6 agents, blocking); (2) start()'s 1 s wedge-detach backstop remains and can silently desync shutdown()'s clean-status from a still-live prior coordinator thread (downgraded to suggestion: deterministic path fixed, intentional liveness tradeoff documented in code). One additional test-coverage gap: the new restart_after_stop_reaps_prior_thread regression test was not mirrored into shielded_sync.rs.
🔴 1 blocking | 🟡 2 suggestion(s)
1 additional finding(s) omitted (not in diff).
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletManager.swift`:
- [BLOCKING] packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletManager.swift:153-160: deinit discards ErrorShutdownIncomplete and lets ARC free callback contexts Rust still references
Rust's `platform_wallet_manager_shielded_sync_stop` (rs-platform-wallet-ffi/src/shielded_sync.rs:124-132) and `platform_wallet_manager_destroy` (rs-platform-wallet-ffi/src/manager.rs:367-385) now return `ErrorShutdownIncomplete = 19` on a non-clean drain, with an explicit FFI contract: "host must not free the callback context immediately — a lingering pass may still fire one final callback through it." `PlatformWalletResultCode.errorShutdownIncomplete` and `PlatformWalletError.shutdownIncomplete` were added to mirror this, but the lifecycle-critical caller here ignores the signal:
```swift
deinit {
progressPollTask?.cancel()
if handle != NULL_HANDLE {
platform_wallet_manager_platform_address_sync_stop(handle).discard()
platform_wallet_manager_shielded_sync_stop(handle).discard()
platform_wallet_manager_destroy(handle).discard()
}
}
.discard() only frees the FFI result message; it never inspects the code. persistenceHandler and eventHandler are stored only as fields on PlatformWalletManager and are passed to Rust with Unmanaged.passUnretained(...).toOpaque(). The instant deinit returns, ARC releases both, and any lingering coordinator that returned Timeout will dereference freed memory on its next callback — the exact UAF the new return code was introduced to prevent.
The fix needs deinit to observe code 19 and keep both handlers alive past the destroy (e.g. Unmanaged.passRetained(...) to leak them deliberately when Rust reports incomplete shutdown, or hand them to a detached holder keyed off the handle until a follow-up signal confirms drain). Storing the result and acting on it — rather than .discard() — is the minimum behavior change.
In packages/rs-platform-wallet/src/manager/identity_sync.rs:
-
[SUGGESTION] packages/rs-platform-wallet/src/manager/identity_sync.rs:502-517: Wedge-backstop detach in start() leaves a still-live coordinator invisible to shutdown()
The deterministic stop()→start() race that forced the detach is fixed in the current head:cancel_guardis dropped before the reap (identity_sync.rs:488-494), the cancellabletokio::select!in the loop body lands cancellation promptly, andrestart_after_stop_reaps_prior_threadasserts the reap completes in <500 ms. The 1 s backstop at lines 502-517 (and the identical pattern in platform_address_sync.rs:~317 and shielded_sync.rs:343-358) still exists, however, and on a genuine wedge (e.g. a sync_now future whoseDropnever yields) it dropshand detaches the still-running OS thread. Becausestart()has already installed the new handle inbackground_join, a latershutdown()joins only the new thread, reportsall_clean() == true, anddestroyreturnsok()— at which point the FFI contract permits the host to free the callback context that the detached prior thread still holds via Arc'd FFI wrappers.The failure mode is narrow (requires a non-yielding wedge) and the comments document it as a deliberate liveness/safety tradeoff. But the rest of this PR's lifecycle guarantees rely on
shutdown()'s status honestly reflecting whether any coordinator could still call back. Two options that close it cleanly: (i) track detached handles in a per-manager orphans list thatshutdown()polls and reports as a non-cleanTimeout/Detached, or (ii) drop the deadline once cancellation is signalled (cancellable select! makes a real-world stall vanishingly rare).
In packages/rs-platform-wallet/src/manager/shielded_sync.rs:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/shielded_sync.rs:325-358: shielded_sync.rs missing the restart_after_stop_reaps_prior_thread regression test
The latest delta addsrestart_after_stop_reaps_prior_threadto identity_sync.rs and platform_address_sync.rs, pinning the reap-after-drop(cancel_guard)ordering at <500 ms. The same start() restructuring was applied to ShieldedSyncManager (shielded_sync.rs:325-358) but no equivalent test was added there, leaving the shielded path's stop()→start() ordering unverified. A future refactor of shielded_sync.rs that accidentally moved the reap back insidecancel_guard's lifetime would only be caught by the two siblings; the shielded path would silently regress to the lock-held 1 s detach pattern. Mirroring the existing regression test in shielded_sync.rs (gated behind#[cfg(feature = "shielded")]if needed) would pin the invariant across all three coordinators.
</details>
Three shielded-sync hardening fixes, bringing it in line with its identity-sync and platform-address-sync siblings. - shielded_sync.rs exit epilogue read `background_generation` BEFORE acquiring `background_cancel` (load-then-lock). That stale-read TOCTOU let a prior thread observe a pre-bump generation, block on the lock until a concurrent start() released it, then null the freshly-installed token — leaving the new loop running but untracked via is_running()/stop(). Acquire the lock first and compare the generation under it, exactly like the siblings. - Add the `restart_after_stop_reaps_prior_thread` regression test the siblings already carry. It pins the reap-after-drop(cancel_guard) reorder: a back-to-back stop()+start() must reap the prior OS thread in <500 ms, not stall ~1 s on the detach backstop. Confirmed non-vacuous — it fails at ~1.0 s with the reap moved back inside the lock. - platform-wallet-ffi: the ErrorShutdownIncomplete doc only described destroy. It is now also returned by shielded_sync_stop and shielded_clear, where the manager is NOT torn down and the operation can be retried. Document all three callers and their differing retry semantics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
PlatformWalletManager.deinit called .discard() on shielded_sync_stop and destroy, tossing the result code. Rust now returns ErrorShutdownIncomplete (19) on a non-clean drain, with the contract: a lingering coordinator thread may still fire one final callback through the host-owned callback context. But persistenceHandler/eventHandler are handed to Rust via Unmanaged.passUnretained and kept alive only by this object's fields, so the instant deinit returns ARC frees them — a use-after-free on that final callback. Capture the code via a new discardReturningCode() helper; on ErrorShutdownIncomplete from shielded_sync_stop OR destroy, deliberately leak one extra strong reference (an unbalanced passRetained, never released) to each handler so it outlives any lingering thread. A clean shutdown — the common case — takes neither branch and releases the handlers normally; we never leak unconditionally. UNVERIFIED locally: no Swift toolchain / xcframework on this host. Reasoned correct-by-construction; needs an iOS-environment build to confirm. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
|
Swift 🤖 Co-authored by Claudius the Magnificent AI Agent |
…on tests Wrap the cleanup `mgr.quiesce().await` in all three `restart_after_stop_reaps_prior_thread` tests with a 2-second `tokio::time::timeout`. An unbounded quiesce after the restarted loop would hang CI forever if the loop wedges; now it fails fast with a clear message. Also assert `status.is_clean()` on the returned `CoordinatorThreadStatus`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
The latest delta resolves prior-1 (Swift deinit at PlatformWalletManager.swift:153-195 now captures both shielded_sync_stop and destroy result codes via discardReturningCode() and leaks persistenceHandler/eventHandler via unbalanced passRetained on .errorShutdownIncomplete) and prior-3 (restart_after_stop_reaps_prior_thread now exists in shielded_sync.rs alongside its identity/platform-address siblings). Prior-2 — the 1 s wedge-backstop detach in each coordinator's start() — remains in the tree on the deliberate-wedge path; all six agents converge on it as a suggestion, narrow but conceptually opposed to the PR's central honest-status contract. No new blocking issues introduced.
🟡 1 suggestion(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/identity_sync.rs`:
- [SUGGESTION] packages/rs-platform-wallet/src/manager/identity_sync.rs:502-517: start()'s 1 s wedge-backstop detaches a still-live coordinator that shutdown()/destroy then reports as clean
Carried forward from prior review (still valid; no reviewer-repeat required by the verifier protocol).
On a genuine wedge — a `sync_now` future whose `Drop` never yields, or a non-yielding step inside the SDK — the prior thread cannot reach its exit epilogue within the 1 s deadline at identity_sync.rs:502-517 (and the symmetric patterns at platform_address_sync.rs:317-332 and shielded_sync.rs:350-365). `start()` then drops `h`, detaching the still-running OS thread; `background_join` already holds the new generation's handle, so a later `quiesce()`/`shutdown()` joins only the new thread, `CoordinatorExitStatus::all_clean()` returns true, and `platform_wallet_manager_destroy` returns `ok()`.
That directly defeats the central FFI contract this PR introduces. The Swift deinit at PlatformWalletManager.swift:190-194 only retains `persistenceHandler` / `eventHandler` when it observes `.errorShutdownIncomplete`; a falsely-clean code lets ARC free both immediately, and the detached prior thread — which still owns Arc'd Rust callback wrappers carrying the `passUnretained(self).toOpaque()` raw pointers (PlatformWalletPersistenceHandler.swift:993-996, PlatformWalletManagerAddressSync.swift:47-50) — can fire one last callback into freed memory. The same hazard exists on the timeout paths of `clear_shielded` and the FFI `platform_wallet_manager_shielded_sync_stop`, where an outer `tokio::time::timeout` firing while the inner quiesce is parked inside `join_coordinator_thread(handle).await` drops the local handle at the .await — detach.
The cancellable `tokio::select!` in each loop body and the prompt `drop(cancel_guard)` ordering (identity_sync.rs:488-494, mirrored in the other two) make this vanishingly rare in practice — `restart_after_stop_reaps_prior_thread` now pins the common-case reap at <500 ms across all three coordinators. But the residual path is the one bypass of the PR's honest-status guarantee. Two clean closures: (a) track detached handles in a per-manager orphans list that `shutdown()` polls and surfaces as `Timeout`/`Detached`, so `destroy` returns `ErrorShutdownIncomplete` and the host keeps the callback context alive; or (b) once cancellation has been signalled, drop the 1 s deadline entirely — the cancellable select! makes a real-world stall essentially impossible, and turning a degraded shutdown into a hang is arguably less dangerous than the silent UAF window the detach can open.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Latest delta (5017ba1..b491773) is test-only: it bounds the post-restart cleanup quiesce() in three restart-reap regression tests with a 2 s tokio::time::timeout and asserts is_clean(). No production code changed and no new defects in the delta. The prior finding about start()'s 1 s wedge-backstop silently detaching a still-live coordinator is STILL VALID on the current head for all three coordinators and is carried forward as the sole in-scope finding.
Carried-forward prior findings
-
[suggestion] 1 s wedge-backstop in start() detaches a live prior coordinator that shutdown()/destroy then reports as clean
packages/rs-platform-wallet/src/manager/identity_sync.rs:502-517Verified against b491773. On a genuine wedge — a sync_now future whose Drop never yields, or a non-yielding step inside the SDK — the prior coordinator thread cannot reach its exit epilogue within the 1 s deadline. The loop at lines 504–513 then
breaks and dropsh, which detaches the still-running OS thread (std::thread::JoinHandle's Drop does not join).background_joinnow holds only the replacement handle installed at lines 483–486, so a laterquiesce()(lines 576–581) joins only the replacement andCoordinatorExitStatus::all_clean()returns true. FFIplatform_wallet_manager_destroy(packages/rs-platform-wallet-ffi/src/manager.rs:366–386) then returnsok()instead ofErrorShutdownIncomplete=19. SwiftPlatformWalletManager.deinit(packages/swift-sdk/Sources/SwiftDashSDK/PlatformWallet/PlatformWalletManager.swift:153–195) does NOT take the retain-leak branch and ARC releases the persister/event-handler wrappers, even though the detached orphan thread still holds Arc/Arc referencing the freedpassUnretainedcontext — exactly the UAF class this PR exists to close, but now on the stop()→start()→destroy() degraded path the new restart bookkeeping creates.The biased cancel-select introduced at lines 446–461 narrows the common-case window (cancel beats sync_now's next .await), but it cannot abort an SDK step or future Drop that is itself non-yielding — which is precisely when the 1 s backstop fires. Same pattern exists in platform_address_sync.rs:317–332 and shielded_sync.rs:350–365; none of the three record the detach event into state observable by shutdown()/CoordinatorExitStatus. The new restart-reap tests only exercise the prompt-reap (<500 ms) path, not a genuine wedge past 1 s.
Minimum FFI-correct fix options (none fit an inline patch): (a) track detached orphans in a Mutex<Vec<JoinHandle<()>>> the next quiesce()/shutdown() polls and surfaces as Timeout/non-clean status; (b) have start() flip a per-coordinator atomic on detach that quiesce() ORs into its returned CoordinatorThreadStatus; (c) have start() return an error rather than silently install a replacement when the prior couldn't be reaped, so the FFI host can escalate. The current shape — silently dropping the orphan — is the one outcome that defeats the PR's central invariant that destroy()'s result code tells the host whether it is safe to free callback contexts.
…) reports them as non-clean Closes the residual use-after-free window left by the coordinator reap backstop. On a tight stop()->start(), each sync coordinator waits ~1s for the prior OS thread to finish; if that thread is genuinely wedged in a non-yielding Drop, the backstop previously DROPPED the still-live JoinHandle (detaching it). A later shutdown() joined only the current handle, all_clean() returned true, and the FFI destroy returned ok() — at which point the host could free the callback context the detached, still-running thread might still touch. Fix (review option i): the manager now owns a shared CoordinatorOrphans list (Arc<Mutex<Vec<JoinHandle>>>) cloned into every coordinator. The duplicated reap blocks in identity/platform-address/shielded start() are consolidated into reap_prior_or_park(), which PARKS a wedged prior thread in that list instead of dropping it (lock-ordering preserved: drop(cancel_guard) still happens before the reap). shutdown() then drains the list via join_detached_orphans() within a bounded, yielding is_finished() poll and reports a new CoordinatorThreadStatus::Detached (non-clean) in CoordinatorExitStatus::detached_threads when any orphan is still alive at the grace deadline. all_clean() folds it in, so the FFI destroy correctly returns ErrorShutdownIncomplete and the host delays freeing its context. The new Detached variant re-exports through lib.rs with its sibling statuses. Tests (manager/mod.rs): reap_prior_or_park parks a force-wedged thread; join_detached_orphans reports Detached then Ok; and a manager shutdown() with a parked still-live orphan reports non-clean. All proven non-vacuous by neutering the park/join. Cleanup quiesce/join in tests is bounded; a wedged stand-in thread is released and joined so none leak. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Prior finding (start()/wedge-backstop detaching a live coordinator) is FIXED at head 76c8bee via reap_prior_or_park + CoordinatorOrphans + join_detached_orphans, with Detached folded into all_clean(). However, two related lifecycle gaps in the same orphan-tracking design remain: (1) quiesce() takes the JoinHandle out of background_join before awaiting join_coordinator_thread, so a wrapping tokio::time::timeout (used by shutdown(), clear_shielded(), and the FFI shielded_sync_stop bridge) drops the live handle and detaches the thread without parking it; on the explicit retry path the next call sees background_join == None and reports NotRunning/clean while a wedged thread may still call host callbacks. (2) clear_shielded() consults only the current shielded quiesce status and wipes the commitment-tree store even when an earlier wedged shielded coordinator is still alive in coordinator_orphans, breaking the PR's 'Clear leaves the store intact whenever a sync drain is incomplete' invariant. Both are in scope for the PR's stated FFI lifecycle/UAF hardening.
🔴 2 blocking
1 additional finding(s) omitted (not in diff).
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:240-261: Timed-out quiesce drops the JoinHandle and silently detaches a live coordinator
join_coordinator_thread takes the JoinHandle by value. Every caller of quiesce() — shutdown(), clear_shielded(), and the FFI platform_wallet_manager_shielded_sync_stop — wraps it in tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, ...). Each coordinator's quiesce() first synchronously takes background_join (identity_sync.rs:579, shielded_sync.rs:423, platform_address_sync.rs analogous) and only then awaits this helper. If the outer timeout fires while this loop is in tokio::time::sleep, the future is dropped, the locally-owned handle is dropped with it, and the still-live OS thread is detached — not parked in coordinator_orphans. The FFI/host contract is explicitly retry-capable here (Timeout → ErrorShutdownIncomplete → host should be able to retry stop/destroy), but a retry now sees background_join == None and join_coordinator_thread returns NotRunning, which is_clean() treats as clean. The host then frees the persister / event-handler context while a wedged thread still holds Arcs to it and may still fire one final callback — the exact UAF class the orphan list was added to close. The fix is to keep the handle reachable across timeout/cancel: either don't take() from background_join until is_finished() is true, or have join_coordinator_thread re-park the unjoined handle into a manager-scoped orphans list on Drop / on the deadline (mirroring reap_prior_or_park's park-on-wedge contract).
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:644-674: clear_shielded wipes the store while parked coordinator orphans may still be alive
clear_shielded() gates the store wipe on `self.shielded_sync_manager.quiesce()` returning a clean status, but a tight stop()→start() sequence can park a previously-wedged shielded coordinator thread into coordinator_orphans (reap_prior_or_park, mod.rs:296-332). Those parked threads hold Arc references to the same shielded coordinator / persister context the clear() below is about to wipe, and shutdown() is currently the only path that drains them via join_detached_orphans. As soon as the current quiesce reports clean, this path calls coord.clear() while a parked orphan may still be inside its sync_now drop / persister fan-out, violating the PR's stated invariant that Clear leaves the store intact whenever a sync drain is incomplete and re-creating the store-desync the orphan tracking was added to surface. Mirror the shutdown() path: after quiesce, drain coordinator_orphans (with SHUTDOWN_ORPHAN_GRACE_SECS) and refuse the wipe with ShieldedShutdownIncomplete if any orphan is still Detached/Panicked.
| pub(crate) async fn join_coordinator_thread( | ||
| handle: Option<std::thread::JoinHandle<()>>, | ||
| ) -> CoordinatorThreadStatus { | ||
| let Some(handle) = handle else { | ||
| return CoordinatorThreadStatus::NotRunning; | ||
| }; | ||
| // Poll until the thread exits. The coordinator was already cancelled | ||
| // (stop() fires before quiesce() calls us), so is_finished() becomes | ||
| // true nearly immediately — typically within a single 5 ms step. | ||
| loop { | ||
| if handle.is_finished() { | ||
| return match handle.join() { | ||
| Ok(()) => CoordinatorThreadStatus::Ok, | ||
| Err(payload) => CoordinatorThreadStatus::Panicked(panic_message(payload)), | ||
| }; | ||
| } | ||
| // Yield to the executor so the outer tokio::time::timeout wrapping | ||
| // quiesce() can fire if the deadline has passed. Without this yield | ||
| // the loop would busy-spin and block the task. | ||
| tokio::time::sleep(std::time::Duration::from_millis(5)).await; | ||
| } | ||
| } |
There was a problem hiding this comment.
🔴 Blocking: Timed-out quiesce drops the JoinHandle and silently detaches a live coordinator
join_coordinator_thread takes the JoinHandle by value. Every caller of quiesce() — shutdown(), clear_shielded(), and the FFI platform_wallet_manager_shielded_sync_stop — wraps it in tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, ...). Each coordinator's quiesce() first synchronously takes background_join (identity_sync.rs:579, shielded_sync.rs:423, platform_address_sync.rs analogous) and only then awaits this helper. If the outer timeout fires while this loop is in tokio::time::sleep, the future is dropped, the locally-owned handle is dropped with it, and the still-live OS thread is detached — not parked in coordinator_orphans. The FFI/host contract is explicitly retry-capable here (Timeout → ErrorShutdownIncomplete → host should be able to retry stop/destroy), but a retry now sees background_join == None and join_coordinator_thread returns NotRunning, which is_clean() treats as clean. The host then frees the persister / event-handler context while a wedged thread still holds Arcs to it and may still fire one final callback — the exact UAF class the orphan list was added to close. The fix is to keep the handle reachable across timeout/cancel: either don't take() from background_join until is_finished() is true, or have join_coordinator_thread re-park the unjoined handle into a manager-scoped orphans list on Drop / on the deadline (mirroring reap_prior_or_park's park-on-wedge contract).
source: ['codex']
There was a problem hiding this comment.
Confirmed — both findings are still open at HEAD 3cca1cf833. Thanks for the catch.
F1 (this thread): join_coordinator_thread takes the JoinHandle by value, and quiesce() take()s it out of background_join before awaiting. If the wrapping tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS, …) fires during the helper's 5 ms poll-sleep, the future is dropped and the owned handle drops with it — the OS thread is silently detached and never re-parked into coordinator_orphans. On retry background_join == None → NotRunning → is_clean(), so destroy can free the context while the thread is still live. Confirmed UAF window.
F2 (review body): clear_shielded() (mod.rs:644-674) gates the store wipe on the current quiesce() being clean but never drains coordinator_orphans, so a prior-generation shielded thread parked by reap_prior_or_park can still hold Arc refs to the persister that clear() is about to wipe. Confirmed.
Both will be addressed before this PR merges — leaving this thread open until the fix lands.
🤖 Co-authored by Claudius the Magnificent AI Agent
…via tokio::join! The three periodic coordinators (platform-address, identity, shielded) were quiesced sequentially in shutdown(), making the worst-case wait additive (~3 x SHUTDOWN_JOIN_TIMEOUT_SECS). Each quiesce() touches only its own state (its quiescing/is_syncing atomics and its own background_cancel/background_join mutexes) and joins its own OS thread, sharing no lock, so racing them is sound. Drain them concurrently via tokio::join!, collapsing the worst case to ~max(timeouts). Each join! arm keeps its OWN inner tokio::time::timeout, so every coordinator still yields its own per-coordinator CoordinatorThreadStatus (a single outer timeout would flatten all three to Timeout). The event adapter teardown and join_detached_orphans stay sequential and ordered strictly AFTER the coordinator join!, since the adapter sinks the coordinators' stores. The multi-thread runtime assert is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_0157yd3YvWeyckhfQivS9gf7
|
@thepastaclaw please review |
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Incremental review against head 3cca1cf. The new tokio::join! concurrent drain in shutdown() is structurally sound — each coordinator's quiesce() touches only its own quiescing/is_syncing atomics and background_cancel/background_join mutexes, and the shared coordinator_orphans Mutex is not taken inside quiesce(), so racing the three drains is safe; per-coordinator inner timeouts correctly preserve per-thread CoordinatorThreadStatus that a single outer timeout would have flattened. However, both prior blocking findings are unchanged on this head: a timed-out quiesce still loses the only JoinHandle and silently detaches a live coordinator OS thread, and clear_shielded still wipes the shielded store without consulting coordinator_orphans. Both were unanimously re-flagged by all six agent reviews and both are required for this PR's stated lifetime/UAF guarantees to actually hold.
🔴 1 blocking
1 additional finding(s) omitted (not in diff).
1 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-platform-wallet/src/manager/mod.rs`:
- [BLOCKING] packages/rs-platform-wallet/src/manager/mod.rs:644-674: clear_shielded wipes the shielded store without draining parked coordinator orphans
clear_shielded (mod.rs:644–674) gates coord.clear().await on shielded_sync_manager.quiesce() returning is_clean(), and the doc on SHUTDOWN_JOIN_TIMEOUT_SECS (mod.rs:407–408) explicitly lists clear_shielded and the FFI shielded-stop bridge as using the same backstop. But neither path inspects self.coordinator_orphans. Shielded coordinator start() parks a prior wedged shielded OS thread into that shared list via reap_prior_or_park (mod.rs:296–332) whenever a tight stop()→start() cannot reap within the 1 s wedge backstop. Only shutdown() actually drains the list via join_detached_orphans (mod.rs:820–824).
A parked shielded thread still holds an Arc to the same NetworkShieldedCoordinator and persister context that coord.clear() (mod.rs:670–672) is about to wipe; the doc on lines 663–666 warns exactly about this — "a surviving pass writing into a store we just cleared, desyncing the host's own wipe from a repopulated tree." Because the orphan drain is missing, a sequence of stop()→start() that wedges the prior shielded thread, followed by clear_shielded, lets the wipe proceed while the parked thread is still alive and can persister.store(...) / fire on_shielded_* host callbacks against the cleared store.
The same defect is in the FFI platform_wallet_manager_shielded_sync_stop bridge: it awaits only quiesce(), so a non-clean orphan goes unreported and the host (which sees ok()) is free to release the callback context that a parked shielded thread still references.
Fix shape (symmetric with shutdown()): in both clear_shielded and platform_wallet_manager_shielded_sync_stop, after quiesce() returns clean, run a bounded join_detached_orphans(&self.coordinator_orphans, deadline) and fold the result into the precondition — return ShieldedShutdownIncomplete { status } (and leave the store intact) unless both the live drain and the orphan drain are clean. Caveat: coordinator_orphans is shared across all three coordinator kinds; either accept the conservative wait for identity/platform-address orphans before a shielded Clear, or tag parked handles with their coordinator kind so Clear can drain only the shielded subset.
Why this PR exists
PlatformWalletManager::shutdown()never joined the sync-coordinator OS threads. The coordinators (identity-sync, platform-address-sync, shielded-sync) run on detachedstd::threadOS threads because the SDK futures are!Send(Handle::block_on); theirJoinHandles were dropped at spawn, andshutdown()only drained viaquiesce()(awhile is_syncingbarrier) without joining the threads or returning per-thread status.shutdown()(dash-evo-tool's CLI one-shot / headless / stdio paths) races coordinator threads still pollingtokio::timeon a shutting-down runtime → panicA Tokio 1.x context was found, but it is being shutdown.DET's only deterministic workaround wasstd::process::exit(3 sites), which skips graceful teardown and any pending coordinator persister writes.v3.1-dev. Requested by dash-evo-tool PR build(dashmate): fix deb package release #864.What was done?
shutdown()now joins the coordinator OS threads and returns a per-threadCoordinatorExitStatus:JoinHandle(under thebackground_cancellock, so a concurrentquiesce()can't miss it). After the existing quiesce drain (gate → cancel token →while is_syncing),quiesce()takes the handle and joins it by polling [JoinHandle::is_finished()] (5 ms steps,.awaiting each step so the executor stays free) — joining while the runtime is still alive is what closes the panic race.JoinHandle::join()is a happens-before barrier that orders the thread's fullblock_on+ cleanup teardown beforeshutdown()returns. Polling rather thanspawn_blocking(|| handle.join())keeps the join cancellable: aspawn_blockingtask can't be cancelled once started, so it would outlive the bounded timeout and leave the!Sendthread running; the poll yields at every step so thetokio::time::timeoutwrappingquiesce()is truly binding.shutdown(&self) -> CoordinatorExitStatus { platform_address_sync, identity_sync, shielded_sync: Option<_>, event_adapter }— oneCoordinatorThreadStatusper worker (shielded_syncisNonein builds without theshieldedfeature;event_adapteris the wallet-eventtokiotask, not an OS thread). EachCoordinatorThreadStatus:Ok— joined and returned normallyStopped(Option<String>)— joined, but did not return normally for a non-panic reason (e.g. a tokio task cancel/abort surfaces as a non-panicJoinError); carries the reason when available. (A non-panic join error is no longer mis-reported asOk.)Panicked(String)— the coordinator panicked (not a silent drop)Timeout— the bounded join exceededSHUTDOWN_JOIN_TIMEOUT_SECS(30 s); reported distinctly rather than folded into a generic errorNotRunning— never started / already joinedError(String)— an infrastructural join failure that is not a timeoutis_clean()/all_clean()treat onlyOkandNotRunningas clean; every other variant is non-clean.Dropguard resets theis_syncingflag, so a panic insidesync_nowcan't leave thewhile is_syncingdrain spinning forever — which would otherwise hangshutdown()and makePanickedunreachable. This guard is now a single reusableAtomicFlagGuardRAII helper inrs-dash-async, replacing three byte-identical copies across the sync coordinators (DRY).tokio::time::timeout(SHUTDOWN_JOIN_TIMEOUT_SECS= 30 s) wraps eachquiesce()+join soshutdown()can't wedge indefinitely (elapsing →Timeout).shutdown()asserts (a normal runtime check, active in release builds) that it runs on a multi-thread runtime — each coordinator's OS thread drives its loop viaHandle::block_onand needs the runtime's timer/IO driver; acurrent_threadruntime can only service oneblock_onat a time and would deadlock. The existingbackground_generationreschedule guard is preserved.Scope — Rust + FFI-internal only. The FFI
platform_wallet_manager_destroyadapts to the new return type internally: it logs any non-clean exit and returns the newErrorShutdownIncomplete(19) result code so the host can defer freeing its callback context (a coordinator that timed out may still fire one final callback). Function signatures / C ABI are unchanged (the new code is an additive enum variant). Swift gains one additive enum case —PlatformWalletResultCode.errorShutdownIncomplete = 19plus aPlatformWalletError.shutdownIncompletemapping — mirroring the Rust code. The SDK'sPlatformWalletManager.deinitnow acts on it: it captures the code fromshielded_sync_stop/destroyand, only onErrorShutdownIncomplete, deliberately retains the persistence/event callback handlers (an unbalancedUnmanaged.passRetained) so a lingering coordinator can't fire into a freed context — a clean shutdown releases them normally. (The Swift changes are not compiled in this CI environment — no Swift toolchain / xcframework on the build host — so they are verified by construction; an iOS build is required before merge. The Rust + FFI changes are fully built/tested.) Once landed, DET drops its 3std::process::exitstopgaps and its probabilistic 100 ms grace sleep, uses normal runtime drop, and honors thedestroyreturn code (tracked DET-side).Post-review hardening
A grumpy-review pass (CodeRabbit + an internal review) drove a round of degraded-path hardening, all Rust + FFI-internal:
select! { biased; _ = cancel.cancelled() => break, _ = sync_now() => {} }) on the coordinator's own thread, sostop()/quiesce()aborts a stalled SDK await directly. This shrinks the common-case drain to near-instant so the bounded join lands well inside the timeout.is_finished()instead ofspawn_blocking(|| handle.join()), so the boundedtokio::time::timeoutactually interrupts a wedged join — aspawn_blockingtask is uncancellable and would outlive the timeout, leaving the!Sendthread alive pastshutdown()(a use-after-free / runtime-drop-panic hazard on the degraded path).start()now reaps any handle left by a priorstop()— joining it outside thebackground_cancellock (after installing the new token), since the exiting thread's own epilogue also locksbackground_cancel; this stops astop()→start()reschedule from detaching a live, untracked thread thatshutdown()would miss (and removes a 1 s reap stall on that path). And the FFIdestroyreturnsErrorShutdownIncompleteon a non-clean exit so the host won't free a context a lingering thread may still touch.start()'s 1 s reap backstop ever fires against a coordinator genuinely wedged in a non-yieldingDrop, the still-liveJoinHandleis now parked in a shared per-manager orphans list (the duplicated reap is consolidated into onereap_prior_or_park) instead of dropped.shutdown()drains the orphans via a bounded, yieldingis_finished()poll (join_detached_orphans) and reports any survivor as the new non-cleanCoordinatorThreadStatus::Detached(surfaced inCoordinatorExitStatus.detached_threads, folded intoall_clean()), sodestroyreturnsErrorShutdownIncompleterather than a false clean on that path too.shutdown()drains the three independent coordinators (platform_address,identity,shielded) concurrently viatokio::join!— each keeping its own innertokio::time::timeoutso per-coordinator status survives — collapsing the pathological worst-case teardown from the sum of the per-coordinator timeouts (~90 s) to their max (~30 s). The three drains share no lock (each touches only its own per-manager state; thecoordinator_orphansmutex is untouched byquiesce()), and the event-adapter cancel+join andjoin_detached_orphansstay strictly ordered after thejoin!.destroy,clear_shieldednow returns a typedPlatformWalletError::ShieldedShutdownIncomplete { status }and leaves the store intact when the shieldedquiesce()times out or ends non-cleanly (so the host never wipes over a still-draining pass), and the FFIshielded_sync_stop/shielded_clearbridges returnErrorShutdownIncompletein that case — symmetric withdestroy. All three teardown paths share theSHUTDOWN_JOIN_TIMEOUT_SECSbound.PlatformAddressSyncManagergained thebackground_generationguard its siblings already had (prevents a stop()+start() reschedule from detaching a live thread); thequiescinggate is now RAII-guarded likeis_syncing; teardown mutex locks are poison-tolerant.#[must_use]onAtomicFlagGuard, doc/assert-message corrections, uniform failure logging, and a guard temporary no longer held across the join.await.How Has This Been Tested?
cargo test -p platform-wallet --lib→ 217 passed (default); 316 passed with--features shielded. Coverage:shutdown_joins_all_workers_reports_ok_and_is_idempotent,shutdown_without_starting_reports_not_runningshutdown_waits_for_in_flight_pass_to_drain(the drain barrier is exercised; its handler-start wait is now bounded by atokio::time::timeoutso a regression fails fast instead of hanging CI)shutdown_then_drop_runtime_does_not_panic— the race regression, 10 iterations each holding a livetokio::timetimer across the shutdown window, using a per-iterationstd::panic::catch_unwind(not a process-global panic hook, which could swallow sibling tests' diagnostics) to confirm 0 "being shutdown" panicsjoin_coordinator_thread_surfaces_panic,join_coordinator_thread_completes_within_deadline(theis_finished()poll-join completes within a bounded time without busy-spinning)restart_after_stop_reaps_prior_thread(identity-sync, platform-address-sync, and shielded-sync) — astop()→start()reschedule reaps the prior thread promptly (< 500 ms) rather than stalling on the reap backstop; proven non-vacuous against the old lock-held ordering (which stalled 1 s and detached)reap_prior_or_park_parks_wedged_thread,join_detached_orphans_reports_detached_then_ok,shutdown_reports_detached_orphan_as_non_clean— the wedge-backstop orphan path: a force-wedged thread is parked (not dropped) andshutdown()reports it as non-cleanDetachedwith!all_clean(); all three proven non-vacuous by neutering then revertingcoordinator_thread_status_clean_predicate(all six variants),coordinator_exit_status_all_clean,event_adapter_non_panic_join_error_maps_to_stopped_and_is_not_clean(now deterministic — a permanently-pending task is aborted, so the non-panicJoinError→Stoppedpath always fires; assertsStopped(_)and!is_clean()). TheTimeoutmapping is covered via the clean-predicate test, since triggering the real 30 s timeout deterministically is impractical.cargo test -p dash-async→ 4 passed (the 2 pre-existingblock_ontests + 2 newAtomicFlagGuardcontract tests: normal-drop clears the flag, and a drop while unwinding acatch_unwindpanic still clears it).cargo check -p platform-wallet-fficlean — function signatures / C ABI unchanged.cargo clippy -p platform-wallet -p platform-wallet-ffi -p dash-async --all-targets -- -D warningsclean (both feature sets);cargo fmtclean.is_syncingRAII guard repairs the panic-hang and makesPanickedreachable, and that theis_finished()poll-join leaves each coordinator thread either confirmed-exited or explicitlyTimeoutbeforeshutdown()returns.Breaking Changes
PlatformWalletManager::shutdown()return type changed from()toCoordinatorExitStatus(public fields:platform_address_sync,identity_sync,shielded_sync: Option<CoordinatorThreadStatus>,event_adapter,detached_threads), andquiesce()now returnsCoordinatorThreadStatus. Rust consumers must adapt.CoordinatorThreadStatus(public enum) gainedStopped(Option<String>),Timeout, andDetachedvariants — exhaustivematches over it must add arms.AtomicFlagGuardRAII helper exported fromrs-dash-async(additive).platform_wallet_manager_destroy,platform_wallet_manager_shielded_sync_stop, andplatform_wallet_manager_shielded_clearnow return the newPlatformWalletFFIResultCode::ErrorShutdownIncomplete(19) on a non-clean / timed-out drain instead of alwaysok()— a behavioral change for FFI hosts (defer freeing the callback context; the shielded store is left intact). The new result code is an additive enum variant; function signatures / C ABI are unchanged. (Title marked!.)clear_shielded()(Rust,shieldedfeature) now returnsErr(PlatformWalletError::ShieldedShutdownIncomplete { status })and leaves the store intact on a non-clean / timed-out drain (previously it discarded the result and always wiped). New publicPlatformWalletError::ShieldedShutdownIncompletevariant — exhaustivematches must add an arm.PlatformWalletResultCodegainscase errorShutdownIncomplete = 19andPlatformWalletErrorgains.shutdownIncomplete(String).Checklist:
For repository code-owners and collaborators only
🤖 Co-authored by Claudius the Magnificent AI Agent
Summary by CodeRabbit
New Features
Bug Fixes
Tests