wallet: cooperative-yield sync (incremental persist + sentinel) by saroupille · Pull Request #24 · trilitech/tzel

saroupille · 2026-05-09T06:44:34Z

Problem

tzel-wallet sync scans [scanned, tree_size) sequentially, persisting wallet.json.scanned only at the end. While running it holds an exclusive lock on wallet.json.lock, so every other wallet op (shield, transfer, unshield, balance, sync, …) blocks for the full duration. On a long-running rollup that's 5–45 min wallclock — unacceptable in daemon-driven flows where users want to act now.

This is the second half of a two-PR pair. PR #23 (merged) added cmd_init --at-tree-size N so a fresh wallet skips the historical scan entirely. This PR makes the scan that DOES need to run preemptable so it doesn't starve the rest of the daemon.

Design — cooperative yield via sentinel file

Mechanism

Incremental persist. cmd_rollup_sync scans in batches of --checkpoint-every commits (default DEFAULT_CHECKPOINT_EVERY = 50). Each batch ends in a save_wallet that records w.scanned = batch_end plus newly-discovered notes atomically (temp + fsync + rename + parent-dir fsync — pre-existing invariant, just called more often).
Sentinel check at every checkpoint. After persisting a batch, the sync stat()s <wallet>.yield. Present ⇒ exit 0 cleanly. Next sync resumes from the persisted cursor.
Stale-PID recovery (audit-driven). The daemon writes its PID into the sentinel; the CLI checks /proc/<pid> and unlinks if dead. Mirrors the existing WalletLock discipline. Without this, a daemon crash mid-slow-op would wedge every subsequent sync forever.
Final flush at end-of-loop. Nullifier prune + pending-pool prune run once over the cumulative cm set accumulated across batches.

The companion change in trilitech/tzel-infra#55 wraps slow-lane CLI ops (shield/send/unshield/withdraw) with touch <wallet>.yield before admit + RAII-removed after. So a long initial sync no longer starves slow-lane ops: the slow-lane signals "yield please" via the file, the next checkpoint sees it, sync exits cleanly, slow-lane gets the lock.

CLI surface

tzel-wallet sync                          # K=50 default
tzel-wallet sync --checkpoint-every 200   # less overhead, slower preempt
tzel-wallet sync --checkpoint-every 20    # faster preempt, more fsyncs
TZEL_SYNC_CHECKPOINT_EVERY=100 tzel-wallet sync   # operator override

The default 50 is reasoned, pending empirical measurement on a live octez-smart-rollup-node. Documented inline. Tunable range 10..500 with explicit operator-vs-laptop guidance for slow disks (raise to 100–200) vs fast operator boxes wanting sub-second preemption (drop to 20–30).

Considered alternative — shadow-wallet sync

A strategic-audit pass proposed an alternative design where the scan operates on a <wallet>.scan scratch copy with its own lock, and merges back into canonical at every K-commit boundary. Documented in docs/shadow-wallet-design.md (in tzel-infra) for traceability.

Why we picked cooperative-yield over shadow-wallet (per the comparative audit's verdict):

Failure-class blast radius	Cooperative-yield (this PR)	Shadow-wallet
Worst-case footgun	Stale sentinel file; recovered by `kill -0` + unlink, loud (sync exits early forever, immediately diagnosable, `rm <wallet>.yield` resolves)	Delta-accumulator clear-before-fsync; silent (lost nullifier observations → spent note looks unspent → double-spend rejected at proving time → opaque UX error)
Discovery latency	K commits	K commits (when periodic-merge variant chosen)
Lock contention	Bounded by K	Bounded by K (when periodic-merge)
fsync count on 67k commits	1340	1340
Crash class: stale sentinel	yes (recoverable)	no
Crash class: silent observation drift	no	yes (clear-before-rename ordering bug)

Loud beats silent. Both designs are tactical fixes for the same constraint (file-based wallet + flock). At ~2× LOC for shadow-wallet plus a re-litigation of ADR-002, the migration cost (~2 weeks senior-eng review) buys a "canonical never holds partial scan state" property that no caller currently exercises — while introducing a subtle data-loss class that surfaces only at proving time. Decision: ship cooperative-yield as-is.

Pre-merge audit follow-ups (commits `f3a0755`, `3edef12`)

Independent reviewers caught a real consistency bug missed by the earlier audit rounds: cmd_rollup_sync finalize pinned head_hash for the note slice + nullifiers but called the head-resolving load_pool_balances for pool reads. A slow-lane drain landing between the two reads would silently evict a still-funded pool. Fixed in f3a0755 (load_pool_balances_at_block taking an explicit block_ref) and covered by a regression test that drives two block hashes returning different balances.

A generalising review of the same pattern caught a second instance in cmd_wallet_check (banner pool summary), fixed in 3edef12. The same commit renames the head-resolving wrappers to _at_head so the naming asymmetry (_at_head vs _at_block) makes a future regression loud at code review, and enriches the stale-sentinel recovery warning with the dead daemon's PID for operator triage.

Tests

12 cooperative-yield tests in network_profile_tests (gated on #[serial(sync_checkpoint_hook)] since they share the process-global SYNC_CHECKPOINT_HOOK):

Test	What it asserts
`cmd_rollup_sync_exits_at_first_checkpoint_when_sentinel_present`	Pre-existing sentinel ⇒ exit at first checkpoint, do NOT remove caller's sentinel
`cmd_rollup_sync_resumes_from_checkpointed_cursor`	Sentinel triggered mid-scan ⇒ resume from last checkpoint, notes pre-yield in run 1, post-yield in run 2
`cmd_rollup_sync_intermediate_wallet_is_always_parseable`	Concurrent reader thread sees no half-written wallet.json
`cmd_rollup_sync_refuses_zero_checkpoint_every`	Validation pre-RPC; error message stable
`cmd_rollup_sync_respects_custom_checkpoint_every`	K ≠ 50 honoured (locks the runtime parameter contract)
`cmd_rollup_sync_http_error_mid_batch_preserves_last_checkpoint`	5xx mid-batch ⇒ Err propagates; `scanned` at last batch boundary; sentinel untouched by CLI
`cmd_rollup_sync_panic_mid_batch_preserves_wallet_json`	Rust panic mid-batch ⇒ atomic-rename invariant holds; reload works
`cmd_rollup_sync_watch_with_zero_checkpoint_every_aborts_first_iter`	watch loop aborts on validation, doesn't sleep
`cmd_rollup_sync_recovers_from_stale_yield_sentinel`	Sentinel with a freshly-reaped child PID ⇒ unlink + continue, scan reaches tree_size (uses `freshly_dead_pid()` helper instead of a hardcoded `999999`, robust against `kernel.pid_max` lore on container hosts)
`cmd_rollup_sync_watch_loop_body_yields_twice_then_finishes`	Three-iteration yield-then-resume mirroring `cmd_rollup_sync_watch`'s loop body — sentinel persists across iter 1+2, removed before iter 3. Locks the property that recovered notes from earlier iterations re-feed `known_cms` on the finalize iteration via `w.notes`, so a regression that moved `seen_cms` to a process-local accumulator (or stopped folding `w.notes` into `known_cms`) would surface
`cmd_rollup_sync_treats_legacy_sentinel_content_as_live`	Non-PID content (`b"yield"`) ⇒ yield as before; CLI does NOT unlink (forward-compat with old daemons)

Plus four post-audit regression tests:

rollup_rpc_load_pool_balances_at_block_uses_pinned_block — two block hashes returning different balances; asserts the pinned form reads from the caller's block, not from a re-resolved head.
rollup_rpc_load_state_snapshot_at_block_uses_pinned_block — mirror for the state snapshot helper that cmd_wallet_check uses on the same pinned head_hash (Reviewer feat(wallet): wallet-server HTTP API and web dapp #1's ask).
cmd_rollup_sync_pins_pool_reads_to_finalize_head — e2e test driving a full sync against a stateful HTTP mock whose /head/hash route returns block_old on the first read and block_new thereafter. The fixture publishes one note at index 0 wrapping observed_cm, so the batch's seen_cms includes it; the wallet's pending_deposit.shielded_cm = Some(observed_cm) so finalize evaluates cm_observed = true. Pool funded at block_old, drained at block_new. Correct shape: pool reads at the pinned block_old (funded) ⇒ deposit retained AND head/hash counter resolves exactly once. Regression (load_pool_balances_at_head at the call site): pool reads land on block_new ⇒ drained_on_chain && cm_observed ⇒ deposit evicted ⇒ assertion fails.
fixture_for_finalize_pin_test_actually_evicts_under_regression_helper — companion that proves the fixture's failure mechanism is real (not tautological): same fixture, but invokes load_pool_balances_at_head directly + apply_scan_feed_finalize, asserts the deposit IS evicted. Locks in that the e2e test above protects a live mechanism.

The first cut of the e2e test (in commit 0392f29) was caught as tautological by independent reviewers A and B: it had tree_size = 0 (loop never ran) and shielded_cm = None (cm_observed always false), so the eviction predicate could not fire and the test passed regardless of the call-site shape. Both reviewers verified mechanically by reverting the call site to _at_head and observing the test still passed. The fixture is now corrected — and the companion test above verifies the failure mechanism via the helper without needing to mutate production code.

Result: cargo +nightly-2025-07-14 test -p tzel-wallet-app --lib → 118 passed, 0 failed, 3 ignored.

Companion change

trilitech/tzel-infra#55 (feat/sync-cooperative-yield) — daemon dispatcher wraps slow-lane task execution with SyncYieldGuard::arm (writes the daemon PID into <wallet>.yield) + RAII Drop (rm), so a long-running sync yields the lock for shield/transfer/unshield/withdraw and resumes after.

The contract is forward-compatible: a daemon shipping the sentinel without this CLI's stale-PID recovery is harmless (legacy CLI just yields normally, no recovery benefit). Once both binaries ship, the cooperative behaviour activates.

Deferred to follow-up PRs

Daemon-side runner.rs pass-through of TZEL_SYNC_CHECKPOINT_EVERY via the env (unblocks operator override end-to-end).
friendlyError UI pattern for the validation error once the env var exposes it to non-CLI users.
Re-tune default K to 250–500 once the design doc's parallel-fetch patch (P2, see PR wallet: concurrent HTTP + parallel decrypt for cmd_rollup_sync #25) lands; at the post-P2 wallclock, K=50 means ~7 fsyncs/s sustained — non-trivial.
Long-term architectural follow-up: SQLite-backed WalletFile to delegate concurrency to the storage layer (would obsolete this PR's mechanism and the shadow-wallet alternative).

Two independent reviewers on PR trilitech#24 surfaced one real bug, two test gaps, and three docstring follow-ups. All applied: **Real bug (Reviewer A)** — `w.scanned + checkpoint_every` at lib.rs :7721 used unchecked addition. With an adversarial `--checkpoint-every` (close to `usize::MAX` on 64-bit), that would panic in debug / wrap in release. Switched to `w.scanned.saturating_add(checkpoint_every) .min(tree_size)`. The atomic-rename invariant on `save_wallet` means a wrap couldn't have corrupted on-disk state, but the sync would have silently misbehaved with a tiny degenerate batch. Belt-and-braces. **Test gaps**: - All four cooperative-yield tests share the process-global `SYNC_CHECKPOINT_HOOK` (a `OnceLock<Mutex<...>>`). cargo-test runs in-binary tests in parallel; without serialization, one test's hook could fire under another test's loop. Added `serial_test` dev-dep and `#[serial(sync_checkpoint_hook)]` on all four (now five). - All four tests hardcoded `DEFAULT_CHECKPOINT_EVERY` (50). A future regression that ignored the runtime parameter and used the const internally would still pass. Added `cmd_rollup_sync_respects_custom_ checkpoint_every` exercising K=7 with a sentinel trip at K*2; locks the contract that the flag actually does something. **Docstring follow-ups (Reviewer B)**: - The previous text claimed `services/scan-bench/` of tzel-infra informed the K=50 default. False: scan-bench measures concurrent- fetch tolerance (the orthogonal N dial in design doc §2.A); the K=50 numbers come from a `save_wallet` micro-bench that doesn't exist yet. Tightened to call out the dial separation + acknowledge the pending bench. - Added the operator-vs-laptop calibration sentence: raise to 100-200 on slow / synchronously-replicated storage; drop to 20-30 on fast operator boxes when sub-second preemption matters. - Added a one-liner about `--watch` × `--checkpoint-every` composition (each polling iteration honours the value independently). **Env var (Reviewer B F4)** — added `env = "TZEL_SYNC_CHECKPOINT_EVERY"` to the clap arg, mirroring the precedent the design doc set for `TZEL_SYNC_CONCURRENCY` (doc §4 line 187). Operator-deployed daemons can now override K without recompiling. Required `clap` "env" feature, added. Tests: 104/104 pass (was 103, +1 with the new K=7 test). No regressions to the 99 baseline tests. Deferred to follow-up PRs (out of scope here): - Daemon-side runner.rs pass-through of TZEL_SYNC_CHECKPOINT_EVERY (tzel-infra PR #55, before its merge — otherwise the daemon ships a hardcoded K=50 contract). - friendlyError pattern for the validation error once the env var exposes it to non-CLI users. - Re-tune K=250-500 once the design doc's 2.A (concurrent fetch) lands; at the post-2.A wallclock, K=50 means ~7 fsyncs/sec, which is non-trivial sustained. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The audit on PR trilitech#24 catalogued the interruption modes for cmd_rollup_sync: sentinel-driven yield (4 tests) and zero-K validation (1 test) were already covered, but three error paths weren't: 1. **HTTP error mid-batch** — a 5xx from rollup-rpc partway through a batch must propagate as Err with `wallet.json.scanned` left at the last successful checkpoint boundary, never at the failed cursor. Test mocks 503 on note index 15 with K=10; sync errors out, scanned sticks at 10 (first batch boundary), sentinel never created by the CLI (daemon owns sentinel lifecycle). 2. **Panic mid-batch** — a Rust panic from inside the scan loop (e.g. a future Drop touching FS state) must NOT corrupt wallet.json. The atomic-rename invariant in save_wallet preserves the prior on-disk state. Test panics from the sync_checkpoint_hook at scanned=20, wraps cmd_rollup_sync in catch_unwind, then asserts the wallet reloads cleanly with scanned=20 (the just-completed checkpoint value, since the panic fires AFTER the save_wallet in the loop body). 3. **`--watch` + `--checkpoint-every 0`** — validation must fire on the first iteration before any RPC; the `?` propagates and the watch loop aborts cleanly. Test points at an unreachable RPC URL so a hang would deadline rather than hide; asserts the validation error message survives through the watch loop. The panic test caused PoisonError in the Mutex protecting SYNC_CHECKPOINT_HOOK across subsequent tests in the suite. Made the three accessor functions poison-tolerant via `unwrap_or_else(|e| e.into_inner())` — the protected data is just `Option<Box<dyn Fn>>`, safe to overwrite regardless of poison state. This also defends against any future panic-from-hook regression breaking unrelated tests. 107/107 tests pass (was 104, +3 new). All cooperative-yield tests remain serialized via #[serial(sync_checkpoint_hook)] since they share the global hook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Strategic audit on PR trilitech#24 caught a real bug class: if the daemon crashes after `touch <wallet>.yield` but before its RAII guard's `rm`, the sentinel is permanent. Every subsequent sync would exit at the first checkpoint, make K commits of progress, and exit again — wedged forever, identical failure shape to the stale-WalletLock case the existing `is_stale_wallet_lock` recovery solves. Mirror that discipline: - New `is_stale_yield_sentinel(path)` reads sentinel content, parses first line as PID, probes `/proc/<pid>`. Forward-compat: any read or parse failure returns false ("treat as live"), so legacy daemons + tests writing `b"yield"` keep working. - In `cmd_rollup_sync`'s sentinel-check, when the sentinel exists and `is_stale_yield_sentinel` returns true: unlink + log warning + `continue` (don't yield this iteration). Otherwise yield as before. - The companion daemon-side PR (tzel-infra #55) needs a follow-up to write `<pid>\n` into the sentinel on touch. Until that lands, this CLI change is harmless: it falls through to "treat as live" for the legacy `b"yield"` content. Two new tests: - `cmd_rollup_sync_recovers_from_stale_yield_sentinel`: pre-create sentinel with PID 999999 (above kernel pid_max ceiling on dev envs), assert sync runs to tree_size and unlinks the sentinel. - `cmd_rollup_sync_treats_legacy_sentinel_content_as_live`: pre- create sentinel with `b"yield"`, assert sync yields normally and does NOT unlink (daemon owns lifecycle for live sentinels). 109/109 tests pass (was 107, +2 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two independent reviewers on PR trilitech#24 surfaced one real bug, two test gaps, and three docstring follow-ups. All applied: **Real bug (Reviewer A)** — `w.scanned + checkpoint_every` at lib.rs :7721 used unchecked addition. With an adversarial `--checkpoint-every` (close to `usize::MAX` on 64-bit), that would panic in debug / wrap in release. Switched to `w.scanned.saturating_add(checkpoint_every) .min(tree_size)`. The atomic-rename invariant on `save_wallet` means a wrap couldn't have corrupted on-disk state, but the sync would have silently misbehaved with a tiny degenerate batch. Belt-and-braces. **Test gaps**: - All four cooperative-yield tests share the process-global `SYNC_CHECKPOINT_HOOK` (a `OnceLock<Mutex<...>>`). cargo-test runs in-binary tests in parallel; without serialization, one test's hook could fire under another test's loop. Added `serial_test` dev-dep and `#[serial(sync_checkpoint_hook)]` on all four (now five). - All four tests hardcoded `DEFAULT_CHECKPOINT_EVERY` (50). A future regression that ignored the runtime parameter and used the const internally would still pass. Added `cmd_rollup_sync_respects_custom_ checkpoint_every` exercising K=7 with a sentinel trip at K*2; locks the contract that the flag actually does something. **Docstring follow-ups (Reviewer B)**: - The previous text claimed `services/scan-bench/` of tzel-infra informed the K=50 default. False: scan-bench measures concurrent- fetch tolerance (the orthogonal N dial in design doc §2.A); the K=50 numbers come from a `save_wallet` micro-bench that doesn't exist yet. Tightened to call out the dial separation + acknowledge the pending bench. - Added the operator-vs-laptop calibration sentence: raise to 100-200 on slow / synchronously-replicated storage; drop to 20-30 on fast operator boxes when sub-second preemption matters. - Added a one-liner about `--watch` × `--checkpoint-every` composition (each polling iteration honours the value independently). **Env var (Reviewer B F4)** — added `env = "TZEL_SYNC_CHECKPOINT_EVERY"` to the clap arg, mirroring the precedent the design doc set for `TZEL_SYNC_CONCURRENCY` (doc §4 line 187). Operator-deployed daemons can now override K without recompiling. Required `clap` "env" feature, added. Tests: 104/104 pass (was 103, +1 with the new K=7 test). No regressions to the 99 baseline tests. Deferred to follow-up PRs (out of scope here): - Daemon-side runner.rs pass-through of TZEL_SYNC_CHECKPOINT_EVERY (tzel-infra PR #55, before its merge — otherwise the daemon ships a hardcoded K=50 contract). - friendlyError pattern for the validation error once the env var exposes it to non-CLI users. - Re-tune K=250-500 once the design doc's 2.A (concurrent fetch) lands; at the post-2.A wallclock, K=50 means ~7 fsyncs/sec, which is non-trivial sustained. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The audit on PR trilitech#24 catalogued the interruption modes for cmd_rollup_sync: sentinel-driven yield (4 tests) and zero-K validation (1 test) were already covered, but three error paths weren't: 1. **HTTP error mid-batch** — a 5xx from rollup-rpc partway through a batch must propagate as Err with `wallet.json.scanned` left at the last successful checkpoint boundary, never at the failed cursor. Test mocks 503 on note index 15 with K=10; sync errors out, scanned sticks at 10 (first batch boundary), sentinel never created by the CLI (daemon owns sentinel lifecycle). 2. **Panic mid-batch** — a Rust panic from inside the scan loop (e.g. a future Drop touching FS state) must NOT corrupt wallet.json. The atomic-rename invariant in save_wallet preserves the prior on-disk state. Test panics from the sync_checkpoint_hook at scanned=20, wraps cmd_rollup_sync in catch_unwind, then asserts the wallet reloads cleanly with scanned=20 (the just-completed checkpoint value, since the panic fires AFTER the save_wallet in the loop body). 3. **`--watch` + `--checkpoint-every 0`** — validation must fire on the first iteration before any RPC; the `?` propagates and the watch loop aborts cleanly. Test points at an unreachable RPC URL so a hang would deadline rather than hide; asserts the validation error message survives through the watch loop. The panic test caused PoisonError in the Mutex protecting SYNC_CHECKPOINT_HOOK across subsequent tests in the suite. Made the three accessor functions poison-tolerant via `unwrap_or_else(|e| e.into_inner())` — the protected data is just `Option<Box<dyn Fn>>`, safe to overwrite regardless of poison state. This also defends against any future panic-from-hook regression breaking unrelated tests. 107/107 tests pass (was 104, +3 new). All cooperative-yield tests remain serialized via #[serial(sync_checkpoint_hook)] since they share the global hook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Strategic audit on PR trilitech#24 caught a real bug class: if the daemon crashes after `touch <wallet>.yield` but before its RAII guard's `rm`, the sentinel is permanent. Every subsequent sync would exit at the first checkpoint, make K commits of progress, and exit again — wedged forever, identical failure shape to the stale-WalletLock case the existing `is_stale_wallet_lock` recovery solves. Mirror that discipline: - New `is_stale_yield_sentinel(path)` reads sentinel content, parses first line as PID, probes `/proc/<pid>`. Forward-compat: any read or parse failure returns false ("treat as live"), so legacy daemons + tests writing `b"yield"` keep working. - In `cmd_rollup_sync`'s sentinel-check, when the sentinel exists and `is_stale_yield_sentinel` returns true: unlink + log warning + `continue` (don't yield this iteration). Otherwise yield as before. - The companion daemon-side PR (tzel-infra #55) needs a follow-up to write `<pid>\n` into the sentinel on touch. Until that lands, this CLI change is harmless: it falls through to "treat as live" for the legacy `b"yield"` content. Two new tests: - `cmd_rollup_sync_recovers_from_stale_yield_sentinel`: pre-create sentinel with PID 999999 (above kernel pid_max ceiling on dev envs), assert sync runs to tree_size and unlinks the sentinel. - `cmd_rollup_sync_treats_legacy_sentinel_content_as_live`: pre- create sentinel with `b"yield"`, assert sync yields normally and does NOT unlink (daemon owns lifecycle for live sentinels). 109/109 tests pass (was 107, +2 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`tzel-wallet sync` previously scanned `[scanned, tree_size)` sequentially, persisting `wallet.json.scanned` only at the end. While running it held an exclusive lock on `wallet.json.lock`, so every other wallet op (shield, transfer, unshield, balance, sync, …) blocked for the full duration. On a long-running rollup that's 5–45 min wallclock — unacceptable in daemon-driven flows where users want to act now. This change makes the scan that does need to run preemptable: 1. Incremental persist. `cmd_rollup_sync` scans in batches of `--checkpoint-every` commits (default `DEFAULT_CHECKPOINT_EVERY = 50`, tunable via `TZEL_SYNC_CHECKPOINT_EVERY` env var). Each batch ends in a `save_wallet` that records the new cursor + newly-discovered notes atomically (temp + fsync + rename + parent-dir fsync, the existing invariant — just called more often). 2. Sentinel check at every checkpoint. After persisting a batch, the sync stat()s `<wallet>.yield`. Present ⇒ exit 0 cleanly. Next sync resumes from the persisted cursor. 3. Stale-PID recovery. The daemon writes its PID into the sentinel; the CLI checks `/proc/<pid>` and unlinks if dead. Mirrors the existing `WalletLock` discipline. The recovery warning includes the dead PID so operators can correlate with their daemon-restart logs. Without this, a daemon crash mid-slow-op would wedge every subsequent sync forever. 4. Final flush. Nullifier prune + pending-pool prune run once over the cumulative cm set accumulated across batches. 5. Consistency-pinned helpers. `load_pool_balances_at_block` and `load_state_snapshot_at_block` take an explicit `block_ref` so callers performing multiple reads (note slice + nullifiers + pool balances) observe the same kernel state. The head-resolving wrappers are renamed `_at_head` and reserved for single-shot callers; the naming asymmetry makes future regressions of this consistency property loud at code review. `cmd_rollup_sync` finalize and `cmd_wallet_check` both use the pinned form against a `head_hash` captured at the top of the function. The companion daemon-side change in trilitech/tzel-infra#55 wraps slow-lane CLI ops (shield/send/unshield/withdraw) with `touch <wallet>.yield` before admit + RAII-removed after. Once both binaries ship, a long initial sync no longer starves slow-lane ops: the slow-lane signals \"yield please\" via the file, the next checkpoint sees it, sync exits cleanly, slow-lane gets the lock. The contract is forward-compatible: a daemon shipping the sentinel without this CLI's stale-PID recovery is harmless. Considered alternative: shadow-wallet sync (scan against a scratch copy, merge back at K-commit boundaries). Documented in `docs/shadow-wallet-design.md` (tzel-infra). Rejected after a comparative audit: same fsync count + same K-commit discovery latency, but introduces a silent failure class (clear-before-rename ordering on the delta accumulator can drop nullifier observations, surfacing only at proving time as an opaque double-spend). Cooperative-yield's worst-case footgun is a stale sentinel — loud, immediately diagnosable, fixable with `rm`. Loud beats silent. Tests: 12 cooperative-yield tests in `network_profile_tests` (gated on `#[serial(sync_checkpoint_hook)]` since they share the process-global `SYNC_CHECKPOINT_HOOK`). They cover the primary yield paths (pre-existing sentinel, mid-scan sentinel, custom K, watch-loop validation, three-iteration watch-loop body), the wallet-file invariants (intermediate parseability, panic-mid-batch atomic-rename), and the recovery paths (HTTP error mid-batch, stale PID, legacy non-PID content). Plus four post-audit regression tests for the consistency-pin invariant: `rollup_rpc_load_pool_balances_at_block_uses_pinned_block` and `rollup_rpc_load_state_snapshot_at_block_uses_pinned_block` lock the helper-layer property; `cmd_rollup_sync_pins_pool_reads_to_finalize_head` e2e-tests the call site against a stateful HTTP mock; and `fixture_for_finalize_pin_test_actually_evicts_under_regression_helper` proves the fixture's failure mechanism is real (not tautological) by running the same routes through the regression helper. Result: `cargo +nightly-2025-07-14 test -p tzel-wallet-app --lib` → 118 passed, 0 failed, 3 ignored. CLI surface: tzel-wallet sync # K=50 default tzel-wallet sync --checkpoint-every 200 # less overhead, slower preempt tzel-wallet sync --checkpoint-every 20 # faster preempt, more fsyncs TZEL_SYNC_CHECKPOINT_EVERY=100 tzel-wallet sync # operator override Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 67k-commit ushuaianet sync wallclock was dominated by sequential ureq round trips against the rollup-node's durable-state RPC (~40 ms each, ~95% of total). Two orthogonal accelerations land here: * 2.A — concurrent HTTP fetch in `RollupRpc::load_notes_since_at_block` via `reqwest` + `futures_util::FuturesUnordered`. The async runtime is hosted on a dedicated worker thread so the path composes with both the synchronous CLI dispatch and the multi-thread tokio runtime that powers `tzel-detect`'s axum handlers (calling `block_on` from inside a tokio runtime panics — the worker-thread bridge sidesteps that without forcing every caller to be async). Concurrency tuned by `TZEL_SYNC_CONCURRENCY` (default 8, hard cap 128). The cap rejects misconfiguration before it can fill the rollup node's TCP backlog. Errors short-circuit the batch — same abort-on-first-error contract the sequential loop had. * 2.C — `apply_scan_feed`'s ML-KEM-768 trial-decrypt loop runs through rayon's default global pool. The decrypt is embarrassingly parallel: `try_recover_note` is `&self`, addresses don't change inside the call, recovered notes are independent. Recovery results are merged sequentially after the parallel pass to preserve the existing println-then-push order the test suite relies on. A one-line stderr banner at sync start surfaces both tunables to the operator. Composes orthogonally with `--at-tree-size` (PR trilitech#23, merged) and the cooperative-yield work in PR trilitech#24. NOT included: 2.B (batched RPC, deferred to upstream octez) and 2.D (monitor stream, a separate concern). Tests: 3 new in `network_profile_tests`: * concurrent_fetch_returns_same_results_as_sequential — out-of-order completion is reassembled correctly (N=1 vs N=8 against a 100-commit mock). * concurrent_fetch_aborts_on_5xx — a 503 mid-batch propagates the error with HTTP status preserved, no silent retry. * parallel_decrypt_returns_same_results_as_sequential — single-thread rayon pool vs default pool yield identical recovered notes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 67k-commit ushuaianet sync wallclock was dominated by sequential ureq round trips against the rollup-node's durable-state RPC (~40 ms each, ~95% of total). Three orthogonal accelerations land here, all firing on the post-trilitech#24 cooperative-yield path that `cmd_rollup_sync` actually executes: * 2.A — concurrent HTTP fetch in `RollupRpc::load_notes_since_at_block` via `reqwest` + `futures_util::FuturesUnordered`. The async runtime is hosted on a dedicated worker thread so the path composes with both the synchronous CLI dispatch and the multi-thread tokio runtime that powers `tzel-detect`'s axum handlers (calling `block_on` from inside a tokio runtime panics — the worker-thread bridge sidesteps that without forcing every caller to be async). Concurrency tuned by `TZEL_SYNC_CONCURRENCY` (default 4 — matches CI vCPU count, ship-safe across every rollup-node we know about; bumping above 4 should be opt-in once an operator measures their own rollup-node's tail-latency curve via `services/scan-bench/`). The 128 hard cap rejects misconfiguration before it can fill the rollup node's TCP backlog. Errors short-circuit the batch — same abort-on-first-error contract the sequential loop had. * 2.B — `apply_scan_feed_recover_batch` (PR trilitech#24's per-batch recover function, the function `cmd_rollup_sync` actually calls) routes its ML-KEM-768 trial-decrypt loop through rayon's default global pool. The decrypt is embarrassingly parallel: `try_recover_note` is `&self`, addresses don't change inside the call, recovered notes are independent. Recovery results are merged sequentially after the parallel pass to preserve the existing println-then-push order the test suite relies on. (The legacy `apply_scan_feed` on the dead `cmd_scan` path was already parallelised in an earlier draft of this PR, but the post-trilitech#24 hot path runs through `apply_scan_feed_recover_batch` — without this commit, the rayon par_iter never fired on a real `tzel-wallet sync` invocation.) * 2.A continued — `SyncFetcher` long-lived context. Pre-fix: `fetch_published_notes_concurrent` was called per-batch and rebuilt the `reqwest::Client` (and a fresh tokio runtime + thread) every time. At the new K=250 across a 67k-commit sync that's still ~268 client builds, each warming a fresh connection pool from cold — the design doc's "amortised TCP+TLS handshake" claim was false in that shape. `SyncFetcher` owns one `reqwest::Client` + worker thread + tokio runtime that survive the whole `cmd_rollup_sync` call. Pool warms once. (Across `--watch` iterations the pool is rebuilt; the per-iteration scan is small once caught up.) A one-line stderr banner at sync start surfaces both tunables to the operator. Composes orthogonally with `--at-tree-size` (PR trilitech#23, merged) and the cooperative-yield work in PR trilitech#24, including PR trilitech#24 body line 90's "re-tune K to 250–500 once P2 lands" — `DEFAULT_CHECKPOINT_EVERY` bumps from 50 → 250 (conservative end of that range). NOT included: 2.B (batched RPC, deferred to upstream octez) and 2.D (monitor stream, a separate concern). Tests: 4 in `network_profile_tests` (122 lib tests pass total, was 117 on `main` pre-trilitech#24-chain): * concurrent_fetch_returns_same_results_as_sequential — out-of-order completion is reassembled correctly (N=1 vs N=8 against a 100-commit mock). * concurrent_fetch_aborts_on_5xx — a 503 mid-batch propagates the error with HTTP status preserved, no silent retry, AND a counted mock asserts cancellation short-circuited (served < total). * parallel_decrypt_returns_same_results_as_sequential — independent fixture-time oracle (50 mixed recoverable + non-recoverable notes); both sequential and parallel branches must produce exactly the oracle set. Mutating `apply_scan_feed_recover_batch` to drop half the recoveries fails this test loud (verified). * sync_fetcher_amortises_client_across_batches — pins the contract that one `reqwest::Client` services multiple sequential batches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 67k-commit ushuaianet sync wallclock was dominated by sequential ureq round trips against the rollup-node's durable-state RPC (~40 ms each, ~95% of total). Three orthogonal accelerations land here, all firing on the post-#24 cooperative-yield path that `cmd_rollup_sync` actually executes: * 2.A — concurrent HTTP fetch in `RollupRpc::load_notes_since_at_block` via `reqwest` + `futures_util::FuturesUnordered`. The async runtime is hosted on a dedicated worker thread so the path composes with both the synchronous CLI dispatch and the multi-thread tokio runtime that powers `tzel-detect`'s axum handlers (calling `block_on` from inside a tokio runtime panics — the worker-thread bridge sidesteps that without forcing every caller to be async). Concurrency tuned by `TZEL_SYNC_CONCURRENCY` (default 4 — matches CI vCPU count, ship-safe across every rollup-node we know about; bumping above 4 should be opt-in once an operator measures their own rollup-node's tail-latency curve via `services/scan-bench/`). The 128 hard cap rejects misconfiguration before it can fill the rollup node's TCP backlog. Errors short-circuit the batch — same abort-on-first-error contract the sequential loop had. * 2.B — `apply_scan_feed_recover_batch` (PR #24's per-batch recover function, the function `cmd_rollup_sync` actually calls) routes its ML-KEM-768 trial-decrypt loop through rayon's default global pool. The decrypt is embarrassingly parallel: `try_recover_note` is `&self`, addresses don't change inside the call, recovered notes are independent. Recovery results are merged sequentially after the parallel pass to preserve the existing println-then-push order the test suite relies on. (The legacy `apply_scan_feed` on the dead `cmd_scan` path was already parallelised in an earlier draft of this PR, but the post-#24 hot path runs through `apply_scan_feed_recover_batch` — without this commit, the rayon par_iter never fired on a real `tzel-wallet sync` invocation.) * 2.A continued — `SyncFetcher` long-lived context. Pre-fix: `fetch_published_notes_concurrent` was called per-batch and rebuilt the `reqwest::Client` (and a fresh tokio runtime + thread) every time. At the new K=250 across a 67k-commit sync that's still ~268 client builds, each warming a fresh connection pool from cold — the design doc's "amortised TCP+TLS handshake" claim was false in that shape. `SyncFetcher` owns one `reqwest::Client` + worker thread + tokio runtime that survive the whole `cmd_rollup_sync` call. Pool warms once. (Across `--watch` iterations the pool is rebuilt; the per-iteration scan is small once caught up.) A one-line stderr banner at sync start surfaces both tunables to the operator. Composes orthogonally with `--at-tree-size` (PR #23, merged) and the cooperative-yield work in PR #24, including PR #24 body line 90's "re-tune K to 250–500 once P2 lands" — `DEFAULT_CHECKPOINT_EVERY` bumps from 50 → 250 (conservative end of that range). NOT included: 2.B (batched RPC, deferred to upstream octez) and 2.D (monitor stream, a separate concern). Tests: 4 in `network_profile_tests` (122 lib tests pass total, was 117 on `main` pre-#24-chain): * concurrent_fetch_returns_same_results_as_sequential — out-of-order completion is reassembled correctly (N=1 vs N=8 against a 100-commit mock). * concurrent_fetch_aborts_on_5xx — a 503 mid-batch propagates the error with HTTP status preserved, no silent retry, AND a counted mock asserts cancellation short-circuited (served < total). * parallel_decrypt_returns_same_results_as_sequential — independent fixture-time oracle (50 mixed recoverable + non-recoverable notes); both sequential and parallel branches must produce exactly the oracle set. Mutating `apply_scan_feed_recover_batch` to drop half the recoveries fails this test loud (verified). * sync_fetcher_amortises_client_across_batches — pins the contract that one `reqwest::Client` services multiple sequential batches. Co-authored-by: François Thiré <franth2@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

saroupille marked this pull request as draft May 10, 2026 09:07

saroupille force-pushed the feat/sync-cooperative-yield branch from 1d89749 to e69857b Compare May 10, 2026 09:54

saroupille marked this pull request as ready for review May 10, 2026 10:03

saroupille mentioned this pull request May 10, 2026

wallet: concurrent HTTP + parallel decrypt for cmd_rollup_sync #25

Merged

5 tasks

saroupille force-pushed the feat/sync-cooperative-yield branch 6 times, most recently from 7a3cd6b to abfdf43 Compare May 10, 2026 13:32

saroupille force-pushed the feat/sync-cooperative-yield branch from abfdf43 to 107ff88 Compare May 10, 2026 14:13

saroupille merged commit 6772208 into trilitech:main May 10, 2026
2 checks passed

saroupille mentioned this pull request May 10, 2026

ci(unit-tests): split rust-and-cairo + add Swatinem/rust-cache #27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wallet: cooperative-yield sync (incremental persist + sentinel)#24

wallet: cooperative-yield sync (incremental persist + sentinel)#24
saroupille merged 1 commit into
trilitech:mainfrom
saroupille:feat/sync-cooperative-yield

saroupille commented May 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

saroupille commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Design — cooperative yield via sentinel file

Mechanism

CLI surface

Considered alternative — shadow-wallet sync

Pre-merge audit follow-ups (commits f3a0755, 3edef12)

Tests

Companion change

Deferred to follow-up PRs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

saroupille commented May 9, 2026 •

edited

Loading

Pre-merge audit follow-ups (commits `f3a0755`, `3edef12`)