Skip to content

wallet: cooperative-yield sync (incremental persist + sentinel)#24

Merged
saroupille merged 1 commit into
trilitech:mainfrom
saroupille:feat/sync-cooperative-yield
May 10, 2026
Merged

wallet: cooperative-yield sync (incremental persist + sentinel)#24
saroupille merged 1 commit into
trilitech:mainfrom
saroupille:feat/sync-cooperative-yield

Conversation

@saroupille
Copy link
Copy Markdown
Collaborator

@saroupille saroupille commented May 9, 2026

Problem

tzel-wallet sync scans [scanned, tree_size) sequentially, persisting wallet.json.scanned only at the end. While running it holds an exclusive lock on wallet.json.lock, so every other wallet op (shield, transfer, unshield, balance, sync, …) blocks for the full duration. On a long-running rollup that's 5–45 min wallclock — unacceptable in daemon-driven flows where users want to act now.

This is the second half of a two-PR pair. PR #23 (merged) added cmd_init --at-tree-size N so a fresh wallet skips the historical scan entirely. This PR makes the scan that DOES need to run preemptable so it doesn't starve the rest of the daemon.

Design — cooperative yield via sentinel file

Mechanism

  1. Incremental persist. cmd_rollup_sync scans in batches of --checkpoint-every commits (default DEFAULT_CHECKPOINT_EVERY = 50). Each batch ends in a save_wallet that records w.scanned = batch_end plus newly-discovered notes atomically (temp + fsync + rename + parent-dir fsync — pre-existing invariant, just called more often).
  2. Sentinel check at every checkpoint. After persisting a batch, the sync stat()s <wallet>.yield. Present ⇒ exit 0 cleanly. Next sync resumes from the persisted cursor.
  3. Stale-PID recovery (audit-driven). The daemon writes its PID into the sentinel; the CLI checks /proc/<pid> and unlinks if dead. Mirrors the existing WalletLock discipline. Without this, a daemon crash mid-slow-op would wedge every subsequent sync forever.
  4. Final flush at end-of-loop. Nullifier prune + pending-pool prune run once over the cumulative cm set accumulated across batches.

The companion change in trilitech/tzel-infra#55 wraps slow-lane CLI ops (shield/send/unshield/withdraw) with touch <wallet>.yield before admit + RAII-removed after. So a long initial sync no longer starves slow-lane ops: the slow-lane signals "yield please" via the file, the next checkpoint sees it, sync exits cleanly, slow-lane gets the lock.

CLI surface

tzel-wallet sync                          # K=50 default
tzel-wallet sync --checkpoint-every 200   # less overhead, slower preempt
tzel-wallet sync --checkpoint-every 20    # faster preempt, more fsyncs
TZEL_SYNC_CHECKPOINT_EVERY=100 tzel-wallet sync   # operator override

The default 50 is reasoned, pending empirical measurement on a live octez-smart-rollup-node. Documented inline. Tunable range 10..500 with explicit operator-vs-laptop guidance for slow disks (raise to 100–200) vs fast operator boxes wanting sub-second preemption (drop to 20–30).

Considered alternative — shadow-wallet sync

A strategic-audit pass proposed an alternative design where the scan operates on a <wallet>.scan scratch copy with its own lock, and merges back into canonical at every K-commit boundary. Documented in docs/shadow-wallet-design.md (in tzel-infra) for traceability.

Why we picked cooperative-yield over shadow-wallet (per the comparative audit's verdict):

Failure-class blast radius Cooperative-yield (this PR) Shadow-wallet
Worst-case footgun Stale sentinel file; recovered by kill -0 + unlink, loud (sync exits early forever, immediately diagnosable, rm <wallet>.yield resolves) Delta-accumulator clear-before-fsync; silent (lost nullifier observations → spent note looks unspent → double-spend rejected at proving time → opaque UX error)
Discovery latency K commits K commits (when periodic-merge variant chosen)
Lock contention Bounded by K Bounded by K (when periodic-merge)
fsync count on 67k commits 1340 1340
Crash class: stale sentinel yes (recoverable) no
Crash class: silent observation drift no yes (clear-before-rename ordering bug)

Loud beats silent. Both designs are tactical fixes for the same constraint (file-based wallet + flock). At ~2× LOC for shadow-wallet plus a re-litigation of ADR-002, the migration cost (~2 weeks senior-eng review) buys a "canonical never holds partial scan state" property that no caller currently exercises — while introducing a subtle data-loss class that surfaces only at proving time. Decision: ship cooperative-yield as-is.

Pre-merge audit follow-ups (commits f3a0755, 3edef12)

Independent reviewers caught a real consistency bug missed by the earlier audit rounds: cmd_rollup_sync finalize pinned head_hash for the note slice + nullifiers but called the head-resolving load_pool_balances for pool reads. A slow-lane drain landing between the two reads would silently evict a still-funded pool. Fixed in f3a0755 (load_pool_balances_at_block taking an explicit block_ref) and covered by a regression test that drives two block hashes returning different balances.

A generalising review of the same pattern caught a second instance in cmd_wallet_check (banner pool summary), fixed in 3edef12. The same commit renames the head-resolving wrappers to _at_head so the naming asymmetry (_at_head vs _at_block) makes a future regression loud at code review, and enriches the stale-sentinel recovery warning with the dead daemon's PID for operator triage.

Tests

12 cooperative-yield tests in network_profile_tests (gated on #[serial(sync_checkpoint_hook)] since they share the process-global SYNC_CHECKPOINT_HOOK):

Test What it asserts
cmd_rollup_sync_exits_at_first_checkpoint_when_sentinel_present Pre-existing sentinel ⇒ exit at first checkpoint, do NOT remove caller's sentinel
cmd_rollup_sync_resumes_from_checkpointed_cursor Sentinel triggered mid-scan ⇒ resume from last checkpoint, notes pre-yield in run 1, post-yield in run 2
cmd_rollup_sync_intermediate_wallet_is_always_parseable Concurrent reader thread sees no half-written wallet.json
cmd_rollup_sync_refuses_zero_checkpoint_every Validation pre-RPC; error message stable
cmd_rollup_sync_respects_custom_checkpoint_every K ≠ 50 honoured (locks the runtime parameter contract)
cmd_rollup_sync_http_error_mid_batch_preserves_last_checkpoint 5xx mid-batch ⇒ Err propagates; scanned at last batch boundary; sentinel untouched by CLI
cmd_rollup_sync_panic_mid_batch_preserves_wallet_json Rust panic mid-batch ⇒ atomic-rename invariant holds; reload works
cmd_rollup_sync_watch_with_zero_checkpoint_every_aborts_first_iter watch loop aborts on validation, doesn't sleep
cmd_rollup_sync_recovers_from_stale_yield_sentinel Sentinel with a freshly-reaped child PID ⇒ unlink + continue, scan reaches tree_size (uses freshly_dead_pid() helper instead of a hardcoded 999999, robust against kernel.pid_max lore on container hosts)
cmd_rollup_sync_watch_loop_body_yields_twice_then_finishes Three-iteration yield-then-resume mirroring cmd_rollup_sync_watch's loop body — sentinel persists across iter 1+2, removed before iter 3. Locks the property that recovered notes from earlier iterations re-feed known_cms on the finalize iteration via w.notes, so a regression that moved seen_cms to a process-local accumulator (or stopped folding w.notes into known_cms) would surface
cmd_rollup_sync_treats_legacy_sentinel_content_as_live Non-PID content (b"yield") ⇒ yield as before; CLI does NOT unlink (forward-compat with old daemons)

Plus four post-audit regression tests:

  • rollup_rpc_load_pool_balances_at_block_uses_pinned_block — two block hashes returning different balances; asserts the pinned form reads from the caller's block, not from a re-resolved head.
  • rollup_rpc_load_state_snapshot_at_block_uses_pinned_block — mirror for the state snapshot helper that cmd_wallet_check uses on the same pinned head_hash (Reviewer feat(wallet): wallet-server HTTP API and web dapp #1's ask).
  • cmd_rollup_sync_pins_pool_reads_to_finalize_head — e2e test driving a full sync against a stateful HTTP mock whose /head/hash route returns block_old on the first read and block_new thereafter. The fixture publishes one note at index 0 wrapping observed_cm, so the batch's seen_cms includes it; the wallet's pending_deposit.shielded_cm = Some(observed_cm) so finalize evaluates cm_observed = true. Pool funded at block_old, drained at block_new. Correct shape: pool reads at the pinned block_old (funded) ⇒ deposit retained AND head/hash counter resolves exactly once. Regression (load_pool_balances_at_head at the call site): pool reads land on block_newdrained_on_chain && cm_observed ⇒ deposit evicted ⇒ assertion fails.
  • fixture_for_finalize_pin_test_actually_evicts_under_regression_helper — companion that proves the fixture's failure mechanism is real (not tautological): same fixture, but invokes load_pool_balances_at_head directly + apply_scan_feed_finalize, asserts the deposit IS evicted. Locks in that the e2e test above protects a live mechanism.

The first cut of the e2e test (in commit 0392f29) was caught as tautological by independent reviewers A and B: it had tree_size = 0 (loop never ran) and shielded_cm = None (cm_observed always false), so the eviction predicate could not fire and the test passed regardless of the call-site shape. Both reviewers verified mechanically by reverting the call site to _at_head and observing the test still passed. The fixture is now corrected — and the companion test above verifies the failure mechanism via the helper without needing to mutate production code.

Result: cargo +nightly-2025-07-14 test -p tzel-wallet-app --lib118 passed, 0 failed, 3 ignored.

Companion change

trilitech/tzel-infra#55 (feat/sync-cooperative-yield) — daemon dispatcher wraps slow-lane task execution with SyncYieldGuard::arm (writes the daemon PID into <wallet>.yield) + RAII Drop (rm), so a long-running sync yields the lock for shield/transfer/unshield/withdraw and resumes after.

The contract is forward-compatible: a daemon shipping the sentinel without this CLI's stale-PID recovery is harmless (legacy CLI just yields normally, no recovery benefit). Once both binaries ship, the cooperative behaviour activates.

Deferred to follow-up PRs

  • Daemon-side runner.rs pass-through of TZEL_SYNC_CHECKPOINT_EVERY via the env (unblocks operator override end-to-end).
  • friendlyError UI pattern for the validation error once the env var exposes it to non-CLI users.
  • Re-tune default K to 250–500 once the design doc's parallel-fetch patch (P2, see PR wallet: concurrent HTTP + parallel decrypt for cmd_rollup_sync #25) lands; at the post-P2 wallclock, K=50 means ~7 fsyncs/s sustained — non-trivial.
  • Long-term architectural follow-up: SQLite-backed WalletFile to delegate concurrency to the storage layer (would obsolete this PR's mechanism and the shadow-wallet alternative).

saroupille pushed a commit to saroupille/tzel that referenced this pull request May 9, 2026
Two independent reviewers on PR trilitech#24 surfaced one real bug, two test
gaps, and three docstring follow-ups. All applied:

**Real bug (Reviewer A)** — `w.scanned + checkpoint_every` at lib.rs
:7721 used unchecked addition. With an adversarial `--checkpoint-every`
(close to `usize::MAX` on 64-bit), that would panic in debug / wrap
in release. Switched to `w.scanned.saturating_add(checkpoint_every)
.min(tree_size)`. The atomic-rename invariant on `save_wallet` means
a wrap couldn't have corrupted on-disk state, but the sync would have
silently misbehaved with a tiny degenerate batch. Belt-and-braces.

**Test gaps**:
- All four cooperative-yield tests share the process-global
  `SYNC_CHECKPOINT_HOOK` (a `OnceLock<Mutex<...>>`). cargo-test runs
  in-binary tests in parallel; without serialization, one test's hook
  could fire under another test's loop. Added `serial_test` dev-dep
  and `#[serial(sync_checkpoint_hook)]` on all four (now five).
- All four tests hardcoded `DEFAULT_CHECKPOINT_EVERY` (50). A future
  regression that ignored the runtime parameter and used the const
  internally would still pass. Added `cmd_rollup_sync_respects_custom_
  checkpoint_every` exercising K=7 with a sentinel trip at K*2; locks
  the contract that the flag actually does something.

**Docstring follow-ups (Reviewer B)**:
- The previous text claimed `services/scan-bench/` of tzel-infra
  informed the K=50 default. False: scan-bench measures concurrent-
  fetch tolerance (the orthogonal N dial in design doc §2.A); the K=50
  numbers come from a `save_wallet` micro-bench that doesn't exist
  yet. Tightened to call out the dial separation + acknowledge the
  pending bench.
- Added the operator-vs-laptop calibration sentence: raise to 100-200
  on slow / synchronously-replicated storage; drop to 20-30 on fast
  operator boxes when sub-second preemption matters.
- Added a one-liner about `--watch` × `--checkpoint-every`
  composition (each polling iteration honours the value
  independently).

**Env var (Reviewer B F4)** — added `env = "TZEL_SYNC_CHECKPOINT_EVERY"`
to the clap arg, mirroring the precedent the design doc set for
`TZEL_SYNC_CONCURRENCY` (doc §4 line 187). Operator-deployed daemons
can now override K without recompiling. Required `clap` "env"
feature, added.

Tests: 104/104 pass (was 103, +1 with the new K=7 test). No
regressions to the 99 baseline tests.

Deferred to follow-up PRs (out of scope here):
- Daemon-side runner.rs pass-through of TZEL_SYNC_CHECKPOINT_EVERY
  (tzel-infra PR #55, before its merge — otherwise the daemon ships a
  hardcoded K=50 contract).
- friendlyError pattern for the validation error once the env var
  exposes it to non-CLI users.
- Re-tune K=250-500 once the design doc's 2.A (concurrent fetch)
  lands; at the post-2.A wallclock, K=50 means ~7 fsyncs/sec, which
  is non-trivial sustained.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saroupille pushed a commit to saroupille/tzel that referenced this pull request May 10, 2026
The audit on PR trilitech#24 catalogued the interruption modes for
cmd_rollup_sync: sentinel-driven yield (4 tests) and zero-K validation
(1 test) were already covered, but three error paths weren't:

1. **HTTP error mid-batch** — a 5xx from rollup-rpc partway through a
   batch must propagate as Err with `wallet.json.scanned` left at the
   last successful checkpoint boundary, never at the failed cursor.
   Test mocks 503 on note index 15 with K=10; sync errors out, scanned
   sticks at 10 (first batch boundary), sentinel never created by the
   CLI (daemon owns sentinel lifecycle).

2. **Panic mid-batch** — a Rust panic from inside the scan loop (e.g.
   a future Drop touching FS state) must NOT corrupt wallet.json. The
   atomic-rename invariant in save_wallet preserves the prior on-disk
   state. Test panics from the sync_checkpoint_hook at scanned=20,
   wraps cmd_rollup_sync in catch_unwind, then asserts the wallet
   reloads cleanly with scanned=20 (the just-completed checkpoint
   value, since the panic fires AFTER the save_wallet in the loop
   body).

3. **`--watch` + `--checkpoint-every 0`** — validation must fire on
   the first iteration before any RPC; the `?` propagates and the
   watch loop aborts cleanly. Test points at an unreachable RPC URL
   so a hang would deadline rather than hide; asserts the validation
   error message survives through the watch loop.

The panic test caused PoisonError in the Mutex protecting
SYNC_CHECKPOINT_HOOK across subsequent tests in the suite. Made the
three accessor functions poison-tolerant via
`unwrap_or_else(|e| e.into_inner())` — the protected data is just
`Option<Box<dyn Fn>>`, safe to overwrite regardless of poison state.
This also defends against any future panic-from-hook regression
breaking unrelated tests.

107/107 tests pass (was 104, +3 new). All cooperative-yield tests
remain serialized via #[serial(sync_checkpoint_hook)] since they
share the global hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saroupille pushed a commit to saroupille/tzel that referenced this pull request May 10, 2026
Strategic audit on PR trilitech#24 caught a real bug class: if the daemon
crashes after `touch <wallet>.yield` but before its RAII guard's
`rm`, the sentinel is permanent. Every subsequent sync would exit
at the first checkpoint, make K commits of progress, and exit
again — wedged forever, identical failure shape to the
stale-WalletLock case the existing `is_stale_wallet_lock` recovery
solves.

Mirror that discipline:

- New `is_stale_yield_sentinel(path)` reads sentinel content,
  parses first line as PID, probes `/proc/<pid>`. Forward-compat:
  any read or parse failure returns false ("treat as live"), so
  legacy daemons + tests writing `b"yield"` keep working.
- In `cmd_rollup_sync`'s sentinel-check, when the sentinel exists
  and `is_stale_yield_sentinel` returns true: unlink + log warning
  + `continue` (don't yield this iteration). Otherwise yield as
  before.
- The companion daemon-side PR (tzel-infra #55) needs a follow-up
  to write `<pid>\n` into the sentinel on touch. Until that lands,
  this CLI change is harmless: it falls through to "treat as live"
  for the legacy `b"yield"` content.

Two new tests:
- `cmd_rollup_sync_recovers_from_stale_yield_sentinel`: pre-create
  sentinel with PID 999999 (above kernel pid_max ceiling on dev
  envs), assert sync runs to tree_size and unlinks the sentinel.
- `cmd_rollup_sync_treats_legacy_sentinel_content_as_live`: pre-
  create sentinel with `b"yield"`, assert sync yields normally and
  does NOT unlink (daemon owns lifecycle for live sentinels).

109/109 tests pass (was 107, +2 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@saroupille saroupille marked this pull request as draft May 10, 2026 09:07
saroupille pushed a commit to saroupille/tzel that referenced this pull request May 10, 2026
Two independent reviewers on PR trilitech#24 surfaced one real bug, two test
gaps, and three docstring follow-ups. All applied:

**Real bug (Reviewer A)** — `w.scanned + checkpoint_every` at lib.rs
:7721 used unchecked addition. With an adversarial `--checkpoint-every`
(close to `usize::MAX` on 64-bit), that would panic in debug / wrap
in release. Switched to `w.scanned.saturating_add(checkpoint_every)
.min(tree_size)`. The atomic-rename invariant on `save_wallet` means
a wrap couldn't have corrupted on-disk state, but the sync would have
silently misbehaved with a tiny degenerate batch. Belt-and-braces.

**Test gaps**:
- All four cooperative-yield tests share the process-global
  `SYNC_CHECKPOINT_HOOK` (a `OnceLock<Mutex<...>>`). cargo-test runs
  in-binary tests in parallel; without serialization, one test's hook
  could fire under another test's loop. Added `serial_test` dev-dep
  and `#[serial(sync_checkpoint_hook)]` on all four (now five).
- All four tests hardcoded `DEFAULT_CHECKPOINT_EVERY` (50). A future
  regression that ignored the runtime parameter and used the const
  internally would still pass. Added `cmd_rollup_sync_respects_custom_
  checkpoint_every` exercising K=7 with a sentinel trip at K*2; locks
  the contract that the flag actually does something.

**Docstring follow-ups (Reviewer B)**:
- The previous text claimed `services/scan-bench/` of tzel-infra
  informed the K=50 default. False: scan-bench measures concurrent-
  fetch tolerance (the orthogonal N dial in design doc §2.A); the K=50
  numbers come from a `save_wallet` micro-bench that doesn't exist
  yet. Tightened to call out the dial separation + acknowledge the
  pending bench.
- Added the operator-vs-laptop calibration sentence: raise to 100-200
  on slow / synchronously-replicated storage; drop to 20-30 on fast
  operator boxes when sub-second preemption matters.
- Added a one-liner about `--watch` × `--checkpoint-every`
  composition (each polling iteration honours the value
  independently).

**Env var (Reviewer B F4)** — added `env = "TZEL_SYNC_CHECKPOINT_EVERY"`
to the clap arg, mirroring the precedent the design doc set for
`TZEL_SYNC_CONCURRENCY` (doc §4 line 187). Operator-deployed daemons
can now override K without recompiling. Required `clap` "env"
feature, added.

Tests: 104/104 pass (was 103, +1 with the new K=7 test). No
regressions to the 99 baseline tests.

Deferred to follow-up PRs (out of scope here):
- Daemon-side runner.rs pass-through of TZEL_SYNC_CHECKPOINT_EVERY
  (tzel-infra PR #55, before its merge — otherwise the daemon ships a
  hardcoded K=50 contract).
- friendlyError pattern for the validation error once the env var
  exposes it to non-CLI users.
- Re-tune K=250-500 once the design doc's 2.A (concurrent fetch)
  lands; at the post-2.A wallclock, K=50 means ~7 fsyncs/sec, which
  is non-trivial sustained.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saroupille pushed a commit to saroupille/tzel that referenced this pull request May 10, 2026
The audit on PR trilitech#24 catalogued the interruption modes for
cmd_rollup_sync: sentinel-driven yield (4 tests) and zero-K validation
(1 test) were already covered, but three error paths weren't:

1. **HTTP error mid-batch** — a 5xx from rollup-rpc partway through a
   batch must propagate as Err with `wallet.json.scanned` left at the
   last successful checkpoint boundary, never at the failed cursor.
   Test mocks 503 on note index 15 with K=10; sync errors out, scanned
   sticks at 10 (first batch boundary), sentinel never created by the
   CLI (daemon owns sentinel lifecycle).

2. **Panic mid-batch** — a Rust panic from inside the scan loop (e.g.
   a future Drop touching FS state) must NOT corrupt wallet.json. The
   atomic-rename invariant in save_wallet preserves the prior on-disk
   state. Test panics from the sync_checkpoint_hook at scanned=20,
   wraps cmd_rollup_sync in catch_unwind, then asserts the wallet
   reloads cleanly with scanned=20 (the just-completed checkpoint
   value, since the panic fires AFTER the save_wallet in the loop
   body).

3. **`--watch` + `--checkpoint-every 0`** — validation must fire on
   the first iteration before any RPC; the `?` propagates and the
   watch loop aborts cleanly. Test points at an unreachable RPC URL
   so a hang would deadline rather than hide; asserts the validation
   error message survives through the watch loop.

The panic test caused PoisonError in the Mutex protecting
SYNC_CHECKPOINT_HOOK across subsequent tests in the suite. Made the
three accessor functions poison-tolerant via
`unwrap_or_else(|e| e.into_inner())` — the protected data is just
`Option<Box<dyn Fn>>`, safe to overwrite regardless of poison state.
This also defends against any future panic-from-hook regression
breaking unrelated tests.

107/107 tests pass (was 104, +3 new). All cooperative-yield tests
remain serialized via #[serial(sync_checkpoint_hook)] since they
share the global hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saroupille pushed a commit to saroupille/tzel that referenced this pull request May 10, 2026
Strategic audit on PR trilitech#24 caught a real bug class: if the daemon
crashes after `touch <wallet>.yield` but before its RAII guard's
`rm`, the sentinel is permanent. Every subsequent sync would exit
at the first checkpoint, make K commits of progress, and exit
again — wedged forever, identical failure shape to the
stale-WalletLock case the existing `is_stale_wallet_lock` recovery
solves.

Mirror that discipline:

- New `is_stale_yield_sentinel(path)` reads sentinel content,
  parses first line as PID, probes `/proc/<pid>`. Forward-compat:
  any read or parse failure returns false ("treat as live"), so
  legacy daemons + tests writing `b"yield"` keep working.
- In `cmd_rollup_sync`'s sentinel-check, when the sentinel exists
  and `is_stale_yield_sentinel` returns true: unlink + log warning
  + `continue` (don't yield this iteration). Otherwise yield as
  before.
- The companion daemon-side PR (tzel-infra #55) needs a follow-up
  to write `<pid>\n` into the sentinel on touch. Until that lands,
  this CLI change is harmless: it falls through to "treat as live"
  for the legacy `b"yield"` content.

Two new tests:
- `cmd_rollup_sync_recovers_from_stale_yield_sentinel`: pre-create
  sentinel with PID 999999 (above kernel pid_max ceiling on dev
  envs), assert sync runs to tree_size and unlinks the sentinel.
- `cmd_rollup_sync_treats_legacy_sentinel_content_as_live`: pre-
  create sentinel with `b"yield"`, assert sync yields normally and
  does NOT unlink (daemon owns lifecycle for live sentinels).

109/109 tests pass (was 107, +2 new).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@saroupille saroupille force-pushed the feat/sync-cooperative-yield branch from 1d89749 to e69857b Compare May 10, 2026 09:54
@saroupille saroupille marked this pull request as ready for review May 10, 2026 10:03
@saroupille saroupille force-pushed the feat/sync-cooperative-yield branch 6 times, most recently from 7a3cd6b to abfdf43 Compare May 10, 2026 13:32
`tzel-wallet sync` previously scanned `[scanned, tree_size)` sequentially,
persisting `wallet.json.scanned` only at the end. While running it held an
exclusive lock on `wallet.json.lock`, so every other wallet op (shield,
transfer, unshield, balance, sync, …) blocked for the full duration. On a
long-running rollup that's 5–45 min wallclock — unacceptable in
daemon-driven flows where users want to act now.

This change makes the scan that does need to run preemptable:

1. Incremental persist. `cmd_rollup_sync` scans in batches of
   `--checkpoint-every` commits (default `DEFAULT_CHECKPOINT_EVERY = 50`,
   tunable via `TZEL_SYNC_CHECKPOINT_EVERY` env var). Each batch ends in
   a `save_wallet` that records the new cursor + newly-discovered notes
   atomically (temp + fsync + rename + parent-dir fsync, the existing
   invariant — just called more often).

2. Sentinel check at every checkpoint. After persisting a batch, the
   sync stat()s `<wallet>.yield`. Present ⇒ exit 0 cleanly. Next sync
   resumes from the persisted cursor.

3. Stale-PID recovery. The daemon writes its PID into the sentinel; the
   CLI checks `/proc/<pid>` and unlinks if dead. Mirrors the existing
   `WalletLock` discipline. The recovery warning includes the dead PID
   so operators can correlate with their daemon-restart logs. Without
   this, a daemon crash mid-slow-op would wedge every subsequent sync
   forever.

4. Final flush. Nullifier prune + pending-pool prune run once over the
   cumulative cm set accumulated across batches.

5. Consistency-pinned helpers. `load_pool_balances_at_block` and
   `load_state_snapshot_at_block` take an explicit `block_ref` so
   callers performing multiple reads (note slice + nullifiers + pool
   balances) observe the same kernel state. The head-resolving wrappers
   are renamed `_at_head` and reserved for single-shot callers; the
   naming asymmetry makes future regressions of this consistency
   property loud at code review. `cmd_rollup_sync` finalize and
   `cmd_wallet_check` both use the pinned form against a `head_hash`
   captured at the top of the function.

The companion daemon-side change in trilitech/tzel-infra#55 wraps
slow-lane CLI ops (shield/send/unshield/withdraw) with `touch
<wallet>.yield` before admit + RAII-removed after. Once both binaries
ship, a long initial sync no longer starves slow-lane ops: the
slow-lane signals \"yield please\" via the file, the next checkpoint
sees it, sync exits cleanly, slow-lane gets the lock. The contract is
forward-compatible: a daemon shipping the sentinel without this CLI's
stale-PID recovery is harmless.

Considered alternative: shadow-wallet sync (scan against a scratch
copy, merge back at K-commit boundaries). Documented in
`docs/shadow-wallet-design.md` (tzel-infra). Rejected after a
comparative audit: same fsync count + same K-commit discovery latency,
but introduces a silent failure class (clear-before-rename ordering on
the delta accumulator can drop nullifier observations, surfacing only
at proving time as an opaque double-spend). Cooperative-yield's
worst-case footgun is a stale sentinel — loud, immediately
diagnosable, fixable with `rm`. Loud beats silent.

Tests: 12 cooperative-yield tests in `network_profile_tests` (gated
on `#[serial(sync_checkpoint_hook)]` since they share the
process-global `SYNC_CHECKPOINT_HOOK`). They cover the primary yield
paths (pre-existing sentinel, mid-scan sentinel, custom K, watch-loop
validation, three-iteration watch-loop body), the wallet-file
invariants (intermediate parseability, panic-mid-batch atomic-rename),
and the recovery paths (HTTP error mid-batch, stale PID, legacy
non-PID content).

Plus four post-audit regression tests for the consistency-pin
invariant: `rollup_rpc_load_pool_balances_at_block_uses_pinned_block`
and `rollup_rpc_load_state_snapshot_at_block_uses_pinned_block` lock
the helper-layer property; `cmd_rollup_sync_pins_pool_reads_to_finalize_head`
e2e-tests the call site against a stateful HTTP mock; and
`fixture_for_finalize_pin_test_actually_evicts_under_regression_helper`
proves the fixture's failure mechanism is real (not tautological) by
running the same routes through the regression helper.

Result: `cargo +nightly-2025-07-14 test -p tzel-wallet-app --lib`
→ 118 passed, 0 failed, 3 ignored.

CLI surface:
  tzel-wallet sync                          # K=50 default
  tzel-wallet sync --checkpoint-every 200   # less overhead, slower preempt
  tzel-wallet sync --checkpoint-every 20    # faster preempt, more fsyncs
  TZEL_SYNC_CHECKPOINT_EVERY=100 tzel-wallet sync  # operator override

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@saroupille saroupille force-pushed the feat/sync-cooperative-yield branch from abfdf43 to 107ff88 Compare May 10, 2026 14:13
@saroupille saroupille merged commit 6772208 into trilitech:main May 10, 2026
2 checks passed
saroupille pushed a commit to saroupille/tzel that referenced this pull request May 10, 2026
The 67k-commit ushuaianet sync wallclock was dominated by sequential
ureq round trips against the rollup-node's durable-state RPC (~40 ms
each, ~95% of total). Two orthogonal accelerations land here:

* 2.A — concurrent HTTP fetch in `RollupRpc::load_notes_since_at_block`
  via `reqwest` + `futures_util::FuturesUnordered`. The async runtime
  is hosted on a dedicated worker thread so the path composes with both
  the synchronous CLI dispatch and the multi-thread tokio runtime that
  powers `tzel-detect`'s axum handlers (calling `block_on` from inside
  a tokio runtime panics — the worker-thread bridge sidesteps that
  without forcing every caller to be async).

  Concurrency tuned by `TZEL_SYNC_CONCURRENCY` (default 8, hard cap
  128). The cap rejects misconfiguration before it can fill the rollup
  node's TCP backlog. Errors short-circuit the batch — same
  abort-on-first-error contract the sequential loop had.

* 2.C — `apply_scan_feed`'s ML-KEM-768 trial-decrypt loop runs through
  rayon's default global pool. The decrypt is embarrassingly parallel:
  `try_recover_note` is `&self`, addresses don't change inside the
  call, recovered notes are independent. Recovery results are merged
  sequentially after the parallel pass to preserve the existing
  println-then-push order the test suite relies on.

A one-line stderr banner at sync start surfaces both tunables to the
operator. Composes orthogonally with `--at-tree-size` (PR trilitech#23, merged)
and the cooperative-yield work in PR trilitech#24.

NOT included: 2.B (batched RPC, deferred to upstream octez) and 2.D
(monitor stream, a separate concern).

Tests: 3 new in `network_profile_tests`:
  * concurrent_fetch_returns_same_results_as_sequential — out-of-order
    completion is reassembled correctly (N=1 vs N=8 against a
    100-commit mock).
  * concurrent_fetch_aborts_on_5xx — a 503 mid-batch propagates the
    error with HTTP status preserved, no silent retry.
  * parallel_decrypt_returns_same_results_as_sequential — single-thread
    rayon pool vs default pool yield identical recovered notes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saroupille pushed a commit to saroupille/tzel that referenced this pull request May 10, 2026
The 67k-commit ushuaianet sync wallclock was dominated by sequential
ureq round trips against the rollup-node's durable-state RPC (~40 ms
each, ~95% of total). Three orthogonal accelerations land here, all
firing on the post-trilitech#24 cooperative-yield path that `cmd_rollup_sync`
actually executes:

* 2.A — concurrent HTTP fetch in `RollupRpc::load_notes_since_at_block`
  via `reqwest` + `futures_util::FuturesUnordered`. The async runtime
  is hosted on a dedicated worker thread so the path composes with both
  the synchronous CLI dispatch and the multi-thread tokio runtime that
  powers `tzel-detect`'s axum handlers (calling `block_on` from inside
  a tokio runtime panics — the worker-thread bridge sidesteps that
  without forcing every caller to be async).

  Concurrency tuned by `TZEL_SYNC_CONCURRENCY` (default 4 — matches CI
  vCPU count, ship-safe across every rollup-node we know about; bumping
  above 4 should be opt-in once an operator measures their own
  rollup-node's tail-latency curve via `services/scan-bench/`). The 128
  hard cap rejects misconfiguration before it can fill the rollup
  node's TCP backlog. Errors short-circuit the batch — same
  abort-on-first-error contract the sequential loop had.

* 2.B — `apply_scan_feed_recover_batch` (PR trilitech#24's per-batch recover
  function, the function `cmd_rollup_sync` actually calls) routes its
  ML-KEM-768 trial-decrypt loop through rayon's default global pool.
  The decrypt is embarrassingly parallel: `try_recover_note` is
  `&self`, addresses don't change inside the call, recovered notes are
  independent. Recovery results are merged sequentially after the
  parallel pass to preserve the existing println-then-push order the
  test suite relies on. (The legacy `apply_scan_feed` on the dead
  `cmd_scan` path was already parallelised in an earlier draft of this
  PR, but the post-trilitech#24 hot path runs through
  `apply_scan_feed_recover_batch` — without this commit, the rayon
  par_iter never fired on a real `tzel-wallet sync` invocation.)

* 2.A continued — `SyncFetcher` long-lived context. Pre-fix:
  `fetch_published_notes_concurrent` was called per-batch and rebuilt
  the `reqwest::Client` (and a fresh tokio runtime + thread) every
  time. At the new K=250 across a 67k-commit sync that's still ~268
  client builds, each warming a fresh connection pool from cold —
  the design doc's "amortised TCP+TLS handshake" claim was false in
  that shape. `SyncFetcher` owns one `reqwest::Client` + worker thread
  + tokio runtime that survive the whole `cmd_rollup_sync` call. Pool
  warms once. (Across `--watch` iterations the pool is rebuilt; the
  per-iteration scan is small once caught up.)

A one-line stderr banner at sync start surfaces both tunables to the
operator. Composes orthogonally with `--at-tree-size` (PR trilitech#23, merged)
and the cooperative-yield work in PR trilitech#24, including PR trilitech#24 body line
90's "re-tune K to 250–500 once P2 lands" — `DEFAULT_CHECKPOINT_EVERY`
bumps from 50 → 250 (conservative end of that range).

NOT included: 2.B (batched RPC, deferred to upstream octez) and 2.D
(monitor stream, a separate concern).

Tests: 4 in `network_profile_tests` (122 lib tests pass total, was 117
on `main` pre-trilitech#24-chain):
  * concurrent_fetch_returns_same_results_as_sequential — out-of-order
    completion is reassembled correctly (N=1 vs N=8 against a
    100-commit mock).
  * concurrent_fetch_aborts_on_5xx — a 503 mid-batch propagates the
    error with HTTP status preserved, no silent retry, AND a counted
    mock asserts cancellation short-circuited (served < total).
  * parallel_decrypt_returns_same_results_as_sequential — independent
    fixture-time oracle (50 mixed recoverable + non-recoverable
    notes); both sequential and parallel branches must produce
    exactly the oracle set. Mutating
    `apply_scan_feed_recover_batch` to drop half the recoveries fails
    this test loud (verified).
  * sync_fetcher_amortises_client_across_batches — pins the contract
    that one `reqwest::Client` services multiple sequential batches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saroupille pushed a commit to saroupille/tzel that referenced this pull request May 10, 2026
The 67k-commit ushuaianet sync wallclock was dominated by sequential
ureq round trips against the rollup-node's durable-state RPC (~40 ms
each, ~95% of total). Three orthogonal accelerations land here, all
firing on the post-trilitech#24 cooperative-yield path that `cmd_rollup_sync`
actually executes:

* 2.A — concurrent HTTP fetch in `RollupRpc::load_notes_since_at_block`
  via `reqwest` + `futures_util::FuturesUnordered`. The async runtime
  is hosted on a dedicated worker thread so the path composes with both
  the synchronous CLI dispatch and the multi-thread tokio runtime that
  powers `tzel-detect`'s axum handlers (calling `block_on` from inside
  a tokio runtime panics — the worker-thread bridge sidesteps that
  without forcing every caller to be async).

  Concurrency tuned by `TZEL_SYNC_CONCURRENCY` (default 4 — matches CI
  vCPU count, ship-safe across every rollup-node we know about; bumping
  above 4 should be opt-in once an operator measures their own
  rollup-node's tail-latency curve via `services/scan-bench/`). The 128
  hard cap rejects misconfiguration before it can fill the rollup
  node's TCP backlog. Errors short-circuit the batch — same
  abort-on-first-error contract the sequential loop had.

* 2.B — `apply_scan_feed_recover_batch` (PR trilitech#24's per-batch recover
  function, the function `cmd_rollup_sync` actually calls) routes its
  ML-KEM-768 trial-decrypt loop through rayon's default global pool.
  The decrypt is embarrassingly parallel: `try_recover_note` is
  `&self`, addresses don't change inside the call, recovered notes are
  independent. Recovery results are merged sequentially after the
  parallel pass to preserve the existing println-then-push order the
  test suite relies on. (The legacy `apply_scan_feed` on the dead
  `cmd_scan` path was already parallelised in an earlier draft of this
  PR, but the post-trilitech#24 hot path runs through
  `apply_scan_feed_recover_batch` — without this commit, the rayon
  par_iter never fired on a real `tzel-wallet sync` invocation.)

* 2.A continued — `SyncFetcher` long-lived context. Pre-fix:
  `fetch_published_notes_concurrent` was called per-batch and rebuilt
  the `reqwest::Client` (and a fresh tokio runtime + thread) every
  time. At the new K=250 across a 67k-commit sync that's still ~268
  client builds, each warming a fresh connection pool from cold —
  the design doc's "amortised TCP+TLS handshake" claim was false in
  that shape. `SyncFetcher` owns one `reqwest::Client` + worker thread
  + tokio runtime that survive the whole `cmd_rollup_sync` call. Pool
  warms once. (Across `--watch` iterations the pool is rebuilt; the
  per-iteration scan is small once caught up.)

A one-line stderr banner at sync start surfaces both tunables to the
operator. Composes orthogonally with `--at-tree-size` (PR trilitech#23, merged)
and the cooperative-yield work in PR trilitech#24, including PR trilitech#24 body line
90's "re-tune K to 250–500 once P2 lands" — `DEFAULT_CHECKPOINT_EVERY`
bumps from 50 → 250 (conservative end of that range).

NOT included: 2.B (batched RPC, deferred to upstream octez) and 2.D
(monitor stream, a separate concern).

Tests: 4 in `network_profile_tests` (122 lib tests pass total, was 117
on `main` pre-trilitech#24-chain):
  * concurrent_fetch_returns_same_results_as_sequential — out-of-order
    completion is reassembled correctly (N=1 vs N=8 against a
    100-commit mock).
  * concurrent_fetch_aborts_on_5xx — a 503 mid-batch propagates the
    error with HTTP status preserved, no silent retry, AND a counted
    mock asserts cancellation short-circuited (served < total).
  * parallel_decrypt_returns_same_results_as_sequential — independent
    fixture-time oracle (50 mixed recoverable + non-recoverable
    notes); both sequential and parallel branches must produce
    exactly the oracle set. Mutating
    `apply_scan_feed_recover_batch` to drop half the recoveries fails
    this test loud (verified).
  * sync_fetcher_amortises_client_across_batches — pins the contract
    that one `reqwest::Client` services multiple sequential batches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saroupille added a commit that referenced this pull request May 10, 2026
The 67k-commit ushuaianet sync wallclock was dominated by sequential
ureq round trips against the rollup-node's durable-state RPC (~40 ms
each, ~95% of total). Three orthogonal accelerations land here, all
firing on the post-#24 cooperative-yield path that `cmd_rollup_sync`
actually executes:

* 2.A — concurrent HTTP fetch in `RollupRpc::load_notes_since_at_block`
  via `reqwest` + `futures_util::FuturesUnordered`. The async runtime
  is hosted on a dedicated worker thread so the path composes with both
  the synchronous CLI dispatch and the multi-thread tokio runtime that
  powers `tzel-detect`'s axum handlers (calling `block_on` from inside
  a tokio runtime panics — the worker-thread bridge sidesteps that
  without forcing every caller to be async).

  Concurrency tuned by `TZEL_SYNC_CONCURRENCY` (default 4 — matches CI
  vCPU count, ship-safe across every rollup-node we know about; bumping
  above 4 should be opt-in once an operator measures their own
  rollup-node's tail-latency curve via `services/scan-bench/`). The 128
  hard cap rejects misconfiguration before it can fill the rollup
  node's TCP backlog. Errors short-circuit the batch — same
  abort-on-first-error contract the sequential loop had.

* 2.B — `apply_scan_feed_recover_batch` (PR #24's per-batch recover
  function, the function `cmd_rollup_sync` actually calls) routes its
  ML-KEM-768 trial-decrypt loop through rayon's default global pool.
  The decrypt is embarrassingly parallel: `try_recover_note` is
  `&self`, addresses don't change inside the call, recovered notes are
  independent. Recovery results are merged sequentially after the
  parallel pass to preserve the existing println-then-push order the
  test suite relies on. (The legacy `apply_scan_feed` on the dead
  `cmd_scan` path was already parallelised in an earlier draft of this
  PR, but the post-#24 hot path runs through
  `apply_scan_feed_recover_batch` — without this commit, the rayon
  par_iter never fired on a real `tzel-wallet sync` invocation.)

* 2.A continued — `SyncFetcher` long-lived context. Pre-fix:
  `fetch_published_notes_concurrent` was called per-batch and rebuilt
  the `reqwest::Client` (and a fresh tokio runtime + thread) every
  time. At the new K=250 across a 67k-commit sync that's still ~268
  client builds, each warming a fresh connection pool from cold —
  the design doc's "amortised TCP+TLS handshake" claim was false in
  that shape. `SyncFetcher` owns one `reqwest::Client` + worker thread
  + tokio runtime that survive the whole `cmd_rollup_sync` call. Pool
  warms once. (Across `--watch` iterations the pool is rebuilt; the
  per-iteration scan is small once caught up.)

A one-line stderr banner at sync start surfaces both tunables to the
operator. Composes orthogonally with `--at-tree-size` (PR #23, merged)
and the cooperative-yield work in PR #24, including PR #24 body line
90's "re-tune K to 250–500 once P2 lands" — `DEFAULT_CHECKPOINT_EVERY`
bumps from 50 → 250 (conservative end of that range).

NOT included: 2.B (batched RPC, deferred to upstream octez) and 2.D
(monitor stream, a separate concern).

Tests: 4 in `network_profile_tests` (122 lib tests pass total, was 117
on `main` pre-#24-chain):
  * concurrent_fetch_returns_same_results_as_sequential — out-of-order
    completion is reassembled correctly (N=1 vs N=8 against a
    100-commit mock).
  * concurrent_fetch_aborts_on_5xx — a 503 mid-batch propagates the
    error with HTTP status preserved, no silent retry, AND a counted
    mock asserts cancellation short-circuited (served < total).
  * parallel_decrypt_returns_same_results_as_sequential — independent
    fixture-time oracle (50 mixed recoverable + non-recoverable
    notes); both sequential and parallel branches must produce
    exactly the oracle set. Mutating
    `apply_scan_feed_recover_batch` to drop half the recoveries fails
    this test loud (verified).
  * sync_fetcher_amortises_client_across_batches — pins the contract
    that one `reqwest::Client` services multiple sequential batches.

Co-authored-by: François Thiré <franth2@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants