wallet: cooperative-yield sync (incremental persist + sentinel)#24
Merged
saroupille merged 1 commit intoMay 10, 2026
Merged
Conversation
saroupille
pushed a commit
to saroupille/tzel
that referenced
this pull request
May 9, 2026
Two independent reviewers on PR trilitech#24 surfaced one real bug, two test gaps, and three docstring follow-ups. All applied: **Real bug (Reviewer A)** — `w.scanned + checkpoint_every` at lib.rs :7721 used unchecked addition. With an adversarial `--checkpoint-every` (close to `usize::MAX` on 64-bit), that would panic in debug / wrap in release. Switched to `w.scanned.saturating_add(checkpoint_every) .min(tree_size)`. The atomic-rename invariant on `save_wallet` means a wrap couldn't have corrupted on-disk state, but the sync would have silently misbehaved with a tiny degenerate batch. Belt-and-braces. **Test gaps**: - All four cooperative-yield tests share the process-global `SYNC_CHECKPOINT_HOOK` (a `OnceLock<Mutex<...>>`). cargo-test runs in-binary tests in parallel; without serialization, one test's hook could fire under another test's loop. Added `serial_test` dev-dep and `#[serial(sync_checkpoint_hook)]` on all four (now five). - All four tests hardcoded `DEFAULT_CHECKPOINT_EVERY` (50). A future regression that ignored the runtime parameter and used the const internally would still pass. Added `cmd_rollup_sync_respects_custom_ checkpoint_every` exercising K=7 with a sentinel trip at K*2; locks the contract that the flag actually does something. **Docstring follow-ups (Reviewer B)**: - The previous text claimed `services/scan-bench/` of tzel-infra informed the K=50 default. False: scan-bench measures concurrent- fetch tolerance (the orthogonal N dial in design doc §2.A); the K=50 numbers come from a `save_wallet` micro-bench that doesn't exist yet. Tightened to call out the dial separation + acknowledge the pending bench. - Added the operator-vs-laptop calibration sentence: raise to 100-200 on slow / synchronously-replicated storage; drop to 20-30 on fast operator boxes when sub-second preemption matters. - Added a one-liner about `--watch` × `--checkpoint-every` composition (each polling iteration honours the value independently). **Env var (Reviewer B F4)** — added `env = "TZEL_SYNC_CHECKPOINT_EVERY"` to the clap arg, mirroring the precedent the design doc set for `TZEL_SYNC_CONCURRENCY` (doc §4 line 187). Operator-deployed daemons can now override K without recompiling. Required `clap` "env" feature, added. Tests: 104/104 pass (was 103, +1 with the new K=7 test). No regressions to the 99 baseline tests. Deferred to follow-up PRs (out of scope here): - Daemon-side runner.rs pass-through of TZEL_SYNC_CHECKPOINT_EVERY (tzel-infra PR #55, before its merge — otherwise the daemon ships a hardcoded K=50 contract). - friendlyError pattern for the validation error once the env var exposes it to non-CLI users. - Re-tune K=250-500 once the design doc's 2.A (concurrent fetch) lands; at the post-2.A wallclock, K=50 means ~7 fsyncs/sec, which is non-trivial sustained. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saroupille
pushed a commit
to saroupille/tzel
that referenced
this pull request
May 10, 2026
The audit on PR trilitech#24 catalogued the interruption modes for cmd_rollup_sync: sentinel-driven yield (4 tests) and zero-K validation (1 test) were already covered, but three error paths weren't: 1. **HTTP error mid-batch** — a 5xx from rollup-rpc partway through a batch must propagate as Err with `wallet.json.scanned` left at the last successful checkpoint boundary, never at the failed cursor. Test mocks 503 on note index 15 with K=10; sync errors out, scanned sticks at 10 (first batch boundary), sentinel never created by the CLI (daemon owns sentinel lifecycle). 2. **Panic mid-batch** — a Rust panic from inside the scan loop (e.g. a future Drop touching FS state) must NOT corrupt wallet.json. The atomic-rename invariant in save_wallet preserves the prior on-disk state. Test panics from the sync_checkpoint_hook at scanned=20, wraps cmd_rollup_sync in catch_unwind, then asserts the wallet reloads cleanly with scanned=20 (the just-completed checkpoint value, since the panic fires AFTER the save_wallet in the loop body). 3. **`--watch` + `--checkpoint-every 0`** — validation must fire on the first iteration before any RPC; the `?` propagates and the watch loop aborts cleanly. Test points at an unreachable RPC URL so a hang would deadline rather than hide; asserts the validation error message survives through the watch loop. The panic test caused PoisonError in the Mutex protecting SYNC_CHECKPOINT_HOOK across subsequent tests in the suite. Made the three accessor functions poison-tolerant via `unwrap_or_else(|e| e.into_inner())` — the protected data is just `Option<Box<dyn Fn>>`, safe to overwrite regardless of poison state. This also defends against any future panic-from-hook regression breaking unrelated tests. 107/107 tests pass (was 104, +3 new). All cooperative-yield tests remain serialized via #[serial(sync_checkpoint_hook)] since they share the global hook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saroupille
pushed a commit
to saroupille/tzel
that referenced
this pull request
May 10, 2026
Strategic audit on PR trilitech#24 caught a real bug class: if the daemon crashes after `touch <wallet>.yield` but before its RAII guard's `rm`, the sentinel is permanent. Every subsequent sync would exit at the first checkpoint, make K commits of progress, and exit again — wedged forever, identical failure shape to the stale-WalletLock case the existing `is_stale_wallet_lock` recovery solves. Mirror that discipline: - New `is_stale_yield_sentinel(path)` reads sentinel content, parses first line as PID, probes `/proc/<pid>`. Forward-compat: any read or parse failure returns false ("treat as live"), so legacy daemons + tests writing `b"yield"` keep working. - In `cmd_rollup_sync`'s sentinel-check, when the sentinel exists and `is_stale_yield_sentinel` returns true: unlink + log warning + `continue` (don't yield this iteration). Otherwise yield as before. - The companion daemon-side PR (tzel-infra #55) needs a follow-up to write `<pid>\n` into the sentinel on touch. Until that lands, this CLI change is harmless: it falls through to "treat as live" for the legacy `b"yield"` content. Two new tests: - `cmd_rollup_sync_recovers_from_stale_yield_sentinel`: pre-create sentinel with PID 999999 (above kernel pid_max ceiling on dev envs), assert sync runs to tree_size and unlinks the sentinel. - `cmd_rollup_sync_treats_legacy_sentinel_content_as_live`: pre- create sentinel with `b"yield"`, assert sync yields normally and does NOT unlink (daemon owns lifecycle for live sentinels). 109/109 tests pass (was 107, +2 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saroupille
pushed a commit
to saroupille/tzel
that referenced
this pull request
May 10, 2026
Two independent reviewers on PR trilitech#24 surfaced one real bug, two test gaps, and three docstring follow-ups. All applied: **Real bug (Reviewer A)** — `w.scanned + checkpoint_every` at lib.rs :7721 used unchecked addition. With an adversarial `--checkpoint-every` (close to `usize::MAX` on 64-bit), that would panic in debug / wrap in release. Switched to `w.scanned.saturating_add(checkpoint_every) .min(tree_size)`. The atomic-rename invariant on `save_wallet` means a wrap couldn't have corrupted on-disk state, but the sync would have silently misbehaved with a tiny degenerate batch. Belt-and-braces. **Test gaps**: - All four cooperative-yield tests share the process-global `SYNC_CHECKPOINT_HOOK` (a `OnceLock<Mutex<...>>`). cargo-test runs in-binary tests in parallel; without serialization, one test's hook could fire under another test's loop. Added `serial_test` dev-dep and `#[serial(sync_checkpoint_hook)]` on all four (now five). - All four tests hardcoded `DEFAULT_CHECKPOINT_EVERY` (50). A future regression that ignored the runtime parameter and used the const internally would still pass. Added `cmd_rollup_sync_respects_custom_ checkpoint_every` exercising K=7 with a sentinel trip at K*2; locks the contract that the flag actually does something. **Docstring follow-ups (Reviewer B)**: - The previous text claimed `services/scan-bench/` of tzel-infra informed the K=50 default. False: scan-bench measures concurrent- fetch tolerance (the orthogonal N dial in design doc §2.A); the K=50 numbers come from a `save_wallet` micro-bench that doesn't exist yet. Tightened to call out the dial separation + acknowledge the pending bench. - Added the operator-vs-laptop calibration sentence: raise to 100-200 on slow / synchronously-replicated storage; drop to 20-30 on fast operator boxes when sub-second preemption matters. - Added a one-liner about `--watch` × `--checkpoint-every` composition (each polling iteration honours the value independently). **Env var (Reviewer B F4)** — added `env = "TZEL_SYNC_CHECKPOINT_EVERY"` to the clap arg, mirroring the precedent the design doc set for `TZEL_SYNC_CONCURRENCY` (doc §4 line 187). Operator-deployed daemons can now override K without recompiling. Required `clap` "env" feature, added. Tests: 104/104 pass (was 103, +1 with the new K=7 test). No regressions to the 99 baseline tests. Deferred to follow-up PRs (out of scope here): - Daemon-side runner.rs pass-through of TZEL_SYNC_CHECKPOINT_EVERY (tzel-infra PR #55, before its merge — otherwise the daemon ships a hardcoded K=50 contract). - friendlyError pattern for the validation error once the env var exposes it to non-CLI users. - Re-tune K=250-500 once the design doc's 2.A (concurrent fetch) lands; at the post-2.A wallclock, K=50 means ~7 fsyncs/sec, which is non-trivial sustained. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saroupille
pushed a commit
to saroupille/tzel
that referenced
this pull request
May 10, 2026
The audit on PR trilitech#24 catalogued the interruption modes for cmd_rollup_sync: sentinel-driven yield (4 tests) and zero-K validation (1 test) were already covered, but three error paths weren't: 1. **HTTP error mid-batch** — a 5xx from rollup-rpc partway through a batch must propagate as Err with `wallet.json.scanned` left at the last successful checkpoint boundary, never at the failed cursor. Test mocks 503 on note index 15 with K=10; sync errors out, scanned sticks at 10 (first batch boundary), sentinel never created by the CLI (daemon owns sentinel lifecycle). 2. **Panic mid-batch** — a Rust panic from inside the scan loop (e.g. a future Drop touching FS state) must NOT corrupt wallet.json. The atomic-rename invariant in save_wallet preserves the prior on-disk state. Test panics from the sync_checkpoint_hook at scanned=20, wraps cmd_rollup_sync in catch_unwind, then asserts the wallet reloads cleanly with scanned=20 (the just-completed checkpoint value, since the panic fires AFTER the save_wallet in the loop body). 3. **`--watch` + `--checkpoint-every 0`** — validation must fire on the first iteration before any RPC; the `?` propagates and the watch loop aborts cleanly. Test points at an unreachable RPC URL so a hang would deadline rather than hide; asserts the validation error message survives through the watch loop. The panic test caused PoisonError in the Mutex protecting SYNC_CHECKPOINT_HOOK across subsequent tests in the suite. Made the three accessor functions poison-tolerant via `unwrap_or_else(|e| e.into_inner())` — the protected data is just `Option<Box<dyn Fn>>`, safe to overwrite regardless of poison state. This also defends against any future panic-from-hook regression breaking unrelated tests. 107/107 tests pass (was 104, +3 new). All cooperative-yield tests remain serialized via #[serial(sync_checkpoint_hook)] since they share the global hook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saroupille
pushed a commit
to saroupille/tzel
that referenced
this pull request
May 10, 2026
Strategic audit on PR trilitech#24 caught a real bug class: if the daemon crashes after `touch <wallet>.yield` but before its RAII guard's `rm`, the sentinel is permanent. Every subsequent sync would exit at the first checkpoint, make K commits of progress, and exit again — wedged forever, identical failure shape to the stale-WalletLock case the existing `is_stale_wallet_lock` recovery solves. Mirror that discipline: - New `is_stale_yield_sentinel(path)` reads sentinel content, parses first line as PID, probes `/proc/<pid>`. Forward-compat: any read or parse failure returns false ("treat as live"), so legacy daemons + tests writing `b"yield"` keep working. - In `cmd_rollup_sync`'s sentinel-check, when the sentinel exists and `is_stale_yield_sentinel` returns true: unlink + log warning + `continue` (don't yield this iteration). Otherwise yield as before. - The companion daemon-side PR (tzel-infra #55) needs a follow-up to write `<pid>\n` into the sentinel on touch. Until that lands, this CLI change is harmless: it falls through to "treat as live" for the legacy `b"yield"` content. Two new tests: - `cmd_rollup_sync_recovers_from_stale_yield_sentinel`: pre-create sentinel with PID 999999 (above kernel pid_max ceiling on dev envs), assert sync runs to tree_size and unlinks the sentinel. - `cmd_rollup_sync_treats_legacy_sentinel_content_as_live`: pre- create sentinel with `b"yield"`, assert sync yields normally and does NOT unlink (daemon owns lifecycle for live sentinels). 109/109 tests pass (was 107, +2 new). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1d89749 to
e69857b
Compare
5 tasks
7a3cd6b to
abfdf43
Compare
`tzel-wallet sync` previously scanned `[scanned, tree_size)` sequentially, persisting `wallet.json.scanned` only at the end. While running it held an exclusive lock on `wallet.json.lock`, so every other wallet op (shield, transfer, unshield, balance, sync, …) blocked for the full duration. On a long-running rollup that's 5–45 min wallclock — unacceptable in daemon-driven flows where users want to act now. This change makes the scan that does need to run preemptable: 1. Incremental persist. `cmd_rollup_sync` scans in batches of `--checkpoint-every` commits (default `DEFAULT_CHECKPOINT_EVERY = 50`, tunable via `TZEL_SYNC_CHECKPOINT_EVERY` env var). Each batch ends in a `save_wallet` that records the new cursor + newly-discovered notes atomically (temp + fsync + rename + parent-dir fsync, the existing invariant — just called more often). 2. Sentinel check at every checkpoint. After persisting a batch, the sync stat()s `<wallet>.yield`. Present ⇒ exit 0 cleanly. Next sync resumes from the persisted cursor. 3. Stale-PID recovery. The daemon writes its PID into the sentinel; the CLI checks `/proc/<pid>` and unlinks if dead. Mirrors the existing `WalletLock` discipline. The recovery warning includes the dead PID so operators can correlate with their daemon-restart logs. Without this, a daemon crash mid-slow-op would wedge every subsequent sync forever. 4. Final flush. Nullifier prune + pending-pool prune run once over the cumulative cm set accumulated across batches. 5. Consistency-pinned helpers. `load_pool_balances_at_block` and `load_state_snapshot_at_block` take an explicit `block_ref` so callers performing multiple reads (note slice + nullifiers + pool balances) observe the same kernel state. The head-resolving wrappers are renamed `_at_head` and reserved for single-shot callers; the naming asymmetry makes future regressions of this consistency property loud at code review. `cmd_rollup_sync` finalize and `cmd_wallet_check` both use the pinned form against a `head_hash` captured at the top of the function. The companion daemon-side change in trilitech/tzel-infra#55 wraps slow-lane CLI ops (shield/send/unshield/withdraw) with `touch <wallet>.yield` before admit + RAII-removed after. Once both binaries ship, a long initial sync no longer starves slow-lane ops: the slow-lane signals \"yield please\" via the file, the next checkpoint sees it, sync exits cleanly, slow-lane gets the lock. The contract is forward-compatible: a daemon shipping the sentinel without this CLI's stale-PID recovery is harmless. Considered alternative: shadow-wallet sync (scan against a scratch copy, merge back at K-commit boundaries). Documented in `docs/shadow-wallet-design.md` (tzel-infra). Rejected after a comparative audit: same fsync count + same K-commit discovery latency, but introduces a silent failure class (clear-before-rename ordering on the delta accumulator can drop nullifier observations, surfacing only at proving time as an opaque double-spend). Cooperative-yield's worst-case footgun is a stale sentinel — loud, immediately diagnosable, fixable with `rm`. Loud beats silent. Tests: 12 cooperative-yield tests in `network_profile_tests` (gated on `#[serial(sync_checkpoint_hook)]` since they share the process-global `SYNC_CHECKPOINT_HOOK`). They cover the primary yield paths (pre-existing sentinel, mid-scan sentinel, custom K, watch-loop validation, three-iteration watch-loop body), the wallet-file invariants (intermediate parseability, panic-mid-batch atomic-rename), and the recovery paths (HTTP error mid-batch, stale PID, legacy non-PID content). Plus four post-audit regression tests for the consistency-pin invariant: `rollup_rpc_load_pool_balances_at_block_uses_pinned_block` and `rollup_rpc_load_state_snapshot_at_block_uses_pinned_block` lock the helper-layer property; `cmd_rollup_sync_pins_pool_reads_to_finalize_head` e2e-tests the call site against a stateful HTTP mock; and `fixture_for_finalize_pin_test_actually_evicts_under_regression_helper` proves the fixture's failure mechanism is real (not tautological) by running the same routes through the regression helper. Result: `cargo +nightly-2025-07-14 test -p tzel-wallet-app --lib` → 118 passed, 0 failed, 3 ignored. CLI surface: tzel-wallet sync # K=50 default tzel-wallet sync --checkpoint-every 200 # less overhead, slower preempt tzel-wallet sync --checkpoint-every 20 # faster preempt, more fsyncs TZEL_SYNC_CHECKPOINT_EVERY=100 tzel-wallet sync # operator override Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
abfdf43 to
107ff88
Compare
saroupille
pushed a commit
to saroupille/tzel
that referenced
this pull request
May 10, 2026
The 67k-commit ushuaianet sync wallclock was dominated by sequential ureq round trips against the rollup-node's durable-state RPC (~40 ms each, ~95% of total). Two orthogonal accelerations land here: * 2.A — concurrent HTTP fetch in `RollupRpc::load_notes_since_at_block` via `reqwest` + `futures_util::FuturesUnordered`. The async runtime is hosted on a dedicated worker thread so the path composes with both the synchronous CLI dispatch and the multi-thread tokio runtime that powers `tzel-detect`'s axum handlers (calling `block_on` from inside a tokio runtime panics — the worker-thread bridge sidesteps that without forcing every caller to be async). Concurrency tuned by `TZEL_SYNC_CONCURRENCY` (default 8, hard cap 128). The cap rejects misconfiguration before it can fill the rollup node's TCP backlog. Errors short-circuit the batch — same abort-on-first-error contract the sequential loop had. * 2.C — `apply_scan_feed`'s ML-KEM-768 trial-decrypt loop runs through rayon's default global pool. The decrypt is embarrassingly parallel: `try_recover_note` is `&self`, addresses don't change inside the call, recovered notes are independent. Recovery results are merged sequentially after the parallel pass to preserve the existing println-then-push order the test suite relies on. A one-line stderr banner at sync start surfaces both tunables to the operator. Composes orthogonally with `--at-tree-size` (PR trilitech#23, merged) and the cooperative-yield work in PR trilitech#24. NOT included: 2.B (batched RPC, deferred to upstream octez) and 2.D (monitor stream, a separate concern). Tests: 3 new in `network_profile_tests`: * concurrent_fetch_returns_same_results_as_sequential — out-of-order completion is reassembled correctly (N=1 vs N=8 against a 100-commit mock). * concurrent_fetch_aborts_on_5xx — a 503 mid-batch propagates the error with HTTP status preserved, no silent retry. * parallel_decrypt_returns_same_results_as_sequential — single-thread rayon pool vs default pool yield identical recovered notes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saroupille
pushed a commit
to saroupille/tzel
that referenced
this pull request
May 10, 2026
The 67k-commit ushuaianet sync wallclock was dominated by sequential ureq round trips against the rollup-node's durable-state RPC (~40 ms each, ~95% of total). Three orthogonal accelerations land here, all firing on the post-trilitech#24 cooperative-yield path that `cmd_rollup_sync` actually executes: * 2.A — concurrent HTTP fetch in `RollupRpc::load_notes_since_at_block` via `reqwest` + `futures_util::FuturesUnordered`. The async runtime is hosted on a dedicated worker thread so the path composes with both the synchronous CLI dispatch and the multi-thread tokio runtime that powers `tzel-detect`'s axum handlers (calling `block_on` from inside a tokio runtime panics — the worker-thread bridge sidesteps that without forcing every caller to be async). Concurrency tuned by `TZEL_SYNC_CONCURRENCY` (default 4 — matches CI vCPU count, ship-safe across every rollup-node we know about; bumping above 4 should be opt-in once an operator measures their own rollup-node's tail-latency curve via `services/scan-bench/`). The 128 hard cap rejects misconfiguration before it can fill the rollup node's TCP backlog. Errors short-circuit the batch — same abort-on-first-error contract the sequential loop had. * 2.B — `apply_scan_feed_recover_batch` (PR trilitech#24's per-batch recover function, the function `cmd_rollup_sync` actually calls) routes its ML-KEM-768 trial-decrypt loop through rayon's default global pool. The decrypt is embarrassingly parallel: `try_recover_note` is `&self`, addresses don't change inside the call, recovered notes are independent. Recovery results are merged sequentially after the parallel pass to preserve the existing println-then-push order the test suite relies on. (The legacy `apply_scan_feed` on the dead `cmd_scan` path was already parallelised in an earlier draft of this PR, but the post-trilitech#24 hot path runs through `apply_scan_feed_recover_batch` — without this commit, the rayon par_iter never fired on a real `tzel-wallet sync` invocation.) * 2.A continued — `SyncFetcher` long-lived context. Pre-fix: `fetch_published_notes_concurrent` was called per-batch and rebuilt the `reqwest::Client` (and a fresh tokio runtime + thread) every time. At the new K=250 across a 67k-commit sync that's still ~268 client builds, each warming a fresh connection pool from cold — the design doc's "amortised TCP+TLS handshake" claim was false in that shape. `SyncFetcher` owns one `reqwest::Client` + worker thread + tokio runtime that survive the whole `cmd_rollup_sync` call. Pool warms once. (Across `--watch` iterations the pool is rebuilt; the per-iteration scan is small once caught up.) A one-line stderr banner at sync start surfaces both tunables to the operator. Composes orthogonally with `--at-tree-size` (PR trilitech#23, merged) and the cooperative-yield work in PR trilitech#24, including PR trilitech#24 body line 90's "re-tune K to 250–500 once P2 lands" — `DEFAULT_CHECKPOINT_EVERY` bumps from 50 → 250 (conservative end of that range). NOT included: 2.B (batched RPC, deferred to upstream octez) and 2.D (monitor stream, a separate concern). Tests: 4 in `network_profile_tests` (122 lib tests pass total, was 117 on `main` pre-trilitech#24-chain): * concurrent_fetch_returns_same_results_as_sequential — out-of-order completion is reassembled correctly (N=1 vs N=8 against a 100-commit mock). * concurrent_fetch_aborts_on_5xx — a 503 mid-batch propagates the error with HTTP status preserved, no silent retry, AND a counted mock asserts cancellation short-circuited (served < total). * parallel_decrypt_returns_same_results_as_sequential — independent fixture-time oracle (50 mixed recoverable + non-recoverable notes); both sequential and parallel branches must produce exactly the oracle set. Mutating `apply_scan_feed_recover_batch` to drop half the recoveries fails this test loud (verified). * sync_fetcher_amortises_client_across_batches — pins the contract that one `reqwest::Client` services multiple sequential batches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saroupille
pushed a commit
to saroupille/tzel
that referenced
this pull request
May 10, 2026
The 67k-commit ushuaianet sync wallclock was dominated by sequential ureq round trips against the rollup-node's durable-state RPC (~40 ms each, ~95% of total). Three orthogonal accelerations land here, all firing on the post-trilitech#24 cooperative-yield path that `cmd_rollup_sync` actually executes: * 2.A — concurrent HTTP fetch in `RollupRpc::load_notes_since_at_block` via `reqwest` + `futures_util::FuturesUnordered`. The async runtime is hosted on a dedicated worker thread so the path composes with both the synchronous CLI dispatch and the multi-thread tokio runtime that powers `tzel-detect`'s axum handlers (calling `block_on` from inside a tokio runtime panics — the worker-thread bridge sidesteps that without forcing every caller to be async). Concurrency tuned by `TZEL_SYNC_CONCURRENCY` (default 4 — matches CI vCPU count, ship-safe across every rollup-node we know about; bumping above 4 should be opt-in once an operator measures their own rollup-node's tail-latency curve via `services/scan-bench/`). The 128 hard cap rejects misconfiguration before it can fill the rollup node's TCP backlog. Errors short-circuit the batch — same abort-on-first-error contract the sequential loop had. * 2.B — `apply_scan_feed_recover_batch` (PR trilitech#24's per-batch recover function, the function `cmd_rollup_sync` actually calls) routes its ML-KEM-768 trial-decrypt loop through rayon's default global pool. The decrypt is embarrassingly parallel: `try_recover_note` is `&self`, addresses don't change inside the call, recovered notes are independent. Recovery results are merged sequentially after the parallel pass to preserve the existing println-then-push order the test suite relies on. (The legacy `apply_scan_feed` on the dead `cmd_scan` path was already parallelised in an earlier draft of this PR, but the post-trilitech#24 hot path runs through `apply_scan_feed_recover_batch` — without this commit, the rayon par_iter never fired on a real `tzel-wallet sync` invocation.) * 2.A continued — `SyncFetcher` long-lived context. Pre-fix: `fetch_published_notes_concurrent` was called per-batch and rebuilt the `reqwest::Client` (and a fresh tokio runtime + thread) every time. At the new K=250 across a 67k-commit sync that's still ~268 client builds, each warming a fresh connection pool from cold — the design doc's "amortised TCP+TLS handshake" claim was false in that shape. `SyncFetcher` owns one `reqwest::Client` + worker thread + tokio runtime that survive the whole `cmd_rollup_sync` call. Pool warms once. (Across `--watch` iterations the pool is rebuilt; the per-iteration scan is small once caught up.) A one-line stderr banner at sync start surfaces both tunables to the operator. Composes orthogonally with `--at-tree-size` (PR trilitech#23, merged) and the cooperative-yield work in PR trilitech#24, including PR trilitech#24 body line 90's "re-tune K to 250–500 once P2 lands" — `DEFAULT_CHECKPOINT_EVERY` bumps from 50 → 250 (conservative end of that range). NOT included: 2.B (batched RPC, deferred to upstream octez) and 2.D (monitor stream, a separate concern). Tests: 4 in `network_profile_tests` (122 lib tests pass total, was 117 on `main` pre-trilitech#24-chain): * concurrent_fetch_returns_same_results_as_sequential — out-of-order completion is reassembled correctly (N=1 vs N=8 against a 100-commit mock). * concurrent_fetch_aborts_on_5xx — a 503 mid-batch propagates the error with HTTP status preserved, no silent retry, AND a counted mock asserts cancellation short-circuited (served < total). * parallel_decrypt_returns_same_results_as_sequential — independent fixture-time oracle (50 mixed recoverable + non-recoverable notes); both sequential and parallel branches must produce exactly the oracle set. Mutating `apply_scan_feed_recover_batch` to drop half the recoveries fails this test loud (verified). * sync_fetcher_amortises_client_across_batches — pins the contract that one `reqwest::Client` services multiple sequential batches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
saroupille
added a commit
that referenced
this pull request
May 10, 2026
The 67k-commit ushuaianet sync wallclock was dominated by sequential ureq round trips against the rollup-node's durable-state RPC (~40 ms each, ~95% of total). Three orthogonal accelerations land here, all firing on the post-#24 cooperative-yield path that `cmd_rollup_sync` actually executes: * 2.A — concurrent HTTP fetch in `RollupRpc::load_notes_since_at_block` via `reqwest` + `futures_util::FuturesUnordered`. The async runtime is hosted on a dedicated worker thread so the path composes with both the synchronous CLI dispatch and the multi-thread tokio runtime that powers `tzel-detect`'s axum handlers (calling `block_on` from inside a tokio runtime panics — the worker-thread bridge sidesteps that without forcing every caller to be async). Concurrency tuned by `TZEL_SYNC_CONCURRENCY` (default 4 — matches CI vCPU count, ship-safe across every rollup-node we know about; bumping above 4 should be opt-in once an operator measures their own rollup-node's tail-latency curve via `services/scan-bench/`). The 128 hard cap rejects misconfiguration before it can fill the rollup node's TCP backlog. Errors short-circuit the batch — same abort-on-first-error contract the sequential loop had. * 2.B — `apply_scan_feed_recover_batch` (PR #24's per-batch recover function, the function `cmd_rollup_sync` actually calls) routes its ML-KEM-768 trial-decrypt loop through rayon's default global pool. The decrypt is embarrassingly parallel: `try_recover_note` is `&self`, addresses don't change inside the call, recovered notes are independent. Recovery results are merged sequentially after the parallel pass to preserve the existing println-then-push order the test suite relies on. (The legacy `apply_scan_feed` on the dead `cmd_scan` path was already parallelised in an earlier draft of this PR, but the post-#24 hot path runs through `apply_scan_feed_recover_batch` — without this commit, the rayon par_iter never fired on a real `tzel-wallet sync` invocation.) * 2.A continued — `SyncFetcher` long-lived context. Pre-fix: `fetch_published_notes_concurrent` was called per-batch and rebuilt the `reqwest::Client` (and a fresh tokio runtime + thread) every time. At the new K=250 across a 67k-commit sync that's still ~268 client builds, each warming a fresh connection pool from cold — the design doc's "amortised TCP+TLS handshake" claim was false in that shape. `SyncFetcher` owns one `reqwest::Client` + worker thread + tokio runtime that survive the whole `cmd_rollup_sync` call. Pool warms once. (Across `--watch` iterations the pool is rebuilt; the per-iteration scan is small once caught up.) A one-line stderr banner at sync start surfaces both tunables to the operator. Composes orthogonally with `--at-tree-size` (PR #23, merged) and the cooperative-yield work in PR #24, including PR #24 body line 90's "re-tune K to 250–500 once P2 lands" — `DEFAULT_CHECKPOINT_EVERY` bumps from 50 → 250 (conservative end of that range). NOT included: 2.B (batched RPC, deferred to upstream octez) and 2.D (monitor stream, a separate concern). Tests: 4 in `network_profile_tests` (122 lib tests pass total, was 117 on `main` pre-#24-chain): * concurrent_fetch_returns_same_results_as_sequential — out-of-order completion is reassembled correctly (N=1 vs N=8 against a 100-commit mock). * concurrent_fetch_aborts_on_5xx — a 503 mid-batch propagates the error with HTTP status preserved, no silent retry, AND a counted mock asserts cancellation short-circuited (served < total). * parallel_decrypt_returns_same_results_as_sequential — independent fixture-time oracle (50 mixed recoverable + non-recoverable notes); both sequential and parallel branches must produce exactly the oracle set. Mutating `apply_scan_feed_recover_batch` to drop half the recoveries fails this test loud (verified). * sync_fetcher_amortises_client_across_batches — pins the contract that one `reqwest::Client` services multiple sequential batches. Co-authored-by: François Thiré <franth2@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
tzel-wallet syncscans[scanned, tree_size)sequentially, persistingwallet.json.scannedonly at the end. While running it holds an exclusive lock onwallet.json.lock, so every other wallet op (shield, transfer, unshield, balance, sync, …) blocks for the full duration. On a long-running rollup that's 5–45 min wallclock — unacceptable in daemon-driven flows where users want to act now.This is the second half of a two-PR pair. PR #23 (merged) added
cmd_init --at-tree-size Nso a fresh wallet skips the historical scan entirely. This PR makes the scan that DOES need to run preemptable so it doesn't starve the rest of the daemon.Design — cooperative yield via sentinel file
Mechanism
cmd_rollup_syncscans in batches of--checkpoint-everycommits (defaultDEFAULT_CHECKPOINT_EVERY = 50). Each batch ends in asave_walletthat recordsw.scanned = batch_endplus newly-discovered notes atomically (temp + fsync + rename + parent-dir fsync — pre-existing invariant, just called more often).<wallet>.yield. Present ⇒ exit 0 cleanly. Next sync resumes from the persisted cursor./proc/<pid>and unlinks if dead. Mirrors the existingWalletLockdiscipline. Without this, a daemon crash mid-slow-op would wedge every subsequent sync forever.The companion change in trilitech/tzel-infra#55 wraps slow-lane CLI ops (shield/send/unshield/withdraw) with
touch <wallet>.yieldbefore admit + RAII-removed after. So a long initial sync no longer starves slow-lane ops: the slow-lane signals "yield please" via the file, the next checkpoint sees it, sync exits cleanly, slow-lane gets the lock.CLI surface
The default 50 is reasoned, pending empirical measurement on a live
octez-smart-rollup-node. Documented inline. Tunable range 10..500 with explicit operator-vs-laptop guidance for slow disks (raise to 100–200) vs fast operator boxes wanting sub-second preemption (drop to 20–30).Considered alternative — shadow-wallet sync
A strategic-audit pass proposed an alternative design where the scan operates on a
<wallet>.scanscratch copy with its own lock, and merges back into canonical at every K-commit boundary. Documented indocs/shadow-wallet-design.md(in tzel-infra) for traceability.Why we picked cooperative-yield over shadow-wallet (per the comparative audit's verdict):
kill -0+ unlink, loud (sync exits early forever, immediately diagnosable,rm <wallet>.yieldresolves)Loud beats silent. Both designs are tactical fixes for the same constraint (file-based wallet + flock). At ~2× LOC for shadow-wallet plus a re-litigation of ADR-002, the migration cost (~2 weeks senior-eng review) buys a "canonical never holds partial scan state" property that no caller currently exercises — while introducing a subtle data-loss class that surfaces only at proving time. Decision: ship cooperative-yield as-is.
Pre-merge audit follow-ups (commits f3a0755, 3edef12)
Independent reviewers caught a real consistency bug missed by the earlier audit rounds:
cmd_rollup_syncfinalize pinnedhead_hashfor the note slice + nullifiers but called the head-resolvingload_pool_balancesfor pool reads. A slow-lane drain landing between the two reads would silently evict a still-funded pool. Fixed in f3a0755 (load_pool_balances_at_blocktaking an explicitblock_ref) and covered by a regression test that drives two block hashes returning different balances.A generalising review of the same pattern caught a second instance in
cmd_wallet_check(banner pool summary), fixed in 3edef12. The same commit renames the head-resolving wrappers to_at_headso the naming asymmetry (_at_headvs_at_block) makes a future regression loud at code review, and enriches the stale-sentinel recovery warning with the dead daemon's PID for operator triage.Tests
12 cooperative-yield tests in
network_profile_tests(gated on#[serial(sync_checkpoint_hook)]since they share the process-globalSYNC_CHECKPOINT_HOOK):cmd_rollup_sync_exits_at_first_checkpoint_when_sentinel_presentcmd_rollup_sync_resumes_from_checkpointed_cursorcmd_rollup_sync_intermediate_wallet_is_always_parseablecmd_rollup_sync_refuses_zero_checkpoint_everycmd_rollup_sync_respects_custom_checkpoint_everycmd_rollup_sync_http_error_mid_batch_preserves_last_checkpointscannedat last batch boundary; sentinel untouched by CLIcmd_rollup_sync_panic_mid_batch_preserves_wallet_jsoncmd_rollup_sync_watch_with_zero_checkpoint_every_aborts_first_itercmd_rollup_sync_recovers_from_stale_yield_sentinelfreshly_dead_pid()helper instead of a hardcoded999999, robust againstkernel.pid_maxlore on container hosts)cmd_rollup_sync_watch_loop_body_yields_twice_then_finishescmd_rollup_sync_watch's loop body — sentinel persists across iter 1+2, removed before iter 3. Locks the property that recovered notes from earlier iterations re-feedknown_cmson the finalize iteration viaw.notes, so a regression that movedseen_cmsto a process-local accumulator (or stopped foldingw.notesintoknown_cms) would surfacecmd_rollup_sync_treats_legacy_sentinel_content_as_liveb"yield") ⇒ yield as before; CLI does NOT unlink (forward-compat with old daemons)Plus four post-audit regression tests:
rollup_rpc_load_pool_balances_at_block_uses_pinned_block— two block hashes returning different balances; asserts the pinned form reads from the caller's block, not from a re-resolved head.rollup_rpc_load_state_snapshot_at_block_uses_pinned_block— mirror for the state snapshot helper thatcmd_wallet_checkuses on the same pinnedhead_hash(Reviewer feat(wallet): wallet-server HTTP API and web dapp #1's ask).cmd_rollup_sync_pins_pool_reads_to_finalize_head— e2e test driving a full sync against a stateful HTTP mock whose/head/hashroute returnsblock_oldon the first read andblock_newthereafter. The fixture publishes one note at index 0 wrappingobserved_cm, so the batch'sseen_cmsincludes it; the wallet'spending_deposit.shielded_cm = Some(observed_cm)so finalize evaluatescm_observed = true. Pool funded atblock_old, drained atblock_new. Correct shape: pool reads at the pinnedblock_old(funded) ⇒ deposit retained AND head/hash counter resolves exactly once. Regression (load_pool_balances_at_headat the call site): pool reads land onblock_new⇒drained_on_chain && cm_observed⇒ deposit evicted ⇒ assertion fails.fixture_for_finalize_pin_test_actually_evicts_under_regression_helper— companion that proves the fixture's failure mechanism is real (not tautological): same fixture, but invokesload_pool_balances_at_headdirectly +apply_scan_feed_finalize, asserts the deposit IS evicted. Locks in that the e2e test above protects a live mechanism.The first cut of the e2e test (in commit
0392f29) was caught as tautological by independent reviewers A and B: it hadtree_size = 0(loop never ran) andshielded_cm = None(cm_observedalways false), so the eviction predicate could not fire and the test passed regardless of the call-site shape. Both reviewers verified mechanically by reverting the call site to_at_headand observing the test still passed. The fixture is now corrected — and the companion test above verifies the failure mechanism via the helper without needing to mutate production code.Result:
cargo +nightly-2025-07-14 test -p tzel-wallet-app --lib→ 118 passed, 0 failed, 3 ignored.Companion change
trilitech/tzel-infra#55 (
feat/sync-cooperative-yield) — daemon dispatcher wraps slow-lane task execution withSyncYieldGuard::arm(writes the daemon PID into<wallet>.yield) + RAIIDrop(rm), so a long-running sync yields the lock for shield/transfer/unshield/withdraw and resumes after.The contract is forward-compatible: a daemon shipping the sentinel without this CLI's stale-PID recovery is harmless (legacy CLI just yields normally, no recovery benefit). Once both binaries ship, the cooperative behaviour activates.
Deferred to follow-up PRs
runner.rspass-through ofTZEL_SYNC_CHECKPOINT_EVERYvia the env (unblocks operator override end-to-end).friendlyErrorUI pattern for the validation error once the env var exposes it to non-CLI users.WalletFileto delegate concurrency to the storage layer (would obsolete this PR's mechanism and the shadow-wallet alternative).