Skip to content

wallet: concurrent HTTP + parallel decrypt for cmd_rollup_sync#25

Merged
saroupille merged 1 commit into
trilitech:mainfrom
saroupille:feat/sync-acceleration
May 10, 2026
Merged

wallet: concurrent HTTP + parallel decrypt for cmd_rollup_sync#25
saroupille merged 1 commit into
trilitech:mainfrom
saroupille:feat/sync-acceleration

Conversation

@saroupille
Copy link
Copy Markdown
Collaborator

@saroupille saroupille commented May 10, 2026

Summary

The 67k-commit ushuaianet sync wallclock is dominated by sequential ureq
round trips against the rollup-node's durable-state RPC (~40 ms each,
~95 % of total). This PR ships three orthogonal accelerations from the
sync-acceleration design (2.A concurrent HTTP, 2.B/2.C parallel ML-KEM-768
trial-decrypt) and bumps DEFAULT_CHECKPOINT_EVERY per PR #24's deferred
retune. The composition fires on the post-#24 cooperative-yield path that
cmd_rollup_sync actually executes — earlier drafts of this PR added
rayon to the legacy apply_scan_feed (only reachable from the dead
cmd_scan entrypoint), which meant the decrypt parallelism never fired
on a real tzel-wallet sync invocation. Reviewer caught that; fixed.

This PR body deliberately drops the speedup numbers it previously claimed
("45 min → 5-6 min", "5-7× HTTP", "4-6× decrypt"). Those were estimates
without measurements behind them; the bench harness in
tzel-infra PR #57
now produces real numbers, and the design doc's §3.3 will be backfilled
when those land. Honest claims: each of the three layers does measurably
fire on the sync path (verified by adversarial mutation), and the
post-fix composition is the shape the design doc intended from the start.

2.A — concurrent HTTP fetch

RollupRpc::load_notes_since_at_block previously issued one synchronous
ureq::get per note. It now drives reqwest + futures_util::FuturesUnordered
with a tunable concurrency budget. Out-of-order completions are reassembled
by index; errors short-circuit the batch (same abort-on-first-error
contract the sequential loop had — verified by a counted mock that
asserts cancellation actually short-circuits).

The async runtime is hosted on a dedicated worker thread so the path
composes with both the synchronous CLI dispatch and the multi-thread
tokio runtime that powers tzel-detect's axum handlers — calling
block_on from inside a tokio runtime panics, the worker-thread bridge
sidesteps that without forcing every caller to be async.

SyncFetcher long-lived context: cmd_rollup_sync builds one
reqwest::Client + worker thread + tokio runtime and reuses them across
every per-K-commit batch in the scan. That pool warms exactly once
instead of once per batch (~268 batches at K=250 across 67k commits).
Across --watch iterations the pool is rebuilt; the per-iteration scan
is small once the wallet has caught up, so cross-iteration amortisation
is on the follow-up list rather than in this PR.

2.B / 2.C — parallel decrypt on the post-#24 hot path

apply_scan_feed_recover_batch (PR #24's per-batch recover function,
the function cmd_rollup_sync actually calls) routes its trial-decrypt
loop through rayon's default global pool. The decrypt is embarrassingly
parallel: try_recover_note takes &self, addresses don't change
inside the call, recovered notes are independent. Recovery results are
merged sequentially after the parallel pass to preserve the existing
println!-then-push order the test suite relies on. Adversarial check:
running the test suite with RAYON_NUM_THREADS=1 adds ~20 s of
wallclock vs the default pool, proving rayon now fires on the sync
path.

Tunables

  • TZEL_SYNC_CONCURRENCY (default 4, hard cap 128). 4 matches
    typical CI vCPU count and stays comfortably below any stock
    rollup-node's connection tolerance — chosen without cross-operator
    measurement, on purpose conservative. Higher concurrency (8, 16, 32, …)
    is reasonable on tuned operator boxes but should be opt-in once an
    operator measures their own rollup-node's tail-latency curve via the
    bench harness in tzel-infra PR #57. The 128 hard cap rejects
    misconfiguration before it can fill the rollup-node TCP backlog.
  • --checkpoint-every N / TZEL_SYNC_CHECKPOINT_EVERY /
    DEFAULT_CHECKPOINT_EVERY — bumped from 50 → 250 per
    PR #24's body line 90
    ("Re-tune default K to 250–500 once the design doc's parallel-fetch
    patch (P2, see PR wallet: concurrent HTTP + parallel decrypt for cmd_rollup_sync #25) lands"). 250 is the conservative end of that
    range — drops the per-checkpoint fsync from ~7/s to ~1.4/s sustained
    at the post-PR throughput. Loss-on-interrupt grows from ~2 s to ~10 s,
    still well under the user-visible threshold for cooperative-yield UX.
  • A one-line stderr banner at sync start surfaces both tunables to the
    operator.

Composition with related PRs

  • Builds on --at-tree-size (#23,
    merged on main) — operators that start fresh from an explicit
    cursor still benefit from the concurrent fetch on subsequent syncs.
  • Composes orthogonally with the cooperative-yield work in
    #24 (merged): each
    K-commit batch from wallet: cooperative-yield sync (incremental persist + sentinel) #24 is fetched concurrently within the batch and
    trial-decrypted in parallel by this PR. PR wallet: cooperative-yield sync (incremental persist + sentinel) #24's
    apply_scan_feed_recover_batch is exactly the function that gained
    the rayon par_iter here.
  • Bench harness + design doc live in
    tzel-infra PR #57
    (separate). The harness wraps tzel-wallet sync in a subprocess and
    times it end-to-end; the README walks through the sandbox-first
    recipe (services/wallet/scripts/install-sandbox.sh → bench).

NOT included (deferred)

  • 2.B (true batched RPC) — deferred to upstream octez (the rollup
    node doesn't currently expose a batch durable-read endpoint).
  • 2.D — monitor stream. Separate concern.
  • Cross---watch connection-pool reuse. --watch rebuilds
    SyncFetcher per iteration; per-iteration scans are small enough
    that one fresh handshake per poll is bounded.

Tests

4 tests in network_profile_tests:

  • concurrent_fetch_returns_same_results_as_sequential — drives a
    100-commit mock rollup, scans with TZEL_SYNC_CONCURRENCY=1 and
    =8, asserts identical NoteMemos in identical order. Validates
    that out-of-order completion is reassembled correctly.
  • concurrent_fetch_aborts_on_5xx — mock returns 503 on the index-7
    length probe; sync at concurrency=4 against a 50-note batch. Asserts
    the error propagates with HTTP status preserved, attributed to a
    rollup RPC URL, AND that the number of requests served by the
    counted mock is strictly less than the index count — a "drain
    everything before propagating" refactor would still satisfy the old
    assertion but would fail this strengthened one.
  • parallel_decrypt_returns_same_results_as_sequential — builds a
    mixed fixture (recoverable + non-recoverable notes) with the
    recoverability of each index pre-computed at fixture-construction
    time as an independent oracle. Both sequential (one-thread rayon
    pool) and parallel (default global pool) branches must produce
    exactly the oracle set, AND must match each other. Verified
    adversarially: mutating apply_scan_feed_recover_batch to drop half
    the recoveries (take(feed.notes.len() / 2)) makes this test fail
    loud — replacing the previous tautological shape that compared
    apply_scan_feed against itself.
  • sync_fetcher_amortises_client_across_batches — pins the contract
    that one reqwest::Client services multiple sequential batches via
    SyncFetcher. Counts served HTTP requests across two back-to-back
    fetches and asserts both batches see the same warm pool.

122 lib tests pass total (was 117 on main before the #24/#25 chain
landed; #24 added the cooperative-yield coverage, this PR adds the four
sync-acceleration tests above).

Test plan

  • cargo +nightly-2025-07-14 build -p tzel-wallet-app
  • cargo +nightly-2025-07-14 test -p tzel-wallet-app --lib — 122 ok
  • Adversarial B1 (drop half recoveries in
    apply_scan_feed_recover_batch) — parallel_decrypt_returns_same_results_as_sequential
    fails loud. Reverted.
  • Adversarial B2 (RAYON_NUM_THREADS=1) — full lib-test runtime
    grows from ~50 s → ~69 s, proving rayon fires on the sync path.
  • cargo +nightly-2025-07-14 clippy -p tzel-wallet-app --all-targets -- -D warnings — no regression vs pre-PR state (37 pre-existing
    tzel-core errors unchanged; nothing new from this PR).

Review-thread audit (pre-merge)

Reviewers A and B caught four blockers + several should-fix items on the
first pass; this rebase addresses each:

  • B1parallel_decrypt_returns_same_results_as_sequential was
    tautological (sequential vs parallel both routed through the same
    function). Rewritten with an independent fixture-time oracle.
  • B2 — Rayon parallelism only fired in the legacy apply_scan_feed,
    not in apply_scan_feed_recover_batch (the post-wallet: cooperative-yield sync (incremental persist + sentinel) #24 hot path).
    Added rayon to the latter.
  • B3reqwest::Client was rebuilt per batch (~268 cold-start
    handshakes per 67k-commit sync at K=250). Added SyncFetcher so the
    client survives the whole cmd_rollup_sync call.
  • B4 — Bench harness called tzel-wallet rollup-sync --rollup-node-url (wrong subcommand, wrong flag). Fixed in
    tzel-infra PR #57
    to run profile init-shadownet first then sync.
  • S1 — K bumped 50 → 250 per PR wallet: cooperative-yield sync (incremental persist + sentinel) #24 body line 90.
  • S2 — Test count corrected (122, was 117 pre-wallet: cooperative-yield sync (incremental persist + sentinel) #24).
  • S3 — Default concurrency dropped 8 → 4 absent measurement
    justifying anything higher.
  • S4 — Design doc updated to reflect that pool reuse is per-
    cmd_rollup_sync call, not cross---watch.
  • S5concurrent_fetch_aborts_on_5xx strengthened with a
    served-requests counter so a "drain everything before propagating"
    refactor would fail it.

🤖 Generated with Claude Code

@saroupille saroupille force-pushed the feat/sync-acceleration branch 2 times, most recently from 447bcce to c68a92d Compare May 10, 2026 15:30
The 67k-commit ushuaianet sync wallclock was dominated by sequential
ureq round trips against the rollup-node's durable-state RPC (~40 ms
each, ~95% of total). Three orthogonal accelerations land here, all
firing on the post-trilitech#24 cooperative-yield path that `cmd_rollup_sync`
actually executes:

* 2.A — concurrent HTTP fetch in `RollupRpc::load_notes_since_at_block`
  via `reqwest` + `futures_util::FuturesUnordered`. The async runtime
  is hosted on a dedicated worker thread so the path composes with both
  the synchronous CLI dispatch and the multi-thread tokio runtime that
  powers `tzel-detect`'s axum handlers (calling `block_on` from inside
  a tokio runtime panics — the worker-thread bridge sidesteps that
  without forcing every caller to be async).

  Concurrency tuned by `TZEL_SYNC_CONCURRENCY` (default 4 — matches CI
  vCPU count, ship-safe across every rollup-node we know about; bumping
  above 4 should be opt-in once an operator measures their own
  rollup-node's tail-latency curve via `services/scan-bench/`). The 128
  hard cap rejects misconfiguration before it can fill the rollup
  node's TCP backlog. Errors short-circuit the batch — same
  abort-on-first-error contract the sequential loop had.

* 2.B — `apply_scan_feed_recover_batch` (PR trilitech#24's per-batch recover
  function, the function `cmd_rollup_sync` actually calls) routes its
  ML-KEM-768 trial-decrypt loop through rayon's default global pool.
  The decrypt is embarrassingly parallel: `try_recover_note` is
  `&self`, addresses don't change inside the call, recovered notes are
  independent. Recovery results are merged sequentially after the
  parallel pass to preserve the existing println-then-push order the
  test suite relies on. (The legacy `apply_scan_feed` on the dead
  `cmd_scan` path was already parallelised in an earlier draft of this
  PR, but the post-trilitech#24 hot path runs through
  `apply_scan_feed_recover_batch` — without this commit, the rayon
  par_iter never fired on a real `tzel-wallet sync` invocation.)

* 2.A continued — `SyncFetcher` long-lived context. Pre-fix:
  `fetch_published_notes_concurrent` was called per-batch and rebuilt
  the `reqwest::Client` (and a fresh tokio runtime + thread) every
  time. At the new K=250 across a 67k-commit sync that's still ~268
  client builds, each warming a fresh connection pool from cold —
  the design doc's "amortised TCP+TLS handshake" claim was false in
  that shape. `SyncFetcher` owns one `reqwest::Client` + worker thread
  + tokio runtime that survive the whole `cmd_rollup_sync` call. Pool
  warms once. (Across `--watch` iterations the pool is rebuilt; the
  per-iteration scan is small once caught up.)

A one-line stderr banner at sync start surfaces both tunables to the
operator. Composes orthogonally with `--at-tree-size` (PR trilitech#23, merged)
and the cooperative-yield work in PR trilitech#24, including PR trilitech#24 body line
90's "re-tune K to 250–500 once P2 lands" — `DEFAULT_CHECKPOINT_EVERY`
bumps from 50 → 250 (conservative end of that range).

NOT included: 2.B (batched RPC, deferred to upstream octez) and 2.D
(monitor stream, a separate concern).

Tests: 4 in `network_profile_tests` (122 lib tests pass total, was 117
on `main` pre-trilitech#24-chain):
  * concurrent_fetch_returns_same_results_as_sequential — out-of-order
    completion is reassembled correctly (N=1 vs N=8 against a
    100-commit mock).
  * concurrent_fetch_aborts_on_5xx — a 503 mid-batch propagates the
    error with HTTP status preserved, no silent retry, AND a counted
    mock asserts cancellation short-circuited (served < total).
  * parallel_decrypt_returns_same_results_as_sequential — independent
    fixture-time oracle (50 mixed recoverable + non-recoverable
    notes); both sequential and parallel branches must produce
    exactly the oracle set. Mutating
    `apply_scan_feed_recover_batch` to drop half the recoveries fails
    this test loud (verified).
  * sync_fetcher_amortises_client_across_batches — pins the contract
    that one `reqwest::Client` services multiple sequential batches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@saroupille saroupille force-pushed the feat/sync-acceleration branch from c68a92d to 2461c33 Compare May 10, 2026 15:56
@saroupille saroupille merged commit 7e48045 into trilitech:main May 10, 2026
2 checks passed
saroupille pushed a commit to saroupille/tzel that referenced this pull request May 10, 2026
The combined `rust-and-cairo` job started clipping cold runs once
the wallet test suite grew (PR trilitech#25 added the concurrent-fetch +
parallel-decrypt machinery; the resulting +16s on
`tzel-wallet-app --lib` pushed total wallclock past ~9 min and hit a
runner shutdown signal mid-Cairo on the first post-merge main run,
SHA 7e48045).

Two changes:

1. Split `rust-and-cairo` into `rust-tests` and `cairo-tests`
   parallel jobs. Each runs on its own ubuntu-latest runner with its
   own wallclock budget. Cairo doesn't gate on Rust completion any
   more; failures surface independently.

2. Add `Swatinem/rust-cache@v2.7.8` (SHA-pinned per the workflow's
   third-party-action convention) to both Rust-touching jobs
   (`rust-tests` and `ocaml`'s cross-impl interop step). Cargo
   registry + target/ amortise across runs; cache key bumps on
   `Cargo.lock` change. Warm runs go from ~9 min cold to ~3-4 min.

The `ocaml` job is unchanged structurally — just gains the cargo
cache for its Rust cross-impl test.

No code changes; workflow-only.
saroupille added a commit that referenced this pull request May 10, 2026
The combined `rust-and-cairo` job started clipping cold runs once
the wallet test suite grew (PR #25 added the concurrent-fetch +
parallel-decrypt machinery; the resulting +16s on
`tzel-wallet-app --lib` pushed total wallclock past ~9 min and hit a
runner shutdown signal mid-Cairo on the first post-merge main run,
SHA 7e48045).

Two changes:

1. Split `rust-and-cairo` into `rust-tests` and `cairo-tests`
   parallel jobs. Each runs on its own ubuntu-latest runner with its
   own wallclock budget. Cairo doesn't gate on Rust completion any
   more; failures surface independently.

2. Add `Swatinem/rust-cache@v2.7.8` (SHA-pinned per the workflow's
   third-party-action convention) to both Rust-touching jobs
   (`rust-tests` and `ocaml`'s cross-impl interop step). Cargo
   registry + target/ amortise across runs; cache key bumps on
   `Cargo.lock` change. Warm runs go from ~9 min cold to ~3-4 min.

The `ocaml` job is unchanged structurally — just gains the cargo
cache for its Rust cross-impl test.

No code changes; workflow-only.

Co-authored-by: François Thiré <franth2@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants