Skip to content

perf(pm): TTY-gate progress + cooperative yield in spawn loop#2915

Draft
elrrrrrrr wants to merge 10 commits intonextfrom
experiment/install-tty-gate-coop-yield
Draft

perf(pm): TTY-gate progress + cooperative yield in spawn loop#2915
elrrrrrrr wants to merge 10 commits intonextfrom
experiment/install-tty-gate-coop-yield

Conversation

@elrrrrrrr
Copy link
Copy Markdown
Contributor

What

Smaller alternative to #2911. Two install hot-path changes layered on top of the new (post-#2905) zero-copy + par_chunks(64) baseline:

  1. TTY-gate progress bar (carried from experiment/install-tty-gate parent commit): non-TTY callers skip indicatif's internal mutex, removing ~9k atomic ops per install.

  2. Single consume_budget().await at the top of install_packages spawn loop: gives the tokio runtime a per-iteration window to drain socket reads on in-flight tarball downloads, replacing the implicit yield that the indicatif mutex used to provide.

for (path, package) in packages.iter() {
    tokio::task::consume_budget().await;  // ← only addition
    // ... existing logic, unchanged ...
}

Why this not the bigger #2911 partition

#2911's classify + 3-phase pipeline did fix the TCP-level starvation (zwin 14 → 0), but N=4 phases bench measured a consistent +1.05s p3 regression. Mechanism: the partition design pushed all cheap paths (omit / cpu-incompat / file: link) before any spawns, then opened Phase 3 with 64 in-flight downloads simultaneously. That concentrated disk burst widened p3 σ vs the baseline's interleaved cheap+heavy schedule.

This PR keeps the original interleaved schedule (cheap paths inline with spawns) and only adds the runtime-yield hint. Same TCP fix, smaller change, no concentrated burst.

Pcap mechanism evidence (carried over from partition experiments)

Variant utoo zwin utoo-next zwin
TTY-gate alone (no yield) 14-16 0
TTY-gate + cooperative yield (this PR) 0 0

consume_budget was added in tokio 1.41 and is the right primitive for this case: ~5ns when it doesn't yield (the common case), ~100ns when the per-task tick budget exhausts. utoo pins tokio 1.51.

Companion / supersedes

🤖 Generated with Claude Code

elrrrrrrr and others added 9 commits May 7, 2026 11:20
`install.rs` had 8 raw `PROGRESS_BAR.inc(1)` plus a `set_length` that
bypassed the `IS_TTY` short-circuit already present in `ProgressReceiver`.
indicatif always takes the internal `Mutex<ProgressState>` write-lock
even with a hidden draw target, so non-TTY runs (CI, piped output)
were paying ~9k Mutex acquisitions per install. Wrapped them in
`progress_inc` / `progress_set_length` helpers that match the
receiver's gating pattern.

This is 1/3 of the triplet from #2902, split out for independent
A/B benchmarking against the recently-merged baseline #2887.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pcap-only bench was previously a one-off that captured `p1_resolve`
across `utoo` and `bun`, and assumed the project tree was already
cloned by `pm-bench-phases.sh` running in the same job. That gave us
metadata fan-out, but install-phase regressions (#2902 / #2903 /
#2904 / #2905 σ widening on `p0_full_cold`) live in the tarball
download path, not in resolve.

This commit makes the pcap bench self-contained and covers both
phases for three PMs:

- Self-clone the project if `$PROJECT_DIR` is missing (mirrors
  `pm-bench-phases.sh`), so this script runs as a standalone CI job.
- Add a `<pm>-install` capture per PM: lock pre-existing,
  `cache + node_modules` wiped, then `<pm> install`. This is the
  cold-tarball-download phase where the σ-widening lives.
- Add `utoo-next` as a third PM: built upstream by `build-linux`'s
  bench-baseline step (now also gated on `pm-bench-pcap`), downloaded
  via the same artifact path as `bench-phases-linux`. Skipped in
  local runs where `$UTOO_NEXT_BIN` is unset.

Workflow change:

- `pm-bench-pcap-linux` now downloads the `utoo-next-linux-x64`
  artifact and exports `UTOO_NEXT_BIN` exactly like
  `bench-phases-linux` does.
- `Build next branch utoo` and `Upload utoo-next binary` steps in
  `build-linux` now also fire for `inputs.target == 'pm-bench-pcap'`,
  not only `pm-bench-phases`.

Outputs in `/tmp/pm-bench-pcap`:

  dns.txt
  utoo-{resolve,install}.{pcap,log}
  utoo-next-{resolve,install}.{pcap,log}   (when UTOO_NEXT_BIN set)
  bun-{resolve,install}.{pcap,log}

Drives the analysis of whether the install hot-path's increased
concurrency (FuturesUnordered streaming, zero-copy tar, TTY-gate)
saturates outbound TCP and starves the download path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Building on the install-phase pcap capture from the previous commit,
post-process each .pcap with tshark to extract pre-TLS metrics that
directly probe the "install greediness starves download" hypothesis
without needing TLS session-key dumping:

  zero_windows    — receive buffer full → server paused. Direct evidence
                    that the app's tokio runtime is not draining the
                    socket fast enough between extracts.
  retransmits     — server resent because ACK was late. Indirect
                    evidence of receive-side stall.
  duplicate_acks  — receiver re-sent ACK because it perceived a gap.
  stream_gap_*    — inter-packet gap distribution per TCP stream
                    (p50 / p99 / max in microseconds). p99 / max measure
                    the longest pause an active connection experienced —
                    if utoo shows multi-hundred-ms gaps where utoo-next
                    shows tens of ms, install is freezing the runtime
                    mid-download.

Per-capture summaries land at $PCAP_DIR/<name>.summary.json. They are
aggregated into a top-level summary.json via jq -s, so artifact
consumers can compare metrics across PMs without re-parsing the 100s
of MB of raw pcaps.

Single-pass tshark over the pcap with -T fields keeps cost bounded
to ~1 minute per 1 GB of capture; the full analysis pass runs after
all captures so it does not bleed into wall-clock measurement.

Workflow change:

  Install pcap tools step now also installs tshark + jq, with
  wireshark-common pre-seeded so tshark installs non-interactively
  (we only read existing pcaps, no setuid dumpcap needed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The analysis pass aborted the whole job on the first .pcap because
the wall-time grep returned no match, and `set -eo pipefail` propagated
that exit-1 through `local x=$(grep | awk)` (the multi-line `local x;
x=$(...)` form does NOT mask the exit code, unlike `local x=$(...)`
on one line — bash gotcha).

Two-part fix:

1. Drop into `set +e` / `set +o pipefail` for the analysis function
   body. The metrics are diagnostic — one tshark hiccup or an empty
   log line should not nuke a 25-minute capture run. Strict mode is
   restored at the end of the function so the rest of the script
   keeps its safety net.

2. Replace `grep -oE | awk` with awk-only. awk returns 0 even when
   no record matches, so empty-result log files no longer trip
   pipefail. Same parse, fewer pipes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iagnosis

The TCP-level analysis (zero-copy retx=123 vs baseline 4-18) gave
strong evidence that utoo's receiver runtime is under back-pressure
during install, but it doesn't tell us *why*. The leading hypothesis
is disk IO saturation: rayon's parallel `fs::create + write_all` over
80k+ files in the ant-design tarball burst can outrun GitHub Actions
runners' Azure-disk IOPS budget, blocking write threads → tokio
threads back up → socket buffers fill.

This commit adds an iostat-x sampler to each capture:

  capture_one() now spawns `iostat -x -y 1` in parallel with
  tcpdump, writing per-second device samples to
  $PCAP_DIR/<name>.iostat.txt. Both samplers are torn down with
  the workload command.

  analyze_pcap() parses the iostat log via column-position lookup
  (sysstat header row → column index map) and extracts:
    io_util_max_pct       — peak disk-busy percentage
    io_util_avg_pct       — average disk-busy percentage
    io_w_iops_max         — peak write IOPS
    io_w_kbs_max          — peak write throughput (kB/s)
    io_w_await_max_ms     — peak write queue wait (ms)
    io_samples            — sample count for sanity check

These six fields land in summary.json alongside the TCP metrics, so
artifact consumers can directly cross-correlate disk pressure with
TCP back-pressure within the same capture window.

The decision rule:
  * If io_util_max_pct stays high (>80%) on the experiment branch
    while baseline same-PM utoo-next stays low → install path is
    saturating disk and that's the mechanism.
  * If both branches show similar low %util, disk is not the
    bottleneck and we keep looking (e.g. CPU contention).

Workflow: apt install adds `sysstat` (iostat lives there). It is
preinstalled on ubuntu-latest images today, but pinning the
dependency makes future image rebuilds resilient.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Smaller alternative to the partition refactor in #2911. Keep the
original Vec<JoinHandle> + sequential drain loop structure (no
restructuring) and only add a single per-iteration
`tokio::task::consume_budget().await` at the top of the spawn loop
body.

Mechanism check from the partition pcap experiments
(run 25553669984 + run 25552156559): with cooperative-yield hint at
the per-iteration boundary, utoo install zwin events drop from 14
to 0, matching utoo-next baseline. Without it, the synchronous
spawn loop runs ~2000 packages back-to-back on non-TTY CI (after
TTY-gate's mutex removal) without giving the runtime a window to
drain socket reads.

Why dropping the partition: the bigger refactor showed a
consistent +1s p3 mean regression across 4 attempts (utoo p3 =
7.42s avg, utoo-next = 6.37s, bun = 6.90s). The partition pushed
all cheap paths (omit / cpu-incompat / file: link) before any
spawns, then opened Phase 3 with all 64 in-flight downloads at
once — a more concentrated disk burst than the original
cheap-path/heavy-path interleaved schedule. The TCP-level fix
worked, but disk-side back-pressure widened p3 σ.

This commit keeps the original interleaved schedule (cheap paths
inline with spawns) and adds only the runtime-yield hint at the
top of each iteration. Same structural principle as #2911's Phase
3 yield, applied to the original loop without the surrounding
restructure.

Cost: ~5ns per iteration when the per-task tick budget isn't
exhausted (the common case — JoinHandle.await later in the
iteration resets budget). ~100ns when the budget exhausts after a
run of cheap iterations and the runtime preempts to drain sockets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elrrrrrrr elrrrrrrr added benchmark Run pm-bench on PR A-Pkg Manager Area: Package Manager labels May 8, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces cooperative scheduling using tokio::task::consume_budget() in the package installation loop to prevent socket read starvation on non-TTY environments. It also gates progress bar updates behind TTY checks to avoid unnecessary mutex contention in CI or piped outputs. A review comment suggests updating the documentation for progress_inc to accurately reflect its generic parameter.

Comment thread crates/pm/src/util/logger.rs
@elrrrrrrr elrrrrrrr marked this pull request as ready for review May 8, 2026 12:52
@elrrrrrrr elrrrrrrr added benchmark Run pm-bench on PR and removed benchmark Run pm-bench on PR labels May 9, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

📊 pm-bench-phases · 3258e65 · linux (ubuntu-latest)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 8.80s 0.11s 9.98s 9.93s 745M 331.0K
utoo-next 8.12s 0.20s 10.23s 12.09s 953M 125.7K
utoo-npm 8.01s 0.03s 10.69s 12.23s 1.28G 173.3K
utoo 8.52s 0.61s 10.25s 12.17s 972M 121.9K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 14.9K 17.4K 1.17G 6M 1.84G 1.72G 1M
utoo-next 119.1K 82.9K 1.14G 5M 1.68G 1.68G 2M
utoo-npm 122.7K 85.2K 1.14G 5M 1.68G 1.68G 2M
utoo 133.4K 83.8K 1.14G 5M 1.68G 1.68G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 1.85s 0.03s 3.91s 1.05s 502M 165.5K
utoo-next 3.10s 0.11s 5.31s 1.92s 610M 85.7K
utoo-npm 3.05s 0.02s 5.30s 1.84s 609M 79.2K
utoo 3.01s 0.08s 5.30s 1.92s 613M 83.5K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 7.7K 4.7K 201M 3M 105M - 1M
utoo-next 68.1K 115.2K 199M 2M 7M 3M 2M
utoo-npm 68.3K 112.6K 199M 2M 7M 3M 2M
utoo 67.5K 112.2K 199M 2M 7M 3M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 6.71s 0.06s 6.11s 9.75s 645M 212.8K
utoo-next 6.58s 1.09s 4.86s 10.65s 456M 56.3K
utoo-npm 6.05s 0.08s 5.27s 10.96s 905M 115.2K
utoo 6.35s 1.56s 4.84s 10.45s 478M 59.3K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 4.1K 6.9K 995M 4M 1.74G 1.74G 1M
utoo-next 101.8K 49.6K 964M 3M 1.67G 1.67G 2M
utoo-npm 103.6K 62.9K 964M 2M 1.67G 1.67G 2M
utoo 91.3K 50.9K 964M 2M 1.67G 1.67G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 3.32s 0.04s 0.22s 2.30s 137M 32.7K
utoo-next 2.20s 0.24s 0.49s 3.73s 80M 18.6K
utoo-npm 2.19s 0.06s 0.50s 3.78s 84M 19.2K
utoo 2.07s 0.06s 0.47s 3.76s 79M 18.3K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 262 27 5M 48K 1.88G 1.72G 1M
utoo-next 41.0K 18.9K 308K 7K 1.68G 1.68G 2M
utoo-npm 46.0K 21.1K 323K 29K 1.68G 1.68G 2M
utoo 40.8K 18.9K 309K 11K 1.68G 1.68G 2M

npmmirror.com: no output captured.

@elrrrrrrr elrrrrrrr marked this pull request as draft May 9, 2026 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Pkg Manager Area: Package Manager benchmark Run pm-bench on PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant