perf(pm): TTY-gate progress + cooperative yield in spawn loop#2915
Draft
perf(pm): TTY-gate progress + cooperative yield in spawn loop#2915
Conversation
`install.rs` had 8 raw `PROGRESS_BAR.inc(1)` plus a `set_length` that bypassed the `IS_TTY` short-circuit already present in `ProgressReceiver`. indicatif always takes the internal `Mutex<ProgressState>` write-lock even with a hidden draw target, so non-TTY runs (CI, piped output) were paying ~9k Mutex acquisitions per install. Wrapped them in `progress_inc` / `progress_set_length` helpers that match the receiver's gating pattern. This is 1/3 of the triplet from #2902, split out for independent A/B benchmarking against the recently-merged baseline #2887. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pcap-only bench was previously a one-off that captured `p1_resolve` across `utoo` and `bun`, and assumed the project tree was already cloned by `pm-bench-phases.sh` running in the same job. That gave us metadata fan-out, but install-phase regressions (#2902 / #2903 / #2904 / #2905 σ widening on `p0_full_cold`) live in the tarball download path, not in resolve. This commit makes the pcap bench self-contained and covers both phases for three PMs: - Self-clone the project if `$PROJECT_DIR` is missing (mirrors `pm-bench-phases.sh`), so this script runs as a standalone CI job. - Add a `<pm>-install` capture per PM: lock pre-existing, `cache + node_modules` wiped, then `<pm> install`. This is the cold-tarball-download phase where the σ-widening lives. - Add `utoo-next` as a third PM: built upstream by `build-linux`'s bench-baseline step (now also gated on `pm-bench-pcap`), downloaded via the same artifact path as `bench-phases-linux`. Skipped in local runs where `$UTOO_NEXT_BIN` is unset. Workflow change: - `pm-bench-pcap-linux` now downloads the `utoo-next-linux-x64` artifact and exports `UTOO_NEXT_BIN` exactly like `bench-phases-linux` does. - `Build next branch utoo` and `Upload utoo-next binary` steps in `build-linux` now also fire for `inputs.target == 'pm-bench-pcap'`, not only `pm-bench-phases`. Outputs in `/tmp/pm-bench-pcap`: dns.txt utoo-{resolve,install}.{pcap,log} utoo-next-{resolve,install}.{pcap,log} (when UTOO_NEXT_BIN set) bun-{resolve,install}.{pcap,log} Drives the analysis of whether the install hot-path's increased concurrency (FuturesUnordered streaming, zero-copy tar, TTY-gate) saturates outbound TCP and starves the download path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Building on the install-phase pcap capture from the previous commit,
post-process each .pcap with tshark to extract pre-TLS metrics that
directly probe the "install greediness starves download" hypothesis
without needing TLS session-key dumping:
zero_windows — receive buffer full → server paused. Direct evidence
that the app's tokio runtime is not draining the
socket fast enough between extracts.
retransmits — server resent because ACK was late. Indirect
evidence of receive-side stall.
duplicate_acks — receiver re-sent ACK because it perceived a gap.
stream_gap_* — inter-packet gap distribution per TCP stream
(p50 / p99 / max in microseconds). p99 / max measure
the longest pause an active connection experienced —
if utoo shows multi-hundred-ms gaps where utoo-next
shows tens of ms, install is freezing the runtime
mid-download.
Per-capture summaries land at $PCAP_DIR/<name>.summary.json. They are
aggregated into a top-level summary.json via jq -s, so artifact
consumers can compare metrics across PMs without re-parsing the 100s
of MB of raw pcaps.
Single-pass tshark over the pcap with -T fields keeps cost bounded
to ~1 minute per 1 GB of capture; the full analysis pass runs after
all captures so it does not bleed into wall-clock measurement.
Workflow change:
Install pcap tools step now also installs tshark + jq, with
wireshark-common pre-seeded so tshark installs non-interactively
(we only read existing pcaps, no setuid dumpcap needed).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The analysis pass aborted the whole job on the first .pcap because the wall-time grep returned no match, and `set -eo pipefail` propagated that exit-1 through `local x=$(grep | awk)` (the multi-line `local x; x=$(...)` form does NOT mask the exit code, unlike `local x=$(...)` on one line — bash gotcha). Two-part fix: 1. Drop into `set +e` / `set +o pipefail` for the analysis function body. The metrics are diagnostic — one tshark hiccup or an empty log line should not nuke a 25-minute capture run. Strict mode is restored at the end of the function so the rest of the script keeps its safety net. 2. Replace `grep -oE | awk` with awk-only. awk returns 0 even when no record matches, so empty-result log files no longer trip pipefail. Same parse, fewer pipes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iagnosis
The TCP-level analysis (zero-copy retx=123 vs baseline 4-18) gave
strong evidence that utoo's receiver runtime is under back-pressure
during install, but it doesn't tell us *why*. The leading hypothesis
is disk IO saturation: rayon's parallel `fs::create + write_all` over
80k+ files in the ant-design tarball burst can outrun GitHub Actions
runners' Azure-disk IOPS budget, blocking write threads → tokio
threads back up → socket buffers fill.
This commit adds an iostat-x sampler to each capture:
capture_one() now spawns `iostat -x -y 1` in parallel with
tcpdump, writing per-second device samples to
$PCAP_DIR/<name>.iostat.txt. Both samplers are torn down with
the workload command.
analyze_pcap() parses the iostat log via column-position lookup
(sysstat header row → column index map) and extracts:
io_util_max_pct — peak disk-busy percentage
io_util_avg_pct — average disk-busy percentage
io_w_iops_max — peak write IOPS
io_w_kbs_max — peak write throughput (kB/s)
io_w_await_max_ms — peak write queue wait (ms)
io_samples — sample count for sanity check
These six fields land in summary.json alongside the TCP metrics, so
artifact consumers can directly cross-correlate disk pressure with
TCP back-pressure within the same capture window.
The decision rule:
* If io_util_max_pct stays high (>80%) on the experiment branch
while baseline same-PM utoo-next stays low → install path is
saturating disk and that's the mechanism.
* If both branches show similar low %util, disk is not the
bottleneck and we keep looking (e.g. CPU contention).
Workflow: apt install adds `sysstat` (iostat lives there). It is
preinstalled on ubuntu-latest images today, but pinning the
dependency makes future image rebuilds resilient.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Smaller alternative to the partition refactor in #2911. Keep the original Vec<JoinHandle> + sequential drain loop structure (no restructuring) and only add a single per-iteration `tokio::task::consume_budget().await` at the top of the spawn loop body. Mechanism check from the partition pcap experiments (run 25553669984 + run 25552156559): with cooperative-yield hint at the per-iteration boundary, utoo install zwin events drop from 14 to 0, matching utoo-next baseline. Without it, the synchronous spawn loop runs ~2000 packages back-to-back on non-TTY CI (after TTY-gate's mutex removal) without giving the runtime a window to drain socket reads. Why dropping the partition: the bigger refactor showed a consistent +1s p3 mean regression across 4 attempts (utoo p3 = 7.42s avg, utoo-next = 6.37s, bun = 6.90s). The partition pushed all cheap paths (omit / cpu-incompat / file: link) before any spawns, then opened Phase 3 with all 64 in-flight downloads at once — a more concentrated disk burst than the original cheap-path/heavy-path interleaved schedule. The TCP-level fix worked, but disk-side back-pressure widened p3 σ. This commit keeps the original interleaved schedule (cheap paths inline with spawns) and adds only the runtime-yield hint at the top of each iteration. Same structural principle as #2911's Phase 3 yield, applied to the original loop without the surrounding restructure. Cost: ~5ns per iteration when the per-task tick budget isn't exhausted (the common case — JoinHandle.await later in the iteration resets budget). ~100ns when the budget exhausts after a run of cheap iterations and the runtime preempts to drain sockets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces cooperative scheduling using tokio::task::consume_budget() in the package installation loop to prevent socket read starvation on non-TTY environments. It also gates progress bar updates behind TTY checks to avoid unnecessary mutex contention in CI or piped outputs. A review comment suggests updating the documentation for progress_inc to accurately reflect its generic parameter.
📊 pm-bench-phases ·
|
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 8.80s | 0.11s | 9.98s | 9.93s | 745M | 331.0K |
| utoo-next | 8.12s | 0.20s | 10.23s | 12.09s | 953M | 125.7K |
| utoo-npm | 8.01s | 0.03s | 10.69s | 12.23s | 1.28G | 173.3K |
| utoo | 8.52s | 0.61s | 10.25s | 12.17s | 972M | 121.9K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 14.9K | 17.4K | 1.17G | 6M | 1.84G | 1.72G | 1M |
| utoo-next | 119.1K | 82.9K | 1.14G | 5M | 1.68G | 1.68G | 2M |
| utoo-npm | 122.7K | 85.2K | 1.14G | 5M | 1.68G | 1.68G | 2M |
| utoo | 133.4K | 83.8K | 1.14G | 5M | 1.68G | 1.68G | 2M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 1.85s | 0.03s | 3.91s | 1.05s | 502M | 165.5K |
| utoo-next | 3.10s | 0.11s | 5.31s | 1.92s | 610M | 85.7K |
| utoo-npm | 3.05s | 0.02s | 5.30s | 1.84s | 609M | 79.2K |
| utoo | 3.01s | 0.08s | 5.30s | 1.92s | 613M | 83.5K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 7.7K | 4.7K | 201M | 3M | 105M | - | 1M |
| utoo-next | 68.1K | 115.2K | 199M | 2M | 7M | 3M | 2M |
| utoo-npm | 68.3K | 112.6K | 199M | 2M | 7M | 3M | 2M |
| utoo | 67.5K | 112.2K | 199M | 2M | 7M | 3M | 2M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 6.71s | 0.06s | 6.11s | 9.75s | 645M | 212.8K |
| utoo-next | 6.58s | 1.09s | 4.86s | 10.65s | 456M | 56.3K |
| utoo-npm | 6.05s | 0.08s | 5.27s | 10.96s | 905M | 115.2K |
| utoo | 6.35s | 1.56s | 4.84s | 10.45s | 478M | 59.3K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 4.1K | 6.9K | 995M | 4M | 1.74G | 1.74G | 1M |
| utoo-next | 101.8K | 49.6K | 964M | 3M | 1.67G | 1.67G | 2M |
| utoo-npm | 103.6K | 62.9K | 964M | 2M | 1.67G | 1.67G | 2M |
| utoo | 91.3K | 50.9K | 964M | 2M | 1.67G | 1.67G | 2M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 3.32s | 0.04s | 0.22s | 2.30s | 137M | 32.7K |
| utoo-next | 2.20s | 0.24s | 0.49s | 3.73s | 80M | 18.6K |
| utoo-npm | 2.19s | 0.06s | 0.50s | 3.78s | 84M | 19.2K |
| utoo | 2.07s | 0.06s | 0.47s | 3.76s | 79M | 18.3K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 262 | 27 | 5M | 48K | 1.88G | 1.72G | 1M |
| utoo-next | 41.0K | 18.9K | 308K | 7K | 1.68G | 1.68G | 2M |
| utoo-npm | 46.0K | 21.1K | 323K | 29K | 1.68G | 1.68G | 2M |
| utoo | 40.8K | 18.9K | 309K | 11K | 1.68G | 1.68G | 2M |
npmmirror.com: no output captured.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Smaller alternative to #2911. Two install hot-path changes layered on top of the new (post-#2905) zero-copy + par_chunks(64) baseline:
TTY-gate progress bar (carried from
experiment/install-tty-gateparent commit): non-TTY callers skip indicatif's internal mutex, removing ~9k atomic ops per install.Single
consume_budget().awaitat the top ofinstall_packagesspawn loop: gives the tokio runtime a per-iteration window to drain socket reads on in-flight tarball downloads, replacing the implicit yield that the indicatif mutex used to provide.Why this not the bigger #2911 partition
#2911's classify + 3-phase pipeline did fix the TCP-level starvation (zwin 14 → 0), but N=4 phases bench measured a consistent +1.05s p3 regression. Mechanism: the partition design pushed all cheap paths (omit / cpu-incompat /
file:link) before any spawns, then opened Phase 3 with 64 in-flight downloads simultaneously. That concentrated disk burst widened p3 σ vs the baseline's interleaved cheap+heavy schedule.This PR keeps the original interleaved schedule (cheap paths inline with spawns) and only adds the runtime-yield hint. Same TCP fix, smaller change, no concentrated burst.
Pcap mechanism evidence (carried over from partition experiments)
consume_budgetwas added in tokio 1.41 and is the right primitive for this case: ~5ns when it doesn't yield (the common case), ~100ns when the per-task tick budget exhausts. utoo pins tokio 1.51.Companion / supersedes
🤖 Generated with Claude Code