experiment(pm): swap DiskManifestStore for NoopStore (measurement-only)#2918
experiment(pm): swap DiskManifestStore for NoopStore (measurement-only)#2918
Conversation
…down
p1_resolve has been ~0.9s behind bun on phases bench for the past
several PRs. Pcap on prior runs measured bun opening ~260 parallel
TCP streams against registry.npmjs.org for resolve, while utoo
opened ~70 (the 64 manifests-concurrency-limit cap was at saturation).
Adding fetch-breakdown timing in ruborist showed where p1's 22s
(local Mac) actually goes:
fetch-timings: n=2730
sum_request = 1089s (88% — TCP+TLS+HTTP RTT to first byte)
sum_body = 138s (11% — body download)
sum_parse = 2s (0.16% — simd_json on rayon)
The dominant cost is per-request RTT, not parsing or body transfer.
The lever is the cap on concurrent in-flight requests.
This commit:
1. Adds `crates/ruborist/src/util/timing.rs` — process-wide atomic
accumulator that records per-fetch (request_us, body_us,
parse_us, bytes) inside both `fetch_full_manifest` and
`fetch_version_manifest`. Reset before each preload phase, dumped
at INFO level after preload + bfs.
2. Bumps `manifests-concurrency-limit` default 64 → 256 to match
bun's observed working point against npmjs.org.
CI bench will validate. Expected: p1 utoo wall drops toward bun's
range (~2.3s on GHA).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes after the GHA bench on the previous commit (PR #2916, run 25559625024) showed the concurrency=256 hypothesis was wrong on GHA's environment. Revert concurrency 256 → 64 --------------------------- The new fetch-timing instrumentation shipped in the previous commit caught the surprise: GHA's pcap-vs-local profile is the *opposite* of what local Mac measurements suggested. metric local Mac GHA Linux avg_request 399ms 70ms ← network MUCH faster on GHA avg_body 50ms 20ms avg_parse 730µs 266ms ← parse 365× SLOWER on GHA Mechanism: `parse_json_off_runtime` dispatches to `rayon::spawn`, and rayon's pool size is `num_cpus` (= 2 on GHA ubuntu-latest). Bumping concurrency 64 → 256 queued 256 manifest parses behind 2 rayon workers — head-of-line blocking. avg_parse jumped from ~10ms to 266ms wall, dragging p1 utoo wall from 3.10s up to 3.33s. Restore manifest-bench ---------------------- Brought back `crates/manifest-bench` (originally landed in the post-#2818 driver hunt, dropped in af714eb once #2818 graduated). It's a single-binary HTTP-only fetch tool that strips out the ruborist pipeline (no BFS, no dedup, no parse, no project cache, no lockfile write) — fires `GET <registry>/<name>` in parallel and reports the same diag shape as the new `p1-breakdown` lines. Goal: separate the network ceiling from the resolver pipeline so the next round of p1 experiments (parse offload, partial parse, dedicated parse pool, etc.) can be evaluated against a stable "pure network" baseline. Knobs (unchanged from the original drop): --concurrency N sweep without rebuilding utoo --reps N run same workload back-to-back --single-version use /<name>/latest (smaller bodies) --user-agent X UA-fingerprint experiments --http1-only H2 vs H1 toggle --accept X override Accept header Same TLS stack as ruborist (rustls + aws-lc-rs, native roots). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…inux build-linux now also builds + uploads `manifest-bench` when a phases bench is going to run (label or dispatch). bench-phases-linux downloads the binary and runs it after the regular phase-isolated benchmark. Sweep mirrors the original (#2818-era) wire-in: concurrency: 32 / 64 / 96 / 128 / 192 / 256 (HTTP/1.1, full manifest) protocol: H1 vs H2-negotiate (cap=128) endpoint: full vs `/<name>/latest` (cap=128, smaller bodies) UA: default vs `Bun/1.2.21` (cap=128) Output goes to /tmp/pm-bench-output/manifest-bench-npmjs.log and ships in the existing pm-bench-logs-linux artifact — no PR comment surface (the headline phases bench comment stays the same). Why now: the new ruborist `p1-breakdown` instrumentation showed sum_parse on GHA can dominate when concurrency is bumped (256: sum_parse 728s vs sum_request 193s). To attribute the bun-vs-utoo gap on p1_resolve we need a "pure HTTP" baseline that strips out ruborist's parse / BFS / dedup / lockfile path. manifest-bench is that baseline: same TLS stack as ruborist (rustls + aws-lc-rs, native roots), no resolver pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI fetch-breakdown on GHA (run 25562552058, conc=64) showed parse
queueing on rayon dominates the gap to manifest-bench's pure-HTTP
baseline:
manifest-bench (pure HTTP, conc=64): 2.12s wall
utoo p1 (full ruborist): 3.10s wall ← +1.0s overhead
↑ sum_parse 95s vs sum_request 95s, parse 50% of work-time
↑ avg_parse 30ms wall vs ~5ms actual CPU — the 25ms extra is rayon
queue wait
Mechanism: 64 concurrent tasks all dispatching parse to rayon's pool
(size = num_cpus = 2 on GHA). Queue depth grows to ~32 per worker.
Each parse waits 25ms+ in queue before running its 5ms of CPU work.
Round 1 fix: inline parse, drop the rayon hop. simd_json on a tokio
worker thread is fast (~5ms for 115KB JSON), and the tokio runtime's
cooperative budget naturally rebalances CPU across the 64 tasks.
Expected on next CI:
- avg_parse drops from 30ms wall → ~5-10ms wall (close to CPU-only)
- preload_wall drops from 5.4s → ~3.5-4s for cold runs
- p1 hyperfine wall drops from 3.10s → 2.3-2.5s, narrowing the gap
to manifest-bench's 2.12s ceiling
If parse becomes the new bottleneck (CPU-bound), next round could
look at partial parse / lazy field access. If wall doesn't drop,
hypothesis is wrong and we look elsewhere (BFS, dedup, lockfile).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round 1 (inline parse) reverted on data: GHA showed +0.37s p1 regression because parse blocked tokio runtime workers, dropping eff_parallel 42 → 35 even though per-fetch work-time fell. avg_request went up from 35ms → 52ms — symptomatic of socket reads being delayed by the parsing task on the same worker. metric round 0 (rayon) round 1 (inline) p1 wall 3.27s 3.64s⚠️ +0.37s avg_parse 30ms (queued) 300µs ✓ avg_request 35ms 52ms⚠️ +17ms (worker contention) eff_parallel 42 35⚠️ Round 2 attempts the third option: `tokio::task::spawn_blocking`. - rayon's pool was too small (num_cpus = 2 on GHA) — 64 concurrent parses queued behind 2 workers, parse wall 30ms. - inline parse held tokio worker hostage during simd_json call, starving in-flight socket reads. - tokio's blocking pool has a much larger default cap (512), so 64 concurrent parses never queue. Unlike rayon there's no contention with the install path's parallel-write rayon usage. Unlike inline the tokio runtime workers stay free to drive network I/O. Expected on next CI: - avg_parse drops to ~5-10ms wall (close to CPU floor, no queue) - avg_request stays ~35ms (workers free for I/O) - eff_parallel returns to ~50, possibly higher - p1 wall drops toward manifest-bench's 2.10s ceiling Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round 2 moved parse_json_off_runtime off rayon (-0.11s p1). But fetch-breakdown still showed avg_request 41ms vs round 0's 35ms, hinting at a second source of rayon contention. Found it: `extract_core_version_off_runtime` is also on `rayon::spawn`. On npmjs.org's `!supports_semver` path EVERY fetch resolves through `resolve_via_full_manifest`, which fetches the full packument once per package name (deduped via inflight_full) and then calls `extract_core_version_off_runtime` per (name, spec) to materialize the chosen version into a `CoreVersionManifest`. So per fetch we hit rayon TWICE — once for the JSON parse (round 2 moved to spawn_blocking), and once for `get_core_version` (still on rayon). The second hop has the same head-of-line blocking signature as the first: 64 concurrent resolves dispatching to a 2-thread rayon pool. Round 3: move extract_core_version_off_runtime to spawn_blocking for the same reasons. The work is JSON lazy-reparse (`raw_json` sub-tree decoding) — genuinely blocking, well-suited for tokio's blocking pool. Expected: utoo p1 wall drops further toward manifest-bench's 2.10s ceiling. avg_request should fall back from 41ms → ~35ms (rayon contention removed from the fetch task's await chain). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes for round 4 of p1 optimization:
1. Revert `extract_core_version_off_runtime` from spawn_blocking back
to rayon::spawn (round 3). Within-run measurement showed +0.42s
regression vs utoo-next (round 2 was +0.11s). Likely cause: this
function is called per (name, spec), so multi-spec packages call
it 2-5x per fetch. spawn_blocking's per-dispatch overhead exceeds
rayon queue savings at this multiplier.
2. Add `serialize_us` and `cache_export_us` to the p1-breakdown line
so we can attribute the remaining gap. Currently:
manifest-bench wall: 2.10s (pure HTTP ceiling)
utoo p1 wall (round 2): 3.16s
gap: 1.06s
We have:
preload_wall ≈ 2.7s (logged)
bfs_wall ≈ 0.3s (logged)
serialize_us ?
cache_export_us ? ← suspected: full manifest deep-clone
into ProjectCacheData for ~2730 entries
Next round will have data to choose between attacking serialize,
cache export, or the BFS loop body.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round 4 measured serialize_us = 15ms and cache_export_us = 34ms — both
tiny — confirming the 1s gap from manifest-bench (utoo p1 = 3.16s vs
mb wall = 2.10s) is not in post-build code.
Per-fetch math also pointed at main-loop bookkeeping:
manifest-bench: eff_parallel = 52 (sum_work 111s / wall 2.14s)
utoo preload : eff_parallel = 43 (sum_work 120s / wall 2.85s)
Same conc=64 cap, but utoo loses 9 effective slots — most likely
the main loop's serial bookkeeping (dedup hash insert, format!
key, extract_transitive_deps, queue push, 3-4 receiver events)
holds the flow between futures.next() returning and the next
fetch dispatch.
This commit splits the main loop into two timed segments:
preload_loop_dispatch_us: time spent in the `while in_flight <
concurrency` block — popping pending,
dedup check, futures.push.
preload_loop_result_us: time spent processing each completed
future — extract_transitive_deps,
pending.extend, on_manifest.
If dispatch+result sum approaches preload_wall, the main loop is
the bottleneck and we need to either (a) split processing onto a
dedicated task, or (b) use unbounded futures with a downstream
consumer. If they're small, the gap is elsewhere (per-task
overhead in resolve_package's inflight gates).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round 5 main-loop instrumentation showed the preload main loop itself is fast (15-25ms total dispatch+result). The 0.8s gap from manifest-bench's 2.10s wall lives INSIDE the spawned fetch tasks. Per-fetch wall (warm runs): measured: avg_request 30ms + avg_body 6ms + avg_parse 2.5ms = ~38ms derived: preload_wall 2.4s × eff_parallel(43) / 2730 = 38ms delta: ~12ms unaccounted per task That 12ms is `extract_core_version_off_runtime` queueing on rayon's 2-thread pool. extract is called per (name, spec) — for ant-design that's ~3000+ calls. With pool=2 and 64 concurrent fetches each dispatching extract, the queue depth grows; each task waits its turn before extract returns. Bump rayon pool to `max(num_cpus, 8)` for non-Windows. Sizing the pool above the CPU count for short blocking JSON ops (parse + extract) replaces FIFO queueing with parallel dispatch. Real CPU contention is bounded by num_cpus (the kernel scheduler still gates), so the extra pool threads just hold ready-to-run dispatches in parallel rather than serialised in a queue. Why not just spawn_blocking (round 3 attempt): tokio's blocking pool defaults to 512 threads, but its per-dispatch overhead was higher than rayon's even when queueing — round 3 regressed by 0.5s. Expected: extract queue wait drops from ~12ms to ~1-2ms wall, p1 preload_wall narrows toward manifest-bench's 2.10s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `BuildDepsOptions::skip_preload` so callers without a pipeline consumer (utoo deps / package-lock-only) can drop the up-front preload phase entirely. BFS now batches prefetch per level across the whole frontier, then runs the existing sequential process_dependency walk against the warmed cache. For install paths (Context::pipeline_deps_options), skip_preload stays false so PackageResolved events still feed the download/clone pipeline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds resolver::fast_preload, a manifest-bench-style flat FuturesUnordered over service::manifest::fetch_full_manifest. It warms MemoryCache (both full_manifests and version_manifests slots) synchronously after each fetch, so the BFS phase is pure cache-hit: no rayon hop on extract_core_version, no OnceMap gates, no DiskManifestStore writes, no PackageResolved events. Wired into service::api::build_deps: when the caller asks to skip preload (Context::build_deps for `utoo deps`) and there's no warm project cache, fast_preload runs ahead of build_deps_with_config. Install paths still go through preload_manifests so the pipeline keeps its early-start signal. Also reverts the per-level prefetch I added in 394f6c9 — with fast_preload pre-warming everything, BFS doesn't need its own prefetch wave. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v1 of fast_preload called settle_spec inline on the tokio worker — each settle ran simd_json::to_borrowed_value over the full manifest's raw bytes (5–10ms per spec) right on the runtime thread. CI showed it starved sibling fetches: avg_request rose +3ms, avg_parse jumped 5→11ms, p1_resolve regressed +1.0s vs the preload+BFS baseline (4.0s vs 3.0s). Fix: route every settle through extract_core_version_off_runtime (the same rayon::spawn helper the BFS path uses), and merge fetch and settle completions into a single FuturesUnordered so backpressure on either side throttles the other. Sibling specs that arrived during a fetch are now stashed by name (HashMap, not linear scan), then dispatched as their own settle futures when the fetch lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standalone manifest-bench HTTP-only sweep (npmjs, h1) shows wall bottoming at concurrency=96 (1817ms) — earlier 256 regression was caused by rayon-queued parses behind 2 workers, no longer relevant since fetch parse is on spawn_blocking and settle is rayon-dispatched off the runtime. fast_preload's wave-shaped transitive walk currently runs at eff_parallel ~35 against the 64 cap because pending refills lag settles; raising the cap to 96 gives headroom for sustained in-flight on the deep waves without crossing the npmjs per-IP tail-latency cliff that conc 128+ trips. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… path `UnifiedRegistry::resolve_version_manifest`'s first cache check (service/registry.rs:347) keys on `(name, spec)` — the original spec string the caller passed, e.g. `^4.0.0`. settle_future was only populating `(name, resolved_version)` (e.g. `4.17.21`), so on every BFS edge for `lodash@^4.0.0`-style specs the warm path missed and fell into the OnceMap inflight gate + `resolve_via_full_manifest` re-walk before recovering the manifest from the `(name, resolved_version)` slot we'd already set. Now settle writes both keys so BFS hits the early-return at service/registry.rs:347 with no further dispatch. Saves ~1 OnceMap+resolve_target_version round-trip per unique (name, spec) the BFS encounters (≈3000 calls on ant-design-x). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous fast_preload (v2) dispatched primary settles to rayon as separate FuturesUnordered futures. CI breakdown showed eff_parallel ~44 against the conc=96 cap — the wave-shaped transitive walk was held back by settle dispatch RTT: each fetch landed → primary settle queued → settle popped → only then did `pending` get transitive deps and fill the next dispatch wave. v3 folds the primary settle into the fetch task itself via `tokio::task::spawn_blocking`. The fetch task does the network round-trip and the primary version-extract on the same blocking pool slot, then returns with the resolved CoreVersionManifest attached. Main loop pulls one Fetched event, immediately extends `pending`, no second `next().await` to wait through the queue. Sibling specs (rare; same name, different range) still go through the rayon settle_future path so the primary path stays lean. Carries primary_spec through FastEvent so the fused path can populate both `(name, primary_spec)` and `(name, resolved_version)` cache slots — preserves the 6455852 BFS fast-path win. FetchOutcome enum replaces by-value FetchManifestResult to avoid a full FullManifest clone (HashMap+Vec) per fetch event. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…json
The fast_preload hot path was paying TWO simd_json passes per
manifest:
1. fetch_full_manifest's parse_json_off_runtime did a typed
simd_json::serde::from_slice<FullManifest> (envelope + IgnoredAny
visitor on `versions` keys, ~3-5ms on a 100KB body).
2. Primary settle re-parsed the same raw bytes with
simd_json::to_borrowed_value (~5-10ms) to extract one version's
subtree.
Both passes went through simd_json's Tape constructor — duplicated
work. CI showed avg_parse 5-7ms × 2700 fetches = 14-19s of CPU sum
on 2-core GHA, where the spawn_blocking pool's overlapping schedule
masked some of the cost but not all.
Adds `service::manifest::fetch_full_manifest_with_settle`: same HTTP
+ retry + ETag machinery as `fetch_full_manifest`, but the parse
step does ONE `to_borrowed_value` and extracts:
* envelope (`name`, `dist-tags`, `versions` keys) into FullManifest
manually (no typed serde), and
* the resolved version's subtree as a typed CoreVersionManifest
(serde-deserializing that single subtree via the borrowed value).
fast_preload's fetch task switches to this entry point — primary
settle is now a free byproduct of the fetch parse, not a separate
`to_borrowed_value` pass. Sibling specs (same name, different
range) still go through the rayon settle_future path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After 671ac98's combined-parse fetch path eliminated the double simd_json pass, the spawn_blocking pool's contention ceiling rose enough that bumping concurrency past 96 no longer queues parses behind 2-core CPU. manifest-bench's most recent good-network sweep on GHA showed conc=128 hitting 1500ms vs conc=96 at 1566ms — small but real headroom for fast_preload's late-wave saturation now that initial waves fill faster. Risk: on slower-network runs (npmjs per-IP throttle), conc=128 widens p99. Earlier conc-sweep data was mixed — accepting that variance for the average-case improvement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
542d7f1's conc=128 bench landed in a slow-network run (mb best 2010ms vs 1500ms in the prior good-network run; bun also bumped to 2.14s vs 1.83s). Adjusted gap to mb best stayed flat (~700ms either way), so conc=128 didn't beat 96 across runs. Picking 96 as the conservative default: at-or-near best on every GHA run we've measured, never the worst, and leaves headroom for npmjs's per-IP throttling to absorb without compounding p99. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…preload)
Adds resolver::mb_resolve module + service::build_deps_mb entry point
as a parallel-track alternative to fast_preload, structured to
match manifest-bench's main-loop shape as closely as correctness
allows. Hypothesis under test: fast_preload's eff_parallel caps at
~50/96 because the FastEvent enum match + cache writes + sibling
deferred bookkeeping in the main loop competes with tokio runtime
workers for the 2 CPU cores on GHA, stalling socket I/O drive.
mb_fetch pushes ALL per-fetch work into the spawned future itself
(including cache writes), so the main loop is reduced to:
while let Some(deps) = futs.next().await {
pending.extend(deps);
refill_to_cap(...);
}
Sibling specs (multiple ranges on same package) are NOT deferred at
queue level — racing fetches for the same name both proceed. The
race converges naturally: first fetch to land populates
full_manifests, subsequent racers find the cache hit on entry and
short-circuit to a sibling-style settle. Wastes ~5-50 network
requests in real workloads but eliminates the HashMap probe + drain
overhead from the hot loop.
Wired in via UTOO_RESOLVE=mb env var:
- Context::build_deps (utoo deps) routes through build_deps_mb
- pipeline::resolve_with_pipeline (utoo install) also routes
through it; pipeline workers still start but don't pipeline
during fetch (mb_fetch emits no PackageResolved events) — install
becomes phase-sequential, useful for resolve-phase A/B.
bench script enables UTOO_RESOLVE=mb so CI measures the new path
against existing baselines (utoo-next/utoo-npm/bun ignore the env
var). Comment the export line to A/B back against fast_preload.
Old fast_preload + UnifiedRegistry paths untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v1/v2 ran parse work in spawn_blocking inside each fetch future, which competed with tokio runtime workers for the 2 GHA cores. CI showed eff_parallel capped at 47/96 vs manifest-bench standalone's 75/96 on the same box. Hypothesis: parse CPU starves socket drive. v3 separates the two phases: * PHASE 1 — `mb_style_pure_fetch` is a structural copy of `manifest-bench`'s main loop: future body does ONLY GET + body recv, refill 1-for-1 on completion. Zero per-future CPU work, so tokio runtime workers retain full CPU for socket drive. * PHASE 2 — bulk rayon par_iter parse: for each body, parse `FullManifest` envelope via simd_json::to_borrowed_value, resolve every queued spec for this name against the just-parsed manifest, populate cache slots, collect transitive deps. Runs off the tokio runtime entirely (spawn_blocking → rayon par_iter). Phases alternate until pending exhausted. Typical project: 3-5 iterations as the dep tree fans out wave by wave. The point of the split is the `phase1_http_wall` trace — measured in isolation from any parse work, it should match manifest-bench's standalone wall (~1.5-2.0s for 2733 names @ conc=96). If it does, the remaining gap to mb is concentrated in phase 2 work, which is inherent to discovering transitive deps from a non-flat name list. Tracing per iteration: p1-breakdown mb_fetch iter=N phase1_http_wall=Xms n=Y bytes=Z p1-breakdown mb_fetch iter=N phase2_parse_wall=Xms settles=Y new_transitives=Z p1-breakdown mb_fetch total_wall=Xms iters=Y Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v3 dropped the (name, spec) HashSet from v1/v2 thinking name-level dedup via done_names was sufficient. It wasn't: sibling-settle's extract_transitive can re-introduce specs we've already settled (peer/optional dep cycles trivially trigger this), so the outer while-loop never terminated. CI 25589397823 hung on `Run phase-isolated benchmark · npmjs` for ~25 min before being cancelled — the bench's first utoo p1_resolve hyperfine run got stuck in an infinite settle loop. Fix: maintain `seen_specs: HashSet<(String, String)>` across all iterations; filter both initial seed and every wave of new transitives through it before extending pending_specs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New crate `crates/preload-bench/` is a fully-standalone bench that: * Uses the SAME HTTP setup as `manifest-bench` (own reqwest::Client built per rep with aws-lc-rs TLS, pool_max_idle_per_host(256), no proxy, default DNS, no retry, h1_only). * Discovers names by walking transitive deps from a package.json root — instead of consuming a flat name list like manifest-bench. * Per-future does GET + body recv + spawn_blocking parse → returns transitive deps → main loop refills on completion. * No dependency on ruborist or any utoo internals (own simd_json, own dedup, own everything). The point: prove (or disprove) that a fully ruborist-independent streaming preload can hit standalone manifest-bench's wall on the same workload. ruborist's path runs at ~2.18s for ant-design's ~2700 names; manifest-bench standalone runs the same workload at ~1.6s. The gap could be in any number of things — DNS layer, retry, pool config, parse-CPU contention, registry single-flight gates. preload-bench eliminates all of those simultaneously so we can read the wall directly. Wired into bench-phases-linux: builds + uploads preload-bench binary alongside manifest-bench, then runs a conc=64/96/128 sweep against the same project after the standalone manifest-bench sweep. bench script reverts UTOO_RESOLVE=mb so utoo runs default fast_preload — gives a third datapoint (utoo wall on integrated path) alongside manifest-bench (HTTP-only ceiling) and preload-bench (streaming-with-walk ceiling). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y path
Step 1 of staged service-layer ablation. Rewrites mb_resolve as a
fully self-contained streaming preload mirroring preload-bench's
loop shape verbatim, but living inside ruborist so it can populate
MemoryCache for the BFS phase.
Bypasses every other ruborist service layer:
* service::http::get_client — own reqwest::Client built per call,
no global LazyLock, no shared_resolver dns layer, no
connect_timeout, pool_max_idle_per_host(256).
* service::manifest::fetch_full_manifest_with_settle — own GET +
body.bytes() + spawn_blocking(simd_json::to_borrowed_value),
no RetryIf, no FETCH_TIMINGS.
* service::registry::UnifiedRegistry — no OnceMap, no
ManifestStore, no EventReceiver.
Only service::* touched is MemoryCache writes (DashMap inserts) so
BFS has data to read from.
PM is unaware: dispatch happens entirely inside
service::api::build_deps when skip_preload=true and no warm cache.
Removes the previous UTOO_RESOLVE=mb env-var gating from
pm::helper::ruborist_context::Context::build_deps and
pipeline::resolve_with_pipeline. Removes the now-unused
service::api::build_deps_mb sibling entry point.
Expected: utoo p1_resolve drops from ~2.67s toward preload-bench's
~2.57s (or better since ruborist fetches fewer names than
preload-bench). The remaining gap to mb's ~1.99s would isolate
incremental layer effects we add back next:
- tokio runtime config / cooperative scheduling
- reqwest::Client provider differences (TLS, DNS)
- cache layer (DashMap vs DiskManifestStore reads on the cold path)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Side experiment branched off perf/p1-resolve-concurrency at 01d1513. Question under test: how much of utoo's p1/p3 wall comes from the per-fetch disk-cache existence-check IO that service::registry::UnifiedRegistry issues alongside each manifest fetch (store.load_versions / store.load_version_manifest + fire-and-forget store.store_*)? Swaps `Context::manifest_store` from `DiskManifestStore` to `NoopStore`, which makes every store call a no-op without touching the filesystem. Affects ALL paths that go through `Context`: * `utoo deps` (lockfile-only): already bypasses UnifiedRegistry via mb_resolve, so no perf impact expected — confirms baseline. * `utoo install` (pipeline path): preload_manifests still goes through UnifiedRegistry, so this swap removes per-fetch disk IO from the install resolve phase. p3_cold_install delta is the meaningful number. * BFS edges that miss MemoryCache and fall into resolve_via_full_manifest: no disk fallback, so a cold cache miss falls straight to network instead of checking disk first. NOT for landing — measurement-only branch. Compare against 01d1513 to read the disk-cache IO cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces performance optimizations and benchmarking tools for manifest preloading, including new benchmark crates and a lean mb_fetch path for lockfile-only resolutions. It optimizes JSON parsing by moving it to tokio::task::spawn_blocking and adds comprehensive timing instrumentation. Reviewers identified several bugs in the new preloading logic, such as potential leaks of sibling specs during fetch failures or cache misses, and misleading success reporting for failed HTTP requests. Additionally, feedback pointed out redundant cache writes and missing client configurations, such as timeouts and TLS settings, in the standalone fetcher.
| if out.fetched | ||
| && let Some(siblings) = deferred_by_name.remove(&out.name) | ||
| && let Some(raw) = body_cache.lock().get(&out.name).cloned() | ||
| { | ||
| for sibling_spec in siblings { | ||
| futs.push(spawn_settle( | ||
| out.name.clone(), | ||
| sibling_spec, | ||
| Arc::clone(&raw), | ||
| cache.clone(), | ||
| peer_deps, | ||
| )); | ||
| } | ||
| } |
There was a problem hiding this comment.
There is a potential leak of sibling specs. If out.fetched is true but the body_cache lookup fails (e.g. due to a race or an earlier error), deferred_by_name.remove() still executes, consuming the siblings. Since they are neither processed nor put back, they are lost. The removal should be decoupled from the processing logic to ensure they are always cleared from the map even on failure.
| if out.fetched | |
| && let Some(siblings) = deferred_by_name.remove(&out.name) | |
| && let Some(raw) = body_cache.lock().get(&out.name).cloned() | |
| { | |
| for sibling_spec in siblings { | |
| futs.push(spawn_settle( | |
| out.name.clone(), | |
| sibling_spec, | |
| Arc::clone(&raw), | |
| cache.clone(), | |
| peer_deps, | |
| )); | |
| } | |
| } | |
| // Drain sibling specs deferred while the fetch was in flight. | |
| if let Some(siblings) = deferred_by_name.remove(&out.name) { | |
| if out.fetched { | |
| if let Some(raw) = body_cache.lock().get(&out.name).cloned() { | |
| for sibling_spec in siblings { | |
| futs.push(spawn_settle( | |
| out.name.clone(), | |
| sibling_spec, | |
| Arc::clone(&raw), | |
| cache.clone(), | |
| peer_deps, | |
| )); | |
| } | |
| } | |
| } | |
| } |
| if let Some(siblings) = deferred_by_name.remove(&name) { | ||
| for sibling_spec in siblings { | ||
| futs.push(Box::pin(settle_future( | ||
| name.clone(), | ||
| sibling_spec, | ||
| Arc::clone(&full_arc), | ||
| cache.clone(), | ||
| peer_deps, | ||
| ))); | ||
| } | ||
| } |
There was a problem hiding this comment.
Sibling specs are leaked if the primary fetch fails. The deferred_by_name.remove(&name) call is inside the FetchOutcome::Ok branch. If the outcome is Err, the siblings remain in the map forever, and since no more futures will be spawned for that name, they are never processed.
| if let Some(siblings) = deferred_by_name.remove(&name) { | |
| for sibling_spec in siblings { | |
| futs.push(Box::pin(settle_future( | |
| name.clone(), | |
| sibling_spec, | |
| Arc::clone(&full_arc), | |
| cache.clone(), | |
| peer_deps, | |
| ))); | |
| } | |
| } | |
| // Sibling specs that were stashed while the | |
| // fetch was in flight: dispatch each as a | |
| // separate settle future. | |
| let siblings = deferred_by_name.remove(&name); | |
| if let Some(siblings) = siblings { | |
| for sibling_spec in siblings { | |
| futs.push(Box::pin(settle_future( | |
| name.clone(), | |
| sibling_spec, | |
| Arc::clone(&full_arc), | |
| cache.clone(), | |
| peer_deps, | |
| ))); | |
| } | |
| } |
| fn build_mb_client() -> Result<reqwest::Client> { | ||
| reqwest::Client::builder() | ||
| .no_proxy() | ||
| .pool_max_idle_per_host(256) | ||
| .http1_only() | ||
| .build() | ||
| .context("build reqwest client for mb_resolve") | ||
| } |
There was a problem hiding this comment.
The reqwest::Client in build_mb_client is missing important configurations that are present in the main ruborist HTTP client and the other benchmark tools in this PR. Specifically, it lacks timeouts and TLS root certificate configuration. Without these, the benchmark might hang indefinitely on network issues or fail to connect to registries using custom CAs.
| fn build_mb_client() -> Result<reqwest::Client> { | |
| reqwest::Client::builder() | |
| .no_proxy() | |
| .pool_max_idle_per_host(256) | |
| .http1_only() | |
| .build() | |
| .context("build reqwest client for mb_resolve") | |
| } | |
| fn build_mb_client() -> Result<reqwest::Client> { | |
| reqwest::Client::builder() | |
| .no_proxy() | |
| .pool_max_idle_per_host(256) | |
| .http1_only() | |
| .connect_timeout(std::time::Duration::from_secs(10)) | |
| .timeout(std::time::Duration::from_secs(30)) | |
| .build() | |
| .context("build reqwest client for mb_resolve") | |
| } |
| return FetchOutcome { | ||
| name, | ||
| transitives: Vec::new(), | ||
| fetched: true, | ||
| }; |
| name, | ||
| transitives: Vec::new(), | ||
| fetched: true, | ||
| }; | ||
| } |
| // Don't overwrite full_manifest — the original fetcher | ||
| // already set it. Only populate the version-manifest | ||
| // slots so BFS hits the (name, spec) early-return. | ||
| cache.set_full_manifest(name.clone(), full_arc); | ||
| cache.set_version_manifest(name.clone(), spec, Arc::clone(&core_arc)); | ||
| cache.set_version_manifest(name.clone(), resolved, core_arc); | ||
| transitives |
There was a problem hiding this comment.
The code contradicts the comment. It overwrites the full_manifest in the cache even though the comment states it shouldn't. This is redundant since the original fetcher already populated this slot.
Some((_full_arc, resolved, core_arc, transitives)) => {
// Don't overwrite full_manifest — the original fetcher
// already set it. Only populate the version-manifest
// slots so BFS hits the (name, spec) early-return.
cache.set_version_manifest(name.clone(), spec, Arc::clone(&core_arc));
cache.set_version_manifest(name.clone(), resolved, core_arc);
transitives
}| if out.transitives.is_empty() && out.fetched { | ||
| // Empty result from a fetch is ambiguous (no transitives | ||
| // OR a fetch/parse failure). Track conservatively as | ||
| // success — the FETCH_TIMINGS-equivalent counter is | ||
| // omitted in this path on purpose to keep the future | ||
| // body lean. | ||
| stats.success += 1; | ||
| } else if out.fetched { | ||
| stats.success += 1; | ||
| } |
There was a problem hiding this comment.
📊 pm-bench-phases ·
|
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 8.78s | 0.16s | 9.98s | 9.64s | 761M | 329.6K |
| utoo-next | 8.04s | 0.08s | 10.14s | 11.58s | 949M | 119.1K |
| utoo-npm | 9.85s | 2.55s | 10.57s | 11.99s | 1.36G | 176.6K |
| utoo | 7.82s | 0.47s | 10.17s | 11.18s | 1.35G | 187.2K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 15.1K | 17.5K | 1.17G | 6M | 1.84G | 1.72G | 1M |
| utoo-next | 122.9K | 80.1K | 1.14G | 5M | 1.68G | 1.68G | 2M |
| utoo-npm | 131.5K | 88.6K | 1.14G | 5M | 1.68G | 1.68G | 2M |
| utoo | 102.4K | 91.3K | 1.14G | 5M | 1.68G | 1.68G | 2M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 1.90s | 0.09s | 3.87s | 1.05s | 490M | 170.6K |
| utoo-next | 5.07s | 1.85s | 5.05s | 2.02s | 609M | 82.0K |
| utoo-npm | 2.90s | 0.05s | 5.07s | 1.95s | 600M | 79.9K |
| utoo | 2.77s | 0.12s | 4.90s | 1.16s | 1.05G | 153.3K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 8.0K | 4.6K | 202M | 3M | 105M | - | 1M |
| utoo-next | 70.5K | 109.3K | 199M | 2M | 7M | 3M | 2M |
| utoo-npm | 69.4K | 112.2K | 199M | 2M | 7M | 3M | 2M |
| utoo | 43.9K | 7.4K | 199M | 3M | - | 3M | 2M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 6.80s | 0.14s | 6.14s | 9.35s | 608M | 201.5K |
| utoo-next | 5.82s | 0.08s | 4.88s | 10.10s | 449M | 56.5K |
| utoo-npm | 6.03s | 0.23s | 5.28s | 10.55s | 887M | 112.2K |
| utoo | 7.22s | 2.94s | 4.99s | 10.20s | 639M | 76.0K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 4.7K | 6.4K | 995M | 3M | 1.74G | 1.74G | 1M |
| utoo-next | 94.0K | 44.2K | 964M | 2M | 1.67G | 1.67G | 2M |
| utoo-npm | 102.5K | 61.3K | 964M | 2M | 1.67G | 1.67G | 2M |
| utoo | 101.2K | 70.9K | 964M | 3M | 1.67G | 1.67G | 2M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 3.49s | 0.06s | 0.17s | 2.46s | 137M | 31.6K |
| utoo-next | 2.26s | 0.19s | 0.47s | 3.73s | 79M | 18.1K |
| utoo-npm | 2.22s | 0.06s | 0.52s | 3.77s | 85M | 19.7K |
| utoo | 2.08s | 0.04s | 0.50s | 3.72s | 80M | 18.2K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 236 | 21 | 3K | 6K | 1.84G | 1.74G | 1M |
| utoo-next | 42.4K | 18.8K | 307K | 10K | 1.68G | 1.68G | 2M |
| utoo-npm | 46.8K | 20.3K | 324K | 16K | 1.68G | 1.68G | 2M |
| utoo | 42.7K | 20.5K | 309K | 13K | 1.68G | 1.68G | 2M |
npmmirror.com: no output captured.
Side experiment branched off perf/p1-resolve-concurrency at 01d1513.
Question under test: how much of utoo's p1/p3 wall comes from the per-fetch disk-cache existence-check IO that
service::registry::UnifiedRegistryissues alongside each manifest fetch.Swaps
Context::manifest_storefromDiskManifestStoretoNoopStore. NOT for landing — measurement-only branch. Compare against #2916 to read the disk-cache IO cost.