[Feat] Add per-stage pipeline store and layerwise overlap metrics by dante159753 · Pull Request #933 · ModelEngine-Group/unified-cache-management

dante159753 · 2026-04-24T03:05:28Z

Purpose

Adds observability needed to diagnose pipeline store (Cache|Posix) per-tier performance and to verify that UCMLayerWiseConnector's load/forward/save overlap actually hides backend latency.

Modifications

Pipeline Store (C++ side):

Cache: per-task load/dump duration + bandwidth, queue wait, dispatch, backend-wait, H2D/D2H, backend-submit durations; lookup hit/miss counters and instantaneous hit rate gauge.
Posix: per-IO S2H/H2S duration + bandwidth, queue wait, failure counters.

Layerwise Connector (Python side):

wait_blocking_ms: primary signal for overlap health (near 0 = perfect overlap; tracks load_duration = degenerated to serial).
inter_wait_interval_ms, next_layer_submit_ms, first_layer_submit_ms, save_submit_ms, save_per_layer_wait_ms, save_tail_total_ms, stalled_layers_total.

Infrastructure:

Change metrics library from STATIC to SHARED so cachestore.so, posixstore.so, and ucmmetrics.so share one Metrics singleton. With a STATIC metrics the function-local GetInstance() produced a separate instance in each .so and all C++ UpdateStats() calls from stores were silently dropped.
Set INSTALL_RPATH=$ORIGIN/../../shared/metrics on cachestore.so and posixstore.so; $ORIGIN on ucmmetrics.so.

Test

Adds observability needed to diagnose pipeline store (Cache|Posix) per-tier performance and to verify that UCMLayerWiseConnector's load/forward/save overlap actually hides backend latency. Pipeline Store (C++ side): - Cache: per-task load/dump duration + bandwidth, queue wait, dispatch, backend-wait, H2D/D2H, backend-submit durations; lookup hit/miss counters and instantaneous hit rate gauge. - Posix: per-IO S2H/H2S duration + bandwidth, queue wait, failure counters. Layerwise Connector (Python side): - wait_blocking_ms: primary signal for overlap health (near 0 = perfect overlap; tracks load_duration = degenerated to serial). - inter_wait_interval_ms, next_layer_submit_ms, first_layer_submit_ms, save_submit_ms, save_per_layer_wait_ms, save_tail_total_ms, stalled_layers_total. Infrastructure: - Change metrics library from STATIC to SHARED so cachestore.so, posixstore.so, and ucmmetrics.so share one Metrics singleton. With a STATIC metrics the function-local GetInstance() produced a separate instance in each .so and all C++ UpdateStats() calls from stores were silently dropped. - Set INSTALL_RPATH=\$ORIGIN/../../shared/metrics on cachestore.so and posixstore.so; \$ORIGIN on ucmmetrics.so. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Fold a short UpdateStats call back onto one line per clang-format 20. Pure formatting; no behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds 20 new panels to examples/metrics/grafana.json covering the metrics introduced in the previous commit: Pipeline / Cache stage (9 panels): Hit Rate (full width), Load/Dump Duration + Bandwidth, Backend Wait, H2D / D2H durations, Backend Submit Ratio. Pipeline / Posix stage (4 panels): S2H and H2S bandwidth and duration. Layerwise Connector (7 panels): Wait Blocking (full-width key metric), Inter-Wait Interval, Stalled Layers Rate, First Layer Submit, Save Tail, Next Layer Submit, Save Per-Layer Wait. Thresholds are set on the most actionable panels: Hit Rate (red < 0.5, green >= 0.8), Backend Submit Ratio (green < 0.3, red >= 0.7), Wait Blocking (green 0, red >= 20 ms), Save Tail (green 0, red >= 50 ms). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds docs/source/user-guide/metrics/performance_analysis.md covering diagnosis of the Cache|Posix pipeline store in both layerwise and non-layerwise mode using the per-stage and layerwise metrics. Sections: 1. Architecture and load/dump data flow with metric annotations. 2. Critical metrics ranked by diagnostic priority. 3. Nine bottleneck playbooks (low hit rate, slow loads, slow Posix, dump back-pressure, no layerwise speedup, layerwise TTFT regression, layerwise save tail, non-layerwise dump-bound, worker pool starvation) - each with metric signature and concrete tunables. 4. Layerwise vs non-layerwise diagnostic differences. 5. PromQL recipes for hit rate, miss ratio, p99 decomposition, overlap loss, dump back-pressure, worker utilization. 6. Tunables indexed by symptom. 7. Honest list of what these metrics cannot tell you. Wires the new page into the User Guide toctree in docs/source/index.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous ASCII flowcharts in performance_analysis.md misaligned in the Sphinx HTML output because the box-drawing characters and CJK punctuation have inconsistent monospace widths. Replace them with Mermaid flowcharts: - Storage tier overview (vLLM Worker → CacheStore → PosixStore) - LOAD path with per-stage metric annotations on each node (queue waits, dispatch, posix S2H, backend wait, H2D, epilog) - Cache-hit fast path - DUMP path showing the user-visible chain plus the asynchronous BackendDumpStage / Posix H2S branch with a dashed edge Color-codes nodes by tier (Cache blue, Posix orange, completion green) so the tier hand-offs are visible at a glance. Wire-up: - Add sphinxcontrib-mermaid to docs/requirements-docs.txt. - Register the extension in docs/source/conf.py. - Set myst_fence_as_directive = ["mermaid"] so plain ```mermaid fences work both on GitHub (native rendering) and on Sphinx / ReadTheDocs. Drop the now-unused 'promql' language tag from PromQL examples - the default Pygments PromQL lexer rejects the colon in 'ucm:metric_name' and emitted highlighting warnings on every build. Verified locally with `sphinx -W`: my page now builds without warnings; mermaid blocks render as <pre class="mermaid"> for the client-side mermaid.js to pick up. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… range The previous Posix bandwidth buckets (0.05, 0.1, 0.2, 0.5, 1, 2, 4, 8, 12, 16, 24, 32) had only four sample points across the entire range where actual production performance lives, so p50/p90/p99 collapsed to a single bucket and changes within the band were invisible. New layout: - 0.05 / 0.1 / 0.2 / 0.5 -> degraded paths - 1, 1.5, 2, 2.5, 3, 3.5, 4 -> 0.5 GB/s steps (slow/saturated NVMe) - 5, 6, 7, 8, 9, 10, 11, 12 -> 1 GB/s steps (typical) - 14, 16, 20, 24, 32 -> sparse headroom 24 buckets total per metric (previously 12). Applied to both pipeline_posix_s2h_bandwidth_gbps (read) and pipeline_posix_h2s_bandwidth_gbps (write). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Each of the 10 critical latency / bandwidth histograms gets: 1. A heatmap panel showing the full distribution shape over time (cluster-wide aggregation, full bucket density visible). 2. A p50 / p90 / p99 time-series panel with per-worker breakdown, line styles distinguishing the three quantiles (p50 solid, p90 dashed, p99 thick dashed). Metrics covered: Cache: load_duration_ms, dump_duration_ms, load_backend_wait_duration_ms, load_bandwidth_gbps, dump_bandwidth_gbps Posix: s2h_duration_ms, h2s_duration_ms, s2h_bandwidth_gbps, h2s_bandwidth_gbps Layerwise: wait_blocking_ms Grouped into three collapsible row sections (Cache / Posix / Layerwise), collapsed by default so the existing dashboard scrolls unchanged. Adds 23 panels (3 rows + 20 children); existing 29 panels untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Every timeseries panel had spanNulls=false, which combined with showPoints=auto rendered intermittent metrics (cache miss only, layerwise-only, dump on tp_rank=0 only, NaN from rate(_sum)/rate(_count) when count=0, histogram_quantile with no observations) as scattered discrete points instead of continuous lines. Set spanNulls to 60000 ms across all 39 panels: gaps under one minute are bridged so normal "quiet window" sparseness reads as a smooth line, while real outages longer than 60s still break the line and remain visible. No query, color, or layout changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Cache and Posix stores can be used standalone (Posix can run without Cache; Cache always sits on top of some backend but isn't pipeline- specific), so the pipeline_ prefix on their metric names misrepresented the binding. The pipeline_ framing only makes sense for the composite PipelineStore wrapper, not for the underlying stores' own instrumentation. Renames (181 references across 8 files): pipeline_cache_* -> cache_* pipeline_posix_* -> posix_* Touched: - examples/metrics/metrics_configs.yaml (registration) - examples/metrics/grafana.json (panel queries) - docs/source/user-guide/metrics/performance_analysis.md (prose & PromQL) - ucm/store/cache/cc/{trans_manager.h,buffer_manager.h, load_queue.cc,dump_queue.cc} (UpdateStats calls) - ucm/store/posix/cc/trans_queue.cc (UpdateStats calls) Plus clang-format reflow on five C++ files where the now-shorter metric-name string literals fit back onto one line. layerwise_* metrics keep their prefix - they live in the connector layer, not the store layer, and the prefix correctly identifies the layerwise overlap mechanism. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

quantiles their description promises Panels 17/18/21/22 (Connector Load/Save Duration/Speed) all advertised "P50, P90, P95, P99 and Average" in their description, but each ran a single rate(_sum)/rate(_count) target that only produces the average. The percentiles were fictional. Replace each panel's single avg target with five targets: A: p50 (histogram_quantile(0.5, sum by (le, worker_id) (rate(_bucket)))) B: p90 (histogram_quantile(0.9, ...)) C: p95 (histogram_quantile(0.95, ...)) D: p99 (histogram_quantile(0.99, ...)) E: avg (the original rate(_sum)/rate(_count)) Per-worker breakdown so an outlier worker stands out as its own line. Distinct line styles (p50 solid, p90/p95/p99 dashed at decreasing period, p99 thicker, avg solid with 5% fill) keep the multi-series panel readable. Legend switched to table mode so worker rows can be scanned at a glance. This was a pre-existing issue; only Cache|Posix metric prefix renaming in the previous commit went near these panels, but the description<->query mismatch was not introduced by recent changes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The 17 main-level latency and throughput panels (Cache / Posix / Layerwise) previously rendered only the rate(_sum)/rate(_count) average, so a single hot tail or a slow worker was invisible. Each now exposes four lines per worker: p50 (solid), p90 (dashed 10-6), p99 (dashed 4-4, thicker), avg (solid with 5% fill). Panels touched: Cache: load_duration, load_bandwidth, dump_duration, dump_bandwidth, load_backend_wait, h2d_duration, d2h_duration (7 panels) Posix: s2h_bandwidth, h2s_bandwidth, s2h_duration, h2s_duration (4 panels) Layerwise: wait_blocking, inter_wait_interval, first_layer_submit, save_tail_total, next_layer_submit, save_per_layer_wait (6 panels) Skipped (intentionally - not latency/throughput): - Hit-rate / counter-rate / count panels (id 14-16, 19-20, 100, 108, 115). - Distribution-row heatmaps (id 230, 232, ... 248) which already show the full shape. - Distribution-row dedicated quantile panels (id 231, 233, ... 249) which already render p50/p90/p99 - now somewhat redundant with the upgraded main panels but kept for the focused deep-dive view inside the collapsible distribution rows. Legend switched to table mode so worker rows can be scanned at a glance; tooltip set to multi-series sorted desc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

prefix from titles Two cleanups in one pass. 1) Remove redundant quantile panels from distribution rows. The previous commit gave every main-level latency / throughput panel its own p50/p90/p99 lines (commit de667fb), which made the dedicated p50/p90/p99 panels inside the collapsed distribution rows duplicate work. Drop those 10 panels (id 231, 233, 235, 237, 239, 241, 243, 245, 247, 249) and lay out the remaining heatmaps 2-up at w=12 (with the Cache backend_wait and Layerwise blocking heatmaps full-width because they are alone on their row). Total panels: 52 -> 42 (top-level unchanged at 32 since rows are counted as containers). 2) Strip leftover "Pipeline" wording from titles to match the metric rename in commit acd5af0: - "Pipeline / Cache Load Duration" -> "Cache Load Duration" - "Pipeline Cache -- Distributions" -> "Cache -- Distributions" - "Cache / Cache Load Duration (heatmap)" -> "Cache / Load Duration" - "(heatmap)" suffix dropped since rows now contain only heatmaps. The Layerwise / * panel titles are unchanged - layerwise is the correct prefix for those metrics. Queries themselves were already migrated and contain no pipeline_* references. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The four count panels (Connector Load/Save Requests/Blocks Num) used rate(_sum)/rate(_count) which yields 0/0 NaN whenever no batch happened in the rate window, so the dashboard frequently showed gaps or single isolated points even though the underlying metrics were healthy. Each panel is now split into two: Rate panel (existing id 15/16/19/20, renamed): expr: rate(ucm:METRIC_count[$__rate_interval]) unit: ops (events/sec) Always defined when there is any activity in the window - no more NaN gaps. Per worker, single line. Size distribution panel (new id 130-133): p50 / p90 / p99 / avg of per-batch value (request count or block count). Same quantile + avg multi-line treatment as the duration and bandwidth panels. Layout shift: the new size-distribution rows sit right under their sibling rate panel. All panels with y >= 8 shifted by +8 to make room for the load distributions; everything with y >= 24 shifted by +16 to also accommodate the save distributions. Distribution-row sections (id 200/210/220) re-packed so rows sit immediately after the previous row's last child (closing a 16-grid-unit slack inherited from the earlier cleanup commit). Final connector section: Hit Rate (full width) Load Req Rate | Load Blk Rate Load Req Size | Load Blk Size Load Duration | Load Speed Save Req Rate | Save Blk Rate Save Req Size | Save Blk Size Save Duration | Save Speed Top-level panel count: 32 -> 36 (4 new dist panels). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…159753/unified-cache-management into pipeline-layerwise-metrics

The single grafana.json had grown to 36 top-level panels + 10 nested heatmap children (52 total, ~7000 lines), too big to edit ergonomically and mixing concerns (overview / store-tier diagnosis / advanced layerwise) for different audiences. Drop 2 redundant panels: - id=16 Connector Load Blocks Rate - id=20 Connector Save Blocks Rate Both queries were rate(_count) of metrics observed in the same update_stats({...}) call as their Requests Rate siblings, producing mathematically identical time series. The size-distribution panels for the same metrics (130/131/132/133) are NOT redundant and stay. Split the remaining ~50 panels into three module dashboards under examples/metrics/: grafana_connector.json (11 panels) Audience: anyone running UCM. Top-level activity, hit rate, per-batch sizes, end-to-end load/save durations and speeds. grafana_pipeline_store.json (13 main + 11 in collapsible distribution rows = 24 total) Audience: people diagnosing storage tier perf. Cache hit rate / backend submit ratio at top, then per-stage Cache and Posix latency / bandwidth, then Cache + Posix distribution heatmaps in collapsible rows. grafana_layerwise.json (8 main + 1 in collapsible row = 9 total) Audience: layerwise mode users. Wait_blocking key signal full width at top, then stalls / submit costs / save tail, plus a layerwise wait_blocking heatmap. Per-dashboard hygiene: - Fresh uid (ucm-connector-overview / ucm-pipeline-store / ucm-layerwise). - version=1, panel ids renumbered from 1, gridPos repacked from y=0. - Tagged ucm + <module>; each carries an "Other UCM dashboards" dropdown link in the header that auto-discovers siblings by tag. - Cache Hit Rate full-width at top of pipeline_store so the 9 Cache panels pair cleanly without bleeding into the Posix section. - templating, time, refresh copied verbatim from the original. Documentation: - docs/source/user-guide/metrics/metrics.md: replace the single "Import Dashboard" section with a "Pick the dashboard you need" table. - docs/source/developer-guide/add_metrics.md: update the "add a new panel" pointer to the right module dashboard. Verified: JSON validity, panel id uniqueness within each dashboard, all metric refs resolve in metrics_configs.yaml, no leftover standalone rate(_count) for the dropped duplicate metrics, original grafana.json removed, pre-commit clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ygwpz · 2026-05-08T06:44:22Z

        }
        if (!task.waiter) {
            holder_.push_back(std::move(task));
+            UC::Metrics::UpdateStats("cache_load_backend_wait_duration_ms",


Bug: cache_load_backend_wait_duration_ms is recorded twice in this function.

First time here (line 157-158) in the if (!task.waiter) branch

Second time at line 168-169 in the normal flow after stream.Synchronize()

This causes incorrect metrics data because the same metric is updated from two different execution paths. Should only record once based on the actual path taken.

ygwpz · 2026-05-08T06:44:53Z

            break;
        }
+        auto tpEnd = NowTime::Now();
+        UC::Metrics::UpdateStats("cache_load_backend_wait_duration_ms",


Duplicate recording: This is the second place where cache_load_backend_wait_duration_ms is recorded. See my comment at line 157-158 for the first occurrence.

Both paths record the same metric, which will cause data corruption in Prometheus metrics. Consider removing one or restructuring the logic to only record in the appropriate path.

ygwpz · 2026-05-08T06:45:21Z

        UC_DEBUG("Cache task({},{},{},{}) dispatching.", id, brief, num, size);
-        w->SetEpilog([id, brief = std::move(brief), num, size, tp] {
+        w->SetEpilog([id, brief = std::move(brief), num, size, tp, isLoad] {
            auto cost = NowTime::Now() - tp;


Potential edge case: When cost is very small (e.g., sub-millisecond), bwGbps could become extremely large due to division by a tiny number. Consider adding a minimum threshold for cost before calculating bandwidth to avoid unrealistic spikes in metrics.

Example: if (cost > 1e-6) { bwGbps = ... } else { bwGbps = 0.0; }

ygwpz · 2026-05-08T06:45:40Z

        ios.waiter->Done();
        return;
    }
+    auto tp = NowTime::Now();


Same edge case as trans_manager.h: When cost is very small, bwGbps could become extremely large. Consider adding a minimum threshold check before calculating bandwidth to avoid unrealistic spikes.

ygwpz · 2026-05-08T06:46:11Z

+  - name: "cache_lookup_hit_rate"
+    documentation: "Instantaneous Cache stage hit rate from the most recent lookup call"
+    multiprocess_mode: "livemostrecent"



Missing metric definition: layerwise_first_layer_requests is used in the code (ucm_connector.py:849) but not defined in the histogram configuration here. Should this be a gauge or counter? Please add the corresponding definition.

Posix's existing posix_*_bandwidth_gbps histogram observes one sample per shard IO call (bytes / wall_time_of_that_call / 1e9). That granularity cannot reveal: * Real per-worker GB/s — 8 concurrent IO threads each at 1 GB/s still show as ~1 GB/s in the histogram instead of aggregating to 8. * Disk utilisation — idle gaps between IOs are not in the denominator, so 50%-utilised disks look identical to saturated ones. Add posix_s2h_bytes_total and posix_h2s_bytes_total counters that accumulate only on successful IOs, in both posix engines: * trans_queue.cc (psync): in the success branch after the s.Failure() check, in both LoadWorker and DumpWorker. * io_engine_aio.h (aio): in OnIoCallback's else-branch after the result.error == 0 check; via the dump template parameter we cover s2h (load) and h2s (dump) with one line. The aio open-failure path in OnOpenCallback intentionally does NOT update the counter (no bytes ever moved). PromQL form: rate(ucm:posix_s2h_bytes_total[$__rate_interval]) / 1e9 gives true per-worker GB/s — multi-thread IO aggregates naturally and idle gaps land in the wall-clock denominator. Grafana (production + CI variants of grafana_pipeline_store.json): * Rename four existing Posix bandwidth panels (two top-level, two nested heatmaps) with the "(per shard)" suffix, and add a description sentence saying they do NOT aggregate across IO threads. * Add two new "(per worker)" timeseries panels at y=48 between the per-shard row and the duration row, plotting rate(_bytes_total) / 1e9 broken down by worker_id. CI variant adds job="$job" to the label matcher. Doc: performance_analysis.md gets a decision-split entry pair (5a per-worker / 5b per-shard) in the §2 ranking table, plus an explainer block under §3.3 covering when to read which metric.

"Cache Backend Submit Ratio (load)" measures the fraction of load shards that missed the Cache buffer and descended to the backend (true cache miss at load time). The "submit" verb suggested a half-finished state and routinely misled diagnostics; the action itself is a backend-load, so name the panel and underlying counter accordingly. Renames: * counter cache_load_backend_submit_shards_total -> cache_load_backend_shards_total * histogram cache_dump_backend_submit_duration_ms -> cache_dump_backend_duration_ms * panel "Cache Backend Submit Ratio (load)" -> "Cache Backend Load Ratio" The dump-side histogram is renamed for symmetry; its documentation now clarifies that it measures the synchronous hand-off duration only and does NOT include the lower tier's actual write time. Files touched: load_queue.cc / dump_queue.cc (UpdateStats string literals), metrics_configs.yaml (counter+histogram definitions with revised documentation), grafana_pipeline_store.json in both production and CI (panel title, expr, plus description suffix linking the ratio to its component metrics), and performance_analysis.md (mermaid dump diagram node D plus two PromQL example references). Layerwise *_submit_ms metrics are intentionally NOT renamed: those measure dispatcher synchronous cost, where "submit" is the accurate dispatcher-pattern verb.

Mirror the posix per-worker bytes-counter pattern at the Cache stage. The existing cache_*_bandwidth_gbps histogram samples once per Cache task (size = shardSize * num_shards, cost = task wall time) and so does NOT aggregate across concurrent tasks or include idle gaps — same limitation that motivated the posix-side per-worker view. C++ (trans_manager.h epilog): * UpdateStats("cache_load_bytes_total", size) in load branch * UpdateStats("cache_dump_bytes_total", size) in dump branch The counters reuse the same per-task `size` value already computed for the bandwidth_gbps histogram; semantic matches the existing cache_*_blocks_total counters (incremented on every completed task). yaml (metrics_configs.yaml): two new counter definitions describing how to use rate(_bytes_total) / 1e9 for real per-worker GB/s. Grafana (production + CI grafana_pipeline_store.json): * Rename 4 existing Cache bandwidth panels (two top-level, two nested heatmaps) with "(per task)" suffix and a clarifying description sentence — Cache's native event granularity is a task, not a shard, so the qualifier is "per task" (not the "per shard" wording used on the posix side). * Add 2 new "(per worker)" timeseries panels at y=40 between the Cache region and the Posix region, plotting rate(cache_*_bytes_total) / 1e9 by worker_id. Posix region shifted down by 8. Doc (performance_analysis.md): extends the per-shard / per-worker explainer block with a mirror entry for cache per-task / per-worker.

Most diagnostic sessions want the service-level signal, not 8 lines per panel. Add a single user-visible Grafana template variable `perWorker` (label: View) at the top of every dashboard with two values: * Aggregated (default) — collapse worker_id from grouping; one line per panel (or 4 quantile lines for histogram-quantile panels). * Per Worker — keep worker_id in grouping; per-worker lines. Key PromQL trick: the variable's per-worker value is the literal ", worker_id" (comma-prefixed). Every existing `sum by (...)` clause is rewritten to anchor on `model_name` (which all exprs already filter on) and append the variable value: sum by (le, model_name${perWorker:raw}) (rate(metric_bucket{...})) sum by (model_name${perWorker:raw}) (rate(metric{...})) Anchoring on `model_name` (always pinned to one value via the existing template filter) avoids the trailing-comma PromQL error when the variable is empty, while still giving aggregated semantics. Scope (one Python transform script run on 7 dashboards): * 275 `expr` strings refactored across all panel types: histogram_quantile, rate division, single rates, ratios, gauges. * 275 `legendFormat` strings: drop "worker-" prefix so legend reads "{{worker_id}}" instead of "worker-{{worker_id}}". * 12 additional vLLM-style `sum by(le)` quantile exprs rewritten so they also honour the toggle. * 8 panel titles renamed: "(per worker)" -> "(aggregated)" — toggle now controls grouping; the suffix communicates the panel's default view, not the (now toggleable) breakdown. Heatmap panels (sum by(le) on `increase()`) and categorical panels (sum by(finished_reason) etc.) intentionally left untouched: their grouping is already aggregated over workers by design. Docs: * performance_analysis.md — explainer blocks now contrast (per shard) / (per task) vs (aggregated), and mention the View toggle as the way to drill into per-worker. * metrics.md — new "View toggle" subsection under "Import Dashboards" describing the new selector.

ucm:load_speed and ucm:save_speed are histograms of per-call instantaneous speed (size_in_call / duration_of_call), so toggling the dashboard View=Aggregated only pools observations across workers into a single distribution — the resulting quantile is still "typical single-call speed" and does NOT sum across workers. Users seeing similar p50 between Aggregated and Per Worker have hit this limitation. Mirror the cache_*_bytes_total / posix_*_bytes_total pattern at the connector layer: * Add load_bytes_total + save_bytes_total counters in metrics_configs.yaml. * In ucm_connector.py, accumulate the per-call byte count (num_blocks * block_data_size) into the same update_stats block that already records load_speed / save_speed (so it only fires on the successful is_load / is_save path, no failure pollution). Grafana (production + CI grafana_connector.json): * Rename "Connector Load Speed" -> "Connector Load Speed (per task)" and same for Save, with a description sentence saying the histogram does not sum across workers. * Add new "Connector Load Bandwidth (aggregated)" and "Connector Save Bandwidth (aggregated)" panels at y=48 plotting sum by (${perWorker:raw}) (rate(*_bytes_total)) / 1e9. CI variant carries the extra job="$job" label matcher. Unit "gbytes" to match the existing aggregated panels in pipeline_store. Docs: * performance_analysis.md gets a Connector "(per task) vs (aggregated)" explainer block mirroring the existing posix / cache pairs. * metrics.md "Available Metrics" table adds the two new counter rows.

Several UC_DEBUG lines in the cache/posix paths print durations and counts that are useful for diagnosis but never made it into metrics. This commit promotes four such "debug-only" measurements into proper Prometheus metrics while leaving the existing UC_DEBUG lines in place (so verbose-log workflows are unchanged). Cache buffer lookup duration (buffer_manager.h L45/92/111/127): * cache_lookup_duration_ms — fast in-memory hit/miss scan * cache_lookup_backend_duration_ms — backend descent when buffer missing or miss-blocks need backend resolution Reason: vLLM scheduler hits Lookup on every decision; lookup latency directly drives scheduling overhead and TTFT. Cache dump shard counters (dump_queue.cc DumpOneTask): * cache_dump_shards_total — total shards in the task * cache_dump_backend_shards_total — shards actually pushed to backend (excludes !handle.Owner() skips) Mirror of the existing cache_load_shards_total / cache_load_backend_shards_total pair. Emitted BEFORE the early-return at L125 so 0/N tasks are recorded. Posix task-level duration (io_engine_psync.h + io_engine_aio.h): * posix_load_task_duration_ms * posix_dump_task_duration_ms Both engines capture t->type in the SetEpilog lambda and emit the appropriate metric. io_engine_psync.h gains an include of metrics_api.h. Use case: directly compare with cache_load_duration_ms / cache_dump_duration_ms (same task-level granularity) to isolate the cache layer's H2D/D2H + serialization overhead vs the posix layer's actual disk time. Grafana (production + CI grafana_pipeline_store.json): * Insert "Cache Lookup Duration" / "Cache Lookup Backend Duration" timeseries panels at y=48 (between cache aggregated row and posix region). Posix region shifted down by 8. * Insert "Posix Load Task Duration" / "Posix Dump Task Duration" at y=80 (between posix per-shard duration row and the distributions rows). Distributions rows shifted down by 8. * All 4 new panels share the existing histogram-quantile template (p50/p90/p99/avg + perWorker toggle support). docs/source/user-guide/metrics/metrics.md: "Available Metrics" table gains 6 new entries grouped under Lookup, Shard Counters, and Posix Task Duration headings. Out of scope (deferred): ds3fs / compress / pcstore stores are still uninstrumented — same pattern would apply but they are not on the user's active CI path.

dante159753 requested review from FangRun2, Infinite666, Tarrei, harrisonyhq, mag1c-h, qyh111 and ygwpz as code owners April 24, 2026 03:05

dante159753 and others added 5 commits April 27, 2026 11:18

[Fix] Reformat load_queue.cc with clang-format to satisfy CI lint

b288eef

Fold a short UpdateStats call back onto one line per clang-format 20. Pure formatting; no behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

dante159753 force-pushed the pipeline-layerwise-metrics branch from a56840a to 3fe0dd4 Compare April 27, 2026 03:21

dante159753 requested a review from flesher0813 as a code owner April 27, 2026 03:21

dante159753 and others added 11 commits April 27, 2026 11:30

Merge branch 'develop' into pipeline-layerwise-metrics

f315a78

Merge branch 'pipeline-layerwise-metrics' of https://github.com/dante…

51c260c

…159753/unified-cache-management into pipeline-layerwise-metrics

ygwpz reviewed May 8, 2026

View reviewed changes

dante159753 added 29 commits May 11, 2026 10:15

Adjust Grafana legend height

194bd02

Fix Grafana view variable interpolation

458ffc1

Clarify layerwise heatmap request rate

2ccb7df

Clarify pipeline store heatmap request rate

5127ba4

Fix Grafana legend max height option

a3f7960

Constrain Grafana legend max height

f9cc481

Split cache dump mkbuf and d2h metrics

28a0cac

Add cache buffer timing metrics

3d8b8c0

Remove cache buffer get duration metric

50d25af

Record alloc spin metric inside TransBuffer alloc

00fe6d2

Restore vLLM dashboard aggregation

6418c13

Remove pipeline cache hit rate panel

994066c

Add queue wait duration metrics panels

389fac0

Clean up layerwise dashboard signals

3eeeccf

Add cache posix breakdown dashboard panels

6d956be

Fix CI pipeline dashboard datasource

9b8156f

Reorder pipeline dashboard store panels

123426a

Smooth pipeline dashboard breakdown panels

b8b94d4

Use fixed breakdown rate window

0e7d41a

Move cache total duration before breakdown

5450c5a

Refine pipeline dashboard transfer layout

9594925

Place cache transfer bandwidth below duration

6c8cdd6

add instance filter to vllm dashboard

d4ee183

dante159753 marked this pull request as draft May 13, 2026 02:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Add per-stage pipeline store and layerwise overlap metrics#933

[Feat] Add per-stage pipeline store and layerwise overlap metrics#933
dante159753 wants to merge 54 commits into
ModelEngine-Group:developfrom
dante159753:pipeline-layerwise-metrics

dante159753 commented Apr 24, 2026 •

edited

Loading

Uh oh!

ygwpz May 8, 2026

Uh oh!

ygwpz May 8, 2026

Uh oh!

ygwpz May 8, 2026

Uh oh!

ygwpz May 8, 2026

Uh oh!

ygwpz May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dante159753 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Modifications

Test

Uh oh!

ygwpz May 8, 2026

Choose a reason for hiding this comment

Uh oh!

ygwpz May 8, 2026

Choose a reason for hiding this comment

Uh oh!

ygwpz May 8, 2026

Choose a reason for hiding this comment

Uh oh!

ygwpz May 8, 2026

Choose a reason for hiding this comment

Uh oh!

ygwpz May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dante159753 commented Apr 24, 2026 •

edited

Loading