Skip to content

[Feat] Add per-stage pipeline store and layerwise overlap metrics#933

Draft
dante159753 wants to merge 54 commits into
ModelEngine-Group:developfrom
dante159753:pipeline-layerwise-metrics
Draft

[Feat] Add per-stage pipeline store and layerwise overlap metrics#933
dante159753 wants to merge 54 commits into
ModelEngine-Group:developfrom
dante159753:pipeline-layerwise-metrics

Conversation

@dante159753
Copy link
Copy Markdown
Contributor

@dante159753 dante159753 commented Apr 24, 2026

Purpose

Adds observability needed to diagnose pipeline store (Cache|Posix) per-tier performance and to verify that UCMLayerWiseConnector's load/forward/save overlap actually hides backend latency.

Modifications

Pipeline Store (C++ side):

  • Cache: per-task load/dump duration + bandwidth, queue wait, dispatch, backend-wait, H2D/D2H, backend-submit durations; lookup hit/miss counters and instantaneous hit rate gauge.
  • Posix: per-IO S2H/H2S duration + bandwidth, queue wait, failure counters.

Layerwise Connector (Python side):

  • wait_blocking_ms: primary signal for overlap health (near 0 = perfect overlap; tracks load_duration = degenerated to serial).
  • inter_wait_interval_ms, next_layer_submit_ms, first_layer_submit_ms, save_submit_ms, save_per_layer_wait_ms, save_tail_total_ms, stalled_layers_total.

Infrastructure:

  • Change metrics library from STATIC to SHARED so cachestore.so, posixstore.so, and ucmmetrics.so share one Metrics singleton. With a STATIC metrics the function-local GetInstance() produced a separate instance in each .so and all C++ UpdateStats() calls from stores were silently dropped.
  • Set INSTALL_RPATH=$ORIGIN/../../shared/metrics on cachestore.so and posixstore.so; $ORIGIN on ucmmetrics.so.

Test

dante159753 and others added 5 commits April 27, 2026 11:18
Adds observability needed to diagnose pipeline store (Cache|Posix)
per-tier performance and to verify that UCMLayerWiseConnector's
load/forward/save overlap actually hides backend latency.

Pipeline Store (C++ side):
  - Cache: per-task load/dump duration + bandwidth, queue wait, dispatch,
    backend-wait, H2D/D2H, backend-submit durations; lookup hit/miss
    counters and instantaneous hit rate gauge.
  - Posix: per-IO S2H/H2S duration + bandwidth, queue wait, failure
    counters.

Layerwise Connector (Python side):
  - wait_blocking_ms: primary signal for overlap health (near 0 = perfect
    overlap; tracks load_duration = degenerated to serial).
  - inter_wait_interval_ms, next_layer_submit_ms, first_layer_submit_ms,
    save_submit_ms, save_per_layer_wait_ms, save_tail_total_ms,
    stalled_layers_total.

Infrastructure:
  - Change metrics library from STATIC to SHARED so cachestore.so,
    posixstore.so, and ucmmetrics.so share one Metrics singleton. With a
    STATIC metrics the function-local GetInstance() produced a separate
    instance in each .so and all C++ UpdateStats() calls from stores were
    silently dropped.
  - Set INSTALL_RPATH=\$ORIGIN/../../shared/metrics on cachestore.so and
    posixstore.so; \$ORIGIN on ucmmetrics.so.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Fold a short UpdateStats call back onto one line per clang-format 20.
Pure formatting; no behavior change.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds 20 new panels to examples/metrics/grafana.json covering the metrics
introduced in the previous commit:

Pipeline / Cache stage (9 panels):
  Hit Rate (full width), Load/Dump Duration + Bandwidth, Backend Wait,
  H2D / D2H durations, Backend Submit Ratio.

Pipeline / Posix stage (4 panels):
  S2H and H2S bandwidth and duration.

Layerwise Connector (7 panels):
  Wait Blocking (full-width key metric), Inter-Wait Interval, Stalled
  Layers Rate, First Layer Submit, Save Tail, Next Layer Submit, Save
  Per-Layer Wait.

Thresholds are set on the most actionable panels: Hit Rate (red < 0.5,
green >= 0.8), Backend Submit Ratio (green < 0.3, red >= 0.7), Wait
Blocking (green 0, red >= 20 ms), Save Tail (green 0, red >= 50 ms).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds docs/source/user-guide/metrics/performance_analysis.md covering
diagnosis of the Cache|Posix pipeline store in both layerwise and
non-layerwise mode using the per-stage and layerwise metrics.

Sections:
  1. Architecture and load/dump data flow with metric annotations.
  2. Critical metrics ranked by diagnostic priority.
  3. Nine bottleneck playbooks (low hit rate, slow loads, slow Posix,
     dump back-pressure, no layerwise speedup, layerwise TTFT
     regression, layerwise save tail, non-layerwise dump-bound,
     worker pool starvation) - each with metric signature and
     concrete tunables.
  4. Layerwise vs non-layerwise diagnostic differences.
  5. PromQL recipes for hit rate, miss ratio, p99 decomposition,
     overlap loss, dump back-pressure, worker utilization.
  6. Tunables indexed by symptom.
  7. Honest list of what these metrics cannot tell you.

Wires the new page into the User Guide toctree in docs/source/index.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous ASCII flowcharts in performance_analysis.md misaligned in
the Sphinx HTML output because the box-drawing characters and CJK
punctuation have inconsistent monospace widths. Replace them with
Mermaid flowcharts:

  - Storage tier overview (vLLM Worker → CacheStore → PosixStore)
  - LOAD path with per-stage metric annotations on each node
    (queue waits, dispatch, posix S2H, backend wait, H2D, epilog)
  - Cache-hit fast path
  - DUMP path showing the user-visible chain plus the asynchronous
    BackendDumpStage / Posix H2S branch with a dashed edge

Color-codes nodes by tier (Cache blue, Posix orange, completion green)
so the tier hand-offs are visible at a glance.

Wire-up:
  - Add sphinxcontrib-mermaid to docs/requirements-docs.txt.
  - Register the extension in docs/source/conf.py.
  - Set myst_fence_as_directive = ["mermaid"] so plain ```mermaid
    fences work both on GitHub (native rendering) and on Sphinx /
    ReadTheDocs.

Drop the now-unused 'promql' language tag from PromQL examples - the
default Pygments PromQL lexer rejects the colon in 'ucm:metric_name'
and emitted highlighting warnings on every build.

Verified locally with `sphinx -W`: my page now builds without
warnings; mermaid blocks render as <pre class="mermaid"> for the
client-side mermaid.js to pick up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@dante159753 dante159753 force-pushed the pipeline-layerwise-metrics branch from a56840a to 3fe0dd4 Compare April 27, 2026 03:21
dante159753 and others added 11 commits April 27, 2026 11:30
… range

The previous Posix bandwidth buckets (0.05, 0.1, 0.2, 0.5, 1, 2, 4, 8,
12, 16, 24, 32) had only four sample points across the entire range
where actual production performance lives, so p50/p90/p99 collapsed to
a single bucket and changes within the band were invisible.

New layout:
  - 0.05 / 0.1 / 0.2 / 0.5  -> degraded paths
  - 1, 1.5, 2, 2.5, 3, 3.5, 4  -> 0.5 GB/s steps (slow/saturated NVMe)
  - 5, 6, 7, 8, 9, 10, 11, 12  -> 1 GB/s steps (typical)
  - 14, 16, 20, 24, 32  -> sparse headroom

24 buckets total per metric (previously 12). Applied to both
pipeline_posix_s2h_bandwidth_gbps (read) and
pipeline_posix_h2s_bandwidth_gbps (write).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Each of the 10 critical latency / bandwidth histograms gets:
  1. A heatmap panel showing the full distribution shape over time
     (cluster-wide aggregation, full bucket density visible).
  2. A p50 / p90 / p99 time-series panel with per-worker breakdown,
     line styles distinguishing the three quantiles (p50 solid, p90
     dashed, p99 thick dashed).

Metrics covered:
  Cache:     load_duration_ms, dump_duration_ms,
             load_backend_wait_duration_ms, load_bandwidth_gbps,
             dump_bandwidth_gbps
  Posix:     s2h_duration_ms, h2s_duration_ms,
             s2h_bandwidth_gbps, h2s_bandwidth_gbps
  Layerwise: wait_blocking_ms

Grouped into three collapsible row sections (Cache / Posix /
Layerwise), collapsed by default so the existing dashboard scrolls
unchanged. Adds 23 panels (3 rows + 20 children); existing 29 panels
untouched.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Every timeseries panel had spanNulls=false, which combined with
showPoints=auto rendered intermittent metrics (cache miss only,
layerwise-only, dump on tp_rank=0 only, NaN from rate(_sum)/rate(_count)
when count=0, histogram_quantile with no observations) as scattered
discrete points instead of continuous lines.

Set spanNulls to 60000 ms across all 39 panels: gaps under one minute
are bridged so normal "quiet window" sparseness reads as a smooth line,
while real outages longer than 60s still break the line and remain
visible.

No query, color, or layout changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Cache and Posix stores can be used standalone (Posix can run without
Cache; Cache always sits on top of some backend but isn't pipeline-
specific), so the pipeline_ prefix on their metric names misrepresented
the binding. The pipeline_ framing only makes sense for the composite
PipelineStore wrapper, not for the underlying stores' own
instrumentation.

Renames (181 references across 8 files):
  pipeline_cache_*  ->  cache_*
  pipeline_posix_*  ->  posix_*

Touched:
  - examples/metrics/metrics_configs.yaml  (registration)
  - examples/metrics/grafana.json          (panel queries)
  - docs/source/user-guide/metrics/performance_analysis.md (prose & PromQL)
  - ucm/store/cache/cc/{trans_manager.h,buffer_manager.h,
                        load_queue.cc,dump_queue.cc} (UpdateStats calls)
  - ucm/store/posix/cc/trans_queue.cc      (UpdateStats calls)

Plus clang-format reflow on five C++ files where the now-shorter
metric-name string literals fit back onto one line.

layerwise_* metrics keep their prefix - they live in the connector
layer, not the store layer, and the prefix correctly identifies
the layerwise overlap mechanism.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
quantiles their description promises

Panels 17/18/21/22 (Connector Load/Save Duration/Speed) all advertised
"P50, P90, P95, P99 and Average" in their description, but each ran a
single rate(_sum)/rate(_count) target that only produces the average.
The percentiles were fictional.

Replace each panel's single avg target with five targets:
  A: p50 (histogram_quantile(0.5, sum by (le, worker_id) (rate(_bucket))))
  B: p90 (histogram_quantile(0.9, ...))
  C: p95 (histogram_quantile(0.95, ...))
  D: p99 (histogram_quantile(0.99, ...))
  E: avg (the original rate(_sum)/rate(_count))

Per-worker breakdown so an outlier worker stands out as its own line.
Distinct line styles (p50 solid, p90/p95/p99 dashed at decreasing
period, p99 thicker, avg solid with 5% fill) keep the multi-series
panel readable. Legend switched to table mode so worker rows can be
scanned at a glance.

This was a pre-existing issue; only Cache|Posix metric prefix renaming
in the previous commit went near these panels, but the
description<->query mismatch was not introduced by recent changes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 17 main-level latency and throughput panels (Cache / Posix /
Layerwise) previously rendered only the rate(_sum)/rate(_count)
average, so a single hot tail or a slow worker was invisible. Each now
exposes four lines per worker: p50 (solid), p90 (dashed 10-6), p99
(dashed 4-4, thicker), avg (solid with 5% fill).

Panels touched:
  Cache:     load_duration, load_bandwidth, dump_duration,
             dump_bandwidth, load_backend_wait, h2d_duration,
             d2h_duration  (7 panels)
  Posix:     s2h_bandwidth, h2s_bandwidth, s2h_duration,
             h2s_duration  (4 panels)
  Layerwise: wait_blocking, inter_wait_interval, first_layer_submit,
             save_tail_total, next_layer_submit,
             save_per_layer_wait  (6 panels)

Skipped (intentionally - not latency/throughput):
  - Hit-rate / counter-rate / count panels (id 14-16, 19-20, 100,
    108, 115).
  - Distribution-row heatmaps (id 230, 232, ... 248) which already
    show the full shape.
  - Distribution-row dedicated quantile panels (id 231, 233, ... 249)
    which already render p50/p90/p99 - now somewhat redundant with the
    upgraded main panels but kept for the focused deep-dive view
    inside the collapsible distribution rows.

Legend switched to table mode so worker rows can be scanned at a
glance; tooltip set to multi-series sorted desc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
prefix from titles

Two cleanups in one pass.

1) Remove redundant quantile panels from distribution rows.
   The previous commit gave every main-level latency / throughput
   panel its own p50/p90/p99 lines (commit de667fb), which made the
   dedicated p50/p90/p99 panels inside the collapsed distribution
   rows duplicate work. Drop those 10 panels (id 231, 233, 235, 237,
   239, 241, 243, 245, 247, 249) and lay out the remaining heatmaps
   2-up at w=12 (with the Cache backend_wait and Layerwise blocking
   heatmaps full-width because they are alone on their row).

   Total panels: 52 -> 42 (top-level unchanged at 32 since rows are
   counted as containers).

2) Strip leftover "Pipeline" wording from titles to match the metric
   rename in commit acd5af0:
     - "Pipeline / Cache Load Duration" -> "Cache Load Duration"
     - "Pipeline Cache -- Distributions" -> "Cache -- Distributions"
     - "Cache / Cache Load Duration (heatmap)" -> "Cache / Load Duration"
     - "(heatmap)" suffix dropped since rows now contain only heatmaps.
   The Layerwise / * panel titles are unchanged - layerwise is the
   correct prefix for those metrics.

   Queries themselves were already migrated and contain no
   pipeline_* references.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The four count panels (Connector Load/Save Requests/Blocks Num) used
rate(_sum)/rate(_count) which yields 0/0 NaN whenever no batch
happened in the rate window, so the dashboard frequently showed gaps
or single isolated points even though the underlying metrics were
healthy.

Each panel is now split into two:

  Rate panel (existing id 15/16/19/20, renamed):
    expr: rate(ucm:METRIC_count[$__rate_interval])
    unit: ops (events/sec)
    Always defined when there is any activity in the window - no more
    NaN gaps. Per worker, single line.

  Size distribution panel (new id 130-133):
    p50 / p90 / p99 / avg of per-batch value (request count or block
    count). Same quantile + avg multi-line treatment as the duration
    and bandwidth panels.

Layout shift: the new size-distribution rows sit right under their
sibling rate panel. All panels with y >= 8 shifted by +8 to make room
for the load distributions; everything with y >= 24 shifted by +16 to
also accommodate the save distributions. Distribution-row sections
(id 200/210/220) re-packed so rows sit immediately after the previous
row's last child (closing a 16-grid-unit slack inherited from the
earlier cleanup commit).

Final connector section:
  Hit Rate (full width)
  Load Req Rate    | Load Blk Rate
  Load Req Size    | Load Blk Size
  Load Duration    | Load Speed
  Save Req Rate    | Save Blk Rate
  Save Req Size    | Save Blk Size
  Save Duration    | Save Speed

Top-level panel count: 32 -> 36 (4 new dist panels).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The single grafana.json had grown to 36 top-level panels + 10 nested
heatmap children (52 total, ~7000 lines), too big to edit ergonomically
and mixing concerns (overview / store-tier diagnosis / advanced
layerwise) for different audiences.

Drop 2 redundant panels:
  - id=16 Connector Load Blocks Rate
  - id=20 Connector Save Blocks Rate
Both queries were rate(_count) of metrics observed in the same
update_stats({...}) call as their Requests Rate siblings, producing
mathematically identical time series. The size-distribution panels for
the same metrics (130/131/132/133) are NOT redundant and stay.

Split the remaining ~50 panels into three module dashboards under
examples/metrics/:

  grafana_connector.json (11 panels)
    Audience: anyone running UCM. Top-level activity, hit rate,
    per-batch sizes, end-to-end load/save durations and speeds.

  grafana_pipeline_store.json (13 main + 11 in collapsible
    distribution rows = 24 total)
    Audience: people diagnosing storage tier perf. Cache hit
    rate / backend submit ratio at top, then per-stage Cache and
    Posix latency / bandwidth, then Cache + Posix distribution
    heatmaps in collapsible rows.

  grafana_layerwise.json (8 main + 1 in collapsible row = 9 total)
    Audience: layerwise mode users. Wait_blocking key signal full
    width at top, then stalls / submit costs / save tail, plus
    a layerwise wait_blocking heatmap.

Per-dashboard hygiene:
  - Fresh uid (ucm-connector-overview / ucm-pipeline-store /
    ucm-layerwise).
  - version=1, panel ids renumbered from 1, gridPos repacked from y=0.
  - Tagged ucm + <module>; each carries an "Other UCM dashboards"
    dropdown link in the header that auto-discovers siblings by tag.
  - Cache Hit Rate full-width at top of pipeline_store so the 9 Cache
    panels pair cleanly without bleeding into the Posix section.
  - templating, time, refresh copied verbatim from the original.

Documentation:
  - docs/source/user-guide/metrics/metrics.md: replace the single
    "Import Dashboard" section with a "Pick the dashboard you need"
    table.
  - docs/source/developer-guide/add_metrics.md: update the
    "add a new panel" pointer to the right module dashboard.

Verified: JSON validity, panel id uniqueness within each dashboard,
all metric refs resolve in metrics_configs.yaml, no leftover standalone
rate(_count) for the dropped duplicate metrics, original grafana.json
removed, pre-commit clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
}
if (!task.waiter) {
holder_.push_back(std::move(task));
UC::Metrics::UpdateStats("cache_load_backend_wait_duration_ms",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: cache_load_backend_wait_duration_ms is recorded twice in this function.

  • First time here (line 157-158) in the if (!task.waiter) branch
  • Second time at line 168-169 in the normal flow after stream.Synchronize()

This causes incorrect metrics data because the same metric is updated from two different execution paths. Should only record once based on the actual path taken.

break;
}
auto tpEnd = NowTime::Now();
UC::Metrics::UpdateStats("cache_load_backend_wait_duration_ms",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate recording: This is the second place where cache_load_backend_wait_duration_ms is recorded. See my comment at line 157-158 for the first occurrence.

Both paths record the same metric, which will cause data corruption in Prometheus metrics. Consider removing one or restructuring the logic to only record in the appropriate path.

UC_DEBUG("Cache task({},{},{},{}) dispatching.", id, brief, num, size);
w->SetEpilog([id, brief = std::move(brief), num, size, tp] {
w->SetEpilog([id, brief = std::move(brief), num, size, tp, isLoad] {
auto cost = NowTime::Now() - tp;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential edge case: When cost is very small (e.g., sub-millisecond), bwGbps could become extremely large due to division by a tiny number. Consider adding a minimum threshold for cost before calculating bandwidth to avoid unrealistic spikes in metrics.

Example: if (cost > 1e-6) { bwGbps = ... } else { bwGbps = 0.0; }

ios.waiter->Done();
return;
}
auto tp = NowTime::Now();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same edge case as trans_manager.h: When cost is very small, bwGbps could become extremely large. Consider adding a minimum threshold check before calculating bandwidth to avoid unrealistic spikes.

- name: "cache_lookup_hit_rate"
documentation: "Instantaneous Cache stage hit rate from the most recent lookup call"
multiprocess_mode: "livemostrecent"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing metric definition: layerwise_first_layer_requests is used in the code (ucm_connector.py:849) but not defined in the histogram configuration here. Should this be a gauge or counter? Please add the corresponding definition.

Posix's existing posix_*_bandwidth_gbps histogram observes one sample
per shard IO call (bytes / wall_time_of_that_call / 1e9). That
granularity cannot reveal:

  * Real per-worker GB/s — 8 concurrent IO threads each at 1 GB/s
    still show as ~1 GB/s in the histogram instead of aggregating
    to 8.
  * Disk utilisation — idle gaps between IOs are not in the
    denominator, so 50%-utilised disks look identical to saturated
    ones.

Add posix_s2h_bytes_total and posix_h2s_bytes_total counters that
accumulate only on successful IOs, in both posix engines:

  * trans_queue.cc (psync): in the success branch after the s.Failure()
    check, in both LoadWorker and DumpWorker.
  * io_engine_aio.h (aio): in OnIoCallback's else-branch after the
    result.error == 0 check; via the dump template parameter we cover
    s2h (load) and h2s (dump) with one line.

The aio open-failure path in OnOpenCallback intentionally does NOT
update the counter (no bytes ever moved).

PromQL form: rate(ucm:posix_s2h_bytes_total[$__rate_interval]) / 1e9
gives true per-worker GB/s — multi-thread IO aggregates naturally and
idle gaps land in the wall-clock denominator.

Grafana (production + CI variants of grafana_pipeline_store.json):
  * Rename four existing Posix bandwidth panels (two top-level, two
    nested heatmaps) with the "(per shard)" suffix, and add a
    description sentence saying they do NOT aggregate across IO
    threads.
  * Add two new "(per worker)" timeseries panels at y=48 between the
    per-shard row and the duration row, plotting rate(_bytes_total) /
    1e9 broken down by worker_id. CI variant adds job="$job" to the
    label matcher.

Doc: performance_analysis.md gets a decision-split entry pair (5a
per-worker / 5b per-shard) in the §2 ranking table, plus an explainer
block under §3.3 covering when to read which metric.
"Cache Backend Submit Ratio (load)" measures the fraction of load
shards that missed the Cache buffer and descended to the backend
(true cache miss at load time). The "submit" verb suggested a
half-finished state and routinely misled diagnostics; the action
itself is a backend-load, so name the panel and underlying counter
accordingly.

Renames:
  * counter   cache_load_backend_submit_shards_total
              -> cache_load_backend_shards_total
  * histogram cache_dump_backend_submit_duration_ms
              -> cache_dump_backend_duration_ms
  * panel     "Cache Backend Submit Ratio (load)"
              -> "Cache Backend Load Ratio"

The dump-side histogram is renamed for symmetry; its documentation
now clarifies that it measures the synchronous hand-off duration only
and does NOT include the lower tier's actual write time.

Files touched: load_queue.cc / dump_queue.cc (UpdateStats string
literals), metrics_configs.yaml (counter+histogram definitions with
revised documentation), grafana_pipeline_store.json in both production
and CI (panel title, expr, plus description suffix linking the ratio
to its component metrics), and performance_analysis.md (mermaid dump
diagram node D plus two PromQL example references).

Layerwise *_submit_ms metrics are intentionally NOT renamed: those
measure dispatcher synchronous cost, where "submit" is the accurate
dispatcher-pattern verb.
Mirror the posix per-worker bytes-counter pattern at the Cache stage.
The existing cache_*_bandwidth_gbps histogram samples once per Cache
task (size = shardSize * num_shards, cost = task wall time) and so
does NOT aggregate across concurrent tasks or include idle gaps —
same limitation that motivated the posix-side per-worker view.

C++ (trans_manager.h epilog):
  * UpdateStats("cache_load_bytes_total", size)  in load branch
  * UpdateStats("cache_dump_bytes_total", size)  in dump branch
The counters reuse the same per-task `size` value already computed
for the bandwidth_gbps histogram; semantic matches the existing
cache_*_blocks_total counters (incremented on every completed task).

yaml (metrics_configs.yaml): two new counter definitions describing
how to use rate(_bytes_total) / 1e9 for real per-worker GB/s.

Grafana (production + CI grafana_pipeline_store.json):
  * Rename 4 existing Cache bandwidth panels (two top-level, two
    nested heatmaps) with "(per task)" suffix and a clarifying
    description sentence — Cache's native event granularity is a
    task, not a shard, so the qualifier is "per task" (not the
    "per shard" wording used on the posix side).
  * Add 2 new "(per worker)" timeseries panels at y=40 between the
    Cache region and the Posix region, plotting
    rate(cache_*_bytes_total) / 1e9 by worker_id. Posix region
    shifted down by 8.

Doc (performance_analysis.md): extends the per-shard / per-worker
explainer block with a mirror entry for cache per-task / per-worker.
Most diagnostic sessions want the service-level signal, not 8 lines
per panel. Add a single user-visible Grafana template variable
`perWorker` (label: View) at the top of every dashboard with two
values:
  * Aggregated (default) — collapse worker_id from grouping; one line
    per panel (or 4 quantile lines for histogram-quantile panels).
  * Per Worker — keep worker_id in grouping; per-worker lines.

Key PromQL trick: the variable's per-worker value is the literal
", worker_id" (comma-prefixed). Every existing `sum by (...)` clause
is rewritten to anchor on `model_name` (which all exprs already
filter on) and append the variable value:

  sum by (le, model_name${perWorker:raw}) (rate(metric_bucket{...}))
  sum by (model_name${perWorker:raw}) (rate(metric{...}))

Anchoring on `model_name` (always pinned to one value via the existing
template filter) avoids the trailing-comma PromQL error when the
variable is empty, while still giving aggregated semantics.

Scope (one Python transform script run on 7 dashboards):
  * 275 `expr` strings refactored across all panel types:
    histogram_quantile, rate division, single rates, ratios, gauges.
  * 275 `legendFormat` strings: drop "worker-" prefix so legend reads
    "{{worker_id}}" instead of "worker-{{worker_id}}".
  * 12 additional vLLM-style `sum by(le)` quantile exprs rewritten
    so they also honour the toggle.
  * 8 panel titles renamed:
      "(per worker)" -> "(aggregated)"
    — toggle now controls grouping; the suffix communicates the
    panel's default view, not the (now toggleable) breakdown.

Heatmap panels (sum by(le) on `increase()`) and categorical panels
(sum by(finished_reason) etc.) intentionally left untouched: their
grouping is already aggregated over workers by design.

Docs:
  * performance_analysis.md — explainer blocks now contrast
    (per shard) / (per task) vs (aggregated), and mention the View
    toggle as the way to drill into per-worker.
  * metrics.md — new "View toggle" subsection under "Import
    Dashboards" describing the new selector.
ucm:load_speed and ucm:save_speed are histograms of per-call
instantaneous speed (size_in_call / duration_of_call), so toggling
the dashboard View=Aggregated only pools observations across workers
into a single distribution — the resulting quantile is still
"typical single-call speed" and does NOT sum across workers. Users
seeing similar p50 between Aggregated and Per Worker have hit this
limitation.

Mirror the cache_*_bytes_total / posix_*_bytes_total pattern at the
connector layer:

  * Add load_bytes_total + save_bytes_total counters in
    metrics_configs.yaml.
  * In ucm_connector.py, accumulate the per-call byte count
    (num_blocks * block_data_size) into the same update_stats block
    that already records load_speed / save_speed (so it only fires
    on the successful is_load / is_save path, no failure pollution).

Grafana (production + CI grafana_connector.json):

  * Rename "Connector Load Speed" -> "Connector Load Speed
    (per task)" and same for Save, with a description sentence
    saying the histogram does not sum across workers.
  * Add new "Connector Load Bandwidth (aggregated)" and "Connector
    Save Bandwidth (aggregated)" panels at y=48 plotting
    sum by (${perWorker:raw}) (rate(*_bytes_total)) / 1e9. CI variant
    carries the extra job="$job" label matcher. Unit "gbytes" to
    match the existing aggregated panels in pipeline_store.

Docs:
  * performance_analysis.md gets a Connector "(per task) vs
    (aggregated)" explainer block mirroring the existing posix /
    cache pairs.
  * metrics.md "Available Metrics" table adds the two new counter
    rows.
Several UC_DEBUG lines in the cache/posix paths print durations and
counts that are useful for diagnosis but never made it into metrics.
This commit promotes four such "debug-only" measurements into proper
Prometheus metrics while leaving the existing UC_DEBUG lines in place
(so verbose-log workflows are unchanged).

Cache buffer lookup duration (buffer_manager.h L45/92/111/127):
  * cache_lookup_duration_ms        — fast in-memory hit/miss scan
  * cache_lookup_backend_duration_ms — backend descent when buffer
                                       missing or miss-blocks need
                                       backend resolution
  Reason: vLLM scheduler hits Lookup on every decision; lookup
  latency directly drives scheduling overhead and TTFT.

Cache dump shard counters (dump_queue.cc DumpOneTask):
  * cache_dump_shards_total          — total shards in the task
  * cache_dump_backend_shards_total  — shards actually pushed to
                                       backend (excludes
                                       !handle.Owner() skips)
  Mirror of the existing cache_load_shards_total /
  cache_load_backend_shards_total pair. Emitted BEFORE the
  early-return at L125 so 0/N tasks are recorded.

Posix task-level duration (io_engine_psync.h + io_engine_aio.h):
  * posix_load_task_duration_ms
  * posix_dump_task_duration_ms
  Both engines capture t->type in the SetEpilog lambda and emit the
  appropriate metric. io_engine_psync.h gains an include of
  metrics_api.h. Use case: directly compare with
  cache_load_duration_ms / cache_dump_duration_ms (same task-level
  granularity) to isolate the cache layer's H2D/D2H + serialization
  overhead vs the posix layer's actual disk time.

Grafana (production + CI grafana_pipeline_store.json):
  * Insert "Cache Lookup Duration" / "Cache Lookup Backend Duration"
    timeseries panels at y=48 (between cache aggregated row and
    posix region). Posix region shifted down by 8.
  * Insert "Posix Load Task Duration" / "Posix Dump Task Duration"
    at y=80 (between posix per-shard duration row and the
    distributions rows). Distributions rows shifted down by 8.
  * All 4 new panels share the existing histogram-quantile template
    (p50/p90/p99/avg + perWorker toggle support).

docs/source/user-guide/metrics/metrics.md: "Available Metrics" table
gains 6 new entries grouped under Lookup, Shard Counters, and Posix
Task Duration headings.

Out of scope (deferred): ds3fs / compress / pcstore stores are
still uninstrumented — same pattern would apply but they are not
on the user's active CI path.
@dante159753 dante159753 marked this pull request as draft May 13, 2026 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants