client: add pre-throttling demand RU/s metric by JmPotato · Pull Request #10582 · tikv/pd

JmPotato · 2026-04-08T09:30:46Z

What problem does this PR solve?

Issue Number: Close #10581

What is changed and how does it work?

Add a new client-side Prometheus Gauge
`resource_manager_client_resource_group_demand_ru_per_sec` that tracks the
EMA of demanded RU/s before Resource Control throttling takes effect.

The existing `avgRUPerSec` is based on post-throttling consumption: when a
request is rejected by the token bucket, its RU cost is subtracted from the
consumption counter (`onRequestWaitImpl`). This means the consumption-based
EMA underreports the true workload demand when the resource group is
actively throttled.

The new demand metric introduces a monotonically increasing `demandRUTotal`
counter that accumulates RU cost at every request entry point
(`onRequestWaitImpl`, `onResponseImpl`, `onResponseWaitImpl`,
`addRUConsumption`) and is never subtracted on throttle failure. A demand
EMA is then computed using the same `movingAvgFactor` as the consumption
EMA and flushed to the Gauge on each `updateAvgRequestResourcePerSec` tick.

This enables operators to:
- See per-instance RU demand in Grafana (natural `instance` label).
- Aggregate cluster-wide demand via `sum by (resource_group)`.
- Identify the true workload peak via `max_over_time(...)`.

Pure client-side change — no proto or PD server changes required, making it
rolling-upgrade friendly.

Check List

Tests

Unit test

Release note

Add a new client-side Prometheus metric `resource_manager_client_resource_group_demand_ru_per_sec` that tracks the EMA of demanded RU/s before Resource Control throttling, enabling operators to observe true workload demand per instance and across the cluster.

Summary by CodeRabbit

New Features
- Added a demand RU/sec metric (EMA) to report requested resource usage before throttling.
- Tracked total demanded RU separately from consumed RU so demand is visible even when requests are throttled; demand is recorded at request/response entry points.
Chores
- Remove per-resource-group demand metric series when a group is deleted.
Tests
- Added tests validating demand accounting, EMA updates, and capture on throttled responses.

ti-chi-bot · 2026-04-08T09:30:56Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jmpotato for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-04-08T09:31:08Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fd6858b6-285a-456e-b0e1-c0ff9a167148

📥 Commits

Reviewing files that changed from the base of the PR and between 4d6937d and d276cd8.

📒 Files selected for processing (3)

client/resource_group/controller/global_controller.go
client/resource_group/controller/group_controller.go
client/resource_group/controller/group_controller_test.go

🚧 Files skipped from review as they are similar to previous changes (2)

client/resource_group/controller/global_controller.go
client/resource_group/controller/group_controller_test.go

📝 Walkthrough

Walkthrough

Adds client-side pre-throttling demand RU/s tracking: records total demanded RU, computes a time-aware EMA of demanded RU/s, exposes it via a new per-resource-group Prometheus gauge, and ensures metric label cleanup during resource-group removal. Includes tests validating accumulation and EMA.

Changes

Cohort / File(s)	Summary
Demand Tracking Core `client/resource_group/controller/group_controller.go`	Added `mu.demandRUTotal` and `run.demandRUTotal`; extended `tokenCounter` with EMA fields and `calcDemandAvg`; added `recordDemand(delta)` and wired demand recording into request/response entry points; updated avg calculation flow to compute/publish demand EMA.
Metrics Instrumentation `client/resource_group/controller/metrics/metrics.go`	Added `DemandRUPerSecGauge *prometheus.GaugeVec` and registered `resource_manager_client_resource_group_demand_ru_per_sec{resource_group="..."}`.
Global Cleanup `client/resource_group/controller/global_controller.go`	Remove demand gauge label series on resource-group delete via `DemandRUPerSecGauge.DeleteLabelValues(...)`; added TODO noting other per-group metrics to clean.
Tests `client/resource_group/controller/group_controller_test.go`	Added `TestDemandRUTracking` and `TestDemandRUCapturedOnResponseWaitThrottle` to validate demand accumulation, EMA computation/decay, and capture on throttled responses.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant GroupController
    participant TokenCounter
    participant Metrics
    Client->>GroupController: Send request (RU v)  -- pre-throttle sample
    GroupController->>TokenCounter: recordDemand(v) and calcDemandAvg(now)
    TokenCounter-->>GroupController: updated avgDemandRUPerSec
    GroupController->>Metrics: DemandRUPerSecGauge.Set(resource_group, avgDemandRUPerSec)
    GroupController->>TokenCounter: Reserve()/AcquireTokens() (may block/reject)
    TokenCounter-->>GroupController: allow/reject + consumption accounting

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

type/development

Suggested reviewers

nolouch

Poem

🐰 I hopped through tokens, counted every cue,
I hummed an EMA, watched the demand accrue,
Metrics on my carrot board, per-group and true,
Now throttles can't hide what clients pursue! 🌿📈🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'client: add pre-throttling demand RU/s metric' clearly and concisely summarizes the main change—adding a new client-side Prometheus metric for demand RU/s before throttling.
Description check	✅ Passed	The PR description includes the problem statement, detailed explanation of changes, unit tests, and a release note, meeting template requirements despite minor section formatting differences.
Linked Issues check	✅ Passed	All objectives from issue `#10581` are met: demand tracking at entry points, Prometheus gauge exposure, EMA computation using existing smoothing, per-instance aggregation support, and client-side only implementation.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with the linked issue objective: introducing demand RU tracking, EMA calculation, new gauge metric, and cleanup logic for deleted resource groups.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

rleungx · 2026-04-09T10:33:16Z

client/resource_group/controller/group_controller.go

 		tokenRequestCounter:               metrics.ResourceGroupTokenRequestCounter.WithLabelValues(oldName, name),
 		runningKVRequestCounter:           metrics.GroupRunningKVRequestCounter.WithLabelValues(name),
 		consumeTokenHistogram:             metrics.TokenConsumedHistogram.WithLabelValues(name),
+		demandRUPerSecGauge:               metrics.DemandRUPerSecGauge.WithLabelValues(name),


Will these metrics being cleaned?

Good catch. On the current master only ResourceGroupStatusGauge is cleaned in cleanUpResourceGroup, so this new gauge would leak label series the same way most other per-group metrics already do. The leak impact is small in practice (resource groups are typically long-lived, client-side only, bounded per process), but it's still a real leak and not a design decision. I'll add DeleteLabelValues for DemandRUPerSecGauge in the cleanup path to keep this PR self-consistent. Cleaning up the other pre-existing per-group metrics is out of scope here — I'm tracking that in a separate branch.

rleungx · 2026-04-13T04:37:15Z

client/resource_group/controller/metrics/metrics.go

 	// ResourceGroupStatusGauge comments placeholder
 	ResourceGroupStatusGauge *prometheus.GaugeVec
+	// DemandRUPerSecGauge is the EMA of demanded RU/s before throttling per resource group.
+	DemandRUPerSecGauge *prometheus.GaugeVec


If it is throttled, will the demand be recorded?

Yes — that's exactly the design intent of this metric. Demand is accumulated at request entry (onRequestWaitImpl etc.) before acquireTokens is called, and on throttle error only consumption is rolled back via sub(gc.mu.consumption, delta); demandRUTotal is never subtracted. So throttled requests still count toward demand, which is the whole point of exposing this separately from the existing consumption-based avgRUPerSec. I'll also tighten the Help text / doc comment to make this invariant explicit ("including requests rejected by the token bucket").

Add a new client-side Prometheus Gauge `resource_manager_client_resource_group_demand_ru_per_sec` that tracks the EMA of demanded RU/s before Resource Control throttling takes effect. The existing `avgRUPerSec` is based on post-throttling consumption: when a request is rejected by the token bucket, its RU cost is subtracted from the consumption counter. This means the consumption-based EMA underreports the true workload demand when the resource group is actively throttled. The new demand metric samples RU cost at every `onRequestWaitImpl`, `onResponseImpl`, `onResponseWaitImpl`, and `addRUConsumption` entry point, accumulating into a monotonically increasing `demandRUTotal` counter that is never subtracted on throttle failure. A demand EMA is then computed using the same `movingAvgFactor` as the consumption EMA and flushed to the Gauge on each `updateAvgRequestResourcePerSec` tick. This enables operators to: - See per-instance RU demand in Grafana (natural `instance` label). - Aggregate cluster-wide demand via `sum by (resource_group)`. - Identify the true workload peak via `max_over_time(...)`. Close tikv#10581 Signed-off-by: JmPotato <ghzpotato@gmail.com> Signed-off-by: JmPotato <github@ipotato.me>

Address review feedback on tikv#10582: - Delete `DemandRUPerSecGauge` label series in `cleanUpResourceGroup` so the new gauge does not leak labels when a resource group is deleted. Add a TODO tracking the remaining per-group metrics that still leak (TokenConsumedHistogram, GroupRunningKVRequestCounter, SuccessfulRequestDuration, FailedRequestCounter, ResourceGroupTokenRequestCounter, RequestRetryCounter, FailedLimitReserveDuration). - Clarify the metric Help text and doc comment to make the pre-throttling semantics explicit: the EMA includes requests rejected by the token bucket, which is the whole reason this metric is exposed separately from the consumption-based `avg_ru_per_sec`. Signed-off-by: JmPotato <github@ipotato.me>

codecov · 2026-04-13T06:40:54Z

Codecov Report

❌ Patch coverage is 93.02326% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.94%. Comparing base (b21a183) to head (d276cd8).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10582      +/-   ##
==========================================
- Coverage   78.96%   78.94%   -0.03%     
==========================================
  Files         532      532              
  Lines       71883    72003     +120     
==========================================
+ Hits        56766    56840      +74     
- Misses      11093    11130      +37     
- Partials     4024     4033       +9

Flag	Coverage Δ
unittests	`78.94% <93.02%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The new `demand_ru_per_sec` gauge promised "every entry point, never subtracted on throttle failure", but `onResponseWaitImpl` increments `mu.demandRUTotal` only after `acquireTokens` succeeds, so a throttle rejection silently drops the demand sample -- exactly the case the metric is meant to surface. Root cause: the increment was co-located with the consumption update inside the post-acquire lock block, even though demand and consumption have different lifetimes (demand is monotonic; consumption is rolled back on rejection). Four call sites carried the same inline expression, making the wrong placement easy to add and hard to notice. This commit makes the invariant structural: * Add `(*groupCostController).recordDemand`, the single point where `mu.demandRUTotal` grows. Its doc comment states the rule: callers MUST invoke it before any limiter wait/acquire so demand survives a rejection. * Route `onRequestWaitImpl`, `onResponseImpl`, `onResponseWaitImpl`, and `addRUConsumption` through `recordDemand`. In `onResponseWaitImpl` this also hoists the call above `acquireTokens`, fixing the bug. * Add `TestDemandRUCapturedOnResponseWaitThrottle` to lock the invariant in via the throttle-fail path. * Rewrite the EMA portion of `TestDemandRUTracking`: the previous version assigned `gc.run.now` only to have `updateRunState` immediately overwrite it with `time.Now()`, so the two-tick EMA assertion was a no-op. The new version drives `calcDemandAvg` directly with hand-set timestamps and asserts the actual EMA trajectory. * Mirror the `acceleratedReportingPeriod` failpoint into `calcDemandAvg` so any test that accelerates `calcAvg` accelerates the demand EMA in lockstep. * `calcDemandAvg` now returns whether it actually updated; the gauge Set is gated on that so we never re-publish a stale value when no time has elapsed. Drop the `< 0` clamp -- the input counter is monotonically increasing, so the EMA cannot go negative. * Extend the leak-TODO in `cleanUpResourceGroup` to include `LowTokenRequestNotifyCounter`, which has the same per-group label cardinality as the others on the list. Signed-off-by: JmPotato <github@ipotato.me>

ti-chi-bot · 2026-04-13T07:53:44Z

@JmPotato: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-unit-test-next-gen-3	`d276cd8`	link	true	`/test pull-unit-test-next-gen-3`
pull-unit-test-next-gen-2	`d276cd8`	link	true	`/test pull-unit-test-next-gen-2`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

rleungx · 2026-04-13T09:41:08Z

client/resource_group/controller/global_controller.go

 		gc.mu.Lock()
 		latestConsumption := *gc.mu.consumption
 		gc.mu.Unlock()
 		if equalRU(latestConsumption, *gc.run.consumption) {


Will the metrics be deleted if the request RU is large?

ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. labels Apr 8, 2026

ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 8, 2026

rleungx reviewed Apr 9, 2026

View reviewed changes

rleungx reviewed Apr 13, 2026

View reviewed changes

JmPotato added 2 commits April 13, 2026 14:21

JmPotato force-pushed the client/add-demand-ru-per-sec-metric branch from 5933f85 to 4d6937d Compare April 13, 2026 06:29

rleungx reviewed Apr 13, 2026

View reviewed changes

Conversation

JmPotato commented Apr 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot bot commented Apr 8, 2026

Uh oh!

coderabbitai bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

rleungx Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

JmPotato Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

rleungx Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

JmPotato Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ti-chi-bot bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rleungx Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JmPotato commented Apr 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 8, 2026 •

edited

Loading

codecov bot commented Apr 13, 2026 •

edited

Loading

ti-chi-bot bot commented Apr 13, 2026 •

edited

Loading