client/resource_group: cache request source RU metrics#10588
client/resource_group: cache request source RU metrics#10588YuhaoZhang00 wants to merge 2 commits intotikv:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughPer-request-source RU/WRU metrics were added and wired into resource group controllers: a new Prometheus counter was introduced, RequestInfo now exposes Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant GroupController
participant GlobalController
participant Prometheus
Client->>GroupController: send request (RequestInfo includes RequestSource)
GroupController->>GroupController: compute RU/WRU delta
GroupController->>GroupController: getOrCreateRequestSourceMetricsState(request_source)
GroupController->>Prometheus: increment RequestSourceRUCounter(resource_group, request_source, type)
GroupController->>Client: respond
Note over GlobalController,GroupController: On shutdown/cleanup
GlobalController->>GroupController: trigger cleanup/tombstone
GroupController->>Prometheus: delete labeled series / cleanup
GroupController->>GlobalController: remove cached request-source state
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Hi @YuhaoZhang00. Thanks for your PR. I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
1dd52fd to
ebd4f63
Compare
Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
ebd4f63 to
492976a
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #10588 +/- ##
==========================================
+ Coverage 78.80% 78.96% +0.15%
==========================================
Files 523 532 +9
Lines 70529 71931 +1402
==========================================
+ Hits 55580 56799 +1219
- Misses 10955 11107 +152
- Partials 3994 4025 +31
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
client/resource_group/controller/request_source_metrics_test.go (1)
24-35: Consider increasing channel buffer or using unbuffered pattern for robustness.The channel buffer of 8 could cause the goroutine to block if the collector produces more metrics than the buffer size before the main routine starts consuming. While this is unlikely in controlled test scenarios, a more robust pattern would be to use an unbuffered channel and start consuming immediately, or increase the buffer size.
♻️ Suggested improvement for robustness
func collectorMetricCount(collector prometheus.Collector) int { - ch := make(chan prometheus.Metric, 8) + ch := make(chan prometheus.Metric, 128) go func() { collector.Collect(ch) close(ch) }() count := 0 for range ch { count++ } return count }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@client/resource_group/controller/request_source_metrics_test.go` around lines 24 - 35, In collectorMetricCount, avoid the fixed small buffered channel which can block if collector emits >8 metrics; change ch := make(chan prometheus.Metric, 8) to either an unbuffered channel (ch := make(chan prometheus.Metric)) so the main goroutine immediately consumes while the goroutine runs, or increase the buffer to a safely large value (e.g., 256/1024) to prevent blocking; ensure this change is applied in the collectorMetricCount function that calls collector.Collect(ch).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@client/resource_group/controller/request_source_metrics_test.go`:
- Around line 24-35: In collectorMetricCount, avoid the fixed small buffered
channel which can block if collector emits >8 metrics; change ch := make(chan
prometheus.Metric, 8) to either an unbuffered channel (ch := make(chan
prometheus.Metric)) so the main goroutine immediately consumes while the
goroutine runs, or increase the buffer to a safely large value (e.g., 256/1024)
to prevent blocking; ensure this change is applied in the collectorMetricCount
function that calls collector.Collect(ch).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bae52b87-fccd-416c-a4c3-3b5c37206d41
📒 Files selected for processing (6)
client/resource_group/controller/global_controller.goclient/resource_group/controller/group_controller.goclient/resource_group/controller/metrics/metrics.goclient/resource_group/controller/model.goclient/resource_group/controller/request_source_metrics_test.goclient/resource_group/controller/testutil.go
|
/cc @JmPotato ptal |
|
@YuhaoZhang00: GitHub didn't allow me to request PR reviews from the following users: ptal. Note that only tikv members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/release-note-none |
| func (mc *groupMetricsCollection) cleanupRequestSourceMetrics(resourceGroupName string) { | ||
| mc.sourceMetricsMu.Lock() | ||
| defer mc.sourceMetricsMu.Unlock() | ||
| for requestSource := range mc.sourceMetrics { |
There was a problem hiding this comment.
Will it leak if the getOrCreateRequestSourceMetrics create a new one?
There was a problem hiding this comment.
No extra leak from this cache:
1. These cached request-source metrics are cleaned up when the resource group is cleaned up (cleanupRequestSourceMetrics() called), so they do not stay around forever.
2. The request_source cardinality is also bounded in practice. In TiDB/client-go it currently comes from a small set of hardcoded values (< 100), so we do not expect it to grow uncontrollably.
If the concern is about concurrency issue, all sourceMetrics operations are wrapped by mutex locks.
|
/hold |
Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
ca98e85 to
3fe18d6
Compare
|
/unhold |
|
[FORMAT CHECKER NOTIFICATION] Notice: To remove the 📖 For more info, you can check the "Linking issues" section in the CONTRIBUTING.md. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@client/resource_group/controller/global_controller.go`:
- Around line 483-489: getOrCreateRequestSourceMetricsState may return a stale
requestSourceMetricsState that has been marked closed by cleanup(), causing
callers to stop emitting metrics; change getOrCreateRequestSourceMetricsState to
detect state.closed after a Load/LoadOrStore and, if closed, retry by creating a
fresh requestSourceMetricsState and atomically replacing the map entry (e.g.,
loop: Load, if missing create and LoadOrStore, if loaded and closed attempt
CompareAndSwap/Store after validating it is still closed or Delete+retry) so
callers never get a closed state; apply the same pattern to the other similar
helpers noted (the other getOrCreate variants around the 492-497 and 624-626
ranges) so closed entries are always recreated instead of reused.
- Around line 339-342: The shutdown path currently calls the global
RequestSourceRUCounter.Reset() which wipes metric series for other controllers;
instead, invoke this controller's cleanup() to delete only the labels tracked in
requestSourceStates (the existing cleanup method already calls DeleteLabelValues
for each tracked request source). Replace the RequestSourceRUCounter.Reset()
call in the loopCtx.Done() case with a call to cleanup() (while keeping
ResourceGroupStatusGauge.Reset() if intended) so only this controller's metrics
are removed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2d11e34f-f662-42e2-8f08-75f5a8dbd897
📒 Files selected for processing (4)
client/resource_group/controller/global_controller.goclient/resource_group/controller/group_controller.goclient/resource_group/controller/group_controller_test.goclient/resource_group/controller/request_source_metrics_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- client/resource_group/controller/group_controller_test.go
| case <-c.loopCtx.Done(): | ||
| metrics.ResourceGroupStatusGauge.Reset() | ||
| metrics.RequestSourceRUCounter.Reset() | ||
| return |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate the file and check its size
wc -l client/resource_group/controller/global_controller.goRepository: tikv/pd
Length of output: 108
🏁 Script executed:
# Read the section around line 339-342
sed -n '330,350p' client/resource_group/controller/global_controller.goRepository: tikv/pd
Length of output: 968
🏁 Script executed:
# Check the requestSourceStates declaration
sed -n '140,160p' client/resource_group/controller/global_controller.goRepository: tikv/pd
Length of output: 752
🏁 Script executed:
# Search for RequestSourceRUCounter definition
rg "RequestSourceRUCounter" client/resource_group/Repository: tikv/pd
Length of output: 4324
🏁 Script executed:
# Find metrics definition file
find client/resource_group/ -name "*metric*" -type fRepository: tikv/pd
Length of output: 166
🏁 Script executed:
# Check getOrCreateRequestSourceMetricsState implementation
sed -n '480,510p' client/resource_group/controller/global_controller.goRepository: tikv/pd
Length of output: 1236
🏁 Script executed:
# Read the metrics definition file
cat -n client/resource_group/controller/metrics/metrics.goRepository: tikv/pd
Length of output: 8800
🏁 Script executed:
# Check the requestSourceMetricsState struct and its cleanup method
rg -A 30 "type requestSourceMetricsState struct" client/resource_group/controller/Repository: tikv/pd
Length of output: 2634
🏁 Script executed:
# Check what cleanup() does
rg -B 5 -A 15 "func.*cleanup\(\)" client/resource_group/controller/global_controller.goRepository: tikv/pd
Length of output: 33
🏁 Script executed:
# Check how other controllers are handling metrics on shutdown
rg -B 5 -A 10 "loopCtx.Done()" client/resource_group/controller/Repository: tikv/pd
Length of output: 1522
🏁 Script executed:
# Check if there are multiple ResourceGroupsController instances in tests/code
rg "NewResourceGroupsController\|new.*ResourceGroupsController" client/resource_group/ -A 3Repository: tikv/pd
Length of output: 33
Replace the global RequestSourceRUCounter.Reset() with targeted cleanup of this controller's tracked metrics.
Calling .Reset() on a process-global CounterVec at shutdown will erase metric series still owned by other active controllers. Clean up only the entries tracked by this controller's requestSourceStates instead, using the existing cleanup() method which already calls DeleteLabelValues() for each tracked request source.
Replace global Reset with per-controller cleanup
case <-c.loopCtx.Done():
metrics.ResourceGroupStatusGauge.Reset()
- metrics.RequestSourceRUCounter.Reset()
+ c.requestSourceStates.Range(func(key, value any) bool {
+ value.(*requestSourceMetricsState).cleanup()
+ c.requestSourceStates.Delete(key)
+ return true
+ })
return📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| case <-c.loopCtx.Done(): | |
| metrics.ResourceGroupStatusGauge.Reset() | |
| metrics.RequestSourceRUCounter.Reset() | |
| return | |
| case <-c.loopCtx.Done(): | |
| metrics.ResourceGroupStatusGauge.Reset() | |
| c.requestSourceStates.Range(func(key, value any) bool { | |
| value.(*requestSourceMetricsState).cleanup() | |
| c.requestSourceStates.Delete(key) | |
| return true | |
| }) | |
| return |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@client/resource_group/controller/global_controller.go` around lines 339 -
342, The shutdown path currently calls the global RequestSourceRUCounter.Reset()
which wipes metric series for other controllers; instead, invoke this
controller's cleanup() to delete only the labels tracked in requestSourceStates
(the existing cleanup method already calls DeleteLabelValues for each tracked
request source). Replace the RequestSourceRUCounter.Reset() call in the
loopCtx.Done() case with a call to cleanup() (while keeping
ResourceGroupStatusGauge.Reset() if intended) so only this controller's metrics
are removed.
| func (c *ResourceGroupsController) getOrCreateRequestSourceMetricsState(name string) *requestSourceMetricsState { | ||
| if state, ok := c.requestSourceStates.Load(name); ok { | ||
| return state.(*requestSourceMetricsState) | ||
| } | ||
| state := newRequestSourceMetricsState(name) | ||
| actual, _ := c.requestSourceStates.LoadOrStore(name, state) | ||
| return actual.(*requestSourceMetricsState) |
There was a problem hiding this comment.
Closed metric-state entries can be reused during cleanup/recreate.
cleanup() marks state.closed = true before the sync.Map entry is removed. A concurrent getOrCreateRequestSourceMetricsState() can Load() that stale object and attach it to a fresh controller; after that, groupMetricsCollection.getOrCreateRequestSourceMetrics() always returns nil, so request-source RU silently stops emitting for that group.
🔁 Recreate closed state instead of returning it
func (c *ResourceGroupsController) getOrCreateRequestSourceMetricsState(name string) *requestSourceMetricsState {
- if state, ok := c.requestSourceStates.Load(name); ok {
- return state.(*requestSourceMetricsState)
- }
- state := newRequestSourceMetricsState(name)
- actual, _ := c.requestSourceStates.LoadOrStore(name, state)
- return actual.(*requestSourceMetricsState)
+ for {
+ if v, ok := c.requestSourceStates.Load(name); ok {
+ state := v.(*requestSourceMetricsState)
+ state.mu.RLock()
+ closed := state.closed
+ state.mu.RUnlock()
+ if !closed {
+ return state
+ }
+ c.requestSourceStates.CompareAndDelete(name, state)
+ continue
+ }
+
+ state := newRequestSourceMetricsState(name)
+ actual, loaded := c.requestSourceStates.LoadOrStore(name, state)
+ if !loaded {
+ return state
+ }
+
+ state = actual.(*requestSourceMetricsState)
+ state.mu.RLock()
+ closed := state.closed
+ state.mu.RUnlock()
+ if !closed {
+ return state
+ }
+ c.requestSourceStates.CompareAndDelete(name, state)
+ }
}Also applies to: 492-497, 624-626
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@client/resource_group/controller/global_controller.go` around lines 483 -
489, getOrCreateRequestSourceMetricsState may return a stale
requestSourceMetricsState that has been marked closed by cleanup(), causing
callers to stop emitting metrics; change getOrCreateRequestSourceMetricsState to
detect state.closed after a Load/LoadOrStore and, if closed, retry by creating a
fresh requestSourceMetricsState and atomically replacing the map entry (e.g.,
loop: Load, if missing create and LoadOrStore, if loaded and closed attempt
CompareAndSwap/Store after validating it is still closed or Delete+retry) so
callers never get a closed state; apply the same pattern to the other similar
helpers noted (the other getOrCreate variants around the 492-497 and 624-626
ranges) so closed entries are always recreated instead of reused.
|
/ok-to-test |
| return tmp.(*groupCostController), loaded | ||
| } | ||
|
|
||
| func (c *ResourceGroupsController) getOrCreateRequestSourceMetricsState(name string) *requestSourceMetricsState { |
There was a problem hiding this comment.
Will there be a race between create and cleanup
|
@YuhaoZhang00: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: ref pingcap/tidb#64339.
client-gois moving RU-by-request-source accounting out of the interceptor hot path.Add
pd/client's own the request-source RU metrics and cache the corresponding metric handles.Relative PR:
What is changed and how does it work?
This PR makes
pd/clientrequest accounting aware ofRequestSourceand records RU-by-request-source metrics inside the existing resource-group controller.Implementation details:
RequestSource()tocontroller.RequestInfoRequestSourceRUCounterunderresource_manager_client_request(resource_group, request_source)in a shared per-resource-group state managed bygroupCostController. Reuse the same request-source metric state across normal / tombstone / revived group controllers. Delete cached handles and Prometheus series when the resource group is finally cleaned upThis keeps the existing metric dimensions, but moves the metric ownership to
pd/clientand avoids repeatedWithLabelValues()on the hot path inclient-go.Change log (2026-04-13)
Before this change, request-source metric state was controller-instance scoped, which could break cleanup across tombstone / revive.
Check List
Tests
performed
ADD INDEXlocally, the DDL-related RU showed up such as:internal_ddlwru:+56.40898437500003leader_internal_ddlrru:+37.54666388932296internal_DistTaskwru:+40.74453125leader_internal_DistTaskrru:+59.991488047526154, but no fine-grained
request_sourcematchingadd_index/merge_temp_indexappeared in the new metric.This is consistent with the bypass logic working in
client-go: the fine-grainedadd_index/merge_temp_indexrequests are bypassed before enteringpd/clientRU accounting, while other non-bypassed DDL-related requests in the same workflow are still visible through coarse DDL sources.Release note
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores