Skip to content

client/resource_group: cache request source RU metrics#10588

Open
YuhaoZhang00 wants to merge 2 commits intotikv:masterfrom
YuhaoZhang00:rg-request-source-metrics
Open

client/resource_group: cache request source RU metrics#10588
YuhaoZhang00 wants to merge 2 commits intotikv:masterfrom
YuhaoZhang00:rg-request-source-metrics

Conversation

@YuhaoZhang00
Copy link
Copy Markdown
Contributor

@YuhaoZhang00 YuhaoZhang00 commented Apr 9, 2026

What problem does this PR solve?

Issue Number: ref pingcap/tidb#64339.

client-go is moving RU-by-request-source accounting out of the interceptor hot path.

Add pd/client's own the request-source RU metrics and cache the corresponding metric handles.

Relative PR:

What is changed and how does it work?

This PR makes pd/client request accounting aware of RequestSource and records RU-by-request-source metrics inside the existing resource-group controller.

Implementation details:

  • add RequestSource() to controller.RequestInfo
  • add RequestSourceRUCounter under resource_manager_client_request
  • cache rru/wru counters per (resource_group, request_source) in a shared per-resource-group state managed by groupCostController. Reuse the same request-source metric state across normal / tombstone / revived group controllers. Delete cached handles and Prometheus series when the resource group is finally cleaned up
  • record request-side and response-side RU deltas through the existing accounting flow

This keeps the existing metric dimensions, but moves the metric ownership to pd/client and avoids repeated WithLabelValues() on the hot path in client-go.

Change log (2026-04-13)

Before this change, request-source metric state was controller-instance scoped, which could break cleanup across tombstone / revive.

  • keep request-source metric state per resource group instead of per controller instance
  • preserve request-source metric bookkeeping across tombstone / revive paths
  • clean up request-source metric state on final resource-group cleanup

Check List

Tests

  • Unit test
  • Manual test

performed ADD INDEX locally, the DDL-related RU showed up such as:

  • internal_ddl wru: +56.40898437500003
  • leader_internal_ddl rru: +37.54666388932296
  • internal_DistTask wru: +40.74453125
  • leader_internal_DistTask rru: +59.991488047526154

, but no fine-grained request_source matching add_index / merge_temp_index appeared in the new metric.

This is consistent with the bypass logic working in client-go: the fine-grained add_index / merge_temp_index requests are bypassed before entering pd/client RU accounting, while other non-bypassed DDL-related requests in the same workflow are still visible through coarse DDL sources.

Release note

None.

Summary by CodeRabbit

  • New Features

    • Added request-source level RU/WRU metrics with Prometheus support for per-source consumption tracking.
  • Bug Fixes

    • Ensured request-source metrics are cleaned up when resource groups are removed.
    • Reset request-source RU metrics during controller shutdown to avoid stale data.
  • Tests

    • Added tests validating request-source metrics caching, recording, and cleanup across lifecycle paths.
  • Chores

    • Updated request info contract to include request-source information.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 9, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 9, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rleungx for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

Per-request-source RU/WRU metrics were added and wired into resource group controllers: a new Prometheus counter was introduced, RequestInfo now exposes RequestSource(), group controllers cache per-source metric state and record deltas, and global/controller shutdown and cleanup paths remove those metrics.

Changes

Cohort / File(s) Summary
Global Controller
client/resource_group/controller/global_controller.go
Added requestSourceStates cache; reset RequestSourceRUCounter on loop shutdown; pass per-group request-source state into newGroupCostController; cleanup request-source state when groups are deleted/tombstoned.
Group Controller
client/resource_group/controller/group_controller.go
Added requestSourceMetricsState into groupMetricsCollection; methods to lazily create per-request-source counters, record RU/WRU via addRequestSourceRU, and clean up cached metrics; wired recordings into request lifecycle handlers.
Metrics Definition
client/resource_group/controller/metrics/metrics.go
Introduced exported RequestSourceRUCounter *prometheus.CounterVec and registered it with labels {resource_group, request_source, type}.
Model / Test Helpers
client/resource_group/controller/model.go, client/resource_group/controller/testutil.go
Extended RequestInfo interface with RequestSource() string; updated TestRequestInfo to include requestSource field and implement the method.
Tests
client/resource_group/controller/request_source_metrics_test.go, client/resource_group/controller/group_controller_test.go
Added tests validating caching, counter values, and cleanup; updated test call sites to pass additional nil parameter to newGroupCostController.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant GroupController
  participant GlobalController
  participant Prometheus
  Client->>GroupController: send request (RequestInfo includes RequestSource)
  GroupController->>GroupController: compute RU/WRU delta
  GroupController->>GroupController: getOrCreateRequestSourceMetricsState(request_source)
  GroupController->>Prometheus: increment RequestSourceRUCounter(resource_group, request_source, type)
  GroupController->>Client: respond
  Note over GlobalController,GroupController: On shutdown/cleanup
  GlobalController->>GroupController: trigger cleanup/tombstone
  GroupController->>Prometheus: delete labeled series / cleanup
  GroupController->>GlobalController: remove cached request-source state
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

size/L, type/development

Suggested reviewers

  • JmPotato
  • disksing
  • nolouch

Poem

🐇
I count small hops in metrics bright,
Per-source RU in soft moonlight.
Maps stay neat, labels take flight,
Counters hum through cleanup night.
A rabbit nods — tidy bytes!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: adding caching for request source RU metrics in the client resource group module.
Description check ✅ Passed The description covers the problem statement, implementation details, testing approach, and change log; it follows the template structure with required sections completed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 9, 2026

Hi @YuhaoZhang00. Thanks for your PR.

I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added contribution This PR is from a community contributor. dco-signoff: no Indicates the PR's author has not signed dco. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 9, 2026
@YuhaoZhang00 YuhaoZhang00 changed the title tso, server: add debug logs for TSO sync, closure, and forwarding paths client/resource_group: cache request source RU metrics Apr 9, 2026
@YuhaoZhang00 YuhaoZhang00 force-pushed the rg-request-source-metrics branch from 1dd52fd to ebd4f63 Compare April 9, 2026 03:23
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 9, 2026
Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
@YuhaoZhang00 YuhaoZhang00 force-pushed the rg-request-source-metrics branch from ebd4f63 to 492976a Compare April 9, 2026 03:29
@ti-chi-bot ti-chi-bot bot added dco-signoff: yes Indicates the PR's author has signed the dco. and removed dco-signoff: no Indicates the PR's author has not signed dco. labels Apr 9, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 9, 2026

Codecov Report

❌ Patch coverage is 89.36170% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.96%. Comparing base (5885cec) to head (492976a).
⚠️ Report is 58 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10588      +/-   ##
==========================================
+ Coverage   78.80%   78.96%   +0.15%     
==========================================
  Files         523      532       +9     
  Lines       70529    71931    +1402     
==========================================
+ Hits        55580    56799    +1219     
- Misses      10955    11107     +152     
- Partials     3994     4025      +31     
Flag Coverage Δ
unittests 78.96% <89.36%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@YuhaoZhang00 YuhaoZhang00 marked this pull request as ready for review April 9, 2026 05:00
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 9, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
client/resource_group/controller/request_source_metrics_test.go (1)

24-35: Consider increasing channel buffer or using unbuffered pattern for robustness.

The channel buffer of 8 could cause the goroutine to block if the collector produces more metrics than the buffer size before the main routine starts consuming. While this is unlikely in controlled test scenarios, a more robust pattern would be to use an unbuffered channel and start consuming immediately, or increase the buffer size.

♻️ Suggested improvement for robustness
 func collectorMetricCount(collector prometheus.Collector) int {
-	ch := make(chan prometheus.Metric, 8)
+	ch := make(chan prometheus.Metric, 128)
 	go func() {
 		collector.Collect(ch)
 		close(ch)
 	}()
 	count := 0
 	for range ch {
 		count++
 	}
 	return count
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/resource_group/controller/request_source_metrics_test.go` around lines
24 - 35, In collectorMetricCount, avoid the fixed small buffered channel which
can block if collector emits >8 metrics; change ch := make(chan
prometheus.Metric, 8) to either an unbuffered channel (ch := make(chan
prometheus.Metric)) so the main goroutine immediately consumes while the
goroutine runs, or increase the buffer to a safely large value (e.g., 256/1024)
to prevent blocking; ensure this change is applied in the collectorMetricCount
function that calls collector.Collect(ch).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@client/resource_group/controller/request_source_metrics_test.go`:
- Around line 24-35: In collectorMetricCount, avoid the fixed small buffered
channel which can block if collector emits >8 metrics; change ch := make(chan
prometheus.Metric, 8) to either an unbuffered channel (ch := make(chan
prometheus.Metric)) so the main goroutine immediately consumes while the
goroutine runs, or increase the buffer to a safely large value (e.g., 256/1024)
to prevent blocking; ensure this change is applied in the collectorMetricCount
function that calls collector.Collect(ch).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bae52b87-fccd-416c-a4c3-3b5c37206d41

📥 Commits

Reviewing files that changed from the base of the PR and between b21a183 and 492976a.

📒 Files selected for processing (6)
  • client/resource_group/controller/global_controller.go
  • client/resource_group/controller/group_controller.go
  • client/resource_group/controller/metrics/metrics.go
  • client/resource_group/controller/model.go
  • client/resource_group/controller/request_source_metrics_test.go
  • client/resource_group/controller/testutil.go

@YuhaoZhang00
Copy link
Copy Markdown
Contributor Author

/cc @JmPotato ptal

@ti-chi-bot ti-chi-bot bot requested a review from JmPotato April 9, 2026 05:17
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 9, 2026

@YuhaoZhang00: GitHub didn't allow me to request PR reviews from the following users: ptal.

Note that only tikv members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @JmPotato ptal

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@YuhaoZhang00
Copy link
Copy Markdown
Contributor Author

/release-note-none

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Apr 9, 2026
func (mc *groupMetricsCollection) cleanupRequestSourceMetrics(resourceGroupName string) {
mc.sourceMetricsMu.Lock()
defer mc.sourceMetricsMu.Unlock()
for requestSource := range mc.sourceMetrics {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it leak if the getOrCreateRequestSourceMetrics create a new one?

Copy link
Copy Markdown
Contributor Author

@YuhaoZhang00 YuhaoZhang00 Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No extra leak from this cache:

1. These cached request-source metrics are cleaned up when the resource group is cleaned up (cleanupRequestSourceMetrics() called), so they do not stay around forever.

2. The request_source cardinality is also bounded in practice. In TiDB/client-go it currently comes from a small set of hardcoded values (< 100), so we do not expect it to grow uncontrollably.

If the concern is about concurrency issue, all sourceMetrics operations are wrapped by mutex locks.

@YuhaoZhang00
Copy link
Copy Markdown
Contributor Author

/hold

@ti-chi-bot ti-chi-bot bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. dco-signoff: no Indicates the PR's author has not signed dco. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed dco-signoff: yes Indicates the PR's author has signed the dco. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 10, 2026
Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
@YuhaoZhang00 YuhaoZhang00 force-pushed the rg-request-source-metrics branch from ca98e85 to 3fe18d6 Compare April 13, 2026 08:18
@ti-chi-bot ti-chi-bot bot added dco-signoff: yes Indicates the PR's author has signed the dco. and removed dco-signoff: no Indicates the PR's author has not signed dco. labels Apr 13, 2026
@YuhaoZhang00
Copy link
Copy Markdown
Contributor Author

/unhold

@ti-chi-bot ti-chi-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 13, 2026
@YuhaoZhang00 YuhaoZhang00 requested a review from rleungx April 13, 2026 08:19
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 13, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456, multiple issues should use full syntax for each issue and be separated by a comma, like: Issue Number: close #123, ref #456.

📖 For more info, you can check the "Linking issues" section in the CONTRIBUTING.md.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/resource_group/controller/global_controller.go`:
- Around line 483-489: getOrCreateRequestSourceMetricsState may return a stale
requestSourceMetricsState that has been marked closed by cleanup(), causing
callers to stop emitting metrics; change getOrCreateRequestSourceMetricsState to
detect state.closed after a Load/LoadOrStore and, if closed, retry by creating a
fresh requestSourceMetricsState and atomically replacing the map entry (e.g.,
loop: Load, if missing create and LoadOrStore, if loaded and closed attempt
CompareAndSwap/Store after validating it is still closed or Delete+retry) so
callers never get a closed state; apply the same pattern to the other similar
helpers noted (the other getOrCreate variants around the 492-497 and 624-626
ranges) so closed entries are always recreated instead of reused.
- Around line 339-342: The shutdown path currently calls the global
RequestSourceRUCounter.Reset() which wipes metric series for other controllers;
instead, invoke this controller's cleanup() to delete only the labels tracked in
requestSourceStates (the existing cleanup method already calls DeleteLabelValues
for each tracked request source). Replace the RequestSourceRUCounter.Reset()
call in the loopCtx.Done() case with a call to cleanup() (while keeping
ResourceGroupStatusGauge.Reset() if intended) so only this controller's metrics
are removed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2d11e34f-f662-42e2-8f08-75f5a8dbd897

📥 Commits

Reviewing files that changed from the base of the PR and between ca98e85 and 3fe18d6.

📒 Files selected for processing (4)
  • client/resource_group/controller/global_controller.go
  • client/resource_group/controller/group_controller.go
  • client/resource_group/controller/group_controller_test.go
  • client/resource_group/controller/request_source_metrics_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • client/resource_group/controller/group_controller_test.go

Comment on lines 339 to 342
case <-c.loopCtx.Done():
metrics.ResourceGroupStatusGauge.Reset()
metrics.RequestSourceRUCounter.Reset()
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, locate the file and check its size
wc -l client/resource_group/controller/global_controller.go

Repository: tikv/pd

Length of output: 108


🏁 Script executed:

# Read the section around line 339-342
sed -n '330,350p' client/resource_group/controller/global_controller.go

Repository: tikv/pd

Length of output: 968


🏁 Script executed:

# Check the requestSourceStates declaration
sed -n '140,160p' client/resource_group/controller/global_controller.go

Repository: tikv/pd

Length of output: 752


🏁 Script executed:

# Search for RequestSourceRUCounter definition
rg "RequestSourceRUCounter" client/resource_group/

Repository: tikv/pd

Length of output: 4324


🏁 Script executed:

# Find metrics definition file
find client/resource_group/ -name "*metric*" -type f

Repository: tikv/pd

Length of output: 166


🏁 Script executed:

# Check getOrCreateRequestSourceMetricsState implementation
sed -n '480,510p' client/resource_group/controller/global_controller.go

Repository: tikv/pd

Length of output: 1236


🏁 Script executed:

# Read the metrics definition file
cat -n client/resource_group/controller/metrics/metrics.go

Repository: tikv/pd

Length of output: 8800


🏁 Script executed:

# Check the requestSourceMetricsState struct and its cleanup method
rg -A 30 "type requestSourceMetricsState struct" client/resource_group/controller/

Repository: tikv/pd

Length of output: 2634


🏁 Script executed:

# Check what cleanup() does
rg -B 5 -A 15 "func.*cleanup\(\)" client/resource_group/controller/global_controller.go

Repository: tikv/pd

Length of output: 33


🏁 Script executed:

# Check how other controllers are handling metrics on shutdown
rg -B 5 -A 10 "loopCtx.Done()" client/resource_group/controller/

Repository: tikv/pd

Length of output: 1522


🏁 Script executed:

# Check if there are multiple ResourceGroupsController instances in tests/code
rg "NewResourceGroupsController\|new.*ResourceGroupsController" client/resource_group/ -A 3

Repository: tikv/pd

Length of output: 33


Replace the global RequestSourceRUCounter.Reset() with targeted cleanup of this controller's tracked metrics.

Calling .Reset() on a process-global CounterVec at shutdown will erase metric series still owned by other active controllers. Clean up only the entries tracked by this controller's requestSourceStates instead, using the existing cleanup() method which already calls DeleteLabelValues() for each tracked request source.

Replace global Reset with per-controller cleanup
 		case <-c.loopCtx.Done():
 			metrics.ResourceGroupStatusGauge.Reset()
-			metrics.RequestSourceRUCounter.Reset()
+			c.requestSourceStates.Range(func(key, value any) bool {
+				value.(*requestSourceMetricsState).cleanup()
+				c.requestSourceStates.Delete(key)
+				return true
+			})
 			return
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
case <-c.loopCtx.Done():
metrics.ResourceGroupStatusGauge.Reset()
metrics.RequestSourceRUCounter.Reset()
return
case <-c.loopCtx.Done():
metrics.ResourceGroupStatusGauge.Reset()
c.requestSourceStates.Range(func(key, value any) bool {
value.(*requestSourceMetricsState).cleanup()
c.requestSourceStates.Delete(key)
return true
})
return
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/resource_group/controller/global_controller.go` around lines 339 -
342, The shutdown path currently calls the global RequestSourceRUCounter.Reset()
which wipes metric series for other controllers; instead, invoke this
controller's cleanup() to delete only the labels tracked in requestSourceStates
(the existing cleanup method already calls DeleteLabelValues for each tracked
request source). Replace the RequestSourceRUCounter.Reset() call in the
loopCtx.Done() case with a call to cleanup() (while keeping
ResourceGroupStatusGauge.Reset() if intended) so only this controller's metrics
are removed.

Comment on lines +483 to +489
func (c *ResourceGroupsController) getOrCreateRequestSourceMetricsState(name string) *requestSourceMetricsState {
if state, ok := c.requestSourceStates.Load(name); ok {
return state.(*requestSourceMetricsState)
}
state := newRequestSourceMetricsState(name)
actual, _ := c.requestSourceStates.LoadOrStore(name, state)
return actual.(*requestSourceMetricsState)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Closed metric-state entries can be reused during cleanup/recreate.

cleanup() marks state.closed = true before the sync.Map entry is removed. A concurrent getOrCreateRequestSourceMetricsState() can Load() that stale object and attach it to a fresh controller; after that, groupMetricsCollection.getOrCreateRequestSourceMetrics() always returns nil, so request-source RU silently stops emitting for that group.

🔁 Recreate closed state instead of returning it
 func (c *ResourceGroupsController) getOrCreateRequestSourceMetricsState(name string) *requestSourceMetricsState {
-	if state, ok := c.requestSourceStates.Load(name); ok {
-		return state.(*requestSourceMetricsState)
-	}
-	state := newRequestSourceMetricsState(name)
-	actual, _ := c.requestSourceStates.LoadOrStore(name, state)
-	return actual.(*requestSourceMetricsState)
+	for {
+		if v, ok := c.requestSourceStates.Load(name); ok {
+			state := v.(*requestSourceMetricsState)
+			state.mu.RLock()
+			closed := state.closed
+			state.mu.RUnlock()
+			if !closed {
+				return state
+			}
+			c.requestSourceStates.CompareAndDelete(name, state)
+			continue
+		}
+
+		state := newRequestSourceMetricsState(name)
+		actual, loaded := c.requestSourceStates.LoadOrStore(name, state)
+		if !loaded {
+			return state
+		}
+
+		state = actual.(*requestSourceMetricsState)
+		state.mu.RLock()
+		closed := state.closed
+		state.mu.RUnlock()
+		if !closed {
+			return state
+		}
+		c.requestSourceStates.CompareAndDelete(name, state)
+	}
 }

Also applies to: 492-497, 624-626

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/resource_group/controller/global_controller.go` around lines 483 -
489, getOrCreateRequestSourceMetricsState may return a stale
requestSourceMetricsState that has been marked closed by cleanup(), causing
callers to stop emitting metrics; change getOrCreateRequestSourceMetricsState to
detect state.closed after a Load/LoadOrStore and, if closed, retry by creating a
fresh requestSourceMetricsState and atomically replacing the map entry (e.g.,
loop: Load, if missing create and LoadOrStore, if loaded and closed attempt
CompareAndSwap/Store after validating it is still closed or Delete+retry) so
callers never get a closed state; apply the same pattern to the other similar
helpers noted (the other getOrCreate variants around the 492-497 and 624-626
ranges) so closed entries are always recreated instead of reused.

@rleungx
Copy link
Copy Markdown
Member

rleungx commented Apr 13, 2026

/ok-to-test

@ti-chi-bot ti-chi-bot bot added ok-to-test Indicates a PR is ready to be tested. and removed needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Apr 13, 2026
return tmp.(*groupCostController), loaded
}

func (c *ResourceGroupsController) getOrCreateRequestSourceMetricsState(name string) *requestSourceMetricsState {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be a race between create and cleanup

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 13, 2026

@YuhaoZhang00: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-integration-realcluster-test 3fe18d6 link true /test pull-integration-realcluster-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contribution This PR is from a community contributor. dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/needs-linked-issue ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants