server, schedule: Add observability for PD API leader-ready and cluster startup paths.#10523
Conversation
* add metrics and log Signed-off-by: ystaticy <y_static_y@sina.com> * add metrics and log Signed-off-by: ystaticy <y_static_y@sina.com> * add metrics and log Signed-off-by: ystaticy <y_static_y@sina.com> * add leader ready log Signed-off-by: ystaticy <y_static_y@sina.com> --------- Signed-off-by: ystaticy <y_static_y@sina.com> (cherry picked from commit a7c733b)
|
Validation:
@coderabbitai full review |
|
🧠 Learnings used✅ Actions performedFull review triggered. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdded latency instrumentation and structured logging across region labeler creation, cluster startup, and leader campaign readiness phases. Metrics track operation durations using Prometheus histograms, and phase-level timing logs enable observability into startup and readiness sequences. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
server/server.go (1)
1828-1849:⚠️ Potential issue | 🟡 MinorEmit the ready-to-serve log at the actual readiness boundary.
Line 1828 says
PromoteSelf()starts accepting the remaining APIs, but"PD leader is ready to serve"andtotal-costare logged only afterCheckPDVersionWithClusterVersion(). That overstates leader-ready latency and makes the final message later than the real serve point. Move the ready log just afterPromoteSelf(), or rename the last message to something likeleader initialization completed.⏱️ Proposed fix
log.Info("[leader-ready] start to promote leader") enableLeaderStart := time.Now() s.member.PromoteSelf() enableLeaderDuration := time.Since(enableLeaderStart) member.ServiceMemberGauge.WithLabelValues(PD).Set(1) log.Info("[leader-ready] promote leader completed", zap.Duration("cost", enableLeaderDuration)) defer resetLeaderOnce.Do(func() { // as soon as cancel the leadership keepalive, then other member have chance // to be new leader. cancel() s.member.Resign() member.ServiceMemberGauge.WithLabelValues(PD).Set(0) }) + totalDuration := time.Since(leaderReadyStart) + log.Info("[leader-ready] PD leader is ready to serve", zap.String("leader-name", s.Name()), zap.Duration("total-cost", totalDuration)) log.Info("[leader-ready] start to check PD version with cluster version") versionCheckStart := time.Now() CheckPDVersionWithClusterVersion(s.persistOptions) versionCheckDuration := time.Since(versionCheckStart) log.Info("[leader-ready] check PD version with cluster version completed", zap.Duration("cost", versionCheckDuration)) - totalDuration := time.Since(leaderReadyStart) - log.Info("[leader-ready] PD leader is ready to serve", zap.String("leader-name", s.Name()), zap.Duration("total-cost", totalDuration))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/server.go` around lines 1828 - 1849, The ready-to-serve log is emitted after CheckPDVersionWithClusterVersion(), overstating readiness latency; move the "PD leader is ready to serve" log (and the total-cost calculation from leaderReadyStart) to immediately after s.member.PromoteSelf() (i.e., right after PromoteSelf() and the member.ServiceMemberGauge.WithLabelValues(PD).Set(1) call) so the message reflects the actual serve boundary, and then log the version check completion separately (or alternatively rename the final log to "leader initialization completed" if you prefer keeping it after CheckPDVersionWithClusterVersion()); adjust uses of leaderReadyStart/totalDuration accordingly and keep the defer block (resetLeaderOnce/Resign/cancel) unchanged.server/cluster/cluster.go (1)
375-393:⚠️ Potential issue | 🟡 MinorInclude placement-rule initialization in this phase's timing/error logs.
loadClusterInfoDurationis captured beforec.ruleManager.Initialize(), and an error from Lines 388-390 returns without any[leader-ready]log. On placement-rules-enabled clusters, the success path underreports cost and the failure path disappears from the trace. Move the duration calculation below the initialize call, or give placement-rule initialization its own timed step.🧭 Proposed fix
log.Info("[leader-ready] start to load cluster info") loadClusterInfoStart := time.Now() cluster, err := c.LoadClusterInfo() if err != nil { log.Error("[leader-ready] failed to load cluster info", errs.ZapError(err), zap.Duration("cost", time.Since(loadClusterInfoStart))) return err } - loadClusterInfoDuration := time.Since(loadClusterInfoStart) if cluster == nil { + loadClusterInfoDuration := time.Since(loadClusterInfoStart) log.Warn("[leader-ready] cluster is not bootstrapped", zap.Duration("cost", loadClusterInfoDuration)) return nil } if c.opt.IsPlacementRulesEnabled() { - err := c.ruleManager.Initialize(c.opt.GetMaxReplicas(), c.opt.GetLocationLabels(), c.opt.GetIsolationLevel(), false) - if err != nil { + ruleInitStart := time.Now() + err := c.ruleManager.Initialize(c.opt.GetMaxReplicas(), c.opt.GetLocationLabels(), c.opt.GetIsolationLevel(), false) + if err != nil { + log.Error("[leader-ready] failed to initialize placement rules", errs.ZapError(err), zap.Duration("cost", time.Since(ruleInitStart))) return err } + log.Info("[leader-ready] initialize placement rules completed", zap.Duration("cost", time.Since(ruleInitStart))) } + loadClusterInfoDuration := time.Since(loadClusterInfoStart) log.Info("[leader-ready] load cluster info completed", zap.Duration("cost", loadClusterInfoDuration))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/cluster/cluster.go` around lines 375 - 393, The timing and error logging for leader-ready currently measures loadClusterInfoDuration before calling c.ruleManager.Initialize, so placement-rule initialization errors are returned without any [leader-ready] log and success cost underreports; update the block around LoadClusterInfo and c.ruleManager.Initialize to either (a) move loadClusterInfoDuration := time.Since(loadClusterInfoStart) until after the Initialize call so the final log.Info("[leader-ready] load cluster info completed", ...) includes the initialize time, or (b) add a separate timed step for placement-rule initialization that records its own start, computes its duration, logs success with zap.Duration and logs errors with errs.ZapError and zap.Duration before returning; reference c.LoadClusterInfo, c.ruleManager.Initialize, c.opt.IsPlacementRulesEnabled, loadClusterInfoStart and loadClusterInfoDuration when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/cluster/cluster.go`:
- Around line 330-333: The current deferred observation (start := time.Now();
defer { raftClusterStartDuration.Observe(...) }) records durations for early
returns and "already running" paths; change it so only successful Raft startups
are recorded: remove the global defer around start, keep start := time.Now()
only where needed, and call
raftClusterStartDuration.Observe(time.Since(start).Seconds()) explicitly after
the code path that completes a real cluster start (e.g., after the Raft
bootstrap/Start call returns success) inside the success branch of the function
(or alternatively add a low-cardinality label like success="true"/"false" and
set it only on completion). Apply the same adjustment for the other similar
blocks around the start/ defer usage (the blocks at the other occurrences).
---
Outside diff comments:
In `@server/cluster/cluster.go`:
- Around line 375-393: The timing and error logging for leader-ready currently
measures loadClusterInfoDuration before calling c.ruleManager.Initialize, so
placement-rule initialization errors are returned without any [leader-ready] log
and success cost underreports; update the block around LoadClusterInfo and
c.ruleManager.Initialize to either (a) move loadClusterInfoDuration :=
time.Since(loadClusterInfoStart) until after the Initialize call so the final
log.Info("[leader-ready] load cluster info completed", ...) includes the
initialize time, or (b) add a separate timed step for placement-rule
initialization that records its own start, computes its duration, logs success
with zap.Duration and logs errors with errs.ZapError and zap.Duration before
returning; reference c.LoadClusterInfo, c.ruleManager.Initialize,
c.opt.IsPlacementRulesEnabled, loadClusterInfoStart and loadClusterInfoDuration
when making the change.
In `@server/server.go`:
- Around line 1828-1849: The ready-to-serve log is emitted after
CheckPDVersionWithClusterVersion(), overstating readiness latency; move the "PD
leader is ready to serve" log (and the total-cost calculation from
leaderReadyStart) to immediately after s.member.PromoteSelf() (i.e., right after
PromoteSelf() and the member.ServiceMemberGauge.WithLabelValues(PD).Set(1) call)
so the message reflects the actual serve boundary, and then log the version
check completion separately (or alternatively rename the final log to "leader
initialization completed" if you prefer keeping it after
CheckPDVersionWithClusterVersion()); adjust uses of
leaderReadyStart/totalDuration accordingly and keep the defer block
(resetLeaderOnce/Resign/cancel) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f5707703-8d9a-4db0-8bea-dacee6d447ae
📒 Files selected for processing (5)
pkg/schedule/labeler/labeler.gopkg/schedule/labeler/metrics.goserver/cluster/cluster.goserver/cluster/metrics.goserver/server.go
|
/retest |
Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
|
Re-reviewed the latest update on top of commit The two remaining findings are now addressed:
I do not have a new finding on this follow-up delta. |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
server/cluster/cluster.go (2)
400-400: Minor: Useerrs.ZapError(err)for consistency.All other error logs in this function use
errs.ZapError(err)(lines 347, 379, 425, 446), but this line useszap.Error(err).🔧 Suggested fix
- log.Error("[leader-ready] region labeler creation failed", zap.Error(err), zap.Duration("cost", labelerDuration)) + log.Error("[leader-ready] region labeler creation failed", errs.ZapError(err), zap.Duration("cost", labelerDuration))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/cluster/cluster.go` at line 400, Replace the inconsistent zap.Error usage in the log call that reports "region labeler creation failed": locate the log.Error invocation that passes zap.Error(err) along with zap.Duration("cost", labelerDuration) and change the error encoder to use errs.ZapError(err) so the call matches the other error logs in this function.
405-409: Consider adding phase logging for affinity manager creation.For consistency with the surrounding instrumentation (labeler at lines 395-403, slow store at lines 412-419), the affinity manager creation step lacks
[leader-ready]start/completion logs and timing. If this operation is expected to be non-trivial, consider adding similar instrumentation.🔧 Suggested addition for consistency
+ log.Info("[leader-ready] creating affinity manager") + affinityStart := time.Now() // create affinity manager with region labeler for key range validation and rebuild c.affinityManager, err = affinity.NewManager(c.ctx, c.storage, c, c.GetOpts(), c.regionLabeler) + affinityDuration := time.Since(affinityStart) if err != nil { + log.Error("[leader-ready] affinity manager creation failed", errs.ZapError(err), zap.Duration("cost", affinityDuration)) return err } + log.Info("[leader-ready] affinity manager created", zap.Duration("cost", affinityDuration))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/cluster/cluster.go` around lines 405 - 409, The affinity manager creation (c.affinityManager = affinity.NewManager(...)) lacks the `[leader-ready]` phase start/completion logs and timing used for neighboring steps; wrap the call to affinity.NewManager (the call creating c.affinityManager using c.ctx, c.storage, c.GetOpts(), c.regionLabeler) with the same leader-ready instrumentation pattern used for the labeler and slow store: emit a `[leader-ready]` start log for "affinity-manager" before the call, record timing (e.g., start time and defer an ObserveDuration or call the same Done/Complete helper), and log success or error on completion so the operation is measured and appears consistent with the other phases.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@server/cluster/cluster.go`:
- Line 400: Replace the inconsistent zap.Error usage in the log call that
reports "region labeler creation failed": locate the log.Error invocation that
passes zap.Error(err) along with zap.Duration("cost", labelerDuration) and
change the error encoder to use errs.ZapError(err) so the call matches the other
error logs in this function.
- Around line 405-409: The affinity manager creation (c.affinityManager =
affinity.NewManager(...)) lacks the `[leader-ready]` phase start/completion logs
and timing used for neighboring steps; wrap the call to affinity.NewManager (the
call creating c.affinityManager using c.ctx, c.storage, c.GetOpts(),
c.regionLabeler) with the same leader-ready instrumentation pattern used for the
labeler and slow store: emit a `[leader-ready]` start log for "affinity-manager"
before the call, record timing (e.g., start time and defer an ObserveDuration or
call the same Done/Complete helper), and log success or error on completion so
the operation is measured and appears consistent with the other phases.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
server/cluster/cluster.go (1)
330-333:⚠️ Potential issue | 🟡 MinorKeep
raftClusterStartDurationscoped to successful startups.This defer still runs on the
already running, not-bootstrapped, and error paths, so the histogram mixes no-ops / aborted attempts with a real Raft cluster start.Also applies to: 338-340, 382-385
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/cluster/cluster.go` around lines 330 - 333, The current defer that calls raftClusterStartDuration.Observe(time.Since(start).Seconds()) runs for all code paths (including "already running", aborted, and error paths) and pollutes the histogram; remove the global defer and instead record the elapsed time only on successful startup/bootstrapped completion points (e.g., after the code paths that actually bootstrap/start the Raft cluster). Locate the start variable and the defer around raftClusterStartDuration in cluster.go (and the similar blocks at the other noted spots) and move the Observe call into the success branches (or add an explicit conditional that checks success) so only genuine, completed Raft starts update raftClusterStartDuration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/cluster/cluster.go`:
- Around line 434-441: The post-load "completed" logs are misleading because
loadExternalTS() and loadMinResolvedTS() currently swallow storage errors;
update the code so the caller can know success: modify loadExternalTS and
loadMinResolvedTS to return an error (or a bool success), propagate that result
back to the leader-ready sequence, and only emit "completed" when the call
succeeded; alternatively, if you prefer a quicker change, change the log
messages in the leader-ready block to non-success wording such as "attempted" or
"finished" (e.g., use "attempted to load external timestamp" / "attempted to
load min resolved ts") so they don't imply success when the functions failed.
Ensure you reference and adjust the caller in the leader-ready block and the
implementations of loadExternalTS and loadMinResolvedTS accordingly.
---
Duplicate comments:
In `@server/cluster/cluster.go`:
- Around line 330-333: The current defer that calls
raftClusterStartDuration.Observe(time.Since(start).Seconds()) runs for all code
paths (including "already running", aborted, and error paths) and pollutes the
histogram; remove the global defer and instead record the elapsed time only on
successful startup/bootstrapped completion points (e.g., after the code paths
that actually bootstrap/start the Raft cluster). Locate the start variable and
the defer around raftClusterStartDuration in cluster.go (and the similar blocks
at the other noted spots) and move the Observe call into the success branches
(or add an explicit conditional that checks success) so only genuine, completed
Raft starts update raftClusterStartDuration.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f60d10bd-2354-4177-b3f8-974928fce9a4
📒 Files selected for processing (2)
server/cluster/cluster.goserver/server.go
🚧 Files skipped from review as they are similar to previous changes (1)
- server/server.go
Signed-off-by: bufferflies <1045931706@qq.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
server/api/scheduler.go (1)
131-153: Nice fix to return 400 here; consider applying the same to ScatterRangeScheduler input parsing.This branch now correctly treats request-option parsing failures as client errors. In the same handler, Line 163, Line 168, and Line 173 still return 500 for comparable input-validation failures.
♻️ Suggested consistency patch
--- a/server/api/scheduler.go +++ b/server/api/scheduler.go @@ case types.ScatterRangeScheduler: if err := apiutil.CollectEscapeStringOption("start_key", input, collector); err != nil { - h.r.JSON(w, http.StatusInternalServerError, err.Error()) + h.r.JSON(w, http.StatusBadRequest, err.Error()) return } @@ if err := apiutil.CollectEscapeStringOption("end_key", input, collector); err != nil { - h.r.JSON(w, http.StatusInternalServerError, err.Error()) + h.r.JSON(w, http.StatusBadRequest, err.Error()) return } @@ if err := apiutil.CollectStringOption("range_name", input, collector); err != nil { - h.r.JSON(w, http.StatusInternalServerError, err.Error()) + h.r.JSON(w, http.StatusBadRequest, err.Error()) return }As per coding guidelines "HTTP handlers must validate payloads and return proper status codes; avoid panics".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/api/scheduler.go` around lines 131 - 153, The ScatterRangeScheduler input parsing currently treats apiutil.CollectStringOption failures as server errors; change those error responses to client errors by returning HTTP 400 with the validation error message instead of 500. Locate the ScatterRangeScheduler handler where apiutil.CollectStringOption is called (the same pattern as the earlier "rule"/"engine"/"timeout"/"alias" block) and replace the h.r.JSON(w, http.StatusInternalServerError, err.Error()) responses with h.r.JSON(w, http.StatusBadRequest, err.Error()) so input-validation failures consistently return 400; keep the existing collector behavior for default values unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/utils/apiutil/apiutil.go`:
- Line 346: The error message is being preformatted before passing to
ErrOptionTypeNotMatch, producing awkward output; instead pass the raw option
identifier into ErrOptionTypeNotMatch so the error template can render the full
sentence. Update the FastGenByArgs call in apiutil.go (the
ErrOptionTypeNotMatch.FastGenByArgs invocation) to pass option (the raw option
name) as the argument rather than fmt.Sprintf("%s must be string", option), so
the template controls the final wording.
---
Nitpick comments:
In `@server/api/scheduler.go`:
- Around line 131-153: The ScatterRangeScheduler input parsing currently treats
apiutil.CollectStringOption failures as server errors; change those error
responses to client errors by returning HTTP 400 with the validation error
message instead of 500. Locate the ScatterRangeScheduler handler where
apiutil.CollectStringOption is called (the same pattern as the earlier
"rule"/"engine"/"timeout"/"alias" block) and replace the h.r.JSON(w,
http.StatusInternalServerError, err.Error()) responses with h.r.JSON(w,
http.StatusBadRequest, err.Error()) so input-validation failures consistently
return 400; keep the existing collector behavior for default values unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f3053176-d46c-4d50-bdd6-9162f2c7e479
📒 Files selected for processing (4)
errors.tomlpkg/errs/errno.gopkg/utils/apiutil/apiutil.goserver/api/scheduler.go
✅ Files skipped from review due to trivial changes (1)
- pkg/errs/errno.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #10523 +/- ##
==========================================
+ Coverage 78.88% 78.94% +0.06%
==========================================
Files 530 532 +2
Lines 71548 71965 +417
==========================================
+ Hits 56439 56814 +375
- Misses 11092 11121 +29
- Partials 4017 4030 +13
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
|
Re-reviewed the latest follow-up on head The two requested adjustments look good to me:
I do not have a new finding on this delta. |
|
Re-reviewed the latest follow-up on head This delta looks clean to me. The newly added trace logs in I do not have a new finding on this delta. |
|
@ystaticy: adding LGTM is restricted to approvers and reviewers in OWNERS files. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
raft-cluster-start logs
|
Re-reviewed the latest follow-up on head This delta looks clean to me. It removes the I do not have a new finding on this delta. |
This reverts commit a3d89ab. Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
2d932f0 to
54003b3
Compare
raft-cluster-start logs|
/retest |
8346b46 to
71defe7
Compare
Signed-off-by: tongjian <1045931706@qq.com>
71defe7 to
6ccc82c
Compare
|
/retest |
Signed-off-by: tongjian <1045931706@qq.com>
d90a231 to
112cda4
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lhy1024, rleungx, ystaticy The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
What problem does this PR solve?
Issue Number: ref #10516
author: @ystaticy
cp
a7c733beWhat is changed and how does it work?
Check List
Tests
Code changes
Release note
Summary by CodeRabbit