Skip to content

server, schedule: Add observability for PD API leader-ready and cluster startup paths.#10523

Merged
ti-chi-bot[bot] merged 10 commits intotikv:masterfrom
bufferflies:pr-merge/a7c733be-pd-api-metrics-logs
Apr 10, 2026
Merged

server, schedule: Add observability for PD API leader-ready and cluster startup paths.#10523
ti-chi-bot[bot] merged 10 commits intotikv:masterfrom
bufferflies:pr-merge/a7c733be-pd-api-metrics-logs

Conversation

@bufferflies
Copy link
Copy Markdown
Contributor

@bufferflies bufferflies commented Mar 31, 2026

What problem does this PR solve?

Issue Number: ref #10516

author: @ystaticy

cp a7c733be

What is changed and how does it work?

Add observability for PD API leader-ready and cluster startup paths.

- add startup duration metrics for raft cluster and region labeler creation
- add leader-ready phase logs in server and cluster startup flow
- keep current master keyspace-group, runner, and PromoteSelf logic while porting the observability changes

Check List

Tests

  • Unit test

Code changes

  • No code

Release note

Improved observability for PD leader-ready and cluster startup by adding metrics and detailed logs.

Summary by CodeRabbit

  • Monitoring
    • Added performance timing metrics for cluster startup and leader readiness processes
    • Improved observability with detailed phase-level logging during system initialization

* add metrics and log

Signed-off-by: ystaticy <y_static_y@sina.com>

* add metrics and log

Signed-off-by: ystaticy <y_static_y@sina.com>

* add metrics and log

Signed-off-by: ystaticy <y_static_y@sina.com>

* add leader ready log

Signed-off-by: ystaticy <y_static_y@sina.com>

---------

Signed-off-by: ystaticy <y_static_y@sina.com>
(cherry picked from commit a7c733b)
@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. labels Mar 31, 2026
@bufferflies
Copy link
Copy Markdown
Contributor Author

Validation:

  • go test ./pkg/schedule/labeler -count=1
  • go test ./server/cluster ./server -run TestDoesNotExist -count=1

@coderabbitai full review
@PR-Review

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 31, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 31, 2026

@bufferflies Sure, I'll kick off a full review of the PR now!


🧠 Learnings used
Learnt from: CR
Repo: tikv/pd PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-03-24T12:03:59.274Z
Learning: Applies to tests/server/join/join_test.go : Leakcheck excludes `tests/server/join/join_test.go`
✅ Actions performed

Full review triggered.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 31, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Added latency instrumentation and structured logging across region labeler creation, cluster startup, and leader campaign readiness phases. Metrics track operation durations using Prometheus histograms, and phase-level timing logs enable observability into startup and readiness sequences.

Changes

Cohort / File(s) Summary
Labeler Instrumentation
pkg/schedule/labeler/labeler.go, pkg/schedule/labeler/metrics.go
Added latency instrumentation to NewRegionLabeler with a new histogram metric newRegionLabelerDuration measuring creation time in seconds. Included info log on successful rule loading with rule count. Refactored metrics declaration to grouped var (...) block and registered new metric.
Cluster Startup Instrumentation
server/cluster/cluster.go, server/cluster/metrics.go
Added comprehensive phase-level timing to (*RaftCluster).Start with deferred duration observation labeled by bootstrap status. Instrumented phases: InitCluster, checkTSOService, LoadClusterInfo, placement rules, region labeler, slow-store status, replication mode manager, external TS, minimum resolved TS, keyspace-group bootstrap, scheduling service, background jobs, and runner startup. Added corresponding raftClusterStartDuration histogram vector metric with exponential buckets.
Leader Campaign Readiness Instrumentation
server/server.go
Added phase-level timing instrumentation to (*Server).campaignLeader() readiness flow with cost tracking for eight sub-steps (lease keep, config reload, TTL options, encryption init, callbacks, cluster creation, ID allocator, version check). Changed error-level logs to warnings with phase-specific elapsed duration. Updated final readiness completion log to include total-cost from election success through readiness completion.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Suggested reviewers

  • okJiang
  • JmPotato

Poem

🐰 Metrics bloom in the morning light,
Phase timings measured, precise and bright,
From labeler to leader we trace each step,
Observability woven, no stone unswept!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed PR description provides issue reference, summary of changes with commit message, checklist completion, and release note; structure aligns with template.
Title check ✅ Passed The PR title 'server, schedule: Add observability for PD API leader-ready and cluster startup paths' directly matches the main changes across multiple files—adding metrics and logs for leader-ready and cluster startup paths.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
server/server.go (1)

1828-1849: ⚠️ Potential issue | 🟡 Minor

Emit the ready-to-serve log at the actual readiness boundary.

Line 1828 says PromoteSelf() starts accepting the remaining APIs, but "PD leader is ready to serve" and total-cost are logged only after CheckPDVersionWithClusterVersion(). That overstates leader-ready latency and makes the final message later than the real serve point. Move the ready log just after PromoteSelf(), or rename the last message to something like leader initialization completed.

⏱️ Proposed fix
 	log.Info("[leader-ready] start to promote leader")
 	enableLeaderStart := time.Now()
 	s.member.PromoteSelf()
 	enableLeaderDuration := time.Since(enableLeaderStart)
 	member.ServiceMemberGauge.WithLabelValues(PD).Set(1)
 	log.Info("[leader-ready] promote leader completed", zap.Duration("cost", enableLeaderDuration))
 	defer resetLeaderOnce.Do(func() {
 		// as soon as cancel the leadership keepalive, then other member have chance
 		// to be new leader.
 		cancel()
 		s.member.Resign()
 		member.ServiceMemberGauge.WithLabelValues(PD).Set(0)
 	})
+	totalDuration := time.Since(leaderReadyStart)
+	log.Info("[leader-ready] PD leader is ready to serve", zap.String("leader-name", s.Name()), zap.Duration("total-cost", totalDuration))
 
 	log.Info("[leader-ready] start to check PD version with cluster version")
 	versionCheckStart := time.Now()
 	CheckPDVersionWithClusterVersion(s.persistOptions)
 	versionCheckDuration := time.Since(versionCheckStart)
 	log.Info("[leader-ready] check PD version with cluster version completed", zap.Duration("cost", versionCheckDuration))
-	totalDuration := time.Since(leaderReadyStart)
-	log.Info("[leader-ready] PD leader is ready to serve", zap.String("leader-name", s.Name()), zap.Duration("total-cost", totalDuration))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/server.go` around lines 1828 - 1849, The ready-to-serve log is emitted
after CheckPDVersionWithClusterVersion(), overstating readiness latency; move
the "PD leader is ready to serve" log (and the total-cost calculation from
leaderReadyStart) to immediately after s.member.PromoteSelf() (i.e., right after
PromoteSelf() and the member.ServiceMemberGauge.WithLabelValues(PD).Set(1) call)
so the message reflects the actual serve boundary, and then log the version
check completion separately (or alternatively rename the final log to "leader
initialization completed" if you prefer keeping it after
CheckPDVersionWithClusterVersion()); adjust uses of
leaderReadyStart/totalDuration accordingly and keep the defer block
(resetLeaderOnce/Resign/cancel) unchanged.
server/cluster/cluster.go (1)

375-393: ⚠️ Potential issue | 🟡 Minor

Include placement-rule initialization in this phase's timing/error logs.

loadClusterInfoDuration is captured before c.ruleManager.Initialize(), and an error from Lines 388-390 returns without any [leader-ready] log. On placement-rules-enabled clusters, the success path underreports cost and the failure path disappears from the trace. Move the duration calculation below the initialize call, or give placement-rule initialization its own timed step.

🧭 Proposed fix
 	log.Info("[leader-ready] start to load cluster info")
 	loadClusterInfoStart := time.Now()
 	cluster, err := c.LoadClusterInfo()
 	if err != nil {
 		log.Error("[leader-ready] failed to load cluster info", errs.ZapError(err), zap.Duration("cost", time.Since(loadClusterInfoStart)))
 		return err
 	}
-	loadClusterInfoDuration := time.Since(loadClusterInfoStart)
 	if cluster == nil {
+		loadClusterInfoDuration := time.Since(loadClusterInfoStart)
 		log.Warn("[leader-ready] cluster is not bootstrapped", zap.Duration("cost", loadClusterInfoDuration))
 		return nil
 	}
 	if c.opt.IsPlacementRulesEnabled() {
-		err := c.ruleManager.Initialize(c.opt.GetMaxReplicas(), c.opt.GetLocationLabels(), c.opt.GetIsolationLevel(), false)
-		if err != nil {
+		ruleInitStart := time.Now()
+		err := c.ruleManager.Initialize(c.opt.GetMaxReplicas(), c.opt.GetLocationLabels(), c.opt.GetIsolationLevel(), false)
+		if err != nil {
+			log.Error("[leader-ready] failed to initialize placement rules", errs.ZapError(err), zap.Duration("cost", time.Since(ruleInitStart)))
 			return err
 		}
+		log.Info("[leader-ready] initialize placement rules completed", zap.Duration("cost", time.Since(ruleInitStart)))
 	}
+	loadClusterInfoDuration := time.Since(loadClusterInfoStart)
 	log.Info("[leader-ready] load cluster info completed", zap.Duration("cost", loadClusterInfoDuration))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/cluster/cluster.go` around lines 375 - 393, The timing and error
logging for leader-ready currently measures loadClusterInfoDuration before
calling c.ruleManager.Initialize, so placement-rule initialization errors are
returned without any [leader-ready] log and success cost underreports; update
the block around LoadClusterInfo and c.ruleManager.Initialize to either (a) move
loadClusterInfoDuration := time.Since(loadClusterInfoStart) until after the
Initialize call so the final log.Info("[leader-ready] load cluster info
completed", ...) includes the initialize time, or (b) add a separate timed step
for placement-rule initialization that records its own start, computes its
duration, logs success with zap.Duration and logs errors with errs.ZapError and
zap.Duration before returning; reference c.LoadClusterInfo,
c.ruleManager.Initialize, c.opt.IsPlacementRulesEnabled, loadClusterInfoStart
and loadClusterInfoDuration when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/cluster/cluster.go`:
- Around line 330-333: The current deferred observation (start := time.Now();
defer { raftClusterStartDuration.Observe(...) }) records durations for early
returns and "already running" paths; change it so only successful Raft startups
are recorded: remove the global defer around start, keep start := time.Now()
only where needed, and call
raftClusterStartDuration.Observe(time.Since(start).Seconds()) explicitly after
the code path that completes a real cluster start (e.g., after the Raft
bootstrap/Start call returns success) inside the success branch of the function
(or alternatively add a low-cardinality label like success="true"/"false" and
set it only on completion). Apply the same adjustment for the other similar
blocks around the start/ defer usage (the blocks at the other occurrences).

---

Outside diff comments:
In `@server/cluster/cluster.go`:
- Around line 375-393: The timing and error logging for leader-ready currently
measures loadClusterInfoDuration before calling c.ruleManager.Initialize, so
placement-rule initialization errors are returned without any [leader-ready] log
and success cost underreports; update the block around LoadClusterInfo and
c.ruleManager.Initialize to either (a) move loadClusterInfoDuration :=
time.Since(loadClusterInfoStart) until after the Initialize call so the final
log.Info("[leader-ready] load cluster info completed", ...) includes the
initialize time, or (b) add a separate timed step for placement-rule
initialization that records its own start, computes its duration, logs success
with zap.Duration and logs errors with errs.ZapError and zap.Duration before
returning; reference c.LoadClusterInfo, c.ruleManager.Initialize,
c.opt.IsPlacementRulesEnabled, loadClusterInfoStart and loadClusterInfoDuration
when making the change.

In `@server/server.go`:
- Around line 1828-1849: The ready-to-serve log is emitted after
CheckPDVersionWithClusterVersion(), overstating readiness latency; move the "PD
leader is ready to serve" log (and the total-cost calculation from
leaderReadyStart) to immediately after s.member.PromoteSelf() (i.e., right after
PromoteSelf() and the member.ServiceMemberGauge.WithLabelValues(PD).Set(1) call)
so the message reflects the actual serve boundary, and then log the version
check completion separately (or alternatively rename the final log to "leader
initialization completed" if you prefer keeping it after
CheckPDVersionWithClusterVersion()); adjust uses of
leaderReadyStart/totalDuration accordingly and keep the defer block
(resetLeaderOnce/Resign/cancel) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f5707703-8d9a-4db0-8bea-dacee6d447ae

📥 Commits

Reviewing files that changed from the base of the PR and between 0a0dcfd and 6977c00.

📒 Files selected for processing (5)
  • pkg/schedule/labeler/labeler.go
  • pkg/schedule/labeler/metrics.go
  • server/cluster/cluster.go
  • server/cluster/metrics.go
  • server/server.go

@bufferflies
Copy link
Copy Markdown
Contributor Author

/retest

Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
@bufferflies
Copy link
Copy Markdown
Contributor Author

Re-reviewed the latest update on top of commit 457b82905a95903f00106d99791531f5e56446f8.

The two remaining findings are now addressed:

  • the PD leader is ready to serve log is emitted immediately after PromoteSelf() / ServiceMemberGauge.Set(1)
  • placement-rule initialization now has its own timed [leader-ready] success/error logs, and load cluster info completed is measured after that step

I do not have a new finding on this follow-up delta.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
server/cluster/cluster.go (2)

400-400: Minor: Use errs.ZapError(err) for consistency.

All other error logs in this function use errs.ZapError(err) (lines 347, 379, 425, 446), but this line uses zap.Error(err).

🔧 Suggested fix
-		log.Error("[leader-ready] region labeler creation failed", zap.Error(err), zap.Duration("cost", labelerDuration))
+		log.Error("[leader-ready] region labeler creation failed", errs.ZapError(err), zap.Duration("cost", labelerDuration))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/cluster/cluster.go` at line 400, Replace the inconsistent zap.Error
usage in the log call that reports "region labeler creation failed": locate the
log.Error invocation that passes zap.Error(err) along with zap.Duration("cost",
labelerDuration) and change the error encoder to use errs.ZapError(err) so the
call matches the other error logs in this function.

405-409: Consider adding phase logging for affinity manager creation.

For consistency with the surrounding instrumentation (labeler at lines 395-403, slow store at lines 412-419), the affinity manager creation step lacks [leader-ready] start/completion logs and timing. If this operation is expected to be non-trivial, consider adding similar instrumentation.

🔧 Suggested addition for consistency
+	log.Info("[leader-ready] creating affinity manager")
+	affinityStart := time.Now()
 	// create affinity manager with region labeler for key range validation and rebuild
 	c.affinityManager, err = affinity.NewManager(c.ctx, c.storage, c, c.GetOpts(), c.regionLabeler)
+	affinityDuration := time.Since(affinityStart)
 	if err != nil {
+		log.Error("[leader-ready] affinity manager creation failed", errs.ZapError(err), zap.Duration("cost", affinityDuration))
 		return err
 	}
+	log.Info("[leader-ready] affinity manager created", zap.Duration("cost", affinityDuration))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/cluster/cluster.go` around lines 405 - 409, The affinity manager
creation (c.affinityManager = affinity.NewManager(...)) lacks the
`[leader-ready]` phase start/completion logs and timing used for neighboring
steps; wrap the call to affinity.NewManager (the call creating c.affinityManager
using c.ctx, c.storage, c.GetOpts(), c.regionLabeler) with the same leader-ready
instrumentation pattern used for the labeler and slow store: emit a
`[leader-ready]` start log for "affinity-manager" before the call, record timing
(e.g., start time and defer an ObserveDuration or call the same Done/Complete
helper), and log success or error on completion so the operation is measured and
appears consistent with the other phases.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@server/cluster/cluster.go`:
- Line 400: Replace the inconsistent zap.Error usage in the log call that
reports "region labeler creation failed": locate the log.Error invocation that
passes zap.Error(err) along with zap.Duration("cost", labelerDuration) and
change the error encoder to use errs.ZapError(err) so the call matches the other
error logs in this function.
- Around line 405-409: The affinity manager creation (c.affinityManager =
affinity.NewManager(...)) lacks the `[leader-ready]` phase start/completion logs
and timing used for neighboring steps; wrap the call to affinity.NewManager (the
call creating c.affinityManager using c.ctx, c.storage, c.GetOpts(),
c.regionLabeler) with the same leader-ready instrumentation pattern used for the
labeler and slow store: emit a `[leader-ready]` start log for "affinity-manager"
before the call, record timing (e.g., start time and defer an ObserveDuration or
call the same Done/Complete helper), and log success or error on completion so
the operation is measured and appears consistent with the other phases.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 61381446-51a7-4351-9193-33587638a943

📥 Commits

Reviewing files that changed from the base of the PR and between 6977c00 and 8430964.

📒 Files selected for processing (1)
  • server/cluster/cluster.go

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
server/cluster/cluster.go (1)

330-333: ⚠️ Potential issue | 🟡 Minor

Keep raftClusterStartDuration scoped to successful startups.

This defer still runs on the already running, not-bootstrapped, and error paths, so the histogram mixes no-ops / aborted attempts with a real Raft cluster start.

Also applies to: 338-340, 382-385

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/cluster/cluster.go` around lines 330 - 333, The current defer that
calls raftClusterStartDuration.Observe(time.Since(start).Seconds()) runs for all
code paths (including "already running", aborted, and error paths) and pollutes
the histogram; remove the global defer and instead record the elapsed time only
on successful startup/bootstrapped completion points (e.g., after the code paths
that actually bootstrap/start the Raft cluster). Locate the start variable and
the defer around raftClusterStartDuration in cluster.go (and the similar blocks
at the other noted spots) and move the Observe call into the success branches
(or add an explicit conditional that checks success) so only genuine, completed
Raft starts update raftClusterStartDuration.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server/cluster/cluster.go`:
- Around line 434-441: The post-load "completed" logs are misleading because
loadExternalTS() and loadMinResolvedTS() currently swallow storage errors;
update the code so the caller can know success: modify loadExternalTS and
loadMinResolvedTS to return an error (or a bool success), propagate that result
back to the leader-ready sequence, and only emit "completed" when the call
succeeded; alternatively, if you prefer a quicker change, change the log
messages in the leader-ready block to non-success wording such as "attempted" or
"finished" (e.g., use "attempted to load external timestamp" / "attempted to
load min resolved ts") so they don't imply success when the functions failed.
Ensure you reference and adjust the caller in the leader-ready block and the
implementations of loadExternalTS and loadMinResolvedTS accordingly.

---

Duplicate comments:
In `@server/cluster/cluster.go`:
- Around line 330-333: The current defer that calls
raftClusterStartDuration.Observe(time.Since(start).Seconds()) runs for all code
paths (including "already running", aborted, and error paths) and pollutes the
histogram; remove the global defer and instead record the elapsed time only on
successful startup/bootstrapped completion points (e.g., after the code paths
that actually bootstrap/start the Raft cluster). Locate the start variable and
the defer around raftClusterStartDuration in cluster.go (and the similar blocks
at the other noted spots) and move the Observe call into the success branches
(or add an explicit conditional that checks success) so only genuine, completed
Raft starts update raftClusterStartDuration.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f60d10bd-2354-4177-b3f8-974928fce9a4

📥 Commits

Reviewing files that changed from the base of the PR and between 8430964 and 457b829.

📒 Files selected for processing (2)
  • server/cluster/cluster.go
  • server/server.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • server/server.go

Signed-off-by: bufferflies <1045931706@qq.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
server/api/scheduler.go (1)

131-153: Nice fix to return 400 here; consider applying the same to ScatterRangeScheduler input parsing.

This branch now correctly treats request-option parsing failures as client errors. In the same handler, Line 163, Line 168, and Line 173 still return 500 for comparable input-validation failures.

♻️ Suggested consistency patch
--- a/server/api/scheduler.go
+++ b/server/api/scheduler.go
@@
 	case types.ScatterRangeScheduler:
 		if err := apiutil.CollectEscapeStringOption("start_key", input, collector); err != nil {
-			h.r.JSON(w, http.StatusInternalServerError, err.Error())
+			h.r.JSON(w, http.StatusBadRequest, err.Error())
 			return
 		}
@@
 		if err := apiutil.CollectEscapeStringOption("end_key", input, collector); err != nil {
-			h.r.JSON(w, http.StatusInternalServerError, err.Error())
+			h.r.JSON(w, http.StatusBadRequest, err.Error())
 			return
 		}
@@
 		if err := apiutil.CollectStringOption("range_name", input, collector); err != nil {
-			h.r.JSON(w, http.StatusInternalServerError, err.Error())
+			h.r.JSON(w, http.StatusBadRequest, err.Error())
 			return
 		}

As per coding guidelines "HTTP handlers must validate payloads and return proper status codes; avoid panics".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/api/scheduler.go` around lines 131 - 153, The ScatterRangeScheduler
input parsing currently treats apiutil.CollectStringOption failures as server
errors; change those error responses to client errors by returning HTTP 400 with
the validation error message instead of 500. Locate the ScatterRangeScheduler
handler where apiutil.CollectStringOption is called (the same pattern as the
earlier "rule"/"engine"/"timeout"/"alias" block) and replace the h.r.JSON(w,
http.StatusInternalServerError, err.Error()) responses with h.r.JSON(w,
http.StatusBadRequest, err.Error()) so input-validation failures consistently
return 400; keep the existing collector behavior for default values unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/utils/apiutil/apiutil.go`:
- Line 346: The error message is being preformatted before passing to
ErrOptionTypeNotMatch, producing awkward output; instead pass the raw option
identifier into ErrOptionTypeNotMatch so the error template can render the full
sentence. Update the FastGenByArgs call in apiutil.go (the
ErrOptionTypeNotMatch.FastGenByArgs invocation) to pass option (the raw option
name) as the argument rather than fmt.Sprintf("%s must be string", option), so
the template controls the final wording.

---

Nitpick comments:
In `@server/api/scheduler.go`:
- Around line 131-153: The ScatterRangeScheduler input parsing currently treats
apiutil.CollectStringOption failures as server errors; change those error
responses to client errors by returning HTTP 400 with the validation error
message instead of 500. Locate the ScatterRangeScheduler handler where
apiutil.CollectStringOption is called (the same pattern as the earlier
"rule"/"engine"/"timeout"/"alias" block) and replace the h.r.JSON(w,
http.StatusInternalServerError, err.Error()) responses with h.r.JSON(w,
http.StatusBadRequest, err.Error()) so input-validation failures consistently
return 400; keep the existing collector behavior for default values unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f3053176-d46c-4d50-bdd6-9162f2c7e479

📥 Commits

Reviewing files that changed from the base of the PR and between 457b829 and a3d89ab.

📒 Files selected for processing (4)
  • errors.toml
  • pkg/errs/errno.go
  • pkg/utils/apiutil/apiutil.go
  • server/api/scheduler.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/errs/errno.go

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

Codecov Report

❌ Patch coverage is 90.24390% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.94%. Comparing base (3eb99ae) to head (112cda4).
⚠️ Report is 18 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10523      +/-   ##
==========================================
+ Coverage   78.88%   78.94%   +0.06%     
==========================================
  Files         530      532       +2     
  Lines       71548    71965     +417     
==========================================
+ Hits        56439    56814     +375     
- Misses      11092    11121      +29     
- Partials     4017     4030      +13     
Flag Coverage Δ
unittests 78.94% <90.24%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bufferflies
Copy link
Copy Markdown
Contributor Author

Re-reviewed the latest follow-up on head 3e30a4ec54aef5a81613657324caadbf8048b351.

The two requested adjustments look good to me:

  • the newly added leader-ready failure logs are now downgraded from log.Error to log.Warn
  • raftClusterStartDuration now distinguishes bootstrap vs non-bootstrap starts while keeping the existing "record the total RaftCluster.Start() wall-clock duration" semantics

I do not have a new finding on this delta.

@bufferflies bufferflies requested review from lhy1024 and rleungx April 2, 2026 08:48
@bufferflies
Copy link
Copy Markdown
Contributor Author

Re-reviewed the latest follow-up on head 9b040a37042ec80e042de24bd3ba3d74e9d67ced.

This delta looks clean to me. The newly added trace logs in server/server.go and server/cluster/cluster.go are now consistently renamed from [leader-ready] to [raft-cluster-start], and I do not see a new behavior regression introduced by this rename-only follow-up.

I do not have a new finding on this delta.

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Apr 3, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 3, 2026

@ystaticy: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@bufferflies bufferflies changed the title server, schedule: add PD API metrics and leader-ready logs server, schedule: add PD API metrics and raft-cluster-start logs Apr 3, 2026
@bufferflies
Copy link
Copy Markdown
Contributor Author

Re-reviewed the latest follow-up on head 2d932f0c9512076526a4a72c262e11f8b94a3045.

This delta looks clean to me. It removes the [raft-cluster-start] prefix from the previously added logs in server/server.go and server/cluster/cluster.go, without changing the log placement, severity, or staged timing behavior that was reviewed earlier.

I do not have a new finding on this delta.

This reverts commit a3d89ab.

Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
Signed-off-by: bufferflies <1045931706@qq.com>
@bufferflies bufferflies force-pushed the pr-merge/a7c733be-pd-api-metrics-logs branch from 2d932f0 to 54003b3 Compare April 7, 2026 08:57
@bufferflies bufferflies requested a review from rleungx April 7, 2026 11:28
@bufferflies bufferflies changed the title server, schedule: add PD API metrics and raft-cluster-start logs server, schedule: Add observability for PD API leader-ready and cluster startup paths. Apr 7, 2026
@bufferflies
Copy link
Copy Markdown
Contributor Author

/retest

@bufferflies bufferflies force-pushed the pr-merge/a7c733be-pd-api-metrics-logs branch from 8346b46 to 71defe7 Compare April 8, 2026 12:38
Signed-off-by: tongjian <1045931706@qq.com>
@bufferflies bufferflies force-pushed the pr-merge/a7c733be-pd-api-metrics-logs branch from 71defe7 to 6ccc82c Compare April 8, 2026 12:40
@bufferflies bufferflies requested a review from rleungx April 8, 2026 12:41
@bufferflies
Copy link
Copy Markdown
Contributor Author

/retest

Signed-off-by: tongjian <1045931706@qq.com>
@bufferflies bufferflies force-pushed the pr-merge/a7c733be-pd-api-metrics-logs branch from d90a231 to 112cda4 Compare April 9, 2026 13:38
@ti-chi-bot ti-chi-bot bot added the lgtm label Apr 9, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 9, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lhy1024, rleungx, ystaticy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 9, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 9, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-03 03:21:03.599949788 +0000 UTC m=+494468.805309845: ☑️ agreed by lhy1024.
  • 2026-04-09 14:19:35.341841486 +0000 UTC m=+1052380.547201543: ☑️ agreed by rleungx.

@bufferflies
Copy link
Copy Markdown
Contributor Author

/retest

@ti-chi-bot ti-chi-bot bot merged commit b0a3c90 into tikv:master Apr 10, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants