Skip to content

tests(tso): add dynamic switching integration tests#10478

Open
YuhaoZhang00 wants to merge 4 commits intotikv:masterfrom
YuhaoZhang00:tso/add-dynamic-switching-integration-tests
Open

tests(tso): add dynamic switching integration tests#10478
YuhaoZhang00 wants to merge 4 commits intotikv:masterfrom
YuhaoZhang00:tso/add-dynamic-switching-integration-tests

Conversation

@YuhaoZhang00
Copy link
Copy Markdown
Contributor

@YuhaoZhang00 YuhaoZhang00 commented Mar 23, 2026

What problem does this PR solve?

Issue Number: ref #10329

What is changed and how does it work?

Add integration tests for TSO dynamic switching between PD and TSO microservice, covering forward transition, fallback when the microservice stops, and leader transfer resilience.

Also add a unit test for IsServiceIndependent.

Check List

Tests

  • Unit test
  • Integration test

Code changes

None

Side effects

None

Related changes

None

Release note

None.

Summary by CodeRabbit

  • Tests
    • Added unit tests validating service-independence behavior across configurations, dynamic switching states, and server shutdown.
    • Added integration tests exercising TSO dynamic switching: transitions between service modes, fallback when a service stops, and resilience across leader changes.
    • Introduced deterministic timing controls and a reusable helper to assert globally monotonic timestamps in integration scenarios.

Add integration tests for TSO dynamic switching between PD and TSO
microservice, covering forward transition, fallback when the
microservice stops, and leader transfer resilience. Also add a unit
test for IsServiceIndependent.

Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-triage-completed release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. labels Mar 23, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Mar 23, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign huachaohuang for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the contribution This PR is from a community contributor. label Mar 23, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Mar 23, 2026

Hi @YuhaoZhang00. Thanks for your PR.

I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added the needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. label Mar 23, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 23, 2026

📝 Walkthrough

Walkthrough

Added new tests exercising TSO dynamic switching and service-independence: a unit test for Server.IsServiceIndependent and multiple integration tests validating dynamic TSO enable/disable and leader-transfer behaviors, plus a helper to assert monotonic TSO responses.

Changes

Cohort / File(s) Summary
Server unit tests
server/server_test.go
New test helper newTestServer(...) and TestIsServiceIndependent covering IsServiceIndependent under combinations of keyspace-group and TSO dynamic-switching settings and server running state.
TSO integration tests
tests/integrations/tso/client_test.go
Added three integration tests (TestDynamicSwitchingPDToTSO, TestDynamicSwitchingTSOToPDFallback, TestDynamicSwitchingWithLeaderTransfer) and helper waitAndCheckTSOMonotonic(...); added use of failpoints and mcsconst.TSOServiceName checks to validate dynamic switching and monotonic GetTS behavior.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

lgtm

Suggested reviewers

  • lhy1024
  • okJiang

Poem

I'm a rabbit in a test-tree glade,
hopping through switches the devs have made.
TSO flips on, then off, then flows —
timestamps climb where the leader goes.
🐇✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding dynamic switching integration tests for TSO, which directly reflects the content of the changeset.
Description check ✅ Passed The description includes required sections: issue reference (ref #10329), clear explanation of changes (integration and unit tests for TSO dynamic switching), appropriate test checklist items, and release note (None).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 23, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
server/server_test.go (1)

29-40: Build persistOptions from the target microservice config.

Line 59 flips s3.cfg.Microservice.EnableTSODynamicSwitching after newTestServer has already created persistOptions, but IsServiceIndependent reads that flag through s.GetMicroserviceConfig(). That makes this case depend on cfg/persistOptions aliasing instead of the behavior under test. It would be more robust to pass the flag into the helper and set it before config.NewPersistOptions(cfg) runs.

♻️ Suggested refactor
-func newTestServer(t *testing.T, keyspaceGroupEnabled bool) *Server {
+func newTestServer(t *testing.T, keyspaceGroupEnabled, tsoDynamicSwitchingEnabled bool) *Server {
 	t.Helper()

 	cfg := config.NewConfig()
+	cfg.Microservice.EnableTSODynamicSwitching = tsoDynamicSwitchingEnabled
 	s := &Server{
 		ctx:                    context.Background(),
 		cfg:                    cfg,
 		persistOptions:         config.NewPersistOptions(cfg),
 		isKeyspaceGroupEnabled: keyspaceGroupEnabled,
 	}
 	atomic.StoreInt64(&s.isRunning, 1)
 	return s
 }
@@
-	s := newTestServer(t, false)
+	s := newTestServer(t, false, false)
@@
-	s2 := newTestServer(t, true)
+	s2 := newTestServer(t, true, false)
@@
-	s3 := newTestServer(t, true)
-	s3.cfg.Microservice.EnableTSODynamicSwitching = true
+	s3 := newTestServer(t, true, true)

Also applies to: 58-60

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/server_test.go` around lines 29 - 40, newTestServer currently builds
persistOptions before the microservice flag EnableTSODynamicSwitching is set,
causing tests to rely on cfg/persistOptions aliasing; change newTestServer (and
its callers) to accept an enableTSODynamicSwitching bool, set
cfg.Microservice.EnableTSODynamicSwitching = enableTSODynamicSwitching on the
cfg returned by config.NewConfig() before calling config.NewPersistOptions(cfg),
so persistOptions is constructed from the intended microservice config;
reference newTestServer, Server, config.NewConfig, config.NewPersistOptions,
GetMicroserviceConfig, and IsServiceIndependent when making this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integrations/tso/client_test.go`:
- Around line 849-866: The loop doesn't assert leadership actually changed and
only checks TSO availability once; update the loop in the test to capture the
previous leader via pdCluster.WaitLeader(), call ResignLeaderWithRetry(), then
call pdCluster.WaitLeader() again and assert the new leader != previous leader
(use leaderName variable and compare), and invoke per-transfer availability
checks (either call utils.WaitForTSOServiceAvailable(ctx, re, pdClient) or the
existing checkTSO(...) around each transfer) before continuing; also keep the
ServiceIndependent assertion using
GetServer(...).GetServer().IsServiceIndependent(mcsconst.TSOServiceName) as you
already do.

---

Nitpick comments:
In `@server/server_test.go`:
- Around line 29-40: newTestServer currently builds persistOptions before the
microservice flag EnableTSODynamicSwitching is set, causing tests to rely on
cfg/persistOptions aliasing; change newTestServer (and its callers) to accept an
enableTSODynamicSwitching bool, set cfg.Microservice.EnableTSODynamicSwitching =
enableTSODynamicSwitching on the cfg returned by config.NewConfig() before
calling config.NewPersistOptions(cfg), so persistOptions is constructed from the
intended microservice config; reference newTestServer, Server, config.NewConfig,
config.NewPersistOptions, GetMicroserviceConfig, and IsServiceIndependent when
making this change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fc5fc0eb-a84f-44d8-aa75-735ed49992f5

📥 Commits

Reviewing files that changed from the base of the PR and between d7b6380 and 1437b3e.

📒 Files selected for processing (2)
  • server/server_test.go
  • tests/integrations/tso/client_test.go

@okJiang
Copy link
Copy Markdown
Member

okJiang commented Mar 25, 2026

/ok-to-test

@ti-chi-bot ti-chi-bot bot added ok-to-test Indicates a PR is ready to be tested. and removed needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Mar 25, 2026
@YuhaoZhang00
Copy link
Copy Markdown
Contributor Author

/retest

Move WaitForTSOServiceAvailable inside the leader transfer loop so that
TSO availability is verified per-iteration rather than only once after
all transfers complete.

Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tests/integrations/tso/client_test.go (1)

850-865: ⚠️ Potential issue | 🟡 Minor

Capture old leader and assert leadership actually changed.

The loop overwrites leaderName without first capturing it, so there's no assertion that leadership actually transferred to a different node. The same PD member could regain leadership, making the test less meaningful.

🔧 Suggested fix
 	for range 2 {
-		leaderName = pdCluster.WaitLeader()
-		re.NotEmpty(leaderName)
-		err = pdCluster.GetServer(leaderName).ResignLeaderWithRetry()
+		oldLeaderName := pdCluster.WaitLeader()
+		re.NotEmpty(oldLeaderName)
+		err = pdCluster.GetServer(oldLeaderName).ResignLeaderWithRetry()
 		re.NoError(err)
 		leaderName = pdCluster.WaitLeader()
 		re.NotEmpty(leaderName)
+		re.NotEqual(oldLeaderName, leaderName)

 		// ServiceIndependent must remain set after leader transfer.
 		newLeader := pdCluster.GetServer(leaderName)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/tso/client_test.go` around lines 850 - 865, Before calling
ResignLeaderWithRetry(), save the current leader into a distinct variable (e.g.,
oldLeader := leaderName) and after pdCluster.WaitLeader() completes, assert the
returned leaderName is different from oldLeader to ensure leadership actually
changed; then use pdCluster.GetServer(leaderName) (the newLeader) for the
existing ServiceIndependent assertion (GetServer().IsServiceIndependent) and for
utils.WaitForTSOServiceAvailable checks. Adjust references to
leaderName/oldLeader so the resign call uses oldLeader and the subsequent
assertions use the newLeader to verify the transfer.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/integrations/tso/client_test.go`:
- Around line 850-865: Before calling ResignLeaderWithRetry(), save the current
leader into a distinct variable (e.g., oldLeader := leaderName) and after
pdCluster.WaitLeader() completes, assert the returned leaderName is different
from oldLeader to ensure leadership actually changed; then use
pdCluster.GetServer(leaderName) (the newLeader) for the existing
ServiceIndependent assertion (GetServer().IsServiceIndependent) and for
utils.WaitForTSOServiceAvailable checks. Adjust references to
leaderName/oldLeader so the resign call uses oldLeader and the subsequent
assertions use the newLeader to verify the transfer.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 79259e1b-cf56-4658-aca2-8c63bc6ae97b

📥 Commits

Reviewing files that changed from the base of the PR and between 1437b3e and ed20c9e.

📒 Files selected for processing (1)
  • tests/integrations/tso/client_test.go

@YuhaoZhang00
Copy link
Copy Markdown
Contributor Author

/retest

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.93%. Comparing base (033398b) to head (ed20c9e).
⚠️ Report is 27 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10478      +/-   ##
==========================================
+ Coverage   78.86%   78.93%   +0.06%     
==========================================
  Files         529      530       +1     
  Lines       71102    71548     +446     
==========================================
+ Hits        56072    56473     +401     
- Misses      11014    11044      +30     
- Partials     4016     4031      +15     
Flag Coverage Δ
unittests 78.93% <ø> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@YuhaoZhang00
Copy link
Copy Markdown
Contributor Author

/retest

@YuhaoZhang00
Copy link
Copy Markdown
Contributor Author

/test pull-unit-test-next-gen-3

Copy link
Copy Markdown
Member

@JmPotato JmPotato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the test structure is good and covers the key dynamic switching scenarios (forward, fallback, leader transfer). However, there are two main issues:

  1. The usePDServiceMode failpoint bypasses client-side service-mode discovery, so these tests only validate server-side behavior. The scope should be documented.
  2. TSO monotonicity is not checked across switches — WaitForTSOServiceAvailable only proves eventual availability, not correctness. checkTSOMonotonic() should be used instead.

The TestIsServiceIndependent unit test is well-structured and covers the state matrix thoroughly.

}

// TestDynamicSwitchingPDToTSO tests that when dynamic switching is enabled and a TSO
// microservice starts, PD stops serving TSO locally, sets ServiceIndependent,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three new integration tests (TestDynamicSwitchingPDToTSO, TestDynamicSwitchingTSOToPDFallback, TestDynamicSwitchingWithLeaderTransfer) enable the usePDServiceMode failpoint, which pins the client to PD_SVC_MODE by short-circuiting updateServiceModeLoop(). This means the tests do not exercise the client-side service-mode discovery path at all.

If the intent is to test server-side dynamic switching behavior only, please add a comment clarifying this scope limitation and noting that client-side service-mode discovery is covered by TestTSOServiceSwitch in tests/integrations/mcs/tso/server_test.go.

Otherwise, if these tests are meant to be end-to-end dynamic switching tests, the failpoint should be removed.

// Start TSO microservice.
tsoCluster, err := tests.NewTestTSOCluster(ctx, 1, backendEndpoints)
re.NoError(err)
defer tsoCluster.Destroy()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test (and the two below) only uses WaitForTSOServiceAvailable() which is essentially Eventually(client.GetTS() == nil) — it proves TSO becomes available at some point, but does not verify that timestamps remain monotonically increasing across the switch.

The repo already has a stronger helper checkTSOMonotonic() (used in TestTSOServiceSwitch). Consider tracking a globalLastTS across the switch boundary and asserting monotonicity:

var globalLastTS uint64
re.NoError(checkTSOMonotonic(ctx, pdClient, &globalLastTS, 10)) // before switch
// ... start TSO microservice ...
re.NoError(checkTSOMonotonic(ctx, pdClient, &globalLastTS, 10)) // after switch

Without this, a timestamp regression during the switch would not be caught.


// PD should start serving TSO.
utils.WaitForTSOServiceAvailable(ctx, re, pdClient)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The loop doesn't assert that leadership actually changed after ResignLeaderWithRetry(). Consider capturing oldLeaderName before resign and asserting re.NotEqual(oldLeaderName, leaderName) to ensure the test is truly exercising a leader transfer scenario.

oldLeaderName := pdCluster.WaitLeader()
err = pdCluster.GetServer(oldLeaderName).ResignLeaderWithRetry()
re.NoError(err)
leaderName = pdCluster.WaitLeader()
re.NotEqual(oldLeaderName, leaderName)

tsoCluster.Destroy()

// PD should resume serving TSO locally.
testutil.Eventually(re, func() bool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Missing blank line between TestDynamicSwitchingTSOToPDFallback and TestDynamicSwitchingWithLeaderTransfer.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 8, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-08 04:08:35.189868922 +0000 UTC m=+929320.395228979: ✖️🔁 reset by JmPotato.

- Add scope comments noting usePDServiceMode failpoint limits tests to
  server-side behavior; client-side discovery is covered elsewhere.
- Replace WaitForTSOServiceAvailable + separate monotonicity check with
  unified waitAndCheckTSOMonotonic that validates every successful GetTS
  (including the first post-switchover sample) against globalLastTS.
- Refactor newTestServer to accept tsoDynamicSwitchingEnabled param so
  persistOptions is built from the intended config, not pointer aliasing.
- Add blank line between TestDynamicSwitchingTSOToPDFallback and
  TestDynamicSwitchingWithLeaderTransfer.

Signed-off-by: Yuhao Zhang <yhzhang00@outlook.com>
@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 9, 2026
@YuhaoZhang00 YuhaoZhang00 requested a review from JmPotato April 9, 2026 08:56
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tests/integrations/tso/client_test.go (1)

884-890: ⚠️ Potential issue | 🟡 Minor

Assert actual leader change after resignation.

Between Line 885 and Line 890, the test waits for a leader before and after resignation but never checks they differ. This can pass without a real transfer.

♻️ Proposed tightening
 	for range 2 {
-		leaderName = pdCluster.WaitLeader()
-		re.NotEmpty(leaderName)
-		err = pdCluster.GetServer(leaderName).ResignLeaderWithRetry()
+		oldLeaderName := pdCluster.WaitLeader()
+		re.NotEmpty(oldLeaderName)
+		err = pdCluster.GetServer(oldLeaderName).ResignLeaderWithRetry()
 		re.NoError(err)
 		leaderName = pdCluster.WaitLeader()
 		re.NotEmpty(leaderName)
+		re.NotEqual(oldLeaderName, leaderName)
 
 		// ServiceIndependent must remain set after leader resignation.
 		newLeader := pdCluster.GetServer(leaderName)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/tso/client_test.go` around lines 884 - 890, The test
currently records leaderName before and after calling
pdCluster.GetServer(leaderName).ResignLeaderWithRetry() but only asserts
non-emptiness; update the test to assert the leader actually changed by checking
that the new leaderName != old leaderName after pdCluster.WaitLeader() returns
(use the same leaderName variable or a new one like newLeader and assert
inequality), while retaining existing re.NotEmpty and re.NoError checks around
pdCluster.WaitLeader() and ResignLeaderWithRetry().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/integrations/tso/client_test.go`:
- Around line 884-890: The test currently records leaderName before and after
calling pdCluster.GetServer(leaderName).ResignLeaderWithRetry() but only asserts
non-emptiness; update the test to assert the leader actually changed by checking
that the new leaderName != old leaderName after pdCluster.WaitLeader() returns
(use the same leaderName variable or a new one like newLeader and assert
inequality), while retaining existing re.NotEmpty and re.NoError checks around
pdCluster.WaitLeader() and ResignLeaderWithRetry().

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 12a6f11a-388d-41f2-9474-98bd756c6231

📥 Commits

Reviewing files that changed from the base of the PR and between ed20c9e and 43309f6.

📒 Files selected for processing (2)
  • server/server_test.go
  • tests/integrations/tso/client_test.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contribution This PR is from a community contributor. dco-signoff: yes Indicates the PR's author has signed the dco. ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants