fix: deflake //rs/tests/message_routing/xnet:xnet_slo_120_subnets_staging_test_colocate#10548
Draft
basvandijk wants to merge 2 commits into
Draft
fix: deflake //rs/tests/message_routing/xnet:xnet_slo_120_subnets_staging_test_colocate#10548basvandijk wants to merge 2 commits into
basvandijk wants to merge 2 commits into
Conversation
…threshold The colocated 120-subnet XNet SLO test flakes on the "Send rate below 0.3" check. With 120 single-node subnets sharing performance hardware, the per-subnet block production rate varies between runs: on loaded runs 2-3 straggler subnets dip just below the 0.3 threshold (0.28-0.297), while on retries the minimum jumps to 0.43-0.45 with no subnets below threshold. Add a with_send_rate_threshold builder to the shared Config and lower this test's threshold to 0.2 to absorb the hardware variance while still catching systemic XNet regressions (the median send rate is ~0.7).
Contributor
There was a problem hiding this comment.
Pull request overview
This PR deflakes the 120-subnet XNet SLO system test by making the send-rate SLO threshold configurable in the shared xnet_slo_test_lib::Config, and then lowering the threshold specifically for the colocated 120-subnet staging test to better tolerate run-to-run hardware variance.
Changes:
- Added a
Config::with_send_rate_threshold(f64)builder to customize the send-rate SLO threshold. - Lowered
xnet_slo_120_subnets_staging_test’s send-rate threshold from the default0.3to0.2.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| rs/tests/message_routing/xnet/xnet_slo_120_subnets_staging_test.rs | Uses the new builder to reduce the send-rate threshold to 0.2 for the 120-subnet colocated run. |
| rs/tests/message_routing/xnet/slo_test_lib/xnet_slo_test_lib.rs | Introduces a with_send_rate_threshold builder to override the default send-rate SLO threshold per test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Addresses Copilot review: the builder consumes self, so cloning the whole Config is unnecessary. Mutate the owned self directly, consistent with with_resource_overrides.
Collaborator
Author
|
Alternative: #10552. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Deflakes
//rs/tests/message_routing/xnet:xnet_slo_120_subnets_staging_test_colocate, which intermittently fails the "Send rate below 0.3" SLO check.Root cause analysis
I downloaded the last week's flaky runs and inspected the
FAILED.logof each (attempt 1) alongside thePASSED.logretry (attempt 2). In every flaky run the only failure was the send-rate check, failing on just 2–3 of the 120 subnets:The
send_ratemetric is essentially the fraction of the theoretical maximum block/message rate a subnet achieves. The test asserts every subnet stays>= SEND_RATE_THRESHOLD(0.3).This test colocates 120 single-node subnets on shared performance hardware, so the per-subnet block production rate varies between runs depending on load. The bulk of subnets land at 0.46–0.95 (median ~0.7), but on loaded runs a couple of "straggler" subnets dip just below 0.3 (0.28–0.297). On the retry the same subnets comfortably exceed 0.3 (min ~0.43–0.45). The 0.3 threshold sits right in this run-to-run variance band, which is what makes the test flaky.
This is a different root cause from the previous fix attempt (#10243), which addressed call timeouts/latency.
Fix
with_send_rate_thresholdbuilder to the sharedConfig(consistent with the existingwith_call_timeouts/with_payload_bytesbuilders).0.2 sits safely below the worst observed straggler (0.28) while still catching systemic XNet regressions: a genuine messaging regression would depress send rates across all subnets (median ~0.7), not just a couple of stragglers, and a subnet producing fewer than 0.2 blocks/s (1 block per 5 s) clearly indicates a real problem. Only this test is affected; the 3-/29-subnet and compatibility tests keep their existing thresholds.
Verification
cargo clippyonxnet-slo-test-libandmessage-routing-system-tests-xnet: clean.bazel build //rs/tests/message_routing/xnet:xnet_slo_120_subnets_staging_test_colocate: succeeds.This PR was created following the steps in
.claude/skills/fix-flaky-tests/SKILL.md.