Skip to content

fix: deflake //rs/tests/message_routing/xnet:xnet_slo_120_subnets_staging_test_colocate#10548

Draft
basvandijk wants to merge 2 commits into
masterfrom
ai/deflake-xnet_slo_120_subnets_staging_test-2026-06-23
Draft

fix: deflake //rs/tests/message_routing/xnet:xnet_slo_120_subnets_staging_test_colocate#10548
basvandijk wants to merge 2 commits into
masterfrom
ai/deflake-xnet_slo_120_subnets_staging_test-2026-06-23

Conversation

@basvandijk

Copy link
Copy Markdown
Collaborator

Summary

Deflakes //rs/tests/message_routing/xnet:xnet_slo_120_subnets_staging_test_colocate, which intermittently fails the "Send rate below 0.3" SLO check.

Root cause analysis

I downloaded the last week's flaky runs and inspected the FAILED.log of each (attempt 1) alongside the PASSED.log retry (attempt 2). In every flaky run the only failure was the send-rate check, failing on just 2–3 of the 120 subnets:

Invocation (date) Attempt 1 (FAILED) min send rate # subnets < 0.3 Attempt 2 (PASSED) min send rate # subnets < 0.3
2026-06-19 0.292 2 0.452 0
2026-06-21 0.290 3 0.442 0
2026-06-22 0.295 2 0.453 0
2026-06-23 0.280 2 0.433 0

The send_rate metric is essentially the fraction of the theoretical maximum block/message rate a subnet achieves. The test asserts every subnet stays >= SEND_RATE_THRESHOLD (0.3).

This test colocates 120 single-node subnets on shared performance hardware, so the per-subnet block production rate varies between runs depending on load. The bulk of subnets land at 0.46–0.95 (median ~0.7), but on loaded runs a couple of "straggler" subnets dip just below 0.3 (0.28–0.297). On the retry the same subnets comfortably exceed 0.3 (min ~0.43–0.45). The 0.3 threshold sits right in this run-to-run variance band, which is what makes the test flaky.

This is a different root cause from the previous fix attempt (#10243), which addressed call timeouts/latency.

Fix

  • Add a with_send_rate_threshold builder to the shared Config (consistent with the existing with_call_timeouts / with_payload_bytes builders).
  • Lower this test's send-rate threshold from 0.3 to 0.2.

0.2 sits safely below the worst observed straggler (0.28) while still catching systemic XNet regressions: a genuine messaging regression would depress send rates across all subnets (median ~0.7), not just a couple of stragglers, and a subnet producing fewer than 0.2 blocks/s (1 block per 5 s) clearly indicates a real problem. Only this test is affected; the 3-/29-subnet and compatibility tests keep their existing thresholds.

Verification

  • cargo clippy on xnet-slo-test-lib and message-routing-system-tests-xnet: clean.
  • bazel build //rs/tests/message_routing/xnet:xnet_slo_120_subnets_staging_test_colocate: succeeds.
  • The full system test is left to CI (it allocates 120+ VMs on shared Farm infrastructure per run).

This PR was created following the steps in .claude/skills/fix-flaky-tests/SKILL.md.

…threshold

The colocated 120-subnet XNet SLO test flakes on the "Send rate below 0.3" check. With 120 single-node subnets sharing performance hardware, the per-subnet block production rate varies between runs: on loaded runs 2-3 straggler subnets dip just below the 0.3 threshold (0.28-0.297), while on retries the minimum jumps to 0.43-0.45 with no subnets below threshold.

Add a with_send_rate_threshold builder to the shared Config and lower this test's threshold to 0.2 to absorb the hardware variance while still catching systemic XNet regressions (the median send rate is ~0.7).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR deflakes the 120-subnet XNet SLO system test by making the send-rate SLO threshold configurable in the shared xnet_slo_test_lib::Config, and then lowering the threshold specifically for the colocated 120-subnet staging test to better tolerate run-to-run hardware variance.

Changes:

  • Added a Config::with_send_rate_threshold(f64) builder to customize the send-rate SLO threshold.
  • Lowered xnet_slo_120_subnets_staging_test’s send-rate threshold from the default 0.3 to 0.2.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
rs/tests/message_routing/xnet/xnet_slo_120_subnets_staging_test.rs Uses the new builder to reduce the send-rate threshold to 0.2 for the 120-subnet colocated run.
rs/tests/message_routing/xnet/slo_test_lib/xnet_slo_test_lib.rs Introduces a with_send_rate_threshold builder to override the default send-rate SLO threshold per test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rs/tests/message_routing/xnet/slo_test_lib/xnet_slo_test_lib.rs Outdated
Addresses Copilot review: the builder consumes self, so cloning the whole Config is unnecessary. Mutate the owned self directly, consistent with with_resource_overrides.
@basvandijk

Copy link
Copy Markdown
Collaborator Author

Alternative: #10552.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants