Skip to content

chore: support allocating testnets to the local DC#10122

Open
basvandijk wants to merge 12 commits intomasterfrom
basvandijk/allocate_testnet_to_local_dc
Open

chore: support allocating testnets to the local DC#10122
basvandijk wants to merge 12 commits intomasterfrom
basvandijk/allocate_testnet_to_local_dc

Conversation

@basvandijk
Copy link
Copy Markdown
Collaborator

@basvandijk basvandijk commented May 7, 2026

What

Support optionally allocating Farm testnets to the same DC as were the GitHub runner is running.

Why

The nested system-tests are quite flaky:

$ bazel run //ci/githubstats:query -- top 100 flaky --gt 0 --columns=label,total,flaky,flaky% --include=%nested%
...
┍━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━┯━━━━━━━━━━┑
│    │ label                                                       │   total │   flaky │   flaky% │
┝━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━┿━━━━━━━━━━┥
│  0 │ //rs/tests/nested/nns_recovery:nr_all_broken_seq_np_actions │      37 │       3 │      8.1 │
│  1 │ //rs/tests/nested:hostos_upgrade_smoke_test                 │      45 │       2 │      4.4 │
│  2 │ //rs/tests/nested/nns_recovery:nr_broken_dfinity_node       │      37 │       1 │      2.7 │
│  3 │ //rs/tests/nested:hostos_upgrade_smoke_test_head_nns        │      37 │       1 │      2.7 │
│  4 │ //rs/tests/nested:registration                              │      45 │       1 │      2.2 │
┕━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━┷━━━━━━━━━━┙

It's because these are the only tests that use the SetupOS disk images which are very large (2.6G). Downloading these images on Farm hosts often times out. Especially if the transfer has to cross the Atlantic, i.e. when the Farm host is in zh1 and the image was built in dm1 or vice versa.

We should avoid these cross-DC transfers. One way of doing that is forcing a testnet to be allocated to the same DC as were the image was created which is the DC where the the GitHub runner is running.

How

  • This introduces the new function on SystemTestGroup: .allocate_testnet_to_local_dc(). When set the Farm group is created with required_host_features set to the DC of the GitHub runner.
  • The DC of the GitHub runner is determined via the NODE_NAME environment variable. This will have a value on CI like dm1-spm34 where the part before the - denotes the DC.
  • The DC is outputted as a bazel volatile status variable in bazel/workspace_status.sh. It has to be volatile because we don't want a different DC to invalidate a previously cached test.
  • A new //rs/tests:DC.txt target is introduced which outputs a DC.txt file with the value of the volatile DC variable.
  • system_tests will use the contents of //rs/tests:DC.txt as an environment variable and read that environment variable to determine the required_host_features of the group in case it's non empty.

Future Work

With this new mechanism we could also consider moving the Farm metadata to volatile meaning changes in those won't invalidate cached tests anymore. See: #10136.

@github-actions github-actions Bot added the chore label May 7, 2026
@basvandijk
Copy link
Copy Markdown
Collaborator Author

basvandijk commented May 7, 2026

Before merging this we need to add a new feature to Farm. The problem is that when the bazel-test-all job is run from dm1 (which is often the case since we have more runners there than in zh1) we quickly allocate all available resources on the dm1 Farm hosts. This is because these Farm hosts were originally meant for performance tests and hence only allow a maximum of 64 vCPUs per host while regular hosts allow 256 vCPUs.

We can't simply increase the 64 to 256 since that will cause performance tests to allocate multiple VMs per host which makes their measurements / benchmarks unreliable.

What I think we need is a new VmAllocationMode that is specifically designed for performance tests. Let's call it PerformanceAllocation for now. Then we would increase the max vCPUs to 256 but then the semantics of PerformanceAllocation is that it would behave like the default MinIntraDistanceLoadBalanceAllocation but remove its first ordering property:

  • the number of VMs of the group that the host has already allocated in
    descending order (VMs of a testnet are grouped together on a host as
    much as possible).

This will cause VMs of performance tests to be load-balanced over the dm1 hosts instead of being colocated together on a single host before spilling over to other hosts.

@basvandijk basvandijk marked this pull request as ready for review May 7, 2026 18:09
@basvandijk basvandijk requested review from a team as code owners May 7, 2026 18:09
@basvandijk basvandijk added the CI_ALL_BAZEL_TARGETS Runs all bazel targets label May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants