chore: support allocating testnets to the local DC#10122
chore: support allocating testnets to the local DC#10122basvandijk wants to merge 12 commits intomasterfrom
Conversation
|
Before merging this we need to add a new feature to Farm. The problem is that when the We can't simply increase the 64 to 256 since that will cause performance tests to allocate multiple VMs per host which makes their measurements / benchmarks unreliable. What I think we need is a new
This will cause VMs of performance tests to be load-balanced over the |
What
Support optionally allocating Farm testnets to the same DC as were the GitHub runner is running.
Why
The nested system-tests are quite flaky:
It's because these are the only tests that use the SetupOS disk images which are very large (2.6G). Downloading these images on Farm hosts often times out. Especially if the transfer has to cross the Atlantic, i.e. when the Farm host is in
zh1and the image was built indm1or vice versa.We should avoid these cross-DC transfers. One way of doing that is forcing a testnet to be allocated to the same DC as were the image was created which is the DC where the the GitHub runner is running.
How
SystemTestGroup:.allocate_testnet_to_local_dc(). When set the Farm group is created withrequired_host_featuresset to the DC of the GitHub runner.NODE_NAMEenvironment variable. This will have a value on CI likedm1-spm34where the part before the-denotes the DC.DCis outputted as a bazel volatile status variable inbazel/workspace_status.sh. It has to be volatile because we don't want a different DC to invalidate a previously cached test.//rs/tests:DC.txttarget is introduced which outputs aDC.txtfile with the value of the volatileDCvariable.system_testswill use the contents of//rs/tests:DC.txtas an environment variable and read that environment variable to determine the required_host_features of the group in case it's non empty.Future Work
With this new mechanism we could also consider moving the Farm metadata to volatile meaning changes in those won't invalidate cached tests anymore. See: #10136.