fix: [CI-21435]: Add Windows network recovery for hotpool VMs#324
Open
anurag-harness wants to merge 1 commit into
Open
fix: [CI-21435]: Add Windows network recovery for hotpool VMs#324anurag-harness wants to merge 1 commit into
anurag-harness wants to merge 1 commit into
Conversation
Windows VMs lose outbound internet connectivity after GCE suspend/resume when used as hotpool instances. After resume, DHCP leases may expire, DNS cache is stale, and system clock drifts — similar to the Linux ARM64 clock issue (CI-21434). When the connectivity check fails on Windows, this fix attempts network recovery by: - Renewing DHCP lease (ipconfig /renew) - Flushing DNS cache (ipconfig /flushdns) - Syncing system clock (w32tm /resync) - Re-adding DNS servers (8.8.8.8, 1.1.1.1) The recovery runs before returning the error, so the next RetryHealth attempt benefits from the restored networking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Problem
Windows VMs lose outbound internet connectivity when reused from the hotpool (GCE suspend/resume). The lite-engine connectivity check (
8.8.8.8:53) fails, causing the VM to be destroyed and wasting the hotpool benefit.Root Cause
After GCE suspend/resume on Windows:
Solution
When the connectivity check fails on Windows,
attemptNetworkRecovery()runs before returning the error:ipconfig /renew— renew DHCP leaseipconfig /flushdns— flush stale DNS cachew32tm /resync /nowait— sync system clocknetsh interface ipv4 add dnsserver— re-add DNS servers (8.8.8.8, 1.1.1.1)The
RetryHealthcaller in drone-runner-aws retries the health check, so the next attempt benefits from the restored networking.Changes
handler/health.goattemptNetworkRecovery()on connectivity failurehandler/network_recovery_windows.gohandler/network_recovery_other.goTest Plan
GOOS=windows GOARCH=amd64Related
🤖 Generated with Claude Code