Skip to content

fix: [CI-21435]: Add Windows network recovery for hotpool VMs#324

Open
anurag-harness wants to merge 1 commit into
harness:mainfrom
anurag-harness:CI-21435
Open

fix: [CI-21435]: Add Windows network recovery for hotpool VMs#324
anurag-harness wants to merge 1 commit into
harness:mainfrom
anurag-harness:CI-21435

Conversation

@anurag-harness

@anurag-harness anurag-harness commented Mar 18, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add Windows network recovery mechanism when connectivity check fails during health check
  • After GCE suspend/resume, Windows hotpool VMs can lose outbound connectivity due to expired DHCP leases, stale DNS cache, and clock drift
  • Recovery attempts: DHCP renewal, DNS flush, clock sync, DNS server re-configuration
  • Similar pattern to Linux ARM64 clock fix (CI-21434)

Problem

Windows VMs lose outbound internet connectivity when reused from the hotpool (GCE suspend/resume). The lite-engine connectivity check (8.8.8.8:53) fails, causing the VM to be destroyed and wasting the hotpool benefit.

Root Cause

After GCE suspend/resume on Windows:

  1. DHCP lease may have expired during suspension
  2. DNS resolver cache becomes stale
  3. System clock drifts (similar to Linux chrony issue CI-21434), causing TLS failures
  4. No post-resume recovery mechanism existed for Windows

Solution

When the connectivity check fails on Windows, attemptNetworkRecovery() runs before returning the error:

  • ipconfig /renew — renew DHCP lease
  • ipconfig /flushdns — flush stale DNS cache
  • w32tm /resync /nowait — sync system clock
  • netsh interface ipv4 add dnsserver — re-add DNS servers (8.8.8.8, 1.1.1.1)

The RetryHealth caller in drone-runner-aws retries the health check, so the next attempt benefits from the restored networking.

Changes

File Change
handler/health.go Call attemptNetworkRecovery() on connectivity failure
handler/network_recovery_windows.go Windows-specific recovery: DHCP, DNS, clock sync
handler/network_recovery_other.go No-op for non-Windows platforms

Test Plan

  • Cross-compiles for GOOS=windows GOARCH=amd64
  • Builds cleanly on macOS/Linux (no-op path)
  • Deploy to a Windows hotpool and verify VMs recover connectivity after suspend/resume
  • Verify health check passes on second retry after recovery

Related

  • JIRA: CI-21435
  • Similar fix: CI-21434 (Linux ARM64 chrony clock stepping)

🤖 Generated with Claude Code

Windows VMs lose outbound internet connectivity after GCE
suspend/resume when used as hotpool instances. After resume,
DHCP leases may expire, DNS cache is stale, and system clock
drifts — similar to the Linux ARM64 clock issue (CI-21434).

When the connectivity check fails on Windows, this fix attempts
network recovery by:
- Renewing DHCP lease (ipconfig /renew)
- Flushing DNS cache (ipconfig /flushdns)
- Syncing system clock (w32tm /resync)
- Re-adding DNS servers (8.8.8.8, 1.1.1.1)

The recovery runs before returning the error, so the next
RetryHealth attempt benefits from the restored networking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant