Fix zombie tart processes blocking subsequent builds#25
Merged
rcurranmoz merged 12 commits intomozilla-platform-ops:mainfrom Apr 27, 2026
Merged
Fix zombie tart processes blocking subsequent builds#25rcurranmoz merged 12 commits intomozilla-platform-ops:mainfrom
rcurranmoz merged 12 commits intomozilla-platform-ops:mainfrom
Conversation
When packer times out waiting for SSH, its graceful shutdown fails (no SSH connection to the VM), leaving the `tart run` process running. The cleanup step's `tart delete` silently fails on a running VM, so the zombie persists and consumes the macOS VM system limit slot. Subsequent builds get the system limit warning and their VMs get no vmnet network access, causing cascading SSH timeouts. Fix: kill `tart run` before `tart delete` in both cleanup steps. Add 5s sleep after the start-of-job kill to let vmnet recover. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests if the GitHub Actions runner's shell can reach a tart VM's SSH port via nc, to isolate whether the 'no route to host' error from packer-plugin-tart is a Go-specific or runner-environment issue. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The packer-plugin-tart v1.15.1 runs VMs with --graphics --vnc-experimental by default. headless=true disables this, running tart without a window. This is a bisect to determine if graphics mode is causing the no-route-to-host SSH failure in the Go plugin. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
packer-plugin-tart gets EHOSTUNREACH immediately on every connection attempt to bridge100 VMs (192.168.64.x) when spawned from the GitHub Actions runner process hierarchy. Interactive SSH sessions can reach the same IPs fine. Add ssh_proxy.py: a bidirectional TCP proxy that listens on 127.0.0.1:2222, calls tart ip to discover the VM's current IP, and forwards connections to VM:22 from within the runner's shell subprocess (which may not share the same routing restriction as Go's net.Dial in packer-plugin-tart). Configure packer source blocks to use ssh_host=127.0.0.1 / ssh_port=2222 so packer's SSH communicator goes through the proxy instead of bridge100 directly. Also add a background net-probe loop alongside each packer build that tests nc from the runner's bash subprocess every 15s — this will confirm whether the runner shell itself can reach the VM (diagnostic data regardless of fix). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…o \$!) The previous commit used command substitution \$(start_ssh_proxy ...) to capture the background PID. The Python proxy inherits the command substitution pipe's write end (stdout) and keeps it open indefinitely, so the \$() call never returns — builder.sh hangs before packer even starts. Fix: start the proxy and net-probe directly with & and capture PID via \$! instead of via a helper function called through \$(...). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
packer-plugin-tart overrides ssh_host by calling 'tart ip' and using the result directly, ignoring any ssh_host set in HCL. The VM's bridge100 IP (192.168.64.x) is unreachable from packer-plugin-tart's process context (EHOSTUNREACH), but IS reachable from the runner's shell subprocess. A tart wrapper script is prepended to PATH before packer runs. It intercepts 'tart ip <vm>' to wait for the VM to have a real IP, then returns 127.0.0.1 instead. packer-plugin-tart then connects to 127.0.0.1:2222, which is our Python proxy (running in the shell context that CAN reach bridge100) forwarding to VM:22. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
packer-plugin-tart calls 'tart ip --wait 120 <vm_name>', not 'tart ip <vm_name>'.
The wrapper was passing only \$2 ('--wait') to the real tart, causing it to fail
or hang. Pass '\$@' instead to forward all arguments unchanged, then replace the
returned IP with 127.0.0.1.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The tart wrapper makes 'tart ip' return 127.0.0.1, so the net-probe was testing 127.0.0.1:22 (runner's own sshd) -- always PASS, useless. Fix: net-probe and pre-packer diagnostics now call /opt/homebrew/bin/tart directly to get the VM's real bridge100 IP. Add pre-packer diagnostics that run in the runner subprocess context: bridge100 interface state, routing table, ARP, nc and ping to the real VM IP. This will tell us definitively whether the routing issue is a process isolation problem or an ARP/DHCP timing problem. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
/usr/bin/nc is a macOS platform binary with implicit access to bridge100 virtual network interfaces. Homebrew Python's socket.create_connection() gets EHOSTUNREACH on the same IP that nc reaches successfully -- confirmed by simultaneous [net-probe] PASS and [proxy] FAIL in build logs. Replace Python socket.create_connection() with subprocess.Popen(['/usr/bin/nc', vm_ip, '22']) and pipe client <-> nc stdin/stdout bidirectionally. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The pipe-threading approach (client_to_nc + nc_to_client threads) was causing the SSH handshake to fail — connections were opening and closing in <1 second. Root cause: Python thread scheduling + pipe buffering interacts badly with SSH's initial binary banner exchange. Fix: pass the packer client socket fd directly as nc's stdin and stdout. nc handles all bidirectional I/O natively without Python intermediation, eliminating buffering and threading race conditions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tart runprocess (graceful shutdown fails — no SSH connection to the VM)tart deletesilently fails on a running VMtart runpersists and consumes the macOS 1-VM system limit, causing the next build's VM to get no vmnet network — cascading SSH timeoutsChanges
pkill -9 -f "tart run sequoia-gecko{1,3}b"+sleep 2beforetart deletein both cleanup stepssleep 5after the start-of-job kill step to let vmnet recover fully before starting the new buildTest plan
🤖 Generated with Claude Code