Skip to content

Fix zombie tart processes blocking subsequent builds#25

Merged
rcurranmoz merged 12 commits intomozilla-platform-ops:mainfrom
rcurranmoz:fix-zombie-cleanup
Apr 27, 2026
Merged

Fix zombie tart processes blocking subsequent builds#25
rcurranmoz merged 12 commits intomozilla-platform-ops:mainfrom
rcurranmoz:fix-zombie-cleanup

Conversation

@rcurranmoz
Copy link
Copy Markdown
Collaborator

Summary

  • When packer times out waiting for SSH, it exits without killing the tart run process (graceful shutdown fails — no SSH connection to the VM)
  • The cleanup step's tart delete silently fails on a running VM
  • The zombie tart run persists and consumes the macOS 1-VM system limit, causing the next build's VM to get no vmnet network — cascading SSH timeouts

Changes

  • Add pkill -9 -f "tart run sequoia-gecko{1,3}b" + sleep 2 before tart delete in both cleanup steps
  • Add sleep 5 after the start-of-job kill step to let vmnet recover fully before starting the new build

Test plan

  • Build gecko-1b completes without SSH timeout
  • Build gecko-3b completes
  • Both images pushed to OCI registry

🤖 Generated with Claude Code

rcurranmoz and others added 12 commits April 27, 2026 15:13
When packer times out waiting for SSH, its graceful shutdown fails
(no SSH connection to the VM), leaving the `tart run` process running.
The cleanup step's `tart delete` silently fails on a running VM, so
the zombie persists and consumes the macOS VM system limit slot.
Subsequent builds get the system limit warning and their VMs get no
vmnet network access, causing cascading SSH timeouts.

Fix: kill `tart run` before `tart delete` in both cleanup steps.
Add 5s sleep after the start-of-job kill to let vmnet recover.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tests if the GitHub Actions runner's shell can reach a tart VM's SSH
port via nc, to isolate whether the 'no route to host' error from
packer-plugin-tart is a Go-specific or runner-environment issue.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The packer-plugin-tart v1.15.1 runs VMs with --graphics --vnc-experimental
by default. headless=true disables this, running tart without a window.

This is a bisect to determine if graphics mode is causing the no-route-to-host
SSH failure in the Go plugin.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
packer-plugin-tart gets EHOSTUNREACH immediately on every connection attempt
to bridge100 VMs (192.168.64.x) when spawned from the GitHub Actions runner
process hierarchy. Interactive SSH sessions can reach the same IPs fine.

Add ssh_proxy.py: a bidirectional TCP proxy that listens on 127.0.0.1:2222,
calls tart ip to discover the VM's current IP, and forwards connections to
VM:22 from within the runner's shell subprocess (which may not share the same
routing restriction as Go's net.Dial in packer-plugin-tart).

Configure packer source blocks to use ssh_host=127.0.0.1 / ssh_port=2222 so
packer's SSH communicator goes through the proxy instead of bridge100 directly.

Also add a background net-probe loop alongside each packer build that tests
nc from the runner's bash subprocess every 15s — this will confirm whether
the runner shell itself can reach the VM (diagnostic data regardless of fix).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…o \$!)

The previous commit used command substitution \$(start_ssh_proxy ...) to
capture the background PID. The Python proxy inherits the command substitution
pipe's write end (stdout) and keeps it open indefinitely, so the \$() call
never returns — builder.sh hangs before packer even starts.

Fix: start the proxy and net-probe directly with & and capture PID via \$!
instead of via a helper function called through \$(...).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
packer-plugin-tart overrides ssh_host by calling 'tart ip' and using the
result directly, ignoring any ssh_host set in HCL. The VM's bridge100 IP
(192.168.64.x) is unreachable from packer-plugin-tart's process context
(EHOSTUNREACH), but IS reachable from the runner's shell subprocess.

A tart wrapper script is prepended to PATH before packer runs. It
intercepts 'tart ip <vm>' to wait for the VM to have a real IP, then
returns 127.0.0.1 instead. packer-plugin-tart then connects to
127.0.0.1:2222, which is our Python proxy (running in the shell context
that CAN reach bridge100) forwarding to VM:22.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
packer-plugin-tart calls 'tart ip --wait 120 <vm_name>', not 'tart ip <vm_name>'.
The wrapper was passing only \$2 ('--wait') to the real tart, causing it to fail
or hang. Pass '\$@' instead to forward all arguments unchanged, then replace the
returned IP with 127.0.0.1.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The tart wrapper makes 'tart ip' return 127.0.0.1, so the net-probe was
testing 127.0.0.1:22 (runner's own sshd) -- always PASS, useless.
Fix: net-probe and pre-packer diagnostics now call /opt/homebrew/bin/tart
directly to get the VM's real bridge100 IP.

Add pre-packer diagnostics that run in the runner subprocess context:
bridge100 interface state, routing table, ARP, nc and ping to the real VM
IP. This will tell us definitively whether the routing issue is a process
isolation problem or an ARP/DHCP timing problem.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
/usr/bin/nc is a macOS platform binary with implicit access to bridge100
virtual network interfaces. Homebrew Python's socket.create_connection()
gets EHOSTUNREACH on the same IP that nc reaches successfully -- confirmed
by simultaneous [net-probe] PASS and [proxy] FAIL in build logs.

Replace Python socket.create_connection() with subprocess.Popen(['/usr/bin/nc',
vm_ip, '22']) and pipe client <-> nc stdin/stdout bidirectionally.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The pipe-threading approach (client_to_nc + nc_to_client threads) was
causing the SSH handshake to fail — connections were opening and closing
in <1 second. Root cause: Python thread scheduling + pipe buffering
interacts badly with SSH's initial binary banner exchange.

Fix: pass the packer client socket fd directly as nc's stdin and stdout.
nc handles all bidirectional I/O natively without Python intermediation,
eliminating buffering and threading race conditions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@rcurranmoz rcurranmoz merged commit d2b6557 into mozilla-platform-ops:main Apr 27, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant