-
Notifications
You must be signed in to change notification settings - Fork 54
Squad/Mission worker processes become orphaned and spin at 100% CPU when TUI session dies #858
Description
Summary
When a Droid TUI session terminates unexpectedly (terminal crash, window close, network drop), all droid exec --input-format stream-jsonrpc worker processes spawned by the daemon for Squad/Mission sessions continue running indefinitely. They consume 100% CPU each, are not cleaned up, and can crash the host machine by exhausting all CPU and memory.
Reproduction
- Start a Droid session in any terminal
- Launch a Squad (
/squad) or Mission (/enter-mission) that spawns multiple workers - Kill the terminal application (not
/exit— simulate a crash:kill -9the terminal PID, close the window, or let it crash) - Observe: all
droid execworker processes remain running
# After killing the terminal:
ps aux | grep 'droid exec' | grep -v grep | wc -l
# Returns 63 (or however many workers were spawned)
# Each at 15-40% CPU, collectively pegging all cores:
ps aux -r | head -20Impact
- Severity: Critical — crashed my Mac Studio (M2 Ultra, 64GB). 63 orphaned workers consumed 100% CPU across all cores. macOS showed "Your system has run out of application memory" and force-quit dialog.
- Workers never self-terminate. They must be manually killed (
ps aux | grep 'droid exec' | awk '{print $2}' | xargs kill -9). - The 30-minute orphan cleanup interval (
ORPHAN_CLEANUP_INTERVAL_MS = 1800000) only handles terminal PTY associations, notdroid execworker processes.
Root Cause (from binary analysis of v0.88.0)
Architecture
TUI (droid) ←WebSocket→ Daemon (factoryd)
↓ child_process.spawn()
droid exec --input-format stream-jsonrpc (worker 1)
droid exec --input-format stream-jsonrpc (worker 2)
...
droid exec --input-format stream-jsonrpc (worker N)
Workers are spawned by the daemon (factoryd) as child processes, not by the TUI. The daemon uses child_process.spawn() with stdio: ["pipe","pipe","pipe"] (class vgH → S8A wrapper → U2H transport).
What happens when TUI dies
- TUI process dies → WebSocket to daemon drops
- Daemon detects the WebSocket close event
- Bug: The daemon does NOT call
S8A.close()on the worker transport objects associated with that session - Worker processes continue running with broken stdio pipes
- Workers enter a hot loop (likely retrying writes to broken pipes without proper EPIPE handling)
- The existing
setupOrphanCleanup()only cleans up terminal PTY associations viaterminalManager, notU2H-managed worker transports
The close() method exists but is never called
S8A.close() correctly sends SIGTERM → waits 5s → SIGKILL:
async close() {
let proc = this._childProcess;
proc.stdin.end();
proc.kill("SIGTERM");
setTimeout(() => proc.kill("SIGKILL"), 5000);
}This method works when called (e.g., normal session end via /exit). It is simply never invoked when the TUI disconnects unexpectedly.
Three specific bugs
-
No session→worker cleanup on WebSocket disconnect: When
handleClose()fires for a TUI WebSocket, the daemon should enumerate allU2Htransport objects associated with that client's sessions and callclose()on each. -
Orphan cleanup scope too narrow:
setupOrphanCleanup()only sweeps terminal PTYs. Workerdroid execprocesses spawned viaU2Hare not tracked or swept. -
No process group isolation: Workers don't share a process group with the TUI session. They can't receive SIGHUP when the session leader dies because the daemon (their actual parent) is still alive.
Suggested Fix
Immediate (handles the common case):
- On WebSocket disconnect in
_b.handleClose(), look up all active sessions for that connection - For each session, find associated
U2Htransport objects in thedroidHandler(u8class) - Call
S8A.close()on each (SIGTERM → SIGKILL)
Safety net (handles edge cases):
- Reduce
ORPHAN_CLEANUP_INTERVAL_MSfrom 1800000 (30min) to 60000 (1min) - Extend orphan cleanup to also check for
droid execchild processes whose parent session is no longer active - Consider spawning workers with a process group so
kill(-pgid)can clean up the entire group
Defense in depth:
- Workers should handle EPIPE/broken pipe on stdout/stdin gracefully and self-terminate instead of spinning
- Workers could implement a heartbeat with the daemon and self-terminate if no heartbeat received within N seconds
Environment
- Droid version: 0.88.0
- OS: macOS 15.5 (Darwin 25.4.0), arm64
- Hardware: Mac Studio M2 Ultra, 64GB RAM
- Terminal: Kaku (WezTerm fork, Rust-based) — terminal crash triggered the issue, but the bug is in the daemon, not the terminal
- Trigger: Ran
/squad, terminal crashed under memory pressure from 63 concurrent worker output streams, all 63droid execprocesses survived and spun at 100% CPU