-
Notifications
You must be signed in to change notification settings - Fork 54
Mission/subagent worker processes become orphaned at 100% CPU when TUI session dies — daemon never cleans up U2H transports #859
Description
Summary
When a Droid TUI session terminates (terminal crash, force-quit, window close), all droid exec --input-format stream-jsonrpc worker processes spawned by the daemon continue running indefinitely at 100% CPU each. They are never cleaned up and can crash the host machine.
Reproduction
- Start a Droid session, launch a Mission or use subagents that spawn multiple workers
- Force-quit the terminal (or let it crash under memory pressure from worker output)
- Workers survive and spin:
ps aux | grep 'droid exec' | grep -v grep | wc -l
# 63 orphaned processes
ps aux -r | head -10
# Each worker at 15-40% CPU, collectively pegging all cores- Workers must be manually killed — they never self-terminate.
Impact
Critical — crashed a Mac Studio (M2 Ultra, 64GB). Orphaned workers consumed 100% CPU across all cores until macOS showed "Your system has run out of application memory."
Root Cause (from runtime debugging of v0.88.0)
How workers are spawned
TUI (droid)
→ daemon (factoryd) via WebSocket
→ handleInitializeSession()
→ new U2H() (transport)
→ vgH.spawn()
→ child_process.spawn("droid", ["exec", "--input-format", "stream-jsonrpc", ...])
→ stdio: ["pipe", "pipe", "pipe"] ← all piped to daemon
Workers are spawned by the daemon (factoryd), not the TUI. The daemon is a separate long-lived process.
The S8A.close() method (works correctly when called)
async close() {
this.isClosing = true;
let proc = this._childProcess;
// Send SIGTERM, wait 5s, then SIGKILL
proc.stdin.end();
proc.kill("SIGTERM");
setTimeout(() => proc.kill("SIGKILL"), 5000);
}This is fine — SIGTERM with a 5s SIGKILL fallback.
The bug: close() is never called when the TUI dies
- TUI crashes (terminal dies, window closes, Ctrl+C, etc.)
- TUI's WebSocket to the daemon drops
- Daemon detects the disconnection
- The orphan cleanup runs every 30 minutes (
ORPHAN_CLEANUP_INTERVAL_MS = 1800000) - But it only cleans up terminals, not
droid execworker processes
The orphan cleanup code:
setupOrphanCleanup() {
this.terminalManager.startOrphanCleanup({
intervalMs: 1800000, // 30 MINUTES
onOrphanFound: (H) => {
// Only cleans up terminal associations
this.terminalManager.removeTerminalAssociations(H);
this.terminalManager.closeTerminal(H);
// DOES NOT call S8A.close() on worker processes!
}
});
}Three specific bugs
-
No parent-death signal propagation: Workers are spawned with
stdio: ["pipe","pipe","pipe"]but when the daemon's WebSocket to the TUI drops, nobody callsS8A.close()on the worker transport objects. The daemon keeps running, the workers keep running, but nobody is reading their output. -
Orphan cleanup only handles terminals: The 30-minute orphan sweep only cleans up terminal PTY associations. Worker
droid execprocesses spawned viaU2Htransport are not tracked or cleaned up by this mechanism. -
No process group / session leader: Workers are spawned as plain child processes of the daemon. They don't share a process group with the TUI. So when the TUI dies, the workers' parent (daemon) is still alive — they don't get SIGHUP. And the daemon doesn't know the TUI-side session is dead until... it never finds out, because there's no health check on the session→worker mapping.
Why 100% CPU
When the TUI dies, the pipe FDs to stdin/stdout of the workers become broken. But the workers are running the full droid agent loop — they keep trying to stream LLM responses, execute tools, and write output to broken pipes. Without proper EPIPE/SIGPIPE handling, they spin.
Suggested Fix
Immediate (handles the common case):
- On WebSocket disconnect in
_b.handleClose(), enumerate all active sessions for that connection - For each session, find associated
U2Htransport objects in thedroidHandler(u8class) - Call
S8A.close()on each (SIGTERM → SIGKILL)
Safety net (handles edge cases):
- Reduce
ORPHAN_CLEANUP_INTERVAL_MSfrom 1800000 (30min) to 60000 (1min) or less - Extend orphan cleanup to also check for
droid execchild processes whose parent session is no longer active - Consider spawning workers with a process group so
kill(-pgid)can clean up the entire group
Defense in depth:
- Workers should handle EPIPE/broken pipe on stdout/stdin gracefully and self-terminate instead of spinning
- Workers could implement a heartbeat with the daemon and self-terminate if no heartbeat received within N seconds
Environment
- Droid v0.88.0, macOS 26.4 (Build 25E246), arm64
- Mac Studio M2 Ultra, 64GB RAM