Skip to content

Squad/Mission worker processes become orphaned and spin at 100% CPU when TUI session dies #858

@harshav167

Description

@harshav167

Summary

When a Droid TUI session terminates unexpectedly (terminal crash, window close, network drop), all droid exec --input-format stream-jsonrpc worker processes spawned by the daemon for Squad/Mission sessions continue running indefinitely. They consume 100% CPU each, are not cleaned up, and can crash the host machine by exhausting all CPU and memory.

Reproduction

  1. Start a Droid session in any terminal
  2. Launch a Squad (/squad) or Mission (/enter-mission) that spawns multiple workers
  3. Kill the terminal application (not /exit — simulate a crash: kill -9 the terminal PID, close the window, or let it crash)
  4. Observe: all droid exec worker processes remain running
# After killing the terminal:
ps aux | grep 'droid exec' | grep -v grep | wc -l
# Returns 63 (or however many workers were spawned)

# Each at 15-40% CPU, collectively pegging all cores:
ps aux -r | head -20

Impact

  • Severity: Critical — crashed my Mac Studio (M2 Ultra, 64GB). 63 orphaned workers consumed 100% CPU across all cores. macOS showed "Your system has run out of application memory" and force-quit dialog.
  • Workers never self-terminate. They must be manually killed (ps aux | grep 'droid exec' | awk '{print $2}' | xargs kill -9).
  • The 30-minute orphan cleanup interval (ORPHAN_CLEANUP_INTERVAL_MS = 1800000) only handles terminal PTY associations, not droid exec worker processes.

Root Cause (from binary analysis of v0.88.0)

Architecture

TUI (droid) ←WebSocket→ Daemon (factoryd)
                           ↓ child_process.spawn()
                     droid exec --input-format stream-jsonrpc (worker 1)
                     droid exec --input-format stream-jsonrpc (worker 2)
                     ...
                     droid exec --input-format stream-jsonrpc (worker N)

Workers are spawned by the daemon (factoryd) as child processes, not by the TUI. The daemon uses child_process.spawn() with stdio: ["pipe","pipe","pipe"] (class vgHS8A wrapper → U2H transport).

What happens when TUI dies

  1. TUI process dies → WebSocket to daemon drops
  2. Daemon detects the WebSocket close event
  3. Bug: The daemon does NOT call S8A.close() on the worker transport objects associated with that session
  4. Worker processes continue running with broken stdio pipes
  5. Workers enter a hot loop (likely retrying writes to broken pipes without proper EPIPE handling)
  6. The existing setupOrphanCleanup() only cleans up terminal PTY associations via terminalManager, not U2H-managed worker transports

The close() method exists but is never called

S8A.close() correctly sends SIGTERM → waits 5s → SIGKILL:

async close() {
    let proc = this._childProcess;
    proc.stdin.end();
    proc.kill("SIGTERM");
    setTimeout(() => proc.kill("SIGKILL"), 5000);
}

This method works when called (e.g., normal session end via /exit). It is simply never invoked when the TUI disconnects unexpectedly.

Three specific bugs

  1. No session→worker cleanup on WebSocket disconnect: When handleClose() fires for a TUI WebSocket, the daemon should enumerate all U2H transport objects associated with that client's sessions and call close() on each.

  2. Orphan cleanup scope too narrow: setupOrphanCleanup() only sweeps terminal PTYs. Worker droid exec processes spawned via U2H are not tracked or swept.

  3. No process group isolation: Workers don't share a process group with the TUI session. They can't receive SIGHUP when the session leader dies because the daemon (their actual parent) is still alive.

Suggested Fix

Immediate (handles the common case):

  • On WebSocket disconnect in _b.handleClose(), look up all active sessions for that connection
  • For each session, find associated U2H transport objects in the droidHandler (u8 class)
  • Call S8A.close() on each (SIGTERM → SIGKILL)

Safety net (handles edge cases):

  • Reduce ORPHAN_CLEANUP_INTERVAL_MS from 1800000 (30min) to 60000 (1min)
  • Extend orphan cleanup to also check for droid exec child processes whose parent session is no longer active
  • Consider spawning workers with a process group so kill(-pgid) can clean up the entire group

Defense in depth:

  • Workers should handle EPIPE/broken pipe on stdout/stdin gracefully and self-terminate instead of spinning
  • Workers could implement a heartbeat with the daemon and self-terminate if no heartbeat received within N seconds

Environment

  • Droid version: 0.88.0
  • OS: macOS 15.5 (Darwin 25.4.0), arm64
  • Hardware: Mac Studio M2 Ultra, 64GB RAM
  • Terminal: Kaku (WezTerm fork, Rust-based) — terminal crash triggered the issue, but the bug is in the daemon, not the terminal
  • Trigger: Ran /squad, terminal crashed under memory pressure from 63 concurrent worker output streams, all 63 droid exec processes survived and spun at 100% CPU

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions