Skip to content

Mission/subagent worker processes become orphaned at 100% CPU when TUI session dies — daemon never cleans up U2H transports #859

@harshav167

Description

@harshav167

Summary

When a Droid TUI session terminates (terminal crash, force-quit, window close), all droid exec --input-format stream-jsonrpc worker processes spawned by the daemon continue running indefinitely at 100% CPU each. They are never cleaned up and can crash the host machine.

Reproduction

  1. Start a Droid session, launch a Mission or use subagents that spawn multiple workers
  2. Force-quit the terminal (or let it crash under memory pressure from worker output)
  3. Workers survive and spin:
ps aux | grep 'droid exec' | grep -v grep | wc -l
# 63 orphaned processes

ps aux -r | head -10
# Each worker at 15-40% CPU, collectively pegging all cores
  1. Workers must be manually killed — they never self-terminate.

Impact

Critical — crashed a Mac Studio (M2 Ultra, 64GB). Orphaned workers consumed 100% CPU across all cores until macOS showed "Your system has run out of application memory."

Root Cause (from runtime debugging of v0.88.0)

How workers are spawned

TUI (droid)
  → daemon (factoryd) via WebSocket
    → handleInitializeSession()
      → new U2H() (transport)
        → vgH.spawn()
          → child_process.spawn("droid", ["exec", "--input-format", "stream-jsonrpc", ...])
          → stdio: ["pipe", "pipe", "pipe"]  ← all piped to daemon

Workers are spawned by the daemon (factoryd), not the TUI. The daemon is a separate long-lived process.

The S8A.close() method (works correctly when called)

async close() {
    this.isClosing = true;
    let proc = this._childProcess;
    // Send SIGTERM, wait 5s, then SIGKILL
    proc.stdin.end();
    proc.kill("SIGTERM");
    setTimeout(() => proc.kill("SIGKILL"), 5000);
}

This is fine — SIGTERM with a 5s SIGKILL fallback.

The bug: close() is never called when the TUI dies

  1. TUI crashes (terminal dies, window closes, Ctrl+C, etc.)
  2. TUI's WebSocket to the daemon drops
  3. Daemon detects the disconnection
  4. The orphan cleanup runs every 30 minutes (ORPHAN_CLEANUP_INTERVAL_MS = 1800000)
  5. But it only cleans up terminals, not droid exec worker processes

The orphan cleanup code:

setupOrphanCleanup() {
    this.terminalManager.startOrphanCleanup({
        intervalMs: 1800000,  // 30 MINUTES
        onOrphanFound: (H) => {
            // Only cleans up terminal associations
            this.terminalManager.removeTerminalAssociations(H);
            this.terminalManager.closeTerminal(H);
            // DOES NOT call S8A.close() on worker processes!
        }
    });
}

Three specific bugs

  1. No parent-death signal propagation: Workers are spawned with stdio: ["pipe","pipe","pipe"] but when the daemon's WebSocket to the TUI drops, nobody calls S8A.close() on the worker transport objects. The daemon keeps running, the workers keep running, but nobody is reading their output.

  2. Orphan cleanup only handles terminals: The 30-minute orphan sweep only cleans up terminal PTY associations. Worker droid exec processes spawned via U2H transport are not tracked or cleaned up by this mechanism.

  3. No process group / session leader: Workers are spawned as plain child processes of the daemon. They don't share a process group with the TUI. So when the TUI dies, the workers' parent (daemon) is still alive — they don't get SIGHUP. And the daemon doesn't know the TUI-side session is dead until... it never finds out, because there's no health check on the session→worker mapping.

Why 100% CPU

When the TUI dies, the pipe FDs to stdin/stdout of the workers become broken. But the workers are running the full droid agent loop — they keep trying to stream LLM responses, execute tools, and write output to broken pipes. Without proper EPIPE/SIGPIPE handling, they spin.

Suggested Fix

Immediate (handles the common case):

  • On WebSocket disconnect in _b.handleClose(), enumerate all active sessions for that connection
  • For each session, find associated U2H transport objects in the droidHandler (u8 class)
  • Call S8A.close() on each (SIGTERM → SIGKILL)

Safety net (handles edge cases):

  • Reduce ORPHAN_CLEANUP_INTERVAL_MS from 1800000 (30min) to 60000 (1min) or less
  • Extend orphan cleanup to also check for droid exec child processes whose parent session is no longer active
  • Consider spawning workers with a process group so kill(-pgid) can clean up the entire group

Defense in depth:

  • Workers should handle EPIPE/broken pipe on stdout/stdin gracefully and self-terminate instead of spinning
  • Workers could implement a heartbeat with the daemon and self-terminate if no heartbeat received within N seconds

Environment

  • Droid v0.88.0, macOS 26.4 (Build 25E246), arm64
  • Mac Studio M2 Ultra, 64GB RAM

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions