Mission/subagent worker processes become orphaned at 100% CPU when TUI session dies — daemon never cleans up U2H transports

## Summary

When a Droid TUI session terminates (terminal crash, force-quit, window close), all `droid exec --input-format stream-jsonrpc` worker processes spawned by the daemon continue running indefinitely at 100% CPU each. They are never cleaned up and can crash the host machine.

## Reproduction

1. Start a Droid session, launch a Mission or use subagents that spawn multiple workers
2. Force-quit the terminal (or let it crash under memory pressure from worker output)
3. Workers survive and spin:

```bash
ps aux | grep 'droid exec' | grep -v grep | wc -l
# 63 orphaned processes

ps aux -r | head -10
# Each worker at 15-40% CPU, collectively pegging all cores
```

4. Workers must be manually killed — they never self-terminate.

## Impact

**Critical** — crashed a Mac Studio (M2 Ultra, 64GB). Orphaned workers consumed 100% CPU across all cores until macOS showed "Your system has run out of application memory."

## Root Cause (from runtime debugging of v0.88.0)

### How workers are spawned

```
TUI (droid)
  → daemon (factoryd) via WebSocket
    → handleInitializeSession()
      → new U2H() (transport)
        → vgH.spawn()
          → child_process.spawn("droid", ["exec", "--input-format", "stream-jsonrpc", ...])
          → stdio: ["pipe", "pipe", "pipe"]  ← all piped to daemon
```

Workers are spawned by the **daemon** (`factoryd`), not the TUI. The daemon is a separate long-lived process.

### The S8A.close() method (works correctly when called)

```javascript
async close() {
    this.isClosing = true;
    let proc = this._childProcess;
    // Send SIGTERM, wait 5s, then SIGKILL
    proc.stdin.end();
    proc.kill("SIGTERM");
    setTimeout(() => proc.kill("SIGKILL"), 5000);
}
```

This is fine — SIGTERM with a 5s SIGKILL fallback.

### The bug: close() is never called when the TUI dies

1. **TUI crashes** (terminal dies, window closes, Ctrl+C, etc.)
2. TUI's WebSocket to the daemon **drops**
3. Daemon detects the disconnection
4. The orphan cleanup runs every **30 minutes** (`ORPHAN_CLEANUP_INTERVAL_MS = 1800000`)
5. But it only cleans up **terminals**, not `droid exec` worker processes

The orphan cleanup code:

```javascript
setupOrphanCleanup() {
    this.terminalManager.startOrphanCleanup({
        intervalMs: 1800000,  // 30 MINUTES
        onOrphanFound: (H) => {
            // Only cleans up terminal associations
            this.terminalManager.removeTerminalAssociations(H);
            this.terminalManager.closeTerminal(H);
            // DOES NOT call S8A.close() on worker processes!
        }
    });
}
```

### Three specific bugs

1. **No parent-death signal propagation**: Workers are spawned with `stdio: ["pipe","pipe","pipe"]` but when the daemon's WebSocket to the TUI drops, nobody calls `S8A.close()` on the worker transport objects. The daemon keeps running, the workers keep running, but nobody is reading their output.

2. **Orphan cleanup only handles terminals**: The 30-minute orphan sweep only cleans up terminal PTY associations. Worker `droid exec` processes spawned via `U2H` transport are not tracked or cleaned up by this mechanism.

3. **No process group / session leader**: Workers are spawned as plain child processes of the daemon. They don't share a process group with the TUI. So when the TUI dies, the workers' parent (daemon) is still alive — they don't get SIGHUP. And the daemon doesn't know the TUI-side session is dead until... it never finds out, because there's no health check on the session→worker mapping.

### Why 100% CPU

When the TUI dies, the pipe FDs to stdin/stdout of the workers become broken. But the workers are running the full droid agent loop — they keep trying to stream LLM responses, execute tools, and write output to broken pipes. Without proper EPIPE/SIGPIPE handling, they spin.

## Suggested Fix

**Immediate** (handles the common case):
- On WebSocket disconnect in `_b.handleClose()`, enumerate all active sessions for that connection
- For each session, find associated `U2H` transport objects in the `droidHandler` (`u8` class)
- Call `S8A.close()` on each (SIGTERM → SIGKILL)

**Safety net** (handles edge cases):
- Reduce `ORPHAN_CLEANUP_INTERVAL_MS` from 1800000 (30min) to 60000 (1min) or less
- Extend orphan cleanup to also check for `droid exec` child processes whose parent session is no longer active
- Consider spawning workers with a process group so `kill(-pgid)` can clean up the entire group

**Defense in depth**:
- Workers should handle EPIPE/broken pipe on stdout/stdin gracefully and self-terminate instead of spinning
- Workers could implement a heartbeat with the daemon and self-terminate if no heartbeat received within N seconds

## Environment

- Droid v0.88.0, macOS 26.4 (Build 25E246), arm64
- Mac Studio M2 Ultra, 64GB RAM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mission/subagent worker processes become orphaned at 100% CPU when TUI session dies — daemon never cleans up U2H transports #859

Summary

Reproduction

Impact

Root Cause (from runtime debugging of v0.88.0)

How workers are spawned

The S8A.close() method (works correctly when called)

The bug: close() is never called when the TUI dies

Three specific bugs

Why 100% CPU

Suggested Fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mission/subagent worker processes become orphaned at 100% CPU when TUI session dies — daemon never cleans up U2H transports #859

Description

Summary

Reproduction

Impact

Root Cause (from runtime debugging of v0.88.0)

How workers are spawned

The S8A.close() method (works correctly when called)

The bug: close() is never called when the TUI dies

Three specific bugs

Why 100% CPU

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions