Squad/Mission worker processes become orphaned and spin at 100% CPU when TUI session dies

## Summary

When a Droid TUI session terminates unexpectedly (terminal crash, window close, network drop), all `droid exec --input-format stream-jsonrpc` worker processes spawned by the daemon for Squad/Mission sessions continue running indefinitely. They consume 100% CPU each, are not cleaned up, and can crash the host machine by exhausting all CPU and memory.

## Reproduction

1. Start a Droid session in any terminal
2. Launch a Squad (`/squad`) or Mission (`/enter-mission`) that spawns multiple workers
3. Kill the terminal application (not `/exit` — simulate a crash: `kill -9` the terminal PID, close the window, or let it crash)
4. Observe: all `droid exec` worker processes remain running

```bash
# After killing the terminal:
ps aux | grep 'droid exec' | grep -v grep | wc -l
# Returns 63 (or however many workers were spawned)

# Each at 15-40% CPU, collectively pegging all cores:
ps aux -r | head -20
```

## Impact

- **Severity: Critical** — crashed my Mac Studio (M2 Ultra, 64GB). 63 orphaned workers consumed 100% CPU across all cores. macOS showed "Your system has run out of application memory" and force-quit dialog.
- Workers never self-terminate. They must be manually killed (`ps aux | grep 'droid exec' | awk '{print $2}' | xargs kill -9`).
- The 30-minute orphan cleanup interval (`ORPHAN_CLEANUP_INTERVAL_MS = 1800000`) only handles terminal PTY associations, not `droid exec` worker processes.

## Root Cause (from binary analysis of v0.88.0)

### Architecture

```
TUI (droid) ←WebSocket→ Daemon (factoryd)
                           ↓ child_process.spawn()
                     droid exec --input-format stream-jsonrpc (worker 1)
                     droid exec --input-format stream-jsonrpc (worker 2)
                     ...
                     droid exec --input-format stream-jsonrpc (worker N)
```

Workers are spawned by the **daemon** (`factoryd`) as child processes, not by the TUI. The daemon uses `child_process.spawn()` with `stdio: ["pipe","pipe","pipe"]` (class `vgH` → `S8A` wrapper → `U2H` transport).

### What happens when TUI dies

1. TUI process dies → WebSocket to daemon drops
2. Daemon detects the WebSocket close event
3. **Bug**: The daemon does NOT call `S8A.close()` on the worker transport objects associated with that session
4. Worker processes continue running with broken stdio pipes
5. Workers enter a hot loop (likely retrying writes to broken pipes without proper EPIPE handling)
6. The existing `setupOrphanCleanup()` only cleans up terminal PTY associations via `terminalManager`, not `U2H`-managed worker transports

### The close() method exists but is never called

`S8A.close()` correctly sends SIGTERM → waits 5s → SIGKILL:

```javascript
async close() {
    let proc = this._childProcess;
    proc.stdin.end();
    proc.kill("SIGTERM");
    setTimeout(() => proc.kill("SIGKILL"), 5000);
}
```

This method works when called (e.g., normal session end via `/exit`). It is simply never invoked when the TUI disconnects unexpectedly.

### Three specific bugs

1. **No session→worker cleanup on WebSocket disconnect**: When `handleClose()` fires for a TUI WebSocket, the daemon should enumerate all `U2H` transport objects associated with that client's sessions and call `close()` on each.

2. **Orphan cleanup scope too narrow**: `setupOrphanCleanup()` only sweeps terminal PTYs. Worker `droid exec` processes spawned via `U2H` are not tracked or swept.

3. **No process group isolation**: Workers don't share a process group with the TUI session. They can't receive SIGHUP when the session leader dies because the daemon (their actual parent) is still alive.

## Suggested Fix

**Immediate** (handles the common case):
- On WebSocket disconnect in `_b.handleClose()`, look up all active sessions for that connection
- For each session, find associated `U2H` transport objects in the `droidHandler` (`u8` class)
- Call `S8A.close()` on each (SIGTERM → SIGKILL)

**Safety net** (handles edge cases):
- Reduce `ORPHAN_CLEANUP_INTERVAL_MS` from 1800000 (30min) to 60000 (1min)
- Extend orphan cleanup to also check for `droid exec` child processes whose parent session is no longer active
- Consider spawning workers with a process group so `kill(-pgid)` can clean up the entire group

**Defense in depth**:
- Workers should handle EPIPE/broken pipe on stdout/stdin gracefully and self-terminate instead of spinning
- Workers could implement a heartbeat with the daemon and self-terminate if no heartbeat received within N seconds

## Environment

- **Droid version**: 0.88.0
- **OS**: macOS 15.5 (Darwin 25.4.0), arm64
- **Hardware**: Mac Studio M2 Ultra, 64GB RAM
- **Terminal**: Kaku (WezTerm fork, Rust-based) — terminal crash triggered the issue, but the bug is in the daemon, not the terminal
- **Trigger**: Ran `/squad`, terminal crashed under memory pressure from 63 concurrent worker output streams, all 63 `droid exec` processes survived and spun at 100% CPU

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Squad/Mission worker processes become orphaned and spin at 100% CPU when TUI session dies #858

Summary

Reproduction

Impact

Root Cause (from binary analysis of v0.88.0)

Architecture

What happens when TUI dies

The close() method exists but is never called

Three specific bugs

Suggested Fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Squad/Mission worker processes become orphaned and spin at 100% CPU when TUI session dies #858

Description

Summary

Reproduction

Impact

Root Cause (from binary analysis of v0.88.0)

Architecture

What happens when TUI dies

The close() method exists but is never called

Three specific bugs

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions