Skip to content

Auto-run stop/kill unreliable — hung processes survive termination #733

@kringle25

Description

@kringle25

Problem

Stopping a hung or non-progressing auto-run frequently does not work. The UI appears to accept the stop request but the underlying process continues running, forcing a full app restart to recover.

Root Cause Analysis

Extracted and analyzed the compiled source (ProcessManager.js, web-server-factory.js, messageHandlers.js, IPC handlers). The stop mechanism has several architectural weaknesses:

1. Six-hop fire-and-forget kill chain

UI "Stop" → WebSocket → Web Server → IPC to renderer → Renderer handler → IPC to main → ProcessManager.kill()

If any link in this chain is blocked (renderer busy, IPC backed up, handler not registered), the stop request silently disappears. The stopAutoRun callback returns true/false but is fire-and-forget with no confirmation back to the UI.

2. Process removed from tracking before confirmed dead

ProcessManager.kill() removes the process from the processes Map immediately (line ~239), then sets a 2-second escalation timer. If SIGKILL also fails or the process is in uninterruptible I/O, Maestro thinks it's dead but it isn't.

3. No process-group killing

Child processes spawned by the CLI agent (tool calls, bash commands, subagents) may not be in the same process group. Killing the parent leaves orphan children holding resources, keeping the session effectively hung.

4. Race condition with auto-run re-spawn

If the auto-run engine advances to the next task while the kill is in flight, the new spawn can race with the termination — the kill targets a process that's already been replaced.

5. Silent error swallowing in escalation path

On macOS/Linux, the SIGTERM → SIGKILL escalation has no verification step. On Windows, taskkill errors are logged at debug level only (line ~260) and don't trigger further action.

Proposed Fix

  1. Process-group killing — Spawn agent processes with setsid / detached: true and kill via kill(-pgid, signal) so the entire process tree dies together.
  2. Synchronous kill confirmation — Don't report "stopped" to the UI until waitpid or a process exit event confirms death. Keep a "stopping…" state visible to the user.
  3. Direct main-process kill path — Add a process:forceKill IPC handler in the main process that bypasses the renderer entirely. The renderer may itself be blocked by a hung IPC call to the agent.
  4. Timeout + auto-escalate — If soft stop (SIGINT → SIGTERM) doesn't produce a confirmed exit within 5 seconds, automatically escalate to SIGKILL on the entire process group.
  5. UI fallback button — If the first stop attempt doesn't confirm within ~3 seconds, surface a "Force Kill" button that uses the direct main-process path.

Environment

  • macOS Darwin 25.4.0
  • Maestro desktop app (Electron)
  • Agent: Claude Code CLI

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions