Skip to content

Harden Every Code session cleanup and worker recovery #417

@cbusillo

Description

@cbusillo

Objective

Make Every Code local sessions fail closed and clean up reliably when the worker crashes, API access is unavailable, or a session exits before producing a PR.

Finish Line

Every Code local sessions fail closed and clean up after worker/API failures

Current Status

State: Partially done, reopened. PR #424 shipped terminal cleanup/no-PR-exit hardening, but SYO #97 showed the local worker can still die permanently on a Launchplane API read timeout and leave a stale pid file.
Next action: Add worker API timeout resilience/self-restart behavior so transient Launchplane API failures do not stop the daemon indefinitely; make stale pid repair automatic in the wrapper/start path.
Blocked by: None.
Last verified: 2026-05-07. SYO #97 was queued by the webhook at 21:46:11Z but stayed unclaimed until manual worker restart at 22:21:00Z. After restart, it transitioned to running and opened tmux session every-code-every-code-cbusillo-sellyouroutboard-97-d78b8a8f38a15b57.

Scope

  • Local Every Code worker lifecycle and restart behavior.
  • Session/process reconciliation for tmux sessions, DUI threads, worktrees, and child dev servers.
  • Handling of work requests that never produce result_pr_url.
  • Handling of closed source issues/PRs with pending feedback records.
  • Diagnostics for stale pid files and API auth/endpoint failures.

Acceptance Criteria

  • Worker API/auth failures do not permanently stop the worker without a visible health signal or restart path.
  • Stale pid files are detected and repaired by the worker wrapper/status/start flow.
  • If an Every Code session exits before creating a PR, the work request becomes blocked or another terminal state with a useful error summary instead of staying running forever.
  • If a source issue/PR closes, any matching running session is terminated and associated child processes are cleaned up.
  • Session cleanup terminates dev servers spawned under the session worktree, not just the tmux session.
  • Pending feedback for terminal requests is marked ignored or otherwise excluded from future delivery.
  • DUI thread close state follows the same terminal request/session state as the TUI/tmux session.
  • Tests cover worker API failure retry/recovery, no-PR early exit, closed issue cleanup, child process cleanup, and pending feedback cleanup.

Relationships

Related to Every Code preview validation and feedback routing work (#373, #393, #394, #406).

Validation

  • Unit tests for reconciliation decisions.
  • Local smoke: create or simulate an Every Code request that exits before PR creation and confirm no tmux/dev-server/DUI residue remains.
  • Local smoke: stop/restart worker during pending feedback and confirm it recovers without stale sessions.

Decisions

  • Treat stale local sessions as operational debt that Launchplane should actively reconcile, not as manual cleanup.
  • Prefer terminal request state plus visible summary over leaving a session open when no PR exists.

Open Questions

  • Should no-PR early exit be represented as blocked, done with no result, or a new terminal state?
  • Should the worker wrapper be launchd-managed, self-restarting, or both?

Metadata

Metadata

Assignees

No one assigned

    Labels

    planDurable planning issueplan:activeCurrent active plan

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions