Skip to content

fix: detect engine task death mid-turn and recover UI immediately#2585

Open
gordonlu wants to merge 5 commits into
Hmbown:mainfrom
gordonlu:feat/engine-death-recovery
Open

fix: detect engine task death mid-turn and recover UI immediately#2585
gordonlu wants to merge 5 commits into
Hmbown:mainfrom
gordonlu:feat/engine-death-recovery

Conversation

@gordonlu
Copy link
Copy Markdown
Contributor

@gordonlu gordonlu commented Jun 2, 2026

Problem

When the engine task panics (or exits unexpectedly) between TurnStarted and TurnComplete, the event channel tx_event is dropped and the mpsc channel closes silently. The UI's event loop uses while let Ok(event) = rx.try_recv() — on Err(TryRecvError::Disconnected) the loop simply exits without any recovery. The turn remains stuck with is_loading=true and runtime_turn_status=Some("in_progress") until the 300-second TURN_STALL_WATCHDOG_TIMEOUT fires in reconcile_turn_liveness().

The root cause is in spawn_supervised (utils.rs): it wraps the future in catch_unwind, logs the panic, writes a dump file, and exits — but never sends an EngineEvent::Error over the channel. The UI has no way to know the engine died.

Fix

Add a post-loop check for TryRecvError::Disconnected after the existing while let Ok(event) event processing loop (ui.rs). If the channel is disconnected while app.is_loading is true, the UI immediately:

  • Finalizes in-flight streaming thinking / assistant / tool cells
  • Resets is_loading, runtime_turn_status, turn_started_at, dispatch_started_at, etc.
  • Pushes an error toast: "Engine process has terminated unexpectedly."

This reduces the recovery window from 300 seconds to ~1 frame (~16ms).

Testing

engine_event_channel_disconnect_recovers_mid_turn_ui_state — creates a real mpsc channel, drops the sender to simulate engine death, sets up the app in mid-turn state, applies the recovery logic, and verifies the UI state is fully cleaned up.

Related

Fixes the symptom described in #2583. The root cause of why the engine panics can be diagnosed via crash dumps in ~/.codewhale/crashes/.

Greptile Summary

This PR adds a post-event-loop check in run_event_loop (ui.rs) that detects engine task death mid-turn via rx.is_closed() and immediately resets all in-flight UI state (streaming, loading flags, turn metadata) and pushes an error toast, reducing the recovery window from the 300-second stall watchdog to a single frame.

  • ui.rs: inserts a disconnect recovery block after the while let Ok(event) drain loop; uses is_closed() to distinguish a disconnected channel from an empty one, sets needs_redraw = true, and clears a superset of the state cleared by reconcile_turn_liveness Branch 3.
  • tests.rs: adds engine_event_channel_disconnect_recovers_mid_turn_ui_state which creates a real mpsc channel, drops the sender, and asserts recovery; the recovery logic is inlined in the test rather than extracted to a shared helper.

Confidence Score: 4/5

The core recovery path works correctly for the common case; the omission of drain_pending_steers means steer messages composed mid-turn are silently discarded rather than resurfaced.

The disconnect recovery block is a clean, isolated addition and the channel-closed check is correctly placed after the event drain. The one concrete correctness gap is that app.drain_pending_steers() is not called — the TurnComplete::Failed branch explicitly handles this to avoid silent message loss, but the new fast-recovery path does not.

The recovery block in ui.rs around line 2256 deserves attention for the missing pending_steers drain before merging.

Important Files Changed

Filename Overview
crates/tui/src/tui/ui.rs Adds post-event-loop disconnect detection: resets in-flight streaming/loading state and pushes an error toast. Recovery is fast and correct for the core case, but omits draining pending_steers — steer messages composed mid-turn are silently discarded rather than requeued.
crates/tui/src/tui/ui/tests.rs Adds engine_event_channel_disconnect_recovers_mid_turn_ui_state test; verifies the recovery block clears loading state and pushes the toast. The recovery logic is inlined in the test rather than extracted to a shared function.

Sequence Diagram

sequenceDiagram
    participant Engine as Engine Task
    participant Channel as mpsc Channel
    participant EventLoop as run_event_loop
    participant App as App State

    Engine->>Channel: "TurnStarted {turn_id}"
    Channel->>EventLoop: Ok(TurnStarted)
    EventLoop->>App: "is_loading=true"

    Note over Engine: panic / unexpected exit
    Engine--xChannel: tx_event dropped

    EventLoop->>Channel: try_recv() Err(Disconnected)
    Note over EventLoop: while let Ok exits silently

    EventLoop->>Channel: "rx.is_closed() = true"
    Note over EventLoop: NEW post-loop disconnect check
    EventLoop->>App: finalize streaming cells
    EventLoop->>App: "is_loading=false, runtime_turn_status=None"
    EventLoop->>App: push error toast
    EventLoop->>App: "needs_redraw=true"
Loading

Fix All in Codex Fix All in Claude Code Fix All in Cursor

Reviews (5): Last reviewed commit: "fix: remove unused mut on rx in test" | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

When the engine task panics (caught by spawn_supervised's catch_unwind)
between TurnStarted and TurnComplete, the event channel's sender is
dropped and try_recv returns Disconnected. The UI's event loop
(while let Ok(event) = rx.try_recv()) exits silently on Err, leaving
the turn stuck with is_loading=true until the 300-second
TURN_STALL_WATCHDOG_TIMEOUT.

Add a post-loop check for TryRecvError::Disconnected after the event
processing loop. If the channel is disconnected while is_loading is
true, immediately finalize streaming state and reset the UI, reducing
recovery from 300 seconds to ~1 frame.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Thanks @gordonlu for taking the time to contribute.

This repository is currently observing a maintainer-managed contribution gate in dry-run mode, so this pull request is staying open. When enforcement is enabled, pull requests from contributors who are not listed in .github/APPROVED_CONTRIBUTORS will be closed automatically.

Please read CONTRIBUTING.md for the expected contribution shape. A maintainer can grant PR access by commenting /lgtm on a pull request.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces immediate UI state recovery and error reporting when the engine process terminates unexpectedly mid-turn, rather than waiting for a watchdog timeout. It achieves this by checking for a disconnected event channel when the app is loading. The review feedback points out a critical gap: if the engine terminates during a manual compaction or purge, the UI will remain stuck because app.is_loading is false. The reviewer suggests expanding the detection and recovery logic to handle compaction and purging states, and updating the corresponding unit test to verify these additional states are cleared.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread crates/tui/src/tui/ui.rs Outdated
Comment on lines +2245 to +2265
if app.is_loading && matches!(rx.try_recv(), Err(TryRecvError::Disconnected)) {
streaming_thinking::finalize_current(app);
app.finalize_streaming_assistant_as_interrupted();
app.finalize_active_cell_as_interrupted();
app.streaming_state.reset();
app.streaming_message_index = None;
app.streaming_thinking_active_entry = None;

app.is_loading = false;
app.turn_started_at = None;
app.turn_last_activity_at = None;
app.runtime_turn_status = None;
app.runtime_turn_id = None;
app.dispatch_started_at = None;
app.user_scrolled_during_stream = false;
app.push_status_toast(
"Engine process has terminated unexpectedly.",
StatusToastLevel::Error,
None,
);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Correctness Bug: Engine death during compaction or purge is not detected

If the engine process terminates unexpectedly while a manual compaction (e.g., triggered via Ctrl+L or /compact) or a purge is in progress, app.is_loading will be false while app.is_compacting or app.is_purging is true.

Because the condition only checks app.is_loading, the UI will fail to detect the engine's termination in these states, leaving the UI permanently stuck with is_compacting or is_purging set to true.

Suggestion

Expand the condition to check for app.is_compacting and app.is_purging, and ensure these states (along with active_allowed_tools, agent_progress, and agent_activity_started_at) are fully reset during recovery.

            if (app.is_loading || app.is_compacting || app.is_purging) && matches!(rx.try_recv(), Err(TryRecvError::Disconnected)) {
                streaming_thinking::finalize_current(app);
                app.finalize_streaming_assistant_as_interrupted();
                app.finalize_active_cell_as_interrupted();
                app.streaming_state.reset();
                app.streaming_message_index = None;
                app.streaming_thinking_active_entry = None;

                app.is_loading = false;
                app.is_compacting = false;
                app.is_purging = false;
                app.active_allowed_tools = None;
                app.agent_progress.clear();
                app.agent_activity_started_at = None;
                app.turn_started_at = None;
                app.turn_last_activity_at = None;
                app.runtime_turn_status = None;
                app.runtime_turn_id = None;
                app.dispatch_started_at = None;
                app.user_scrolled_during_stream = false;
                app.push_status_toast(
                    "Engine process has terminated unexpectedly.",
                    StatusToastLevel::Error,
                    None,
                );
            }

Comment thread crates/tui/src/tui/ui/tests.rs Outdated
Comment on lines +2630 to +2661
if app.is_loading && matches!(rx.try_recv(), Err(TryRecvError::Disconnected)) {
streaming_thinking::finalize_current(&mut app);
app.finalize_streaming_assistant_as_interrupted();
app.finalize_active_cell_as_interrupted();
app.streaming_state.reset();
app.streaming_message_index = None;
app.streaming_thinking_active_entry = None;

app.is_loading = false;
app.turn_started_at = None;
app.turn_last_activity_at = None;
app.runtime_turn_status = None;
app.runtime_turn_id = None;
app.dispatch_started_at = None;
app.user_scrolled_during_stream = false;
app.push_status_toast(
"Engine process has terminated unexpectedly.",
StatusToastLevel::Error,
None,
);
}

// Verify the fix: UI state is fully recovered
assert!(!app.is_loading, "loading must be cleared");
assert!(app.runtime_turn_status.is_none(), "turn status cleared");
assert!(app.runtime_turn_id.is_none(), "turn id cleared");
assert!(app.streaming_message_index.is_none());
assert!(!app.user_scrolled_during_stream);
let toast = app.status_toasts.back().expect("error toast pushed");
assert_eq!(toast.level, StatusToastLevel::Error);
assert!(toast.text.contains("Engine process has terminated"));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Suggestion

Update the test's recovery block and assertions to match the expanded recovery logic, ensuring that is_compacting and is_purging are also verified to be cleared.

    // Apply the same post-loop logic from ui.rs
    if (app.is_loading || app.is_compacting || app.is_purging) && matches!(rx.try_recv(), Err(TryRecvError::Disconnected)) {
        streaming_thinking::finalize_current(&mut app);
        app.finalize_streaming_assistant_as_interrupted();
        app.finalize_active_cell_as_interrupted();
        app.streaming_state.reset();
        app.streaming_message_index = None;
        app.streaming_thinking_active_entry = None;

        app.is_loading = false;
        app.is_compacting = false;
        app.is_purging = false;
        app.active_allowed_tools = None;
        app.agent_progress.clear();
        app.agent_activity_started_at = None;
        app.turn_started_at = None;
        app.turn_last_activity_at = None;
        app.runtime_turn_status = None;
        app.runtime_turn_id = None;
        app.dispatch_started_at = None;
        app.user_scrolled_during_stream = false;
        app.push_status_toast(
            "Engine process has terminated unexpectedly.",
            StatusToastLevel::Error,
            None,
        );
    }

    // Verify the fix: UI state is fully recovered
    assert!(!app.is_loading, "loading must be cleared");
    assert!(!app.is_compacting, "compaction must be cleared");
    assert!(!app.is_purging, "purge must be cleared");
    assert!(app.runtime_turn_status.is_none(), "turn status cleared");
    assert!(app.runtime_turn_id.is_none(), "turn id cleared");
    assert!(app.streaming_message_index.is_none());
    assert!(!app.user_scrolled_during_stream);
    let toast = app.status_toasts.back().expect("error toast pushed");
    assert_eq!(toast.level, StatusToastLevel::Error);
    assert!(toast.text.contains("Engine process has terminated"));

Comment thread crates/tui/src/tui/ui.rs
Comment thread crates/tui/src/tui/ui/tests.rs
Comment thread crates/tui/src/tui/ui.rs
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 2, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

Comment thread crates/tui/src/tui/ui/tests.rs Outdated
Comment thread crates/tui/src/tui/ui.rs
Comment on lines +2256 to +2270
app.active_allowed_tools = None;
app.agent_progress.clear();
app.agent_activity_started_at = None;
app.turn_started_at = None;
app.turn_last_activity_at = None;
app.runtime_turn_status = None;
app.runtime_turn_id = None;
app.dispatch_started_at = None;
app.user_scrolled_during_stream = false;
app.push_status_toast(
"Engine process has terminated unexpectedly.",
StatusToastLevel::Error,
None,
);
app.needs_redraw = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Pending steers silently lost on engine disconnect

The TurnComplete handler has an explicit "hard-fail recovery" path that calls app.drain_pending_steers() and requeues those messages so they are not silently lost (see the TurnOutcomeStatus::Failed branch). This disconnect recovery block is intended to substitute for that path when the engine dies without ever emitting TurnComplete, but it never drains app.pending_steers. Any steer messages the user composed mid-turn and held with Esc will be silently discarded rather than surfaced in the queue where the user can see and re-send them.

Fix in Codex Fix in Claude Code Fix in Cursor

@Hmbown Hmbown added this to the v0.8.51 milestone Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants