fix: detect engine task death mid-turn and recover UI immediately#2585
fix: detect engine task death mid-turn and recover UI immediately#2585gordonlu wants to merge 5 commits into
Conversation
When the engine task panics (caught by spawn_supervised's catch_unwind) between TurnStarted and TurnComplete, the event channel's sender is dropped and try_recv returns Disconnected. The UI's event loop (while let Ok(event) = rx.try_recv()) exits silently on Err, leaving the turn stuck with is_loading=true until the 300-second TURN_STALL_WATCHDOG_TIMEOUT. Add a post-loop check for TryRecvError::Disconnected after the event processing loop. If the channel is disconnected while is_loading is true, immediately finalize streaming state and reset the UI, reducing recovery from 300 seconds to ~1 frame.
|
Thanks @gordonlu for taking the time to contribute. This repository is currently observing a maintainer-managed contribution gate in dry-run mode, so this pull request is staying open. When enforcement is enabled, pull requests from contributors who are not listed in Please read |
There was a problem hiding this comment.
Code Review
This pull request introduces immediate UI state recovery and error reporting when the engine process terminates unexpectedly mid-turn, rather than waiting for a watchdog timeout. It achieves this by checking for a disconnected event channel when the app is loading. The review feedback points out a critical gap: if the engine terminates during a manual compaction or purge, the UI will remain stuck because app.is_loading is false. The reviewer suggests expanding the detection and recovery logic to handle compaction and purging states, and updating the corresponding unit test to verify these additional states are cleared.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if app.is_loading && matches!(rx.try_recv(), Err(TryRecvError::Disconnected)) { | ||
| streaming_thinking::finalize_current(app); | ||
| app.finalize_streaming_assistant_as_interrupted(); | ||
| app.finalize_active_cell_as_interrupted(); | ||
| app.streaming_state.reset(); | ||
| app.streaming_message_index = None; | ||
| app.streaming_thinking_active_entry = None; | ||
|
|
||
| app.is_loading = false; | ||
| app.turn_started_at = None; | ||
| app.turn_last_activity_at = None; | ||
| app.runtime_turn_status = None; | ||
| app.runtime_turn_id = None; | ||
| app.dispatch_started_at = None; | ||
| app.user_scrolled_during_stream = false; | ||
| app.push_status_toast( | ||
| "Engine process has terminated unexpectedly.", | ||
| StatusToastLevel::Error, | ||
| None, | ||
| ); | ||
| } |
There was a problem hiding this comment.
Correctness Bug: Engine death during compaction or purge is not detected
If the engine process terminates unexpectedly while a manual compaction (e.g., triggered via Ctrl+L or /compact) or a purge is in progress, app.is_loading will be false while app.is_compacting or app.is_purging is true.
Because the condition only checks app.is_loading, the UI will fail to detect the engine's termination in these states, leaving the UI permanently stuck with is_compacting or is_purging set to true.
Suggestion
Expand the condition to check for app.is_compacting and app.is_purging, and ensure these states (along with active_allowed_tools, agent_progress, and agent_activity_started_at) are fully reset during recovery.
if (app.is_loading || app.is_compacting || app.is_purging) && matches!(rx.try_recv(), Err(TryRecvError::Disconnected)) {
streaming_thinking::finalize_current(app);
app.finalize_streaming_assistant_as_interrupted();
app.finalize_active_cell_as_interrupted();
app.streaming_state.reset();
app.streaming_message_index = None;
app.streaming_thinking_active_entry = None;
app.is_loading = false;
app.is_compacting = false;
app.is_purging = false;
app.active_allowed_tools = None;
app.agent_progress.clear();
app.agent_activity_started_at = None;
app.turn_started_at = None;
app.turn_last_activity_at = None;
app.runtime_turn_status = None;
app.runtime_turn_id = None;
app.dispatch_started_at = None;
app.user_scrolled_during_stream = false;
app.push_status_toast(
"Engine process has terminated unexpectedly.",
StatusToastLevel::Error,
None,
);
}| if app.is_loading && matches!(rx.try_recv(), Err(TryRecvError::Disconnected)) { | ||
| streaming_thinking::finalize_current(&mut app); | ||
| app.finalize_streaming_assistant_as_interrupted(); | ||
| app.finalize_active_cell_as_interrupted(); | ||
| app.streaming_state.reset(); | ||
| app.streaming_message_index = None; | ||
| app.streaming_thinking_active_entry = None; | ||
|
|
||
| app.is_loading = false; | ||
| app.turn_started_at = None; | ||
| app.turn_last_activity_at = None; | ||
| app.runtime_turn_status = None; | ||
| app.runtime_turn_id = None; | ||
| app.dispatch_started_at = None; | ||
| app.user_scrolled_during_stream = false; | ||
| app.push_status_toast( | ||
| "Engine process has terminated unexpectedly.", | ||
| StatusToastLevel::Error, | ||
| None, | ||
| ); | ||
| } | ||
|
|
||
| // Verify the fix: UI state is fully recovered | ||
| assert!(!app.is_loading, "loading must be cleared"); | ||
| assert!(app.runtime_turn_status.is_none(), "turn status cleared"); | ||
| assert!(app.runtime_turn_id.is_none(), "turn id cleared"); | ||
| assert!(app.streaming_message_index.is_none()); | ||
| assert!(!app.user_scrolled_during_stream); | ||
| let toast = app.status_toasts.back().expect("error toast pushed"); | ||
| assert_eq!(toast.level, StatusToastLevel::Error); | ||
| assert!(toast.text.contains("Engine process has terminated")); | ||
| } |
There was a problem hiding this comment.
Suggestion
Update the test's recovery block and assertions to match the expanded recovery logic, ensuring that is_compacting and is_purging are also verified to be cleared.
// Apply the same post-loop logic from ui.rs
if (app.is_loading || app.is_compacting || app.is_purging) && matches!(rx.try_recv(), Err(TryRecvError::Disconnected)) {
streaming_thinking::finalize_current(&mut app);
app.finalize_streaming_assistant_as_interrupted();
app.finalize_active_cell_as_interrupted();
app.streaming_state.reset();
app.streaming_message_index = None;
app.streaming_thinking_active_entry = None;
app.is_loading = false;
app.is_compacting = false;
app.is_purging = false;
app.active_allowed_tools = None;
app.agent_progress.clear();
app.agent_activity_started_at = None;
app.turn_started_at = None;
app.turn_last_activity_at = None;
app.runtime_turn_status = None;
app.runtime_turn_id = None;
app.dispatch_started_at = None;
app.user_scrolled_during_stream = false;
app.push_status_toast(
"Engine process has terminated unexpectedly.",
StatusToastLevel::Error,
None,
);
}
// Verify the fix: UI state is fully recovered
assert!(!app.is_loading, "loading must be cleared");
assert!(!app.is_compacting, "compaction must be cleared");
assert!(!app.is_purging, "purge must be cleared");
assert!(app.runtime_turn_status.is_none(), "turn status cleared");
assert!(app.runtime_turn_id.is_none(), "turn id cleared");
assert!(app.streaming_message_index.is_none());
assert!(!app.user_scrolled_during_stream);
let toast = app.status_toasts.back().expect("error toast pushed");
assert_eq!(toast.level, StatusToastLevel::Error);
assert!(toast.text.contains("Engine process has terminated"));…isconnect detection to compaction/purge
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
| app.active_allowed_tools = None; | ||
| app.agent_progress.clear(); | ||
| app.agent_activity_started_at = None; | ||
| app.turn_started_at = None; | ||
| app.turn_last_activity_at = None; | ||
| app.runtime_turn_status = None; | ||
| app.runtime_turn_id = None; | ||
| app.dispatch_started_at = None; | ||
| app.user_scrolled_during_stream = false; | ||
| app.push_status_toast( | ||
| "Engine process has terminated unexpectedly.", | ||
| StatusToastLevel::Error, | ||
| None, | ||
| ); | ||
| app.needs_redraw = true; |
There was a problem hiding this comment.
Pending steers silently lost on engine disconnect
The TurnComplete handler has an explicit "hard-fail recovery" path that calls app.drain_pending_steers() and requeues those messages so they are not silently lost (see the TurnOutcomeStatus::Failed branch). This disconnect recovery block is intended to substitute for that path when the engine dies without ever emitting TurnComplete, but it never drains app.pending_steers. Any steer messages the user composed mid-turn and held with Esc will be silently discarded rather than surfaced in the queue where the user can see and re-send them.
Problem
When the engine task panics (or exits unexpectedly) between
TurnStartedandTurnComplete, the event channeltx_eventis dropped and the mpsc channel closes silently. The UI's event loop useswhile let Ok(event) = rx.try_recv()— onErr(TryRecvError::Disconnected)the loop simply exits without any recovery. The turn remains stuck withis_loading=trueandruntime_turn_status=Some("in_progress")until the 300-secondTURN_STALL_WATCHDOG_TIMEOUTfires inreconcile_turn_liveness().The root cause is in
spawn_supervised(utils.rs): it wraps the future incatch_unwind, logs the panic, writes a dump file, and exits — but never sends anEngineEvent::Errorover the channel. The UI has no way to know the engine died.Fix
Add a post-loop check for
TryRecvError::Disconnectedafter the existingwhile let Ok(event)event processing loop (ui.rs). If the channel is disconnected whileapp.is_loadingis true, the UI immediately:is_loading,runtime_turn_status,turn_started_at,dispatch_started_at, etc.This reduces the recovery window from 300 seconds to ~1 frame (~16ms).
Testing
engine_event_channel_disconnect_recovers_mid_turn_ui_state— creates a real mpsc channel, drops the sender to simulate engine death, sets up the app in mid-turn state, applies the recovery logic, and verifies the UI state is fully cleaned up.Related
Fixes the symptom described in #2583. The root cause of why the engine panics can be diagnosed via crash dumps in
~/.codewhale/crashes/.Greptile Summary
This PR adds a post-event-loop check in
run_event_loop(ui.rs) that detects engine task death mid-turn viarx.is_closed()and immediately resets all in-flight UI state (streaming, loading flags, turn metadata) and pushes an error toast, reducing the recovery window from the 300-second stall watchdog to a single frame.ui.rs: inserts a disconnect recovery block after thewhile let Ok(event)drain loop; usesis_closed()to distinguish a disconnected channel from an empty one, setsneeds_redraw = true, and clears a superset of the state cleared byreconcile_turn_livenessBranch 3.tests.rs: addsengine_event_channel_disconnect_recovers_mid_turn_ui_statewhich creates a real mpsc channel, drops the sender, and asserts recovery; the recovery logic is inlined in the test rather than extracted to a shared helper.Confidence Score: 4/5
The core recovery path works correctly for the common case; the omission of drain_pending_steers means steer messages composed mid-turn are silently discarded rather than resurfaced.
The disconnect recovery block is a clean, isolated addition and the channel-closed check is correctly placed after the event drain. The one concrete correctness gap is that app.drain_pending_steers() is not called — the TurnComplete::Failed branch explicitly handles this to avoid silent message loss, but the new fast-recovery path does not.
The recovery block in ui.rs around line 2256 deserves attention for the missing pending_steers drain before merging.
Important Files Changed
Sequence Diagram
sequenceDiagram participant Engine as Engine Task participant Channel as mpsc Channel participant EventLoop as run_event_loop participant App as App State Engine->>Channel: "TurnStarted {turn_id}" Channel->>EventLoop: Ok(TurnStarted) EventLoop->>App: "is_loading=true" Note over Engine: panic / unexpected exit Engine--xChannel: tx_event dropped EventLoop->>Channel: try_recv() Err(Disconnected) Note over EventLoop: while let Ok exits silently EventLoop->>Channel: "rx.is_closed() = true" Note over EventLoop: NEW post-loop disconnect check EventLoop->>App: finalize streaming cells EventLoop->>App: "is_loading=false, runtime_turn_status=None" EventLoop->>App: push error toast EventLoop->>App: "needs_redraw=true"Reviews (5): Last reviewed commit: "fix: remove unused mut on rx in test" | Re-trigger Greptile