Skip to content

Commit cbd980e

Browse files
PureWeenCopilot
andauthored
fix: session relaunch resilience + model selection + IDLE-DEFER-REARM (#472)
## Summary Sessions that are actively processing during a PolyPilot relaunch now survive correctly. Model selection no longer silently resets to Haiku after abort/reconnect. ## Poll-then-resume pattern When restoring sessions after relaunch, sessions detected as actively processing on the CLI headless server are handled with a new poll-then-resume approach: 1. **Do NOT call ResumeSessionAsync** on active sessions (kills in-flight tools) 2. **PollEventsAndResumeWhenIdleAsync** watches events.jsonl every 5s for terminal events 3. When the CLI finishes, safely resume and load response from disk 4. Watchdog runs concurrently as safety net (600s timeout) ### Key changes: - `IsSessionStillProcessing` — blacklist of terminal events instead of whitelist of active events - `PollEventsAndResumeWhenIdleAsync` — new file-based poller for active session completion detection - Generation guard (INV-3/INV-12) and IsMultiAgentSession (INV-9) on poller path - History merge on UI thread for thread safety - LastUpdatedAt reset on restore for correct UI activity display ## Model selection fixes - `GetSessionModel` prioritizes user's explicit choice over usage-reported model - Fallback uses `DefaultModel` instead of alphabetical first (was Haiku) - `SessionStartEvent` no longer overwrites user model choice after abort/reconnect ## Known limitation The CLI headless server does not always write `session.idle` to events.jsonl when the subscriber is disconnected (issue #299). The poller cannot detect these completions and falls back to the 600s watchdog timeout. Clean repro added to #299. ## Testing - 3059/3059 tests passing - Updated StuckSessionRecoveryTests and EventsJsonlParsingTests for blacklist approach - Verified end-to-end with live multi-agent sessions across multiple relaunches --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent fcbae6a commit cbd980e

14 files changed

Lines changed: 523 additions & 69 deletions

File tree

.claude/skills/performance-optimization/SKILL.md

Lines changed: 69 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,18 @@
22
name: performance-optimization
33
description: >
44
Performance invariants and optimization knowledge for PolyPilot's render pipeline,
5-
session switching, persistence, and caching layers. Use when: (1) Modifying
6-
RefreshSessions, GetOrganizedSessions, or SafeRefreshAsync, (2) Touching
7-
SaveActiveSessionsToDisk, SaveOrganization, or SaveUiState, (3) Working with
8-
LoadPersistedSessions or session directory scanning, (4) Modifying markdown
9-
rendering or the message cache, (5) Optimizing render cycle performance or
10-
adding Blazor component rendering, (6) Working with debounce timers or
11-
DisposeAsync cleanup. Covers: session-switch bottleneck fix, debounce flush
12-
requirements, expensive operation guards, cache invalidation strategy, and
13-
render cycle analysis.
5+
session switching, persistence, caching layers, and startup performance. Use when:
6+
(1) Modifying RefreshSessions, GetOrganizedSessions, or SafeRefreshAsync,
7+
(2) Touching SaveActiveSessionsToDisk, SaveOrganization, or SaveUiState,
8+
(3) Working with LoadPersistedSessions or session directory scanning,
9+
(4) Modifying markdown rendering or the message cache,
10+
(5) Optimizing render cycle performance or adding Blazor component rendering,
11+
(6) Working with debounce timers or DisposeAsync cleanup,
12+
(7) Debugging startup slowness or blue screen on launch,
13+
(8) Modifying InitializeAsync or RestoreSessionsInBackgroundAsync.
14+
Covers: session-switch bottleneck fix, debounce flush requirements, expensive
15+
operation guards, cache invalidation strategy, render cycle analysis, and
16+
startup performance debugging.
1417
---
1518

1619
# Performance Optimization
@@ -99,3 +102,60 @@ item heights — chat messages have variable height from markdown rendering.
99102
### ChatMessageItem has no ShouldRender()
100103
Always re-renders when parent calls `StateHasChanged()`. Adding a content
101104
hash comparison would skip re-renders for unchanged messages.
105+
106+
## Startup Performance Debugging
107+
108+
### Architecture
109+
Startup has two phases:
110+
1. **UI thread (blocking):** MAUI bootstrap → BlazorWebView creation → Blazor framework init → Dashboard `OnInitialized``InitializeAsync` → fires `RestoreSessionsInBackgroundAsync` on ThreadPool
111+
2. **Background thread (non-blocking):** Read `active-sessions.json` → create lazy placeholders → eager-resume candidates → `ReconcileOrganization`
112+
113+
The UI renders as soon as `InitializeAsync` returns. Session restore NEVER blocks the UI thread (runs via `Task.Run`). If you see a blue screen, the problem is in phase 1 (UI thread), not phase 2.
114+
115+
### Instrumentation
116+
`[STARTUP-TIMING]` log tags in `~/.polypilot/console.log`:
117+
```
118+
[STARTUP-TIMING] LoadOrganization: 63ms ← reads organization.json
119+
[STARTUP-TIMING] Pre-restore: 64ms ← time before Task.Run
120+
[STARTUP-TIMING] Session loop complete: 35056ms ← background thread, NOT blocking UI
121+
[STARTUP-TIMING] RestoreSessionsInBackground: 35095ms ← total background time
122+
```
123+
124+
### How to debug startup slowness
125+
126+
**Step 1: Measure the UI-visible delay**
127+
```
128+
BlazorDevFlow ConfigureHandler → Dashboard Restoring UI state
129+
```
130+
Find both timestamps in console.log for the current PID. The gap is the user-visible startup time. Normal: 5-8 seconds.
131+
132+
**Step 2: Identify which phase is slow**
133+
- If `[STARTUP-TIMING] Pre-restore` is high (>500ms): `LoadOrganization` or `InitializeAsync` is slow on the UI thread
134+
- If `[STARTUP-TIMING] Session loop` is high but UI rendered fast: background restore is slow but not blocking the user — acceptable
135+
- If neither timing tag appears: the slowdown is in Blazor framework init (before our code runs) — check system load, WebView issues
136+
137+
**Step 3: A/B test against main**
138+
```bash
139+
# Save current branch
140+
git stash && git checkout main
141+
cd PolyPilot && ./relaunch.sh
142+
# Measure: BlazorDevFlow Config → Dashboard Restore gap
143+
144+
# Switch back
145+
git checkout <branch> && git stash pop
146+
cd PolyPilot && ./relaunch.sh
147+
# Compare gaps
148+
```
149+
150+
⚠️ **Use `git stash` + `git checkout` — do NOT `git checkout origin/main -- .`** (checks out files but keeps wrong branch name, confuses the app's branch display).
151+
152+
**Step 4: Common false alarms**
153+
- CPU load from concurrent test runs or builds causes 2-3x slowdown in Blazor init
154+
- DLL file locks from running app cause build failures (retry after a few seconds)
155+
- The first launch after a `dotnet clean` is always slower (JIT compilation)
156+
- Run the comparison 2-3 times each to account for variance
157+
158+
### Critical rules
159+
- **NEVER block the UI thread during restore.** All session loading uses `Task.Run` + `ConfigureAwait(false)`. Violations cause blue screen.
160+
- **`LoadPersistedSessions()` is O(N) on ALL session directories** (750+). Never call from `InitializeAsync` or any UI-triggered path. See PERF-2.
161+
- **InvokeOnUI callbacks during restore compete with Blazor rendering.** Minimize them. The restore path should batch state changes and call `NotifyStateChanged` once at the end, not per-session.

.claude/skills/processing-state-safety/SKILL.md

Lines changed: 24 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,12 @@ description: >
88
SDK event handlers, (4) Debugging stuck sessions showing "Thinking..." forever
99
or spinner stuck, (5) Modifying IsResumed, HasUsedToolsThisTurn, or ActiveToolCallCount,
1010
(6) Adding diagnostic log tags, (7) Modifying session restore paths
11-
(RestoreSingleSessionAsync) that must initialize watchdog-dependent state,
11+
(RestorePreviousSessionsAsync / EnsureSessionConnectedAsync) that must initialize
12+
watchdog-dependent state,
1213
(8) Modifying ReconcileOrganization or any code that reads Organization.Sessions
1314
during the IsRestoring window, (9) Session appears hung or unresponsive after tool use.
1415
Covers: 18 invariants from 13 PRs of fix cycles,
15-
the 16 code paths that set/clear IsProcessing, and common regression patterns.
16+
the 21+ code paths that set/clear IsProcessing, and common regression patterns.
1617
---
1718

1819
# Processing State Safety
@@ -151,9 +152,9 @@ cross-thread fields without a tracking comment explaining the gap.
151152
causing stale renders.
152153

153154
### INV-9: Session restore must initialize all watchdog-dependent state
154-
The restore path (`RestoreSingleSessionAsync`) is separate from `SendPromptAsync`.
155-
Any field that affects watchdog timeout selection or dispatch routing must be
156-
initialized in BOTH paths:
155+
The restore path (`RestorePreviousSessionsAsync` + `EnsureSessionConnectedAsync`) is
156+
separate from `SendPromptAsync`. Any field that affects watchdog timeout selection or
157+
dispatch routing must be initialized in BOTH paths:
157158
- `IsMultiAgentSession` — set via `IsSessionInMultiAgentGroup()` before `StartProcessingWatchdog`
158159
- `HasReceivedEventsSinceResume` / `HasUsedToolsThisTurn` — set via `GetEventsFileRestoreHints()`
159160
- `IsResumed` — set on the `AgentSessionInfo` when `isStillProcessing` is true
@@ -315,10 +316,10 @@ complete the response while sub-agents are still working.
315316
5. **Missing state initialization on session restore**`IsMultiAgentSession`,
316317
`IsResumed`, and other flags must be set on restored sessions BEFORE
317318
`StartProcessingWatchdog` is called. The restore path in
318-
`RestoreSingleSessionAsync` is separate from `SendPromptAsync` and must
319-
independently initialize all state the watchdog depends on. PR #284 fixed
320-
`IsMultiAgentSession` not being set during restore, causing the watchdog
321-
to use 120s instead of 600s for multi-agent workers.
319+
`RestorePreviousSessionsAsync` / `EnsureSessionConnectedAsync` is separate
320+
from `SendPromptAsync` and must independently initialize all state the watchdog
321+
depends on. PR #284 fixed `IsMultiAgentSession` not being set during restore,
322+
causing the watchdog to use 120s instead of 600s for multi-agent workers.
322323

323324
**Retired mistake (was #2):** *ActiveToolCallCount as sole tool signal* — still relevant per
324325
INV-5, but the more impactful version is #2 above (suppressing the fallback entirely).
@@ -345,15 +346,26 @@ When a session shows "Thinking..." indefinitely:
345346
| Symptom | Likely Cause | Fix |
346347
|---------|-------------|-----|
347348
| `[SEND]` then silence | SDK never responded, watchdog will catch at 120s | Wait or abort |
348-
| `[EVT] TurnEnd` but no `[IDLE]` | Zero-idle SDK bug | Watchdog catches at 30s fallback (INV-10) |
349-
| `[IDLE-DEFER]` then long silence | Background tasks (sub-agents/shells) active but never completed | Check agent status; watchdog will eventually catch (INV-18) |
349+
| `[EVT] TurnEnd` but no `[IDLE]` | `session.idle` is ephemeral (never on disk). If live events stopped: IDLE-DEFER deferred the idle and `IsProcessing` was cleared before the follow-up arrived (#403) | Watchdog catches; #403 fix re-arms IsProcessing |
350+
| `[IDLE-DEFER]` then long silence | Background tasks (sub-agents/shells) active | Check `[IDLE-DIAG]` for backgroundTasks count; watchdog will eventually catch (INV-18) |
351+
| `[IDLE-DEFER]` with `IsProcessing=False` | IDLE-DEFER fired but IsProcessing was already cleared by watchdog/reconnect | #403 IDLE-DEFER-REARM fix re-arms IsProcessing |
350352
| `[COMPLETE]` fired but spinner persists | UI thread not notified | Check INV-2, INV-8 |
351353
| `[WATCHDOG]` clears but re-sticks | New turn started before watchdog callback ran | Check INV-3 generation guard |
354+
| After relaunch: session shows "Working" for 600s | Session was active on CLI during relaunch; poll-then-resume waiting for `session.shutdown` (only disk-persisted terminal event) | Normal — watchdog clears at 600s. `session.idle` is ephemeral, never on disk |
352355

353356
5. **Nuclear option** — user clicks Stop (AbortSessionAsync, path #5/#6).
354357

358+
## Key Facts About session.idle
359+
360+
- `session.idle` is **`ephemeral: true`** in the SDK schema — intentionally NOT written to events.jsonl
361+
- Events written to disk: `session.start`, `session.resume`, `session.shutdown` — NOT `session.idle`
362+
- PolyPilot receives `SessionIdleEvent` over the live SDK event stream
363+
- CLI correctly populates `backgroundTasks` field (proven with `[IDLE-DIAG]` instrumentation)
364+
- When `backgroundTasks` is active, IDLE-DEFER defers completion until a subsequent idle with empty backgroundTasks
365+
- After app relaunch, the poll-then-resume pattern watches events.jsonl for `session.shutdown` only (the only terminal event on disk)
366+
355367
## Regression History
356368

357369
10 PRs of fix/regression cycles: #141#147#148#153#158#163#164#276#284#332.
358-
Additional safety PRs: #373 (orphaned state guards), #375 (premature idle re-arm), #399 (IDLE-DEFER for background tasks).
370+
Additional safety PRs: #373 (orphaned state guards), #375 (premature idle re-arm), #399 (IDLE-DEFER for background tasks), #472 (poll-then-resume + IDLE-DEFER-REARM + model selection).
359371
See `references/regression-history.md` for the full timeline with root causes.

.github/copilot-instructions.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -239,7 +239,7 @@ The UI shows: "Sending…" → "Server connected…" → "Thinking…" → "Work
239239
**CRITICAL**: Every code path that sets `IsProcessing = false` must clear 9 companion fields and call `FlushCurrentResponse`. This is the most recurring bug category (13 PRs of fix/regression cycles). **Read `.claude/skills/processing-state-safety/SKILL.md` before modifying ANY processing path.** There are 15+ such paths across CopilotService.cs, Events.cs, Bridge.cs, Organization.cs, and Providers.cs.
240240

241241
### Content Persistence
242-
`FlushCurrentResponse` is also called on `AssistantTurnEndEvent` to persist accumulated response text at each sub-turn boundary. This prevents content loss if the app restarts between `turn_end` and `session.idle` (e.g., "zero-idle sessions" where the SDK never emits `session.idle`). The flush includes a dedup guard to prevent duplicate messages from event replay on resume.
242+
`FlushCurrentResponse` is also called on `AssistantTurnEndEvent` to persist accumulated response text at each sub-turn boundary. This prevents content loss if the app restarts between `turn_end` and `session.idle`. When the IDLE-DEFER logic defers `session.idle` (active background tasks), the flush ensures content from the foreground turn is saved. The flush includes a dedup guard to prevent duplicate messages from event replay on resume.
243243

244244
### Processing Watchdog
245245
The processing watchdog (`RunProcessingWatchdogAsync` in `CopilotService.Events.cs`) detects stuck sessions by checking how long since the last SDK event. It checks every 15 seconds and has three timeout tiers:
@@ -252,7 +252,7 @@ The processing watchdog (`RunProcessingWatchdogAsync` in `CopilotService.Events.
252252

253253
For multi-agent sessions, Case B also checks **file-size-growth**: if events.jsonl hasn't grown for `WatchdogCaseBMaxStaleChecks` (2) consecutive deferrals, the session is force-completed — the connection is dead. This catches `ConnectionLostException` scenarios where mtime stays fresh but no new data arrives, reducing detection from 30+ min to ~360s (3 cycles: 1 baseline + 2 stale checks). The 1800s freshness window is preserved.
254254

255-
Note: Some sessions never receive `session.idle` events (SDK/CLI bug). In these "zero-idle" cases, `IsProcessing` is only cleared by the watchdog or user abort. The turn_end flush (see Content Persistence above) ensures response content is not lost.
255+
Note: `session.idle` is an ephemeral event (`ephemeral: true` in the SDK schema) — it is delivered over the live event stream but intentionally NOT written to `events.jsonl`. When `session.idle` includes active `backgroundTasks` (sub-agents, shells), the IDLE-DEFER logic defers completion until a subsequent idle arrives with empty/null backgroundTasks. In rare cases where `IsProcessing` was already cleared (by watchdog timeout or reconnect) before the deferred idle arrives, the session may appear stuck until the watchdog fires again — see issue #403.
256256

257257
When the watchdog fires, it marshals state mutations to the UI thread via `InvokeOnUI()` and adds a system warning message.
258258

PolyPilot.Tests/BackgroundTasksIdleTests.cs

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
using GitHub.Copilot.SDK;
2+
using PolyPilot.Models;
23
using PolyPilot.Services;
34

45
namespace PolyPilot.Tests;
@@ -110,4 +111,111 @@ public void HasActiveBackgroundTasks_DataNull_ReturnsFalse()
110111
var idle = new SessionIdleEvent { Data = null! };
111112
Assert.False(CopilotService.HasActiveBackgroundTasks(idle));
112113
}
114+
115+
// --- IDLE-DEFER-REARM tests (issue #403) ---
116+
// These test the expected re-arm conditions when session.idle arrives
117+
// with active background tasks but IsProcessing is already false.
118+
119+
[Fact]
120+
public void IdleDeferRearm_ShouldRearm_WhenBackgroundTasksActiveAndNotProcessing()
121+
{
122+
// Scenario: watchdog or reconnect cleared IsProcessing,
123+
// then session.idle arrives with active background tasks.
124+
// Expected: should re-arm IsProcessing=true.
125+
var info = new AgentSessionInfo { Name = "test", Model = "test-model" };
126+
info.IsProcessing = false; // Cleared by watchdog
127+
128+
var idle = MakeIdle(new SessionIdleDataBackgroundTasks
129+
{
130+
Agents = new[]
131+
{
132+
new SessionIdleDataBackgroundTasksAgentsItem
133+
{
134+
AgentId = "agent-1", AgentType = "copilot", Description = "worker"
135+
}
136+
},
137+
Shells = Array.Empty<SessionIdleDataBackgroundTasksShellsItem>()
138+
});
139+
140+
// Verify preconditions
141+
Assert.False(info.IsProcessing);
142+
Assert.True(CopilotService.HasActiveBackgroundTasks(idle));
143+
144+
// The re-arm logic: if !IsProcessing && HasActiveBackgroundTasks && !WasUserAborted → re-arm
145+
bool shouldRearm = !info.IsProcessing && CopilotService.HasActiveBackgroundTasks(idle);
146+
Assert.True(shouldRearm, "Should re-arm when background tasks active and not processing");
147+
148+
// Simulate re-arm
149+
info.IsProcessing = true;
150+
info.ProcessingPhase = 3; // Working (background tasks)
151+
info.ProcessingStartedAt ??= DateTime.UtcNow;
152+
153+
Assert.True(info.IsProcessing);
154+
Assert.Equal(3, info.ProcessingPhase);
155+
Assert.NotNull(info.ProcessingStartedAt);
156+
}
157+
158+
[Fact]
159+
public void IdleDeferRearm_ShouldNotRearm_WhenNoBackgroundTasks()
160+
{
161+
// Scenario: session.idle arrives with no background tasks and IsProcessing=false.
162+
// Expected: should NOT re-arm — this is a normal completion.
163+
var info = new AgentSessionInfo { Name = "test", Model = "test-model" };
164+
info.IsProcessing = false;
165+
166+
var idle = MakeIdle(bt: null);
167+
168+
bool shouldRearm = !info.IsProcessing && CopilotService.HasActiveBackgroundTasks(idle);
169+
Assert.False(shouldRearm, "Should NOT re-arm when no background tasks");
170+
}
171+
172+
[Fact]
173+
public void IdleDeferRearm_ShouldNotRearm_WhenAlreadyProcessing()
174+
{
175+
// Scenario: session.idle with background tasks but IsProcessing is already true.
176+
// Expected: should NOT re-arm (already armed) — just defer.
177+
var info = new AgentSessionInfo { Name = "test", Model = "test-model" };
178+
info.IsProcessing = true; // Already processing
179+
180+
var idle = MakeIdle(new SessionIdleDataBackgroundTasks
181+
{
182+
Agents = new[]
183+
{
184+
new SessionIdleDataBackgroundTasksAgentsItem
185+
{
186+
AgentId = "a1", AgentType = "copilot", Description = ""
187+
}
188+
},
189+
Shells = Array.Empty<SessionIdleDataBackgroundTasksShellsItem>()
190+
});
191+
192+
bool shouldRearm = !info.IsProcessing && CopilotService.HasActiveBackgroundTasks(idle);
193+
Assert.False(shouldRearm, "Should NOT re-arm when already processing");
194+
}
195+
196+
[Fact]
197+
public void IdleDeferRearm_ShouldNotRearm_WhenUserAborted()
198+
{
199+
// Scenario: user clicked Stop, then session.idle with background tasks arrives.
200+
// Expected: should NOT re-arm — user explicitly stopped.
201+
var info = new AgentSessionInfo { Name = "test", Model = "test-model" };
202+
info.IsProcessing = false;
203+
bool wasUserAborted = true;
204+
205+
var idle = MakeIdle(new SessionIdleDataBackgroundTasks
206+
{
207+
Agents = new[]
208+
{
209+
new SessionIdleDataBackgroundTasksAgentsItem
210+
{
211+
AgentId = "a1", AgentType = "copilot", Description = ""
212+
}
213+
},
214+
Shells = Array.Empty<SessionIdleDataBackgroundTasksShellsItem>()
215+
});
216+
217+
// The full re-arm condition includes WasUserAborted check
218+
bool shouldRearm = !info.IsProcessing && CopilotService.HasActiveBackgroundTasks(idle) && !wasUserAborted;
219+
Assert.False(shouldRearm, "Should NOT re-arm when user aborted");
220+
}
113221
}

PolyPilot.Tests/EventsJsonlParsingTests.cs

Lines changed: 11 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -87,23 +87,27 @@ public void TitleCleaning_RemovesNewlines()
8787
[Fact]
8888
public void IsSessionStillProcessing_ActiveEventTypes()
8989
{
90+
// Terminal events are the only ones that indicate processing is complete.
91+
// Everything else (including intermediate events like assistant.turn_end,
92+
// assistant.message, tool.execution_end) means the session is still active.
93+
var terminalEvents = new[] { "session.idle", "session.error", "session.shutdown" };
94+
95+
// These should indicate the session is still processing (not terminal)
9096
var activeEvents = new[]
9197
{
9298
"assistant.turn_start", "tool.execution_start",
9399
"tool.execution_progress", "assistant.message_delta",
94100
"assistant.reasoning", "assistant.reasoning_delta",
95-
"assistant.intent"
101+
"assistant.intent", "assistant.turn_end",
102+
"assistant.message", "session.start"
96103
};
97-
98-
// These should indicate the session is still processing
99104
foreach (var eventType in activeEvents)
100105
{
101-
Assert.Contains(eventType, activeEvents);
106+
Assert.DoesNotContain(eventType, terminalEvents);
102107
}
103108

104-
// These should NOT indicate processing
105-
var inactiveEvents = new[] { "session.idle", "assistant.message", "session.start" };
106-
foreach (var eventType in inactiveEvents)
109+
// These SHOULD indicate processing is complete (terminal) — must not appear in activeEvents
110+
foreach (var eventType in terminalEvents)
107111
{
108112
Assert.DoesNotContain(eventType, activeEvents);
109113
}

0 commit comments

Comments
 (0)