You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: session relaunch resilience + model selection + IDLE-DEFER-REARM (#472)
## Summary
Sessions that are actively processing during a PolyPilot relaunch now
survive correctly. Model selection no longer silently resets to Haiku
after abort/reconnect.
## Poll-then-resume pattern
When restoring sessions after relaunch, sessions detected as actively
processing on the CLI headless server are handled with a new
poll-then-resume approach:
1. **Do NOT call ResumeSessionAsync** on active sessions (kills
in-flight tools)
2. **PollEventsAndResumeWhenIdleAsync** watches events.jsonl every 5s
for terminal events
3. When the CLI finishes, safely resume and load response from disk
4. Watchdog runs concurrently as safety net (600s timeout)
### Key changes:
- `IsSessionStillProcessing` — blacklist of terminal events instead of
whitelist of active events
- `PollEventsAndResumeWhenIdleAsync` — new file-based poller for active
session completion detection
- Generation guard (INV-3/INV-12) and IsMultiAgentSession (INV-9) on
poller path
- History merge on UI thread for thread safety
- LastUpdatedAt reset on restore for correct UI activity display
## Model selection fixes
- `GetSessionModel` prioritizes user's explicit choice over
usage-reported model
- Fallback uses `DefaultModel` instead of alphabetical first (was Haiku)
- `SessionStartEvent` no longer overwrites user model choice after
abort/reconnect
## Known limitation
The CLI headless server does not always write `session.idle` to
events.jsonl when the subscriber is disconnected (issue #299). The
poller cannot detect these completions and falls back to the 600s
watchdog timeout. Clean repro added to #299.
## Testing
- 3059/3059 tests passing
- Updated StuckSessionRecoveryTests and EventsJsonlParsingTests for
blacklist approach
- Verified end-to-end with live multi-agent sessions across multiple
relaunches
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The UI renders as soon as `InitializeAsync` returns. Session restore NEVER blocks the UI thread (runs via `Task.Run`). If you see a blue screen, the problem is in phase 1 (UI thread), not phase 2.
114
+
115
+
### Instrumentation
116
+
`[STARTUP-TIMING]` log tags in `~/.polypilot/console.log`:
[STARTUP-TIMING] RestoreSessionsInBackground: 35095ms ← total background time
122
+
```
123
+
124
+
### How to debug startup slowness
125
+
126
+
**Step 1: Measure the UI-visible delay**
127
+
```
128
+
BlazorDevFlow ConfigureHandler → Dashboard Restoring UI state
129
+
```
130
+
Find both timestamps in console.log for the current PID. The gap is the user-visible startup time. Normal: 5-8 seconds.
131
+
132
+
**Step 2: Identify which phase is slow**
133
+
- If `[STARTUP-TIMING] Pre-restore` is high (>500ms): `LoadOrganization` or `InitializeAsync` is slow on the UI thread
134
+
- If `[STARTUP-TIMING] Session loop` is high but UI rendered fast: background restore is slow but not blocking the user — acceptable
135
+
- If neither timing tag appears: the slowdown is in Blazor framework init (before our code runs) — check system load, WebView issues
136
+
137
+
**Step 3: A/B test against main**
138
+
```bash
139
+
# Save current branch
140
+
git stash && git checkout main
141
+
cd PolyPilot && ./relaunch.sh
142
+
# Measure: BlazorDevFlow Config → Dashboard Restore gap
143
+
144
+
# Switch back
145
+
git checkout <branch>&& git stash pop
146
+
cd PolyPilot && ./relaunch.sh
147
+
# Compare gaps
148
+
```
149
+
150
+
⚠️ **Use `git stash` + `git checkout` — do NOT `git checkout origin/main -- .`** (checks out files but keeps wrong branch name, confuses the app's branch display).
151
+
152
+
**Step 4: Common false alarms**
153
+
- CPU load from concurrent test runs or builds causes 2-3x slowdown in Blazor init
154
+
- DLL file locks from running app cause build failures (retry after a few seconds)
155
+
- The first launch after a `dotnet clean` is always slower (JIT compilation)
156
+
- Run the comparison 2-3 times each to account for variance
157
+
158
+
### Critical rules
159
+
-**NEVER block the UI thread during restore.** All session loading uses `Task.Run` + `ConfigureAwait(false)`. Violations cause blue screen.
160
+
-**`LoadPersistedSessions()` is O(N) on ALL session directories** (750+). Never call from `InitializeAsync` or any UI-triggered path. See PERF-2.
161
+
-**InvokeOnUI callbacks during restore compete with Blazor rendering.** Minimize them. The restore path should batch state changes and call `NotifyStateChanged` once at the end, not per-session.
(RestoreSingleSessionAsync) that must initialize watchdog-dependent state,
11
+
(RestorePreviousSessionsAsync / EnsureSessionConnectedAsync) that must initialize
12
+
watchdog-dependent state,
12
13
(8) Modifying ReconcileOrganization or any code that reads Organization.Sessions
13
14
during the IsRestoring window, (9) Session appears hung or unresponsive after tool use.
14
15
Covers: 18 invariants from 13 PRs of fix cycles,
15
-
the 16 code paths that set/clear IsProcessing, and common regression patterns.
16
+
the 21+ code paths that set/clear IsProcessing, and common regression patterns.
16
17
---
17
18
18
19
# Processing State Safety
@@ -151,9 +152,9 @@ cross-thread fields without a tracking comment explaining the gap.
151
152
causing stale renders.
152
153
153
154
### INV-9: Session restore must initialize all watchdog-dependent state
154
-
The restore path (`RestoreSingleSessionAsync`) is separate from `SendPromptAsync`.
155
-
Any field that affects watchdog timeout selection or dispatch routing must be
156
-
initialized in BOTH paths:
155
+
The restore path (`RestorePreviousSessionsAsync` + `EnsureSessionConnectedAsync`) is
156
+
separate from `SendPromptAsync`. Any field that affects watchdog timeout selection or
157
+
dispatch routing must be initialized in BOTH paths:
157
158
-`IsMultiAgentSession` — set via `IsSessionInMultiAgentGroup()` before `StartProcessingWatchdog`
158
159
-`HasReceivedEventsSinceResume` / `HasUsedToolsThisTurn` — set via `GetEventsFileRestoreHints()`
159
160
-`IsResumed` — set on the `AgentSessionInfo` when `isStillProcessing` is true
@@ -315,10 +316,10 @@ complete the response while sub-agents are still working.
315
316
5.**Missing state initialization on session restore** — `IsMultiAgentSession`,
316
317
`IsResumed`, and other flags must be set on restored sessions BEFORE
317
318
`StartProcessingWatchdog` is called. The restore path in
318
-
`RestoreSingleSessionAsync` is separate from `SendPromptAsync` and must
319
-
independently initialize all state the watchdog depends on. PR #284 fixed
320
-
`IsMultiAgentSession` not being set during restore, causing the watchdog
321
-
to use 120s instead of 600s for multi-agent workers.
319
+
`RestorePreviousSessionsAsync` / `EnsureSessionConnectedAsync` is separate
320
+
from `SendPromptAsync` and must independently initialize all state the watchdog
321
+
depends on. PR #284 fixed `IsMultiAgentSession` not being set during restore,
322
+
causing the watchdog to use 120s instead of 600s for multi-agent workers.
322
323
323
324
**Retired mistake (was #2):***ActiveToolCallCount as sole tool signal* — still relevant per
324
325
INV-5, but the more impactful version is #2 above (suppressing the fallback entirely).
@@ -345,15 +346,26 @@ When a session shows "Thinking..." indefinitely:
345
346
| Symptom | Likely Cause | Fix |
346
347
|---------|-------------|-----|
347
348
|`[SEND]` then silence | SDK never responded, watchdog will catch at 120s | Wait or abort |
348
-
|`[EVT] TurnEnd` but no `[IDLE]`| Zero-idle SDK bug | Watchdog catches at 30s fallback (INV-10) |
349
-
|`[IDLE-DEFER]` then long silence | Background tasks (sub-agents/shells) active but never completed | Check agent status; watchdog will eventually catch (INV-18) |
349
+
|`[EVT] TurnEnd` but no `[IDLE]`|`session.idle` is ephemeral (never on disk). If live events stopped: IDLE-DEFER deferred the idle and `IsProcessing` was cleared before the follow-up arrived (#403) | Watchdog catches; #403 fix re-arms IsProcessing |
350
+
|`[IDLE-DEFER]` then long silence | Background tasks (sub-agents/shells) active | Check `[IDLE-DIAG]` for backgroundTasks count; watchdog will eventually catch (INV-18) |
351
+
|`[IDLE-DEFER]` with `IsProcessing=False`| IDLE-DEFER fired but IsProcessing was already cleared by watchdog/reconnect |#403 IDLE-DEFER-REARM fix re-arms IsProcessing |
350
352
|`[COMPLETE]` fired but spinner persists | UI thread not notified | Check INV-2, INV-8 |
351
353
|`[WATCHDOG]` clears but re-sticks | New turn started before watchdog callback ran | Check INV-3 generation guard |
354
+
| After relaunch: session shows "Working" for 600s | Session was active on CLI during relaunch; poll-then-resume waiting for `session.shutdown` (only disk-persisted terminal event) | Normal — watchdog clears at 600s. `session.idle` is ephemeral, never on disk |
352
355
353
356
5.**Nuclear option** — user clicks Stop (AbortSessionAsync, path #5/#6).
354
357
358
+
## Key Facts About session.idle
359
+
360
+
-`session.idle` is **`ephemeral: true`** in the SDK schema — intentionally NOT written to events.jsonl
361
+
- Events written to disk: `session.start`, `session.resume`, `session.shutdown` — NOT `session.idle`
362
+
- PolyPilot receives `SessionIdleEvent` over the live SDK event stream
363
+
- CLI correctly populates `backgroundTasks` field (proven with `[IDLE-DIAG]` instrumentation)
364
+
- When `backgroundTasks` is active, IDLE-DEFER defers completion until a subsequent idle with empty backgroundTasks
365
+
- After app relaunch, the poll-then-resume pattern watches events.jsonl for `session.shutdown` only (the only terminal event on disk)
**CRITICAL**: Every code path that sets `IsProcessing = false` must clear 9 companion fields and call `FlushCurrentResponse`. This is the most recurring bug category (13 PRs of fix/regression cycles). **Read `.claude/skills/processing-state-safety/SKILL.md` before modifying ANY processing path.** There are 15+ such paths across CopilotService.cs, Events.cs, Bridge.cs, Organization.cs, and Providers.cs.
240
240
241
241
### Content Persistence
242
-
`FlushCurrentResponse` is also called on `AssistantTurnEndEvent` to persist accumulated response text at each sub-turn boundary. This prevents content loss if the app restarts between `turn_end` and `session.idle` (e.g., "zero-idle sessions" where the SDK never emits`session.idle`). The flush includes a dedup guard to prevent duplicate messages from event replay on resume.
242
+
`FlushCurrentResponse` is also called on `AssistantTurnEndEvent` to persist accumulated response text at each sub-turn boundary. This prevents content loss if the app restarts between `turn_end` and `session.idle`. When the IDLE-DEFER logic defers`session.idle` (active background tasks), the flush ensures content from the foreground turn is saved. The flush includes a dedup guard to prevent duplicate messages from event replay on resume.
243
243
244
244
### Processing Watchdog
245
245
The processing watchdog (`RunProcessingWatchdogAsync` in `CopilotService.Events.cs`) detects stuck sessions by checking how long since the last SDK event. It checks every 15 seconds and has three timeout tiers:
@@ -252,7 +252,7 @@ The processing watchdog (`RunProcessingWatchdogAsync` in `CopilotService.Events.
252
252
253
253
For multi-agent sessions, Case B also checks **file-size-growth**: if events.jsonl hasn't grown for `WatchdogCaseBMaxStaleChecks` (2) consecutive deferrals, the session is force-completed — the connection is dead. This catches `ConnectionLostException` scenarios where mtime stays fresh but no new data arrives, reducing detection from 30+ min to ~360s (3 cycles: 1 baseline + 2 stale checks). The 1800s freshness window is preserved.
254
254
255
-
Note: Some sessions never receive `session.idle`events (SDK/CLI bug). In these "zero-idle" cases, `IsProcessing`is only cleared by the watchdog or user abort. The turn_end flush (see Content Persistence above) ensures response content is not lost.
255
+
Note: `session.idle` is an ephemeral event (`ephemeral: true` in the SDK schema) — it is delivered over the live event stream but intentionally NOT written to `events.jsonl`. When `session.idle`includes active `backgroundTasks` (sub-agents, shells), the IDLE-DEFER logic defers completion until a subsequent idle arrives with empty/null backgroundTasks. In rare cases where `IsProcessing`was already cleared (by watchdog timeout or reconnect) before the deferred idle arrives, the session may appear stuck until the watchdog fires again — see issue #403.
256
256
257
257
When the watchdog fires, it marshals state mutations to the UI thread via `InvokeOnUI()` and adds a system warning message.
0 commit comments