fix(desktop): resolve WSL backend getting stuck connecting (#3611)#3623
fix(desktop): resolve WSL backend getting stuck connecting (#3611)#3623Jgratton24 wants to merge 7 commits into
Conversation
Three independently-mergeable changes addressing two distinct root causes of the same code=0/no-port-bound symptom, plus the missing recovery path that turns either into a permanent hang: 1. Hold the WSL bootstrap stdin stream open (Stream.concat(Stream.never)) instead of closing it the instant the single JSON chunk drains. The prior behavior raced stdin EOF against the child's readline 'line' vs 'close' listeners across the extra wsl.exe relay hop, sometimes losing the bootstrap envelope entirely. Adds a regression test capturing command.options.stdin and asserting it never completes for the WSL path, plus a companion assertion that the Windows-native fd3 path is unaffected. 2. Validate the resolved WSL Node's version against nodeEngineRange during ensureNodePty preflight. Previously, once any node resolved the version was never checked, so a too-old Node (e.g. nvm default 18) would load and run the server bundle, but import.meta.main is undefined pre-20.11/22.x, so the launch gate never fires and the process exits 0 with no output -- identical symptoms to (1), entirely independent cause. 3. Cap post-preflight, pre-readiness restart loops the same way the pre-spawn fatal-preflight path already is. Previously any clean exit before readiness (including both causes above) looped forever with no escalation, which is why a persisted wslOnly:true could trap a user across restarts/reinstalls. Gated to bootstrapDelivery:"stdin" so a Windows-native primary crash-looping for an unrelated reason never shows the WSL-specific "falling back to Windows" dialog.
readBootstrapEnvelope's cleanup() (closes the readline interface, destroys its backing stream) was only ever wired as the Effect async registration's interrupt finalizer, so it never ran on a normal successful parse, decode failure, fd-unavailable, or stream-close outcome. The readline interface and its stream stayed attached to the bootstrap fd for the rest of the process lifetime. That was harmless while the desktop side closed stdin immediately after writing the bootstrap line. It stopped being harmless once the WSL bootstrap path (this branch's earlier stdin-EOF-race fix) started holding stdin open indefinitely: a readline interface left idling on a never-closing, wsl.exe-relayed stdin pipe surfaced a spurious EAGAIN 'error' event well after the envelope was already read. Node's readline.Interface registers its own internal 'error' listener on the input stream and re-emits it onto itself; since nothing here listened for 'error' on the interface itself, the unhandled re-emit crashed the process before it ever logged "Listening on" -- the WSL backend crash-looping observed during live validation. Call cleanup() at the start of every terminal branch (handleError, handleLine, handleClose) so the reader stops touching the fd as soon as it resolves one way or another, instead of only on interruption. Updated bootstrap.test.ts: now that cleanup() always destroys the underlying autoClose stream, two tests that separately closed the same fd in an acquireRelease finalizer raced that destroy and hit EBADF; switched them to the no-acquireRelease pattern already used elsewhere in this file for the same reason. Added a regression test asserting the fd is actually closed after a successful parse.
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
ApprovabilityVerdict: Needs human review 1 blocking correctness issue found. Multiple unresolved review comments identify potential bugs in the never-ready counter logic - it may count exits that already reached readiness and may mishandle fd3 vs stdin paths. These substantive correctness concerns about the fix itself warrant human review before merging. You can customize Macroscope's approvability policy. Learn more. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 63c4409383
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| config.value.bootstrapDelivery === "stdin" && | ||
| nextState.restartAttempt >= MAX_PREFLIGHT_FAILURE_ATTEMPTS |
There was a problem hiding this comment.
Track never-ready exits separately from restart attempts
When WSL has several transient preflight retries first (for example WSL cold-start/wslpath failures), scheduleRestart has already incremented restartAttempt, and a later clean preflight does not reset it. This new cap therefore treats the first backend process exit before readiness as if it had already exited five times, invoking onPreflightFailed and potentially disabling/falling back from WSL after a single process failure rather than after consecutive never-ready exits. A separate consecutive-exit counter, or resetting this counter once preflight succeeds and the process is spawned, would avoid conflating these paths.
Useful? React with 👍 / 👎.
Check the resolved Node version against nodeEngineRange only once the rest of the preflight probe has succeeded (exitCode === 0), instead of unconditionally as soon as a node path resolves. This avoids masking a more specific, unrelated probe failure (e.g. the exitCode === 3 unpacked-deps case) behind a Node-version error in the rare case both conditions are true at once.
…ing Node version Three independent automated reviewers (Macroscope, Cursor Bugbot, Codex) flagged that the never-ready cap reused restartAttempt, which is also incremented by ordinary pre-spawn preflight retries (e.g. WSL cold-start) and never reset on a fresh start. This let the cap trip on far fewer than 5 actual post-spawn never-ready exits, and gave no fresh retry budget after it fired once. Add neverReadyAttempt, a counter scoped strictly to post-spawn exits, reset on a successful onReady or a clean start from a stopped state (mirroring the existing preflightFailureAttempt pattern). Also fail closed in the WSL Node engine-range check: if a range is required but the probe didn't report a version, reject rather than silently letting an unchecked Node through (Cursor Bugbot).
There was a problem hiding this comment.
🟡 Medium
finalizeRun unconditionally increments neverReadyAttempt on every exit, including runs that had already reached readiness. If a healthy WSL backend becomes ready, later exits, and the restarted process starts failing before readiness, the counter starts from 1 instead of 0, so the never-ready fallback trips after only 4 consecutive never-ready exits instead of the intended 5. The counter should only increment for runs that never reached readiness.
🤖 Copy this AI Prompt to have your agent fix this:
In file @apps/desktop/src/backend/DesktopBackendManager.ts around line 625:
`finalizeRun` unconditionally increments `neverReadyAttempt` on every exit, including runs that had already reached readiness. If a healthy WSL backend becomes ready, later exits, and the restarted process starts failing before readiness, the counter starts from `1` instead of `0`, so the never-ready fallback trips after only 4 consecutive never-ready exits instead of the intended 5. The counter should only increment for runs that never reached readiness.
| ...latest, | ||
| active: Option.none<ActiveBackendRun>(), | ||
| ready: false, | ||
| neverReadyAttempt: latest.neverReadyAttempt + 1, |
There was a problem hiding this comment.
Counts ready run exits
Medium Severity
finalizeRun always increments neverReadyAttempt on every process exit, even when latest.ready was already true for that run. That counts operational shutdowns after a successful HTTP readiness check toward the “exited without becoming ready” stdin cap, which can push a healthy WSL backend toward the Windows fallback after later bootstrap failures.
Reviewed by Cursor Bugbot for commit 4d3e49b. Configure here.
… raw stream An adversarial review correctly found that registering the error handler on the raw stream (rather than the readline Interface) left a gap: an error on the stream before any line ever arrives is an uncaught exception regardless of cleanup(), because readline's own internal error listener is attached to the stream first and re-emits onto the interface with no listener, which throws before our handler ever runs. Verified directly against the production WSL Node version (v24.14.0): an error emitted on the stream before any line/close event crashes the process when only a stream-level listener is registered, and is handled cleanly once the listener moves to the interface. The previous fix's cleanup()-on-every-terminal-path logic still matters for errors *after* a line has been read or the reader has otherwise started winding down; this closes the remaining pre-first-line gap.
…sn't take Extract surfaceFatalFailure as the common tail of the fatal-preflight cap and the never-ready cap, and give the never-ready path the same already-surfaced guard the preflight path has: if the cap fired and the config still resolves a stdin backend that exits before ready, park the instance instead of re-showing the fallback dialog on every exit. Also fix two type errors in the new tests (stdin captures the full CommandInput | StdinConfig union, fd3 keeps its PlatformError channel).
Drop parseNodeVersion in favor of parseToolchainReport().nodeVersion, and source the out-of-range message from formatMissingToolsReason so the two paths can't give conflicting nvm advice (the resolved node path is still appended to pin down which install needs switching). A version mismatch is deterministic between back-to-back preflights, so return retryLimit: 1 and surface the dialog on the first attempt instead of burning the default five rounds; the fail-closed missing-version case keeps the default allowance for one-off probe hiccups. The probe also no longer pays a second Node cold start per preflight: the existing heredoc invocation now reports nodeVersion over fd 3.
…gg#3611) Consolidates fixes from PRs pingdotgg#3613, pingdotgg#3621, and pingdotgg#3623 to address both root causes and add a fail-safe: 1. DesktopWslEnvironment: Parse the Node version in the WSL node-pty probe and validate it against options.nodeEngineRange. Preflight now fails cleanly with an actionable error if the distro's default Node is incompatible. 2. DesktopBackendManager: Track neverReadyAttempt to cap consecutive post-spawn exits on stdin delivery. Prevents the desktop from permanently getting stuck restarting if the WSL backend consistently fails before readiness. 3. bootstrap: Clean up the readline interface on all event paths to prevent leaks when stdin stays open, and register the error listener on the readline interface to safely catch early stream errors.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using high effort and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 80f8711. Configure here.
| yield* logInstanceError("backend still never-ready after fallback; stopping", { | ||
| reason, | ||
| attempt: nextState.neverReadyAttempt, | ||
| }); |
There was a problem hiding this comment.
Never-ready cap counts fd3 exits
Medium Severity
neverReadyAttempt increases on every backend exit before readiness, including bootstrapDelivery:'fd3', but the over-cap branch that stops without calling onPreflightFailed runs only for stdin. After a Windows fallback keeps crash-looping with desiredRunning still true, switching back to WSL can exceed the cap on the first failed spawn and park silently with no fallback dialog.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 80f8711. Configure here.


Fixes #3611
What Changed
Five related fixes for the WSL backend getting permanently stuck on "Connecting to WSL...":
DesktopBackendManager.ts), fixing a race where the envelope could be lost if stdin closed before the WSL-relayed reader finished reading it. Same mechanism as fix(desktop): prevent WSL backend exiting with code 0 by keeping stdi… #3613.DesktopWslEnvironment.ts), so an incompatiblenvmdefault (e.g. Node 18) is caught at preflight instead of the server silently no-op'ing, and fail closed if the probe doesn't report a version at all when a range is required. Same fix as fix(desktop): Validate WSL node version against engine range after probe success #3621.DesktopBackendManager.ts), so any backend that keeps exiting before becoming ready surfaces an actionable error and recovers automatically — including across reinstall, where a persistedwslOnly: truepreviously left the app stuck with no way out. Uses a dedicated counter scoped strictly to post-spawn exits, separate from the general restart-backoff counter, so ordinary pre-spawn preflight retries (e.g. WSL cold-start) can't trip it early, and resets on a fresh start so the cap firing once doesn't permanently remove the retry budget.apps/server/src/bootstrap.ts).apps/server/src/bootstrap.ts), closing a gap where an error on the stream before any line ever arrived was still an uncaught exception even with fix light mode #4 in place — Node'sreadline.Interfacere-emits stream errors onto itself, and nothing was listening there.Why
#3611 has two independent root causes that produce the identical "stuck connecting" symptom, plus a structural gap that turns either one into a permanent trap:
wslOnlyback tofalse.Fixes #4 and #5 exist because of what we found testing #1 against a real WSL2 box rather than just unit tests: holding stdin open forever (#1 / #3613's approach) only works if something on the server side actually stops reading from that now-permanently-open stream once it has what it needs, and never crashes on a stream-level error at any point in its lifecycle. The existing bootstrap reader did neither. We observed this live once: the WSL backend crash-looped 5/5 times with an uncaught
EAGAINbefore ever logging "Listening on", with #1 applied but not yet #4. We were not able to independently reproduce that exact EAGAIN trigger in isolation afterward (a standalone script holding stdin open against realwsl.exefor several minutes did not reproduce it), so we're not claiming it's a guaranteed, deterministic failure — treat it as a real observed failure, not a proven-deterministic one.What we have deterministically reproduced, independent of that one observation, is a structurally related bug in the same reader: verified directly against the production WSL Node version (v24.14.0), an error on the bootstrap stream before any line is ever read is an uncaught exception with only a stream-level error listener (fix #4 alone doesn't help here, since
cleanup()is never reached — the process crashes before our own handler ever runs), and is handled cleanly once the listener is registered on the readline interface instead (fix #5). Combined, #4 and #5 mean the bootstrap reader no longer crashes the process on a stream error at any point in its lifecycle, before or after the envelope is parsed. #3613 alone leaves both exposed.We bundled all five here because #4 and #5 only make sense in the context of #1, and #3 is a generic safety net for the failure class #1's and #2's bugs both fall into. We're aware #1 and #2 overlap with #3613 and #3621 respectively and are happy to drop either piece from this PR if those merge first — #3, #4, and #5 are the parts that aren't covered anywhere else yet.
UI Changes
No new UI, but fix #3 means the existing "WSL backend couldn't start, falling back to Windows" dialog (already shown in #3621's after-screenshot) now also fires for failures that happen after a successful preflight, not just preflight failures.
Validation
Validated against the real running app on Windows + WSL2, and against the actual production WSL Node version directly, not just unit tests:
main.DesktopBackendManager.test.ts,DesktopWslEnvironment.test.ts, andapps/server/src/bootstrap.test.ts, including regression tests for the never-ready-counter isolation/reset (Git integration: branch picker + worktrees #3) and the fail-closed missing-version case (Add open-in-editor feature with Cmd+O shortcut #2); existing suites pass with no regressions; typecheck and lint clean. (We could not find a reliable way to add an automated in-process unit test for fix feat: Github Integration #5's exact scenario — Node's ownfs.ReadStreaminternals don't honor module-level mocking ofnode:fsfrom the test process — so that one rests on the direct production-Node verification above rather than an automated test.)Checklist
Note
Medium Risk
Changes desktop child-process bootstrap I/O, WSL preflight gating, and backend restart/fallback behavior—high user impact on Windows+WSL but scoped with fd3/stdin branching and new regression tests.
Overview
Addresses #3611 by fixing several ways the WSL backend could stay on "Connecting" forever.
For stdin-delivered bootstrap (
bootstrapDelivery: 'stdin'), the desktop now keeps stdin open after writing the JSON line (Stream.never) so early EOF cannot race the child's readline reader and drop the envelope. The serverreadBootstrapEnvelopenow tears down readline and the fd stream on every exit path and listens for errors on the readline interface instead of the raw stream, which matters once stdin stays open.DesktopBackendManageradds aneverReadyAttemptcounter (separate from preflight/restart backoff): after five consecutive post-spawn exits before HTTP readiness on the stdin path, it surfaces a fatal failure viaonPreflightFailed(Windows fallback dialog) and stops looping; fd3 Windows-native backends are not capped. Counters reset on readiness and on a fresh start after parking.WSL preflight (
ensureNodePtyImpl) now emitsnodeVersion:from the existing probe and enforcesengines.nodewhen configured—fatal withretryLimit: 1for a known mismatch, fail-closed if the version is missing when a range is required.Extensive unit tests cover stdin vs fd3 streams, never-ready escalation, counter isolation, and engine-range preflight.
Reviewed by Cursor Bugbot for commit 80f8711. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Fix WSL backend getting stuck in connecting state by capping never-ready exits and closing bootstrap stdin
bootstrapDelivery:'stdin', stdin is now held open indefinitely (viaStream.never) instead of closing after writing the bootstrap line, preventing the backend from seeing premature EOF.neverReadyAttemptcounter in DesktopBackendManager.ts: afterMAX_PREFLIGHT_FAILURE_ATTEMPTSconsecutive exits before readiness, a fatal failure is surfaced and the instance either restarts or stops.bootstrapDelivery:'stdin'that previously restarted indefinitely will now stop after hitting the never-ready cap.Macroscope summarized 80f8711.