Skip to content

fix(desktop): resolve WSL backend getting stuck connecting (#3611)#3623

Open
Jgratton24 wants to merge 7 commits into
pingdotgg:mainfrom
Jgratton24:fix/3611-wsl-stuck-connecting
Open

fix(desktop): resolve WSL backend getting stuck connecting (#3611)#3623
Jgratton24 wants to merge 7 commits into
pingdotgg:mainfrom
Jgratton24:fix/3611-wsl-stuck-connecting

Conversation

@Jgratton24

@Jgratton24 Jgratton24 commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Fixes #3611

What Changed

Five related fixes for the WSL backend getting permanently stuck on "Connecting to WSL...":

  1. Hold the WSL bootstrap stdin open instead of closing it right after writing the bootstrap envelope (DesktopBackendManager.ts), fixing a race where the envelope could be lost if stdin closed before the WSL-relayed reader finished reading it. Same mechanism as fix(desktop): prevent WSL backend exiting with code 0 by keeping stdi… #3613.
  2. Validate the WSL Node version against the server's required engine range even when a Node binary does resolve, not just when none resolves at all (DesktopWslEnvironment.ts), so an incompatible nvm default (e.g. Node 18) is caught at preflight instead of the server silently no-op'ing, and fail closed if the probe doesn't report a version at all when a range is required. Same fix as fix(desktop): Validate WSL node version against engine range after probe success #3621.
  3. Cap consecutive post-preflight backend exits and fall back to Windows instead of restarting forever (DesktopBackendManager.ts), so any backend that keeps exiting before becoming ready surfaces an actionable error and recovers automatically — including across reinstall, where a persisted wslOnly: true previously left the app stuck with no way out. Uses a dedicated counter scoped strictly to post-spawn exits, separate from the general restart-backoff counter, so ordinary pre-spawn preflight retries (e.g. WSL cold-start) can't trip it early, and resets on a fresh start so the cap firing once doesn't permanently remove the retry budget.
  4. Detach the bootstrap reader once the envelope is parsed, closing the readline interface and releasing the fd on every terminal path instead of only on interruption (apps/server/src/bootstrap.ts).
  5. Listen for the bootstrap reader's errors on the readline interface, not the raw stream (apps/server/src/bootstrap.ts), closing a gap where an error on the stream before any line ever arrived was still an uncaught exception even with fix light mode #4 in place — Node's readline.Interface re-emits stream errors onto itself, and nothing was listening there.

Why

#3611 has two independent root causes that produce the identical "stuck connecting" symptom, plus a structural gap that turns either one into a permanent trap:

Fixes #4 and #5 exist because of what we found testing #1 against a real WSL2 box rather than just unit tests: holding stdin open forever (#1 / #3613's approach) only works if something on the server side actually stops reading from that now-permanently-open stream once it has what it needs, and never crashes on a stream-level error at any point in its lifecycle. The existing bootstrap reader did neither. We observed this live once: the WSL backend crash-looped 5/5 times with an uncaught EAGAIN before ever logging "Listening on", with #1 applied but not yet #4. We were not able to independently reproduce that exact EAGAIN trigger in isolation afterward (a standalone script holding stdin open against real wsl.exe for several minutes did not reproduce it), so we're not claiming it's a guaranteed, deterministic failure — treat it as a real observed failure, not a proven-deterministic one.

What we have deterministically reproduced, independent of that one observation, is a structurally related bug in the same reader: verified directly against the production WSL Node version (v24.14.0), an error on the bootstrap stream before any line is ever read is an uncaught exception with only a stream-level error listener (fix #4 alone doesn't help here, since cleanup() is never reached — the process crashes before our own handler ever runs), and is handled cleanly once the listener is registered on the readline interface instead (fix #5). Combined, #4 and #5 mean the bootstrap reader no longer crashes the process on a stream error at any point in its lifecycle, before or after the envelope is parsed. #3613 alone leaves both exposed.

We bundled all five here because #4 and #5 only make sense in the context of #1, and #3 is a generic safety net for the failure class #1's and #2's bugs both fall into. We're aware #1 and #2 overlap with #3613 and #3621 respectively and are happy to drop either piece from this PR if those merge first — #3, #4, and #5 are the parts that aren't covered anywhere else yet.

UI Changes

No new UI, but fix #3 means the existing "WSL backend couldn't start, falling back to Windows" dialog (already shown in #3621's after-screenshot) now also fires for failures that happen after a successful preflight, not just preflight failures.

Validation

Validated against the real running app on Windows + WSL2, and against the actual production WSL Node version directly, not just unit tests:

  • Forced the original stuck-connecting bug on unpatched main.
  • Forced the EAGAIN crash live once with fix oxc stack #1 applied alone (no fix light mode #4) — confirmed 5/5 crash-loop. This specific trigger was not independently reproduced in isolation afterward; see the "Why" section for what that does and doesn't establish.
  • Independently and deterministically verified (against production WSL Node v24.14.0, outside the full app) that an error on the bootstrap stream before any line arrives is an uncaught exception without fix feat: Github Integration #5, and is handled cleanly with it.
  • Confirmed WSL-only mode reaches ready in ~25s with zero crashes once all five fixes are applied, and parallel (Windows + WSL) mode has no regression.
  • New unit tests added across DesktopBackendManager.test.ts, DesktopWslEnvironment.test.ts, and apps/server/src/bootstrap.test.ts, including regression tests for the never-ready-counter isolation/reset (Git integration: branch picker + worktrees #3) and the fail-closed missing-version case (Add open-in-editor feature with Cmd+O shortcut #2); existing suites pass with no regressions; typecheck and lint clean. (We could not find a reliable way to add an automated in-process unit test for fix feat: Github Integration #5's exact scenario — Node's own fs.ReadStream internals don't honor module-level mocking of node:fs from the test process — so that one rests on the direct production-Node verification above rather than an automated test.)

Checklist

  • This PR is small and focused — it bundles five fixes; see "Why" for why they're grouped
  • I explained what changed and why
  • I included before/after screenshots for any UI changes — no new UI, see "UI Changes"
  • I included a video for animation/interaction changes — n/a

Note

Medium Risk
Changes desktop child-process bootstrap I/O, WSL preflight gating, and backend restart/fallback behavior—high user impact on Windows+WSL but scoped with fd3/stdin branching and new regression tests.

Overview
Addresses #3611 by fixing several ways the WSL backend could stay on "Connecting" forever.

For stdin-delivered bootstrap (bootstrapDelivery: 'stdin'), the desktop now keeps stdin open after writing the JSON line (Stream.never) so early EOF cannot race the child's readline reader and drop the envelope. The server readBootstrapEnvelope now tears down readline and the fd stream on every exit path and listens for errors on the readline interface instead of the raw stream, which matters once stdin stays open.

DesktopBackendManager adds a neverReadyAttempt counter (separate from preflight/restart backoff): after five consecutive post-spawn exits before HTTP readiness on the stdin path, it surfaces a fatal failure via onPreflightFailed (Windows fallback dialog) and stops looping; fd3 Windows-native backends are not capped. Counters reset on readiness and on a fresh start after parking.

WSL preflight (ensureNodePtyImpl) now emits nodeVersion: from the existing probe and enforces engines.node when configured—fatal with retryLimit: 1 for a known mismatch, fail-closed if the version is missing when a range is required.

Extensive unit tests cover stdin vs fd3 streams, never-ready escalation, counter isolation, and engine-range preflight.

Reviewed by Cursor Bugbot for commit 80f8711. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Fix WSL backend getting stuck in connecting state by capping never-ready exits and closing bootstrap stdin

  • For bootstrapDelivery:'stdin', stdin is now held open indefinitely (via Stream.never) instead of closing after writing the bootstrap line, preventing the backend from seeing premature EOF.
  • Adds a neverReadyAttempt counter in DesktopBackendManager.ts: after MAX_PREFLIGHT_FAILURE_ATTEMPTS consecutive exits before readiness, a fatal failure is surfaced and the instance either restarts or stops.
  • The bootstrap reader in bootstrap.ts now closes its fd immediately after parsing the envelope or on error, preventing lingering readers and late error events.
  • WSL preflight in DesktopWslEnvironment.ts now validates the resolved Node.js version against a configured engine range, returning a fatal failure if the version is missing or out of range.
  • Risk: WSL instances using bootstrapDelivery:'stdin' that previously restarted indefinitely will now stop after hitting the never-ready cap.

Macroscope summarized 80f8711.

Three independently-mergeable changes addressing two distinct root
causes of the same code=0/no-port-bound symptom, plus the missing
recovery path that turns either into a permanent hang:

1. Hold the WSL bootstrap stdin stream open (Stream.concat(Stream.never))
   instead of closing it the instant the single JSON chunk drains. The
   prior behavior raced stdin EOF against the child's readline 'line' vs
   'close' listeners across the extra wsl.exe relay hop, sometimes losing
   the bootstrap envelope entirely. Adds a regression test capturing
   command.options.stdin and asserting it never completes for the WSL
   path, plus a companion assertion that the Windows-native fd3 path is
   unaffected.

2. Validate the resolved WSL Node's version against nodeEngineRange
   during ensureNodePty preflight. Previously, once any node resolved
   the version was never checked, so a too-old Node (e.g. nvm default
   18) would load and run the server bundle, but import.meta.main is
   undefined pre-20.11/22.x, so the launch gate never fires and the
   process exits 0 with no output -- identical symptoms to (1), entirely
   independent cause.

3. Cap post-preflight, pre-readiness restart loops the same way the
   pre-spawn fatal-preflight path already is. Previously any clean exit
   before readiness (including both causes above) looped forever with no
   escalation, which is why a persisted wslOnly:true could trap a user
   across restarts/reinstalls. Gated to bootstrapDelivery:"stdin" so a
   Windows-native primary crash-looping for an unrelated reason never
   shows the WSL-specific "falling back to Windows" dialog.
readBootstrapEnvelope's cleanup() (closes the readline interface,
destroys its backing stream) was only ever wired as the Effect async
registration's interrupt finalizer, so it never ran on a normal
successful parse, decode failure, fd-unavailable, or stream-close
outcome. The readline interface and its stream stayed attached to the
bootstrap fd for the rest of the process lifetime.

That was harmless while the desktop side closed stdin immediately
after writing the bootstrap line. It stopped being harmless once the
WSL bootstrap path (this branch's earlier stdin-EOF-race fix) started
holding stdin open indefinitely: a readline interface left idling on a
never-closing, wsl.exe-relayed stdin pipe surfaced a spurious EAGAIN
'error' event well after the envelope was already read. Node's
readline.Interface registers its own internal 'error' listener on the
input stream and re-emits it onto itself; since nothing here listened
for 'error' on the interface itself, the unhandled re-emit crashed the
process before it ever logged "Listening on" -- the WSL backend
crash-looping observed during live validation.

Call cleanup() at the start of every terminal branch (handleError,
handleLine, handleClose) so the reader stops touching the fd as soon
as it resolves one way or another, instead of only on interruption.

Updated bootstrap.test.ts: now that cleanup() always destroys the
underlying autoClose stream, two tests that separately closed the
same fd in an acquireRelease finalizer raced that destroy and hit
EBADF; switched them to the no-acquireRelease pattern already used
elsewhere in this file for the same reason. Added a regression test
asserting the fd is actually closed after a successful parse.
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7f58174d-3ee5-47b7-852c-e16a1fa1eda1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@github-actions github-actions Bot added vouch:unvouched PR author is not yet trusted in the VOUCHED list. size:L 100-499 changed lines (additions + deletions). labels Jun 30, 2026
Comment thread apps/desktop/src/backend/DesktopBackendManager.ts
Comment thread apps/desktop/src/backend/DesktopBackendManager.ts
Comment thread apps/desktop/src/backend/DesktopBackendManager.ts
Comment thread apps/desktop/src/wsl/DesktopWslEnvironment.ts Outdated
@macroscopeapp

macroscopeapp Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Approvability

Verdict: Needs human review

1 blocking correctness issue found. Multiple unresolved review comments identify potential bugs in the never-ready counter logic - it may count exits that already reached readiness and may mishandle fd3 vs stdin paths. These substantive correctness concerns about the fix itself warrant human review before merging.

You can customize Macroscope's approvability policy. Learn more.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 63c4409383

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +655 to +656
config.value.bootstrapDelivery === "stdin" &&
nextState.restartAttempt >= MAX_PREFLIGHT_FAILURE_ATTEMPTS

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Track never-ready exits separately from restart attempts

When WSL has several transient preflight retries first (for example WSL cold-start/wslpath failures), scheduleRestart has already incremented restartAttempt, and a later clean preflight does not reset it. This new cap therefore treats the first backend process exit before readiness as if it had already exited five times, invoking onPreflightFailed and potentially disabling/falling back from WSL after a single process failure rather than after consecutive never-ready exits. A separate consecutive-exit counter, or resetting this counter once preflight succeeds and the process is spawned, would avoid conflating these paths.

Useful? React with 👍 / 👎.

Check the resolved Node version against nodeEngineRange only once
the rest of the preflight probe has succeeded (exitCode === 0),
instead of unconditionally as soon as a node path resolves. This
avoids masking a more specific, unrelated probe failure (e.g. the
exitCode === 3 unpacked-deps case) behind a Node-version error in
the rare case both conditions are true at once.
…ing Node version

Three independent automated reviewers (Macroscope, Cursor Bugbot, Codex)
flagged that the never-ready cap reused restartAttempt, which is also
incremented by ordinary pre-spawn preflight retries (e.g. WSL cold-start)
and never reset on a fresh start. This let the cap trip on far fewer than
5 actual post-spawn never-ready exits, and gave no fresh retry budget
after it fired once. Add neverReadyAttempt, a counter scoped strictly to
post-spawn exits, reset on a successful onReady or a clean start from a
stopped state (mirroring the existing preflightFailureAttempt pattern).

Also fail closed in the WSL Node engine-range check: if a range is
required but the probe didn't report a version, reject rather than
silently letting an unchecked Node through (Cursor Bugbot).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium

finalizeRun unconditionally increments neverReadyAttempt on every exit, including runs that had already reached readiness. If a healthy WSL backend becomes ready, later exits, and the restarted process starts failing before readiness, the counter starts from 1 instead of 0, so the never-ready fallback trips after only 4 consecutive never-ready exits instead of the intended 5. The counter should only increment for runs that never reached readiness.

🤖 Copy this AI Prompt to have your agent fix this:
In file @apps/desktop/src/backend/DesktopBackendManager.ts around line 625:

`finalizeRun` unconditionally increments `neverReadyAttempt` on every exit, including runs that had already reached readiness. If a healthy WSL backend becomes ready, later exits, and the restarted process starts failing before readiness, the counter starts from `1` instead of `0`, so the never-ready fallback trips after only 4 consecutive never-ready exits instead of the intended 5. The counter should only increment for runs that never reached readiness.

...latest,
active: Option.none<ActiveBackendRun>(),
ready: false,
neverReadyAttempt: latest.neverReadyAttempt + 1,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counts ready run exits

Medium Severity

finalizeRun always increments neverReadyAttempt on every process exit, even when latest.ready was already true for that run. That counts operational shutdowns after a successful HTTP readiness check toward the “exited without becoming ready” stdin cap, which can push a healthy WSL backend toward the Windows fallback after later bootstrap failures.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4d3e49b. Configure here.

… raw stream

An adversarial review correctly found that registering the error handler
on the raw stream (rather than the readline Interface) left a gap: an
error on the stream before any line ever arrives is an uncaught exception
regardless of cleanup(), because readline's own internal error listener
is attached to the stream first and re-emits onto the interface with no
listener, which throws before our handler ever runs.

Verified directly against the production WSL Node version (v24.14.0):
an error emitted on the stream before any line/close event crashes the
process when only a stream-level listener is registered, and is handled
cleanly once the listener moves to the interface. The previous fix's
cleanup()-on-every-terminal-path logic still matters for errors *after*
a line has been read or the reader has otherwise started winding down;
this closes the remaining pre-first-line gap.
…sn't take

Extract surfaceFatalFailure as the common tail of the fatal-preflight cap
and the never-ready cap, and give the never-ready path the same
already-surfaced guard the preflight path has: if the cap fired and the
config still resolves a stdin backend that exits before ready, park the
instance instead of re-showing the fallback dialog on every exit.

Also fix two type errors in the new tests (stdin captures the full
CommandInput | StdinConfig union, fd3 keeps its PlatformError channel).
Drop parseNodeVersion in favor of parseToolchainReport().nodeVersion, and
source the out-of-range message from formatMissingToolsReason so the two
paths can't give conflicting nvm advice (the resolved node path is still
appended to pin down which install needs switching).

A version mismatch is deterministic between back-to-back preflights, so
return retryLimit: 1 and surface the dialog on the first attempt instead
of burning the default five rounds; the fail-closed missing-version case
keeps the default allowance for one-off probe hiccups.

The probe also no longer pays a second Node cold start per preflight:
the existing heredoc invocation now reports nodeVersion over fd 3.
jibin7jose pushed a commit to jibin7jose/t3code that referenced this pull request Jul 2, 2026
…gg#3611)

Consolidates fixes from PRs pingdotgg#3613, pingdotgg#3621, and pingdotgg#3623 to address both root causes and add a fail-safe:

1. DesktopWslEnvironment: Parse the Node version in the WSL node-pty probe and validate it against options.nodeEngineRange. Preflight now fails cleanly with an actionable error if the distro's default Node is incompatible.

2. DesktopBackendManager: Track neverReadyAttempt to cap consecutive post-spawn exits on stdin delivery. Prevents the desktop from permanently getting stuck restarting if the WSL backend consistently fails before readiness.

3. bootstrap: Clean up the readline interface on all event paths to prevent leaks when stdin stays open, and register the error listener on the readline interface to safely catch early stream errors.

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using high effort and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 80f8711. Configure here.

yield* logInstanceError("backend still never-ready after fallback; stopping", {
reason,
attempt: nextState.neverReadyAttempt,
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never-ready cap counts fd3 exits

Medium Severity

neverReadyAttempt increases on every backend exit before readiness, including bootstrapDelivery:'fd3', but the over-cap branch that stops without calling onPreflightFailed runs only for stdin. After a Windows fallback keeps crash-looping with desiredRunning still true, switching back to WSL can exceed the cap on the first failed spawn and park silently with no fallback dialog.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 80f8711. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L 100-499 changed lines (additions + deletions). vouch:unvouched PR author is not yet trusted in the VOUCHED list.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Windows desktop 0.0.28 gets stuck on connecting to WSL after enabling WSL support feat: Github Integration

1 participant