Skip to content

fix(spawn): retry readAlive on lock-acquire timeout#530

Merged
ALRubinger merged 1 commit intomainfrom
fix/issue-528-3a-client-lock-retry
May 7, 2026
Merged

fix(spawn): retry readAlive on lock-acquire timeout#530
ALRubinger merged 1 commit intomainfrom
fix/issue-528-3a-client-lock-retry

Conversation

@ALRubinger
Copy link
Copy Markdown
Owner

Summary

Resolves finding 3a from #528 — concurrent first-run hard-failures observed at a 4/10 rate during #454 Test 6.

When the surviving daemon takes daemon.lock, it holds it for its entire lifetime (internal/server.run's defer releaseLock). Other clients that arrive during the daemon's bind window (after the parent client released the spawn lock but before the daemon binds its TCP port) hit discovery.Lock and spin until the 2 s ctx timeout, then return spawn: acquire lock: context deadline exceeded — even though the daemon is right there, just slightly slow to come up.

Fix: after discovery.Lock returns context.DeadlineExceeded, retry readAlive once. By the time 2 s has elapsed, any daemon racing for the lock has either bound its port (so the retry returns its URL) or genuinely failed (in which case the lock-acquire error still propagates).

Test plan

  • TestResolve_RetriesReadAliveOnLockTimeout — external goroutine takes daemon.lock and holds it for the duration; a second goroutine writes daemon.json + binds a listener after 100 ms; Resolve is called with a 400 ms ctx so its lock acquire fails. With the fix, the retry returns the URL; without it, Resolve hard-fails with spawn: acquire lock: context deadline exceeded (verified via stash + rerun).
  • TestResolve_LockTimeoutSurfacedWhenNoDaemon — negative case: lock held, no daemon ever appears. Confirms the lock-acquire error still propagates so the fix doesn't silently mask all timeouts.
  • All existing spawn tests still pass.
  • Coverage: internal/daemon/spawn 90.5%.

Notes

This addresses 3a only. Finding 3b (singleton invariant breaks across rm -rf while a daemon survives) is a separate bug with a different fix shape — tracked for a follow-up PR.

🤖 Generated with Claude Code

When the surviving daemon holds daemon.lock for its lifetime
(internal/server.run's `defer releaseLock`), client invocations that
arrive during the daemon's bind window can't acquire the spawn lock
and would hard-fail with "spawn: acquire lock: context deadline
exceeded" — observed at a 4-of-10 rate under #454 Test 6's concurrent
first-run.

Add a single readAlive retry after the lock-acquire timeout: a
context-deadline-exceeded almost always means a daemon already owns
the lock and is therefore reachable, so probing once before surfacing
the error converts the failure into a successful daemon discovery.
The negative case (lock unreachable AND no daemon) still propagates
the original error.

Refs #528 (finding 3a).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 7, 2026

🚅 Deployed to the aileron-pr-530 environment in aileron

1 service not affected by this PR
  • docs

@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.42%. Comparing base (970400c) to head (e6d58be).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #530      +/-   ##
==========================================
- Coverage   82.34%   81.42%   -0.93%     
==========================================
  Files         221      221              
  Lines       21908    21911       +3     
==========================================
- Hits        18041    17840     -201     
- Misses       2758     2981     +223     
+ Partials     1109     1090      -19     
Flag Coverage Δ
integration 9.47% <0.00%> (-8.09%) ⬇️
unit 77.83% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ALRubinger ALRubinger merged commit aba8be6 into main May 7, 2026
10 checks passed
@ALRubinger ALRubinger deleted the fix/issue-528-3a-client-lock-retry branch May 7, 2026 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant