fix(spawn): retry readAlive on lock-acquire timeout#530
Merged
ALRubinger merged 1 commit intomainfrom May 7, 2026
Merged
Conversation
When the surviving daemon holds daemon.lock for its lifetime (internal/server.run's `defer releaseLock`), client invocations that arrive during the daemon's bind window can't acquire the spawn lock and would hard-fail with "spawn: acquire lock: context deadline exceeded" — observed at a 4-of-10 rate under #454 Test 6's concurrent first-run. Add a single readAlive retry after the lock-acquire timeout: a context-deadline-exceeded almost always means a daemon already owns the lock and is therefore reachable, so probing once before surfacing the error converts the failure into a successful daemon discovery. The negative case (lock unreachable AND no daemon) still propagates the original error. Refs #528 (finding 3a). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🚅 Deployed to the aileron-pr-530 environment in aileron 1 service not affected by this PR
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #530 +/- ##
==========================================
- Coverage 82.34% 81.42% -0.93%
==========================================
Files 221 221
Lines 21908 21911 +3
==========================================
- Hits 18041 17840 -201
- Misses 2758 2981 +223
+ Partials 1109 1090 -19
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
This was referenced May 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resolves finding 3a from #528 — concurrent first-run hard-failures observed at a 4/10 rate during #454 Test 6.
When the surviving daemon takes
daemon.lock, it holds it for its entire lifetime (internal/server.run'sdefer releaseLock). Other clients that arrive during the daemon's bind window (after the parent client released the spawn lock but before the daemon binds its TCP port) hitdiscovery.Lockand spin until the 2 s ctx timeout, then returnspawn: acquire lock: context deadline exceeded— even though the daemon is right there, just slightly slow to come up.Fix: after
discovery.Lockreturnscontext.DeadlineExceeded, retryreadAliveonce. By the time 2 s has elapsed, any daemon racing for the lock has either bound its port (so the retry returns its URL) or genuinely failed (in which case the lock-acquire error still propagates).Test plan
TestResolve_RetriesReadAliveOnLockTimeout— external goroutine takesdaemon.lockand holds it for the duration; a second goroutine writesdaemon.json+ binds a listener after 100 ms;Resolveis called with a 400 ms ctx so its lock acquire fails. With the fix, the retry returns the URL; without it, Resolve hard-fails withspawn: acquire lock: context deadline exceeded(verified via stash + rerun).TestResolve_LockTimeoutSurfacedWhenNoDaemon— negative case: lock held, no daemon ever appears. Confirms the lock-acquire error still propagates so the fix doesn't silently mask all timeouts.internal/daemon/spawn90.5%.Notes
This addresses 3a only. Finding 3b (singleton invariant breaks across
rm -rfwhile a daemon survives) is a separate bug with a different fix shape — tracked for a follow-up PR.🤖 Generated with Claude Code