fix(daemon): self-terminate on lock-file inode mismatch#531
Merged
ALRubinger merged 2 commits intomainfrom May 7, 2026
Merged
Conversation
flock(2) is held against an open file descriptor, which pins the inode — not the path. An external `rm -rf ~/.aileron` (or any unlink + recreate at the same path) leaves a running daemon flock'd against an orphaned inode while a fresh daemon flock's a brand-new inode at the same path. Both processes hold valid locks on different underlying inodes and neither knows the other exists, breaking the singleton invariant. Capture the inode of daemon.lock at startup (right after acquiring the flock) and start a watcher goroutine that polls the path every 5s. When the inode changes — or the file is unlinked — the daemon shuts down cleanly. The shutdown path skips discovery.Remove so daemon.json/daemon.pid (which now belong to the daemon that took over) aren't erased. Adds discovery.LockFileInode for the inode probe. Adds tests for: the inode-stable / changes-after-unlink-recreate contract; the end-to-end daemon-shutdown-on-mismatch path; the file-disappeared (ENOENT) branch; the watcher's clean stop on context cancel. Refs #528 (finding 3b). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🚅 Deployed to the aileron-pr-531 environment in aileron 1 service not affected by this PR
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #531 +/- ##
==========================================
- Coverage 82.34% 81.42% -0.93%
==========================================
Files 221 221
Lines 21908 21959 +51
==========================================
- Hits 18041 17880 -161
- Misses 2758 2986 +228
+ Partials 1109 1093 -16
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
The previous test pattern released the first lock before unlinking and re-acquiring, which on Linux ext4 lets the kernel reuse the freed inode for the new file — producing a false-negative inode comparison and a CI-only failure. The real-world bug requires the original fd to stay held while the path is unlinked, so the inode's refcount stays > 0 and the kernel must allocate a fresh inode for the fresh file (the property the watcher relies on). Switch to t.Cleanup-deferred release1 so the original fd is held throughout the test, mirroring the still-running daemon in the production scenario. The server-side end-to-end test already exercises this correctly and was passing on Linux; only the unit test had the wrong shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Resolves finding 3b from #528 — the singleton-invariant violation observed during #454 Test 6 (three concurrently-running daemons each holding flock on a different inode).
flock(2)is held against an open file descriptor, which pins the inode — not the path. An externalrm -rf ~/.aileron(or any unlink + recreate at the same path) leaves a running daemon flock'd against an orphaned inode while a fresh daemon flock's a brand-new inode at the same path. Both processes hold valid locks on different underlying inodes and neither knows the other exists.Approach
daemon.lockat startup, immediately afterdiscovery.Lockreturns (internal/server/main.go).lockInodeWatchInterval(5 s in production, overrideable for tests). When the inode changes — or the file is unlinked — it closes ainodeMismatchchannel. The daemon's mainselectpicks up the signal and initiates clean shutdown.discovery.Removeon the inode-mismatch shutdown path:daemon.jsonanddaemon.pidnow belong to the daemon that took over, and removing them would erase its advertisement. Tracked via anatomic.Boolset just before shutdown.New helper:
discovery.LockFileInode(stateDir) (uint64, error)— wrapsos.Stat+syscall.Stat_t.Inoso the daemon can probe the path without exposing platform internals at the call site. Returns the wrappingos.ErrNotExistwhen the file has been unlinked.Test plan
discovery.TestLockFileInode_StableThenChangesAfterUnlinkRecreate— pins the inode contract: stable across calls; ENOENT after unlink; fresh inode after re-creation. This is the empirical underpinning of 3b.server.TestRun_LockInodeMismatchTriggersShutdown— end-to-end: daemon starts, externalrm+ recreate ofdaemon.lock, watcher detects within ~50 ms (test interval),run()returns nil,daemon.json/daemon.pidare intact.server.TestWatchLockInode_FileDisappeared— ENOENT branch: unlink-but-no-recreate also triggers shutdown.server.TestWatchLockInode_StopsOnContextCancel— leak prevention: watcher exits cleanly on ctx cancel and does not falsely fire the trigger.watchLockInode: 88.2% (only uncovered: transient stat-error fallthrough,log.Warn + continue).Notes
🤖 Generated with Claude Code