Skip to content

fix(daemon): self-terminate on lock-file inode mismatch#531

Merged
ALRubinger merged 2 commits intomainfrom
fix/issue-528-3b-inode-watch
May 7, 2026
Merged

fix(daemon): self-terminate on lock-file inode mismatch#531
ALRubinger merged 2 commits intomainfrom
fix/issue-528-3b-inode-watch

Conversation

@ALRubinger
Copy link
Copy Markdown
Owner

Summary

Resolves finding 3b from #528 — the singleton-invariant violation observed during #454 Test 6 (three concurrently-running daemons each holding flock on a different inode).

flock(2) is held against an open file descriptor, which pins the inode — not the path. An external rm -rf ~/.aileron (or any unlink + recreate at the same path) leaves a running daemon flock'd against an orphaned inode while a fresh daemon flock's a brand-new inode at the same path. Both processes hold valid locks on different underlying inodes and neither knows the other exists.

Approach

  1. Capture the inode of daemon.lock at startup, immediately after discovery.Lock returns (internal/server/main.go).
  2. Watcher goroutine stats the path every lockInodeWatchInterval (5 s in production, overrideable for tests). When the inode changes — or the file is unlinked — it closes a inodeMismatch channel. The daemon's main select picks up the signal and initiates clean shutdown.
  3. Skip discovery.Remove on the inode-mismatch shutdown path: daemon.json and daemon.pid now belong to the daemon that took over, and removing them would erase its advertisement. Tracked via an atomic.Bool set just before shutdown.

New helper: discovery.LockFileInode(stateDir) (uint64, error) — wraps os.Stat + syscall.Stat_t.Ino so the daemon can probe the path without exposing platform internals at the call site. Returns the wrapping os.ErrNotExist when the file has been unlinked.

Test plan

  • discovery.TestLockFileInode_StableThenChangesAfterUnlinkRecreate — pins the inode contract: stable across calls; ENOENT after unlink; fresh inode after re-creation. This is the empirical underpinning of 3b.
  • server.TestRun_LockInodeMismatchTriggersShutdown — end-to-end: daemon starts, external rm + recreate of daemon.lock, watcher detects within ~50 ms (test interval), run() returns nil, daemon.json/daemon.pid are intact.
  • server.TestWatchLockInode_FileDisappeared — ENOENT branch: unlink-but-no-recreate also triggers shutdown.
  • server.TestWatchLockInode_StopsOnContextCancel — leak prevention: watcher exits cleanly on ctx cancel and does not falsely fire the trigger.
  • All existing server / discovery / spawn tests still pass.
  • Coverage on watchLockInode: 88.2% (only uncovered: transient stat-error fallthrough, log.Warn + continue).

Notes

🤖 Generated with Claude Code

flock(2) is held against an open file descriptor, which pins the
inode — not the path. An external `rm -rf ~/.aileron` (or any
unlink + recreate at the same path) leaves a running daemon flock'd
against an orphaned inode while a fresh daemon flock's a brand-new
inode at the same path. Both processes hold valid locks on different
underlying inodes and neither knows the other exists, breaking the
singleton invariant.

Capture the inode of daemon.lock at startup (right after acquiring
the flock) and start a watcher goroutine that polls the path every
5s. When the inode changes — or the file is unlinked — the daemon
shuts down cleanly. The shutdown path skips discovery.Remove so
daemon.json/daemon.pid (which now belong to the daemon that took
over) aren't erased.

Adds discovery.LockFileInode for the inode probe. Adds tests for:
the inode-stable / changes-after-unlink-recreate contract; the
end-to-end daemon-shutdown-on-mismatch path; the file-disappeared
(ENOENT) branch; the watcher's clean stop on context cancel.

Refs #528 (finding 3b).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 7, 2026

🚅 Deployed to the aileron-pr-531 environment in aileron

1 service not affected by this PR
  • docs

@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

❌ Patch coverage is 86.36364% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.42%. Comparing base (970400c) to head (d62f0d0).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #531      +/-   ##
==========================================
- Coverage   82.34%   81.42%   -0.93%     
==========================================
  Files         221      221              
  Lines       21908    21959      +51     
==========================================
- Hits        18041    17880     -161     
- Misses       2758     2986     +228     
+ Partials     1109     1093      -16     
Flag Coverage Δ
integration 9.56% <45.45%> (-8.00%) ⬇️
unit 77.84% <86.36%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The previous test pattern released the first lock before unlinking
and re-acquiring, which on Linux ext4 lets the kernel reuse the
freed inode for the new file — producing a false-negative inode
comparison and a CI-only failure.

The real-world bug requires the original fd to stay held while the
path is unlinked, so the inode's refcount stays > 0 and the kernel
must allocate a fresh inode for the fresh file (the property the
watcher relies on). Switch to t.Cleanup-deferred release1 so the
original fd is held throughout the test, mirroring the still-running
daemon in the production scenario.

The server-side end-to-end test already exercises this correctly
and was passing on Linux; only the unit test had the wrong shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ALRubinger ALRubinger merged commit 35c7729 into main May 7, 2026
15 of 16 checks passed
@railway-app railway-app Bot temporarily deployed to aileron / aileron-pr-531 May 7, 2026 22:47 Destroyed
@ALRubinger ALRubinger deleted the fix/issue-528-3b-inode-watch branch May 7, 2026 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant