From 0b11b105b88843fc13565a7970835281b8deb6f3 Mon Sep 17 00:00:00 2001 From: Kun Chen <3233006+kunchenguid@users.noreply.github.com> Date: Wed, 24 Jun 2026 14:54:42 -0700 Subject: [PATCH 01/22] fix: harden watcher singleton locking (#71) * fix: make watcher singleton race-proof and re-arm home-scoped Concurrent fm_lock_try_acquire could produce two lock holders: the stale-steal path destroyed and recreated the lock dir without serialization, so two racers could both reclaim it. That let two fm-watch.sh watchers run in one home, doubling every wake. - fm-wake-lib.sh: single-winner acquire. Reclaim is serialized through a sibling ".steal" mutex (its abandonment floor decoupled from FM_LOCK_STALE_AFTER, never below 2s) and the holder pid is re-verified dead immediately before the rmdir. Every claim writes its pid and reads it back; a stomped pid means we lost. mkdir is the atomic arbiter, so at most one acquirer ever returns 0. - fm-watch.sh: self-eviction. Each poll a watcher checks the lock still names its pid; if another took over, it exits cleanly, so any transient duplicate self-resolves within one poll. - fm-watch-arm.sh: safe re-arm. Default no-ops via the singleton; --restart signals only this home's recorded watcher pid, never a broad pkill that would kill sibling homes' watchers. - AGENTS.md section 8: route re-arm through fm-watch-arm.sh and warn that pkill -f bin/fm-watch.sh kills other homes' watchers. - Tests: concurrency single-winner, dead-pid steal, live-lock no-op, watcher self-eviction. * no-mistakes(review): Harden watcher lock ownership * no-mistakes(review): Captain, harden watcher lock claims * no-mistakes(review): Captain, harden lock steal mutex ownership * no-mistakes(review): Captain, harden steal-aware lock claims * no-mistakes(document): Document watcher arm workflow --- AGENTS.md | 23 +-- README.md | 13 +- bin/fm-guard.sh | 2 +- bin/fm-supervise-daemon.sh | 10 +- bin/fm-wake-lib.sh | 280 +++++++++++++++++++++++++++----- bin/fm-watch-arm.sh | 76 +++++++++ bin/fm-watch.sh | 22 ++- tests/fm-wake-queue.test.sh | 313 ++++++++++++++++++++++++++++++++++++ 8 files changed, 680 insertions(+), 59 deletions(-) create mode 100755 bin/fm-watch-arm.sh diff --git a/AGENTS.md b/AGENTS.md index f268e4e6..83b25e45 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -488,9 +488,12 @@ It costs zero tokens while running and exits with one reason line when something It also writes each detected wake to the durable queue at `state/.wake-queue` before advancing suppression markers such as `.seen-*`, `.stale-*`, `.last-check`, or `.last-heartbeat`. At the start of every wake-handling turn and every recovery turn, run `bin/fm-wake-drain.sh` before peeking panes, reading status files beyond the reason line, or starting new work. The printed one-shot reason line is still useful, but the drained queue is the lossless backlog. -After handling drained wakes, re-arm `bin/fm-watch.sh` before you end the turn. -The watcher is singleton-safe: if one is already alive with a fresh liveness beacon, another invocation exits cleanly instead of creating a duplicate watcher; if the live holder's beacon is stale, the new invocation exits with an actionable failure. -Do not pkill-and-restart the watcher as a routine operation; just arm it, and let the singleton lock no-op when appropriate. +After handling drained wakes, re-arm the watcher before you end the turn by running `bin/fm-watch-arm.sh` as a background task. +The watcher is singleton-safe: acquisition is race-proof, so under any number of concurrent arms at most one watcher ever holds this home's lock, and a duplicate that somehow starts self-evicts within one poll once it sees the lock no longer names it. +If one is already alive with a fresh liveness beacon, another invocation exits cleanly instead of creating a duplicate watcher; if the live holder's beacon is stale, the new invocation exits with an actionable failure. +Re-arming is the primary model: just run `bin/fm-watch-arm.sh` and let the singleton lock no-op when a healthy watcher is already alive. +If a forced restart is ever genuinely needed, use `bin/fm-watch-arm.sh --restart`, which stops only THIS home's watcher (the pid recorded in this home's `state/.watch.lock`) and starts a fresh one. +Never `pkill -f bin/fm-watch.sh`: that pattern matches every firstmate home's watcher, including secondmate homes that run the same script, so a broad pkill from one home kills sibling homes' watchers. P2 of the watcher reliability design - proactive routing of wakes into supervisor turns for chat-mode / walk-away supervision - is provided by the optional sub-supervisor (`bin/fm-supervise-daemon.sh`, below), which is presence-gated via the `/afk` skill. P3, a blocking-waiter split, remains deferred; the one-shot restart model is otherwise preserved. Waiting on the watcher is intentionally silent. @@ -498,8 +501,10 @@ After arming it, do not send idle progress updates to the captain; wait until it Empty polls, elapsed waiting time, and "still no change" are tool bookkeeping, not conversational progress. ```sh -bin/fm-watch.sh # run in background; exits with: signal|stale|check|heartbeat -bin/fm-wake-drain.sh # drain queued wake records at turn start +bin/fm-watch-arm.sh # safe re-arm; run in background; no-ops if a healthy watcher is alive +bin/fm-watch-arm.sh --restart # home-scoped forced restart; never a broad pkill +bin/fm-watch.sh # the watcher itself; exits with: signal|stale|check|heartbeat +bin/fm-wake-drain.sh # drain queued wake records at turn start ``` On wake, in order of cheapness: @@ -528,7 +533,7 @@ The supervision scripts (`fm-peek`, `fm-send`, `fm-spawn`, `fm-teardown`, `fm-pr So the next time you touch the fleet with queued wakes or no watcher alive, the tool output itself tells you what to do - a pull-based guard that works on any harness, since it rides the script output you already read rather than a harness-specific hook. The grace window keeps normal handling (watcher briefly down between a wake and its re-arm) silent. If a guard warning says queued wakes are pending, drain them before doing anything else. -If a guard warning says watcher liveness is stale, arm `bin/fm-watch.sh` after draining any queued wakes. +If a guard warning says watcher liveness is stale, arm `bin/fm-watch-arm.sh` after draining any queued wakes. Watcher liveness is not enough if you are foreground-blocked. Whenever one or more tasks are in flight, do not run long foreground-blocking operations in your own session. This includes your own no-mistakes pipeline, long builds, and any other multi-minute command. @@ -551,7 +556,7 @@ The `/afk` skill is the explicit trigger: invoking it sets a durable away-mode f **Entering afk.** Invoke the `/afk` skill. It sets `state/.afk` (durable — recovery re-enters afk if the flag survives a restart), ensures the daemon is running (`nohup bin/fm-supervise-daemon.sh &` if the pid is dead or absent), and acknowledges. With afk active: -- **Do not separately arm `fm-watch.sh`.** The daemon manages the watcher; the singleton lock no-ops a stray arm harmlessly, but the daemon is the single owner. +- **Do not separately arm the watcher with `fm-watch-arm.sh` or `fm-watch.sh`.** The daemon manages the watcher; the singleton lock no-ops a stray arm harmlessly, but the daemon is the single owner. - **`fm-wake-drain.sh` still runs at the start of every escalated firstmate turn** - it is the lossless backstop. The daemon routes; the queue guarantees nothing is lost. The two are complementary, not redundant. **In-band sentinel marker (the load-bearing detail).** The daemon injects into the same pane the captain types into, so an escalation would otherwise look like a user message and cancel afk the moment it fired. @@ -561,7 +566,7 @@ The marker travels with the message text; it does not rely on harness-level type **Exiting afk (the captain's contract).** When firstmate receives a message while afk is active: - Leading marker present → **internal escalation**. Stay afk, process it. - Message starts with `/afk` → **afk re-invocation**. Stay afk (refresh the flag); do not treat as a return. -- Anything else → **the captain is back.** Clear `state/.afk`, stop the daemon, flush one distilled "while you were out" catch-up (drain `state/.wake-queue` + summarize any pending `state/.subsuper-escalations` and `state/.subsuper-inject-wedged` marker), and resume full per-wake responsiveness (arm `bin/fm-watch.sh`). +- Anything else → **the captain is back.** Clear `state/.afk`, stop the daemon, flush one distilled "while you were out" catch-up (drain `state/.wake-queue` + summarize any pending `state/.subsuper-escalations` and `state/.subsuper-inject-wedged` marker), and resume full per-wake responsiveness (arm `bin/fm-watch-arm.sh`). **Bias ambiguous cases toward exit** (a present captain beats token savings; a false exit is self-correcting). **Orthogonal to yolo.** afk changes how aggressively firstmate surfaces things, not who approves what. "Away" never means "approves more" — a PR, a needs-decision finding, or anything destructive still waits for the captain's explicit word. @@ -594,7 +599,7 @@ This is why fewer, cheaper firstmate turns handle the same fleet. That empty composer is the acknowledgement that the submit landed, using the same dim-ghost-aware and border-aware detector so a ghost-only or bordered-empty claude composer counts as submitted rather than a false "swallowed Enter". `fm-send.sh` shares this primitive and exits non-zero on a positively-confirmed swallow, so firstmate learns a steer did not land instead of leaving it unsubmitted. - **Marker strip** - `strip_injection_marker` removes the sentinel prefix before classification/relay, so the digest text firstmate sees is clean. -- **Portable singleton lock** - the daemon uses the repo's mkdir-based lock helper (`fm-wake-lib.sh`) instead of `flock`, which is absent on macOS. +- **Portable singleton lock** - the daemon uses the repo's portable lock helper (`fm-wake-lib.sh`) instead of `flock`, which is absent on macOS. - **Dedupe across signal/stale/scan** - `classify_signal` and `classify_stale` both check the seen-status marker before escalating, so a status escalated by one path is not re-escalated by another in the same digest. - **Auto-discovered supervisor pane** - the daemon resolves its injection target from `FM_SUPERVISOR_TARGET`, then `$TMUX_PANE` (inherited from the pane that launched it), then a `firstmate:0` fallback with a warning; the resolution source is logged at startup so a wrong-but-resolving fallback is detectable. diff --git a/README.md b/README.md index 9ab98ff2..aedab9dd 100644 --- a/README.md +++ b/README.md @@ -116,7 +116,9 @@ firstmate works from any terminal - outside tmux, crewmates land in a detached ` - **Event-driven supervision** - a zero-token bash watcher (`bin/fm-watch.sh`) sleeps on the fleet and wakes the first mate only when a crewmate reports, stalls, a PR merges, or an internal heartbeat review is due. Detected wakes are also written to a durable local queue (`state/.wake-queue`) before detector state advances, so a missed one-shot process exit can be recovered by draining the queue. - Routine watcher polling, restarts, elapsed waiting time, and unchanged heartbeat reviews stay silent; an idle crew costs you nothing. + Routine watcher polling, re-arm no-ops, elapsed waiting time, and unchanged heartbeat reviews stay silent; an idle crew costs you nothing. + Routine re-arms go through `bin/fm-watch-arm.sh`, whose default path either execs the watcher or no-ops behind the per-home singleton. + Its `--restart` mode signals only the watcher recorded in the current home's `state/.watch.lock`, so restarting one home cannot kill sibling secondmate watchers. A pull-based guard (`bin/fm-guard.sh`) warns through supervision tool output if tasks are in flight and that watcher stops running or queued wakes are waiting to be drained. A presence-gated sub-supervisor (`bin/fm-supervise-daemon.sh`) extends this for walk-away supervision: the `/afk` skill activates it, after which it self-handles routine wakes in bash and escalates only captain-relevant events as one batched, single-line digest (prefixed with an in-band sentinel marker so firstmate can tell daemon injections apart from real messages). Its injection path shares `bin/fm-tmux-lib.sh` with `fm-send.sh`, so dim-ghost-aware and border-aware composer detection plus verified submit retry stay consistent; stalled escalation delivery raises `state/.subsuper-inject-wedged` after `FM_MAX_DEFER_SECS` instead of silently deferring forever. @@ -161,6 +163,7 @@ The first mate drives these; you rarely need to, but they work by hand too. | `fm-project-mode.sh` | Resolve a project's delivery mode and `+yolo` flag from `data/projects.md` | | `fm-merge-local.sh` | Fast-forward a `local-only` project's local default branch after approval | | `fm-review-diff.sh` | Review a crewmate branch against the authoritative base, with optional `--stat` output | +| `fm-watch-arm.sh` | Safe per-home watcher re-arm; `--restart` stops only this home's recorded watcher before relaunching | | `fm-watch.sh` | Singleton-safe one-shot watcher; blocks until supervision work is due, queues it durably, then exits with one reason line | | `fm-supervise-daemon.sh` | Presence-gated sub-supervisor for walk-away (`/afk`) supervision: wraps `fm-watch.sh`, self-handles routine wakes in bash, and escalates only captain-relevant events as one verified, batched, single-line digest prefixed with a sentinel marker | | `fm-wake-drain.sh` | Atomically drain queued watcher wakes before handling supervision work | @@ -203,7 +206,9 @@ FM_HEARTBEAT=600 # base seconds between fleet reviews; backs off exponent FM_HEARTBEAT_MAX=7200 # heartbeat backoff cap FM_CHECK_INTERVAL=300 # seconds between slow checks (merged-PR polls) FM_CHECK_TIMEOUT=30 # seconds allowed per slow check script +FM_LOCK_STALE_AFTER=2 # seconds before dead-pid lock records can be reclaimed; mid-acquire locks keep at least 2s grace FM_GUARD_GRACE=300 # seconds a stale watcher beacon may age before guard warnings +FM_WATCHER_STALE_GRACE=300 # defaults to FM_GUARD_GRACE; seconds a live watcher lock may have a stale beacon before re-arm errors FM_SIGNAL_GRACE=30 # seconds to coalesce nearby status and turn-end signals into one wake FM_FLEET_SYNC_BOOTSTRAP_TIMEOUT=20 # seconds allowed for bootstrap's best-effort clone refresh FM_FLEET_PRUNE=1 # set to 0 to skip pruning local branches whose upstream is gone @@ -229,14 +234,14 @@ Tracked changes to firstmate itself, including `AGENTS.md`, `README.md`, `CONTRI When supervising live crewmates, keep long validation or build work in the background so watcher wakes can still be handled. Human-authored pull requests targeting `main` must be raised through `git push no-mistakes`; see `CONTRIBUTING.md` for the enforced contributor workflow. Local `.no-mistakes/` state and test evidence stay out of this repo; `.no-mistakes.yaml` keeps evidence in a temp directory instead. -The current watcher reliability work keeps the one-shot process model and adds a durable queue plus singleton lock. +The current watcher reliability work keeps the one-shot process model and adds a durable queue, race-proof singleton lock, duplicate self-eviction, and home-scoped arm wrapper. The presence-gated sub-supervisor (`bin/fm-supervise-daemon.sh`) provides proactive wake routing for walk-away supervision via the `/afk` skill; a blocking-waiter split remains a deferred follow-up phase. ```sh bash -n bin/*.sh # syntax-check the toolbelt shellcheck bin/*.sh tests/*.sh # lint the toolbelt and behavior tests; CI enforces this for test_script in tests/*.test.sh; do "$test_script"; done # behavior tests, matching CI -tests/fm-wake-queue.test.sh # durable wake queue, singleton behavior, sub-supervisor classifier, /afk presence-gating, border-aware composer, max-defer, and fm-send submit tests +tests/fm-wake-queue.test.sh # durable wake queue, race-proof locks, watcher singleton/re-arm behavior, sub-supervisor classifier, /afk presence-gating, border-aware composer, max-defer, and fm-send submit tests tests/fm-composer-ghost.test.sh # dim-ghost stripping, ghost-only composer detection, and escape-free peek tests tests/fm-afk-inject-e2e.test.sh # private-socket end-to-end test of the afk injection path (partial-input deferral, swallowed-Enter retry) tests/fm-bootstrap.test.sh # bootstrap dependency and feature-probe tests @@ -245,5 +250,5 @@ tests/fm-secondmate.test.sh # persistent secondmate routing, seedi tests/fm-teardown.test.sh # fm-teardown.sh safety and reminder checks: local-only fork-remote allow, truly-unpushed refuse, merged-to-main allow, no-mistakes regression, tasks-axi reminder, --force override [ "$(readlink CLAUDE.md)" = "AGENTS.md" ] [ "$(readlink .claude/skills)" = "../.agents/skills" ] -FM_HEARTBEAT=2 FM_POLL=1 bin/fm-watch.sh # watcher smoke test (prints "heartbeat") +FM_HEARTBEAT=2 FM_POLL=1 bin/fm-watch-arm.sh # watcher re-arm smoke test (prints "heartbeat") ``` diff --git a/bin/fm-guard.sh b/bin/fm-guard.sh index b4f8d95b..a6f8f1e7 100755 --- a/bin/fm-guard.sh +++ b/bin/fm-guard.sh @@ -5,7 +5,7 @@ # missing or older than FM_GUARD_GRACE seconds, prints a loud warning so the # agent sees it in the tool output of whatever it was doing - the one channel # every harness has. Normal wake handling (watcher briefly down between a wake -# and its restart) stays inside the grace window and stays silent. +# and its re-arm) stays inside the grace window and stays silent. # Always exits 0: the guard warns, it never blocks. set -u diff --git a/bin/fm-supervise-daemon.sh b/bin/fm-supervise-daemon.sh index 820daee6..f3b9cf68 100755 --- a/bin/fm-supervise-daemon.sh +++ b/bin/fm-supervise-daemon.sh @@ -48,7 +48,7 @@ # have missed (e.g. a status verb outside CAPTAIN_RE) and escalates it. # # The robustness shell from the prior always-inject version is preserved: -# single-instance lock (portable mkdir-based, no flock dependency), crash-loop +# single-instance lock (portable helper, no flock dependency), crash-loop # backoff, pane-gone guard, and a signal-trapped shutdown that flushes buffered # escalations before exit. # @@ -92,7 +92,7 @@ # FM_LOG_MAX_BYTES / FM_LOG_KEEP_LINES / FM_CRASH_* log + crash guards # FM_STATE_OVERRIDE alternate state dir (testing) # Logs each wake to state/.supervise-daemon.log (size-capped). Single -# instance via portable mkdir lock on state/.supervise-daemon.lock. Trapped +# instance via portable lock on state/.supervise-daemon.lock. Trapped # SIGTERM/SIGINT shut down within ~1s, flush escalations, release the # lock. A crashing fm-watch.sh is logged and restarted, never killing # the daemon; a tight crash-restart spin is detected and backed off. @@ -714,8 +714,8 @@ fm_super_main() { STATE="$(_state_root)" mkdir -p "$STATE" - # Source the portable lock helpers (mkdir-based, works on macOS where flock - # is absent). Export FM_STATE_OVERRIDE so the lib resolves the same state dir. + # Source the portable lock helpers (works on macOS where flock is absent). + # Export FM_STATE_OVERRIDE so the lib resolves the same state dir. # shellcheck source=bin/fm-wake-lib.sh FM_STATE_OVERRIDE="$STATE" . "$FM_DAEMON_DIR/fm-wake-lib.sh" @@ -732,7 +732,7 @@ fm_super_main() { [ -x "$WATCH" ] || { echo "error: watcher not found or not executable: $WATCH" >&2; exit 1; } - # --- single instance (portable mkdir-based lock, no flock dependency) ------ + # --- single instance (portable lock, no flock dependency) ------------------ if ! fm_lock_try_acquire "$LOCK"; then if [ -n "${FM_LOCK_HELD_PID:-}" ]; then echo "error: another fm-supervise-daemon is already running (pid $FM_LOCK_HELD_PID, lock $LOCK held)" >&2 diff --git a/bin/fm-wake-lib.sh b/bin/fm-wake-lib.sh index 99b47efd..af2112a6 100755 --- a/bin/fm-wake-lib.sh +++ b/bin/fm-wake-lib.sh @@ -23,6 +23,16 @@ fm_pid_alive() { kill -0 "$pid" 2>/dev/null } +fm_pid_identity() { + local pid=$1 out + case "$pid" in + ''|*[!0-9]*) return 1 ;; + esac + out=$(ps -p "$pid" -o lstart= -o command= 2>/dev/null) || return 1 + [ -n "$out" ] || return 1 + printf '%s\n' "$out" | sed 's/^[[:space:]]*//' +} + fm_path_mtime() { if [ "$(uname)" = Darwin ]; then stat -f %m "$1" 2>/dev/null @@ -37,32 +47,182 @@ fm_path_age() { echo $(( $(date +%s) - m )) } -fm_lock_remove_stale() { - local lockdir=$1 expected_pid=$2 current_pid - current_pid=$(cat "$lockdir/pid" 2>/dev/null || true) - [ "$current_pid" = "$expected_pid" ] || return 1 - if fm_pid_alive "$current_pid"; then +fm_lock_clean_known_files() { + local lockdir=$1 + rm -f \ + "$lockdir/pid" \ + "$lockdir/fm-home" \ + "$lockdir/pid-identity" \ + "$lockdir/watcher-path" \ + 2>/dev/null || true +} + +fm_lock_abs_path() { + local path=$1 dir base + dir=$(dirname "$path") + base=$(basename "$path") + dir=$(cd "$dir" 2>/dev/null && pwd -P) || return 1 + printf '%s/%s\n' "$dir" "$base" +} + +fm_lock_owner_dir() { + local lockdir=$1 lock_abs + lock_abs=$(fm_lock_abs_path "$lockdir") || return 1 + mktemp -d "${lock_abs}.owner.XXXXXX" 2>/dev/null +} + +fm_lock_prepare_owner() { + local ownerdir=$1 mypid back + mypid=${BASHPID:-$$} + printf '%s\n' "$mypid" > "$ownerdir/pid" 2>/dev/null || return 1 + back=$(cat "$ownerdir/pid" 2>/dev/null || true) + [ "$back" = "$mypid" ] +} + +fm_lock_link_owner() { + local lockdir=$1 owner + owner=$(readlink "$lockdir" 2>/dev/null) || return 1 + [ -n "$owner" ] || return 1 + case "$owner" in + /*) printf '%s\n' "$owner" ;; + *) printf '%s/%s\n' "$(dirname "$lockdir")" "$owner" ;; + esac +} + +fm_lock_points_to_owner() { + local lockdir=$1 ownerdir=$2 actual + actual=$(readlink "$lockdir" 2>/dev/null) || return 1 + [ "$actual" = "$ownerdir" ] +} + +fm_lock_discard_owner() { + local ownerdir=$1 + [ -n "$ownerdir" ] || return 0 + fm_lock_clean_known_files "$ownerdir" + rmdir "$ownerdir" 2>/dev/null || true +} + +fm_lock_remove_stray_owner_link() { + local lockdir=$1 ownerdir=$2 stray + stray="$lockdir/$(basename "$ownerdir")" + if [ -L "$stray" ] && [ "$(readlink "$stray" 2>/dev/null || true)" = "$ownerdir" ]; then + rm -f "$stray" 2>/dev/null || true + fi +} + +fm_lock_claim_blocked_by_steal() { + local lockdir=$1 allowed_steal_owner=${2:-} steal + steal="$lockdir.steal" + [ -e "$steal" ] || [ -L "$steal" ] || return 1 + if [ -n "$allowed_steal_owner" ] && fm_lock_points_to_owner "$steal" "$allowed_steal_owner"; then + return 1 + fi + return 0 +} + +fm_lock_claim() { + local lockdir=$1 ownerdir=$2 allowed_steal_owner=${3:-} mypid back + mypid=${BASHPID:-$$} + if ! { printf '%s\n' "$mypid" > "$ownerdir/pid"; } 2>/dev/null; then + fm_lock_discard_owner "$ownerdir" + return 1 + fi + back=$(cat "$ownerdir/pid" 2>/dev/null || true) + if [ "$back" != "$mypid" ]; then + fm_lock_discard_owner "$ownerdir" + return 1 + fi + if ! fm_lock_points_to_owner "$lockdir" "$ownerdir"; then + fm_lock_discard_owner "$ownerdir" + return 1 + fi + if fm_lock_claim_blocked_by_steal "$lockdir" "$allowed_steal_owner"; then + if fm_lock_points_to_owner "$lockdir" "$ownerdir"; then + rm -f "$lockdir" 2>/dev/null || true + fi + fm_lock_discard_owner "$ownerdir" return 1 fi - case "$current_pid" in + return 0 +} + +fm_lock_try_create() { + local lockdir=$1 allowed_steal_owner=${2:-} ownerdir + FM_LOCK_OWNER_DIR= + ownerdir=$(fm_lock_owner_dir "$lockdir") || return 1 + if [ -e "$lockdir" ] || [ -L "$lockdir" ]; then + fm_lock_discard_owner "$ownerdir" + return 1 + fi + if ! fm_lock_prepare_owner "$ownerdir"; then + fm_lock_discard_owner "$ownerdir" + return 1 + fi + if ln -s "$ownerdir" "$lockdir" 2>/dev/null && fm_lock_points_to_owner "$lockdir" "$ownerdir"; then + if fm_lock_claim "$lockdir" "$ownerdir" "$allowed_steal_owner"; then + FM_LOCK_OWNER_DIR=$ownerdir + return 0 + fi + if fm_lock_points_to_owner "$lockdir" "$ownerdir"; then + rm -f "$lockdir" 2>/dev/null || true + fi + else + fm_lock_remove_stray_owner_link "$lockdir" "$ownerdir" + fi + fm_lock_discard_owner "$ownerdir" + return 1 +} + +fm_lock_remove_path() { + local lockdir=$1 ownerdir + if [ -L "$lockdir" ]; then + ownerdir=$(fm_lock_link_owner "$lockdir" 2>/dev/null || true) + rm -f "$lockdir" 2>/dev/null || return 1 + [ -n "$ownerdir" ] && fm_lock_discard_owner "$ownerdir" + return 0 + fi + fm_lock_clean_known_files "$lockdir" + rmdir "$lockdir" 2>/dev/null +} + +fm_lock_mid_acquire_is_fresh() { + local lockdir=$1 pid=$2 mid_acquire_stale + case "$pid" in ''|*[!0-9]*) - [ "$(fm_path_age "$lockdir")" -ge "$FM_LOCK_STALE_AFTER" ] || return 1 + mid_acquire_stale=$FM_LOCK_STALE_AFTER + [ "$mid_acquire_stale" -lt 2 ] && mid_acquire_stale=2 + [ "$(fm_path_age "$lockdir")" -lt "$mid_acquire_stale" ] + return ;; esac - rm -f "$lockdir/pid" 2>/dev/null || return 1 - rmdir "$lockdir" 2>/dev/null + return 1 +} + +fm_lock_recheck_stale_owner() { + local lockdir=$1 expected_owner=$2 expected_pid=$3 actual_pid + if [ -n "$expected_owner" ]; then + fm_lock_points_to_owner "$lockdir" "$expected_owner" || return 1 + elif [ -e "$lockdir" ] || [ -L "$lockdir" ]; then + [ -d "$lockdir" ] && [ ! -L "$lockdir" ] || return 1 + fi + actual_pid=$(cat "$lockdir/pid" 2>/dev/null || true) + [ "$actual_pid" = "$expected_pid" ] || return 1 + if fm_pid_alive "$actual_pid"; then + return 1 + fi + if fm_lock_mid_acquire_is_fresh "$lockdir" "$actual_pid"; then + return 1 + fi + return 0 } fm_lock_try_acquire() { - local lockdir=$1 pid + local lockdir=$1 pid steal cur rc steal_owner primary_owner FM_LOCK_HELD_PID= - if mkdir "$lockdir" 2>/dev/null; then - if { fm_current_pid > "$lockdir/pid"; } 2>/dev/null; then - return 0 - fi - rm -f "$lockdir/pid" 2>/dev/null || true - rmdir "$lockdir" 2>/dev/null || true - return 1 + FM_LOCK_OWNER_DIR= + + if fm_lock_try_create "$lockdir"; then + return 0 fi pid=$(cat "$lockdir/pid" 2>/dev/null || true) @@ -70,29 +230,63 @@ fm_lock_try_acquire() { FM_LOCK_HELD_PID=$pid return 1 fi - case "$pid" in - ''|*[!0-9]*) - if [ "$(fm_path_age "$lockdir")" -lt "$FM_LOCK_STALE_AFTER" ]; then - FM_LOCK_HELD_PID=$pid - return 1 - fi - ;; - esac + if fm_lock_mid_acquire_is_fresh "$lockdir" "$pid"; then + FM_LOCK_HELD_PID=$pid + return 1 + fi - fm_lock_remove_stale "$lockdir" "$pid" || true - if mkdir "$lockdir" 2>/dev/null; then - if { fm_current_pid > "$lockdir/pid"; } 2>/dev/null; then - return 0 - fi - rm -f "$lockdir/pid" 2>/dev/null || true - rmdir "$lockdir" 2>/dev/null || true + steal="$lockdir.steal" + if ! fm_lock_try_acquire "$steal"; then + FM_LOCK_HELD_PID=$(cat "$lockdir/pid" 2>/dev/null || true) + FM_LOCK_OWNER_DIR= return 1 fi + steal_owner=${FM_LOCK_OWNER_DIR:-} - pid=$(cat "$lockdir/pid" 2>/dev/null || true) - # shellcheck disable=SC2034 # Read by callers after fm_lock_try_acquire returns. - FM_LOCK_HELD_PID=$pid - return 1 + cur=$(cat "$lockdir/pid" 2>/dev/null || true) + if fm_pid_alive "$cur"; then + fm_lock_release "$steal" + FM_LOCK_HELD_PID=$cur + FM_LOCK_OWNER_DIR= + return 1 + fi + if fm_lock_mid_acquire_is_fresh "$lockdir" "$cur"; then + fm_lock_release "$steal" + FM_LOCK_HELD_PID=$cur + FM_LOCK_OWNER_DIR= + return 1 + fi + if ! fm_lock_points_to_owner "$steal" "$steal_owner"; then + fm_lock_release "$steal" + FM_LOCK_HELD_PID=$(cat "$lockdir/pid" 2>/dev/null || true) + FM_LOCK_OWNER_DIR= + return 1 + fi + + primary_owner= + if [ -L "$lockdir" ]; then + primary_owner=$(fm_lock_link_owner "$lockdir" 2>/dev/null || true) + fi + cur=$(cat "$lockdir/pid" 2>/dev/null || true) + if ! fm_lock_recheck_stale_owner "$lockdir" "$primary_owner" "$cur"; then + fm_lock_release "$steal" + FM_LOCK_HELD_PID=$(cat "$lockdir/pid" 2>/dev/null || true) + FM_LOCK_OWNER_DIR= + return 1 + fi + + fm_lock_remove_path "$lockdir" || true + rc=1 + if fm_lock_try_create "$lockdir" "$steal_owner"; then + rc=0 + fi + if [ "$rc" -ne 0 ]; then + # shellcheck disable=SC2034 # Read by callers after fm_lock_try_acquire returns. + FM_LOCK_HELD_PID=$(cat "$lockdir/pid" 2>/dev/null || true) + FM_LOCK_OWNER_DIR= + fi + fm_lock_release "$steal" + return "$rc" } fm_lock_acquire_wait() { @@ -103,11 +297,21 @@ fm_lock_acquire_wait() { } fm_lock_release() { - local lockdir=$1 pid current + local lockdir=$1 pid current ownerdir current=${BASHPID:-$$} + if [ -L "$lockdir" ]; then + ownerdir=$(fm_lock_link_owner "$lockdir" 2>/dev/null || true) + [ -n "$ownerdir" ] || return 0 + pid=$(cat "$ownerdir/pid" 2>/dev/null || true) + [ "$pid" = "$current" ] || return 0 + fm_lock_points_to_owner "$lockdir" "$ownerdir" || return 0 + rm -f "$lockdir" 2>/dev/null || return 0 + fm_lock_discard_owner "$ownerdir" + return 0 + fi pid=$(cat "$lockdir/pid" 2>/dev/null || true) [ "$pid" = "$current" ] || return 0 - rm -f "$lockdir/pid" 2>/dev/null || true + fm_lock_clean_known_files "$lockdir" rmdir "$lockdir" 2>/dev/null || true } diff --git a/bin/fm-watch-arm.sh b/bin/fm-watch-arm.sh new file mode 100755 index 00000000..7fef28e9 --- /dev/null +++ b/bin/fm-watch-arm.sh @@ -0,0 +1,76 @@ +#!/usr/bin/env bash +# Safe, home-scoped (re-)arm of the firstmate watcher. +# +# Default (no args): start bin/fm-watch.sh. The watcher is a singleton per +# FM_HOME, so a second arm no-ops harmlessly when one is already alive. This is +# the primary re-arm model - just arm, never kill, and let the singleton lock +# decide. +# +# --restart: stop ONLY this FM_HOME's watcher and start a fresh one. It resolves +# the pid from THIS home's state/.watch.lock and signals exactly that pid, so it +# can never touch another home's watcher. NEVER use `pkill -f bin/fm-watch.sh`: +# that pattern matches every firstmate home's watcher (secondmate homes run the +# same script) and would kill siblings. +# +# Run this exactly like bin/fm-watch.sh - as a background task. It execs the +# watcher in the foreground, so the harness backgrounds the whole invocation. +set -u + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +# shellcheck source=bin/fm-wake-lib.sh +. "$SCRIPT_DIR/fm-wake-lib.sh" + +WATCH="$SCRIPT_DIR/fm-watch.sh" +WATCH_LOCK="$STATE/.watch.lock" + +watch_lock_matches_pid() { + local pid=$1 lock_home lock_path lock_identity current_identity + lock_home=$(cat "$WATCH_LOCK/fm-home" 2>/dev/null || true) + lock_path=$(cat "$WATCH_LOCK/watcher-path" 2>/dev/null || true) + lock_identity=$(cat "$WATCH_LOCK/pid-identity" 2>/dev/null || true) + [ "$lock_home" = "$FM_HOME" ] || return 1 + [ "$lock_path" = "$WATCH" ] || return 1 + [ -n "$lock_identity" ] || return 1 + current_identity=$(fm_pid_identity "$pid") || return 1 + [ "$current_identity" = "$lock_identity" ] +} + +clear_stale_recorded_watcher_lock() { + local lock_home lock_path lock_identity + lock_home=$(cat "$WATCH_LOCK/fm-home" 2>/dev/null || true) + lock_path=$(cat "$WATCH_LOCK/watcher-path" 2>/dev/null || true) + lock_identity=$(cat "$WATCH_LOCK/pid-identity" 2>/dev/null || true) + [ "$lock_home" = "$FM_HOME" ] || return 0 + [ "$lock_path" = "$WATCH" ] || return 0 + [ -n "$lock_identity" ] || return 0 + fm_lock_remove_path "$WATCH_LOCK" || true +} + +mode=arm +case "${1:-}" in + ''|arm|--arm) mode=arm ;; + --restart) mode=restart ;; + *) echo "usage: $(basename "$0") [--restart]" >&2; exit 2 ;; +esac + +if [ "$mode" = restart ]; then + # Home-scoped stop: only the watcher pid recorded in THIS home's lock. + lock_pid=$(cat "$WATCH_LOCK/pid" 2>/dev/null || true) + if fm_pid_alive "$lock_pid"; then + if watch_lock_matches_pid "$lock_pid"; then + kill -TERM "$lock_pid" 2>/dev/null || true + # Wait for it to actually exit before relaunching, so the fresh watcher + # either takes a released lock or reclaims a now-dead-pid stale lock instead + # of seeing the dying one as a live holder and no-opping. + i=0 + while [ "$i" -lt 50 ] && fm_pid_alive "$lock_pid"; do + sleep 0.1 + i=$((i + 1)) + done + else + clear_stale_recorded_watcher_lock + fi + fi +fi + +exec "$WATCH" diff --git a/bin/fm-watch.sh b/bin/fm-watch.sh index daa43567..5c7957ac 100755 --- a/bin/fm-watch.sh +++ b/bin/fm-watch.sh @@ -6,8 +6,8 @@ # stale: a crewmate pane stopped changing and shows no busy signature # check: