Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -454,6 +454,7 @@ From there the task is an ordinary ship task through its mode-specific validatio

The watcher is the backbone.
Whenever at least one task is in flight, keep `bin/fm-watch.sh` running through a harness-tracked `bin/fm-watch-arm.sh` background task.
In a harness lane where tracked background tasks are not durable enough, run `bin/fm-watch-session.sh start` instead; it keeps a home-scoped tmux runner alive and re-arms through the same verified `fm-watch-arm.sh` path.
It costs zero tokens while running.
**Always-on wake triage.**
The watcher classifies every wake it detects in bash and absorbs the benign majority without ever waking you.
Expand All @@ -470,8 +471,6 @@ The arm chain IS the supervision: while any task is in flight, keep exactly one
Each cycle is one harness-tracked background task that blocks until an actionable wake is due (benign wakes are absorbed in bash without ending the task), fires with one reason line, and ends, so the chain survives only when firstmate starts the next cycle after each fire.
After handling the drained wakes, re-arm before you end the turn by running `bin/fm-watch-arm.sh` as its own background task.
Arm or re-arm the watcher only through the harness's own tracked background mechanism - the one that survives the call and notifies you when the process exits - so the cycle actually persists and the next wake reaches you.
If the current harness cannot provide a reliable tracked background call, start the home-scoped durable runner with `bin/fm-watch-session.sh --start` and check it with `bin/fm-watch-session.sh --status`; it records only this `FM_HOME` in `state/.watch-session.lock` and re-arms the normal watcher from a persistent process.
For a visible pane instead of a detached process, use `bin/fm-watch-session.sh --tmux` to print a ready-to-run tmux command, or run `bin/fm-watch-session.sh --foreground` inside a persistent tmux window.
Never fire-and-forget the watcher with a shell `&` inside another call: that backgrounded child is reaped when the call returns, so supervision silently stops, and worse, the dying process reports a false "already running" that hides the gap.
**Standalone, never bundled.**
Run `bin/fm-watch-arm.sh` as its OWN background task with nothing else in that bash, never tacked onto the tail of a multi-command call: bundled, its self-verifying status line is buried in unrelated output and it can silently no-op as a side effect of those other commands, so no fresh cycle gets established and supervision lapses unnoticed.
Expand All @@ -493,7 +492,8 @@ Empty polls, elapsed waiting time, and "still no change" are tool bookkeeping, n
```sh
bin/fm-watch-arm.sh # safe verified re-arm; run as harness-tracked background; no-ops if healthy
bin/fm-watch-arm.sh --restart # home-scoped forced restart; never a broad pkill
bin/fm-watch-session.sh --start|--status|--stop # durable active-mode runner for this FM_HOME
bin/fm-watch-session.sh start # durable home-scoped tmux runner for lanes without reliable tracked background tasks
bin/fm-watch-session.sh --status # report whether this home's runner window is live
bin/fm-watch.sh # the watcher itself; exits with: signal|stale|check|heartbeat
bin/fm-wake-drain.sh # drain queued wake records at turn start; asserts guard after draining
bin/fm-crew-state.sh <id> # one-line current-state read; reconciles matching run-step, pane, and status log
Expand Down Expand Up @@ -523,13 +523,14 @@ This exception is narrow: ordinary crewmates still trip stale detection when the
**Watcher liveness is guarded, not just disciplined.**
Arming the watcher is the last action of every wake-handling turn - but the protocol no longer relies on remembering that.
While running, `fm-watch.sh` touches `state/.last-watcher-beat` every poll cycle.
The supervision scripts (`fm-peek`, `fm-send`, `fm-spawn`, `fm-teardown`, `fm-pr-check`, `fm-promote`, `fm-review-diff`, `fm-fleet-sync`, `fm-update`) call `bin/fm-guard.sh` first, which warns to stderr when any task is in flight (`state/*.meta` exists) but queued wakes are pending, that beacon is missing or older than `FM_GUARD_GRACE` (default 300s), or the fresh beacon is not backed by `state/.watch.lock` naming a live watcher for this same `FM_HOME` and watcher path.
The supervision scripts (`fm-peek`, `fm-send`, `fm-spawn`, `fm-teardown`, `fm-pr-check`, `fm-promote`, `fm-review-diff`, `fm-fleet-sync`, `fm-update`) call `bin/fm-guard.sh` first, which warns to stderr when any task is in flight (`state/*.meta` exists) but queued wakes are pending, or there is no confirmed live watcher for this same `FM_HOME`.
A confirmed live watcher means `state/.watch.lock` names a live `bin/fm-watch.sh` process for this home and `state/.last-watcher-beat` is fresh within `FM_GUARD_GRACE` (default 300s); a fresh beacon by itself is not enough.
`bin/fm-wake-drain.sh` runs the same guard after it drains, so the liveness check also fires on a drain-and-handle turn that runs no other supervision script, narrowing the window in which a lapsed chain can hide.
The no-watcher case leads with a prominent, bordered ●-marked banner (in-flight count, beacon/lock problem, and the exact one-line re-arm command) so it reads as an alarm rather than a buried stderr line you can skim past.
The no-watcher case leads with a prominent, bordered ●-marked banner (in-flight count, lock state, beacon age, and the exact one-line re-arm command) so it reads as an alarm rather than a buried stderr line you can skim past.
So the next time you touch the fleet with queued wakes or no watcher alive, the tool output itself tells you what to do - a pull-based guard that works on any harness, since it rides the script output you already read rather than a harness-specific hook.
The grace window now only helps when a live matching watcher lock is present; a fresh beacon without that lock is treated as a false-fresh state and warns.
The grace window keeps normal handling silent only when the lock still proves a live watcher.
If a guard warning says queued wakes are pending, drain them before doing anything else.
If a guard warning says watcher liveness is stale, arm `bin/fm-watch-arm.sh` after draining any queued wakes.
If a guard warning says watcher liveness is stale or unconfirmed, arm `bin/fm-watch-arm.sh` after draining any queued wakes, or start `bin/fm-watch-session.sh start` in this environment.

`fm-guard.sh` carries a second, independent alarm in the same bordered ●-marked style: the **worktree-tangle** guard.
Firstmate is a treehouse-pooled git repo of itself - the primary checkout (the repo root, `FM_ROOT`) and every crewmate worktree and secondmate home are linked worktrees of one repo - and the primary must stay on its default branch.
Expand Down
89 changes: 50 additions & 39 deletions bin/fm-guard.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,12 @@
# First, always warn if the firstmate primary checkout (FM_ROOT) is on a named
# non-default branch, because that means firstmate-on-itself work landed in the
# primary instead of an isolated worktree.
# Then, if any task is in flight (a state/<id>.meta exists), prove the watcher is
# live by checking both the liveness beacon and the home-scoped watcher lock. A
# fresh state/.last-watcher-beat alone is not enough: a one-shot watcher can write
# a wake and exit while leaving a fresh beacon behind. Always exits 0: the guard
# warns, it never blocks.
# Then, if any task is in flight (a state/<id>.meta exists) and there is no
# confirmed live watcher for this FM_HOME (state/.watch.lock naming a live
# bin/fm-watch.sh process for this home plus a fresh state/.last-watcher-beat),
# prints a loud, clearly delimited banner so the agent cannot skim past it in the
# tool output of whatever it was doing - the one channel every harness has. Always
# exits 0: the guard warns, it never blocks.
set -u

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
Expand Down Expand Up @@ -54,34 +55,45 @@ fi

WATCH_LOCK="$STATE/.watch.lock"
WATCH_PATH="$SCRIPT_DIR/fm-watch.sh"
watcher_lock_desc="no watcher lock"

watcher_lock_healthy() {
local pid lock_home lock_path lock_identity current_identity
watcher_lock_desc="no watcher lock"
[ -e "$WATCH_LOCK" ] || [ -L "$WATCH_LOCK" ] || return 1
watch_lock_matches_pid() {
local pid=$1 lock_home lock_path lock_identity current_identity
lock_home=$(cat "$WATCH_LOCK/fm-home" 2>/dev/null || true)
lock_path=$(cat "$WATCH_LOCK/watcher-path" 2>/dev/null || true)
lock_identity=$(cat "$WATCH_LOCK/pid-identity" 2>/dev/null || true)
[ "$lock_home" = "$FM_HOME" ] || return 1
[ "$lock_path" = "$WATCH_PATH" ] || return 1
[ -n "$lock_identity" ] || return 1
current_identity=$(fm_pid_identity "$pid") || return 1
[ "$current_identity" = "$lock_identity" ]
}

watcher_lock_desc() {
local pid lock_home lock_path
if [ ! -e "$WATCH_LOCK" ] && [ ! -L "$WATCH_LOCK" ]; then
echo "no watch lock"
return 0
fi
pid=$(cat "$WATCH_LOCK/pid" 2>/dev/null || true)
if ! fm_pid_alive "$pid"; then
watcher_lock_desc="watcher lock has no live pid"
return 1
echo "watch lock has no live pid"
return 0
fi
lock_home=$(cat "$WATCH_LOCK/fm-home" 2>/dev/null || true)
lock_path=$(cat "$WATCH_LOCK/watcher-path" 2>/dev/null || true)
lock_identity=$(cat "$WATCH_LOCK/pid-identity" 2>/dev/null || true)
if [ "$lock_home" != "$FM_HOME" ] || [ "$lock_path" != "$WATCH_PATH" ] || [ -z "$lock_identity" ]; then
watcher_lock_desc="watcher lock does not name a live watcher for this home"
return 1
if [ "$lock_home" != "$FM_HOME" ]; then
echo "watch lock belongs to another FM_HOME"
return 0
fi
if [ "$lock_path" != "$WATCH_PATH" ]; then
echo "watch lock names another watcher path"
return 0
fi
current_identity=$(fm_pid_identity "$pid") || {
watcher_lock_desc="watcher lock pid identity is unavailable"
return 1
}
if [ "$current_identity" != "$lock_identity" ]; then
watcher_lock_desc="watcher lock pid identity no longer matches"
return 1
if ! watch_lock_matches_pid "$pid"; then
echo "watch lock pid identity does not match the watcher"
return 0
fi
watcher_lock_desc="live watcher pid=$pid"
return 0
echo "watch lock is live"
}

# Only act with tasks in flight; count them so the banner can say how much is
Expand All @@ -95,43 +107,42 @@ done

[ -s "$FM_WAKE_QUEUE" ] && queue_pending=true

# Resolve the watcher's liveness from its beacon: fresh within GRACE means a
# watcher is alive and we stay quiet about it.
# Resolve the watcher's liveness from both the lock and the beacon. A fresh
# beacon alone is not proof: a one-shot watcher can leave a fresh mtime behind
# after it exits.
BEAT="$STATE/.last-watcher-beat"
watcher_fresh=false
watcher_confirmed=false
beacon_fresh=false
beacon_desc=never
if [ -e "$BEAT" ]; then
m=$(stat_mtime "$BEAT")
if [ -n "$m" ]; then
age=$(( $(date +%s) - m ))
beacon_desc="${age}s ago"
[ "$age" -lt "$GRACE" ] && watcher_fresh=true
[ "$age" -lt "$GRACE" ] && beacon_fresh=true
else
beacon_desc=unknown
fi
fi
lock_healthy=false
watcher_lock_healthy && lock_healthy=true
watcher_problem=
if [ "$watcher_fresh" = false ]; then
watcher_problem="no fresh beacon (last beat: $beacon_desc, grace ${GRACE}s)"
elif [ "$lock_healthy" = false ]; then
watcher_problem="fresh beacon but no live watcher lock: $watcher_lock_desc"
lock_desc=$(watcher_lock_desc)
lock_pid=$(cat "$WATCH_LOCK/pid" 2>/dev/null || true)
if [ "$beacon_fresh" = true ] && fm_pid_alive "$lock_pid" && watch_lock_matches_pid "$lock_pid"; then
watcher_confirmed=true
fi

# No fresh watcher with tasks in flight is the dangerous state: emit a prominent,
# bordered banner FIRST so it reads as an alarm, not a buried stderr line.
if [ -n "$watcher_problem" ]; then
if [ "$watcher_confirmed" = false ]; then
if "$queue_pending"; then
fix='After draining queued wakes, re-arm the watcher: run bin/fm-watch-arm.sh as the harness-tracked background task (never a shell & that gets reaped).'
else
fix='Re-arm it NOW: run bin/fm-watch-arm.sh as the harness-tracked background task (never a shell & that gets reaped).'
fix='Re-arm it NOW: run bin/fm-watch-arm.sh as the harness-tracked background task, or run bin/fm-watch-session.sh start in this environment.'
fi
rule='━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━'
{
printf '●%s\n' "$rule"
printf '● WATCHER DOWN - SUPERVISION IS OFF\n'
printf '● %s task(s) in flight, but watcher liveness is not proved: %s.\n' "$in_flight" "$watcher_problem"
printf '● %s task(s) in flight, but no watcher has a confirmed live lock (lock: %s; last beat: %s, grace %ss).\n' "$in_flight" "$lock_desc" "$beacon_desc" "$GRACE"
printf '● Trust bin/fm-watch-arm.sh for the true state: it confirms a live watcher and a fresh beacon, or fails loudly.\n'
printf '● %s\n' "$fix"
printf '●%s\n' "$rule"
Expand Down
Loading