diff --git a/AGENTS.md b/AGENTS.md index f07899d9..4c10acf3 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -454,6 +454,7 @@ From there the task is an ordinary ship task through its mode-specific validatio The watcher is the backbone. Whenever at least one task is in flight, keep `bin/fm-watch.sh` running through a harness-tracked `bin/fm-watch-arm.sh` background task. +In a harness lane where tracked background tasks are not durable enough, run `bin/fm-watch-session.sh start` instead; it keeps a home-scoped tmux runner alive and re-arms through the same verified `fm-watch-arm.sh` path. It costs zero tokens while running. **Always-on wake triage.** The watcher classifies every wake it detects in bash and absorbs the benign majority without ever waking you. @@ -470,8 +471,6 @@ The arm chain IS the supervision: while any task is in flight, keep exactly one Each cycle is one harness-tracked background task that blocks until an actionable wake is due (benign wakes are absorbed in bash without ending the task), fires with one reason line, and ends, so the chain survives only when firstmate starts the next cycle after each fire. After handling the drained wakes, re-arm before you end the turn by running `bin/fm-watch-arm.sh` as its own background task. Arm or re-arm the watcher only through the harness's own tracked background mechanism - the one that survives the call and notifies you when the process exits - so the cycle actually persists and the next wake reaches you. -If the current harness cannot provide a reliable tracked background call, start the home-scoped durable runner with `bin/fm-watch-session.sh --start` and check it with `bin/fm-watch-session.sh --status`; it records only this `FM_HOME` in `state/.watch-session.lock` and re-arms the normal watcher from a persistent process. -For a visible pane instead of a detached process, use `bin/fm-watch-session.sh --tmux` to print a ready-to-run tmux command, or run `bin/fm-watch-session.sh --foreground` inside a persistent tmux window. Never fire-and-forget the watcher with a shell `&` inside another call: that backgrounded child is reaped when the call returns, so supervision silently stops, and worse, the dying process reports a false "already running" that hides the gap. **Standalone, never bundled.** Run `bin/fm-watch-arm.sh` as its OWN background task with nothing else in that bash, never tacked onto the tail of a multi-command call: bundled, its self-verifying status line is buried in unrelated output and it can silently no-op as a side effect of those other commands, so no fresh cycle gets established and supervision lapses unnoticed. @@ -493,7 +492,8 @@ Empty polls, elapsed waiting time, and "still no change" are tool bookkeeping, n ```sh bin/fm-watch-arm.sh # safe verified re-arm; run as harness-tracked background; no-ops if healthy bin/fm-watch-arm.sh --restart # home-scoped forced restart; never a broad pkill -bin/fm-watch-session.sh --start|--status|--stop # durable active-mode runner for this FM_HOME +bin/fm-watch-session.sh start # durable home-scoped tmux runner for lanes without reliable tracked background tasks +bin/fm-watch-session.sh --status # report whether this home's runner window is live bin/fm-watch.sh # the watcher itself; exits with: signal|stale|check|heartbeat bin/fm-wake-drain.sh # drain queued wake records at turn start; asserts guard after draining bin/fm-crew-state.sh # one-line current-state read; reconciles matching run-step, pane, and status log @@ -523,13 +523,14 @@ This exception is narrow: ordinary crewmates still trip stale detection when the **Watcher liveness is guarded, not just disciplined.** Arming the watcher is the last action of every wake-handling turn - but the protocol no longer relies on remembering that. While running, `fm-watch.sh` touches `state/.last-watcher-beat` every poll cycle. -The supervision scripts (`fm-peek`, `fm-send`, `fm-spawn`, `fm-teardown`, `fm-pr-check`, `fm-promote`, `fm-review-diff`, `fm-fleet-sync`, `fm-update`) call `bin/fm-guard.sh` first, which warns to stderr when any task is in flight (`state/*.meta` exists) but queued wakes are pending, that beacon is missing or older than `FM_GUARD_GRACE` (default 300s), or the fresh beacon is not backed by `state/.watch.lock` naming a live watcher for this same `FM_HOME` and watcher path. +The supervision scripts (`fm-peek`, `fm-send`, `fm-spawn`, `fm-teardown`, `fm-pr-check`, `fm-promote`, `fm-review-diff`, `fm-fleet-sync`, `fm-update`) call `bin/fm-guard.sh` first, which warns to stderr when any task is in flight (`state/*.meta` exists) but queued wakes are pending, or there is no confirmed live watcher for this same `FM_HOME`. +A confirmed live watcher means `state/.watch.lock` names a live `bin/fm-watch.sh` process for this home and `state/.last-watcher-beat` is fresh within `FM_GUARD_GRACE` (default 300s); a fresh beacon by itself is not enough. `bin/fm-wake-drain.sh` runs the same guard after it drains, so the liveness check also fires on a drain-and-handle turn that runs no other supervision script, narrowing the window in which a lapsed chain can hide. -The no-watcher case leads with a prominent, bordered ●-marked banner (in-flight count, beacon/lock problem, and the exact one-line re-arm command) so it reads as an alarm rather than a buried stderr line you can skim past. +The no-watcher case leads with a prominent, bordered ●-marked banner (in-flight count, lock state, beacon age, and the exact one-line re-arm command) so it reads as an alarm rather than a buried stderr line you can skim past. So the next time you touch the fleet with queued wakes or no watcher alive, the tool output itself tells you what to do - a pull-based guard that works on any harness, since it rides the script output you already read rather than a harness-specific hook. -The grace window now only helps when a live matching watcher lock is present; a fresh beacon without that lock is treated as a false-fresh state and warns. +The grace window keeps normal handling silent only when the lock still proves a live watcher. If a guard warning says queued wakes are pending, drain them before doing anything else. -If a guard warning says watcher liveness is stale, arm `bin/fm-watch-arm.sh` after draining any queued wakes. +If a guard warning says watcher liveness is stale or unconfirmed, arm `bin/fm-watch-arm.sh` after draining any queued wakes, or start `bin/fm-watch-session.sh start` in this environment. `fm-guard.sh` carries a second, independent alarm in the same bordered ●-marked style: the **worktree-tangle** guard. Firstmate is a treehouse-pooled git repo of itself - the primary checkout (the repo root, `FM_ROOT`) and every crewmate worktree and secondmate home are linked worktrees of one repo - and the primary must stay on its default branch. diff --git a/bin/fm-guard.sh b/bin/fm-guard.sh index 6d307453..1d2d9ff7 100755 --- a/bin/fm-guard.sh +++ b/bin/fm-guard.sh @@ -4,11 +4,12 @@ # First, always warn if the firstmate primary checkout (FM_ROOT) is on a named # non-default branch, because that means firstmate-on-itself work landed in the # primary instead of an isolated worktree. -# Then, if any task is in flight (a state/.meta exists), prove the watcher is -# live by checking both the liveness beacon and the home-scoped watcher lock. A -# fresh state/.last-watcher-beat alone is not enough: a one-shot watcher can write -# a wake and exit while leaving a fresh beacon behind. Always exits 0: the guard -# warns, it never blocks. +# Then, if any task is in flight (a state/.meta exists) and there is no +# confirmed live watcher for this FM_HOME (state/.watch.lock naming a live +# bin/fm-watch.sh process for this home plus a fresh state/.last-watcher-beat), +# prints a loud, clearly delimited banner so the agent cannot skim past it in the +# tool output of whatever it was doing - the one channel every harness has. Always +# exits 0: the guard warns, it never blocks. set -u SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" @@ -54,34 +55,45 @@ fi WATCH_LOCK="$STATE/.watch.lock" WATCH_PATH="$SCRIPT_DIR/fm-watch.sh" -watcher_lock_desc="no watcher lock" -watcher_lock_healthy() { - local pid lock_home lock_path lock_identity current_identity - watcher_lock_desc="no watcher lock" - [ -e "$WATCH_LOCK" ] || [ -L "$WATCH_LOCK" ] || return 1 +watch_lock_matches_pid() { + local pid=$1 lock_home lock_path lock_identity current_identity + lock_home=$(cat "$WATCH_LOCK/fm-home" 2>/dev/null || true) + lock_path=$(cat "$WATCH_LOCK/watcher-path" 2>/dev/null || true) + lock_identity=$(cat "$WATCH_LOCK/pid-identity" 2>/dev/null || true) + [ "$lock_home" = "$FM_HOME" ] || return 1 + [ "$lock_path" = "$WATCH_PATH" ] || return 1 + [ -n "$lock_identity" ] || return 1 + current_identity=$(fm_pid_identity "$pid") || return 1 + [ "$current_identity" = "$lock_identity" ] +} + +watcher_lock_desc() { + local pid lock_home lock_path + if [ ! -e "$WATCH_LOCK" ] && [ ! -L "$WATCH_LOCK" ]; then + echo "no watch lock" + return 0 + fi pid=$(cat "$WATCH_LOCK/pid" 2>/dev/null || true) if ! fm_pid_alive "$pid"; then - watcher_lock_desc="watcher lock has no live pid" - return 1 + echo "watch lock has no live pid" + return 0 fi lock_home=$(cat "$WATCH_LOCK/fm-home" 2>/dev/null || true) lock_path=$(cat "$WATCH_LOCK/watcher-path" 2>/dev/null || true) - lock_identity=$(cat "$WATCH_LOCK/pid-identity" 2>/dev/null || true) - if [ "$lock_home" != "$FM_HOME" ] || [ "$lock_path" != "$WATCH_PATH" ] || [ -z "$lock_identity" ]; then - watcher_lock_desc="watcher lock does not name a live watcher for this home" - return 1 + if [ "$lock_home" != "$FM_HOME" ]; then + echo "watch lock belongs to another FM_HOME" + return 0 + fi + if [ "$lock_path" != "$WATCH_PATH" ]; then + echo "watch lock names another watcher path" + return 0 fi - current_identity=$(fm_pid_identity "$pid") || { - watcher_lock_desc="watcher lock pid identity is unavailable" - return 1 - } - if [ "$current_identity" != "$lock_identity" ]; then - watcher_lock_desc="watcher lock pid identity no longer matches" - return 1 + if ! watch_lock_matches_pid "$pid"; then + echo "watch lock pid identity does not match the watcher" + return 0 fi - watcher_lock_desc="live watcher pid=$pid" - return 0 + echo "watch lock is live" } # Only act with tasks in flight; count them so the banner can say how much is @@ -95,43 +107,42 @@ done [ -s "$FM_WAKE_QUEUE" ] && queue_pending=true -# Resolve the watcher's liveness from its beacon: fresh within GRACE means a -# watcher is alive and we stay quiet about it. +# Resolve the watcher's liveness from both the lock and the beacon. A fresh +# beacon alone is not proof: a one-shot watcher can leave a fresh mtime behind +# after it exits. BEAT="$STATE/.last-watcher-beat" -watcher_fresh=false +watcher_confirmed=false +beacon_fresh=false beacon_desc=never if [ -e "$BEAT" ]; then m=$(stat_mtime "$BEAT") if [ -n "$m" ]; then age=$(( $(date +%s) - m )) beacon_desc="${age}s ago" - [ "$age" -lt "$GRACE" ] && watcher_fresh=true + [ "$age" -lt "$GRACE" ] && beacon_fresh=true else beacon_desc=unknown fi fi -lock_healthy=false -watcher_lock_healthy && lock_healthy=true -watcher_problem= -if [ "$watcher_fresh" = false ]; then - watcher_problem="no fresh beacon (last beat: $beacon_desc, grace ${GRACE}s)" -elif [ "$lock_healthy" = false ]; then - watcher_problem="fresh beacon but no live watcher lock: $watcher_lock_desc" +lock_desc=$(watcher_lock_desc) +lock_pid=$(cat "$WATCH_LOCK/pid" 2>/dev/null || true) +if [ "$beacon_fresh" = true ] && fm_pid_alive "$lock_pid" && watch_lock_matches_pid "$lock_pid"; then + watcher_confirmed=true fi # No fresh watcher with tasks in flight is the dangerous state: emit a prominent, # bordered banner FIRST so it reads as an alarm, not a buried stderr line. -if [ -n "$watcher_problem" ]; then +if [ "$watcher_confirmed" = false ]; then if "$queue_pending"; then fix='After draining queued wakes, re-arm the watcher: run bin/fm-watch-arm.sh as the harness-tracked background task (never a shell & that gets reaped).' else - fix='Re-arm it NOW: run bin/fm-watch-arm.sh as the harness-tracked background task (never a shell & that gets reaped).' + fix='Re-arm it NOW: run bin/fm-watch-arm.sh as the harness-tracked background task, or run bin/fm-watch-session.sh start in this environment.' fi rule='━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━' { printf '●%s\n' "$rule" printf '● WATCHER DOWN - SUPERVISION IS OFF\n' - printf '● %s task(s) in flight, but watcher liveness is not proved: %s.\n' "$in_flight" "$watcher_problem" + printf '● %s task(s) in flight, but no watcher has a confirmed live lock (lock: %s; last beat: %s, grace %ss).\n' "$in_flight" "$lock_desc" "$beacon_desc" "$GRACE" printf '● Trust bin/fm-watch-arm.sh for the true state: it confirms a live watcher and a fresh beacon, or fails loudly.\n' printf '● %s\n' "$fix" printf '●%s\n' "$rule" diff --git a/bin/fm-watch-session.sh b/bin/fm-watch-session.sh index 76cadcb0..290fca45 100755 --- a/bin/fm-watch-session.sh +++ b/bin/fm-watch-session.sh @@ -1,153 +1,120 @@ #!/usr/bin/env bash -# Home-scoped durable active watcher runner. +# Durable, home-scoped active watcher runner. # -# fm-watch-arm.sh intentionally keeps the watcher as its child. That is good for -# harness-tracked foreground tasks, but fragile when a harness cannot keep that -# foreground call alive. This wrapper gives active mode a durable process for the -# current FM_HOME: it starts a small runner that repeatedly arms the watcher, -# records the runner pid in state/.watch-session.lock, and can report or stop -# only that home-scoped runner. +# Use this in harnesses where a tracked background task is not durable enough. +# It creates one tmux window per FM_HOME/STATE pair and runs fm-watch-arm.sh in a +# loop there. The watcher itself remains the same singleton: it is still scoped by +# this home's state/.watch.lock, and no broad process matching is used. set -u SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" # shellcheck source=bin/fm-wake-lib.sh . "$SCRIPT_DIR/fm-wake-lib.sh" -WATCH_ARM="$SCRIPT_DIR/fm-watch-arm.sh" -SESSION_LOCK="$STATE/.watch-session.lock" -LOG="$STATE/.watch-session.log" -RUNNER_PATH="$SCRIPT_DIR/fm-watch-session.sh" +SESSION_NAME=${FM_WATCH_SESSION_TMUX_SESSION:-firstmate-watch} +HASH=$(printf '%s\n%s\n' "$FM_HOME" "$STATE" | cksum | awk '{print $1}') +WINDOW_NAME=${FM_WATCH_SESSION_TMUX_WINDOW:-fm-watch-$HASH} +TARGET="$SESSION_NAME:$WINDOW_NAME" +SESSION_DIR="$STATE/.watch-session" +ENV_FILE="$SESSION_DIR/env.sh" +RUNNER_FILE="$SESSION_DIR/runner.sh" +STOP_FILE="$SESSION_DIR/stop" +RETRY_DELAY=${FM_WATCH_SESSION_RETRY_DELAY:-5} +AFK_DELAY=${FM_WATCH_SESSION_AFK_DELAY:-15} usage() { - echo "usage: $(basename "$0") [--start|--stop|--status|--foreground|--tmux]" >&2 + echo "usage: $(basename "$0") [start|--status|status|stop|restart]" >&2 } -session_lock_matches_pid() { - local pid=$1 lock_home lock_path lock_identity current_identity - lock_home=$(cat "$SESSION_LOCK/fm-home" 2>/dev/null || true) - lock_path=$(cat "$SESSION_LOCK/runner-path" 2>/dev/null || true) - lock_identity=$(cat "$SESSION_LOCK/pid-identity" 2>/dev/null || true) - [ "$lock_home" = "$FM_HOME" ] || return 1 - [ "$lock_path" = "$RUNNER_PATH" ] || return 1 - [ -n "$lock_identity" ] || return 1 - current_identity=$(fm_pid_identity "$pid") || return 1 - [ "$current_identity" = "$lock_identity" ] +shell_quote() { + # POSIX single-quote escaping. + printf "'%s'" "$(printf '%s' "$1" | sed "s/'/'\\\\''/g")" } -session_pid() { - cat "$SESSION_LOCK/pid" 2>/dev/null || true +tmux_window_exists() { + command -v tmux >/dev/null 2>&1 || return 1 + tmux has-session -t "$SESSION_NAME" 2>/dev/null || return 1 + tmux list-windows -t "$SESSION_NAME" -F '#W' 2>/dev/null | grep -Fx "$WINDOW_NAME" >/dev/null } -session_running() { - local pid - pid=$(session_pid) - fm_pid_alive "$pid" || return 1 - session_lock_matches_pid "$pid" +write_runner_files() { + mkdir -p "$SESSION_DIR" + { + printf 'export FM_HOME=%s\n' "$(shell_quote "$FM_HOME")" + printf 'export FM_ROOT_OVERRIDE=%s\n' "$(shell_quote "$FM_ROOT")" + printf 'export FM_STATE_OVERRIDE=%s\n' "$(shell_quote "$STATE")" + printf 'export PATH=%s\n' "$(shell_quote "$PATH")" + } > "$ENV_FILE" + { + printf '#!/usr/bin/env bash\n' + printf 'set -u\n' + printf '. %s\n' "$(shell_quote "$ENV_FILE")" + printf 'rm -f %s\n' "$(shell_quote "$STOP_FILE")" + printf 'while :; do\n' + printf ' [ -e %s ] && exit 0\n' "$(shell_quote "$STOP_FILE")" + printf ' if [ -e "$FM_STATE_OVERRIDE/.afk" ]; then sleep %s; continue; fi\n' "$AFK_DELAY" + printf ' %s/fm-watch-arm.sh\n' "$(shell_quote "$SCRIPT_DIR")" + printf ' rc=$?\n' + printf ' [ -e %s ] && exit 0\n' "$(shell_quote "$STOP_FILE")" + printf ' if [ "$rc" -ne 0 ]; then sleep %s; else sleep 1; fi\n' "$RETRY_DELAY" + printf 'done\n' + } > "$RUNNER_FILE" + chmod +x "$RUNNER_FILE" } -write_session_identity() { - local pid=$1 - printf '%s\n' "$FM_HOME" > "$SESSION_LOCK/fm-home" || true - printf '%s\n' "$RUNNER_PATH" > "$SESSION_LOCK/runner-path" || true - fm_pid_identity "$pid" > "$SESSION_LOCK/pid-identity" 2>/dev/null || true -} - -status_cmd() { - local pid - if session_running; then - pid=$(session_pid) - echo "watch-session: running pid=$pid home=$FM_HOME log=$LOG" - exit 0 +start_runner() { + local command + if ! command -v tmux >/dev/null 2>&1; then + echo "watch-session: FAILED - tmux not found" >&2 + return 1 fi - echo "watch-session: stopped home=$FM_HOME" - exit 1 -} - -stop_cmd() { - local pid i pgid - if ! session_running; then - fm_lock_remove_path "$SESSION_LOCK" 2>/dev/null || true - echo "watch-session: stopped home=$FM_HOME" + if tmux_window_exists; then + echo "watch-session: running target=$TARGET home=$FM_HOME" return 0 fi - pid=$(session_pid) - kill -TERM "$pid" 2>/dev/null || true - pgid=$(ps -p "$pid" -o pgid= 2>/dev/null | tr -d ' ' || true) - i=0 - while [ "$i" -lt 80 ] && fm_pid_alive "$pid"; do - if [ "$i" -eq 10 ] && [ "$pgid" = "$pid" ]; then - kill -TERM "-$pid" 2>/dev/null || true - fi - sleep 0.1 - i=$((i + 1)) - done - if fm_pid_alive "$pid"; then - echo "watch-session: FAILED - runner still alive pid=$pid" >&2 - return 1 + command="bash $(shell_quote "$RUNNER_FILE")" + write_runner_files + rm -f "$STOP_FILE" + if tmux has-session -t "$SESSION_NAME" 2>/dev/null; then + tmux new-window -d -t "$SESSION_NAME:" -n "$WINDOW_NAME" "$command" || { + echo "watch-session: FAILED - could not start target=$TARGET" >&2 + return 1 + } + else + tmux new-session -d -s "$SESSION_NAME" -n "$WINDOW_NAME" "$command" || { + echo "watch-session: FAILED - could not start target=$TARGET" >&2 + return 1 + } fi - fm_lock_remove_path "$SESSION_LOCK" 2>/dev/null || true - echo "watch-session: stopped pid=$pid home=$FM_HOME" + echo "watch-session: started target=$TARGET home=$FM_HOME" } -foreground_cmd() { - if ! fm_lock_try_acquire "$SESSION_LOCK"; then - if [ -n "${FM_LOCK_HELD_PID:-}" ] && fm_pid_alive "$FM_LOCK_HELD_PID"; then - echo "watch-session: already running pid=$FM_LOCK_HELD_PID home=$FM_HOME" >&2 - else - echo "watch-session: already running home=$FM_HOME" >&2 - fi - exit 1 +status_runner() { + if tmux_window_exists; then + echo "watch-session: running target=$TARGET home=$FM_HOME" + return 0 fi - trap 'fm_lock_release "$SESSION_LOCK"; exit 143' TERM INT HUP - trap 'fm_lock_release "$SESSION_LOCK"' EXIT - write_session_identity "${BASHPID:-$$}" - while :; do - "$WATCH_ARM" >> "$LOG" 2>&1 || true - sleep "${FM_WATCH_SESSION_REARM_DELAY:-1}" - done + echo "watch-session: stopped home=$FM_HOME" + return 1 } -start_cmd() { - local pid i - if session_running; then - pid=$(session_pid) - echo "watch-session: running pid=$pid home=$FM_HOME log=$LOG" +stop_runner() { + touch "$STOP_FILE" 2>/dev/null || true + if tmux_window_exists; then + tmux kill-window -t "$TARGET" + echo "watch-session: stopped target=$TARGET home=$FM_HOME" return 0 fi - fm_lock_remove_path "$SESSION_LOCK" 2>/dev/null || true - : > "$LOG" || { - echo "watch-session: FAILED - cannot write $LOG" >&2 - return 1 - } - if command -v setsid >/dev/null 2>&1; then - setsid "$RUNNER_PATH" --foreground >> "$LOG" 2>&1 < /dev/null & - else - nohup "$RUNNER_PATH" --foreground >> "$LOG" 2>&1 < /dev/null & - fi - pid=$! - i=0 - while [ "$i" -lt 80 ]; do - if session_running; then - pid=$(session_pid) - echo "watch-session: started pid=$pid home=$FM_HOME log=$LOG" - return 0 - fi - sleep 0.1 - i=$((i + 1)) - done - echo "watch-session: FAILED - runner did not confirm" >&2 - return 1 + echo "watch-session: stopped home=$FM_HOME" + return 0 } -mode=${1:---status} +mode=${1:-start} case "$mode" in - --start|start) start_cmd ;; - --stop|stop) stop_cmd ;; - --status|status) status_cmd ;; - --foreground|foreground) foreground_cmd ;; - --tmux) - echo "tmux new-window -n fm-watch-$(basename "$FM_HOME") 'cd \"$FM_ROOT\" && FM_HOME=\"$FM_HOME\" bin/fm-watch-session.sh --foreground'" - ;; + start|--start) start_runner ;; + status|--status) status_runner ;; + stop|--stop) stop_runner ;; + restart|--restart) stop_runner >/dev/null; start_runner ;; -h|--help|help) usage; exit 0 ;; *) usage; exit 2 ;; esac diff --git a/docs/architecture.md b/docs/architecture.md index 91e3d4ab..ecce91c6 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -127,5 +127,5 @@ Kill the first mate session anytime; the next one reconciles and carries on. ## Development notes -The current watcher reliability work combines always-on bash triage with a durable queue for actionable wakes, a race-proof singleton lock, duplicate self-eviction, drain-time liveness assertion, a self-verifying tracked-child arm wrapper, and a home-scoped durable active-mode session runner. +The current watcher reliability work combines always-on bash triage with a durable queue for actionable wakes, a race-proof singleton lock, duplicate self-eviction, drain-time liveness assertion that requires both a live lock and fresh beacon, a self-verifying tracked-child arm wrapper, and a home-scoped tmux session runner for harnesses without durable background tasks. The presence-gated sub-supervisor (`bin/fm-supervise-daemon.sh`) provides walk-away supervision via the `/afk` skill while reusing the same shared wake classifier as the always-on watcher. diff --git a/docs/scripts.md b/docs/scripts.md index 5a3adb07..44c8f4ae 100644 --- a/docs/scripts.md +++ b/docs/scripts.md @@ -23,7 +23,7 @@ Each file also starts with a short header comment. | `fm-cognee-manifest-check.sh` | Validate TSV Cognee manifest rows and verify `SOURCE_ID`, `SOURCE_PATH`, or `SEED_FILE` answer references against reopened local files | | `fm-marker-lib.sh` | Shared from-firstmate request marker and detector sourced by `fm-send.sh`, `fm-brief.sh`, and tests | | `fm-watch-arm.sh` | Verified per-home watcher re-arm; reports `started`, `healthy`, or `FAILED`; `--restart` relaunches only this home's watcher | -| `fm-watch-session.sh` | Home-scoped durable active watcher runner with `--start`, `--status`, `--stop`, `--foreground`, and `--tmux` helpers | +| `fm-watch-session.sh` | Durable home-scoped tmux runner that loops through `fm-watch-arm.sh` for harness lanes without reliable tracked background tasks | | `fm-watch.sh` | Singleton-safe always-on watcher; absorbs benign wakes in bash, queues and exits only for actionable wakes, and reverts to daemon-owned one-shot behavior while `state/.afk` exists | | `fm-supervise-daemon.sh` | Presence-gated sub-supervisor for walk-away (`/afk`) supervision: wraps `fm-watch.sh`, uses the shared wake classifier, self-handles routine wakes in bash, and escalates only captain-relevant events as one verified, batched, single-line digest prefixed with a sentinel marker | | `fm-crew-state.sh` | Print one stable current-state line for a crew by reconciling its matching no-mistakes run-step, even when the pane has closed, with pane and status-log fallback | diff --git a/tests/fm-wake-queue.test.sh b/tests/fm-wake-queue.test.sh index c04d4bf0..159ae158 100755 --- a/tests/fm-wake-queue.test.sh +++ b/tests/fm-wake-queue.test.sh @@ -184,7 +184,7 @@ test_drain_asserts_watcher_liveness() { : > "$err" touch "$state/.last-watcher-beat" FM_STATE_OVERRIDE="$state" FM_GUARD_GRACE=300 "$DRAIN" >/dev/null 2> "$err" || fail "drain failed with a fresh beacon" - grep -F 'fresh beacon but no live watcher lock' "$err" >/dev/null || fail "drain did not warn for a fresh beacon without a live watcher lock" + grep -F 'no watcher has a confirmed live lock' "$err" >/dev/null || fail "drain did not warn for a fresh beacon without a live watcher lock" : > "$err" sleep 300 & diff --git a/tests/fm-watch-session.test.sh b/tests/fm-watch-session.test.sh index 458e40e2..e2f380fc 100644 --- a/tests/fm-watch-session.test.sh +++ b/tests/fm-watch-session.test.sh @@ -1,79 +1,133 @@ #!/usr/bin/env bash -# tests/fm-watch-session.test.sh - durable active watcher runner wrapper. +# tests/fm-watch-session.test.sh - home-scoped durable active watcher runner. set -u # shellcheck source=tests/wake-helpers.sh . "$(dirname "${BASH_SOURCE[0]}")/wake-helpers.sh" -SESSION="$ROOT/bin/fm-watch-session.sh" -WATCH_ARM="$ROOT/bin/fm-watch-arm.sh" - +WATCH_SESSION="$ROOT/bin/fm-watch-session.sh" TMP_ROOT=$(fm_test_tmproot fm-watch-session-tests) -trap fm_test_watch_cleanup_exit EXIT -test_status_reports_missing_session() { - local dir state out status - dir=$(make_case status-missing) - state="$dir/state" - out="$dir/status.out" - status=0 - FM_HOME="$dir" FM_STATE_OVERRIDE="$state" "$SESSION" --status > "$out" || status=$? - [ "$status" -ne 0 ] || fail "status exited zero when no watcher session existed" - grep -F 'watch-session: stopped' "$out" >/dev/null || fail "status did not report stopped" - pass "watch-session status reports stopped for an empty home" +install_fake_tmux() { + local dir=$1 fakebin log root + fakebin=$(fm_fakebin "$dir") + log="$dir/tmux.log" + root="$dir/tmux-state" + mkdir -p "$root" + cat > "$fakebin/tmux" <<'SH' +#!/usr/bin/env bash +set -u +log=${FM_FAKE_TMUX_LOG:?} +root=${FM_FAKE_TMUX_ROOT:?} +cmd=${1:-} +shift || true +printf '%s\n' "tmux $cmd $*" >> "$log" +case "$cmd" in + has-session) + target= + while [ "$#" -gt 0 ]; do + case "$1" in -t) target=$2; shift 2 ;; *) shift ;; esac + done + target=${target%%:*} + [ -n "$target" ] && [ -d "$root/$target" ] + ;; + new-session) + session= window= command= + while [ "$#" -gt 0 ]; do + case "$1" in + -d) shift ;; + -s) session=$2; shift 2 ;; + -n) window=$2; shift 2 ;; + *) command=$1; shift ;; + esac + done + mkdir -p "$root/$session" + printf '%s\n' "$command" > "$root/$session/$window" + ;; + new-window) + target= window= command= + while [ "$#" -gt 0 ]; do + case "$1" in + -d) shift ;; + -t) target=$2; shift 2 ;; + -n) window=$2; shift 2 ;; + *) command=$1; shift ;; + esac + done + session=${target%%:*} + mkdir -p "$root/$session" + printf '%s\n' "$command" > "$root/$session/$window" + ;; + list-windows) + target= + while [ "$#" -gt 0 ]; do + case "$1" in -t) target=$2; shift 2 ;; -F) shift 2 ;; *) shift ;; esac + done + session=${target%%:*} + [ -d "$root/$session" ] || exit 1 + for f in "$root/$session"/*; do + [ -f "$f" ] || continue + basename "$f" + done | sort + ;; + kill-window) + target= + while [ "$#" -gt 0 ]; do + case "$1" in -t) target=$2; shift 2 ;; *) shift ;; esac + done + session=${target%%:*} + window=${target#*:} + rm -f "$root/$session/$window" + ;; + *) + echo "unsupported fake tmux command: $cmd" >&2 + exit 2 + ;; +esac +SH + chmod +x "$fakebin/tmux" + printf '%s\n' "$fakebin" } -test_start_status_stop_are_home_scoped() { - local dir state other other_state fakebin out start_pid other_pid lock_pid i - dir=$(make_case home-scoped) - state="$dir/state" - other=$(make_case other-home) - other_state="$other/state" - fakebin="$dir/fakebin" - out="$dir/session.out" +test_watch_session_start_status_stop_are_home_scoped() { + local dir fakebin state_a state_b out_a out_b status_a status_b after_stop log + dir=$(make_case session-home-scope) + fakebin=$(install_fake_tmux "$dir") + log="$dir/tmux.log" + state_a="$dir/home-a/state" + state_b="$dir/home-b/state" + mkdir -p "$state_a" "$state_b" + out_a="$dir/a.out" + out_b="$dir/b.out" + status_a="$dir/a.status" + status_b="$dir/b.status" + after_stop="$dir/after-stop.status" - PATH="$fakebin:$PATH" FM_HOME="$other" FM_STATE_OVERRIDE="$other_state" FM_POLL=5 FM_SIGNAL_GRACE=1 FM_CHECK_INTERVAL=999999 FM_HEARTBEAT=999999 "$WATCH_ARM" > "$other/watch-arm.out" & - other_pid=$! - i=0 - while [ "$i" -lt 80 ]; do - [ -s "$other_state/.watch.lock/pid" ] && [ -e "$other_state/.last-watcher-beat" ] && break - sleep 0.1 - i=$((i + 1)) - done - [ -s "$other_state/.watch.lock/pid" ] && [ -e "$other_state/.last-watcher-beat" ] || fail "other home watcher did not start" - start_pid=$(cat "$other_state/.watch.lock/pid") + PATH="$fakebin:$PATH" FM_FAKE_TMUX_LOG="$log" FM_FAKE_TMUX_ROOT="$dir/tmux-state" FM_HOME="$dir/home-a" "$WATCH_SESSION" start > "$out_a" \ + || fail "watch-session did not start home A: $(cat "$out_a" 2>/dev/null || true)" + PATH="$fakebin:$PATH" FM_FAKE_TMUX_LOG="$log" FM_FAKE_TMUX_ROOT="$dir/tmux-state" FM_HOME="$dir/home-b" "$WATCH_SESSION" start > "$out_b" \ + || fail "watch-session did not start home B: $(cat "$out_b" 2>/dev/null || true)" - PATH="$fakebin:$PATH" FM_HOME="$dir" FM_STATE_OVERRIDE="$state" FM_POLL=5 FM_SIGNAL_GRACE=1 FM_CHECK_INTERVAL=999999 FM_HEARTBEAT=999999 "$SESSION" --start > "$out" || fail "watch-session start failed: $(cat "$out")" - grep -F 'watch-session: started' "$out" >/dev/null || fail "start did not report a started session" - lock_pid=$(cat "$state/.watch-session.lock/pid" 2>/dev/null || true) - [ -n "$lock_pid" ] || fail "session did not record its runner pid" - kill -0 "$lock_pid" 2>/dev/null || fail "recorded session runner is not alive" + grep -F 'watch-session: started target=' "$out_a" >/dev/null || fail "home A start did not report started" + grep -F 'watch-session: started target=' "$out_b" >/dev/null || fail "home B start did not report started" + [ "$(find "$dir/tmux-state/firstmate-watch" -type f | wc -l | tr -d '[:space:]')" = 2 ] \ + || fail "expected separate tmux windows for two FM_HOME values" - : > "$out" - PATH="$fakebin:$PATH" FM_HOME="$dir" FM_STATE_OVERRIDE="$state" "$SESSION" --status > "$out" || fail "status failed for running session" - grep -F "watch-session: running pid=$lock_pid" "$out" >/dev/null || fail "status did not report the running session pid" + PATH="$fakebin:$PATH" FM_FAKE_TMUX_LOG="$log" FM_FAKE_TMUX_ROOT="$dir/tmux-state" FM_HOME="$dir/home-a" "$WATCH_SESSION" --status > "$status_a" \ + || fail "watch-session status failed for home A" + grep -F 'watch-session: running target=' "$status_a" >/dev/null || fail "home A status did not report running" - : > "$out" - PATH="$fakebin:$PATH" FM_HOME="$dir" FM_STATE_OVERRIDE="$state" "$SESSION" --stop > "$out" || fail "stop failed for running session" - grep -F "watch-session: stopped pid=$lock_pid" "$out" >/dev/null || fail "stop did not report the stopped session pid" - i=0 - while [ "$i" -lt 80 ] && kill -0 "$lock_pid" 2>/dev/null; do - sleep 0.1 - i=$((i + 1)) - done - ! kill -0 "$lock_pid" 2>/dev/null || fail "session runner remained alive after stop" - - kill -0 "$start_pid" 2>/dev/null || fail "stopping this home killed another home's watcher" - kill "$other_pid" "$start_pid" 2>/dev/null || true - wait "$other_pid" 2>/dev/null || true - pass "watch-session starts, reports, and stops only the current FM_HOME" -} + PATH="$fakebin:$PATH" FM_FAKE_TMUX_LOG="$log" FM_FAKE_TMUX_ROOT="$dir/tmux-state" FM_HOME="$dir/home-a" "$WATCH_SESSION" stop >/dev/null \ + || fail "watch-session stop failed for home A" + PATH="$fakebin:$PATH" FM_FAKE_TMUX_LOG="$log" FM_FAKE_TMUX_ROOT="$dir/tmux-state" FM_HOME="$dir/home-a" "$WATCH_SESSION" --status > "$after_stop" \ + && fail "home A status succeeded after stop" + grep -F 'watch-session: stopped' "$after_stop" >/dev/null || fail "home A status after stop did not report stopped" -test_source_contains_no_broad_pkill() { - ! grep -Eq 'pkill[[:space:]].*fm-watch|pkill[[:space:]]+-f' "$SESSION" || fail "watch-session uses broad pkill" - pass "watch-session does not use broad pkill" + PATH="$fakebin:$PATH" FM_FAKE_TMUX_LOG="$log" FM_FAKE_TMUX_ROOT="$dir/tmux-state" FM_HOME="$dir/home-b" "$WATCH_SESSION" --status > "$status_b" \ + || fail "stopping home A stopped home B too" + grep -F 'watch-session: running target=' "$status_b" >/dev/null || fail "home B did not remain running" + ! grep -F 'pkill -f' "$WATCH_SESSION" >/dev/null || fail "watch-session contains broad pkill -f" + pass "watch-session start/status/stop are scoped to one FM_HOME and never use broad pkill" } -test_status_reports_missing_session -test_start_status_stop_are_home_scoped -test_source_contains_no_broad_pkill +test_watch_session_start_status_stop_are_home_scoped diff --git a/tests/fm-watcher-lock.test.sh b/tests/fm-watcher-lock.test.sh index 5d457874..9d9563d1 100755 --- a/tests/fm-watcher-lock.test.sh +++ b/tests/fm-watcher-lock.test.sh @@ -157,7 +157,7 @@ test_guard_requires_live_matching_watch_lock() { touch "$state/.last-watcher-beat" FM_ROOT_OVERRIDE="$dir" FM_HOME="$dir" FM_STATE_OVERRIDE="$state" FM_GUARD_GRACE=300 "$ROOT/bin/fm-guard.sh" 2> "$err" >/dev/null || fail "guard failed with no lock" grep -F 'WATCHER DOWN - SUPERVISION IS OFF' "$err" >/dev/null || fail "guard stayed silent with fresh beacon but no watcher lock" - grep -F 'fresh beacon but no live watcher lock' "$err" >/dev/null || fail "guard did not explain the false-fresh beacon" + grep -F 'no watcher has a confirmed live lock' "$err" >/dev/null || fail "guard did not explain the false-fresh beacon" # A live pid is still not proof unless the lock identifies THIS home and the # current watcher script. This protects sibling homes and reused pids. @@ -180,7 +180,7 @@ test_guard_requires_live_matching_watch_lock() { fail "guard failed with mismatched lock" } grep -F 'WATCHER DOWN - SUPERVISION IS OFF' "$err" >/dev/null || fail "guard stayed silent for a lock from another home" - grep -F 'watcher lock does not name a live watcher for this home' "$err" >/dev/null || fail "guard did not explain the mismatched lock" + grep -F 'watch lock belongs to another FM_HOME' "$err" >/dev/null || fail "guard did not explain the mismatched lock" kill "$peer" 2>/dev/null || true wait "$peer" 2>/dev/null || true