Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions .agent/STALLED-WORKTREE-RECOVERY.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,14 @@ The Guardex Codex launcher auto-finishes a branch only when the codex CLI exits
To act on the report:

- Inspect: `bash scripts/agent-autofinish-watch.sh --once --dry-run`
- Auto-finish once (commit dirty changes, push, create PR, attempt merge): `bash scripts/agent-autofinish-watch.sh --once --auto-merge`
- Run the daemon (poll forever, auto-finish after `--idle-seconds`): `bash scripts/agent-autofinish-watch.sh --daemon --auto-merge`
- Reap merged lanes (prune worktrees whose PR already merged): `bash scripts/agent-autofinish-watch.sh --once --auto-merge`
- Run the daemon (poll forever, reaping merged lanes each cycle): `bash scripts/agent-autofinish-watch.sh --daemon --auto-merge --interval 300`

Defaults: `--idle-seconds=900` (15 min of file silence before auto-commit) and `--branch-prefix=agent/`. The watcher is conservative — it never touches branches outside the configured prefix and only commits worktrees whose files have stopped changing.
Flags: `--idle-minutes` (default 60, or `GUARDEX_AUTOFINISH_IDLE_MINUTES`) gates how long a lane must be quiet before it counts as stalled; `--interval` sets the daemon poll seconds; `--base` overrides the inferred base branch.

The watcher is deliberately conservative. It only ever **reports** agent worktrees with unmerged work (committed-no-PR or uncommitted) — it never auto-commits, pushes, or opens a PR for un-reviewed work. `--auto-merge` only reaps lanes whose PR has already **merged** (delegating to `gx worktree prune --include-pr-merged --delete-branches`), which is what fixes the post-merge "retained for now" gap. Finishing an un-PR'd lane stays a manual `gx branch finish`. Healthy in-flight lanes (open PR, or a live process in the worktree) produce no output.

A stalled lane that holds file locks can keep blocking other agents; clear those with `gx locks reap` (removes locks from worktrees idle past `--ttl-hours` / `GUARDEX_LOCK_TTL_HOURS`, default 7 days, with no live process inside).

## Source-probe temp worktree cleanup

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
schema: spec-driven
created: 2026-06-30
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
## Why

Multi-agent worktree recovery had three gaps that strand state and block agents:

- `scripts/agent-stalled-report.sh` (a `SessionStart` hook) wrapped `scripts/agent-autofinish-watch.sh`, but that watcher was **never authored**, so the hook soft-exited 0 and merged-PR worktrees (the "retained for now" path in `agent-branch-finish.sh`) were never reaped.
- File locks recorded `claimed_at` but had no expiry. A lingering-but-idle worktree (crashed or forgotten lane) keeps blocking other agents on its files forever.
- `gx finish --all` finished each lane but never swept merged-but-stranded worktree dirs whose branch merged out-of-band.

## What Changes

- Add `scripts/agent-autofinish-watch.sh`: scans agent worktrees, reports stalled lanes (work present, no open PR, past idle gate) and merged-but-retained lanes, and under `--auto-merge` reaps merged lanes via the existing `gx worktree prune` primitive. Resolves the primary checkout via the git common dir; healthy in-flight lanes stay silent.
- Add `gx locks reap`: clears locks from worktrees idle past a TTL (`--ttl-hours` / `GUARDEX_LOCK_TTL_HOURS`, default 7d) with no live process inside. A blocked `claim` against a past-TTL lock now hints at `reap`.
- `gx finish --all` sweeps merged orphans after a fully-successful run (opt-out `--no-sweep-orphans`, never on dry-run), gated by the pure `shouldSweepOrphans` predicate.

## Impact

- Affected surfaces: `scripts/agent-autofinish-watch.sh` (new), `templates/scripts/agent-file-locks.py`, `src/finish/index.js`, `src/cli/args.js`, `.agent/STALLED-WORKTREE-RECOVERY.md`.
- Conservative by design: `--auto-merge` only reaps **merged** lanes; it does not auto-commit/push/PR un-reviewed work. Finishing un-PR'd lanes stays a reported manual action.
- Follow-up (out of scope here, blocked by a foreign lock on `src/cli/commands/claude.js`): distribute the watcher + `agent-stalled-report.sh` to target repos (pair into `templates/scripts/`, register in `MANAGED_TEMPLATE_DESTINATIONS`, add the report to `MANAGED_HOOK_FILES`). The new `gx locks reap` can clear that very stale lock once it ages out.
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
## ADDED Requirements

### Requirement: Stalled-lane watcher
The system SHALL provide `scripts/agent-autofinish-watch.sh`, which the `SessionStart` shim `scripts/agent-stalled-report.sh` invokes, to detect stalled agent worktrees and reap merged-but-retained lanes. It SHALL resolve the primary checkout via the git common dir so it operates correctly from inside any worktree, and SHALL emit a `[agent-autofinish-watch] agent/<branch>: <status>` line only for actionable lanes.

#### Scenario: Stalled lane is reported
- **WHEN** an agent worktree has committed or uncommitted work, no open PR, and is idle past the idle gate
- **THEN** the watcher prints an actionable `agent/<branch>: ... -> needs finish` line
- **AND** a healthy in-flight lane (open PR or a live process) produces no line.

#### Scenario: Merged lane is reaped under --auto-merge
- **WHEN** an agent branch's PR has merged but its worktree is still on disk
- **THEN** the watcher reports the lane as `prunable`
- **AND** with `--auto-merge` (and not `--dry-run`) it delegates to `gx worktree prune --include-pr-merged --delete-branches` to remove it.

### Requirement: Stale lock reaping
The `gx locks` tool SHALL provide a `reap` subcommand that clears file locks held by abandoned worktrees: present on disk, idle beyond a TTL (`--ttl-hours`, `GUARDEX_LOCK_TTL_HOURS`, default 7 days), and with no live process inside. It SHALL never clear locks from a worktree that has a live process, and a blocked `claim` against a past-TTL lock SHALL surface a hint pointing at `gx locks reap`.

#### Scenario: Abandoned lock is reaped
- **WHEN** `gx locks reap` runs and a sibling worktree holds a lock older than the TTL with no live process
- **THEN** that lock entry is removed from the sibling worktree's lock file
- **AND** `--dry-run` reports the same lock without removing it.

#### Scenario: Active lock is preserved
- **WHEN** a lock is within the TTL, or its worktree has a live process
- **THEN** `reap` leaves the lock in place.

### Requirement: Bulk-finish orphan sweep
`gx finish --all` SHALL sweep merged-but-stranded worktree dirs after the per-lane loop, only when every lane succeeded, never on a dry run, and opt-out via `--no-sweep-orphans`. The sweep SHALL be best-effort: a sweep failure warns but does not fail the finish.

#### Scenario: Sweep fires after a successful bulk finish
- **WHEN** `gx finish --all` completes with no failed lanes and `--no-sweep-orphans` is not set
- **THEN** it runs `gx worktree prune --include-pr-merged --delete-branches`
- **AND** with `--no-sweep-orphans`, `--dry-run`, a single-branch finish, or any failed lane, the sweep does not run.
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
## Definition of Done

This change is complete only when **all** of the following are true:

- Every checkbox below is checked.
- The agent branch reaches `MERGED` state on `origin` and the PR URL + state are recorded in the completion handoff.
- If any step blocks (test failure, conflict, ambiguous result), append a `BLOCKED:` line under section 4 explaining the blocker and **STOP**. Do not tick remaining cleanup boxes; do not silently skip the cleanup pipeline.

## Handoff

- Handoff: change=`agent-claude-autofinish-watcher-lock-ttl-orphan-sweep-2026-06-30-02-05`; branch=`agent/<your-name>/<branch-slug>`; scope=`watcher + lock reap + finish --all orphan sweep; distribution to target repos is a follow-up (blocked by foreign lock on src/cli/commands/claude.js)`; action=`continue this sandbox or finish cleanup after a usage-limit/manual takeover`.
- Copy prompt: Continue `agent-claude-autofinish-watcher-lock-ttl-orphan-sweep-2026-06-30-02-05` on branch `agent/<your-name>/<branch-slug>`. Work inside the existing sandbox, review `openspec/changes/agent-claude-autofinish-watcher-lock-ttl-orphan-sweep-2026-06-30-02-05/tasks.md`, continue from the current state instead of creating a new sandbox, and when the work is done run `gx branch finish --branch agent/<your-name>/<branch-slug> --base dev --via-pr --wait-for-merge --cleanup`.

## 1. Specification

- [x] 1.1 Finalize proposal scope and acceptance criteria for `agent-claude-autofinish-watcher-lock-ttl-orphan-sweep-2026-06-30-02-05`.
- [x] 1.2 Define normative requirements in `specs/autofinish-watcher-lock-ttl-orphan-sweep-for-multi-agent-worktrees/spec.md`.

## 2. Implementation

- [x] 2.1 Implement scoped behavior changes.
- [x] 2.2 Add/update focused regression coverage.

## 3. Verification

- [x] 3.1 Run targeted project verification commands.
- [x] 3.2 Run `openspec validate agent-claude-autofinish-watcher-lock-ttl-orphan-sweep-2026-06-30-02-05 --type change --strict`.
- [x] 3.3 Run `openspec validate --specs`.

## 4. Cleanup (mandatory; run before claiming completion)

- [ ] 4.1 Run the cleanup pipeline: `gx branch finish --branch agent/<your-name>/<branch-slug> --base dev --via-pr --wait-for-merge --cleanup`. This handles commit -> push -> PR create -> merge wait -> worktree prune in one invocation.
- [ ] 4.2 Record the PR URL and final merge state (`MERGED`) in the completion handoff.
- [ ] 4.3 Confirm the sandbox worktree is gone (`git worktree list` no longer shows the agent path; `git branch -a` shows no surviving local/remote refs for the branch).
280 changes: 280 additions & 0 deletions scripts/agent-autofinish-watch.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,280 @@
#!/usr/bin/env bash
# Detect stalled agent/* worktrees and (optionally) reap lanes whose PR already
# merged but whose worktree was retained on disk.
#
# This is the watcher that scripts/agent-stalled-report.sh (the SessionStart
# hook) expects. Without it, that shim soft-exits 0 and merged-PR worktrees are
# never cleaned up (the "retained for now" path in agent-branch-finish.sh).
#
# It does NOT reinvent cleanup: reaping delegates to `gx worktree prune`
# (scripts/agent-worktree-prune.sh), the existing, tested primitive.
#
# Per-lane status lines use the prefix the report shim greps:
# [agent-autofinish-watch] agent/<branch>: <status>
# A line is emitted ONLY for actionable lanes (merged-but-retained, or stalled
# with no open PR after the idle gate). Healthy in-flight lanes stay silent.
#
# Exit codes: 0 always (informational); reaping failures warn but do not fail.

set -euo pipefail

MODE="once" # once | daemon
DRY_RUN=0
AUTO_MERGE=0
INTERVAL=300
IDLE_MINUTES="${GUARDEX_AUTOFINISH_IDLE_MINUTES:-60}"
BASE_BRANCH="${GUARDEX_BASE_BRANCH:-}"
GH_BIN="${GUARDEX_GH_BIN:-gh}"
NOW_EPOCH_OVERRIDE="${GUARDEX_AUTOFINISH_NOW_EPOCH:-}"

WORKTREE_ROOT_RELS=(
".omx/agent-worktrees"
".omx/.tmp-worktrees"
".omc/agent-worktrees"
".omc/.tmp-worktrees"
)
LOCK_FILE_REL=".omx/state/agent-file-locks.json"

while [[ $# -gt 0 ]]; do
case "$1" in
--once) MODE="once"; shift ;;
--daemon) MODE="daemon"; shift ;;
--dry-run) DRY_RUN=1; shift ;;
--auto-merge) AUTO_MERGE=1; shift ;;
--interval)
[[ $# -ge 2 ]] || { echo "[agent-autofinish-watch] --interval requires a value" >&2; exit 1; }
INTERVAL="$2"; shift 2 ;;
--idle-minutes)
[[ $# -ge 2 ]] || { echo "[agent-autofinish-watch] --idle-minutes requires a value" >&2; exit 1; }
IDLE_MINUTES="$2"; shift 2 ;;
--base)
[[ $# -ge 2 ]] || { echo "[agent-autofinish-watch] --base requires a value" >&2; exit 1; }
BASE_BRANCH="$2"; shift 2 ;;
-h|--help)
echo "Usage: $0 [--once|--daemon] [--dry-run] [--auto-merge] [--interval SEC] [--idle-minutes MIN] [--base BRANCH]"
echo "Note: merged/open PR detection reads the most recent 200 PRs per state; a"
echo " branch whose merged PR is older than that will not be auto-reaped."
exit 0
;;
*)
echo "[agent-autofinish-watch] Unknown argument: $1" >&2
exit 1
;;
esac
done

if ! git rev-parse --is-inside-work-tree >/dev/null 2>&1; then
echo "[agent-autofinish-watch] Not inside a git repository." >&2
exit 0
fi

# Resolve the PRIMARY checkout root, not the current worktree: the managed
# worktree roots (.omc/agent-worktrees, ...) live under the primary checkout,
# and refs/reflogs are shared via the common git dir. Running from inside an
# agent worktree must still see every sibling lane.
git_common_dir="$(git rev-parse --git-common-dir 2>/dev/null)"
case "$git_common_dir" in
/*) ;;
*) git_common_dir="$(git rev-parse --show-toplevel)/${git_common_dir}" ;;
esac
repo_root="$(cd "$(dirname "$git_common_dir")" && pwd)"

resolve_base_branch() {
[[ -n "$BASE_BRANCH" ]] && return 0
local head_ref
head_ref="$(git -C "$repo_root" symbolic-ref --quiet --short refs/remotes/origin/HEAD 2>/dev/null || true)"
if [[ -n "$head_ref" ]]; then
BASE_BRANCH="${head_ref#origin/}"
return 0
fi
for cand in main master dev; do
if git -C "$repo_root" show-ref --verify --quiet "refs/heads/${cand}"; then
BASE_BRANCH="$cand"
return 0
fi
done
BASE_BRANCH="main"
}

is_managed_worktree_path() {
local entry="$1" rel
for rel in "${WORKTREE_ROOT_RELS[@]}"; do
[[ "$entry" == "${repo_root}/${rel}"/* ]] && return 0
done
return 1
}

is_temporary_worktree_path() {
local name
name="$(basename "$1")"
[[ "$name" == __agent_integrate-* || "$name" == __source-probe-* ]]
}

now_epoch() {
if [[ -n "$NOW_EPOCH_OVERRIDE" ]]; then
printf '%s' "$NOW_EPOCH_OVERRIDE"
else
date +%s
fi
}

has_live_process_in_worktree() {
local wt="$1" proc_cwd live_cwd
[[ -d /proc ]] || return 1
for proc_cwd in /proc/[0-9]*/cwd; do
[[ -e "$proc_cwd" ]] || continue
live_cwd="$(readlink "$proc_cwd" 2>/dev/null || true)"
[[ -n "$live_cwd" ]] || continue
live_cwd="${live_cwd% (deleted)}"
if [[ "$live_cwd" == "$wt" || "$live_cwd" == "${wt}"/* ]]; then
return 0
fi
done
return 1
}

branch_idle_minutes() {
local branch="$1" wt="$2" activity_epoch="" lock_mtime now
activity_epoch="$(git -C "$repo_root" reflog show --format='%ct' -n 1 "refs/heads/${branch}" 2>/dev/null | head -n1 | tr -d '[:space:]')"
if [[ -z "$activity_epoch" ]]; then
activity_epoch="$(git -C "$repo_root" log -1 --format='%ct' "$branch" 2>/dev/null | head -n1 | tr -d '[:space:]')"
fi
if [[ -n "$wt" && -f "${wt}/${LOCK_FILE_REL}" ]]; then
lock_mtime="$(stat -c %Y "${wt}/${LOCK_FILE_REL}" 2>/dev/null || stat -f %m "${wt}/${LOCK_FILE_REL}" 2>/dev/null || true)"
if [[ "$lock_mtime" =~ ^[0-9]+$ && ( -z "$activity_epoch" || "$lock_mtime" -gt "$activity_epoch" ) ]]; then
activity_epoch="$lock_mtime"
fi
fi
[[ "$activity_epoch" =~ ^[0-9]+$ ]] || { printf '%s' 999999; return; }
now="$(now_epoch)"
printf '%s' $(( (now - activity_epoch) / 60 ))
}

# Count uncommitted changes, ignoring lock-file churn.
dirty_count() {
local wt="$1"
git -C "$wt" status --porcelain -- . ":(exclude)${LOCK_FILE_REL}" 2>/dev/null | grep -c . || true
}

commits_ahead() {
local branch="$1"
git -C "$repo_root" rev-list --count "${BASE_BRANCH}..${branch}" 2>/dev/null || printf '0'
}

# Prefer the gx CLI; fall back to the bundled prune script.
run_prune() {
if command -v gx >/dev/null 2>&1; then
gx worktree prune "$@"
else
bash "${repo_root}/scripts/agent-worktree-prune.sh" "$@"
fi
}

declare -A MERGED_BRANCHES=()
declare -A OPEN_BRANCHES=()

load_pr_state() {
command -v "$GH_BIN" >/dev/null 2>&1 || return 0
local line
while IFS= read -r line; do
[[ -n "$line" ]] && MERGED_BRANCHES["$line"]=1
done < <("$GH_BIN" pr list --state merged --base "$BASE_BRANCH" --limit 200 --json headRefName --jq '.[].headRefName' 2>/dev/null || true)
while IFS= read -r line; do
[[ -n "$line" ]] && OPEN_BRANCHES["$line"]=1
done < <("$GH_BIN" pr list --state open --base "$BASE_BRANCH" --limit 200 --json headRefName --jq '.[].headRefName' 2>/dev/null || true)
}

run_once() {
resolve_base_branch
MERGED_BRANCHES=()
OPEN_BRANCHES=()
load_pr_state

local scanned=0 stalled=0 merged=0
local cur_wt="" cur_branch=""

while IFS= read -r line; do
if [[ "$line" == worktree\ * ]]; then
cur_wt="${line#worktree }"
cur_branch=""
elif [[ "$line" == branch\ refs/heads/* ]]; then
cur_branch="${line#branch refs/heads/}"
elif [[ -z "$line" ]]; then
process_lane "$cur_wt" "$cur_branch"
cur_wt=""; cur_branch=""
fi
done < <(git -C "$repo_root" worktree list --porcelain; printf '\n')

# Reap merged-but-retained lanes before the summary so reaped= is accurate.
if [[ "$merged" -gt 0 ]]; then
reap_merged
fi

printf '[agent-autofinish-watch] scanned=%s stalled=%s merged=%s reaped=%s\n' \
"$scanned" "$stalled" "$merged" "$reaped"
}

# process_lane mutates scanned/stalled/merged in the caller scope (bash dynamic
# scope via run_once's locals); reaped is a top-level global set by reap_merged.
process_lane() {
local wt="$1" branch="$2"
[[ -n "$wt" && -n "$branch" ]] || return 0
[[ "$branch" == agent/* ]] || return 0
is_managed_worktree_path "$wt" || return 0
is_temporary_worktree_path "$wt" && return 0
scanned=$((scanned + 1))

if [[ -n "${MERGED_BRANCHES[$branch]:-}" && -d "$wt" ]]; then
merged=$((merged + 1))
echo "[agent-autofinish-watch] ${branch}: merged PR, worktree retained -> prunable"
return 0
fi

# Open PR or live process => healthy in-flight, stay silent.
[[ -n "${OPEN_BRANCHES[$branch]:-}" ]] && return 0
has_live_process_in_worktree "$wt" && return 0

local idle dirty ahead
idle="$(branch_idle_minutes "$branch" "$wt")"
[[ "$idle" -ge "$IDLE_MINUTES" ]] || return 0

dirty="$(dirty_count "$wt")"
if [[ "$dirty" -gt 0 ]]; then
stalled=$((stalled + 1))
echo "[agent-autofinish-watch] ${branch}: ${dirty} uncommitted change(s), idle ${idle}m -> needs commit + finish"
return 0
fi

ahead="$(commits_ahead "$branch")"
if [[ "$ahead" -gt 0 ]]; then
stalled=$((stalled + 1))
echo "[agent-autofinish-watch] ${branch}: ${ahead} commit(s) ahead of ${BASE_BRANCH}, no PR, idle ${idle}m -> needs finish"
fi
}

reaped=0

reap_merged() {
[[ "$AUTO_MERGE" -eq 1 ]] || return 0
if [[ "$DRY_RUN" -eq 1 ]]; then
echo "[agent-autofinish-watch] [dry-run] would prune merged lanes: gx worktree prune --include-pr-merged --delete-branches --base ${BASE_BRANCH}"
return 0
fi
local out=""
out="$(run_prune --include-pr-merged --delete-branches --base "$BASE_BRANCH" 2>&1 || true)"
printf '%s\n' "$out"
local removed
removed="$(printf '%s\n' "$out" | sed -n 's/.*removed_worktrees=\([0-9]*\).*/\1/p' | head -n1)"
[[ "$removed" =~ ^[0-9]+$ ]] && reaped="$removed"
}

if [[ "$MODE" == "daemon" ]]; then
while true; do
reaped=0
run_once
sleep "$INTERVAL"
done
else
reaped=0
run_once
fi
Loading
Loading