Supervisor resilience: failure handling, conflict resolution, and observability

## Summary

Add failure handling with configurable retries, a dedicated conflict resolution agent for merge failures, resume support for crashed supervisors, and structured progress reporting (log output + JSON file for VS Code extension).

## Context

Part of the Autonomous Swarm Mode epic (#557). This issue extends the supervisor loop (#562) with production-grade resilience and observability.

## Scope

### 1. Failure Handling & Retry
When an agent exits with non-zero status:
1. Release the Beads claim (`bd update <id> --status open`)
2. Cleanup the failed worktree
3. Increment failure counter for this task
4. If `failureCount < swarm.maxRetries`:
   - Log "Retrying task #N (attempt X of Y)..."
   - Create fresh worktree, spawn new agent
5. If `failureCount >= swarm.maxRetries`:
   - Mark as failed in Beads (`bd close <id> --reason "Failed after N retries"`)
   - Log failure, continue with other tasks
   - **Important**: If failed task was blocking others, those remain blocked

At swarm completion, report aggregate failures:
```
Swarm complete: 6/7 tasks succeeded, 1 failed
Failed: #103 "Implement auth" (2 attempts, agent error)
```

### 2. Conflict Resolution Agent
When the merge queue encounters a conflict:
1. Detect merge failure (GitHub API returns conflict status)
2. Log "Merge conflict detected for PR #N. Spawning resolver..."
3. Spawn a lightweight Claude Code agent in the child's worktree:
   - Agent prompt: "Rebase branch `<child-branch>` onto `<epic-branch>`, resolve all merge conflicts preserving the intent of both changes, then force-push."
   - Use `il spin -p` with a conflict-resolution-specific prompt (new template or env var)
   - Agent has context: the PR diff, the epic branch state
4. After resolver exits:
   - Retry merge
   - If still conflicts and `conflictRetryCount < swarm.maxConflictRetries`: repeat
   - If exhausted: mark task as failed, skip

### 3. Resume Support
When supervisor starts and detects existing Beads state for this epic:
1. Read Beads task statuses (`bd list --json`)
2. Skip tasks marked as `closed` (already completed)
3. For tasks marked `in_progress`:
   - Check PID file for running processes
   - If process still running: re-attach monitoring
   - If process dead: release claim, treat as failure (retry applies)
4. For tasks marked `open`/`ready`: proceed normally
5. Log "Resuming swarm: X completed, Y in progress, Z remaining"

### 4. Progress Reporting

**Terminal output:**
- On state change: log structured line with timestamp
- Periodic summary (every 30s or on change): "Active: 3/3 | Completed: 4/7 | Failed: 0 | Blocked: 0"

**JSON progress file:**
Written to `~/.config/iloom-ai/looms/<epic-loom-id>/swarm-progress.json` on every state change:

```json
{
  "epicIssue": 42,
  "epicBranch": "issue-42-swarm-mode",
  "status": "running|completed|failed|paused",
  "startedAt": "2026-02-05T10:00:00Z",
  "updatedAt": "2026-02-05T10:15:30Z",
  "dag": {
    "nodes": [
      {
        "issue": 101,
        "title": "Add settings schema",
        "status": "completed|in_progress|blocked|ready|failed",
        "agentPid": null,
        "logFile": "/path/to/agent-logs/101.log",
        "attempts": 1,
        "prNumber": 145,
        "startedAt": "...",
        "completedAt": "..."
      }
    ],
    "edges": [
      { "from": 101, "to": 103 }
    ]
  },
  "stats": {
    "total": 7,
    "completed": 4,
    "inProgress": 2,
    "failed": 0,
    "blocked": 1,
    "ready": 0
  },
  "failures": [
    { "issue": 105, "reason": "Agent exited with code 1", "attempts": 2 }
  ]
}
```

This file is the contract between the supervisor and the VS Code extension. The extension watches it with `fs.watch()` and renders the swarm state.

## Acceptance Criteria

- [ ] Agent failures trigger claim release, worktree cleanup, and retry
- [ ] Configurable retry count from settings (default 1)
- [ ] Failed blocking tasks correctly leave downstream tasks blocked
- [ ] Merge conflicts spawn resolver agent
- [ ] Resolver agent rebases and retries merge
- [ ] Conflict retries respect maxConflictRetries setting (default 3)
- [ ] Supervisor can resume from crashed state (reads Beads + PID file)
- [ ] Progress JSON file written on every state change
- [ ] Terminal output shows clear, structured progress
- [ ] Aggregate failure report at swarm completion
- [ ] Unit tests for failure/retry state machine
- [ ] Unit tests for resume logic with various Beads states

## Scope Boundaries

- Does NOT modify the core supervisor loop structure (extends it)
- Conflict resolution agent uses a simple prompt, not a full custom agent definition (can be enhanced later)

## Dependencies

- #562 (Supervisor loop) — this extends the supervisor with resilience and reporting


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supervisor resilience: failure handling, conflict resolution, and observability #563

Summary

Context

Scope

1. Failure Handling & Retry

2. Conflict Resolution Agent

3. Resume Support

4. Progress Reporting

Acceptance Criteria

Scope Boundaries

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Supervisor resilience: failure handling, conflict resolution, and observability #563

Description

Summary

Context

Scope

1. Failure Handling & Retry

2. Conflict Resolution Agent

3. Resume Support

4. Progress Reporting

Acceptance Criteria

Scope Boundaries

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions