-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Summary
Add failure handling with configurable retries, a dedicated conflict resolution agent for merge failures, resume support for crashed supervisors, and structured progress reporting (log output + JSON file for VS Code extension).
Context
Part of the Autonomous Swarm Mode epic (#557). This issue extends the supervisor loop (#562) with production-grade resilience and observability.
Scope
1. Failure Handling & Retry
When an agent exits with non-zero status:
- Release the Beads claim (
bd update <id> --status open) - Cleanup the failed worktree
- Increment failure counter for this task
- If
failureCount < swarm.maxRetries:- Log "Retrying task #N (attempt X of Y)..."
- Create fresh worktree, spawn new agent
- If
failureCount >= swarm.maxRetries:- Mark as failed in Beads (
bd close <id> --reason "Failed after N retries") - Log failure, continue with other tasks
- Important: If failed task was blocking others, those remain blocked
- Mark as failed in Beads (
At swarm completion, report aggregate failures:
Swarm complete: 6/7 tasks succeeded, 1 failed
Failed: #103 "Implement auth" (2 attempts, agent error)
2. Conflict Resolution Agent
When the merge queue encounters a conflict:
- Detect merge failure (GitHub API returns conflict status)
- Log "Merge conflict detected for PR #N. Spawning resolver..."
- Spawn a lightweight Claude Code agent in the child's worktree:
- Agent prompt: "Rebase branch
<child-branch>onto<epic-branch>, resolve all merge conflicts preserving the intent of both changes, then force-push." - Use
il spin -pwith a conflict-resolution-specific prompt (new template or env var) - Agent has context: the PR diff, the epic branch state
- Agent prompt: "Rebase branch
- After resolver exits:
- Retry merge
- If still conflicts and
conflictRetryCount < swarm.maxConflictRetries: repeat - If exhausted: mark task as failed, skip
3. Resume Support
When supervisor starts and detects existing Beads state for this epic:
- Read Beads task statuses (
bd list --json) - Skip tasks marked as
closed(already completed) - For tasks marked
in_progress:- Check PID file for running processes
- If process still running: re-attach monitoring
- If process dead: release claim, treat as failure (retry applies)
- For tasks marked
open/ready: proceed normally - Log "Resuming swarm: X completed, Y in progress, Z remaining"
4. Progress Reporting
Terminal output:
- On state change: log structured line with timestamp
- Periodic summary (every 30s or on change): "Active: 3/3 | Completed: 4/7 | Failed: 0 | Blocked: 0"
JSON progress file:
Written to ~/.config/iloom-ai/looms/<epic-loom-id>/swarm-progress.json on every state change:
{
"epicIssue": 42,
"epicBranch": "issue-42-swarm-mode",
"status": "running|completed|failed|paused",
"startedAt": "2026-02-05T10:00:00Z",
"updatedAt": "2026-02-05T10:15:30Z",
"dag": {
"nodes": [
{
"issue": 101,
"title": "Add settings schema",
"status": "completed|in_progress|blocked|ready|failed",
"agentPid": null,
"logFile": "/path/to/agent-logs/101.log",
"attempts": 1,
"prNumber": 145,
"startedAt": "...",
"completedAt": "..."
}
],
"edges": [
{ "from": 101, "to": 103 }
]
},
"stats": {
"total": 7,
"completed": 4,
"inProgress": 2,
"failed": 0,
"blocked": 1,
"ready": 0
},
"failures": [
{ "issue": 105, "reason": "Agent exited with code 1", "attempts": 2 }
]
}This file is the contract between the supervisor and the VS Code extension. The extension watches it with fs.watch() and renders the swarm state.
Acceptance Criteria
- Agent failures trigger claim release, worktree cleanup, and retry
- Configurable retry count from settings (default 1)
- Failed blocking tasks correctly leave downstream tasks blocked
- Merge conflicts spawn resolver agent
- Resolver agent rebases and retries merge
- Conflict retries respect maxConflictRetries setting (default 3)
- Supervisor can resume from crashed state (reads Beads + PID file)
- Progress JSON file written on every state change
- Terminal output shows clear, structured progress
- Aggregate failure report at swarm completion
- Unit tests for failure/retry state machine
- Unit tests for resume logic with various Beads states
Scope Boundaries
- Does NOT modify the core supervisor loop structure (extends it)
- Conflict resolution agent uses a simple prompt, not a full custom agent definition (can be enhanced later)
Dependencies
- Swarm supervisor loop with sequential merge queue #562 (Supervisor loop) — this extends the supervisor with resilience and reporting
Metadata
Metadata
Assignees
Labels
Type
Projects
Status