Skip to content

Epic: Autonomous Swarm Mode for Epics#565

Draft
acreeger wants to merge 9 commits intomainfrom
feat/issue-557__autonomous-swarm-mode
Draft

Epic: Autonomous Swarm Mode for Epics#565
acreeger wants to merge 9 commits intomainfrom
feat/issue-557__autonomous-swarm-mode

Conversation

@acreeger
Copy link
Collaborator

@acreeger acreeger commented Feb 5, 2026

PR for issue #557

This PR was created automatically by iloom.

@acreeger
Copy link
Collaborator Author

acreeger commented Feb 5, 2026

Complexity Assessment for Issue #557

Analysis Plan

  • Scan issue description and comments for scope
  • Identify files affected (new and modified)
  • Estimate lines of code and architectural signals
  • Assess breaking changes and cross-cutting concerns
  • Perform quick codebase searches for existing patterns
  • Classify complexity and document findings

Complexity Assessment

Classification: COMPLEX

Metrics:

  • Estimated files affected: 14-18 (8-10 new + 6-8 modified)
  • Estimated lines of code: 1000-1250 LOC
  • Breaking changes: Yes (il start behavior change, new --swarm flag, user interaction changes)
  • Database migrations: No (progress file only, no schema changes)
  • Cross-cutting changes: Yes (epic detection → supervisor spawn → task claiming → worktree → agent → PR merge → progress tracking)
  • File architecture quality: Good (modular structure, but start.ts is 610 LOC with complex entry logic)
  • Architectural signals triggered: External constraints (Beads), Uncertain approach (conflict resolver agent pattern), New patterns (supervisor orchestration loop), Integration points (5+ - Beads, agents, GitHub/Linear, worktrees, progress tracking), Implementation unclear (how conflict resolution coordinates with parallel agents)
  • Overall risk level: High

Reasoning: This epic implements a sophisticated multi-agent orchestration system with external Beads integration, requiring coordination across 14+ files, significant new code (~1000 LOC), and critical architectural decisions about supervisor loops, conflict resolution agents, and parallel task claiming. The cross-cutting nature of the data flow (epic detection threading through supervisor → task claiming → worktree → agent → merge) and multiple integration points with uncertain implementation approaches trigger automatic COMPLEX classification despite moderate individual file impacts.

@acreeger acreeger force-pushed the feat/issue-557__autonomous-swarm-mode branch from 6f1f1e6 to f9ecc92 Compare February 6, 2026 02:06
@acreeger
Copy link
Collaborator Author

acreeger commented Feb 6, 2026

Code Review: Epic #557 — Autonomous Swarm Mode

Review of all 7 commits (#558-#564) against their respective issue requirements, ADR compliance, and project guidelines.

Critical / High Priority (Must Fix)

1. Silent auto-install in CI when autoInstallBeads: false#558

File: src/lib/BeadsManager.ts (ensureInstalled())

When autoInstall is false and the environment is non-interactive (CI/CD), the code falls through to install without consent — silently downloading and executing a remote script. Contradicts the purpose of having an autoInstallBeads setting.

Fix: In non-interactive mode with autoInstall=false, throw an error telling the user to install bd manually or enable autoInstallBeads.

2. PATH not updated after Beads install — #558

File: src/lib/BeadsManager.ts (ensureInstalled())

After runInstallScript() completes, isInstalled() checks command -v bd which fails because the Node process PATH is stale. Install scripts add to ~/.local/bin but the running process doesn't pick that up.

Fix: Append likely install locations (~/.local/bin) to process.env.PATH before verification check.

3. findPRForBranch swallows all errors, causing premature task closure — #562

File: src/lib/SwarmSupervisor.ts:303-316

The catch block returns null for any error (network failures, rate limits, auth errors). When it returns null, checkCompletedAgents assumes no PR was created and permanently closes the Beads task — effectively losing that agent's work. Violates "DO NOT SWALLOW ERRORS".

Fix: Propagate unexpected errors. Only return null for "no PR found" scenarios, not transient failures.

4. GIT_REMOTE not set for swarm mode — #561

File: src/commands/ignite.ts:562-571

The template's STEP 5-SWARM uses {{GIT_REMOTE}} in git push {{GIT_REMOTE}} HEAD, but GIT_REMOTE is only set inside the draft PR block. In swarm mode without draft PR, the template renders git push HEAD (empty remote).

Fix: Set GIT_REMOTE inside the swarm mode block if not already set (default to origin or settings.mergeBehavior.remote).

5. EPIC_BRANCH not validated when SWARM_MODE active — #561

File: src/commands/ignite.ts:565-566

EPIC_BRANCH is optional, but the template uses it unconditionally in gh pr create --base {{EPIC_BRANCH}}. Missing value causes gh to fail with a confusing argument parsing error.

Fix: Throw Error('ILOOM_EPIC_BRANCH is required when ILOOM_SWARM_MODE is enabled') if missing.

6. Broken prompt interpolation in confirmSwarmMode — #559

File: src/commands/start.ts (confirmSwarmMode)

The template literal Issue #${epicDetection.totalChildren > 0 ? ...} never interpolates the issue number. Output is "Issue #is an epic with 3 child issues...".

Fix: Pass issue number into confirmSwarmMode and use it in the template.

7. parseInt NaN for --max-agents#559

File: src/cli.ts:384

parseInt('abc') returns NaN, which passes through ?? (NaN is not null/undefined) and flows into SwarmSettings.maxConcurrent. User sees "Max NaN concurrent agents".

Fix: Replace parseInt with a custom parser that validates and throws on invalid input.


Medium Priority (Should Fix)

8. Potential infinite loop on permanently failed tasks — #563

File: src/lib/SwarmSupervisor.ts:256

When a task is permanently failed, releaseClaim() returns it to "ready" state in Beads. The code skips claiming it (permanentlyFailed check), but readyTasks.length > 0 prevents the exit condition from being met. The loop polls every 2s indefinitely.

Fix: Filter permanently failed tasks from the exit condition: const actionableReadyTasks = readyTasks.filter(t => !this.permanentlyFailed.has(t.id)).

9. Conflict resolver uses wrong working directory — #563

File: src/lib/SwarmSupervisor.ts:649-651

By the time a task reaches the merge queue, activeAgents has already been cleared. The cwd falls back to epicLoom.epicLoomPath instead of the child worktree. The resolver runs without the right branch context.

Fix: Store loomPath in MergeQueueEntry when enqueuing, or keep a separate taskLoomPaths map.

10. Log file streams never closed — #562

File: src/lib/SwarmSupervisor.ts:185-190

fs.createWriteStream is called per agent but never closed. In long-running swarms, this leaks file descriptors.

Fix: Store logStream in ActiveAgent, call logStream.end() when agent completes.

11. Supervisor loop can exit prematurely — #562

File: src/lib/SwarmSupervisor.ts:134-136

If a failed agent releases its claim, the task may not appear immediately in the next ready() call (Beads internal delay). Supervisor exits before the released task becomes ready again.

Fix: Check Beads DAG overall status to distinguish "all done" from "no tasks ready right now".

12. parseInt truncates mixed-format IDs — #562

File: src/lib/SwarmSupervisor.ts:332-335

parseInt("100-fix-login", 10) returns 100. Any task ID starting with digits but containing non-digit chars is silently truncated.

Fix: Use strict regex: if (/^\d+$/.test(taskId)) return parseInt(taskId, 10); return taskId;

13. ready() and list() swallow JSON parse errors — #558, #563

Files: src/lib/BeadsManager.ts

Both methods catch JSON parse failures and return [] instead of throwing. Violates "DO NOT SWALLOW ERRORS". Supervisor sees zero tasks and idles indefinitely.

Fix: Throw BeadsError on parse failure. Callers already have try/catch for graceful handling.

14. Sync idempotency check uses ready() instead of full list — #558

File: src/lib/BeadsSyncService.ts

bd ready only returns unblocked tasks. Blocked/claimed/completed tasks are missed, causing re-creation attempts on re-sync.

Fix: Use beadsManager.list() for the idempotency check, or rely solely on the "already exists" catch.

15. Catch-all in detectEpic swallows all errors — #559

File: src/commands/start.ts:735-738

Catches all errors and returns null. Network errors, auth failures, type errors all treated as "not an epic". Violates project guidelines.

Fix: Narrow catch to expected API/network errors. Re-throw unexpected errors.

16. N+1 dependency fetching — #559

File: src/lib/EpicDetector.ts:107-135

Each child issue's dependencies fetched sequentially. 10 children = 10 serial API calls.

Fix: Use Promise.allSettled for parallel fetching.

17. SKIP_IMPLEMENTATION path not swarm-aware — #561

File: templates/prompts/issue-prompt.txt:796-798

In swarm mode, marking SKIP_IMPLEMENTATION skips PR creation and issue closing. The supervisor has no visibility into the agent's work.

Fix: Add swarm-mode conditional that still closes the issue and reports to supervisor.

18. "Human review" text contradicts swarm mode — #561

File: templates/prompts/issue-prompt.txt:218-220

"Each step requires explicit human approval" runs in swarm mode where agents are autonomous. Could cause agents to halt waiting for approval.

Fix: Wrap in {{#unless SWARM_MODE}} or add override.

19. Missing isEpic/swarmStatus in swarm metadata path — #560

File: src/lib/LoomManager.ts:~1203-1218

Normal path propagates isEpic/swarmStatus to metadata but finishSwarmLoom does not. Metadata inconsistency.

Fix: Add same propagation to swarm path, or document intentional omission.


Low Priority (Nice to Fix)

# Commit Location Issue
20 #558 BeadsManager.ts (runInstallScript) curl|bash from unpinned main branch — no integrity verification
21 #558 BeadsManager.ts (execBd) Spreads full process.env to bd subprocess — potential secret leakage
22 #558 BeadsSyncService.test.ts Mock missing providerName and issuePrefix properties
23 #558 plan-prompt.txt:280-283 Duplicate numbered list items (two 1. entries)
24 #559 start.ts:722 Settings loaded twice (in execute and detectEpic)
25 #559 start.ts:225-228 --swarm on null epicDetection produces no log message
26 #560 LoomManager.ts:~1186 Missing extractIssueNumber for PR-type swarm looms
27 #562 SwarmSupervisor.ts:372-376 Force shutdown orphans child processes
28 #562 SwarmSupervisor.ts:279-289 Graceful shutdown has no timeout — can hang indefinitely
29 #562 SwarmSupervisor.ts:304-310 PR search by issueId in:title can match wrong PRs
30 #563 BeadsManager.ts:256-264 list() silently returns [] on parse failure (same as #13)
31 #563 SwarmSupervisor.ts maxRetries=1 means 1 total attempt, 0 retries — naming misleading
32 #563 SwarmSupervisor.ts:844 Progress file not written atomically — readers could see partial JSON
33 #564 start-swarm.test.ts:303 Dynamic import in test violates project guidelines
34 #564 start.ts:763 No range validation for --max-agents CLI flag

Positive Observations

  • Clean dependency injection throughout — all core classes accept dependencies via constructor
  • Good error typing with BeadsError preserving exit code and stderr
  • Well-structured test coverage: 98 new tests across 7 commits
  • Clean separation of concerns: BeadsManager (CLI), BeadsSyncService (sync), EpicDetector (detection), SwarmSupervisor (orchestration)
  • Swarm fast path cleanly short-circuits without touching normal flow
  • Signal handler install/removal in try/finally prevents handler leaks
  • Template conditionals preserve non-swarm workflow unchanged

@acreeger
Copy link
Collaborator Author

acreeger commented Feb 6, 2026

Code Review Fixes Applied

All 34 issues from the code review have been addressed in commit 7991e33.

Critical/High (7 fixes)

# Issue Fix
1 Silent auto-install in CI Throw error in non-interactive mode when autoInstallBeads: false
2 PATH stale after install Append ~/.local/bin, ~/.cargo/bin, /usr/local/bin to PATH before verification
3 findPRForBranch swallows errors Only return null for "no PR found"; re-throw network/auth/rate-limit errors
4 GIT_REMOTE not set in swarm mode Default to settings.mergeBehavior.remote or origin
5 EPIC_BRANCH not validated Throw if missing when SWARM_MODE is enabled
6 Broken interpolation in confirmSwarmMode Fixed template literal to include issue number
7 parseInt NaN for --max-agents Custom parser with NaN check + range validation (1-20)

Medium (12 fixes)

# Issue Fix
8 Infinite loop on failed tasks Filter permanently failed tasks from exit condition
9 Conflict resolver wrong cwd Store loomPath in MergeQueueEntry + taskLoomPaths map
10 Log streams never closed Store logStream in ActiveAgent, close on completion/shutdown
11 Premature supervisor exit Track pendingReleases counter in exit condition
12 parseInt truncates IDs Strict regex ^\d+$ before parsing
13+30 ready()/list() swallow parse errors Throw BeadsError on JSON parse failure
14 Sync idempotency uses ready() Changed to list() for full task visibility
15 detectEpic catch-all Narrow to expected API/network errors only
16 N+1 dependency fetching Parallel via Promise.allSettled
17 SKIP_IMPLEMENTATION not swarm-aware Added swarm conditional for issue close + status reporting
18 Human review contradicts swarm Wrapped in {{#unless SWARM_MODE}}, autonomous override added
19 Missing metadata in swarm path Propagate isEpic/swarmStatus in finishSwarmLoom

Low (15 fixes)

Issues #20-34: Security comments for unpinned install, env filtering for subprocess, mock property fixes, duplicate list numbering, extractIssueNumber for PR-type looms, dynamic import replaced, settings double-load removed, --swarm on non-epic warning, force shutdown process cleanup, graceful shutdown 30s timeout, PR search branch matching, maxRetries semantics documented, atomic progress file writes.

Validation

  • 3908 tests pass (119 files, 0 failures)
  • TypeScript compile: clean
  • ESLint: clean

@acreeger
Copy link
Collaborator Author

acreeger commented Feb 8, 2026

Implementation Complete - Beads ID format fix

Summary

Fixed bug where Beads CLI rejected task IDs because they were plain GitHub issue numbers (e.g., 54) instead of the required prefix-hash format (e.g., gh-54). Added toBeadsId() and fromBeadsId() helper functions and fixed all ID mapping throughout BeadsSyncService and SwarmSupervisor.

Changes Made

  • BeadsSyncService.ts: Added toBeadsId() and fromBeadsId() exported helpers; fixed idempotency check to compare using toBeadsId(child.id) instead of raw child.id; replaced inline regex patterns with toBeadsId() calls
  • SwarmSupervisor.ts: Imported fromBeadsId; updated parseIssueIdentifier() to strip gh- prefix before parsing; updated closeIssue() and findPRForBranch() to use raw issue IDs for GitHub API calls
  • BeadsSyncService.test.ts: Updated all mock Beads task IDs to use gh- prefix; added unit tests for toBeadsId() and fromBeadsId() helpers
  • SwarmSupervisor.test.ts: Updated all Beads task IDs and assertions to use gh- prefix format

Validation Results

  • Build: Passed
  • Tests: 3905 passed / 3928 total (23 skipped)
  • All 119 test files passing

Detailed Changes by File (click to expand)

src/lib/BeadsSyncService.ts

Changes: Added toBeadsId() and fromBeadsId() helper functions; fixed idempotency check and replaced inline regex patterns

  • toBeadsId(): Converts raw issue IDs to Beads format (e.g., '54' -> 'gh-54'). IDs already in prefix-hash format (e.g., 'ENG-123') pass through unchanged
  • fromBeadsId(): Strips gh- prefix to recover raw issue ID. Non-gh- prefixed IDs pass through unchanged
  • Fixed existingTaskIds.has() check to use toBeadsId(child.id) so re-sync correctly detects existing tasks
  • Replaced inline child.id.match(/^[a-z]+-/) ? child.id : \gh-${child.id}`withtoBeadsId()` calls

src/lib/SwarmSupervisor.ts

Changes: Fixed ID mapping for GitHub API calls that need raw issue numbers

  • parseIssueIdentifier(): Now calls fromBeadsId() to strip gh- prefix before parsing (e.g., 'gh-100' -> 100)
  • closeIssue(): Strips gh- prefix before calling gh issue close
  • findPRForBranch(): Strips gh- prefix before searching for PRs by issue number

src/lib/BeadsSyncService.test.ts

Changes: Updated all test expectations and added helper function tests

  • All mockBeadsManager.create return values and assertions now use gh- prefixed IDs
  • Existing tasks in idempotency test use gh- prefix
  • Dependency assertions use gh- prefixed IDs
  • Added describe('toBeadsId') with 2 test cases
  • Added describe('fromBeadsId') with 3 test cases

src/lib/SwarmSupervisor.test.ts

Changes: Updated all Beads task IDs and assertions to use gh- prefix

  • All createBeadsTask() calls use gh-100, gh-101, etc.
  • All syncService.syncEpicToBeads mock results use gh- prefixed beadsTaskId
  • Assertions for beadsManager.claim, beadsManager.close, beadsManager.releaseClaim updated
  • closeIssue assertion verifies raw ID (100) is passed to gh issue close

@acreeger
Copy link
Collaborator Author

acreeger commented Feb 8, 2026

Analysis: Research beads.role in Beads CLI

  • Fetch issue context
  • Research Beads GitHub repo documentation
  • Search Beads source code for beads.role
  • Check our BeadsManager.ts integration
  • Document findings

Executive Summary

beads.role is a git config value (maintainer or contributor) that controls how Beads routes planning issues. The warning "warning: beads.role not configured. Run 'bd init' to set." is emitted to stderr by DetectUserRole() on every bd command that invokes role detection. Our bd init --quiet --skip-hooks --skip-merge-driver call never sets the role because --quiet suppresses prompts and we run in non-TTY contexts -- the interactive role prompt in promptContributorMode() only fires when stdin is a TTY and no --contributor/--team flag is passed.

The role does not affect our DAG operations (create, ready, claim, close, dep). It only affects Beads' issue routing system, which we bypass entirely since iloom syncs issues from GitHub/Linear to Beads itself.

Impact Summary

  • The warning is cosmetic -- it does not cause bd commands to fail (exit code 0)
  • It appears on stderr during commands that trigger DetectUserRole() (routing-related operations)
  • Fix: After bd init, set the role via either git config beads.role maintainer or bd config set beads.role maintainer in the project directory
  • Only 1 file affected: /Users/adam/Documents/Projects/iloom-cli/feat-issue-557__autonomous-swarm-mode/src/lib/BeadsManager.ts (the init() method at line 168)

Complete Technical Reference (click to expand for implementation details)

Answers to Each Question

1. What is beads.role and what does it do?

beads.role determines whether Beads treats your repository context as a maintainer (push access, in-repo storage) or contributor (fork workflow, separate planning repo). It governs how DetermineTargetRepo() in internal/routing/routing.go routes planning issues.

For iloom's use case (DAG task management with BEADS_DIR outside the repo), the role is irrelevant -- we never use Beads' issue routing system. We sync issues ourselves and only use bd create, bd ready, bd claim, bd close, and bd dep.

2. Valid values

Exactly two values, validated in cmd/bd/config.go:

  • maintainer -- repo owner or team with push access
  • contributor -- fork/OSS contributor without direct push access

Any other value triggers a warning from bd doctor and DetectUserRole().

3. How is it configured?

Primary storage: Git config (git config beads.role <value>)

Set via:

  • git config beads.role maintainer -- direct git config
  • bd config set beads.role maintainer -- validates value before writing to git config (preferred)
  • bd init interactive prompt -- asks "Contributing to someone else's repo? [y/N]" when stdin is TTY and no --contributor/--team flag

Read via:

  • git config --get beads.role
  • bd config get beads.role

Fallback: If unset, DetectUserRole() falls back to deprecated URL-based heuristic (SSH = maintainer, HTTPS without creds = contributor), which triggers the warning.

Database fallback: cmd/bd/doctor/role.go also checks SQLite for legacy configs (pre-GH#1531).

4. Is it required or just a warning we can suppress?

Just a warning. The DetectUserRole() function emits to stderr via fmt.Fprintln(os.Stderr, "warning: beads.role not configured. Run 'bd init' to set.") and then falls back to URL heuristics. The bd command still completes with exit code 0. Our execBd() method captures stderr but only throws on non-zero exit codes.

However, the warning pollutes stderr which could be confusing in logs.

5. Can we set it programmatically during our bd init call?

bd init does not accept a --role flag. The role-related flags are:

  • --contributor -- runs contributor wizard (sets role to contributor)
  • --team -- runs team wizard (sets role to maintainer)

Neither is appropriate since they trigger full wizard flows.

Best approach: After bd init, call bd config set beads.role maintainer which validates and writes to git config. This can be done via our existing execBd() method:

await this.execBd(['config', 'set', 'beads.role', 'maintainer'], { cwd: this.projectPath })

Alternatively, run git config beads.role maintainer directly via execa in the project directory, but bd config set is preferred because it validates the value.

Codebase Research Findings

Affected Area: BeadsManager.init()

Entry Point: /Users/adam/Documents/Projects/iloom-cli/feat-issue-557__autonomous-swarm-mode/src/lib/BeadsManager.ts:168-179 - the init() method
Dependencies:

  • Uses: execBd() private method (line 298)
  • Used By: SwarmSupervisor and il start swarm flow

Why the warning occurs with our current init call

Our call: bd init --quiet --skip-hooks --skip-merge-driver

The --quiet flag suppresses output but does NOT set the role. The interactive role prompt (promptContributorMode() in cmd/bd/init.go) only fires when:

  1. In a git repository (yes)
  2. No --contributor or --team flag (correct, we don't pass them)
  3. stdin is a TTY (shouldPromptForRole() check)

Since our execBd() runs with encoding: 'utf8' and no stdio: 'inherit', stdin is NOT a TTY, so the prompt is skipped entirely. The role is never set.

Beads Source Files Investigated

  • cmd/bd/doctor/role.go -- CheckBeadsRole() validates role, returns warning status if unset
  • cmd/bd/config.go -- bd config set/get beads.role implementation, validates against allowlist
  • cmd/bd/init.go -- promptContributorMode() sets role interactively during init
  • internal/routing/routing.go -- DetectUserRole() emits the warning to stderr, falls back to URL heuristic
  • internal/beads/context.go -- ErrRoleNotConfigured error constant, RequireRole() method

bd init available flags (full list)

Flag Description
--prefix/-p Issue prefix
--quiet/-q Suppress output
--branch/-b Git branch for beads commits
--backend Storage backend (sqlite/dolt)
--force Force re-initialization
--contributor Run contributor wizard
--team Run team wizard
--stealth Enable stealth mode
--setup-exclude Configure git exclude for forks
--skip-hooks Skip git hooks installation
--skip-merge-driver Skip merge driver setup
--from-jsonl Import from existing issues.jsonl

No --role flag exists.

@acreeger
Copy link
Collaborator Author

acreeger commented Feb 8, 2026

Analysis: Beads CLI (bd) Prefix System Research

  • Fetch issue context
  • Research Beads CLI source code (GitHub repo)
  • Analyze bd init prefix behavior
  • Analyze bd create --id validation logic
  • Document findings

Executive Summary

The prefix mismatch error is a direct consequence of iloom's BeadsManager.init() not passing --prefix to bd init. Beads auto-detects the prefix from the current working directory name, producing iloom-test-project-github as the prefix. When BeadsSyncService.toBeadsId() then generates IDs like gh-54, Beads rejects them because they don't start with iloom-test-project-github-. The fix requires coordinating the prefix between bd init and task ID generation.

Question Answer
Should the prefix be a short token like gh or the full repo slug like iloom-test-project-github? Using the full repo slug prevents collisions across repos but creates verbose IDs. Beads supports any prefix format; this is an iloom design choice.
Should we use --prefix on init, or --force on create? --prefix on init is the clean approach. --force bypasses all validation and is a blunt workaround.
Where should the prefix be derived from (repo name, org/repo, custom)? The prefix needs to be consistent between init and ID generation. It can be derived from the repo slug or be configurable via swarm settings.

Impact Summary

  • 2 files requiring modification: BeadsManager.ts (init method) and BeadsSyncService.ts (toBeadsId/fromBeadsId)
  • The prefix choice must be consistent between init and create calls
  • Existing tests in BeadsSyncService.test.ts assume gh- prefix and will need updating if the prefix changes

Complete Technical Reference (click to expand for implementation details)

Problem Space Research

Problem Understanding

When iloom syncs GitHub child issues into Beads for swarm mode DAG orchestration, bd create --id gh-54 fails because the Beads database was initialized with prefix iloom-test-project-github (auto-detected from the directory name), but the ID gh-54 starts with gh, not iloom-test-project-github.

Architectural Context

Beads enforces prefix consistency: all IDs in a database must share the same prefix. This is by design to prevent cross-project contamination when databases are shared. iloom stores Beads databases outside the repo at ~/.config/iloom-ai/beads/<project-hash>, so cross-project contamination is already prevented by the project-hash directory isolation.

Edge Cases Identified

  • Multiple repos with same name: Two repos named my-app in different orgs would hash to different BEADS_DIR paths (iloom uses project path hash), so prefix collision is not a concern at the filesystem level
  • Prefix with special characters: Beads normalizes prefixes by stripping trailing hyphens. The prefix iloom-test-project-github is valid but verbose
  • Linear integration: Linear IDs like ENG-123 already have a prefix format. If Linear is the issue tracker, the prefix should match (e.g., ENG)

Third-Party Research Findings

Beads CLI (Go, ~4 months old)

Source: GitHub source code at https://github.com/steveyegge/beads (cloned and analyzed directly)

How bd init Sets the Prefix

Prefix determination follows strict precedence (from cmd/bd/init.go:122-152):

  1. --prefix / -p flag: Highest priority. bd init --prefix gh sets prefix to gh
  2. config.yaml value: config.GetString("issue-prefix") from .beads/config.yaml
  3. Auto-detect from git history: Scans JSONL for existing issues, extracts prefix from first issue
  4. Directory name fallback: filepath.Base(cwd) - the basename of CWD

After determination, trailing hyphens are stripped: strings.TrimRight(prefix, "-").

The prefix is permanently stored in SQLite: store.SetConfig(ctx, "issue_prefix", prefix) at init.go:369.

How bd create --id Validates

From cmd/bd/create.go:458-483:

  1. Reads issue_prefix from database (or daemon RPC)
  2. Also reads allowed_prefixes for multi-prefix support
  3. Calls validation.ValidateIDPrefixAllowed(explicitID, dbPrefix, allowedPrefixes, forceCreate)

The validation (internal/validation/bead.go:139-167) checks:

  • If force is true, skip all validation
  • If dbPrefix is empty, skip validation
  • If id starts with dbPrefix + "-", pass
  • If id starts with any entry in allowedPrefixes + "-", pass
  • Otherwise: error with "prefix mismatch: database uses '{dbPrefix}-' but ID '{id}' doesn't match (use --force to override)"

ID Format Requirements

From internal/validation/bead.go:55-76 (ValidateIDFormat):

  • Must contain at least one hyphen
  • Format: prefix-hash or prefix-number (e.g., bd-a3f8e9, bd-42)
  • Hierarchical: prefix-hash.number (e.g., bd-a3f8.1)
  • Multi-hyphen prefixes supported: web-app-a3f8e9 extracts as prefix web-app

Querying the Current Prefix

Two methods:

  • bd info --json returns { "issue_prefix": "..." } among other fields
  • bd config get issue_prefix reads from database (not documented for this key but uses same mechanism)

Multi-Prefix Support

allowed_prefixes is a comma-separated config value stored in the database: store.GetConfig(ctx, "allowed_prefixes"). IDs matching any prefix in this list (plus the primary issue_prefix) pass validation.

--force Bypass

bd create --id gh-54 --force skips all prefix validation. This is the emergency escape hatch but is not the intended workflow.

Codebase Research Findings

Affected Area: BeadsManager + BeadsSyncService

Entry Point for the bug: BeadsManager.init() at /Users/adam/Documents/Projects/iloom-cli/feat-issue-557__autonomous-swarm-mode/src/lib/BeadsManager.ts:168-181

The init method calls bd init --quiet --skip-hooks --skip-merge-driver without --prefix. Since BEADS_DIR is set to a directory outside the repo, and cwd is set to projectPath, Beads auto-detects the prefix from filepath.Base(projectPath).

For repo iloom-ai/iloom-test-project-github, if the checkout is at a path ending in iloom-test-project-github, the prefix becomes iloom-test-project-github.

ID Generation: BeadsSyncService.toBeadsId() at /Users/adam/Documents/Projects/iloom-cli/feat-issue-557__autonomous-swarm-mode/src/lib/BeadsSyncService.ts:30-32

This function prefixes numeric GitHub IDs with gh- (e.g., 54 becomes gh-54). This hardcoded gh- prefix does not match the database prefix.

Dependencies:

  • BeadsManager.init() is called from SwarmSupervisor during epic startup
  • BeadsSyncService.syncEpicToBeads() calls BeadsManager.create() with IDs from toBeadsId()
  • toBeadsId() and fromBeadsId() are used throughout swarm code for ID conversion

Key Design Tension

The user's stated preference is IDs that include repo info to prevent collisions (e.g., iloom-test-project-github-54). However, iloom already isolates Beads databases per-project via a SHA-256 hash of the project path (BeadsManager.computeProjectHash()), so cross-project collision is already impossible at the filesystem level. The question is whether human-readable disambiguation is worth the verbosity.

Possible Approaches (for reference, not recommendation)

  1. Pass --prefix to bd init with a value that matches what toBeadsId() generates (e.g., gh)
  2. Change toBeadsId() to generate IDs that match whatever prefix Beads auto-detects
  3. Use --force on every bd create call to bypass validation
  4. Use allowed_prefixes to register gh as an additional allowed prefix after init

Affected Files

  • /Users/adam/Documents/Projects/iloom-cli/feat-issue-557__autonomous-swarm-mode/src/lib/BeadsManager.ts:168-181 - init() method does not pass --prefix to bd init
  • /Users/adam/Documents/Projects/iloom-cli/feat-issue-557__autonomous-swarm-mode/src/lib/BeadsSyncService.ts:30-32 - toBeadsId() hardcodes gh- prefix
  • /Users/adam/Documents/Projects/iloom-cli/feat-issue-557__autonomous-swarm-mode/src/lib/BeadsSyncService.ts:40-42 - fromBeadsId() hardcodes gh- stripping
  • /Users/adam/Documents/Projects/iloom-cli/feat-issue-557__autonomous-swarm-mode/src/lib/BeadsSyncService.test.ts - Tests assume gh- prefix format throughout

Integration Points

  • BeadsManager.init() is called by SwarmSupervisor which passes this.projectPath as CWD
  • BeadsSyncService uses BeadsManager.create() with IDs from toBeadsId()
  • fromBeadsId() is used to convert Beads task IDs back to issue tracker IDs for GitHub API calls
  • The bd CLI reads BEADS_DIR env var set by BeadsManager.execBd() for all operations

Beads CLI Reference Summary

Command Relevant Behavior
bd init --prefix <value> Sets prefix explicitly, stored in SQLite
bd init (no --prefix) Auto-detects: config.yaml > git history > directory name
bd create --id <id> Validates ID starts with {prefix}-
bd create --id <id> --force Bypasses prefix validation
bd info --json Returns issue_prefix among other config
bd config set allowed_prefixes "gh,other" Adds allowed prefixes for multi-prefix support

@acreeger
Copy link
Collaborator Author

acreeger commented Feb 8, 2026

Implementation Complete - Fix Beads task ID prefix system

Summary

Replaced the hardcoded gh- prefix in Beads task IDs with a repo-aware prefix derived from the repository name. The prefix is now set via bd init --prefix <repo-name> and queried at runtime via bd info --json.

Changes Made

  • BeadsManager.ts: Added prefix parameter to init() and new getPrefix() method
  • BeadsSyncService.ts: toBeadsId() and fromBeadsId() now accept a prefix parameter; sync queries prefix from BeadsManager.getPrefix() with caching
  • SwarmSupervisor.ts: Added beadsPrefix to EpicLoomContext; all fromBeadsId() calls use the dynamic prefix
  • start.ts: Fetches repo info via getRepoInfo() and passes repoInfo.name as the Beads prefix

Validation Results

  • Tests: 3935 passed / 3935 total (119 test files)
  • Typecheck/Build: Passed
  • Lint: Passed

Detailed Changes by File (click to expand)

Files Modified

src/lib/BeadsManager.ts

Changes: Added optional prefix parameter to init() which passes --prefix to bd init; added getPrefix() method that queries bd info --json for the issue_prefix field.

src/lib/BeadsSyncService.ts

Changes: toBeadsId(issueId, prefix) and fromBeadsId(beadsId, prefix) now require a prefix parameter instead of hardcoding gh-. The BeadsSyncService class caches the prefix via getBeadsPrefix() using BeadsManager.getPrefix().

src/lib/SwarmSupervisor.ts

Changes: Added beadsPrefix field to EpicLoomContext interface. Supervisor loads prefix via beadsManager.getPrefix() after init and uses it in all fromBeadsId() calls (closeIssue, findPRForBranch, parseIssueIdentifier). Passes prefix to beadsManager.init().

src/commands/start.ts

Changes: Imports getRepoInfo from github utils. Derives beadsPrefix from repoInfo.name and passes it to the supervisor via EpicLoomContext.beadsPrefix.

src/lib/BeadsManager.test.ts

Changes: Added tests for init() with prefix parameter and getPrefix() method.

src/lib/BeadsSyncService.test.ts

Changes: Updated all test data to use iloom-test-project prefix instead of gh-. Added test for prefix caching. Updated toBeadsId/fromBeadsId tests for new signature.

src/lib/SwarmSupervisor.test.ts

Changes: Added getPrefix mock to BeadsManager, updated all task IDs from gh-XXX to test-repo-XXX, added beadsPrefix to EpicLoomContext.

src/commands/start-swarm.test.ts

Changes: Added getRepoInfo mock, added getPrefix to BeadsManager mock, verified beadsPrefix is passed in EpicLoomContext.

@acreeger
Copy link
Collaborator Author

acreeger commented Feb 8, 2026

Implementation: Simplify Beads prefix system

  • Remove getPrefix() method from BeadsManager.ts
  • Update BeadsSyncService.ts to accept prefix as constructor parameter
  • Update SwarmSupervisor.ts to use epicLoom.beadsPrefix directly
  • Update start.ts to pass prefix to BeadsSyncService constructor
  • Update BeadsManager.test.ts - remove getPrefix tests
  • Update BeadsSyncService.test.ts - update constructor, remove getPrefix mock
  • Update SwarmSupervisor.test.ts - remove getPrefix mock
  • Update start-swarm.test.ts - remove getPrefix mock
  • Run pnpm build and pnpm test to verify

@acreeger
Copy link
Collaborator Author

acreeger commented Feb 8, 2026

Implementation Complete - Issue #557 (Simplify Beads Prefix) ✅

Summary

Removed stale getPrefix references from test mocks across 3 test files. The production code (BeadsManager.ts, BeadsSyncService.ts, SwarmSupervisor.ts, start.ts) was already correctly using a deterministic prefix pattern -- BeadsSyncService accepts prefix via constructor, SwarmSupervisor uses epicLoom.beadsPrefix, and start.ts computes it from repoInfo.name. No production code changes were needed.

Changes Made

  • src/lib/BeadsSyncService.test.ts: Replaced stale getPrefix cache test with a test verifying the constructor-provided prefix is used for all task IDs
  • src/lib/SwarmSupervisor.test.ts: Removed getPrefix from createMockBeadsManager() mock
  • src/commands/start-swarm.test.ts: Removed getPrefix from BeadsManager mock (2 occurrences)

Validation Results

  • ✅ Build: Passed
  • ✅ Tests: 3910 passed / 3933 total (23 skipped)
  • ✅ All 119 test files passed

📋 Detailed Changes by File (click to expand)

Files Modified

src/lib/BeadsSyncService.test.ts

Changes: Replaced the "should cache the prefix across multiple calls within the same sync" test (which asserted mockBeadsManager.getPrefix was called once) with "should use the prefix passed via constructor for all task IDs" (which verifies both create calls use the constructor-provided prefix).

src/lib/SwarmSupervisor.test.ts

Changes: Removed getPrefix: vi.fn().mockResolvedValue(TEST_PREFIX) from the mock BeadsManager factory at line 70.

src/commands/start-swarm.test.ts

Changes: Removed all getPrefix: vi.fn().mockResolvedValue('iloom-test-project') entries from the BeadsManager mock (lines 62 and 476).

Dependencies Added

None

@acreeger
Copy link
Collaborator Author

acreeger commented Feb 8, 2026

Implementation Complete - Add user-visible logging to Beads DAG sync

Summary

Added user-visible logger.info() calls to BeadsSyncService.syncEpicToBeads() so users can see the DAG setup process. Changed existing logger.debug() calls at key milestones to logger.info() and removed the redundant summary line from SwarmSupervisor since BeadsSyncService now handles the final summary.

Changes Made

  • src/lib/BeadsSyncService.ts: Changed open child issue count from debug to info, added per-task creation log, added per-dependency link log, updated final summary to show "DAG ready: N tasks, M dependencies" format with proper singular/plural
  • src/lib/SwarmSupervisor.ts: Removed redundant "Synced N tasks" log line (now handled by BeadsSyncService's "DAG ready" summary)
  • src/lib/BeadsSyncService.test.ts: Added 5 new tests verifying user-visible logging for child count, task creation, dependency linking, DAG summary, and singular forms

Validation Results

  • Tests: 3915 passed / 3915 total (22 BeadsSyncService tests including 5 new)
  • Build: Passed
  • No lint issues

Detailed Changes by File (click to expand)

Files Modified

src/lib/BeadsSyncService.ts

Changes: Upgraded 3 debug log calls to info-level user-visible output, added 1 new dependency log line

  • Line 87: logger.info(' Found N open child issue(s)') - shows count after filtering closed issues
  • Line 122: logger.info(' Creating task: #ID - Title') - shows each task as it's created
  • Line 155: logger.info(' Linking dependency: #X depends on #Y') - shows each dependency link
  • Lines 180-183: logger.info(' DAG ready: N tasks, M dependencies') - final summary with proper singular/plural

src/lib/SwarmSupervisor.ts

Changes: Removed redundant summary line after sync (line 215 previously: logger.info('Synced N tasks...'))

src/lib/BeadsSyncService.test.ts

Changes: Added logger import and 5 new test cases

  • should log user-visible output for child issue count
  • should log user-visible output for each created task
  • should log user-visible output for dependencies
  • should log DAG summary with correct counts
  • should use singular forms when counts are 1

@acreeger
Copy link
Collaborator Author

acreeger commented Feb 9, 2026

Implementation: Fix swarm mode worktree reuse bug

  • Analyze bug: reuseIloom() doesn't check swarmMode, launches IDE/terminal/dev server/Claude
  • Fix reuseIloom() to add swarm mode fast path (skip color sync, issue status, launching)
  • Add test: reusing worktree with swarmMode: true should not launch components
  • Add test: reusing worktree with swarmMode: true should write metadata with swarmAgent flag
  • Run build and tests to validate

@acreeger
Copy link
Collaborator Author

acreeger commented Feb 9, 2026

Implementation Complete - Issue #557 (Swarm Mode Reuse Path Fix)

Summary

Fixed the bug where SwarmSupervisor child looms launched IDE, terminals, and dev servers when reusing existing worktrees in swarm mode. The reuseIloom() method in LoomManager.ts now checks for swarmMode and calls finishSwarmLoomReuse() to skip all interactive/visual components, matching the behavior of the new worktree creation path.

Changes Made

  • src/lib/LoomManager.ts: Added finishSwarmLoomReuse() private method that mirrors finishSwarmLoom() for reused worktrees (skips color sync, issue status, IDE, terminal, dev server, Claude launch). Added swarm mode check in reuseIloom() before the launch sequence.
  • src/lib/LoomManager.test.ts: Added 3 tests for swarm mode reuse: verifying no workspace components are launched, metadata includes swarmAgent flag, and environment files are still copied.
  • src/lib/SwarmSupervisor.ts: Added beadsPrefix to EpicLoomContext, uses fromBeadsId() to strip prefix for GitHub operations (closeIssue, findPRForBranch, parseIssueIdentifier), added duplicate progress line suppression.
  • src/lib/SwarmSupervisor.test.ts: Updated task IDs to use prefixed Beads format (test-repo-100), added test for progress line deduplication.
  • Other files (BeadsManager, BeadsSyncService, start-swarm, start): Related swarm infrastructure changes.

Validation Results

  • Tests: 3919 passed / 3942 total (23 skipped)
  • Build: Passed
  • All test files: 119 passed

Detailed Changes by File (click to expand)

Files Modified

src/lib/LoomManager.ts

Changes: Added swarm mode fast path in reuseIloom() at line ~1417, and finishSwarmLoomReuse() private method (lines 1273-1354) that writes metadata with swarmAgent: true and neutral color, then returns a Loom without launching any interactive components.

src/lib/LoomManager.test.ts

Changes: Added 3 test cases under the existing worktree reuse describe block:

  • should not launch any workspace components when reusing worktree in swarm mode
  • should write metadata with swarmAgent flag when reusing worktree in swarm mode
  • should still copy environment files when reusing worktree in swarm mode

src/lib/SwarmSupervisor.ts

Changes: Added beadsPrefix field to EpicLoomContext, uses it in closeIssue(), findPRForBranch(), and parseIssueIdentifier() via fromBeadsId(). Added progress line dedup logic.

src/lib/SwarmSupervisor.test.ts

Changes: Updated all task IDs from '100' to 'test-repo-100' format to match Beads prefix behavior. Added dedup progress test.

@acreeger
Copy link
Collaborator Author

acreeger commented Feb 9, 2026

Analysis: SwarmSupervisor Bugs (Issue #557)

  • Analyze Bug 1: Swarm stops after first task completes
  • Analyze Bug 2: PR search hits GitHub API rate limit
  • Document findings

Executive Summary

Two bugs in SwarmSupervisor.orchestrate() cause the swarm to stop after completing just 1 of 6 tasks. Bug 1 is a premature loop exit: the exit condition only checks whether bd ready returns empty, without verifying that uncompleted tasks remain in the DAG. Bug 2 amplifies this: the PR search hits GitHub's rate limit (exhausted by the child agent), causing findPRForBranch to throw, which is caught and swallowed -- the task gets marked "completed without PR" and its PR is never merged.

HIGH/CRITICAL Risks

  • Premature loop exit loses work: The exit condition at SwarmSupervisor.ts:279 does not verify all tasks are finished. If bd ready returns empty for even one iteration (propagation delay, closeTask failure, or Beads state inconsistency), the loop breaks and the swarm exits with uncompleted tasks.
  • Silent closeTask failure blocks dependents: closeTask() at line 760-765 swallows all errors. If bd close fails, dependent tasks never get unblocked in the DAG, and the swarm exits thinking there is nothing left to do.
  • Rate limit causes missed PR merge: When findPRForBranch fails due to rate limits, the task is marked "completed without PR" -- but the agent likely DID create a PR. That PR is never merged into the epic branch.

Impact Summary

  • 2 methods need modification in /src/lib/SwarmSupervisor.ts
  • 1 method needs rate-limit retry logic: findPRForBranch()
  • Loop exit condition at line 279 needs to incorporate total task completion check
  • Tests in /src/lib/SwarmSupervisor.test.ts need updates for new behaviors

Complete Technical Reference (click to expand for implementation details)

Problem Space Research

Problem Understanding

The swarm supervisor orchestrates N child agents working on an epic's sub-issues. It relies on Beads (a DAG task tracker) to determine which tasks are ready (unblocked). When a task completes, closing it in Beads should unblock dependent tasks, making them appear in subsequent bd ready calls. Two failures conspire to break this flow.

Architectural Context

The supervisor loop follows a poll-based model: each iteration queries bd ready, claims/spawns agents, checks for completions, processes merges, then checks exit. The exit condition assumes that if bd ready returns empty AND no agents are running AND no merges pending, all work is done. This assumption is invalid when tasks exist in the DAG but aren't yet surfaced by bd ready.

Codebase Research Findings

Bug 1: Premature Loop Exit

Root Cause: The exit condition at /src/lib/SwarmSupervisor.ts:278-281 is:

const actionableReadyTasks = readyTasks.filter(t => !this.permanentlyFailed.has(t.id))
if (actionableReadyTasks.length === 0 && this.activeAgents.size === 0 && this.mergeQueue.length === 0 && this.pendingReleases === 0) {
    break
}

This checks only the current bd ready output. It does not check whether uncompleted tasks remain in the DAG. After task 54 completes:

  1. checkCompletedAgents() processes task 54's completion (line 462-508)
  2. closeTask() is called (line 501), which calls beadsManager.close() to mark it done in Beads
  3. On the next iteration, bd ready is called (line 241). If the 5 dependent tasks haven't been unblocked yet (due to Beads propagation timing, or if closeTask silently failed), bd ready returns []
  4. Exit condition is met: readyTasks empty, no active agents, no merge queue, no pending releases -> loop breaks

Contributing factor: closeTask() at line 760-765 wraps beadsManager.close() in try/catch and only warns on failure. If the Beads close command fails, dependent tasks are never unblocked, but the supervisor doesn't know.

The correct exit condition should verify that result.completed + result.failed + this.activeAgents.size + this.mergeQueue.length >= result.totalTasks, i.e., all tasks have been accounted for. Only then is it safe to exit.

Bug 2: Rate-Limited PR Search

Entry Point: /src/lib/SwarmSupervisor.ts:775-809 - findPRForBranch()

Mechanism:

  1. Child agent (task 54) runs il spin -p, which internally makes many GitHub API calls during its work session
  2. When agent completes, supervisor calls findPRForBranch() at line 482
  3. findPRForBranch() executes gh pr list --search "is:pr is:open 54 in:title" (line 782-783)
  4. This search uses GitHub's GraphQL API, which shares the rate limit budget with the child agent
  5. GitHub returns "API rate limit already exceeded" with exit code 1
  6. The error is caught at line 795-808 and re-thrown (it doesn't match "no pull requests" patterns)
  7. The re-thrown error is caught at line 483-486 in checkCompletedAgents(), logged, and prNumber stays null
  8. Falls through to line 498-503: task marked "completed without PR", never merged

Sub-issues with the search query:

executeGhCommand behavior (/src/utils/github.ts:13-32): No retry logic. No rate-limit detection. Throws on any non-zero exit code from gh. The 30-second timeout at line 19 is the only protection.

Affected Files

  • /src/lib/SwarmSupervisor.ts:278-281 - Loop exit condition: too aggressive, doesn't check for uncompleted tasks remaining in DAG
  • /src/lib/SwarmSupervisor.ts:760-765 - closeTask(): swallows errors that could prevent dependent task unblocking
  • /src/lib/SwarmSupervisor.ts:775-809 - findPRForBranch(): no rate-limit retry, imprecise title search query
  • /src/lib/SwarmSupervisor.ts:480-504 - checkCompletedAgents(): catches PR search failure and proceeds as if no PR exists
  • /src/lib/SwarmSupervisor.test.ts - Tests need cases for: premature exit with remaining tasks, rate-limited PR search retry

Similar Patterns Found

  • The pendingReleases counter (lines 166, 383-385, 543) was introduced to prevent premature exit after claim releases. This is the same class of problem -- the exit condition needs awareness of "tasks that exist but aren't yet visible in bd ready." The pendingReleases approach is a band-aid; checking total task completion would be more robust.

Edge Cases Identified

  • Beads propagation delay: After bd close, there may be a brief window where bd ready hasn't re-evaluated the DAG. A single poll cycle returning empty would cause premature exit.
  • All tasks blocked by failed task: If task A fails permanently and tasks B-F depend on A, they will never become ready. The current code handles this via permanentlyFailed filter, but does NOT detect that B-F are permanently stuck. The loop would exit correctly (no actionable tasks), but result.failed would only count 1, not 6.
  • Rate limit recovery: GitHub rate limits reset on a per-hour window. A retry with exponential backoff (e.g., wait 60s) could recover without user intervention, but would slow the loop significantly for one task.

Medium Severity Risks

@acreeger
Copy link
Collaborator Author

acreeger commented Feb 9, 2026

Combined Analysis & Plan - Issue #557 (Swarm Bugs)

Executive Summary

Three bugs in SwarmSupervisor: (1) the orchestration loop exits after the first task completes because the exit condition doesn't verify all tasks are done -- blocked tasks won't appear in bd ready but aren't finished; (2) findPRForBranch uses imprecise --search with title matching instead of exact --head <branch> matching; (3) no retry logic for rate-limited GitHub API calls. Additionally, closeTask() silently swallows errors, preventing dependent tasks from being unblocked.

HIGH/CRITICAL Risks

  • Premature exit loses work: With 6 tasks where task 3 depends on task 1, completing task 1 and closing it in Beads can cause bd ready to return empty for one cycle (before task 3 becomes unblocked), triggering the exit condition with 5 tasks remaining.
  • closeTask swallowing errors blocks DAG: If bd close fails in processMergeQueue, dependent tasks are never unblocked and the swarm silently stalls or exits incomplete.

Implementation Overview

High-Level Execution Phases

  1. Fix premature exit: Add result.completed + result.failed >= result.totalTasks check to exit condition
  2. Fix closeTask error handling: Propagate errors from closeTask in critical paths (merge queue), keep non-fatal in failure paths
  3. Store branch names: Add taskBranchNames map and populate from loom.branch in claimAndSpawnAgent
  4. Fix findPRForBranch: Switch from --search to --head <branch-name>
  5. Add rate limit retry: Create executeGhCommandWithRetry wrapper with exponential backoff for rate-limited calls
  6. Update tests: Cover all three fixes

Quick Stats

  • 2 files to modify (SwarmSupervisor.ts, SwarmSupervisor.test.ts)
  • 1 file to modify for retry utility (github.ts)
  • 0 new files
  • Dependencies: None

Complete Analysis & Implementation Details (click to expand)

Research Findings

Problem Space

  • Problem: SwarmSupervisor has three bugs that cause premature exit, imprecise PR matching, and fragility under rate limits.
  • Architectural context: SwarmSupervisor orchestrates headless agents via Beads DAG; all three bugs are in the supervisor loop and its helper methods.
  • Edge cases: (1) All tasks blocked by a single dependency that just closed; (2) PR branch name not matching issue ID format; (3) GitHub returning 403 rate limit during critical merge flow.

Codebase Research

  • Exit condition: SwarmSupervisor.ts:278-281 -- checks actionableReadyTasks.length === 0 && activeAgents.size === 0 && mergeQueue.length === 0 && pendingReleases === 0 but doesn't check if all tasks are actually done.
  • closeTask: SwarmSupervisor.ts:760-765 -- catches all errors from beadsManager.close() and logs warning only. Called from processMergeQueue:580, handleMergeConflict:665, handleAgentFailure:550, and checkCompletedAgents:501.
  • findPRForBranch: SwarmSupervisor.ts:775-809 -- uses --search with rawId in:title. The _epicLoom param is unused (prefixed with underscore).
  • Branch name availability: claimAndSpawnAgent at line 392-406 calls loomManager.createIloom() which returns a Loom object with .branch field, but only .path is stored in taskLoomPaths.
  • executeGhCommand: github.ts:13-32 -- thin wrapper with no retry logic.
  • ActiveAgent interface: SwarmSupervisor.ts:29-40 -- has no branch name field.

Affected Files

  • /src/lib/SwarmSupervisor.ts:141-173 - Private fields (add taskBranchNames map)
  • /src/lib/SwarmSupervisor.ts:278-281 - Exit condition in run() loop
  • /src/lib/SwarmSupervisor.ts:409 - claimAndSpawnAgent where loom.path is stored (also store loom.branch)
  • /src/lib/SwarmSupervisor.ts:566-607 - processMergeQueue where closeTask errors need to propagate
  • /src/lib/SwarmSupervisor.ts:760-765 - closeTask method (split into critical/non-critical paths)
  • /src/lib/SwarmSupervisor.ts:775-809 - findPRForBranch method
  • /src/lib/SwarmSupervisor.ts:732-736 - mergePR (add retry)
  • /src/utils/github.ts:13-32 - executeGhCommand (add retry wrapper)
  • /src/lib/SwarmSupervisor.test.ts - Tests for all three fixes

Medium Severity Risks

  • Retry delay in tests: Rate limit retry with real delays will slow tests; need to ensure sleep mock covers the retry utility.

Implementation Plan

Automated Test Cases to Create

Test File: /src/lib/SwarmSupervisor.test.ts (MODIFY)

Click to expand complete test structure (35 lines)
describe('premature exit fix', () => {
  it('should NOT exit when bd ready returns empty but tasks remain incomplete', async () => {
    // Setup: 3 tasks synced, task 1 ready, tasks 2+3 blocked on task 1
    // Cycle 1: bd ready returns [task1], claim+spawn, agent completes
    // Cycle 2: bd ready returns [] (task1 just closed, task2/3 not yet unblocked)
    // Verify: loop does NOT exit. Cycle 3: bd ready returns [task2, task3]
    // result.completed should be 3, not 1
  })
})

describe('closeTask error propagation', () => {
  it('should propagate closeTask error in processMergeQueue and count as failure', async () => {
    // Setup: agent succeeds, PR found, merge succeeds, bd close throws
    // Verify: error is thrown/caught at processMergeQueue level, task counted as failed
  })

  it('should still log-and-continue for closeTask errors in handleAgentFailure', async () => {
    // Setup: agent fails, exhausts retries, bd close throws
    // Verify: still counted as failed, no crash
  })
})

describe('findPRForBranch uses --head', () => {
  it('should search PR by head branch name instead of title search', async () => {
    // Verify executeGhCommand called with ['pr', 'list', '--state', 'open', '--json', 'number,headRefName', '--head', '<branch-name>']
  })
})

describe('rate limit retry', () => {
  it('should retry on rate limit error with backoff', async () => {
    // Setup: executeGhCommand fails with rate limit error, then succeeds
    // Verify: retried and succeeded
  })

  it('should give up after max retries', async () => {
    // Setup: executeGhCommand keeps failing with rate limit
    // Verify: eventually throws
  })
})

Test File: /src/utils/github.test.ts (CHECK IF EXISTS, otherwise add to SwarmSupervisor.test.ts)

describe('executeGhCommandWithRetry', () => {
  // Test retry on rate limit (403 + "rate limit")
  // Test no retry on non-rate-limit errors
  // Test max retries respected
})

Files to Modify

1. /src/utils/github.ts:13-32

Change: Add executeGhCommandWithRetry function that wraps executeGhCommand with exponential backoff for rate limit errors (HTTP 403 or "rate limit" in error message).

Click to expand implementation guidance (20 lines)
// Add after executeGhCommand (line 32):
export async function executeGhCommandWithRetry<T = unknown>(
  args: string[],
  options?: { cwd?: string; timeout?: number; maxRetries?: number; baseDelayMs?: number }
): Promise<T> {
  const maxRetries = options?.maxRetries ?? 3
  const baseDelayMs = options?.baseDelayMs ?? 5000
  
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await executeGhCommand<T>(args, options)
    } catch (error) {
      const isRateLimit = error instanceof Error && 
        (error.message.includes('rate limit') || error.message.includes('403') || 
         error.message.includes('secondary rate limit') || error.message.includes('API rate limit'))
      
      if (!isRateLimit || attempt >= maxRetries) throw error
      
      const delay = baseDelayMs * Math.pow(2, attempt) // exponential backoff
      logger.warn(`GitHub rate limit hit, retrying in ${delay / 1000}s (attempt ${attempt + 1}/${maxRetries})...`)
      await sleep(delay) // import from timers/promises
    }
  }
  throw new Error('unreachable')
}

2. /src/lib/SwarmSupervisor.ts:11

Change: Import executeGhCommandWithRetry alongside executeGhCommand.

3. /src/lib/SwarmSupervisor.ts:172

Change: Add private taskBranchNames: Map<string, string> = new Map() field after taskLoomPaths.

4. /src/lib/SwarmSupervisor.ts:409

Change: After this.taskLoomPaths.set(task.id, loom.path), add this.taskBranchNames.set(task.id, loom.branch).

5. /src/lib/SwarmSupervisor.ts:278-281 (exit condition)

Change: Add a check that all tasks are accounted for before allowing exit. Replace the current exit condition with:

const allTasksAccountedFor = result.totalTasks === 0 || (result.completed + result.failed >= result.totalTasks)
const actionableReadyTasks = readyTasks.filter(t => !this.permanentlyFailed.has(t.id))
if (allTasksAccountedFor && actionableReadyTasks.length === 0 && this.activeAgents.size === 0 && this.mergeQueue.length === 0 && this.pendingReleases === 0) {
  break
}

6. /src/lib/SwarmSupervisor.ts:760-765 (closeTask)

Change: Make closeTask propagate errors by default. Add a swallow option parameter for non-critical call sites. Critical callers (processMergeQueue line 580, handleMergeConflict line 665) should let the error propagate. Non-critical callers (handleAgentFailure line 550, checkCompletedAgents line 501) should pass { swallow: true }.

Click to expand implementation guidance (15 lines)
// Replace closeTask:
private async closeTask(taskId: string, reason?: string, options?: { swallow?: boolean }): Promise<void> {
  try {
    await this.beadsManager.close(taskId, reason)
  } catch (error) {
    const message = `Failed to close Beads task ${taskId}: ${error instanceof Error ? error.message : 'Unknown error'}`
    if (options?.swallow) {
      logger.warn(message)
      return
    }
    logger.error(message)
    throw error
  }
}

// Update call sites:
// Line 501 (checkCompletedAgents, no-PR path): closeTask(id, reason, { swallow: true })
// Line 550 (handleAgentFailure): closeTask(id, reason, { swallow: true })
// Line 580 (processMergeQueue): closeTask(id, reason) -- let it throw, caught by outer try/catch
// Line 665 (handleMergeConflict): closeTask(id, reason) -- let it throw, caught by outer try/catch

7. /src/lib/SwarmSupervisor.ts:775-809 (findPRForBranch)

Change: Accept branch name, use --head instead of --search. Use executeGhCommandWithRetry for resilience.

Click to expand implementation guidance (20 lines)
// Change signature to accept epicLoom (remove underscore) and use branch name:
private async findPRForBranch(issueId: string, epicLoom: EpicLoomContext): Promise<number | null> {
  // Get the branch name from our stored map
  const branchName = this.taskBranchNames.get(issueId)
  if (!branchName) {
    logger.warn(`No branch name stored for task ${issueId}, falling back to title search`)
    // Fallback: use old approach if branch name unknown
    // (shouldn't happen in normal flow)
    return this.findPRForBranchByTitle(issueId)
  }

  try {
    const prList = await executeGhCommandWithRetry<Array<{ number: number; headRefName: string }>>(
      ['pr', 'list', '--state', 'open', '--json', 'number,headRefName', '--head', branchName],
    )
    if (prList.length > 0 && prList[0]) {
      return prList[0].number
    }
    return null
  } catch (error: unknown) {
    // Same error handling as before for "no PR found" edge cases
    if (error instanceof Error) {
      const msg = error.message.toLowerCase()
      if (msg.includes('no pull requests match') || msg.includes('no open pull requests')) {
        return null
      }
    }
    throw error
  }
}

// Keep old method as fallback (rename from current findPRForBranch):
private async findPRForBranchByTitle(issueId: string): Promise<number | null> {
  // ... existing title-search logic ...
}

8. /src/lib/SwarmSupervisor.ts:732-736 (mergePR)

Change: Use executeGhCommandWithRetry instead of executeGhCommand for merge operations.

private async mergePR(prNumber: number): Promise<void> {
  await executeGhCommandWithRetry(['pr', 'merge', String(prNumber), '--merge', '--delete-branch'])
}

Detailed Execution Order

NOTE: These steps are executed in a SINGLE implementation run.

  1. Add retry utility to github.ts

    • Files: /src/utils/github.ts
    • Add import { setTimeout as sleep } from 'timers/promises' at top
    • Add executeGhCommandWithRetry function after line 32 -> Verify: TypeScript compiles
  2. Add taskBranchNames field and populate it

    • Files: /src/lib/SwarmSupervisor.ts
    • Add field at line ~172, populate in claimAndSpawnAgent at line ~409 -> Verify: loom.branch stored
  3. Fix exit condition

    • Files: /src/lib/SwarmSupervisor.ts
    • Replace lines 276-281 with allTasksAccountedFor check -> Verify: loop continues when tasks remain
  4. Fix closeTask error handling

    • Files: /src/lib/SwarmSupervisor.ts
    • Update closeTask signature and body at line 760-765
    • Update 4 call sites (lines ~501, ~550, ~580, ~665) -> Verify: critical paths throw, non-critical swallow
  5. Fix findPRForBranch to use --head

    • Files: /src/lib/SwarmSupervisor.ts
    • Update import at line 11 to add executeGhCommandWithRetry
    • Rewrite findPRForBranch at line 775 -> Verify: uses --head with branch name
  6. Update mergePR to use retry

    • Files: /src/lib/SwarmSupervisor.ts
    • Replace executeGhCommand with executeGhCommandWithRetry in mergePR at line 733 -> Verify: merges retry on rate limit
  7. Update and add tests

    • Files: /src/lib/SwarmSupervisor.test.ts
    • Add premature exit test, closeTask error test, findPRForBranch --head test, rate limit retry test -> Verify: pnpm test passes
  8. Build verification

    • Run pnpm build to verify TypeScript compiles successfully

Dependencies and Configuration

None

@acreeger
Copy link
Collaborator Author

acreeger commented Feb 9, 2026

Summary: Swarm Bug Fixes & Resilience Improvements

Three critical swarm mode bugs were fixed plus resilience improvements:

Bug 1: Premature Loop Exit (Critical)

Problem: The swarm orchestration loop exited after completing just 1 of 6 tasks. The exit condition only checked if bd ready returned empty + no active agents, but didn't verify all tasks were actually done. Between one task closing and its dependents being unblocked, bd ready could temporarily return empty, triggering premature exit.

Fix: Added allTasksAccountedFor check (result.completed + result.failed >= result.totalTasks) to the exit condition in SwarmSupervisor.ts. The loop now only exits when all tasks are accounted for.

Bug 2: Imprecise PR Search

Problem: findPRForBranch() used gh pr list --search "54 in:title" which was imprecise (could match wrong PRs) and used GraphQL (heavier on rate limits). It also failed due to a GitHub API rate limit during testing.

Fix:

  • Switched to gh pr list --head <branch-name> for exact matching by branch name
  • Added taskBranchNames map to store branch names from loom.branch when agents spawn
  • Falls back to title search if no branch name is stored

Bug 3: No Rate Limit Resilience

Problem: When GitHub API calls hit rate limits, the supervisor just failed and moved on. Critical operations like PR merge and PR search had no retry logic.

Fix: Added executeGhCommandWithRetry() in github.ts with exponential backoff for rate-limited calls (403, "rate limit", "secondary rate limit"). Used for mergePR() and findPRForBranch().

Bug 4: closeTask Error Swallowing

Problem: closeTask() silently swallowed all errors from bd close. If closing a task failed, dependent tasks were never unblocked in the Beads DAG, but the supervisor didn't know and exited incomplete.

Fix: Made closeTask() propagate errors by default in critical paths (merge queue, conflict resolution). Non-critical paths (agent failure handling, no-PR completion) pass { swallow: true } to keep the old behavior.

Additional Fix: Progress Logging

  • Status line now only prints when values change (no more repeated identical lines)
  • Dots (.) printed to stderr between changes to show liveness
  • Added detailed DAG sync logging (task creation, dependency linking, summary)

OOM Investigation Finding

During testing, discovered that SwarmSupervisor.test.ts caused OOM crashes when run alongside other test files. Root cause: the process.stderr.write('.') in the progress logging creates an infinite tight loop when setTimeout is mocked to resolve instantly (as vitest does).

Fix: Added sleepFn injection parameter to executeGhCommandWithRetry() and ensured the SwarmSupervisor test mocks properly bound the orchestration loop iterations.

Lesson learned: Mocking Node.js built-in modules like timers/promises at file level can destabilize vitest worker processes. Prefer dependency injection (sleepFn parameter) over module-level mocks for timer-dependent code.

Files Changed

  • src/lib/SwarmSupervisor.ts - Loop exit condition, closeTask error handling, findPRForBranch rewrite, progress dedup, branch name tracking
  • src/utils/github.ts - Added executeGhCommandWithRetry with exponential backoff
  • src/lib/BeadsSyncService.ts - DAG sync logging, deterministic prefix
  • src/lib/BeadsManager.ts - bd init --prefix, bd config set beads.role maintainer
  • src/lib/LoomManager.ts - Swarm fast path for reused worktrees
  • src/commands/start.ts - Pluralization fix, beadsPrefix from repoInfo.name
  • Tests updated: SwarmSupervisor.test.ts, BeadsSyncService.test.ts, BeadsManager.test.ts, github.test.ts, start-swarm.test.ts

…ntation

Fix all critical, medium, and low priority issues from code review:

- Eliminate error swallowing in BeadsManager, SwarmSupervisor, and start.ts
- Add input validation for --max-agents (NaN check + range 1-20)
- Fix silent auto-install in CI when autoInstallBeads is false
- Update PATH after Beads install for verification check
- Prevent infinite loop on permanently failed tasks
- Fix conflict resolver using wrong working directory
- Close log file streams on agent completion
- Add 30s timeout to graceful shutdown
- Write progress file atomically via temp+rename
- Fix broken prompt interpolation in confirmSwarmMode
- Set GIT_REMOTE and validate EPIC_BRANCH in swarm mode
- Make templates swarm-aware (autonomous mode, SKIP_IMPLEMENTATION)
- Parallel dependency fetching via Promise.allSettled
- Filter subprocess environment to prevent secret leakage
- Fix parseInt truncation for mixed-format task IDs
- Propagate isEpic/swarmStatus metadata in swarm finish path
- Replace dynamic import with static import in tests
@acreeger acreeger force-pushed the feat/issue-557__autonomous-swarm-mode branch from 7991e33 to e4d4a83 Compare February 9, 2026 04:54
Catch "already initialized" error from `bd init` and treat as success,
fixing re-runs on existing epic looms. Also improve execBd() error
handling to distinguish CLI failures from unexpected runtime errors.

Fixes #571
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant