Skip to content

feat(runtime): implement migration failure recovery — Task 14#19

Merged
simonovic86 merged 1 commit intomainfrom
claude/loving-dhawan
Mar 5, 2026
Merged

feat(runtime): implement migration failure recovery — Task 14#19
simonovic86 merged 1 commit intomainfrom
claude/loving-dhawan

Conversation

@simonovic86
Copy link
Owner

Summary

Implements robust agent migration with retry, fallback, and lease-aware recovery to handle failures gracefully while preserving the single-instance invariant (EI-1).

Key Components

Migration Retry & Fallback

  • internal/migration/retry.go: Error classification (retriable, fatal, ambiguous) and exponential backoff calculation
  • MigrateAgentWithRetry: Orchestrates retry loop across multiple candidate peers with fallback
  • FS-2 safety: Ambiguous transfers (sent but no confirmation) trigger RECOVERY_REQUIRED rather than retrying to different targets

Peer Registry

  • internal/registry/registry.go: Peer discovery with health tracking, caching prices/capabilities, and candidate filtering by price/capability/health
  • SelectCandidates: Ordered peer selection for migration fallback

Lease State Transitions

  • RevertHandoff(): HANDOFF_INITIATED → ACTIVE_OWNER when migration fails before transfer
  • Recover(): RECOVERY_REQUIRED → ACTIVE_OWNER with fresh epoch (major+1) after detecting unresponsive owner
  • Lease recovery in tick loop: Auto-recovers from RECOVERY_REQUIRED during periodic ticks

Configuration & CLI

  • --migration-retries: Max retry attempts per target (default: 3)
  • --migration-retry-delay: Initial backoff delay (default: 1s)
  • internal/config/config.go: MigrationMaxRetries, MigrationRetryDelay fields with validation

Main Loop Refactoring

  • handleTick(): Encapsulates lease check, agent tick, and divergence verification
  • applyCLIOverrides(): Consolidated CLI flag handling
  • Tick result enum: Distinguishes normal/fast-path/recovery/stopped outcomes

Supporting Changes

  • internal/pricing/service.go: ScanPeerPrices() for bulk peer price queries during migration decisions
  • internal/p2p/node.go: ConnectedPeers() for discovering migration candidates
  • ROADMAP.md: Updated to Phase 5 (Hardening); Tasks 12–14 marked complete

Safety Properties

  • FS-1 (Migration Continuity): Source retains authority until target confirms
  • FS-2 (Migration Continuity): Ambiguous transfers do not retry to different targets
  • EI-1 (Single Active Instance): Enforced via lease state transitions and registry tracking
  • Epoch versioning: Major version increment on recovery prevents stale lease conflicts

Testing

  • Comprehensive unit tests for retry classification, backoff, registry filtering, and lease state transitions
  • Error path coverage: timeouts, connection failures, rejections, ambiguous cases

Add robust migration retry with exponential backoff, fallback to
alternative peers, and lease-aware recovery for the FS-2 ambiguous
transfer case. Preserves single-instance invariant (EI-1) throughout.

Components:
- Peer registry with health tracking and candidate selection
- Retry policy with error classification (retriable/fatal/ambiguous)
- MigrateAgentWithRetry orchestrating retry loop with peer fallback
- Lease transitions: RevertHandoff and Recover state machine methods
- Lease recovery in tick loop for RECOVERY_REQUIRED auto-recovery
- DivergenceMigrate escalation wired to retry-capable migration
- Configuration: --migration-retries, --migration-retry-delay flags
- Roadmap updated: Tasks 11–14 marked complete

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@chatgpt-codex-connector
Copy link

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@simonovic86 simonovic86 merged commit eb9cada into main Mar 5, 2026
1 check passed
@simonovic86 simonovic86 deleted the claude/loving-dhawan branch March 5, 2026 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant