feat(runtime): implement lease-based authority epochs#14
Conversation
Add time-bounded lease authority so nodes must hold a valid, non-expired lease to tick an agent. Leases auto-renew locally; expiry triggers RECOVERY_REQUIRED state (EI-6: safety over liveness). This is the foundation of Phase 5 (Hardening). - New internal/authority/ package: Epoch (MajorVersion, LeaseGeneration), State machine (5 states), Lease lifecycle with injectable clock - Checkpoint format bumped to v0x03 (81-byte header): adds epoch and lease expiry fields, backward-compatible with v0x02 - Pre-tick lease validation in runner with auto-renewal - Migration advances epoch (MajorVersion+1) with handoff/retired state transitions - CLI flags: --lease-duration (60s), --lease-grace (10s) - Inspector displays epoch and lease metadata for v0x03 checkpoints Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5ea8a33ce8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| RenewalWindow: cfg.LeaseRenewalWindow, | ||
| GracePeriod: cfg.LeaseGracePeriod, | ||
| } | ||
| instance.Lease = authority.NewLease(leaseCfg) |
There was a problem hiding this comment.
Preserve checkpoint epoch when initializing local lease
This always creates a fresh lease with NewLease, which resets authority to epoch (0,0) even when LoadCheckpointFromStorage just loaded a v0x03 checkpoint containing a higher epoch from prior renewals or migrations. After a restart, the node can therefore emit regressed epochs on later migrations, breaking monotonic authority ordering and weakening the anti-clone guarantees introduced by this change; initialize from the checkpoint epoch instead of unconditionally bootstrapping a new one.
Useful? React with 👍 / 👎.
| MajorVersion: epoch.MajorVersion, | ||
| LeaseGeneration: epoch.LeaseGeneration, |
There was a problem hiding this comment.
Use active lease epoch when building migration package
MajorVersion/LeaseGeneration are taken from the stored checkpoint, but the target-side migration path advances lease epoch only in memory and does not immediately persist that new epoch. In a rapid A→B→C handoff (before B writes another checkpoint), B will still package A’s old epoch from disk and C will receive a non-incremented major version, causing epoch rollback across hops; package the epoch from the live instance lease or persist the updated checkpoint before migration.
Useful? React with 👍 / 👎.
Summary
Implement lease-based authority system (Phase 5: Hardening) to enforce time-bounded exclusivity of agent execution across cluster nodes.
Changes
Core Authority System
internal/authority/(new package)authority.go: Epoch ordering logic (MajorVersion + LeaseGeneration) and authority states (ACTIVE_OWNER, HANDOFF_INITIATED, HANDOFF_PENDING, RETIRED, RECOVERY_REQUIRED)lease.go: Lease lifecycle management with renewal, expiry validation, grace periods, and state transitionsCheckpoint Format Upgrade
ParseCheckpointHeader()to support both versionscheckpoint_v3.binfor format testingAgent & Runtime Integration
internal/agent/instance.goLease *authority.Leasefield to Instancecmd/igord/main.go--lease-duration,--lease-graceinternal/runner/runner.goCheckAndRenewLease()andHandleLeaseExpiry()for lease lifecycleConfiguration & Services
internal/config/config.go: Lease duration, renewal window, and grace period config with validationinternal/migration/service.go: Lease config passed through migration serviceinternal/inspector/inspector.go: Checkpoint inspection displays epoch and lease expirySafety Properties (EI-6)
Testing