-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Problem
Deterministic deploy recovery will cover the most common, well-understood failure modes, but it will never cover every local Docker and Azure deployment edge case.
After the CLI has:
- a stronger deploy-time recovery baseline
- Copilot SDK integration
- and AI-powered diagnosis with structured deployment context
there should be a preview self-healing workflow that can inspect deployment state, choose from explicit repair tools, apply bounded remediations, and either recover the deployment or stop with precise next steps.
Proposed Solution
Add a preview agentic self-healing deployment workflow built on the GitHub Copilot SDK.
This is not a freeform shell agent. The model should orchestrate a small set of explicit, typed repair tools implemented in Go, with approval boundaries and an audit trail.
Possible command surfaces:
gh devlake repair
gh devlake diagnose --fix
gh devlake deploy local --self-heal=previewFinal command naming can be decided during implementation, but the behavior should be the same: inspect -> choose bounded repair -> apply -> retry -> report.
Architecture
Layering
- Deterministic recovery first — rely on the deploy-time classifier and bounded-retry groundwork from Improve deploy-time error classification and bounded recovery #142.
- Diagnosis second — reuse the structured health/connection/pipeline context from
gh devlake diagnose. - Agentic repair third — let Copilot decide among explicit repair tools and approval-gated actions.
Repair model
- Reuse the
internal/copilot/foundation from Integrate Copilot SDK (Go) —internal/copilotpackage +gh devlake insights#63. - Add a repair-oriented tool surface rather than generic shell execution.
- Require approval for mutating actions.
- Keep every repair step auditable in CLI output.
- Stop after bounded attempts; when confidence is low, fall back to diagnosis and recommended commands.
Candidate repair tools
inspect_local_port_conflicts
rewrite_local_ports_to_alt_bundle
cleanup_partial_local_artifacts
retry_local_compose_up
check_azure_prereqs
start_mysql_if_stopped
purge_soft_deleted_key_vault
rerun_bicep_deploy
collect_deploy_logsThese tool names are illustrative, but the important constraint is that each one is:
- explicit
- typed
- narrow in side effects
- implemented in deterministic Go code
Safety boundaries
- No arbitrary shell planning/execution by the model.
- No silent destructive cleanup.
- Mutating steps must either:
- require confirmation, or
- be explicitly classified as safe/idempotent in code.
- Repairs must log what changed and why.
- Bounded retry only; no open-ended loops.
Likely Files
| File | Change |
|---|---|
cmd/diagnose.go or cmd/repair.go |
Preview repair command surface |
internal/copilot/ |
Repair-oriented session/tool orchestration |
internal/repair/ |
Deterministic repair helpers and safety boundaries |
cmd/deploy_local.go / cmd/deploy_azure.go |
Shared recovery primitives consumed by repair tools |
README.md / docs |
Preview workflow, safety model, and examples |
Acceptance Criteria
- A preview repair workflow exists (
repairordiagnose --fix). - The workflow uses explicit repair tools, not arbitrary shell execution.
- The model can inspect deployment state and select from bounded repair actions.
- Mutating repairs are approval-gated or explicitly safe/idempotent by implementation.
- Repair output includes an audit trail of the attempted fixes and retry results.
- When repair confidence is low or no safe action exists, the workflow stops and prints precise next steps instead of guessing.
-
go build ./...,go test ./..., andgo vet ./...pass. - README/docs clearly describe the preview status and safety boundaries.
Dependencies
Blocked by:
- Improve deploy-time error classification and bounded recovery #142 — deterministic deploy recovery groundwork
- Integrate Copilot SDK (Go) —
internal/copilotpackage +gh devlake insights#63 — Copilot SDK integration +gh devlake insights - Add
gh devlake diagnose— AI-powered troubleshooting #64 —gh devlake diagnose
Target Version
v0.4.4 — preview work within the active v0.4.x line once the deterministic recovery and AI diagnosis foundations are in place.
References
- Improve deploy-time error classification and bounded recovery #142 — deterministic deploy recovery groundwork
- Integrate Copilot SDK (Go) —
internal/copilotpackage +gh devlake insights#63 — Copilot SDK integration - Add
gh devlake diagnose— AI-powered troubleshooting #64 — AI-powered diagnosis internal/copilot/— future shared SDK/session foundation