Skip to content

Add preview agentic self-healing deployment workflow #143

@ewega

Description

@ewega

Problem

Deterministic deploy recovery will cover the most common, well-understood failure modes, but it will never cover every local Docker and Azure deployment edge case.

After the CLI has:

  • a stronger deploy-time recovery baseline
  • Copilot SDK integration
  • and AI-powered diagnosis with structured deployment context

there should be a preview self-healing workflow that can inspect deployment state, choose from explicit repair tools, apply bounded remediations, and either recover the deployment or stop with precise next steps.

Proposed Solution

Add a preview agentic self-healing deployment workflow built on the GitHub Copilot SDK.

This is not a freeform shell agent. The model should orchestrate a small set of explicit, typed repair tools implemented in Go, with approval boundaries and an audit trail.

Possible command surfaces:

gh devlake repair
gh devlake diagnose --fix
gh devlake deploy local --self-heal=preview

Final command naming can be decided during implementation, but the behavior should be the same: inspect -> choose bounded repair -> apply -> retry -> report.

Architecture

Layering

  1. Deterministic recovery first — rely on the deploy-time classifier and bounded-retry groundwork from Improve deploy-time error classification and bounded recovery #142.
  2. Diagnosis second — reuse the structured health/connection/pipeline context from gh devlake diagnose.
  3. Agentic repair third — let Copilot decide among explicit repair tools and approval-gated actions.

Repair model

Candidate repair tools

inspect_local_port_conflicts
rewrite_local_ports_to_alt_bundle
cleanup_partial_local_artifacts
retry_local_compose_up
check_azure_prereqs
start_mysql_if_stopped
purge_soft_deleted_key_vault
rerun_bicep_deploy
collect_deploy_logs

These tool names are illustrative, but the important constraint is that each one is:

  • explicit
  • typed
  • narrow in side effects
  • implemented in deterministic Go code

Safety boundaries

  • No arbitrary shell planning/execution by the model.
  • No silent destructive cleanup.
  • Mutating steps must either:
    • require confirmation, or
    • be explicitly classified as safe/idempotent in code.
  • Repairs must log what changed and why.
  • Bounded retry only; no open-ended loops.

Likely Files

File Change
cmd/diagnose.go or cmd/repair.go Preview repair command surface
internal/copilot/ Repair-oriented session/tool orchestration
internal/repair/ Deterministic repair helpers and safety boundaries
cmd/deploy_local.go / cmd/deploy_azure.go Shared recovery primitives consumed by repair tools
README.md / docs Preview workflow, safety model, and examples

Acceptance Criteria

  • A preview repair workflow exists (repair or diagnose --fix).
  • The workflow uses explicit repair tools, not arbitrary shell execution.
  • The model can inspect deployment state and select from bounded repair actions.
  • Mutating repairs are approval-gated or explicitly safe/idempotent by implementation.
  • Repair output includes an audit trail of the attempted fixes and retry results.
  • When repair confidence is low or no safe action exists, the workflow stops and prints precise next steps instead of guessing.
  • go build ./..., go test ./..., and go vet ./... pass.
  • README/docs clearly describe the preview status and safety boundaries.

Dependencies

Blocked by:

Target Version

v0.4.4 — preview work within the active v0.4.x line once the deterministic recovery and AI diagnosis foundations are in place.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions