Skip to content

Improve deploy-time error classification and bounded recovery #142

@ewega

Description

@ewega

Problem

gh devlake deploy local and gh devlake deploy azure already recover a few known failure modes, but deployment resilience is still uneven.

Today the CLI can:

  • auto-run az login when Azure auth is missing
  • start a stopped MySQL server during Azure deploy
  • purge a soft-deleted Key Vault before retrying Azure deployment
  • save partial Azure state so cleanup still works after mid-flight failure
  • print a friendlier message for some local Docker port conflicts

But local deploy recovery is still too narrow for real-world edge cases. In particular, Docker bind failures like:

  • ports are not available
  • address already in use
  • failed programming external connectivity

can bypass the current friendly path, dump raw compose errors, and force the user to manually inspect and retry. The CLI also lacks a shared deploy-time error taxonomy and a consistent bounded retry story across local and Azure flows.

Proposed Solution

Add a deterministic deploy recovery layer for known failure classes. This issue is intentionally non-agentic and does not depend on Copilot SDK.

The CLI should:

  1. classify known deploy-time failures
  2. apply only safe, explicit repairs
  3. retry at most once
  4. print exactly what it detected, what repair it applied, and what happened next

Scope

Local deploy

  • Expand Docker bind/port conflict detection to catch additional variants, including:
    • ports are not available
    • address already in use
    • failed programming external connectivity
  • Keep arbitrary custom ports out of scope for now.
  • Instead, support only the already-known alternate local port bundle:
    • backend: 8085
    • grafana: 3004
    • config UI: 4004
  • For official and fork local deploy flows, allow a bounded fallback from the default bundle (8080/3002/4000) to the alternate bundle (8085/3004/4004), then retry once.
  • Preserve the current custom flow as a manual/user-controlled path.
  • Improve remediation output when the CLI can identify the conflicting container or compose file.

Azure deploy

  • Formalize the existing bounded-recovery behavior under a clearer error-classification model.
  • Keep existing safe repairs first-class and visible in output:
    • missing Azure auth -> az login
    • stopped MySQL -> start and continue
    • soft-deleted Key Vault -> purge and continue
    • migration still warming up -> bounded wait with clear status text
  • Do not add silent infinite retries or broad catch-all retry loops.

Architecture Notes

  • Reuse the repo's existing assumption that local discovery/start/status know two well-known local port bundles (8080/3002/4000 and 8085/3004/4004) instead of inventing arbitrary port support.
  • Keep repair actions as deterministic Go code in the CLI.
  • Prefer a shared helper/classifier over scattering more substring matching across commands.
  • Ensure follow-on commands (status, start, discovery-driven configure flows) continue to work against whichever healthy endpoint actually comes up.

Likely Files

File Change
cmd/deploy_local.go Broaden local error classification, bounded port fallback, retry/output flow
cmd/deploy_azure.go Normalize bounded-recovery reporting and retry behavior
cmd/start.go Keep companion URL handling aligned with the fallback port bundle
cmd/status.go Ensure status output reflects the actual healthy local endpoint after fallback
internal/devlake/discovery.go Keep discovery aligned with supported local port bundles
docs/deploy.md Document deterministic recovery behavior and manual fallback steps
docs/status.md / README.md Update local endpoint and troubleshooting guidance if behavior changes

Acceptance Criteria

  • Local deploy recognizes the common Docker bind-error variants listed above and surfaces a friendly remediation path.
  • When the default local port bundle is unavailable, official/fork flows can recover via the alternate well-known bundle (8085/3004/4004) with a single bounded retry.
  • custom local deploy remains manual; no arbitrary port-management feature is introduced.
  • Recovery output clearly states: detected failure, chosen repair, retry attempt, and final outcome.
  • Azure deploy keeps existing safe repairs but reports them through a clearer bounded-recovery flow.
  • There are no unbounded or silent retry loops.
  • go build ./..., go test ./..., and go vet ./... pass.
  • Docs/help text are updated where behavior or guidance changes.

Dependencies

Blocks: #143 — preview agentic self-healing deployment workflow

Target Version

v0.4.1 — near-term deployment resilience and UX improvement that belongs with the post-v0.4.0 hardening line, even though it also supports the later AI operations roadmap.

References

  • cmd/deploy_local.go — current compose-up path and narrow port-conflict handling
  • cmd/deploy_azure.go — existing bounded Azure repairs already in place
  • cmd/start.go — current well-known local companion URL mapping
  • cmd/status.go — current endpoint discovery/status reporting
  • internal/devlake/discovery.go — current local endpoint discovery logic

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions