-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Problem
gh devlake deploy local and gh devlake deploy azure already recover a few known failure modes, but deployment resilience is still uneven.
Today the CLI can:
- auto-run
az loginwhen Azure auth is missing - start a stopped MySQL server during Azure deploy
- purge a soft-deleted Key Vault before retrying Azure deployment
- save partial Azure state so cleanup still works after mid-flight failure
- print a friendlier message for some local Docker port conflicts
But local deploy recovery is still too narrow for real-world edge cases. In particular, Docker bind failures like:
ports are not availableaddress already in usefailed programming external connectivity
can bypass the current friendly path, dump raw compose errors, and force the user to manually inspect and retry. The CLI also lacks a shared deploy-time error taxonomy and a consistent bounded retry story across local and Azure flows.
Proposed Solution
Add a deterministic deploy recovery layer for known failure classes. This issue is intentionally non-agentic and does not depend on Copilot SDK.
The CLI should:
- classify known deploy-time failures
- apply only safe, explicit repairs
- retry at most once
- print exactly what it detected, what repair it applied, and what happened next
Scope
Local deploy
- Expand Docker bind/port conflict detection to catch additional variants, including:
ports are not availableaddress already in usefailed programming external connectivity
- Keep arbitrary custom ports out of scope for now.
- Instead, support only the already-known alternate local port bundle:
- backend:
8085 - grafana:
3004 - config UI:
4004
- backend:
- For official and fork local deploy flows, allow a bounded fallback from the default bundle (
8080/3002/4000) to the alternate bundle (8085/3004/4004), then retry once. - Preserve the current
customflow as a manual/user-controlled path. - Improve remediation output when the CLI can identify the conflicting container or compose file.
Azure deploy
- Formalize the existing bounded-recovery behavior under a clearer error-classification model.
- Keep existing safe repairs first-class and visible in output:
- missing Azure auth ->
az login - stopped MySQL -> start and continue
- soft-deleted Key Vault -> purge and continue
- migration still warming up -> bounded wait with clear status text
- missing Azure auth ->
- Do not add silent infinite retries or broad catch-all retry loops.
Architecture Notes
- Reuse the repo's existing assumption that local discovery/start/status know two well-known local port bundles (
8080/3002/4000and8085/3004/4004) instead of inventing arbitrary port support. - Keep repair actions as deterministic Go code in the CLI.
- Prefer a shared helper/classifier over scattering more substring matching across commands.
- Ensure follow-on commands (
status,start, discovery-driven configure flows) continue to work against whichever healthy endpoint actually comes up.
Likely Files
| File | Change |
|---|---|
cmd/deploy_local.go |
Broaden local error classification, bounded port fallback, retry/output flow |
cmd/deploy_azure.go |
Normalize bounded-recovery reporting and retry behavior |
cmd/start.go |
Keep companion URL handling aligned with the fallback port bundle |
cmd/status.go |
Ensure status output reflects the actual healthy local endpoint after fallback |
internal/devlake/discovery.go |
Keep discovery aligned with supported local port bundles |
docs/deploy.md |
Document deterministic recovery behavior and manual fallback steps |
docs/status.md / README.md |
Update local endpoint and troubleshooting guidance if behavior changes |
Acceptance Criteria
- Local deploy recognizes the common Docker bind-error variants listed above and surfaces a friendly remediation path.
- When the default local port bundle is unavailable, official/fork flows can recover via the alternate well-known bundle (
8085/3004/4004) with a single bounded retry. -
customlocal deploy remains manual; no arbitrary port-management feature is introduced. - Recovery output clearly states: detected failure, chosen repair, retry attempt, and final outcome.
- Azure deploy keeps existing safe repairs but reports them through a clearer bounded-recovery flow.
- There are no unbounded or silent retry loops.
-
go build ./...,go test ./..., andgo vet ./...pass. - Docs/help text are updated where behavior or guidance changes.
Dependencies
Blocks: #143 — preview agentic self-healing deployment workflow
Target Version
v0.4.1 — near-term deployment resilience and UX improvement that belongs with the post-v0.4.0 hardening line, even though it also supports the later AI operations roadmap.
References
cmd/deploy_local.go— current compose-up path and narrow port-conflict handlingcmd/deploy_azure.go— existing bounded Azure repairs already in placecmd/start.go— current well-known local companion URL mappingcmd/status.go— current endpoint discovery/status reportinginternal/devlake/discovery.go— current local endpoint discovery logic