fix(deploy): three real bugs caught running the full docker stack e2e#12
Merged
Conversation
PR #11 shipped the Dockerfiles + docker-compose + K8s manifests untested end-to-end (just `docker compose build` succeeded). Bringing the stack up with `docker compose up -d` and walking the full Phantom auth -> create vault -> MCP session flow against it surfaced three real bugs. This PR fixes all three; the same flow now succeeds end-to-end. Bug 1: uvicorn binary not on $PATH pip install --target=/install (in api/Dockerfile builder stage) skips bin/ scripts. Runtime container ran: CMD ["uvicorn", "api.app:app", ...] and crashed: exec: "uvicorn": executable file not found in $PATH: unknown Fix: invoke via `python -m uvicorn` so we don't depend on bin shims surviving the --target install. Bug 2: Postgres tables never created on first boot Lifespan hook only ran init_models() when DATABASE_URL started with "sqlite". Production uses Alembic, so that gate was right for prod but wrong for `docker compose up` (which uses Postgres just like prod, but is a developer-facing convenience). First MCP session call blew up: UndefinedTableError: relation "mcp_sessions" does not exist Fix: also run init_models() when APP_ENV != "production". SQLite path stays unchanged. Production still bypasses (Alembic owns schema). Bug 3: CONNECTION_VAULT_KEY YAML-parsed as integer 0 docker-compose.yaml had: CONNECTION_VAULT_KEY: 0000000000000000000000000000000000000000000000000000000000000000 YAML treats long all-zero numerics as int. The container env was literally "CONNECTION_VAULT_KEY=0", and crypto module bailed: RuntimeError: CONNECTION_VAULT_KEY must be hex Fix: quote the value as a string. Same fix the .env.example was already getting right because .env files are pure text. Verification — full stack: $ docker compose up -d --build $ python e2e_smoke.py AUTH_OK challenge -> Ed25519 sign -> session CREATE_OK vault_pda=6kPj8M1d... tx_b64_len=756 LIST_OK count=1 MCP_SESSION_OK /mcp/EOtTcPBh... bound to vault MCP_TOOLS_LIST_OK 4 tools: aceguard_balance, aceguard_history, aceguard_spend, aceguard_pay_for_api $ cd api && PYTHONPATH=.. .venv/bin/python -m pytest tests/ 35 passed in 0.53s The 35-case backend test suite stays green because the lifespan hook change is a pure relaxation (broader cases run init_models, narrower cases unchanged). Tests use SQLite, which already triggered the SQLite branch. Out of scope: - real Alembic migrations for production. Tracked separately; not on the hackathon critical path because production deploy runs the same lifespan hook and APP_ENV=production keeps it off.
acedatacloud-dev
added a commit
that referenced
this pull request
May 4, 2026
…site is live (#14) PR #11/#12/#13 shipped manifests modeled on a generic K8s setup. None of those actually fit the AceDataCloud TKE cluster + nginx-router ingress + wildcard-cert convention, so when the user opened https://x402guard.acedata.cloud/ they got a "Kubernetes Ingress Controller Fake Certificate" + 404 (the LB had no rule for the host). This PR aligns everything with the platform's conventions and the site is now live at https://x402guard.acedata.cloud/ with a real Let's Encrypt cert from the existing tls-wildcard-acedata-cloud secret. Conventions adopted (matching Wisdom + Nexior + MCPs/* in this org): namespace acedatacloud (was: x402guard) ingress class annotation kubernetes.io/ingress.class: nginx-router (was: ingressClassName: nginx) TLS secret tls-wildcard-acedata-cloud, already in the cluster, signed *.acedata.cloud (was: x402guard-tls + cert-manager annotation) image-pull secret docker-registry, already in the namespace (was: missing imagePullSecrets entirely) build tag ${TAG} substituted by sed in deploy/run.sh (was: __BUILD__) service names x402guard-api / x402guard-web — qualified with project prefix to avoid colliding with other tenants in acedatacloud namespace (was: api / web) storage class cbs-ssd (WaitForFirstConsumer, 10Gi minimum) (was: cbs default — fails to bind because cbs is Immediate-binding zone-pinned) What changes: deploy/production/ namespace.yaml DELETED (use existing acedatacloud ns) configmap.yaml DELETED (env values inlined into Deployment) api.yaml namespace + names + imagePullSecrets + annotation; ${TAG} placeholder web.yaml same ingress.yaml nginx-router annotation; tls-wildcard-acedata-cloud; 5 path rules (/api, /mcp, /.well-known, /health, /) all on a single Ingress postgres.yaml NEW — single-replica StatefulSet on cbs-ssd with a 10Gi PVC. POSTGRES_PASSWORD reads from the same x402guard-secrets the api consumes. Cluster has no shared Postgres so x402guard hosts its own. deploy/run.sh Sed ${TAG} -> $BUILD_NUMBER + apply 4 yaml in order; rollout wait + /health probe. Bails clearly if the secret is missing. docker-compose.yaml Service names renamed api -> x402guard-api / web -> x402guard-web so the nginx upstream `x402guard-api` works in both docker-compose and K8s without separate configs. web/deploy/nginx.conf proxy_pass updated to http://x402guard-api:8000 in all 4 locations. Live verification (against https://x402guard.acedata.cloud/): $ curl -sS https://x402guard.acedata.cloud/health {"status":"ok","version":"0.1.0"} $ curl -sS https://x402guard.acedata.cloud/.well-known/x402guard {"service":"x402guard","version":"0.1.0","cluster":"mainnet", "agent_vault_program_id":"5s9rscxc...","usdc_mint":"EPjFWdd5..."} $ curl -sS https://x402guard.acedata.cloud/ | grep '<title>' <title>x402guard - Solana-native AI agent wallets</title> $ openssl s_client ... | openssl x509 -noout -subject -issuer subject=CN=acedata.cloud issuer=Let's Encrypt E8 Pods (kubectl -n acedatacloud get pods -l app=x402guard): x402guard-api-79c7d796b7-cdlpd 1/1 Running x402guard-api-79c7d796b7-f9mpc 1/1 Running x402guard-postgres-0 1/1 Running x402guard-web-5869d7cd49-29772 1/1 Running x402guard-web-5869d7cd49-zvgcb 1/1 Running Bugs caught while bringing the cluster live (not in this PR but worth recording so the next deploy doesn't hit them again): - Initial image push was darwin/arm64 because docker compose build uses host arch on macOS. Cluster is amd64 -> CrashLoopBackOff with "exec format error". Fix: use docker buildx --platform linux/amd64. The CI workflow .github/workflows/deploy.yaml already does this via docker/build-push-action which defaults to linux/amd64, but the local-deploy fallback path needs the explicit platform flag. - cbs storage class is Immediate-binding zone-pinned and our cluster happened to have no spare capacity in the picked zone, so PVCs stayed Pending. cbs-ssd uses WaitForFirstConsumer and binds in the same zone the pod actually scheduled into. - cbs-ssd minimum disk size is 10Gi (Tencent Cloud limit). 5Gi requests fail with "disk size is invalid. Must in [10, 32000]". Out of scope: - The CI workflow .github/workflows/deploy.yaml doesn't run yet (DEPLOY_TO_K8S repo var unset). This first deploy was driven from a workstation using the kubeconfig pulled via .claude/scripts/tke.py. Subsequent deploys will go through CI once the cluster credentials are loaded into the GHCR-secrets vault.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
PR #11 shipped the Docker + K8s deploy scaffold but I only verified
docker compose buildsucceeded. Runningdocker compose up -dand walking the full Phantom-auth → create-vault → MCP-session flow against the live stack surfaced three real bugs. This PR fixes all three; the same flow now succeeds end-to-end.Bugs and fixes
1.
uvicornbinary not on $PATHpip install --target=/installin the API Dockerfile's builder stage skipsbin/scripts. The runtime image then crashed with:Fix: invoke via
python -m uvicornso we don't depend on bin shims surviving the--targetinstall.2. Postgres tables never created on first boot
The lifespan hook only ran
init_models()for SQLite URLs. Production runs Alembic so that gate was right for prod, but wrong fordocker compose upwhich uses the same Postgres engine prod uses.Fix: also run
init_models()whenAPP_ENV != "production". SQLite path unchanged. Production still bypasses (Alembic owns schema there).3.
CONNECTION_VAULT_KEYYAML-parsed as integer0YAML treats long all-zero numerics as int. The container saw
CONNECTION_VAULT_KEY=0and crashed:Fix: quote the value as a string.
.env.examplewas already right because dotenv files are pure text.Verification — full stack walked end-to-end
Backend test suite unchanged because the lifespan-hook change is a pure relaxation (broader cases run
init_models, narrower cases unchanged), and tests use SQLite which already took the SQLite branch.