fix(deploy): three real bugs caught running the full docker stack e2e by acedatacloud-dev · Pull Request #12 · AceDataCloud/X402Guard

acedatacloud-dev · 2026-05-04T18:06:40Z

Why

PR #11 shipped the Docker + K8s deploy scaffold but I only verified docker compose build succeeded. Running docker compose up -d and walking the full Phantom-auth → create-vault → MCP-session flow against the live stack surfaced three real bugs. This PR fixes all three; the same flow now succeeds end-to-end.

Bugs and fixes

1. `uvicorn` binary not on $PATH

pip install --target=/install in the API Dockerfile's builder stage skips bin/ scripts. The runtime image then crashed with:

exec: "uvicorn": executable file not found in $PATH: unknown

Fix: invoke via python -m uvicorn so we don't depend on bin shims surviving the --target install.

2. Postgres tables never created on first boot

The lifespan hook only ran init_models() for SQLite URLs. Production runs Alembic so that gate was right for prod, but wrong for docker compose up which uses the same Postgres engine prod uses.

UndefinedTableError: relation "mcp_sessions" does not exist

Fix: also run init_models() when APP_ENV != "production". SQLite path unchanged. Production still bypasses (Alembic owns schema there).

3. `CONNECTION_VAULT_KEY` YAML-parsed as integer `0`

CONNECTION_VAULT_KEY: 0000000000000000000000000000000000000000000000000000000000000000

YAML treats long all-zero numerics as int. The container saw CONNECTION_VAULT_KEY=0 and crashed:

RuntimeError: CONNECTION_VAULT_KEY must be hex

Fix: quote the value as a string. .env.example was already right because dotenv files are pure text.

Verification — full stack walked end-to-end

$ docker compose up -d --build
$ python e2e_smoke.py
  AUTH_OK              challenge → Ed25519 sign → session token
  CREATE_OK            vault_pda=6kPj8M1d... tx_b64_len=756
  LIST_OK              count=1
  MCP_SESSION_OK       /mcp/EOtTcPBh... bound to vault
  MCP_TOOLS_LIST_OK    4 tools: aceguard_balance, aceguard_history,
                       aceguard_spend, aceguard_pay_for_api

$ cd api && PYTHONPATH=.. pytest tests/
  35 passed in 0.53s

Backend test suite unchanged because the lifespan-hook change is a pure relaxation (broader cases run init_models, narrower cases unchanged), and tests use SQLite which already took the SQLite branch.

PR #11 shipped the Dockerfiles + docker-compose + K8s manifests untested end-to-end (just `docker compose build` succeeded). Bringing the stack up with `docker compose up -d` and walking the full Phantom auth -> create vault -> MCP session flow against it surfaced three real bugs. This PR fixes all three; the same flow now succeeds end-to-end. Bug 1: uvicorn binary not on $PATH pip install --target=/install (in api/Dockerfile builder stage) skips bin/ scripts. Runtime container ran: CMD ["uvicorn", "api.app:app", ...] and crashed: exec: "uvicorn": executable file not found in $PATH: unknown Fix: invoke via `python -m uvicorn` so we don't depend on bin shims surviving the --target install. Bug 2: Postgres tables never created on first boot Lifespan hook only ran init_models() when DATABASE_URL started with "sqlite". Production uses Alembic, so that gate was right for prod but wrong for `docker compose up` (which uses Postgres just like prod, but is a developer-facing convenience). First MCP session call blew up: UndefinedTableError: relation "mcp_sessions" does not exist Fix: also run init_models() when APP_ENV != "production". SQLite path stays unchanged. Production still bypasses (Alembic owns schema). Bug 3: CONNECTION_VAULT_KEY YAML-parsed as integer 0 docker-compose.yaml had: CONNECTION_VAULT_KEY: 0000000000000000000000000000000000000000000000000000000000000000 YAML treats long all-zero numerics as int. The container env was literally "CONNECTION_VAULT_KEY=0", and crypto module bailed: RuntimeError: CONNECTION_VAULT_KEY must be hex Fix: quote the value as a string. Same fix the .env.example was already getting right because .env files are pure text. Verification — full stack: $ docker compose up -d --build $ python e2e_smoke.py AUTH_OK challenge -> Ed25519 sign -> session CREATE_OK vault_pda=6kPj8M1d... tx_b64_len=756 LIST_OK count=1 MCP_SESSION_OK /mcp/EOtTcPBh... bound to vault MCP_TOOLS_LIST_OK 4 tools: aceguard_balance, aceguard_history, aceguard_spend, aceguard_pay_for_api $ cd api && PYTHONPATH=.. .venv/bin/python -m pytest tests/ 35 passed in 0.53s The 35-case backend test suite stays green because the lifespan hook change is a pure relaxation (broader cases run init_models, narrower cases unchanged). Tests use SQLite, which already triggered the SQLite branch. Out of scope: - real Alembic migrations for production. Tracked separately; not on the hackathon critical path because production deploy runs the same lifespan hook and APP_ENV=production keeps it off.

…site is live (#14) PR #11/#12/#13 shipped manifests modeled on a generic K8s setup. None of those actually fit the AceDataCloud TKE cluster + nginx-router ingress + wildcard-cert convention, so when the user opened https://x402guard.acedata.cloud/ they got a "Kubernetes Ingress Controller Fake Certificate" + 404 (the LB had no rule for the host). This PR aligns everything with the platform's conventions and the site is now live at https://x402guard.acedata.cloud/ with a real Let's Encrypt cert from the existing tls-wildcard-acedata-cloud secret. Conventions adopted (matching Wisdom + Nexior + MCPs/* in this org): namespace acedatacloud (was: x402guard) ingress class annotation kubernetes.io/ingress.class: nginx-router (was: ingressClassName: nginx) TLS secret tls-wildcard-acedata-cloud, already in the cluster, signed *.acedata.cloud (was: x402guard-tls + cert-manager annotation) image-pull secret docker-registry, already in the namespace (was: missing imagePullSecrets entirely) build tag ${TAG} substituted by sed in deploy/run.sh (was: __BUILD__) service names x402guard-api / x402guard-web — qualified with project prefix to avoid colliding with other tenants in acedatacloud namespace (was: api / web) storage class cbs-ssd (WaitForFirstConsumer, 10Gi minimum) (was: cbs default — fails to bind because cbs is Immediate-binding zone-pinned) What changes: deploy/production/ namespace.yaml DELETED (use existing acedatacloud ns) configmap.yaml DELETED (env values inlined into Deployment) api.yaml namespace + names + imagePullSecrets + annotation; ${TAG} placeholder web.yaml same ingress.yaml nginx-router annotation; tls-wildcard-acedata-cloud; 5 path rules (/api, /mcp, /.well-known, /health, /) all on a single Ingress postgres.yaml NEW — single-replica StatefulSet on cbs-ssd with a 10Gi PVC. POSTGRES_PASSWORD reads from the same x402guard-secrets the api consumes. Cluster has no shared Postgres so x402guard hosts its own. deploy/run.sh Sed ${TAG} -> $BUILD_NUMBER + apply 4 yaml in order; rollout wait + /health probe. Bails clearly if the secret is missing. docker-compose.yaml Service names renamed api -> x402guard-api / web -> x402guard-web so the nginx upstream `x402guard-api` works in both docker-compose and K8s without separate configs. web/deploy/nginx.conf proxy_pass updated to http://x402guard-api:8000 in all 4 locations. Live verification (against https://x402guard.acedata.cloud/): $ curl -sS https://x402guard.acedata.cloud/health {"status":"ok","version":"0.1.0"} $ curl -sS https://x402guard.acedata.cloud/.well-known/x402guard {"service":"x402guard","version":"0.1.0","cluster":"mainnet", "agent_vault_program_id":"5s9rscxc...","usdc_mint":"EPjFWdd5..."} $ curl -sS https://x402guard.acedata.cloud/ | grep '<title>' <title>x402guard - Solana-native AI agent wallets</title> $ openssl s_client ... | openssl x509 -noout -subject -issuer subject=CN=acedata.cloud issuer=Let's Encrypt E8 Pods (kubectl -n acedatacloud get pods -l app=x402guard): x402guard-api-79c7d796b7-cdlpd 1/1 Running x402guard-api-79c7d796b7-f9mpc 1/1 Running x402guard-postgres-0 1/1 Running x402guard-web-5869d7cd49-29772 1/1 Running x402guard-web-5869d7cd49-zvgcb 1/1 Running Bugs caught while bringing the cluster live (not in this PR but worth recording so the next deploy doesn't hit them again): - Initial image push was darwin/arm64 because docker compose build uses host arch on macOS. Cluster is amd64 -> CrashLoopBackOff with "exec format error". Fix: use docker buildx --platform linux/amd64. The CI workflow .github/workflows/deploy.yaml already does this via docker/build-push-action which defaults to linux/amd64, but the local-deploy fallback path needs the explicit platform flag. - cbs storage class is Immediate-binding zone-pinned and our cluster happened to have no spare capacity in the picked zone, so PVCs stayed Pending. cbs-ssd uses WaitForFirstConsumer and binds in the same zone the pod actually scheduled into. - cbs-ssd minimum disk size is 10Gi (Tencent Cloud limit). 5Gi requests fail with "disk size is invalid. Must in [10, 32000]". Out of scope: - The CI workflow .github/workflows/deploy.yaml doesn't run yet (DEPLOY_TO_K8S repo var unset). This first deploy was driven from a workstation using the kubeconfig pulled via .claude/scripts/tke.py. Subsequent deploys will go through CI once the cluster credentials are loaded into the GHCR-secrets vault.

acedatacloud-dev merged commit b988cbf into main May 4, 2026
2 checks passed

acedatacloud-dev deleted the fix/docker-stack-end-to-end branch May 4, 2026 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(deploy): three real bugs caught running the full docker stack e2e#12

fix(deploy): three real bugs caught running the full docker stack e2e#12
acedatacloud-dev merged 1 commit into
mainfrom
fix/docker-stack-end-to-end

acedatacloud-dev commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

acedatacloud-dev commented May 4, 2026

Why

Bugs and fixes

1. uvicorn binary not on $PATH

2. Postgres tables never created on first boot

3. CONNECTION_VAULT_KEY YAML-parsed as integer 0

Verification — full stack walked end-to-end

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `uvicorn` binary not on $PATH

3. `CONNECTION_VAULT_KEY` YAML-parsed as integer `0`