Skip to content

fix(deploy): three real bugs caught running the full docker stack e2e#12

Merged
acedatacloud-dev merged 1 commit into
mainfrom
fix/docker-stack-end-to-end
May 4, 2026
Merged

fix(deploy): three real bugs caught running the full docker stack e2e#12
acedatacloud-dev merged 1 commit into
mainfrom
fix/docker-stack-end-to-end

Conversation

@acedatacloud-dev
Copy link
Copy Markdown
Member

Why

PR #11 shipped the Docker + K8s deploy scaffold but I only verified docker compose build succeeded. Running docker compose up -d and walking the full Phantom-auth → create-vault → MCP-session flow against the live stack surfaced three real bugs. This PR fixes all three; the same flow now succeeds end-to-end.

Bugs and fixes

1. uvicorn binary not on $PATH

pip install --target=/install in the API Dockerfile's builder stage skips bin/ scripts. The runtime image then crashed with:

exec: "uvicorn": executable file not found in $PATH: unknown

Fix: invoke via python -m uvicorn so we don't depend on bin shims surviving the --target install.

2. Postgres tables never created on first boot

The lifespan hook only ran init_models() for SQLite URLs. Production runs Alembic so that gate was right for prod, but wrong for docker compose up which uses the same Postgres engine prod uses.

UndefinedTableError: relation "mcp_sessions" does not exist

Fix: also run init_models() when APP_ENV != "production". SQLite path unchanged. Production still bypasses (Alembic owns schema there).

3. CONNECTION_VAULT_KEY YAML-parsed as integer 0

CONNECTION_VAULT_KEY: 0000000000000000000000000000000000000000000000000000000000000000

YAML treats long all-zero numerics as int. The container saw CONNECTION_VAULT_KEY=0 and crashed:

RuntimeError: CONNECTION_VAULT_KEY must be hex

Fix: quote the value as a string. .env.example was already right because dotenv files are pure text.

Verification — full stack walked end-to-end

$ docker compose up -d --build
$ python e2e_smoke.py
  AUTH_OK              challenge → Ed25519 sign → session token
  CREATE_OK            vault_pda=6kPj8M1d... tx_b64_len=756
  LIST_OK              count=1
  MCP_SESSION_OK       /mcp/EOtTcPBh... bound to vault
  MCP_TOOLS_LIST_OK    4 tools: aceguard_balance, aceguard_history,
                       aceguard_spend, aceguard_pay_for_api

$ cd api && PYTHONPATH=.. pytest tests/
  35 passed in 0.53s

Backend test suite unchanged because the lifespan-hook change is a pure relaxation (broader cases run init_models, narrower cases unchanged), and tests use SQLite which already took the SQLite branch.

PR #11 shipped the Dockerfiles + docker-compose + K8s manifests untested
end-to-end (just `docker compose build` succeeded). Bringing the stack
up with `docker compose up -d` and walking the full Phantom auth ->
create vault -> MCP session flow against it surfaced three real bugs.
This PR fixes all three; the same flow now succeeds end-to-end.

Bug 1: uvicorn binary not on $PATH

   pip install --target=/install (in api/Dockerfile builder stage)
   skips bin/ scripts. Runtime container ran:
     CMD ["uvicorn", "api.app:app", ...]
   and crashed:
     exec: "uvicorn": executable file not found in $PATH: unknown

   Fix: invoke via `python -m uvicorn` so we don't depend on bin
   shims surviving the --target install.

Bug 2: Postgres tables never created on first boot

   Lifespan hook only ran init_models() when DATABASE_URL started with
   "sqlite". Production uses Alembic, so that gate was right for prod
   but wrong for `docker compose up` (which uses Postgres just like
   prod, but is a developer-facing convenience). First MCP session
   call blew up:
     UndefinedTableError: relation "mcp_sessions" does not exist

   Fix: also run init_models() when APP_ENV != "production". SQLite
   path stays unchanged. Production still bypasses (Alembic owns
   schema).

Bug 3: CONNECTION_VAULT_KEY YAML-parsed as integer 0

   docker-compose.yaml had:
     CONNECTION_VAULT_KEY: 0000000000000000000000000000000000000000000000000000000000000000
   YAML treats long all-zero numerics as int. The container env was
   literally "CONNECTION_VAULT_KEY=0", and crypto module bailed:
     RuntimeError: CONNECTION_VAULT_KEY must be hex

   Fix: quote the value as a string. Same fix the .env.example was
   already getting right because .env files are pure text.

Verification — full stack:

   $ docker compose up -d --build
   $ python  e2e_smoke.py
     AUTH_OK              challenge -> Ed25519 sign -> session
     CREATE_OK            vault_pda=6kPj8M1d... tx_b64_len=756
     LIST_OK              count=1
     MCP_SESSION_OK       /mcp/EOtTcPBh... bound to vault
     MCP_TOOLS_LIST_OK    4 tools: aceguard_balance, aceguard_history,
                          aceguard_spend, aceguard_pay_for_api

   $ cd api && PYTHONPATH=.. .venv/bin/python -m pytest tests/
     35 passed in 0.53s

The 35-case backend test suite stays green because the lifespan hook
change is a pure relaxation (broader cases run init_models, narrower
cases unchanged). Tests use SQLite, which already triggered the SQLite
branch.

Out of scope:
  - real Alembic migrations for production. Tracked separately;
    not on the hackathon critical path because production deploy
    runs the same lifespan hook and APP_ENV=production keeps it off.
@acedatacloud-dev acedatacloud-dev merged commit b988cbf into main May 4, 2026
2 checks passed
@acedatacloud-dev acedatacloud-dev deleted the fix/docker-stack-end-to-end branch May 4, 2026 18:06
acedatacloud-dev added a commit that referenced this pull request May 4, 2026
…site is live (#14)

PR #11/#12/#13 shipped manifests modeled on a generic K8s setup. None of
those actually fit the AceDataCloud TKE cluster + nginx-router ingress
+ wildcard-cert convention, so when the user opened
https://x402guard.acedata.cloud/ they got a "Kubernetes Ingress
Controller Fake Certificate" + 404 (the LB had no rule for the host).

This PR aligns everything with the platform's conventions and the site
is now live at https://x402guard.acedata.cloud/ with a real Let's
Encrypt cert from the existing tls-wildcard-acedata-cloud secret.

Conventions adopted (matching Wisdom + Nexior + MCPs/* in this org):

  namespace                 acedatacloud (was: x402guard)
  ingress class             annotation kubernetes.io/ingress.class:
                            nginx-router (was: ingressClassName: nginx)
  TLS secret                tls-wildcard-acedata-cloud, already in the
                            cluster, signed *.acedata.cloud (was:
                            x402guard-tls + cert-manager annotation)
  image-pull secret         docker-registry, already in the namespace
                            (was: missing imagePullSecrets entirely)
  build tag                 ${TAG} substituted by sed in deploy/run.sh
                            (was: __BUILD__)
  service names             x402guard-api / x402guard-web — qualified
                            with project prefix to avoid colliding with
                            other tenants in acedatacloud namespace
                            (was: api / web)
  storage class             cbs-ssd (WaitForFirstConsumer, 10Gi minimum)
                            (was: cbs default — fails to bind because
                            cbs is Immediate-binding zone-pinned)

What changes:

  deploy/production/
    namespace.yaml             DELETED (use existing acedatacloud ns)
    configmap.yaml             DELETED (env values inlined into Deployment)
    api.yaml                   namespace + names + imagePullSecrets +
                               annotation; ${TAG} placeholder
    web.yaml                   same
    ingress.yaml               nginx-router annotation;
                               tls-wildcard-acedata-cloud;
                               5 path rules (/api, /mcp, /.well-known,
                               /health, /) all on a single Ingress
    postgres.yaml              NEW — single-replica StatefulSet on cbs-ssd
                               with a 10Gi PVC. POSTGRES_PASSWORD reads
                               from the same x402guard-secrets the api
                               consumes. Cluster has no shared Postgres
                               so x402guard hosts its own.

  deploy/run.sh                Sed ${TAG} -> $BUILD_NUMBER + apply 4 yaml
                               in order; rollout wait + /health probe.
                               Bails clearly if the secret is missing.

  docker-compose.yaml          Service names renamed
                               api -> x402guard-api / web -> x402guard-web
                               so the nginx upstream `x402guard-api`
                               works in both docker-compose and K8s
                               without separate configs.

  web/deploy/nginx.conf        proxy_pass updated to http://x402guard-api:8000
                               in all 4 locations.

Live verification (against https://x402guard.acedata.cloud/):

  $ curl -sS https://x402guard.acedata.cloud/health
    {"status":"ok","version":"0.1.0"}
  $ curl -sS https://x402guard.acedata.cloud/.well-known/x402guard
    {"service":"x402guard","version":"0.1.0","cluster":"mainnet",
     "agent_vault_program_id":"5s9rscxc...","usdc_mint":"EPjFWdd5..."}
  $ curl -sS https://x402guard.acedata.cloud/ | grep '<title>'
    <title>x402guard - Solana-native AI agent wallets</title>
  $ openssl s_client ... | openssl x509 -noout -subject -issuer
    subject=CN=acedata.cloud
    issuer=Let's Encrypt E8

  Pods (kubectl -n acedatacloud get pods -l app=x402guard):
    x402guard-api-79c7d796b7-cdlpd   1/1 Running
    x402guard-api-79c7d796b7-f9mpc   1/1 Running
    x402guard-postgres-0             1/1 Running
    x402guard-web-5869d7cd49-29772   1/1 Running
    x402guard-web-5869d7cd49-zvgcb   1/1 Running

Bugs caught while bringing the cluster live (not in this PR but worth
recording so the next deploy doesn't hit them again):

  - Initial image push was darwin/arm64 because docker compose build
    uses host arch on macOS. Cluster is amd64 -> CrashLoopBackOff with
    "exec format error". Fix: use docker buildx --platform linux/amd64.
    The CI workflow .github/workflows/deploy.yaml already does this
    via docker/build-push-action which defaults to linux/amd64, but
    the local-deploy fallback path needs the explicit platform flag.

  - cbs storage class is Immediate-binding zone-pinned and our cluster
    happened to have no spare capacity in the picked zone, so PVCs
    stayed Pending. cbs-ssd uses WaitForFirstConsumer and binds in
    the same zone the pod actually scheduled into.

  - cbs-ssd minimum disk size is 10Gi (Tencent Cloud limit). 5Gi
    requests fail with "disk size is invalid. Must in [10, 32000]".

Out of scope:
  - The CI workflow .github/workflows/deploy.yaml doesn't run yet
    (DEPLOY_TO_K8S repo var unset). This first deploy was driven from
    a workstation using the kubeconfig pulled via .claude/scripts/tke.py.
    Subsequent deploys will go through CI once the cluster credentials
    are loaded into the GHCR-secrets vault.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant