CredentialEngine · arielr-lt · May 20, 2026 · May 20, 2026
diff --git a/CI-CD.md b/CI-CD.md
@@ -38,6 +38,67 @@ This mirrors the environment model in the xTRA Design Document: `DEVELOPMENT →
 
 ---
 
+## What happens when you merge to `main`
+
+Concrete walkthrough. Say you merge `feat/something` at `10:00:00`. The commit sha is `abc1234…`.
+
+| Time | Event | Why |
+|---|---|---|
+| `10:00:00` | Merge to `main` lands | GitHub fires a `push` event |
+| `10:00:05` | `Release` workflow starts | Its trigger is `push: branches: [main]` |
+| `10:00:05` | `Build Base Image` evaluates trigger | Only fires if `base.Dockerfile` changed in this commit. Usually skipped |
+| `~10:06:00` | `Release` finishes | Three jobs ran: `lint` and `test` non-blocking (`continue-on-error: true`), `publish` independently built+pushed `ctdl-xtra-test/{api,worker}:sha-abc1234` and `:main-latest` |
+| `10:06:05` | `Deploy to TEST` auto-fires | Triggered by `workflow_run` event from Release. Its `if:` checks `conclusion == 'success'` and proceeds |
+| `~10:10:00` | TEST is live with `sha-abc1234` | Pods rolled out, db-migrate Completed, app reachable at `xtra-test.credentialengineregistry.org` |
+
+Then the pipeline **stops automatically**. SANDBOX and PRODUCTION don't move on their own.
+
+When you're ready to promote:
+
+| Action | Effect | Time |
+|---|---|---|
+| **Actions → Promote to SANDBOX → Run workflow**, type `sha-abc1234` | `crane copy` TEST → SANDBOX ECR (no rebuild, same digest), then deploy to `ctdl-xtra-sandbox` | ~4 min |
+| **Actions → Promote to PRODUCTION → Run workflow**, type `sha-abc1234` | `crane copy` SANDBOX → PROD ECR (no rebuild, same digest), then deploy to `ctdl-xtra-prod` | ~4 min |
+
+The image you put in PRODUCTION is byte-for-byte the same image that was tested in TEST and SANDBOX — never rebuilt.
+
+---
+
+## How `workflow_run` chains Release → Deploy to TEST
+
+GitHub Actions has a `workflow_run` trigger that lets one workflow fire automatically when another workflow finishes. It's how we get auto-deploy to TEST without a human in the middle.
+
+In `deploy-test.yml`:
+
+```yaml
+on:
+  workflow_run:
+    workflows: ["Release"]   # upstream workflow's name
+    types: [completed]        # fire when it finishes (success or fail)
+    branches: [main]          # only when Release ran on main
+```
+
+The deploy job guards with `if: github.event.workflow_run.conclusion == 'success'`, so a failed Release still triggers Deploy to TEST but the job exits as "Skipped" instead of deploying a broken build.
+
+Two quirks worth knowing:
+
+1. **The triggered workflow runs against the default branch (`main`), not against the commit that produced the upstream run.** We work around it by passing `ref: ${{ github.event.workflow_run.head_sha }}` to `actions/checkout` so the deploy script and manifests match the commit that was actually built. The image tag is computed from the same sha (`sha-${head_sha:0:7}`).
+2. **You can also run `deploy-test.yml` manually** via `workflow_dispatch` with an explicit `image_tag` input — useful for rolling back to an older sha without going through Release.
+
+---
+
+## Failure modes
+
+| What fails | What you'll see | Effect |
+|---|---|---|
+| Lint or tests in Release | Red ⚠ on the run but Release still succeeds | None — those jobs have `continue-on-error: true`. Will be tightened later. |
+| Docker build in Release | Red ✗ on Release | Deploy to TEST runs but the `if:` guard skips the deploy job (you'll see a grey "Skipped" run) |
+| `ctdl-xtra-db-migrate` Job fails | Deploy workflow fails after waiting up to 5 min | Pods don't get rolled out. `kubectl describe job/ctdl-xtra-db-migrate` shows the error |
+| App pods don't become Ready | `kubectl rollout status` times out at 5 min, deploy fails | Old pods keep serving (current rollout strategy doesn't drop them until new ones are healthy) |
+| `crane copy` source missing | Promote workflow fails immediately | Promotion didn't happen; target ECR unchanged |
+
+---
+
 ## Workflows
 
 ### CI (`ci.yml`) — Pull Request

diff --git a/INFRASTRUCTURE-SUMMARY.md b/INFRASTRUCTURE-SUMMARY.md
@@ -56,6 +56,13 @@ Runs **inside the cluster** as a StatefulSet (not AWS ElastiCache).
 - ReadWriteMany — both API and Worker pods see the same files
 - Persistent across deploys and pod restarts
 
+| | Test | Sandbox | Production |
+|---|---|---|---|
+| File system id | `fs-0b3bfe1beaf5ed573` | `fs-093db3f7f152e19d7` | `fs-0ade50f1051fb7502` |
+| Name tag | `ctdl-xtra-test-files` | `ctdl-xtra-sandbox-files` | `ctdl-xtra-prod-files` |
+
+`deploy-app.sh` looks up the file system id by Name tag at deploy time and substitutes it into the static PV manifest, so you don't reference the id by hand.
+
 ## Secrets
 
 Stored in **AWS Secrets Manager**, synced into the cluster by the External Secrets operator. Pods read them as plain env vars (`envFrom: secretRef`).
@@ -78,6 +85,18 @@ Current keys in the app secret:
 - `NODE_ENV`
 - `PORT`
 
+## Nodes
+
+Two managed node groups per cluster (`system` and `app`), labeled `nodepool=system` and `nodepool=app`. The app group also carries `workload=ctdl-xtra`, which is the selector redis pins itself to.
+
+| | Test | Sandbox | Production |
+|---|---|---|---|
+| System node group | 1× t3.medium (1-2 autoscale) | 1× t3.medium (1-2 autoscale) | 2× t3.medium (2-4 autoscale) |
+| App node group | 1× t3.medium (1-2 autoscale) | 1× t3.medium (1-2 autoscale) | 2× t3.large (2-4 autoscale) |
+| NAT Gateway | 1 (shared across AZs) | 1 (shared across AZs) | 1 (shared across AZs) |
+
+Cluster autoscaler runs in each cluster and grows/shrinks the node groups based on pending pods.
+
 ## Deployments
 
 In each cluster's `ctdl-xtra` namespace:
@@ -89,7 +108,17 @@ In each cluster's `ctdl-xtra` namespace:
 | `redis` | StatefulSet | 1 | 1 | 1 |
 | `ctdl-xtra-db-migrate` | Job (one-shot per deploy) | — | — | — |
 
-The migrate Job runs `drizzle-kit migrate` before each rollout via the deploy script.
+The migrate Job runs `drizzle-kit migrate` before each rollout via the deploy script. Test and Sandbox use lighter resource requests/limits than Production to fit on t3.medium.
+
+## Cluster add-ons
+
+Same five Helm/manifest installs in every cluster, all managed via `k8s-manifests/<env>/addons/install-foundation.sh`:
+
+- **cert-manager** — TLS via Let's Encrypt (HTTP-01 / DNS-01)
+- **external-secrets** — syncs AWS Secrets Manager → K8s Secrets
+- **ingress-nginx** — L7 ingress + ALB
+- **metrics-server** — pod/node metrics for HPA and `kubectl top`
+- **cluster-autoscaler** — scales the EKS node groups based on pending pods
 
 ## Deploying