diff --git a/docs/INDEX.md b/docs/INDEX.md index cdc4076..d280baa 100644 --- a/docs/INDEX.md +++ b/docs/INDEX.md @@ -70,11 +70,13 @@ ### `docs/operations/guides/` — Runbooks +- [`884-demo-promotion-cheatsheet.md`](operations/guides/884-demo-promotion-cheatsheet.md) — Tactical runbook layered on `demo-environment-update.md` for the #884 self-serve promotion: SSM keys, MDR API + Cognito + SAM Flyway, frontend, user cleanup. - [`add-data-source.md`](operations/guides/add-data-source.md) — Adding a new data source to a LIF deployment: source schema, JSONata mappings, pipeline wiring. - [`adding-a-new-microservice.md`](operations/guides/adding-a-new-microservice.md) — Runbook for standing up a new HTTP microservice: Polylith brick layout, pyproject hygiene, Dockerfile2, AuthMiddleware wiring, docker-compose entry. - [`creating-a-data-source-adapter.md`](operations/guides/creating-a-data-source-adapter.md) — Reference for the data source adapter contract: what adapters are, what they receive, what they return. - [`demo-environment-update.md`](operations/guides/demo-environment-update.md) — End-to-end runbook for promoting dev images to demo. - [`load-testing.md`](operations/guides/load-testing.md) — Load testing notes for LIF services. +- [`self-serve-registration-walkthrough.md`](operations/guides/self-serve-registration-walkthrough.md) — End-to-end walkthrough of the #884 self-serve flow (register → workspace → invite → switch); tester checklist + admin/operator notes for verifying it on dev or demo. ### `docs/operations/proposals/` — Proposed work diff --git a/docs/operations/guides/884-demo-promotion-cheatsheet.md b/docs/operations/guides/884-demo-promotion-cheatsheet.md new file mode 100644 index 0000000..27df129 --- /dev/null +++ b/docs/operations/guides/884-demo-promotion-cheatsheet.md @@ -0,0 +1,150 @@ +# Self-Serve (#884) Demo Promotion Cheatsheet + +Tactical runbook for promoting the #884 self-registration story from dev → demo. Built on findings from the dev debugging session on 2026-05-26 (one day before the 2026-05-27 client demo). + +**Use alongside, not instead of, [`demo-environment-update.md`](demo-environment-update.md).** That guide is the canonical end-to-end. This one focuses only on the #884-specific deltas that we discovered the hard way on dev. + +## TL;DR — execution order + +```bash +# 1. Pre-flight SSM params (creates post-confirm key on both sides) +AWS_PROFILE=lif ./scripts/setup-mdr-api-keys.sh demo +AWS_PROFILE=lif ./scripts/setup-mdr-api-keys.sh demo --apply + +# 2. Promote demo image tags (standard process) +AWS_PROFILE=lif ./scripts/release-demo.sh +AWS_PROFILE=lif ./scripts/release-demo.sh --apply + +# 3. Deploy demo-lif-mdr-api (new task def picks up POST_CONFIRM + Cognito + tenant routing env vars) +./aws-deploy.sh -s demo --only-stack demo-lif-mdr-api + +# 4. Deploy demo-lif-mdr-cognito (adds lif-team group + replaces post-confirmation Lambda code) +./aws-deploy.sh -s demo --only-stack demo-lif-mdr-cognito + +# 4b. Deploy SAM mdr-database (Flyway migrations V1.2/V1.3/V1.4) +# CRITICAL: without this, the new post-confirm Lambda's /tenants/provision +# call 500s because the `clone_lif_schema` PG function (V1.2) doesn't +# exist. Discovered during the 2026-05-27 demo promotion. +# BUILDX_NO_DEFAULT_ATTESTATIONS=1 is required on Apple Silicon — otherwise +# Docker emits an OCI multi-arch image index that Lambda's image pull +# can't follow. +cd sam && BUILDX_NO_DEFAULT_ATTESTATIONS=1 AWS_PROFILE=lif \ + bash deploy-sam.sh -s ../demo -d mdr-database +cd .. + +# 5. Deploy MDR frontend (covers PR #935 redirect fix as well) +AWS_PROFILE=lif ./scripts/release-demo-frontend.sh main --apply + +# 6. Clean out stale test users (their eval-* groups exist but their tenant schemas don't) +# — see "Stale user cleanup" below + +# 7. Add the demo account to lif-team if you want it pre-staged +AWS_PROFILE=lif aws cognito-idp admin-add-user-to-group \ + --user-pool-id $(aws cloudformation describe-stacks --stack-name demo-lif-mdr-cognito \ + --query 'Stacks[0].Outputs[?OutputKey==`UserPoolId`].OutputValue' --output text) \ + --username \ + --group-name lif-team +``` + +## Why this is more than a normal demo update + +The standard `demo-environment-update.md` flow promotes the latest image tags. For #884, **the CloudFormation templates also drifted from prod** over the 6 weeks between when the work landed and when we're shipping it. The templates carry: + +- A new `LifTeamGroup` resource in `cognito-selfserve.yml` +- A new post-confirmation Lambda body that calls MDR's `POST /tenants/provision` +- New env vars + secret refs in `lif-mdr-api-taskdef-includes.yml` and `service.yml` (`POST_CONFIRM` key, `COGNITO_USER_POOL_ID`, `COGNITO_SPA_CLIENT_ID`, `COGNITO_REGION`, `TENANT_ROUTING__ENABLED`, `TENANT_ROUTING__SERVICE_SCHEMA`) + +Image-only promotion (step 2 alone) **will not** ship the schema-per-tenant or invite features. Steps 3-4 are required. + +## Required application-code fixes that must be in the image you promote + +The promotion only ships fixes that have already merged into `main` and made it into the MDR API image. Two of those landed late in dev debug and you'll want to verify they're in the `lif_mdr_api:latest` image you're promoting: + +| PR | What it fixes | How to spot the unfixed-version symptom | +|---|---|---| +| **[#949](https://github.com/LIF-Initiative/lif-core/pull/949)** | Tenant search_path missing `public` → PG enum types (`elementtype`, `datamodelelementtype`) fail to resolve | `GET /datamodels/with_details/` returns 500 for `OrgLIF` models (e.g. StateU LIF #17, Org2 LIF #18); browser console misreports it as a CORS error. Fix: `SET search_path TO "", public` in `components/lif/mdr_utils/database_setup.py`. | +| **[#940](https://github.com/LIF-Initiative/lif-core/pull/940)** + **[#939](https://github.com/LIF-Initiative/lif-core/pull/939)** | V1.3 migration ran the buggy V1.2 `clone_lif_schema`; `flyway repair` now runs before `migrate` | The Flyway Lambda fails on first deploy with `cannot insert a non-DEFAULT value into column "Id"` — see Step 4b below. | + +Verify before deploy: +```bash +# Confirm the image tag in demo's params has both fixes (these PRs merged 2026-05-26+) +grep -E "ImageUrl|ImageTag" cloudformation/demo-lif-mdr-api.params +# Pull the image's commit SHA from its tag and check `git log` to confirm +# the merge commits for #949 + #940 are reachable. +``` + +If demo's `ImageUrl` is `…/lif_mdr_api:latest` (rare for demo — it usually pins a timestamped tag), step 2 below picks up whatever the latest dev build is. If pinned, make sure the pinned tag was built AFTER 2026-05-26 (when #949 + #940 merged). + +## Findings worth keeping in mind + +| # | Finding | Demo-day implication | +|---|---|---| +| 1 | Post-confirmation Lambda is the **only** thing that provisions a new tenant's PG schema. If the old Lambda body is deployed (no MDR call), users get Cognito groups but no schemas. | After step 4, do a fresh self-serve registration end-to-end before the demo. If the new tenant's `/explore` works, you're good. | +| 2 | Cognito group name `lif-team` (Precedence: 10) → routes to `tenant_lif_team` schema (precedence beats auto-created `eval-` groups, so the LIF data wins in the JWT's group ordering). | The demo path "I can switch to the original LIF data" relies on this. Verify the group exists in demo after step 4. | +| 3 | `lif_workspace` cookie is HttpOnly + `SameSite=Lax`; frontend must use `withCredentials: true` (shipped on `main`); backend CORS must `allow_credentials=true` (already wired in `dev-lif-mdr-api.params`, mirror in `demo-lif-mdr-api.params`). | Pre-flight check: `grep CORS demo-lif-mdr-api.params`. Must allow the demo frontend origin. | +| 4 | After joining a new group via invite or admin-add, the user's existing JWT does **not** reflect it. They must log out + back in. | Tell the demo audience this *before* they click. Otherwise the new workspace card just doesn't appear. | +| 5 | The `EnableCognitoAuth=false` param is **legacy** (it controlled the old ALB-Cognito stub). Self-serve Cognito runs independently of this flag. Don't get tempted to flip it. | n/a — just don't touch it. | +| 6 | MDR API logs `DATABASE_URL` with the credential in plaintext to CloudWatch at startup (`lif.mdr_utils.database_setup`). **Pre-existing, not from #884.** | Out of scope for the demo; capture as a post-demo follow-up. Sensitive in shared log groups. | +| 7 | **`clone_lif_schema` copies tables but NOT PG `CREATE TYPE` definitions.** Tenant-scoped queries that cast against an enum (e.g. `'Entity'::elementtype`) fail because the type lookup uses search_path, which previously didn't include `public`. PR [#949](https://github.com/LIF-Initiative/lif-core/pull/949) appends `public` to the search_path as a fallback so types resolve correctly. | Without #949 in the deployed image: opening any `OrgLIF` data model (StateU LIF, Org2 LIF, future per-org models) returns 500 from `/datamodels/with_details/`, surfaced in the UI as "Network Error" + a misleading CORS error in console. Fixed dev on 2026-05-26 evening; must be in demo image too. Permanent fix is to extend `clone_lif_schema` to copy types, then drop the fallback. | + +## Stale user cleanup + +If anyone has tested registration on demo before steps 3-4 deployed (or before today), their Cognito group exists but their PG schema doesn't. Two paths: + +**Path A (clean — drop + re-register):** +```bash +USER_POOL_ID=$(AWS_PROFILE=lif aws cloudformation describe-stacks \ + --stack-name demo-lif-mdr-cognito \ + --query 'Stacks[0].Outputs[?OutputKey==`UserPoolId`].OutputValue' --output text) + +# For each test user: +AWS_PROFILE=lif aws cognito-idp admin-delete-user --user-pool-id "$USER_POOL_ID" --username + +# Find the orphan eval-* group (named after their sub): +AWS_PROFILE=lif aws cognito-idp list-groups --user-pool-id "$USER_POOL_ID" --query 'Groups[?starts_with(GroupName,`eval-`)].GroupName' +AWS_PROFILE=lif aws cognito-idp delete-group --user-pool-id "$USER_POOL_ID" --group-name + +# They re-register fresh; new Lambda fires; PG schema gets created +``` + +**Path B (retroactive — manually provision the existing user's schema):** + +Call MDR's `POST /tenants/provision` with the `mdr-post-confirm` API key: +```bash +API_KEY=$(AWS_PROFILE=lif aws ssm get-parameter \ + --name /demo/mdr-post-confirm/MdrApiKey --with-decryption \ + --query 'Parameter.Value' --output text) + +curl -X POST "https://mdr-api.demo.lif.unicon.net/tenants/provision" \ + -H "X-API-Key: $API_KEY" \ + -H "Content-Type: application/json" \ + -d '{"group": "eval-"}' +``` + +Path A is cleaner for the demo because it exercises the full live flow. + +## Smoke tests before the demo + +After all steps complete, verify in this order: + +1. **MDR API health** — `curl https://mdr-api.demo.lif.unicon.net/health-check` returns 200 +2. **Cognito group exists** — `aws cognito-idp list-groups | grep lif-team` shows it with Precedence 10 +3. **Fresh registration works** — register a test email; confirm; should land on `/workspaces` showing one card; click it; `/explore` shows seed data (the new tenant has just-cloned `public` data — not the LIF team data, but it should *load*) +4. **lif-team join works** — add the test user to lif-team via CLI; sign out + back in; `/workspaces` now shows two cards; clicking lif-team enters real LIF data +5. **Invite flow works** — generate invite link from lif-team workspace; open in another browser profile / incognito; register a second test user; click the invite URL; should succeed, prompt re-login; second user sees lif-team in their list + +## Rollback plan + +If a step fails or the demo blows up on stage: + +- **Step 3 (mdr-api) rolls back** automatically via CloudFormation if the task fails to start. The old task def revision is preserved. +- **Step 4 (cognito) rolls back** the same way. The `lif-team` group would be deleted; existing users' `eval-*` groups stay. UserPool itself is `Replace: False`, so user accounts and existing memberships are safe. +- **Frontend rollback** — re-run `release-demo-frontend.sh --apply`. +- **Database rollback** — n/a. The schema-per-tenant cutover happened in #883 Phase 2 PR 3 (already on demo); we're not changing schemas in this promotion. + +## Related links + +- [Self-serve auth design doc](../../design/cross-cutting/self-serve-tenant-auth.md) +- [Demo environment update guide](demo-environment-update.md) — full update process +- [Self-serve registration walkthrough](self-serve-registration-walkthrough.md) — tester flow +- PRs: #914 (workspace listing), #918 (invites), #931 (reset), #932 (export), #934 (UI), #935 (redirect fix) diff --git a/docs/operations/guides/self-serve-registration-walkthrough.md b/docs/operations/guides/self-serve-registration-walkthrough.md new file mode 100644 index 0000000..475ed8d --- /dev/null +++ b/docs/operations/guides/self-serve-registration-walkthrough.md @@ -0,0 +1,152 @@ +# Self-Serve Registration Walkthrough + +End-to-end walkthrough of the self-registration → workspace selection → invite flow shipped under issue **#884**. Use this to validate the feature on `dev` or `demo`, or to onboard a tester. + +For the architectural backstory (Cognito self-serve stack, post-confirmation Lambda, schema-per-tenant), see [`docs/design/cross-cutting/self-serve-tenant-auth.md`](../../design/cross-cutting/self-serve-tenant-auth.md). + +## Environments + +| Environment | Frontend URL | API URL | +|---|---|---| +| dev | https://mdr.dev.lif.unicon.net | https://mdr-api.dev.lif.unicon.net | +| demo | https://mdr.demo.lif.unicon.net | https://mdr-api.demo.lif.unicon.net | + +Pick a fresh email address per tester — Cognito won't let two users share an email, and registration leaves a persistent user pool record. + +## Tester walkthrough + +### 1. Register + +1. Go to the frontend URL. +2. Click **Sign In / Register** on the landing page. +3. Cognito's hosted UI loads. Click **Sign up**. +4. Fill in: + - **Email** (will receive a 6-digit confirmation code) + - **Password** (Cognito enforces complexity rules) + - **Organization** (free text — your school / employer) + - **Role** (free text — your title or function) + - **Reason** (free text — why you're evaluating LIF) +5. Submit. Cognito sends a confirmation code by email. +6. Enter the code on the next screen. + +Behind the scenes: Cognito fires its post-confirmation Lambda, which calls the MDR's `POST /tenants/provision`. That clones the `public` schema into a new `tenant_` schema in Postgres. **This is idempotent and async** — if you reach step 7 below before it finishes (rare; sub-second in practice), retry the page. + +### 2. First sign-in → workspace landing + +7. After confirmation, you're redirected back through the SPA. Auth callback completes, then you land on **`/workspaces`**. +8. Because you have exactly one workspace (the one just provisioned for you), the page auto-selects it and forwards you to **`/explore`**. You should not need to click anything. +9. If something fails — wrong cookie origin, race with provisioning, etc. — the auto-forward stops and you'll see the picker with an error callout. Refresh the page. + +### 3. Generate an invite link + +10. Click **Workspaces** in the top nav (or **Switch workspace** in the user dropdown). You're back at `/workspaces`. +11. Click **Invite** on your workspace card. A dialog opens: *"Invite someone to '\'"*. +12. Click **Generate invite link**. The dialog shows: + - A URL of the form `https://mdr.dev.lif.unicon.net/invite/accept?token=…` + - An expiry timestamp ("Expires \<7 days from now\>") + - A copy-to-clipboard button +13. Copy the URL. Send it to a second tester (or paste it into a different browser profile for solo testing). + +### 4. Accept an invite (second tester) + +The recipient needs their own Cognito account first. They can: +- Register fresh (step 1 above), then visit the invite URL while signed in, **or** +- Sign in to an existing account, then visit the invite URL. + +14. Visit the invite URL. The page is `/invite/accept` and shows **Accept invite**. +15. If you're not signed in, the AuthGuard bounces you through Cognito login first; you'll land back on the invite page after sign-in (search params preserved). +16. Click **Accept invite**. +17. Success state: **"You're in."** with a button **Sign in again to refresh**. +18. Click that button. You'll be logged out and bounced back through Cognito's login. After sign-in, you'll see *both* workspaces at `/workspaces` — your original (if any) and the invited one. + +**Why the re-sign-in is required:** the Cognito JWT was issued before the new group was added; it doesn't reflect the new membership until refreshed. Forcing a re-login is the cheapest way to get a fresh ID token with the updated `cognito:groups` claim. + +### 5. Switch workspaces + +19. On `/workspaces`, click **Open** on a different workspace card. The page navigates to `/explore`, but you're now operating against the other tenant's data. +20. The selection is stored in an HMAC-signed cookie (`lif_workspace`, `SameSite=Lax`). It survives browser restarts until expiry. + +### Edge cases to spot-check + +| Scenario | Expected behavior | +|---|---| +| Visit `/invite/accept` with no `?token=` | **"Missing invite token"** card | +| Visit an expired invite link (> 7 days) | **"This invite has expired"** card (HTTP 410 under the hood) | +| Visit a tampered invite link (bad signature) | **"Invite link is invalid"** card | +| Click **Open** on workspace A, then quickly **Open** workspace B | Both buttons disabled while either select is in flight; first wins, second is a no-op | +| Network failure during accept | **"Something went wrong"** card with **Try again** button | +| Legacy username/password mode (Cognito disabled) | `/invite/accept` shows **"Invites require Cognito sign-in"** card before any backend call | + +## Admin / operator notes + +### Monitor a registration + +Cognito console → **User pools** → `\-lif-mdr-selfserve` → **Users**. Newly confirmed users show up with status `CONFIRMED` and a `custom:organization` / `custom:role` / `custom:reason` attribute set. + +Group membership for a user: **Users** → click the user → **Group memberships**. The post-confirmation Lambda adds them to a group named after their sub. + +### Export the registration list + +For outreach (who has signed up, what they wrote in `reason`, etc.): + +```bash +# Dry-run / preview (read-only, no --apply needed) +AWS_PROFILE=lif uv run scripts/export_cognito_registrations.py dev + +# Write to a file +AWS_PROFILE=lif uv run scripts/export_cognito_registrations.py dev --output dev-registrations.csv + +# JSON instead of CSV +AWS_PROFILE=lif uv run scripts/export_cognito_registrations.py dev --format json +``` + +IAM permissions required: `cognito-idp:ListUsers`, `cognito-idp:AdminListGroupsForUser`, `cloudformation:DescribeStacks`. The script reads the UserPoolId from the `-lif-mdr-cognito` stack outputs. + +### Verify a tenant schema was provisioned + +The post-confirmation Lambda's job is to make `tenant_` exist in the MDR Postgres database. To check: + +```bash +# Connect to the MDR DB via the bastion / however you normally reach it +psql "$MDR_DB_URL" -c "\dn tenant_*" +``` + +You should see one `tenant_` schema per confirmed user (plus `tenant_lif_team` as the default service schema). + +### When something goes wrong + +| Symptom | Likely cause | Where to look | +|---|---|---| +| User confirms email but lands on **"No workspaces yet"** | Post-confirmation Lambda failed | CloudWatch Logs → `/aws/lambda/-lif-mdr-cognito-PostConfirmationLambda-*` | +| Invite link 400s with "Token signature invalid" (or workspace cookie silently ignored) | The shared HMAC secret rotated between issue and accept | Check `//mdr-api/MdrAuthJwtSecretKey` SSM parameter — this single key signs HS256 JWTs *and* HMACs the workspace cookie *and* HMACs invite tokens, so rotating it invalidates all three simultaneously | +| Invite link 400s with `cognito_sub` error | Frontend has Cognito disabled but backend has it enabled (or vice-versa) | `VITE_COGNITO_DOMAIN` / `VITE_COGNITO_CLIENT_ID` in the frontend build vs. backend `cognito_auth__*` settings | +| `/tenants/select` returns 200 but `/explore` still shows the wrong data | Workspace cookie not being sent | Check axios `withCredentials: true` (PR #934) and CORS `allow_credentials=true` on the backend; both required for cross-origin | +| Two registrations from the same email | They aren't; Cognito rejects duplicate emails. The second registrant is silently re-using the first account. | Cognito console → user pool → look at sign-in counts on the existing user | + +### Reset / clean up test users + +In dev or demo, to clean out test registrants between demo runs: + +1. **Cognito side:** User pool → select user → **Actions** → **Delete user**. This frees the email for re-use. +2. **Database side:** the orphaned `tenant_` schema stays. To drop it: + ```sql + DROP SCHEMA tenant_ CASCADE; + ``` + Use the [#931 workspace-reset endpoint](https://github.com/LIF-Initiative/lif-core/pull/931) if you want a programmatic path; the SQL above is the manual equivalent. + +There is currently **no UI** to delete a workspace — by design for v1. Operators handle it. + +## Out of scope (known v1 limitations) + +These are deliberate cuts for the 2026-05-27 demo. Flagged here so testers don't report them as bugs: + +- No header indicator of *which* workspace is currently selected (cookie is HttpOnly; surfacing the value needs localStorage / a context wired in) +- No reset-workspace button in the UI (backend [#931](https://github.com/LIF-Initiative/lif-core/pull/931) exists; UI wiring deferred) +- Invite tokens are **reusable** until expiry (no single-use enforcement; would need a server-side store) +- Invites only work for users with an existing Cognito account; the invite flow doesn't compose with the sign-up flow yet (sign up first, *then* click the invite) + +## Related docs + +- [`docs/design/cross-cutting/self-serve-tenant-auth.md`](../../design/cross-cutting/self-serve-tenant-auth.md) — architectural narrative (Cognito → Lambda → schema-per-tenant → cookies → invites) +- [`docs/proposals/mdr-self-serve-registration.md`](../../proposals/mdr-self-serve-registration.md) — the broader self-serve roadmap proposal +- [`docs/operations/guides/demo-environment-update.md`](demo-environment-update.md) — how to promote dev → demo (use this *before* a tester runs through the above on `demo`)