Skip to content

feat: deploy app layer (Coder, Keycloak, GitLab, AI Gateway) on GovCloud EKS#5

Merged
ausbru87 merged 16 commits into
mainfrom
feat/app-platform-deploy
Jun 7, 2026
Merged

feat: deploy app layer (Coder, Keycloak, GitLab, AI Gateway) on GovCloud EKS#5
ausbru87 merged 16 commits into
mainfrom
feat/app-platform-deploy

Conversation

@ausbru87
Copy link
Copy Markdown
Collaborator

@ausbru87 ausbru87 commented Jun 7, 2026

What this does

Deploys and validates the full demo stack on the live GovCloud EKS cluster
(us-gov-west-1, usgov.coderdemo.io): Coder + Keycloak SSO + GitLab +
AI Gateway + a Claude Code workspace template
.

Service URL State
Coder https://dev.usgov.coderdemo.io Running, licensed (AI Governance + premium), OIDC SSO live
Keycloak https://auth.usgov.coderdemo.io Realm coder imported; authorize flow verified
GitLab https://gitlab.usgov.coderdemo.io Single-container (embedded Postgres)

Verified end to end

  • SSO: Coder shows "Sign in with Keycloak"; Keycloak authorize for client
    coder + redirect /api/v2/users/oidc/callback returns the login page (200).
  • AI Gateway: POST /api/v2/aibridge/anthropic/v1/messages routes through to
    api.anthropic.com (currently 502 "keys failed authentication", a placeholder
    key; see below). Providers anthropic + anthropic-bedrock (IRSA) are enabled.
  • Workspace template: claude-code pushed; a test workspace built, the agent
    connected and went healthy, and Claude Code + AgentAPI + code-server installed.
  • Platform: internet-facing NLB + ingress-nginx (ACM TLS), in-cluster hairpin
    verified, EBS CSI via IRSA, RDS roles/dbs created (force_ssl).

Hardening and documentation (latest update)

  • Every workspace template requires in-boundary GitLab login. The
    claude-code template declares data "coder_external_auth" "gitlab"; the
    active template version's /external-auth lists gitlab as required. The
    agent git credential helper then injects a short-lived token, so no PATs or
    SSH keys live in the workspace and no auth path leaves the boundary.
  • Path-based workspace apps disabled (CODER_DISABLE_PATH_APPS=true, Helm
    rev 4). All template apps use subdomain = true, so apps are subdomain-only.
  • Classification banner set to green UNCLASSIFIED - USGOVCLOUD and the
    application name set to USGOV Coder Demo via the idempotent
    scripts/set-appearance.sh (runtime appearance applied over the API, not a
    Helm value; the name is configurable via APP_NAME).
  • As-built documentation added under docs/as-built/: architecture and
    flows, per-component configuration, and a declarative-vs-imperative ledger
    with a Terraform reconciliation backlog. Indexed from docs/00-INDEX.md.

Multi-tenant IdP sync (Keycloak -> Coder)

Models a true multi-tenant hierarchy in Keycloak and syncs it into Coder via
OIDC IdP sync (organization + group + role). No org/group/role is assigned by
hand in Coder.

  • 3 organizations: coder (Platform Engineering), alpha (Mission Partner
    Alpha), bravo (Mission Partner Bravo). assign_default=false, so org
    membership is purely claim-driven.
  • One full-path groups claim (Group Membership mapper on the coder
    client) drives org sync, per-org group sync, and per-org role sync
    (organization-admin / organization-template-admin / organization-auditor).
  • 8 persona users (platform lead, SRE/template-admin, org admins, devs, data
    scientist, cross-tenant ISSO/auditor).
  • Tenant orgs are functional: org-scoped provisioner key + external
    provisioner daemon per tenant (reusing the coder SA) and the claude-code
    template pushed into all three orgs.
  • Verified end to end with scripts/verify-oidc-login.py (a real Keycloak
    login per persona): each lands in the right org(s)/group(s)/role(s); Alpha,
    Bravo, and Platform stay isolated; the auditor spans both tenants read-only.

Idempotent scripts: scripts/setup-keycloak-hierarchy.py,
scripts/setup-coder-idp-sync.py, scripts/verify-oidc-login.py. Details in
docs/as-built/45-idp-sync-personas.md.

Secrets management (External Secrets Operator + AWS Secrets Manager)

Runtime secrets are now sourced from AWS Secrets Manager and synced into
Kubernetes by the External Secrets Operator over IRSA. ASM is the source
of truth; nothing sensitive is in git, and the app manifests are unchanged.

  • ESO (chart 2.6.0, ns external-secrets, ECR-mirrored image) authenticates via
    IRSA role usgov-coderdemo-external-secrets (least-privilege: read-only on
    usgov-coderdemo/*, no static keys).
  • The 9 runtime app secrets were migrated into ASM
    (scripts/migrate-secrets-to-asm.py); a ClusterSecretStore +
    per-secret ExternalSecret materialize them back with the same names/keys.
  • ESO adopted the existing Secrets with byte-identical data (verified via
    sha256 before/after), so running pods were not disrupted; store Valid, all 9
    SecretSynced, and delete/recreate recovery confirmed.
  • EKS Secrets envelope encryption with a customer-managed KMS key is codified
    in terraform/secrets-hardening.tf but not applied (irreversible
    re-encrypt; needs a maintenance decision).

Details in docs/as-built/85-secrets-management.md.

Observability (in-cluster Prometheus + Grafana)

In-boundary, in-cluster metrics and dashboards so the demo shows live
control-plane telemetry without leaving the GovCloud boundary. The AWS-native
managed variant (AMP/AMG, CloudWatch -> Security Lake) is the production target
and is planned separately (see below), not built here.

  • Stack (deploy/observability/, Helm release kps, ns monitoring):
    kube-prometheus-stack 86.2.0 (Prometheus + Grafana + operator), ECR-mirrored
    images. Trimmed for the demo: Alertmanager, node-exporter, kube-state-metrics,
    bundled rules, and the EKS control-plane ServiceMonitors are off; the kubelet
    ServiceMonitor is kept for cAdvisor container CPU/memory.
  • Coder metrics enabled (deploy/coder/values.yaml, ADD-only to respect the
    coderd drift guard): CODER_PROMETHEUS_ENABLE=true,
    CODER_PROMETHEUS_ADDRESS=0.0.0.0:2112, agent stats on. A headless
    coder-metrics Service + ServiceMonitor scrapes the control plane;
    up{job="coder-metrics"} is 1.
  • Coder Grafana dashboards (from github.com/coder/observability) render
    live data at https://grafana.usgov.coderdemo.io (HTTP 200, valid TLS). The
    Grafana admin password is sourced from AWS Secrets Manager
    (usgov-coderdemo/observability/grafana) and synced by ESO; no password in
    git. Log panels are wired to the in-cluster Loki datasource (below), and a
    dedicated AI Governance dashboard (uid: ai-governance,
    deploy/observability/dashboards-ai-governance.yaml) merges AI Gateway
    provider health (coder_aibridged_provider_info) with Agent Firewall
    (Boundary) proxy activity in one view.
  • In-cluster logs (Loki + Promtail): a single-binary Loki 3.5.9 (10Gi gp3
    PVC) plus a Promtail DaemonSet aggregate all pod logs in-boundary
    (deploy/observability/loki.yaml, promtail.yaml); a loki Grafana
    datasource (deploy/observability/loki-datasource.yaml) feeds the log panels.
    Images are ECR-mirrored. Log aggregation stays entirely inside the GovCloud
    boundary; no external log sink is used.
  • Keycloak SSO (one SSO): Grafana logs in through the same realm (coder)
    as Coder via a confidential OIDC client grafana
    (scripts/setup-grafana-oidc.py, authorization-code + PKCE; secret in AWS
    Secrets Manager usgov-coderdemo/observability/grafana-oauth, ESO-synced, no
    secret in git). Keycloak group membership maps to a Grafana org role
    (contains(groups[*], '/platform') && 'Admin' || 'Viewer'); the local admin
    login is kept as break-glass. Verified per persona with a headless login:
    pat.platform (/platform) -> Admin, dana.dev (/alpha) -> Viewer.
  • SIEM-ready logs + audit: structured JSON server logs
    (CODER_LOGGING_JSON=/dev/stderr, CODER_LOGGING_HUMAN=/dev/null); licensed
    audit logging is already entitled and on (/audit). Coder has no single
    CODER_LOG_FORMAT flag, so JSON is selected by pointing CODER_LOGGING_JSON
    at a sink.

Verified live: coder Helm rev 5 healthy; monitoring pods Running (grafana 3/3,
prometheus 2/2, operator 1/1); the grafana-admin ExternalSecret is
SecretSynced; the Coder Control Plane dashboard renders end to end. Details in
docs/as-built/55-observability.md and deploy/observability/README.md.

Planned (design + issues, nothing applied)

Forward-looking designs added under docs/plans/ with companion GitHub issues.
Nothing in these plans changes the live environment; GitOps live migration is
deliberately deferred (adopt the current state in place later).

GitLab SSO + one super admin

GitLab now signs in through the same Keycloak realm (coder) as Coder and
Grafana, so the demo is one SSO. OmniAuth openid_connect is configured in the
GitLab StatefulSet; the realm client gitlab is created by
scripts/setup-gitlab-oidc.py, with its secret in AWS Secrets Manager and
synced by ESO (no secret in git). The local root form stays as break-glass.

GitLab stays Community Edition, which does not implement OIDC group-to-role
mapping (admin_groups is an EE feature). So persona users and the GitLab admin
flag are provisioned explicitly by scripts/setup-gitlab-users.py (idempotent,
gitlab-rails), linking each openid_connect identity and making only the
operator account austen.platform an instance admin (the persona accounts stay
non-admin, preserving tenant isolation).

A dedicated operator account austen.platform (separate from the demo personas)
is the single super-admin SSO identity. scripts/grant-coder-owner.py (default
username austen.platform) grants it the Coder site Owner role, and
scripts/setup-gitlab-users.py makes it GitLab Administrator; it is also Grafana
Admin, so one Keycloak identity is super admin across Coder (Owner), GitLab
(Administrator), and Grafana (Admin)
. It is created in
scripts/setup-keycloak-hierarchy.py with passkey (WebAuthn) + TOTP enrollment
required at first login
(Keycloak required actions webauthn-register +
CONFIGURE_TOTP). pat.platform is reverted to a normal Platform lead
(org-admin of the coder org only). Verified live per persona. Per-app local
break-glass admins remain; credentials live in generated-secrets.env / AWS
Secrets Manager, not git.

One action remains (external dependency)

No real Anthropic API key was available in the environment, so the anthropic
provider is seeded with a placeholder. To make AI respond: sign in as owner →
Admin > AI > Providers (/ai/settings) → set the real sk-ant-... key on the
provider named anthropic (do this in the UI, not the coder-ai secret).
Details in STATUS.md.

Decision log & deviations to reconcile into Terraform

Why standard EKS instead of Auto Mode: EKS Auto Mode node provisioning is
broken in this GovCloud account. The AWS-managed SLR AWSServiceRoleForAmazonEKS
lacks iam:AddRoleToInstanceProfile / iam:TagInstanceProfile, so Auto Mode
NodeClass validation never succeeds. The cluster was converted to standard EKS.

Live-cluster changes not yet in terraform/ (see deploy/platform/README.md):

  1. Auto Mode disabled; managed node group mng (3x m5.xlarge), node role usgov-coderdemo-mngnode.
  2. EBS CSI IRSA role usgov-coderdemo-ebs-csi + addon service-account-role-arn.
  3. Self-managed addons (vpc-cni, kube-proxy, coredns, aws-ebs-csi-driver) + gp3 default StorageClass.
  4. ingress-nginx + aws-load-balancer-controller (Helm) for the NLB.
  5. Workspace RBAC for the Coder SA in coder-workspaces.

Fixes folded into the manifests/template:

  • ingress-nginx: aws-load-balancer-type: external (LB Controller, not Auto Mode NLB).
  • keycloak realm: removed non-standard _comment_* keys that abort realm import.
  • coder values: AI provider name must be anthropic; the gateway routes by
    provider name and the claude-code module hardcodes /api/v2/aibridge/anthropic.
  • claude-code template: allow_privilege_escalation: true so the agentapi module
    can sudo-install to /usr/local/bin.
  • gitlab: gp3 StorageClass; removed mattermost key (removed in GitLab 19.0);
    added the VPC CIDR to gitlab_rails['monitoring_whitelist'] so kubelet health
    probes (from the node IP) pass instead of getting 404.

AI provider note: providers are DB-managed since v2.34 (env only seeds once).
The running instance has anthropic (placeholder key) + anthropic-bedrock; a
coderd restart was verified to skip re-seeding the soft-deleted old provider and
not crash (no drift).


Authored by Coder Agents on behalf of @ausbru87.

ausbru87 added 16 commits June 7, 2026 08:05
…oud EKS

Brings up and validates the full demo stack on the live us-gov-west-1 cluster:

- Coder v2.34.0 (Helm) with Keycloak OIDC SSO, AI Governance license, and
  AI Gateway providers (anthropic + anthropic-bedrock via IRSA).
- Keycloak 26.6.3 with realm `coder` import (client + demo user).
- GitLab CE 19.0.1 single-container (embedded Postgres).
- claude-code workspace template (Coder Agents + Claude Code + AgentAPI).
- Platform layer: ingress-nginx + internet-facing NLB (AWS LB Controller),
  EBS CSI IRSA, gp3 StorageClass, RDS roles/dbs, workspace RBAC.

Fixes applied during bring-up:
- ingress-nginx: aws-load-balancer-type=external (standard EKS, not Auto Mode).
- keycloak realm: drop non-standard _comment_* keys that break realm import.
- coder values: AI provider name must be `anthropic` (AI Gateway routes by
  provider name; the claude-code module hardcodes /api/v2/aibridge/anthropic).
- claude-code template: allow_privilege_escalation=true so the agentapi
  module can sudo-install to /usr/local/bin.
- gitlab: gp3 StorageClass; remove mattermost key (removed in GitLab 19.0);
  add VPC CIDR to monitoring_whitelist so kubelet health probes pass.

NOTE: EKS Auto Mode node provisioning is broken in this GovCloud account, so
the cluster runs as standard EKS. See STATUS.md and deploy/platform/README.md
for the deviations to reconcile into Terraform.

Authored by Coder Agents on behalf of @ausbru87.
…rnal auth

Disable Coder's built-in github.com providers and route git through the
in-cluster GitLab instead, so no auth path leaves the GovCloud boundary.

- CODER_OAUTH2_GITHUB_DEFAULT_PROVIDER_ENABLE=false disables the default
  GitHub login (was enabled out-of-the-box via Coder's hosted GitHub app).
- Configure a GitLab external-auth provider (CODER_EXTERNAL_AUTH_0_*) against
  gitlab.usgov.coderdemo.io using an instance-wide OAuth app; id/secret come
  from Secret coder-external-auth. Declaring an explicit external-auth provider
  also suppresses Coder's default github.com external-auth injection.

Login is now Keycloak SSO + local password owner only.

Authored by Coder Agents on behalf of @ausbru87.
Harden the demo Coder deployment along three axes the user requested:

- Every workspace template now requires in-boundary GitLab login. The
  claude-code template declares `data "coder_external_auth" "gitlab"`, so a
  workspace must complete the GitLab OAuth flow before the agent is ready;
  the agent git credential helper then injects a short-lived token for
  clone/fetch/push. No PATs/SSH keys in the workspace, no out-of-boundary
  auth path.
- Disable path-based workspace apps (CODER_DISABLE_PATH_APPS=true). All
  templates serve apps with subdomain=true, so apps are now subdomain-only
  and the same-origin path-app surface is removed.
- Add scripts/set-appearance.sh to set the green "UNCLASSIFIED - USGOVCLOUD"
  classification banner. Appearance is a runtime DB setting (premium-gated),
  not a Helm value, so the script makes it reproducible and idempotent.

Verified live: template version /external-auth lists gitlab as required,
deployment config disable_path_apps=true, GET /api/v2/appearance shows the
banner.

Generated by Coder Agents.
Add docs/as-built/, the engineering record of what is deployed and how it is
configured, produced by a fan-out of documentation agents and cross-checked
against live read-only state:

- 00-overview: architecture, component map, topology diagram, core flows.
- 10-infrastructure: GovCloud substrate (VPC, EKS standard-not-Auto-Mode and
  why, node group, IRSA, RDS, ECR, Route53, ACM, NLB).
- 20-platform-kubernetes: namespaces, ingress, storage, workspace RBAC, Secrets.
- 30-coder-control-plane: values.yaml walkthrough, OIDC SSO, auth hardening,
  licensing, appearance.
- 40-identity-keycloak: realm coder, OIDC client, the no-group-sync gap.
- 50-gitlab-scm: in-boundary GitLab, the OAuth app, per-workspace git auth.
- 60-ai-gateway: AI Bridge providers, name-based routing, end-to-end flow,
  remaining action.
- 70-workspace-templates: the claude-code template and required GitLab auth.
- 80-iac-vs-imperative: declarative (Terraform) vs imperative ledger plus a
  reconciliation backlog.
- 90-operations-runbook: day-2 ops and known gaps.

Cross-linked from docs/00-INDEX.md and STATUS.md. Verified emdash/endash-free.

Generated by Coder Agents.
Model a true multi-tenant hierarchy in Keycloak and sync it into Coder via OIDC
IdP sync (organization + group + role), with personas for the demo.

Organizations: coder (display "Platform Engineering"), alpha ("Mission Partner
Alpha"), bravo ("Mission Partner Bravo").

Keycloak (realm coder): a hierarchical group tree plus one Group Membership
mapper emitting a full-path `groups` claim (ID + access + userinfo), and 8
persona users. Coder runs runtime per-org IdP sync (not legacy env vars):
- organization sync: field=groups, assign_default=false, /platform|/alpha|/bravo
- group sync (per org): team subgroups -> pre-created Coder groups
- role sync (per org): role subgroups -> organization-admin /
  organization-template-admin / organization-auditor

Tenant orgs are functional: an org-scoped provisioner key + external provisioner
daemon per tenant (deploy/coder/provisioners.yaml, reusing the coder SA), and
the claude-code template pushed into all three orgs.

Verified end to end with scripts/verify-oidc-login.py: a real Keycloak login per
persona lands them in the correct org(s), group(s), and role(s), with tenant
isolation (Alpha vs Bravo vs Platform) and a cross-tenant ISSO/auditor.

New idempotent scripts:
- scripts/setup-keycloak-hierarchy.py (Keycloak Admin REST API)
- scripts/setup-coder-idp-sync.py (Coder API: orgs, groups, sync, no secrets)
- scripts/verify-oidc-login.py (real OIDC login -> org/role/group report)

Docs: docs/as-built/45-idp-sync-personas.md; updated 40-identity-keycloak.md,
as-built README, and STATUS.md.

Generated by Coder Agents.
Move the demo's runtime secrets to AWS Secrets Manager as the source of truth
and sync them into Kubernetes with the External Secrets Operator over IRSA, so
no secret material lives in git or in a local file.

- Mirror the ESO image into ECR (scripts/images.txt) and deploy ESO (chart
  2.6.0, ns external-secrets) with deploy/platform/external-secrets/values.yaml.
- IRSA role usgov-coderdemo-external-secrets: least-privilege
  secretsmanager:GetSecretValue/DescribeSecret on usgov-coderdemo/* only, no
  static keys. Codified in terraform/secrets-hardening.tf.
- Migrate the 9 runtime app secrets (coder/keycloak/gitlab) into ASM with
  scripts/migrate-secrets-to-asm.py (values passed via mode-600 temp files).
- ClusterSecretStore aws-secretsmanager + one ExternalSecret per app secret
  (dataFrom extract, creationPolicy Owner). ESO adopted the existing Secrets
  in place with byte-identical data (no app disruption); store Valid, all 9
  SecretSynced; delete/recreate recovery verified.
- EKS Secrets envelope encryption with a customer-managed KMS key is codified
  in terraform/secrets-hardening.tf but NOT applied (irreversible re-encrypt;
  needs a maintenance decision).

Docs: docs/as-built/85-secrets-management.md; updated 80-iac-vs-imperative.md,
the example secret files, STATUS.md, and the docs index.

Generated by Coder Agents.
Add three design-only plans (nothing applied to the live environment) with
companion GitHub issues, plus an index.

- plans/observability-aws-native.md: the production AWS-native target the
  in-cluster Prometheus/Grafana stack should evolve into (Amazon Managed
  Prometheus + Grafana for metrics; CloudWatch -> Firehose -> S3 -> Athena with
  an optional Amazon Security Lake OCSF path for audit/SIEM). Issues #13-#20.
  Grounded in read-only us-gov-west-1 calls: AMP managed scraper is absent in
  GovCloud (self-managed ADOT + SigV4), AMG auth via SAML to Keycloak (IAM
  Identity Center not enabled), Security Lake optional.
- plans/gitops-control-plane.md: Argo CD control plane sourced from the
  in-cluster GitLab, app-of-apps over the existing deploy/ paths, adopt-in-place
  (manual sync, no prune, no self-heal). Issues #6-#12.
- plans/gitops-adoption.md: per-workload GitOps adoption and the non-Argo state
  (Coder API via Argo PostSync Jobs, Keycloak via keycloak-config-cli, AWS stays
  Terraform). Issues #21-#29.

GitOps live migration is deliberately deferred: leave the current imperative
state in place and adopt it later.

Generated by Coder Agents.
Add an in-boundary, in-cluster observability stack and wire Coder into it, so
the demo shows live control-plane metrics and dashboards without leaving the
GovCloud boundary. The AWS-native managed variant (AMP/AMG) is planned
separately in docs/plans/ and intentionally not built here.

Stack (deploy/observability/, Helm release kps, ns monitoring):
- kube-prometheus-stack 86.2.0 (Prometheus + Grafana + operator), trimmed for
  the demo: Alertmanager, node-exporter, kube-state-metrics, bundled rules, and
  the EKS control-plane ServiceMonitors are off; the kubelet ServiceMonitor is
  kept for cAdvisor container CPU/memory. Images mirrored into ECR
  (scripts/images.txt) and the chart overridden to the mirror.
- coder-metrics.yaml: a headless Service (ns coder, :2112) selecting only the
  control-plane pod, plus ServiceMonitor/coder. Prometheus discovers it
  (serviceMonitorSelectorNilUsesHelmValues=false); up{job="coder-metrics"}=1.
- dashboards-coder.yaml: six Prometheus-backed Coder dashboards from
  github.com/coder/observability as sidecar-imported ConfigMaps, rendering live
  data. Log-only panels and the agent-boundaries dashboard are omitted (no Loki).
- grafana-ingress.yaml: host grafana.usgov.coderdemo.io behind the existing NLB
  (ACM wildcard TLS); HTTP 200 with valid TLS.

Coder server (deploy/coder/values.yaml): ADD only, to respect the coderd
AI-provider drift guard.
- CODER_PROMETHEUS_ENABLE=true, CODER_PROMETHEUS_ADDRESS=0.0.0.0:2112,
  CODER_PROMETHEUS_COLLECT_AGENT_STATS=true.
- Structured JSON logs for SIEM readiness: CODER_LOGGING_JSON=/dev/stderr and
  CODER_LOGGING_HUMAN=/dev/null. Coder has no single CODER_LOG_FORMAT flag;
  JSON is selected by pointing CODER_LOGGING_JSON at a sink.

Secrets: the Grafana admin password lives in AWS Secrets Manager
(usgov-coderdemo/observability/grafana) and is synced into the grafana-admin
Secret by a new ExternalSecret; no password in git.

Audit: licensed audit logging is already entitled and on (/audit); the JSON
server logs make coderd shippable to a downstream SIEM.

Verified live: coder Helm rev 5 healthy (1/1); monitoring pods Running
(grafana 3/3, prometheus 2/2, operator 1/1); grafana + dev hosts return 200;
the grafana-admin ExternalSecret is SecretSynced; the Coder Control Plane
dashboard renders live data end to end.

Docs: docs/as-built/55-observability.md; updated the as-built README, the docs
index, and STATUS.md.

Generated by Coder Agents.
Make the demo one SSO: Grafana now logs in through the same Keycloak realm
(coder) as Coder, instead of local-admin only. The local admin login form is
kept enabled as break-glass.

- scripts/setup-grafana-oidc.py (idempotent): register a confidential OIDC
  client `grafana` in the realm (authorization-code + PKCE S256, redirect
  https://grafana.usgov.coderdemo.io/login/generic_oauth) with the same
  full-path `groups` group-membership mapper the coder client uses, then read
  the client secret and upsert it to AWS Secrets Manager at
  usgov-coderdemo/observability/grafana-oauth.
- ESO ExternalSecret grafana-oauth (ns monitoring) syncs that secret into a
  Kubernetes Secret; Grafana consumes it via the env var
  GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET (grafana.envValueFrom), so no secret is in
  git.
- kube-prometheus-stack-values.yaml: enable [auth.generic_oauth] against the
  realm auth/token/userinfo endpoints (scopes openid email profile) and map
  group membership to a Grafana org role:
  contains(groups[*], '/platform') && 'Admin' || 'Viewer'. allow_sign_up
  auto-provisions users; allow_assign_grafana_admin is off so the server-admin
  flag stays local.

Verified live (helm release kps upgraded, Grafana rolled out): the login page
shows "Sign in with Keycloak"; /login/generic_oauth redirects to the realm with
client_id=grafana and PKCE; a headless authorization-code login per persona
confirms role mapping (pat.platform in /platform -> Admin, /api/org/users 200;
dana.dev in /alpha -> Viewer, /api/org/users 403), both authLabels Generic
OAuth and isExternallySynced.

Docs: docs/as-built/55-observability.md and deploy/observability/README.md gain
an SSO section; STATUS.md notes the one-SSO Grafana login.

Generated by Coder Agents.
Make GitLab sign in through the same Keycloak realm (coder) as Coder and
Grafana, and give the demo a single SSO identity that is super admin across all
three. Stays on GitLab Community Edition (no EE switch).

GitLab SSO (deploy/gitlab/statefulset.yaml):
- OmniAuth openid_connect provider in GITLAB_OMNIBUS_CONFIG (auth-code + PKCE,
  uid_field preferred_username, JIT sign-on). Auto-redirect is intentionally not
  set so the local root form remains as break-glass.
- scripts/setup-gitlab-oidc.py registers the confidential realm client `gitlab`
  and stores its secret in AWS Secrets Manager (usgov-coderdemo/gitlab/oidc);
  ESO syncs it to the gitlab-oidc Secret, injected as GITLAB_OIDC_CLIENT_SECRET.

CE role limitation, handled explicitly:
- GitLab CE does not implement OIDC group-to-role assignment (admin_groups is an
  EE feature; this gitlab-ce image has no openid_connect group code path). The
  admin_groups line is left as a documented no-op (EE-forward-compatible).
- scripts/setup-gitlab-users.py (idempotent, gitlab-rails) populates the eight
  personas, links each openid_connect identity (extern_uid = preferred_username),
  and sets GitLab instance admin only on pat.platform, mirroring the Coder
  org-admin mapping and preserving tenant isolation.

Unified super admin:
- scripts/grant-coder-owner.py grants the Coder site Owner role to pat.platform
  (site roles are not claim-driven and persist across logins). With the GitLab
  admin flag and the existing Grafana /platform -> Admin mapping, the single SSO
  identity pat.platform is super admin in Coder, GitLab, and Grafana.
- Local break-glass admins remain per app; GitLab root was given a known
  password (stored in ASM usgov-coderdemo/gitlab/secrets root_password and the
  local secrets file), since the first-boot random root password was gone.

Verified live: pat.platform SSO -> GitLab is_admin=true (/admin 200), Coder site
roles [owner], Grafana org Admin; dana.dev -> regular/Viewer. Root login works
with the reset password.

Docs: docs/as-built/50-gitlab-scm.md gains a Keycloak SSO section and the CE
limitation; STATUS.md gains a single sign-on + super admin summary.

Generated by Coder Agents.
…orgs

The unified super admin signs in via Keycloak but only saw one Coder org,
because org membership is IdP-synced from the `groups` claim and pat.platform
was only in /platform (-> the coder org). Add pat.platform to the /alpha and
/bravo Keycloak groups (and their org-admin role subgroups) in
scripts/setup-keycloak-hierarchy.py, so org sync makes Pat a member and
organization-admin of all three orgs on login. Combined with the Coder site
Owner role and GitLab/Grafana admin, one Keycloak login is now admin across the
whole stack and the Coder org switcher shows Platform, Alpha, and Bravo.

Verified live with scripts/verify-oidc-login.py (a real OIDC login, which runs
the sync): pat.platform -> coder/alpha/bravo all organization-admin, site
roles [owner]. Tenant isolation is unchanged for the mission-partner personas.

Docs: STATUS.md and docs/as-built/45-idp-sync-personas.md updated to reflect
pat.platform as the all-orgs super admin (deliberate exception to isolation).

Generated by Coder Agents.
Add a dedicated operator account austen.platform (its own SUPERADMIN_PASSWORD)
that is super admin across the stack through a single Keycloak login: Coder site
Owner plus org-admin in all three orgs, GitLab instance admin, and Grafana org
Admin (via the /platform group rule). Revert pat.platform to a normal Platform
lead persona: Platform org-admin only, no Coder site Owner, not a GitLab admin.

- setup-keycloak-hierarchy.py: add austen.platform in the platform/alpha/bravo
  org and org-admins groups with a per-user password_env (SUPERADMIN_PASSWORD);
  trim pat.platform back to the /platform groups.
- setup-gitlab-users.py: provision austen.platform as instance admin, mark every
  demo persona (including pat.platform) a regular user, and support per-persona
  password env over stdin.
- grant-coder-owner.py: default target is now austen.platform.
- Docs (STATUS.md, 45-idp-sync-personas.md, 50-gitlab-scm.md,
  55-observability.md): describe the operator super admin and the pat.platform
  revert.

Verified live via headless SSO: austen is owner and org-admin in all orgs, a
GitLab admin, and a Grafana Admin; pat is org-admin in coder only, no Coder site
role, and a GitLab non-admin.

Generated by Coder Agents.
Give the operator super admin austen.platform the Keycloak webauthn-register and
CONFIGURE_TOTP required actions so its first Keycloak sign in forces WebAuthn
passkey and TOTP enrollment. The actions are applied only while the matching
credential is missing, so reconciles never re-force enrollment.

- setup-keycloak-hierarchy.py: add required_actions to the austen.platform spec
  plus an ensure_required_actions() reconciler keyed on existing credentials.
- Docs (STATUS.md, 45-idp-sync-personas.md): note the enforced enrollment and
  that the headless verify probe no longer applies to austen.platform.

Applied live: austen.platform requiredActions=[CONFIGURE_TOTP, webauthn-register]
with only a password credential; the other personas are unaffected.

Generated by Coder Agents.
…Coder Demo

Make the Coder dashboard application_name configurable via APP_NAME (default
"USGOV Coder Demo") instead of empty, so the demo deployment shows a branded
name in the UI title and login page. Applied live via PUT /api/v2/appearance;
the UNCLASSIFIED announcement banner is preserved.

Generated by Coder Agents.
Deploy a lean single-binary Grafana Loki (filesystem gp3 PVC, tsdb schema v13,
168h retention) and a node-level Promtail DaemonSet into the monitoring
namespace, with both images mirrored to ECR. Add a Grafana datasource ConfigMap
with uid "loki" so the generated Coder dashboards' log panels resolve to the live
log store instead of erroring.

Clean up the coder-status dashboard: replace the upstream LGTM "Observability
Tools" row (distributed Loki, Grafana Agent, config reloaders, storage/CPU/RAM)
with Prometheus, Loki, and Promtail up panels, and repoint the Workspace Builds
and Postgres panels to coderd_* metrics that exist in this stack. Update the
observability README, STATUS.md, and the as-built doc.

Generated by Coder Agents.
Add a single Grafana dashboard (uid ai-governance, title "AI Governance")
covering both the AI Gateway (AI Bridge) and the Agent Firewall (Boundary),
replacing the two missing add-on dashboards.

New ConfigMap deploy/observability/dashboards-ai-governance.yaml (ns
monitoring, label grafana_dashboard: "1") so it never conflicts with
dashboards-coder.yaml. AI Gateway panels read coder_aibridged_* (configured
providers, reload health, provider inventory) and stream AI Bridge logs from
Loki. Agent Firewall panels read agent_boundary_log_proxy_batches_forwarded_total
and stream Boundary logs from Loki. Every panel targets datasource uid
prometheus or uid loki.

All ten query panels verified HTTP 200 through Grafana /api/ds/query; usage
panels read 0 or stay sparse until live AI traffic occurs (placeholder
Anthropic key). Documents the dashboard in docs/as-built/55-observability.md
and STATUS.md.

Generated by Coder Agents.
@ausbru87 ausbru87 merged commit 34d43d8 into main Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant