From d4d3a9d54a7b7dfd5a389e79cef7d8488bf1b050 Mon Sep 17 00:00:00 2001 From: Austen Bruhn Date: Sun, 7 Jun 2026 08:05:47 +0000 Subject: [PATCH 01/16] feat: deploy app layer (Coder, Keycloak, GitLab, AI Gateway) on GovCloud EKS Brings up and validates the full demo stack on the live us-gov-west-1 cluster: - Coder v2.34.0 (Helm) with Keycloak OIDC SSO, AI Governance license, and AI Gateway providers (anthropic + anthropic-bedrock via IRSA). - Keycloak 26.6.3 with realm `coder` import (client + demo user). - GitLab CE 19.0.1 single-container (embedded Postgres). - claude-code workspace template (Coder Agents + Claude Code + AgentAPI). - Platform layer: ingress-nginx + internet-facing NLB (AWS LB Controller), EBS CSI IRSA, gp3 StorageClass, RDS roles/dbs, workspace RBAC. Fixes applied during bring-up: - ingress-nginx: aws-load-balancer-type=external (standard EKS, not Auto Mode). - keycloak realm: drop non-standard _comment_* keys that break realm import. - coder values: AI provider name must be `anthropic` (AI Gateway routes by provider name; the claude-code module hardcodes /api/v2/aibridge/anthropic). - claude-code template: allow_privilege_escalation=true so the agentapi module can sudo-install to /usr/local/bin. - gitlab: gp3 StorageClass; remove mattermost key (removed in GitLab 19.0); add VPC CIDR to monitoring_whitelist so kubelet health probes pass. NOTE: EKS Auto Mode node provisioning is broken in this GovCloud account, so the cluster runs as standard EKS. See STATUS.md and deploy/platform/README.md for the deviations to reconcile into Terraform. Authored by Coder Agents on behalf of @ausbru87. --- .gitignore | 1 + STATUS.md | 97 ++++- coder-templates/claude-code/README.md | 191 +++++++++ coder-templates/claude-code/main.tf | 447 ++++++++++++++++++++++ deploy/CONVENTIONS.md | 109 ++++++ deploy/coder/README.md | 185 +++++++++ deploy/coder/secrets.example.yaml | 55 +++ deploy/coder/values.yaml | 169 ++++++++ deploy/gitlab/README.md | 176 +++++++++ deploy/gitlab/ingress.yaml | 36 ++ deploy/gitlab/secrets.example.yaml | 25 ++ deploy/gitlab/service.yaml | 19 + deploy/gitlab/statefulset.yaml | 210 ++++++++++ deploy/keycloak/README.md | 148 +++++++ deploy/keycloak/deployment.yaml | 174 +++++++++ deploy/keycloak/ingress.yaml | 33 ++ deploy/keycloak/kustomization.yaml | 26 ++ deploy/keycloak/realm-coder.json | 75 ++++ deploy/keycloak/secrets.example.yaml | 47 +++ deploy/keycloak/service.yaml | 21 + deploy/platform/README.md | 107 ++++++ deploy/platform/ingress-nginx-values.yaml | 39 ++ deploy/platform/nodepool.yaml | 60 +++ deploy/platform/workspace-rbac.yaml | 35 ++ scripts/images.txt | 17 +- 25 files changed, 2477 insertions(+), 25 deletions(-) create mode 100644 coder-templates/claude-code/README.md create mode 100644 coder-templates/claude-code/main.tf create mode 100644 deploy/CONVENTIONS.md create mode 100644 deploy/coder/README.md create mode 100644 deploy/coder/secrets.example.yaml create mode 100644 deploy/coder/values.yaml create mode 100644 deploy/gitlab/README.md create mode 100644 deploy/gitlab/ingress.yaml create mode 100644 deploy/gitlab/secrets.example.yaml create mode 100644 deploy/gitlab/service.yaml create mode 100644 deploy/gitlab/statefulset.yaml create mode 100644 deploy/keycloak/README.md create mode 100644 deploy/keycloak/deployment.yaml create mode 100644 deploy/keycloak/ingress.yaml create mode 100644 deploy/keycloak/kustomization.yaml create mode 100644 deploy/keycloak/realm-coder.json create mode 100644 deploy/keycloak/secrets.example.yaml create mode 100644 deploy/keycloak/service.yaml create mode 100644 deploy/platform/README.md create mode 100644 deploy/platform/ingress-nginx-values.yaml create mode 100644 deploy/platform/nodepool.yaml create mode 100644 deploy/platform/workspace-rbac.yaml diff --git a/.gitignore b/.gitignore index c99652c..2f9f6ac 100644 --- a/.gitignore +++ b/.gitignore @@ -14,3 +14,4 @@ secrets/ *.pem *.key *.tfvars.local +.substrate-outputs.json diff --git a/STATUS.md b/STATUS.md index 2c88337..7c02ae0 100644 --- a/STATUS.md +++ b/STATUS.md @@ -1,34 +1,91 @@ # Demo build status Single source of progress truth for the lean Coder+AI GovCloud demo. -Plan: see chat plan file. Target: `us-gov-west-1`, `usgov.coderdemo.io`. +Target: `us-gov-west-1`, `usgov.coderdemo.io`. Account `430737322961`. + +> Overnight autonomous build by Coder Agents. **The full stack is deployed and +> running.** One action remains before AI responses work end to end: drop a real +> Anthropic API key into the `anthropic` AI provider (see "Remaining action"). + +## Live environment + +| Service | URL | Auth / notes | +|---|---|---| +| Coder | https://dev.usgov.coderdemo.io | Owner login (password) or "Sign in with Keycloak" (OIDC). | +| Keycloak | https://auth.usgov.coderdemo.io | Realm `coder` imported; admin console at `/admin`. | +| GitLab | https://gitlab.usgov.coderdemo.io | root + `GITLAB_ROOT_PASSWORD` (embedded Postgres). | + +All credentials generated overnight are in **`~/.config/usgov-coderdemo/generated-secrets.env`** +(gitignored, mode 600): Coder owner, Keycloak admin, Keycloak `demo` user, DB +passwords, and the Coder<->Keycloak OIDC client secret. The GitLab root password +is `GITLAB_ROOT_PASSWORD` in `~/.config/usgov-coderdemo/env`. ## Foundations - [x] GovCloud creds (`demoenv-usgov`, acct 430737322961) - [x] Service quotas verified healthy - [x] ACM cert issued + sufficient (`*.usgov.coderdemo.io`) - [x] Route53 zone `Z06701704WFETYIRU5C8` + NS delegation LIVE -- [ ] Bedrock Claude Sonnet 4.5 model access (needs Anthropic agreement via the account PAIRED with GovCloud) — BLOCKED on identifying paired account -- [x] Bedrock path proven: `amazon.nova-pro-v1:0` invokes in GovCloud (fallback model if Claude slips) - -## Build (T0 substrate) -- [x] Terraform backend: S3 (versioned/encrypted) + DynamoDB lock created -- [x] VPC (single, 3 AZ, 1 NAT) — authored + validates -- [x] EKS (Auto Mode, k8s 1.36) + cluster/node IAM + admin access entry — authored + validates -- [x] RDS PostgreSQL 18.4 (Multi-AZ instance) — authored + validates -- [x] IRSA OIDC + Bedrock IAM role (coder/coder SA -> bedrock:InvokeModel allowlist) — authored + validates -- [x] Outputs (cluster, oidc, bedrock role, rds, ecr registry) — authored -- [x] `terraform plan` clean: **39 to add, 0 change, 0 destroy** — awaiting user go-ahead to apply -- [ ] ECR repos + mirrored images (repos auto-created by `scripts/mirror-images.sh`) -- [ ] NLB + ingress controller (cert wired) — post-apply (Helm) +- [x] DNS: `dev` / `auth` / `gitlab` / `*` alias A records -> ingress NLB +- [ ] Bedrock Claude Sonnet 4.5 model access (needs Anthropic agreement via the + account PAIRED with GovCloud) — still gated +- [x] Bedrock fallback proven: `amazon.nova-pro-v1:0` invokes in GovCloud + +## Substrate (Terraform, applied — PR #4 merged) +- [x] VPC (single, 3 AZ, 1 NAT), RDS PostgreSQL 18.4, ECR, IRSA OIDC + Bedrock role +- [x] EKS cluster `usgov-coderdemo` (k8s 1.36) +- [x] ECR repos + 4 mirrored images (+ `docker-hub/library/postgres:18-alpine` for db bootstrap) + +> **Deviation from Terraform (reconcile later):** EKS Auto Mode node provisioning +> is broken in this GovCloud account (the AWS service-linked role +> `AWSServiceRoleForAmazonEKS` lacks `iam:AddRoleToInstanceProfile`/`TagInstanceProfile`, +> so NodeClass validation never succeeds). Auto Mode was disabled and the cluster +> converted to **standard EKS**. See "Deviations to reconcile into Terraform". + +## Platform (live cluster) +- [x] 3x m5.xlarge managed node group `mng` (node role `usgov-coderdemo-mngnode`), k8s 1.36 +- [x] Addons: vpc-cni, kube-proxy, coredns, aws-ebs-csi-driver (IRSA role `usgov-coderdemo-ebs-csi`) +- [x] `gp3` default StorageClass (encrypted, WaitForFirstConsumer) +- [x] aws-load-balancer-controller + ingress-nginx -> internet-facing NLB (ACM TLS termination) +- [x] In-cluster NLB hairpin to the public hostnames verified (valid TLS) — OIDC + agents work server-side +- [x] RDS roles/dbs: `coder` (owns db `coder`), `keycloak` (owns db `keycloak`); `rds.force_ssl=1` ## Apps (T1) -- [ ] Coder (Keycloak OIDC, `dev.`) -- [ ] Keycloak (`auth.`) -- [ ] GitLab single-container (`gitlab.`) -- [ ] AI Gateway -> Bedrock (Claude Sonnet 4.5) -- [ ] Workspace template with Coder Agents + Claude Code -- [ ] Test workspace validated +- [x] Keycloak (`auth.`) — realm `coder` imported; authorize flow for client `coder` returns the login page +- [x] Coder (`dev.`) v2.34.0 — licensed (AI Governance add-on + premium, entitled+enabled); OIDC SSO live +- [x] AI Gateway providers (DB-managed): `anthropic` (direct, enabled) + `anthropic-bedrock` (IRSA, enabled) +- [x] AI Gateway routing verified end to end: `POST /api/v2/aibridge/anthropic/v1/messages` + reaches api.anthropic.com (currently 502 "keys failed authentication" — placeholder key) +- [x] Template `claude-code` pushed; test workspace built, agent connected + healthy, + Claude Code + AgentAPI + code-server installed +- [x] GitLab single-container (`gitlab.`) — embedded Postgres; first boot can take ~15-20 min +- [ ] **Real Anthropic key in the `anthropic` provider** (see below) — only thing gating live AI + +## Remaining action (to make AI respond) + +The AI path is fully wired but seeded with a **placeholder** Anthropic key (no +real key was available in the environment overnight). To finish: + +1. Sign in to https://dev.usgov.coderdemo.io as the owner (creds in + `generated-secrets.env`). +2. Go to **Admin settings > AI > Providers** (`/ai/settings`). +3. Edit the provider named **`anthropic`** and replace its API key with the real + `sk-ant-...` key. (Do this in the UI, **not** by editing the `coder-ai` + k8s secret — the provider config lives in the database now.) +4. Re-run the routing check; it should return 200. + +Alternative (in-boundary): enable **Bedrock** Claude Sonnet 4.5 model access in +the GovCloud console, then point Claude Code at the `anthropic-bedrock` provider +(rename it to `anthropic`, or set the workspace model). Bedrock access is still +gated; Nova Pro is the proven fallback. + +## Deviations to reconcile into Terraform +1. Auto Mode disabled; standard managed node group `mng` (3x m5.xlarge, `AL2023_x86_64_STANDARD`). +2. New node role `usgov-coderdemo-mngnode` (worker/CNI/ECR/SSM/EBS policies). The + original Auto Mode node role `usgov-coderdemo-node` is left untouched/unused. +3. EBS CSI IRSA role `usgov-coderdemo-ebs-csi` + addon `service-account-role-arn`. +4. Self-managed addons (vpc-cni, kube-proxy, coredns, aws-ebs-csi-driver) and `gp3` StorageClass. +5. ingress-nginx + aws-load-balancer-controller (Helm) replacing the Auto Mode NLB path. +6. Workspace RBAC: `deploy/platform/workspace-rbac.yaml` (coder SA -> coder-workspaces ns). ## Out of scope (demo) OpenShift, Istio, observability, full identity sync. diff --git a/coder-templates/claude-code/README.md b/coder-templates/claude-code/README.md new file mode 100644 index 0000000..3efa388 --- /dev/null +++ b/coder-templates/claude-code/README.md @@ -0,0 +1,191 @@ +# Claude Code on Coder Agents (GovCloud demo template) + +Coder workspace template that runs **Claude Code as a Coder Agent** inside a +Kubernetes pod on the EKS cluster, wired through the **Coder AI Gateway (AI +Bridge)**. The workspace never holds a raw Anthropic API key: every request is +proxied through Coder using the workspace owner's session token and routed to +the configured provider (Anthropic-direct primary, Bedrock secondary) +in-boundary. + +Launching the template as a **Coder Task** opens the Claude Code chat UI and +seeds the agent with the task prompt. + +- `main.tf` — the template (providers `coder` + `kubernetes`). +- Workspace image: `codercom/enterprise-base:ubuntu-noble-20260601`, pulled + from the ECR mirror. + +## What's inside + +| Piece | Resource | Notes | +|---|---|---| +| Agent | `coder_agent.main` | startup script, metadata, `display_apps` (VS Code Desktop, web terminal, SSH) | +| Claude Code | `module.claude_code` (`registry.coder.com/coder/claude-code/coder` **4.7.3**) | `enable_aibridge = true`, bundles AgentAPI + Claude Code web app, outputs `task_app_id` | +| Coder Task | `coder_ai_task.claude_code` | binds the Task UI to the Claude Code app; only created in a Task context | +| Browser IDE | `module.code_server` (`code-server` 1.3.1) | extra `coder_app` tile | +| Compute | `kubernetes_pod_v1.workspace` + `kubernetes_persistent_volume_claim_v1.home` | sizing from `cpu` / `memory` / `disk_size` parameters | +| AI auth | `coder_env.anthropic_auth_token` | exports `ANTHROPIC_AUTH_TOKEN` = session token | + +Parameters: `cpu`, `memory`, `disk_size`, and `ai_prompt` (fallback prompt for +non-Task builds). + +## AI Gateway wiring (end to end) + +1. The `claude_code` module is configured with `enable_aibridge = true`. On the + agent it sets: + - `ANTHROPIC_BASE_URL = /api/v2/aibridge/anthropic` + - `CLAUDE_API_KEY = ` + + With `CODER_ACCESS_URL=https://dev.usgov.coderdemo.io` the base URL resolves + to `https://dev.usgov.coderdemo.io/api/v2/aibridge/anthropic`. +2. This template additionally exports `ANTHROPIC_AUTH_TOKEN` (the same session + token) to match the AI Gateway client contract in `deploy/CONVENTIONS.md`. +3. Claude Code calls `ANTHROPIC_BASE_URL`. The Coder AI Gateway authenticates + the session token, applies governance/audit, and forwards the request to the + active provider: + - **Anthropic-direct** (primary) — egress via the NAT gateway. + - **Bedrock** (secondary) — IRSA on the `coder/coder` service account, model + `us-gov.anthropic.claude-sonnet-4-5-20250929-v1:0`, in-region only. + +No Anthropic key is stored in the workspace; the session token is the only +credential and it is scoped to the workspace owner. + +### Model selection + +Model is left at the module default on purpose, because the requested model +name must match whichever provider the Gateway has live: + +- Anthropic-direct: an Anthropic id, e.g. `claude-sonnet-4-5-20250929`. +- Bedrock (GovCloud): the inference profile + `us-gov.anthropic.claude-sonnet-4-5-20250929-v1:0`. + +Pin one by uncommenting `model = "..."` in the module block once the live +provider is confirmed. Bedrock Claude access was still gated at authoring time +(see `STATUS.md`), so the safe default is to let Claude Code/Gateway negotiate. + +### Why module 4.7.3 and `enable_aibridge` (not `enable_ai_gateway`) + +Verified against the Coder registry: + +- `deploy/CONVENTIONS.md` and `versions.lock.yaml` pin the claude-code module + to **4.7.3**. +- In **4.7.x the input is `enable_aibridge`**. The `enable_ai_gateway` rename + (and an `ANTHROPIC_AUTH_TOKEN` the module sets itself) only appear in the + **5.x** line. +- The 5.x refactor **removed** the bundled AgentAPI integration and the + `task_app_id` output, which `coder_ai_task` requires. Staying on 4.7.3 is what + makes the Coder Tasks wiring in this template work. + +If the project later moves to claude-code 5.x, switch `enable_aibridge` → +`enable_ai_gateway`, drop the explicit `coder_env.anthropic_auth_token`, and add +a standalone `agentapi` module to supply `task_app_id` for `coder_ai_task`. + +## Cluster prerequisites + +The platform layer (Coder server + ingress + namespaces) is out of scope for +this directory. Before pushing/using the template, ensure: + +1. **Coder server** 2.34.0 with the AI Governance add-on license and the AI + Gateway providers configured (Anthropic-direct + Bedrock). See + `deploy/coder/`. +2. **Wildcard access URL** set so subdomain apps work + (`CODER_WILDCARD_ACCESS_URL=*.usgov.coderdemo.io`). The Claude Code web app + and code-server use `subdomain = true`. +3. **Workspaces namespace** exists: + + ```bash + kubectl create namespace coder-workspaces + ``` + +4. **Provisioner RBAC** — the Coder provisioner (service account `coder` in the + `coder` namespace) must be able to manage pods/PVCs in `coder-workspaces`. + Example (apply with the platform layer, not from this directory): + + ```yaml + apiVersion: rbac.authorization.k8s.io/v1 + kind: Role + metadata: + name: coder-workspace-provisioner + namespace: coder-workspaces + rules: + - apiGroups: [""] + resources: ["pods", "persistentvolumeclaims"] + verbs: ["create", "get", "list", "watch", "update", "patch", "delete"] + - apiGroups: [""] + resources: ["pods/exec", "pods/log"] + verbs: ["get", "create"] + - apiGroups: [""] + resources: ["events"] + verbs: ["get", "list", "watch"] + --- + apiVersion: rbac.authorization.k8s.io/v1 + kind: RoleBinding + metadata: + name: coder-workspace-provisioner + namespace: coder-workspaces + roleRef: + apiGroup: rbac.authorization.k8s.io + kind: Role + name: coder-workspace-provisioner + subjects: + - kind: ServiceAccount + name: coder + namespace: coder + ``` + +5. **Image pull** — the EKS node IAM role needs ECR read + (`ecr:GetAuthorizationToken`, `ecr:BatchGetImage`, + `ecr:GetDownloadUrlForLayer`) for + `430737322961.dkr.ecr.us-gov-west-1.amazonaws.com`. With that on the node + role, no `imagePullSecret` is required on the pod. The image must already be + mirrored into ECR (`scripts/mirror-images.sh`). + +## Pushing the template + +From the repo root: + +```bash +# First time: create the template. +coder templates push claude-code \ + --directory coder-templates/claude-code \ + --variable namespace=coder-workspaces + +# Subsequent updates push a new version. +coder templates push claude-code \ + --directory coder-templates/claude-code +``` + +Override the image or namespace at push time if needed: + +```bash +coder templates push claude-code \ + --directory coder-templates/claude-code \ + --variable namespace=coder-workspaces \ + --variable workspace_image=430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/docker-hub/codercom/enterprise-base:ubuntu-noble-20260601 +``` + +Template variables: + +| Variable | Default | Purpose | +|---|---|---| +| `namespace` | `coder-workspaces` | namespace for workspace pods | +| `workspace_image` | ECR-mirrored `enterprise-base` | workspace container image | +| `use_kubeconfig` | `false` | use a host kubeconfig instead of in-cluster config | + +## Using it + +- **As a workspace**: create a workspace from the template, open VS Code / + terminal / code-server, and run `claude` in the workspace. +- **As a Task**: create a Coder Task from this template and enter a prompt. + Coder injects the prompt via `data.coder_task.me.prompt`, the + `coder_ai_task` resource binds the Task UI to the Claude Code app, and the + agent reports status back to the Coder UI through AgentAPI. + +## Verification status + +| Item | Source | Status | +|---|---|---| +| claude-code 4.7.3 inputs (`enable_aibridge`, `workdir`, `ai_prompt`, `report_tasks`, `subdomain`) and `task_app_id` output | module `main.tf` / `README.md` at tag `release/coder/claude-code/v4.7.3` | verified | +| `coder_ai_task.app_id` + `data.coder_task` (`enabled`, `prompt`) | `coder/terraform-provider-coder` docs; first shipped in provider **v2.13.0** | verified | +| Workspace image tag | Docker Hub `codercom/enterprise-base` | verified (`ubuntu-noble-20260601`) | +| `code-server` 1.3.1 | registry tag `release/coder/code-server/v1.3.1` | verified (latest is 1.5.0) | +| Live AI Gateway routing / Bedrock model access | runtime cluster | NOT verified here (no live infra access; Bedrock Claude access gated per `STATUS.md`) | diff --git a/coder-templates/claude-code/main.tf b/coder-templates/claude-code/main.tf new file mode 100644 index 0000000..a4d49e6 --- /dev/null +++ b/coder-templates/claude-code/main.tf @@ -0,0 +1,447 @@ +# ============================================================================= +# Claude Code on Coder Agents — GovCloud demo workspace template +# ============================================================================= +# Runs Claude Code as a Coder Agent inside a Kubernetes pod on the EKS +# cluster. Claude Code is wired through the Coder AI Gateway (AI Bridge) +# so the workspace never holds a raw Anthropic key: requests are proxied +# through Coder using the workspace owner's session token and routed to +# the configured provider (Anthropic-direct primary / Bedrock secondary) +# in-boundary. +# +# Launching this template as a Coder Task surfaces the Claude Code chat UI +# (via the bundled AgentAPI app) and seeds the agent with the task prompt. +# +# VERSION / INPUT NAMING — verified against the Coder registry: +# - claude-code module is pinned to 4.7.3 (the version in +# deploy/CONVENTIONS.md / versions.lock.yaml). +# - In 4.7.3 the AI Gateway input is named `enable_aibridge` (NOT +# `enable_ai_gateway`). The `enable_ai_gateway` rename landed in the +# 5.x line, which also REMOVED the bundled AgentAPI integration and +# the `task_app_id` output that `coder_ai_task` depends on. Staying on +# 4.7.3 is what makes the Coder Tasks wiring below possible. +# - `enable_aibridge = true` makes the module set, on the agent: +# ANTHROPIC_BASE_URL = /api/v2/aibridge/anthropic +# CLAUDE_API_KEY = +# With CODER_ACCESS_URL=https://dev.usgov.coderdemo.io the base URL +# resolves to https://dev.usgov.coderdemo.io/api/v2/aibridge/anthropic. +# - We additionally export ANTHROPIC_AUTH_TOKEN (session token) to match +# the AI Gateway client contract in deploy/CONVENTIONS.md. +# +# See README.md for the end-to-end AI Gateway wiring and cluster +# prerequisites (namespace + provisioner RBAC). +# ============================================================================= + +terraform { + required_providers { + coder = { + source = "coder/coder" + # `data.coder_task` and `coder_ai_task.app_id` require provider >= 2.13.0. + version = ">= 2.13.0" + } + kubernetes = { + source = "hashicorp/kubernetes" + version = ">= 2.23" + } + } +} + +# ----------------------------------------------------------------------------- +# Providers +# ----------------------------------------------------------------------------- + +provider "coder" {} + +variable "use_kubeconfig" { + type = bool + description = "Use a host kubeconfig instead of in-cluster config. Leave false when the Coder provisioner runs inside the cluster." + default = false +} + +variable "namespace" { + type = string + description = "Kubernetes namespace that hosts workspace pods. The platform layer must create this namespace and grant the provisioner RBAC (see README)." + default = "coder-workspaces" +} + +# Workspace container image (ECR mirror). +# +# Upstream ref : docker.io/codercom/enterprise-base:ubuntu-noble-20260601 +# ECR mirror : per deploy/CONVENTIONS.md the docker.io -> ECR mapping is +# docker.io/: -> /docker-hub/: +# +# codercom/enterprise-base is Coder's maintained Kubernetes workspace base +# image: runs as user `coder` (uid 1000), ships git/curl/sudo, and is the +# canonical base for Coder's official Kubernetes template. Claude Code and +# AgentAPI install as standalone binaries into $HOME/.local/bin, so no +# Node.js/npm is required in the base image. +variable "workspace_image" { + type = string + description = "Fully-qualified workspace image. Defaults to the ECR-mirrored codercom/enterprise-base." + default = "430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/docker-hub/codercom/enterprise-base:ubuntu-noble-20260601" +} + +provider "kubernetes" { + config_path = var.use_kubeconfig ? "~/.kube/config" : null +} + +data "coder_provisioner" "me" {} +data "coder_workspace" "me" {} +data "coder_workspace_owner" "me" {} + +# Populated when the workspace is created as a Coder Task. `enabled` is +# false for a normal workspace build, and `prompt` carries the task prompt. +data "coder_task" "me" {} + +# ----------------------------------------------------------------------------- +# Parameters — sizing and the AI task prompt +# ----------------------------------------------------------------------------- + +data "coder_parameter" "cpu" { + name = "cpu" + display_name = "CPU Cores" + description = "CPU limit for the workspace pod." + type = "number" + default = "4" + mutable = true + icon = "/icon/memory.svg" + + option { + name = "2 Cores" + value = "2" + } + option { + name = "4 Cores" + value = "4" + } + option { + name = "8 Cores" + value = "8" + } +} + +data "coder_parameter" "memory" { + name = "memory" + display_name = "Memory (GB)" + description = "Memory limit for the workspace pod." + type = "number" + default = "8" + mutable = true + icon = "/icon/memory.svg" + + option { + name = "4 GB" + value = "4" + } + option { + name = "8 GB" + value = "8" + } + option { + name = "16 GB" + value = "16" + } +} + +data "coder_parameter" "disk_size" { + name = "disk_size" + display_name = "Disk Size (GB)" + description = "Persistent /home/coder volume size. Cannot be changed after creation." + type = "number" + default = "20" + mutable = false + icon = "/icon/database.svg" + + option { + name = "10 GB" + value = "10" + } + option { + name = "20 GB" + value = "20" + } + option { + name = "50 GB" + value = "50" + } +} + +# Fallback prompt for non-Task workspace builds. When the workspace is +# launched as a Coder Task, data.coder_task.me.prompt takes precedence. +data "coder_parameter" "ai_prompt" { + name = "ai_prompt" + display_name = "Initial AI Prompt" + description = "Seed prompt for Claude Code. Ignored when launched as a Coder Task (the Task prompt is used instead)." + type = "string" + default = "" + mutable = true + icon = "/icon/claude.svg" +} + +locals { + # Prefer the Coder Task prompt; fall back to the parameter for plain builds. + effective_prompt = data.coder_task.me.prompt != "" ? data.coder_task.me.prompt : data.coder_parameter.ai_prompt.value + + # For documentation/readme parity. The claude-code module derives the + # same value internally from data.coder_workspace.me.access_url. + ai_gateway_anthropic_url = "${data.coder_workspace.me.access_url}/api/v2/aibridge/anthropic" +} + +# ----------------------------------------------------------------------------- +# Agent +# ----------------------------------------------------------------------------- + +resource "coder_agent" "main" { + arch = data.coder_provisioner.me.arch + os = "linux" + + # Claude Code + AgentAPI are installed by the claude-code module's own + # coder_script (native binaries into $HOME/.local/bin). This startup + # script only normalizes PATH and signals readiness. + startup_script = <<-EOT + #!/bin/bash + set -e + touch ~/.bashrc + grep -qF '$HOME/.local/bin' ~/.profile 2>/dev/null || \ + echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.profile + echo "=== Workspace ready ===" + EOT + + env = { + EDITOR = "code" + VISUAL = "code" + + # No docker socket in the pod; opt out of devcontainer auto-detection + # so the dashboard does not hang polling `docker ps`. + CODER_AGENT_DEVCONTAINERS_ENABLE = "false" + } + + metadata { + display_name = "CPU Usage" + key = "cpu_usage" + script = "coder stat cpu" + interval = 10 + timeout = 1 + } + + metadata { + display_name = "Memory Usage" + key = "mem_usage" + script = "coder stat mem" + interval = 10 + timeout = 1 + } + + metadata { + display_name = "Disk Usage" + key = "disk_usage" + script = "coder stat disk --path /home/coder" + interval = 60 + timeout = 1 + } + + display_apps { + vscode = true + vscode_insiders = false + web_terminal = true + ssh_helper = true + port_forwarding_helper = true + } +} + +# ----------------------------------------------------------------------------- +# AI Gateway client auth +# ----------------------------------------------------------------------------- +# The claude-code module (enable_aibridge = true) already sets +# ANTHROPIC_BASE_URL and CLAUDE_API_KEY. We additionally export +# ANTHROPIC_AUTH_TOKEN with the workspace owner's session token to match +# the AI Gateway client contract documented in deploy/CONVENTIONS.md. Both +# carry the same session token, so there is no conflict; no raw Anthropic +# API key is ever placed in the workspace. +resource "coder_env" "anthropic_auth_token" { + agent_id = coder_agent.main.id + name = "ANTHROPIC_AUTH_TOKEN" + value = data.coder_workspace_owner.me.session_token +} + +# ----------------------------------------------------------------------------- +# Claude Code (Coder registry module) + Coder Task +# ----------------------------------------------------------------------------- + +module "claude_code" { + source = "registry.coder.com/coder/claude-code/coder" + version = "4.7.3" + agent_id = coder_agent.main.id + + # Required by the module: directory Claude Code runs in. Pre-created and + # trust-accepted by the module. + workdir = "/home/coder" + + # Route Claude Code through the Coder AI Gateway (AI Bridge) instead of + # talking to api.anthropic.com directly. Sets ANTHROPIC_BASE_URL + + # CLAUDE_API_KEY (session token) on the agent. Mutually exclusive with + # claude_api_key / claude_code_oauth_token. + enable_aibridge = true + + # Coder Tasks: seed the agent and report task status to the Coder UI via + # AgentAPI. Empty string for plain builds -> Claude Code starts idle. + ai_prompt = local.effective_prompt + report_tasks = true + + # Serve the Claude Code web app on a subdomain. Requires the wildcard + # access URL (*.usgov.coderdemo.io) configured on the Coder server. + subdomain = true + + # Model selection is intentionally left at the module default. With the + # AI Gateway, the requested model name must match the active provider: + # - Anthropic-direct (primary): an Anthropic model id, e.g. + # "claude-sonnet-4-5-20250929". + # - Bedrock (secondary): the GovCloud inference profile, e.g. + # "us-gov.anthropic.claude-sonnet-4-5-20250929-v1:0". + # Pin one explicitly only after confirming which provider is live: + # model = "claude-sonnet-4-5-20250929" +} + +# Marks this workspace build as a Coder AI Task and binds the Task UI to the +# Claude Code AgentAPI app. Only created in a Task context so normal +# workspace builds are unaffected. +resource "coder_ai_task" "claude_code" { + count = data.coder_task.me.enabled ? data.coder_workspace.me.start_count : 0 + app_id = module.claude_code.task_app_id +} + +# code-server — VS Code in the browser (an additional coder_app tile). +module "code_server" { + count = data.coder_workspace.me.start_count + source = "registry.coder.com/coder/code-server/coder" + version = "1.3.1" + agent_id = coder_agent.main.id + folder = "/home/coder" + subdomain = true + order = 1 +} + +# ----------------------------------------------------------------------------- +# Kubernetes resources +# ----------------------------------------------------------------------------- + +resource "kubernetes_persistent_volume_claim_v1" "home" { + metadata { + name = "coder-${data.coder_workspace.me.id}-home" + namespace = var.namespace + labels = { + "app.kubernetes.io/name" = "coder-workspace" + "app.kubernetes.io/instance" = "coder-${data.coder_workspace.me.id}" + "app.kubernetes.io/part-of" = "coder" + } + } + wait_until_bound = false + spec { + access_modes = ["ReadWriteOnce"] + resources { + requests = { + storage = "${data.coder_parameter.disk_size.value}Gi" + } + } + } + + lifecycle { + ignore_changes = all + } +} + +resource "kubernetes_pod_v1" "workspace" { + count = data.coder_workspace.me.start_count + + metadata { + name = "coder-${data.coder_workspace.me.id}" + namespace = var.namespace + labels = { + "app.kubernetes.io/name" = "coder-workspace" + "app.kubernetes.io/instance" = "coder-${data.coder_workspace.me.id}" + "app.kubernetes.io/part-of" = "coder" + } + } + + spec { + # enterprise-base runs as the `coder` user (uid/gid 1000). + security_context { + run_as_user = 1000 + fs_group = 1000 + } + + container { + name = "dev" + image = var.workspace_image + image_pull_policy = "IfNotPresent" + command = ["sh", "-c", coder_agent.main.init_script] + + security_context { + run_as_user = 1000 + # enterprise-base grants the coder user passwordless sudo. The + # claude-code/agentapi module installs the agentapi binary to + # /usr/local/bin via sudo, which requires privilege escalation. + # Disabling it sets the kernel no_new_privs flag and breaks that + # install (and the Coder Tasks chat UI it powers). + allow_privilege_escalation = true + } + + env { + name = "CODER_AGENT_TOKEN" + value = coder_agent.main.token + } + + env { + name = "CODER_AGENT_URL" + value = data.coder_workspace.me.access_url + } + + resources { + requests = { + "cpu" = "500m" + "memory" = "${max(2, floor(data.coder_parameter.memory.value / 2))}Gi" + } + limits = { + "cpu" = "${data.coder_parameter.cpu.value}" + "memory" = "${data.coder_parameter.memory.value}Gi" + } + } + + volume_mount { + mount_path = "/home/coder" + name = "home" + read_only = false + } + } + + volume { + name = "home" + persistent_volume_claim { + claim_name = kubernetes_persistent_volume_claim_v1.home.metadata[0].name + } + } + + affinity { + pod_anti_affinity { + preferred_during_scheduling_ignored_during_execution { + weight = 1 + pod_affinity_term { + topology_key = "kubernetes.io/hostname" + label_selector { + match_expressions { + key = "app.kubernetes.io/name" + operator = "In" + values = ["coder-workspace"] + } + } + } + } + } + } + } + + # The agent token is baked into init_script; ignore_changes keeps a + # running pod intact across template re-applies / prebuild claims. + lifecycle { + ignore_changes = all + } +} diff --git a/deploy/CONVENTIONS.md b/deploy/CONVENTIONS.md new file mode 100644 index 0000000..f2921e5 --- /dev/null +++ b/deploy/CONVENTIONS.md @@ -0,0 +1,109 @@ +# App-layer conventions (the contract) + +Shared facts for every app-layer workstream. Read this before drafting. +Draft files ONLY in your assigned directory. Do not run terraform, kubectl, +helm, or aws against live infra. Do not edit `terraform/` or other +workstreams' directories. Return a concise report plus the list of container +images your workstream needs (fully-qualified upstream refs, pinned tags). + +## Account / region + +- Partition `aws-us-gov`, account `430737322961`, region `us-gov-west-1`. +- Domain: `usgov.coderdemo.io`. + +## Hostnames (single ACM cert covers `usgov.coderdemo.io` + `*.usgov.coderdemo.io`) + +| Host | Service | +|---|---| +| `dev.usgov.coderdemo.io` | Coder dashboard | +| `*.usgov.coderdemo.io` | Coder workspace apps (wildcard) | +| `auth.usgov.coderdemo.io` | Keycloak | +| `gitlab.usgov.coderdemo.io` | GitLab | + +ACM cert ARN: `arn:aws-us-gov:acm:us-gov-west-1:430737322961:certificate/7f4fc566-8efd-4aa5-b6ba-3b0c9a535d12` + +## Ingress (locked) + +One internet-facing **NLB → ingress-nginx → one ACM cert**. TLS terminates at +the NLB via the AWS LB annotations on the ingress-nginx controller Service +(`aws-load-balancer-ssl-cert` = the ACM ARN, `aws-load-balancer-type=external`, +`nlb-target-type=ip`, ssl-ports=443). Backends are plain HTTP. Each app exposes +an `Ingress` with `ingressClassName: nginx` and its host from the table above. +The platform layer (owned by the orchestrator) installs ingress-nginx and the +namespaces; your workstream only declares its own `Ingress` object. + +## Namespaces + +`coder`, `keycloak`, `gitlab`. Service accounts created per app. + +## Versions (source of truth: `versions.lock.yaml`) + +- EKS / k8s **1.36**, PostgreSQL **18.4** +- Coder **2.34.0** (Helm chart + `ghcr.io/coder/coder:v2.34.0`) +- Keycloak **26.6.3** +- GitLab CE **19.0.1** +- claude-code Coder module **4.7.3** + +## Images (ECR mirror; no pull-through in GovCloud) + +Registry: `430737322961.dkr.ecr.us-gov-west-1.amazonaws.com`. The orchestrator +populates `scripts/images.txt` from your reported images. Mirror path mapping +(`scripts/mirror-images.sh`): + +- `docker.io/:` → `/docker-hub/:` +- `quay.io/:` → `/quay/:` +- `ghcr.io/:` → `/ghcr/:` + +Reference ECR images by the mirrored path. Report the upstream refs you used. + +## Database (RDS PostgreSQL 18.4, single instance) + +- Endpoint: `terraform -chdir=terraform output -raw rds_endpoint` (host only). +- Master creds: Secrets Manager `usgov-coderdemo/rds/master` (JSON: + `username`,`password`,`host`,`port`). Master user `dbadmin`. +- Logical databases (the orchestrator's db-init job creates these + roles): + - `coder` (already the instance default db) + - `keycloak` + - `gitlabhq_production` +- Assume each app reads its DB password from a k8s Secret named + `-db` (key `password`) that the platform layer will create. Declare the + Secret name you expect; do not invent passwords. + +## AI path (Coder AI Gateway) + +Two providers configured; AI Governance Add-On license is present. + +1. **Anthropic-direct (PRIMARY for demo reliability)** — points at + `api.anthropic.com`; egress leaves the VPC via the NAT gateway. API key + comes from a k8s Secret (key `ANTHROPIC_API_KEY`); never hardcode it. +2. **Bedrock (in-boundary, SECONDARY)** — IRSA, no static keys. The Coder + service account `coder/coder` is annotated with + `eks.amazonaws.com/role-arn: arn:aws-us-gov:iam::430737322961:role/usgov-coderdemo-coder-bedrock`. + Region `us-gov-west-1`; model + `us-gov.anthropic.claude-sonnet-4-5-20250929-v1:0`; Nova Pro + (`amazon.nova-pro-v1:0`) is the proven fallback. Claude access is still + gated, so Bedrock may be disabled at demo time but must be wired. + +Verify exact env var / values schema against +`https://coder.com/docs/ai-coder/ai-gateway` (provider env vars like +`BEDROCK_REGION`, `BEDROCK_MODEL`, and indexed +`CODER_AI_GATEWAY___`). Client uses +`ANTHROPIC_BASE_URL=/api/v2/aibridge/anthropic` + +`ANTHROPIC_AUTH_TOKEN=`. + +## Coder server env (highlights) + +- `CODER_ACCESS_URL=https://dev.usgov.coderdemo.io` +- `CODER_WILDCARD_ACCESS_URL=*.usgov.coderdemo.io` +- OIDC via Keycloak realm `coder`, client `coder`, issuer + `https://auth.usgov.coderdemo.io/realms/coder`. + +## Directory ownership + +| Dir | Workstream | +|---|---| +| `deploy/coder/` | Coder Helm values + Ingress | +| `deploy/keycloak/` | Keycloak Deployment/Service/Ingress + realm import | +| `deploy/gitlab/` | GitLab single-container + Ingress | +| `coder-templates/claude-code/` | Workspace template (Coder Agents + Claude Code) | +| `deploy/platform/` , `scripts/images.txt` | Orchestrator (do not edit) | diff --git a/deploy/coder/README.md b/deploy/coder/README.md new file mode 100644 index 0000000..e51b349 --- /dev/null +++ b/deploy/coder/README.md @@ -0,0 +1,185 @@ +# Coder control plane (`deploy/coder/`) + +Helm values + Ingress for the Coder dashboard at `dev.usgov.coderdemo.io`, +pinned to **Coder v2.34.0** (official chart, `ghcr.io/coder/coder:v2.34.0` +mirrored to ECR). Read [`deploy/CONVENTIONS.md`](../CONVENTIONS.md) first. + +## Files + +| File | Purpose | +|------------------------|----------------------------------------------------------------| +| `values.yaml` | Helm values for the Coder chart (Deployment, SA, Service, Ingress, env). | +| `secrets.example.yaml` | Placeholder manifests for the 3 Secrets `values.yaml` consumes. | +| `README.md` | This file. | + +## What the platform layer must provide first + +This workstream declares only Coder objects. Before installing, the +orchestrator/platform layer must have: + +- The `coder` namespace. +- `ingress-nginx` installed, fronted by the internet-facing **NLB + ACM cert** + (TLS terminates at the NLB; backends are plain HTTP). +- The ECR mirror populated with `ghcr/coder/coder:v2.34.0` + (`scripts/mirror-images.sh`). +- The `coder` logical DB + role on RDS (db-init job). +- The three Secrets in `secrets.example.yaml` (`coder-db`, `coder-oidc`, + `coder-ai`), or you create them by hand (below). + +## Install order + +```sh +# 0. Context: point kubectl at the EKS cluster (platform layer owns this). + +# 1. Namespace (skip if the platform layer already created it). +kubectl create namespace coder + +# 2. Secrets. Prefer imperative creation so secrets never touch git. +# Replace placeholders with real values first. +kubectl create secret generic coder-db -n coder \ + --from-literal=url='postgres://coder:PASSWORD@RDS_ENDPOINT:5432/coder?sslmode=require' +kubectl create secret generic coder-oidc -n coder \ + --from-literal=client-secret='KEYCLOAK_CODER_CLIENT_SECRET' +kubectl create secret generic coder-ai -n coder \ + --from-literal=ANTHROPIC_API_KEY='sk-ant-...' +# (Or: edit secrets.example.yaml, then `kubectl apply -n coder -f secrets.example.yaml`.) + +# 3. Add the chart repo and install/upgrade. +helm repo add coder-v2 https://helm.coder.com/v2 +helm repo update +helm upgrade --install coder coder-v2/coder \ + --namespace coder \ + --version 2.34.0 \ + --values deploy/coder/values.yaml + +# 4. Apply the AI Governance Add-On license (see "Licensing" below). +``` + +`RDS_ENDPOINT` (host only) comes from +`terraform -chdir=terraform output -raw rds_endpoint`. At authoring time the +Terraform apply had not run (plan: 39 to add), so `secrets.example.yaml` uses a +`REPLACE_ME_RDS_ENDPOINT` placeholder. + +## How values map to the demo + +| Requirement | Where in `values.yaml` | +|---|---| +| Dashboard host `dev.usgov.coderdemo.io` | `coder.ingress.host` + `CODER_ACCESS_URL` | +| Workspace-app wildcard `*.usgov.coderdemo.io` | `coder.ingress.wildcardHost` + `CODER_WILDCARD_ACCESS_URL` (single-level, matches the one ACM cert) | +| TLS terminated upstream at NLB | `coder.ingress.tls.enable: false`, no `coder.tls.secretNames`, `ssl-redirect: "false"` | +| Sits behind ingress-nginx (no 2nd LB) | `coder.service.type: ClusterIP` (chart default is `LoadBalancer`) | +| Postgres `coder` DB | `CODER_PG_CONNECTION_URL` from Secret `coder-db` key `url` | +| Keycloak SSO | `CODER_OIDC_*` env; client secret from Secret `coder-oidc` key `client-secret` | +| Bedrock IRSA | `coder.serviceAccount.annotations[eks.amazonaws.com/role-arn]` | +| AI Gateway providers | `CODER_AI_GATEWAY_PROVIDER_0_*` (Anthropic-direct) + `_1_*` (Bedrock) | + +## AI Gateway provider schema (verified) + +Verified against **v2.34.0** source and docs: + +- Docs: and + `.../ai-gateway/providers` (the AI Gateway product was formerly "AI Bridge"; + API paths still use `/api/v2/aibridge/...`). +- Parser: `cli/server.go` `readAIProvidersForPrefix` (env prefix + `CODER_AI_GATEWAY_PROVIDER_`). +- Seeding/type resolution: `coderd/ai_providers_migrate.go` + `SeedAIProvidersFromEnv`. + +Indexed scheme is `CODER_AI_GATEWAY_PROVIDER__` (literal word +`PROVIDER`, numeric index `` starting at 0). Recognized `` keys: + +``` +TYPE # openai | anthropic | bedrock | azure | google | + # openai-compat | openrouter | vercel | copilot +NAME # unique, lowercase, hyphenated (routing id) +KEY | KEYS # bearer key(s); KEYS is comma-separated (max 5) +BASE_URL +BEDROCK_BASE_URL +BEDROCK_REGION +BEDROCK_ACCESS_KEY | BEDROCK_ACCESS_KEYS +BEDROCK_ACCESS_KEY_SECRET | BEDROCK_ACCESS_KEY_SECRETS +BEDROCK_MODEL +BEDROCK_SMALL_FAST_MODEL +``` + +Notes: + +- The convention's guessed `CODER_AI_GATEWAY___` shape is + **not** what v2.34 uses; the real prefix is `CODER_AI_GATEWAY_PROVIDER__`. +- A Bedrock provider can be declared as `TYPE=bedrock` (used here, most + self-documenting) **or** `TYPE=anthropic` with `BEDROCK_*` fields set; the + server detects "Bedrock" whenever `BEDROCK_REGION`/`BEDROCK_BASE_URL`/access + keys are present (`IsBedrockConfigured`). Both seed an equivalent provider + that routes through aibridge's Anthropic client. +- Do **not** attach a key to the Bedrock provider. With no static creds the AWS + SDK default credential chain resolves the **IRSA** web-identity token from the + annotated service account. The IAM role must allow `bedrock:InvokeModel` and + `bedrock:InvokeModelWithResponseStream` (the Terraform `coder_bedrock` role + grants exactly these for the inference profile + Nova Pro). +- Client side (set in the workspace template, not here): + `ANTHROPIC_BASE_URL=/api/v2/aibridge/anthropic` and + `ANTHROPIC_AUTH_TOKEN=`. + +### IMPORTANT: provider env vars seed the DB ONCE + +Since v2.34, AI Gateway providers live in the **database**, managed at +`https://dev.usgov.coderdemo.io/ai/settings` or the AI Providers API. The +`CODER_AI_GATEWAY_PROVIDER_*` env vars are **deprecated** and only **seed** the +DB on the first startup. After seeding: + +- The database is authoritative; editing a provider in the dashboard is **not** + overwritten by env on restart. +- **Changing a seeded env var later makes `coderd` fail to start** (drift guard). + To rotate the Anthropic key or change a model, do it in `/ai/settings`, then + update/remove the matching env var to match (or remove the env vars entirely + once seeded). + +This matters for Helm: a later `helm upgrade` that changes any +`CODER_AI_GATEWAY_PROVIDER_*` value (or the `coder-ai` secret contents) will +break startup unless you first reconcile the change in the dashboard. Treat +these values as one-time seed config. + +## Licensing (AI Governance Add-On) + +AI Gateway requires the **AI Governance Add-On** license. There is **no +`CODER_LICENSE` server env var** in v2.34 (the chart/server does not read a +license from env or a Secret). The license is a JWT applied at runtime and +stored in the DB. Apply it after install via CLI or UI: + +```sh +# CLI (as a Coder admin/owner): +coder licenses add -f /path/to/coder.lic +# or paste the JWT in the dashboard: Admin settings > Licenses > Add a license. +``` + +Confirm the add-on is active before relying on AI Gateway. `/ai/settings` is +inaccessible / providers will not serve without the add-on entitlement. + +## Open questions / risks + +1. **`coder-db` secret shape.** `values.yaml` expects Secret `coder-db` with a + full connection URL under key `url`. CONVENTIONS says the platform creates + `-db` with key `password` only. Reconcile with the platform layer: + either it also publishes `url`, or add a small step to assemble the URL from + `password` + `rds_endpoint`. (Documented in `secrets.example.yaml`.) +2. **Bedrock model access still gated.** `InvokeModel` on + `us-gov.anthropic.claude-sonnet-4-5-...` returns AccessDenied until model + access is enabled (STATUS.md gating item). The Bedrock provider is wired but + may need to be disabled at demo time from `/ai/settings`; Nova Pro + (`amazon.nova-pro-v1:0`) is the proven fallback. +3. **Claude Code + Bedrock beta header (GovCloud).** Known issue + `coder/aibridge#221`: Claude Code sends an `anthropic-beta` flag that Bedrock + in GovCloud rejects (`invalid beta flag`). This can break the + Bedrock-through-AI-Gateway path for Claude Code specifically; Anthropic-direct + (primary) is unaffected. Validate before relying on Bedrock for the live demo. +4. **IRSA STS in GovCloud.** IRSA exchanges the SA token via + `AssumeRoleWithWebIdentity`. `AWS_REGION` + `AWS_STS_REGIONAL_ENDPOINTS=regional` + are set so the SDK uses the GovCloud regional STS endpoint; verify the role + trust policy lists the cluster OIDC provider and the `coder:coder` SA once the + cluster is up. +5. **Provider seeding vs. Helm drift.** See the boxed note above. Decide the + long-term source of truth (dashboard) and keep Helm env values frozen after + first boot, or remove them post-seed. +6. **Could not verify live.** Terraform had not been applied (no AWS creds in + this sandbox), so RDS endpoint, the OIDC client secret, and IRSA end-to-end + were not exercised. Values use documented placeholders. diff --git a/deploy/coder/secrets.example.yaml b/deploy/coder/secrets.example.yaml new file mode 100644 index 0000000..403ba52 --- /dev/null +++ b/deploy/coder/secrets.example.yaml @@ -0,0 +1,55 @@ +# Example k8s Secret manifests for the Coder control plane (namespace: coder). +# +# DO NOT COMMIT REAL SECRETS. Every value below is a REPLACE_ME placeholder. +# In the real deploy these Secrets are created by the platform layer +# (orchestrator) or applied out-of-band; this file documents the exact +# names/keys that deploy/coder/values.yaml expects so the two stay in sync. +# +# Apply (after `kubectl create namespace coder`) only if the platform layer +# has not already created them: +# kubectl apply -n coder -f secrets.example.yaml # AFTER replacing values +# +# Prefer creating them imperatively so secrets never touch git, e.g.: +# kubectl create secret generic coder-db -n coder \ +# --from-literal=url='postgres://coder:PASSWORD@RDS_ENDPOINT:5432/coder?sslmode=require' +--- +apiVersion: v1 +kind: Secret +metadata: + name: coder-db + namespace: coder +type: Opaque +stringData: + # Full PostgreSQL connection URL for the `coder` database on RDS. + # Host: `terraform -chdir=terraform output -raw rds_endpoint` (host only). + # User `coder` + its password: the platform db-init job creates the role; + # the password is the one the platform stores in the `coder-db` Secret + # (CONVENTIONS says key `password`). Assemble the URL from that password, + # the RDS endpoint, port 5432, db `coder`. RDS enforces TLS, so sslmode=require. + # + # postgres://coder:@:5432/coder?sslmode=require + url: "postgres://coder:REPLACE_ME_DB_PASSWORD@REPLACE_ME_RDS_ENDPOINT:5432/coder?sslmode=require" +--- +apiVersion: v1 +kind: Secret +metadata: + name: coder-oidc + namespace: coder +type: Opaque +stringData: + # Keycloak client secret for confidential client `coder` in realm `coder` + # (Keycloak admin console: Clients > coder > Credentials). Owned by the + # deploy/keycloak workstream; coordinate the value with them. + client-secret: "REPLACE_ME_KEYCLOAK_CODER_CLIENT_SECRET" +--- +apiVersion: v1 +kind: Secret +metadata: + name: coder-ai + namespace: coder +type: Opaque +stringData: + # Anthropic API key for the PRIMARY (anthropic-direct) AI Gateway provider. + # Source: Anthropic Console (console.anthropic.com) > API Keys. Begins `sk-ant-`. + # The Bedrock (SECONDARY) provider uses IRSA, so it needs NO key here. + ANTHROPIC_API_KEY: "REPLACE_ME_sk-ant-xxxxxxxx" diff --git a/deploy/coder/values.yaml b/deploy/coder/values.yaml new file mode 100644 index 0000000..48b33ea --- /dev/null +++ b/deploy/coder/values.yaml @@ -0,0 +1,169 @@ +# Helm values for the official Coder chart, pinned to v2.34.0. +# Chart + appVersion 2.34.0 (see versions.lock.yaml / deploy/CONVENTIONS.md). +# +# Scope of this file: the Coder control-plane only. The platform layer +# (orchestrator) owns ingress-nginx, the NLB + ACM cert, the `coder` +# namespace, and the k8s Secrets referenced below. This file only declares +# Coder's Deployment, ServiceAccount, Service, and Ingress. +# +# TLS: terminates UPSTREAM at the NLB (ACM cert) via the ingress-nginx +# controller Service annotations. Traffic from nginx to Coder is plain HTTP, +# so the Ingress here requests NO certificate (no cert-manager). +# +# >>> READ deploy/coder/README.md before installing. <<< +# In particular, the AI Gateway provider env vars below are DEPRECATED as of +# v2.34: they SEED the database ONCE on first startup, then the database is +# authoritative. Changing them after seeding makes coderd fail to start. + +coder: + # Image is the upstream ghcr.io/coder/coder:v2.34.0 mirrored into private + # ECR (no pull-through cache in GovCloud). Mirror mapping from CONVENTIONS: + # ghcr.io/coder/coder:v2.34.0 + # -> 430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/ghcr/coder/coder:v2.34.0 + image: + repo: "430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/ghcr/coder/coder" + tag: "v2.34.0" + pullPolicy: IfNotPresent + + # The chart always creates a ServiceAccount named `coder`. Annotate it for + # IRSA so the AI Gateway Bedrock provider can call bedrock:InvokeModel and + # bedrock:InvokeModelWithResponseStream with temporary credentials (no static + # AWS keys). Role authored in terraform/ (output: bedrock_role_arn). + serviceAccount: + name: coder + workspacePerms: true + enableDeployments: true + annotations: + eks.amazonaws.com/role-arn: "arn:aws-us-gov:iam::430737322961:role/usgov-coderdemo-coder-bedrock" + + # Coder sits BEHIND ingress-nginx, so its own Service must NOT provision a + # second load balancer. Override the chart default (LoadBalancer) to ClusterIP. + service: + enable: true + type: ClusterIP + + # We set CODER_ACCESS_URL explicitly below, so do not let the chart inject a + # cluster-internal access URL. + envUseClusterAccessURL: false + + # One internet-facing NLB -> ingress-nginx -> this Ingress. TLS already + # terminated at the NLB, so tls.enable=false (plain HTTP backend, no cert + # requested here). + ingress: + enable: true + className: "nginx" + host: "dev.usgov.coderdemo.io" + wildcardHost: "*.usgov.coderdemo.io" + tls: + enable: false + annotations: + # TLS is terminated upstream at the NLB; do not redirect to HTTPS at nginx + # (backend is plain HTTP) to avoid a redirect loop. + nginx.ingress.kubernetes.io/ssl-redirect: "false" + # Coder relies on long-lived websockets (web terminal, agent, logs) and + # streams large payloads; relax nginx proxy limits accordingly. + nginx.ingress.kubernetes.io/proxy-read-timeout: "86400" + nginx.ingress.kubernetes.io/proxy-send-timeout: "86400" + nginx.ingress.kubernetes.io/proxy-body-size: "0" + + env: + # --- Core access URLs ------------------------------------------------- + - name: CODER_ACCESS_URL + value: "https://dev.usgov.coderdemo.io" + # Single-level wildcard so the one ACM cert covers dashboard + apps. + - name: CODER_WILDCARD_ACCESS_URL + value: "*.usgov.coderdemo.io" + + # --- Database --------------------------------------------------------- + # Full libpq/URL connection string for the `coder` DB on RDS, supplied by + # the platform layer as Secret `coder-db` (key `url`). See secrets.example.yaml. + - name: CODER_PG_CONNECTION_URL + valueFrom: + secretKeyRef: + name: coder-db + key: url + + # --- Keycloak OIDC SSO ------------------------------------------------ + - name: CODER_OIDC_ISSUER_URL + value: "https://auth.usgov.coderdemo.io/realms/coder" + - name: CODER_OIDC_CLIENT_ID + value: "coder" + - name: CODER_OIDC_CLIENT_SECRET + valueFrom: + secretKeyRef: + name: coder-oidc + key: client-secret + # Standard OIDC claims emitted by a Keycloak realm. + - name: CODER_OIDC_SCOPES + value: "openid,profile,email" + - name: CODER_OIDC_EMAIL_FIELD + value: "email" + - name: CODER_OIDC_USERNAME_FIELD + value: "preferred_username" + # Demo: let SSO users self-provision on first login. + - name: CODER_OIDC_ALLOW_SIGNUPS + value: "true" + - name: CODER_OIDC_SIGN_IN_TEXT + value: "Sign in with Keycloak" + + # --- AI Gateway (AI Governance Add-On) -------------------------------- + # AI Gateway is enabled by default in v2.34; set explicitly for clarity. + # NOTE: provider env vars are deprecated and seed the DB ONCE. After the + # first successful startup, manage providers at /ai/settings. Do NOT edit + # these values in place afterward or coderd will refuse to start (drift + # guard). See README "AI Gateway provider seeding" for the safe workflow. + - name: CODER_AI_GATEWAY_ENABLED + value: "true" + + # Provider 0 = Anthropic-direct (PRIMARY, demo reliability). + # Egress to api.anthropic.com leaves the VPC via the NAT gateway. + # + # NAME MUST be "anthropic": the claude-code workspace module (4.7.3) + # hardcodes ANTHROPIC_BASE_URL=/api/v2/aibridge/anthropic, and + # the AI Gateway routes by provider NAME (verified: POST + # /api/v2/aibridge//v1/messages). A name like "anthropic-direct" + # makes that route 404, so Claude Code cannot reach the provider. + - name: CODER_AI_GATEWAY_PROVIDER_0_TYPE + value: "anthropic" + - name: CODER_AI_GATEWAY_PROVIDER_0_NAME + value: "anthropic" + - name: CODER_AI_GATEWAY_PROVIDER_0_BASE_URL + value: "https://api.anthropic.com" + - name: CODER_AI_GATEWAY_PROVIDER_0_KEY + valueFrom: + secretKeyRef: + name: coder-ai + key: ANTHROPIC_API_KEY + + # Provider 1 = Amazon Bedrock (SECONDARY, in-boundary, IRSA, NO static keys). + # `bedrock` type authenticates via the AWS SDK default credential chain, + # which picks up the IRSA web-identity token from the annotated SA above. + # Bedrock-ness is detected by BEDROCK_REGION (no default); no API key is + # attached. Claude Sonnet 4.5 access is still gated, so this provider may be + # disabled at demo time from /ai/settings, but it is wired here. + - name: CODER_AI_GATEWAY_PROVIDER_1_TYPE + value: "bedrock" + - name: CODER_AI_GATEWAY_PROVIDER_1_NAME + value: "anthropic-bedrock" + - name: CODER_AI_GATEWAY_PROVIDER_1_BEDROCK_REGION + value: "us-gov-west-1" + - name: CODER_AI_GATEWAY_PROVIDER_1_BEDROCK_MODEL + value: "us-gov.anthropic.claude-sonnet-4-5-20250929-v1:0" + # Claude Code uses a Haiku-class "small fast model" for background tasks. + # Nova Pro is the proven in-GovCloud fallback. + - name: CODER_AI_GATEWAY_PROVIDER_1_BEDROCK_SMALL_FAST_MODEL + value: "amazon.nova-pro-v1:0" + + # --- AWS SDK / IRSA resolution (GovCloud) ----------------------------- + # Give the AWS SDK an explicit region for endpoint construction and for the + # regional STS AssumeRoleWithWebIdentity call used by IRSA. + - name: AWS_REGION + value: "us-gov-west-1" + - name: AWS_DEFAULT_REGION + value: "us-gov-west-1" + - name: AWS_STS_REGIONAL_ENDPOINTS + value: "regional" + + # Single replica for the demo. HA (replicaCount > 1) is an Enterprise feature + # and is out of scope. + replicaCount: 1 diff --git a/deploy/gitlab/README.md b/deploy/gitlab/README.md new file mode 100644 index 0000000..eb0dc7d --- /dev/null +++ b/deploy/gitlab/README.md @@ -0,0 +1,176 @@ +# GitLab CE (single-container Omnibus) — `gitlab.usgov.coderdemo.io` + +GitLab CE **19.0.1** deployed as the single-container Omnibus image (not the Helm +chart), in namespace `gitlab`, behind the shared NLB (TLS) + ingress-nginx. + +This is a **demo** footprint: one StatefulSet replica, embedded PostgreSQL/Redis, +monitoring and extra services trimmed off. + +## Topology + +``` +client ──HTTPS──> NLB (terminates TLS, ACM cert) + └─HTTP──> ingress-nginx + └─HTTP──> Service gitlab:80 + └─> Pod gitlab-0 (bundled NGINX :80 -> Workhorse/Puma) +``` + +TLS is terminated upstream, so the pod's bundled NGINX serves plain HTTP and we +force `X-Forwarded-Proto=https` so GitLab generates correct `https://` links and +does not redirect-loop. See the `GITLAB_OMNIBUS_CONFIG` block in +[`statefulset.yaml`](./statefulset.yaml). + +## Image + +| Upstream (pinned) | ECR (mirrored, used by manifests) | +|----------------------------------------|----------------------------------------------------------------------------------------------| +| `docker.io/gitlab/gitlab-ce:19.0.1-ce.0` | `430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/docker-hub/gitlab/gitlab-ce:19.0.1-ce.0` | + +Add the upstream ref to `scripts/images.txt` (orchestrator-owned) so +`scripts/mirror-images.sh` mirrors it. No other images are required: the Omnibus +image bundles NGINX, Puma, Workhorse, Sidekiq, Gitaly, Redis, and PostgreSQL. + +## Prerequisites (owned by the platform layer, not this directory) + +- Namespace `gitlab` exists. +- ingress-nginx is installed and the NLB is wired to the ACM cert (per + `deploy/CONVENTIONS.md`). +- A gp3 StorageClass from EKS Auto Mode named **`auto-ebs-sc`** + (provisioner `ebs.csi.eks.amazonaws.com`). EKS Auto Mode does not ship a + default StorageClass, so the platform must create one. If it is named + differently, update `storageClassName` in `statefulset.yaml`. See + [Open questions](#open-questions--risks). +- The `gitlab-ce` image mirrored into ECR (above). + +## Install order + +```bash +# 1) Create the real root-password Secret (do NOT apply secrets.example.yaml as-is). +kubectl -n gitlab create secret generic gitlab-secrets \ + --from-literal=initial_root_password='' + +# 2) StatefulSet (also creates the ServiceAccount + the 3 PVCs). +kubectl apply -f statefulset.yaml + +# 3) Service + Ingress. +kubectl apply -f service.yaml +kubectl apply -f ingress.yaml + +# 4) Watch it boot (first boot runs DB migrations; allow several minutes). +kubectl -n gitlab rollout status statefulset/gitlab --timeout=20m +kubectl -n gitlab get pods -w +``` + +## First login / root password + +- User: `root` +- Password: the value you put in `gitlab-secrets.initial_root_password`. + +If you did **not** set the Secret before first boot, GitLab auto-generates a +password and writes it to a file inside the pod (valid for ~24h after first +reconfigure): + +```bash +kubectl -n gitlab exec gitlab-0 -- cat /etc/gitlab/initial_root_password +``` + +To change the password later (the Secret is ignored after first boot): + +```bash +kubectl -n gitlab exec -it gitlab-0 -- gitlab-rake "gitlab:password:reset[root]" +``` + +## Database: embedded PostgreSQL (chosen) vs shared RDS + +**Decision: use the embedded, bundled PostgreSQL** (the Omnibus default). Its data +persists under `/var/opt/gitlab/postgresql` on the `var-opt-gitlab` PVC. + +Why embedded for this single-container demo: + +- Simplest path; the Omnibus image is designed to run its own tightly-coupled + PostgreSQL + Redis. Fewest moving parts for a time-boxed demo. +- No dependency on the orchestrator's db-init job creating the + `gitlabhq_production` database, role, and `gitlab-db` Secret, and no GitLab + schema migrations run against the shared RDS instance. +- Decoupled blast radius: GitLab keeps working independently of RDS health. +- Version-safe: GitLab 19 requires **PostgreSQL 17+**; the bundled engine always + satisfies this with no drift risk. + +Tradeoff: GitLab's data is not under RDS automated backups/Multi-AZ; durability +relies on the EBS PVC plus GitLab's own backup tooling +(`gitlab-backup`). Acceptable for a demo. The shared RDS is **PostgreSQL 18.4**, +which also satisfies the 17+ minimum if you later want managed storage. + +### Switching to shared RDS (alternative, not enabled) + +If you ever need managed storage, disable the embedded Postgres and point GitLab +at RDS. Add to `GITLAB_OMNIBUS_CONFIG`, and inject the password from a +platform-provided `gitlab-db` Secret (key `password`, per `deploy/CONVENTIONS.md`): + +```ruby +postgresql['enable'] = false +gitlab_rails['db_adapter'] = 'postgresql' +gitlab_rails['db_host'] = '' +gitlab_rails['db_port'] = 5432 +gitlab_rails['db_database'] = 'gitlabhq_production' +gitlab_rails['db_username'] = 'gitlab' +gitlab_rails['db_password'] = ENV['GITLAB_DB_PASSWORD'] # from gitlab-db Secret +``` + +This requires the orchestrator to have created the `gitlabhq_production` database, +the `gitlab` role, and the `gitlab-db` Secret first. Redis would still be embedded. + +## Optional: Keycloak OIDC (do not block the demo on this) + +GitLab can SSO against Keycloak (`auth.usgov.coderdemo.io`, realm to be confirmed). +This is optional; the root login above is enough to demo GitLab. When ready, add an +`openid_connect` provider to `GITLAB_OMNIBUS_CONFIG` (sketch, verify against the +Keycloak realm/client and store the client secret in a Secret): + +```ruby +gitlab_rails['omniauth_enabled'] = true +gitlab_rails['omniauth_allow_single_sign_on'] = ['openid_connect'] +gitlab_rails['omniauth_block_auto_created_users'] = false +gitlab_rails['omniauth_providers'] = [ + { + name: 'openid_connect', + label: 'Keycloak', + args: { + name: 'openid_connect', + scope: ['openid', 'profile', 'email'], + response_type: 'code', + issuer: 'https://auth.usgov.coderdemo.io/realms/', + discovery: true, + client_auth_method: 'query', + uid_field: 'preferred_username', + client_options: { + identifier: 'gitlab', + secret: ENV['GITLAB_OIDC_CLIENT_SECRET'], + redirect_uri: 'https://gitlab.usgov.coderdemo.io/users/auth/openid_connect/callback' + } + } + } +] +``` + +## Open questions / risks + +1. **StorageClass name.** Manifests assume `auto-ebs-sc` (the AWS-documented EKS + Auto Mode gp3 class). Confirm the exact name the platform layer created; the + `deploy/CONVENTIONS.md` text says "gp3 storage class" generically. If wrong, + the three PVCs stay `Pending`. +2. **Git over SSH is not exposed.** The NLB only terminates 443; there is no + path for git+SSH (port 22). Clone/push over HTTPS works. Wire SSH later if the + demo needs it (separate NLB listener + Service of type LoadBalancer or a TCP + ingress). +3. **Resource sizing.** Requests 1 CPU / 4Gi, limits 2 CPU / 8Gi. If the node is + tight, boots get slower and OOM risk rises; tune in `statefulset.yaml`. +4. **First-boot time.** GitLab can take several minutes (migrations); the startup + probe allows ~15 min. Do not mistake a slow first boot for a failure. +5. **Keycloak realm/client** for the optional OIDC block above is unconfirmed. +6. **Backups.** With embedded Postgres, there is no managed backup. Add + `gitlab-backup` + an S3 target if the data must survive PVC loss. + +--- + +*Authored by Coder Agents. Scope: `deploy/gitlab/` only.* diff --git a/deploy/gitlab/ingress.yaml b/deploy/gitlab/ingress.yaml new file mode 100644 index 0000000..f12fa64 --- /dev/null +++ b/deploy/gitlab/ingress.yaml @@ -0,0 +1,36 @@ +--- +# GitLab Ingress. TLS is terminated upstream at the NLB with the shared ACM cert, +# so there is NO tls: block here and traffic to the backend is plain HTTP. +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: gitlab + namespace: gitlab + labels: + app: gitlab + annotations: + # Allow arbitrarily large bodies: git pushes, LFS objects, CI artifacts, + # container/registry uploads. "0" disables the limit. + nginx.ingress.kubernetes.io/proxy-body-size: "0" + # Long-running git/HTTP operations and web-terminal/websocket streams. + nginx.ingress.kubernetes.io/proxy-read-timeout: "3600" + nginx.ingress.kubernetes.io/proxy-send-timeout: "3600" + nginx.ingress.kubernetes.io/proxy-http-version: "1.1" + # TLS is terminated at the NLB; ingress-nginx receives plain HTTP, so do not + # force an HTTPS redirect here (it would loop). GitLab itself emits https URLs + # via X-Forwarded-Proto (see statefulset.yaml). + nginx.ingress.kubernetes.io/ssl-redirect: "false" + nginx.ingress.kubernetes.io/force-ssl-redirect: "false" +spec: + ingressClassName: nginx + rules: + - host: gitlab.usgov.coderdemo.io + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: gitlab + port: + number: 80 diff --git a/deploy/gitlab/secrets.example.yaml b/deploy/gitlab/secrets.example.yaml new file mode 100644 index 0000000..09eef94 --- /dev/null +++ b/deploy/gitlab/secrets.example.yaml @@ -0,0 +1,25 @@ +--- +# EXAMPLE ONLY. Copy to a real, untracked file, replace REPLACE_ME, and apply. +# Do NOT commit a real password. This Secret holds ONLY the initial root +# password, which GitLab consumes on its first boot to seed the "root" user. +# +# kubectl -n gitlab create secret generic gitlab-secrets \ +# --from-literal=initial_root_password='' +# +# Notes: +# - Must be at least 8 characters (GitLab rejects weak/short passwords). +# - Applied/changed AFTER first boot, this value is ignored: the password then +# lives in the database. Change it via the GitLab UI or a Rails console instead. +# - This is unrelated to GitLab's internally generated /etc/gitlab/gitlab-secrets.json +# (DB encryption keys), which GitLab creates itself and we persist on the +# etc-gitlab PVC. Different thing, similar name. +apiVersion: v1 +kind: Secret +metadata: + name: gitlab-secrets + namespace: gitlab + labels: + app: gitlab +type: Opaque +stringData: + initial_root_password: REPLACE_ME diff --git a/deploy/gitlab/service.yaml b/deploy/gitlab/service.yaml new file mode 100644 index 0000000..a58aa00 --- /dev/null +++ b/deploy/gitlab/service.yaml @@ -0,0 +1,19 @@ +--- +# ClusterIP service fronting the GitLab pod. ingress-nginx routes to this on :80. +# Also the StatefulSet's governing service (serviceName: gitlab). +apiVersion: v1 +kind: Service +metadata: + name: gitlab + namespace: gitlab + labels: + app: gitlab +spec: + type: ClusterIP + selector: + app: gitlab + ports: + - name: http + port: 80 + targetPort: http + protocol: TCP diff --git a/deploy/gitlab/statefulset.yaml b/deploy/gitlab/statefulset.yaml new file mode 100644 index 0000000..1ac467e --- /dev/null +++ b/deploy/gitlab/statefulset.yaml @@ -0,0 +1,210 @@ +--- +# GitLab CE 19.0.1, single-container Omnibus image (NOT the Helm chart). +# +# Topology: internet-facing NLB (terminates TLS with the ACM cert) -> ingress-nginx +# (plain HTTP) -> this pod's bundled NGINX on port 80 -> Workhorse/Puma. +# Because TLS is terminated upstream, the bundled NGINX listens on HTTP only and +# we force X-Forwarded-Proto=https so GitLab builds correct https:// URLs and does +# not redirect-loop. +# +# A StatefulSet (not a Deployment) is used because GitLab is stateful and each of +# the three persistent paths is a ReadWriteOnce EBS volume; the StatefulSet gives +# stable PVCs and never tries to run two pods against the same RWO volume. +# +# Database: EMBEDDED PostgreSQL bundled in the Omnibus image (default). Its data +# lives under /var/opt/gitlab/postgresql on the data PVC. See README.md for the +# rationale and for the shared-RDS alternative. +apiVersion: v1 +kind: ServiceAccount +metadata: + name: gitlab + namespace: gitlab + labels: + app: gitlab +# NOTE: the platform layer may already create this ServiceAccount ("service +# accounts created per app" in deploy/CONVENTIONS.md). Applying it here is +# harmless and keeps this directory self-contained. GitLab with embedded +# Postgres needs no IRSA annotation. +--- +apiVersion: apps/v1 +kind: StatefulSet +metadata: + name: gitlab + namespace: gitlab + labels: + app: gitlab +spec: + serviceName: gitlab + replicas: 1 + # Single RWO data volume: never run two pods at once. + podManagementPolicy: OrderedReady + updateStrategy: + type: RollingUpdate + selector: + matchLabels: + app: gitlab + template: + metadata: + labels: + app: gitlab + spec: + serviceAccountName: gitlab + # GitLab reconfigure can take several minutes; give it time to drain. + terminationGracePeriodSeconds: 120 + securityContext: + # The Omnibus image runs its services under runit as root by design. + # Do not force runAsNonRoot here. + fsGroup: 0 + containers: + - name: gitlab + # Mirrored to ECR per deploy/CONVENTIONS.md: + # docker.io/gitlab/gitlab-ce:19.0.1-ce.0 + # -> 430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/docker-hub/gitlab/gitlab-ce:19.0.1-ce.0 + image: 430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/docker-hub/gitlab/gitlab-ce:19.0.1-ce.0 + imagePullPolicy: IfNotPresent + # The hostname GitLab uses for its own checks; matches external_url host. + env: + - name: GITLAB_INITIAL_ROOT_PASSWORD + valueFrom: + secretKeyRef: + name: gitlab-secrets + key: initial_root_password + # optional so the pod still starts on later restarts (after the + # first boot the root password lives in the database, not here). + optional: true + - name: GITLAB_OMNIBUS_CONFIG + value: |- + external_url 'https://gitlab.usgov.coderdemo.io' + + ## Run behind an upstream proxy that terminates TLS (NLB + ingress-nginx). + ## Bundled NGINX speaks plain HTTP on :80; do not terminate TLS here and + ## do not redirect to https (the NLB already did). + nginx['listen_port'] = 80 + nginx['listen_https'] = false + nginx['redirect_http_to_https'] = false + nginx['proxy_set_headers'] = { + 'X-Forwarded-Proto' => 'https', + 'X-Forwarded-Ssl' => 'on' + } + ## Trust in-cluster proxies so client IPs are logged and users are not + ## shown as signed in from the proxy address (RFC1918 ranges). + nginx['real_ip_header'] = 'X-Forwarded-For' + nginx['real_ip_recursive'] = 'on' + nginx['real_ip_trusted_addresses'] = ['10.0.0.0/8', '172.16.0.0/12', '192.168.0.0/16'] + + ## Health endpoints (/-/health, /-/readiness, /-/liveness) are + ## IP-allowlisted to 127.0.0.0/8 by default, so kubelet probes + ## from the node IP get 404. Allow the VPC CIDR so the probes pass. + gitlab_rails['monitoring_whitelist'] = ['127.0.0.0/8', '10.0.0.0/16'] + + ## Initial root password, first boot only, injected from the + ## gitlab-secrets Secret. Ignored once the DB has a root user. + if ENV['GITLAB_INITIAL_ROOT_PASSWORD'] && !ENV['GITLAB_INITIAL_ROOT_PASSWORD'].empty? + gitlab_rails['initial_root_password'] = ENV['GITLAB_INITIAL_ROOT_PASSWORD'] + end + + ## Embedded PostgreSQL (bundled) is the default; nothing to set. + ## To use the shared RDS gitlabhq_production instead, see README.md. + + ## ---- Trim the footprint for a demo ---- + puma['worker_processes'] = 2 + sidekiq['concurrency'] = 10 + ## Umbrella switch: disables prometheus, alertmanager, and every + ## bundled exporter (node/redis/postgres/gitlab) in one line. + prometheus_monitoring['enable'] = false + puma['exporter_enabled'] = false + sidekiq['metrics_enabled'] = false + ## Services not needed for this demo. + registry['enable'] = false + gitlab_pages['enable'] = false + gitlab_kas['enable'] = false + ## Return freed memory to the OS sooner (memory-constrained tuning). + gitlab_rails['env'] = { 'MALLOC_CONF' => 'dirty_decay_ms:1000,muzzy_decay_ms:1000' } + ports: + - name: http + containerPort: 80 + protocol: TCP + volumeMounts: + - name: etc-gitlab + mountPath: /etc/gitlab + - name: var-opt-gitlab + mountPath: /var/opt/gitlab + - name: var-log-gitlab + mountPath: /var/log/gitlab + - name: dshm + mountPath: /dev/shm + resources: + # Modest but functional. GitLab needs real memory even when trimmed; + # tune down only if the demo node is tight (expect slower boots/OOM risk). + requests: + cpu: "1" + memory: 4Gi + limits: + cpu: "2" + memory: 8Gi + # GitLab takes minutes to boot (asset load + DB migrations on first run). + # The startup probe gives it up to ~15 minutes before liveness kicks in. + startupProbe: + httpGet: + path: /-/health + port: http + initialDelaySeconds: 60 + periodSeconds: 15 + timeoutSeconds: 5 + failureThreshold: 60 + livenessProbe: + httpGet: + path: /-/health + port: http + periodSeconds: 30 + timeoutSeconds: 5 + failureThreshold: 5 + readinessProbe: + httpGet: + path: /-/readiness + port: http + periodSeconds: 15 + timeoutSeconds: 5 + failureThreshold: 3 + volumes: + # Mirrors the Omnibus docker recommendation of --shm-size for Postgres. + - name: dshm + emptyDir: + medium: Memory + sizeLimit: 256Mi + # Three PVCs map 1:1 to the three Omnibus persistence paths. StorageClass + # "gp3" is the cluster default (provisioner ebs.csi.aws.com, + # WaitForFirstConsumer, encrypted) created by the platform layer on standard + # EKS. The EKS Auto Mode "auto-ebs-sc" class is not used (Auto Mode was + # disabled; see terraform reconciliation notes). + volumeClaimTemplates: + - metadata: + name: etc-gitlab + labels: + app: gitlab + spec: + accessModes: ["ReadWriteOnce"] + storageClassName: gp3 + resources: + requests: + storage: 2Gi + - metadata: + name: var-opt-gitlab + labels: + app: gitlab + spec: + accessModes: ["ReadWriteOnce"] + storageClassName: gp3 + resources: + requests: + storage: 20Gi + - metadata: + name: var-log-gitlab + labels: + app: gitlab + spec: + accessModes: ["ReadWriteOnce"] + storageClassName: gp3 + resources: + requests: + storage: 5Gi diff --git a/deploy/keycloak/README.md b/deploy/keycloak/README.md new file mode 100644 index 0000000..fc8850f --- /dev/null +++ b/deploy/keycloak/README.md @@ -0,0 +1,148 @@ +# Keycloak (`auth.usgov.coderdemo.io`) + +Keycloak **26.6.3** for the GovCloud Coder demo. Runs in namespace `keycloak`, +behind the locked ingress path: + +``` +client --HTTPS--> NLB (TLS terminated, ACM cert) --HTTP--> ingress-nginx --HTTP--> keycloak:8080 +``` + +Backed by the shared RDS PostgreSQL 18.4 instance (logical database `keycloak`). +Provides OIDC SSO for Coder (`dev.usgov.coderdemo.io`) via realm `coder`. + +## Files + +| File | Purpose | +|---|---| +| `deployment.yaml` | `ServiceAccount` + `Deployment` (Keycloak 26.6.3, postgres, proxy/hostname/health config) | +| `service.yaml` | `ClusterIP` on `8080` (management `9000` deliberately not exposed) | +| `ingress.yaml` | `ingressClassName: nginx`, host `auth.usgov.coderdemo.io`, plain-HTTP backend | +| `realm-coder.json` | Realm `coder`: confidential client `coder` + `demo` user + token settings | +| `secrets.example.yaml` | Placeholder `keycloak-db` / `keycloak-admin` Secrets (REPLACE_ME) | +| `kustomization.yaml` | Wires the manifests + generates the realm-import ConfigMap from the JSON | + +## Image + +- Upstream (pinned): `quay.io/keycloak/keycloak:26.6.3` +- Referenced as ECR mirror: `430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/quay/keycloak/keycloak:26.6.3` +- Add `quay.io/keycloak/keycloak:26.6.3` to `scripts/images.txt` (orchestrator-owned) so the mirror job pulls it. + +## Verified Keycloak 26.x configuration + +26.x changed several knobs from older majors. What this manifest uses and why +(docs: , +, +, +, +): + +| Setting | Value | Notes | +|---|---|---| +| `KC_PROXY_HEADERS` | `xforwarded` | Replaces the removed `KC_PROXY=edge` (deprecated since v24). Parses `X-Forwarded-*`. | +| `KC_HTTP_ENABLED` | `true` | Required when TLS terminates at the proxy (edge termination). | +| `KC_HOSTNAME` | `https://auth.usgov.coderdemo.io` | hostname v2: a full URL fixes scheme/host/port. Chosen because the L4 NLB terminates TLS and does not inject a trustworthy `X-Forwarded-Proto`. With a full URL, `hostname-strict` stays at its secure default (`true`); we do **not** set it to `false`. | +| `KC_DB` | `postgres` | Build-time option (applied by the implicit build on `start`). | +| `KC_DB_URL` | `jdbc:postgresql://REPLACE_WITH_RDS_ENDPOINT:5432/keycloak` | Full JDBC URL to the RDS endpoint + `keycloak` db. | +| `KC_DB_USERNAME` / `KC_DB_PASSWORD` | from Secret `keycloak-db` | Keys `username` / `password`. | +| `KC_HEALTH_ENABLED` / `KC_METRICS_ENABLED` | `true` | Build-time options; expose `/health` + `/metrics` on management port **9000**. | +| `KC_CACHE` | `local` | Single replica; avoids the default `jdbc-ping` cluster discovery. | +| `KC_BOOTSTRAP_ADMIN_USERNAME` / `KC_BOOTSTRAP_ADMIN_PASSWORD` | from Secret `keycloak-admin` | **Renamed in 26.0** from `KEYCLOAK_ADMIN` / `KEYCLOAK_ADMIN_PASSWORD`. First-boot only. | + +Health endpoints on `:9000` (probes in `deployment.yaml`): +`/health/started` (startup), `/health/live` (liveness), `/health/ready` (readiness). + +### `start` vs `start --optimized` + +This manifest uses **`start --import-realm`** (not `--optimized`). + +`KC_DB`, `KC_HEALTH_ENABLED`, `KC_METRICS_ENABLED`, and `KC_CACHE` are +**build-time** options. `--optimized` tells Keycloak to skip the build and +assume a pre-built image. The stock upstream image we mirror is **not** built +for postgres, so `start --optimized` would ignore `KC_DB` and fall back to the +H2 dev database. Plain `start` runs the build automatically on first boot +(slower start, hence the generous `startupProbe`), which is correct for an +unmodified mirrored image. + +To switch to `--optimized` later, bake a custom image +(`FROM .../quay/keycloak/keycloak:26.6.3` + `RUN kc.sh build --db=postgres +--health-enabled=true --metrics-enabled=true --cache=local`), push it to ECR, +and change the args to `start --optimized --import-realm`. That introduces a +build pipeline outside the current "mirror upstream only" convention, so it is +left as a future hardening step (see open questions). + +## Realm import + +`--import-realm` imports every `*.json` under `/opt/keycloak/data/import` on +startup. The `kustomization.yaml` generates ConfigMap `keycloak-realm-coder` +from `realm-coder.json`, mounted read-only at that path. Import is idempotent: +if realm `coder` already exists it is skipped (logged), so leaving the flag on +across restarts is safe. + +`realm-coder.json` defines: + +- Confidential OIDC client `coder` (standard flow), redirect URIs + `https://dev.usgov.coderdemo.io/api/v2/users/oidc/callback` and + `https://dev.usgov.coderdemo.io/*`, web origins `+`. +- User `demo` (`demo@usgov.coderdemo.io`, `emailVerified: true`). +- Token settings: 5-min access tokens, 30-min idle / 10-hour max SSO session. + +Two placeholders in the JSON are **not** k8s Secrets and must be set before/after import: + +- `coder` client `secret` -> must equal the value Coder reads from Secret + `coder-oidc` (owned by `deploy/coder/`). Issuer for Coder: + `https://auth.usgov.coderdemo.io/realms/coder`. +- `demo` user password. + +Alternative to `--import-realm`: run a one-off `kc.sh import --file +/opt/keycloak/data/import/realm-coder.json` as a Job, then run `start` without +the flag. + +## Install order + +1. Platform layer is up: `keycloak` namespace, ingress-nginx + NLB + ACM cert, + RDS reachable, and the `keycloak` logical db + role created by the db-init job. +2. Mirror the image: ensure `quay.io/keycloak/keycloak:26.6.3` is in + `scripts/images.txt`, then run `scripts/mirror-images.sh`. +3. Create Secrets (real values, not committed): + ```sh + kubectl -n keycloak create secret generic keycloak-db \ + --from-literal=username=keycloak --from-literal=password='<…>' + kubectl -n keycloak create secret generic keycloak-admin \ + --from-literal=username=admin --from-literal=password='<…>' + ``` +4. Set the real RDS endpoint in `deployment.yaml` (`KC_DB_URL`, + `REPLACE_WITH_RDS_ENDPOINT`) and the realm placeholders in `realm-coder.json`. +5. Apply: + ```sh + kubectl apply -k deploy/keycloak/ + ``` + (or apply `deployment.yaml`, `service.yaml`, `ingress.yaml` individually after + creating the `keycloak-realm-coder` ConfigMap with + `kubectl -n keycloak create configmap keycloak-realm-coder --from-file=deploy/keycloak/realm-coder.json`). +6. Verify: `kubectl -n keycloak rollout status deploy/keycloak`, then browse + `https://auth.usgov.coderdemo.io/realms/coder/.well-known/openid-configuration`. + +## Open questions / risks + +1. **ingress-nginx `X-Forwarded-Proto` (platform-owned).** With an L4 NLB doing + TLS termination, the controller sees plain HTTP and forwards + `X-Forwarded-Proto: http`. Pinning `KC_HOSTNAME` to a full `https://` URL + makes Keycloak independent of this, and the Ingress sets + `ssl-redirect: "false"` to avoid a redirect loop. Confirm the controller is + not separately forcing SSL redirects for this host. +2. **`keycloak-db` username key.** CONVENTIONS only guarantees the `password` + key. This manifest also reads `username` (role `keycloak`). If the platform + Secret omits `username`, either add it or hardcode `KC_DB_USERNAME=keycloak`. +3. **RDS endpoint injection.** `KC_DB_URL` carries a literal placeholder. Decide + whether the orchestrator templates this (kustomize/helm) or it is filled at + apply time. Verify the RDS role requires no `?sslmode=` / TLS JDBC params in + this VPC; add them to the JDBC URL if enforced. +4. **Client/user secrets in realm JSON.** `coder` client secret and `demo` + password are committed as `REPLACE_ME` placeholders. The client secret must + be kept in sync with Secret `coder-oidc`. A realm import cannot natively pull + these from a k8s Secret; if that coupling is undesirable, import the realm + without the secret and set it via `kcadm` post-import. +5. **`start` vs pre-built `--optimized`.** Demo uses plain `start` (auto-build, + slower cold start). If startup time matters, switch to a pre-built ECR image + (see above); needs orchestrator buy-in for a build step. +6. **Single replica.** No HA; `KC_CACHE=local`. Fine for the demo, not for prod. diff --git a/deploy/keycloak/deployment.yaml b/deploy/keycloak/deployment.yaml new file mode 100644 index 0000000..8f52cc6 --- /dev/null +++ b/deploy/keycloak/deployment.yaml @@ -0,0 +1,174 @@ +# Keycloak 26.6.3 Deployment for the GovCloud Coder demo. +# +# Topology (locked by deploy/CONVENTIONS.md): +# client --HTTPS--> NLB (TLS terminated via ACM) --HTTP--> ingress-nginx --HTTP--> this pod:8080 +# +# Because TLS is terminated upstream at the L4 NLB, the pod only ever sees plain +# HTTP and cannot reliably learn the external scheme from headers. We therefore +# pin KC_HOSTNAME to a full https:// URL so Keycloak always builds correct +# issuer/redirect URLs, and set KC_PROXY_HEADERS=xforwarded so origin checks use +# the forwarded headers that ingress-nginx adds. +# +# Start command: `start --import-realm` (NOT `start --optimized`). +# `--optimized` requires the image to have been pre-built with `kc.sh build +# --db=postgres` (KC_DB / KC_HEALTH_ENABLED / KC_METRICS_ENABLED / KC_CACHE are +# build-time options). The stock upstream image we mirror is not pre-built for +# postgres, so we use plain `start`, which runs the build step automatically on +# first boot. See README.md "start vs start --optimized". +--- +apiVersion: v1 +kind: ServiceAccount +metadata: + name: keycloak + namespace: keycloak + labels: + app.kubernetes.io/name: keycloak + app.kubernetes.io/part-of: usgov-coderdemo +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: keycloak + namespace: keycloak + labels: + app.kubernetes.io/name: keycloak + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/version: 26.6.3 +spec: + # Single replica for the demo. HA/clustering (jdbc-ping cache stack) is out of + # scope; KC_CACHE=local below avoids cluster discovery on a single pod. + replicas: 1 + strategy: + type: Recreate + selector: + matchLabels: + app.kubernetes.io/name: keycloak + template: + metadata: + labels: + app.kubernetes.io/name: keycloak + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/version: 26.6.3 + spec: + serviceAccountName: keycloak + securityContext: + runAsNonRoot: true + fsGroup: 1000 + seccompProfile: + type: RuntimeDefault + containers: + - name: keycloak + # Upstream: quay.io/keycloak/keycloak:26.6.3 + # Mirrored to ECR per CONVENTIONS (quay.io/ -> /quay/). + image: 430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/quay/keycloak/keycloak:26.6.3 + imagePullPolicy: IfNotPresent + args: + - start + - --import-realm + env: + # ---- Database (RDS PostgreSQL 18.4, logical db `keycloak`) ---- + - name: KC_DB + value: postgres + # Host-only RDS endpoint comes from `terraform -chdir=terraform + # output -raw rds_endpoint`. Not a secret; fill in at deploy time + # (or template via kustomize/helm). Port 5432 is the RDS PG default. + - name: KC_DB_URL + value: jdbc:postgresql://REPLACE_WITH_RDS_ENDPOINT:5432/keycloak + - name: KC_DB_USERNAME + valueFrom: + secretKeyRef: + name: keycloak-db + key: username + - name: KC_DB_PASSWORD + valueFrom: + secretKeyRef: + name: keycloak-db + key: password + # ---- Hostname / proxy (26.x hostname v2 semantics) ---- + # Full URL => scheme/host/port are fixed; safe behind an L4 TLS + # terminator that does not forward a trustworthy X-Forwarded-Proto. + - name: KC_HOSTNAME + value: https://auth.usgov.coderdemo.io + # xforwarded replaces the removed KC_PROXY=edge option (>= v24). + - name: KC_PROXY_HEADERS + value: xforwarded + # Required when TLS is terminated at the proxy (edge termination). + - name: KC_HTTP_ENABLED + value: "true" + # ---- Health / metrics on the management port (9000) ---- + - name: KC_HEALTH_ENABLED + value: "true" + - name: KC_METRICS_ENABLED + value: "true" + # ---- Single-node cache (no clustering for the demo) ---- + - name: KC_CACHE + value: local + # ---- Initial admin bootstrap (26.0+ variable names) ---- + # Only consumed on first start against an empty DB; ignored (with a + # log warning) once the initial admin exists. + - name: KC_BOOTSTRAP_ADMIN_USERNAME + valueFrom: + secretKeyRef: + name: keycloak-admin + key: username + - name: KC_BOOTSTRAP_ADMIN_PASSWORD + valueFrom: + secretKeyRef: + name: keycloak-admin + key: password + ports: + - name: http + containerPort: 8080 + protocol: TCP + - name: management + containerPort: 9000 + protocol: TCP + # Probes hit the management interface (9000), never proxied externally. + startupProbe: + httpGet: + path: /health/started + port: management + # First boot runs DB migration + the implicit build step, so allow + # generous time before liveness takes over (up to ~5 min). + periodSeconds: 5 + failureThreshold: 60 + timeoutSeconds: 5 + livenessProbe: + httpGet: + path: /health/live + port: management + periodSeconds: 10 + failureThreshold: 6 + timeoutSeconds: 5 + readinessProbe: + httpGet: + path: /health/ready + port: management + periodSeconds: 10 + failureThreshold: 6 + timeoutSeconds: 5 + resources: + requests: + cpu: 500m + memory: 1280Mi + limits: + # Keycloak sizes its heap from the container memory limit; the + # upstream guidance is at least 750Mi, 2Gi for production-ready. + memory: 2Gi + securityContext: + allowPrivilegeEscalation: false + readOnlyRootFilesystem: false + capabilities: + drop: + - ALL + volumeMounts: + - name: realm-import + mountPath: /opt/keycloak/data/import + readOnly: true + volumes: + # Realm JSON(s) placed here are imported when `--import-realm` is set. + # ConfigMap is generated from realm-coder.json (see kustomization.yaml + # or the manual `kubectl create configmap` command in README.md). + - name: realm-import + configMap: + name: keycloak-realm-coder diff --git a/deploy/keycloak/ingress.yaml b/deploy/keycloak/ingress.yaml new file mode 100644 index 0000000..2bc515f --- /dev/null +++ b/deploy/keycloak/ingress.yaml @@ -0,0 +1,33 @@ +# Ingress for Keycloak at auth.usgov.coderdemo.io. +# +# TLS is terminated upstream at the NLB (single ACM cert, configured on the +# ingress-nginx controller Service by the platform layer). This Ingress declares +# only the plain-HTTP backend route, per deploy/CONVENTIONS.md. +# +# ssl-redirect is disabled because the controller receives plain HTTP from the +# NLB; leaving the default on would cause an HTTP->HTTPS redirect loop. The +# larger proxy buffer accommodates Keycloak's sizable auth cookies/headers. +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: keycloak + namespace: keycloak + labels: + app.kubernetes.io/name: keycloak + app.kubernetes.io/part-of: usgov-coderdemo + annotations: + nginx.ingress.kubernetes.io/ssl-redirect: "false" + nginx.ingress.kubernetes.io/proxy-buffer-size: "128k" +spec: + ingressClassName: nginx + rules: + - host: auth.usgov.coderdemo.io + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: keycloak + port: + number: 8080 diff --git a/deploy/keycloak/kustomization.yaml b/deploy/keycloak/kustomization.yaml new file mode 100644 index 0000000..668a7cd --- /dev/null +++ b/deploy/keycloak/kustomization.yaml @@ -0,0 +1,26 @@ +# Convenience kustomization for the Keycloak workstream. +# +# It wires deployment/service/ingress together and generates the realm-import +# ConfigMap directly from realm-coder.json (single source of truth, no +# duplication). disableNameSuffixHash keeps the name stable as +# `keycloak-realm-coder`, which the Deployment volume references. +# +# secrets.example.yaml is intentionally NOT included here; provision the real +# `keycloak-db` and `keycloak-admin` Secrets out of band (see README.md). +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +namespace: keycloak + +resources: + - deployment.yaml + - service.yaml + - ingress.yaml + +configMapGenerator: + - name: keycloak-realm-coder + files: + - realm-coder.json + +generatorOptions: + disableNameSuffixHash: true diff --git a/deploy/keycloak/realm-coder.json b/deploy/keycloak/realm-coder.json new file mode 100644 index 0000000..96e0702 --- /dev/null +++ b/deploy/keycloak/realm-coder.json @@ -0,0 +1,75 @@ +{ + "realm": "coder", + "displayName": "Coder (GovCloud Demo)", + "enabled": true, + "sslRequired": "external", + "registrationAllowed": false, + "loginWithEmailAllowed": true, + "duplicateEmailsAllowed": false, + "resetPasswordAllowed": true, + "editUsernameAllowed": false, + "accessTokenLifespan": 300, + "accessTokenLifespanForImplicitFlow": 900, + "ssoSessionIdleTimeout": 1800, + "ssoSessionMaxLifespan": 36000, + "offlineSessionIdleTimeout": 2592000, + "requiredCredentials": [ + "password" + ], + "clients": [ + { + "clientId": "coder", + "name": "Coder", + "description": "Coder dashboard OIDC login (confidential, standard flow).", + "protocol": "openid-connect", + "enabled": true, + "publicClient": false, + "clientAuthenticatorType": "client-secret", + "secret": "REPLACE_WITH_CODER_OIDC_CLIENT_SECRET", + "standardFlowEnabled": true, + "implicitFlowEnabled": false, + "directAccessGrantsEnabled": false, + "serviceAccountsEnabled": false, + "fullScopeAllowed": true, + "redirectUris": [ + "https://dev.usgov.coderdemo.io/api/v2/users/oidc/callback", + "https://dev.usgov.coderdemo.io/*" + ], + "webOrigins": [ + "+" + ], + "attributes": { + "post.logout.redirect.uris": "https://dev.usgov.coderdemo.io/*" + }, + "defaultClientScopes": [ + "web-origins", + "profile", + "roles", + "email" + ], + "optionalClientScopes": [ + "offline_access" + ] + } + ], + "users": [ + { + "username": "demo", + "enabled": true, + "emailVerified": true, + "email": "demo@usgov.coderdemo.io", + "firstName": "Demo", + "lastName": "User", + "credentials": [ + { + "type": "password", + "value": "REPLACE_WITH_DEMO_USER_PASSWORD", + "temporary": false + } + ], + "realmRoles": [ + "default-roles-coder" + ] + } + ] +} diff --git a/deploy/keycloak/secrets.example.yaml b/deploy/keycloak/secrets.example.yaml new file mode 100644 index 0000000..2917514 --- /dev/null +++ b/deploy/keycloak/secrets.example.yaml @@ -0,0 +1,47 @@ +# EXAMPLE secrets for Keycloak. DO NOT COMMIT REAL VALUES. +# +# Per deploy/CONVENTIONS.md the platform layer normally provisions the app DB +# Secret (`-db`). These manifests document the exact keys this workstream +# expects so the Deployment can be applied standalone for testing. Replace every +# REPLACE_ME and apply into the `keycloak` namespace. +# +# Create real secrets without committing them, e.g.: +# kubectl -n keycloak create secret generic keycloak-db \ +# --from-literal=username=keycloak \ +# --from-literal=password='' +# kubectl -n keycloak create secret generic keycloak-admin \ +# --from-literal=username=admin \ +# --from-literal=password='' +# +# NOTE: two more placeholders live in realm-coder.json (not k8s Secrets): +# - the `coder` client secret -> must equal Secret `coder-oidc` (deploy/coder) +# - the `demo` user password +--- +apiVersion: v1 +kind: Secret +metadata: + name: keycloak-db + namespace: keycloak + labels: + app.kubernetes.io/name: keycloak + app.kubernetes.io/part-of: usgov-coderdemo +type: Opaque +stringData: + # CONVENTIONS guarantees the `password` key. `username` is the logical role + # the orchestrator's db-init job creates for the `keycloak` database. + username: REPLACE_ME + password: REPLACE_ME +--- +apiVersion: v1 +kind: Secret +metadata: + name: keycloak-admin + namespace: keycloak + labels: + app.kubernetes.io/name: keycloak + app.kubernetes.io/part-of: usgov-coderdemo +type: Opaque +stringData: + # Initial bootstrap admin. Consumed only on first start against an empty DB. + username: REPLACE_ME + password: REPLACE_ME diff --git a/deploy/keycloak/service.yaml b/deploy/keycloak/service.yaml new file mode 100644 index 0000000..f5ec848 --- /dev/null +++ b/deploy/keycloak/service.yaml @@ -0,0 +1,21 @@ +# ClusterIP service for Keycloak. Only the main HTTP port (8080) is exposed to +# ingress-nginx. The management port (9000, health/metrics) is intentionally NOT +# exposed via the Service; probes reach it pod-locally and it must not be +# reachable through the proxy. +apiVersion: v1 +kind: Service +metadata: + name: keycloak + namespace: keycloak + labels: + app.kubernetes.io/name: keycloak + app.kubernetes.io/part-of: usgov-coderdemo +spec: + type: ClusterIP + selector: + app.kubernetes.io/name: keycloak + ports: + - name: http + port: 8080 + targetPort: http + protocol: TCP diff --git a/deploy/platform/README.md b/deploy/platform/README.md new file mode 100644 index 0000000..e463e83 --- /dev/null +++ b/deploy/platform/README.md @@ -0,0 +1,107 @@ +# Platform layer (orchestrator-owned) + +Brings up the shared cluster platform that every app layer depends on: +node group, addons, storage, ingress + NLB, RDS roles/databases, and +workspace RBAC. These steps were executed live against the cluster during the +overnight build; this README is the reproducible record. + +> **Context:** EKS Auto Mode node provisioning is broken in this GovCloud +> account (the AWS-managed `AWSServiceRoleForAmazonEKS` SLR lacks +> `iam:AddRoleToInstanceProfile` / `iam:TagInstanceProfile`, so Auto Mode +> NodeClass validation never succeeds). The cluster was converted to standard +> EKS. The items below are not yet in `terraform/`; see `STATUS.md` +> "Deviations to reconcile into Terraform". + +Prereqs for every command: + +```sh +. ~/.config/usgov-coderdemo/env # AWS_PROFILE, region (sh: use ".", not "source") +export KUBECONFIG=./kubeconfig +``` + +## 1. Compute: disable Auto Mode, create a managed node group + +```sh +aws eks update-cluster-config --name usgov-coderdemo \ + --compute-config enabled=false \ + --storage-config '{"blockStorage":{"enabled":false}}' \ + --kubernetes-network-config '{"elasticLoadBalancing":{"enabled":false}}' + +# Node role usgov-coderdemo-mngnode: AmazonEKSWorkerNodePolicy, AmazonEKS_CNI_Policy, +# AmazonEC2ContainerRegistryReadOnly, AmazonSSMManagedInstanceCore, AmazonEBSCSIDriverPolicy. +# Managed node group `mng`: 3x m5.xlarge, AL2023_x86_64_STANDARD, private subnets, +# min 2 / desired 3 / max 4. +``` + +## 2. Addons + +```sh +# vpc-cni, kube-proxy, coredns: default config. +# aws-ebs-csi-driver: needs IRSA (node IMDS hop limit blocks the controller's +# default credential path), so it gets its own role: +aws iam create-role --role-name usgov-coderdemo-ebs-csi \ + --assume-role-policy-document file://ebs-trust.json # trusts the cluster OIDC provider, + # sub system:serviceaccount:kube-system:ebs-csi-controller-sa +aws iam attach-role-policy --role-name usgov-coderdemo-ebs-csi \ + --policy-arn arn:aws-us-gov:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy +aws eks update-addon --cluster-name usgov-coderdemo --addon-name aws-ebs-csi-driver \ + --service-account-role-arn arn:aws-us-gov:iam::430737322961:role/usgov-coderdemo-ebs-csi \ + --resolve-conflicts PRESERVE +``` + +`gp3` is the default StorageClass (provisioner `ebs.csi.aws.com`, encrypted, +`WaitForFirstConsumer`). + +## 3. Ingress + NLB + +The AWS Load Balancer Controller (Helm, `kube-system`) provisions an +internet-facing NLB for the ingress-nginx controller Service. TLS terminates at +the NLB with the shared ACM cert; backends are plain HTTP. + +```sh +helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \ + --namespace ingress-nginx --create-namespace --version 4.15.1 \ + --values ingress-nginx-values.yaml +``` + +The controller Service uses `aws-load-balancer-type: external` so the LB +controller (not the in-tree provider) manages the NLB. Public subnets are +auto-discovered via the `kubernetes.io/role/elb=1` tag. + +## 4. DNS + +Route53 alias A records in zone `Z06701704WFETYIRU5C8` point `dev`, `auth`, +`gitlab`, and `*` (workspace apps) at the ingress NLB. In-cluster hairpin to +these public hostnames is verified (valid TLS), so Coder's server-side OIDC +calls and workspace agents work. + +## 5. RDS roles + databases + +Run in-cluster (RDS is private; the workspace cannot reach it directly). A Job +using the mirrored `postgres:18-alpine` image connects as the master user and +creates roles + databases. Idempotent. Note `rds.force_ssl=1`, so all clients +use TLS (`sslmode=require` / JDBC `?sslmode=require`). + +- Role `coder` owns database `coder` (and its `public` schema). +- Role `keycloak` owns database `keycloak`. +- GitLab uses the Omnibus **embedded** Postgres (no RDS database). + +RDS requires the master user to be a member of a role before transferring +ownership to it (`GRANT TO dbadmin;`). + +## 6. Application secrets + +Created imperatively (never committed). See each app's `secrets.example.yaml`: +`coder-db`, `coder-oidc`, `coder-ai` (coder ns); `keycloak-db`, `keycloak-admin` +(keycloak ns); `gitlab-secrets` (gitlab ns). Values are in +`~/.config/usgov-coderdemo/generated-secrets.env`. + +## 7. Workspace RBAC + +`workspace-rbac.yaml` grants the `coder/coder` ServiceAccount permission to +manage pods/PVCs in `coder-workspaces` (the Helm chart only grants this in the +release namespace). + +```sh +kubectl apply -f workspace-rbac.yaml +``` diff --git a/deploy/platform/ingress-nginx-values.yaml b/deploy/platform/ingress-nginx-values.yaml new file mode 100644 index 0000000..d99f43d --- /dev/null +++ b/deploy/platform/ingress-nginx-values.yaml @@ -0,0 +1,39 @@ +# ingress-nginx values for standard EKS with the AWS Load Balancer Controller. +# +# The AWS Load Balancer Controller provisions an internet-facing NLB for the +# controller's type: LoadBalancer Service (opted in via +# aws-load-balancer-type: external). The NLB terminates TLS using the single ACM +# cert (ssl-cert + ssl-ports=443) and forwards decrypted TCP to the controller's +# plain HTTP port, so backends are HTTP and every app Ingress uses +# ingressClassName: nginx. Public subnets are auto-discovered via the +# kubernetes.io/role/elb=1 tag. +controller: + replicaCount: 2 + service: + type: LoadBalancer + annotations: + service.beta.kubernetes.io/aws-load-balancer-type: external + service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing + service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip + service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "arn:aws-us-gov:acm:us-gov-west-1:430737322961:certificate/7f4fc566-8efd-4aa5-b6ba-3b0c9a535d12" + service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443" + service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp + service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true" + # TLS terminates at the NLB; forward the 443 listener to the controller's + # plain HTTP container port. + targetPorts: + http: http + https: http + config: + # The L4 NLB does not add X-Forwarded-Proto, so disable nginx's own + # redirect to avoid http->https loops; clients reach the stack over 443. + use-forwarded-headers: "true" + ssl-redirect: "false" + # Coder workspace apps and terminals need websockets and large bodies. + proxy-body-size: "0" + proxy-read-timeout: "3600" + proxy-send-timeout: "3600" + ingressClassResource: + name: nginx + default: false + allowSnippetAnnotations: false diff --git a/deploy/platform/nodepool.yaml b/deploy/platform/nodepool.yaml new file mode 100644 index 0000000..ab04636 --- /dev/null +++ b/deploy/platform/nodepool.yaml @@ -0,0 +1,60 @@ +# Custom NodeClass + NodePool owned by this project. +# +# Why this exists: at cluster-init the EKS Auto Mode `default` NodeClass failed +# to finish creating its node instance profile (created the profile but never +# attached the role), and its controller is now wedged retrying CreateInstanceProfile +# and hitting EntityAlreadyExists. Rather than touch the EKS-managed `default` +# objects, we declare our own NodeClass with an identical role/SG but a distinct +# name, so Auto Mode mints a brand-new instance profile that creates cleanly. +# Nodes are restricted to the private subnets. +apiVersion: eks.amazonaws.com/v1 +kind: NodeClass +metadata: + name: coder +spec: + role: usgov-coderdemo-node + subnetSelectorTerms: + - id: subnet-06f6e0e790ba2a2b5 + - id: subnet-0e876c98a365ef368 + - id: subnet-0ae1157cce4a2d949 + securityGroupSelectorTerms: + - id: sg-02219a6a7996d66a4 + ephemeralStorage: + size: 80Gi + iops: 3000 + throughput: 125 +--- +apiVersion: karpenter.sh/v1 +kind: NodePool +metadata: + name: coder-general +spec: + disruption: + budgets: + - nodes: 10% + consolidateAfter: 30s + consolidationPolicy: WhenEmptyOrUnderutilized + template: + spec: + expireAfter: 336h + terminationGracePeriod: 24h0m0s + nodeClassRef: + group: eks.amazonaws.com + kind: NodeClass + name: coder + requirements: + - key: karpenter.sh/capacity-type + operator: In + values: ["on-demand"] + - key: eks.amazonaws.com/instance-category + operator: In + values: ["c", "m", "r"] + - key: eks.amazonaws.com/instance-generation + operator: Gt + values: ["4"] + - key: kubernetes.io/arch + operator: In + values: ["amd64"] + - key: kubernetes.io/os + operator: In + values: ["linux"] diff --git a/deploy/platform/workspace-rbac.yaml b/deploy/platform/workspace-rbac.yaml new file mode 100644 index 0000000..23ef956 --- /dev/null +++ b/deploy/platform/workspace-rbac.yaml @@ -0,0 +1,35 @@ +# Grants the Coder control-plane ServiceAccount (coder/coder) permission to +# manage workspace pods and PVCs in the coder-workspaces namespace. The Helm +# chart's serviceAccount.workspacePerms only creates this Role in the release +# namespace (coder); workspaces in coder-templates/claude-code/ run in +# coder-workspaces, so the same permissions are replicated here. +apiVersion: rbac.authorization.k8s.io/v1 +kind: Role +metadata: + name: coder-workspace-perms + namespace: coder-workspaces + labels: + app.kubernetes.io/part-of: usgov-coderdemo +rules: + - apiGroups: [""] + resources: ["pods", "persistentvolumeclaims"] + verbs: ["create", "delete", "deletecollection", "get", "list", "patch", "update", "watch"] + - apiGroups: ["apps"] + resources: ["deployments"] + verbs: ["create", "delete", "deletecollection", "get", "list", "patch", "update", "watch"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: RoleBinding +metadata: + name: coder-workspace-perms + namespace: coder-workspaces + labels: + app.kubernetes.io/part-of: usgov-coderdemo +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: Role + name: coder-workspace-perms +subjects: + - kind: ServiceAccount + name: coder + namespace: coder diff --git a/scripts/images.txt b/scripts/images.txt index a1dc674..3f7f7dd 100644 --- a/scripts/images.txt +++ b/scripts/images.txt @@ -7,8 +7,15 @@ # docker.io/bitnami/postgresql:16 -> /docker-hub/bitnami/postgresql:16 # ghcr.io/org/app:v1.2.3 -> /ghcr/org/app:v1.2.3 # quay.io/keycloak/keycloak:26.0 -> /quay/keycloak/keycloak:26.0 -# -# Populate with the images each workstream actually needs. Examples (commented): -# docker.io/library/nginx:1.27 -# quay.io/keycloak/keycloak:26.0 -# quay.io/jetstack/cert-manager-controller:v1.16.2 + +# --- Coder control plane (deploy/coder) --- +ghcr.io/coder/coder:v2.34.0 + +# --- Keycloak SSO (deploy/keycloak) --- +quay.io/keycloak/keycloak:26.6.3 + +# --- GitLab single-container omnibus (deploy/gitlab) --- +docker.io/gitlab/gitlab-ce:19.0.1-ce.0 + +# --- Workspace base image for the Claude Code template (coder-templates/claude-code) --- +docker.io/codercom/enterprise-base:ubuntu-noble-20260601 From 7b12706db56410f2d2b75d015b37da9d799bb8cc Mon Sep 17 00:00:00 2001 From: Austen Bruhn Date: Sun, 7 Jun 2026 13:48:27 +0000 Subject: [PATCH 02/16] feat(deploy/coder): keycloak-only login + in-boundary GitLab git external auth Disable Coder's built-in github.com providers and route git through the in-cluster GitLab instead, so no auth path leaves the GovCloud boundary. - CODER_OAUTH2_GITHUB_DEFAULT_PROVIDER_ENABLE=false disables the default GitHub login (was enabled out-of-the-box via Coder's hosted GitHub app). - Configure a GitLab external-auth provider (CODER_EXTERNAL_AUTH_0_*) against gitlab.usgov.coderdemo.io using an instance-wide OAuth app; id/secret come from Secret coder-external-auth. Declaring an explicit external-auth provider also suppresses Coder's default github.com external-auth injection. Login is now Keycloak SSO + local password owner only. Authored by Coder Agents on behalf of @ausbru87. --- STATUS.md | 11 +++++++++ deploy/coder/secrets.example.yaml | 16 ++++++++++++ deploy/coder/values.yaml | 41 +++++++++++++++++++++++++++++++ 3 files changed, 68 insertions(+) diff --git a/STATUS.md b/STATUS.md index 7c02ae0..d9d8f4e 100644 --- a/STATUS.md +++ b/STATUS.md @@ -87,5 +87,16 @@ gated; Nova Pro is the proven fallback. 5. ingress-nginx + aws-load-balancer-controller (Helm) replacing the Auto Mode NLB path. 6. Workspace RBAC: `deploy/platform/workspace-rbac.yaml` (coder SA -> coder-workspaces ns). +## Auth boundary hardening +- [x] Disabled Coder's built-in **GitHub login** default provider + (`CODER_OAUTH2_GITHUB_DEFAULT_PROVIDER_ENABLE=false`). Login is now + Keycloak SSO + local password owner only (no github.com egress). +- [x] Configured **GitLab external auth** for git-in-workspaces against the + in-cluster GitLab (instance-wide OAuth app; id/secret in Secret + `coder-external-auth`). This also suppresses Coder's default github.com + external-auth provider, so no auth path leaves the GovCloud boundary. + (App id/secret recorded in `generated-secrets.env` as + `GITLAB_CODER_OAUTH_*`.) + ## Out of scope (demo) OpenShift, Istio, observability, full identity sync. diff --git a/deploy/coder/secrets.example.yaml b/deploy/coder/secrets.example.yaml index 403ba52..8d3e96b 100644 --- a/deploy/coder/secrets.example.yaml +++ b/deploy/coder/secrets.example.yaml @@ -53,3 +53,19 @@ stringData: # Source: Anthropic Console (console.anthropic.com) > API Keys. Begins `sk-ant-`. # The Bedrock (SECONDARY) provider uses IRSA, so it needs NO key here. ANTHROPIC_API_KEY: "REPLACE_ME_sk-ant-xxxxxxxx" +--- +apiVersion: v1 +kind: Secret +metadata: + name: coder-external-auth + namespace: coder +type: Opaque +stringData: + # Instance-wide GitLab OAuth application (Admin > Applications, or minted via + # the GitLab Rails console). Redirect URI MUST be + # https://dev.usgov.coderdemo.io/external-auth/gitlab/callback and scopes + # read_user/read_repository/write_repository. Consumed by the + # CODER_EXTERNAL_AUTH_0_CLIENT_ID/SECRET env in values.yaml so git-in-workspaces + # authenticates against the in-cluster GitLab (in-boundary) instead of github.com. + gitlab-client-id: "REPLACE_ME_GITLAB_APP_ID" + gitlab-client-secret: "REPLACE_ME_GITLAB_APP_SECRET" diff --git a/deploy/coder/values.yaml b/deploy/coder/values.yaml index 48b33ea..c79248c 100644 --- a/deploy/coder/values.yaml +++ b/deploy/coder/values.yaml @@ -106,6 +106,47 @@ coder: - name: CODER_OIDC_SIGN_IN_TEXT value: "Sign in with Keycloak" + # --- Auth boundary hardening ------------------------------------------ + # Coder enables a built-in GitHub login + GitHub external-auth provider by + # default (Coder's own hosted GitHub app), both of which call github.com, + # i.e. OUT of the GovCloud boundary. Disable it so login is Keycloak-only + # (plus the local password owner) and git stays on the in-cluster GitLab. + - name: CODER_OAUTH2_GITHUB_DEFAULT_PROVIDER_ENABLE + value: "false" + + # --- Git external auth: in-cluster GitLab (in-boundary) ---------------- + # Authenticates git operations inside workspaces against the in-cluster + # GitLab via an instance-wide OAuth app (minted in GitLab; id/secret live + # in Secret `coder-external-auth`). Configuring an explicit external-auth + # provider also SUPPRESSES Coder's built-in github.com default provider. + # Self-managed GitLab needs explicit auth/token/validate URLs. + - name: CODER_EXTERNAL_AUTH_0_ID + value: "gitlab" + - name: CODER_EXTERNAL_AUTH_0_TYPE + value: "gitlab" + - name: CODER_EXTERNAL_AUTH_0_DISPLAY_NAME + value: "GitLab" + - name: CODER_EXTERNAL_AUTH_0_CLIENT_ID + valueFrom: + secretKeyRef: + name: coder-external-auth + key: gitlab-client-id + - name: CODER_EXTERNAL_AUTH_0_CLIENT_SECRET + valueFrom: + secretKeyRef: + name: coder-external-auth + key: gitlab-client-secret + - name: CODER_EXTERNAL_AUTH_0_AUTH_URL + value: "https://gitlab.usgov.coderdemo.io/oauth/authorize" + - name: CODER_EXTERNAL_AUTH_0_TOKEN_URL + value: "https://gitlab.usgov.coderdemo.io/oauth/token" + - name: CODER_EXTERNAL_AUTH_0_VALIDATE_URL + value: "https://gitlab.usgov.coderdemo.io/oauth/token/info" + - name: CODER_EXTERNAL_AUTH_0_REGEX + value: "gitlab\\.usgov\\.coderdemo\\.io" + - name: CODER_EXTERNAL_AUTH_0_SCOPES + value: "read_user read_repository write_repository" + # --- AI Gateway (AI Governance Add-On) -------------------------------- # AI Gateway is enabled by default in v2.34; set explicitly for clarity. # NOTE: provider env vars are deprecated and seed the DB ONCE. After the From 95a1eb2be64e1b544746035924a899d6f6f11740 Mon Sep 17 00:00:00 2001 From: Austen Bruhn Date: Sun, 7 Jun 2026 14:03:19 +0000 Subject: [PATCH 03/16] feat: per-template GitLab auth, disable path apps, UNCLASSIFIED banner Harden the demo Coder deployment along three axes the user requested: - Every workspace template now requires in-boundary GitLab login. The claude-code template declares `data "coder_external_auth" "gitlab"`, so a workspace must complete the GitLab OAuth flow before the agent is ready; the agent git credential helper then injects a short-lived token for clone/fetch/push. No PATs/SSH keys in the workspace, no out-of-boundary auth path. - Disable path-based workspace apps (CODER_DISABLE_PATH_APPS=true). All templates serve apps with subdomain=true, so apps are now subdomain-only and the same-origin path-app surface is removed. - Add scripts/set-appearance.sh to set the green "UNCLASSIFIED - USGOVCLOUD" classification banner. Appearance is a runtime DB setting (premium-gated), not a Helm value, so the script makes it reproducible and idempotent. Verified live: template version /external-auth lists gitlab as required, deployment config disable_path_apps=true, GET /api/v2/appearance shows the banner. Generated by Coder Agents. --- STATUS.md | 17 +++++++ coder-templates/claude-code/main.tf | 18 +++++++ deploy/coder/values.yaml | 9 ++++ scripts/set-appearance.sh | 77 +++++++++++++++++++++++++++++ 4 files changed, 121 insertions(+) create mode 100755 scripts/set-appearance.sh diff --git a/STATUS.md b/STATUS.md index d9d8f4e..788ede8 100644 --- a/STATUS.md +++ b/STATUS.md @@ -97,6 +97,23 @@ gated; Nova Pro is the proven fallback. external-auth provider, so no auth path leaves the GovCloud boundary. (App id/secret recorded in `generated-secrets.env` as `GITLAB_CODER_OAUTH_*`.) +- [x] **Every workspace template requires GitLab login.** The `claude-code` + template declares `data "coder_external_auth" "gitlab"` (id `gitlab`), + so each workspace must complete the in-boundary GitLab OAuth flow before + the agent reports ready; the agent's git credential helper then injects a + short-lived token for clone/fetch/push. Verified: the active template + version's `/external-auth` lists `gitlab` as required. + +## Demo hardening (runtime + Helm) +- [x] **Path-based workspace apps disabled** (`CODER_DISABLE_PATH_APPS=true`, + Helm rev 4). Workspace apps are served only from their own + `*.usgov.coderdemo.io` subdomains (all templates use `subdomain = true`), + removing the same-origin path-app attack surface. Verified live + (`deployment/config.disable_path_apps = true`). +- [x] **Classification banner** enabled: green `UNCLASSIFIED - USGOVCLOUD` + (`#007a33`). This is a runtime DB setting (premium-gated), NOT in Helm; + reproduce with `scripts/set-appearance.sh` (idempotent). Verified via + `GET /api/v2/appearance`. ## Out of scope (demo) OpenShift, Istio, observability, full identity sync. diff --git a/coder-templates/claude-code/main.tf b/coder-templates/claude-code/main.tf index a4d49e6..a57dc64 100644 --- a/coder-templates/claude-code/main.tf +++ b/coder-templates/claude-code/main.tf @@ -92,6 +92,24 @@ data "coder_workspace_owner" "me" {} # false for a normal workspace build, and `prompt` carries the task prompt. data "coder_task" "me" {} +# ----------------------------------------------------------------------------- +# Git external auth — in-cluster GitLab (in-boundary) +# ----------------------------------------------------------------------------- +# Every workspace authenticates git against the in-cluster GitLab through +# Coder's external-auth provider `gitlab` (configured on the Coder server, see +# deploy/coder/values.yaml CODER_EXTERNAL_AUTH_0_*). Declaring this data source +# makes the workspace REQUIRE a GitLab login: the dashboard surfaces a "Login +# with GitLab" control and the agent only reports the auth as satisfied once +# the owner has completed the OAuth flow. The Coder agent's git credential +# helper then injects the short-lived OAuth token for any clone/fetch/push to +# gitlab.usgov.coderdemo.io. No PATs or SSH keys live in the workspace, and no +# auth path leaves the GovCloud boundary. +# +# id MUST match CODER_EXTERNAL_AUTH_0_ID on the Coder server ("gitlab"). +data "coder_external_auth" "gitlab" { + id = "gitlab" +} + # ----------------------------------------------------------------------------- # Parameters — sizing and the AI task prompt # ----------------------------------------------------------------------------- diff --git a/deploy/coder/values.yaml b/deploy/coder/values.yaml index c79248c..71fdd08 100644 --- a/deploy/coder/values.yaml +++ b/deploy/coder/values.yaml @@ -147,6 +147,15 @@ coder: - name: CODER_EXTERNAL_AUTH_0_SCOPES value: "read_user read_repository write_repository" + # --- App isolation hardening ------------------------------------------ + # Disable path-based workspace apps so every workspace app is served only + # from its own *.usgov.coderdemo.io subdomain. Path apps share the main + # dashboard origin and can make authenticated requests to the Coder API, + # so disabling them is the hardened posture. All templates here serve + # their apps with subdomain = true, so nothing relies on path apps. + - name: CODER_DISABLE_PATH_APPS + value: "true" + # --- AI Gateway (AI Governance Add-On) -------------------------------- # AI Gateway is enabled by default in v2.34; set explicitly for clarity. # NOTE: provider env vars are deprecated and seed the DB ONCE. After the diff --git a/scripts/set-appearance.sh b/scripts/set-appearance.sh new file mode 100755 index 0000000..67bfd0f --- /dev/null +++ b/scripts/set-appearance.sh @@ -0,0 +1,77 @@ +#!/usr/bin/env bash +# ============================================================================= +# set-appearance.sh — set the Coder dashboard appearance (classification banner) +# ============================================================================= +# The appearance config (announcement banners, app name, logo) is a RUNTIME +# setting stored in the Coder database, NOT in the Helm chart. This script +# reproduces it idempotently so the demo banner survives a fresh deploy. +# +# Requires the premium/Enterprise license (announcement banners are gated). +# +# Usage: +# ./scripts/set-appearance.sh +# +# Env (with sane demo defaults): +# DEMO_CODER_URL default https://dev.usgov.coderdemo.io +# BANNER_MESSAGE default "UNCLASSIFIED - USGOVCLOUD" +# BANNER_COLOR default "#007a33" (IC/DoD UNCLASSIFIED green) +# Admin creds are read from ~/.config/usgov-coderdemo/generated-secrets.env. +# +# NOTE: This intentionally uses DEMO_CODER_URL, not CODER_URL. When this runs +# inside a Coder workspace, the agent already exports CODER_URL pointing at the +# HOST Coder (e.g. https://dev.coder.com); reusing it would target the wrong +# deployment. +set -euo pipefail + +export CODER_URL="${DEMO_CODER_URL:-https://dev.usgov.coderdemo.io}" +export BANNER_MESSAGE="${BANNER_MESSAGE:-UNCLASSIFIED - USGOVCLOUD}" +export BANNER_COLOR="${BANNER_COLOR:-#007a33}" +SECRETS="${HOME}/.config/usgov-coderdemo/generated-secrets.env" + +# shellcheck disable=SC1090 +. "${SECRETS}" +export CODER_ADMIN_EMAIL CODER_ADMIN_PASSWORD + +# Login and PUT the appearance in one Python pass: avoids shell JSON quoting +# bugs and is resilient to special characters in the banner or credentials. +python3 - <<'PY' +import json, os, urllib.request, urllib.error, sys + +base = os.environ["CODER_URL"].rstrip("/") + + +def call(method, path, body=None, token=None): + headers = {"Content-Type": "application/json"} + if token: + headers["Coder-Session-Token"] = token + data = json.dumps(body).encode() if body is not None else None + req = urllib.request.Request(base + path, data=data, headers=headers, method=method) + try: + r = urllib.request.urlopen(req) + raw = r.read().decode() + return r.status, (json.loads(raw) if raw else None) + except urllib.error.HTTPError as e: + print(f"FAILED: {method} {path} -> {e.code} {e.read().decode()[:300]}", file=sys.stderr) + sys.exit(1) + + +_, login = call("POST", "/api/v2/users/login", { + "email": os.environ["CODER_ADMIN_EMAIL"], + "password": os.environ["CODER_ADMIN_PASSWORD"], +}) +token = login["session_token"] + +call("PUT", "/api/v2/appearance", { + "application_name": "", + "logo_url": "", + "service_banner": {"enabled": False}, + "announcement_banners": [{ + "enabled": True, + "message": os.environ["BANNER_MESSAGE"], + "background_color": os.environ["BANNER_COLOR"], + }], +}, token=token) + +status, appearance = call("GET", "/api/v2/appearance", token=token) +print("appearance set:", json.dumps(appearance["announcement_banners"])) +PY From d051303880a39979171cd6ccbeccea088e13c19b Mon Sep 17 00:00:00 2001 From: Austen Bruhn Date: Sun, 7 Jun 2026 14:18:54 +0000 Subject: [PATCH 04/16] docs: add as-built documentation for the GovCloud Coder demo Add docs/as-built/, the engineering record of what is deployed and how it is configured, produced by a fan-out of documentation agents and cross-checked against live read-only state: - 00-overview: architecture, component map, topology diagram, core flows. - 10-infrastructure: GovCloud substrate (VPC, EKS standard-not-Auto-Mode and why, node group, IRSA, RDS, ECR, Route53, ACM, NLB). - 20-platform-kubernetes: namespaces, ingress, storage, workspace RBAC, Secrets. - 30-coder-control-plane: values.yaml walkthrough, OIDC SSO, auth hardening, licensing, appearance. - 40-identity-keycloak: realm coder, OIDC client, the no-group-sync gap. - 50-gitlab-scm: in-boundary GitLab, the OAuth app, per-workspace git auth. - 60-ai-gateway: AI Bridge providers, name-based routing, end-to-end flow, remaining action. - 70-workspace-templates: the claude-code template and required GitLab auth. - 80-iac-vs-imperative: declarative (Terraform) vs imperative ledger plus a reconciliation backlog. - 90-operations-runbook: day-2 ops and known gaps. Cross-linked from docs/00-INDEX.md and STATUS.md. Verified emdash/endash-free. Generated by Coder Agents. --- STATUS.md | 3 + docs/00-INDEX.md | 12 + docs/as-built/00-overview.md | 225 +++++++++++++++++ docs/as-built/10-infrastructure.md | 243 ++++++++++++++++++ docs/as-built/20-platform-kubernetes.md | 161 ++++++++++++ docs/as-built/30-coder-control-plane.md | 313 ++++++++++++++++++++++++ docs/as-built/40-identity-keycloak.md | 231 +++++++++++++++++ docs/as-built/50-gitlab-scm.md | 185 ++++++++++++++ docs/as-built/60-ai-gateway.md | 214 ++++++++++++++++ docs/as-built/70-workspace-templates.md | 201 +++++++++++++++ docs/as-built/80-iac-vs-imperative.md | 131 ++++++++++ docs/as-built/90-operations-runbook.md | 222 +++++++++++++++++ docs/as-built/README.md | 30 +++ 13 files changed, 2171 insertions(+) create mode 100644 docs/as-built/00-overview.md create mode 100644 docs/as-built/10-infrastructure.md create mode 100644 docs/as-built/20-platform-kubernetes.md create mode 100644 docs/as-built/30-coder-control-plane.md create mode 100644 docs/as-built/40-identity-keycloak.md create mode 100644 docs/as-built/50-gitlab-scm.md create mode 100644 docs/as-built/60-ai-gateway.md create mode 100644 docs/as-built/70-workspace-templates.md create mode 100644 docs/as-built/80-iac-vs-imperative.md create mode 100644 docs/as-built/90-operations-runbook.md create mode 100644 docs/as-built/README.md diff --git a/STATUS.md b/STATUS.md index 788ede8..e20dcb1 100644 --- a/STATUS.md +++ b/STATUS.md @@ -3,6 +3,9 @@ Single source of progress truth for the lean Coder+AI GovCloud demo. Target: `us-gov-west-1`, `usgov.coderdemo.io`. Account `430737322961`. +> Engineering "as-built" documentation (architecture, configuration, and the +> declarative-vs-imperative ledger) lives in [`docs/as-built/`](docs/as-built/README.md). + > Overnight autonomous build by Coder Agents. **The full stack is deployed and > running.** One action remains before AI responses work end to end: drop a real > Anthropic API key into the `anthropic` AI provider (see "Remaining action"). diff --git a/docs/00-INDEX.md b/docs/00-INDEX.md index eff1592..acfec90 100644 --- a/docs/00-INDEX.md +++ b/docs/00-INDEX.md @@ -4,11 +4,23 @@ | Audience | File | |---|---| +| **As-built (what was actually deployed)** | **[as-built/README.md](as-built/README.md)** | | Human setup | [PRE-REQUISITES.md](PRE-REQUISITES.md) | | Orchestrator | [swarm/ORCHESTRATOR.md](swarm/ORCHESTRATOR.md) | | **All agents** | **[AGENT-PRD.md](AGENT-PRD.md)** | | Subagents | [swarm/RULES.md](swarm/RULES.md) + [swarm/workstreams/](swarm/workstreams/) | +## As-built (current deployment) + +The engineering record of what is deployed and how it is configured. The swarm +and workstream docs below describe the planned build; `as-built/` describes the +live result. + +- [as-built/README.md](as-built/README.md) (index) +- [as-built/00-overview.md](as-built/00-overview.md): architecture + flows +- [as-built/80-iac-vs-imperative.md](as-built/80-iac-vs-imperative.md): declarative vs imperative ledger +- [as-built/90-operations-runbook.md](as-built/90-operations-runbook.md): day-2 ops + ## Architecture - [architecture/overview.md](architecture/overview.md) diff --git a/docs/as-built/00-overview.md b/docs/as-built/00-overview.md new file mode 100644 index 0000000..4eba221 --- /dev/null +++ b/docs/as-built/00-overview.md @@ -0,0 +1,225 @@ +# As-built overview: Coder + AI demo on AWS GovCloud + +Status source of truth: [`STATUS.md`](../../STATUS.md). This document describes +the environment **as it was actually built**, which differs in places from the +original target design in [`docs/architecture/`](../architecture/). Those +deviations are called out inline. + +- Region / account: `us-gov-west-1`, account `430737322961`, partition + `aws-us-gov`. Everything runs inside the GovCloud boundary. +- Domain: `usgov.coderdemo.io`. +- Coder version: `v2.34.0` (confirmed live via + `GET https://dev.usgov.coderdemo.io/api/v2/buildinfo` -> `v2.34.0+3006da5`), + licensed with the AI Governance add-on plus premium entitlements. + +## What the demo proves + +A self-contained, in-boundary developer platform where every authentication, +source-control, and AI path stays inside AWS GovCloud: + +1. **Coder control plane** on EKS as the single governance and workspace plane + (`deploy/coder/values.yaml`). +2. **Keycloak SSO** as the identity provider via OIDC (realm `coder`), so users + sign in with "Sign in with Keycloak" instead of any external IdP + (`deploy/coder/values.yaml`, `deploy/keycloak/realm-coder.json`). +3. **In-boundary GitLab** as the source-control manager, wired as a Coder git + external-auth provider so workspace git operations use short-lived + in-boundary OAuth tokens (`deploy/gitlab/`, `deploy/coder/values.yaml`). +4. **Coder AI Gateway (AI Bridge)** as the governed egress for model traffic, + fronting two providers: `anthropic` (direct to `api.anthropic.com` over the + NAT gateway) and `anthropic-bedrock` (Amazon Bedrock in-region via IRSA, no + static keys). +5. **Coder Agents running Claude Code** in workspace pods, talking only to the + AI Gateway with the owner's session token, never holding a raw model key + (`coder-templates/claude-code/main.tf`). + +The hardening posture removes external egress paths: Coder's built-in GitHub +login default provider is disabled, path-based workspace apps are disabled, and +the only model egress is the governed AI Gateway path. + +> **Deviation from the target design.** The original architecture docs +> (`docs/architecture/overview.md`, `target-architecture.md`) placed GitLab on +> EC2, RDS on PostgreSQL 17 Multi-AZ, and reserved Istio, OpenShift, Grafana, +> and full identity sync for later phases. As built, GitLab runs **in-cluster** +> on EKS as a single-container StatefulSet with embedded Postgres, RDS is a +> **Multi-AZ PostgreSQL 18.4** instance (one shared instance backing both the +> `coder` and `keycloak` databases, on version 18.4 rather than 17), and Istio +> / OCP / observability / identity sync are **out of scope** (`STATUS.md`). The identity doc +> (`docs/architecture/identity.md`) names realm `usgov`; the realm that was +> actually imported and is in use is **`coder`** (`STATUS.md`, +> `deploy/CONVENTIONS.md`, `deploy/keycloak/README.md`). + +## Component map + +| Layer | Component | Where | Notes | +|---|---|---|---| +| Edge | Internet-facing NLB + ACM cert `*.usgov.coderdemo.io` | AWS | TLS terminates at the NLB; backends are plain HTTP (`deploy/CONVENTIONS.md`). | +| Edge | Route53 zone `Z06701704WFETYIRU5C8` | AWS | Alias A records `dev` / `auth` / `gitlab` / `*` -> NLB (`deploy/platform/README.md`). | +| Ingress | ingress-nginx (Helm chart `4.15.1`) + aws-load-balancer-controller | ns `ingress-nginx` | 2 controller replicas; `className: nginx`. | +| Control plane | Coder `v2.34.0` (1 replica) | ns `coder` | OIDC SSO, AI Gateway, GitLab external auth, path apps disabled (`deploy/coder/values.yaml`). | +| Identity | Keycloak `26.6.3`, realm `coder` | ns `keycloak` | OIDC client `coder`; admin console `/admin` (`deploy/keycloak/`). | +| SCM | GitLab CE `19.0.1-ce.0`, embedded Postgres | ns `gitlab` | Single-container Omnibus StatefulSet `gitlab-0` (`deploy/gitlab/`). | +| Workspaces | Claude Code template pods | ns `coder-workspaces` | `enterprise-base` image, gp3 PVC, Claude Code + AgentAPI + code-server (`coder-templates/claude-code/main.tf`). | +| Data | RDS PostgreSQL `18.4`, single instance | AWS | Databases `coder` and `keycloak`; `rds.force_ssl=1` (`deploy/CONVENTIONS.md`, `STATUS.md`). | +| Registry | ECR `430737322961.dkr.ecr.us-gov-west-1.amazonaws.com` | AWS | Mirrored images, no pull-through in GovCloud (`scripts/mirror-images.sh`). | +| AI egress | AI Gateway -> `api.anthropic.com` via NAT; or Bedrock via IRSA | AWS | Provider `anthropic` (direct) and `anthropic-bedrock` (Bedrock). | + +EKS detail: cluster `usgov-coderdemo`, k8s `1.36`, standard EKS (Auto Mode was +abandoned in this account), managed node group `mng` of 3x `m5.xlarge` +(`AL2023_x86_64_STANDARD`, static capacity). Provisioners are **internal only**: +3 built-in provisioner daemons run in the coderd pod, with no external daemons +(`STATUS.md`, facts sheet). See `20`/`10` companion docs below. + +## Topology + +```mermaid +flowchart TB + user([Demo user / browser]) + + subgraph gov [AWS GovCloud us-gov-west-1 / account 430737322961] + r53[Route53 zone usgov.coderdemo.io] + nlb[Internet-facing NLB
ACM TLS *.usgov.coderdemo.io] + + subgraph eks [EKS cluster usgov-coderdemo / k8s 1.36] + nginx[ingress-nginx controller] + coder[Coder control plane
ns coder / v2.34.0] + kc[Keycloak
ns keycloak / realm coder] + gl[GitLab CE
ns gitlab / embedded Postgres] + ws[Workspace pods
ns coder-workspaces
Claude Code agent] + end + + rds[(RDS PostgreSQL 18.4
dbs coder + keycloak)] + bedrock[Amazon Bedrock] + nat[NAT gateway] + end + + anthropic([api.anthropic.com]) + + user --> r53 --> nlb --> nginx + nginx --> coder + nginx --> kc + nginx --> gl + coder --> rds + kc --> rds + coder -->|IRSA role usgov-coderdemo-coder-bedrock| bedrock + coder -->|AI Bridge egress| nat --> anthropic + coder --> ws + ws -->|session token| coder +``` + +ASCII summary for terminals: + +```text + Internet + | + Route53 (usgov.coderdemo.io) + | + NLB (ACM TLS *.usgov.coderdemo.io) + | + ingress-nginx + / | \ + Coder Keycloak GitLab (all on EKS; GitLab embeds its own Postgres) + | | + +--------+--> RDS PostgreSQL 18.4 (coder, keycloak dbs) + | + +--> coder SA -> Bedrock (IRSA, in-region, no static key) + +--> AI Bridge -> NAT gateway -> api.anthropic.com + | + +--> workspace pods (coder-workspaces) -> back to Coder via session token +``` + +## Core flows + +### A. User login / SSO via Keycloak OIDC + +1. A user opens `https://dev.usgov.coderdemo.io` and chooses "Sign in with + Keycloak" (button text from `CODER_OIDC_SIGN_IN_TEXT`). +2. Coder redirects to the Keycloak issuer + `https://auth.usgov.coderdemo.io/realms/coder`, with client id `coder` and + scopes `openid,profile,email` (`deploy/coder/values.yaml`). +3. Keycloak authenticates the user in realm `coder` and redirects back to + `https://dev.usgov.coderdemo.io/api/v2/users/oidc/callback` + (`deploy/keycloak/realm-coder.json`). +4. Coder validates the token server-side. The in-cluster NLB hairpin to the + public `auth.` hostname presents valid TLS, so server-side OIDC works + (`STATUS.md`, `deploy/platform/README.md`). Coder maps `email_field=email` + and `username_field=preferred_username`, and `CODER_OIDC_ALLOW_SIGNUPS=true` + lets a first-time SSO user self-provision. +5. **No group or role sync.** OIDC `group_field` is empty and the realm has no + group-claim mapper, so login grants an account only; group and role mapping + is a known gap (`STATUS.md`, facts sheet). The GitHub default login provider + is disabled, so the only sign-in paths are Keycloak SSO and the local + password owner (`deploy/coder/values.yaml`). + +### B. Workspace create -> GitLab external auth -> agent ready + +1. The user creates a workspace (or Coder Task) from the single template + `claude-code`. +2. The template declares `data "coder_external_auth" "gitlab"` (id `gitlab`), + so the dashboard surfaces a "Login with GitLab" control and the build blocks + until the owner completes the in-boundary GitLab OAuth flow + (`coder-templates/claude-code/main.tf`). Coder uses the GitLab endpoints + `…/oauth/authorize`, `…/oauth/token`, and `…/oauth/token/info` + (`deploy/coder/values.yaml`). +3. An in-process provisioner (one of the 3 built-in daemons in coderd) applies + the template Terraform: it creates a gp3 PVC and a pod in + `coder-workspaces`. The `coder-workspace-perms` Role/RoleBinding lets the + `coder` service account manage pods and PVCs in that namespace + (`deploy/platform/workspace-rbac.yaml`). +4. The pod boots the ECR-mirrored `enterprise-base` image and the agent connects + using `CODER_AGENT_TOKEN` / `CODER_AGENT_URL`. The `claude-code` module + installs Claude Code and AgentAPI; `code-server` is installed as an extra app + (`coder-templates/claude-code/main.tf`). +5. The agent reports ready once external auth is satisfied. Its git credential + helper then injects a short-lived GitLab OAuth token for clone / fetch / push + to `gitlab.usgov.coderdemo.io`. No PATs or SSH keys live in the workspace + (`STATUS.md`). + +### C. Claude Code request -> AI Bridge -> provider + +1. On the agent, the `claude-code` module sets + `ANTHROPIC_BASE_URL=https://dev.usgov.coderdemo.io/api/v2/aibridge/anthropic` + and `CLAUDE_API_KEY=`; the template also exports + `ANTHROPIC_AUTH_TOKEN` (the same session token). No raw Anthropic key is + placed in the workspace (`coder-templates/claude-code/main.tf`). +2. Claude Code POSTs to + `…/api/v2/aibridge/anthropic/v1/messages` with the session token. +3. The AI Gateway authenticates the session, applies governance and audit (AI + Governance add-on), routes by provider **name** `anthropic`, and forwards to + that provider's base URL `https://api.anthropic.com`. Egress leaves the VPC + through the single NAT gateway (`deploy/coder/values.yaml`, facts sheet). +4. The alternative provider `anthropic-bedrock` (type `bedrock`) calls Bedrock + in-region using the coder service account IRSA role + `usgov-coderdemo-coder-bedrock` (no static key), model + `us-gov.anthropic.claude-sonnet-4-5-20250929-v1:0`, with small-fast model + `amazon.nova-pro-v1:0` (`deploy/coder/values.yaml`). +5. **Current state.** The `anthropic` provider holds a **placeholder** key, so + routing is verified end to end but returns `502 "all configured keys failed + authentication"`. The remaining action is to paste a real `sk-ant-...` key + into the `anthropic` provider at `/ai/settings` (UI, not the k8s secret). + Bedrock Claude Sonnet 4.5 access is still gated; `amazon.nova-pro-v1:0` is + the proven in-GovCloud fallback (`STATUS.md`, facts sheet). + +## Detailed companion documents + +The deep-dive as-built documents live in this directory, numbered by layer. +Confirmed filenames are linked. Companion docs that may still be in flight +(`10`, `20`, `60`, `70`, `80`) follow the same `NN-topic.md` convention; +confirm exact names against the directory listing if a link does not resolve. + +| Doc | Topic | +|---|---| +| `10-*.md` | Infrastructure substrate (Terraform): VPC, RDS, ECR, IRSA OIDC provider + Bedrock role, EKS cluster. | +| `20-*.md` | EKS platform: standard node group `mng`, addons, gp3 storage, ingress / NLB / DNS, workspace RBAC. | +| [`30-coder-control-plane.md`](30-coder-control-plane.md) | Coder Helm values, server env, hardening, licensing. | +| [`40-identity-keycloak.md`](40-identity-keycloak.md) | Keycloak realm `coder`, OIDC client, SSO config, identity gaps. | +| [`50-gitlab-scm.md`](50-gitlab-scm.md) | In-boundary GitLab and the Coder git external-auth provider. | +| `60-*.md` | AI Gateway / AI Bridge, DB-managed providers, Bedrock IRSA. | +| `70-*.md` | `claude-code` workspace template, Coder Agents, Tasks, code-server. | +| `80-*.md` | Additional layer (for example networking or security hardening); confirm topic in the directory. | +| [`90-operations-runbook.md`](90-operations-runbook.md) | Day-2 operations: access, upgrades, template push, image mirroring, known gaps. | + +--- + +*As-built documentation authored by Coder Agents. Read-only; grounded in repo +files and `STATUS.md`.* diff --git a/docs/as-built/10-infrastructure.md b/docs/as-built/10-infrastructure.md new file mode 100644 index 0000000..bac42d1 --- /dev/null +++ b/docs/as-built/10-infrastructure.md @@ -0,0 +1,243 @@ +# As-built: AWS GovCloud infrastructure + +As-built record of the AWS GovCloud substrate for the Coder demo. Every row is +grounded in a repo file or a read-only command run against the live account on +2026-06-07. Values that could not be verified are marked "unverified". + +## Account and region + +| Fact | Value | Source | +|---|---|---| +| Partition | `aws-us-gov` | `terraform/providers.tf`, `versions.lock.yaml` | +| Account | `430737322961` | `.substrate-outputs.json`, live `aws sts`/IAM ARNs | +| Region | `us-gov-west-1` | `terraform/variables.tf` (`region`), `versions.lock.yaml` | +| Public domain | `usgov.coderdemo.io` | `terraform/variables.tf` (`domain`), `versions.lock.yaml` | +| Terraform state backend | S3 `usgov-coderdemo-tfstate-430737322961`, DynamoDB lock `usgov-coderdemo-tflock`, encrypted | `terraform/backend.tf` | + +The backend S3 bucket and DynamoDB table are referenced by `backend.tf` but are +not declared in this Terraform; they are bootstrap inputs created out of band. + +## Component summary + +| Component | Identifier / key values | Source | +|---|---|---| +| VPC | `vpc-08a88ce74ae217bc7`, CIDR `10.0.0.0/16`, 3 AZ, 1 NAT gateway, 1 IGW | `terraform/vpc.tf`; live `aws ec2 describe-vpcs` | +| Public subnets | 3, `10.0.0.0/20` (1a), `10.0.16.0/20` (1b), `10.0.32.0/20` (1c), `map_public_ip_on_launch=true` | `terraform/vpc.tf`; live `aws ec2 describe-subnets` | +| Private subnets | 3, `10.0.48.0/20` (1a), `10.0.64.0/20` (1b), `10.0.80.0/20` (1c) | `terraform/vpc.tf`; live `aws ec2 describe-subnets` | +| NAT gateway | `nat-05f778038711165c0` in public subnet `subnet-081b77ab74f26fc2f`; egress path for `api.anthropic.com` | `terraform/vpc.tf`; live `aws ec2 describe-nat-gateways` | +| EKS cluster | `usgov-coderdemo`, k8s `1.36`, STANDARD (Auto Mode disabled), endpoint public+private | live `aws eks describe-cluster`; `terraform/eks.tf` | +| Cluster IAM role | `usgov-coderdemo-cluster` | `terraform/iam-eks.tf`; live IAM | +| Managed node group | `mng`: 3x `m5.xlarge`, `AL2023_x86_64_STANDARD`, ON_DEMAND, min2/desired3/max4, static, 20Gi disk, private subnets | live `aws eks describe-nodegroup`; `deploy/platform/README.md` | +| Node IAM role | `usgov-coderdemo-mngnode` (5 managed policies) | live IAM; `deploy/platform/README.md` | +| Unused node role | `usgov-coderdemo-node` (original Auto Mode role, left attached, unused) | `terraform/iam-eks.tf`; live IAM | +| Cluster addons | `vpc-cni`, `kube-proxy`, `coredns`, `aws-ebs-csi-driver` | live `aws eks list-addons` | +| EBS CSI IRSA role | `usgov-coderdemo-ebs-csi` (`AmazonEBSCSIDriverPolicy`) | live IAM; `deploy/platform/README.md` | +| Coder Bedrock IRSA role | `usgov-coderdemo-coder-bedrock`, inline policy `bedrock-invoke` | `terraform/irsa.tf`; live IAM | +| OIDC provider (IRSA) | `arn:aws-us-gov:iam::430737322961:oidc-provider/oidc.eks.us-gov-west-1.amazonaws.com/id/E9DB9E591C95ECB91F44EDCF38F146F2` | `terraform/irsa.tf`; `.substrate-outputs.json` | +| RDS instance | `usgov-coderdemo-pg`, PostgreSQL `18.4`, `db.m6g.large`, Multi-AZ, 50Gi gp3 encrypted, private | `terraform/rds.tf`; live `aws rds describe-db-instances` | +| RDS endpoint | `usgov-coderdemo-pg.crhk7w9eko3r.us-gov-west-1.rds.amazonaws.com:5432` | `.substrate-outputs.json`; live RDS | +| RDS security group | `sg-0f80f84106ca6502e`, ingress tcp/5432 from `10.0.0.0/16` | `terraform/rds.tf`; live RDS | +| RDS master secret | Secrets Manager `usgov-coderdemo/rds/master` (user `dbadmin`) | `terraform/rds.tf`; `.substrate-outputs.json` | +| ECR registry | `430737322961.dkr.ecr.us-gov-west-1.amazonaws.com` | `.substrate-outputs.json`; `terraform/outputs.tf` | +| ACM certificate | `arn:aws-us-gov:acm:us-gov-west-1:430737322961:certificate/7f4fc566-8efd-4aa5-b6ba-3b0c9a535d12` (`*.usgov.coderdemo.io` + apex) | `versions.lock.yaml`; `deploy/platform/ingress-nginx-values.yaml` | +| Route53 zone | `Z06701704WFETYIRU5C8` (`usgov.coderdemo.io`) | `terraform/variables.tf`; live `aws route53` | +| Ingress NLB | internet-facing NLB `k8s-ingressn-ingressn-e16fe3cd33-c002102481951644.elb.us-gov-west-1.amazonaws.com` | live `kubectl`/`aws route53` | + +## VPC and egress + +The VPC is a single `10.0.0.0/16` network spanning three AZ +(`us-gov-west-1a/1b/1c`). Each AZ has one public `/20` and one private `/20` +subnet. Public subnets carry `map_public_ip_on_launch=true` and are tagged +`kubernetes.io/role/elb=1`; private subnets are tagged +`kubernetes.io/role/internal-elb=1`. Both subnet sets are tagged +`kubernetes.io/cluster/usgov-coderdemo=shared` (`terraform/vpc.tf`). + +A single NAT gateway (`nat-05f778038711165c0`) lives in a public subnet; the +private route table sends `0.0.0.0/0` through it. The public route table sends +`0.0.0.0/0` through the internet gateway. One NAT was a deliberate choice to +stay within the default Elastic IP quota and reduce cost +(`terraform/vpc.tf`). The NAT gateway is the only egress path out of the +boundary, used by the Anthropic-direct AI provider to reach `api.anthropic.com`. + +## EKS cluster: standard, not Auto Mode + +Live state shows the cluster running k8s `1.36` with Auto Mode fully disabled: + +``` +computeConfig.enabled = false +storageConfig.blockStorage = false +kubernetesNetworkConfig.elasticLoadBalancing.enabled = false +``` + +(`aws eks describe-cluster`). This is a deliberate divergence from +`terraform/eks.tf`, which still declares Auto Mode (`compute_config.enabled = +true`, `node_pools = ["general-purpose","system"]`, managed block storage and +elastic load balancing). + +Why Auto Mode was abandoned: in this GovCloud account the AWS-managed +service-linked role `AWSServiceRoleForAmazonEKS` lacks +`iam:AddRoleToInstanceProfile` and `iam:TagInstanceProfile`, so Auto Mode +NodeClass validation never completes (the controller creates an instance +profile but never attaches the role, then wedges on `EntityAlreadyExists`). +The cluster was converted to standard EKS instead of fighting the SLR +(`deploy/platform/README.md`, `deploy/platform/nodepool.yaml`, `STATUS.md`). +The abandoned Auto Mode NodeClass/NodePool workaround remains in the repo at +`deploy/platform/nodepool.yaml` but is not applied to the standard cluster. + +The cluster IAM role `usgov-coderdemo-cluster` still carries the five Auto Mode +policies (`AmazonEKSClusterPolicy`, `AmazonEKSComputePolicy`, +`AmazonEKSBlockStoragePolicy`, `AmazonEKSLoadBalancingPolicy`, +`AmazonEKSNetworkingPolicy`) from Terraform even though compute/storage/LB are +now self-managed (live `aws iam list-attached-role-policies`). + +## Managed node group `mng` + +| Attribute | Value | +|---|---| +| Instance type | `m5.xlarge` x3 | +| AMI type | `AL2023_x86_64_STANDARD` | +| Capacity type | `ON_DEMAND` | +| Scaling | min 2, desired 3, max 4 (static; no Karpenter, no cluster-autoscaler) | +| Disk | 20Gi | +| Subnets | the 3 private subnets | +| Node role | `usgov-coderdemo-mngnode` | +| Node version | `1.36` | + +Source: live `aws eks describe-nodegroup --cluster-name usgov-coderdemo +--nodegroup-name mng`. Live nodes report `v1.36.1-eks-3385e9b` on Amazon Linux +2023 (`kubectl get nodes -o wide`). + +The node role `usgov-coderdemo-mngnode` has five attached managed policies +(live `aws iam list-attached-role-policies`): + +- `AmazonEKSWorkerNodePolicy` +- `AmazonEKS_CNI_Policy` +- `AmazonEC2ContainerRegistryReadOnly` +- `AmazonSSMManagedInstanceCore` +- `AmazonEBSCSIDriverPolicy` + +The original Auto Mode node role `usgov-coderdemo-node` (from +`terraform/iam-eks.tf`) still exists but is unused (live `aws iam get-role`). + +## EBS CSI driver and IRSA + +`aws-ebs-csi-driver` runs as an EKS addon. Its controller could not reach IMDS +for credentials, so it uses IRSA: role `usgov-coderdemo-ebs-csi`, trusting the +cluster OIDC provider for service account `kube-system:ebs-csi-controller-sa`, +attached to `AmazonEBSCSIDriverPolicy` (one attached policy, no inline policies; +live `aws iam list-attached-role-policies` / `list-role-policies`). The addon +was bound to the role with `--service-account-role-arn` +(`deploy/platform/README.md`). The default StorageClass is `gp3` (documented in +the platform layer doc). + +## Coder Bedrock IRSA role + +Authored in `terraform/irsa.tf` (output `bedrock_role_arn`). The EKS OIDC +provider is registered as an IAM OIDC identity provider; the role trust policy +restricts `sts:AssumeRoleWithWebIdentity` to service account +`system:serviceaccount:coder:coder` with audience `sts.amazonaws.com`. + +Inline policy `bedrock-invoke`, exact live actions and resources from +`aws iam get-role-policy --role-name usgov-coderdemo-coder-bedrock +--policy-name bedrock-invoke`: + +- Actions: `bedrock:InvokeModel`, `bedrock:InvokeModelWithResponseStream` +- Resources: + - `arn:aws-us-gov:bedrock:us-gov-west-1::foundation-model/anthropic.claude-sonnet-4-5-20250929-v1:0` + - `arn:aws-us-gov:bedrock:us-gov-west-1::foundation-model/amazon.nova-pro-v1:0` + - `arn:aws-us-gov:bedrock:us-gov-west-1:430737322961:inference-profile/us-gov.anthropic.claude-sonnet-4-5-20250929-v1:0` + - `arn:aws-us-gov:bedrock:us-gov-east-1::foundation-model/anthropic.claude-sonnet-4-5-20250929-v1:0` + +The `us-gov.` cross-region inference profile can route to both GovCloud +regions, so the underlying foundation model is allowlisted in both +`us-gov-west-1` and `us-gov-east-1`; Nova Pro is allowlisted in-region only +(`terraform/irsa.tf`). + +## RDS PostgreSQL + +| Attribute | Value | +|---|---| +| Identifier | `usgov-coderdemo-pg` | +| Engine | PostgreSQL `18.4` | +| Class | `db.m6g.large` | +| Storage | 50Gi gp3, encrypted, autoscale to 200Gi | +| Multi-AZ | true (standby instance; Multi-AZ DB clusters are unsupported in GovCloud) | +| Public access | false | +| Endpoint | `usgov-coderdemo-pg.crhk7w9eko3r.us-gov-west-1.rds.amazonaws.com:5432` | +| Default db / master | db `coder`, master user `dbadmin` | +| Security group | `sg-0f80f84106ca6502e` | +| TLS enforcement | `rds.force_ssl=1` | + +Sources: `terraform/rds.tf`; live `aws rds describe-db-instances`. The security +group allows tcp/5432 from the VPC CIDR `10.0.0.0/16` only (`terraform/rds.tf`). +`rds.force_ssl=1` is set on the in-use parameter group `default.postgres18` +(live `aws rds describe-db-parameters`), so all clients connect with TLS +(`sslmode=require`). + +Logical databases and roles were created imperatively after provisioning (not in +Terraform): role `coder` owns database `coder`, role `keycloak` owns database +`keycloak` (`deploy/platform/README.md`). GitLab does not use RDS; it runs the +Omnibus embedded PostgreSQL (`deploy/gitlab/README.md`, +`deploy/gitlab/statefulset.yaml`). + +Master credentials are stored in Secrets Manager `usgov-coderdemo/rds/master` +as JSON (`username`, `password`, `host`, `port`); the secret is created by +Terraform (`terraform/rds.tf`, output `rds_secret_arn`). + +## ECR registry and image mirror + +Registry host: `430737322961.dkr.ecr.us-gov-west-1.amazonaws.com` +(`terraform/outputs.tf` is a derived value; the host is not a managed resource). +GovCloud has no ECR pull-through cache, so images are mirrored with `crane` by +`scripts/mirror-images.sh` reading `scripts/images.txt`. The script maps +upstream registries to ECR repository paths +(`docker.io -> docker-hub/...`, `ghcr.io -> ghcr/...`, `quay.io -> quay/...`) +and creates each repo IMMUTABLE with scan-on-push. + +Mirrored repositories present live (`aws ecr describe-repositories`): + +| ECR repository | Upstream (pinned) | Used by | +|---|---|---| +| `ghcr/coder/coder` | `ghcr.io/coder/coder:v2.34.0` | Coder control plane | +| `quay/keycloak/keycloak` | `quay.io/keycloak/keycloak:26.6.3` | Keycloak | +| `docker-hub/gitlab/gitlab-ce` | `docker.io/gitlab/gitlab-ce:19.0.1-ce.0` | GitLab | +| `docker-hub/codercom/enterprise-base` | `docker.io/codercom/enterprise-base:ubuntu-noble-20260601` | Workspace base image | +| `docker-hub/library/postgres` | `postgres:18-alpine` | DB bootstrap job | + +Sources: `scripts/images.txt`, `scripts/mirror-images.sh`, +`deploy/*/README.md`; live `aws ecr describe-repositories`. + +## DNS, ACM, and the NLB ingress path + +Route53 hosted zone `Z06701704WFETYIRU5C8` holds these records (live +`aws route53 list-resource-record-sets`): + +| Record | Type | Target | +|---|---|---| +| `usgov.coderdemo.io` | NS, SOA | delegation | +| `dev.usgov.coderdemo.io` | A (alias) | ingress NLB | +| `auth.usgov.coderdemo.io` | A (alias) | ingress NLB | +| `gitlab.usgov.coderdemo.io` | A (alias) | ingress NLB | +| `*.usgov.coderdemo.io` | A (alias) | ingress NLB | +| `_2632...usgov.coderdemo.io` | CNAME | ACM DNS validation | + +All four service/wildcard records are alias A records pointing at the +internet-facing NLB +`k8s-ingressn-ingressn-e16fe3cd33-c002102481951644.elb.us-gov-west-1.amazonaws.com`. +The Route53 zone and the ACM certificate pre-exist this Terraform (referenced by +ID/ARN in `terraform/variables.tf`); the records were created imperatively at +deploy time (`deploy/platform/README.md`). + +Ingress path: + +``` +client --HTTPS 443--> NLB (TLS terminated, ACM *.usgov.coderdemo.io) + --HTTP--> ingress-nginx controller --HTTP--> app pods +``` + +The single ACM certificate `7f4fc566-8efd-4aa5-b6ba-3b0c9a535d12` covers the +apex and the single-level wildcard `*.usgov.coderdemo.io`, which matches both +the Coder dashboard host and the workspace-app wildcard. TLS terminates at the +NLB; traffic from the NLB to nginx and from nginx to pods is plain HTTP. NLB +provisioning, listener, and TLS detail are covered in +`docs/as-built/20-platform-kubernetes.md`. diff --git a/docs/as-built/20-platform-kubernetes.md b/docs/as-built/20-platform-kubernetes.md new file mode 100644 index 0000000..da6027e --- /dev/null +++ b/docs/as-built/20-platform-kubernetes.md @@ -0,0 +1,161 @@ +# As-built: Kubernetes platform layer + +The shared cluster platform that every app depends on: namespaces, ingress and +the NLB, storage, workspace RBAC, and the platform-owned Secrets. Grounded in +repo files and read-only `kubectl` output captured 2026-06-07 against EKS +cluster `usgov-coderdemo` (k8s 1.36). Mutating steps were performed during the +overnight build; `deploy/platform/README.md` is the reproducible record. + +## Namespaces and what runs in each + +Live `kubectl get ns` plus `kubectl get pods -A -o wide`: + +| Namespace | Workloads (live) | +|---|---| +| `coder` | Coder control plane `coder` (Deployment, 1 replica) | +| `coder-workspaces` | Workspace pods (e.g. `coder-8e0c3f4a-...`, 1/1 Running) | +| `gitlab` | `gitlab-0` (StatefulSet, embedded Postgres/Redis) | +| `keycloak` | `keycloak` (Deployment, 1 replica) | +| `ingress-nginx` | `ingress-nginx-controller` (2 replicas) | +| `kube-system` | `aws-load-balancer-controller` (2), `aws-node`/vpc-cni, `coredns` (2), `kube-proxy`, `ebs-csi-controller` (2) + `ebs-csi-node` (DaemonSet) | + +The `coder` and `coder-workspaces` namespaces are split on purpose: the control +plane runs in `coder`, while it provisions workspace pods into +`coder-workspaces` (see workspace RBAC below and +`coder-templates/claude-code/main.tf`). + +## Ingress: NLB, aws-load-balancer-controller, ingress-nginx + +Two controllers cooperate: + +- `aws-load-balancer-controller` (Helm release in `kube-system`, 2 replicas) + provisions and manages the NLB for the ingress-nginx controller Service. +- `ingress-nginx` (Helm chart `4.15.1`, 2 controller replicas) is the in-cluster + ingress. Every app `Ingress` uses `ingressClassName: nginx`. + +The ingress-nginx controller Service is `type: LoadBalancer` and is opted in to +the LB controller (not the in-tree provider) via +`aws-load-balancer-type: external`. Live annotations on the Service +(`kubectl get svc -n ingress-nginx ingress-nginx-controller`): + +| Annotation | Value | +|---|---| +| `aws-load-balancer-type` | `external` | +| `aws-load-balancer-scheme` | `internet-facing` | +| `aws-load-balancer-nlb-target-type` | `ip` | +| `aws-load-balancer-backend-protocol` | `tcp` | +| `aws-load-balancer-ssl-cert` | `arn:aws-us-gov:acm:us-gov-west-1:430737322961:certificate/7f4fc566-8efd-4aa5-b6ba-3b0c9a535d12` | +| `aws-load-balancer-ssl-ports` | `443` | +| `aws-load-balancer-cross-zone-load-balancing-enabled` | `true` | + +These match `deploy/platform/ingress-nginx-values.yaml`. Public subnets are +auto-discovered through the `kubernetes.io/role/elb=1` subnet tag. + +TLS terminates at the NLB. Both Service ports forward to the controller's plain +HTTP container port (live `.spec.ports`): port `443` -> `targetPort: http` and +port `80` -> `targetPort: http`. So the NLB decrypts on 443 and forwards plain +TCP to nginx HTTP, and traffic from nginx to pods is also plain HTTP. To avoid +an http->https redirect loop on an L4 NLB that does not inject a trustworthy +`X-Forwarded-Proto`, the controller config sets `ssl-redirect: "false"` and +`use-forwarded-headers: "true"`, plus websocket-friendly timeouts and +`proxy-body-size: "0"` (`deploy/platform/ingress-nginx-values.yaml`). + +Two IngressClasses exist live: `nginx` (`k8s.io/ingress-nginx`, used by all app +ingresses) and `alb` (`ingress.k8s.aws/alb`, shipped by the LB controller, not +used by any app). All three app ingresses (`coder`, `gitlab`, `keycloak`) +resolve to the same NLB address (live `kubectl get ingress -A`). + +Hairpin: the Route53 names resolve to the public NLB, and in-cluster requests to +those public hostnames route back through the NLB with valid TLS. This lets +Coder's server-side OIDC calls to Keycloak and workspace agent connections work +without split-horizon DNS (`deploy/platform/README.md`, `STATUS.md`). + +## Storage + +Live `kubectl get sc`: + +| StorageClass | Provisioner | Default | Binding | Encrypted | Expansion | +|---|---|---|---|---|---| +| `gp3` | `ebs.csi.aws.com` | yes | `WaitForFirstConsumer` | `true` | `true` | +| `gp2` | `kubernetes.io/aws-ebs` (in-tree) | no | `WaitForFirstConsumer` | n/a | `false` | + +`gp3` is the platform-created default and is the class every workload uses +(`gp3` parameters `type=gp3`, `encrypted=true`; live `kubectl get sc gp3 -o +yaml`). `gp2` is the legacy in-tree class that ships with the cluster and is not +used. GitLab's three PVCs and the workspace home PVC request `gp3` +(`deploy/gitlab/statefulset.yaml`, `coder-templates/claude-code/main.tf`). + +Note: `deploy/gitlab/README.md` mentions an EKS Auto Mode class `auto-ebs-sc`, +but that is superseded; the committed `statefulset.yaml` and the live cluster +both use `gp3` (Auto Mode was disabled). + +## Workspace RBAC + +`deploy/platform/workspace-rbac.yaml` declares a `Role` + `RoleBinding` named +`coder-workspace-perms` in the `coder-workspaces` namespace, binding the +`coder/coder` ServiceAccount. The rules grant, on `pods` and +`persistentvolumeclaims` (core API) and `deployments` (`apps`): +`create, delete, deletecollection, get, list, patch, update, watch`. + +This is needed because the Coder Helm chart's `serviceAccount.workspacePerms` +only creates the equivalent Role in the release namespace (`coder`), but +workspaces run in `coder-workspaces`. Live state confirms the Role exists in +both namespaces (`kubectl get role,rolebinding -n coder` and +`-n coder-workspaces`): + +| Namespace | Role | RoleBinding | Origin | +|---|---|---|---| +| `coder` | `coder-workspace-perms` | `coder` | Coder Helm chart (`serviceAccount.workspacePerms: true`) | +| `coder-workspaces` | `coder-workspace-perms` | `coder-workspace-perms` | `deploy/platform/workspace-rbac.yaml` (applied imperatively) | + +## Platform-owned Kubernetes Secrets + +The platform layer creates the application Secrets imperatively so they never +touch git; the committed `secrets.example.yaml` files document the exact +names/keys (`deploy/platform/README.md`, `deploy/coder/secrets.example.yaml`). +The four Secrets in the `coder` namespace and their consumers: + +| Secret | Keys | Consumed by | How | +|---|---|---|---| +| `coder-db` | `url` | Coder control plane | `CODER_PG_CONNECTION_URL` (full libpq URL to the `coder` RDS database, `sslmode=require`) | +| `coder-oidc` | `client-secret` | Coder control plane | `CODER_OIDC_CLIENT_SECRET` for Keycloak realm `coder`, client `coder` | +| `coder-ai` | `ANTHROPIC_API_KEY` | Coder AI Gateway, provider `anthropic` | `CODER_AI_GATEWAY_PROVIDER_0_KEY` (Anthropic-direct; the Bedrock provider uses IRSA and needs no key) | +| `coder-external-auth` | `gitlab-client-id`, `gitlab-client-secret` | Coder external auth | `CODER_EXTERNAL_AUTH_0_CLIENT_ID` / `_SECRET` for in-cluster GitLab git auth | + +Source: `deploy/coder/values.yaml` (env `valueFrom.secretKeyRef`) and +`deploy/coder/secrets.example.yaml`. + +For completeness, the other app namespaces own their own Secrets, also created +imperatively (`deploy/platform/README.md`, `deploy/keycloak/README.md`, +`deploy/gitlab/README.md`): + +| Secret | Namespace | Consumed by | +|---|---|---| +| `keycloak-db` (`username`,`password`) | `keycloak` | Keycloak `KC_DB_USERNAME`/`KC_DB_PASSWORD` | +| `keycloak-admin` (`username`,`password`) | `keycloak` | Keycloak bootstrap admin | +| `gitlab-secrets` (`initial_root_password`) | `gitlab` | GitLab `GITLAB_INITIAL_ROOT_PASSWORD` (first boot only) | + +## Coder ServiceAccount and IRSA + +The Helm chart creates ServiceAccount `coder` in the `coder` namespace and +annotates it for IRSA. Live annotation (`kubectl get sa coder -n coder`): +`eks.amazonaws.com/role-arn: +arn:aws-us-gov:iam::430737322961:role/usgov-coderdemo-coder-bedrock`. This is +how the AI Gateway Bedrock provider authenticates without static AWS keys +(`deploy/coder/values.yaml`; the role and policy are documented in +`docs/as-built/10-infrastructure.md`). + +## Helm releases vs applied manifests + +Live Helm releases (`kubectl get secret -A -l owner=helm`): + +| Release | Namespace | Revisions | +|---|---|---| +| `coder` | `coder` | v1..v4 | +| `ingress-nginx` | `ingress-nginx` | v1 | +| `aws-load-balancer-controller` | `kube-system` | v1 | + +Keycloak and GitLab are not Helm releases; they are plain manifests applied with +`kubectl apply` (`kubectl apply -k deploy/keycloak/`, `kubectl apply -f +deploy/gitlab/*.yaml`). See `docs/as-built/80-iac-vs-imperative.md` for the full +declarative-vs-imperative ledger. diff --git a/docs/as-built/30-coder-control-plane.md b/docs/as-built/30-coder-control-plane.md new file mode 100644 index 0000000..7fe8c3b --- /dev/null +++ b/docs/as-built/30-coder-control-plane.md @@ -0,0 +1,313 @@ +# 30. Coder control plane (as-built) + +Coder **v2.34.0** control plane for the GovCloud demo, served at +`https://dev.usgov.coderdemo.io`. This document walks `deploy/coder/values.yaml` +section by section and records what was verified against the live deployment. + +Scope of `values.yaml`: the Coder control plane only (Deployment, +ServiceAccount, Service, Ingress, and `env`). The platform layer owns +ingress-nginx, the NLB plus ACM cert, the `coder` namespace, and the k8s +Secrets referenced below. Source: `deploy/coder/values.yaml:1-16`, +`deploy/coder/README.md:15-27`. + +## Verification method + +Read-only. Logged in to the demo Coder with the admin credentials +(`POST /api/v2/users/login`) to obtain a session token, then issued `GET` +requests with the `Coder-Session-Token` header. Cluster facts came from +`kubectl get` and `helm list` against `./kubeconfig`. AWS facts came from +read-only `aws iam get-role` / `get-role-policy`. No mutating call was made. +Always target `https://dev.usgov.coderdemo.io` explicitly; the ambient +`$CODER_URL` points at a different host Coder and was not used. + +| Check | Source / command | Result | +|---|---|---| +| Server version | `GET /api/v2/buildinfo` | `v2.34.0+3006da5` | +| Helm release | `helm -n coder list` | `coder-2.34.0`, app `2.34.0`, revision **4**, `deployed` | +| Deployment image | `kubectl -n coder get deploy coder -o jsonpath` | `.../ghcr/coder/coder:v2.34.0` | +| Replicas | same | `1` | +| Service type | `kubectl -n coder get svc coder` | `ClusterIP` | +| Ingress | `kubectl -n coder get ingress` | class `nginx`, hosts `dev.` + `*.usgov.coderdemo.io` | + +## Image (ECR ghcr mirror) + +```yaml +coder: + image: + repo: "430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/ghcr/coder/coder" + tag: "v2.34.0" + pullPolicy: IfNotPresent +``` + +The upstream `ghcr.io/coder/coder:v2.34.0` is mirrored into private ECR because +GovCloud has no pull-through cache. The mirror path follows the convention +`ghcr.io/:` to `/ghcr/:` +(`deploy/CONVENTIONS.md:47-57`). The chart version is pinned to `2.34.0` +(`deploy/coder/README.md:48-53`, `deploy/CONVENTIONS.md:39-45`). Source: +`deploy/coder/values.yaml:18-26`. Verified live: the running Deployment uses +`430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/ghcr/coder/coder:v2.34.0`. + +## ServiceAccount and IRSA (Bedrock) + +```yaml +serviceAccount: + name: coder + workspacePerms: true + enableDeployments: true + annotations: + eks.amazonaws.com/role-arn: "arn:aws-us-gov:iam::430737322961:role/usgov-coderdemo-coder-bedrock" +``` + +The chart always creates a ServiceAccount named `coder`. It is annotated for +IRSA so the AI Gateway Bedrock provider can call Bedrock with temporary +credentials and no static AWS keys. `workspacePerms: true` and +`enableDeployments: true` let the in-pod provisioner manage workspace pods and +Deployments. Source: `deploy/coder/values.yaml:28-37`. + +Verified live: the SA `coder/coder` carries the annotation +`eks.amazonaws.com/role-arn = arn:aws-us-gov:iam::430737322961:role/usgov-coderdemo-coder-bedrock` +(`kubectl -n coder get sa coder -o jsonpath`). The IRSA chain itself is +documented in `60-ai-gateway.md`. + +## Service (ClusterIP behind nginx) + +```yaml +service: + enable: true + type: ClusterIP +envUseClusterAccessURL: false +``` + +Coder sits behind ingress-nginx, so its Service must not provision a second +load balancer. The chart default is `LoadBalancer`; it is overridden to +`ClusterIP`. `envUseClusterAccessURL: false` stops the chart from injecting a +cluster-internal access URL because `CODER_ACCESS_URL` is set explicitly below. +Source: `deploy/coder/values.yaml:39-47`. Verified live: Service type is +`ClusterIP`. + +## Ingress (host, wildcard, TLS off, websocket annotations) + +```yaml +ingress: + enable: true + className: "nginx" + host: "dev.usgov.coderdemo.io" + wildcardHost: "*.usgov.coderdemo.io" + tls: + enable: false + annotations: + nginx.ingress.kubernetes.io/ssl-redirect: "false" + nginx.ingress.kubernetes.io/proxy-read-timeout: "86400" + nginx.ingress.kubernetes.io/proxy-send-timeout: "86400" + nginx.ingress.kubernetes.io/proxy-body-size: "0" +``` + +One internet-facing NLB routes to ingress-nginx, which routes to this Ingress. +TLS terminates upstream at the NLB via the ACM cert, so `tls.enable=false` and +the backend is plain HTTP. `ssl-redirect: "false"` avoids a redirect loop +because nginx talks plain HTTP to Coder. The two 86400-second proxy timeouts +and `proxy-body-size: "0"` support Coder's long-lived websockets (web terminal, +agent, logs) and large streamed payloads. Source: +`deploy/coder/values.yaml:49-67`; ingress contract in +`deploy/CONVENTIONS.md:25-33`. + +Verified live: the `coder` Ingress has `ingressClassName: nginx`, rules for +`dev.usgov.coderdemo.io` and `*.usgov.coderdemo.io`, and exactly the four nginx +annotations above (`kubectl -n coder get ingress`). + +## Access URLs + +```yaml +env: + - name: CODER_ACCESS_URL + value: "https://dev.usgov.coderdemo.io" + - name: CODER_WILDCARD_ACCESS_URL + value: "*.usgov.coderdemo.io" +``` + +The single-level wildcard lets the one ACM cert cover the dashboard and all +workspace apps. Source: `deploy/coder/values.yaml:69-75`. Verified live: +`GET /api/v2/deployment/config` reports `access_url=https://dev.usgov.coderdemo.io` +and `wildcard_access_url=*.usgov.coderdemo.io`. + +## Database + +`CODER_PG_CONNECTION_URL` is taken from Secret `coder-db` key `url`, a full +libpq connection string for the `coder` database on RDS. Source: +`deploy/coder/values.yaml:77-84`, `deploy/coder/secrets.example.yaml:16-31`. +The connection string enforces `sslmode=require` because RDS sets +`rds.force_ssl=1` (`deploy/coder/secrets.example.yaml:28-31`). The exact +credential value was not read (read-only, secrets out of scope). + +## OIDC SSO to Keycloak + +```yaml +- name: CODER_OIDC_ISSUER_URL + value: "https://auth.usgov.coderdemo.io/realms/coder" +- name: CODER_OIDC_CLIENT_ID + value: "coder" +- name: CODER_OIDC_CLIENT_SECRET # from Secret coder-oidc key client-secret +- name: CODER_OIDC_SCOPES + value: "openid,profile,email" +- name: CODER_OIDC_EMAIL_FIELD + value: "email" +- name: CODER_OIDC_USERNAME_FIELD + value: "preferred_username" +- name: CODER_OIDC_ALLOW_SIGNUPS + value: "true" +- name: CODER_OIDC_SIGN_IN_TEXT + value: "Sign in with Keycloak" +``` + +SSO points at the Keycloak realm `coder`. The confidential client secret comes +from Secret `coder-oidc` key `client-secret` (`deploy/coder/secrets.example.yaml:33-43`). +Self-provisioning on first login is enabled for the demo. Source: +`deploy/coder/values.yaml:86-107`. + +Verified live (`GET /api/v2/deployment/config`, `oidc` block): +`issuer_url=https://auth.usgov.coderdemo.io/realms/coder`, `client_id=coder`, +`email_field=email`, `username_field=preferred_username`, +`scopes=[openid, profile, email]`, `allow_signups=true`, +`sign_in_text="Sign in with Keycloak"`. Note: `oidc.group_field` is empty +(`None`), confirming group/role sync is not configured (known gap, tracked in +the facts sheet and STATUS notes). + +## Auth boundary hardening + +Three settings keep all login and git egress inside the GovCloud boundary. + +1. **GitHub default login provider disabled.** + `CODER_OAUTH2_GITHUB_DEFAULT_PROVIDER_ENABLE=false`. Coder's built-in GitHub + login uses Coder's hosted GitHub app and calls github.com, which is out of + boundary. Disabling it makes login Keycloak SSO plus the local password owner + only. Source: `deploy/coder/values.yaml:109-115`. Verified live: + `deployment/config` `oauth2.github.default_provider_enable=false`. + +2. **GitLab git external auth (in-boundary SCM).** + `CODER_EXTERNAL_AUTH_0_*` declares an explicit external-auth provider for the + in-cluster GitLab: id `gitlab`, type `gitlab`, display `GitLab`, client + id/secret from Secret `coder-external-auth` + (`deploy/coder/secrets.example.yaml:57-71`), explicit auth URL + `.../oauth/authorize`, token URL `.../oauth/token`, validate URL + `.../oauth/token/info`, regex `gitlab\.usgov\.coderdemo\.io`, scopes + `read_user read_repository write_repository`. Self-managed GitLab needs the + explicit URLs. Declaring this provider also suppresses Coder's built-in + github.com default external-auth provider. Source: + `deploy/coder/values.yaml:117-148`. Verified live: `deployment/config` + `external_auth[0]` matches (type `gitlab`, id `gitlab`, the three GitLab + OAuth URLs, regex, and the three scopes). In-workspace git auth is detailed + in `70-workspace-templates.md`. + +3. **Path-based workspace apps disabled.** + `CODER_DISABLE_PATH_APPS=true`. Path apps share the dashboard origin and can + make authenticated requests to the Coder API, so disabling them is the + hardened posture; every template here serves apps from its own subdomain. + Source: `deploy/coder/values.yaml:150-157`. Verified live: + `deployment/config` `disable_path_apps=true`. + +## AI Gateway env (seed-once provider config) + +```yaml +- name: CODER_AI_GATEWAY_ENABLED + value: "true" +# Provider 0: Anthropic direct +- CODER_AI_GATEWAY_PROVIDER_0_TYPE = "anthropic" +- CODER_AI_GATEWAY_PROVIDER_0_NAME = "anthropic" +- CODER_AI_GATEWAY_PROVIDER_0_BASE_URL = "https://api.anthropic.com" +- CODER_AI_GATEWAY_PROVIDER_0_KEY # from Secret coder-ai key ANTHROPIC_API_KEY +# Provider 1: Amazon Bedrock (IRSA, no static key) +- CODER_AI_GATEWAY_PROVIDER_1_TYPE = "bedrock" +- CODER_AI_GATEWAY_PROVIDER_1_NAME = "anthropic-bedrock" +- CODER_AI_GATEWAY_PROVIDER_1_BEDROCK_REGION = "us-gov-west-1" +- CODER_AI_GATEWAY_PROVIDER_1_BEDROCK_MODEL = "us-gov.anthropic.claude-sonnet-4-5-20250929-v1:0" +- CODER_AI_GATEWAY_PROVIDER_1_BEDROCK_SMALL_FAST_MODEL = "amazon.nova-pro-v1:0" +# AWS SDK / IRSA resolution +- AWS_REGION = "us-gov-west-1" +- AWS_DEFAULT_REGION = "us-gov-west-1" +- AWS_STS_REGIONAL_ENDPOINTS = "regional" +``` + +AI Gateway is enabled by default in v2.34; it is set explicitly here for +clarity. Provider 0 is Anthropic-direct (primary), keyed from Secret `coder-ai`. +Provider 1 is Amazon Bedrock (secondary), authenticated by IRSA with no static +key; Bedrock-ness is detected from `BEDROCK_REGION`. The AWS region and regional +STS endpoint settings make the SDK use the GovCloud regional STS endpoint for +the IRSA `AssumeRoleWithWebIdentity` exchange. Source: +`deploy/coder/values.yaml:159-215`. Provider behavior, routing, and the IRSA +chain are documented in `60-ai-gateway.md`. + +Verified live: `deployment/config` `ai.bridge.enabled=true`, and the seeded +`ai.bridge.providers` array contains `anthropic` (type `anthropic`, base +`https://api.anthropic.com`) and `anthropic-bedrock` (type `bedrock`, region +`us-gov-west-1`, model `us-gov.anthropic.claude-sonnet-4-5-20250929-v1:0`, +small fast model `amazon.nova-pro-v1:0`). `chat.ai_gateway_routing_enabled` is +`true`. + +### Deprecated-AI-provider-seed drift guard + +The `CODER_AI_GATEWAY_PROVIDER_*` env vars are deprecated as of v2.34. They seed +the database once on first startup; after that the database is authoritative and +providers are managed at `/ai/settings`. Editing a seeded env var in place later +(or changing the `coder-ai` secret contents) makes `coderd` refuse to start +(the drift guard). The safe workflow is to change providers in the dashboard, +then reconcile or remove the matching env vars. Treat these values as one-time +seed config and freeze them after first boot. Source: +`deploy/coder/values.yaml:13-16, 159-164`, `deploy/coder/README.md:123-140`. + +## Replicas + +```yaml +replicaCount: 1 +``` + +Single replica for the demo. HA (`replicaCount > 1`) is an Enterprise feature +and out of scope. Source: `deploy/coder/values.yaml:217-219`. Verified live: +Deployment `coder` has `spec.replicas=1`. + +## Licensing and entitlements (AI Governance Add-On plus premium) + +AI Gateway requires the AI Governance Add-On license. Per +`deploy/coder/README.md:142-156`, v2.34 has no `CODER_LICENSE` server env var +(the chart/server does not read a license from env or a Secret); the license is +a JWT applied at runtime and stored in the database, via `coder licenses add` or +the dashboard. A `CODER_LICENSE` value does exist in the operator's local env +file, but that is for applying the license with the CLI, not for the chart to +read. + +Verified live (`GET /api/v2/entitlements`): `has_license=true`, no warnings, and +the following are entitled and enabled: `aibridge`, `ai_governance_user_limit` +(limit 30, actual 1), `appearance`, `audit_log`, `connection_log`, +`high_availability`, `multiple_external_auth`, `multiple_organizations`, +`template_rbac`, `workspace_prebuilds`, and other premium features. This +confirms both the AI Governance add-on and the broader premium entitlement. + +## Appearance banner (runtime DB setting, not Helm) + +The classification banner is a runtime database setting, not part of Helm. It is +applied idempotently by `scripts/set-appearance.sh` and shows green +`UNCLASSIFIED - USGOVCLOUD` (`#007a33`). The `appearance` feature is +premium-gated. Source: `STATUS.md:107-116`. + +Verified live (`GET /api/v2/appearance`): `service_banner` and +`announcement_banners[0]` are both `enabled=true`, message +`UNCLASSIFIED - USGOVCLOUD`, background color `#007a33`. + +## Secrets consumed (names and keys only) + +| Secret | Keys | Used by | +|---|---|---| +| `coder-db` | `url` | `CODER_PG_CONNECTION_URL` | +| `coder-oidc` | `client-secret` | `CODER_OIDC_CLIENT_SECRET` | +| `coder-ai` | `ANTHROPIC_API_KEY` | `CODER_AI_GATEWAY_PROVIDER_0_KEY` (seed only) | +| `coder-external-auth` | `gitlab-client-id`, `gitlab-client-secret` | `CODER_EXTERNAL_AUTH_0_CLIENT_ID/SECRET` | + +Source: `deploy/coder/secrets.example.yaml` (all values are `REPLACE_ME` +placeholders in the repo; real values are created out-of-band by the platform +layer). Secret values were not read. + +## Notes and known gaps + +- OIDC group/role sync is not configured (`oidc.group_field` empty live); a + documented future-work item. +- The `anthropic` provider is currently seeded with a placeholder key; making AI + respond requires pasting a real key at `/ai/settings`. See `60-ai-gateway.md`. diff --git a/docs/as-built/40-identity-keycloak.md b/docs/as-built/40-identity-keycloak.md new file mode 100644 index 0000000..ffc9337 --- /dev/null +++ b/docs/as-built/40-identity-keycloak.md @@ -0,0 +1,231 @@ +# 40. Identity: Keycloak SSO (as-built) + +As-built, read-only documentation of the Keycloak identity layer for the +GovCloud Coder demo. Every nontrivial claim below is grounded in a repo file +path or a live read-only command. Items that could not be verified from a repo +file or a permitted GET are marked "unverified". + +- Keycloak URL: `https://auth.usgov.coderdemo.io` (realm `coder`). +- Coder URL it serves SSO for: `https://dev.usgov.coderdemo.io`. +- Namespace: `keycloak`. Source: `deploy/keycloak/`. + +## Deployment + +Keycloak runs as a single-replica `Deployment` in namespace `keycloak` +(`deploy/keycloak/deployment.yaml`). + +| Aspect | Value | Source | +|---|---|---| +| Image | `430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/quay/keycloak/keycloak:26.6.3` (ECR mirror of `quay.io/keycloak/keycloak:26.6.3`) | `deploy/keycloak/deployment.yaml` | +| Replicas | 1, `strategy.type: Recreate`, `KC_CACHE=local` (no clustering; HA out of scope) | `deploy/keycloak/deployment.yaml` | +| Start command | `start --import-realm` (not `--optimized`; stock image is not pre-built for postgres, so plain `start` runs the build step on first boot) | `deploy/keycloak/deployment.yaml`, `deploy/keycloak/README.md` | +| Database | RDS PostgreSQL, logical db `keycloak`, `KC_DB=postgres`, `KC_DB_URL=jdbc:postgresql://:5432/keycloak`; credentials from Secret `keycloak-db` (keys `username`/`password`) | `deploy/keycloak/deployment.yaml` | +| Hostname / proxy | `KC_HOSTNAME=https://auth.usgov.coderdemo.io`, `KC_PROXY_HEADERS=xforwarded`, `KC_HTTP_ENABLED=true`. Full https URL pins scheme/host because TLS terminates upstream at the L4 NLB | `deploy/keycloak/deployment.yaml`, `deploy/keycloak/README.md` | +| Bootstrap admin | `KC_BOOTSTRAP_ADMIN_USERNAME`/`KC_BOOTSTRAP_ADMIN_PASSWORD` from Secret `keycloak-admin` (first boot only) | `deploy/keycloak/deployment.yaml` | +| Health/metrics | `KC_HEALTH_ENABLED`/`KC_METRICS_ENABLED=true` on management port `9000`; startup/liveness/readiness probes hit `/health/started`, `/health/live`, `/health/ready` | `deploy/keycloak/deployment.yaml` | + +Network path (`deploy/keycloak/ingress.yaml`, `service.yaml`): + +``` +client --HTTPS--> NLB (TLS terminated, ACM cert) --HTTP--> ingress-nginx --HTTP--> Service keycloak:8080 --> pod :8080 +``` + +- `Service` is `ClusterIP` exposing only HTTP `8080`; the management port `9000` + is intentionally not exposed through the Service (`deploy/keycloak/service.yaml`). +- `Ingress` is `ingressClassName: nginx`, host `auth.usgov.coderdemo.io`, with + `ssl-redirect: "false"` (backend is plain HTTP, avoids a redirect loop) and a + larger `proxy-buffer-size` for Keycloak's auth cookies + (`deploy/keycloak/ingress.yaml`). +- The realm JSON is mounted from a ConfigMap (`keycloak-realm-coder`) generated + from `realm-coder.json` by `deploy/keycloak/kustomization.yaml` (with + `disableNameSuffixHash: true`). + +Secrets are provisioned out of band (not committed). `secrets.example.yaml` +documents the expected keys for `keycloak-db` and `keycloak-admin` +(`deploy/keycloak/secrets.example.yaml`). + +### Live verification + +``` +kubectl -n keycloak get deploy,svc,ingress + deployment.apps/keycloak 1/1 + service/keycloak ClusterIP 8080/TCP + ingress/keycloak nginx auth.usgov.coderdemo.io -> NLB + +curl -sS https://auth.usgov.coderdemo.io/realms/coder/.well-known/openid-configuration + issuer = https://auth.usgov.coderdemo.io/realms/coder (HTTP 200) +``` + +The discovery document confirms the realm is live and the issuer matches the +value Coder is configured with. + +## Realm `coder` (`realm-coder.json`) + +The realm is imported from `deploy/keycloak/realm-coder.json`. Import is +idempotent: if realm `coder` already exists it is skipped +(`deploy/keycloak/README.md`). + +Realm-level settings (`realm-coder.json`): + +- `enabled: true`, `displayName: "Coder (GovCloud Demo)"`, + `sslRequired: "external"`. +- `registrationAllowed: false`, `loginWithEmailAllowed: true`, + `resetPasswordAllowed: true`, `editUsernameAllowed: false`. +- Token settings: `accessTokenLifespan: 300` (5 min), + `ssoSessionIdleTimeout: 1800` (30 min idle), + `ssoSessionMaxLifespan: 36000` (10 h max), `offlineSessionIdleTimeout: 2592000`. + +OIDC client `coder` (`realm-coder.json`): + +| Field | Value | +|---|---| +| `clientId` | `coder` | +| Type | Confidential (`publicClient: false`, `clientAuthenticatorType: client-secret`) | +| Flows | `standardFlowEnabled: true`; implicit, direct-access-grants, and service-accounts all disabled | +| `secret` | `REPLACE_WITH_CODER_OIDC_CLIENT_SECRET` placeholder in the committed JSON; the real value must equal what Coder reads from Secret `coder-oidc` | +| `redirectUris` | `https://dev.usgov.coderdemo.io/api/v2/users/oidc/callback` and `https://dev.usgov.coderdemo.io/*` | +| `webOrigins` | `+` | +| `defaultClientScopes` | `web-origins`, `profile`, `roles`, `email` | +| `optionalClientScopes` | `offline_access` | +| `post.logout.redirect.uris` | `https://dev.usgov.coderdemo.io/*` | + +User defined in the realm (`realm-coder.json`): + +- `demo` / `demo@usgov.coderdemo.io`, `enabled: true`, `emailVerified: true`, + realm role `default-roles-coder`. Password is a placeholder + (`REPLACE_WITH_DEMO_USER_PASSWORD`) in the committed JSON and is set out of + band. + +The committed realm JSON defines only the realm flags, the single `coder` +client (with its default/optional client scopes), and the one `demo` user. It +declares no realm groups, no `defaultGroups`, and no custom protocol mappers or +client scopes beyond the Keycloak built-ins. Verified by grepping +`realm-coder.json`: the only matches for group/scope keys are the client's +`defaultClientScopes` and `optionalClientScopes` arrays; there is no `groups` +array, no `defaultGroups`, and no `protocolMappers`. + +## How Coder OIDC SSO is wired to Keycloak + +Configured in the Coder Helm values (`deploy/coder/values.yaml`, env block): + +| Coder env var | Value | Notes | +|---|---|---| +| `CODER_OIDC_ISSUER_URL` | `https://auth.usgov.coderdemo.io/realms/coder` | matches the realm issuer | +| `CODER_OIDC_CLIENT_ID` | `coder` | matches the realm client | +| `CODER_OIDC_CLIENT_SECRET` | from Secret `coder-oidc`, key `client-secret` | must match the realm client `secret` | +| `CODER_OIDC_SCOPES` | `openid,profile,email` | | +| `CODER_OIDC_EMAIL_FIELD` | `email` | | +| `CODER_OIDC_USERNAME_FIELD` | `preferred_username` | | +| `CODER_OIDC_ALLOW_SIGNUPS` | `true` | SSO users self-provision on first login | +| `CODER_OIDC_SIGN_IN_TEXT` | `Sign in with Keycloak` | login-button label | + +GitHub's built-in default login provider is disabled +(`CODER_OAUTH2_GITHUB_DEFAULT_PROVIDER_ENABLE=false`), so the dashboard login +options are the local password owner plus "Sign in with Keycloak" +(`deploy/coder/values.yaml`, and `STATUS.md` "Auth boundary hardening"). + +### Login UX + +- The Coder login screen shows a "Sign in with Keycloak" button + (`CODER_OIDC_SIGN_IN_TEXT`). Clicking it runs the standard OIDC + authorization-code flow against the `coder` realm client. +- Because `CODER_OIDC_ALLOW_SIGNUPS=true`, a Keycloak user who logs in for the + first time is auto-provisioned a Coder account; username is taken from + `preferred_username` and email from `email`. + +### Live verification (Coder's view of OIDC) + +Logged into `https://dev.usgov.coderdemo.io` (the demo Coder, explicitly, not +the ambient `$CODER_URL`) with the admin credentials from +`generated-secrets.env`, then `GET /api/v2/deployment/config`. The `oidc` block +reports: + +``` +issuer_url = https://auth.usgov.coderdemo.io/realms/coder +client_id = coder +scopes = ["openid","profile","email"] +email_field = email +username_field = preferred_username +allow_signups = true +sign_in_text = Sign in with Keycloak +``` + +This matches `deploy/coder/values.yaml` exactly. + +## Configured vs NOT configured + +### Configured and working + +- OIDC SSO end to end: realm `coder`, confidential client `coder`, issuer and + client id matching on both sides, standard authorization-code flow. +- Identity claim mapping: email from `email`, username from + `preferred_username`. +- Self-service signup on first SSO login (`allow_signups: true`). +- Boundary hardening: GitHub default login disabled, so no github.com login + egress. + +### NOT configured (known gap): IdP group sync and role mapping + +There is no Keycloak-to-Coder group sync or role mapping. This is a deliberate, +documented gap (see also `STATUS.md` "Out of scope: full identity sync" and the +facts sheet). Evidence from the live `GET /api/v2/deployment/config` `oidc` +block on the demo Coder: + +``` +groups_field = "" (no claim is read for group membership) +group_mapping = {} (no OIDC-group -> Coder-group mapping) +group_auto_create = false (Coder will not create groups from claims) +user_role_field = "" (no claim is read for site roles) +user_role_mapping = {} (no OIDC-claim -> Coder-role mapping) +group_regex_filter = ".*" (default; inert because groups_field is empty) +group_allow_list = null (default) +``` + +On the Keycloak side, the realm `coder` has no groups and no group-claim mapper: +`realm-coder.json` defines no `groups`/`defaultGroups` and no +`protocolMappers`, so even if Coder read a `groups` field there is currently no +`groups` claim emitted in the token. + +Net effect: all SSO users land as ordinary members of the default Coder +organization. Group membership and site roles are managed manually inside +Coder, not driven by the IdP. + +### What enabling group sync would require (future work, not implemented) + +Documentation only. Do not implement as part of this as-built pass. To wire +Keycloak group sync into Coder you would need all of: + +1. Keycloak: create the groups in realm `coder` (and assign users), then add a + "Group Membership" protocol mapper (on a client scope or the `coder` client) + that emits a `groups` claim in the token. Decide whether the claim is full + group paths or names. +2. Coder: set `CODER_OIDC_GROUP_FIELD` (the deployment-config key surfaces as + `groups_field`) to the claim name, for example `groups`. Optionally set + `CODER_OIDC_GROUP_MAPPING` to translate IdP group names to Coder group IDs, + and `CODER_OIDC_GROUP_AUTO_CREATE=true` if Coder should create missing + groups. `CODER_OIDC_GROUP_REGEX_FILTER` can scope which groups are honored. +3. For site-role sync (separate from groups): add a realm/role mapper that emits + a roles claim, then set `CODER_OIDC_USER_ROLE_FIELD` and + `CODER_OIDC_USER_ROLE_MAPPING` on Coder. + +Note: OIDC-driven group and role sync is a Coder premium/enterprise capability. +This deployment is licensed (premium + AI Governance per `STATUS.md`), so the +gating is configuration effort, not licensing. None of the above is wired today. + +## Sources + +Repo files: + +- `deploy/keycloak/deployment.yaml`, `service.yaml`, `ingress.yaml`, + `kustomization.yaml`, `realm-coder.json`, `secrets.example.yaml`, `README.md` +- `deploy/coder/values.yaml` (OIDC env block) +- `STATUS.md` + +Live read-only commands run (GET only): + +- `kubectl -n keycloak get deploy,svc,ingress` +- `curl -sS https://auth.usgov.coderdemo.io/realms/coder/.well-known/openid-configuration` +- `POST /api/v2/users/login` then `GET /api/v2/deployment/config` against + `https://dev.usgov.coderdemo.io` (admin creds from `generated-secrets.env`; + no secret values reproduced here) +- `grep` over `deploy/keycloak/realm-coder.json` for group/mapper keys diff --git a/docs/as-built/50-gitlab-scm.md b/docs/as-built/50-gitlab-scm.md new file mode 100644 index 0000000..9e8d9f1 --- /dev/null +++ b/docs/as-built/50-gitlab-scm.md @@ -0,0 +1,185 @@ +# 50. In-boundary GitLab SCM (as-built) + +As-built, read-only documentation of the in-boundary GitLab source-control +manager and how it is wired into Coder as a git external-auth provider. Every +nontrivial claim is grounded in a repo file path or a live read-only command. +Items that could not be verified from a repo file or a permitted GET are marked +"unverified". + +- GitLab URL: `https://gitlab.usgov.coderdemo.io` (namespace `gitlab`). +- Purpose: in-boundary SCM for workspaces; git auth stays inside the GovCloud + boundary (no github.com egress). +- Source: `deploy/gitlab/`, `deploy/coder/values.yaml`. + +## Deployment + +GitLab CE runs as a single-container Omnibus image in a one-replica +`StatefulSet` (not the Helm chart), in namespace `gitlab` +(`deploy/gitlab/statefulset.yaml`, `deploy/gitlab/README.md`). + +| Aspect | Value | Source | +|---|---|---| +| Image | `430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/docker-hub/gitlab/gitlab-ce:19.0.1-ce.0` (ECR mirror of `docker.io/gitlab/gitlab-ce:19.0.1-ce.0`) | `deploy/gitlab/statefulset.yaml` | +| Workload | `StatefulSet`, `replicas: 1`, `serviceName: gitlab`, `OrderedReady`; chosen over a Deployment because the data volumes are RWO and two pods must never share them | `deploy/gitlab/statefulset.yaml` | +| Database | EMBEDDED PostgreSQL bundled in the Omnibus image (the default), data under `/var/opt/gitlab/postgresql` on the `var-opt-gitlab` PVC. NOT the shared RDS instance | `deploy/gitlab/statefulset.yaml`, `deploy/gitlab/README.md` | +| External URL | `external_url 'https://gitlab.usgov.coderdemo.io'`; bundled NGINX on plain HTTP `:80`, `listen_https=false`, `redirect_http_to_https=false`, forces `X-Forwarded-Proto=https` | `deploy/gitlab/statefulset.yaml` | +| Storage | 3 gp3 PVCs via `volumeClaimTemplates`: `etc-gitlab` 2Gi, `var-opt-gitlab` 20Gi, `var-log-gitlab` 5Gi (cluster-default `gp3`, encrypted, WaitForFirstConsumer) | `deploy/gitlab/statefulset.yaml` | +| Trimmed footprint | registry, pages, KAS, and all bundled exporters/prometheus disabled; `puma.worker_processes=2`, `sidekiq.concurrency=10` | `deploy/gitlab/statefulset.yaml` | +| Root bootstrap | `GITLAB_INITIAL_ROOT_PASSWORD` from Secret `gitlab-secrets` (key `initial_root_password`, `optional: true`); consumed on first boot only, then the password lives in the DB | `deploy/gitlab/statefulset.yaml`, `deploy/gitlab/secrets.example.yaml` | +| First-boot time | First boot runs DB migrations and asset load; the `startupProbe` allows roughly 15 minutes (`initialDelaySeconds: 60`, `periodSeconds: 15`, `failureThreshold: 60`) before liveness takes over. README notes ~15 to 20 min; do not mistake a slow first boot for failure | `deploy/gitlab/statefulset.yaml`, `deploy/gitlab/README.md`, `STATUS.md` | + +Network path (`deploy/gitlab/ingress.yaml`, `service.yaml`): + +``` +client --HTTPS--> NLB (TLS terminated, ACM cert) --HTTP--> ingress-nginx --HTTP--> Service gitlab:80 --> pod gitlab-0 (bundled NGINX :80 -> Workhorse/Puma) +``` + +- `Service` is `ClusterIP` on port `80` and is also the StatefulSet governing + service (`deploy/gitlab/service.yaml`). +- `Ingress` is `ingressClassName: nginx`, host `gitlab.usgov.coderdemo.io`, with + `ssl-redirect`/`force-ssl-redirect` false, `proxy-body-size: "0"` (large git + pushes/LFS), and 3600s read/send timeouts (`deploy/gitlab/ingress.yaml`). +- Git over SSH (port 22) is not exposed; only HTTPS 443 is fronted by the NLB + (`deploy/gitlab/README.md`, open questions). Clone/push over HTTPS is the + supported path. + +Why embedded Postgres rather than RDS (`deploy/gitlab/README.md`): fewest moving +parts for a single-container demo, no dependency on an orchestrator-created +`gitlabhq_production` db/role/Secret, decoupled blast radius from RDS health, +and the bundled engine always meets GitLab 19's PostgreSQL 17+ requirement. The +tradeoff is no managed backups/Multi-AZ. A shared-RDS alternative is sketched in +the README but is not enabled. + +### Live verification + +``` +kubectl -n gitlab get statefulset,svc,ingress + statefulset.apps/gitlab 1/1 + service/gitlab ClusterIP 80/TCP + ingress/gitlab nginx gitlab.usgov.coderdemo.io -> NLB + +curl -sS -o /dev/null -w '%{http_code}' https://gitlab.usgov.coderdemo.io/oauth/authorize + 302 (OAuth authorize endpoint live; redirects to login when called without params) +``` + +## The instance-wide OAuth app "Coder" + +To let Coder authenticate git operations, an instance-wide OAuth application +named "Coder" was minted in GitLab (`deploy/coder/values.yaml` comment; +`STATUS.md` "Auth boundary hardening"). Its parameters: + +| Property | Value | Source | +|---|---|---| +| Application name | `Coder` | `deploy/coder/values.yaml` comment, `STATUS.md` | +| Redirect URI | `https://dev.usgov.coderdemo.io/external-auth/gitlab/callback` | Coder external-auth callback shape `/external-auth//callback`, with access_url `https://dev.usgov.coderdemo.io` and id `gitlab` | +| Scopes | `read_user read_repository write_repository` | `deploy/coder/values.yaml` (`CODER_EXTERNAL_AUTH_0_SCOPES`); verified live in Coder config | +| Scope | Instance-wide (admin-owned application, not a user/group app) | `deploy/coder/values.yaml` comment, `STATUS.md` | +| `organization_id` | `1` (the default GitLab organization). Recent GitLab associates OAuth applications with an organization; an instance-wide application is scoped to the default org id `1`. See note below | build context / facts sheet | + +Note on `organization_id=1`: this detail comes from the build context (facts +sheet / lead) and is consistent with how recent GitLab scopes instance-wide +applications to the default organization. It was not independently re-verified +in this read-only pass, because confirming it requires an authenticated admin +API call to GitLab (a login POST and a token), which is outside the GET-only +constraint of this documentation task. Treat the specific value as unverified +here. + +The application's client id and secret are recorded in the gitignored +`~/.config/usgov-coderdemo/generated-secrets.env` as `GITLAB_CODER_OAUTH_APP_ID` +and `GITLAB_CODER_OAUTH_SECRET` (key names confirmed; secret values are not +reproduced in this doc). + +## How it maps to Coder external auth + +Coder consumes the GitLab OAuth app as external-auth provider index 0 +(`deploy/coder/values.yaml`, env block): + +| Coder env var | Value | +|---|---| +| `CODER_EXTERNAL_AUTH_0_ID` | `gitlab` | +| `CODER_EXTERNAL_AUTH_0_TYPE` | `gitlab` | +| `CODER_EXTERNAL_AUTH_0_DISPLAY_NAME` | `GitLab` | +| `CODER_EXTERNAL_AUTH_0_CLIENT_ID` | from Secret `coder-external-auth`, key `gitlab-client-id` | +| `CODER_EXTERNAL_AUTH_0_CLIENT_SECRET` | from Secret `coder-external-auth`, key `gitlab-client-secret` | +| `CODER_EXTERNAL_AUTH_0_AUTH_URL` | `https://gitlab.usgov.coderdemo.io/oauth/authorize` | +| `CODER_EXTERNAL_AUTH_0_TOKEN_URL` | `https://gitlab.usgov.coderdemo.io/oauth/token` | +| `CODER_EXTERNAL_AUTH_0_VALIDATE_URL` | `https://gitlab.usgov.coderdemo.io/oauth/token/info` | +| `CODER_EXTERNAL_AUTH_0_REGEX` | `gitlab\.usgov\.coderdemo\.io` | +| `CODER_EXTERNAL_AUTH_0_SCOPES` | `read_user read_repository write_repository` | + +The OAuth app client id and secret are supplied to Coder via the k8s Secret +`coder-external-auth` (keys `gitlab-client-id` and `gitlab-client-secret`); +these correspond to `GITLAB_CODER_OAUTH_APP_ID` and `GITLAB_CODER_OAUTH_SECRET` +in `generated-secrets.env`. + +A self-managed GitLab requires the explicit auth/token/validate URLs above +(`deploy/coder/values.yaml` comment). Configuring an explicit external-auth +provider also suppresses Coder's built-in github.com default external-auth +provider, so no auth path leaves the GovCloud boundary (`STATUS.md`). + +### Live verification (Coder's view of external auth) + +From `GET /api/v2/deployment/config` against `https://dev.usgov.coderdemo.io`, +the `external_auth` entry reports: + +``` +id = gitlab +type = gitlab +display_name = GitLab +auth_url = https://gitlab.usgov.coderdemo.io/oauth/authorize +token_url = https://gitlab.usgov.coderdemo.io/oauth/token +validate_url = https://gitlab.usgov.coderdemo.io/oauth/token/info +regex = gitlab\.usgov\.coderdemo\.io +scopes = [read_user, read_repository, write_repository] +``` + +This matches `deploy/coder/values.yaml` exactly. + +## Every workspace template requires GitLab login + +The `claude-code` template declares the GitLab external-auth data source, which +makes a GitLab login mandatory before a workspace agent reports ready +(`coder-templates/claude-code/main.tf`): + +```hcl +data "coder_external_auth" "gitlab" { + id = "gitlab" # MUST match CODER_EXTERNAL_AUTH_0_ID on the Coder server +} +``` + +Per the template comment and `STATUS.md`: declaring this data source surfaces a +"Login with GitLab" control on the dashboard; the agent only reports auth as +satisfied once the owner completes the in-boundary GitLab OAuth flow. The Coder +agent's git credential helper then injects a short-lived OAuth token for any +clone/fetch/push to `gitlab.usgov.coderdemo.io`, so no PATs or SSH keys live in +the workspace. `STATUS.md` records this as verified: the active template +version's `/external-auth` lists `gitlab` as required. + +## Notes and out of scope + +- GitLab to Keycloak SSO (OIDC) is optional and NOT enabled. `deploy/gitlab/README.md` + includes an `openid_connect` omniauth sketch, but the as-built login is root + plus local GitLab users. +- Git over SSH is not wired (NLB terminates 443 only). HTTPS clone/push is the + supported path. +- Backups: with embedded Postgres there is no managed backup; durability relies + on the EBS PVC plus GitLab's own backup tooling (`deploy/gitlab/README.md`). + +## Sources + +Repo files: + +- `deploy/gitlab/statefulset.yaml`, `service.yaml`, `ingress.yaml`, + `secrets.example.yaml`, `README.md` +- `deploy/coder/values.yaml` (external-auth env block) +- `coder-templates/claude-code/main.tf` (the `coder_external_auth` data source) +- `STATUS.md` + +Live read-only commands run (GET only): + +- `kubectl -n gitlab get statefulset,svc,ingress` +- `curl -sS -o /dev/null -w '%{http_code}' https://gitlab.usgov.coderdemo.io/oauth/authorize` +- `POST /api/v2/users/login` then `GET /api/v2/deployment/config` against + `https://dev.usgov.coderdemo.io` (admin creds from `generated-secrets.env`; + no secret values reproduced here) +- Secret key-name listing from `generated-secrets.env` (names only; no values) diff --git a/docs/as-built/60-ai-gateway.md b/docs/as-built/60-ai-gateway.md new file mode 100644 index 0000000..832b803 --- /dev/null +++ b/docs/as-built/60-ai-gateway.md @@ -0,0 +1,214 @@ +# 60. AI Gateway / AI Bridge (as-built) + +How the Coder AI Gateway (formerly "AI Bridge") is wired for the GovCloud demo: +two providers, routing by provider name, the Bedrock IRSA credential chain, and +the verified end-to-end routing proof. The product is the AI Gateway; the API +paths still use `/api/v2/aibridge/...` (`deploy/coder/README.md:80-82`). + +## Verification method + +Read-only. Session token obtained via `POST /api/v2/users/login`, then `GET` +requests against `https://dev.usgov.coderdemo.io` with the +`Coder-Session-Token` header. AWS facts came from read-only `aws iam get-role` +and `aws iam get-role-policy`. The routing probe is a `POST` to the gateway, so +it was not re-executed here (read-only, GET-only); the 502 proof below is cited +from `STATUS.md` and the facts sheet, and the live providers/config it depends +on were independently re-verified. + +## Enabled by default in v2.34 + +AI Gateway is enabled by default in v2.34 and is set explicitly in Helm +(`CODER_AI_GATEWAY_ENABLED=true`, `deploy/coder/values.yaml:159-166`). It +requires the AI Governance Add-On entitlement. + +Verified live: + +- `GET /api/v2/deployment/config` reports `ai.bridge.enabled=true` and + `chat.ai_gateway_routing_enabled=true`. +- `GET /api/v2/entitlements` reports `aibridge` entitled and enabled, and + `ai_governance_user_limit` entitled and enabled (limit 30, actual 1). + +## Providers are database-managed (seed-once) + +Since v2.34, AI Gateway providers live in the database and are managed at +`https://dev.usgov.coderdemo.io/ai/settings` or the AI Providers API. The +`CODER_AI_GATEWAY_PROVIDER_*` env vars are deprecated and only seed the DB on +first startup; afterward the database is authoritative. Changing a seeded env +var later makes `coderd` fail to start (the drift guard). Source: +`deploy/coder/README.md:123-140`, `deploy/coder/values.yaml:13-16, 159-164`. +See `30-coder-control-plane.md` for the Helm-side seed/drift detail. + +Verified live: `GET /api/v2/ai/providers` returns two providers (below). These +match the seeded `ai.bridge.providers` array in `deployment/config`. + +### Provider `anthropic` (direct, primary) + +```json +{ + "type": "anthropic", + "name": "anthropic", + "display_name": "Anthropic (direct)", + "enabled": true, + "base_url": "https://api.anthropic.com", + "api_keys": [{ "masked": "sk-a...ings", ... }], + "settings": null +} +``` + +Direct provider; egress to `api.anthropic.com` leaves the VPC via the NAT +gateway (`deploy/coder/values.yaml:168-186`, `deploy/CONVENTIONS.md:76-78`). +The key is seeded from Secret `coder-ai` key `ANTHROPIC_API_KEY` on first boot +and is then managed in the DB. + +The live masked key `sk-a...ings` is consistent with the placeholder +`sk-ant-REPLACE_ME_set_real_key_via_ai_settings` (it ends in `ings`). No real +Anthropic key exists anywhere in this environment. + +Remaining user action: sign in as the owner, open Admin settings > AI > +Providers (`/ai/settings`), edit the provider named `anthropic`, and paste the +real `sk-ant-...` key. Do this in the UI, not by editing the `coder-ai` k8s +secret, because the provider config now lives in the database. Source: +`STATUS.md:61-74`, facts sheet "Remaining action". + +### Provider `anthropic-bedrock` (Bedrock via IRSA, secondary) + +```json +{ + "type": "bedrock", + "name": "anthropic-bedrock", + "display_name": "anthropic-bedrock", + "enabled": true, + "base_url": "", + "api_keys": [], + "settings": { + "_type": "bedrock", + "model": "us-gov.anthropic.claude-sonnet-4-5-20250929-v1:0", + "region": "us-gov-west-1", + "small_fast_model": "amazon.nova-pro-v1:0" + } +} +``` + +In-boundary provider with no static key (`api_keys` is empty); it authenticates +through IRSA. The primary model is the GovCloud Claude Sonnet 4.5 inference +profile; the small fast model (the Haiku-class background model Claude Code +uses) is `amazon.nova-pro-v1:0`. Source: `deploy/coder/values.yaml:188-205`, +verified live via `GET /api/v2/ai/providers`. + +Claude Sonnet 4.5 access on Bedrock is still gated; it needs an Anthropic +agreement via the account paired with GovCloud. `amazon.nova-pro-v1:0` is the +proven fallback that invokes in GovCloud today. Source: `STATUS.md:29-31`, +`deploy/coder/README.md:165-169`. + +## Routing path and why the provider must be named `anthropic` + +The gateway routes by provider **name**: + +``` +POST /api/v2/aibridge//v1/messages +``` + +The provider must be named `anthropic` because the claude-code workspace module +(4.7.3) hardcodes `ANTHROPIC_BASE_URL=/api/v2/aibridge/anthropic`. +With `CODER_ACCESS_URL=https://dev.usgov.coderdemo.io` that resolves to +`https://dev.usgov.coderdemo.io/api/v2/aibridge/anthropic`. A name like +`anthropic-direct` would make that route 404, so Claude Code could not reach the +provider. Source: `deploy/coder/values.yaml:171-179`, +`deploy/coder/README.md:119-121`, `coder-templates/claude-code/main.tf:22-26`. + +This is why the Anthropic-direct provider is named exactly `anthropic` and the +Bedrock provider is named `anthropic-bedrock`. To route Claude Code to Bedrock, +you either rename the Bedrock provider to `anthropic` or set the workspace model +to a Bedrock id (`STATUS.md:76-79`). + +## End-to-end request flow + +A request from a workspace's Claude Code to the upstream model: + +1. Claude Code in the workspace pod reads `ANTHROPIC_BASE_URL` + (`/api/v2/aibridge/anthropic`) and a bearer token. The token is + the workspace owner's Coder session token (`CLAUDE_API_KEY` set by the + module, plus `ANTHROPIC_AUTH_TOKEN` exported by the template), not a raw + Anthropic key. Source: `coder-templates/claude-code/main.tf:22-28, 269-282`. +2. The request hits `POST /api/v2/aibridge/anthropic/v1/messages` on the Coder + server. +3. The AI Gateway authenticates the session token, applies governance and + audit, then looks up the provider whose name matches the path segment + (`anthropic`). +4. The gateway forwards to that provider's upstream: + - `anthropic` (direct): `https://api.anthropic.com`, egress via the NAT + gateway. + - `anthropic-bedrock`: AWS Bedrock in `us-gov-west-1` using IRSA credentials, + in-region only. +5. The upstream response streams back through the gateway to Claude Code. + +No Anthropic key is stored in the workspace; the session token is the only +credential and it is scoped to the workspace owner. Source: +`coder-templates/claude-code/README.md:31-50`. + +## Bedrock IRSA credential chain (verified live) + +The Bedrock provider attaches no static key, so the AWS SDK default credential +chain resolves the IRSA web-identity token from the annotated service account. +The chain, verified read-only: + +1. **ServiceAccount annotation.** SA `coder/coder` is annotated + `eks.amazonaws.com/role-arn = arn:aws-us-gov:iam::430737322961:role/usgov-coderdemo-coder-bedrock`. + Verified: `kubectl -n coder get sa coder -o jsonpath`. Declared at + `deploy/coder/values.yaml:32-37`. +2. **STS AssumeRoleWithWebIdentity.** The role trust policy allows + `sts:AssumeRoleWithWebIdentity` from the cluster OIDC provider + (`oidc-provider/oidc.eks.us-gov-west-1.amazonaws.com/id/E9DB9E591C95ECB91F44EDCF38F146F2`), + conditioned on `aud = sts.amazonaws.com` and + `sub = system:serviceaccount:coder:coder`. The SDK uses the GovCloud regional + STS endpoint because `AWS_REGION=us-gov-west-1` and + `AWS_STS_REGIONAL_ENDPOINTS=regional` are set + (`deploy/coder/values.yaml:207-215`). Verified: + `aws iam get-role --role-name usgov-coderdemo-coder-bedrock`. +3. **bedrock:InvokeModel.** The inline policy `bedrock-invoke` grants + `bedrock:InvokeModel` and `bedrock:InvokeModelWithResponseStream` on an + allowlisted resource set: + - `foundation-model/anthropic.claude-sonnet-4-5-20250929-v1:0` + (us-gov-west-1 and us-gov-east-1) + - `foundation-model/amazon.nova-pro-v1:0` (us-gov-west-1) + - `inference-profile/us-gov.anthropic.claude-sonnet-4-5-20250929-v1:0` + (us-gov-west-1, account `430737322961`) + + Verified: `aws iam get-role-policy --role-name usgov-coderdemo-coder-bedrock + --policy-name bedrock-invoke`. This matches the expectation in the facts + sheet and `deploy/coder/README.md:114-118`. + +## Verified 502 routing proof and what 200 requires + +Per `STATUS.md:56-57` and the facts sheet, the routing path was verified end to +end: `POST /api/v2/aibridge/anthropic/v1/messages` reaches `api.anthropic.com` +and returns **502 "all configured keys failed authentication"** with the +placeholder key. The 502 (an upstream auth rejection, not a 404) proves the full +path works: client to gateway to upstream Anthropic. The route only resolves +because the provider is named `anthropic`. + +This routing probe is a `POST`, so it was not re-run under this read-only task. +The supporting state was independently re-verified live: the `anthropic` +provider exists, is enabled, points at `https://api.anthropic.com`, and still +carries the placeholder key (masked `sk-a...ings`). + +A `200` requires a working upstream credential: + +- Anthropic-direct: paste a real `sk-ant-...` key into the `anthropic` provider + at `/ai/settings`, then re-run the routing check. Source: `STATUS.md:63-74`. +- Bedrock (in-boundary alternative): enable Claude Sonnet 4.5 model access in + the GovCloud console, then route Claude Code at the `anthropic-bedrock` + provider (rename it to `anthropic` or set the workspace model). Bedrock access + is still gated; Nova Pro is the proven fallback. Source: `STATUS.md:76-79`. + +## Known issues + +- **Bedrock model access gated.** `InvokeModel` on + `us-gov.anthropic.claude-sonnet-4-5-...` returns AccessDenied until model + access is enabled; the provider is wired but may be disabled at demo time. + Source: `deploy/coder/README.md:165-169`, `STATUS.md:29-31`. +- **Claude Code Bedrock beta header in GovCloud.** Known issue + `coder/aibridge#221`: Claude Code sends an `anthropic-beta` flag that GovCloud + Bedrock rejects (`invalid beta flag`), which can break the + Bedrock-through-gateway path for Claude Code specifically. Anthropic-direct is + unaffected. Source: `deploy/coder/README.md:170-174`. diff --git a/docs/as-built/70-workspace-templates.md b/docs/as-built/70-workspace-templates.md new file mode 100644 index 0000000..6198954 --- /dev/null +++ b/docs/as-built/70-workspace-templates.md @@ -0,0 +1,201 @@ +# 70. Workspace template: `claude-code` (as-built) + +The single workspace template `claude-code` +(`coder-templates/claude-code/main.tf`) runs Claude Code as a Coder Agent in a +Kubernetes pod, wired through the AI Gateway, and now requires in-boundary +GitLab login. This documents the pod, the modules, Coder Tasks, parameters, and +the GitLab external-auth requirement. + +## Verification method + +Read-only. Session token via `POST /api/v2/users/login`, then `GET` against +`https://dev.usgov.coderdemo.io`. Template facts come from +`coder-templates/claude-code/main.tf` and its `README.md`, cross-checked against +the live active template version. + +Verified live: + +- `GET /api/v2/organizations/5de29a6d-8836-4643-a42b-2cb807c8e3e2/templates`: + one template, `claude-code`, active version `3c0614b5-...`. +- That version's provisioner job status is `succeeded`. + +## Pod, PVC, image, and security context + +The template provisions one Kubernetes pod and one PVC in namespace +`coder-workspaces` (the `namespace` variable defaults to `coder-workspaces`, +`coder-templates/claude-code/main.tf:60-64`). + +- **Pod.** `kubernetes_pod_v1.workspace`, created only when the workspace is + started (`count = data.coder_workspace.me.start_count`). Labeled + `app.kubernetes.io/name=coder-workspace`. Source: `main.tf:370-381`. +- **PVC.** `kubernetes_persistent_volume_claim_v1.home`, `ReadWriteOnce`, size + `${disk_size}Gi`, mounted at `/home/coder`. The cluster default StorageClass + is `gp3` (encrypted, `WaitForFirstConsumer`), so the home volume lands on gp3. + `wait_until_bound = false`. Source: `main.tf:345-368, 427-439`; StorageClass + per the facts sheet. +- **Image.** `var.workspace_image` defaults to the ECR-mirrored + `430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/docker-hub/codercom/enterprise-base:ubuntu-noble-20260601`. + `enterprise-base` runs as user `coder` (uid 1000) and ships git/curl/sudo; + Claude Code and AgentAPI install as standalone binaries into + `$HOME/.local/bin`, so no Node.js is needed in the base image. + `image_pull_policy = IfNotPresent`. Source: `main.tf:66-81, 390-394`. +- **Security context.** Pod-level `run_as_user=1000`, `fs_group=1000`; + container-level `run_as_user=1000`, `allow_privilege_escalation=true`. + Privilege escalation must stay enabled because the claude-code/agentapi module + installs the `agentapi` binary to `/usr/local/bin` via passwordless sudo; + disabling it sets `no_new_privs` and breaks that install and the Coder Tasks + chat UI it powers. Source: `main.tf:383-404`. +- **Resources.** Requests `cpu=500m` and `memory=max(2, floor(memory/2))Gi`; + limits `cpu=${cpu}` and `memory=${memory}Gi`. Source: `main.tf:416-425`. +- **Scheduling and stability.** Soft pod anti-affinity by hostname; both the pod + and the PVC use `lifecycle { ignore_changes = all }` so a running pod survives + template re-applies and prebuild claims (the agent token is baked into + `init_script`). Source: `main.tf:441-465`. + +The agent container receives `CODER_AGENT_TOKEN` and `CODER_AGENT_URL` (the +access URL) as env (`main.tf:406-414`). + +## Agent + +`coder_agent.main` (`main.tf:211-267`): a small startup script that only +normalizes `PATH` (adds `$HOME/.local/bin`) and signals readiness, because the +claude-code module's own `coder_script` installs Claude Code and AgentAPI as +native binaries. Agent env sets `EDITOR`/`VISUAL=code` and +`CODER_AGENT_DEVCONTAINERS_ENABLE=false` (no docker socket in the pod, so +devcontainer auto-detection is disabled to avoid the dashboard hanging on +`docker ps`). Metadata reports CPU, memory, and disk usage. `display_apps` +enables VS Code Desktop, web terminal, SSH helper, and port-forwarding helper. + +## Claude Code module 4.7.3 (and why not 5.x) + +```hcl +module "claude_code" { + source = "registry.coder.com/coder/claude-code/coder" + version = "4.7.3" + agent_id = coder_agent.main.id + workdir = "/home/coder" + enable_aibridge = true + ai_prompt = local.effective_prompt + report_tasks = true + subdomain = true +} +``` + +The module is pinned to **4.7.3** (`deploy/CONVENTIONS.md:39-45`). In 4.7.x the +AI Gateway input is `enable_aibridge` (not `enable_ai_gateway`). With +`enable_aibridge = true` the module sets, on the agent, +`ANTHROPIC_BASE_URL=/api/v2/aibridge/anthropic` and +`CLAUDE_API_KEY=`. Source: `main.tf:14-32, +288-320`. + +Why not 5.x: the `enable_ai_gateway` rename landed in the 5.x line, which also +removed the bundled AgentAPI integration and the `task_app_id` output that +`coder_ai_task` depends on. Staying on 4.7.3 is what makes the Coder Tasks +wiring below possible. If the project later moves to 5.x, switch to +`enable_ai_gateway`, drop the explicit `coder_env.anthropic_auth_token`, and add +a standalone `agentapi` module to supply `task_app_id`. Source: +`coder-templates/claude-code/README.md:65-80`. + +Model selection is left at the module default on purpose: the requested model +name must match whichever provider the gateway has live (an Anthropic id for +direct, the GovCloud inference profile for Bedrock). Source: `main.tf:312-320`. + +### AI Gateway client auth in the template + +The module already sets `ANTHROPIC_BASE_URL` and `CLAUDE_API_KEY`. The template +additionally exports `ANTHROPIC_AUTH_TOKEN` (the same session token) via +`coder_env.anthropic_auth_token` to match the AI Gateway client contract in +`deploy/CONVENTIONS.md`. Both carry the same session token, so no raw Anthropic +key is ever placed in the workspace. Source: `main.tf:269-282`, +`deploy/CONVENTIONS.md:90-92`. The full routing flow is in `60-ai-gateway.md`. + +## code-server + +`module.code_server` (`registry.coder.com/coder/code-server/coder` **1.3.1**) +adds VS Code in the browser as an extra `coder_app` tile, folder `/home/coder`, +`subdomain = true`. Source: `main.tf:331-339`. Both the Claude Code web app and +code-server use `subdomain = true`, which requires the wildcard access URL +configured on the server (this aligns with path apps being disabled; see +`30-coder-control-plane.md`). + +## Coder Tasks + +Three pieces wire Coder Tasks: + +- `data.coder_task.me` (`main.tf:91-93`): populated when the workspace is created + as a Task. `enabled` is false for a normal build; `prompt` carries the task + prompt. `local.effective_prompt` prefers the Task prompt and falls back to the + `ai_prompt` parameter (`main.tf:198-205`). +- `report_tasks = true` on the module: reports task status to the Coder UI via + AgentAPI (`main.tf:303-306`). +- `coder_ai_task.claude_code` (`main.tf:322-328`): marks the build as a Coder AI + Task and binds the Task UI to the Claude Code AgentAPI app + (`app_id = module.claude_code.task_app_id`). It is created only in a Task + context (`count = data.coder_task.me.enabled ? start_count : 0`), so normal + builds are unaffected. + +The `coder` provider is pinned `>= 2.13.0` because `data.coder_task` and +`coder_ai_task.app_id` first shipped in provider v2.13.0 (`main.tf:34-46`, +`coder-templates/claude-code/README.md:188`). + +## Parameters + +| Parameter | Type | Mutable | Default | Options | +|---|---|---|---|---| +| `cpu` | number | yes | `4` | 2, 4, 8 | +| `memory` (GB) | number | yes | `8` | 4, 8, 16 | +| `disk_size` (GB) | number | **no** | `20` | 10, 20, 50 | +| `ai_prompt` | string | yes | `""` | (free text) | + +`disk_size` is immutable because it sizes the persistent `/home/coder` volume, +which cannot be changed after creation. `ai_prompt` is the fallback seed prompt +for non-Task builds and is ignored when the workspace is launched as a Task. +Source: `main.tf:117-196`. Verified live against the active template version +(`GET /api/v2/templateversions/3c0614b5-.../rich-parameters`): the four +parameters, their types, mutability, defaults, and options all match the table. + +## Required GitLab external auth (new requirement) + +```hcl +data "coder_external_auth" "gitlab" { + id = "gitlab" +} +``` + +The template declares `data "coder_external_auth" "gitlab"` with `id = "gitlab"`, +which must match `CODER_EXTERNAL_AUTH_0_ID` on the server. Declaring this data +source makes every workspace require a GitLab login: the dashboard surfaces a +"Login with GitLab" control, and the agent only reports ready once the owner +completes the in-boundary GitLab OAuth flow. Source: `main.tf:95-111`. + +Verified live: the active template version's external-auth list +(`GET /api/v2/templateversions/3c0614b5-.../external-auth`) contains exactly one +entry, `gitlab` (type `gitlab`, display `GitLab`, authenticate URL +`https://dev.usgov.coderdemo.io/external-auth/gitlab`), confirming GitLab login +is required by this version. + +This satisfies the directive that every workspace template should include +external-auth through GitLab. Source: `STATUS.md:100-105`. + +### How in-workspace git auth works + +After the owner completes the GitLab OAuth flow, the Coder agent's git +credential helper injects a short-lived OAuth token for any clone/fetch/push to +`gitlab.usgov.coderdemo.io`. No PATs and no SSH keys live in the workspace, and +no auth path leaves the GovCloud boundary. The server-side provider (id +`gitlab`, type `gitlab`, the in-cluster GitLab OAuth app, scopes +`read_user read_repository write_repository`) is defined in +`deploy/coder/values.yaml:117-148`; the regex `gitlab\.usgov\.coderdemo\.io` +scopes which remotes the helper authenticates. Source: `main.tf:95-111`, +`deploy/coder/values.yaml:117-148`. The server-side external-auth config and its +boundary rationale are in `30-coder-control-plane.md`. + +## Cluster prerequisites (for reference) + +The platform layer owns these (not this template directory): the +`coder-workspaces` namespace, provisioner RBAC letting the `coder/coder` SA +manage pods/PVCs in `coder-workspaces`, and ECR read on the node IAM role so the +pod image pulls without an imagePullSecret. Source: +`coder-templates/claude-code/README.md:82-141`. The workspace RBAC also exists +in both `coder` and `coder-workspaces` namespaces per the facts sheet +(`deploy/platform/workspace-rbac.yaml`). diff --git a/docs/as-built/80-iac-vs-imperative.md b/docs/as-built/80-iac-vs-imperative.md new file mode 100644 index 0000000..186407a --- /dev/null +++ b/docs/as-built/80-iac-vs-imperative.md @@ -0,0 +1,131 @@ +# As-built: declarative (Terraform) vs imperative (CLI/Helm/kubectl/SQL/API) + +A complete ledger of what is managed by Terraform versus what was done by hand +during the overnight build. Grounded in `terraform/*.tf`, `deploy/*`, +`scripts/*`, `STATUS.md`, and read-only live output captured 2026-06-07. + +Bottom line: the AWS substrate primitives (VPC, RDS, base IAM, IRSA OIDC + +Bedrock role, the EKS cluster object) are Terraform. Everything inside or on top +of the cluster (node group, addons, storage, ingress, DNS records, DB +roles/schemas, ECR repos and image content, all Kubernetes objects, Keycloak, +GitLab, Coder, and runtime config) was applied imperatively. The Terraform +`helm` and `kubernetes` providers are declared in `terraform/versions.tf` but no +`helm_release` or `kubernetes_*` resources exist, so no in-cluster object is +under Terraform control. + +## Declarative (Terraform, `terraform/`, PR #4 merged) + +| Resource | Terraform source | Notes | +|---|---|---| +| VPC `10.0.0.0/16`, IGW | `vpc.tf` | `aws_vpc.this`, `aws_internet_gateway.this` | +| 3 public + 3 private subnets, tagged for ELB | `vpc.tf` | `aws_subnet.public/private` | +| 1 NAT gateway + EIP, route tables, associations | `vpc.tf` | single NAT by design (EIP quota/cost) | +| EKS cluster `usgov-coderdemo` (k8s 1.36) | `eks.tf` | declared as Auto Mode; live cluster is standard (see drift note) | +| EKS deployer access entry + cluster-admin association | `eks.tf` | `aws_eks_access_entry.deployer` + policy association | +| Cluster IAM role `usgov-coderdemo-cluster` + 5 policies | `iam-eks.tf` | Auto Mode compute/storage/LB/networking policies, still attached | +| Auto Mode node role `usgov-coderdemo-node` + 2 policies | `iam-eks.tf` | provisioned but unused (node group uses a different role) | +| IAM OIDC provider for the cluster | `irsa.tf` | `aws_iam_openid_connect_provider.eks` | +| Coder Bedrock IRSA role `usgov-coderdemo-coder-bedrock` + inline `bedrock-invoke` | `irsa.tf` | trust limited to `coder:coder` SA | +| RDS subnet group, security group | `rds.tf` | SG allows tcp/5432 from `10.0.0.0/16` | +| RDS instance `usgov-coderdemo-pg` (PG 18.4, Multi-AZ) + master password | `rds.tf` | `random_password.db` | +| Secrets Manager `usgov-coderdemo/rds/master` + version | `rds.tf` | master `dbadmin` creds JSON | +| Outputs (`.substrate-outputs.json`) | `outputs.tf` | includes `ecr_registry` as a derived string only | +| S3/DynamoDB state backend config | `backend.tf` | bucket + lock table are bootstrap inputs, not managed here | + +Inputs referenced but NOT managed by this Terraform (pre-existing, passed by +ID/ARN in `variables.tf`): the Route53 hosted zone `Z06701704WFETYIRU5C8` and +the ACM certificate `7f4fc566-8efd-4aa5-b6ba-3b0c9a535d12`. The `ecr_registry` +output is a constructed string; the ECR repositories themselves are not +Terraform resources (created by the mirror script, see below). + +### Terraform-vs-live drift (declared but diverged) + +- `eks.tf` declares Auto Mode (`compute_config.enabled = true`, + `node_pools = ["general-purpose","system"]`, managed `block_storage` and + `elastic_load_balancing`). Live `aws eks describe-cluster` shows all three + disabled. The cluster runs as standard EKS. Reason: the GovCloud + `AWSServiceRoleForAmazonEKS` SLR lacks `iam:AddRoleToInstanceProfile` / + `iam:TagInstanceProfile`, so Auto Mode NodeClass validation never succeeds + (`deploy/platform/README.md`, `STATUS.md`). +- The cluster IAM role keeps the Auto Mode compute/storage/LB policies even + though those functions are now self-managed. + +## Imperative (CLI / Helm / kubectl / SQL / API) + +| Action | Mechanism | Evidence | +|---|---|---| +| Disable EKS Auto Mode (compute/storage/ELB) | `aws eks update-cluster-config` | `deploy/platform/README.md`; live `describe-cluster` all `false` | +| Node IAM role `usgov-coderdemo-mngnode` + 5 managed policies | AWS CLI/IAM | live `aws iam list-attached-role-policies` | +| Managed node group `mng` (3x m5.xlarge, AL2023, static 2/3/4) | `aws eks create-nodegroup` (CLI) | live `aws eks describe-nodegroup`; `deploy/platform/README.md` | +| EBS CSI IRSA role `usgov-coderdemo-ebs-csi` + trust + `AmazonEBSCSIDriverPolicy` | AWS CLI/IAM | live IAM; `deploy/platform/README.md` | +| Bind EBS CSI addon to the IRSA role | `aws eks update-addon --service-account-role-arn` | `deploy/platform/README.md` | +| Self-managed addons: `vpc-cni`, `kube-proxy`, `coredns`, `aws-ebs-csi-driver` | `aws eks create-addon` | live `aws eks list-addons` (TF sets `bootstrap_self_managed_addons=false`) | +| Default `gp3` StorageClass (encrypted, WaitForFirstConsumer) | `kubectl apply` | live `kubectl get sc gp3 -o yaml` | +| `aws-load-balancer-controller` (kube-system) | Helm | live Helm release `aws-load-balancer-controller.v1` | +| `ingress-nginx` chart 4.15.1 (+ internet-facing NLB) | Helm | live Helm release `ingress-nginx.v1`; `deploy/platform/ingress-nginx-values.yaml` | +| Route53 alias A records `dev`/`auth`/`gitlab`/`*` -> NLB | AWS CLI | live `aws route53`; `deploy/platform/README.md` | +| RDS roles + databases (`coder`, `keycloak`) | in-cluster SQL Job (`postgres:18-alpine`) | `deploy/platform/README.md` | +| ECR repositories + image mirroring (5 images) | `scripts/mirror-images.sh` (crane) + `scripts/images.txt` | live `aws ecr describe-repositories` | +| k8s Secrets `coder-db`, `coder-oidc`, `coder-ai`, `coder-external-auth` | `kubectl create secret` | `deploy/coder/secrets.example.yaml`; `deploy/platform/README.md` | +| k8s Secrets `keycloak-db`, `keycloak-admin`, `gitlab-secrets` | `kubectl create secret` | `deploy/keycloak/README.md`, `deploy/gitlab/README.md` | +| Workspace RBAC in `coder-workspaces` | `kubectl apply -f deploy/platform/workspace-rbac.yaml` | live `kubectl get role -n coder-workspaces` | +| Keycloak Deployment/Service/Ingress + realm `coder` import | `kubectl apply -k deploy/keycloak/` | `deploy/keycloak/*`; live pod `keycloak` | +| GitLab StatefulSet/Service/Ingress (embedded Postgres) | `kubectl apply -f deploy/gitlab/*` | `deploy/gitlab/*`; live pod `gitlab-0` | +| Coder control plane | Helm release `coder` (4 revisions) + `deploy/coder/values.yaml` | live Helm release `coder.v1..v4` | +| Coder AI Gateway providers (`anthropic`, `anthropic-bedrock`) | env-seeded once, then DB-authoritative | `deploy/coder/values.yaml`; `STATUS.md` | +| Coder classification banner (`UNCLASSIFIED - USGOVCLOUD`) | `scripts/set-appearance.sh` (runtime DB setting) | `scripts/set-appearance.sh`; `STATUS.md` | +| Coder AI Governance add-on license | `coder licenses add` / UI (runtime JWT in DB) | `deploy/coder/README.md`; `STATUS.md` | +| GitLab instance-wide OAuth app (id/secret -> `coder-external-auth`) | GitLab API / Rails console | `STATUS.md`; `deploy/coder/secrets.example.yaml` | +| Coder template `claude-code` push | `coder templates push` | `coder-templates/claude-code/main.tf`; `STATUS.md` | + +Unverified detail: the `aws-load-balancer-controller` almost certainly uses its +own IRSA role, but the exact role name was not checked live, so it is left +unverified here. + +Abandoned artifact: `deploy/platform/nodepool.yaml` is an Auto Mode +NodeClass/NodePool workaround that was not applied to the standard cluster; it +remains in the repo for history only. + +## Reconciliation backlog (to fold into Terraform) + +This mirrors the `STATUS.md` "Deviations to reconcile into Terraform" list and +expands it with every imperative item found above. Ordered roughly by layer. + +1. Flip `terraform/eks.tf` from Auto Mode to standard EKS (disable + `compute_config`, `storage_config.block_storage`, and + `kubernetes_network_config.elastic_load_balancing`); drop the unused Auto + Mode policies from the cluster role if no longer needed. +2. Add a managed node group `mng` (3x m5.xlarge, `AL2023_x86_64_STANDARD`, + static min2/desired3/max4, private subnets) as + `aws_eks_node_group`. +3. Add node role `usgov-coderdemo-mngnode` with its five managed policies; + decide whether to remove the now-unused `usgov-coderdemo-node` role. +4. Add the EBS CSI IRSA role `usgov-coderdemo-ebs-csi` and manage the four EKS + addons as `aws_eks_addon` (with the CSI addon's `service_account_role_arn`). +5. Manage the `gp3` default StorageClass (kubernetes provider or a bootstrap + manifest). +6. Manage `aws-load-balancer-controller` and `ingress-nginx` (and the LB + controller IRSA role) via the `helm` provider; capture the NLB annotations. +7. Manage Route53 alias A records (`dev`, `auth`, `gitlab`, `*`) -> + ingress NLB as `aws_route53_record` (alias to the NLB). +8. Codify RDS role/database creation (`coder`, `keycloak`) instead of the ad hoc + SQL Job, or document it as an explicit post-apply step. +9. Manage ECR repositories as `aws_ecr_repository` (the registry host is already + an output); keep image mirroring (`scripts/mirror-images.sh`) as an explicit + pipeline step since image content is not Terraform's job. +10. Decide a source of truth for Kubernetes Secrets (`coder-db`, `coder-oidc`, + `coder-ai`, `coder-external-auth`, `keycloak-db`, `keycloak-admin`, + `gitlab-secrets`); keep real values out of git. +11. Manage workspace RBAC (`coder-workspaces` Role/RoleBinding) declaratively. +12. Manage Keycloak (Deployment/Service/Ingress + realm import) and GitLab + (StatefulSet/Service/Ingress) manifests under a GitOps or Terraform path. +13. Manage the Coder Helm release and `values.yaml` declaratively; note that AI + Gateway provider env vars only seed the DB once, so treat them as one-time + seed config and manage providers in the DB afterward. +14. Treat these as runtime/out-of-band, not Terraform: the AI Governance license + JWT, the appearance banner DB setting, the GitLab OAuth app, and the Coder + template push. Document them as runbook steps. + +Note: the Route53 hosted zone and ACM certificate are pre-existing inputs and do +not need to be created by Terraform; only the records inside the zone are part of +the backlog. diff --git a/docs/as-built/90-operations-runbook.md b/docs/as-built/90-operations-runbook.md new file mode 100644 index 0000000..5f4cce2 --- /dev/null +++ b/docs/as-built/90-operations-runbook.md @@ -0,0 +1,222 @@ +# Day-2 operations runbook + +Operational reference for the GovCloud Coder demo. Status source of truth: +[`STATUS.md`](../../STATUS.md). All commands are run from the repo root +`/home/coder/demoenv-workspace/usgov-coderdemo` unless noted. The shell is +`sh`, so source files with `.`, not `source`. + +## Live endpoints + +Verified live (read-only) at authoring time. Codes are the raw HTTP status of +an unauthenticated `GET /`. + +| Service | URL | Live check | Notes | +|---|---|---|---| +| Coder | `https://dev.usgov.coderdemo.io` | `200` (`/api/v2/buildinfo` -> `v2.34.0+3006da5`) | Owner password login or "Sign in with Keycloak". | +| Keycloak | `https://auth.usgov.coderdemo.io` | `302` (redirect to login) | Realm `coder`; admin console at `/admin`, master realm, user `admin`. | +| GitLab | `https://gitlab.usgov.coderdemo.io` | `302` (redirect to login) | Root login; embedded Postgres. | + +Re-check any endpoint without printing secrets: + +```sh +. ~/.config/usgov-coderdemo/env >/dev/null 2>&1 +for h in dev auth gitlab; do + printf '%s -> ' "$h" + curl -sS -o /dev/null -m 20 -w '%{http_code}\n' "https://$h.usgov.coderdemo.io" +done +``` + +## Source environment + kubeconfig + +```sh +cd /home/coder/demoenv-workspace/usgov-coderdemo +. ~/.config/usgov-coderdemo/env # AWS profile/region and GitLab root password +export KUBECONFIG=./kubeconfig # cluster usgov-coderdemo +kubectl get nodes # sanity check +``` + +## Logging into the Coder API / CLI + +> **CODER_URL gotcha.** When this runs inside a Coder workspace, the agent +> ambiently exports `CODER_URL=https://dev.coder.com` (the **host** Coder, not +> this demo). Always target the demo explicitly with +> `https://dev.usgov.coderdemo.io`; do not reuse `$CODER_URL`. The helper +> scripts use a separate `DEMO_CODER_URL` for exactly this reason +> (`scripts/set-appearance.sh`). + +Locate a Coder CLI binary: + +```sh +ls -t /tmp/coder.*/coder | head -1 # host CLI cached in the workspace +# in-pod binary, if exec'ing into the coder pod: /opt/coder +``` + +CLI login against the demo: + +```sh +CODER_URL=https://dev.usgov.coderdemo.io "$(ls -t /tmp/coder.*/coder | head -1)" login https://dev.usgov.coderdemo.io +``` + +API login (owner credentials from `generated-secrets.env`, never echo them): + +```sh +. ~/.config/usgov-coderdemo/generated-secrets.env +TOKEN=$(curl -sS https://dev.usgov.coderdemo.io/api/v2/users/login \ + -H 'Content-Type: application/json' \ + -d "{\"email\":\"$CODER_ADMIN_EMAIL\",\"password\":\"$CODER_ADMIN_PASSWORD\"}" \ + | python3 -c 'import sys,json; print(json.load(sys.stdin)["session_token"])') +# Use it: curl -H "Coder-Session-Token: $TOKEN" https://dev.usgov.coderdemo.io/api/v2/users/me +``` + +Reference: org id `5de29a6d-8836-4643-a42b-2cb807c8e3e2` (facts sheet). + +## Credentials map + +Where each secret lives. Do not print values. + +| Secret | Location | Contents | +|---|---|---| +| AWS profile / region, GitLab root password | `~/.config/usgov-coderdemo/env` | `GITLAB_ROOT_PASSWORD`, AWS profile/region; Docker Hub creds for mirroring. Source before AWS commands. | +| Generated app credentials | `~/.config/usgov-coderdemo/generated-secrets.env` (gitignored, mode 600) | Coder owner (`CODER_ADMIN_EMAIL` / `CODER_ADMIN_PASSWORD`), Keycloak admin, Keycloak `demo` user, DB passwords, Coder<->Keycloak OIDC client secret, GitLab OAuth app id/secret (`GITLAB_CODER_OAUTH_*`). | +| RDS master | AWS Secrets Manager `usgov-coderdemo/rds/master` | JSON `username`,`password`,`host`,`port`; master user `dbadmin`. | +| Coder k8s Secrets (ns `coder`) | k8s | `coder-db` (key `url`), `coder-oidc` (key `client-secret`), `coder-ai` (key `ANTHROPIC_API_KEY`, currently a placeholder), `coder-external-auth` (keys `gitlab-client-id`, `gitlab-client-secret`). | +| Keycloak k8s Secrets (ns `keycloak`) | k8s | `keycloak-db` (`username`/`password`), `keycloak-admin` (`username`/`password`). | +| GitLab k8s Secret (ns `gitlab`) | k8s | `gitlab-secrets` (`initial_root_password`). | + +Sources: `deploy/platform/README.md`, `deploy/coder/`, `deploy/keycloak/`, +`deploy/gitlab/`, `STATUS.md`, facts sheet. + +> **AI provider key is in the database, not a k8s Secret.** Since v2.34 the AI +> Gateway providers live in the Coder DB. Rotate the Anthropic key in the UI at +> `/ai/settings`, not by editing the `coder-ai` Secret. Editing a seeded +> `CODER_AI_GATEWAY_PROVIDER_*` env var or the secret after first boot makes +> coderd refuse to start (drift guard) (`deploy/coder/README.md`). + +## Helm upgrade pattern (Coder) + +```sh +helm upgrade coder ~/.cache/helm/repository/coder_helm_2.34.0.tgz \ + --namespace coder \ + --values deploy/coder/values.yaml \ + --timeout 6m +kubectl -n coder rollout status deploy/coder +``` + +Caution: the `CODER_AI_GATEWAY_PROVIDER_*` env vars in `values.yaml` only seed +the DB on first startup. A later upgrade that changes any of those values (or +the `coder-ai` secret) breaks startup unless you first reconcile the change in +`/ai/settings`. Treat them as one-time seed config (`deploy/coder/README.md`). + +## Pushing a template + +The single template is `claude-code` (`coder-templates/claude-code/`). From the +repo root, targeting the demo Coder: + +```sh +# First time: create the template. +coder templates push claude-code \ + --directory coder-templates/claude-code \ + --variable namespace=coder-workspaces + +# Subsequent updates push a new version. +coder templates push claude-code \ + --directory coder-templates/claude-code +``` + +Variables: `namespace` (default `coder-workspaces`), `workspace_image` +(default ECR-mirrored `enterprise-base`), `use_kubeconfig` (default `false`). +The provisioner is in-process in coderd, so leave `use_kubeconfig=false` +(`coder-templates/claude-code/README.md`). + +## Mirroring images + +ECR has no pull-through cache in GovCloud, so upstream images are copied in with +`crane`. The image list is `scripts/images.txt`. + +```sh +. ~/.config/usgov-coderdemo/env # Docker Hub + AWS creds, region +scripts/mirror-images.sh # add --dry-run to preview +``` + +Currently mirrored: `ghcr.io/coder/coder:v2.34.0`, +`quay.io/keycloak/keycloak:26.6.3`, `docker.io/gitlab/gitlab-ce:19.0.1-ce.0`, +`docker.io/codercom/enterprise-base:ubuntu-noble-20260601`, plus +`postgres:18-alpine` for db bootstrap (`scripts/images.txt`, `STATUS.md`). + +## Setting the classification banner + +The green `UNCLASSIFIED - USGOVCLOUD` banner (`#007a33`) is a runtime DB setting +(premium-gated), not Helm. Reproduce idempotently: + +```sh +scripts/set-appearance.sh # reads admin creds from generated-secrets.env +``` + +The script targets `DEMO_CODER_URL` (default `https://dev.usgov.coderdemo.io`), +logs in as the owner, PUTs `/api/v2/appearance`, then reads it back to confirm +(`scripts/set-appearance.sh`). + +## Checking pod health + +```sh +export KUBECONFIG=./kubeconfig +kubectl get pods -A | grep -Ev 'Running|Completed' # anything unhealthy +kubectl -n coder get pods +kubectl -n coder rollout status deploy/coder +kubectl -n keycloak rollout status deploy/keycloak +kubectl -n gitlab rollout status statefulset/gitlab +kubectl -n ingress-nginx get pods # expect 2 controller replicas +kubectl -n coder-workspaces get pods # active workspace pods + +# Logs and recent events when a pod is unhappy: +kubectl -n logs --tail=200 +kubectl -n get events --sort-by=.lastTimestamp | tail -30 +``` + +Expected namespaces: `coder`, `coder-workspaces`, `gitlab`, `ingress-nginx`, +`keycloak`. Coder and Keycloak run 1 replica each; GitLab is the `gitlab-0` +StatefulSet; ingress-nginx runs 2 controller replicas (facts sheet, `STATUS.md`). + +## Known gaps / remaining actions + +1. **Real Anthropic key not set.** The `anthropic` provider holds a placeholder + (`sk-ant-REPLACE_ME_...`). AI requests return `502 "all configured keys + failed authentication"` until a real `sk-ant-...` key is pasted into the + `anthropic` provider at `/ai/settings` (UI, not the `coder-ai` secret). No + real Anthropic key exists anywhere in the environment (`STATUS.md`, facts + sheet). +2. **Bedrock Claude Sonnet 4.5 access gated.** Model access for + `us-gov.anthropic.claude-sonnet-4-5-20250929-v1:0` needs an Anthropic + agreement via the account paired with GovCloud. The proven in-GovCloud + fallback that does invoke today is `amazon.nova-pro-v1:0` (`STATUS.md`). +3. **No group / role sync.** Keycloak realm `coder` has no groups and no + group-claim mapper; Coder OIDC `group_field` and role mapping are empty. + Login provisions a bare account only (facts sheet). +4. **Internal provisioners only.** 3 built-in provisioner daemons run inside the + coderd pod; there are no external provisioner daemons and `daemon_psk` is not + set (facts sheet). +5. **Terraform reconciliation backlog.** Several pieces were applied + imperatively (CLI/Helm/kubectl/API) and are not yet in `terraform/`: Auto + Mode disabled plus standard node group `mng` and node role + `usgov-coderdemo-mngnode`; EBS CSI IRSA role and addon SA role; self-managed + addons and the `gp3` StorageClass; ingress-nginx and + aws-load-balancer-controller via Helm; RDS roles/dbs created via SQL; ECR + image mirroring; Route53 records; k8s Secrets; Keycloak realm import; the + Coder Helm release plus runtime appearance banner and DB-seeded AI providers; + the GitLab OAuth app minted via API; and the Coder template push. See + `STATUS.md` "Deviations to reconcile into Terraform". + +Out of scope for the demo: OpenShift, Istio, observability, full identity sync +(`STATUS.md`). + +## Related documents + +- [`00-overview.md`](00-overview.md): executive and architecture overview, the + three core flows, and the component map. +- Layer deep-dives `10` through `80` in this directory. +- [`STATUS.md`](../../STATUS.md): canonical build status. + +--- + +*As-built runbook authored by Coder Agents. Read-only; grounded in repo files +and `STATUS.md`.* diff --git a/docs/as-built/README.md b/docs/as-built/README.md new file mode 100644 index 0000000..4723e1a --- /dev/null +++ b/docs/as-built/README.md @@ -0,0 +1,30 @@ +# As-built documentation + +This directory is the engineering "as-built" record of the GovCloud Coder demo: +what was deployed, how it is configured and architected, and which parts are +declarative (Terraform) versus imperative (CLI, Helm, kubectl, SQL, API). + +Live status and the credentials map live in the repo-root `STATUS.md`. These +docs explain the *how* and *why* behind that status. + +## Read in this order + +| Doc | Scope | +|---|---| +| [00-overview.md](00-overview.md) | Executive + architecture overview, component map, topology diagram, and the three core flows (SSO login, workspace create + GitLab auth, Claude Code through the AI Gateway). Start here. | +| [10-infrastructure.md](10-infrastructure.md) | AWS GovCloud substrate: account/region/partition, VPC, EKS (standard, not Auto Mode, and why), node group, IRSA roles, RDS, ECR, Route53, ACM, NLB. | +| [20-platform-kubernetes.md](20-platform-kubernetes.md) | Kubernetes platform layer: namespaces, ingress-nginx + load-balancer-controller, `gp3` StorageClass, workspace RBAC, platform-owned Secrets. | +| [30-coder-control-plane.md](30-coder-control-plane.md) | Coder v2.34.0 control plane: a section-by-section walkthrough of `deploy/coder/values.yaml`, OIDC SSO, auth-boundary hardening, licensing, appearance. | +| [40-identity-keycloak.md](40-identity-keycloak.md) | Keycloak realm `coder`, the OIDC client, the SSO wiring, and the configured-vs-not gap (no group/role sync). | +| [50-gitlab-scm.md](50-gitlab-scm.md) | In-boundary GitLab SCM, the instance-wide OAuth app, and how every workspace authenticates git against it. | +| [60-ai-gateway.md](60-ai-gateway.md) | AI Gateway / AI Bridge: DB-managed providers (`anthropic` direct + `anthropic-bedrock` IRSA), name-based routing, the end-to-end request flow, and the remaining action to make AI respond. | +| [70-workspace-templates.md](70-workspace-templates.md) | The `claude-code` workspace template: pod/PVC, the claude-code module (4.7.3), Coder Tasks, parameters, and the required GitLab external auth. | +| [80-iac-vs-imperative.md](80-iac-vs-imperative.md) | The declarative-versus-imperative ledger and the Terraform reconciliation backlog. | +| [90-operations-runbook.md](90-operations-runbook.md) | Day-2 operations: env/kubeconfig, API/CLI login, Helm upgrade, template push, image mirroring, banner, health checks, known gaps. | + +## One thing to know before reading + +The AI path is fully wired but the `anthropic` provider holds a **placeholder** +key. Pasting a real Anthropic key into that provider at `/ai/settings` is the +only step left before live AI responses work. See +[60-ai-gateway.md](60-ai-gateway.md) and `STATUS.md`. From b26256f5c27e6452712130e49b081a34a9e5a470 Mon Sep 17 00:00:00 2001 From: Austen Bruhn Date: Sun, 7 Jun 2026 17:53:49 +0000 Subject: [PATCH 05/16] feat: multi-tenant IdP sync (Keycloak orgs/groups/roles -> Coder) Model a true multi-tenant hierarchy in Keycloak and sync it into Coder via OIDC IdP sync (organization + group + role), with personas for the demo. Organizations: coder (display "Platform Engineering"), alpha ("Mission Partner Alpha"), bravo ("Mission Partner Bravo"). Keycloak (realm coder): a hierarchical group tree plus one Group Membership mapper emitting a full-path `groups` claim (ID + access + userinfo), and 8 persona users. Coder runs runtime per-org IdP sync (not legacy env vars): - organization sync: field=groups, assign_default=false, /platform|/alpha|/bravo - group sync (per org): team subgroups -> pre-created Coder groups - role sync (per org): role subgroups -> organization-admin / organization-template-admin / organization-auditor Tenant orgs are functional: an org-scoped provisioner key + external provisioner daemon per tenant (deploy/coder/provisioners.yaml, reusing the coder SA), and the claude-code template pushed into all three orgs. Verified end to end with scripts/verify-oidc-login.py: a real Keycloak login per persona lands them in the correct org(s), group(s), and role(s), with tenant isolation (Alpha vs Bravo vs Platform) and a cross-tenant ISSO/auditor. New idempotent scripts: - scripts/setup-keycloak-hierarchy.py (Keycloak Admin REST API) - scripts/setup-coder-idp-sync.py (Coder API: orgs, groups, sync, no secrets) - scripts/verify-oidc-login.py (real OIDC login -> org/role/group report) Docs: docs/as-built/45-idp-sync-personas.md; updated 40-identity-keycloak.md, as-built README, and STATUS.md. Generated by Coder Agents. --- STATUS.md | 19 ++- deploy/coder/provisioners.yaml | 137 ++++++++++++++++ docs/as-built/40-identity-keycloak.md | 74 ++++----- docs/as-built/45-idp-sync-personas.md | 151 ++++++++++++++++++ docs/as-built/README.md | 3 +- scripts/setup-coder-idp-sync.py | 195 +++++++++++++++++++++++ scripts/setup-keycloak-hierarchy.py | 220 ++++++++++++++++++++++++++ scripts/verify-oidc-login.py | 132 ++++++++++++++++ 8 files changed, 882 insertions(+), 49 deletions(-) create mode 100644 deploy/coder/provisioners.yaml create mode 100644 docs/as-built/45-idp-sync-personas.md create mode 100755 scripts/setup-coder-idp-sync.py create mode 100755 scripts/setup-keycloak-hierarchy.py create mode 100755 scripts/verify-oidc-login.py diff --git a/STATUS.md b/STATUS.md index e20dcb1..e713a24 100644 --- a/STATUS.md +++ b/STATUS.md @@ -118,5 +118,22 @@ gated; Nova Pro is the proven fallback. reproduce with `scripts/set-appearance.sh` (idempotent). Verified via `GET /api/v2/appearance`. +## Identity / multi-tenancy (Keycloak -> Coder IdP sync) +- [x] **3 Coder organizations**: `coder` (display "Platform Engineering"), + `alpha` ("Mission Partner Alpha"), `bravo` ("Mission Partner Bravo"). +- [x] **Org + group + role sync** from a single full-path `groups` OIDC claim + (Group Membership mapper on the `coder` client). `assign_default=false`; + runtime per-org IdP sync (not legacy env vars). Configured by + `scripts/setup-keycloak-hierarchy.py` + `scripts/setup-coder-idp-sync.py` + (both idempotent). +- [x] **8 persona users** in realm `coder` (platform lead, SRE/template-admin, + org admins, developers, data scientist, cross-tenant ISSO/auditor). +- [x] **Verified end to end** with `scripts/verify-oidc-login.py`: each persona + lands in the right org(s)/group(s)/role(s); tenant isolation holds. +- [x] **Tenant provisioners + templates**: external provisioner daemon per + tenant org (`deploy/coder/provisioners.yaml`, org-scoped keys) + the + `claude-code` template pushed into all three orgs. +- See `docs/as-built/45-idp-sync-personas.md` for the full hierarchy + matrix. + ## Out of scope (demo) -OpenShift, Istio, observability, full identity sync. +OpenShift, Istio, observability. diff --git a/deploy/coder/provisioners.yaml b/deploy/coder/provisioners.yaml new file mode 100644 index 0000000..3f7bf91 --- /dev/null +++ b/deploy/coder/provisioners.yaml @@ -0,0 +1,137 @@ +# ============================================================================= +# Per-organization external provisioner daemons (multi-tenant demo) +# ============================================================================= +# Each tenant Coder organization (alpha, bravo) needs its own provisioner +# because built-in provisioners only serve the default organization. These +# daemons authenticate with an org-scoped provisioner KEY (created out of band, +# stored in Secret coder-provisioner-, key `key`) so no PSK is shared. +# +# They reuse the `coder` ServiceAccount, which already has IRSA plus the +# workspace RBAC (Roles in `coder` and `coder-workspaces`), so the Terraform +# kubernetes provider can create workspace pods/PVCs in-cluster. NOTE: reusing +# one ServiceAccount and one workspace namespace means the tenant boundary here +# is Coder org/RBAC/provisioner-key isolation, NOT Kubernetes namespace +# isolation. That is an acceptable demo posture; hard k8s tenancy would need +# per-org namespaces, ServiceAccounts, and IRSA roles. +# +# Connect over the in-cluster Service (plain HTTP) to avoid the NLB hairpin. +# +# Create the key + secret first, for example: +# coder provisioner keys create alpha-eks --org alpha +# kubectl -n coder create secret generic coder-provisioner-alpha \ +# --from-literal=key= +# (scripts/setup-coder-idp-sync.py documents the org slugs/IDs.) +# ----------------------------------------------------------------------------- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: coder-provisioner-alpha + namespace: coder + labels: + app.kubernetes.io/name: coder-provisioner + app.kubernetes.io/instance: alpha + app.kubernetes.io/part-of: coder +spec: + replicas: 1 + selector: + matchLabels: + app.kubernetes.io/name: coder-provisioner + app.kubernetes.io/instance: alpha + template: + metadata: + labels: + app.kubernetes.io/name: coder-provisioner + app.kubernetes.io/instance: alpha + app.kubernetes.io/part-of: coder + spec: + serviceAccountName: coder + securityContext: + runAsUser: 1000 + runAsGroup: 1000 + fsGroup: 1000 + containers: + - name: provisionerd + image: "430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/ghcr/coder/coder:v2.34.0" + imagePullPolicy: IfNotPresent + command: ["/opt/coder"] + args: ["provisionerd", "start", "--name", "alpha-eks"] + env: + - name: CODER_URL + value: "http://coder.coder.svc.cluster.local" + - name: CODER_PROVISIONER_DAEMON_KEY + valueFrom: + secretKeyRef: + name: coder-provisioner-alpha + key: key + - name: CODER_CACHE_DIRECTORY + value: "/home/coder/.cache/coder" + resources: + requests: + cpu: "250m" + memory: "256Mi" + limits: + cpu: "2" + memory: "1Gi" + volumeMounts: + - name: home + mountPath: /home/coder + volumes: + - name: home + emptyDir: {} +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: coder-provisioner-bravo + namespace: coder + labels: + app.kubernetes.io/name: coder-provisioner + app.kubernetes.io/instance: bravo + app.kubernetes.io/part-of: coder +spec: + replicas: 1 + selector: + matchLabels: + app.kubernetes.io/name: coder-provisioner + app.kubernetes.io/instance: bravo + template: + metadata: + labels: + app.kubernetes.io/name: coder-provisioner + app.kubernetes.io/instance: bravo + app.kubernetes.io/part-of: coder + spec: + serviceAccountName: coder + securityContext: + runAsUser: 1000 + runAsGroup: 1000 + fsGroup: 1000 + containers: + - name: provisionerd + image: "430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/ghcr/coder/coder:v2.34.0" + imagePullPolicy: IfNotPresent + command: ["/opt/coder"] + args: ["provisionerd", "start", "--name", "bravo-eks"] + env: + - name: CODER_URL + value: "http://coder.coder.svc.cluster.local" + - name: CODER_PROVISIONER_DAEMON_KEY + valueFrom: + secretKeyRef: + name: coder-provisioner-bravo + key: key + - name: CODER_CACHE_DIRECTORY + value: "/home/coder/.cache/coder" + resources: + requests: + cpu: "250m" + memory: "256Mi" + limits: + cpu: "2" + memory: "1Gi" + volumeMounts: + - name: home + mountPath: /home/coder + volumes: + - name: home + emptyDir: {} diff --git a/docs/as-built/40-identity-keycloak.md b/docs/as-built/40-identity-keycloak.md index ffc9337..6aa9548 100644 --- a/docs/as-built/40-identity-keycloak.md +++ b/docs/as-built/40-identity-keycloak.md @@ -164,53 +164,33 @@ This matches `deploy/coder/values.yaml` exactly. - Boundary hardening: GitHub default login disabled, so no github.com login egress. -### NOT configured (known gap): IdP group sync and role mapping - -There is no Keycloak-to-Coder group sync or role mapping. This is a deliberate, -documented gap (see also `STATUS.md` "Out of scope: full identity sync" and the -facts sheet). Evidence from the live `GET /api/v2/deployment/config` `oidc` -block on the demo Coder: - -``` -groups_field = "" (no claim is read for group membership) -group_mapping = {} (no OIDC-group -> Coder-group mapping) -group_auto_create = false (Coder will not create groups from claims) -user_role_field = "" (no claim is read for site roles) -user_role_mapping = {} (no OIDC-claim -> Coder-role mapping) -group_regex_filter = ".*" (default; inert because groups_field is empty) -group_allow_list = null (default) -``` - -On the Keycloak side, the realm `coder` has no groups and no group-claim mapper: -`realm-coder.json` defines no `groups`/`defaultGroups` and no -`protocolMappers`, so even if Coder read a `groups` field there is currently no -`groups` claim emitted in the token. - -Net effect: all SSO users land as ordinary members of the default Coder -organization. Group membership and site roles are managed manually inside -Coder, not driven by the IdP. - -### What enabling group sync would require (future work, not implemented) - -Documentation only. Do not implement as part of this as-built pass. To wire -Keycloak group sync into Coder you would need all of: - -1. Keycloak: create the groups in realm `coder` (and assign users), then add a - "Group Membership" protocol mapper (on a client scope or the `coder` client) - that emits a `groups` claim in the token. Decide whether the claim is full - group paths or names. -2. Coder: set `CODER_OIDC_GROUP_FIELD` (the deployment-config key surfaces as - `groups_field`) to the claim name, for example `groups`. Optionally set - `CODER_OIDC_GROUP_MAPPING` to translate IdP group names to Coder group IDs, - and `CODER_OIDC_GROUP_AUTO_CREATE=true` if Coder should create missing - groups. `CODER_OIDC_GROUP_REGEX_FILTER` can scope which groups are honored. -3. For site-role sync (separate from groups): add a realm/role mapper that emits - a roles claim, then set `CODER_OIDC_USER_ROLE_FIELD` and - `CODER_OIDC_USER_ROLE_MAPPING` on Coder. - -Note: OIDC-driven group and role sync is a Coder premium/enterprise capability. -This deployment is licensed (premium + AI Governance per `STATUS.md`), so the -gating is configuration effort, not licensing. None of the above is wired today. +### IdP organization, group, and role sync (CONFIGURED) + +Keycloak-to-Coder sync is now wired and verified. Organization sync, group sync, +and role sync all read a single full-path `groups` claim that a Group Membership +mapper on the `coder` client emits. See +[45-idp-sync-personas.md](45-idp-sync-personas.md) for the full hierarchy, +persona matrix, and verification. + +High-level: + +1. Keycloak realm `coder` has a hierarchical group tree (`/platform`, `/alpha`, + `/bravo` with team and role subgroups) and 8 persona users, created by + `scripts/setup-keycloak-hierarchy.py`. The `coder` client emits the + `groups` claim (full path; ID + access + userinfo). +2. Coder runs runtime IdP sync (not the legacy `CODER_OIDC_*` env vars): + organization sync (`field=groups`, `organization_assign_default=false`), + per-org group sync, and per-org role sync mapping to `organization-admin`, + `organization-template-admin`, and `organization-auditor`. Configured by + `scripts/setup-coder-idp-sync.py`. + +The legacy deployment-config keys (`groups_field`, `user_role_field`, etc.) +remain empty on purpose: this deployment uses the runtime per-org IdP sync +settings instead, which are required for multi-organization sync. + +Net effect: SSO users are placed into the correct Coder organization(s), groups, +and roles automatically on login, with no manual assignment. Tenant isolation +(Alpha vs Bravo vs Platform) is enforced by organization membership. ## Sources diff --git a/docs/as-built/45-idp-sync-personas.md b/docs/as-built/45-idp-sync-personas.md new file mode 100644 index 0000000..e41c93b --- /dev/null +++ b/docs/as-built/45-idp-sync-personas.md @@ -0,0 +1,151 @@ +# As-built: IdP sync, organizations, and demo personas + +Keycloak (`realm coder`) is the identity source. Coder consumes a single +full-path `groups` OIDC claim and runs three IdP sync passes on every login: +**organization sync**, **group sync**, and **role sync**. This gives true +multi-tenancy (isolated Coder organizations) plus realistic personas, all +modeled in Keycloak and synced automatically. No org/group/role is assigned by +hand in Coder. + +Built by two idempotent scripts: + +- `scripts/setup-keycloak-hierarchy.py` - groups, the group-membership claim + mapper on the `coder` client, and the persona users (Keycloak Admin REST API). +- `scripts/setup-coder-idp-sync.py` - Coder organizations, groups, and the + org/group/role sync settings (Coder API). + +Verify end to end with `scripts/verify-oidc-login.py ...` (drives a real +OIDC login and prints the resulting orgs/roles/groups). + +## Organizations (tenants) + +| Coder org (slug) | Display name | Role in the demo | +|---|---|---| +| `coder` (default) | Platform Engineering | Central platform team. Owns the built-in provisioners. | +| `alpha` | Mission Partner Alpha | Tenant. Own provisioner (`alpha-eks`) + `claude-code` template. | +| `bravo` | Mission Partner Bravo | Tenant. Own provisioner (`bravo-eks`) + `claude-code` template. | + +Tenant isolation boundary is Coder organization membership, RBAC, and +per-org provisioner keys. Workspaces for all orgs currently share the +`coder-workspaces` namespace and the `coder` ServiceAccount, so this is org/RBAC +isolation, not Kubernetes-namespace isolation (see +[30-coder-control-plane.md](30-coder-control-plane.md) and +[20-platform-kubernetes.md](20-platform-kubernetes.md)). + +## Keycloak group tree and the `groups` claim + +One Group Membership mapper on the `coder` client emits the full group path as a +JSON array claim named `groups`, in the ID token, access token, and userinfo. +Users are explicitly added to the org group, their team subgroup, and any role +subgroup (Keycloak does not imply parent membership). + +``` +/platform org-sync -> coder (Platform Engineering) +/platform/platform-admins group-sync -> group "platform-admins" +/platform/sre group-sync -> group "sre" +/platform/org-admins role-sync -> organization-admin +/platform/template-admins role-sync -> organization-template-admin +/alpha org-sync -> alpha +/alpha/developers group-sync -> group "developers" +/alpha/data-science group-sync -> group "data-science" +/alpha/security group-sync -> group "security" +/alpha/org-admins role-sync -> organization-admin +/alpha/auditors role-sync -> organization-auditor +/bravo org-sync -> bravo +/bravo/developers group-sync -> group "developers" +/bravo/org-admins role-sync -> organization-admin +/bravo/auditors role-sync -> organization-auditor +``` + +Example decoded ID token claim (persona `morgan.isso`): +`"groups": ["/alpha", "/alpha/auditors", "/bravo", "/bravo/auditors"]`. + +## Coder sync configuration + +- **Organization sync** (deployment-level, `/api/v2/settings/idpsync/organization`): + `field=groups`, `organization_assign_default=false` (membership is purely + claim-driven), mapping `/platform`,`/alpha`,`/bravo` to the org IDs. +- **Group sync** (per org, `.../settings/idpsync/groups`): `field=groups`, + `auto_create_missing_groups=false`. Groups are pre-created. +- **Role sync** (per org, `.../settings/idpsync/roles`): `field=groups`, mapping + role subgroups to the exact role IDs `organization-admin`, + `organization-template-admin`, `organization-auditor`. + +The local `admin` owner is a non-OIDC break-glass account and is unaffected by +`assign_default=false`. The legacy Keycloak `demo` user is in no mapped group, +so with `assign_default=false` it lands in no organization by design. + +## Personas (Keycloak realm `coder`) + +All persona users have `emailVerified=true` and share the password in +`DEMO_USER_PASSWORD` (`~/.config/usgov-coderdemo/generated-secrets.env`). +Email is `@usgov.coderdemo.io`. + +| Username | Name | Org | Org role | Groups | +|---|---|---|---|---| +| pat.platform | Pat Rivera | Platform Engineering | organization-admin | platform-admins | +| sky.sre | Sky Nguyen | Platform Engineering | organization-template-admin | sre | +| alex.admin | Alex Carter | Mission Partner Alpha | organization-admin | (none) | +| dana.dev | Dana Brooks | Mission Partner Alpha | member | developers | +| quinn.data | Quinn Lee | Mission Partner Alpha | member | data-science | +| morgan.isso | Morgan Diaz | Alpha + Bravo | organization-auditor (both) | (none) | +| riley.admin | Riley Fox | Mission Partner Bravo | organization-admin | (none) | +| jordan.dev | Jordan Kim | Mission Partner Bravo | member | developers | + +## Verified login matrix + +Run `scripts/verify-oidc-login.py` (fresh cookie jar per user, real Keycloak +login). Confirmed output: + +``` +pat.platform -> coder organization-admin groups=[platform-admins] +sky.sre -> coder organization-template-admin groups=[sre] +alex.admin -> alpha organization-admin groups=[] +dana.dev -> alpha member groups=[developers] +quinn.data -> alpha member groups=[data-science] +morgan.isso -> alpha organization-auditor groups=[] + -> bravo organization-auditor groups=[] +riley.admin -> bravo organization-admin groups=[] +jordan.dev -> bravo member groups=[developers] +``` + +Tenant isolation holds: Alpha users see only Alpha, Bravo users see only Bravo, +Platform users see only Platform. The ISSO/auditor spans both tenants read-only. + +## Provisioners and templates per tenant org + +Each tenant org has its own external provisioner daemon +(`deploy/coder/provisioners.yaml`, Deployments `coder-provisioner-alpha` / +`coder-provisioner-bravo`) authenticated with an org-scoped provisioner key +(Secret `coder-provisioner-`), reusing the `coder` ServiceAccount. The +`claude-code` template is pushed into all three orgs; its import (terraform +init/plan) ran on each org's daemon. + +Workspace builds in any org require the user to complete the in-boundary GitLab +external auth first (every template declares `data coder_external_auth +"gitlab"`, see [70-workspace-templates.md](70-workspace-templates.md)). + +## Demo flow + +1. Log in as `pat.platform`: lands in Platform Engineering as org admin. +2. Log in (incognito) as `dana.dev`: lands only in Mission Partner Alpha, group + developers, no admin. Cannot see Bravo or Platform. +3. Log in as `riley.admin`: Bravo org admin; manage Bravo members/templates. +4. Log in as `morgan.isso`: auditor in both Alpha and Bravo; read-only audit + access, no build/admin rights. + +After changing Keycloak group membership, sync applies on the user's next login; +use a fresh/incognito session to avoid a cached session. + +## Re-run / reset + +``` +. ~/.config/usgov-coderdemo/generated-secrets.env +export KEYCLOAK_ADMIN_USERNAME KEYCLOAK_ADMIN_PASSWORD DEMO_USER_PASSWORD +python3 scripts/setup-keycloak-hierarchy.py # Keycloak groups/mapper/users +python3 scripts/setup-coder-idp-sync.py # Coder orgs/groups/sync +export DEMO_USER_PASSWORD +python3 scripts/verify-oidc-login.py pat.platform dana.dev morgan.isso riley.admin +``` + +Both setup scripts are idempotent. diff --git a/docs/as-built/README.md b/docs/as-built/README.md index 4723e1a..f676e28 100644 --- a/docs/as-built/README.md +++ b/docs/as-built/README.md @@ -15,7 +15,8 @@ docs explain the *how* and *why* behind that status. | [10-infrastructure.md](10-infrastructure.md) | AWS GovCloud substrate: account/region/partition, VPC, EKS (standard, not Auto Mode, and why), node group, IRSA roles, RDS, ECR, Route53, ACM, NLB. | | [20-platform-kubernetes.md](20-platform-kubernetes.md) | Kubernetes platform layer: namespaces, ingress-nginx + load-balancer-controller, `gp3` StorageClass, workspace RBAC, platform-owned Secrets. | | [30-coder-control-plane.md](30-coder-control-plane.md) | Coder v2.34.0 control plane: a section-by-section walkthrough of `deploy/coder/values.yaml`, OIDC SSO, auth-boundary hardening, licensing, appearance. | -| [40-identity-keycloak.md](40-identity-keycloak.md) | Keycloak realm `coder`, the OIDC client, the SSO wiring, and the configured-vs-not gap (no group/role sync). | +| [40-identity-keycloak.md](40-identity-keycloak.md) | Keycloak realm `coder`, the OIDC client, the SSO wiring, and IdP sync status. | +| [45-idp-sync-personas.md](45-idp-sync-personas.md) | Multi-tenant org/group/role hierarchy, the persona users, and the verified Keycloak-to-Coder IdP sync (org + group + role). | | [50-gitlab-scm.md](50-gitlab-scm.md) | In-boundary GitLab SCM, the instance-wide OAuth app, and how every workspace authenticates git against it. | | [60-ai-gateway.md](60-ai-gateway.md) | AI Gateway / AI Bridge: DB-managed providers (`anthropic` direct + `anthropic-bedrock` IRSA), name-based routing, the end-to-end request flow, and the remaining action to make AI respond. | | [70-workspace-templates.md](70-workspace-templates.md) | The `claude-code` workspace template: pod/PVC, the claude-code module (4.7.3), Coder Tasks, parameters, and the required GitLab external auth. | diff --git a/scripts/setup-coder-idp-sync.py b/scripts/setup-coder-idp-sync.py new file mode 100755 index 0000000..a36dc21 --- /dev/null +++ b/scripts/setup-coder-idp-sync.py @@ -0,0 +1,195 @@ +#!/usr/bin/env python3 +""" +setup-coder-idp-sync.py - configure Coder organizations, groups, and OIDC IdP +sync (organizations + groups + roles) for the GovCloud multi-tenant demo. + +Idempotent: safe to re-run. Discovers existing orgs/groups by name and only +creates what is missing, then PATCHes the sync settings to the desired state. + +Targets the demo Coder explicitly (NOT the ambient $CODER_URL, which inside a +Coder workspace points at the host Coder). Admin creds come from +~/.config/usgov-coderdemo/generated-secrets.env. + +Usage: + python3 scripts/setup-coder-idp-sync.py + +The Keycloak side (groups, group-membership mapper, persona users) is created +by scripts/setup-keycloak-hierarchy.py. Both read from the same hierarchy +described in docs/as-built/45-idp-sync-personas.md. +""" +import json +import os +import sys +import urllib.request +import urllib.error + +BASE = os.environ.get("DEMO_CODER_URL", "https://dev.usgov.coderdemo.io").rstrip("/") + +# --- Desired hierarchy ------------------------------------------------------- +# Organizations: slug -> display name. "coder" is the pre-existing default org. +ORGS = { + "coder": "Platform Engineering", + "alpha": "Mission Partner Alpha", + "bravo": "Mission Partner Bravo", +} + +# Coder groups to pre-create per org slug (do not rely on auto-create). +GROUPS = { + "coder": ["platform-admins", "sre"], + "alpha": ["developers", "data-science", "security"], + "bravo": ["developers"], +} + +# Organization sync (deployment-level): full Keycloak group path -> org slug. +ORG_SYNC = { + "/platform": "coder", + "/alpha": "alpha", + "/bravo": "bravo", +} + +# Group sync per org slug: Keycloak group path -> Coder group name (in that org). +GROUP_SYNC = { + "coder": { + "/platform/platform-admins": "platform-admins", + "/platform/sre": "sre", + }, + "alpha": { + "/alpha/developers": "developers", + "/alpha/data-science": "data-science", + "/alpha/security": "security", + }, + "bravo": { + "/bravo/developers": "developers", + }, +} + +# Role sync per org slug: Keycloak group path -> list of Coder org role names. +ROLE_SYNC = { + "coder": { + "/platform/org-admins": ["organization-admin"], + "/platform/template-admins": ["organization-template-admin"], + }, + "alpha": { + "/alpha/org-admins": ["organization-admin"], + "/alpha/auditors": ["organization-auditor"], + }, + "bravo": { + "/bravo/org-admins": ["organization-admin"], + "/bravo/auditors": ["organization-auditor"], + }, +} + + +def login(): + secrets = os.path.expanduser("~/.config/usgov-coderdemo/generated-secrets.env") + creds = {} + with open(secrets) as f: + for line in f: + line = line.strip() + if "=" in line and not line.startswith("#"): + k, v = line.split("=", 1) + creds[k] = v + body = json.dumps({ + "email": creds["CODER_ADMIN_EMAIL"], + "password": creds["CODER_ADMIN_PASSWORD"], + }).encode() + req = urllib.request.Request(BASE + "/api/v2/users/login", data=body, + headers={"Content-Type": "application/json"}) + return json.load(urllib.request.urlopen(req))["session_token"] + + +TOKEN = None + + +def api(method, path, body=None, ok=(200, 201)): + headers = {"Coder-Session-Token": TOKEN, "Content-Type": "application/json"} + data = json.dumps(body).encode() if body is not None else None + req = urllib.request.Request(BASE + path, data=data, headers=headers, method=method) + try: + r = urllib.request.urlopen(req) + raw = r.read().decode() + return r.status, (json.loads(raw) if raw else None) + except urllib.error.HTTPError as e: + return e.code, e.read().decode() + + +def main(): + global TOKEN + TOKEN = login() + + # 1. Organizations ------------------------------------------------------- + _, orgs = api("GET", "/api/v2/organizations") + by_slug = {o["name"]: o for o in orgs} + org_id = {} + for slug, display in ORGS.items(): + if slug in by_slug: + o = by_slug[slug] + org_id[slug] = o["id"] + if o.get("display_name") != display: + code, _ = api("PATCH", f"/api/v2/organizations/{o['id']}", + {"display_name": display}) + print(f"org {slug}: display_name -> {display!r} (HTTP {code})") + else: + print(f"org {slug}: exists ({o['id']})") + else: + code, o = api("POST", "/api/v2/organizations", + {"name": slug, "display_name": display}) + if code not in (200, 201): + print(f"FAILED creating org {slug}: {code} {o}", file=sys.stderr) + sys.exit(1) + org_id[slug] = o["id"] + print(f"org {slug}: CREATED ({o['id']})") + + # 2. Groups (pre-create) ------------------------------------------------- + group_id = {} # (slug, name) -> id + for slug, names in GROUPS.items(): + _, existing = api("GET", f"/api/v2/organizations/{org_id[slug]}/groups") + ex = {g["name"]: g["id"] for g in existing} + for name in names: + if name in ex: + group_id[(slug, name)] = ex[name] + print(f"group {slug}/{name}: exists") + else: + code, g = api("POST", f"/api/v2/organizations/{org_id[slug]}/groups", + {"name": name, "display_name": name}) + if code not in (200, 201): + print(f"FAILED group {slug}/{name}: {code} {g}", file=sys.stderr) + sys.exit(1) + group_id[(slug, name)] = g["id"] + print(f"group {slug}/{name}: CREATED") + + # 3. Organization sync (deployment-level) -------------------------------- + org_mapping = {path: [org_id[slug]] for path, slug in ORG_SYNC.items()} + code, _ = api("PATCH", "/api/v2/settings/idpsync/organization", { + "field": "groups", + "mapping": org_mapping, + "organization_assign_default": False, + }) + print(f"org-sync: field=groups assign_default=false (HTTP {code})") + + # 4. Group sync (per org) ------------------------------------------------ + for slug, mapping in GROUP_SYNC.items(): + m = {path: [group_id[(slug, name)]] for path, name in mapping.items()} + code, _ = api("PATCH", + f"/api/v2/organizations/{org_id[slug]}/settings/idpsync/groups", { + "field": "groups", + "mapping": m, + "regex_filter": None, + "auto_create_missing_groups": False, + }) + print(f"group-sync[{slug}]: {len(m)} mappings (HTTP {code})") + + # 5. Role sync (per org) ------------------------------------------------- + for slug, mapping in ROLE_SYNC.items(): + code, _ = api("PATCH", + f"/api/v2/organizations/{org_id[slug]}/settings/idpsync/roles", { + "field": "groups", + "mapping": mapping, + }) + print(f"role-sync[{slug}]: {len(mapping)} mappings (HTTP {code})") + + print("\nOrg IDs:", json.dumps(org_id)) + + +if __name__ == "__main__": + main() diff --git a/scripts/setup-keycloak-hierarchy.py b/scripts/setup-keycloak-hierarchy.py new file mode 100755 index 0000000..78d31c4 --- /dev/null +++ b/scripts/setup-keycloak-hierarchy.py @@ -0,0 +1,220 @@ +#!/usr/bin/env python3 +""" +setup-keycloak-hierarchy.py - build the Keycloak realm `coder` group/user +hierarchy and the OIDC `groups` claim mapper that Coder IdP sync consumes. + +Idempotent: re-running ensures the desired state (groups, the group-membership +protocol mapper on the `coder` client, persona users + memberships) without +duplicating anything. + +Reads admin + demo-user credentials from +~/.config/usgov-coderdemo/generated-secrets.env: + KEYCLOAK_ADMIN_USERNAME, KEYCLOAK_ADMIN_PASSWORD, DEMO_USER_PASSWORD + +Pairs with scripts/setup-coder-idp-sync.py (the Coder side). The hierarchy is +documented in docs/as-built/45-idp-sync-personas.md. +""" +import json +import os +import sys +import urllib.parse +import urllib.request +import urllib.error + +KC = os.environ.get("KEYCLOAK_URL", "https://auth.usgov.coderdemo.io").rstrip("/") +REALM = "coder" +CLIENT_ID = "coder" + +# Group tree: top-level (org) -> subgroups (teams + role groups). +GROUP_TREE = { + "platform": ["platform-admins", "sre", "org-admins", "template-admins"], + "alpha": ["developers", "data-science", "security", "org-admins", "auditors"], + "bravo": ["developers", "org-admins", "auditors"], +} + +# Persona users -> full group paths they belong to. +USERS = { + "pat.platform": { + "first": "Pat", "last": "Rivera", + "groups": ["/platform", "/platform/platform-admins", "/platform/org-admins"], + }, + "sky.sre": { + "first": "Sky", "last": "Nguyen", + "groups": ["/platform", "/platform/sre", "/platform/template-admins"], + }, + "alex.admin": { + "first": "Alex", "last": "Carter", + "groups": ["/alpha", "/alpha/org-admins"], + }, + "dana.dev": { + "first": "Dana", "last": "Brooks", + "groups": ["/alpha", "/alpha/developers"], + }, + "quinn.data": { + "first": "Quinn", "last": "Lee", + "groups": ["/alpha", "/alpha/data-science"], + }, + "morgan.isso": { + "first": "Morgan", "last": "Diaz", + "groups": ["/alpha", "/alpha/auditors", "/bravo", "/bravo/auditors"], + }, + "riley.admin": { + "first": "Riley", "last": "Fox", + "groups": ["/bravo", "/bravo/org-admins"], + }, + "jordan.dev": { + "first": "Jordan", "last": "Kim", + "groups": ["/bravo", "/bravo/developers"], + }, +} + +EMAIL_DOMAIN = "usgov.coderdemo.io" + + +def read_secrets(): + path = os.path.expanduser("~/.config/usgov-coderdemo/generated-secrets.env") + out = {} + with open(path) as f: + for line in f: + line = line.strip() + if "=" in line and not line.startswith("#"): + k, v = line.split("=", 1) + out[k] = v + return out + + +SECRETS = read_secrets() +TOKEN = None + + +def token(): + data = urllib.parse.urlencode({ + "grant_type": "password", + "client_id": "admin-cli", + "username": SECRETS["KEYCLOAK_ADMIN_USERNAME"], + "password": SECRETS["KEYCLOAK_ADMIN_PASSWORD"], + }).encode() + req = urllib.request.Request( + KC + "/realms/master/protocol/openid-connect/token", data=data, + headers={"Content-Type": "application/x-www-form-urlencoded"}) + return json.load(urllib.request.urlopen(req))["access_token"] + + +def kc(method, path, body=None, ok=(200, 201, 204)): + headers = {"Authorization": "Bearer " + TOKEN} + data = None + if body is not None: + headers["Content-Type"] = "application/json" + data = json.dumps(body).encode() + req = urllib.request.Request(KC + "/admin/realms/" + REALM + path, + data=data, headers=headers, method=method) + try: + r = urllib.request.urlopen(req) + raw = r.read().decode() + return r.status, (json.loads(raw) if raw else None) + except urllib.error.HTTPError as e: + return e.code, e.read().decode() + + +def ensure_groups(): + """Create the group tree if missing; return {full_path: id}.""" + # Top-level groups. + _, tops = kc("GET", "/groups?max=200") + top_by_name = {g["name"]: g for g in tops} + for name in GROUP_TREE: + if name not in top_by_name: + code, _ = kc("POST", "/groups", {"name": name}) + print(f"group /{name}: CREATED (HTTP {code})") + # Re-fetch to get ids. + _, tops = kc("GET", "/groups?max=200") + top_by_name = {g["name"]: g for g in tops} + + paths = {} + for name, children in GROUP_TREE.items(): + top = top_by_name[name] + paths["/" + name] = top["id"] + _, existing = kc("GET", f"/groups/{top['id']}/children?max=200") + child_by_name = {g["name"]: g for g in existing} + for child in children: + if child not in child_by_name: + code, _ = kc("POST", f"/groups/{top['id']}/children", {"name": child}) + print(f"group /{name}/{child}: CREATED (HTTP {code})") + _, existing = kc("GET", f"/groups/{top['id']}/children?max=200") + for g in existing: + paths[f"/{name}/{g['name']}"] = g["id"] + return paths + + +def ensure_mapper(): + """Group-membership mapper on the coder client -> full-path `groups` claim.""" + _, clients = kc("GET", "/clients?clientId=" + CLIENT_ID) + cid = clients[0]["id"] + _, mappers = kc("GET", f"/clients/{cid}/protocol-mappers/models") + existing = {m["name"]: m for m in (mappers or [])} + desired_config = { + "full.path": "true", + "id.token.claim": "true", + "access.token.claim": "true", + "userinfo.token.claim": "true", + "lightweight.claim": "false", + "claim.name": "groups", + } + rep = { + "name": "groups", + "protocol": "openid-connect", + "protocolMapper": "oidc-group-membership-mapper", + "config": desired_config, + } + if "groups" in existing: + m = existing["groups"] + rep["id"] = m["id"] + code, _ = kc("PUT", f"/clients/{cid}/protocol-mappers/models/{m['id']}", rep) + print(f"client mapper 'groups': updated (HTTP {code})") + else: + code, _ = kc("POST", f"/clients/{cid}/protocol-mappers/models", rep) + print(f"client mapper 'groups': CREATED (HTTP {code})") + + +def ensure_users(paths): + pw = SECRETS["DEMO_USER_PASSWORD"] + for username, spec in USERS.items(): + _, found = kc("GET", "/users?exact=true&username=" + urllib.parse.quote(username)) + if found: + uid = found[0]["id"] + print(f"user {username}: exists") + else: + rep = { + "username": username, + "email": f"{username}@{EMAIL_DOMAIN}", + "firstName": spec["first"], + "lastName": spec["last"], + "enabled": True, + "emailVerified": True, + } + code, _ = kc("POST", "/users", rep) + _, found = kc("GET", "/users?exact=true&username=" + urllib.parse.quote(username)) + uid = found[0]["id"] + print(f"user {username}: CREATED (HTTP {code})") + # Password (non-temporary so login is immediate). + kc("PUT", f"/users/{uid}/reset-password", + {"type": "password", "value": pw, "temporary": False}) + # Group memberships (PUT is idempotent). + for gpath in spec["groups"]: + gid = paths[gpath] + code, _ = kc("PUT", f"/users/{uid}/groups/{gid}") + print(f" {username}: groups -> {', '.join(spec['groups'])}") + + +def main(): + global TOKEN + TOKEN = token() + paths = ensure_groups() + ensure_mapper() + ensure_users(paths) + print("\nGroup paths:") + for p in sorted(paths): + print(" ", p) + + +if __name__ == "__main__": + main() diff --git a/scripts/verify-oidc-login.py b/scripts/verify-oidc-login.py new file mode 100755 index 0000000..4d708fd --- /dev/null +++ b/scripts/verify-oidc-login.py @@ -0,0 +1,132 @@ +#!/usr/bin/env python3 +""" +verify-oidc-login.py - drive a real Coder OIDC login through Keycloak for a +persona user and report the org membership, org roles, and groups that IdP sync +assigned. Proves the Keycloak -> Coder sync end to end. + +Usage: + DEMO_USER_PASSWORD=... python3 scripts/verify-oidc-login.py dana.dev [more...] + +Read-only against Coder (it only logs in and GETs). Uses a fresh cookie jar per +user so there is no cached SSO session. +""" +import sys +import re +import json +import os +import http.cookiejar +import urllib.request +import urllib.parse +import urllib.error + +CODER = os.environ.get("DEMO_CODER_URL", "https://dev.usgov.coderdemo.io").rstrip("/") +PW = os.environ["DEMO_USER_PASSWORD"] + + +class NoRedirect(urllib.request.HTTPRedirectHandler): + def redirect_request(self, req, fp, code, msg, headers, newurl): + return None + + +def opener(): + cj = http.cookiejar.CookieJar() + op = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj), NoRedirect) + op.addheaders = [("User-Agent", "coder-demo-verify/1.0")] + return op, cj + + +def req(op, url, data=None, ctype=None): + headers = {} + if ctype: + headers["Content-Type"] = ctype + r = urllib.request.Request(url, data=data, headers=headers) + try: + resp = op.open(r) + return resp.getcode(), resp.headers, resp.read().decode("utf-8", "replace") + except urllib.error.HTTPError as e: + return e.code, e.headers, e.read().decode("utf-8", "replace") + + +def login(user): + op, cj = opener() + # 1. Initiate OIDC at Coder -> 302 to Keycloak authorize. + code, h, _ = req(op, CODER + "/api/v2/users/oidc/callback") + if code not in (302, 307) or "Location" not in h: + return None, f"initiate: expected redirect, got {code}" + authz = h["Location"] + # 2. Load the Keycloak login page. + code, h, body = req(op, authz) + if code != 200: + # Could be a direct 302 if already authenticated (should not happen on a + # fresh jar). Surface for debugging. + return None, f"authorize: expected 200 login page, got {code} loc={h.get('Location')}" + m = re.search(r'action="([^"]+)"', body) + if not m: + return None, "authorize: could not find login form action" + action = m.group(1).replace("&", "&") + # 3. Submit credentials -> 302 back to the Coder callback with code+state. + form = urllib.parse.urlencode({"username": user, "password": PW, "credentialId": ""}).encode() + code, h, body = req(op, action, data=form, + ctype="application/x-www-form-urlencoded") + if code not in (302, 307) or "Location" not in h: + return None, f"login POST: expected redirect, got {code} (bad credentials or extra form field?)" + cb = h["Location"] + if "/oidc/callback" not in cb: + return None, f"login POST: unexpected redirect {cb[:120]}" + # 4. Coder consumes the code, sets the session cookie, redirects to the app. + code, h, _ = req(op, cb) + if code not in (302, 307): + return None, f"coder callback: expected redirect, got {code}" + tok = None + for c in cj: + if c.name == "coder_session_token": + tok = c.value + if not tok: + return None, "coder callback: no coder_session_token cookie set" + return tok, None + + +def capi(tok, path): + r = urllib.request.Request(CODER + path, headers={"Coder-Session-Token": tok}) + try: + return json.load(urllib.request.urlopen(r)) + except urllib.error.HTTPError as e: + return {"ERROR": e.code, "body": e.read().decode()[:200]} + + +def report(user): + tok, err = login(user) + if err: + print(f"\n## {user}: LOGIN FAILED: {err}") + return + me = capi(tok, "/api/v2/users/me") + orgs = capi(tok, "/api/v2/users/me/organizations") + site_roles = [r["name"] if isinstance(r, dict) else r for r in me.get("roles", [])] + print(f"\n## {user} ({me.get('email')}) site_roles={site_roles or '[]'}") + if isinstance(orgs, dict): + print(" orgs ERROR:", orgs) + return + for o in orgs: + oid = o["id"] + # roles for this member in this org + members = capi(tok, f"/api/v2/organizations/{oid}/members") + roles = [] + if isinstance(members, list): + for mem in members: + if mem.get("user_id") == me["id"]: + roles = [r["name"] if isinstance(r, dict) else r for r in mem.get("roles", [])] + # groups in this org the user belongs to + groups = capi(tok, f"/api/v2/organizations/{oid}/groups") + my_groups = [] + if isinstance(groups, list): + for g in groups: + ids = [mm.get("id") for mm in (g.get("members") or [])] + if me["id"] in ids and g["name"] != "Everyone": + my_groups.append(g["name"]) + print(f" org {o['name']:8} display={o.get('display_name'):22} roles={roles or ['member']} groups={my_groups}") + + +if __name__ == "__main__": + users = sys.argv[1:] or ["dana.dev"] + for u in users: + report(u) From 5f4e46382d2195047f4c570e075194cd936f3026 Mon Sep 17 00:00:00 2001 From: Austen Bruhn Date: Sun, 7 Jun 2026 18:17:50 +0000 Subject: [PATCH 06/16] feat: secrets via External Secrets Operator + AWS Secrets Manager (IRSA) Move the demo's runtime secrets to AWS Secrets Manager as the source of truth and sync them into Kubernetes with the External Secrets Operator over IRSA, so no secret material lives in git or in a local file. - Mirror the ESO image into ECR (scripts/images.txt) and deploy ESO (chart 2.6.0, ns external-secrets) with deploy/platform/external-secrets/values.yaml. - IRSA role usgov-coderdemo-external-secrets: least-privilege secretsmanager:GetSecretValue/DescribeSecret on usgov-coderdemo/* only, no static keys. Codified in terraform/secrets-hardening.tf. - Migrate the 9 runtime app secrets (coder/keycloak/gitlab) into ASM with scripts/migrate-secrets-to-asm.py (values passed via mode-600 temp files). - ClusterSecretStore aws-secretsmanager + one ExternalSecret per app secret (dataFrom extract, creationPolicy Owner). ESO adopted the existing Secrets in place with byte-identical data (no app disruption); store Valid, all 9 SecretSynced; delete/recreate recovery verified. - EKS Secrets envelope encryption with a customer-managed KMS key is codified in terraform/secrets-hardening.tf but NOT applied (irreversible re-encrypt; needs a maintenance decision). Docs: docs/as-built/85-secrets-management.md; updated 80-iac-vs-imperative.md, the example secret files, STATUS.md, and the docs index. Generated by Coder Agents. --- STATUS.md | 17 ++ deploy/coder/secrets.example.yaml | 5 + deploy/gitlab/secrets.example.yaml | 3 + deploy/keycloak/secrets.example.yaml | 4 + .../secretstore-and-externalsecrets.yaml | 179 ++++++++++++++++++ deploy/platform/external-secrets/values.yaml | 39 ++++ docs/00-INDEX.md | 1 + docs/as-built/80-iac-vs-imperative.md | 12 +- docs/as-built/85-secrets-management.md | 114 +++++++++++ docs/as-built/README.md | 1 + scripts/images.txt | 4 + scripts/migrate-secrets-to-asm.py | 92 +++++++++ terraform/secrets-hardening.tf | 94 +++++++++ 13 files changed, 562 insertions(+), 3 deletions(-) create mode 100644 deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml create mode 100644 deploy/platform/external-secrets/values.yaml create mode 100644 docs/as-built/85-secrets-management.md create mode 100755 scripts/migrate-secrets-to-asm.py create mode 100644 terraform/secrets-hardening.tf diff --git a/STATUS.md b/STATUS.md index e713a24..6c37c5a 100644 --- a/STATUS.md +++ b/STATUS.md @@ -135,5 +135,22 @@ gated; Nova Pro is the proven fallback. `claude-code` template pushed into all three orgs. - See `docs/as-built/45-idp-sync-personas.md` for the full hierarchy + matrix. +## Secrets management (ESO + AWS Secrets Manager) +- [x] **AWS Secrets Manager is the source of truth** for the 9 runtime app + secrets (`usgov-coderdemo/{coder,keycloak,gitlab}/*`). No secret material + in git. +- [x] **External Secrets Operator** (chart 2.6.0, ns `external-secrets`, ECR + mirror image) syncs ASM into the app namespaces via IRSA role + `usgov-coderdemo-external-secrets` (read-only, scoped to + `usgov-coderdemo/*`, no static keys). ClusterSecretStore + `aws-secretsmanager` Valid; all 9 ExternalSecrets SecretSynced. +- [x] Migrated with `scripts/migrate-secrets-to-asm.py`; ESO adopted the + existing Secrets with byte-identical data (no app disruption); + delete/recreate recovery verified. +- [ ] **EKS Secrets envelope encryption (customer KMS)**: NOT applied + (irreversible re-encrypt; needs a maintenance decision). Codified in + `terraform/secrets-hardening.tf`. +- See `docs/as-built/85-secrets-management.md`. + ## Out of scope (demo) OpenShift, Istio, observability. diff --git a/deploy/coder/secrets.example.yaml b/deploy/coder/secrets.example.yaml index 8d3e96b..8a557cd 100644 --- a/deploy/coder/secrets.example.yaml +++ b/deploy/coder/secrets.example.yaml @@ -1,5 +1,10 @@ # Example k8s Secret manifests for the Coder control plane (namespace: coder). # +# NOTE: these Secrets are now produced by the External Secrets Operator from +# AWS Secrets Manager (usgov-coderdemo/coder/*). See +# docs/as-built/85-secrets-management.md. This file is kept only to document the +# Secret names and keys the control plane consumes. +# # DO NOT COMMIT REAL SECRETS. Every value below is a REPLACE_ME placeholder. # In the real deploy these Secrets are created by the platform layer # (orchestrator) or applied out-of-band; this file documents the exact diff --git a/deploy/gitlab/secrets.example.yaml b/deploy/gitlab/secrets.example.yaml index 09eef94..44b53d8 100644 --- a/deploy/gitlab/secrets.example.yaml +++ b/deploy/gitlab/secrets.example.yaml @@ -1,5 +1,8 @@ --- # EXAMPLE ONLY. Copy to a real, untracked file, replace REPLACE_ME, and apply. +# NOTE: this Secret is now produced by the External Secrets Operator from AWS +# Secrets Manager (usgov-coderdemo/gitlab/secrets). See +# docs/as-built/85-secrets-management.md. This file documents the key only. # Do NOT commit a real password. This Secret holds ONLY the initial root # password, which GitLab consumes on its first boot to seed the "root" user. # diff --git a/deploy/keycloak/secrets.example.yaml b/deploy/keycloak/secrets.example.yaml index 2917514..1a2753d 100644 --- a/deploy/keycloak/secrets.example.yaml +++ b/deploy/keycloak/secrets.example.yaml @@ -1,5 +1,9 @@ # EXAMPLE secrets for Keycloak. DO NOT COMMIT REAL VALUES. # +# NOTE: these Secrets are now produced by the External Secrets Operator from +# AWS Secrets Manager (usgov-coderdemo/keycloak/*). See +# docs/as-built/85-secrets-management.md. This file documents the keys only. +# # Per deploy/CONVENTIONS.md the platform layer normally provisions the app DB # Secret (`-db`). These manifests document the exact keys this workstream # expects so the Deployment can be applied standalone for testing. Replace every diff --git a/deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml b/deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml new file mode 100644 index 0000000..d8c1002 --- /dev/null +++ b/deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml @@ -0,0 +1,179 @@ +# ============================================================================= +# ClusterSecretStore + ExternalSecrets: sync usgov-coderdemo/* from AWS Secrets +# Manager into the app namespaces' Kubernetes Secrets. +# ============================================================================= +# The ClusterSecretStore authenticates to ASM via the ESO controller's IRSA +# ServiceAccount (external-secrets/external-secrets), so no static AWS keys. +# Each ExternalSecret extracts the JSON keys of one ASM secret into a target +# Kubernetes Secret with the SAME name/keys the apps already reference, so the +# control-plane manifests (deploy/coder, deploy/keycloak, deploy/gitlab) are +# unchanged. ASM is the source of truth; ESO owns these Secrets and refreshes +# them hourly. +# ----------------------------------------------------------------------------- +apiVersion: external-secrets.io/v1 +kind: ClusterSecretStore +metadata: + name: aws-secretsmanager +spec: + provider: + aws: + service: SecretsManager + region: us-gov-west-1 + auth: + jwt: + serviceAccountRef: + name: external-secrets + namespace: external-secrets +--- +apiVersion: external-secrets.io/v1 +kind: ExternalSecret +metadata: + name: coder-db + namespace: coder +spec: + refreshInterval: 1h + secretStoreRef: + name: aws-secretsmanager + kind: ClusterSecretStore + target: + name: coder-db + creationPolicy: Owner + dataFrom: + - extract: + key: usgov-coderdemo/coder/db +--- +apiVersion: external-secrets.io/v1 +kind: ExternalSecret +metadata: + name: coder-oidc + namespace: coder +spec: + refreshInterval: 1h + secretStoreRef: + name: aws-secretsmanager + kind: ClusterSecretStore + target: + name: coder-oidc + creationPolicy: Owner + dataFrom: + - extract: + key: usgov-coderdemo/coder/oidc +--- +apiVersion: external-secrets.io/v1 +kind: ExternalSecret +metadata: + name: coder-ai + namespace: coder +spec: + refreshInterval: 1h + secretStoreRef: + name: aws-secretsmanager + kind: ClusterSecretStore + target: + name: coder-ai + creationPolicy: Owner + dataFrom: + - extract: + key: usgov-coderdemo/coder/ai +--- +apiVersion: external-secrets.io/v1 +kind: ExternalSecret +metadata: + name: coder-external-auth + namespace: coder +spec: + refreshInterval: 1h + secretStoreRef: + name: aws-secretsmanager + kind: ClusterSecretStore + target: + name: coder-external-auth + creationPolicy: Owner + dataFrom: + - extract: + key: usgov-coderdemo/coder/external-auth +--- +apiVersion: external-secrets.io/v1 +kind: ExternalSecret +metadata: + name: coder-provisioner-alpha + namespace: coder +spec: + refreshInterval: 1h + secretStoreRef: + name: aws-secretsmanager + kind: ClusterSecretStore + target: + name: coder-provisioner-alpha + creationPolicy: Owner + dataFrom: + - extract: + key: usgov-coderdemo/coder/provisioner-alpha +--- +apiVersion: external-secrets.io/v1 +kind: ExternalSecret +metadata: + name: coder-provisioner-bravo + namespace: coder +spec: + refreshInterval: 1h + secretStoreRef: + name: aws-secretsmanager + kind: ClusterSecretStore + target: + name: coder-provisioner-bravo + creationPolicy: Owner + dataFrom: + - extract: + key: usgov-coderdemo/coder/provisioner-bravo +--- +apiVersion: external-secrets.io/v1 +kind: ExternalSecret +metadata: + name: keycloak-admin + namespace: keycloak +spec: + refreshInterval: 1h + secretStoreRef: + name: aws-secretsmanager + kind: ClusterSecretStore + target: + name: keycloak-admin + creationPolicy: Owner + dataFrom: + - extract: + key: usgov-coderdemo/keycloak/admin +--- +apiVersion: external-secrets.io/v1 +kind: ExternalSecret +metadata: + name: keycloak-db + namespace: keycloak +spec: + refreshInterval: 1h + secretStoreRef: + name: aws-secretsmanager + kind: ClusterSecretStore + target: + name: keycloak-db + creationPolicy: Owner + dataFrom: + - extract: + key: usgov-coderdemo/keycloak/db +--- +apiVersion: external-secrets.io/v1 +kind: ExternalSecret +metadata: + name: gitlab-secrets + namespace: gitlab +spec: + refreshInterval: 1h + secretStoreRef: + name: aws-secretsmanager + kind: ClusterSecretStore + target: + name: gitlab-secrets + creationPolicy: Owner + dataFrom: + - extract: + key: usgov-coderdemo/gitlab/secrets diff --git a/deploy/platform/external-secrets/values.yaml b/deploy/platform/external-secrets/values.yaml new file mode 100644 index 0000000..169210c --- /dev/null +++ b/deploy/platform/external-secrets/values.yaml @@ -0,0 +1,39 @@ +# Helm values for External Secrets Operator (chart/appVersion 2.6.0). +# +# ESO syncs secrets from AWS Secrets Manager (usgov-coderdemo/*) into Kubernetes +# Secrets, so ASM is the source of truth and no secret material lives in git or +# in a local file. The controller authenticates to ASM via IRSA (no static AWS +# keys): its ServiceAccount is annotated with role usgov-coderdemo-external-secrets, +# which is allowed secretsmanager:GetSecretValue/DescribeSecret on +# arn:aws-us-gov:secretsmanager:us-gov-west-1:430737322961:secret:usgov-coderdemo/*. +# +# Images are pulled from the private ECR mirror (no GHCR pull-through in +# GovCloud). All three components share one image. Install with release name +# `external-secrets` in namespace `external-secrets` so the controller SA is +# named `external-secrets` (matches the IRSA trust subject). + +installCRDs: true +crds: + createClusterSecretStore: true + +image: + repository: "430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/ghcr/external-secrets/external-secrets" + tag: "v2.6.0" + pullPolicy: IfNotPresent + +serviceAccount: + name: external-secrets + annotations: + eks.amazonaws.com/role-arn: "arn:aws-us-gov:iam::430737322961:role/usgov-coderdemo-external-secrets" + +webhook: + image: + repository: "430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/ghcr/external-secrets/external-secrets" + tag: "v2.6.0" + pullPolicy: IfNotPresent + +certController: + image: + repository: "430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/ghcr/external-secrets/external-secrets" + tag: "v2.6.0" + pullPolicy: IfNotPresent diff --git a/docs/00-INDEX.md b/docs/00-INDEX.md index acfec90..054c4d4 100644 --- a/docs/00-INDEX.md +++ b/docs/00-INDEX.md @@ -19,6 +19,7 @@ live result. - [as-built/README.md](as-built/README.md) (index) - [as-built/00-overview.md](as-built/00-overview.md): architecture + flows - [as-built/80-iac-vs-imperative.md](as-built/80-iac-vs-imperative.md): declarative vs imperative ledger +- [as-built/85-secrets-management.md](as-built/85-secrets-management.md): secrets via ESO + AWS Secrets Manager - [as-built/90-operations-runbook.md](as-built/90-operations-runbook.md): day-2 ops ## Architecture diff --git a/docs/as-built/80-iac-vs-imperative.md b/docs/as-built/80-iac-vs-imperative.md index 186407a..5066ab2 100644 --- a/docs/as-built/80-iac-vs-imperative.md +++ b/docs/as-built/80-iac-vs-imperative.md @@ -68,6 +68,8 @@ Terraform resources (created by the mirror script, see below). | ECR repositories + image mirroring (5 images) | `scripts/mirror-images.sh` (crane) + `scripts/images.txt` | live `aws ecr describe-repositories` | | k8s Secrets `coder-db`, `coder-oidc`, `coder-ai`, `coder-external-auth` | `kubectl create secret` | `deploy/coder/secrets.example.yaml`; `deploy/platform/README.md` | | k8s Secrets `keycloak-db`, `keycloak-admin`, `gitlab-secrets` | `kubectl create secret` | `deploy/keycloak/README.md`, `deploy/gitlab/README.md` | +| Migrate the 9 runtime Secrets to AWS Secrets Manager + sync via External Secrets Operator (IRSA) | Helm (ESO chart 2.6.0) + `scripts/migrate-secrets-to-asm.py` + ExternalSecret CRDs | live ESO; `docs/as-built/85-secrets-management.md` | +| ESO IRSA role `usgov-coderdemo-external-secrets` (Secrets Manager read) | AWS CLI/IAM | live IAM; codified in `terraform/secrets-hardening.tf` | | Workspace RBAC in `coder-workspaces` | `kubectl apply -f deploy/platform/workspace-rbac.yaml` | live `kubectl get role -n coder-workspaces` | | Keycloak Deployment/Service/Ingress + realm `coder` import | `kubectl apply -k deploy/keycloak/` | `deploy/keycloak/*`; live pod `keycloak` | | GitLab StatefulSet/Service/Ingress (embedded Postgres) | `kubectl apply -f deploy/gitlab/*` | `deploy/gitlab/*`; live pod `gitlab-0` | @@ -113,9 +115,13 @@ expands it with every imperative item found above. Ordered roughly by layer. 9. Manage ECR repositories as `aws_ecr_repository` (the registry host is already an output); keep image mirroring (`scripts/mirror-images.sh`) as an explicit pipeline step since image content is not Terraform's job. -10. Decide a source of truth for Kubernetes Secrets (`coder-db`, `coder-oidc`, - `coder-ai`, `coder-external-auth`, `keycloak-db`, `keycloak-admin`, - `gitlab-secrets`); keep real values out of git. +10. Source of truth for Kubernetes Secrets: RESOLVED. The 9 runtime app Secrets + now live in AWS Secrets Manager (`usgov-coderdemo/*`) and are synced by the + External Secrets Operator via IRSA (`docs/as-built/85-secrets-management.md`). + Remaining: import the live ESO IAM role into Terraform + (`terraform/secrets-hardening.tf`) before a reconciliation apply, and enable + EKS Secrets envelope encryption with the customer-managed KMS key defined in + that file (IRREVERSIBLE, not yet applied). 11. Manage workspace RBAC (`coder-workspaces` Role/RoleBinding) declaratively. 12. Manage Keycloak (Deployment/Service/Ingress + realm import) and GitLab (StatefulSet/Service/Ingress) manifests under a GitOps or Terraform path. diff --git a/docs/as-built/85-secrets-management.md b/docs/as-built/85-secrets-management.md new file mode 100644 index 0000000..f80198c --- /dev/null +++ b/docs/as-built/85-secrets-management.md @@ -0,0 +1,114 @@ +# As-built: secrets management (External Secrets Operator + AWS Secrets Manager) + +Runtime secrets are sourced from **AWS Secrets Manager** (ASM) and synced into +Kubernetes by the **External Secrets Operator** (ESO), which authenticates to +ASM with **IRSA** (no static AWS keys). ASM is the source of truth; no secret +material is committed to git, and the app control-plane manifests are unchanged +because ESO produces Kubernetes Secrets with the same names and keys the apps +already reference. + +This replaces the earlier bootstrap approach (secrets generated into a local +gitignored file and applied as plain `kubectl create secret`). That file +(`~/.config/usgov-coderdemo/generated-secrets.env`) is retained only as the +break-glass bootstrap source for setup scripts; it is gitignored. + +## Flow + +``` +AWS Secrets Manager (usgov-coderdemo/*) + | GetSecretValue / DescribeSecret (IRSA: usgov-coderdemo-external-secrets) + v +External Secrets Operator (ns external-secrets) + ClusterSecretStore "aws-secretsmanager" -> ExternalSecret (per app secret) + | writes/owns + v +Kubernetes Secret (coder/keycloak/gitlab ns) -> consumed by app pods (secretKeyRef) +``` + +## What runs where + +| Piece | Detail | +|---|---| +| ESO | Helm chart `external-secrets` 2.6.0, ns `external-secrets` (controller + webhook + cert-controller, all 1/1). Image from the ECR mirror `ghcr/external-secrets/external-secrets:v2.6.0`. Values: `deploy/platform/external-secrets/values.yaml`. | +| IRSA role | `usgov-coderdemo-external-secrets`. Trust: `system:serviceaccount:external-secrets:external-secrets`. Policy: `secretsmanager:GetSecretValue` + `DescribeSecret` on `arn:aws-us-gov:secretsmanager:us-gov-west-1:430737322961:secret:usgov-coderdemo/*` only. Codified in `terraform/secrets-hardening.tf`. | +| Store | `ClusterSecretStore/aws-secretsmanager` (AWS SecretsManager, region us-gov-west-1, `auth.jwt.serviceAccountRef` -> the ESO controller SA). Status `Valid`. | +| ExternalSecrets | One per app secret (`deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml`), `dataFrom.extract`, `creationPolicy: Owner`, `refreshInterval: 1h`. | + +## ASM secret layout + +Each ASM secret is a JSON object whose keys match the target Kubernetes Secret +keys. ESO `extract` materializes them 1:1. + +| ASM secret | JSON keys | Kubernetes Secret (ns/name) | +|---|---|---| +| usgov-coderdemo/coder/db | url | coder/coder-db | +| usgov-coderdemo/coder/oidc | client-secret | coder/coder-oidc | +| usgov-coderdemo/coder/ai | ANTHROPIC_API_KEY | coder/coder-ai | +| usgov-coderdemo/coder/external-auth | gitlab-client-id, gitlab-client-secret | coder/coder-external-auth | +| usgov-coderdemo/coder/provisioner-alpha | key | coder/coder-provisioner-alpha | +| usgov-coderdemo/coder/provisioner-bravo | key | coder/coder-provisioner-bravo | +| usgov-coderdemo/keycloak/admin | username, password | keycloak/keycloak-admin | +| usgov-coderdemo/keycloak/db | username, password | keycloak/keycloak-db | +| usgov-coderdemo/gitlab/secrets | initial_root_password | gitlab/gitlab-secrets | + +`usgov-coderdemo/rds/master` (the RDS master credential) predates this and is +managed by Terraform; the apps do not read it. + +## Migration (one time, idempotent) + +`scripts/migrate-secrets-to-asm.py` reads the live Kubernetes Secrets (the prior +source of truth) and writes each as a JSON ASM secret. Values are passed to the +AWS CLI via mode-600 temp files, never on the command line. ESO then adopted the +existing Secrets in place. + +## Verification (performed) + +- ClusterSecretStore: `Ready=True reason=Valid` (IRSA to ASM works). +- All 9 ExternalSecrets: `SecretSynced=True`. +- ESO adopted the pre-existing Secrets with byte-identical data (sha256 of the + data map matched before and after for all 9), so running pods were not + disrupted. ESO now owns them (`reconcile.external-secrets.io/managed=true`, + ownerReference to the ExternalSecret). +- Recovery proven: deleting `coder/coder-ai` caused ESO to rebuild it from ASM + within seconds with the identical value. + +## Operational notes + +- **Rotation:** update the value in ASM (or `put-secret-value`). ESO refreshes + the Kubernetes Secret within `refreshInterval` (1h) or immediately if the + Secret is deleted. Pods that read a secret as an env var (`secretKeyRef`) only + pick up a new value on restart; roll the relevant Deployment after rotation. +- **Least privilege:** the ESO role can only read `usgov-coderdemo/*` and cannot + write to ASM. Rotation is a separate, deliberate action. +- **No secrets in git:** only `deploy/*/secrets.example.yaml` placeholders are + committed. Real values live in ASM; the local `generated-secrets.env` is + gitignored and outside the repo. + +## Still on the backlog + +- **EKS Secrets envelope encryption with a customer-managed KMS key.** Today the + cluster uses only the default AWS-managed etcd encryption + (`encryptionConfig=null`). `terraform/secrets-hardening.tf` defines the CMK and + documents the `encryption_config` to add to the cluster. Enabling it is + IRREVERSIBLE and triggers a re-encrypt, so it is intentionally not applied yet; + it needs an explicit maintenance decision. +- **Fold the live ESO IAM role into a Terraform apply** (created via CLI; import + before apply). See `docs/as-built/80-iac-vs-imperative.md`. + +## Reproduce + +``` +. ~/.config/usgov-coderdemo/env +export KUBECONFIG=./kubeconfig +# 1. mirror the ESO image (already in scripts/images.txt) +scripts/mirror-images.sh +# 2. ESO IAM role: see terraform/secrets-hardening.tf (or the CLI in git history) +# 3. install ESO +helm upgrade --install external-secrets external-secrets/external-secrets \ + --version 2.6.0 -n external-secrets --create-namespace \ + -f deploy/platform/external-secrets/values.yaml +# 4. seed ASM from the current cluster secrets +python3 scripts/migrate-secrets-to-asm.py +# 5. store + ExternalSecrets +kubectl apply -f deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml +``` diff --git a/docs/as-built/README.md b/docs/as-built/README.md index f676e28..9bc60d8 100644 --- a/docs/as-built/README.md +++ b/docs/as-built/README.md @@ -21,6 +21,7 @@ docs explain the *how* and *why* behind that status. | [60-ai-gateway.md](60-ai-gateway.md) | AI Gateway / AI Bridge: DB-managed providers (`anthropic` direct + `anthropic-bedrock` IRSA), name-based routing, the end-to-end request flow, and the remaining action to make AI respond. | | [70-workspace-templates.md](70-workspace-templates.md) | The `claude-code` workspace template: pod/PVC, the claude-code module (4.7.3), Coder Tasks, parameters, and the required GitLab external auth. | | [80-iac-vs-imperative.md](80-iac-vs-imperative.md) | The declarative-versus-imperative ledger and the Terraform reconciliation backlog. | +| [85-secrets-management.md](85-secrets-management.md) | Runtime secrets via External Secrets Operator + AWS Secrets Manager (IRSA): ASM layout, migration, verification, and the EKS CMK backlog. | | [90-operations-runbook.md](90-operations-runbook.md) | Day-2 operations: env/kubeconfig, API/CLI login, Helm upgrade, template push, image mirroring, banner, health checks, known gaps. | ## One thing to know before reading diff --git a/scripts/images.txt b/scripts/images.txt index 3f7f7dd..ab9d18d 100644 --- a/scripts/images.txt +++ b/scripts/images.txt @@ -19,3 +19,7 @@ docker.io/gitlab/gitlab-ce:19.0.1-ce.0 # --- Workspace base image for the Claude Code template (coder-templates/claude-code) --- docker.io/codercom/enterprise-base:ubuntu-noble-20260601 + +# --- External Secrets Operator (deploy/platform/external-secrets) --- +# Controller, webhook, and cert-controller all use this single image. +ghcr.io/external-secrets/external-secrets:v2.6.0 diff --git a/scripts/migrate-secrets-to-asm.py b/scripts/migrate-secrets-to-asm.py new file mode 100755 index 0000000..8dea5fe --- /dev/null +++ b/scripts/migrate-secrets-to-asm.py @@ -0,0 +1,92 @@ +#!/usr/bin/env python3 +""" +migrate-secrets-to-asm.py - copy the demo's runtime Kubernetes Secrets into AWS +Secrets Manager under the usgov-coderdemo/* prefix, so External Secrets Operator +can sync them back into the cluster (ASM becomes the source of truth). + +Idempotent: creates each ASM secret if missing, otherwise puts a new value. +Reads the live cluster secrets (the current source of truth) and writes the +exact same key/value map as a JSON ASM secret. Secret values are passed to the +AWS CLI via mode-600 temp files (file://), never on the command line. + +Usage: + . ~/.config/usgov-coderdemo/env + export KUBECONFIG=./kubeconfig + python3 scripts/migrate-secrets-to-asm.py [--dry-run] +""" +import base64 +import json +import os +import subprocess +import sys +import tempfile + +REGION = os.environ.get("AWS_DEFAULT_REGION", "us-gov-west-1") +DRY = "--dry-run" in sys.argv[1:] + +# ASM secret name -> (namespace, kubernetes secret name) +MAPPING = { + "usgov-coderdemo/coder/db": ("coder", "coder-db"), + "usgov-coderdemo/coder/oidc": ("coder", "coder-oidc"), + "usgov-coderdemo/coder/ai": ("coder", "coder-ai"), + "usgov-coderdemo/coder/external-auth": ("coder", "coder-external-auth"), + "usgov-coderdemo/coder/provisioner-alpha": ("coder", "coder-provisioner-alpha"), + "usgov-coderdemo/coder/provisioner-bravo": ("coder", "coder-provisioner-bravo"), + "usgov-coderdemo/keycloak/admin": ("keycloak", "keycloak-admin"), + "usgov-coderdemo/keycloak/db": ("keycloak", "keycloak-db"), + "usgov-coderdemo/gitlab/secrets": ("gitlab", "gitlab-secrets"), +} + + +def sh(args, check=True, capture=True): + return subprocess.run(args, check=check, + stdout=subprocess.PIPE if capture else None, + stderr=subprocess.PIPE) + + +def read_k8s_secret(ns, name): + out = sh(["kubectl", "-n", ns, "get", "secret", name, "-o", "json"]).stdout + data = json.loads(out).get("data", {}) + return {k: base64.b64decode(v).decode("utf-8") for k, v in data.items()} + + +def asm_exists(name): + r = sh(["aws", "secretsmanager", "describe-secret", "--region", REGION, + "--secret-id", name], check=False) + return r.returncode == 0 + + +def put_asm(name, payload): + fd, path = tempfile.mkstemp(prefix="asm-", suffix=".json") + try: + os.fchmod(fd, 0o600) + with os.fdopen(fd, "w") as f: + json.dump(payload, f) + ref = "file://" + path + if asm_exists(name): + sh(["aws", "secretsmanager", "put-secret-value", "--region", REGION, + "--secret-id", name, "--secret-string", ref]) + return "updated" + else: + sh(["aws", "secretsmanager", "create-secret", "--region", REGION, + "--name", name, + "--description", "usgov-coderdemo demo secret (synced to k8s by ESO)", + "--secret-string", ref]) + return "created" + finally: + os.unlink(path) + + +def main(): + for asm_name, (ns, k8s_name) in MAPPING.items(): + payload = read_k8s_secret(ns, k8s_name) + keys = ",".join(sorted(payload)) + if DRY: + print(f"[dry-run] {asm_name} <- {ns}/{k8s_name} keys=[{keys}]") + continue + action = put_asm(asm_name, payload) + print(f"{action:8} {asm_name} <- {ns}/{k8s_name} keys=[{keys}]") + + +if __name__ == "__main__": + main() diff --git a/terraform/secrets-hardening.tf b/terraform/secrets-hardening.tf new file mode 100644 index 0000000..7905aea --- /dev/null +++ b/terraform/secrets-hardening.tf @@ -0,0 +1,94 @@ +# ============================================================================= +# secrets-hardening.tf - IaC for the secrets-management hardening. +# ============================================================================= +# RECONCILIATION BACKLOG. These resources describe the desired state. The live +# environment was built imperatively (the External Secrets Operator IAM role was +# created with the AWS CLI; see scripts and docs/as-built/85-secrets-management.md), +# so on a reconciliation pass the existing role must be imported before apply: +# +# terraform import aws_iam_role.external_secrets usgov-coderdemo-external-secrets +# +# Reuses aws_iam_openid_connect_provider.eks and locals from irsa.tf. + +# --- External Secrets Operator IRSA role ------------------------------------ +# ESO's controller ServiceAccount (external-secrets/external-secrets) assumes +# this role to read demo secrets from AWS Secrets Manager. No static AWS keys. +data "aws_iam_policy_document" "external_secrets_assume" { + statement { + actions = ["sts:AssumeRoleWithWebIdentity"] + effect = "Allow" + + principals { + type = "Federated" + identifiers = [aws_iam_openid_connect_provider.eks.arn] + } + + condition { + test = "StringEquals" + variable = "${local.oidc_issuer_host}:sub" + values = ["system:serviceaccount:external-secrets:external-secrets"] + } + + condition { + test = "StringEquals" + variable = "${local.oidc_issuer_host}:aud" + values = ["sts.amazonaws.com"] + } + } +} + +resource "aws_iam_role" "external_secrets" { + name = "${var.cluster_name}-external-secrets" + assume_role_policy = data.aws_iam_policy_document.external_secrets_assume.json + description = "External Secrets Operator: read usgov-coderdemo/* from Secrets Manager (IRSA)" +} + +# Least-privilege: read only the demo's secrets, no write, no other prefixes. +data "aws_iam_policy_document" "external_secrets" { + statement { + sid = "ReadDemoSecrets" + effect = "Allow" + actions = [ + "secretsmanager:GetSecretValue", + "secretsmanager:DescribeSecret", + ] + resources = [ + "arn:${data.aws_partition.current.partition}:secretsmanager:${var.region}:${data.aws_caller_identity.current.account_id}:secret:usgov-coderdemo/*", + ] + } +} + +resource "aws_iam_role_policy" "external_secrets" { + name = "secretsmanager-read" + role = aws_iam_role.external_secrets.id + policy = data.aws_iam_policy_document.external_secrets.json +} + +# --- EKS Secrets envelope encryption (customer-managed KMS) ------------------ +# Backlog hardening: encrypt Kubernetes Secrets at rest in etcd with a CMK, on +# top of the default AWS-managed etcd encryption. Enabling envelope encryption +# on a cluster is IRREVERSIBLE and triggers a re-encrypt, so it is NOT applied +# yet; it needs an explicit maintenance decision. To enable, create the key +# below and add an `encryption_config` block to aws_eks_cluster.this in eks.tf: +# +# encryption_config { +# provider { key_arn = aws_kms_key.eks_secrets.arn } +# resources = ["secrets"] +# } +# +# Once the key exists, this can also be enabled out of band with: +# aws eks associate-encryption-config --cluster-name usgov-coderdemo \ +# --encryption-config '[{"provider":{"keyArn":""},"resources":["secrets"]}]' +resource "aws_kms_key" "eks_secrets" { + description = "${var.cluster_name} EKS Secrets envelope encryption" + deletion_window_in_days = 14 + enable_key_rotation = true + tags = { + Name = "${var.cluster_name}-eks-secrets" + } +} + +resource "aws_kms_alias" "eks_secrets" { + name = "alias/${var.cluster_name}-eks-secrets" + target_key_id = aws_kms_key.eks_secrets.key_id +} From 612481c4811df18ef990bd714ff2afea68c9e433 Mon Sep 17 00:00:00 2001 From: Austen Bruhn Date: Sun, 7 Jun 2026 18:54:08 +0000 Subject: [PATCH 07/16] docs(plans): AWS-native observability and GitOps adoption plans Add three design-only plans (nothing applied to the live environment) with companion GitHub issues, plus an index. - plans/observability-aws-native.md: the production AWS-native target the in-cluster Prometheus/Grafana stack should evolve into (Amazon Managed Prometheus + Grafana for metrics; CloudWatch -> Firehose -> S3 -> Athena with an optional Amazon Security Lake OCSF path for audit/SIEM). Issues #13-#20. Grounded in read-only us-gov-west-1 calls: AMP managed scraper is absent in GovCloud (self-managed ADOT + SigV4), AMG auth via SAML to Keycloak (IAM Identity Center not enabled), Security Lake optional. - plans/gitops-control-plane.md: Argo CD control plane sourced from the in-cluster GitLab, app-of-apps over the existing deploy/ paths, adopt-in-place (manual sync, no prune, no self-heal). Issues #6-#12. - plans/gitops-adoption.md: per-workload GitOps adoption and the non-Argo state (Coder API via Argo PostSync Jobs, Keycloak via keycloak-config-cli, AWS stays Terraform). Issues #21-#29. GitOps live migration is deliberately deferred: leave the current imperative state in place and adopt it later. Generated by Coder Agents. --- docs/plans/README.md | 29 ++ docs/plans/gitops-adoption.md | 359 +++++++++++++++++ docs/plans/gitops-control-plane.md | 522 +++++++++++++++++++++++++ docs/plans/observability-aws-native.md | 458 ++++++++++++++++++++++ 4 files changed, 1368 insertions(+) create mode 100644 docs/plans/README.md create mode 100644 docs/plans/gitops-adoption.md create mode 100644 docs/plans/gitops-control-plane.md create mode 100644 docs/plans/observability-aws-native.md diff --git a/docs/plans/README.md b/docs/plans/README.md new file mode 100644 index 0000000..67ac897 --- /dev/null +++ b/docs/plans/README.md @@ -0,0 +1,29 @@ +# Plans (design proposals) + +Forward-looking design documents for the GovCloud Coder demo. These are +**proposals for LATER adoption**: nothing in this directory has been applied to +the live cluster, AWS, Coder, Keycloak, or GitLab. Each plan has companion +GitHub issues that track the implementation work. + +The engineering record of what is actually deployed lives in +[`../as-built/`](../as-built/README.md); this directory describes where parts of +that deployment are intended to evolve. + +| Plan | Scope | Issues | +|---|---|---| +| [observability-aws-native.md](observability-aws-native.md) | The production, AWS-native observability + audit target the in-cluster Prometheus/Grafana stack should evolve into: Amazon Managed Prometheus + Grafana for metrics, and CloudWatch -> Firehose -> S3 -> Athena with an optional Amazon Security Lake (OCSF) path for audit/SIEM. | #13-#20 | +| [gitops-control-plane.md](gitops-control-plane.md) | The GitOps control plane and bootstrap: Argo CD installed in-cluster, sourcing from the in-cluster GitLab, with an app-of-apps over the existing `deploy/` paths and a non-disruptive adopt-in-place strategy. | #6-#12 | +| [gitops-adoption.md](gitops-adoption.md) | Per-workload GitOps adoption details and the application state a GitOps controller cannot natively reconcile (Coder API config via Argo Jobs, Keycloak realm via keycloak-config-cli, AWS substrate stays Terraform). | #21-#29 | + +## Relationship between the plans + +- The two GitOps plans are siblings: **gitops-control-plane** decides and + bootstraps the controller (the "where it syncs from" and "how it is + installed"), while **gitops-adoption** designs, per workload, how each live + resource is adopted without disruption. They deliberately do not duplicate + each other. +- **observability-aws-native** is independent of GitOps: it is the managed-AWS + target for the observability stack documented as-built in + [`../as-built/55-observability.md`](../as-built/55-observability.md). Its + Phase 0 (enable Coder Prometheus metrics + JSON audit logging) is already done + by the in-cluster build; the remaining phases are the AWS-native migration. diff --git a/docs/plans/gitops-adoption.md b/docs/plans/gitops-adoption.md new file mode 100644 index 0000000..60be97d --- /dev/null +++ b/docs/plans/gitops-adoption.md @@ -0,0 +1,359 @@ +# Plan: per-workload GitOps adoption and non-Kubernetes app state + +Status: PLAN ONLY. Nothing changes now. This is a design for a LATER, deliberate +adoption. Every step below is non-disruptive by construction: the live resources +keep running and a GitOps controller adopts them in place. + +Scope boundary with the sibling plan: a separate effort designs the GitOps +**control plane** (the Argo CD vs Flux choice, the in-cluster GitLab as the git +source, bootstrap, app-of-apps, and repo layout). This document assumes that +control plane exists and instead designs, per workload, **how each live workload +is adopted into GitOps without disruption**, plus **how to handle the state a +GitOps controller cannot natively reconcile**. Control-plane bootstrap issues are +not duplicated here. + +This plan uses Argo CD terminology (Application, sync waves, PreSync/PostSync +hooks, resource tracking) because it is the most common choice and the sibling +plan is leaning that way; the same techniques map to Flux (Kustomization, +HelmRelease, dependsOn, health gates, and Flux Kustomize health checks). + +Grounding: `STATUS.md`, `docs/as-built/` (read in full), and the live `deploy/`, +`scripts/`, and `terraform/` trees. Investigation was read-only; the cluster was +not reachable from the planning workspace (no in-boundary AWS CLI), so live diffs +are part of the execution steps, not this plan. + +## 1. What we are adopting + +Three classes of state, handled differently: + +1. **Helm releases** (4): `coder`, `ingress-nginx`, + `aws-load-balancer-controller`, `external-secrets`. These are CLI-installed + Helm releases. GitOps adopts them in place. +2. **kubectl-applied manifests**: keycloak (kustomize), gitlab (StatefulSet), the + 2 Coder provisioner Deployments, the ExternalSecrets + ClusterSecretStore, + workspace RBAC, the `gp3` StorageClass, plus the **new** in-cluster monitoring + stack. Most are already YAML in git; a few are live-only and must be authored + into git before adoption. +3. **State a GitOps controller cannot natively reconcile**: Coder application + config applied through the Coder API, Keycloak realm config applied through the + Keycloak Admin API, and the AWS substrate (Terraform). Each gets a dedicated + strategy in section 6. + +## 2. Per-workload adoption table + +Source chart facts come from `versions.lock.yaml`, `deploy/CONVENTIONS.md`, +`deploy/platform/README.md`, and `docs/as-built/`. "Type" is how the GitOps +controller renders the source (Helm, Kustomize, or plain directory of manifests). + +| Workload | Type | Source (chart/version + values, or manifest path) | Adoption method | Diff and landmine notes | +|---|---|---|---|---| +| coder | Helm | chart `coder` 2.34.0 (repo `helm.coder.com/v2`), values `deploy/coder/values.yaml`, ns `coder`, live revs v1..v4 | Argo Application, Helm source, in place | AI Gateway provider env vars are **seed-once** with a drift guard (`docs/as-built/30-coder-control-plane.md`): editing a seeded `CODER_AI_GATEWAY_PROVIDER_*` value or the `coder-ai` secret makes coderd refuse to start. Freeze that env block and manage providers through the DB/API. License, appearance banner, and IdP sync are DB state, not Helm (section 6). SA `coder` carries the Bedrock IRSA annotation; keep it. | +| ingress-nginx | Helm | chart `ingress-nginx` 4.15.1 (repo `kubernetes.github.io/ingress-nginx`), values `deploy/platform/ingress-nginx-values.yaml`, ns `ingress-nginx`, live rev v1 | Argo Application, Helm source, in place | The controller `Service` (type LoadBalancer) **owns the live internet-facing NLB** that all DNS aliases point to. A recreate of that Service re-provisions a new NLB and breaks `dev`/`auth`/`gitlab`/`*` DNS. The benign-diff gate must show **zero** change on the Service `.spec` and its six `aws-load-balancer-*` annotations. Add `ignoreDifferences` for Service `.status` and any LB-controller-mutated fields. | +| aws-load-balancer-controller | Helm | chart `aws-load-balancer-controller` (`eks-charts`), ns `kube-system`, live rev v1, **no values file committed** | Author values, then Argo Application, Helm source, in place | **No committed values file** (installed with CLI flags). Reconstruct desired values from the live release (`helm get values`) before adoption: `clusterName`, `region=us-gov-west-1`, `vpcId`, image from the ECR mirror, and the controller `serviceAccount` + its IRSA role (role name unverified in the as-built ledger; capture it live). Owns CRDs `TargetGroupBinding` and `IngressClassParams`; use ServerSideApply so the large CRDs do not hit the last-applied annotation limit. It actively reconciles the ingress-nginx NLB, so adopt it before or together with ingress-nginx. | +| external-secrets | Helm | chart `external-secrets` 2.6.0 (repo `external-secrets`), values `deploy/platform/external-secrets/values.yaml`, ns `external-secrets`, live components controller+webhook+cert-controller | Argo Application, Helm source, in place | Chart sets `installCRDs: true` and `crds.createClusterSecretStore: true`. The ClusterSecretStore also exists in the manifests file (next row), so pick **one** owner to avoid a fight: let the operator Application own only the operator + CRDs, and let a separate Application own the `ClusterSecretStore` + `ExternalSecret` CRs. Use ServerSideApply for CRDs. | +| ClusterSecretStore + 9 ExternalSecrets | Kustomize/dir | `deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml` | Argo Application, in place | Already declarative and git-friendly. CRs must sync **after** the ESO operator + CRDs are healthy (sync wave). ESO already owns the 9 target Secrets with `creationPolicy: Owner`; adoption is metadata-only. ASM is the source of truth, no secret material in git. | +| keycloak | Kustomize | `deploy/keycloak/` (deployment, service, ingress + `configMapGenerator` of `realm-coder.json`, `disableNameSuffixHash: true`) | Argo Application, Kustomize source, in place | ConfigMap name `keycloak-realm-coder` is stable (hash suffix disabled), so adoption is clean. `start --import-realm` only imports on first boot and skips an existing realm, so groups/mapper/persona users are **not** reconciled by re-apply; that is realm API state (section 6.2). | +| gitlab | dir | `deploy/gitlab/statefulset.yaml`, `service.yaml`, `ingress.yaml`, ServiceAccount | Argo Application, in place | StatefulSet with `volumeClaimTemplates` (3 RWO gp3 PVCs holding the only copy of GitLab + embedded Postgres data). **Never** use `Replace=true` and never delete/recreate: that orphans or destroys the data PVCs. Pod template and `volumeClaimTemplates` selectors are immutable; ServerSideApply with annotation tracking avoids touching them. | +| coder-provisioner-alpha / -bravo | dir | `deploy/coder/provisioners.yaml` (2 Deployments, ns `coder`) | Argo Application, in place | Clean. They consume `coder-provisioner-{alpha,bravo}` secrets (ESO from ASM) and the org-scoped provisioner key. The key is create-once API state (section 6.1). Labels are `app.kubernetes.io/*`, fine under annotation tracking. | +| workspace RBAC | dir | `deploy/platform/workspace-rbac.yaml` (Role + RoleBinding in `coder-workspaces`) | Argo Application, in place | Clean, low risk. The Coder Helm chart also makes a same-named Role in ns `coder`; keep them in separate Applications so the two `coder-workspace-perms` Roles do not collide. | +| gp3 StorageClass | dir | **Live only; not in git** (`kubectl apply` during build) | Author manifest from live, then Argo Application | Reconstruct from `kubectl get sc gp3 -o yaml` (strip runtime fields), keep the `storageclass.kubernetes.io/is-default-class` annotation. Cluster-scoped, key fields immutable; adopt in place, never delete/recreate (would interrupt dynamic provisioning). | +| namespaces | dir | Partly created via Helm `--create-namespace` / ad hoc | Author manifests or use `CreateNamespace=true` | `coder`, `coder-workspaces`, `keycloak`, `gitlab`, `ingress-nginx`, `external-secrets`, plus a new `monitoring`. Author explicit Namespace manifests so ownership is unambiguous and labels (for example `pod-security`) are declarative. | +| monitoring (Prometheus/Grafana) | Helm | **New, greenfield** (`kube-prometheus-stack`), ns `monitoring` | Install **fresh under GitOps**, not an adoption | Being added now. Install it through the GitOps controller from day one so there is no CLI release to adopt later. Needs ECR-mirrored images, gp3-backed PVCs, a Grafana admin secret via ESO/ASM, and (if exposed) a `grafana.usgov.coderdemo.io` ingress under the existing wildcard cert. | + +## 3. Helm release adoption: the ownership and label landmines + +A Helm CLI install and a GitOps controller track ownership differently, and the +gap is where adoption breaks if done naively. + +### 3.1 How the two systems mark ownership + +- **Helm CLI** records release state in a Secret + `sh.helm.release.v1..` (labels `owner=helm`), and stamps every + rendered object with `app.kubernetes.io/managed-by: Helm` plus the annotations + `meta.helm.sh/release-name` and `meta.helm.sh/release-namespace`. +- **Argo CD** renders a Helm chart with `helm template` (no Tiller, no release + Secret) and applies the output. By default it tracks ownership with the label + `app.kubernetes.io/instance`. + +### 3.2 The label collision (the main landmine) + +Helm charts already set `app.kubernetes.io/instance` to the release name, and +many charts put that label inside **immutable** selectors +(`Deployment.spec.selector`, `StatefulSet.spec.selector`, Service selectors). If +Argo's default label tracking writes a different `app.kubernetes.io/instance` +value, it will try to mutate an immutable selector and the sync fails, or it will +fight the chart on every reconcile. + +Mitigation (set once on the GitOps control plane, so noted here only as a +dependency): switch Argo's resource tracking to the annotation method +(`application.resourceTrackingMethod: annotation`, tracking via +`argocd.argoproj.io/tracking-id`). Argo then never touches +`app.kubernetes.io/instance`. This is mandatory before adopting any of the four +Helm releases. + +### 3.3 The stale Helm release Secret + +After Argo adopts a release via `helm template`, the old +`sh.helm.release.v1..*` Secrets remain but are inert (`helm list` may still +show the release; it is no longer the source of truth). Keep them until adoption +is verified for rollback, then delete them to avoid two apparent owners. + +### 3.4 CRDs + +`external-secrets` and `aws-load-balancer-controller` ship CRDs. CRDs are large +and exceed the client-side last-applied-configuration annotation limit, so adopt +them with **ServerSideApply=true**. Decide explicitly whether the chart manages +CRDs (`installCRDs: true` for ESO today) or whether CRDs are split into their own +Application; do not let two Applications both own a CRD. + +### 3.5 Verifying a benign diff before the first sync + +For each release, before flipping the Application to a synced/managed state: + +1. Render exactly what GitOps will apply: + `helm template --version -n -f `. +2. Server-side dry-run diff against live: `kubectl diff -f rendered.yaml` + (or `argocd app diff ` once the Application exists, unsynced). +3. **Accept only metadata diffs**: the `managed-by` label flipping from `Helm`, + the added `argocd.argoproj.io/tracking-id` annotation, and removal of the + `meta.helm.sh/*` annotations. +4. **Block on any spec diff**: zero change to the ingress-nginx Service `.spec` + and its `aws-load-balancer-*` annotations, to any Deployment/StatefulSet + selector or pod template, to image tags, to replica counts, to CRD specs, and + to StorageClass parameters. A spec diff means the committed values do not match + the live release and must be reconciled in git first. +5. Add `ignoreDifferences` (with `RespectIgnoreDifferences=true`) for fields that + controllers or webhooks mutate: the ingress-nginx Service `.status`, + LB-controller-added annotations, and ESO-managed Secret `data`. +6. Sync with **ServerSideApply=true** and **Replace=false**. + +### 3.6 Per release + +- **coder**: values already match the live release in `deploy/coder/values.yaml`. + The only behavioral trap is the seed-once AI provider env block plus drift + guard; freeze it (or remove it after providers are managed in the DB, per + `docs/as-built/30-coder-control-plane.md`). DB-only state (license, banner, IdP + sync) is section 6.1. +- **ingress-nginx**: the highest-risk adoption because the Service owns the NLB. + Gate hard on a zero Service-spec diff. +- **aws-load-balancer-controller**: reconstruct the missing values from the live + release first (section 2). Adopt before/with ingress-nginx since it reconciles + that NLB. +- **external-secrets**: split operator+CRDs from the CRs; ServerSideApply for the + CRDs. + +## 4. kubectl-applied manifest adoption + +These are plain manifests or kustomize; Argo renders them natively. Adoption is +metadata-only (add the tracking annotation, set the Application to own them). + +- **keycloak (kustomize)**: point an Application at `deploy/keycloak/`. Clean + because the generated realm ConfigMap has a stable name. Realm content drift is + handled in section 6.2, not by this Application. +- **gitlab (StatefulSet)**: protect the data PVCs. Annotation tracking + + ServerSideApply, `Replace=false`, and a sync policy that never prunes the PVCs. + Treat the StatefulSet selector and `volumeClaimTemplates` as immutable. +- **coder provisioner Deployments**: straightforward; depend on the ESO-synced + provisioner-key secrets existing first (sync wave after ESO). +- **workspace RBAC**: straightforward Role/RoleBinding adoption. +- **gp3 StorageClass and namespaces**: author the missing manifests into git from + the live objects first (section 2), then adopt. Cluster-scoped, immutable key + fields, adopt in place. + +## 5. ESO and ASM secrets slot into GitOps cleanly + +The secrets layer is already the GitOps-friendly part of the stack +(`docs/as-built/85-secrets-management.md`): + +- The 9 `ExternalSecret` CRs and the `ClusterSecretStore` are declarative YAML in + `deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml`, with no + secret material in git. They commit as-is. +- ASM is the source of truth; ESO authenticates with IRSA (no static keys) and + owns the 9 target Secrets with `creationPolicy: Owner`. A GitOps controller + reconciles the CRs; ESO reconciles the actual secret data out of band. Set + Argo to **ignore the managed Secret `data`** so it never shows spurious drift on + values it cannot see. +- Ordering: the ESO operator and its CRDs must be healthy before the + `ClusterSecretStore`/`ExternalSecret` CRs sync, and the target Secrets must + exist before the apps that mount them. Express this with sync waves: ESO + operator (wave 0), CRs (wave 1), apps (wave 2+). +- This makes secrets the cleanest workloads to adopt and a good early proof that + the GitOps plumbing works before touching the NLB-bearing workloads. + +## 6. State a GitOps controller cannot natively reconcile + +Argo and Flux reconcile Kubernetes API objects. They do not natively reconcile +state that lives behind an application API or in AWS. Three areas, each with a +recommendation. + +### 6.1 Coder application config (Coder API + DB state) + +In scope: organizations, group/role IdP sync settings, AI providers, templates, +appearance banner, provisioner keys, and the license. None of these are +Kubernetes objects; they live in the Coder database and are set through the Coder +API. Existing idempotent automation already covers most of it: +`scripts/setup-coder-idp-sync.py` (orgs, groups, org/group/role sync, discover +then PATCH), `scripts/set-appearance.sh` (PUT `/api/v2/appearance`). + +Options considered: + +- **Argo PreSync/PostSync Jobs running the existing scripts** (packaged as a + container image, hooks gated by sync waves). +- A Terraform/Crossplane Coder provider. A community `coderd` Terraform provider + exists and can manage some surfaces (orgs, groups, roles, templates, users) but + does not cover appearance or AI providers, and it adds a second state store and + a separate apply lifecycle alongside Terraform-for-AWS. +- Keep it as a CI pipeline (for example in-boundary GitLab CI). + +Recommendation, **per surface** (one size does not fit all): + +| Coder surface | Mechanism | Why / idempotency / secrets | +|---|---|---| +| Orgs + group/role IdP sync | **Argo PostSync Job** running `setup-coder-idp-sync.py` | Already idempotent (discover then PATCH). Runs after coderd is healthy. Admin creds come from an ESO-synced Secret, never git. | +| Appearance banner | **Argo PostSync Job** running `set-appearance.sh` | Idempotent PUT. Premium-gated; depends on the license being present. Creds via ESO. | +| AI providers | **Argo PostSync Job** that reconciles via the Coder API, reading the key from ASM via ESO at runtime | DB is authoritative with a seed-once env drift guard, so the safe path is API-managed and the Helm provider env frozen. The real `sk-ant-...` key stays in ASM, injected at Job runtime, never in git. | +| Provisioner keys | **One-time bootstrap Job** ("create only if absent in ASM"), key written back to ASM for ESO to sync | Not a reconcile loop: re-creating a key rotates it. Guard on absence to stay idempotent. | +| Templates (`coder templates push`) | **CI pipeline** (in-boundary GitLab CI) on the template repo | A versioned build-and-publish action, not a declarative reconcile; fits CI better than a hook. Pin the template version in git. | +| License (JWT) | **Out-of-band runbook**, value in ASM | Runtime JWT applied by CLI/UI; treat as deliberate break-glass, not reconciled. | + +Net recommendation: **Argo PostSync (and one bootstrap) Jobs for the reconcilable +DB state, CI for templates, runbook for the license.** This reuses the existing, +proven idempotent scripts; keeps all secrets in ASM/ESO and out of git; and +respects the AI provider drift guard by managing providers through the API rather +than the frozen Helm env. Revisit a `coderd` Terraform provider later only if the +managed surface grows enough to justify a second state store. + +Secret-handling implication to call out: the Jobs need a Coder admin session. +Source those admin credentials from an ESO-synced Secret (ASM), scope them +tightly, and prefer a dedicated automation account over the break-glass owner. + +### 6.2 Keycloak realm config (Keycloak Admin API) + +In scope: the realm import, the group tree, the group-membership protocol mapper +on the `coder` client, and the 8 persona users. Today these are created by the +imperative `scripts/setup-keycloak-hierarchy.py`, and `start --import-realm` only +seeds the realm on first boot (it skips an existing realm), so none of this is +reconciled after day one. + +Options considered: keycloak-config-cli as an Argo Job, the Keycloak Operator, or +realm import on boot. + +Recommendation: **keycloak-config-cli as an Argo PostSync Job.** It applies a +git-committed, declarative realm config and reconciles the realm to that desired +state on every run (managed import), covering the realm, groups, the mapper, and +users, which boot-time import cannot. It replaces the imperative hierarchy script +with a declarative file. Secrets (the OIDC client secret, persona passwords) are +injected via ESO-synced env and variable substitution, never committed. + +- Why not the **Keycloak Operator**: the live Keycloak is a plain Deployment, not + operator-managed. Adopting the Operator means re-platforming the Keycloak + instance itself (it manages the workload and realm import CRs), which is a much + larger change than the realm-config problem we are solving. Its + `KeycloakRealmImport` is also import-shaped, not full reconcile. +- Why not **realm import on boot**: it only runs on first boot and skips an + existing realm, so it cannot reconcile drift or apply post-hoc groups, mappers, + or users. That is exactly the current gap. + +Execution notes: mirror the keycloak-config-cli image into ECR; run it as a Job +with admin creds from ESO; commit the realm config with placeholders plus env +substitution. Order it after the keycloak Deployment is healthy. + +### 6.3 AWS substrate and the imperative reconciliation backlog (Terraform) + +This stays **Terraform**, not GitOps. See `docs/as-built/80-iac-vs-imperative.md` +for the full ledger and backlog. A GitOps controller cannot create the cluster, +node group, IRSA roles, Route53 records, ASM secrets, or EKS envelope encryption; +it runs **inside** the cluster those things create. + +Ordering relative to GitOps (the key cross-reference): + +1. **Terraform first.** Fold the imperative backlog into Terraform: standard EKS + (drop Auto Mode), the `mng` node group and `usgov-coderdemo-mngnode` role, the + four EKS addons and the EBS CSI IRSA role, the `gp3`-backing addon, the Route53 + alias records, the ECR repos, and the IRSA roles GitOps depends on (the ESO + role `usgov-coderdemo-external-secrets`, the LB controller role, the coder + Bedrock role). The ESO role was created by CLI, so **import it into Terraform + state before apply** rather than recreating it (recreating breaks ESO auth). + Route53 alias records point at the live NLB; adopt them into Terraform without + delete/recreate so DNS never drops. +2. **Then the GitOps control plane bootstrap** (sibling plan). +3. **Then per-workload adoption** (this plan). + +Independent and deferred: **EKS Secrets envelope encryption with the +customer-managed KMS key** (`terraform/secrets-hardening.tf`) is irreversible and +gated on a maintenance window; it is orthogonal to GitOps and should not block +adoption. + +This is a cross-reference and ordering dependency, not new GitOps work. It is +tracked as a single issue that points at the existing backlog. + +## 7. Ordered adoption sequence + +0. **Prerequisites** (not this plan's issues): + - Sibling: GitOps control plane installed (Argo CD), in-cluster GitLab as the + git source, app-of-apps, and **annotation resource tracking** set (section + 3.2). + - Terraform: substrate backlog reconciled, ESO IRSA role imported, ASM secrets + and IRSA roles and Route53 records present (section 6.3). CMK deferred. +1. **Close the git gaps**: author the `gp3` StorageClass and Namespace manifests + from live, and reconstruct the `aws-load-balancer-controller` Helm values from + the live release. +2. **Adopt foundational, no-data, no-LB objects**: namespaces, `gp3` + StorageClass, workspace RBAC. Lowest risk, proves the plumbing. +3. **Adopt external-secrets**: operator + CRDs (wave 0), then the + `ClusterSecretStore` + 9 `ExternalSecret` CRs (wave 1). Confirms secret + plumbing before app adoption. +4. **Adopt aws-load-balancer-controller** (reconstructed values; it reconciles + the NLB). Zero spec diff required. +5. **Adopt ingress-nginx** (the Service owns the live NLB). Hard gate on a zero + Service-spec diff. +6. **Adopt keycloak and gitlab** (protect the gitlab data PVCs; never Replace). +7. **Adopt coder + the 2 provisioner Deployments** (freeze the AI provider seed + env; benign diff only). +8. **Install the monitoring stack fresh under GitOps** (greenfield, not an + adoption). +9. **Layer the non-Argo app-state controllers**: keycloak-config-cli Job (6.2), + Coder API PostSync/bootstrap Jobs (6.1), CI for templates, runbook for the + license. +10. For every step: render, diff, confirm benign, sync with ServerSideApply, + verify health, then proceed. Keep the prior Helm release Secrets until each + adoption is verified, then delete them. + +## 8. Risks and rollback + +- **NLB re-provision** (ingress-nginx / LB controller): the top risk. Mitigate + with the zero Service-spec diff gate and annotation tracking; rollback is to + unmanage the Application (leave resources in place) and re-pin DNS if needed. +- **StatefulSet data loss** (gitlab): never Replace or prune PVCs; rollback is to + re-point the Application and re-attach the existing PVCs. +- **coderd refuses to start** (AI provider drift guard): freeze the seed env; + manage providers through the API only. +- **Immutable selector mutation** (Helm label collision): fixed by annotation + tracking before any Helm adoption. +- **Double CRD / ClusterSecretStore ownership**: assign exactly one Application + per cluster-scoped object. + +## 9. Issue map + +These adoption work items are filed as GitHub issues on `coder/usgov-coderdemo` +(label `gitops`): + +1. Adopt the `coder` Helm release into GitOps in place (chart 2.34.0). +2. Adopt the `ingress-nginx` Helm release into GitOps in place (chart 4.15.1; owns + the NLB). +3. Adopt the `aws-load-balancer-controller` Helm release into GitOps (reconstruct + values; CRDs). +4. Adopt the `external-secrets` Helm release plus the ClusterSecretStore and + ExternalSecrets into GitOps (sync waves). +5. Adopt the kubectl-applied manifests into GitOps (keycloak, gitlab, provisioner + Deployments, workspace RBAC, `gp3` StorageClass, namespaces). +6. Add the in-cluster monitoring stack (Prometheus/Grafana) GitOps-native. +7. Reconcile Coder API application state via Argo PostSync and bootstrap Jobs. +8. Reconcile the Keycloak realm via keycloak-config-cli as an Argo Job. +9. Cross-reference: Terraform AWS substrate reconcile as a GitOps prerequisite + (ordering only; see `docs/as-built/80-iac-vs-imperative.md`). + +--- + +*Planning document authored by Coder Agents. Read-only investigation; no cluster, +AWS, Coder, or Keycloak state was changed.* diff --git a/docs/plans/gitops-control-plane.md b/docs/plans/gitops-control-plane.md new file mode 100644 index 0000000..5dd9de0 --- /dev/null +++ b/docs/plans/gitops-control-plane.md @@ -0,0 +1,522 @@ +# Plan: GitOps control plane for the GovCloud Coder demo + +Status: PLANNING ONLY. This document is a design proposal for LATER adoption. +Nothing in this plan has been applied. No change has been made to the cluster, +AWS, Coder, Keycloak, GitLab, or git as part of writing it. The investigation +behind it was read-only (`kubectl get`, `helm list`, repo reads). The goal is to +improve maintainability by moving the in-cluster state from imperative CLI steps +to a declarative, auditable GitOps controller, without disrupting the live demo. + +Scope of this document: the GitOps **control plane and bootstrap** only (which +controller, where it syncs from, how it is installed and reconciled, how it +integrates with the existing secrets stack, and the non-disruptive adoption +strategy). A sibling plan covers the **per-workload adoption details** and the +non-Argo application state (Coder and Keycloak API configuration, Terraform +reconciliation). This document deliberately stays at the control-plane level and +does not duplicate per-workload adoption steps. + +## 1. Current state (confirmed live, read-only) + +Captured against EKS cluster `usgov-coderdemo` (k8s 1.36) on 2026-06-07 with +`. ~/.config/usgov-coderdemo/env && export KUBECONFIG=./kubeconfig`. + +No GitOps controller exists yet. `kubectl get ns`, `kubectl get crd`, and +`helm list -A` show no Argo or Flux namespaces or CRDs. + +In-cluster state is split between Helm releases and `kubectl`-applied manifests: + +| Mechanism | Object | Namespace | +|---|---|---| +| Helm | `coder` (rev 4) | `coder` | +| Helm | `ingress-nginx` (rev 1) | `ingress-nginx` | +| Helm | `aws-load-balancer-controller` (rev 1) | `kube-system` | +| Helm | `external-secrets` (rev 1) | `external-secrets` | +| kubectl | Keycloak Deployment/Service/Ingress + realm import | `keycloak` | +| kubectl | GitLab StatefulSet/Service/Ingress (embedded Postgres) | `gitlab` | +| kubectl | 2 Coder provisioner Deployments (`alpha`, `bravo`) | `coder` | +| kubectl | `ClusterSecretStore` + 9 `ExternalSecret` objects | cluster + app ns | +| kubectl | workspace RBAC (`coder-workspace-perms`) | `coder-workspaces` | +| kubectl | `gp3` default StorageClass | cluster | + +A monitoring stack is being added to the cluster now; it should be folded into +the same GitOps model once it lands, as a new app under `gitops/apps`. + +The AWS substrate (VPC, EKS, node group and IAM, RDS, ECR, IRSA roles, Route53, +ACM, KMS) is Terraform and stays Terraform. Secrets are sourced from AWS Secrets +Manager (ASM) and synced into Kubernetes by the External Secrets Operator (ESO) +via IRSA role `usgov-coderdemo-external-secrets`; the `ClusterSecretStore` +`aws-secretsmanager` reports `Valid`/`Ready`. No secret material is in git. + +GovCloud has no ECR pull-through cache, so every image is mirrored into private +ECR by `scripts/mirror-images.sh` from `scripts/images.txt`. + +Repo facts: remote `github.com/coder/usgov-coderdemo`, working branch +`feat/app-platform-deploy`. A historical `gitops/` placeholder is referenced in +`docs/repo-layout.md` (labelled "OCP Argo apps") but does not exist on disk; this +plan defines the `gitops/` tree from scratch. + +## 2. Goals and non-goals + +Goals: + +- One declarative source of truth for in-cluster state, reconciled by a + controller, replacing the ad hoc `helm`/`kubectl` steps. +- Strictly in-boundary: the controller syncs from the in-cluster GitLab at + `gitlab.usgov.coderdemo.io`, not from github.com. No github.com egress on the + reconcile path. +- Non-disruptive adoption of the already-running releases and manifests. The + live demo keeps working throughout; adoption is verified with + `argocd app diff` before any sync. +- No secrets in git. The controller manages `ExternalSecret` references only; + ESO continues to own the actual Kubernetes Secrets from ASM. + +Non-goals (for this control-plane plan): + +- Per-workload adoption mechanics and runtime app config (Coder/Keycloak API, + license JWT, appearance banner, AI provider DB seed, GitLab OAuth app). Owned + by the sibling plan. +- Moving the AWS substrate into GitOps. That stays Terraform. +- Enabling auto-sync, self-heal, or prune before the demo. Deferred by design + (see Section 9). +- Relocating existing manifests. Files stay where they are in `deploy/`; the + GitOps layer only adds Argo `Application` objects that point at those paths + (see Section 6). + +## 3. Decision: Argo CD vs Flux + +**Decision: adopt Argo CD. Commit to it for this environment.** The tradeoff is +recorded below so the choice can be revisited if the end customer standardizes on +a Flux-based reference architecture. + +Why Argo CD here: + +1. **The UI is a demo asset.** This is a customer-facing demo platform whose + whole point is to show a governed, in-boundary developer platform. Argo CD + ships a first-class web UI that visualizes the application tree, sync state, + and drift. That dashboard is itself a demo artifact: it makes the "everything + is declarative and reconciled from in-boundary git" story visible on screen. + Flux is intentionally UI-less in its core (third-party UIs exist but are a + separate install and less prominent). +2. **It reuses the existing Keycloak SSO.** Argo CD can authenticate its UI and + API against the Keycloak realm `coder` over OIDC, reinforcing the same + in-boundary identity story already used by Coder. One more relying party on + the existing IdP, no new auth path. +3. **App-of-apps maps cleanly onto incremental, non-disruptive adoption.** Argo + CD's `Application` and `AppProject` model, plus `argocd app diff`, lets us + adopt one existing release or manifest set at a time and prove the diff is + benign before syncing. This matches the careful adoption posture this live + environment requires. +4. **Broad adoption and operator familiarity** lower the support burden for a + demo that platform engineers will run and extend. + +The case for Flux (the recorded tradeoff): + +- **DoD Platform One Big Bang uses Flux.** If this demo needs to align with a + customer's Big Bang reference architecture, Flux would be the native choice and + would reduce friction with that ecosystem. This is the single strongest reason + to revisit the decision, and it is a real one for a GovCloud, DoD-adjacent + audience. +- Flux has a **smaller footprint** (fewer components, so fewer images to mirror + into ECR) and a GitOps-native, Kustomize-first multi-tenancy model. + +Resolution: pick Argo CD now for the demo's UI value, SSO reuse, and adoption +ergonomics. If Big Bang alignment becomes a hard requirement for a specific +engagement, treat that as the trigger to re-evaluate Flux. Record this as a +reversible decision, not a permanent platform standard. + +Version note: pin to a currently supported Argo CD release. As of late May 2026 +the supported minor lines are v3.4, v3.3, and v3.2; v3.1 reached end of life on +2026-05-06. Plan on the latest v3.4.x patch at adoption time and confirm the +exact patch then. + +## 4. Target architecture + +``` + IN-BOUNDARY (GovCloud, no github.com on the reconcile path) + +-------------------------------------------------------------------------+ + | | + | operator / in-boundary CI | + | | git push (mirror from github.com origin, done off the | + | | reconcile path) | + | v | + | In-cluster GitLab (gitlab.usgov.coderdemo.io / gitlab.gitlab.svc) | + | project: platform/usgov-coderdemo (authoritative for GitOps) | + | | ^ | + | | | webhook (push) -> /api/webhook ; poll fallback (~3m) | + | | | read-only deploy token (read_repository) | + | v | | + | +--------------------- ns: argocd ---------------------------------+ | + | | Argo CD | | + | | application-controller / repo-server / server(UI+API) / | | + | | applicationset-controller / redis / dex(optional) | | + | | images: ECR mirror (quay/argoproj/argocd, redis, dex) | | + | | UI/API auth: Keycloak realm `coder` (OIDC) | | + | +-----------------------------+------------------------------------+ | + | | renders + applies (helm template / kustomize / manifests) | | + | v | | + | app-of-apps root -> AppProjects -> child Applications | | + | | | | + | +--> platform: ingress-nginx, aws-load-balancer-controller, | | + | | external-secrets, gp3 StorageClass, ws RBAC | | + | +--> coder: coder (Helm) + provisioners | | + | +--> keycloak (manifests) | | + | +--> gitlab (manifests) | | + | +--> secrets-config: ClusterSecretStore + ExternalSecrets | | + | +--> argocd (self-management) | | + | | | + | ExternalSecrets (in git, references only) | | + | | reconciled by Argo CD | | + | v | | + | ESO (IRSA: usgov-coderdemo-external-secrets) --> reads ASM | | + | | writes/owns | | + | v | | + | Kubernetes Secrets (NOT in git, NOT pruned by Argo) | | + | | | + +----------------------------------------------------------------------+ | + | | + | AWS Secrets Manager (usgov-coderdemo/*) <-- source of truth, secrets | + | ECR (image mirror, no pull-through) | + | Terraform substrate: VPC, EKS, RDS, IRSA, Route53, ACM, KMS | + +--------------------------------------------------------------------------+ +``` + +Key properties: the reconcile loop (GitLab to Argo CD to cluster) is entirely +in-cluster. The only out-of-boundary touch is the operator mirroring the repo +from github.com into GitLab, and that happens off the reconcile path. Secret +material never enters git; Argo manages references, ESO owns the Secrets. + +## 5. In-boundary source: the in-cluster GitLab + +The canonical repo lives at `github.com/coder/usgov-coderdemo`, which is out of +boundary. The architecture goal is strictly in-boundary, so Argo CD must sync +from the in-cluster GitLab, not from github.com. + +### 5.1 Authoritative source and mirroring direction + +- Create a GitLab project, for example `platform/usgov-coderdemo`, on the + in-cluster GitLab. This project becomes the **authoritative source for what + Argo CD reconciles**. +- github.com remains the public collaboration mirror. The mirror direction is + **github.com to GitLab** (push into GitLab), performed by an operator or an + in-boundary CI runner that has github read access. Do not use GitLab's "pull + mirror" feature pointed at github.com, because that would put github.com egress + back on the platform's critical path, defeating the boundary goal. +- Document the push step as a release action (for example a `git push gitlab` + to a second remote) so the in-boundary source is updated deliberately and the + reconcile loop never depends on github.com reachability. + +### 5.2 Repo URL the controller uses + +Two valid options; pick one and be consistent with the deploy token host: + +- **In-cluster Service URL (recommended):** + `http://gitlab.gitlab.svc.cluster.local/platform/usgov-coderdemo.git`. Keeps + all reconcile traffic inside the cluster, avoids the NLB hairpin, and needs no + TLS trust configuration (GitLab serves plain HTTP on the Service port behind + the bundled NGINX). Simplest and most in-boundary. +- **Public hostname over the NLB hairpin:** + `https://gitlab.usgov.coderdemo.io/platform/usgov-coderdemo.git`. Uses the + valid ACM TLS path but adds an NLB round trip; only choose this if you want + Argo to validate the public certificate. + +Recommendation: use the in-cluster Service URL for the Argo repo definition. + +### 5.3 Repository credentials (deploy token) + +- Mint a **GitLab project deploy token** scoped to `read_repository` on + `platform/usgov-coderdemo` (read-only; the controller never pushes). +- Store the deploy token in ASM (for example `usgov-coderdemo/argocd/gitlab-repo` + with keys `username` and `password`), consistent with the existing + ASM-plus-ESO pattern. An `ExternalSecret` then materializes a Kubernetes Secret + in the `argocd` namespace carrying the Argo repository label + `argocd.argoproj.io/secret-type: repository` with `url`, `username`, and + `password`. This keeps the credential out of git and rotation in ASM. +- Bootstrap ordering matters: ESO and the repo-credential `ExternalSecret` must + exist before Argo CD first tries to pull from GitLab. ESO is already installed, + so the only new prerequisite is the repo-cred `ExternalSecret`. + +### 5.4 Change delivery: webhook plus poll + +- Configure a GitLab **project webhook** to Argo CD's `/api/webhook` endpoint for + push-triggered sync (low latency for demos). Protect it with a webhook secret, + also delivered via ASM/ESO. +- Keep Argo's **polling reconcile** as the fallback (default around 3 minutes) so + reconciliation still happens if a webhook is missed. For a demo, polling alone + is acceptable; the webhook is a nice-to-have for snappy syncs. + +## 6. Repo layout for GitOps (no files moved yet) + +Add a `gitops/` tree that contains only Argo CD objects (the controller install +and the `Application`/`AppProject`/`ApplicationSet` definitions). The existing +manifests and Helm values stay exactly where they are under `deploy/`; each +`Application` points its `source.path` at the current `deploy/...` location. This +is the "adopt in place, do not relocate" principle, so the diff between git and +the live cluster stays minimal during adoption. + +Proposed layout: + +``` +gitops/ + bootstrap/ + argocd/ # Argo CD install (Helm values for the chart, + # image repos overridden to the ECR mirror) + root-app.yaml # app-of-apps root Application + projects/ + platform.yaml # AppProject: platform infra + apps.yaml # AppProject: coder/keycloak/gitlab + argocd.yaml # AppProject: argocd self-management + apps/ + platform/ + ingress-nginx.yaml # Application -> deploy/platform ingress-nginx values + aws-load-balancer-controller.yaml + external-secrets.yaml # Application -> ESO Helm release + storageclass-gp3.yaml # Application -> gp3 StorageClass manifest + workspace-rbac.yaml # Application -> deploy/platform/workspace-rbac.yaml + secrets/ + secretstore-externalsecrets.yaml # Application -> deploy/platform/external-secrets + coder/ + coder.yaml # Application -> deploy/coder (Helm: chart + values.yaml) + provisioners.yaml # Application -> deploy/coder/provisioners.yaml + keycloak/ + keycloak.yaml # Application -> deploy/keycloak (kustomize) + gitlab/ + gitlab.yaml # Application -> deploy/gitlab/*.yaml +``` + +Pattern choice: + +- **App-of-apps (recommended to start).** A single root `Application` + (`gitops/bootstrap/root-app.yaml`) enumerates the child `Application` objects + under `gitops/apps`. Explicit, easy to reason about, easy to adopt one child at + a time, and easy to keep some children on manual sync while others advance. +- **ApplicationSet (alternative, for later scale).** A git-directory generator + over `gitops/apps/*` removes the per-app boilerplate. Good once the set of apps + stabilizes and you want uniform policy. Note it as the evolution path, not the + starting point, because per-app policy variation is exactly what we want during + adoption. + +`AppProject` boundaries: define at least three projects so each app set is +restricted to its source repo (the one GitLab project) and its destination +namespaces. Least privilege at the project layer is what makes a broad +controller safe (see Section 8). + +How Helm releases are represented: each existing Helm release becomes an +`Application` whose `source` is the chart (mirrored chart or repo) with +`helm.valueFiles` pointing at the committed `deploy//values.yaml`. Argo +renders the chart with `helm template` and applies the result; it does not call +`helm install`. The implications of that are covered in Section 9. + +## 7. Bootstrap: installing and self-reconciling the controller + +### 7.1 Image mirroring to ECR (no pull-through in GovCloud) + +Add the Argo CD component images to `scripts/images.txt` and mirror them with the +existing `scripts/mirror-images.sh` (crane to private ECR). Argo CD is a handful +of images: + +- `quay.io/argoproj/argocd:` (one image backs the application-controller, + repo-server, server/UI, applicationset-controller, and notifications). +- A Redis image used by the chart (commonly `docker.io/library/redis:`; + confirm the exact repo/tag the chosen chart version pins). +- `ghcr.io/dexidp/dex:` only if Dex is used for SSO bundling. If Argo CD is + pointed straight at Keycloak OIDC, Dex can be omitted, removing one image. + +Override the chart's image repositories to the ECR mirror paths +(`/quay/argoproj/argocd`, `/docker-hub/library/redis`, +`/ghcr/dexidp/dex`) following the existing mirror path convention. +These ECR pulls work with the node role's `AmazonEC2ContainerRegistryReadOnly`, +so no new IRSA is required just to pull Argo's images. + +### 7.2 Install method and self-management + +- **First install is imperative**, like everything else in this build: install + Argo CD via its Helm chart into namespace `argocd`, with images pointed at the + ECR mirror. Use server-side apply for the install. The official Argo CD install + guidance calls for `--server-side --force-conflicts` because the CRDs exceed + the client-side apply size limit; the same constraint applies when Argo manages + CRD-bearing charts (see Section 9). +- **Then Argo CD manages Argo CD.** Add an `argocd` `Application` (under the + `argocd` AppProject) whose source is `gitops/bootstrap/argocd`. After the + initial bootstrap, all future Argo CD config and upgrades flow through git like + any other app. This is the standard self-management pattern and closes the loop + so the controller is not itself a snowflake. +- The app-of-apps `root-app.yaml` is applied once by hand during bootstrap; + thereafter it reconciles itself and its children from git. + +### 7.3 RBAC and IRSA + +- **Controller RBAC:** the application-controller needs permission to reconcile + across the managed namespaces (`coder`, `coder-workspaces`, `keycloak`, + `gitlab`, `ingress-nginx`, `external-secrets`, `kube-system` for the LB + controller, and cluster-scoped objects like the StorageClass and + ClusterSecretStore). The Argo install ships a ClusterRole for this. Constrain + the blast radius at the `AppProject` layer instead of widening or narrowing the + ClusterRole: each project allowlists only its destination namespaces and the + single GitLab source repo. +- **UI/API RBAC:** wire Argo CD's UI and API to Keycloak OIDC (realm `coder`) and + map a platform-admin group to the Argo `admin` role via `policy.csv`; everyone + else defaults to read-only. This reuses the in-boundary IdP and avoids a + separate Argo local-admin password as the standing credential. +- **IRSA:** Argo CD core needs no AWS credentials to reconcile manifests, so no + new IRSA role is required for the controller itself. The only AWS dependency is + pulling images from ECR, already covered by the node role. (If a future + ApplicationSet cloud generator or an image updater against ECR is added, that + would need its own IRSA role; out of scope here.) + +## 8. Secrets and ESO integration + +The hard rule is unchanged: no secret material in git. GitOps and the existing +ESO/ASM stack divide responsibility cleanly: + +- **Argo CD manages the references.** The `ClusterSecretStore` and the nine + `ExternalSecret` objects (`deploy/platform/external-secrets/...`) contain only + pointers to ASM, no secret values, so they are safe to keep in git and + reconcile with Argo. +- **ESO owns the actual Kubernetes Secrets.** ESO writes them with + `creationPolicy: Owner` from ASM via IRSA. Those Secrets are not in git and + must not be managed or pruned by Argo. +- **Keep Argo and ESO from fighting over Secrets.** Because the Kubernetes + Secrets are not rendered from git, Argo will not track them under normal + operation. The risk is namespace-level pruning or orphaned-resource handling + deleting an ESO-owned Secret. Mitigations: keep `prune: false` during adoption + (Section 9); set `AppProject` `orphanedResources` to `warn`, never delete; and + if needed add `Secret` to Argo's resource exclusions for the managed + namespaces. Argo 3.x already ships sensible default `resource.exclusions`; + extend them rather than reduce them. +- **The Argo repo credential is itself a secret**, handled the same in-boundary + way: deploy token in ASM, surfaced into the `argocd` namespace by an + `ExternalSecret` carrying the Argo repository label (Section 5.3). This keeps + the one credential Argo needs out of git and rotatable in ASM, and it means the + whole platform, including the GitOps controller's own inputs, follows one + secrets pattern. + +## 9. Non-disruptive adoption strategy (the careful part) + +This is a live environment. Adoption must not restart or revert running +workloads. The strategy is to install the controller, point Applications at the +existing state with all automation off, prove the diff is benign, and only then +consider enabling automation, after the demo. + +### 9.1 Sync policy phases + +1. **Phase 0: install only.** Argo CD running; no child Application syncing yet + (Applications created with automated sync disabled, or not created at all). +2. **Phase 1: adopt with everything off.** For each existing release or manifest + set, create an `Application` pointing at the GitLab source with: + - manual sync (no `automated`), + - `prune: false`, + - `selfHeal: false`. + The app will report `OutOfSync` or `Synced` but will not change anything. +3. **Phase 2: verify the diff is benign.** Run `argocd app diff ` and + confirm the only differences are metadata Argo adds (its tracking + annotation/labels), not spec changes. If the diff shows real spec drift, + reconcile the git source to match live before going further (do not sync to + "fix" live). +4. **Phase 3: adopt in place.** Once the diff is benign, let Argo take ownership + (a manual sync that only adds tracking metadata). Still no prune, no self-heal, + no auto-sync. +5. **Phase 4 (after the demo): enable automation per app.** Turn on `automated`, + then `selfHeal`, then `prune`, one app at a time, lowest-risk first, watching + each. Auto-sync is deliberately deferred until after the demo. + +### 9.2 Order of adoption (lowest risk first) + +1. Leaf, stateless manifests with no Helm bookkeeping: `gp3` StorageClass, + workspace RBAC, and the `ClusterSecretStore`/`ExternalSecret` set. +2. The plain `kubectl`-applied app manifests: Keycloak, GitLab, the Coder + provisioner Deployments. These were applied with `kubectl`, so there is no + competing Helm release to reconcile. +3. The Helm releases last: `external-secrets`, `aws-load-balancer-controller`, + `ingress-nginx`, then `coder`. These carry the ownership and values landmines + below, so they are adopted only after the no-Helm items prove the workflow. + +### 9.3 Landmines when adopting CLI-installed Helm releases + +These are the specific traps and how to defuse each: + +- **Ownership metadata (Helm vs Argo).** Argo does not use Helm to install; it + runs `helm template` and applies the output, taking ownership via its tracking + annotation/label. After adoption, the old Helm release Secret + (`sh.helm.release.v1...`) is orphaned bookkeeping: both Helm and Argo believe + they own the objects. Plan: let Argo become the owner, verify the app is + `Healthy`/`Synced`, then clean up the stale Helm release record (for example + `helm uninstall --keep-resources`, or simply leave the orphaned release Secret + and stop using `helm upgrade`). Decide and document which tool is authoritative + per release; do not run `helm upgrade` against an Argo-managed release. +- **Values drift.** `coder` is at Helm revision 4. Live values set across those + upgrades may not all be captured in `deploy/coder/values.yaml`. Before + adoption, capture live values (`helm get values coder -n coder`) and reconcile + them into the committed values file so the rendered manifest matches live. + Otherwise `argocd app diff` shows spurious changes and a sync would revert real + live configuration. Repeat for each Helm release. +- **CRDs.** ESO and the AWS load balancer controller install CRDs. Argo CD has + known CRD-size handling caveats: large CRDs exceed the client-side apply + annotation limit, which is why the official install itself requires + `--server-side --force-conflicts`. Use `ServerSideApply=true` for CRD-bearing + apps, keep CRDs out of prune, and be deliberate about `Replace`. ESO CRDs in + particular are large; server-side apply is the safe default. +- **Resource tracking method.** Prefer annotation-based resource tracking so Argo + does not mutate `spec` or label selectors on adoption. Label-based tracking can + touch immutable selector fields and trigger churn; annotation tracking avoids + that. Confirm the controller is set to annotation tracking before adopting. +- **`helm template` vs `helm install` semantics.** Chart hooks, `lookup` + functions, and `.Release.IsInstall` can render differently under + `helm template`. Verify the rendered output of each chart matches what is live + before syncing, especially for charts that branch on install-vs-upgrade. + +## 10. Scope: what GitOps manages vs what stays Terraform or scripts + +| Layer | Owner after this plan | +|---|---| +| In-cluster Helm releases (coder, ingress-nginx, aws-load-balancer-controller, external-secrets) | GitOps (Argo CD) | +| In-cluster manifests (keycloak, gitlab, provisioners, workspace RBAC, gp3 SC, ClusterSecretStore + ExternalSecrets) | GitOps (Argo CD) | +| The new monitoring stack (once it lands) | GitOps (Argo CD), as a new `gitops/apps` app | +| Argo CD itself | GitOps (self-managed app-of-apps) | +| AWS substrate (VPC, EKS cluster, node group + IAM, RDS, ECR repos, IRSA roles, Route53, ACM, KMS) | Terraform | +| Image mirroring into ECR (crane) | Script (`scripts/mirror-images.sh`), a pipeline step; image content is not Argo's job | +| DB roles/schemas (`coder`, `keycloak`) | Script / one-time job (or fold into Terraform later) | +| Runtime app config: Coder license JWT, appearance banner, AI provider DB seed, GitLab OAuth app, Keycloak realm runtime config, IdP sync, Coder template push | Out-of-band / runtime (sibling plan), NOT GitOps | +| Kubernetes Secrets material | ESO from ASM (referenced, never stored, by GitOps) | + +Boundary with the sibling plan: this document stops at the control plane and +bootstrap. The per-workload adoption details (how each Coder/Keycloak/GitLab +workload is cut over, and how the non-Argo application state and Terraform +reconciliation are handled) are owned by the sibling plan and are not duplicated +here. + +## 11. Risks and open questions + +- **Mirror discipline.** The in-boundary GitLab is only as current as the last + github-to-GitLab push. A missed push means Argo reconciles stale desired state. + Mitigation: make the push a required release step or an in-boundary CI job. +- **Single GitLab as a dependency.** GitLab uses embedded Postgres on one PVC + with no managed backup. If GitLab is down, Argo cannot fetch new desired state + (already-synced state keeps running). Acceptable for a demo; note it. +- **Coder values reconciliation effort.** Capturing four revisions of live Coder + values into the committed values file is the most error-prone adoption task and + should get the most diff scrutiny. +- **Open question:** confirm the exact Redis image/tag the chosen Argo CD chart + version pins, and whether Dex is needed or Keycloak OIDC is wired directly + (affects the image mirror list). +- **Open question:** confirm whether the monitoring stack should be adopted as + part of the first GitOps cut or after the four Helm releases, given it is being + installed imperatively right now. + +## 12. Implementation roadmap (maps to the GitHub issues) + +1. Choose and install the Argo CD control plane (namespace, Helm, ECR images, + self-managed app-of-apps root). +2. Mirror Argo CD component images into ECR (`scripts/images.txt`). +3. Stand up the in-cluster GitLab project, push the repo into it, and configure + Argo repo credentials (deploy token via ASM/ESO) plus webhook and poll. +4. Scaffold the `gitops/` app-of-apps and `AppProject` layout, with Applications + referencing the existing `deploy/` paths (no files moved). +5. Argo CD bootstrap RBAC, Keycloak SSO for the UI, and least-privilege + AppProjects. +6. Secrets/ESO integration guardrails (no secrets in git; prune and + orphaned-resource protections; server-side apply for CRDs). +7. Non-disruptive adoption runbook with `argocd app diff` verification (values + reconcile, ownership and CRD landmines, staged ordering, defer auto-sync). + +These map one-to-one to the `gitops`-labelled GitHub issues filed alongside this +plan. diff --git a/docs/plans/observability-aws-native.md b/docs/plans/observability-aws-native.md new file mode 100644 index 0000000..d5cec0a --- /dev/null +++ b/docs/plans/observability-aws-native.md @@ -0,0 +1,458 @@ +# Plan: AWS-native observability and audit pipeline (production target) + +Status: PLAN (design only). No cluster, AWS, Coder, or Keycloak changes were +made to produce this document. Every AWS availability claim below is grounded in +a read-only `aws` call run on 2026-06-07 against account `430737322961`, +partition `aws-us-gov`, region `us-gov-west-1`. Items that could not be fully +verified are marked "unverified" or "to verify". + +This is the production, AWS-native target that the demo's in-cluster +Prometheus plus Grafana stack (built by a separate workstream) should evolve +into. It covers two pipelines: + +1. Metrics: Coder Prometheus endpoint to Amazon Managed Prometheus (AMP) to + Amazon Managed Grafana (AMG). +2. Audit and SIEM: Coder structured JSON logs to CloudWatch Logs, then to + Kinesis Data Firehose, S3, and Athena, with an optional Amazon Security Lake + (OCSF) path, plus alerting on sensitive audit actions. + +## 1. Context and current state + +| Fact | Value | Source | +|---|---|---| +| Account / partition / region | `430737322961` / `aws-us-gov` / `us-gov-west-1` | `.substrate-outputs.json`, live `aws sts get-caller-identity` | +| EKS cluster | `usgov-coderdemo`, k8s 1.36, standard (not Auto Mode), `authenticationMode=API` | live `aws eks describe-cluster` | +| EKS OIDC provider (IRSA) | `arn:aws-us-gov:iam::430737322961:oidc-provider/oidc.eks.us-gov-west-1.amazonaws.com/id/E9DB9E591C95ECB91F44EDCF38F146F2` | `.substrate-outputs.json`, `terraform/irsa.tf` | +| OIDC issuer host | `oidc.eks.us-gov-west-1.amazonaws.com/id/E9DB9E591C95ECB91F44EDCF38F146F2` | `.substrate-outputs.json` | +| Coder | v2.34.0, ns `coder`, served at `https://dev.usgov.coderdemo.io` | `docs/as-built/30-coder-control-plane.md` | +| Coder metrics port (live) | `CODER_PROMETHEUS_ADDRESS=0.0.0.0:2112` already set on the Deployment | live `kubectl -n coder get deploy coder` | +| Coder Service ports (live) | only `http` 80 to container `8080`. Port `2112` is not in the Service or as a named containerPort. | live `kubectl -n coder get svc/deploy coder` | +| Identity | Keycloak realm `coder` at `https://auth.usgov.coderdemo.io/realms/coder`, OIDC client `coder` | `docs/as-built/40-identity-keycloak.md` | +| IRSA precedent | `usgov-coderdemo-coder-bedrock`, `usgov-coderdemo-external-secrets`, `usgov-coderdemo-ebs-csi` use the cluster OIDC provider with least-privilege inline policies | `terraform/irsa.tf`, `terraform/secrets-hardening.tf` | +| In-cluster demo stack | A separate workstream is building Prometheus plus Grafana in-cluster for the demo. Reserved host `metrics.usgov.coderdemo.io`. | `docs/AGENT-PRD.md`, `docs/decisions-locked.md` | +| Secrets pattern | AWS Secrets Manager is source of truth, synced by External Secrets Operator via IRSA | `docs/as-built/85-secrets-management.md` | + +The cluster already exposes the Coder metrics address, but the metrics port is +not yet wired into a Service and `CODER_PROMETHEUS_ENABLE` is not present, so +metrics are not yet scrapeable end to end. JSON logging is not enabled. + +## 2. Verified GovCloud service availability + +All calls below were read-only (`list` / `describe`) under `AWS_PROFILE` +`demoenv-usgov`, region `us-gov-west-1`, CLI `aws-cli/2.34.63`. + +| Service | Probe | Result | Conclusion | +|---|---|---|---| +| Amazon Managed Prometheus (AMP) | `aws amp list-workspaces` | `{"workspaces": []}` | Available. | +| AMP managed scraper (collector) | `aws amp list-scrapers` | `AccessDeniedException: Unable to determine service/operation name to be authorized` on a current CLI | The managed scraper operation is not served by the regional endpoint. Treat the AMP managed collector as NOT available in `us-gov-west-1`. Use a self-managed ADOT collector. (Verified by probe; see caveats.) | +| Amazon Managed Grafana (AMG) | `aws grafana list-workspaces` | `{"workspaces": []}` | Available. | +| IAM Identity Center | `aws sso-admin list-instances` | `{"Instances": []}` | Not enabled. Account is standalone (`organizations:DescribeOrganization` returns `AWSOrganizationsNotInUseException`). AMG SAML is the simpler auth path. | +| Amazon Security Lake | `aws securitylake list-data-lakes` | `{"dataLakes": []}` | API available, not enabled. | +| AWS Security Hub | `aws securityhub describe-hub` | `InvalidAccessException: not subscribed` | API available, not enabled. | +| Amazon Detective | `aws detective list-graphs` | `{"GraphList": []}` | API available, not enabled. | +| Kinesis Data Firehose | `aws firehose list-delivery-streams` | empty list | Available. | +| Amazon Athena | `aws athena list-work-groups` | `primary` workgroup, engine v3 | Available. | +| AWS Glue | `aws glue get-databases` | `{"DatabaseList": []}` | Available. | +| Amazon EventBridge | `aws events list-event-buses` | `default` bus present | Available. | +| Amazon SNS | `aws sns list-topics` | empty list | Available. | +| CloudWatch Logs | `aws logs describe-log-groups` | existing groups present | Available. | +| AWS KMS | `aws kms list-aliases` | AWS-managed aliases present, no CMKs yet | Available. | +| EKS addon `adot` | `aws eks describe-addon-versions --addon-name adot` | `v0.151.0-eksbuild.1` (and older) | ADOT EKS managed addon available. | +| EKS addon `amazon-cloudwatch-observability` | `describe-addon-versions` | `v6.2.0-eksbuild.1` | CloudWatch agent plus Fluent Bit addon available. | +| EKS addon `eks-pod-identity-agent` | `describe-addon-versions` | `v1.3.10-eksbuild.3`, compatible with 1.36 | EKS Pod Identity available as an alternative to IRSA. | + +To verify before build (not provable read-only without creating resources): + +- AMP remote write hostname for GovCloud. The standard pattern is + `https://aps-workspaces.us-gov-west-1.amazonaws.com/workspaces//api/v1/remote_write`. + Confirm whether a FIPS endpoint + (`aps-workspaces-fips.us-gov-west-1.amazonaws.com`) is required by the + compliance posture. +- AMG SAML federation against Keycloak (assertion attributes, role mapping). +- Security Lake OCSF custom-source registration and the Coder-to-OCSF mapping. + +## 3. Target architecture + +```mermaid +flowchart TB + subgraph EKS["EKS cluster usgov-coderdemo (ns coder, observability, amazon-cloudwatch)"] + coder["coderd v2.34.0\n/metrics on :2112\nJSON logs to stdout"] + adot["ADOT Collector (Deployment)\nprometheus receiver + sigv4auth\nIRSA: usgov-coderdemo-adot-amp"] + fb["Fluent Bit (DaemonSet)\ntail container stdout\nIRSA: usgov-coderdemo-fluentbit-cwl"] + coder -->|scrape :2112| adot + coder -->|stdout JSON| fb + end + + subgraph Metrics["Metrics pipeline"] + amp["Amazon Managed Prometheus\nworkspace usgov-coderdemo"] + amg["Amazon Managed Grafana\nSAML to Keycloak\nCoder dashboards"] + adot -->|remote_write SigV4| amp + amp -->|PromQL query SigV4| amg + end + + subgraph Audit["Audit and SIEM pipeline"] + cwl["CloudWatch Logs\n/coder/audit (retention set)"] + fh["Kinesis Data Firehose"] + s3["S3 bucket\ndate-partitioned audit archive"] + glue["Glue Data Catalog"] + athena["Athena (engine v3)"] + slake["Amazon Security Lake\nOCSF normalization (optional)"] + shub["Security Hub / Detective\n(optional)"] + fb -->|PutLogEvents| cwl + cwl -->|subscription filter| fh + fh -->|PutObject| s3 + s3 --> glue --> athena + s3 -. "OCSF transform" .-> slake --> shub + end + + subgraph Alerting["Alerting"] + sns["SNS topic\ncoder-audit-alerts"] + eb["EventBridge rule\nsensitive audit actions"] + cwl -->|metric filter + alarm| sns + cwl -->|forwarder to events| eb -->|rule target| sns + amg -->|Grafana alerts| sns + end +``` + +## 4. Component table + +| # | Component | AWS service / object | Auth | Runs where | Notes | +|---|---|---|---|---|---| +| M1 | Coder metrics endpoint | n/a (coderd) | n/a | ns `coder` | `:2112/metrics`; enable plus expose via Service. | +| M2 | Metrics scraper | ADOT Collector (self-managed Deployment, or the `adot` EKS addon) | IRSA `usgov-coderdemo-adot-amp` | ns `observability` | Prometheus receiver scrapes `:2112`; `prometheusremotewrite` exporter with `sigv4auth`. Managed AMP scraper is unavailable in this region. | +| M3 | Metrics store | Amazon Managed Prometheus workspace | SigV4 | AWS managed | Receives `remote_write`; 150-day default retention. | +| M4 | Dashboards | Amazon Managed Grafana workspace | SAML to Keycloak; AMP data source via AMG service role | AWS managed | Import Coder's published Grafana dashboards. | +| A1 | Log emitter | coderd JSON logs to stdout | n/a | ns `coder` | `CODER_LOGGING_JSON=/dev/stderr`; audit events are in coderd logs. | +| A2 | Log shipper | Fluent Bit DaemonSet (or `amazon-cloudwatch-observability` addon) | IRSA `usgov-coderdemo-fluentbit-cwl` | ns `amazon-cloudwatch` | Tails coder pod stdout to CloudWatch Logs group `/coder/audit`. | +| A3 | Log store and retention | CloudWatch Logs group `/coder/audit` | IAM | AWS managed | Set retention via `logs:PutRetentionPolicy`. | +| A4 | Stream to lake | Kinesis Data Firehose | Firehose service role | AWS managed | Source: CloudWatch Logs subscription filter. | +| A5 | Archive | S3 bucket (date-partitioned) | IAM, SSE-KMS | AWS managed | `year=/month=/day=/` prefixing for Athena. | +| A6 | Catalog and query | Glue Data Catalog plus Athena | IAM | AWS managed | Glue table or crawler over the S3 prefix. | +| A7 | OCSF SIEM (optional) | Amazon Security Lake plus Security Hub plus Detective | IAM | AWS managed | Coder custom source, OCSF mapping; compliance-grade path. | +| AL1 | Alert detection | CloudWatch Logs metric filter plus alarm, and/or EventBridge rule | IAM | AWS managed | Sensitive actions (license, user role change, template push, login failures). | +| AL2 | Notification | SNS topic `coder-audit-alerts` | IAM | AWS managed | Email/Slack/PagerDuty subscribers; also AMG alerts. | + +## 5. Metrics pipeline detail + +### 5.1 Coder metrics + +Coder exposes Prometheus metrics when `CODER_PROMETHEUS_ENABLE=true`. The live +Deployment already sets `CODER_PROMETHEUS_ADDRESS=0.0.0.0:2112` but does not set +`CODER_PROMETHEUS_ENABLE`, and `:2112` is not exposed by the `coder` Service. +Required Coder config (see section 8). The scraper reaches metrics over the pod +network at `:2112`; a dedicated headless Service or a Prometheus +annotation/PodMonitor selects the pod. + +### 5.2 Scraper: self-managed ADOT collector with SigV4 + +The AMP managed collector (`aws amp list-scrapers`) is not available in +`us-gov-west-1` (verified by probe), so run the AWS Distro for OpenTelemetry +(ADOT) collector in-cluster. Two deployment options: + +- The `adot` EKS managed addon (`v0.151.0-eksbuild.1` available), or +- A self-managed ADOT Collector Deployment from the ECR mirror (consistent with + the no-pull-through-cache constraint in `docs/as-built/10-infrastructure.md`). + +Collector config outline (concept, not applied): + +```yaml +receivers: + prometheus: + config: + scrape_configs: + - job_name: coderd + kubernetes_sd_configs: [{ role: pod }] + relabel_configs: + - source_labels: [__meta_kubernetes_namespace] + regex: coder + action: keep + - source_labels: [__meta_kubernetes_pod_container_port_number] + regex: "2112" + action: keep +extensions: + sigv4auth: + region: us-gov-west-1 + service: aps +exporters: + prometheusremotewrite: + endpoint: https://aps-workspaces.us-gov-west-1.amazonaws.com/workspaces//api/v1/remote_write + auth: { authenticator: sigv4auth } +service: + extensions: [sigv4auth] + pipelines: + metrics: + receivers: [prometheus] + exporters: [prometheusremotewrite] +``` + +The `sigv4auth` extension signs `remote_write` with the IRSA-provided role +credentials. No static AWS keys, matching the Bedrock and ESO precedent. + +### 5.3 AMP workspace + +Create one AMP workspace (alias `usgov-coderdemo`). Encrypt with a CMK if the +posture requires it (KMS is available; no CMK exists yet). Default retention is +150 days; adjust to the compliance requirement. + +### 5.4 Amazon Managed Grafana and auth + +AMG requires either IAM Identity Center or SAML for user auth. IAM Identity +Center has no instances and the account is not in an Organization (both +verified), so enabling Identity Center is extra scope. Recommended: + +- User auth: SAML federation directly to Keycloak (realm `coder`), reusing the + existing IdP and persona/group model in + `docs/as-built/45-idp-sync-personas.md`. Map a Keycloak group to the AMG Admin + role and another to Viewer/Editor. +- Data source auth: the AMG workspace IAM role (service-managed) granted + read-only AMP query permissions (and CloudWatch read for the optional + CloudWatch data source). + +Dashboards: import Coder's published Grafana dashboards (the coderd and +workspace dashboards from the coder/coder monitoring docs; confirm exact source +and version at build time). Add the AMP workspace as the Prometheus data source +with SigV4 enabled. + +## 6. Audit and SIEM pipeline detail + +### 6.1 Why JSON logs carry audit data + +Coder's audit log is stored in the database and surfaced through the API and UI. +It is also emitted to the coderd process logs: Coder documents that server +errors, audit logs, user activities, and SSO/OIDC events are all captured in the +coderd logs. Enabling JSON logging therefore turns coderd stdout into a +machine-parsable audit stream that Fluent Bit can ship. This is the integration +seam for the SIEM. (The Coder audit API is an alternative pull-based source, but +the stdout-plus-Fluent-Bit path matches the EKS-native pattern requested.) + +### 6.2 Fluent Bit to CloudWatch Logs + +Run Fluent Bit as a DaemonSet (or use the `amazon-cloudwatch-observability` +addon, `v6.2.0` available). Tail the `coder` namespace container stdout, filter +to audit-bearing records, and write to CloudWatch Logs group `/coder/audit`. +Authenticate with IRSA role `usgov-coderdemo-fluentbit-cwl`. Set group retention +explicitly with `logs:PutRetentionPolicy` (for example 365 days; choose per +policy). Encrypt the group with a CMK if required. + +### 6.3 Firehose to S3, Glue, Athena + +A CloudWatch Logs subscription filter on `/coder/audit` streams to a Kinesis +Data Firehose delivery stream, which writes to an S3 bucket with date +partitioning (`s3://usgov-coderdemo-audit/coder/audit/year=YYYY/month=MM/day=DD/`). +Use Firehose dynamic partitioning and, optionally, Parquet conversion for cheaper +Athena scans. Catalog the prefix with a Glue table (or crawler) and query in +Athena (engine v3 available). Apply an S3 lifecycle policy for cold storage +(Glacier transition) and expiration aligned to the retention requirement; this S3 +archive, not the Coder database, is the long-term system of record. + +### 6.4 Optional: Amazon Security Lake (OCSF) + +For a compliance-grade SIEM, register Coder as a Security Lake custom source and +normalize audit events to OCSF (for example Authentication, Account Change, and +API Activity classes) using a Glue or Lambda transform. Security Lake then +manages the partitioned OCSF S3 store and exposes it to subscribers (Athena, +Security Hub, Detective). Caveats: Security Hub is not subscribed and Detective +has no graph yet (both verified); both must be enabled, and the Coder-to-OCSF +field mapping must be authored and maintained. + +### 6.5 Alerting on sensitive audit actions + +Sensitive actions to alert on include: license add/remove, user role or +organization-membership change, template create/push, external-auth or OIDC +config change, owner login, and repeated login failures. + +- Simplest path: CloudWatch Logs metric filters on `/coder/audit` patterns, with + CloudWatch Alarms that publish to SNS topic `coder-audit-alerts`. +- Richer routing: a CloudWatch Logs subscription filter to a small Lambda that + emits structured events to EventBridge; EventBridge rules then match action + types and target SNS (and other targets). EventBridge does not read CloudWatch + Logs content directly, so a forwarder is required. +- Dashboards-side: AMG Grafana alerts on metric thresholds (for example error + rate, provisioner failures) to the same SNS topic. + +## 7. IAM and IRSA requirements (least privilege) + +All roles trust the existing cluster OIDC provider +(`arn:aws-us-gov:iam::430737322961:oidc-provider/oidc.eks.us-gov-west-1.amazonaws.com/id/E9DB9E591C95ECB91F44EDCF38F146F2`) +with `aud = sts.amazonaws.com` and a `sub` pinned to the exact service account, +exactly like `terraform/irsa.tf`. EKS Pod Identity is an available alternative +(addon present) if the team prefers it over IRSA. + +### 7.1 ADOT scraper role `usgov-coderdemo-adot-amp` + +- Trust `sub = system:serviceaccount:observability:adot-collector`. +- Policy: `aps:RemoteWrite` on + `arn:aws-us-gov:aps:us-gov-west-1:430737322961:workspace/`. + +### 7.2 Fluent Bit role `usgov-coderdemo-fluentbit-cwl` + +- Trust `sub = system:serviceaccount:amazon-cloudwatch:fluent-bit`. +- Policy: `logs:CreateLogStream`, `logs:PutLogEvents`, + `logs:DescribeLogStreams`, `logs:PutRetentionPolicy`, and + `logs:CreateLogGroup` (scoped to + `arn:aws-us-gov:logs:us-gov-west-1:430737322961:log-group:/coder/audit:*`). + +### 7.3 Firehose service role `usgov-coderdemo-firehose-audit` + +- Trust principal `firehose.amazonaws.com`. +- Policy: `s3:PutObject`/`s3:GetBucketLocation`/`s3:ListBucket` on the audit + bucket and prefix; `kms:GenerateDataKey`/`kms:Decrypt` on the bucket CMK; + `logs:PutLogEvents` to the Firehose error log group; Glue read if format + conversion is enabled. + +### 7.4 CloudWatch Logs to Firehose role `usgov-coderdemo-cwl-to-firehose` + +- Trust principal `logs.us-gov-west-1.amazonaws.com`. +- Policy: `firehose:PutRecord`, `firehose:PutRecordBatch` on the delivery stream. + +### 7.5 AMG workspace IAM role (data sources) + +- Used by the AMG workspace to query data. +- Policy: `aps:QueryMetrics`, `aps:GetLabels`, `aps:GetSeries`, + `aps:GetMetricMetadata`, `aps:ListWorkspaces` on the AMP workspace; optional + CloudWatch read (`cloudwatch:GetMetricData`, `logs:StartQuery`, + `logs:GetQueryResults`) for the CloudWatch data source. + +### 7.6 EventBridge to SNS and Security Lake roles (optional) + +- EventBridge target permission via an SNS topic resource policy allowing + `events.amazonaws.com` to `sns:Publish`, or an EventBridge role. +- Security Lake custom-source role and the transform (Glue/Lambda) execution + role if the OCSF path is adopted. + +## 8. Coder configuration required + +Add to `deploy/coder/values.yaml` `coder.env` (declarative; apply via Helm). +Note `CODER_LOG_FORMAT` is not a current Coder key; the real keys are +`CODER_LOGGING_HUMAN` and `CODER_LOGGING_JSON`. + +```yaml +# Metrics +- name: CODER_PROMETHEUS_ENABLE + value: "true" +- name: CODER_PROMETHEUS_ADDRESS + value: "0.0.0.0:2112" # already live +- name: CODER_PROMETHEUS_COLLECT_AGENT_STATS + value: "true" +- name: CODER_PROMETHEUS_COLLECT_DB_METRICS + value: "false" # enable deliberately; can be high-cardinality +# Structured logging for SIEM (audit events are in coderd logs) +- name: CODER_LOGGING_JSON + value: "/dev/stderr" +- name: CODER_LOGGING_HUMAN + value: "/dev/null" # avoid duplicate human-format lines +``` + +Also expose the metrics port so the scraper can reach it. Either add `2112` as a +named container port plus a small headless Service, or select the pod with a +PodMonitor / Prometheus pod annotations. Keep `:2112` cluster-internal; do not +route it through the ingress NLB. + +Retention notes: + +- Coder has no built-in audit-log TTL; the `audit_log` table grows unbounded in + RDS. Treat S3 (with a lifecycle policy) and the CloudWatch Logs retention + setting as the retention controls for the SIEM, and track RDS audit-table + growth as a separate day-2 item. +- The `connection_log` and `audit_log` premium features are already entitled and + enabled (`docs/as-built/30-coder-control-plane.md`). + +## 9. GovCloud caveats (consolidated) + +- AMP managed collector (scraper) is not available in `us-gov-west-1` (verified + by probe). Use a self-managed ADOT collector with SigV4. This is the single + biggest divergence from a typical commercial-region design. +- AMG auth needs IAM Identity Center or SAML. Identity Center is not enabled and + the account is standalone (verified). SAML to Keycloak is recommended; turning + on Identity Center is optional extra scope. +- Security Hub is not subscribed and Detective has no graph (verified). The + Security Lake / Security Hub / Detective path requires enabling these first and + authoring the OCSF mapping. +- Confirm whether FIPS endpoints are mandated. AMP, Firehose, and CloudWatch + Logs publish FIPS endpoints in GovCloud; the ADOT `sigv4auth` endpoint and + Firehose destination may need the `-fips` hostnames. +- No ECR pull-through cache in GovCloud. ADOT, Fluent Bit, and any collector + images must be mirrored into ECR per `scripts/mirror-images.sh` + (`docs/as-built/10-infrastructure.md`). +- Coder's egress to public Anthropic already uses the single NAT gateway; the + observability pipeline should prefer VPC endpoints (interface endpoints for + AMP, CloudWatch Logs, Firehose, STS, S3 gateway) to keep telemetry traffic in + the boundary and off the NAT path. Verify endpoint availability per service in + `us-gov-west-1`. + +## 10. Migration path from the in-cluster demo stack + +The separate workstream is standing up in-cluster Prometheus plus Grafana for +the demo. Evolve, do not rebuild: + +Keep: + +- The Coder metrics configuration (section 8) and the metric set. +- Grafana dashboards and alert rules. Re-import the same dashboard JSON into AMG; + port Alertmanager rules to AMG alerts or to CloudWatch alarms. +- The scrape selection logic (PodMonitor/labels) can be reused by the ADOT + prometheus receiver. + +Replace: + +- In-cluster Prometheus server to AMP. As a low-risk bridge, configure the + existing in-cluster Prometheus `remote_write` to AMP (dual-run) before cutting + the scraper over to a standalone ADOT collector, then retire the in-cluster + Prometheus. +- In-cluster Grafana to AMG (SAML to Keycloak). +- In-cluster Alertmanager to AMG alerts plus SNS. + +Sequence: enable Coder config first (works for both stacks), add AMP and +`remote_write` from the existing Prometheus, stand up AMG against AMP, validate +dashboards, then remove the in-cluster Prometheus/Grafana/Alertmanager. + +## 11. Phased rollout + +| Phase | Scope | Risk | Depends on | +|---|---|---|---| +| 0 | Coder config: enable Prometheus, expose `:2112`, JSON logging, retention plan | Low (Helm rev) | none | +| 1 | AMP workspace plus ADOT collector (IRSA, SigV4 remote_write); verify series in AMP | Medium | Phase 0 | +| 2 | AMG workspace, SAML to Keycloak, AMP data source, import Coder dashboards | Medium | Phase 1 | +| 3 | Fluent Bit DaemonSet to CloudWatch Logs `/coder/audit` (IRSA), set retention | Low/Medium | Phase 0 | +| 4 | Firehose to S3 (date-partitioned) plus Glue plus Athena | Medium | Phase 3 | +| 5 | Alerting: CloudWatch metric filters / EventBridge to SNS, plus AMG alerts | Low | Phases 2 and 3 | +| 6 | Optional: Security Lake OCSF custom source plus Security Hub/Detective | High | Phase 4 | +| 7 | Decommission in-cluster demo stack; reconcile all IAM/IRSA into Terraform | Medium | Phases 2 and 4 | + +## 12. Open questions and to-verify + +- AMP and Firehose FIPS endpoint requirement for this boundary. +- Exact source and version of Coder's published Grafana dashboards to import. +- Retention numbers (CloudWatch Logs days, S3 lifecycle, AMP days) per the + compliance requirement. +- Whether to standardize on IRSA or EKS Pod Identity for new workloads (both + available); this plan uses IRSA to match existing precedent. +- CMK strategy (per-service CMKs vs a shared observability CMK) for AMP, the S3 + audit bucket, and the CloudWatch Logs group. + +## Appendix: verification commands (read-only, 2026-06-07) + +``` +aws sts get-caller-identity +aws amp list-workspaces --region us-gov-west-1 +aws amp list-scrapers --region us-gov-west-1 # AccessDenied: unsupported op +aws grafana list-workspaces --region us-gov-west-1 +aws sso-admin list-instances --region us-gov-west-1 # [] +aws organizations describe-organization # not in an org +aws securitylake list-data-lakes --region us-gov-west-1 +aws securityhub describe-hub --region us-gov-west-1 # not subscribed +aws detective list-graphs --region us-gov-west-1 +aws firehose list-delivery-streams --region us-gov-west-1 +aws athena list-work-groups --region us-gov-west-1 +aws glue get-databases --region us-gov-west-1 +aws events list-event-buses --region us-gov-west-1 +aws sns list-topics --region us-gov-west-1 +aws logs describe-log-groups --region us-gov-west-1 +aws kms list-aliases --region us-gov-west-1 +aws eks describe-cluster --name usgov-coderdemo --region us-gov-west-1 +aws eks describe-addon-versions --addon-name adot --region us-gov-west-1 +aws eks describe-addon-versions --addon-name amazon-cloudwatch-observability --region us-gov-west-1 +aws eks describe-addon-versions --addon-name eks-pod-identity-agent --region us-gov-west-1 +``` + +Generated by Coder Agents. From bea8f9311c2b453f7bee05355bc33eb32aa39f5b Mon Sep 17 00:00:00 2001 From: Austen Bruhn Date: Sun, 7 Jun 2026 18:54:28 +0000 Subject: [PATCH 08/16] feat: in-cluster Prometheus + Grafana observability for Coder Add an in-boundary, in-cluster observability stack and wire Coder into it, so the demo shows live control-plane metrics and dashboards without leaving the GovCloud boundary. The AWS-native managed variant (AMP/AMG) is planned separately in docs/plans/ and intentionally not built here. Stack (deploy/observability/, Helm release kps, ns monitoring): - kube-prometheus-stack 86.2.0 (Prometheus + Grafana + operator), trimmed for the demo: Alertmanager, node-exporter, kube-state-metrics, bundled rules, and the EKS control-plane ServiceMonitors are off; the kubelet ServiceMonitor is kept for cAdvisor container CPU/memory. Images mirrored into ECR (scripts/images.txt) and the chart overridden to the mirror. - coder-metrics.yaml: a headless Service (ns coder, :2112) selecting only the control-plane pod, plus ServiceMonitor/coder. Prometheus discovers it (serviceMonitorSelectorNilUsesHelmValues=false); up{job="coder-metrics"}=1. - dashboards-coder.yaml: six Prometheus-backed Coder dashboards from github.com/coder/observability as sidecar-imported ConfigMaps, rendering live data. Log-only panels and the agent-boundaries dashboard are omitted (no Loki). - grafana-ingress.yaml: host grafana.usgov.coderdemo.io behind the existing NLB (ACM wildcard TLS); HTTP 200 with valid TLS. Coder server (deploy/coder/values.yaml): ADD only, to respect the coderd AI-provider drift guard. - CODER_PROMETHEUS_ENABLE=true, CODER_PROMETHEUS_ADDRESS=0.0.0.0:2112, CODER_PROMETHEUS_COLLECT_AGENT_STATS=true. - Structured JSON logs for SIEM readiness: CODER_LOGGING_JSON=/dev/stderr and CODER_LOGGING_HUMAN=/dev/null. Coder has no single CODER_LOG_FORMAT flag; JSON is selected by pointing CODER_LOGGING_JSON at a sink. Secrets: the Grafana admin password lives in AWS Secrets Manager (usgov-coderdemo/observability/grafana) and is synced into the grafana-admin Secret by a new ExternalSecret; no password in git. Audit: licensed audit logging is already entitled and on (/audit); the JSON server logs make coderd shippable to a downstream SIEM. Verified live: coder Helm rev 5 healthy (1/1); monitoring pods Running (grafana 3/3, prometheus 2/2, operator 1/1); grafana + dev hosts return 200; the grafana-admin ExternalSecret is SecretSynced; the Coder Control Plane dashboard renders live data end to end. Docs: docs/as-built/55-observability.md; updated the as-built README, the docs index, and STATUS.md. Generated by Coder Agents. --- STATUS.md | 38 +- deploy/coder/values.yaml | 25 + deploy/observability/README.md | 135 + deploy/observability/coder-metrics.yaml | 57 + deploy/observability/dashboards-coder.yaml | 9069 +++++++++++++++++ deploy/observability/grafana-ingress.yaml | 34 + .../kube-prometheus-stack-values.yaml | 164 + deploy/observability/namespace.yaml | 9 + .../secretstore-and-externalsecrets.yaml | 17 + docs/00-INDEX.md | 11 + docs/as-built/55-observability.md | 179 + docs/as-built/README.md | 1 + scripts/images.txt | 10 + 13 files changed, 9748 insertions(+), 1 deletion(-) create mode 100644 deploy/observability/README.md create mode 100644 deploy/observability/coder-metrics.yaml create mode 100644 deploy/observability/dashboards-coder.yaml create mode 100644 deploy/observability/grafana-ingress.yaml create mode 100644 deploy/observability/kube-prometheus-stack-values.yaml create mode 100644 deploy/observability/namespace.yaml create mode 100644 docs/as-built/55-observability.md diff --git a/STATUS.md b/STATUS.md index 6c37c5a..82f9c21 100644 --- a/STATUS.md +++ b/STATUS.md @@ -152,5 +152,41 @@ gated; Nova Pro is the proven fallback. `terraform/secrets-hardening.tf`. - See `docs/as-built/85-secrets-management.md`. +## Observability (in-cluster Prometheus + Grafana) +- [x] **In-boundary metrics + dashboards** via the + `prometheus-community/kube-prometheus-stack` Helm release `kps` (ns + `monitoring`, ECR-mirrored images). Prometheus (2/2), Grafana (3/3), and + the operator (1/1) are healthy. +- [x] **Coder Prometheus metrics enabled** (`CODER_PROMETHEUS_ENABLE=true`, + `CODER_PROMETHEUS_ADDRESS=0.0.0.0:2112`, agent stats on). A headless + `coder-metrics` Service + ServiceMonitor scrapes the control plane; + Prometheus `up{job="coder-metrics"}` is `1`. +- [x] **Six Coder Grafana dashboards** (from `github.com/coder/observability`) + render live data at `https://grafana.usgov.coderdemo.io` (valid TLS, + HTTP 200). Grafana admin password lives in AWS Secrets Manager + (`usgov-coderdemo/observability/grafana`) and is synced by ESO. +- [x] **Structured JSON server logs** (`CODER_LOGGING_JSON=/dev/stderr`, + `CODER_LOGGING_HUMAN=/dev/null`) make coderd SIEM-ready; audit logging is + entitled + on (`/audit`). +- [ ] AWS-native managed variant (AMP + AMG, CloudWatch -> Security Lake) is the + production target, planned only. See + [`docs/plans/observability-aws-native.md`](docs/plans/observability-aws-native.md) + and issues #13-#20. +- See `docs/as-built/55-observability.md` and `deploy/observability/README.md`. + +## Planned (design + issues, nothing applied) +- [ ] **GitOps control plane** (Argo CD, sourced from the in-cluster GitLab, + app-of-apps over the existing `deploy/` paths, adopt-in-place): + [`docs/plans/gitops-control-plane.md`](docs/plans/gitops-control-plane.md), + issues #6-#12. +- [ ] **Per-workload GitOps adoption** + non-Kubernetes app state (Coder API via + Argo Jobs, Keycloak via keycloak-config-cli, AWS stays Terraform): + [`docs/plans/gitops-adoption.md`](docs/plans/gitops-adoption.md), + issues #21-#29. +- [ ] **AWS-native observability** (AMP/AMG, CloudWatch/Firehose/S3/Athena, + optional Security Lake): + [`docs/plans/observability-aws-native.md`](docs/plans/observability-aws-native.md), + issues #13-#20. + ## Out of scope (demo) -OpenShift, Istio, observability. +OpenShift, Istio. diff --git a/deploy/coder/values.yaml b/deploy/coder/values.yaml index 71fdd08..afa674d 100644 --- a/deploy/coder/values.yaml +++ b/deploy/coder/values.yaml @@ -214,6 +214,31 @@ coder: - name: AWS_STS_REGIONAL_ENDPOINTS value: "regional" + # --- Observability: Prometheus metrics -------------------------------- + # Serve coderd_* and Go runtime metrics so the in-cluster Prometheus can + # scrape them. Bind to 0.0.0.0 (not the 127.0.0.1 default) so the metrics + # endpoint is reachable from the cluster network; a separate ClusterIP + # Service `coder-metrics` (deploy/observability) exposes port 2112 and a + # ServiceMonitor scrapes it. COLLECT_AGENT_STATS adds per-workspace agent + # metrics (connections, RX/TX, latency) used by the workspace dashboards. + - name: CODER_PROMETHEUS_ENABLE + value: "true" + - name: CODER_PROMETHEUS_ADDRESS + value: "0.0.0.0:2112" + - name: CODER_PROMETHEUS_COLLECT_AGENT_STATS + value: "true" + + # --- Observability: structured (JSON) logs ---------------------------- + # Emit machine-parseable JSON logs to stderr so the cluster log pipeline + # can ship them to a SIEM. The human-readable stream (default /dev/stderr) + # is silenced to /dev/null so stderr carries JSON only, avoiding duplicate + # log lines. Coder has no single CODER_LOG_FORMAT flag; JSON output is + # selected by pointing CODER_LOGGING_JSON at a sink (here, stderr). + - name: CODER_LOGGING_HUMAN + value: "/dev/null" + - name: CODER_LOGGING_JSON + value: "/dev/stderr" + # Single replica for the demo. HA (replicaCount > 1) is an Enterprise feature # and is out of scope. replicaCount: 1 diff --git a/deploy/observability/README.md b/deploy/observability/README.md new file mode 100644 index 0000000..07785d9 --- /dev/null +++ b/deploy/observability/README.md @@ -0,0 +1,135 @@ +# Observability stack (in-cluster metrics + dashboards) + +In-boundary, in-cluster metrics and dashboards for the GovCloud demo. It scrapes +the Coder control plane's Prometheus metrics and renders Coder's prebuilt +Grafana dashboards with live data, reachable over HTTPS at +`https://grafana.usgov.coderdemo.io`. + +This is the reliable in-cluster implementation. The AWS-native managed variant +(Amazon Managed Prometheus / Grafana, Security Lake) is planned separately and +is not built here. + +## What runs + +| Piece | Detail | +|---|---| +| Helm release | `kps` = `prometheus-community/kube-prometheus-stack` chart `86.2.0` (prometheus-operator `v0.91.0`), namespace `monitoring`. Values: `kube-prometheus-stack-values.yaml`. | +| Prometheus | StatefulSet `prometheus-kps-kube-prometheus-stack-prometheus`, 20Gi gp3 PVC, 7d retention. Service `kps-kube-prometheus-stack-prometheus:9090`. | +| Grafana | Deployment `kps-grafana`, 5Gi gp3 PVC. Service `kps-grafana:80`. Admin password from AWS Secrets Manager via ESO. | +| Prometheus operator | Deployment `kps-kube-prometheus-stack-operator`. Admission webhooks disabled. | +| Coder scrape | `coder-metrics` headless Service (port 2112) + `ServiceMonitor/coder`, both in namespace `coder`. Prometheus job `coder-metrics`. | +| Dashboards | Six Coder dashboards as ConfigMaps in `monitoring`, imported by the Grafana sidecar (label `grafana_dashboard: "1"`). | +| Ingress | `grafana` Ingress (className `nginx`, host `grafana.usgov.coderdemo.io`, TLS terminated upstream at the NLB). | + +Disabled to keep the demo lean and cut image mirroring: Alertmanager, +node-exporter, kube-state-metrics, bundled alert rules, and the managed EKS +control-plane ServiceMonitors. The kubelet ServiceMonitor is kept so cAdvisor +container CPU and memory metrics power the dashboards' resource panels. + +## Images (ECR mirror) + +GovCloud has no pull-through cache, so every image is mirrored into private ECR +(`scripts/images.txt` + `scripts/mirror-images.sh`) and the chart values point +at the mirror: + +- `quay/prometheus/prometheus:v3.12.0-distroless` +- `quay/prometheus-operator/prometheus-operator:v0.91.0` +- `quay/prometheus-operator/prometheus-config-reloader:v0.91.0` +- `docker-hub/grafana/grafana:13.0.1-security-01` +- `quay/kiwigrid/k8s-sidecar:2.7.3` + +## The scrape path + +1. `coderd` serves Prometheus metrics on `0.0.0.0:2112` (env vars + `CODER_PROMETHEUS_ENABLE`, `CODER_PROMETHEUS_ADDRESS`, + `CODER_PROMETHEUS_COLLECT_AGENT_STATS` in `deploy/coder/values.yaml`). The + Coder chart's own Service has no metrics port, so `coder-metrics.yaml` adds a + headless Service that exposes 2112 for the control-plane pod. +2. `ServiceMonitor/coder` selects that Service. Prometheus is configured with + `serviceMonitorSelectorNilUsesHelmValues: false`, so it discovers the + ServiceMonitor across namespaces. Scraping adds `namespace` and `pod` target + labels. +3. The Coder dashboards filter on `namespace="coder"` and `pod=~"coder.*"`, + which the scraped series satisfy, so panels render without extra config. + +## Dashboards + +`dashboards-coder.yaml` carries six Prometheus-backed dashboards taken from +`github.com/coder/observability` (`compiled/resources.yaml`): Coder Control +Plane (`coderd`), Coder Status (`coder-status`), Coder Prebuilds, Coder +Provisioners, Coder Workspaces, and Coder Workspace Detail. Every panel targets +datasource uid `prometheus`, which the kube-prometheus-stack Grafana +auto-provisions and marks default. + +The purely log-based `agent-boundaries` dashboard is omitted, and a few log +panels inside the workspaces / provisionerd / workspace-detail dashboards show +no data, because this stack ships metrics only (no Loki). Their Prometheus +panels render live. + +## Grafana admin credentials (ESO + AWS Secrets Manager) + +The admin password is generated once and stored as JSON +`{"admin-user","admin-password"}` in AWS Secrets Manager at +`usgov-coderdemo/observability/grafana`. The ESO `ClusterSecretStore` +`aws-secretsmanager` syncs it into the Kubernetes Secret `grafana-admin` +(namespace `monitoring`) via the ExternalSecret added to +`deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml`. Grafana +consumes it through `admin.existingSecret`. The ESO IAM role only allows reading +`usgov-coderdemo/*`, so this path is in policy. No password is committed to git. + +Rotate by writing a new value to the ASM secret, then deleting the +`grafana-admin` Secret (ESO rebuilds it) or waiting for the 1h refresh, and +restart the Grafana pod to pick up the env value. + +## Reproduce + +```sh +. ~/.config/usgov-coderdemo/env +export KUBECONFIG=./kubeconfig + +# 1. Mirror the observability images into ECR. +bash scripts/mirror-images.sh + +# 2. Enable Coder metrics + JSON logs (already in deploy/coder/values.yaml). +helm upgrade coder ~/.cache/helm/repository/coder_helm_2.34.0.tgz \ + --namespace coder --values deploy/coder/values.yaml --timeout 6m +kubectl -n coder rollout status deploy/coder + +# 3. Grafana admin secret in ASM (generate; pass via a mode-600 file, not argv). +# aws secretsmanager create-secret \ +# --name usgov-coderdemo/observability/grafana \ +# --secret-string file:///path/to/grafana.json # {"admin-user","admin-password"} + +# 4. Namespace + ESO ExternalSecret for the Grafana admin secret. +kubectl apply -f deploy/observability/namespace.yaml +kubectl apply -f deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml +kubectl -n monitoring get externalsecret grafana-admin # Ready=SecretSynced + +# 5. Install the stack. +helm install kps ~/.cache/helm/repository/kube-prometheus-stack-86.2.0.tgz \ + --namespace monitoring --values deploy/observability/kube-prometheus-stack-values.yaml --timeout 8m + +# 6. Coder scrape target, Grafana Ingress, dashboards. +kubectl apply -f deploy/observability/coder-metrics.yaml +kubectl apply -f deploy/observability/grafana-ingress.yaml +kubectl apply -f deploy/observability/dashboards-coder.yaml +``` + +To regenerate `dashboards-coder.yaml` from upstream, extract the +`coder-dashboard-*` ConfigMaps from +`https://raw.githubusercontent.com/coder/observability/main/compiled/resources.yaml`, +relabel them with `grafana_dashboard: "1"`, set namespace `monitoring`, and drop +the `agent-boundaries` (Loki-only) dashboard. + +## Verify + +```sh +# Coder target UP +kubectl -n monitoring port-forward svc/kps-kube-prometheus-stack-prometheus 9090:9090 & +curl -s 'http://localhost:9090/api/v1/query?query=up{job="coder-metrics"}' + +# Grafana over HTTPS (valid TLS) + datasource + dashboards (admin from ASM) +GPW=$(kubectl -n monitoring get secret grafana-admin -o jsonpath='{.data.admin-password}' | base64 -d) +curl -s -o /dev/null -w '%{http_code} ssl=%{ssl_verify_result}\n' https://grafana.usgov.coderdemo.io/login +curl -s -u "admin:$GPW" 'https://grafana.usgov.coderdemo.io/api/search?type=dash-db&query=Coder' +``` diff --git a/deploy/observability/coder-metrics.yaml b/deploy/observability/coder-metrics.yaml new file mode 100644 index 0000000..89e7df8 --- /dev/null +++ b/deploy/observability/coder-metrics.yaml @@ -0,0 +1,57 @@ +# Scrape path for the Coder control plane's Prometheus metrics. +# +# The Coder Helm chart's own Service only exposes the HTTP app port (80 -> 8080) +# and has no metrics port. coderd serves Prometheus metrics on :2112 (enabled by +# CODER_PROMETHEUS_ENABLE + CODER_PROMETHEUS_ADDRESS=0.0.0.0:2112 in +# deploy/coder/values.yaml). This headless Service exposes that port for the +# control-plane pod only, and the ServiceMonitor below tells the in-cluster +# Prometheus to scrape it. Prometheus adds `namespace` and `pod` target labels, +# which the Coder dashboards filter on (namespace `coder`, pod `coder.*`). +apiVersion: v1 +kind: Service +metadata: + name: coder-metrics + namespace: coder + labels: + app.kubernetes.io/name: coder + app.kubernetes.io/part-of: coder + app.kubernetes.io/component: metrics +spec: + type: ClusterIP + clusterIP: None + # Select only the Coder control-plane pod. The external provisioner pods do + # not carry app.kubernetes.io/name=coder, so they are excluded (they do not + # serve :2112). + selector: + app.kubernetes.io/instance: coder + app.kubernetes.io/name: coder + ports: + - name: metrics + port: 2112 + targetPort: 2112 + protocol: TCP +--- +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: coder + namespace: coder + labels: + app.kubernetes.io/name: coder + app.kubernetes.io/component: metrics + # Belt-and-suspenders: the stack also selects ServiceMonitors regardless of + # this label (serviceMonitorSelectorNilUsesHelmValues: false). + release: kps +spec: + namespaceSelector: + matchNames: + - coder + selector: + matchLabels: + app.kubernetes.io/name: coder + app.kubernetes.io/component: metrics + endpoints: + - port: metrics + path: /metrics + interval: 30s + scheme: http diff --git a/deploy/observability/dashboards-coder.yaml b/deploy/observability/dashboards-coder.yaml new file mode 100644 index 0000000..c20ba60 --- /dev/null +++ b/deploy/observability/dashboards-coder.yaml @@ -0,0 +1,9069 @@ +# Coder Grafana dashboards (generated, do not hand-edit). +# +# Source: github.com/coder/observability compiled/resources.yaml (the chart's +# rendered output). These are the six Prometheus-backed Coder dashboards. The +# selectors are already expanded for this deployment (namespace `coder`, +# pods `coder.*`), and every panel points at datasource uid `prometheus`, which +# the kube-prometheus-stack Grafana auto-provisions. +# +# The Grafana sidecar imports any ConfigMap labelled `grafana_dashboard: "1"` +# from any namespace, so these live in the `monitoring` namespace next to +# Grafana. Regenerate with deploy/observability/README.md instructions. +# +# The purely log-based `agent-boundaries` dashboard is intentionally omitted: +# this stack ships metrics only (no Loki). A few panels in the workspaces, +# provisionerd, and workspace-detail dashboards are log-based and will show no +# data for the same reason; their Prometheus panels render live. +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: coder-dashboard-coderd + namespace: monitoring + labels: + grafana_dashboard: "1" + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: grafana-dashboard +data: + coder-coderd.json: |- + { + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": { + "type": "grafana", + "uid": "-- Grafana --" + }, + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "target": { + "limit": 100, + "matchAny": false, + "tags": [], + "type": "dashboard" + }, + "type": "dashboard" + } + ] + }, + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 0, + "links": [], + "panels": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "Down" + }, + "properties": [ + { + "id": "thresholds", + "value": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 1 + } + ] + } + } + ] + } + ] + }, + "gridPos": { + "h": 6, + "w": 6, + "x": 0, + "y": 0 + }, + "id": 10, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "value_and_name", + "wideLayout": false + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "count(up{pod=~`coder.*`, pod!~`.*provisioner.*`, namespace=`coder`} == 1) or vector(0)", + "instant": true, + "legendFormat": "Up", + "range": false, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "(count(up{pod=~`coder.*`, pod!~`.*provisioner.*`, namespace=`coder`} == 0) or vector(0)) > 0", + "hide": false, + "instant": true, + "legendFormat": "Down", + "range": false, + "refId": "B" + } + ], + "title": "Replicas", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 6, + "w": 6, + "x": 6, + "y": 0 + }, + "id": 18, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "One or more replicas are required to be running in order to serve the control-plane.\n\nSee [High Availability](https://coder.com/docs/v2/latest/admin/high-availability) for details on how to\nrun multiple `coderd` replicas.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "#EAB839", + "value": 0.9 + }, + { + "color": "red", + "value": 1 + } + ] + }, + "unit": "percentunit" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "Enabled" + }, + "properties": [ + { + "id": "mappings", + "value": [ + { + "options": { + "0": { + "index": 1, + "text": "No" + }, + "1": { + "index": 0, + "text": "Yes" + } + }, + "type": "value" + }, + { + "options": { + "result": { + "color": "orange", + "index": 2, + "text": "Unknown" + } + }, + "type": "special" + } + ] + }, + { + "id": "thresholds", + "value": { + "mode": "absolute", + "steps": [ + { + "color": "text", + "value": null + } + ] + } + } + ] + } + ] + }, + "gridPos": { + "h": 6, + "w": 6, + "x": 12, + "y": 0 + }, + "id": 32, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "last" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "value_and_name", + "wideLayout": false + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "max(coderd_license_user_limit_enabled)", + "instant": true, + "legendFormat": "Enabled", + "range": false, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "(\n max(coderd_license_active_users) / max(coderd_license_limit_users)\n) > 0", + "hide": false, + "instant": false, + "legendFormat": "Usage", + "range": true, + "refId": "B" + } + ], + "title": "Enterprise License", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 6, + "w": 6, + "x": 18, + "y": 0 + }, + "id": 33, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "If you would like to try Coder's [Enterprise features](https://coder.com/docs/v2/latest/enterprise), you can [request a trial license](https://coder.com/docs/v2/latest/faqs#how-do-i-add-an-enterprise-license).", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "bars", + "fillOpacity": 100, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "never", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "s" + }, + "overrides": [ + { + "matcher": { + "id": "byRegexp", + "options": "/(Requested|Limit)/" + }, + "properties": [ + { + "id": "custom.lineStyle", + "value": { + "dash": [ + 0, + 10 + ], + "fill": "dot" + } + }, + { + "id": "custom.fillOpacity", + "value": 5 + }, + { + "id": "custom.drawStyle", + "value": "line" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Requested" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Limit" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "orange", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 6, + "w": 6, + "x": 0, + "y": 6 + }, + "id": 25, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum by (pod) (rate(container_cpu_usage_seconds_total{pod=~`coder.*`, pod!~`.*provisioner.*`, namespace=`coder`}[$__rate_interval]))", + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "max(kube_pod_container_resource_limits{pod=~`coder.*`, pod!~`.*provisioner.*`, namespace=`coder`, resource=\"cpu\"})", + "hide": false, + "instant": false, + "legendFormat": "Limit", + "range": true, + "refId": "C" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "max(kube_pod_container_resource_requests{pod=~`coder.*`, pod!~`.*provisioner.*`, namespace=`coder`, resource=\"cpu\"})", + "hide": false, + "instant": false, + "legendFormat": "Requested", + "range": true, + "refId": "B" + } + ], + "title": "CPU Usage Seconds", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 6, + "w": 6, + "x": 6, + "y": 6 + }, + "id": 26, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "The cumulative CPU used per core-second. If `coderd` was using a full CPU core, that would be represented as 1 second.\n\nRequests & limits are shown if set.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "fixedColor": "red", + "mode": "shades" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "bars", + "fillOpacity": 100, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "decimals": 0, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "short" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "Requested" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Limit" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "red", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 6, + "w": 4, + "x": 12, + "y": 6 + }, + "id": 30, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "sum by (reason) (\n count_over_time(kube_pod_container_status_terminated_reason{pod=~`coder.*`, pod!~`.*provisioner.*`, namespace=`coder`}[$__interval])\n)", + "hide": false, + "instant": false, + "legendFormat": "{{reason}}", + "range": true, + "refId": "C" + } + ], + "title": "Terminations", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "decimals": 0, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 0.0001 + } + ] + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 6, + "w": 2, + "x": 16, + "y": 6 + }, + "id": 34, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "mean" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(increase(kube_pod_container_status_restarts_total{pod=~`coder.*`, pod!~`.*provisioner.*`, namespace=`coder`}[$__range]))", + "hide": false, + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "B" + } + ], + "title": "Restarts", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 6, + "w": 6, + "x": 18, + "y": 6 + }, + "id": 31, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "Pods can be terminated for several reasons:\n- `OOMKilled`: pod exceeded its defined memory limit or was terminated by the OS for using excessive memory (if no limit defined)\n- `Error`: usually attributeable to a configuration problem\n- `Evicted`: pod has been evicted from node for overusing resources and will be rescheduled on another node is possible", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "bars", + "fillOpacity": 100, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "never", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "bytes" + }, + "overrides": [ + { + "matcher": { + "id": "byRegexp", + "options": "/(Requested|Limit)/" + }, + "properties": [ + { + "id": "custom.lineStyle", + "value": { + "dash": [ + 0, + 10 + ], + "fill": "dot" + } + }, + { + "id": "custom.fillOpacity", + "value": 5 + }, + { + "id": "custom.drawStyle", + "value": "line" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Requested" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Limit" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "orange", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 6, + "w": 6, + "x": 0, + "y": 12 + }, + "id": 29, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "max by (pod) (container_memory_working_set_bytes{pod=~`coder.*`, pod!~`.*provisioner.*`, namespace=`coder`})", + "hide": false, + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "max(kube_pod_container_resource_limits{pod=~`coder.*`, pod!~`.*provisioner.*`, namespace=`coder`, resource=\"memory\"})", + "hide": false, + "instant": false, + "legendFormat": "Limit", + "range": true, + "refId": "C" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "max(kube_pod_container_resource_requests{pod=~`coder.*`, pod!~`.*provisioner.*`, namespace=`coder`, resource=\"memory\"})", + "hide": false, + "instant": false, + "legendFormat": "Requested", + "range": true, + "refId": "B" + } + ], + "title": "RAM Usage", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 6, + "w": 6, + "x": 6, + "y": 12 + }, + "id": 28, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "This shows the total memory used by each `coderd` container; it is the same metric which the [OOM killer](https://www.kernel.org/doc/gorman/html/understand/understand016.html) uses.\n\nRequests & limits are shown if set.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "orange", + "value": 100 + }, + { + "color": "red", + "value": 500 + } + ] + }, + "unit": "s" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "Errors" + }, + "properties": [ + { + "id": "unit", + "value": "short" + }, + { + "id": "thresholds", + "value": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 1 + } + ] + } + } + ] + } + ] + }, + "gridPos": { + "h": 3, + "w": 4, + "x": 12, + "y": 12 + }, + "id": 16, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "mean" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "quantile(0.5, coder_pubsub_send_latency_seconds)", + "instant": false, + "legendFormat": "Send", + "range": true, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "quantile(0.5, coder_pubsub_receive_latency_seconds)", + "hide": false, + "instant": false, + "legendFormat": "Receive", + "range": true, + "refId": "B" + } + ], + "title": "Pubsub Latency (Median)", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "s" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "Errors" + }, + "properties": [ + { + "id": "unit", + "value": "short" + }, + { + "id": "thresholds", + "value": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 1 + } + ] + } + } + ] + } + ] + }, + "gridPos": { + "h": 6, + "w": 2, + "x": 16, + "y": 12 + }, + "id": 22, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "mean" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "(\n sum(increase(coder_pubsub_latency_measure_errs_total[$__range]))\n / count(coder_pubsub_latency_measure_errs_total)\n) or vector(0)", + "hide": false, + "instant": false, + "legendFormat": "Errors", + "range": true, + "refId": "B" + } + ], + "title": "Pubsub Errors", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 6, + "w": 6, + "x": 18, + "y": 12 + }, + "id": 19, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "`coderd` uses Postgres for passing messages between subcomponents for coordination and signalling;\nthis is called \"pubsub\" (or publish-subscribe).\n\nWe measure the time for messages to be sent and received. Latencies higher than 500ms will likely lead to\nyour Coder deployment feeling sluggish. High latency is usually an indication that your Postgres server is under-resourced on CPU.\n\nHigh values for median should be concerning,\nwhile the 90th percentile shows the outliers.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "orange", + "value": 100 + }, + { + "color": "red", + "value": 500 + } + ] + }, + "unit": "s" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "Errors" + }, + "properties": [ + { + "id": "unit", + "value": "short" + }, + { + "id": "thresholds", + "value": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 1 + } + ] + } + } + ] + } + ] + }, + "gridPos": { + "h": 3, + "w": 4, + "x": 12, + "y": 15 + }, + "id": 21, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "mean" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "quantile(0.9, coder_pubsub_send_latency_seconds)", + "instant": false, + "legendFormat": "Send", + "range": true, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "quantile(0.9, coder_pubsub_receive_latency_seconds)", + "hide": false, + "instant": false, + "legendFormat": "Receive", + "range": true, + "refId": "B" + } + ], + "title": "Pubsub Latency (P90)", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "bars", + "fillOpacity": 100, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 0, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "never", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + }, + "unit": "reqps" + }, + "overrides": [] + }, + "gridPos": { + "h": 6, + "w": 6, + "x": 0, + "y": 18 + }, + "id": 35, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum by(pod) (rate(coderd_api_requests_processed_total{pod=~`coder.*`, pod!~`.*provisioner.*`, namespace=`coder`}[$__rate_interval]))", + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "A" + } + ], + "title": "API Requests", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 6, + "w": 6, + "x": 6, + "y": 18 + }, + "id": 36, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "This shows the number of requests per second each `coderd` replica is handling.\n\nHeavy skewing towards a single `coderd` replica indicates faulty loadbalancing.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + } + ], + "refresh": "30s", + "schemaVersion": 39, + "tags": [], + "templating": { + "list": [] + }, + "time": { + "from": "now-12h", + "to": "now" + }, + "timepicker": {}, + "timezone": "browser", + "title": "Coder Control Plane", + "uid": "coderd", + "version": 6, + "weekStart": "" + } +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: coder-dashboard-status + namespace: monitoring + labels: + grafana_dashboard: "1" + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: grafana-dashboard +data: + coder-status.json: |- + { + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": { + "type": "grafana", + "uid": "-- Grafana --" + }, + "enable": false, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "target": { + "limit": 100, + "matchAny": false, + "tags": [], + "type": "dashboard" + }, + "type": "dashboard" + } + ] + }, + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 0, + "links": [], + "panels": [ + { + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 0 + }, + "id": 9, + "title": "Application", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "Down" + }, + "properties": [ + { + "id": "thresholds", + "value": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 1 + } + ] + } + } + ] + } + ] + }, + "gridPos": { + "h": 7, + "w": 4, + "x": 0, + "y": 1 + }, + "id": 10, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "value_and_name", + "wideLayout": false + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "count(up{pod=~`coder.*`, pod!~`.*provisioner.*`, namespace=`coder`} == 1) or vector(0) > 0", + "instant": true, + "legendFormat": "Up", + "range": false, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "count(up{pod=~`coder.*`, pod!~`.*provisioner.*`, namespace=`coder`} == 0) or vector(0) > 0", + "hide": false, + "instant": true, + "legendFormat": "Down", + "range": false, + "refId": "B" + } + ], + "title": "Coder Replicas", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 4, + "x": 4, + "y": 1 + }, + "id": 16, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "value_and_name", + "wideLayout": false + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(coderd_provisionerd_num_daemons{pod=~`coder.*`, pod!~`.*provisioner.*`, namespace=`coder`})", + "instant": true, + "legendFormat": "Built-in", + "range": false, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(coderd_provisionerd_num_daemons{pod=~`coder-provisioner.*`, namespace=`coder`})", + "hide": false, + "instant": true, + "legendFormat": "External", + "range": false, + "refId": "B" + } + ], + "title": "Provisioners", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + } + }, + "mappings": [] + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "failed" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "orange", + "mode": "fixed" + } + }, + { + "id": "displayName", + "value": "Failed" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "success" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + }, + { + "id": "displayName", + "value": "Success" + } + ] + } + ] + }, + "gridPos": { + "h": 7, + "w": 4, + "x": 8, + "y": 1 + }, + "id": 17, + "options": { + "displayLabels": [ + "name", + "value" + ], + "legend": { + "displayMode": "list", + "placement": "bottom", + "showLegend": true, + "values": [ + "percent" + ] + }, + "pieType": "pie", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "tooltip": { + "mode": "multi", + "sort": "none" + } + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "round(sum by (status) (increase(coderd_provisionerd_job_timings_seconds_count{pod!=``}[$__range])))", + "instant": true, + "legendFormat": "{{status}}", + "range": false, + "refId": "A" + } + ], + "title": "Workspace Builds", + "type": "piechart" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "text", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 4, + "x": 12, + "y": 1 + }, + "id": 18, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "count(kube_pod_status_ready{condition=\"true\", namespace=`coder-workspaces`} == 1)\nor\nsum(max by (workspace_owner, template_name, template_version) (coderd_workspace_latest_build_status{status=\"succeeded\", workspace_transition=\"start\"}))\nor\nvector(0)", + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "A" + } + ], + "title": "Running Workspaces", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "decimals": 0, + "mappings": [ + { + "options": { + "0": { + "color": "red", + "index": 1, + "text": "Down" + }, + "1": { + "color": "green", + "index": 0, + "text": "Up" + } + }, + "type": "value" + }, + { + "options": { + "match": "null", + "result": { + "color": "orange", + "index": 2, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "empty", + "result": { + "color": "orange", + "index": 3, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "null+nan", + "result": { + "index": 4, + "text": "Unknown" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "text", + "value": null + } + ] + }, + "unit": "s" + }, + "overrides": [ + { + "matcher": { + "id": "byRegexp", + "options": "/.*RAM/" + }, + "properties": [ + { + "id": "unit", + "value": "bytes" + } + ] + } + ] + }, + "gridPos": { + "h": 7, + "w": 4, + "x": 16, + "y": 1 + }, + "id": 15, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(\n max_over_time(\n rate(container_cpu_usage_seconds_total{pod=~`coder.*`, pod!~`.*provisioner.*`, namespace=`coder`}[1h:1m])\n [$__range:]\n )\n)", + "instant": true, + "legendFormat": "Control Plane CPU", + "range": false, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(\n max_over_time(\n rate(container_cpu_usage_seconds_total{pod=~`coder-provisioner.*`, namespace=`coder`}[1h:1m])\n [$__range:]\n )\n)", + "hide": false, + "instant": true, + "legendFormat": "Provisioner CPU", + "range": false, + "refId": "B" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(\n max_over_time(\n container_memory_working_set_bytes{pod=~`coder.*`, pod!~`.*provisioner.*`, namespace=`coder`}\n [$__range:]\n )\n)", + "hide": false, + "instant": true, + "legendFormat": "Control Plane RAM", + "range": false, + "refId": "C" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(\n max_over_time(\n container_memory_working_set_bytes{pod=~`coder-provisioner.*`, namespace=`coder`}\n [$__range:]\n )\n)", + "hide": false, + "instant": true, + "legendFormat": "Provisioner RAM", + "range": false, + "refId": "D" + } + ], + "title": "Resource Usage High Watermark (Cumulative)", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [ + { + "options": { + "0": { + "color": "red", + "index": 1, + "text": "Down" + }, + "1": { + "color": "green", + "index": 0, + "text": "Up" + } + }, + "type": "value" + }, + { + "options": { + "match": "null", + "result": { + "color": "orange", + "index": 2, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "empty", + "result": { + "color": "orange", + "index": 3, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "null+nan", + "result": { + "index": 4, + "text": "Unknown" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 4, + "x": 20, + "y": 1 + }, + "id": 19, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "min(pg_up) or vector(0)", + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "A" + } + ], + "title": "Postgres", + "type": "stat" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 8 + }, + "id": 3, + "panels": [], + "title": "Observability Tools", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [ + { + "options": { + "0": { + "color": "red", + "index": 1, + "text": "Down" + }, + "1": { + "color": "green", + "index": 0, + "text": "Up" + } + }, + "type": "value" + }, + { + "options": { + "match": "null", + "result": { + "color": "orange", + "index": 2, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "empty", + "result": { + "color": "orange", + "index": 3, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "null+nan", + "result": { + "index": 4, + "text": "Unknown" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 5, + "w": 4, + "x": 0, + "y": 9 + }, + "id": 1, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "min(up{job=\"coder-observability/prometheus/server\"}) or vector(0)", + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "A" + } + ], + "title": "Prometheus", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [ + { + "options": { + "0": { + "color": "red", + "index": 1, + "text": "Down" + }, + "1": { + "color": "green", + "index": 0, + "text": "Up" + } + }, + "type": "value" + }, + { + "options": { + "match": "null", + "result": { + "color": "orange", + "index": 2, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "empty", + "result": { + "color": "orange", + "index": 3, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "null+nan", + "result": { + "index": 4, + "text": "Unknown" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 5, + "w": 4, + "x": 4, + "y": 9 + }, + "id": 4, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "min(up{job=\"coder-observability/loki/write\"}) or vector(0)", + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "A" + } + ], + "title": "Loki Write Path", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [ + { + "options": { + "0": { + "color": "red", + "index": 1, + "text": "Down" + }, + "1": { + "color": "green", + "index": 0, + "text": "Up" + } + }, + "type": "value" + }, + { + "options": { + "match": "null", + "result": { + "color": "orange", + "index": 2, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "empty", + "result": { + "color": "orange", + "index": 3, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "null+nan", + "result": { + "index": 4, + "text": "Unknown" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 5, + "w": 4, + "x": 8, + "y": 9 + }, + "id": 5, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "min(up{job=\"coder-observability/loki/read\"}) or vector(0)", + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "A" + } + ], + "title": "Loki Read Path", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [ + { + "options": { + "0": { + "color": "red", + "index": 1, + "text": "Down" + }, + "1": { + "color": "green", + "index": 0, + "text": "Up" + } + }, + "type": "value" + }, + { + "options": { + "match": "null", + "result": { + "color": "orange", + "index": 2, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "empty", + "result": { + "color": "orange", + "index": 3, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "null+nan", + "result": { + "index": 4, + "text": "Unknown" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 5, + "w": 4, + "x": 12, + "y": 9 + }, + "id": 6, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "min(up{job=\"coder-observability/loki/backend\", container=\"loki\"}) or vector(0)", + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "A" + } + ], + "title": "Loki Backend", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [ + { + "options": { + "0": { + "color": "red", + "index": 1, + "text": "Down" + }, + "1": { + "color": "green", + "index": 0, + "text": "Up" + } + }, + "type": "value" + }, + { + "options": { + "match": "null", + "result": { + "color": "orange", + "index": 2, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "empty", + "result": { + "color": "orange", + "index": 3, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "null+nan", + "result": { + "index": 4, + "text": "Unknown" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 5, + "w": 4, + "x": 16, + "y": 9 + }, + "id": 7, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "min(up{job=\"coder-observability/loki/canary\"}) or vector(0)", + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "A" + } + ], + "title": "Loki Canary", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [ + { + "options": { + "0": { + "color": "red", + "index": 1, + "text": "Down" + }, + "1": { + "color": "green", + "index": 0, + "text": "Up" + } + }, + "type": "value" + }, + { + "options": { + "match": "null", + "result": { + "color": "orange", + "index": 2, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "empty", + "result": { + "color": "orange", + "index": 3, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "null+nan", + "result": { + "index": 4, + "text": "Unknown" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 5, + "w": 4, + "x": 20, + "y": 9 + }, + "id": 8, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "min(up{job=\"coder-observability/grafana-agent/grafana-agent\"}) or vector(0)", + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "A" + } + ], + "title": "Grafana Agent", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [ + { + "options": { + "0": { + "color": "red", + "index": 1, + "text": "Unhealthy" + }, + "1": { + "color": "green", + "index": 0, + "text": "Healthy" + } + }, + "type": "value" + }, + { + "options": { + "match": "null", + "result": { + "color": "orange", + "index": 2, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "empty", + "result": { + "color": "orange", + "index": 3, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "null+nan", + "result": { + "index": 4, + "text": "Unknown" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 5, + "w": 4, + "x": 0, + "y": 14 + }, + "id": 12, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "prometheus_config_last_reload_successful{job=\"coder-observability/prometheus/server\"}", + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "A" + } + ], + "title": "Prometheus Config", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [ + { + "options": { + "0": { + "color": "red", + "index": 1, + "text": "Unhealthy" + }, + "1": { + "color": "green", + "index": 0, + "text": "Healthy" + } + }, + "type": "value" + }, + { + "options": { + "match": "null", + "result": { + "color": "orange", + "index": 2, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "empty", + "result": { + "color": "orange", + "index": 3, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "null+nan", + "result": { + "index": 4, + "text": "Unknown" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 5, + "w": 4, + "x": 4, + "y": 14 + }, + "id": 14, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "min(loki_runtime_config_last_reload_successful) or vector(0)", + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "A" + } + ], + "title": "Loki Config", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [ + { + "options": { + "0": { + "color": "red", + "index": 1, + "text": "Unhealthy" + }, + "1": { + "color": "green", + "index": 0, + "text": "Healthy" + } + }, + "type": "value" + }, + { + "options": { + "match": "null", + "result": { + "color": "orange", + "index": 2, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "empty", + "result": { + "color": "orange", + "index": 3, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "null+nan", + "result": { + "index": 4, + "text": "Unknown" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 5, + "w": 4, + "x": 8, + "y": 14 + }, + "id": 13, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "min(agent_config_last_load_successful{job=\"coder-observability/grafana-agent/grafana-agent\"}) or vector(0)", + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "A" + } + ], + "title": "Grafana Agent Config", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "percentunit" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "Retention Limit" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "red", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Write-Ahead Log" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "purple", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Storage" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "#f9f9fb", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 5, + "w": 4, + "x": 12, + "y": 14 + }, + "id": 11, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "(\n prometheus_tsdb_wal_storage_size_bytes{job=\"coder-observability/prometheus/server\"} +\n prometheus_tsdb_storage_blocks_bytes{job=\"coder-observability/prometheus/server\"} +\n prometheus_tsdb_symbol_table_size_bytes{job=\"coder-observability/prometheus/server\"}\n)\n/\nprometheus_tsdb_retention_limit_bytes{job=\"coder-observability/prometheus/server\"}", + "instant": false, + "legendFormat": "Retention limit used", + "range": true, + "refId": "A" + } + ], + "title": "Prometheus Storage", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "text", + "value": null + } + ] + }, + "unit": "none" + }, + "overrides": [] + }, + "gridPos": { + "h": 5, + "w": 4, + "x": 16, + "y": 14 + }, + "id": 20, + "options": { + "colorMode": "value", + "graphMode": "none", + "justifyMode": "center", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "text": { + "titleSize": 20, + "valueSize": 35 + }, + "textMode": "auto", + "wideLayout": false + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(kube_pod_container_resource_requests{namespace=\"coder-observability\", resource=\"cpu\"})", + "hide": false, + "instant": true, + "legendFormat": "Requested", + "range": false, + "refId": "C" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(\n max_over_time(\n rate(container_cpu_usage_seconds_total{namespace=\"coder-observability\"}[$__rate_interval])\n [$__range:]\n )\n)", + "hide": false, + "instant": true, + "legendFormat": "High Watermark", + "range": false, + "refId": "D" + } + ], + "title": "CPU", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "text", + "value": null + } + ] + }, + "unit": "bytes" + }, + "overrides": [] + }, + "gridPos": { + "h": 5, + "w": 4, + "x": 20, + "y": 14 + }, + "id": 21, + "options": { + "colorMode": "none", + "graphMode": "area", + "justifyMode": "center", + "orientation": "vertical", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "text": { + "titleSize": 20, + "valueSize": 35 + }, + "textMode": "value_and_name", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(kube_pod_container_resource_requests{namespace=\"coder-observability\", resource=\"memory\"})", + "hide": false, + "instant": true, + "legendFormat": "Requested", + "range": false, + "refId": "B" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(\n max_over_time(container_memory_working_set_bytes{namespace=\"coder-observability\"}[$__range])\n)", + "instant": true, + "legendFormat": "High Watermark", + "range": false, + "refId": "A" + } + ], + "title": "RAM", + "type": "stat" + } + ], + "refresh": "30s", + "schemaVersion": 39, + "tags": [], + "templating": { + "list": [] + }, + "time": { + "from": "now-24h", + "to": "now" + }, + "timepicker": {}, + "timezone": "browser", + "title": "Coder Status", + "uid": "coder-status", + "version": 1, + "weekStart": "" + } +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: coder-dashboard-prebuilds + namespace: monitoring + labels: + grafana_dashboard: "1" + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: grafana-dashboard +data: + coder-prebuilds.json: |- + { + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": { + "type": "grafana", + "uid": "-- Grafana --" + }, + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "type": "dashboard" + } + ] + }, + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 0, + "id": 132, + "links": [], + "panels": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "fixedColor": "text", + "mode": "fixed" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": 0 + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 4, + "w": 4, + "x": 0, + "y": 0 + }, + "id": 49, + "interval": "30s", + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "orientation": "vertical", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "12.1.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(max(coderd_prebuilt_workspaces_desired) by (template_name, preset_name)) or vector(0)", + "instant": true, + "interval": "", + "legendFormat": "Desired", + "range": false, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(max(coderd_prebuilt_workspaces_running) by (template_name, preset_name)) or vector(0)", + "hide": false, + "instant": true, + "interval": "", + "legendFormat": "Running", + "range": false, + "refId": "D" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(max(coderd_prebuilt_workspaces_eligible) by (template_name, preset_name)) or vector(0)", + "hide": false, + "instant": true, + "interval": "", + "legendFormat": "Eligible", + "range": false, + "refId": "E" + } + ], + "title": "Current: Global", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "fixedColor": "text", + "mode": "fixed" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": 0 + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 4, + "w": 4, + "x": 4, + "y": 0 + }, + "id": 48, + "interval": "30s", + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "orientation": "vertical", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "12.1.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(max by (template_name, preset_name) (coderd_prebuilt_workspaces_created_total)) or vector(0)", + "hide": false, + "instant": true, + "interval": "", + "legendFormat": "Created", + "range": false, + "refId": "B" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(max by (template_name, preset_name) (coderd_prebuilt_workspaces_failed_total)) or vector(0)", + "hide": false, + "instant": true, + "interval": "", + "legendFormat": "Failed", + "range": false, + "refId": "C" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(max by (template_name, preset_name) (coderd_prebuilt_workspaces_claimed_total)) or vector(0)", + "hide": false, + "instant": true, + "interval": "", + "legendFormat": "Claimed", + "range": false, + "refId": "A" + } + ], + "title": "All Time: Global", + "type": "stat" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 4 + }, + "id": 2, + "panels": [], + "repeat": "preset", + "title": "$preset", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "fixedColor": "text", + "mode": "fixed" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": 0 + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 3, + "w": 6, + "x": 0, + "y": 5 + }, + "id": 1, + "interval": "30s", + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "orientation": "vertical", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "12.1.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "max(coderd_prebuilt_workspaces_created_total{template_name=~\"$template\", preset_name=~\"$preset\"}) or vector(0)", + "hide": false, + "instant": true, + "interval": "", + "legendFormat": "Created", + "range": false, + "refId": "B" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "max(coderd_prebuilt_workspaces_failed_total{template_name=~\"$template\", preset_name=~\"$preset\"}) or vector(0)", + "hide": false, + "instant": true, + "interval": "", + "legendFormat": "Failed", + "range": false, + "refId": "C" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "max(coderd_prebuilt_workspaces_claimed_total{template_name=~\"$template\", preset_name=~\"$preset\"}) or vector(0)", + "hide": false, + "instant": true, + "interval": "", + "legendFormat": "Claimed", + "range": false, + "refId": "A" + } + ], + "title": "All Time: $preset", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "axisSoftMax": 10, + "axisSoftMin": 0, + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 13, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "smooth", + "lineStyle": { + "fill": "solid" + }, + "lineWidth": 2, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "never", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "decimals": 0, + "fieldMinMax": false, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": 0 + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "Failed" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "red", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Created" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "blue", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Desired" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "purple", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Running" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "yellow", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Eligible" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Claimed" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "dark-green", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 6, + "w": 9, + "x": 6, + "y": 5 + }, + "id": 38, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.1.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "floor(max(increase(coderd_prebuilt_workspaces_created_total{template_name=~\"$template\", preset_name=~\"$preset\"}[$__rate_interval]))) or vector(0)", + "hide": false, + "instant": false, + "interval": "", + "legendFormat": "Created", + "range": true, + "refId": "B" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "floor(max(increase(coderd_prebuilt_workspaces_failed_total{template_name=~\"$template\", preset_name=~\"$preset\"}[$__rate_interval]))) or vector(0)", + "hide": false, + "instant": false, + "interval": "", + "legendFormat": "Failed", + "range": true, + "refId": "C" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "floor(max(increase(coderd_prebuilt_workspaces_claimed_total{template_name=~\"$template\", preset_name=~\"$preset\"}[$__rate_interval]))) or vector(0)", + "hide": false, + "instant": false, + "interval": "", + "legendFormat": "Claimed", + "range": true, + "refId": "F" + } + ], + "title": "Pool Operations: $preset", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "axisSoftMax": 10, + "axisSoftMin": 0, + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 18, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "smooth", + "lineStyle": { + "fill": "solid" + }, + "lineWidth": 2, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "never", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "decimals": 0, + "fieldMinMax": false, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": 0 + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "Desired" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "purple", + "mode": "fixed" + } + }, + { + "id": "custom.lineStyle", + "value": { + "dash": [ + 10, + 10 + ], + "fill": "dash" + } + }, + { + "id": "custom.fillOpacity", + "value": 85 + }, + { + "id": "custom.fillBelowTo", + "value": "Running" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Running" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "yellow", + "mode": "fixed" + } + }, + { + "id": "custom.fillBelowTo", + "value": "Eligible" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Eligible" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 6, + "w": 9, + "x": 15, + "y": 5 + }, + "id": 5, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.1.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "max(coderd_prebuilt_workspaces_desired{template_name=~\"$template\", preset_name=~\"$preset\"}) or vector(0)", + "instant": false, + "interval": "", + "legendFormat": "Desired", + "range": true, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "max(coderd_prebuilt_workspaces_running{template_name=~\"$template\", preset_name=~\"$preset\"}) or vector(0)", + "hide": false, + "instant": false, + "interval": "", + "legendFormat": "Running", + "range": true, + "refId": "D" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "max(coderd_prebuilt_workspaces_eligible{template_name=~\"$template\", preset_name=~\"$preset\"}) or vector(0)", + "hide": false, + "instant": false, + "interval": "", + "legendFormat": "Eligible", + "range": true, + "refId": "E" + } + ], + "title": "Pool Capacity: $preset", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "fixedColor": "text", + "mode": "fixed" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": 0 + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 3, + "w": 6, + "x": 0, + "y": 8 + }, + "id": 31, + "interval": "30s", + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "orientation": "vertical", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "12.1.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "max(coderd_prebuilt_workspaces_desired{template_name=~\"$template\", preset_name=~\"$preset\"}) or vector(0)", + "instant": true, + "interval": "", + "legendFormat": "Desired", + "range": false, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "max(coderd_prebuilt_workspaces_running{template_name=~\"$template\", preset_name=~\"$preset\"}) or vector(0)", + "hide": false, + "instant": true, + "interval": "", + "legendFormat": "Running", + "range": false, + "refId": "D" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "max(coderd_prebuilt_workspaces_eligible{template_name=~\"$template\", preset_name=~\"$preset\"}) or vector(0)", + "hide": false, + "instant": true, + "interval": "", + "legendFormat": "Eligible", + "range": false, + "refId": "E" + } + ], + "title": "Current: $preset", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "Compares the total number of regular workspace creations to prebuilt workspace claims to date.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": 0 + }, + { + "color": "red", + "value": 80 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 3, + "w": 6, + "x": 0, + "y": 11 + }, + "id": 51, + "options": { + "colorMode": "none", + "graphMode": "none", + "justifyMode": "auto", + "orientation": "horizontal", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "value_and_name", + "wideLayout": true + }, + "pluginVersion": "12.1.0", + "targets": [ + { + "editorMode": "code", + "exemplar": false, + "expr": "sum(max by (template_name, preset_name) (\n coderd_workspace_creation_total{\n template_name=~\"$template\", preset_name=~\"$preset\"\n }\n)) or vector(0)", + "instant": false, + "legendFormat": "Regular workspaces created", + "range": true, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "sum(max by (template_name, preset_name) (\n coderd_prebuilt_workspaces_claimed_total{\n template_name=~\"$template\", preset_name=~\"$preset\"\n }\n)) or vector(0)\n", + "hide": false, + "instant": false, + "legendFormat": "Prebuilt workspaces claimed", + "range": true, + "refId": "B" + } + ], + "title": "All Time: Regular vs Prebuilt", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "Median (p50) build time in seconds for Regular Workspace Creation, Prebuilt Workspace Creation, and Prebuilt Workspace Claim", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "smooth", + "lineStyle": { + "dash": [ + 10, + 10 + ], + "fill": "dash" + }, + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": 0 + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "s" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "Regular Creation" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "blue", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Prebuild Creation" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "yellow", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Prebuild Claim" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 6, + "w": 9, + "x": 6, + "y": 11 + }, + "id": 50, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.1.0", + "targets": [ + { + "editorMode": "code", + "expr": "histogram_quantile(0.5,\n sum(\n coderd_workspace_creation_duration_seconds{\n template_name=~\"$template\", preset_name=~\"$preset\", type=\"regular\"\n }\n )\n)\nor vector(0)", + "legendFormat": "Regular Creation", + "range": true, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "histogram_quantile(0.5,\n sum(\n coderd_workspace_creation_duration_seconds{\n template_name=~\"$template\", preset_name=~\"$preset\", type=\"prebuild\"\n }\n )\n)\nor vector(0)", + "hide": false, + "instant": false, + "legendFormat": "Prebuild Creation", + "range": true, + "refId": "B" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "histogram_quantile(0.5,\n sum(\n coderd_prebuilt_workspace_claim_duration_seconds{\n template_name=~\"$template\", preset_name=~\"$preset\"\n }\n )\n)\nor vector(0)", + "hide": false, + "instant": false, + "legendFormat": "Prebuild Claim", + "range": true, + "refId": "C" + } + ], + "title": "Workspace Build Latency p50", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "95th-percentile (p95) build time in seconds for Regular Workspace Creation, Prebuilt Workspace Creation, and Prebuilt Workspace Claim.", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "barWidthFactor": 0.6, + "drawStyle": "line", + "fillOpacity": 0, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "smooth", + "lineStyle": { + "dash": [ + 10, + 10 + ], + "fill": "dash" + }, + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": 0 + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "s" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "Regular Creation" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "blue", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Prebuild Creation" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "yellow", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Prebuild Claim" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 6, + "w": 9, + "x": 15, + "y": 11 + }, + "id": 53, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "hideZeros": false, + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "12.1.0", + "targets": [ + { + "editorMode": "code", + "expr": "histogram_quantile(0.95,\n sum(\n coderd_workspace_creation_duration_seconds{\n template_name=~\"$template\", preset_name=~\"$preset\", type=\"regular\"\n }\n )\n)\nor vector(0)", + "legendFormat": "Regular Creation", + "range": true, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "histogram_quantile(0.95,\n sum(\n coderd_workspace_creation_duration_seconds{\n template_name=~\"$template\", preset_name=~\"$preset\", type=\"prebuild\"\n }\n )\n)\nor vector(0)", + "hide": false, + "instant": false, + "legendFormat": "Prebuild Creation", + "range": true, + "refId": "B" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "histogram_quantile(0.95,\n sum(\n coderd_prebuilt_workspace_claim_duration_seconds{\n template_name=~\"$template\", preset_name=~\"$preset\"\n }\n )\n)\nor vector(0)", + "hide": false, + "instant": false, + "legendFormat": "Prebuild Claim", + "range": true, + "refId": "C" + } + ], + "title": "Workspace Build Latency p95", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "max": 100, + "min": 0, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": 0 + }, + { + "color": "#EAB839", + "value": 50 + }, + { + "color": "green", + "value": 75 + } + ] + }, + "unit": "percent" + }, + "overrides": [] + }, + "gridPos": { + "h": 3, + "w": 6, + "x": 0, + "y": 14 + }, + "id": 54, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "auto", + "orientation": "auto", + "percentChangeColorMode": "standard", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "12.1.0", + "targets": [ + { + "editorMode": "code", + "expr": "clamp_max(\n 100 *\n (\n sum(\n coderd_prebuilt_workspaces_claimed_total{\n template_name=\"$template\", preset_name=\"$preset\"\n }\n ) or vector(0)\n )\n /\n clamp_min(\n ( \n sum(\n coderd_prebuilt_workspaces_claimed_total{\n template_name=\"$template\", preset_name=\"$preset\"\n }\n ) or vector(0))\n +\n (\n sum(\n coderd_workspace_creation_total{\n template_name=\"$template\", preset_name=\"$preset\"\n }\n ) or vector(0)),\n 1\n ),\n 100\n)", + "legendFormat": "__auto", + "range": true, + "refId": "A" + } + ], + "title": "All Time: Prebuilds Usage %", + "type": "stat" + } + ], + "preload": false, + "refresh": "30s", + "schemaVersion": 41, + "tags": [], + "templating": { + "list": [ + { + "current": { + "text": "coder", + "value": "coder" + }, + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "definition": "label_values(coderd_prebuilt_workspaces_desired,template_name)", + "includeAll": false, + "label": "Template", + "name": "template", + "options": [], + "query": { + "qryType": 1, + "query": "label_values(coderd_prebuilt_workspaces_desired,template_name)", + "refId": "PrometheusVariableQueryEditor-VariableQuery" + }, + "refresh": 1, + "regex": "", + "type": "query" + }, + { + "current": { + "text": [ + "All" + ], + "value": [ + "$__all" + ] + }, + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "definition": "label_values(coderd_prebuilt_workspaces_desired{template_name=~\"$template\"},preset_name)", + "includeAll": true, + "label": "Preset", + "multi": true, + "name": "preset", + "options": [], + "query": { + "qryType": 1, + "query": "label_values(coderd_prebuilt_workspaces_desired{template_name=~\"$template\"},preset_name)", + "refId": "PrometheusVariableQueryEditor-VariableQuery" + }, + "refresh": 1, + "regex": "", + "type": "query" + } + ] + }, + "time": { + "from": "now-12h", + "to": "now" + }, + "timepicker": {}, + "timezone": "browser", + "title": "Coder Prebuilds", + "uid": "cej6jysyme22oa", + "version": 5 + } +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: coder-dashboard-provisionerd + namespace: monitoring + labels: + grafana_dashboard: "1" + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: grafana-dashboard +data: + coder-provisionerd.json: |- + { + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": { + "type": "grafana", + "uid": "-- Grafana --" + }, + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "target": { + "limit": 100, + "matchAny": false, + "tags": [], + "type": "dashboard" + }, + "type": "dashboard" + } + ] + }, + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 0, + "links": [], + "panels": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "text", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 6, + "x": 0, + "y": 0 + }, + "id": 17, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "value_and_name", + "wideLayout": false + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(coderd_provisionerd_num_daemons{pod=~`coder.*`, pod!~`.*provisioner.*`})", + "instant": true, + "legendFormat": "Built-in", + "range": false, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(coderd_provisionerd_num_daemons{pod=~`coder-provisioner.*`, namespace=`coder`})", + "hide": false, + "instant": true, + "legendFormat": "External", + "range": false, + "refId": "B" + } + ], + "title": "Provisioners", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 7, + "w": 6, + "x": 6, + "y": 0 + }, + "id": 20, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "Provisioners are responsible for building workspaces.\n\n`coderd` runs built-in provisioners by default. Control this with the `CODER_PROVISIONER_DAEMONS` environment variable or `--provisioner-daemons` flag.\n\nYou can also consider [External Provisioners](https://coder.com/docs/v2/latest/admin/provisioners). Running both built-in and external provisioners is perfectly valid,\nalthough dedicated (external) provisioners will generally give the best build performance.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "text", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + } + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 6, + "x": 12, + "y": 0 + }, + "id": 21, + "options": { + "colorMode": "value", + "graphMode": "area", + "justifyMode": "center", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "last" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "(sum(coderd_provisionerd_jobs_current) > 0) or vector(0)", + "instant": false, + "legendFormat": "Current", + "range": true, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(coderd_provisionerd_num_daemons)", + "hide": false, + "instant": true, + "legendFormat": "Capacity", + "range": false, + "refId": "B" + } + ], + "title": "Builds", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 7, + "w": 6, + "x": 18, + "y": 0 + }, + "id": 22, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "The maximum number of simultaneous builds is equivalent to the number of `provisionerd` daemons running.\n\nThe \"Capacity\" panel shows the how many simultaneous builds are possible.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "fieldMinMax": false, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "text", + "value": null + } + ] + }, + "unit": "s" + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 6, + "x": 0, + "y": 7 + }, + "id": 23, + "options": { + "colorMode": "value", + "graphMode": "none", + "justifyMode": "center", + "orientation": "auto", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "showPercentChange": false, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "histogram_quantile(0.5, sum by(le) (rate(coderd_provisionerd_job_timings_seconds_bucket[$__range])))", + "hide": false, + "instant": true, + "legendFormat": "Median", + "range": false, + "refId": "B" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "histogram_quantile(0.9, sum by(le) (rate(coderd_provisionerd_job_timings_seconds_bucket[$__range])))", + "hide": false, + "instant": true, + "legendFormat": "90th Percentile", + "range": false, + "refId": "A" + } + ], + "title": "Build Times", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 7, + "w": 6, + "x": 6, + "y": 7 + }, + "id": 24, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "This shows the median and 90th percentile workspace build times.\n\nLong build times can impede developers' productivity while they wait for workspaces to start or be created.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "bars", + "fillOpacity": 100, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "normal" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "decimals": 0, + "fieldMinMax": false, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "text", + "value": null + } + ] + }, + "unit": "short" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "failed" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "orange", + "mode": "fixed" + } + }, + { + "id": "displayName", + "value": "Failure" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "success" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + }, + { + "id": "displayName", + "value": "Success" + } + ] + } + ] + }, + "gridPos": { + "h": 7, + "w": 6, + "x": 12, + "y": 7 + }, + "id": 25, + "interval": "1h", + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "mode": "multi", + "sort": "none" + } + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum by (status) (increase(coderd_provisionerd_job_timings_seconds_count[$__interval]))", + "hide": false, + "instant": false, + "interval": "1h", + "legendFormat": "__auto", + "range": true, + "refId": "A" + } + ], + "title": "Build Count Per Hour", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 7, + "w": 6, + "x": 18, + "y": 7 + }, + "id": 26, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "_NOTE: this will not show the current hour._", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "bars", + "fillOpacity": 100, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "never", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "fieldMinMax": false, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "text", + "value": null + } + ] + }, + "unit": "s" + }, + "overrides": [ + { + "matcher": { + "id": "byRegexp", + "options": "/(Limit|Requested)/" + }, + "properties": [ + { + "id": "custom.drawStyle", + "value": "line" + }, + { + "id": "custom.fillOpacity", + "value": 5 + }, + { + "id": "custom.lineStyle", + "value": { + "dash": [ + 0, + 10 + ], + "fill": "dot" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Limit" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "orange", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Requested" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 7, + "w": 6, + "x": 0, + "y": 14 + }, + "id": 28, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum by (pod) (rate(container_cpu_usage_seconds_total{pod=~`coder-provisioner.*`, namespace=`coder`}[$__rate_interval]))", + "hide": false, + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "max(kube_pod_container_resource_limits{pod=~`coder-provisioner.*`, namespace=`coder`, resource=\"cpu\"})", + "hide": false, + "instant": false, + "legendFormat": "Limit", + "range": true, + "refId": "B" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "max(kube_pod_container_resource_requests{pod=~`coder-provisioner.*`, namespace=`coder`, resource=\"cpu\"})", + "hide": false, + "instant": false, + "legendFormat": "Requested", + "range": true, + "refId": "C" + } + ], + "title": "CPU Usage Seconds", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 7, + "w": 6, + "x": 6, + "y": 14 + }, + "id": 30, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "The cumulative CPU used per core-second. If the process was using a full CPU core, that would be represented as 1 second.\n\nRequests & limits are shown if set.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "bars", + "fillOpacity": 100, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "never", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "fieldMinMax": false, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "text", + "value": null + } + ] + }, + "unit": "bytes" + }, + "overrides": [ + { + "matcher": { + "id": "byRegexp", + "options": "/(Limit|Requested)/" + }, + "properties": [ + { + "id": "custom.drawStyle", + "value": "line" + }, + { + "id": "custom.fillOpacity", + "value": 5 + }, + { + "id": "custom.lineStyle", + "value": { + "dash": [ + 0, + 10 + ], + "fill": "dot" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Limit" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "orange", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Requested" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "green", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 7, + "w": 6, + "x": 12, + "y": 14 + }, + "id": 29, + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "mode": "single", + "sort": "none" + } + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "max by (pod) (container_memory_working_set_bytes{pod=~`coder-provisioner.*`, namespace=`coder`})", + "hide": false, + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "max(kube_pod_container_resource_limits{pod=~`coder-provisioner.*`, namespace=`coder`, resource=\"memory\"})", + "hide": false, + "instant": false, + "legendFormat": "Limit", + "range": true, + "refId": "B" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "max(kube_pod_container_resource_requests{pod=~`coder-provisioner.*`, namespace=`coder`, resource=\"memory\"})", + "hide": false, + "instant": false, + "legendFormat": "Requested", + "range": true, + "refId": "C" + } + ], + "title": "RAM Usage", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 7, + "w": 6, + "x": 18, + "y": 14 + }, + "id": 31, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "This shows the total memory used by each container; it is the same metric which the [OOM killer](https://www.kernel.org/doc/gorman/html/understand/understand016.html) uses.\n\nRequests & limits are shown if set.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "datasource": { + "type": "loki", + "uid": "loki" + }, + "gridPos": { + "h": 18, + "w": 18, + "x": 0, + "y": 21 + }, + "id": 27, + "options": { + "dedupStrategy": "exact", + "enableLogDetails": true, + "prettifyLogMessage": false, + "showCommonLabels": false, + "showLabels": false, + "showTime": true, + "sortOrder": "Descending", + "wrapLogMessage": false + }, + "targets": [ + { + "datasource": { + "type": "loki", + "uid": "loki" + }, + "editorMode": "code", + "expr": "{namespace=~`(coder|coder)`, logger=~\"(.*runner|terraform|provisioner.*)\"}", + "queryType": "range", + "refId": "A" + } + ], + "title": "Logs", + "type": "logs" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 7, + "w": 6, + "x": 18, + "y": 21 + }, + "id": 32, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "This panel shows all logs across built-in and [external provisioners](https://coder.com/docs/v2/latest/admin/provisioners).", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + } + ], + "refresh": "30s", + "schemaVersion": 39, + "tags": [], + "templating": { + "list": [] + }, + "time": { + "from": "now-12h", + "to": "now" + }, + "timepicker": {}, + "timezone": "browser", + "title": "Coder Provisioners", + "uid": "provisionerd", + "version": 10, + "weekStart": "" + } +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: coder-dashboard-workspaces + namespace: monitoring + labels: + grafana_dashboard: "1" + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: grafana-dashboard +data: + coder-workspaces.json: |- + { + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": { + "type": "grafana", + "uid": "-- Grafana --" + }, + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "type": "dashboard" + } + ] + }, + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 0, + "links": [], + "panels": [ + { + "datasource": { + "type": "loki", + "uid": "loki" + }, + "description": "", + "gridPos": { + "h": 1.2, + "w": 24, + "x": 0, + "y": 0 + }, + "id": 28, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "**HINT**: use the dropdowns above to filter by specific workspaces and/or templates.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 1.2 + }, + "id": 31, + "title": "Resources", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "bars", + "fillOpacity": 100, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 1, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "s" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 10, + "x": 0, + "y": 2.2 + }, + "id": 33, + "options": { + "legend": { + "calcs": [ + "mean", + "stdDev", + "min", + "max", + "lastNotNull" + ], + "displayMode": "table", + "placement": "bottom", + "showLegend": true, + "sortBy": "Max", + "sortDesc": true + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "sum by (pod) (rate(container_cpu_usage_seconds_total{namespace=`coder-workspaces`}[$__rate_interval]))", + "hide": false, + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "B" + } + ], + "title": "CPU Usage", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "bars", + "fillOpacity": 100, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 1, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "bytes" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 10, + "x": 10, + "y": 2.2 + }, + "id": 37, + "options": { + "legend": { + "calcs": [ + "mean", + "stdDev", + "min", + "max", + "lastNotNull" + ], + "displayMode": "table", + "placement": "bottom", + "showLegend": true, + "sortBy": "Max", + "sortDesc": true + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "max by (pod) (container_memory_working_set_bytes{namespace=`coder-workspaces`})", + "hide": false, + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "B" + } + ], + "title": "RAM Usage", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 8, + "w": 4, + "x": 20, + "y": 2.2 + }, + "id": 36, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "The cumulative CPU used per core-second. If a workspace was using a full CPU core, that would be represented as 1 second.\n\nSee the Kubernetes [documentation](https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/#cpu-units) for more details.\n\nThe total memory used by each workspace container is represented; it is the same metric which the [OOM killer](https://www.kernel.org/doc/gorman/html/understand/understand016.html) uses.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "bars", + "fillOpacity": 100, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 1, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "decimals": 0, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "none" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 10, + "x": 0, + "y": 10.2 + }, + "id": 38, + "options": { + "legend": { + "calcs": [ + "sum" + ], + "displayMode": "table", + "placement": "bottom", + "showLegend": true, + "sortBy": "Max", + "sortDesc": true + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "sum by (pod) (\n round(increase(kube_pod_container_status_restarts_total{namespace=`coder-workspaces`}[$__interval]))\n) > 0", + "hide": false, + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "B" + } + ], + "title": "Pod Restarts", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "bars", + "fillOpacity": 100, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 1, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "none" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "decimals": 0, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "none" + }, + "overrides": [] + }, + "gridPos": { + "h": 8, + "w": 10, + "x": 10, + "y": 10.2 + }, + "id": 39, + "options": { + "legend": { + "calcs": [ + "sum" + ], + "displayMode": "table", + "placement": "bottom", + "showLegend": true, + "sortBy": "Max", + "sortDesc": true + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "sum by (pod, reason) (\n count_over_time(kube_pod_container_status_terminated_reason{namespace=`coder-workspaces`}[$__interval])\n)", + "hide": false, + "instant": false, + "legendFormat": "{{pod}}:{{reason}}", + "range": true, + "refId": "B" + } + ], + "title": "Terminations", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 8, + "w": 4, + "x": 20, + "y": 10.2 + }, + "id": 40, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "Pods can be terminated for several reasons:\n- `OOMKilled`: pod exceeded its defined memory limit or was terminated by the OS for using excessive memory (if no limit defined)\n- `Error`: usually attributeable to a configuration problem\n- `Evicted`: pod has been evicted from node for overusing resources and will be rescheduled on another node is possible\n\nPod restarts are not necessarily problematic, but they are worth noting.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 18.2 + }, + "id": 30, + "panels": [], + "title": "Builds", + "type": "row" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "bars", + "fillOpacity": 100, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 1, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "normal" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "short" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "DESTROY" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "red", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "STOP" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "purple", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "START" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "blue", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 8, + "w": 10, + "x": 0, + "y": 19.2 + }, + "id": 2, + "interval": "5m", + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "sum by (workspace_transition) (\n (\n # Since new series are created and are initially set to a value of 1, we cannot use \"increase\" (because an increase from to 1 does not yield 1).\n # So we compare the current series to an interval ago to see if we have any new series and then sum the series we find. \n (\n coderd_workspace_builds_total{status=\"success\", workspace_name=~\"$workspace_name\", template_name=~\"$template_name\"} - \n coderd_workspace_builds_total{status=\"success\", workspace_name=~\"$workspace_name\", template_name=~\"$template_name\"} offset $__interval\n ) >= 0) \n or coderd_workspace_builds_total{status=\"success\", workspace_name=~\"$workspace_name\", template_name=~\"$template_name\"}\n) > 0", + "hide": false, + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "B" + } + ], + "title": "Successful Builds by State", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "bars", + "fillOpacity": 100, + "gradientMode": "none", + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { + "type": "linear" + }, + "showPoints": "auto", + "spanNulls": false, + "stacking": { + "group": "A", + "mode": "normal" + }, + "thresholdsStyle": { + "mode": "off" + } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "short" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "DESTROY" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "red", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "STOP" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "purple", + "mode": "fixed" + } + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "START" + }, + "properties": [ + { + "id": "color", + "value": { + "fixedColor": "blue", + "mode": "fixed" + } + } + ] + } + ] + }, + "gridPos": { + "h": 8, + "w": 10, + "x": 10, + "y": 19.2 + }, + "id": 1, + "interval": "5m", + "options": { + "legend": { + "calcs": [], + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "sum by (workspace_transition) (\n (\n # Since new series are created and are initially set to a value of 1, we cannot use \"increase\" (because an increase from to 1 does not yield 1).\n # So we compare the current series to an interval ago to see if we have any new series and then sum the series we find. \n (\n coderd_workspace_builds_total{status=\"failed\", workspace_name=~\"$workspace_name\", template_name=~\"$template_name\"} - \n coderd_workspace_builds_total{status=\"failed\", workspace_name=~\"$workspace_name\", template_name=~\"$template_name\"} offset $__interval\n ) >= 0) \n or coderd_workspace_builds_total{status=\"failed\", workspace_name=~\"$workspace_name\", template_name=~\"$template_name\"}\n) > 0", + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "A" + } + ], + "title": "Unsuccessful Builds by State", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 8, + "w": 4, + "x": 20, + "y": 19.2 + }, + "id": 34, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "Workspaces \"transition\" between `STOP`, `START`, and `DESTROY` states.\n\nWorkspaces transition between states when a \"build\" is initiated, which is an execution of `terraform` against the chosen template.\n\nUse the \"Build Count\" table to identify workspace owners which may be struggling with template builds, in order to proactively reach out to them with assistance.\n\nConsult the [Template documentation](https://coder.com/docs/v2/latest/templates) for more information.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "custom": { + "align": "auto", + "cellOptions": { + "type": "auto" + }, + "filterable": true, + "inspect": false + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "short" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "status" + }, + "properties": [ + { + "id": "custom.cellOptions", + "value": { + "type": "color-text" + } + }, + { + "id": "mappings", + "value": [ + { + "options": { + "failed": { + "color": "orange", + "index": 1, + "text": "Failure" + }, + "success": { + "color": "green", + "index": 0, + "text": "Success" + } + }, + "type": "value" + } + ] + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Workspace Transition" + }, + "properties": [ + { + "id": "custom.cellOptions", + "value": { + "type": "color-text" + } + }, + { + "id": "mappings", + "value": [ + { + "options": { + "DESTROY": { + "color": "red", + "index": 0 + }, + "START": { + "color": "blue", + "index": 1 + }, + "STOP": { + "color": "purple", + "index": 2 + } + }, + "type": "value" + } + ] + } + ] + } + ] + }, + "gridPos": { + "h": 10, + "w": 20, + "x": 0, + "y": 27.2 + }, + "id": 6, + "interval": "", + "options": { + "cellHeight": "sm", + "footer": { + "countRows": false, + "enablePagination": true, + "fields": [], + "reducer": [ + "sum" + ], + "show": false + }, + "showHeader": true, + "sortBy": [ + { + "desc": true, + "displayName": "Time" + } + ] + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum by (workspace_name, workspace_owner, status, template_name, template_version, workspace_transition) (\n # Since new series are created and are initially set to a value of 1, we cannot use \"increase\" (because an increase from to 1 does not yield 1).\n # So we compare the current series to an interval ago to see if we have any new series and then sum the series we find. \n ((\n coderd_workspace_builds_total{workspace_name=~\"$workspace_name\", template_name=~\"$template_name\"} - \n coderd_workspace_builds_total{workspace_name=~\"$workspace_name\", template_name=~\"$template_name\"} offset $__interval\n ) >= 0) \n or coderd_workspace_builds_total{workspace_name=~\"$workspace_name\", template_name=~\"$template_name\"}\n) > 0", + "format": "table", + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "A" + } + ], + "title": "Build Log", + "transformations": [ + { + "disabled": true, + "id": "groupBy", + "options": { + "fields": { + "Count": { + "aggregations": [ + "sum" + ], + "operation": "aggregate" + }, + "Status": { + "aggregations": [], + "operation": "groupby" + }, + "Template Name": { + "aggregations": [], + "operation": "groupby" + }, + "Template Version": { + "aggregations": [], + "operation": "groupby" + }, + "Total": { + "aggregations": [ + "sum" + ], + "operation": "aggregate" + }, + "Value": { + "aggregations": [ + "sum" + ], + "operation": "aggregate" + }, + "Workspace Name": { + "aggregations": [], + "operation": "groupby" + }, + "Workspace Ownert": { + "aggregations": [], + "operation": "groupby" + }, + "Workspace Transition": { + "aggregations": [], + "operation": "groupby" + }, + "status": { + "aggregations": [], + "operation": "groupby" + }, + "template_name": { + "aggregations": [], + "operation": "groupby" + }, + "template_version": { + "aggregations": [], + "operation": "groupby" + }, + "workspace_name": { + "aggregations": [], + "operation": "groupby" + }, + "workspace_owner": { + "aggregations": [], + "operation": "groupby" + }, + "workspace_transition": { + "aggregations": [], + "operation": "groupby" + } + } + } + }, + { + "id": "sortBy", + "options": { + "fields": {}, + "sort": [ + { + "desc": true, + "field": "Value" + } + ] + } + }, + { + "id": "organize", + "options": { + "excludeByName": { + "Time": false + }, + "includeByName": {}, + "indexByName": {}, + "renameByName": { + "Value": "Count", + "Value (sum)": "Total", + "status": "Status", + "template_name": "Template Name", + "template_version": "Template Version", + "workspace_name": "Workspace Name", + "workspace_owner": "Workspace Owner", + "workspace_transition": "Workspace Transition" + } + } + } + ], + "type": "table" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 10, + "w": 4, + "x": 20, + "y": 27.2 + }, + "id": 29, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "This table shows a reverse-chronological log of all workspace builds.\n\nThe \"Count\" field shows the count of events which occurred within a minute, grouped by all columns.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + } + }, + "mappings": [], + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 5, + "x": 0, + "y": 37.2 + }, + "id": 8, + "interval": "1h", + "options": { + "displayLabels": [ + "name" + ], + "legend": { + "displayMode": "list", + "placement": "bottom", + "showLegend": true, + "values": [ + "percent" + ] + }, + "pieType": "pie", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "tooltip": { + "mode": "multi", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "count by (workspace_owner) (coderd_workspace_latest_build_status{template_name=~\"$template_name\"})", + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "A" + } + ], + "title": "Workspace by User", + "type": "piechart" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + } + }, + "mappings": [], + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 5, + "x": 5, + "y": 37.2 + }, + "id": 9, + "interval": "1h", + "options": { + "displayLabels": [ + "name" + ], + "legend": { + "displayMode": "list", + "placement": "bottom", + "showLegend": true, + "values": [ + "percent" + ] + }, + "pieType": "pie", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "tooltip": { + "mode": "multi", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "count by (workspace_owner, template_name) (coderd_workspace_latest_build_status{template_name=~\"$template_name\"})", + "instant": true, + "legendFormat": "{{workspace_owner}}:{{template_name}}", + "range": false, + "refId": "A" + } + ], + "title": "Workspace by User/Template", + "type": "piechart" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + } + }, + "mappings": [], + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 5, + "x": 10, + "y": 37.2 + }, + "id": 4, + "interval": "1h", + "options": { + "displayLabels": [ + "name" + ], + "legend": { + "displayMode": "list", + "placement": "bottom", + "showLegend": true, + "values": [ + "percent" + ] + }, + "pieType": "pie", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "tooltip": { + "mode": "multi", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "count by (template_name) (coderd_workspace_latest_build_status{template_name=~\"$template_name\"})", + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "A" + } + ], + "title": "Template Usage", + "type": "piechart" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "hideFrom": { + "legend": false, + "tooltip": false, + "viz": false + } + }, + "mappings": [], + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 7, + "w": 5, + "x": 15, + "y": 37.2 + }, + "id": 5, + "interval": "1h", + "options": { + "displayLabels": [], + "legend": { + "displayMode": "list", + "placement": "bottom", + "showLegend": true, + "values": [ + "percent" + ] + }, + "pieType": "pie", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "tooltip": { + "mode": "multi", + "sort": "none" + } + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "count by (template_name, template_version) (coderd_workspace_latest_build_status{template_name=~\"$template_name\"})", + "instant": true, + "legendFormat": "{{template_name}}:{{template_version}}", + "range": false, + "refId": "A" + } + ], + "title": "Template Version Usage", + "type": "piechart" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 7, + "w": 4, + "x": 20, + "y": 37.2 + }, + "id": 24, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "These charts show the distribution of workspaces and templates.\n\nUse these charts to identify which users have outdated templates, and which templates are the most/least popular in your organisation.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 44.2 + }, + "id": 32, + "panels": [], + "title": "Logs", + "type": "row" + }, + { + "datasource": { + "type": "loki", + "uid": "loki" + }, + "gridPos": { + "h": 10, + "w": 20, + "x": 0, + "y": 45.2 + }, + "id": 7, + "options": { + "dedupStrategy": "exact", + "enableLogDetails": true, + "prettifyLogMessage": false, + "showCommonLabels": false, + "showLabels": false, + "showTime": false, + "sortOrder": "Descending", + "wrapLogMessage": true + }, + "targets": [ + { + "datasource": { + "type": "loki", + "uid": "loki" + }, + "editorMode": "code", + "expr": "{namespace=~`(coder|coder)`, logger=~\"(.*runner|terraform|provisioner.*)\"} |~ \"$workspace_name\" or \"$template_name\"", + "queryType": "range", + "refId": "A" + } + ], + "title": "Logs", + "type": "logs" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 10, + "w": 4, + "x": 20, + "y": 45.2 + }, + "id": 22, + "links": [ + { + "title": "Provisioners Dashboard", + "url": "/d/provisionerd/provisioners?${__url_time_range}" + } + ], + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "These are the logs produced by the [Provisioners](/d/provisionerd/provisioners?${__url_time_range}).\n\nUse the dropdowns at the top to filter the logs down to a specific workspace and/or template.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + } + ], + "refresh": "30s", + "schemaVersion": 39, + "tags": [], + "templating": { + "list": [ + { + "allValue": "", + "current": { + "selected": true, + "text": [ + "All" + ], + "value": [ + "$__all" + ] + }, + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "definition": "label_values(coderd_workspace_builds_total,workspace_name)", + "hide": 0, + "includeAll": true, + "label": "Workspace Name Filter", + "multi": true, + "name": "workspace_name", + "options": [], + "query": { + "qryType": 1, + "query": "label_values(coderd_workspace_builds_total,workspace_name)", + "refId": "PrometheusVariableQueryEditor-VariableQuery" + }, + "refresh": 2, + "regex": "", + "skipUrlSync": false, + "sort": 1, + "type": "query" + }, + { + "allValue": "", + "current": { + "selected": true, + "text": [ + "All" + ], + "value": [ + "$__all" + ] + }, + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "definition": "label_values(coderd_workspace_builds_total,template_name)", + "hide": 0, + "includeAll": true, + "label": "Template Name Filter", + "multi": true, + "name": "template_name", + "options": [], + "query": { + "qryType": 1, + "query": "label_values(coderd_workspace_builds_total,template_name)", + "refId": "PrometheusVariableQueryEditor-VariableQuery" + }, + "refresh": 2, + "regex": "", + "skipUrlSync": false, + "sort": 1, + "type": "query" + } + ] + }, + "time": { + "from": "now-12h", + "to": "now" + }, + "timepicker": {}, + "timezone": "browser", + "title": "Coder Workspaces", + "uid": "workspaces", + "version": 2, + "weekStart": "" + } +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: coder-dashboard-workspace-detail + namespace: monitoring + labels: + grafana_dashboard: "1" + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: grafana-dashboard +data: + coder-workspaces-detail.json: |- + { + "annotations": { + "list": [ + { + "builtIn": 1, + "datasource": { + "type": "grafana", + "uid": "-- Grafana --" + }, + "enable": true, + "hide": true, + "iconColor": "rgba(0, 211, 255, 1)", + "name": "Annotations & Alerts", + "type": "dashboard" + } + ] + }, + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 0, + "links": [], + "panels": [ + { + "datasource": { + "type": "loki", + "uid": "loki" + }, + "description": "", + "gridPos": { + "h": 1.2, + "w": 24, + "x": 0, + "y": 0 + }, + "id": 28, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "**HINT**: use the dropdowns above to filter by specific workspace(s).", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "blue", + "value": null + } + ] + }, + "unit": "short" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "CPUs Requested" + }, + "properties": [ + { + "id": "unit", + "value": "none" + }, + { + "id": "decimals", + "value": 2 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "RAM Requested" + }, + "properties": [ + { + "id": "unit", + "value": "bytes" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "PVC Capacity" + }, + "properties": [ + { + "id": "unit", + "value": "bytes" + } + ] + } + ] + }, + "gridPos": { + "h": 4, + "w": 20, + "x": 0, + "y": 1.2 + }, + "id": 29, + "options": { + "colorMode": "none", + "graphMode": "none", + "justifyMode": "center", + "orientation": "vertical", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "/.*/", + "values": false + }, + "showPercentChange": false, + "text": { + "titleSize": 20, + "valueSize": 40 + }, + "textMode": "value_and_name", + "wideLayout": false + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "group by (template_name) (coderd_agents_up{workspace_name=~\"$workspace_name\"})", + "format": "table", + "hide": false, + "instant": true, + "legendFormat": "Template Name", + "range": false, + "refId": "B" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "group by (template_version) (coderd_agents_up{workspace_name=~\"$workspace_name\"})", + "format": "table", + "hide": false, + "instant": true, + "legendFormat": "Template Version", + "range": false, + "refId": "A" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "group by (username) (coderd_agents_up{workspace_name=~\"$workspace_name\"})", + "format": "table", + "hide": false, + "instant": true, + "legendFormat": "Owner", + "range": false, + "refId": "C" + } + ], + "title": "Details", + "transformations": [ + { + "id": "concatenate", + "options": {} + }, + { + "id": "organize", + "options": { + "excludeByName": { + "Time": true, + "Value #A": true, + "Value #B": true, + "Value #C": true, + "Value #D": true + }, + "includeByName": {}, + "indexByName": { + "CPUs Requested": 7, + "PVC Capacity": 9, + "RAM Requested": 8, + "Time": 0, + "Value #A": 5, + "Value #B": 3, + "Value #C": 6, + "template_name": 2, + "template_version": 4, + "username": 1 + }, + "renameByName": { + "Value #C": "", + "lifecycle_state": "Agent State", + "template_name": "Template", + "template_version": "Template Version", + "username": "Owner" + } + } + } + ], + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 8, + "w": 4, + "x": 20, + "y": 1.2 + }, + "id": 38, + "links": [ + { + "title": "Provisioners Dashboard", + "url": "/d/provisionerd/provisioners?${__url_time_range}" + } + ], + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "Essential information about the selected workspace.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "blue", + "value": null + } + ] + }, + "unit": "short" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "CPUs Requested" + }, + "properties": [ + { + "id": "unit", + "value": "none" + }, + { + "id": "decimals", + "value": 2 + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "RAM Requested" + }, + "properties": [ + { + "id": "unit", + "value": "bytes" + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "PVC Capacity" + }, + "properties": [ + { + "id": "unit", + "value": "bytes" + } + ] + } + ] + }, + "gridPos": { + "h": 4, + "w": 20, + "x": 0, + "y": 5.2 + }, + "id": 36, + "options": { + "reduceOptions": { + "values": false, + "calcs": [ + "lastNotNull" + ], + "fields": "/.*/" + }, + "orientation": "vertical", + "textMode": "value_and_name", + "wideLayout": false, + "colorMode": "none", + "graphMode": "none", + "justifyMode": "center", + "showPercentChange": false, + "text": { + "titleSize": 20, + "valueSize": 40 + } + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(kube_pod_container_resource_requests{pod=~\".*$workspace_name.*\", namespace=`coder-workspaces`, resource=\"cpu\"})", + "format": "time_series", + "hide": false, + "instant": true, + "legendFormat": "CPUs Requested", + "range": false, + "refId": "D" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(kube_pod_container_resource_requests{pod=~\".*$workspace_name.*\", namespace=`coder-workspaces`, resource=\"memory\"})", + "format": "time_series", + "hide": false, + "instant": true, + "legendFormat": "RAM Requested", + "range": false, + "refId": "E" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum(\n kube_pod_spec_volumes_persistentvolumeclaims_info{pod=~\".*$workspace_name.*\",namespace=`coder-workspaces`}\n * on(persistentvolumeclaim) group_right\n group by (persistentvolumeclaim, persistentvolume) (\n label_replace(\n kube_persistentvolume_claim_ref,\n \"persistentvolumeclaim\",\n \"$1\",\n \"name\",\n \"(.+)\"\n )\n )\n * on (persistentvolume)\n kube_persistentvolume_capacity_bytes\n)", + "format": "time_series", + "hide": false, + "instant": true, + "legendFormat": "PVC Capacity", + "range": false, + "refId": "F" + } + ], + "title": "Resources", + "transformations": [ + { + "id": "concatenate", + "options": {} + }, + { + "id": "organize", + "options": { + "excludeByName": { + "Time": true, + "Value #A": true, + "Value #B": true, + "Value #C": true, + "Value #D": true + }, + "includeByName": {}, + "indexByName": { + "CPUs Requested": 7, + "PVC Capacity": 9, + "RAM Requested": 8, + "Time": 0, + "Value #A": 5, + "Value #B": 3, + "Value #C": 6, + "template_name": 2, + "template_version": 4, + "username": 1 + }, + "renameByName": { + "Value #C": "", + "lifecycle_state": "Agent State", + "template_name": "Template", + "template_version": "Template Version", + "username": "Owner" + } + } + } + ], + "type": "stat", + "description": "" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "mappings": [ + { + "options": { + "created": { + "color": "light-blue", + "index": 1, + "text": "Created" + }, + "off": { + "color": "text", + "index": 8, + "text": "Off" + }, + "ready": { + "color": "green", + "index": 0, + "text": "Ready" + }, + "shutdown_error": { + "color": "red", + "index": 7, + "text": "Shutdown Error" + }, + "shutdown_timeout": { + "color": "purple", + "index": 6, + "text": "Shutdown Timeout" + }, + "shutting_down": { + "color": "light-purple", + "index": 5, + "text": "Shutting Down" + }, + "start_error": { + "color": "red", + "index": 4, + "text": "Start Error" + }, + "start_timeout": { + "color": "orange", + "index": 3, + "text": "Start Timeout" + }, + "starting": { + "color": "super-light-green", + "index": 2, + "text": "Starting" + } + }, + "type": "value" + }, + { + "options": { + "match": "empty", + "result": { + "color": "text", + "index": 9, + "text": "Unknown" + } + }, + "type": "special" + }, + { + "options": { + "match": "null", + "result": { + "color": "text", + "index": 10, + "text": "Unknown" + } + }, + "type": "special" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "text", + "value": null + } + ] + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 6, + "w": 4, + "x": 0, + "y": 9.2 + }, + "id": 35, + "options": { + "colorMode": "background", + "graphMode": "none", + "justifyMode": "auto", + "orientation": "horizontal", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "/^lifecycle_state$/", + "values": false + }, + "showPercentChange": false, + "text": { + "valueSize": 50 + }, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "max by (lifecycle_state) (coderd_agents_connections{workspace_name=~\"$workspace_name\"})", + "format": "table", + "hide": false, + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "D" + } + ], + "title": "Agent Lifecycle State", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "mappings": [ + { + "options": { + "-1": { + "color": "light-orange", + "index": 0, + "text": "Not completed yet" + } + }, + "type": "value" + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "#EAB839", + "value": 60 + }, + { + "color": "red", + "value": 120 + } + ] + }, + "unit": "s" + }, + "overrides": [] + }, + "gridPos": { + "h": 6, + "w": 3, + "x": 4, + "y": 9.2 + }, + "id": 33, + "options": { + "colorMode": "background", + "graphMode": "none", + "justifyMode": "auto", + "orientation": "horizontal", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "/^Value$/", + "values": false + }, + "showPercentChange": false, + "text": { + "valueSize": 50 + }, + "textMode": "value", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "max(coderd_agentstats_startup_script_seconds{workspace_name=~\"$workspace_name\"}) or vector(-1)", + "format": "table", + "hide": false, + "instant": true, + "legendFormat": "__auto", + "range": false, + "refId": "C" + } + ], + "title": "Agent Startup Script Execution Time", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { + "h": 6, + "w": 3, + "x": 7, + "y": 9.2 + }, + "id": 39, + "options": { + "colorMode": "background", + "graphMode": "none", + "justifyMode": "center", + "orientation": "horizontal", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "/.*/", + "values": false + }, + "showPercentChange": false, + "text": { + "titleSize": 20, + "valueSize": 50 + }, + "textMode": "value_and_name", + "wideLayout": false + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "max by (app) (\n label_replace(\n {workspace_name=~\"$workspace_name\", __name__=~\"coderd_agentstats_session_count_.*\"},\n \"app\",\n \"$1\",\n \"__name__\",\n \"coderd_agentstats_session_count_(.*)\"\n )\n)>0", + "format": "time_series", + "hide": false, + "instant": true, + "legendFormat": "{{app}}", + "range": false, + "refId": "C" + } + ], + "title": "App Session Counts", + "transformations": [ + { + "id": "concatenate", + "options": {} + }, + { + "id": "organize", + "options": { + "excludeByName": { + "Time": true + }, + "includeByName": {}, + "indexByName": {}, + "renameByName": {} + } + } + ], + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + }, + "unit": "s" + }, + "overrides": [ + { + "matcher": { + "id": "byRegexp", + "options": "/.*Bytes/" + }, + "properties": [ + { + "id": "unit", + "value": "bytes" + } + ] + } + ] + }, + "gridPos": { + "h": 6, + "w": 10, + "x": 10, + "y": 9.2 + }, + "id": 34, + "options": { + "colorMode": "none", + "graphMode": "none", + "justifyMode": "center", + "orientation": "vertical", + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "/.*/", + "values": false + }, + "showPercentChange": false, + "text": { + "titleSize": 20, + "valueSize": 50 + }, + "textMode": "auto", + "wideLayout": true + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "max(coderd_agents_connection_latencies_seconds{workspace_name=~\"$workspace_name\"})", + "format": "table", + "hide": false, + "instant": true, + "legendFormat": "Connection Latency", + "range": false, + "refId": "C" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "max(sum by (pod) (sum_over_time(coderd_agentstats_rx_bytes{workspace_name=~\"$workspace_name\"}[$__range])))", + "format": "table", + "hide": false, + "instant": true, + "legendFormat": "Received Bytes", + "range": false, + "refId": "rx" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "max(sum by (pod) (sum_over_time(coderd_agentstats_tx_bytes{workspace_name=~\"$workspace_name\"}[$__range])))", + "format": "table", + "hide": false, + "instant": true, + "legendFormat": "Transmitted Bytes", + "range": false, + "refId": "tx" + } + ], + "title": "Networking", + "transformations": [ + { + "id": "merge", + "options": {} + }, + { + "id": "organize", + "options": { + "excludeByName": { + "Time": true + }, + "includeByName": {}, + "indexByName": {}, + "renameByName": { + "Value #A": "Received Bytes", + "Value #B": "Transmitted Bytes", + "Value #C": "Connection Latency", + "Value #rx": "Received Bytes", + "Value #tx": "Transmitted Bytes" + } + } + } + ], + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 6, + "w": 4, + "x": 20, + "y": 9.2 + }, + "id": 40, + "links": [ + { + "title": "Provisioners Dashboard", + "url": "/d/provisionerd/provisioners?${__url_time_range}" + } + ], + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "Essential information about this workspace's agent.\n\nRead more about the agent [here](https://coder.com/docs/v2/latest/about/architecture#agents).", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "custom": { + "align": "auto", + "cellOptions": { + "type": "auto" + }, + "filterable": true, + "inspect": false + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + }, + { + "color": "red", + "value": 80 + } + ] + }, + "unit": "short" + }, + "overrides": [ + { + "matcher": { + "id": "byName", + "options": "status" + }, + "properties": [ + { + "id": "custom.cellOptions", + "value": { + "type": "color-text" + } + }, + { + "id": "mappings", + "value": [ + { + "options": { + "failed": { + "color": "orange", + "index": 1, + "text": "Failure" + }, + "success": { + "color": "green", + "index": 0, + "text": "Success" + } + }, + "type": "value" + } + ] + } + ] + }, + { + "matcher": { + "id": "byName", + "options": "Workspace Transition" + }, + "properties": [ + { + "id": "custom.cellOptions", + "value": { + "type": "color-text" + } + }, + { + "id": "mappings", + "value": [ + { + "options": { + "DESTROY": { + "color": "red", + "index": 0 + }, + "START": { + "color": "blue", + "index": 1 + }, + "STOP": { + "color": "purple", + "index": 2 + } + }, + "type": "value" + } + ] + } + ] + } + ] + }, + "gridPos": { + "h": 7, + "w": 20, + "x": 0, + "y": 15.2 + }, + "id": 6, + "interval": "", + "options": { + "cellHeight": "sm", + "footer": { + "countRows": false, + "enablePagination": true, + "fields": [], + "reducer": [ + "sum" + ], + "show": false + }, + "showHeader": true, + "sortBy": [ + { + "desc": true, + "displayName": "Time" + } + ] + }, + "pluginVersion": "10.4.0", + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "exemplar": false, + "expr": "sum by (workspace_name, workspace_owner, status, template_name, template_version, workspace_transition) (\n # Since new series are created and are initially set to a value of 1, we cannot use \"increase\" (because an increase from to 1 does not yield 1).\n # So we compare the current series to an interval ago to see if we have any new series and then sum the series we find. \n ((\n coderd_workspace_builds_total{workspace_name=~\"$workspace_name\"} - \n coderd_workspace_builds_total{workspace_name=~\"$workspace_name\"} offset $__interval\n ) >= 0) \n or coderd_workspace_builds_total{workspace_name=~\"$workspace_name\"}\n) > 0", + "format": "table", + "instant": false, + "legendFormat": "__auto", + "range": true, + "refId": "A" + } + ], + "title": "Build Log", + "transformations": [ + { + "disabled": true, + "id": "groupBy", + "options": { + "fields": { + "Count": { + "aggregations": [ + "sum" + ], + "operation": "aggregate" + }, + "Status": { + "aggregations": [], + "operation": "groupby" + }, + "Template Name": { + "aggregations": [], + "operation": "groupby" + }, + "Template Version": { + "aggregations": [], + "operation": "groupby" + }, + "Total": { + "aggregations": [ + "sum" + ], + "operation": "aggregate" + }, + "Value": { + "aggregations": [ + "sum" + ], + "operation": "aggregate" + }, + "Workspace Name": { + "aggregations": [], + "operation": "groupby" + }, + "Workspace Ownert": { + "aggregations": [], + "operation": "groupby" + }, + "Workspace Transition": { + "aggregations": [], + "operation": "groupby" + }, + "status": { + "aggregations": [], + "operation": "groupby" + }, + "template_name": { + "aggregations": [], + "operation": "groupby" + }, + "template_version": { + "aggregations": [], + "operation": "groupby" + }, + "workspace_name": { + "aggregations": [], + "operation": "groupby" + }, + "workspace_owner": { + "aggregations": [], + "operation": "groupby" + }, + "workspace_transition": { + "aggregations": [], + "operation": "groupby" + } + } + } + }, + { + "id": "sortBy", + "options": { + "fields": {}, + "sort": [ + { + "desc": true, + "field": "Value" + } + ] + } + }, + { + "id": "organize", + "options": { + "excludeByName": { + "Time": false + }, + "includeByName": {}, + "indexByName": {}, + "renameByName": { + "Value": "Count", + "Value (sum)": "Total", + "status": "Status", + "template_name": "Template Name", + "template_version": "Template Version", + "workspace_name": "Workspace Name", + "workspace_owner": "Workspace Owner", + "workspace_transition": "Workspace Transition" + } + } + } + ], + "type": "table" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 7, + "w": 4, + "x": 20, + "y": 15.2 + }, + "id": 37, + "links": [ + { + "title": "Provisioners Dashboard", + "url": "/d/provisionerd/provisioners?${__url_time_range}" + } + ], + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "This table shows a reverse-chronological log of all workspace builds.\n\nThe \"Count\" field shows the count of events which occurred within a minute, grouped by all columns.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + }, + { + "datasource": { + "type": "loki", + "uid": "loki" + }, + "gridPos": { + "h": 10, + "w": 20, + "x": 0, + "y": 22.2 + }, + "id": 7, + "options": { + "dedupStrategy": "exact", + "enableLogDetails": true, + "prettifyLogMessage": false, + "showCommonLabels": false, + "showLabels": false, + "showTime": true, + "sortOrder": "Descending", + "wrapLogMessage": false + }, + "targets": [ + { + "datasource": { + "type": "loki", + "uid": "loki" + }, + "editorMode": "code", + "expr": "{namespace=~`(coder|coder)`, logger=~\"(.*runner|terraform|provisioner.*)\"} |~ \"$workspace_name\" | line_format `{{ printf \"[\\033[35m\" }}{{.pod}}{{ printf \"\\033[0m]\\t\" }}{{ __line__ }}`", + "hide": false, + "queryType": "range", + "refId": "A" + }, + { + "datasource": { + "type": "loki", + "uid": "loki" + }, + "editorMode": "code", + "expr": "{namespace=`coder-workspaces`, pod=~\".*($workspace_name).*\"} | line_format `{{ printf \"[\\033[32m\" }}{{.pod}}{{ printf \"\\033[0m]\\t\" }}{{ __line__ }}`", + "hide": false, + "queryType": "range", + "refId": "B" + } + ], + "title": "Logs", + "type": "logs" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "description": "", + "gridPos": { + "h": 10, + "w": 4, + "x": 20, + "y": 22.2 + }, + "id": 24, + "options": { + "code": { + "language": "plaintext", + "showLineNumbers": false, + "showMiniMap": false + }, + "content": "The logs to the left come both from provisioners and workspace logs.\n\nProvisioner logs matching the name filter are highlighted in magenta, while\nworkspace logs matching the name filter are highlighted in green.", + "mode": "markdown" + }, + "pluginVersion": "10.4.0", + "transparent": true, + "type": "text" + } + ], + "refresh": "30s", + "schemaVersion": 39, + "tags": [], + "templating": { + "list": [ + { + "allValue": "", + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "definition": "label_values(coderd_agents_up,workspace_name)", + "hide": 0, + "includeAll": false, + "label": "Workspace Name Filter", + "multi": false, + "name": "workspace_name", + "options": [], + "query": { + "qryType": 1, + "query": "label_values(coderd_agents_up,workspace_name)", + "refId": "PrometheusVariableQueryEditor-VariableQuery" + }, + "refresh": 2, + "regex": "", + "skipUrlSync": false, + "sort": 1, + "type": "query" + } + ] + }, + "time": { + "from": "now-12h", + "to": "now" + }, + "timepicker": {}, + "timezone": "browser", + "title": "Coder Workspace Detail", + "uid": "workspace-detail", + "version": 9, + "weekStart": "" + } diff --git a/deploy/observability/grafana-ingress.yaml b/deploy/observability/grafana-ingress.yaml new file mode 100644 index 0000000..d13a107 --- /dev/null +++ b/deploy/observability/grafana-ingress.yaml @@ -0,0 +1,34 @@ +# Ingress for Grafana at grafana.usgov.coderdemo.io. +# +# Same pattern as deploy/keycloak/ingress.yaml: TLS terminates upstream at the +# NLB (single ACM wildcard cert on the ingress-nginx controller Service). This +# Ingress declares only the plain-HTTP backend route. The Route53 `*` alias +# already resolves grafana.usgov.coderdemo.io to the NLB. +# +# ssl-redirect is disabled because the controller receives plain HTTP from the +# NLB; leaving the default on would cause an HTTP->HTTPS redirect loop. The +# Grafana service created by the kube-prometheus-stack release is `kps-grafana` +# on port 80. +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: grafana + namespace: monitoring + labels: + app.kubernetes.io/name: grafana + app.kubernetes.io/part-of: usgov-coderdemo + annotations: + nginx.ingress.kubernetes.io/ssl-redirect: "false" +spec: + ingressClassName: nginx + rules: + - host: grafana.usgov.coderdemo.io + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: kps-grafana + port: + number: 80 diff --git a/deploy/observability/kube-prometheus-stack-values.yaml b/deploy/observability/kube-prometheus-stack-values.yaml new file mode 100644 index 0000000..90c0901 --- /dev/null +++ b/deploy/observability/kube-prometheus-stack-values.yaml @@ -0,0 +1,164 @@ +# Helm values for prometheus-community/kube-prometheus-stack, pinned to chart +# 86.2.0 (appVersion / prometheus-operator v0.91.0). +# +# Scope: the in-cluster metrics + dashboards stack for the GovCloud demo, +# installed into the `monitoring` namespace. It scrapes the Coder control +# plane's /metrics endpoint (via the ServiceMonitor in deploy/observability) +# and renders Coder's prebuilt Grafana dashboards with live data. +# +# Design choices for GovCloud + a lean demo: +# - Every image is pulled from the private ECR mirror (no pull-through cache +# in GovCloud). Tags match scripts/images.txt. +# - Alertmanager, node-exporter, and kube-state-metrics are DISABLED to cut +# the image-mirroring surface and footprint. The Coder dashboards' core +# panels (coderd_*, up, and cAdvisor CPU/memory from the kubelet) render; +# a handful of kube-state-metrics-only panels stay empty by design. +# - The managed EKS control-plane ServiceMonitors are disabled (their +# endpoints are not scrapeable on EKS and would show as permanently-down +# targets). The kubelet ServiceMonitor is kept for container resource +# metrics. +# - Prometheus and Grafana persist on gp3 PVCs (modest sizes). +# - Grafana's admin password comes from a Kubernetes Secret that External +# Secrets Operator syncs from AWS Secrets Manager (admin.existingSecret). + +crds: + enabled: true + +# --- Components we do not run (cuts images + footprint) --------------------- +alertmanager: + enabled: false +nodeExporter: + enabled: false +kubeStateMetrics: + enabled: false + +# Managed EKS control-plane components are not directly scrapeable; disabling +# their ServiceMonitors keeps the demo's target list clean. Keep the kubelet so +# cAdvisor container CPU/memory metrics power the Coder dashboards. +kubeApiServer: + enabled: false +kubeControllerManager: + enabled: false +kubeScheduler: + enabled: false +kubeProxy: + enabled: false +kubeEtcd: + enabled: false +coreDns: + enabled: false +kubelet: + enabled: true + +# Bundled alerting rules assume kube-state-metrics / node-exporter; skip them. +defaultRules: + create: false +windowsMonitoring: + enabled: false + +# --- Prometheus operator ---------------------------------------------------- +prometheusOperator: + # Admission webhooks pull an extra registry.k8s.io cert-gen image that the + # ECR mirror script does not handle; disable them (not needed for the demo). + admissionWebhooks: + enabled: false + tls: + enabled: false + image: + registry: 430737322961.dkr.ecr.us-gov-west-1.amazonaws.com + repository: quay/prometheus-operator/prometheus-operator + tag: v0.91.0 + prometheusConfigReloader: + image: + registry: 430737322961.dkr.ecr.us-gov-west-1.amazonaws.com + repository: quay/prometheus-operator/prometheus-config-reloader + tag: v0.91.0 + resources: + requests: + cpu: 50m + memory: 128Mi + limits: + memory: 256Mi + +# --- Prometheus ------------------------------------------------------------- +prometheus: + prometheusSpec: + image: + registry: 430737322961.dkr.ecr.us-gov-west-1.amazonaws.com + repository: quay/prometheus/prometheus + tag: v3.12.0-distroless + # Discover ServiceMonitors / PodMonitors / rules regardless of the Helm + # release label, across all namespaces. This lets the Coder ServiceMonitor + # in the `coder` namespace be selected without extra labels. + serviceMonitorSelectorNilUsesHelmValues: false + podMonitorSelectorNilUsesHelmValues: false + ruleSelectorNilUsesHelmValues: false + probeSelectorNilUsesHelmValues: false + scrapeConfigSelectorNilUsesHelmValues: false + retention: 7d + resources: + requests: + cpu: 200m + memory: 768Mi + limits: + memory: 1536Mi + storageSpec: + volumeClaimTemplate: + spec: + storageClassName: gp3 + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 20Gi + +# --- Grafana ---------------------------------------------------------------- +grafana: + enabled: true + image: + registry: 430737322961.dkr.ecr.us-gov-west-1.amazonaws.com + repository: docker-hub/grafana/grafana + tag: 13.0.1-security-01 + # Admin credentials from the ESO-synced Secret `grafana-admin` (sourced from + # AWS Secrets Manager usgov-coderdemo/observability/grafana). + admin: + existingSecret: grafana-admin + userKey: admin-user + passwordKey: admin-password + # The bundled Kubernetes dashboards need kube-state-metrics, which is off. + defaultDashboardsEnabled: false + service: + type: ClusterIP + # Grafana is exposed through our own Ingress (grafana-ingress.yaml); the NLB + # terminates TLS, so set root_url to the public HTTPS host for correct login + # redirects. + ingress: + enabled: false + "grafana.ini": + server: + root_url: https://grafana.usgov.coderdemo.io + sidecar: + image: + registry: 430737322961.dkr.ecr.us-gov-west-1.amazonaws.com + repository: quay/kiwigrid/k8s-sidecar + tag: 2.7.3 + dashboards: + enabled: true + label: grafana_dashboard + labelValue: "1" + searchNamespace: ALL + datasources: + enabled: true + persistence: + enabled: true + type: pvc + storageClassName: gp3 + accessModes: + - ReadWriteOnce + size: 5Gi + resources: + requests: + cpu: 100m + memory: 256Mi + limits: + memory: 512Mi diff --git a/deploy/observability/namespace.yaml b/deploy/observability/namespace.yaml new file mode 100644 index 0000000..00f5409 --- /dev/null +++ b/deploy/observability/namespace.yaml @@ -0,0 +1,9 @@ +# Dedicated namespace for the in-cluster observability stack +# (kube-prometheus-stack: Prometheus + Grafana + the Prometheus operator). +apiVersion: v1 +kind: Namespace +metadata: + name: monitoring + labels: + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: observability diff --git a/deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml b/deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml index d8c1002..fcda2c0 100644 --- a/deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml +++ b/deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml @@ -177,3 +177,20 @@ spec: dataFrom: - extract: key: usgov-coderdemo/gitlab/secrets +--- +apiVersion: external-secrets.io/v1 +kind: ExternalSecret +metadata: + name: grafana-admin + namespace: monitoring +spec: + refreshInterval: 1h + secretStoreRef: + name: aws-secretsmanager + kind: ClusterSecretStore + target: + name: grafana-admin + creationPolicy: Owner + dataFrom: + - extract: + key: usgov-coderdemo/observability/grafana diff --git a/docs/00-INDEX.md b/docs/00-INDEX.md index 054c4d4..a2b42d3 100644 --- a/docs/00-INDEX.md +++ b/docs/00-INDEX.md @@ -18,10 +18,21 @@ live result. - [as-built/README.md](as-built/README.md) (index) - [as-built/00-overview.md](as-built/00-overview.md): architecture + flows +- [as-built/55-observability.md](as-built/55-observability.md): in-cluster Prometheus + Grafana observability - [as-built/80-iac-vs-imperative.md](as-built/80-iac-vs-imperative.md): declarative vs imperative ledger - [as-built/85-secrets-management.md](as-built/85-secrets-management.md): secrets via ESO + AWS Secrets Manager - [as-built/90-operations-runbook.md](as-built/90-operations-runbook.md): day-2 ops +## Plans (design proposals, not yet applied) + +Forward-looking designs with companion GitHub issues. Nothing in these plans is +applied to the live environment. + +- [plans/README.md](plans/README.md) (index) +- [plans/observability-aws-native.md](plans/observability-aws-native.md): AWS-native metrics + audit pipeline (AMP/AMG, CloudWatch/Firehose/S3/Athena, optional Security Lake) +- [plans/gitops-control-plane.md](plans/gitops-control-plane.md): Argo CD control plane sourced from the in-cluster GitLab +- [plans/gitops-adoption.md](plans/gitops-adoption.md): per-workload GitOps adoption + non-Kubernetes app state + ## Architecture - [architecture/overview.md](architecture/overview.md) diff --git a/docs/as-built/55-observability.md b/docs/as-built/55-observability.md new file mode 100644 index 0000000..563db00 --- /dev/null +++ b/docs/as-built/55-observability.md @@ -0,0 +1,179 @@ +# 55. Observability (as-built) + +In-boundary, in-cluster metrics and dashboards for the GovCloud demo: the +`prometheus-community/kube-prometheus-stack` Helm release `kps` (Prometheus + +Grafana + the Prometheus operator) in the `monitoring` namespace, scraping the +Coder control plane's Prometheus metrics and rendering Coder's prebuilt Grafana +dashboards with live data at `https://grafana.usgov.coderdemo.io`. Coder audit +logging is entitled and on; structured JSON server logs make it SIEM-ready. + +This is the reliable in-cluster implementation. The AWS-native managed variant +(Amazon Managed Prometheus / Grafana, Security Lake) is planned separately and +is intentionally not built here. + +Source of truth for the manifests and the reproduce/verify steps: +`deploy/observability/` and `deploy/observability/README.md`. Coder server +changes are in `deploy/coder/values.yaml`; the Grafana admin ExternalSecret is +in `deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml`. + +## Verification method + +Mutating steps (helm install/upgrade, kubectl apply, one ASM +`create-secret`) were performed during this build against `./kubeconfig` and the +`us-gov-west-1` account `430737322961`. Live checks used read-only `kubectl get`, +the Prometheus HTTP API over a `port-forward`, and authenticated calls to the +public Grafana host. The Grafana admin password was read from the synced +Kubernetes Secret and never printed. Always target +`https://grafana.usgov.coderdemo.io` explicitly. + +## Coder server changes (deploy/coder/values.yaml) + +Four env vars were ADDED (the existing AI-provider seed env vars were left +untouched so the coderd drift guard does not trip): + +| Env var | Value | Purpose | +|---|---|---| +| `CODER_PROMETHEUS_ENABLE` | `true` | Serve Prometheus metrics. | +| `CODER_PROMETHEUS_ADDRESS` | `0.0.0.0:2112` | Bind the metrics endpoint to the pod network (the default `127.0.0.1` is not scrapeable). | +| `CODER_PROMETHEUS_COLLECT_AGENT_STATS` | `true` | Emit per-workspace agent stats used by the workspace dashboards. | +| `CODER_LOGGING_JSON` | `/dev/stderr` | Emit structured JSON logs to stderr. | +| `CODER_LOGGING_HUMAN` | `/dev/null` | Silence the duplicate human stream so stderr carries JSON only. | + +Note on logging: Coder has no single `CODER_LOG_FORMAT` flag. JSON output is +selected by pointing `--log-json` / `CODER_LOGGING_JSON` at a sink, and the +human stream (default `/dev/stderr`) is redirected to `/dev/null` to avoid +duplicate lines. Verified live: the coder pod's stderr is single-stream JSON +(for example `{"ts":...,"level":"INFO","msg":"serving connection",...}`). + +Helm release `coder` went to revision 5; the Deployment rolled out 1/1. The +metrics endpoint returns `coderd_*` series: exec into the coder pod and run +`wget -qO- http://localhost:2112/metrics` to see +`coderd_api_requests_processed_total` and the `coderd_agentapi_*` family. + +## The stack (Helm release kps, ns monitoring) + +| Component | Live object | Storage | +|---|---|---| +| Prometheus | `prometheus-kps-kube-prometheus-stack-prometheus-0` (2/2), Service `kps-kube-prometheus-stack-prometheus:9090` | 20Gi gp3 PVC, 7d retention | +| Grafana | `kps-grafana` (3/3: grafana + dashboard sidecar + datasource sidecar), Service `kps-grafana:80` | 5Gi gp3 PVC | +| Operator | `kps-kube-prometheus-stack-operator` (1/1) | n/a | + +Chart `kube-prometheus-stack-86.2.0` (operator `v0.91.0`). To keep the demo lean +and cut image mirroring, Alertmanager, node-exporter, kube-state-metrics, the +bundled alert rules, and the managed EKS control-plane ServiceMonitors are +disabled. The kubelet ServiceMonitor is kept so cAdvisor container CPU/memory +metrics power the dashboards' resource panels (9 kubelet targets are up). + +### Images (ECR mirror, no pull-through in GovCloud) + +Mirrored via `scripts/images.txt` + `scripts/mirror-images.sh`; chart values +override the image repos to the mirror: + +- `quay/prometheus/prometheus:v3.12.0-distroless` +- `quay/prometheus-operator/prometheus-operator:v0.91.0` +- `quay/prometheus-operator/prometheus-config-reloader:v0.91.0` +- `docker-hub/grafana/grafana:13.0.1-security-01` +- `quay/kiwigrid/k8s-sidecar:2.7.3` + +## Scrape path (Coder) + +`coderd` serves metrics on `:2112`. The Coder chart Service exposes only the app +port, so `deploy/observability/coder-metrics.yaml` adds: + +- a headless Service `coder-metrics` (ns `coder`, port 2112) selecting only the + control-plane pod (`app.kubernetes.io/name=coder`, `instance=coder`); the + external provisioner pods do not match and are excluded, and +- `ServiceMonitor/coder` (ns `coder`) selecting that Service. + +Prometheus is set with `serviceMonitorSelectorNilUsesHelmValues: false`, so it +discovers the ServiceMonitor without a release label and adds `namespace` and +`pod` target labels. Verified live (Prometheus `/api/v1/targets`): job +`coder-metrics` is `up`, target +`http://10.0.x.x:2112/metrics`, labels `namespace="coder"`, +`pod="coder-...."`, `lastError` empty. PromQL spot checks: +`up{job="coder-metrics"}` is `1`, `sum(coderd_api_requests_processed_total)` +returns a live counter, and `container_cpu_usage_seconds_total{namespace="coder"}` +has series (cAdvisor). + +## Grafana + +- Datasource: the chart auto-provisions `Prometheus` (uid `prometheus`, + default), URL `http://kps-kube-prometheus-stack-prometheus.monitoring:9090`. + Verified via `GET /api/datasources`. +- Dashboards: six Prometheus-backed Coder dashboards from + `github.com/coder/observability` are shipped as ConfigMaps + (`dashboards-coder.yaml`) labelled `grafana_dashboard: "1"` and imported by + the Grafana sidecar (`NAMESPACE=ALL`). Verified via + `GET /api/search?type=dash-db`: Coder Control Plane (`coderd`), Coder Status, + Coder Prebuilds, Coder Provisioners, Coder Workspaces, Coder Workspace Detail. + Every panel targets datasource uid `prometheus`; the dashboard selectors are + already scoped to `namespace="coder"`, `pod=~"coder.*"`, which match the + scraped series. +- Live data: through Grafana's datasource proxy, the main Coder Control Plane + dashboard query + `sum by(pod) (rate(coderd_api_requests_processed_total{...}[5m]))` returns a + series, and `up{job="coder-metrics"}` returns `1`. So the main dashboard + renders live data end to end (Grafana to Prometheus to coderd). +- The purely log-based `agent-boundaries` dashboard is omitted, and a few log + panels inside the workspaces / provisionerd / workspace-detail dashboards show + no data, because this stack ships metrics only (no Loki). Their Prometheus + panels render live. + +### Admin credentials (ESO + AWS Secrets Manager) + +The admin password is generated once and stored as JSON +`{"admin-user","admin-password"}` in AWS Secrets Manager at +`usgov-coderdemo/observability/grafana`. The ESO `ClusterSecretStore` +`aws-secretsmanager` syncs it into the Kubernetes Secret `grafana-admin` +(ns `monitoring`) through the ExternalSecret added to +`deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml`; Grafana +reads it via `admin.existingSecret`. The ESO IAM role only allows reading +`usgov-coderdemo/*`, so this path is in policy, and no password is in git. +Verified live: ExternalSecret `grafana-admin` is `Ready=True` reason +`SecretSynced`; the Secret carries keys `admin-user` and `admin-password`; and +logging in to the public Grafana with that password succeeds. + +## Ingress (HTTPS) + +`deploy/observability/grafana-ingress.yaml` follows the platform pattern +(`deploy/keycloak/ingress.yaml`): `ingressClassName: nginx`, host +`grafana.usgov.coderdemo.io`, `nginx.ingress.kubernetes.io/ssl-redirect: +"false"`, no TLS block. One internet-facing NLB terminates TLS with the ACM +wildcard cert and forwards plain HTTP to ingress-nginx, which routes to +`kps-grafana:80`. The Route53 `*` alias already resolves the host. Verified +live: `https://grafana.usgov.coderdemo.io/login` returns HTTP `200` with +`ssl_verify_result=0` (valid TLS, no `-k`), and `/api/health` reports +`database: ok`, version `13.0.1+security-01`. + +## Audit logging + +Audit logging is a licensed Coder feature and is already entitled and enabled +(`GET /api/v2/entitlements`: `audit_log` and `connection_log` entitled + +enabled, see `30-coder-control-plane.md`). The in-product audit view is the +Coder dashboard's `/audit` page, which records who did what (logins, template +and workspace changes, user and org administration). No env var is required to +turn it on beyond the license. + +For SIEM ingestion, `CODER_LOGGING_JSON=/dev/stderr` makes the coderd server +logs structured JSON on stderr, so the cluster log pipeline can ship them to a +downstream SIEM without parsing free text. The audit records themselves remain +queryable through the Coder API and `/audit` UI, and audit entries are retained +indefinitely by default (`CODER_AUDIT_LOGS_RETENTION` default `0` = keep +forever). + +## Reaching Grafana + +- URL: `https://grafana.usgov.coderdemo.io` (valid TLS via the ACM wildcard). +- User: `admin`. Password: the value synced into the `grafana-admin` Secret from + ASM `usgov-coderdemo/observability/grafana` + (`kubectl -n monitoring get secret grafana-admin -o jsonpath='{.data.admin-password}' | base64 -d`). +- Open the "Coder Control Plane" dashboard for live control-plane metrics. + +## Notes and known gaps + +- Metrics only: no Loki/logs datasource, so log-based panels and the + `agent-boundaries` dashboard are inactive by design. +- kube-state-metrics is disabled, so the dashboards' pod resource limit/request + and restart/terminated-reason panels (which depend on `kube_pod_*`) stay + empty; container CPU/memory usage panels (cAdvisor via the kubelet) render. +- Alerting is out of scope: Alertmanager and the bundled alert rules are off. diff --git a/docs/as-built/README.md b/docs/as-built/README.md index 9bc60d8..805c651 100644 --- a/docs/as-built/README.md +++ b/docs/as-built/README.md @@ -18,6 +18,7 @@ docs explain the *how* and *why* behind that status. | [40-identity-keycloak.md](40-identity-keycloak.md) | Keycloak realm `coder`, the OIDC client, the SSO wiring, and IdP sync status. | | [45-idp-sync-personas.md](45-idp-sync-personas.md) | Multi-tenant org/group/role hierarchy, the persona users, and the verified Keycloak-to-Coder IdP sync (org + group + role). | | [50-gitlab-scm.md](50-gitlab-scm.md) | In-boundary GitLab SCM, the instance-wide OAuth app, and how every workspace authenticates git against it. | +| [55-observability.md](55-observability.md) | In-cluster observability: kube-prometheus-stack (Prometheus + Grafana) in the `monitoring` namespace, Coder Prometheus metrics, the six Coder Grafana dashboards at `grafana.usgov.coderdemo.io`, structured JSON logs, and audit logging. The AWS-native managed variant is planned in `docs/plans/`. | | [60-ai-gateway.md](60-ai-gateway.md) | AI Gateway / AI Bridge: DB-managed providers (`anthropic` direct + `anthropic-bedrock` IRSA), name-based routing, the end-to-end request flow, and the remaining action to make AI respond. | | [70-workspace-templates.md](70-workspace-templates.md) | The `claude-code` workspace template: pod/PVC, the claude-code module (4.7.3), Coder Tasks, parameters, and the required GitLab external auth. | | [80-iac-vs-imperative.md](80-iac-vs-imperative.md) | The declarative-versus-imperative ledger and the Terraform reconciliation backlog. | diff --git a/scripts/images.txt b/scripts/images.txt index ab9d18d..8672dd4 100644 --- a/scripts/images.txt +++ b/scripts/images.txt @@ -23,3 +23,13 @@ docker.io/codercom/enterprise-base:ubuntu-noble-20260601 # --- External Secrets Operator (deploy/platform/external-secrets) --- # Controller, webhook, and cert-controller all use this single image. ghcr.io/external-secrets/external-secrets:v2.6.0 + +# --- Observability stack (deploy/observability) --- +# kube-prometheus-stack 86.2.0 (Prometheus + Grafana + operator), minimal set. +# Alertmanager, node-exporter, and kube-state-metrics are disabled, so their +# images are intentionally not mirrored. +quay.io/prometheus/prometheus:v3.12.0-distroless +quay.io/prometheus-operator/prometheus-operator:v0.91.0 +quay.io/prometheus-operator/prometheus-config-reloader:v0.91.0 +docker.io/grafana/grafana:13.0.1-security-01 +quay.io/kiwigrid/k8s-sidecar:2.7.3 From 5fa87f5be8d9e8305dd894fd73c8ae166c3eb5d9 Mon Sep 17 00:00:00 2001 From: Austen Bruhn Date: Sun, 7 Jun 2026 19:37:24 +0000 Subject: [PATCH 09/16] feat: Grafana single sign-on via Keycloak for the observability stack Make the demo one SSO: Grafana now logs in through the same Keycloak realm (coder) as Coder, instead of local-admin only. The local admin login form is kept enabled as break-glass. - scripts/setup-grafana-oidc.py (idempotent): register a confidential OIDC client `grafana` in the realm (authorization-code + PKCE S256, redirect https://grafana.usgov.coderdemo.io/login/generic_oauth) with the same full-path `groups` group-membership mapper the coder client uses, then read the client secret and upsert it to AWS Secrets Manager at usgov-coderdemo/observability/grafana-oauth. - ESO ExternalSecret grafana-oauth (ns monitoring) syncs that secret into a Kubernetes Secret; Grafana consumes it via the env var GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET (grafana.envValueFrom), so no secret is in git. - kube-prometheus-stack-values.yaml: enable [auth.generic_oauth] against the realm auth/token/userinfo endpoints (scopes openid email profile) and map group membership to a Grafana org role: contains(groups[*], '/platform') && 'Admin' || 'Viewer'. allow_sign_up auto-provisions users; allow_assign_grafana_admin is off so the server-admin flag stays local. Verified live (helm release kps upgraded, Grafana rolled out): the login page shows "Sign in with Keycloak"; /login/generic_oauth redirects to the realm with client_id=grafana and PKCE; a headless authorization-code login per persona confirms role mapping (pat.platform in /platform -> Admin, /api/org/users 200; dana.dev in /alpha -> Viewer, /api/org/users 403), both authLabels Generic OAuth and isExternallySynced. Docs: docs/as-built/55-observability.md and deploy/observability/README.md gain an SSO section; STATUS.md notes the one-SSO Grafana login. Generated by Coder Agents. --- STATUS.md | 7 + deploy/observability/README.md | 31 ++- .../kube-prometheus-stack-values.yaml | 40 +++- .../secretstore-and-externalsecrets.yaml | 17 ++ docs/as-built/55-observability.md | 48 +++- scripts/setup-grafana-oidc.py | 214 ++++++++++++++++++ 6 files changed, 349 insertions(+), 8 deletions(-) create mode 100755 scripts/setup-grafana-oidc.py diff --git a/STATUS.md b/STATUS.md index 82f9c21..efc4bf3 100644 --- a/STATUS.md +++ b/STATUS.md @@ -165,6 +165,13 @@ gated; Nova Pro is the proven fallback. render live data at `https://grafana.usgov.coderdemo.io` (valid TLS, HTTP 200). Grafana admin password lives in AWS Secrets Manager (`usgov-coderdemo/observability/grafana`) and is synced by ESO. +- [x] **Grafana Keycloak SSO (one SSO)**: Grafana signs in via the same realm + (`coder`) through a confidential OIDC client `grafana` + (`scripts/setup-grafana-oidc.py`, PKCE; secret in ASM + `usgov-coderdemo/observability/grafana-oauth`, ESO-synced). Group + membership maps to org role: `/platform` -> Grafana `Admin`, others -> + `Viewer`; local admin kept as break-glass. Verified per persona + (`pat.platform` Admin, `dana.dev` Viewer). - [x] **Structured JSON server logs** (`CODER_LOGGING_JSON=/dev/stderr`, `CODER_LOGGING_HUMAN=/dev/null`) make coderd SIEM-ready; audit logging is entitled + on (`/audit`). diff --git a/deploy/observability/README.md b/deploy/observability/README.md index 07785d9..a2d0f76 100644 --- a/deploy/observability/README.md +++ b/deploy/observability/README.md @@ -15,7 +15,7 @@ is not built here. |---|---| | Helm release | `kps` = `prometheus-community/kube-prometheus-stack` chart `86.2.0` (prometheus-operator `v0.91.0`), namespace `monitoring`. Values: `kube-prometheus-stack-values.yaml`. | | Prometheus | StatefulSet `prometheus-kps-kube-prometheus-stack-prometheus`, 20Gi gp3 PVC, 7d retention. Service `kps-kube-prometheus-stack-prometheus:9090`. | -| Grafana | Deployment `kps-grafana`, 5Gi gp3 PVC. Service `kps-grafana:80`. Admin password from AWS Secrets Manager via ESO. | +| Grafana | Deployment `kps-grafana`, 5Gi gp3 PVC. Service `kps-grafana:80`. Keycloak SSO (generic OAuth) + local admin break-glass; admin password and OIDC client secret from AWS Secrets Manager via ESO. | | Prometheus operator | Deployment `kps-kube-prometheus-stack-operator`. Admission webhooks disabled. | | Coder scrape | `coder-metrics` headless Service (port 2112) + `ServiceMonitor/coder`, both in namespace `coder`. Prometheus job `coder-metrics`. | | Dashboards | Six Coder dashboards as ConfigMaps in `monitoring`, imported by the Grafana sidecar (label `grafana_dashboard: "1"`). | @@ -66,6 +66,25 @@ panels inside the workspaces / provisionerd / workspace-detail dashboards show no data, because this stack ships metrics only (no Loki). Their Prometheus panels render live. +## Single sign-on (Keycloak) + +Grafana logs in through the same Keycloak realm (`coder`) as Coder, so the demo +is one SSO. `scripts/setup-grafana-oidc.py` (idempotent) registers a confidential +OIDC client `grafana` (authorization-code + PKCE, redirect +`https://grafana.usgov.coderdemo.io/login/generic_oauth`, full-path `groups` +mapper) and writes its client secret to AWS Secrets Manager at +`usgov-coderdemo/observability/grafana-oauth`. ESO syncs that into the +`grafana-oauth` Secret, and Grafana reads it via the +`GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET` env (`grafana.envValueFrom`), so no secret +is in git. + +The `[auth.generic_oauth]` block in `kube-prometheus-stack-values.yaml` maps +group membership to a Grafana org role +(`contains(groups[*], '/platform') && 'Admin' || 'Viewer'`): Platform +Engineering administers Grafana, everyone else is read-only `Viewer`. The local +admin login form is kept enabled as break-glass. See +`docs/as-built/55-observability.md` for the verified persona role mapping. + ## Grafana admin credentials (ESO + AWS Secrets Manager) The admin password is generated once and stored as JSON @@ -100,10 +119,12 @@ kubectl -n coder rollout status deploy/coder # --name usgov-coderdemo/observability/grafana \ # --secret-string file:///path/to/grafana.json # {"admin-user","admin-password"} -# 4. Namespace + ESO ExternalSecret for the Grafana admin secret. +# 4. Namespace + ESO ExternalSecrets (Grafana admin + OIDC client secret). +# First register the Keycloak client and publish its secret to ASM. +python3 scripts/setup-grafana-oidc.py kubectl apply -f deploy/observability/namespace.yaml kubectl apply -f deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml -kubectl -n monitoring get externalsecret grafana-admin # Ready=SecretSynced +kubectl -n monitoring get externalsecret grafana-admin grafana-oauth # Ready=SecretSynced # 5. Install the stack. helm install kps ~/.cache/helm/repository/kube-prometheus-stack-86.2.0.tgz \ @@ -132,4 +153,8 @@ curl -s 'http://localhost:9090/api/v1/query?query=up{job="coder-metrics"}' GPW=$(kubectl -n monitoring get secret grafana-admin -o jsonpath='{.data.admin-password}' | base64 -d) curl -s -o /dev/null -w '%{http_code} ssl=%{ssl_verify_result}\n' https://grafana.usgov.coderdemo.io/login curl -s -u "admin:$GPW" 'https://grafana.usgov.coderdemo.io/api/search?type=dash-db&query=Coder' + +# Keycloak SSO button + redirect (client_id=grafana, PKCE) +curl -s https://grafana.usgov.coderdemo.io/login | grep -o '"oauth":{[^}]*}' +curl -s -o /dev/null -D - https://grafana.usgov.coderdemo.io/login/generic_oauth | grep -i '^location:' ``` diff --git a/deploy/observability/kube-prometheus-stack-values.yaml b/deploy/observability/kube-prometheus-stack-values.yaml index 90c0901..7faf201 100644 --- a/deploy/observability/kube-prometheus-stack-values.yaml +++ b/deploy/observability/kube-prometheus-stack-values.yaml @@ -120,11 +120,20 @@ grafana: repository: docker-hub/grafana/grafana tag: 13.0.1-security-01 # Admin credentials from the ESO-synced Secret `grafana-admin` (sourced from - # AWS Secrets Manager usgov-coderdemo/observability/grafana). + # AWS Secrets Manager usgov-coderdemo/observability/grafana). The local admin + # login is kept as break-glass alongside Keycloak SSO below. admin: existingSecret: grafana-admin userKey: admin-user passwordKey: admin-password + # Keycloak OIDC client secret (ESO-synced from + # usgov-coderdemo/observability/grafana-oauth) injected as the env var that + # overrides [auth.generic_oauth] client_secret, so no secret sits in git. + envValueFrom: + GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET: + secretKeyRef: + name: grafana-oauth + key: client-secret # The bundled Kubernetes dashboards need kube-state-metrics, which is off. defaultDashboardsEnabled: false service: @@ -137,6 +146,35 @@ grafana: "grafana.ini": server: root_url: https://grafana.usgov.coderdemo.io + # Single sign-on via the same Keycloak realm (`coder`) as the rest of the + # stack. The confidential client `grafana` is created by + # scripts/setup-grafana-oidc.py; the secret is injected via the env var + # above. The local admin login form is left enabled as break-glass. + auth: + oauth_auto_login: false + disable_login_form: false + "auth.generic_oauth": + enabled: true + name: Keycloak + icon: signin + client_id: grafana + # client_secret comes from GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET (env). + scopes: "openid email profile" + auth_url: https://auth.usgov.coderdemo.io/realms/coder/protocol/openid-connect/auth + token_url: https://auth.usgov.coderdemo.io/realms/coder/protocol/openid-connect/token + api_url: https://auth.usgov.coderdemo.io/realms/coder/protocol/openid-connect/userinfo + login_attribute_path: preferred_username + email_attribute_path: email + name_attribute_path: name + use_pkce: true + allow_sign_up: true + # Map Keycloak group membership to a Grafana org role. Platform + # Engineering (group path /platform) administers Grafana; every other + # authenticated realm user gets read-only Viewer. + role_attribute_path: "contains(groups[*], '/platform') && 'Admin' || 'Viewer'" + role_attribute_strict: false + # Org role only; the Grafana server-admin flag stays with the local admin. + allow_assign_grafana_admin: false sidecar: image: registry: 430737322961.dkr.ecr.us-gov-west-1.amazonaws.com diff --git a/deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml b/deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml index fcda2c0..bedc731 100644 --- a/deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml +++ b/deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml @@ -194,3 +194,20 @@ spec: dataFrom: - extract: key: usgov-coderdemo/observability/grafana +--- +apiVersion: external-secrets.io/v1 +kind: ExternalSecret +metadata: + name: grafana-oauth + namespace: monitoring +spec: + refreshInterval: 1h + secretStoreRef: + name: aws-secretsmanager + kind: ClusterSecretStore + target: + name: grafana-oauth + creationPolicy: Owner + dataFrom: + - extract: + key: usgov-coderdemo/observability/grafana-oauth diff --git a/docs/as-built/55-observability.md b/docs/as-built/55-observability.md index 563db00..80e079b 100644 --- a/docs/as-built/55-observability.md +++ b/docs/as-built/55-observability.md @@ -4,8 +4,10 @@ In-boundary, in-cluster metrics and dashboards for the GovCloud demo: the `prometheus-community/kube-prometheus-stack` Helm release `kps` (Prometheus + Grafana + the Prometheus operator) in the `monitoring` namespace, scraping the Coder control plane's Prometheus metrics and rendering Coder's prebuilt Grafana -dashboards with live data at `https://grafana.usgov.coderdemo.io`. Coder audit -logging is entitled and on; structured JSON server logs make it SIEM-ready. +dashboards with live data at `https://grafana.usgov.coderdemo.io`. Grafana signs +in through the same Keycloak realm (`coder`) as the rest of the stack, so the +demo is one SSO. Coder audit logging is entitled and on; structured JSON server +logs make it SIEM-ready. This is the reliable in-cluster implementation. The AWS-native managed variant (Amazon Managed Prometheus / Grafana, Security Lake) is planned separately and @@ -119,6 +121,41 @@ has series (cAdvisor). no data, because this stack ships metrics only (no Loki). Their Prometheus panels render live. +### Single sign-on (Keycloak OIDC) + +Grafana logs in through the same Keycloak realm (`coder`) as Coder, so the demo +is one SSO. A confidential OIDC client `grafana` is registered in the realm by +`scripts/setup-grafana-oidc.py` (idempotent): standard authorization-code flow +with PKCE (S256), redirect URI +`https://grafana.usgov.coderdemo.io/login/generic_oauth`, and the same full-path +`groups` group-membership mapper the `coder` client uses. The script reads the +client secret and stores it in AWS Secrets Manager at +`usgov-coderdemo/observability/grafana-oauth` (`{"client-secret"}`); ESO syncs it +into the Kubernetes Secret `grafana-oauth`, and Grafana consumes it through the +env var `GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET` (set via `grafana.envValueFrom`), +so no secret is in git. + +Grafana's `[auth.generic_oauth]` (in `kube-prometheus-stack-values.yaml`) points +at the realm's auth/token/userinfo endpoints with scopes `openid email profile` +and maps Keycloak group membership to a Grafana org role: + +``` +role_attribute_path: contains(groups[*], '/platform') && 'Admin' || 'Viewer' +``` + +so Platform Engineering (group path `/platform`) administers Grafana and every +other authenticated realm user gets read-only `Viewer`. `allow_sign_up: true` +auto-provisions users on first login; `allow_assign_grafana_admin: false` keeps +the Grafana server-admin flag with the local account. The local admin login form +is intentionally left enabled (`disable_login_form: false`) as break-glass. + +Verified live with a headless authorization-code login per persona: the login +page shows "Sign in with Keycloak"; `/login/generic_oauth` redirects to the realm +with `client_id=grafana` and PKCE; `pat.platform` (`/platform`) lands as org +role `Admin` (the admin-only `/api/org/users` returns 200) while `dana.dev` +(`/alpha`) lands as `Viewer` (same endpoint returns 403). Both arrive +`authLabels: ["Generic OAuth"]`, `isExternallySynced: true`. + ### Admin credentials (ESO + AWS Secrets Manager) The admin password is generated once and stored as JSON @@ -164,8 +201,11 @@ forever). ## Reaching Grafana - URL: `https://grafana.usgov.coderdemo.io` (valid TLS via the ACM wildcard). -- User: `admin`. Password: the value synced into the `grafana-admin` Secret from - ASM `usgov-coderdemo/observability/grafana` +- SSO (preferred): click **Sign in with Keycloak** and authenticate against the + `coder` realm. Platform Engineering personas get Grafana `Admin`; other realm + users get `Viewer`. See "Single sign-on (Keycloak OIDC)". +- Break-glass: user `admin`, password the value synced into the `grafana-admin` + Secret from ASM `usgov-coderdemo/observability/grafana` (`kubectl -n monitoring get secret grafana-admin -o jsonpath='{.data.admin-password}' | base64 -d`). - Open the "Coder Control Plane" dashboard for live control-plane metrics. diff --git a/scripts/setup-grafana-oidc.py b/scripts/setup-grafana-oidc.py new file mode 100755 index 0000000..19229ed --- /dev/null +++ b/scripts/setup-grafana-oidc.py @@ -0,0 +1,214 @@ +#!/usr/bin/env python3 +""" +setup-grafana-oidc.py - register the Grafana OIDC client in the Keycloak realm +`coder` so Grafana logs in with the same SSO as Coder, and publish the client +secret to AWS Secrets Manager for ESO to sync. + +Idempotent: re-running ensures the desired client + group-membership mapper and +upserts the secret. It does NOT rotate the Keycloak client secret on each run; it +reads the current secret and writes that value to ASM. + +What it does: + 1. Create/update a confidential OIDC client `grafana` (standard flow, PKCE + S256, redirect URI https://grafana.usgov.coderdemo.io/login/generic_oauth). + 2. Add the same full-path `groups` group-membership mapper used by the `coder` + client, so Grafana can map Keycloak group membership to Grafana org roles. + 3. Read the client secret and upsert it into AWS Secrets Manager at + usgov-coderdemo/observability/grafana-oauth as {"client-secret": "..."}. + +Reads admin credentials from ~/.config/usgov-coderdemo/generated-secrets.env: + KEYCLOAK_ADMIN_USERNAME, KEYCLOAK_ADMIN_PASSWORD + +Pairs with deploy/observability/kube-prometheus-stack-values.yaml (Grafana +generic_oauth config) and the grafana-oauth ExternalSecret in +deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml. +""" +import json +import os +import subprocess +import tempfile +import urllib.error +import urllib.parse +import urllib.request + +KC = os.environ.get("KEYCLOAK_URL", "https://auth.usgov.coderdemo.io").rstrip("/") +REALM = "coder" +CLIENT_ID = "grafana" +GRAFANA_URL = "https://grafana.usgov.coderdemo.io" +REGION = "us-gov-west-1" +ASM_NAME = "usgov-coderdemo/observability/grafana-oauth" + +DESIRED_CLIENT = { + "clientId": CLIENT_ID, + "name": "Grafana", + "description": "Grafana (observability) SSO via Keycloak realm coder.", + "enabled": True, + "protocol": "openid-connect", + "publicClient": False, + "standardFlowEnabled": True, + "implicitFlowEnabled": False, + "directAccessGrantsEnabled": False, + "serviceAccountsEnabled": False, + "clientAuthenticatorType": "client-secret", + "rootUrl": GRAFANA_URL, + "baseUrl": "/", + "redirectUris": [GRAFANA_URL + "/login/generic_oauth"], + "webOrigins": [GRAFANA_URL], + "attributes": { + "pkce.code.challenge.method": "S256", + "post.logout.redirect.uris": GRAFANA_URL + "/*", + }, +} + +# Same full-path groups mapper the coder client uses, so role mapping in Grafana +# can key off Keycloak group paths (e.g. /platform). +GROUPS_MAPPER = { + "name": "groups", + "protocol": "openid-connect", + "protocolMapper": "oidc-group-membership-mapper", + "config": { + "full.path": "true", + "id.token.claim": "true", + "access.token.claim": "true", + "userinfo.token.claim": "true", + "lightweight.claim": "false", + "claim.name": "groups", + }, +} + +TOKEN = None + + +def read_secrets(): + path = os.path.expanduser("~/.config/usgov-coderdemo/generated-secrets.env") + out = {} + with open(path) as f: + for line in f: + line = line.strip() + if "=" in line and not line.startswith("#"): + k, v = line.split("=", 1) + out[k] = v + return out + + +SECRETS = read_secrets() + + +def token(): + data = urllib.parse.urlencode({ + "grant_type": "password", + "client_id": "admin-cli", + "username": SECRETS["KEYCLOAK_ADMIN_USERNAME"], + "password": SECRETS["KEYCLOAK_ADMIN_PASSWORD"], + }).encode() + req = urllib.request.Request( + KC + "/realms/master/protocol/openid-connect/token", data=data, + headers={"Content-Type": "application/x-www-form-urlencoded"}) + return json.load(urllib.request.urlopen(req))["access_token"] + + +def kc(method, path, body=None): + headers = {"Authorization": "Bearer " + TOKEN} + data = None + if body is not None: + headers["Content-Type"] = "application/json" + data = json.dumps(body).encode() + req = urllib.request.Request(KC + "/admin/realms/" + REALM + path, + data=data, headers=headers, method=method) + try: + r = urllib.request.urlopen(req) + raw = r.read().decode() + return r.status, (json.loads(raw) if raw else None) + except urllib.error.HTTPError as e: + return e.code, e.read().decode() + + +def ensure_client(): + _, clients = kc("GET", "/clients?clientId=" + CLIENT_ID) + if clients: + cid = clients[0]["id"] + rep = dict(clients[0]) + rep.update(DESIRED_CLIENT) + # Merge attributes so we do not drop Keycloak-managed defaults. + attrs = dict(clients[0].get("attributes") or {}) + attrs.update(DESIRED_CLIENT["attributes"]) + rep["attributes"] = attrs + code, _ = kc("PUT", f"/clients/{cid}", rep) + print(f"client '{CLIENT_ID}': updated (HTTP {code})") + else: + code, _ = kc("POST", "/clients", DESIRED_CLIENT) + print(f"client '{CLIENT_ID}': CREATED (HTTP {code})") + _, clients = kc("GET", "/clients?clientId=" + CLIENT_ID) + cid = clients[0]["id"] + return cid + + +def ensure_mapper(cid): + _, mappers = kc("GET", f"/clients/{cid}/protocol-mappers/models") + existing = {m["name"]: m for m in (mappers or [])} + rep = dict(GROUPS_MAPPER) + if "groups" in existing: + rep["id"] = existing["groups"]["id"] + code, _ = kc("PUT", + f"/clients/{cid}/protocol-mappers/models/{rep['id']}", rep) + print(f"client mapper 'groups': updated (HTTP {code})") + else: + code, _ = kc("POST", f"/clients/{cid}/protocol-mappers/models", rep) + print(f"client mapper 'groups': CREATED (HTTP {code})") + + +def client_secret(cid): + _, body = kc("GET", f"/clients/{cid}/client-secret") + if isinstance(body, dict) and body.get("value"): + return body["value"] + # No secret yet (should not happen for a confidential client); generate one. + _, body = kc("POST", f"/clients/{cid}/client-secret") + return body["value"] + + +def asm_exists(name): + r = subprocess.run( + ["aws", "secretsmanager", "describe-secret", "--region", REGION, + "--secret-id", name], + stdout=subprocess.PIPE, stderr=subprocess.PIPE) + return r.returncode == 0 + + +def put_asm(name, payload): + fd, path = tempfile.mkstemp(prefix="asm-", suffix=".json") + try: + os.fchmod(fd, 0o600) + with os.fdopen(fd, "w") as f: + json.dump(payload, f) + ref = "file://" + path + if asm_exists(name): + subprocess.run( + ["aws", "secretsmanager", "put-secret-value", "--region", REGION, + "--secret-id", name, "--secret-string", ref], + check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) + return "updated" + subprocess.run( + ["aws", "secretsmanager", "create-secret", "--region", REGION, + "--name", name, + "--description", "usgov-coderdemo Grafana OIDC client secret (ESO).", + "--secret-string", ref], + check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) + return "created" + finally: + os.unlink(path) + + +def main(): + global TOKEN + TOKEN = token() + cid = ensure_client() + ensure_mapper(cid) + secret = client_secret(cid) + action = put_asm(ASM_NAME, {"client-secret": secret}) + print(f"ASM {ASM_NAME}: {action} (client-secret, {len(secret)} chars)") + print("\nNext: kubectl apply the grafana-oauth ExternalSecret, then helm " + "upgrade kps and roll Grafana.") + + +if __name__ == "__main__": + main() From 36cc6bf78811a46fe222c9ef69c0ee519ddb7e30 Mon Sep 17 00:00:00 2001 From: Austen Bruhn Date: Sun, 7 Jun 2026 20:20:10 +0000 Subject: [PATCH 10/16] feat: GitLab Keycloak SSO + unified demo super admin Make GitLab sign in through the same Keycloak realm (coder) as Coder and Grafana, and give the demo a single SSO identity that is super admin across all three. Stays on GitLab Community Edition (no EE switch). GitLab SSO (deploy/gitlab/statefulset.yaml): - OmniAuth openid_connect provider in GITLAB_OMNIBUS_CONFIG (auth-code + PKCE, uid_field preferred_username, JIT sign-on). Auto-redirect is intentionally not set so the local root form remains as break-glass. - scripts/setup-gitlab-oidc.py registers the confidential realm client `gitlab` and stores its secret in AWS Secrets Manager (usgov-coderdemo/gitlab/oidc); ESO syncs it to the gitlab-oidc Secret, injected as GITLAB_OIDC_CLIENT_SECRET. CE role limitation, handled explicitly: - GitLab CE does not implement OIDC group-to-role assignment (admin_groups is an EE feature; this gitlab-ce image has no openid_connect group code path). The admin_groups line is left as a documented no-op (EE-forward-compatible). - scripts/setup-gitlab-users.py (idempotent, gitlab-rails) populates the eight personas, links each openid_connect identity (extern_uid = preferred_username), and sets GitLab instance admin only on pat.platform, mirroring the Coder org-admin mapping and preserving tenant isolation. Unified super admin: - scripts/grant-coder-owner.py grants the Coder site Owner role to pat.platform (site roles are not claim-driven and persist across logins). With the GitLab admin flag and the existing Grafana /platform -> Admin mapping, the single SSO identity pat.platform is super admin in Coder, GitLab, and Grafana. - Local break-glass admins remain per app; GitLab root was given a known password (stored in ASM usgov-coderdemo/gitlab/secrets root_password and the local secrets file), since the first-boot random root password was gone. Verified live: pat.platform SSO -> GitLab is_admin=true (/admin 200), Coder site roles [owner], Grafana org Admin; dana.dev -> regular/Viewer. Root login works with the reset password. Docs: docs/as-built/50-gitlab-scm.md gains a Keycloak SSO section and the CE limitation; STATUS.md gains a single sign-on + super admin summary. Generated by Coder Agents. --- STATUS.md | 18 ++ deploy/gitlab/statefulset.yaml | 60 +++++ .../secretstore-and-externalsecrets.yaml | 17 ++ docs/as-built/50-gitlab-scm.md | 46 +++- scripts/grant-coder-owner.py | 98 ++++++++ scripts/setup-gitlab-oidc.py | 214 ++++++++++++++++++ scripts/setup-gitlab-users.py | 115 ++++++++++ 7 files changed, 565 insertions(+), 3 deletions(-) create mode 100755 scripts/grant-coder-owner.py create mode 100755 scripts/setup-gitlab-oidc.py create mode 100755 scripts/setup-gitlab-users.py diff --git a/STATUS.md b/STATUS.md index efc4bf3..4ffc2f4 100644 --- a/STATUS.md +++ b/STATUS.md @@ -135,6 +135,24 @@ gated; Nova Pro is the proven fallback. `claude-code` template pushed into all three orgs. - See `docs/as-built/45-idp-sync-personas.md` for the full hierarchy + matrix. +## Single sign-on + demo super admin +- [x] **One SSO across the stack**: Coder, GitLab, and Grafana all authenticate + against the Keycloak realm `coder`. Grafana via generic OAuth + (`scripts/setup-grafana-oidc.py`); GitLab via OmniAuth `openid_connect` + (`scripts/setup-gitlab-oidc.py`, in `deploy/gitlab/statefulset.yaml`). +- [x] **GitLab CE caveat**: CE has no OIDC group-to-role mapping (an EE + feature), so GitLab persona users + the instance admin attribute are + provisioned explicitly by `scripts/setup-gitlab-users.py` + (`pat.platform` -> admin; others regular). +- [x] **Unified super admin**: the SSO identity `pat.platform` is super admin in + all three (Coder site Owner via `scripts/grant-coder-owner.py`, GitLab + Administrator, Grafana org Admin). Sign in with "Keycloak" on each. +- [x] **Local break-glass admins** remain per app (Coder owner, GitLab root, + Grafana admin). Credentials live in + `~/.config/usgov-coderdemo/generated-secrets.env` and AWS Secrets Manager + (`usgov-coderdemo/gitlab/secrets` `root_password`, + `usgov-coderdemo/observability/grafana`); none are committed to git. + ## Secrets management (ESO + AWS Secrets Manager) - [x] **AWS Secrets Manager is the source of truth** for the 9 runtime app secrets (`usgov-coderdemo/{coder,keycloak,gitlab}/*`). No secret material diff --git a/deploy/gitlab/statefulset.yaml b/deploy/gitlab/statefulset.yaml index 1ac467e..6fa0132 100644 --- a/deploy/gitlab/statefulset.yaml +++ b/deploy/gitlab/statefulset.yaml @@ -72,6 +72,15 @@ spec: # optional so the pod still starts on later restarts (after the # first boot the root password lives in the database, not here). optional: true + # Keycloak OIDC client secret, ESO-synced from AWS Secrets Manager + # (usgov-coderdemo/gitlab/oidc). Referenced as ENV[...] inside the + # omniauth block below so no secret is written into this manifest. + - name: GITLAB_OIDC_CLIENT_SECRET + valueFrom: + secretKeyRef: + name: gitlab-oidc + key: client-secret + optional: true - name: GITLAB_OMNIBUS_CONFIG value: |- external_url 'https://gitlab.usgov.coderdemo.io' @@ -103,6 +112,57 @@ spec: gitlab_rails['initial_root_password'] = ENV['GITLAB_INITIAL_ROOT_PASSWORD'] end + ## ---- Keycloak SSO (OpenID Connect) ---- + ## Single sign-on against the same realm (coder) as Coder and + ## Grafana. The confidential client `gitlab` is created by + ## scripts/setup-gitlab-oidc.py; its secret is injected via the + ## GITLAB_OIDC_CLIENT_SECRET env above. Auto sign-on creates the + ## GitLab user on first Keycloak login (JIT). We deliberately do + ## NOT set auto_sign_in_with_provider, so the local username and + ## password form stays available as break-glass for root. + ## + ## Role mapping: GitLab Community Edition does NOT implement + ## OIDC group-to-role assignment. admin_groups / required_groups + ## / external_groups are EE features; CE ships only the SAML and + ## LDAP equivalents, with no openid_connect code path. The + ## admin_groups line below is therefore a NO-OP on this CE image + ## and is kept only so it activates automatically if the image is + ## ever switched to GitLab EE (Free tier). On CE, the instance + ## admin attribute is provisioned explicitly by + ## scripts/setup-gitlab-users.py (Platform lead pat.platform). + ## Per-group membership/roles is not possible on either edition + ## without SAML Group Sync (a Premium feature). + if ENV['GITLAB_OIDC_CLIENT_SECRET'] && !ENV['GITLAB_OIDC_CLIENT_SECRET'].empty? + gitlab_rails['omniauth_enabled'] = true + gitlab_rails['omniauth_allow_single_sign_on'] = ['openid_connect'] + gitlab_rails['omniauth_block_auto_created_users'] = false + gitlab_rails['omniauth_providers'] = [ + { + name: 'openid_connect', + label: 'Keycloak', + args: { + name: 'openid_connect', + scope: ['openid', 'profile', 'email'], + response_type: 'code', + issuer: 'https://auth.usgov.coderdemo.io/realms/coder', + discovery: true, + client_auth_method: 'basic', + uid_field: 'preferred_username', + pkce: true, + client_options: { + identifier: 'gitlab', + secret: ENV['GITLAB_OIDC_CLIENT_SECRET'], + redirect_uri: 'https://gitlab.usgov.coderdemo.io/users/auth/openid_connect/callback', + gitlab: { + groups_attribute: 'groups', + admin_groups: ['/platform/platform-admins', '/platform/org-admins'] + } + } + } + } + ] + end + ## Embedded PostgreSQL (bundled) is the default; nothing to set. ## To use the shared RDS gitlabhq_production instead, see README.md. diff --git a/deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml b/deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml index bedc731..f483936 100644 --- a/deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml +++ b/deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml @@ -211,3 +211,20 @@ spec: dataFrom: - extract: key: usgov-coderdemo/observability/grafana-oauth +--- +apiVersion: external-secrets.io/v1 +kind: ExternalSecret +metadata: + name: gitlab-oidc + namespace: gitlab +spec: + refreshInterval: 1h + secretStoreRef: + name: aws-secretsmanager + kind: ClusterSecretStore + target: + name: gitlab-oidc + creationPolicy: Owner + dataFrom: + - extract: + key: usgov-coderdemo/gitlab/oidc diff --git a/docs/as-built/50-gitlab-scm.md b/docs/as-built/50-gitlab-scm.md index 9e8d9f1..5dc3bfe 100644 --- a/docs/as-built/50-gitlab-scm.md +++ b/docs/as-built/50-gitlab-scm.md @@ -155,11 +155,51 @@ clone/fetch/push to `gitlab.usgov.coderdemo.io`, so no PATs or SSH keys live in the workspace. `STATUS.md` records this as verified: the active template version's `/external-auth` lists `gitlab` as required. +## Keycloak SSO (OpenID Connect) + +GitLab signs in through the same Keycloak realm (`coder`) as Coder and Grafana, +so the demo is one SSO. The OmniAuth `openid_connect` provider is configured in +`GITLAB_OMNIBUS_CONFIG` (`deploy/gitlab/statefulset.yaml`): + +- Confidential realm client `gitlab` (PKCE S256, redirect + `https://gitlab.usgov.coderdemo.io/users/auth/openid_connect/callback`), + created by `scripts/setup-gitlab-oidc.py`. The client secret lives in AWS + Secrets Manager (`usgov-coderdemo/gitlab/oidc`), is synced by ESO into the + `gitlab-oidc` Secret, and is injected as `GITLAB_OIDC_CLIENT_SECRET` + (referenced as `ENV[...]` in the omnibus config), so no secret is in git. +- Auto sign-on (JIT) creates the GitLab user on first Keycloak login; + `uid_field` is `preferred_username`. Auto-redirect is deliberately NOT set, so + the local username/password form stays available as break-glass for root. + +Verified live: the sign-in page shows a "Keycloak" button; the OmniAuth request +phase redirects to the realm with `client_id=gitlab` and PKCE; a headless +authorization-code login provisions the persona and returns to the dashboard. + +### Roles: CE limitation and explicit user provisioning + +GitLab **Community Edition does not implement OIDC group-to-role assignment**. +`admin_groups` / `required_groups` / `external_groups` are EE features; this CE +image (`gitlab-ce` 19.0.1, no `ee/` directory) ships only the SAML and LDAP +equivalents with no `openid_connect` code path, so the `admin_groups` line in the +omnibus config is a no-op (kept only so it activates if the image is ever +switched to GitLab EE). Per-group membership/roles is impossible on either +edition without SAML Group Sync (a Premium feature). + +Because of that, the persona users and the instance admin attribute are +provisioned explicitly by `scripts/setup-gitlab-users.py` (idempotent, +`gitlab-rails`): it creates the eight personas from +`scripts/setup-keycloak-hierarchy.py`, links each to its `openid_connect` +identity (`extern_uid = preferred_username`) so SSO lands on the right account, +and sets GitLab instance admin only on `pat.platform` (the Platform lead), +mirroring the Coder org-admin mapping while keeping tenant isolation. Verified +live: `pat.platform` SSO login is `is_admin=true` (`/admin` returns 200); +`dana.dev` is a regular user (`/admin` returns 404). + ## Notes and out of scope -- GitLab to Keycloak SSO (OIDC) is optional and NOT enabled. `deploy/gitlab/README.md` - includes an `openid_connect` omniauth sketch, but the as-built login is root - plus local GitLab users. +- GitLab to Keycloak SSO (OIDC) is now ENABLED (see "Keycloak SSO" above). + GitLab CE has no OIDC group-to-role mapping, so the instance admin attribute + is provisioned by `scripts/setup-gitlab-users.py`, not by group claims. - Git over SSH is not wired (NLB terminates 443 only). HTTPS clone/push is the supported path. - Backups: with embedded Postgres there is no managed backup; durability relies diff --git a/scripts/grant-coder-owner.py b/scripts/grant-coder-owner.py new file mode 100755 index 0000000..6755d13 --- /dev/null +++ b/scripts/grant-coder-owner.py @@ -0,0 +1,98 @@ +#!/usr/bin/env python3 +""" +grant-coder-owner.py - grant the Coder site-wide Owner role to a demo persona so +one Keycloak SSO identity (default pat.platform, the Platform lead) is super +admin across Coder, GitLab, and Grafana. + +Coder organization/role IdP sync only manages org-scoped roles; the site-wide +Owner role is not claim-driven, so it is assigned explicitly here. Site roles are +not overwritten by the per-org IdP sync, so this persists across logins. + +Idempotent: re-running is a no-op if the user already has Owner. Targets the demo +Coder explicitly (NOT the ambient $CODER_URL). Admin creds come from +~/.config/usgov-coderdemo/generated-secrets.env. + +Usage: + python3 scripts/grant-coder-owner.py [username] # default: pat.platform +""" +import json +import os +import sys +import urllib.error +import urllib.request + +BASE = os.environ.get("DEMO_CODER_URL", "https://dev.usgov.coderdemo.io").rstrip("/") +USERNAME = sys.argv[1] if len(sys.argv) > 1 else "pat.platform" +EMAIL_DOMAIN = "usgov.coderdemo.io" + + +def creds(): + out = {} + path = os.path.expanduser("~/.config/usgov-coderdemo/generated-secrets.env") + with open(path) as f: + for line in f: + line = line.strip() + if "=" in line and not line.startswith("#"): + k, v = line.split("=", 1) + out[k] = v + return out + + +C = creds() + + +def login(): + body = json.dumps({"email": C["CODER_ADMIN_EMAIL"], + "password": C["CODER_ADMIN_PASSWORD"]}).encode() + req = urllib.request.Request(BASE + "/api/v2/users/login", data=body, + headers={"Content-Type": "application/json"}) + return json.load(urllib.request.urlopen(req))["session_token"] + + +TOKEN = None + + +def api(method, path, body=None): + headers = {"Coder-Session-Token": TOKEN, "Content-Type": "application/json"} + data = json.dumps(body).encode() if body is not None else None + req = urllib.request.Request(BASE + path, data=data, headers=headers, method=method) + try: + r = urllib.request.urlopen(req) + raw = r.read().decode() + return r.status, (json.loads(raw) if raw else None) + except urllib.error.HTTPError as e: + return e.code, e.read().decode() + + +def main(): + global TOKEN + TOKEN = login() + # Coder sanitizes usernames (e.g. drops dots), so resolve by email. + email = f"{USERNAME}@{EMAIL_DOMAIN}" + code, res = api("GET", "/api/v2/users?q=" + email) + user = None + if code == 200: + for u in (res.get("users") or []): + if u.get("email", "").lower() == email.lower(): + user = u + break + if user is None: + print(f"user {email}: not found (must SSO-login to Coder once first)", + file=sys.stderr) + sys.exit(1) + roles = sorted({r["name"] if isinstance(r, dict) else r + for r in (user.get("roles") or [])}) + if "owner" in roles: + print(f"{user['username']} ({email}): already site Owner ({user['id']})") + return + code, res = api("PUT", f"/api/v2/users/{user['id']}/roles", {"roles": ["owner"]}) + if code != 200: + print(f"{email}: grant failed ({code}) {res}", file=sys.stderr) + sys.exit(1) + new = sorted({r["name"] if isinstance(r, dict) else r + for r in (res.get("roles") or [])}) + print(f"{user['username']} ({email}): site roles -> {new}") + + +if __name__ == "__main__": + main() diff --git a/scripts/setup-gitlab-oidc.py b/scripts/setup-gitlab-oidc.py new file mode 100755 index 0000000..b27bc3a --- /dev/null +++ b/scripts/setup-gitlab-oidc.py @@ -0,0 +1,214 @@ +#!/usr/bin/env python3 +""" +setup-gitlab-oidc.py - register the GitLab OIDC client in the Keycloak realm +`coder` so GitLab logs in with the same SSO as Coder and Grafana, and publish the +client secret to AWS Secrets Manager for ESO to sync. + +Idempotent: re-running ensures the desired client + full-path `groups` mapper and +upserts the secret. It does NOT rotate the Keycloak client secret on each run; it +reads the current secret and writes that value to ASM. + +What it does: + 1. Create/update a confidential OIDC client `gitlab` (standard flow, PKCE + S256, redirect URI + https://gitlab.usgov.coderdemo.io/users/auth/openid_connect/callback). + 2. Add the same full-path `groups` group-membership mapper the coder and + grafana clients use, so GitLab can map group membership to the instance + admin attribute (admin_groups). + 3. Read the client secret and upsert it into AWS Secrets Manager at + usgov-coderdemo/gitlab/oidc as {"client-secret": "..."}. + +Reads admin credentials from ~/.config/usgov-coderdemo/generated-secrets.env: + KEYCLOAK_ADMIN_USERNAME, KEYCLOAK_ADMIN_PASSWORD + +Pairs with deploy/gitlab/statefulset.yaml (the openid_connect omniauth block in +GITLAB_OMNIBUS_CONFIG) and the gitlab-oidc ExternalSecret in +deploy/platform/external-secrets/secretstore-and-externalsecrets.yaml. +""" +import json +import os +import subprocess +import tempfile +import urllib.error +import urllib.parse +import urllib.request + +KC = os.environ.get("KEYCLOAK_URL", "https://auth.usgov.coderdemo.io").rstrip("/") +REALM = "coder" +CLIENT_ID = "gitlab" +GITLAB_URL = "https://gitlab.usgov.coderdemo.io" +REGION = "us-gov-west-1" +ASM_NAME = "usgov-coderdemo/gitlab/oidc" + +DESIRED_CLIENT = { + "clientId": CLIENT_ID, + "name": "GitLab", + "description": "GitLab SCM SSO via Keycloak realm coder.", + "enabled": True, + "protocol": "openid-connect", + "publicClient": False, + "standardFlowEnabled": True, + "implicitFlowEnabled": False, + "directAccessGrantsEnabled": False, + "serviceAccountsEnabled": False, + "clientAuthenticatorType": "client-secret", + "rootUrl": GITLAB_URL, + "baseUrl": "/", + "redirectUris": [GITLAB_URL + "/users/auth/openid_connect/callback"], + "webOrigins": [GITLAB_URL], + "attributes": { + "pkce.code.challenge.method": "S256", + "post.logout.redirect.uris": GITLAB_URL + "/*", + }, +} + +# Same full-path groups mapper the coder/grafana clients use, so GitLab's +# admin_groups can key off Keycloak group paths (e.g. /platform/platform-admins). +GROUPS_MAPPER = { + "name": "groups", + "protocol": "openid-connect", + "protocolMapper": "oidc-group-membership-mapper", + "config": { + "full.path": "true", + "id.token.claim": "true", + "access.token.claim": "true", + "userinfo.token.claim": "true", + "lightweight.claim": "false", + "claim.name": "groups", + }, +} + +TOKEN = None + + +def read_secrets(): + path = os.path.expanduser("~/.config/usgov-coderdemo/generated-secrets.env") + out = {} + with open(path) as f: + for line in f: + line = line.strip() + if "=" in line and not line.startswith("#"): + k, v = line.split("=", 1) + out[k] = v + return out + + +SECRETS = read_secrets() + + +def token(): + data = urllib.parse.urlencode({ + "grant_type": "password", + "client_id": "admin-cli", + "username": SECRETS["KEYCLOAK_ADMIN_USERNAME"], + "password": SECRETS["KEYCLOAK_ADMIN_PASSWORD"], + }).encode() + req = urllib.request.Request( + KC + "/realms/master/protocol/openid-connect/token", data=data, + headers={"Content-Type": "application/x-www-form-urlencoded"}) + return json.load(urllib.request.urlopen(req))["access_token"] + + +def kc(method, path, body=None): + headers = {"Authorization": "Bearer " + TOKEN} + data = None + if body is not None: + headers["Content-Type"] = "application/json" + data = json.dumps(body).encode() + req = urllib.request.Request(KC + "/admin/realms/" + REALM + path, + data=data, headers=headers, method=method) + try: + r = urllib.request.urlopen(req) + raw = r.read().decode() + return r.status, (json.loads(raw) if raw else None) + except urllib.error.HTTPError as e: + return e.code, e.read().decode() + + +def ensure_client(): + _, clients = kc("GET", "/clients?clientId=" + CLIENT_ID) + if clients: + cid = clients[0]["id"] + rep = dict(clients[0]) + rep.update(DESIRED_CLIENT) + attrs = dict(clients[0].get("attributes") or {}) + attrs.update(DESIRED_CLIENT["attributes"]) + rep["attributes"] = attrs + code, _ = kc("PUT", f"/clients/{cid}", rep) + print(f"client '{CLIENT_ID}': updated (HTTP {code})") + else: + code, _ = kc("POST", "/clients", DESIRED_CLIENT) + print(f"client '{CLIENT_ID}': CREATED (HTTP {code})") + _, clients = kc("GET", "/clients?clientId=" + CLIENT_ID) + cid = clients[0]["id"] + return cid + + +def ensure_mapper(cid): + _, mappers = kc("GET", f"/clients/{cid}/protocol-mappers/models") + existing = {m["name"]: m for m in (mappers or [])} + rep = dict(GROUPS_MAPPER) + if "groups" in existing: + rep["id"] = existing["groups"]["id"] + code, _ = kc("PUT", + f"/clients/{cid}/protocol-mappers/models/{rep['id']}", rep) + print(f"client mapper 'groups': updated (HTTP {code})") + else: + code, _ = kc("POST", f"/clients/{cid}/protocol-mappers/models", rep) + print(f"client mapper 'groups': CREATED (HTTP {code})") + + +def client_secret(cid): + _, body = kc("GET", f"/clients/{cid}/client-secret") + if isinstance(body, dict) and body.get("value"): + return body["value"] + _, body = kc("POST", f"/clients/{cid}/client-secret") + return body["value"] + + +def asm_exists(name): + r = subprocess.run( + ["aws", "secretsmanager", "describe-secret", "--region", REGION, + "--secret-id", name], + stdout=subprocess.PIPE, stderr=subprocess.PIPE) + return r.returncode == 0 + + +def put_asm(name, payload): + fd, path = tempfile.mkstemp(prefix="asm-", suffix=".json") + try: + os.fchmod(fd, 0o600) + with os.fdopen(fd, "w") as f: + json.dump(payload, f) + ref = "file://" + path + if asm_exists(name): + subprocess.run( + ["aws", "secretsmanager", "put-secret-value", "--region", REGION, + "--secret-id", name, "--secret-string", ref], + check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) + return "updated" + subprocess.run( + ["aws", "secretsmanager", "create-secret", "--region", REGION, + "--name", name, + "--description", "usgov-coderdemo GitLab OIDC client secret (ESO).", + "--secret-string", ref], + check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE) + return "created" + finally: + os.unlink(path) + + +def main(): + global TOKEN + TOKEN = token() + cid = ensure_client() + ensure_mapper(cid) + secret = client_secret(cid) + action = put_asm(ASM_NAME, {"client-secret": secret}) + print(f"ASM {ASM_NAME}: {action} (client-secret, {len(secret)} chars)") + print("\nNext: kubectl apply the gitlab-oidc ExternalSecret, then apply the " + "statefulset (GitLab reconfigure + restart takes a few minutes).") + + +if __name__ == "__main__": + main() diff --git a/scripts/setup-gitlab-users.py b/scripts/setup-gitlab-users.py new file mode 100755 index 0000000..e3faec0 --- /dev/null +++ b/scripts/setup-gitlab-users.py @@ -0,0 +1,115 @@ +#!/usr/bin/env python3 +""" +setup-gitlab-users.py - populate the in-boundary GitLab with the demo persona +users and set the GitLab instance admin attribute, since GitLab Community +Edition does not implement OIDC group-to-role assignment (admin_groups is an EE +feature and is a no-op on the CE image, see deploy/gitlab/statefulset.yaml). + +Idempotent: re-running finds existing users (including any JIT-created by an SSO +login) and reconciles name, admin flag, active state, and the openid_connect +identity (extern_uid = username) so a Keycloak SSO login lands on the right +account. Mirrors the personas in scripts/setup-keycloak-hierarchy.py. + +Mapping applied (mirrors the Coder org-admin role; only Platform Engineering +gets GitLab instance admin, to preserve tenant isolation): + pat.platform -> instance admin (Platform lead) + all other personas -> regular users + +Runs gitlab-rails inside the gitlab-0 pod (ROPC/password grant is disabled, so a +REST token is not available without a bootstrap). The demo password is read from +~/.config/usgov-coderdemo/generated-secrets.env (DEMO_USER_PASSWORD) and passed +to the pod over stdin, never on the command line. + +Usage (from the repo root, with the demo kubeconfig): + . ~/.config/usgov-coderdemo/env && export KUBECONFIG=./kubeconfig + python3 scripts/setup-gitlab-users.py +""" +import os +import subprocess +import sys + +NAMESPACE = "gitlab" +POD = "gitlab-0" + +RUBY = r''' +admin = User.find_by(username: "root") +org = Organizations::Organization.default_organization +pw = ENV["DEMO_USER_PASSWORD"].to_s +abort("DEMO_USER_PASSWORD not provided") if pw.empty? + +personas = [ + ["pat.platform", "Pat Rivera", true], + ["sky.sre", "Sky Nguyen", false], + ["alex.admin", "Alex Carter", false], + ["dana.dev", "Dana Brooks", false], + ["quinn.data", "Quinn Lee", false], + ["morgan.isso", "Morgan Diaz", false], + ["riley.admin", "Riley Fox", false], + ["jordan.dev", "Jordan Kim", false], +] + +personas.each do |uname, fullname, is_admin| + email = "#{uname}@usgov.coderdemo.io" + u = User.find_by(username: uname) + if u.nil? + res = Users::CreateService.new( + admin, + username: uname, email: email, name: fullname, + password: pw, password_confirmation: pw, + skip_confirmation: true, organization_id: org.id + ).execute + u = res.is_a?(User) ? res : User.find_by(username: uname) + unless u&.persisted? + puts "#{uname}: CREATE FAILED: #{u&.errors&.full_messages&.join('; ')}" + next + end + end + u.name = fullname + u.admin = is_admin + u.state = "active" + u.save!(validate: false) + unless u.identities.exists?(provider: "openid_connect") + u.identities.create!(provider: "openid_connect", extern_uid: uname) + end + puts "#{uname}: id=#{u.id} admin=#{u.admin} oidc=#{u.identities.where(provider: "openid_connect").first&.extern_uid}" +end +''' + + +def read_demo_password(): + path = os.path.expanduser("~/.config/usgov-coderdemo/generated-secrets.env") + with open(path) as f: + for line in f: + line = line.strip() + if line.startswith("DEMO_USER_PASSWORD="): + return line.split("=", 1)[1] + print("DEMO_USER_PASSWORD not found in generated-secrets.env", file=sys.stderr) + sys.exit(1) + + +def kubectl_exec(stdin_data, shell_cmd): + return subprocess.run( + ["kubectl", "-n", NAMESPACE, "exec", "-i", POD, "--", "sh", "-c", shell_cmd], + input=stdin_data, text=True, capture_output=True) + + +def main(): + pw = read_demo_password() + # 1. Stage the Ruby script in the pod (contains no secret). + r = kubectl_exec(RUBY, "cat > /tmp/setup-gitlab-users.rb") + if r.returncode != 0: + print(r.stderr, file=sys.stderr) + sys.exit(1) + # 2. Run it with the password supplied over stdin -> env (not argv). + r = kubectl_exec( + pw, + 'read -r PW; DEMO_USER_PASSWORD="$PW" gitlab-rails runner ' + '/tmp/setup-gitlab-users.rb; rc=$?; rm -f /tmp/setup-gitlab-users.rb; exit $rc') + sys.stdout.write(r.stdout) + if r.returncode != 0: + sys.stderr.write(r.stderr) + sys.exit(r.returncode) + + +if __name__ == "__main__": + main() From 549a1b6f853b1190b3ca2d58232b53a31b940961 Mon Sep 17 00:00:00 2001 From: Austen Bruhn Date: Sun, 7 Jun 2026 20:32:40 +0000 Subject: [PATCH 11/16] feat: make the demo super admin (pat.platform) a member of all Coder orgs The unified super admin signs in via Keycloak but only saw one Coder org, because org membership is IdP-synced from the `groups` claim and pat.platform was only in /platform (-> the coder org). Add pat.platform to the /alpha and /bravo Keycloak groups (and their org-admin role subgroups) in scripts/setup-keycloak-hierarchy.py, so org sync makes Pat a member and organization-admin of all three orgs on login. Combined with the Coder site Owner role and GitLab/Grafana admin, one Keycloak login is now admin across the whole stack and the Coder org switcher shows Platform, Alpha, and Bravo. Verified live with scripts/verify-oidc-login.py (a real OIDC login, which runs the sync): pat.platform -> coder/alpha/bravo all organization-admin, site roles [owner]. Tenant isolation is unchanged for the mission-partner personas. Docs: STATUS.md and docs/as-built/45-idp-sync-personas.md updated to reflect pat.platform as the all-orgs super admin (deliberate exception to isolation). Generated by Coder Agents. --- STATUS.md | 7 +++++-- docs/as-built/45-idp-sync-personas.md | 14 ++++++++++---- scripts/setup-keycloak-hierarchy.py | 10 +++++++++- 3 files changed, 24 insertions(+), 7 deletions(-) diff --git a/STATUS.md b/STATUS.md index 4ffc2f4..ba13bf6 100644 --- a/STATUS.md +++ b/STATUS.md @@ -145,8 +145,11 @@ gated; Nova Pro is the proven fallback. provisioned explicitly by `scripts/setup-gitlab-users.py` (`pat.platform` -> admin; others regular). - [x] **Unified super admin**: the SSO identity `pat.platform` is super admin in - all three (Coder site Owner via `scripts/grant-coder-owner.py`, GitLab - Administrator, Grafana org Admin). Sign in with "Keycloak" on each. + all three (Coder site Owner via `scripts/grant-coder-owner.py` plus + org-admin in every org, GitLab Administrator, Grafana org Admin). Pat is a + member of all three Coder orgs (added to the `/alpha` and `/bravo` Keycloak + groups in `scripts/setup-keycloak-hierarchy.py`), so the org switcher shows + Platform, Alpha, and Bravo. Sign in with "Keycloak" on each app. - [x] **Local break-glass admins** remain per app (Coder owner, GitLab root, Grafana admin). Credentials live in `~/.config/usgov-coderdemo/generated-secrets.env` and AWS Secrets Manager diff --git a/docs/as-built/45-idp-sync-personas.md b/docs/as-built/45-idp-sync-personas.md index e41c93b..94dc3ef 100644 --- a/docs/as-built/45-idp-sync-personas.md +++ b/docs/as-built/45-idp-sync-personas.md @@ -83,7 +83,7 @@ Email is `@usgov.coderdemo.io`. | Username | Name | Org | Org role | Groups | |---|---|---|---|---| -| pat.platform | Pat Rivera | Platform Engineering | organization-admin | platform-admins | +| pat.platform | Pat Rivera | All (Platform + Alpha + Bravo) | organization-admin (all) + site Owner | platform-admins | | sky.sre | Sky Nguyen | Platform Engineering | organization-template-admin | sre | | alex.admin | Alex Carter | Mission Partner Alpha | organization-admin | (none) | | dana.dev | Dana Brooks | Mission Partner Alpha | member | developers | @@ -99,6 +99,8 @@ login). Confirmed output: ``` pat.platform -> coder organization-admin groups=[platform-admins] + -> alpha organization-admin groups=[] + -> bravo organization-admin groups=[] sky.sre -> coder organization-template-admin groups=[sre] alex.admin -> alpha organization-admin groups=[] dana.dev -> alpha member groups=[developers] @@ -109,8 +111,11 @@ riley.admin -> bravo organization-admin groups=[] jordan.dev -> bravo member groups=[developers] ``` -Tenant isolation holds: Alpha users see only Alpha, Bravo users see only Bravo, -Platform users see only Platform. The ISSO/auditor spans both tenants read-only. +Tenant isolation holds for the mission-partner personas: Alpha users see only +Alpha, Bravo users see only Bravo, and the ISSO/auditor spans both tenants +read-only. `pat.platform` is the deliberate exception: it is the demo super admin +(site Owner + org-admin in all three orgs + GitLab Administrator + Grafana +Admin), so a single Keycloak login administers the whole stack. ## Provisioners and templates per tenant org @@ -127,7 +132,8 @@ external auth first (every template declares `data coder_external_auth ## Demo flow -1. Log in as `pat.platform`: lands in Platform Engineering as org admin. +1. Log in as `pat.platform`: the demo super admin. Lands in all three orgs as + org admin (and is site Owner); switch orgs from the org picker. 2. Log in (incognito) as `dana.dev`: lands only in Mission Partner Alpha, group developers, no admin. Cannot see Bravo or Platform. 3. Log in as `riley.admin`: Bravo org admin; manage Bravo members/templates. diff --git a/scripts/setup-keycloak-hierarchy.py b/scripts/setup-keycloak-hierarchy.py index 78d31c4..f0b390e 100755 --- a/scripts/setup-keycloak-hierarchy.py +++ b/scripts/setup-keycloak-hierarchy.py @@ -36,7 +36,15 @@ USERS = { "pat.platform": { "first": "Pat", "last": "Rivera", - "groups": ["/platform", "/platform/platform-admins", "/platform/org-admins"], + # Platform lead and demo super admin: a member of ALL tenant orgs (and + # org-admin in each) so a single Keycloak login is admin across every + # Coder org. Combined with the site Owner role and GitLab/Grafana admin, + # pat.platform is super admin across the whole stack. + "groups": [ + "/platform", "/platform/platform-admins", "/platform/org-admins", + "/alpha", "/alpha/org-admins", + "/bravo", "/bravo/org-admins", + ], }, "sky.sre": { "first": "Sky", "last": "Nguyen", From 4c0658776da16f9099e914f2d926376424fb75d0 Mon Sep 17 00:00:00 2001 From: Austen Bruhn Date: Sun, 7 Jun 2026 20:51:27 +0000 Subject: [PATCH 12/16] feat: dedicated operator super admin austen.platform Add a dedicated operator account austen.platform (its own SUPERADMIN_PASSWORD) that is super admin across the stack through a single Keycloak login: Coder site Owner plus org-admin in all three orgs, GitLab instance admin, and Grafana org Admin (via the /platform group rule). Revert pat.platform to a normal Platform lead persona: Platform org-admin only, no Coder site Owner, not a GitLab admin. - setup-keycloak-hierarchy.py: add austen.platform in the platform/alpha/bravo org and org-admins groups with a per-user password_env (SUPERADMIN_PASSWORD); trim pat.platform back to the /platform groups. - setup-gitlab-users.py: provision austen.platform as instance admin, mark every demo persona (including pat.platform) a regular user, and support per-persona password env over stdin. - grant-coder-owner.py: default target is now austen.platform. - Docs (STATUS.md, 45-idp-sync-personas.md, 50-gitlab-scm.md, 55-observability.md): describe the operator super admin and the pat.platform revert. Verified live via headless SSO: austen is owner and org-admin in all orgs, a GitLab admin, and a Grafana Admin; pat is org-admin in coder only, no Coder site role, and a GitLab non-admin. Generated by Coder Agents. --- STATUS.md | 22 ++++++---- docs/as-built/45-idp-sync-personas.md | 38 +++++++++++++---- docs/as-built/50-gitlab-scm.md | 16 ++++---- docs/as-built/55-observability.md | 2 +- scripts/grant-coder-owner.py | 10 ++--- scripts/setup-gitlab-users.py | 59 ++++++++++++++++----------- scripts/setup-keycloak-hierarchy.py | 22 ++++++---- 7 files changed, 108 insertions(+), 61 deletions(-) diff --git a/STATUS.md b/STATUS.md index ba13bf6..e91a5ad 100644 --- a/STATUS.md +++ b/STATUS.md @@ -135,7 +135,7 @@ gated; Nova Pro is the proven fallback. `claude-code` template pushed into all three orgs. - See `docs/as-built/45-idp-sync-personas.md` for the full hierarchy + matrix. -## Single sign-on + demo super admin +## Single sign-on + operator super admin - [x] **One SSO across the stack**: Coder, GitLab, and Grafana all authenticate against the Keycloak realm `coder`. Grafana via generic OAuth (`scripts/setup-grafana-oidc.py`); GitLab via OmniAuth `openid_connect` @@ -143,13 +143,17 @@ gated; Nova Pro is the proven fallback. - [x] **GitLab CE caveat**: CE has no OIDC group-to-role mapping (an EE feature), so GitLab persona users + the instance admin attribute are provisioned explicitly by `scripts/setup-gitlab-users.py` - (`pat.platform` -> admin; others regular). -- [x] **Unified super admin**: the SSO identity `pat.platform` is super admin in - all three (Coder site Owner via `scripts/grant-coder-owner.py` plus - org-admin in every org, GitLab Administrator, Grafana org Admin). Pat is a - member of all three Coder orgs (added to the `/alpha` and `/bravo` Keycloak - groups in `scripts/setup-keycloak-hierarchy.py`), so the org switcher shows - Platform, Alpha, and Bravo. Sign in with "Keycloak" on each app. + (`austen.platform` -> admin; all demo personas regular). +- [x] **Unified super admin**: a dedicated operator SSO identity + `austen.platform` (its own `SUPERADMIN_PASSWORD`, not a demo persona) is + super admin in all three (Coder site Owner via + `scripts/grant-coder-owner.py` plus org-admin in every org, GitLab + Administrator, Grafana org Admin). It is a member of all three Coder orgs + (the `/platform`, `/alpha`, and `/bravo` Keycloak groups in + `scripts/setup-keycloak-hierarchy.py`), so the org switcher shows Platform, + Alpha, and Bravo. The `pat.platform` persona is a normal Platform lead + (Platform org-admin only, not a site Owner and not a GitLab admin). Sign in + with "Keycloak" on each app. - [x] **Local break-glass admins** remain per app (Coder owner, GitLab root, Grafana admin). Credentials live in `~/.config/usgov-coderdemo/generated-secrets.env` and AWS Secrets Manager @@ -192,7 +196,7 @@ gated; Nova Pro is the proven fallback. `usgov-coderdemo/observability/grafana-oauth`, ESO-synced). Group membership maps to org role: `/platform` -> Grafana `Admin`, others -> `Viewer`; local admin kept as break-glass. Verified per persona - (`pat.platform` Admin, `dana.dev` Viewer). + (`austen.platform` Admin, `dana.dev` Viewer). - [x] **Structured JSON server logs** (`CODER_LOGGING_JSON=/dev/stderr`, `CODER_LOGGING_HUMAN=/dev/null`) make coderd SIEM-ready; audit logging is entitled + on (`/audit`). diff --git a/docs/as-built/45-idp-sync-personas.md b/docs/as-built/45-idp-sync-personas.md index 94dc3ef..26fa480 100644 --- a/docs/as-built/45-idp-sync-personas.md +++ b/docs/as-built/45-idp-sync-personas.md @@ -83,7 +83,7 @@ Email is `@usgov.coderdemo.io`. | Username | Name | Org | Org role | Groups | |---|---|---|---|---| -| pat.platform | Pat Rivera | All (Platform + Alpha + Bravo) | organization-admin (all) + site Owner | platform-admins | +| pat.platform | Pat Rivera | Platform Engineering | organization-admin | platform-admins | | sky.sre | Sky Nguyen | Platform Engineering | organization-template-admin | sre | | alex.admin | Alex Carter | Mission Partner Alpha | organization-admin | (none) | | dana.dev | Dana Brooks | Mission Partner Alpha | member | developers | @@ -92,15 +92,27 @@ Email is `@usgov.coderdemo.io`. | riley.admin | Riley Fox | Mission Partner Bravo | organization-admin | (none) | | jordan.dev | Jordan Kim | Mission Partner Bravo | member | developers | +## Operator super admin (not a demo persona) + +`austen.platform` (Austen Platform) is the dedicated operator account, separate +from the eight demo personas and with its own password in `SUPERADMIN_PASSWORD`. +It belongs to the `/platform`, `/alpha`, and `/bravo` Keycloak groups (org-admin +in each) and is additionally granted the Coder **site Owner** role +(`scripts/grant-coder-owner.py`), **GitLab instance admin** +(`scripts/setup-gitlab-users.py`), and **Grafana org Admin** (via the `/platform` +group rule). One Keycloak login therefore administers the entire stack: every +Coder org, GitLab, and Grafana. + ## Verified login matrix Run `scripts/verify-oidc-login.py` (fresh cookie jar per user, real Keycloak login). Confirmed output: ``` +austen.platform -> coder organization-admin groups=[platform-admins] site_roles=[owner] + -> alpha organization-admin groups=[] + -> bravo organization-admin groups=[] pat.platform -> coder organization-admin groups=[platform-admins] - -> alpha organization-admin groups=[] - -> bravo organization-admin groups=[] sky.sre -> coder organization-template-admin groups=[sre] alex.admin -> alpha organization-admin groups=[] dana.dev -> alpha member groups=[developers] @@ -113,9 +125,10 @@ jordan.dev -> bravo member groups=[developers] Tenant isolation holds for the mission-partner personas: Alpha users see only Alpha, Bravo users see only Bravo, and the ISSO/auditor spans both tenants -read-only. `pat.platform` is the deliberate exception: it is the demo super admin -(site Owner + org-admin in all three orgs + GitLab Administrator + Grafana -Admin), so a single Keycloak login administers the whole stack. +read-only. The operator account `austen.platform` is the deliberate exception: +it is super admin (site Owner + org-admin in all three orgs + GitLab +Administrator + Grafana Admin), so a single Keycloak login administers the whole +stack. `pat.platform` is a normal Platform lead (Platform org-admin only). ## Provisioners and templates per tenant org @@ -132,8 +145,8 @@ external auth first (every template declares `data coder_external_auth ## Demo flow -1. Log in as `pat.platform`: the demo super admin. Lands in all three orgs as - org admin (and is site Owner); switch orgs from the org picker. +1. Log in as `austen.platform`: the operator super admin. Lands in all three + orgs as org admin (and is site Owner); switch orgs from the org picker. 2. Log in (incognito) as `dana.dev`: lands only in Mission Partner Alpha, group developers, no admin. Cannot see Bravo or Platform. 3. Log in as `riley.admin`: Bravo org admin; manage Bravo members/templates. @@ -155,3 +168,12 @@ python3 scripts/verify-oidc-login.py pat.platform dana.dev morgan.isso riley.adm ``` Both setup scripts are idempotent. + +The operator super admin `austen.platform` also needs its cross-app admin grants +(idempotent; credentials are read from `generated-secrets.env`): + +``` +python3 scripts/grant-coder-owner.py austen.platform # Coder site Owner +python3 scripts/setup-gitlab-users.py # GitLab instance admin +# Grafana org Admin is automatic via the /platform group rule +``` diff --git a/docs/as-built/50-gitlab-scm.md b/docs/as-built/50-gitlab-scm.md index 5dc3bfe..5565458 100644 --- a/docs/as-built/50-gitlab-scm.md +++ b/docs/as-built/50-gitlab-scm.md @@ -187,13 +187,15 @@ edition without SAML Group Sync (a Premium feature). Because of that, the persona users and the instance admin attribute are provisioned explicitly by `scripts/setup-gitlab-users.py` (idempotent, -`gitlab-rails`): it creates the eight personas from -`scripts/setup-keycloak-hierarchy.py`, links each to its `openid_connect` -identity (`extern_uid = preferred_username`) so SSO lands on the right account, -and sets GitLab instance admin only on `pat.platform` (the Platform lead), -mirroring the Coder org-admin mapping while keeping tenant isolation. Verified -live: `pat.platform` SSO login is `is_admin=true` (`/admin` returns 200); -`dana.dev` is a regular user (`/admin` returns 404). +`gitlab-rails`): it creates the eight demo personas from +`scripts/setup-keycloak-hierarchy.py` plus the operator super admin +`austen.platform`, links each to its `openid_connect` identity +(`extern_uid = preferred_username`) so SSO lands on the right account, and sets +GitLab instance admin only on `austen.platform` (the operator super admin), +keeping every demo persona (including the Platform lead `pat.platform`) a regular +user to preserve tenant isolation. Verified live: `austen.platform` SSO login is +`is_admin=true` (`/admin` returns 200); `pat.platform` and `dana.dev` are regular +users (`/admin` returns 404). ## Notes and out of scope diff --git a/docs/as-built/55-observability.md b/docs/as-built/55-observability.md index 80e079b..c267a33 100644 --- a/docs/as-built/55-observability.md +++ b/docs/as-built/55-observability.md @@ -151,7 +151,7 @@ is intentionally left enabled (`disable_login_form: false`) as break-glass. Verified live with a headless authorization-code login per persona: the login page shows "Sign in with Keycloak"; `/login/generic_oauth` redirects to the realm -with `client_id=grafana` and PKCE; `pat.platform` (`/platform`) lands as org +with `client_id=grafana` and PKCE; `austen.platform` (`/platform`) lands as org role `Admin` (the admin-only `/api/org/users` returns 200) while `dana.dev` (`/alpha`) lands as `Viewer` (same endpoint returns 403). Both arrive `authLabels: ["Generic OAuth"]`, `isExternallySynced: true`. diff --git a/scripts/grant-coder-owner.py b/scripts/grant-coder-owner.py index 6755d13..bae4a05 100755 --- a/scripts/grant-coder-owner.py +++ b/scripts/grant-coder-owner.py @@ -1,8 +1,8 @@ #!/usr/bin/env python3 """ -grant-coder-owner.py - grant the Coder site-wide Owner role to a demo persona so -one Keycloak SSO identity (default pat.platform, the Platform lead) is super -admin across Coder, GitLab, and Grafana. +grant-coder-owner.py - grant the Coder site-wide Owner role to a user so one +Keycloak SSO identity (default austen.platform, the operator super admin) is +super admin across Coder, GitLab, and Grafana. Coder organization/role IdP sync only manages org-scoped roles; the site-wide Owner role is not claim-driven, so it is assigned explicitly here. Site roles are @@ -13,7 +13,7 @@ ~/.config/usgov-coderdemo/generated-secrets.env. Usage: - python3 scripts/grant-coder-owner.py [username] # default: pat.platform + python3 scripts/grant-coder-owner.py [username] # default: austen.platform """ import json import os @@ -22,7 +22,7 @@ import urllib.request BASE = os.environ.get("DEMO_CODER_URL", "https://dev.usgov.coderdemo.io").rstrip("/") -USERNAME = sys.argv[1] if len(sys.argv) > 1 else "pat.platform" +USERNAME = sys.argv[1] if len(sys.argv) > 1 else "austen.platform" EMAIL_DOMAIN = "usgov.coderdemo.io" diff --git a/scripts/setup-gitlab-users.py b/scripts/setup-gitlab-users.py index e3faec0..63f8937 100755 --- a/scripts/setup-gitlab-users.py +++ b/scripts/setup-gitlab-users.py @@ -10,10 +10,10 @@ identity (extern_uid = username) so a Keycloak SSO login lands on the right account. Mirrors the personas in scripts/setup-keycloak-hierarchy.py. -Mapping applied (mirrors the Coder org-admin role; only Platform Engineering +Mapping applied (mirrors the Coder org-admin role; only the operator super admin gets GitLab instance admin, to preserve tenant isolation): - pat.platform -> instance admin (Platform lead) - all other personas -> regular users + austen.platform -> instance admin (operator super admin) + all demo personas -> regular users Runs gitlab-rails inside the gitlab-0 pod (ROPC/password grant is disabled, so a REST token is not available without a bootstrap). The demo password is read from @@ -34,21 +34,26 @@ RUBY = r''' admin = User.find_by(username: "root") org = Organizations::Organization.default_organization -pw = ENV["DEMO_USER_PASSWORD"].to_s -abort("DEMO_USER_PASSWORD not provided") if pw.empty? +pwmap = { + "DEMO_USER_PASSWORD" => ENV["DEMO_USER_PASSWORD"].to_s, + "SUPERADMIN_PASSWORD" => ENV["SUPERADMIN_PASSWORD"].to_s, +} personas = [ - ["pat.platform", "Pat Rivera", true], - ["sky.sre", "Sky Nguyen", false], - ["alex.admin", "Alex Carter", false], - ["dana.dev", "Dana Brooks", false], - ["quinn.data", "Quinn Lee", false], - ["morgan.isso", "Morgan Diaz", false], - ["riley.admin", "Riley Fox", false], - ["jordan.dev", "Jordan Kim", false], + ["austen.platform", "Austen Platform", true, "SUPERADMIN_PASSWORD"], + ["pat.platform", "Pat Rivera", false, "DEMO_USER_PASSWORD"], + ["sky.sre", "Sky Nguyen", false, "DEMO_USER_PASSWORD"], + ["alex.admin", "Alex Carter", false, "DEMO_USER_PASSWORD"], + ["dana.dev", "Dana Brooks", false, "DEMO_USER_PASSWORD"], + ["quinn.data", "Quinn Lee", false, "DEMO_USER_PASSWORD"], + ["morgan.isso", "Morgan Diaz", false, "DEMO_USER_PASSWORD"], + ["riley.admin", "Riley Fox", false, "DEMO_USER_PASSWORD"], + ["jordan.dev", "Jordan Kim", false, "DEMO_USER_PASSWORD"], ] -personas.each do |uname, fullname, is_admin| +personas.each do |uname, fullname, is_admin, pw_env| + pw = pwmap[pw_env].to_s + abort("password #{pw_env} not provided") if pw.empty? email = "#{uname}@usgov.coderdemo.io" u = User.find_by(username: uname) if u.nil? @@ -76,15 +81,19 @@ ''' -def read_demo_password(): +def read_passwords(): path = os.path.expanduser("~/.config/usgov-coderdemo/generated-secrets.env") + out = {} with open(path) as f: for line in f: line = line.strip() - if line.startswith("DEMO_USER_PASSWORD="): - return line.split("=", 1)[1] - print("DEMO_USER_PASSWORD not found in generated-secrets.env", file=sys.stderr) - sys.exit(1) + for key in ("DEMO_USER_PASSWORD", "SUPERADMIN_PASSWORD"): + if line.startswith(key + "="): + out[key] = line.split("=", 1)[1] + if "DEMO_USER_PASSWORD" not in out: + print("DEMO_USER_PASSWORD not found in generated-secrets.env", file=sys.stderr) + sys.exit(1) + return out["DEMO_USER_PASSWORD"], out.get("SUPERADMIN_PASSWORD", "") def kubectl_exec(stdin_data, shell_cmd): @@ -94,17 +103,19 @@ def kubectl_exec(stdin_data, shell_cmd): def main(): - pw = read_demo_password() + demo_pw, super_pw = read_passwords() # 1. Stage the Ruby script in the pod (contains no secret). r = kubectl_exec(RUBY, "cat > /tmp/setup-gitlab-users.rb") if r.returncode != 0: print(r.stderr, file=sys.stderr) sys.exit(1) - # 2. Run it with the password supplied over stdin -> env (not argv). + # 2. Run it with the passwords supplied over stdin -> env (not argv). r = kubectl_exec( - pw, - 'read -r PW; DEMO_USER_PASSWORD="$PW" gitlab-rails runner ' - '/tmp/setup-gitlab-users.rb; rc=$?; rm -f /tmp/setup-gitlab-users.rb; exit $rc') + demo_pw + "\n" + super_pw + "\n", + 'read -r DPW; read -r SPW; ' + 'DEMO_USER_PASSWORD="$DPW" SUPERADMIN_PASSWORD="$SPW" ' + 'gitlab-rails runner /tmp/setup-gitlab-users.rb; ' + 'rc=$?; rm -f /tmp/setup-gitlab-users.rb; exit $rc') sys.stdout.write(r.stdout) if r.returncode != 0: sys.stderr.write(r.stderr) diff --git a/scripts/setup-keycloak-hierarchy.py b/scripts/setup-keycloak-hierarchy.py index f0b390e..1fbc6c8 100755 --- a/scripts/setup-keycloak-hierarchy.py +++ b/scripts/setup-keycloak-hierarchy.py @@ -34,18 +34,23 @@ # Persona users -> full group paths they belong to. USERS = { - "pat.platform": { - "first": "Pat", "last": "Rivera", - # Platform lead and demo super admin: a member of ALL tenant orgs (and - # org-admin in each) so a single Keycloak login is admin across every - # Coder org. Combined with the site Owner role and GitLab/Grafana admin, - # pat.platform is super admin across the whole stack. + # Operator super admin (not a demo persona). Dedicated account for the demo + # operator: a member of ALL tenant orgs (org-admin in each) plus the Coder + # site Owner role, GitLab instance admin, and Grafana admin, so one Keycloak + # login administers the whole stack. Uses its own SUPERADMIN_PASSWORD. + "austen.platform": { + "first": "Austen", "last": "Platform", + "password_env": "SUPERADMIN_PASSWORD", "groups": [ "/platform", "/platform/platform-admins", "/platform/org-admins", "/alpha", "/alpha/org-admins", "/bravo", "/bravo/org-admins", ], }, + "pat.platform": { + "first": "Pat", "last": "Rivera", + "groups": ["/platform", "/platform/platform-admins", "/platform/org-admins"], + }, "sky.sre": { "first": "Sky", "last": "Nguyen", "groups": ["/platform", "/platform/sre", "/platform/template-admins"], @@ -184,8 +189,11 @@ def ensure_mapper(): def ensure_users(paths): - pw = SECRETS["DEMO_USER_PASSWORD"] for username, spec in USERS.items(): + pw_env = spec.get("password_env") + pw = SECRETS.get(pw_env) if pw_env else None + if not pw: + pw = SECRETS["DEMO_USER_PASSWORD"] _, found = kc("GET", "/users?exact=true&username=" + urllib.parse.quote(username)) if found: uid = found[0]["id"] From dc188f930168a33b99690a7144db34cf66f29d81 Mon Sep 17 00:00:00 2001 From: Austen Bruhn Date: Sun, 7 Jun 2026 21:01:03 +0000 Subject: [PATCH 13/16] feat: enforce passkey + TOTP enrollment for austen.platform Give the operator super admin austen.platform the Keycloak webauthn-register and CONFIGURE_TOTP required actions so its first Keycloak sign in forces WebAuthn passkey and TOTP enrollment. The actions are applied only while the matching credential is missing, so reconciles never re-force enrollment. - setup-keycloak-hierarchy.py: add required_actions to the austen.platform spec plus an ensure_required_actions() reconciler keyed on existing credentials. - Docs (STATUS.md, 45-idp-sync-personas.md): note the enforced enrollment and that the headless verify probe no longer applies to austen.platform. Applied live: austen.platform requiredActions=[CONFIGURE_TOTP, webauthn-register] with only a password credential; the other personas are unaffected. Generated by Coder Agents. --- STATUS.md | 6 +++++ docs/as-built/45-idp-sync-personas.md | 8 ++++++ scripts/setup-keycloak-hierarchy.py | 35 +++++++++++++++++++++++++++ 3 files changed, 49 insertions(+) diff --git a/STATUS.md b/STATUS.md index e91a5ad..9ca2817 100644 --- a/STATUS.md +++ b/STATUS.md @@ -154,6 +154,12 @@ gated; Nova Pro is the proven fallback. Alpha, and Bravo. The `pat.platform` persona is a normal Platform lead (Platform org-admin only, not a site Owner and not a GitLab admin). Sign in with "Keycloak" on each app. +- [x] **Operator MFA enrollment enforced**: `austen.platform` carries the + Keycloak `webauthn-register` and `CONFIGURE_TOTP` required actions, so the + first Keycloak sign in forces passkey + TOTP enrollment + (`scripts/setup-keycloak-hierarchy.py`). The stock browser flow then + challenges TOTP on later logins; re-running the script never re-forces + enrollment once the credentials exist. - [x] **Local break-glass admins** remain per app (Coder owner, GitLab root, Grafana admin). Credentials live in `~/.config/usgov-coderdemo/generated-secrets.env` and AWS Secrets Manager diff --git a/docs/as-built/45-idp-sync-personas.md b/docs/as-built/45-idp-sync-personas.md index 26fa480..46d89fc 100644 --- a/docs/as-built/45-idp-sync-personas.md +++ b/docs/as-built/45-idp-sync-personas.md @@ -103,6 +103,14 @@ in each) and is additionally granted the Coder **site Owner** role group rule). One Keycloak login therefore administers the entire stack: every Coder org, GitLab, and Grafana. +On its first Keycloak sign in `austen.platform` is forced to enroll a WebAuthn +passkey and TOTP: it carries the `webauthn-register` and `CONFIGURE_TOTP` +required actions (set by `scripts/setup-keycloak-hierarchy.py`, only while the +matching credential is missing, so reconciles never re-force enrollment). The +stock browser flow challenges TOTP on subsequent logins. Because of this, the +headless `verify-oidc-login.py` probe does not apply to `austen.platform` once +the required actions are set, until enrollment is completed interactively. + ## Verified login matrix Run `scripts/verify-oidc-login.py` (fresh cookie jar per user, real Keycloak diff --git a/scripts/setup-keycloak-hierarchy.py b/scripts/setup-keycloak-hierarchy.py index 1fbc6c8..95c8522 100755 --- a/scripts/setup-keycloak-hierarchy.py +++ b/scripts/setup-keycloak-hierarchy.py @@ -11,6 +11,11 @@ ~/.config/usgov-coderdemo/generated-secrets.env: KEYCLOAK_ADMIN_USERNAME, KEYCLOAK_ADMIN_PASSWORD, DEMO_USER_PASSWORD +The operator super admin (austen.platform) is additionally given the +webauthn-register and CONFIGURE_TOTP required actions, so Keycloak forces passkey ++ TOTP enrollment on its next sign in. The actions are only (re)applied while the +matching credential is still missing, so a reconcile never forces re-enrollment. + Pairs with scripts/setup-coder-idp-sync.py (the Coder side). The hierarchy is documented in docs/as-built/45-idp-sync-personas.md. """ @@ -41,6 +46,8 @@ "austen.platform": { "first": "Austen", "last": "Platform", "password_env": "SUPERADMIN_PASSWORD", + # Force passkey + TOTP enrollment on first sign in (super admin account). + "required_actions": ["webauthn-register", "CONFIGURE_TOTP"], "groups": [ "/platform", "/platform/platform-admins", "/platform/org-admins", "/alpha", "/alpha/org-admins", @@ -83,6 +90,15 @@ EMAIL_DOMAIN = "usgov.coderdemo.io" +# Maps a required action to the credential type it provisions. Used to enforce +# MFA enrollment only while the credential is still missing, so reconciles stay +# idempotent and never force a re-enrollment once the user is set up. +REQUIRED_ACTION_CREDENTIAL = { + "CONFIGURE_TOTP": "otp", + "webauthn-register": "webauthn", + "webauthn-register-passwordless": "webauthn-passwordless", +} + def read_secrets(): path = os.path.expanduser("~/.config/usgov-coderdemo/generated-secrets.env") @@ -219,6 +235,25 @@ def ensure_users(paths): gid = paths[gpath] code, _ = kc("PUT", f"/users/{uid}/groups/{gid}") print(f" {username}: groups -> {', '.join(spec['groups'])}") + ensure_required_actions(uid, username, spec.get("required_actions") or []) + + +def ensure_required_actions(uid, username, desired): + """Add required actions that enforce MFA enrollment, but only while the + matching credential is missing (so a reconcile never re-forces enrollment).""" + if not desired: + return + _, creds = kc("GET", f"/users/{uid}/credentials") + have = {c.get("type") for c in (creds or [])} + pending = [a for a in desired + if REQUIRED_ACTION_CREDENTIAL.get(a) not in have] + _, rep = kc("GET", f"/users/{uid}") + current = list(rep.get("requiredActions") or []) + merged = current + [a for a in pending if a not in current] + if merged != current: + rep["requiredActions"] = merged + kc("PUT", f"/users/{uid}", rep) + print(f" {username}: required actions -> {merged or '[]'}") def main(): From f73f7fca42e1cfb7389d0f5b72b1c3abd45bd754 Mon Sep 17 00:00:00 2001 From: Austen Bruhn Date: Sun, 7 Jun 2026 21:02:23 +0000 Subject: [PATCH 14/16] feat(scripts/set-appearance.sh): set Coder application name to USGOV Coder Demo Make the Coder dashboard application_name configurable via APP_NAME (default "USGOV Coder Demo") instead of empty, so the demo deployment shows a branded name in the UI title and login page. Applied live via PUT /api/v2/appearance; the UNCLASSIFIED announcement banner is preserved. Generated by Coder Agents. --- scripts/set-appearance.sh | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/scripts/set-appearance.sh b/scripts/set-appearance.sh index 67bfd0f..e10ca0f 100755 --- a/scripts/set-appearance.sh +++ b/scripts/set-appearance.sh @@ -13,6 +13,7 @@ # # Env (with sane demo defaults): # DEMO_CODER_URL default https://dev.usgov.coderdemo.io +# APP_NAME default "USGOV Coder Demo" # BANNER_MESSAGE default "UNCLASSIFIED - USGOVCLOUD" # BANNER_COLOR default "#007a33" (IC/DoD UNCLASSIFIED green) # Admin creds are read from ~/.config/usgov-coderdemo/generated-secrets.env. @@ -24,6 +25,7 @@ set -euo pipefail export CODER_URL="${DEMO_CODER_URL:-https://dev.usgov.coderdemo.io}" +export APP_NAME="${APP_NAME:-USGOV Coder Demo}" export BANNER_MESSAGE="${BANNER_MESSAGE:-UNCLASSIFIED - USGOVCLOUD}" export BANNER_COLOR="${BANNER_COLOR:-#007a33}" SECRETS="${HOME}/.config/usgov-coderdemo/generated-secrets.env" @@ -62,7 +64,7 @@ _, login = call("POST", "/api/v2/users/login", { token = login["session_token"] call("PUT", "/api/v2/appearance", { - "application_name": "", + "application_name": os.environ["APP_NAME"], "logo_url": "", "service_banner": {"enabled": False}, "announcement_banners": [{ @@ -73,5 +75,6 @@ call("PUT", "/api/v2/appearance", { }, token=token) status, appearance = call("GET", "/api/v2/appearance", token=token) +print("application_name:", json.dumps(appearance.get("application_name"))) print("appearance set:", json.dumps(appearance["announcement_banners"])) PY From 176b9a7ccd539af792d03b5d4cee25a702e5e783 Mon Sep 17 00:00:00 2001 From: Austen Bruhn Date: Sun, 7 Jun 2026 21:31:28 +0000 Subject: [PATCH 15/16] feat: add in-cluster Loki + Promtail logging and wire Grafana log panels Deploy a lean single-binary Grafana Loki (filesystem gp3 PVC, tsdb schema v13, 168h retention) and a node-level Promtail DaemonSet into the monitoring namespace, with both images mirrored to ECR. Add a Grafana datasource ConfigMap with uid "loki" so the generated Coder dashboards' log panels resolve to the live log store instead of erroring. Clean up the coder-status dashboard: replace the upstream LGTM "Observability Tools" row (distributed Loki, Grafana Agent, config reloaders, storage/CPU/RAM) with Prometheus, Loki, and Promtail up panels, and repoint the Workspace Builds and Postgres panels to coderd_* metrics that exist in this stack. Update the observability README, STATUS.md, and the as-built doc. Generated by Coder Agents. --- STATUS.md | 24 +- deploy/observability/README.md | 77 +- deploy/observability/dashboards-coder.yaml | 1024 +------------------- deploy/observability/loki-datasource.yaml | 35 + deploy/observability/loki.yaml | 239 +++++ deploy/observability/promtail.yaml | 271 ++++++ docs/as-built/55-observability.md | 117 ++- scripts/images.txt | 9 + 8 files changed, 773 insertions(+), 1023 deletions(-) create mode 100644 deploy/observability/loki-datasource.yaml create mode 100644 deploy/observability/loki.yaml create mode 100644 deploy/observability/promtail.yaml diff --git a/STATUS.md b/STATUS.md index 9ca2817..6328d01 100644 --- a/STATUS.md +++ b/STATUS.md @@ -183,7 +183,7 @@ gated; Nova Pro is the proven fallback. `terraform/secrets-hardening.tf`. - See `docs/as-built/85-secrets-management.md`. -## Observability (in-cluster Prometheus + Grafana) +## Observability (in-cluster Prometheus + Grafana + Loki) - [x] **In-boundary metrics + dashboards** via the `prometheus-community/kube-prometheus-stack` Helm release `kps` (ns `monitoring`, ECR-mirrored images). Prometheus (2/2), Grafana (3/3), and @@ -196,6 +196,25 @@ gated; Nova Pro is the proven fallback. render live data at `https://grafana.usgov.coderdemo.io` (valid TLS, HTTP 200). Grafana admin password lives in AWS Secrets Manager (`usgov-coderdemo/observability/grafana`) and is synced by ESO. +- [x] **In-cluster logs via Loki + Promtail** (hand-rolled manifests + `deploy/observability/loki.yaml` + `promtail.yaml`, ECR-mirrored + `grafana/loki:3.5.9` and `grafana/promtail:3.5.9`). Single-binary Loki + stores on a 10Gi gp3 PVC (filesystem, tsdb schema v13, 168h retention); a + Promtail DaemonSet tails `/var/log/pods` on every node and pushes pod logs + to `loki.monitoring.svc:3100`, covering namespaces `coder`, + `coder-workspaces`, `gitlab`, `keycloak`, `monitoring`, and + `external-secrets`. A Grafana datasource ConfigMap + (`deploy/observability/loki-datasource.yaml`, uid `loki`) provisions it via + the sidecar, so the Coder dashboards' log panels (workspace-detail "Logs", + provisionerd, workspaces) resolve instead of showing a datasource error. + Prometheus scrapes both (`up{job="loki"}` and `up{job="promtail"}` are + `1`). +- [x] **`coder-status` dashboard adapted to this stack**: the upstream + "Observability Tools" row (distributed Loki, Grafana Agent, config + reloaders, storage/CPU/RAM) was replaced with Prometheus, Loki, and + Promtail `up` panels; "Workspace Builds" repointed to + `coderd_workspace_latest_build_status` and "Postgres" to a real + `coderd_db_tx_duration_seconds` signal (no postgres_exporter runs). - [x] **Grafana Keycloak SSO (one SSO)**: Grafana signs in via the same realm (`coder`) through a confidential OIDC client `grafana` (`scripts/setup-grafana-oidc.py`, PKCE; secret in ASM @@ -205,7 +224,8 @@ gated; Nova Pro is the proven fallback. (`austen.platform` Admin, `dana.dev` Viewer). - [x] **Structured JSON server logs** (`CODER_LOGGING_JSON=/dev/stderr`, `CODER_LOGGING_HUMAN=/dev/null`) make coderd SIEM-ready; audit logging is - entitled + on (`/audit`). + entitled + on (`/audit`). Promtail also ships these lines to the + in-cluster Loki, so they are queryable in Grafana. - [ ] AWS-native managed variant (AMP + AMG, CloudWatch -> Security Lake) is the production target, planned only. See [`docs/plans/observability-aws-native.md`](docs/plans/observability-aws-native.md) diff --git a/deploy/observability/README.md b/deploy/observability/README.md index a2d0f76..c8c25f3 100644 --- a/deploy/observability/README.md +++ b/deploy/observability/README.md @@ -1,9 +1,9 @@ -# Observability stack (in-cluster metrics + dashboards) +# Observability stack (in-cluster metrics, logs + dashboards) -In-boundary, in-cluster metrics and dashboards for the GovCloud demo. It scrapes -the Coder control plane's Prometheus metrics and renders Coder's prebuilt -Grafana dashboards with live data, reachable over HTTPS at -`https://grafana.usgov.coderdemo.io`. +In-boundary, in-cluster metrics, logs, and dashboards for the GovCloud demo. It +scrapes the Coder control plane's Prometheus metrics, collects pod logs into +Loki, and renders Coder's prebuilt Grafana dashboards with live data, reachable +over HTTPS at `https://grafana.usgov.coderdemo.io`. This is the reliable in-cluster implementation. The AWS-native managed variant (Amazon Managed Prometheus / Grafana, Security Lake) is planned separately and @@ -19,6 +19,9 @@ is not built here. | Prometheus operator | Deployment `kps-kube-prometheus-stack-operator`. Admission webhooks disabled. | | Coder scrape | `coder-metrics` headless Service (port 2112) + `ServiceMonitor/coder`, both in namespace `coder`. Prometheus job `coder-metrics`. | | Dashboards | Six Coder dashboards as ConfigMaps in `monitoring`, imported by the Grafana sidecar (label `grafana_dashboard: "1"`). | +| Loki | Single-binary Deployment `loki` (10Gi gp3 PVC, filesystem object store, tsdb schema v13, ~168h retention). Service `loki:3100`. Config `loki.yaml`. Scraped via `ServiceMonitor/loki`. | +| Promtail | DaemonSet `promtail` tailing `/var/log/pods` on every node and pushing to `loki.monitoring.svc:3100`. Config `promtail.yaml`. Scraped via `ServiceMonitor/promtail`. | +| Loki datasource | `loki-datasource.yaml` ConfigMap (label `grafana_datasource: "1"`, uid `loki`), provisioned by the Grafana sidecar. Powers the dashboards' log panels. | | Ingress | `grafana` Ingress (className `nginx`, host `grafana.usgov.coderdemo.io`, TLS terminated upstream at the NLB). | Disabled to keep the demo lean and cut image mirroring: Alertmanager, @@ -37,6 +40,8 @@ at the mirror: - `quay/prometheus-operator/prometheus-config-reloader:v0.91.0` - `docker-hub/grafana/grafana:13.0.1-security-01` - `quay/kiwigrid/k8s-sidecar:2.7.3` +- `docker-hub/grafana/loki:3.5.9` +- `docker-hub/grafana/promtail:3.5.9` ## The scrape path @@ -61,10 +66,43 @@ Provisioners, Coder Workspaces, and Coder Workspace Detail. Every panel targets datasource uid `prometheus`, which the kube-prometheus-stack Grafana auto-provisions and marks default. -The purely log-based `agent-boundaries` dashboard is omitted, and a few log -panels inside the workspaces / provisionerd / workspace-detail dashboards show -no data, because this stack ships metrics only (no Loki). Their Prometheus -panels render live. +The purely log-based `agent-boundaries` dashboard is omitted. The log panels in +the workspaces / provisionerd / workspace-detail dashboards target datasource +uid `loki`, which the in-cluster Loki + Promtail stack (below) backs through +`loki-datasource.yaml`, so they resolve and query live log data instead of +erroring. + +The `coder-status` dashboard's "Observability Tools" row originally came from the +upstream coder/observability LGTM reference and checked components this demo does +not run (distributed Loki read/write/backend/canary, Grafana Agent, config +reloaders, Prometheus storage, CPU, RAM). It was replaced with `up` panels for +the components this stack actually runs: Prometheus, the single-binary Loki, and +Promtail. The same dashboard's "Workspace Builds" panel was repointed to +`coderd_workspace_latest_build_status` (the previous +`coderd_provisionerd_job_timings_seconds_count` has no series here), and its +"Postgres" panel to a boolean over `coderd_db_tx_duration_seconds` (there is no +postgres_exporter, so `pg_up` never existed). + +## The logging path + +1. A `promtail` DaemonSet runs on every node and tails the real container log + files under `/var/log/pods` (containerd on EKS), discovering pods with the + Kubernetes `pod` service-discovery role. It attaches `namespace`, `pod`, + `container`, `app`, and `node_name` labels and pushes batches to + `http://loki.monitoring.svc:3100/loki/api/v1/push`. There is no namespace + filter, so all workload namespaces (`coder`, `coder-workspaces`, `gitlab`, + `keycloak`, `monitoring`, `external-secrets`, and the rest) are captured. +2. `loki` runs as a single binary (`-target=all`, in-memory ring, + `replication_factor: 1`) with filesystem object storage and a tsdb shipper on + a 10Gi gp3 PVC. `auth_enabled` is false (single tenant, in-cluster only) and + the compactor enforces ~168h retention. +3. `loki-datasource.yaml` provisions a Grafana datasource with uid exactly + `loki`. The generated Coder dashboards reference that uid on their log panels, + so creating it wires those panels to the live store. `isDefault` stays false + so Prometheus remains the default datasource. +4. `ServiceMonitor/loki` and `ServiceMonitor/promtail` let Prometheus scrape both + components' `/metrics`, so `up{job="loki"}` and `up{job="promtail"}` drive the + `coder-status` dashboard's Loki and Promtail panels. ## Single sign-on (Keycloak) @@ -134,6 +172,14 @@ helm install kps ~/.cache/helm/repository/kube-prometheus-stack-86.2.0.tgz \ kubectl apply -f deploy/observability/coder-metrics.yaml kubectl apply -f deploy/observability/grafana-ingress.yaml kubectl apply -f deploy/observability/dashboards-coder.yaml + +# 7. Logging stack: Loki, Promtail, and the Grafana Loki datasource. +# The images were mirrored by step 1 (they are in scripts/images.txt). +kubectl apply -f deploy/observability/loki.yaml +kubectl -n monitoring rollout status deploy/loki +kubectl apply -f deploy/observability/promtail.yaml +kubectl apply -f deploy/observability/loki-datasource.yaml +kubectl -n monitoring rollout status ds/promtail ``` To regenerate `dashboards-coder.yaml` from upstream, extract the @@ -157,4 +203,17 @@ curl -s -u "admin:$GPW" 'https://grafana.usgov.coderdemo.io/api/search?type=dash # Keycloak SSO button + redirect (client_id=grafana, PKCE) curl -s https://grafana.usgov.coderdemo.io/login | grep -o '"oauth":{[^}]*}' curl -s -o /dev/null -D - https://grafana.usgov.coderdemo.io/login/generic_oauth | grep -i '^location:' + +# Loki ingesting logs (labels + a sample query through a port-forward) +kubectl -n monitoring port-forward svc/loki 3100:3100 & +curl -s 'http://localhost:3100/loki/api/v1/labels' +curl -s -G 'http://localhost:3100/loki/api/v1/query_range' \ + --data-urlencode 'query={namespace="coder"}' --data-urlencode 'limit=1' + +# Loki + Promtail scraped by Prometheus (both 1) +curl -s 'http://localhost:9090/api/v1/query?query=up{job="loki"}' +curl -s 'http://localhost:9090/api/v1/query?query=up{job="promtail"}' + +# Loki datasource present in Grafana with uid loki +curl -s -u "admin:$GPW" 'https://grafana.usgov.coderdemo.io/api/datasources/uid/loki' ``` diff --git a/deploy/observability/dashboards-coder.yaml b/deploy/observability/dashboards-coder.yaml index c20ba60..41a277f 100644 --- a/deploy/observability/dashboards-coder.yaml +++ b/deploy/observability/dashboards-coder.yaml @@ -1,4 +1,5 @@ -# Coder Grafana dashboards (generated, do not hand-edit). +# Coder Grafana dashboards (generated upstream; locally adapted, see the note +# at the end of this header). # # Source: github.com/coder/observability compiled/resources.yaml (the chart's # rendered output). These are the six Prometheus-backed Coder dashboards. The @@ -10,10 +11,20 @@ # from any namespace, so these live in the `monitoring` namespace next to # Grafana. Regenerate with deploy/observability/README.md instructions. # -# The purely log-based `agent-boundaries` dashboard is intentionally omitted: -# this stack ships metrics only (no Loki). A few panels in the workspaces, -# provisionerd, and workspace-detail dashboards are log-based and will show no -# data for the same reason; their Prometheus panels render live. +# The purely log-based `agent-boundaries` dashboard is intentionally omitted. +# The log-based panels in the workspaces, provisionerd, and workspace-detail +# dashboards reference datasource uid `loki`, which is backed by the in-cluster +# single-binary Loki + Promtail stack (deploy/observability/loki.yaml and +# promtail.yaml) through deploy/observability/loki-datasource.yaml, so those +# panels render live log data. +# +# Local adaptation: the `coder-status` dashboard's "Observability Tools" row was +# copied from the upstream coder/observability LGTM reference and checked +# components this demo does not run (distributed Loki read/write/backend/canary, +# Grafana Agent, config reloaders, Prometheus storage, CPU, RAM). That row was +# replaced with panels for what this stack actually runs: Prometheus, the +# single-binary Loki, and Promtail. The "Workspace Builds" and "Postgres" +# panels were likewise repointed to coderd_* metrics that exist in this stack. --- apiVersion: v1 kind: ConfigMap @@ -1845,7 +1856,7 @@ data: }, "editorMode": "code", "exemplar": false, - "expr": "round(sum by (status) (increase(coderd_provisionerd_job_timings_seconds_count{pod!=``}[$__range])))", + "expr": "sum by (status) (coderd_workspace_latest_build_status)", "instant": true, "legendFormat": "{{status}}", "range": false, @@ -2203,7 +2214,7 @@ data: "uid": "prometheus" }, "editorMode": "code", - "expr": "min(pg_up) or vector(0)", + "expr": "(sum(rate(coderd_db_tx_duration_seconds_count[5m])) > bool 0) or vector(0)", "instant": true, "legendFormat": "__auto", "range": false, @@ -2333,7 +2344,7 @@ data: "uid": "prometheus" }, "editorMode": "code", - "expr": "min(up{job=\"coder-observability/prometheus/server\"}) or vector(0)", + "expr": "min(up{job=\"kps-kube-prometheus-stack-prometheus\"}) or vector(0)", "instant": true, "legendFormat": "__auto", "range": false, @@ -2450,14 +2461,14 @@ data: "uid": "prometheus" }, "editorMode": "code", - "expr": "min(up{job=\"coder-observability/loki/write\"}) or vector(0)", + "expr": "min(up{job=\"loki\"}) or vector(0)", "instant": true, "legendFormat": "__auto", "range": false, "refId": "A" } ], - "title": "Loki Write Path", + "title": "Loki", "type": "stat" }, { @@ -2567,1003 +2578,14 @@ data: "uid": "prometheus" }, "editorMode": "code", - "expr": "min(up{job=\"coder-observability/loki/read\"}) or vector(0)", - "instant": true, - "legendFormat": "__auto", - "range": false, - "refId": "A" - } - ], - "title": "Loki Read Path", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "description": "", - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "mappings": [ - { - "options": { - "0": { - "color": "red", - "index": 1, - "text": "Down" - }, - "1": { - "color": "green", - "index": 0, - "text": "Up" - } - }, - "type": "value" - }, - { - "options": { - "match": "null", - "result": { - "color": "orange", - "index": 2, - "text": "Unknown" - } - }, - "type": "special" - }, - { - "options": { - "match": "empty", - "result": { - "color": "orange", - "index": 3, - "text": "Unknown" - } - }, - "type": "special" - }, - { - "options": { - "match": "null+nan", - "result": { - "index": 4, - "text": "Unknown" - } - }, - "type": "special" - } - ], - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "red", - "value": null - }, - { - "color": "red", - "value": 80 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 4, - "x": 12, - "y": 9 - }, - "id": 6, - "options": { - "colorMode": "value", - "graphMode": "area", - "justifyMode": "auto", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "showPercentChange": false, - "textMode": "auto", - "wideLayout": true - }, - "pluginVersion": "10.4.0", - "targets": [ - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "editorMode": "code", - "expr": "min(up{job=\"coder-observability/loki/backend\", container=\"loki\"}) or vector(0)", - "instant": true, - "legendFormat": "__auto", - "range": false, - "refId": "A" - } - ], - "title": "Loki Backend", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "description": "", - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "mappings": [ - { - "options": { - "0": { - "color": "red", - "index": 1, - "text": "Down" - }, - "1": { - "color": "green", - "index": 0, - "text": "Up" - } - }, - "type": "value" - }, - { - "options": { - "match": "null", - "result": { - "color": "orange", - "index": 2, - "text": "Unknown" - } - }, - "type": "special" - }, - { - "options": { - "match": "empty", - "result": { - "color": "orange", - "index": 3, - "text": "Unknown" - } - }, - "type": "special" - }, - { - "options": { - "match": "null+nan", - "result": { - "index": 4, - "text": "Unknown" - } - }, - "type": "special" - } - ], - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "red", - "value": null - }, - { - "color": "red", - "value": 80 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 4, - "x": 16, - "y": 9 - }, - "id": 7, - "options": { - "colorMode": "value", - "graphMode": "area", - "justifyMode": "auto", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "showPercentChange": false, - "textMode": "auto", - "wideLayout": true - }, - "pluginVersion": "10.4.0", - "targets": [ - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "editorMode": "code", - "expr": "min(up{job=\"coder-observability/loki/canary\"}) or vector(0)", + "expr": "min(up{job=\"promtail\"}) or vector(0)", "instant": true, "legendFormat": "__auto", "range": false, "refId": "A" } ], - "title": "Loki Canary", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "description": "", - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "mappings": [ - { - "options": { - "0": { - "color": "red", - "index": 1, - "text": "Down" - }, - "1": { - "color": "green", - "index": 0, - "text": "Up" - } - }, - "type": "value" - }, - { - "options": { - "match": "null", - "result": { - "color": "orange", - "index": 2, - "text": "Unknown" - } - }, - "type": "special" - }, - { - "options": { - "match": "empty", - "result": { - "color": "orange", - "index": 3, - "text": "Unknown" - } - }, - "type": "special" - }, - { - "options": { - "match": "null+nan", - "result": { - "index": 4, - "text": "Unknown" - } - }, - "type": "special" - } - ], - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "red", - "value": null - }, - { - "color": "red", - "value": 80 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 4, - "x": 20, - "y": 9 - }, - "id": 8, - "options": { - "colorMode": "value", - "graphMode": "area", - "justifyMode": "auto", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "showPercentChange": false, - "textMode": "auto", - "wideLayout": true - }, - "pluginVersion": "10.4.0", - "targets": [ - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "editorMode": "code", - "expr": "min(up{job=\"coder-observability/grafana-agent/grafana-agent\"}) or vector(0)", - "instant": true, - "legendFormat": "__auto", - "range": false, - "refId": "A" - } - ], - "title": "Grafana Agent", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "description": "", - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "mappings": [ - { - "options": { - "0": { - "color": "red", - "index": 1, - "text": "Unhealthy" - }, - "1": { - "color": "green", - "index": 0, - "text": "Healthy" - } - }, - "type": "value" - }, - { - "options": { - "match": "null", - "result": { - "color": "orange", - "index": 2, - "text": "Unknown" - } - }, - "type": "special" - }, - { - "options": { - "match": "empty", - "result": { - "color": "orange", - "index": 3, - "text": "Unknown" - } - }, - "type": "special" - }, - { - "options": { - "match": "null+nan", - "result": { - "index": 4, - "text": "Unknown" - } - }, - "type": "special" - } - ], - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "red", - "value": null - }, - { - "color": "red", - "value": 80 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 4, - "x": 0, - "y": 14 - }, - "id": 12, - "options": { - "colorMode": "value", - "graphMode": "area", - "justifyMode": "auto", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "showPercentChange": false, - "textMode": "auto", - "wideLayout": true - }, - "pluginVersion": "10.4.0", - "targets": [ - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "editorMode": "code", - "expr": "prometheus_config_last_reload_successful{job=\"coder-observability/prometheus/server\"}", - "instant": true, - "legendFormat": "__auto", - "range": false, - "refId": "A" - } - ], - "title": "Prometheus Config", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "description": "", - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "mappings": [ - { - "options": { - "0": { - "color": "red", - "index": 1, - "text": "Unhealthy" - }, - "1": { - "color": "green", - "index": 0, - "text": "Healthy" - } - }, - "type": "value" - }, - { - "options": { - "match": "null", - "result": { - "color": "orange", - "index": 2, - "text": "Unknown" - } - }, - "type": "special" - }, - { - "options": { - "match": "empty", - "result": { - "color": "orange", - "index": 3, - "text": "Unknown" - } - }, - "type": "special" - }, - { - "options": { - "match": "null+nan", - "result": { - "index": 4, - "text": "Unknown" - } - }, - "type": "special" - } - ], - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "red", - "value": null - }, - { - "color": "red", - "value": 80 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 4, - "x": 4, - "y": 14 - }, - "id": 14, - "options": { - "colorMode": "value", - "graphMode": "area", - "justifyMode": "auto", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "showPercentChange": false, - "textMode": "auto", - "wideLayout": true - }, - "pluginVersion": "10.4.0", - "targets": [ - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "editorMode": "code", - "expr": "min(loki_runtime_config_last_reload_successful) or vector(0)", - "instant": true, - "legendFormat": "__auto", - "range": false, - "refId": "A" - } - ], - "title": "Loki Config", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "description": "", - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "mappings": [ - { - "options": { - "0": { - "color": "red", - "index": 1, - "text": "Unhealthy" - }, - "1": { - "color": "green", - "index": 0, - "text": "Healthy" - } - }, - "type": "value" - }, - { - "options": { - "match": "null", - "result": { - "color": "orange", - "index": 2, - "text": "Unknown" - } - }, - "type": "special" - }, - { - "options": { - "match": "empty", - "result": { - "color": "orange", - "index": 3, - "text": "Unknown" - } - }, - "type": "special" - }, - { - "options": { - "match": "null+nan", - "result": { - "index": 4, - "text": "Unknown" - } - }, - "type": "special" - } - ], - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "red", - "value": null - }, - { - "color": "red", - "value": 80 - } - ] - } - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 4, - "x": 8, - "y": 14 - }, - "id": 13, - "options": { - "colorMode": "value", - "graphMode": "area", - "justifyMode": "auto", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "showPercentChange": false, - "textMode": "auto", - "wideLayout": true - }, - "pluginVersion": "10.4.0", - "targets": [ - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "editorMode": "code", - "exemplar": false, - "expr": "min(agent_config_last_load_successful{job=\"coder-observability/grafana-agent/grafana-agent\"}) or vector(0)", - "instant": true, - "legendFormat": "__auto", - "range": false, - "refId": "A" - } - ], - "title": "Grafana Agent Config", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "mappings": [], - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "green", - "value": null - }, - { - "color": "red", - "value": 80 - } - ] - }, - "unit": "percentunit" - }, - "overrides": [ - { - "matcher": { - "id": "byName", - "options": "Retention Limit" - }, - "properties": [ - { - "id": "color", - "value": { - "fixedColor": "red", - "mode": "fixed" - } - } - ] - }, - { - "matcher": { - "id": "byName", - "options": "Write-Ahead Log" - }, - "properties": [ - { - "id": "color", - "value": { - "fixedColor": "purple", - "mode": "fixed" - } - } - ] - }, - { - "matcher": { - "id": "byName", - "options": "Storage" - }, - "properties": [ - { - "id": "color", - "value": { - "fixedColor": "#f9f9fb", - "mode": "fixed" - } - } - ] - } - ] - }, - "gridPos": { - "h": 5, - "w": 4, - "x": 12, - "y": 14 - }, - "id": 11, - "options": { - "colorMode": "value", - "graphMode": "area", - "justifyMode": "auto", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "showPercentChange": false, - "textMode": "auto", - "wideLayout": true - }, - "pluginVersion": "10.4.0", - "targets": [ - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "editorMode": "code", - "expr": "(\n prometheus_tsdb_wal_storage_size_bytes{job=\"coder-observability/prometheus/server\"} +\n prometheus_tsdb_storage_blocks_bytes{job=\"coder-observability/prometheus/server\"} +\n prometheus_tsdb_symbol_table_size_bytes{job=\"coder-observability/prometheus/server\"}\n)\n/\nprometheus_tsdb_retention_limit_bytes{job=\"coder-observability/prometheus/server\"}", - "instant": false, - "legendFormat": "Retention limit used", - "range": true, - "refId": "A" - } - ], - "title": "Prometheus Storage", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "description": "", - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "mappings": [], - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "text", - "value": null - } - ] - }, - "unit": "none" - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 4, - "x": 16, - "y": 14 - }, - "id": 20, - "options": { - "colorMode": "value", - "graphMode": "none", - "justifyMode": "center", - "orientation": "auto", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "showPercentChange": false, - "text": { - "titleSize": 20, - "valueSize": 35 - }, - "textMode": "auto", - "wideLayout": false - }, - "pluginVersion": "10.4.0", - "targets": [ - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "editorMode": "code", - "exemplar": false, - "expr": "sum(kube_pod_container_resource_requests{namespace=\"coder-observability\", resource=\"cpu\"})", - "hide": false, - "instant": true, - "legendFormat": "Requested", - "range": false, - "refId": "C" - }, - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "editorMode": "code", - "exemplar": false, - "expr": "sum(\n max_over_time(\n rate(container_cpu_usage_seconds_total{namespace=\"coder-observability\"}[$__rate_interval])\n [$__range:]\n )\n)", - "hide": false, - "instant": true, - "legendFormat": "High Watermark", - "range": false, - "refId": "D" - } - ], - "title": "CPU", - "type": "stat" - }, - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "description": "", - "fieldConfig": { - "defaults": { - "color": { - "mode": "thresholds" - }, - "mappings": [], - "thresholds": { - "mode": "absolute", - "steps": [ - { - "color": "text", - "value": null - } - ] - }, - "unit": "bytes" - }, - "overrides": [] - }, - "gridPos": { - "h": 5, - "w": 4, - "x": 20, - "y": 14 - }, - "id": 21, - "options": { - "colorMode": "none", - "graphMode": "area", - "justifyMode": "center", - "orientation": "vertical", - "reduceOptions": { - "calcs": [ - "lastNotNull" - ], - "fields": "", - "values": false - }, - "showPercentChange": false, - "text": { - "titleSize": 20, - "valueSize": 35 - }, - "textMode": "value_and_name", - "wideLayout": true - }, - "pluginVersion": "10.4.0", - "targets": [ - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "editorMode": "code", - "exemplar": false, - "expr": "sum(kube_pod_container_resource_requests{namespace=\"coder-observability\", resource=\"memory\"})", - "hide": false, - "instant": true, - "legendFormat": "Requested", - "range": false, - "refId": "B" - }, - { - "datasource": { - "type": "prometheus", - "uid": "prometheus" - }, - "editorMode": "code", - "exemplar": false, - "expr": "sum(\n max_over_time(container_memory_working_set_bytes{namespace=\"coder-observability\"}[$__range])\n)", - "instant": true, - "legendFormat": "High Watermark", - "range": false, - "refId": "A" - } - ], - "title": "RAM", + "title": "Promtail", "type": "stat" } ], diff --git a/deploy/observability/loki-datasource.yaml b/deploy/observability/loki-datasource.yaml new file mode 100644 index 0000000..e8ab84e --- /dev/null +++ b/deploy/observability/loki-datasource.yaml @@ -0,0 +1,35 @@ +# Grafana datasource for the in-cluster Loki (deploy/observability/loki.yaml). +# +# The kube-prometheus-stack Grafana runs the kiwigrid sidecar with +# `sidecar.datasources.enabled: true`, which provisions any ConfigMap labelled +# `grafana_datasource: "1"` (from any namespace) as a Grafana datasource. So +# adding this ConfigMap is enough; no Helm upgrade is required. +# +# The uid is pinned to EXACTLY "loki" because the generated Coder dashboards +# (deploy/observability/dashboards-coder.yaml) reference datasource uid "loki" +# on their log panels (workspace-detail "Logs", and panels in provisionerd and +# workspaces). Creating this datasource with that uid wires those panels to the +# live log store. isDefault stays false so Prometheus remains the default. +apiVersion: v1 +kind: ConfigMap +metadata: + name: loki-datasource + namespace: monitoring + labels: + grafana_datasource: "1" + app.kubernetes.io/name: loki + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: grafana-datasource +data: + loki-datasource.yaml: |- + apiVersion: 1 + datasources: + - name: Loki + type: loki + uid: loki + access: proxy + url: http://loki.monitoring.svc:3100 + isDefault: false + editable: false + jsonData: + maxLines: 1000 diff --git a/deploy/observability/loki.yaml b/deploy/observability/loki.yaml new file mode 100644 index 0000000..7e673ef --- /dev/null +++ b/deploy/observability/loki.yaml @@ -0,0 +1,239 @@ +# Grafana Loki, single-binary (monolithic) mode, for the GovCloud demo. +# +# Scope: the in-cluster log store for the `monitoring` observability stack. +# Promtail (deploy/observability/promtail.yaml) pushes node-level pod logs here, +# and Grafana queries it through the `loki` datasource +# (deploy/observability/loki-datasource.yaml). +# +# Design choices for GovCloud + a lean demo: +# - The image is pulled from the private ECR mirror (no pull-through cache in +# GovCloud). Tag matches scripts/images.txt (grafana/loki:3.5.9). +# - Single process running every Loki target (default `-target=all`), an +# in-memory ring with replication_factor 1, and filesystem object storage on +# a modest gp3 PVC. No object store, no microservices, no cache services. +# - tsdb shipper with schema v13, ~168h (7d) retention enforced by the +# compactor. auth_enabled is false (single tenant, in-cluster only). +# - The Deployment uses the Recreate strategy because the chunk/index data +# lives on a single ReadWriteOnce volume that only one pod may attach. +# - A ServiceMonitor scrapes Loki's own /metrics so the status dashboard's +# "Loki" panel reflects a real, scraped target. +apiVersion: v1 +kind: ConfigMap +metadata: + name: loki-config + namespace: monitoring + labels: + app.kubernetes.io/name: loki + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: logging +data: + config.yaml: |- + auth_enabled: false + + server: + http_listen_port: 3100 + grpc_listen_port: 9095 + log_level: info + # Allow larger log lines and query payloads than the conservative + # defaults; harmless for a low-volume demo. + grpc_server_max_recv_msg_size: 16777216 + grpc_server_max_send_msg_size: 16777216 + + common: + instance_addr: 127.0.0.1 + path_prefix: /loki + replication_factor: 1 + ring: + kvstore: + store: inmemory + storage: + filesystem: + chunks_directory: /loki/chunks + rules_directory: /loki/rules + + schema_config: + configs: + - from: 2024-01-01 + store: tsdb + object_store: filesystem + schema: v13 + index: + prefix: index_ + period: 24h + + storage_config: + tsdb_shipper: + active_index_directory: /loki/tsdb-index + cache_location: /loki/tsdb-cache + filesystem: + directory: /loki/chunks + + compactor: + working_directory: /loki/compactor + retention_enabled: true + delete_request_store: filesystem + + limits_config: + retention_period: 168h + reject_old_samples: true + reject_old_samples_max_age: 168h + volume_enabled: true + max_query_series: 10000 + + query_range: + results_cache: + cache: + embedded_cache: + enabled: true + max_size_mb: 100 + + ruler: + storage: + type: local + local: + directory: /loki/rules + + # Air-gapped: never phone home with usage analytics. + analytics: + reporting_enabled: false +--- +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: loki-data + namespace: monitoring + labels: + app.kubernetes.io/name: loki + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: logging +spec: + accessModes: + - ReadWriteOnce + storageClassName: gp3 + resources: + requests: + storage: 10Gi +--- +apiVersion: v1 +kind: Service +metadata: + name: loki + namespace: monitoring + labels: + app.kubernetes.io/name: loki + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: logging +spec: + type: ClusterIP + selector: + app.kubernetes.io/name: loki + ports: + - name: http + port: 3100 + targetPort: http + protocol: TCP + - name: grpc + port: 9095 + targetPort: grpc + protocol: TCP +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: loki + namespace: monitoring + labels: + app.kubernetes.io/name: loki + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: logging +spec: + replicas: 1 + strategy: + type: Recreate + selector: + matchLabels: + app.kubernetes.io/name: loki + template: + metadata: + labels: + app.kubernetes.io/name: loki + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: logging + annotations: + # Roll the pod when the rendered config changes. + checksum/config: loki-config-v1 + spec: + securityContext: + runAsNonRoot: true + runAsUser: 10001 + runAsGroup: 10001 + fsGroup: 10001 + containers: + - name: loki + image: 430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/docker-hub/grafana/loki:3.5.9 + imagePullPolicy: IfNotPresent + args: + - -config.file=/etc/loki/config.yaml + ports: + - name: http + containerPort: 3100 + protocol: TCP + - name: grpc + containerPort: 9095 + protocol: TCP + readinessProbe: + httpGet: + path: /ready + port: http + initialDelaySeconds: 30 + periodSeconds: 10 + timeoutSeconds: 5 + failureThreshold: 6 + livenessProbe: + httpGet: + path: /ready + port: http + initialDelaySeconds: 45 + periodSeconds: 30 + timeoutSeconds: 5 + failureThreshold: 6 + resources: + requests: + cpu: 100m + memory: 256Mi + limits: + memory: 512Mi + volumeMounts: + - name: config + mountPath: /etc/loki + - name: data + mountPath: /loki + volumes: + - name: config + configMap: + name: loki-config + - name: data + persistentVolumeClaim: + claimName: loki-data +--- +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: loki + namespace: monitoring + labels: + app.kubernetes.io/name: loki + app.kubernetes.io/component: logging + release: kps +spec: + namespaceSelector: + matchNames: + - monitoring + selector: + matchLabels: + app.kubernetes.io/name: loki + endpoints: + - port: http + path: /metrics + interval: 30s + scheme: http diff --git a/deploy/observability/promtail.yaml b/deploy/observability/promtail.yaml new file mode 100644 index 0000000..be58b84 --- /dev/null +++ b/deploy/observability/promtail.yaml @@ -0,0 +1,271 @@ +# Promtail node-level log collector for the GovCloud demo. +# +# Scope: a DaemonSet that tails every pod's container logs from /var/log/pods on +# each node, attaches Kubernetes metadata (namespace, pod, container, app, node), +# and pushes them to the in-cluster Loki (deploy/observability/loki.yaml). This +# covers all workload namespaces, including coder, coder-workspaces, gitlab, +# keycloak, monitoring, and external-secrets. +# +# Design choices for GovCloud + a lean demo: +# - The image is pulled from the private ECR mirror (no pull-through cache in +# GovCloud). Tag matches scripts/images.txt (grafana/promtail:3.5.9). +# - Promtail discovers pods with the Kubernetes SD `pod` role and reads the +# real log files under /var/log/pods (containerd on EKS), so no container +# runtime socket is needed. It runs as root because those files are +# root-owned, with read-only host mounts. +# - A headless Service plus ServiceMonitor scrape Promtail's own /metrics so +# the status dashboard's "Promtail" panel reflects real, scraped targets. +apiVersion: v1 +kind: ServiceAccount +metadata: + name: promtail + namespace: monitoring + labels: + app.kubernetes.io/name: promtail + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: logging +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: promtail + labels: + app.kubernetes.io/name: promtail + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: logging +rules: + - apiGroups: [""] + resources: + - nodes + - nodes/proxy + - services + - endpoints + - pods + verbs: ["get", "list", "watch"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: promtail + labels: + app.kubernetes.io/name: promtail + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: logging +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: promtail +subjects: + - kind: ServiceAccount + name: promtail + namespace: monitoring +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: promtail-config + namespace: monitoring + labels: + app.kubernetes.io/name: promtail + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: logging +data: + config.yaml: |- + server: + http_listen_port: 3101 + grpc_listen_port: 0 + log_level: info + + positions: + filename: /run/promtail/positions.yaml + + clients: + - url: http://loki.monitoring.svc:3100/loki/api/v1/push + + scrape_configs: + # Tail every pod container log on this node and label it with Kubernetes + # metadata. No namespace filter, so all namespaces are captured. + - job_name: kubernetes-pods + pipeline_stages: + - cri: {} + kubernetes_sd_configs: + - role: pod + relabel_configs: + - source_labels: + - __meta_kubernetes_pod_controller_name + regex: ([0-9a-z-.]+?)(-[0-9a-f]{8,10})? + action: replace + target_label: __tmp_controller_name + - source_labels: + - __meta_kubernetes_pod_label_app_kubernetes_io_name + - __meta_kubernetes_pod_label_app + - __tmp_controller_name + - __meta_kubernetes_pod_name + regex: ^;*([^;]+)(;.*)?$ + action: replace + target_label: app + - source_labels: + - __meta_kubernetes_pod_node_name + action: replace + target_label: node_name + - source_labels: + - __meta_kubernetes_namespace + action: replace + target_label: namespace + - source_labels: + - __meta_kubernetes_pod_name + action: replace + target_label: pod + - source_labels: + - __meta_kubernetes_pod_container_name + action: replace + target_label: container + # Only tail running pods that actually have a log path on this node. + - source_labels: + - __meta_kubernetes_pod_uid + - __meta_kubernetes_pod_container_name + separator: / + action: replace + replacement: /var/log/pods/*$1/*.log + target_label: __path__ + # Static pods (e.g. addons) expose a config-hash annotation instead of + # a stable uid; map those to their on-disk path as well. + - source_labels: + - __meta_kubernetes_pod_annotationpresent_kubernetes_io_config_hash + - __meta_kubernetes_pod_container_name + separator: / + regex: true/(.*) + action: replace + replacement: /var/log/pods/*$1/*.log + target_label: __path__ +--- +apiVersion: v1 +kind: Service +metadata: + name: promtail + namespace: monitoring + labels: + app.kubernetes.io/name: promtail + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: logging +spec: + type: ClusterIP + clusterIP: None + selector: + app.kubernetes.io/name: promtail + ports: + - name: http-metrics + port: 3101 + targetPort: http-metrics + protocol: TCP +--- +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: promtail + namespace: monitoring + labels: + app.kubernetes.io/name: promtail + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: logging +spec: + selector: + matchLabels: + app.kubernetes.io/name: promtail + updateStrategy: + type: RollingUpdate + template: + metadata: + labels: + app.kubernetes.io/name: promtail + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: logging + annotations: + checksum/config: promtail-config-v1 + spec: + serviceAccountName: promtail + securityContext: + runAsUser: 0 + runAsGroup: 0 + containers: + - name: promtail + image: 430737322961.dkr.ecr.us-gov-west-1.amazonaws.com/docker-hub/grafana/promtail:3.5.9 + imagePullPolicy: IfNotPresent + args: + - -config.file=/etc/promtail/config.yaml + env: + - name: HOSTNAME + valueFrom: + fieldRef: + fieldPath: spec.nodeName + ports: + - name: http-metrics + containerPort: 3101 + protocol: TCP + readinessProbe: + httpGet: + path: /ready + port: http-metrics + initialDelaySeconds: 10 + periodSeconds: 10 + timeoutSeconds: 5 + failureThreshold: 6 + resources: + requests: + cpu: 50m + memory: 128Mi + limits: + memory: 256Mi + securityContext: + readOnlyRootFilesystem: true + allowPrivilegeEscalation: false + capabilities: + drop: + - ALL + volumeMounts: + - name: config + mountPath: /etc/promtail + - name: positions + mountPath: /run/promtail + - name: varlog + mountPath: /var/log + readOnly: true + tolerations: + - effect: NoSchedule + operator: Exists + - effect: NoExecute + operator: Exists + volumes: + - name: config + configMap: + name: promtail-config + - name: positions + hostPath: + path: /var/lib/promtail + type: DirectoryOrCreate + - name: varlog + hostPath: + path: /var/log + type: Directory +--- +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: promtail + namespace: monitoring + labels: + app.kubernetes.io/name: promtail + app.kubernetes.io/component: logging + release: kps +spec: + namespaceSelector: + matchNames: + - monitoring + selector: + matchLabels: + app.kubernetes.io/name: promtail + endpoints: + - port: http-metrics + path: /metrics + interval: 30s + scheme: http diff --git a/docs/as-built/55-observability.md b/docs/as-built/55-observability.md index c267a33..c94da34 100644 --- a/docs/as-built/55-observability.md +++ b/docs/as-built/55-observability.md @@ -1,13 +1,15 @@ # 55. Observability (as-built) -In-boundary, in-cluster metrics and dashboards for the GovCloud demo: the +In-boundary, in-cluster metrics, logs, and dashboards for the GovCloud demo: the `prometheus-community/kube-prometheus-stack` Helm release `kps` (Prometheus + Grafana + the Prometheus operator) in the `monitoring` namespace, scraping the Coder control plane's Prometheus metrics and rendering Coder's prebuilt Grafana -dashboards with live data at `https://grafana.usgov.coderdemo.io`. Grafana signs -in through the same Keycloak realm (`coder`) as the rest of the stack, so the -demo is one SSO. Coder audit logging is entitled and on; structured JSON server -logs make it SIEM-ready. +dashboards with live data at `https://grafana.usgov.coderdemo.io`. Alongside it, +hand-rolled manifests add an in-cluster Grafana Loki and a Promtail DaemonSet +that collects pod logs from every node, queried in Grafana through a `loki` +datasource. Grafana signs in through the same Keycloak realm (`coder`) as the +rest of the stack, so the demo is one SSO. Coder audit logging is entitled and +on; structured JSON server logs make it SIEM-ready and are shipped to Loki. This is the reliable in-cluster implementation. The AWS-native managed variant (Amazon Managed Prometheus / Grafana, Security Lake) is planned separately and @@ -116,10 +118,11 @@ has series (cAdvisor). `sum by(pod) (rate(coderd_api_requests_processed_total{...}[5m]))` returns a series, and `up{job="coder-metrics"}` returns `1`. So the main dashboard renders live data end to end (Grafana to Prometheus to coderd). -- The purely log-based `agent-boundaries` dashboard is omitted, and a few log - panels inside the workspaces / provisionerd / workspace-detail dashboards show - no data, because this stack ships metrics only (no Loki). Their Prometheus - panels render live. +- The purely log-based `agent-boundaries` dashboard is omitted. The log panels in + the workspaces / provisionerd / workspace-detail dashboards target datasource + uid `loki`, which the in-cluster Loki + Promtail stack backs through + `loki-datasource.yaml` (see "Logging" below), so they resolve and query live + log data instead of erroring. ### Single sign-on (Keycloak OIDC) @@ -170,6 +173,94 @@ Verified live: ExternalSecret `grafana-admin` is `Ready=True` reason `SecretSynced`; the Secret carries keys `admin-user` and `admin-password`; and logging in to the public Grafana with that password succeeds. +## Logging (Loki + Promtail) + +Hand-rolled manifests in `deploy/observability/` add an in-cluster log store and +collector next to the metrics stack. They are plain Kubernetes objects (not part +of the `kps` Helm release) and use ECR-mirrored images. + +| Component | Live object | Storage | +|---|---|---| +| Loki | Deployment `loki` (single binary, `-target=all`), Service `loki:3100` | 10Gi gp3 PVC `loki-data` | +| Promtail | DaemonSet `promtail` (one pod per node) | host `/var/log` read-only + host `/var/lib/promtail` positions | + +Images (mirrored via `scripts/images.txt` + `scripts/mirror-images.sh`): + +- `docker-hub/grafana/loki:3.5.9` +- `docker-hub/grafana/promtail:3.5.9` + +Loki (`loki.yaml`) runs monolithic: `auth_enabled: false`, an in-memory ring with +`replication_factor: 1`, filesystem object storage, and a tsdb shipper with +schema `v13`. The compactor enforces ~168h (7d) retention. Everything lives under +`/loki` on the PVC. The container runs as the image's nonroot user (uid 10001), +and the Deployment uses the `Recreate` strategy because the data sits on a single +ReadWriteOnce volume. + +Promtail (`promtail.yaml`) runs as a DaemonSet under a ServiceAccount with a +ClusterRole granting read access to pods/nodes/services/endpoints for Kubernetes +service discovery. It tails the real container log files under `/var/log/pods` +(containerd on EKS) with the `pod` SD role, attaches `namespace`, `pod`, +`container`, `app`, and `node_name` labels, and pushes to +`http://loki.monitoring.svc:3100/loki/api/v1/push`. There is no namespace filter, +so every workload namespace is captured. Verified live: Loki's +`/loki/api/v1/labels` returns `app`, `container`, `namespace`, `node_name`, +`pod`, ...; `/loki/api/v1/label/namespace/values` lists `coder`, +`coder-workspaces`, `gitlab`, `keycloak`, `monitoring`, and `external-secrets` +(plus `ingress-nginx` and `kube-system`); and a `{namespace="coder"}` +`query_range` returns coderd JSON log lines (including `msg:"audit_log"`). + +### Grafana Loki datasource (how the log panels are powered) + +The kube-prometheus-stack Grafana runs the kiwigrid sidecar with +`sidecar.datasources.enabled: true`, which provisions any ConfigMap labelled +`grafana_datasource: "1"` as a datasource. `loki-datasource.yaml` is that +ConfigMap: it defines a Loki datasource with `access: proxy`, URL +`http://loki.monitoring.svc:3100`, `isDefault: false`, and uid EXACTLY `loki`. No +Helm upgrade is needed. + +That uid is a contract: the generated Coder dashboards (`dashboards-coder.yaml`) +reference datasource uid `loki` on their log panels, the workspace-detail "Logs" +panel and the provisionerd / workspaces "Logs" panels. Before this datasource +existed those panels errored ("datasource loki not found"); creating it with the +matching uid resolves them. Verified live: `GET /api/datasources` lists `Loki` +(type `loki`, uid `loki`, default false); a labels call and a `{namespace="coder"}` +`query_range` through the Grafana datasource proxy +(`/api/datasources/proxy/uid/loki/...`) both return `success` with log lines; and +`POST /api/ds/query` for the workspace-detail `{namespace="coder-workspaces"}` +query returns HTTP 200 with log frames. + +The workspaces / provisionerd "Logs" panels additionally filter on a `logger` +label that Promtail does not emit, so they resolve but are legitimately empty; +the workspace-detail panel that matches `coder-workspaces` pods returns live +workspace logs. + +### Prometheus scraping of Loki and Promtail + +`loki.yaml` and `promtail.yaml` each ship a `ServiceMonitor` (selected because +`serviceMonitorSelectorNilUsesHelmValues: false`), so Prometheus scrapes their +`/metrics`. Verified live: `up{job="loki"}` is `1` and `min(up{job="promtail"})` +is `1` (one target per node). These drive the `coder-status` dashboard's Loki and +Promtail panels below. + +## coder-status dashboard adaptation + +The `coder-status` dashboard (`coder-dashboard-status` in `dashboards-coder.yaml`, +uid `coder-status`) shipped an "Observability Tools" row copied from the upstream +coder/observability LGTM reference. That row probed components this demo does not +run, so most tiles were permanently red or empty. It was rebuilt to reflect this +stack, and two unrelated broken panels were repointed at metrics that exist here: + +| Panel | Before | After | +|---|---|---| +| Observability Tools row | Loki Write/Read/Backend/Canary, Grafana Agent, Prometheus/Loki/Grafana-Agent config reload, Prometheus Storage, CPU, RAM | Three `up` stat panels: Prometheus (`up{job="kps-kube-prometheus-stack-prometheus"}`), Loki (`up{job="loki"}`), Promtail (`up{job="promtail"}`) | +| Workspace Builds | `coderd_provisionerd_job_timings_seconds_count` (no series here, so "No data") | `sum by (status) (coderd_workspace_latest_build_status)` | +| Postgres | `pg_up` (no postgres_exporter, so "Down") | `(sum(rate(coderd_db_tx_duration_seconds_count[5m])) > bool 0) or vector(0)` | + +Verified live through Grafana `POST /api/ds/query`: all five changed panels return +HTTP 200; the three `up` panels and Postgres each evaluate to `1`, and Workspace +Builds returns `1` for `status="succeeded"`. The header comment in +`dashboards-coder.yaml` documents this adaptation. + ## Ingress (HTTPS) `deploy/observability/grafana-ingress.yaml` follows the platform pattern @@ -211,8 +302,12 @@ forever). ## Notes and known gaps -- Metrics only: no Loki/logs datasource, so log-based panels and the - `agent-boundaries` dashboard are inactive by design. +- In-cluster logs are present: a single-binary Loki plus a Promtail DaemonSet + back the `loki` datasource, so the log-based panels resolve. The + `agent-boundaries` dashboard is still omitted (it is purely log-based and was + not part of the dashboard set shipped here). The workspaces / provisionerd + "Logs" panels filter on a `logger` label Promtail does not emit, so they are + legitimately empty while error-free. - kube-state-metrics is disabled, so the dashboards' pod resource limit/request and restart/terminated-reason panels (which depend on `kube_pod_*`) stay empty; container CPU/memory usage panels (cAdvisor via the kubelet) render. diff --git a/scripts/images.txt b/scripts/images.txt index 8672dd4..3c35d0a 100644 --- a/scripts/images.txt +++ b/scripts/images.txt @@ -33,3 +33,12 @@ quay.io/prometheus-operator/prometheus-operator:v0.91.0 quay.io/prometheus-operator/prometheus-config-reloader:v0.91.0 docker.io/grafana/grafana:13.0.1-security-01 quay.io/kiwigrid/k8s-sidecar:2.7.3 + +# --- Logging stack (deploy/observability) --- +# In-cluster logs: single-binary Grafana Loki plus a node-level Promtail +# collector DaemonSet. Both pinned to the 3.5 release line (released in +# lockstep). Loki persists on a filesystem PVC; Promtail tails /var/log/pods +# and pushes to Loki. Referenced from manifests as the ECR mirror path +# docker-hub/grafana/loki and docker-hub/grafana/promtail. +docker.io/grafana/loki:3.5.9 +docker.io/grafana/promtail:3.5.9 From a24fba5167dbbcf26f15488c929ca5a387edd182 Mon Sep 17 00:00:00 2001 From: Austen Bruhn Date: Sun, 7 Jun 2026 21:41:39 +0000 Subject: [PATCH 16/16] feat: add merged AI Governance Grafana dashboard Add a single Grafana dashboard (uid ai-governance, title "AI Governance") covering both the AI Gateway (AI Bridge) and the Agent Firewall (Boundary), replacing the two missing add-on dashboards. New ConfigMap deploy/observability/dashboards-ai-governance.yaml (ns monitoring, label grafana_dashboard: "1") so it never conflicts with dashboards-coder.yaml. AI Gateway panels read coder_aibridged_* (configured providers, reload health, provider inventory) and stream AI Bridge logs from Loki. Agent Firewall panels read agent_boundary_log_proxy_batches_forwarded_total and stream Boundary logs from Loki. Every panel targets datasource uid prometheus or uid loki. All ten query panels verified HTTP 200 through Grafana /api/ds/query; usage panels read 0 or stay sparse until live AI traffic occurs (placeholder Anthropic key). Documents the dashboard in docs/as-built/55-observability.md and STATUS.md. Generated by Coder Agents. --- STATUS.md | 10 + .../dashboards-ai-governance.yaml | 745 ++++++++++++++++++ docs/as-built/55-observability.md | 35 + 3 files changed, 790 insertions(+) create mode 100644 deploy/observability/dashboards-ai-governance.yaml diff --git a/STATUS.md b/STATUS.md index 6328d01..dcf566b 100644 --- a/STATUS.md +++ b/STATUS.md @@ -215,6 +215,16 @@ gated; Nova Pro is the proven fallback. Promtail `up` panels; "Workspace Builds" repointed to `coderd_workspace_latest_build_status` and "Postgres" to a real `coderd_db_tx_duration_seconds` signal (no postgres_exporter runs). +- [x] **Merged AI Governance dashboard** (`deploy/observability/dashboards-ai-governance.yaml`, + uid `ai-governance`, ns `monitoring`) covers the AI Gateway (AI Bridge) and + the Agent Firewall (Boundary) in one view, replacing the two add-on + dashboards. AI Gateway panels use `coder_aibridged_*` (configured providers, + reload health, provider inventory) plus AI Bridge Loki logs + (`{namespace="coder"} |~ "aibridged"`); Agent Firewall panels use + `agent_boundary_log_proxy_batches_forwarded_total` plus Boundary Loki logs + (`{namespace="coder-workspaces"} |= "boundary"`). All ten query panels + verified HTTP 200 via Grafana `/api/ds/query`; usage panels read `0` until + live AI traffic occurs (placeholder Anthropic key). - [x] **Grafana Keycloak SSO (one SSO)**: Grafana signs in via the same realm (`coder`) through a confidential OIDC client `grafana` (`scripts/setup-grafana-oidc.py`, PKCE; secret in ASM diff --git a/deploy/observability/dashboards-ai-governance.yaml b/deploy/observability/dashboards-ai-governance.yaml new file mode 100644 index 0000000..234a8fd --- /dev/null +++ b/deploy/observability/dashboards-ai-governance.yaml @@ -0,0 +1,745 @@ +# AI Governance Grafana dashboard (Coder AI Governance add-on). +# +# A single merged dashboard (uid ai-governance) that covers BOTH halves of the +# Coder AI Governance add-on: +# - AI Gateway, implemented in Coder as AI Bridge (aibridged). Panels read the +# coder_aibridged_* Prometheus series (configured providers and provider +# reload health) and stream AI Bridge logs from Loki (namespace coder). +# - Agent Firewall, implemented in Coder as Boundary. Panels read the +# agent_boundary_log_proxy_batches_forwarded_total Prometheus counter and +# stream Boundary logs from Loki (namespace coder-workspaces). +# +# The Grafana sidecar imports any ConfigMap labelled grafana_dashboard: "1", so +# this lives in the monitoring namespace next to Grafana. Every panel points at +# datasource uid prometheus or uid loki, both provisioned by the +# kube-prometheus-stack Grafana (loki via deploy/observability/loki-datasource.yaml). +# +# The live Anthropic key is a placeholder, so real AI calls fail and AI/agent +# traffic is minimal. Usage panels therefore read 0 or stay sparse; that is +# expected. Every panel still targets a real series or log query and returns a +# clean result rather than a datasource error. +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: coder-dashboard-ai-governance + namespace: monitoring + labels: + grafana_dashboard: "1" + app.kubernetes.io/part-of: usgov-coderdemo + app.kubernetes.io/component: grafana-dashboard +data: + ai-governance.json: |- + { + "uid": "ai-governance", + "title": "AI Governance", + "description": "Coder AI Governance add-on: merged view of the AI Gateway (AI Bridge) and the Agent Firewall (Boundary).", + "tags": [ + "coder", + "ai-governance" + ], + "style": "dark", + "timezone": "", + "editable": true, + "graphTooltip": 0, + "fiscalYearStartMonth": 0, + "weekStart": "", + "schemaVersion": 39, + "version": 1, + "refresh": "1m", + "time": { + "from": "now-6h", + "to": "now" + }, + "timepicker": {}, + "annotations": { + "list": [] + }, + "templating": { + "list": [] + }, + "links": [], + "panels": [ + { + "id": 1, + "type": "text", + "title": "AI Governance (AI Gateway + Agent Firewall)", + "datasource": null, + "gridPos": { + "h": 4, + "w": 24, + "x": 0, + "y": 0 + }, + "options": { + "mode": "markdown", + "content": "This is the Coder AI Governance view, covering the AI Gateway (AI Bridge) and the Agent Firewall (Boundary). AI Gateway panels show configured providers and reload health from Prometheus, plus the AI Bridge log stream from Loki. Agent Firewall panels show forwarded Boundary batches from Prometheus, plus the Boundary log stream from Loki. Traffic dependent panels populate as AI and agent activity occurs, so they can legitimately read 0 or stay sparse until live usage happens." + } + }, + { + "id": 2, + "type": "row", + "title": "AI Gateway (AI Bridge)", + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 4 + }, + "panels": [] + }, + { + "id": 3, + "type": "stat", + "title": "Configured Providers", + "description": "Number of AI providers configured in the AI Gateway (AI Bridge), from coder_aibridged_provider_info. Expected to be 2 in this demo (anthropic and anthropic-bedrock).", + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "gridPos": { + "h": 6, + "w": 5, + "x": 0, + "y": 5 + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "count(coder_aibridged_provider_info) or vector(0)", + "refId": "A", + "instant": true, + "range": false, + "format": "time_series" + } + ], + "fieldConfig": { + "defaults": { + "unit": "short", + "decimals": 0, + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "red", + "value": null + }, + { + "color": "green", + "value": 1 + } + ] + } + }, + "overrides": [] + }, + "options": { + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "orientation": "auto", + "textMode": "auto", + "colorMode": "value", + "graphMode": "none", + "justifyMode": "auto" + } + }, + { + "id": 4, + "type": "stat", + "title": "Provider Reload Status", + "description": "Whether the AI Bridge most recent provider config reload succeeded. Healthy means the last reload attempt matched the last successful reload (no failed reload in between). Compares coder_aibridged_providers_last_reload_timestamp_seconds with the success variant.", + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "gridPos": { + "h": 6, + "w": 5, + "x": 5, + "y": 5 + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "coder_aibridged_providers_last_reload_timestamp_seconds == bool coder_aibridged_providers_last_reload_success_timestamp_seconds", + "refId": "A", + "instant": true, + "range": false, + "format": "time_series" + } + ], + "fieldConfig": { + "defaults": { + "unit": "short", + "decimals": 0, + "color": { + "mode": "thresholds" + }, + "mappings": [ + { + "type": "value", + "options": { + "1": { + "text": "Healthy", + "color": "green", + "index": 0 + }, + "0": { + "text": "Reload Failed", + "color": "red", + "index": 1 + } + } + } + ], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + } + }, + "overrides": [] + }, + "options": { + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "orientation": "auto", + "textMode": "value", + "colorMode": "value", + "graphMode": "none", + "justifyMode": "auto" + } + }, + { + "id": 5, + "type": "stat", + "title": "Last Successful Reload", + "description": "Time of the AI Bridge last successful provider reload, shown relative to now. It updates only when providers reload (startup or config change), so an older value reflects a stable provider configuration, not an error.", + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "gridPos": { + "h": 6, + "w": 5, + "x": 10, + "y": 5 + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "coder_aibridged_providers_last_reload_success_timestamp_seconds * 1000", + "refId": "A", + "instant": true, + "range": false, + "format": "time_series" + } + ], + "fieldConfig": { + "defaults": { + "unit": "dateTimeFromNow", + "decimals": 0, + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + } + }, + "overrides": [] + }, + "options": { + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "orientation": "auto", + "textMode": "value", + "colorMode": "none", + "graphMode": "none", + "justifyMode": "auto" + } + }, + { + "id": 6, + "type": "table", + "title": "Provider Inventory", + "description": "Inventory of AI Gateway providers from coder_aibridged_provider_info: provider name, type, and enabled status.", + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "gridPos": { + "h": 6, + "w": 9, + "x": 15, + "y": 5 + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "coder_aibridged_provider_info", + "refId": "A", + "instant": true, + "range": false, + "format": "table" + } + ], + "transformations": [ + { + "id": "organize", + "options": { + "excludeByName": { + "Time": true, + "__name__": true, + "Value": true, + "container": true, + "endpoint": true, + "instance": true, + "job": true, + "namespace": true, + "pod": true, + "service": true + }, + "renameByName": { + "provider_name": "Provider", + "provider_type": "Type", + "status": "Status" + }, + "indexByName": { + "provider_name": 0, + "provider_type": 1, + "status": 2 + } + } + } + ], + "fieldConfig": { + "defaults": { + "custom": { + "align": "auto", + "cellOptions": { + "type": "auto" + }, + "filterable": true + } + }, + "overrides": [] + }, + "options": { + "showHeader": true, + "cellHeight": "sm" + } + }, + { + "id": 7, + "type": "logs", + "title": "AI Bridge Log Stream", + "description": "Live AI Bridge (AI Gateway) log stream from Loki, namespace coder. Volume depends on AI usage; expect lifecycle lines such as provider pool reloads when AI traffic is low.", + "datasource": { + "type": "loki", + "uid": "loki" + }, + "gridPos": { + "h": 9, + "w": 16, + "x": 0, + "y": 11 + }, + "targets": [ + { + "datasource": { + "type": "loki", + "uid": "loki" + }, + "editorMode": "code", + "expr": "{namespace=\"coder\"} |~ \"aibridged\"", + "refId": "A", + "queryType": "range" + } + ], + "options": { + "showTime": true, + "showLabels": false, + "showCommonLabels": false, + "wrapLogMessage": true, + "prettifyLogMessage": false, + "enableLogDetails": true, + "dedupStrategy": "none", + "sortOrder": "Descending" + } + }, + { + "id": 8, + "type": "timeseries", + "title": "AI Bridge Log Event Rate", + "description": "Count of AI Bridge log events in 5m windows from Loki. Indicates AI Gateway activity over time; flat or absent when there is no AI traffic.", + "datasource": { + "type": "loki", + "uid": "loki" + }, + "gridPos": { + "h": 9, + "w": 8, + "x": 16, + "y": 11 + }, + "targets": [ + { + "datasource": { + "type": "loki", + "uid": "loki" + }, + "editorMode": "code", + "expr": "sum(count_over_time({namespace=\"coder\"} |~ \"aibridged\" [5m]))", + "refId": "A", + "queryType": "range", + "legendFormat": "aibridged events" + } + ], + "fieldConfig": { + "defaults": { + "unit": "short", + "custom": { + "drawStyle": "line", + "lineWidth": 1, + "fillOpacity": 10, + "showPoints": "auto", + "spanNulls": false, + "axisPlacement": "auto" + }, + "color": { + "mode": "palette-classic" + }, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + } + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + } + }, + { + "id": 9, + "type": "row", + "title": "Agent Firewall (Boundary)", + "collapsed": false, + "gridPos": { + "h": 1, + "w": 24, + "x": 0, + "y": 20 + }, + "panels": [] + }, + { + "id": 10, + "type": "stat", + "title": "Total Boundary Batches Forwarded", + "description": "Total Boundary (Agent Firewall) log proxy batches forwarded, summed across workspaces, from agent_boundary_log_proxy_batches_forwarded_total. Increments as agents forward boundary egress logs; reads 0 until agent activity occurs.", + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "gridPos": { + "h": 6, + "w": 6, + "x": 0, + "y": 21 + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "sum(agent_boundary_log_proxy_batches_forwarded_total) or vector(0)", + "refId": "A", + "instant": true, + "range": false, + "format": "time_series" + } + ], + "fieldConfig": { + "defaults": { + "unit": "short", + "decimals": 0, + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + } + }, + "overrides": [] + }, + "options": { + "reduceOptions": { + "calcs": [ + "lastNotNull" + ], + "fields": "", + "values": false + }, + "orientation": "auto", + "textMode": "auto", + "colorMode": "value", + "graphMode": "none", + "justifyMode": "auto" + } + }, + { + "id": 11, + "type": "timeseries", + "title": "Forwarded Boundary Batches by Workspace", + "description": "Per workspace rate of Boundary batches forwarded (5m), from agent_boundary_log_proxy_batches_forwarded_total. Each line is a workspace and user. Stays at 0 until the agent firewall forwards egress logs.", + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "gridPos": { + "h": 6, + "w": 18, + "x": 6, + "y": 21 + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "sum by (workspace_name, username) (rate(agent_boundary_log_proxy_batches_forwarded_total[5m]))", + "refId": "A", + "instant": false, + "range": true, + "format": "time_series", + "legendFormat": "{{workspace_name}} / {{username}}" + } + ], + "fieldConfig": { + "defaults": { + "unit": "short", + "custom": { + "drawStyle": "line", + "lineWidth": 1, + "fillOpacity": 10, + "showPoints": "auto", + "spanNulls": false, + "axisPlacement": "auto" + }, + "color": { + "mode": "palette-classic" + }, + "thresholds": { + "mode": "absolute", + "steps": [ + { + "color": "green", + "value": null + } + ] + } + }, + "overrides": [] + }, + "options": { + "legend": { + "displayMode": "list", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { + "mode": "multi", + "sort": "desc" + } + } + }, + { + "id": 12, + "type": "table", + "title": "Active Boundary Agents", + "description": "Boundary agents currently exporting metrics, grouped by workspace, user, template, and agent, with total batches forwarded.", + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "gridPos": { + "h": 7, + "w": 10, + "x": 0, + "y": 27 + }, + "targets": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "editorMode": "code", + "expr": "agent_boundary_log_proxy_batches_forwarded_total", + "refId": "A", + "instant": true, + "range": false, + "format": "table" + } + ], + "transformations": [ + { + "id": "organize", + "options": { + "excludeByName": { + "Time": true, + "__name__": true, + "container": true, + "endpoint": true, + "instance": true, + "job": true, + "namespace": true, + "pod": true, + "service": true + }, + "renameByName": { + "workspace_name": "Workspace", + "username": "User", + "template_name": "Template", + "agent_name": "Agent", + "Value": "Batches" + }, + "indexByName": { + "workspace_name": 0, + "username": 1, + "template_name": 2, + "agent_name": 3, + "Value": 4 + } + } + } + ], + "fieldConfig": { + "defaults": { + "custom": { + "align": "auto", + "cellOptions": { + "type": "auto" + }, + "filterable": true + } + }, + "overrides": [] + }, + "options": { + "showHeader": true, + "cellHeight": "sm" + } + }, + { + "id": 13, + "type": "logs", + "title": "Boundary Log Stream", + "description": "Live Boundary (Agent Firewall) log stream from Loki, namespace coder-workspaces. Currently proxy lifecycle and debug lines; allow and deny egress events appear here as AI traffic grows.", + "datasource": { + "type": "loki", + "uid": "loki" + }, + "gridPos": { + "h": 7, + "w": 14, + "x": 10, + "y": 27 + }, + "targets": [ + { + "datasource": { + "type": "loki", + "uid": "loki" + }, + "editorMode": "code", + "expr": "{namespace=\"coder-workspaces\"} |= \"boundary\"", + "refId": "A", + "queryType": "range" + } + ], + "options": { + "showTime": true, + "showLabels": false, + "showCommonLabels": false, + "wrapLogMessage": true, + "prettifyLogMessage": false, + "enableLogDetails": true, + "dedupStrategy": "none", + "sortOrder": "Descending" + } + } + ] + } diff --git a/docs/as-built/55-observability.md b/docs/as-built/55-observability.md index c94da34..1680b71 100644 --- a/docs/as-built/55-observability.md +++ b/docs/as-built/55-observability.md @@ -261,6 +261,41 @@ HTTP 200; the three `up` panels and Postgres each evaluate to `1`, and Workspace Builds returns `1` for `status="succeeded"`. The header comment in `dashboards-coder.yaml` documents this adaptation. +## AI Governance dashboard + +`deploy/observability/dashboards-ai-governance.yaml` adds one merged dashboard +(ConfigMap `coder-dashboard-ai-governance`, ns `monitoring`, label +`grafana_dashboard: "1"`, uid `ai-governance`, title "AI Governance") for the +Coder AI Governance add-on. It replaces the two separate add-on dashboards with a +single view that covers both halves of the feature, and it lives in its own file +so it never conflicts with `dashboards-coder.yaml`. Every panel targets datasource +uid `prometheus` or uid `loki`. + +The "AI Gateway (AI Bridge)" row reads the `coder_aibridged_*` Prometheus series: +a Configured Providers stat (`count(coder_aibridged_provider_info)`, value `2`), +a Provider Reload Status stat +(`coder_aibridged_providers_last_reload_timestamp_seconds == bool` the success +variant, value `1` = Healthy), a Last Successful Reload stat +(`coder_aibridged_providers_last_reload_success_timestamp_seconds`), and a +Provider Inventory table (provider name, type, status). A logs panel streams +`{namespace="coder"} |~ "aibridged"` from Loki, with a companion log event rate +timeseries (`count_over_time` over 5m). + +The "Agent Firewall (Boundary)" row reads +`agent_boundary_log_proxy_batches_forwarded_total`: a total forwarded batches +stat, a per workspace forwarded rate timeseries +(`sum by (workspace_name, username) (rate(...[5m]))`), and an Active Boundary +Agents table (workspace, user, template, agent). A logs panel streams +`{namespace="coder-workspaces"} |= "boundary"` from Loki. + +AI Bridge exposes no token, request, or latency Prometheus metrics, and the live +Anthropic key is a placeholder, so AI and agent traffic is minimal. Usage panels +(Total Boundary Batches, the per workspace rate, and the log streams) therefore +read `0` or stay sparse; that is expected and documented in each panel +description. Verified live through Grafana `POST /api/ds/query`: all ten query +panels return HTTP 200 with real series or a clean `0`, and Grafana +`GET /api/dashboards/uid/ai-governance` returns the imported dashboard. + ## Ingress (HTTPS) `deploy/observability/grafana-ingress.yaml` follows the platform pattern