Skip to content

Adopt Istio service mesh for mTLS and rationalize ingress (NLB + nginx) #30

@ausbru87

Description

@ausbru87

Summary

Adopt Istio service mesh (sidecar mode) on the GovCloud EKS cluster, primarily
to get workload-to-workload mTLS, and use an Istio ingress gateway to
rationalize the current NLB plus ingress-nginx chain. The same gateway redesign
fixes the confirmed Keycloak Account Console failure, whose root cause is that
the L4 NLB terminates TLS and forwards plain HTTP, so X-Forwarded-Proto: http
reaches Keycloak and its cookies are issued without Secure/SameSite=None.

This issue is planning only. Nothing here is applied to live infrastructure.
Note that STATUS.md currently lists Istio under "Out of scope (demo)", so
adopting it is a deliberate scope change that should be approved before any
implementation work starts.

Motivation: confirmed Keycloak cookie root cause

The Keycloak Account Console fails with "Something went wrong / Server responded
with an invalid status." This has already been root-caused; do not re-debug it
by hand-editing nginx in the live cluster.

  • The internet-facing NLB terminates TLS with the ACM cert
    (arn:aws-us-gov:acm:us-gov-west-1:430737322961:certificate/7f4fc566-8efd-4aa5-b6ba-3b0c9a535d12)
    and forwards decrypted plain HTTP to ingress-nginx. Confirmed live: the NLB
    has a TLS listener on 443 bound to that cert and a TCP listener on 80; the
    controller Service maps both http and https target ports to the plain
    http container port.
  • ingress-nginx therefore sets X-Forwarded-Proto: http. Keycloak detects a
    non-secure context (Non-secure context detected; cookies are not secured in
    the logs) and issues AUTH_SESSION_ID and KC_RESTART without Secure and
    without SameSite=None. Verified with curl -D -.
  • The Account Console performs a silent single sign-on through a hidden iframe.
    Third-party iframe context requires SameSite=None; Secure cookies, so the
    browser drops the non-secure cookies and the API call returns an invalid
    status.
  • Coder, GitLab, and Grafana are unaffected because they build cookie security
    from their hardcoded https URLs. Keycloak ties the cookie Secure flag to the
    actual request scheme it sees.
  • deploy/keycloak/deployment.yaml already sets KC_PROXY_HEADERS=xforwarded,
    KC_HOSTNAME=https://auth.usgov.coderdemo.io, and KC_HTTP_ENABLED=true.
    Those fix issuer and redirect URL generation, but they do not change the
    cookie Secure decision, which still follows the request scheme. The correct
    fix is to make the proxy chain present X-Forwarded-Proto: https to Keycloak,
    which an Istio ingress gateway (or any proper L7 proxy layer) can do
    deterministically.

Current architecture (as-built, verified live)

  • Chain: client to NLB over HTTPS (ACM TLS terminated at L4), NLB to
    ingress-nginx over plain HTTP, ingress-nginx to app pods over plain HTTP.
  • One internet-facing NLB provisioned by the AWS Load Balancer Controller from
    the ingress-nginx controller Service (deploy/platform/ingress-nginx-values.yaml,
    chart 4.15.1, controller v1.15.1, 2 replicas).
  • Hostnames on one ACM cert covering usgov.coderdemo.io and
    *.usgov.coderdemo.io:
    • dev.usgov.coderdemo.io (Coder dashboard) and *.usgov.coderdemo.io
      (Coder workspace apps, wildcard subdomain apps, websockets, large bodies).
    • auth.usgov.coderdemo.io (Keycloak, ClusterIP Service on 8080).
    • gitlab.usgov.coderdemo.io (GitLab).
    • grafana.usgov.coderdemo.io (Grafana).
  • Every app exposes an Ingress with ingressClassName: nginx, ssl-redirect
    disabled (no X-Forwarded-Proto-driven redirect because the L4 NLB does not
    add it), and websocket and large-body tuning for Coder and GitLab.
  • No Istio is installed today (istio-system namespace does not exist).

Goals and non-goals

Goals:

  • Mesh-wide mTLS in STRICT mode for in-cluster service-to-service traffic.
  • A single, well-defined L7 ingress that presents X-Forwarded-Proto: https to
    backends, fixing the Keycloak cookie problem at the architecture level.
  • Reduce the NLB plus nginx redundancy where it is safe to do so.
  • Keep the migration non-disruptive for the four live hostnames.

Non-goals:

  • Istio ambient (ztunnel) mode. Start with the sidecar data plane; ambient can
    be evaluated later.
  • Multi-cluster mesh, egress gateway lockdown, and authorization policy beyond
    mTLS. Those are follow-ups once STRICT mTLS is stable.

1. Istio installation approach (air-gapped GovCloud EKS)

Air-gap and ECR mirroring:

  • GovCloud has no ECR pull-through cache, so every image must be mirrored into
    430737322961.dkr.ecr.us-gov-west-1.amazonaws.com via scripts/images.txt
    plus scripts/mirror-images.sh (crane).
  • Istio publishes images to gcr.io/istio-release (canonical) and to
    docker.io/istio (Docker Hub mirror). The components needed for sidecar mode
    are pilot (istiod), proxyv2 (sidecar and the gateway data plane), and
    install-cni if the Istio CNI plugin is used.
  • Important tooling gap: scripts/mirror-images.sh ecr_repo_path() only maps
    docker.io, ghcr.io, and quay.io. A gcr.io/istio-release reference will
    hit the unsupported upstream registry host error. Two options:
    1. Pin Istio images to docker.io/istio/* so they map cleanly to
      <ecr>/docker-hub/istio/* with no script change (recommended for the least
      change), or
    2. Extend mirror-images.sh to add a gcr.io to gcr/ mapping and pull from
      gcr.io/istio-release.
  • Add the chosen refs to scripts/images.txt, for example
    docker.io/istio/pilot:<ver>, docker.io/istio/proxyv2:<ver>, and
    docker.io/istio/install-cni:<ver>, pinned to one stable Istio release.
  • Override the install to pull from ECR by setting the Istio hub to
    <ecr>/docker-hub/istio and tag to the pinned version. istioctl and the
    Helm charts default hub to docker.io/istio, so this is a single override.

Install method (istioctl vs Helm):

Compatibility risk to verify first:

  • The cluster is EKS k8s 1.36, which is newer than the Kubernetes versions most
    Istio releases certify against. Confirm the target Istio version supports k8s
    1.36 on its official support matrix before committing. This is a hard
    gating item.

2. Mesh-wide mTLS STRICT via PeerAuthentication

  • Final state: a mesh-wide PeerAuthentication in istio-system with
    mtls.mode: STRICT, so all sidecar-to-sidecar traffic must be mTLS.
  • Roll out permissive first. Install with the default PERMISSIVE behavior, inject
    sidecars namespace by namespace, confirm traffic is healthy, then flip to
    STRICT. PERMISSIVE accepts both mTLS and plain text so injection does not break
    callers mid-migration.
  • Required exceptions and care:
    • Namespaces that are not sidecar-injected (kube-system, the Istio control
      plane, and anything intentionally left out) must not be forced to STRICT; a
      mesh-wide STRICT policy only governs injected workloads, but plain-text paths
      into non-injected services need per-port PeerAuthentication portLevelMtls
      or a DestinationRule with the right trafficPolicy.tls.mode.
    • Kubelet health and readiness probes: Keycloak probes hit the management port
      9000, Coder and others use HTTP probes. Use Istio probe rewrite (enabled by
      default) so the kubelet, which has no sidecar, can still reach probe
      endpoints under STRICT.
    • The RDS PostgreSQL connection leaves the mesh to an AWS endpoint and is
      already TLS with rds.force_ssl=1. It is mesh-external, so it is governed by
      a ServiceEntry plus DestinationRule, not by PeerAuthentication.
    • GitLab (single-container omnibus with embedded Postgres and internal
      workhorse and unicorn ports) and Coder workspace pods are the highest-risk
      injection targets. Treat gitlab, coder, and coder-workspaces as the
      last namespaces to flip to STRICT and validate carefully (websockets, git
      over http, the Coder agent tunnels).
    • External Secrets Operator, Prometheus scraping, and Loki and Promtail must
      keep working; Prometheus scrape of mTLS targets needs either the merged
      Istio metrics endpoints or scrape configuration that tolerates the sidecar.

3. Ingress gateway: replace nginx, or sit behind the NLB

Three layouts were considered:

  • A. Keep NLB at L4 (TLS terminated by ACM), put the Istio ingress gateway behind
    it in place of ingress-nginx. The gateway is the single L7 entry, sets
    forwarded headers correctly, and routes by host to the meshed services.
  • B. Keep NLB plus ingress-nginx and add Istio only for east-west mTLS, with no
    Istio gateway. This fixes nothing about the Keycloak header problem and leaves
    the redundancy in place.
  • C. Move TLS termination off the NLB into the Istio gateway (NLB passes TCP 443
    straight through), terminating with a cert mounted in the gateway.

Recommendation: option A. Keep the NLB doing L4 with the ACM cert (FIPS-validated
ACM termination at the edge is desirable in GovCloud and avoids managing cert
material in-cluster), and replace ingress-nginx with the Istio ingress gateway as
the L7 layer. The gateway terminates the NLB-forwarded HTTP, applies host-based
routing through Gateway and VirtualService objects, and injects a correct
X-Forwarded-Proto: https to backends.

Tradeoffs:

  • Option A keeps the proven ACM edge and only swaps the in-cluster L7. The
    gateway must be told the real external scheme is https even though it receives
    plain HTTP from the L4 NLB; do this with the gateway topology and forwarded
    header settings (see section 4).
  • Option C gives end-to-end TLS to the gateway but requires putting a server cert
    into the cluster (ACM private cert export is restricted; this likely pulls in
    cert-manager or a manually managed secret) and reworking the NLB to TCP
    passthrough. More moving parts and more risk for this environment, so it is not
    recommended now.

4. X-Forwarded-Proto fix tied to the gateway design

This is the concrete fix for the Keycloak cookie failure.

  • The Istio ingress gateway is an Envoy. Even though the NLB hands it plain HTTP,
    the gateway is the trust boundary for forwarded headers and can normalize them.
  • Configure the gateway so that traffic arriving on the external 443 path is
    presented to backends as https: set the gateway listener and route so that
    Envoy emits X-Forwarded-Proto: https, and have Envoy overwrite rather than
    append client-supplied forwarded headers at this edge hop
    (xff_num_trusted_hops and forwarding settings tuned for an L4 NLB in front).
  • With Keycloak behind the gateway and seeing X-Forwarded-Proto: https plus its
    existing KC_PROXY_HEADERS=xforwarded, Keycloak treats the request as secure
    and issues AUTH_SESSION_ID and KC_RESTART with Secure and
    SameSite=None, so the Account Console silent single sign-on iframe works.
  • Keep KC_HOSTNAME=https://auth.usgov.coderdemo.io for stable issuer URLs; the
    scheme fix is about cookies, the hostname pin is about URL generation. Both are
    needed.
  • Acceptance for this item: curl -D - against the Keycloak login path through
    the gateway shows Secure and SameSite=None on the session cookies, and the
    Account Console loads.

5. Certificate handling

  • Edge TLS stays on ACM at the NLB (option A). No public-facing server cert lives
    in the cluster, which keeps the GovCloud edge posture unchanged.
  • Mesh mTLS certificates are issued and rotated automatically by istiod (the
    built-in Istio CA, SPIFFE identities per workload). These are internal and are
    not the ACM cert; the two concerns are independent.
  • cert-manager interplay: not required for option A. It only becomes relevant if
    option C is chosen (a gateway-terminated server cert) or if the Istio CA should
    be backed by an intermediate from a managed PKI (cert-manager istio-csr).
    Recommend deferring cert-manager unless a compliance requirement forces an
    external issuer for the mesh CA.
  • Document the mesh trust domain and root rotation runbook, since STRICT mTLS
    makes istiod the CA for all in-mesh traffic.

6. Rationalize the NLB plus nginx redundancy

Target end state:

  • Keep: the internet-facing NLB with ACM L4 TLS (one external entry point, one
    cert, the existing DNS alias records keep pointing at it).
  • Replace: ingress-nginx, by the Istio ingress gateway Service, which becomes the
    new target of the NLB. The four app Ingress objects become Istio Gateway
    plus VirtualService objects (or are kept temporarily through the Istio
    ingress class during cutover).
  • Remove after cutover: the ingress-nginx Helm release and its controller, once
    all four hosts are served by the gateway and verified.

Non-disruptive migration path (no downtime for dev, auth, gitlab,
grafana):

  1. Install Istio (PERMISSIVE), mirror images, verify, do not touch ingress yet.
  2. Stand up the Istio ingress gateway as a separate Service. Do not move the NLB
    to it yet. The NLB still targets ingress-nginx.
  3. Inject sidecars and add Gateway plus VirtualService for one low-risk host
    first (Grafana is the best canary; it is unaffected by the cookie bug and
    least critical). Validate through a temporary path or a second test NLB or by
    weighting.
  4. Cut hosts over one at a time by repointing the NLB target group (or the
    AWS Load Balancer Controller Service ownership) from ingress-nginx to the
    Istio gateway, host by host, validating each. Do Keycloak deliberately, since
    it is both the cookie fix and the highest blast radius for single sign-on.
  5. Once all four hosts run through the gateway and are verified, decommission
    ingress-nginx.
  6. Only after ingress is stable, progress the mTLS rollout to STRICT per
    section 2.

A safe variant during validation is to keep both ingress-nginx and the Istio
gateway live behind the same NLB, splitting by host at the NLB or via DNS, so any
single host can be rolled back instantly by repointing it to the old path.

7. Risks, rollback, and phased rollout

Risks:

  • k8s 1.36 may be ahead of the chosen Istio version's support matrix (gating).
  • GitLab omnibus and Coder workspace traffic (websockets, git over http, agent
    tunnels) can break under sidecar injection or STRICT mTLS if ports or probes
    are mishandled.
  • Air-gap image gaps (the gcr.io mapping is unsupported by the mirror script
    today) can stall an install midway.
  • The Keycloak cutover is the same step that fixes the bug and risks single
    sign-on for Coder, GitLab, and Grafana, which all depend on the realm.

Rollback:

  • Ingress: repoint the NLB target back to ingress-nginx per host (keep the
    release until the very end).
  • mTLS: flip the namespace or mesh PeerAuthentication back to PERMISSIVE, which
    immediately re-accepts plain text.
  • Injection: remove the namespace istio-injection label and restart workloads
    to drop sidecars.

Phased rollout (summary):

  1. Mirror images, install Istio control plane in PERMISSIVE, verify.
  2. Stand up the ingress gateway alongside ingress-nginx.
  3. Migrate hosts to the gateway one at a time (Grafana, then Coder and GitLab,
    then Keycloak), validating each, fixing the Keycloak header and cookie path.
  4. Decommission ingress-nginx.
  5. Inject sidecars namespace by namespace under PERMISSIVE; validate.
  6. Flip to STRICT mesh-wide; keep documented exceptions.

8. Observability integration

  • Istio control plane and Envoy sidecars expose Prometheus metrics. Wire them
    into the existing kube-prometheus-stack (monitoring namespace) with
    PodMonitor or ServiceMonitor for istiod and the sidecars, accounting for STRICT
    mTLS on scrape paths.
  • Add the standard Istio mesh, workload, and service Grafana dashboards to the
    existing Grafana so request rates, mTLS success, and gateway latency are
    visible next to the Coder dashboards.
  • Envoy and istiod logs flow to the existing Loki and Promtail pipeline like
    other pods, so access logs are queryable in Grafana.
  • Optional: deploy Kiali for mesh topology and config validation. It is useful
    during the STRICT cutover to spot non-mTLS edges, but it is another image to
    mirror and another SSO integration, so treat it as optional and add it behind
    Keycloak single sign-on if adopted.

Open questions and decisions to confirm

  • Approve the scope change (Istio is currently "Out of scope (demo)" in
    STATUS.md).
  • Confirm a specific Istio version that supports EKS k8s 1.36.
  • Choose istioctl vs Helm (lean toward Helm if the GitOps control plane lands
    first; otherwise istioctl).
  • Confirm option A (NLB L4 ACM plus Istio gateway) over option C
    (gateway-terminated TLS).
  • Decide whether Kiali is in scope for the demo.

Acceptance criteria

  • Istio control plane runs from ECR-mirrored images, istioctl verify-install
    passes, and image refs are pinned in scripts/images.txt.
  • All four hostnames are served through the Istio ingress gateway behind the
    existing NLB with the ACM cert, with no downtime during cutover.
  • The Keycloak Account Console loads; curl -D - shows session cookies with
    Secure and SameSite=None; Coder, GitLab, and Grafana single sign-on still
    work.
  • ingress-nginx is removed after cutover.
  • Mesh-wide PeerAuthentication is STRICT with documented, justified exceptions;
    all app traffic stays healthy.
  • Istio metrics and dashboards are visible in the existing Grafana, and Envoy
    logs reach Loki.

Generated by Coder Agents on behalf of @ausbru87 (ausbru87).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions