Summary
Adopt Istio service mesh (sidecar mode) on the GovCloud EKS cluster, primarily
to get workload-to-workload mTLS, and use an Istio ingress gateway to
rationalize the current NLB plus ingress-nginx chain. The same gateway redesign
fixes the confirmed Keycloak Account Console failure, whose root cause is that
the L4 NLB terminates TLS and forwards plain HTTP, so X-Forwarded-Proto: http
reaches Keycloak and its cookies are issued without Secure/SameSite=None.
This issue is planning only. Nothing here is applied to live infrastructure.
Note that STATUS.md currently lists Istio under "Out of scope (demo)", so
adopting it is a deliberate scope change that should be approved before any
implementation work starts.
Motivation: confirmed Keycloak cookie root cause
The Keycloak Account Console fails with "Something went wrong / Server responded
with an invalid status." This has already been root-caused; do not re-debug it
by hand-editing nginx in the live cluster.
- The internet-facing NLB terminates TLS with the ACM cert
(arn:aws-us-gov:acm:us-gov-west-1:430737322961:certificate/7f4fc566-8efd-4aa5-b6ba-3b0c9a535d12)
and forwards decrypted plain HTTP to ingress-nginx. Confirmed live: the NLB
has a TLS listener on 443 bound to that cert and a TCP listener on 80; the
controller Service maps both http and https target ports to the plain
http container port.
- ingress-nginx therefore sets
X-Forwarded-Proto: http. Keycloak detects a
non-secure context (Non-secure context detected; cookies are not secured in
the logs) and issues AUTH_SESSION_ID and KC_RESTART without Secure and
without SameSite=None. Verified with curl -D -.
- The Account Console performs a silent single sign-on through a hidden iframe.
Third-party iframe context requires SameSite=None; Secure cookies, so the
browser drops the non-secure cookies and the API call returns an invalid
status.
- Coder, GitLab, and Grafana are unaffected because they build cookie security
from their hardcoded https URLs. Keycloak ties the cookie Secure flag to the
actual request scheme it sees.
deploy/keycloak/deployment.yaml already sets KC_PROXY_HEADERS=xforwarded,
KC_HOSTNAME=https://auth.usgov.coderdemo.io, and KC_HTTP_ENABLED=true.
Those fix issuer and redirect URL generation, but they do not change the
cookie Secure decision, which still follows the request scheme. The correct
fix is to make the proxy chain present X-Forwarded-Proto: https to Keycloak,
which an Istio ingress gateway (or any proper L7 proxy layer) can do
deterministically.
Current architecture (as-built, verified live)
- Chain: client to NLB over HTTPS (ACM TLS terminated at L4), NLB to
ingress-nginx over plain HTTP, ingress-nginx to app pods over plain HTTP.
- One internet-facing NLB provisioned by the AWS Load Balancer Controller from
the ingress-nginx controller Service (deploy/platform/ingress-nginx-values.yaml,
chart 4.15.1, controller v1.15.1, 2 replicas).
- Hostnames on one ACM cert covering
usgov.coderdemo.io and
*.usgov.coderdemo.io:
dev.usgov.coderdemo.io (Coder dashboard) and *.usgov.coderdemo.io
(Coder workspace apps, wildcard subdomain apps, websockets, large bodies).
auth.usgov.coderdemo.io (Keycloak, ClusterIP Service on 8080).
gitlab.usgov.coderdemo.io (GitLab).
grafana.usgov.coderdemo.io (Grafana).
- Every app exposes an
Ingress with ingressClassName: nginx, ssl-redirect
disabled (no X-Forwarded-Proto-driven redirect because the L4 NLB does not
add it), and websocket and large-body tuning for Coder and GitLab.
- No Istio is installed today (
istio-system namespace does not exist).
Goals and non-goals
Goals:
- Mesh-wide mTLS in STRICT mode for in-cluster service-to-service traffic.
- A single, well-defined L7 ingress that presents
X-Forwarded-Proto: https to
backends, fixing the Keycloak cookie problem at the architecture level.
- Reduce the NLB plus nginx redundancy where it is safe to do so.
- Keep the migration non-disruptive for the four live hostnames.
Non-goals:
- Istio ambient (ztunnel) mode. Start with the sidecar data plane; ambient can
be evaluated later.
- Multi-cluster mesh, egress gateway lockdown, and authorization policy beyond
mTLS. Those are follow-ups once STRICT mTLS is stable.
1. Istio installation approach (air-gapped GovCloud EKS)
Air-gap and ECR mirroring:
- GovCloud has no ECR pull-through cache, so every image must be mirrored into
430737322961.dkr.ecr.us-gov-west-1.amazonaws.com via scripts/images.txt
plus scripts/mirror-images.sh (crane).
- Istio publishes images to
gcr.io/istio-release (canonical) and to
docker.io/istio (Docker Hub mirror). The components needed for sidecar mode
are pilot (istiod), proxyv2 (sidecar and the gateway data plane), and
install-cni if the Istio CNI plugin is used.
- Important tooling gap:
scripts/mirror-images.sh ecr_repo_path() only maps
docker.io, ghcr.io, and quay.io. A gcr.io/istio-release reference will
hit the unsupported upstream registry host error. Two options:
- Pin Istio images to
docker.io/istio/* so they map cleanly to
<ecr>/docker-hub/istio/* with no script change (recommended for the least
change), or
- Extend
mirror-images.sh to add a gcr.io to gcr/ mapping and pull from
gcr.io/istio-release.
- Add the chosen refs to
scripts/images.txt, for example
docker.io/istio/pilot:<ver>, docker.io/istio/proxyv2:<ver>, and
docker.io/istio/install-cni:<ver>, pinned to one stable Istio release.
- Override the install to pull from ECR by setting the Istio
hub to
<ecr>/docker-hub/istio and tag to the pinned version. istioctl and the
Helm charts default hub to docker.io/istio, so this is a single override.
Install method (istioctl vs Helm):
Compatibility risk to verify first:
- The cluster is EKS k8s 1.36, which is newer than the Kubernetes versions most
Istio releases certify against. Confirm the target Istio version supports k8s
1.36 on its official support matrix before committing. This is a hard
gating item.
2. Mesh-wide mTLS STRICT via PeerAuthentication
- Final state: a mesh-wide
PeerAuthentication in istio-system with
mtls.mode: STRICT, so all sidecar-to-sidecar traffic must be mTLS.
- Roll out permissive first. Install with the default PERMISSIVE behavior, inject
sidecars namespace by namespace, confirm traffic is healthy, then flip to
STRICT. PERMISSIVE accepts both mTLS and plain text so injection does not break
callers mid-migration.
- Required exceptions and care:
- Namespaces that are not sidecar-injected (
kube-system, the Istio control
plane, and anything intentionally left out) must not be forced to STRICT; a
mesh-wide STRICT policy only governs injected workloads, but plain-text paths
into non-injected services need per-port PeerAuthentication portLevelMtls
or a DestinationRule with the right trafficPolicy.tls.mode.
- Kubelet health and readiness probes: Keycloak probes hit the management port
9000, Coder and others use HTTP probes. Use Istio probe rewrite (enabled by
default) so the kubelet, which has no sidecar, can still reach probe
endpoints under STRICT.
- The RDS PostgreSQL connection leaves the mesh to an AWS endpoint and is
already TLS with rds.force_ssl=1. It is mesh-external, so it is governed by
a ServiceEntry plus DestinationRule, not by PeerAuthentication.
- GitLab (single-container omnibus with embedded Postgres and internal
workhorse and unicorn ports) and Coder workspace pods are the highest-risk
injection targets. Treat gitlab, coder, and coder-workspaces as the
last namespaces to flip to STRICT and validate carefully (websockets, git
over http, the Coder agent tunnels).
- External Secrets Operator, Prometheus scraping, and Loki and Promtail must
keep working; Prometheus scrape of mTLS targets needs either the merged
Istio metrics endpoints or scrape configuration that tolerates the sidecar.
3. Ingress gateway: replace nginx, or sit behind the NLB
Three layouts were considered:
- A. Keep NLB at L4 (TLS terminated by ACM), put the Istio ingress gateway behind
it in place of ingress-nginx. The gateway is the single L7 entry, sets
forwarded headers correctly, and routes by host to the meshed services.
- B. Keep NLB plus ingress-nginx and add Istio only for east-west mTLS, with no
Istio gateway. This fixes nothing about the Keycloak header problem and leaves
the redundancy in place.
- C. Move TLS termination off the NLB into the Istio gateway (NLB passes TCP 443
straight through), terminating with a cert mounted in the gateway.
Recommendation: option A. Keep the NLB doing L4 with the ACM cert (FIPS-validated
ACM termination at the edge is desirable in GovCloud and avoids managing cert
material in-cluster), and replace ingress-nginx with the Istio ingress gateway as
the L7 layer. The gateway terminates the NLB-forwarded HTTP, applies host-based
routing through Gateway and VirtualService objects, and injects a correct
X-Forwarded-Proto: https to backends.
Tradeoffs:
- Option A keeps the proven ACM edge and only swaps the in-cluster L7. The
gateway must be told the real external scheme is https even though it receives
plain HTTP from the L4 NLB; do this with the gateway topology and forwarded
header settings (see section 4).
- Option C gives end-to-end TLS to the gateway but requires putting a server cert
into the cluster (ACM private cert export is restricted; this likely pulls in
cert-manager or a manually managed secret) and reworking the NLB to TCP
passthrough. More moving parts and more risk for this environment, so it is not
recommended now.
4. X-Forwarded-Proto fix tied to the gateway design
This is the concrete fix for the Keycloak cookie failure.
- The Istio ingress gateway is an Envoy. Even though the NLB hands it plain HTTP,
the gateway is the trust boundary for forwarded headers and can normalize them.
- Configure the gateway so that traffic arriving on the external 443 path is
presented to backends as https: set the gateway listener and route so that
Envoy emits X-Forwarded-Proto: https, and have Envoy overwrite rather than
append client-supplied forwarded headers at this edge hop
(xff_num_trusted_hops and forwarding settings tuned for an L4 NLB in front).
- With Keycloak behind the gateway and seeing
X-Forwarded-Proto: https plus its
existing KC_PROXY_HEADERS=xforwarded, Keycloak treats the request as secure
and issues AUTH_SESSION_ID and KC_RESTART with Secure and
SameSite=None, so the Account Console silent single sign-on iframe works.
- Keep
KC_HOSTNAME=https://auth.usgov.coderdemo.io for stable issuer URLs; the
scheme fix is about cookies, the hostname pin is about URL generation. Both are
needed.
- Acceptance for this item:
curl -D - against the Keycloak login path through
the gateway shows Secure and SameSite=None on the session cookies, and the
Account Console loads.
5. Certificate handling
- Edge TLS stays on ACM at the NLB (option A). No public-facing server cert lives
in the cluster, which keeps the GovCloud edge posture unchanged.
- Mesh mTLS certificates are issued and rotated automatically by istiod (the
built-in Istio CA, SPIFFE identities per workload). These are internal and are
not the ACM cert; the two concerns are independent.
- cert-manager interplay: not required for option A. It only becomes relevant if
option C is chosen (a gateway-terminated server cert) or if the Istio CA should
be backed by an intermediate from a managed PKI (cert-manager istio-csr).
Recommend deferring cert-manager unless a compliance requirement forces an
external issuer for the mesh CA.
- Document the mesh trust domain and root rotation runbook, since STRICT mTLS
makes istiod the CA for all in-mesh traffic.
6. Rationalize the NLB plus nginx redundancy
Target end state:
- Keep: the internet-facing NLB with ACM L4 TLS (one external entry point, one
cert, the existing DNS alias records keep pointing at it).
- Replace: ingress-nginx, by the Istio ingress gateway Service, which becomes the
new target of the NLB. The four app Ingress objects become Istio Gateway
plus VirtualService objects (or are kept temporarily through the Istio
ingress class during cutover).
- Remove after cutover: the ingress-nginx Helm release and its controller, once
all four hosts are served by the gateway and verified.
Non-disruptive migration path (no downtime for dev, auth, gitlab,
grafana):
- Install Istio (PERMISSIVE), mirror images, verify, do not touch ingress yet.
- Stand up the Istio ingress gateway as a separate Service. Do not move the NLB
to it yet. The NLB still targets ingress-nginx.
- Inject sidecars and add
Gateway plus VirtualService for one low-risk host
first (Grafana is the best canary; it is unaffected by the cookie bug and
least critical). Validate through a temporary path or a second test NLB or by
weighting.
- Cut hosts over one at a time by repointing the NLB target group (or the
AWS Load Balancer Controller Service ownership) from ingress-nginx to the
Istio gateway, host by host, validating each. Do Keycloak deliberately, since
it is both the cookie fix and the highest blast radius for single sign-on.
- Once all four hosts run through the gateway and are verified, decommission
ingress-nginx.
- Only after ingress is stable, progress the mTLS rollout to STRICT per
section 2.
A safe variant during validation is to keep both ingress-nginx and the Istio
gateway live behind the same NLB, splitting by host at the NLB or via DNS, so any
single host can be rolled back instantly by repointing it to the old path.
7. Risks, rollback, and phased rollout
Risks:
- k8s 1.36 may be ahead of the chosen Istio version's support matrix (gating).
- GitLab omnibus and Coder workspace traffic (websockets, git over http, agent
tunnels) can break under sidecar injection or STRICT mTLS if ports or probes
are mishandled.
- Air-gap image gaps (the
gcr.io mapping is unsupported by the mirror script
today) can stall an install midway.
- The Keycloak cutover is the same step that fixes the bug and risks single
sign-on for Coder, GitLab, and Grafana, which all depend on the realm.
Rollback:
- Ingress: repoint the NLB target back to ingress-nginx per host (keep the
release until the very end).
- mTLS: flip the namespace or mesh
PeerAuthentication back to PERMISSIVE, which
immediately re-accepts plain text.
- Injection: remove the namespace
istio-injection label and restart workloads
to drop sidecars.
Phased rollout (summary):
- Mirror images, install Istio control plane in PERMISSIVE, verify.
- Stand up the ingress gateway alongside ingress-nginx.
- Migrate hosts to the gateway one at a time (Grafana, then Coder and GitLab,
then Keycloak), validating each, fixing the Keycloak header and cookie path.
- Decommission ingress-nginx.
- Inject sidecars namespace by namespace under PERMISSIVE; validate.
- Flip to STRICT mesh-wide; keep documented exceptions.
8. Observability integration
- Istio control plane and Envoy sidecars expose Prometheus metrics. Wire them
into the existing kube-prometheus-stack (monitoring namespace) with
PodMonitor or ServiceMonitor for istiod and the sidecars, accounting for STRICT
mTLS on scrape paths.
- Add the standard Istio mesh, workload, and service Grafana dashboards to the
existing Grafana so request rates, mTLS success, and gateway latency are
visible next to the Coder dashboards.
- Envoy and istiod logs flow to the existing Loki and Promtail pipeline like
other pods, so access logs are queryable in Grafana.
- Optional: deploy Kiali for mesh topology and config validation. It is useful
during the STRICT cutover to spot non-mTLS edges, but it is another image to
mirror and another SSO integration, so treat it as optional and add it behind
Keycloak single sign-on if adopted.
Open questions and decisions to confirm
- Approve the scope change (Istio is currently "Out of scope (demo)" in
STATUS.md).
- Confirm a specific Istio version that supports EKS k8s 1.36.
- Choose istioctl vs Helm (lean toward Helm if the GitOps control plane lands
first; otherwise istioctl).
- Confirm option A (NLB L4 ACM plus Istio gateway) over option C
(gateway-terminated TLS).
- Decide whether Kiali is in scope for the demo.
Acceptance criteria
- Istio control plane runs from ECR-mirrored images,
istioctl verify-install
passes, and image refs are pinned in scripts/images.txt.
- All four hostnames are served through the Istio ingress gateway behind the
existing NLB with the ACM cert, with no downtime during cutover.
- The Keycloak Account Console loads;
curl -D - shows session cookies with
Secure and SameSite=None; Coder, GitLab, and Grafana single sign-on still
work.
- ingress-nginx is removed after cutover.
- Mesh-wide
PeerAuthentication is STRICT with documented, justified exceptions;
all app traffic stays healthy.
- Istio metrics and dashboards are visible in the existing Grafana, and Envoy
logs reach Loki.
Generated by Coder Agents on behalf of @ausbru87 (ausbru87).
Summary
Adopt Istio service mesh (sidecar mode) on the GovCloud EKS cluster, primarily
to get workload-to-workload mTLS, and use an Istio ingress gateway to
rationalize the current NLB plus ingress-nginx chain. The same gateway redesign
fixes the confirmed Keycloak Account Console failure, whose root cause is that
the L4 NLB terminates TLS and forwards plain HTTP, so
X-Forwarded-Proto: httpreaches Keycloak and its cookies are issued without
Secure/SameSite=None.This issue is planning only. Nothing here is applied to live infrastructure.
Note that
STATUS.mdcurrently lists Istio under "Out of scope (demo)", soadopting it is a deliberate scope change that should be approved before any
implementation work starts.
Motivation: confirmed Keycloak cookie root cause
The Keycloak Account Console fails with "Something went wrong / Server responded
with an invalid status." This has already been root-caused; do not re-debug it
by hand-editing nginx in the live cluster.
(
arn:aws-us-gov:acm:us-gov-west-1:430737322961:certificate/7f4fc566-8efd-4aa5-b6ba-3b0c9a535d12)and forwards decrypted plain HTTP to ingress-nginx. Confirmed live: the NLB
has a
TLSlistener on 443 bound to that cert and aTCPlistener on 80; thecontroller Service maps both
httpandhttpstarget ports to the plainhttpcontainer port.X-Forwarded-Proto: http. Keycloak detects anon-secure context (
Non-secure context detected; cookies are not securedinthe logs) and issues
AUTH_SESSION_IDandKC_RESTARTwithoutSecureandwithout
SameSite=None. Verified withcurl -D -.Third-party iframe context requires
SameSite=None; Securecookies, so thebrowser drops the non-secure cookies and the API call returns an invalid
status.
from their hardcoded https URLs. Keycloak ties the cookie
Secureflag to theactual request scheme it sees.
deploy/keycloak/deployment.yamlalready setsKC_PROXY_HEADERS=xforwarded,KC_HOSTNAME=https://auth.usgov.coderdemo.io, andKC_HTTP_ENABLED=true.Those fix issuer and redirect URL generation, but they do not change the
cookie
Securedecision, which still follows the request scheme. The correctfix is to make the proxy chain present
X-Forwarded-Proto: httpsto Keycloak,which an Istio ingress gateway (or any proper L7 proxy layer) can do
deterministically.
Current architecture (as-built, verified live)
ingress-nginx over plain HTTP, ingress-nginx to app pods over plain HTTP.
the ingress-nginx controller Service (
deploy/platform/ingress-nginx-values.yaml,chart 4.15.1, controller v1.15.1, 2 replicas).
usgov.coderdemo.ioand*.usgov.coderdemo.io:dev.usgov.coderdemo.io(Coder dashboard) and*.usgov.coderdemo.io(Coder workspace apps, wildcard subdomain apps, websockets, large bodies).
auth.usgov.coderdemo.io(Keycloak, ClusterIP Service on 8080).gitlab.usgov.coderdemo.io(GitLab).grafana.usgov.coderdemo.io(Grafana).IngresswithingressClassName: nginx,ssl-redirectdisabled (no
X-Forwarded-Proto-driven redirect because the L4 NLB does notadd it), and websocket and large-body tuning for Coder and GitLab.
istio-systemnamespace does not exist).Goals and non-goals
Goals:
X-Forwarded-Proto: httpstobackends, fixing the Keycloak cookie problem at the architecture level.
Non-goals:
be evaluated later.
mTLS. Those are follow-ups once STRICT mTLS is stable.
1. Istio installation approach (air-gapped GovCloud EKS)
Air-gap and ECR mirroring:
430737322961.dkr.ecr.us-gov-west-1.amazonaws.comviascripts/images.txtplus
scripts/mirror-images.sh(crane).gcr.io/istio-release(canonical) and todocker.io/istio(Docker Hub mirror). The components needed for sidecar modeare
pilot(istiod),proxyv2(sidecar and the gateway data plane), andinstall-cniif the Istio CNI plugin is used.scripts/mirror-images.shecr_repo_path()only mapsdocker.io,ghcr.io, andquay.io. Agcr.io/istio-releasereference willhit the
unsupported upstream registry hosterror. Two options:docker.io/istio/*so they map cleanly to<ecr>/docker-hub/istio/*with no script change (recommended for the leastchange), or
mirror-images.shto add agcr.iotogcr/mapping and pull fromgcr.io/istio-release.scripts/images.txt, for exampledocker.io/istio/pilot:<ver>,docker.io/istio/proxyv2:<ver>, anddocker.io/istio/install-cni:<ver>, pinned to one stable Istio release.hubto<ecr>/docker-hub/istioandtagto the pinned version. istioctl and theHelm charts default
hubtodocker.io/istio, so this is a single override.Install method (istioctl vs Helm):
IstioProfile/IstioOperatorconfig checked intothe repo (rendered or applied through the same GitOps path as the rest of
deploy/). istioctl gives a cleanistioctl install,istioctl verify-install,and
istioctl x precheck, which de-risk a first install on a new k8s minor.base,istiod,gateway), which fitthe existing Helm and future Argo CD adoption better. If the GitOps control
plane work (issues GitOps: choose and install the Argo CD control plane (decision + bootstrap) #6 to GitOps: non-disruptive adoption runbook + argocd app diff verification #12, gitops: adopt the coder Helm release into GitOps in place (chart 2.34.0) #21 to gitops: Terraform AWS substrate reconcile as a GitOps prerequisite (ordering cross-reference) #29) lands first, prefer the Helm charts
so Istio is reconciled the same way as ingress-nginx.
hub/tagto the ECR mirror and pin a single Istio version.Compatibility risk to verify first:
Istio releases certify against. Confirm the target Istio version supports k8s
1.36 on its official support matrix before committing. This is a hard
gating item.
2. Mesh-wide mTLS STRICT via PeerAuthentication
PeerAuthenticationinistio-systemwithmtls.mode: STRICT, so all sidecar-to-sidecar traffic must be mTLS.sidecars namespace by namespace, confirm traffic is healthy, then flip to
STRICT. PERMISSIVE accepts both mTLS and plain text so injection does not break
callers mid-migration.
kube-system, the Istio controlplane, and anything intentionally left out) must not be forced to STRICT; a
mesh-wide STRICT policy only governs injected workloads, but plain-text paths
into non-injected services need per-port
PeerAuthentication portLevelMtlsor a
DestinationRulewith the righttrafficPolicy.tls.mode.9000, Coder and others use HTTP probes. Use Istio probe rewrite (enabled by
default) so the kubelet, which has no sidecar, can still reach probe
endpoints under STRICT.
already TLS with
rds.force_ssl=1. It is mesh-external, so it is governed bya
ServiceEntryplusDestinationRule, not by PeerAuthentication.workhorse and unicorn ports) and Coder workspace pods are the highest-risk
injection targets. Treat
gitlab,coder, andcoder-workspacesas thelast namespaces to flip to STRICT and validate carefully (websockets, git
over http, the Coder agent tunnels).
keep working; Prometheus scrape of mTLS targets needs either the merged
Istio metrics endpoints or scrape configuration that tolerates the sidecar.
3. Ingress gateway: replace nginx, or sit behind the NLB
Three layouts were considered:
it in place of ingress-nginx. The gateway is the single L7 entry, sets
forwarded headers correctly, and routes by host to the meshed services.
Istio gateway. This fixes nothing about the Keycloak header problem and leaves
the redundancy in place.
straight through), terminating with a cert mounted in the gateway.
Recommendation: option A. Keep the NLB doing L4 with the ACM cert (FIPS-validated
ACM termination at the edge is desirable in GovCloud and avoids managing cert
material in-cluster), and replace ingress-nginx with the Istio ingress gateway as
the L7 layer. The gateway terminates the NLB-forwarded HTTP, applies host-based
routing through
GatewayandVirtualServiceobjects, and injects a correctX-Forwarded-Proto: httpsto backends.Tradeoffs:
gateway must be told the real external scheme is https even though it receives
plain HTTP from the L4 NLB; do this with the gateway topology and forwarded
header settings (see section 4).
into the cluster (ACM private cert export is restricted; this likely pulls in
cert-manager or a manually managed secret) and reworking the NLB to TCP
passthrough. More moving parts and more risk for this environment, so it is not
recommended now.
4. X-Forwarded-Proto fix tied to the gateway design
This is the concrete fix for the Keycloak cookie failure.
the gateway is the trust boundary for forwarded headers and can normalize them.
presented to backends as https: set the gateway listener and route so that
Envoy emits
X-Forwarded-Proto: https, and have Envoy overwrite rather thanappend client-supplied forwarded headers at this edge hop
(
xff_num_trusted_hopsand forwarding settings tuned for an L4 NLB in front).X-Forwarded-Proto: httpsplus itsexisting
KC_PROXY_HEADERS=xforwarded, Keycloak treats the request as secureand issues
AUTH_SESSION_IDandKC_RESTARTwithSecureandSameSite=None, so the Account Console silent single sign-on iframe works.KC_HOSTNAME=https://auth.usgov.coderdemo.iofor stable issuer URLs; thescheme fix is about cookies, the hostname pin is about URL generation. Both are
needed.
curl -D -against the Keycloak login path throughthe gateway shows
SecureandSameSite=Noneon the session cookies, and theAccount Console loads.
5. Certificate handling
in the cluster, which keeps the GovCloud edge posture unchanged.
built-in Istio CA, SPIFFE identities per workload). These are internal and are
not the ACM cert; the two concerns are independent.
option C is chosen (a gateway-terminated server cert) or if the Istio CA should
be backed by an intermediate from a managed PKI (cert-manager
istio-csr).Recommend deferring cert-manager unless a compliance requirement forces an
external issuer for the mesh CA.
makes istiod the CA for all in-mesh traffic.
6. Rationalize the NLB plus nginx redundancy
Target end state:
cert, the existing DNS alias records keep pointing at it).
new target of the NLB. The four app
Ingressobjects become IstioGatewayplus
VirtualServiceobjects (or are kept temporarily through the Istioingress class during cutover).
all four hosts are served by the gateway and verified.
Non-disruptive migration path (no downtime for
dev,auth,gitlab,grafana):to it yet. The NLB still targets ingress-nginx.
GatewayplusVirtualServicefor one low-risk hostfirst (Grafana is the best canary; it is unaffected by the cookie bug and
least critical). Validate through a temporary path or a second test NLB or by
weighting.
AWS Load Balancer Controller Service ownership) from ingress-nginx to the
Istio gateway, host by host, validating each. Do Keycloak deliberately, since
it is both the cookie fix and the highest blast radius for single sign-on.
ingress-nginx.
section 2.
A safe variant during validation is to keep both ingress-nginx and the Istio
gateway live behind the same NLB, splitting by host at the NLB or via DNS, so any
single host can be rolled back instantly by repointing it to the old path.
7. Risks, rollback, and phased rollout
Risks:
tunnels) can break under sidecar injection or STRICT mTLS if ports or probes
are mishandled.
gcr.iomapping is unsupported by the mirror scripttoday) can stall an install midway.
sign-on for Coder, GitLab, and Grafana, which all depend on the realm.
Rollback:
release until the very end).
PeerAuthenticationback to PERMISSIVE, whichimmediately re-accepts plain text.
istio-injectionlabel and restart workloadsto drop sidecars.
Phased rollout (summary):
then Keycloak), validating each, fixing the Keycloak header and cookie path.
8. Observability integration
into the existing
kube-prometheus-stack(monitoringnamespace) withPodMonitor or ServiceMonitor for istiod and the sidecars, accounting for STRICT
mTLS on scrape paths.
existing Grafana so request rates, mTLS success, and gateway latency are
visible next to the Coder dashboards.
other pods, so access logs are queryable in Grafana.
during the STRICT cutover to spot non-mTLS edges, but it is another image to
mirror and another SSO integration, so treat it as optional and add it behind
Keycloak single sign-on if adopted.
Open questions and decisions to confirm
STATUS.md).first; otherwise istioctl).
(gateway-terminated TLS).
Acceptance criteria
istioctl verify-installpasses, and image refs are pinned in
scripts/images.txt.existing NLB with the ACM cert, with no downtime during cutover.
curl -D -shows session cookies withSecureandSameSite=None; Coder, GitLab, and Grafana single sign-on stillwork.
PeerAuthenticationis STRICT with documented, justified exceptions;all app traffic stays healthy.
logs reach Loki.
Generated by Coder Agents on behalf of @ausbru87 (ausbru87).