Skip to content

Gatewayapi Namespaced Mode#4690

Open
radixo wants to merge 3 commits intotigera:masterfrom
radixo:gatewayapi-deployment-enterprise
Open

Gatewayapi Namespaced Mode#4690
radixo wants to merge 3 commits intotigera:masterfrom
radixo:gatewayapi-deployment-enterprise

Conversation

@radixo
Copy link
Copy Markdown
Contributor

@radixo radixo commented Apr 14, 2026

Description

Replace the previous Gateway-API install — which ran an Envoy Gateway controller in tigera-gateway and deployed all proxy workloads in that same namespace — with a single envoy-gateway controller in calico-system running with deploy.type=GatewayNamespace, so proxy workloads land in each Gateway's own namespace.

This is a breaking change for clusters running the legacy install. Existing Gateway CRs do not need edits — tigera-gateway-class and its controllerName are preserved — but proxy Pods, their Services and LoadBalancer addresses are recreated in each Gateway's own namespace on first reconcile after upgrade. Anything pinned to tigera-gateway (NetworkPolicies, monitoring, RBAC, external DNS) must follow.

  • Single controller in calico-system with controllerName=gateway.envoyproxy.io/gatewayclass-controller (chart default) and deploy.type=GatewayNamespace. ControllerName + GatewayClass name are deliberately reused from the legacy install so existing Gateway CRs continue to be claimed.
  • Auto-provisioned tigera-gateway-class GatewayClass + EnvoyProxy. Users can declare additional classes via GatewayAPI.Spec.GatewayClasses; all classes target the single controller.
  • Embed gateway-helm.tgz and render at runtime via the Helm SDK; result is cached per process via sync.Once. Replaces the previous pre-rendered YAML.
  • Per-namespace Enterprise resources for each Gateway-hosting namespace: waf-http-filter SA + per-namespace waf-http-filter-gateway-resources RoleBinding (least-privilege Gateway-API reads), tigera-operator-secrets RoleBinding, tigera-pull-secret copy. Cluster-scoped perms (licensekeys, tokenreviews) go through a single shared waf-http-filter-gateway-namespaces ClusterRoleBinding whose Subjects list is recomputed each reconcile.
  • Reserved-namespace guard (calico-system, tigera-operator): the operator does not create or delete the shared tigera-operator-secrets RoleBinding or tigera-pull-secret copy in those namespaces — the core Installation controller owns them.
  • Per-namespace cleanup when a namespace no longer hosts a Gateway: pull Secret → SA + RoleBinding → tigera-operator-secrets RoleBinding. That RoleBinding is what grants the operator Secret-delete perms, so reversing the order yields a 403 and aborts the reconcile.
  • Explicit upgrade-cleanup queue for the legacy tigera-gateway install: controller Deployment/Service/SAs/ConfigMap, certgen Job + RBAC, namespaced Role/RoleBinding, copied pull Secrets, tigera-operator-secrets RoleBinding, envoy-gateway-topology-injector.tigera-gateway MWC, the orphaned waf-http-filter-cluster-scoped and waf-http-filter-gateway-resources ClusterRoleBindings, and the deprecated combined waf-http-filter ClusterRole/ClusterRoleBinding. Pull Secrets are queued before tigera-operator-secrets. The tigera-gateway Namespace itself is intentionally not queued — users may have placed their own resources in it.
  • v3 NetworkPolicy (calico-system.envoy-gateway) under the calico-system tier to keep the controller + certgen Job working under default-deny. Selector covers both app.kubernetes.io/name=gateway-helm and app=certgen (the chart applies different labels to the certgen Job vs its pod template). Egress: DNS + kube-apiserver, then Pass. Ingress: 9443 (topology-injector webhook), 18000-18002 (xDS), 19001 (metrics).
  • Tests: UT coverage for OSS + Enterprise renders, GatewayClass + EnvoyProxy customisation, per-namespace lifecycle (create + cleanup), reserved-namespace guards, Secret-before-RoleBinding ordering, and the explicit legacy-cleanup queue (with the tigera-gateway Namespace asserted not in the delete list). FV coverage: deploys the controller in calico-system and asserts nothing lands in tigera-gateway, provisions and cleans up per-namespace resources, GatewayClass + EnvoyProxy cleanup, custom EnvoyProxy watch, l7-log-collector owning-gateway env wiring, custom EnvoyGateway ConfigMap.

Security

The Enterprise per-namespace render copies tigera-pull-secret into every namespace that hosts a Gateway, so permissive RBAC on those namespaces can expose the pull secret. Reserved namespaces are excluded from create and delete of the shared resources, so the operator does not clobber core-Installation-owned secrets.

Upgrade / compatibility

In-place upgrade. The single controller in calico-system claims all tigera-gateway-class Gateways unchanged. Proxy Pods, their Services, and LoadBalancer addresses are recreated in each Gateway's own namespace on first reconcile, so any cluster setting pinned to tigera-gateway (NetworkPolicies, monitoring, RBAC, external DNS) must be repointed to the Gateway's own namespace. The legacy controller install is removed automatically; the tigera-gateway Namespace itself is preserved in case it holds user resources.

Calico-private operator RBAC update required: envoy-gateway-topology-injector.calico-system added to the mutatingwebhookconfigurations resourceNames list with update and delete verbs (the legacy .tigera-gateway entry is retained with delete so the upgrade-cleanup can reap it).

Release Note

**Breaking change.** The Calico Ingress Gateway controller now runs in
`calico-system` and provisions proxy workloads in the Gateway's own namespace
(`deploy.type=GatewayNamespace`). Existing `tigera-gateway-class` Gateways
keep working without CR edits, but their proxy Pods are relaunched in the
Gateway's own namespace; any NetworkPolicy, monitoring, RBAC or external DNS
previously pinned to `tigera-gateway` must be repointed to the Gateway's own
namespace. The legacy controller install is removed automatically on upgrade;
the `tigera-gateway` Namespace itself is left in place in case it holds user
resources.

For PR author

  • Tests for change.
  • If changing pkg/apis/, run make gen-files
  • If changing versions, run make gen-versions

For PR reviewers

A note for code reviewers - all pull requests must have the following:

  • Milestone set according to targeted release.
  • Appropriate labels:
    • kind/bug if this is a bugfix.
    • kind/enhancement if this is a a new feature.
    • enterprise if this PR applies to Calico Enterprise only.

Copy link
Copy Markdown
Member

@electricjesus electricjesus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work on the runtime Helm rendering migration — dropping 53K lines of pre-rendered YAML is a huge win. The GatewayNamespace mode looks solid with good test coverage. A few observations below.

Comment thread pkg/render/gatewayapi/gateway_api.go Outdated
Comment thread pkg/render/gatewayapi/gateway_api.go Outdated
Comment thread pkg/render/gatewayapi/gateway_api.go Outdated
CurrentGatewayClasses: set.New[string](),
}

if gatewayAPI.Spec.GatewayDeploymentMode == nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The CRD already has +kubebuilder:default=ControllerNamespace, so any persisted GatewayAPI resource will have this field populated by the API server. This runtime defaulting only matters for in-memory objects that were never persisted (tests?). Not a problem, just noting the redundancy — if the CRD default is the source of truth, a comment here explaining why you also default in code would help future readers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used by the tests

@radixo
Copy link
Copy Markdown
Contributor Author

radixo commented Apr 15, 2026

Good work on the runtime Helm rendering migration — dropping 53K lines of pre-rendered YAML is a huge win. The GatewayNamespace mode looks solid with good test coverage. A few observations below.

All good catches man, all sorted

Comment thread pkg/render/gatewayapi/gateway_api.go Outdated
Copy link
Copy Markdown
Member

@electricjesus electricjesus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

// Gateway resources using operator-managed GatewayClasses. These namespaces need
// per-namespace Enterprise resources (SA, CRB, pull secrets).
if *gatewayAPI.Spec.GatewayDeploymentMode == operatorv1.GatewayDeploymentModeGatewayNamespace &&
variant.IsEnterprise() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the Variant matter here at all?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For resources that are rendered only on EE license, like WAF.

@radixo radixo force-pushed the gatewayapi-deployment-enterprise branch from 04d49c6 to 8c6f64c Compare April 23, 2026 21:22
radixo and others added 3 commits May 7, 2026 17:31
- Swap the checked-in gateway_api_resources.yaml for the embedded gateway-helm.tgz rendered via the helm SDK at startup; K8SGatewayAPICRDs/GatewayAPICRDs now take a runtime.Scheme and return an error (istio_controller updated for the new signature)
- Deploy two envoy-gateway controllers: legacy in tigera-gateway (user-declared classes via Spec.GatewayClasses) and a new one in calico-system with deploy.type=GatewayNamespace; auto-provision the tigera-gateway-class-ns GatewayClass bound to the new controller
- Group the tigera-gateway install behind legacyObjects/legacyTeardownObjects so the eventual deprecation is a single delete
- HasLegacyGateways classifier in the controller: build a className -> controllerName map seeded from Spec.GatewayClasses + existing GatewayClass resources, classify every live Gateway; when no Gateway targets the tigera-gateway controller, the install is torn down; during the teardown-then-redeploy race the legacy render is deferred to avoid a "Namespace is terminating, skipping creation" log flood
- Legacy teardown queues only the Namespace + cluster-scoped objects + the Deployment (for status.RemoveDeployments); in-namespace RBAC/Secrets ride the cascade to avoid the tigera-operator-secrets RoleBinding race
- Move the shared waf-http-filter ClusterRoles out of the legacy bundle so the calico-system-side proxies keep their cluster-scoped perms after tigera-gateway is retired
- Per-namespace Enterprise resources (SA, RoleBindings, pull secret, shared CRB subject) for namespaces hosting a namespaced-class Gateway; reserved namespaces skip shared resource create/delete; Secret goes before RoleBinding on cleanup to avoid 403
- Gate v3 NetworkPolicies on the calico-system Tier; render calico-system.envoy-gateway allow for the controller and certgen
- Update unit tests and Makefile/docs accordingly

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Cover the calico-system envoy-gateway controller lifecycle, per-namespace resource provisioning and cleanup, custom EnvoyProxy and EnvoyGateway ConfigMap watches, owning-gateway env vars in l7-log-collector, and the legacy-class teardown path
- Teardown sequencing for tigera-gateway cascading

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lico-system

- Render one envoy-gateway controller in calico-system with deploy.type=GatewayNamespace
- Auto-provision tigera-gateway-class; honour user overrides if redeclared in Spec.GatewayClasses
- Enumerate every operator-owned object from the legacy tigera-gateway install for cleanup (pull Secrets before tigera-operator-secrets); keep the Namespace itself in case users placed their own resources there
- Point GatewayAPI finalizer at the calico-system envoy-gateway Deployment
- Drop dual-controller fixtures and the legacy-undeploy test; consolidate FV tests to the calico-system layout
@radixo radixo force-pushed the gatewayapi-deployment-enterprise branch from 0d63b8f to d3ef961 Compare May 8, 2026 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants