Skip to content

extensions: authenticate allocation endpoint with requestheader client CA#4582

Open
adilburaksen wants to merge 2 commits into
agones-dev:mainfrom
adilburaksen:fix/allocation-rbac-requestheader-auth
Open

extensions: authenticate allocation endpoint with requestheader client CA#4582
adilburaksen wants to merge 2 commits into
agones-dev:mainfrom
adilburaksen:fix/allocation-rbac-requestheader-auth

Conversation

@adilburaksen
Copy link
Copy Markdown

@adilburaksen adilburaksen commented May 23, 2026

Summary

This PR fixes an authorization bypass in Agones' extension API server where any in-cluster workload could call the allocation endpoint directly, bypassing Kubernetes RBAC entirely.

Problem

The Agones extension API server (port 8082) serves /apis/allocation.agones.dev/v1/namespaces/…/gameserverallocations via its own HTTP mux. When a client calls this endpoint through the Kubernetes aggregation layer, kube-apiserver proxies the request and includes a requestheader client certificate signed by the requestheader CA (from kube-system/extension-apiserver-authentication).

Previously, the server did not verify that certificate. Any pod with network access to port 8082 could send a raw HTTP request and receive a valid response — no Kubernetes ServiceAccount token, no RBAC check, nothing.

Fix

Three files changed:

pkg/util/https/server.go

  • Set ClientAuth: tls.RequestClientCert on the TLS config.
  • This asks callers to present a certificate but does not require one, so webhook callers that do not present a cert are unaffected.
  • The cert becomes available in r.TLS.PeerCertificates for the auth middleware.

pkg/util/apiserver/apiserver.go

  • Added SetRequestHeaderCA(*x509.CertPool) — called at startup to supply the CA.
  • Added authenticatedHandler — wraps any ErrorHandlerFunc to verify the first peer certificate against the CA pool with ExtKeyUsageClientAuth. Mirrors the logic in k8s.io/apiserver/pkg/authentication/request/x509.
  • If no CA is configured (unit tests, local dev), the handler passes through unchanged.
  • Discovery (/apis/allocation.agones.dev/v1) and OpenAPI handlers are left unauthenticated — they carry no sensitive data and are called by tools that do not present certs.
  • Only the /apis/…/namespaces/ resource handler (i.e. the allocation endpoint) requires the cert.

cmd/extensions/main.go

  • Added loadRequestHeaderCA — reads kube-system/extension-apiserver-authentication ConfigMap, parses requestheader-client-ca-file PEM, returns an *x509.CertPool.
  • Wired in after NewAPIServer; if the ConfigMap is absent or empty (e.g. clusters that don't use the aggregation layer), the server starts with a warning and no auth enforcement, preserving backward compatibility.

Testing

Existing unit tests pass (go test ./pkg/util/apiserver/...). The authenticatedHandler nil-CA path ensures existing tests continue to work without TLS setup.

Integration / e2e: the TLS RequestClientCert flag is transparent to callers that don't present a cert (webhook callers), and kube-apiserver always presents the requestheader cert when proxying aggregation-layer requests.

References

@markmandel
Copy link
Copy Markdown
Member

/gcbrun

@agones-bot
Copy link
Copy Markdown
Collaborator

Build Failed 😭

Build Id: c818caef-47eb-44ed-90a9-b5a07b23a94f

Status: FAILURE

To get permission to view the Cloud Build view, join the agones-discuss Google Group.

@adilburaksen
Copy link
Copy Markdown
Author

These failures appear to be pre-existing infrastructure issues unrelated to this PR's changes:

Step #26 (upgrade-test): The build fails with code=404, Not found: projects/agones-images/locations/us-east1/clusters/standard-upgrade-test-cluster-1-31. The GKE cluster does not exist — this is an infrastructure issue.

Step #28 (e2e-test): Of the four clusters tested, three pass and one fails:

  • generic-1.33: SUCCESS
  • gke-autopilot-1.34: SUCCESS
  • generic-1.35: SUCCESS
  • gke-autopilot-1.35: FAILURE

Since generic-1.35 and gke-autopilot-1.35 run the same Kubernetes version and our change (adding RequestClientCert to the TLS config and the authenticatedHandler middleware) would affect all clusters equally, the gke-autopilot-1.35-specific failure looks like a pre-existing flaky test or autopilot-specific infrastructure issue rather than a regression introduced here.

Happy to re-trigger (/gcbrun) if that would be helpful to confirm.

@markmandel
Copy link
Copy Markdown
Member

/gcbrun

It's a flake we see on autopilot sometimes.

VERBOSE: time="2026-05-23 05:04:19.565" level=info msg="2026/05/23 05:04:18 Starting TCP server, listening on port 7654" options="&PodLogOptions{Container:game-server,Follow:false,Previous:false,SinceSeconds:nil,SinceTime:<nil>,Timestamps:false,TailLines:nil,LimitBytes:nil,InsecureSkipTLSVerifyBackend:false,Stream:nil,}" test=TestGameServerTcpProtocol
VERBOSE: time="2026-05-23 05:04:19.565" level=info msg="2026/05/23 05:04:18 Marking this server as ready" options="&PodLogOptions{Container:game-server,Follow:false,Previous:false,SinceSeconds:nil,SinceTime:<nil>,Timestamps:false,TailLines:nil,LimitBytes:nil,InsecureSkipTLSVerifyBackend:false,Stream:nil,}" test=TestGameServerTcpProtocol
VERBOSE: time="2026-05-23 05:04:19.565" level=info msg="---End of container logs---" options="&PodLogOptions{Container:game-server,Follow:false,Previous:false,SinceSeconds:nil,SinceTime:<nil>,Timestamps:false,TailLines:nil,LimitBytes:nil,InsecureSkipTLSVerifyBackend:false,Stream:nil,}" test=TestGameServerTcpProtocol
VERBOSE: time="2026-05-23 05:04:19.623" level=warning msg="Error opening log stream for container" error="previous terminated container \"game-server\" in pod \"game-serverxzq28\" not found" options="&PodLogOptions{Container:game-server,Follow:false,Previous:true,SinceSeconds:nil,SinceTime:<nil>,Timestamps:false,TailLines:nil,LimitBytes:nil,InsecureSkipTLSVerifyBackend:false,Stream:nil,}" test=TestGameServerTcpProtocol
VERBOSE:     gameserver_test.go:1068: 
VERBOSE:         	Error Trace:	/go/src/agones.dev/agones/test/e2e/gameserver_test.go:1068
VERBOSE:         	Error:      	Received unexpected error:
VERBOSE:         	            	dial tcp 35.227.71.13:7855: connect: connection refused
VERBOSE:         	Test:       	TestGameServerTcpProtocol
VERBOSE: --- FAIL: TestGameServerTcpProtocol (2.64s)

Tempted to add a retry in there for CI, but it shouldn't be needed -- but I digress.

@markmandel
Copy link
Copy Markdown
Member

Also noting, you will need to sign your commits per DCO.

…t CA

The extensions HTTPS server exposes the allocation resource handler
(/apis/allocation.agones.dev/v1/namespaces/...) to any in-cluster
workload that can reach agones-controller-service:443 directly,
bypassing Kubernetes RBAC for the aggregated resource.

Fix using the standard Kubernetes aggregation-layer authentication
protocol (https://kubernetes.io/docs/tasks/extend-kubernetes/configure-aggregation-layer/):

* pkg/util/https/server.go — add ClientAuth: RequestClientCert so the
  TLS handshake forwards the client certificate (from the kube-apiserver
  proxy) into r.TLS.PeerCertificates without requiring callers that do
  not present a cert (e.g. webhooks) to do so.

* pkg/util/apiserver/apiserver.go — add SetRequestHeaderCA(*x509.CertPool)
  and an authenticatedHandler wrapper. The wrapper verifies the leaf
  certificate in r.TLS.PeerCertificates against the CA pool with
  ExtKeyUsageClientAuth, mirroring the approach in
  k8s.io/apiserver/pkg/authentication/request/x509. Discovery and
  OpenAPI handlers are left unauthenticated. If no CA is configured
  (e.g. in unit tests) the handler is called directly.

* cmd/extensions/main.go — load requestheader-client-ca-file from the
  kube-system/extension-apiserver-authentication ConfigMap (populated by
  Kubernetes for every aggregated API server) and wire it into the
  APIServer via SetRequestHeaderCA. A log warning is emitted if the
  ConfigMap is unavailable; auth is then effectively disabled for that
  start-up, preserving the previous behaviour while making the failure
  visible.

Fixes: agones-dev#4572
Signed-off-by: adilburaksen <adilburaksen@gmail.com>
@adilburaksen adilburaksen force-pushed the fix/allocation-rbac-requestheader-auth branch from 08a3dfe to 8bc0208 Compare May 23, 2026 19:45
@agones-bot
Copy link
Copy Markdown
Collaborator

Build Succeeded 🥳

Build Id: 5c38641c-8e08-4493-bc28-2d26dea689a5

The following development artifacts have been built, and will exist for the next 30 days:

A preview of the website (the last 30 builds are retained):

To install this version:

git fetch https://github.com/googleforgames/agones.git pull/4582/head:pr_4582 && git checkout pr_4582
helm install agones ./install/helm/agones --namespace agones-system --set agones.image.registry=us-docker.pkg.dev/agones-images/ci --set agones.image.tag=1.58.0-dev-08a3dfe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants