Description
Several TestCounters subtests are intermittently failing in the e2e-stable CI step on a GKE Autopilot cluster because the Agones SDK sidecar gRPC server on 127.0.0.1:9357 returns connection refused instead of the expected domain-level error responses.
CI Build
Logs: https://console.cloud.google.com/cloud-build/builds/fbedf122-d973-46db-ba07-1fbe141601ae;step=2?project=agones-images
- Cloud product:
gke-autopilot
- Feature gates active:
CountsAndLists=true, SidecarContainers=true, GKEAutopilotExtendedDurationPods=true, DisableResyncOnSDKServer=true
Note: the total failure count in this run (20) exceeded --rerun-fails-max-failures=10, so no automatic re-runs were attempted.
Failing Tests
All failing subtests share the same pattern — the test sends a UDP message to simple-game-server, which then calls the Agones SDK gRPC server on 127.0.0.1:9357. The expected response is a specific SDK-level error (e.g. out-of-range), but instead the game server returns a connection refused error:
Error Trace: /go/src/agones.dev/agones/test/e2e/gameserver_test.go:1651
Error: Not equal:
expected: "ERROR: could not increment Counter games by amount 50: rpc error: code = Unknown desc = out of range. Count must be within range [0,Capacity]. Found Count: 51, Capacity: 50\n"
actual : "ERROR: could not increment Counter games by amount 50: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:9357: connect: connection refused\"\n"
--- FAIL: TestCounters/IncrementCounter_Past_Capacity (0.08s)
Affected subtests:
TestCounters/IncrementCounter_Past_Capacity
TestCounters/IncrementCounter_Negative
TestCounters/DecrementCounter_Past_Capacity
TestCounters/SetCounterCount_Past_Capacity
TestCounters/SetCounterCount_Past_Zero
Root Cause Analysis
TestCounters creates a single GameServer, waits for it to reach Ready, then runs all subtests against that shared GameServer. The failing subtests are the ones that trigger an SDK gRPC call from inside simple-game-server to the Agones SDK server on 127.0.0.1:9357.
connection refused on a loopback address means the SDK server gRPC port is not bound at the time of the call. With SidecarContainers=true, the SDK server runs as a proper Kubernetes sidecar container. A likely cause is a race condition: the SDK server has processed the Ready() call (causing the GameServer to transition to Ready state), but the gRPC listener on port 9357 is momentarily unavailable — for example, due to a brief disconnect/reconnect cycle in the SDK server, or the listener not yet being re-established after some internal state change triggered by the Ready() transition.
Since the subtests iterate over a Go map (random order), the failing ones are those that happen to be scheduled during the brief window when the port is unavailable.
This may be more likely to surface on GKE Autopilot due to the additional scheduling and networking latency compared to standard GKE.
Potential Solutions / Areas for Exploration
-
Investigate SDK server stability after Ready() with SidecarContainers=true: Check whether the SDK server gRPC listener on port 9357 can ever become temporarily unavailable after the Ready() call is processed. A connection refused on loopback suggests the listener has stopped — this should not be possible for a stable sidecar, so this warrants investigation.
-
Investigate DisableResyncOnSDKServer=true interaction: This feature gate is active in the e2e-stable run. It is worth confirming whether disabling SDK server resyncs has any effect on port availability around state transitions.
-
Add retry in simple-game-server for SDK gRPC calls: The game server could retry the SDK connection on Unavailable errors with a short backoff before returning the error to the caller. This would make the test more resilient to transient SDK server unavailability.
-
Add a post-Ready SDK connectivity check in the test framework: Before CreateGameServerAndWaitUntilReady returns, verify the SDK server port is actually accepting connections, ensuring the GameServer is truly ready for SDK interactions.
Description
Several
TestCounterssubtests are intermittently failing in thee2e-stableCI step on a GKE Autopilot cluster because the Agones SDK sidecar gRPC server on127.0.0.1:9357returnsconnection refusedinstead of the expected domain-level error responses.CI Build
Logs: https://console.cloud.google.com/cloud-build/builds/fbedf122-d973-46db-ba07-1fbe141601ae;step=2?project=agones-images
gke-autopilotCountsAndLists=true,SidecarContainers=true,GKEAutopilotExtendedDurationPods=true,DisableResyncOnSDKServer=trueFailing Tests
All failing subtests share the same pattern — the test sends a UDP message to
simple-game-server, which then calls the Agones SDK gRPC server on127.0.0.1:9357. The expected response is a specific SDK-level error (e.g. out-of-range), but instead the game server returns aconnection refusederror:Affected subtests:
TestCounters/IncrementCounter_Past_CapacityTestCounters/IncrementCounter_NegativeTestCounters/DecrementCounter_Past_CapacityTestCounters/SetCounterCount_Past_CapacityTestCounters/SetCounterCount_Past_ZeroRoot Cause Analysis
TestCounterscreates a single GameServer, waits for it to reachReady, then runs all subtests against that shared GameServer. The failing subtests are the ones that trigger an SDK gRPC call from insidesimple-game-serverto the Agones SDK server on127.0.0.1:9357.connection refusedon a loopback address means the SDK server gRPC port is not bound at the time of the call. WithSidecarContainers=true, the SDK server runs as a proper Kubernetes sidecar container. A likely cause is a race condition: the SDK server has processed theReady()call (causing the GameServer to transition toReadystate), but the gRPC listener on port 9357 is momentarily unavailable — for example, due to a brief disconnect/reconnect cycle in the SDK server, or the listener not yet being re-established after some internal state change triggered by theReady()transition.Since the subtests iterate over a Go map (random order), the failing ones are those that happen to be scheduled during the brief window when the port is unavailable.
This may be more likely to surface on GKE Autopilot due to the additional scheduling and networking latency compared to standard GKE.
Potential Solutions / Areas for Exploration
Investigate SDK server stability after
Ready()withSidecarContainers=true: Check whether the SDK server gRPC listener on port 9357 can ever become temporarily unavailable after theReady()call is processed. Aconnection refusedon loopback suggests the listener has stopped — this should not be possible for a stable sidecar, so this warrants investigation.Investigate
DisableResyncOnSDKServer=trueinteraction: This feature gate is active in thee2e-stablerun. It is worth confirming whether disabling SDK server resyncs has any effect on port availability around state transitions.Add retry in
simple-game-serverfor SDK gRPC calls: The game server could retry the SDK connection onUnavailableerrors with a short backoff before returning the error to the caller. This would make the test more resilient to transient SDK server unavailability.Add a post-Ready SDK connectivity check in the test framework: Before
CreateGameServerAndWaitUntilReadyreturns, verify the SDK server port is actually accepting connections, ensuring the GameServer is truly ready for SDK interactions.