-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Description:
We are observing a regression in the xDS client state when an ADS stream restarts. If the management server (Google Traffic Director) provides a stale version_info on the new stream (lower/older than the version previously ACKed on the closed stream), the gRPC Java xDS client incorrectly concludes that currently active resources have been deleted.This leads to the client unwatching the resources and moving the channel into TRANSIENT_FAILURE, causing RPC drops.Environment:
-
gRPC Java Version: 1.80.0-SNAPSHOT
-
Control Plane: Google Traffic Director (Cloud Service Mesh)
-
Protocol: xDS v3 (ADS / SotW)
Steps to Reproduce / Log Analysis:
-
Initial State (Stream A): The client is subscribed to two clusters: ...711645838861579111 (Cluster A) and ...8058834216449389430 (Cluster B).The client receives and ACKs CDS Version: 1771507339513814471 (Nonce 3).Note: In this specific trace, the server response only contained Cluster B, causing an initial (perhaps intended) unsubscription from Cluster A.
-
Stream Restart: The ADS stream restarts. The nonce counter resets to 1.
-
Regression (Stream B): The client sends a DiscoveryRequest for Cluster B.The server responds with CDS Version: 1771507283347080789 (Nonce 1).Critical Issue: This version_info is numerically older than the version previously processed (...3814471 vs ...7080789). This stale version contains Cluster A but not Cluster B.
-
Failure: The xDS client processes the nonce: 1 response. Because it is a "State of the World" (SotW) update and Cluster B is missing from the resources list, the client concludes Cluster B no longer exists:[xds-client<9>] Conclude ... resource ...8058834216449389430 not exist
Expected Behavior: The xDS client should perform a "version sanity check" across stream restarts. If a new stream provides a version_info that is chronologically or numerically older than the last successfully applied version, the client should either:
-
Ignore the stale update.
-
Log a warning and wait for a more recent version before performing destructive deletions of active resources.
Actual Behavior:
2026-02-19T13 21 07 885258 00 00 ns psm-ds-client.txt
The client accepts the stale version as the new "truth" for the stream. Since the requested resource is missing in that stale snapshot, the client deletes the resource locally, leading to TRANSIENT_FAILURE.