Skip to content

xDS: Stale version_info on ADS stream restart causes incorrect resource deletion and TRANSIENT_FAILURE #12663

@kannanjgithub

Description

@kannanjgithub

Description:

We are observing a regression in the xDS client state when an ADS stream restarts. If the management server (Google Traffic Director) provides a stale version_info on the new stream (lower/older than the version previously ACKed on the closed stream), the gRPC Java xDS client incorrectly concludes that currently active resources have been deleted.This leads to the client unwatching the resources and moving the channel into TRANSIENT_FAILURE, causing RPC drops.Environment:

  • gRPC Java Version: 1.80.0-SNAPSHOT

  • Control Plane: Google Traffic Director (Cloud Service Mesh)

  • Protocol: xDS v3 (ADS / SotW)

Steps to Reproduce / Log Analysis:

  1. Initial State (Stream A): The client is subscribed to two clusters: ...711645838861579111 (Cluster A) and ...8058834216449389430 (Cluster B).The client receives and ACKs CDS Version: 1771507339513814471 (Nonce 3).Note: In this specific trace, the server response only contained Cluster B, causing an initial (perhaps intended) unsubscription from Cluster A.

  2. Stream Restart: The ADS stream restarts. The nonce counter resets to 1.

  3. Regression (Stream B): The client sends a DiscoveryRequest for Cluster B.The server responds with CDS Version: 1771507283347080789 (Nonce 1).Critical Issue: This version_info is numerically older than the version previously processed (...3814471 vs ...7080789). This stale version contains Cluster A but not Cluster B.

  4. Failure: The xDS client processes the nonce: 1 response. Because it is a "State of the World" (SotW) update and Cluster B is missing from the resources list, the client concludes Cluster B no longer exists:[xds-client<9>] Conclude ... resource ...8058834216449389430 not exist

Expected Behavior: The xDS client should perform a "version sanity check" across stream restarts. If a new stream provides a version_info that is chronologically or numerically older than the last successfully applied version, the client should either:

  1. Ignore the stale update.

  2. Log a warning and wait for a more recent version before performing destructive deletions of active resources.

Actual Behavior:

2026-02-19T13 21 07 885258 00 00 ns psm-ds-client.txt

The client accepts the stale version as the new "truth" for the stream. Since the requested resource is missing in that stale snapshot, the client deletes the resource locally, leading to TRANSIENT_FAILURE.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions