Skip to content

fix(gax): adjust RetryAlgorithm logic for timeout v. exception#13243

Draft
whowes wants to merge 5 commits into
mainfrom
whowes/issue-12373-flaky
Draft

fix(gax): adjust RetryAlgorithm logic for timeout v. exception#13243
whowes wants to merge 5 commits into
mainfrom
whowes/issue-12373-flaky

Conversation

@whowes

@whowes whowes commented May 20, 2026

Copy link
Copy Markdown
Contributor

The current RetryAlgorithm logic is susceptible to an edge case if a short RPC deadline is set near the end of polling. If an exception results from a too-short deadline, the exception thrown by the associated Future will be an ExecutionException rather than e.g. CancellationException (for LRO polling). This likely causes flakiness observed in #12373 and #3277.

This change ensures that shouldRetryBasedOnTiming executes if the operation hasn't completed successfully, even when shouldRetryBasedOnResult is false (e.g. due to non-retriable RPC exception due to the short deadline). This way a timeout-related exception will be thrown instead if the overall timeout has been reached.

Resolves #12373

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the RetryAlgorithm to short-circuit the retry check when an operation succeeds and the result algorithm indicates no retry is necessary. This change prevents the timing algorithm from erroneously throwing a CancellationException on successful operations. New unit tests were added to verify both the short-circuiting behavior and the standard path. Feedback suggests adding a missing assertion to one of the new test cases to explicitly verify the return value of the retry decision.

@whowes whowes force-pushed the whowes/issue-12373-flaky branch 2 times, most recently from 3cff2d3 to 54b9310 Compare May 20, 2026 20:40
@whowes

whowes commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the RetryAlgorithm.shouldRetry method to short-circuit when an operation has succeeded. Specifically, if the result-based retry check returns false and there is no previous exception, the method now returns false immediately without calling the timing-based retry check. This change prevents the timing-based check from erroneously throwing a CancellationException on successful operations. Corresponding unit tests were added to verify the short-circuiting behavior and ensure timing checks still occur when an error is present. I have no feedback to provide as there were no review comments to evaluate.

@whowes whowes requested a review from blakeli0 May 21, 2026 02:29
@whowes whowes force-pushed the whowes/issue-12373-flaky branch from 239df6a to 17c1ee4 Compare May 21, 2026 17:07
@sonarqubecloud

Copy link
Copy Markdown

@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed for 'gapic-generator-java-root'

Failed conditions
0.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

@lqiu96 lqiu96 self-assigned this Jun 2, 2026
@lqiu96 lqiu96 self-requested a review June 2, 2026 15:21
Comment on lines +166 to +169
// exception, defer to any exception thrown by shouldRetryBasedOnTiming.
// This lets a timeout-related exception get propagated when justified
// rather than e.g. exceptions caused by very short RPC deadlines on a
// final polling attempt.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq, I'm not sure I'm following this comment. I might not be understanding this fully.

What do you mean by defer to any exception thrown by shouldRetryBasedOnTiming`? Is this referring to an exception thrown by the ExponentialRetryAlgorithm?

IIRC, short RPC deadlines that can't be processed in time will result in a DEADLINE_EXCEEDED status code and that may or may not be retried as a retryable status code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Is this referring to an exception thrown by the ExponentialRetryAlgorithm?" Not by ExponentialRetryAlgorithm itself, but by its subclass OperationTimedPollAlgorithm used for LRO polling, which throws a CancellationException when the cumulative timeout for all polls is exhausted.

With the current structure of RetryAlgorithm.createNextAttempt() the CancellationException-throwing codepath doesn't get reached if the final LRO polling RPC fails. That results in an ExecutionException getting propagated instead, which is misleading when that failure is triggered by an overly-truncated deadline. The mismatch of expectations is IMO likely at the root of the test flakes from #12373 and #3277.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[sdk-platform-java] flaky test: showcase test ITLongRunningOperation

2 participants