Skip to content

HDDS-14830. Handle interrupt gracefully in XceiverClientGrpc.sendCommandWithRetry#10178

Open
chihsuan wants to merge 1 commit intoapache:masterfrom
chihsuan:HDDS-14830
Open

HDDS-14830. Handle interrupt gracefully in XceiverClientGrpc.sendCommandWithRetry#10178
chihsuan wants to merge 1 commit intoapache:masterfrom
chihsuan:HDDS-14830

Conversation

@chihsuan
Copy link
Copy Markdown
Contributor

@chihsuan chihsuan commented May 2, 2026

What changes were proposed in this pull request?

XceiverClientGrpc.sendCommandWithRetry retries a request across the pipeline's datanodes and rethrows the last captured ioException after the loop. The catch (InterruptedException) branch logged the error and restored the interrupt flag, but did not assign ioException or exit the loop. With the flag restored, each subsequent Future.get() throws InterruptedException immediately, so the loop falls through every datanode and exits with ioException == null. The post-loop Objects.requireNonNull(ioException, ...) then throws NullPointerException, masking the real interrupt — which is what the EC reconstruction worker pool sees during datanode shutdown.

The fix throws InterruptedIOException (with the original InterruptedException as cause) directly from the catch block. This matches the existing convention already used in XceiverClientGrpc.sendCommand (lines 396-411) and XceiverClientSpi.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14830

How was this patch tested?

New unit test TestXceiverClientGrpc#testInterruptedCommandThrowsInterruptedIOException reproduces the original NPE deterministically (pre-sets the test thread's interrupt flag and stubs sendCommandAsync to return a never-completing future) and asserts:

  • InterruptedIOException is thrown — the regression that this PR fixes (was NullPointerException).
  • The cause is the original InterruptedException — debugging info preserved.
  • Thread.currentThread().isInterrupted() is still true — callers can still detect the interrupt.

The flag is cleared in finally before client.close() runs, so channel shutdown is not affected by the test setup.

The test fails on master without the production-code change in this PR, confirming it exercises the actual bug path:

Screenshot 2026-05-02 at 7 47 28 PM

Local checks: mvn checkstyle:check -pl hadoop-hdds/client,hadoop-ozone/integration-test → 0 violations.
Full CI: Runs via the build-branch workflow on the fork. https://github.com/chihsuan/ozone/actions/runs/25251708235

@chihsuan chihsuan marked this pull request as ready for review May 2, 2026 13:45
@ivandika3 ivandika3 requested a review from jojochuang May 2, 2026 14:05
Copy link
Copy Markdown
Contributor

@Gargi-jais11 Gargi-jais11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @chihsuan for working on this. LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants