Skip to content

KAFKA-20403 : streams - Fix stream threads interruptions#21970

Open
muralibasani wants to merge 6 commits intoapache:trunkfrom
muralibasani:KAFKA-20403
Open

KAFKA-20403 : streams - Fix stream threads interruptions#21970
muralibasani wants to merge 6 commits intoapache:trunkfrom
muralibasani:KAFKA-20403

Conversation

@muralibasani
Copy link
Copy Markdown
Contributor

@muralibasani muralibasani commented Apr 4, 2026

https://issues.apache.org/jira/browse/KAFKA-20403

As per the ticket, adding thread interrupt together with log.warn.

@github-actions github-actions bot added triage PRs from the community streams small Small PRs labels Apr 4, 2026
@mjsax mjsax added ci-approved and removed triage PRs from the community labels Apr 4, 2026
Copy link
Copy Markdown
Member

@mjsax mjsax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Nice catch!

version.topologyCV.await();
} catch (final InterruptedException e) {
Thread.currentThread().interrupt();
log.error("StreamThread was interrupted while waiting on empty topology", e);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a warn-log only?

}
return result;
} catch (final InterruptedException ignored) {
Thread.currentThread().interrupt();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a warn-log?

Thread.currentThread().interrupt();
// we interrupt the thread for shut down and pause.
// we can ignore this exception.
log.debug("Await unblocked: Interrupted while waiting for processable tasks");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be a warn-log?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we could keep debug as it is expected to shutdown or pause ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Digging into the code, I am actually not sure when we would interrupt a thread, so I am a little bit unsure about the existing comment... we interrupt the thread for shut down and pause

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like a pre-existing comment. I have removed it.

And for the await unblock, as it is not an error, log.debug is better than log.warn ?

@muralibasani
Copy link
Copy Markdown
Contributor Author

@mjsax thankyou for the review. Pushed changes.

@muralibasani muralibasani changed the title KAFKA-20403 : Fix stream threads interruptions KAFKA-20403 : streams - Fix stream threads interruptions Apr 5, 2026
@@ -147,8 +147,8 @@ public void awaitProcessableTasks(final Supplier<Boolean> isShuttingDown) throws
log.debug("Not awaiting since shutdown was requested");
}
} catch (final InterruptedException ignored) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider handling InterruptedException in waitIfAllChangelogsCompletelyRead the same way as the other sites: Thread.currentThread().interrupt() (+ optional warn).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That path (waitIfAllChangelogsCompletelyRead) is never triggered, which was mentioned in the description. That being a private method, adding interrupt would become a tight loop and need to add break which is unnecessary imo. Prefer to ignore that.

Copy link
Copy Markdown
Member

@mjsax mjsax Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I follow the argument?

In the end, KS code does not call interrupt() by itself. Following this argument, InterruptedException should not needed to be handled anywhere, because it should never happen. It's still good to add a defensive guard as proposed on this PR for it.

I was digging around a little bit, and found other existing code doing:

} catch (final InterruptedException fatalException) {
    // this should not happen; if it ever happens it indicate a bug
    Thread.currentThread().interrupt();
    log.error(INTERRUPTED_ERROR_MESSAGE, fatalException);
    throw new IllegalStateException(INTERRUPTED_ERROR_MESSAGE, fatalException);
}

So maybe we should follow this pattern everywhere (maybe except in shutdown code path)?

Or, considering the comment below:

// we interrupt the thread for shut down and pause.

We might want to always handle InterruptedException gracefully and remove existing code that treats it as fatal?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree we should still handle interrupts defensively here.

For waitIfAllChangelogsCompletelyRead, IMHO graceful handling (restore interrupt + leave the await() loop) is preferable over the fatal IllegalStateException style. Shutdown already uses isRunning + signalAll(), and this is an idle condition wait, not a “must complete this step” path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with being consistent. Updated with interrupt here

Thread.currentThread().interrupt();
log.warn("Interrupted while waiting for tasks {} to be locked",
ids.stream().map(TaskId::toString).collect(Collectors.joining(",")));
break;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern on TaskManager.maybeLockTasks: exiting the loop with a break after InterruptedException means the method returns without the lock having been acquired, but every caller still continues as if it had (handleCorruption, handleAssignment, handleRevocation, closeRunningTasksDirty, closeAndCleanUpTasks, commit, etc.). When schedulingTaskManager != null, that can allow the stream thread to do commit/suspend/close work while task executors might still be running the same tasks.

Could we Thread.currentThread().interrupt() as you have, but then fail the operation instead of falling through (e.g. throw new StreamsException(..., e) or another unchecked error that fits this module) rather than break? That keeps the “don’t swallow interrupt” fix without proceeding into the locked critical section without a lock.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I just reverted the interrupt which I introduced.
Instead of even throwing (which is not handled by a few other callers).
I think it is better to ignore and get back to previous state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants