Skip to content

KAFKA-20372: Fix flaky test in NamedTopologyIntegrationTest#21968

Open
chickenchickenlove wants to merge 1 commit intoapache:trunkfrom
chickenchickenlove:KAFKA-20372-2
Open

KAFKA-20372: Fix flaky test in NamedTopologyIntegrationTest#21968
chickenchickenlove wants to merge 1 commit intoapache:trunkfrom
chickenchickenlove:KAFKA-20372-2

Conversation

@chickenchickenlove
Copy link
Copy Markdown
Contributor

@chickenchickenlove chickenchickenlove commented Apr 4, 2026

Description

  • Fix flaky NamedTopologyIntegrationTest shouldAddToEmptyInitialTopologyRemoveResetOffsetsThenAddSameNamedTopologyWithRepartitioning
...
assertThat(waitUntilMinKeyValueRecordsReceived(consumerConfig, COUNT_OUTPUT, 5), equal(COUNT_OUTPUT_DATA))
assertThat(waitUntilMinKeyValueRecordsReceived(consumerConfig, COUNT_OUTPUT, 5), equal(SUM_OUTPUT_DATA))
...

This PR fixes a flaky test caused by not deterministically initializing all tasks in the topology.

The current test only sends records from STANDARD_INPUT_DATA to partition 0. Because of that, the existing condition only guarantees that the task for partition 0 has been initialized.

However, the test setup contains both partition 0 and partition 1.
If the task for partition 1 has not been initialized yet, it may remain in the state-updating phase and only later become an active task after state restore completes.
If streams.removeNamedTopology(...) is called during that window, only the initialized tasks are removed, due to the current behavior of streams.removeNamedTopology(...). As a result, the task for partition 1 can remain as a suspended active task. Then, when streams.cleanUpNamedTopology(...) is called, the local state directory is deleted.

Later, when streams.addNamedTopology(...) is invoked again, the suspended active task transitions from suspended to restoring. At that point, because the state directory has already been removed, the test can fail with a runtime error. This makes the test flaky.

To make the test deterministic, this PR also sends data to partition 1 and waits for those records to be processed. This ensures that the tasks for both partitions 0 and 1 are initialized before removeNamedTopology(...).

Test result in my local

  • Before : success ratio is 12/17.
  • After : success ratio is 89/89.

@github-actions github-actions bot added triage PRs from the community streams tests Test fixes (including flaky tests) labels Apr 4, 2026
@chickenchickenlove
Copy link
Copy Markdown
Contributor Author

@lianetm Hi!
When you get a chance, could you take a look?

@chickenchickenlove
Copy link
Copy Markdown
Contributor Author

I think a simpler approach would be to change the topic partition count from 2 to 1.

Throughout this test class, we have already been waiting for the initialization of the tasks related to Partition 0 by expecting 5 records in all parts of the test. Also, aside from the current test, there have been no flaky tests in this class.

In other words, all the existing tests in this class appear to rely on the conditions for Partition 0. However, this particular test seems to have unintentionally started depending on Partition 1 as well, which I believe is what makes it flaky from time to time.

For that reason, I think changing the partition count from 2 to 1 could be one way to resolve the flaky behavior without changing the meaning of the tests in this class.

If you have a preferred direction, please let me know, and I would be happy to update it accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

streams tests Test fixes (including flaky tests) triage PRs from the community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant