KAFKA-20372: Fix flaky test in NamedTopologyIntegrationTest#21968
KAFKA-20372: Fix flaky test in NamedTopologyIntegrationTest#21968chickenchickenlove wants to merge 1 commit intoapache:trunkfrom
Conversation
|
@lianetm Hi! |
|
I think a simpler approach would be to change the topic partition count from 2 to 1. Throughout this test class, we have already been waiting for the initialization of the tasks related to Partition 0 by expecting 5 records in all parts of the test. Also, aside from the current test, there have been no flaky tests in this class. In other words, all the existing tests in this class appear to rely on the conditions for Partition 0. However, this particular test seems to have unintentionally started depending on Partition 1 as well, which I believe is what makes it flaky from time to time. For that reason, I think changing the partition count from 2 to 1 could be one way to resolve the flaky behavior without changing the meaning of the tests in this class. If you have a preferred direction, please let me know, and I would be happy to update it accordingly. |
Description
NamedTopologyIntegrationTest shouldAddToEmptyInitialTopologyRemoveResetOffsetsThenAddSameNamedTopologyWithRepartitioningThis PR fixes a flaky test caused by not deterministically initializing all tasks in the topology.
The current test only sends records from
STANDARD_INPUT_DATAto partition0. Because of that, the existing condition only guarantees that the task for partition0has been initialized.However, the test setup contains both partition
0and partition1.If the task for partition
1has not been initialized yet, it may remain in the state-updating phase and only later become an active task after state restore completes.If
streams.removeNamedTopology(...)is called during that window, only the initialized tasks are removed, due to the current behavior ofstreams.removeNamedTopology(...). As a result, the task for partition1can remain as a suspended active task. Then, whenstreams.cleanUpNamedTopology(...)is called, the local state directory is deleted.Later, when
streams.addNamedTopology(...)is invoked again, the suspended active task transitions from suspended to restoring. At that point, because the state directory has already been removed, the test can fail with a runtime error. This makes the test flaky.To make the test deterministic, this PR also sends data to partition
1and waits for those records to be processed. This ensures that the tasks for both partitions0and1are initialized beforeremoveNamedTopology(...).Test result in my local
12/17.89/89.