Skip to content

[FLINK-39921][runtime/tests] Fix the flaky test case ExecutionTimeBasedSlowTaskDetectorTest due to unexepected ComponentMainThreadExecutor setting.#28434

Open
och5351 wants to merge 1 commit into
apache:masterfrom
och5351:feature/FLINK-39921
Open

Conversation

@och5351

@och5351 och5351 commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

What is the purpose of the change

ExecutionTimeBasedSlowTaskDetectorTest was flaky due to two issues.

FLINK-38114 introduced thenComposeAsync(tryGetTaskDeploymentDescriptorForSlot, jobMasterMainThreadExecutor) in Execution.deploy(), which posts TDD creation to the jobMasterMainThreadExecutor after task restore serialization completes in the IO executor. Before FLINK-38114, TDD creation happened synchronously on the main thread.

Issue 1 — Wrong ComponentMainThreadExecutor setting

createExecutionGraph() used forMainThread(), which asserts that execute() is called from the registered main (test) thread. When the IO thread triggered the thenComposeAsync callback and called execute(), an AssertionError was thrown, transitioning the execution to FAILED and causing:

java.lang.IllegalStateException: BUG: trying to schedule a region which is not in CREATED state

Issue 2 — Missing waitForTaskDeploymentDescriptorsCreation()

Without waiting for async TDD creation to complete, switchAllVerticesToRunning() raced with IO threads still accessing execution graph internals, occasionally producing:

AssertionError: Expected size:<4> but was:<3>

Brief change log

  • Replace ComponentMainThreadExecutorServiceAdapter.forMainThread() with NoMainThreadCheckComponentMainThreadExecutor in createExecutionGraph() and testAllTasksInCreatedAndNoSlowTasks() to allow IO threads to call execute() without thread assertion failure.
  • Add ExecutionUtils.waitForTaskDeploymentDescriptorsCreation() after startScheduling() in createExecutionGraph() and createDynamicExecutionGraph() to ensure async TDD creation completes before switchAllVerticesToRunning() is called.

Verifying this change

Ran ExecutionTimeBasedSlowTaskDetectorTest with @RepeatedTest(100000) on each test method individually. All 100,000 repetitions passed with no failures.

image

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

…edSlowTaskDetectorTest due to unexepected ComponentMainThreadExecutor setting.

Co-authored-by: Yuepeng Pan <hipanyuepeng@gmail.com>
@flinkbot

flinkbot commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@och5351 och5351 force-pushed the feature/FLINK-39921 branch from f21facd to 7c12e36 Compare June 14, 2026 15:23
@och5351

och5351 commented Jun 14, 2026

Copy link
Copy Markdown
Contributor Author

Hi, @lihaosky, @RocMarshal !
Could you please review this fix for the flaky test??

@RocMarshal RocMarshal self-assigned this Jun 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants