Skip to content

[Bug]Fix FlinkJobStatusWatcher deadlock & NullPointerException#4327

Merged
wolfboys merged 2 commits intoapache:dev-2.1.7from
Li-GL:fix-flink-job-status-watcher
Jan 26, 2026
Merged

[Bug]Fix FlinkJobStatusWatcher deadlock & NullPointerException#4327
wolfboys merged 2 commits intoapache:dev-2.1.7from
Li-GL:fix-flink-job-status-watcher

Conversation

@Li-GL
Copy link

@Li-GL Li-GL commented Jan 23, 2026

What changes were proposed in this pull request

一、死锁问题
新创建一个streampark服务并启动后,添加一个新任务并启动,会一直starting,持续4分钟,原因是doWatch()方法线程A与appFuture异步执行线程B发生死锁,线程A持有当前对象实例(this)的锁,线程B中有两个lazy val懒加载需要等这个锁,导致死锁。
解决办法是把两个超时变量从lazy val改为val

死锁时间线:
T0: Thread-A (主线程) 进入 doWatch()
T1: Thread-A 获取 this.synchronized 锁
T2: Thread-A 创建 Future 任务提交到线程池
T3: Thread-B (Future线程) 从线程池取出任务开始执行
T4: Thread-B 执行 touchApplicationJob(id)
T5: Thread-B 访问 lazy val FLINK_CLIENT_TIMEOUT_SEC 和 FLINK_REST_AWAIT_TIMEOUT_SEC,尝试获取 this 对象的锁
    → 阻塞等待(因为Thread-A持有锁)
T6: Thread-A 执行 Await.ready(..., timeout)
    → 等待Thread-B完成
T7: timeout时间到达(2分钟,两个超时变量4分钟)
T8: Await.ready抛出TimeoutException
T9: Thread-A 捕获异常,记录日志,释放锁
T10: 锁释放后,Thread-B获得锁,继续执行
    → 但此时Future已经标记为失败/超时

二、空指针问题
当任务实际完成后,前端状态还是running,这是因为任务完成后deployment被flink删除了,jobs/overview接口不可用了,watcher还没来得及更新状态,于是会走inferStateFromK8sEvent方法去更新状态,但是这里的latest有大概率是null,因为watchController缓存时效是20秒,过期了就拿不到当前任务的最新状态了,导致 if watchController.canceling.has(id) || latest.jobState.equals(里空指针,PS: 但是异常并没有抛出来

    val latest: JobStatusCV = watchController.jobStatuses.get(trackId)
    // whether deployment exists on kubernetes cluster
    val deployExists = KubernetesRetriever.isDeploymentExists(
      trackId.namespace,
      trackId.clusterId
    )
    val jobState = trackId match {
      case id
          if watchController.canceling.has(id) || latest.jobState.equals(
            FlinkJobState.CANCELLING) =>

1. Deadlock bug
After creating a new StreamPark service and starting it, when adding and launching a new task, it remains in the "starting" state continuously for 4 minutes. The reason is that a deadlock occurs between thread A of the doWatch() method and thread B executing appFuture asynchronously. Thread A holds the lock of the current object instance (this), while thread B contains two lazy val lazy-loaded variables that need to wait for this same lock, resulting in a deadlock.

Timeline of the Deadlock:

T0: Thread-A (main thread) enters doWatch()
T1: Thread-A acquires the this.synchronized lock
T2: Thread-A creates Future tasks and submits them to the thread pool
T3: Thread-B (Future thread) picks up a task from the thread pool and starts execution
T4: Thread-B executes touchApplicationJob(id)
T5: Thread-B accesses a lazy val FLINK_CLIENT_TIMEOUT_SEC and FLINK_REST_AWAIT_TIMEOUT_SEC, and attempts to acquire the lock on the 'this' object
    → Blocks waiting (because Thread-A holds the lock)
T6: Thread-A executes Await.ready(..., timeout)
    → Waits for Thread-B to complete
T7: Timeout is reached (2 minutes, with the two timeout variables totaling 4 minutes)
T8: Await.ready throws a TimeoutException
T9: Thread-A catches the exception, logs it, and releases the lock
T10: After the lock is released, Thread-B acquires the lock and continues execution
    → However, by this point, the Future is already marked as failed/timed out

Solution: Change the two timeout variables from lazy val to val

2. NPE bug
When a task is actually completed, the front-end status still shows "running". This is because after the task completes, the deployment is deleted by Flink, making the jobs/overview API unavailable. The watcher hasn't had time to update the status yet, so it falls back to the inferStateFromK8sEvent method to update the status. However, the latest variable here has a high probability of being null because the watchController cache has a 20-second expiration time. Once expired, it cannot retrieve the latest status of the current task, leading to a null pointer error in the condition:
if watchController.canceling.has(id) || latest.jobState.equals(...)

PS: However, the exception is not thrown (likely caught or suppressed somewhere).

    val latest: JobStatusCV = watchController.jobStatuses.get(trackId)
    // whether deployment exists on kubernetes cluster
    val deployExists = KubernetesRetriever.isDeploymentExists(
      trackId.namespace,
      trackId.clusterId
    )
    val jobState = trackId match {
      case id
          if watchController.canceling.has(id) || latest.jobState.equals(
            FlinkJobState.CANCELLING) =>

Brief change log

  1. change two timeout variables FLINK_CLIENT_TIMEOUT_SEC and FLINK_REST_AWAIT_TIMEOUT_SEC from lazy val to val
  2. fix npe

Verifying this change

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

Does this pull request potentially affect one of the following parts

  • Dependencies (does it add or upgrade a dependency): (yes / no)

@sonarqubecloud
Copy link

@Li-GL Li-GL changed the title [Bug]Fix FlinkJobStatusWatcher deadlock [Bug]Fix FlinkJobStatusWatcher deadlock & NullPointerException Jan 23, 2026
@wolfboys wolfboys requested a review from RocMarshal January 23, 2026 15:23
RocMarshal
RocMarshal previously approved these changes Jan 23, 2026
Copy link
Contributor

@RocMarshal RocMarshal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Li-GL for the patch.

LGTM +1.

@RocMarshal
Copy link
Contributor

RocMarshal commented Jan 23, 2026

It seems that the CI is failed.
I'll have a look next week.

@wolfboys wolfboys changed the base branch from dev-2.1.7 to dev January 26, 2026 01:54
@wolfboys wolfboys dismissed RocMarshal’s stale review January 26, 2026 01:54

The base branch was changed.

@wolfboys wolfboys changed the base branch from dev to dev-2.1.7 January 26, 2026 01:55
@wolfboys
Copy link
Member

2.1.7 is a released version. Please submit them to the dev branch.

@Li-GL
Copy link
Author

Li-GL commented Jan 26, 2026

2.1.7 is a released version. Please submit them to the dev branch.

However, the issue was introduced in version 2.1.6, and at that time, version 2.1.6 had not been merged into the dev branch, so the dev branch does not have this bug.

@wolfboys
Copy link
Member

Please resubmit this PR against the dev branch. I will merge it; the build error is unrelated to your changes.

@wolfboys wolfboys merged commit 16bfa38 into apache:dev-2.1.7 Jan 26, 2026
16 of 25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants