[Bug]Fix FlinkJobStatusWatcher deadlock & NullPointerException#4327
Merged
wolfboys merged 2 commits intoapache:dev-2.1.7from Jan 26, 2026
Merged
[Bug]Fix FlinkJobStatusWatcher deadlock & NullPointerException#4327wolfboys merged 2 commits intoapache:dev-2.1.7from
wolfboys merged 2 commits intoapache:dev-2.1.7from
Conversation
|
RocMarshal
previously approved these changes
Jan 23, 2026
Contributor
RocMarshal
left a comment
There was a problem hiding this comment.
Thanks @Li-GL for the patch.
LGTM +1.
Contributor
|
It seems that the CI is failed. |
Member
|
2.1.7 is a released version. Please submit them to the dev branch. |
Author
However, the issue was introduced in version 2.1.6, and at that time, version 2.1.6 had not been merged into the dev branch, so the dev branch does not have this bug. |
Member
|
Please resubmit this PR against the dev branch. I will merge it; the build error is unrelated to your changes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



What changes were proposed in this pull request
一、死锁问题
新创建一个streampark服务并启动后,添加一个新任务并启动,会一直starting,持续4分钟,原因是doWatch()方法线程A与appFuture异步执行线程B发生死锁,线程A持有当前对象实例(this)的锁,线程B中有两个lazy val懒加载需要等这个锁,导致死锁。
解决办法是把两个超时变量从lazy val改为val
二、空指针问题
当任务实际完成后,前端状态还是running,这是因为任务完成后deployment被flink删除了,jobs/overview接口不可用了,watcher还没来得及更新状态,于是会走inferStateFromK8sEvent方法去更新状态,但是这里的latest有大概率是null,因为watchController缓存时效是20秒,过期了就拿不到当前任务的最新状态了,导致
if watchController.canceling.has(id) || latest.jobState.equals(里空指针,PS: 但是异常并没有抛出来1. Deadlock bug
After creating a new StreamPark service and starting it, when adding and launching a new task, it remains in the "starting" state continuously for 4 minutes. The reason is that a deadlock occurs between thread A of the
doWatch()method and thread B executingappFutureasynchronously. Thread A holds the lock of the current object instance (this), while thread B contains twolazy vallazy-loaded variables that need to wait for this same lock, resulting in a deadlock.Solution: Change the two timeout variables from lazy val to val
2. NPE bug
When a task is actually completed, the front-end status still shows "running". This is because after the task completes, the deployment is deleted by Flink, making the
jobs/overviewAPI unavailable. The watcher hasn't had time to update the status yet, so it falls back to theinferStateFromK8sEventmethod to update the status. However, thelatestvariable here has a high probability of beingnullbecause thewatchControllercache has a 20-second expiration time. Once expired, it cannot retrieve the latest status of the current task, leading to a null pointer error in the condition:if watchController.canceling.has(id) || latest.jobState.equals(...)PS: However, the exception is not thrown (likely caught or suppressed somewhere).
Brief change log
FLINK_CLIENT_TIMEOUT_SECandFLINK_REST_AWAIT_TIMEOUT_SECfromlazy valtovalVerifying this change
This change is a trivial rework / code cleanup without any test coverage.
(or)
This change is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
Does this pull request potentially affect one of the following parts