feat(plugin): Added plugin-ready check endpoints and optimized local plugin startup logic#600
feat(plugin): Added plugin-ready check endpoints and optimized local plugin startup logic#600NieRonghua wants to merge 1 commit intolanggenius:mainfrom
Conversation
添加 /ready/check 端点用于检查插件启动就绪状态,支持 Kubernetes 就绪探针 引入插件最大重试次数配置,默认为 15 次,可自定义重试策略 优化本地插件监控逻辑,分离初始插件和运行时插件的就绪判断 修复 go.mod 中 trace 包依赖重复声明问题
Summary of ChangesHello @NieRonghua, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses critical race conditions and stability issues in the plugin daemon's readiness probe within Kubernetes environments. By introducing a sophisticated "Initial Plugin Set Locking Strategy," the system now accurately reports readiness only after all initial plugins have completed their startup attempts, preventing traffic from being routed to unready pods. Furthermore, it significantly improves startup performance through configurable retry limits and provides detailed observability into plugin states, ensuring that runtime plugin changes do not destabilize the pod's readiness. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
@Yeuoly please review this PR |
There was a problem hiding this comment.
Code Review
This pull request introduces a new readiness check mechanism to solve a race condition during plugin startup. However, a critical bug in the local plugin monitoring loop will cause the daemon to panic and crash when it encounters a new plugin, leading to a Denial of Service (DoS). Additionally, the new readiness endpoint exposes detailed plugin metadata publicly, which should be restricted or minimized to prevent information leakage. My review also identified a design issue with global state and opportunities to improve code clarity and adhere to Go idioms. Addressing these issues will significantly improve the robustness, security, and maintainability of the code.
| retry = LocalPluginFailsRecord{ | ||
| RetryCount: 0, | ||
| LastTriedAt: time.Now(), | ||
| LastError: err.Error(), |
There was a problem hiding this comment.
A critical bug exists where the code attempts to call .Error() on the err variable, which is guaranteed to be nil at this point (due to an early return if err is not nil on line 62). This will cause a runtime panic, leading to a Denial of Service (DoS) of the plugin daemon whenever a new plugin is detected. When creating a new LocalPluginFailsRecord, the LastError field should be initialized to an empty string, as there is no error to record yet.
LastError: ""| var initialPlugins = &initialPluginSet{ | ||
| ids: make(map[string]bool), | ||
| } |
There was a problem hiding this comment.
The use of a global variable initialPlugins to store state is a significant design issue. Global state can lead to subtle bugs in concurrent programs, makes testing more difficult, and tightly couples different parts of the application. This state should be encapsulated within the ControlPanel struct.
Please consider the following refactoring:
- Move the
initialPluginSetstruct definition tointernal/core/control_panel/daemon.go(or make it public inreadiness.go). - Add
initialPlugins *initialPluginSetas a field to theControlPanelstruct indaemon.go. - Initialize this field in
NewControlPanel. - Update
lockInitialPlugins,getInitialPluginSet, andisInitialPluginsReadyinreadiness.goto operate onc.initialPluginsinstead of the global variable.
This will improve encapsulation and make the code more robust and testable.
| c.JSON(http.StatusOK, gin.H{ | ||
| "status": "ok", | ||
| "ready": true, | ||
| "reason": report.Reason, | ||
| "detail": report.Plugins, | ||
| }) | ||
| return | ||
| } | ||
|
|
||
| c.JSON(http.StatusServiceUnavailable, gin.H{ | ||
| "status": "unready", | ||
| "ready": false, | ||
| "reason": report.Reason, | ||
| "detail": report.Plugins, | ||
| }) |
There was a problem hiding this comment.
The /ready/check endpoint is publicly accessible and returns detailed information about all installed plugins via the report.Plugins field. This includes plugin IDs, versions, and checksums. Exposing this information publicly can help an attacker identify vulnerable components or gain insights into the system's configuration. Consider restricting access to this endpoint to internal monitoring systems or minimizing the data returned in the response.
| func (c *ControlPanel) isInitialPluginsReady( | ||
| current []plugin_entities.PluginUniqueIdentifier, | ||
| initialExpected *int, | ||
| initialRunning *int, | ||
| initialMissing *[]string, | ||
| initialFailed *[]string, | ||
| ) bool { |
There was a problem hiding this comment.
The function signature of isInitialPluginsReady is not idiomatic Go. Using multiple pointer arguments to return values is more common in C and can be cumbersome. A more idiomatic approach is to return a struct containing the results. This improves readability and maintainability.
Consider refactoring this to return a status struct. For example:
type initialPluginsStatus struct {
Ready bool
Expected int
Running int
Missing []string
Failed []string
}
func (c *ControlPanel) getInitialPluginsStatus(current []plugin_entities.PluginUniqueIdentifier) initialPluginsStatus {
// ... implementation from isInitialPluginsReady ...
return initialPluginsStatus{
Ready: len(missingList) == 0,
Expected: expected,
Running: running,
Missing: missingList,
Failed: failedList,
}
}Then, in updateLocalReadinessSnapshot, you could use it like this:
initialStatus := c.getInitialPluginsStatus(expected)
snapshot := &LocalReadinessSnapshot{
Ready: initialStatus.Ready,
InitialPluginsReady: initialStatus.Ready,
InitialExpected: initialStatus.Expected,
// ... and so on
}| "github.com/langgenius/dify-plugin-daemon/internal/types/app" | ||
| ) | ||
|
|
||
| func ReadyCheck(appConfig *app.Config) gin.HandlerFunc { |
There was a problem hiding this comment.
The appConfig parameter is unused in this function, as indicated by _ = appConfig on line 13. To improve code clarity and maintain a clean API, this parameter should be removed from the function signature. The call site in internal/server/http_server.go should be updated accordingly.
| func ReadyCheck(appConfig *app.Config) gin.HandlerFunc { | |
| func ReadyCheck() gin.HandlerFunc { |
Description
Problem
In Kubernetes environments, the plugin daemon exhibits a race condition during startup that causes service disruption:
Additionally, when users install new plugins at runtime after the pod becomes Ready, the readiness probe returns 503, causing K8s to remove the pod from Service endpoints and interrupt traffic flow.
Root Causes Addressed
Solution: Initial Plugin Set Locking Strategy
Implements an intelligent readiness mechanism that:
Key principle: Once a pod is Ready, it will NEVER become NotReady due to runtime plugin additions.
Changes Made
Code Implementation
LocalReadinessSnapshotstructure separating initial/runtime plugin statesinitialPluginSetlocking mechanism (thread-safe withsync.RWMutex)isInitialPluginsReady()function for atomic readiness determinationlockInitialPlugins()one-time locking at first startupgetInitialPluginSet()atomic read-only accessDocumentation Updates
Configuration
PLUGIN_LOCAL_MAX_RETRY_COUNT(default: 5, was hardcoded 15)Performance Impact
API Response Format
New
/ready/checkendpoint returns:InitialPluginsReadyandRuntimePluginsLoadingfieldsBackward Compatibility
✅ Fully backward compatible
/health/checkendpoint unchangedChanges
internal/core/control_panel/readiness.go: Initial plugin set locking mechanismREADME.md: Updated Health Endpoints documentationTECHNICAL_PLAN.md: Complete technical specificationCOMMUNITY_ISSUE.md: Issue template with scenariosINITIAL_PLUGIN_SET_LOCKING_STRATEGY.md: Implementation guideType of Change
Essential Checklist
Testing
Bug Fix (if applicable)
Fixes #598)Additional Information
#598