eventservice: optimize scanwindow#4950
Conversation
Signed-off-by: dongmen <414110582@qq.com>
Signed-off-by: dongmen <414110582@qq.com>
Signed-off-by: dongmen <414110582@qq.com>
Signed-off-by: dongmen <414110582@qq.com>
Signed-off-by: dongmen <414110582@qq.com>
|
Skipping CI for Draft Pull Request. |
|
[FORMAT CHECKER NOTIFICATION] Notice: To remove the 📖 For more info, you can check the "Contribute Code" section in the development guide. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request replaces the existing scan interval adjustment logic with a new adaptiveScanWindowController that utilizes Exponential Moving Averages (EMAs) and a pressure score for more stable memory pressure management. The update includes comprehensive simulation tests and enhanced Prometheus metrics for monitoring controller decisions. Review feedback highlights a non-monotonic discontinuity in the emergency brake calculation, potential over-throttling caused by latching peak usage values, the presence of magic numbers, and the use of a redundant maxFloat64 helper that should be replaced by the built-in max function.
| func scanWindowEmergencyBrakeInterval(current time.Duration) time.Duration { | ||
| if current <= 6*defaultScanInterval { | ||
| return max(current/2, defaultScanInterval) | ||
| } | ||
| return max(current/4, minScanInterval) | ||
| } |
There was a problem hiding this comment.
The logic in scanWindowEmergencyBrakeInterval has two significant issues:
- Discontinuity: There is a sharp jump at the boundary of
6*defaultScanInterval(30s). For an input of 30s, it returns 15s (current/2), but for 30.1s, it returns ~7.5s (current/4). This non-monotonic behavior can cause unstable oscillations in the scan interval. - Unreachable Minimum: The function floors at
defaultScanInterval(5s) for anycurrent <= 30s. This makes theminScanInterval(1s) constant effectively unreachable during emergency pressure (98%+ usage) if the interval has already been reduced to a moderate level. If the goal is to allow the interval to drop to 1s under extreme congestion, the floor should beminScanIntervalin both branches.
|
|
||
| return c.fastUsageEMA >= scanWindowHighPressureThreshold || | ||
| c.slowUsageEMA >= scanWindowHighPressureThreshold || | ||
| usage.max >= memoryUsageHighThreshold |
There was a problem hiding this comment.
Using usage.max in shouldReduceForHighPressureLocked may lead to excessive interval reductions. Since usage.max tracks the peak usage over the last 30 seconds, a single transient spike will keep triggering reductions every 10 seconds (the cooldown) for the entire duration the spike remains in the window, even if the current usage (usage.last) and EMAs indicate that pressure has subsided. Consider relying primarily on EMAs or the current report value to ensure the response is truly adaptive to the present state.
| c.fastUsageEMA < memoryUsageLowThreshold+0.03 && | ||
| c.slowUsageEMA < memoryUsageLowThreshold+0.02 |
| func maxFloat64(a float64, b float64) float64 { | ||
| if a > b { | ||
| return a | ||
| } | ||
| return b | ||
| } |
There was a problem hiding this comment.
The maxFloat64 helper function is redundant as Go 1.21+ provides a built-in max function. The codebase already uses the built-in max in other parts of this file (e.g., lines 450, 464), so this helper should be removed and its call sites (lines 679, 681, 687) updated to use the built-in function for consistency.
What problem does this PR solve?
Issue Number: close #xxx
What is changed and how it works?
Check List
Tests
Questions
Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?
Release note