Fix adaptive metrics decay when provider metrics are not updated#16048
Fix adaptive metrics decay when provider metrics are not updated#16048SURYAS1306 wants to merge 3 commits intoapache:3.3from
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 3.3 #16048 +/- ##
============================================
- Coverage 60.75% 60.73% -0.03%
+ Complexity 11757 11752 -5
============================================
Files 1952 1952
Lines 89012 89012
Branches 13421 13421
============================================
- Hits 54079 54059 -20
- Misses 29367 29382 +15
- Partials 5566 5571 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hi maintainers, This PR fixes the adaptive metrics decay issue when provider metrics are not updated and adds a unit test covering the scenario. All checks are green. I’d really appreciate your review when you have time. Thanks! |
|
you'd better add a comparison test to ensure applying this PR also does well under high QPS circumstance. |
|
Hi @zrlw , thanks for the suggestion. |
|
Hi @zrlw , thanks for the suggestion. |
zrlw
left a comment
There was a problem hiding this comment.
We still need to evaluate this PR carefully as we should draw on relatively mature industry solutions to refactor the adaptive algorithm.
|
Hi @zrlw , thanks for the feedback. Understood - this PR focuses on fixing the incorrect decay behavior when provider metrics are not updated. The added high-QPS style test is intended to validate the correctness and stability of the existing adaptive logic under more realistic conditions, without altering the overall strategy. I agree that the adaptive algorithm itself is an important topic and could benefit from deeper discussion and comparison with more mature industry solutions. I’m happy to participate in that discussion or help explore alternative designs if there is a preferred direction. For now, this PR intentionally keeps the change minimal and low-risk, addressing the concrete issue of unstable decay behavior without introducing broader algorithmic refactoring. Please let me know how you’d like to proceed. Thanks. |
What is the purpose of the change?
This PR fixes an issue in AdaptiveLoadBalance / AdaptiveMetrics where latency decay behaves incorrectly when provider metrics are not updated for a period of time.
Currently, when no new provider metrics arrive, getLoad() may repeatedly apply the penalty branch or aggressively right-shift lastLatency, which can result in stale or extreme values dominating EWMA. This makes adaptive load balancing unstable, especially in low-QPS or intermittent-update scenarios.
This PR ensures that latency decays safely and progressively instead of collapsing or being stuck at penalty values.
Fixes #15810
What is changed?
1. Improved decay logic in AdaptiveMetrics#getLoad()
2. Added unit test
Added
testAdaptiveMetricsDecayWithoutProviderUpdateVerifies that when provider metrics are not updated:
Why is this needed?
Adaptive load balancing relies on EWMA latency to reflect recent performance trends.
Without this fix:
This change makes adaptive load balancing more stable, realistic, and robust under real-world traffic patterns.
Verifying this change
mvn -pl dubbo-cluster -am testChecklist