[CELEBORN-2264] Support cancel shuffle when write bytes exceeds threshold#3601
[CELEBORN-2264] Support cancel shuffle when write bytes exceeds threshold#3601yew1eb wants to merge 7 commits intoapache:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3601 +/- ##
==========================================
- Coverage 67.13% 66.91% -0.22%
==========================================
Files 357 357
Lines 21860 21932 +72
Branches 1943 1949 +6
==========================================
Hits 14674 14674
- Misses 6166 6244 +78
+ Partials 1020 1014 -6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
d15feca to
2182ddb
Compare
2182ddb to
9522ca4
Compare
9522ca4 to
65ee9d9
Compare
d7c7080 to
4a513c7
Compare
eolivelli
left a comment
There was a problem hiding this comment.
I have left one question about protocol compatibility
Can you please add some unit tests ?
common/src/main/scala/org/apache/celeborn/common/protocol/message/ControlMessages.scala
Show resolved
Hide resolved
RexXiong
left a comment
There was a problem hiding this comment.
Thanks, BTW when the configuration is changed, you should also execute UPDATE=1 build/mvn clean test -pl common -am -Dtest=none -DwildcardSuites=org.apache.celeborn.ConfigurationSuite
common/src/main/scala/org/apache/celeborn/common/protocol/message/ControlMessages.scala
Show resolved
Hide resolved
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
Outdated
Show resolved
Hide resolved
b925eb6 to
15b57bb
Compare
62fdfad to
9e32fcf
Compare
tests/spark-it/src/test/scala/org/apache/celeborn/tests/client/LifecycleManagerSuite.scala
Show resolved
Hide resolved
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala
Outdated
Show resolved
Hide resolved
8115b03 to
4322dfb
Compare
|
Wouldn't server side approach like – #3336 makes more sense to handle this. Just thinking out loud, Few cons i can see with this approach:
but the shuffle data will be stored on the server till shuffle cleanup happens.
Cons with server side approach –
@SteNicholas @RexXiong wanted to know your thoughts on this? |
Thank you @s0nskar , I believe this PR primarily focuses on controlling the data bytes of individual shuffle writes. Compared to QuotaManager, this is relatively more fine-grained. If fine-grained control is done on the server side, it would put too much pressure on the server, making it a good supplement to the QuotaManager. Managing this on the client side also allows for flexible and on-demand control. cc @yew1eb |
What changes were proposed in this pull request?
This patch adds configurable threshold check for shuffle write bytes.
Why are the changes needed?
Shuffle will be canceled automatically if write bytes exceed the threshold to avoid cluster resource exhaustion.
Does this PR resolve a correctness bug?
No.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
CI and Manual testing.