Skip to content

[CELEBORN-2264] Support cancel shuffle when write bytes exceeds threshold#3601

Open
yew1eb wants to merge 7 commits intoapache:mainfrom
yew1eb:CELEBORN_2264
Open

[CELEBORN-2264] Support cancel shuffle when write bytes exceeds threshold#3601
yew1eb wants to merge 7 commits intoapache:mainfrom
yew1eb:CELEBORN_2264

Conversation

@yew1eb
Copy link
Copy Markdown
Contributor

@yew1eb yew1eb commented Feb 11, 2026

What changes were proposed in this pull request?

This patch adds configurable threshold check for shuffle write bytes.

Why are the changes needed?

Shuffle will be canceled automatically if write bytes exceed the threshold to avoid cluster resource exhaustion.

Does this PR resolve a correctness bug?

No.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

CI and Manual testing.

@yew1eb yew1eb marked this pull request as draft February 11, 2026 03:47
@codecov
Copy link
Copy Markdown

codecov bot commented Feb 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 66.91%. Comparing base (2dd1b7a) to head (d15feca).
⚠️ Report is 13 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3601      +/-   ##
==========================================
- Coverage   67.13%   66.91%   -0.22%     
==========================================
  Files         357      357              
  Lines       21860    21932      +72     
  Branches     1943     1949       +6     
==========================================
  Hits        14674    14674              
- Misses       6166     6244      +78     
+ Partials     1020     1014       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left one question about protocol compatibility

Can you please add some unit tests ?

Copy link
Copy Markdown
Contributor

@RexXiong RexXiong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, BTW when the configuration is changed, you should also execute UPDATE=1 build/mvn clean test -pl common -am -Dtest=none -DwildcardSuites=org.apache.celeborn.ConfigurationSuite

@s0nskar
Copy link
Copy Markdown
Contributor

s0nskar commented Mar 11, 2026

Wouldn't server side approach like – #3336 makes more sense to handle this. Just thinking out loud, Few cons i can see with this approach:

  1. We are not considering the existing shuffle data stored for the app on Celeborn server or multiple shuffle stages running in parallel.

  2. We are removing the written bytes as soon as all mappers are completed

      if (shuffleWriteLimitEnabled) {
        shuffleTotalWrittenBytes.remove(shuffleId)
      }

but the shuffle data will be stored on the server till shuffle cleanup happens.

  1. No central config management, such configs should be managed by config store so it can be applied globally to all apps, instead of each app having control on such configs. (Override functionality can be provided for certain apps)

Cons with server side approach –

  1. Since it relies on heartbeats, for very high throughput applications the difference between threshold and actual killing can be large but for normal applications it should be fine.

@SteNicholas @RexXiong wanted to know your thoughts on this?

@RexXiong
Copy link
Copy Markdown
Contributor

Wouldn't server side approach like – #3336 makes more sense to handle this. Just thinking out loud, Few cons i can see with this approach:

  1. We are not considering the existing shuffle data stored for the app on Celeborn server or multiple shuffle stages running in parallel.
  2. We are removing the written bytes as soon as all mappers are completed
      if (shuffleWriteLimitEnabled) {
        shuffleTotalWrittenBytes.remove(shuffleId)
      }

but the shuffle data will be stored on the server till shuffle cleanup happens.

  1. No central config management, such configs should be managed by config store so it can be applied globally to all apps, instead of each app having control on such configs. (Override functionality can be provided for certain apps)

Cons with server side approach –

  1. Since it relies on heartbeats, for very high throughput applications the difference between threshold and actual killing can be large but for normal applications it should be fine.

@SteNicholas @RexXiong wanted to know your thoughts on this?

Thank you @s0nskar , I believe this PR primarily focuses on controlling the data bytes of individual shuffle writes. Compared to QuotaManager, this is relatively more fine-grained. If fine-grained control is done on the server side, it would put too much pressure on the server, making it a good supplement to the QuotaManager. Managing this on the client side also allows for flexible and on-demand control. cc @yew1eb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants