[xla:gpu] Add async thunk passes for optimizing async execution#39435
Closed
ezhulenev wants to merge 1 commit intoopenxla:mainfrom
Closed
[xla:gpu] Add async thunk passes for optimizing async execution#39435ezhulenev wants to merge 1 commit intoopenxla:mainfrom
ezhulenev wants to merge 1 commit intoopenxla:mainfrom
Conversation
ae3fac7 to
5a820d2
Compare
506f29a to
445de51
Compare
445de51 to
dc1989a
Compare
dc1989a to
d1dd496
Compare
seantalts
requested changes
Apr 3, 2026
Member
There was a problem hiding this comment.
Could you add a couple of tests (I think these might fail today)
- Redundant pair nested inside a non-redundant AsyncStartThunk: An AsyncStartThunk whose nested sequence contains [start_inner, done_inner], where the outer start and done are not adjacent. The inner pair should still get inlined.
- Same for ExpandAsyncScopeThunkPass: An AsyncStartThunk whose nested sequence contains [kernel, start, kernel, done, kernel]. The inner scope should get expanded.
Probably also good to have a test that combines the passes and makes sure they interact well, and also test multiple independent async pairs.
Contributor
Author
|
Looks like when we already have thunk sequence it's too late to optimize it, it must be done earlier at HLO level. Abandoning it for now. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Thunk passes that correct LHS scheduling decisions if buffer assignment permits it:
Implement
RemoveRedundantAsyncThunkPass— removes redundant async start/done thunk pairs where anAsyncDoneThunkimmediately follows its matchingAsyncStartThunk. In this case there is no actual asynchronous execution (the done thunk just waits for an event recorded right before it), so we inline the nested thunk sequence from the start thunk, avoiding the overhead of creating an async execution scope (recording events and synchronizing streams).Implement
ExpandAsyncScopeThunkPass— a new thunk pass that widens async execution scopes by movingAsyncStartThunkas far up the thunk sequence as possible andAsyncDoneThunkas far down as possible, maximizing the overlap window between async operations (e.g., collectives) and compute.Conflict detection uses
BufferUse::ReadWriteSetandResourceUse::ReadWriteSet, with buffer and resource uses collected transitively via theThunk::WalkAPI.AsyncDoneThunk(which reports no uses of its own) inherits the uses of all matchingAsyncStartThunks to handle pipelined async chains correctly.