Skip to content

[xla:gpu] Add async thunk passes for optimizing async execution#39435

Closed
ezhulenev wants to merge 1 commit intoopenxla:mainfrom
ezhulenev:async-thunk-passes-0
Closed

[xla:gpu] Add async thunk passes for optimizing async execution#39435
ezhulenev wants to merge 1 commit intoopenxla:mainfrom
ezhulenev:async-thunk-passes-0

Conversation

@ezhulenev
Copy link
Copy Markdown
Contributor

@ezhulenev ezhulenev commented Mar 18, 2026

Thunk passes that correct LHS scheduling decisions if buffer assignment permits it:

  • Implement RemoveRedundantAsyncThunkPass — removes redundant async start/done thunk pairs where an AsyncDoneThunk immediately follows its matching AsyncStartThunk. In this case there is no actual asynchronous execution (the done thunk just waits for an event recorded right before it), so we inline the nested thunk sequence from the start thunk, avoiding the overhead of creating an async execution scope (recording events and synchronizing streams).

  • Implement ExpandAsyncScopeThunkPass — a new thunk pass that widens async execution scopes by moving AsyncStartThunk as far up the thunk sequence as possible and AsyncDoneThunk as far down as possible, maximizing the overlap window between async operations (e.g., collectives) and compute.

  • Conflict detection uses BufferUse::ReadWriteSet and ResourceUse::ReadWriteSet, with buffer and resource uses collected transitively via the Thunk::Walk API. AsyncDoneThunk (which reports no uses of its own) inherits the uses of all matching AsyncStartThunks to handle pipelined async chains correctly.

@ezhulenev ezhulenev force-pushed the async-thunk-passes-0 branch 6 times, most recently from ae3fac7 to 5a820d2 Compare March 19, 2026 20:46
@ezhulenev ezhulenev requested review from olegshyshkov and removed request for pifon2a March 19, 2026 21:03
@ezhulenev ezhulenev force-pushed the async-thunk-passes-0 branch 2 times, most recently from 506f29a to 445de51 Compare March 22, 2026 18:12
@ezhulenev ezhulenev force-pushed the async-thunk-passes-0 branch from 445de51 to dc1989a Compare March 24, 2026 00:53
@ezhulenev ezhulenev force-pushed the async-thunk-passes-0 branch from dc1989a to d1dd496 Compare April 2, 2026 21:57
Copy link
Copy Markdown
Member

@seantalts seantalts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise looks good to me

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a couple of tests (I think these might fail today)

  1. Redundant pair nested inside a non-redundant AsyncStartThunk: An AsyncStartThunk whose nested sequence contains [start_inner, done_inner], where the outer start and done are not adjacent. The inner pair should still get inlined.
  2. Same for ExpandAsyncScopeThunkPass: An AsyncStartThunk whose nested sequence contains [kernel, start, kernel, done, kernel]. The inner scope should get expanded.

Probably also good to have a test that combines the passes and makes sure they interact well, and also test multiple independent async pairs.

@ezhulenev ezhulenev marked this pull request as draft April 7, 2026 05:04
@ezhulenev
Copy link
Copy Markdown
Contributor Author

Looks like when we already have thunk sequence it's too late to optimize it, it must be done earlier at HLO level. Abandoning it for now.

@ezhulenev ezhulenev closed this Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants