Skip to content

Conversation

@therault
Copy link
Contributor

Currently, the GPU manager systematically records 3 polling events (or more for TTG) per task: one after kernel_push, one after kernels submission and one after kernel_pop. The event record after the kernel_push is needed if any input data is not yet on the GPU, even if that task did not schedule an actual cudaMemcpyAsync, because the cudaMemcpyAsync it depends on is already executing, and we use the serialization of the input stream to guarantee order.

However, in many cases (typically O(nt^3-nt^2) tasks for a GEMM that fits on GPU memory, or for a POTRF that fits in GPU memory), all data are already on the GPU. In that case, recording an event and waiting for it to exit the queue can be unnecessary overhead.

This PR skips event recording after the input stream if all input data is already on the GPU.

Pro:

  • reduces number of events to poll
    Con:
  • will increase the cases of out-of-order execution.

@QingleiCao has tested the PR with variable tile size, and some benefit is observed for small tiles. No effect (detrimental or beneficial) is observed for large tiles on DPLASMA. @devreal wanted to test with TTG, where this is expected to bring more benefit (very small work in many tasks) and this PR should simplify him testing the combination of multiple branches.

@therault therault requested a review from a team as a code owner October 16, 2024 13:20
Copy link
Contributor

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By this logic all submission functions should now return a positive number, including the DTD GPU submission, the coroutines and the all the user-provided stage_in and out functions.

@devreal
Copy link
Contributor

devreal commented Jan 23, 2025

@therault pretty please :)

@therault therault force-pushed the gpu-skip-records-when-nothing-scheduled-on-stream branch from 71edab8 to 385a251 Compare February 6, 2025 12:24
@therault therault force-pushed the gpu-skip-records-when-nothing-scheduled-on-stream branch from c740700 to 22d8b8a Compare March 27, 2025 14:28
@therault therault marked this pull request as ready for review March 27, 2025 14:30
therault and others added 4 commits March 27, 2025 15:37
When a GPU kernel had all its data already on the GPU, we would still schedule
a record event on the input stream. That would delay the scheduling of the kernel
after all asyncMemcpy that had been scheduled before complete their execution,
reducing parallelism between execution and I/O.

With this change, the aim is to entirely skip recording the event on the input
stream and directly schedule the ready-to-execute kernel on an exec stream of
the GPU.

Note that the behavior of data already in transfer is unchanged: no additional
transfer is scheduled, but the task needs to wait on the input stream progress
to be ready to execute.
…lready in transfer from CPU to GPU, fallback on scheduling another copy from the CPU if the transfer is yet incomplete.
…ng event while doing nothing to the GPU seems dubious to me)
@therault therault force-pushed the gpu-skip-records-when-nothing-scheduled-on-stream branch from 22d8b8a to 78a9749 Compare March 27, 2025 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants