Make GPU manager skip records when nothing scheduled on input stream #681

therault · 2024-10-16T13:20:35Z

Currently, the GPU manager systematically records 3 polling events (or more for TTG) per task: one after kernel_push, one after kernels submission and one after kernel_pop. The event record after the kernel_push is needed if any input data is not yet on the GPU, even if that task did not schedule an actual cudaMemcpyAsync, because the cudaMemcpyAsync it depends on is already executing, and we use the serialization of the input stream to guarantee order.

However, in many cases (typically O(nt^3-nt^2) tasks for a GEMM that fits on GPU memory, or for a POTRF that fits in GPU memory), all data are already on the GPU. In that case, recording an event and waiting for it to exit the queue can be unnecessary overhead.

This PR skips event recording after the input stream if all input data is already on the GPU.

Pro:

reduces number of events to poll
Con:
will increase the cases of out-of-order execution.

@QingleiCao has tested the PR with variable tile size, and some benefit is observed for small tiles. No effect (detrimental or beneficial) is observed for large tiles on DPLASMA. @devreal wanted to test with TTG, where this is expected to bring more benefit (very small work in many tasks) and this PR should simplify him testing the combination of multiple branches.

bosilca

By this logic all submission functions should now return a positive number, including the DTD GPU submission, the coroutines and the all the user-provided stage_in and out functions.

parsec/mca/device/device_gpu.c

devreal · 2025-01-23T19:21:33Z

@therault pretty please :)

When a GPU kernel had all its data already on the GPU, we would still schedule a record event on the input stream. That would delay the scheduling of the kernel after all asyncMemcpy that had been scheduled before complete their execution, reducing parallelism between execution and I/O. With this change, the aim is to entirely skip recording the event on the input stream and directly schedule the ready-to-execute kernel on an exec stream of the GPU. Note that the behavior of data already in transfer is unchanged: no additional transfer is scheduled, but the task needs to wait on the input stream progress to be ready to execute.

… and don't forget to call complete_stage

…lready in transfer from CPU to GPU, fallback on scheduling another copy from the CPU if the transfer is yet incomplete.

…ng event while doing nothing to the GPU seems dubious to me)

parsec/mca/device/device_gpu.c

therault requested a review from a team as a code owner October 16, 2024 13:20

bosilca reviewed Oct 16, 2024

View reviewed changes

parsec/mca/device/device_gpu.c Show resolved Hide resolved

parsec/mca/device/device_gpu.c Outdated Show resolved Hide resolved

parsec/mca/device/device_gpu.c Show resolved Hide resolved

parsec/mca/device/device_gpu.c Show resolved Hide resolved

abouteiller marked this pull request as draft October 25, 2024 14:57

devreal mentioned this pull request Nov 26, 2024

PaRSEC now allows DSLs to free the gpu task TESSEorg/ttg#307

Open

devreal assigned therault Jan 23, 2025

therault force-pushed the gpu-skip-records-when-nothing-scheduled-on-stream branch from 71edab8 to 385a251 Compare February 6, 2025 12:24

therault force-pushed the gpu-skip-records-when-nothing-scheduled-on-stream branch from c740700 to 22d8b8a Compare March 27, 2025 14:28

therault marked this pull request as ready for review March 27, 2025 14:30

therault and others added 4 commits March 27, 2025 15:37

Patch from Aurelien: count properly how_many transfers have been done…

b803c9c

… and don't forget to call complete_stage

Rollback unintended change in PR -- do not change behavior wrt data a…

1ae9564

…lready in transfer from CPU to GPU, fallback on scheduling another copy from the CPU if the transfer is yet incomplete.

Close events that might have been started (even if starting a profili…

78a9749

…ng event while doing nothing to the GPU seems dubious to me)

therault force-pushed the gpu-skip-records-when-nothing-scheduled-on-stream branch from 22d8b8a to 78a9749 Compare March 27, 2025 14:37

bosilca reviewed Mar 28, 2025

View reviewed changes

parsec/mca/device/device_gpu.c Show resolved Hide resolved

devreal approved these changes Apr 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make GPU manager skip records when nothing scheduled on input stream #681

Make GPU manager skip records when nothing scheduled on input stream #681

Uh oh!

therault commented Oct 16, 2024

Uh oh!

bosilca left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

devreal commented Jan 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Make GPU manager skip records when nothing scheduled on input stream #681

Are you sure you want to change the base?

Make GPU manager skip records when nothing scheduled on input stream #681

Uh oh!

Conversation

therault commented Oct 16, 2024

Uh oh!

bosilca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

devreal commented Jan 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants