Offload device task release to worker threads #687

devreal · 2024-10-29T23:11:49Z

Add a LIFO for task activities that are high-priority to the context. These activities are picked up by worker threads. With GPU execution, worker threads are mostly idle so they can spare cycles handling the release of successor tasks, including potential communication.

A similar mechanism could apply to incoming communication to relieve the communication thread and offload task release upon completion of a remote dep receive.

Add a LIFO for activities that are high-priority to the context. These activities are are picked up by worker threads. With GPU execution, worker threads are mostly idle so they can spare cycles handling the release of successor tasks, including potential communication. A similar mechanism could apply to incoming communication to relieve the communication thread and offload task release upon completion of a remote dep receive. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

bosilca · 2024-10-30T00:27:40Z

We tried hard to avoid any context level structures for managing tasks in order to avoid contentions between all the threads. This PR bypasses the scheduler and creates exactly what we were trying to avoid. I think the PR #509 provides a cleaner, and less intrusive solution (aka allow co-managers for the GPU that will share the completion burden).

devreal · 2024-11-01T03:04:08Z

I agree that this solution opens could lead to contention. I liked the idea of being able to also offload communication completion though. I would argue that it's not much different than the default scheduler, which creates contention once the system queue is hit.

My problem with #509 is that it permanently takes away an additional compute thread. That's the worst if there are only two threads, since we then only have a maximum of two tasks in the device before both threads go out and submit another two. Granted, that is a bit of an extreme case...

We also have no control over which thread is taken, so if we catch two threads across different NUMA nodes we could end up with quite some NUMA traffic just from the constant polling alone.

Lastly, by having one thread do all the releases we a) serialize them; and b) the successors are likely to end up in the system queue, instead of being potentially distributed across several threads. There is your contention again...

bosilca · 2024-11-07T14:58:47Z

Then you dont need an additional mechanism, simply push the tasks back into the parsec scheduler. As long as the tasks are marked in the completion state, that should be nicely handled. If you want them to be executed quickly you can even bump their priority.

devreal · 2024-11-07T15:23:21Z

That is an option, yes. My concern is that we don't control the scheduler so we don't know where these tasks end up. With the default scheduler, pushing the tasks back will land them first in the local task queue of the GPU manager and then overflow into the global queue. In both cases, they won't be touched until worker threads run out of work locally, so they will end up delayed. Sure, they'll still have the highest priority in the global queue but as long as other threads have local work they won't touch the global queue.

bosilca · 2024-11-07T15:28:15Z

Set the distance to 1 and the tasks shall not be pushed into the local queue. If that's not the case, the code needs to be fixed.

devreal · 2024-11-07T15:34:10Z

These tasks will still be pushed into the global queue once the local queue of the GPU manager is full.

bosilca · 2024-11-07T15:47:57Z

With distance 1 they should never be pushed into the GPU manager local queue, but hierarchically from there. But you are right they will eventually end in the global queue with high priority, which guarantee they will be executed relatively soon. But at least there is a single queue that handle contention instead of two.

devreal requested a review from a team as a code owner October 29, 2024 23:11

devreal marked this pull request as draft October 29, 2024 23:33

devreal mentioned this pull request Nov 26, 2024

PaRSEC now allows DSLs to free the gpu task TESSEorg/ttg#307

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Offload device task release to worker threads #687

Offload device task release to worker threads #687

Uh oh!

devreal commented Oct 29, 2024

Uh oh!

bosilca commented Oct 30, 2024

Uh oh!

devreal commented Nov 1, 2024

Uh oh!

bosilca commented Nov 7, 2024

Uh oh!

devreal commented Nov 7, 2024

Uh oh!

bosilca commented Nov 7, 2024 •

edited

Loading

Uh oh!

devreal commented Nov 7, 2024

Uh oh!

bosilca commented Nov 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Offload device task release to worker threads #687

Are you sure you want to change the base?

Offload device task release to worker threads #687

Uh oh!

Conversation

devreal commented Oct 29, 2024

Uh oh!

bosilca commented Oct 30, 2024

Uh oh!

devreal commented Nov 1, 2024

Uh oh!

bosilca commented Nov 7, 2024

Uh oh!

devreal commented Nov 7, 2024

Uh oh!

bosilca commented Nov 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devreal commented Nov 7, 2024

Uh oh!

bosilca commented Nov 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bosilca commented Nov 7, 2024 •

edited

Loading