-
Notifications
You must be signed in to change notification settings - Fork 19
Offload device task release to worker threads #687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Add a LIFO for activities that are high-priority to the context. These activities are are picked up by worker threads. With GPU execution, worker threads are mostly idle so they can spare cycles handling the release of successor tasks, including potential communication. A similar mechanism could apply to incoming communication to relieve the communication thread and offload task release upon completion of a remote dep receive. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
|
We tried hard to avoid any context level structures for managing tasks in order to avoid contentions between all the threads. This PR bypasses the scheduler and creates exactly what we were trying to avoid. I think the PR #509 provides a cleaner, and less intrusive solution (aka allow co-managers for the GPU that will share the completion burden). |
|
I agree that this solution opens could lead to contention. I liked the idea of being able to also offload communication completion though. I would argue that it's not much different than the default scheduler, which creates contention once the system queue is hit. My problem with #509 is that it permanently takes away an additional compute thread. That's the worst if there are only two threads, since we then only have a maximum of two tasks in the device before both threads go out and submit another two. Granted, that is a bit of an extreme case... We also have no control over which thread is taken, so if we catch two threads across different NUMA nodes we could end up with quite some NUMA traffic just from the constant polling alone. Lastly, by having one thread do all the releases we a) serialize them; and b) the successors are likely to end up in the system queue, instead of being potentially distributed across several threads. There is your contention again... |
|
Then you dont need an additional mechanism, simply push the tasks back into the parsec scheduler. As long as the tasks are marked in the completion state, that should be nicely handled. If you want them to be executed quickly you can even bump their priority. |
|
That is an option, yes. My concern is that we don't control the scheduler so we don't know where these tasks end up. With the default scheduler, pushing the tasks back will land them first in the local task queue of the GPU manager and then overflow into the global queue. In both cases, they won't be touched until worker threads run out of work locally, so they will end up delayed. Sure, they'll still have the highest priority in the global queue but as long as other threads have local work they won't touch the global queue. |
|
Set the distance to 1 and the tasks shall not be pushed into the local queue. If that's not the case, the code needs to be fixed. |
|
These tasks will still be pushed into the global queue once the local queue of the GPU manager is full. |
|
With distance 1 they should never be pushed into the GPU manager local queue, but hierarchically from there. But you are right they will eventually end in the global queue with high priority, which guarantee they will be executed relatively soon. But at least there is a single queue that handle contention instead of two. |
Add a LIFO for task activities that are high-priority to the context. These activities are picked up by worker threads. With GPU execution, worker threads are mostly idle so they can spare cycles handling the release of successor tasks, including potential communication.
A similar mechanism could apply to incoming communication to relieve the communication thread and offload task release upon completion of a remote dep receive.