Topic/cuda aware communications #671

bosilca · 2024-09-10T04:35:01Z

Add support for sending and receiving the data directly from and to devices. There are few caveats (noted on the commit log).

Note: because it includes the span renaming, this PR changes teh public API and will need to bump version to 5.x

The first question is how is such a device selected ?

The allocation of such a copy happen way before the scheduler is invoked
for a task, in fact before the task is even ready. Thus, we need to
decide on the location of this copy only based on some static
information, such as the task affinity. Therefore, this approach only
works for owner-compute type of tasks, where the task will be executed
on the device that owns the data used for the task affinity.

Pass the correct data copy across the entire system, instead of
falling back to data copy of the device 0 (CPU memory)

TODOs

abouteiller · 2024-10-11T18:09:56Z

Now passing 1-gpu/node, 8 ranks PTG POTRF
Sorry I had to force-push there were issues with rebasing on master

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

This allows to check if the data can be send and received directly to and from GPU buffers. Signed-off-by: George Bosilca <gbosilca@nvidia.com>

This is a multi-part patch that allows the CPU to prepare a data copy mapped onto a device. 1. The first question is how is such a device selected ? The allocation of such a copy happen way before the scheduler is invoked for a task, in fact before the task is even ready. Thus, we need to decide on the location of this copy only based on some static information, such as the task affinity. Therefore, this approach only works for owner-compute type of tasks, where the task will be executed on the device that owns the data used for the task affinity. 2. Pass the correct data copy across the entire system, instead of falling back to data copy of the device 0 (CPU memory) Add a configure option to enable GPU-aware communications. Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Name the data_t allocated for temporaries allowing developers to track them through the execution. Add the keys to all outputs (tasks and copies). Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

copy if we are passed-in a GPU copy, and we need to retain/release the copies that we are swapping

readers

…ut-only flows, for which checking if they are control flows segfaults

G-Ragghianti · 2025-01-13T18:27:43Z

I think we need to create a CI test that targets gpu_nvidia and issues the job to that runner, correct?

abouteiller · 2025-01-13T21:12:41Z

Failure in stress (and similar in nvlink) due to the code generating a pushback event when transferring the last tile between the GEMM -> DISCARD_C flow (m >= mt+1). This tile has no original->device_copies[0] because it was created directly without a backing DC (from a NEW in MAKE_C).

See further discussion in #671 (comment)

d@00000 GPU[1:cuda(0)]: Retrieve data (if any) for GEMM(79, 0, 0)[79, 0, 0]<0> keys = {4f, f000000000000001, 4f} {tp: 2} @parsec_device_kernel_scheduler:2719
d@00000 GPU[1:cuda(0)]: Try to Pop GEMM(79, 0, 0)[79, 0, 0]<0> keys = {4f, f000000000000001, 4f} {tp: 2} @parsec_device_kernel_pop:2264
d@00000 GPU[1:cuda(0)]: read copy 0x7ff06462f970 [ref_count 1] on flow A has readers (1) @parsec_device_kernel_pop:2323
d@00000 GPU[1:cuda(0)]: read copy 0x7ff064002c10 [ref_count 2] on flow C has readers (0) @parsec_device_kernel_pop:2323
d@00000 GPU[1:cuda(0)]: OUT Data copy 0x7ff064002c10 [ref_count 2] for flow C @parsec_device_kernel_pop:2330
Process 2891337 stopped
* thread #11, name = 'stress', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x70)
    frame #0: 0x00007ffff7eaa89b libparsec.so.4`parsec_device_kernel_pop(gpu_device=0x0000555555f7b7b0, gpu_task=0x00007ff06462a8c0, gpu_stream=0x0000555555f7bc68) at device_gpu.c:2341:17
   2338             if( gpu_task->pushout & (1 << i) ) {
   2339                 /* TODO: make sure no readers are working on the CPU version */
   2340                 original = gpu_copy->original;
-> 2341                 PARSEC_DEBUG_VERBOSE(10, parsec_gpu_output_stream,
   2342                                     "GPU[%d:%s]:\tMove D2H data <%s:%x> copy %p [ref_count %d] -- D:%p -> H:%p requested",
   2343                                     gpu_device->super.device_index, gpu_device->super.name, flow->name, original->key, gpu_copy, gpu_copy->super.super.obj_reference_count,
   2344                                      (void*)gpu_copy->device_private, original->device_copies[0]->device_private);

~~Potential fix is to allocate a dev0copy like is done for the network received tiles, not sure why it doesn't already.~~

devreal · 2025-02-22T21:36:01Z

parsec/data.c

+             * from the data and eventually release their memory.
+             */
+            parsec_data_copy_detach(data, copy, copy->device_index);
+            zone_free((zone_malloc_t *)copy->arena_chunk, copy->device_private);


Has this been checked for per-tile allocated data? The release of that memory is different and should not go into the zone allocator.

devreal · 2025-02-22T22:00:11Z

parsec/mca/device/device_gpu.c

+                           gpu_device->super.device_index, gpu_device->super.name, original->key, (void*)gpu_copy->device_private);
        }
-        assert(0 != (gpu_copy->flags & PARSEC_DATA_FLAG_PARSEC_OWNED) );
+        assert(0 != (gpu_copy->flags & PARSEC_DATA_FLAG_PARSEC_OWNED));


I know this has been here before, but: how should we handle copies that are not owned by parsec? They still seem to end up in the LRU even if they are only managed, not owned.

abouteiller · 2025-02-24T22:11:59Z

I merged with master but apparently I missed a defect in erroneous cases printouts, that causes the CI failures.

parsec/data.c

abouteiller · 2025-02-26T19:13:56Z

parsec/mca/device/device_gpu.c

+            gpu_copy->coherency_state = PARSEC_DATA_COHERENCY_SHARED;
+            assert(PARSEC_DATA_STATUS_UNDER_TRANSFER == cpu_copy->data_transfer_status);
+            cpu_copy->data_transfer_status = PARSEC_DATA_STATUS_COMPLETE_TRANSFER;
+            if( 0 == (parsec_mpi_allow_gpu_memory_communications & PARSEC_RUNTIME_SEND_GPU_MEMORY) ) {


Followup to #671 (comment)
Issue with stress and friends is stemming from here: when we pushout and we have a successor task, the successor task must receive the cpu copy as input (otherwise it will reference a gpu_copy that is now in the read LRU).

A fix would need to distinguish between the case where we pushout to satisfy a communication and we pushout to satisfy the input of a CPU-only task.

if a pushout is requested then we should always pass back the CPU copy (after it was updated). I need to understand the case yo are describing here. Is there a simple reproducer I can play with ?

The reproducer is the stress CI tester

accessor

abouteiller · 2025-02-26T20:47:11Z

I merged with master but apparently I missed a defect in erroneous cases printouts, that causes the CI failures.

ci failure due to using data_in uninitialized in new context nc during ontask, introduced by changes to task_snprintf in 108b778

devreal · 2025-03-26T14:44:42Z

I see dangling copies on the device. This might just require a fix in the cleanup code (ignoring data that only has the device copy):

[****] TIME(s)      7.35572 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  102044.227028 gflops - ENQ&PROG&DEST      7.75205 :   96827.061209 gflops - ENQ      0.39626 - DEST      0.00008
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe11ca00000) and it is discarding it!
[****] TIME(s)      6.82278 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  110015.002854 gflops - ENQ&PROG&DEST      6.82296 :  110012.229403 gflops - ENQ      0.00009 - DEST      0.00008
[****] TIME(s)      6.70474 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  111951.924658 gflops - ENQ&PROG&DEST      6.70490 :  111949.236353 gflops - ENQ      0.00009 - DEST      0.00007
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fded0a00000) and it is discarding it!
[****] TIME(s)      6.64373 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  112979.994900 gflops - ENQ&PROG&DEST      6.64388 :  112977.490061 gflops - ENQ      0.00009 - DEST      0.00006
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe11aa00000) and it is discarding it!
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe138a00000) and it is discarding it!
[****] TIME(s)      6.71053 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  111855.299517 gflops - ENQ&PROG&DEST      6.71068 :  111852.904738 gflops - ENQ      0.00009 - DEST      0.00005

Mismatch in printf type sizes might lead to segmentation faults. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

abouteiller · 2025-03-28T14:15:06Z

I see dangling copies on the device. This might just require a fix in the cleanup code (ignoring data that only has the device copy):

[****] TIME(s)      7.35572 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  102044.227028 gflops - ENQ&PROG&DEST      7.75205 :   96827.061209 gflops - ENQ      0.39626 - DEST      0.00008
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe11ca00000) and it is discarding it!
[****] TIME(s)      6.82278 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  110015.002854 gflops - ENQ&PROG&DEST      6.82296 :  110012.229403 gflops - ENQ      0.00009 - DEST      0.00008
[****] TIME(s)      6.70474 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  111951.924658 gflops - ENQ&PROG&DEST      6.70490 :  111949.236353 gflops - ENQ      0.00009 - DEST      0.00007
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fded0a00000) and it is discarding it!
[****] TIME(s)      6.64373 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  112979.994900 gflops - ENQ&PROG&DEST      6.64388 :  112977.490061 gflops - ENQ      0.00009 - DEST      0.00006
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe11aa00000) and it is discarding it!
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe138a00000) and it is discarding it!
[****] TIME(s)      6.71053 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  111855.299517 gflops - ENQ&PROG&DEST      6.71068 :  111852.904738 gflops - ENQ      0.00009 - DEST      0.00005

This warning is overcautious now that we have GPU-only copies created from the network. Ideally we would find a way to discriminate between the real leakages from the application and these temporaries being reclaimed.

bosilca · 2025-03-28T18:11:55Z

The original associated with these device owned copies should not have a valid dc ?

devreal · 2025-07-14T17:36:16Z

Here is what I think happens in the stress benchmark:

We allocate a new C tile, which has a host and device copy.
At the end of the GEMM task, the device copy gets passed through the reshape code and a reference is added to the device copy because that is the copy that is captured in the parsec_dep_data_description_t:

    data.data   = this_task->data._f_A.data_out;

We release the host copy at the end of release_deps_of_stress_GEMM. However, the host copy only has a single reference (the data) and so the host copy is detached from the parsec_data_t. Next time we use that parsec_data_t (in the next GEMM) we are missing the host copy.

I don't understand the reshape code and I was hoping to never have to touch it. I suspect that the reshape code was not designed with GPUs in mind but I could be wrong. Will need some help digging through this and figuring out

Whether capturing the device copy in reshape is correct.
How we can make sure that the host copy is not accidentally released.

bosilca requested a review from a team as a code owner September 10, 2024 04:35

bosilca force-pushed the topic/cuda_aware_communications branch from 968bf7e to 6f2e034 Compare September 10, 2024 04:38

bosilca mentioned this pull request Sep 10, 2024

Add support for batched tasks and for CUDA-aware communications bosilca/parsec#4

Open

bosilca force-pushed the topic/cuda_aware_communications branch 2 times, most recently from b3dfcdc to 0838a95 Compare September 10, 2024 05:03

This comment was marked as resolved.

Sign in to view

abouteiller force-pushed the topic/cuda_aware_communications branch from efa8386 to ab1a74a Compare October 11, 2024 18:07

This comment was marked as resolved.

Sign in to view

abouteiller force-pushed the topic/cuda_aware_communications branch from cd7c475 to 3bab2d5 Compare October 16, 2024 20:43

therault mentioned this pull request Oct 24, 2024

C11 atomic lock alignment in data_t #685

Merged

bosilca and others added 10 commits October 30, 2024 09:59

Allow JDF with no dependencies, no datatype and no arenas.

fa9438f

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Add a CUDA-based RTT test.

0ab9a2b

This allows to check if the data can be send and received directly to and from GPU buffers. Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Mostly improvement to the debuging output.

108b778

Name the data_t allocated for temporaries allowing developers to track them through the execution. Add the keys to all outputs (tasks and copies). Signed-off-by: George Bosilca <gbosilca@nvidia.com>

Correctly initialize an unlock atomic lock.

c08f9ca

Signed-off-by: George Bosilca <gbosilca@nvidia.com>

gpu-datain: when we are executing a CPU body we need to retrieve the CPU

bff15ea

copy if we are passed-in a GPU copy, and we need to retain/release the copies that we are swapping

gpu-datain: pure distribution collections have no data_of/data_of_key

bf90133

Use lock initializers instead of static temps

bedef4d

gpu-datain: when the RO data-in is from D2D we need to increase its

6099898

readers

gpu-datain: add a configure option and change some namings

3e0cb38

abouteiller force-pushed the topic/cuda_aware_communications branch from eb5c782 to 3e0cb38 Compare October 31, 2024 14:53

When managing CPU-only tasks that received a GPU data copy, skip outp…

82dcd40

…ut-only flows, for which checking if they are control flows segfaults

devreal mentioned this pull request Nov 26, 2024

PaRSEC now allows DSLs to free the gpu task TESSEorg/ttg#307

Open

This was referenced Dec 5, 2024

Thread safe zone_malloc #712

Closed

Make zone-malloc/free thread safe #715

Merged

Merge branch 'master' into topic/cuda_aware_communications

cf59065

mpigpu: fix a number of crashes in release_self_contained_data

8976f0b

devreal mentioned this pull request Feb 4, 2025

Rework the parsec_object_t system. #726

Closed

abouteiller linked an issue Feb 12, 2025 that may be closed by this pull request

Direct GPU-GPU inter-node data transfer #661

Open

8 tasks

Merge branch 'master' into topic/cuda_aware_communications

d261061

devreal reviewed Feb 22, 2025

View reviewed changes

This comment was marked as resolved.

Sign in to view

devreal reviewed Feb 25, 2025

View reviewed changes

parsec/data.c Outdated Show resolved Hide resolved

devreal mentioned this pull request Feb 25, 2025

Provide mechanism to discard data #695

Open

abouteiller reviewed Feb 26, 2025

View reviewed changes

testing for dev_index==0 is not always true, use the mca_device_is_gpu

f4193a1

accessor

This comment was marked as resolved.

Sign in to view

abouteiller added API Change Change to the public API, backward incompatible (version major bump) enhancement New feature or request labels Mar 14, 2025

abouteiller added this to the v5.0 milestone Mar 14, 2025

This comment was marked as resolved.

Sign in to view

Use PRIx64 when printing a parsec_data_key_t

4ebde47

Mismatch in printf type sizes might lead to segmentation faults. Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>

Topic/cuda aware communications #671

Are you sure you want to change the base?

Topic/cuda aware communications #671

Uh oh!

Conversation

bosilca commented Sep 10, 2024 • edited by devreal Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

abouteiller commented Oct 11, 2024

Uh oh!

This comment was marked as resolved.

G-Ragghianti commented Jan 13, 2025

Uh oh!

abouteiller commented Jan 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devreal Feb 22, 2025

Choose a reason for hiding this comment

Uh oh!

devreal Feb 22, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

abouteiller commented Feb 24, 2025

Uh oh!

Uh oh!

abouteiller Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bosilca Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

abouteiller Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

abouteiller commented Feb 26, 2025

Uh oh!

This comment was marked as resolved.

devreal commented Mar 26, 2025

Uh oh!

This comment was marked as resolved.

abouteiller commented Mar 28, 2025

Uh oh!

bosilca commented Mar 28, 2025

Uh oh!

devreal commented Jul 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bosilca commented Sep 10, 2024 •

edited by devreal

Loading

abouteiller commented Jan 13, 2025 •

edited

Loading

abouteiller Feb 26, 2025 •

edited

Loading