Skip to content

Conversation

@bosilca
Copy link
Contributor

@bosilca bosilca commented Sep 10, 2024

Add support for sending and receiving the data directly from and to devices. There are few caveats (noted on the commit log).

Note: because it includes the span renaming, this PR changes teh public API and will need to bump version to 5.x

  1. The first question is how is such a device selected ?

The allocation of such a copy happen way before the scheduler is invoked
for a task, in fact before the task is even ready. Thus, we need to
decide on the location of this copy only based on some static
information, such as the task affinity. Therefore, this approach only
works for owner-compute type of tasks, where the task will be executed
on the device that owns the data used for the task affinity.

  1. Pass the correct data copy across the entire system, instead of
    falling back to data copy of the device 0 (CPU memory)

TODOs

  • rebase on c11 atomic fix
  • Add a configure option to enable GPU-aware communications.
  • Add a runtime configuration to turn on/off the gpu-aware comms?
  • Pass -g 2 tests
  • Failure with ctest get_best_device scheduling.c:157: int __parsec_execute(parsec_execution_stream_t *, parsec_task_t *): Assertion NULL != copy->original && NULL != copy->original->device_copies[0]'
  • Failure with ctest nvlink, stress (segfault), details of why (its because using NEW) Topic/cuda aware communications #671 (comment)
  • Failure with ctest stage (presumably identical to intermittent failure in gemm/potrf) device_gpu.c:2470: int parsec_device_kernel_epilog(parsec_device_gpu_module_t *, parsec_gpu_task_t *): Assertion PARSEC_DATA_STATUS_UNDER _TRANSFER == cpu_copy->data_transfer_status' failed.
  • RO data between tasks may reach an assert when doing D2D between devices that do not have peer_access between them (relevant only for weirdos)
  • POTRF Crashes on Frontier (due to using uninitialized/after free device_copies 8976f0b
  • readers values are miscounted when 2 or more GPUs are used per rank Topic/cuda aware communications #671 (comment))

@bosilca bosilca requested a review from a team as a code owner September 10, 2024 04:35
@bosilca bosilca force-pushed the topic/cuda_aware_communications branch from 968bf7e to 6f2e034 Compare September 10, 2024 04:38
@bosilca bosilca force-pushed the topic/cuda_aware_communications branch 2 times, most recently from b3dfcdc to 0838a95 Compare September 10, 2024 05:03
@abouteiller

This comment was marked as resolved.

@abouteiller

This comment was marked as resolved.

@devreal

This comment was marked as resolved.

@abouteiller abouteiller force-pushed the topic/cuda_aware_communications branch from efa8386 to ab1a74a Compare October 11, 2024 18:07
@abouteiller
Copy link
Contributor

Now passing 1-gpu/node, 8 ranks PTG POTRF
Sorry I had to force-push there were issues with rebasing on master

@abouteiller

This comment was marked as resolved.

@abouteiller abouteiller force-pushed the topic/cuda_aware_communications branch from cd7c475 to 3bab2d5 Compare October 16, 2024 20:43
bosilca and others added 10 commits October 30, 2024 09:59
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
This allows to check if the data can be send and received directly to
and from GPU buffers.

Signed-off-by: George Bosilca <gbosilca@nvidia.com>
This is a multi-part patch that allows the CPU to prepare a data copy
mapped onto a device.

1. The first question is how is such a device selected ?

The allocation of such a copy happen way before the scheduler is invoked
for a task, in fact before the task is even ready. Thus, we need to
decide on the location of this copy only based on some static
information, such as the task affinity. Therefore, this approach only
works for owner-compute type of tasks, where the task will be executed
on the device that owns the data used for the task affinity.

2. Pass the correct data copy across the entire system, instead of
   falling back to data copy of the device 0 (CPU memory)

Add a configure option to enable GPU-aware communications.

Signed-off-by: George Bosilca <gbosilca@nvidia.com>
Name the data_t allocated for temporaries allowing developers to track
them through the execution. Add the keys to all outputs (tasks and
copies).

Signed-off-by: George Bosilca <gbosilca@nvidia.com>
Signed-off-by: George Bosilca <gbosilca@nvidia.com>
copy if we are passed-in a GPU copy, and we need to retain/release the
copies that we are swapping
@abouteiller abouteiller force-pushed the topic/cuda_aware_communications branch from eb5c782 to 3e0cb38 Compare October 31, 2024 14:53
…ut-only flows, for which checking if they are control flows segfaults
@G-Ragghianti
Copy link
Contributor

I think we need to create a CI test that targets gpu_nvidia and issues the job to that runner, correct?

@abouteiller
Copy link
Contributor

abouteiller commented Jan 13, 2025

Failure in stress (and similar in nvlink) due to the code generating a pushback event when transferring the last tile between the GEMM -> DISCARD_C flow (m >= mt+1). This tile has no original->device_copies[0] because it was created directly without a backing DC (from a NEW in MAKE_C).

See further discussion in #671 (comment)

d@00000 GPU[1:cuda(0)]: Retrieve data (if any) for GEMM(79, 0, 0)[79, 0, 0]<0> keys = {4f, f000000000000001, 4f} {tp: 2} @parsec_device_kernel_scheduler:2719
d@00000 GPU[1:cuda(0)]: Try to Pop GEMM(79, 0, 0)[79, 0, 0]<0> keys = {4f, f000000000000001, 4f} {tp: 2} @parsec_device_kernel_pop:2264
d@00000 GPU[1:cuda(0)]: read copy 0x7ff06462f970 [ref_count 1] on flow A has readers (1) @parsec_device_kernel_pop:2323
d@00000 GPU[1:cuda(0)]: read copy 0x7ff064002c10 [ref_count 2] on flow C has readers (0) @parsec_device_kernel_pop:2323
d@00000 GPU[1:cuda(0)]: OUT Data copy 0x7ff064002c10 [ref_count 2] for flow C @parsec_device_kernel_pop:2330
Process 2891337 stopped
* thread #11, name = 'stress', stop reason = signal SIGSEGV: address not mapped to object (fault address: 0x70)
    frame #0: 0x00007ffff7eaa89b libparsec.so.4`parsec_device_kernel_pop(gpu_device=0x0000555555f7b7b0, gpu_task=0x00007ff06462a8c0, gpu_stream=0x0000555555f7bc68) at device_gpu.c:2341:17
   2338             if( gpu_task->pushout & (1 << i) ) {
   2339                 /* TODO: make sure no readers are working on the CPU version */
   2340                 original = gpu_copy->original;
-> 2341                 PARSEC_DEBUG_VERBOSE(10, parsec_gpu_output_stream,
   2342                                     "GPU[%d:%s]:\tMove D2H data <%s:%x> copy %p [ref_count %d] -- D:%p -> H:%p requested",
   2343                                     gpu_device->super.device_index, gpu_device->super.name, flow->name, original->key, gpu_copy, gpu_copy->super.super.obj_reference_count,
   2344                                      (void*)gpu_copy->device_private, original->device_copies[0]->device_private);

Potential fix is to allocate a dev0copy like is done for the network received tiles, not sure why it doesn't already.

@abouteiller abouteiller linked an issue Feb 12, 2025 that may be closed by this pull request
8 tasks
* from the data and eventually release their memory.
*/
parsec_data_copy_detach(data, copy, copy->device_index);
zone_free((zone_malloc_t *)copy->arena_chunk, copy->device_private);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been checked for per-tile allocated data? The release of that memory is different and should not go into the zone allocator.

gpu_device->super.device_index, gpu_device->super.name, original->key, (void*)gpu_copy->device_private);
}
assert(0 != (gpu_copy->flags & PARSEC_DATA_FLAG_PARSEC_OWNED) );
assert(0 != (gpu_copy->flags & PARSEC_DATA_FLAG_PARSEC_OWNED));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this has been here before, but: how should we handle copies that are not owned by parsec? They still seem to end up in the LRU even if they are only managed, not owned.

@devreal

This comment was marked as resolved.

@abouteiller
Copy link
Contributor

I merged with master but apparently I missed a defect in erroneous cases printouts, that causes the CI failures.

gpu_copy->coherency_state = PARSEC_DATA_COHERENCY_SHARED;
assert(PARSEC_DATA_STATUS_UNDER_TRANSFER == cpu_copy->data_transfer_status);
cpu_copy->data_transfer_status = PARSEC_DATA_STATUS_COMPLETE_TRANSFER;
if( 0 == (parsec_mpi_allow_gpu_memory_communications & PARSEC_RUNTIME_SEND_GPU_MEMORY) ) {
Copy link
Contributor

@abouteiller abouteiller Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followup to #671 (comment)
Issue with stress and friends is stemming from here: when we pushout and we have a successor task, the successor task must receive the cpu copy as input (otherwise it will reference a gpu_copy that is now in the read LRU).

A fix would need to distinguish between the case where we pushout to satisfy a communication and we pushout to satisfy the input of a CPU-only task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a pushout is requested then we should always pass back the CPU copy (after it was updated). I need to understand the case yo are describing here. Is there a simple reproducer I can play with ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reproducer is the stress CI tester

@abouteiller
Copy link
Contributor

I merged with master but apparently I missed a defect in erroneous cases printouts, that causes the CI failures.

ci failure due to using data_in uninitialized in new context nc during ontask, introduced by changes to task_snprintf in 108b778

@abouteiller

This comment was marked as resolved.

@abouteiller abouteiller added API Change Change to the public API, backward incompatible (version major bump) enhancement New feature or request labels Mar 14, 2025
@abouteiller abouteiller added this to the v5.0 milestone Mar 14, 2025
@devreal
Copy link
Contributor

devreal commented Mar 26, 2025

I see dangling copies on the device. This might just require a fix in the cleanup code (ignoring data that only has the device copy):

[****] TIME(s)      7.35572 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  102044.227028 gflops - ENQ&PROG&DEST      7.75205 :   96827.061209 gflops - ENQ      0.39626 - DEST      0.00008
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe11ca00000) and it is discarding it!
[****] TIME(s)      6.82278 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  110015.002854 gflops - ENQ&PROG&DEST      6.82296 :  110012.229403 gflops - ENQ      0.00009 - DEST      0.00008
[****] TIME(s)      6.70474 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  111951.924658 gflops - ENQ&PROG&DEST      6.70490 :  111949.236353 gflops - ENQ      0.00009 - DEST      0.00007
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fded0a00000) and it is discarding it!
[****] TIME(s)      6.64373 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  112979.994900 gflops - ENQ&PROG&DEST      6.64388 :  112977.490061 gflops - ENQ      0.00009 - DEST      0.00006
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe11aa00000) and it is discarding it!
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe138a00000) and it is discarding it!
[****] TIME(s)      6.71053 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  111855.299517 gflops - ENQ&PROG&DEST      6.71068 :  111852.904738 gflops - ENQ      0.00009 - DEST      0.00005

@devreal

This comment was marked as resolved.

Mismatch in printf type sizes might lead to segmentation faults.

Signed-off-by: Joseph Schuchart <joseph.schuchart@stonybrook.edu>
@abouteiller
Copy link
Contributor

I see dangling copies on the device. This might just require a fix in the cleanup code (ignoring data that only has the device copy):

[****] TIME(s)      7.35572 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  102044.227028 gflops - ENQ&PROG&DEST      7.75205 :   96827.061209 gflops - ENQ      0.39626 - DEST      0.00008
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe11ca00000) and it is discarding it!
[****] TIME(s)      6.82278 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  110015.002854 gflops - ENQ&PROG&DEST      6.82296 :  110012.229403 gflops - ENQ      0.00009 - DEST      0.00008
[****] TIME(s)      6.70474 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  111951.924658 gflops - ENQ&PROG&DEST      6.70490 :  111949.236353 gflops - ENQ      0.00009 - DEST      0.00007
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fded0a00000) and it is discarding it!
[****] TIME(s)      6.64373 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  112979.994900 gflops - ENQ&PROG&DEST      6.64388 :  112977.490061 gflops - ENQ      0.00009 - DEST      0.00006
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe11aa00000) and it is discarding it!
W@00002 GPU[2:hip(1)] still OWNS the master memory copy for data 0 (0x7fe138a00000) and it is discarding it!
[****] TIME(s)      6.71053 : dpotrf	PxQxg=   2 2   2 NB= 2048 N=  131072 :  111855.299517 gflops - ENQ&PROG&DEST      6.71068 :  111852.904738 gflops - ENQ      0.00009 - DEST      0.00005

This warning is overcautious now that we have GPU-only copies created from the network. Ideally we would find a way to discriminate between the real leakages from the application and these temporaries being reclaimed.

@bosilca
Copy link
Contributor Author

bosilca commented Mar 28, 2025

The original associated with these device owned copies should not have a valid dc ?

@devreal
Copy link
Contributor

devreal commented Jul 14, 2025

Here is what I think happens in the stress benchmark:

  • We allocate a new C tile, which has a host and device copy.
  • At the end of the GEMM task, the device copy gets passed through the reshape code and a reference is added to the device copy because that is the copy that is captured in the parsec_dep_data_description_t:
    data.data   = this_task->data._f_A.data_out;
  • We release the host copy at the end of release_deps_of_stress_GEMM. However, the host copy only has a single reference (the data) and so the host copy is detached from the parsec_data_t. Next time we use that parsec_data_t (in the next GEMM) we are missing the host copy.

I don't understand the reshape code and I was hoping to never have to touch it. I suspect that the reshape code was not designed with GPUs in mind but I could be wrong. Will need some help digging through this and figuring out

  1. Whether capturing the device copy in reshape is correct.
  2. How we can make sure that the host copy is not accidentally released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API Change Change to the public API, backward incompatible (version major bump) enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Direct GPU-GPU inter-node data transfer

5 participants