|
490 | 490 | <div class="card grid grid-cols-4 justify-center items-center"> |
491 | 491 | <img class="shadow rounded-full max-w-full h-auto align-middle border-none" src="./team-images/abdullah.png" width="100px" /> |
492 | 492 | <p class="grid-colspan-3"> |
493 | | - <b>Alumni:</b> Muhammad Abdullah Soyturk |
| 493 | + <b>Alumni:</b> Muhammed Abdullah Soytürk |
494 | 494 | <br> |
495 | 495 | <b>Research Focus:</b> Scalable deep learning. |
496 | 496 | </p> |
@@ -531,6 +531,7 @@ BeyondMoore Software Ecosystem |
531 | 531 | **_Profiling Tools_** |
532 | 532 | * [Snoopie](#SNOOPIE): A Multi-GPU Communication Profiler and Visualiser |
533 | 533 | * [PES AMD vs Intel](#PRECISE-EVENT-SAMPLING): A Precise Event Sampling Benchmark Suite |
| 534 | +* [aCG](#ACG): CPU- and GPU-initiated Communication Strategies for CG Methods |
534 | 535 |
|
535 | 536 | </div> |
536 | 537 |
|
@@ -606,13 +607,32 @@ single-process, multi-threaded, and multi-process codes. More details about the |
606 | 607 | </div> |
607 | 608 |
|
608 | 609 |
|
| 610 | +<div id="ACG" class="h-auto bg-gray-100 rounded-s p-4 border-solid border-1 border-gray-200 flex flex-row justify-start items-start gap-5 transform transition-all hover:shadow-gray-100 hover:shadow-lg shadow-none"> |
| 611 | + <div clas="flex flex-col justify-start"> |
| 612 | + <div class="flex flex-row gap-2 justify-start items-center flex-shrink"> |
| 613 | + <img width="32" src="./assets/git.webp" /> |
| 614 | + <a href="https://github.com/ParCoreLab/aCG" class="text-xl font-semibold font-sans visited:text-teal-700">CPU- and GPU-initiated Communication Strategies for CG Methods</a> |
| 615 | + </div> |
| 616 | + <p class="text-lg">This work revisits Conjugate Gradient (CG) parallelization for large-scale multi-GPU systems, addressing challenges from low computational intensity and communication overhead. We develop scalable CG and pipelined CG solvers for NVIDIA and AMD GPUs, employing GPU-aware MPI, NCCL/RCCL, and NVSHMEM for both CPU- and GPU-initiated communication. A monolithic GPU-offloaded variant further enables fully device-driven execution, removing CPU involvement. Optimizations across all designs reduce data transfers and synchronization costs. Evaluations on SuiteSparse matrices and a real finite element application show 8–14% gains over state-of-the-art on single GPUs and 5–15% improvements in strong scaling tests on over 1,000 GPUs. While CPU-driven variants currently benefit from stronger library support, results highlight the promising scalability of GPU-initiated execution for future large-scale systems.</p> |
| 617 | + |
| 618 | + <p> |
| 619 | + <a href="https://github.com/ParCoreLab/aCG" class="text-xl font-semibold font-sans visited:text-teal-700">More details and git repository of the project.</a> |
| 620 | + </p> |
| 621 | + </div> |
| 622 | + <div class="grid h-auto justify-center place-items-center"> |
| 623 | + <img width="400px" src="./assets/CG.png" /> |
| 624 | + </div> |
| 625 | + </div> |
| 626 | + |
609 | 627 | <div id="CPU-FREE-MODEL-COMPILER" class="h-auto bg-gray-100 rounded-s p-4 border-solid border-1 border-gray-200 flex flex-row justify-start items-start gap-5 transform transition-all hover:shadow-gray-100 hover:shadow-lg shadow-none"> |
610 | 628 | <div clas="flex flex-col justify-start"> |
611 | 629 | <div class="flex flex-row gap-2 justify-start items-center flex-shrink"> |
612 | 630 | <img width="32" src="./assets/git.webp" /> |
613 | 631 | <a href="https://github.com/ParCoreLab/" class="text-xl font-semibold font-sans visited:text-teal-700">CPU Free Model Compiler</a> |
614 | 632 | </div> |
615 | | - <p class="text-lg">We're actively crafting a compiler to empower developers to write high-level Python code that compiles into efficient CPU-free device code. This compiler integrates GPU-initiated communication libraries, NVSHMEM for NVIDIA and ROC_SHMEM for AMD, enabling GPU communication directly within Python code. With automatic generation of GPU-initiated communication calls and persistent kernels, we aim to streamline development workflows. Our prototype will be available soon.</p> |
| 633 | + <p class="text-lg">We're actively crafting a compiler to empower developers to write high-level Python code that compiles into efficient CPU-free device code. This compiler integrates GPU-initiated communication libraries, NVSHMEM for NVIDIA and ROC_SHMEM for AMD, enabling GPU communication directly within Python code. With automatic generation of GPU-initiated communication calls and persistent kernels, we aim to streamline development workflows.</p> <p> |
| 634 | + <a href="https://github.com/ParCoreLab/CPU-Free-model" class="text-xl font-semibold font-sans visited:text-teal-700">More details and git repository of the project.</a> |
| 635 | + </p> |
616 | 636 | </div> |
617 | 637 | <div class="grid h-auto justify-center place-items-center"> |
618 | 638 | <img width="300px" src="./assets/dace-compiler.png" /> |
@@ -689,7 +709,7 @@ Graphs</a> <a href="https://docs.google.com/presentation/d/1po87zQeUQb5l12AXB5RM |
689 | 709 | <div class="card text-lg">Tugba Torun, Ameer Taweel, Didem Unat (2024) <a href="https://arxiv.org/pdf/2405.04944">A Sparse Tensor Generator with Efficient Feature Extraction</a>. <span class="italic">Accepted for publication; online release pending</span>. <a class="italic" href="https://arxiv.org/pdf/2405.04944">preprint pdf</a> </div> |
690 | 710 |
|
691 | 711 | <div class="card text-lg"> Javid Baydamirli, Tal Ben Nun, Didem Unat (2024) <a href="https://ieeexplore.ieee.org/abstract/document/10820747">Autonomous Execution for Multi-GPU Systems: |
692 | | -Compiler Support</a>. In SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. <a class="italic" download href="./assets/preprint-pdfs/P3HPC_____Autonomous_Execution_for_Multi_GPU_Systems__Compiler_Support-2 (1).pdf">preprint pdf</a> |
| 712 | +Compiler Support</a> <a href="https://docs.google.com/presentation/d/1nBsANrcLh0Tnc2qqqDL_-6khqo-Y-_mX5kfJbmRwawE/edit?slide=id.p#slide=id.p">(presentation)</a>. In SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. <a class="italic" download href="./assets/preprint-pdfs/P3HPC_____Autonomous_Execution_for_Multi_GPU_Systems__Compiler_Support-2 (1).pdf">preprint pdf</a> |
693 | 713 | </div> |
694 | 714 | <div class="card text-lg"> Javid Baydamirli, Tal Ben Nun, Didem Unat (2024) <a href="https://sc24.supercomputing.org/proceedings/workshops/workshop_pages/ws_p3hpc108.html">Autonomous Execution for Multi-GPU Systems: |
695 | 715 | Compiler Support</a> <a href="https://sc24.conference-program.com/presentation/?id=ws_p3hpc108&sess=sess751">(presentation)</a>. In the 2024 International Workshop on Performance, Portability, and Productivity in HPC. <a class="italic" download href="./assets/preprint-pdfs/sc24-workshop-autonomous-execution-for-multi-gpu-systems-compiler-support.pdf">preprint pdf</a> |
|
0 commit comments