|
1 | 1 | [](){#ref-software-communication} |
2 | 2 | # Communication Libraries |
3 | 3 |
|
4 | | -CSCS provides common communication libraries optimized for the [Slingshot 11 network on Alps][ref-alps-hsn]. |
| 4 | +Communication libraries, like MPI and NCCL, are one of the building blocks for high performance scientific and ML workloads. |
| 5 | +Broadly speaking, there are two levels of communication: |
| 6 | + |
| 7 | +* **Intra-node** communication between two processes on the same node. |
| 8 | +* **Inter-node** communication between different nodes, over the [Slingshot 11 network][ref-alps-hsn] that connects nodes on Alps. |
| 9 | + |
| 10 | +To get the best inter-node performance on Alps, they need to be configured to use the [libfabric][ref-communication-libfabric] library that has an optimised back end for the Slingshot 11 network on Alps. |
| 11 | + |
| 12 | +As such, communication libraries are part of the "base layer" of libraries and tools used by all workloads to fully utilize the hardware on Alps. |
| 13 | +They comprise the *network* layer in the following stack: |
| 14 | + |
| 15 | +* **CPU**: compilers with support for building applications optimized for the CPU architecture on the node. |
| 16 | +* **GPU**: CUDA and ROCM provide compilers and runtime libraries for NVIDIA and AMD GPUs respectively. |
| 17 | +* **Network**: libfabric, MPI, NCCL, NVSHMEM, need to be configured for the Slingshot network. |
| 18 | + |
| 19 | +CSCS provides communication libraries optimised for libfabric and Slingshot in uenv, and guidance on how to create container images that use them. |
| 20 | +This section of the documentation provides advice on how to build and install software to use these libraries, and how to deploy them. |
5 | 21 |
|
6 | 22 | For most scientific applications relying on MPI, [Cray MPICH][ref-communication-cray-mpich] is recommended. |
7 | 23 | [MPICH][ref-communication-mpich] and [OpenMPI][ref-communication-openmpi] may also be used, with limitations. |
8 | 24 | Cray MPICH, MPICH, and OpenMPI make use of [libfabric][ref-communication-libfabric] to interact with the underlying network. |
9 | 25 |
|
10 | | -Most machine learning applications rely on [NCCL][ref-communication-nccl] or [RCCL][ref-communication-rccl] for high-performance implementations of collectives. |
11 | | -NCCL and RCCL have to be configured with a plugin using [libfabric][ref-communication-libfabric] to make full use of the Slingshot network. |
| 26 | +Most machine learning applications rely on [NCCL][ref-communication-nccl] for high-performance implementations of collectives. |
| 27 | +NCCL have to be configured with a plugin using [libfabric][ref-communication-libfabric] to make full use of the Slingshot network. |
12 | 28 |
|
13 | 29 | See the individual pages for each library for information on how to use and best configure the libraries. |
14 | 30 |
|
15 | | -* [Cray MPICH][ref-communication-cray-mpich] |
16 | | -* [MPICH][ref-communication-mpich] |
17 | | -* [OpenMPI][ref-communication-openmpi] |
18 | | -* [NCCL][ref-communication-nccl] |
19 | | -* [RCCL][ref-communication-rccl] |
20 | | -* [libfabric][ref-communication-libfabric] |
| 31 | +<div class="grid cards" markdown> |
| 32 | + |
| 33 | +- __Low Level__ |
| 34 | + |
| 35 | + Learn about the low-level networking library libfabric, and how to use it in uenv and containers |
| 36 | + |
| 37 | + [:octicons-arrow-right-24: libfabric][ref-alps] |
| 38 | + |
| 39 | +</div> |
| 40 | +<div class="grid cards" markdown> |
| 41 | + |
| 42 | +- __MPI__ |
| 43 | + |
| 44 | + Cray MPICH is the most optimized and best tested MPI implementation on Alps, and is used by uenv. |
| 45 | + |
| 46 | + [:octicons-arrow-right-24: Cray MPICH][ref-communication-cray-mpich] |
| 47 | + |
| 48 | + For compatibility in containers: |
| 49 | + |
| 50 | + [:octicons-arrow-right-24: MPICH][ref-communication-mpich] |
| 51 | + |
| 52 | + Also OpenMPI can be built in containers or in uenv |
| 53 | + |
| 54 | + [:octicons-arrow-right-24: OpenMPI][ref-communication-openmpi] |
| 55 | + |
| 56 | +</div> |
| 57 | +<div class="grid cards" markdown> |
| 58 | + |
| 59 | +- __Machine Learning__ |
| 60 | + |
| 61 | + Communication libraries used by ML tools like Torch, and some simulation codes. |
| 62 | + |
| 63 | + [:octicons-arrow-right-24: NCCL][ref-communication-nccl] |
| 64 | + |
| 65 | + [:octicons-arrow-right-24: NVSHMEM][ref-communication-nvshmem] |
| 66 | + |
| 67 | +</div> |
0 commit comments