Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/cheaha/hardware.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ The following hardware summaries may be useful for selecting partitions for work

### Summary

The table below contains a summary of the computational resources available on Cheaha and relevant Quality of Service (QoS) Limits. QoS limits allow us to balance usage and ensure fairness for all researchers using the cluster. QoS limits are not a guarantee of resource availability.
The table below contains a summary of the computational resources available on Cheaha and relevant Quality of Service (QoS) Limits. QoS limits allow us to balance usage and ensure fairness for all researchers using the cluster. QoS limits are not a guarantee of resource availability. Please refer [Estimating Compute Resources](../cheaha/slurm/submitting_jobs.md/#estimating-compute-resources) page when requesting a resources.

In the table, [Slurm](./slurm/introduction.md) partitions are grouped by shared QoS limits on cores, memory, and GPUs. Node limits are applied to partitions independently. All limits are applied to researchers independently.

Expand All @@ -21,6 +21,7 @@ Examples of how to make use of the table:
- Suppose you submit 30 jobs to the "express" partition, and suppose each job needs 10 cores each. Hypothetically, in order for all of the jobs to start at once, 300 cores would be required. The QoS limit on cores is 264 on the "express" partition, so at most 26 jobs (260 cores) can start at once. The remaining 4 jobs will be held in queue, because starting one more would go beyond the QoS limit (270 > 264).
- Suppose you submit 5 jobs to the "medium" partition and 5 to the "long" partition, each requiring 1 node. Then, 10 total nodes would be needed. In this case, it is possible that all 10 jobs can start at once because partition node limits are separate. If all 5 jobs start, jobs on the "medium" partition.
- Suppose you submit 5 jobs to the "amperenodes" partition and 5 to "amperenodes-medium", for a total of 10 A100 GPUs. Additionally, you also submit 4 jobs to the "pascalnodes" partition totaling 8 P100 GPUs. Then 4 of the "gpu: ampere" group jobs can start at once, because the QoS limit is 4 GPUs there. Additionally, all 4 of the "gpu: pascal" group jobs, because the QoS limit is 8 GPUs there. In this case, the QoS for each group is separate.
- Suppose you submit 6 jobs to the "medium" partition, and suppose each job requests 600 GB of memory. If all of the jobs were to start at once, the total memory required would be 3600 GB total (6 jobs × 600 GB). The QoS limit on memory for the "medium" partition is 3072 GB per user (see the value in parentheses under column "Mem GB/Node (Limit/Person)"). This means, at most 5 jobs (5 × 600 GB = 3000 GB) can start at once, and the 6th job will remain queued, because starting it would exceed your memory QoS limit (3600 GB > 3072 GB).

{{ read_csv('cheaha/res/hardware_summary_cheaha.csv', keep_default_na=False) }}
<!-- fix headers -->
Expand Down
Loading