Multi server multi gpu#1367
Conversation
5f40515 to
0905ce4
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1367 +/- ##
==========================================
- Coverage 86.96% 85.34% -1.62%
==========================================
Files 26 27 +1
Lines 3712 3877 +165
==========================================
+ Hits 3228 3309 +81
- Misses 484 568 +84
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
33ef29e to
b9c119c
Compare
|
Can you confirm what SLURM script you used to check this so i can match that? |
bw4sz
left a comment
There was a problem hiding this comment.
I want to approve this, @jveitchmichaelis any objections? I have spoken to comet and they agree that the lack of multi-node GPU utilization graph is probably on their end and not anything wrong here.
|
remove references to torchrun, we can use srun alone. |
|
To do, is @henrykironde comparing this with #1304 and decide if both are needing, but we want to get this done because it's broad and could make rebasing harder. |
… evaluate, and predict
Adds reproducible Slurm helpers for multinode and large-tile prediction workflows.
Docs for multi-GPU and multi-node workflows.
b9c119c to
cc12a58
Compare
| return dist.get_rank() == 0 | ||
|
|
||
|
|
||
| def should_sync(trainer: Any | None = None) -> bool: |
There was a problem hiding this comment.
Is this required? I thoughtorchmetrics handles syncing?
There was a problem hiding this comment.
We do not need this for tensor metrics. TorchMetrics artifacts already sync those across ranks. For non-tensor results (pandas DataFrames), we handle distribution explicitl. Without gathering predictions across ranks, each GPU keeps its own DataFrame, which led to duplicated or inconsistent outputs.
There was a problem hiding this comment.
@jveitchmichaelis - can you take a look at this response and either merge if @henrykironde's response clears things up or the two of you get together to figure it out so we can get this one merged. Thanks!
Description
Related Issue(s)
AI-Assisted Development
AI tools used (if applicable):