forked from hpcaitech/ColossalAI-Benchmark
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
🐛 Describe the bug
Solves key problems:
- Too many WandB loggers per node. Switch from 1 per gpu to one per node (since each process can see all 4 GPUs).
- Makes launching easier: single loop over the nodes in the slum job.
Refactor this line in slurm launcher.
ssh "$local_node_hostname" \
"export DATA=$DATA; conda activate $CONDA_ENV_NAME; python $TRAIN_FILEPATH --config $CONFIG_FILEPATH --host $MAIN_HOST --port $MAIN_HOST_PORT --world_size $WORLD_SIZE --rank $localrank" &
Environment
No response
Metadata
Metadata
Assignees
Labels
No labels