Skip to content

Switch from Colossal Launcher to torchrun.  #1

@KastanDay

Description

@KastanDay

🐛 Describe the bug

Solves key problems:

  • Too many WandB loggers per node. Switch from 1 per gpu to one per node (since each process can see all 4 GPUs).
  • Makes launching easier: single loop over the nodes in the slum job.

Refactor this line in slurm launcher.

ssh "$local_node_hostname" \
            "export DATA=$DATA; conda activate $CONDA_ENV_NAME; python $TRAIN_FILEPATH --config $CONFIG_FILEPATH --host $MAIN_HOST --port $MAIN_HOST_PORT --world_size $WORLD_SIZE --rank $localrank" &
        

Environment

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions