This guide provides an educational overview of how to deploy this project across multiple machines, how to scale the model, and how to estimate the required GPU resources.
This project uses Distributed Data-Parallel (DDP) training, which is a common strategy for training large models. Here’s how it works:
- Data Parallelism: The training data is split into smaller chunks, and each chunk is processed by a different GPU.
- Model Replication: A complete copy of the model is loaded onto each GPU.
- Gradient Synchronization: During the backward pass, each GPU computes the gradients for its chunk of data. These gradients are then averaged across all GPUs using a process called All-Reduce. This ensures that all models are updated with the same gradients, keeping them synchronized.
torchrun: This is a utility provided by PyTorch that simplifies the process of launching distributed training jobs. It automatically sets up the necessary environment variables (RANK,WORLD_SIZE,LOCAL_RANK, etc.) that are used bytorch.distributedto coordinate the processes.
The README.md provides commands for running on a single machine with multiple GPUs. To run on multiple machines, you need to extend this to a multi-node setup.
- Networking: Your machines must be able to communicate with each other over the network. They should be on the same subnet, or you need to configure your firewall to allow traffic on a specific port.
- Shared Storage (Optional but Recommended): For simplicity, it's best to have a shared network file system (like NFS) where your dataset and code are stored. This ensures that all nodes have access to the same files. If you don't have a shared file system, you'll need to copy the code and data to each machine.
- SSH Access: You'll need to be able to SSH into each machine to launch the training script.
You will use torchrun on each machine. You need to designate one machine as the "master" node, which will be responsible for coordinating the other nodes.
Let's say you have two machines, machine-1 (master) and machine-2.
On machine-1 (Master Node):
torchrun --nproc-per-node=4 \
--nnodes=2 \
--node-rank=0 \
--rdzv-id=123 \
--rdzv-backend=c10d \
--rdzv-endpoint="machine-1:29500" \
pretrain.py \
arch=trm \
data_paths="[data/arc1concept-aug-1000]" \
arch.L_layers=2 \
arch.H_cycles=3 arch.L_cycles=4 \
+run_name="my-distributed-run" ema=TrueOn machine-2 (Worker Node):
torchrun --nproc-per-node=4 \
--nnodes=2 \
--node-rank=1 \
--rdzv-id=123 \
--rdzv-backend=c10d \
--rdzv-endpoint="machine-1:29500" \
pretrain.py \
arch=trm \
data_paths="[data/arc1concept-aug-1000]" \
arch.L_layers=2 \
arch.H_cycles=3 arch.L_cycles=4 \
+run_name="my-distributed-run" ema=True--nproc-per-node: The number of GPUs to use on each machine.--nnodes: The total number of machines (nodes) participating in the training.--node-rank: A unique ID for each machine, from 0 tonnodes - 1. The master node should always haverank=0.--rdzv-id: A unique ID for the job. This should be the same on all nodes.--rdzv-backend: The rendezvous backend.c10dis the standard TCP-based backend.--rdzv-endpoint: The address and port of the master node. All nodes will connect to this address to coordinate.
The size of the model and the batch size are the primary factors that determine the GPU memory you'll need.
hidden_size: The embedding dimension. This has a quadratic impact on the memory used by the attention mechanism.num_heads: The number of attention heads. More heads increase parallelism but also memory usage.L_layers: The number of "low-level" transformer layers. This is a linear scaling factor.H_cycles,L_cycles: The number of recursive cycles. These don't significantly change the model's memory footprint, but they increase the computational cost (and thus the training time).
- Baseline: The
README.mdmentions that the ARC-AGI-1 experiment runs on 4 H100 GPUs. The H100 has 80GB of VRAM. This gives you a starting point. - Doubling
hidden_size: If you double thehidden_size, you will roughly quadruple the memory required for the attention mechanism. The memory for the feed-forward layers will double. Overall, expect the memory usage to increase by a factor of 2.5-3x. - Doubling
L_layers: If you double the number of layers, you will roughly double the memory required for the model's parameters and activations. - Batch Size: The
global_batch_sizeinconfig/cfg_pretrain.yamlis the total batch size across all GPUs. The per-GPU batch size isglobal_batch_size / WORLD_SIZE, whereWORLD_SIZEis the total number of GPUs (nproc-per-node * nnodes). If you run out of memory, the first thing to try is reducing theglobal_batch_size. - Mixed-Precision Training: The
forward_dtype: bfloat16inconfig/arch/trm.yamlindicates that the model uses 16-bit precision for the forward pass. This significantly reduces memory usage compared to 32-bit floating-point (float32).
- Start Small: Begin with the default configuration on a single machine.
- Monitor GPU Usage: Use
nvidia-smito monitor the VRAM usage. - Extrapolate:
- If you want to train a larger model, make one change at a time (e.g., increase
hidden_size) and observe the new memory usage. - If a single GPU is not enough, you'll need to add more GPUs. The total memory required will be the memory per GPU multiplied by the number of GPUs.
- If you want to train a larger model, make one change at a time (e.g., increase
Running experiments on multiple machines can be expensive. Here are some best practices:
- Use
wandb: The project is already integrated with Weights & Biases (wandb). Use it to log your experiments. This will help you keep track of your results and compare different runs. - Start with Small-Scale Experiments: Before launching a full multi-day training run, test your setup with a small dataset or for a small number of epochs. This will help you catch any bugs or configuration issues early.
- Use Checkpointing: The
checkpoint_every_evaloption inconfig/cfg_pretrain.yamlis very useful. It saves the model periodically, so you can resume training if it gets interrupted. - Automate Your Workflow: For launching jobs on multiple machines, consider using a tool like
tmuxorscreento manage your SSH sessions, or even a more advanced cluster management tool if you have one available.
By following this guide, you should be well-equipped to deploy this project on your machines and start experimenting with different model sizes and configurations. Good luck with your educational project!