Computer Vision Center | VLR Group
Contact: tockier@cvc.uab.cat
Massive-scale VLM pre-training and finetuning on HPC environments. It is specifically designed and tested for Marenostrum 5 and JUPITER. Works similary to torchtitan, only relying on native torch code for the distributed implementation. Compatibilty with HF state-dict, loads weights from HF snapshot directory.
See SCALABILITY.md and USAGE.md for more details.
- Supported Architectures: Qwen3.5, Qwen3-VL and Qwen3 (text).
- 2D Parallelism: FSDP/DDP (Single & Multi-node) and Tensor Parallelism (TP) support. Tested scaling up to 256 GPUs.
- Optimized Dataloading: Nvidia Energon integration with offline data packing for high-throughput data ingestion.
- State Management: Fully distributed model, optimizer, and scheduler checkpointing.
We are using the same environment in both MN5 and JUPITER, as well as our local clusters.
Relies on the torch.nn.attention.varlen.varlen_attn implementation of torch=2.11.0 (see here) for the attention in Qwen3.5, we do not require flash_attn since its difficult to install in JUPITER (ARM system).
To use torch=2.10.0 you MUST install flash_attention, see here for the CUDA kernels.
Support for ROCm systems (LUMI) is work in progress.
torch=2.11.0ideally, also works withtorch=2.10.0 + flash_attntransformers=5.3.0
torch=2.11.0flash-linear-attentioncausal-conv1dtransformers=5.6.0
Datasets are expected to be as a CrudeWebdataset. With https://github.com/NVIDIA/Megatron-Energon we handle the raw data and tokenize it on the fly. It is an asynchrnos process that does not have an impact on model performance. Online datapacking is used by default. Support for Metadatasets (multiple sources).
Use utils/down.py on a login node to pre-download model weights and tokenizers to a shared filesystem. The models' archicture configuration relies on what is downloaded.
Loading Mechanism: During training, models are instantiated directly from these local paths. The architecture is initialized purely in PyTorch, and the offline weights are mapped and loaded directly into the native state dictionary.
- Ensure your datasets are formatted as Nvidia Energon webdatasets.
- Configure your hyperparameters and environment variables in the
configs/directory. - Launch the distributed training job using the environment-specific script:
# For Marenostrum 5
./scripts/mn5_finetune.sh --config [toml file]
# For JUPITER
./scripts/jup_finetune.sh --config [toml file]In configs/ you can find several examples. Look into the jup and mn5 directories to see the configs for the respective HPC systems.
Note: The scripts/ directory contains both direct CLI launch scripts and SLURM batch scripts.
The codebase demonstrates linear scaling up to 256 GPUs using FSDP and Tensor Parallelism. For a detailed breakdown of throughput, GPU efficiency, and scaling characteristics, please refer to SCALABILITY.md.
- The entire workflow
training -> checkpoints -> eval/usageneeds a lot of work. - Static shape compilation (
torch.compilewithfullgraph=True) is pending. - A better data packing implemented is needed.