This document provides instructions for end-to-end training using the ChatLearn, pytorch FSDP and vLLM framework, and the qwen3 model.
- Docker Image Preparation
We recommend running the following example in PAI DSW/DLC. You need to use the following image to launch the instance.
dsw-registry.cn-shanghai.cr.aliyuncs.com/pai-training-algorithm/chatlearn:torch2.6.0-vllm0.8.5-ubuntu24.04-cuda12.6-py312You can use a VPC address to accelerate image pulling. The image address should be adjusted based on the current region. For example, if you need to launch a DSW instance in Shanghai, you can use the following image dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/pai-training-algorithm/chatlearn:torch2.6.0-vllm0.8.5-ubuntu24.04-cuda12.6-py312.
- Code Preparation
git clone https://github.com/alibaba/ChatLearn.git && cd ChatLearnWe take MATH-lighteval as exmaple.
# download dataset
mkdir -p dataset
modelscope download --dataset AI-ModelScope/MATH-lighteval --local_dir dataset/MATH-lighteval
# preprocess dataset
python chatlearn/data/data_preprocess/math_lighteval.py --input_dir dataset/MATH-lighteval --local_dir dataset/MATH-lightevalYou can run the following command to start training:
Run this command on server with 8 GPUs
# download model weight
modelscope download --model Qwen/Qwen3-8B --local_dir pretrained_models/Qwen3-8B
bash scripts/fsdp_vllm/train_fsdp_vllm_qwen3_8b_grpo.shIf you want to use Wandb to log the training process, you need to modify the configuration with:
export WANDB_API_KEY="Your-Wandb-api-key"Change the configuration to:
runtime_args.log_args_dict.enable_wandb=True
runtime_args.log_args_dict.wandb_project="Your-Wandb-Project-Name"Saving FSDP models is time-consuming. Chatlearn provides an offline model conversion feature, which converts FSDP-sharded checkpoints back to HuggingFace format. The script is as follows:
export CHATLEARN=$(pwd)
python chatlearn/offline_ckpt_converter.py \
--hf_dir ${CHATLEARN}/Qwen3-8B/ \
--ckpt_dir ${CHATLEARN}/output/qwen3-grpo-8b/save_model/policy_trainer \
--save_dir ${CHATLEARN}/output/qwen3-grpo-8b/save_model/huggingface/ \
--iter 200 \
--groupgemm 0If you are training an MoE model with groupgemm, please make sure to set:
--groupgemm 1This script will convert the final FSDP sharded model after training back into a HuggingFace model and save it in the path "${CHATLEARN}/output/qwen3-grpo-8b/save_model/huggingface/".
-
Set models.policy_trainer.packing=True and configure models.policy_trainer.max_token_in_packing to the maximum token count that fits GPU memory.
-
For the Qwen3-MoE model, enable models.policy_trainer.groupgemm=True to activate the GroupGEMM patch, improving MoE layer training speed.
Enable models.policy_trainer.meta_init=True to mitigate this issue. This may cause extra time cost for initialization.
ray.exceptions.RayChannelTimeoutError: System error: If the execution is expected to take a long time, increase RAY_CGRAPH_get_timeout which is currently 10 seconds. Otherwise, this may indicate that the execution is hanging.Check the model input parameter: If models.policy.tensor_model_parallel_size is not 1, set models.policy.enforce_eager=True.
-
If models.policy_trainer.packing=True, try reducing models.policy_trainer.max_token_in_packing.
-
If models.policy_trainer.packing=False, decrease runtime_args.train_micro_batch_size.
-
If OOM persists even with runtime_args.train_micro_batch_size=1 or when models.policy_trainer.max_token_in_packing is smaller than the generation length, increase models.policy_trainer.ulysses_sequence_parallel_size (recommended: a power of 2, not exceeding the number of GPUs per node).
Consider scaling up the number of GPUs—FSDP memory consumption scales roughly linearly with the total GPU count.
Increase models.policy.gpu_memory_utilization (recommended: no higher than 0.95).