End-to-End GRPO Training Tutorial with FSDP

This document provides instructions for end-to-end training using the ChatLearn, pytorch FSDP and vLLM framework, and the qwen3 model.

Environment Setup

Docker Image Preparation

We recommend running the following example in PAI DSW/DLC. You need to use the following image to launch the instance.

dsw-registry.cn-shanghai.cr.aliyuncs.com/pai-training-algorithm/chatlearn:torch2.6.0-vllm0.8.5-ubuntu24.04-cuda12.6-py312

You can use a VPC address to accelerate image pulling. The image address should be adjusted based on the current region. For example, if you need to launch a DSW instance in Shanghai, you can use the following image dsw-registry-vpc.cn-shanghai.cr.aliyuncs.com/pai-training-algorithm/chatlearn:torch2.6.0-vllm0.8.5-ubuntu24.04-cuda12.6-py312.

Code Preparation

git clone https://github.com/alibaba/ChatLearn.git && cd ChatLearn

Data Preparation

We take MATH-lighteval as exmaple.

# download dataset
mkdir -p dataset
modelscope download --dataset AI-ModelScope/MATH-lighteval --local_dir dataset/MATH-lighteval
# preprocess dataset
python chatlearn/data/data_preprocess/math_lighteval.py --input_dir dataset/MATH-lighteval --local_dir dataset/MATH-lighteval

Training

You can run the following command to start training:

Qwen3-8B

Run this command on server with 8 GPUs

# download model weight
modelscope download --model Qwen/Qwen3-8B --local_dir pretrained_models/Qwen3-8B
bash scripts/fsdp_vllm/train_fsdp_vllm_qwen3_8b_grpo.sh

Using Wandb

If you want to use Wandb to log the training process, you need to modify the configuration with:

export WANDB_API_KEY="Your-Wandb-api-key"

Change the configuration to:

runtime_args.log_args_dict.enable_wandb=True
runtime_args.log_args_dict.wandb_project="Your-Wandb-Project-Name"

Model Conversion

Saving FSDP models is time-consuming. Chatlearn provides an offline model conversion feature, which converts FSDP-sharded checkpoints back to HuggingFace format. The script is as follows:

export CHATLEARN=$(pwd)
python chatlearn/offline_ckpt_converter.py \
    --hf_dir ${CHATLEARN}/Qwen3-8B/ \
    --ckpt_dir ${CHATLEARN}/output/qwen3-grpo-8b/save_model/policy_trainer \
    --save_dir ${CHATLEARN}/output/qwen3-grpo-8b/save_model/huggingface/ \
    --iter 200 \
    --groupgemm 0

If you are training an MoE model with groupgemm, please make sure to set:

   --groupgemm 1

This script will convert the final FSDP sharded model after training back into a HuggingFace model and save it in the path "${CHATLEARN}/output/qwen3-grpo-8b/save_model/huggingface/".

FAQ

How to Speed Up PolicyTrainer Training?

Set models.policy_trainer.packing=True and configure models.policy_trainer.max_token_in_packing to the maximum token count that fits GPU memory.
For the Qwen3-MoE model, enable models.policy_trainer.groupgemm=True to activate the GroupGEMM patch, improving MoE layer training speed.

Why Does FSDP Initialization Cause Ray OOM Errors When Load Weights in Transformers?

Enable models.policy_trainer.meta_init=True to mitigate this issue. This may cause extra time cost for initialization.

Why Does This Error Occur During Inference?

ray.exceptions.RayChannelTimeoutError: System error: If the execution is expected to take a long time, increase RAY_CGRAPH_get_timeout which is currently 10 seconds. Otherwise, this may indicate that the execution is hanging.

Check the model input parameter: If models.policy.tensor_model_parallel_size is not 1, set models.policy.enforce_eager=True.

Why Does torch.OutOfMemoryError: CUDA Out of Memory Occur During Training?

If models.policy_trainer.packing=True, try reducing models.policy_trainer.max_token_in_packing.
If models.policy_trainer.packing=False, decrease runtime_args.train_micro_batch_size.
If OOM persists even with runtime_args.train_micro_batch_size=1 or when models.policy_trainer.max_token_in_packing is smaller than the generation length, increase models.policy_trainer.ulysses_sequence_parallel_size (recommended: a power of 2, not exceeding the number of GPUs per node).

Why Does CUDA OOM Still Occur After These Adjustments?

Consider scaling up the number of GPUs—FSDP memory consumption scales roughly linearly with the total GPU count.

Why Does vLLM Initialization Cause CUDA OOM?

Increase models.policy.gpu_memory_utilization (recommended: no higher than 0.95).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

End-to-End GRPO Training Tutorial with FSDP

Environment Setup

Data Preparation

Training

Qwen3-8B

Using Wandb

Model Conversion

FAQ

How to Speed Up PolicyTrainer Training?

Why Does FSDP Initialization Cause Ray OOM Errors When Load Weights in Transformers?

Why Does This Error Occur During Inference?

Why Does torch.OutOfMemoryError: CUDA Out of Memory Occur During Training?

Why Does CUDA OOM Still Occur After These Adjustments?

Why Does vLLM Initialization Cause CUDA OOM?

FilesExpand file tree

tutorial_grpo_fsdp.md

Latest commit

History

tutorial_grpo_fsdp.md

File metadata and controls

End-to-End GRPO Training Tutorial with FSDP

Environment Setup

Data Preparation

Training

Qwen3-8B

Using Wandb

Model Conversion

FAQ

How to Speed Up PolicyTrainer Training?

Why Does FSDP Initialization Cause Ray OOM Errors When Load Weights in Transformers?

Why Does This Error Occur During Inference?

Why Does torch.OutOfMemoryError: CUDA Out of Memory Occur During Training?

Why Does CUDA OOM Still Occur After These Adjustments?

Why Does vLLM Initialization Cause CUDA OOM?