GIFT is a unified post-training framework designed to bridge the optimization gap within the prevailing SFT-to-RL paradigm. By replacing traditional one-hot SFT with a finite-temperature Gibbs distribution, GIFT establishes a distributional bridge that preserves base priors while ensuring consistency with global post-training objectives. We theoretically and empirically demonstrate that GIFT provides an optimal initialization for RL in mathematical reasoning. Specifically, we show that standard SFT is merely a degenerate zero-temperature limit of this ideal policy. Our results indicate that GIFT significantly outperforms robust SFT variants across diverse and out-of-distribution benchmarks. Furthermore, geometric and distributional analyses reveal that GIFT preserves the exploration landscape, facilitating accelerated convergence and superior asymptotic performance to unlock the model’s full reasoning potential.
cd verl
pip install -e .bash exp_scripts/run_gift_training.shEdit exp_scripts/run_gift_training.sh to set your own paths:
train_file="/path/to/your/train.parquet"
val_file="/path/to/your/val.parquet"
model_path="/path/to/your/base_model"Your parquet files should contain:
prompt_key: column name for input prompts (default:sft_prompt)response_key: column name for target responses (default:solution)
GIFT builds upon veRL, deepmath, and utilizes vLLM for inference. We utilize Math-Verify for math reasoning evaluation. We thank the open-source community for codes, datasets and backbones, including veRL, LUFFY, ReLIFT.