-
Notifications
You must be signed in to change notification settings - Fork 106
Description
Hi. First, thanks for sharing and open sourcing your code.
I would like to clarify how packing was performed for SFT data for training Fast-dLLMv2.
The paper mentions that training was on nvidia/Llama-Nemotron-Post-Training-Dataset with a context length of 2048.
Looking at the samples of the dataset, they tend to be long, and most of the samples tend to be 1000 tokens, and I can't imagine easily packing more than one sample into a 2048 context, unless we allow a sample split across more than one sequence. This might be problematic in SFT as there will be training samples where a model is supposed to predict an answer or part of answer without seeing the question in the attention window.
So my question is: was SFT packing while allowing samples to split across sequences within a batch?
Thanks