Packing SFT Data

Hi. First, thanks for sharing and open sourcing your code.
I would like to clarify how packing was performed for SFT data for training Fast-dLLMv2. 
The paper mentions that training was on [nvidia/Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset) with a context length of 2048.
Looking at the samples of the dataset, they tend to be long, and most of the samples tend to be 1000 tokens, and I can't imagine easily packing more than one sample into a 2048 context, unless we allow a sample split across more than one sequence. This might be problematic in SFT as there will be training samples where a model is supposed to predict an answer or part of answer without seeing the question in the attention window.

So my question is: was SFT packing while allowing samples to split across sequences within a batch? 

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Packing SFT Data #68

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Packing SFT Data #68

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions