Refactor dataset module for train

**Steps:**
1. Dataset Refactor: https://github.com/NVIDIA-NeMo/RL/pull/977
2. Decouple Train and Validation Dataset:
    1. SFT/RL: https://github.com/NVIDIA-NeMo/RL/pull/1649
    2. RM/DPO: https://github.com/NVIDIA-NeMo/RL/pull/1763
3. Multiple Datasets Support:
    1. SFT/RL: https://github.com/NVIDIA-NeMo/RL/pull/1691
    2. RM/DPO:
4. Clean up
    1. Clean up GRPO: environments: https://github.com/NVIDIA-NeMo/RL/pull/1841
    2. Refactor data processor.
    3. Refactor prompt management.
    4. Unify dataset_name and dataset_cls.

**Step1: Dataset Refactor**
1.  Add general dataset class for different modes: sft_dataset, preference_dataset (for RM and DPO), rl_dataset. We can use some keys like `prompt_key`, `chosen_key`, `rejected_key` to specify how to read local or HuggingFace dataset, instead of writing a new dataset class.
2. For the built-in datasets (e.g. `open_assistant`, `HelpSteer3`, etc.), we'll keep them for enabling others to accurately reproduce our results.

After refactor, the usage will become:
1. For special supported datasets, the usage is the same as before.
2. For general datasets (local/hf), an example for DPO is below.
```yaml
data:
    train_data_path: /path/to/local/train_dataset.jsonl
    val_data_path: /path/to/local/val_dataset.jsonl
    dataset_name: BinaryPreferenceDataset
    prompt_key: prompt
    chosen_key: chosen
    rejected_key: rejected
```

**Step2: Decouple Train and Validation Dataset**
Train and validation dataset are coupled for now, which means we need write the same logic twice for train and eval when we add support for new dataset, so it's good to decouple them.
After this, the usage will become:
```yaml
data:
    train:
        data_path: /path/to/local/train_dataset.jsonl
        dataset_name: BinaryPreferenceDataset
        prompt_key: prompt
        chosen_key: chosen
        rejected_key: rejected
    validation:
        data_path: /path/to/local/val_dataset.jsonl
        dataset_name: BinaryPreferenceDataset
        prompt_key: prompt
        chosen_key: chosen
        rejected_key: rejected
```

**Step3: Multiple Datasets Support**
After this, the usage will become:
```yaml
data:
    train:
        # this dataset will override prompt_key and use the default values for other vars
        - data_path: /path/to/local/train_dataset_1.jsonl
          prompt_key: context
        # this dataset will use all the default values
        - data_path: /path/to/local/train_dataset_2.jsonl
    validation:
        - data_path: /path/to/local/val_dataset.jsonl
    default:
        # will use below vars as default values if dataset doesn't specify it
        dataset_name: BinaryPreferenceDataset
        prompt_key: prompt
        chosen_key: chosen
        rejected_key: rejected
```

**Related issues / discussions**
https://github.com/NVIDIA-NeMo/RL/issues/688, https://github.com/NVIDIA-NeMo/RL/discussions/830

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor dataset module for train #909

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refactor dataset module for train #909

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions