refactor: split train and val dataset in response dataset #1649

yuki-97 · 2025-12-17T16:41:13Z

Related issue: #1050

Split train and val in build-in dataset, so that we could unblock multiple dataset support.
Unify the built-in datasets under nemo_rl/data/datasets/response_datasets/ into a similar format.
Remove duplicated dataset name: clevr_cogent and openmathinstruct2.

New Param
Add a new param split_validation_size to handle the case that one dataset is used for both training and validation. (e.g., OpenMathInstruct-2 in examples/configs/grpo_math_1B.yaml)

If data.train.split_validation_size > 0 and data.validation is None, will use part of the training dataset as validation dataset.
If data.train.split_validation_size > 0 and data.validation is not None, will use both "part of the training dataset" and "provided validation dataset" as validation dataset.

Usage

data:
  # other data settings, see `examples/configs/sft.yaml` for more details
  ...
  # dataset settings
  train:
    # this dataset will override input_key and use the default values for other vars
    data_path: /path/to/local/train_dataset.jsonl  # local file or hf_org/hf_dataset_name (HuggingFace)
    input_key: question
    split: train  # used for HuggingFace datasets
    split_validation_size: 0.05  # use 5% of the training data as validation data
    seed: 42  # seed for train/validation split when split_validation_size > 0
  validation:
    # this dataset will use the default values for other vars except data_path
    data_path: /path/to/local/val_dataset.jsonl
  default:
    # will use below vars as default values if dataset doesn't specify it
    dataset_name: ResponseDataset
    input_key: input
    output_key: output
    prompt_file: null
    system_prompt_file: null
    processor: "sft_processor"

Migrate Guide

For dataset that loads from local JSONL file or HuggingFace (openai_format and ResponseDataset)

# old
data:
  dataset_name: ResponseDataset
  train_data_path: <PathToTrainingDataset>
  val_data_path: <PathToValidationDataset>
  input_key: <QuestionKey>
  output_key: <AnswerKey>
  train_split: <TrainSplit>
  val_split: <ValSplit>

# new
data:
  # other data settings, see `examples/configs/sft.yaml` for more details
  ...
  # dataset settings
  train:
    # this dataset will override input_key and use the default values for other vars
    data_path: /path/to/local/train_dataset.jsonl  # local file or hf_org/hf_dataset_name (HuggingFace)
    input_key: question
    split: train  # used for HuggingFace datasets
    split_validation_size: 0.05  # use 5% of the training data as validation data
    seed: 42  # seed for train/validation split when split_validation_size > 0
  validation:
    # this dataset will use the default values for other vars except data_path
    data_path: /path/to/local/val_dataset.jsonl
  default:
    # will use below vars as default values if dataset doesn't specify it
    dataset_name: ResponseDataset
    input_key: input
    output_key: output
    prompt_file: null
    system_prompt_file: null
    processor: "sft_processor"

For some built-in datasets that needs change

DAPOMath17K

# old
data:
  dataset_name: DAPOMath17K

# new
data:
  train:
    dataset_name: DAPOMath17K
  validation:
    dataset_name: DAPOMathAIME2024

DeepScaler

# old
data:
  dataset_name: DeepScaler

# new
data:
  train:
    dataset_name: DeepScaler
  validation:
    dataset_name: AIME2024
    repeat: 16

clevr-cogent

# old
data:
  dataset_name: clevr-cogent
  split: trainA

# new
data:
    train:
      dataset_name: clevr-cogent
      split: train
    validation:
      dataset_name: clevr-cogent
      split: valA

HelpSteer3

# old
data:
  dataset_name: HelpSteer3
  split: preference
# new
  train:
    dataset_name: HelpSteer3
    split: train
  validation:
    dataset_name: HelpSteer3
    split: validation

For other built-in datasets, you only need to move them and set the correct split, e.g.

# old
data:
  dataset_name: "squad"

# new
data:
  train:
    dataset_name: "squad"
    split: "train"
  validation:
    dataset_name: "squad"
    split: "validation"

Test Result

algo	result
sft
sft-vlm
grpo
grpo-vlm
distillation

Summary by CodeRabbit

Release Notes

New Features
- Added support for separate training and validation dataset configuration with new train and validation blocks in data settings
- Introduced new datasets: AIME2024, DAPOMath variants with automatic validation split capability
- Enhanced dataset framework with improved flexibility for processor selection and environment configuration
Documentation
- Updated guides with new data configuration structure and examples for train/validation dataset setup
- Clarified supported dataset listings and configuration format for multi-dataset training scenarios
Bug Fixes & Improvements
- Improved dataset loading workflow with better support for shared datasets and per-task processing
- Streamlined configuration migration from flat to nested dataset structure across all example configs

_{✏️ Tip: You can customize this high-level summary in your review settings.}

terrykong

some initial thoughts

since it's a big PR @ashors1 could you help as a second review?

examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-fsdp2tp2.yaml

examples/configs/recipes/llm/sft-llama3.1-8b-1n8g-megatron.yaml

nemo_rl/data/__init__.py

terrykong · 2025-12-18T07:36:55Z

examples/run_grpo.py

-    assert hasattr(data, "processor"), "Dataset must have a processor attribute"
-    task_data_processors[task_name] = (task_spec, data.processor)
+    # setup train dataset
+    update_single_dataset_config(data_config["train"], data_config)


wdyt about just expecting users to populate the train config? then we don't have dup keys

I think we should have a default value especially when we support multiple datasets in next PR, otherwise people need to write the same things for every dataset, then the data config will be a bit redundant.

and I'm thinking if it's better to provide a default like train and validation, it seems more directly than just put it outside. wdyt?

# now data: train: # this dataset will override prompt_key and use the default values for other vars - data_path: /path/to/local/train_dataset_1.jsonl prompt_key: question # this dataset will use all the default values - data_path: /path/to/local/train_dataset_2.jsonl validation: - data_path: /path/to/local/val_dataset.jsonl # will use below vars as default values if dataset doesn't specify it dataset_name: BinaryPreferenceDataset prompt_key: prompt chosen_key: chosen rejected_key: rejected prompt_file: null system_prompt_file: null env_name: math # add `default` data: train: # this dataset will override prompt_key and use the default values for other vars - data_path: /path/to/local/train_dataset_1.jsonl prompt_key: question # this dataset will use all the default values - data_path: /path/to/local/train_dataset_2.jsonl validation: - data_path: /path/to/local/val_dataset.jsonl default: # will use below vars as default values if dataset doesn't specify it dataset_name: BinaryPreferenceDataset prompt_key: prompt chosen_key: chosen rejected_key: rejected prompt_file: null system_prompt_file: null env_name: math

I feel like it's better to be explicit rather than rely on fallback, since it's not clear what needs what. So to understand the relationship between default and each dataset, they'd need to inspect code.

I agree it's kind of redundant, but it's more explicit.

could you get feedback from research team to see what they'd prefer?

as discussed offline, use the default one.

code update: 5edeafe, cc6a2dd
config update: 01cb6d1
doc update: 2853f0e

nemo_rl/data/datasets/response_datasets/__init__.py

nemo_rl/data/datasets/response_datasets/oasst.py

tests/unit/data/datasets/test_response_dataset.py

examples/run_grpo.py

nemo_rl/data/datasets/response_datasets/deepscaler.py

examples/run_grpo.py

nemo_rl/data/datasets/response_datasets/tulu3.py

nemo_rl/data/datasets/response_datasets/oasst.py

Signed-off-by: Yuki Huang <yukih@nvidia.com>

Signed-off-by: Rayen <ruit@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com> update all run_xxx and recipe of response dataset to use default Signed-off-by: Yuki Huang <yukih@nvidia.com> fix missing default Signed-off-by: Yuki Huang <yukih@nvidia.com>

Signed-off-by: Yuki Huang <yukih@nvidia.com>

yuki-97 · 2026-01-20T16:26:34Z

Running nightly test now and need some minor fix, will push it later.

yuki-97 added the CI:L0 Run doctests and unit tests label Dec 17, 2025

yuki-97 temporarily deployed to nemo-ci December 17, 2025 16:41 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci December 17, 2025 16:44 — with GitHub Actions Inactive

yuki-97 force-pushed the yukih/split-train-val-dataset branch from f8dcf7c to 2f78c84 Compare December 18, 2025 05:05

yuki-97 added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Dec 18, 2025

yuki-97 temporarily deployed to nemo-ci December 18, 2025 05:06 — with GitHub Actions Inactive

yuki-97 force-pushed the yukih/split-train-val-dataset branch from 2f78c84 to fd448be Compare December 18, 2025 05:23

github-actions bot added the documentation Improvements or additions to documentation label Dec 18, 2025

yuki-97 force-pushed the yukih/split-train-val-dataset branch from 2aa7ce0 to 6a093d1 Compare December 18, 2025 07:08

yuki-97 added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Dec 18, 2025

yuki-97 temporarily deployed to nemo-ci December 18, 2025 07:09 — with GitHub Actions Inactive

yuki-97 temporarily deployed to nemo-ci December 18, 2025 07:13 — with GitHub Actions Inactive

terrykong reviewed Dec 18, 2025

View reviewed changes

yuki-97 changed the title ~~feat: split train val dataset and refactor for response dataset~~ refactor: split train val dataset in response dataset Dec 18, 2025

yuki-97 changed the title ~~refactor: split train val dataset in response dataset~~ refactor: split train and val dataset in response dataset Dec 18, 2025

yuki-97 commented Dec 18, 2025

View reviewed changes

tests/unit/data/datasets/test_response_dataset.py Show resolved Hide resolved