Skip to content

Conversation

@yuki-97
Copy link
Contributor

@yuki-97 yuki-97 commented Dec 17, 2025

Related issue: #1050

  1. Split train and val in build-in dataset, so that we could unblock multiple dataset support.
  2. Unify the built-in datasets under nemo_rl/data/datasets/response_datasets/ into a similar format.
  3. Remove duplicated dataset name: clevr_cogent and openmathinstruct2.

New Param
Add a new param split_validation_size to handle the case that one dataset is used for both training and validation. (e.g., OpenMathInstruct-2 in examples/configs/grpo_math_1B.yaml)

  1. If data.train.split_validation_size > 0 and data.validation is None, will use part of the training dataset as validation dataset.
  2. If data.train.split_validation_size > 0 and data.validation is not None, will use both "part of the training dataset" and "provided validation dataset" as validation dataset.

Usage

data:
  # other data settings, see `examples/configs/sft.yaml` for more details
  ...
  # dataset settings
  train:
    # this dataset will override input_key and use the default values for other vars
    data_path: /path/to/local/train_dataset.jsonl  # local file or hf_org/hf_dataset_name (HuggingFace)
    input_key: question
    split: train  # used for HuggingFace datasets
    split_validation_size: 0.05  # use 5% of the training data as validation data
    seed: 42  # seed for train/validation split when split_validation_size > 0
  validation:
    # this dataset will use the default values for other vars except data_path
    data_path: /path/to/local/val_dataset.jsonl
  default:
    # will use below vars as default values if dataset doesn't specify it
    dataset_name: ResponseDataset
    input_key: input
    output_key: output
    prompt_file: null
    system_prompt_file: null
    processor: "sft_processor"

Migrate Guide

  1. For dataset that loads from local JSONL file or HuggingFace (openai_format and ResponseDataset)
    # old
    data:
      dataset_name: ResponseDataset
      train_data_path: <PathToTrainingDataset>
      val_data_path: <PathToValidationDataset>
      input_key: <QuestionKey>
      output_key: <AnswerKey>
      train_split: <TrainSplit>
      val_split: <ValSplit>
    
    # new
    data:
      # other data settings, see `examples/configs/sft.yaml` for more details
      ...
      # dataset settings
      train:
        # this dataset will override input_key and use the default values for other vars
        data_path: /path/to/local/train_dataset.jsonl  # local file or hf_org/hf_dataset_name (HuggingFace)
        input_key: question
        split: train  # used for HuggingFace datasets
        split_validation_size: 0.05  # use 5% of the training data as validation data
        seed: 42  # seed for train/validation split when split_validation_size > 0
      validation:
        # this dataset will use the default values for other vars except data_path
        data_path: /path/to/local/val_dataset.jsonl
      default:
        # will use below vars as default values if dataset doesn't specify it
        dataset_name: ResponseDataset
        input_key: input
        output_key: output
        prompt_file: null
        system_prompt_file: null
        processor: "sft_processor"
  2. For some built-in datasets that needs change
    1. DAPOMath17K
      # old
      data:
        dataset_name: DAPOMath17K
      
      # new
      data:
        train:
          dataset_name: DAPOMath17K
        validation:
          dataset_name: DAPOMathAIME2024
    2. DeepScaler
      # old
      data:
        dataset_name: DeepScaler
      
      # new
      data:
        train:
          dataset_name: DeepScaler
        validation:
          dataset_name: AIME2024
          repeat: 16
    3. clevr-cogent
      # old
      data:
        dataset_name: clevr-cogent
        split: trainA
      
      # new
      data:
          train:
            dataset_name: clevr-cogent
            split: train
          validation:
            dataset_name: clevr-cogent
            split: valA
    4. HelpSteer3
      # old
      data:
        dataset_name: HelpSteer3
        split: preference
      # new
        train:
          dataset_name: HelpSteer3
          split: train
        validation:
          dataset_name: HelpSteer3
          split: validation
  3. For other built-in datasets, you only need to move them and set the correct split, e.g.
    # old
    data:
      dataset_name: "squad"
    
    # new
    data:
      train:
        dataset_name: "squad"
        split: "train"
      validation:
        dataset_name: "squad"
        split: "validation"

Test Result

algo result
sft image
sft-vlm image
grpo image
grpo-vlm image
distillation image

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for separate training and validation dataset configuration with new train and validation blocks in data settings
    • Introduced new datasets: AIME2024, DAPOMath variants with automatic validation split capability
    • Enhanced dataset framework with improved flexibility for processor selection and environment configuration
  • Documentation

    • Updated guides with new data configuration structure and examples for train/validation dataset setup
    • Clarified supported dataset listings and configuration format for multi-dataset training scenarios
  • Bug Fixes & Improvements

    • Improved dataset loading workflow with better support for shared datasets and per-task processing
    • Streamlined configuration migration from flat to nested dataset structure across all example configs

✏️ Tip: You can customize this high-level summary in your review settings.

@yuki-97 yuki-97 added the CI:L0 Run doctests and unit tests label Dec 17, 2025
@yuki-97 yuki-97 force-pushed the yukih/split-train-val-dataset branch from f8dcf7c to 2f78c84 Compare December 18, 2025 05:05
@yuki-97 yuki-97 added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Dec 18, 2025
@yuki-97 yuki-97 force-pushed the yukih/split-train-val-dataset branch from 2f78c84 to fd448be Compare December 18, 2025 05:23
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Dec 18, 2025
@yuki-97 yuki-97 force-pushed the yukih/split-train-val-dataset branch from 2aa7ce0 to 6a093d1 Compare December 18, 2025 07:08
@yuki-97 yuki-97 added CI:L0 Run doctests and unit tests and removed CI:L0 Run doctests and unit tests labels Dec 18, 2025
Copy link
Contributor

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial thoughts

since it's a big PR @ashors1 could you help as a second review?

assert hasattr(data, "processor"), "Dataset must have a processor attribute"
task_data_processors[task_name] = (task_spec, data.processor)
# setup train dataset
update_single_dataset_config(data_config["train"], data_config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdyt about just expecting users to populate the train config? then we don't have dup keys

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have a default value especially when we support multiple datasets in next PR, otherwise people need to write the same things for every dataset, then the data config will be a bit redundant.

and I'm thinking if it's better to provide a default like train and validation, it seems more directly than just put it outside. wdyt?

# now
data:
    train:
        # this dataset will override prompt_key and use the default values for other vars
        - data_path: /path/to/local/train_dataset_1.jsonl
          prompt_key: question
        # this dataset will use all the default values
        - data_path: /path/to/local/train_dataset_2.jsonl
    validation:
        - data_path: /path/to/local/val_dataset.jsonl
    # will use below vars as default values if dataset doesn't specify it
    dataset_name: BinaryPreferenceDataset
    prompt_key: prompt
    chosen_key: chosen
    rejected_key: rejected
    prompt_file: null
    system_prompt_file: null
    env_name: math

# add `default`
data:
    train:
        # this dataset will override prompt_key and use the default values for other vars
        - data_path: /path/to/local/train_dataset_1.jsonl
          prompt_key: question
        # this dataset will use all the default values
        - data_path: /path/to/local/train_dataset_2.jsonl
    validation:
        - data_path: /path/to/local/val_dataset.jsonl
    default:
        # will use below vars as default values if dataset doesn't specify it
        dataset_name: BinaryPreferenceDataset
        prompt_key: prompt
        chosen_key: chosen
        rejected_key: rejected
        prompt_file: null
        system_prompt_file: null
        env_name: math

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like it's better to be explicit rather than rely on fallback, since it's not clear what needs what. So to understand the relationship between default and each dataset, they'd need to inspect code.

I agree it's kind of redundant, but it's more explicit.

could you get feedback from research team to see what they'd prefer?

Copy link
Contributor Author

@yuki-97 yuki-97 Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed offline, use the default one.

code update: 5edeafe, cc6a2dd
config update: 01cb6d1
doc update: 2853f0e

@yuki-97 yuki-97 changed the title feat: split train val dataset and refactor for response dataset refactor: split train val dataset in response dataset Dec 18, 2025
@yuki-97 yuki-97 changed the title refactor: split train val dataset in response dataset refactor: split train and val dataset in response dataset Dec 18, 2025
@yuki-97 yuki-97 force-pushed the yukih/split-train-val-dataset branch 2 times, most recently from 6b34af3 to fea258d Compare December 19, 2025 15:50
@yuki-97 yuki-97 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L0 Run doctests and unit tests labels Dec 19, 2025
yuki-97 and others added 26 commits January 19, 2026 23:29
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Rayen <ruit@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>

update all run_xxx and recipe of response dataset to use default

Signed-off-by: Yuki Huang <yukih@nvidia.com>

fix missing default

Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 force-pushed the yukih/split-train-val-dataset branch from f9def0d to ec862a3 Compare January 20, 2026 10:40
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97
Copy link
Contributor Author

yuki-97 commented Jan 20, 2026

Running nightly test now and need some minor fix, will push it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants