Skip to content

feat(datasets): allow per-dataset override of mixture val_split_ratio#377

Merged
shuheng-liu merged 2 commits into
mainfrom
claude/loving-carson-7ea150
Jun 2, 2026
Merged

feat(datasets): allow per-dataset override of mixture val_split_ratio#377
shuheng-liu merged 2 commits into
mainfrom
claude/loving-carson-7ea150

Conversation

@shuheng-liu
Copy link
Copy Markdown
Member

What this does

Re-enables a per-dataset override of the validation split ratio in a dataset mixture.

Previously, validation used a single mixture-level DatasetMixtureConfig.val_split_ratio (default 0.05) applied uniformly to every dataset. A per-dataset DatasetConfig.val_split_ratio field existed but had been deprecated and ignored — setting it only emitted a DeprecationWarning.

This PR turns DatasetConfig.val_split_ratio into a real per-dataset override:

  • None (the default) inherits DatasetMixtureConfig.val_split_ratio.
  • Any value wins for that dataset only — including 0.0 to opt a single dataset out of validation while the rest of the mixture keeps a split.

The override is resolved in make_dataset exactly like the existing tolerance_s / skip_timestamp_check inherit-on-None pattern (effective = cfg.x if cfg.x is not None else mixture.x). Per-dataset values are range-validated ([0, 1]) in DatasetConfig.__post_init__, and the now-obsolete deprecation warning in DatasetMixtureConfig.__post_init__ is removed.

Also hardens fit_fast_tokenizer._build_train_cfg: it already forces the mixture-level ratio to 0.0 to guarantee a single (train-only) dataset; it now zeroes per-dataset overrides too (rebuilding the dataset configs via dataclasses.replace, leaving the caller's config untouched), so the "no validation split" invariant holds for any input config.

Backward-compatible: configs that set only the mixture-level ratio behave identically.

Label: 🗃️ Feature

How it was tested

  • pre-commit run --all-files clean (ruff lint+format, pyupgrade, etc.) on the changed files.
  • CPU suite green for the touched areas: pytest -m "not gpu" tests/configs/test_default.py tests/datasets/test_datasets.py (93 passed) and tests/scripts/test_fit_fast_tokenizer.py (9 passed).
  • New / updated tests:
    • tests/datasets/test_datasets.py: test_make_dataset_per_dataset_val_split_ratio_wins (per-dataset 0.2 beats mixture 0.05) and test_make_dataset_inherits_mixture_val_split_ratio (None inherits mixture 0.1).
    • tests/configs/test_default.py: flipped test_val_split_ratio_warns_when_child_overridestest_val_split_ratio_no_warning_when_child_overrides (a per-dataset override is supported now, so it must not warn); added test_dataset_config_val_split_ratio_out_of_range_raises.

Not a policy change (no model / training-loop / optimizer / sampler files touched), so GPU and regression suites were not run.

How to checkout & try? (for the reviewer)

pytest -sx tests/datasets/test_datasets.py::test_make_dataset_per_dataset_val_split_ratio_wins \
          tests/datasets/test_datasets.py::test_make_dataset_inherits_mixture_val_split_ratio
pytest -sx tests/configs/test_default.py::test_val_split_ratio_no_warning_when_child_overrides \
          tests/configs/test_default.py::test_dataset_config_val_split_ratio_out_of_range_raises
pytest -sx tests/scripts/test_fit_fast_tokenizer.py

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

`DatasetConfig.val_split_ratio` was deprecated and ignored, so the
mixture-level `val_split_ratio` applied uniformly to every dataset.
Restore it as a real per-dataset override: `None` inherits the mixture
default, any value (incl. `0.0` to opt out of validation) wins for that
dataset only. Mirrors the existing `tolerance_s` / `skip_timestamp_check`
inherit-on-None pattern, resolved in `make_dataset`.

Also zero out per-dataset overrides in `fit_fast_tokenizer._build_train_cfg`
so its "no validation split" invariant holds regardless of the input config.
@shuheng-liu shuheng-liu added the feature New feature or request label Jun 2, 2026
@shuheng-liu shuheng-liu self-assigned this Jun 2, 2026
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the per-dataset val_split_ratio override. Logic is correct and mirrors the existing tolerance_s/skip_timestamp_check inherit-on-None pattern; the fit_fast_tokenizer zeroing is correctly non-mutating via dataclasses.replace. One non-blocking note inline about the advertised 0.0 opt-out path.

Comment thread src/opentau/datasets/factory.py
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Jun 2, 2026

[claude-review] summary for commit b0c392a

No blocking issues found.

  • resolvedsrc/opentau/datasets/factory.py:273 — My prior suggestion (the advertised 0.0 per-dataset opt-out was untested, and its behavior on the explicit-weights path unverified) is addressed by commit b0c392a. tests/datasets/test_datasets.py::test_make_dataset_per_dataset_val_split_ratio_zero_opts_out covers the factory path (empty val Subset, all samples retained for train), and tests/datasets/test_dataset_mixture.py::test_calculate_sample_weights_skips_empty_member proves an empty member contributes no samples even with an explicit positive weight — consistent with _calculate_sample_weights skipping length-0 datasets (dataset_mixture.py:833-835).

Otherwise clean (carried from prior review): the inherit-on-None resolution in make_dataset correctly mirrors the existing tolerance_s/skip_timestamp_check pattern; the [0,1] validation in DatasetConfig.__post_init__ is correct; and the fit_fast_tokenizer per-dataset zeroing is non-mutating via dataclasses.replace (caller config untouched, datasets_for_fit correctly threaded into mixture_cfg_for_fit, import dataclasses present at line 1016).

Address review feedback that the advertised `0.0` opt-out was untested.
Add a factory-level test (0.0 yields an empty val Subset, all samples stay
in train) and a mixture-level test proving an empty member contributes no
samples even with a positive mixture weight — `_calculate_sample_weights`
skips length-0 datasets, so the opt-out works on the explicit-weights path
too, not only the inferred-weights path.
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed at b0c392a. The prior non-blocking suggestion (untested 0.0 per-dataset opt-out, and whether it holds on the explicit-weights path) is now resolved: test_make_dataset_per_dataset_val_split_ratio_zero_opts_out covers the factory-level empty val split, and test_calculate_sample_weights_skips_empty_member proves an empty member contributes no samples even with an explicit positive weight — matching _calculate_sample_weights skipping length-0 datasets at dataset_mixture.py:833. No blocking issues found.

@shuheng-liu shuheng-liu marked this pull request as ready for review June 2, 2026 19:22
@shuheng-liu shuheng-liu merged commit b732e68 into main Jun 2, 2026
18 checks passed
@shuheng-liu shuheng-liu deleted the claude/loving-carson-7ea150 branch June 2, 2026 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant