feat(datasets): allow per-dataset override of mixture val_split_ratio by shuheng-liu · Pull Request #377 · TensorAuto/OpenTau

shuheng-liu · 2026-06-02T19:02:13Z

What this does

Re-enables a per-dataset override of the validation split ratio in a dataset mixture.

Previously, validation used a single mixture-level DatasetMixtureConfig.val_split_ratio (default 0.05) applied uniformly to every dataset. A per-dataset DatasetConfig.val_split_ratio field existed but had been deprecated and ignored — setting it only emitted a DeprecationWarning.

This PR turns DatasetConfig.val_split_ratio into a real per-dataset override:

None (the default) inherits DatasetMixtureConfig.val_split_ratio.
Any value wins for that dataset only — including 0.0 to opt a single dataset out of validation while the rest of the mixture keeps a split.

The override is resolved in make_dataset exactly like the existing tolerance_s / skip_timestamp_check inherit-on-None pattern (effective = cfg.x if cfg.x is not None else mixture.x). Per-dataset values are range-validated ([0, 1]) in DatasetConfig.__post_init__, and the now-obsolete deprecation warning in DatasetMixtureConfig.__post_init__ is removed.

Also hardens fit_fast_tokenizer._build_train_cfg: it already forces the mixture-level ratio to 0.0 to guarantee a single (train-only) dataset; it now zeroes per-dataset overrides too (rebuilding the dataset configs via dataclasses.replace, leaving the caller's config untouched), so the "no validation split" invariant holds for any input config.

Backward-compatible: configs that set only the mixture-level ratio behave identically.

Label: 🗃️ Feature

How it was tested

pre-commit run --all-files clean (ruff lint+format, pyupgrade, etc.) on the changed files.
CPU suite green for the touched areas: pytest -m "not gpu" tests/configs/test_default.py tests/datasets/test_datasets.py (93 passed) and tests/scripts/test_fit_fast_tokenizer.py (9 passed).
New / updated tests:
- tests/datasets/test_datasets.py: test_make_dataset_per_dataset_val_split_ratio_wins (per-dataset 0.2 beats mixture 0.05) and test_make_dataset_inherits_mixture_val_split_ratio (None inherits mixture 0.1).
- tests/configs/test_default.py: flipped test_val_split_ratio_warns_when_child_overrides → test_val_split_ratio_no_warning_when_child_overrides (a per-dataset override is supported now, so it must not warn); added test_dataset_config_val_split_ratio_out_of_range_raises.

Not a policy change (no model / training-loop / optimizer / sampler files touched), so GPU and regression suites were not run.

How to checkout & try? (for the reviewer)

pytest -sx tests/datasets/test_datasets.py::test_make_dataset_per_dataset_val_split_ratio_wins \
          tests/datasets/test_datasets.py::test_make_dataset_inherits_mixture_val_split_ratio

pytest -sx tests/configs/test_default.py::test_val_split_ratio_no_warning_when_child_overrides \
          tests/configs/test_default.py::test_dataset_config_val_split_ratio_out_of_range_raises

pytest -sx tests/scripts/test_fit_fast_tokenizer.py

Checklist

I have added Google-style docstrings to important functions and ensured function parameters are typed.
My PR includes policy-related changes.
- If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

`DatasetConfig.val_split_ratio` was deprecated and ignored, so the mixture-level `val_split_ratio` applied uniformly to every dataset. Restore it as a real per-dataset override: `None` inherits the mixture default, any value (incl. `0.0` to opt out of validation) wins for that dataset only. Mirrors the existing `tolerance_s` / `skip_timestamp_check` inherit-on-None pattern, resolved in `make_dataset`. Also zero out per-dataset overrides in `fit_fast_tokenizer._build_train_cfg` so its "no validation split" invariant holds regardless of the input config.

claude

Reviewed the per-dataset val_split_ratio override. Logic is correct and mirrors the existing tolerance_s/skip_timestamp_check inherit-on-None pattern; the fit_fast_tokenizer zeroing is correctly non-mutating via dataclasses.replace. One non-blocking note inline about the advertised 0.0 opt-out path.

claude · 2026-06-02T19:05:31Z

[claude-review] summary for commit b0c392a

No blocking issues found.

resolved — src/opentau/datasets/factory.py:273 — My prior suggestion (the advertised 0.0 per-dataset opt-out was untested, and its behavior on the explicit-weights path unverified) is addressed by commit b0c392a. tests/datasets/test_datasets.py::test_make_dataset_per_dataset_val_split_ratio_zero_opts_out covers the factory path (empty val Subset, all samples retained for train), and tests/datasets/test_dataset_mixture.py::test_calculate_sample_weights_skips_empty_member proves an empty member contributes no samples even with an explicit positive weight — consistent with _calculate_sample_weights skipping length-0 datasets (dataset_mixture.py:833-835).

Otherwise clean (carried from prior review): the inherit-on-None resolution in make_dataset correctly mirrors the existing tolerance_s/skip_timestamp_check pattern; the [0,1] validation in DatasetConfig.__post_init__ is correct; and the fit_fast_tokenizer per-dataset zeroing is non-mutating via dataclasses.replace (caller config untouched, datasets_for_fit correctly threaded into mixture_cfg_for_fit, import dataclasses present at line 1016).

Address review feedback that the advertised `0.0` opt-out was untested. Add a factory-level test (0.0 yields an empty val Subset, all samples stay in train) and a mixture-level test proving an empty member contributes no samples even with a positive mixture weight — `_calculate_sample_weights` skips length-0 datasets, so the opt-out works on the explicit-weights path too, not only the inferred-weights path.

claude

Re-reviewed at b0c392a. The prior non-blocking suggestion (untested 0.0 per-dataset opt-out, and whether it holds on the explicit-weights path) is now resolved: test_make_dataset_per_dataset_val_split_ratio_zero_opts_out covers the factory-level empty val split, and test_calculate_sample_weights_skips_empty_member proves an empty member contributes no samples even with an explicit positive weight — matching _calculate_sample_weights skipping length-0 datasets at dataset_mixture.py:833. No blocking issues found.

shuheng-liu added the feature New feature or request label Jun 2, 2026

shuheng-liu self-assigned this Jun 2, 2026

claude Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread src/opentau/datasets/factory.py

claude Bot reviewed Jun 2, 2026

View reviewed changes

shuheng-liu marked this pull request as ready for review June 2, 2026 19:22

shuheng-liu merged commit b732e68 into main Jun 2, 2026
18 checks passed

shuheng-liu deleted the claude/loving-carson-7ea150 branch June 2, 2026 19:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): allow per-dataset override of mixture val_split_ratio#377

feat(datasets): allow per-dataset override of mixture val_split_ratio#377
shuheng-liu merged 2 commits into
mainfrom
claude/loving-carson-7ea150

shuheng-liu commented Jun 2, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

claude Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shuheng-liu commented Jun 2, 2026

What this does

How it was tested

How to checkout & try? (for the reviewer)

Checklist

Note: Before submitting this PR, please read the contributor guideline.

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Jun 2, 2026 •

edited

Loading