feat(datasets): allow per-dataset override of mixture val_split_ratio#377
Conversation
`DatasetConfig.val_split_ratio` was deprecated and ignored, so the mixture-level `val_split_ratio` applied uniformly to every dataset. Restore it as a real per-dataset override: `None` inherits the mixture default, any value (incl. `0.0` to opt out of validation) wins for that dataset only. Mirrors the existing `tolerance_s` / `skip_timestamp_check` inherit-on-None pattern, resolved in `make_dataset`. Also zero out per-dataset overrides in `fit_fast_tokenizer._build_train_cfg` so its "no validation split" invariant holds regardless of the input config.
There was a problem hiding this comment.
Reviewed the per-dataset val_split_ratio override. Logic is correct and mirrors the existing tolerance_s/skip_timestamp_check inherit-on-None pattern; the fit_fast_tokenizer zeroing is correctly non-mutating via dataclasses.replace. One non-blocking note inline about the advertised 0.0 opt-out path.
|
[claude-review] summary for commit b0c392a No blocking issues found.
Otherwise clean (carried from prior review): the inherit-on-None resolution in |
Address review feedback that the advertised `0.0` opt-out was untested. Add a factory-level test (0.0 yields an empty val Subset, all samples stay in train) and a mixture-level test proving an empty member contributes no samples even with a positive mixture weight — `_calculate_sample_weights` skips length-0 datasets, so the opt-out works on the explicit-weights path too, not only the inferred-weights path.
There was a problem hiding this comment.
Re-reviewed at b0c392a. The prior non-blocking suggestion (untested 0.0 per-dataset opt-out, and whether it holds on the explicit-weights path) is now resolved: test_make_dataset_per_dataset_val_split_ratio_zero_opts_out covers the factory-level empty val split, and test_calculate_sample_weights_skips_empty_member proves an empty member contributes no samples even with an explicit positive weight — matching _calculate_sample_weights skipping length-0 datasets at dataset_mixture.py:833. No blocking issues found.
What this does
Re-enables a per-dataset override of the validation split ratio in a dataset mixture.
Previously, validation used a single mixture-level
DatasetMixtureConfig.val_split_ratio(default0.05) applied uniformly to every dataset. A per-datasetDatasetConfig.val_split_ratiofield existed but had been deprecated and ignored — setting it only emitted aDeprecationWarning.This PR turns
DatasetConfig.val_split_ratiointo a real per-dataset override:None(the default) inheritsDatasetMixtureConfig.val_split_ratio.0.0to opt a single dataset out of validation while the rest of the mixture keeps a split.The override is resolved in
make_datasetexactly like the existingtolerance_s/skip_timestamp_checkinherit-on-None pattern (effective = cfg.x if cfg.x is not None else mixture.x). Per-dataset values are range-validated ([0, 1]) inDatasetConfig.__post_init__, and the now-obsolete deprecation warning inDatasetMixtureConfig.__post_init__is removed.Also hardens
fit_fast_tokenizer._build_train_cfg: it already forces the mixture-level ratio to0.0to guarantee a single (train-only) dataset; it now zeroes per-dataset overrides too (rebuilding the dataset configs viadataclasses.replace, leaving the caller's config untouched), so the "no validation split" invariant holds for any input config.Backward-compatible: configs that set only the mixture-level ratio behave identically.
Label: 🗃️ Feature
How it was tested
pre-commit run --all-filesclean (ruff lint+format, pyupgrade, etc.) on the changed files.pytest -m "not gpu" tests/configs/test_default.py tests/datasets/test_datasets.py(93 passed) andtests/scripts/test_fit_fast_tokenizer.py(9 passed).tests/datasets/test_datasets.py:test_make_dataset_per_dataset_val_split_ratio_wins(per-dataset0.2beats mixture0.05) andtest_make_dataset_inherits_mixture_val_split_ratio(Noneinherits mixture0.1).tests/configs/test_default.py: flippedtest_val_split_ratio_warns_when_child_overrides→test_val_split_ratio_no_warning_when_child_overrides(a per-dataset override is supported now, so it must not warn); addedtest_dataset_config_val_split_ratio_out_of_range_raises.Not a policy change (no model / training-loop / optimizer / sampler files touched), so GPU and regression suites were not run.
How to checkout & try? (for the reviewer)
pytest -sx tests/datasets/test_datasets.py::test_make_dataset_per_dataset_val_split_ratio_wins \ tests/datasets/test_datasets.py::test_make_dataset_inherits_mixture_val_split_ratiopytest -sx tests/configs/test_default.py::test_val_split_ratio_no_warning_when_child_overrides \ tests/configs/test_default.py::test_dataset_config_val_split_ratio_out_of_range_raisesChecklist
Note: Before submitting this PR, please read the contributor guideline.