Option to fix preprocessing seed in finetuning#771
Conversation
…eature-modality-dict
…om:PriorLabs/TabPFN into ben/fixed-preprocessing-seed-in-finetuning
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Pull request overview
Adds support for keeping preprocessing randomness fixed during fine-tuning so that stochastic preprocessing choices (e.g., column permutations) remain consistent across batches/epochs, and updates the codebase to pass/derive preprocessing random_state explicitly.
Changes:
- Introduces
use_fixed_preprocessing_seedonFinetunedTabPFNClassifier/FinetunedTabPFNRegressorand wires it into the fine-tuning data pipeline. - Refactors preprocessing randomness plumbing (
rng→random_state, plus separatedata_shuffle_seedvspreprocessing_random_state) across classifier/regressor + preprocessing utilities. - Updates finetuning/inference tests and refreshes reference predictions to reflect the new deterministic behavior.
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_inference.py | Updates TabPFNEnsemblePreprocessor construction to use random_state=. |
| tests/test_finetuning_regressor.py | Updates finetuning dataset chunk helper call to new seed/random_state parameters. |
| tests/test_finetuning_classifier.py | Refactors helper usage for dataset chunk creation; improves patching approach; adds test covering fixed preprocessing seed behavior. |
| tests/reference_predictions/darwin_arm64/regressor_tiny_dataset_v2_fit_preprocessors.json | Updates reference predictions after preprocessing randomness changes. |
| tests/reference_predictions/darwin_arm64/regressor_tiny_dataset_v2.5_low_memory.json | Updates reference predictions after preprocessing randomness changes. |
| tests/reference_predictions/darwin_arm64/regressor_tiny_dataset_v2.5_fit_with_cache.json | Updates reference predictions after preprocessing randomness changes. |
| tests/reference_predictions/darwin_arm64/regressor_tiny_dataset_v2.5_fit_preprocessors.json | Updates reference predictions after preprocessing randomness changes. |
| tests/reference_predictions/darwin_arm64/regressor_tiny_dataset_several_devices.json | Updates reference predictions after preprocessing randomness changes. |
| tests/reference_predictions/darwin_arm64/classifier_tiny_dataset_v2_fit_preprocessors.json | Updates reference predictions after preprocessing randomness changes. |
| tests/reference_predictions/darwin_arm64/classifier_tiny_dataset_v2.5_low_memory.json | Updates reference predictions after preprocessing randomness changes. |
| tests/reference_predictions/darwin_arm64/classifier_tiny_dataset_v2.5_fit_with_cache.json | Updates reference predictions after preprocessing randomness changes. |
| tests/reference_predictions/darwin_arm64/classifier_tiny_dataset_v2.5_fit_preprocessors.json | Updates reference predictions after preprocessing randomness changes. |
| tests/reference_predictions/darwin_arm64/classifier_tiny_dataset_differentiable_input.json | Updates reference predictions after preprocessing randomness changes. |
| tests/reference_predictions/darwin_arm64/classifier_tiny_dataset_5_estimators.json | Updates reference predictions after preprocessing randomness changes. |
| tests/reference_predictions/darwin_arm64/classifier_iris_dataset_several_devices.json | Updates reference predictions after preprocessing randomness changes. |
| tests/reference_predictions/darwin_arm64/classifier_iris_dataset.json | Updates reference predictions after preprocessing randomness changes. |
| src/tabpfn/regressor.py | Switches preprocessing RNG usage to explicit random_state derived via infer_random_state. |
| src/tabpfn/preprocessing/initialization.py | Adds a new helper module for feature tagging + dtype sanitization + ordinal encoding setup. |
| src/tabpfn/preprocessing/ensemble.py | Renames rng to random_state and standardizes seed derivation via infer_random_state. |
| src/tabpfn/preprocessing/init.py | Minor import formatting cleanup. |
| src/tabpfn/finetuning/finetuned_regressor.py | Exposes use_fixed_preprocessing_seed in the regressor fine-tuning wrapper API/docs. |
| src/tabpfn/finetuning/finetuned_classifier.py | Exposes use_fixed_preprocessing_seed in the classifier fine-tuning wrapper API/docs; minor type signature tweak. |
| src/tabpfn/finetuning/finetuned_base.py | Implements fixed-vs-varying preprocessing random state selection in the fine-tuning loop. |
| src/tabpfn/finetuning/data_util.py | Splits data shuffling seed from preprocessing random state in dataset chunk creation. |
| src/tabpfn/classifier.py | Switches preprocessing RNG usage to explicit random_state derived via infer_random_state. |
| src/tabpfn/base.py | Changes model initialization helper to return only byte_size (no RNG), aligning RNG handling elsewhere. |
| examples/finetune_regressor.py | Updates example docstring wording (VRAM statement). |
| examples/finetune_classifier.py | Updates example docstring and estimator counts/random_state usage. |
| changelog/771.added.md | Adds changelog entry for use_fixed_preprocessing_seed. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
psinger-prior
left a comment
There was a problem hiding this comment.
LGTM! Please check / address the two comments if needed.
Fixing the seed will, e.g., keep column permutations the same across batches. This is expected to improve results in finetuning.
Todo