Add patient-level leakage checks for dataset splits by tippered1-debug · Pull Request #1159 · sunlabuiuc/PyHealth

tippered1-debug · 2026-06-10T23:09:35Z

Summary

This PR adds helper utilities to audit patient-level leakage across dataset splits.

Healthcare datasets often contain multiple samples per patient. If the same patient appears in both train and validation/test splits, evaluation metrics may be inflated because the model is partially evaluated on patients already seen during training.

The new helpers inspect the actual samples in each split and verify that patient IDs are disjoint.

Changes

Add get_patient_ids(...) to extract patient IDs from datasets, subsets, sample collections, or ID collections.
Add check_patient_disjoint(...) to return a report with split counts and patient overlaps.
Add assert_patient_disjoint(...) to raise a clear error when patient overlap is detected.
Export the helpers from pyhealth.datasets.
Add focused tests for patient-level splits, sample-level leakage detection, conformal splits, and missing patient_id errors.

Tests

python -m unittest tests.core.test_patient_disjoint -v

Add patient-level leakage checks for dataset splits

a57a07d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add patient-level leakage checks for dataset splits#1159

Add patient-level leakage checks for dataset splits#1159
tippered1-debug wants to merge 1 commit into
sunlabuiuc:masterfrom
tippered1-debug:reliability/patient-disjoint-checks

tippered1-debug commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tippered1-debug commented Jun 10, 2026

Summary

Changes

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant