Skip to content

Fix data leakage in _LifelinesAdapter when y has multiple columns#971

Open
aditi-malviya666 wants to merge 1 commit intosktime:mainfrom
aditi-malviya666:fix-lifelines-leakage
Open

Fix data leakage in _LifelinesAdapter when y has multiple columns#971
aditi-malviya666 wants to merge 1 commit intosktime:mainfrom
aditi-malviya666:fix-lifelines-leakage

Conversation

@aditi-malviya666
Copy link
Copy Markdown

Reference Issues/PRs

Fixes #947

What does this implement/fix? Explain your changes.

This PR fixes a data leakage issue in _LifelinesAdapter.

Previously, when y contained multiple columns, all columns were concatenated into the training DataFrame. Since lifelines treats any column not specified as duration_col or event_col as covariates, this caused unintended leakage of target information into the feature set.

The fix ensures that only:

  • feature columns from X
  • the first column of y (used as duration_col)
  • the event column derived from C (if provided)

are included in the DataFrame passed to the lifelines estimator.

Additionally, a warning is raised when y contains multiple columns to inform the user that only the first column is used.

Does your contribution introduce a new dependency? If yes, which one?

No new dependencies are introduced.

What should a reviewer concentrate their feedback on?

  • Correctness of DataFrame construction in _fit
  • Whether excluding extra columns from y is consistent with expected API behavior
  • Handling of censoring column (C) and compatibility with lifelines

Did you add any tests for the change?

No new tests have been added yet, but the fix has been manually validated using a reproducible example that previously demonstrated leakage.

Any other comments?

This fix prevents unintended target leakage and aligns the adapter behavior with lifelines expectations.

PR checklist

For all contributions
  • I've added myself to the list of contributors
  • The PR title starts with [BUG]
For new estimators
  • Not applicable

@fkiraly fkiraly added bug module:survival&time-to-event module for time-to-event prediction aka survival prediction labels Mar 26, 2026
Copy link
Copy Markdown
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment on #947, can you answer there why this fix is needed?

As far as I understand, only single-collumn y will reach the code that you are changing.

@fkiraly
Copy link
Copy Markdown
Collaborator

fkiraly commented Mar 26, 2026

likely an AI hallucinated fix for an AI hallucinated bug, see #947

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI overuse suspected bug module:survival&time-to-event module for time-to-event prediction aka survival prediction

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Data leakage in _LifelinesAdapter when y has multiple columns

2 participants