Skip to content

Conversation

@EBjerrum
Copy link
Owner

@EBjerrum EBjerrum commented Mar 8, 2025

Pull request with applicability domain estimators.

The estimators has been adapted from another open-source project: https://github.com/OlivierBeq/MLChemAD after discussion with Olivier. He was cool about it, although he didn't have time to contribute to Scikit-Mol right now.

It's a lot of new files in the applicability directory with various approaches and algorithms. I've tried to make them as sklearn-like as possible and standardized the API somewhat compared to the original methods.

The new notebook illustrates their use with two examples, but I have not experience with all of the methods in practice. Some of them seem to not really work that great, but that could depend on the combination of ML method, dataset and features.

@EBjerrum EBjerrum requested a review from asiomchen March 8, 2025 11:20
@asiomchen
Copy link
Collaborator

Hey there, here are my initial thoughts on the code (I just walked trough it quickly, without looking at the original repo and the statistical background)

So each AD method returns the value in the different range, right? Maybe it is good to have it in range of 0.0-1.0(higher is better), where it is possible.

Speaking of the classes itself, I think will be nice to have the class method to create AD model from the corresponding model, so I will be easy to check, e.g. if the input mols are in the AD before the actual prediction (We might have this option in the upcoming server option - working on that PR, will share it soon )

Below is the snippet, showing how it might look like

# original model
pipe = Pipeline([
    ('fp', MorganFingerprintTransformer(fpSize=2048, radius=2, useCounts=True)),
    ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])
# AD model is created from the original model, default can be to use PCA
# and 0.9 of variance
LeverageApplicabilityDomain.from_pipeline(pipe, percentile=X, scale="PCA", scale_kwargs={"n_components": 0.9})
Pipeline(steps=[('fp', MorganFingerprintTransformer(useCounts=True)),('pca', PCA(n_components=0.9)),('scaler', StandardScaler()),'leverage', LeverageApplicabilityDomain(percentile=X))])

@EBjerrum
Copy link
Owner Author

Thanks for your input. Compared to the original repo, I have tried to standardize the output of the AD estimators. The AD estimators should have both the possibility for raw scores from original method, but also output group scores -1 or 1 similar to some scikit-learn tools, as well as option for softer transition score from 0 to 1. As far as I remember, .transform() will output the raw scores, and .predict() the groups. and the soft score is a special method. I guess it can be confusing, I'll check if I've been explicit enough about this in the documentation.

I'm not sure I understand your suggestion to build an AD model from a prediction model fully. You may or may not want the same featurization, and the model or pipeline will not contain any information about what molecules was used to fit it? We also have the complication that you may want to insert some steps for the standard featurization before the AD estimation, like PCA.
But as a convenience function, we could do something that simply uses the pipeline minus the predictor, as I guess that's a fairly common approach. More complex workflows would then need some manual work with scikit-learn.

@asiomchen
Copy link
Collaborator

Oh, right I forgot about the need for original data to fit the estimator.

I just assumed that the pattern is always the same. I think that even if the logic of the AD estimator construction changes from the estimator to estimator it is worth edging this class method to provide a good default for the estimator without a need for the user to deeply understand what exactly you need to do.

I will review the code and logic for the estimators later - you were right PR is huge, sorry that it's taking me ages to go through

@EBjerrum EBjerrum merged commit 281ba47 into main Apr 6, 2025
11 checks passed
@EBjerrum EBjerrum deleted the AD_estimators_cleanup branch April 6, 2025 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants