AD estimators #65

EBjerrum · 2025-03-08T11:20:24Z

Pull request with applicability domain estimators.

The estimators has been adapted from another open-source project: https://github.com/OlivierBeq/MLChemAD after discussion with Olivier. He was cool about it, although he didn't have time to contribute to Scikit-Mol right now.

It's a lot of new files in the applicability directory with various approaches and algorithms. I've tried to make them as sklearn-like as possible and standardized the API somewhat compared to the original methods.

The new notebook illustrates their use with two examples, but I have not experience with all of the methods in practice. Some of them seem to not really work that great, but that could depend on the combination of ML method, dataset and features.

…t API

…orts distance based methods, and only bit fingerprints for jaccard/tanimoto distances.

…parallel helper functions.

…es to work.

…our.

asiomchen · 2025-03-30T21:05:40Z

Hey there, here are my initial thoughts on the code (I just walked trough it quickly, without looking at the original repo and the statistical background)

So each AD method returns the value in the different range, right? Maybe it is good to have it in range of 0.0-1.0(higher is better), where it is possible.

Speaking of the classes itself, I think will be nice to have the class method to create AD model from the corresponding model, so I will be easy to check, e.g. if the input mols are in the AD before the actual prediction (We might have this option in the upcoming server option - working on that PR, will share it soon )

Below is the snippet, showing how it might look like

# original model
pipe = Pipeline([
    ('fp', MorganFingerprintTransformer(fpSize=2048, radius=2, useCounts=True)),
    ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])
# AD model is created from the original model, default can be to use PCA
# and 0.9 of variance
LeverageApplicabilityDomain.from_pipeline(pipe, percentile=X, scale="PCA", scale_kwargs={"n_components": 0.9})
Pipeline(steps=[('fp', MorganFingerprintTransformer(useCounts=True)),('pca', PCA(n_components=0.9)),('scaler', StandardScaler()),'leverage', LeverageApplicabilityDomain(percentile=X))])

EBjerrum · 2025-03-31T09:55:06Z

Thanks for your input. Compared to the original repo, I have tried to standardize the output of the AD estimators. The AD estimators should have both the possibility for raw scores from original method, but also output group scores -1 or 1 similar to some scikit-learn tools, as well as option for softer transition score from 0 to 1. As far as I remember, .transform() will output the raw scores, and .predict() the groups. and the soft score is a special method. I guess it can be confusing, I'll check if I've been explicit enough about this in the documentation.

I'm not sure I understand your suggestion to build an AD model from a prediction model fully. You may or may not want the same featurization, and the model or pipeline will not contain any information about what molecules was used to fit it? We also have the complication that you may want to insert some steps for the standard featurization before the AD estimation, like PCA.
But as a convenience function, we could do something that simply uses the pipeline minus the predictor, as I guess that's a fairly common approach. More complex workflows would then need some manual work with scikit-learn.

asiomchen · 2025-03-31T21:51:58Z

Oh, right I forgot about the need for original data to fit the estimator.

I just assumed that the pattern is always the same. I think that even if the logic of the AD estimator construction changes from the estimator to estimator it is worth edging this class method to provide a good default for the estimator without a need for the user to deeply understand what exactly you need to do.

I will review the code and logic for the estimators later - you were right PR is huge, sorry that it's taking me ages to go through

docs/notebooks/12_applicability_domain.ipynb

scikit_mol/applicability/README.md

scikit_mol/applicability/__init__.py

scikit_mol/applicability/base.py

scikit_mol/applicability/kernel_density.py

scikit_mol/applicability/knn.py

tests/applicability/test_base.py

…mes and paths.

…ed code

Esben Jannik Bjerrum added 23 commits November 22, 2024 11:11

Added draft support for multi-output prediction and AD estimation

96daff0

Merge branch 'main' into AD_calculation.py

9c2678f

Fixed type on Readme

83cde65

fixed conflict

43088a6

Added AD domain estimators from MLChemAD, need some work on consisten…

7c48055

…t API

Updated README with a reference

2f3abfd

Developed a base_class for making AD estimators consistent

198cfba

Made first transition for the kNN implementation. Currently only supp…

d5d84ad

…orts distance based methods, and only bit fingerprints for jaccard/tanimoto distances.

Added leverage to our tests.

51efb25

Added more AD estimators as children of base_class

f75bc80

Moved rest of AD estimators. All test runs.

9cafe9f

WIP on the adapter. Not fully there yet as some complications in the …

c2a9dc4

…parallel helper functions.

Merge branch 'main' into AD_calculation.py

f41929e

work in progress on adapters

f2ab158

It seems to be getting there with the EstimatorUnion. Got feature_nam…

4273936

…es to work.

predicttotransformwrapper seems to be working now.

159432f

Also got the fit_transform to work. Seemingly getting there.

9279008

Experimental adapters. WIP

d19322f

Cleaning up.

8dfa312

Further fixes in tests and some estimators

756e6c9

Setting numpy random seed automatically for consistent testing behavi…

64bc117

…our.

Merge branch 'main' into AD_estimators_cleanup

c6669b3

Adding a link for the new notebook

e8a4c7b

EBjerrum requested a review from asiomchen March 8, 2025 11:20