-
Notifications
You must be signed in to change notification settings - Fork 29
AD estimators #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AD estimators #65
Conversation
…orts distance based methods, and only bit fingerprints for jaccard/tanimoto distances.
…parallel helper functions.
|
Hey there, here are my initial thoughts on the code (I just walked trough it quickly, without looking at the original repo and the statistical background) So each AD method returns the value in the different range, right? Maybe it is good to have it in range of 0.0-1.0(higher is better), where it is possible. Speaking of the classes itself, I think will be nice to have the class method to create AD model from the corresponding model, so I will be easy to check, e.g. if the input mols are in the AD before the actual prediction (We might have this option in the upcoming server option - working on that PR, will share it soon ) Below is the snippet, showing how it might look like # original model
pipe = Pipeline([
('fp', MorganFingerprintTransformer(fpSize=2048, radius=2, useCounts=True)),
('clf', RandomForestClassifier(n_estimators=100, random_state=42))
])
# AD model is created from the original model, default can be to use PCA
# and 0.9 of variance
LeverageApplicabilityDomain.from_pipeline(pipe, percentile=X, scale="PCA", scale_kwargs={"n_components": 0.9})
Pipeline(steps=[('fp', MorganFingerprintTransformer(useCounts=True)),('pca', PCA(n_components=0.9)),('scaler', StandardScaler()),'leverage', LeverageApplicabilityDomain(percentile=X))]) |
|
Thanks for your input. Compared to the original repo, I have tried to standardize the output of the AD estimators. The AD estimators should have both the possibility for raw scores from original method, but also output group scores -1 or 1 similar to some scikit-learn tools, as well as option for softer transition score from 0 to 1. As far as I remember, .transform() will output the raw scores, and .predict() the groups. and the soft score is a special method. I guess it can be confusing, I'll check if I've been explicit enough about this in the documentation. I'm not sure I understand your suggestion to build an AD model from a prediction model fully. You may or may not want the same featurization, and the model or pipeline will not contain any information about what molecules was used to fit it? We also have the complication that you may want to insert some steps for the standard featurization before the AD estimation, like PCA. |
|
Oh, right I forgot about the need for original data to fit the estimator. I just assumed that the pattern is always the same. I think that even if the logic of the AD estimator construction changes from the estimator to estimator it is worth edging this class method to provide a good default for the estimator without a need for the user to deeply understand what exactly you need to do. I will review the code and logic for the estimators later - you were right PR is huge, sorry that it's taking me ages to go through |
Pull request with applicability domain estimators.
The estimators has been adapted from another open-source project: https://github.com/OlivierBeq/MLChemAD after discussion with Olivier. He was cool about it, although he didn't have time to contribute to Scikit-Mol right now.
It's a lot of new files in the applicability directory with various approaches and algorithms. I've tried to make them as sklearn-like as possible and standardized the API somewhat compared to the original methods.
The new notebook illustrates their use with two examples, but I have not experience with all of the methods in practice. Some of them seem to not really work that great, but that could depend on the combination of ML method, dataset and features.