-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Hello,
I tried to bump the version of RDKit in this project and ran into reproducibility issues for the classification outcomes. This is due to some well-known changes in RDKit's generation of Morgan fingerprints for molecular graphs with added hydrogens (see rdkit-discuss). Hydrogens are added in NPClassifier's calculate_fingerprint function.
The classification models were trained under these circumstances, thus the last version of RDKit you can use is rdkit-pypi==2021.9.4. Everything beyond that may give irreproducible classifications!
My test code:
from rdkit import Chem
from rdkit.Chem import rdMolDescriptors
smiles = "C[C@]1(C=C[C@H]2C[C@H](CC[C@@H]2[C@H]1C(=O)CCO)CO)O"
mol1 = Chem.MolFromSmiles(smiles)
mol_fp1 = rdMolDescriptors.GetMorganFingerprintAsBitVect(mol1, radius=1, bitInfo={}, nBits=2048)
# always passes
assert list(mol_fp1.GetOnBits()) == [29, 80, 142, 222, 473, 494, 622, 650, 787, 807, 848, 926, 1019, 1057, 1060,
1083, 1154, 1274, 1292, 1325, 1516, 1564, 1764, 1873, 1917]
mol2 = Chem.MolFromSmiles(smiles)
mol2 = Chem.AddHs(mol2)
mol_fp2 = rdMolDescriptors.GetMorganFingerprintAsBitVect(mol2, radius=1, bitInfo={}, nBits=2048)
# passes with rdkit-pypi==2021.9.4
# fails with rdkit-pypi==2021.9.5.1 and beyond
assert list(mol_fp2.GetOnBits()) == [2, 88, 107, 114, 449, 650, 664, 695, 788, 807, 836, 866, 906, 955, 1060, 1233,
1380, 1455, 1477, 1652, 1673, 1804, 1871, 1886, 1917, 2003]NB: I use pip-based dependency management, which is why I refer to the RDKit artifacts at PyPI. If you use Conda to install RDKit like in the Docker image you'll probably run into incompatibilities with the Boost library at runtime. A working RDKit version from conda-forge with reproducible classifications is rdkit=2021.09.5.