Summary
In our tests, nvmolkit.substructure.hasSubstructMatch / countSubstructMatches /
getSubstructMatches returned no match for recursive SMARTS environments $(...) when the
target molecule had 128 or fewer atoms. Recursive SMARTS matched RDKit once the target had
≥ 129 atoms, and non-recursive SMARTS matched RDKit at all sizes we tried. Every molecule we tested
below 129 atoms showed the discrepancy, but we have not tested exhaustively — so this describes the
behavior we observed, not a proven universal rule.
The flip point in our runs coincided with kMaxTargetAtoms = 128 in src/substruct/substruct_constants.h.
The docstrings for all three entry points state "Supports recursive SMARTS queries", and no exception
is raised, so the incorrect result is silent.
Environment
| Field |
Value |
| nvMolKit |
0.5.0 (wheel nvmolkit-0.5.0.post1) |
| Repo commit |
6f967ed (VERSION 0.5.0, main) |
| RDKit |
2026.03.1 |
| PyTorch |
2.12.0+cu126 (CUDA 12.6) |
| GPU |
NVIDIA A100 80GB PCIe (sm_80) |
| Driver |
565.57.01 |
| Python / OS |
3.12.13 / Linux x86_64 (glibc 2.35) |
We have only reproduced this on the single configuration above; behavior on other GPUs / RDKit
versions / builds is unknown.
Affected API
hasSubstructMatch, countSubstructMatches, getSubstructMatches (shared preprocessing path;
demonstrated below with hasSubstructMatch).
Test molecules
Query: [$(NC=O)] — a recursive SMARTS matching a nitrogen bonded to a carbonyl (amide-type N).
Small molecules where we observed RDKit and nvMolKit disagree (all contain an amide N):
| Molecule |
SMILES |
Atoms |
RDKit |
nvMolKit (observed) |
| acetamide |
CC(=O)N |
4 |
match |
no match |
| formamide |
NC=O |
3 |
match |
no match |
| N-methylacetamide |
CC(=O)NC |
5 |
match |
no match |
| benzamide |
NC(=O)c1ccccc1 |
9 |
match |
no match |
| urea |
NC(=O)N |
4 |
match |
no match |
Controls where RDKit and nvMolKit agreed (true negatives — no amide N):
| Molecule |
SMILES |
Atoms |
RDKit |
nvMolKit (observed) |
| methylamine |
CN |
2 |
no match |
no match |
| acetic acid |
CC(=O)O |
4 |
no match |
no match |
Copy-paste list:
disagree = ["CC(=O)N", "NC=O", "CC(=O)NC", "NC(=O)c1ccccc1", "NC(=O)N"] # RDKit match, nvMolKit no match
agree = ["CN", "CC(=O)O"] # both no match
For the size threshold, glycine chains H-(Gly)ₙ-OH ("NCC(=O)" + "NCC(=O)"*(n-1) + "O") give a clean
atom-count sweep; every chain with n ≥ 2 contains a peptide (amide) bond.
Steps to reproduce
Minimal — acetamide contains an amide nitrogen:
from rdkit.Chem import MolFromSmiles, MolFromSmarts
from nvmolkit.substructure import hasSubstructMatch
q = MolFromSmarts("[$(NC=O)]") # recursive: an N bonded to a carbonyl
m = MolFromSmiles("CC(=O)N") # acetamide (4 atoms)
print(m.HasSubstructMatch(q)) # RDKit -> True
print(int(hasSubstructMatch([m], [q])[0, 0])) # nvMolKit -> 0 in our runs
A full self-contained, deterministic reproducer is in the collapsible section at the bottom.
Observed behavior
1. Small molecules — recursive queries returned 0 matches; non-recursive queries agreed with RDKit:
pattern recursive rd_hits nv_hits agree verdict
[$(NC=O)] True 5 0 50% disagree
[$([OH])] True 2 0 80% disagree
[$([#6])] True 10 0 0% disagree <- "any carbon" wrapped recursively, also 0
C=O False 6 6 100% agree
[OH] False 2 2 100% agree
[#6] False 10 10 100% agree
[$([#6])] wraps "any carbon" in a recursive environment; logically it should match every
carbon-containing molecule, but it returned zero matches.
2. Size threshold — recursive [$(NC=O)] on glycine chains H-(Gly)ₙ-OH:
n_res atoms RDKit nvMolKit
8 33 True False
24 97 True False
30 121 True False
31 125 True False
32 129 True True <-- nvMolKit started matching here
33 133 True True
In our runs the boundary was sharp: 0% recall at ≤ 128 atoms, correct at ≥ 129 atoms. We also saw this
on a 2,000-molecule ChEMBL sample (≈0% recall for ≤128-atom molecules, ≈91% for ≥129-atom molecules).
For the larger molecules where nvMolKit did return matches, the atom indices were identical to RDKit, so
the core matching appears correct in those cases — the smaller molecules were the ones skipped.
3. Deterministic: identical across 5 repeated runs and across batchSize ∈ {16 … 100000} in our
testing — we did not observe run-to-run or batch-ordering variation.
Note on why this can be easy to miss: a query panel restricted to small molecules can fail this way
while only surfacing against an RDKit comparison; and a molecule set whose first entries are large
(≥ 129 atoms) can show partial agreement. Comparing a recursive query on a small molecule against
RDKit is the most direct check.
Expected behavior
We would expect recursive SMARTS results to match RDKit Mol.HasSubstructMatch regardless of molecule
size, as non-recursive SMARTS already did in our tests. For example,
hasSubstructMatch([acetamide], [MolFromSmarts("[$(NC=O)]")]) would be expected to return 1.
Root-cause hypothesis (speculative)
The threshold we observed equals kMaxTargetAtoms = 128 (src/substruct/substruct_constants.h; kernel
template Config_T128_Q64_B8). One possibility is that the recursive match-bit buffer
(recursiveMatchBits, stride maxTargetAtoms — see src/substruct/substruct_kernels.h and the
"paint mode" kernel for recursive SMARTS preprocessing) is only populated for targets spanning more than
one 128-atom tile, so molecules fitting in a single tile (≤ 128 atoms) skip the recursive-bit step. We
have not confirmed this in the source — it is only a hypothesis consistent with the observed threshold.
Worth checking:
- The condition gating the recursive-bit "paint" launch on target atom count / tile count.
- Indexing of
recursiveMatchBits by maxTargetAtoms stride for single-tile molecules.
A regression test comparing a recursive query on a small molecule against RDKit would likely catch this;
the existing suite may pass because it does not appear to cover that specific case.
Full self-contained deterministic reproducer (no data files)
import sys
import numpy as np
from rdkit import RDLogger
from rdkit.Chem import MolFromSmarts, MolFromSmiles
from nvmolkit.substructure import hasSubstructMatch
RDLogger.DisableLog("rdApp.*")
def nv(targets, queries):
return np.asarray(hasSubstructMatch(targets, queries)).astype(bool)
def rd(targets, queries):
return np.array([[t.HasSubstructMatch(q) for q in queries] for t in targets], dtype=bool)
# A. minimal cases
q = MolFromSmarts("[$(NC=O)]")
cases = ["CC(=O)N", "NC=O", "CC(=O)NC", "NC(=O)c1ccccc1", "NC(=O)N", "CN", "CC(=O)O"]
mols = [MolFromSmiles(s) for s in cases]
print("A. [$(NC=O)] on small molecules (smiles, atoms, RDKit, nvMolKit):")
for s, m in zip(cases, mols):
print(f" {s:18s} {m.GetNumAtoms():3d} {m.HasSubstructMatch(q)!s:5s} {bool(nv([m],[q])[0,0])}")
# B. recursion is the trigger
smis = ["CC(=O)N", "NC=O", "NC(=O)N", "CN", "CCO", "c1ccccc1"]
mols = [MolFromSmiles(s) for s in smis]
probes = ["[$(NC=O)]", "[$([#6])]", "C=O", "[#6]"]
qs = [MolFromSmarts(p) for p in probes]
R, N = rd(mols, qs), nv(mols, qs)
print("\nB. recursive vs non-recursive (pattern, rd_hits, nv_hits, agree%):")
for j, p in enumerate(probes):
print(f" {p:12s} {int(R[:,j].sum()):3d} {int(N[:,j].sum()):3d} {100*(R[:,j]==N[:,j]).mean():5.0f}%")
# C. 128-atom threshold via glycine chains
print("\nC. [$(NC=O)] on glycine chains (n_res, atoms, RDKit, nvMolKit):")
for n in [8, 24, 30, 31, 32, 33, 36]:
smi = "NCC(=O)" + "NCC(=O)" * (n - 1) + "O"
m = MolFromSmiles(smi)
print(f" {n:3d} {m.GetNumAtoms():4d} {m.HasSubstructMatch(q)!s:5s} {bool(nv([m],[q])[0,0])}")
# verdict
bug = not bool(nv([MolFromSmiles("CC(=O)N")], [q])[0, 0])
print("\nBUG REPRODUCED" if bug else "\nnot reproduced (fixed?)")
sys.exit(1 if bug else 0)
Summary
In our tests,
nvmolkit.substructure.hasSubstructMatch/countSubstructMatches/getSubstructMatchesreturned no match for recursive SMARTS environments$(...)when thetarget molecule had 128 or fewer atoms. Recursive SMARTS matched RDKit once the target had
≥ 129 atoms, and non-recursive SMARTS matched RDKit at all sizes we tried. Every molecule we tested
below 129 atoms showed the discrepancy, but we have not tested exhaustively — so this describes the
behavior we observed, not a proven universal rule.
The flip point in our runs coincided with
kMaxTargetAtoms = 128insrc/substruct/substruct_constants.h.The docstrings for all three entry points state "Supports recursive SMARTS queries", and no exception
is raised, so the incorrect result is silent.
Environment
0.5.0(wheelnvmolkit-0.5.0.post1)6f967ed(VERSION 0.5.0,main)2026.03.12.12.0+cu126(CUDA 12.6)We have only reproduced this on the single configuration above; behavior on other GPUs / RDKit
versions / builds is unknown.
Affected API
hasSubstructMatch,countSubstructMatches,getSubstructMatches(shared preprocessing path;demonstrated below with
hasSubstructMatch).Test molecules
Query:
[$(NC=O)]— a recursive SMARTS matching a nitrogen bonded to a carbonyl (amide-type N).Small molecules where we observed RDKit and nvMolKit disagree (all contain an amide N):
CC(=O)NNC=OCC(=O)NCNC(=O)c1ccccc1NC(=O)NControls where RDKit and nvMolKit agreed (true negatives — no amide N):
CNCC(=O)OCopy-paste list:
For the size threshold, glycine chains
H-(Gly)ₙ-OH("NCC(=O)" + "NCC(=O)"*(n-1) + "O") give a cleanatom-count sweep; every chain with n ≥ 2 contains a peptide (amide) bond.
Steps to reproduce
Minimal — acetamide contains an amide nitrogen:
A full self-contained, deterministic reproducer is in the collapsible section at the bottom.
Observed behavior
1. Small molecules — recursive queries returned 0 matches; non-recursive queries agreed with RDKit:
[$([#6])]wraps "any carbon" in a recursive environment; logically it should match everycarbon-containing molecule, but it returned zero matches.
2. Size threshold — recursive
[$(NC=O)]on glycine chainsH-(Gly)ₙ-OH:In our runs the boundary was sharp: 0% recall at ≤ 128 atoms, correct at ≥ 129 atoms. We also saw this
on a 2,000-molecule ChEMBL sample (≈0% recall for ≤128-atom molecules, ≈91% for ≥129-atom molecules).
For the larger molecules where nvMolKit did return matches, the atom indices were identical to RDKit, so
the core matching appears correct in those cases — the smaller molecules were the ones skipped.
3. Deterministic: identical across 5 repeated runs and across
batchSize∈ {16 … 100000} in ourtesting — we did not observe run-to-run or batch-ordering variation.
Expected behavior
We would expect recursive SMARTS results to match RDKit
Mol.HasSubstructMatchregardless of moleculesize, as non-recursive SMARTS already did in our tests. For example,
hasSubstructMatch([acetamide], [MolFromSmarts("[$(NC=O)]")])would be expected to return1.Root-cause hypothesis (speculative)
The threshold we observed equals
kMaxTargetAtoms = 128(src/substruct/substruct_constants.h; kerneltemplate
Config_T128_Q64_B8). One possibility is that the recursive match-bit buffer(
recursiveMatchBits, stridemaxTargetAtoms— seesrc/substruct/substruct_kernels.hand the"paint mode" kernel for recursive SMARTS preprocessing) is only populated for targets spanning more than
one 128-atom tile, so molecules fitting in a single tile (≤ 128 atoms) skip the recursive-bit step. We
have not confirmed this in the source — it is only a hypothesis consistent with the observed threshold.
Worth checking:
recursiveMatchBitsbymaxTargetAtomsstride for single-tile molecules.A regression test comparing a recursive query on a small molecule against RDKit would likely catch this;
the existing suite may pass because it does not appear to cover that specific case.
Full self-contained deterministic reproducer (no data files)