Skip to content

Recursive SMARTS $(...) return no matches for small target molecules (≤128 atoms in our tests) #208

@xinyu-dev

Description

@xinyu-dev

Summary

In our tests, nvmolkit.substructure.hasSubstructMatch / countSubstructMatches /
getSubstructMatches returned no match for recursive SMARTS environments $(...) when the
target molecule had 128 or fewer atoms. Recursive SMARTS matched RDKit once the target had
≥ 129 atoms, and non-recursive SMARTS matched RDKit at all sizes we tried. Every molecule we tested
below 129 atoms showed the discrepancy, but we have not tested exhaustively — so this describes the
behavior we observed, not a proven universal rule.

The flip point in our runs coincided with kMaxTargetAtoms = 128 in src/substruct/substruct_constants.h.
The docstrings for all three entry points state "Supports recursive SMARTS queries", and no exception
is raised, so the incorrect result is silent.

Environment

Field Value
nvMolKit 0.5.0 (wheel nvmolkit-0.5.0.post1)
Repo commit 6f967ed (VERSION 0.5.0, main)
RDKit 2026.03.1
PyTorch 2.12.0+cu126 (CUDA 12.6)
GPU NVIDIA A100 80GB PCIe (sm_80)
Driver 565.57.01
Python / OS 3.12.13 / Linux x86_64 (glibc 2.35)

We have only reproduced this on the single configuration above; behavior on other GPUs / RDKit
versions / builds is unknown.

Affected API

hasSubstructMatch, countSubstructMatches, getSubstructMatches (shared preprocessing path;
demonstrated below with hasSubstructMatch).

Test molecules

Query: [$(NC=O)] — a recursive SMARTS matching a nitrogen bonded to a carbonyl (amide-type N).

Small molecules where we observed RDKit and nvMolKit disagree (all contain an amide N):

Molecule SMILES Atoms RDKit nvMolKit (observed)
acetamide CC(=O)N 4 match no match
formamide NC=O 3 match no match
N-methylacetamide CC(=O)NC 5 match no match
benzamide NC(=O)c1ccccc1 9 match no match
urea NC(=O)N 4 match no match

Controls where RDKit and nvMolKit agreed (true negatives — no amide N):

Molecule SMILES Atoms RDKit nvMolKit (observed)
methylamine CN 2 no match no match
acetic acid CC(=O)O 4 no match no match

Copy-paste list:

disagree = ["CC(=O)N", "NC=O", "CC(=O)NC", "NC(=O)c1ccccc1", "NC(=O)N"]  # RDKit match, nvMolKit no match
agree    = ["CN", "CC(=O)O"]                                              # both no match

For the size threshold, glycine chains H-(Gly)ₙ-OH ("NCC(=O)" + "NCC(=O)"*(n-1) + "O") give a clean
atom-count sweep; every chain with n ≥ 2 contains a peptide (amide) bond.

Steps to reproduce

Minimal — acetamide contains an amide nitrogen:

from rdkit.Chem import MolFromSmiles, MolFromSmarts
from nvmolkit.substructure import hasSubstructMatch

q = MolFromSmarts("[$(NC=O)]")                  # recursive: an N bonded to a carbonyl
m = MolFromSmiles("CC(=O)N")                    # acetamide (4 atoms)

print(m.HasSubstructMatch(q))                   # RDKit    -> True
print(int(hasSubstructMatch([m], [q])[0, 0]))   # nvMolKit -> 0 in our runs

A full self-contained, deterministic reproducer is in the collapsible section at the bottom.

Observed behavior

1. Small molecules — recursive queries returned 0 matches; non-recursive queries agreed with RDKit:

pattern      recursive  rd_hits  nv_hits   agree  verdict
[$(NC=O)]         True        5        0     50%   disagree
[$([OH])]         True        2        0     80%   disagree
[$([#6])]         True       10        0      0%   disagree   <- "any carbon" wrapped recursively, also 0
C=O              False        6        6    100%   agree
[OH]             False        2        2    100%   agree
[#6]             False       10       10    100%   agree

[$([#6])] wraps "any carbon" in a recursive environment; logically it should match every
carbon-containing molecule, but it returned zero matches.

2. Size threshold — recursive [$(NC=O)] on glycine chains H-(Gly)ₙ-OH:

n_res atoms  RDKit  nvMolKit
    8    33   True     False
   24    97   True     False
   30   121   True     False
   31   125   True     False
   32   129   True      True   <-- nvMolKit started matching here
   33   133   True      True

In our runs the boundary was sharp: 0% recall at ≤ 128 atoms, correct at ≥ 129 atoms. We also saw this
on a 2,000-molecule ChEMBL sample (≈0% recall for ≤128-atom molecules, ≈91% for ≥129-atom molecules).
For the larger molecules where nvMolKit did return matches, the atom indices were identical to RDKit, so
the core matching appears correct in those cases — the smaller molecules were the ones skipped.

3. Deterministic: identical across 5 repeated runs and across batchSize ∈ {16 … 100000} in our
testing — we did not observe run-to-run or batch-ordering variation.

Note on why this can be easy to miss: a query panel restricted to small molecules can fail this way
while only surfacing against an RDKit comparison; and a molecule set whose first entries are large
(≥ 129 atoms) can show partial agreement. Comparing a recursive query on a small molecule against
RDKit is the most direct check.

Expected behavior

We would expect recursive SMARTS results to match RDKit Mol.HasSubstructMatch regardless of molecule
size, as non-recursive SMARTS already did in our tests. For example,
hasSubstructMatch([acetamide], [MolFromSmarts("[$(NC=O)]")]) would be expected to return 1.

Root-cause hypothesis (speculative)

The threshold we observed equals kMaxTargetAtoms = 128 (src/substruct/substruct_constants.h; kernel
template Config_T128_Q64_B8). One possibility is that the recursive match-bit buffer
(recursiveMatchBits, stride maxTargetAtoms — see src/substruct/substruct_kernels.h and the
"paint mode" kernel for recursive SMARTS preprocessing) is only populated for targets spanning more than
one 128-atom tile, so molecules fitting in a single tile (≤ 128 atoms) skip the recursive-bit step. We
have not confirmed this in the source — it is only a hypothesis consistent with the observed threshold.

Worth checking:

  • The condition gating the recursive-bit "paint" launch on target atom count / tile count.
  • Indexing of recursiveMatchBits by maxTargetAtoms stride for single-tile molecules.

A regression test comparing a recursive query on a small molecule against RDKit would likely catch this;
the existing suite may pass because it does not appear to cover that specific case.

Full self-contained deterministic reproducer (no data files)
import sys
import numpy as np
from rdkit import RDLogger
from rdkit.Chem import MolFromSmarts, MolFromSmiles
from nvmolkit.substructure import hasSubstructMatch

RDLogger.DisableLog("rdApp.*")


def nv(targets, queries):
    return np.asarray(hasSubstructMatch(targets, queries)).astype(bool)


def rd(targets, queries):
    return np.array([[t.HasSubstructMatch(q) for q in queries] for t in targets], dtype=bool)


# A. minimal cases
q = MolFromSmarts("[$(NC=O)]")
cases = ["CC(=O)N", "NC=O", "CC(=O)NC", "NC(=O)c1ccccc1", "NC(=O)N", "CN", "CC(=O)O"]
mols = [MolFromSmiles(s) for s in cases]
print("A. [$(NC=O)] on small molecules (smiles, atoms, RDKit, nvMolKit):")
for s, m in zip(cases, mols):
    print(f"  {s:18s} {m.GetNumAtoms():3d}  {m.HasSubstructMatch(q)!s:5s}  {bool(nv([m],[q])[0,0])}")

# B. recursion is the trigger
smis = ["CC(=O)N", "NC=O", "NC(=O)N", "CN", "CCO", "c1ccccc1"]
mols = [MolFromSmiles(s) for s in smis]
probes = ["[$(NC=O)]", "[$([#6])]", "C=O", "[#6]"]
qs = [MolFromSmarts(p) for p in probes]
R, N = rd(mols, qs), nv(mols, qs)
print("\nB. recursive vs non-recursive (pattern, rd_hits, nv_hits, agree%):")
for j, p in enumerate(probes):
    print(f"  {p:12s} {int(R[:,j].sum()):3d} {int(N[:,j].sum()):3d} {100*(R[:,j]==N[:,j]).mean():5.0f}%")

# C. 128-atom threshold via glycine chains
print("\nC. [$(NC=O)] on glycine chains (n_res, atoms, RDKit, nvMolKit):")
for n in [8, 24, 30, 31, 32, 33, 36]:
    smi = "NCC(=O)" + "NCC(=O)" * (n - 1) + "O"
    m = MolFromSmiles(smi)
    print(f"  {n:3d} {m.GetNumAtoms():4d}  {m.HasSubstructMatch(q)!s:5s}  {bool(nv([m],[q])[0,0])}")

# verdict
bug = not bool(nv([MolFromSmiles("CC(=O)N")], [q])[0, 0])
print("\nBUG REPRODUCED" if bug else "\nnot reproduced (fixed?)")
sys.exit(1 if bug else 0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions