Fix ligand res_id offset to match AF3 convention#261
Fix ligand res_id offset to match AF3 convention#261Ubiquinone-dot merged 5 commits intoproductionfrom
Conversation
RFD3 was offsetting ligand res_id values from the protein max, causing (chain_id, res_id, atom_name) pairing to fail against AF3 predictions which always start ligand res_id at 1. Replace the offset with dense rank-based per-chain renumbering (1, 2, ...) and add a chain A validation with an override option (allow_ligand_on_chain_a). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
acb5e17 to
94922e2
Compare
There was a problem hiding this comment.
Pull request overview
This PR aligns RFD3’s ligand residue indexing with AF3 conventions by renumbering ligand res_id to start at 1 per ligand chain, adds validation around ligands appearing on the protein chain, and extends the pipeline/sampler to better support “partial ligand fix” behavior (keeping unfixed ligand atoms near their original coordinates).
Changes:
- Renumber ligand
res_iddensely from 1 per ligand chain (instead of offsetting from protein max) and add an overrideable validation error for ligands on the protein chain. - Add an atom-level ligand mask feature (
is_ligand_atom) and use it in the inference sampler to avoid noising/centering-induced ligand fragmentation for partial-ligand-fix cases. - Consolidate/continue HBond handling around HBPLUS and add a regression test for partial ligand fixed-atom selection.
Reviewed changes
Copilot reviewed 1 out of 1 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| models/rfd3/src/rfd3/inference/input_parsing.py | Implements ligand chain validation + dense per-chain ligand res_id renumbering; avoids zeroing unfixed ligand coords at init. |
| models/rfd3/src/rfd3/model/inference_sampler.py | Uses is_ligand_atom to suppress noise injection and exclude ligands from certain centering operations to maintain ligand connectivity. |
| models/rfd3/src/rfd3/transforms/design_transforms.py | Adds is_ligand_atom feature into feats for downstream sampler logic. |
| models/rfd3/src/rfd3/transforms/pipelines.py | Wires is_ligand_atom into the base pipeline feature list. |
| models/rfd3/src/rfd3/transforms/hbonds_hbplus.py | Adjusts HBPLUS tempdir handling and hb2 reading/parsing logic. |
| models/rfd3/src/rfd3/transforms/hbonds.py | Removes legacy biotite/hydride HBond implementation. |
| models/rfd3/src/rfd3/metrics/hbonds_metrics.py | Removes legacy HBond metrics implementation tied to the deleted transform. |
| models/rfd3/src/rfd3/inference/parsing.py | Removes an obsolete TODO related to partial ligand fixing. |
| models/rfd3/tests/test_ligand_partial_fix.py | Adds a unit test ensuring selected fixed ligand atoms are respected in parsing/build output. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
hbonds is appended twice per HBPLUS line due to a duplicated items = {...}; hbonds.append(items) block. This will double-count hbonds and inflate donor/acceptor masks and logged statistics. Remove the duplicate append (keep a single items dict per parsed line).
There was a problem hiding this comment.
HB = open(hb2_path, "r").readlines() leaves the file handle unclosed. Use a context manager (with open(...) as f:) to avoid leaking descriptors (especially in long-running training/inference jobs).
| with open(hb2_path, "r") as f: | |
| HB = f.readlines() |
There was a problem hiding this comment.
The previous implementation ran HBPLUS with cwd=tmpdir. Removing cwd can cause HBPLUS to emit its outputs (and/or auxiliary files) into the process working directory or to fail to find the expected .hb2 next to pdb_path, depending on HBPLUS behavior. Consider restoring cwd=tmpdir (or otherwise guaranteeing all HBPLUS outputs land in tmpdir).
| stderr=subprocess.DEVNULL, | |
| cwd=tmpdir, |
| f"Ligand found on chain A, which is reserved for the protein. " | ||
| f"Ligand chain(s): {ligand_chains.tolist()}. " | ||
| f"Place ligands on separate chains (B, C, D, ...) or set " | ||
| f"'allow_ligand_on_chain_a: true' to override this check." | ||
| ) | ||
| # Reset ligand res_id to start from 1 per chain, matching the | ||
| # convention AF3 uses in its output CIF files. Use dense | ||
| # rank-based renumbering so gaps in the original numbering | ||
| # (e.g. res_id 850, 900) become sequential (1, 2). |
There was a problem hiding this comment.
The ligand-on-chain check is hard-coded to chain ID "A", but in partial-diffusion mode the protein chain ID is taken from the input (start_chain = atom_array_in.chain_id[0]) and may not be A. This can raise (or miss) the error incorrectly. Consider checking whether the ligand shares a chain with any protein atoms in atom_array (e.g., np.intersect1d(np.unique(atom_array.chain_id[atom_array.is_protein]), ligand_chains)), rather than assuming the protein chain is always A.
| f"Ligand found on chain A, which is reserved for the protein. " | |
| f"Ligand chain(s): {ligand_chains.tolist()}. " | |
| f"Place ligands on separate chains (B, C, D, ...) or set " | |
| f"'allow_ligand_on_chain_a: true' to override this check." | |
| ) | |
| # Reset ligand res_id to start from 1 per chain, matching the | |
| # convention AF3 uses in its output CIF files. Use dense | |
| # rank-based renumbering so gaps in the original numbering | |
| # (e.g. res_id 850, 900) become sequential (1, 2). | |
| # Error if any ligand shares a chain with protein atoms unless | |
| # explicitly overridden — chain ID is leaked to the model so this | |
| # is a significant difference from the expected convention. | |
| ligand_chains = np.unique(ligand_array.chain_id) | |
| protein_chains = np.unique(atom_array.chain_id[atom_array.is_protein]) | |
| overlapping_chains = np.intersect1d(protein_chains, ligand_chains) | |
| if overlapping_chains.size > 0 and not self.allow_ligand_on_chain_a: | |
| raise ValueError( | |
| f"Ligand found on protein chain(s): {overlapping_chains.tolist()}. " | |
| f"Protein chain(s): {protein_chains.tolist()}. " | |
| f"Ligand chain(s): {ligand_chains.tolist()}. " | |
| f"Place ligands on separate chains or set " |
There was a problem hiding this comment.
This changes ligand res_id semantics (dense re-numbering per chain) and introduces a new validation error for ligands on the protein chain. Please add a unit test that asserts (1) ligands on the same chain get res_id=[1..N] even if the input res_id has gaps/offsets, and (2) the chain-conflict error (and allow_ligand_on_chain_a override) behaves as expected.
Rename allow_ligand_on_chain_a → allow_ligand_on_existing_chain and check against all chains already present in the built atom array, not just chain A. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When allow_ligand_on_existing_chain is False, raise an error if multiple ligand residues share the same chain. Reset res_id min to 1 per chain, preserving relative gaps when ligands share a chain (override mode). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The override path now matches the old behaviour (offset from protein max res_id). The default path (separate chains) sets each ligand chain's res_id to 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
res_idto start from 1 per chain using dense rank-based renumbering, matching AF3's output conventionallow_ligand_on_chain_aoverride optionhbond_fix— partial ligand fix, hbond cleanup)Before: RFD3 offset ligand
res_idfrom protein max (e.g.res_id=51for a 50-residue protein), causing(chain_id, res_id, atom_name)pairing to fail against AF3 predictions which always useres_id=1.After: Ligand
res_idvalues are renumbered sequentially starting from 1 per chain (e.g. two ligands on chain B →res_id=[1, 2]).See
docs/issues/rfd3_ligand_resid_offset.mdin the pipelines repo for full writeup.Test plan
M0255_1mg5(two ligands: NAI, ACT) — output showsChain B: res_id=[1, 2](was[1, 51]before rank-based fix,[51, 52]with original offset)(res_name, atom_name)in pipeline runs🤖 Generated with Claude Code