Skip to content

Fix ligand res_id offset to match AF3 convention#261

Merged
Ubiquinone-dot merged 5 commits intoproductionfrom
fix/rfd3-ligand-resid-offset
Apr 5, 2026
Merged

Fix ligand res_id offset to match AF3 convention#261
Ubiquinone-dot merged 5 commits intoproductionfrom
fix/rfd3-ligand-resid-offset

Conversation

@Ubiquinone-dot
Copy link
Copy Markdown
Collaborator

Summary

  • Reset ligand res_id to start from 1 per chain using dense rank-based renumbering, matching AF3's output convention
  • Add validation error when a ligand is on chain A (the protein chain), with allow_ligand_on_chain_a override option
  • Includes changes from Fix HBond conflict and stabilize tempdir usage #231 (hbond_fix — partial ligand fix, hbond cleanup)

Before: RFD3 offset ligand res_id from protein max (e.g. res_id=51 for a 50-residue protein), causing (chain_id, res_id, atom_name) pairing to fail against AF3 predictions which always use res_id=1.

After: Ligand res_id values are renumbered sequentially starting from 1 per chain (e.g. two ligands on chain B → res_id=[1, 2]).

See docs/issues/rfd3_ligand_resid_offset.md in the pipelines repo for full writeup.

Test plan

  • Ran inference with M0255_1mg5 (two ligands: NAI, ACT) — output shows Chain B: res_id=[1, 2] (was [1, 51] before rank-based fix, [51, 52] with original offset)
  • Verify RMSD pairing no longer falls back to (res_name, atom_name) in pipeline runs

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 5, 2026 04:20
RFD3 was offsetting ligand res_id values from the protein max, causing
(chain_id, res_id, atom_name) pairing to fail against AF3 predictions
which always start ligand res_id at 1. Replace the offset with dense
rank-based per-chain renumbering (1, 2, ...) and add a chain A
validation with an override option (allow_ligand_on_chain_a).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Ubiquinone-dot Ubiquinone-dot force-pushed the fix/rfd3-ligand-resid-offset branch from acb5e17 to 94922e2 Compare April 5, 2026 04:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aligns RFD3’s ligand residue indexing with AF3 conventions by renumbering ligand res_id to start at 1 per ligand chain, adds validation around ligands appearing on the protein chain, and extends the pipeline/sampler to better support “partial ligand fix” behavior (keeping unfixed ligand atoms near their original coordinates).

Changes:

  • Renumber ligand res_id densely from 1 per ligand chain (instead of offsetting from protein max) and add an overrideable validation error for ligands on the protein chain.
  • Add an atom-level ligand mask feature (is_ligand_atom) and use it in the inference sampler to avoid noising/centering-induced ligand fragmentation for partial-ligand-fix cases.
  • Consolidate/continue HBond handling around HBPLUS and add a regression test for partial ligand fixed-atom selection.

Reviewed changes

Copilot reviewed 1 out of 1 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
models/rfd3/src/rfd3/inference/input_parsing.py Implements ligand chain validation + dense per-chain ligand res_id renumbering; avoids zeroing unfixed ligand coords at init.
models/rfd3/src/rfd3/model/inference_sampler.py Uses is_ligand_atom to suppress noise injection and exclude ligands from certain centering operations to maintain ligand connectivity.
models/rfd3/src/rfd3/transforms/design_transforms.py Adds is_ligand_atom feature into feats for downstream sampler logic.
models/rfd3/src/rfd3/transforms/pipelines.py Wires is_ligand_atom into the base pipeline feature list.
models/rfd3/src/rfd3/transforms/hbonds_hbplus.py Adjusts HBPLUS tempdir handling and hb2 reading/parsing logic.
models/rfd3/src/rfd3/transforms/hbonds.py Removes legacy biotite/hydride HBond implementation.
models/rfd3/src/rfd3/metrics/hbonds_metrics.py Removes legacy HBond metrics implementation tied to the deleted transform.
models/rfd3/src/rfd3/inference/parsing.py Removes an obsolete TODO related to partial ligand fixing.
models/rfd3/tests/test_ligand_partial_fix.py Adds a unit test ensuring selected fixed ligand atoms are respected in parsing/build output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 130 to 143
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hbonds is appended twice per HBPLUS line due to a duplicated items = {...}; hbonds.append(items) block. This will double-count hbonds and inflate donor/acceptor masks and logged statistics. Remove the duplicate append (keep a single items dict per parsed line).

Suggested change

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HB = open(hb2_path, "r").readlines() leaves the file handle unclosed. Use a context manager (with open(...) as f:) to avoid leaking descriptors (especially in long-running training/inference jobs).

Suggested change
with open(hb2_path, "r") as f:
HB = f.readlines()

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation ran HBPLUS with cwd=tmpdir. Removing cwd can cause HBPLUS to emit its outputs (and/or auxiliary files) into the process working directory or to fail to find the expected .hb2 next to pdb_path, depending on HBPLUS behavior. Consider restoring cwd=tmpdir (or otherwise guaranteeing all HBPLUS outputs land in tmpdir).

Suggested change
stderr=subprocess.DEVNULL,
cwd=tmpdir,

Copilot uses AI. Check for mistakes.
Comment on lines +682 to +690
f"Ligand found on chain A, which is reserved for the protein. "
f"Ligand chain(s): {ligand_chains.tolist()}. "
f"Place ligands on separate chains (B, C, D, ...) or set "
f"'allow_ligand_on_chain_a: true' to override this check."
)
# Reset ligand res_id to start from 1 per chain, matching the
# convention AF3 uses in its output CIF files. Use dense
# rank-based renumbering so gaps in the original numbering
# (e.g. res_id 850, 900) become sequential (1, 2).
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ligand-on-chain check is hard-coded to chain ID "A", but in partial-diffusion mode the protein chain ID is taken from the input (start_chain = atom_array_in.chain_id[0]) and may not be A. This can raise (or miss) the error incorrectly. Consider checking whether the ligand shares a chain with any protein atoms in atom_array (e.g., np.intersect1d(np.unique(atom_array.chain_id[atom_array.is_protein]), ligand_chains)), rather than assuming the protein chain is always A.

Suggested change
f"Ligand found on chain A, which is reserved for the protein. "
f"Ligand chain(s): {ligand_chains.tolist()}. "
f"Place ligands on separate chains (B, C, D, ...) or set "
f"'allow_ligand_on_chain_a: true' to override this check."
)
# Reset ligand res_id to start from 1 per chain, matching the
# convention AF3 uses in its output CIF files. Use dense
# rank-based renumbering so gaps in the original numbering
# (e.g. res_id 850, 900) become sequential (1, 2).
# Error if any ligand shares a chain with protein atoms unless
# explicitly overridden — chain ID is leaked to the model so this
# is a significant difference from the expected convention.
ligand_chains = np.unique(ligand_array.chain_id)
protein_chains = np.unique(atom_array.chain_id[atom_array.is_protein])
overlapping_chains = np.intersect1d(protein_chains, ligand_chains)
if overlapping_chains.size > 0 and not self.allow_ligand_on_chain_a:
raise ValueError(
f"Ligand found on protein chain(s): {overlapping_chains.tolist()}. "
f"Protein chain(s): {protein_chains.tolist()}. "
f"Ligand chain(s): {ligand_chains.tolist()}. "
f"Place ligands on separate chains or set "

Copilot uses AI. Check for mistakes.
Comment on lines 693 to 701
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes ligand res_id semantics (dense re-numbering per chain) and introduces a new validation error for ligands on the protein chain. Please add a unit test that asserts (1) ligands on the same chain get res_id=[1..N] even if the input res_id has gaps/offsets, and (2) the chain-conflict error (and allow_ligand_on_chain_a override) behaves as expected.

Copilot uses AI. Check for mistakes.
Ubiquinone-dot and others added 4 commits April 4, 2026 21:29
Rename allow_ligand_on_chain_a → allow_ligand_on_existing_chain and
check against all chains already present in the built atom array,
not just chain A.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When allow_ligand_on_existing_chain is False, raise an error if
multiple ligand residues share the same chain. Reset res_id min to 1
per chain, preserving relative gaps when ligands share a chain
(override mode).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The override path now matches the old behaviour (offset from protein
max res_id). The default path (separate chains) sets each ligand
chain's res_id to 1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Ubiquinone-dot Ubiquinone-dot merged commit b5395e4 into production Apr 5, 2026
3 checks passed
@Ubiquinone-dot Ubiquinone-dot deleted the fix/rfd3-ligand-resid-offset branch April 5, 2026 04:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants