Skip to content

[codex] fix #42 with query-only skip-MSA feature generation#611

Merged
DimaMolod merged 5 commits intomainfrom
fix-42-skip-msa
Apr 10, 2026
Merged

[codex] fix #42 with query-only skip-MSA feature generation#611
DimaMolod merged 5 commits intomainfrom
fix-42-skip-msa

Conversation

@DimaMolod
Copy link
Copy Markdown
Collaborator

Summary

This PR adds a real --skip_msa mode to create_individual_features.py so AlphaPulldown can generate query-only single-sequence features instead of running bulk MSA searches.

Closes #42.

Root Cause

AlphaPulldown always built feature pickles through the standard AF2 or MMseqs2 MSA paths. Even when users wanted a single-sequence workflow, the feature-generation step still ran bulk MSA searches and stored full MSA-derived tensors. There was also no persisted marker to prevent those query-only features from being used later with --pair_msa=True, which would be semantically invalid.

What Changed

  • add --skip_msa to create_individual_features.py
  • for AF2 classic mode, build query-only monomer features directly and still run template search from a synthetic single-sequence alignment
  • for MMseqs2 mode, switch to ColabFold-style single_sequence behavior and keep template-only search support
  • for AF3 mode, prefill protein and RNA chains with query-only custom MSAs so the AF3 data pipeline skips MSA search while still allowing template search
  • persist a skip_msa marker on generated monomer objects
  • propagate that marker to chopped objects during prediction setup
  • reject run_structure_prediction.py --pair_msa=True when any interactor was generated with --skip_msa
  • prevent the TrueMultimer reuse shortcut from silently reusing an old bulk-MSA pickle when --skip_msa is requested
  • document --skip_msa in the README feature-generation flag section

User Impact

Users can now generate lightweight query-only feature pickles across AF2, MMseqs2, and AF3 workflows without running bulk MSA searches. Those pickles are guarded at prediction time so they cannot be paired accidentally; they must be used with --pair_msa=False.

Validation

  • PYTHONPATH=/tmp/ap_pyshim:$PYTHONPATH conda run -n AlphaPulldown python -m pytest -q test/unit/test_modelling_setup.py test/unit/test_script_entrypoints.py test/unit/test_objects.py test/integration/test_create_individual_features.py
  • Result: 137 passed

Note: the temporary PYTHONPATH shim was only needed on this workstation to work around the local AlphaFold/Biopython import mismatch during test collection. It is not part of the repo changes.

@DimaMolod DimaMolod marked this pull request as ready for review April 10, 2026 09:43
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6eaa0922e8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +351 to +355
if FLAGS.pair_msa and any(getattr(interactor, "skip_msa", False) for interactor in interactors):
raise ValueError(
"--skip_msa generates query-only MSAs and cannot be combined with "
"--pair_msa=True. Re-run structure prediction with --pair_msa=False."
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict skip_msa pairing check to multimer jobs

pre_modelling_setup now raises as soon as any interactor has skip_msa=True, before determining whether the job is monomeric or multimeric. Since pair_msa defaults to True in this script, a single-chain prediction built with --skip_msa now fails even though no MSA pairing is performed for monomer runs, which blocks the query-only workflow unless users pass an unrelated override.

Useful? React with 👍 / 👎.

Comment on lines +477 to +483
if FLAGS.skip_msa and not getattr(monomer, "skip_msa", False):
logging.info(
"Existing monomer features for %s were generated with bulk MSAs. "
"Recomputing query-only features for --skip_msa.",
source_name,
)
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Block reuse of skip_msa cache for full-MSA generation

The reuse guard only handles --skip_msa reusing a bulk-MSA pickle, but not the reverse mismatch. If a cached monomer was created with skip_msa=True and the user reruns without --skip_msa, this path still reuses the query-only pickle, silently skipping full MSA regeneration and propagating skip_msa=True into outputs (which can later trigger pairing errors in prediction).

Useful? React with 👍 / 👎.

@DimaMolod DimaMolod merged commit 2c71500 into main Apr 10, 2026
6 checks passed
@DimaMolod DimaMolod deleted the fix-42-skip-msa branch April 10, 2026 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Requesting feature: ability to skip MSA generation step

1 participant