fix: resolve breaking issues in docking pipeline#2
Conversation
- Makefile: expand LD_LIBRARY_PATH with nvidia CUDA library paths to fix DiffDock NVRTC crash at runtime - scripts/run_guild.py: add --no-decoys flag to allow running without a decoy file present - guild/bulk.py: add PROTEINS_FOLDER import, prefer single-chain PDB as Boltz2 template, and retry without template on empty manifest (fixes Boltz2 template parsing IndexError)
There was a problem hiding this comment.
Pull request overview
This PR addresses runtime breakages observed during GPU docking runs by improving container CUDA library visibility, adding a CLI option to skip decoys, and making Boltz2 templating more robust during bulk docking.
Changes:
- Add a
--no-decoysflag to allow running the pipeline when a decoy file is unavailable. - Improve Boltz2 template handling by preferring a single-chain cleaned PDB template and retrying without a template when Boltz produces an empty manifest.
- Extend
LD_LIBRARY_PATHin Docker run targets to include additional NVIDIA CUDA-related shared library paths.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
scripts/run_guild.py |
Adds --no-decoys CLI flag and wires it into BulkRun(use_decoys=...). |
guild/bulk.py |
Uses a single-chain cleaned PDB as the Boltz template and retries Boltz without a template when the manifest is empty. |
Makefile |
Updates GPU docker run targets to include additional NVIDIA library paths in LD_LIBRARY_PATH. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| with open(manifest_path) as _mf: | ||
| _manifest = _json.load(_mf) |
There was a problem hiding this comment.
manifest.json is loaded without any error handling; if the file is truncated/invalid (e.g., Boltz interrupted mid-write) this will raise JSONDecodeError and abort the entire batch. Consider wrapping the manifest load in try/except and treating parse errors like an empty manifest (log + retry without template, or skip with warning).
| with open(manifest_path) as _mf: | |
| _manifest = _json.load(_mf) | |
| try: | |
| with open(manifest_path) as _mf: | |
| _manifest = _json.load(_mf) | |
| except (_json.JSONDecodeError, OSError) as exc: | |
| logger.warning( | |
| f"Boltz2 produced unreadable manifest for {run_id}: {exc}. " | |
| "Retrying without template..." | |
| ) | |
| _manifest = {} |
| # Check if Boltz produced valid output (manifest with records). | ||
| # Template PDB parsing can fail silently in Boltz2, resulting | ||
| # in an empty manifest. If that happens, retry without the template. | ||
| manifest_path = ( | ||
| f"{boltz_out_dir}/boltz_results_{run_id}_boltz/processed/manifest.json" | ||
| ) | ||
| if os.path.exists(manifest_path): | ||
| import json as _json | ||
|
|
||
| with open(manifest_path) as _mf: | ||
| _manifest = _json.load(_mf) | ||
| if not _manifest.get("records"): | ||
| logger.warning( | ||
| f"Boltz2 produced empty manifest for {run_id} " | ||
| "(likely template parsing failure). Retrying without template..." | ||
| ) |
There was a problem hiding this comment.
The new Boltz empty-manifest retry path is complex and impacts run stability, but there are no unit tests exercising it. Since this repo already has BulkRun tests, consider adding a test that mocks deploy_boltz/generate_boltz_yaml and verifies a retry occurs when manifest.json has no records.
| -e LD_LIBRARY_PATH=/opt/localcolabfold/.pixi/envs/default/lib:/usr/local/lib:/app/.venv/lib/python3.10/site-packages/nvidia/cu13/lib:/app/.venv/lib/python3.10/site-packages/nvidia/cuda_nvrtc/lib:/app/.venv/lib/python3.10/site-packages/nvidia/cudnn/lib:/app/.venv/lib/python3.10/site-packages/nvidia/cublas/lib \ | ||
| guild:latest \ |
There was a problem hiding this comment.
LD_LIBRARY_PATH is fully hardcoded here and duplicated across multiple targets, which can drift from the value baked into the image and is easy to forget to update in one place. Consider factoring this into a Makefile variable and/or appending to the container’s existing LD_LIBRARY_PATH instead of replacing it entirely.
| -e LD_LIBRARY_PATH=/opt/localcolabfold/.pixi/envs/default/lib:/usr/local/lib:/app/.venv/lib/python3.10/site-packages/nvidia/cu13/lib:/app/.venv/lib/python3.10/site-packages/nvidia/cuda_nvrtc/lib:/app/.venv/lib/python3.10/site-packages/nvidia/cudnn/lib:/app/.venv/lib/python3.10/site-packages/nvidia/cublas/lib \ | ||
| guild:latest \ |
There was a problem hiding this comment.
LD_LIBRARY_PATH is duplicated here (and differs from the image’s ENV LD_LIBRARY_PATH), which increases the chance of future drift between targets/images. Consider reusing a single Makefile variable (shared with run-boltz) and/or appending to the existing container LD_LIBRARY_PATH rather than replacing it.
Fixes three breaking issues discovered during GPU docking runs (Boltz2 + DiffDock):
All three fixes validated end-to-end on a GPU node.