Skip to content

MarkusKhoa/protein-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Protein Structure Analysis Project

Project Overview

This repository contains various implementations and experiments for protein structure analysis using different deep learning approaches.

Reproducible Pipeline (Binding Site Prediction)

The run_pipeline.py script provides an end-to-end, reproducible pipeline extracted from GraphSAGE-improving.ipynb. It supports multiple GNN backbones: GraphSAGE, GCN, and GAT.

Quick Start

# Full run with default config (GraphSAGE)
python run_pipeline.py --config configs/graphsage_default.yaml

# Select backbone via CLI
python run_pipeline.py --config configs/graphsage_default.yaml --model gcn
python run_pipeline.py --config configs/graphsage_default.yaml --model gat

# Smoke test (minimal data for quick verification)
python run_pipeline.py --config configs/graphsage_default.yaml --smoke

CLI Options

Option Description
--config Path to YAML config (default: built-in defaults)
--model GNN backbone: graphsage, gcn, gat
--device Device: cuda or cpu
--seed Random seed for reproducibility
--save-dir Output directory for checkpoints and metrics
--smoke Smoke test with 4 train + 2 test samples

Data Paths

Update configs/graphsage_default.yaml or pass a custom config:

  • train_csv: Training CSV with prot_id, sequence, labels (list format)
  • test_csv: Test CSV with same columns
  • pdb_dir: Directory containing PDB files (e.g. {prot_id}.pdb or {prot_id}_alphafold.pdb)

Artifacts

Outputs are saved under artifacts/ (or --save-dir):

  • {backbone}_best_model.pth: Best model checkpoint
  • run_metadata.json: Config, metrics, timestamp

Pipeline Layout

  • pipeline/config.py – Configuration dataclasses
  • pipeline/io.py – Data loading and path resolution
  • pipeline/embeddings.py – ESM-2 tokenization and embeddings
  • pipeline/graph_features.py – Structure features and graph construction
  • pipeline/models.py – GCN, GraphSAGE, GAT backbones
  • pipeline/losses.py – Loss functions and binding features
  • pipeline/train.py – Training loop
  • pipeline/evaluate.py – Evaluation metrics

Existing helper scripts (data_preparation.py, features_extraction.py, alphafold_data_ingestion.py, etc.) remain for creating multi-label and processed datasets.

Repository Structure

  • run_pipeline.py: Main CLI for binding site prediction
  • pipeline/: Modular pipeline package
  • configs/: YAML configuration files
  • GraphSAGE-improving.ipynb: Original notebook (reference)
  • data_preparation.py, features_extraction.py, etc.: Dataset preparation helpers

Important Notice

Before running any code:

  1. Data Paths: Update paths in configs/graphsage_default.yaml or your config
  2. PDB Files: Ensure data/esmFold_pdb_files (or your pdb_dir) contains PDB files for all protein IDs in train/test CSVs
  3. Execution: run_pipeline.py handles embeddings, graph construction, training, and evaluation in one command

Prerequisites

  • Python: 3.10 or 3.11 is recommended (broad wheel support for PyTorch and scientific stacks; some packages skip 3.9 or cap below 3.12).
  • Hardware: GPU optional; CUDA builds of PyTorch are separate from this repo.

Installation (reproducible setup)

Structure-related code (pipeline/graph_features.py, features_extraction.py, utils.py) imports MDTraj and may call DSSP (secondary structure). Follow the path that matches your OS.

Option A — Conda (recommended, especially on Windows)

Conda-forge provides prebuilt mdtraj and the mkdssp (DSSP) binary, avoiding MSVC compiler errors and missing PyPI packages.

conda create -n esm-orion python=3.11 -y
conda activate esm-orion
conda config --env --add channels conda-forge
conda config --env --set channel_priority strict
conda install -y mdtraj mkdssp
pip install -r requirements.txt
  • mdtraj: On Windows, pip install mdtraj often falls back to a source build and fails without Microsoft C++ Build Tools. Installing mdtraj from conda-forge avoids that.
  • mkdssp: DSSP is a standalone program, not a Python package named dssp on PyPI. Do not add dssp to requirements.txt. After conda install mkdssp, the executable should be on your PATH inside the env.
  • Biopython DSSP: features_extraction.py uses DSSP(..., dssp='/usr/bin/dssp') for the Biopython-based path. On Windows, point that argument to your conda env’s mkdssp.exe (for example under %CONDA_PREFIX%\Scripts\). The MDTraj-based helpers use md.compute_dssp and rely on mkdssp on PATH.

Option B — pip-only (Linux / macOS, or Windows with C++ Build Tools)

python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -U pip
pip install -r requirements.txt
  • Windows + pip: If mdtraj tries to compile from source, install Visual Studio Build Tools (Desktop development with C++) or prefer Option A.
  • DSSP binary: Still required for md.compute_dssp / Biopython DSSP. Install mkdssp from your OS package manager, conda-forge, or build from source, and ensure it is on PATH.

Optional checks

python -c "import mdtraj as md; print('mdtraj', md.__version__)"
mkdssp --version   # or: dssp --version

Contact

For questions or issues, please open a GitHub issue in this repository.

Releases

No releases published

Packages

 
 
 

Contributors