RIBEX combines protein language model embeddings with graph-derived positional encodings from the human STRING protein-protein interaction network for RNA-binding protein prediction. The repository contains the raw-data builders, embedding generation, dataset assembly, FiLM-PE training, LoRA fine-tuning, and the explainability pipeline used for PE-scan clustering and enrichment. For more information refer to our bioarxiv paper .
Create the shared conda environment on Lustre:
conda env create -f environment.yaml
conda activate rbp_ig_lustreSet the storage root used by the scripts. The code defaults to /path/to/RBP_IG_storage, but setting it explicitly is safer:
export REPOSITORY=/path/to/RBP_IG_storageSeveral legacy HPC launchers also contain explicit placeholders such as /path/to/RBP_IG, /path/to/RBP_IG_storage, and /path/to/miniconda3/bin/activate. Replace those with your local checkout path, storage path, and conda installation before submitting jobs.
Optional if you do not want W&B runs uploaded:
export WANDB_MODE=offlineThe scripts read and write under ${REPOSITORY}/data:
${REPOSITORY}/data/
├── data_original/
│ ├── bressin19/
│ ├── InterPro/
│ └── RIC/
├── data_raw/
├── data_sets/
├── embeddings/
├── figures/
├── logs/
├── models/
└── splits/
The Git checkout itself is used for code, helper scripts, random-search launchers, and local run folders such as LoRA trial directories.
The command inventory is in pipeline.sh. The effective order is:
- Put the original source files into
${REPOSITORY}/data/data_original/bressin19,${REPOSITORY}/data/data_original/InterPro, and${REPOSITORY}/data/data_original/RIC. - Build the harmonised raw tables:
python3 scripts/data_raw/generate_Bressin19.py
python3 scripts/data_raw/generate_InterPro.py
python3 scripts/data_raw/generate_RIC.py
python3 scripts/data_raw/analyze.py- Run the sequence clustering step before dataset generation. This appends
cluster_numberto the raw TSVs and writes the MMseqs2 clustering files used later for leakage-aware splits:
python3 scripts/data_raw/cluster_tsv_data.py- Generate embeddings. The full set of model-specific commands is in pipeline.sh; a common example is:
python3 scripts/embeddings/generate.py --device cuda:0 --languageModel esm2_t33_650M_UR50D --precision f16 --maxSeqLen 2000- Build the downstream datasets:
python3 scripts/data_sets/generate.py
python3 scripts/data_sets/analyze.pyThis creates files such as ${REPOSITORY}/data/data_sets/RIC_human_fine-tuning.pkl and ${REPOSITORY}/data/data_sets/bressin19_human_fine-tuning.pkl.
RIBEX expects the precomputed positional-encoding assets:
${REPOSITORY}/data/data_sets/ranks_personalized_page_rank_0.5_v12_all.npy
${REPOSITORY}/data/data_sets/gene_names_0.5_v12_all.npy
If you need to regenerate them from STRING:
- Confirm the latest official STRING release on the version-history page:
https://string-db.org/cgi/access. At the time of writing, the current STRING release is12.0. - Go to the official download page:
https://string-db.org/cgi/download.pl. - Restrict the download to
Homo sapiens/ taxon9606and download the filtered v12 full interaction table named9606.protein.links.full.v12.0.txt.gz. - Place that file at
${REPOSITORY}/data/data_original/string_db/9606.protein.links.full.v12.0.txt.gz. - Generate the global PPI positional-encoding assets with:
mkdir -p ${REPOSITORY}/data/data_original/string_db
python3 scripts/data_sets/positional_encoding.py \
--string-links ${REPOSITORY}/data/data_original/string_db/9606.protein.links.full.v12.0.txt.gz- This writes:
${REPOSITORY}/data/data_sets/ranks_personalized_page_rank_0.5_v12_all.npy
${REPOSITORY}/data/data_sets/gene_names_0.5_v12_all.npy
At training and inference time, gene IDs are mapped to STRING IDs through the STRING get_string_ids API in positional_encoding_processing.py.
Generate fair train/held-out splits that are consistent across FiLM PE and LoRA:
bash scripts/data_util/generate_shared_splits_any.sh RIC 2023This writes split files under ${REPOSITORY}/data/splits/.
bash scripts/training/run_scripts/run_LoRA_fine_tuning_random_search.shUseful overrides:
SEED=2024 LM_NAME=protT5_xl_uniref50 NUM_TRIALS=30 bash scripts/training/run_scripts/run_LoRA_fine_tuning_random_search.shbash scripts/training/run_scripts/run_FiLM_PE_fine_tuning_random_search.shUseful overrides:
SEED=2024 LM_NAME=esm2_t36_3B_UR50D NUM_TRIALS=30 bash scripts/training/run_scripts/run_FiLM_PE_fine_tuning_random_search.shBoth launchers now enforce the same protocol on the held-out split:
- Train each trial on the shared training split.
- Use the saved held-out predictions per epoch.
- Split that held-out set into:
- 1/3 nested validation for best-epoch selection
- 2/3 nested test for reporting the hyperparameter combination
- Rank hyperparameter combinations by nested-validation AUPRC only.
- Report the corresponding nested-test metrics separately.
The post-search evaluator is evaluate_random_search_nested_holdout.py.
Each search writes:
results/random_search/<search_tag>/
├── manifest.tsv
├── nested_validation_split.tsv
├── nested_test_split.tsv
├── random_search_per_epoch.tsv
├── random_search_leaderboard.tsv
└── best_trial.json
If you want a single final model after the search, rerun scripts/training/train.py once with the selected hyperparameters from best_trial.json.
The reproducible PE-scan clustering workflow is:
bash repro_pe_scan_pipeline.shThat runs:
- analyze_pe_scan_effect.py
- cluster_pe_scan_nodes.py
- enrichment_pe_clusters.py
- plot_pe_clusters_enrichment_labeled.py
- pipeline.sh is the quick reference for the main commands.
- The repository has many historical run folders; the new random-search launchers isolate trials by
run_tagso evaluation only picks up the intended batch. - The raw-data builders use online InterPro / MobiDB / STRING services, so network access is required when regenerating those assets from scratch.