Skip to content

HIM-AIM/AptaFlow

Repository files navigation

AptaFlow

AptaFlow is a conditional Poisson Flow–based generative framework for protein-specific aptamer design. It integrates:

  • CPAP: Comparative Protein–Aptamer Pretraining module
  • PCPFGM: Protein-Conditioned Poisson Flow Generative Model

enable cross-target conditional generation without per-target retraining.

Quickstart

1) Environment

We recommend using conda to reproduce the environment:

conda env create -f env.yaml
conda activate pfgmpp

2) Checkpoints

  • CPAP: training-runs/cpap_cluster_train_0605/training-state-010003.pt
  • PCPFGM (ACPFGM in our runs): training-runs/MASSA_CPAP_AdaLN_101_cluster_22_ESMC_final_true/training-state-067738.pt

3) Datasets

Datasets used in this work are provided under datasets/.

  • datasets/MASSA_labels/: protein embeddings generated by the pretrained MASSA model (.npy, named by index)
  • datasets/CPAP_MASSA_labels_clustered/: protein conditioning vectors produced by CPAP (.npy, named by index)
  • datasets/133_split_to_100_20000_cluster.zip: processed HT-SELEX aptamer dataset for PCPFGM training

Data and model weights will be released after this work be published.

Training (PCPFGM)

Train a PCPFGM model with protein conditioning:

python train.py \
  --outdir training-runs \
  --name MASSA_CPAP_AdaLN_101_cluster_22_ESMC_Final_true \
  --data datasets/133_split_to_100_20000_cluster.zip \
  --cond 1 \
  --arch ncsnpp \
  --pfgmpp 1 \
  --batch 32 \
  --aug_dim 2048 \
  --cpap 1

By default, the script reads protein condition labels from datasets/CPAP_MASSA_labels_clustered. If you use a different label directory, update train/dataset.py:75 accordingly.

Key Arguments

  • --data: Training dataset file or directory
  • --pfgmpp: Enable PFGM++ training
  • --aug_dim: Augmented dimension for Poisson flow
  • --cpap: Enable protein conditioning via CPAP

Generation

Step 1: Obtain Protein Conditions

  1. Use MASSA to generate protein embeddings (.npy).
  2. Put embedding files into a directory (e.g. datasets/MASSA_labels/).
  3. Convert MASSA embeddings into CPAP protein conditions:
python cpap_sample.py \
  --input_dir datasets/MASSA_labels \
  --output_dir datasets/CPAP_MASSA_labels_clustered \
  --ckpt training-runs/cpap_cluster_train_0605/training-state-010003.pt

To train a personalized CPAP module on your own data, use cpap_train.py.

Step 2: Generate aptamer sequences

Generate aptamer sequences conditioned on a specific protein (e.g. --class 126):

python generate.py \
  --seeds 0-2999 \
  --outdir training-runs/MASSA_CPAP_AdaLN_101_cluster_22_ESMC_final_true \
  --pfgmpp 1 \
  --aug_dim 2048 \
  --class 126 \
  --use_pickle 0 \
  --save_images \
  --cpap True

By default, this script reads protein condition labels under datasets/CPAP_MASSA_labels_clustered. If you use custom labels, modify the relevant label-loading code in generate.py:378-388.

Key Arguments

  • --outdir: directory containing .pt weight files (if multiple weights exist, each will be used)
  • --class: index of the target protein condition
  • --seeds: random seed range (controls the number of generated sequences)

Acknowledgements

Parts of this codebase are adapted from or inspired by:

  • EDM / PFGM++ (Xu et al.)
  • Diffusion Transformer (DiT)
  • MASSA and RNA-FM pretrained encoders

We thank the original authors for making their implementations publicly available.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages