AptaFlow is a conditional Poisson Flow–based generative framework for protein-specific aptamer design. It integrates:
- CPAP: Comparative Protein–Aptamer Pretraining module
- PCPFGM: Protein-Conditioned Poisson Flow Generative Model
enable cross-target conditional generation without per-target retraining.
We recommend using conda to reproduce the environment:
conda env create -f env.yaml
conda activate pfgmpp- CPAP:
training-runs/cpap_cluster_train_0605/training-state-010003.pt - PCPFGM (ACPFGM in our runs):
training-runs/MASSA_CPAP_AdaLN_101_cluster_22_ESMC_final_true/training-state-067738.pt
Datasets used in this work are provided under datasets/.
datasets/MASSA_labels/: protein embeddings generated by the pretrained MASSA model (.npy, named by index)datasets/CPAP_MASSA_labels_clustered/: protein conditioning vectors produced by CPAP (.npy, named by index)datasets/133_split_to_100_20000_cluster.zip: processed HT-SELEX aptamer dataset for PCPFGM training
Data and model weights will be released after this work be published.
Train a PCPFGM model with protein conditioning:
python train.py \
--outdir training-runs \
--name MASSA_CPAP_AdaLN_101_cluster_22_ESMC_Final_true \
--data datasets/133_split_to_100_20000_cluster.zip \
--cond 1 \
--arch ncsnpp \
--pfgmpp 1 \
--batch 32 \
--aug_dim 2048 \
--cpap 1By default, the script reads protein condition labels from datasets/CPAP_MASSA_labels_clustered. If you use a different label directory, update train/dataset.py:75 accordingly.
Key Arguments
--data: Training dataset file or directory--pfgmpp: Enable PFGM++ training--aug_dim: Augmented dimension for Poisson flow--cpap: Enable protein conditioning via CPAP
- Use MASSA to generate protein embeddings (
.npy). - Put embedding files into a directory (e.g.
datasets/MASSA_labels/). - Convert MASSA embeddings into CPAP protein conditions:
python cpap_sample.py \
--input_dir datasets/MASSA_labels \
--output_dir datasets/CPAP_MASSA_labels_clustered \
--ckpt training-runs/cpap_cluster_train_0605/training-state-010003.ptTo train a personalized CPAP module on your own data, use cpap_train.py.
Generate aptamer sequences conditioned on a specific protein (e.g. --class 126):
python generate.py \
--seeds 0-2999 \
--outdir training-runs/MASSA_CPAP_AdaLN_101_cluster_22_ESMC_final_true \
--pfgmpp 1 \
--aug_dim 2048 \
--class 126 \
--use_pickle 0 \
--save_images \
--cpap TrueBy default, this script reads protein condition labels under datasets/CPAP_MASSA_labels_clustered. If you use custom labels, modify the relevant label-loading code in generate.py:378-388.
Key Arguments
--outdir: directory containing.ptweight files (if multiple weights exist, each will be used)--class: index of the target protein condition--seeds: random seed range (controls the number of generated sequences)
Parts of this codebase are adapted from or inspired by:
- EDM / PFGM++ (Xu et al.)
- Diffusion Transformer (DiT)
- MASSA and RNA-FM pretrained encoders
We thank the original authors for making their implementations publicly available.