Motivation: Traditional de novo drug design prioritizes physicochemical properties, yet often overlooks the objective of modulating biological states. Consequently, phenotypic drug discovery (PDD) has emerged in de novo design, generating novel molecules conditioned on transcriptomic profiles for the desired biological activity. Although large-scale datasets have facilitated the application of deep learning models to PDD, current architectures encounter critical limitations, leading to limited molecular validity, structural redundancy, or high inference latency. Furthermore, reliance on restricted genes, cell lines, and standard evaluation protocols may limit the rigorous assessment of structural generalization and de novo design capabilities.
Results: We propose DEG2MOL, the first conditional latent flow matching framework for PDD that generates molecules by transforming Gaussian noise into molecular embeddings guided by Gene Ontology-informed differentially expressed genes (DEG) information. DEG2MOL achieved superior performance in generating valid and unique molecules with faster inference speed compared to baselines, maintaining a uniqueness score of 0.87 across both random and scaffold splits, confirming its capacity for de novo drug design rather than simple memorization. We substantiated the biological relevance of the generated molecules through molecular docking simulations, which confirmed robust binding interactions comparable to those of the reference drugs. DEG2MOL further demonstrated generalizability across knockdown and knockout profiles validated against known inhibitors, notably extending to single-cell Perturb-seq data. Overall, DEG2MOL establishes a robust framework for transcriptome-guided de novo drug design based solely on DEG profiles.
# PyTorch (CUDA support recommended)
pip install torch torchvision torchaudio
# Flow Matching and ODE Solver
pip install torchdiffeq
# Data Processing
pip install pandas numpy scipy
# Molecular Processing and Evaluation
pip install rdkit
# Progress Display
pip install tqdm
# Optional: Experiment Tracking
pip install wandbThe project uses the following data formats:
-
DEG Data (
.featherformat)- Columns:
cmap_name(molecule identifier), gene names (12,014 genes) - File locations:
data/{data_type}/train.feather,data/{data_type}/valid.feather - Example data types:
KO,KD,Perturb-seq
- Columns:
-
Gene Order File (
.csvformat)- File that defines the standard order of gene names
- Default path:
data/first_GO_matrix_cmap_12014x1574.csv - Gene names stored as index
-
Molecular Latent Representations (
.npzformat)- Molecular latent representations encoded by ScafVAE
- File location:
{task_path}/scaf/{cmap_name}.npz - One
.npzfile per molecule
-
Molecular Feature Data (
.npzformat)- Additional feature information for molecules
- File location:
{task_path}/feat/{cmap_name}.npz
data/
├── {data_type}/
│ ├── train.feather # Training DEG data
│ └── valid.feather # Validation DEG data
-
DEG Encoder: Model that encodes DEG data into latent space
- Default path:
checkpoints/DEGMON_AE_Best_model.pth - Supports Autoencoder types
- Default path:
-
ScafVAE: Molecular encoding/decoding model
- Automatically loaded from ScafVAE library
Train the Flow Matching model.
python train.py \
--use_ema \
--use_amp \
--use_scheduler \
--save_dir ./checkpoints--combine_method: Condition combination method (sum,concat,cross_attn)--use_ema: Whether to use Exponential Moving Average--use_amp: Whether to use Mixed Precision Training--cfg_drop_prob: Classifier-free guidance dropout probability (default: 0.3)
Generate molecules and evaluate using the trained model.
python test.py \
--num_samples 100 \
--guidance_scale 3 \
--conditional--conditional: Enable conditional generation mode--num_samples: Number of molecules to generate per test sample--guidance_scale: Classifier-free guidance scale
Generate molecules for new DEG data using the trained model.
python inference.py \
--model_checkpoint ./checkpoints/DEG2MOL_best_model.pth \
--data_type Perturb-seq \
--num_samples 100 \
--guidance_scale 3 \--data_type: Data type (KO,KD,Perturb-seq)
- Training: Checkpoint files are saved in
--save_dir - Testing/Inference: Generated molecule dictionary is saved as a
.pklfile- Filename:
{data_type}_generated_molecules_dict_{guidance_scale}.pkl - Format:
{sample_name}_{idx}: {'generated_mols': [list of Mol objects]}
- Filename:
- Input: Molecular latent representation
x, timet, DEG conditionc - Structure:
- Time embedding (Sinusoidal)
- Condition combination (sum/concat/cross-attention)
- Gated MLP blocks
- Output projection
- Features: Residual connections, Layer normalization, Dropout support
- AE Mode:
GO_Autoencoder- Autoencoder-based encoder- Architecture:
[12014, 1574, 1386, 951, 515] → latent_dim
- Architecture:
