DEG2MOL: Conditional Latent Flow Matching for Transcriptome-Guided De Novo Drug Design

Abstract

Motivation: Traditional de novo drug design prioritizes physicochemical properties, yet often overlooks the objective of modulating biological states. Consequently, phenotypic drug discovery (PDD) has emerged in de novo design, generating novel molecules conditioned on transcriptomic profiles for the desired biological activity. Although large-scale datasets have facilitated the application of deep learning models to PDD, current architectures encounter critical limitations, leading to limited molecular validity, structural redundancy, or high inference latency. Furthermore, reliance on restricted genes, cell lines, and standard evaluation protocols may limit the rigorous assessment of structural generalization and de novo design capabilities.

Results: We propose DEG2MOL, the first conditional latent flow matching framework for PDD that generates molecules by transforming Gaussian noise into molecular embeddings guided by Gene Ontology-informed differentially expressed genes (DEG) information. DEG2MOL achieved superior performance in generating valid and unique molecules with faster inference speed compared to baselines, maintaining a uniqueness score of 0.87 across both random and scaffold splits, confirming its capacity for de novo drug design rather than simple memorization. We substantiated the biological relevance of the generated molecules through molecular docking simulations, which confirmed robust binding interactions comparable to those of the reference drugs. DEG2MOL further demonstrated generalizability across knockdown and knockout profiles validated against known inhibitors, notably extending to single-cell Perturb-seq data. Overall, DEG2MOL establishes a robust framework for transcriptome-guided de novo drug design based solely on DEG profiles.

Environment Setting

Required Packages

# PyTorch (CUDA support recommended)
pip install torch torchvision torchaudio

# Flow Matching and ODE Solver
pip install torchdiffeq

# Data Processing
pip install pandas numpy scipy

# Molecular Processing and Evaluation
pip install rdkit

# Progress Display
pip install tqdm

# Optional: Experiment Tracking
pip install wandb

Data

Data Format

The project uses the following data formats:

DEG Data (.feather format)
- Columns: cmap_name (molecule identifier), gene names (12,014 genes)
- File locations: data/{data_type}/train.feather, data/{data_type}/valid.feather
- Example data types: KO, KD, Perturb-seq
Gene Order File (.csv format)
- File that defines the standard order of gene names
- Default path: data/first_GO_matrix_cmap_12014x1574.csv
- Gene names stored as index
Molecular Latent Representations (.npz format)
- Molecular latent representations encoded by ScafVAE
- File location: {task_path}/scaf/{cmap_name}.npz
- One .npz file per molecule
Molecular Feature Data (.npz format)
- Additional feature information for molecules
- File location: {task_path}/feat/{cmap_name}.npz

Data Directory Structure

data/
├── {data_type}/
│   ├── train.feather      # Training DEG data
│   └── valid.feather      # Validation DEG data

Pre-trained Models

DEG Encoder: Model that encodes DEG data into latent space
- Default path: checkpoints/DEGMON_AE_Best_model.pth
- Supports Autoencoder types
ScafVAE: Molecular encoding/decoding model
- Automatically loaded from ScafVAE library

Implementation

1. Training

Train the Flow Matching model.

python train.py \
    --use_ema \
    --use_amp \
    --use_scheduler \
    --save_dir ./checkpoints

Key Parameters

--combine_method: Condition combination method (sum, concat, cross_attn)
--use_ema: Whether to use Exponential Moving Average
--use_amp: Whether to use Mixed Precision Training
--cfg_drop_prob: Classifier-free guidance dropout probability (default: 0.3)

2. Testing

Generate molecules and evaluate using the trained model.

python test.py \
    --num_samples 100 \
    --guidance_scale 3 \
    --conditional

Key Parameters

--conditional: Enable conditional generation mode
--num_samples: Number of molecules to generate per test sample
--guidance_scale: Classifier-free guidance scale

3. Inference

Generate molecules for new DEG data using the trained model.

python inference.py \
    --model_checkpoint ./checkpoints/DEG2MOL_best_model.pth \
    --data_type Perturb-seq \
    --num_samples 100 \
    --guidance_scale 3 \

Key Parameters

--data_type: Data type (KO, KD, Perturb-seq)

Output Files

Training: Checkpoint files are saved in --save_dir
Testing/Inference: Generated molecule dictionary is saved as a .pkl file
- Filename: {data_type}_generated_molecules_dict_{guidance_scale}.pkl
- Format: {sample_name}_{idx}: {'generated_mols': [list of Mol objects]}

Model Architecture

Gated Conditional Flow MLP

Input: Molecular latent representation x, time t, DEG condition c
Structure:
- Time embedding (Sinusoidal)
- Condition combination (sum/concat/cross-attention)
- Gated MLP blocks
- Output projection
Features: Residual connections, Layer normalization, Dropout support

DEG Encoder

AE Mode: GO_Autoencoder - Autoencoder-based encoder
- Architecture: [12014, 1574, 1386, 951, 515] → latent_dim

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
figures		figures
models		models
.gitignore		.gitignore
README.md		README.md
inference.py		inference.py
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DEG2MOL: Conditional Latent Flow Matching for Transcriptome-Guided De Novo Drug Design

Abstract

Environment Setting

Required Packages

Data

Data Format

Data Directory Structure

Pre-trained Models

Implementation

1. Training

Key Parameters

2. Testing

Key Parameters

3. Inference

Key Parameters

Output Files

Model Architecture

Gated Conditional Flow MLP

DEG Encoder

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DEG2MOL: Conditional Latent Flow Matching for Transcriptome-Guided De Novo Drug Design

Abstract

Environment Setting

Required Packages

Data

Data Format

Data Directory Structure

Pre-trained Models

Implementation

1. Training

Key Parameters

2. Testing

Key Parameters

3. Inference

Key Parameters

Output Files

Model Architecture

Gated Conditional Flow MLP

DEG Encoder

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages