Skip to content

BioinfoMachineLearning/GRNformer

Repository files navigation

GRNFormer - Accurate Gene Regulatory Network Inference Using Graph Transformer

GRNFormer is an advanced variational graph transformer autoencoder model designed to accurately infer regulatory relationships between transcription factors (TFs) and target genes from single-cell RNA-seq transcriptomics data, while supporting generalization across species and cell types.

GRNFormer

Overview

GRNFormer consists of three main novel designs:

  1. TFWalker: A de-novo Transcription Factor (TF) centered subgraph sampling method to extract local or neighborhood co-expression of a transcription factor (TF) to facilitate GRN inference.

  2. End-to-End Learning:

    • GeneTranscoder: A transformer encoder representation module for encoding single-cell RNA-seq (scRNA-seq) gene expression data across different species and cell types.
    • A graph transformer model with a GRNFormer Encoder and a variational GRNFormer decoder coupled with GRN inference module for the reconstruction of GRNs.
  3. Novel Inference Strategy: Incorporates both node features and edge features to infer GRNs for given gene expression data of any given length.

Pipeline

Given a scRNA-seq dataset, a gene co-expression network is first constructed, from which a set of subgraphs are sampled by TF-Walker. The subgraphs are processed by GeneTranscoder to generate node and edge embeddings, which are fed to the variational graph transformer autoencoder to learn a GRN representation. The representation is used to infer a gene regulatory sub-network for each subgraph. The subnetworks are aggregated to construct a full GRN.

Installation

Prerequisites

  • Python 3.11+
  • CUDA-capable GPU (recommended for training)
  • Conda or Miniconda

Setup

  1. Clone the repository:
git clone https://github.com/BioinfoMachineLearning/GRNformer.git
cd GRNformer
  1. Set up conda environment and install necessary packages using the setup script:
./setup.sh

Alternatively, you can manually create the environment:

conda env create -f environment.yml
conda activate grnformer_env

Usage

Quick Start: Inference on Your Data

Run GRNFormer inference on a sample gene expression file:

python infer_grn.py \
    --exp_file /path/to/expression-file.csv \
    --tf_file /path/to/listoftfs.csv \
    --output_file /path/to/predicted-edges.csv \
    --coexpression_threshold 0.1 \
    --max_subgraph_size 100

Input File Formats:

  • expression-file.csv: Gene expression matrix with genes as rows and cells as columns (or vice versa - the script handles both orientations)
  • listoftfs.csv: List of transcription factor gene names (one per line or comma-separated)
  • output_file: Path where the predicted GRN edges will be saved (CSV format: source, target, weight/score)

Optional Parameters:

  • --coexpression_threshold (default: 0.1): Threshold for constructing the co-expression network. Lower values result in denser networks, while higher values create sparser networks.
  • --max_subgraph_size (default: 100): Maximum number of nodes in each TF-centered subgraph sampled by TFWalker. Adjust based on your dataset size and computational resources.

Evaluation with Ground Truth

Standard Evaluation

Run GRNFormer to evaluate performance when a ground truth network is available:

python eval_grn.py \
    --exp_file /path/to/expression-file.csv \
    --tf_file /path/to/listoftfs.csv \
    --net_file /path/to/ground-truth-network.csv \
    --output_file /path/to/predicted-edges.csv

Additional Input:

  • ground-truth-network.csv: Ground truth network edges (CSV format: source, target)

Custom Evaluation with Configurable Parameters

For evaluation with custom coexpression threshold and subgraph size:

python eval_grn_custom.py \
    --exp_file /path/to/expression-file.csv \
    --tf_file /path/to/listoftfs.csv \
    --net_file /path/to/ground-truth-network.csv \
    --output_file /path/to/predicted-edges.csv \
    --ckpt_path /path/to/checkpoint.ckpt \
    --coexpression_threshold 0.1 \
    --max_subgraph_size 100

Additional Parameters:

  • --ckpt_path: Path to the trained model checkpoint file
  • --coexpression_threshold (default: 0.1): Threshold for co-expression network construction
  • --max_subgraph_size (default: 100): Maximum subgraph size for TFWalker sampling

Perturbation Evaluation

Evaluate model robustness under various perturbation conditions (noise and dropout):

Single test with specific perturbation:

python eval_grn_perturb.py \
    --single_test \
    --exp_file /path/to/expression-file.csv \
    --tf_file /path/to/listoftfs.csv \
    --net_file /path/to/ground-truth-network.csv \
    --output_file /path/to/predicted-edges.csv \
    --ckpt_path /path/to/checkpoint.ckpt \
    --noise_std 0.1 \
    --dropout_fraction 0.05 \
    --coexpression_threshold 0.1 \
    --max_subgraph_size 100

Full perturbation sweep (tests multiple noise and dropout levels):

python eval_grn_perturb.py \
    --exp_file /path/to/expression-file.csv \
    --tf_file /path/to/listoftfs.csv \
    --net_file /path/to/ground-truth-network.csv \
    --output_file /path/to/predicted-edges.csv \
    --ckpt_path /path/to/checkpoint.ckpt \
    --noise_levels 0.0 0.05 0.1 0.15 0.2 \
    --dropout_levels 0.0 0.05 0.1 0.15 \
    --output_dir ./outputs/perturbation_results \
    --coexpression_threshold 0.1 \
    --max_subgraph_size 100

Perturbation Parameters:

  • --noise_std: Standard deviation of Gaussian noise to add to expression data (for single test)
  • --dropout_fraction: Fraction of genes to randomly drop (for single test)
  • --noise_levels: Space-separated list of noise levels for sweep (e.g., "0.0 0.05 0.1 0.15 0.2")
  • --dropout_levels: Space-separated list of dropout fractions for sweep (e.g., "0.0 0.05 0.1 0.15")
  • --absolute_noise: Use absolute noise values instead of scaled (default: noise is scaled relative to data std)
  • --output_dir: Directory to save perturbation sweep results
  • --coexpression_threshold (default: 0.1): Threshold for co-expression network construction
  • --max_subgraph_size (default: 100): Maximum subgraph size for TFWalker sampling

Evaluation on Test Datasets

Download BEELINE Datasets

Download BEELINE sc-RNAseq datasets:

python collect_data.py --data_dir ./Data/scRNA-seq/

The downloaded datasets can be found in:

  • Data/scRNA-seq/ - Expression data
  • Data/scRNA-seq-Networks/ - Network data

Run Evaluation Pipeline

Run the evaluation pipeline on test datasets with all subset creations:

python evaluation_pipeline.py \
    --dataset_file Data/mESC.csv \
    --output_dir ./outputs/evaluation

Training from Scratch

1. Prepare Datasets

Download BEELINE sc-RNAseq datasets:

python collect_data.py --data_dir ./Data/scRNA-seq/

Note: Before beginning training, copy all the Regulatory Networks (Non-specific-Chip-seq-network.csv, STRING-network.csv, [cell-type]-Chip-seq-network.csv) and TFs.csv file to the corresponding cell-type datasets in ./Data/scRNA-seq/[cell-type]/.

2. Combine Networks

For generalization training, GRNformer combines all the networks for every training dataset:

python dataset_combiner.py \
    --cell-type-network ./Data/scRNA-seq/hESC/hESC-Chip-seq-network.csv \
    --non-specific-network ./Data/scRNA-seq/hESC/Non-specific-Chip-seq-network.csv \
    --string-network ./Data/scRNA-seq/hESC/STRING-network.csv \
    --output-file ./Data/scRNA-seq/hESC/hESC-combined.csv

3. Create Dataset Splits

Create dataset and splits for training, validation, and testing:

python create_dataset.py \
    --dataset_dir ./Data/sc-RNAseq \
    --dataset_name ./Data/train_list.csv

4. Train the Model

Train the model from scratch using the configuration file:

python main.py fit --config config/grnformer.yaml

You can customize training parameters by editing config/grnformer.yaml or by passing command-line arguments.

Datasets

Available Datasets

Project Structure

GRNformer/
├── src/
│   ├── models/
│   │   └── grnformer/
│   │       ├── model.py          # Main GRNFormer model
│   │       └── network.py        # Network architecture
│   └── datamodules/
│       ├── grn_datamodule.py     # Training data module
│       ├── grn_dataset_inference.py  # Inference dataset
│       └── grn_dataset_test.py   # Test dataset
├── config/
│   └── grnformer.yaml            # Training configuration
├── main.py                       # Training entry point
├── infer_grn.py                  # Inference script
├── eval_grn.py                   # Standard evaluation script
├── eval_grn_custom.py            # Custom evaluation with configurable parameters
├── eval_grn_perturb.py           # Perturbation evaluation script
├── evaluation_pipeline.py        # Full evaluation pipeline
├── create_dataset.py             # Dataset creation
├── dataset_combiner.py            # Network combination
├── collect_data.py                # Data download
└── environment.yml               # Conda environment

Citation

If you use GRNFormer in your research, please cite:

@article {Hegde2025.01.26.634966,
	author = {Hegde, Akshata and Cheng, Jianlin},
	title = {GRNFormer: Accurate Gene Regulatory Network Inference Using Graph Transformer},
	elocation-id = {2025.01.26.634966},
	year = {2025},
	doi = {10.1101/2025.01.26.634966},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/01/27/2025.01.26.634966},
	eprint = {https://www.biorxiv.org/content/early/2025/01/27/2025.01.26.634966.full.pdf},
	journal = {bioRxiv}
}

License

See LICENSE file for details.

Contact

For questions or issues, please open an issue on the GitHub repository.

About

Transformer models for predicting gene regulatory networks from omics data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published