single-cell Multiome Analysis using Variational-inference and Enhancer-driven Regulatory-networks to Inform Cell-atlas Structure
Disclaimer: This pipeline is built to run with large (>100 sample) datasets on the Slurm HPC at the NIH (Biowulf).
This work is now published in Cell Reports (https://doi.org/10.1016/j.celrep.2026.117110)! Please use this publication for citation.
Focus on the analysis, not the processing.
The workflow was built to minimize the time required to process data and get the single-cell scientist analyzing high-quality processed data and generating figures. We tried to strike a balance between giving the user control. The scripts and Snakemake pipeline work for the Multiome datasets output from the 10X Multiome kit (https://www.10xgenomics.com/support/epi-multiome), generated by the Single Cell Expert group at Center for Alzheimer's and Related Dementias at the NIH. The multiome pipeline was built to take several batches of human brain single-nuclei sequencing samples and process them into a multiome atlas object for further analysis.
The modules of Scanpy (https://github.com/scverse/scanpy), SCVI (https://github.com/scverse/scvi-tools), snapATAC2 (https://github.com/kaizhang/SnapATAC2), pycisTopic (https://github.com/aertslab/pycisTopic), and MACS (https://github.com/macs3-project/MACS) are utilized heavily to produce a multiome atlas with minimal batch effects.
Copy this repository to where you will be working with your data. This folder will be where output data is stored, while intermediary files will be stored in a separate folder to be defined by the user. It is important that the output of your CellRanger-ARC run has the format: <data_dir>/<sample>/!
- Metadata file in .csv format, example in
input/example_metadata.csv. A minimal metadata file should include:- Sequencing batch, called Use_batch in example (indicating the separate sequencing folder for each run of
CellRanger-ARC, denoted batch"Sequencing batch>") - Sample ID, called Sample in the example, indicates where name of each sample
- Sample comparison, usually a disease or diagnosis, called 'Primary Diagnosis' in the example, as well for the values of this column;
- Names of the control condition, called 'control' in the example
- A list of conditions to compare to the control, called 'diseases' in the example
- Sequencing batch, called Use_batch in example (indicating the separate sequencing folder for each run of
- Cell-typing table with marker genes, in .csv format, see example in
input/example_marker_genes.csv - Directory of the CellRanger output, the data here is the output of
CellRanger, the raw data required includes;data_dir/{sample}/raw_feature_bc_matrix.h5data_dir/{sample}/atac_fragments.tsv.gzdata_dir/{sample}/atac_possorted_bam.bam
In addition, the snakefile requires modifications to fit your project. The top section "Parameter" should be modified for your dataset, include quality control values, where the input metadata and cell/cell gene marker files are stored. Input files should have their values match the parameters section.
- RNA and ATAC Multiome atlas object
- DEG tables
- DAR tables
- Motif enrichment
- Footprinting results
- Co-accessibility networks
- Functional enrichment (GSEA, GREAT)
- Uses Singularity images for reproducible runs (scVI and poissonVI modeling will be fast utilizing a GPU-enabled environment)
- Snakemake runs steps until all output files are created
- Genes used for celltyping are input from
input/example_marker_genes.csv - Both RNA and ATAC processing has to be done to run this pipeline
- Data needs to be stored in a specific heirarchy
- Cellbender needs to be run after CellRanger to compensate for ambient RNA
- Differential Gene Expression and Differential Accessibility of Regions analysis are done
Once set up, this complete pipeline can be run by simply typing bash snakemake.sh in terminal in an HPC running Slurm. This is a work in progress and has not been tested on other devices.
- Ambient RNA correction (Cellbender)
- Per-Sample RNA Processing
- Creation of per-sample AnnData objects in Scanpy
- Metadata integration
- Gene counts
- %Mt and %Rb
- Doublet score (scrublet)
- QC filtering (%Mt, %Rb, doublet score, min. # genes)
- Merge filtered samples
- Feature Selection & Modeling
- Highly variable gene selection
- GPU-enabled scVI modeling (two-stage: initial & after cluster/cell-type QC)
- UMAP dimensionality reduction and leiden clustering
- Cell-Type Annotation
- Marker-based annotation
- Leiden-cluster QC
- Cell-type specific QC
- Rerun feature selection and modeling
- Pseudobulking by cell-type
- Differential Expression
- Regression Analyses
- Gene set enrichment analysis (GSEA)
- Cell-cell communication
- Per-Sample ATAC Processing
- Create ATAC AnnData objects from fragment files
- Metadata integration
- Fragment counts
- TSSE score
- ATAC-specific QC visualization
- Min. number of fragments
- Joint RNA–ATAC filtering per sample
- Merge filtered samples
- Sort ATAC fragments by cell-type, subtype
- Generate bigWig tracks (raw + normalized)
- Call ATAC peaks via MACS2
- cell-type +/- disease state
- Consensus peak calling across cell-types
- Merge and count peaks into cellxpeak AnnData object
- Dimensionality reduction with PoissonVI
- Pseudobulk ATAC matrices produced for each comparison
- Differential Accessibility (DAR)
- Co-accessibility Networks
- Cicero/CIRCE used to infer co-accessible peak networks
- CCAN (cis-co-accessibility network) modules extracted
- DARs mapped onto CCANs
- Global and differential motif enrichment using JASPAR motifs Motifs tested within DARs
- ATAC Footprinting (TOBIAS)
- ATACorrect bias correction
- Footprint scoring
- Differential footprinting
- GREAT Analysis
- Map DARs to nearby genes
- Ontology and pathway enrichment for regulatory regions
- Cell-type- and disease-specific GREAT outputs
- RNA + ATAC merged into a MuData object
