Skip to content

wejlab/scCombat_seq

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scComBat-Seq

A practical guide to applying ComBat variants for batch correction in single-cell RNA-seq data.

ComBat-Seq uses a negative binomial model to correct batch effects directly in raw count matrices — preserving integer counts for downstream differential expression analysis. This repository implements 4 variants, benchmarks them against Harmony, LIGER, and Seurat RPCA across 10 datasets, and provides a decision tree for selecting the right method based on your data's design characteristics.


Repository Structure

scCombat_seq/
├── R/
│   ├── core/
│   │   ├── combat_variants.R      # ComBat_combo(): all 4 variants + adaptive covariate
│   │   ├── benchmark_methods.R    # benchmark_batch_methods(): Harmony, LIGER, Seurat RPCA
│   │   ├── evaluation.R           # score_and_plot(), summarise_metrics(): kBET + LISI
│   │   ├── preprocessing.R        # filter_data_mtx(), preprocess_big()
│   │   ├── qc_checks.R            # check_data_characteristics(), Cramér's V diagnostic
│   │   └── combat_nb_math.R       # NB model math: match_quantiles(), mapDisp()
│   ├── simulation/
│   │   ├── simulate_splatter.R    # simulate_condition() via Splatter
│   │   └── simulate_scdesign3.R   # simulate_scdesign3() via scDesign3
│   └── visualization/
│       └── plot_umap.R            # generate_umap_pair(), assemble_figure()
├── analysis/
│   ├── config/
│   │   └── datasets.yaml          # Dataset registry (paths, column names, parameters)
│   ├── run_combat_variants.R      # run_dataset_analysis(): parameterized pipeline runner
│   └── run_simulation.R           # Simulation sweep (S1/S2/S3 confounding conditions)
├── tests/
│   ├── run_tests.R                # Master test runner
│   ├── test-combat-variants.R
│   └── test-benchmark.R
├── archive/                       # Original scripts (reference only, not maintained)
├── Data/                          # NOT included — see download instructions below
├── results/                       # Gitignored — generated by pipeline
├── figures/                       # Publication figures
└── run_all.R                      # Entry point

Data Download

Datasets are not included in this repository (~8 GB total). Download all 10 benchmark datasets from:

Tran, H.T.N., Ang, K.S., Chevrier, M. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol 21, 12 (2020). https://doi.org/10.1186/s13059-019-1850-9

After downloading, place each dataset under Data/dataset{N}/. File paths and column names are pre-configured in analysis/config/datasets.yaml.


Quick Start

# Install dependencies (run once)
install.packages(c("here", "yaml", "ggplot2", "patchwork"))
BiocManager::install(c("sva", "scater", "kBET"))
install.packages(c("Seurat", "harmony", "liger"))

# Run full pipeline on dataset1
source("run_all.R")

# Or run a single dataset directly
source("R/core/combat_variants.R")
source("R/core/evaluation.R")
source("analysis/run_combat_variants.R")

run_dataset_analysis(
  dataset_id = "dataset1",
  subset     = 0.5,       # use 0.1 for a quick dev test
  output_dir = "results"
)

The 4 ComBat Variants

Variant Description Best for
combat_seq Batch correction only (NB model) No cell type labels; severe confounding
combat_scseq Batch + celltype covariate; adaptive selection Balanced or partially confounded designs
combat_pcseq ComBat on PCA embeddings Log-normalized input; large datasets
combat_ind Per-celltype correction, then merge Severe batch×celltype confounding

All variants are exposed through a single function:

ComBat_combo(seu,
             combat_seq   = TRUE,
             combat_scseq = TRUE,
             combat_pcseq = TRUE,
             combat_ind   = TRUE,
             use_adaptive_covariate = TRUE)   # adaptive celltype covariate (default ON)

Method Selection Guide

Q1: Need a corrected COUNT MATRIX (for DESeq2 / edgeR)?
 ├─ YES → Must use ComBat family (Harmony/LIGER produce embeddings only)
 └─ NO  → Any integration method is acceptable

Q2: Cell type labels available before integration?
 ├─ YES (marker genes, SingleR, Azimuth) → ComBat family  [go to Q3]
 └─ NO  → Harmony (fast), Seurat RPCA (multi-modal), scVI (trajectories)

Q3: Batch × celltype confounding level?
    (diagnosed by qc_checks.R; Cramér's V reported as diagnostic)
 ├─ Low  (V < 0.2, all types in all batches) → combat_scseq
 ├─ Moderate (V 0.2–0.5, some types missing) → combat_scseq (adaptive, default)
 └─ High (V ≥ 0.5, some types exclusive to one batch)
      ├─ Preserve cell type boundaries → combat_ind
      └─ Batch-only correction needed  → combat_seq

Key Results — Dataset 1 (Dendritic Cells, confounded design, V ≈ 0.58)

Method kBET batch_LISI celltype_LISI
combat_ind 0.665 1.627 1.112
combat_seq 0.632 1.659 1.148
combat_scseq (adaptive) 0.554 1.673 1.348
combat_pcseq 0.501 1.509 1.144
Harmony 0.428 1.513 1.140

On confounded designs, all ComBat variants outperform Harmony. Harmony's soft-clustering diversity penalty pushes single-batch cell types toward other clusters, silently distorting biology — while combat_ind respects cell type boundaries entirely.


Project Status

Status: Handed off to ComBatSuite team (May 2026) — contact jr689@njit.edu

Completed
─────────
✅ All 4 variants implemented and tested (R/core/combat_variants.R)
✅ Adaptive celltype covariate selection (use_adaptive_covariate = TRUE, default)
✅ Dataset 1 fully benchmarked (results/dataset1/)
✅ Benchmark comparisons: Harmony, LIGER, Seurat RPCA (R/core/benchmark_methods.R)
✅ Simulation scaffold: Splatter + scDesign3 (R/simulation/)
✅ Decision tree framework drafted (see Overleaf paper draft)

Pending
───────
🔲 Simulation sweep S1/S2/S3 — validate decision tree across confounding levels
🔲 Datasets 2–10 benchmarked via run_dataset_analysis()
🔲 Downstream DE analysis: ComBat counts → DESeq2 vs Harmony embeddings → Wilcoxon
🔲 Bias-corrected Cramér's V in qc_checks.R (Bergsma 2013 formula)

Citation

Ren, J. et al. scComBat-Seq: batch effect adjustment for single-cell RNA sequencing data. (in preparation)

Benchmark datasets from:

Tran, H.T.N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol 21, 12 (2020). https://doi.org/10.1186/s13059-019-1850-9


License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • R 100.0%