A practical guide to applying ComBat variants for batch correction in single-cell RNA-seq data.
ComBat-Seq uses a negative binomial model to correct batch effects directly in raw count matrices — preserving integer counts for downstream differential expression analysis. This repository implements 4 variants, benchmarks them against Harmony, LIGER, and Seurat RPCA across 10 datasets, and provides a decision tree for selecting the right method based on your data's design characteristics.
scCombat_seq/
├── R/
│ ├── core/
│ │ ├── combat_variants.R # ComBat_combo(): all 4 variants + adaptive covariate
│ │ ├── benchmark_methods.R # benchmark_batch_methods(): Harmony, LIGER, Seurat RPCA
│ │ ├── evaluation.R # score_and_plot(), summarise_metrics(): kBET + LISI
│ │ ├── preprocessing.R # filter_data_mtx(), preprocess_big()
│ │ ├── qc_checks.R # check_data_characteristics(), Cramér's V diagnostic
│ │ └── combat_nb_math.R # NB model math: match_quantiles(), mapDisp()
│ ├── simulation/
│ │ ├── simulate_splatter.R # simulate_condition() via Splatter
│ │ └── simulate_scdesign3.R # simulate_scdesign3() via scDesign3
│ └── visualization/
│ └── plot_umap.R # generate_umap_pair(), assemble_figure()
├── analysis/
│ ├── config/
│ │ └── datasets.yaml # Dataset registry (paths, column names, parameters)
│ ├── run_combat_variants.R # run_dataset_analysis(): parameterized pipeline runner
│ └── run_simulation.R # Simulation sweep (S1/S2/S3 confounding conditions)
├── tests/
│ ├── run_tests.R # Master test runner
│ ├── test-combat-variants.R
│ └── test-benchmark.R
├── archive/ # Original scripts (reference only, not maintained)
├── Data/ # NOT included — see download instructions below
├── results/ # Gitignored — generated by pipeline
├── figures/ # Publication figures
└── run_all.R # Entry point
Datasets are not included in this repository (~8 GB total). Download all 10 benchmark datasets from:
Tran, H.T.N., Ang, K.S., Chevrier, M. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol 21, 12 (2020). https://doi.org/10.1186/s13059-019-1850-9
After downloading, place each dataset under Data/dataset{N}/. File paths and column names are pre-configured in analysis/config/datasets.yaml.
# Install dependencies (run once)
install.packages(c("here", "yaml", "ggplot2", "patchwork"))
BiocManager::install(c("sva", "scater", "kBET"))
install.packages(c("Seurat", "harmony", "liger"))
# Run full pipeline on dataset1
source("run_all.R")
# Or run a single dataset directly
source("R/core/combat_variants.R")
source("R/core/evaluation.R")
source("analysis/run_combat_variants.R")
run_dataset_analysis(
dataset_id = "dataset1",
subset = 0.5, # use 0.1 for a quick dev test
output_dir = "results"
)| Variant | Description | Best for |
|---|---|---|
combat_seq |
Batch correction only (NB model) | No cell type labels; severe confounding |
combat_scseq |
Batch + celltype covariate; adaptive selection | Balanced or partially confounded designs |
combat_pcseq |
ComBat on PCA embeddings | Log-normalized input; large datasets |
combat_ind |
Per-celltype correction, then merge | Severe batch×celltype confounding |
All variants are exposed through a single function:
ComBat_combo(seu,
combat_seq = TRUE,
combat_scseq = TRUE,
combat_pcseq = TRUE,
combat_ind = TRUE,
use_adaptive_covariate = TRUE) # adaptive celltype covariate (default ON)Q1: Need a corrected COUNT MATRIX (for DESeq2 / edgeR)?
├─ YES → Must use ComBat family (Harmony/LIGER produce embeddings only)
└─ NO → Any integration method is acceptable
Q2: Cell type labels available before integration?
├─ YES (marker genes, SingleR, Azimuth) → ComBat family [go to Q3]
└─ NO → Harmony (fast), Seurat RPCA (multi-modal), scVI (trajectories)
Q3: Batch × celltype confounding level?
(diagnosed by qc_checks.R; Cramér's V reported as diagnostic)
├─ Low (V < 0.2, all types in all batches) → combat_scseq
├─ Moderate (V 0.2–0.5, some types missing) → combat_scseq (adaptive, default)
└─ High (V ≥ 0.5, some types exclusive to one batch)
├─ Preserve cell type boundaries → combat_ind
└─ Batch-only correction needed → combat_seq
| Method | kBET | batch_LISI | celltype_LISI |
|---|---|---|---|
| combat_ind | 0.665 | 1.627 | 1.112 |
| combat_seq | 0.632 | 1.659 | 1.148 |
| combat_scseq (adaptive) | 0.554 | 1.673 | 1.348 |
| combat_pcseq | 0.501 | 1.509 | 1.144 |
| Harmony | 0.428 | 1.513 | 1.140 |
On confounded designs, all ComBat variants outperform Harmony. Harmony's soft-clustering diversity penalty pushes single-batch cell types toward other clusters, silently distorting biology — while combat_ind respects cell type boundaries entirely.
Status: Handed off to ComBatSuite team (May 2026) — contact jr689@njit.edu
Completed
─────────
✅ All 4 variants implemented and tested (R/core/combat_variants.R)
✅ Adaptive celltype covariate selection (use_adaptive_covariate = TRUE, default)
✅ Dataset 1 fully benchmarked (results/dataset1/)
✅ Benchmark comparisons: Harmony, LIGER, Seurat RPCA (R/core/benchmark_methods.R)
✅ Simulation scaffold: Splatter + scDesign3 (R/simulation/)
✅ Decision tree framework drafted (see Overleaf paper draft)
Pending
───────
🔲 Simulation sweep S1/S2/S3 — validate decision tree across confounding levels
🔲 Datasets 2–10 benchmarked via run_dataset_analysis()
🔲 Downstream DE analysis: ComBat counts → DESeq2 vs Harmony embeddings → Wilcoxon
🔲 Bias-corrected Cramér's V in qc_checks.R (Bergsma 2013 formula)
Ren, J. et al. scComBat-Seq: batch effect adjustment for single-cell RNA sequencing data. (in preparation)
Benchmark datasets from:
Tran, H.T.N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol 21, 12 (2020). https://doi.org/10.1186/s13059-019-1850-9
MIT