scComBat-Seq

A practical guide to applying ComBat variants for batch correction in single-cell RNA-seq data.

ComBat-Seq uses a negative binomial model to correct batch effects directly in raw count matrices — preserving integer counts for downstream differential expression analysis. This repository implements 4 variants, benchmarks them against Harmony, LIGER, and Seurat RPCA across 10 datasets, and provides a decision tree for selecting the right method based on your data's design characteristics.

Repository Structure

scCombat_seq/
├── R/
│   ├── core/
│   │   ├── combat_variants.R      # ComBat_combo(): all 4 variants + adaptive covariate
│   │   ├── benchmark_methods.R    # benchmark_batch_methods(): Harmony, LIGER, Seurat RPCA
│   │   ├── evaluation.R           # score_and_plot(), summarise_metrics(): kBET + LISI
│   │   ├── preprocessing.R        # filter_data_mtx(), preprocess_big()
│   │   ├── qc_checks.R            # check_data_characteristics(), Cramér's V diagnostic
│   │   └── combat_nb_math.R       # NB model math: match_quantiles(), mapDisp()
│   ├── simulation/
│   │   ├── simulate_splatter.R    # simulate_condition() via Splatter
│   │   └── simulate_scdesign3.R   # simulate_scdesign3() via scDesign3
│   └── visualization/
│       └── plot_umap.R            # generate_umap_pair(), assemble_figure()
├── analysis/
│   ├── config/
│   │   └── datasets.yaml          # Dataset registry (paths, column names, parameters)
│   ├── run_combat_variants.R      # run_dataset_analysis(): parameterized pipeline runner
│   └── run_simulation.R           # Simulation sweep (S1/S2/S3 confounding conditions)
├── tests/
│   ├── run_tests.R                # Master test runner
│   ├── test-combat-variants.R
│   └── test-benchmark.R
├── archive/                       # Original scripts (reference only, not maintained)
├── Data/                          # NOT included — see download instructions below
├── results/                       # Gitignored — generated by pipeline
├── figures/                       # Publication figures
└── run_all.R                      # Entry point

Data Download

Datasets are not included in this repository (~8 GB total). Download all 10 benchmark datasets from:

Tran, H.T.N., Ang, K.S., Chevrier, M. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol 21, 12 (2020). https://doi.org/10.1186/s13059-019-1850-9

After downloading, place each dataset under Data/dataset{N}/. File paths and column names are pre-configured in analysis/config/datasets.yaml.

Quick Start

# Install dependencies (run once)
install.packages(c("here", "yaml", "ggplot2", "patchwork"))
BiocManager::install(c("sva", "scater", "kBET"))
install.packages(c("Seurat", "harmony", "liger"))

# Run full pipeline on dataset1
source("run_all.R")

# Or run a single dataset directly
source("R/core/combat_variants.R")
source("R/core/evaluation.R")
source("analysis/run_combat_variants.R")

run_dataset_analysis(
  dataset_id = "dataset1",
  subset     = 0.5,       # use 0.1 for a quick dev test
  output_dir = "results"
)

The 4 ComBat Variants

Variant	Description	Best for
`combat_seq`	Batch correction only (NB model)	No cell type labels; severe confounding
`combat_scseq`	Batch + celltype covariate; adaptive selection	Balanced or partially confounded designs
`combat_pcseq`	ComBat on PCA embeddings	Log-normalized input; large datasets
`combat_ind`	Per-celltype correction, then merge	Severe batch×celltype confounding

All variants are exposed through a single function:

ComBat_combo(seu,
             combat_seq   = TRUE,
             combat_scseq = TRUE,
             combat_pcseq = TRUE,
             combat_ind   = TRUE,
             use_adaptive_covariate = TRUE)   # adaptive celltype covariate (default ON)

Method Selection Guide

Q1: Need a corrected COUNT MATRIX (for DESeq2 / edgeR)?
 ├─ YES → Must use ComBat family (Harmony/LIGER produce embeddings only)
 └─ NO  → Any integration method is acceptable

Q2: Cell type labels available before integration?
 ├─ YES (marker genes, SingleR, Azimuth) → ComBat family  [go to Q3]
 └─ NO  → Harmony (fast), Seurat RPCA (multi-modal), scVI (trajectories)

Q3: Batch × celltype confounding level?
    (diagnosed by qc_checks.R; Cramér's V reported as diagnostic)
 ├─ Low  (V < 0.2, all types in all batches) → combat_scseq
 ├─ Moderate (V 0.2–0.5, some types missing) → combat_scseq (adaptive, default)
 └─ High (V ≥ 0.5, some types exclusive to one batch)
      ├─ Preserve cell type boundaries → combat_ind
      └─ Batch-only correction needed  → combat_seq

Key Results — Dataset 1 (Dendritic Cells, confounded design, V ≈ 0.58)

Method	kBET	batch_LISI	celltype_LISI
combat_ind	0.665	1.627	1.112
combat_seq	0.632	1.659	1.148
combat_scseq (adaptive)	0.554	1.673	1.348
combat_pcseq	0.501	1.509	1.144
Harmony	0.428	1.513	1.140

On confounded designs, all ComBat variants outperform Harmony. Harmony's soft-clustering diversity penalty pushes single-batch cell types toward other clusters, silently distorting biology — while combat_ind respects cell type boundaries entirely.

Project Status

Status: Handed off to ComBatSuite team (May 2026) — contact jr689@njit.edu

Completed
─────────
✅ All 4 variants implemented and tested (R/core/combat_variants.R)
✅ Adaptive celltype covariate selection (use_adaptive_covariate = TRUE, default)
✅ Dataset 1 fully benchmarked (results/dataset1/)
✅ Benchmark comparisons: Harmony, LIGER, Seurat RPCA (R/core/benchmark_methods.R)
✅ Simulation scaffold: Splatter + scDesign3 (R/simulation/)
✅ Decision tree framework drafted (see Overleaf paper draft)

Pending
───────
🔲 Simulation sweep S1/S2/S3 — validate decision tree across confounding levels
🔲 Datasets 2–10 benchmarked via run_dataset_analysis()
🔲 Downstream DE analysis: ComBat counts → DESeq2 vs Harmony embeddings → Wilcoxon
🔲 Bias-corrected Cramér's V in qc_checks.R (Bergsma 2013 formula)

Citation

Ren, J. et al. scComBat-Seq: batch effect adjustment for single-cell RNA sequencing data. (in preparation)

Benchmark datasets from:

Tran, H.T.N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol 21, 12 (2020). https://doi.org/10.1186/s13059-019-1850-9

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
R		R
analysis		analysis
archive		archive
tests		tests
.gitignore		.gitignore
README.md		README.md
run_all.R		run_all.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scComBat-Seq

Repository Structure

Data Download

Quick Start

The 4 ComBat Variants

Method Selection Guide

Key Results — Dataset 1 (Dendritic Cells, confounded design, V ≈ 0.58)

Project Status

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

scComBat-Seq

Repository Structure

Data Download

Quick Start

The 4 ComBat Variants

Method Selection Guide

Key Results — Dataset 1 (Dendritic Cells, confounded design, V ≈ 0.58)

Project Status

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages