eSMC2 — Tutorials for the Statistical Population Genomics Book Chapter

This project includes worked examples, helper scripts and tiny demo data sets that accompany the “Inference of Biological Traits using the Sequentially Markovian Coalescent” chapter (link).

topic	tool / section	folder
Self‑fertilisation (σ) • constant σ • transition outcrossing → selfing • F‑based quick estimate	eSMC2 teSMC heterozygosity formula	`Selfing/eSMC2‑Selfing/` `Selfing/teSMC‑Transition/` `Selfing/Heterozygosity/`
Dormancy (β)	eSMC2	`eSMC2‑Dormancy/`
Sweepstakes / β‑coalescent (α)	SMBC	`SMBC-Sweepstakes/`
Multiple markers	SMCtheo, SMCm	`SMCtheo/` & `SMCm/`
Data preprocessing	VCF → mask → multi‑het‑sep	`SMC_data_processing/`

1 Quick start

Clone the repo

git clone https://github.com/StatisticalPopulationGenomics-2ndEd/eSMC2.git
cd eSMC2

Open any *.html or *.pdf inside the tutorial folders for an immediate walk‑through (no software required).
To rerun the code you need R (≥ 4.1) plus the eSMC2 package (see install guide below).

2 Installing eSMC2

Download the latest eSMC2 .tar.gz from
https://github.com/TPPSellinger/eSMC2/releases
Install it in R:

# install devtools once
install.packages("devtools")
library(devtools)

# local path to the .tar.gz you downloaded from
# https://github.com/TPPSellinger/eSMC2/releases
path <- "~/Downloads/eSMC2_5.1.3.tar.gz"   # adjust as needed
install_local(path)

# no root permissions? install into a personal library
withr::with_libpaths(new = "~/R/local_lib",
                     install_local(path))

3 Input format – “Segregating matrix”

eSMC2() does not read VCFs directly. Instead it needs a segregation matrix that captures, for every SNP, (1) the alleles carried by each haplotype, (2) how many bases were callable at that position, and (3) the SNP’s genomic coordinate. It is similar to a transposed Multihetsep file (see further information in the MSMC guide
https://github.com/stschiff/msmc-tools/blob/master/msmc-tutorial/guide.md):

Rows 1 … M → one row per haplotype (phased). Alleles can be A/C/G/T or integer codes 0/1/2/3 (faster). Missing data = “N” or dot.
Row M + 1 → number of called sites per column; this lets the HMM ignore coverage gaps just like MSMC.
Row M + 2 → genomic position along the scaffold/chromosome; must be strictly increasing.

Columns therefore correspond to SNPs; rows to biological information.
Conversion from VCF is explained in SMC_data_processing/, which uses bcftools, GATK, and msmc-tools).

Example snippet (3 haplotypes, 5 SNPs)

0 1 0 1 0
1 1 0 0 0
0 0 0 1 1
6 6 5 6 5         #   <- row M+1
10012 10547 13001 18110 19455   # <- row M+2

4 Quick run with phased haplotypes (one single `.mhs`)

library(eSMC2)

mhs_path <- "./CEU_regions.mhs"
M        <- 20  # number of haplotypes (rows 1..M in the .mhs)

# Returns a list: one matrix per chromosome/scaffold
obs_chr <- Get_real_data(dirname(mhs_path), M, basename(mhs_path), "\t")

res <- eSMC2(
  n   = 40,
  rho = 3.6e-8 / 6.95e-9,
  obs_chr,                 # third positional arg = list of matrices
  Rho = FALSE,
  NC  = length(obs_chr)    # number of chromosomes/scaffolds
)

5 Multiple unphased diploids (two common input patterns)

Treat each diploid as one chromosome: two pseudo-haploid rows + the two meta rows. Build a list with one matrix per diploid and pass that list as the third positional argument to eSMC2().

Set NC = number_of_individuals × chromosomes_per_individual.

Each matrix in the list must have:

Rows 1…M: haplotypes (A/C/G/T or integer codes)
Row M+1: number of called sites
Row M+2: genomic position (strictly increasing)

You can feed eSMC2 in two common ways:

Pattern A — one combined `.mhs` with many individuals

library(eSMC2)

mhs_path <- "./CEU_regions.mhs"
M_full   <- 20          # 10 diploids × 2
dip_ids  <- 1:4         # choose diploids
chr_keep <- 1           # choose chromosome/scaffold

Os_chr <- Get_real_data(dirname(mhs_path), M_full, basename(mhs_path), "\t")
M      <- nrow(Os_chr[[chr_keep]]) - 2L
tail   <- (M+1L):(M+2L)

# one matrix per diploid: hap1, hap2, meta1, meta2
obs_list <- lapply(dip_ids, function(i) {
  rows <- c(2*i-1L, 2*i, tail)
  Oi   <- Os_chr[[chr_keep]][rows, , drop = FALSE]
  mode(Oi) <- "character"   # keep A/C/G/T; use integer codes for speed if desired
  Oi
})

res <- eSMC2(
  n   = 40,
  rho = 3.6e-8 / 6.95e-9,
  obs_list,
  SF  = TRUE, Rho = FALSE, BoxP = c(3,3),
  NC  = length(obs_list)
)

Pattern B — one `.mhs` per individual

library(eSMC2)

files <- list.files("Dormancy/", pattern = "\\.multihetsep\\.txt$", full.names = TRUE)
M     <- 2  # haplotypes per file

mats <- lapply(files, function(f) Get_real_data(dirname(f), M, basename(f), "\t"))

res <- eSMC2(
  n   = 40,
  rho = (8.0e-8/6) / 4.33e-9,    # example: annual recomb / (clonal+sexual cycles), then / mu
  mats,
  SB  = TRUE, SF = FALSE, Rho = FALSE,
  BoxP = c(3,3), BoxB = c(0.3,0.99),
  NC  = length(mats)
)

eSMC2 uses a composite-likelihood over “chromosomes”. For unphased diploids we treat each individual as one chromosome composed of two pseudo-haploid rows plus the two meta rows from the .mhs. Subsetting per individual gives the per-diploid matrices you pass in a list. This is analogous to how MSMC2 consume multiple chromosomes.

Selfers — quick note:
In strong selfing species, within-individual heterozygosity is low, so a single diploid may carry few segregating sites.
eSMC2 will run, but estimates can be noisy.
Many analyses instead form pseudo-diploids between individuals (often requires phasing) to boost signal.

See Dormancy/ and Selfing/eSMC2-Selfing/ in this repo for full examples.

6 Plotting in eSMC2

eSMC2() works in coalescent units, so the times and population sizes it outputs are still in a “scaled” space. To plot the results in actual generations and diploid effective population size, you first need to convert them using the mutation rate you supplied into the model.

symbol	returned by `eSMC2`	meaning	rescaling factor	becomes
`Tc`	`results$Tc`	lower & upper bound of each hidden‑state interval (in coalescent units)	$2N_e$ generations → multiply by $N_e$	time grid (generations ago)
`Xi`	`results$Xi`	population‑size parameter for each interval: $\Xi_t = N_{e,t}/N_{e,0}$	multiply by $N_{e,0}$	effective $N_e(t)$

The “hidden” scaling factor $N_{e,0}$ can be recovered from the mutation rate that maximises the likelihood (results$mu) and the empirical mutation rate you provided (mu).

Time axis (generations)

# generations ago for the midpoint of each interval
Ne0  <- results$mu / mu          # inferred 2Ne (diploids) in generations
time <- results$Tc * Ne0         # coalescent units × generations

If you prefer calendar years, multiply by the generation time in years (e.g. gen_time <- 1 for annual plants).

Population size (diploids)

Ne_t <- results$Xi * 0.5 * Ne0   # Xi × (Ne0 / 2)   because Ne0 is 2Ne

Plot log10(Ne_t) against time (log‑scaled x‑axis) to reproduce the classic MSMC/MSMC2 stepwise skyline. These formulas are identical to the post‑processing steps documented for MSMC2; only the object names differ.

Example plot

plot(time, log10(Ne_t), type = "s", log = "x",
     xlab = "Generations ago", ylab = expression(log[10]~N[e]))

7 Warnings & good practice

⚠︎ Computational cost increases when …	⚠︎ Accuracy decreases when …
• > 1 chromosome in one analysis is used • > 10 haplotypes are simultaneously analyzed • > 60 hidden states are used	• < 30 hidden states are used • sequences < 1 Mb • < 1000 SNPs • ρ ≫ θ per bp

Depending on the ecological parameters, the recombination rate can be underestimated if, per base pair, the recombination rate is higher than the mutation rate.

eSMC2 cannot disentangle selfing from seed‑banking; therefore it is recommended to avoid simultaneous estimation unless strong priors are available.

8 Repository layout

PopulationGenomicsChapter/
├── Selfing/
│   ├── eSMC2‑Selfing/
│   ├── teSMC‑Transition/
│   └── Heterozygosity/
├── Dormancy/
├── SMBC/
├── SMCtheo/
├── SMCm/
└── SMC_data_processing/

Each sub‑folder contains its own README plus runnable example.

9 Citing this work

Sellinger T. et al. (2020), Inference of past demography, dormancy and self‑fertilization rates from whole genome sequence data. PLOS Genetics.
Sellinger T. et al. (2021), Limits and convergence properties of the sequentially Markovian coalescent. Molecular Ecology Resources.
Struett S. et al. (2023), Joint inference of evolutionary transitions to self-fertilization and demographic history using whole-genome sequences. Elife.
Korfmann K. (2024), Simultaneous Inference of Past Demography and Selection from the Ancestral Recombination Graph under the Beta Coalescent. PCI.
Sellinger T. et al. (2024), Improved inference of population histories by integrating genomic and epigenomic data. Elife.

10 Support

For assistance or inquiries, please email Lukas Metzger (lukas.metzger@tum.de) or Miles Anderson (miles.anderson@tum.de).

Last updated: 2025‑07‑25

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
SMC_data_processing		SMC_data_processing
SMCm		SMCm
SMCtheo		SMCtheo
Selfing		Selfing
eSMC2-Dormancy		eSMC2-Dormancy
Example_plot.png		Example_plot.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eSMC2 — Tutorials for the Statistical Population Genomics Book Chapter

1 Quick start

2 Installing eSMC2

3 Input format – “Segregating matrix”

Example snippet (3 haplotypes, 5 SNPs)

4 Quick run with phased haplotypes (one single `.mhs`)

5 Multiple unphased diploids (two common input patterns)

Pattern A — one combined `.mhs` with many individuals

Pattern B — one `.mhs` per individual

6 Plotting in eSMC2

Time axis (generations)

Population size (diploids)

Example plot

7 Warnings & good practice

8 Repository layout

9 Citing this work

10 Support

About

Uh oh!

Releases

Packages

Languages

StatisticalPopulationGenomics-2ndEd/eSMC2

Folders and files

Latest commit

History

Repository files navigation

eSMC2 — Tutorials for the Statistical Population Genomics Book Chapter

1 Quick start

2 Installing eSMC2

3 Input format – “Segregating matrix”

Example snippet (3 haplotypes, 5 SNPs)

4 Quick run with phased haplotypes (one single .mhs)

5 Multiple unphased diploids (two common input patterns)

Pattern A — one combined .mhs with many individuals

Pattern B — one .mhs per individual

6 Plotting in eSMC2

Time axis (generations)

Population size (diploids)

Example plot

7 Warnings & good practice

8 Repository layout

9 Citing this work

10 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

eSMC2 — Tutorials for the Statistical Population Genomics Book Chapter

2 Installing eSMC2

4 Quick run with phased haplotypes (one single `.mhs`)

Pattern A — one combined `.mhs` with many individuals

Pattern B — one `.mhs` per individual

Packages