This project includes worked examples, helper scripts and tiny demo data sets that accompany the “Inference of Biological Traits using the Sequentially Markovian Coalescent” chapter (link).
| topic | tool / section | folder |
|---|---|---|
| Self‑fertilisation (σ) • constant σ • transition outcrossing → selfing • F‑based quick estimate |
eSMC2 teSMC heterozygosity formula |
Selfing/eSMC2‑Selfing/Selfing/teSMC‑Transition/Selfing/Heterozygosity/ |
| Dormancy (β) | eSMC2 | eSMC2‑Dormancy/ |
| Sweepstakes / β‑coalescent (α) | SMBC | SMBC-Sweepstakes/ |
| Multiple markers | SMCtheo, SMCm | SMCtheo/ & SMCm/ |
| Data preprocessing | VCF → mask → multi‑het‑sep | SMC_data_processing/ |
-
Clone the repo
git clone https://github.com/StatisticalPopulationGenomics-2ndEd/eSMC2.git cd eSMC2 -
Open any
*.htmlor*.pdfinside the tutorial folders for an immediate walk‑through (no software required). -
To rerun the code you need R (≥ 4.1) plus the
eSMC2package (see install guide below).
-
Download the latest eSMC2
.tar.gzfrom
https://github.com/TPPSellinger/eSMC2/releases -
Install it in R:
# install devtools once
install.packages("devtools")
library(devtools)
# local path to the .tar.gz you downloaded from
# https://github.com/TPPSellinger/eSMC2/releases
path <- "~/Downloads/eSMC2_5.1.3.tar.gz" # adjust as needed
install_local(path)
# no root permissions? install into a personal library
withr::with_libpaths(new = "~/R/local_lib",
install_local(path))eSMC2() does not read VCFs directly.
Instead it needs a segregation matrix that captures, for every SNP,
(1) the alleles carried by each haplotype, (2) how many bases were
callable at that position, and (3) the SNP’s genomic coordinate.
It is similar to a transposed Multihetsep file (see further information in the MSMC guide
https://github.com/stschiff/msmc-tools/blob/master/msmc-tutorial/guide.md):
- Rows 1 … M → one row per haplotype (phased). Alleles can be A/C/G/T or integer codes 0/1/2/3 (faster). Missing data = “N” or dot.
- Row M + 1 → number of called sites per column; this lets the HMM ignore coverage gaps just like MSMC.
- Row M + 2 → genomic position along the scaffold/chromosome; must be strictly increasing.
Columns therefore correspond to SNPs; rows to biological information.
Conversion from VCF is explained in SMC_data_processing/, which uses bcftools, GATK, and
msmc-tools).
0 1 0 1 0
1 1 0 0 0
0 0 0 1 1
6 6 5 6 5 # <- row M+1
10012 10547 13001 18110 19455 # <- row M+2library(eSMC2)
mhs_path <- "./CEU_regions.mhs"
M <- 20 # number of haplotypes (rows 1..M in the .mhs)
# Returns a list: one matrix per chromosome/scaffold
obs_chr <- Get_real_data(dirname(mhs_path), M, basename(mhs_path), "\t")
res <- eSMC2(
n = 40,
rho = 3.6e-8 / 6.95e-9,
obs_chr, # third positional arg = list of matrices
Rho = FALSE,
NC = length(obs_chr) # number of chromosomes/scaffolds
)Treat each diploid as one chromosome: two pseudo-haploid rows + the two meta rows. Build a list with one matrix per diploid and pass that list as the third positional argument to eSMC2().
Set NC = number_of_individuals × chromosomes_per_individual.
Each matrix in the list must have:
- Rows 1…M: haplotypes (A/C/G/T or integer codes)
- Row M+1: number of called sites
- Row M+2: genomic position (strictly increasing)
You can feed eSMC2 in two common ways:
library(eSMC2)
mhs_path <- "./CEU_regions.mhs"
M_full <- 20 # 10 diploids × 2
dip_ids <- 1:4 # choose diploids
chr_keep <- 1 # choose chromosome/scaffold
Os_chr <- Get_real_data(dirname(mhs_path), M_full, basename(mhs_path), "\t")
M <- nrow(Os_chr[[chr_keep]]) - 2L
tail <- (M+1L):(M+2L)
# one matrix per diploid: hap1, hap2, meta1, meta2
obs_list <- lapply(dip_ids, function(i) {
rows <- c(2*i-1L, 2*i, tail)
Oi <- Os_chr[[chr_keep]][rows, , drop = FALSE]
mode(Oi) <- "character" # keep A/C/G/T; use integer codes for speed if desired
Oi
})
res <- eSMC2(
n = 40,
rho = 3.6e-8 / 6.95e-9,
obs_list,
SF = TRUE, Rho = FALSE, BoxP = c(3,3),
NC = length(obs_list)
)library(eSMC2)
files <- list.files("Dormancy/", pattern = "\\.multihetsep\\.txt$", full.names = TRUE)
M <- 2 # haplotypes per file
mats <- lapply(files, function(f) Get_real_data(dirname(f), M, basename(f), "\t"))
res <- eSMC2(
n = 40,
rho = (8.0e-8/6) / 4.33e-9, # example: annual recomb / (clonal+sexual cycles), then / mu
mats,
SB = TRUE, SF = FALSE, Rho = FALSE,
BoxP = c(3,3), BoxB = c(0.3,0.99),
NC = length(mats)
)eSMC2 uses a composite-likelihood over “chromosomes”. For unphased diploids we treat each individual as one chromosome composed of two pseudo-haploid rows plus the two meta rows from the .mhs. Subsetting per individual gives the per-diploid matrices you pass in a list. This is analogous to how MSMC2 consume multiple chromosomes.
Selfers — quick note:
In strong selfing species, within-individual heterozygosity is low, so a single diploid may carry few segregating sites.
eSMC2 will run, but estimates can be noisy.
Many analyses instead form pseudo-diploids between individuals (often requires phasing) to boost signal.
See Dormancy/ and Selfing/eSMC2-Selfing/ in this repo for full examples.
eSMC2() works in coalescent units, so the times and population sizes it outputs are still in a “scaled” space.
To plot the results in actual generations and diploid effective population
size, you first need to convert them using the mutation rate you supplied into the model.
| symbol | returned by eSMC2
|
meaning | rescaling factor | becomes |
|---|---|---|---|---|
Tc |
results$Tc |
lower & upper bound of each hidden‑state interval (in coalescent units) |
|
time grid (generations ago) |
Xi |
results$Xi |
population‑size parameter for each interval: |
multiply by |
effective |
The “hidden” scaling factor results$mu) and the empirical
mutation rate you provided (mu).
# generations ago for the midpoint of each interval
Ne0 <- results$mu / mu # inferred 2Ne (diploids) in generations
time <- results$Tc * Ne0 # coalescent units × generationsIf you prefer calendar years, multiply by the generation time in years
(e.g. gen_time <- 1 for annual plants).
Ne_t <- results$Xi * 0.5 * Ne0 # Xi × (Ne0 / 2) because Ne0 is 2NePlot log10(Ne_t) against time (log‑scaled x‑axis) to reproduce the
classic MSMC/MSMC2 stepwise skyline.
These formulas are identical to the post‑processing steps documented for
MSMC2; only the object names differ.
plot(time, log10(Ne_t), type = "s", log = "x",
xlab = "Generations ago", ylab = expression(log[10]~N[e]))| ⚠︎ Computational cost increases when … | ⚠︎ Accuracy decreases when … |
|---|---|
| • > 1 chromosome in one analysis is used • > 10 haplotypes are simultaneously analyzed • > 60 hidden states are used |
• < 30 hidden states are used • sequences < 1 Mb • < 1000 SNPs • ρ ≫ θ per bp |
Depending on the ecological parameters, the recombination rate can be underestimated if, per base pair, the recombination rate is higher than the mutation rate.
eSMC2 cannot disentangle selfing from seed‑banking; therefore it is recommended to avoid simultaneous estimation unless strong priors are available.
PopulationGenomicsChapter/
├── Selfing/
│ ├── eSMC2‑Selfing/
│ ├── teSMC‑Transition/
│ └── Heterozygosity/
├── Dormancy/
├── SMBC/
├── SMCtheo/
├── SMCm/
└── SMC_data_processing/
Each sub‑folder contains its own README plus runnable example.
- Sellinger T. et al. (2020), Inference of past demography, dormancy and self‑fertilization rates from whole genome sequence data. PLOS Genetics.
- Sellinger T. et al. (2021), Limits and convergence properties of the sequentially Markovian coalescent. Molecular Ecology Resources.
- Struett S. et al. (2023), Joint inference of evolutionary transitions to self-fertilization and demographic history using whole-genome sequences. Elife.
- Korfmann K. (2024), Simultaneous Inference of Past Demography and Selection from the Ancestral Recombination Graph under the Beta Coalescent. PCI.
- Sellinger T. et al. (2024), Improved inference of population histories by integrating genomic and epigenomic data. Elife.
For assistance or inquiries, please email Lukas Metzger (lukas.metzger@tum.de) or Miles Anderson (miles.anderson@tum.de).
Last updated: 2025‑07‑25
