Skip to content

Tutorials for the book chapter to infer biological traits using the Sequentially Markovian Coalescent

Notifications You must be signed in to change notification settings

StatisticalPopulationGenomics-2ndEd/eSMC2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

102 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

eSMC2 — Tutorials for the Statistical Population Genomics Book Chapter

This project includes worked examples, helper scripts and tiny demo data sets that accompany the “Inference of Biological Traits using the Sequentially Markovian Coalescent” chapter (link).

topic tool / section folder
Self‑fertilisation (σ)
  • constant σ
  • transition outcrossing → selfing
  • F‑based quick estimate
eSMC2
teSMC
heterozygosity formula
Selfing/eSMC2‑Selfing/
Selfing/teSMC‑Transition/
Selfing/Heterozygosity/
Dormancy (β) eSMC2 eSMC2‑Dormancy/
Sweepstakes / β‑coalescent (α) SMBC SMBC-Sweepstakes/
Multiple markers SMCtheo, SMCm SMCtheo/ & SMCm/
Data preprocessing VCF → mask → multi‑het‑sep SMC_data_processing/

1 Quick start

  1. Clone the repo

    git clone https://github.com/StatisticalPopulationGenomics-2ndEd/eSMC2.git
    cd eSMC2
  2. Open any *.html or *.pdf inside the tutorial folders for an immediate walk‑through (no software required).

  3. To rerun the code you need R (≥ 4.1) plus the eSMC2 package (see install guide below).


2 Installing eSMC2

  1. Download the latest eSMC2 .tar.gz from
    https://github.com/TPPSellinger/eSMC2/releases

  2. Install it in R:

# install devtools once
install.packages("devtools")
library(devtools)

# local path to the .tar.gz you downloaded from
# https://github.com/TPPSellinger/eSMC2/releases
path <- "~/Downloads/eSMC2_5.1.3.tar.gz"   # adjust as needed
install_local(path)

# no root permissions? install into a personal library
withr::with_libpaths(new = "~/R/local_lib",
                     install_local(path))

3 Input format – “Segregating matrix”

eSMC2() does not read VCFs directly. Instead it needs a segregation matrix that captures, for every SNP, (1) the alleles carried by each haplotype, (2) how many bases were callable at that position, and (3) the SNP’s genomic coordinate. It is similar to a transposed Multihetsep file (see further information in the MSMC guide
https://github.com/stschiff/msmc-tools/blob/master/msmc-tutorial/guide.md):

  • Rows 1 … M  → one row per haplotype (phased). Alleles can be A/C/G/T or integer codes 0/1/2/3 (faster). Missing data = “N” or dot.
  • Row M + 1   → number of called sites per column; this lets the HMM ignore coverage gaps just like MSMC.
  • Row M + 2   → genomic position along the scaffold/chromosome; must be strictly increasing.

Columns therefore correspond to SNPs; rows to biological information.
Conversion from VCF is explained in SMC_data_processing/, which uses bcftools, GATK, and msmc-tools).

Example snippet (3 haplotypes, 5 SNPs)

0 1 0 1 0
1 1 0 0 0
0 0 0 1 1
6 6 5 6 5         #   <- row M+1
10012 10547 13001 18110 19455   # <- row M+2

4 Quick run with phased haplotypes (one single .mhs)

library(eSMC2)

mhs_path <- "./CEU_regions.mhs"
M        <- 20  # number of haplotypes (rows 1..M in the .mhs)

# Returns a list: one matrix per chromosome/scaffold
obs_chr <- Get_real_data(dirname(mhs_path), M, basename(mhs_path), "\t")

res <- eSMC2(
  n   = 40,
  rho = 3.6e-8 / 6.95e-9,
  obs_chr,                 # third positional arg = list of matrices
  Rho = FALSE,
  NC  = length(obs_chr)    # number of chromosomes/scaffolds
)

5 Multiple unphased diploids (two common input patterns)

Treat each diploid as one chromosome: two pseudo-haploid rows + the two meta rows. Build a list with one matrix per diploid and pass that list as the third positional argument to eSMC2().

Set NC = number_of_individuals × chromosomes_per_individual.

Each matrix in the list must have:

  • Rows 1…M: haplotypes (A/C/G/T or integer codes)
  • Row M+1: number of called sites
  • Row M+2: genomic position (strictly increasing)

You can feed eSMC2 in two common ways:


Pattern A — one combined .mhs with many individuals

library(eSMC2)

mhs_path <- "./CEU_regions.mhs"
M_full   <- 20          # 10 diploids × 2
dip_ids  <- 1:4         # choose diploids
chr_keep <- 1           # choose chromosome/scaffold

Os_chr <- Get_real_data(dirname(mhs_path), M_full, basename(mhs_path), "\t")
M      <- nrow(Os_chr[[chr_keep]]) - 2L
tail   <- (M+1L):(M+2L)

# one matrix per diploid: hap1, hap2, meta1, meta2
obs_list <- lapply(dip_ids, function(i) {
  rows <- c(2*i-1L, 2*i, tail)
  Oi   <- Os_chr[[chr_keep]][rows, , drop = FALSE]
  mode(Oi) <- "character"   # keep A/C/G/T; use integer codes for speed if desired
  Oi
})

res <- eSMC2(
  n   = 40,
  rho = 3.6e-8 / 6.95e-9,
  obs_list,
  SF  = TRUE, Rho = FALSE, BoxP = c(3,3),
  NC  = length(obs_list)
)

Pattern B — one .mhs per individual

library(eSMC2)

files <- list.files("Dormancy/", pattern = "\\.multihetsep\\.txt$", full.names = TRUE)
M     <- 2  # haplotypes per file

mats <- lapply(files, function(f) Get_real_data(dirname(f), M, basename(f), "\t"))

res <- eSMC2(
  n   = 40,
  rho = (8.0e-8/6) / 4.33e-9,    # example: annual recomb / (clonal+sexual cycles), then / mu
  mats,
  SB  = TRUE, SF = FALSE, Rho = FALSE,
  BoxP = c(3,3), BoxB = c(0.3,0.99),
  NC  = length(mats)
)

eSMC2 uses a composite-likelihood over “chromosomes”. For unphased diploids we treat each individual as one chromosome composed of two pseudo-haploid rows plus the two meta rows from the .mhs. Subsetting per individual gives the per-diploid matrices you pass in a list. This is analogous to how MSMC2 consume multiple chromosomes.

Selfers — quick note:
In strong selfing species, within-individual heterozygosity is low, so a single diploid may carry few segregating sites.
eSMC2 will run, but estimates can be noisy.
Many analyses instead form pseudo-diploids between individuals (often requires phasing) to boost signal.

See Dormancy/ and Selfing/eSMC2-Selfing/ in this repo for full examples.


6 Plotting in eSMC2

eSMC2() works in coalescent units, so the times and population sizes it outputs are still in a “scaled” space. To plot the results in actual generations and diploid effective population size, you first need to convert them using the mutation rate you supplied into the model.

symbol returned by eSMC2 meaning rescaling factor becomes
Tc results$Tc lower & upper bound of each hidden‑state interval (in coalescent units) $2N_e$ generations → multiply by $N_e$ time grid (generations ago)
Xi results$Xi population‑size parameter for each interval: $\Xi_t = N_{e,t}/N_{e,0}$ multiply by $N_{e,0}$ effective $N_e(t)$

The “hidden” scaling factor $N_{e,0}$ can be recovered from the mutation rate that maximises the likelihood (results$mu) and the empirical mutation rate you provided (mu).


Time axis (generations)

# generations ago for the midpoint of each interval
Ne0  <- results$mu / mu          # inferred 2Ne (diploids) in generations
time <- results$Tc * Ne0         # coalescent units × generations

If you prefer calendar years, multiply by the generation time in years (e.g. gen_time <- 1 for annual plants).


Population size (diploids)

Ne_t <- results$Xi * 0.5 * Ne0   # Xi × (Ne0 / 2)   because Ne0 is 2Ne

Plot log10(Ne_t) against time (log‑scaled x‑axis) to reproduce the classic MSMC/MSMC2 stepwise skyline. These formulas are identical to the post‑processing steps documented for MSMC2; only the object names differ.

Example plot

plot(time, log10(Ne_t), type = "s", log = "x",
     xlab = "Generations ago", ylab = expression(log[10]~N[e]))


7 Warnings & good practice

⚠︎ Computational cost increases when … ⚠︎ Accuracy decreases when …
• > 1 chromosome in one analysis is used
• > 10 haplotypes are simultaneously analyzed
• > 60 hidden states are used
• < 30 hidden states are used
• sequences < 1 Mb
• < 1000 SNPs
• ρ ≫ θ per bp

Depending on the ecological parameters, the recombination rate can be underestimated if, per base pair, the recombination rate is higher than the mutation rate.

eSMC2 cannot disentangle selfing from seed‑banking; therefore it is recommended to avoid simultaneous estimation unless strong priors are available.


8 Repository layout

PopulationGenomicsChapter/
├── Selfing/
│   ├── eSMC2‑Selfing/
│   ├── teSMC‑Transition/
│   └── Heterozygosity/
├── Dormancy/
├── SMBC/
├── SMCtheo/
├── SMCm/
└── SMC_data_processing/

Each sub‑folder contains its own README plus runnable example.


9 Citing this work

  • Sellinger T. et al. (2020), Inference of past demography, dormancy and self‑fertilization rates from whole genome sequence data. PLOS Genetics.
  • Sellinger T. et al. (2021), Limits and convergence properties of the sequentially Markovian coalescent. Molecular Ecology Resources.
  • Struett S. et al. (2023), Joint inference of evolutionary transitions to self-fertilization and demographic history using whole-genome sequences. Elife.
  • Korfmann K. (2024), Simultaneous Inference of Past Demography and Selection from the Ancestral Recombination Graph under the Beta Coalescent. PCI.
  • Sellinger T. et al. (2024), Improved inference of population histories by integrating genomic and epigenomic data. Elife.

10 Support

For assistance or inquiries, please email Lukas Metzger (lukas.metzger@tum.de) or Miles Anderson (miles.anderson@tum.de).

Last updated: 2025‑07‑25

About

Tutorials for the book chapter to infer biological traits using the Sequentially Markovian Coalescent

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published