From d87917f2d2085c1ce843eded2d9860352ac72f7c Mon Sep 17 00:00:00 2001 From: Mingye Wang Date: Sun, 25 Jan 2026 12:23:52 +0800 Subject: [PATCH] Update codon models and frequencies documentation Add documentation about codon model from a close reading of https://github.com/iqtree/iqtree3/blob/master/model/modelcodon.cpp. The main change is a detailed note on how the mechanistic models compute rates. This is required because * `MG` does not actually implement what MG1994 does, specifically deviating in how multi-nt changes are handled: the original paper sets a rate of 0 while iqtree does something else. * The 1KT{S,V} and 2K models are very reasonable mechanistic approches to multi-nt changes, but people won't know they're reasonable unless we describe it. I also added a mention of `GY0K`. It does not have a literature reference but is in the code. Describing it first makes describing all the GY-with-kappa models easier. Other small changes are: * Shortened the parameter sentence, since it's too repetitive. We've already defined that dn/ds, ts, tv are, so just use them. * `ECM` alias for `ECMK07` because it is in the code. * Mention the difference between `MG` and `GY0K` in the "substitution rates" section by pointing to the frequency section. Adding a pointer here helps answer questions like "what's the difference between MG2K and GY2K?" --- doc/Substitution-Models.md | 53 +++++++++++++++++++++++--------------- 1 file changed, 32 insertions(+), 21 deletions(-) diff --git a/doc/Substitution-Models.md b/doc/Substitution-Models.md index 0866f95..f0af4cd 100644 --- a/doc/Substitution-Models.md +++ b/doc/Substitution-Models.md @@ -284,7 +284,7 @@ To apply a codon model one should use the option `-st CODON` to tell IQ-TREE tha | Code | Genetic code meaning | |---------|------------------------------------------------------------------------| -| CODON1 | The Standard Code (same as `-st CODON`)| +| CODON1 | The Standard Code (same as `-st CODON`) | | CODON2 | The Vertebrate Mitochondrial Code | | CODON3 | The Yeast Mitochondrial Code | | CODON4 | The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code | @@ -309,24 +309,35 @@ To apply a codon model one should use the option `-st CODON` to tell IQ-TREE tha IQ-TREE supports several codon models: -| Model | Explanation | -|------------------|------------------------------------------------------------------------| -| MG | Nonsynonymous/synonymous (dn/ds) rate ratio ([Muse and Gaut, 1994]). -| MGK | Like `MG` with additional transition/transversion (ts/tv) rate ratio. -| MG1KTS or MGKAP2 | Like `MG` with a transition rate ([Kosiol et al., 2007]). -| MG1KTV or MGKAP3 | Like `MG` with a transversion rate ([Kosiol et al., 2007]). -| MG2K or MGKAP4 | Like `MG` with a transition rate and a transversion rate ([Kosiol et al., 2007]). -| GY | Nonsynonymous/synonymous and transition/transversion rate ratios ([Goldman and Yang, 1994]). -| GY1KTS or GYKAP2 | Like `GY` with a transition rate ([Kosiol et al., 2007]). -| GY1KTV or GYKAP3 | Like `GY` with a transversion rate ([Kosiol et al., 2007]). -| GY2K or GYKAP4 | Like `GY` with a transition rate and a transversion rate ([Kosiol et al., 2007]). -| ECMK07 or KOSI07 | Empirical codon model ([Kosiol et al., 2007]). -| ECMrest | Restricted version of `ECMK07` that allows only one nucleotide exchange. -| ECMS05 or SCHN05 | Empirical codon model ([Schneider et al., 2005]). - -Users could specify the model parameters (e.g., Nonsynonymous/synonymous (dn/ds) rate ratio, and/or transition/transversion (ts/tv) rate ratio, and/or transition rate, and/or a transversion rate) by `{,[],[]}`. For example, `MG2K{1.0,0.3,0.5}` specifies the nonsynonymous/synonymous (dn/ds) rate ratio, the transition rate, and the transversion rate are 1.0, 0.3, 0.5, respectively. The number of input parameters depends on the definition of each model. - -The last three models (`ECMK07`, `ECMrest` or `ECMS05`) are called *empirical* codon models, whereas the others are called *mechanistic* codon models. +| Model | Explanation | +|-------------------------|------------------------------------------------------------------------| +| MG | Nonsynonymous/synonymous (dn/ds) rate ratio ([Muse and Gaut, 1994]). | +| MGK | Like `MG` with a transition/transversion (ts/tv) rate ratio. | +| MG1KTS or MGKAP2 | Like `MG` with a transition (ts) rate ([Kosiol et al., 2007]). | +| MG1KTV or MGKAP3 | Like `MG` with a transversion (tv) rate ([Kosiol et al., 2007]). | +| MG2K or MGKAP4 | Like `MG` with a transition (ts) rate and a transversion (tv) rate ([Kosiol et al., 2007]). | +| GY0K or GYKAP1 | Nonsynonymous/synonymous (dn/ds) rate ratio. | +| GY | Like `GY0K` with a transition/transversion (ts/tv) rate ratio ([Goldman and Yang, 1994]). | +| GY1KTS or GYKAP2 | Like `GY0K` with a transition (ts) rate ([Kosiol et al., 2007]). | +| GY1KTV or GYKAP3 | Like `GY0K` with a transversion (tv) rate ([Kosiol et al., 2007]). | +| GY2K or GYKAP4 | Like `GY0K` with a transition (ts) rate and a transversion (tv) rate ([Kosiol et al., 2007]). | +| ECM or ECMK07 or KOSI07 | Empirical codon model ([Kosiol et al., 2007]). | +| ECMrest | Restricted version of `ECMK07` that allows only one nucleotide exchange. | +| ECMS05 or SCHN05 | Empirical codon model ([Schneider et al., 2005]). | + +The mechanistic models compute rates as follows: + +* For `MG` and `GY0K`, *omega* = dn/ds. *Rate* = 1.0 if synonymous else *omega*. This is the base rate *r* used below. +* For `MGK` and `GY`, *kappa* = ts/tv. *Rate* = *r* × (1.0 if *number_of_transversions* > 0 else *kappa*). +* For `MG1KTS`/`GY1KTS`, *kappa* = ts. *Rate* = *r* × (*kappa* ^ *number_of_transitions*). +* For `MG1KTV`/`GY1KTV`, *kappa* = tv. *Rate* = *r* × (*kappa* ^ *number_of_transversions*). +* For `MG2K`/`GY2K`, *kappa* = ts, *kappa2* = tv. *Rate* = *r* × (*kappa* ^ *number_of_transitions*) × (*kappa2* ^ *number_of_transversions*). + +`MG` and `GY0K` differ in how they handle codon frequencies. See the next section for more details. + +Users could specify the model parameters by `{,[],[]}`. For example, `MG2K{1.0,0.3,0.5}` specifies dn/ds = 1.0, ts = 0.3, tv = 0.5. The number of input parameters depends on the definition of each model. + +The last three models (`ECMK07`, `ECMrest`, and `ECMS05`) are called *empirical* codon models, whereas the others are called *mechanistic* codon models. The empirical models can only be used with the standard genetic code. Moreover, IQ-TREE supports combined empirical-mechanistic codon models using an underscore separator (`_`). For example: @@ -344,8 +355,8 @@ Thus, there can be many such combinations. IQ-TREE supports the following codon frequencies: -| FreqType | Explanation | -|----------|------------------------------------------------------------------------| +| FreqType | df | Explanation | +|----------|----|------------------------------------------------------------------------| | +F | Empirical codon frequencies counted from the data. In AliSim, if users neither specify base frequencies nor supply an input alignment, AliSim will generate base frequencies from empirical distributions.| | +FQ | Equal codon frequencies.| | +F1X4 | Unequal nucleotide frequencies but equal nt frequencies over three codon positions. In AliSim, if users don't supply an input alignment, the base frequencies are randomly generated based on empirical distributions, or users could specify the frequencies via `+F1X4{,...,}`.|