| title | author | date | output | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Bio720 - Simulating Data |
Ian Dworkin |
11/19/2018 |
|
One of the most important skills in your bag as new computational biologists is the ability to perform simulations. In particular, to simulate data or evaluate models numerically (or both).
In biology mathematical models are the basis for much of the theoretical and conceptual background in disciplines like ecology, evolution, population genetics, molecular evolution & biochemistry.
Many of the models that are developed are not analytically tractable. That is, without making very strong biological assumptions there are no closed form solutions.
That is the models can not be solved for the general case.
- For such models, "solutions" can still be found.
- For some models, stability analysis can be performed (among other techniques).
- However, most often, scientists resort to computers to identify numerical solutions
- Within dynamical models, that include the majority of models in biology there are two broad categories.
- Deterministic - where the outcome of the model entirely depends on the model itself and the starting conditions.
- Stochastic - where random events influence the outcomes of the model, and only the probability of events can be predicted, not exact outcomes.
- In population genetics we often start with a simple deterministic model of how selection on an allele influences its frequency in the population over time.
- In a simple haploid model we can envision two alternative alleles, A and a.
- a is initially fixed in the population, but the new mutation A arrives and it is beneficial. What will happen to this allele?
- Frequency of A at time t is
$p(t)$ - Fitness of A is
$W_A$ , and for a is$W_a$
-
Where
$\overline{W}$ is mean fitness for the population -
How would you convert this into R code for one generation to the next?
-
Start with what variables you need.
-
We need variables for the fitness values for each allele,
$W_A$ ,$W_a$ and for allele frequency of A$p(t+1)$ at time t and t+1. -
With this we can write the equation for allele frequency change from one generation to the next.
p_t1 <- function(w_A, w_a, p_t0) {
w_bar <- (p_t0*w_A) + ((1-p_t0)*w_a) # mean pop fitness
p_t1 <- (w_A*p_t0)/w_bar
return(p_t1)}
p_t1(w_A = 1.1, w_a = 1.0, p_t0 = 0.5)## [1] 0.5238095
replicate(n = 100, p_t1(1.1, 1.0, 0.5))## [1] 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095
## [8] 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095
## [15] 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095
## [22] 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095
## [29] 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095
## [36] 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095
## [43] 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095
## [50] 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095
## [57] 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095
## [64] 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095
## [71] 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095
## [78] 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095
## [85] 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095
## [92] 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095 0.5238095
## [99] 0.5238095 0.5238095
-
Now let's extend this across many generations. We need to rewrite the function as we expect the allele frequencies to change each generation. Also, mean population fitness will change each generation (as allele frequency changes)
-
write this simulation (a for loop would be sensible) and go for 200 generations, and look at the dynamics (starting allele frequency is p = 0.01). Wrap it all as a function called
haploid_selection
haploid_selection <- function(p0 = 0.01, w1 = 1, w2 = 0.9, n = 100) {
# Initialize vectors to store allele frequencies and mean pop fitness
p <- rep(NA,n) # a vector to store allele frequencies
w_bar <- rep(NA, n)
# starting conditions
p[1] <- p0 # starting allele frequencies
w_bar[1] <- (p[1]*w1) + ((1-p[1])*w2)
# now we need to loop from generation to generation
for ( i in 2:n) {
w_bar[i - 1] <- (p[i - 1]*w1) + ((1-p[i - 1])*w2) # mean population fitness
p[i] <- (w1*p[i - 1])/w_bar[i - 1]
}
return(p)
}p <- haploid_selection()
generations <- 1:length(p)
plot(p ~ generations, pch = 20,
ylab = "allele frequency",
xlab = "generation")- Try altering fitness advantage of A a little bit. Or reducing initial allele frequency
- Is this deterministic or stochastic?
haploid.selection <- function(p0 = 0.01, w1 = 1, w2 = 0.9, n = 100) {
# Initialize vectors to store p, delta p and mean pop fitness
p <- rep(NA,n)
delta_p <- rep(NA, n)
w_bar <- rep(NA, n)
# starting conditions
p[1] <- p0 # starting allele frequencies
delta_p[1] <- 0 #change in allele frequency
w_bar[1] <- (p[1]*w1) + ((1-p[1])*w2)
# now we need to loop from generation to generation
for ( i in 2:n) {
w_bar[i - 1] <- (p[i - 1]*w1) + ((1-p[i - 1])*w2)
p[i] <- (w1*p[i - 1])/w_bar[i - 1]
delta_p[i] <- p[i] - p[i-1]
}
if (any(p > 0.9999)) {
fixation <- min(which.max(p > 0.9999))
cat("fixation for A1 occurs approximately at generation:", fixation )
} else {
maxAlleleFreq <- max(p)
cat("fixation of A1 does not occur, max. allele frequency is:", print(maxAlleleFreq, digits = 2) )
}
# Let's make some plots
par(mfrow=c(2,2))
# 1. mean population fitness over time
plot(x = 1:n, y = w_bar,
xlab = "generations",
ylab = expression(bar(w)),
pch=20, col="red", cex = 2, cex.lab = 1.5, cex.main = 2.5,
main = paste("p0 = ", p0, "and s = ", (1 - (w2/w1))))
# 2. change in allele frequency over time
plot(x = 1:n, y = p,
xlab="generations",
ylab="Allele frequency (p)",
pch = 20, col = "red", cex.lab = 1.5)
# 3. plot of p[t+1] vs p[t]
p.1 <- p[-n]
p.2 <- p[-1]
plot(p.2 ~ p.1,
xlab = expression(p[t]),
ylab = expression(p[t+1]),
pch = 20, col = "red", cex = 2, cex.lab = 1.5)
# 4. plot of allele frequency change
plot(x = 2:n, y = delta_p[-1],
xlab = "generation",
ylab= expression(paste(Delta,"p")),
pch = 20, col = "red", cex = 2, cex.lab = 1.5)
}haploid.selection(p0 = 0.0001, w1 = 1, w2 = 0.987, n = 1000)## [1] 0.98
## fixation of A1 does not occur, max. allele frequency is: 0.9794053
- The real power for your computational skills comes from performing stochastic simulations in a computer.
- That is adding some sources of random variation at various places in your model, and allowing it to propogate and influence your outcomes.
- Computers can not do true random numbers.
- Instead they use pseudo-random number generation.
- The details of how they do it, don't matter at the moment.
- What does matter is the effect it has.
- try running this code and compare your answers with the person next to you.
x <- rnorm(1000, mean = 5, sd = 1)
mean(x)## [1] 4.983739
x <- rnorm(1000, mean = 5, sd = 1)
mean(x)## [1] 4.965949
x <- rnorm(1000, mean = 5, sd = 1)
mean(x)## [1] 5.044017
Now try it again, but let's use the seed for the random number generator
set.seed(720)
x <- rnorm(1000, mean = 5, sd = 1)
mean(x)## [1] 5.01245
x <- rnorm(1000, mean = 5, sd = 1)
mean(x)## [1] 4.99981
x <- rnorm(1000, mean = 5, sd = 1)
mean(x)## [1] 4.941751
set.seed(720)
x <- rnorm(1000, mean = 5, sd = 1)
mean(x)## [1] 5.01245
set.seed(720)
x <- rnorm(1000, mean = 5, sd = 1)
mean(x)## [1] 5.01245
set.seed(720)
x <- rnorm(1000, mean = 5, sd = 1)
mean(x)## [1] 5.01245
rm(x)- Why did you get different results the first time, the same sequence of results the second time and the exact same results the third time?
- No, but as long as we are aware of it, this can be very useful to check our work.
- If you don't specifically set the seed for random number generation, it defaults to the current time and date.
- let's say we want to simulate the rolling of a regular 6-sided die. How would we accomplish this?
- R has the
sample()function.
sample(x, size, replace = FALSE, prob = NULL)
- where
xis the object that you are sampling from. sizeis how many samples you want from the objectxreplaceis whether you want to allow sampling with replacement. i.e. with a jar full of jellybeans, do you take one out, record its colour and then put it back before shaking the jar and selecting the next colour? Or once you take it out, you can not select it again (replace = FALSE).probwill be explained a bit later.
- let's try this. A 6 sided die can only land on one face, between 1 and 6 (
1:6).
So one roll of the die
sample(1:6, size = 1, replace = TRUE)## [1] 3
- Run this a few times.
- use the replicate function to repeat this roll of the die 100 times.
- plot the distribution of rolls (how many 1's, 2's etc..)
- use the replicate function to repeat this roll of the die 1000 times.
rolls <- replicate(1000, sample(1:6, size = 1, replace = TRUE))barplot(table(rolls))set.seed(NULL)- We don't need to invoke the replicate function at all.
- We can simply change the
sizeargument insample().
more_rolls <- sample(1:6, size = 1000, replace = TRUE)
barplot(table(more_rolls))sample(1:6, size = 1000, replace = FALSE)## Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'
- Why?
sample(1:6, size = 1000, replace = FALSE)## Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'
- Why? Because there are only 6 values, so it is sampling them randomly, but removing them after. So we can only sample 6 times for a vector of length 6.
- This is particularly useful to get a different permutation of the order that you take the jelly beans out, but with the same sample of jelly beans.
- Try it a few times with the dice example.
sample(1:6, size = 6, replace = FALSE)## [1] 1 3 4 6 5 2
sample(1:6, size = 6, replace = FALSE)## [1] 1 2 6 4 3 5
sample(1:6, size = 6, replace = FALSE)## [1] 3 4 6 5 1 2
- There are many times that sampling with or without replacement is extremely useful, and forms the basis for an important computationally intensive statistical approach, known as empirical resampling methods these include:
- randomization/permutation tests (sampling without replacement). Essentially shuffling the order of observations relative to a treatment variable (at its simplest).
- non-parametric bootstrap (sampling with replacement), is a common method to determine standard errors and confidence intervals on estimated quantities based on properties of the observed distribution.
- Bio708 may introduce you to these ideas, but I have a series of youtube screencasts on it if you want to know more.
-
Genetic drift, a change in allele frequencies not due to natural selection, but to stochastic sampling of which alleles are transmitted from one generation to the next.
-
It is particularly important in small populations, and has important implications in species with conservation concerns.
-
Imagine we have two alleles at a given genetic locus, "A" and "a", in a population of diploids (each individual has 2 alleles).
-
There are a total of 20 individuals, and anyone can mate with anyone else.
-
population size stays constant (20 offspring for a total of 40 alleles).
-
If the initial allele frequency for A and a are equal (
$f(A)= f(a) = 0.5$ ), can you write a simulation to see how allele frequencies may change from one generation to the next? -
What are the allele frequencies in the next generation?
-
Things to consider:
-
Obviously using
sample(x, size, replace) -
What is your vector of observations in this case (what do we want to sample from)?
-
How many alleles do we want in the next generation, and which argument allows for this?
-
Do we want
replace = TRUEorreplace = FALSE? -
Do you expect to get the same result as the person next to you? Why or why not?
allele_counts <- sample(c("A", "a"),
size = 40,
replace = TRUE,
prob = NULL)
table(allele_counts)/length(allele_counts)## allele_counts
## a A
## 0.45 0.55
- Now our allele frequencies
$f(A) \neq f(A)$ so we have to let our function know that the probability of choosing the "a" and "A" allele differs. - This is where the
probargument becomes very important. - instead of assuming each element of your input vector (in this case alleles "A" and "a") are sampled with equal probability, we can sample them according to their allele frequencies.
allele_counts <- sample(c("a", "A"),
size = 40,
replace = TRUE,
prob = c(0.5, 0.5))
allele_freq1 <- table(allele_counts)/length(allele_counts)
allele_freq1## allele_counts
## a A
## 0.45 0.55
allele_counts2 <- sample(c("a", "A"),
size = 40,
replace = TRUE,
prob = allele_freq1)
allele_freq2 <- table(allele_counts2)/length(allele_counts2)
allele_freq2## allele_counts2
## a A
## 0.25 0.75
How might you generalize this function to allow you to alter population size, starting allele frequency and arbitrary number of generations?
- This will be one of the questions on the assignment
At the heart of stochastic simulations, is needing some sort of probability distribution to simulate from. The easiest one of course is the uniform distribution. The idea is to take random draws from this distribution upon repeated sampling. The set of functions we will use all start with lower case r, for random, like rnorm(), runif, rbinom(), etc...
- Let's start with generating a single random number from a uniform distribution on [0,1]
runif(n = 1, min = 0, max = 1) ## [1] 0.1127679
- the rNameOfDist functions generate random numbers.
- min sets lowest, max sets highest possible value, n is the number of draws from this distribution.
- We can actually use the
runifto generate random draws from all other distributions if we wanted to, but it is unnecessary here.
If we wanted 10000 such numbers we could use a for loop, or the replicate() in R. However the easier way would be to ask for 10000 numbers with the n = 10000 argument.
ru1 <- runif(n = 10000, min = 0, max = 1)
length(ru1)## [1] 10000
head(ru1)## [1] 0.08486152 0.90806514 0.95425768 0.29654024 0.47065853 0.56805520
tail(ru1)## [1] 0.76798117 0.08949722 0.04894085 0.03430759 0.09629528 0.05323584
- If I plotted a histogram of this data, what should it look like (at least theoretically)?
par(mfrow = c(1,1))
hist(ru1, freq = F)
curve(dunif(x, 0, 1), 0, 1,
add = T, col = "red", lwd = 2)With the theoretical uniform distribution on [1,1] in red
- If we wanted to look at 100 draws from a normal distribution with mean =5 and sd=2
$x \sim N(\mu = 5, \sigma = 2)$
random.normal.100 <- rnorm(n = 100, mean = 5,sd = 2)
par(mfrow=c(2,2))
plot(random.normal.100)
hist(random.normal.100)
qqnorm(y = random.normal.100)- repeat this simulation a few times to confirm it is changing...
- Say I was doing this to get a sense of what would happen in a particular experiment with 100 samples. I may want to repeat this a few times. How might I do this efficiently?
random.normal.100.rep <- replicate(n = 4,
rnorm(100, 5, 2))
par(mfrow=c(2,2))
apply(X=random.normal.100.rep,
MARGIN=2, FUN=hist) apply(X=random.normal.100.rep, MARGIN=2, FUN=mean)
apply(random.normal.100.rep, 2, sd)the advantage of this approach is you can also just do
summary(random.normal.100.rep)## V1 V2 V3 V4
## Min. :0.8106 Min. :-0.5143 Min. :-1.113 Min. :1.926
## 1st Qu.:3.7606 1st Qu.: 3.2204 1st Qu.: 4.135 1st Qu.:4.122
## Median :4.8543 Median : 4.4249 Median : 5.284 Median :5.476
## Mean :4.9794 Mean : 4.5496 Mean : 5.185 Mean :5.401
## 3rd Qu.:6.0381 3rd Qu.: 6.0378 3rd Qu.: 6.509 3rd Qu.:6.541
## Max. :9.5285 Max. : 8.8425 Max. : 9.442 Max. :9.299
- This is an example of a very simple monte carlo simulation
- Let's use this to do something a bit more interesting, simulate from a regression model.
$Y \sim N(\mu = a + b*x, \sigma)$ - We will make the slope
$b = 0.7$ , the intercept$a = 5$ and residual variation$\sigma = 2$
par(mfrow=c(1,1))
a = 5 # intercept
b = 0.7 # slope
x <- seq(2,20) # values of our predictor "x"
y_fixed <- a + b*x # we are expressing the relationship between y and x as a linear model. In this case we are generating the data using such a model.
plot(y_fixed ~ x, main= "Deterministic Component of the model") # A linear model
abline(a=5, b=0.7)y.sim.1 <- rnorm(length(x), mean = y_fixed, sd = 2)
plot(y.sim.1 ~ x, pch = 20)
abline(a = 5, b = 0.7, col = "red") # Expected relationship based on the parameters we used.
y.sim.1.lm <- lm(y.sim.1 ~ x) # fit a regression with simulated data
abline(reg = y.sim.1.lm, lty = 2, col = "blue") # estimated values based on simulated data.- Re-run this simulation a few times to see how it changes.
- Keeping the same deterministic parameters, but change the residual standard error (stochastic variation) to
$\sigma \sim 2.5$ - plot the deterministic fit first.
- Simulate the y values 25 times, each time re-fitting the regression and plotting the lines.
plot(y_fixed ~ x, col = "black", type = "n")
abline(a=5, b=0.7)
simulated_regression <- function() {
y.sim.1 <- rnorm(length(x), mean = y_fixed, sd = 2.5)
y.sim.1.lm <- lm(y.sim.1 ~ x)
abline(reg = y.sim.1.lm, col = "#FF000032")}
replicate(n =25, simulated_regression())
# or with a for loop instead (in blue)
for (i in 1:25){
y.sim.1 <- rnorm(length(x), mean = y_fixed, sd = 2.5)
y.sim.1.lm <- lm(y.sim.1 ~ x)
abline(reg = y.sim.1.lm, col = "#0000FF32")
}- Sometimes we don't want to sample from a probability distribution, but for discrete categories (say number of A, C, T and G in a sequence). Like we did with the 6 sided die.
- The all powerful
sample()is the workhorse function and is incredibly powerful for this!
- Say you wanted to generate a random DNA sequence with 100000 base pairs, with no GC bias. Do this with the sample function.
- Confirm the length of the sequence.
- What is the distribution of each nucleotide?
seq1_no_bias <- sample(c("A","C","G", "T"),
size = 100000, replace = TRUE)
table(seq1_no_bias)/length(seq1_no_bias)## seq1_no_bias
## A C G T
## 0.24951 0.24898 0.25081 0.25070
length(seq1_no_bias)## [1] 100000
head(seq1_no_bias)## [1] "C" "A" "A" "C" "G" "A"
replace = TRUEmeans it can repeatedly draw from the letters ACGT, otherwise it could only use each once (for a maximum of 4 draws)
- Now we want to convert this to a single sequence.
seq1 <- paste(seq1_no_bias, sep = "", collapse = "")
nchar(seq1)## [1] 100000
- How many times does AACTTT occur in this simulated sequence?
- R has many string manipulation functions (see
?grepand?regex). - We will learn more about them, and the powerful stringi and stringr packages next week.
- Here we will see just a few.
- use
gregrexpr()to count number of occurrences of your pattern. - g for global
x <- gregexpr("AACTTTT", seq1, fixed = T, useBytes = T)
length(unlist(x))## [1] 5
fixed = T, this means to not treat the pattern as a regular expression, but to interpret as is. This can speed things up at the cost of only allowing fixed expressions.- useBytes =T, does byte by byte comparisons. Also speeds up computation, BUT at the cost of generality.
- In general use the default settings for fixed and useBytes unless speed is really important.
- Say our sequence simulation is supposed to be from Drosophila with a 60:40 GC:AT bias. We can just add in the prob argument. Generate a sequence of 100000 base pairs.
seq2_60GCbias <- sample(c("A","C","G", "T"),
size = 100000,
prob = c(30,20,20,30),
replace = TRUE)
table(seq2_60GCbias)/length(seq2_60GCbias)## seq2_60GCbias
## A C G T
## 0.29895 0.20129 0.19913 0.30063
seq2_60GCbias <- paste0(seq2_60GCbias, collapse="")
nchar(seq2_60GCbias)## [1] 100000
y <- gregexpr("AACTTTT", seq2_60GCbias, fixed = T, useBytes = T)
length(unlist(y))## [1] 11
- How would you check?
-Can you think of some reasons?
- What happens if there are no matches? Check.
gregexpr()only includes unique matches, if the pattern is found in overlapping parts of the sequence, it will only count it once.- What happens if there is whitespace? My code does not account for it.
- What happens if some nucleotides are lower case?
- What happens if ambigous nucleotides are included, or insertions, deletions, Ns?
- DNA is generally double stranded, so we need to deal with the reverse complement as well.
-
We could definitely rewrite this function to handle all such exceptions. Use functions like
toupper()to make everything upper case, check for other nucleotides remove leading and trailing white space withtrimws(), which is really just a wrapper aroundsub. From a learning perspective this is a useful exercise, that I leave to you. -
However, if our goal is to manipulate nucleotide sequences, we should be smart about this and use a well developed tool like the Biostrings library. Other than for the purposes of learning the skills to DIY (or when learning to program), or for some specialized purpose that does not exist, using well developed libraries make sense. These have been well vetted by many users and developers and are less buggy and probably more efficient computationally than what most of us would write.
- the
chartr(old, new, x)function translates one set of characters to another for a string x - rev, reverses the order of the elements of the vector (but not if it is a single string)
seq3_60GCbias <- sample(c("A","C","G", "T"),
size = 20,
prob = c(30,20,20,30),
replace = TRUE)
seq3_60GCbias## [1] "A" "C" "C" "C" "G" "G" "T" "G" "A" "G" "A" "G" "A" "T" "A" "T" "T"
## [18] "T" "A" "C"
seq3_60GCbias_rev <- rev(seq3_60GCbias)
seq3_60GCbias_revcomp <- chartr("ACTG", "TGAC", seq3_60GCbias_rev)
seq3_60GCbias_revcomp <- paste0(seq3_60GCbias_revcomp, collapse="")- For this simulation example, why might the reverse complement not matter much?









