-
Notifications
You must be signed in to change notification settings - Fork 22
Expand file tree
/
Copy pathBio720_SimulatingData.Rmd
More file actions
648 lines (476 loc) · 23 KB
/
Bio720_SimulatingData.Rmd
File metadata and controls
648 lines (476 loc) · 23 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
---
title: "Bio720 - Simulating Data"
author: "Ian Dworkin"
date: "11/19/2018"
output:
pdf_document:
toc: yes
ioslides_presentation:
incremental: yes
keep_md: yes
widescreen: yes
html_document:
keep_md: yes
toc: yes
slidy_presentation:
incremental: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Introduction
One of the most important skills in your bag as new computational biologists is the ability to perform simulations. In particular, to simulate data or evaluate models numerically (or both).
## Introduction
In biology mathematical models are the basis for much of the theoretical and conceptual background in disciplines like ecology, evolution, population genetics, molecular evolution & biochemistry.
## Introduction
Many of the models that are developed are not *analytically* tractable. That is, without making very strong biological assumptions there are no *closed form solutions*.
That is the models can not be solved for the general case.
## Using computers to find numerical solutions
- For such models, "solutions" can still be found.
- For some models, stability analysis can be performed (among other techniques).
- However, most often, scientists resort to computers to identify *numerical solutions*
## Deterministic VS. Stochastic
- Within *dynamical* models, that include the majority of models in biology there are two broad categories.
- **Deterministic** - where the outcome of the model entirely depends on the model itself and the starting conditions.
- **Stochastic** - where random events influence the outcomes of the model, and only the probability of events can be predicted, not exact outcomes.
## A simple deterministic model one-locus model of natural selection.
- In population genetics we often start with a simple deterministic model of how selection on an allele influences its frequency in the population over time.
- In a simple haploid model we can envision two alternative alleles, *A* and *a*.
- *a* is initially fixed in the population, but the new mutation *A* arrives and it is beneficial. What will happen to this allele?
## Haploid selection model
- Frequency of *A* at time *t* is $p(t)$
- Fitness of *A* is $W_A$, and for *a* is $W_a$
$p(t+1) = \frac{p(t) W_A}{p(t) W_A + W_a(1-p(t))}$
$p(t+1) = \frac{p(t) W_A} {\overline W }$
- Where $\overline{W}$ is mean fitness for the population
- How would you convert this into *R* code for one generation to the next?
- Start with what variables you need.
## Converting this into R code - What variables
- We need variables for the fitness values for each allele, $W_A$, $W_a$ and for allele frequency of A $p(t+1)$ at time t and t+1.
- With this we can write the equation for allele frequency change from one generation to the next.
## Converting this into R code - What variables
```{r echo=TRUE}
p_t1 <- function(w_A, w_a, p_t0) {
w_bar <- (p_t0*w_A) + ((1-p_t0)*w_a) # mean pop fitness
p_t1 <- (w_A*p_t0)/w_bar
return(p_t1)}
p_t1(w_A = 1.1, w_a = 1.0, p_t0 = 0.5)
```
## Is this deterministic or stochastic, what would be a quick way to check?
## Is this deterministic or stochastic, what would be a quick way to check?
```{r echo=TRUE}
replicate(n = 100, p_t1(1.1, 1.0, 0.5))
```
## Allele frequency dynamics.
- Now let's extend this across many generations. We need to rewrite the function as we expect the allele frequencies to change each generation. Also, mean population fitness will change each generation (as allele frequency changes)
- write this simulation (a *for loop* would be sensible) and go for 200 generations, and look at the dynamics (starting allele frequency is p = 0.01). Wrap it all as a function called `haploid_selection`
##
```{r echo=TRUE}
haploid_selection <- function(p0 = 0.01, w1 = 1, w2 = 0.9, n = 100) {
# Initialize vectors to store allele frequencies and mean pop fitness
p <- rep(NA,n) # a vector to store allele frequencies
w_bar <- rep(NA, n)
# starting conditions
p[1] <- p0 # starting allele frequencies
w_bar[1] <- (p[1]*w1) + ((1-p[1])*w2)
# now we need to loop from generation to generation
for ( i in 2:n) {
w_bar[i - 1] <- (p[i - 1]*w1) + ((1-p[i - 1])*w2) # mean population fitness
p[i] <- (w1*p[i - 1])/w_bar[i - 1]
}
return(p)
}
```
## Test the model and plot it.
```{r echo=TRUE}
p <- haploid_selection()
generations <- 1:length(p)
plot(p ~ generations, pch = 20,
ylab = "allele frequency",
xlab = "generation")
```
## Using simple numerical simulation to gain intuition for the system.
- Try altering fitness advantage of *A* a little bit. Or reducing initial allele frequency
- Is this deterministic or stochastic?
## Making a more general function
```{r, echo = TRUE}
haploid.selection <- function(p0 = 0.01, w1 = 1, w2 = 0.9, n = 100) {
# Initialize vectors to store p, delta p and mean pop fitness
p <- rep(NA,n)
delta_p <- rep(NA, n)
w_bar <- rep(NA, n)
# starting conditions
p[1] <- p0 # starting allele frequencies
delta_p[1] <- 0 #change in allele frequency
w_bar[1] <- (p[1]*w1) + ((1-p[1])*w2)
# now we need to loop from generation to generation
for ( i in 2:n) {
w_bar[i - 1] <- (p[i - 1]*w1) + ((1-p[i - 1])*w2)
p[i] <- (w1*p[i - 1])/w_bar[i - 1]
delta_p[i] <- p[i] - p[i-1]
}
if (any(p > 0.9999)) {
fixation <- min(which.max(p > 0.9999))
cat("fixation for A1 occurs approximately at generation:", fixation )
} else {
maxAlleleFreq <- max(p)
cat("fixation of A1 does not occur, max. allele frequency is:", print(maxAlleleFreq, digits = 2) )
}
# Let's make some plots
par(mfrow=c(2,2))
# 1. mean population fitness over time
plot(x = 1:n, y = w_bar,
xlab = "generations",
ylab = expression(bar(w)),
pch=20, col="red", cex = 2, cex.lab = 1.5, cex.main = 2.5,
main = paste("p0 = ", p0, "and s = ", (1 - (w2/w1))))
# 2. change in allele frequency over time
plot(x = 1:n, y = p,
xlab="generations",
ylab="Allele frequency (p)",
pch = 20, col = "red", cex.lab = 1.5)
# 3. plot of p[t+1] vs p[t]
p.1 <- p[-n]
p.2 <- p[-1]
plot(p.2 ~ p.1,
xlab = expression(p[t]),
ylab = expression(p[t+1]),
pch = 20, col = "red", cex = 2, cex.lab = 1.5)
# 4. plot of allele frequency change
plot(x = 2:n, y = delta_p[-1],
xlab = "generation",
ylab= expression(paste(Delta,"p")),
pch = 20, col = "red", cex = 2, cex.lab = 1.5)
}
```
## Let's see what this does.
```{r}
haploid.selection(p0 = 0.0001, w1 = 1, w2 = 0.987, n = 1000)
```
## Stochastic simulations
- The real power for your computational skills comes from performing stochastic simulations in a computer.
- That is adding some sources of random variation at various places in your model, and allowing it to *propogate* and influence your outcomes.
## Is it really random?
- Computers can not do true random numbers.
- Instead they use *pseudo-random* number generation.
- The details of how they do it, don't matter at the moment.
- What does matter is the effect it has.
## Is it really random?
- try running this code and compare your answers with the person next to you.
```{r}
x <- rnorm(1000, mean = 5, sd = 1)
mean(x)
x <- rnorm(1000, mean = 5, sd = 1)
mean(x)
x <- rnorm(1000, mean = 5, sd = 1)
mean(x)
```
## Is it really random?
Now try it again, but let's use the *seed* for the random number generator
```{r}
set.seed(720)
x <- rnorm(1000, mean = 5, sd = 1)
mean(x)
x <- rnorm(1000, mean = 5, sd = 1)
mean(x)
x <- rnorm(1000, mean = 5, sd = 1)
mean(x)
```
## resetting the seed to the same number each time.
```{r}
set.seed(720)
x <- rnorm(1000, mean = 5, sd = 1)
mean(x)
set.seed(720)
x <- rnorm(1000, mean = 5, sd = 1)
mean(x)
set.seed(720)
x <- rnorm(1000, mean = 5, sd = 1)
mean(x)
rm(x)
```
## Can you explain what is happening?
- Why did you get different results the first time, the same sequence of results the second time and the exact same results the third time?
## Is it really random?
- No, but as long as we are aware of it, this can be very useful to check our work.
- If you don't specifically set the seed for random number generation, it defaults to the current time and date.
## rolling the dice.
- let's say we want to simulate the rolling of a regular 6-sided die. How would we accomplish this?
- *R* has the `sample()` function.
`sample(x, size, replace = FALSE, prob = NULL)`
- where `x` is the object that you are sampling from.
- `size` is how many samples you want from the object `x`
- `replace` is whether you want to allow sampling with replacement. i.e. with a jar full of jellybeans, do you take one out, record its colour and then put it back before shaking the jar and selecting the next colour? Or once you take it out, you can not select it again (`replace = FALSE`).
- `prob` will be explained a bit later.
## rolling the dice
- let's try this. A 6 sided die can only land on one face, between 1 and 6 (`1:6`).
So one roll of the die
```{r}
sample(1:6, size = 1, replace = TRUE)
```
- Run this a few times.
## rolling the die many times.
- use the replicate function to repeat this roll of the die 100 times.
- plot the distribution of rolls (how many 1's, 2's etc..)
## rolling the die many times.
- use the replicate function to repeat this roll of the die 1000 times.
```{r}
rolls <- replicate(1000, sample(1:6, size = 1, replace = TRUE))
```
```{r}
barplot(table(rolls))
```
## Resetting the seed back to "random" based on time and date
```{r}
set.seed(NULL)
```
## A better way to roll the die
- We don't need to invoke the replicate function at all.
- We can simply change the `size` argument in `sample()`.
```{r}
more_rolls <- sample(1:6, size = 1000, replace = TRUE)
barplot(table(more_rolls))
```
## what happens with `replace = FALSE`?
```{r, error= TRUE}
sample(1:6, size = 1000, replace = FALSE)
```
- Why?
## what happens with `replace = FALSE`?
```{r, error= TRUE}
sample(1:6, size = 1000, replace = FALSE)
```
- Why? Because there are only 6 values, so it is sampling them randomly, but removing them after. So we can only sample 6 times for a vector of length 6.
## How to use `replace = FALSE`
- This is particularly useful to get a different permutation of the order that you take the jelly beans out, but with the same sample of jelly beans.
- Try it a few times with the dice example.
```{r}
sample(1:6, size = 6, replace = FALSE)
sample(1:6, size = 6, replace = FALSE)
sample(1:6, size = 6, replace = FALSE)
```
## How is the sample function useful?
- There are many times that sampling with or without replacement is extremely useful, and forms the basis for an important computationally intensive statistical approach, known as *empirical resampling methods* these include:
- randomization/permutation tests (sampling without replacement). Essentially shuffling the order of observations relative to a treatment variable (at its simplest).
- non-parametric bootstrap (sampling with replacement), is a common method to determine standard errors and confidence intervals on estimated quantities based on properties of the observed distribution.
- Bio708 may introduce you to these ideas, but I have a series of youtube screencasts on it if you want to know more.
## Using the sample function to model genetic drift
- Genetic drift, a change in allele frequencies not due to natural selection, but to stochastic sampling of which alleles are transmitted from one generation to the next.
- It is particularly important in small populations, and has important implications in species with conservation concerns.
- Imagine we have two alleles at a given genetic locus, "A" and "a", in a population of diploids (each individual has 2 alleles).
- There are a total of 20 individuals, and anyone can mate with anyone else.
- population size stays constant (20 offspring for a total of 40 alleles).
- If the initial allele frequency for A and a are equal ($f(A)= f(a) = 0.5$), can you write a simulation to see how allele frequencies may change from one generation to the next?
- What are the allele frequencies in the next generation?
- Things to consider:
- Obviously using `sample(x, size, replace)`
- What is your vector of observations in this case (what do we want to sample from)?
- How many alleles do we want in the next generation, and which argument allows for this?
- Do we want `replace = TRUE` or `replace = FALSE`?
- Do you expect to get the same result as the person next to you? Why or why not?
## Using the sample function to model genetic drift
```{r}
allele_counts <- sample(c("A", "a"),
size = 40,
replace = TRUE,
prob = NULL)
table(allele_counts)/length(allele_counts)
```
## How about if we wanted to go one more generation?
- Now our allele frequencies $f(A) \neq f(A)$ so we have to let our function know that the probability of choosing the "a" and "A" allele differs.
- This is where the `prob` argument becomes very important.
- instead of assuming each element of your input vector (in this case alleles "A" and "a") are sampled with equal probability, we can sample them according to their allele frequencies.
```{r}
allele_counts <- sample(c("a", "A"),
size = 40,
replace = TRUE,
prob = c(0.5, 0.5))
allele_freq1 <- table(allele_counts)/length(allele_counts)
allele_freq1
allele_counts2 <- sample(c("a", "A"),
size = 40,
replace = TRUE,
prob = allele_freq1)
allele_freq2 <- table(allele_counts2)/length(allele_counts2)
allele_freq2
```
## How might you generalize this function to allow you to alter population size, starting allele frequency and arbitrary number of generations?
- This will be one of the questions on the assignment
## Numerical stochastic simulations, the basics.
At the heart of stochastic simulations, is needing some sort of probability distribution to simulate from. The easiest one of course is the *uniform distribution*. The idea is to take random draws from this distribution upon repeated sampling. The set of functions we will use all start with lower case *r*, for random, like `rnorm()`, `runif`, `rbinom()`, etc...
## `runif`
- Let's start with generating a single random number from a uniform distribution on [0,1]
```{r}
runif(n = 1, min = 0, max = 1)
```
- the rNameOfDist functions generate random numbers.
- min sets lowest, max sets highest possible value, n is the number of draws from this distribution.
- We can actually use the `runif` to generate random draws from all other distributions if we wanted to, but it is unnecessary here.
## `runif()`
If we wanted 10000 such numbers we could use a for loop, or the replicate() in R. However the easier way would be to ask for 10000 numbers with the n = 10000 argument.
```{r}
ru1 <- runif(n = 10000, min = 0, max = 1)
length(ru1)
head(ru1)
tail(ru1)
```
- If I plotted a histogram of this data, what should it look like (at least theoretically)?
##
```{r}
par(mfrow = c(1,1))
hist(ru1, freq = F)
curve(dunif(x, 0, 1), 0, 1,
add = T, col = "red", lwd = 2)
```
With the theoretical uniform distribution on [1,1] in red
## simulating from a normal distribution
- If we wanted to look at 100 draws from a normal distribution with mean =5 and sd=2
- $x \sim N(\mu = 5, \sigma = 2)$
```{r}
random.normal.100 <- rnorm(n = 100, mean = 5,sd = 2)
par(mfrow=c(2,2))
plot(random.normal.100)
hist(random.normal.100)
qqnorm(y = random.normal.100)
```
- repeat this simulation a few times to confirm it is changing...
## More efficient random sampling using replicate
- Say I was doing this to get a sense of what would happen in a particular experiment with 100 samples. I may want to repeat this a few times. How might I do this efficiently?
## More efficient random sampling using replicate
```{r results= 'hide'}
random.normal.100.rep <- replicate(n = 4,
rnorm(100, 5, 2))
par(mfrow=c(2,2))
apply(X=random.normal.100.rep,
MARGIN=2, FUN=hist)
apply(X=random.normal.100.rep, MARGIN=2, FUN=mean)
apply(random.normal.100.rep, 2, sd)
```
##
the advantage of this approach is you can also just do
```{r}
summary(random.normal.100.rep)
```
## monte carlo methods
- This is an example of a very simple *monte carlo simulation*
- Let's use this to do something a bit more interesting, simulate from a regression model.
- $Y \sim N(\mu = a + b*x, \sigma)$
- We will make the slope $b = 0.7$, the intercept $a = 5$ and residual variation$\sigma = 2$
## First the deterministic part
```{r}
par(mfrow=c(1,1))
a = 5 # intercept
b = 0.7 # slope
x <- seq(2,20) # values of our predictor "x"
y_fixed <- a + b*x # we are expressing the relationship between y and x as a linear model. In this case we are generating the data using such a model.
plot(y_fixed ~ x, main= "Deterministic Component of the model") # A linear model
abline(a=5, b=0.7)
```
## add in stochastic component now
```{r}
y.sim.1 <- rnorm(length(x), mean = y_fixed, sd = 2)
plot(y.sim.1 ~ x, pch = 20)
abline(a = 5, b = 0.7, col = "red") # Expected relationship based on the parameters we used.
y.sim.1.lm <- lm(y.sim.1 ~ x) # fit a regression with simulated data
abline(reg = y.sim.1.lm, lty = 2, col = "blue") # estimated values based on simulated data.
```
- Re-run this simulation a few times to see how it changes.
## how about if you wanted to simulate this relationship 25 times?
- Keeping the same deterministic parameters, but change the residual standard error (stochastic variation) to $\sigma \sim 2.5$
- plot the deterministic fit first.
- Simulate the y values 25 times, each time re-fitting the regression and plotting the lines.
## how about if you wanted to simulate this relationship 25 times?
```{r, results = 'hide'}
plot(y_fixed ~ x, col = "black", type = "n")
abline(a=5, b=0.7)
simulated_regression <- function() {
y.sim.1 <- rnorm(length(x), mean = y_fixed, sd = 2.5)
y.sim.1.lm <- lm(y.sim.1 ~ x)
abline(reg = y.sim.1.lm, col = "#FF000032")}
replicate(n =25, simulated_regression())
# or with a for loop instead (in blue)
for (i in 1:25){
y.sim.1 <- rnorm(length(x), mean = y_fixed, sd = 2.5)
y.sim.1.lm <- lm(y.sim.1 ~ x)
abline(reg = y.sim.1.lm, col = "#0000FF32")
}
```
## Back to the all important `sample()` function. Simple simulations of sequences.
- Sometimes we don't want to sample from a probability distribution, but for discrete categories (say number of A, C, T and G in a sequence). Like we did with the 6 sided die.
- The all powerful `sample()` is the workhorse function and is incredibly powerful for this!
## a random 100000bp sequence with no biases.
- Say you wanted to generate a random DNA sequence with 100000 base pairs, with no GC bias. Do this with the sample function.
- Confirm the length of the sequence.
- What is the distribution of each nucleotide?
## a random 100000bp sequence with no biases.
```{r}
seq1_no_bias <- sample(c("A","C","G", "T"),
size = 100000, replace = TRUE)
table(seq1_no_bias)/length(seq1_no_bias)
length(seq1_no_bias)
head(seq1_no_bias)
```
- `replace = TRUE` means it can repeatedly draw from the letters ACGT, otherwise it could only use each once (for a maximum of 4 draws)
## Convert this vector to a single string, and check how many "letters" are in it
## Convert this vector to a single string, and check how many "letters" are in it
- Now we want to convert this to a single sequence.
```{r}
seq1 <- paste(seq1_no_bias, sep = "", collapse = "")
nchar(seq1)
```
## Let's say we wanted to identify how often a simple sequence motif occurs.
- How many times does AACTTT occur in this simulated sequence?
- R has many string manipulation functions (see `?grep` and `?regex`).
- We will learn more about them, and the powerful *stringi* and *stringr* packages next week.
- Here we will see just a few.
- use `gregrexpr()` to count number of occurrences of your pattern.
- g for global
## Let's say we wanted to identify how often a simple sequence motif occurs.
```{r}
x <- gregexpr("AACTTTT", seq1, fixed = T, useBytes = T)
length(unlist(x))
```
- `fixed = T`, this means to not treat the pattern as a regular expression, but to interpret as is. This can speed things up at the cost of only allowing fixed expressions.
- useBytes =T, does byte by byte comparisons. Also speeds up computation, BUT at the cost of generality.
- In general use the default settings for fixed and useBytes unless speed is really important.
## with some GC bias
- Say our sequence simulation is supposed to be from *Drosophila* with a 60:40 GC:AT bias. We can just add in the prob argument. Generate a sequence of 100000 base pairs.
```{r}
seq2_60GCbias <- sample(c("A","C","G", "T"),
size = 100000,
prob = c(30,20,20,30),
replace = TRUE)
table(seq2_60GCbias)/length(seq2_60GCbias)
seq2_60GCbias <- paste0(seq2_60GCbias, collapse="")
nchar(seq2_60GCbias)
y <- gregexpr("AACTTTT", seq2_60GCbias, fixed = T, useBytes = T)
length(unlist(y))
```
- How would you check?
## The code I have written is extremely fragile and error prone.
-Can you think of some reasons?
## The code I have written is extremely fragile and error prone.
- What happens if there are no matches? Check.
- `gregexpr()` only includes unique matches, if the pattern is found in overlapping parts of the sequence, it will only count it once.
- What happens if there is whitespace? My code does not account for it.
- What happens if some nucleotides are lower case?
- What happens if ambigous nucleotides are included, or insertions, deletions, Ns?
- DNA is generally double stranded, so we need to deal with the reverse complement as well.
## So what should we do.
- We could definitely rewrite this function to handle all such exceptions. Use functions like `toupper()` to make everything upper case, check for other nucleotides remove leading and trailing white space with `trimws()`, which is really just a wrapper around `sub`. From a learning perspective this is a useful exercise, that I leave to you.
- However, if our goal is to manipulate nucleotide sequences, we should be smart about this and use a well developed tool like the Biostrings library. Other than for the purposes of learning the skills to DIY (or when learning to program), or for some specialized purpose that does not exist, using well developed libraries make sense. These have been well vetted by many users and developers and are less buggy and probably more efficient computationally than what most of us would write.
## reverse complement with base R
- the `chartr(old, new, x)` function translates one set of characters to another for a string *x*
- rev, reverses the order of the elements of the vector (but not if it is a single string)
```{r}
seq3_60GCbias <- sample(c("A","C","G", "T"),
size = 20,
prob = c(30,20,20,30),
replace = TRUE)
seq3_60GCbias
seq3_60GCbias_rev <- rev(seq3_60GCbias)
seq3_60GCbias_revcomp <- chartr("ACTG", "TGAC", seq3_60GCbias_rev)
seq3_60GCbias_revcomp <- paste0(seq3_60GCbias_revcomp, collapse="")
```
## Do we care about the reverse complement for this example?
- For this simulation example, why might the reverse complement not matter much?