Bio720/R_exercises/Bio720_R_InClassExercise2.Rmd at master · DworkinLab/Bio720 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
title: "Bio720 R In Class Exercise 2"
author: "Ian Dworkin"
date: "`r format(Sys.time(),'%d %b %Y')`"
output:
  html_document:
    keep_md: yes
    number_sections: yes
    toc: yes
editor_options:
  chunk_output_type: console
---
# Introduction

In class assignment. Please work in groups of 2-3. When you are finished with your assignment, please (as both a .Rmd and a .md file) add this to your class github repo (or make a new one if you really want).

## for loops and the apply family of functions
During the past two week for, you learned a bit about how to use `for` loops and today the `apply` family of functions. Now we are going to do some in class practice and extend your understanding of how they work. In case you want some alternative tutorials [here](https://www.datacamp.com/community/tutorials/tutorial-on-loops-in-r#gs.FgFVHlY) is one that goes into more detail.

1. What happens to `x` for each of these functions after you run them? Please explain what is happening in each case. i.e. Why do these behave differently?

```{r}
x <- 1

for (i in 1:9) {
    x <- x + 1
    print(x)
}

x

```

**VS**

```{r, error=TRUE}
rm(x)

countFun <-function(x)
{

  for (i in 1:10) {
    x <- x + 1
    print(x)
}}

countFun(1)
x
```


2. Write a for loop that generates a vector which is the square of all integers from 1 to 10000. Make this all happen within the loop. i.e. We are going to use the `system.time()` function to monitor how long it takes. This is pretty coarse, and a way better way of doing this is with the `microbenchmark` package (which is worth learning), but for now this function will do.


3.  Now do the same thing as in question 2, but *pre-allocate* memory for the vector `x` (or whatever you want to call it) to store the 10000 values *before* running the `for` loop. You don't need to do anything differently, just initialize a vector of length 10000. Which runs faster? Please explain why?

4. What is a very "`R`" to make this (question 2 and 3) even faster? (i.e. think about vectorization)

5. Why is this faster? So what is the lesson learned from this?

6 - Compare the following pieces of code for both what they are doing, and for speed (you can wrap this in `system.time()`). We are using the `rnorm` function which generates random numbers from a normal distribution (with a particular mean and standard deviation)

```{r}
n <- 10000 # just run this once, each time you change n

rm(x)
system.time(x <- replicate(n = n, expr = rnorm(1, mean = 5, sd = 5)))

```

**VS**

```{r}
rm(x)
x <- rep(NA, n)
system.time(for (i in 1:n) x[i] <- rnorm(1, mean = 5, sd = 5))
```

Repeat each of these 5 times, and write down how long they take. Now repeat this for `n = 100000` iterations. Repeat for `n = 1000000`

6. (continued) Is there a difference in speed? Can you explain why? What are these two pieces of code doing differently?


7. Can you think of a faster way to generate the random numbers? hint, look at the help for `rnorm`.

## Running a power analysis.

In statistics, we often want to check (before doing an experiment) how likely we will be able to detect "an effect". This is true whether we are doing RNAseq, or a simple t-test. A tests statistical power is a function of three things. First, the magnitude of the observed effect (i.e. does your experiment have a big effect on gene expression or really small subtle effects). Second the amount of variation within each treatment group (if what you measure is really noisy, it will make it more difficult to get accurate estimates) and of course sample size. So when statistically oriented researchers are testing new methods, or just planning experiments, they will often use simulations to perform power analyses.  These require repeating the same simulations over and over again and checking results against some expectation or some hypothetical threshold (i.e. statistical significance is something some researchers care about... although you should focus on magnitude of effect size and confidence intervals... but that is another conversation).

Here is an example of a very simple simulation, which generates a linear relationship between two variables (the dependent variable *x* and the response variable *y*), simulates values of *x* based on this relationship and the amount of variation (noise) and then fits a linear regression and extracts the p-value associated with the slope of the relationship.

```{r}
rm(list=ls())

# sample size, n
n <- 20

# intercept, a
a = 3

# slope, b
b = 0.3

# independent/explanatory variable
x <- rnorm( n = n, mean = 10, sd = 2)

# response/dependent variable
y <- rnorm(n = length(x), mean = a + b*x, sd = 1 )

plot(y ~ x, pch = 20, cex = 1.5)

# use the `lm` function to fit a linear model (including   a regression like here)
mod_1 <- lm(y ~ x)

summary(mod_1) # just to look at.

(p_val <- summary(mod_1)$coef[2,4])

```

Run the code above a few times just to convince yourself that you are generating "new data" each time, and as such the results from the regression are different (including the estimated slope, and associated stats like p-values.)

8. How would you take this idea and create a vector that stores the p_values from these simulations, and allows you to repeat the simulation 1000 times using a `for` loop? Then use this vector to find out what proportion of p-values are less than 0.05. You can finish this code up by using a histogram (with the `hist` function) to look at the distribution of p-values.


9. How would you do the same thing as in question 9 but by writing an explicit function and using `replicate` instead of a `for` loop? Why are the results from question 8 and 9 not the same?


You can use this to get more involved. Say you wanted to see what would happened if you varied the sample size from n = 10 to n = 100 or varied the slope from 0 to 0.5? How would you do this using both approaches from questions 9 and 10 (you may need to use nested for loops). I will show you in the answers.