RClassWeb/Statistics.Rmd at master · 13vasquezan/RClassWeb · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
title: "Statistics"
output:
  html_document:
    toc: true
    toc_float: true
    collapsed: false
    number_sections: false
    toc_depth: 1
    #code_folding: hide
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(message=FALSE,warning=FALSE, cache=TRUE)
```
##Null Distributions


###1) Sample from normal distribution/create a null
```{r}
Stat<-rnorm(10,0,1)
hist(Stat)

# More normal
hist(rnorm(1000000,0,1))


#Replicate
replicate(100,rnorm(10,0,1))
```
###2) Get the mean differences
```{r}

#get sample mean
mean(Stat)

#Use a for loop
savemeans<-c()
for(i in 1:1000){
  savemeans[i]<-mean(rnorm(10,0,1))

}

null <- replicate(10000,mean(rnorm(10,0,1)-mean(rnorm(10,0,1))))
```
###3) Is it by chance? Alpha<.05
```{r}

Crit_val<- sort(null)
Crit_val[9500]

#nondirectional
Absolute_cv <- sort(abs(null))
Absolute_cv[9500]
```
###T test

Show that the properties of a simulated t-distribution are the same as the properties of the analytic t-distribution. Assume df = 9.

For example, what are the probabilities of t(9) >= .5, 1, 1.5, 2, and 2.5? These p-values can be obtained using the  qt() function.

Create a simulated t-distribution for the null hypothesis with df=9. Here, the model situation involves taking samples of size n=10 from a normal distribution. The t-value is computed for each sample (sample mean - 0)/SEM. The process is repeated 10,000 times, and each t-value is saved. The resulting 10,000 t-values are the simulated t-distribution.

Using the simulated t-distribution, find the probability of t(9) >= .5, 1, 1.5, 2, and 2.5

Compare the two sets of probabilities to show that the difference is small. What happens to the difference if the simulation is repeated fewer times (e.g., 100) vs. more times (e.g., 100,000)
```{r}
t_dist <- replicate(10000,t.test(rnorm(10,0,1))$statistic)

hist(t_dist)

length(t_dist[t_dist<.5])/10000
length(t_dist[t_dist<1])/10000
length(t_dist[t_dist<1.5])/10000
length(t_dist[t_dist<2])/10000
length(t_dist[t_dist<2.5])/10000


#With a loop
ts <- c(.5,1,1.5,2,2.5)
sim_t <- c()
for(i in ts){
  sim_t <- c(sim_t,length(t_dist[t_dist >= i])/10000)
}

sim_t-(1-pt(ts,9))
```
###Correlation

Sample A and Sample B both have 10 observations randomly sampled from the same normal distribution with mean = 0, and sd =1. The expected correlation between A and B is 0, because both samples are taken randomly.

1. Generate the distribution of correlations (Pearson r values) that could be obtained by chance (simulate 10,000 times)

2. Find the critical value such that the absolute value of the correlation occurs less than 5% of the time by chance.

3. Find the critical value when the sample-size is increased to 100
```{r}
null <- replicate(10000,cor(rnorm(10,0,1),rnorm(10,0,1)))
sorted_null <- sort(abs(null))
sorted_null[9500]
null <- replicate(10000,cor(rnorm(100,0,1),rnorm(100,0,1)))
sorted_null <- sort(abs(null))
sorted_null[9500]
```
## F-values

```{r}
A <- c(1,2,3,4)
B <- c(3,4,5,6)
C <- c(5,6,7,8)
```

## F simulation

Assume that there are three groups of different subjects. Each group has four subjects. The subject means for all subjects are sampled randomly from  normal distribution with mean =0 and sd =1.

1. Run a simulation 10,000 times to construct the simulated F-distribution. On each run, sample new numbers into the three groups, then compute F and save it.

2. Using the simulated F-distribution, what is the critical value of F for alpha set at, p<.05

3. Compare your answer from above to the answer obtained using `qf`, that can compute the critical value directly.

```{r}
run_anova<-function(){
  A<-rnorm(4,0,1)
  B<-rnorm(4,0,1)
  C<-rnorm(4,0,1)


con<-rep(c("A","B","C"),each=4)
DV<-c(A,B,C)
df<-data.frame(con,DV)
sum_out<-summary(aov(DV~con,df))
return(sum_out[[1]]$`F value`[1])

}

save_fs<-replicate(1000,run_anova())
hist(save_fs)

sort(save_fs)[9500]

qf(.95,2,9)