-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathAdvanced_Statistical_Inference-R_labs.Rmd
More file actions
1741 lines (1353 loc) · 50.1 KB
/
Advanced_Statistical_Inference-R_labs.Rmd
File metadata and controls
1741 lines (1353 loc) · 50.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: Advanced Statistical Inference. R labs
author: "Alex Sanchez-Pla"
output:
html_document:
code_folding: show
toc: yes
toc_float:
toc_collapsed: yes
toc_depth: 3
theme: cosmo
highlight: textmate
number_sections: yes
editor_options:
chunk_output_type: console
markdown:
wrap: 72
# theme args should be one of: "default", "cerulean", "journal", "flatly", "darkly", "readable", "spacelab", "united", "cosmo", "lumen", "paper", "sandstone", "simplex", "yeti"
---
This notebook contains code and examples to extend or clarify some
topics from the "Advanced Statistical Inference" course.
# Using simulation to estimate properties of statistics
Simulation is often used to study sampling properties of estimators.
The example below compares three different estimators for the mean of a
distribution based on an iid sample of size n:
- sample mean
- 20% trimmed mean (may work better with heavy tails...)
- median (may not work well for asymmetric distributions...)
```{r}
# function to view the first k lines of a data frame
view <- function(dat,k){
message <- paste("First",k,"rows")
krows <- dat[1:k,]
cat(message,"\n","\n")
print(krows)
}
# function to calculate summary statistics across the 1000
# data sets
simsum <- function(dat,trueval){
S <- nrow(dat)
MCmean <- apply(dat,2,mean)
MCbias <- MCmean-trueval
MCrelbias <- MCbias/trueval
MCstddev <- sqrt(apply(dat,2,var))
MCMSE <- apply((dat-trueval)^2,2,mean)
# MCMSE <- MCbias^2 + MCstddev^2 # alternative lazy calculation
MCRE <- MCMSE[1]/MCMSE
sumdat <- rbind(rep(trueval,3),S,MCmean,MCbias,MCrelbias,MCstddev,MCMSE,
MCRE)
names <- c("true value","# sims","MC mean","MC bias","MC relative bias",
"MC standard deviation","MC MSE","MC relative efficiency")
ests <- c("Sample mean","Trimmed mean","Median")
dimnames(sumdat) <- list(names,ests)
round(sumdat,5)
}
# function to generate S data sets of size n from normal
# distribution with mean mu and variance sigma^2
generate.normal <- function(S,n,mu,sigma){
dat <- matrix(rnorm(n*S,mu,sigma),ncol=n,byrow=T)
# Note: for this very simple data generation, we can get the data
# in one step like this, which requires no looping. In more complex
# statistical models, looping is often required to set up each
# data set, because the scenario is much more complicated. Here is
# a loop to get the same data as above; try running the program and see
# how much longer it takes!
# dat <- NULL
#
# for (i in 1:S){
#
# Y <- rnorm(n,mu,sigma)
# dat <- rbind(dat,Y)
#
# }
out <- list(dat=dat)
return(out)
}
# function to generate S data sets of size n from gamma
# distribution with mean mu, variance sigma^2 mu^2
generate.gamma <- function(S,n,mu,sigma){
a <- 1/(sigma^2)
s <- mu/a
dat <- matrix(rgamma(n*S,shape=a,scale=s),ncol=n,byrow=T)
# Alternative loop
# dat <- NULL
#
# for (i in 1:S){
#
# Y <- rgamma(n,shape=a,scale=s)
# dat <- rbind(dat,Y)
#
# }
out <- list(dat=dat)
return(out)
}
# function to generate S data sets of size n from a t distribution
# with df degrees of freedom centered at the value mu (a t distribution
# has mean 0 and variance df/(df-2) for df>2)
generate.t <- function(S,n,mu,df){
dat <- matrix(mu + rt(n*S,df),ncol=n,byrow=T)
# Alternative loop
# dat <- NULL
#
# for (i in 1:S){
#
# Y <- mu + rt(n,df)
# dat <- rbind(dat,Y)
#
# }
out <- list(dat=dat)
return(out)
}
# function to compute the 20% trimmed mean
trimmean <- function(Y){mean(Y,0.2)}
```
A possibly "fair" comparison would be to generate data from each of
these distributions with the same mean and variance and see how the
three methods perform on a relative basis under each condition.
```{r}
# set the seed for the simulation
set.seed(3)
# set number of simulated data sets and sample size
S <- 1000
n <- 15
# generate data --Distribution choices are normal with mu,sigma
# (rnorm), gamma (rgamma) and student t (rt)
mu <- 1
sigma <- sqrt(5/3)
# out <- generate.normal(S,n,mu,sigma) # generate normal samples
# out <- generate.gamma(S,n,mu,sigma) # generate gamma samples
out <- generate.t(S,n,mu,5) # generate t_5 samples
# get the sample means from each data set
outsampmean <- apply(out$dat,1,mean)
# get the sample trimmed mean from each data set
outtrimmean <- apply(out$dat,1,trimmean)
# get the sample median from each data set
outmedian <- apply(out$dat,1,median)
# now we can summarize -- remember, mu is the true value of the mean
# for the true distribution of the data
# get the 1000 estimates for each method
summary.sim <- data.frame(mean=outsampmean,trim=outtrimmean,
median=outmedian)
# print out the first 5 of them (round to 4 decimal places)
view(round(summary.sim,4),5)
# get summary statistics for each estimator
results <- simsum(summary.sim,mu)
results
#############################################################
# Some additional calculation for the sample mean
# average of estimated standard errors for sample mean
# usual standard error for sample mean from each data set
sampmean.ses <- sqrt(apply(out$dat,1,var)/n)
# take the average
ave.sampmeanses <- mean(sampmean.ses)
# coverage of usual confidence interval based on sample mean
t05 <- qt(0.975,n-1)
coverage <- sum((outsampmean-t05*sampmean.ses <= mu) &
(outsampmean+t05*sampmean.ses >= mu))/S
coverage
```
# Investigating Convergence Concepts
## The law of large numbers
The LLN can be illustrated in the context of a sequence of iid Bernoulli
random variables.
We fix the parameter at p=0.3. We then consider a varying number n of
such random variables and the corresponding sums. We know that these are
binomial with parameters (n,p) The plot below draws, for each n, the
10-percentile and the 90-percentile of Bin(n,p).
```{r}
p = 0.3
n = 20*(1:50)
L = qbinom(0.10, n, p)/n
U = qbinom(0.90, n, p)/n
plot(n, rep(p, length(n)), type = "n", ylim = c(min(L), max(U)), xlab="n", ylab = "")
segments(n, L, n, U, lwd = 2, col = "black")
abline(h = p, lty = 2)
```
- Where does this sequence converge to?
- Does the plot suggest such convergence? Change it into a logarithmic
scale.
```{r}
n = 2^(10:20)
L = qbinom(0.10, n, p)/n
U = qbinom(0.90, n, p)/n
plot(n, rep(p, length(n)), log = "x", type = "n", ylim = c(min(L), max(U)), xlab="n", ylab = "")
segments(n, L, n, U, lwd = 2, col = "black")
abline(h = p, lty = 2)
```
## The Central limit theorem
The Law of Large Numbers provides a first-order limit, which is
deterministic: The sequence of partial means converges to the mean of
the underlying distribution generating the random variables.
The Central Limit Theorem provides a second-order limit, which does not
exist in a deterministic sence, but does in the distributional sense:
The sequence of partial means, when standardized, converges in
distribution to the standard normal distribution. (This assumes that the
underlying distribution has finite second moment.)
To see this, we go back to the binomial example above, but we track down
multiple quantiles of the successive (standardized) binomial
distributions.
```{r}
p = 0.3
n = 20*(1:50)
Q = matrix(NA, 50, 9) # each row stores the quantiles of standardized binomial distribution
for (k in 1:50){
Q[k, ] = (qbinom((1:9)/10, n[k], p) - n[k]*p)/sqrt(n[k]*p*(1-p))
}
matplot(n, Q, type = "l", lty = 1, lwd = 2, col = 4, xlab = "n", ylab = "quantiles")
abline(h = qnorm((1:9)/10), lty = 2) # quantiles of the standard normal distribution
```
### The CLT and sampling distribution of the mean
A typical illustration of the CLT for the sample mean.
```{r}
datos<-list()
datos[[1]] <- matrix(runif(10000),nrow=1000)
datos[[2]] <- matrix(rexp(10000),nrow=1000)
datos[[3]] <- matrix(rnorm(10000),nrow=1000)
names(datos)<-c("Unifor","Exponential","Gaussian")
opt<-par(mfrow=c(3,3))
for (i in 1:3){
x<-datos[[i]]
m1<-x[,1]
m2<-apply(x[,1:2],1,mean)
m3<-apply(x,1,mean)
hist(m1, 20, main=paste(names(datos)[i], "n=1", sep="\n"), cex.main=0.8);
hist(m2, 20, main=paste(names(datos)[i], "n=2 (mean)", sep="\n"), cex.main=0.8)
hist(m3, 20, main=paste(names(datos)[i], "n=10 (mean)", sep="\n"), cex.main=0.8)
}
```
Strictly speaking we are not illustrating the CLT here. What should we
do for a better illustration?
## On the difference between Convergence in Probability and Almost sure convergence
This is a simple analogy, found in Quora, that captures some aspects of the difference:
-Suppose we have a committee with 100 members.
- Imagine there is one meeting of the committee everyday throughout the year, and that we want to know about attendance on the committee.
- Alost sure convergence is a bit like asking whether almost all members had perfect attendance.
- Convergence in probability is a bit like asking whether all meetings were almost full.
- If almost all members have perfect attendance, then each meeting must be almost full (convergence almost surely implies convergence in probability)
- But if all meetings were nearly full, it isn't necessary that any member has perfect attendance (eg, each member might have been absent on one distinct day). Convergence in probability does not imply convergence almost surely.
Convergence almost surely means that $X_n(w)$ gets close to $X(w)$ as $n$ increases for almost all elements w of the sample space. It's a statement about convergence of $X_n(w)$ for many individual $w$.
Convergence in probability means that the set of elements $w$ of the sample space for which $X_n(w)$ is close to $X(w)$ has probability approaching 1. It's a statement about the size of the set of w which satisfy the closeness property, but not about any one w.
# The empirical distribution function
## Building the empirical distribution function
- The empirical distribution function can be intuitively described
through the following algorithm.
- Take a sample and consider it "fixed"
- Sort it
- consider values x that go from $-\infty$ to $\infty$ or, simply,
from below the smallest number (e.g. $min(X_0)-1$) to above the
highest value (e.g.$max(x_0)+1$).
- Be sure to increase $x$ in such a way that the values in the
sample will be "hit" by it,
- Build two vectors as followss:
- $x_0=min(\mathbb{x_0})-1$
- $y_0 = 0$
- While $x_i< max(\mathbb{x_0})+1$ do
- $x_i+1 = x_i+\Delta$
- if $x_{i+1} \in \mathbb{x_0^{sorted}}$ then
$y_{i+1}=y_i+\frac{1}{n}$
- The values of $y$ form an approximation to the empirical cdf of
$\mathbb{x_0}$
Implement an R function that generates the empirical distibution
function given a sample.
R has a function implemented that computes the ecdf.º
```{r}
x0<-c(0.4,-0.6, 1.8,0.5,-2.0, 0.7, -0.5, -1.1, -0.2, -0.7)
Fn <- ecdf(x0)
round(Fn(sort(x0)), 2)
knots(Fn)
Fn(knots(Fn))
```
## Ploting the empirical cdf with R
Functions `ecdf` and `plot.ecdf` facilitate the visualization of the
empirical cdf.
```{r}
x0<-c(0.4,-0.6, 1.8,0.5,-2.0, 0.7, -0.5, -1.1, -0.2, -0.7)
Fn <- ecdf(x0)
plot(Fn)
```
```{r}
op <- par(mfrow=c(3,1), mgp=c(1.5, 0.8,0), mar= .1+c(3,3,2,1))
set.seed(123)
x0 <- round(rnorm(12),3)
x0
Fn12 <- ecdf(x0)
summary(Fn12)
plot(Fn12)
plot(Fn12, verticals= TRUE, do.points = FALSE)
plot(Fn12 , lwd = 2) ; mtext("lwd = 2", adj=1)
xx <- unique(sort(c(seq(-3, 2, length=201), knots(Fn12))))
lines(xx, Fn12(xx), col='blue')
abline(v=knots(Fn12),lty=2,col='gray70')
plot(xx, Fn12(xx), type='o', cex=.1)#- plot.default {ugly}
plot(Fn12, col.hor='red', add= TRUE) #- plot method
abline(v=knots(Fn12),lty=2,col='gray70')
## luxury plot
plot(Fn12, verticals=TRUE, col.points='blue',
col.hor='red', col.vert='bisque')
##-- this works too (automatic call to ecdf(.)):
plot.ecdf(rnorm(24))
title("via simple plot.ecdf(x)", adj=1)
par(op)
```
## Showing that $F_n()$ approximates $F()$
For the normal distribution
```{r}
op<-par(mfrow=c(2,2))
for(n in c(10, 20, 50, 100)){
x<- sort(rnorm(n))
plot.ecdf(x, main=paste ("Empirical cdf \n (Samples from N(0,1)), n=", n), cex.main=0.7)
tcdf<- pnorm(x)
lines(x,tcdf)
}
par(op)
```
Or the uniform distribution
```{r}
op<-par(mfrow=c(2,2))
for(n in c(10, 20, 50, 100)){
x<- sort(runif(n))
plot.ecdf(x, main=paste ("Empirical cdf (Samples from U(0,1)), n=", n))
tcdf<- punif(x)
lines(x,tcdf)
}
```
# The bootstrap
## Basic resampling (i): Standard error of the mean
```{r}
# Asignación de los datos a un vector
x <- c(94,197,16,38,99,141,23)
#Calculo de la media y de su error estandar
theta <- mean(x)
stderrtheta <- sqrt(var (x)/length(x))
#Visualizacion de la media y de su error estandar
cat("theta = " , theta,"\n")
cat("Error estandard de theta = " , stderrtheta,"\n")
# Preparacion del remuestreo
# Numero de remuestras
nboot<-100
# Creación de un vector en donde guardar las remuestras
distboot<- rep(NA,nboot)
# Remuestreo
for (b in 1:nboot)
{x.boot<-sample(x,replace=T)
distboot[b] <-mean (x.boot)
}
hist(distboot)
# Error estandar bootstrap
stderrboot <-sqrt(var(distboot))
#Visualizacion del error estandar bootstrap
cat("Error estandard bootstrap= " , stderrboot,"\n")
# Distribucion bootstrap
results<-summary(distboot)
print(results)
```
```{r}
bootstrapsSeMean <- function(x, B){
res <- numeric(B)
for(i in 1:B)
res[i] <- mean(sample(x, replace=TRUE))
res <- sd(res)
return(res)
}
x <- c(94,197,16,38,99,141,23)
bootstrapsSeMean (x, 100)
```
## Basic resampling (ii): Standard error of the median
```{r}
bootstrapSeMedian <- function(x, B){
res <- numeric(B)
for(i in 1:B)
res[i] <- median(sample(x, replace=TRUE))
res <- sd(res)
return(res)
}
bootstrapSeMedian(x, 1000)
```
## The bootstrap distribution
The histogram of all estimations computed on all the resamples is an
approximation to the *bootstrap* distribution of the statistic.
```{r}
if (!require(bootstrap)) install.packages("bootstrap", dep=TRUE)
library(bootstrap)
data('law')
set.seed(1)
theta <- function(ind) {
cor(law[ind, 1], law[ind, 2], method = "pearson")
}
theta0 <- cor(law$LSAT, law$GPA)
# for(b in 1:5){
# indiv<- sample(1:15, replace=TRUE)
# show(t(resample <- law[indiv,]))
# show(cor(resample)[1,2])
# }
law.boot <- bootstrap(1:15, 1000, theta)
summary(law.boot)
```
An approximation to the bootstrap distribution.
```{r}
hist(law.boot$thetastar)
```
## Bootstrap confidence intervals
There are many ways to use the bootstrap to compute confidence
intervals.
The simplest one is assuming normality of the estimator and uise the
standard error to build normal confidence intervals:
```{r}
ci1 <- qnorm(p=c(.025, .975),
mean=theta0,
sd=sd(law.boot$thetastar))
ci1
```
The *bootstrap percentile* method relies on the quantiles of the
bootstrap distribution-
```{r}
quantile(law.boot$thetastar, c(0.025, 0.975))
```
For improved estimators it is preferable to use built-in functions such
as those available in packages `boot` or `bootstrap`
## Resampling with the `bootstrap` package
Programming the resampling process from scratch can be useful for a
better understanding but depending on the it may be slower, complex
(depending on the statistic to be bootstrapped) and often less
efficient.
Resampling packages allow an easier execution of the process and,
althoug they are more of the "black box" type, they should be the option
of choice for many problems.
Information on how to use some resampling packages can be found here
- `boot`
- [Package
site](https://cran.r-project.org/web/packages/boot/index.html)
- [A
tutorial](https://www.datacamp.com/community/tutorials/bootstrap-r)
- [Another tutorial with
references](https://www.statmethods.net/advstats/bootstrapping.html)
- `resample`
- [Package site](https://rdrr.io/cran/resample/)
-[Topics](https://www.rdocumentation.org/packages/resample/versions/0.4/topics)
## Example: Bootstraping a multivariate data set (Exercise U-2-I-9)
File `scores5.dat` contains the scores in 5 topics of 88 U.B. students.
1. This data matrix has a $5 \times 5$ variance-covariance matrix with
positive eigenvalues
$\lambda_{1} >\lambda_{1} > \lambda_{2} > ... > \lambda_{5}$. Let
$\theta$ be:
$$\theta= \dfrac{\lambda_{1}} {\sum_{i=1}^{n} \lambda_{i}}.$$
$\theta$ is used to represent the percentage of variability explained by
the first principal component.
Let $\hat{\lambda_{1}} >\hat{\lambda_{2}} > ... > \hat{\lambda_{5}}$ be
the eigenvalues of $\widehat{\sum}$, the MLE of $\sum$. From here, by
the functional invariance principle of the MLE one has:
$$\hat{\theta}= \dfrac{\hat{\lambda_{1}}} {\sum_{i=1}^{n} \hat{\lambda_{i}}}$$
\### Describe how can one resample the data matrix to obtain an
approximation of the bootstrap distribution of $\hat{\theta}$. Using the
process derived above compute a bootstrap estimate of the standard error
of $\hat{\theta}$.
This is an scenario in which each observation is a multivariate random
vector of five components, which corresponds to each row of the data
matrix.
Resampling from the empirical distribution means resampling the data
matrix **by rows**, in order to maintain each original observation.
A bootstrap sample of the data matrix X would be a matrix $X_B$ with the
same dimensions than X whose rows would be randomly sampled with
replacement from the rows of $X$.
In this case, we will proceed as follows: Repeate B = 1000 times 1.
Create a newly resampled dataset by taking 5 samples with replacement of
the *whole* rows 2. Compute the covariance matrix on the resampled
dataset and then its eigen values $\lambda_{i}^b$, $i=1,...,5$ and then
$\hat{\theta}^b= \dfrac{\lambda_{1}^b} {\sum_{i} \lambda_{i}^b}$. 3.
Finally, we will compute the standard error of $\hat{\theta}$.
```{r}
datamatrix<-read.delim("scores5.dat")
covmatrix<-cov(datamatrix)
ev<-eigen(covmatrix)
evalues<-ev$values
theta <- evalues[1]/sum(evalues); theta
```
```{r}
set.seed(2)
n<-1000
distboot<- rep(NA,n)
for (i in 1:n){
bootindexes <- sample ( nrow ( datamatrix ),replace = TRUE )
bootmatrix <- datamatrix [ bootindexes ,]
covmatrix1<-cov(bootmatrix)
ev1<-eigen(covmatrix1)
evalues1<-ev1$values
distboot[i] <-evalues1[1]/sum(evalues1)
}
stderrboot <-sqrt(var(distboot)); stderrboot
```
### Provide a bias corrected estimate of $\theta$.
The bias of an estimator is defined as $$
b(\hat \theta) = E(\hat \theta)- \theta.
$$ Bias can be estimated using the bootstrap as folllows: Thinling of
the equivalence of "roles" between real world and bootstrap world we can
set
- The role of $\theta$ is played by $\hat \theta$
- The role of $E(\hat \theta)$ is played by
$\widehat{E_B(\hat \theta)}$.
- The bootstrap bias estimate is then $$
b_B(\hat \theta) = \widehat{E_B(\hat \theta)}- \hat \theta.
$$
And the bias corrected estimate of $\theta$ is
$$
\hat \theta_{BC} = \hat \theta -b_B(\hat \theta) =
2 \hat \theta- \widehat{E_B(\hat \theta)}.
$$
Finally recall that although these should be exact bootstrap estimates,
they can be approximated using Monte Carlo resampling and taking the
mean of the resampled vvalues.
$$
\widehat{E_B(\hat \theta)} \simeq \frac 1B \sum_{b=1}^B \hat \theta^b
$$ In our example we would simply do:
```{r}
Etheta <- mean(distboot);Etheta
theta <- evalues[1]/sum(evalues);theta
btheta<- Etheta - theta; btheta
thetaCB <- 2 *theta - Etheta; thetaCB
```
Notice that **all results are very similar**. That is in spite of how
"reasonable" the approach may seem we may need some theoretical results
or some simluation analysis to decide up to what point the bias
adjustment improves the estimator.
# Hypothesis tests, power function and sample size
## Power function
Given a test of simple hypotheses $H_0: \theta=\theta_0$, vs
$H_1:\, \theta=\theta_1$ about a parameter $\theta$, with *critical
(rejection) region* $W$,
- The power of the test is the probability to correctly reject the
null hypotheis, that is the probability of finding a sample in the
critical region under the alternative hypothesis. $$
\beta = \mbox{Power} = P\{X\in W| H_1\}.
$$
- If $H_1$ is compound, and there is a range of possible values for
$\theta$, the power function gives the power of the test as a
function of $\theta$: $Power=\beta(\theta)$.
For example: Given a sample of a binomial distribution where: $n=10$,
whch we want to use to test the hypotheses : $p=0.2$ vs $p>0.2$, the
critical region with size $\alpha < 0.05$ is es
$W=\{ x \in \{4, 5, 6, 7, 8, 9, 10\}\}$ because:
```{r}
probsacum <- pbinom(0:10, size=10, prob=0.2)
names(probsacum)<-0:10
print(round(probsacum,3))
```
The power function is just: $\beta(p)=P_p(W)=P_p(X \geq 4)$ which can be
plotted doing:
```{r}
probabs <- seq(0,1,by=0.01)
potenc <- 1-pbinom(4, size=10, prob=probabs)
plot (probabs,potenc, main="Funcion de potencia")
abline(v=0.2); abline(h=1- pbinom(4, size=10, prob=0.2))
```
## The power of a test
A simple way to define the power of a test isthe probability of the
critical region under the alternative hypothesis.
In simple cases it is possible to calculate this power analytically or
using R functions such as: `power.t.test` or related.
For example, suppose we wish to study a drug that lowers blood pressure.
We have 50 people in a control group and 50 in a treated group and we
expect the pressure to have a mean value of 150 (mmHg) and a deviation
of 15 mmHg in the control group, and to be *at least 10mmHg lower* in
the treated group.
With this information we can now calculate the power function for a
sample size (n) and a given effect size or minimum difference ("delta").
```{r}
power.t.test(n=50,
delta=10,
sd=15,
sig.level=0.01,
power=NULL,
type="two.sample",
alternative="one.sided")
```
## Sample size
Given that power and sample size are usually inversely related, the same
function can be used to find the sample size required to reach a certain
power ("power"), given the values of the effect size ("delta"), standard
deviation ("sd") and significance level ("sig.level").
To compute $n$ here, set this parameter to "NULL" and assign values to
all the remaining parameters.
```{r}
power.t.test(n=NULL,
delta=10,
sd=15,
sig.level=0.01,
power=0.95,
type="two.sample",
alternative="one.sided")
```
Imagine now (Exercise 1.2.4) you are required to find the sample size
needed to have a 95% chance of rejecting $H_0:\, \mu = 0$ vs
$H_1: \mu >0$ at $\alpha = 0.01$ if the true change is 1.5 sec. decrease
in time.
Assuming the standard deviation is 3, and has been estimated, Chiara and Hesterberg (Mathematical Statistics wirth Resampling and R) provide the following ex`planation:
```{r}
power.t.test(n=NULL,
delta=1.5,
sd=3,
sig.level=0.01,
power=0.95,
type="one.sample",
alternative="one.sided")
```
```{r, echo=FALSE, out.width="100%"}
knitr::include_graphics("images/sampleSize1a.png")
```
```{r, echo=FALSE, out.width="100%"}
knitr::include_graphics("images/sampleSize1b.png")
```
```{r, echo=FALSE, out.width="100%"}
knitr::include_graphics("images/sampleSize1c.png")
```
# Introducing permutation tests
One of the most common problems with statistical inference is not being
able to verify some assumptions of the assumed model for the data, such
as normality.
In these cases, it is usual to consider different alternatives such as
non-parametric tests, (which is usually labeled as having little power)
or permutations tests.
The logic of a permutations test, which we introduce in a simplified way
for the case of a two-group comparison problem, is as follows:
- Suppose we have obtained two samples of $ n_1 $ observations from a first group and $ n_2 $ observations of a second group respectively.
- To estimate the p-value of a hypotesis test:
$H_o:\, \mu_1 = \mu_2$ versus $H_1:\, \mu_1 \neq \mu_2$ we must know the
probability distribution of the test statistic, which can be
the same as that of the two-sample t-test, under the null hypothesis.
- Now, we need to realize that, if the null hypothesis were true, the labels we have assigned to individuals as belonging to group 1 and group 2 _have no reason to be_ since all would belong to the same group and would have been left by chance in the sample that we have considered of the first group or that we have considered of the second.
- One way to evaluate to what extent this has been so is the following:
1. Consider all possible permutations of the two samples
2. On each permutation, the separation between group 1 and group 2 will be performed (for example, the first $n_1$ elements of the permuted sample are assigned to group 1 and the remainder to group 2).
3. After reallocation, the value of the test statistic is calculated.
The values of the test statistics calculated on all permutations constitute an approximation to the \emph {the distribution of the test statistic under the null hypotesis (that is: _there are not two groups but one_)}
so the p-value of the test can be calculated as:
$$
\begin {align*}
p & = P_ {H_0} \left [T \left (\mathbf {X_1}, \mathbf {X_2} \right) \geq T \left (\mathbf {x_1 ^ 0}, \mathbf {x_2 ^ 0} \right) \right] \\
& \simeq \frac {\# T \left (\mathbf {X_1}, \mathbf {X_2} \right) \geq T
\left (\mathbf {x_1 ^ 0}, \mathbf {x_2 ^ 0} \right)} {\mbox {Number of
permutations}}
\end {align*}
$$
_In practice, the number of permutations is usually very large, so the above process is carried out in an approximate way, taking a smaller number of random permutations_.
The figure below illustrates the procedure used in a permutations test
to compare two samples.
```{r, echo=FALSE, out.width="100%"}
knitr::include_graphics("images/permutationtests.png")
```
## Example
We You want to compare the mean survival time in weeks of two groups of
mice affected by liver disease that have been treated with a drug ($z$)
or a placebo ($y$).
The values obtained have been:
```{=tex}
\begin {align *}
z & = & 94, 197, 16, 38, 99, 141, 23 \\
y & = & 52, 104, 146, 10, 51, 30, 40, 27, 46.
\end {align *}
```
- Implement a permutation test as described above to compare the means
of the two groups, using a two-sample t-statistic as a test
statistic.
- Apply the implemented test to compare the obtained values. Base the
estimate of the p-value on 10,000 random permutations.
- In case you don't want to use a permutation test, what would be an
alternative reasonable approach?
- Compare the results obtained with the alternative suggested with
those of the permutation tests. Which one do you think is more
appropriate in this case? What are the advantages or disadvantages
of each of them?
### Building a permutations distribution
```{r}
tr <- c( 94, 197, 16, 38, 99, 141, 23)
y <- c( 52, 104, 146, 10, 51, 30, 40, 27, 46)
testInSample<- mean(tr) - mean(y)
```
The difference in means is `{r} testInSample`
To obtain a single permutation of the data, we combine the data and then
sample without replacement two "mock" tr and y vectors.
```{r}
testFun<- function(x1,x2){
mean(x2)-mean(x1)
}
nt<- length(tr)
ny<- length(y)
N<- nt+ny
s <- sample(c(tr, y), N, FALSE)
trp <- s[1:nt]
yp <- s[(nt+1):N]
testFun(yp, trp)
```
If we repeat this process a large number of times, we can build our
approximate permutation distribution (i.e., the sampling distribution
for the mean-difference).
The result will be a vector of the differences from each permutation
(i.e., our distribution):
```{r}
NPerm <- 2000
testValues <- numeric(NPerm)
for (i in 1:NPerm){
permutedS<- sample(s, N, FALSE)
trp <- permutedS[1:nt]
yp <- permutedS[(nt+1):N]
testValues[i]<- testFun(yp, trp)
}
summary(testValues)
```
We can look at our distribution using hist and draw a vertical line for
our observed difference:
```{r}
par(mfrow=c(1,1))
hist(testValues)
abline(v = testInSample, col = "blue", lwd = 2)
```
A rough estimation of the permutation p-value, based on the formula above is
```{r}
sum(testValues >= testInSample)/length(testValues)
```
That is, in this example **we would not reject $H_0$ based on the results of the permutation test performed**.
### Relying on pre-built packages
We don't always need to build our own permutation distributions (though
it is good to know how to do it). R provides a package to conduct
permutation tests called coin. We can compare our p-value (and
associated inference) from above with the result from `coin`.
# The delta method
This example is taken directly from the practrical part from the
excelent tutorial on the delta method available at: [Delta Method in
Epidemiology: An Applied and Reproducible
Tutorial](https://migariane.github.io/DeltaMethodEpiTutorial.nb.html)
To illustrate the use of the delta-method the authors simulate data
based on a cancer epidemiology example.
The goal is to estimate the effect of the presence of comorbidities
(binary indicator) on one-year cancer mortality in cancer patients
controlling for the confounding effect of age in a cohort of 1,000
individuals in their middle age.
It is assumed that it is an extremely lethal type of cancer (i.e.,
pancreatic cancer) thus they can expect high one-year mortality rate. -
Age in years was generated as a normal random variable with mean 65
years and standard deviation of 5 years. - Comorbidities was generated
as a binary indicator and as a function of age using a binomial model. -
Patients one-year mortality rate was generated as a function of the
patients age and the presence of comorbidities using a binomial
distribution.
The data generation and models specifications are provide here below
```{r}
library(dplyr)
generateData <- function(n, seed){
set.seed(seed)
age <- rnorm(n, 65, 5)
cmbd <- rbinom(n, size=1, prob = plogis(1 - 0.05 * age))
Y <- rbinom(n, size=1, prob = plogis(1 - 0.02* age - 0.02 * cmbd))
data.frame(Y, cmbd, age)
}
# Describing the data
data <- generateData(n = 1000, seed = 777)
str(data)
summarize(
data,
Status = mean(Y),
Comorbidities = mean(cmbd),
Age = mean(age))
```
## Delta-method for a single proportion parameter