BigData/12_partIV_data_analyticsI.Rmd at master · umatter/BigData · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
# (PART) Application: Topics in Big Data Econometrics {-}

# Introduction {#a .unnumbered}

>"There has also been an intellectual convergence across fields—machine learning and computer science, modern computational and Bayesian statistics, and data-driven social sciences and economics—that has raised the breadth and quality of applied analysis elsewhere. The machine learners have taught us how to automate and scale, the economists bring tools for causal and structural modeling, and the statisticians make sure that everyone remembers to keep track of uncertainty." [@taddy2020business,p.ix]

Far from being a comprehensive overview of Big Data Analytics applications in applied econometrics, the goal of this part of the book is to connect the conceptual and practical material covered thus far with analytics settings and point you to potentially interesting approaches and tools that may be useful in your specific field of applying Big Data Analytics. While all of the examples, case studies, and tutorials presented in this part are in the context of economic research or business analytics, the approaches and tools discussed are often easily transferable to other domains of applying Big Data Analytics.

First, Chapter 12 presents a few brief case studies pointing to a small exemplary selection of common bottlenecks in everyday data analytics tasks, such as the estimation of fixed effects models. The purpose of this chapter is to review some of the key concepts discussed in the previous two parts, whereby each of the case studies refers to some of the previously discussed perspectives on/approaches to Big Data: realizing why an analytics task is a burden for the available resources and considering an alternative statistical procedure, writing efficient R code for simple computations, and scaling up the computing resources. You can easily skip this chapter if you are already well familiar with the topics covered in the previous parts. Chapters 13–15 cover specific topic domains in the realm of applied Big Data Analytics that are common in modern econometrics: training machine learning models in predictive econometrics using GPUs\index{Graphics Processing Unit (GPU)} (Chapter 13), estimating linear and generalized linear models (e.g., classification models) with large scale datasets (Chapter 14), and performing large-scale text analysis (Chapter 15).


# Bottlenecks in Everyday Data Analytics Tasks

This chapter presents three examples of how the lessons from the previous chapters could be applied in everyday data analytics tasks. The first section focuses on the statistics perspective: compute something in a different way (with a different algorithm) but end up with essentially the same result. It is also an illustration of how diverse the already implemented solutions for working with large data in applied econometrics in the R-universe are, and how it makes sense to first look into a more efficient algorithm/statistical procedure than directly use specialized packages such as `bigmemory`\index{bigmemory Package} or even scale up in the cloud. The second section is a reminder (in an extremely simple setting) of how we can use R more efficiently when taking basic R characteristics into consideration. It is an example and detailed illustration of how adapting a few simple coding habits with basic R can substantially improve the efficiency of your code for larger workloads. Finally, the third section in this chapter re-visits the topics of scaling up both locally and in the cloud.


## Case study: Efficient fixed effects estimation

In this case study we look into a very common computational problem in applied econometrics: estimation of a fixed effects\index{Fixed Effects} model with various fixed-effects units (i.e., many intercepts). The aim of this case study is to give an illustration of how a specific statistical procedure can help us reduce the computational burden substantially (here, by reducing the number of columns in the model matrix and therefore the burden of computing the inverse of a huge model matrix). The context of this tutorial builds on a study called ["Friends in High Places"](https://www.aeaweb.org/articles?id=10.1257/pol.6.3.63) by @cohen_malloy. Cohen and Malloy show that US Senators who are alumni of the same university/college tend to help each other out in votes on industrial policies if the corresponding policy is highly relevant for the state of one senator but not relevant for the state of the other senator. The data is provided along with the published article and can be accessed here: [http://doi.org/10.3886/E114873V1](http://doi.org/10.3886/E114873V1). The data (and code) is provided in STATA format. We can import the main dataset with the `foreign` package \index{foreign Package} [@foreign]. For data handling we load the `data.table` package\index{data.table Package} and for hypotheses tests we load the `lmtest` package \index{lmtest Package} [@lmtest].

```{r message=FALSE, warning=FALSE}
# SET UP ------------------
# load packages
library(foreign)
library(data.table)
library(lmtest)
# fix vars
DATA_PATH <- "data/data_for_tables.dta"

# import data
cm <- as.data.table(read.dta(DATA_PATH))
# keep only clean obs
cm <- cm[!(is.na(yes)
           |is.na(pctsumyessameparty)
           |is.na(pctsumyessameschool)
           |is.na(pctsumyessamestate))]

```


As part of this case study, we will replicate parts of Table 3 of the main article (p. 73). Specifically, we will estimate specifications (1) and (2). In both specifications, the dependent variable is an indicator `yes` that is equal to 1 if the corresponding senator voted Yes on the given bill and 0 otherwise. The main explanatory variables of interest are `pctsumyessameschool` (the percentage of senators from the same school as the corresponding senator who voted Yes on the given bill), `pctsumyessamestate` (the percentage of senators from the same state as the corresponding senator who voted Yes on the given bill), and `pctsumyessameparty` (the percentage of senators from the same party as the corresponding senator who voted Yes on the given bill). Specification 1 accounts for congress (time) fixed effects\index{Fixed Effects} and senator (individual) fixed effects\index{Fixed Effects}, and specification 2 accounts for congress-session-vote fixed effects\index{Fixed Effects} and senator fixed effects\index{Fixed Effects}.

First, let us look at a very simple example to highlight where the computational burden in the estimation of such specifications is coming from. In terms of the regression\index{Regression} model 1, the fixed effect\index{Fixed Effects} specification means that we introduce an indicator variable (an intercept) for $N-1$ senators and $M-1$ congresses. That is, the simple model matrix ($X$) without accounting for fixed effects\index{Fixed Effects} has dimensions $425653\times4$.

```{r}
# pooled model (no FE)
model0 <-   yes ~
  pctsumyessameschool +
  pctsumyessamestate +
  pctsumyessameparty

dim(model.matrix(model0, data=cm))
```
In contrast, the model matrix of specification (1) is of dimensions $425653\times221$, and the model matrix of specification (2) even of $425653\times6929$.

```{r}
model1 <-
  yes ~ pctsumyessameschool +
        pctsumyessamestate +
        pctsumyessameparty +
        factor(congress) +
        factor(id) -1
mm1 <- model.matrix(model1, data=cm)
dim(mm1)
```

Using OLS\index{Ordinary Least Squares (OLS)} to estimate such a model thus involves the computation of a very large matrix inversion (because $\hat{\beta}_{OLS} = (\mathbf{X}^\intercal\mathbf{X})^{-1}\mathbf{X}^{\intercal}\mathbf{y}$). In addition, the model matrix for specification 2 is about 22GB, which might further slow down the computer due to a lack of physical memory or even crash the R session altogether.

<!-- ```{r} -->
<!-- as.numeric(object.size(mm2))*9.31e-10 -->
<!-- ``` -->

In order to set a point of reference, we first estimate specification (1) with standard OLS.


```{r message=FALSE, warning=FALSE}

# fit specification (1)
runtime <- system.time(fit1 <- lm(data = cm, formula = model1))
coeftest(fit1)[2:4,]
# median amount of time needed for estimation
runtime[3]
```

As expected, this takes quite some time to compute. However, there is an alternative approach to estimating such models that substantially reduces the computational burden by "sweeping out the fixed effects dummies". In the simple case of only one fixed effect variable (e.g., only individual fixed effects), the trick is called "within transformation"\index{Within Transformation} or "demeaning" and is quite simple to implement. For each of the categories in the fixed effect variable, compute the mean of the covariate and subtract the mean from the covariate's value.

```{r}
# illustration of within transformation for the senator fixed effects
cm_within <-
  with(cm, data.table(yes = yes - ave(yes, id),
                      pctsumyessameschool = pctsumyessameschool -
                        ave(pctsumyessameschool, id),
                      pctsumyessamestate = pctsumyessamestate -
                        ave(pctsumyessamestate, id),
                      pctsumyessameparty = pctsumyessameparty -
                        ave(pctsumyessameparty, id)
                      ))

# comparison of dummy fixed effects estimator and within estimator
dummy_time <- system.time(fit_dummy <-
              lm(yes ~ pctsumyessameschool +
                       pctsumyessamestate +
                       pctsumyessameparty +
                       factor(id) -1, data = cm
                         ))
within_time <- system.time(fit_within <-
                             lm(yes ~ pctsumyessameschool +
                                      pctsumyessamestate +
                                      pctsumyessameparty -1,
                                      data = cm_within))
# computation time comparison
as.numeric(within_time[3])/as.numeric(dummy_time[3])

# comparison of estimates
coeftest(fit_dummy)[1:3,]
coeftest(fit_within)

```

Unfortunately, we cannot simply apply the same procedure in a specification with several fixed effects\index{Fixed Effects} variables. However, @GAURE20138 provides a generalization of the linear within-estimator\index{Within Estimator} to several fixed effects\index{Fixed Effects} variables. This method is implemented in the `lfe` package \index{lfe Package} [@gaure_2013]. With this package, we can easily estimate both fixed-effect specifications (as well as the corresponding cluster-robust standard errors\index{Cluster-Robust Standard Errors}) in order to replicate the original results by @cohen_malloy.

```{r warning=FALSE, message=FALSE}
library(lfe)

# model and clustered SE specifications
model1 <- yes ~ pctsumyessameschool +
                pctsumyessamestate +
                pctsumyessameparty |congress+id|0|id
model2 <- yes ~ pctsumyessameschool +
                pctsumyessamestate +
                pctsumyessameparty |congress_session_votenumber+id|0|id

# estimation
fit1 <- felm(model1, data=cm)
fit2 <- felm(model2, data=cm)
```

Finally we can display the regression\index{Regression} table.

```{r warning=FALSE, message=FALSE}
stargazer::stargazer(fit1,fit2,
                     type="text",
                     dep.var.labels = "Vote (yes/no)",
                     covariate.labels = c("School Connected Votes",
                                          "State Votes",
                                          "Party Votes"),
                     keep.stat = c("adj.rsq", "n"))
```


## Case study: Loops, memory, and vectorization

We first read the `economics` dataset into R and extend it by duplicating its rows to get a slightly larger dataset (this step can easily be adapted to create a very large dataset).

```{r}
# read dataset into R
economics <- read.csv("data/economics.csv")
# have a look at the data
head(economics, 2)
# create a 'large' dataset out of this
for (i in 1:3) {
     economics <- rbind(economics, economics)
}
dim(economics)

```

The goal of this code example is to compute real personal consumption expenditures, assuming that `pce` in the `economics` dataset provides nominal personal consumption expenditures. Thus, we divide each value in the vector `pce` by a deflator `1.05`.

### Naïve approach (ignorant of R)

The first approach we take is based on a simple `for` loop\index{For-Loop}. In each iteration one element in `pce` is divided by the `deflator`, and the resulting value is stored as a new element in the vector `pce_real`.

```{r}
# Naïve approach (ignorant of R)
deflator <- 1.05 # define deflator
# iterate through each observation
pce_real <- c()
n_obs <- length(economics$pce)
for (i in 1:n_obs) {
  pce_real <- c(pce_real, economics$pce[i]/deflator)
}

# look at the result
head(pce_real, 2)


```


How long does it take?

```{r}
# Naïve approach (ignorant of R)
deflator <- 1.05 # define deflator
# iterate through each observation
pce_real <- list()
n_obs <- length(economics$pce)
time_elapsed <-
     system.time(
         for (i in 1:n_obs) {
              pce_real <- c(pce_real, economics$pce[i]/deflator)
})

time_elapsed

```


Assuming a linear time algorithm ($O(n)$), we need that much time for one additional row of data:

```{r}

time_per_row <- time_elapsed[3]/n_obs
time_per_row

```


If we are dealing with Big Data, say 100 million rows, that is

```{r}
# in seconds
(time_per_row*100^4)
# in minutes
(time_per_row*100^4)/60
# in hours
(time_per_row*100^4)/60^2

```


Can we improve this?


### Improvement 1: Pre-allocation of memory

In the naïve approach taken above, each iteration of the loop causes R to re-allocate memory because the number of elements in the vector `pce_element` is changing. In simple terms, this means that R needs to execute more steps in each iteration. We can improve this with a simple trick by initiating the vector to the right size to begin with (filled with `NA` values).

```{r}
# Improve memory allocation (still somewhat ignorant of R)
deflator <- 1.05 # define deflator
n_obs <- length(economics$pce)
# allocate memory beforehand
# Initialize the vector to the right size
pce_real <- rep(NA, n_obs)
# iterate through each observation
time_elapsed <-
     system.time(
         for (i in 1:n_obs) {
              pce_real[i] <- economics$pce[i]/deflator
})


```

Let's see if this helped to make the code faster.

```{r}

time_per_row <- time_elapsed[3]/n_obs
time_per_row

```

Again, we can extrapolate (approximately) the computation time, assuming the dataset had millions of rows.

```{r}
# in seconds
(time_per_row*100^4)
# in minutes
(time_per_row*100^4)/60
# in hours
(time_per_row*100^4)/60^2

```

This looks much better, but we can do even better.

### Improvement 2: Exploit vectorization
\index{Vectorization}
In this approach, we exploit the fact that in R, 'everything is a vector' and that many of the basic R functions (such as math operators) are *vectorized*. In simple terms, this means that a vectorized operation is implemented in such a way that it can take advantage of the similarity of each of the vector's elements. That is, R only has to figure out once how to apply a given function to a vector element in order to apply it to all elements of the vector. In a simple loop, R has to go through the same 'preparatory' steps again and again in each iteration; this is time-intensive.

In this example, we specifically exploit that the division operator `/` is actually a vectorized function. Thus, the division by our `deflator` is applied to each element of `economics$pce`.

```{r}
# Do it 'the R way'
deflator <- 1.05 # define deflator
# Exploit R's vectorization
time_elapsed <-
     system.time(
     pce_real <- economics$pce/deflator
          )
# same result
head(pce_real, 2)

```


Now this is much faster. In fact, `system.time()`\index{system.time()} is not precise enough to capture the time elapsed. In order to measure the improvement, we use `microbenchmark::microbenchmark()`\index{microbenchmark()} to measure the elapsed time in microseconds (millionth of a second).

```{r}
library(microbenchmark)
# measure elapsed time in microseconds (avg.)
time_elapsed <-
  summary(microbenchmark(pce_real <- economics$pce/deflator))$mean
# per row (in sec)
time_per_row <- (time_elapsed/n_obs)/10^6

```

Now we get a more precise picture regarding the improvement due to vectorization:

```{r}
# in seconds
(time_per_row*100^4)
# in minutes
(time_per_row*100^4)/60
# in hours
(time_per_row*100^4)/60^2

```


## Case study: Bootstrapping and parallel processing

In this example, we estimate a simple regression\index{Regression} model that aims to assess racial discrimination in the context of police stops.^[Note that this example aims to illustrate a point about computation in an applied econometrics context. It does not make any argument whatsoever about identification or the broader research question.] The example is based on the 'Minneapolis Police Department 2017 Stop Dataset', containing data on nearly all stops made by the Minneapolis Police Department for the year 2017.

We start by importing the data into R.

```{r message=FALSE, warning=FALSE, echo=FALSE, tidy=FALSE, tidy.opts=list(arrow=TRUE, width.cutoff=60)}
url <-
"https://vincentarelbundock.github.io/Rdatasets/csv/carData/MplsStops.csv"
stopdata <- data.table::fread(url)
```

```{r message=FALSE, warning=FALSE, eval=FALSE, tidy=FALSE, tidy.opts=list(arrow=TRUE, width.cutoff=60)}
url <-
"https://vincentarelbundock.github.io/Rdatasets/csv/carData/MplsStops.csv"
stopdata <- data.table::fread(url)
```

We specify a simple linear probability model that aims to test whether a person identified as 'white' is less likely to have their vehicle searched when stopped by the police. In order to take into account level differences between different police precincts, we add precinct indicators to the regression\index{Regression} specification.

First, let's remove observations with missing entries (`NA`) and code our main explanatory variable and the dependent variable.

```{r}
# remove incomplete obs
stopdata <- na.omit(stopdata)
# code dependent var
stopdata$vsearch <- 0
stopdata$vsearch[stopdata$vehicleSearch=="YES"] <- 1
# code explanatory var
stopdata$white <- 0
stopdata$white[stopdata$race=="White"] <- 1
```


We specify our baseline model as follows.

```{r}
model <- vsearch ~ white + factor(policePrecinct)
```

and estimate the linear probability model via OLS (the `lm` function).

```{r}
fit <- lm(model, stopdata)
summary(fit)
```

A potential problem with this approach (and there might be many more in this simple example) is that observations stemming from different police precincts might be correlated over time. If that is the case, we likely underestimate the coefficient's standard errors. There is a standard approach to computing estimates for so-called *cluster-robust* standard errors\index{Cluster-Robust Standard Errors}, which would take the problem of correlation over time within clusters into consideration (and deliver a more conservative estimate of the SEs). However, this approach only works well if the number of clusters in the data is roughly 50 or more. Here we only have five.

The alternative approach is to compute bootstrapped clustered standard errors. That is, we apply the [bootstrap resampling procedure](https://en.wikipedia.org/wiki/Bootstrapping_(statistics))\index{Bootstrap Resampling} at the cluster level. Specifically, we draw $B$ samples (with replacement), estimate and record the coefficient vector for each bootstrap-sample, and then estimate $SE_{boot}$ based on the standard deviation of all respective estimated coefficient values.

```{r message=FALSE}
# load packages
library(data.table)
# set the 'seed' for random numbers (makes the example reproducible)
set.seed(2)

# set number of bootstrap iterations
B <- 10
# get selection of precincts
precincts <- unique(stopdata$policePrecinct)
# container for coefficients
boot_coefs <- matrix(NA, nrow = B, ncol = 2)
# draw bootstrap samples, estimate model for each sample
for (i in 1:B) {

     # draw sample of precincts (cluster level)
     precincts_i <- base::sample(precincts, size = 5, replace = TRUE)
     # get observations
     bs_i <-
          lapply(precincts_i, function(x){
               stopdata[stopdata$policePrecinct==x,]
     } )
     bs_i <- rbindlist(bs_i)

     # estimate model and record coefficients
     boot_coefs[i,] <- coef(lm(model, bs_i))[1:2] # ignore FE-coefficients
}
```


Finally, let's compute $SE_{boot}$.

```{r}
se_boot <- apply(boot_coefs,
                 MARGIN = 2,
                 FUN = sd)
se_boot
```


Note that even with a very small $B$, computing $SE_{boot}$ takes some time to compute. When setting $B$ to over 500, computation time will be substantial. Also note that running this code hardly uses up more memory than the very simple approach without bootstrapping (after all, in each bootstrap iteration the dataset used to estimate the model is approximately the same size as the original dataset). There is little we can do to improve the script's performance regarding memory. However, we can tell R how to allocate CPU\index{Central Processing Unit (CPU)} resources more efficiently to handle that many regression\index{Regression} estimates.

In particular, we can make use of the fact that most modern computing environments (such as a laptop) have CPUs\index{Central Processing Unit (CPU)} with several *cores*. We can exploit this fact by instructing the computer to run the computations *in parallel* (simultaneously computing on several cores). The following code is a parallel implementation of our bootstrap procedure that does exactly that.

```{r message=FALSE}
# load packages for parallel processing
library(doSNOW)
# get the number of cores available
ncores <- parallel::detectCores()
# set cores for parallel processing
ctemp <- makeCluster(ncores) #
registerDoSNOW(ctemp)


# set number of bootstrap iterations
B <- 10
# get selection of precincts
precincts <- unique(stopdata$policePrecinct)
# container for coefficients
boot_coefs <- matrix(NA, nrow = B, ncol = 2)

# bootstrapping in parallel
boot_coefs <-
     foreach(i = 1:B, .combine = rbind, .packages="data.table") %dopar% {
          # draw sample of precincts (cluster level)
          precincts_i <- base::sample(precincts, size = 5, replace = TRUE)
          # get observations
          bs_i <- lapply(precincts_i, function(x) {
            stopdata[stopdata$policePrecinct==x,]
          })
          bs_i <- rbindlist(bs_i)
          # estimate model and record coefficients
          coef(lm(model, bs_i))[1:2] # ignore FE-coefficients
     }
# be a good citizen and stop the snow clusters
stopCluster(cl = ctemp)

```


As a last step, we again compute $SE_{boot}$.

```{r}
se_boot <- apply(boot_coefs,
                 MARGIN = 2,
                 FUN = sd)
se_boot
```


### Parallelization with an EC2 instance


This short tutorial illustrates how to scale up the computation of clustered standard errors shown above by running it on an AWS EC2\index{EC2} instance. Note that there are a few things that we need to keep in mind to make the script run on an AWS EC2\index{EC2} instance in RStudio Server\index{RStudio Server}. First, our EC2\index{EC2} instance is a Linux machine.  When running R on a Linux machine, there is an additional step to install R packages (at least for most of the packages): R packages need to be compiled before they can be installed. The command to install packages is exactly the same (`install.packages()`), and normally you only notice a slight difference in the output shown in the R console during installation (and the installation process takes a little longer than you are used to). Apart from that, using R via RStudio Server\index{RStudio Server} in the cloud looks/feels very similar if not identical to when using R/RStudio locally. For this step of the case study, first follow the instructions of how to set up an AWS EC2\index{EC2} instance with R/RStudio Server\index{RStudio Server} in Chapter 7. Then, open a browser window, log in to RStudio Server\index{RStudio Server} on the EC2\index{EC2} instance, and copy and paste the code below to a new R-file on the EC2\index{EC2} instance (note that you might have to install the `data.table` and `doSNOW` packages\index{doSNOW Package} before running the code).

When executing the code below line-by-line, you will notice that essentially all parts of the script work exactly as on your local machine. This is one of the great advantages of running R/RStudio Server\index{RStudio Server} in the cloud. You can implement your entire data analysis locally (based on a small sample), test it locally, and then move it to the cloud and run it at a larger scale in exactly the same way (even with the same Graphical User Interface (GUI)).

```{r eval=FALSE, tidy=TRUE, tidy.opts=list(arrow=TRUE, width.cutoff=60)}

# install packages
install.packages("data.table")
install.packages("doSNOW")
# load packages
library(data.table)

# fetch the data
url <-
"https://vincentarelbundock.github.io/Rdatasets/csv/carData/MplsStops.csv"
stopdata <- read.csv(url)
# remove incomplete obs
stopdata <- na.omit(stopdata)
# code dependent var
stopdata$vsearch <- 0
stopdata$vsearch[stopdata$vehicleSearch=="YES"] <- 1
# code explanatory var
stopdata$white <- 0
stopdata$white[stopdata$race=="White"] <- 1

# model fit
model <- vsearch ~ white + factor(policePrecinct)
fit <- lm(model, stopdata)
summary(fit)
# bootstrapping: normal approach
# set the 'seed' for random numbers (makes the example reproducible)
set.seed(2)
# set number of bootstrap iterations
B <- 50
# get selection of precincts
precincts <- unique(stopdata$policePrecinct)
# container for coefficients
boot_coefs <- matrix(NA, nrow = B, ncol = 2)
# draw bootstrap samples, estimate model for each sample
for (i in 1:B) {
  # draw sample of precincts (cluster level)
  precincts_i <- base::sample(precincts, size = 5, replace = TRUE)
  # get observations
  bs_i <-
    lapply(precincts_i, function(x){
      stopdata[stopdata$policePrecinct==x,]})
  bs_i <- rbindlist(bs_i)
  # estimate model and record coefficients
  boot_coefs[i,] <- coef(lm(model, bs_i))[1:2] # ignore FE-coefficients
}

se_boot <- apply(boot_coefs,
                 MARGIN = 2,
                 FUN = sd)
se_boot
```


So far, we have only demonstrated that the simple implementation (non-parallel) works both locally and in the cloud. However, the real purpose of using an EC2\index{EC2} instance in this example is to make use of the fact that we can scale up our instance to have more CPU\index{Central Processing Unit (CPU)} cores available for the parallel implementation of our bootstrap procedure. Recall that running the script below on our local machine will employ all cores available to us and compute the bootstrap resampling\index{Bootstrap Resampling} in parallel on all these cores. Exactly the same thing happens when running the code below on our simple `t2.micro` instance. However, this type of EC2\index{EC2} instance only has one core. You can check this when running the following line of code in RStudio Server\index{RStudio Server} (assuming the `doSNOW` package\index{doSNOW Package} is installed and loaded): `parallel::detectCores()`\index{detectCores()}
.

When running the entire parallel implementation below, you will thus notice that it won't compute the bootstrap SE any faster than with the non-parallel version above. However, by simply initiating another EC2\index{EC2} type with more cores, we can distribute the workload across many CPU\index{Central Processing Unit (CPU)} cores, using exactly the same R script.

```{r eval=FALSE}

# bootstrapping: parallel approaach
# install.packages("doSNOW", "parallel")
# load packages for parallel processing
library(doSNOW)
# set cores for parallel processing
ncores <- parallel::detectCores()
ctemp <- makeCluster(ncores)
registerDoSNOW(ctemp)
# set number of bootstrap iterations
B <- 50
# get selection of precincts
precincts <- unique(stopdata$policePrecinct)
# container for coefficients
boot_coefs <- matrix(NA, nrow = B, ncol = 2)

# bootstrapping in parallel
boot_coefs <-
  foreach(i = 1:B, .combine = rbind, .packages="data.table") %dopar% {
    # draw sample of precincts (cluster level)
    precincts_i <- base::sample(precincts, size = 5, replace = TRUE)
    # get observations
    bs_i <- lapply(precincts_i, function(x){
         stopdata[stopdata$policePrecinct==x,])
    }
    bs_i <- rbindlist(bs_i)

    # estimate model and record coefficients
    coef(lm(model, bs_i))[1:2] # ignore FE-coefficients
  }

# be a good citizen and stop the snow clusters
stopCluster(cl = ctemp)
# compute the bootstrapped standard errors
se_boot <- apply(boot_coefs,
                 MARGIN = 2,
                 FUN = sd)
```