MonashHonoursStatistics/Week8_LMM1.qmd at master · JWiley/MonashHonoursStatistics · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "PSY4210: Linear Mixed Models (LMMs) 1"
author:
  - name: Pei Hwa Goh
  - name: Joshua F. Wiley
    email: joshua.wiley@monash.edu
  - name: Michelle Byrne
date: "`r Sys.Date()`"
format:
  html:
    toc: true
    number-sections: true
    embed-resources: true
    anchor-sections: true
    smooth-scroll: true
---

You can view, download or pull the raw code for this content here
https://github.com/jwiley/MonashHonoursStatistics

These are the `R` packages we will use.

```{r setup}
options(digits = 2)

## new packages are lme4, lmerTest, and multilevelTools

library(tufte)
library(haven)
library(data.table)
library(JWileymisc)
library(lme4)
library(lmerTest)
library(multilevelTools)
library(visreg)
library(ggplot2)
library(ggpubr)

## Set your working directory to the folder that has the datafiles.
## "Merged" is a a merged long dataset of both baseline and daily diary
dm <- as.data.table(read_sav("[2021] PSY4210 merged.sav"))

```

# Linear Mixed Models (LMMs) - Introduction

Often, observations are not independent.
For example:


- Repeated measures data, such as in longitudinal studies;
  repeated measures experiments, where observations are clustered
  within people
- When individuals are clustered or grouped, such as people within
  families, schools, companies, resulting in clustering within the
  higher order unit

Clustered data versus repeated measures or longitudinal data may
seem quite different, but from a statistical perspective, they
pose many of the same challenges that are solved in basically the same
ways. We will focus on repeated measures for now, but note the
statistical methods apply to both contexts.

Here is some hypothetical data on a few people where systolic blood
pressure (SBP) was measured at three time points:

| ID | SBP1 | SBP2 | SBP3 |
|----|------|------|------|
| 1  | 135  | 130  | 125  |
| 2  | 120  | 125  | 121  |
| 3  | 121  | 125  | .    |

This data is stored in "wide" format. That is each repeated
measure is stored as a separate variable. For LMMs, we typically
want data in a long format, like this

| ID | Time | SBP |
|----|------|-----|
| 1  | 1    | 135 |
| 1  | 2    | 130 |
| 1  | 3    | 125 |
| 2  | 1    | 120 |
| 2  | 2    | 125 |
| 2  | 3    | 121 |
| 3  | 1    | 121 |
| 3  | 2    | 125 |

in a long format, data are stored with each measure in one variable
and an additional variable to indicate the time point or
which specific assessment is being examined.
In long format, multiple rows may belong to any single unit.
Not all units (here IDs) have to have the same number of rows.
For example some people like ID3 may have missed a time point.
With clustered data, you could have a large or small school with
a different number of students (where student is the repeated measure
within school rather than time points within a person).
The `reshape()` function in `R` can be used to reshape data in
a wide format to a long format or from a long format to a wide format.
See the "Working with Data" topic for examples and details.

Regardless of wide or long format, these data are clearly not
independent. The different blood pressure readings within a
person are likely more related to each other than to blood pressure
readings from a different person.

## Considering Non Independence

In previous PSY4210 cohorts there was a small data collection exercise where
students completed some questions every day (like the BL dataset but repeated measures).
We load that data using the
`read_sav()` function and plot the daily energy ratings from a few
different people in the following figure.

```{r, fig.height = 4, fig.width = 5}

dd <- as.data.table(read_sav("[2021] PSY4210 DD.sav")) # daily

ggplot(dd[ID %in% c(10, 18, 40, 64, 97)],
       aes(factor(ID), dEnergy)) +
  geom_point(alpha = .2) +
  stat_summary(fun = mean, colour = "blue", shape = 18) +
  theme_pubr()

```

Not only do observations vary across days within a person, but
different people have different mean energy levels.

Our first thoughts on analysing data where observations are not
independent, may be to consider methods we already know, such as
linear regression (or GLMs) or ANOVAs.

You might think about would be ANOVA. We could use a
repeated measures ANOVA. RM ANOVA can handle repeated measures if they
are:

- at discrete time points (e.g., T1, T2, T3)
- everyone has the same number of time points
- the outcome is continuous, normally distributed data

However, RM ANOVA has many limitations. It cannot handle...

- continuous time (e.g., if one person completes on days 0, 13, 22,
  and another on 0, 1, 20)
- if someone is missing data on any timepoint, they are completely
  excluded from analaysis (unless you do something like imputation)
- continuous predictors/explanatory variables (e.g., age in years) nor
  other types of outcomes


If ANOVAs are not ideal, what would happen if we use a linear
regression? The following figure shows the linear regression lines for
the association between energy and self esteem for each individual
participant (just four for example) and the single line that is the
result of an overall linear regression (in blue).

```{r, fig.height = 4, fig.width = 5}

ggplot(dd[ID %in% c(10, 18, 40, 64, 97)],
       aes(dEnergy, dSE)) +
  stat_smooth(method = "lm", se = FALSE, colour = "blue", size = 2) +
  stat_smooth(aes(group = ID), method = "lm", se = FALSE, colour = "black") +
  theme_pubr()

```

The overall linear regression has to be the same for everyone because
in linear regression we have one intercept and one slope. The fact
that different people may have different overall levels of energy or self
esteem and the fact that the association between energy and selfesteem
may not be the same for all people cannot be captured by the linear
regression.

Statistically, the linear regression assumption of independence also
would be violated making the standard errors, confidence intervals,
and p-values from a linear regression on non-independent data biased
and wrong.

Both the fact that regular regression / GLMs cannot capture individual
differences in the level of each line (different intercepts by ID) nor
the fact that different people may have a different relationship
between the predictor and outcome (different slopes by ID) and the
issue around independence stem from the fact that in the regressions /
GLMs we have learned so far, all the coefficients are **fixed
effects**, meaning that they are **fixed** (assumed) to be identical
for every participant.

When you have only one observation per participant, fixed effects are
necessary as it is impossible to estimate coefficients for any
individual participant. We can only analyze by aggregating across
individuals. With repeated measures, however, it is possible and
important to consider whether a coefficient differs between people.

If a regression coefficient differs for each participant we say that
it varies or is a random variable. Because of this we call
coefficients that differ by person random effects, which are different
from fixed effects because they are not assumed to be identical for
everyone, but instead vary randomly by person.

## LMM Theory

Linear Mixed Effects Models (LMMs) extend the fixed-effects only linear
regression models covered in previous units to include both fixed and
random effects. That is, to be able to include regression coefficients
that are identical for everyone (fixed effects) and regression
coefficients that vary randomly for each participant (random
effects).

Because LMMs include both fixed and random effects, they are called
mixed models.

Let's see how this works starting with the simplest linear regression
model and then moving into the simplest linear mixed model.

The simplest linear regression model is an intercept only model.

$$ y_i = b_0 * 1 + \varepsilon_i $$

The intercept here, $b_0$, is the expected outcome value when all
other predictors in the model are zero. Since there are no other
predictors, this will be the mean of the outcome. The fixed effect
means the linear regression model assumes that all participants have
the same mean. Shown graphically, here are a few participants' data
with a blue dot showing the mean for each person. The regression line
with an intercept only will be the blue line. It assumes that the mean
(intercept) for all people is identical, which is not true.
We need to make $b_0$ a random effect -- to allow it to vary randomly
across participants.

```{r, fig.height = 4, fig.width = 5}

ggplot(dd[ID %in% c(10, 18, 40, 64, 97)],
       aes(ID, dEnergy)) +
  geom_point(alpha = .2) +
  stat_summary(fun = mean, colour = "blue", shape = 18) +
  stat_smooth(method = "lm", se = FALSE, formula = y ~ 1) +
  theme_pubr()

```

We can see the actual distribution of mean energy levels by ID easily
enough as follows.

```{r}

plot(testDistribution(dd[, .(
  MeanEnergy = mean(dEnergy, na.rm = TRUE)), by = ID]$MeanEnergy),
  varlab = "Mean (average) Energy by ID")

```

The key point of the previous graph is to show that indeed, different
participants have different mean energy levels. The mean energy levels
vary and in this case appear to fairly closely follow a normal
distribution. The means are a random variable.

If we assume that a random variable follows a particular distribution
(e.g., a normal distribution), we can describe it with two parameters:
the mean and standard deviation.

Assuming a random variable comes from a distribution gives us another
way to think about fixed and random effects.

- **Fixed Effects** approximate the distribution by: $M$ = estimated
  mean; $SD$ = 0 (i.e., the standard deviation is **fixed** at 0)
- **Random Effects** approximate the distribution by:
  $M$ = estimated mean; $SD$ = estimated standard deviation
  (the SD is free to vary).

In this way, you can think of both fixed and random effects as both
using a distribution to approximate a random variable (the association
between say energy and self esteem for different people) but fixed
effects make the strong assumption that the distribution has $SD = 0$
whereas random effects *relax* that assumption and allow the
possibility that $SD > 0$.

## Intraclass Correlation Coefficient (ICC)

In repeated measures / multilevel (twolevel) data, we can decompose
variability into two sources: Variability **between** individuals and
variability **within** individuals.

We call the ratio of between variance to total variance the Intraclass
Correlation Coefficient (ICC). The ICC varies between 0 and 1.

- 0 indicates that all individual means are identical so that all
  variability occurs within individuals.
- 1 indicates that within individuals all values are the same with all
  variability occurring between individuals.

We can define some equivalent notations:

$$ TotalVariance = \sigma^{2}_{between} + \sigma^{2}_{within} = \sigma^{2}_{randomintercept} + \sigma^{2}_{residual} $$

With that definition of the total variance, we can define the ICC as:

$$ ICC = \frac{\sigma^{2}_{between}}{\sigma^{2}_{between} + \sigma^{2}_{within}} = \frac{\sigma^{2}_{randomintercept}}{\sigma^{2}_{randomintercept} + \sigma^{2}_{residual}} $$

A basic understanding of these equations is helpful as it let's us
interpret the ICC. Suppose that every person's mean was
identical. There would be no variation at the between person level.
That is, $\sigma^{2}_{between} = 0$. That results in:

$$ ICC = \frac{0}{0 + \sigma^{2}_{within}} = 0 $$

When there are no differences between people, $ICC = 0$.

What happens if there is no variation within people? What if all the
variation is between people?
That is, $\sigma^{2}_{within} = 0$. That results in:

$$ ICC = \frac{\sigma^{2}_{between}}{\sigma^{2}_{between} + 0} = 1 $$

When there are no differences within people, $ICC = 1$.
The ICC is the ratio of the differences between people to the total
variance. A small ICC near 0 tells us that almost all the variation
exists within person, not between person. A high ICC near 1 tells us
that almost all the variation occurs between people with very little
variation within person. The ICC can be interpretted as the percent of
total variation that is between person.

The following figure shows two examples. The High Between variance
graph would have a **high** ICC and the Low Between variance graph
would have a **low** ICC.

```{r, fig.height = 10, fig.width = 7, fig.cap = "Example showing the difference between high and low between person variance in data."}

set.seed(1234)
ex.data.1 <- data.table(
  ID = factor(rep(1:4, each = 10)),
  time = rep(1:10, times = 4),
  y = rnorm(40, rep(1:4, each = 10), .2))

ex.data.2 <- data.table(
  ID = factor(rep(1:4, each = 10)),
  time = rep(1:10, times = 4),
  y = rnorm(40, 2.5, 1))

ggarrange(
  set_palette(ggplot(ex.data.1,
         aes(time, y, colour = ID, shape = ID)) +
  stat_smooth(method = "lm", formula = y ~ 1, se=FALSE) +
  geom_point() +
  theme_pubr(), "jco"),
  set_palette(ggplot(ex.data.2,
         aes(time, y, colour = ID, shape = ID)) +
  stat_smooth(method = "lm", formula = y ~ 1, se=FALSE) +
  geom_point() +
  theme_pubr(), "jco"),
  ncol = 1,
  labels = c("High Between", "Low Between"))

```

:::{.callout-note collapse="true"}
Beyond the scope of this unit, sometimes we have even
more levels of data, like repeated observations measured within
people and people measured within classrooms or cities, etc. We can
similarly calculate ICCs for each level, i.e., city, person,
etc. To do this using the iccMixed() function, you would just add
more ID variables.
:::

We can calculate the ICC in `R` using the `iccMixed()` function.
It requires the name of the dependent variable, the ID variable,
indicating which rows in the dataset belong to which units, and the
dataset name. It returns the standard deviation (Sigma) attributable
to the units (ID; the between variance) and to the residual (the
within variance). The column labelled ICC is the proportion of total
variance attributable to each effect. The main ICC would be the ICC
for the ID, here 0.40 for stress.

```{r}

iccMixed("dStress", id = "ID", data = dd)

```

The ICC can differ across variables. For example, one variable might
be quite stable across days and another very unstable. When working
with repeated measures data, the ICC for each repeated measure
variable is a useful descriptive statistic to report. For example, the
following calculates the ICC for energy, 0.28, which is lower
than that for stress, indicating that energy is less stable from day
to day than is stress, or equivalently that compared to energy, stress
has relatively more variation between people than within them.

```{r}

iccMixed("dEnergy", id = "ID", data = dd)

```

## Between and Within Effects

The ICC illustrates an idea that comes up a lot in multilevel or mixed
effects models: the idea of between and within effects. A total effect
or the observed total score can be decomposed into some part
attributable to between and some part attributable to a within
effect as shown in the following diagram.

```{r, echo = FALSE, fig.cap = "Decomposing multilevel effects"}

DiagrammeR::grViz("
digraph 'Decomposing multilevel effects' {
  graph [overlap = true, fontsize = 14]
  node [fontname = Helvetica, shape = rectangle]
  T [label = 'Total']
  B [label = 'Between']
  W [label = 'Within']

  T -> {B W}
}
")

```

Between effects exist **between** units whereas within effects exist
**within** a unit. In psychology, often the unit is a person and the
between effects are differences between people whereas the within
effects are differences within an individual.

:::{.callout-note collapse="true"}

When you subtract the mean from a variable, the new mean will be equal
to 0. For example, if you have three numbers:

$$ (1, 3, 5) $$

the mean is 3, if you subtract the mean you get:

$$ (-2, 0, +2) $$

and the mean of the deviations are 0.

If you have:

$$ (4, 5, 6) $$

the mean is 5 and if you subtract the mean you get:

$$ (-1, 0, +1) $$

as deviation scores and the mean of these deviations is 0.

It is a fact that the mean of deviations from a mean will always
be 0. Because of how we normally calculate a within person variable,
from a total variable, we separate out the mean into the between
portion and the within portion has deviations from individuals' own
means and so the mean of the within portion will be 0.
:::

For example, suppose that in one individual, Person A, over three
days, we observed the following scores on happiness: 1, 3, 5.
Those total observed scores could be decomposed into one part
attributable to a stable, between person effect (the mean) and another
part that purely captures daily fluctuations, the within person effect
(deviations from the individuals' own mean). We could break down those
numbers: 1, 3, 5 as shown in the following figure.

```{r, echo = FALSE, fig.cap = "Decomposing multilevel effects - concrete example"}

DiagrammeR::grViz("
digraph 'Decomposing multilevel effects example' {
  graph [overlap = true, fontsize = 12]
  node [fontname = Helvetica, shape = rectangle]

  subgraph Between {
    node [fontname = Helvetica, shape = rectangle]
    B1 [label = 'Day 1: 3']
    B2 [label = 'Day 2: 3']
    B3 [label = 'Day 3: 3']
    B1 -> B2
    B2 -> B3
  }

  subgraph Within {
    node [fontname = Helvetica, shape = rectangle]
    W1 [label = 'Day 1: -2']
    W2 [label = 'Day 2: 0']
    W3 [label = 'Day 3: +2']
    W1 -> W2
    W2 -> W3
  }

  Total
  Total -> Between
  Total -> Within

  Between -> B1 [lhead = between]
  Within  -> W1 [lhead = within]

}
")

```

Notice that the between person effect does not vary across days. It is
constant for a single individual. The within person portion, however,
does vary across days.

In this case, the between person effects would be the mean for each
person, matching a different intercept for each person from our ICC
example. The variance in these means (intercepts) is the between
person variance. The within person effects are individual deviations
from individuals' own means and their variance would be the residual
variance. Together those variances add up to the total variance and
these portions are what the ICC captures.

## Descriptive Statsitics on Multilevel Data

With multilevel data, basic descriptive statistics can be calculated
in different ways. We will use the merged dataset from the daily
collection exercise. This dataset merges the baseline and daily
datasets in a single, long format file.

To start with, suppose we had such a merged file (generally, a
convenient way to store and analyze multilevel data) and we wanted to
calculate descriptive statistics on some between person variables,
such as loneliness and sex. We might start off as we have before using
`egltable()` for some summary descriptive statistics.

```{r betweendesc1}

egltable(c("loneliness", "sex"), data = dm)

```

Note that as `sex` is coded 1/2 and not a factor, by default we are
given a mean and standard deviation. This is not so helpful. We could
convert to a factor or set `strict=FALSE` in which case `egltable()`
does not **strict**ly follow the class of each variable and instead if
a variable only has a few unique values, assumes its
categorical/discrete, regardless of its official class in `R`.

```{r betweendesc2}

egltable(c("loneliness", "sex"), data = dm, strict=FALSE)

```

To make things even clearer, let's remind R what 1 and 2 represents.

```{r}

dm[, sex := factor(
  sex,
  levels = c(1,2),
  labels = c("male", "female"))]

## While we are at it, let's also tell R that relationship
## status is a factor too
dm[, relsta := factor(
  relsta, levels = c(1,2,3),
  labels = c("single", "in a committed exclusive relationship",
    "in a committed nonexclusive relationship"))]

## now let's redo the egltable
egltable(c("loneliness", "sex"), data = dm, strict=FALSE)

## now that R knows sex is a factor, we can remove the strict argument
egltable(c("loneliness", "sex"), data = dm)

```

Now we can see the mean and standard deviation for loneliness and the
frequencies and percentages of female and male participants in the study. However,
this reveals a problem/error. There were not 43 male and 240 female participants in
the study. What this actually shows is the average loneliness, weighted by
the number of days of data available for each participant, and the
total number of daily surveys completed by male and female participants. That happens
because the between person variables are repeated in long format.

Let's look at the data just for the first two participants in the
following table. We can see that on all days, loneliness and sex scores
are identical for IDs 10 and 12.

```{r, results = 'asis'}

knitr::kable(dm[ID %in% c(10, 12), .(ID, SurveyDay, loneliness, sex)])

```

This is how between person variables are typically stored when in a
dataset combined with repeated measures data in long format. However,
it renders descriptive statistics on them probably not what we really
wanted. The means are essentially weighted means, participants who
complete more days of data get a higher weighting because their loneliness values
are repeated more times. For categorical/discrete variables, what we
get is the number of assessments / observations for each level of the
categorical variable, in this case number of observations belonging to
male and female participants. In this case, when using data tables, the solution is
easy: drop duplicated rows and only keep one row of data per ID and
then remake the table. The following code shows how to do this.
The difference is in using `data = dm[!duplicated(ID)]`
instead of `data = dm`.

```{r betweendesc3}

egltable(c("loneliness", "sex"), data = dm[!duplicated(ID)])

```

Yielding this nice descriptive statistics table:


```{r, results = 'asis'}

knitr::kable(egltable(c("loneliness", "sex"), data = dm[!duplicated(ID)]))

```

When working with multilevel data and a mix of between variables
(especially sociodemographics that are often asked once) and repeated
measures variables, watch out for which are which type and select only
one row per ID when calculating descriptives or graphs for between
person variables. The same applies when making exploratory plots or
figures.

```{r, fig.cap = "This is wrong, has many duplicates for each loneliness score"}

plot(testDistribution(dm$loneliness), varlab = "this is wrong")

```

```{r, fig.cap = "This is right, only has one loneliness score per person."}

plot(testDistribution(dm[!duplicated(ID)]$loneliness), varlab = "this is right")

```

For repeated measures variables or any variable that can vary within a
unit, we have several options for how to calculate descriptive
statistics. The choices depend on whether the variable is continuous
or categorical/discrete.

### Continuous Variables

With multilevel data in long format, if we calculate the mean and variance
of a variable, that would average across units and observations within
units for the mean. The variance will incorporate both differences between
units and variance within units (how much the data points vary around
each unit's own mean). That is, it is essentially the total variance.

Conversely, if we first average observations within a unit (e.g.,
person), then the mean will be the average of the individual averages,
and the variance will only be the variability between individual
units' means. That is, we could first create a between person variable
by averaging scores for each unit and then calculate descriptives as
usual for that variable.

```{r}

dm[, c("Bstress", "Wstress") := meanDeviations(dStress), by = ID]

egltable(c("Bstress"), data = dm[!duplicated(ID)])

```

We can interpret the descriptive statistics for `Bstress` as,
"The mean and standard deviation of the average level of stress across
days was 4.08 (SD = 1.11)."


For continuous variables, we also could calculate descriptives
on all data points directly rather than on the average by person.
Using all data points *can* unequally weight different participants
(e.g., imagine one participant contributed 100 data points and 100
other participants each contributed 1 data point, the average will be
50% the 100 participants with 1 data point and 50% from the 1
participant with 100 data points). The issue of weighting tends to be
less important to the extent that clusters are all about the same
size, and makes no difference if all clusters are identical (e.g.,
everyone has exactly 10 observations). If there are no systematic
differences such that, for example, participants with the highest or
lowest levels of stress tend to have more/fewer data points, then the
mean is likely to be quite similar regardless of the approach.

The standard deviation from all data points is more like the total
variance, combining variance from between and within
participants. Thus, the standard deviation we expect to be at least
the same and most likely larger than the standard deviation of the
individual means, since those individual means "smooth" or remove
variation within person.

```{r}

egltable(c("dStress"), data = dm)

```

In this example we can see that while the mean changes by .05, the
standard deviation changes by quite a bit more, reflecting the added
within person variance.

A final approach that can be taken for continuous variables is to only
pick a single observation from each unit to use in calculations. This
tends to work best with longitudinal data where, for example, you
could pick either the first day or the last day and calculate
descriptives just for that one day as an example. To do this, we use
the overall variable, but we subset the dataset to only include one
day.

```{r}

egltable(c("dStress"), data = dm[SurveyDay == 1])

```

This we can interpret as any other "usual" descriptive statistics. It
is the average level of stress and standard deviation across
participants, on the first day. If there are few time points (e.g., in
a longitudinal study with just baseline, post, and follow-up), it
might make sense to simply report the means and standard deviations of
continuous variables at each time point. With a long dataset, this can
easily be done by specifying the time point as the **g**rouping
variable. **Note:** `egltable()` does not calculate correct tests,
effect sizes, or p-values when the grouping variable is a repeated
measures and not independent groups, so if you do it this way, ignore
the tests and p-values, which assume independence.

```{r}

egltable(c("dStress"), g = "SurveyDay", data = dm)

```

If the data are already in wide format, one would simply calculate
descriptive statistics on each of the separate variables representing
each time point.

#### Summary

With multilevel, continuous variables, three approaches we
discussed for calculating descriptive statistics are:

1. average by person and report descriptives on the individual
   averages;
2. report descriptives on the overall variable which captures the
   total variance but possibly unequally weights participants;
3. report descriptives on individual timepoints/assessments.

### Categorical Variables

Compared to continuous variables, there are fewer options for
presenting descriptive statistics with multilevel categorical data.
For example, suppose people reported each day whether they: walked,
rode a bike, or went to the gym for exercise.
It does not make sense to average these data per person, although it
could be possible to average, for example, the proportion of days each
participant engaged in any one of those activities.
However, categorical / discrete data are typically presented as
frequencies and percentages. Averages of frequencies, especially with
skewed data are not all that easy to interpret. For example, suppose
that people either ride a bike or walk each day, using averages would
appear to show that on average people walk half the days and ride a
bike half the days.

The general choices, then for multilevel categorical data descriptives
would either be to report descriptives for a variable overall or to
report descriptives for a specific time point (e.g., day 1, 2, etc.)
of longitudinal data.

```{r}

## overall
egltable("Int_Fri", data = dm, strict = FALSE)

## one day
egltable("Int_Fri", data = dm[SurveyDay==1], strict = FALSE)

```

These results show the total frequency and percentage of interactions
with a friend (1 = interacted with at least one friend that day)
in overall (first table) and the frequency and percentage of people
that reported interacting with a friend on day 1 (second table).

### Putting it All Together

Calculating and reporting descriptives statsitics from multilevel data
stored in long format can require putting everything we talked about
together. In this example, we want a descriptive statistics table for
the following variables:

- sex (measured baseline only; categorical)
- loneliness (measured baseline; continuous)
- selfesteem (measured baseline; continuous)
- dStress (measured daily; continuous, wanted average levels)
- Int_Fri (measured daily; categorical)

First for any continuous variables that we want to report on average
levels only, `dStress`, we create a between person version.
Then we subset the dataset to one row per person and calculate
descriptives and store these in `desc1`. Then we calculate any
descriptives we want on the daily data, here for `Int_Fri`.
Then we name the columns the same using `setnames()` and finally
we combine the two tables by binding them rowise, using `rbind()` and
make a nice table with `kable()` from the `knitr` package.

```{r, results = 'asis'}

dm[, c("Bstress", "Wstress") := meanDeviations(dStress), by = ID]

desc1 <- egltable(c("sex", "loneliness", "selfesteem", "Bstress"),
           data = dm[!duplicated(ID)], strict = FALSE)
desc2 <-   egltable(c("Int_Fri"), data = dm, strict = FALSE)
setnames(desc1, c("", "M (SD)/N (%)"))
setnames(desc2, c("", "M (SD)/N (%)"))

knitr::kable(rbind(desc1, desc2))

```

From these results we can see that 86.4% of the participants were
female,the average loneliness was 2.07 (possible scores 1-4), the
average self-esteem was 3.43 (possible scores 1-5), the average
of the mean stress across days was 4.08 (possible scores 1-7), and
64% of the days completed featured an interaction with a friend.

## Linear Mixed Models

:::{.callout-note collapse="true"}

Linear mixed models (LMMs) or mixed effects models are also
sometimes called "multilevel models"  or
"hierarchical linear models" (HLMs). The multilevel model or
hierarchical linear model names both come from the fact that
they are used for data with multiple, hierarchical levels,
like observations nested within people, or kids nested
within classrooms. However, all three of these are synonyms
for the same statistical model, that includes both fixed and random
effects, as random effects are the method used to address the
hierarchical or multilevel nature of the data.

One difference, however, is that HLMs imply a specific hierarchy,
which may not always be present in data. For example, if all siblings
from a family were recruited as well as students in the same
classroom, there would be crossed effects with any given person a
member of a higher level family unit and a higher level classroom
unit. HLMs generally are not setup for situations where there is not a
pure hierarchy. Mixed effects models address this easily as it just
requires different random effects (one for families, one for
classrooms, for example). Thus, all HLMs are mixed models, but not all
mixed models are HLMs. Another example would be if in an experiment,
we randomly sampled difficulty levels and every participant completed
all combinations. Difficulty level could be a random effect and
participant could be a random effect, with observations nested within
both, but they do not form a clear hierarchy (i.e., difficulty level
is not above/below participant).

:::

Unlike linear regression, which models all regression coefficients as
fixed effects, mixed effects regression models some coefficients as
fixed effects and others as random effects (hence the term, "mixed"
effects).

A random effect is a regression coefficient that is allowed to vary
as a random variable. Random effects provide a way of dealing with
non-independence in the data by explicitly modeling the systematic
differences between different units.

The simplest linear mixed model (LMM) is an intercept only model,
where the intercept is allowed to be a random variable.

$$ y_{ij} = b_{0j} * 1 + \varepsilon_{ij} $$

Here we have an observed outcome, $y$, with observations for a
specific person (unit), $j$, at a specifc timepoint / lower level
unit, $i$. In addition, note that our intercept, which in linear
regression is $b_0$ is now $b_{0j}$ indicating that the intercept is
an estimated intercept for each unit, $j$. In psychology our units
often are people, but as noted before the "unit"may be something else,
such as students within classrooms, or kids within families, etc.

As before, we assume that:

$$ y_{ij} \sim \mathcal{N}(\boldsymbol{\eta}, \sigma_{\varepsilon_{ij}}) $$

by convention, we decompose random effects, here the random intercept,
as follows:

$$ b_{0j} = \gamma_{00} + u_{0j} $$

where

- $\gamma_{00}$ is the mean intercept across all people. There are
  no variable subscripts under it, only numbers, as it does *not* vary
  across different people.
- $u_{0j}$ is the individual unit deviations from the mean
  intercept, $\gamma_{00}$.

Under this decomposition, $\gamma_{00}$, is a fixed effect, the
average intercept for all units, and is interpretted akin to what we
are used to in linear regression, $b_0$. $u_{0j}$ is how much each
individual units intercept differs from the average, they are
deviations.

We also make another, new distributional assumption, we assume that:

$$ u_{0j} \sim \mathcal{N}(0, \sigma_{int}) $$

In other words, we assume that individual units' deviations from the
fixed effect, average intercept follow a normal distribution with mean
0 and standard deviation equal to the standard deviation of the
deviations.

The actual parameters of the model that must be estimated are:

- $\gamma_{00}$, the fixed effect intercept
- $\sigma_{int}$, the standard deviation of the individual deviations
  from the fixed effect intercept, the $u_{0j}$
- $\sigma_{\varepsilon}$, the residual error term

A powerful feature of LMMs is that despite needing to allowing/model
individual differences in a coefficient, here the intercept, we do not
have to actually estimate a separate intercept parameter for each unit
/ person. In fact, it takes only one more parameter than a regular
linear regression. The only additional parameter is the standard
deviation of the individual intercepts. We also have another
distribution assumption, by assuming the the random effect intercept
also follows a normal distribution, but in exchange for the one extra
parameter and one new assumption, we are able to relax the assumption
of linear regression that each observation is independent.

To fit a LMM in `R`, we use the `lmer()` function from the `lme4`
package. `lmer()` stands for **l**inear **m**ixed **e**ffects
**r**egression, and it follows a syntax very similar to `lm()`. The
main difference/addition is a new section in parentheses to indicate
which specific regression coefficients should be random effects and
what the ID variable of the unit is that should be used for the random
effects. Here we will fit an intercept only LMM on stress from the
daily data collection exercise data. We add a random intercept by ID
using the syntax: `(1 | ID)`.

:::{.callout-note collapse="true"}

Summary of output:

Type of model, how fit (Restricted Maximum Likelihood, REML) and how
  statistical inference was performed (t-tests using the Satterthwaite's
  approximation method to calculate approximate degrees of freedom)

Formula used to define the model is echoed for reference

REML criterion, which is kin to a log likelihood value. We'll talk
  about this more in a later lecture.

Scaled residuals, which are rescaled from the raw residuals.
  Interpret these similarly to usual residuals from linear regression.

A new section: *Random Effects*.
  This shows the variance and standard deviation (just square root of
  variance) for the random intercept and the residual variance (our
  residual error term from linear regression).

Total number of observations included and the number of unique
  units, here IDs, representing different people. These numbers are
  what were included in the analysis.

Fixed Effects regression coefficient table for LMMs is interpretted
  comparably to the regression coefficients table for a linear
  regression. One note is that the degrees of freedom, df, cannot
  readily be calculated in LMMs. The Satterthwaite's approximation
  method was used here, but this approximation is not perfect and
  because it is an estimated degrees of freedom, you can get decimal
  points, not just whole numbers. The decimals are not an error.

:::

The summary shows information about how the model was estimated and
its overall results.

```{r}

m  <- lmer(dStress ~ 1 + (1 | ID), data = dd)

summary(m)

```

We can interpret that in this model, there were
`r nobs(m)` observations included in the linear mixed model, and these
observations came from
`r ngrps(m)[["ID"]]` different people.
On average, these people had a stress score of
`r fixef(m)[["(Intercept)"]]`, the fixed effect intercept, and this
was signifciantly different from zero. By examining the random effects
standard deviations, `Std.Dev.` we can see that the random intercept
standard deviation is pretty close to the residual standard
deviation, which is consistent with the earlier ICC we calculated for
stress that nearly half the variance in stress is between people with
the other half occuring within people.

:::{.callout-note collapse="true"}

In mixed models we do not technically estimate parameters for each
individual. That is, we do not directly estimate random effects like
$b_{0j}$. We estimate the average of the distribution, $\gamma_{00}$