Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions Daniel/Exercise 1/Solutions_Exercise_1.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
---
title: "Markup Solutions Exercise 1"
author: "Daniel Anadria"
date: 15 September 2022
output:
html_document:
css: style.css

---
<p style="text-align: center;">**Exercise 1: Git and Monte Carlo simulation with a GitHub flavor**</p>


In this exercise, I sample 100 samples from a normal distribution with the mean $\mu = 100$ and the standard deviation $\sigma = 15$..
For each of the samples, I calculate the following statistics for the mean:

- absolute bias
- standard error
- lower bound of the 95% confidence interval
- upper bound of the 95% confidence interval

Finally, I plot the resulting 95% confidence intervals.

```{r, echo=T, results='hide'}
# defining population values
population <- new.env()
population$mean <- 100
population$sd <- 15
as.list(population)

# 100 samples of 1000 observations from the population
set.seed(123) # for reproducibility
samples <- data.frame(matrix(ncol = 100, nrow = 1000))
for (i in 1:100){
samples[,i] <- rnorm(n = 1000, mean = 100, sd = 15)}
```

Computing the sample means:

```{r}
samplemeans <- data.frame(matrix(ncol = 100, nrow = 1))
for (i in 1:100){
samplemeans[,i] <- mean(samples[,i])}
```

Computing the absolute bias for each sample:

```{r}
absolutebias <- data.frame(matrix(ncol = 100, nrow = 1))
for (i in 1:100){
absolutebias[,i] <- abs(population$mean - samplemeans[,i])}
```

Computing the standard errorfor each sample:

```{r}
# SD of every sample
sampleSDs <- data.frame(matrix(ncol = 100, nrow = 1))
for (i in 1:100){
sampleSDs[,i] <- sd(samples[,i])}

# SE of every sample
sampleSEs <- data.frame(matrix(ncol = 100, nrow = 1))
for (i in 1:100){
sampleSEs[,i] <- sampleSDs[,i] / sqrt(nrow(samples))}

```

95% confidence interval for each sample:

```{r}
# lower bound
CIlower <- data.frame(matrix(ncol = 100, nrow = 1))
for (i in 1:100){
CIlower[,i] <- samplemeans[,i] - 1.959964 * sampleSEs[,i]}

# upper bound
CIupper <- data.frame(matrix(ncol = 100, nrow = 1))
for (i in 1:100){
CIupper[,i] <- samplemeans[,i] + 1.959964 * sampleSEs[,i]}
```

Preparing the data for `ggplot2`:

```{r, results='hide', message=FALSE, warning=FALSE}
library(tidyverse)
```

```{r}
data <- data.frame(sample_id = 1:100, lower_bound = t(CIlower), sample_mean = t(samplemeans), upper_bound = t(CIupper), row.names = 1:100)

data$sample_biased <- ifelse((data$upper_bound < 100) | (data$lower_bound > 100), "Yes", "No")

data_long <- data %>% pivot_longer(c(2,4), names_to = "Statistic", values_to = "Values")

```

Plotting the 95% confidence interval for each sample:

```{r, warning = FALSE, message = FALSE, error = FALSE, results='hide', fig.keep = 'all', out.width = "100%", fig.align = 'center'}

ggplot(data_long, aes(x = Values, y = sample_id)) +
geom_line(aes(colour = sample_biased, group = sample_id)) +
scale_color_manual(values=c("#00000f", "#a10a02"), name = "Biased Sample", )+
geom_point(aes(sample_mean), size = 0.75, show.legend = T)+
geom_line(aes(x = population$mean), linetype = "solid") +
xlab("95% Confidence Intervals") +
ylab("Simulation Number")+
theme_classic()

```

<p style="text-align: center;">*“A replication of the procedure that generates a 95% confidence interval that is centered around the sample mean would cover the population value at least 95 out of 100 times” (Neyman, 1934)*</p>
\n

```{r}
table(data$sample_biased)
```
In line with the Neyman's quote, in our simulated data, 6% of confidence intervals did not contain the true population mean. This is close to the expectation of 5% of the 95% confidence intervals will not contain the true population parameter. In an infinitely large sample, this proportion would equal exactly 5%.
\n
Here I present a table containing all simulated samples for which the resulting confidence interval does not contain the population mean.
\n
```{r}
data %>% filter(sample_biased == 'Yes')
```

517 changes: 517 additions & 0 deletions Daniel/Exercise 1/Solutions_Exercise_1.html

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions Daniel/Exercise 1/Two_Pieces.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
I like hiking which is unfortunately difficult in the Netherlands.
I grew up with 10 dogs in the house.
14 changes: 14 additions & 0 deletions Daniel/Exercise 1/style.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
/* Whole document: */
body{
font-family: Times New Roman;
font-size: 16pt;
}
/* Headers */
h1{
text-align: center;
font-size: 20pt;
font-weight: bold;
}
h2,h3,h4,h5,h6{
text-align: center;
font-size: 16pt;}
124 changes: 124 additions & 0 deletions Daniel/Exercise 2/Replication_Solution_Exercise_2.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
---
title: 'Replication of Exercise 2: Latent Class Analysis'
author: "Daniel Anadria"
output:
html_document:
df_print: paged
---

# Introduction

In this exercise, I perform latent class analysis (LCA) using the `tidySEM` package. LCA is a type of structural equation modelling analysis that can be used to discover latent subpopulations within a population sample. LCA can be seen as a data reduction technique focused on clustering similar observations together.

For this exercise, I use the `data_mix_ordinal` dataset built-in to `tidySEM`. The research question is: **How many latent classes is the sample composed of?**

# Preparing the Data

```{r}
library(tidySEM)
data <- data_mix_ordinal
data[1:4] <- lapply(data, ordered) # indicators as ordered factors
```

# Exploring the Data

An important step to preceding any statistical analysis is data exploration.

```{r}
tidySEM::descriptives(data) # descriptives
```

I see that all my indicators are ordered factors, each has four levels, 5000 observations, no missing data.

# Conducting Latent Class Analysis

LCA entails fitting successive models, each with one additional class compare to the previous one. In my example, I will fit 1 to 4 class solutions and compare their output in order to choose the best one.
Depending on your computer's computational power, this might take a while.

Before I fit a series of LCA models, I set a random seed using `set.seed()`. This is an important step as there is some inherent randomness in the LCA computations, and having the same seed number ensures that two separate researcher obtain exactly the same results when fitting LCA models.

```{r, warning=F, message=F}
set.seed(321) # setting seed
res <- mx_lca(data = data, classes = 1:4) # fitting LCA
```

# Class enumeration

Class enumeration is about comparing a sequence of LCA models fitted to the data. To do so, I use `tidySEM::table_fit()` with the results object as the input. However, the resulting model fit table contains a lot of information on each of the four models.
Hence I focus on a subset of model fit indices and classification diagnostics.

```{r}
fit_table <- table_fit(res) # model fit table
fit_table[ , c("Name", "LL", "AIC", "BIC", "Entropy", "prob_min", "prob_max", "n_min", "n_max")] # a selection from the model fit table
```
The model fit table is important for selecting the final class solution.
My subset of fit indices and classification diagnostics includes:

- **Name** indicates the $K$-class solution.
In this case, I chose to fit 1 - 4 class solutions.
- **LL** is the -2*log-likelihood of each model.
- **AIC** and **BIC** are the Akaike
and the Bayesian Information Criterion, respectively.
- **prob_min** and **prob_max** are
the lowest and the highest posterior class probability
by most likely class membership.
- **n_min** and **n_max** are the lowest and the highest
class proportion based on the posterior class probabilities.

There are several possible strategies to select the final class solution.
Here, I apply the following strategy:

First, I examine the `-2*log likelihood` which falls successively with each added class.

Then, I look at the `BIC` value which is fairly close for the one, two, and three-class solutions, but high for the four-class solution.

Classification diagnostics should not be used for model selection, but they can be used to disqualify certain solutions because they are uninterpretable. I see that `prob_min` for the four-class solution is low, therefore I disqualify this solution.

Out of the remaining three solutions, I notice that `entropy` is the highest for the three-class solution,
and it has a satisfactory `prob_min` and `n_min`. Based on this, I can retain the three class solution in model selection.
Entropy for the one-class solution will always equal to one, as it is 100% true that every case is in that class.
Based on the low entropy of the two-class solution, I eliminate this model.

Finally, when comparing the one and three-class solutions, I consult the `BIC` which tells me that the added complexity
of having three classes still explains the data better than a one-class solution.
Therefore, I select the three-class solution as my final-class solution.

# Interpreting Final Class Solution

To aid my understanding of the final class solution, I use plot it.
The resulting graph shows response patterns on all the indicators for each group.

```{r}
library(ggplot2)
plot_prob(res[[3]]) # plotting the response patterns for the final class solution
```
The plot shows the distributions of the response probabilities on the indicators
for each of the three classes.
For instance, I see that in Class 1 the most common response to u2 is 2,
while in Class 2 and Class 3 this is 0.
I can also see that response 1 is a rare response not forming the majority in
any class.
Class 2 distinguishes itself because the majority scores the response 0 category of u3 and u4,
while in Class 1 and 2 this is not the case.
Class 3 distinguishes itself because the most common response to u3 and u4 is 3.


# Extracting posterior class probabilities

Another step is to extract posterior class probabilities.

```{r}
probs <- class_prob(res[[3]]) # extracting the posterior class probabilities of the final class solution
probs$mostlikely.class # posterior class probabilities by most likely class membership
```

# Conclusion

I conclude that the three class solution is the best fit the present data.

End.

```{r}
sessionInfo()
```

1,878 changes: 1,878 additions & 0 deletions Daniel/Exercise 2/Replication_Solution_Exercise_2.html

Large diffs are not rendered by default.

124 changes: 124 additions & 0 deletions Daniel/Exercise 2/Solution_Exercise_2.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
---
title: 'Exercise 2: Latent Class Analysis'
author: "Daniel Anadria"
output:
html_document:
df_print: paged
---

# Introduction

In this exercise, I perform latent class analysis (LCA) using the `tidySEM` package. LCA is a type of structural equation modelling analysis that can be used to discover latent subpopulations within a population sample. LCA can be seen as a data reduction technique focused on clustering similar observations together.

For this exercise, I use the `data_mix_ordinal` dataset built-in to `tidySEM`. The research question is: **How many latent classes is the sample composed of?**

# Preparing the Data

```{r, warning=F, message=F}
library(tidySEM)
data <- data_mix_ordinal
data[1:4] <- lapply(data, ordered) # indicators as ordered factors
```

# Exploring the Data

An important step to preceding any statistical analysis is data exploration.

```{r}
tidySEM::descriptives(data) # descriptives
```

I see that all my indicators are ordered factors, each has four levels, 5000 observations, no missing data.

# Conducting Latent Class Analysis

LCA entails fitting successive models, each with one additional class compare to the previous one. In my example, I will fit 1 to 4 class solutions and compare their output in order to choose the best one.
Depending on your computer's computational power, this might take a while.

Before I fit a series of LCA models, I set a random seed using `set.seed()`. This is an important step as there is some inherent randomness in the LCA computations, and having the same seed number ensures that two separate researcher obtain exactly the same results when fitting LCA models.

```{r, message=F, warning=F}
set.seed(123) # setting seed
res <- mx_lca(data = data, classes = 1:4) # fitting LCA
```

# Class enumeration

Class enumeration is about comparing a sequence of LCA models fitted to the data. To do so, I use `tidySEM::table_fit()` with the results object as the input. However, the resulting model fit table contains a lot of information on each of the four models.
Hence I focus on a subset of model fit indices and classification diagnostics.

```{r}
fit_table <- table_fit(res) # model fit table
fit_table[ , c("Name", "LL", "AIC", "BIC", "Entropy", "prob_min", "prob_max", "n_min", "n_max")] # a selection from the model fit table
```
The model fit table is important for selecting the final class solution.
My subset of fit indices and classification diagnostics includes:

- **Name** indicates the $K$-class solution.
In this case, I chose to fit 1 - 4 class solutions.
- **LL** is the -2*log-likelihood of each model.
- **AIC** and **BIC** are the Akaike
and the Bayesian Information Criterion, respectively.
- **prob_min** and **prob_max** are
the lowest and the highest posterior class probability
by most likely class membership.
- **n_min** and **n_max** are the lowest and the highest
class proportion based on the posterior class probabilities.

There are several possible strategies to select the final class solution.
Here, I apply the following strategy:

First, I examine the `-2*log likelihood` which falls successively with each added class.

Then, I look at the `BIC` value which is fairly close for the one, two, and three-class solutions, but high for the four-class solution.

Classification diagnostics should not be used for model selection, but they can be used to disqualify certain solutions because they are uninterpretable. I see that `prob_min` for the four-class solution is low, therefore I disqualify this solution.

Out of the remaining three solutions, I notice that `entropy` is the highest for the three-class solution,
and it has a satisfactory `prob_min` and `n_min`. Based on this, I can retain the three class solution in model selection.
Entropy for the one-class solution will always equal to one, as it is 100% true that every case is in that class.
Based on the low entropy of the two-class solution, I eliminate this model.

Finally, when comparing the one and three-class solutions, I consult the `BIC` which tells me that the added complexity
of having three classes still explains the data better than a one-class solution.
Therefore, I select the three-class solution as my final-class solution.

# Interpreting Final Class Solution

To aid my understanding of the final class solution, I use plot it.
The resulting graph shows response patterns on all the indicators for each group.

```{r}
library(ggplot2)
plot_prob(res[[3]]) # plotting the response patterns for the final class solution
```
The plot shows the distributions of the response probabilities on the indicators
for each of the three classes.
For instance, I see that in Class 1 the most common response to u2 is 2,
while in Class 2 and Class 3 this is 0.
I can also see that response 1 is a rare response not forming the majority in
any class.
Class 2 distinguishes itself because the majority scores the response 0 category of u3 and u4,
while in Class 1 and 2 this is not the case.
Class 3 distinguishes itself because the most common response to u3 and u4 is 3.


# Extracting posterior class probabilities

Another step is to extract posterior class probabilities.

```{r}
probs <- class_prob(res[[3]]) # extracting the posterior class probabilities of the final class solution
probs$mostlikely.class # posterior class probabilities by most likely class membership
```

# Conclusion

I conclude that the three class solution is the best fit the present data.

End.

```{r}
sessionInfo()
```

Loading