gerkovink · danadria · Sep 15, 2022 · Sep 15, 2022 · Oct 4, 2022 · Nov 18, 2022
diff --git a/Daniel/Exercise 1/Solutions_Exercise_1.Rmd b/Daniel/Exercise 1/Solutions_Exercise_1.Rmd
@@ -0,0 +1,125 @@
+---
+title: "Markup Solutions Exercise 1"
+author: "Daniel Anadria"
+date: 15 September 2022
+output:
+  html_document:
+    css: style.css
+
+---
+<p style="text-align: center;">**Exercise 1: Git and Monte Carlo simulation with a GitHub flavor**</p>
+
+
+In this exercise, I sample 100 samples from a normal distribution with the mean $\mu = 100$ and the standard deviation $\sigma = 15$..
+For each of the samples, I calculate the following statistics for the mean:
+
+- absolute bias
+- standard error
+- lower bound of the 95% confidence interval
+- upper bound of the 95% confidence interval
+
+Finally, I plot the resulting 95% confidence intervals.
+
+```{r, echo=T, results='hide'}
+# defining population values
+population <- new.env()
+population$mean <- 100
+population$sd <- 15
+as.list(population)
+
+# 100 samples of 1000 observations from the population
+set.seed(123) # for reproducibility
+samples <- data.frame(matrix(ncol = 100, nrow = 1000))
+for (i in 1:100){
+  samples[,i] <- rnorm(n = 1000, mean = 100, sd = 15)}
+```
+
+Computing the sample means:
+
+```{r}
+samplemeans <- data.frame(matrix(ncol = 100, nrow = 1))
+for (i in 1:100){
+  samplemeans[,i] <- mean(samples[,i])}
+```
+
+Computing the absolute bias for each sample:
+
+```{r}
+absolutebias <- data.frame(matrix(ncol = 100, nrow = 1))
+for (i in 1:100){
+  absolutebias[,i] <- abs(population$mean - samplemeans[,i])}
+```
+
+Computing the standard errorfor each sample:
+
+```{r}
+# SD of every sample
+sampleSDs <- data.frame(matrix(ncol = 100, nrow = 1))
+for (i in 1:100){
+  sampleSDs[,i] <- sd(samples[,i])}
+
+# SE of every sample
+sampleSEs <- data.frame(matrix(ncol = 100, nrow = 1))
+for (i in 1:100){
+  sampleSEs[,i] <- sampleSDs[,i] / sqrt(nrow(samples))}
+
+```
+
+95% confidence interval for each sample:
+
+```{r}
+# lower bound
+CIlower <- data.frame(matrix(ncol = 100, nrow = 1))
+for (i in 1:100){
+  CIlower[,i] <- samplemeans[,i] - 1.959964 * sampleSEs[,i]}
+
+# upper bound
+CIupper <- data.frame(matrix(ncol = 100, nrow = 1))
+for (i in 1:100){
+  CIupper[,i] <- samplemeans[,i] + 1.959964 * sampleSEs[,i]}
+```
+
+Preparing the data for `ggplot2`:
+
+```{r, results='hide', message=FALSE, warning=FALSE}
+library(tidyverse)
+```
+
+```{r}
+data <- data.frame(sample_id = 1:100, lower_bound = t(CIlower), sample_mean = t(samplemeans), upper_bound = t(CIupper), row.names = 1:100)
+
+data$sample_biased <- ifelse((data$upper_bound < 100) | (data$lower_bound > 100), "Yes", "No")
+
+data_long <- data %>% pivot_longer(c(2,4), names_to = "Statistic", values_to = "Values")
+
+```
+
+Plotting the 95% confidence interval for each sample:
+
+```{r, warning = FALSE, message = FALSE, error = FALSE, results='hide', fig.keep = 'all', out.width = "100%", fig.align = 'center'}
+
+ggplot(data_long, aes(x = Values, y = sample_id)) + 
+   geom_line(aes(colour = sample_biased, group = sample_id)) +
+   scale_color_manual(values=c("#00000f", "#a10a02"), name = "Biased Sample", )+
+   geom_point(aes(sample_mean), size = 0.75, show.legend = T)+
+   geom_line(aes(x = population$mean), linetype = "solid") +
+   xlab("95% Confidence Intervals") +
+   ylab("Simulation Number")+
+   theme_classic()
+
+```
+
+<p style="text-align: center;">*“A replication of the procedure that generates a 95% confidence interval that is centered around the sample mean would cover the population value at least 95 out of 100 times” (Neyman, 1934)*</p>
+\n
+
+```{r}
+table(data$sample_biased)
+```
+In line with the Neyman's quote, in our simulated data, 6% of confidence intervals did not contain the true population mean. This is close to the expectation of 5% of the 95% confidence intervals will not contain the true population parameter. In an infinitely large sample, this proportion would equal exactly 5%. 
+\n
+Here I present a table containing all simulated samples for which the resulting confidence interval does not contain the population mean.
+\n
+```{r}
+data %>% filter(sample_biased == 'Yes')
+```
+
diff --git a/Daniel/Exercise 1/Solutions_Exercise_1.html b/Daniel/Exercise 1/Solutions_Exercise_1.html
diff --git a/Daniel/Exercise 1/Two_Pieces.txt b/Daniel/Exercise 1/Two_Pieces.txt
@@ -0,0 +1,2 @@
+I like hiking which is unfortunately difficult in the Netherlands.
+I grew up with 10 dogs in the house.
diff --git a/Daniel/Exercise 1/style.css b/Daniel/Exercise 1/style.css
@@ -0,0 +1,14 @@
+/* Whole document: */
+body{
+  font-family: Times New Roman;
+  font-size: 16pt;
+}
+/* Headers */
+h1{
+text-align: center;
+font-size: 20pt;
+font-weight: bold;
+}
+h2,h3,h4,h5,h6{
+text-align: center;
+font-size: 16pt;}
diff --git a/Daniel/Exercise 2/Replication_Solution_Exercise_2.Rmd b/Daniel/Exercise 2/Replication_Solution_Exercise_2.Rmd
@@ -0,0 +1,124 @@
+---
+title: 'Replication of Exercise 2: Latent Class Analysis'
+author: "Daniel Anadria"
+output:
+  html_document:
+    df_print: paged
+---
+
+# Introduction
+
+In this exercise, I perform latent class analysis (LCA) using the `tidySEM` package. LCA is a type of structural equation modelling analysis that can be used to discover latent subpopulations within a population sample. LCA can be seen as a data reduction technique focused on clustering similar observations together.
+
+For this exercise, I use the `data_mix_ordinal` dataset built-in to `tidySEM`. The research question is: **How many latent classes is the sample composed of?**
+
+# Preparing the Data
+
+```{r}
+library(tidySEM)
+data <- data_mix_ordinal
+data[1:4] <- lapply(data, ordered) # indicators as ordered factors
+```
+
+# Exploring the Data
+
+An important step to preceding any statistical analysis is data exploration.
+
+```{r}
+tidySEM::descriptives(data) # descriptives
+```
+
+I see that all my indicators are ordered factors, each has four levels, 5000 observations, no missing data. 
+
+# Conducting Latent Class Analysis
+
+LCA entails fitting successive models, each with one additional class compare to the previous one. In my example, I will fit 1 to 4 class solutions and compare their output in order to choose the best one.
+Depending on your computer's computational power, this might take a while.
+
+Before I fit a series of LCA models, I set a random seed using `set.seed()`. This is an important step as there is some inherent randomness in the LCA computations, and having the same seed number ensures that two separate researcher obtain exactly the same results when fitting LCA models.
+
+```{r, warning=F, message=F}
+set.seed(321) # setting seed 
+res <- mx_lca(data = data, classes = 1:4) # fitting LCA 
+```
+
+# Class enumeration
+
+Class enumeration is about comparing a sequence of LCA models fitted to the data. To do so, I use `tidySEM::table_fit()` with the results object as the input. However, the resulting model fit table contains a lot of information on each of the four models.
+Hence I focus on a subset of model fit indices and classification diagnostics.
+
+```{r}
+fit_table <- table_fit(res) # model fit table
+fit_table[ , c("Name", "LL", "AIC", "BIC", "Entropy", "prob_min", "prob_max", "n_min", "n_max")] # a selection from the model fit table
+```
+The model fit table is important for selecting the final class solution.
+My subset of fit indices and classification diagnostics includes:
+
+- **Name** indicates the $K$-class solution.
+In this case, I chose to fit 1 - 4 class solutions.
+- **LL** is the -2*log-likelihood of each model.
+- **AIC** and **BIC** are the Akaike 
+and the Bayesian Information Criterion, respectively.
+- **prob_min** and **prob_max** are 
+the lowest and the highest posterior class probability
+by most likely class membership.
+- **n_min** and **n_max** are the lowest and the highest
+class proportion based on the posterior class probabilities.
+
+There are several possible strategies to select the final class solution.
+Here, I apply the following strategy: 
+
+First, I examine the `-2*log likelihood` which falls successively with each added class.
+
+Then, I look at the `BIC` value which is fairly close for the one, two, and three-class solutions, but high for the four-class solution.
+
+Classification diagnostics should not be used for model selection, but they can be used to disqualify certain solutions because they are uninterpretable. I see that `prob_min` for the four-class solution is low, therefore I disqualify this solution.
+
+Out of the remaining three solutions, I notice that `entropy` is the highest for the three-class solution,
+and it has a satisfactory `prob_min` and `n_min`. Based on this, I can retain the three class solution in model selection.
+Entropy for the one-class solution will always equal to one, as it is 100% true that every case is in that class.
+Based on the low entropy of the two-class solution, I eliminate this model.
+
+Finally, when comparing the one and three-class solutions, I consult the `BIC` which tells me that the added complexity
+of having three classes still explains the data better than a one-class solution.
+Therefore, I select the three-class solution as my final-class solution.
+
+# Interpreting Final Class Solution
+
+To aid my understanding of the final class solution, I use plot it. 
+The resulting graph shows response patterns on all the indicators for each group.
+
+```{r}
+library(ggplot2)
+plot_prob(res[[3]]) # plotting the response patterns for the final class solution
+```
+The plot shows the distributions of the response probabilities on the indicators
+for each of the three classes.
+For instance, I see that in Class 1 the most common response to u2 is 2,
+while in Class 2 and Class 3 this is 0.
+I can also see that response 1 is a rare response not forming the majority in 
+any class.
+Class 2 distinguishes itself because the majority scores the response 0 category of u3 and u4,
+while in Class 1 and 2 this is not the case.
+Class 3 distinguishes itself because the most common response to u3 and u4 is 3.
+
+
+# Extracting posterior class probabilities
+
+Another step is to extract posterior class probabilities.
+
+```{r}
+probs <- class_prob(res[[3]]) # extracting the posterior class probabilities of the final class solution
+probs$mostlikely.class # posterior class probabilities by most likely class membership
+```
+
+# Conclusion
+
+I conclude that the three class solution is the best fit the present data.
+
+End.
+
+```{r}
+sessionInfo()
+```
+
diff --git a/Daniel/Exercise 2/Replication_Solution_Exercise_2.html b/Daniel/Exercise 2/Replication_Solution_Exercise_2.html
diff --git a/Daniel/Exercise 2/Solution_Exercise_2.Rmd b/Daniel/Exercise 2/Solution_Exercise_2.Rmd
@@ -0,0 +1,124 @@
+---
+title: 'Exercise 2: Latent Class Analysis'
+author: "Daniel Anadria"
+output:
+  html_document:
+    df_print: paged
+---
+
+# Introduction
+
+In this exercise, I perform latent class analysis (LCA) using the `tidySEM` package. LCA is a type of structural equation modelling analysis that can be used to discover latent subpopulations within a population sample. LCA can be seen as a data reduction technique focused on clustering similar observations together.
+
+For this exercise, I use the `data_mix_ordinal` dataset built-in to `tidySEM`. The research question is: **How many latent classes is the sample composed of?**
+
+# Preparing the Data
+
+```{r, warning=F, message=F}
+library(tidySEM)
+data <- data_mix_ordinal
+data[1:4] <- lapply(data, ordered) # indicators as ordered factors
+```
+
+# Exploring the Data
+
+An important step to preceding any statistical analysis is data exploration.
+
+```{r}
+tidySEM::descriptives(data) # descriptives
+```
+
+I see that all my indicators are ordered factors, each has four levels, 5000 observations, no missing data. 
+
+# Conducting Latent Class Analysis
+
+LCA entails fitting successive models, each with one additional class compare to the previous one. In my example, I will fit 1 to 4 class solutions and compare their output in order to choose the best one.
+Depending on your computer's computational power, this might take a while.
+
+Before I fit a series of LCA models, I set a random seed using `set.seed()`. This is an important step as there is some inherent randomness in the LCA computations, and having the same seed number ensures that two separate researcher obtain exactly the same results when fitting LCA models.
+
+```{r, message=F, warning=F}
+set.seed(123) # setting seed 
+res <- mx_lca(data = data, classes = 1:4) # fitting LCA 
+```
+
+# Class enumeration
+
+Class enumeration is about comparing a sequence of LCA models fitted to the data. To do so, I use `tidySEM::table_fit()` with the results object as the input. However, the resulting model fit table contains a lot of information on each of the four models.
+Hence I focus on a subset of model fit indices and classification diagnostics.
+
+```{r}
+fit_table <- table_fit(res) # model fit table
+fit_table[ , c("Name", "LL", "AIC", "BIC", "Entropy", "prob_min", "prob_max", "n_min", "n_max")] # a selection from the model fit table
+```
+The model fit table is important for selecting the final class solution.
+My subset of fit indices and classification diagnostics includes:
+
+- **Name** indicates the $K$-class solution.
+In this case, I chose to fit 1 - 4 class solutions.
+- **LL** is the -2*log-likelihood of each model.
+- **AIC** and **BIC** are the Akaike 
+and the Bayesian Information Criterion, respectively.
+- **prob_min** and **prob_max** are 
+the lowest and the highest posterior class probability
+by most likely class membership.
+- **n_min** and **n_max** are the lowest and the highest
+class proportion based on the posterior class probabilities.
+
+There are several possible strategies to select the final class solution.
+Here, I apply the following strategy: 
+
+First, I examine the `-2*log likelihood` which falls successively with each added class.
+
+Then, I look at the `BIC` value which is fairly close for the one, two, and three-class solutions, but high for the four-class solution.
+
+Classification diagnostics should not be used for model selection, but they can be used to disqualify certain solutions because they are uninterpretable. I see that `prob_min` for the four-class solution is low, therefore I disqualify this solution.
+
+Out of the remaining three solutions, I notice that `entropy` is the highest for the three-class solution,
+and it has a satisfactory `prob_min` and `n_min`. Based on this, I can retain the three class solution in model selection.
+Entropy for the one-class solution will always equal to one, as it is 100% true that every case is in that class.
+Based on the low entropy of the two-class solution, I eliminate this model.
+
+Finally, when comparing the one and three-class solutions, I consult the `BIC` which tells me that the added complexity
+of having three classes still explains the data better than a one-class solution.
+Therefore, I select the three-class solution as my final-class solution.
+
+# Interpreting Final Class Solution
+
+To aid my understanding of the final class solution, I use plot it. 
+The resulting graph shows response patterns on all the indicators for each group.
+
+```{r}
+library(ggplot2)
+plot_prob(res[[3]]) # plotting the response patterns for the final class solution
+```
+The plot shows the distributions of the response probabilities on the indicators
+for each of the three classes.
+For instance, I see that in Class 1 the most common response to u2 is 2,
+while in Class 2 and Class 3 this is 0.
+I can also see that response 1 is a rare response not forming the majority in 
+any class.
+Class 2 distinguishes itself because the majority scores the response 0 category of u3 and u4,
+while in Class 1 and 2 this is not the case.
+Class 3 distinguishes itself because the most common response to u3 and u4 is 3.
+
+
+# Extracting posterior class probabilities
+
+Another step is to extract posterior class probabilities.
+
+```{r}
+probs <- class_prob(res[[3]]) # extracting the posterior class probabilities of the final class solution
+probs$mostlikely.class # posterior class probabilities by most likely class membership
+```
+
+# Conclusion
+
+I conclude that the three class solution is the best fit the present data.
+
+End.
+
+```{r}
+sessionInfo()
+```
+
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		I like hiking which is unfortunately difficult in the Netherlands.
		I grew up with 10 dogs in the house.