CohortLabeleR/README.Rmd at main · StringhiniLab/CohortLabeleR · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "70%",
  fig.align = "center"
)
data(df)
data(dict_df)
data(df_missing)
data(dict_df_missing)
```

# CohortLabeleR <img src="logo.png" align="right" height="150"/>

THIS PACKAGE IS UNDER DEVELOPMENT

<!-- badges: start -->
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
<!-- badges: end -->

CohortLabeleR provides functions to easily add STATA-style variable and cell labels to entire data frames in R. It also includes tools for quick variable recoding, making it easier to work with categorical numeric variables commonly found in longitudinal cohort studies.

## Introduction

Longitudinal or cohort studies are used to understand human health and how social and environmental factors influence it. These studies follow individuals over time across different cohorts, collecting data of interest from surveys and/or biological or environmental measurements.

In terms of databases, these studies are usually reported with one file per cohort, accompanied by a dictionary that helps interpret the variables measured in each case. Although many variables are included (sometimes more than 200 per cohort), these datasets rarely exceed one gigabyte. The variables are often a mix of survey responses coded as integers in categorical variables and numeric values from biological measurements.

Performing a thorough exploratory data analysis (EDA) of these datasets can be challenging due to the large number of variables. Therefore, using programming languages like R, which can automatize data analysis easily, opens up additional possibilities for EDA. **CohortLabeleR** is an R package with three functions that simplifies processing of categorical variables coded as numbers and also enables data transfer to STATA without loss of information.

The first function, `recode_vars()`, allows you to replace numerical codes in categorical variables with text throughout the dataset. This not only enables data interpretation without repeatedly consulting the dictionary for each variable but also improves the interpretability of plots. This is particularly useful if you want to create automatically plots for each variable in the dataset for a first analysis of the data.

The `replace_missing_with_na()` function replaces missing values with `NA`. This facilitates the visualization of missing values, that are usually coded in more than one level in the categorical numerical variables.

Finally, `label_df_for_STATA()` adds labels to variable names and numeric values in categorical variables in a format that STATA can open directly, preserving information in the case of switching to STATA.

## Installation

You can install the development version of CohortLabeleR from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("StringhiniLab/CohortLabeleR")
```

## Use

This package consists of three functions.

### 1. `recode_vars()`

Many datasets, such as CLSA, use numeric codes for categorical variables. This can make it harder to label axes when you want to produce many plots or to interpret the data at a glance.
This function replaces all numeric category values in the data frame with their corresponding labels, making the data easier to read and visualize.
For example, if we want to plot the last three variables of this dataset:
```{r example, warning=FALSE, message=FALSE}
library(CohortLabeleR)
library(dplyr)
library(ggplot2)
library(purrr)
head(df)
```

```{r}

# Function for plotting
barplot <- function(var, data = df){
data |>
count(!!sym(var)) |>
ggplot(aes(x = !!sym(var), y = n)) +
  geom_col() +
  coord_flip() +
  theme_minimal()
}

# I create plots for three variables
purrr::map(colnames(df)[4:6], barplot)
```
The category levels of each variable are not possible to recognize. To fix this, you can use `recode_vars` to replace the labels directly, using the dictionary as a reference.

```{r}
# dictionary
head(dict_df)

# recoding the variables
df_recoded <- recode_vars(data = df,
            # specify here other numeric columns you would like to ignore
            ignore_columns = c(id, age),
            dictionary = dict_df,
            # dictionary column with the variable names
            var_colname = variable, # dictionary column with the variables
            # dictionary column with the category numbers
            num_colname = level_num,
            # dictionary column with the category labels
            str_colname = level_str)
head(df_recoded)
```

```{r}
purrr::map(data = df_recoded, colnames(df_recoded)[4:6], barplot)
```

### 2. `replace_missing_with_na()`

This function reads the dictionary and detects observations that are coded as missing. These missing values are represented numerically, with values like -99, which are clearly not valid results.

Databases such as CLSA, encode these missing values  distinguishing the reasons for the missing data and assigning different numeric codes to each case.
It is not the same to say that the person preferred not to answer the survey as it is to say that the survey was never conducted.
This means that there are many numeric codes that can represent a missing value.

The goal of this function is to convert all of the missing values of the categorical variables coded as numbers of a data frame to `NA`. This makes it much easier to detect the total amount of missing data per variable, regardless of the origin of the missing values.

```{r}
# There are missing values in blood_pressure_category and cholesterol_level
df_missing
```
```{r}
df_missing_NA <- replace_missing_with_na(dataset = df_missing,
                         ignore_columns = c(id, age),
                         dictionary = dict_df_missing,
                         var_colname = variable,
                         num_colname = level_num,
                         missing_colname = missing)

df_missing_NA
```

Now, let's estimate the percentage of data that is missing.

```{r}
# 2 of 10 observations are missing for cholesterol_level (20%)
# 1 of 10 observations are missing for blood_pressure_category (10%)
colMeans(is.na.data.frame(df_missing_NA))*100
```

### 3. `label_df_for_STATA()`

To open a database in a format compatible with STATA, it is advisable that:

1. **The variables have labels**.
    This allows information that is typically not encoded in the variable name, such as units or more detailed human readable text, to be associated with each data column.

2. **The numeric values of categorical variables have labels**.
    For variables reported as categorical and coded as numbers, it is possible to assign labels. This way, it is easy to visualize what each category level represents in STATA.

The function `label_df_for_STATA()` reads the data frame and the provided dictionary, and adds a variable label to all column names, as well as a label to all numeric variables reported as categorical. It is possible to exclude those numeric variables without levels using the `ignore_columns` argument.

In this way, the data frame can be saved as a `.dta` file and easily opened in STATA with all the important information needed for the analysis.


```{r}
df_stata <- label_df_for_STATA(dataset = df,
                   dictionary = dict_df_var_label,
                   ignore_columns = c("id", "age"),
                   var_label = var_label , # variable label
                   var_colname = variable, # variable colname
                   num_colname = level_num, # code number
                   str_colname = level_str) # code label

# You can recognize the variable labels (as attributes) and code labels (as labels) when checking the data frame structure
str(df_stata)
```

Finally, you can save the file using the `haven` package:

```{r, eval = FALSE}
# save the file in STATA format
haven::write_dta(df_stata, "data/2020-11-04_df_stata.dta")
```