Assignment_1/A1_P1_LangASD_code.Rmd at master · simonamhansen/Assignment_1 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
---
title: "A1_P1_Student"
author: "Simon Hansen"
date: "7/9/2017"
output: html_document
---

# Assignment 1, Part 1: Language development in Autism Spectrum Disorder (ASD) - Brushing up your code skills

In this first part of the assignment we will brush up your programming skills, and make you familiar with the data sets you will be analysing for the next parts of the assignment.

In this first part of the assignment you will:
1) Create a Github account and link it to your RStudio
2) Use small nifty lines of code to transform several data sets into just one. The final data set will contain only the variables that are needed for the analysis in the next parts of the assignment
3) Become familiar with the tidyverse package, which you will find handy for later assignments.


## 0. First an introduction on the data

# Language development in Autism Spectrum Disorder (ASD)

Background: Autism Spectrum Disorder is often related to language impairment. However, this phenomenon has not been empirically traced in detail: i) relying on actual naturalistic language production, ii) over extended periods of time. We therefore videotaped circa 30 kids with ASD and circa 30 comparison kids (matched by linguistic performance at visit 1) for ca. 30 minutes of naturalistic interactions with a parent. We repeated the data collection 6 times per kid, with 4 months between each visit. We transcribed the data and counted:
i) the amount of words that each kid uses in each video. Same for the parent.
ii) the amount of unique words that each kid uses in each video. Same for the parent.
iii) the amount of morphemes per utterance (Mean Length of Utterance) displayed by each child in each video. Same for the parent.

## 1. Let's get started on GitHub

Follow the link to a Github tutorial:
https://support.rstudio.com/hc/en-us/articles/200532077-Version-Control-with-Git-and-SVN

In the assignments you will be asked to upload your code on Github and the GitHub repositories will be part of the portfolio, therefore all students must make an account and link it to their RStudio (you'll thank us later for this!).

N.B. Create a GitHub repository for the Language Development in ASD set of assignments and link it to a project on your RStudio (including a working directory where you will save all your data and code for these assignments)

## 2. Now let's take dirty dirty data sets and make them into a tidy one

Set the working directory (the directory with your data and code for these assignments):

```{r}
# To set working directory
setwd("C:/Users/simon/Google Drev/Uni/Methods3/Assignment_1")

# To load packages
library(plyr)
library(stringr)
library(tidyverse)

```

Load the three data sets, after downloading them from dropbox and saving them in your working directory:
* Demographic data for the participants: https://www.dropbox.com/s/w15pou9wstgc8fe/demo_train.csv?dl=0
* Length of utterance data: https://www.dropbox.com/s/usyauqm37a76of6/LU_train.csv?dl=0
* Word data: https://www.dropbox.com/s/8ng1civpl2aux58/token_train.csv?dl=0

```{r}

#To load data
demo_data=read.csv("demo_train(1).csv")
lu_data=read.csv("LU_train.csv")
token_data=read.csv("token_train.csv")

```

Explore the 3 datasets (e.g. visualize them, summarize them, etc.). You will see that the data is messy, since the psychologists collected the demographic data, a linguist analyzed the length of utterance in May 2014 and the same linguist analyzed the words several months later. In particular:
- the same variables might have different names (e.g. identifier of the child)
- the same variables might report the values in different ways (e.g. visit)
Welcome to real world of messy data :-)

Before being able to combine the data sets we need to make sure the relevant variables have the same names and the same kind of values.

So:

2a. Find a way to transform variable names.
Tip: Look into the package data.table, or google "how to rename variables in R"


```{r}

#To rename variable names
demo_data=plyr::rename(demo_data, c("Child.ID"="ID"))
lu_data=plyr::rename(lu_data, c("SUBJ" = "ID"))
token_data=plyr::rename(token_data, c("SUBJ" = "ID"))

demo_data=plyr::rename(demo_data, c("Visit" = "VISIT" ))
```

2b. Find a way to homogeneize the way "visit" is reported. If you look into the original data sets, you will see that in the LU data and the Token data, Visits are called "visit 1" in stead of just 1 (which is the case in the demographic data set).
Tip: There is a package called stringr, which will be very handy for you also in future assignments. We will return to this package later, but for now use the str_extract () to extract only the number from the variable Visit in each data set. Tip: type ?str_extract() after loading the library, for examples of how to use it.

```{r}
#To change the values of visit
lu_data$VISIT=str_extract(lu_data$VISIT, "\\d") #This code only keeps digits
token_data$VISIT=str_extract(token_data$VISIT, "\\d")

```

2c. We also need to make a small adjustment to the content of the Child.ID coloumn in the demographic data. Within this column, names that are not abbreviations do not end with "." (i.e. Adam), which is the case in the other two data sets (i.e. Adam.). If The content of the two variables isn't identical the data sets will not be merged sufficiently.
We wish to remove the "." at the end of names in the LU data and the tokens data.
To do these a subfunction of apply(), called sapply() can be used.

Tip: Take a look into the gsub() function.
Tip: A possible solution has one line of code for each child name that is to be changed. Another combines mutate() and recode()

Tip: You will have to do identical work for both data sets, so to save time on the copy/paste use the cmd+f/ctrl+f function. Add the data frame name (e.g. LU_data) in the first box, and the data frame name (e.g. Tokens_data) you wish to change it to in the other box, and press replace.


```{r}
#To remove punctuation to streamline the string
lu_data$ID=gsub('[[:punct:] ]+','',lu_data$ID) # This code removes all "."
demo_data$ID=gsub('[[:punct:] ]+','',demo_data$ID)
token_data$ID=gsub('[[:punct:] ]+','',token_data$ID)


```

2d. Now that the nitty gritty details of the different data sets are fixed, we want to make a subset of each data set only containig the variables that we wish to use in the final data set.
For this we use the tidyverse package, which contain the function select().

The variables we need are: Child.ID, Visit, Ethnicity, Diagnosis, Gender, Age, ADOS,  MullenRaw, ExpressiveLangRaw, MOT_MLU, MOT_LUstd, CHI_MLU, CHI_LUstd, types_MOT, types_CHI, tokens_MOT, tokens_CHI.

* ADOS indicates the severity of the autistic symptoms (the higher the worse)
* MullenRaw indicates non verbal IQ
* ExpressiveLangRaw indicates verbal IQ
* MLU stands for mean length of utterance
* types stands for unique words (e.g. even if "doggie" is used 100 times it only counts for 1)
* tokens stands for overall amount of words (if "doggie" is used 100 times it counts for 100)

It would be smart to rename the MullenRaw and ExpressiveLangRaw into something you can remember (i.e. nonVerbalIQ, verbalIQ)

```{r}
#To subset interesting variables
demo_sub=select(demo_data, ID, VISIT, Ethnicity, Gender, Age, Diagnosis, ADOS, MullenRaw, ExpressiveLangRaw)
lu_sub=select(lu_data, ID, VISIT, MOT_MLU, MOT_LUstd, CHI_MLU, CHI_LUstd)
token_sub=select(token_data, ID, VISIT, types_MOT, types_CHI, tokens_MOT, tokens_CHI)

#To rename variables
demo_sub=plyr::rename(demo_sub, c("MullenRaw" = "nonVerbalIQ"))
demo_sub=plyr::rename(demo_sub, c("ExpressiveLangRaw" = "verbalIQ"))

```

2e. Finally we are ready to merge all the data sets into just one.
Google "How to merge datasets in R"
Tip: Use the merge() function for this.
Tip: Merge only works for two data frames at the time.
Tip: Check the number of observations in the datasets before and after merging. What is going on?

```{r}
# To merge data
data1=merge(demo_sub, lu_sub)
DATA=merge(data1, token_sub)

# The reason for less datapoints is because the number of observation for demographic data is greater than the two other dataframes. Therefore the extra datapoints cannot merge.
```

Are we done yet?

If you look at the data set now, you'll se a lot of NA's in the variables ADOS, nonVerbalIQ (MullenRaw) and verbalIQ (ExpressiveLangRaw). These measures were not taken at all visits. Additionally, we only want these measures for the first visit (Riccardo will explain why in class).
So let's make sure that we select only these variables as collected during the first visit for each child and repeat these values throughout all other visits.

Tip: one solution requires you to select only the rows corresponding to visit 1 in a new dataset, to rename the columns of the relevant variables and to merge it back to the old dataset.
Tip: subset() and select() might be useful.
Tip: the final dataset should have as many rows as the the old one.


```{r}
# To only get data for visit 1
Visit1_data=subset(DATA, DATA$VISIT == 1)

#To only select relevant columns
ID_visit=select(Visit1_data, ID, ADOS, nonVerbalIQ, verbalIQ)

#To omit irrelevant NA columns from dataset
DATA_NEW=DATA[,-7:-9]

#To merge datasets back together
DATA2 = merge(DATA_NEW, ID_visit, by = "ID")

```

Now, we are almost ready to actually start working with the data. However, here are some additional finishing touches:

* in some experiments your participants must be anonymous. Therefore we wish to turn the CHILD.ID into numbers.
Tip: as.numeric() might be a useful function, but not alone.

* Note that visit is (probably) not defined as numeric. Turn it into a numeric variable

* In order to make it easier to work with this nice, clean dataset in the future, it is practical to make sure the variables have sensible values. E.g. right now gender is marked 1 and 2, but in two weeks you will not be able to remember, which gender were connected to which number, so change the values from 1 and 2 to F and M in the gender variable. For the same reason, you should also change the values of Diagnosis from A and B to ASD (autism spectrum disorder) and TD (typically developing).
Tip: Google "how to rename levels in R".

```{r}
# To make ID anonymous
DATA2$ID=as.factor(DATA2$ID)
DATA2$ID=as.numeric(DATA2$ID)

# Visits is already numeric

# To change the gender variable
DATA2$Gender=as.factor(DATA2$Gender)
DATA2$Gender=revalue(DATA2$Gender, c("1"="M", "2"="F"))

# To change diagnosis variable
DATA2$Diagnosis=revalue(DATA2$Diagnosis, c("A"="ASD", "B"="TD"))

```


Save the data set using into a csv file. Hint: look into write.csv()

```{r}
write.csv(DATA2, "clean_data.csv")
```


3) Now that we have a nice clean data set to use for the analysis next week, we shall play a bit around with it. The following exercises are not relevant for the analysis, but are here so you can get familiar with the functions within the tidyverse package.

Here's the link to a very helpful book, which explains each function:
http://r4ds.had.co.nz/index.html

1) USING FILTER
List all kids who:
1. have a mean length of utterance (across all visits) of more than 2.7 morphemes.
2. have a mean length of utterance of less than 1.5 morphemes at the first visit
3. have not completed all trials. Tip: Use pipes to solve this

```{r}
# To load data
clean_data=read.csv("clean_data.csv")
clean_data=clean_data[,-1] #Omits the first column which we are not interested in

# Mean length of untterances of more than 2.7 morphemes
clean_data %>% group_by(ID) %>% summarise(MEAN = mean(CHI_MLU)) %>% filter(MEAN >2.7)

# Less than 1.5 morphemes on first visit
t=filter(clean_data, VISIT == 1 & CHI_MLU < 1.5)

# Have not completed all trials
missed_trials = clean_data %>% group_by(ID) %>% tally()
filter(missed_trials, n < 6)


```

USING ARRANGE

1. Sort kids to find the kid who produced the most words on the 6th visit
2. Sort kids to find the kid who produced the least amount of words on the 1st visit.

```{r}
# To sort the kids to find the one that produced the most words on the 6th visit
Visit_6=subset(clean_data, clean_data$VISIT==6)
arrange(Visit_6, c(Visit_6$tokens_CHI))
# ID = 55 produced the most words (1294 tokens)

# To sort the kids to find the one that produced the least amount of words on the 1st visit
Visit_1=subset(clean_data, clean_data$VISIT==1)
arrange(Visit_1, c(Visit_1$tokens_CHI))
#ID = 57 produced the least amount of words on the first visit (0 tokens)

```

USING SELECT

1. Make a subset of the data including only kids with ASD, mlu and word tokens
2. What happens if you include the name of a variable multiple times in a select() call?

```{r}
# Subset with only ASD children
ASD_data=subset(clean_data, Diagnosis == "ASD")
ASD_data2=select(ASD_data, ID, VISIT, CHI_MLU, tokens_CHI)

# What happens if you include the name of a variable multiple times?
ASD_data3=select(ASD_data, ID, ID, VISIT, CHI_MLU, tokens_CHI)

# Nothing happens. The variable only appears once.

```


USING MUTATE, SUMMARISE and PIPES
1. Add a column to the data set that represents the mean number of words spoken during all visits.
2. Use the summarise function and pipes to add an column in the data set containing the mean amount of words produced by each trial across all visits. HINT: group by Child.ID
3. The solution to task above enables us to assess the average amount of words produced by each child. Why don't we just use these average values to describe the language production of the children? What is the advantage of keeping all the data?

```{r}
# To add a column to the dataset with the mean number of words for each visit
data2 = clean_data %>% group_by(VISIT) %>% mutate (Mean_words_VISIT = sum(tokens_CHI)/length(VISIT))

# To add a column to the datset with mean number of words spoken for each child
data3 = data2 %>% group_by(ID) %>% mutate(Mean_words_CHI = sum(tokens_CHI)/length(tokens_CHI))

# 3. When you average out the amount of words you lose the development that each child undergoes. By using all the data we are able to compare the development of typically developing children and autistic children.
```