textmining_workshop/R Crashcourse.Rmd at master · mdweisner/textmining_workshop · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
---
title: "R Crashcourse"
author: "Michael Weisner"
date: "February 14, 2019"
output:
  word_document: default
  pdf_document: default
  html_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Acknowledgements

This tutorial is heavily based on and influenced by the excellent R for Social Scientists Data Carpentry Workshop, available [https://datacarpentry.org/r-socialsci/00-intro/index.html](here)

# Setup

**[https://datacarpentry.org/r-socialsci/setup.html](Data Carpentry Setup Information)**

## R and R Studio
R is an open source scripted programming language targeted at statistical analysis. It includes graphics capabilities and tools for data analysis and reading and writing data to and from files. [https://datacarpentry.org/r-socialsci/00-intro/index.html](Source)


RStudio is an Integrated Development Environment (IDE) which is designed around making the R program more easily accessible. The RStudio IDE open-source product is free under the Affero General Public License (AGPL) v3.

The R Studio interface is broken into three parts, the top left corner is reserved for editing scripts and RMarkdown files. The right side has your environment, which is for variables. The bottom right is for packages and help. And the bottom left is your Console where you can enter commands to R directly.

## Getting Started

It's best to organize things into a specific working directory, or project folder. We'll do so by creating a new project for this tutorial.

* Under the `File` menu, click on `New project`, choose `New directory`, then `New project`
* Enter a name for this new folder (or “directory”), and choose a convenient location for it. This will be your working directory for the rest of the day (e.g., `~/textmining`)
* Click on `Create project`
* Create a new file where we will type our scripts. Go to File > New File > R Markdown Click the save icon on your toolbar and save your script as “`textmining.Rmd`”.

## R Markdown Files

R Markdown files are a kind of notebook, where you can have "chunks" of R Code that can be run in steps. This is useful for separating notes and words out from your code, preventing all of your code from running together, and you can create reports from them.

Detailed Cheatsheets can be found here: [http://rstudio.com/cheatsheets](http://rstudio.com/cheatsheets)

R Markdowns contain three important types of content:

1. An (optional) YAML header surrounded by ---
2. Chunks of R code surrounded by ```{r}
3. Text mixed with simple text formatting like `# heading` and `_italics_`.

<!--  --- -->
<!--  title: "Diamond sizes" -->
<!--  date: 2016-08-25 -->
<!--  output: html_document -->
<!--  --- -->

<!--  ```{r} -->
<!--  library(ggplot2) -->
<!--  library(dplyr) -->
<!--  data(diamonds) -->
<!--  smaller <- diamonds %>%  -->
<!--   filter(carat <= 2.5) -->
<!--  ``` -->

<!--  We have data about r nrow(diamonds) diamonds. Only  -->
<!--  r nrow(diamonds) - nrow(smaller) are larger than -->
<!--  2.5 carats. The distribution of the remainder is shown -->
<!--  below: -->

<!--  ```{r} -->
<!--  smaller %>%  -->
<!--  ggplot(aes(carat)) +  -->
<!--   geom_freqpoly(binwidth = 0.01) -->
<!-- ``` -->


### Creating Chunks

Chunks are important to understand for this. Youc an use the Insert button at the top right of the main window (it has a green C and Insert dropdown) or use control + alt + i to insert a chunk, like so:

```{r}
# empty chunk
```

Each chunk has alittle green arrow at the right which lets you run the code it contains, but you can also do so by highlighting code and using control + enter

### Installing Packages

In a chunk, you can install packages with the `install.packages()` function


```{r}
# install.packages("tidyverse")
```


# Basics of R

## Math

You can do simple math with R, try it in the console:

```{r}
3+5
```

```{r}
12/7
```

## Creating Objects
Objects are variables that we assign meaning to. R uses `<-` to assign values. It can assign numbers, characters and words, true and false values, and collections of these such as vectors, matrices, and arrays. These can be important for types of analysis.

```{r}
weight_lbs <- 5.0
weight_kgs <- weight_lbs * 2.205

weight_lbs
weight_kgs
```

### Vectors
A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is composed by a series of values, which can be either numbers or characters. We can assign a series of values to a vector using the c() function.

#### Assign Values to Variables
```{r}
no_membrs <- c(3, 7, 10, 6)
no_membrs
```

```{r}
respondent_wall_type <- c("muddaub", "burntbricks", "sunbricks")
respondent_wall_type
```

#### Length
```{r}
length(no_membrs)
length(respondent_wall_type)
```

#### Class
```{r}
class(no_membrs)
class(respondent_wall_type)
```

#### Structure
```{r}
str(no_membrs)
str(respondent_wall_type)
```

#### Types of Vectors:

* "character" are letters and words
* "numeric" (or "double") are numbers
* "logical" for TRUE and FALSE (the boolean data type)
* "integer" for integer numbers (e.g., 2L, the L indicates to R that it’s an integer)
* "complex" to represent complex numbers with real and imaginary parts (e.g., 1 + 4i) and that’s all we’re going to say about them
* "raw" for bitstreams that we won’t discuss further

#### Factors

A special kind of vector includes Factors.

Factors are categorical or ordinal data. Here's an example of ordered factor
```{r}
credit_rating <- c("AAA", "AA", "A", "BBB", "AA", "BBB", "A")

credit_factor_ordered <- factor(credit_rating, ordered = TRUE,
                                levels = c("AAA", "AA", "A", "BBB"))

plot(credit_factor_ordered)
```

#### Subsetting Vectors

We can extract specific values from a vector like so:

```{r}
respondent_wall_type
respondent_wall_type[2]
respondent_wall_type[c(3, 2)]
no_membrs
no_membrs[c(TRUE, FALSE, TRUE, TRUE)]
```

Or

```{r}
possessions <- c("car", "bicycle", "radio", "television", "mobile_phone")
possessions[possessions == "car" | possessions == "bicycle"] # returns both car and bicycle
possessions %in% c("car", "bicycle", "motorcycle", "truck", "boat") # returns logical TRUE or FALSE for each item
possessions[possessions %in% c("car", "bicycle", "motorcycle", "truck", "boat")] # Returns the ones that posessions contains
```

Missing data shows up as `NA` values and you'll need to learn about things like `is.na()`, `na.omit()`, and `complete.cases()` to fully deal with them. For example:

```{r}
rooms <- c(2, 1, 1, NA, 4)
mean(rooms) # fails because NA
mean(rooms, na.rm = TRUE)
mean(rooms[!is.na(rooms)])
```


### Data Frames

For our purpose we're going to focus on what a dataset typically is assign as: a `data.frame`. A `data.frame` is how R handles things like CSV or excel files (or most other data files for that matter) where you have rows of observations and columns of variables of one type.

https://bit.ly/2W3S2Yl

![Source: https://datacarpentry.org/r-socialsci/fig/data-frame.svg](data-frame.svg)

```{r}
library(tidyverse)
#interviews <- read_csv("https://ndownloader.figshare.com/files/11492171", na = "NULL")
interviews <- read_csv("https://bit.ly/2vnbYtu")
head(interviews)
```

```{r}
typeof(interviews)
head(str(interviews))
```

### Subsetting Dataframes

You can get specific values with positions:

```{r}
interviews[1, 1] # this gets the value at the first row and first column position
```

```{r}
interviews[1:3, 4:7] # gets rows 1 through 3 and columns 4 through 7
```

dplyr, a popular subsetting package, has good tools for this as well:
```{r}
library(dplyr)
head(select(interviews, village, years_liv))
```

```{r}
head(filter(interviews, years_liv > 10))
```


# Dplyr & Tidyverse

The modern methods for subsetting and manipulating data are via the `dplyr()` and `tidyverse()` packages.

`Tidyverse` is what allows us to create tibbles (tidy data frames) that mostly differ in that they print nicely and they can't have rownames. Tibbles are the basis for a lot of tidy data manipulation, though.

`Dplyr` lets us use powerful manipulation commands and pipe (or chain) these commands together:

+ `select()`: subset columns
+ `filter()`: subset rows on conditions
+ `mutate()`: create new columns by using information from other columns
+ `group_by()` and `summarize()`: create summary statistics on grouped data
+ `arrange()`: sort results
+ `count()`: count discrete values

```{r}
interviews %>%
  select(-interview_date) %>%
  filter(respondent_wall_type == "burntbricks") %>%
  summarise(average_years = mean(years_liv))
```


```{r}
head(interviews)
interviews2 <- interviews %>%
  mutate(mem_by_years = no_membrs * years_liv)
head(interviews2)
```


# Ggplot & Graphing
Ggplot is the more modern way to graph with R. It's flexible and makes plots modular, in that you can add geometry on top of data. Ggplot stands for the "grammar of graphics." The general steps are

1. format your data
2. set your plane
3. add your geometry
4. add themes and tweak visuals
5. label

```{r}
library(ggplot2)
str(mpg)
?mpg
```

Let's look at displaced gas and highway miles
```{r}
ggplot(mpg, aes(displ, hwy, colour = class)) +
  geom_point()
```

```{r}
library(ggthemes)
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = manufacturer)) +
  geom_smooth(method="lm", se = FALSE) +
  facet_wrap(~class) +
  theme_light() +
  ggtitle("Miles per Gallon by Manufacturer per Car Type") +
  xlab("Highway Miles") +
  ylab("Gallons of Gas")
```