textmining_workshop/1. Text Mining Setup.Rmd at master · mdweisner/textmining_workshop · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
title: "Text Mining Setup"
author: "Michael Weisner"
date: "February 14, 2019"
output:
  word_document: default
  html_document: default
---
This tutorial was based on the Fall 2018 Data Mining course at Columbia University lead by Ben Goodrich.

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Packages

Packages for this tutorial include

* tidyverse
* tidytext
* dplyr
* gutenbergr
* janeaustenr
* wordcloud
* ggwordcloud
* tm
* topicmodels
* text2vec
* ggplot2

## Package Installation (Optional)
```{r package installation, eval=FALSE}
# install.packages(c("tidyverse", "dplyr", "ggraph", "tidytext", "gutenbergr", "janeaustenr", "tm", "wordcloud", "topicmodels", "text2vec", "ggplot2", "ggwordcloud", "quanteda"))
```

## Basic Packages to Load
```{r basic packages}
library(tidyverse)
library(dplyr)
library(ggplot2)
```


# Tidy Text

In general, a "tidy" `data.frame`, which is what we will use for R text analysis, has

* One "observation" per row
* Each column is a variable
* Each type of observational unit can be represented as a table

Which can be seen in the images below:

![Source: https://r4ds.had.co.nz/tidy-data.html](tidy-1.png)

When applied to text data, "tidy" means a table with one "token" per row, where a "token" can be a single word or set of adjacent words. The main strength of this approach is that it integrates well with the rest of the packages in the **tidyverse**.

Non-tidy approaches (that can be made tidy) to text include

* Character vectors
* Corpora with additional metadata
* Document-term matrices


## Ways to Break Up Text Data
The `unnest_tokens` function in the **tidytext** package can parse words, sentences, paragraphs, and more into tokens, in which is removes punctuation and converts to lowercase.

The tidytext::unnest_tokens() function does the following:

1. **Convert text to lowercase**: each word found in the text will be converted to lowercase so ensure that you don’t get duplicate words due to variation in capitalization.
2. **Punctuation is removed**: all instances of periods, commas etc will be removed from your list of words , and
3. **Unique id associated with the tweet**: will be added for each occurrence of the word

The unnest_tokens() function takes two arguments:

+ The name of the column where the unique word will be stored and
+ The column name from the data.frame that you are using that you want to pull unique words from.

### Tidytext Examples:

Fundamentally you can break up your units of textual analysis by words, word pairs, sentences, paragraphs, chapters, etc.
```{r}
library(tidytext)
example(unnest_tokens)
```

# Preparing Tidy Text Data
## Stop Words

### Joins and Anti Joins

The **tidyverse** also uses some database-style logic in order to merge `data.frame`s together.

* `left_join` returns all rows of the first `data.frame` when merged with another `data.frame`.
* `inner_join` is like a `left_join` but only keeps rows that match between the two `data.frame`s according to the columns in `by` that define a key.
* `outer_join` keeps all the rows that appear in either of the two `data.frame`s.
* `anti_join` drops all the observations from the first `data.frame` that match with the second `data.frame`.

To eliminate stop words we can do so with an `anti_join`:

```{r}
library(gutenbergr)
hgwells <- gutenberg_download(c(35, 36, 5230, 159)) # Books by H.G. Wells

tidy_hgwells_stop <- hgwells %>%
  unnest_tokens(word, text)
tidy_hgwells_stop
```


### Remove Stopwords
```{r}
tidy_hgwells <- tidy_hgwells_stop %>%
  anti_join(stop_words)
tidy_hgwells
```


### Sort By Frequency
```{r}
tidy_hgwells %>%
  count(word, sort = TRUE)
```


# Sentiment Analysis in Text

Text may have a sentiment that is easy for a human to pick up on but the sentiment of individual words is subject to negation, context, sarcasm, and other linguistic problems.

Still, there are efforts to allow for analysis of the sentiment of text. The **tidytext** package includes a `data.frame` call `sentiments`

```{r}
sentiments
table(sentiments$score)
```

These are scored by three different sets of researchers. We can then `left_join` a tidy `data.frame` of texts with the `sentiments` `data.frame` to investigate whether the words used tend to be negative or positive, for example using the books writen by Jane Austen.