-
Notifications
You must be signed in to change notification settings - Fork 2
Expand file tree
/
Copy path1. Text Mining Setup.Rmd
More file actions
144 lines (97 loc) · 4.25 KB
/
1. Text Mining Setup.Rmd
File metadata and controls
144 lines (97 loc) · 4.25 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
title: "Text Mining Setup"
author: "Michael Weisner"
date: "February 14, 2019"
output:
word_document: default
html_document: default
---
This tutorial was based on the Fall 2018 Data Mining course at Columbia University lead by Ben Goodrich.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Packages
Packages for this tutorial include
* tidyverse
* tidytext
* dplyr
* gutenbergr
* janeaustenr
* wordcloud
* ggwordcloud
* tm
* topicmodels
* text2vec
* ggplot2
## Package Installation (Optional)
```{r package installation, eval=FALSE}
# install.packages(c("tidyverse", "dplyr", "ggraph", "tidytext", "gutenbergr", "janeaustenr", "tm", "wordcloud", "topicmodels", "text2vec", "ggplot2", "ggwordcloud", "quanteda"))
```
## Basic Packages to Load
```{r basic packages}
library(tidyverse)
library(dplyr)
library(ggplot2)
```
# Tidy Text
In general, a "tidy" `data.frame`, which is what we will use for R text analysis, has
* One "observation" per row
* Each column is a variable
* Each type of observational unit can be represented as a table
Which can be seen in the images below:

When applied to text data, "tidy" means a table with one "token" per row, where a "token" can be a single word or set of adjacent words. The main strength of this approach is that it integrates well with the rest of the packages in the **tidyverse**.
Non-tidy approaches (that can be made tidy) to text include
* Character vectors
* Corpora with additional metadata
* Document-term matrices
## Ways to Break Up Text Data
The `unnest_tokens` function in the **tidytext** package can parse words, sentences, paragraphs, and more into tokens, in which is removes punctuation and converts to lowercase.
The tidytext::unnest_tokens() function does the following:
1. **Convert text to lowercase**: each word found in the text will be converted to lowercase so ensure that you don’t get duplicate words due to variation in capitalization.
2. **Punctuation is removed**: all instances of periods, commas etc will be removed from your list of words , and
3. **Unique id associated with the tweet**: will be added for each occurrence of the word
The unnest_tokens() function takes two arguments:
+ The name of the column where the unique word will be stored and
+ The column name from the data.frame that you are using that you want to pull unique words from.
### Tidytext Examples:
Fundamentally you can break up your units of textual analysis by words, word pairs, sentences, paragraphs, chapters, etc.
```{r}
library(tidytext)
example(unnest_tokens)
```
# Preparing Tidy Text Data
## Stop Words
### Joins and Anti Joins
The **tidyverse** also uses some database-style logic in order to merge `data.frame`s together.
* `left_join` returns all rows of the first `data.frame` when merged with another `data.frame`.
* `inner_join` is like a `left_join` but only keeps rows that match between the two `data.frame`s according to the columns in `by` that define a key.
* `outer_join` keeps all the rows that appear in either of the two `data.frame`s.
* `anti_join` drops all the observations from the first `data.frame` that match with the second `data.frame`.
To eliminate stop words we can do so with an `anti_join`:
```{r}
library(gutenbergr)
hgwells <- gutenberg_download(c(35, 36, 5230, 159)) # Books by H.G. Wells
tidy_hgwells_stop <- hgwells %>%
unnest_tokens(word, text)
tidy_hgwells_stop
```
### Remove Stopwords
```{r}
tidy_hgwells <- tidy_hgwells_stop %>%
anti_join(stop_words)
tidy_hgwells
```
### Sort By Frequency
```{r}
tidy_hgwells %>%
count(word, sort = TRUE)
```
# Sentiment Analysis in Text
Text may have a sentiment that is easy for a human to pick up on but the sentiment of individual words is subject to negation, context, sarcasm, and other linguistic problems.
Still, there are efforts to allow for analysis of the sentiment of text. The **tidytext** package includes a `data.frame` call `sentiments`
```{r}
sentiments
table(sentiments$score)
```
These are scored by three different sets of researchers. We can then `left_join` a tidy `data.frame` of texts with the `sentiments` `data.frame` to investigate whether the words used tend to be negative or positive, for example using the books writen by Jane Austen.