learningr/semnet.Rmd at master · vanatteveldt/learningr · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
---
title: "Semantic Network Analysis"
author: "Kasper Welbers & Wouter van Atteveldt "
date: "May 25, 2016"
output: pdf_document
---

```{r, echo=F}
head = function(...) knitr::kable(utils::head(...))
```

One way to create semantic networks is to calculate how often words co-occur together. This co-occurence reflects a semantic relation, because it indicates that the meaning of these words is related.

In this howto we demonstrate two functions to calculate the co-occurence of words. The first is the `coOccurenceNetwork` function, which calculates the co-occurence of words within documents based on a document term matrix. The second is the `windowedCoOccurenceNetwork`, which calculates how often words co-occur within a given word distance based on tokenized texts.

We start with a simple example.

```{r, message=F}
library(semnet)
data(simple_dtm)
dtm
```

`dtm` is a document term matrix: the rows represent documents and the columns represent words. Values represent how often a word occured within a document. The co-occurence of words can then be calculated as the number of documents in which two words occur together. This is what the `coOccurenceNetwork` function does.

```{r, warning=F}
g = coOccurenceNetwork(dtm)
plot(g, vertex.size=V(g)$freq*10)
```

Of course, this method mainly becomes interesting when lots of documents are analyzed. This could for instance show how often the word 'nuclear' is used in the context of 'energy', compared to the context of 'weapons' and 'waste'. Thus, it can provide an answer to the question: if one think or talks about nuclear technology, what discourses, frames or topics come to mind?

To demonstrate the `windowedCoOccurenceNetwork` function we'll use a larger dataset, consisting of the state of the union speeches of Obama and Bush (1090 paragraphs). We'll filter the data on part-of-speech tags to contain only the nouns, names and adjectives.

```{r}
data(sotu)
sotu.tokens = sotu.tokens[sotu.tokens$pos1 %in% c('N','M','A'),]
head(sotu.tokens)
```

We are interested in three columns in the `sotu.tokens` dataframe:
* The `lemma` column, which is the lemma of a term (the non-plural basic form of a word). We use this instead of the word because we are interested in the meaning of words, for which it is generally less relevant in what specific form it is used. Thus, we consider the words "responsibility" and "responsibilities" to represent the same meaning.
* The `aid` column, which is a unique id for the document, in this case for a paragraph in the SotU speeches. We refer to this as the `context` in which a word occurs.
* The `id` column, which is the specific location of a term within a context. For example, the first row in sotu.tokens shows that in context `111541965`, the term `unfinished` was the fourth term.

These columns are the main arguments for the `windowedCoOccurenceNetwork` function. In addition, the `window.size` argument determines the word distance within which words need to occur to be counted as a co-occurence.

```{r}
g = windowedCoOccurenceNetwork(location=sotu.tokens$id,
                    term=sotu.tokens$lemma,
                    context=sotu.tokens$aid,
                    window.size=20)
class(g)
vcount(g)
ecount(g)
```

# Visualizing Semantic Networks

The output `g` is an igraph object---a popular format for representing and working with graph/network data. `vcount(g)` shows that the number of vertices (i.e. terms) is 3976. `ecount(g)` shows that the number of edges is 201792.

Naturally, this would not be an easy network to interpret. Therefore, we first filter on the most important vertices and edges. There are several methods to do so (see e.g., [Leydesdorff & Welbers, 2011]{http://arxiv.org/abs/1011.5209}). Here we use backbone extraction, which is a relatively new method (see [Kim & Kim, 2015]{http://jcom.sissa.it/archive/14/01/JCOM_1401_2015_A01}. Essentially, this method filters out edges that are not significant based on an alpha value, which can be interpreted similar to a p-value. To filter out vertices, we lower the alpha to a point where only the specified number of vertices remains.

```{r}
g_backbone = getBackboneNetwork(g, alpha=0.0001, max.vertices=100)
vcount(g_backbone)
ecount(g_backbone)
```

Now there are only 100 vertices and 255 edge left. This is a network we can interpret. Let's plot!

```{r}
plot(g_backbone)
```

Nice, but still a bit messy. We can take some additional steps to focus the analysis and add additional information. First, we can look only at the largest connected component, thus ignoring small islands of terms such as `math` and `science`.

```{r}
# select only largest connected component
g_backbone = decompose.graph(g_backbone, max.comps=1)[[1]]
plot(g_backbone)
```

Next, it would be interesting to take into account how often each term occured. This can be visualized by using the frequency of terms to set the sizes of the vertices. Also, we can use colors to indicate different clusters.

The output of the (windowed)coOccurenceNetwork function by default contains the vertex attribute `freq`, which can be used to set the vertex sizes. To find clusters, several community detection algorithms are available. To use this information for visualization some basic understanding of plotting igraph objects is required, which is out of the scope of this tutorial. We do provide a function named `setNetworkAttributes` which deals with these and some other visualization attributes.


```{r}
V(g_backbone)$cluster = edge.betweenness.community(g_backbone)$membership

g_backbone = setNetworkAttributes(g_backbone, size_attribute=V(g_backbone)$freq,
  cluster_attribute=V(g_backbone)$cluster)

plot(g_backbone)
```

Now we have a more focused and informational visualization. We can for instance see several clusters that represent important talking points, such as the health care debate and the issue of nuclear weapons. Also, we see that America is at the center of discussions, in particular in context of economy and the job market.

# Extracting Quantitative Information from networks

## Computing network metrics

The igraph package contains numerous functions for computing network statistics.
For example, this code computes the degree centrality (links per node) for the original semantic network:

```{r}
c = centralization.degree(g)
degree = data.frame(node=V(g)$name, degree=c$res)
head(arrange(degree, -degree))
```

Or you can get the most central nodes in the backgone network (using betweenness centrality)

```{r}
c = centralization.betweenness(g_backbone)
centrality = data.frame(node=V(g_backbone)$name, centrality=c$res)
head(arrange(centrality, -centrality))
```

## Extracting relations

If we want to do more quantitative anlaysis of the network it can be useful to extract all relations as a data frame:

```{r}
edges = as_data_frame(g, what="edges")
head(edges)
```

The most frequent edges are also good candidates for collocation:

```{r}
edges = arrange(edges, -weight)
head(edges)
```

## Extraecting co-occurrences per document

If you want to see in which documents an edge is contained, for example to add a time dimension to the network or to compare sources,
you can set `output.per.context=T` in the original call.
Since this will take a lot longer to run and generate a lot of output, we limit here to a sample of 10 paragraphs from Obama's speeches:

```{r}
smp = sample(sotu.meta$id[sotu.meta$headline == "Barack Obama"], 10)
edges = with(sotu.tokens[sotu.tokens$aid %in% smp, ],
  windowedCoOccurenceNetwork(location=id, term=lemma, context=aid,
                      window.size=20, output.per.context = T))
head(edges)
```

# Semnet for sentiment anlaysis

We can also use the word-windo approach for sentiment analysis.
Suppose we would want to get the sentiment in phrases around 'Iraq' and 'terror'.
First, we need to load a sentiment dictionary and for this exercise we will use a list to store the dictionary:

```{r}
lexicon = readRDS("data/lexicon.rds")
dictionary = list(
  pos = lexicon$word1[lexicon$priorpolarity == "positive"],
  neg = lexicon$word1[lexicon$priorpolarity == "negative"],
  iraq = c("Iraq", "Iraqi" ),
  terror = c("terror", "terrorism", "terrorist"))
```
Now, we can use this to make a new 'concept' column that contains that concept and the sentiment values positive or negative

```{r}
data(sotu)
sotu.tokens$concept = NA
for (concept in names(dictionary)) {
  sotu.tokens$concept[sotu.tokens$lemma %in% dictionary[[concept]]] = concept
}
table(sotu.tokens$concept)
```

We can use now get all windowed co-occurrences for these concepts using:

```{r}
hits = windowedCoOccurenceNetwork(location=sotu.tokens$id, term=sotu.tokens$concept, context=sotu.tokens$aid,
                    window.size=20, output.per.context = T)
head(hits)
```

Now, we can compute a sentiment score and get the mean sentiment per context, excluding pos and neg itself:

```{r}
hits$sentiment[hits$y == "pos"] = 1
hits$sentiment[hits$y == "neg"] = -1
hits = hits[!(hits$x %in% c("pos", "neg")), ]
library(reshape2)
sent = dcast(hits, context ~ x, value.var="sentiment", fun.aggregate = mean)
head(sent)
```

Finally, let's merge that back with the metadata to get sentiment about Iraq and teror per president:

```{r}
sent = merge(sotu.meta, sent, by.x="id", by.y="context")
aggregate(sent[c("iraq", "terror")], sent["headline"], mean, na.rm=T)
```

So, both presidents are positive about (their policy in) Iraq, but while Bush is negative about terror,
Obama is actually positive (presumably mostly talking about his efforts to contain it).