-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.html
More file actions
616 lines (520 loc) · 24.6 KB
/
index.html
File metadata and controls
616 lines (520 loc) · 24.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
<!DOCTYPE html>
<html lang="en">
<head>
<meta name="generator" content="HTML Tidy for HTML5 for Apple macOS version 5.6.0">
<meta charset="utf-8">
<title>Processing Textual Data</title>
<meta name="description" content="A framework for easily creating beautiful presentations using HTML">
<meta name="author" content="Raphael Cobe">
<meta name="apple-mobile-web-app-capable" content="yes">
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="stylesheet" href="dist/reset.css">
<link rel="stylesheet" href="dist/reveal.css">
<link rel="stylesheet" href="dist/style.css">
<link rel="stylesheet" href="dist/theme/night.css" id="theme"><!-- Theme used for syntax highlighting of code -->
<link rel="stylesheet" href="plugin/highlight/monokai.css" id="highlight-theme">
</head>
<body>
<div class="reveal">
<!-- Any section element inside of this container is displayed as a slide -->
<div class="slides">
<section>
<a href="https://www.datascienceschools.org/" target="_blank"><div style="height: 120px; margin: 0 auto 10rem auto; background: transparent;"><img src="img/DS_logo_CDT-RDA-mono-inverse.png"></img></div></a>
<!--<a href="https://revealjs.com"><img src="https://static.slid.es/reveal/logo-v1/reveal-white-text.svg" alt="reveal.js logo" style="height: 180px; margin: 0 auto 4rem auto; background: transparent;" class="demo-logo"></a>-->
<h3>Processing Textual Data with Python</h3>
<p><small><em><a href="http://twitter.com/raphaelmcobe">@raphaelmcobe</a></em></small></p>
</section>
<section data-markdown data-separator="^---"
data-separator-vertical="^--"
data-separator-notes="^\nNote:"
data-charset="iso-8859-15"
style="text-align: left;">
<textarea data-template>
## Introduction/Motivation
<!-- .slide: data-background="#6d8ba7" -->
--
### A (few) word (s) on unstructure data
- Structured or Semistructured data includes fields or markup that
enable it to be <span class="highlight">easily parsed by a
computer</span>;
- Language is <span class="highlight">unstructured data</span> that
has been <span class="highlight">produced by people</span> to
be <span class="highlight">understood by other people</span>;
- Retrieving information from unstructured data <span
class="highlight">is essential</span>! - It is estimated that [80%
of the world’s data is
unstructured](https://www.ibm.com/blogs/watson/2016/05/biggest-data-challenges-might-not-even-know/).
- Unstructured data <span class="fragment highlight-red">*is not
random!*</span>
- Linguistic properties that make it very understandable to other
people <!-- .element: class="fragment" -->
--
### Language as Data
- Text mining:
- Process of deriving insights from (unstructured) text data;
- Natural Language Processing (NLP):
- Part of computer science and artificial intelligence which deals
with human languages. - Sentiment Analysis:
- Just a small variant of Text mining;
Note:
This is a simple note
- Test
- Test2
--
### Understanding text is hard
- Real-world data is often <span class="highlight-r">messy</span> and <span class="highlight-r">noisy</span>;
- Several factors that makes this process hard:
- Encoding scheme used (ASCII, UTF-8, UTF-16, Latin-1, etc.); <!-- .element: class="fragment" -->
- hundreds of natural languages, each of which has different syntax
rules; <!-- .element: class="fragment" -->
- Words can be ambiguous (dependent on their context);<!-- .element: class="fragment" -->
- case-sensitive, punctuation, numbers, hyperlinks, the use of
emoticons and emojis <!-- .element: class="fragment" --> <span>😜</span><!-- .element: class="fragment" -->
- synonyms, abbreviation, acronyms, and spelling errors;<!-- .element: class="fragment" -->
- often need to understand which words in a sentence are nouns and
which are verbs; <!-- .element: class="fragment" -->
---
## Processing Textual data
<!-- .slide: data-background="#6d8ba7" -->
--
### How to process textual data?
- “the numbers don’t lie” and not “the text doesn’t lie”;
- Machine learning:
- train statistical models on language as it changes;
- building models of language on <span class="highlight">
context-specific corpora</span>;
- Model that describes language and can make inferences based on that
description
- To take advantage of the predictability of text, we need to define
a constrained, numeric decision space on which the model can compute;
--
### How to process textual data?
- Break the text processing into steps:
- <span class="fragment highlight-blue">Finding Parts of Text</span>
- Finding Sentences
- Finding People and Things
- Detecting Parts of Speech
- <span class="fragment highlight-blue">Classifying Text and
Documents</span>
- Extracting Relationships
- Combined Approaches
---
<!-- .slide: data-background="#6d8ba7" -->
## Finding Parts of Text
Examples taken from [Real Python Regular Expressions
Tutorial](https://realpython.com/regex-python/)
--
### Regular Expressions
A (very) brief history
- In 1951, mathematician Stephen Cole Kleene described the concept of
a regular language, a language that is recognizable by a finite
automaton;
- In the mid-1960s, computer science pioneer Ken Thompson, one of
the original designers of Unix, implemented pattern matching in the
QED text editor;
- Appeared in many programming languages, editors, and other tools as
a means of determining whether a string matches a specified pattern;
--
### The Python `re` module
- Regex functionality in Python resides in a module named `re`. The
`re` module contains many useful functions and methods. Let's play a
little bit with the `search()` function:
- `re.search(<regex>, <string>)`
```python[1|2|3-4]
>>> import re
>>> text = "foo123foo"
>>> re.search("123", text)
<re.Match object; span=(3, 6), match='123'>
```
--
### The Python `re` module
- `span=(3, 6)` indicates the portion of `<string>` in which the
match was found. This means the same thing as it would in slice
notation:
```python[1|2]
>>> text[3:6]
'123'
```
--
### Regular Expressions in Python
- Any character or sequence of carachters is a valid regular
expression;
- The `search()` function returns the first sequence of characters
that matches the RE;
```python[2-3|4-5]
>>> text="abc123abc"
>>> re.search("a", text)
<re.Match object; span=(0, 1), match='a'>
>>> re.search("abc", text)
<re.Match object; span=(0, 3), match='abc'>
```
--
### Regular Expressions in Python
Metacharacters
- Real power of regex matching in Python emerges when `<regex>`
contains special characters called <span class="highlight">metacharacters</span>;
- Unique meaning to the regex matching engine;
- **Character class**:
- Matches any single character that is in the class;
- Defined using square brackets ([]);
- Can be used to specify intervals
```python[1-2|3-4|5-6|7-8]
>>> re.search('[0-9]', "abc123abc")
<re.Match object; span=(3, 4), match='1'>
>>> re.search('[2-9]', "abc123abc")
<re.Match object; span=(4, 5), match='2'>
>>> re.search('[0-9][0-9][0-9]', "abc123abc")
<re.Match object; span=(3, 6), match='123'>
>>> re.search('[2-9][0-9]', "abc123abc")
<re.Match object; span=(4, 6), match='23'>
```
--
### Regular Expressions in Python
Metacharacters - character classes
- Can have all the characters enumerated:
```python[1-2|3-4]
>>> re.search('ab[cde]', "abc123abc")
<re.Match object; span=(0, 3), match='abc'>
>>> re.search('ab[cde]', "abd123abc")
<re.Match object; span=(0, 3), match='abd'>
```
- Complement a character class by specifying `^` as the first
character:
```python
>>> re.search('ab[^cde]', "abc abd abe abf")
<re.Match object; span=(12, 15), match='abf'>
```
--
### Regular Expressions in Python
Metacharacters
- The dot (.) metacharacter matches any character except a newline,
so it functions like a wildcard:
```python[1-2|3-4]
>>> re.search('1.3', "abc123abc")
<re.Match object; span=(3, 6), match='123'>
>>> print(re.search('foo.bar', 'foobar'))
None
```
--
### Regular Expressions in Python
Metacharacters
- Quantifiers:
- Indicates how many times that portion must occur for the match to succeed
- `*` matches zero or more repetitions of the preceding regex:
```python[1-2|3-4|5-6]
>>> re.search('foo-*bar', 'foobar') # Zero dashes
<re.Match object; span=(0, 6), match='foobar'>
>>> re.search('foo-*bar', 'foo-bar') # One dash
<re.Match object; span=(0, 7), match='foo-bar'>
>>> re.search('foo-*bar', 'foo--bar') # Two dashes
<re.Match object; span=(0, 8), match='foo--bar'>
```
- `.*` This matches zero or more occurrences of any character:
```python
>>> re.search('foo.*bar', '# foo $qux@grault % bar #')
<re.Match object; span=(2, 23), match='foo $qux@grault % bar'>
```
--
### Regular Expressions in Python
Metacharacters
- Quantifiers:
- `+` matches one or more repetitions of the preceding regex;
- `?` matches zero or one repetitions of the preceding regex;
- `{m}` matches exactly `m` repetitions of the preceding regex;
- `{m,n}` matches any number of repetitions of the preceding regex
from `m` to `n`, inclusive.
--
### Regular Expressions in Python
Metacharacters
- Anchors: zero-width matches.
- Don’t match any actual characters in the search string;
- Dictates a particular location in the search string where a match
must occur;
- `^` or `\A` anchor a match to the start of the text:
```python[1-2|3-4]
>>> re.search('^foo', 'foobar')
<re.Match object; span=(0, 3), match='foo'>
>>> re.search('^foo', 'barfoo')
None
```
- `$` or `\Z` anchor a match to the end of the text:
```python[1-2|3-4]
>>> re.search('bar$', 'foobar')
<re.Match object; span=(3, 6), match='bar'>
>>> re.search('bar$', 'barfoo')
None
```
--
### Regular Expressions
Why are they so important?
- Removing uninportant text:
- Use the `re.sub()` function;
- Strip all non alphanumerical characters:
```python
>>> re.sub('[^a-zA-Z0-9_]+', '', "foo1!bar#~2")
'foo1bar2'
```
- Remove all punctuation signs (`\s` represents any whitespace
character):
```python
>>> re.sub("[^a-zA-Z0-9_]\s",' ',"Foo. Bar? Foo.Bar")
'Foo Bar Foo.Bar'
```
- Remove all numbers from the text:
```python
>>> re.sub('[0-9]+', '', "Foo 123 Bar 456")
'Foo Bar '
```
---
## Further Cleaning the Text
<!-- .slide: data-background="#6d8ba7" -->
Standard text mining procedures to extract useful features from the
news contents, including:
<ul>
<li><b>tokenization</b>,</li>
<li><b>removing stopwords</b>, and </li>
<li><b>lemmatization/stemming</b></li>
</ul>
--
### Tokenization
- The process of breaking strings into tokens which in turn are small
structures or units;
- Taking individual words rather than sentences breaks down the
connections between words
- It is efficient and convenient for computers to analyze
- Most of the times, discovering what words appear in a text and
counting the times that these words appear is sufficient to give
insightful results;
- Transforms a text into a list of words;
- Remove useless information (using Regular Expressions!)
--
### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/)
- Leading platform for building Python programs to work with human
language data
- A wonderful tool for teaching, and working in, computational
linguistics using Python;
- Free reference [book](http://www.nltk.org/book/)
- Open Source;
```python
>>> import nltk
>>> nltk.download()
```
--
### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/)
Tokenization
```python[1-2|4-8|10-11]
# importing word_tokenize from nltk
from nltk.tokenize import word_tokenize
text = """
Ground control to Major Tom (10, 9, 8, 7).
Commencing countdown, engines on (6, 5, 4, 3).
Check ignition, and may God's love be with you (2, 1, lift off).
"""
token = word_tokenize(text)
print(token)
```
```
['Ground', 'control', 'to', 'Major', 'Tom', '(', '10', ',', '9', ',', '8', ',', '7', ')', '.', 'Commencing', 'countdown', ',', 'engines', 'on', '(', '6', ',', '5', ',', '4', ',', '3', ')', '.', 'Check', 'ignition', ',', 'and', 'may', 'God', "'s", 'love', 'be', 'with', 'you', '(', '2', ',', '1', ',', 'lift', 'off', ')', '.']
```
--
### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/)
Cleaning Before Tokenization
```python[1-2|4-5|7-8|10-12|14-16]
# Remove non alphanumeric character
text = re.sub("[^a-zA-Z0-9]",' ',text)
# Remove all numbers
text = re.sub('[0-9]+', '', text)
# Remove extra whitespaces
text = re.sub('\s+', ' ', text)
# Remove beginning and trailing whitespaces
text = re.sub('^\s', '', text)
text = re.sub('\s$', '', text)
```
--
### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/)
Tokenization
```python
# Now tokenize
token = word_tokenize(text)
print(token)
```
```
['Ground', 'control', 'to', 'Major', 'Tom', 'Commencing', 'countdown', 'engines', 'on', 'Check', 'ignition', 'and', 'may', 'God', 's', 'love', 'be', 'with', 'you', 'lift', 'off']
```
- Finding the frequency distinct in the tokens
```python[1|3-4]
from nltk.probability import FreqDist
fdist = FreqDist(token)
fdist
```
```
FreqDist({'Ground': 1, 'control': 1, 'to': 1, 'Major': 1, 'Tom': 1, 'Commencing': 1, 'countdown': 1, 'engines': 1, 'on': 1, 'Check': 1, ...})
```
--
### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/)
Tokenization
- Let's Inspect the lines from
[Ghostbusters](https://www.imdb.com/title/tt0087332/) from the
[Cornell Movie Lines
Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html):
```python[1-3|5-6|8-10]
# load data and concatenate all lyrics
movie_db = pd.read_csv("./movie_lines_db.csv")
ghostbusters_lines = movie_db[movie_db['movie_name']=='ghostbusters'].iloc[0,1]
# tokenize
token = word_tokenize(ghostbusters_lines)
# build the frequency distribution and print the top 10 occurences:
fdist = FreqDist(token)
fdist.most_common(10)
```
```
[('.', 535), ('I', 223), (',', 201), ('the', 162), ('?', 160), ('you', 158), ('to', 123), ('a', 110), ("'s", 85), ('it', 82)]
```
--
### Removing Stop Words
- Remove the useless words;
- Words that frequently appear in many text fragments, but without
significant meanings;
- Do not provide any meaning and are usually removed from texts
- e.g.: ‘I’, ‘the’, ‘a’, ‘of’
- Import the stopwords from the NLTK library;
--
### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/)
Removing stopwords
- Continuing with the
[Ghostbusters](https://www.imdb.com/title/tt0087332/) lines from the
[Cornell Movie Lines
Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html):
```python
# importing stopwors from nltk library
from nltk import word_tokenize
from nltk.corpus import stopwords
english_stopwords = set(stopwords.words('english'))
# Create list of tokens coverting all text to lower case
ghostbusters_token = word_tokenize(ghostsbusters_lines.lower())
# List words from lines that are not in the set of stopwords
clean_lines = [word for word in ghostbusters_token if word not in
english_stopwords]
print(clean_lines)
```
```
["'s", 'matter', ',', 'dear', '?', 'uuuuuuugh', '!', '!', 'hey', ',',
'sweetheart', ',', 'cut',...]
```
--
### Stemming
- refers to normalizing words into its base form or root form;
- two methods in Stemming namely, Porter Stemming (removes common
morphological and inflectional endings from words) and Lancaster
Stemming (a more aggressive stemming algorithm)
--
### Python [Natural Language Toolkit - NLTK](https://www.nltk.org/)
Removing stopwords
- Continuing with the
[Ghostbusters](https://www.imdb.com/title/tt0087332/) lines from the
[Cornell Movie Lines
Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html):
```python
# Importing Porterstemmer from nltk library
from nltk.stem import PorterStemmer
pst = PorterStemmer()
stemmed_tokens = [pst.stem(word) for word in clean_lines]
print(stemmed_tokes)
```
```
[...,'probabl', 'would', "'ve", '.', "n't", 'glad', 'wait', '?','go', 'say', '``', 'eight', '.', "''", "'re", 'fantast', '!','eight', "o'clock", '?', 'okay', '.', 'give', 'second', '.','leav',...]
```
---
## Sentiment Analysis
<!-- .slide: data-background="#6d8ba7" -->
--
### Sentiment Analysis
- Also known as opinion mining;
- Used to determine whether data is positive, negative or neutral;
- focus on polarity (positive, negative, neutral);
- Help businesses monitor brand and product sentiment in customer
feedback, and understand customer needs;
- Can use heavy Machine Learning techniques, such as Artificial
Neural Networks, or can be based on Rules;
--
### Sentiment Analysis
Rule Based
- Include various NLP techniques developed in computational
linguistics, such as: Stemming, Tokenization, and Text cleaning;
- Use of Sentiment Lexicons (i.e. lists of words and expressions);
- Very naive since they don't take into account how words are
combined in a sequence;
- Leave out the context;
- Often require fine-tuning and maintenance;
--
### Python Sentiment Analysis
After cleaning, tokenizing and stemming the text:
- Use well-known lexicons, e.g.: [The subjectivity
lexicon](http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/) or the
[Valence Aware Dictionary and sEntiment Reasoner -
VADER](https://github.com/cjhutto/vaderSentiment)
```python[1-4|6-7|9-11|13-14]
# first load VADER
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
# instantiate the sentiment analyzer:
vader = SentimentIntensityAnalyzer()
# check scores for some phrases:
print(vader.polarity_scores("This was the best idea I've had in a long time"))
{'neg': 0.0, 'neu': 0.682, 'pos': 0.318, 'compound': 0.6369}
print(vader.polarity_scores("This was the worst idea I've had in a long time"))
{'neg': 0.313, 'neu': 0.687, 'pos': 0.0, 'compound': -0.6249}
```
---
## Questions?
<!-- .slide: data-background="#6d8ba7" -->
---
## References
--
### Examples mainly taken from:
- https://towardsdatascience.com/fine-grained-sentiment-analysis-in-python-part-1-2697bb111ed4
- https://monkeylearn.com/sentiment-analysis/
- https://medium.com/towards-artificial-intelligence/text-mining-in-python-steps-and-examples-78b3f8fd913b
- https://towardsdatascience.com/a-step-by-step-tutorial-for-conducting-sentiment-analysis-a7190a444366
- https://towardsdatascience.com/getting-started-with-text-analysis-in-python-ca13590eb4f7
--
### Some books:
- Ingersoll, Grant, Thomas Morton, and Andrew Farris. "Taming text." How to Find, Organize, and Manipulate It, Shelter Island, NY/London (2013).
- BIRD, Steven; KLEIN, Ewan; LOPER, Edward. Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.", 2009.
- VASILIEV, Yuli. Natural Language Processing with Python and SpaCy: A Practical Introduction. No Starch Press, 2020.
- BENGFORT, Benjamin; BILBRO, Rebecca; OJEDA, Tony. Applied text analysis with python: Enabling language-aware data products with machine learning. " O'Reilly Media, Inc.", 2018.
---
## Thanks!
<!-- .slide: data-background="#6d8ba7" -->
<br/><br/><br/><br/>
#### raphaelmcobe@gmail.com
#### [@raphaelmcobe](twitter.com/raphaelmcobe)
#### [CODATA-RDA Data Science Schools](https://www.datascienceschools.org/)
</textarea>
</section>
</div>
</div>
<script src="dist/reveal.js"></script>
<script src="plugin/zoom/zoom.js"></script>
<script src="plugin/notes/notes.js"></script>
<script src="plugin/search/search.js"></script>
<script src="plugin/markdown/markdown.js"></script>
<script src="plugin/highlight/highlight.js"></script>
<script>
// Also available as an ES module, see:
// https://revealjs.com/initialization/
Reveal.initialize({
controls: true,
progress: true,
center: true,
hash: true,
width: 1280,
height: 720,
// Learn about plugins: https://revealjs.com/plugins/
plugins: [ RevealZoom, RevealNotes, RevealSearch, RevealMarkdown, RevealHighlight ]
});
</script>
</body>
</html>