-
Notifications
You must be signed in to change notification settings - Fork 2
Expand file tree
/
Copy path27-handson-parse-HTML.Rmd
More file actions
98 lines (63 loc) · 3.36 KB
/
27-handson-parse-HTML.Rmd
File metadata and controls
98 lines (63 loc) · 3.36 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# Hands-on: HTML Parsing
Goal: Learn how to use OpenRefine's HTML parsing capabilities by fetching some [David Price press releases](https://price.house.gov/newsroom/press-releases) and then parsing the content.
1. Import Data
- <span class="or-menu">Create Project > Web Addresses (URLs) ></span> `https://raw.githubusercontent.com/libjohn/openrefine/master/data/price-crawl-and-HTML-parse.csv`
- <span class="or-menu">Next >></span>
- You many want to give your project a pretty title
- Create Project >>
## Fetch
Now let’s fetch the data by crawling a few links to Congressman Price's press releases. This will return large amounts of raw HTML that can be hard to read. So, after fetching, we'll parse the result.
2. Fetch HTML
- <span class="or-menu"> prlink-href > Edit column > Add column by fetching URLs… </span>
- New column name = `raw HTML`
- Throttle delay = `2000`
- Expression = <br>
`value`
- <span class="or-menu">OK</span>
## Parse
Now parse the HTML data.
3. <span class="or-menu"> raw HTML > Edit column > Add column based on this column ... </span>
- New column name = `HTML title`
- expression = `value.parseHtml().select("title")[0].htmlText()` ^[Note the square-bracket (`[0]`) notation in the `ParseHtml()` function denotes and identifies the first array element. It's the first element because in OpenRefine counting begins with zero (e.g. 0,1,2,3,4,5).]
- <span class="or-menu">OK</span>
3. <span class="or-menu"> raw HTML > Edit column > Add column based on this column ... </span>
- New column name = `body title`
- expression = `value.parseHtml().select("h1#page-title.title")[0].htmlText()`
- <span class="or-menu">OK</span>
4. <span class="or-menu"> raw HTML > Edit column > Add column based on this column ... </span>
- New column name = `date2 `
- expression = `value.parseHtml().select("div.pane-content")[0].htmlText()`
- <span class="or-menu">OK</span>
3. <span class="or-menu"> raw HTML > Edit column > Add column based on this column ... </span>
- New column name = `dateline`
- expression = `value.parseHtml().select("div.field-item.even p strong")[0].htmlText()`
- <span class="or-menu">OK</span>
3. <span class="or-menu"> raw HTML > Edit column > Add column based on this column ... </span>
- New column name = `links`
- expression =
```
forEach(
value.parseHtml().select("div#block-system-main")[0].select("a"),
e,
e.htmlAttr("href")
).join("|")
```
- <span class="or-menu">OK</span>
3. <span class="or-menu"> raw HTML > Edit column > Add column based on this column ... </span>
- New column name = `link text`
- expression =
```
forEach(
value.parseHtml().select("div#block-system-main")[0].select("a"),
e,
e.htmlText()
).join("|")
```
- <span class="or-menu">OK</span>
## Inspect your work...
9. <span class="or-menu"> raw HTML > View > Collapse this column </span>
9. Click the `records` link in the "Show as: **rows** records" section, above the column headers
9. <span class="or-menu"> links > Edit cells > Split multi-valued cells... </span>
- for the *by separator* option, in the *Separator* textbox enter a pipe: `|`
- repeate this step for the `link text` column
9. Look around. Scroll left to right and wee what you've parsed.