-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathcollect_data_across_a_site.py
More file actions
40 lines (33 loc) · 1.35 KB
/
collect_data_across_a_site.py
File metadata and controls
40 lines (33 loc) · 1.35 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getLinks(pageUrl):
"""_summary_
How to collect specific data across an entire website
Args:
getLinks (string): _description_
Web crawlers would be fairly boring if all they did was hop from one page to the
other. To make them useful, you need to be able to do something on the page while
you’re there. Let’s look at how to build a scraper that collects the title, the first para‐
graph of content, and the link to edit the page (if available).
"""
global pages
html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
bs = BeautifulSoup(html, 'html.parser')
try:
print(bs.h1.get_text())
print(bs.find(id ='mw-content-text').find_all('p')[0])
print(bs.find(id='ca-edit').find('span').find('a').attrs['href'])
except AttributeError:
print('This page is missing something! Continuing.')
for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
#We have encountered a new page
newPage = link.attrs['href']
print('-'*20)
print(newPage)
pages.add(newPage)
getLinks(newPage)
getLinks('')