-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathREADME
More file actions
96 lines (79 loc) · 3.41 KB
/
README
File metadata and controls
96 lines (79 loc) · 3.41 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
sascha zinke
maximilian haeckel
starting the crawler:
% cd crawler/
% jython27 crawler.py http://www.udacity.com/cs101x/index.html -i
MainThread [INFO]: creating new Crawler..
MainThread [INFO]: starting crawling..
Thread-1 [INFO]: target(0): http://udacity.com/cs101x/index.html
Thread-1 [INFO]: target(1): http://udacity.com/cs101x/crawling.html
Thread-2 [INFO]: target(1): http://udacity.com/cs101x/flying.html
Thread-3 [INFO]: target(1): http://udacity.com/cs101x/walking.html
Thread-1 [INFO]: target(2): http://udacity.com/cs101x/kicking.html
http://udacity.com/cs101x/crawling.html
http://udacity.com/cs101x/index.html
http://udacity.com/cs101x/walking.html
http://udacity.com/cs101x/kicking.html
http://udacity.com/cs101x/flying.html
finished in 1.618000s
> kick
-> (1.405591) http://udacity.com/cs101x/kicking.html
u'<html>\n<body>\n<b>Kick! Kick! Kick!</b>\n</body>\n</html>\n'
-> (1.405591) http://udacity.com/cs101x/kicking.html'
u'<html>\n<body>\n<b>Kick! Kick! Kick!</b>\n</body>\n</html>\n'
...
Note:
Check output for import warnings. The indexer only works, if the import
succeeded!
The crawling algorithm mainly consists of the following steps:
provide initial target
|
----------- |
/ \
| |
| v
| __________________
| | | The crawler provides an internal target queue
| | fill target queue| which contains the url it has not visited yet.
| |__________________|
| |
| | get a target from queue
| | drop if invalid media type
| v
| __________________
| | |
| | request website | get requested target website
| |__________________|
| |
| |
| v
| __________________
| | | extract interesting information
| | extract inform. | (links, metadata, mail addresses)
| |__________________|
| / \
\ / \
--------- |
fill in target v
if not in result__________________
| | information are stored
| fill result queue| in a result queue
|__________________|
The crawler provides an interactive mode('-i'). If used, the crawler takes
input to build indexer queries. The result is written to stdout. Also you can
set the crawling depth ('--depth') and the amount of threads crawling
('--threads').
Indexing:
This setup uses the apache lucene indexer in java. To get it working, you
need to run the crawler with jython - otherwise the import of the indexer
will. This has no effect on the crawling itself.
Some implementation details:
To provide a queue that only puts elements once, the Queue.Queue's hooks
for _init, _put, _get are overwritten to use the internal data type
sets(). This is quit cool, since it does not affect the multithreading
functionality of Queue.Queue and garantuees, that no element ist put a
second time. (UniqueQueue, scn.py)
To get a website's content, the extractor class inherits from HTMLParser
python class and provides methods to everything of a website that
might contain interesting information. If the HTMLParser comes to a
beginning tag, it will call the Extractor's handle_starttag.