crawler/README at master · mkind/crawler · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
sascha zinke
maximilian haeckel

starting the crawler:

% cd crawler/
 % jython27 crawler.py http://www.udacity.com/cs101x/index.html -i
 MainThread [INFO]: creating new Crawler..
 MainThread [INFO]: starting crawling..
 Thread-1 [INFO]: target(0): http://udacity.com/cs101x/index.html
 Thread-1 [INFO]: target(1): http://udacity.com/cs101x/crawling.html
 Thread-2 [INFO]: target(1): http://udacity.com/cs101x/flying.html
 Thread-3 [INFO]: target(1): http://udacity.com/cs101x/walking.html
 Thread-1 [INFO]: target(2): http://udacity.com/cs101x/kicking.html
 http://udacity.com/cs101x/crawling.html
 http://udacity.com/cs101x/index.html
 http://udacity.com/cs101x/walking.html
 http://udacity.com/cs101x/kicking.html
 http://udacity.com/cs101x/flying.html
 finished in 1.618000s
 > kick
    -> (1.405591) http://udacity.com/cs101x/kicking.html
        u'<html>\n<body>\n<b>Kick! Kick! Kick!</b>\n</body>\n</html>\n'
    -> (1.405591) http://udacity.com/cs101x/kicking.html'
        u'<html>\n<body>\n<b>Kick! Kick! Kick!</b>\n</body>\n</html>\n'
    ...

Note:
Check output for import warnings. The indexer only works, if the import
succeeded!

The crawling algorithm mainly consists of the following steps:


          provide initial target
                  |
      ----------- |
     /            \
    |             |
    |             v
    |     __________________
    |    |                  | The crawler provides an internal target queue
    |    | fill target queue| which contains the url it has not visited yet.
    |    |__________________|
    |             |
    |             | get a target from queue
    |             | drop if invalid media type
    |             v
    |     __________________
    |    |                  |
    |    | request website  | get requested target website
    |    |__________________|
    |             |
    |             |
    |             v
    |     __________________
    |    |                  | extract interesting information
    |    |  extract inform. | (links, metadata, mail addresses)
    |    |__________________|
    |           /   \
     \         /     \
      ---------      |
   fill in target    v
  if not in result__________________
                 |                  | information are stored
                 | fill result queue| in a result queue
                 |__________________|


The crawler provides an interactive mode('-i'). If used, the crawler takes
input to build indexer queries. The result is written to stdout. Also you can
set the crawling depth ('--depth') and the amount of threads crawling
('--threads').

Indexing:

    This setup uses the apache lucene indexer in java. To get it working, you
    need to run the crawler with jython - otherwise the import of the indexer
    will. This has no effect on the crawling itself.

Some implementation details:

    To provide a queue that only puts elements once, the Queue.Queue's hooks
    for _init, _put, _get are overwritten to use the internal data type
    sets(). This is quit cool, since it does not affect the multithreading
    functionality of Queue.Queue and garantuees, that no element ist put a
    second time. (UniqueQueue, scn.py)

    To get a website's content, the extractor class inherits from HTMLParser
    python class and provides methods  to everything of a website that
    might contain interesting information. If the HTMLParser comes to a
    beginning tag, it will call the Extractor's handle_starttag.