Skip to content
Mateusz Kobos edited this page Sep 25, 2011 · 7 revisions

Structure of the library

General structure

The main class of the program is the MultiThreadedCrawler. It runs several threads to crawl the tree, and it is also responsible for all the ancillary stuff: makes sure that the state of the tree is saved to disk, sets up the logging level, etc. Its constructor accepts a list of objects of class AbstractTreeNavigator.

The class AbstractTreeNavigator is an interface between "the domain tree" explored by the crawler and the generic tree crawling algorithm. By saying "the domain tree" we mean a tree-like structure present in the application domain e.g. web site with pages forming a tree structure. The class defines basic tree navigation operations (e.g. get_children(), move_to_parent()) that have to be implemented by a concrete inheriting class.

The MultiThreadedCrawler class creates a single tree-crawling thread for each provided AbstractTreeNavigator object. Each thread uses its AbstractTreeNavigator object to navigate the domain tree.

Fig. 1. General UML diagram of the library

general diagram

The main classes have counterparts with names with a CmdLn prefix, e.g. MultiThreadedCrawler has CmdLnMultiThreadedCrawler, and AbstractTreeNavigator has AbstractCmdLnTreeNavigator as a counterpart. The goal of the counterpart class is to create its main class based on the command-line parameters of the program. This way e.g. when adding a new subclass of AbstractTreeNavigator it is straightforward to extend the existing command-line interface without the need of modifying the code of the library -- one just has to create a new class which inherits from an appropriate AbstractCmdLn... class.

A sample implementation to crawl a magazine archive web site

The main application domain for this library is using it to crawl a web site with known tree-like structure. In the library's code, we have placed a sample implementation to crawl a magazine web site. The main page of the web site consists of links to magazine issues, and each magazine issue consists of links to articles. On each level, the links to the pages of the lower level in hierarchy are not necessarily placed on one page, but can be distributed among many pages with links connecting these pages (see Fig. 2).

Fig. 2. Each node can correspond to many linked pages

Multipage nodes

The code used to crawl such a web site is placed in html_multipage_navigator package. Its main class is HTMLMultipageNavigator which is a descendant of the AbstractTreeNavigator class. As such, it is an interface between generic tree crawling algorithm and the domain tree consisting of HTML pages. The HTMLMultipageNavigator class assumes that the hierarchy of the web site is fixed and has a known number of levels. In our sample implementation, the fixed hierarchy is magazine -> issues -> articles where all the articles are placed on the same (lowest) level of the tree.

Fig. 3. HTML multipage navigator

html multipage navigator

Using the library to create one's own crawler

There are two main ways in which this library can be used to create a crawler for a specific task; they are presented in the following subsections.

Note that in both cases a general tip applies: look how the sample crawling task is implemented to gain some insight into proper usage of the library. The sample task is run from file concurrent_tree_crawler/bin/sample_download_crawler.py, so you can use it as a starting point when exploring the code.

Crawling tree-like structures from a custom application domain

If you want to create a crawler to explore a specific tree-like domain (e.g. some kind of tree-like computer network, tree-like repository of documents) you should execute the following steps.

  1. Create a class inheriting from AbstractTreeNavigator class. This class is responsible for navigating the tree of your domain of application, let us name it MyTreeNavigator.

  2. Create a class inheriting from AbstractCmdLnNavigatorsCreator. This class creates an appropriate number of MyTreeNavigators with parameters parsed from the command-line, let us name it MyCmdLnNavigatorsCreator.

  3. To create a script that starts the crawling, create a file with contents similar to the contents of concurrent_tree_crawler/bin/sample_download_crawler.py file, i.e. (with import lines omitted for brevity) :

     navigators_creator = MyCmdLnNavigatorsCreator()
     crawler = CmdLnMultithreadedCrawler(navigators_creator)
     crawler.run()
    

    and run the file using Python.

Crawling HTML web sites with a specific structure

The second way should be followed when you want to create a crawler to explore HTML website with known and fixed tree-like structure. In this approach, the number of tree levels is fixed, what is more, each page on a certain tree level has basically the same structure (i.e. it is parsed by the same parser).

  1. For each tree level, create a class parsing and processing a single page on that level. Each class should inherit from AbstractPageAnalyzer.

  2. Create a class inheriting from AbstractLevelsCreator. This class in its create() method returns a list of Level objects corresponding to consecutive levels of the explored web site tree. Each Level object contains name of the level and an object inheriting from AbstractPageAnalyzer used for processing a single web page on this level.

  3. Create a class inheriting from AbstractCmdLnLevelsCreator. This class creates an AbstractLevelsCreator object based on command-line parameters. Let us name this new class MyCmdLnLevelsCreator.

  4. To create a script that can be run to start crawling, create a file with contents similar to the contents of concurrent_tree_crawler/bin/sample_download_crawler.py file, i.e. (with import lines omitted for brevity):

     navigators_creator = CmdLnNavigatorsCreator(MyCmdLnLevelsCreator())
     crawler = CmdLnMultithreadedCrawler(navigators_creator)
     crawler.run()
    

    and run the file using Python.