By: Parimal Mehta (@prmehta24) and Parth Panchal (@Parth19499)
This program will crawl a given website and its links upto a given depth looking for a given keyword. It will return:
- the paragraphs in which the keyword is found
- the number of visited links
- the links visited as well as the number of occurrences of the keyword per link
git clone https://github.com/Parth19499/WebCrawler.gitor download the repo and unzip it.- In Command Prompt:
cpan LWP::UserAgent HTML::LinkExtor URI::URL LWP::Simple HTTP::Request HTTP::Response HTML::Strip HTML::DOM - In Command Prompt: Navigate to cloned folder and run
perl WebCrawler.pl
- Only allows one word search
- Not all websites are accessible by crawler
- HTML::Strip which is used to remove tags from text is not perfect.(https://metacpan.org/pod/HTML::Strip#LIMITATIONS)
- The number of occurences stored in hashmap is for occurences in source code, not just paragraphs
- To change base url for crawling, edit
$urlvariable - To change number of links traversed per page, edit
$linklimvariable
- Store paragraphs retrieved in file