Web Scraper

By: Parimal Mehta (@prmehta24) and Parth Panchal (@Parth19499)

This program will crawl a given website and its links upto a given depth looking for a given keyword. It will return:

the paragraphs in which the keyword is found
the number of visited links
the links visited as well as the number of occurrences of the keyword per link

Prerequisites

Perl (https://learn.perl.org/installing/)

Steps to Run:

git clone https://github.com/Parth19499/WebCrawler.git or download the repo and unzip it.
In Command Prompt: cpan LWP::UserAgent HTML::LinkExtor URI::URL LWP::Simple HTTP::Request HTTP::Response HTML::Strip HTML::DOM
In Command Prompt: Navigate to cloned folder and run perl WebCrawler.pl

Limitations

Only allows one word search
Not all websites are accessible by crawler
HTML::Strip which is used to remove tags from text is not perfect.(https://metacpan.org/pod/HTML::Strip#LIMITATIONS)
The number of occurences stored in hashmap is for occurences in source code, not just paragraphs

Notes

To change base url for crawling, edit $url variable
To change number of links traversed per page, edit $linklim variable

Future Work

Store paragraphs retrieved in file

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
WebCrawler.pl		WebCrawler.pl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Scraper

Prerequisites

Steps to Run:

Limitations

Notes

Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Parth19499/WebScraper

Folders and files

Latest commit

History

Repository files navigation

Web Scraper

Prerequisites

Steps to Run:

Limitations

Notes

Future Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages