An automatic Web crawler which collects hotel reviews from TripAdvisor.com and stores data in MongoDB.
1 Install common-node (https://www.npmjs.com/package/common-node)
sudo npm -g install common-node2 Install PhantomJS (http://phantomjs.org/)
sudo npm -g install phantomjs3 Install nodejs-legacy (required by common-node)
sudo apt-get install nodejs-legacy- Edit
src/0-add-city.js - Run
./0-add-cities.shto add city data - Run
./1-collect-city-hotels.shto collect hotel URL's - Run
./2-collect-hotel-reviews.shto collect reviews URL's (you may need to run it more than once until all pages have been processed). - Run
./3-get-reviews-html.shto download the HTML content of reviews (you may need to run it more than once until all reviews have been processed). - Run
./4-get-blocked-review-html.shto download the HTML content of reviews which were blocked in the previous step. many steps (note this runs using nodejs, not common-node). - Finally, run
./5-process-review-html.sh(editsuccess_status_codeandfail_status_code beforerunning the script)