Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 7 additions & 6 deletions week12/hw/dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,9 @@ A straightfoward library that allows you to crawl, clean up, and deduplicate web
1. SSH into your first node (gpfs1) and proceed to install the requisites for LazyNLP installation
```
* sudo apt-get --purge remove gpfs.gss.pmcollector gpfs.gui
* sudo apt-get install -y python3 python3-dev python3- setuptools python3-pip
* sudo apt-get install -y python3 python3-dev python3-setuptools python3-pip
* git clone https://github.com/chiphuyen/lazynlp.git
* cd lazynlp
* pip3 install -r requirements.txt
* pip3 install .
```
Expand All @@ -24,18 +25,18 @@ A straightfoward library that allows you to crawl, clean up, and deduplicate web
```
4. Let's use the library to crawl a medium size dataset, the approach we are going to use is getting dumps of URL's that have been deduplicated, we will just clean them and prepare for processing.
```
* pip install gdown
* pip3 install gdown
* gdown https://drive.google.com/uc?id=1hRtA3zZ0K5UHKOQ0_8d0BIc_1VyxgY51
* unzip reddit_urls.zip
```
5. In step 4 a urls.txt file was downloaded, use it as a base with the contents of [crawler code](https://github.com/MIDS-scaling-up/v2/blob/master/week12/hw/crawler.py), main idea is to use the function lazynlp.download_pages()

6. Use the following URL dumps:
```
* https://drive.google.com/file/d/1hRtA3zZ0K5UHKOQ0_8d0BIc_1VyxgY51/view
* https://drive.google.com/file/d/1zIVaRaVqGP8VNBUT4eKAzW3gYWxNk728/view?usp=sharing
* https://drive.google.com/file/d/1C5aSisXMC3S3OXBFbnETLeK3UTUXEXrC/view?usp=sharing
* https://dumps.wikimedia.org/enwiktionary/20190301/enwiktionary-20190301-pages-articles-multistream.xml.bz2 (notice this is not a url.txt file but a text file)
* gdown https://drive.google.com/uc?id=1hRtA3zZ0K5UHKOQ0_8d0BIc_1VyxgY51 (done in step 4)
* gdown https://drive.google.com/uc?id=1zIVaRaVqGP8VNBUT4eKAzW3gYWxNk728
* gdown https://drive.google.com/uc?id=1C5aSisXMC3S3OXBFbnETLeK3UTUXEXrC
* wget https://dumps.wikimedia.org/enwiktionary/20190301/enwiktionary-20190301-pages-articles-multistream.xml.bz2 (notice this is not a url.txt file but a text file)
```
7. Be creative with the crawler (multithread, run in background no up &) and put the data into the distributed storage, we will use for the class lab;

Expand Down