MIDS-scaling-up · mstackhouse · Mar 30, 2019
diff --git a/week12/hw/dataset.md b/week12/hw/dataset.md
@@ -8,8 +8,9 @@ A straightfoward library that allows you to crawl, clean up, and deduplicate web
 1. SSH into your first node (gpfs1) and proceed to install the requisites for LazyNLP installation
 ```
   * sudo apt-get --purge remove gpfs.gss.pmcollector gpfs.gui
-  * sudo apt-get install -y python3 python3-dev python3-  setuptools python3-pip
+  * sudo apt-get install -y python3 python3-dev python3-setuptools python3-pip
   * git clone https://github.com/chiphuyen/lazynlp.git
+  * cd lazynlp
   * pip3 install -r requirements.txt
   * pip3 install .
  ``` 
@@ -24,18 +25,18 @@ A straightfoward library that allows you to crawl, clean up, and deduplicate web
   ```
 4. Let's use the library to crawl a medium size dataset, the approach we are going to use is getting dumps of URL's that have been deduplicated, we will just clean them and prepare for processing.
   ```
-  * pip install gdown
+  * pip3 install gdown
   * gdown https://drive.google.com/uc?id=1hRtA3zZ0K5UHKOQ0_8d0BIc_1VyxgY51
   * unzip reddit_urls.zip
   ```
 5. In step 4 a urls.txt file was downloaded, use it as a base with the contents of [crawler code](https://github.com/MIDS-scaling-up/v2/blob/master/week12/hw/crawler.py), main idea is to use the function lazynlp.download_pages()
 
 6. Use the following URL dumps:
  ```
- * https://drive.google.com/file/d/1hRtA3zZ0K5UHKOQ0_8d0BIc_1VyxgY51/view
- * https://drive.google.com/file/d/1zIVaRaVqGP8VNBUT4eKAzW3gYWxNk728/view?usp=sharing
- * https://drive.google.com/file/d/1C5aSisXMC3S3OXBFbnETLeK3UTUXEXrC/view?usp=sharing
- * https://dumps.wikimedia.org/enwiktionary/20190301/enwiktionary-20190301-pages-articles-multistream.xml.bz2 (notice this is not a url.txt file but a text file)
+ * gdown https://drive.google.com/uc?id=1hRtA3zZ0K5UHKOQ0_8d0BIc_1VyxgY51 (done in step 4)
+ * gdown https://drive.google.com/uc?id=1zIVaRaVqGP8VNBUT4eKAzW3gYWxNk728
+ * gdown https://drive.google.com/uc?id=1C5aSisXMC3S3OXBFbnETLeK3UTUXEXrC
+ * wget https://dumps.wikimedia.org/enwiktionary/20190301/enwiktionary-20190301-pages-articles-multistream.xml.bz2 (notice this is not a url.txt file but a text file)
   ```
 7. Be creative with the crawler (multithread, run in background no up &) and put the data into the distributed storage, we will use for the class lab;