course-exercises/README at master · hpcit/course-exercises · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
############################################################
 Exercises for the course:
   "Tools and techniques for massive data analysis"

  > https://events.prace-ri.eu/conferenceDisplay.py?confId=314
  > http://www.hpc.cineca.it
  > http://www.prace-ri.eu

############################################################

 o) Packages
   - Hadoop 1.2.1 (2.5.1/2.5.2)
   - Python 2.7
   - MrJob 0.4.3-dev
   - Apache Spark 1.1.0

 o) Examples
   -> MrJob
      $ source /root/mrjob0.4.2/bin/activate (mrjob0.4.3-dev also available)
      $ cd /course-exercises/mrjob
      $ python word_count.py ../data/txt/2261.txt.utf-8
      $ python word_count.py ../data/txt/2261.txt.utf-8 -r hadoop

      $ python matmat.py (-r hadoop) ../data/mat/matmat_3x2_A ../data/mat/matmat_2x2_B
      $ python sparse_matmat.py (-r hadoop) ../data/mat/smat_100x10_A ../data/mat/smat_10x200_B

      $ python smatvec.py ../data/smat_10_5_A.txt ../data/vec_5.txt
      $ python matmat.py ../data/smat_10_5_A.txt ../data/smat_5_5.txt

   -> NGS
      # Hadoop streaming: debug and hadoop run
      $ bash ngs/hs/hstream.sh

      # Mr job ngs
      $ source /root/mrjob0.4.2/bin/activate
      $ python ngs/mrjob/runner.py < data/ngs/input.sam 1> data/ngs/out.coverage 2> data/ngs/out.log

   -> Hadoop
      $ hadoop jar $HADOOP_PREFIX/hadoop-examples-1.2.1.jar pi 3 10

   -> Spark
      $ cd spark
      --------------
      $ hadoop fs -put ../data/spark/1902 /spark/1902
      $ spark-submit max_temp.py /spark/1902
      $ hadoop fs -get /user/root/output/part-00000
      --------------

   -> Spark Shell
      --------------
      $ hadoop fs -put ../data/txt/divine_comedy.txt /spark/divine_comedy.txt
      $ spark-shell
      $ scala> val textFile = sc.textFile("/spark/divine_comedy.txt")
      $ scala> textFile.count() // Number of items in this RDD
      $ scala> textFile.first() // First item in this RDD
      $ scala> val linesWithCanto = textFile.filter(line => line.contains("Canto"))
      $ scala> textFile.filter(line => line.contains("Canto")).count()
      $ scala> linesWithCanto.cache()
      $ scala> linesWithCanto.count()
      --------------

   -> Hadoop 2.5.2
      $ cd hadoop
      $ yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar pi 16 10
      --------------
      $ hdfs dfs -mkdir /data
      $ hdfs dfs -put ../data/txt/2261.txt.utf-8 /data/
      $ yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar wordcount /data wordcount-output
      --------------

############################################################