forked from gfiameni/course-exercises
-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathREADME
More file actions
71 lines (58 loc) · 2.54 KB
/
README
File metadata and controls
71 lines (58 loc) · 2.54 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
############################################################
Exercises for the course:
"Tools and techniques for massive data analysis"
> https://events.prace-ri.eu/conferenceDisplay.py?confId=314
> http://www.hpc.cineca.it
> http://www.prace-ri.eu
############################################################
o) Packages
- Hadoop 1.2.1 (2.5.1/2.5.2)
- Python 2.7
- MrJob 0.4.3-dev
- Apache Spark 1.1.0
o) Examples
-> MrJob
$ source /root/mrjob0.4.2/bin/activate (mrjob0.4.3-dev also available)
$ cd /course-exercises/mrjob
$ python word_count.py ../data/txt/2261.txt.utf-8
$ python word_count.py ../data/txt/2261.txt.utf-8 -r hadoop
$ python matmat.py (-r hadoop) ../data/mat/matmat_3x2_A ../data/mat/matmat_2x2_B
$ python sparse_matmat.py (-r hadoop) ../data/mat/smat_100x10_A ../data/mat/smat_10x200_B
$ python smatvec.py ../data/smat_10_5_A.txt ../data/vec_5.txt
$ python matmat.py ../data/smat_10_5_A.txt ../data/smat_5_5.txt
-> NGS
# Hadoop streaming: debug and hadoop run
$ bash ngs/hs/hstream.sh
# Mr job ngs
$ source /root/mrjob0.4.2/bin/activate
$ python ngs/mrjob/runner.py < data/ngs/input.sam 1> data/ngs/out.coverage 2> data/ngs/out.log
-> Hadoop
$ hadoop jar $HADOOP_PREFIX/hadoop-examples-1.2.1.jar pi 3 10
-> Spark
$ cd spark
--------------
$ hadoop fs -put ../data/spark/1902 /spark/1902
$ spark-submit max_temp.py /spark/1902
$ hadoop fs -get /user/root/output/part-00000
--------------
-> Spark Shell
--------------
$ hadoop fs -put ../data/txt/divine_comedy.txt /spark/divine_comedy.txt
$ spark-shell
$ scala> val textFile = sc.textFile("/spark/divine_comedy.txt")
$ scala> textFile.count() // Number of items in this RDD
$ scala> textFile.first() // First item in this RDD
$ scala> val linesWithCanto = textFile.filter(line => line.contains("Canto"))
$ scala> textFile.filter(line => line.contains("Canto")).count()
$ scala> linesWithCanto.cache()
$ scala> linesWithCanto.count()
--------------
-> Hadoop 2.5.2
$ cd hadoop
$ yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar pi 16 10
--------------
$ hdfs dfs -mkdir /data
$ hdfs dfs -put ../data/txt/2261.txt.utf-8 /data/
$ yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar wordcount /data wordcount-output
--------------
############################################################