This repository was archived by the owner on Nov 13, 2018. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathnotes.txt
More file actions
56 lines (44 loc) · 2.47 KB
/
notes.txt
File metadata and controls
56 lines (44 loc) · 2.47 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
Motivation
um Daten zu verstehen und zu verwenden/benutzen
hilfreich für Datenmodelierung, Datenintegration, Optimierung von Anfrageausführungen und um Anomalien zu finden
Problematisch
NP-Hard[Gunopulos et al.]
schwierig, weil die Anzahl möglicher UCCs steigt exponential zur Anzahl der Eingabespalten
Aufgabenstellung
nur nicht redundante UCC
Algorithmen
BruteForce
GORDIAN
HCA (Buttom-Up)- statistical pruning
MapReduce based Unique Column Combination detection - row and column pruning
column pruning: prune column combination, = keine Kombinationen mit Uniques!
row pruning: beschleunige verifizierung von Kandidaten = Streiche Zeilen, die nur einmal vorkommen (nicht gobal sondern nur für das finden des Uniques mit dieser Spalte
- alle n-ColumnCombinationen werden gleichzeitig geprüft ({a},{b},{c}|{a,b},{a,c},{b,c}|{a,b,c})
Performance
mit Realdaten und Syntetischen Daten
Speedup ratio
expansion rate
-------------
Unterschiede Spark vs Flink
Iteration:
Flink iteration vs Spark cache, Spark hat Zwischenergebnisse
Different methods
Spark: groupByKey, reduceByKey (only for tuple2 not tuple3 there use groupBy(function)) vs Flink groupBy(0).reduce(..)
Spark: combination of map, groupByKey etc, Flink reduceGroup
Spark: which number has current tasks
Flink: instead of Cell-Object use Tuple3, Spark: use Cell object because groupBy 2 elements isn't easy / possible
Spark: flatMap and mapToPair not combinable, diff btw JavaPairRDD<Object> and JavRDD<Tuple2<Object, Object>>
Spark: for groupByKey, reduceByKey needs JavaPairRDD
Spark: counts arguments starting with 1, Flink starting with 0
How we changed our algorithm
no BitSets
save PLIs in long[] instead of List<List<Long>> or fastUtil structure
only broadcast plis of single columns
don't calculate whole PLIs try to break before
better cache (use uncache!)
-----
ncvoter 10k - 1 core
Flink:
java -Dlog4j.configuration="file:./log4j.properties" -cp "target/*:target/Uuc-0.0.1-SNAPSHOT-dependencies/*" com.dataprofiling.ucc.Ucc --input "hdfs://tenemhead2:8020/data/csv/LOD-small/PERSON_EN.csv" --parallelism 20 --executor tenemhead2:6123 --jars target/Uuc-0.0.1-SNAPSHOT-flink-fat-jar.jar --delimiter ";"
Spark:
/opt/spark/spark-1.3.0/bin/spark-submit --class com.dataprofiling.ucc.UccSpark --jars target/simpletask-0.0.1-SNAPSHOT-spark-fat-jar.jar --master spark://172.16.21.111:7077 --conf spark.cores.max=1 target/simpletask-0.0.1-SNAPSHOT.jar --input hdfs://tenemhead2:8020/data/data-profiling/ncvoter_10k.csv