data-profiling/notes.txt at master · DBDA15/data-profiling · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
Motivation
	um Daten zu verstehen und zu verwenden/benutzen
	hilfreich für Datenmodelierung, Datenintegration, Optimierung von Anfrageausführungen und um Anomalien zu finden

Problematisch
	NP-Hard[Gunopulos et al.]
	schwierig, weil die Anzahl möglicher UCCs steigt exponential zur Anzahl der Eingabespalten

Aufgabenstellung
	nur nicht redundante UCC

Algorithmen
	BruteForce
	GORDIAN
	HCA (Buttom-Up)- statistical pruning
	MapReduce based Unique Column Combination detection - row and column pruning
		column pruning: prune column combination, = keine Kombinationen mit Uniques!
		row pruning: beschleunige verifizierung von Kandidaten = Streiche Zeilen, die nur einmal vorkommen (nicht gobal sondern nur für das finden des Uniques mit dieser Spalte
		- alle n-ColumnCombinationen werden gleichzeitig geprüft ({a},{b},{c}|{a,b},{a,c},{b,c}|{a,b,c})

Performance
	mit Realdaten und Syntetischen Daten
	Speedup ratio
	expansion rate

-------------

Unterschiede Spark vs Flink

Iteration:
	Flink iteration vs Spark cache, Spark hat Zwischenergebnisse
Different methods
	Spark: groupByKey, reduceByKey (only for tuple2 not tuple3 there use groupBy(function)) vs Flink groupBy(0).reduce(..)
	Spark: combination of map, groupByKey etc, Flink reduceGroup
	Spark: which number has current tasks
	Flink: instead of Cell-Object use Tuple3, Spark: use Cell object because groupBy 2 elements isn't easy / possible
	Spark: flatMap and mapToPair not combinable, diff btw JavaPairRDD<Object> and JavRDD<Tuple2<Object, Object>>
	Spark: for groupByKey, reduceByKey needs JavaPairRDD
	Spark: counts arguments starting with 1, Flink starting with 0

How we changed our algorithm
	no BitSets
	save PLIs in long[] instead of List<List<Long>> or fastUtil structure
	only broadcast plis of single columns
	don't calculate whole PLIs try to break before
	better cache (use uncache!)

-----
ncvoter 10k - 1 core


Flink:
java -Dlog4j.configuration="file:./log4j.properties" -cp "target/*:target/Uuc-0.0.1-SNAPSHOT-dependencies/*" com.dataprofiling.ucc.Ucc --input "hdfs://tenemhead2:8020/data/csv/LOD-small/PERSON_EN.csv" --parallelism 20 --executor tenemhead2:6123 --jars target/Uuc-0.0.1-SNAPSHOT-flink-fat-jar.jar --delimiter ";"

Spark:
/opt/spark/spark-1.3.0/bin/spark-submit --class com.dataprofiling.ucc.UccSpark --jars target/simpletask-0.0.1-SNAPSHOT-spark-fat-jar.jar --master spark://172.16.21.111:7077 --conf spark.cores.max=1 target/simpletask-0.0.1-SNAPSHOT.jar --input hdfs://tenemhead2:8020/data/data-profiling/ncvoter_10k.csv