scalable discovery of unique column...
TRANSCRIPT
Scalable Discovery of Unique Column Combinations
1Hasso Plattner Institute 2Qatar Computing Research Institute
Arvid Heise1 Jorge-Arnulfo Quiané-Ruiz2 Ziawasch Abedjan1 Anja Jentzsch1 Felix Naumann1
Motivation Problem Large datasets at very fast rates: - Social networks - Scientific applications - Transactional applications - ...
Finding uniques is crucial for: - Query optimization - Anomaly detection - Data modeling - Indexing - ...
Ø Unique column combinations often unknown in big datasets Ø Exponential search space
Finding unique column combinations is an NP-Hard problem
DUCC
Results
* Work done in the context of Metanome: joint project between HPI and QCRI that provides a fresh view on data profiling and aims at providing scalability for Big Data. Website: http://www.hpi.uni-potsdam.de/naumann/projekte/metanome_data_profiling.html
Lattice with eight attributes
minimal unique
unique
maximal non-unique
non-unique
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE o Graph divided into uniques and non-uniques
o Minimal uniques summarize uniques
o Maximal non-uniques summarize non-uniques
o DuCC simultaneously detects (non-)uniques
o Completeness verified (proof in paper)
Modeled as graph coloring problem
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
1.
2. 3.
4. 5.
6.
7.
Random walk graph traversal o Randomly pick next CC
from current CC o Go upwards if CC is non-
unique o Go downwards if CC is
unique o Trace back if no pruned CC
is left o Check for holes by
comparing min uniques and max non-uniques
minimal unique
unique
maximal non-unique
non-unique
o Worker receives next CC from traversal algorithm
o Performs fast check with PLI intersection
o Maintains pruning data structures
o Seed provider detects holes and calculates restart CC
Modular architecture
o Workers exchange minimal uniques and maximal non-uniques
o Scale up with local event bus
o Scale out with distributed event bus (ZooKeeper)
o Fault-tolerance with Map-only Hadoop Job
Scalable architecture
Scaling the Number of Columns on 100,000 Rows
Scaling the Number of Rows on 15 Columns
Scaling up/out Solution space and checks
Related Work Gordian: [Gordian: efficient and scalable discovery of composite keys. VLDB’06] - Row-based approach - Prefix-tree data organization
HCA: [Advancing the discovery of unique column combinations. CIKM’11] - Column-based approach - Histograms- and value-counting-based
Swan: [Detecting Unique Column Combinations on Dynamic Data. ICDE’14] - Builds on top of DUCC - Focus on dealing with incremental data
●
●
●
●
Number of Columns
log
Exec
utio
n Ti
me
(s)
5 10 15 20 25 30 35 40 45 50 55 60 65 701
10
100
1000
10000GORDIANHCADUCC
●
●
● ●
log
Exec
utio
n Ti
me
(s)
Number of Columns
5 10 15 1610^0
10^1
10^2
10^3
10^4● GORDIAN
HCADUCC
●
●
●
log
Exec
utio
n Ti
me
(s)
Number of Rows
10,000 100,000 539,1651
10
100
1000
10000GORDIANHCADUCC
●
●
●
log
Exec
utio
n Ti
me
(s)
Number of Rows
10,000 100,000 1,000,000 6,001,2151
10
100
1000
10000GORDIANHCADUCC
● ● ● ●●
●●
●
Number of Columns
Exec
utio
n Ti
me
(s)
0
5,000
10,000
15,000
20,000
20 25 30 35 40 45 50
● 1 node2 nodes4 nodes8 nodes
Minimal uniques
Uni
quen
ess
chec
ks
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
450,000
10,000 30,000 50,000 70,000 90,000 110,000
R2 = 0.9983
Uniprot
Uniprot
TPC-H
TPC-H