from theory to practice: efficient join query processing in a parallel database system shumo chu,...
TRANSCRIPT
From Theory to Practice: Efficient Join Query Processing in a
Parallel Database System
Shumo Chu, Magdalena Balazinska and Dan SuciuDatabase Group, CSE, University of Washington
2
In industry and science, users need to analyze large datasets
Myria: Parallel DBMS developed at UW
New class of queries
Two key differences: Multiple tables need to be joined Query structure may be cyclic
Motivation
Knowledge base exploration
Social network analysis: find all triangles
Traditional Parallel Join Evaluation
Shuffle A, B on y
AB
Worker 1
AB
Worker 2
AB
Worker 3
A’B’
Worker 1
⋈
A’B’
Worker 2
⋈
A’B’
Worker 3
⋈
A’⋈B’
c’
Worker 1
⋈
A’⋈B’
c’
Worker 2
⋈
A’⋈B’
c’
Worker 3
⋈Shuffle A⋈B, C on (x, z)
3
A⋈B⋈C
A
⋈
B
C
⋈
Solution 1: Shuffle on joined attributes
Large intermediate result Skew on shuffle
Solution 2: keep largest table, broadcast others
Background: HyperCube (Shares) Shuffle
A
⋈
B C
T(x, y, z) :- A(x, y), B(y, z), C(z, x)
CA
C
B
CA
C
B
CA
C
B
……
P
worke
rs
A(x1, y1) (h1(x1), h2(y1), *) P1/3 replication
B(y1, z1) (*, h2(y1), h3(z1)) P1/3 replication
C(z1, x1) (h1(x1), * , h3(z1)) P1/3 replication
4
Afrati and Ullman EDBT10
Beame et. PODS13
P1/3
P1/3
P1/3
x
y
z
5
Single Node Multiway Join• Join algorithm with optimal
guarantees • Leapfrog TrieJoin by Veldhuizen,
2014• Minesweeper by Ngo etc, 2014
• Pipeline of joins Single multiway join
• Tributary Join : Leapfrog TrieJoin in Myria
• A multiway sort-merge join on steroid
• Avoid constructing tries compared with Leapfrog
x y
2 0
2 1
2 3
3 4
4 2
5 6
y z
0 1
2 0
2 3
3 4
4 2
5 6
x z
0 2
1 0
2 4
3 2
4 3
6 5
A B C
T(x, y, z) :- A(x, y), B(y, z), C(z, x)
6
Questions
Empirical study of HyperCube shuffle and Tributary join
HyperCube configuration optimization
Tributary join cost model and attribute order optimization
7
Empirical StudyMyria deployment with 64 workers.
Shuffle paradigms: Regular shuffles HyperCube shuffle Broadcast
Local join algorithms: Symmetric hash join Tributary join
Parallel semi-join
Evaluate 8 queries on Twitter social graph and Freebase
8
Triangle Query on Twitter
Query: T(x, y, z) :- A(x, y), B(y, z), C(z, x)
Dataset: Sampled twitter social network graph with 1 million
edges (follower:int, followee:int)
9
Triangle Query: Data Shuffling HyperCube Shuffle (12M Total)
A B
A⋈B C
A⋈B⋈C
#: 1MSkew:1.35
#: 1MSkew:1.72
#: 51MSkew:20.8
#: 1MSkew: 1.01
T(x, y, z) :- A(x, y), B(y, z), C(z, x)
Regular Shuffle (54M Total)
A B C
A⋈B⋈C# 4MSkew: 1.06
# 4MSkew: 1.06
# 4MSkew: 1.06
Broadcast (142M, no skew)
10
Triangle Query: Runtime
Query Runtime (Sec)
Shuffle paradigm:HyperCube < Broadcast < Regular
Sequential join:Tributary Join < Hash Join
T(x, y, z) :- A(x, y), B(y, z), C(z, x)
HyperCube BroadcastRegular
11
Query 2: Knowledge Base Exploration Query
Query: Show the full cast members of all films starring both Joe Pesci and Robert de Niro
• Dataset: FreeBase RDF, data is partitioned into separate tables by its predicate
CastMember(cast):- ActorName(a1, “Joe Pesci”), ActorPerform(a1, p1), PerformFilm(p1, film), ActorName(a2, “Robert de Niro”), ActorPerform(a2, p2), PerformFilm(p2, film), PerformFilm(p, film), ActorPerform(p, cast)
12
Freebase Query: Data Shuffling
Regular shuffles: 7M tuples
HyperCube shuffle:105M tuples (16x replication)
Broadcast: 351M tuples (50x replication)
R1 R2
R3⋈
R3
R5
R6
R7
R8
⋈
⋈
⋈
⋈
⋈
⋈
26
1.09M
1.09M
1.10M
1.10M
2
1.09M
1.10M
660
660
25.2K
25.2K
140
10.3K
Regular shuffle
8-way join on freebase 1
13
Knowledge Exploration in Freebase
Comparing shuffle paradigms:
Regular < HyperCube < Broadcast
Comparing sequential join algorithms:
Hash join < Tributary joinQuery Runtime (sec)
8-way join on freebase
14
Empirical Study Summary
The best query plan depends on query, data and cluster Size of intermediate result Replication factor of HyperCube
Large intermediate results favor HyperCube and Tributary Join Small communication Small input Reducing
sorting time
15
Optimizing HyperCube Shuffle
Optimization goal: minimizing maximum load of single worker
Example: Q1 with 64 workers 4x4x4 is better than 2x4x8
What if we have 63 workers or a 7 way join?
State of the art: Linear Programming (BeameKS, PODS13) If |A| = |B| = |C| = N, 63 servers, optimal is 3.98 x 3.98 x 3.98
The penalty of rounding down is non-negligible 3x3x3 only use 27 servers out of 63
16
A Simple Yet Effective Algorithm for HyperCube Configuration
Algorithm:1. Enumerate all the hypercube configurations with
number of servers ≤ P
2. find the configuration with minimal shuffle cost
Tie-breaking heuristic: 1x16 vs 4x4
Best configuration of previous example: 3x4x5
17
Evaluation of HyperCube Optimization
Compare different configuration algorithms Our Algorithm Rounding down Random (many virtual servers real servers)
Opt. Ratio: Max Load / Optimal (by LP Solution)
Our algorithm outperforms rounding down and random, with at most 1.06 optimality ratio
18
More in the paper
Tributary join cost model and attribute order optimization
Evaluation of more queries
Comparison with parallel semi-join plans
Open source implementation in Myria:https://github.com/uwescience/myria
19
Conclusions
Efficient parallel join query evaluation - break down the gap between theory and practice:
Select the best parallel query plan Shuffle paradigm Sequential join algorithm
Optimal HyperCube configuration
Optimizing Tributary join attribute order
20
Thanks! Myria Team
21
Conclusions
Efficient parallel join query evaluation - break down the gap between theory and practice:
Select the best parallel query plan Shuffle paradigm Sequential join algorithm
Optimal HyperCube configuration
Optimizing Tributary join attribute order
22
Query execution profiling
PerfOpticon: the visual query profiling tool used in Myria
23
Cost Model Explained query:
Number of binary searches in first attribute:
Number of binary searches in a joined attribute:
The total cost
24
Why random HyperCube cell allocation is bad?
Query:A(x, y, z, p) :- S(x, y), R(y, z), T(z, p)
64 cells, 8 x 8 hypercube of cells, randomly allocate cells to 4 servers
Server 1 will receive 7/8 of S (1/2 if optimal) 1/4 of R 7/8 of T (1/2 if optimal)
Myria: new generation parallel DBMS
MyriaX
Coordinator
REST Server
Worker Catalog
Catalog
…
JSON query plans & other instructions
RDBMS
Worker Catalog
RDBMS
Worker Catalog
RDBMS
HDFS HDFS HDFS
Shared-nothing cluster
Primary data store:
Can also ingest data
from:25