minimizing cost in distributed multiquery processing applications
TRANSCRIPT
Keyboard
Minimizing Communication Cost in DistributedMulti-query Processing
Jian Li, Amol Deshpande, Samir KhullerDepartment of Computer Science, University of Maryland
Presented by:Luis Galrraga
Saarland UniversityJuly 7th, 2010
Outline
Justification and related work
Problem formulation
Proposed methods and analysisGraph theory concepts
Tree topology
Arbitrary graph topologies
Experimental results
Conclusions
Outline
Justification
Problem formulation
Proposed methods and analysisGraph theory concepts
Tree topology
Arbitrary graph topologies
Experimental results
Conclusions
Justification
Emergence of large-scale distributed query processing in applications like:Wireless sensor networks
Publish-subscribe systems
Distributed stream processing applications
Common need:Minimize data movement!!
Justification
Data transfer cost from one node to another may be heterogenous.
Justification
Outline
Justification
Problem formulation
Proposed methods and analysisGraph theory concepts
Tree topology
Arbitrary graph topologies
Experimental results
Conclusions
Problem formulation
Minimization of data movement for multiples queries.
Assumptions:Non-uniform communication cost model.
No restrictions on data size.
Query plans are part of the input.
Intermediate results sizes are known.
More formally
Input: Set of relations or data sources
Topology, undirected weighted graph
Assignment of relations to nodes in the topology.
If more than one source resides in a node then we can just create a node per relation and link the nodes with weighted 0 edges.In the case of replication, it becomes part of the query plan optimization. In that case the tree query plan given to the algorithm should be the minimum weighted tree (of course only considering the weight edges in the topology). The shortest paths for every pair of nodes might be precomputed (we could use associativity for joins of 3 relations), then we only care about taking the groups of replicas such that their distance is the smallest, furthermore this is only done in the leaves of the query plan.
More formally
Set of queries
Each query comes with a plan in the form of a directed tree.
Destination node
Data sources involved
Data size
Si
Sj
Si x Sj
w
z(Si)
z(Sj)
z(Si x Sj)
More formally
Given the topology graph Gc and a set of trees representing the query plans, our goal is to find a data movement plan that minimizes the total communication cost incurred while executing the queries.
Problem formulation
(10)
S1
S2
S1 x S2
C
(10)
(7)
S4
S1 x S2 x S4
(5)
(100)
(100)
S2
S6
S2 x S6
D
(10)
(5)
S2
S5
S2 x S5
B
(10)
(6)
(8)
B
A
C
D
E
F
S2
S1
S3
S4
S6
S5
(10)
(10)
(100)
(8)
(100)
Topology Gc
Queries
Problem formulation
If a block of data sized S is sent along an edge e with weight w(e) then the communication cost is S * w(e)
For simplicity in the examples assume w(e) = 1 for all edges.But the algorithm is general in that sense!
Outline
Justification
Problem formulation
Proposed methods and analysis
Graph theory concepts
Tree topology
Arbitrary graph topologies
Experimental results
Conclusions
Problem analysis
It has been proved to be NP-HardVia reduction to the Steiner Tree problem
Is everything lost?If topology graph is a tree, there is a polynomial-time algorithm.
For general topologies, aproximation algorithms are known.
Steiner Tree problem
Given an undirected graph:
Find a tree of minimum weight that connects all vertices in S.It can contain vertices not in S, known as Steiner points.
Steiner Tree problem
5
5
2
6
2
2
3
4
13
2
2
3
4
Terminals
Steiner points
The algorithm
It implies to solve a series of min-cut problems on appropriately constructed hypergraphs.
Umm.. Hypergraphs?
Hypergraphs
Generalization of a graph. In normal graphs, edges can be seen as pairs of vertices.
Hyperedges can group any number of vertices.
Max-flow/Min-cut
Given a weighted, directed graph and nodes s, t known as source and sink:
Find a flow or mapping of maximum value:
Max-flow/Min-cut
s
t
3/3
2/3
2/2
3/3
0/2
1/4
2/2
3/3
Flow
Capacity
Max-flow/Min-cut
A min-cut is a set of edges with minimum weight such that if removed from the graph, there is no path from s to t.
The maximum value of an s-t flow is equal to the minimum capacity of an s-t cut.
Min-cut can be solved in polynomial time. Edmonds-Karp solves the problem in O(|V| * |E| ^2)Max-flow: Ford-Fulkerson O(|V| * f)Dinitz algorithm: O(|E| |V| ^ 2)
Max-flow/Min-cut
s
t
3/3
2/3
3/3
0/2
1/4
2/2
3/3
2/2
Flow
Capacity
Max-flow/Min-cut in hypergraphs
Problem solvable in polynomial time.
What about max-flow in hypergraphs?For every hyperedge, add two new nodes and a directed edge between them of capacity equal to the weight of the hyperedge.
Max-flow/Min-cut in hypergraphs
1
2345
w
1
2345
w
67Max-flow/Min-cut in hypergraphs
Weighted hypergraph partition problem.
Instead of a source and a sink we have 2 subsets of nodes:
We want to find a partition (S, T) on V such that:
Weighted hypergraph partition problem.
Convert it to a min-cut problem by adding artificial source and sink.
Ls
Lt
s
t
Outline
Justification
Problem formulation
Proposed methods and analysis
Graph theory concepts
Tree topology
Arbitrary graph topologies
Experimental results
Conclusions
Tree topology Step 1
Build a directed weighted hypergraph HD, by combining the query plan trees for all the queries. Edges oriented from children to parent.
It explicitly captures all the opportunities for sharing the movement of data sources among the queries.
Tree topology Step 2
For every edge in the topology graph decide which data sources and intermediate results move across that edge by solving an instance of the weighted hypergraph partition problem.
Tree topology Step 1 Single query
B
A
C
D
E
F
S2
S1
S3
S4
S6
S5
(10)
(10)
(100)
(8)
(100)
Topology Gc
(10)
S1
S2
S1 x S2
C
(10)
(7)
S4
S1 x S2 x S4
(5)
(100)
HD
Here HD has the same structure of the query plan.
Tree topology Step 2 Single query
For every edge (u, v) in Gc solve an instance of the weighted hypergraph partition problem.
The tree is divided into 2 components Gcu and GcvLabel the nodes in the hypergraph with u or v depending on which component they lie.
Gcu
Gcv
Tree topology Step 2 Single query
S1
S2
C
S4
(10)
S1 x S2
(10)
(7)
S1 x S2 x S4
(5)
(100)
HD
B
A
C
D
E
F
S2
S1
S3
S4
S6
S5
(10)
(10)
(100)
(8)
(100)
CCDC
Tree topology Step 2 Single query
(10)
S1 x S2
(10)
(7)
S1 x S2 x S4
(5)
(100)
HD
CCDCNow solve a weighted hypergraph partition problem where Ls are the nodes labeled C and Lt, the ones labeled D
Tree topology Step 2 Single query
(10)
(10)
(7)
(5)
(100)
HD
CCDCIt induces a labeling for internal nodes and data transference (S1S2 from C to, S1S2S4 from D to C)
DCS1S2S1S2S4
Tree topology Step 2 Multiple queries
(10)
S1
S2
S1 x S2
C
(10)
(7)
S4
S1 x S2 x S4
(5)
(100)
(100)
S2
S6
S2 x S6
D
(10)
(5)
S2
S5
S2 x S5
B
(10)
(6)
(8)
Sources appearing more than once generate hyperedges with weight equal to the source data size.
Tree topology Step 2 Multiple queries
S1
S1 x S2
C
(10)
(7)
S4
S1 x S2 x S4
(5)
(100)
(100)
S6
S2 x S6
D
(5)
S2
S5
S2 x S5
B
(10)
(6)
(8)
Tree topology Step 2 Multiple queries
(10)
(7)
(5)
(100)
(100)
(5)
(10)
(6)
(8)
Consider again edge C-D in the topology.
CCDDCDCDDDDCS1S2S1S2S4
S2S5
S2
Tree topology Step 3
Combine the local solutions for all the edges of the topology into a single global data movement plan.Local solutions may not agree on where the internal nodes should be evaluated.
Tree topology- Step 3
For every internal node i in HD, construct a directed graph Ji with:
Then add the edges according to this rule (for every edge in Gc):
Tree topology Step 3
Ji encodes information about how the different local plans locate evaluation of node i and defines a path.
B
A
C
D
E
F
JS1S2S4
Evaluate here!!
Outline
Justification
Problem formulation
Proposed methods and analysis
Graph theory concepts
Tree topology
Arbitrary graph topologies
Experimental results
Conclusions
Arbitrary graph topologies
For single queries, dynamic programming approach offers optimal solution in O(n2m+n3) with:m = # of nodes in query tree and n = # nodes in topology graph.
For multi-query a O(lg(n)) approximation is achieved through tree metrics.Suitable for n small!
Arbitrary graph topologies The pairs problem
Restrictions on the query structure.
Pairs problem:Assume each query is defined by a pair of nodes whose data should meet somewhere in the network.
A query-overlap graph H is a graph where the vertices correspond to the set of data items and each edge corresponds to a pair query.
The pairs problem
The pairs problem
Approximation algorithm when H is a star.
For same size data sources:Reducible to the minimum Steiner Tree problem
Otherwise to Connected Facility Location Problem
Isn't this problem NP-Hard?Yes, but good approximation algorithms do exist!!
Connected Facility Location Problem
Input:
Output:
Bought edges(Steiner cost)Rented edges(Connection cost)
Connected Facility Location Problem
Demands, D
Facilities, F
Rented edges
The pairs problem
Steiner Tree and SROB have approximation algorithms with ratio 1.55 and 2.92 respectively.
Final approximation ratio depends on query overlap graph topology:If the star arboricity SN of H can be computed in polynomial time, then there is a *SN(H)-approximation.
Minimum number of star-shaped foreststhe edges of the graph can be partitioned.Why?
Outline
Justification
Problem formulation
Proposed methods and analysis
Graph theory concepts
Tree topology
Arbitrary graph topologies
Experimental results
Conclusions
Experimental results
Topology cases: tree and arbitrary graph
Multi-query approach compared with isolated optimization per query.
Normalized with the cost of the nave approach.
Identical data size vs tri-modal distribution
Random query overlap.
Tree topology
Arbitrary topology Pairs problem
Outline
Justification
Problem formulation
Proposed methods and analysisGraph theory concepts
Tree topology
Arbitrary graph topologies
Experimental results
Conclusions
Conclusions
Communication cost can be the main bottleneck in certain applications.
Optimizing data movement is a hard problem.Relies on assumptions about query structure and topology.
For certain cases efficient algorithms are available.
It has to be complemented with other optimization schemes.
Thank you!!
Click to edit the title text format