minimizing cost in distributed multiquery processing applications

Download Minimizing cost in distributed multiquery processing applications

If you can't read please download the document

Upload: luis-galarraga

Post on 16-Apr-2017

1.198 views

Category:

Technology


0 download

TRANSCRIPT

Keyboard

Minimizing Communication Cost in DistributedMulti-query Processing

Jian Li, Amol Deshpande, Samir KhullerDepartment of Computer Science, University of Maryland

Presented by:Luis Galrraga

Saarland UniversityJuly 7th, 2010

Outline

Justification and related work

Problem formulation

Proposed methods and analysisGraph theory concepts

Tree topology

Arbitrary graph topologies

Experimental results

Conclusions

Outline

Justification

Problem formulation

Proposed methods and analysisGraph theory concepts

Tree topology

Arbitrary graph topologies

Experimental results

Conclusions

Justification

Emergence of large-scale distributed query processing in applications like:Wireless sensor networks

Publish-subscribe systems

Distributed stream processing applications

Common need:Minimize data movement!!

Justification

Data transfer cost from one node to another may be heterogenous.

Justification

Outline

Justification

Problem formulation

Proposed methods and analysisGraph theory concepts

Tree topology

Arbitrary graph topologies

Experimental results

Conclusions

Problem formulation

Minimization of data movement for multiples queries.

Assumptions:Non-uniform communication cost model.

No restrictions on data size.

Query plans are part of the input.

Intermediate results sizes are known.

More formally

Input: Set of relations or data sources

Topology, undirected weighted graph

Assignment of relations to nodes in the topology.

If more than one source resides in a node then we can just create a node per relation and link the nodes with weighted 0 edges.In the case of replication, it becomes part of the query plan optimization. In that case the tree query plan given to the algorithm should be the minimum weighted tree (of course only considering the weight edges in the topology). The shortest paths for every pair of nodes might be precomputed (we could use associativity for joins of 3 relations), then we only care about taking the groups of replicas such that their distance is the smallest, furthermore this is only done in the leaves of the query plan.

More formally

Set of queries

Each query comes with a plan in the form of a directed tree.

Destination node

Data sources involved

Data size

Si

Sj

Si x Sj

w

z(Si)

z(Sj)

z(Si x Sj)

More formally

Given the topology graph Gc and a set of trees representing the query plans, our goal is to find a data movement plan that minimizes the total communication cost incurred while executing the queries.

Problem formulation

(10)

S1

S2

S1 x S2

C

(10)

(7)

S4

S1 x S2 x S4

(5)

(100)

(100)

S2

S6

S2 x S6

D

(10)

(5)

S2

S5

S2 x S5

B

(10)

(6)

(8)

B

A

C

D

E

F

S2

S1

S3

S4

S6

S5

(10)

(10)

(100)

(8)

(100)

Topology Gc

Queries

Problem formulation

If a block of data sized S is sent along an edge e with weight w(e) then the communication cost is S * w(e)

For simplicity in the examples assume w(e) = 1 for all edges.But the algorithm is general in that sense!

Outline

Justification

Problem formulation

Proposed methods and analysis

Graph theory concepts

Tree topology

Arbitrary graph topologies

Experimental results

Conclusions

Problem analysis

It has been proved to be NP-HardVia reduction to the Steiner Tree problem

Is everything lost?If topology graph is a tree, there is a polynomial-time algorithm.

For general topologies, aproximation algorithms are known.

Steiner Tree problem

Given an undirected graph:

Find a tree of minimum weight that connects all vertices in S.It can contain vertices not in S, known as Steiner points.

Steiner Tree problem

5

5

2

6

2

2

3

4

13

2

2

3

4

Terminals

Steiner points

The algorithm

It implies to solve a series of min-cut problems on appropriately constructed hypergraphs.

Umm.. Hypergraphs?

Hypergraphs

Generalization of a graph. In normal graphs, edges can be seen as pairs of vertices.

Hyperedges can group any number of vertices.

Max-flow/Min-cut

Given a weighted, directed graph and nodes s, t known as source and sink:

Find a flow or mapping of maximum value:

Max-flow/Min-cut

s

t

3/3

2/3

2/2

3/3

0/2

1/4

2/2

3/3

Flow

Capacity

Max-flow/Min-cut

A min-cut is a set of edges with minimum weight such that if removed from the graph, there is no path from s to t.

The maximum value of an s-t flow is equal to the minimum capacity of an s-t cut.

Min-cut can be solved in polynomial time. Edmonds-Karp solves the problem in O(|V| * |E| ^2)Max-flow: Ford-Fulkerson O(|V| * f)Dinitz algorithm: O(|E| |V| ^ 2)

Max-flow/Min-cut

s

t

3/3

2/3

3/3

0/2

1/4

2/2

3/3

2/2

Flow

Capacity

Max-flow/Min-cut in hypergraphs

Problem solvable in polynomial time.

What about max-flow in hypergraphs?For every hyperedge, add two new nodes and a directed edge between them of capacity equal to the weight of the hyperedge.

Max-flow/Min-cut in hypergraphs

1

2345

w

1

2345

w

67Max-flow/Min-cut in hypergraphs

Weighted hypergraph partition problem.

Instead of a source and a sink we have 2 subsets of nodes:

We want to find a partition (S, T) on V such that:

Weighted hypergraph partition problem.

Convert it to a min-cut problem by adding artificial source and sink.

Ls

Lt

s

t

Outline

Justification

Problem formulation

Proposed methods and analysis

Graph theory concepts

Tree topology

Arbitrary graph topologies

Experimental results

Conclusions

Tree topology Step 1

Build a directed weighted hypergraph HD, by combining the query plan trees for all the queries. Edges oriented from children to parent.

It explicitly captures all the opportunities for sharing the movement of data sources among the queries.

Tree topology Step 2

For every edge in the topology graph decide which data sources and intermediate results move across that edge by solving an instance of the weighted hypergraph partition problem.

Tree topology Step 1 Single query

B

A

C

D

E

F

S2

S1

S3

S4

S6

S5

(10)

(10)

(100)

(8)

(100)

Topology Gc

(10)

S1

S2

S1 x S2

C

(10)

(7)

S4

S1 x S2 x S4

(5)

(100)

HD

Here HD has the same structure of the query plan.

Tree topology Step 2 Single query

For every edge (u, v) in Gc solve an instance of the weighted hypergraph partition problem.

The tree is divided into 2 components Gcu and GcvLabel the nodes in the hypergraph with u or v depending on which component they lie.

Gcu

Gcv

Tree topology Step 2 Single query

S1

S2

C

S4

(10)

S1 x S2

(10)

(7)

S1 x S2 x S4

(5)

(100)

HD

B

A

C

D

E

F

S2

S1

S3

S4

S6

S5

(10)

(10)

(100)

(8)

(100)

CCDC

Tree topology Step 2 Single query

(10)

S1 x S2

(10)

(7)

S1 x S2 x S4

(5)

(100)

HD

CCDCNow solve a weighted hypergraph partition problem where Ls are the nodes labeled C and Lt, the ones labeled D

Tree topology Step 2 Single query

(10)

(10)

(7)

(5)

(100)

HD

CCDCIt induces a labeling for internal nodes and data transference (S1S2 from C to, S1S2S4 from D to C)

DCS1S2S1S2S4

Tree topology Step 2 Multiple queries

(10)

S1

S2

S1 x S2

C

(10)

(7)

S4

S1 x S2 x S4

(5)

(100)

(100)

S2

S6

S2 x S6

D

(10)

(5)

S2

S5

S2 x S5

B

(10)

(6)

(8)

Sources appearing more than once generate hyperedges with weight equal to the source data size.

Tree topology Step 2 Multiple queries

S1

S1 x S2

C

(10)

(7)

S4

S1 x S2 x S4

(5)

(100)

(100)

S6

S2 x S6

D

(5)

S2

S5

S2 x S5

B

(10)

(6)

(8)

Tree topology Step 2 Multiple queries

(10)

(7)

(5)

(100)

(100)

(5)

(10)

(6)

(8)

Consider again edge C-D in the topology.

CCDDCDCDDDDCS1S2S1S2S4

S2S5

S2

Tree topology Step 3

Combine the local solutions for all the edges of the topology into a single global data movement plan.Local solutions may not agree on where the internal nodes should be evaluated.

Tree topology- Step 3

For every internal node i in HD, construct a directed graph Ji with:

Then add the edges according to this rule (for every edge in Gc):

Tree topology Step 3

Ji encodes information about how the different local plans locate evaluation of node i and defines a path.

B

A

C

D

E

F

JS1S2S4

Evaluate here!!

Outline

Justification

Problem formulation

Proposed methods and analysis

Graph theory concepts

Tree topology

Arbitrary graph topologies

Experimental results

Conclusions

Arbitrary graph topologies

For single queries, dynamic programming approach offers optimal solution in O(n2m+n3) with:m = # of nodes in query tree and n = # nodes in topology graph.

For multi-query a O(lg(n)) approximation is achieved through tree metrics.Suitable for n small!

Arbitrary graph topologies The pairs problem

Restrictions on the query structure.

Pairs problem:Assume each query is defined by a pair of nodes whose data should meet somewhere in the network.

A query-overlap graph H is a graph where the vertices correspond to the set of data items and each edge corresponds to a pair query.

The pairs problem

The pairs problem

Approximation algorithm when H is a star.

For same size data sources:Reducible to the minimum Steiner Tree problem

Otherwise to Connected Facility Location Problem

Isn't this problem NP-Hard?Yes, but good approximation algorithms do exist!!

Connected Facility Location Problem

Input:

Output:

Bought edges(Steiner cost)Rented edges(Connection cost)

Connected Facility Location Problem

Demands, D

Facilities, F

Rented edges

The pairs problem

Steiner Tree and SROB have approximation algorithms with ratio 1.55 and 2.92 respectively.

Final approximation ratio depends on query overlap graph topology:If the star arboricity SN of H can be computed in polynomial time, then there is a *SN(H)-approximation.

Minimum number of star-shaped foreststhe edges of the graph can be partitioned.Why?

Outline

Justification

Problem formulation

Proposed methods and analysis

Graph theory concepts

Tree topology

Arbitrary graph topologies

Experimental results

Conclusions

Experimental results

Topology cases: tree and arbitrary graph

Multi-query approach compared with isolated optimization per query.

Normalized with the cost of the nave approach.

Identical data size vs tri-modal distribution

Random query overlap.

Tree topology

Arbitrary topology Pairs problem

Outline

Justification

Problem formulation

Proposed methods and analysisGraph theory concepts

Tree topology

Arbitrary graph topologies

Experimental results

Conclusions

Conclusions

Communication cost can be the main bottleneck in certain applications.

Optimizing data movement is a hard problem.Relies on assumptions about query structure and topology.

For certain cases efficient algorithms are available.

It has to be complemented with other optimization schemes.

Thank you!!

Click to edit the title text format