probabilistic adaptive load balancing for parallel queries

Probabilistic Adaptive Load Balancing for Parallel Queries

Daniel M. Yellin* Jorge Buenabad-Chávez** Norman W. Paton***

*IBM Israel Software Lab

** Centro de Investigacion y de Estudios Avanzados del IPN, Mexico

*** University of Manchester

Autonomic Computing

• Autonomic computing provides general framework for adaptive systems, MAPE

• … but getting the details right is tough

• When a trend is sensed, when to adapt?– Not too early, not too late

• What to adapt to?– Adapting in the wrong way can make things

worse!

Our problem• Given:

– A computational system S that can operate in one of several (possibly infinite) modes, m1,m2,..

– Each mode is optimized for a particular workload

• Goal:– Monitor the existing workload and decide when to adapt S to a different

mode, optimized for the current (or predicted) workload

• Considerations:– Risk: No promise that future workload will be similar to current workload– Cost: Each time we change the mode of the system S from mi to mj, we

incur a cost. Switching modes can be expensive!

m1m2

m4

m3m5

m1m2

m4

m3m5

Entails a cost

Example: Pub-sub systems

Given: S = pub-sub system, including a server and a set of clients

Modes = {cache a particular

data item on a client, store particular data item on server}

Goal: Monitor the access patterns of

clients and decide when to move data item to client (server) from server (client)

Daniel M. Yellin: Competitive algorithms for the dynamic selection of component implementations. IBM Systems Journal 42(1): 85-97 (2003).

Server

Client 1 Client 2

d1 d2 d3d3 …

read d3, read d3,...write d3, write d3,...

http://www.informatik.uni-trier.de/~ley/db/journals/ibmsj/ibmsj42.html#Yellin03

Example: Data type implementation

Given:

S = abstract data type with multiple implementations, each optimized for specific sorts of operations

Goal:

Monitor the operations on S and decide when to switch from one implementation to another

Component

impl1 data

Requests of type X faster using key K1

K1

impl2

Requests of type Y faster using key K2

K2

Our approach

1. Monitor existing workload and response times2. Determine (a finite number of) modes to

consider for adaptation3. Determine likelihood of (a finite number of)

workloads in the immediate future4. For each relevant mode, compute the

expected cost of switching to that mode, based upon probability of different workloads and cost of processing workload in that mode

Note: cost of adaptation (SwitchCost) is included in EC

Adaptive Query Processing

• A query optimiser, given a query and information on the data involved and the environment in which the query is to be run, proposes an execution plan for that query that is predicted to yield the best response time.

• If the information used by the optimiser is misleading (e.g. partial, incorrect, out-of-date or subject to change during query evaluation), the execution plan chosen by the optimiser may be inappropriate.

• In Adaptive Query Processing, the execution plan is modified at query runtime, on the basis of feedback received from the environment.

Adaptation for Load Balancing

• In partitioned parallelism, a task is divided into subtasks that are run in parallel on different nodes.

• For a join, A⋈B is represented as the union of the results of plan fragments Fi = Ai ⋈Bi , for i = 1..P, where P is the level of parallelism.

• The time taken to evaluate the join is max(evaluation_time(Fi )), for i = 1..P.

• As a result, any delay in completing a fragment Fi delays the completion of the operator, so it is crucial to match fragment size to node capabilities.

• Most join algorithms have state; as such changing the size of a fragment allocated to a machine involves replicating or relocating operator state.

Flux*• When load imbalance is

detected:– Halt query execution.– Compute new distribution

policy (dp).– Update hash tables by

transferring data between nodes.

– Update dp in parent exchange nodes.

– Resume query execution• Many variations of this

technique exists and have been compared ** Scan(A)

Join(A1,B1) Join(A2,B2)

Hash table A1

dp

Hash table A2

* M. Shah, J.M. Hellerstein, S. Chandrasekaran, M.J. Franklin, Flux: An Adaptive Partitioning Operator for Continuous Query Systems. ICDE 2003.

** Paton, N.W., Raman, V., Swart, G. and Narang, I., Autonomic Query Parallelization using Non-dedicated Computers: An Evaluation of Adaptivity Options, Proc. 3rd IEEE Intl. Conf. on Autonomic Computing, 2006.

Heuristics used by Flux

• Units of adaptation: table divided into partitions, and each node can gain/loose at most one partition during an adaptation

• Scale of adaptation: at most half the partitions can be moved at any adaptation

• Frequency of adaptation: once an adaptation takes place and takes time s, no further adaptation until after time s

• Timing of adaptation: applies heuristics to determine when to transfer partition from over-utilized processor to under-utilized processor

A brief review …

Our algorithm is based upon the concept of mathematical expectation.

If the probabilities of obtaining the amounts a1, a2,..., ak

are p1, p2,..., pk, where p1+ p2 +...+ pk = 1 then the mathematical expectation is:

E = a1 * p1 + a2 * p2 +...+ ak * pk

For example, if we win $10 when a die comes up 1 or 6, and lose $ 5 when it comes up 2, 3, 4 or 5, our mathematical expectation is:

E = 10*(2/6) + (-5)(4/6) = 0

Moving from heuristics to evidence-based decision making

Define the notion of expected cost (EC) of using a particular distribution policy dp– EC of dp is cost of processing the parallel query using

dp, given that in the future we will have actual workloads w1,w2,… with probabilities of p1,p2,…

EC(dp) = cost(dp,w1)*p1 + cost(dp,w2)*p2 + … + SwitchCost(current,dp)

– In practice, we only consider two workloads & two distribution policies: the currently used dp and the “optimal” one obtained from monitored workloads

Cost(dp,w1) computed how much longer it would take dp to finish

processing w1 than the optimal distribution policy. See paper for details.

SwitchCost is not present if dp is the currently used distribution

policy

Probabilistic Delta AlgorithmInitialize current_dp // initially distribute uniformly

TimeToSwitch = Falsewhile (not TimeToSwitch)

Process next portion of queryCompute preferred_dp // “ideal” distribution

ecNoChange = EC_NoChange(current_dp, preferred_dp, count)

ecChange = EC_Change(preferred_dp, current_dp, count)

if ecNoChange >= ecChange TimeToSwitch = True

endwhilecurrent_dp = preferred_dpAdapt to preferred_dp

Includes SwitchCost

Does notinclude

SwitchCost

Computing probabilities of future workloads

1. Let n_c be the number of workloads in the window that are most similar to current_dp

2. Let n_p be the number of workloads in the window that are most similar to preferred_dp

3. Let n_w be the total number of time units in the window.

4. prob(preferred_dp) = n_p / n_w and prob(current_dp)= n_c / n_w

Note: can use more sophisticated techniques; e.g., weight the workloads based on

“proximity” to current time

Experiment Setup (Simulator)

• Cost model parameters: drawn from micro benchmarks

• Database from TPC-H benchmark.• As number of nodes grows, the data is assumed

to be striped over the available machines.• All machines are assumed to have the same

capabilities, and to be sharing the same network.

• Experiments use Q1: P⋈PS (P has 200,000 tuples, PS has 800,000 tuples).

Same as in: Automatic Query Parallelization using Non-dedicated Compters: An Evaluation of Adaptivity Options, The Very Large Data Bases Journal, N. W. Paton, J. Buenabad-Chavez, M. Chen, V. Raman, G. Swart, I. Narang, D. M. Yellin, and A.A.A. Fernandez. To VLDB

Experiments

• Periodic imbalance: The load on one or more of the machines comes and goes during the experiment. The level, duration, and repeat duration of the external load are varied.

• Poisson imbalance: The arrival rate of jobs follows a Poisson distribution in which the average number of jobs starting per second varies.

• “Cyclic Poisson” imbalance: Like a Poisson distribution except the average workload is not constant but changes over time in cyclic fashion (like sine wave). Trying to mimic more realistic workloads that change over time.

Periodic load imbalance

Parallelism level =3

Single node affected

Duration & repeatduration of load spike = 1s

Level of imbalance= avg # of external jobs introduced

PD is more conservative in deciding to adapt

current dp =0 means adaptation taking place

PD adjusts only once to periodic increased load on node 1

Expected cost of adaptation is greater than expected cost of sticking with current distribution

For previousexperimentw/ level ofimbalance= 6

Each node start w/ 1/3 of workload but nodes 2 and 3 gain workload over time

Poisson load imbalance


Single node affected

Duration of load spike = 1s

“Poisson cyclic” load imbalance


Duration of Cycle = 5s

load spike = 1s

Future work

• Our approach is sensitive to window size. What is best window size to use?

• Investigate better techniques for computing the probability of future workloads

• Use more than just two alternative distribution policies; e.g., can we infer a trend and use a “predicted distribution policy”?

• Test the algorithm on a real system, not only with simulator

Conclusions

• We investigated replacing heuristics with a more fundamental approach for determining when to adapt the system

• Initial experiments showed that using Probabilistic Delta (expected cost) algorithm for determining when to adapt usually improved on existing approaches, sometimes significantly

• The gain of this approach is due to inhibiting specious adaptations while still encouraging necessary adaptations

Backup slides

Adaptive Parallel Queries

A distribution policy describes how we partition work between processors

probabilistic adaptive load balancing for parallel queries

Technology