parallelism analysis, and work distribution by daniel livshen. based on chapter 16 “futures,...

1Parallelism Analysis, and Work DistributionBY DANIEL LIVSHEN.

BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR PROGRAMMING” BY MAURICE HERLIHY AND NIR SHAVIT ©

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

2Content

Intro and motivation

Analyzing Parallelism

Work distribution


3IntroSome applications break down naturally into parallel threads.

Web Server

Creates a thread to handle a request.


4IntroSome applications break down naturally into parallel threads.

Producer Consumer

Every consumer and producer can be represented as a thread.


5Intro

But, we are here to talk about the hard stuff.

We will look at applications that have inherent parallelism, but where it is not obvious how to take advantage of it.Our example will be matrix multiplication.

Recall:

1

,0

,

,

,

( , ) ,

:i j

n

i j k i j kk

If a is the value at position i j of matrix A then the

product C of two n n matrices A and B is given by

bc a


6

How To Parallelize?


7IntroFirst Try

Put one thread in charge of computing each .

1 class MMThread {2 double[][] a, b, c;3 int n;4 public MMThread(double[][] myA, double[][] myB) {5 n = ymA.length;6 a = myA;7 b = myB;8 c = new double[n][n];9 }10 void multiply() {11 Worker[][] worker = new Worker[n][n];12 for (int row = 0; row < n; row++)13 for (int col = 0; col < n; col++)14 worker[row][col] = new Worker(row,col);15 for (int row = 0; row < n; row++)16 for (int col = 0; col < n; col++)17 worker[row][col].start();18 for (int row = 0; row < n; row++)19 for (int col = 0; col < n; col++)20 worker[row][col].join();21 }

22 class Worker extends Thread {23 int row, col;24 Worker(int myRow, int myCol) {25 row = myRow; col = myCol;26 }27 public void run() {28 double dotProduct = 0.0;29 for (int i = 0; i < n; i++)30 dotProduct += a[row][i] * b[i][col];31 c[row][col] = dotProduct;32 }33 }34 }


8IntroFirst Try

Put one thread in charge of computing each .

This might seem like an ideal design, but…

- Poor performance for large matrices (Million threads on

1000x1000 matrices!).

- High memory consumption.

- Many short-lived threads.


9IntroThread Pool (Second Try)

- A data-structure that connects threads to tasks.

- Number of long lived threads.

- The number of threads can be dynamic or static (fixed number).

- Each thread waits until it assigned a task.

- The thread Executes the task and rejoins the pool to await its next

assignment.

Benefits:

- Performance improvement due to the use

of long lived threads.

- Platform independent – from small

machines to hundreds of cores


10IntroThread PoolJava Terms

- Thread pool is called “” (interface ).- Provides:

- Submit task.- Wait for a set of submitted tasks to complete.- Cancel uncompleted tasks.

- Tasks:- Task that does not return a result is usually represented as a object, where

work performed by method w/o arguments and returns no results.- Task that returns a value of type is usually represented as a object, where is

the result returned by a with the T method that takes no arguments.


11IntroThread PoolJava Terms (cont.)

- When a object is submitted to an executor service, the service returns an object implementing the interface.

- is a promise to deliver the result of an asynchronous computation, when it is ready.

- provides:- method that returns the result. Blocking if necessary until the result is ready.- Methods for canceling uncompleted computations.

- task also returns a future, but with the wild card parameter ().


12IntroBack to matrix multiplicationCreate a object with calls that provide and methods to access matrix elements, along with method that splits an matrix into four by sub-matrices.

1 public class Matrix {2 int dim;3 double[][] data;4 int rowDisplace, colDisplace;5 public Matrix(int d) {6 dim = d;7 rowDisplace = colDisplace = 0;8 data = new double[d][d];9 }10 private Matrix(double[][] matrix, int x, int y, int d) {11 data = matrix;12 rowDisplace = x;13 colDisplace = y;14 dim = d;15 }16 public double get(int row, int col) {17 return data[row+rowDisplace][col+colDisplace];18 }19 public void set(int row, int col, double value) {20 data[row+rowDisplace][col+colDisplace] = value;21 }

22 public int getDim() {23 return dim;24 }25 Matrix[][] split() {26 Matrix[][] result = new Matrix[2][2];27 int newDim = dim / 2;28 result[0][0] =29 new Matrix(data, rowDisplace, colDisplace, newDim);30 result[0][1] =31 new Matrix(data, rowDisplace, colDisplace + newDim, newDim);32 result[1][0] =33 new Matrix(data, rowDisplace + newDim, colDisplace, newDim);34 result[1][1] =35 new Matrix(data, rowDisplace + newDim, colDisplace + newDim, newDim);36 return result;37 } }

Splits the matrix to 4 sub-

matrices


13IntroBack to matrix multiplication(cont.)Matrix multiplication B can be decomposed as follows:

Split the two matrices into 4

Multiply the 8 sub matrices

4 Sums of the 8 products


15

1 2

3 4

IntroBack to matrix multiplication(cont.)1 2

3 4

1 2

3 4

1 6

1 2

3 4

2 8 3 12 6 16

7 10

15 22X =

1 1 2 3 1 2 2 4 3 1 4 3 3 2 4 4

Parallel Multiplication

Parallel Addition

Task Creation

16IntroBack to matrix multiplication(cont.)


The class that holds the thread pool

The multiplying task

The constructor created two Matrices to hold the matrix

product terms

Now we will describe the thread that performs the job.

17IntroBack to matrix multiplication(cont.)


Split all the matrices

Submit the tasks to compute the eight product terms in parallel

Once they are complete the thread submits tasks to compute

the four sums in parallel and waits for them to complete


18In Conclusion

- Two tries for the same algorithm.- One is bad and inefficient because it is not smart, it’s

just allocated thousands of threads and executes them.

- The second is a lot better, with a good design and fewer threads we achieve a better performance.

- Some analysis of the parallelism might help us to design better solutions for the same algorithm.


19

Analyzing Parallelism


20Program DAG

- Multithreaded computation can be represented as a DAG.- Each node represents a task- Each directed edge links a predecessor task to successor task where the

successor depends on the predecessor’s result- In a node that creates futures we have 2 dependencies : The computation node,

and the next successor task in the same node.

Example: Fibonacci sequence

Recall: Fibonacci sequence is defined by the function :

Where is the item.


21Fibonacci ExampleFibonacci multithreaded implementations with futures:

Thread pool that holds the tasks

Every creates two tasks if : for and for


22The Fibonacci DAG for fib(4)fib(4)

fib(3) fib(2)

fib(2) fib(1)

fib(1) fib(0)

fib(1) fib(0)


24Analyzing Parallelism

• What is the meaning of the notion “Some computations are inherently more parallel than others”?

• We want to give a precise answer for this question.

• Assume that all individual computation steps take the same amount of time• Let be minimum time needed to execute a multithreaded program on a

system of P dedicated processors.=> is the program’s latency

= Time needed to execute the program on a single processor = The computation’s work.


25Analyzing Parallelism

= Time needed to execute the program on an unlimited number of processors = the critical-path-length.

=The speedup on processors

Computation parallelism is the maximum possible speedup Good estimation of how many processors one should devote to a computation


26Example - Addition

Let be the number of steps (time) needed to add two matrices on processors.

• Matrix addition requires four half-size matrix additions + constant amount of time to split the matrices.

• Can be done by doubly nested loop.• Half size additions can be done in parallel.• Critical path length is given by the following formula:

• Done by threads in parallel.

The parallelism for matrix addition given by :

For 1000x1000 matrices the parallelism is approximately !


27Example - MultiplicationLet be the number of steps (time) needed to multiply two matrices on processors.

• Matrix multiplication requires 8 half-size matrix multiplications + 4 matrix additions as we have seen in the intro.

• Can be done by triply nested loop.• We can do the multiplications and additions in parallel.• Critical path length is given by the following formula:

• Done by threads in parallel.

The parallelism for matrix multiplication is given by :

For 1000x1000 matrices the parallelism is approximately !


28In real life…• The multithreaded speedup we achieved is not realistic, it’s highly

idealized upper bound.• In real life it is not easy to assign idle threads to idle processors.• In some cases a program that displays less parallelism but consumes

less memory may perform better because it encounters fewer page faults.

But, this kind of analysis is a good indication of which problem can be resolved in parallel


30Recall: Operating Systems- Our analysis so far has been based on the assumption that each multithreaded

program has dedicated processors.- It is not realistic.- Modern operating systems provide user-level threads that encompass a program

counter and stack.- The operating system kernel includes a scheduler that runs threads on physical

processors.- The application has no control over the mapping between threads and

processors.- We can describe the gap between user level threads and operating system-level

processors in a three level model:Multithreaded programs – task level

User level scheduler mapping of tasks to fixed number of threads

Kernel mapping of threads to hardware processors

Can be controlled by the application

The programmer can optimize this with good work distribution


33

Work Distribution


34Intro

- The key to achieving a good speedup is to keep user-level threads supplied with tasks.

- However, multithreaded computations create and destroy tasks dynamically sometimes in unpredictable ways.

- We need a work distribution algorithm to assign ready tasks to idle threads as efficiently as possible.


35Work Dealing

Simple approach to work distribution. An overloaded task tries to offload tasks to other, less heavily loaded threads.

Thread A Thread B

HEAVY TASK

What if all threads are overloaded?

Thread A offloads work


36Work StealingOpposite approach: A thread that runs out of work will try to “steal” work from others.

Thread A Thread B

HEAVY TASK

The issue fixed?

Thread B steals work


37DEQueue- A data structure to implement work stealing approach.- DEQueue = Double Ended Queue .- Provides methods:

- When a thread creates a new task it call – push the task to its - When a thread needs a task to work on it calls to remove a task from its .- If the thread discovers that its queue is empty then it becomes a thief :

chooses a victim thread at random and calls that thread’s ’s to steal a task for itself.


38Algorithm ReviewHolds the array of all thread queues,

internal id, and random number generator

Pops a task from the queue(pool) and runs it

If the pool is empty then it randomly finds a victim to steal a job from

Why?


39

- Another, alternative work distribution approach.- Periodically each thread balances its workloads with a

randomly chosen partner.

What could be a problem?- Solution: Coin flipping! - We ensure that lightly-loaded threads will be more likely to

initiate rebalancing.- Each thread periodically flips a biased coin to decide whether

to balance with another.- The thread’s probability of balancing is inversely proportional

to the number of tasks in the thread’s queue.

Work Balancing

𝑃=1

¿𝑇𝑎𝑠𝑘𝑠𝐼𝑛𝑄𝑢𝑒𝑢𝑒

Fewer tasks -> more chance to be selected for

balancing


40Work Balancing (cont.)- A thread rebalances by selecting a victim uniformly.- If the difference between its workload and the victim’s

exceeds a predefined threshold they transfer tasks until their queues contain the same number of tasks.

- Algorithm’s benefits:- Fairness.- The balancing operation moves multiple tasks at each

exchange.- If one thread has much more work than the others it is

easy to balance his work over all threads.- Algorithm’s drawbacks?

- Need a good value of threshold for every platform.


41Work Balancing ImplementationOur probability function will be : where is the size of the queue of the thread.

Holds queue of tasks and random number generator

The best threshold depends eventually on the OS and platform

Always runs

In probability of a balance will happen

Find the victim and do the balance


42Work Balancing Implementation (2)

Gets 2 queues

Calculates the difference between the sizes of the queues

If the size bigger than the Threshold we will move items from the bigger

queue to the smaller one


43Conclusion

How to implement multithreaded

programs with thread pools

Analyze with precise tools the parallelism of

an algorithm

Improve thread scheduling on user

level

Learn different approaches on work

distribution

Parallel Programming


44

Thank You!

parallelism analysis, and work distribution by daniel livshen. based on chapter 16 “futures,...

Documents

int col

col n col

row n row

int mycol

myrow col

int rowdisplace

number of threads

parallel threads