parallelism analysis, and work distribution by daniel livshen. based on chapter 16 “futures,...

39
Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN . BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR PROGRAMMING” BY MAURICE HERLIHY AND NIR SHAVIT © 1

Upload: malcolm-pierce

Post on 19-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

1Parallelism Analysis, and Work DistributionBY DANIEL LIVSHEN.

BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR PROGRAMMING” BY MAURICE HERLIHY AND NIR SHAVIT ©

Page 2: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

2Content

Intro and motivation

Analyzing Parallelism

Work distribution

Page 3: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

3IntroSome applications break down naturally into parallel threads.

Web Server

Creates a thread to handle a request.

Page 4: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

4IntroSome applications break down naturally into parallel threads.

Producer Consumer

Every consumer and producer can be represented as a thread.

Page 5: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

5Intro

But, we are here to talk about the hard stuff.

We will look at applications that have inherent parallelism, but where it is not obvious how to take advantage of it.Our example will be matrix multiplication.

Recall:

1

,0

,

,

,

( , ) ,

:i j

n

i j k i j kk

If a is the value at position i j of matrix A then the

product C of two n n matrices A and B is given by

bc a

Page 6: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

6

How To Parallelize?

Page 7: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

7IntroFirst Try

Put one thread in charge of computing each .

1 class MMThread {2 double[][] a, b, c;3 int n;4 public MMThread(double[][] myA, double[][] myB) {5 n = ymA.length;6 a = myA;7 b = myB;8 c = new double[n][n];9 }10 void multiply() {11 Worker[][] worker = new Worker[n][n];12 for (int row = 0; row < n; row++)13 for (int col = 0; col < n; col++)14 worker[row][col] = new Worker(row,col);15 for (int row = 0; row < n; row++)16 for (int col = 0; col < n; col++)17 worker[row][col].start();18 for (int row = 0; row < n; row++)19 for (int col = 0; col < n; col++)20 worker[row][col].join();21 }

22 class Worker extends Thread {23 int row, col;24 Worker(int myRow, int myCol) {25 row = myRow; col = myCol;26 }27 public void run() {28 double dotProduct = 0.0;29 for (int i = 0; i < n; i++)30 dotProduct += a[row][i] * b[i][col];31 c[row][col] = dotProduct;32 }33 }34 }

Page 8: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

8IntroFirst Try

Put one thread in charge of computing each .

This might seem like an ideal design, but…

- Poor performance for large matrices (Million threads on

1000x1000 matrices!).

- High memory consumption.

- Many short-lived threads.

Page 9: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

9IntroThread Pool (Second Try)

- A data-structure that connects threads to tasks.

- Number of long lived threads.

- The number of threads can be dynamic or static (fixed number).

- Each thread waits until it assigned a task.

- The thread Executes the task and rejoins the pool to await its next

assignment.

Benefits:

- Performance improvement due to the use

of long lived threads.

- Platform independent – from small

machines to hundreds of cores

Page 10: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

10IntroThread PoolJava Terms

- Thread pool is called “” (interface ).- Provides:

- Submit task.- Wait for a set of submitted tasks to complete.- Cancel uncompleted tasks.

- Tasks:- Task that does not return a result is usually represented as a object, where

work performed by method w/o arguments and returns no results.- Task that returns a value of type is usually represented as a object, where is

the result returned by a with the T method that takes no arguments.

Page 11: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

11IntroThread PoolJava Terms (cont.)

- When a object is submitted to an executor service, the service returns an object implementing the interface.

- is a promise to deliver the result of an asynchronous computation, when it is ready.

- provides:- method that returns the result. Blocking if necessary until the result is ready.- Methods for canceling uncompleted computations.

- task also returns a future, but with the wild card parameter ().

Page 12: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

12IntroBack to matrix multiplicationCreate a object with calls that provide and methods to access matrix elements, along with method that splits an matrix into four by sub-matrices.

1 public class Matrix {2 int dim;3 double[][] data;4 int rowDisplace, colDisplace;5 public Matrix(int d) {6 dim = d;7 rowDisplace = colDisplace = 0;8 data = new double[d][d];9 }10 private Matrix(double[][] matrix, int x, int y, int d) {11 data = matrix;12 rowDisplace = x;13 colDisplace = y;14 dim = d;15 }16 public double get(int row, int col) {17 return data[row+rowDisplace][col+colDisplace];18 }19 public void set(int row, int col, double value) {20 data[row+rowDisplace][col+colDisplace] = value;21 }

22 public int getDim() {23 return dim;24 }25 Matrix[][] split() {26 Matrix[][] result = new Matrix[2][2];27 int newDim = dim / 2;28 result[0][0] =29 new Matrix(data, rowDisplace, colDisplace, newDim);30 result[0][1] =31 new Matrix(data, rowDisplace, colDisplace + newDim, newDim);32 result[1][0] =33 new Matrix(data, rowDisplace + newDim, colDisplace, newDim);34 result[1][1] =35 new Matrix(data, rowDisplace + newDim, colDisplace + newDim, newDim);36 return result;37 } }

Splits the matrix to 4 sub-

matrices

Page 13: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

13IntroBack to matrix multiplication(cont.)Matrix multiplication B can be decomposed as follows:

Split the two matrices into 4

Multiply the 8 sub matrices

4 Sums of the 8 products

Page 14: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

15

1 2

3 4

IntroBack to matrix multiplication(cont.)1 2

3 4

1 2

3 4

1 6

1 2

3 4

2 8 3 12 6 16

7 10

15 22X =

1 1 2 3 1 2 2 4 3 1 4 3 3 2 4 4

Parallel Multiplication

Parallel Addition

Task Creation

Page 15: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

16IntroBack to matrix multiplication(cont.)

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

The class that holds the thread pool

The multiplying task

The constructor created two Matrices to hold the matrix

product terms

Now we will describe the thread that performs the job.

Page 16: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

17IntroBack to matrix multiplication(cont.)

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

Split all the matrices

Submit the tasks to compute the eight product terms in parallel

Once they are complete the thread submits tasks to compute

the four sums in parallel and waits for them to complete

Page 17: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

18In Conclusion

- Two tries for the same algorithm.- One is bad and inefficient because it is not smart, it’s

just allocated thousands of threads and executes them.

- The second is a lot better, with a good design and fewer threads we achieve a better performance.

- Some analysis of the parallelism might help us to design better solutions for the same algorithm.

Page 18: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

19

Analyzing Parallelism

Page 19: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

20Program DAG

- Multithreaded computation can be represented as a DAG.- Each node represents a task- Each directed edge links a predecessor task to successor task where the

successor depends on the predecessor’s result- In a node that creates futures we have 2 dependencies : The computation node,

and the next successor task in the same node.

Example: Fibonacci sequence

Recall: Fibonacci sequence is defined by the function :

Where is the item.

Page 20: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

21Fibonacci ExampleFibonacci multithreaded implementations with futures:

Thread pool that holds the tasks

Every creates two tasks if : for and for

Page 21: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

22The Fibonacci DAG for fib(4)fib(4)

fib(3) fib(2)

fib(2) fib(1)

fib(1) fib(0)

fib(1) fib(0)

Page 22: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

24Analyzing Parallelism

• What is the meaning of the notion “Some computations are inherently more parallel than others”?

• We want to give a precise answer for this question.

• Assume that all individual computation steps take the same amount of time• Let be minimum time needed to execute a multithreaded program on a

system of P dedicated processors.=> is the program’s latency

= Time needed to execute the program on a single processor = The computation’s work.

Page 23: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

25Analyzing Parallelism

= Time needed to execute the program on an unlimited number of processors = the critical-path-length.

=The speedup on processors

Computation parallelism is the maximum possible speedup Good estimation of how many processors one should devote to a computation

Page 24: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

26Example - Addition

Let be the number of steps (time) needed to add two matrices on processors.

• Matrix addition requires four half-size matrix additions + constant amount of time to split the matrices.

• Can be done by doubly nested loop.• Half size additions can be done in parallel.• Critical path length is given by the following formula:

• Done by threads in parallel.

The parallelism for matrix addition given by :

For 1000x1000 matrices the parallelism is approximately !

Page 25: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

27Example - MultiplicationLet be the number of steps (time) needed to multiply two matrices on processors.

• Matrix multiplication requires 8 half-size matrix multiplications + 4 matrix additions as we have seen in the intro.

• Can be done by triply nested loop.• We can do the multiplications and additions in parallel.• Critical path length is given by the following formula:

• Done by threads in parallel.

The parallelism for matrix multiplication is given by :

For 1000x1000 matrices the parallelism is approximately !

Page 26: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

28In real life…• The multithreaded speedup we achieved is not realistic, it’s highly

idealized upper bound.• In real life it is not easy to assign idle threads to idle processors.• In some cases a program that displays less parallelism but consumes

less memory may perform better because it encounters fewer page faults.

But, this kind of analysis is a good indication of which problem can be resolved in parallel

Page 27: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

30Recall: Operating Systems- Our analysis so far has been based on the assumption that each multithreaded

program has dedicated processors.- It is not realistic.- Modern operating systems provide user-level threads that encompass a program

counter and stack.- The operating system kernel includes a scheduler that runs threads on physical

processors.- The application has no control over the mapping between threads and

processors.- We can describe the gap between user level threads and operating system-level

processors in a three level model:Multithreaded programs – task level

User level scheduler mapping of tasks to fixed number of threads

Kernel mapping of threads to hardware processors

Can be controlled by the application

The programmer can optimize this with good work distribution

Page 28: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

33

Work Distribution

Page 29: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

34Intro

- The key to achieving a good speedup is to keep user-level threads supplied with tasks.

- However, multithreaded computations create and destroy tasks dynamically sometimes in unpredictable ways.

- We need a work distribution algorithm to assign ready tasks to idle threads as efficiently as possible.

Page 30: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

35Work Dealing

Simple approach to work distribution. An overloaded task tries to offload tasks to other, less heavily loaded threads.

Thread A Thread B

HEAVY TASK

What if all threads are overloaded?

Thread A offloads work

Page 31: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

36Work StealingOpposite approach: A thread that runs out of work will try to “steal” work from others.

Thread A Thread B

HEAVY TASK

The issue fixed?

Thread B steals work

Page 32: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

37DEQueue- A data structure to implement work stealing approach.- DEQueue = Double Ended Queue .- Provides methods:

- When a thread creates a new task it call – push the task to its - When a thread needs a task to work on it calls to remove a task from its .- If the thread discovers that its queue is empty then it becomes a thief :

chooses a victim thread at random and calls that thread’s ’s to steal a task for itself.

Page 33: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

38Algorithm ReviewHolds the array of all thread queues,

internal id, and random number generator

Pops a task from the queue(pool) and runs it

If the pool is empty then it randomly finds a victim to steal a job from

Why?

Page 34: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

39

- Another, alternative work distribution approach.- Periodically each thread balances its workloads with a

randomly chosen partner.

What could be a problem?- Solution: Coin flipping! - We ensure that lightly-loaded threads will be more likely to

initiate rebalancing.- Each thread periodically flips a biased coin to decide whether

to balance with another.- The thread’s probability of balancing is inversely proportional

to the number of tasks in the thread’s queue.

Work Balancing

𝑃=1

¿𝑇𝑎𝑠𝑘𝑠𝐼𝑛𝑄𝑢𝑒𝑢𝑒

Fewer tasks -> more chance to be selected for

balancing

Page 35: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

40Work Balancing (cont.)- A thread rebalances by selecting a victim uniformly.- If the difference between its workload and the victim’s

exceeds a predefined threshold they transfer tasks until their queues contain the same number of tasks.

- Algorithm’s benefits:- Fairness.- The balancing operation moves multiple tasks at each

exchange.- If one thread has much more work than the others it is

easy to balance his work over all threads.- Algorithm’s drawbacks?

- Need a good value of threshold for every platform.

Page 36: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

41Work Balancing ImplementationOur probability function will be : where is the size of the queue of the thread.

Holds queue of tasks and random number generator

The best threshold depends eventually on the OS and platform

Always runs

In probability of a balance will happen

Find the victim and do the balance

Page 37: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

42Work Balancing Implementation (2)

Gets 2 queues

Calculates the difference between the sizes of the queues

If the size bigger than the Threshold we will move items from the bigger

queue to the smaller one

Page 38: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

43Conclusion

How to implement multithreaded

programs with thread pools

Analyze with precise tools the parallelism of

an algorithm

Improve thread scheduling on user

level

Learn different approaches on work

distribution

Parallel Programming

Page 39: Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR

Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014

44

Thank You!