parallelism analysis, and work distribution by daniel livshen. based on chapter 16 “futures,...
TRANSCRIPT
1Parallelism Analysis, and Work DistributionBY DANIEL LIVSHEN.
BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR PROGRAMMING” BY MAURICE HERLIHY AND NIR SHAVIT ©
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
2Content
Intro and motivation
Analyzing Parallelism
Work distribution
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
3IntroSome applications break down naturally into parallel threads.
Web Server
Creates a thread to handle a request.
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
4IntroSome applications break down naturally into parallel threads.
Producer Consumer
Every consumer and producer can be represented as a thread.
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
5Intro
But, we are here to talk about the hard stuff.
We will look at applications that have inherent parallelism, but where it is not obvious how to take advantage of it.Our example will be matrix multiplication.
Recall:
1
,0
,
,
,
( , ) ,
:i j
n
i j k i j kk
If a is the value at position i j of matrix A then the
product C of two n n matrices A and B is given by
bc a
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
6
How To Parallelize?
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
7IntroFirst Try
Put one thread in charge of computing each .
1 class MMThread {2 double[][] a, b, c;3 int n;4 public MMThread(double[][] myA, double[][] myB) {5 n = ymA.length;6 a = myA;7 b = myB;8 c = new double[n][n];9 }10 void multiply() {11 Worker[][] worker = new Worker[n][n];12 for (int row = 0; row < n; row++)13 for (int col = 0; col < n; col++)14 worker[row][col] = new Worker(row,col);15 for (int row = 0; row < n; row++)16 for (int col = 0; col < n; col++)17 worker[row][col].start();18 for (int row = 0; row < n; row++)19 for (int col = 0; col < n; col++)20 worker[row][col].join();21 }
22 class Worker extends Thread {23 int row, col;24 Worker(int myRow, int myCol) {25 row = myRow; col = myCol;26 }27 public void run() {28 double dotProduct = 0.0;29 for (int i = 0; i < n; i++)30 dotProduct += a[row][i] * b[i][col];31 c[row][col] = dotProduct;32 }33 }34 }
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
8IntroFirst Try
Put one thread in charge of computing each .
This might seem like an ideal design, but…
- Poor performance for large matrices (Million threads on
1000x1000 matrices!).
- High memory consumption.
- Many short-lived threads.
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
9IntroThread Pool (Second Try)
- A data-structure that connects threads to tasks.
- Number of long lived threads.
- The number of threads can be dynamic or static (fixed number).
- Each thread waits until it assigned a task.
- The thread Executes the task and rejoins the pool to await its next
assignment.
Benefits:
- Performance improvement due to the use
of long lived threads.
- Platform independent – from small
machines to hundreds of cores
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
10IntroThread PoolJava Terms
- Thread pool is called “” (interface ).- Provides:
- Submit task.- Wait for a set of submitted tasks to complete.- Cancel uncompleted tasks.
- Tasks:- Task that does not return a result is usually represented as a object, where
work performed by method w/o arguments and returns no results.- Task that returns a value of type is usually represented as a object, where is
the result returned by a with the T method that takes no arguments.
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
11IntroThread PoolJava Terms (cont.)
- When a object is submitted to an executor service, the service returns an object implementing the interface.
- is a promise to deliver the result of an asynchronous computation, when it is ready.
- provides:- method that returns the result. Blocking if necessary until the result is ready.- Methods for canceling uncompleted computations.
- task also returns a future, but with the wild card parameter ().
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
12IntroBack to matrix multiplicationCreate a object with calls that provide and methods to access matrix elements, along with method that splits an matrix into four by sub-matrices.
1 public class Matrix {2 int dim;3 double[][] data;4 int rowDisplace, colDisplace;5 public Matrix(int d) {6 dim = d;7 rowDisplace = colDisplace = 0;8 data = new double[d][d];9 }10 private Matrix(double[][] matrix, int x, int y, int d) {11 data = matrix;12 rowDisplace = x;13 colDisplace = y;14 dim = d;15 }16 public double get(int row, int col) {17 return data[row+rowDisplace][col+colDisplace];18 }19 public void set(int row, int col, double value) {20 data[row+rowDisplace][col+colDisplace] = value;21 }
22 public int getDim() {23 return dim;24 }25 Matrix[][] split() {26 Matrix[][] result = new Matrix[2][2];27 int newDim = dim / 2;28 result[0][0] =29 new Matrix(data, rowDisplace, colDisplace, newDim);30 result[0][1] =31 new Matrix(data, rowDisplace, colDisplace + newDim, newDim);32 result[1][0] =33 new Matrix(data, rowDisplace + newDim, colDisplace, newDim);34 result[1][1] =35 new Matrix(data, rowDisplace + newDim, colDisplace + newDim, newDim);36 return result;37 } }
Splits the matrix to 4 sub-
matrices
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
13IntroBack to matrix multiplication(cont.)Matrix multiplication B can be decomposed as follows:
Split the two matrices into 4
Multiply the 8 sub matrices
4 Sums of the 8 products
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
15
1 2
3 4
IntroBack to matrix multiplication(cont.)1 2
3 4
1 2
3 4
1 6
1 2
3 4
2 8 3 12 6 16
7 10
15 22X =
1 1 2 3 1 2 2 4 3 1 4 3 3 2 4 4
Parallel Multiplication
Parallel Addition
Task Creation
16IntroBack to matrix multiplication(cont.)
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
The class that holds the thread pool
The multiplying task
The constructor created two Matrices to hold the matrix
product terms
Now we will describe the thread that performs the job.
17IntroBack to matrix multiplication(cont.)
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
Split all the matrices
Submit the tasks to compute the eight product terms in parallel
Once they are complete the thread submits tasks to compute
the four sums in parallel and waits for them to complete
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
18In Conclusion
- Two tries for the same algorithm.- One is bad and inefficient because it is not smart, it’s
just allocated thousands of threads and executes them.
- The second is a lot better, with a good design and fewer threads we achieve a better performance.
- Some analysis of the parallelism might help us to design better solutions for the same algorithm.
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
19
Analyzing Parallelism
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
20Program DAG
- Multithreaded computation can be represented as a DAG.- Each node represents a task- Each directed edge links a predecessor task to successor task where the
successor depends on the predecessor’s result- In a node that creates futures we have 2 dependencies : The computation node,
and the next successor task in the same node.
Example: Fibonacci sequence
Recall: Fibonacci sequence is defined by the function :
Where is the item.
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
21Fibonacci ExampleFibonacci multithreaded implementations with futures:
Thread pool that holds the tasks
Every creates two tasks if : for and for
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
22The Fibonacci DAG for fib(4)fib(4)
fib(3) fib(2)
fib(2) fib(1)
fib(1) fib(0)
fib(1) fib(0)
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
24Analyzing Parallelism
• What is the meaning of the notion “Some computations are inherently more parallel than others”?
• We want to give a precise answer for this question.
• Assume that all individual computation steps take the same amount of time• Let be minimum time needed to execute a multithreaded program on a
system of P dedicated processors.=> is the program’s latency
= Time needed to execute the program on a single processor = The computation’s work.
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
25Analyzing Parallelism
= Time needed to execute the program on an unlimited number of processors = the critical-path-length.
=The speedup on processors
Computation parallelism is the maximum possible speedup Good estimation of how many processors one should devote to a computation
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
26Example - Addition
Let be the number of steps (time) needed to add two matrices on processors.
• Matrix addition requires four half-size matrix additions + constant amount of time to split the matrices.
• Can be done by doubly nested loop.• Half size additions can be done in parallel.• Critical path length is given by the following formula:
• Done by threads in parallel.
The parallelism for matrix addition given by :
For 1000x1000 matrices the parallelism is approximately !
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
27Example - MultiplicationLet be the number of steps (time) needed to multiply two matrices on processors.
• Matrix multiplication requires 8 half-size matrix multiplications + 4 matrix additions as we have seen in the intro.
• Can be done by triply nested loop.• We can do the multiplications and additions in parallel.• Critical path length is given by the following formula:
• Done by threads in parallel.
The parallelism for matrix multiplication is given by :
For 1000x1000 matrices the parallelism is approximately !
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
28In real life…• The multithreaded speedup we achieved is not realistic, it’s highly
idealized upper bound.• In real life it is not easy to assign idle threads to idle processors.• In some cases a program that displays less parallelism but consumes
less memory may perform better because it encounters fewer page faults.
But, this kind of analysis is a good indication of which problem can be resolved in parallel
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
30Recall: Operating Systems- Our analysis so far has been based on the assumption that each multithreaded
program has dedicated processors.- It is not realistic.- Modern operating systems provide user-level threads that encompass a program
counter and stack.- The operating system kernel includes a scheduler that runs threads on physical
processors.- The application has no control over the mapping between threads and
processors.- We can describe the gap between user level threads and operating system-level
processors in a three level model:Multithreaded programs – task level
User level scheduler mapping of tasks to fixed number of threads
Kernel mapping of threads to hardware processors
Can be controlled by the application
The programmer can optimize this with good work distribution
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
33
Work Distribution
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
34Intro
- The key to achieving a good speedup is to keep user-level threads supplied with tasks.
- However, multithreaded computations create and destroy tasks dynamically sometimes in unpredictable ways.
- We need a work distribution algorithm to assign ready tasks to idle threads as efficiently as possible.
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
35Work Dealing
Simple approach to work distribution. An overloaded task tries to offload tasks to other, less heavily loaded threads.
Thread A Thread B
HEAVY TASK
What if all threads are overloaded?
Thread A offloads work
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
36Work StealingOpposite approach: A thread that runs out of work will try to “steal” work from others.
Thread A Thread B
HEAVY TASK
The issue fixed?
Thread B steals work
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
37DEQueue- A data structure to implement work stealing approach.- DEQueue = Double Ended Queue .- Provides methods:
- When a thread creates a new task it call – push the task to its - When a thread needs a task to work on it calls to remove a task from its .- If the thread discovers that its queue is empty then it becomes a thief :
chooses a victim thread at random and calls that thread’s ’s to steal a task for itself.
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
38Algorithm ReviewHolds the array of all thread queues,
internal id, and random number generator
Pops a task from the queue(pool) and runs it
If the pool is empty then it randomly finds a victim to steal a job from
Why?
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
39
- Another, alternative work distribution approach.- Periodically each thread balances its workloads with a
randomly chosen partner.
What could be a problem?- Solution: Coin flipping! - We ensure that lightly-loaded threads will be more likely to
initiate rebalancing.- Each thread periodically flips a biased coin to decide whether
to balance with another.- The thread’s probability of balancing is inversely proportional
to the number of tasks in the thread’s queue.
Work Balancing
𝑃=1
¿𝑇𝑎𝑠𝑘𝑠𝐼𝑛𝑄𝑢𝑒𝑢𝑒
Fewer tasks -> more chance to be selected for
balancing
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
40Work Balancing (cont.)- A thread rebalances by selecting a victim uniformly.- If the difference between its workload and the victim’s
exceeds a predefined threshold they transfer tasks until their queues contain the same number of tasks.
- Algorithm’s benefits:- Fairness.- The balancing operation moves multiple tasks at each
exchange.- If one thread has much more work than the others it is
easy to balance his work over all threads.- Algorithm’s drawbacks?
- Need a good value of threshold for every platform.
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
41Work Balancing ImplementationOur probability function will be : where is the size of the queue of the thread.
Holds queue of tasks and random number generator
The best threshold depends eventually on the OS and platform
Always runs
In probability of a balance will happen
Find the victim and do the balance
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
42Work Balancing Implementation (2)
Gets 2 queues
Calculates the difference between the sizes of the queues
If the size bigger than the Threshold we will move items from the bigger
queue to the smaller one
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
43Conclusion
How to implement multithreaded
programs with thread pools
Analyze with precise tools the parallelism of
an algorithm
Improve thread scheduling on user
level
Learn different approaches on work
distribution
Parallel Programming
Art of Multiprocessor Programming - Computer Science Seminar 3 - 2014
44
Thank You!