avah banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf ·...
TRANSCRIPT
Introduction
I In a parallel computing system tasks can run concurrently.
I It has two main components:
I 1) Logical model of the concurrency: a.k.a parallel algorithms
I 2) Concurrency platform : software system used to carry outthe computation
I It is possible to have the concurrency management built intoan algorithm, but there are obvious drawbacks in such asystem.
Example: Fibonacci
Figure: A program to compute the nth Fibonacci number using recursion.
∗Some figures in this presentation are taken from Cormen, T. H., Leiserson, C.
E., Rivest, R. L., & Stein, C. (2009). Introduction to algorithms. MIT press.
Example: Fibonacci
1
2
3 4
5
Figure: Logical dependency between the instructions.
Parallel operators
I We can highlight this logical dependency in an algorithmusing certain parallel operators (keywords)
I Two basic operators are spawn, sync.
I We are also going to use a third operator parallel for.
Figure: A parallel version of Fib(n) with spawn and sync
Parallel operators
I spawn and sync are logical constructs.
I spawn does not create a parallel task, it is used to indicatethat the spawned (child) task can be executed in parallel(with its parent).
I sync enforces a logical constraint that in order to proceedwith the execution the corresponding spawned task must allhave completed.
I If a spawned task itself spawns one or more subtasks then wehave nested parallelism.
I The mechanism by which these logical constructs areimplemented is often referred to as the concurrency platform.
Introduction
I The dependency graph is a logical abstraction of theconcurrency in the computation.
I It is a weighted DAG. The weights represents the executiontime of the task at the out-nodes.
I Its vertices are tasks, which can represent one or more atomicinstructions, but cannot contain any parallel constructs orreturnable procedure calls.
I The tasks along with the dependency graph gives rise to aposet.
I Linear extensions of this poset represent a valid serialexecution schedule for the tasks.
I More about scheduling later.
Task Dependency Graph
Figure: Tasks are circles and dependencies are given by edges.
Concurrency Platform
I Manages the scheduling, communication, execution etc. oftasks.
I In case of distributed systems manages the massage passinginterface.
I For multi-threaded system it used to manage thread creation,allocation of threads to processors, thread deletion etc. Thisis also known as “dynamic multithreading”.
I Concurrency platform allows for decoupling of the logicalparallelism from its physical execution.
I Examples: OpenMP, MPI, HPX , pThreads etc.
Basic Concepts
I Work: If a parallel computation were to be run on a singleprocessor, the total execution time is known as the work ofthe computation.
I Span: Length of the longest path in the dependency graph.
I Tp execution time with a p processor.
I T∞ execution time with unlimited number of processors (=span).
I T1 serial runtime, this is also the work.
Basic Concepts
I work law: Tp ≥ T1/p.
I span law: Tp ≥ T∞.
I speedup: s = T1/Tp.
I parallelism: par = T1/T∞
I slackness: par/p.
Greedy scheduling
I Assign as many tasks as possible to processors at any giventime.
I This online scheduler is 2-competitive.
I In a complete step all processors are assigned some task.
I Otherwise a step is incomplete.
TheoremIf the original work and span are T1 and T∞ respectively thenTp ≤ T1/p + T∞ for the greedy schedular.
Greedy scheduling
Assume w.l.o.g each task takes unit time.
TheoremIf the original work and span are T1 and T∞ respectively thenTp ≤ T1/p + T∞ for the greedy schedular.
Proof.
I We can divide the computation into complete and incompletesteps.
I # of incomplete steps ≤ T∞ since at each incomplete step alltasks with 0-indegree are executed.
I # of complete steps ≤ T1/p, otherwise total work doneduring the complete steps > T1.
Greedy scheduling
TheoremThe greedy scheduler is 2-competitive.
Proof.
I T optp ≥ max(T∞,T1/p)
I Tp ≤ T1/p + T∞ ≤ 2 max(T1/p,T∞) ≤ 2T optp
Analysis of multi-threaded algorithms
Figure: Composition rules for the purpose of analyzing the execution timeof multiple tasks.
Example: Fibonacci
I Work: T1(n) = T (n − 1) + T (n − 2) + O(1)
I T1(n) = O(φ(n)), where φ(n) is the nth Fibonacci number.
I Span: T∞(n) = max(T (n − 1),T (n − 2)) + O(1) = O(n)
I Parallelism : O(φ(n)/n)
Parallel Loops
I Iteration runs concurrently.
I we can use spawn and sync to create parallel loops.
I For simplicity we use a direct constructor: parallel for.
Figure: Parallel matrix-vector multiplication using parallel for.
Parallel Loops
Figure: Parallel matrix-vector multiplication using spawn and sync.
Parallel Loops
Figure: Task dependency graph for MatVecMainLoop(A, x , y , 8, 1, 8).
Analysis of Parallel Loops
Consider the following psedocode:
1. parallel for i1 = 1 to n1
2. parallel for i2 = 1 to n2
3. ...
4. parallel for ik = 1 to nk
5. Task(i1, ..., ik , other arguments)
I Each parallel for creates a task dependency graph with spanO(log ni ).
I These spans are additive (?).
I Additionally, the task executed during iteration (i1, ..., ik) hasa task dependency graph of its own.
I S∞(i1, ..., ik) be the span of Task(i1, ..., ik , other arguments)
I Total span:
T∞ = O(k log n) + maxover all iterations
S∞(i1, ..., ik)
I What are the work and span for the Mat-Vec algorithm?
Analysis of Mat-Vec(n)
I T1(n) = Θ(n2).
I T∞ = Θ(log n) + Θ(n) (span of line 4 and 5)
I T∞ = Θ(n), thus parallelism = Θ(n).
I Can we achieve Θ(n2/ log n) parallelism for matrix vectormultiplication with Θ(n2) work?
Race Conditions
Figure: The above program could either print 1 or 2.
Race Conditions
1. x = 0
2. x = x+ 1
3. x = x+ 1
PRINT4 x
Figure: Task dependency and logical dependency in the Race-Example.
If the scheduler choses an execution order uniformly at randomwhat is the probability the output is correct?
Parallel Matrix Multiplication
span: T∞(n) = T∞(n/2) + O(log n).
Parallel Matrix Multiplication
Can we eliminate the use of the temporary matrix T by increasingthe span to O(n)?
I We use QuickSort as our candidate algorithm.
SPAWN
I W.L.O.G assume partition divides the array A evenly. (say q isthe median of A)
I Then T1(n) = 2T1(n/2) + O(n) giving T1(n) = O(n log n).
I T∞(n) = T∞(n/2) + O(n) giving T∞(n) = O(n).
I parallelism = O(log n) (modest)
I where is the bottleneck?