avah banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf ·...

Multi-threaded Computing

Avah Banerjee

LSU, CCT

[email protected]

Introduction

I In a parallel computing system tasks can run concurrently.

I It has two main components:

I 1) Logical model of the concurrency: a.k.a parallel algorithms

I 2) Concurrency platform : software system used to carry outthe computation

I It is possible to have the concurrency management built intoan algorithm, but there are obvious drawbacks in such asystem.

Example: Fibonacci

Figure: A program to compute the nth Fibonacci number using recursion.

∗Some figures in this presentation are taken from Cormen, T. H., Leiserson, C.

E., Rivest, R. L., & Stein, C. (2009). Introduction to algorithms. MIT press.

Example: Fibonacci

1

2

3 4

5

Figure: Logical dependency between the instructions.

Parallel operators

I We can highlight this logical dependency in an algorithmusing certain parallel operators (keywords)

I Two basic operators are spawn, sync.

I We are also going to use a third operator parallel for.

Figure: A parallel version of Fib(n) with spawn and sync

Parallel operators

I spawn and sync are logical constructs.

I spawn does not create a parallel task, it is used to indicatethat the spawned (child) task can be executed in parallel(with its parent).

I sync enforces a logical constraint that in order to proceedwith the execution the corresponding spawned task must allhave completed.

I If a spawned task itself spawns one or more subtasks then wehave nested parallelism.

I The mechanism by which these logical constructs areimplemented is often referred to as the concurrency platform.

Introduction

I The dependency graph is a logical abstraction of theconcurrency in the computation.

I It is a weighted DAG. The weights represents the executiontime of the task at the out-nodes.

I Its vertices are tasks, which can represent one or more atomicinstructions, but cannot contain any parallel constructs orreturnable procedure calls.

I The tasks along with the dependency graph gives rise to aposet.

I Linear extensions of this poset represent a valid serialexecution schedule for the tasks.

I More about scheduling later.

Task Dependency Graph

Figure: Tasks are circles and dependencies are given by edges.

Concurrency Platform

I Manages the scheduling, communication, execution etc. oftasks.

I In case of distributed systems manages the massage passinginterface.

I For multi-threaded system it used to manage thread creation,allocation of threads to processors, thread deletion etc. Thisis also known as “dynamic multithreading”.

I Concurrency platform allows for decoupling of the logicalparallelism from its physical execution.

I Examples: OpenMP, MPI, HPX , pThreads etc.

Basic Concepts

I Work: If a parallel computation were to be run on a singleprocessor, the total execution time is known as the work ofthe computation.

I Span: Length of the longest path in the dependency graph.

I Tp execution time with a p processor.

I T∞ execution time with unlimited number of processors (=span).

I T1 serial runtime, this is also the work.

Basic Concepts

I work law: Tp ≥ T1/p.

I span law: Tp ≥ T∞.

I speedup: s = T1/Tp.

I parallelism: par = T1/T∞

I slackness: par/p.

Greedy scheduling

I Assign as many tasks as possible to processors at any giventime.

I This online scheduler is 2-competitive.

I In a complete step all processors are assigned some task.

I Otherwise a step is incomplete.

TheoremIf the original work and span are T1 and T∞ respectively thenTp ≤ T1/p + T∞ for the greedy schedular.

Greedy scheduling

Assume w.l.o.g each task takes unit time.

TheoremIf the original work and span are T1 and T∞ respectively thenTp ≤ T1/p + T∞ for the greedy schedular.

Proof.

I We can divide the computation into complete and incompletesteps.

I # of incomplete steps ≤ T∞ since at each incomplete step alltasks with 0-indegree are executed.

I # of complete steps ≤ T1/p, otherwise total work doneduring the complete steps > T1.

Greedy scheduling

TheoremThe greedy scheduler is 2-competitive.

Proof.

I T optp ≥ max(T∞,T1/p)

I Tp ≤ T1/p + T∞ ≤ 2 max(T1/p,T∞) ≤ 2T optp

Analysis of multi-threaded algorithms

Figure: Composition rules for the purpose of analyzing the execution timeof multiple tasks.

Example: Fibonacci

I Work: T1(n) = T (n − 1) + T (n − 2) + O(1)

I T1(n) = O(φ(n)), where φ(n) is the nth Fibonacci number.

I Span: T∞(n) = max(T (n − 1),T (n − 2)) + O(1) = O(n)

I Parallelism : O(φ(n)/n)

Parallel Loops

I Iteration runs concurrently.

I we can use spawn and sync to create parallel loops.

I For simplicity we use a direct constructor: parallel for.

Figure: Parallel matrix-vector multiplication using parallel for.

Parallel Loops

Figure: Parallel matrix-vector multiplication using spawn and sync.

Parallel Loops

Figure: Task dependency graph for MatVecMainLoop(A, x , y , 8, 1, 8).

Analysis of Parallel Loops

Consider the following psedocode:

1. parallel for i1 = 1 to n1

2. parallel for i2 = 1 to n2

3. ...

4. parallel for ik = 1 to nk

5. Task(i1, ..., ik , other arguments)

I Each parallel for creates a task dependency graph with spanO(log ni ).

I These spans are additive (?).

I Additionally, the task executed during iteration (i1, ..., ik) hasa task dependency graph of its own.

I S∞(i1, ..., ik) be the span of Task(i1, ..., ik , other arguments)

I Total span:

T∞ = O(k log n) + maxover all iterations

S∞(i1, ..., ik)

I What are the work and span for the Mat-Vec algorithm?

Analysis of Mat-Vec(n)

I T1(n) = Θ(n2).

I T∞ = Θ(log n) + Θ(n) (span of line 4 and 5)

I T∞ = Θ(n), thus parallelism = Θ(n).

I Can we achieve Θ(n2/ log n) parallelism for matrix vectormultiplication with Θ(n2) work?

Race Conditions

Figure: The above program could either print 1 or 2.

Race Conditions

1. x = 0

2. x = x+ 1

3. x = x+ 1

PRINT4 x

Figure: Task dependency and logical dependency in the Race-Example.

If the scheduler choses an execution order uniformly at randomwhat is the probability the output is correct?

Parallel Matrix Multiplication

span: T∞(n) = T∞(n/2) + O(log n).

Parallel Matrix Multiplication

Can we eliminate the use of the temporary matrix T by increasingthe span to O(n)?

I We use QuickSort as our candidate algorithm.

SPAWN

I W.L.O.G assume partition divides the array A evenly. (say q isthe median of A)

I Then T1(n) = 2T1(n/2) + O(n) giving T1(n) = O(n log n).

I T∞(n) = T∞(n/2) + O(n) giving T∞(n) = O(n).

I parallelism = O(log n) (modest)

I where is the bottleneck?

avah banerjee - stellar.cct.lsu.edustellar.cct.lsu.edu/...multithreading/...05.17.19.pdf ·...

Documents