cuda lecture 3 parallel architectures and performance analysis

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 3Parallel Architectures and

Performance Analysis

Conventional Von Neumann architecture consists of a processor executing a program stored in a (main) memory:

Each main memory location located by its address. Addresses start at zero and extend to 2n – 1 when there are n bits (binary digits) in the address.

Parallel Architectures and Performance Analysis – Slide 2

Topic 1: Parallel Architectures

Parallel computer: multiple-processor system supporting parallel programming.

Three principle types of architectureVector computers, in particular processor

arraysShared memory multiprocessors

Specially designed and manufactured systemsDistributed memory multicomputers

Message passing systems readily formed from a cluster of workstations


Parallel Computers

Vector computer: instruction set includes operations on vectors as well as scalars

Two ways to implement vector computersPipelined vector processor (e.g. Cray): streams

data through pipelined arithmetic unitsProcessor array: many identical, synchronized

arithmetic processing elements


Type 1: Vector Computers

Historically, high cost of a control unitScientific applications have data parallelism


Why Processor Arrays?

Front end computer (standard uniprocessor)ProgramData manipulated sequentially

Processor array (individual processor/memory pairs)Data manipulated in parallelPerformance

Speed of processing elementsUtilization of processing elementsSize of data structure


Data/Instruction Storage

Each VLSI chip has 16 processing elements


2-D Processor Interconnection Network

Not all problems are data parallelSpeed drops for conditionally executed codeDo not adapt to multiple users wellDo not scale down well to “starter” systemsRely on custom VLSI for processorsExpense of control units has dropped


Processor Array Shortcomings

Natural way to extend single processor modelHave multiple processors connected to

multiple memory modules such that each processor can access any memory module

So-called shared memory configuration:


Type 2: Shared Memory Multiprocessor Systems


Ex: Quad Pentium Shared Memory Multiprocessor

Any memory location can be accessible by any of the processors.

A single address space exists, meaning that each memory location is given unique address within a single range of addresses.

Generally, shared memory programming more convenient although it does require access to shared data to be controlled by the programmer (using critical sections, etc.).


Shared Memory Multiprocessor Systems

Alternately known as a tightly coupled architecture.No local memory associated with processors.

Avoid three problems of processor arraysCan be built from commodity CPUsNaturally support multiple usersMaintain efficiency in conditional code


Shared Memory Multiprocessor Systems (cont.)

Several alternatives for programming shared memory multiprocessorsUsing threads (pthreads, Java, …) in which the

programmer decomposes the program into individual parallel sequences, each being a thread, and each being able to access variables declared outside the threads.

Using a sequential programming language with user-level libraries to declare and access shared variables.



Several alternatives for programming shared memory multiprocessorsUsing a sequential programming language with

preprocessor compiler directives to declare shared variables and specify parallelism.Ex: OpenMP – the industry standard

An API for shared-memory systemsSupports higher performance parallel programming of

symmetrical multiprocessors



Several alternatives for programming shared memory multiprocessorsUsing a parallel programming language with

syntax for parallelism, in which the compiler creates the appropriate executable code for each processor.

Using a sequential programming language and ask a parallelizing compiler to convert it into parallel executable code.

Neither of these not now common.



Type 1: Centralized MultiprocessorStraightforward extension of uniprocessorAdd CPUs to busAll processors share same primary memoryMemory access time same for all CPUs

An example of a uniform memory access (UMA) multiprocessor

Symmetrical multiprocessor (SMP)


Fundamental Types of Shared Memory Multiprocessor


Centralized Multiprocessor

Private data: items used only by a single processorShared data: values used by multiple processorsIn a centralized multiprocessor, processors

communicate via shared data valuesProblems associated with shared data

Cache coherenceReplicating data across multiple caches reduces

contentionHow to ensure different processors have same value for

same address?Synchronization

Mutual exclusionBarriers


Private and Shared Data

Making the main memory of a cluster of computers look as though it is a single memory with a single address space (via hidden message passing).

Then can use shared memory programming techniques.


Distributed Shared Memory

Type 2: Distributed MultiprocessorDistribute primary memory among processorsIncrease aggregate memory bandwidth and

lower average memory access timeAllow greater number of processorsAlso called non-uniform memory access

(NUMA) multiprocessor


Fundamental Types of Shared Memory Multiprocessor


Distributed Multiprocessor

Some NUMA multiprocessors do not support it in hardwareOnly instructions, private data in cacheLarge memory access time variance

Implementations more difficultNo shared memory bus to “snoop”Directory-based protocol needed


Cache Coherence

Distributed directory contains information about cacheable memory blocks

One directory entry for each cache blockEach entry has

Sharing statusUncached: block not in any processor’s cacheShared: cached by one or more processors; read

onlyExclusive: cached by exactly one processor which

has written block, so copy in memory obsoleteWhich processors have copies


Directory-Based Protocol

Complete computers connected through an interconnection network


Type 3: Message-Passing Multicomputers

Distributed memory multiple-CPU computerSame address on different processors refers

to different physical memory locationsProcessors interact through message passingCommercial multicomputersCommodity clusters


Multicomputers

Alternate name for message-passing multicomputer systems.

Each processor has its own memory accessible only to that processor.

A message passing interconnection network provides point-to-point connections among processors.

Memory access varies between processors.


Loosely Coupled Architectures


Asymmetrical Multicomputer

Advantages:Back-end processors dedicated to parallel

computationsEasier to understand, model, tune performance

Only a simple back-end operating system neededEasy for a vendor to create

Disadvantages:Front-end computer is a single point of failureSingle front-end computer limits scalability of

systemPrimitive operating system in back-end processors

makes debugging difficultEvery application requires development of both

front-end and back-end programsParallel Architectures and Performance Analysis – Slide 28

Asymmetrical Multicomputer


Symmetrical Multicomputer

Advantages:Alleviate performance bottleneck caused by

single front-end computerBetter support for debuggingEvery processor executes same program

Disadvantages:More difficult to maintain illusion of single

“parallel computer”No simple way to balance program

development workload among processorsMore difficult to achieve high performance

when multiple processes on each processorParallel Architectures and Performance Analysis – Slide 30

Symmetrical Multicomputer


ParPar Cluster: A Mixed Model

Michael Flynn (1966) created a classification for computer architectures based upon a variety of characteristics, specifically instruction streams and data streams.

Also important are number of processors, number of programs which can be executed, and the memory structure.


Alternate System: Flynn’s Taxonomy

Single instruction stream, single data stream (SISD) computer

In a single processor computer, a single stream of instructions is generated from the program. The instructions operate upon a single stream of data items.

The single CPU executes one instruction at a time and fetches or stores one item of data at a time.


Flynn’s Taxonomy: SISD


Flynn’s Taxonomy: SISD (cont.)

Control unit ArithmeticProcessor

Memory

Control Signals

Instruction Data Stream

Results

Single instruction stream, multiple data stream (SIMD) computer

A specially designed computer in which a single instruction stream is from a single program, but multiple data streams exist.The instructions from the program are

broadcast to more than one processor.Each processor executes the same instruction

in synchronism, but using different data.Developed because there are a number of

important applications that mostly operate upon arrays of data.


Flynn’s Taxonomy: SIMD


Flynn’s Taxonomy: SIMD (cont.)

Control Unit

Control Signal

PE 1 PE 2 PE n

Data Stream 1 Data Stream 2 Data Stream n

Processing distributed over a large amount of hardware.

Operates concurrently on many different data elements.

Performs the same computation on all data elements.

Processors operate synchronously.Examples: pipelined vector processors (e.g.

Cray-1) and processor arrays (e.g. Connection Machine)


SIMD Architectures


SISD vs. SIMD ExecutionX 1

X 2

X 3

X 4

PEs satisfy a = 0,others are idle

PEs satisfy a ≠ 0,others are idle

All PEs

All PEs

SIMD machine

X 1

a=0 ?

X 3

X 2

X 4

Yes

No

SISD machine

Multiple instruction stream, single data stream (MISD) computer

MISD machines may execute several different programs on the same data item.

There are two categoriesDistinct processing units perform distinct

instructions on the same data. Currently there is no such machine.

Pipelined architectures, where data flows through a series of processing elements.


Flynn’s Taxonomy: MISD


Flynn’s Taxonomy: MISD (cont.)

Control Unit 1

Control Unit 2

Control Unit n

ProcessingElement 1

ProcessingElement 2

ProcessingElement n

Instruction Stream 1


Instruction Stream n

DataStream

A pipeline processor works according to the principle of pipelining.A process can be broken down into several

stages (segments). While one stage is executing, another stage is

being loaded and the input of one stage is the output of the previous stage.

The processor carries out many different computations concurrently.

Example: systolic array


MISD Architectures


MISD Architectures (cont.)

Serial execution of two processes with 4 stages each. Time to execute T = 8 t , where t is the time to execute one stage.

Pipelined execution of the same two processes. T = 5 t

S1 S2 S3 S4 S1 S2 S3 S4

S1 S2 S3 S4S1 S2 S3 S4

Multiple instruction stream, multiple data stream (MIMD) computer

General purpose multiprocessor system.Multiple processors, each with a separate

(different) program operating on its own data.One instruction stream is generated from each

program for each processor.Each instruction operates upon different data.

Both the shared memory and the message-passing multiprocessors so far described are in the MIMD classification.


Flynn’s Taxonomy: MIMD


Flynn’s Taxonomy: MIMD (cont.)

Control Unit 1

Control Unit 2

Control Unit n

ProcessingElement 1

ProcessingElement 2

ProcessingElement n



Instruction Stream n

Data Stream 1

Data Stream 2

Data Stream n

Processing distributed over a number of processors operating independently and concurrently.Resources (memory) shared among processors.Each processor runs its own program.

MIMD systems execute operations in a parallel asynchronous fashion.


MIMD Architectures

Differ with regard toInterconnection networksMemory addressing techniquesSynchronizationControl structures

A high throughput can be achieved if the processing can be broken into parallel streams keeping all the processors active concurrently.


MIMD Architectures (cont.)

Multiple Program Multiple Data (MPMD) Structure

Within the MIMD classification, which we are concerned with, each processor will have its own program to execute.


Two MIMD Structures: MPMD

Single Program Multiple Data (SPMD) Structure

Single source program is written and each processor will execute its personal copy of this program, although independently and not in synchronism.

The source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer.

Software equivalent of SIMD; can perform SIMD calculations on MIMD hardware.Parallel Architectures and Performance Analysis – Slide 48

Two MIMD Structures: SPMD

SIMD needs less hardware (only one control unit). In MIMD each processor has its own control unit.

SIMD needs less memory than MIMD (SIMD need only one copy of instructions). In MIMD the program and operating system needs to be stored at each processor.

SIMD has implicit synchronization of PEs. In contrast, explicit synchronization may be required in MIMD.


SIMD vs. MIMD

MIMD allows different operations to be performed on different processing elements simultaneously (functional parallelism). SIMD is limited to data parallelism.

For MIMD it is possible to use general-purpose microprocessor as a processing unit. Processor may be cheaper and more powerful.


SIMD vs. MIMD (cont.)

Time to execute a sequence of instructions in which the execution time is data dependent is less for MIMD than for SIMD.MIMD allows each instruction to execute

independently. In SIMD each processing element must wait until all the others have finished the execution of one instruction.

ThusT(MIMD) = MAX {t1 + t2 + … + tn}T(SIMD) = MAX {t1} + MAX {t2} + … + MAX {tn} T(MIMD) ≤ T(SIMD)



In MIMD each processing element can independently follow either direction path in executing if-then-else statement. This requires two phases on SIMD.

MIMD can operate in SIMD mode.



ArchitecturesVector computersShared memory multiprocessors: tightly coupled

Centralized/symmetrical multiprocessor (SMP): UMA

Distributed multiprocessor: NUMADistributed memory/message-passing

multicomputers: loosely coupledAsymmetrical vs. symmetrical

Flynn’s TaxonomySISD, SIMD, MISD, MIMD (MPMD, SPMD)


Topic 1 Summary

A sequential algorithm can be evaluated in terms of its execution time, which can be expressed as a function of the size of its input.

The execution time of a parallel algorithm depends not only on the input size of the problem but also on the architecture of a parallel computer and the number of available processing elements.


Topic 2: Performance Measures and Analysis

The degree of parallelism is a measure of the number of operations that an algorithm can perform in parallel for a problem of size W, and it is independent of the parallel architecture.If P(W) is the degree of parallelism of a parallel

algorithm, then for a problem of size W no more than P(W) processors can be employed effectively.

Want to be able to do two things: predict performance of parallel programs, and understand barriers to higher performance.


Performance Measures and Analysis (cont.)

General speedup formulaAmdahl’s Law

Decide if program merits parallelizationGustafson-Barsis’ Law

Evaluate performance of a parallel program


Topic 2 Outline

The speedup factor is a measure that captures the relative benefit of solving a computational problem in parallel.

The speedup factor of a parallel computation utilizing p processors is defined as the following ratio:

In other words, S(p) is defined as the ratio of the sequential processing time to the parallel processing time.


Speedup Factor

ps

TTpS ssormultiproce ausing time Exec.

processor oneusing time Exec.)(

Speedup factor can also be cast in terms of computational steps:

Maximum speedup is (usually) p with p processors (linear speedup).


Speedup Factor (cont.)

processors using steps comp. No.processor oneusing steps comp. No.)( ppS

It is assumed that the processor used in parallel computation is identical to the one used by sequential algorithm.

S(p) gives the increase in speed by using a multiprocessor.Underlying algorithm for parallel

implementation might be (and usually is) different.



The sequential algorithm has to be the best algorithm known for a particular computation problem.

This means that it is fair to judge the performance of parallel computation with respect to the fastest sequential algorithm for solving the same problem in a single processor architecture.

Several issues such as synchronization and communication are involved in the parallel computation.



Given a problem of size n on p processors letInherently sequential computations (n)Potentially parallel computations (n)Communication operations (n,p)

Then:


Execution Time Components

p

s

TT

pnpnn

nnpS

),()()()()()(


Speedup PlotComputation Time Communication Time

“elbowing out”

Number of processors

The efficiency of a parallel computation is defined as a ratio between the speedup factor and the number of processing elements in a parallel system:

Efficiency is a measure of the fraction of time for which a processing element is usefully employed in a computation.


Efficiency

ppS

TpT

ppE

p

s )(processors using time Exec.processor oneusing time Exec.

In an ideal parallel system the speedup factor is equal to p and the efficiency is equal to one.

In practice ideal behavior is not achieved, since processors cannot devote 100 percent of their time to the computation.

Every parallel program has overhead factors such as creating processes, process synchronization and communication.

In practice efficiency is between zero and one, depending on the degree of effectiveness with which processing elements are utilized.Parallel Architectures and Performance Analysis – Slide 64

Efficiency (cont.)

Since E = S(p)/p, by what we did earlier

Since all terms are positive, E > 0Furthermore, since the denominator is larger

than the numerator, E < 1


Analysis of Efficiency

),()()()()(

pnpnnpnnE

Consider the problem of adding n numbers on a p processor system.

Initial brute force approach: all tasks send values to one processor which adds them all up..


Example: Reduction

Parallel algorithm: find the global sum by using a binomial tree.


Example: Reduction (cont.)

S

Assume it takes one unit of time for two directly connected processors to add two numbers and to communicate to each other.

Adding n/p numbers locally on each processor takes n/p –1 units of time.

The p partial sums may be added in log p steps, each consisting of one addition and one communication.



The total parallel computation time Tp is

n/p – 1 + 2 log p.For large values of p and n this can be

approximated by Tp = n / p + 2 log p.The serial computation time can be

approximated by Ts = n.



The expression for speedup is

The expression for efficiency is

Speedup and efficiency can be calculated for any p and n.



ppnnp

pn

TTpS

pn

p

s

log2log2)(

ppnn

ppSE

log2)(

Computational efficiency as a function of n and p.



processors p n 1

2 4 8 16 32

64 1 .980

.930

.815

.623

.399

192

1 .990

.975

.930

.832

.665

320

1 .995

.985

.956

.892

.768

512

1 .995

.990

.972

.930

.841



0 10 20 30

5

10

15

20

25

30

0

spee

dup

processors

n=64

n=192

n=320n=512


Maximum Speedup: Amdahl’s Law

As before

since the communication time must be non-trivial.

Let f represent the inherently sequential portion of the computation; then


Amdahl’s Law (cont.)

pnn

nn

pnpnn

nnpS)()(

)()(

),()()(

)()()(

)()()(nn

nf

Then

In short, the maximum speedup factor is given by

where f is the fraction of the computation that cannot be divided into concurrent tasks.



fpp

fpfp

pff

pS)()(

)(1111

1

fpp

ffTT

pSpsT

s

s

)()()(

111

LimitationsIgnores communication timeOverestimates speedup achievable

Amdahl EffectTypically (n,p) has lower complexity than

(n)/pSo as p increases, (n)/p dominates (n,p)Thus as p increases, speedup increases



fpppS

)()(

11

Even with an infinite number of processors, maximum speedup limited to 1/f .

Ex: With only 5% of a computation being serial, maximum speedup is 20, irrespective of number of processors.


Speedup against number of processors

So Amdahl’s LawTreats problem size as a constantShows how execution time decreases as the

number of processors increasesHowever, we often use faster computers to

solve larger problem instancesLet’s treat time as a constant and allow the

problem size to increase with the number of processors


Amdahl’s Law: The Last Word

As before

Let s represent the fraction of time spent in parallel computation performing inherently sequential operations; then


Gustafson-Barsis’ Law

pnn

nn

pnpnn

nnpS)()(

)()(

),()()(

)()()(

pnn

ns)()(

)(

Then


Gustafson-Barsis’ Law (cont.)

)1()1()( psppspsspspS

Begin with parallel execution time instead of sequential time

Estimate sequential execution time to solve same problem

Problem size is an increasing function of pPredicts scaled speedup


Gustafson-Barsis’ Law (cont.)spppS )()( 1

An application running on 10 processors spends 3% of its time in serial code.

According to Amdahl’s Law the maximum speedup is

However the scaled speedup is


Example

87703091

10 ..

)(

pS

739030910 .).)(()( pS

Both Amdahl’s Law and Gustafson-Barsis’ Law ignore communication time

Both overestimate speedup or scaled speedup achievable

Gene Amdahl John L. Gustafson


Limitations

http://en.wikipedia.org/wiki/File:John_L_Gustafson_CEO.jpg

Performance terms: speedup, efficiencyModel of speedup: serial, parallel and

communication componentsWhat prevents linear speedup?

Serial and communication operationsProcess start-upImbalanced workloadsArchitectural limitations

Analyzing parallel performanceAmdahl’s LawGustafson-Barsis’ Law


Topic 2 Summary

Based on original material fromThe University of Akron: Tim O’Neil, Kathy

LiszkaHiram College: Irena LomonosovThe University of North Carolina at Charlotte

Barry Wilkinson, Michael AllenOregon State University: Michael Quinn

Revision history: last updated 7/28/2011.


End Credits

cuda lecture 3 parallel architectures and performance analysis

Documents

performance analysis

parallel programming

memory moduleso

local memory

shared memory programming

parallel computing

shared memory configuration

multiple memory modules