parallel architectures & performance analysis

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Parallel Architectures & Performance Analysis

Parallel computer: multiple-processor system supporting parallel programming.

Three principle types of architectureVector computers, in particular processor

arraysShared memory multiprocessors

Specially designed and manufactured systemsDistributed memory multicomputers

Message passing systems readily formed from a cluster of workstations

Parallel Architectures and Performance Analysis – Slide 2

Parallel Computers

Vector computer: instruction set includes operations on vectors as well as scalars

Two ways to implement vector computersPipelined vector processor (e.g. Cray): streams

data through pipelined arithmetic unitsProcessor array: many identical, synchronized

arithmetic processing elements


Type 1: Vector Computers

Natural way to extend single processor modelHave multiple processors connected to

multiple memory modules such that each processor can access any memory module

So-called shared memory configuration:


Type 2: Shared Memory Multiprocessor Systems


Ex: Quad Pentium Shared Memory Multiprocessor

Type 2: Distributed MultiprocessorDistribute primary memory among processorsIncrease aggregate memory bandwidth and

lower average memory access timeAllow greater number of processorsAlso called non-uniform memory access

(NUMA) multiprocessor


Fundamental Types of Shared Memory Multiprocessor


Distributed Multiprocessor

Complete computers connected through an interconnection network


Type 3: Message-Passing Multicomputers

Distributed memory multiple-CPU computerSame address on different processors refers

to different physical memory locationsProcessors interact through message passingCommercial multicomputersCommodity clusters


Multicomputers


Asymmetrical Multicomputer


Symmetrical Multicomputer


ParPar Cluster: A Mixed Model

Michael Flynn (1966) created a classification for computer architectures based upon a variety of characteristics, specifically instruction streams and data streams.

Also important are number of processors, number of programs which can be executed, and the memory structure.


Alternate System: Flynn’s Taxonomy


Flynn’s Taxonomy: SISD (cont.)

Control unit ArithmeticProcessor

Memory

Control Signals

Instruction Data Stream

Results


Flynn’s Taxonomy: SIMD (cont.)

Control Unit

Control Signal

PE 1 PE 2 PE n

Data Stream 1 Data Stream 2 Data Stream n


Flynn’s Taxonomy: MISD (cont.)

Control Unit 1

Control Unit 2

Control Unit n

ProcessingElement 1

ProcessingElement 2

ProcessingElement n

Instruction Stream 1


Instruction Stream n

DataStream


MISD Architectures (cont.)

Serial execution of two processes with 4 stages each. Time to execute T = 8 t , where t is the time to execute one stage.

Pipelined execution of the same two processes. T = 5 t

S1 S2 S3 S4 S1 S2 S3 S4

S1 S2 S3 S4S1 S2 S3 S4


Flynn’s Taxonomy: MIMD (cont.)

Control Unit 1

Control Unit 2

Control Unit n

ProcessingElement 1

ProcessingElement 2

ProcessingElement n



Instruction Stream n

Data Stream 1

Data Stream 2

Data Stream n

Multiple Program Multiple Data (MPMD) Structure

Within the MIMD classification, which we are concerned with, each processor will have its own program to execute.


Two MIMD Structures: MPMD

Single Program Multiple Data (SPMD) Structure

Single source program is written and each processor will execute its personal copy of this program, although independently and not in synchronism.

The source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer.

Software equivalent of SIMD; can perform SIMD calculations on MIMD hardware.Parallel Architectures and Performance Analysis – Slide 20

Two MIMD Structures: SPMD

ArchitecturesVector computersShared memory multiprocessors: tightly coupled

Centralized/symmetrical multiprocessor (SMP): UMA

Distributed multiprocessor: NUMADistributed memory/message-passing

multicomputers: loosely coupledAsymmetrical vs. symmetrical

Flynn’s TaxonomySISD, SIMD, MISD, MIMD (MPMD, SPMD)


Topic 1 Summary

A sequential algorithm can be evaluated in terms of its execution time, which can be expressed as a function of the size of its input.

The execution time of a parallel algorithm depends not only on the input size of the problem but also on the architecture of a parallel computer and the number of available processing elements.


Topic 2: Performance Measures and Analysis

The speedup factor is a measure that captures the relative benefit of solving a computational problem in parallel.

The speedup factor of a parallel computation utilizing p processors is defined as the following ratio:

In other words, S(p) is defined as the ratio of the sequential processing time to the parallel processing time.


Speedup Factor

ps

TTpS ssormultiproce ausing time Exec.

processor oneusing time Exec.)(

Speedup factor can also be cast in terms of computational steps:

Maximum speedup is (usually) p with p processors (linear speedup).


Speedup Factor (cont.)

processors using steps comp. No.processor oneusing steps comp. No.)( ppS

Given a problem of size n on p processors letInherently sequential computations (n)Potentially parallel computations (n)Communication operations (n,p)

Then:


Execution Time Components

p

s

TT

pnpnn

nnpS

),()()()()()(


Speedup PlotComputation Time Communication Time

“elbowing out”

Number of processors

The efficiency of a parallel computation is defined as a ratio between the speedup factor and the number of processing elements in a parallel system:

Efficiency is a measure of the fraction of time for which a processing element is usefully employed in a computation.


Efficiency

ppS

TpT

ppE

p

s )(processors using time Exec.processor oneusing time Exec.

Since E = S(p)/p, by what we did earlier

Since all terms are positive, E > 0Furthermore, since the denominator is larger

than the numerator, E < 1


Analysis of Efficiency

),()()()()(

pnpnnpnnE


Maximum Speedup: Amdahl’s Law

As before

since the communication time must be non-trivial.

Let f represent the inherently sequential portion of the computation; then


Amdahl’s Law (cont.)

pnn

nn

pnpnn

nnpS)()(

)()(

),()()(

)()()(

)()()(nn

nf

LimitationsIgnores communication timeOverestimates speedup achievable

Amdahl EffectTypically (n,p) has lower complexity than

(n)/pSo as p increases, (n)/p dominates (n,p)Thus as p increases, speedup increases


Amdahl’s Law (cont.)

fpppS

)()(

11

As before

Let s represent the fraction of time spent in parallel computation performing inherently sequential operations; then


Gustafson-Barsis’ Law

pnn

nn

pnpnn

nnpS)()(

)()(

),()()(

)()()(

pnn

ns)()(

)(

Then


Gustafson-Barsis’ Law (cont.)

)1()1()( psppspsspspS

Begin with parallel execution time instead of sequential time

Estimate sequential execution time to solve same problem

Problem size is an increasing function of pPredicts scaled speedup


Gustafson-Barsis’ Law (cont.)spppS )()( 1

Both Amdahl’s Law and Gustafson-Barsis’ Law ignore communication time

Both overestimate speedup or scaled speedup achievable

Gene Amdahl John L. Gustafson


Limitations

http://en.wikipedia.org/wiki/File:John_L_Gustafson_CEO.jpg

Performance terms: speedup, efficiencyModel of speedup: serial, parallel and

communication componentsWhat prevents linear speedup?

Serial and communication operationsProcess start-upImbalanced workloadsArchitectural limitations

Analyzing parallel performanceAmdahl’s LawGustafson-Barsis’ Law


Topic 2 Summary

Based on original material fromThe University of Akron: Tim O’Neil, Kathy

LiszkaHiram College: Irena LomonosovThe University of North Carolina at Charlotte

Barry Wilkinson, Michael AllenOregon State University: Michael Quinn

Revision history: last updated 7/28/2011.


End Credits

parallel architectures & performance analysis

Documents

performance analysis

computer architectures

parallel programming

parallel computing

memory moduleso

memory structure

shared memory configuration

multiple memory modules