parallel architectures & performance analysis

37
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Parallel Architectures & Performance Analysis

Upload: alden

Post on 23-Mar-2016

71 views

Category:

Documents


1 download

DESCRIPTION

Parallel Architectures & Performance Analysis. Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Parallel Computers. Parallel computer: multiple-processor system supporting parallel programming. Three principle types of architecture - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parallel  Architectures & Performance Analysis

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Parallel Architectures & Performance Analysis

Page 2: Parallel  Architectures & Performance Analysis

Parallel computer: multiple-processor system supporting parallel programming.

Three principle types of architectureVector computers, in particular processor

arraysShared memory multiprocessors

Specially designed and manufactured systemsDistributed memory multicomputers

Message passing systems readily formed from a cluster of workstations

Parallel Architectures and Performance Analysis – Slide 2

Parallel Computers

Page 3: Parallel  Architectures & Performance Analysis

Vector computer: instruction set includes operations on vectors as well as scalars

Two ways to implement vector computersPipelined vector processor (e.g. Cray): streams

data through pipelined arithmetic unitsProcessor array: many identical, synchronized

arithmetic processing elements

Parallel Architectures and Performance Analysis – Slide 3

Type 1: Vector Computers

Page 4: Parallel  Architectures & Performance Analysis

Natural way to extend single processor modelHave multiple processors connected to

multiple memory modules such that each processor can access any memory module

So-called shared memory configuration:

Parallel Architectures and Performance Analysis – Slide 4

Type 2: Shared Memory Multiprocessor Systems

Page 5: Parallel  Architectures & Performance Analysis

Parallel Architectures and Performance Analysis – Slide 5

Ex: Quad Pentium Shared Memory Multiprocessor

Page 6: Parallel  Architectures & Performance Analysis

Type 2: Distributed MultiprocessorDistribute primary memory among processorsIncrease aggregate memory bandwidth and

lower average memory access timeAllow greater number of processorsAlso called non-uniform memory access

(NUMA) multiprocessor

Parallel Architectures and Performance Analysis – Slide 6

Fundamental Types of Shared Memory Multiprocessor

Page 7: Parallel  Architectures & Performance Analysis

Parallel Architectures and Performance Analysis – Slide 7

Distributed Multiprocessor

Page 8: Parallel  Architectures & Performance Analysis

Complete computers connected through an interconnection network

Parallel Architectures and Performance Analysis – Slide 8

Type 3: Message-Passing Multicomputers

Page 9: Parallel  Architectures & Performance Analysis

Distributed memory multiple-CPU computerSame address on different processors refers

to different physical memory locationsProcessors interact through message passingCommercial multicomputersCommodity clusters

Parallel Architectures and Performance Analysis – Slide 9

Multicomputers

Page 10: Parallel  Architectures & Performance Analysis

Parallel Architectures and Performance Analysis – Slide 10

Asymmetrical Multicomputer

Page 11: Parallel  Architectures & Performance Analysis

Parallel Architectures and Performance Analysis – Slide 11

Symmetrical Multicomputer

Page 12: Parallel  Architectures & Performance Analysis

Parallel Architectures and Performance Analysis – Slide 12

ParPar Cluster: A Mixed Model

Page 13: Parallel  Architectures & Performance Analysis

Michael Flynn (1966) created a classification for computer architectures based upon a variety of characteristics, specifically instruction streams and data streams.

Also important are number of processors, number of programs which can be executed, and the memory structure.

Parallel Architectures and Performance Analysis – Slide 13

Alternate System: Flynn’s Taxonomy

Page 14: Parallel  Architectures & Performance Analysis

Parallel Architectures and Performance Analysis – Slide 14

Flynn’s Taxonomy: SISD (cont.)

Control unit ArithmeticProcessor

Memory

Control Signals

Instruction Data Stream

Results

Page 15: Parallel  Architectures & Performance Analysis

Parallel Architectures and Performance Analysis – Slide 15

Flynn’s Taxonomy: SIMD (cont.)

Control Unit

Control Signal

PE 1 PE 2 PE n

Data Stream 1 Data Stream 2 Data Stream n

Page 16: Parallel  Architectures & Performance Analysis

Parallel Architectures and Performance Analysis – Slide 16

Flynn’s Taxonomy: MISD (cont.)

Control Unit 1

Control Unit 2

Control Unit n

ProcessingElement 1

ProcessingElement 2

ProcessingElement n

Instruction Stream 1

Instruction Stream 2

Instruction Stream n

DataStream

Page 17: Parallel  Architectures & Performance Analysis

Parallel Architectures and Performance Analysis – Slide 17

MISD Architectures (cont.)

Serial execution of two processes with 4 stages each. Time to execute T = 8 t , where t is the time to execute one stage.

Pipelined execution of the same two processes. T = 5 t

S1 S2 S3 S4 S1 S2 S3 S4

S1 S2 S3 S4S1 S2 S3 S4

Page 18: Parallel  Architectures & Performance Analysis

Parallel Architectures and Performance Analysis – Slide 18

Flynn’s Taxonomy: MIMD (cont.)

Control Unit 1

Control Unit 2

Control Unit n

ProcessingElement 1

ProcessingElement 2

ProcessingElement n

Instruction Stream 1

Instruction Stream 2

Instruction Stream n

Data Stream 1

Data Stream 2

Data Stream n

Page 19: Parallel  Architectures & Performance Analysis

Multiple Program Multiple Data (MPMD) Structure

Within the MIMD classification, which we are concerned with, each processor will have its own program to execute.

Parallel Architectures and Performance Analysis – Slide 19

Two MIMD Structures: MPMD

Page 20: Parallel  Architectures & Performance Analysis

Single Program Multiple Data (SPMD) Structure

Single source program is written and each processor will execute its personal copy of this program, although independently and not in synchronism.

The source program can be constructed so that parts of the program are executed by certain computers and not others depending upon the identity of the computer.

Software equivalent of SIMD; can perform SIMD calculations on MIMD hardware.Parallel Architectures and Performance Analysis – Slide 20

Two MIMD Structures: SPMD

Page 21: Parallel  Architectures & Performance Analysis

ArchitecturesVector computersShared memory multiprocessors: tightly coupled

Centralized/symmetrical multiprocessor (SMP): UMA

Distributed multiprocessor: NUMADistributed memory/message-passing

multicomputers: loosely coupledAsymmetrical vs. symmetrical

Flynn’s TaxonomySISD, SIMD, MISD, MIMD (MPMD, SPMD)

Parallel Architectures and Performance Analysis – Slide 21

Topic 1 Summary

Page 22: Parallel  Architectures & Performance Analysis

A sequential algorithm can be evaluated in terms of its execution time, which can be expressed as a function of the size of its input.

The execution time of a parallel algorithm depends not only on the input size of the problem but also on the architecture of a parallel computer and the number of available processing elements.

Parallel Architectures and Performance Analysis – Slide 22

Topic 2: Performance Measures and Analysis

Page 23: Parallel  Architectures & Performance Analysis

The speedup factor is a measure that captures the relative benefit of solving a computational problem in parallel.

The speedup factor of a parallel computation utilizing p processors is defined as the following ratio:

In other words, S(p) is defined as the ratio of the sequential processing time to the parallel processing time.

Parallel Architectures and Performance Analysis – Slide 23

Speedup Factor

ps

TTpS ssormultiproce ausing time Exec.

processor oneusing time Exec.)(

Page 24: Parallel  Architectures & Performance Analysis

Speedup factor can also be cast in terms of computational steps:

Maximum speedup is (usually) p with p processors (linear speedup).

Parallel Architectures and Performance Analysis – Slide 24

Speedup Factor (cont.)

processors using steps comp. No.processor oneusing steps comp. No.)( ppS

Page 25: Parallel  Architectures & Performance Analysis

Given a problem of size n on p processors letInherently sequential computations (n)Potentially parallel computations (n)Communication operations (n,p)

Then:

Parallel Architectures and Performance Analysis – Slide 25

Execution Time Components

p

s

TT

pnpnn

nnpS

),()()()()()(

Page 26: Parallel  Architectures & Performance Analysis

Parallel Architectures and Performance Analysis – Slide 26

Speedup PlotComputation Time Communication Time

“elbowing out”

Number of processors

Page 27: Parallel  Architectures & Performance Analysis

The efficiency of a parallel computation is defined as a ratio between the speedup factor and the number of processing elements in a parallel system:

Efficiency is a measure of the fraction of time for which a processing element is usefully employed in a computation.

Parallel Architectures and Performance Analysis – Slide 27

Efficiency

ppS

TpT

ppE

p

s )(processors using time Exec.processor oneusing time Exec.

Page 28: Parallel  Architectures & Performance Analysis

Since E = S(p)/p, by what we did earlier

Since all terms are positive, E > 0Furthermore, since the denominator is larger

than the numerator, E < 1

Parallel Architectures and Performance Analysis – Slide 28

Analysis of Efficiency

),()()()()(

pnpnnpnnE

Page 29: Parallel  Architectures & Performance Analysis

Parallel Architectures and Performance Analysis – Slide 29

Maximum Speedup: Amdahl’s Law

Page 30: Parallel  Architectures & Performance Analysis

As before

since the communication time must be non-trivial.

Let f represent the inherently sequential portion of the computation; then

Parallel Architectures and Performance Analysis – Slide 30

Amdahl’s Law (cont.)

pnn

nn

pnpnn

nnpS)()(

)()(

),()()(

)()()(

)()()(nn

nf

Page 31: Parallel  Architectures & Performance Analysis

LimitationsIgnores communication timeOverestimates speedup achievable

Amdahl EffectTypically (n,p) has lower complexity than

(n)/pSo as p increases, (n)/p dominates (n,p)Thus as p increases, speedup increases

Parallel Architectures and Performance Analysis – Slide 31

Amdahl’s Law (cont.)

fpppS

)()(

11

Page 32: Parallel  Architectures & Performance Analysis

As before

Let s represent the fraction of time spent in parallel computation performing inherently sequential operations; then

Parallel Architectures and Performance Analysis – Slide 32

Gustafson-Barsis’ Law

pnn

nn

pnpnn

nnpS)()(

)()(

),()()(

)()()(

pnn

ns)()(

)(

Page 33: Parallel  Architectures & Performance Analysis

Then

Parallel Architectures and Performance Analysis – Slide 33

Gustafson-Barsis’ Law (cont.)

)1()1()( psppspsspspS

Page 34: Parallel  Architectures & Performance Analysis

Begin with parallel execution time instead of sequential time

Estimate sequential execution time to solve same problem

Problem size is an increasing function of pPredicts scaled speedup

Parallel Architectures and Performance Analysis – Slide 34

Gustafson-Barsis’ Law (cont.)spppS )()( 1

Page 35: Parallel  Architectures & Performance Analysis

Both Amdahl’s Law and Gustafson-Barsis’ Law ignore communication time

Both overestimate speedup or scaled speedup achievable

Gene Amdahl John L. Gustafson

Parallel Architectures and Performance Analysis – Slide 35

Limitations

Page 36: Parallel  Architectures & Performance Analysis

Performance terms: speedup, efficiencyModel of speedup: serial, parallel and

communication componentsWhat prevents linear speedup?

Serial and communication operationsProcess start-upImbalanced workloadsArchitectural limitations

Analyzing parallel performanceAmdahl’s LawGustafson-Barsis’ Law

Parallel Architectures and Performance Analysis – Slide 36

Topic 2 Summary

Page 37: Parallel  Architectures & Performance Analysis

Based on original material fromThe University of Akron: Tim O’Neil, Kathy

LiszkaHiram College: Irena LomonosovThe University of North Carolina at Charlotte

Barry Wilkinson, Michael AllenOregon State University: Michael Quinn

Revision history: last updated 7/28/2011.

Parallel Architectures and Performance Analysis – Slide 37

End Credits