cse 8383 - advanced computer architecture

CSE 8383 - Advanced Computer Architecture

Week-5Week of Feb 9, 2004

engr.smu.edu/~rewini/8383

Contents Project/Schedule Introduction to Multiprocessors Parallelism Performance PRAM Model ….

Warm Up Parallel Numerical Integration Parallel Matrix Multiplication

In class: Discuss with your neighbor!Videotape: Think about it!

What kind of architecture do we need?

Explicit vs. Implicit Paralleism

Parallel Architecture

Programming Environment

Parallelizer

Sequentialprogram

Parallelprogram

Motivation One-processor systems are not capable

of delivering solutions to some problems in reasonable time

Multiple processors cooperate to jointly execute a single computational task in order to speed up its execution

Speed-up versus Quality-up

Multiprocessing

One-processor

Multiprocessor

Speed-up Quality-up Sharing

Physical limitations

N processors cooperate to solve a single computational task

Flynn’s Classification- revisited

SISD (single instruction stream over a single data stream)

SIMD (single instruction stream over multiple data stream)

MIMD (multiple instruction streams over multiple data streams)

MISD (multiple instruction streams and a single data streams)

SISD (single instruction stream over a single data stream)

SISD uniprocessor architecture

CU

IS

DSIS

PU MUI/O

Captions:

CU = control unit PU = Processing unit

MU = memory unit IS = instruction stream

DS = data stream PE = processing element

LM = Local Memory

SIMD (single instruction stream over multiple data stream)

SIMD Architecture

PEn

PE1

LMn

CU

IS

DS DS

DS DS

ISProgram loaded from host

Data sets loaded from host

LM1

MIMD (multiple instruction streams over multiple data streams)

CU1

CU1

PUn

IS DS

IS DS

MMD Architecture (with shared memory)

PU1

SharedMemory

I/O

I/O

IS

IS

MISD (multiple instruction streams and a single data streams)

Memory(Programand data)

CU1 CU2

PU2

CUn

PUnPU1

IS IS

IS IS IS

DSI/O

DS DS DS

MISD architecture (the systolic array)

System Components Three major Components

Processors

Memory Modules

Interconnection Network

Memory Access Shared Memory

Distributed Memory

M PP

P

M

P

M

Interconnection Network Taxonomy


Static Dynamic

Bus-based Switch-based1-D 2-D HC

Single Multiple SS MS Crossbar

MIMD Shared Memory Systems

Interconnection Networks

M M M M

P P P P P

Shared Memory Single address space Communication via read & write Synchronization via locks

Bus Based & switch based SM Systems

Global Memory

P

C

P

C

P

C

P C

P C

P C

P C

M M M M

Cache Coherent NUMA


M

C

P

M

C

P

M

C

P

M

C

P

MIMD Distributed Memory Systems

Interconnection Networks

M M M M

P P P P

Distributed Memory Multiple address spaces Communication via send & receive Synchronization via messages

SIMD Computers

Processor

Memory

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

PM

von Neumann Computer

Some Interconnection Network

SIMD (Data Parallel) Parallel Operations within a

computation are partitioned spatially rather than temporally

Scalar instructions vs. Array instructions

Processors are incapable of operating autonomously they must be diven by the control uni

Past Trends in Parallel Architecture (inside the box) Completely custom designed

components (processors, memory, interconnects, I/O) Longer R&D time (2-3 years) Expensive systems Quickly becoming outdated

Bankrupt companies!!

New Trends in Parallel Architecture (outside the box) Advances in commodity processors and

network technology Network of PCs and workstations

connected via LAN or WAN forms a Parallel System

Network Computing Compete favorably (cost/performance) Utilize unused cycles of systems sitting

idle

Clusters

M

C

P

I/O

OS

M

C

P

I/O

OS

M

C

P

I/O

OS

Middleware

Programming Environment


Grids Grids are geographically

distributed platforms for computation.

They provide dependable, consistent, pervasive, and inexpensive access to high end computational capabilities.

Problem Assume that a switching component such as a transistor can switch in zero time. We propose to construct a disk-shaped computer chip with such a component. The only limitation is the time it takes to send electronic signals from one edge of the chip to the other. Make the simplifying assumption that electronic signals travel 300,000 kilometers per second. What must be the diameter of a round chip so that it can switch 109 times per second? What would the diameter be if the switching requirements were 1012 time per second?

Grosch’s Law (1960s) “To sell a computer for twice as

much, it must be four times as fast” Vendors skip small speed

improvements in favor of waiting for large ones

Buyers of expensive machines would wait for a twofold improvement in performance for the same price.

Moore’s Law Gordon Moore (cofounder of Intel) Processor performance would

double every 18 months This prediction has held for several

decades Unlikely that single-processor

performance continues to increase indefinitely

Von Neumann’s bottleneck Great mathematician of the 1940s and

1950s Single control unit connecting a memory to

a processing unit Instructions and data are fetched one at a

time from memory and fed to processing unit

Speed is limited by the rate at which instructions and data are transferred from memory to the processing unit.

Parallelism Multiple CPUs

Within the CPU One Pipeline Multiple pipelines

Speedup S = Speed(new) / Speed(old) S = Work/time(new) /

Work/time(old) S = time(old) / time(new) S = time(before improvement) / time(after improvement)

Speedup Time (one CPU): T(1)

Time (n CPUs): T(n)

Speedup: S

S = T(1)/T(n)

Amdahl’s Law

The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used

20 hours

200 miles

A B

Walk 4 miles /hour 50 + 20 = 70 hours S = 1Bike 10 miles / hour 20 + 20 = 40 hours S = 1.8Car-1 50 miles / hour 4 + 20 = 24 hours S = 2.9Car-2 120 miles / hour 1.67 + 20 = 21.67 hours S = 3.2Car-3 600 miles /hour 0.33 + 20 = 20.33 hours S = 3.4

must walk

Example

Amdahl’s Law (1967) : The fraction of the program that

is naturally serial

(1- ): The fraction of the program that is naturally parallel

S = T(1)/T(N)

T(N) = T(1) + T(1)(1- )

N

S = 1

+ (1- )

N

=N

N + (1- )

Amdahl’s Law

Gustafson-Barsis Law

N & are not independent from each other

T(N) = 1

T(1) = + (1- ) N

S = N – (N-1)

: The fraction of the program that is naturally serial

Gustafson-Barsis Law

Comparison of Amdahl’s Law vs Gustafson-Barsis’ Law

For I = 1 to 10 do

begin

S[I] = 0.0 ;

for J = 1 to 10 do

S[I] = S[I] + M[I, J];

S[I] = S[I]/10;

end

Example

Distributed Computing Performance

Single Program Performance

Multiple Program Performance

PRAM Model

What is a Model? According to Webster’s Dictionary, a

model is “a description or analogy used to help visualize something that cannot be directly observed.”

According to The Oxford English Dictionary, a model is “a simplified or idealized description or conception of a particular system, situation or process.”

Why Models? In general, the purpose of

Modeling is to capture the salient characteristics of phenomena with clarity and the right degree of accuracy to facilitate analysis and prediction.

Megg, Matheson and Tarjan (1995)

Models in Problem Solving Computer Scientists use models to

help design problem solving tools such as:

Fast Algorithms Effective Programming Environments Powerful Execution Engines

A model is an interface separating high level properties from low level ones

An InterfaceApplications

Architectures

Providesoperations

Requires implementation

MODEL

PRAM Model Synchronized

Read Compute Write Cycle

EREW ERCW CREW CRCW Complexity:

T(n), P(n), C(n)

Control

PrivateMemory

P1

PrivateMemory

P2

PrivateMemory

Pp

Global

Memory

The PRAM model and its variations (cont.) There are different modes for read and write operations in a

PRAM. Exclusive read(ER) Exclusive write(EW) Concurrent read(CR) Concurrent write(CW)

Common Arbitrary Minimum Priority

Based on the different modes described above, the PRAM can be further divided into the following four subclasses.

EREW-PRAM model CREW-PRAM model ERCW-PRAM model CRCW-PRAM model

Analysis of Algorithms Sequential Algorithms

Time Complexity Space Complexity

An algorithm whose time complexity is bounded by a polynomial is called a polynomial-time algorithm. An algorithm is considered to be efficient if it runs in polynomial time.

Analysis of Sequential Algorithms

NP

P

NP-complete

NP-hard

The relationships among P, NP, NP-complete, NP-hard

Analysis of parallel algorithm

Performance of a parallel algorithm is expressed in terms of how fast it is and how much resources it uses when it runs.

Run time, which is defined as the time during the execution of the algorithm

Number of processors the algorithm uses to solve a problem

The cost of the parallel algorithm, which is the product of the run time and the number of processors

Analysis of parallel algorithmThe NC-class and P-completeness

NP

P

NP-complete

NC

P-complete

NP-hard

The relationships among P, NP, NP-complete, NP-hard, NC, and P-complete

(if PNP and NC P)

Simulating multiple accesses on an EREW PRAM

Broadcasting mechanism: P1 reads x and makes it known to P2. P1 and P2 make x known to P3 and P4,

respectively, in parallel. P1, P2, P3 and P4 make x known to P5,

P6, P7 and P8, respectively, in parallel. These eight processors will make x

know to another eight processors, and so on.

Simulating multiple accesses on an EREW PRAM (cont.)

Simulating Concurrent read on EREW PRAM with eight processors using Algorithm Broadcast_EREW

x

xx P1

(a)

x

x

xx P2

(b)

x

x

x

x

x P3

(c)

x

x

x

x

x

x

x

x

x P5

(d)

x P4

x P6

x P7

x P8

LLLL

Simulating multiple accesses on an EREW PRAM (cont.) Algorithm Broadcast_EREW

Processor P1

y (in P1’s private memory) xL[1] y

for i=0 to log p-1 doforall Pj, where 2i +1 < j < 2i+1 do in parallel

y (in Pj’s private memory) L[j-2i]L[j] y

endforendfor

Bus-based Shared Memory

Collection of wires and connectors

Only one transaction at a time

Bottleneck!! How can we solve the problem?

Global Memory

P P P P P

Single Processor caching

P

x

x Memory

CacheHit: data in the cache

Miss: data is not in the cache

Hit rate: h

Miss rate: m = (1-h)

Writing in the cache

P

x

x

Before

Memory

Cache

P

x’

x’

Write through

Memory

Cache

P

x’

x

Write back

Memory

Cache

Using Caches

Global Memory

P1

C1

P2

C2

P3

C3

Pn

Cn

- Cache Coherence problem

- How many processors?

Group Activity Variables

Number of processors (n) Hit rate (h) Bus Bandwidth (B) Processor speed (V)

Condition: n*(I - h)*v <= B

Maximum number of processors n = B/(1-h)*v

Cache Coherence

P1

x

P2 P3

x

Pn

x

x

-Multiple copies of x-What if P1 updates x?

Cache Coherence Policies Writing to Cache in 1 processor case

Write Through Write Back

Writing to Cache in n processor case Write Update - Write Through Write Invalidate - Write Back Write Update - Write Through Write Invalidate - Write Back

Write-invalidate

P1

x

P2 P3

x

x

P1

x’

P2 P3

I

x’

P1

x’

P2 P3

I

x

Before Write Through Write back

Write-Update

P1

x

P2 P3

x

x

P1

x’

P2 P3

x’

x’

P1

x’

P2 P3

x’

x

Before Write Through Write back

SynchronizationP1 P2 P3

Lock…..…..

unlock

Lock…..…..

unlock

Lock…..…..

unlockLocks

wait

wait

Superscalar Parallelism

Scheduling

cse 8383 - advanced computer architecture

Documents

data streampe

instruction streamds

memory unitis

kind of architecture

commodity processors

control unipast trends

processing elementlm

processing unitmu