july 2004computer architecture, advanced architecturesslide 1 part vii advanced architectures

166
July 2004 Computer Architecture, Advanced Architectures Slide 1 Part VII Advanced Architectures

Post on 21-Dec-2015

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 1

Part VIIAdvanced Architectures

Page 2: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 2

About This Presentation

This presentation is intended to support the use of the textbook Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated regularly by the author as part of his teaching of the upper-division course ECE 154, Introduction to Computer Architecture, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Any other use is strictly prohibited. ©

Behrooz Parhami

Edition Released Revised Revised Revised Revised

First July 2003 July 2004 July 2005

Page 3: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 3

VII Advanced Architectures

Topics in This Part

Chapter 25 Road to Higher Performance

Chapter 26 Vector and Array Processing

Chapter 27 Shared-Memory Multiprocessing

Chapter 28 Distributed Multicomputing

Performance enhancement beyond what we have seen:• What else can we do at the instruction execution level?• Data parallelism: vector and array processing• Control parallelism: parallel and distributed processing

Page 4: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 4

25 Road to Higher Performance Review past, current, and future architectural trends:

• General-purpose and special-purpose acceleration• Introduction to data and control parallelism

Topics in This Chapter

25.1 Past and Current Performance Trends

25.2 Performance-Driven ISA Extensions

25.3 Instruction-Level Parallelism

25.4 Speculation and Value Prediction

25.5 Special-Purpose Hardware Accelerators

25.6 Vector, Array, and Parallel Processing

Page 5: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 5

25.1 Past and Current Performance Trends

Architectural method Improvement factor

1. Pipelining (and superpipelining) 3-8 √ 2. Cache memory, 2-3 levels 2-5 √ 3. RISC and related ideas 2-3 √ 4. Multiple instruction issue (superscalar) 2-3 √ 5. ISA extensions (e.g., for multimedia) 1-3 √ 6. Multithreading (super-, hyper-) 2-5 ? 7. Speculation and value prediction 2-3 ? 8. Hardware acceleration 2-10 ? 9. Vector and array processing 2-10 ?10. Parallel/distributed computing 2-1000s ?

Est

ab

lishe

dm

eth

ods

Ne

we

rm

eth

ods

Pre

vio

usly

dis

cuss

ed

Co

vere

d in

Pa

rt V

II

Available computing power ca. 2000: GFLOPS on desktop TFLOPS in supercomputer center PFLOPS on drawing board

Computer performance grew by a factorof about 10000 between 1980 and 2000 100 due to faster technology 100 due to better architecture

Page 6: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 6

Peak Performance of SupercomputersPFLOPS

TFLOPS

GFLOPS1980 20001990 2010

Earth Simulator

ASCI White Pacific

ASCI Red

Cray T3DTMC CM-5

TMC CM-2Cray X-MP

Cray 2

10 / 5 years

Dongarra, J., “Trends in High Performance Computing,”Computer J., Vol. 47, No. 4, pp. 399-403, 2004. [Dong04]

Page 7: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 7

Energy Consumption is Getting out of Hand

Figure 25.1 Trend in energy consumption for each MIPS of computational power in general-purpose processors and DSPs.

1990 1980 2000 2010 kIPS

MIPS

GIPS

TIPS

Pe

rfo

rma

nce

Calendar year

Absolute processor

performance

GP processor performance

per watt

DSP performance per watt

Page 8: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 8

Energy Reduction Methods

• Clock gating– Turn off clock signals that go to unused

parts of a chip

• Asynchronous design– Doing away with clocking and the

attendant energy waste

• Voltage reduction– Slower but uses less energy

Page 9: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 9

Examples

• Intel XScale Processor– Runs at 100 MHz to 1 gHz– 30 mW of power

• System on a chip– Wearable computers– Augmented reality

Page 10: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 10

Examples

• Power = k * voltage squared– Use two low-voltage processors– Chip multiprocessors

• Cell phones– Two processors

• One low power but computationally weak• Digital signal processor

Page 11: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 11

25.2 Performance-Driven ISA Extensions

Adding instructions that do more work per cycle

Shift-add: replace two instructions with one (e.g., multiply by 5)

Multiply-add: replace two instructions with one (x := c + a b) Multiply-accumulate: reduce round-off error (s := s + a b) Conditional copy: to avoid some branches (e.g., in if-then-else)

Subword parallelism (for multimedia applications)

Intel MMX: multimedia extension 64-bit registers can hold multiple integer operands

Intel SSE: Streaming SIMD extension 128-bit registers can hold several floating-point operands

Page 12: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 12

Multimedia Extensions

• Workload shifts from data crunching to multimedia

• New instructions

• Compatibility issues and costs lead to compromises “that prevented the full benefits of such extensions from materializing”

Page 13: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 13

New Instructions

• Subword operands– 8-bit pixel, 16-bit audio samples– MMX added 57 instructions to allow

parallel subword arithmetic operations

Page 14: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 14

MMX

• 8 floating-point registers that “double” as the integer register file (cost)

• A 64-bit register can be viewed as holding a single 64-bit integer or a vector of 2, 4, or 8 narrower integer operands

Page 15: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 15

Parallel Pack

• Unpacking– 8-vectors, low

• Xxxxabcd, xxxxefgh -> aebfcgdh

• Packing does the reverse

• These operations increae precision

Page 16: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 16

Saturating Arithmetic

• With ordinary arithmetic overflow results in a sum that is smaller than either operand. Ie, 0b1100 + 0b1011 = 0b0111

• This could result in a bright spot in a dark region

• Special rules to deal with “infinity”

Page 17: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 17

Intel MMXISA

Exten-sion

Table25.1

Class Instruction Vector Op type Function or results

Copy

Register copy 32 bits Integer register MMX registerParallel pack 4, 2 Saturate Convert to narrower elementsParallel unpack low 8, 4, 2 Merge lower halves of 2 vectorsParallel unpack high 8, 4, 2 Merge upper halves of 2 vectors

Arith-metic

Parallel add 8, 4, 2 Wrap/Saturate# Add; inhibit carry at boundariesParallel subtract 8, 4, 2 Wrap/Saturate# Subtract with carry inhibitionParallel multiply low 4 Multiply, keep the 4 low halvesParallel multiply high 4 Multiply, keep the 4 high halvesParallel multiply-add 4 Multiply, add adjacent products*

Parallel compare equal 8, 4, 2 All 1s where equal, else all 0s

Parallel compare greater 8, 4, 2 All 1s where greater, else all 0s

ShiftParallel left shift logical 4, 2, 1 Shift left, respect boundariesParallel right shift logical 4, 2, 1 Shift right, respect boundariesParallel right shift arith 4, 2 Arith shift within each (half)word

Logic

Parallel AND 1 Bitwise dest (src1) (src2)Parallel ANDNOT 1 Bitwise dest (src1) (src2)Parallel OR 1 Bitwise dest (src1) (src2)Parallel XOR 1 Bitwise dest (src1) (src2)

Memoryaccess

Parallel load MMX reg 32 or 64 bits Address given in integer registerParallel store MMX reg 32 or 64 bit Address given in integer register

Control Empty FP tag bits Required for compatibility$

Page 18: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 18

MMX Multiplication and Multiply-Add

Figure 25.2 Parallel multiplication and multiply-add in MMX.

a

(a) Parallel multiply low (b) Parallel multiply-add

b d e

e f g h

s t u v

e h

d g

b f

a e

z v

y u

x t

w s

a b d e

e f g h

s + t u + v

e h

d g

b f

a e

v

u

t

s

add add

Page 19: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 19

MMX Parallel Comparisons

Figure 25.3 Parallel comparisons in MMX.

14

(a) Parallel compare equal (b) Parallel compare greater

3 58 66

79 1 58 65

0 0 0

5 12 3 32

12 3 22

5 12 6 9

12 5 90 17 8 65 535 (all 1s)

0 0 0 0 0

255 (all 1s)

Page 20: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 20

25.3 ILP Under Ideal Conditions!

Figure 25.4 Available instruction-level parallelism and the speedup due to multiple instruction issue in superscalar processors [John91].

1

Fra

ctio

n o

f cyc

les

Issuable instructions per cycle

20%

30%

10%

0% 2 3 4 5 6 7 8 0

Sp

ee

dup

atta

ine

d

Instruction issue width

3

2

1 2 4 6 8 0

(a) (b)

Page 21: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 21

ILP Inefficiencies

• Diminishing returns due to– Slowing down instruction issue due to

extreme circuit complexity– Wasting resources on instructions that will

not be executed due to branching

Page 22: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 22

Instruction-Level Parallelism

Figure 25.5 A computation with inherent instruction-level parallelism.

Page 23: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 23

VLIW and EPIC Architectures

Figure 25.6 Hardware organization for IA-64. General and floating-point registers are 64-bit wide. Predicates are single-bit registers.

VLIW Very long instruction word architectureEPIC Explicitly parallel instruction computing

Memory

General registers (128)

Floating-point registers (128)

Predi- cates (64)

Execution unit

Execution unit

Execution unit

Execution unit

Execution unit

Execution unit

. . .

. . .

Page 24: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 24

A Better Approach?

• Allow the programmer or compiler to specify which operations are capable of being executed at the same time

• More reasonable than hiding the concurrency in sequential code and then requiring the hardware to rediscover it

Page 25: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 25

Itanium

• Long instructions “bundle” three conventional RISC-type instructions, “syllables”, of 41 bits, plus a 5 bit “template” holding scheduling information

• The template specifies the instruction types and indicates concurrency

Page 26: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 26

41-bit Instruction

• 4 bit major opcode

• 6 bit predicate specification

• 31 bits for operands and modifiers– 3 registers can be specified (21 bits)

• The results are committed only if the specified predicate is 1– Predicated instruction execution

Page 27: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 27

Additional Features

• Control and data speculation

• Software pipelining for loops

• Threading– Interleaved multithreading– Blocked multithreading– Simultaneous multithreading

Page 28: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 28

Interleaved Multithreading

• Instruction logic picks instructions from the different threads in a round-robin fashion– Spaces out the instructions from the same

thread avoiding bubbles due to data or control dependencies

Page 29: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 29

Blocked multithreading

• Deals with only one thread until it hits a snag, eg, cache miss

• Fills the long stalls of one thread with useful instructions from another thread

Page 30: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 30

Simultaneous Multithreading

• Most flexible strategy

• Allows mixing and matching of instructions from different threads to make the best use of available resources

Page 31: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 31

Speculation and Value Prediction

• Itanium– Control speculation

• One load becomes two– A speculative load (earlier)– A checking instruction

• We may execute a load instruction that will not be encountered due to a branch or a false predicate

• The speculative load keeps a record of an exception

Page 32: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 32

Data Speculation

• Allows a load instruction to be moved ahead of a store

• When the speculative load is performed, the address is recorded and checked against subsequent addresses for write operations

Page 33: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 33

Other Forms of Speculation

• Issuing instructions from both paths a branch instruction

• Add microarchitecture registers for temporary values

• Value prediction– Store the operands and results of previous

divide instructions in a memo table

Page 34: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 34

Continued

• Instruction reuse

• Superspeculation– Predict the outcome of a sequence of

instructions– Take into account patterns

• 50 – 98%

• Predict the next address in a load (r2)

Page 35: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 35

25.4 Speculation and Value Prediction

Figure 25.7 Examples of software speculation in IA-64.

---- ---- ---- ---- load ---- ----

spec load ---- ---- ---- ---- check load ---- ----

(a) Control speculation

---- ---- store ---- load ---- ----

spec load ---- ---- store ---- check load ---- ----

(b) Data speculation

Page 36: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 36

Value Prediction

Figure 25.8 Value prediction for multiplication or division via a memo table.

Mult/ Div

Memo table

Control

Mux

Inputs

Inputs ready

Output

Output ready

0

1

Miss

Done

Page 37: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 37

25.5 Special-Purpose Hardware Accelerators

Figure 25.9 General structure of a processor with configurable hardware accelerators.

CPU Configuration

memory

Accel. 1

Accel. 2

Accel. 3

Data and program memory

FPGA-like unit on which accelerators can be formed via loading of configuration registers

Unused resources

Page 38: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 38

Graphic Processors, Network Processors, etc.

Figure 25.10 Simplified block diagram of Toaster2, Cisco Systems’ network processor.

Input buffer

PE 0

PE 1

PE 2

PE 3

PE 4

PE 5 PE

6 PE 7

PE 8

PE 9

PE 10

PE 11

PE 12

PE 13

PE 14

PE 15

Output buffer

Column memory

Column memory

Column memory

Column memory

Feedback path

PE5

Page 39: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 39

Network Processor

• Arriving traffic overwhelms a processor

• Network processors have specialized instruction sets and other capabilities– Instructions for bit-mapping, tree search,

and cyclic redundancy check calculations– Instruction-level parallelism, micro engines

Page 40: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 40

25.6 Vector, Array, and Parallel Processing

Figure 25.11 The Flynn-Johnson classification of computer systems.

SISD

SIMD

MISD

MIMD

GMSV

GMMP

DMSV

DMMP

Single data stream

Mult iple data streams

Sin

gle

inst

r st

ream

M

ultip

le in

str

stre

ams

Flynn’s categories

Joh

nso

n’s

ex

pan

sio

n

Shared variables

Message passing

Glo

bal

me

mor

y D

istr

ibut

ed

me

mor

y

Uniprocessors

Rarely used

Array or vector processors

Mult iproc’s or mult icomputers

Shared-memory mult iprocessors

Rarely used

Distributed shared memory

Distrib-memory mult icomputers

Page 41: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 41

SIMD Architectures

Data parallelism: executing one operation on multiple data streams

Concurrency in time – vector processing Concurrency in space – array processing

Example to provide context

Multiplying a coefficient vector by a data vector (e.g., in filtering) y[i] := c[i] x[i], 0 i < n

Sources of performance improvement in vector processing (details in the first half of Chapter 26)

One instruction is fetched and decoded for the entire operation The multiplications are known to be independent (no checking) Pipelining/concurrency in memory access as well as in arithmetic

Array processing is similar (details in the second half of Chapter 26)

Page 42: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 42

MISD Architecture Example

Figure 25.12 Multiple instruction streams operating on a single data stream (MISD).

I n s t r u c t i o n s t r e a m s 1-5

Data in

Data out

Page 43: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 43

MIMD Architectures

Control parallelism: executing several instruction streams in parallel

GMSV: Shared global memory – symmetric multiprocessors DMSV: Shared distributed memory – asymmetric multiprocessors DMMP: Message passing – multicomputers

Figure 27.1 Centralized shared memory. Figure 28.1 Distributed memory.

0 0

1 1

m 1

Processor-to-

memory network

Processor-to-

processor network

Processors Memory modules

Parallel I/O

. . .

.

.

.

.

.

.

p 1

0

1

Inter- connection

network

Memories and processors

Par

alle

l inp

ut/o

utp

ut

p 1

.

.

.

Routers

A computing node

. . .

Page 44: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 44

Amdahl’s Law Revisited

0

10

20

30

40

50

0 10 20 30 40 50Enhancement factor (p )

Spe

edup

(s

)

f = 0

f = 0.1

f = 0.05

f = 0.02

f = 0.01

Figure 4.4 Amdahl’s law: speedup achieved if a fraction f of a task is unaffected and the remaining 1 – f part runs p times as fast.

s =

min(p, 1/f)

1f + (1 – f)/p

f = sequential fraction

p = speedup of the rest with p processors

Page 45: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 45

26 Vector and Array Processing Single instruction stream operating on multiple data streams

• Data parallelism in time = vector processing• Data parallelism in space = array processing

Topics in This Chapter

26.1 Operations on Vectors

26.2 Vector Processor Implementation

26.3 Vector Processor Performance

26.4 Shared-Control Systems

26.5 Array Processor Implementation

26.6 Array Processor Performance

Page 46: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 46

26.1 Operations on Vectors

Sequential processor:

for i = 0 to 63 do P[i] := W[i]

D[i]endfor

Vector processor:

load Wload DP := W Dstore P

for i = 0 to 63 do X[i+1] := X[i] + Z[i] Y[i+1] := X[i+1] + Y[i] endfor

Unparallelizable

Page 47: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 47

First Example

• Overlap load and store

• Pipeline chaining– http://www.pcc.qub.ac.uk/tec/courses/

cray/stu-notes/CRAY-completeMIF_3.html

Page 48: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 48

Second Example

• For (i=0;i<64;i++)– {

• A[i]=A[i] + B[i];• B[i+1]= C[i] + D[i];

– }

• Data dependency

• Exercise

Page 49: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 49

Useful Vector Operations

• A = c * B

• A = B * B

• s = summation(A)

Page 50: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 50

Three Key Features

• Deeply pipelined units for high throughput

• Plenty of long vector registers• High-bandwidth data transfer paths

between memory and vector registers• Helpful: pipeline chaining

– Forwarding result values from one unit to another

Page 51: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 51

26.2 Vector Processor Implementation

Figure 26.1 Simplified generic structure of a vector processor.

Function unit 1 pipeline

To

and

from

mem

ory

uni

t

From scalar registers

Vector register

file

Function unit 2 pipeline

Function unit 3 pipeline

Forwarding muxes

Load unit A

Load unit B

Store unit

Page 52: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 52

The Memory Bandwidth Problem

• Page 495– Most challenging aspect– No cache– Memory is highly interleaved (17.4)

• 64-256 memory banks are not unusual• For peak performance a memory bank must not

be accessed until all other memory banks have been accessed

Page 53: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 53

Continued

• No problem for sequential, stride-1, access

• Any stride that is relatively prime with the interleaving width is no problem

• Unfortunately, …

• Eg, 64 x 64 matrix stored in row-major order accessed by columns

Page 54: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 54

Continued

• Skewed storage scheme– Elements of row i are rotated to the right by

i positions– Stride becomes 65– Adds small overhead due to a more

complex translation

Page 55: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 55

Conflict-Free Memory Access

Figure 26.2 Skewed storage of the elements of a 64 64 matrix for conflict-free memory access in a 64-way interleaved memory. Elements of column 0 are highlighted in both diagrams .

0,0

1,0

2,0 . . .

62,0

63,0

0,1

1,1

2,1 . . .

62,1

63,1

0,2

1,2

2,2 . . .

62,2

63,2

0,62

1,62

2,62 . . .

62,62

63,62

0,63

1,63

2,63 . . .

62,63

63,63

...

...

... . . . ...

...

0,0

1,63

2,62 . . .

62,2

63,1

0,1

1,0

2,63 . . .

62,3

63,2

0,2

1,0

2,0 . . .

62,4

63,3

0,62

1,61

2,60 . . .

62,0

63,63

0,63

1,62

2,61 . . .

62,1

63,0

... ... ... . . . ... ...

(a) Conventional row-major order (b) Skewed row-major order

Bank number

0 1 62 63 2 . . . 0 1 62 63 2 . . .

Page 56: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 56

Vector machine instructions

• Similar in spirit to MMX instructions– MMX vectors are short and fixed in length– Vector machine vectors are long and

variable in length (parameter)

• Typical instructions (non-arithmetic)– Vector load and store with implicit or

explicitly specified stride

Page 57: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 57

Continued

• Vector load and store with memory addresses specified in an index vector

• Vector comparisons on an element by element basis

• Data copying among registers including vector-length and mask registers– Mask registers allow selectivity

Page 58: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 58

Overlapped Memory Access and Computation

Figure 26.3 Vector processing via segmented load/store of vectors in registers in a double-buffering scheme. Solid (dashed) lines show data flow in the current (next) segment.

Vector reg 0

Vector reg 1

Vector reg 5

Vector reg 2

Vector reg 3

Vector reg 4

Load X

Load Y

Store Z

To

an

d fr

om

mem

ory

un

it

Pipelined adder

Page 59: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 59

Vector Processor Performance

• Ideal performance is only theoretical

• Start-up time– Becomes negligible only when vectors are

very long– A few to many tens of cycles– Half-performance vector length

• Performance is half that of infinite vectors• Tens to a hundred

Page 60: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 60

Continued

• Resource idle time and contention

• S = V * W + X * Y– With one multiply unit the adder will go

unused

• Memory bank contention– Severe when vector elements are

accessed in irregular positions (index vectors)

Page 61: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 61

• Vector segmentation– Entire vector does not fit in a vector

register– A vector of length 65!

• Conditional computations– Mask determines selection

Page 62: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 62

Conditional Computation

• Example: if (A(i)) … else …– Say the mask is 110101, ie , true for 0, 1,

3, 5– Two stages

• First stage 0, 1, 3, 5 execute the if part• Second stage 2 and 4 execute the then part

– Computation time roughly doubles

• Switch statement?

Page 63: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 63

Multilane Vector Processors

• 4 lanes– Lane 0 has 0, 4, 8, …– Four pipelined adders operate on the

registers in parallel

• Corresponds to multiple lanes on a highway

Page 64: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 64

26.3 Vector Processor Performance

Figure 26.4 Total latency of the vector computation S := X Y + Z, without and with pipeline chaining.

Multiplication start-up

Addition start-up

+

+

Without chaining

With pipeline chaining

Time

Page 65: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 65

Performance as a Function of Vector Length

Figure 26.5 The per-element execution time in a vector processor as a function of the vector length.

Vector length 100 200 300 400 0

Clo

ck c

ycle

s pe

r ve

ctor

ele

me

nt

5

4

3

2

1

0

Page 66: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 66

26.4 Shared-Control Systems

Figure 26.6 From completely shared control to totally separate controls.

(a) Shared-control array processor, SIMD

(b) Multiple shared controls, MSIMD

(c) Separate controls, MIMD

Processing Control

. . .

Processing Control

. . .

Processing Control

. . .

. . .

Page 67: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 67

Simple Shared-Control

• SIMD

• Array processors– Centralized control– Memory and processor are distributed in

space

Page 68: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 68

Example Array Processor

Figure 26.7 Array processor with 2D torus interprocessor communication network.

Control broadcast Parallel

I/O

Processor array Control

Switches

Page 69: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 69

Memory Bandwidth

• Ceases to be a major bottleneck

• A vector of length m can be distributed to the local memories of p processors with each one receiving m/p elements

• 64-256 very fast processors

• 64K processors

Page 70: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 70

Downside

• Vector operations that require interprocessor communication– Summation– X[i] * Y[m-1-i]

• Processor i needs information from processor m-1-i

Page 71: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 71

Pattern of Connectivity

• Linear array (ring)– Each processor is connected to two

neighbors

• Highly sophisticated networks

• Square or rectangular 2D mesh– Torus involves connecting the endpoints of

rows and columns

Page 72: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 72

Array Processor Implementation

• Assume 2D torus interconnection

• Switches allow data to flow in the row loops clockwise or counterclockwise directions

• Control unit resembles a scalar processor– Fetch, decode, broadcast control signals

Page 73: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 73

Continued

• Register file

• ALU

• Data memory

• New: branching based on the state of the processor array– Single bit of information from each

processor

Page 74: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 74

Continued

• Array state register– Population count instruction

• Processors within the array– Bare-bone

• No fetch, no decode

– Processing elements (PEs)

• Register file, ALU, memory

Page 75: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 75

Continued

• Some local autonomy– Say we load using r2 as a pointer– Different processors may have different

values of r2

Page 76: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 76

26.5 Array Processor Implementation

Figure 26.8 Handling of interprocessor communication via a mechanism similar to data forwarding.

ALU Reg file

CommunDir

CommunEn

PE state FF

Data memory

To array state reg

To reg file and data memory

Commun buffer

N E

W S To NEWS

neighbors

0

1

Page 77: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 77

Configuration Switches

Figure 26.9 I/O switch states in the array processor of Figure 26.7.

Control broadcast Parallel

I/O

Processor array Control

Switches

Figure 26.7

(a) Torus operation

In

(b) Clockwise I/O (c) Counterclockwise I/O

Out

In

Out

Page 78: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 78

Array Processor Performance

• Signal travel time

• Embarrassingly Parallel– Signal and image processing– Database management and data mining– Data fusion in command and control

• Data distribution– Which processor has what memory?

Page 79: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 79

26.6 Array Processor Performance

Array processors perform well for the same class of problems thatare suitable for vector processors

For embarrassingly (pleasantly) parallel problems, array processorscan be faster and more energy-efficient than vector processors

A criticism of array processing:For conditional computations, a significant part of the array remainsidle while the “then” part is performed; subsequently, idle and busyprocessors reverse roles during the “else” part

However:Considering array processors inefficient due to idle processorsis like criticizing mass transportation because many seats are unoccupied most of the time

It’s the total cost of computation that counts, not hardware utilization!

Page 80: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 80

27 Shared-Memory Multiprocessing Multiple processors sharing a memory unit seems naïve

• Didn’t we conclude that memory is the bottleneck?• How then does it make sense to share the memory?

Topics in This Chapter

27.1 Centralized Shared Memory

27.2 Multiple Caches and Cache Coherence

27.3 Implementing Symmetric Multiprocessors

27.4 Distributed Shared Memory

27.5 Directories to Guide Data Access

27.6 Implementing Asymmetric Multiprocessors

Page 81: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 81

Robert Tomasulo

• Looking forward, 1998

• Technology advances in both hardware and software have rendered most Instruction Set Architecture conflicts moot. … Computer design is focusing more on the memory bandwidth and multiprocessing interconnection problems.

Page 82: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 82

Parallel Processing as a Topic of Study

Graduate course ECE 254B:Adv. Computer Architecture – Parallel Processing

An important area of studythat allows us to overcomefundamental speed limits

Our treatment of the topic isquite brief (Chapters 26-27)

Page 83: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 83

Memory

• It is the speed of memory, not the processor, that limits performance.

• Why does it make sense to expect a single memory unit to keep up with several processors?

• Provide a private cache for each processor. Cache consistency!

Page 84: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 84

27.1 Centralized Shared Memory

Figure 27.1 Structure of a multiprocessor with centralized shared-memory.

0 0

1 1

m 1

Processor-to-

memory network

Processor-to-

processor network

Processors Memory modules

Parallel I/O

. . .

.

.

.

.

.

.

p 1

Page 85: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 85

Two Extremes

• A single shared bus

• A permutation network that allows any processor to read or write any memory module.

• Most networks fall between these extremes.

Page 86: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 86

Processor-to-memory connection

• Most critical component• Must support conflict-free and high-

speed access to memory by several processors at once

• Bandwidth is more important than latency due to multithreading and other methods that allow processors to do useful work while waiting for data. LT

Page 87: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 87

Multistage Interconnection Network

• Multiple paths

• Large switches arranged in layers

• Cost/performance trade-offs

• One of the oldest and most versatile is the butterfly network

• Self-routing network

Page 88: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 88

Butterfly Network

• 2^q rows and q+1 columns

• Each switch in row r has a connection to the switch in the same row and a cross connection to row r +/- 2^c, where c is the column number

• Unique path can be determined from i and j (in binary)

Page 89: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 89

Example

• Say we want to go from row 2 to row 4, 010 to 100.

• XOR the two numbers getting 110.

• Read that routing tag backwards.

• This is not a permutation network.

Page 90: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 90

Processor-to-Memory Interconnection Network

Figure 27.2 Butterfly and the related Beneš network as examples of processor-to-memory interconnection network in a multiprocessor.

(a) Butterfly network (b) Beneš network

0

2

4

6

8

10

12

14

Processors Memories

P r o c e s s o r s

M e m o r i e s

1

3

5

7

9

11

13

15

0

2

4

6

8

10

12

14

1

3

5

7

9

11

13

15

0

2

4

6

0

2

4

6

1

3

5

7

1

3

5

7

Row 0

Row 1

Row 2

Row 3

Row 4

Row 5

Row 6

Row 7

Page 91: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 91

Processor-to-Memory Interconnection Network

Figure 27.3 Interconnection of eight processors to 256 memory banks in Cray Y-MP, a supercomputer with multiple vector processors.

0

1

2

3

4

5

6

7

8 8

8 8

8 8

8 8

4 4

4 4

4 4

4 4

4 4

4 4

4 4

4 4

Sections Subsections Memory banks

0, 4, 8, 12, 16, 20, 24, 28 32, 36, 40, 44, 48, 52, 56, 60

1, 5, 9, 13, 17, 21, 25, 29

2, 6, 10, 14, 18, 22, 26, 30

3, 7, 11, 15, 19, 23, 27, 31

Processors

1 8 switches 224, 228, 232, 236, . . . , 252

225, 229, 233, 237, . . . , 253

226, 230, 234, 238, . . . , 254

227, 231, 235, 239, . . . , 255

8 /

8 /

8 /

8 /

Page 92: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 92

Shared-Memory Programming: BroadcastingCopy B[0] into all B[i] so that multiple

processorscan read its value without memory access conflicts

for k = 0 to log2 p – 1 processor j, 0 j < p, do

B[j + 2k] := B[j]endfor

0 1 2 3 4 5 6 7 8 9 10 11

B

Recursivedoubling

Page 93: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 93

Shared-Memory Programming: SummationSum reduction of vector X

processor j, 0 j < p, do Z[j] := X[j]

s := 1while s < p processor j, 0 j < p – s,

do Z[j + s] := X[j] + X[j + s] s := 2 sendfor

0 1 2 3 4 5 6 7 8 9

S 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 8:8 9:9

0:0 0:1 1:2 2:3 3:4 4:5 5:6 6:7 7:8 8:9

0:0 0:1 0:2 0:3 1:4 2:5 3:6 4:7 5:8 6:9

0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 1:8 2:9

0:0 0:1 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9

Recursivedoubling

Page 94: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 94

Drawbacks

• Hot-spot contention

• Shared data that are continually being accessed by many processors

• Interconnection becomes bottleneck

Page 95: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 95

Solution?

• Reduce the frequency of memory accesses to main memory by equipping each processor with a private cache

• A hit rate of 95% reduces traffic through the processor-to-memory network to 5%

Page 96: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 96

27.2 Multiple Caches and Cache Coherence

Private processor caches reduce memory access traffic through the interconnection network but lead to challenging consistency problems.

0 0

1 1

m 1

Processor-to-

memory network

p 1

Processor-to-

processor network

Processors Caches Memory modules

Parallel I/O

. . .

.

.

.

.

.

.

Page 97: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 97

Status of Data Copies

Figure 27.4 Various types of cached data blocks in a parallel processor with centralized main memory and private processor caches.

0

1

Processor-to-

memory network

p–1

Processor-to-

processor network

Processors Caches Memory modules

Parallel I/O

. . .

.

.

.

.

.

.

w x

y

z

w z

w y

x z

Multiple consistent

Single consistent

Single inconsistent

Invalid

m–1

0

1

Page 98: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 98

Cache Coherence Algotithms

• Designate each cache line as exclusive or shared

• A processor may only modify an exclusive cache line

• To modify a shared line, change its status to exclusive which invalidates all copies in other caches.

Page 99: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 99

Continued

• This involves broadcasting the intention to modify the line

• This is called a write-invalidate algorithm

Page 100: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 100

Snoopy Protocol

• Particular implementation of write-invalidate

• Requires a shared bus

• Snoop control lines monitor the bus

• Each cache line is shared, exclusive, or invalid

Page 101: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 101

Continued

• Transitions– CPU read hits are no problem– CPU write hits

• Exclusive? No change• Shared state? Change state to exclusive and a

write miss is signaled to other caches, causing them to invalidate their copies

Page 102: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 102

Continued

• CPU write miss– Exclusive state? Write the line to memory

and change state to invalid

• A request for access to an exclusive cache line forces the owner to make sure that stale data from main memory is not used.

Page 103: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 103

A Snoopy Cache Coherence

Protocol

Figure 27.5 Finite-state control mechanism for a bus-based snoopy cache coherence protocol with write-back caches.

CPU read or write hit

Invalid

Shared (read-only)

Exclusive (writable)

CPU read hit

CPU read miss: signal read miss

on bus

CPU w rite miss: signal write miss

on bus

CPU w rite hit: signal write miss on bus

Bus write miss: write back cache line

Bus write miss

Bus read miss: write back cache line

P

C

P

C

P

C

P

C

BusMemory

Page 104: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 104

27.3 Implementing Symmetric Multiprocessors

Figure 27.6 Structure of a generic bus-based symmetric multiprocessor.

Computing nodes (typically, 1-4 CPUs

and caches per node)

Interleaved memory

Bus adapter

I/O modules

Standard interfaces

Bus adapter

Very wide, high-bandwidth bus

Page 105: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 105 

Bus Bandwidth Limits PerformanceExample 27.1

Consider a shared-memory multiprocessor built around a single bus with a data bandwidth of x GB/s. Instructions and data words are 4 B wide, each instruction requires access to an average of 1.4 memory words (including the instruction itself). The combined hit rate for caches is 98%. Compute an upper bound on the multiprocessor performance in GIPS. Address lines are separate and do not affect the bus data bandwidth.

Solution

Executing an instruction implies a bus transfer of 1.4 0.02 4 = 0.112

B. Thus, an absolute upper bound on performance is x/0.112 = 8.93x GIPS. Assuming a bus width of 32 B, no bus cycle or data going to waste, and a bus clock rate of y GHz, the performance bound becomes 286y GIPS. This bound is highly optimistic. Buses operate in the range 0.1 to 1 GHz. Thus, a performance level approaching 1 TIPS (perhaps even ¼ TIPS) is beyond reach with this type of architecture.

Page 106: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 106

Split-transaction Protocol

• The bus is released between the transmission of memory address and the return of data.

• There may be several outstanding memory requests in progress at the same time.

Page 107: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 107

Snoop-based Coherence

• Assume one level of cache for each processor

• Atomic bus– Total order

Page 108: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 108

Implementing Snoopy Caches

Figure 27.7 Main structure for a snoop-based cache coherence algorithm.

Tags

Cache data array

Duplicate tags and state store for snoop side

CPU

Main tags and state store for processor side

=?

=?

Processor side cache control

Snoop side cache control

Addr Addr Cmd Cmd Buffer Buffer Snoop state

System bus

Tag

Addr Cmd

State

Page 109: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 109

Example

• Two processors issue requests for writing into a shared cache line– Change its state from shared to exclusive

• The bus arbiter will allow one to go first

• The other processor must then invalidate the line and issue a different request (write miss)

Page 110: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 110

Continued

• Include intermediate states in the state diagram

• Shared/exclusive intermediate state

• Outcome of the request takes you either to exclusive or invalid

Page 111: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 111

Two Useful Mechanisms

• Locking of shared resources– Atomic instructions– Two processors both read 0 and then write

1

• Barrier synchronizations– Counter (stop until it gets to p)

• Obvious contention

– Logical and

Page 112: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 112

Distributed Shared Memory

• Scalability problem!• Does sharing of a global address space

require that the memory be centralized?• No. NUMA (nonuniform memory

access)• Reference to a non-local address

initiates a remote access.• DMA controllers.

Page 113: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 113

27.4 Distributed Shared Memory

Figure 27.8 Structure of a distributed shared-memory multiprocessor.

0

1 z : 0

x : 0 y : 1

Inter- connection

network

Processors with memory

Pa

ralle

l inp

ut/o

utp

ut

.

.

.

p 1

y := -1 z := 1

while z=0 do x := x + y endwhile

Routers

Page 114: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 114

Sequential Consistency

• The example in the previous slide is nondeterministic.

• Everything depends on the order of execution.

• The only expectation is that y = -1 will complete before z = 1

• Any processor that sees the new value for z will see the new value for y

Page 115: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 115

Continued

• Ensuring sequential consistency is not easy and comes with a performance penalty.

• It is insufficient to ensure that y changes before z changes.– A processor with different latencies to y

and z may see these events in the wrong order.

– Astronomer example.

Page 116: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 116

Directories

• Problem: dynamic data distribution

• Management overhead– Keeping track of the whereabouts of

various data elements to be able to find them when needed

Page 117: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 117

Assumptions

• Each data line has a statically assigned home location within the memory modules but may have multiple cached copies at various locations

• Associated with each data element is a directory entry that specifies which caches currently hold a copy of the data (the sharing set)

Page 118: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 118

Continued

• If the number of processors is small, a bit vector is quite efficient.– 32 processors <-> 32 bit word

• Provide a listing of the caches– Efficient when sharing sets are small

• Neither method is scalable

Page 119: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 119

A Scalable Method

• Scalable Coherent Interface (SCI) standard

• The directory entry specifies one of the caches holding a copy, with each copy containing a pointer to the next cache copy.

• Distribute directory scheme

Page 120: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 120

27.5 Directories to Guide Data Access

Figure 27.9 Distributed shared-memory multiprocessor with a cache, directory, and memory module associated with each processor.

0

1

Inter- connection

network

Processors & caches

Pa

ralle

l inp

ut/o

utp

ut

.

.

.

p 1

Memories

Directories Communication & memory interfaces

Page 121: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 121

Triggering Events

• Messages sent by cache memories to the directory holding the requested line

• Cache c indicates a read miss– Shared: data sent to c, c added to SS– Exclusive: fetch to the owner, data to c, c

added to SS– Uncached: data to c, sharing set is {c}

• Line is shared

Page 122: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 122

Continued

• Cache message indicates a write miss– Shared: invalidation message sent to SS,

sharing set is {c}, data to c– Exclusive: fetch/invalidate to owner, data to

c, SS is {c}– Uncached: data to c, SS is {c}

• Line is exclusive

Page 123: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 123

Continued

• If a writeback message is sent from a cache (needs to replace a dirty line), the state becomes uncached and SS is empty

Page 124: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 124

Cache-only Memory Architecture

• COMA

• Data lines have no fixed homes

• The entire memory associated with each PE is organized as a cache with each line having one or more temporary copies

Page 125: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 125

Shared Virtual Memory

• When a page fault occurs in a node, the page fault handler obtains the page

• CPU and OS get involved in the process

• Requires little hardware support

Page 126: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 126

Directory-Based Cache Coherence

Figure 27.10 States and transitions for a directory entry in a directory-based cache coherence protocol (c is the requesting cache).

Write miss: return value, set sharing set to {c}

Uncached

Shared (read-only)

Exclusive (writable)

Read miss: return value, include c in sharing set

Read miss: return value, set sharing set to {c}

Write miss: invalidate all cached copies, set sharing set to {c}, return value

Data w rite-back: set sharing set to { }

Read miss: fetch data from owner, return value, include c in sharing set

Write miss: fetch data from owner, request invalidation,

return value, set sharing set to {c}

Page 127: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 127

27.6 Implementing Asymmetric Multiprocessors

Figure 27.11 Structure of a ring-based distributed-memory multiprocessor.

Computing nodes (typically, 1-4 CPUs and associated memory)

Link

To I/O controllers

Memory

Ring network

Link Link Link

Node 0 Node 1 Node 2 Node 3

Page 128: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 128

Link

• Key component

• Allows communication with all other nodes

• Enforces the coherence protocol

• Contains cache, local directory, interface controllers for both sides

Page 129: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 129

Continued

• Cache miss– Request is sent via the ring– Coherence

• Within the node, snooping is used to ensure coherence among processor caches, the link’s cache, and main memory

– Internode coherence• Directory based protocol (TBD)

Page 130: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 130

Scalable Coherent Interface

(SCI)

Figure 27.11 Structure of a ring-based distributed-memory multiprocessor.

0

1

Processors and caches

To

inte

rco

nnec

tion

net

wo

rk

3

Memories

2

Page 131: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 131

28 Distributed MulticomputingComputer architects’ dream: connect computers like toy blocks

• Building multicomputers from loosely connected nodes• Internode communication is done via message passing

Topics in This Chapter

28.1 Communication by Message Passing

28.2 Interconnection Networks

28.3 Message Composition and Routing

28.4 Building and Using Multicomputers

28.5 Network-Based Distributed Computing

28.6 Grid Computing and Beyond

Page 132: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 132

28.1 Communication by Message Passing

Figure 28.1 Structure of a distributed multicomputer.

0

1

Inter- connection

network

Memories and processors

Pa

ralle

l inp

ut/o

utp

ut

p 1

.

.

.

Routers

A computing node

Page 133: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 133

How is it different?

• From a shared-memory multiprocessor– Data transmission does not occur as a

result of memory access requests– Message-passing commands activated by

the processor– Usually more than one connection from

each node to the interconnection network

Page 134: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 134

Continued

• Shared-medium is controlled by bus acquisition

• Direct, or point-to-point, involves forwarding

Page 135: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 135

Router Design

Figure 28.2 The structure of a generic router.

Switch

Inp

ut c

hann

els

Routing and arbitration

Input queues

Q

Q

Q

Q

Q

Q

Q

Q

LC

LC

LC

LC

LC

LC

LC

LC Out

put

chan

nels

Output queues

Q Q

LC LC Link controller

Message queue

Injection channel Ejection channel

Page 136: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 136

Routing Decisions

• Routing tables within the router• Routing tags carried with messages• Packet routing

– Store and forward

• Wormhole routing– Small portion (flit – flow control digit) is

forwarded with the expectation that others will follow

Page 137: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 137

Switches

• May contain queues and routing logic that make them similar to routers

• Circuit-switched interconnection network– A path is established and data is sent over

that path

Page 138: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 138

Crossbar Switches

• Have p^2 crossover points connecting p inputs to p outputs

Page 139: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 139 

Building Networks from Switches

Straight through Crossed connection Lower broadcast Upper broadcast

Figure 28.3 Example 2 2 switch with point-to-point and broadcast connection capabilities.

(a) Butterfly network (b) Beneš network

0

2

4

6

8

10

12

14

Processors Memories

P r o c e s s o r s

M e m o r i e s

1

3

5

7

9

11

13

15

0

2

4

6

8

10

12

14

1

3

5

7

9

11

13

15

0

2

4

6

0

2

4

6

1

3

5

7

1

3

5

7

Row 0

Row 1

Row 2

Row 3

Row 4

Row 5

Row 6

Row 7

Figure 27.2Butterfly and Beneš networks

Page 140: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 140 

Interprocess Communication via Messages

Figure 28.4 Use of send and receive message-passing primitives to synchronize two processes.

Process A Process B

...

...

...

...

...

... send x ... ... ... ... ... ... ...

...

... receive x ... ... ... Time

Communication latency

Process B is suspended

Process B is awakened

Page 141: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 141

Message-Passing Interface

• MPI library– Broadcasting– Scattering– Gathering– Complete exchange

• Provides– Persistent communication– Barrier synchronization

Page 142: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 142

Interconnection Networks

• Analogy of a power grid– Some relays are simply merely relays or

voltage converters

• Indirect network– Some nodes are noncomputing nodes

• Switches and routers

Page 143: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 143

28.2 Interconnection Networks

Figure 28.5 Examples of direct and indirect interconnection networks.

(a) Direct network (b) Indirect network

Routers Nodes Nodes

Page 144: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 144 

Direct Interconnection

Networks

Figure 28.6 A sampling of common direct interconnection networks. Only routers are shown; a computing node is implicit for each router.

(a) 2D torus (b) 4D hypercube

(c) Chordal ring (d) Ring of rings

Page 145: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 145 

Indirect Interconnection Networks

Figure 28.7 Two commonly used indirect interconnection networks.

(a) Hierarchical buses (b) Omega network

Level-1 bus

Level-2 bus

Level-3 bus

Page 146: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 146

Two Key Parameters

• Network diameter– Defined in terms of number of hops, ie

router to router transfers

• Torus? 4

• Larger diameter higher worst-case latency

Page 147: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 147

Continued

• Bisection width– Minimum number of channels that must be

cut to divide the p-node network into two parts containing [p/2] and [p+1/2] nodes

– Torus? 8– The greater this width, the higher the

available communications bandwidth– Important when communication is random

Page 148: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 148

28.3 Message Composition and Routing

Figure 28.8 Messages and their parts for message passing.

Message Padding

Packet data

Last packet Header Trailer

A transmitted packet

Flow control digits (flits)

Data or payload First packet

Page 149: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 149 

Wormhole Switching

Figure 28.9 Concepts of wormhole switching.

Worm 1: moving

(a) Two worms en route to their respective destinations

Source 2

Source 1

Destination 1

Destination 2

Worm 2: blocked

(b) Deadlock due to circular waiting of four blocked worms

Each worm is blocked at the point of attempted right turn

Page 150: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 150

Routing algorithms

• Nondeterministic– Adaptive– Probabilistic or random

• Fairness

– Oblivious• Unique path can be determined at the source

Page 151: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 151

Continued

• Row-first routing on 2D mesh or torus networks– Message goes along the row and then

down the appropriate column

Page 152: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 152

28.4 Building and Using Multicomputers

Figure 28.10 A task system and schedules on 1, 2, and 3 computers.

(a) Static task graph (b) Schedules on 1-3 computers

Inputs

Outputs

t = 1

t = 1

t = 2

t = 2 t = 2

t = 3

B

A C

D

E

F

G

H

t = 1

t = 2

B A C D E F G H

B A C

D E

H F G

B A C

D

E F G H

0 5 10 15

Time

Page 153: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 153

Optimal Schedule?

• Extremely difficult– Task running times are data-dependent– All tasks are not known a priori– Communication time must be considered– Message passing does not occur at task

boundaries– Can every node execute every task?

Page 154: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 154

Two-tiered Approach

• Assign tasks and monitor progress and communication behavior

• Reassign tasks for overloaded nodes and excessive communication

• Load balancing

Page 155: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 155 

Building Multicomputers from Commodity Nodes

Figure 28.11 Growing clusters using modular nodes.

(a) Current racks of modules (b) Futuristic toy-block construction

Expansion slots

One module: CPU,

memory, disks

One module: CPU(s), memory,

disks

Wireless connection surfaces

Page 156: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 156

Network-based Distributed Computing

• What if we drop the assumptions of uniformity and regularity?– Identical or similar computing nodes– Regular interconnections

• Mix general-purpose and special-purpose resources

• Use idle time on PCs

Page 157: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 157

Clusters

• LAN

• Limited span

• Also referred to as networks of workstations

Page 158: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 158

Metacomputer

• WAN– The internet as an extreme example

• Nodes are complete computers– End systems or platforms

• Requires a coordination strategy– Client-server– Peer-to-peer

Page 159: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 159

Peer-to-peer

• Middleware– Helps coordinate the actions of the nodes– Eg, a group communication facility may be

used to notify all participating processes of changes in the environment

• Stirling has argued that for each MFLOPS a supercomputer costs 10 times as much as a PC

Page 160: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 160

28.5 Network-Based Distributed Computing

Figure 28.12 Network of workstations.

System or I/O bus PC

Fast network interface with large memory

NIC

Network built of high-speed

wormhole switches

Page 161: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 161

Extensions to the OS

• Maintain status information for efficient use and recovery

• Task partition and scheduling

• Initial assignment to nodes and, possibly, load balancing

• Distributed file system management

Page 162: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 162

Continued

• Compilation of system usage data for planning and accounting

• Enforcement of intersystem privacy and security policies

• Some or all of these functions will be incorporated into standard operating systems

Page 163: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 163

28.6 Grid Computing and Beyond

Computational grid is analogous to the power grid

Decouples the “production” and “consumption” of computational power

Homes don’t have an electricity generator; why should they have a computer?

Advantages of computational grid:

Near continuous availability of computational and related resources

Resource requirements based on sum of averages, rather than sum of peaks

Paying for services based on actual usage rather than peak demand

Distributed data storage for higher reliability, availability, and security

Universal access to specialized and one-of-a-kind computing resources

Still to be worked out: How to charge for computation usage

Page 164: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 164

What will it take?

• Interaction requirements are quite different from the way in which systems interact today– More emphasis on communication

capabilities and interfaces– Hardware and OS support for cluster

coordination and communication with the outside world

Page 165: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 165

Continued

• Improve communication performance by using lighter-weight interaction, much as in clusters, is needed

• Internets generally have no centralized control whatsoever. Wide geographic distribution leads to nonnegligible signal propogation time.

Page 166: July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004 Computer Architecture, Advanced Architectures Slide 166

Continued

• There are security, interorganizational, and international issues in data exchange