eecs 252 graduate computer architecture lec 01 - introduction · growth in data-intensive...

43
Multi-Processor Systems Roberto Airoldi Department of Computer Systems Tampere University of Technology [email protected] TG 307

Upload: others

Post on 04-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Multi-Processor Systems

Roberto Airoldi Department of Computer Systems Tampere University of Technology

[email protected]

TG 307

Page 2: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Lecture outline

Why Multi-processor systems are needed? Basics of MP design strategies - memory hierarchy - Homogeneous or Heterogeneous - Communication infrastructure - … Few words on the research work on our research group.

Multi-Processor Systems 2

Page 3: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Uniprocessor Performance (SPECint)

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Perfo

rman

ce (v

s. V

AX-1

1/78

0)

25%/year

52%/year

??%/year

Multi-Processor Systems 3

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

Page 4: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Other Factors ⇒ Multiprocessors

Growth in data-intensive applications – Data bases, file servers, …

Growing interest in servers, server performance Increasing desktop performance less important

– Outside of graphics Improved understanding in how to use multiprocessors effectively

– Especially server where significant natural TLP Advantage of leveraging design investment by replication

– Rather than unique design

Multi-Processor Systems 4

Page 5: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Flynn’s Taxonomy

Flynn classified by data and control streams in 1966

SIMD ⇒ Data Level Parallelism MIMD ⇒ Thread Level Parallelism MIMD popular because

– Flexible: N pgms and 1 multithreaded pgm – Cost-effective: same MPU in desktop & MIMD

Multi-Processor Systems 5

Single Instruction Single Data (SISD) (Uniprocessor)

Single Instruction Multiple Data SIMD (single PC: Vector, CM-2)

Multiple Instruction Single Data (MISD) (????)

Multiple Instruction Multiple Data MIMD (Clusters, SMP servers)

M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.

Page 6: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Back to Basics

“A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Parallel Architecture = Computer Architecture + Communication Architecture 2 classes of multiprocessors WRT memory:

– Centralized Memory Multiprocessor • < few dozen processor chips (and < 100 cores) in 2006 • Small enough to share single, centralized memory

– Physically Distributed-Memory multiprocessor • Larger number chips and cores • BW demands ⇒ Memory distributed among processors

Multi-Processor Systems 6

Page 7: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Centralized vs. Distributed Memory

Multi-Processor Systems 7

P 1

$

Inter connection network

$

P n

Mem Mem

P 1

$

Inter connection network

$

P n

Mem Mem

Centralized Memory Distributed Memory

Scale

Page 8: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Centralized Memory Multiprocessor Also called symmetric multiprocessors (SMPs) because single main memory has a symmetric relationship to all processors Large caches ⇒ single memory can satisfy memory demands of small number of processors Can scale to a few dozen processors by using a switch and by using many memory banks Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases

Multi-Processor Systems 8

Page 9: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Distributed Memory Multiprocessor Pro: Cost-effective way to scale memory bandwidth

– If most accesses are to local memory Pro: Reduces latency of local memory accesses Con: Communicating data between processors more complex Con: Must change software to take advantage of increased memory BW

Multi-Processor Systems 9

Page 10: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

2 Models for Communication and Memory Architecture Communication occurs by explicitly passing messages among the processors:

– message-passing multiprocessors Communication occurs through a shared address space (via loads and stores): shared memory multiprocessors either

– UMA (Uniform Memory Access time) for shared address, centralized memory MP

– NUMA (Non Uniform Memory Access time multiprocessor) for shared address, distributed memory MP

In past, confusion whether “sharing” means sharing physical memory (Symmetric MP) or sharing address space

Multi-Processor Systems 10

Page 11: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Challenges of Parallel Processing

First challenge is % of program inherently sequential Suppose 80X speedup from 100 processors. What fraction of original program can be sequential?

Multi-Processor Systems 11

( )

( )

( )

%75.992.79/79Fraction

Fraction8.0Fraction8079

1)100

Fraction Fraction 1(80

100Fraction

Fraction 1

1 08

SpeedupFraction Fraction 1

1 Speedup

parallel

parallelparallel

parallelparallel

parallelparallel

parallel

enhancedenhanced

overall

==

×−×=

=+−×

+−=

+−=

Page 12: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Challenges of Parallel Processing

Second challenge is long latency to remote memory Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles) What is performance impact if 0.2% instructions involve remote access? CPI = Base CPI + Remote request rate x Remote request cost CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3 No communication is 1.3/0.5 = 2.6X faster than 0.2% instructions involve remote access

Multi-Processor Systems 12

Page 13: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Challenges of Parallel Processing

Application parallelism ⇒ primarily via new algorithms that have better parallel performance Long remote latency impact ⇒ both by architect and by the programmer For example, reduce frequency of remote accesses either by

– Caching shared data (HW) – Restructuring the data layout to make more accesses local (SW)

Multi-Processor Systems 13

Page 14: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Fundamental MP design decisions

We have already discussed: Shared memory versus Message passing ? Other decisions: Homogeneous versus Heterogeneous ? Bus versus Network ? Generic versus Application specific ? What types of parallelism to support ? Focus on Performance, Power or Cost ? Memory organization ?

Multi-Processor Systems

Page 15: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Homogeneous or Heterogeneous

Homogeneous: – replication effect – memory dominated any way – solve realization issues

once and for all – less flexible

Heterogeneous – better fit to application domain – smaller increments

Multi-Processor Systems

Page 16: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Hybrid approaches

Middle of the road approach Flexible tiles Fixed tile structure at top level

Multi-Processor Systems

Page 17: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Shared vs. Switched Media

Shared media: nodes share a single interconnection medium (e.g., Ethernet, Bus)

– Only one message sent at a time – Inexpensive: one medium used by all processors – Limited bandwidth: medium becomes the bottleneck – Needs arbitration

Switched media: allow direct communication between source and destination nodes (e.g., ATM)

– Multiple users at a time – More expensive: need to replicate medium – Higher total bandwidth – No arbitration – Added latency to go through the switch – Claimed to be more scalable

• but router overhead

Multi-Processor Systems

Page 18: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Which types off parallelism to support?

Kernel level ILP

DLP Module level

Program/ Thread level (TLP)

Application level

Multi-Processor Systems

Page 19: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Heterogeneity

Kernel level

Heterogeneous

Homogeneous Module level

Very heterogeneous Application

level

Multi-Processor Systems

Page 20: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Example MP supporting DLP and ILP: IMAGINE Stream Processor

Multi-Processor Systems

Page 21: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Focus

Low cost ? Low power ? High performance ? Depends on platform class

Multi-Processor Systems

Page 22: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Intrinsic computational efficiency of silicon

i386SX i486DX P5 68040

microsparc Super sparc

601 604

Ultra sparc

P6

604e 21164a

Turbosparc 604e 21364

7400

106

105

104

103

102

101

100 2 1 0.5 0.25 0.13 0.07

Computational efficiency

[MOPS/W]

Feature size [µm]

Intrinsic computational efficiency of silicon

Ref: Roza

C6411

Itanium2

XScale

C5510 Xtensa

Multi-Processor Systems

Computational efficiency of microprocessors

Presenter
Presentation Notes
Wait for programmable solution or be proactive and take action now
Page 23: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Problem Energy & Performance Memories

– Bandwidth • Slow (memory wall)

– Energy Expensive • RISC: 27% of power in Instruction Cache

– StrongARM SA110: A 160MHz 32b 0.5W CMOS ARM processor • VLIW: 30% of power in Instruction Cache

– C6x, 0.15um, 250MHz, Texas Instruments Inc. – Lx, HP & STMicroelectronics, L. Benini et.al (ISLPED)

Multi-Processor Systems

Page 24: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Wiring hierarchy

How far can signal reach in one local clock cycle (local frequency) ?

0.0

0.5

1.0

1.5

2.0

2.5

3.0

40 50 60 70 80 90 100

Technology node

Line

leng

th (m

m)

Local line length

Intermediate linelengthGlobal linelength

Multi-Processor Systems

Page 25: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

INTERCONNECTION NETWORKS

Multi-Processor Systems

Page 26: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Network design parameters

Large network design space: – topology, degree – routing algorithm

• path, path control, collision resolution, network support, deadlock handling, livelock handling

– virtual layer support – flow control – buffering – QoS guarantees – error handling – etc, etc.

Multi-Processor Systems

Page 27: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Network-on-Chip

Some common characteristics of NoC – more than a single shared medium (like bus)

• a true network – provides a point-to-point connection between any two hosts

connected to the network • real point-to-point by a crossbar switch (or equivalent) • virtual point-to-point connection

– higher aggregate bandwidth due to parallelism • many concurrent packet flows or parallel circuit paths

– clearly separates communication from computation – layered approach is used for the network implementation – practical implication from the above points is that the network

contains storage elements for the data • implicit pipelining

Multi-Processor Systems

Additional reading: J. Nurmi, ”Network-on-Chip – A New Paradigm for System-on-Chip Design,”

in Proc. International Symposium on System-on-Chip, Nov. 2005, pp. 2-6.

Page 28: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Protocol layers

Multi-Processor Systems

Page 29: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Circuit-switching and packet-switching Circuit-switching

– A connection has to be established between the communicating parties before the transfer can start

– The line is reserved for the whole time independently of the amount of traffic (like the phone line is reserved whether you are silent or talk at your maximum rate)

– The connection will be explicitly closed when there is no more need for the communication

Packet-switching – The data is packed to carry also information about the destination

address (like a letter or packet in mail) – Separate packets or sub-packets (flits or phits) can be sent along

different routes (if the networks allows that) – Also other senders can inject packets to the network to travel along

(partially) the same route

Multi-Processor Systems

Page 30: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Nodes, switches, resources

A node or a switch – Routes the packet along the network – Often includes some buffering (FIFOs) – The name ’switch’ refers to a crossbar switch connecting any of the

inputs to any of the outputs – Different kinds of switch architectures can be used in NoC – Often connects also the RESOURCES to the network

A resource – processor – memory – fixed-function or reconfigurable block

Multi-Processor Systems

Page 31: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Node characteristics – examples

3 x 3 switch – This size of a crossbar is suitable for connecting a resource to a

bidirectional ring

5 x 5 switch – This is the typical size for Mesh networks, four nearest neighbors +

the local resource

16 x 16 switch – From this size upwards, the switches start to absorb the whole

network operation into a single centralized switch – The switch is the network!

Multi-Processor Systems

Page 32: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Network Topology

Switched media have a topology that indicate how nodes are connected Topology determines

– Degree: number of links from a node – Diameter: max number of links crossed between nodes – Average distance: number of links to random destination – Bisection: minimum number of links that separate the network into

two halves – Bisection bandwidth: link bandwidth x bisection

Multi-Processor Systems

Page 33: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Common Topologies

Multi-Processor Systems

Type Degree Diameter Ave Dist Bisection

1D mesh 2 N-1 N/3 1

2D mesh 4 2(N1/2 - 1) 2N1/2 / 3 N1/2

3D mesh 6 3(N1/3 - 1) 3N1/3 / 3 N2/3

nD mesh 2n n(N1/n - 1) nN1/n / 3 N(n-1) / n

Ring 2 N/2 N/4 2

2D torus 4 N1/2 N1/2 / 2 2N1/2

Hypercube Log2N n=Log2N n/2 N/2

2D Tree 3 2Log2N ~2Log2 N 1

Crossbar N-1 1 1 N2/2

N = number of nodes, n = dimension

Page 34: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Network topologies - Mesh

Two-dimensional mesh – very intuitive, indeed

Multi-Processor Systems

resources

nodes

links

Page 35: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Network topologies - Torus

Torus and folded torus extend from mesh – folded torus has 2x length of the links compared to mesh!

Multi-Processor Systems

Page 36: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Network topologies - Ring

Ring is the simplest of the networks – Unidirectional or bidirectional – Does not correspond to the physical placement of resources!

Multi-Processor Systems

Page 37: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Network topologies – Octacon and Spidercon Extend the ring by introducing shortcuts

– Octacon 4 diagonals – Spidercon N diagonals

Multi-Processor Systems

Page 38: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Routing schemes

Deterministic vs. adaptive – deterministic routing is fixed (by routing table or information in the

packet) – adaptive routing can adapt to e.g. network congestion

Packet handling property – store-and-forward

• the whole packet is always stored in a node – virtual cut-through

• the routing further may start if there is enough room in the next node

– wormhole routing • the flit will be routed further immediately

Multi-Processor Systems

Page 39: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Network performance

020406080

100120140160

10 30 50 70 90

Load %

Throughput

Latency

Throughput

– throughput/input – aggregate throughput – bisection throughput

Latency

– time from data injection to readout completion

Inter-related and function of network loading!

Multi-Processor Systems

Page 40: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Quality-of-Service (QoS)

Normally packet-switched networks provide best effort throughput to all traffic

– ”the mileage may vary...” For some services, guaranteed service is needed

– Predictable delays – Guaranteed throughput

This will be handled by

– circuit-switched channel – time-multiplexing (and giving a certain amount of time slots to the

priority service) – can be approximated by assigning packet priority (may lead to

starvation of other services)

Multi-Processor Systems

Page 41: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Fault and error tolerance

More permanent manufacturing defects in large complex circuits More soft errors due to lower error margins More dynamic transient faults due to crosstalk and other noise on-chip For error tolerance error detection and correction methods from telecommunications can be used to some extent

– must not cause too much overhead in area and power – resending or error correction codes???

Fault tolerance may cause too much overhead, either – network provides natural redundancy for fixing faults

Multi-Processor Systems

Page 42: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Syncronous, asynchronous, GALS Single-clock SoC not practical Single-clock NoC (communication part) may be... However, synchronization between clock domains needed anyway GALS (Globally Asynchronous Locally Synchronous) is a natural mode of operation for large NoCs with many clock domains connected

– synchronization carried out between the network and the synchronous resource

– a place for trade-offs: where to place the A/S border??? Asynchronous functional blocks rare, but the network can be fully asynchronous (with synchronizing wrappers)

Multi-Processor Systems

Page 43: EECS 252 Graduate Computer Architecture Lec 01 - Introduction · Growth in data-intensive applications – Data bases, file servers, … Growing interest in servers, server performance

Our Research group

Prof. Jari Nurmi leads the group Research areas: Computer Architecture: Networks-on-Chip Multi-Processor architecture design Reconfigurable hardware Implementation f signal processing for Communications Navigation and Positioning: GNSS receivers Indoor positioning

Multi-Processor Systems 43