eecs 252 graduate computer architecture lec 01 - introduction · growth in data-intensive...

Multi-Processor Systems

Roberto Airoldi Department of Computer Systems Tampere University of Technology

[email protected]

TG 307

mailto:[email protected]

Lecture outline

Why Multi-processor systems are needed? Basics of MP design strategies - memory hierarchy - Homogeneous or Heterogeneous - Communication infrastructure - … Few words on the research work on our research group.

Multi-Processor Systems 2

Uniprocessor Performance (SPECint)

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Perfo

rman

ce (v

s. V

AX-1

1/78

0)

25%/year

52%/year

??%/year


From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

Other Factors ⇒ Multiprocessors

Growth in data-intensive applications – Data bases, file servers, …

Growing interest in servers, server performance Increasing desktop performance less important

– Outside of graphics Improved understanding in how to use multiprocessors effectively

– Especially server where significant natural TLP Advantage of leveraging design investment by replication

– Rather than unique design


Flynn’s Taxonomy

Flynn classified by data and control streams in 1966

SIMD ⇒ Data Level Parallelism MIMD ⇒ Thread Level Parallelism MIMD popular because

– Flexible: N pgms and 1 multithreaded pgm – Cost-effective: same MPU in desktop & MIMD


Single Instruction Single Data (SISD) (Uniprocessor)

Single Instruction Multiple Data SIMD (single PC: Vector, CM-2)

Multiple Instruction Single Data (MISD) (????)

Multiple Instruction Multiple Data MIMD (Clusters, SMP servers)

M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.

Back to Basics

“A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Parallel Architecture = Computer Architecture + Communication Architecture 2 classes of multiprocessors WRT memory:

– Centralized Memory Multiprocessor • < few dozen processor chips (and < 100 cores) in 2006 • Small enough to share single, centralized memory

– Physically Distributed-Memory multiprocessor • Larger number chips and cores • BW demands ⇒ Memory distributed among processors


Centralized vs. Distributed Memory


P 1

$

Inter connection network

$

P n

Mem Mem

P 1

$

Inter connection network

$

P n

Mem Mem

Centralized Memory Distributed Memory

Scale

Centralized Memory Multiprocessor Also called symmetric multiprocessors (SMPs) because single main memory has a symmetric relationship to all processors Large caches ⇒ single memory can satisfy memory demands of small number of processors Can scale to a few dozen processors by using a switch and by using many memory banks Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases


Distributed Memory Multiprocessor Pro: Cost-effective way to scale memory bandwidth

– If most accesses are to local memory Pro: Reduces latency of local memory accesses Con: Communicating data between processors more complex Con: Must change software to take advantage of increased memory BW


2 Models for Communication and Memory Architecture Communication occurs by explicitly passing messages among the processors:

– message-passing multiprocessors Communication occurs through a shared address space (via loads and stores): shared memory multiprocessors either

– UMA (Uniform Memory Access time) for shared address, centralized memory MP

– NUMA (Non Uniform Memory Access time multiprocessor) for shared address, distributed memory MP

In past, confusion whether “sharing” means sharing physical memory (Symmetric MP) or sharing address space


Challenges of Parallel Processing

First challenge is % of program inherently sequential Suppose 80X speedup from 100 processors. What fraction of original program can be sequential?


( )

( )

( )

%75.992.79/79Fraction

Fraction8.0Fraction8079

1)100

Fraction Fraction 1(80

100Fraction

Fraction 1

1 08

SpeedupFraction Fraction 1

1 Speedup

parallel

parallelparallel

parallelparallel

parallelparallel

parallel

enhancedenhanced

overall

==

×−×=

=+−×

+−=

+−=


Second challenge is long latency to remote memory Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles) What is performance impact if 0.2% instructions involve remote access? CPI = Base CPI + Remote request rate x Remote request cost CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3 No communication is 1.3/0.5 = 2.6X faster than 0.2% instructions involve remote access



Application parallelism ⇒ primarily via new algorithms that have better parallel performance Long remote latency impact ⇒ both by architect and by the programmer For example, reduce frequency of remote accesses either by

– Caching shared data (HW) – Restructuring the data layout to make more accesses local (SW)


Fundamental MP design decisions

We have already discussed: Shared memory versus Message passing ? Other decisions: Homogeneous versus Heterogeneous ? Bus versus Network ? Generic versus Application specific ? What types of parallelism to support ? Focus on Performance, Power or Cost ? Memory organization ?


Homogeneous or Heterogeneous

Homogeneous: – replication effect – memory dominated any way – solve realization issues

once and for all – less flexible

Heterogeneous – better fit to application domain – smaller increments


Hybrid approaches

Middle of the road approach Flexible tiles Fixed tile structure at top level


Shared vs. Switched Media

Shared media: nodes share a single interconnection medium (e.g., Ethernet, Bus)

– Only one message sent at a time – Inexpensive: one medium used by all processors – Limited bandwidth: medium becomes the bottleneck – Needs arbitration

Switched media: allow direct communication between source and destination nodes (e.g., ATM)

– Multiple users at a time – More expensive: need to replicate medium – Higher total bandwidth – No arbitration – Added latency to go through the switch – Claimed to be more scalable

• but router overhead


Which types off parallelism to support?

Kernel level ILP

DLP Module level

Program/ Thread level (TLP)

Application level


Heterogeneity

Kernel level

Heterogeneous

Homogeneous Module level

Very heterogeneous Application

level


Example MP supporting DLP and ILP: IMAGINE Stream Processor


Focus

Low cost ? Low power ? High performance ? Depends on platform class


Intrinsic computational efficiency of silicon

i386SX i486DX P5 68040

microsparc Super sparc

601 604

Ultra sparc

P6

604e 21164a

Turbosparc 604e 21364

7400

106

105

104

103

102

101

100 2 1 0.5 0.25 0.13 0.07

Computational efficiency

[MOPS/W]

Feature size [µm]

Intrinsic computational efficiency of silicon

Ref: Roza

C6411

Itanium2

XScale

C5510 Xtensa


Computational efficiency of microprocessors

Presenter

Presentation Notes

Wait for programmable solution or be proactive and take action now

Problem Energy & Performance Memories

– Bandwidth • Slow (memory wall)

– Energy Expensive • RISC: 27% of power in Instruction Cache

– StrongARM SA110: A 160MHz 32b 0.5W CMOS ARM processor • VLIW: 30% of power in Instruction Cache

– C6x, 0.15um, 250MHz, Texas Instruments Inc. – Lx, HP & STMicroelectronics, L. Benini et.al (ISLPED)


Wiring hierarchy

How far can signal reach in one local clock cycle (local frequency) ?

0.0

0.5

1.0

1.5

2.0

2.5

3.0

40 50 60 70 80 90 100

Technology node

Line

leng

th (m

m)

Local line length

Intermediate linelengthGlobal linelength


INTERCONNECTION NETWORKS


Network design parameters

Large network design space: – topology, degree – routing algorithm

• path, path control, collision resolution, network support, deadlock handling, livelock handling

– virtual layer support – flow control – buffering – QoS guarantees – error handling – etc, etc.


Network-on-Chip

Some common characteristics of NoC – more than a single shared medium (like bus)

• a true network – provides a point-to-point connection between any two hosts

connected to the network • real point-to-point by a crossbar switch (or equivalent) • virtual point-to-point connection

– higher aggregate bandwidth due to parallelism • many concurrent packet flows or parallel circuit paths

– clearly separates communication from computation – layered approach is used for the network implementation – practical implication from the above points is that the network

contains storage elements for the data • implicit pipelining


Additional reading: J. Nurmi, ”Network-on-Chip – A New Paradigm for System-on-Chip Design,”

in Proc. International Symposium on System-on-Chip, Nov. 2005, pp. 2-6.

Protocol layers


Circuit-switching and packet-switching Circuit-switching

– A connection has to be established between the communicating parties before the transfer can start

– The line is reserved for the whole time independently of the amount of traffic (like the phone line is reserved whether you are silent or talk at your maximum rate)

– The connection will be explicitly closed when there is no more need for the communication

Packet-switching – The data is packed to carry also information about the destination

address (like a letter or packet in mail) – Separate packets or sub-packets (flits or phits) can be sent along

different routes (if the networks allows that) – Also other senders can inject packets to the network to travel along

(partially) the same route


Nodes, switches, resources

A node or a switch – Routes the packet along the network – Often includes some buffering (FIFOs) – The name ’switch’ refers to a crossbar switch connecting any of the

inputs to any of the outputs – Different kinds of switch architectures can be used in NoC – Often connects also the RESOURCES to the network

A resource – processor – memory – fixed-function or reconfigurable block


Node characteristics – examples

3 x 3 switch – This size of a crossbar is suitable for connecting a resource to a

bidirectional ring

5 x 5 switch – This is the typical size for Mesh networks, four nearest neighbors +

the local resource

16 x 16 switch – From this size upwards, the switches start to absorb the whole

network operation into a single centralized switch – The switch is the network!


Network Topology

Switched media have a topology that indicate how nodes are connected Topology determines

– Degree: number of links from a node – Diameter: max number of links crossed between nodes – Average distance: number of links to random destination – Bisection: minimum number of links that separate the network into

two halves – Bisection bandwidth: link bandwidth x bisection


Common Topologies


Type Degree Diameter Ave Dist Bisection

1D mesh 2 N-1 N/3 1

2D mesh 4 2(N1/2 - 1) 2N1/2 / 3 N1/2

3D mesh 6 3(N1/3 - 1) 3N1/3 / 3 N2/3

nD mesh 2n n(N1/n - 1) nN1/n / 3 N(n-1) / n

Ring 2 N/2 N/4 2

2D torus 4 N1/2 N1/2 / 2 2N1/2

Hypercube Log2N n=Log2N n/2 N/2

2D Tree 3 2Log2N ~2Log2 N 1

Crossbar N-1 1 1 N2/2

N = number of nodes, n = dimension

Network topologies - Mesh

Two-dimensional mesh – very intuitive, indeed


resources

nodes

links

Network topologies - Torus

Torus and folded torus extend from mesh – folded torus has 2x length of the links compared to mesh!


Network topologies - Ring

Ring is the simplest of the networks – Unidirectional or bidirectional – Does not correspond to the physical placement of resources!


Network topologies – Octacon and Spidercon Extend the ring by introducing shortcuts

– Octacon 4 diagonals – Spidercon N diagonals


Routing schemes

Deterministic vs. adaptive – deterministic routing is fixed (by routing table or information in the

packet) – adaptive routing can adapt to e.g. network congestion

Packet handling property – store-and-forward

• the whole packet is always stored in a node – virtual cut-through

• the routing further may start if there is enough room in the next node

– wormhole routing • the flit will be routed further immediately


Network performance

020406080

100120140160

10 30 50 70 90

Load %

Throughput

Latency

Throughput

– throughput/input – aggregate throughput – bisection throughput

Latency

– time from data injection to readout completion

Inter-related and function of network loading!


Quality-of-Service (QoS)

Normally packet-switched networks provide best effort throughput to all traffic

– ”the mileage may vary...” For some services, guaranteed service is needed

– Predictable delays – Guaranteed throughput

This will be handled by

– circuit-switched channel – time-multiplexing (and giving a certain amount of time slots to the

priority service) – can be approximated by assigning packet priority (may lead to

starvation of other services)


Fault and error tolerance

More permanent manufacturing defects in large complex circuits More soft errors due to lower error margins More dynamic transient faults due to crosstalk and other noise on-chip For error tolerance error detection and correction methods from telecommunications can be used to some extent

– must not cause too much overhead in area and power – resending or error correction codes???

Fault tolerance may cause too much overhead, either – network provides natural redundancy for fixing faults


Syncronous, asynchronous, GALS Single-clock SoC not practical Single-clock NoC (communication part) may be... However, synchronization between clock domains needed anyway GALS (Globally Asynchronous Locally Synchronous) is a natural mode of operation for large NoCs with many clock domains connected

– synchronization carried out between the network and the synchronous resource

– a place for trade-offs: where to place the A/S border??? Asynchronous functional blocks rare, but the network can be fully asynchronous (with synchronizing wrappers)


Our Research group

Prof. Jari Nurmi leads the group Research areas: Computer Architecture: Networks-on-Chip Multi-Processor architecture design Reconfigurable hardware Implementation f signal processing for Communications Navigation and Positioning: GNSS receivers Indoor positioning


eecs 252 graduate computer architecture lec 01 - introduction · growth in data-intensive...

Documents