cdc/cra chips mentoring workshop high performance interconnects

CDC/CRACDC/CRACHiPs Mentoring WorkshopCHiPs Mentoring Workshop

High Performance InterconnectsHigh Performance Interconnects

Timothy M. PinkstonTimothy M. PinkstonProfessor, USCProfessor, USC

July 25-27, 2009July 25-27, 2009

My BackgroundMy Background

• Education:Education:• BSEE (minor in CS): The Ohio State Univ., ’85• MSEE (Computer Engineering): Stanford U., ’86• PhDEE (Computer Engineering, Comp Arch): Stanford U., ’93

• Experience:Experience:• Industry: Industry: AT&T Bell Labs, ’85-’86; IBM Intern, ‘89-’90 (summers);

Hughes Research Labs (HRL) Doctoral Fellow ’90-’93• Academia: Academia: University of Southern California ’93 - present• Government: Government: NSF, Jan. ‘06 – Dec. ‘08

• Research Interests:Research Interests:• Computer systems architecture: interconnection networks,

on-chip networks for multicore and multiprocessor systems• Recent Activities:Recent Activities:

• “Interconnection Networks” with Jose Duato , book chapter in Computer Architecture: A Quantitative Approach, 4th edition, J. L. Hennessy and D. A. Patterson (2006)

• Lead Program Director for Expeditions in Computing program: NSF CISE, $40M award portfolio in inaugural year (2008)

Interconnection NetworksInterconnection Networks

• The subsystem that connects individual devices together The subsystem that connects individual devices together into a community of communicating devicesinto a community of communicating devices

• Device (End Node):Device (End Node):– Component in a computerComponent in a computer– A computerA computer– System of computersSystem of computers

• Interconnection Network:Interconnection Network:– Interfaces and LinksInterfaces and Links– Communication protocolCommunication protocol– Routers (switches)Routers (switches)

• Goal: Goal: Transfer maximumTransfer maximum

amount of data reliably inamount of data reliably in

least amount of time (& energy, cost)least amount of time (& energy, cost)

so as not to bottleneck overall system performanceso as not to bottleneck overall system performance

End NodeEnd NodeEnd NodeEnd Node

…

…

HW Interface HW Interface HW Interface HW Interface

SW Interface SW Interface SW Interface SW Interface

Device Device Device Device

…

Router Router Router Router

Router Router Router Router

Lin

k

Lin

k

Lin

k

Lin

k

Lin

k

Lin

k

Lin

k

Lin

k

Inte

rco

nn

ec

tio

n N

etw

ork

Internetworking

LANs

Different Networks for Different ScalesDifferent Networks for Different ScalesD

ista

nce

(m

eter

s)

5 x 10-3

5 x 100

5 x 103

5 x 106

Number of devices interconnected

1 10 100 1,000 10,000 >100,000

WANs

Wide-Area Networks (WANs)

On-Chip Networks (OCNs)

Local Area Networks (LANs)

System-Area Networks (SANs)

SANs

OCNs

Increasing Parallelism on ChipsIncreasing Parallelism on Chips

• Level 1Level 1– Level 2Level 2

• Level 3Level 3– Level 4Level 4

0

0.5

1

1.5

2

2.5

1980 1985 1990 1995 2000 2005 2010 2015Year of Technology Availability

Min

imu

m F

ea

ture

Siz

e

(um

)

CPU FPU

L1

MCM

P P P P

2um

CPU

L1

MC

FPU

1um

L2

MC

FPU

L1

CPU

0.35um

FPU

L1

L2

L3

MC

CPU

0.18um

L3 + MC

0.09um

multicore era!

(adapted from Nhon Quach, Intel)

many-core chips

FPU

L1

L2

FPU

L1

L2

CPUCPU

L3 + MC

0.045umCPU CPU

L1 L1

L2 L2

CPU CPU

L1 L1

L2 L2

L3 + MC

6

• Blue Gene/L 3D Torus NetworkBlue Gene/L 3D Torus Network

Increasing Parallelism in SystemsIncreasing Parallelism in Systems

IBM Blue Gene

(www.ibm.com)

3-dimensional (XYZ) torus

interconnection network

(10s to 100s thousand devices)

Defects, Faults, Chip Yield and LifetimeDefects, Faults, Chip Yield and Lifetime

Trends in chip (system) failure rateTrends in chip (system) failure rate

infant mortality period aging perioduseful lifetime period

fault resilient

designs

fau

lt r

esili

ent

des

ign

s

time

failu

re r

ate

technology

scaling

tech

nolo

gy

scal

ing

- Intel predicts at least 5-10% of chip resources will be used for ensuring reliability

(Source: “Platform 2015” www.intel.com/go/platform2015)

0

10

20

30

40

2004 2007 2010 2013 2016

Year of Production

No

rmal

izat

ion

to

Yea

r 20

04

Defect data volumn

ITRS’04

- Technology scaling adversely impacts chip yield and chip/system failure rate

(manufacturing defects, soft and hard faults, wear-out lifetime)

- Adaptive, self-correcting, self-repairable architectures are needed to combat decreasing

chip reliability with successive technology generations

Transporting Packets within a NetworkTransporting Packets within a Network

• Goal: Goal: Transfer maximum amount of data reliably in least amount of Transfer maximum amount of data reliably in least amount of time (& energy, cost) so as not to bottleneck overall system perf.time (& energy, cost) so as not to bottleneck overall system perf.

• Network Structure and Functions for Transporting Data PacketsNetwork Structure and Functions for Transporting Data Packets

– Topology: Topology: What network paths are possible for packets?What network paths are possible for packets?

– Routing: Routing: Which of the possible paths are allowable for packets?Which of the possible paths are allowable for packets?

– Flow Control & Arbitration: Flow Control & Arbitration: When are paths available for packets?When are paths available for packets?

– Switching: Switching: How are paths allocated to packets?How are paths allocated to packets?

– Router Microarchitecture: Router Microarchitecture: Implementation of router internal pathsImplementation of router internal paths

Flow Control of Data PacketsFlow Control of Data Packets

• Poor flow control can reduce link efficiencyPoor flow control can reduce link efficiency– ““Handshaking” flow controlHandshaking” flow control

receiversender

Router Router

Receiver transmits

handshake when ready

for next packet

Sender can transmit

only after receiving

handshake signal

Handshake

data link

control link

queued packets buffer queue

X

Queue is

not serviced

receiversender

Handshake

Receiver transmits

handshake when ready

for next packet

Sender can transmit

only after receiving

Handshake signal

non-pipelined transfer


Router Router

data link

control link

• Poor flow control can reduce link efficiencyPoor flow control can reduce link efficiency– ““Handshaking” flow controlHandshaking” flow control– simple, but low throughput and high latencysimple, but low throughput and high latency


X

Queue is

not serviced

receiversender

Go

Stop

When Stop threshold is

reached, a Stop

notification is signaled

Control bit Stop

Go

When in Stop,

sender cannot

inject packets

X

Queue is

not serviced


Router Router

data link

control link

pipelined transfer

• Poor flow control can reduce link efficiencyPoor flow control can reduce link efficiency– ““Stop & Go” flow controlStop & Go” flow control

a packet is injected if

control bit is a “Go”

buffer queuequeued packets

receiversender

Go

Stop

When GoGo threshold is

reached, a “GoGo”

notification is sent

Control bit

Go

Stop

X

Queue is

not serviced


Router Router

data link

control link

pipelined transfer

• Poor flow control can reduce link efficiencyPoor flow control can reduce link efficiency– ““Stop & Go” flow controlStop & Go” flow control– improved throughput and latency improved throughput and latency if large enough buffer queuesif large enough buffer queues

a packet is injected if

control bit is a “GoGo”


receiversender

Sender sends packets

whenever credit counter

is not zero

109876543210

X

Queue is

not serviced

Credit counter


Router Router

data link

control link

pipelined transfer

• Poor flow control can reduce link efficiencyPoor flow control can reduce link efficiency– ““Credit-based” flow controlCredit-based” flow control


receiversender

10Credit counter 9876543210

+5

5432

X

Queue is

not serviced

Receiver sends credits

after they become

available

Sender resumes

injecting when credit

counter > 0


Router Router

data link

control link

pipelined transfer

• Poor flow control can reduce link efficiencyPoor flow control can reduce link efficiency– ““Credit-based” flow controlCredit-based” flow control– improved throughput and latency improved throughput and latency with smaller buffer queueswith smaller buffer queues


• Improving flow control with split buffer organizationsImproving flow control with split buffer organizations– A router with a single buffer queue per input portA router with a single buffer queue per input port

buffer queue

Y-Y+X-Y-

Input port i

Output port X+

X+

Output port X-

Output port Y+

Output port Y-

X+


K x K Router

queued packets

2-dimensional mesh network

with dimension-order routing

X

physical

channel

X

• Improving flow control with split buffer organizationsImproving flow control with split buffer organizations– Head-of-line Head-of-line blockingblocking in a router with single queue per input port in a router with single queue per input port


Y+

X+

Y-

X-

Router

split buffer queue

X-

Input port i

Output port X+

X+

Output port X-

Output port Y+

Output port Y-

Y+Y-

DE

MU

X

X+

Y-

• Improving flow control with split buffer organizationsImproving flow control with split buffer organizations– A router with two queues per input port A router with two queues per input port two virtual channels two virtual channels


K x K Router

queued packets

X

virtual channel 0

virtual channel 1

X



(two virtual channels/physical)

• Improving flow control with split buffer organizationsImproving flow control with split buffer organizations– HoL blocking HoL blocking reducedreduced in a router with two queues per input port in a router with two queues per input port


Y+

X+

Y-

X-

Router

X

X

X

No VCs

available

virtual channel 0

virtual channel 1




• Improving flow control with split buffer organizationsImproving flow control with split buffer organizations– HoL blocking HoL blocking not eliminated not eliminated in a router with virtual channelsin a router with virtual channels


Y+

X+

Y-

X-

Router

Split buffer queue

Input port iOutput port X+

X+

Output port X-

Output port Y+

Output port Y-

DEM

UX

Y-

X-

X+

Y-

Y+

• Improving flow control with split buffer organizationsImproving flow control with split buffer organizations– A router with virtual output queuing (VOQ) requires A router with virtual output queuing (VOQ) requires kk queues queues


K x K Router

queued packets

X

X

Y+

X+

Y-

X-

Y+

X+

Y-

X-

HOL

blocking at

neighboring

router!!

• Improving flow control with split buffer organizationsImproving flow control with split buffer organizations– HoL blocking HoL blocking eliminated eliminated at router with VOQat router with VOQ





,, but not at neighbor but not at neighbor

Y+

X+

Y-

X-

Router

skyline

regionroot

• Reduce chip-kill in the presence of permanent faults Reduce chip-kill in the presence of permanent faults with with dynamic reconfiguration dynamic reconfiguration of on-chip networksof on-chip networks

new

root

- A 2-D mesh network with XY

routing (deadlock-free)- If a core’s router & link is faulty

- Network can be dynamically

reconfigured to up*/down* (u*/d*)

routing remaining deadlock-free!- Later, if the u*/d* root fails

- Only the up*/down* link directions

within the skyline region are

affected by the fault- Reconfigured again to regain

connectivity no chip-kill!!

→ causes five failed links

→ causes four links to fail

Resilient Interconnection NetworksResilient Interconnection Networks

Many such fascinating problems in

need of innovative solutions!

In ConclusionIn Conclusion

• Interconnection networks are key to exploiting parallelismInterconnection networks are key to exploiting parallelism– on-chip networks between cores within a chipon-chip networks between cores within a chip– off-chip networks between chips and boards across a systemoff-chip networks between chips and boards across a system

• Many open research questions remain:Many open research questions remain:– network topology, routing, arbitration, switching, and flow control network topology, routing, arbitration, switching, and flow control

designs that maximize throughput and minimize latencydesigns that maximize throughput and minimize latency– innovative resource management techniques that enable adaptive, innovative resource management techniques that enable adaptive,

power-aware, fault-resilient, reliable interprocessor communicationpower-aware, fault-resilient, reliable interprocessor communication– the list goes on … the list goes on …

• High performance interconnection network designHigh performance interconnection network design is an is an exciting area of computer systems architecture researchexciting area of computer systems architecture research

• The future awaits!The future awaits!

OCNs SANs LANs WANs

Interconnect Media & Form FactorsInterconnect Media & Form FactorsM

edia

typ

es

Distance (meters)

0.01 1 10 100 >1,000

Fiber Optics

Coaxial

cables

Myrinet

connectors

Cat5E twisted pair

Metal layers

Printed

circuit

boards

InfiniBand

connectors

Ethernet

• Comparison of “Stop & Go” with “Credit-based”Comparison of “Stop & Go” with “Credit-based”

Sto

p &

Go

Time

Cre

dit

bas

ed

Time

Go

Stop

Go

Stop

Go

Stop

Sender

stops

transmission

Last packet

reaches receiver

buffer

Stop

Go

Packets in

buffer get

processed

Go signal

returned to

sender

Sender

resumes

transmission

First packet

reaches

buffer

# credits

returned

to sender

Sender

uses

last credit

Last packet

reaches receiver

buffer

Pacekts get

processed and

credits returned

Sender

transmits

packets

First packet

reaches

buffer

Flow control latency

observed by

receiver buffer

Stop signal

returned by

receiver


cdc/cra chips mentoring workshop high performance interconnects

Documents

chip yield

chip resources

network paths

decreasing chip reliability

computer architecture

possible paths

data packetstopology

flow control arbitration