the kill rule for multicore

The Kill Rule for Multicore

Anant Agarwal

MIT and Tilera Corp.

Corollary of Moore’s LawNumber of cores will double every 18 months

What must change to enable this growth?

Multicore is Moving Fast

Multicore Drivers Suggest Three Directions

• Diminishing returns– Smaller structures

• Power efficiency– Smaller structures– Slower clocks, voltage scaling

• Wire delay– Distributed structures

• Multicore programming

1. How we size core resources

2. How we connect the cores

3. How programming will evolve

How We Size Core Resources

Processor

3 coresSmall Cache

4 coresSmall Cache

Processor

3 coresBig Cache

Processor

Core:IPC=1Area=1

Core:IPC=1.2Area=1.3

Chip:IPC=3

Chip:IPC=3.6

Chip:IPC=4

1IPC = 1 + m x latency

Instructions per cycle

Kill If Less than Linear

“KILL Rule” for Multicore

A resource in a core must be increased in area only if the core’s performance improvement is at least

proportional to the core’s area increase

Put another way, increase resource size only if for every 1% increase in core area there is at least a

1% increase in core performance

Leads to power-efficient multicore design

CoreArea=1IPC=0.04 512B

Multicore

100 cores

CoreArea=1.03IPC=0.17

Multicore

97 cores

Multicore

93 cores

Multicore

87 cores

Multicore

76 cores

Multicore

61 cores

Chip IPC=4 Chip IPC=17 Chip IPC=23

Chip IPC=25 Chip IPC=24 Chip IPC=19

2% increase

325% increase

4% increase

47% increase

14% increase

7% increase

24% increase

3% increase

16% increase

7% increase

Kill Rule for Cache Size Using Video Codec

Well Beyond Diminishing Returns

Cache System

Madison Itanium2

L3 CacheL3 Cache

Photo courtesy Intel Corp.

Miss ratecan be

4x more

Miss penalty in cycles

4x smaller

Insight:

1IPC = 1 + m x latency (cycles)

Maintain constant instructions per cycle (IPC)

1IPC = = 2 1 + 0.5% x 200

1IPC = = 2 1 + 2.0% x 50

Slower Clocks Suggest Even Smaller Caches

Implies that cache can be 16x smaller!

CacheSizem

KILL rule suggests smaller caches for multicoreIf the clock is slower by x, for constant IPC, the cache can

be smaller by x2

KILL rule applies to all multicore resourcesIssue width: 2-way is probably ideal [Simplefit, TPDS 7/2001]

Cache sizes and number of memory hierarchy levels

Interconnect Options

Mesh Multicore

Packet routing through switches

Bus Multicore

s s s s

Ring Multicore

Bisection Bandwidth is Important

Bus Multicore

s s s s

Ring Multicorep

Mesh Multicore

s s s s

Ring Multicore

Concept of Bisection Bandwidth

Bus Multicore

Mesh Multicore

Bandwidth increases as we add more cores

Meshes are Power Efficient

%Energy Savings(Mesh vs. Bus)

Benchmarks Number ofProcessors

Meshes Offer Simple Layout

• 16 cores

• Demonstrated in 2002

• 0.18 micron

• 425 MHz

• IBM SA27E standard cell

• 6.8 GOPS

Example:MIT’s Raw Multicore

www.cag.csail.mit.edu/raw

Multicore

• Single chip

• Multiple processing units

• Multiple, independent threads of control, or program counters – MIMD

L2 Cache

pswitch

switch

pswitch

switch

pswitch

switch

pswitch

switch

Tiled Multicore satisfies one additional property

Fully Distributed, No Centralized Resources

Mesh based tiled multicore

Multicore Programming Challenge

• “Multicore programming is hard”. Why?– New – Misunderstood- some sequential programs are harder– Current tools are where VLSI design tools where in the

mid 80’s– Standards are needed (tools, ecosystems)

• This problem will be solved soon. Why?– Multicore is here to stay– Intel webinar: “Think parallel or perish”– Opportunity to create the API foundations – The incentives are there

Old Approaches Fall Short

• Pthreads– Intel webinar likens it to the assembly of parallel programming– Data races are hard to analyze– No encapsulation or modularity – But evolutionary, and OK in the interim

• DMA with external shared memory– DSP programmers favor DMA– Explicit copying from global shared memory to local store– Wastes pin bandwidth and energy– But, evolutionary, simple, modular and small core memory footprint

• MPI– Province of HPC users– Based on sending explicit messages between private memories– High overheads and large core memory footprint

But, there is a big new idea staring us in the face

memmem

Stream of data over a hardware FIFO

• Streaming is energy efficient and fast

• Concept familiar and well developed in hardware design and simulation languages

Inspiration from ASICs: Streaming

Streaming is Familiar – Like Sockets

• Basis of networking and internet software

• Familiar & popular

• Modular & scalable

• Conceptually simple

• Each process can use existing sequential code

SenderProcess

ReceiverProcess

Interconnect

Port1 Port2

Channel

Core-to-Core Data Transfer Cheaper than Memory Access

• Energy– 32b network transfer over 1mm channel 3pJ – 32KB cache read 50pJ– External access 200pJ

• Latency– Reg to reg 5 cycles (RAW)– Cache to cache 50 cycle– DRAM access 200 cycle

Data based on 90nm process node

Streaming Supports Many Models

Pipeline

Not great for Blackboard style

Shared state

But then, there is no one size fits all

Client-server

Broadcast-reduce

Multicore Streaming Can be Way Faster than Sockets

• No fundamental overheads for– Unreliable communication– High latency buffering– Hardware heterogeneity– OS heterogeneity

• Infrequent setup

• Common-case operations are fast and power efficient– Low memory footprint

Put(Port1, Data) Get(Port2, Data)Put(Port1, Data) Get(Port2, Data)Put(Port1, Data) Get(Port2, Data)Put(Port1, Data) Get(Port2, Data)Put(Port1, Data) Get(Port2, Data)

connect(<send_proc, Port1>, <receive_proc, Port2>)

SenderProcess

ReceiverProcess

Interconnect

Port1 Port2

Channel

MCA’s CAPI standard

CAPI’s Stream Implementation 1

Process A(E.g., FIR1)

Process B(E.g., FIR2)

Multicore Chip

Core 1

Core 2

I/O register-mapped hardware FIFOs in SOCs

CAPI’s Stream Implementation 2

Process A(E.g., FIR)

Process B(E.g., FIR)

Multicore Chip

Core 1

Core 2

On-chip cache to cache transfers over on-chip interconnect in general multicores

On-chip Interconnect

Conclusions

• Multicore is here to stay

• Evolve core and interconnect

• Create multicore programming standards – users are ready

• Multicore success requires– Reduction in core cache size– Adoption of mesh based on-chip interconnect – Use of a stream based programming API

• Successful solutions will offer evolutionary transition path

the kill rule for multicore

multicore drivers

core area

core resources2

core performanceleads

resource size

multicore resourcesissue

evolvekill rule

x2kill rule

Documents

i3 multicore processor

mit opencourseware 6.189 multicore programming primer...

a kill is a kill

doing more with...

multicore programmingandtpl

using multicore navigator multicore applications

multicore at ral

multicore- und gpgpu- architekturen€¦ · multicore- und...

parallelization & multicore

09_practical multicore programming

multicore simulator

multicore processing, virtualization, and...

multicore and multicore programming with openmp (syst emes

bs7846 multicore

multicore manual

processadores multicore

take control of your multicore debugging - iar.com ·...

multicore 101: migrating embedded apps to multicore with...

heterogeneous multicore

multicore processors: challenges, opportunities, emerging...