chip multiprocessor coherence and interconnect system design natalie enright jerger phd defense...

Chip Multiprocessor Coherence and Interconnect System Design

Natalie Enright Jerger

PhD DefenseOctober 20, 2008

Key Contributions – Interconnects (1) Trend: proliferation of architectures

requiring broadcast and multicast communication

Challenge: too expensive for on-chip

Solution: Virtual Circuit Tree Multicasting Improves throughput with low-

overhead routing tables

10/20/2008 2Natalie Enright Jerger - PhD Defense

Key Contributions – Interconnects (2) Trend: interconnect latency has a

significant impact on overall performance

Challenge: router overhead contributes substantially to total network latency

Solution: Hybrid Circuit Switching Circuit switching can be made fast

without sacrificing bandwidth


Key Contributions – Cache Coherence Trend: Need for all cores to observe

coherence requests will be rare Challenge: low latency cache-to-

cache transfers without high bandwidth overhead

Solutions: leverage properties of interconnection network Virtual Tree Coherence

Efficient multicast Circuit-Switched Coherence

Optimized pair-wise sharing10/20/2008 4Natalie Enright Jerger - PhD

Defense

Outline Overview/Background

On-chip Networks Cache Coherence

Virtual Circuit Tree Multicasting Hybrid Circuit Switching Virtual Tree Coherence Circuit-Switched Coherence Conclusion/Future Work


Communication system co-design

Communication consists of both How data is communicated (interconnect) What is communicated to whom and when

(coherence) Focus on co-design

Interconnect provides efficient multicasting design multicast coherence protocol

Interconnect optimizes pair-wise communication integrate pair-wise sharing into protocol


On-Chip Interconnection Networks

Replace global wiring as communication medium

Prohibitive latency, inefficient, poor scalability

Modular network design High bandwidth Multiple concurrent

communication flows Challenges:

Low latency communication Low power and area


Multicore Examples


Sun Niagara

XBAR

0

1

2

3

0 1 2 3

4

5

4 5

Multicore Examples


IBM Cell

RING

Many Core Example

Intel Terascale 80 core

prototype 2-D Mesh

Topology MIT RAW, TRIPS

10/20/2008 Natalie Enright Jerger - PhD Defense 10

2D MESH

Designing an on-chip network

Routing Path from source to

destination Switching & Flow

Control Packet-Switching Virtual channels

Router micro-architecture Logic, buffers

10/20/2008Natalie Enright Jerger - PhD

Defense

router

11

Routing Design Decisions Determine path through

network Single source to a single

destination Multicasting: small percentage

One source with multiple destinations

Current solution: replicate message at source


B

A

Deterministic X-Y Route

C

D

Unicast Highly Optimized

MulticastToo Expensive Unnecessary

Multicast: Multiple unicast

Communication latency Network latency

Router overheads Network congestion due to inefficient

multicasting Coherence protocol

Cache-to-cache transfer latency Bandwidth requirements


Existing Coherence Schemes Broadcast

Low latency Simple ordering (using bus) Single ordering point bottleneck Low bandwidth interconnect

High coherence traffic requirements Directory

Distributed ordering point Mostly unicast coherence traffic

Scalable topologies Indirection latency Storage overhead


Full-chip Coherence: Broadcast

Too expensive Too much power, too much bandwidth Majority of cores do not need to see

broadcast10/20/2008 15Natalie Enright Jerger - PhD

Defense

Full-chip Coherence: Directory

Long latency Sharers may be closer than directory

Directory overheads10/20/2008 16Natalie Enright Jerger - PhD

Defense

Store MissForwardX

Response

Hierarchical Coherence for CMPs Limit coherence actions to subset

Reduce bandwidth demands and power consumption

A multicast to active sharers

Fast cache-to-cache transfers Avoid indirection (directory)

Limited storage overhead Leverage coarse-grain information Efficient directories


Workloads & Coherence Hierarchical coherence good

match to applications Server consolidation

Emerging class of workloads for many-core Limited global (inter-VM) communication

Communication and coherence dominated by intra-VM sharing

Traditional workloads Small number of sharers per cache line or

region (multiple cache lines)


Outline Motivation Virtual Circuit Tree Multicasting

Baseline Inefficiencies Router Architecture Results

Hybrid Circuit Switching Virtual Tree Coherence Circuit-Switched Coherence Conclusion/Future Work


State-of-the-Art Router

Current router architecture poorly equipped to handle even a low amount of multicast (MC) traffic


Packet-Switched Router 3 stage router with virtual channel flow

control Based on most aggressive recent proposals

Aggressive baseline not well matched all types of communication

Multicast is performed using multiple unicasts


Buffer Write

Virtual Channel/Switch

Allocation

Switch Traversal

Link Traversal

Switch Traversal

Link Traversal

Link Traversal

Link Traversal

Router LinkLinkRouter

Baseline Router Example


VCsVCs X

A B

DC

VCsBusy

VCs

VCs

1B1B

1A1A

1C1C

1D1D

2A2A

2B2B

2C2C

2D2D

More resources to solve this problem?

More buffers, virtual channels, links?

Key Router Problems


VCsVCs X

A B

DC

VCsBusy

VCs

VCs

2A2A

2B2B

2C2C

2D2D

Injection Bandwidth: Burst of messages at network interface

1A1A

1C1C

1B1B

1D1D

Redundant (wasteful) use of resources: same payload occupying

extra buffers, links

Alternative routing: Improve throughput,

but wastes power

VCTM Overview

Multiple unicast vs. tree-based multicast


A

B C

D E F G

A

B C

D E F G

VCTM Overview

Mapping trees to mesh (or arbitrary topology) With Dimension Order X-Y Routing

A C

E F

G

D

BRoot


A

B C

D E F G

M: <Eject>

M: <Eject>

Multicast from 0 to <2,4,5> 1

Build Tree Incrementally (Tree M)2

3 Unicast Setup Packets (1 per destination)

3

0 1 2

3 4 5

3 Packets Injected into Network4

Virtual Circuit Tree Multicasting Example

Builds on existing state-of-the-art router Unicast performance is not impacted

Build multicast trees incrementally


ABCM M M

A 2

B 4

C 5

M: <East> M: <Eject>

M: <East, South>

M: <Eject, South>

M: <East>

Injection problem solved

Link Redundancy Removed

Fewer packets improves speculation

Tree reuse is necessary for effectiveness

Significant temporal destination set reuse across all scenarios

VCTM Router Architecture


VC 0

MVC 0

VC 0

VC x

MVC 0

VC 0

VC x

MVC 0

Src VCTnum

Virtual Circuit Tree Table

Switch Allocator

Virtual Channel Allocator

VC 0

VC xInput Ports

Id Ej N S E W

.

.

.

0 1 0 1 1 0

Routing Computation

Implementation Details (1) Destination Set Content Addressable

Memory

If not present replace tree perform setup10/20/2008 28Natalie Enright Jerger - PhD

Defense

0 0 1 1 0 0

0 1 1 1 0 0

1 1 0 1 0 0

0 0 1 0 1 0

5 4 21 1 0 1 0 0

0

1

2

3

Encode Tree ID 2 into multicast header

Destination Set <5,4,2>

Implementation Details (2) VCTs provide routing not resources

Multicast arbitration same as unicast VCTs do not pre-allocate resources

Multiple arbitration steps at tree branch If one desired output is blocked, other tree

branch outputs can still proceed Longer buffer occupancy

VCT entries statically partitioned equally among nodes Dynamic partitioning explored


TokenB

VTC

Characterizing Multicasts Unique Destination Sets: combination of

destinations in multicast Number of Destinations per multicast


TokenB: 1 destination set for each node

VTC and Directory:Much larger variety of

destination sets

TRIPS and Directory:Small destination sets

TokenB and Opteron:Large destination sets

VTC:Wide variety of sizes

Up to 13% of traffic is multicast

VCTM is an inexpensive solution to support multicasting

VT

C

Tok

enB

Simulation Methodology Network traffic from 5 different

scenarios Detailed network simulator

Cycle-accurate modeling of router stages

Evaluate 5 Scenarios: TokenB, VTC, AMD Opteron, SGI Origin

Directory, TRIPS*

*Thanks to Niket Agarwal & Noel Eisley for TokenB, Opteron and TRIPS traces


Network Configuration


Topology 4-ary 2-mesh5-ary 2-mesh (TRIPS)

Routing Dimension Order: X-Y Routing

Channel Width 16 Bytes

Packet Size 1 flit (Coherence request = Address + Command)5 flits (Data)3 flits (TRIPS)

Virtual Channels 4

Buffers per port 24

Router ports 5

Virtual Circuit Trees

Varied from 16 to 4K(1 to 256 VCTs/core)

Power Savings On-chip networks consume up to ~36% of

chip power [Wang, 2002] Links, buffers and crossbars consume nearly

100% of network power Power saved through activity reduction


Performance Results Summary

Small number of trees required for majority of benefit Performance improvement depends on network pressure10/20/2008 34Natalie Enright Jerger - PhD

Defense

TPC-H: 68%

Art: 55% SPECweb: 12%

VTC

1 2 4 8 32

128

256

0 Trees per core

Total trees

TRIPS

VCTM vs. Aggressive Network

VCTM outperforms aggressive (unrealistic) network10/20/2008 35Natalie Enright Jerger - PhD

Defense

Related Work Off-Chip Multicast Routers

[Sivaram, 99; Turner, 94; Stunkel, 99, Chiang, 95; Malumbres, 96]

Expensive: High Radix Routers Large Routing Tables Targeting Multi-stage networks

On-Chip Path-based, SOC [Lu, 06] D-NUCA cache [Jin, 07] Table-based [Rodrigo, 08]


VCTM Summary Key contributions

Characterize multicasting: necessary VCTM improves throughput of network

Improves performance across a variety of scenarios Reduces interconnect latency by up to

90% Reduces switching activity by up to 53%

Small number of trees necessary (8/core)


Outline Motivation Virtual Circuit Tree Multicasting Hybrid Circuit Switching

Brief (work presented at Prelim) (2 slides)

Extended (12 slides) Virtual Tree Coherence Circuit-Switched Coherence Conclusion/Future Work


Hybrid Circuit Switching (1)

Latency is critical Utilize Circuit Switching for lower

latency A circuit connects resources across

multiple hops to avoid router overhead Traditional circuit-switching performs

poorly HCS Contributions

Novel setup mechanism Bandwidth stealing


Defense39

Hybrid Circuit Switching (2) Hybrid Network

Interleaves circuit-switched and packet-switched flits

Optimize setup latency Improve throughput over traditional

circuit-switching Packet-switched flits snoop idle

bandwidth Reduce interconnect delay by

average of 18% (up to 22%)


Outline Motivation Virtual Circuit Tree Multicasting Hybrid Circuit Switching Virtual Tree Coherence

Hierarchical Coherence Coarse Grain Regions Ordering Results

Circuit-Switched Coherence Conclusion/Future Work


Virtual Tree Coherence Local Coherence: Snoopy multicast

Root information present: send coherence request to tree root

Tree Root multicasts request to sharers using VCTM

Add requestor to sharing vector Sharers supply data (and invalidate if store)

Global Coherence No root information:

Request root from coarse directory –OR – broadcast for root

Then perform first level actions10/20/2008 42Natalie Enright Jerger - PhD

Defense

Coarse Grain Region Overview

Take a macro view of memory Better observe access patterns (1KB Regions) Use RegionTracker for efficient storage

[Zebchuk, 07] Region-based Optimizations

Coarse Grain Coherence Tracking [Cantin, 05] Eliminate unnecessary broadcasts

Track sharers and roots on coarse granularity Multicast to sharers for data Efficient upgrading for non-shared regions


Selecting Root Node Root should be sharer fast

ordering First touch

Possibility of single node becoming root for many regions

Migration Migrate to more central location as

sharers are added Forces majority of roots onto center nodes

Forward in-flight requests to old root10/20/2008 44Natalie Enright Jerger - PhD

Defense

Global Coherence Trade-offs Global Coarse Directories

Storage overhead Unicast requests

Global Broadcast No storage overhead Broadcast traffic: expensive but rare


Virtual Tree Coherence Ordering

Intra-region Order Ordering Point

Tree Root orders all messages to same address region

Multiple roots prevent ordering bottleneck Sharers

A multicast must contain all current sharers Too many false sharers wastes power Too few incorrect execution


Preserving Network Order Arbitrary topology

Messages can become reordered either By virtual channel allocation By adaptive routing

Restrict virtual channel assignment to prevent re-ordering in the network

Tree must use the same VC throughout network traversal

Leaves (sharers) observe requests in same order that root processed them


Disallow

Address

Virtual Tree Coherence Example

Competing store requests to address A[0] from Nodes 0 and 1.


0 1 2

3 4 5

Region Sharers Root

A 0, 1, 4, 5

4

B 2,3 2

Order:1, 0

Order:1

Order:1

Order:1

Data A[0]1

1

1 StoreComplete!

0

0

0Data A[0]

Store Complete!

Order:1,0

Order:1, 0

Order:1, 0

Invalid A[0]Invalid A[0]

Invalid A[0]

Modified A[0]

Inter-Region Order (1) Safe to collect acknowledgments on every

store miss (Opteron-like) Given our intra-region tree order

Acknowledging every store miss overly conservative

Before a dirty block becomes visible to system (leaves the core) Must fence on all previous dirty/unfenced

regions A fence issued for each region on that region’s tree

Memory barrier requires that fences on all dirty/unfenced regions are complete (PowerPC consistency)


Inter-Region Order (2) Write atomicity

Remote reads to dirty/unfenced regions initiate fence(s) and are deferred until fence(s) completes

Fence acks guarantee that all prior stores (invalidates) to that region have been sunk on all cores

Write serialization Writes within a region serialized through Tree

Order Serialize writes across region boundary with fence

(e.g. St A, St B) Fence occurs on tree of earlier store (A) Eager fence10/20/2008 50Natalie Enright Jerger - PhD

Defense

Simulation Methodology PHARMsim

Full-system multi-core simulator Detailed network level model with VCTM

Baseline: Directory Coherence

Workloads: Commercial: SPECjbb, SPECweb, TPC-H, TPC-

W Scientific: Barnes-Hut, Ocean, Radiosity,

Raytrace Server Consolidation


Simulation Configuration


Processors

Cores 16,64 in-order general purpose

Memory System

L1 I/D Caches 32 KB 2-way set associative1 cycle

Private L2 caches 1MB 8-way set associative, 6 cycles64 Byte lines

RegionTracker (associated with each L2)

1024 sets, 8 ways, 1KB regions

Interconnect: 4x4 2-D Mesh

Packet-switched baseline 3 router stages8 Virtual channels with 4 Buffers each

Virtual Circuit Tree Multicasting 64 Trees per source node1024, 4096 total trees

Performance (16 cores)

Outperform Directory by average of 25% Difference between VTC-Bcast and VTC-Mcast-* will

become more pronounced in larger systems


Server Consolidation Performance

Performance difference between VTC-Bcast and VTC-Mcast-Dir goes from 6% to 16% moving from 16 to 64 cores


Ordering Hop Counts

Significant reduction in hop count to order requests: 21% and 41% on average


Tree Root Migration: Hotspots

Migration based on lowest ordering hop count results in severe congestion


Activity (16 cores)

VTC-Mcast-Dir increases bandwidth 2.1x over directory


Related Work Unordered Interconnects

Logical/Virtual Ordering: UnCorq [Strauss, 07] Token Counting: Token Coherence [Martin, 03]

Hierarchical Coherence Virtual Hierarchies [Marty, 07] Multiprocessors built out of SMP building

blocks [Laudon, 97; Lenoski, 90; Lovett, 96; Hagersten, 99]

Multicast coherence Multicast Snooping, Destination Set Prediction

[Bilir, 99; Martin, 03]


Virtual Tree Coherence Summary High bandwidth, low latency

ordering substrate Fast cache-to-cache transfers

Efficient bandwidth utilization through multicasting Coarse-grain tracking of sharers

Scales well for server consolidation workloads


Outline Motivation Virtual Circuit Tree Multicasting Hybrid Circuit Switching Virtual Tree Coherence Circuit-Switched Coherence

Brief (work presented at Prelim) (1 slide)

Extended (6 slides) Conclusion/Future Work


Circuit-Switched Coherence

Goal: Better exploit circuits through coherence protocol

Observe pair-wise sharing Take Directory off critical path

Directory is sole ordering point Performance improvement increased

with HCS + Protocol Optimization Protocol Optimization drives up circuit

reuse, better utilizing HCS Performance improvement up to 18%


Defense61

Outline Motivation Virtual Circuit Tree Multicasting Hybrid Circuit Switching Virtual Tree Coherence Circuit-Switched Coherence Conclusion/Future Work


Conclusion Communication behaviors must

influence interconnect design Coherence protocol dictates

interconnect traffic Designing networks solely for unicast

traffic problematic Protocol should take advantage of

interconnect properties If interconnect provides implicit

ordering use it


Future Work Network Quality of Service/Fairness

Root selection results in potential for hotspots in network

Reduction/Collection Networks Reduce interconnect activity

Combine acknowledgments in network

Techniques to improve scalability of VCTM


Thank you


Extended Versions (HCS, CSC)


Hybrid Circuit Switching (Extended) Latency is critical Utilize Circuit Switching for lower

latency A circuit connects resources across

multiple hops to avoid router overhead Traditional circuit-switching performs

poorly HCS Contributions

Novel setup mechanism Bandwidth stealing


Defense67

Traditional Circuit Switching Path Setup (with Acknowledgement)

Significant latency overhead prior to data transfer

Other requests forced to wait for resources10/20/2008 Natalie Enright Jerger - PhD

Defense

Acknowledgement

Configuration Probe

Data

Circuit

0

5

68

Novel Circuit Setup Policy

Overlap circuit setup with 1st data transfer Reconfigure existing circuits if no unused links available

Allows piggy-backed request to always achieve low latency Multiple circuit planes prevent frequent reconfiguration

Configuration Packet

Data

Circuit

A0

5

Natalie Enright Jerger - PhD Defense10/20/200869

Setup Network Light-weight setup network

Narrow Circuit plane identifier (2 bits) + Destination (4 bits)

Low Load No virtual channels small area footprint

Stores circuit configuration information Multiple narrow circuit planes prevent

frequent reconfiguration Reconfiguration

Buffered, traverses packet-switched pipeline


Packet-Switched Bandwidth Stealing Remember: problem with traditional

Circuit-Switching is poor bandwidth Need to overcome this limitation

Hybrid Circuit-Switched Solution: Packet-switched messages snoop incoming links When there are no circuit-switched

messages on the link A waiting packet-switched message can

steal idle bandwidth


Hybrid Circuit-Switched Router Design

T

T

T

T

T

Allocators

Crossbar

Inj

N

S

E

W

W

E

S

N

Ej


Defense72

HCS Pipeline Circuit-switched messages: 1 stage

Packet-switched messages: 3 stages Aggressive Speculation reduces stages

Buffer Write

Virtual Channel/Switch

Allocation

Switch Traversal

Link Traversal


Defense

Switch Traversal

Link Traversal

73

Router Link

Router Link

Link Traversal

Link Traversal

Simulation Workloads


Commercial

SPECjbb Java server workload24 warehouse, 200 requests

SPECweb Web server, 300 requests

TPC-W Web e-commerce, 40 transactions

TPC-H Decision support system

Scientific

Barnes-Hut 8k particles, full run

Ocean 514x514, parallel phase

Radiosity Parallel phase

Raytrace Car input, parallel phase

Synthetic

Uniform Random Destination select with uniform random distribution

Permutation Traffic

Each node communicates with one other node (pair-wise)


Table with config parameters


Processors

Cores 16 in-order general purpose

Memory System


Private L2 caches 512 KB 4-way set associative6 cycles64 Byte lines

Shared L3 Cache 16 MB (1MB bank/tile)4-way set associative12 cycles

Main Memory Latency 100 cycles


Packet-switched baseline Optimized 1-3 router stages4 Virtual channels with 4 Buffers each

Hybrid Circuit Switching 1 router stage2 or 4 Circuit planes

Network Results

Communication latency is key: shave off precious cycles in network latency


Flit breakdown

Reduce interconnect latency for a significant fraction of messages


Uniform Random Traffic

HCS successfully overcomes bandwidth limitations associated with Circuit Switching


HCS Conclusion Overcome the bandwidth limitations

of traditional circuit-switching Fast setup technique for low latency Improve performance

Network latency reduced by average of 18% (up to 22%)

Overall performance improvement: up to 7%


Circuit-Switched Coherence (Extended)

Goal: Better exploit circuits through coherence protocol

Modifications: Allow a cache to send a request

directly to another cache Notify the directory in parallel Prediction mechanism for pair-wise

sharers Directory is sole ordering point


Defense80

Circuit-Switched Coherence Optimization


Directory

Address

State Sharers

A Exclusive

2

B Shared 1,2

1 2

Update A

1

Data Response A2

3

Directory

Address

State Sharers

A Shared 1,2

B Shared 1,2

Ack A

Read A1

Region Prediction

Each memory region spans 1KB Takes advantage of spatial and temporal sharing


Directory

Address

State Sharers

A[0] Shared 2

A[1] Shared 2

1 2Miss A[0]1

Forward Read A[0]

2

Data Response A[0]3

Region Table

A --

B 3

Region Table

A 2

B 3

Region A Update4

5 Read A[1]

Directory

Address

State Sharers

A[0] Shared 1,2

A[1] Shared 2


Table with config parameters


Processors

Cores 16 in-order general purpose

Memory System


Private L2 caches 512 KB 4-way set associative6 cycles64 Byte lines

Shared L3 Cache 16 MB (1MB bank/tile)4-way set associative12 cycles

Main Memory Latency 100 cycles


Packet-switched baseline Optimized 1-3 router stages4 Virtual channels with 4 Buffers each

Hybrid Circuit Switching 1 router stage2 or 4 Circuit planes

HCS + Protocol Optimization

Improvement of HCS + Protocol optimization is greater than the sum of HCS or Protocol Optimization alone.

Protocol Optimization drives up circuit reuse, better utilizing HCS


Defense84

Circuit-Switched Coherence Summary

Replace packet-switched mesh with hybrid circuit-switched mesh

Interleave circuit and packet switched flits

Reconfigurable circuits Dedicated bandwidth for

frequent pair-wise sharers Low Latency and low power

Avoid switching/routing Devise novel coherence

mechanisms to take advantage of benefits of circuit switching


Defense85

Inter-Region Order (3)

Option 1: require acknowledgments from multicast sharers

Option 2: Coarse grain fences Fence between stores to different regions Significantly reduces acknowledgments Fence not required if previous region was held

in Exclusive10/20/2008 86Natalie Enright Jerger - PhD

Defense

P0 P1

Store A[0] Rd A[1]

Store A[1] Rd A[0]

If P1 observes the store to A[1], then it

has also observed A[0] because of tree

order

P0 P1

Store A Rd B

Store B Rd A

Delayed on Tree 1

Arrives quickly on Tree 2

Result: P1 reads new values of B and stale

value of A

Remote Reads to B cannot be satisfied

by P0 until acks collected for Fence

A, B

P0 P1

Store miss A Rd B

Implicit Fence A Rd A

Store miss B

Implicit Fence B

Inter-Region Order (4) Write Atomicity


P0 P1 P2

A = 1 While (A == 0); While (B == 0);

B = 1 Rd A

A

P0 P1 P2

B

P2 reads B = 1, A = 0

Inter-Region Order (5) Write Atomicity


Remote Reads to dirty/unfenced region

require fence to guarantee that store is visible to all

processors

P0 P1 P2

A = 1 While (A == 0); While (B == 0);

Remote Rd A Fence A

B = 1 Rd A

P1’s read of A cannot be satisfied until Fence A sunk

by everyoneP2 will Rd A = 1

chip multiprocessor coherence and interconnect system design natalie enright jerger phd defense...

Documents

low latency cache

pairwise communication

interconnect latency

chip multiprocessor

chip networkroutingpath

networksingle source

cache transfers

coherence requests