pl-4049, cache coherence for gpu architectures, by arvindh shriraman and tor aamodt

Post on 23-Dec-2014

400 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

PL-4049, Cache Coherence for GPU Architectures, by Arvindh Shriraman and Tor Aamodt at the AMD Developer Summit (APU13) November 11-13, 2013

TRANSCRIPT

Cache coherence for GPU Architectures

���1

Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O'Connor, Tor M. Aamodt, Cache Coherence for GPU Architectures, In proceedings of the 19th IEEE International Symposium on High-Performance Computer Architecture (HPCA-19)

Agenda

���2

Agenda

Challenges with CPU coherence on GPUs.

���2

Agenda

Challenges with CPU coherence on GPUs.

Temporal Coherence: Rethinking coherence for GPUs

���2

Agenda

Challenges with CPU coherence on GPUs.

Temporal Coherence: Rethinking coherence for GPUs

What is the cost of providing coherence?

���2

Why provide coherence?1. Inter-workgroup communication

2. Atomic operations

3. Task queues

Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems, ISPASS 2012

���3

Cache Coherence

Shared Memory

P P P PProgrammer

Appearance: One global copy of every location

���4

Cache Coherence

MulticoresP P P PL1 L1 L1 L1

L2 L2

GPUs

Memory

L1 L1 L1 L1

... L2 Memory

���5

Cache Coherence

P P P PL1 L1 L1 L1

L2 L2

Memory

L1 L1 L1 L1

... L2 ...

Heterogeneous Systems

How to provide coherence?

���6

Challenges

���7

Challenges with coherence

L1

Shared L2

L1

���8

Challenges with coherence

L1

Shared L2

L1

���8

Challenges with coherence

L1

Shared L2

L11 2

���8

Challenges with coherence

L1

Shared L2

L11 23

���8

Challenge 1: Traffic

L1

Shared L2

L1

���9

Challenge 1: Traffic

L1

Shared L2

L1 L1

���9

Challenge 1: Traffic

L1

Shared L2

L1 L1

���9

Challenge 1: Traffic

L1

Shared L2

L1 L1

30% more traffic than current GPUs

���9

Challenge 2: Buffer Overhead

L1

Shared L2

L1 L1

���10

Challenge 2: Buffer Overhead

L1

Shared L2

L1 L1Protocol Buffer

���10

Challenge 2: Buffer Overhead

L1

Shared L2

L1 L1Protocol Buffer

���10

Challenge 2: Buffer Overhead

L1

Shared L2

L1 L1Protocol Buffer

���10

Challenge 2: Buffer Overhead

L1

Shared L2

L1 L1

Coherence protocol buffers require 28% of total L2

Protocol Buffer

���10

Challenge 3: Complexity

L1

Shared L2

L11

4 states

Incoherent protocol

���11

Challenge 3: Complexity

L1

Shared L2

L11

4 states 16 states

Incoherent protocol

Coherent protocol

23

���11

Coherence Overhead.

L1Coherence messages 1. Traffic transferring 2. Area overhead 3. Protocol complexity

How to achieve coherence without messages?

���12

TEMPORAL COHERENCE���13

L1

Temporal CoherenceTime-based Approach - trigger protocol events on timer alerts

L1

���14

Shared L2

L1

Temporal CoherenceTime-based Approach - trigger protocol events on timer alerts

L1

���14

Shared L2

L1

Temporal CoherenceTime-based Approach - trigger protocol events on timer alerts

L1

���14

Shared L2

L1

Temporal CoherenceTime-based Approach - trigger protocol events on timer alerts

L1

���14

Shared L2

Temporal Coherence

L1

Shared L2

L1

���15

Shared L2

Temporal Coherence

L1

Shared L2

L1

1 Clock

���15

Shared L2

Temporal Coherence

L1

Shared L2

L1

Load

1 Clock

���15

Shared L2

Temporal Coherence

L1

Shared L2

L1

Load

!

LT

Valid if TIME < LT

1 Clock

���15

Shared L2

Temporal Coherence

L1

Shared L2

L1

Load

!

LT

Valid if TIME < LT

1 Clock

���15

Shared L2

Temporal Coherence

L1

Shared L2

L1

Load

!

LT

Valid if TIME < LT

1 Clock

!

GT Shared if TIME < GT

���15

Shared L2

Temporal Coherence

L1L1

���16

Temporal Coherence

L1L1

0 TIME

���16

Temporal Coherence

L1L1

Load

0 TIME

���16

Temporal Coherence

L1L1

Load

!

20

0 TIME

���16

Temporal Coherence

L1L1

Load

!

20

0 TIME

!

20 Line shared till 20

���16

L1L1!

20 L1

5 TIME

���17

L1L1!

20 L1

Load

!

25

5 TIME

���17

L1L1!

20!

25Line shared

till 25

L1

Load

!

25

5 TIME

���17

L1L1!

20!

25

L1!

25

15 TIME

���18

L1L1!

20!

25

L1!

25

15 TIME

Write

���18

L1L1!

20!

25

L1!

25

20 TIME

Write

���19

L1L1!

20!

25

L1!

25

20 TIME

Write

���19

L1L1!

25

L1

Write

!

25

25 TIME

���20

L1L1!

25

L1

Write

!

25

25 TIME

���20

L1L1!

25

L1

Write

!

25

25 TIME

���20

L1L1!

25

L1

Write

!

25

25 TIME

���20

Temporal Coherence

No coherence messages

All transactions are 2-hop

Protocol complexity minimal

Supports strong and weak memory models

Enables optimized communication (ask me later...)

���21

How to set the block lifetime?

•Longer => writes may stall

•Shorter => may not exploit temporal locality

!

• Lifetime predictor at L2. -Load to expired block (for temporal locality) -Store to unexpired block (reduce write stalls) -Eviction of unexpired block (reduce L2 eviction stalls)

���22

Temporal Coherence (Weak)

L1

Shared L2

L1!

20!

25

L1!

25

Write

���23

Temporal Coherence (Weak)

L1

Shared L2

L1!

20!

25

L1!

25

Write

���23

Temporal Coherence (Weak)

L1

Shared L2

L1!

20!

25

L1!

25

Write

Sensitive to misprediction

���23

Temporal Coherence (Weak)

L1

Shared L2

L1!

20!

25

L1!

25

Write

Sensitive to misprediction

Resource stalls

���23

Temporal Coherence (Weak)

L1

Shared L2

L1!

20!

25

L1!

25

Write

Sensitive to misprediction

Resource stalls

Hurts GPU applications

���23

Temporal Coherence (Weak)

L1

Shared L2

L1!

20!

25

L1!

25

Write

Goal : eliminate Write Stalls!

Sensitive to misprediction

Resource stalls

Hurts GPU applications

���23

Temporal Coherence (Weak)

L1L1!

20!

25

L1!

25

15 TIME

OLD OLD

���24

Temporal Coherence (Weak)

L1L1!

20!

25

L1!

25

15 TIME

Write

OLD OLD

!

25

���24

Temporal Coherence (Weak)

L1L1!

20!

25

L1!

25

15 TIME

Write

OLD OLD

!

25 Fence

���24

Temporal Coherence (Weak)

L1L1!

20!

25

L1!

25

15 TIME

Write

OLD OLD

!

25 Fence......

���24

Temporal Coherence (Weak)

L1L1!

20!

25

L1!

25

20 TIME

OLD

!

25 Fence!

20

���25

Temporal Coherence (Weak)

L1L1!

20!

25

L1!

25

20 TIME

OLD

!

25 Fence!

20

���25

Temporal Coherence (Weak)

L1L1!

20!

25

L1!

25

20 TIME

OLD

!

25 Fence!

20

......

���25

Temporal Coherence (Weak)

L1L1!

25

L1

25 TIME

Fence!

25!

25

���26

Temporal Coherence (Weak)

L1L1!

25

L1

25 TIME

Fence!

25!

25

���26

Temporal Coherence (Weak)

L1L1!

25

L1

25 TIME

Fence!

25!

25

���26

Temporal Coherence (Weak)

L1L1!

25

L1

25 TIME

!

25!

25

���26

Temporal Coherence (Weak)

No Access Stalls

Efficient GPU applications

Aggressive lifetime predictors

Supports weak memory models

���27

���28

Coherence Applications•Lock-based programs -Barnes Hut -Cloth Physics -Place-and-Route

•Stencil -Max-Flow Min-cut -3D equation solver

•Load balancing -Octree Partitioning

���29

Interconnect TrafficGPU Applications (do not need coherence)

���30

Interconnect Traffic

0

0.5

1

1.5

2

GPU Applications (do not need coherence)

���30

Interconnect Traffic

0

0.5

1

1.5

2N

O.C

CGPU Applications (do not need coherence)

���30

Interconnect Traffic

0

0.5

1

1.5

2

MES

I

NO

.CC

2.3

GPU Applications (do not need coherence)

���30

Interconnect Traffic

0

0.5

1

1.5

2

MES

I

GPU

-VI

NO

.CC

2.3

GPU Applications (do not need coherence)

���30

Interconnect Traffic

0

0.5

1

1.5

2

MES

I

GPU

-VI

NO

.CC

2.3

GPU Applications (do not need coherence)

.8xWr-Through

���30

Interconnect Traffic

0

0.5

1

1.5

2

MES

I

GPU

-VI

TC

NO

.CC

2.3

GPU Applications (do not need coherence)

.8x

.3x

Wr-Through

No msgs

���30

Coherence Applications•Lock-based programs -Barnes Hut -Cloth Physics -Place-and-Route

•Stencil -Max-Flow Min-cut -3D equation solver

•Load balancing -Octree Partitioning

���31

SpeedupCoherence Applications

���32

SpeedupCoherence Applications

0

0.25

0.5

0.75

1

1.25

1.5

1.75

���32

SpeedupCoherence Applications

0

0.25

0.5

0.75

1

1.25

1.5

1.75N

O L

1

���32

SpeedupCoherence Applications

0

0.25

0.5

0.75

1

1.25

1.5

1.75

MES

I

NO

L1

���32

SpeedupCoherence Applications

0

0.25

0.5

0.75

1

1.25

1.5

1.75

MES

I

NO

L1

���32

SpeedupCoherence Applications

0

0.25

0.5

0.75

1

1.25

1.5

1.75

MES

I

GPU

-VI

TC

NO

L1

���32

SpeedupCoherence Applications

0

0.25

0.5

0.75

1

1.25

1.5

1.75

MES

I

GPU

-VI

TC

NO

L1

Need a 32KB directory

���32

Protocol Complexity

���33

Protocol Complexity

L1 Stable

L1 TransientL2 Stable

L2 Transient

���33

Protocol Complexity

L1 Stable

L1 TransientL2 Stable

L2 Transient

Non-Coherent

2222

���33

Protocol Complexity

L1 Stable

L1 TransientL2 Stable

L2 Transient

Non-Coherent

2222

GPU-VI

21510

���33

Protocol Complexity

L1 Stable

L1 TransientL2 Stable

L2 Transient

Non-Coherent

2222

GPU-VI

21510

Temporal Coherence

2153

���33

What did we learn!

•Throughput and heterogeneous architectures require a more streamlined caching framework. !

•Single-chip integration enables mechanisms that we can exploit to simplify communication protocols. !

•Efficient coherence protocols enable programmers to deploy accelerators for wider purposes..

•Obtain GPGPU-Sim with coherence support

���35

http://www.ece.ubc.ca/~isingh/gpgpusim-ruby.tar.gz

Contact: ashriram@cs.sfu.ca

or aamodt@ece.ubc.ca

Interconnect Energy

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

NO

-L1

MES

I G

PU-V

I G

PU-V

ini

TCW

NO

-CO

H

MES

I G

PU-V

I G

PU-V

ini

TCW

Inter- workgroup

Intra- workgroup

Nor

mal

ized

Ene

rgy

Link (Dynamic) Router (Dynamic) Link (Static) Router (Static)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

NO

-L1

MES

I G

PU-V

I G

PU-V

ini

TCW

NO

-CO

H

MES

I G

PU-V

I G

PU-V

ini

TCW

Inter- workgroup

Intra- workgroup

Nor

mal

ized

Pow

er

���36

0.0

0.5

1.0

1.5

2.0

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

HSP KMN LPS NDL RG SR AVG

Inte

rcon

nect

Tra

ffic

RCL INV REQ ATO ST LDRCL=0.25REQ=0.55

RCL=0.15REQ=0.63

RCL=0.09REQ=0.55

RCL=0.16REQ=0.63 2.27

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

BH CC CL DLB STN VPR AVG

0.0

0.5

1.0

1.5

2.0INV=0.03

RCL=0.03

REQ=0.68

(a) Inter-workgroup communication (b) Intra-workgroup communication

Figure 8. Breakdown of interconnect traffic for coherent and non-coherent GPU memory systems.

0.00.20.40.60.81.01.21.41.6

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

Inter-workgroup

Intra-workgroup

Nor

mal

ized

Ene

rgy

Link (Dynamic) Router (Dynamic) Link (Static) Router (Static)

0.00.20.40.60.81.01.21.41.6

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

Inter-workgroup

Intra-workgroup

Nor

mal

ized

Pow

er

Figure 9. Breakdown of interconnect powerand energy.

into a shared queue. As a result, the task-fetching and task-inserting invalidation latencies lie on the critical path for alarge number of threads. TCW eliminates this critical pathinvalidation latency in DLB and performs up to 2x fasterthan the invalidation-based protocols.

Figures 8(a) and 8(b) show the breakdown of intercon-nect traffic between different coherence protocols. LD, ST,and ATO are the data traffic from load, store, and atomicrequests. MESI performs atomic operations at the L1 cacheand this traffic is included in ST. REQ refers to control traf-fic for all protocols. INV and RCL are invalidation and re-call traffic, respectively.

MESI’s write-allocate policy at the L1 significantly in-creases store traffic due to unnecessary refills of write-oncedata. On average, MESI increases interconnect traffic overthe baseline non-coherent GPU by 75% across all applica-tions. The write-through GPU-VI and GPU-VIni introduceunnecessary invalidation and recall traffic, averaging to atraffic overhead of 31% and 30% for applications withoutinter-workgroup communication. TCW removes all invali-dations and recalls and as a result reduces interconnect traf-fic by 56% over MESI, 23% over GPU-VI and 23% overGPU-VIni for this set of applications.

8.2 Power

Figure 9 shows the breakdown of interconnect power andenergy usage. TCW lowers the interconnect power usage by

0.00.20.40.60.81.01.2

All applications

Inte

rcon

nect

Tr

affic

TCS TCW-FIXED TCW

0.6

0.8

1.0

1.2

1.4

All applications

Spee

dup

(a) (b)

Figure 10. (a) Harmonic mean speedup. (b)Normalized average interconnect traffic.

21%, 10% and 8%, and interconnect energy usage by 36%,13% and 8% over MESI, GPU-VI and GPU-VIni, respec-tively. The reductions are both in dynamic power, due tolower interconnect traffic, and static power, due to fewervirtual channel buffers in TCW.

8.3 TC-Weak vs. TC-Strong

Figures 10(a) and 10(b) compare the harmonic meanperformance and average interconnect traffic, respectively,across all applications for TC-Strong and TC-Weak. TCSimplements TC-Strong with the FIXED-DELTA predictionscheme proposed in LCC [34, 54], which selects a singlefixed lifetime that works best across all applications. TCSuses a fixed lifetime prediction of 800 core cycles, whichwas found to yield the best harmonic mean performanceover other lifetime values. TCW-FIXED uses TC-Weak anda fixed lifetime of 3200 core cycles, which was found to bebest performing over other values. TCW implements TC-Weak with the proposed predictor, as before.

TCW-FIXED has the same predictor as TCS but out-performs it by 15% while reducing traffic by 13%. TCWachieves a 28% improvement in performance over TCS andreduces interconnect traffic by 26%. TC-Strong has a trade-off between additional write stalls with higher lifetimes andadditional L1 misses with lower lifetimes. TC-Weak avoidsthis trade-off by not stalling writes. This permits longerlifetimes and fewer L1 misses, improving performance andreducing traffic over TC-Strong.

0.0

0.5

1.0

1.5

2.0

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

HSP KMN LPS NDL RG SR AVG

Inte

rcon

nect

Tra

ffic

RCL INV REQ ATO ST LDRCL=0.25REQ=0.55

RCL=0.15REQ=0.63

RCL=0.09REQ=0.55

RCL=0.16REQ=0.63 2.27

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

BH CC CL DLB STN VPR AVG

0.0

0.5

1.0

1.5

2.0INV=0.03

RCL=0.03

REQ=0.68

(a) Inter-workgroup communication (b) Intra-workgroup communication

Figure 8. Breakdown of interconnect traffic for coherent and non-coherent GPU memory systems.

0.00.20.40.60.81.01.21.41.6

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

Inter-workgroup

Intra-workgroup

Nor

mal

ized

Ene

rgy

Link (Dynamic) Router (Dynamic) Link (Static) Router (Static)

0.00.20.40.60.81.01.21.41.6

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

Inter-workgroup

Intra-workgroup

Nor

mal

ized

Pow

er

Figure 9. Breakdown of interconnect powerand energy.

into a shared queue. As a result, the task-fetching and task-inserting invalidation latencies lie on the critical path for alarge number of threads. TCW eliminates this critical pathinvalidation latency in DLB and performs up to 2x fasterthan the invalidation-based protocols.

Figures 8(a) and 8(b) show the breakdown of intercon-nect traffic between different coherence protocols. LD, ST,and ATO are the data traffic from load, store, and atomicrequests. MESI performs atomic operations at the L1 cacheand this traffic is included in ST. REQ refers to control traf-fic for all protocols. INV and RCL are invalidation and re-call traffic, respectively.

MESI’s write-allocate policy at the L1 significantly in-creases store traffic due to unnecessary refills of write-oncedata. On average, MESI increases interconnect traffic overthe baseline non-coherent GPU by 75% across all applica-tions. The write-through GPU-VI and GPU-VIni introduceunnecessary invalidation and recall traffic, averaging to atraffic overhead of 31% and 30% for applications withoutinter-workgroup communication. TCW removes all invali-dations and recalls and as a result reduces interconnect traf-fic by 56% over MESI, 23% over GPU-VI and 23% overGPU-VIni for this set of applications.

8.2 Power

Figure 9 shows the breakdown of interconnect power andenergy usage. TCW lowers the interconnect power usage by

0.00.20.40.60.81.01.2

All applications

Inte

rcon

nect

Tr

affic

TCS TCW-FIXED TCW

0.6

0.8

1.0

1.2

1.4

All applications

Spee

dup

(a) (b)

Figure 10. (a) Harmonic mean speedup. (b)Normalized average interconnect traffic.

21%, 10% and 8%, and interconnect energy usage by 36%,13% and 8% over MESI, GPU-VI and GPU-VIni, respec-tively. The reductions are both in dynamic power, due tolower interconnect traffic, and static power, due to fewervirtual channel buffers in TCW.

8.3 TC-Weak vs. TC-Strong

Figures 10(a) and 10(b) compare the harmonic meanperformance and average interconnect traffic, respectively,across all applications for TC-Strong and TC-Weak. TCSimplements TC-Strong with the FIXED-DELTA predictionscheme proposed in LCC [34, 54], which selects a singlefixed lifetime that works best across all applications. TCSuses a fixed lifetime prediction of 800 core cycles, whichwas found to yield the best harmonic mean performanceover other lifetime values. TCW-FIXED uses TC-Weak anda fixed lifetime of 3200 core cycles, which was found to bebest performing over other values. TCW implements TC-Weak with the proposed predictor, as before.

TCW-FIXED has the same predictor as TCS but out-performs it by 15% while reducing traffic by 13%. TCWachieves a 28% improvement in performance over TCS andreduces interconnect traffic by 26%. TC-Strong has a trade-off between additional write stalls with higher lifetimes andadditional L1 misses with lower lifetimes. TC-Weak avoidsthis trade-off by not stalling writes. This permits longerlifetimes and fewer L1 misses, improving performance andreducing traffic over TC-Strong.

���37

0.0

0.5

1.0

1.5

2.0

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

HSP KMN LPS NDL RG SR AVG

Inte

rcon

nect

Tra

ffic

RCL INV REQ ATO ST LDRCL=0.25REQ=0.55

RCL=0.15REQ=0.63

RCL=0.09REQ=0.55

RCL=0.16REQ=0.63 2.27

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

BH CC CL DLB STN VPR AVG

0.0

0.5

1.0

1.5

2.0INV=0.03

RCL=0.03

REQ=0.68

(a) Inter-workgroup communication (b) Intra-workgroup communication

Figure 8. Breakdown of interconnect traffic for coherent and non-coherent GPU memory systems.

0.00.20.40.60.81.01.21.41.6

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

Inter-workgroup

Intra-workgroup

Nor

mal

ized

Ene

rgy

Link (Dynamic) Router (Dynamic) Link (Static) Router (Static)

0.00.20.40.60.81.01.21.41.6

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

Inter-workgroup

Intra-workgroup

Nor

mal

ized

Pow

er

Figure 9. Breakdown of interconnect powerand energy.

into a shared queue. As a result, the task-fetching and task-inserting invalidation latencies lie on the critical path for alarge number of threads. TCW eliminates this critical pathinvalidation latency in DLB and performs up to 2x fasterthan the invalidation-based protocols.

Figures 8(a) and 8(b) show the breakdown of intercon-nect traffic between different coherence protocols. LD, ST,and ATO are the data traffic from load, store, and atomicrequests. MESI performs atomic operations at the L1 cacheand this traffic is included in ST. REQ refers to control traf-fic for all protocols. INV and RCL are invalidation and re-call traffic, respectively.

MESI’s write-allocate policy at the L1 significantly in-creases store traffic due to unnecessary refills of write-oncedata. On average, MESI increases interconnect traffic overthe baseline non-coherent GPU by 75% across all applica-tions. The write-through GPU-VI and GPU-VIni introduceunnecessary invalidation and recall traffic, averaging to atraffic overhead of 31% and 30% for applications withoutinter-workgroup communication. TCW removes all invali-dations and recalls and as a result reduces interconnect traf-fic by 56% over MESI, 23% over GPU-VI and 23% overGPU-VIni for this set of applications.

8.2 Power

Figure 9 shows the breakdown of interconnect power andenergy usage. TCW lowers the interconnect power usage by

0.00.20.40.60.81.01.2

All applications

Inte

rcon

nect

Tr

affic

TCS TCW-FIXED TCW

0.6

0.8

1.0

1.2

1.4

All applicationsSp

eedu

p

(a) (b)

Figure 10. (a) Harmonic mean speedup. (b)Normalized average interconnect traffic.

21%, 10% and 8%, and interconnect energy usage by 36%,13% and 8% over MESI, GPU-VI and GPU-VIni, respec-tively. The reductions are both in dynamic power, due tolower interconnect traffic, and static power, due to fewervirtual channel buffers in TCW.

8.3 TC-Weak vs. TC-Strong

Figures 10(a) and 10(b) compare the harmonic meanperformance and average interconnect traffic, respectively,across all applications for TC-Strong and TC-Weak. TCSimplements TC-Strong with the FIXED-DELTA predictionscheme proposed in LCC [34, 54], which selects a singlefixed lifetime that works best across all applications. TCSuses a fixed lifetime prediction of 800 core cycles, whichwas found to yield the best harmonic mean performanceover other lifetime values. TCW-FIXED uses TC-Weak anda fixed lifetime of 3200 core cycles, which was found to bebest performing over other values. TCW implements TC-Weak with the proposed predictor, as before.

TCW-FIXED has the same predictor as TCS but out-performs it by 15% while reducing traffic by 13%. TCWachieves a 28% improvement in performance over TCS andreduces interconnect traffic by 26%. TC-Strong has a trade-off between additional write stalls with higher lifetimes andadditional L1 misses with lower lifetimes. TC-Weak avoidsthis trade-off by not stalling writes. This permits longerlifetimes and fewer L1 misses, improving performance andreducing traffic over TC-Strong.

0.0

0.5

1.0

1.5

2.0

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

HSP KMN LPS NDL RG SR AVG

Inte

rcon

nect

Tra

ffic

RCL INV REQ ATO ST LDRCL=0.25REQ=0.55

RCL=0.15REQ=0.63

RCL=0.09REQ=0.55

RCL=0.16REQ=0.63 2.27

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

BH CC CL DLB STN VPR AVG

0.0

0.5

1.0

1.5

2.0INV=0.03

RCL=0.03

REQ=0.68

(a) Inter-workgroup communication (b) Intra-workgroup communication

Figure 8. Breakdown of interconnect traffic for coherent and non-coherent GPU memory systems.

0.00.20.40.60.81.01.21.41.6

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

Inter-workgroup

Intra-workgroup

Nor

mal

ized

Ene

rgy

Link (Dynamic) Router (Dynamic) Link (Static) Router (Static)

0.00.20.40.60.81.01.21.41.6

NO

-L1

MES

IG

PU-V

IG

PU-V

ini

TCW

NO

-CO

HM

ESI

GPU

-VI

GPU

-Vin

iTC

W

Inter-workgroup

Intra-workgroup

Nor

mal

ized

Pow

er

Figure 9. Breakdown of interconnect powerand energy.

into a shared queue. As a result, the task-fetching and task-inserting invalidation latencies lie on the critical path for alarge number of threads. TCW eliminates this critical pathinvalidation latency in DLB and performs up to 2x fasterthan the invalidation-based protocols.

Figures 8(a) and 8(b) show the breakdown of intercon-nect traffic between different coherence protocols. LD, ST,and ATO are the data traffic from load, store, and atomicrequests. MESI performs atomic operations at the L1 cacheand this traffic is included in ST. REQ refers to control traf-fic for all protocols. INV and RCL are invalidation and re-call traffic, respectively.

MESI’s write-allocate policy at the L1 significantly in-creases store traffic due to unnecessary refills of write-oncedata. On average, MESI increases interconnect traffic overthe baseline non-coherent GPU by 75% across all applica-tions. The write-through GPU-VI and GPU-VIni introduceunnecessary invalidation and recall traffic, averaging to atraffic overhead of 31% and 30% for applications withoutinter-workgroup communication. TCW removes all invali-dations and recalls and as a result reduces interconnect traf-fic by 56% over MESI, 23% over GPU-VI and 23% overGPU-VIni for this set of applications.

8.2 Power

Figure 9 shows the breakdown of interconnect power andenergy usage. TCW lowers the interconnect power usage by

0.00.20.40.60.81.01.2

All applications

Inte

rcon

nect

Tr

affic

TCS TCW-FIXED TCW

0.6

0.8

1.0

1.2

1.4

All applications

Spee

dup

(a) (b)

Figure 10. (a) Harmonic mean speedup. (b)Normalized average interconnect traffic.

21%, 10% and 8%, and interconnect energy usage by 36%,13% and 8% over MESI, GPU-VI and GPU-VIni, respec-tively. The reductions are both in dynamic power, due tolower interconnect traffic, and static power, due to fewervirtual channel buffers in TCW.

8.3 TC-Weak vs. TC-Strong

Figures 10(a) and 10(b) compare the harmonic meanperformance and average interconnect traffic, respectively,across all applications for TC-Strong and TC-Weak. TCSimplements TC-Strong with the FIXED-DELTA predictionscheme proposed in LCC [34, 54], which selects a singlefixed lifetime that works best across all applications. TCSuses a fixed lifetime prediction of 800 core cycles, whichwas found to yield the best harmonic mean performanceover other lifetime values. TCW-FIXED uses TC-Weak anda fixed lifetime of 3200 core cycles, which was found to bebest performing over other values. TCW implements TC-Weak with the proposed predictor, as before.

TCW-FIXED has the same predictor as TCS but out-performs it by 15% while reducing traffic by 13%. TCWachieves a 28% improvement in performance over TCS andreduces interconnect traffic by 26%. TC-Strong has a trade-off between additional write stalls with higher lifetimes andadditional L1 misses with lower lifetimes. TC-Weak avoidsthis trade-off by not stalling writes. This permits longerlifetimes and fewer L1 misses, improving performance andreducing traffic over TC-Strong.

���38

TC-Strong vs TC-Weak

Fixed lifetime for all applications

0.6 0.8 1.0

1.2

1.4

All applications

Spe

edup

0.6

0.8

1.0

1.2

All applications S

peed

up

Best lifetime for each application

TCSUO TCS TCSOO TCW TCW w/ predictor

���39

top related