pl-4049, cache coherence for gpu architectures, by arvindh shriraman and tor aamodt

Cache coherence for GPU Architectures

��1

Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O'Connor, Tor M. Aamodt, Cache Coherence for GPU Architectures, In proceedings of the 19th IEEE International Symposium on High-Performance Computer Architecture (HPCA-19)

Agenda

��2

Agenda

Challenges with CPU coherence on GPUs.

��2

Agenda

Temporal Coherence: Rethinking coherence for GPUs

��2

Agenda

Temporal Coherence: Rethinking coherence for GPUs

What is the cost of providing coherence?

��2

Why provide coherence?1. Inter-workgroup communication

2. Atomic operations

3. Task queues

Characterizing and Evaluating a Key-value Store Application on Heterogeneous CPU-GPU Systems, ISPASS 2012

��3

Cache Coherence

Shared Memory

P P P PProgrammer

Appearance: One global copy of every location

��4

Cache Coherence

MulticoresP P P PL1 L1 L1 L1

Memory

L1 L1 L1 L1

... L2 Memory

��5

Cache Coherence

P P P PL1 L1 L1 L1

Memory

L1 L1 L1 L1

... L2 ...

Heterogeneous Systems

How to provide coherence?

��6

Challenges

��7

Challenges with coherence

Shared L2

��8

Shared L2

��8

Shared L2

��8

Shared L2

L11 23

��8

Challenge 1: Traffic

Shared L2

��9

Shared L2

��9

Shared L2

��9

Shared L2

30% more traffic than current GPUs

��9

Challenge 2: Buffer Overhead

Shared L2

��10

Shared L2

L1 L1Protocol Buffer

��10

Shared L2

��10

Shared L2

��10

Shared L2

Coherence protocol buffers require 28% of total L2

Protocol Buffer

��10

Challenge 3: Complexity

Shared L2

4 states

Incoherent protocol

��11

Challenge 3: Complexity

Shared L2

4 states 16 states

Incoherent protocol

Coherent protocol

��11

Coherence Overhead.

L1Coherence messages 1. Traffic transferring 2. Area overhead 3. Protocol complexity

How to achieve coherence without messages?

��12

TEMPORAL COHERENCE��13

Temporal CoherenceTime-based Approach - trigger protocol events on timer alerts

��14

Shared L2

��14

Shared L2

��14

Shared L2

��14

Shared L2

Temporal Coherence

Shared L2

��15

Shared L2

Temporal Coherence

Shared L2

1 Clock

��15

Shared L2

Temporal Coherence

Shared L2

1 Clock

��15

Shared L2

Temporal Coherence

Shared L2

Valid if TIME < LT

1 Clock

��15

Shared L2

Temporal Coherence

Shared L2

Valid if TIME < LT

1 Clock

��15

Shared L2

Temporal Coherence

Shared L2

Valid if TIME < LT

1 Clock

GT Shared if TIME < GT

��15

Shared L2

Temporal Coherence

��16

Temporal Coherence

0 TIME

��16

Temporal Coherence

0 TIME

��16

Temporal Coherence

0 TIME

��16

Temporal Coherence

0 TIME

20 Line shared till 20

��16

5 TIME

��17

5 TIME

��17

25Line shared

till 25

5 TIME

��17

15 TIME

��18

15 TIME

��18

20 TIME

��19

20 TIME

��19

25 TIME

��20

25 TIME

��20

25 TIME

��20

25 TIME

��20

Temporal Coherence

No coherence messages

All transactions are 2-hop

Protocol complexity minimal

Supports strong and weak memory models

Enables optimized communication (ask me later...)

��21

How to set the block lifetime?

•Longer => writes may stall

•Shorter => may not exploit temporal locality

• Lifetime predictor at L2. -Load to expired block (for temporal locality) -Store to unexpired block (reduce write stalls) -Eviction of unexpired block (reduce L2 eviction stalls)

��22

Temporal Coherence (Weak)

Shared L2

��23

Shared L2

��23

Shared L2

Sensitive to misprediction

��23

Shared L2

Resource stalls

��23

Shared L2

Resource stalls

Hurts GPU applications

��23

Shared L2

Goal : eliminate Write Stalls!

Resource stalls

Hurts GPU applications

��23

15 TIME

OLD OLD

��24

15 TIME

OLD OLD

��24

15 TIME

OLD OLD

25 Fence

��24

15 TIME

OLD OLD

25 Fence......

��24

20 TIME

25 Fence!

��25

20 TIME

25 Fence!

��25

20 TIME

25 Fence!

......

��25

25 TIME

Fence!

��26

25 TIME

Fence!

��26

25 TIME

Fence!

��26

25 TIME

��26

No Access Stalls

Efficient GPU applications

Aggressive lifetime predictors

Supports weak memory models

��27

��28

Coherence Applications•Lock-based programs -Barnes Hut -Cloth Physics -Place-and-Route

•Stencil -Max-Flow Min-cut -3D equation solver

•Load balancing -Octree Partitioning

��29

Interconnect TrafficGPU Applications (do not need coherence)

��30

Interconnect Traffic

GPU Applications (do not need coherence)

��30

CGPU Applications (do not need coherence)

��30

.8xWr-Through

��30

Wr-Through

No msgs

��30

Coherence Applications•Lock-based programs -Barnes Hut -Cloth Physics -Place-and-Route

•Stencil -Max-Flow Min-cut -3D equation solver

•Load balancing -Octree Partitioning

��31

SpeedupCoherence Applications

��32

Need a 32KB directory

��32

Protocol Complexity

��33

Protocol Complexity

L1 Stable

L1 TransientL2 Stable

L2 Transient

��33

Protocol Complexity

L1 Stable

L2 Transient

Non-Coherent

��33

Protocol Complexity

L1 Stable

L2 Transient

Non-Coherent

GPU-VI

��33

Protocol Complexity

L1 Stable

L2 Transient

Non-Coherent

GPU-VI

Temporal Coherence

��33

What did we learn!

•Throughput and heterogeneous architectures require a more streamlined caching framework. !

•Single-chip integration enables mechanisms that we can exploit to simplify communication protocols. !

•Efficient coherence protocols enable programmers to deploy accelerators for wider purposes..

•Obtain GPGPU-Sim with coherence support

��35

http://www.ece.ubc.ca/~isingh/gpgpusim-ruby.tar.gz

Contact: ashriram@cs.sfu.ca

or aamodt@ece.ubc.ca

Interconnect Energy

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

Inter- workgroup

Intra- workgroup

Link (Dynamic) Router (Dynamic) Link (Static) Router (Static)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

Inter- workgroup

Intra- workgroup

��36

HSP KMN LPS NDL RG SR AVG

RCL INV REQ ATO ST LDRCL=0.25REQ=0.55

RCL=0.15REQ=0.63

RCL=0.09REQ=0.55

RCL=0.16REQ=0.63 2.27

BH CC CL DLB STN VPR AVG

2.0INV=0.03

RCL=0.03

REQ=0.68

(a) Inter-workgroup communication (b) Intra-workgroup communication

Figure 8. Breakdown of interconnect traffic for coherent and non-coherent GPU memory systems.

0.00.20.40.60.81.01.21.41.6

Inter-workgroup

Intra-workgroup

0.00.20.40.60.81.01.21.41.6

Inter-workgroup

Intra-workgroup

Figure 9. Breakdown of interconnect powerand energy.

into a shared queue. As a result, the task-fetching and task-inserting invalidation latencies lie on the critical path for alarge number of threads. TCW eliminates this critical pathinvalidation latency in DLB and performs up to 2x fasterthan the invalidation-based protocols.

Figures 8(a) and 8(b) show the breakdown of intercon-nect traffic between different coherence protocols. LD, ST,and ATO are the data traffic from load, store, and atomicrequests. MESI performs atomic operations at the L1 cacheand this traffic is included in ST. REQ refers to control traf-fic for all protocols. INV and RCL are invalidation and re-call traffic, respectively.

MESI’s write-allocate policy at the L1 significantly in-creases store traffic due to unnecessary refills of write-oncedata. On average, MESI increases interconnect traffic overthe baseline non-coherent GPU by 75% across all applica-tions. The write-through GPU-VI and GPU-VIni introduceunnecessary invalidation and recall traffic, averaging to atraffic overhead of 31% and 30% for applications withoutinter-workgroup communication. TCW removes all invali-dations and recalls and as a result reduces interconnect traf-fic by 56% over MESI, 23% over GPU-VI and 23% overGPU-VIni for this set of applications.

8.2 Power

Figure 9 shows the breakdown of interconnect power andenergy usage. TCW lowers the interconnect power usage by

0.00.20.40.60.81.01.2

All applications

TCS TCW-FIXED TCW

All applications

(a) (b)

Figure 10. (a) Harmonic mean speedup. (b)Normalized average interconnect traffic.

21%, 10% and 8%, and interconnect energy usage by 36%,13% and 8% over MESI, GPU-VI and GPU-VIni, respec-tively. The reductions are both in dynamic power, due tolower interconnect traffic, and static power, due to fewervirtual channel buffers in TCW.

8.3 TC-Weak vs. TC-Strong

Figures 10(a) and 10(b) compare the harmonic meanperformance and average interconnect traffic, respectively,across all applications for TC-Strong and TC-Weak. TCSimplements TC-Strong with the FIXED-DELTA predictionscheme proposed in LCC [34, 54], which selects a singlefixed lifetime that works best across all applications. TCSuses a fixed lifetime prediction of 800 core cycles, whichwas found to yield the best harmonic mean performanceover other lifetime values. TCW-FIXED uses TC-Weak anda fixed lifetime of 3200 core cycles, which was found to bebest performing over other values. TCW implements TC-Weak with the proposed predictor, as before.

TCW-FIXED has the same predictor as TCS but out-performs it by 15% while reducing traffic by 13%. TCWachieves a 28% improvement in performance over TCS andreduces interconnect traffic by 26%. TC-Strong has a trade-off between additional write stalls with higher lifetimes andadditional L1 misses with lower lifetimes. TC-Weak avoidsthis trade-off by not stalling writes. This permits longerlifetimes and fewer L1 misses, improving performance andreducing traffic over TC-Strong.

RCL=0.15REQ=0.63

RCL=0.09REQ=0.55

RCL=0.16REQ=0.63 2.27

2.0INV=0.03

RCL=0.03

REQ=0.68

0.00.20.40.60.81.01.21.41.6

Inter-workgroup

Intra-workgroup

0.00.20.40.60.81.01.21.41.6

Inter-workgroup

Intra-workgroup

8.2 Power

0.00.20.40.60.81.01.2

All applications

TCS TCW-FIXED TCW

All applications

(a) (b)

��37

RCL=0.15REQ=0.63

RCL=0.09REQ=0.55

RCL=0.16REQ=0.63 2.27

2.0INV=0.03

RCL=0.03

REQ=0.68

0.00.20.40.60.81.01.21.41.6

Inter-workgroup

Intra-workgroup

0.00.20.40.60.81.01.21.41.6

Inter-workgroup

Intra-workgroup

8.2 Power

0.00.20.40.60.81.01.2

All applications

TCS TCW-FIXED TCW

All applicationsSp

(a) (b)

RCL=0.15REQ=0.63

RCL=0.09REQ=0.55

RCL=0.16REQ=0.63 2.27

2.0INV=0.03

RCL=0.03

REQ=0.68

0.00.20.40.60.81.01.21.41.6

Inter-workgroup

Intra-workgroup

0.00.20.40.60.81.01.21.41.6

Inter-workgroup

Intra-workgroup

8.2 Power

0.00.20.40.60.81.01.2

All applications

TCS TCW-FIXED TCW

All applications

(a) (b)

��38

TC-Strong vs TC-Weak

Fixed lifetime for all applications

0.6 0.8 1.0

All applications

All applications S

Best lifetime for each application

TCSUO TCS TCSOO TCW TCW w/ predictor

��39

pl-4049, cache coherence for gpu architectures, by arvindh shriraman and tor aamodt

temporal coherence weakwrite

cpu coherence

rethinking coherence

coherence overhead

temporal coherence weak

gt shared l2

l1old l1

temporal coherencel1l116

Technology

last name first name user name aamodt alex aaamodt1 … ·...

power containers: an os facility for fine-grained power and...

aamodt hus

effective presentations dr. mike aamodt radford university

transact 2006 hardware acceleration of software...

ingunn aamodt pulverheksa på ferie

detecting deception dr. mike aamodt radford university

onkar k. deshmukh, and arvindh k. gananathanonkar k....

psicología industrial organizacional un enfoque aplicado -...

deltakerliste - high north dialogue€¦ · university of...

for the district of new mexico state of new...

ingunn aamodt og anna fiske usynlige snefrid og nusselige...

”representing temporal knowledge for case-based...

aamodt sandra y wang sam - entra en tu cerebro

in seminar 19.10.12 finn kr. aamodt

supplemental response to aamodt interrogatories re ... · f...

mike aamodt, bobbie raynes, dale drewry radford university...

t or a amodt (aamodt@eecg.utoronto andreas moshovos ...

resumes and cover letters dr. mike aamodt department of...

floating-point to fixed-point compilation and embedded...