managinggpuconcurrencyin heterogeneous&architectures&omutlu/pub/gpu... ·...

Managing GPU Concurrency in Heterogeneous Architectures

Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh,

Onur Mutlu, Chita R. Das

Era of Heterogeneous Architectures

Intel Haswell AMD Fusion

NVIDIA Echelon

NVIDIA Denver

Execu9ve Summary n  When sharing the memory hierarchy, CPU and GPU applications interfere with each other q  GPU applications signiJicantly affect CPU applications due to multi-‐threading

n  Existing GPU Thread-‐level Parallelism (TLP) management techniques (MICRO12, PACT13) q  Unaware of CPUs q  Not effective in heterogeneous systems

Our Proposal: Warp scheduling strategies to

Adjust GPU TLP to improve CPU and/or GPU performance

Executive Summary

CPU-centric Strategy

Memory Congestion CPU Performance

Executive Summary


Memory Congestion CPU Performance IF Memory Congestion GPU TLP

Executive Summary


Memory Congestion CPU Performance IF Memory Congestion GPU TLP Results Summary: +24% CPU & -11% GPU

Executive Summary



CPU-GPU Balanced Strategy

GPU TLP GPU Latency Tolerance

Executive Summary




GPU TLP GPU Latency Tolerance IF Latency Tolerance GPU TLP

Executive Summary




GPU TLP GPU Latency Tolerance IF Latency Tolerance GPU TLP Results Summary: +7% both CPU & GPU

Outline n  Summary

n  Background

n  Motivation

n  Analysis of TLP

n  Our Proposal

n  Evaluation

n  Conclusions

Many-‐core Architecture SIMT Cores

Warp Scheduler

ALUs L1 Caches

Interconnect

DRAM

LLC cache

CTA

Sc

hedu

ler

ALUs L1 Caches

L2 Caches

ROB

CPU Cores

Latency optimized cores

Throughput optimized cores


n  Background

n  Motivation


n  Our Proposal

n  Evaluation

n  Conclusions

Applica9on Interference

0 0.2 0.4 0.6 0.8

1 1.2

KM MM PVR

Nor

mal

ized

G

PU IP

C noCPU mcf omnetpp perlbench

0 0.2 0.4 0.6 0.8

1 1.2

mcf omnetpp perlbench

Nor

mal

ized

C

PU IP

C noGPU KM MM PVR

• GPU applications are affected moderately due to CPU interference

• CPU applications are affected signiJicantly due to GPU interference

Up to 20% Up to 85%

Latency Tolerance in CPUs vs. GPUs

n  High GPU TLP -‐> memory system congestion

n  High GPU TLP -‐> low CPU performance

n  GPU cores can tolerate latencies due to multi-‐threading

0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6

1 warp 6 warps

16 warps

Nor

mal

ized

IPC

GPU IPC CPU IPC

DYNCTA (PACT 2013)

Higher performance potential at low TLP

Problem: TLP management strategies for GPUs are not aware of the latency tolerance disparity between CPU and GPU applications


n  Background

n  Motivation


n  Our Proposal

n  Evaluation

n  Conclusions

Effect of GPU Concurrency on GPU Performance

Reduction in GPU TLP

GPU performance

Effect of GPU Concurrency on CPU Performance

Reduction in GPU TLP

CPU performance

Effect of GPU Concurrency on CPU Performance

? Change in CPU

performance

2 metrics: -‐ Memory congestion -‐  Network congestion

congestion : CPU performance


n  Background

n  Motivation


n  Our Proposal

n  Evaluation

n  Conclusions

Our Approach

Existing works

CPU-‐centric Strategy CPU-‐GPU Balanced Strategy

Improved GPU performance

Improved CPU performance

þ × × þ þ þ

+ control the trade-‐off

CM-‐CPU: CPU-‐centric Strategy n  Categorize congestion: low, medium, or high

Increase # of warps

Decrease # of warps

No change in # of warps

memory network

L

M

H

L M H

GPU-‐unaware TLP management: InsufJicient GPU latency tolerance

CM-‐BAL: CPU-‐GPU Balanced Strategy

GPU TLP

stal

l GPU

Latency tolerance of GPU cores: stallGPU: scheduler stalls @ GPU cores

High memory congestion

Low latency tolerance

same strategy as CM-‐CPU

×

Overrides CM-‐CPU can only increase TLP


GPU TLP

stal

l GPU




same strategy as CM-‐CPU

þ

Overrides CM-‐CPU can only increase TLP


GPU TLP

stal

l GPU




same strategy as CM-‐CPU Overrides CM-‐CPU can only increase TLP

Control the triggering of the condition =

Control the trade-‐off between CPU or GPU beneQits


n  Background

n  Motivation


n  Our Proposal

n  Evaluation

n  Conclusions

Evaluated Architecture

LLC/ MC

LLC/ MC

LLC/ MC

LLC/ MC

LLC/ MC

LLC/ MC

LLC/ MC

CPU CPU

CPU CPU

CPU CPU CPU

CPU CPU CPU

CPU CPU

CPU CPU LLC/ MC

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

Tile-‐based design

Evalua9on Methodology n  Evaluated on an integrated platform with an in-house x86 CPU simulator and

GPGPU-Sim

n  Baseline Architecture q  28 GPU cores, 14 CPU cores, 8 memory controllers, 2D mesh q  GPU: 1400MHz, SIMT Width = 16*2, Max. 1536 threads/core, GTO Sch. q  CPU: 2000 MHz, OoO, 128-entry instr. win., max. 3 inst./cycle q  8MB, 128B Line, 16-way, 700MHz q  GDDR5 800MHz

n  Workloads: q  13 GPU applications q  34 CPU applications, 6 CPU application mixes q  36 diverse workloads

n  1 GPU application + 1 CPU mix

0.8

0.9

1

1.1

1.2

Normalized GPU IPC

GPU Performance Results

7%

-‐11%

2%

-‐11%

All 36 workloads

0.8

1

1.2

1.4

1 2 3 4 5 6

Normalized CPU WS

CPU Performance Results

24%

7% 2%

19%

All 36 workloads

System Performance n  OSS = (1 − α) × WSCPU + α × SUGPU (ISCA 2012) n  α is between 0 and 1 n  Higher α -‐> higher GPU importance

0.75

0.875

1

1.125

1.25

alpha 0.5 1.0

Normalized

OSS

alpha (0 -‐ 1)

48 warps DYNCTA Obj. 1 Obj. 2 (Balanced) CM-‐CPU CM-‐BAL

More in the Paper n  Motivation

q  Analysis of the metrics used by our algorithm

n  Scheme q  Detailed hardware walkthrough of our scheme

n  Results q  Analysis over time q  Change in GPU TLP q  Change in the metrics used by our algorithm q  Comparison against static approaches q  Lower number of LLC accesses


n  Background

n  Motivation


n  Our Proposal

n  Evaluation

n  Conclusions

Conclusions n  Sharing the memory hierarchy leads to CPU and GPU applications to interfere with each other

n  Existing GPU TLP management techniques are not well-‐suited for heterogeneous architectures

n  We propose two GPU TLP management techniques for heterogeneous architectures q  CM-‐CPU reduces GPU TLP to improve CPU performance q  CM-‐BAL is similar to CM-‐CPU, but increases GPU TLP when it detects low latency tolerance in GPU cores

q  TLP can be tuned based on user’s preference for higher CPU or GPU performance

THANKS!

Managing GPU Concurrency in Heterogeneous Architectures

Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh,

Onur Mutlu, Chita R. Das

managinggpuconcurrencyin heterogeneous&architectures&omutlu/pub/gpu... ·...

Documents