managinggpuconcurrencyin heterogeneous&architectures&omutlu/pub/gpu... ·...

36
Managing GPU Concurrency in Heterogeneous Architectures Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das

Upload: others

Post on 19-Apr-2020

30 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Managing  GPU  Concurrency  in  Heterogeneous  Architectures  

Onur  Kayıran,  Nachiappan  CN,  Adwait  Jog,  Rachata  Ausavarungnirun,  Mahmut  T.  Kandemir,  Gabriel  H.  Loh,  

Onur  Mutlu,  Chita  R.  Das        

Page 2: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Era  of  Heterogeneous  Architectures  

Intel  Haswell   AMD  Fusion  

NVIDIA  Echelon  

NVIDIA  Denver  

Page 3: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Execu9ve  Summary  n  When  sharing  the  memory  hierarchy,  CPU  and  GPU  applications  interfere  with  each  other  q  GPU  applications  signiJicantly  affect  CPU  applications  due  to  multi-­‐threading  

n  Existing  GPU  Thread-­‐level  Parallelism  (TLP)  management  techniques  (MICRO12,  PACT13)  q  Unaware  of  CPUs  q  Not  effective  in  heterogeneous  systems  

Our  Proposal:  Warp  scheduling  strategies  to  

Adjust  GPU  TLP  to  improve  CPU  and/or  GPU  performance  

Page 4: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Executive Summary

CPU-centric Strategy

Memory Congestion CPU Performance

Page 5: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Executive Summary

CPU-centric Strategy

Memory Congestion CPU Performance IF Memory Congestion GPU TLP

Page 6: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Executive Summary

CPU-centric Strategy

Memory Congestion CPU Performance IF Memory Congestion GPU TLP Results Summary: +24% CPU & -11% GPU

Page 7: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Executive Summary

CPU-centric Strategy

Memory Congestion CPU Performance IF Memory Congestion GPU TLP Results Summary: +24% CPU & -11% GPU

CPU-GPU Balanced Strategy

GPU TLP GPU Latency Tolerance  

Page 8: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Executive Summary

CPU-centric Strategy

Memory Congestion CPU Performance IF Memory Congestion GPU TLP Results Summary: +24% CPU & -11% GPU

CPU-GPU Balanced Strategy

GPU TLP GPU Latency Tolerance IF Latency Tolerance GPU TLP  

Page 9: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Executive Summary

CPU-centric Strategy

Memory Congestion CPU Performance IF Memory Congestion GPU TLP Results Summary: +24% CPU & -11% GPU

CPU-GPU Balanced Strategy

GPU TLP GPU Latency Tolerance IF Latency Tolerance GPU TLP Results Summary: +7% both CPU & GPU  

Page 10: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Outline  n  Summary

n  Background

n  Motivation

n  Analysis of TLP

n  Our Proposal

n  Evaluation

n  Conclusions

Page 11: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Many-­‐core  Architecture  SIMT  Cores  

Warp Scheduler

ALUs L1 Caches

Interconnect

DRAM

LLC cache

CTA

Sc

hedu

ler

ALUs L1 Caches

L2 Caches

ROB

CPU  Cores  

Latency  optimized  cores  

Throughput  optimized  cores  

Page 12: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Outline  n  Summary

n  Background

n  Motivation

n  Analysis of TLP

n  Our Proposal

n  Evaluation

n  Conclusions

Page 13: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Applica9on  Interference  

0 0.2 0.4 0.6 0.8

1 1.2

KM MM PVR

Nor

mal

ized

G

PU IP

C noCPU mcf omnetpp perlbench

0 0.2 0.4 0.6 0.8

1 1.2

mcf omnetpp perlbench

Nor

mal

ized

C

PU IP

C noGPU KM MM PVR

• GPU  applications  are  affected  moderately  due  to  CPU  interference  

• CPU  applications  are  affected  signiJicantly  due  to  GPU  interference    

Up  to  20%   Up  to  85%  

Page 14: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Latency  Tolerance  in  CPUs  vs.  GPUs  

n  High  GPU  TLP  -­‐>  memory  system  congestion  

n  High  GPU  TLP  -­‐>  low  CPU  performance  

n  GPU  cores  can  tolerate  latencies  due  to  multi-­‐threading  

0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6

1 warp 6 warps

16 warps

Nor

mal

ized

IPC

GPU IPC CPU IPC

DYNCTA  (PACT  2013)  

Higher  performance  potential  at  low  TLP  

Problem:  TLP  management  strategies  for  GPUs  are  not  aware  of  the  latency  tolerance  disparity  between  CPU  and  GPU  applications  

Page 15: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Outline  n  Summary

n  Background

n  Motivation

n  Analysis of TLP

n  Our Proposal

n  Evaluation

n  Conclusions

Page 16: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Effect  of  GPU  Concurrency  on  GPU  Performance  

Reduction  in  GPU  TLP  

GPU  performance  

Page 17: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Effect  of  GPU  Concurrency  on  CPU  Performance  

Reduction  in  GPU  TLP  

CPU  performance  

Page 18: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Effect  of  GPU  Concurrency  on  CPU  Performance  

? Change  in  CPU  

performance    

2  metrics:  -­‐ Memory  congestion  -­‐  Network  congestion  

congestion   :   CPU  performance  

Page 19: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Outline  n  Summary

n  Background

n  Motivation

n  Analysis of TLP

n  Our Proposal

n  Evaluation

n  Conclusions

Page 20: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Our  Approach  

Existing  works  

CPU-­‐centric  Strategy  CPU-­‐GPU  Balanced  Strategy  

Improved  GPU  performance  

Improved  CPU  performance  

þ   ×  ×   þ  þ   þ  

+  control  the  trade-­‐off  

Page 21: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

CM-­‐CPU:  CPU-­‐centric  Strategy  n  Categorize  congestion:  low,  medium,  or  high  

Increase  #  of  warps  

Decrease  #  of  warps  

No  change  in  #  of  warps  

memory  network  

L  

M  

H  

L   M   H  

GPU-­‐unaware  TLP  management:  InsufJicient  GPU  latency  tolerance  

Page 22: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

CM-­‐BAL:  CPU-­‐GPU  Balanced  Strategy  

GPU TLP

stal

l GPU

Latency  tolerance  of  GPU  cores:  stallGPU:  scheduler  stalls  @  GPU  cores  

High  memory  congestion  

Low  latency  tolerance  

same  strategy  as  CM-­‐CPU  

×  

Overrides  CM-­‐CPU  can  only  increase  TLP  

Page 23: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

CM-­‐BAL:  CPU-­‐GPU  Balanced  Strategy  

GPU TLP

stal

l GPU

Latency  tolerance  of  GPU  cores:  stallGPU:  scheduler  stalls  @  GPU  cores  

High  memory  congestion  

Low  latency  tolerance  

same  strategy  as  CM-­‐CPU  

×  

Overrides  CM-­‐CPU  can  only  increase  TLP  

Page 24: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

CM-­‐BAL:  CPU-­‐GPU  Balanced  Strategy  

GPU TLP

stal

l GPU

Latency  tolerance  of  GPU  cores:  stallGPU:  scheduler  stalls  @  GPU  cores  

High  memory  congestion  

Low  latency  tolerance  

same  strategy  as  CM-­‐CPU  

þ  

Overrides  CM-­‐CPU  can  only  increase  TLP  

Page 25: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

CM-­‐BAL:  CPU-­‐GPU  Balanced  Strategy  

GPU TLP

stal

l GPU

Latency  tolerance  of  GPU  cores:  stallGPU:  scheduler  stalls  @  GPU  cores  

High  memory  congestion  

Low  latency  tolerance  

same  strategy  as  CM-­‐CPU  Overrides  CM-­‐CPU  can  only  increase  TLP  

Control  the  triggering  of  the  condition    =  

Control  the  trade-­‐off  between  CPU  or  GPU  beneQits  

Page 26: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Outline  n  Summary

n  Background

n  Motivation

n  Analysis of TLP

n  Our Proposal

n  Evaluation

n  Conclusions

Page 27: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Evaluated  Architecture  

LLC/ MC

LLC/ MC

LLC/ MC

LLC/ MC

LLC/ MC

LLC/ MC

LLC/ MC

CPU CPU

CPU CPU

CPU CPU CPU

CPU CPU CPU

CPU CPU

CPU CPU LLC/ MC

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

Tile-­‐based  design  

Page 28: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Evalua9on  Methodology  n  Evaluated on an integrated platform with an in-house x86 CPU simulator and

GPGPU-Sim

n  Baseline Architecture q  28 GPU cores, 14 CPU cores, 8 memory controllers, 2D mesh q  GPU: 1400MHz, SIMT Width = 16*2, Max. 1536 threads/core, GTO Sch. q  CPU: 2000 MHz, OoO, 128-entry instr. win., max. 3 inst./cycle q  8MB, 128B Line, 16-way, 700MHz q  GDDR5 800MHz

n  Workloads: q  13 GPU applications q  34 CPU applications, 6 CPU application mixes q  36 diverse workloads

n  1 GPU application + 1 CPU mix  

Page 29: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

0.8  

0.9  

1  

1.1  

1.2  

Normalized  GPU  IPC  

GPU  Performance  Results  

7%  

-­‐11%  

2%  

-­‐11%  

All  36  workloads  

Page 30: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

0.8  

1  

1.2  

1.4  

1   2   3   4   5   6  

Normalized  CPU  WS  

CPU  Performance  Results    

24%  

7%  2%  

19%  

All  36  workloads  

Page 31: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

System  Performance  n  OSS  =  (1  −  α)  ×  WSCPU  +  α  ×  SUGPU  (ISCA  2012)  n  α    is  between  0  and  1  n  Higher  α  -­‐>  higher  GPU  importance  

0.75  

0.875  

1  

1.125  

1.25  

alpha   0.5   1.0  

Normalized

 OSS  

alpha  (0  -­‐  1)  

48  warps   DYNCTA   Obj.  1   Obj.  2  (Balanced)  CM-­‐CPU   CM-­‐BAL  

Page 32: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

More  in  the  Paper  n  Motivation  

q  Analysis  of  the  metrics  used  by  our  algorithm  

n  Scheme  q  Detailed  hardware  walkthrough  of  our  scheme  

n  Results  q  Analysis  over  time  q  Change  in  GPU  TLP  q  Change  in  the  metrics  used  by  our  algorithm  q  Comparison  against  static  approaches  q  Lower  number  of  LLC  accesses  

Page 33: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Outline  n  Summary

n  Background

n  Motivation

n  Analysis of TLP

n  Our Proposal

n  Evaluation

n  Conclusions

Page 34: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Conclusions  n  Sharing  the  memory  hierarchy  leads  to  CPU  and  GPU  applications  to  interfere  with  each  other  

n  Existing  GPU  TLP  management  techniques  are  not  well-­‐suited  for  heterogeneous  architectures  

n  We  propose  two  GPU  TLP  management  techniques  for  heterogeneous  architectures  q  CM-­‐CPU  reduces  GPU  TLP  to  improve  CPU  performance  q  CM-­‐BAL  is  similar  to  CM-­‐CPU,  but  increases  GPU  TLP  when  it  detects  low  latency  tolerance  in  GPU  cores  

q  TLP  can  be  tuned  based  on  user’s  preference  for  higher  CPU  or  GPU  performance  

Page 35: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

THANKS!              

Page 36: ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Managing  GPU  Concurrency  in  Heterogeneous  Architectures  

Onur  Kayıran,  Nachiappan  CN,  Adwait  Jog,  Rachata  Ausavarungnirun,  Mahmut  T.  Kandemir,  Gabriel  H.  Loh,  

Onur  Mutlu,  Chita  R.  Das