transparent accelerator migration in virtualized gpu environments

29
Transparent Accelerator Migration in Virtualized GPU Environments Shucai Xiao 1 , Pavan Balaji 2 , James Dinan 2 , Qian Zhu 3 , Rajeev Thakur 2 , Susan Coghlan 2 , Heshan Lin 4 , Gaojin Wen 5 , Jue Hong 5 , Wu-chun Feng 4 1 AMD 2 Argonne National Laboratory 3 Accenture Technologies 4 Virginia Tech 5 Shenzhen Institute of Advanced Technologies

Upload: arion

Post on 23-Feb-2016

50 views

Category:

Documents


0 download

DESCRIPTION

Transparent Accelerator Migration in Virtualized GPU Environments. Shucai Xiao 1 , Pavan Balaji 2 , James Dinan 2 , Qian Zhu 3 , Rajeev Thakur 2 , Susan Coghlan 2 , Heshan Lin 4 , Gaojin Wen 5 , Jue Hong 5 , Wu- chun Feng 4 1 AMD 2 Argonne National Laboratory - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Transparent Accelerator Migration in Virtualized GPU Environments

Transparent Accelerator Migration in Virtualized GPU Environments

Shucai Xiao1, Pavan Balaji2, James Dinan2, Qian Zhu3, Rajeev Thakur2, Susan Coghlan2, Heshan Lin4, Gaojin Wen5, Jue Hong5, Wu-chun Feng4

1AMD2Argonne National Laboratory3Accenture Technologies4Virginia Tech5Shenzhen Institute of Advanced Technologies

Page 2: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

Trends in Graphics Processing Unit Performance

(Courtesy Bill Dally @ NVIDIA)

Page 3: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

Graphics Processing Unit Usage in Applications

(From the NVIDIA website)

Page 4: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

June 2011 Top5 Supercomputers (from the Top500 list)

Rank Site Computer

1 RIKEN Advanced Institute for Computational Science (AICS)Japan

K computer, SPARC64 Vlllfx 2.0GHz, Tofu interconnectFujitsu

2 National Supercomputing Center in TianjinChina

Tianhe-1A – NUDT TH MPP, X5670, 2.93Ghz 6C, NVIDIA GPU, FT-1000 8CNUDT

3 DOE/SC/Oak Ridge National LaboratoryUnited States

Jaguar – Cray XT5-HE Opteron 6-core 2.6 GHz Cray Inc.

4 National Supercomputing Center in Shenzhen (NSCS)China

Nebulae – Dawning TC3600 Blade, Intel X5650, NVidia Tesla C2050 GPUDawning

5 GSIC Center, Tokyo Institute of TechnologyJapan

TSUBAME 2.0 – HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU, Linux/WindowsNEC/HP

(From the Top500 website)

Page 5: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

GPUs in Heterogeneous Environments

GPU programming environments today assume local access to GPUs– Two commonly used programming models: CUDA and OpenCL– Both follow the same assumption of locality of GPUs

Many supercomputers are not homogenous with GPUs

???

Page 6: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

GPUs as a Cloud Service

Today, there is no model to provide GPU as a Cloud Service– What if a lot of GPUs are available in a cloud?– Can I access them remotely?– Or do I need to buy GPUs and plug them into my local computer to

access them?

Page 7: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

VOCL: A Virtual Implementation of OpenCL to access and manage remote GPU adapters

GPU Virtualization– Transparent utilization of remote GPUs

• Remote GPUs look like local “virtual” GPUs• Applications can access them as if they are

regular local GPUs• VOCL will automatically move data and

computation

– Efficient GPU resource management• Virtual GPUs can migrate from one physical

GPU to another• If a system administrator wants to add or

remove a node, he/she can do that while the applications are running (hot-swap capability)

Page 8: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

Virtual OpenCL (VOCL) Framework

Compute Node

Physical GPU

Application

Native OpenCL Library

OpenCL API

Traditional Model

Compute Node

Physical GPU

VOCL Proxy

OpenCL API

VOCL Model

Native OpenCL LibraryCompute Node

Virtual GPU

Application

VOCL Library

OpenCL API

MPI

Compute Node

Physical GPU

VOCL Proxy

OpenCL API

Native OpenCL Library

Virtual GPUMPI

Page 9: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

VOCL Infrastructure VOCL Library

– Implementation of the OpenCL Functionality• API Compatibility: API functions in VOCL have the same interface as that in

OpenCL• ABI Compatibility with System Native OpenCL: No recompilation needed;

relinking needed is statically built (runtime relinking in the common case)– The VOCL library calls MPI functions to send input data to and receive

output data from remote nodes VOCL service proxy

– Located on Remote GPU nodes– Application processes can dynamically connect to proxy processes to use

the GPUs associated with that proxy– Receives input from and sends output to the application process– Calls native OpenCL functions for GPU computation

Page 10: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K0

20406080

100120140160180

0%5%10%15%20%25%30%35%40%45%Matrix Transpose

Matrix size

Exec

ution

tim

e (m

s)

% sl

owdo

wn

1K 2K 3K 4K 5K 6K0.0

0.1

0.2

0.3

0.4

0.5

0.6

0%5%10%15%20%25%30%35%40%Smith-Waterman

Sequence size

Exec

ution

tim

e (s

)

% sl

owdo

wn

15360 23040 30720 38400 46080 537600.01.02.03.04.05.06.07.08.09.0

10.0

0.00%

0.05%

0.10%

0.15%

0.20%

0.25%N-body

Number of bodies

Exec

ution

tim

e (s

)

% sl

owdo

wn

1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K0.0

500.0

1,000.0

1,500.0

2,000.0

2,500.0

3,000.0

0%1%2%3%4%5%6%7%8%SGEMM

Native OpenCL (local)VOCL (remote)% slowdown

Matrix size

Exec

ution

tim

e (s

)

% sl

owdo

wn

VOCL Performance

“Transparent Virtualization of Graphics Processing Units”, S. Xiao, P. Balaji, Q. Zhu, R. Thakur, S. Coghlan, H. Lin, G. Wen, J. Hong and W. Feng. International Conference on Innovative Parallel Computing (InPar), 2012

Page 11: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

Speedup with Multiple Virtual GPUs

1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs 32 GPUs

1

5

50

N-body

Matrix multiplication

Smith-Waterman

Matrix transpose

Ove

rall

Spee

dup

Page 12: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

Contribution of this paper One of the advantages of virtual GPUs is that the physical GPU

to which they are associated is transparent to the user– This mapping of virtual GPU to physical GPU can change dynamically

This paper: virtual GPU migration– Maintenance

• Suppose a system administrator wants to take a machine down for maintenance, he/she should be able to migrate all virtual GPUs on that physical GPU to another physical GPU

• Easy to add new nodes into the system; can be done while applications are running

– Resource Management• Load Balancing: Depending on usage, virtual GPUs can be remapped to

different physical GPUs so the load on any given physical GPU is low• Power Management: Scheduling multiple virtual GPUs to the same

physical GPU can allow for power savings

Page 13: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

Virtual GPU Migration with VOCL: Model

Compute Node

Physical GPU

VOCL Proxy

OpenCL API

Native OpenCL LibraryCompute Node

Virtual GPU

Application

VOCL Library

OpenCL API

Compute Node

Physical GPU

VOCL Proxy

OpenCL API

Native OpenCL Library

System administrator wants to take this node down for maintenance

Suspend Communication

Remap virtual GPU to new

physical GPU

Migrate physical GPU state to new physical GPU

Message to migrate virtual

GPU

Page 14: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

Using VOCL Migration for Resource Management

Compute Node

Virtual GPU

Application

VOCL Library

OpenCL API

Virtual GPU

Virtual GPU

Compute Node

Physical GPU

VOCL Proxy

OpenCL API

Native OpenCL Library

Compute Node

Physical GPU

VOCL Proxy

OpenCL API

Native OpenCL Library

Compute Node

Physical GPU

VOCL Proxy

OpenCL API

Native OpenCL Library

Compute Node

Virtual GPU

Application

VOCL Library

OpenCL API

Virtual GPU

Virtual GPU

Virtual GPU

Page 15: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

Virtual GPU Migration Details: Queuing Command Issues

When a non-blocking operation is issued to the GPU, it is queued within the GPU runtime system and executed at a later time

Problem:– What happens to the non-blocking operations issued to the GPU, when a

migration is triggered?– OpenCL provides no way to cancel these operations– Waiting for them to finish is an option, but can increase the migration

overhead significantly (each kernel can take an arbitrarily long time)

Our solution: Maintian an internal queue of unposted operations– Restrict the number of non-blocking operations handed over to the GPU– Improve the responsiveness of virtual GPU migration– Can cause additional overhead, but can be controlled by queue depth

Page 16: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

Virtual GPU Migration Details: Atomic Transactions

Atomic Transactions– Since a virtual GPU can migrate from one physical GPU to another, all

transactions on the virtual GPU have to be atomic• E.g., we cannot transfer any data or kernel to a physical GPU while it is in

the process of migrating

– We emulate mutex behavior with MPI RMA operations• The VOCL library internally obtains a mutex lock before issuing any GPU

operations• When migration is required, the proxy obtains this mutex lock thus

blocking the VOCL library from issuing any additional calls• Once the migration is done, it updates all the appropriate data structures

at the VOCL library with the remapped physical GPU information and releases the lock

Page 17: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

Virtual GPU Migration Details: Target GPU Selection

Identifying which physical GPU to migrate to– VOCL records all transactions to the GPU and keeps track of the

number of kernels that have been issued to the GPU and have not been completed

– Whichever GPU has the least number of pending kernels is chosen as the target GPU

– Not ideal, but can give some indication of load

Page 18: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

Presentation Layout

Introduction and Motivation

VOCL: Goals, Design and Optimizations

VOCL: Virtual GPU Migration (Ongoing Work)

Performance Results and Analysis

Concluding Remarks

Other Research Areas

Page 19: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

Impact of Internal Queue on Application Performance

2 4 8 12 16 20 Infini05

1015202530354045

N-body Smith-WatermanMatrix transpose Matrix multiplication

N Value

Prog

ram

exe

cutio

n tim

e (s

econ

d)

Page 20: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

Impact of Internal Queue on Migration Overhead

2 4 8 12 16 200

20

40

60

80

100

120

140

160

Matrix multiplication N-bodyMatrix transpose Smith-Waterman

N Value

Wai

t for

com

pleti

on ti

me

(ms)

Page 21: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K0

20

40

60

80

100

120

140

0%

1%

2%

3%

4%

5%

6%Without migrationWith migrationOverhead caused by migration

Matrix size

Prog

ram

exe

cutio

n tim

e (s

)

Ove

rhea

d ca

used

by

mig

ratio

n

1K 2K 3K 4K 5K 6K05

101520253035404550

-10%

0%

10%

20%

30%

40%

50%

60%Without migrationWith migrationOverhead caused by migration

Sequence size

Prog

ram

exe

cutio

n tim

e (s

)

Ove

rhea

d ca

used

by

mig

ratio

n

15360 23040 30720 38400 46080 537600

20406080

100120140160180200

0.0%0.2%0.4%0.6%0.8%1.0%1.2%1.4%1.6%1.8%2.0%

Without migrationWith migrationOverhead caused by migration

Number of bodies

Prog

ram

exe

cutio

n tim

e (s

)

Ove

rhea

d ca

used

by

mig

ratio

n

1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K0

20

40

60

80

100

120

0%

1%

2%

3%

4%

5%

6%

7%Without migrationWith migrationOverhead caused by migration

Matrix size

Prog

ram

exe

cutio

n tim

e (s

)

Ove

rhea

d ca

used

by

mig

ratio

n

Migration Overhead with Regard to Problem Size

Matrix multiplication

N-body

Smith-Waterman Matrix transpose

Page 22: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K0

50

100

150

200

250

300

350

400

0.00.20.40.60.81.01.21.41.61.82.0

Without migrationWith migrationSpeedup

Matrix size

Prog

ram

exe

cutio

n tim

e (s

)

Spee

dup

brou

ght b

y m

igra

tion

1K 2K 3K 4K 5K 6K0

1020304050607080

0.00.20.40.60.81.01.21.41.61.82.0

Without migrationWith migrationSpeedup

Sequence size

Prog

ram

exe

cutio

n tim

e (s

)

Spee

dup

brou

ght b

y m

igra

tion

15360 23040 30720 38400 46080 537600

50100150200250300350400450

0.0

0.5

1.0

1.5

2.0

2.5Without migrationWith migrationSpeedup

Number of bodies

Prog

ram

exe

cutio

n tim

e (s

)

Spee

dup

brou

ght b

y m

igra

tion

1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K0

50

100

150

200

250

0.00.20.40.60.81.01.21.41.61.82.0

Without migrationWith migrationSpeedup

Matrix size

Prog

ram

exe

cutio

n tim

e (s

)

Spee

dup

brou

ght b

y m

igra

tion

Impact of Load Rebalancing through Migration

Matrix multiplication

N-body

Smith-Waterman

Matrix transpose

Page 23: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

Presentation Layout

Introduction and Motivation

VOCL: Goals, Design and Optimizations

VOCL: Virtual GPU Migration (Ongoing Work)

Performance Results and Analysis

Concluding Remarks

Other Research Areas

Page 24: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

Concluding Remarks Current GPU programming environments do not allow us to

use GPUs in heterogeneous environments– If all nodes on the system do not have GPUs– Or if GPUs are available on a cloud and customers want to use them

The VOCL framework bridges this gap to allow users to use remote GPUs transparently as if they were local virtual GPUs– Several optimizations to achieve the best performance– Almost no overhead for compute-intensive applications

Some future directions:– Fault tolerance: auto-redundancy, fault-triggered migration– Co-scheduling and resource management: how can we use virtual

environments to co-schedule cooperating applications?– Elastic resources: Allowing applications to take advantage of

increasing/decreasing number of GPUs

Page 25: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

Personnel Acknowledgments Current Students

– Palden Lama (Ph.D.)– Yan Li (Ph.D.)– Ziaul Olive Haque (Ph.D.)– Xin Zhao (Ph.D.)

Past Students– Li Rao (M.S.)– Lukasz Wesolowski (Ph.D.)– Feng Ji (Ph.D.)– John Jenkins (Ph.D.)– Ashwin Aji (Ph.D.)– Shucai Xiao (Ph.D.)– Piotr Fidkowski (Ph.D.)– Sreeram Potluri (Ph.D.)– James S. Dinan (Ph.D.)– Gopalakrishnan Santhanaraman (Ph.D.)– Ping Lai (Ph.D.)– Rajesh Sudarsan (Ph.D.)– Thomas Scogland (Ph.D.)– Ganesh Narayanaswamy (M.S.)

Current Staff Members and Postdocs– James S. Dinan (Postdoc)– Jiayuan Meng (Postdoc)– Darius T. Buntinas (Assistant Computer Scientist)– David J. Goodell (Software Developer)– Jeff Hammond (Assistant Computational Scientist)

Past Staff Members and Postdocs– Qian Zhu (Postdoc)

External Collaborators– Wu-chun Feng, Virginia Tech– Heshan Lin, Virginia Tech– Laxmikant Kale, UIUC– William Gropp, UIUC– Xiaosong Ma, NCSU– Nagiza Somatova, NSCU– Howard Pritchard, Cray– Jue Hong, SIAT, CAS, Shenzhen– Gaojin Wen, SIAT, CAS, Shenzhen– Satoshi Matsuoka, TiTech, Japan– …. (many others)

Page 26: Transparent Accelerator Migration in Virtualized GPU Environments

Thank You!

Email: [email protected]

Webpage: http://www.mcs.anl.gov/~balaji

Page 27: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

Data Movement Overhead

512K1024K

2048K4096K

8192K

16384K

32768K0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

-5%

0%

5%

10%

15%

20%

25%

30%

35%Host Memory to Device Memory

Native OpenCL (local) VOCL (local)VOCL (remote) % slowdown (local)% slowdown (remote)

Data Block Size (bytes)

Band

wid

th (G

B/s)

% d

iffer

ence

512K1024K

2048K4096K

8192K

16384K

32768K0

0.5

1

1.5

2

2.5

0%

2%

4%

6%

8%

10%

12%Device Memory to Host Memory

Native OpenCL (local) VOCL (local)VOCL (remote) % slowdown (local)% slowdown (remote)

Data Block Size (bytes)

Band

wid

th (G

B/s)

% d

iffer

ence

Page 28: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory

Real World Application Kernels

Computation and memory access complexities of four applications

– In SEGMM/DGEMM and Matrix Transpose, n is the number of rows and columns in the matrix

– In N-body, n is the number of bodies– In Smith-Waterman, n is the number of letters in the input sequences

CCGrid 2012 (05/14/2012)

Application Kernels Computation Memory Access

SGEMM/DGEMM O(n3) O(n2)

N-body O(n2) O(n)

Matrix Transpose O(n2) O(n2)

Smith-Waterman O(n2) O(n2)

Page 29: Transparent Accelerator Migration in Virtualized GPU Environments

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

Percentage of Kernel Execution Time

1 2 3 4 5 60%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

N-BodySGEMMSmith-WatermanMatrix Transpose

Increasing Problem Sizes

Perc

enta

ge o

f ker

nel e

xecu

tion

time