transparent accelerator migration in virtualized gpu environments

Transparent Accelerator Migration in Virtualized GPU Environments

Shucai Xiao1, Pavan Balaji2, James Dinan2, Qian Zhu3, Rajeev Thakur2, Susan Coghlan2, Heshan Lin4, Gaojin Wen5, Jue Hong5, Wu-chun Feng4

1AMD2Argonne National Laboratory3Accenture Technologies4Virginia Tech5Shenzhen Institute of Advanced Technologies

Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)

Trends in Graphics Processing Unit Performance

(Courtesy Bill Dally @ NVIDIA)


Graphics Processing Unit Usage in Applications

(From the NVIDIA website)


June 2011 Top5 Supercomputers (from the Top500 list)

Rank Site Computer

1 RIKEN Advanced Institute for Computational Science (AICS)Japan

K computer, SPARC64 Vlllfx 2.0GHz, Tofu interconnectFujitsu

2 National Supercomputing Center in TianjinChina

Tianhe-1A – NUDT TH MPP, X5670, 2.93Ghz 6C, NVIDIA GPU, FT-1000 8CNUDT

3 DOE/SC/Oak Ridge National LaboratoryUnited States

Jaguar – Cray XT5-HE Opteron 6-core 2.6 GHz Cray Inc.

4 National Supercomputing Center in Shenzhen (NSCS)China

Nebulae – Dawning TC3600 Blade, Intel X5650, NVidia Tesla C2050 GPUDawning

5 GSIC Center, Tokyo Institute of TechnologyJapan

TSUBAME 2.0 – HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU, Linux/WindowsNEC/HP

(From the Top500 website)


GPUs in Heterogeneous Environments

GPU programming environments today assume local access to GPUs– Two commonly used programming models: CUDA and OpenCL– Both follow the same assumption of locality of GPUs

Many supercomputers are not homogenous with GPUs

???


GPUs as a Cloud Service

Today, there is no model to provide GPU as a Cloud Service– What if a lot of GPUs are available in a cloud?– Can I access them remotely?– Or do I need to buy GPUs and plug them into my local computer to

access them?


VOCL: A Virtual Implementation of OpenCL to access and manage remote GPU adapters

GPU Virtualization– Transparent utilization of remote GPUs

• Remote GPUs look like local “virtual” GPUs• Applications can access them as if they are

regular local GPUs• VOCL will automatically move data and

computation

– Efficient GPU resource management• Virtual GPUs can migrate from one physical

GPU to another• If a system administrator wants to add or

remove a node, he/she can do that while the applications are running (hot-swap capability)


Virtual OpenCL (VOCL) Framework

Compute Node

Physical GPU

Application

Native OpenCL Library

OpenCL API

Traditional Model

Compute Node

Physical GPU

VOCL Proxy

OpenCL API

VOCL Model

Native OpenCL LibraryCompute Node

Virtual GPU

Application

VOCL Library

OpenCL API

MPI

Compute Node

Physical GPU

VOCL Proxy

OpenCL API


Virtual GPUMPI


VOCL Infrastructure VOCL Library

– Implementation of the OpenCL Functionality• API Compatibility: API functions in VOCL have the same interface as that in

OpenCL• ABI Compatibility with System Native OpenCL: No recompilation needed;

relinking needed is statically built (runtime relinking in the common case)– The VOCL library calls MPI functions to send input data to and receive

output data from remote nodes VOCL service proxy

– Located on Remote GPU nodes– Application processes can dynamically connect to proxy processes to use

the GPUs associated with that proxy– Receives input from and sends output to the application process– Calls native OpenCL functions for GPU computation


1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K0

20406080

100120140160180

0%5%10%15%20%25%30%35%40%45%Matrix Transpose

Matrix size

Exec

ution

tim

e (m

s)

% sl

owdo

wn

1K 2K 3K 4K 5K 6K0.0

0.1

0.2

0.3

0.4

0.5

0.6

0%5%10%15%20%25%30%35%40%Smith-Waterman

Sequence size

Exec

ution

tim

e (s

)

% sl

owdo

wn

15360 23040 30720 38400 46080 537600.01.02.03.04.05.06.07.08.09.0

10.0

0.00%

0.05%

0.10%

0.15%

0.20%

0.25%N-body

Number of bodies

Exec

ution

tim

e (s

)

% sl

owdo

wn

1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K0.0

500.0

1,000.0

1,500.0

2,000.0

2,500.0

3,000.0

0%1%2%3%4%5%6%7%8%SGEMM

Native OpenCL (local)VOCL (remote)% slowdown

Matrix size

Exec

ution

tim

e (s

)

% sl

owdo

wn

VOCL Performance

“Transparent Virtualization of Graphics Processing Units”, S. Xiao, P. Balaji, Q. Zhu, R. Thakur, S. Coghlan, H. Lin, G. Wen, J. Hong and W. Feng. International Conference on Innovative Parallel Computing (InPar), 2012


Speedup with Multiple Virtual GPUs

1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs 32 GPUs

1

5

50

N-body

Matrix multiplication

Smith-Waterman

Matrix transpose

Ove

rall

Spee

dup


Contribution of this paper One of the advantages of virtual GPUs is that the physical GPU

to which they are associated is transparent to the user– This mapping of virtual GPU to physical GPU can change dynamically

This paper: virtual GPU migration– Maintenance

• Suppose a system administrator wants to take a machine down for maintenance, he/she should be able to migrate all virtual GPUs on that physical GPU to another physical GPU

• Easy to add new nodes into the system; can be done while applications are running

– Resource Management• Load Balancing: Depending on usage, virtual GPUs can be remapped to

different physical GPUs so the load on any given physical GPU is low• Power Management: Scheduling multiple virtual GPUs to the same

physical GPU can allow for power savings


Virtual GPU Migration with VOCL: Model

Compute Node

Physical GPU

VOCL Proxy

OpenCL API

Native OpenCL LibraryCompute Node

Virtual GPU

Application

VOCL Library

OpenCL API

Compute Node

Physical GPU

VOCL Proxy

OpenCL API


System administrator wants to take this node down for maintenance

Suspend Communication

Remap virtual GPU to new

physical GPU

Migrate physical GPU state to new physical GPU

Message to migrate virtual

GPU


Using VOCL Migration for Resource Management

Compute Node

Virtual GPU

Application

VOCL Library

OpenCL API

Virtual GPU

Virtual GPU

Compute Node

Physical GPU

VOCL Proxy

OpenCL API


Compute Node

Physical GPU

VOCL Proxy

OpenCL API


Compute Node

Physical GPU

VOCL Proxy

OpenCL API


Compute Node

Virtual GPU

Application

VOCL Library

OpenCL API

Virtual GPU

Virtual GPU

Virtual GPU


Virtual GPU Migration Details: Queuing Command Issues

When a non-blocking operation is issued to the GPU, it is queued within the GPU runtime system and executed at a later time

Problem:– What happens to the non-blocking operations issued to the GPU, when a

migration is triggered?– OpenCL provides no way to cancel these operations– Waiting for them to finish is an option, but can increase the migration

overhead significantly (each kernel can take an arbitrarily long time)

Our solution: Maintian an internal queue of unposted operations– Restrict the number of non-blocking operations handed over to the GPU– Improve the responsiveness of virtual GPU migration– Can cause additional overhead, but can be controlled by queue depth


Virtual GPU Migration Details: Atomic Transactions

Atomic Transactions– Since a virtual GPU can migrate from one physical GPU to another, all

transactions on the virtual GPU have to be atomic• E.g., we cannot transfer any data or kernel to a physical GPU while it is in

the process of migrating

– We emulate mutex behavior with MPI RMA operations• The VOCL library internally obtains a mutex lock before issuing any GPU

operations• When migration is required, the proxy obtains this mutex lock thus

blocking the VOCL library from issuing any additional calls• Once the migration is done, it updates all the appropriate data structures

at the VOCL library with the remapped physical GPU information and releases the lock


Virtual GPU Migration Details: Target GPU Selection

Identifying which physical GPU to migrate to– VOCL records all transactions to the GPU and keeps track of the

number of kernels that have been issued to the GPU and have not been completed

– Whichever GPU has the least number of pending kernels is chosen as the target GPU

– Not ideal, but can give some indication of load


Presentation Layout

Introduction and Motivation

VOCL: Goals, Design and Optimizations

VOCL: Virtual GPU Migration (Ongoing Work)

Performance Results and Analysis

Concluding Remarks

Other Research Areas


Impact of Internal Queue on Application Performance

2 4 8 12 16 20 Infini05

1015202530354045

N-body Smith-WatermanMatrix transpose Matrix multiplication

N Value

Prog

ram

exe

cutio

n tim

e (s

econ

d)


Impact of Internal Queue on Migration Overhead

2 4 8 12 16 200

20

40

60

80

100

120

140

160

Matrix multiplication N-bodyMatrix transpose Smith-Waterman

N Value

Wai

t for

com

pleti

on ti

me

(ms)



20

40

60

80

100

120

140

0%

1%

2%

3%

4%

5%

6%Without migrationWith migrationOverhead caused by migration

Matrix size

Prog

ram

exe

cutio

n tim

e (s

)

Ove

rhea

d ca

used

by

mig

ratio

n

1K 2K 3K 4K 5K 6K05

101520253035404550

-10%

0%

10%

20%

30%

40%

50%


Sequence size

Prog

ram

exe

cutio

n tim

e (s

)

Ove

rhea

d ca

used

by

mig

ratio

n

15360 23040 30720 38400 46080 537600

20406080

100120140160180200

0.0%0.2%0.4%0.6%0.8%1.0%1.2%1.4%1.6%1.8%2.0%

Without migrationWith migrationOverhead caused by migration

Number of bodies

Prog

ram

exe

cutio

n tim

e (s

)

Ove

rhea

d ca

used

by

mig

ratio

n


20

40

60

80

100

120

0%

1%

2%

3%

4%

5%

6%


Matrix size

Prog

ram

exe

cutio

n tim

e (s

)

Ove

rhea

d ca

used

by

mig

ratio

n

Migration Overhead with Regard to Problem Size


N-body

Smith-Waterman Matrix transpose



50

100

150

200

250

300

350

400

0.00.20.40.60.81.01.21.41.61.82.0

Without migrationWith migrationSpeedup

Matrix size

Prog

ram

exe

cutio

n tim

e (s

)

Spee

dup

brou

ght b

y m

igra

tion

1K 2K 3K 4K 5K 6K0

1020304050607080

0.00.20.40.60.81.01.21.41.61.82.0


Sequence size

Prog

ram

exe

cutio

n tim

e (s

)

Spee

dup

brou

ght b

y m

igra

tion

15360 23040 30720 38400 46080 537600

50100150200250300350400450

0.0

0.5

1.0

1.5

2.0

2.5Without migrationWith migrationSpeedup

Number of bodies

Prog

ram

exe

cutio

n tim

e (s

)

Spee

dup

brou

ght b

y m

igra

tion


50

100

150

200

250

0.00.20.40.60.81.01.21.41.61.82.0


Matrix size

Prog

ram

exe

cutio

n tim

e (s

)

Spee

dup

brou

ght b

y m

igra

tion

Impact of Load Rebalancing through Migration


N-body

Smith-Waterman

Matrix transpose


Presentation Layout

Introduction and Motivation

VOCL: Goals, Design and Optimizations

VOCL: Virtual GPU Migration (Ongoing Work)

Performance Results and Analysis

Concluding Remarks

Other Research Areas


Concluding Remarks Current GPU programming environments do not allow us to

use GPUs in heterogeneous environments– If all nodes on the system do not have GPUs– Or if GPUs are available on a cloud and customers want to use them

The VOCL framework bridges this gap to allow users to use remote GPUs transparently as if they were local virtual GPUs– Several optimizations to achieve the best performance– Almost no overhead for compute-intensive applications

Some future directions:– Fault tolerance: auto-redundancy, fault-triggered migration– Co-scheduling and resource management: how can we use virtual

environments to co-schedule cooperating applications?– Elastic resources: Allowing applications to take advantage of

increasing/decreasing number of GPUs


Personnel Acknowledgments Current Students

– Palden Lama (Ph.D.)– Yan Li (Ph.D.)– Ziaul Olive Haque (Ph.D.)– Xin Zhao (Ph.D.)

Past Students– Li Rao (M.S.)– Lukasz Wesolowski (Ph.D.)– Feng Ji (Ph.D.)– John Jenkins (Ph.D.)– Ashwin Aji (Ph.D.)– Shucai Xiao (Ph.D.)– Piotr Fidkowski (Ph.D.)– Sreeram Potluri (Ph.D.)– James S. Dinan (Ph.D.)– Gopalakrishnan Santhanaraman (Ph.D.)– Ping Lai (Ph.D.)– Rajesh Sudarsan (Ph.D.)– Thomas Scogland (Ph.D.)– Ganesh Narayanaswamy (M.S.)

Current Staff Members and Postdocs– James S. Dinan (Postdoc)– Jiayuan Meng (Postdoc)– Darius T. Buntinas (Assistant Computer Scientist)– David J. Goodell (Software Developer)– Jeff Hammond (Assistant Computational Scientist)

Past Staff Members and Postdocs– Qian Zhu (Postdoc)

External Collaborators– Wu-chun Feng, Virginia Tech– Heshan Lin, Virginia Tech– Laxmikant Kale, UIUC– William Gropp, UIUC– Xiaosong Ma, NCSU– Nagiza Somatova, NSCU– Howard Pritchard, Cray– Jue Hong, SIAT, CAS, Shenzhen– Gaojin Wen, SIAT, CAS, Shenzhen– Satoshi Matsuoka, TiTech, Japan– …. (many others)

Thank You!

Email: [email protected]

Webpage: http://www.mcs.anl.gov/~balaji

mailto:[email protected]

http://www.mcs.anl.gov/~balaji


Data Movement Overhead

512K1024K

2048K4096K

8192K

16384K

32768K0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

-5%

0%

5%

10%

15%

20%

25%

30%

35%Host Memory to Device Memory

Native OpenCL (local) VOCL (local)VOCL (remote) % slowdown (local)% slowdown (remote)

Data Block Size (bytes)

Band

wid

th (G

B/s)

% d

iffer

ence

512K1024K

2048K4096K

8192K

16384K

32768K0

0.5

1

1.5

2

2.5

0%

2%

4%

6%

8%

10%

12%Device Memory to Host Memory

Native OpenCL (local) VOCL (local)VOCL (remote) % slowdown (local)% slowdown (remote)

Data Block Size (bytes)

Band

wid

th (G

B/s)

% d

iffer

ence

Pavan Balaji, Argonne National Laboratory

Real World Application Kernels

Computation and memory access complexities of four applications

– In SEGMM/DGEMM and Matrix Transpose, n is the number of rows and columns in the matrix

– In N-body, n is the number of bodies– In Smith-Waterman, n is the number of letters in the input sequences

CCGrid 2012 (05/14/2012)

Application Kernels Computation Memory Access

SGEMM/DGEMM O(n3) O(n2)

N-body O(n2) O(n)

Matrix Transpose O(n2) O(n2)

Smith-Waterman O(n2) O(n2)


Percentage of Kernel Execution Time

1 2 3 4 5 60%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

N-BodySGEMMSmith-WatermanMatrix Transpose

Increasing Problem Sizes

Perc

enta

ge o

f ker

nel e

xecu

tion

time

transparent accelerator migration in virtualized gpu environments

Documents

nvidia gpu

nvidia websitepavan

nvidiapavan balaji

local access

top500 websitepavan

system native opencl

lot of gpus

local computer