transparent accelerator migration in virtualized gpu environments
DESCRIPTION
Transparent Accelerator Migration in Virtualized GPU Environments. Shucai Xiao 1 , Pavan Balaji 2 , James Dinan 2 , Qian Zhu 3 , Rajeev Thakur 2 , Susan Coghlan 2 , Heshan Lin 4 , Gaojin Wen 5 , Jue Hong 5 , Wu- chun Feng 4 1 AMD 2 Argonne National Laboratory - PowerPoint PPT PresentationTRANSCRIPT
Transparent Accelerator Migration in Virtualized GPU Environments
Shucai Xiao1, Pavan Balaji2, James Dinan2, Qian Zhu3, Rajeev Thakur2, Susan Coghlan2, Heshan Lin4, Gaojin Wen5, Jue Hong5, Wu-chun Feng4
1AMD2Argonne National Laboratory3Accenture Technologies4Virginia Tech5Shenzhen Institute of Advanced Technologies
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
Trends in Graphics Processing Unit Performance
(Courtesy Bill Dally @ NVIDIA)
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
Graphics Processing Unit Usage in Applications
(From the NVIDIA website)
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
June 2011 Top5 Supercomputers (from the Top500 list)
Rank Site Computer
1 RIKEN Advanced Institute for Computational Science (AICS)Japan
K computer, SPARC64 Vlllfx 2.0GHz, Tofu interconnectFujitsu
2 National Supercomputing Center in TianjinChina
Tianhe-1A – NUDT TH MPP, X5670, 2.93Ghz 6C, NVIDIA GPU, FT-1000 8CNUDT
3 DOE/SC/Oak Ridge National LaboratoryUnited States
Jaguar – Cray XT5-HE Opteron 6-core 2.6 GHz Cray Inc.
4 National Supercomputing Center in Shenzhen (NSCS)China
Nebulae – Dawning TC3600 Blade, Intel X5650, NVidia Tesla C2050 GPUDawning
5 GSIC Center, Tokyo Institute of TechnologyJapan
TSUBAME 2.0 – HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia GPU, Linux/WindowsNEC/HP
(From the Top500 website)
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
GPUs in Heterogeneous Environments
GPU programming environments today assume local access to GPUs– Two commonly used programming models: CUDA and OpenCL– Both follow the same assumption of locality of GPUs
Many supercomputers are not homogenous with GPUs
???
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
GPUs as a Cloud Service
Today, there is no model to provide GPU as a Cloud Service– What if a lot of GPUs are available in a cloud?– Can I access them remotely?– Or do I need to buy GPUs and plug them into my local computer to
access them?
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
VOCL: A Virtual Implementation of OpenCL to access and manage remote GPU adapters
GPU Virtualization– Transparent utilization of remote GPUs
• Remote GPUs look like local “virtual” GPUs• Applications can access them as if they are
regular local GPUs• VOCL will automatically move data and
computation
– Efficient GPU resource management• Virtual GPUs can migrate from one physical
GPU to another• If a system administrator wants to add or
remove a node, he/she can do that while the applications are running (hot-swap capability)
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
Virtual OpenCL (VOCL) Framework
Compute Node
Physical GPU
Application
Native OpenCL Library
OpenCL API
Traditional Model
Compute Node
Physical GPU
VOCL Proxy
OpenCL API
VOCL Model
Native OpenCL LibraryCompute Node
Virtual GPU
Application
VOCL Library
OpenCL API
MPI
Compute Node
Physical GPU
VOCL Proxy
OpenCL API
Native OpenCL Library
Virtual GPUMPI
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
VOCL Infrastructure VOCL Library
– Implementation of the OpenCL Functionality• API Compatibility: API functions in VOCL have the same interface as that in
OpenCL• ABI Compatibility with System Native OpenCL: No recompilation needed;
relinking needed is statically built (runtime relinking in the common case)– The VOCL library calls MPI functions to send input data to and receive
output data from remote nodes VOCL service proxy
– Located on Remote GPU nodes– Application processes can dynamically connect to proxy processes to use
the GPUs associated with that proxy– Receives input from and sends output to the application process– Calls native OpenCL functions for GPU computation
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K0
20406080
100120140160180
0%5%10%15%20%25%30%35%40%45%Matrix Transpose
Matrix size
Exec
ution
tim
e (m
s)
% sl
owdo
wn
1K 2K 3K 4K 5K 6K0.0
0.1
0.2
0.3
0.4
0.5
0.6
0%5%10%15%20%25%30%35%40%Smith-Waterman
Sequence size
Exec
ution
tim
e (s
)
% sl
owdo
wn
15360 23040 30720 38400 46080 537600.01.02.03.04.05.06.07.08.09.0
10.0
0.00%
0.05%
0.10%
0.15%
0.20%
0.25%N-body
Number of bodies
Exec
ution
tim
e (s
)
% sl
owdo
wn
1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K0.0
500.0
1,000.0
1,500.0
2,000.0
2,500.0
3,000.0
0%1%2%3%4%5%6%7%8%SGEMM
Native OpenCL (local)VOCL (remote)% slowdown
Matrix size
Exec
ution
tim
e (s
)
% sl
owdo
wn
VOCL Performance
“Transparent Virtualization of Graphics Processing Units”, S. Xiao, P. Balaji, Q. Zhu, R. Thakur, S. Coghlan, H. Lin, G. Wen, J. Hong and W. Feng. International Conference on Innovative Parallel Computing (InPar), 2012
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
Speedup with Multiple Virtual GPUs
1 GPU 2 GPUs 4 GPUs 8 GPUs 16 GPUs 32 GPUs
1
5
50
N-body
Matrix multiplication
Smith-Waterman
Matrix transpose
Ove
rall
Spee
dup
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
Contribution of this paper One of the advantages of virtual GPUs is that the physical GPU
to which they are associated is transparent to the user– This mapping of virtual GPU to physical GPU can change dynamically
This paper: virtual GPU migration– Maintenance
• Suppose a system administrator wants to take a machine down for maintenance, he/she should be able to migrate all virtual GPUs on that physical GPU to another physical GPU
• Easy to add new nodes into the system; can be done while applications are running
– Resource Management• Load Balancing: Depending on usage, virtual GPUs can be remapped to
different physical GPUs so the load on any given physical GPU is low• Power Management: Scheduling multiple virtual GPUs to the same
physical GPU can allow for power savings
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
Virtual GPU Migration with VOCL: Model
Compute Node
Physical GPU
VOCL Proxy
OpenCL API
Native OpenCL LibraryCompute Node
Virtual GPU
Application
VOCL Library
OpenCL API
Compute Node
Physical GPU
VOCL Proxy
OpenCL API
Native OpenCL Library
System administrator wants to take this node down for maintenance
Suspend Communication
Remap virtual GPU to new
physical GPU
Migrate physical GPU state to new physical GPU
Message to migrate virtual
GPU
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
Using VOCL Migration for Resource Management
Compute Node
Virtual GPU
Application
VOCL Library
OpenCL API
Virtual GPU
Virtual GPU
Compute Node
Physical GPU
VOCL Proxy
OpenCL API
Native OpenCL Library
Compute Node
Physical GPU
VOCL Proxy
OpenCL API
Native OpenCL Library
Compute Node
Physical GPU
VOCL Proxy
OpenCL API
Native OpenCL Library
Compute Node
Virtual GPU
Application
VOCL Library
OpenCL API
Virtual GPU
Virtual GPU
Virtual GPU
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
Virtual GPU Migration Details: Queuing Command Issues
When a non-blocking operation is issued to the GPU, it is queued within the GPU runtime system and executed at a later time
Problem:– What happens to the non-blocking operations issued to the GPU, when a
migration is triggered?– OpenCL provides no way to cancel these operations– Waiting for them to finish is an option, but can increase the migration
overhead significantly (each kernel can take an arbitrarily long time)
Our solution: Maintian an internal queue of unposted operations– Restrict the number of non-blocking operations handed over to the GPU– Improve the responsiveness of virtual GPU migration– Can cause additional overhead, but can be controlled by queue depth
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
Virtual GPU Migration Details: Atomic Transactions
Atomic Transactions– Since a virtual GPU can migrate from one physical GPU to another, all
transactions on the virtual GPU have to be atomic• E.g., we cannot transfer any data or kernel to a physical GPU while it is in
the process of migrating
– We emulate mutex behavior with MPI RMA operations• The VOCL library internally obtains a mutex lock before issuing any GPU
operations• When migration is required, the proxy obtains this mutex lock thus
blocking the VOCL library from issuing any additional calls• Once the migration is done, it updates all the appropriate data structures
at the VOCL library with the remapped physical GPU information and releases the lock
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
Virtual GPU Migration Details: Target GPU Selection
Identifying which physical GPU to migrate to– VOCL records all transactions to the GPU and keeps track of the
number of kernels that have been issued to the GPU and have not been completed
– Whichever GPU has the least number of pending kernels is chosen as the target GPU
– Not ideal, but can give some indication of load
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
Presentation Layout
Introduction and Motivation
VOCL: Goals, Design and Optimizations
VOCL: Virtual GPU Migration (Ongoing Work)
Performance Results and Analysis
Concluding Remarks
Other Research Areas
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
Impact of Internal Queue on Application Performance
2 4 8 12 16 20 Infini05
1015202530354045
N-body Smith-WatermanMatrix transpose Matrix multiplication
N Value
Prog
ram
exe
cutio
n tim
e (s
econ
d)
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
Impact of Internal Queue on Migration Overhead
2 4 8 12 16 200
20
40
60
80
100
120
140
160
Matrix multiplication N-bodyMatrix transpose Smith-Waterman
N Value
Wai
t for
com
pleti
on ti
me
(ms)
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K0
20
40
60
80
100
120
140
0%
1%
2%
3%
4%
5%
6%Without migrationWith migrationOverhead caused by migration
Matrix size
Prog
ram
exe
cutio
n tim
e (s
)
Ove
rhea
d ca
used
by
mig
ratio
n
1K 2K 3K 4K 5K 6K05
101520253035404550
-10%
0%
10%
20%
30%
40%
50%
60%Without migrationWith migrationOverhead caused by migration
Sequence size
Prog
ram
exe
cutio
n tim
e (s
)
Ove
rhea
d ca
used
by
mig
ratio
n
15360 23040 30720 38400 46080 537600
20406080
100120140160180200
0.0%0.2%0.4%0.6%0.8%1.0%1.2%1.4%1.6%1.8%2.0%
Without migrationWith migrationOverhead caused by migration
Number of bodies
Prog
ram
exe
cutio
n tim
e (s
)
Ove
rhea
d ca
used
by
mig
ratio
n
1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K0
20
40
60
80
100
120
0%
1%
2%
3%
4%
5%
6%
7%Without migrationWith migrationOverhead caused by migration
Matrix size
Prog
ram
exe
cutio
n tim
e (s
)
Ove
rhea
d ca
used
by
mig
ratio
n
Migration Overhead with Regard to Problem Size
Matrix multiplication
N-body
Smith-Waterman Matrix transpose
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K0
50
100
150
200
250
300
350
400
0.00.20.40.60.81.01.21.41.61.82.0
Without migrationWith migrationSpeedup
Matrix size
Prog
ram
exe
cutio
n tim
e (s
)
Spee
dup
brou
ght b
y m
igra
tion
1K 2K 3K 4K 5K 6K0
1020304050607080
0.00.20.40.60.81.01.21.41.61.82.0
Without migrationWith migrationSpeedup
Sequence size
Prog
ram
exe
cutio
n tim
e (s
)
Spee
dup
brou
ght b
y m
igra
tion
15360 23040 30720 38400 46080 537600
50100150200250300350400450
0.0
0.5
1.0
1.5
2.0
2.5Without migrationWith migrationSpeedup
Number of bodies
Prog
ram
exe
cutio
n tim
e (s
)
Spee
dup
brou
ght b
y m
igra
tion
1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K0
50
100
150
200
250
0.00.20.40.60.81.01.21.41.61.82.0
Without migrationWith migrationSpeedup
Matrix size
Prog
ram
exe
cutio
n tim
e (s
)
Spee
dup
brou
ght b
y m
igra
tion
Impact of Load Rebalancing through Migration
Matrix multiplication
N-body
Smith-Waterman
Matrix transpose
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
Presentation Layout
Introduction and Motivation
VOCL: Goals, Design and Optimizations
VOCL: Virtual GPU Migration (Ongoing Work)
Performance Results and Analysis
Concluding Remarks
Other Research Areas
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
Concluding Remarks Current GPU programming environments do not allow us to
use GPUs in heterogeneous environments– If all nodes on the system do not have GPUs– Or if GPUs are available on a cloud and customers want to use them
The VOCL framework bridges this gap to allow users to use remote GPUs transparently as if they were local virtual GPUs– Several optimizations to achieve the best performance– Almost no overhead for compute-intensive applications
Some future directions:– Fault tolerance: auto-redundancy, fault-triggered migration– Co-scheduling and resource management: how can we use virtual
environments to co-schedule cooperating applications?– Elastic resources: Allowing applications to take advantage of
increasing/decreasing number of GPUs
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
Personnel Acknowledgments Current Students
– Palden Lama (Ph.D.)– Yan Li (Ph.D.)– Ziaul Olive Haque (Ph.D.)– Xin Zhao (Ph.D.)
Past Students– Li Rao (M.S.)– Lukasz Wesolowski (Ph.D.)– Feng Ji (Ph.D.)– John Jenkins (Ph.D.)– Ashwin Aji (Ph.D.)– Shucai Xiao (Ph.D.)– Piotr Fidkowski (Ph.D.)– Sreeram Potluri (Ph.D.)– James S. Dinan (Ph.D.)– Gopalakrishnan Santhanaraman (Ph.D.)– Ping Lai (Ph.D.)– Rajesh Sudarsan (Ph.D.)– Thomas Scogland (Ph.D.)– Ganesh Narayanaswamy (M.S.)
Current Staff Members and Postdocs– James S. Dinan (Postdoc)– Jiayuan Meng (Postdoc)– Darius T. Buntinas (Assistant Computer Scientist)– David J. Goodell (Software Developer)– Jeff Hammond (Assistant Computational Scientist)
Past Staff Members and Postdocs– Qian Zhu (Postdoc)
External Collaborators– Wu-chun Feng, Virginia Tech– Heshan Lin, Virginia Tech– Laxmikant Kale, UIUC– William Gropp, UIUC– Xiaosong Ma, NCSU– Nagiza Somatova, NSCU– Howard Pritchard, Cray– Jue Hong, SIAT, CAS, Shenzhen– Gaojin Wen, SIAT, CAS, Shenzhen– Satoshi Matsuoka, TiTech, Japan– …. (many others)
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
Data Movement Overhead
512K1024K
2048K4096K
8192K
16384K
32768K0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
-5%
0%
5%
10%
15%
20%
25%
30%
35%Host Memory to Device Memory
Native OpenCL (local) VOCL (local)VOCL (remote) % slowdown (local)% slowdown (remote)
Data Block Size (bytes)
Band
wid
th (G
B/s)
% d
iffer
ence
512K1024K
2048K4096K
8192K
16384K
32768K0
0.5
1
1.5
2
2.5
0%
2%
4%
6%
8%
10%
12%Device Memory to Host Memory
Native OpenCL (local) VOCL (local)VOCL (remote) % slowdown (local)% slowdown (remote)
Data Block Size (bytes)
Band
wid
th (G
B/s)
% d
iffer
ence
Pavan Balaji, Argonne National Laboratory
Real World Application Kernels
Computation and memory access complexities of four applications
– In SEGMM/DGEMM and Matrix Transpose, n is the number of rows and columns in the matrix
– In N-body, n is the number of bodies– In Smith-Waterman, n is the number of letters in the input sequences
CCGrid 2012 (05/14/2012)
Application Kernels Computation Memory Access
SGEMM/DGEMM O(n3) O(n2)
N-body O(n2) O(n)
Matrix Transpose O(n2) O(n2)
Smith-Waterman O(n2) O(n2)
Pavan Balaji, Argonne National Laboratory CCGrid 2012 (05/14/2012)
Percentage of Kernel Execution Time
1 2 3 4 5 60%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
N-BodySGEMMSmith-WatermanMatrix Transpose
Increasing Problem Sizes
Perc
enta
ge o
f ker
nel e
xecu
tion
time