increasing the throughput of your gpu-enabled cluster with
TRANSCRIPT
Increasing the throughput of
your GPU-enabled cluster
with rCUDA
Federico Silla Technical University of Valencia
Spain
HPC Advisory Council Spain Conference 2014, Santander 2/45
The scope of this talk
HPC Advisory Council Spain Conference 2014, Santander 3/45
Proposed cluster performance
• Performance numbers collected from real executions in an 8-node cluster*, each
node with 1 NVIDIA K20 GPUs
• Workload composed of a mix of the LAMMPS, GROMACS, GPU-Blast and
MCUDA-MEME applications
*node characteristics: 2 E5-2620V2 sockets (6-core at 2.1GHz) and 32GB DDR3 RAM. One NVIDIA Tesla K20 GPU
HPC Advisory Council Spain Conference 2014, Santander 4/45
The problem with current GPU-enabled clusters
The enabler for increased cluster throughput
Engineering the enabler
Some performance numbers
Final considerations
Increasing throughput in current clusters
HPC Advisory Council Spain Conference 2014, Santander 5/45
The problem with current GPU-enabled clusters
The enabler for increased cluster throughput
Engineering the enabler
Some performance numbers
Final considerations
Increasing throughput in current clusters
HPC Advisory Council Spain Conference 2014, Santander 6/45
A GPU computing facility is usually a set of independent self-
contained nodes that leverage the shared-nothing approach:
Nothing is directly shared among nodes (MPI required for aggregating
computing resources within the cluster)
GPUs can only be used within the node they are attached to
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Characteristics of GPU-based clusters
HPC Advisory Council Spain Conference 2014, Santander 7/45
• Applications can only use the GPUs located within their node:
• Non-accelerated applications keep GPUs idle in the nodes where they
use all the cores
First concern with accelerated clusters
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
A CPU-only application spreading over
these four nodes would make their GPUs
unavailable for accelerated applications
HPC Advisory Council Spain Conference 2014, Santander 8/45
For many workloads, GPUs may be idle for significant periods of time:
• Initial acquisition costs not amortized
• Space: GPUs reduce CPU density
• Energy: idle GPUs keep consuming power
Money leakage in current clusters?
Time (s)
Idle
Pow
er
(Watts)
1 GPU node
4 GPUs node
• 1 GPU node: 2 E5-2620V2 sockets and 32GB DDR3 RAM. Tesla K20 GPU
• 4 GPUs node: 2 E5-2620V2 sockets and 128GB DDR3 RAM. 4 Tesla K20 GPUs
25%
HPC Advisory Council Spain Conference 2014, Santander 9/45
• Applications can only use the GPUs located within their node:
• Multi-GPU applications running on a subset of nodes cannot
make use of the tremendous GPU resources available at other
cluster nodes (even if they are idle)
Second concern with accelerated clusters
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU
GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
All these GPUs cannot be
used by the MPI multi-GPU
application in execution
HPC Advisory Council Spain Conference 2014, Santander 10/45
• Do applications completely squeeze GPUs present in the cluster?
• Even if all GPUs are assigned to running applications,
computational resources inside GPUs may not be fully used
• Application presenting low level of parallelism
• CPU code being executed
• GPU-core stall due to lack of data
• etc …
One more concern with accelerated clusters
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU
GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
HPC Advisory Council Spain Conference 2014, Santander 11/45
In summary …
• There are scenarios where GPUs are available but
cannot be used
• Accelerated applications do not make use of GPUs
100% of the time
In conclusion …
• We are losing GPU cycles, thus reducing cluster
performance
Why GPU-cluster performance is lost?
HPC Advisory Council Spain Conference 2014, Santander 12/45
What is missing is ...
... some flexibility for using
the GPUs in the cluster
We need something more in the cluster
The current model for using GPUs is too rigid
HPC Advisory Council Spain Conference 2014, Santander 13/45
The problem with current GPU-enabled clusters
The enabler for increased cluster throughput
Engineering the enabler
Some performance numbers
Further considerations
Increasing throughput in current clusters
HPC Advisory Council Spain Conference 2014, Santander 14/45
• Two ingredients are required to cook a
higher-throughput GPU-based cluster
• A way of seamlessly sharing GPUs across
nodes in the cluster (remote GPU
virtualization)
• Enhanced job schedulers that take into
account the new shared GPUs
What is needed for increased flexibility?
HPC Advisory Council Spain Conference 2014, Santander 15/45
Remote GPU virtualization allows a new vision of a GPU
deployment, moving from the usual cluster configuration:
Remote GPU virtualization envision
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e C
PU
GPU
GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
to the following one ….
HPC Advisory Council Spain Conference 2014, Santander 16/45
Physical
configuration
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e
CP
U
Ma
in M
em
ory
Network
Interconnection Network
Logical connections
Logical
configuration
Remote GPU virtualization envision
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
PC
I-e CP
U
Ma
in M
em
ory
Network
Interconnection Network
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
CP
U
Ma
in M
em
ory
Network
GPU GPU mem
GPU GPU mem
PC
I-e CP
U
Ma
in M
em
ory
Network
GPU GPU mem
GPU GPU mem
PC
I-e CP
U
Ma
in M
em
ory
Network
GPU GPU mem
GPU GPU mem
PC
I-e
HPC Advisory Council Spain Conference 2014, Santander 17/45
GPU GPU mem
GPU GPU mem
PC
I-e CP
U
Ma
in M
em
ory
Network
Interconnection Network
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
CP
U
Ma
in M
em
ory
Network
GPU GPU mem
GPU GPU mem
PC
I-e CP
U
Ma
in M
em
ory
Network
GPU GPU mem
GPU GPU mem
PC
I-e CP
U
Ma
in M
em
ory
Network
GPU GPU mem
GPU GPU mem
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e
CP
U
Ma
in M
em
ory
Network
Interconnection Network
Logical connections
Busy cores are no longer a problem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
Physical
configuration
Logical
configuration
HPC Advisory Council Spain Conference 2014, Santander 18/45
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
GPU GPU mem
Without GPU virtualization
GPU virtualization is also useful for multi-GPU applications
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
CP
U
Ma
in M
em
ory
Network
PC
I-e
PC
I-e
CP
U
Ma
in M
em
ory
Network
Interconnection Network
Logical connections
PC
I-e
CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
Interconnection Network
PC
I-e
CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e
CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e
CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e
CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
PC
I-e
CP
U
GPU GPU mem
Ma
in M
em
ory
Network
GPU GPU mem
With GPU virtualization
Many GPUs in the
cluster can be provided
to the application
Only the GPUs in the
node can be provided
to the application
Multi-GPU applications get benefit
HPC Advisory Council Spain Conference 2014, Santander 19/45
0 2000 4000 6000 8000 10000 12000 14000 16000 180000
10
20
30
40
50
60
70
80
90
100Influence of data transfers for SGEMM
Pinned Memory
Non-Pinned Memory
Matrix Size
Tim
e d
evo
ted
to
da
ta tra
nsfe
rs (
%)
Main GPU virtualization drawback is the increased latency
and reduced bandwidth to the remote GPU
Problem with remote GPU virtualization
Data from a matrix-matrix
multiplication using a local GPU!!!
HPC Advisory Council Spain Conference 2014, Santander 20/45
Current job schedulers, like SLURM, know
about real GPUs, but cannot manage virtual
GPUs
Enhancing schedulers is required to effectively
take advantage of GPU virtualization
About the second ingredient
HPC Advisory Council Spain Conference 2014, Santander 21/45
One step further:
enhancing the scheduling process so that
GPU servers are put into low-power
sleeping modes as soon as their
acceleration features are not required
More about enhanced GPU scheduling
HPC Advisory Council Spain Conference 2014, Santander 22/45
The problem with current GPU-enabled clusters
The enabler for increased cluster throughput
Engineering the enabler (I)
Some performance numbers
Further considerations
Increasing throughput in current clusters
HPC Advisory Council Spain Conference 2014, Santander 23/45
Several efforts have been made regarding GPU virtualization
during the last years:
rCUDA (CUDA 6.0)
GVirtuS (CUDA 3.2)
DS-CUDA (CUDA 4.1)
vCUDA (CUDA 1.1)
GViM (CUDA 1.1)
GridCUDA (CUDA 2.3)
V-GPU (CUDA 4.0)
Remote GPU virtualization frameworks
Publicly available
NOT publicly available
HPC Advisory Council Spain Conference 2014, Santander 24/45
Basics of the rCUDA framework
Basic CUDA behavior
HPC Advisory Council Spain Conference 2014, Santander 25/45
Basics of the rCUDA framework
HPC Advisory Council Spain Conference 2014, Santander 26/45
How to declare remote GPUs
Environment variables are properly initialized in the client side and
used by the rCUDA client (transparently to the application)
Server name/IP address : GPU
Amount of GPUs exposed to
applications
HPC Advisory Council Spain Conference 2014, Santander 27/45
rCUDA presents a modular architecture
HPC Advisory Council Spain Conference 2014, Santander 28/45
Test system:
Dual Intel Xeon E5-2620v2 (6 cores) 2.1 GHz (Ivy Bridge)
GPU NVIDIA Tesla K20
Mellanox ConnectX-3 single-port InfiniBand Adapter (FDR)
SX6025 Mellanox switch
Cisco switch SLM2014 (1Gbps Ethernet)
CentOS 6.3 + Mellanox OFED 1.5.3
Performance of the rCUDA framework
HPC Advisory Council Spain Conference 2014, Santander 29/45
CUDASW++
Bioinformatics software for Smith-Waterman protein database searches
144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478
0
5
10
15
20
25
30
35
40
45
0
2
4
6
8
10
12
14
16
18
FDR Overhead QDR Overhead GbE Overhead CUDA
rCUDA FDR rCUDA QDR rCUDA GbE
Sequence Length
rCU
DA
Ove
rhe
ad
(%
)
Exe
cu
tio
n T
ime
(s)
Single-GPU applications
Small overhead
HPC Advisory Council Spain Conference 2014, Santander 30/45
MonteCarlo Multi-GPU (from NVIDIA SDK)
Multi-GPU applications
Lower
is better
Higher
is better
HPC Advisory Council Spain Conference 2014, Santander 31/45
CUDA-MEME application:
GPU NVIDIA Tesla K40
Mellanox ConnectX-3 single-port (FDR) and Connect-IB Adapters
Connect-IB and applications
0.19%
Lower
is better
HPC Advisory Council Spain Conference 2014, Santander 32/45
The problem with current GPU-enabled clusters
The enabler for increased cluster throughput
Engineering the enabler (II)
Some performance numbers
Further considerations
Increasing throughput in current clusters
HPC Advisory Council Spain Conference 2014, Santander 33/45
• SLURM does not know about virtualized GPUs
• SLURM must be enhanced in order to manage the new virtualized GPUs
Integrating rCUDA with SLURM
HPC Advisory Council Spain Conference 2014, Santander 34/45
The basic idea about SLURM
HPC Advisory Council Spain Conference 2014, Santander 35/45
GPUs are
decoupled
from nodes
The basic idea about SLURM + rCUDA
All jobs are
executed in
less time
HPC Advisory Council Spain Conference 2014, Santander 36/45
GPUs are
decoupled
from nodes
All jobs are
executed
even in
less time
Sharing remote GPUs among jobs
GPU 0 is
scheduled to
be shared
among jobs
HPC Advisory Council Spain Conference 2014, Santander 37/45
The problem with current GPU-enabled clusters
The enabler for increased cluster throughput
Engineering the enabler
Some performance numbers
Further considerations
Increasing throughput in current clusters
HPC Advisory Council Spain Conference 2014, Santander 38/45
Test bench for analyzing SLURM+rCUDA performance:
InfiniBand ConnectX-3 based cluster
CentOS 6.4 Linux
Dual socket E5-2620v2 Intel Xeon based nodes:
1 node without GPU
8 nodes with NVIDIA K20 GPU
SLURM+rCUDA test bench description
1 node hosting the main
SLURM controller
8 nodes with one
K20 GPU each
HPC Advisory Council Spain Conference 2014, Santander 39/45
Applications for testing SLURM+rCUDA
Configuration for each of the applications:
In our tests, GROMACS
does not use GPUs. It is
a CPU-only application
Three different workload sizes used: • Small (≈ 70 jobs)
• Medium (≈ 170 jobs)
• Large (≈ 330 jobs)
HPC Advisory Council Spain Conference 2014, Santander 40/45
Cluster performance with rCUDA+SLURM
HPC Advisory Council Spain Conference 2014, Santander 41/45
Cluster performance with rCUDA+SLURM
Let’s reduce the amount of GPUs in the cluster
HPC Advisory Council Spain Conference 2014, Santander 42/45
Cluster performance with rCUDA+SLURM
The time that GPUs are allocated is increased
HPC Advisory Council Spain Conference 2014, Santander 43/45
The problem with current GPU-enabled clusters
The enabler for increased cluster throughput
Engineering the enabler
Some performance numbers
Further considerations
Increasing throughput in current clusters
HPC Advisory Council Spain Conference 2014, Santander 44/45
• High Throughput Computing
• Sharing remote GPUs makes applications
to execute slower … BUT more
throughput (jobs/time) is achieved
• Green Computing
• GPU migration and application migration allow to devote just the
required computing resources to the current load
• More flexible system upgrades
• GPU and CPU updates become independent
from each other. Attaching GPU boxes to non
GPU-enabled clusters is possible
• Datacenter administrators can choose between HPC and HTC
rCUDA is the enabling technology for …
Get a free copy of rCUDA at
http://www.rcuda.net
@rcuda_r