increasing the throughput of your gpu-enabled cluster with

Increasing the throughput of

your GPU-enabled cluster

with rCUDA

Federico Silla Technical University of Valencia

Spain

HPC Advisory Council Spain Conference 2014, Santander 2/45

The scope of this talk


Proposed cluster performance

• Performance numbers collected from real executions in an 8-node cluster*, each

node with 1 NVIDIA K20 GPUs

• Workload composed of a mix of the LAMMPS, GROMACS, GPU-Blast and

MCUDA-MEME applications

*node characteristics: 2 E5-2620V2 sockets (6-core at 2.1GHz) and 32GB DDR3 RAM. One NVIDIA Tesla K20 GPU


The problem with current GPU-enabled clusters

The enabler for increased cluster throughput

Engineering the enabler

Some performance numbers

Final considerations

Increasing throughput in current clusters






Final considerations



A GPU computing facility is usually a set of independent self-

contained nodes that leverage the shared-nothing approach:

Nothing is directly shared among nodes (MPI required for aggregating

computing resources within the cluster)

GPUs can only be used within the node they are attached to

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

Interconnection Network

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

Characteristics of GPU-based clusters


• Applications can only use the GPUs located within their node:

• Non-accelerated applications keep GPUs idle in the nodes where they

use all the cores

First concern with accelerated clusters

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem


PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

A CPU-only application spreading over

these four nodes would make their GPUs

unavailable for accelerated applications


For many workloads, GPUs may be idle for significant periods of time:

• Initial acquisition costs not amortized

• Space: GPUs reduce CPU density

• Energy: idle GPUs keep consuming power

Money leakage in current clusters?

Time (s)

Idle

Pow

er

(Watts)

1 GPU node

4 GPUs node

• 1 GPU node: 2 E5-2620V2 sockets and 32GB DDR3 RAM. Tesla K20 GPU

• 4 GPUs node: 2 E5-2620V2 sockets and 128GB DDR3 RAM. 4 Tesla K20 GPUs

25%


• Applications can only use the GPUs located within their node:

• Multi-GPU applications running on a subset of nodes cannot

make use of the tremendous GPU resources available at other

cluster nodes (even if they are idle)

Second concern with accelerated clusters

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem


PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU

GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

All these GPUs cannot be

used by the MPI multi-GPU

application in execution


• Do applications completely squeeze GPUs present in the cluster?

• Even if all GPUs are assigned to running applications,

computational resources inside GPUs may not be fully used

• Application presenting low level of parallelism

• CPU code being executed

• GPU-core stall due to lack of data

• etc …

One more concern with accelerated clusters

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem


PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU

GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem


In summary …

• There are scenarios where GPUs are available but

cannot be used

• Accelerated applications do not make use of GPUs

100% of the time

In conclusion …

• We are losing GPU cycles, thus reducing cluster

performance

Why GPU-cluster performance is lost?


What is missing is ...

... some flexibility for using

the GPUs in the cluster

We need something more in the cluster

The current model for using GPUs is too rigid






Further considerations



• Two ingredients are required to cook a

higher-throughput GPU-based cluster

• A way of seamlessly sharing GPUs across

nodes in the cluster (remote GPU

virtualization)

• Enhanced job schedulers that take into

account the new shared GPUs

What is needed for increased flexibility?


Remote GPU virtualization allows a new vision of a GPU

deployment, moving from the usual cluster configuration:

Remote GPU virtualization envision

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem


PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU

GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

to the following one ….


Physical

configuration

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

PC

I-e

CP

U

Ma

in M

em

ory

Network


Logical connections

Logical

configuration

Remote GPU virtualization envision

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

PC

I-e CP

U

Ma

in M

em

ory

Network


PC

I-e CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

CP

U

Ma

in M

em

ory

Network

GPU GPU mem

GPU GPU mem

PC

I-e CP

U

Ma

in M

em

ory

Network

GPU GPU mem

GPU GPU mem

PC

I-e CP

U

Ma

in M

em

ory

Network

GPU GPU mem

GPU GPU mem

PC

I-e


GPU GPU mem

GPU GPU mem

PC

I-e CP

U

Ma

in M

em

ory

Network


PC

I-e CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

CP

U

Ma

in M

em

ory

Network

GPU GPU mem

GPU GPU mem

PC

I-e CP

U

Ma

in M

em

ory

Network

GPU GPU mem

GPU GPU mem

PC

I-e CP

U

Ma

in M

em

ory

Network

GPU GPU mem

GPU GPU mem

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

PC

I-e

CP

U

Ma

in M

em

ory

Network


Logical connections

Busy cores are no longer a problem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

Physical

configuration

Logical

configuration


GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

Without GPU virtualization

GPU virtualization is also useful for multi-GPU applications

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

PC

I-e

CP

U

Ma

in M

em

ory

Network


Logical connections

PC

I-e

CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem


PC

I-e

CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e

CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e

CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e

CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e

CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

With GPU virtualization

Many GPUs in the

cluster can be provided

to the application

Only the GPUs in the

node can be provided

to the application

Multi-GPU applications get benefit


0 2000 4000 6000 8000 10000 12000 14000 16000 180000

10

20

30

40

50

60

70

80

90

100Influence of data transfers for SGEMM

Pinned Memory

Non-Pinned Memory

Matrix Size

Tim

e d

evo

ted

to

da

ta tra

nsfe

rs (

%)

Main GPU virtualization drawback is the increased latency

and reduced bandwidth to the remote GPU

Problem with remote GPU virtualization

Data from a matrix-matrix

multiplication using a local GPU!!!


Current job schedulers, like SLURM, know

about real GPUs, but cannot manage virtual

GPUs

Enhancing schedulers is required to effectively

take advantage of GPU virtualization

About the second ingredient


One step further:

enhancing the scheduling process so that

GPU servers are put into low-power

sleeping modes as soon as their

acceleration features are not required

More about enhanced GPU scheduling




Engineering the enabler (I)





Several efforts have been made regarding GPU virtualization

during the last years:

rCUDA (CUDA 6.0)

GVirtuS (CUDA 3.2)

DS-CUDA (CUDA 4.1)

vCUDA (CUDA 1.1)

GViM (CUDA 1.1)

GridCUDA (CUDA 2.3)

V-GPU (CUDA 4.0)

Remote GPU virtualization frameworks

Publicly available

NOT publicly available


Basics of the rCUDA framework

Basic CUDA behavior


Basics of the rCUDA framework


How to declare remote GPUs

Environment variables are properly initialized in the client side and

used by the rCUDA client (transparently to the application)

Server name/IP address : GPU

Amount of GPUs exposed to

applications


rCUDA presents a modular architecture


Test system:

Dual Intel Xeon E5-2620v2 (6 cores) 2.1 GHz (Ivy Bridge)

GPU NVIDIA Tesla K20

Mellanox ConnectX-3 single-port InfiniBand Adapter (FDR)

SX6025 Mellanox switch

Cisco switch SLM2014 (1Gbps Ethernet)

CentOS 6.3 + Mellanox OFED 1.5.3

Performance of the rCUDA framework


CUDASW++

Bioinformatics software for Smith-Waterman protein database searches

144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478

0

5

10

15

20

25

30

35

40

45

0

2

4

6

8

10

12

14

16

18

FDR Overhead QDR Overhead GbE Overhead CUDA

rCUDA FDR rCUDA QDR rCUDA GbE

Sequence Length

rCU

DA

Ove

rhe

ad

(%

)

Exe

cu

tio

n T

ime

(s)

Single-GPU applications

Small overhead


MonteCarlo Multi-GPU (from NVIDIA SDK)

Multi-GPU applications

Lower

is better

Higher

is better


CUDA-MEME application:

GPU NVIDIA Tesla K40

Mellanox ConnectX-3 single-port (FDR) and Connect-IB Adapters

Connect-IB and applications

0.19%

Lower

is better




Engineering the enabler (II)





• SLURM does not know about virtualized GPUs

• SLURM must be enhanced in order to manage the new virtualized GPUs

Integrating rCUDA with SLURM


The basic idea about SLURM


GPUs are

decoupled

from nodes

The basic idea about SLURM + rCUDA

All jobs are

executed in

less time


GPUs are

decoupled

from nodes

All jobs are

executed

even in

less time

Sharing remote GPUs among jobs

GPU 0 is

scheduled to

be shared

among jobs


Test bench for analyzing SLURM+rCUDA performance:

InfiniBand ConnectX-3 based cluster

CentOS 6.4 Linux

Dual socket E5-2620v2 Intel Xeon based nodes:

1 node without GPU

8 nodes with NVIDIA K20 GPU

SLURM+rCUDA test bench description

1 node hosting the main

SLURM controller

8 nodes with one

K20 GPU each


Applications for testing SLURM+rCUDA

Configuration for each of the applications:

In our tests, GROMACS

does not use GPUs. It is

a CPU-only application

Three different workload sizes used: • Small (≈ 70 jobs)

• Medium (≈ 170 jobs)

• Large (≈ 330 jobs)


Cluster performance with rCUDA+SLURM



Let’s reduce the amount of GPUs in the cluster



The time that GPUs are allocated is increased


• High Throughput Computing

• Sharing remote GPUs makes applications

to execute slower … BUT more

throughput (jobs/time) is achieved

• Green Computing

• GPU migration and application migration allow to devote just the

required computing resources to the current load

• More flexible system upgrades

• GPU and CPU updates become independent

from each other. Attaching GPU boxes to non

GPU-enabled clusters is possible

• Datacenter administrators can choose between HPC and HTC

rCUDA is the enabling technology for …

Get a free copy of rCUDA at

http://www.rcuda.net

@rcuda_r

increasing the throughput of your gpu-enabled cluster with

Documents