increasing the throughput of your gpu-enabled cluster with

45
Increasing the throughput of your GPU-enabled cluster with rCUDA Federico Silla Technical University of Valencia Spain

Upload: others

Post on 21-May-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Increasing the throughput of your GPU-enabled cluster with

Increasing the throughput of

your GPU-enabled cluster

with rCUDA

Federico Silla Technical University of Valencia

Spain

Page 2: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 2/45

The scope of this talk

Page 3: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 3/45

Proposed cluster performance

• Performance numbers collected from real executions in an 8-node cluster*, each

node with 1 NVIDIA K20 GPUs

• Workload composed of a mix of the LAMMPS, GROMACS, GPU-Blast and

MCUDA-MEME applications

*node characteristics: 2 E5-2620V2 sockets (6-core at 2.1GHz) and 32GB DDR3 RAM. One NVIDIA Tesla K20 GPU

Page 4: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 4/45

The problem with current GPU-enabled clusters

The enabler for increased cluster throughput

Engineering the enabler

Some performance numbers

Final considerations

Increasing throughput in current clusters

Page 5: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 5/45

The problem with current GPU-enabled clusters

The enabler for increased cluster throughput

Engineering the enabler

Some performance numbers

Final considerations

Increasing throughput in current clusters

Page 6: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 6/45

A GPU computing facility is usually a set of independent self-

contained nodes that leverage the shared-nothing approach:

Nothing is directly shared among nodes (MPI required for aggregating

computing resources within the cluster)

GPUs can only be used within the node they are attached to

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

Interconnection Network

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

Characteristics of GPU-based clusters

Page 7: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 7/45

• Applications can only use the GPUs located within their node:

• Non-accelerated applications keep GPUs idle in the nodes where they

use all the cores

First concern with accelerated clusters

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

Interconnection Network

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

A CPU-only application spreading over

these four nodes would make their GPUs

unavailable for accelerated applications

Page 8: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 8/45

For many workloads, GPUs may be idle for significant periods of time:

• Initial acquisition costs not amortized

• Space: GPUs reduce CPU density

• Energy: idle GPUs keep consuming power

Money leakage in current clusters?

Time (s)

Idle

Pow

er

(Watts)

1 GPU node

4 GPUs node

• 1 GPU node: 2 E5-2620V2 sockets and 32GB DDR3 RAM. Tesla K20 GPU

• 4 GPUs node: 2 E5-2620V2 sockets and 128GB DDR3 RAM. 4 Tesla K20 GPUs

25%

Page 9: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 9/45

• Applications can only use the GPUs located within their node:

• Multi-GPU applications running on a subset of nodes cannot

make use of the tremendous GPU resources available at other

cluster nodes (even if they are idle)

Second concern with accelerated clusters

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

Interconnection Network

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU

GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

All these GPUs cannot be

used by the MPI multi-GPU

application in execution

Page 10: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 10/45

• Do applications completely squeeze GPUs present in the cluster?

• Even if all GPUs are assigned to running applications,

computational resources inside GPUs may not be fully used

• Application presenting low level of parallelism

• CPU code being executed

• GPU-core stall due to lack of data

• etc …

One more concern with accelerated clusters

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

Interconnection Network

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU

GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

Page 11: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 11/45

In summary …

• There are scenarios where GPUs are available but

cannot be used

• Accelerated applications do not make use of GPUs

100% of the time

In conclusion …

• We are losing GPU cycles, thus reducing cluster

performance

Why GPU-cluster performance is lost?

Page 12: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 12/45

What is missing is ...

... some flexibility for using

the GPUs in the cluster

We need something more in the cluster

The current model for using GPUs is too rigid

Page 13: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 13/45

The problem with current GPU-enabled clusters

The enabler for increased cluster throughput

Engineering the enabler

Some performance numbers

Further considerations

Increasing throughput in current clusters

Page 14: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 14/45

• Two ingredients are required to cook a

higher-throughput GPU-based cluster

• A way of seamlessly sharing GPUs across

nodes in the cluster (remote GPU

virtualization)

• Enhanced job schedulers that take into

account the new shared GPUs

What is needed for increased flexibility?

Page 15: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 15/45

Remote GPU virtualization allows a new vision of a GPU

deployment, moving from the usual cluster configuration:

Remote GPU virtualization envision

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

Interconnection Network

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e C

PU

GPU

GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

to the following one ….

Page 16: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 16/45

Physical

configuration

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

PC

I-e

CP

U

Ma

in M

em

ory

Network

Interconnection Network

Logical connections

Logical

configuration

Remote GPU virtualization envision

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

PC

I-e CP

U

Ma

in M

em

ory

Network

Interconnection Network

PC

I-e CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

CP

U

Ma

in M

em

ory

Network

GPU GPU mem

GPU GPU mem

PC

I-e CP

U

Ma

in M

em

ory

Network

GPU GPU mem

GPU GPU mem

PC

I-e CP

U

Ma

in M

em

ory

Network

GPU GPU mem

GPU GPU mem

PC

I-e

Page 17: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 17/45

GPU GPU mem

GPU GPU mem

PC

I-e CP

U

Ma

in M

em

ory

Network

Interconnection Network

PC

I-e CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

CP

U

Ma

in M

em

ory

Network

GPU GPU mem

GPU GPU mem

PC

I-e CP

U

Ma

in M

em

ory

Network

GPU GPU mem

GPU GPU mem

PC

I-e CP

U

Ma

in M

em

ory

Network

GPU GPU mem

GPU GPU mem

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

PC

I-e

CP

U

Ma

in M

em

ory

Network

Interconnection Network

Logical connections

Busy cores are no longer a problem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

Physical

configuration

Logical

configuration

Page 18: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 18/45

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

GPU GPU mem

Without GPU virtualization

GPU virtualization is also useful for multi-GPU applications

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

CP

U

Ma

in M

em

ory

Network

PC

I-e

PC

I-e

CP

U

Ma

in M

em

ory

Network

Interconnection Network

Logical connections

PC

I-e

CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

Interconnection Network

PC

I-e

CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e

CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e

CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e

CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

PC

I-e

CP

U

GPU GPU mem

Ma

in M

em

ory

Network

GPU GPU mem

With GPU virtualization

Many GPUs in the

cluster can be provided

to the application

Only the GPUs in the

node can be provided

to the application

Multi-GPU applications get benefit

Page 19: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 19/45

0 2000 4000 6000 8000 10000 12000 14000 16000 180000

10

20

30

40

50

60

70

80

90

100Influence of data transfers for SGEMM

Pinned Memory

Non-Pinned Memory

Matrix Size

Tim

e d

evo

ted

to

da

ta tra

nsfe

rs (

%)

Main GPU virtualization drawback is the increased latency

and reduced bandwidth to the remote GPU

Problem with remote GPU virtualization

Data from a matrix-matrix

multiplication using a local GPU!!!

Page 20: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 20/45

Current job schedulers, like SLURM, know

about real GPUs, but cannot manage virtual

GPUs

Enhancing schedulers is required to effectively

take advantage of GPU virtualization

About the second ingredient

Page 21: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 21/45

One step further:

enhancing the scheduling process so that

GPU servers are put into low-power

sleeping modes as soon as their

acceleration features are not required

More about enhanced GPU scheduling

Page 22: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 22/45

The problem with current GPU-enabled clusters

The enabler for increased cluster throughput

Engineering the enabler (I)

Some performance numbers

Further considerations

Increasing throughput in current clusters

Page 23: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 23/45

Several efforts have been made regarding GPU virtualization

during the last years:

rCUDA (CUDA 6.0)

GVirtuS (CUDA 3.2)

DS-CUDA (CUDA 4.1)

vCUDA (CUDA 1.1)

GViM (CUDA 1.1)

GridCUDA (CUDA 2.3)

V-GPU (CUDA 4.0)

Remote GPU virtualization frameworks

Publicly available

NOT publicly available

Page 24: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 24/45

Basics of the rCUDA framework

Basic CUDA behavior

Page 25: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 25/45

Basics of the rCUDA framework

Page 26: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 26/45

How to declare remote GPUs

Environment variables are properly initialized in the client side and

used by the rCUDA client (transparently to the application)

Server name/IP address : GPU

Amount of GPUs exposed to

applications

Page 27: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 27/45

rCUDA presents a modular architecture

Page 28: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 28/45

Test system:

Dual Intel Xeon E5-2620v2 (6 cores) 2.1 GHz (Ivy Bridge)

GPU NVIDIA Tesla K20

Mellanox ConnectX-3 single-port InfiniBand Adapter (FDR)

SX6025 Mellanox switch

Cisco switch SLM2014 (1Gbps Ethernet)

CentOS 6.3 + Mellanox OFED 1.5.3

Performance of the rCUDA framework

Page 29: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 29/45

CUDASW++

Bioinformatics software for Smith-Waterman protein database searches

144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478

0

5

10

15

20

25

30

35

40

45

0

2

4

6

8

10

12

14

16

18

FDR Overhead QDR Overhead GbE Overhead CUDA

rCUDA FDR rCUDA QDR rCUDA GbE

Sequence Length

rCU

DA

Ove

rhe

ad

(%

)

Exe

cu

tio

n T

ime

(s)

Single-GPU applications

Small overhead

Page 30: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 30/45

MonteCarlo Multi-GPU (from NVIDIA SDK)

Multi-GPU applications

Lower

is better

Higher

is better

Page 31: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 31/45

CUDA-MEME application:

GPU NVIDIA Tesla K40

Mellanox ConnectX-3 single-port (FDR) and Connect-IB Adapters

Connect-IB and applications

0.19%

Lower

is better

Page 32: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 32/45

The problem with current GPU-enabled clusters

The enabler for increased cluster throughput

Engineering the enabler (II)

Some performance numbers

Further considerations

Increasing throughput in current clusters

Page 33: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 33/45

• SLURM does not know about virtualized GPUs

• SLURM must be enhanced in order to manage the new virtualized GPUs

Integrating rCUDA with SLURM

Page 34: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 34/45

The basic idea about SLURM

Page 35: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 35/45

GPUs are

decoupled

from nodes

The basic idea about SLURM + rCUDA

All jobs are

executed in

less time

Page 36: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 36/45

GPUs are

decoupled

from nodes

All jobs are

executed

even in

less time

Sharing remote GPUs among jobs

GPU 0 is

scheduled to

be shared

among jobs

Page 37: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 37/45

The problem with current GPU-enabled clusters

The enabler for increased cluster throughput

Engineering the enabler

Some performance numbers

Further considerations

Increasing throughput in current clusters

Page 38: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 38/45

Test bench for analyzing SLURM+rCUDA performance:

InfiniBand ConnectX-3 based cluster

CentOS 6.4 Linux

Dual socket E5-2620v2 Intel Xeon based nodes:

1 node without GPU

8 nodes with NVIDIA K20 GPU

SLURM+rCUDA test bench description

1 node hosting the main

SLURM controller

8 nodes with one

K20 GPU each

Page 39: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 39/45

Applications for testing SLURM+rCUDA

Configuration for each of the applications:

In our tests, GROMACS

does not use GPUs. It is

a CPU-only application

Three different workload sizes used: • Small (≈ 70 jobs)

• Medium (≈ 170 jobs)

• Large (≈ 330 jobs)

Page 40: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 40/45

Cluster performance with rCUDA+SLURM

Page 41: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 41/45

Cluster performance with rCUDA+SLURM

Let’s reduce the amount of GPUs in the cluster

Page 42: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 42/45

Cluster performance with rCUDA+SLURM

The time that GPUs are allocated is increased

Page 43: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 43/45

The problem with current GPU-enabled clusters

The enabler for increased cluster throughput

Engineering the enabler

Some performance numbers

Further considerations

Increasing throughput in current clusters

Page 44: Increasing the throughput of your GPU-enabled cluster with

HPC Advisory Council Spain Conference 2014, Santander 44/45

• High Throughput Computing

• Sharing remote GPUs makes applications

to execute slower … BUT more

throughput (jobs/time) is achieved

• Green Computing

• GPU migration and application migration allow to devote just the

required computing resources to the current load

• More flexible system upgrades

• GPU and CPU updates become independent

from each other. Attaching GPU boxes to non

GPU-enabled clusters is possible

• Datacenter administrators can choose between HPC and HTC

rCUDA is the enabling technology for …

Page 45: Increasing the throughput of your GPU-enabled cluster with

Get a free copy of rCUDA at

http://www.rcuda.net

@rcuda_r