cuda cloud: enabling hpc workloads in openstack |...

CUDA in the Cloud – Enabling HPC Workloads in OpenStack

John Paul Walters Computer Scien5st, USC Informa5on Sciences Ins5tute

[email protected] With special thanks to Andrew Younge (Indiana Univ.) and Massimo Bernaschi (IAC-CNR)

2

Outline

§  Introduction §  OpenStack Background §  Supporting Heterogeneity §  Supporting GPUs §  Performance Results §  Current Status §  Future Work

3

Introduction

§  Cloud computing is traditionally seen as a resource for IT –  Web servers, databases

§  More recently researchers have begun to leverage the public cloud as an HPC resource –  AWS virtual cluster is 102 on Top500 list

§  Major difference between HPC and IT in the cloud: –  Types of resources, heterogeneity

§  Our contribution: we’re developing the heterogeneous HPC extensions for the OpenStack cloud computing platform

Why A Heterogeneous Cloud?

§  Typical cloud infrastructures focus on commodity hardware. §  Choose number of CPUs, memory, and disk §  ODen not a good match for technical compu5ng

§  Advantage of Heterogeneity §  Leverage GPUs and other accelerators §  Power efficiency, performance §  Process technical workloads

§  Move away from batch/grid schedulers §  Give users access to customizable instances §  Dedicated virtual machine

Benefits of Heterogeneity

136.2 seconds 139.5 seconds

SGI UV100 rendering 1926 objects

Tilera vs. x86 video transcoding

Each architecture performs well on different applica5ons.

CPU: 108 samples GPU: 1010 samples

6

OpenStack Background §  OpenStack founded by

Rackspace and NASA §  In use by Rackspace, HP, and

others for their public clouds §  Open source with hundreds of

participating companies §  In use for both public and private

clouds §  Current stable release:

OpenStack Folsom –  OpenStack Grizzly to be

released in April

0

20

40

60

80

100

120

Google Trends Searches for Common Open Source IaaS Projects openstack cloudstack opennebula eucalyptus cloud

7

OpenStack Architecture

Image Source: Ken Pepple (http://ken.pepple.info/openstack/2012/09/25/openstack-folsom-architecture/)

Supporting Heterogeneity

•  Current cloud services have limited resource op5ons •  M1.large = 7.5 GB RAM, 2 virtual cores, 850 GB disk •  x86 and x86_64 only

•  Users should be able to customize instance types to needs •  E.g. 16 GB RAM, 4 virtual cores, SSE4.2, 2xKepler GPUs •  Hypervisor type

9

Supporting GPUs §  We’re pursuing two approaches for GPU support in OpenStack

–  LXC support for container-based VMs •  Good for non-virtualization-friendly GPUs •  Near-native performance •  Can’t run non-Linux guests

–  Xen support for fully virtualized guests •  Fully virtualized guests, can run Windows, etc. •  Some overhead, especially for PCIe transfers

§  Our OpenStack work currently supports GPU-enabled LXC containers –  Xen is in development, preliminary results shown today

9

10

Integrating GPU Support into OpenStack §  Our GPU work is based off of the OpenStack’s libvirt

module –  Libvirt supports KVM/QEMU, LXC –  Libvirt-based Xen support is experimental

§  After instances boot, LXC’s GPUs are whitelisted and their major/minor device IDs are passed into the VM

§  GPU Instances are launched just like any other: Euca-run-instances –k <my key> -t cg1.small <machine image>

cg1 instance types represent GPUs

11

Test Results: LXC and Xen §  3 data sets (all using CUDA 5)

–  NVIDIA CUDA samples –  SHOC multi-GPU –  BFS/Graph500 multi-GPU

LXC tests §  2x Xeon E5520 §  96 GB RAM §  1x Tesla S2050 (4 GPUs) §  RHEL 6.1

–  2.6.38.2 kernel (non-stock)

Xen Tests §  2x Xeon X5660 §  192 GB RAM §  2x NVIDIA Tesla C2075 §  Centos 6.3 (with Xen 4.1.2)

–  2.6.32-279 kernel (guest, stock)

Hardware

12

LXC CUDA 5 Samples

0.92

0.94

0.96

0.98

1

1.02

RelaHv

e Pe

rforman

ce

Higher is BeKer

LXC CUDA Samples, Performance RelaHve to Base

13

LXC Bandwidth Results

0

1000

2000

3000

4000

5000

Band

width, M

B/s

Data Size, KB

LXC vs. Base, Host to Device Bandwidth, Pinned

Base

LXC VM 0

500 1000 1500 2000 2500 3000 3500

Band

width, M

B/s

Data Size, KB

LXC vs. Base, Device to Host Bandwidth, Pinned

Base

LXC VM

LXC mirrors base bandwidth

14

LXC SHOC Results

0.86 0.88 0.9

0.92 0.94 0.96 0.98

1 1.02

RelaHv

e Pe

rforman

ce

Higher is BeKer

LXC SHOC Benchmarks, 1 Node 4 GPUs. Performance RelaHve to Base

15

0.0E+00

5.0E+06

1.0E+07

1.5E+07

2.0E+07

2.5E+07

3.0E+07

3.5E+07

16 17 18 19 20 21 22

Med

ian TEPS

Problem Size

BFS/Graph500 2,4 GPUs, 1 Node

LXC 2 GPUs

Base 2 GPUs

LXC 4 GPUs

Base 4 GPUs

LXC Graph500 Results

TEPS: Traversed Edges per Second

16

Xen CUDA 5 Samples

0.92 0.94 0.96 0.98

1 1.02

RelaHv

e Pe

rforman

ce

Higher is BeKer

Xen CUDA Samples, Performance RelaHve to Base

17

Xen Bandwidth Results

0 1000 2000 3000 4000 5000 6000 7000

Band

width, M

B/s

Data Size, KB

Xen vs. Base, Host to Device Bandwidth, Pinned

Base

Xen VM

0 1000 2000 3000 4000 5000 6000

Band

width, M

B/s

Data Size, KB

Xen vs. Base, Device to Host Bandwidth, Pinned

Base

Xen VM

18

Xen SHOC Benchmarks

0.86 0.88 0.9 0.92 0.94 0.96 0.98

1 1.02

RelaHv

e Pe

rforman

ce

Higher is BeKer

Xen SHOC Benchmarks, 1 Node 2 GPUs. Performance RelaHve to Base

19

Xen Graph500 Results

0.00E+00

5.00E+06

1.00E+07

1.50E+07

2.00E+07

2.50E+07

3.00E+07

3.50E+07

16 17 18 19 20 21 22

Med

ian TEPS

Problem Size

BFS/Graph500, 2 GPUs, 1 Node

Xen VM

Base

TEPS: Traversed Edges per Second

20

Current Status

§  Source code is available now –  https://github.com/usc-isi/nova

§  Includes support for heterogeneity –  GPU-enabled LXC instances –  Bare-metal provisioning –  Architecture-aware scheduler

§  We’re working towards integrating additional support into the OpenStack Grizzly and H-releases

21

Future Work

§  Add support for high speed networking as a resource –  Infiniband via SR-IOV –  Enable features like GPUdirect

§  Add support for the Kepler architecture §  Merge the Xen support work into Upstream OpenStack

cuda cloud: enabling hpc workloads in openstack |...

Documents