nvidia gpus on openshift deep learning …...deep learning workloads with nvidia gpus on openshift...

19
Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster Systems Engineer, Supermicro Inc. 1

Upload: others

Post on 15-Mar-2020

46 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NVIDIA GPUs on OpenShift Deep Learning …...Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster

Deep Learning Workloads with NVIDIA GPUs on OpenShift

28 October, 2019

Mayur ShettySenior Solutions Architect, Red Hat

Mehnaz MahbubCluster Systems Engineer, Supermicro Inc.

1

Page 2: NVIDIA GPUs on OpenShift Deep Learning …...Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster

Agenda

2

● ML Pipeline and Key Personas● Why Containers & Kubernetes in Hybrid Cloud for AI/ML workloads?● Why OpenShift and Hybrid Cloud for ML workloads● How to use GPUs with OpenShift● Solution building blocks ● Cluster overview/ network topology● Benchmark Suite ● Benchmark Results

Page 3: NVIDIA GPUs on OpenShift Deep Learning …...Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster

3

ML Pipeline & Key Personas

Data Acquisition & Preparation

ML Modelling (Selection, Training,

Testing)

ML Model Deployment in

App. Dev. Process

Data Engineer

Data Scientists

App Developer

IT Operations

BusinessObjectives

Data

Business Leadership

Business Leadership

Intelligent applicationsto achieve

business outcomes

Page 4: NVIDIA GPUs on OpenShift Deep Learning …...Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster

Why Containers & Kubernetes in Hybrid Cloud for AI/ML workloads?

4

Agility across the ML pipeline ● Automated install and provisioning ● Autoscaling ● GPU acceleration, scaling, security,

uptime

1

Portability & flexibility for ML powered apps

● Develop/deploy ML apps across data center, edge, and public clouds

● Offer ML-as-a-service 2

Red Hat products & services help solve additional challenges

● Automation, CI/CD drive collaboration● Boost productivity ● Data access, prep, & governance● Apps lifecycle management &

operations

3

Page 5: NVIDIA GPUs on OpenShift Deep Learning …...Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster

Why OpenShift And Hybrid Platforms for ML Workloads?

5

EXISTING AUTOMATION

TOOLSETS

SCM(GIT)

CI/CD

SERVICE LAYER

PERSISTENTSTORAGE

REGISTRY

RHEL

NODE

c

RHEL

NODE

RHEL

NODE

RHEL

NODE

RHEL

NODE

RHEL

NODE

C

C

C C

C

C

C CC C

RED HATENTERPRISE LINUX

MASTER

API/AUTHENTICATION

DATA STORE

SCHEDULER

HEALTH/SCALING

PHYSICAL VIRTUAL PRIVATE PUBLIC HYBRID

DATA SCIENTIST

ML deployed across clouds, data center,

and edge

ML services, load balanced

and scaled

ML microservices scheduled and

orchestrated on shared resources

Best of SDLC

ML in Production

Page 6: NVIDIA GPUs on OpenShift Deep Learning …...Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster

GPU as a service on OpenShift

6

Page 7: NVIDIA GPUs on OpenShift Deep Learning …...Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster

7

Enablement of GPUs in an OpenShift Cluster

CUDA driver (or container)

K8s device plugin for GPU

GPU node_exporter for

Prometheus

Label: GPU

CRIO GPU runtime plugin

● Pre-reqs - Install NVIDIA driver for

RHEL on the GPU host

● Add nvidia-container-runtime-hook

and create hook file

● Run cuda-vector-add container to

verify operation of driver and

container enablement

● Configure OpenShift - Device

Plugin API is enabled by default

● Label the nodes with GPU

● Next, deploy the NVIDIA Device

Plugin

Page 8: NVIDIA GPUs on OpenShift Deep Learning …...Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster

Deploying GPU Workloads onto OpenShift

8

Pod Deployment

Job

Page 9: NVIDIA GPUs on OpenShift Deep Learning …...Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster

Preparing OpenShift for GPU benchmark workloads

9

● Containerize each of the MLPerf Training v0.6 benchmarks○ Create a Dockerfile for the model with MLCC tool from Red Hat

■ Add statements to the Dockerfile to build NVIDIA PyTorch from source■ Add commands to run each of the MLPerf Training benchmark script

● Create a container image for each of the benchmark

● Push the image to Quay.io

● Deploy MLPerf Training benchmark which requires GPU

Page 10: NVIDIA GPUs on OpenShift Deep Learning …...Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster

Deep Learning Benchmarks on Red Hat OpenShift using Supermicro SuperServers

10

Page 11: NVIDIA GPUs on OpenShift Deep Learning …...Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster

Solution Reference Architecture

11

Software Stack Details

Page 12: NVIDIA GPUs on OpenShift Deep Learning …...Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster

Solution Building Blocks

12

Page 13: NVIDIA GPUs on OpenShift Deep Learning …...Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster

Hardware Setup

13

Ten-Node Cluster Overview● 3 Master Nodes● 3 Infra Nodes● 1 Bastion/ LB node● 3 Application nodes

- Includes a GPU node with 8 * Nvidia® Tesla® V100 SXM2 GPUs

Network Topology

Page 14: NVIDIA GPUs on OpenShift Deep Learning …...Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster

About MLPerf and Datasets

14

MLPerf: https://mlperf.org/Coco: http://cocodataset.org/#homeWMT: http://www.statmt.org/wmt14/translation-task.html

Object Detection

Machine Translation

Page 15: NVIDIA GPUs on OpenShift Deep Learning …...Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster

Benchmarking: Object Detection

15

Software • RHEL 7.6• OpenShift 3.11• Pytorch 19.05• Cuda 10.0, Cuda 9.2• Python 3.

MLPerf Training v0.6 Results

Page 16: NVIDIA GPUs on OpenShift Deep Learning …...Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster

Benchmarking: Machine TranslationRecurrent & Non-Recurrent Translation Using GNMT & Transformer

16

MLPerf Training v0.6 Results

Page 17: NVIDIA GPUs on OpenShift Deep Learning …...Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster

OpenShift GUI from the Project

17

Page 18: NVIDIA GPUs on OpenShift Deep Learning …...Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster

Project Outcomes & Result Evaluation

18

Result Validation & Significance:

● First ever MLPerf Benchmark of Red Hat OpenShift

● Deep Learning workload running on OpenShift matches (if not better!) bare metal performance

● Hardware Advantage: Customer gets same training performance at a much lower cost (Better performance/ dollar)

➔ GitLab: https://gitlab.com/opendatahub/gpu-performance-benchmarks➔ Whitepaper: https://www.redhat.com/en/resources/supermicro-deep-learning-openshift-reference-architecture➔ Supermicro OpenShift Solution: https://www.supermicro.com/en/solutions/red-hat-openshift

Page 19: NVIDIA GPUs on OpenShift Deep Learning …...Deep Learning Workloads with NVIDIA GPUs on OpenShift 28 October, 2019 Mayur Shetty Senior Solutions Architect, Red Hat Mehnaz Mahbub Cluster

linkedin.com/company/red-hat

youtube.com/user/RedHatVideos

facebook.com/redhatinc

twitter.com/RedHat

Thank You

19