speeding up deep learning services: when gpus meet ...speeding up deep learning services: when gpus...

1

Dr. Seetharami Seelam & Dr. Yubo LiResearch Staff MembersIBM ResearchGPU Tech Conference: San Jose, CA – May 10, 2017

Speeding up Deep Learning Services: When GPUs meet Container Clouds

Outline

5/8/17IBM2

• Whoarewe

• Whyshouldyoulistentous

• Whatproblemsarewetryingtosolve

• ChallengeswithdeliveringDLonCloud

• WhathavewedoneinMesos andKubernetes

• Whatislefttodo

• Howcanyouhelp

Who are we

5/8/17IBM3

Dr. Yubo Li（李玉博）Dr. Seelam is a Research Staff Member at the T. J. Watson Research Center. He is an expert at delivering hardware, middleware, applications as-a-service using containers. He delivered Autoscaling, Business Rules, Containers on Bluemix and multiple others internally.

2

Dr. Seetharami SeelamDr. Yubo Li is a Research Staff Member at IBM Research, China. He is the architect of the GPU acceleration and deep learning service onSuperVessel, an open-access cloud running OpenStack on OpenPOWER machines. He is currently working on GPU support for several cloud container technologies, including Mesos, Kubernetes, Marathon and OpenStack.

Why should you listen to us

5/8/17IBM4

• We have multiple years of developing, optimizing, and operating container clouds

• Heterogeneous HW (POWER and x86)

• Long running and batch jobs

• OpenStack, Docker, Mesos, Kubernetes

• Container clouds with Accelerators (GPUs)

What problems are we trying to solve

5/8/17IBM5

• EnableDeepLearningintheCloud• Needflexibleaccesstohardware(GPUs)

• Trainingtimesinhours,days,weeks,months

• Longrunninginferencingservices

• Supportold,newandemergingframeworks

• Sharehardwareamongmultipleworkloadsandusers

Speech Vision

DL in the Cloud: State-of-the-art

5/8/17IBM6

• HistoricallyDLison-prem infrastructureandSWstack–high-performanceenvironment• Baremetal GPUsystems(x86andPOWER),Ethernet,IBnetwork

connectivity,GPFS• SpectrumLSF,MPIandRDMAsupport,singleSWstack

• Cloud– Freesresearchers&developersfrominfrastructure&SWStack• AllinfrastructurefromCloudasservices:GPUs,objectstore,

NFS,SDN,etc,• JobsubmissionwithAPIs:Torch,Caffe,Tensorflow,Theano• 24/7service,elasticandresilient• Appropriatevisibilityandcontrol

Challenges with DL on Cloud

5/8/17IBM7

• Data,data,data,data,…

• Accesstodifferenthardwareandaccelerators(GPU,IB,… )

• Supportfordifferentapplicationmodels

• Visibilityandcontrolofinfrastructure

• DevandOpschallengeswith24/7statefullservice

Journey started in 2016… promised to deliver DL on Cloud

5/8/17IBM8

• Excellentpromise,goaheadandbuiltaDLcloudservice

• ContainersupportGPU:minimalornon-existent

• Theideacouldhavediedonday1butfailureisnotanoption…

• WechosecontainerswithMesos andKubernetestoaddresssomeofthesechallenges

• DevelopedandoperatedMesos andKubernetesbasedGPUCloudsforoverayear

• Whatfollowsarelessonslearnedfromthisexperience

DL on Containers: DevOps challenges

5/8/17IBM9

• Multiple GPUs per node –> multiple containers per node: need to maintain GPU <–> Container mapping (GPU Allocator)

• Images need NVIDIA Drivers: makes them non-portable (Volume Manager)

• Cluster quickly becomes heterogeneous (K80, M60, P100…): need to be able to pick GPU type (GPU Discovery)

• Fragmentation of GPUs is a real problem (Priority placement)

• Like everything else GPUs fail à must identify and remove unhealthy GPUs from scheduling (Liveness check)

• Visibility, control, and sharing (to be done)

Allocate GPU#1

High-level view of GPU support in containers clouds

5/8/17IBM 11

Worker Node

Master Node

Node

Job Executor

Task Id: xxxResource Request:GPU: 1

…Resource report:GPU: 2…

Task DistributeGPU: 1

Insert Driver Volume

GPU Allocator

5/8/17IBM

• Allocator handles GPU number/device mapping• Isolator uses cgroup(mesos)/docker(k8s) to control GPU access permission inside

container/dev├── nvidia0 (data interface for GPU0)├── nvidia1 (data interface for GPU1)├── nvidiactl (control interface)├── nvidia-uvm (unified virtual memory)└── nvidia-uvm-tools (UVM tools, optional)

• Allocate/Release GPU

11 12

Request 1 GPU

GPU0: in useGPU1: idle

NVIDIA GPU Allocator

GPU0: in useGPU1: in use

NVIDIA GPU Allocator

Allocate GPU1

Container/dev/nvidia1/dev/nvidiactl/dev/nvidia-uvm

Expose GPU1

Expose GPU 1

GPU Driver Challenges: Drivers in the container is an issue

5/8/17IBM12

Container1

NVIDIA base libraries v1

CUDA libraries

Application (Caffe/TF/…)

Container2

NVIDIA base libraries v1

CUDA libraries


Linux KernelNVIDIA-kernel-module v1 Linux Kernel

NVIDIA-kernel-module (v2)

Container

NVIDIA base libraries (v1)

CUDA libraries


Changes in the host driver require all containers to update their drivers!!!! Credit to Kevin Klues, Mesosphere

GPU Driver Challenges: NVIDIA-Docker solves it

Linux Kernel


Container


CUDA libraries


Linux Kernel



Container

CUDA libraries



Volume Injection

App will not work if NVIDIA libraries and kernel module versions are not match

GPU Volume Manager

5/8/17IBM14

•Mimic functionality of nvidia-docker-plugin

• Finds all standard NVIDIA libraries / binaries on the host and consolidates them into a single place.

/var/lib/mesos/volumes└── nvidia_XXX.XX (version number)├── bin├── lib└── lib64

• Inject volume with read-only (“ro”) to container if needed

GPU container

Image label:com.nvidia.volumes.needed = nvidia_driver

/usr/local/nvidia

Inject

GPU Discovery

5/8/17IBM15 15

• Mesos-agent/kubelet auto detects GPU numbers

• Instead, we use nvml library to detect number of GPUS, model, etc

Compiled binary, dynamic link to

nvml

NVIDIA GDK (nvml library)

Run on GPU mode

Run on non-GPU node

Compile

nvml lib not found

Discover GPU

Same binary works on both GPU node and non-GPU nodes

nvml lib found

Fragmentation problem: GPU Priority Placement

5/8/17IBM1617

GPU node #4GPU node #1 GPU node #2 GPU node #3

Spread (Default)

J1

J1

J2

J2 J2 J3

Used Free

• Three jobs are spread on 4 nodes• New Job 4 needs 4 GPUs on a single node, can it run?• Although there are 10 free GPUs

Jx à Job x


5/9/17IBM1717


Bundle (Default)

J1 J2

J1 J2

J3 J4

J2 J4

J4

J4

Used Free

• Solution: Bundle the jobs• New Job 4 needs 4 GPUs

Jx à Job x


5/9/17IBM1818


Bundle (Default)

J1 J2

J1 J2

J3

J2

J4 J4

J4 J4

Used Free

• Solution: Bundle the jobs• New Job 4 needs 4 GPUs on a single node

Jx à Job x


5/9/17IBM19 19

• GPU priority scheduler can bundle/spread GPU tasks across the cluster

• Bundle: Reserve large idle GPU nodes for large tasks

• Spread: Distribute GPU workload over cluster

GPU node #2

GPU task #1

GPU task #2

GPU node #1 GPU node #2

GPU task #2

GPU node #1

GPU task #1

Bundle Spread

GPU Liveness Check

5/8/17IBM20

• GPU errors due to:• Insufficient power supply• Hardware damage• Over heating• Software bugs• ...

• GPU liveness check• Agent will probe GPU through nvml periodically• If GPU probe fails, mark GPU as unavailable, no future applications are

scheduled on that GPU

GPU failure sample

Implementation in Mesos and Kubernetes

21

• Open-source cluster manager• Enables siloed applications to be consolidated on a shared pool of resources• Rich framework ecosystem• Emerging GPU support

GPU Support on Apache Mesos

22

Docker Containerizer

Containerizer API

Isolator API

CPU

Mem

ory

(Unified) Mesos

Containerizer

GPU

Composing Containerizer NVIDIA

GPUAllocator

NVIDIA Volume

Manager

Mesos Agent

Mesos Master

Scheduler Framework (Marathon) Resource Definition:

Kubernetes

22

• Open source orchestration system for Docker containers• Handles scheduling onto nodes in a compute cluster• Actively manages workloads to ensure that their state matches the user’s declared intentions• Emerging support for GPUs

GPU Support on Kubernetes -- Upstream

24

• Basic multi-GPU support in release 1.6 upstream• GPU discovery• GPU allocator• GPU isolator

21GPU isolator

GPU

GPU discovery

GPU allocator

alpha.kubernetes.io/nvidia-gpu: 2

alpha.kubernetes.io/nvidia-gpu: 2

GPU resource type extension

GPU Support on Kubernetes – Internal DL cloud

24

Docker

Docker API / CRICP

U

Mem

ory

GPU

Linux devices cgroup

NVIDIA GPU

Allocator

NVIDIA Volume

Managerkubelet

NVIDIA GPU Manager

libnvidia-ml library

kube-apiserver

Dynamic loadingkubectl

GPU number display for PODs/Nodes

GPU Liveness

Check

kube-scheduler

GPU priority scheduler

Community GPU Support

Our ExtensionGPU resource

type extension

Demo

26

Status of GPU Support in Mesos and Kubernetes

27

Function/Feature Nvidia-docker Mesos k8s upstream k8s IBMGPUs exposed to Dockerized applications ✔ ✔ ✔ ✔

GPU vendor NVIDIA NVIDIA NVIDIA NVIDIASupport Multiple GPUs per node ✔ ✔ ✔ ✔

No GPU driver in container ✔ ✔ Future ✔

Multi-node management ✘ ✔ ✔ ✔

GPU Isolation ✔ ✔ ✔ ✔

GPU Auto-discovery ✔ ✔ ✔ (no nvml) ✔

GPU Usage metrics ✔ On-going Future On-goingHeterogeneous GPUs in a cluster ✘ ✔ Partial ✔

GPU sharing ✔ (No control) ✔ (No control) Future FutureGPU liveness check ✘ Future Future ✔

GPU advanced scheduling ✘ Future Future ✔

Compatible with NVIDIA official docker image ✔ ✔ Future ✔

Our DL service

28

• Mesos/Marathon GPU support• Support NVIDIA GPU resource management• Developing and operating deep learning and AI Vision internal services• Code contributed back to community• Presentations at MesosCon EU 2016 and MesosCon Asia 2016

• Kubernetes GPU support• Support NVIDIA GPU resource management• Developing and operating deep learning and AI Vision internal services• GPU support in IBM Spectrum Conductor for Containers (CfC)• Engagement with community to bring several of these features

IBM Spectrum Conductor for Containers

29

§ Community Edition available now! Free to download and use as you wish (optional paid support)

• Customer-managed, on-premises Kubernetes offering from IBM on x86 or Power

• Simple container based installation with integrated orchestration & resource management

• Authorization and access control (built-in user registry or LDAP)

• Private Docker registry• Dashboard UI• Metrics and log aggregation• Calico networking• Pre-populated app catalog• GPU support in 1.1; paid support in 1.2 (June)

§ Learn more and register on our community page: http://ibm.biz/ConductorForContainers

Demo on YouTube

What is left to do

30

• GPU topology-aware scheduling (on-going)• Support GPU topology-aware scheduling to optimize performance

• GPU live metric collection (on-going)• Collect GPU live metrics (i.e., live core/mem usage)

• Support CRI interface (in plan)• kubernetes moves to CRI after release 1.6, we will not depend on docker API

• Support libnvidia-container (under discussion)• Use libnvidia-container instead of nvidia-docker logic to manage GPU

container

New OpenPOWER Systems with NVLink

31

S822LC “Minsky”:2 POWER8 CPUs with 4 NVIDIA® Tesla® P100 GPUs hooked to CPUs using NVIDIA’s NVLinkhigh-speed interconnecthttp://www-03.ibm.com/systems/power/hardware/s822lc-hpc/index.html

Enabling Accelerators/GPUs on OpenPOWER

Containers and images

Accelerators

NVIDIA DOCKER

33

• Requirements, requirements, requirements

• Comment on issues

• Hack it and PR

• Review PRs

Summary and Next Steps

34

• Cognitive, Machine and Deep Learning workloads are everywhere

• Containers can be leveraged with accelerators for agile deployment of these new workloads

• Docker, Mesos and Kubernetes are making rapid progress to support accelerators

• OpenPOWER and this emerging cloud stack makes it the preferred platform for Cognitive workloads

• Join the community, share your requirements, and contribute code

Community Activities: Mesos

35 35

• GPU features on Mesos/Marathon have been supported in upstream• GPU for Mesos Containerizer support added after Mesos 1.0• GPU support added after Marathon v1.3 • GPU usage: http://mesos.apache.org/documentation/latest/gpu-support/

• Companies collaborating on GPU support on Mesos/Marathon

• Collaborators

35

§ Benjamin Mahler § Vikrama Ditya§ Yong Feng

• Kevin Klues• Rajat Phull• Guangya Liu• Qian Zhang

• Yu Bo Li• Seetharami Seelam

Community Activities: Kubernetes

36

• Multi-GPU support starting in Kubernetes 1.6

• Collaborators• Vish Kannan• David Oppenheimer• Christopher M Luciano• Felix Abecassis• Derek Carr• Yu Bo Li• Seetharami Seelam• Hui Zhi• …

Thank You

Contact: [email protected]@cn.ibm.com

speeding up deep learning services: when gpus meet ...speeding up deep learning services: when gpus...

Documents