ibm leading high performance computing and deep learning...
TRANSCRIPT
IBM Leading High Performance Computing and Deep Learning Technologies
Yubo Li (李玉博)Chief Architect, GPU on CloudIBM Research -- Chinaemail: [email protected]: 395238640
GTC China 2016Sept. 13, 2016
New Generation in IBM’s Eyes
AcceleratoronCloud AcceleratorVirtualization AcceleratorManagementandMonitoring
Speed Up DL and Cognitive APIs on Cloud
GPUsavailableondatacenter
Optimizedacceleratorperformance:NVLink
Virtualization to Eligible on Cloud
GPUpassthroughHardwarevirtualization
Containerssharing
PCIpassthroughNVIDIAGRIDnvidia-docker
Optimize Throughput and Operation
AcceleratormanagementOptimizedschedulingMetriccollectionMonitorandalarm
Softwareenhancement
2
POWER8 with NVLink
2.5xFasterCPU-GPUDataCommunicationviaNVLink
NVLink80GB/s
GPU
P8
GPU GPU
P8
GPU
PCIe32GB/s
GPU
x86
GPU GPU
x86
GPU
NoNVLink betweenCPU&GPUforx86Servers:PCIe Bottleneck
NVIDIAP100PascalGPU
POWER8NVLink Server x86ServerswithPCIe
• Custom-builtGPUAcceleratorServer• High-SpeedNVLink ConnectionsbetweenCPUs&GPUsandamongGPUs
• FeaturesnovelNVIDIAP100PascalGPUaccelerator
822LC Power System for HPCFirst Custom-Built GPU Accelerator Server with NVLink
3
More Compelling for CPU and GPU IntegrationFar easier to create new applications on Tesla P100 + POWER8 with NVLink
NVIDIA Page Migration Engine ensures unified memory spaceUnified memory: address space spans CPU and GPU, 1TB+Hardware managed transfers: eliminates explicit data transfers
Close code-base to parallel CPU code
POWER8 with NVLink ensures speedy data throughput 1TB memory space requires faster CPU:GPU data movementBus masks transfer times
Too Large a Memory
Space Required
Too complicated to move
data
Moves too much data
Too much custom
coding for GPU data movement
Software UVM
feature too
limiting
Requires page
faulting support
Barriers to Entry Removed
4
SuperVessel: OpenStack Based Cognitive Cloud
Computing service
Data store service
Network service
Big Data Service
Cloud Data Service
IoTDevelopment
Service
Super Marketplace (Accelerators, Images, Applications)
Infrastructure as Service
Platform layer service
SuperVessel provides multiple layers services.
Accelerator DevOps Service
Cognitive Computing
Service
Accelerator service
VisionBrain: Deep
Learning as a Service
with NVIDIA GPU 5Try SuperVessel : https://www.ptopenlab.com/
SuperVessel Cloud Infrastructure
User account & authentication
manage
User dashboard
Admin dashboard
Virtual point management
Statistic and analysis
Platform Management
System
GPUIBM POWER server
Container pool for POWER7 LPAR
Distributed file system / shared file system
KVM pool for POWER8 LE/BE
KVM pool for x86Container pool for POWER8 LE/BE
Container pool for x86
Nova Neutron Cinder
LxC/ Docker
Horizon
OpenStack controller (HA)
Nova
NeutronGlance Cinder
HEAT Senlin
Ironic Swift
Keystone
Services layer
System maintenance
System monitoring
Resource metering
System analysis
Services for cloud administration
Baremetalmanagement
Image management
Nova Neutron Cinder
KVM
Nova Neutron Cinder
KVM/GPU Passthrough
Nova Neutron Cinder
Docker/GPU Sharing
Nova Neutron Cinder
LxC/ Docker
X86 server
GPU scheduler
Auto Provision
6
Heterogeneous Computing for Cognitive Cloud
Train Data Set DNN Net File
Trained model
Application Data from User
Training (development) Stage Recognition (deployment) Stage
Big data platform (Hadoop, Spark)
Deep Learning platform(caffe, Torch, Theano,
TensorFlow, etc.)
Model pool
Data Management
CPU + GPU cluster
Data Cleansing
Feature Engineering Modeling
Deep Learning platform Application servers, DB service, messaging, etc.
CPU + GPU cluster
ApplicationRecognition, classification
7
Cognitive Computing on SuperVessel
8
Try it on : https://dashboard.ptopenlab.com/computing/
• Cognitive Infrastructure Service
• Cognitive Computing Service
• Cognitive Solution and Demo Service
GPU Service and GPU Accelerated Deep LearningSuperVessel provides the GPU sharing service by extending OpenStack and Docker capability. It is the first GPU sharing service in the public cloud.
• Users could apply the docker instance on SuperVessel
• Users could apply the deep learning development environment on SuperVessel, e.g. Caffe, Torch, Theano, and TensorFlow.
• All the DL environment will assign the GPU resource for acceleration automatically.
9
Cognitive Innovation Services Exposed to Bluemix China
• SuperVessel team built a new cloud site in 21Vianet
• All the highlighted services are running on cloud with Supervesseltechnology
• GPU is used to accelerate deep learning service
10
GPU Enhancement on Container Cloud• GPU support on Mesos/Marathon/Kubernetes
• GPU scheduler• GPU exposition/isolation with container• GPU auto-discovery• GPU driver volume injection• GPU metrics collection
• Community activities• Main contributor for GPU enablement on Mesos/Marathon• Demos/Presentations on several conferences
11
Mesos
Disk CPU GPU Memory
Hardware Resources Storage Compute Memory
Resource Management/Orchestration
DL TrainingData pre-processing DL Inference
Inference APIUser UIData/Task Management
Deep Learning Application
SuperVessel IaaS
Infrastructure
Docker Container
Management/Interface
User Authentication
Data Persistence
Marathon
Task Status Monitoring
Shared FS
Cluster Monitoring
Container based Cognitive Service Infrastructure
12
GPU Accelerated Spark Components (Current and Future)
Spark Infrastructure (1.6.1+), DataFrames Interface
GPU
GPU
ComputeNode
GPU
GPU
ComputeNode
GPU
GPU
ComputeNode
GPU
GPU
ComputeNode
GPU
GPU
ComputeNode
GPU
GPU
ComputeNode
AnalyticsMachine Learning
Deep Learning GraphX Spark SQL
Logistic RegressionADMMRecommendation/ALSElastic NetNNMF and PCASVMRandom Forest
Word2VecNearest Neighbor/LSHTensorGradient Descent/EAGD
Spark SQL OLAPBFS/DFSLink Prediction
https://github.com/IBMSparkGPU/SparkGPU,CUDA-MLlib13
Questions?
Contact me at WeChat:
14