deep learning as a service - files.devnetwork.cloud · nvidia tesla available in dlaas ......
TRANSCRIPT
Deep Learning as a Service
Susan Diamond Senior Technical Staff Member/Manager 10/2/2019
• Why is deep learning the future over traditional machine
learning?
• Why are Watson services adopting DLaaS to power their
model training?
• If you are a data scientist, are you ready to try out DLaaS
(lite plan provides free GPUs)? If you are in the leadership
position in your company, are you ready to promote deep
learning technology to power your business?
Agenda
• Why build Deep Learning as a Service (DLaaS)
• DlaaS Architecture
• Key Design Aspects in DLaaS
• Watson Services Build Models on DLaaS
• Getting Started with DLaaS APIs
Machine Learning
and AI
are everywhere
4
facial recognition
unlocks your phone
fraud detection
protects your credit
recommendations
help you shop faster
speech recognition
lets you go hands-free
chat bots
route calls quicker
autonomous vehicles
detect pedestrians
machine vision
detects cancer early
spam detection
unclogs your Inbox
The future is now
A human brain has: • 200 billion neurons
• 32 trillion connections between them
• 25 million “neurons”
• 100 million connections (parameters)
Deep Learning = Training Artificial Neural Networks
Intelligence arises from system interactions
Deep learning is neural network design
Machine Learning is algorithm selection
AI is systems architecture
AI requires more…
data + compute + network complexity
Perf
orm
ance
machine learning
deep learning
NVIDIA Tesla available in DLaaS
K80
CPU-
only V100
® ®
Graphic Processing Units
NVIDIA
Tesla
V100
®
®
source code training run
definition
Training Lifecycle Management You provide code + data + training definition
DLaaS handles rest
Kubernetes container
training
artifacts
compute cluster
NVIDIA Tesla K80, V100
Cloud Object Storage
Training assets are
managed and tracked.
Hyperparameter Optimization: Efficiently automate searching your
network’s hyperparameter space to ensure the best model performance with
the fewest training runs.
Code with your favorite frameworks and tools
Graphs not Log Files:
Don’t stare at text logs
when you can overlay
accuracy and loss graphs
to dive deeper into the
training of your neural
networks.
Don’t be constrained. Select the framework appropriate to the unique requirements of
your problem domain and skills of your team.
DLAAS - A deep learning platform to bridge
innovations/optimization across the entire stack
New hardware Infrastructure (Softlayer)
CPUs NVIDIA GPUs
Container and resource
management
Frameworks
API: train/manage/watch
Data Science
Experience
Watson
services
Advances in the cloud stack
Improvement in training techniques
Optimizations in DL frameworks
Innovation in neural net design
Better user experience and tools
Cloud native architecture
Challenge Solution
Resilience Observed faults in different layers (GPU, network,
etc.)
Engineered to survive restarts, and recover from intermittent
failures
Scalability and
elasticity Scale infrastructure to match workload Nodes/GPUs, service replicas can be added and removed live
Serviceability User cannot login and run diagnostic tools Expose standard APIs for useful logs and metrics
Security Run untrustworthy user code for DLaaS retail offering Defense-in-depth with multiple isolation techniques (at the
process, container, pod, network level)
Performance Multi TB customer trainer data transfer in the cloud
environment
Cloud object storage to store training data
High network bandwidth for data transfer
S3fs driver mount training data to training pod
Distributed
training
Support framework specific and optimized distribution
techniques (e.g. DDL, Horovod) Framework independent provisioning
MongoDB
(document store)
Elasticsearch
(log store)
DLaaS training architecture
Trainer microservice
Training data microservice
Helper
Controller
Log collector
Data broker
Job monitor
Cre
ate
per
tra
inin
g jo
b
job record
Watson Machine Learning/Visual Recognition/Watson Assistant etc.
job status
Lifecycle manager microservice
job status
logs, metrics
job status
Training data microservice
logs, metrics
Learner Learner
NFS volume mount
•••
Cloud Object Store mount
• Logs (stdout, tensorboard,
etc.)
• Job state
• Training data
• Training results
• Logs
• Use GPUs (in exclusive mode)
• Block network access (running user code)
• Except to workers in the same job
• Pluggable DL frameworks
Logging (ELK stack)
Docker registry
Cloud Object Storage
NFS volumes (SSD-
backed)
Kubernetes
Jenkins
High throughput object store access for deep
learning
• Provide high throughput streaming-like
access to IBM Cloud Object Storage
• Enable deep learning frameworks, e.g.,
Tensorflow, to run as-is
• No local storage requirement; no
intermediate cache such as NFS storage
• Collaborated with HRL to develop s3fs driver
as container storage with the Armada team
DL learner
Object Store
/s3fs
Kubernetes master
GPU enabled Docker container
/tmpfs /s3fs /tmpfs /s3fs /tmpfs
Optimized network for fast data transfer
DLaaS – Basic Flow Model Spec
1. Create model in DL framework supported by DLaaS (Caffe, Tensorflow, Torch, Theano, Keras, …)
2. Store training data in object storage
3. Specify model metadata (framework, …); resource requirements (GPUs, mem, num learners, …); pointer to training data (object store, s3, …)
manifest.yml
DLaaS API: /v1/models
Object storage
Kubernetes cluster runs training jobs
Training Request
Training logs, status
Trained Model
4. Start a training job 5. Query training status and
retrieve logs 6. Get trained model
Watson AI services build models with DLaaS
https://cloud.ibm.com/catalog?category=ai
DLaaS Summary
• Deep Learning is the future over traditional machine learning technology
• Deep Learning training is much faster than traditional machine learning
and achieve higher accuracy over traditional machine learning
• DLaaS is a cloud base platform that supports major deep learning
frameworks and powered by NVIDIA GPUs.
• DLaaS is designed and implemented to ensure security, high scale,
resilience and high performance as well as easy to use and serviceable.
Q&A
• Why is deep learning the future over traditional machine learning?
• Why are Watson services adopting DLaaS to power their model
training?
• If you are a data scientist, are you ready to try out DLaaS (lite plan
provides free GPUs)? If you are in the leadership position in your
company, are you ready to promote deep learning technology to
power your business?
Try DLaaS
• https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-
data/ml_dlaas.html
Thank You!