introduction to kubeflow - oliver wyman · introduction to kubeflow aronchick@ machine learning is...
TRANSCRIPT
Introduction to Kubeflowaronchick@
Machine Learning is a way of solving problems without explicitly knowing how to create the solution
Google DC Ops
PUE == Power Usage Effectiveness
PUE == Power Usage Effectiveness
PUE == Power Usage Effectiveness
PUE == Power Usage Effectiveness
But...
Most FolksMagical
AIGoodness
LOTS OFPAIN
Why the Gap?
Composability
Portability
Scalability
Buildinga
Model
Logging
DataIngestion
DataAnalysis
DataTransform
-ation
DataValidation
Data Splitting
Trainer ModelValidation
TrainingAt Scale
Roll-out Serving Monitoring
Composability
Portability
Each ML Stage is an Independent System
System 6System 5
System 4
TrainingAt Scale
System 3System 1
DataIngestion
DataAnalysis
DataTransform
-ation
DataValidation
System 2
Buildinga
Model
ModelValidation
Serving LoggingMonitoringRoll-out
Data Splitting
Trainer
Portability
Portability
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
HW
Model
Laptop
Portability
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
HW
Model
Laptop
Portability
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
HW HW
Model Model
Laptop Training Rig
Portability
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
HW HW HW
Model Model Model
Laptop Training Rig Cloud
Scalability● Machine specific HW (GPU)● Limited (or unlimited) compute● Network & storage constraints
○ Rack, Server Locality○ Bandwidth constraints
● Heterogeneous hardware● HW & SW lifecycle management● Scale isn’t JUST about adding new
machines!○ Intern vs Researcher○ Scale to 1000s of experiments
You Know What’s Really Good at Composability,
Portability, and Scalability?
Containers and Kubernetes
Kubernetes
NFSCeph
CassandraMySQL
SparkAirflow
TensorflowCaffe
TF-ServingFlask+Scikit
Operating system (Linux, Windows)
CPU Memory DiskSSD GPU FPGA ASIC NIC
Jupyter
Quota
RBACMonitoring
Logging
GCP AWS Azure On-prem
Namespace
Kubernetes for ML
Kubernetes for ML● Supports accelerators in an extensible manner
○ GPUs already in progress○ Support for FPGAs, high perf NICs under discussion
● Existing Controllers simplify devops challenges○ K8S Jobs for Training○ K8S Deployments for Serving
● Handles 1000s of nodes● Container base images for ML workloads
But Wait, There’s More!● Kubernetes native scaling objects
○ Autoscaling cluster based on workload metrics○ Priority eviction for removal of low priority jobs○ Scaled to large number of pods (experiments)
● Passes through cluster specs for specific needs○ Scheduling jobs where the data needed to run them is○ Node labels for Heterogeneous HW (more in the future)○ Manage SW drivers and HW health via addons
But...
Oh, you want to use ML on K8s?
Before that, can you become an expert in:● Containers● Packaging● Kubernetes service endpoints● Persistent volumes● Scaling● Immutable deployments● GPUs, Drivers & the GPL● Cloud APIs● DevOps● ...
Kubeflow
Make it Easy for Everyone to Learn, Deploy and Manage Portable, Distributed ML
on Kubernetes(Everywhere)
Kubernetes + ML = Kubeflow = Win● Composability
○ Choose from existing popular tools○ Uses ksonnet packaging for easy setup
● Portability○ Build using cloud native, portable Kubernetes APIs○ Let K8s community solve for your deployment
● Scalability○ TF already supports CPU/GPU/distributed○ K8s scales to 5k nodes with same stack
Portability
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
HW HW HW
Model Model Model
Laptop Training Rig Cloud
Portability
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
HW HW HW
Model Model Model
Laptop Training Rig Cloud
Portability
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
HW HW HW
Model Model Model
Laptop Training Rig Cloud
Kubeflow
Portability
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
HW HW HW
Model Model Model
Laptop Training Rig Cloud
Kubeflow
Kubeflow
Storage
Framework
Tooling
UX
Model
Portability
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
Drivers
OS
Accelerator
Runtime
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
HW HW HW
Model Model
Laptop Training Rig Cloud
Kubeflow
Kubeflow Kubeflow
Storage
Framework
Tooling
UX
Model
Storage
Framework
Tooling
UX
Model
Portability
Storage
Drivers
OS
Accelerator
Runtime
Framework
Tooling
UX
Drivers
OS
Accelerator
Runtime
Drivers
OS
Accelerator
Runtime
HW HW HW
Model
Laptop Training Rig Cloud
Kubeflow
Kubeflow Kubeflow Kubeflow
What’s in the Box?● Jupyter Hub - for collaborative & interactive training● A TensorFlow Training Controller● A TensorFlow Serving Deployment● Argo for workflows● SeldonCore for complex inference and non TF models● Reverse Proxy (Ambassador)● Wiring to make it work on any Kubernetes anywhere
LoggingRoll-out Monitoring
DataIngestion
DataAnalysis
DataTransform
-ation
DataValidation
Data Splitting
What’s in the Box?
Buildinga
ModelTrainer Model
ValidationTrainingAt Scale
Serving
Using Kubeflow# Initialize a ksonnet APPAPP_NAME=my-kubeflowks init ${APP_NAME}cd ${APP_NAME}
# Install Kubeflow componentsks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflowks pkg install kubeflow/coreks pkg install kubeflow/tf-servingks pkg install kubeflow/tf-job
# Deploy KubeflowNAMESPACE=kubeflowkubectl create namespace ${NAMESPACE}ks generate core kubeflow-core --name=kubeflow-core --namespace=${NAMESPACE}ks apply default -c kubeflow-core
Don’t Like TensorFlow?# Initialize a ksonnet APPAPP_NAME=my-kubeflowks init ${APP_NAME}cd ${APP_NAME}
# Install Kubeflow componentsks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflowks pkg install kubeflow/coreks pkg install kubeflow/tf-servingks pkg install kubeflow/tf-job
# Deploy KubeflowNAMESPACE=kubeflowkubectl create namespace ${NAMESPACE}ks generate core kubeflow-core --name=kubeflow-core --namespace=${NAMESPACE}ks apply default -c kubeflow-core
ks pkg install kubeflow/sklearn-job # Soon
Don’t Like TF Serving?# Initialize a ksonnet APPAPP_NAME=my-kubeflowks init ${APP_NAME}cd ${APP_NAME}
# Install Kubeflow componentsks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflowks pkg install kubeflow/coreks pkg install kubeflow/tf-servingks pkg install kubeflow/tf-job
# Deploy KubeflowNAMESPACE=kubeflowkubectl create namespace ${NAMESPACE}ks generate core kubeflow-core --name=kubeflow-core --namespace=${NAMESPACE}ks apply default -c kubeflow-core
ks pkg install kubeflow/seldon-core # Soon
That’s It?
Yes…(For Now)
Yes…(For Now)
Yes…(For Now)
We’re Just Getting Started!
● Who’s helping?○ Redhat, Weave, CaiCloud, Canonical, many more
● What’s next...○ Easy to use accelerator integration○ Support for other popular tools like Spark ML, XGBoost, sklearn○ Autoscaled TF Serving○ tf.transform (programmatic data transforms)
● You tell us! (Or better yet, help!)
Kubeflow is Open- open community- open design- open source- open to ideas
https://github.com/kubeflow/kubeflowslack: kubeflow (http://kubeflow.slack.com)
twitter: @kubeflow@aronchick ([email protected])
@jeremylewi ([email protected])`
● As a data scientist, you want to use the right HW for the job
● Every variation is an opportunity for pain○ GPUs/FPGAs, ASICs, NICs ○ Kernel drivers, libraries, performance
● Even within an ML frameworks dependencies cause chaos○ Package management○ ML compilation
Container
Kernel
GPU FPGA Infiniband
Drivers
LibraryApp
App
Portability