end-to-end data science and machine learning for telcos · deploying logging +serving monitoring...

End-to-End Data Science and Machine Learning for Telcos Telstra's Use Case — Animesh Singh — Tim Osborne — Adam Makarucha

Session 6123

CODAIT

Improving the Enterprise AI Lifecycle in Open Source

Center for Open Source Data & AI Technologies

CODAIT aims to make AI solutions dramatically easier to create, deploy, and manage in the enterprise.

We contribute to and advocate for the open-source technologies that are foundational to IBM’s AI offerings.

30+ open-source developers!

Enterprise Machine Learning

The Machine Learning Lifecycle

*Source: Hidden Technical Debt in Machine Learning Systems

Perception

*Source: Hidden Technical Debt in Machine Learning Systems

In reality…ML Code is tiny part in this overall platform

And the ML workflow spans teams …

Data cleansing

Dataanalysis&transformation

Data validation

Data splitting

Data prep

Building a model

Model validation

Training at scale

Model creation

Deploying Serving Monitoring &

Logging + Explainability

Finetune & improvements

Rollout

Training optimization Model

Data ingestion

Edge Cloud

rkflow

stratio

Marketplace(A

Dataconsistency(versioning)

FeatureEngineering

And is much more complex..

End to end ML on Kubernetes?

●  Containers ●  Packaging ●  Kubernetes service endpoints ●  Persistent volumes ●  Scaling ●  Immutable deployments ●  GPUs, Drivers & the GPL ●  Cloud APIs ●  DevOps ●  ...

First, can you become an expert in ...

We need a platform. Enter Kubeflow

Prepared and

Analyzed Data

Trained Model

Deployed Model

Prepared Data

Untrained Model

Libraries and CLIs - Focus on end users

Systems - Combine multiple services

Low Level APIs / Services (single function)

Arena kfctl kubectl

katib pipelines

notebooks

fairing

TFJob PyTorchJob

Jupyter CR

Seldon CR

kube bench

•  End to end ML Platform on Kubernetes. Focused on multiple aspects of Model Lifecycle

•  Originated at Google, and has grown to have a large community of developers

•  Google, IBM, Cisco, RedHat, Intel, Microsoft and others contributing

•  IBM is the 2nd largest contributor in terms of overall commits. IBM maintainers (committers/reviewers) in Katib (HPO+Training), Kubeflow Serving, Manifests, Pipelines etc.

Metadata

Orchestration

Pipelines CR

Study Job

MPI CR

Spark Job

Model DB

Developed By Kubeflow

Developed Outside Kubeflow

* Not all components shown

Scheduling Kubeflow https://github.com/kubeflow

Jupyter Notebooks

Workflow Building

Pipelines

Serving

Metadata

Data Management

Fairing

Airflow, +

KF Pipelines

HP Tuning

Tensorboard

KFServing

Seldon Core

TFServing, + Training Operators Pytorch

XGBoost, +

Tensorflow

Prometheus

Versioning Reproducibility Secure Sharing

Develop (Kubeflow Jupyter Notebooks)

Data Scientist

-  Self-service Jupyter Notebooks provide faster model experimentation

-  Simplified configuration of CPU/GPU, RAM, Persistent Volumes

-  Faster model creation with training operators, TFX, magics, workflow automation (Kale,

Fairing)

-  Simplify access to external data sources (using stored secrets)

-  Easier protection, faster restoration & sharing of “complete” notebooks

IT Operator

-  Profile Controller, Istio, Dex enable secure RBAC to notebooks, data & resources

-  Smaller base container images for notebooks, fewer crashes, faster to recover

Distributed Model Training and HPO (TFJob, PyTorch Job, MPI Job, Katib, …)

Addresses One of the key goals for model builder persona: Distributed Model Training and Hyper parameter optimization for Tensorflow, PyTorch etc. Common problems in HP optimization

–  Overfitting

–  Wrong metrics

–  Too few hyperparameters

Katib: a fully open source, Kubernetes-native hyperparameter tuning service

–  Inspired by Google Vizier

–  Framework agnostic

–  Extensible algorithms

–  Simple integration with other Kubeflow components

Kubeflow also supports distributed MPI based training using Horovod

●  Founded by Google, Seldon,

IBM, Bloomberg and Microsoft

●  Part of the Kubeflow project

●  Focus on 80% use cases -

single model rollout and update

●  Kfserving 1.0 goals:

○  Serverless ML Inference

○  Canary rollouts

○  Model Explanations

○  Optional Pre/Post

processing

KFServing

Kubeflow Pipelines §  Containerized implementations of ML Tasks

§  Pre-built components: Just provide params or code snippets (e.g. training code)

§  Create your own components from code or libraries

§  Use any runtime, framework, data types

§  Attach k8s objects - volumes, secrets

§  Specification of the sequence of steps

§  Specified via Python DSL

§  Inferred from data dependencies on input/output

§  Input Parameters

§  A “Run” = Pipeline invoked w/ specific parameters

§  Can be cloned with different parameters

§  Schedules

§  Invoke a single run or create a recurring scheduled pipeline

Define Pipeline with Python SDK

@dsl.pipeline(name='TaxiCabClassificationPipelineExample’)deftaxi_cab_classification(output_dir,project,Train_data='gs://bucket/train.csv',Evaluation_data='gs://bucket/eval.csv',Target='tips',Learning_rate=0.1,hidden_layer_size='100,50’,steps=3000): tfdv =TfdvOp(train_data,evaluation_data,project,output_dir) preprocess =PreprocessOp(train_data,evaluation_data,tfdv.output[“schema”],project,output_dir) training =DnnTrainerOp(preprocess.output,tfdv.schema,learning_rate,hidden_layer_size,steps,

target,output_dir) tfma =TfmaOp(training.output,evaluation_data,tfdv.schema,project,output_dir) deploy =TfServingDeployerOp(training.output)

Compile and Submit Pipeline Run

dsl.compile(taxi_cab_classification,'tfx.tar.gz')run=client.run_pipeline(

'tfx_run','tfx.tar.gz',params={'output':‘gs://dpa22’,'project':‘my-project-33’})

From Single Apps to Complete Platform

Dec 2017

Introduce Kubeflow JupyterHub TFJob TFServing

May 2018

Kubeflow 0.1 Argo Ambassador Seldon

Kubeflow 0.2 Katib -HP Tuning Kubebench PyTorch

Kubeflow 0.3 kfctl.sh TFJob v1alpha2

Jan 2019

Kubeflow 0.4 Pipelines JupyterHub UI refresh TFJob, PyTorch beta

Kubeflow 0.5 KFServing Fairing Jupyter WebApp + CR

Contributor Summit

Kubeflow 0.6 Metadata Kustomize Multi-user support

Individual Applications

Connecting Apps And Metadata

November

Kubeflow 0.7 Pipelines+ KFServing v0.2 kfctl refactor

March 2020

Kubeflow 1.0 Production ready stable components

Productionisation & Hardening

Telstra AI Lab - (TAIL) - Configuration

•  Kubernetes – 1.15

•  Spectrum Scale CSI Driver

•  MetalLB for Load Balancing

•  Istio 1.3.1 for ingress

•  Kubeflow – 1.0.1

•  Jupyter Notebook images are IBM’s

multiarchitecture powerai images (https://hub.docker.com/r/ibmcom/powerai/tags)

Telstra AI Lab - (TAIL)

Mixed-Architecture 2x IBM Power9 AC922

Nodes 4x Cisco Intel Nodes

Telstra AI Lab - (TAIL)

237.6 TFlops GPU Single Precision performance

Telstra AI Lab - (TAIL): Compute

4x NVLink’ed Nvidia V100 GPUs

4x PCIe Nvidia V100 GPUs

64x Power 9 Cores

68x Intel Cores

150GB/s

150GB/s 150GB/s 150GB/s

Telstra AI Lab - (TAIL): AC922

Large Model Support Able to train models that are greater exceed GPU memory.

Distributed Deep Learning Linear scaling for deep learning training across multiple GPU enable nodes.

Supports Open Source DL Frameworks Tensorflow, Pytorch, Caffe all supported and optimized.

Telstra AI Lab - (TAIL): Configuration

•  Taint Nodes •  Node Selector •  Only does data science

•  Kubeflow running on x86

•  Can be used to run other components, such as databases, microservices, etc.

Telstra AI Lab - (TAIL): Challenges

•  Enterprise proxy and internal host names •  Running squid proxy that routes to the

enterprise proxy to enable access to

docker.io, github.com, pypi.org, etc.

•  Configure HostAliases in notebooks

•  Getting data into the cluster •  Provisioned Minio object storage instance in

each user namespace and accessible via

kubeflow endpoint •  User over-provisioning of cores / PVCs:

•  Locked defaults and created reasonable limits

Telstra AI Lab - (TAIL): Successes

•  Easy to select the Power platform with configuration options in the notebook server

•  Added open source code to enable node selector, tolerations, and hostAliases

•  Using Kubeflow-Kale to simplify pipelining of code

•  Significantly simplifies the adoption of pipelines and conversion of code.

•  First instance code conversion took - 1 day – optimisation of code 2 weeks.

•  Significant performance improvements thanks to the available compute and

software tools

•  First use case went from a run time 15 hours > 2 hours

Telstra AI Lab - (TAIL) – Future state

•  RedHat Openshift – 4.3

•  GPU Operator

•  Kubeflow Operator

•  Extending the compute

•  Integrate feature stores and

streaming technologies

•  Integrate with CI/CD tools (Tekton

Pipelines)

end-to-end data science and machine learning for telcos · deploying logging +serving monitoring...

Documents

the physical reductive explainability of phenomenal...

artificial intelligence and deep learning in...

ai for risk management · scorecard, offering a high degree...

few-shot adaptive video-to-video translation · pose 2...

finetune complaince iss1

how ‘explainability’ is driving the future of artificial...

finetune solution presentationsend

the journey to ai -...

finetune technologiesw fire_safety

the robotics institute, carnegie mellon university 1 ...1....

controlling ai: the imperative for transparency and...

explainability and knowledge representation in robotics: the...

dark patterns of explainability, transparency, and user...

topological data analysis for discourse...

function [a, w] = fpica(x, whiteningmatrix ... filefunction...

artificial intelligence (ai) &...

controlling ai: imperative for transparency and...

improving the explainability of random forest classifier –...

artificial intelligence governance principles: towards...

integrating intrinsic and extrinsic explainability: the