machine learning model life cycle management in production

33
[email protected] [email protected] Machine Learning Model Life Cycle Management In Production - #aiops #mlops

Upload: others

Post on 01-Jan-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning Model Life Cycle Management In Production

[email protected]@datatron.com

Machine Learning Model Life Cycle Management

In Production - #aiops #mlops

Page 2: Machine Learning Model Life Cycle Management In Production

datatron

Who are we?

�2

Harish DoddiCEO

• Lyft Surge Pricing Model• Twitter’s Distributed Photo

Storage - 12 PB• Architected Snapchat Stories -

scaled from 0 to billions

Jerry XuCTO

• Lyft ETA Machine Learning Model - replaced Google

• Twitter’s Manhattan - replaced Cassandra

• Founding team of Windows Azure

Team: Previously worked at places like Amazon, Microsoft, and AnacondaHeadquarters: 350 Townsend Street, Suite 204, San Francisco, CA

Page 3: Machine Learning Model Life Cycle Management In Production

datatron

Today’s Enterprises Data Science Lifecycle

�3

Development

Data Science

Initial AnalysisModel development using multiple frameworksTraining and Experimenting

Production

Engineering DevOps

DeploymentWorkflow processRoll back strategyA/B Testing

Post-Production

DevOps

Model Performance MonitoringInfrastructure MonitoringScalingModel Governance

Page 4: Machine Learning Model Life Cycle Management In Production

datatron�4

Production and Post-Production

Deployment, Monitoring & Governance

Data Aggregation

Discovery & Analysis

Training & Experimenting

Page 5: Machine Learning Model Life Cycle Management In Production

datatron

Lesson 1

Need for single centralized production teamand platform

�5

Page 6: Machine Learning Model Life Cycle Management In Production

datatron�6

Fraud Data Science

MarketingData Science

Credit Risk Data Science

Production infrastructure

Production infrastructure

Production infrastructure

Financial Institutions Production Environment

Page 7: Machine Learning Model Life Cycle Management In Production

datatron

Hidden Technical Debt in Machine Learning Systems

�7

Google PaperHidden Technical Debt in Machine Learning Systems

Machine learning offers a fantastically powerful toolkit for building complex systems quickly. … it

is remarkably easy to incur massive ongoing maintenance costs at the system level when

applying machine learning.

Boundary erosionEntanglementHidden feedback loopsUndeclared consumersData dependenciesChanges in the external worldSystem-level anti-patterns

Page 8: Machine Learning Model Life Cycle Management In Production

datatron�8

Fraud Data Science

Horizontal DEVOPS AI team

MarketingData Science

Credit Risk Data Science

Production Model Deployment

Centralized Production AI Platform like Datatron

Page 9: Machine Learning Model Life Cycle Management In Production

datatron

Lesson 2

Start adopting best practices and standardization early

�9

Page 10: Machine Learning Model Life Cycle Management In Production

datatron�10

Old Way

Data Science

End User

Machine Learning

Engineering DevOps

• Teams operate in silos, don’t speak the same language• Errors due to lack of communication• Engineering has to write stand-alone scripts

Production Model

Teams face cross-functional

inefficiencies

Page 11: Machine Learning Model Life Cycle Management In Production

datatron

Central devops platform for AI

models streamlines the process and

reduces the inefficiencies

�11

Data Science

End User

platformdatatron

DevOps

Production Model

New Way

Page 12: Machine Learning Model Life Cycle Management In Production

datatron

Lesson 3

Data Science Universe shouldn’t be limited

�12

Page 13: Machine Learning Model Life Cycle Management In Production

datatron�13

Model Containerization

Data Scientist Universe is un-limited

Team

Upcoming Frameworks

New Way

Team Team

Model

TensorFlowModel

scikit-learnModel

Apache Spark

No Model Containerization

Data Scientist Universe is limited

Old Way

TeamTeam Team

Model

Page 14: Machine Learning Model Life Cycle Management In Production

datatron�14

datatron

Framework Agnostic

LibraryAgnostic

LanguageAgnostic

Infrastructure Agnostic

SageMaker

Page 15: Machine Learning Model Life Cycle Management In Production

datatron

Lesson 4

Models may go wrong, you need to monitor them continuously

�15

Page 16: Machine Learning Model Life Cycle Management In Production

datatron�16

South Park and Alexa

Page 17: Machine Learning Model Life Cycle Management In Production

datatron�17

RecommendationModel

Features

Business is continuously losing value

Garbage

Old Way

RecommendationModel

Features

Recommendation

Minimize the continuous loss

Alert!

New Way

Alert!

Page 18: Machine Learning Model Life Cycle Management In Production

datatron�18

New Way

# o

f Fra

ud

Tra

nsa

ctio

ns

Jan Feb Mar Apr May Jun

Reality

Model Prediction

Anomalies Detected

Production Model Monitoring

Allows You To Act Preemptively

Old Way

TOOLATE

# o

f Fra

ud

Tra

nsa

ctio

ns

Post-mortem reports

Page 19: Machine Learning Model Life Cycle Management In Production

datatron

Model Performance monitoring• Confusion Matrix• Gain and Lift charts• Kolomogorov Smirnov chart• Area Under the ROC curve• Gini Coefficient• Concordant – Discordant ratio• Root Mean Squared Error (RMSE)• etcModel Timeout monitoringInfrastructure monitoringOrganization KPI monitoringAnomoly monitoring

�19

Monitoring for Machine Learning Models

Page 20: Machine Learning Model Life Cycle Management In Production

datatron

Lesson 5

You either NEVER deploy a model, or you have to do it over and over again

�20

Page 21: Machine Learning Model Life Cycle Management In Production

datatron

ML model is a continuously optimizing process

�21

Concept driftNew concept comes up

Model Building

Model Deployment

Model Monitoring

Model Management

Page 22: Machine Learning Model Life Cycle Management In Production

datatron

Connecting Machine Learning to Software world

�22

Before Now

Software deployment once a 1 or 2 years

Software deployment every day

Future

Machine Learning models will deploy

very frequent and fast

Machine Learning models deploy

very slow

BUT

Page 23: Machine Learning Model Life Cycle Management In Production

datatron

Lesson 6

Model Governance should be automated as much as possible

�23

Page 24: Machine Learning Model Life Cycle Management In Production

datatron�24

Log

Comments

Versions

Model Versioning

New Way

Production Model Governance

Each Model Action Is Logged

Old Way

Comments

Version Log

LogLog

Log

Time for an audit

Page 25: Machine Learning Model Life Cycle Management In Production

datatron

Lesson 7

Be prepared, your number of model will grow

�25

Page 26: Machine Learning Model Life Cycle Management In Production

datatron�26

Deployment Learning: 1 model vs Multiple models

Page 27: Machine Learning Model Life Cycle Management In Production

datatron

Value Proposition: Cost per order of magnitude

�27

Without Model Management

# of models

With Model Management

As the number of models increases, the cost also increases

As the number of models increases, the cost significantly decreases

# of models

Cost per model

Cost per model

Page 28: Machine Learning Model Life Cycle Management In Production

datatron

Lesson 8

Senior people are required at later stages

�28

Page 29: Machine Learning Model Life Cycle Management In Production

datatron

Software Development vs ML Model Development

�29

EvolutionTestingImplementationDesignRequirements

Monitoring and

Optimization

Deploy to Production

Training / Testing

Data Preparation

Requirements

Senior People

Senior People

Page 30: Machine Learning Model Life Cycle Management In Production

datatron

Lesson 9

Your real work starts AFTER you deploy the model to production

�30

Page 31: Machine Learning Model Life Cycle Management In Production

datatron

Enterprise AI Life Cycle

�31

Exploration Training Deploy

Page 32: Machine Learning Model Life Cycle Management In Production

datatron

Model performance Model latency Infrastructure monitoring

Feature distribution Model result

Model routing Challenger KPI based selection

Model timeout Fall back strategy Alerting

Split traffic Shadowing Multi-Armed Bandit Optimization

Blue-Green deployment Rollback Canary

Enterprise AI Life Cycle After Deployment

�32

Deploy A/B Testing SLA

Model Selection

Anomaly Detection

Monitoring

Replay End-to-end tracing Logging

Troubleshooting

Cluster management integration Security check

IT Env Integration

Versioning History Approval process

Auditing

Tensorflow, sklearn, H2O, SAS, SparkML etc.

ML Frameworks

Support