geek night - continuous delivery for machine learning

5/15/2017

Continuous Delivery Principles for Machine Learning

Rajesh Muppallarajesh@indix.com@codingnirvana

About Me

● Co-Founder @ Indix● Earlier - Distributed Systems, Big Data problems● Currently - Machine Learning problems● Ex-Thoughtworks

○ Tech lead on Go-CD - CI/CD tool● Previous Talks @ Geek Night

○ Building Distributed Crawler using Akka○ Big Data Testing Challenges

Six Business Critical Indexes

People

Documents Businesses

Places Products

ConnectedDevices

Content plus platform capability makes them very valuable

Enabling businesses to build location-aware software.

~3.6 million websites use Google maps

Enabling businesses to build product-aware software.

Indix catalogs over 2.1 billion product offers

Indix – the “Google Maps” of Products Building a platform for product information

Structure Refine

Organize

AI & Machine Learning

Brand & Retailer Websites

Organizing the World’s Product Information

Brand & Retailer Feeds

1.1 B Products | 2.1B Offers | 60K Brands

APIReal-time

APIBulk

Customizable Feeds

Data Scale @ Indix

2.1 BillionProduct

URLs 8 TB HTML Data

Crawled Daily

1B Unique

Products

5000Categories

100 BPrice

Points

3000Sites

3/31/16

Auto Parsers to detect and extract Product content from Web pages, using Machine Vision algorithms

Predictive Scheduler for deciding re-crawl frequency using various signals like Seasonality, Product Type, Store

Multi-label classifier Categorizing products into a hierarchical taxonomy using text information

Inferring Product vs Listing vs Other Pages using either just URL patterns or using Page Content

Adaptive Crawlers that modifies the crawl rate based dynamic characteristics like Site traffic, Number of products, Robots.txt settings

Deep learning - Categorizing products using Product images

Predicting which products are an exact match or similar products

NER based Attribute extraction algorithm that mines text like Title, Descriptions, Specifications to build structured Key:Value Attributes

Fusion/Enrichment - An algorithm that uses the data to learn and build golden product record using disparate sources

Product Rank - algorithm that uses multiple signals like product popularity, price, data quality, store popularity, brand popularity to build dynamic relevance/rank score

Recommendation Engines that suggest Tags where Product information can be found on a web page

Deep learning - Extracting visual product attributes using Product images

NLG algorithms to generate product descriptions

Product GPS - Universal Product Identifier using machine learning algorithms and allowing Search & Discovery

Machine Learning at Indix

5/15/2017

Machine Learning Workflow

Define Business Objective

Explore & Transform

Pull and Acquire Data

Develop Model

Model Evaluation & Validation

Meets Business Needs?

Build Production System

DeployMeasure Metrics

Not Yet!

Human in the Loop

Machine Learning Workflow

Test Data

Training Data

Machine Learning Sandwich?*

* - https://techcrunch.com/2017/08/08/the-evolution-of-machine-learning/

Explore & Transform

DeployBuild Production

System

Develop ModelModel Evaluation &

Validation

The MEAT is not in the middle

Machine Learning Sandwich?*

* - https://techcrunch.com/2017/08/08/the-evolution-of-machine-learning/

Explore & Transform

DeployBuild Production

System

Develop ModelModel Evaluation &

Validation

Data Pipelines

5/15/2017

Pain Points

Pain Points● A key employee in the team had to abruptly go on leave

○ Unable to reproduce the performance of an existing production model■ Training Data Missing/Not known■ Scripts not there for Pre-processing■ Hyperparameters not known

● It takes 3 Months to productionize a model■ Lot of glue code■ Custom code developed every time■ Frequent updates to model takes long time

● Confidence in Test Set != Confidence in Production■ Confidence of model performance on a sample set not good enough

● Heterogeneous Systems for performance reasons■ Eg. - Sharing stuff between Python and JVM

These are solved problems in Software Development

And have been solved using principles of

Continuous Delivery

Continuous Delivery is a software engineering approach that aims at building, testing and

releasing software faster and more frequently.

A straightforward and repeatable process is important from continuous delivery

What is Continuous Delivery?

● Use source and artifact repository labels for reproducibility○ Data and model management (incl. versioning)

● Use containers to package and run services○ Model containers for model prediction services

● A/B Testing using Canary Releases & Blue Green Deployments○ Variation of BG Deployment for A/B testing of models

● Automation via CI + CD pipelines○ Pipelines for Training, Evaluation and for Offline Predictions

Principles from CD in ML

Model Repository

● Organization, Versioning, Publishing and Resolving of latest version○ Similar to an artifact repository like Maven, Ivy

● For a model, stores ○ Metadata

■ Training/Validation/Test Datasets (From MDA or Custom)■ Hyper-parameters used■ Evaluation Metrics

○ Data■ Different formats - parquet (Spark MLLib), pickle (scikit-learn), h5 (keras)

● Has clients for most commonly used frameworks - scikit-learn, Spark MLLib, Keras

5/15/2017Confidential and Proprietary Do Not Distribute

Model Productionization

Model Promotion

● Tagging the “latest good” version that needs to be deployed● Not all models need/can be promoted

○ Experimental models○ Models that fail the test set metrics

● Easy rollback - tag the “last good” version as the latest

Model Container

● Hosts a single model to be used for predictions● Exposes API for prediction and are “dockerized”● Containers can be replicated to handle scale● Two µServices

○ Scala ■ Handles pre-processing

○ Python■ Loads model and exposes the predict on the model■ Can also predict in batches for better throughput

○ Scala µservice delegates the predict and predict_batch functions to the Python µservice

Model Container

Docker Host

Scala µService

predict(input)predict_batch(inputs)_preprocess(input)

Python µService

ModelModel

predict(input)predict_batch(inputs)

Model Deployment

● Two Modes - Offline (Batch) and/or Online● Offline Mode

○ Package model containers into an AMI (Amazon Machine Image)○ Start the container as part of your Spark/Hadoop clusters on the

Executors/Task Trackers○ Within a job call the local Scala Service for prediction for each record

● Online Mode○ Deploy the model containers into a Mesos + Marathon or a Kubernetes

cluster○ (Auto) Scaling is managed by the cluster

Model A/B Testing

● Most common approach (MAB) - Multi-Armed Bandit Testing

Source - https://www.slideshare.net/turi-inc/model-managementalice

Model A/B Testing

Source - https://www.slideshare.net/turi-inc/model-managementalice

Model A/B Testing

● We don’t use MAB○ Reason - Payout is not easily measurable

● Instead we use a variation of the Blue Green Deployment pattern○ Input to both old and new both, but serve output only from old○ Find deltas and do spot checking

● Advantages○ Zero Downtime while pushing new models○ Easy rollback

● For Offline, BG not needed, only deltas + spot checking● We have built an in-house data turking tool for spot checking

Spot Checking Example 1

Spot Checking Example 2

ML Pipelines

● ML Pipelines could be modelled after build pipelines ● Customized Go-CD, a CI & CD tool to automate our ML workflows● Created plugins to help us with our ML workflows

Read Training Data(MDA Job)

Pre-process Data(Spark Job)

Build Model(Python)

Evaluate Model(Python) Publish Model Promote Model

Training PipelineManual Stage

Publish Container

Create Docker Image

(Docker)

Push to Docker Registry (Docker)

Create AMI(Shell)

Future Work

● Open source the template model container● Add more plugins in Go-CD to better support stuff natively● Model Repository visualization

Thank YouQuestions

geek night - continuous delivery for machine learning

Engineering

geek , (name of founder) geek , (name of founder) geek...

geek girl mini-presentation på geek girl meetup 2016

geek decor

geek ng.docx

marketing disruptive technologies - reading geek night...

geek night 16.0 - evolution of programming languages

speak geek

geek watching

geek-night - vmwarex -...

geek table

knowing you know nothing - mk geek night #8

geek speak

melbourne geek night - boot to gecko – the web as a...

windows phone geek night

geek books se sei geek puoi farcela - ilaria defilippo

geek night 15.0 - touring the dark-side of the internet

product geek

empresas geek

geek tecnology

geek to geek: universal design