geek night - continuous delivery for machine learning
Post on 21-Jan-2018
81 Views
Preview:
TRANSCRIPT
5/15/2017
Continuous Delivery Principles for Machine Learning
Rajesh Muppallarajesh@indix.com@codingnirvana
About Me
● Co-Founder @ Indix● Earlier - Distributed Systems, Big Data problems● Currently - Machine Learning problems● Ex-Thoughtworks
○ Tech lead on Go-CD - CI/CD tool● Previous Talks @ Geek Night
○ Building Distributed Crawler using Akka○ Big Data Testing Challenges
Six Business Critical Indexes
People
Documents Businesses
Places Products
ConnectedDevices
Content plus platform capability makes them very valuable
Enabling businesses to build location-aware software.
~3.6 million websites use Google maps
Enabling businesses to build product-aware software.
Indix catalogs over 2.1 billion product offers
Indix – the “Google Maps” of Products Building a platform for product information
Structure Refine
Organize
AI & Machine Learning
Brand & Retailer Websites
Organizing the World’s Product Information
Brand & Retailer Feeds
1.1 B Products | 2.1B Offers | 60K Brands
APIReal-time
APIBulk
Customizable Feeds
Data Scale @ Indix
2.1 BillionProduct
URLs 8 TB HTML Data
Crawled Daily
1B Unique
Products
5000Categories
100 BPrice
Points
3000Sites
3/31/16
Auto Parsers to detect and extract Product content from Web pages, using Machine Vision algorithms
Predictive Scheduler for deciding re-crawl frequency using various signals like Seasonality, Product Type, Store
Multi-label classifier Categorizing products into a hierarchical taxonomy using text information
Inferring Product vs Listing vs Other Pages using either just URL patterns or using Page Content
Adaptive Crawlers that modifies the crawl rate based dynamic characteristics like Site traffic, Number of products, Robots.txt settings
Deep learning - Categorizing products using Product images
Predicting which products are an exact match or similar products
NER based Attribute extraction algorithm that mines text like Title, Descriptions, Specifications to build structured Key:Value Attributes
Fusion/Enrichment - An algorithm that uses the data to learn and build golden product record using disparate sources
Product Rank - algorithm that uses multiple signals like product popularity, price, data quality, store popularity, brand popularity to build dynamic relevance/rank score
Recommendation Engines that suggest Tags where Product information can be found on a web page
Deep learning - Extracting visual product attributes using Product images
NLG algorithms to generate product descriptions
Product GPS - Universal Product Identifier using machine learning algorithms and allowing Search & Discovery
Machine Learning at Indix
5/15/2017
Machine Learning Workflow
Define Business Objective
Explore & Transform
Pull and Acquire Data
Develop Model
Model Evaluation & Validation
Meets Business Needs?
Build Production System
DeployMeasure Metrics
Yes!
Not Yet!
Human in the Loop
Machine Learning Workflow
Test Data
Training Data
Machine Learning Sandwich?*
* - https://techcrunch.com/2017/08/08/the-evolution-of-machine-learning/
Explore & Transform
Pull and Acquire Data
DeployBuild Production
System
Develop ModelModel Evaluation &
Validation
The MEAT is not in the middle
Machine Learning Sandwich?*
* - https://techcrunch.com/2017/08/08/the-evolution-of-machine-learning/
Explore & Transform
Pull and Acquire Data
DeployBuild Production
System
Develop ModelModel Evaluation &
Validation
Data Pipelines
App
Model
5/15/2017
Pain Points
Pain Points● A key employee in the team had to abruptly go on leave
○ Unable to reproduce the performance of an existing production model■ Training Data Missing/Not known■ Scripts not there for Pre-processing■ Hyperparameters not known
● It takes 3 Months to productionize a model■ Lot of glue code■ Custom code developed every time■ Frequent updates to model takes long time
● Confidence in Test Set != Confidence in Production■ Confidence of model performance on a sample set not good enough
● Heterogeneous Systems for performance reasons■ Eg. - Sharing stuff between Python and JVM
These are solved problems in Software Development
And have been solved using principles of
Continuous Delivery
Continuous Delivery is a software engineering approach that aims at building, testing and
releasing software faster and more frequently.
A straightforward and repeatable process is important from continuous delivery
What is Continuous Delivery?
● Use source and artifact repository labels for reproducibility○ Data and model management (incl. versioning)
● Use containers to package and run services○ Model containers for model prediction services
● A/B Testing using Canary Releases & Blue Green Deployments○ Variation of BG Deployment for A/B testing of models
● Automation via CI + CD pipelines○ Pipelines for Training, Evaluation and for Offline Predictions
Principles from CD in ML
Model Repository
● Organization, Versioning, Publishing and Resolving of latest version○ Similar to an artifact repository like Maven, Ivy
● For a model, stores ○ Metadata
■ Training/Validation/Test Datasets (From MDA or Custom)■ Hyper-parameters used■ Evaluation Metrics
○ Data■ Different formats - parquet (Spark MLLib), pickle (scikit-learn), h5 (keras)
● Has clients for most commonly used frameworks - scikit-learn, Spark MLLib, Keras
5/15/2017Confidential and Proprietary Do Not Distribute
Model Productionization
Model Promotion
● Tagging the “latest good” version that needs to be deployed● Not all models need/can be promoted
○ Experimental models○ Models that fail the test set metrics
● Easy rollback - tag the “last good” version as the latest
Model Container
● Hosts a single model to be used for predictions● Exposes API for prediction and are “dockerized”● Containers can be replicated to handle scale● Two µServices
○ Scala ■ Handles pre-processing
○ Python■ Loads model and exposes the predict on the model■ Can also predict in batches for better throughput
○ Scala µservice delegates the predict and predict_batch functions to the Python µservice
Model Container
Docker Host
Scala µService
predict(input)predict_batch(inputs)_preprocess(input)
Python µService
ModelModel
Model
predict(input)predict_batch(inputs)
Model Deployment
● Two Modes - Offline (Batch) and/or Online● Offline Mode
○ Package model containers into an AMI (Amazon Machine Image)○ Start the container as part of your Spark/Hadoop clusters on the
Executors/Task Trackers○ Within a job call the local Scala Service for prediction for each record
● Online Mode○ Deploy the model containers into a Mesos + Marathon or a Kubernetes
cluster○ (Auto) Scaling is managed by the cluster
Model A/B Testing
● Most common approach (MAB) - Multi-Armed Bandit Testing
Source - https://www.slideshare.net/turi-inc/model-managementalice
Model A/B Testing
Source - https://www.slideshare.net/turi-inc/model-managementalice
Model A/B Testing
● We don’t use MAB○ Reason - Payout is not easily measurable
● Instead we use a variation of the Blue Green Deployment pattern○ Input to both old and new both, but serve output only from old○ Find deltas and do spot checking
● Advantages○ Zero Downtime while pushing new models○ Easy rollback
● For Offline, BG not needed, only deltas + spot checking● We have built an in-house data turking tool for spot checking
Spot Checking Example 1
Spot Checking Example 2
ML Pipelines
● ML Pipelines could be modelled after build pipelines ● Customized Go-CD, a CI & CD tool to automate our ML workflows● Created plugins to help us with our ML workflows
Read Training Data(MDA Job)
Pre-process Data(Spark Job)
Build Model(Python)
Evaluate Model(Python) Publish Model Promote Model
Training PipelineManual Stage
Publish Container
Create Docker Image
(Docker)
Push to Docker Registry (Docker)
Create AMI(Shell)
5/15/2017Confidential and Proprietary Do Not Distribute
Future Work
● Open source the template model container● Add more plugins in Go-CD to better support stuff natively● Model Repository visualization
5/15/2017Confidential and Proprietary Do Not Distribute
Thank YouQuestions
top related