predictive models at scale

12
Predictive Models at Scale using Dumbo Nikhil Ketkar

Upload: nikhil-ketkar

Post on 10-Apr-2017

387 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Predictive Models at Scale

Predictive Models at Scale using Dumbo

Nikhil Ketkar

Page 2: Predictive Models at Scale

40k+ Brands600k+ Sellers

700+ Million Products7k+ Categories10k+ Attributes

Motivation: Problem Space @ Indix

Page 3: Predictive Models at Scale

Developing Predictive Models

Unlabelled Data

SampleHandLabel Model Predict

Data with Predicted Labels

Page 4: Predictive Models at Scale

HDFS

StatisticalModel

StatisticalModel

StatisticalModel

StatisticalModel

StatisticalModel

StatisticalModel

Predictive Models at Scale

Page 5: Predictive Models at Scale

The Two Giants

Native, C/C++ Fortran

Numpy

Scipy, Pandas, Matplotlib

scikit-learn, scikit-image, statsmodels

JVM

Java/Scala

HDFS, Hadoop MapReduce

Cascading/Scalding

PyData Ecosystem Hadoop Ecosystem

ModelPredict

Page 6: Predictive Models at Scale

The Standard Options ● Port to Java/Scala use as Library in Mapper

○ Time Consuming ○ Need to port parts of the PyData Stack○ Reduced Velocity○ Error prone

● Write a REST API/Service for the model and call from Mapper○ Slow due to Network Latency○ Deployment is a nightmare

● Use Disco

Page 7: Predictive Models at Scale

Can we do better?

● Hadoop Streaming with Typedbytes Support● Python Wrappers over Hadoop Streaming

○ Dumbo○ MRJob○ Hadoopy○ Pydoop

Reference: http://blog.cloudera.com/blog/2013/01/a-guide-to-python-frameworks-for-hadoop/

Page 8: Predictive Models at Scale

Two Minute MapReduce Refresher

Reference: https://tarnbarford.net/journal/mapreduce-on-mongo

Page 9: Predictive Models at Scale

Sample Problem: Extract MPN from Product Titles

● 0.5 Billion Product Titles● Many contain MPNs● Humans can detect

MPNs● Can a model do the

same?● Use CRF on Full Title● Use RF on Tokens

Moen CSIMC000BN Brushed Nickel Decorative Mirror Frame Corner Rosette from Mirrorscapes 000 Series Set of 4

Rohl A3608/6.5LPAPC 2 Polished Chrome Country Kitchen Low Lead Bar Faucet with Porcelain Lever Handle

Newport Brass 3 447/ORB Oil Rubbed Bronze Hand RelievedDiverter / Volume Control Handle from the Metropole Collection

Bosch HCFC2044B 1/4" SDS Plus X5L with Optimized Flute Surface Pack of 25

Sterling 7214120 Ensemble 0" x 30" Shower Receptor with Right hand Drain Pack 6

U12 23252 KUB QUATRON INDX DRILL

MPNs in Product Titles

Page 10: Predictive Models at Scale

Code Walkthrough

Page 11: Predictive Models at Scale

Code Walkthrough

Page 12: Predictive Models at Scale

Important Learnings

● Dumbo Fairly Stable, Mature and Ready for Production

● Gets the 2 giants working together!● Found just one issue over 6 months of

usage (patch submitted)● Support for Typedbytes is critical if making

predictions over binary data (Images etc.)