mlleap, or how to productionize data science workflows using spark by mikhail semeniuk and hollin...
Post on 21-Apr-2017
2.250 Views
Preview:
TRANSCRIPT
MLeap: Release Spark ML Pipelines
Hollin Wilkins & Mikhail Semeniuk
Introduction
Studied some Math and Economics
- Predictive modeling of at risk patients @ UHG
- R/SAS batch pipelines
- - Joins TrueCar to do new car price modeling
- SQL-based batch pipelins + python for Linear Regressions in API layer
- Continues to train models in SAS/R
- SQL-based batch pipelins + python for Linear Regressions in API layer
- Massive pre-computation in Hadoop/ElasticSearch
- Portions of models still translated, now to Java
- Spark enables data processing and training of models in same environment
MLeap
Web Dev @ Cornell
Studied some General Biology
Rails Consulting for TrueCar and other companies
Implement ML model for ClearBook in Ruby, Python and Java
Erlang for highly-concurrent game servers
C/C# Game Engines
Join TrueCar for Rails/mobile development
Begin work on machine learning platform based on Spark/Hadoop
How do we build real-time APIs based on Spark Models? SparkContext prohibitive
Try several ideas, fail, learn a ton, begin collaboration with Hortonworks
Everyone wants to do better! The winning technology will be the one that enables Engineers and Data Scientists to collaborate and work across a single platform.
Action Reaction
Problem Statement: Deploying machine learning algorithms to a production environment is a lot more difficult than it has to be and is a common problem for data-driven organizations.
- Data scientists write data pipelines to construct research datasets
- Engineers re-write the data pipelines for a production-ready system
- Engineers write scalable libraries for computing features and algorithms
- Data scientists largely don’t use those libraries and maintain/re-write their own copy of the code
- Data scientists largely focus on linear/logistic regressions due to engineering constraints
- Talented engineers get largely tired of coding up linear regressions and updating coefficients
HDFS: Shared Data Spark: Shared ML APIs ?: Shared API Environment
Existing Solutions: You won’t believe how many companies are still deploying algorithms in a SQL environment! And these are billion dollar operations.
Hard-Coded Models(SQL, Java, Ruby)
PMML Emerging Solutions(yHat, DataRobot)
Enterprise Solutions(Microsoft, IBM, SAS)
MLeap
Quick to Implement
Open Sourced
Committed to Spark/Hadoop
API Server
MLeap Solution• Born out of need to deploy models to a real time API
server at TrueCar
• Leverage Hadoop/Spark ecosystem for training, get rid of Spark dependency for execution
• Easily reuse models with serialization and executing without Spark
MLeap Components• core - provides linear algebra system,
regression models, and feature builders
• runtime - provides DataFrame-like “LeapFrame” and transformers for it
• spark - provides easy conversion from Spark transformers to MLeap transformers
mleap-spark
mleap-runtime
mleap-core
Linear Algebra
Dense/Sparse Vectors
BLAS from Spark
Feature Builders
VectorAssembler
StringIndexer
Regressions
LinearRegression
RandomForest
OneHotEncoder
StandardScaler
LogisticRegression
MLeap Core Components
MLeap Runtime• Provides LeapFrame, which stores data for transformations by MLeap
transformers
• MLeap transformers user mleap-core building blocks to transform LeapFrame
• MLeap transformers correspond one-to-one with Spark transformers
• No dependencies on Spark
Feature Pipeline DiagramVectorAssembler Continuous
Feature Vector StandardScaler
StringIndexer
StringIndexer
StringIndexer
OneHotEncoder
OneHotEncoder
VectorAssembler
LinearRegression
Categorical Feature
Categorical FeatureIndex
Categorical Feature
One Hot Vector
Categorical Feature Vector
VectorAssembler
Scaled Continuous Feature Vector
Final Feature Vector
Continuous Feature
Legend
Final Feature Vector Prediction
Regression Pipeline
OneHotEncoder
Categorical Pipeline
LeapFrame LeapFrame LeapFrame
Categorical Feature StringIndexer OneHotEncoderCategorical
Feature Index
Categorical Feature One Hot Vector
StringIndexer OneHotEncoder
String Indexer Model
One Hot Encoder Model
Linear Regression Model
MLeap Spark• Train an ML pipeline with Spark then export it to MLeap
• Execute an MLeap pipeline against a Spark DataFrame
Spark Estimator Spark Model MLeap Model
MLeap Spark
Spark DataFrame Spark LeapFrame Spark LeapFrame
MLeap Spark
Spark DataFrame
MLeap Transformer
MLeap Spark
Typical MLeap Workflow
DataFrames
HDFS: Raw Data
Spark/MapReduce/Pig/Hive
Combine + Enrich Transactional Data
Compute the feature vector
Train, Test, Repeat(MLeap-Spark)
Export models to …?
Spark MLeap Runtime
Serialize to HDFS
Serialize to ElasticSearch
Model 1 Model 2
Deploy to API Servers
Scala/Java/Python/Zeppelin/Jupyter/DataBricks
Demo Usage of MLeap• Train a sample listing price model using linear
regression and random forest
• Deploy both models to an API server
• Get real-time results
• IN UNDER 5 MINUTES!
Future of MLeap• Have MLeap Core be a dependency of Spark to ensure
parity between transformers
• Expand supported transformers
• Python/R interface
THANK YOU.Github: http://github.com/truecar/mleapHollin Wilkins: hollinrwilkins@gmail.comMikhail Semeniuk: seme0021@gmail.com
top related