multi model machine learning by maximo gurmendez and beth logan
TRANSCRIPT
![Page 1: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/1.jpg)
Multi-Model Machine Learning for
Real Time Bidding over Display Ads
Beth Logan
Senior Director of Optimization
Maximo Gurmendez
Data Science Engineering Team Lead
With credit to our Spark developers:
Inés Guelfi, Juan Tejería, Martin Manasliski, Victoria Seoane
![Page 2: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/2.jpg)
We wanted to try Spark but wondered
Thread
safe?
Is Spark
fast
enough?
Does it use
too much
memory?
![Page 3: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/3.jpg)
Agenda
1. What we do
2. How we do it
3. Why Spark?
4. Challenges Addressed
5. Main Takeaways
![Page 4: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/4.jpg)
DataXu’s Mission
Make marketing
smarter through
Data Science!
![Page 5: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/5.jpg)
What We Do
![Page 6: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/6.jpg)
Taking Action Automatically
• Bid in real-time ad auctions on behalf of advertisers
• Machine Learning System learns from past bids
Browser
Request
Ad
Ad
exchanges
Ad
Selection
+ Bid
Ad Bid
Request
DataXu Machine
Learning systemDataXu Real
time systemUser
![Page 7: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/7.jpg)
DataXu ML System
Learn
Models
Ads shown
User actions
(purchase, clicks, etc)
Only high
quality
models
Hive
database
Calibrate
Evaluate
Real Time
Bidding
Hadoop
![Page 8: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/8.jpg)
Why is this hard?
Huge Scale • 2 Petabytes Processed Daily
• 1.6 Million Bid Decisions Per Second
• Runs 24 X 7 on 5 Continents
• Thousands of ML Models Trained per Day
Unattended Operation • Model training and deployment runs automatically every day
Changing Industry • Need ability to adapt quickly to new customer requirements
![Page 9: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/9.jpg)
Why Spark?• Large open source Machine Learning library
– Fast turnaround of research to production
– Easy to prototype and support new customer use cases
– Built-in upgrade of algorithms
– Increased reliability
• Trains models faster than hadoop
• Enables iterative models
• Elastic environment via cloud
![Page 10: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/10.jpg)
Challenges Addressed• Smart Dataset Partitioning by Campaign
• Categorical Features
• Functional Features
• Pipelines + RowTransformers
• Use of SparkSQL
• Real-time model instantiations
![Page 11: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/11.jpg)
Partitioning the data
• Need 1 RDD per campaign
• "Fat Reducers" or "Many files" problem
• 2-pass solution
![Page 12: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/12.jpg)
Partitioning the data: Solution
• Sample the RDD
• Construct histogram of sizes
• Use histogram to allocate more
processes (pseudo-sub-
partition)
![Page 13: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/13.jpg)
Spark ML PipelinesRaw Feature
Transformation
Feature Encoding
Feature Selection
Decision Tree
Trainer
Transformers
Estimator
![Page 14: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/14.jpg)
Spark ML PipelinesTransformer
transform(Dataframe):Dataframe
Model
fit(Dataframe):Model
Extends
Estimator
• Great for Training, evaluation &
experimentation
• Can we use them at bid time?
![Page 15: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/15.jpg)
ML Pipelines: Row Transformer
Problem:
At bid time there is no DataFrame!
Solution:
Use row transformer
Transformer
transform(Dataframe):Dataframe
RowTransformer
transform(Row):Row
Extends
![Page 16: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/16.jpg)
Meta-Pipeline Extension• Combines and evaluates several pipelines
• DAG with all steps and dependencies
• JSON Configurable
• Pipelines = All possible paths from root to leaves
• Use to train multiple classifiers with little
overhead (training time dominated by data read
and transformation)
![Page 17: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/17.jpg)
Best model varies from campaign to campaign
AUC
![Page 18: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/18.jpg)
Models at bid time• Standard java serialization
• Models relatively light-weight and fast
Add Preprocessing
Metadata ~ 130K
0
20
40
60
80
100
120
140
Model Size in Memory (KB)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Current DataXuModel
Spark RandomForest
Avg. Latency (milliseconds)
![Page 19: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/19.jpg)
• Choosing features via select command
• Functional features and categorical to numerical encoding via UDFs
• Top K feature values via UDAF
• Reuse UDFs at bid time
• Imperative to declarative
• Huge savings in LOC
Use and abuse of SparkSQL
![Page 20: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/20.jpg)
SparkSQL:TopK UDAF Example
For categorical encoding we first obtain most
popular nominals:
select topk(os) from training_data
Result:
{windows:1562, macos:928, linux:21}
![Page 21: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/21.jpg)
SparkSQL: Feature Encoding
Enumerate: select enumerate_encode(os),
enumerate_encode(browser) from training_data
One-Hot-Encoding: select onehot(os,’macos’),
onehot(os,’windows’),onehot(os,’linux’) …
Result:
1,3
3,1
2,1
Result:
1,0,0
0,1,0
0,0,1
Easily encode categorical features using UDFs
![Page 22: Multi Model Machine Learning by Maximo Gurmendez and Beth Logan](https://reader033.vdocuments.net/reader033/viewer/2022052405/58f9a94d760da3da068b6d40/html5/thumbnails/22.jpg)
Takeaways
•It works!
• Spark SQL: maintainable & declarative
• Models can bid at real-time
• Automated & unattended ML at large scale
• ML Pipelines had to be extended