andy feng, distinguished architect, yahoo at mlconf sf
DESCRIPTION
Abstract: Scalable Machine Learning at Yahoo Yahoo scientists have developed variety of machine learning libraries (supervised learning, unsupervised learning, deep learning) for online search, advertising and personalization. The emerging business needs require us to address 2 problems: - Can we apply these libraries against massive datasets (billions of training examples, and millions of features) using commodity hardware clusters? - Can we reduce the learning time from days to minutes or seconds? We have thus examined system architecture options (including Hadoop, Spark and Storm), and developed a fault-tolerant MPI solution that allows hundreds of machines to jointly build a model. We are collaborating with open source community for a better system architecture for next-gen machine learning applications. Yahoo ML libraries are being revised for much better scalability and latency. In the talk, we will share system architecture of our ML platform and its use cases.TRANSCRIPT
![Page 1: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/1.jpg)
Scalable Machine Learning at Yahoo
Andy Feng
Nov 14, 2014
![Page 2: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/2.jpg)
My Background § Current
› VP Architecture, Yahoo › Committer, Apache Storm › Contributor, Apache Spark & Hadoop
§ Past › NoSQL › Online advertisement › Personalization › Cloud services
![Page 3: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/3.jpg)
Agenda
3
§ Machine Learning › Use Cases › Challenges
§ Scalable ML Architecture § Design Patterns › Batch, real-time and hybrid
![Page 4: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/4.jpg)
Evolution of Big Data @ Yahoo
4
0
100
200
300
400
500
600
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013 2014
Raw
HD
FS S
tora
ge (i
n PB
)
Num
ber o
f Ser
vers
Year Servers Storage
Yahoo! Commits to
Scaling Hadoop for Production
Use
Research Workloads
in Search and Advertising
Production with machine
learning & WebMap
Revenue Systems
with Security, Multi-tenancy,
and SLAs
Open Sourced with
Apache
Hortonworks Spinoff for Enterprise hardening
Nextgen Hadoop (H 0.23)
New Services (Hbase, Hive)
Increased User-base
with partitioned namespaces
Hadoop 2.5
Machine Learning
![Page 5: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/5.jpg)
Personalized Homepage http://www.yahoo.com Mobile
Today Module (2012)
Content stream w/ native ads (2013)
![Page 6: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/6.jpg)
6
Web Search & Ads
• Web Page rank • Image/Video insertion
Ads targeting & ranking
![Page 7: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/7.jpg)
Flickr Photo Search
Flickr
2014 … Empowered by Scalable ML 2013 … User tags based
![Page 8: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/8.jpg)
§ Search
› Page ranking per user intention § Advertisement
› Ad click prediction › Identify potential users for an ad campaign
§ Content › Matching news articles against users › Object detection, face recognition in photos
§ Security › Email spam › Fraud login and registration
8
Machine Learning @ Yahoo
![Page 9: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/9.jpg)
§ Scale › 1,000,000,000’s examples › 100,000,000’s features › 10,000’s models › 10’s algorithms
• Batch learning • Incremental learning • Real-time learning
§ Speed › Temporal nature of user
interests › Time sensitive content
• Ex., breaking news
› Naïve solutions spend days/hours in model training • Minutes/seconds desired
9
Our Challenges
![Page 10: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/10.jpg)
Our Approach: Big-Data Machine Learning
![Page 11: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/11.jpg)
§ Originally created by Yahoo § Popular framework for running
applications on large cluster built of commodity hardware
§ Designed for very high throughput and reliability
§ YARN resource manager supports Map/Reduce, Tez and beyond
11
Apache Hadoop http://hadoop.apache.org
![Page 12: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/12.jpg)
Apache Storm http://storm.apache.org § “Hadoop for Realtime”
› distributed and high-performance realtime data processing
§ Simple API § Horizontal scalability § Fault-tolerance § Guaranteed data
processing 12
![Page 13: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/13.jpg)
§ Fast and expressive cluster computing system compatible with Apache Hadoop
§ Support general execution DAGs › Ex. iterative programming
§ Resilient Distributed Datasets › In-memory storage
Apache Spark http://spark.apache.org
![Page 14: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/14.jpg)
30x Speedup for GBDT § Gradient Boosted
Decision Trees took days on training for our large datasets. é High accuracy ê Sequential execution
§ 30X speedup enables frequently model training. › GBDT included in data
pipeline (Hadoop Oozie workflow)
![Page 15: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/15.jpg)
Pixels -> features
Pixels -> features
Pixels -> features
dog, 1, [.2, -.3, …] dog, 0, [.3, -.5, …]
cat, 1, [.2, -.3, …] cat, 0, [.3, -.5, …]
Train models: Dog, …
Train model: …
Train model: Cat, …
10,000 Mappers
1,000 Reducers Shuffle
Deep network as feature extractor
8000+ classifiers
Auto-tag billions of Flickr photos
![Page 16: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/16.jpg)
Real-time Prediction & Training User Experience
Real-time Learning of Newly Uploaded Photos
![Page 17: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/17.jpg)
Design Patterns Enabled
17
1. Batch ML for scale › Parallel model training (ex. 1000 models for ad campaigns) › Distributed model training (ex. 1 model for all homepage content)
2. Real-time ML for speed › Up-to-minutes models (ex. fraud detection, breaknews)
3. Lambda architecture › Scale + Speedy learning (ex. Photo autotags) › Enabled by “Parameter Server on Grid”
![Page 18: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/18.jpg)
§ Basic Requirements
› 100’s - 1000’s models › Training data for each model
could be loaded into a single machine
§ Solution: 1 reducer per model › hadoop jar hadoop-streaming.jar
-Dmapreduce.job.reduces=$num_models -reducer ”vw --passes 20 --cache_file …”
› hadoop jar lib/hadoop-streaming.jar -D mapreduce.job.reduces=$num_models
-reducer ”svm_train_reducer.py …”
18
1a. ML in Hadoop Reducers
![Page 19: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/19.jpg)
§ Basic Requirements › Small # of models to be trained › Training data are too large to be
loaded into a single machine
§ Solution: Mappers + MPI AllReduce 1. spanning_tree 2. hadoop jar hadoop-streaming.jar
-input $training_data -output $model_loc -Dmapreduce.job.maps=$num_mappers -mapper "runvw.sh $model_location $span_server $num_mappers” -reducer NONE
19
1b. ML in Hadoop Mappers
![Page 20: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/20.jpg)
1c. Spark Native ML
20
§ Spark based › Yahoo E-Commerce: 30 LOC Spark program for collaborative
filtering
§ Spark’s MLlib › Binary classification, Linear regression, Collaborative filtering,
Clustering, Decision Trees etc.
§ 3Rd ML libs › Ex. Alpine Data Lab’s Random Forest
![Page 21: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/21.jpg)
§ Observations › A large scale ML learning
job use 100’s processes to train models for hours.
› Some learner processes will stuck/fail due to many hardware issues (ex. disk, network etc.)
› Existing ML algorithms will hang or fail.
§ Partial Reducer › Enable trade off b/w speed and
accuracy › Tolerate failures of % of learner
processes
for (i <- 1 to ITERATIONS) { val gradient = points.pipe(learner_cmd) .partialReducer(reduceFunc,
0.99, timeout) w -= gradient }
1d. Approximate Computing
![Page 22: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/22.jpg)
22
2. Realtime Training in Storm Bolts § Basic Requirements
› Freshness of ML model is critical
§ Sample Solution public class TrainingBolt extends BaseBasicBolt { Model model; public void prepare(Map conf, TopologyContext ctx) { System.loadLibrary("VW"); model =VW.init(conf); } public void execute(Tuple input, OutputCollector collector) { Instance example = input. getValue(0); model.learn(example); if (Time since last export) collector.emit(model); } }
![Page 23: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/23.jpg)
23
3a. Hybrid Learning § Basic Requirements › Boostrape models via batch
learning from large datasets › Update models via realtime
learning from latest events
§ Sample Solution › ML in Hadoop + Storm › ML in Spark + Storm
![Page 24: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/24.jpg)
• billions of features per model • millions of operation per second • enable asynchronous learning
3b. Parameter Server on Grid
![Page 25: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/25.jpg)
Summary
Hadoop YARN: Resource Manager
Hadoop Storage: File System and NoSQL
Applications
Search Ranking
Photo/Video Services
Online Ads
Persona-lization
Abuse Detection
Machine Learning LibrariesLogistic
Regression Deep Learning Unsupervised Learning
Decision Trees …
Computing Engines
![Page 26: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/26.jpg)
Committed to Apache Open Source
26
8 Committers (6 PMCs) | Apache - 80
5 Committers (3 PMCs) | Apache - 18
3 Committers (2 PMCs) | Apache - 21
5 Committer (5 PMC) | Apache - 17
3 Committers | Apache - 32
7 Committers (6 PMCs) | Apache - 33
![Page 27: Andy Feng, Distinguished Architect, Yahoo at MLconf SF](https://reader033.vdocuments.net/reader033/viewer/2022052601/559446ee1a28ab0f0d8b457f/html5/thumbnails/27.jpg)
§ Big-Data Blog … http://yahoohadoop.tumblr.com § Hiring … http://careers.yahoo.com
27
Thanks!