making machine learning scale: single machine and distributed
TRANSCRIPT
Scalable Machine Learning: Single Machine to Distributed
Yucheng LowChief Architect
What is ML scalability?
Is this scalability?
1600s
Algorithm Implementation X
800s
400s
200s
300s
Best Single Machine Implementation
True Scalability
How long does it take to get to a predetermined accuracy?
Not About:How well you can implement Algorithm X
Understand the tradeoffs between different algorithms.
It is not about
Scaling Up Scaling Out
It is about
Scaling Up Scaling Out
Going as fast as you can, on any hardware
• Assume bounded resources • Optimize for data scalability
The Dato Way
• Scales excellently• Require fewer machines to
solve in the same runtime as other systems
10
~1GB/s
1 TB
~0.1GB/s
10 TB
~1-10 GB/s
0.1 TB
Single Machine Scalability: Storage Hierarchy
Capacity
Throughput
Random access is very slow!
Good External Memory Datastructures For ML
SFrame:
Scalable Tabular Data Manipulation
User Com.
Title Body
User Disc.
SGraph: Scalable Graph Manipulation
Data is usually rows…
user movie rating
But, data engineering typically column transformations…
13
Feature engineering is columnar
Normalizes the feature x:sf[‘rating’] = sf[‘rating’] / sf[‘rating’].sum()
Create a new feature:sf[‘rating-squared’] =
sf[‘rating’].apply(lambda rating: rating*rating)
Create a new dataset with 2 of the features:sf2 = sf[[‘rating’,’ rating-squared’]]
ratinguser movierating
squared
SFrame
• Rich Datatypes• Strong schema types: int, double, string, image, ...• Weak schema types: list, dictionary (Can contain arbitrary
JSON)• Columnar Architecture
• Easy feature engineering + Vectorized feature operations.• Lazy evaluation• Statistics + sketches• Type aware compression
UserCom
.
Title Body
User Disc.Scalable Out-Of-Core Table Representation
Netflix Dataset, 99M rows, 3 columns, ints1.4GB raw289MB gzip compressed
160MB
Out of Core Machine Learning
Rethink all ML Algorithms
Random Access Sequential Only
Sampling? Sort/Shuffle
Understanding the Statistical/convergence impacts of ML
algorithm variations.
Single Machine Scaling
GraphLab-Create (1 Node)
MLlib 1.3 (5 Node)
MLlib 1.3 (1 Node)
Scikit-Learn
0 500 1000 1500 2000 2500
Runtime
Dataset Source: LIBLinear binary classification datsets.KDD Cup data: 8.4M data points, 20M features, 2.4GB compressed.Task: Predict student performance on math problems based on interactions with tutoring system
Single Machine Scaling
GraphLab-Create (1 Node)
BIDMach (1 GPU Node)
0 100 200 300 400 500 600 700 800 900
Runtime
Criteo Kaggle: Click Prediction
46M rows34M sparse coefficients
Not a Compute Bound Task
Social Media
Graphs encode the relationships between:
•Big: trillions of vertices and edges and rich metadata
•Facebook (10/2012): 1B users, 144B friendships
•Twitter (2011): 15B follower edges
AdvertisingScience Web
PeopleFacts
ProductsInterests
Ideas
SGraph1. Immutable disk-backed graph
representation. (Append only)
2. Vertex / Edge Attributes.3. Optimized for bulk access, not fine-grained queries.
Get neighborhood of [5 Million Vertices]
Get neighborhood of 1 vertex
Standard Graph Representations
src dest
1 102
132 10
48 999
129 192
998 23
392 124
Edge List
Easy to Insert
src dest
1 10
1 99
1 102
2 5
2 10
2 120
Sparse Matrix / Sorted Edge List
Difficult to Insert (random writes)102 103
349 13
Difficult to Query
Fast to Query
1 105
SGraph Layout
1
2
3
4
Vertex SFrames
__id Address ZipCode
Alice … 98105
Bob … 98102
Vertices partitioned into p = 4 SFrames
Edges partitioned into p^2 = 16 SFrames
__id Address ZipCode
John … 98105
Jack … 98102
SGraph Layout
1
3
4
Vertex SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge SFrames
__src_id __dst_id Message
Alice Bob “hello”
Bob Charlie “world”
Charlie Alice “moof”
2
__id Address ZipCode
John … 98105
Jack … 98102
3
SGraph Layout
1
2
4
Vertex SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge SFrames
__src_id __dst_id Message
Alice Bob “hello”
Bob Charlie “world”
Charlie Alice “moof”
3
SGraph Layout
1
2
4
Vertex SFrames
(1,2)
(2,2)
(3,2)
(4,2)
(1,1)
(2,1)
(3,1)
(4,1)
(1,4)
(2,4)
(3,4)
(4,4)
(1,3)
(2,3)
(3,3)
(4,3)
Edge SFrames
Common Crawl Graph
3.5 billion Nodes and 128 billion Edges
Largest available public Graph. 200GB
Compression factor 10:112.5 bits per edge
2 TB
Benefit From SFrame Compression Methods
Common Crawl Graph
3.5 billion Nodes and 128 billion Edges
Largest available public Graph. 200GB
Compression factor 10:112.5 bits per edge
2 TB
Common Crawl Graph
1x r3.8xlarge using 1x SSD.
3.5 billion Nodes and 128 billion Edges
PageRank: 9 min per iteration.Connected Components: ~ 1 hr.
There isn’t any general purpose library out there capable of this.
SFrame & SGraph
BSD License(August)
Distributed
Train on bigger datasets
Train Faster
Speedup Relative to Best Single Machine Implementation
X Y
Time for 1 pass = 100s
Extending Single Machine to Distributed
Extending Single Machine to Distributed
X Y
Time for 1 pass = 50s
X Y
Parallel Disks
Good External Memory Datastructures For ML Still Help
Distributed Optimization
Newton, LBFGS, FISTA, etc
Parallel Sweep over
data
X Y
Synchronize Parameters
Parallel Sweep over
data
X Y
Synchronize Parameters
Make sure this is embarrassingly parallel
Talk Quickly
Distributed Optimization
HDFSX Y
1. Data begins on HDFS
X YX Y
2. Every machine takes part of the data to local disk/SSD
3. Inter machine communication by fast supercomputer-style primitives
Criteo Terabyte Click Logs
Click Prediction Task: Whether visitor clicked on a link or not.
Criteo Terabyte Click Prediction
4.4 Billion Rows13 Features
½ TB of data
0 4 8 12 160
500
1000
1500
2000
2500
3000
3500
4000
#Machines
Ru
ntim
eLinear Speedup 225s
3630s
Distributed Graphs
Graph Partitioning Minimizing Communication
YYYCommunication is linear in the number of machines
each vertex spans
49
Vertex-Cut: Placing edges on machines, and letting vertex span machines
Graph Partitioning
Communication Minimization
Time to compute a partition
Quality of partition
Graph Partitioning
Since Large Natural Graphs are difficult to partition anyway…
Time to compute a partition
Quality of partition
How good a partition quality can we get while doing almost no work at all?
Machine 2Machine 1 Machine 3
Randomly assign edges to machines
YYYY ZYYYY ZY Z
Random Partitioning
But is probably the worst partition you can construct. Can we do better?
Sgraph Partitioning
(1,2)
(2,2)
(1,1)
(2,1)
(3,2)
(4,2)
(3,1)
(4,1)
(1,4)
(2,4)
(1,3)
(2,3)
(3,4)
(4,4)
(3,3)
(4,3)
Slides from a couple of years ago
Distributed Graphs
New Graph Partitioning IdeasMixed in-core out-of-core
computation
Common Crawl Graph
0 4 8 12 160
100
200
300
400
500
600
#Machines
Ru
ntim
e
16 Machines, (c3.8xlarge, 512 vCPUs)45 sec per iteration
3B edges per second
3.5 billion Nodes and 128 billion Edges
In search of Performance
Understand memory access patterns of algorithms:
Single Machine and DistributedSequential? Random?
User Com.
Title Body
User Disc. Optimize datastructures for
access patterns
It is not merely about speed, or scaling
Doing more with what you already have
Excess Slides
Our Tools Are Easy To Use
import graphlab as gl
train_data = gl.SFrame.read_csv(traindata_path)
train_data['1grams'] = gl.text_analytics.count_ngrams(train_data[‘text’],1) train_data['2grams'] = gl.text_analytics.count_ngrams(train_data[‘text’],2)
cls = gl.classifier.create(train_data, target='sentiment’)
5 line sentiment analysis
ButYou have preexisting code in Numpy, Scipy, Scikit-learn
Automatic Numpy Scaling
Automatic in-memory, type aware compression using SFrame type compression technology.
import graphlab.numpyScalable numpy activation successful
Scales all numeric numpy arrays to datasets much larger than memory Works with scipy, sklearn.
Demo
Scikit Learn SGDLinearCLassifier
0 50 100 150 200 250 300 350 4000
500
1000
1500
2000
2500
3000
3500
4000
Millions of Rows
Ru
ntim
e (
s)Airline Delay Dataset
Numpy
Graphlab + numpy
Automatic Numpy Scaling
Automatic in-memory, type aware compression using SFrame type compression technology.
import graphlab.numpyScalable numpy activation successful
Scales all numeric numpy arrays to datasets much larger than memory Works with scipy, sklearn.
Demo
Caveats apply
- Sequential Access highly preferred.
- Scales most memory bound sklearn algorithms by at least 2x, some by more.
H20 (4 node) H20 (16 Node) H20 (63 Node) GraphLab Create GPU
0
5000
10000
15000
20000
25000
30000Im
ag
es
pe
r S
ec
on
d
Deep Learning Throughput GPU
Dataset Source: MNIST 60K examples, 764 dimensionsSource(s) : H20 Deep Learning Benchmarks using a 4 layer architecture..