making machine learning scale: single machine and distributed

52
Scalable Machine Learning: Single Machine to Distributed Yucheng Low Chief Architect

Upload: dato-inc

Post on 12-Aug-2015

74 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Making Machine Learning Scale: Single Machine and Distributed

Scalable Machine Learning: Single Machine to Distributed

Yucheng LowChief Architect

Page 2: Making Machine Learning Scale: Single Machine and Distributed

What is ML scalability?

Page 3: Making Machine Learning Scale: Single Machine and Distributed

Is this scalability?

1600s

Algorithm Implementation X

800s

400s

200s

300s

Best Single Machine Implementation

Page 4: Making Machine Learning Scale: Single Machine and Distributed

True Scalability

How long does it take to get to a predetermined accuracy?

Not About:How well you can implement Algorithm X

Understand the tradeoffs between different algorithms.

Page 5: Making Machine Learning Scale: Single Machine and Distributed

It is not about

Scaling Up Scaling Out

Page 6: Making Machine Learning Scale: Single Machine and Distributed

It is about

Scaling Up Scaling Out

Going as fast as you can, on any hardware

Page 7: Making Machine Learning Scale: Single Machine and Distributed

• Assume bounded resources • Optimize for data scalability

The Dato Way

• Scales excellently• Require fewer machines to

solve in the same runtime as other systems

Page 8: Making Machine Learning Scale: Single Machine and Distributed

10

~1GB/s

1 TB

~0.1GB/s

10 TB

~1-10 GB/s

0.1 TB

Single Machine Scalability: Storage Hierarchy

Capacity

Throughput

Random access is very slow!

Good External Memory Datastructures For ML

Page 9: Making Machine Learning Scale: Single Machine and Distributed

SFrame:

Scalable Tabular Data Manipulation

User Com.

Title Body

User Disc.

SGraph: Scalable Graph Manipulation

Page 10: Making Machine Learning Scale: Single Machine and Distributed

Data is usually rows…

user movie rating

But, data engineering typically column transformations…

Page 11: Making Machine Learning Scale: Single Machine and Distributed

13

Feature engineering is columnar

Normalizes the feature x:sf[‘rating’] = sf[‘rating’] / sf[‘rating’].sum()

Create a new feature:sf[‘rating-squared’] =

sf[‘rating’].apply(lambda rating: rating*rating)

Create a new dataset with 2 of the features:sf2 = sf[[‘rating’,’ rating-squared’]]

ratinguser movierating

squared

Page 12: Making Machine Learning Scale: Single Machine and Distributed

SFrame

• Rich Datatypes• Strong schema types: int, double, string, image, ...• Weak schema types: list, dictionary (Can contain arbitrary

JSON)• Columnar Architecture

• Easy feature engineering + Vectorized feature operations.• Lazy evaluation• Statistics + sketches• Type aware compression

UserCom

.

Title Body

User Disc.Scalable Out-Of-Core Table Representation

Netflix Dataset, 99M rows, 3 columns, ints1.4GB raw289MB gzip compressed

160MB

Page 13: Making Machine Learning Scale: Single Machine and Distributed

Out of Core Machine Learning

Rethink all ML Algorithms

Random Access Sequential Only

Sampling? Sort/Shuffle

Understanding the Statistical/convergence impacts of ML

algorithm variations.

Page 14: Making Machine Learning Scale: Single Machine and Distributed

Single Machine Scaling

GraphLab-Create (1 Node)

MLlib 1.3 (5 Node)

MLlib 1.3 (1 Node)

Scikit-Learn

0 500 1000 1500 2000 2500

Runtime

Dataset Source: LIBLinear binary classification datsets.KDD Cup data: 8.4M data points, 20M features, 2.4GB compressed.Task: Predict student performance on math problems based on interactions with tutoring system

Page 15: Making Machine Learning Scale: Single Machine and Distributed

Single Machine Scaling

GraphLab-Create (1 Node)

BIDMach (1 GPU Node)

0 100 200 300 400 500 600 700 800 900

Runtime

Criteo Kaggle: Click Prediction

46M rows34M sparse coefficients

Not a Compute Bound Task

Page 16: Making Machine Learning Scale: Single Machine and Distributed

Social Media

Graphs encode the relationships between:

•Big: trillions of vertices and edges and rich metadata

•Facebook (10/2012): 1B users, 144B friendships

•Twitter (2011): 15B follower edges

AdvertisingScience Web

PeopleFacts

ProductsInterests

Ideas

Page 17: Making Machine Learning Scale: Single Machine and Distributed

SGraph1. Immutable disk-backed graph

representation. (Append only)

2. Vertex / Edge Attributes.3. Optimized for bulk access, not fine-grained queries.

Get neighborhood of [5 Million Vertices]

Get neighborhood of 1 vertex

Page 18: Making Machine Learning Scale: Single Machine and Distributed

Standard Graph Representations

src dest

1 102

132 10

48 999

129 192

998 23

392 124

Edge List

Easy to Insert

src dest

1 10

1 99

1 102

2 5

2 10

2 120

Sparse Matrix / Sorted Edge List

Difficult to Insert (random writes)102 103

349 13

Difficult to Query

Fast to Query

1 105

Page 19: Making Machine Learning Scale: Single Machine and Distributed

SGraph Layout

1

2

3

4

Vertex SFrames

__id Address ZipCode

Alice … 98105

Bob … 98102

Vertices partitioned into p = 4 SFrames

Page 20: Making Machine Learning Scale: Single Machine and Distributed

Edges partitioned into p^2 = 16 SFrames

__id Address ZipCode

John … 98105

Jack … 98102

SGraph Layout

1

3

4

Vertex SFrames

(1,2)

(2,2)

(3,2)

(4,2)

(1,1)

(2,1)

(3,1)

(4,1)

(1,4)

(2,4)

(3,4)

(4,4)

(1,3)

(2,3)

(3,3)

(4,3)

Edge SFrames

__src_id __dst_id Message

Alice Bob “hello”

Bob Charlie “world”

Charlie Alice “moof”

2

Page 21: Making Machine Learning Scale: Single Machine and Distributed

__id Address ZipCode

John … 98105

Jack … 98102

3

SGraph Layout

1

2

4

Vertex SFrames

(1,2)

(2,2)

(3,2)

(4,2)

(1,1)

(2,1)

(3,1)

(4,1)

(1,4)

(2,4)

(3,4)

(4,4)

(1,3)

(2,3)

(3,3)

(4,3)

Edge SFrames

__src_id __dst_id Message

Alice Bob “hello”

Bob Charlie “world”

Charlie Alice “moof”

Page 22: Making Machine Learning Scale: Single Machine and Distributed

3

SGraph Layout

1

2

4

Vertex SFrames

(1,2)

(2,2)

(3,2)

(4,2)

(1,1)

(2,1)

(3,1)

(4,1)

(1,4)

(2,4)

(3,4)

(4,4)

(1,3)

(2,3)

(3,3)

(4,3)

Edge SFrames

Page 23: Making Machine Learning Scale: Single Machine and Distributed

Common Crawl Graph

3.5 billion Nodes and 128 billion Edges

Largest available public Graph. 200GB

Compression factor 10:112.5 bits per edge

2 TB

Benefit From SFrame Compression Methods

Page 24: Making Machine Learning Scale: Single Machine and Distributed

Common Crawl Graph

3.5 billion Nodes and 128 billion Edges

Largest available public Graph. 200GB

Compression factor 10:112.5 bits per edge

2 TB

Page 25: Making Machine Learning Scale: Single Machine and Distributed

Common Crawl Graph

1x r3.8xlarge using 1x SSD.

3.5 billion Nodes and 128 billion Edges

PageRank: 9 min per iteration.Connected Components: ~ 1 hr.

There isn’t any general purpose library out there capable of this.

Page 26: Making Machine Learning Scale: Single Machine and Distributed

SFrame & SGraph

BSD License(August)

Page 27: Making Machine Learning Scale: Single Machine and Distributed

Distributed

Page 28: Making Machine Learning Scale: Single Machine and Distributed

Train on bigger datasets

Train Faster

Speedup Relative to Best Single Machine Implementation

Page 29: Making Machine Learning Scale: Single Machine and Distributed

X Y

Time for 1 pass = 100s

Extending Single Machine to Distributed

Page 30: Making Machine Learning Scale: Single Machine and Distributed

Extending Single Machine to Distributed

X Y

Time for 1 pass = 50s

X Y

Parallel Disks

Good External Memory Datastructures For ML Still Help

Page 31: Making Machine Learning Scale: Single Machine and Distributed

Distributed Optimization

Newton, LBFGS, FISTA, etc

Parallel Sweep over

data

X Y

Synchronize Parameters

Parallel Sweep over

data

X Y

Synchronize Parameters

Make sure this is embarrassingly parallel

Talk Quickly

Page 32: Making Machine Learning Scale: Single Machine and Distributed

Distributed Optimization

HDFSX Y

1. Data begins on HDFS

X YX Y

2. Every machine takes part of the data to local disk/SSD

3. Inter machine communication by fast supercomputer-style primitives

Page 33: Making Machine Learning Scale: Single Machine and Distributed

Criteo Terabyte Click Logs

Click Prediction Task: Whether visitor clicked on a link or not.

Page 34: Making Machine Learning Scale: Single Machine and Distributed

Criteo Terabyte Click Prediction

4.4 Billion Rows13 Features

½ TB of data

0 4 8 12 160

500

1000

1500

2000

2500

3000

3500

4000

#Machines

Ru

ntim

eLinear Speedup 225s

3630s

Page 35: Making Machine Learning Scale: Single Machine and Distributed

Distributed Graphs

Page 36: Making Machine Learning Scale: Single Machine and Distributed

Graph Partitioning Minimizing Communication

YYYCommunication is linear in the number of machines

each vertex spans

49

Vertex-Cut: Placing edges on machines, and letting vertex span machines

Page 37: Making Machine Learning Scale: Single Machine and Distributed

Graph Partitioning

Communication Minimization

Time to compute a partition

Quality of partition

Page 38: Making Machine Learning Scale: Single Machine and Distributed

Graph Partitioning

Since Large Natural Graphs are difficult to partition anyway…

Time to compute a partition

Quality of partition

How good a partition quality can we get while doing almost no work at all?

Page 39: Making Machine Learning Scale: Single Machine and Distributed

Machine 2Machine 1 Machine 3

Randomly assign edges to machines

YYYY ZYYYY ZY Z

Random Partitioning

But is probably the worst partition you can construct. Can we do better?

Page 40: Making Machine Learning Scale: Single Machine and Distributed

Sgraph Partitioning

(1,2)

(2,2)

(1,1)

(2,1)

(3,2)

(4,2)

(3,1)

(4,1)

(1,4)

(2,4)

(1,3)

(2,3)

(3,4)

(4,4)

(3,3)

(4,3)

Page 41: Making Machine Learning Scale: Single Machine and Distributed

Slides from a couple of years ago

Page 42: Making Machine Learning Scale: Single Machine and Distributed

Distributed Graphs

New Graph Partitioning IdeasMixed in-core out-of-core

computation

Page 43: Making Machine Learning Scale: Single Machine and Distributed

Common Crawl Graph

0 4 8 12 160

100

200

300

400

500

600

#Machines

Ru

ntim

e

16 Machines, (c3.8xlarge, 512 vCPUs)45 sec per iteration

3B edges per second

3.5 billion Nodes and 128 billion Edges

Page 44: Making Machine Learning Scale: Single Machine and Distributed

In search of Performance

Understand memory access patterns of algorithms:

Single Machine and DistributedSequential? Random?

User Com.

Title Body

User Disc. Optimize datastructures for

access patterns

Page 45: Making Machine Learning Scale: Single Machine and Distributed

It is not merely about speed, or scaling

Doing more with what you already have

Page 46: Making Machine Learning Scale: Single Machine and Distributed
Page 47: Making Machine Learning Scale: Single Machine and Distributed

Excess Slides

Page 48: Making Machine Learning Scale: Single Machine and Distributed

Our Tools Are Easy To Use

import graphlab as gl

train_data = gl.SFrame.read_csv(traindata_path)

train_data['1grams'] = gl.text_analytics.count_ngrams(train_data[‘text’],1) train_data['2grams'] = gl.text_analytics.count_ngrams(train_data[‘text’],2)

cls = gl.classifier.create(train_data, target='sentiment’)

5 line sentiment analysis

ButYou have preexisting code in Numpy, Scipy, Scikit-learn

Page 49: Making Machine Learning Scale: Single Machine and Distributed

Automatic Numpy Scaling

Automatic in-memory, type aware compression using SFrame type compression technology.

import graphlab.numpyScalable numpy activation successful

Scales all numeric numpy arrays to datasets much larger than memory Works with scipy, sklearn.

Demo

Page 50: Making Machine Learning Scale: Single Machine and Distributed

Scikit Learn SGDLinearCLassifier

0 50 100 150 200 250 300 350 4000

500

1000

1500

2000

2500

3000

3500

4000

Millions of Rows

Ru

ntim

e (

s)Airline Delay Dataset

Numpy

Graphlab + numpy

Page 51: Making Machine Learning Scale: Single Machine and Distributed

Automatic Numpy Scaling

Automatic in-memory, type aware compression using SFrame type compression technology.

import graphlab.numpyScalable numpy activation successful

Scales all numeric numpy arrays to datasets much larger than memory Works with scipy, sklearn.

Demo

Caveats apply

- Sequential Access highly preferred.

- Scales most memory bound sklearn algorithms by at least 2x, some by more.

Page 52: Making Machine Learning Scale: Single Machine and Distributed

H20 (4 node) H20 (16 Node) H20 (63 Node) GraphLab Create GPU

0

5000

10000

15000

20000

25000

30000Im

ag

es

pe

r S

ec

on

d

Deep Learning Throughput GPU

Dataset Source: MNIST 60K examples, 764 dimensionsSource(s) : H20 Deep Learning Benchmarks using a 4 layer architecture..