data-intensive distributed computing · data-intensive distributed computing part 6: data mining...

36
Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details CS 451/651 431/631 (Winter 2018) Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo February 27, 2018 These slides are available at http://lintool.github.io/bigdata-2018w/

Upload: others

Post on 24-Jul-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Data-Intensive Distributed Computing

Part 6: Data Mining (1/4)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 451/651 431/631 (Winter 2018)

Jimmy LinDavid R. Cheriton School of Computer Science

University of Waterloo

February 27, 2018

These slides are available at http://lintool.github.io/bigdata-2018w/

Page 2: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Structure of the Course

“Core” framework features and algorithm design

Anal

yzin

gTe

xt

Anal

yzin

gG

raph

s

Anal

yzin

gRe

latio

nal D

ata

Data

Min

ing

Page 3: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Descriptive vs. Predictive Analytics

Page 4: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Frontend

Backend

users

Frontend

Backend

users

Frontend

Backend

external APIs

“Traditional”BI tools

SQL on Hadoop

Othertools

Data Warehouse“Data Lake”

data scientists

OLTP database

ETL(Extract, Transform, and Load)

OLTP database

OLTP database

Page 5: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Supervised Machine Learning

The generic problem of function induction given sample instances of input and output

Classification: output draws from finite discrete labels

Regression: output is a continuous value

This is not meant to be an exhaustive treatment of machine learning!

Page 6: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Source: Wikipedia (Sorting)

Classification

Page 7: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Applications

Spam detection

Sentiment analysis

Content (e.g., topic) classification

Link prediction

Document ranking

Object recognition

And much much more!

Fraud detection

Page 8: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

training

Model

Machine Learning Algorithm

testing/deployment

?

Supervised Machine Learning

Page 9: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Objects are represented in terms of features:“Dense” features: sender IP, timestamp, # of recipients, length of message, etc.“Sparse” features: contains the term “viagra” in message, contains “URGENT” in subject, etc.

Feature Representations

Page 10: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Applications

Spam detection

Sentiment analysis

Content (e.g., genre) classification

Link prediction

Document ranking

Object recognition

And much much more!

Fraud detection

Page 11: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Components of a ML Solution

DataFeaturesModel

Optimization

Page 12: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

(Banko and Brill, ACL 2001)(Brants et al., EMNLP 2007)

No data like more data!

Page 13: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Limits of Supervised Classification?

Why is this a big data problem?Isn’t gathering labels a serious bottleneck?

SolutionsCrowdsourcing

Bootstrapping, semi-supervised techniquesExploiting user behavior logs

The virtuous cycle of data-driven products

Page 14: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

a useful service

analyze user behavior to extract insights

transform insights into action

$(hopefully)

Google. Facebook. Twitter. Amazon. Uber.

data sciencedata products

Virtuous Product Cycle

Page 15: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

What’s the deal with neural networks?

DataFeaturesModel

Optimization

Page 16: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Supervised Binary Classification

Restrict output label to be binaryYes/No

1/0

Binary classifiers form primitive building blocks for multi-class problems…

Page 17: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Binary Classifiers as Building BlocksExample: four-way classification

One vs. rest classifiers

A or not?

B or not?

C or not?

D or not?

A or not?

B or not?

C or not?

D or not?

Classifier cascades

Page 18: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

The Task

Given: D = {(xi, yi)}ni(sparse) feature vector

label

xi = [x1, x2, x3, . . . , xd]

y 2 {0, 1}

Induce:Such that loss is minimized

f : X ! Y

1

n

nX

i=0

`(f(xi), yi)

loss function

Typically, we consider functions of a parametric form:

argmin✓

1

n

nX

i=0

`(f(xi; ✓), yi)

model parameters

Page 19: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Key insight: machine learning as an optimization problem!(closed form solutions generally not possible)

Page 20: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

* caveats

Gradient Descent: Preliminaries

Rewrite:argmin

✓L(✓)

argmin

1

n

nX

i=0

`(f(xi; ✓), yi)

Compute gradient:“Points” to fastest increasing “direction”

rL(✓) =

@L(✓)

@w0,@L(✓)

@w1, . . .

@L(✓)

@wd

So, at any point:b = a� �rL(a)

L(a) � L(b)

*

Page 21: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Gradient Descent: Iterative Update

Start at an arbitrary point, iteratively update:✓(t+1) ✓(t) � �(t)rL(✓(t))

We have:L(✓(0)) � L(✓(1)) � L(✓(2)) . . .

Page 22: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Old weightsUpdate based on gradient

New weights

✓(t+1) ✓(t) � �(t) 1

n

nX

i=0

r`(f(xi; ✓(t)), yi)

`(x)

r`d

dx

`

Intuition behind the math…

Page 23: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Gradient Descent: Iterative Update

Start at an arbitrary point, iteratively update:

Lots of details:Figuring out the step size

Getting stuck in local minimaConvergence rate

✓(t+1) ✓(t) � �(t)rL(✓(t))

We have:L(✓(0)) � L(✓(1)) � L(✓(2)) . . .

Page 24: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Repeat until convergence:

✓(t+1) ✓(t) � �(t) 1

n

nX

i=0

r`(f(xi; ✓(t)), yi)

Gradient Descent

Note, sometimes formulated as ascent but entirely equivalent

Page 25: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Gradient Descent

Source: Wikipedia (Hills)

✓(t+1) ✓(t) � �(t) 1

n

nX

i=0

r`(f(xi; ✓(t)), yi)

Page 26: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

f(x+�x) = f(x) + f

0(x)�x+1

2f

00(x)�x

2

Even More Details…

Gradient descent is a “first order” optimization techniqueOften, slow convergence

Newton and quasi-Newton methods:Intuition: Taylor expansion

Requires the Hessian (square matrix of second order partial derivatives):impractical to fully compute

Page 27: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Source: Wikipedia (Hammer)

Logistic Regression

Page 28: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

D = {(xi, yi)}nixi = [x1, x2, x3, . . . , xd]

y 2 {0, 1}

ln

Pr (y = 1|x)Pr (y = 0|x)

�= w · x

ln

Pr (y = 1|x)

1� Pr (y = 1|x)

�= w · x

f(x; w) : Rd ! {0, 1}

f(x; w) =

⇢1 if w · x � t

0 if w · x < t

Logistic Regression: Preliminaries

Given:

Define:

Interpretation:

Page 29: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Relation to the Logistic Function

After some algebra:

Pr (y = 1|x) = e

w·x

1 + e

w·x

Pr (y = 0|x) = 1

1 + e

w·x

The logistic function:

f(z) =ez

ez + 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8

logi

stic

(z)

z

Page 30: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

argmax

w

nY

i=1

Pr(yi|xi,w)

L(w) =nX

i=1

ln Pr(yi|xi,w)

y 2 {0, 1}

L(w) =nX

i=1

⇣yi ln Pr(yi = 1|xi,w) + (1� yi) lnPr(yi = 0|xi,w)

Pr(y|x,w) = Pr(y = 1|x,w)yPr(y = 0|x,w)(1�y)

Training an LR Classifier

Maximize the conditional likelihood:

Define the objective in terms of conditional log likelihood:

We know:

So:

Substituting:

Page 31: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

L(w) =nX

i=1

⇣yi ln Pr(yi = 1|xi,w) + (1� yi) lnPr(yi = 0|xi,w)

@

@wL(w) =

nX

i=0

xi

⇣yi � Pr(yi = 1|xi,w)

w(t+1) w(t) + �(t)rwL(w(t))

rL(w) =

@L(w)

@w0,@L(w)

@w1, . . .

@L(w)

@wd

w

(t+1)i w

(t)i + �

(t)nX

j=0

xj,i

⇣yj � Pr(yj = 1|xj ,w(t)

)

LR Classifier Update RuleTake the derivative:

General form of update rule:

Final update rule:

Page 32: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Want more details? Take a real machine-learning course!

Lots more details…

RegularizationDifferent loss functions

Page 33: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

mapper mapper mapper mapper

reducer

compute partial gradient

single reducer

mappers

update model iterate until convergence

✓(t+1) ✓(t) � �(t) 1

n

nX

i=0

r`(f(xi; ✓(t)), yi)

MapReduce Implementation

Page 34: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Shortcomings

Hadoop is bad at iterative algorithmsHigh job startup costs

Awkward to retain state across iterations

High sensitivity to skewIteration speed bounded by slowest task

Potentially poor cluster utilizationMust shuffle all data to a single reducer

Some possible tradeoffsNumber of iterations vs. complexity of computation per iteration

E.g., L-BFGS: faster convergence, but more to compute

Page 35: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

val points = spark.textFile(...).map(parsePoint).persist()

var w = // random initial vectorfor (i <- 1 to ITERATIONS) {

val gradient = points.map{ p =>p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y

}.reduce((a,b) => a+b)w -= gradient

}

mapper mapper mapper mapper

reducer

compute partial gradient

update model

Spark Implementation

Page 36: Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

Source: Wikipedia (Japanese rock garden)

Questions?