scaling up learning via sgd - university of washington › courses › cse446 › ... · cse 446:...

42
1/27/2017 1 Linear classifiers: Scaling up learning via SGD ©2017 Emily Fox This image cannot currently be displayed. CSE 446: Machine Learning Emily Fox University of Washington January 27, 2017 Stochastic gradient descent: Learning, one data point at a time ©2017 Emily Fox

Upload: others

Post on 04-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

1

CSE 446: Machine Learning

Linear classifiers: Scaling up learning via SGD

©2017 Emily Fox

This image cannot currently be displayed.

CSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 27, 2017

CSE 446: Machine Learning

Stochastic gradient descent:Learning, one data point at a time

©2017 Emily Fox

Page 2: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

2

CSE 446: Machine Learning3

Data

Stochastic gradient ascent

©2017 Emily Fox

w(t)

Compute gradientUse only

small subsets of data

w(t+1)Update

coefficients

w(t+2)Update

coefficients

w(t+3)Update

coefficients

w(t+4)Update

coefficients

Many updates for each pass

over data

CSE 446: Machine Learning4

Stochastic gradient ascent for logistic regression

©2017 Emily Fox

init w(1)=0, t=1

until convergedfor j=0,…,D

partial[j] =

wj(t+1) wj

(t) + η partial[j]

t t + 1

for i=1,…,N Sum over data points

Each time, pick different data point i

Page 3: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

3

CSE 446: Machine Learning

Why would stochastic gradient ever work???

©2017 Emily Fox

CSE 446: Machine Learning6

Gradient is direction of steepest ascent

©2017 Emily Fox

Gradient is “best” direction, butany direction that goes “up” would be useful

Page 4: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

4

CSE 446: Machine Learning7

In ML, steepest direction is sum of “little directions” from each data point

©2017 Emily Fox

Sum over data points

For most data points,contribution points “up”

CSE 446: Machine Learning8

Stochastic gradient: Pick a data point and move in direction

©2017 Emily Fox

Most of the time, total likelihood will increase

Page 5: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

5

CSE 446: Machine Learning9

Stochastic gradient ascent:Most iterations increase likelihood, but sometimes decrease itOn average, make progress

©2017 Emily Fox

until converged

for i=1,…,N

for j=0,…,D

wj(t+1) wj

(t) + ηt t + 1

CSE 446: Machine Learning

Convergence path

©2017 Emily Fox

Page 6: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

6

CSE 446: Machine Learning11

Convergence paths

©2017 Emily Fox

Gradient

Stochastic gradient

CSE 446: Machine Learning12

Stochastic gradient convergence is “noisy”

©2017 Emily Fox

Bett

er

Avg

. lo

g li

kelih

oo

d

Gradient usually increases likelihood smoothly

Stochastic gradient makes “noisy” progress

Total time proportional to # passes over data

Stochastic gradient achieves higher likelihood sooner, but it’s noisier

Page 7: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

7

CSE 446: Machine Learning13

Eventually, gradient catches up

©2017 Emily Fox

Avg

. lo

g li

kelih

oo

dB

ett

er

Gradient

Stochastic gradient

Note: should only trust “average” quality of stochastic gradient

CSE 446: Machine Learning14

The last coefficients may be really good or really bad!!

©2017 Emily Fox

w(1000) was bad

w(1005) was good

How do we minimize risk of

picking bad coefficients

Stochastic gradient will eventually oscillate around a solution

Minimize noise: don’t return last learned coefficients

Output average:

ŵ= 1 w(t)

T

Page 8: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

8

CSE 446: Machine Learning15

Summary of why stochastic gradient works

Gradient finds direction of steepest ascent

Gradient is sum of contributions from each data point

Stochastic gradient uses direction from 1 data point

On average increases likelihood, sometimes decreases

Stochastic gradient has “noisy” convergence

©2017 Emily Fox

CSE 446: Machine Learning

Online learning: Fitting models from streaming data

©2017 Emily Fox

Page 9: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

9

CSE 446: Machine Learning

Batch vs online learning

Batch learning• All data is available at

start of training time

Online learning• Data arrives (streams in) over time

- Must train model as data arrives!

©2017 Emily Fox

DataML

algorithm

time

ŵ

Data

t=1

Data

t=2

ML algorithm

ŵ(1)

Data

t=3

Data

t=4

ŵ(2)ŵ(3)ŵ(4)

CSE 446: Machine Learning18

Online learning example: Ad targeting

©2017 Emily Fox

Website….......….......….......….......….......….......….......….......….......….......….......….......….......….......….......….......

Input: xt

User info, page text

ML algorithm

ŵ(t)

Ad1

Ad2

Ad3

ŷ= Suggested ads

User clicked on Ad2

yt=Ad2

ŵ(t+1)

Page 10: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

10

CSE 446: Machine Learning19

Online learning problem• Data arrives over each time step t:

- Observe input xt

• Info of user, text of webpage

- Make a predictionŷt

• Which ad to show

- Observe true output yt

• Which ad user clicked on

©2017 Emily Fox

Need ML algorithm to update coefficients each time step!

CSE 446: Machine Learning20

Stochastic gradient ascent can be used for online learning!!!

©2017 Emily Fox

• init w(1)=0, t=1

• Each time step t:- Observe input xt

- Make a predictionŷt

- Observe true output yt

- Update coefficients:

for j=0,…,D

wj(t+1) wj

(t) + η

Page 11: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

11

CSE 446: Machine Learning21

Summary of online learning

Data arrives over time

Must make a prediction every time new data point arrives

Observe true class after prediction made

Want to update parameters immediately

©2017 Emily Fox

CSE 446: Machine Learning

Summary of stochastic gradient descent

©2017 Emily Fox

Page 12: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

12

CSE 446: Machine Learning23

What you can do now…

• Significantly speedup learning algorithm using stochastic gradient

• Describe intuition behind why stochastic gradient works

• Apply stochastic gradient in practice

• Describe online learning problems

• Relate stochastic gradient to online learning

©2017 Emily Fox

CSE 446: Machine Learning

Decision Trees

©2017 Emily Fox

CSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 27, 2017

Page 13: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

13

CSE 446: Machine Learning

Predicting potential loan defaults

©2017 Emily Fox

CSE 446: Machine Learning26

What makes a loan risky?

©2017 Emily Fox

I want a to buy a new house! Credit History

★★★★

Income★★★

Term★★★★★

Personal Info★★★

Loan Application

Page 14: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

14

CSE 446: Machine Learning27

Credit history explained

©2017 Emily Fox

Credit History ★★★★

Income★★★

Term★★★★★

Personal Info★★★

Did I pay previous loans on time?

Example:excellent, good, or fair

CSE 446: Machine Learning28

Income

©2017 Emily Fox

Credit History ★★★★

Income★★★

Term★★★★★

Personal Info★★★

What’s my income?

Example:$80K per year

Page 15: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

15

CSE 446: Machine Learning29

Loan terms

©2017 Emily Fox

Credit History ★★★★

Income★★★

Term★★★★★

Personal Info★★★

How soon do I need to pay the loan?

Example: 3 years, 5 years,…

CSE 446: Machine Learning30

Personal information

©2017 Emily Fox

Credit History ★★★★

Income★★★

Term★★★★★

Personal Info★★★

Age, reason for the loan, marital status,…

Example: Home loan for a married couple

Page 16: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

16

CSE 446: Machine Learning31

Intelligent application

©2017 Emily Fox

Safe✓

Risky✘

Risky✘

Intelligent loan application review system

Loan Applications

CSE 446: Machine Learning32

Classifier review

©2017 Emily Fox

Loan Application

ClassifierMODEL

Input: xi

Output: ŷPredicted class

Safe

ŷi = +1

Risky

ŷi = -1

Page 17: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

17

CSE 446: Machine Learning33

This module ... decision trees

©2017 Emily Fox

Start

Credit?

Safe

excellent

Income?

poor

Term?

Risky Safe

fair

5 years3 years

Risky

Low

Term?

Risky Safe

high

5 years3 years

CSE 446: Machine Learning34

Scoring a loan application

©2017 Emily Fox

xi = (Credit = poor, Income = high, Term = 5 years)

Credit?

Safe Term?

Risky Safe

Income?

Term?

Risky Safe

Risky

Start

excellent poor

fair

5 years3 yearsLowhigh

5 years3 years

Credit?

Safe Term?

Risky Safe

Income?

Term?

Risky Safe

Risky

Start

excellent poor

fair

5 years3 yearsLowhigh

5 years3 years

ŷi = Safe

Page 18: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

18

CSE 446: Machine Learning

Decision tree learning task

©2017 Emily Fox

CSE 446: Machine Learning36

Decision tree learning problem

©2017 Emily Fox

Optimize quality metricon training data

Training data: N observations (xi,yi)

Credit Term Income y

excellent 3 yrs high safe

fair 5 yrs low risky

fair 3 yrs high safe

poor 5 yrs high risky

excellent 3 yrs low risky

fair 5 yrs low safe

poor 3 yrs high risky

poor 5 yrs low safe

fair 3 yrs high safe

T(X)

Page 19: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

19

CSE 446: Machine Learning37

Quality metric: Classification error

• Error measures fraction of mistakes

- Best possible value : 0.0

- Worst possible value: 1.0

©2017 Emily Fox

Error = # incorrect predictions # examples

CSE 446: Machine Learning38

How do we find the best tree?

©2017 Emily Fox

Exponentially large number of possible trees makes decision tree learning hard!

T1(X) T2(X) T3(X)

T4(X) T5(X) T6(X)

Learning the smallest decision tree is an NP-hard problem[Hyafil & Rivest ’76]

Page 20: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

20

CSE 446: Machine Learning

Greedy decision tree learning

©2017 Emily Fox

CSE 446: Machine Learning40

Our training data table

©2017 Emily Fox

Assume N = 40, 3 features

Credit Term Income y

excellent 3 yrs high safe

fair 5 yrs low risky

fair 3 yrs high safe

poor 5 yrs high risky

excellent 3 yrs low risky

fair 5 yrs low safe

poor 3 yrs high risky

poor 5 yrs low safe

fair 3 yrs high safe

Page 21: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

21

CSE 446: Machine Learning41

(all data)

Start with all the data

©2017 Emily Fox

Loan status: Safe Risky

N = 40 examples

# of Safe loans

22# of Risky loans

18

CSE 446: Machine Learning42

Root22 18

Compact visual notation: Root node

©2017 Emily Fox

Loan status: Safe Risky

N = 40 examples

# of Safe loans

# of Risky loans

Page 22: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

22

CSE 446: Machine Learning43

Decision stump: Single level tree

©2017 Emily Fox

Root22 18

Loan status:Safe Risky

poor4 14

fair9 4

excellent9 0

Credit?

Split on Credit

Subset of data with Credit = excellent

Subset of data with Credit = fair

Subset of data with Credit = poor

CSE 446: Machine Learning44

Visual notation: Intermediate nodes

©2017 Emily Fox

Root22 18

excellent9 0

fair9 4

poor4 14

Loan status:Safe Risky

Credit?

Intermediate nodes

Page 23: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

23

CSE 446: Machine Learning45

Making predictions with a decision stump

©2017 Emily Fox

root22 18

excellent9 0

fair9 4

poor4 14

Loan status:Safe Risky

credit?For each intermediate node, set ŷ= majority value

Safe Safe Risky

CSE 446: Machine Learning

Selecting best feature to split on

©2017 Emily Fox

Page 24: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

24

CSE 446: Machine Learning47

How do we learn a decision stump?

©2017 Emily Fox

Root22 18

excellent9 0

fair9 4

poor4 14

Loan status:Safe Risky

Credit?

Find the “best” feature to split on!

CSE 446: Machine Learning48

How do we select the best feature?

©2017 Emily Fox

Root22 18

excellent9 0

fair9 4

poor4 14

Loan status:Safe Risky

Credit?

Choice 1: Split on Credit

Root22 18

3 years16 4

5 years6 14

Loan status:Safe Risky

Term?

Choice 2: Split on Term

Page 25: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

25

CSE 446: Machine Learning49

How do we measure effectiveness of a split?

©2017 Emily Fox

Error = # mistakes # data points

Root22 18

poor4 14

Loan status:Safe Risky

Credit?

excellent9 0

fair9 4

Idea: Calculate classification error of this decision stump

CSE 446: Machine Learning50

Calculating classification error

©2017 Emily Fox

• Step 1:ŷ= class of majority of data in node

• Step 2: Calculate classification error of predicting ŷfor this data

Root22 18

Loan status:Safe Risky Error = .

=18 mistakes22 correct

ŷ= majority class

Safe Tree Classification error

(root) 0.45

Page 26: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

26

CSE 446: Machine Learning51

Choice 1: Split on Credit history?

©2017 Emily Fox

Does a split on Credit reduce classification error below 0.45?

Root22 18

excellent9 0

fair9 4

poor4 14

Loan status:Safe Risky

Credit?

Choice 1: Split on Credit

CSE 446: Machine Learning52

Split on Credit: Classification error

©2017 Emily Fox

Root22 18

excellent9 0

fair9 4

poor4 14

Loan status:Safe Risky

Credit?

0 mistakes 4 mistakes 4 mistakes

Safe Safe Risky

Choice 1: Split on Credit

Error = .

=

Tree Classification error

(root) 0.45

Split on credit 0.2

Page 27: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

27

CSE 446: Machine Learning53

Choice 2: Split on Term?

©2017 Emily Fox

Root22 18

3 years16 4

5 years6 14

Loan status:Safe Risky

Term?

Safe Risky

Choice 2: Split on Term

CSE 446: Machine Learning54

Evaluating the split on Term

©2017 Emily Fox

Root22 18

3 years16 4

5 years6 14

Loan status:Safe Risky

Term?

4 mistakes 6 mistakes

Safe Risky

Error = .

=

Tree Classification error

(root) 0.45

Split on credit 0.2

Split on term 0.25

Choice 2: Split on Term

Page 28: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

28

CSE 446: Machine Learning55

Choice 1 vs Choice 2:Comparing split on Credit vs Term

©2017 Emily Fox

Root22 18

excellent9 0

fair8 4

poor4 14

Loan status:Safe Risky

Root22 18

3 years16 4

5 years6 14

Loan status:Safe Risky

Credit? Term?

Tree Classification error

(root) 0.45

split on credit 0.2

split on loan term 0.25

Choice 2: Split on TermChoice 1: Split on Credit

CSE 446: Machine Learning56

Feature split selection algorithm

©2017 Emily Fox

• Given a subset of data M (a node in a tree)

• For each feature hi(x):

1. Split data of M according to feature hi(x)

2. Compute classification error split

• Chose feature h*(x) with lowest classification error

Page 29: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

29

CSE 446: Machine Learning

Recursion & Stopping conditions

©2017 Emily Fox

CSE 446: Machine Learning58

We’ve learned a decision stump, what next?

©2017 Emily Fox

Root22 18

excellent9 0

fair9 4

poor4 14

Loan status:Safe Risky

Credit?

Safe All data points are Safenothing else to do with this subset of data

Leaf node

Page 30: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

30

CSE 446: Machine Learning59

Tree learning = Recursive stump learning

©2017 Emily Fox

Root22 18

excellent9 0

fair9 4

poor4 14

Loan status:Safe Risky

Credit?

SafeBuild decision stump with subset of data

where Credit = poor

Build decision stump with subset of data where Credit = fair

CSE 446: Machine Learning60

Second level

©2017 Emily Fox

Root22 18

Loan status:Safe Risky

Credit?

excellent9 0

fair9 4

poor4 14

Safe

3 years0 4

5 years9 0

Term?

Risky Safe

Build another stumpthese data points

high4 5

Low0 9

Income?

Risky

Page 31: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

31

CSE 446: Machine Learning61

Final decision tree

©2017 Emily Fox

Root22 18

excellent9 0

Fair9 4

poor4 14

Loan status:Safe Risky

Credit?

Safe

5 years9 0

3 years0 4

Term?

Risky Safe

low0 9

high4 5

Income?

5 years4 3

3 years0 2

Term?

Risky Safe

Risky

CSE 446: Machine Learning62

Simple greedy decision tree learning

Pick best feature to split on

Learn decision stump with this split

For each leaf of decision stump, recurse

©2017 Emily Fox

When do we stop???

Page 32: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

32

CSE 446: Machine Learning63

Stopping condition 1: All data agrees on y

©2017 Emily Fox

Root22 18

excellent9 0

Fair9 4

poor4 14

Loan status:Safe Risky

Credit?

Safe

5 years9 0

3 years0 4

Term?

Risky Safe

low0 9

high4 5

Income?

5 years4 3

Term?

Risky Safe

Risky

3 years0 2

3 years0 2

All data in these nodes have same

y value Nothing to do

excellent9 0

5 years9 0

3 years0 4

low0 9

CSE 446: Machine Learning64

Stopping condition 2: Already split on all features

©2017 Emily Fox

Root22 18

excellent9 0

Fair9 4

poor4 14

Loan status:Safe Risky

Credit?

Safe

5 years9 0

3 years0 4

Term?

Risky Safe

low0 9

high4 5

Income?

5 years4 3

Term?

Risky Safe

Risky

3 years0 2

Already split on all possible features

Nothing to do

5 years4 3

Page 33: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

33

CSE 446: Machine Learning65

Recursion

Stopping conditions 1 & 2

Greedy decision tree learning

©2017 Emily Fox

• Step 1: Start with an empty tree

• Step 2: Select a feature to split data

• For each split of the tree:

• Step 3: If nothing more to, make predictions

• Step 4: Otherwise, go to Step 2 & continue (recurse) on this split

Pick feature split leading to lowest classification error

CSE 446: Machine Learning66

Is this a good idea?

Proposed stopping condition 3:Stop if no split reduces the

classification error

©2017 Emily Fox

Page 34: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

34

CSE 446: Machine Learning67

Stopping condition 3: Don’t stop if error doesn’t decrease???

©2017 Emily Fox

y valuesTrue False

Root2 2

Error = .

=

Tree Classification error

(root) 0.5

x[1] x[2] yFalse False False

False True True

True False True

True True False

y = x[1] xor x[2]

CSE 446: Machine Learning68

Consider split on x[1]

©2017 Emily Fox

y valuesTrue False

Root2 2

Error = .

=

Tree Classification error

(root) 0.5

Split on x[1] 0.5

True1 1

False1 1

x[1]

x[1] x[2] yFalse False False

False True True

True False True

True True False

y = x[1] xor x[2]

Page 35: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

35

CSE 446: Machine Learning69

Consider split on x[2]

©2017 Emily Fox

y valuesTrue False

Root2 2

Error = 1+1 .

2+2= 0.5

Tree Classification error

(root) 0.5

Split on x[1] 0.5

Split on x[2] 0.5

True1 1

False1 1

x[2]

Neither featuresimprove training error… Stop now???

x[1] x[2] yFalse False False

False True True

True False True

True True False

y = x[1] xor x[2]

CSE 446: Machine Learning70

Final tree with stopping condition 3

©2017 Emily Fox

Tree Classification error

with stopping condition 3

0.5

y valuesTrue False

Root2 2

Predict True

x[1] x[2] yFalse False False

False True True

True False True

True True False

y = x[1] xor x[2]

Page 36: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

36

CSE 446: Machine Learning71

Without stopping condition 3

©2017 Emily Fox

y valuesTrue False

Root2 2

True1 1

False1 1

x[1]

True0 1

x[2]

True1 0

False1 0

x[2]

False0 1

True FalseFalse True

Tree Classification error

with stopping condition 3

0.5

without stopping condition 3

x[1] x[2] yFalse False False

False True True

True False True

True True False

y = x[1] xor x[2]

Condition 3 (stopping when training error doesn’t’ improve) is not recommended!

CSE 446: Machine Learning

Decision tree learning: Real valued features

©2017 Emily Fox

Page 37: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

37

CSE 446: Machine Learning73

How do we use real values inputs?

©2017 Emily Fox

Income Credit Term y

$105 K excellent 3 yrs Safe

$112 K good 5 yrs Risky

$73 K fair 3 yrs Safe

$69 K excellent 5 yrs Safe

$217 K excellent 3 yrs Risky

$120 K good 5 yrs Safe

$64 K fair 3 yrs Risky

$340 K excellent 5 yrs Safe

$60 K good 3 yrs Risky

CSE 446: Machine Learning74

Threshold split

©2017 Emily Fox

Root22 18

Loan status:Safe Risky

Split on the feature Income

< $60K8 13

>= $60K14 5

Income?

Subset of data with Income >= $60K

Page 38: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

38

CSE 446: Machine Learning75

Finding the best threshold split

©2017 Emily Fox

Infinite possible values of t

Income < t* Income >= t*

SafeRisky

Income

$120K$10K

Income = t*

CSE 446: Machine Learning76

Consider a threshold between points

©2017 Emily Fox

SafeRisky

Income

$120K$10K

vA vB

Same classification error for any threshold split between vA and vB

Page 39: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

39

CSE 446: Machine Learning77

Only need to consider mid-points

©2017 Emily Fox

SafeRisky

Income

$120K$10K

Finite number of splits to consider

CSE 446: Machine Learning78

Threshold split selection algorithm

©2017 Emily Fox

• Step 1: Sort the values of a feature hj(x) :

Let {v1, v2, v3, … vN} denote sorted values

• Step 2:

- For i = 1 … N-1

• Consider split ti = (vi + vi+1) / 2

• Compute classification error for treshold split hj(x) >= ti

- Chose the t* with the lowest classification error

Page 40: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

40

CSE 446: Machine Learning79

Visualizing the threshold split

©2017 Emily Fox

0 10 20 30 40 …

$0K

$40K

$80K

Age

IncomeThreshold split is the line Age = 38

CSE 446: Machine Learning80

Split on Age >= 38

©2017 Emily Fox

Age

Income age >= 38age < 38

Predict Safe

Predict Risky

0 10 20 30 40 …

$0K

$40K

$80K

Page 41: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

41

CSE 446: Machine Learning81

Depth 2: Split on Income >= $60K

©2017 Emily Fox

Age

Income

0 10 20 30 40 …

$0K

$40K

$80K

Threshold split is the line Income = 60K

CSE 446: Machine Learning82

Each split partitions the 2-D space

©2017 Emily Fox

Age

Age >= 38

Income >= 60KAge < 38

Age >= 38

Income < 60K

Income

0 10 20 30 40 …

$0K

$40K

$80K

Page 42: Scaling up learning via SGD - University of Washington › courses › cse446 › ... · CSE 446: Machine Learning Batch vs online learning Batch learning • All data is available

1/27/2017

42

CSE 446: Machine Learning

Summary of decision trees

©2017 Emily Fox

CSE 446: Machine Learning84

What you can do now

• Define a decision tree classifier

• Interpret the output of a decision trees

• Learn a decision tree classifier using greedy algorithm

• Traverse a decision tree to make predictions- Majority class predictions

- Probability predictions

- Multiclass classification

©2017 Emily Fox