machine(learning(&(datamining - yisong · pdf filetoday(•...

70
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Boos5ng & Ensemble Selec5on 1

Upload: doanthu

Post on 31-Jan-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Machine  Learning  &  Data  Mining  CS/CNS/EE  155  

Lecture  6:  Boos5ng  &  Ensemble  Selec5on  

1  

Page 2: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Today  

•  High  Level  Overview  of  Ensemble  Methods  

•  Boos5ng  – Ensemble  Method  for  Reducing  Bias  

•  Ensemble  Selec5on  

2  

Page 3: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Recall:  Test  Error  

•  “True”  distribu4on:  P(x,y)    – Unknown  to  us  

•  Train:  hS(x)  =  y    – Using  training  data:    – Sampled  from  P(x,y)  

•  Test  Error:    

•  Overfi<ng:  Test  Error  >>  Training  Error  

S = (xi, yi ){ }i=1N

LP (hS ) = E(x,y)~P(x,y) L(y,hS (x))[ ]

3  

Page 4: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Person   Age   Male?   Height  >  55”  

Alice   14   0   1  

Bob   10   1   1  

Carol   13   0   1  

Dave   8   1   0  

Erin   11   0   0  

Frank   9   1   1  

Gena   8   0   0  

Person   Age   Male?   Height  >  55”  

James   11   1   1  

Jessica   14   0   1  

Alice   14   0   1  

Amy   12   0   1  

Bob   10   1   1  

Xavier   9   1   0  

Cathy   9   0   1  

Carol   13   0   1  

Eugene   13   1   0  

Rafael   12   1   1  

Dave   8   1   0  

Peter   9   1   0  

Henry   13   1   0  

Erin   11   0   0  

Rose   7   0   0  

Iain   8   1   1  

Paulo   12   1   0  

Margaret   10   0   1  

Frank   9   1   1  

Jill   13   0   0  

Leon   10   1   0  

Sarah   12   0   0  

Gena   8   0   0  

Patrick   5   1   1  …  

L(h)  =  E(x,y)~P(x,y)[  L(h(x),y)  ]      Test  Error:  

h(x)  y  

Training  Set  S  True  Distribu5on  P(x,y)  

4  

Page 5: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Recall:  Test  Error  

•  Test  Error:  

•  Treat  hS  as  random  variable:  

 •  Expected  Test  Error:  

LP (h) = E(x,y)~P(x,y) L(y,h(x))[ ]

hS = argminh

L yi,h(xi )( )(xi ,yi )∈S∑

ES LP (hS )[ ] = ES E(x,y)~P(x,y) L(y,hS (x))[ ]!" #$

5  

aka  test  error  of  model  class  

Page 6: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Recall:  Bias-­‐Variance  Decomposi5on  

•  For  squared  error:  

ES LP (hS )[ ] = ES E(x,y)~P(x,y) L(y,hS (x))[ ]!" #$

ES LP (hS )[ ] = E(x,y)~P(x,y) ES hS (x)−H (x)( )2"#

$%+ H (x)− y( )2"

#&$%'

H (x) = ES hS (x)[ ]“Average  predic5on  on  x”  

Variance  Term   Bias  Term  

6  

Page 7: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Recall:  Bias-­‐Variance  Decomposi5on  

0 20 40 60 80 100−1

−0.5

0

0.5

1

1.5

0 20 40 60 80 1000

0.5

1

1.5

0 20 40 60 80 100−1

−0.5

0

0.5

1

1.5

0 20 40 60 80 1000

0.5

1

1.5

0 20 40 60 80 100−1

−0.5

0

0.5

1

1.5

0 20 40 60 80 1000

0.5

1

1.5Variance  Bias   Variance  Bias   Variance  Bias  

7  

Page 8: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Recall:  Bias-­‐Variance  Decomposi5on  

0 20 40 60 80 100−1

−0.5

0

0.5

1

1.5

0 20 40 60 80 1000

0.5

1

1.5

0 20 40 60 80 100−1

−0.5

0

0.5

1

1.5

0 20 40 60 80 1000

0.5

1

1.5

0 20 40 60 80 100−1

−0.5

0

0.5

1

1.5

0 20 40 60 80 1000

0.5

1

1.5Variance  Bias   Variance  Bias   Variance  Bias  

8  

Some  models  experience  high  test  error  due  to  high  bias.  (Model  class  to  simple  to  make  accurate  predic5ons.)  

 Some  models  experience  high  test  error  due  to  high  variance.  

(Model  class  unstable  due  to  insufficient  training  data.)  

Page 9: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

General  Concept:  Ensemble  Methods  

•  Combine  mul5ple  learning  algorithms  or  models  –  Previous  Lecture:  Bagging  &  Random  Forests  –  Today:  Boos4ng  &  Ensemble  Selec4on  

•  “Meta  Learning”  approach  –  Does  not  innovate  on  base  learning  algorithm/model  –  Innovates  at  higher  level  of  abstrac5on:  

•  crea5ng  training  data  and  combining  resul5ng  base  models  •  Bagging  creates  new  training  sets  via  bootstrapping,  and  then  combines  by  averaging  predic5ons  

9  

Decision  Trees,    SVMs,  etc.  

Page 10: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Intui5on:  Why  Ensemble  Methods  Work  

•  Bias-­‐Variance  Tradeoff!  •  Bagging  reduces  variance  of  low-­‐bias  models  – Low-­‐bias  models  are  “complex”  and  unstable.  – Bagging  averages  them  together  to  create  stability  

•  Boos4ng  reduces  bias  of  low-­‐variance  models  – Low-­‐variance  models  are  simple  with  high  bias  – Boos5ng  trains  sequence  of  simple  models  – Sum  of  simple  models  is  complex/accurate  

10  

Page 11: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Boos5ng  “The  Strength  of  Weak  Classifiers”*  

11  *  hlp://www.cs.princeton.edu/~schapire/papers/strengthofweak.pdf  

Page 12: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Terminology:  Shallow  Decision  Trees  

•  Decision  Trees  with  only  a  few  nodes  

•  Very  high  bias  &  low  variance  – Different  training  sets  lead  to  very  similar  trees  – Error  is  high  (barely  beler  than  sta5c  baseline)  

•  Extreme  case:  “Decision  Stumps”  – Trees  with  exactly  1  split  

12  

Page 13: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Stability  of  Shallow  Trees  

•  Tends  to  learn  more-­‐or-­‐less  the  same  model.  •  hS(x)  has  low  variance    – Over  the  randomness  of  training  set  S  

13  

Page 14: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Terminology:  Weak  Learning  

•  Error  rate:    

•  Weak  Classifier:                  slightly  beler  than  0.5  – Slightly  beler  than  random  guessing  

•  Weak  Learner:  can  learn  a  weak  classifier    

14  

εh,P = EP(x,y) 1 h(x )≠y[ ]"#

$%

εh,P

Page 15: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Terminology:  Weak  Learning  

•  Error  rate:    

•  Weak  Classifier:                  slightly  beler  than  0.5  – Slightly  beler  than  random  guessing  

•  Weak  Learner:  can  learn  a  weak  classifier    

15  

εh,P = EP(x,y) 1 h(x )≠y[ ]"#

$%

εh,P

Shallow  Decision  Trees  are  Weak  Classifiers!    

Weak  Learners  are  Low  Variance  &  High  Bias!  

Page 16: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

How  to  “Boost”  Performance  of  Weak  Models?  

•  Weak  Models  are  High  Bias  &  Low  Variance  •  Bagging  would  not  work  – Reduces  variance,  not  bias  

16  

ES LP (hS )[ ] = E(x,y)~P(x,y) ES hS (x)−H (x)( )2"#

$%+ H (x)− y( )2"

#&$%'

H (x) = ES hS (x)[ ]“Average  predic5on  on  x”  

Variance  Term   Bias  Term  Expected  Test  Error  Over  randomness  of  S  (Squared  Loss)  

Page 17: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

First  Try  (for  Regression)  •  1  dimensional  regression  •  Learn  Decision  Stump    

–  (single  split,  predict  mean  of  two  par55ons)  

17  

x   y  0   0  

1   1  

2   4  

3   9  

4   16  

5   25  

6   36  

S  

Page 18: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

First  Try  (for  Regression)  •  1  dimensional  regression  •  Learn  Decision  Stump    

–  (single  split,  predict  mean  of  two  par55ons)  

18  

x   y  0   0  

1   1  

2   4  

3   9  

4   16  

5   25  

6   36  

S  

y1   h1(x)  

0   6  

1   6  

4   6  

9   6  

16   6  

25   30.5  

36   30.5  

Page 19: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

First  Try  (for  Regression)  •  1  dimensional  regression  •  Learn  Decision  Stump    

–  (single  split,  predict  mean  of  two  par55ons)  

19  

x   y  0   0  

1   1  

2   4  

3   9  

4   16  

5   25  

6   36  

S  

y1   h1(x)   y2  0   6   -­‐6  

1   6   -­‐5  

4   6   -­‐2  

9   6   -­‐3  

16   6   10  

25   30.5   -­‐5.5  

36   30.5   5.5  

yt  =  y  –  h1:t-­‐1(x)  

“residual”  

Page 20: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

First  Try  (for  Regression)  •  1  dimensional  regression  •  Learn  Decision  Stump    

–  (single  split,  predict  mean  of  two  par55ons)  

20  

x   y  0   0  

1   1  

2   4  

3   9  

4   16  

5   25  

6   36  

S  

y1   h1(x)   y2   h2(x)  

0   6   -­‐6   -­‐5.5  

1   6   -­‐5   -­‐5.5  

4   6   -­‐2   2.2  

9   6   -­‐3   2.2  

16   6   10   2.2  

25   30.5   -­‐5.5   2.2  

36   30.5   5.5   2.2  

h1:t(x)  =  h1(x)  +  …  +  ht(x)  

yt  =  y  –  h1:t-­‐1(x)  

“residual”  

Page 21: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

First  Try  (for  Regression)  •  1  dimensional  regression  •  Learn  Decision  Stump    

–  (single  split,  predict  mean  of  two  par55ons)  

21  

x   y  0   0  

1   1  

2   4  

3   9  

4   16  

5   25  

6   36  

S  

y1   h1(x)   y2   h2(x)   h1:2(x)  

0   6   -­‐6   -­‐5.5   0.5  

1   6   -­‐5   -­‐5.5   0.5  

4   6   -­‐2   2.2   8.2  

9   6   -­‐3   2.2   8.2  

16   6   10   2.2   8.2  

25   30.5   -­‐5.5   2.2   32.7  

36   30.5   5.5   2.2   32.7  

h1:t(x)  =  h1(x)  +  …  +  ht(x)  

yt  =  y  –  h1:t-­‐1(x)  

“residual”  

Page 22: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

First  Try  (for  Regression)  •  1  dimensional  regression  •  Learn  Decision  Stump    

–  (single  split,  predict  mean  of  two  par55ons)  

22  

x   y  0   0  

1   1  

2   4  

3   9  

4   16  

5   25  

6   36  

S  

y1   h1(x)   y2   h2(x)   h1:2(x)   y3   h3(x)   h1:3(x)  

0   6   -­‐6   -­‐5.5   0.5   -­‐0.5   -­‐0.55   -­‐0.05  

1   6   -­‐5   -­‐5.5   0.5   0.5   -­‐0.55   -­‐0.05  

4   6   -­‐2   2.2   8.2   -­‐4.2   -­‐0.55   7.65  

9   6   -­‐3   2.2   8.2   0.8   -­‐0.55   7.65  

16   6   10   2.2   8.2   7.8   -­‐0.55   7.65  

25   30.5   -­‐5.5   2.2   32.7   -­‐7.7   -­‐0.55   32.15  

36   30.5   5.5   2.2   32.7   3.3   3.3   36  

h1:t(x)  =  h1(x)  +  …  +  ht(x)  

yt  =  y  –  h1:t-­‐1(x)  

“residual”  

Page 23: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

23  

0 2 4 60

10

20

30

40

0 2 4 60

10

20

30

40

0 2 4 60

10

20

30

40

0 2 4 6-10

-5

0

5

10

0 2 4 6-6-4-2024

0 2 4 60

10

20

30

40

0 2 4 6-10

-5

0

5

10

0 2 4 6-101234

0 2 4 60

10

20

30

40

0 2 4 6-10

-5

0

5

10

0 2 4 6-0.02

0

0.02

0.04

0.06

0 2 4 60

10

20

30

40

yt  

ht  

h1:t  

h1:t(x)  =  h1(x)  +  …  +  ht(x)  yt  =  y  –  h1:t-­‐1(x)  

t=1   t=2   t=3   t=4  

First  Try  (for  Regression)  

Page 24: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Gradient  Boos5ng  (Simple  Version)  

hlp://statweb.stanford.edu/~jhf/tp/trebst.pdf    

h(x)  =  h1(x)  

h1(x)   h2(x)   hn(x)  

…  

+  h2(x)    +  …  +  hn(x)  

S1 = (xi, yi ){ }i=1N S2 = (xi, yi − h1(xi )){ }i=1

N Sn = (xi, yi − h1:n−1(xi )){ }i=1N

S = (xi, yi ){ }i=1N

(Why  is  it  called  “gradient”?)    (Answer  next  slides.)  

(For  Regression  Only)  

24  

Page 25: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Axis  Aligned  Gradient  Descent  

•  Linear  Model:  h(x)  =  wTx  •  Squared  Loss:  L(y,y’)  =  (y-­‐y’)2  

•  Similar  to  Gradient  Descent  – But  only  allow  axis-­‐aligned  update  direc5ons  

– Updates  are  of  the  form:        

 

 

25  

(For  Linear  Model)  

w = w−ηgded ed =

0!010!0

!

"

########

$

%

&&&&&&&&

Unit  vector  along  d-­‐th  Dimension  

g = ∇wL(yi,wT xi )

i∑

Projec5on  of  gradient  along  d-­‐th  dimension  Update  along  axis  with  greatest  projec5on  

S = (xi, yi ){ }i=1N

Training  Set  

Page 26: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Axis  Aligned  Gradient  Descent  

26  

Update  along  axis  with  largest  projec5on  

This  concept  will  become    useful  in  ~5    slides  

Page 27: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Func5on  Space  &  Ensemble  Methods  

•  Linear  model  =  one  coefficient  per  feature  –  Linear  over  the  input  feature  space  

•  Ensemble  methods  =  one  coefficient  per  model  –  Linear  over  a  func5on  space  –  E.g.,  h  =  h1  +  h2  +  …  +  hn  

27  

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

“Func4on  Space”  (Convex  combina5on  of  all  possible  shallow  trees)  (Poten5ally  infinite)  

(Most  coefficients  are  0)    

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Func+onal#Gradient#Descent#

hsp://statweb.stanford.edu/~jhf/up/trebst.pdf#

h(x)#=#h1(x)#

S’#=#{(x,y)}#

h1(x)#

S’#=#{(x,yVh1(x))}#

h2(x)#

S’#=#{(x,yVh1(x)#V#…#V#hnV1(x))}#

h2(x)#

…+

+#h2(x)##+#…#+#hn(x)#

Coefficient=1  for  models  used  Coefficient=0  for  other  models  

Page 28: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Proper5es  of  Func5on  Space  

•  Generaliza5on  of  a  Vector  Space  

 

•  Closed  under  Addi5on  – Sum  of  two  func5ons  is  a  func5on  

 

•  Closed  under  Scalar  Mul5plica5on  – Mul5plying  a  func5on  with  a  scalar  is  a  func5on  

•  Gradient  descent:  adding  a  scaled  func5on  to  an  exis5ng  func5on  

28  

Page 29: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Func5on  Space  of  Weak  Models  

•  Every  “axis”  in  the  space  is  a  weak  model  – Poten5ally  infinite  axes/dimensions  

•  Complex  models  are  linear  combina5ons  of  weak  models  – h  =  η1h1  +  η2h2  +  …  +  ηnhn  – Equivalent  to  a  point  in  func5on  space  

•  Defined  by  coefficients  η  

29  

Page 30: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Recall:  Axis  Aligned  Gradient  Descent  

30  

Project  to  closest  axis  &  update  (smallest  squared  dist)  

Imagine  each  axis    is  a  weak  model.  

Every  point  is  a    linear  combina5on  of  weak  models  

Page 31: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Func5onal  Gradient  Descent  

31  

(Gradient  Descent  in  Func5on  Space)                                (Deriva5on  for  Squared  Loss)  

•  Init  h(x)  =  0  •  Loop  n=1,2,3,4,…  

h = h− argmaxhn

projecthn ∇hL(yi,h(xi ))i∑$

%&

'

()

$

%&&

'

())

= h+ argminhn

(yi − h(xi )− hn (xi )i∑ )2

S = (xi, yi ){ }i=1N

Project  func5onal    gradient  to  best  func5on  

Equivalent  to  finding  the  hn    that  minimizes  residual  loss  

Page 32: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Reduc5on  to  Vector  Space  

•  Func5on  space  =  axis-­‐aligned  unit  vectors  – Weak  model  =  axis-­‐aligned  unit  vector:  

 

•  Linear  model  w  has  same  func5onal  form:  – w  =  η1e1  +  η2e2  +  …  +  ηDeD  – Point  in  space  of  D  “axis-­‐aligned  func5ons”  

•  Axis-­‐Aligned  Gradient  Descent  =  Func4onal  Gradient  Descent  on  space  of  axis-­‐aligned  unit  vector  weak  models.  

32  

ed =

!010!

!

"

######

$

%

&&&&&&

Page 33: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Gradient  Boos5ng  (Full  Version)  

hlp://statweb.stanford.edu/~jhf/tp/trebst.pdf    

h1:n(x)  =  h1(x)  

h1(x)   h2(x)   hn(x)  

…  

+  η2h2(x)    +  …  +  ηnhn(x)  

S1 = (xi, yi ){ }i=1N S2 = (xi, yi − h1(xi )){ }i=1

N Sn = (xi, yi − h1:n−1(xi )){ }i=1N

S = (xi, yi ){ }i=1N

(Instance  of  Func5onal  Gradient  Descent)   (For  Regression  Only)  

33  See  reference  for  how  to  set  η  

Page 34: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Recap:  Basic  Boos5ng  

•  Ensemble  of  many  weak  classifiers.  – h(x)  =  η1h1(x)  +η2h2(x)  +  …  +  ηnhn(x)        

•  Goal:  reduce  bias  using  low-­‐variance  models  

•  Deriva4on:  via  Gradient  Descent  in  Func5on  Space  – Space  of  weak  classifiers  

•  We’ve  only  seen  the  regression  so  far…  

34  

Page 35: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

AdaBoost  Adap5ve  Boos5ng  for  Classifica5on  

35  hlp://www.yisongyue.com/courses/cs155/lectures/msri.pdf  

Page 36: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Boos5ng  for  Classifica5on  

•  Gradient  Boos5ng  was  designed  for  regression  

•  Can  we  design  one  for  classifica5on?  

•  AdaBoost  – Adap5ve  Boos5ng  

36  

Page 37: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

AdaBoost  =  Func5onal  Gradient  Descent  

•  AdaBoost  is  also  instance  of  func5onal  gradient  descent:  – h(x)  =  sign(  a1h1(x)  +  a2h2(x)  +  …  +  a3hn(x)  )    

•  E.g.,  weak  models  hi(x)  are  classifica5on  trees  – Always  predict  0  or  1  –  (Gradient  Boos5ng  used  regression  trees)  

37  

Page 38: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Combining  Mul5ple  Classifiers  

38  

Data  Point  

h1(x)   h2(x)   h3(x)   h4(x)   f(x)   h(x)  

x1   +1   +1   +1   -­‐1   0.1  +  1.5  +  0.4  -­‐  1.1  =  0.9   +1  

x2   +1   +1   +1   +1   0.1  +  1.5  +  0.4  +  1.1  =  3.1   +1  

x3   -­‐1   +1   -­‐1   -­‐1   -­‐0.1  +  1.5  –  0.3  –  1.1  =  -­‐0.1   -­‐1  

x4   -­‐1   -­‐1   +1   -­‐1   -­‐0.1  –  1.5  +  0.3  –  1.1  =  -­‐2.4   -­‐1  

f(x)  =  0.1*h1(x)  +    1.5*h2(x)  +  0.4*h3(x)  +  1.1*h4(x)  

h(x)  =  sign(f(x))  

Aggregate  Scoring  Func4on:  

Aggregate  Classifier:  

Page 39: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Also  Creates  New  Training  Sets  

•  Gradients  in  Func5on  Space    – Weak  model  that  outputs  residual  of  loss  func5on  

•  Squared  loss  =    y-­‐h(x)    – Algorithmically  equivalent  to  training  weak  model  on  modified  training  set  •  Gradient  Boos5ng  =  train  on  (xi,  yi–h(xi))  

•  What  about  AdaBoost?  – Classifica4on  problem.  

39  

For  Regression  

Page 40: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Reweigh5ng  Training  Data  

•  Define  weigh5ng  D  over  S:  – Sums  to  1:    

•  Examples:    

•  Weighted  loss  func5on:  

40  

S = (xi, yi ){ }i=1N

D(i)i∑ =1

Data  Point   D(i)  

(x1,y1)   1/3  

(x2,y2)   1/3  

(x3,y3)   1/3  

Data  Point   D(i)  

(x1,y1)   0  

(x2,y2)   1/2  

(x3,y3)   1/2  

Data  Point   D(i)  

(x1,y1)   1/6  

(x2,y2)   1/3  

(x3,y3)   1/2  

LD (h) = D(i)L(yi,h(xi ))i∑

Page 41: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Training  Decision  Trees  with    Weighted  Training  Data  

•  Slight  modifica5on  of  spliyng  criterion.  

•  Example:  Bernoulli  Variance:  

•  Es5mate  frac5on  of  posi5ves  as:  

41  

L(S ') = S ' pS ' (1− pS ' ) =# pos*#neg

| S ' |

pS ' =D(i)1 yi=1[ ]

(xi ,yi )∈S '∑

S 'S ' ≡ D(i)

(xi ,yi )∈S '∑

Page 42: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

AdaBoost  Outline  

h(x)  =  sign(a1h1(x))  

(S,  D1=Uniform)  

h1(x)  

(S,D2)  

h2(x)  

(S,Dn)  

hn(x)  

…  

+  a2h2(x))    +  …  +  anhn(x))  

Dt  –  weigh5ng  on  data  points  at  –  weight  of  linear  combina5on  

Stop  when  valida5on    performance  plateaus  (will  discuss  later)   42  hlp://www.yisongyue.com/courses/cs155/lectures/msri.pdf  

S = (xi, yi ){ }i=1N

yi ∈ −1,+1{ }h(x)  =  sign(a1h1(x)  +  a2h2(x)    

Page 43: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Intui5on  

43  

f(x)  =  0.1*h1(x)  +    1.5*h2(x)  +  0.4*h3(x)  +  1.1*h4(x)  

h(x)  =  sign(f(x))  

Aggregate  Scoring  Func4on:  

Aggregate  Classifier:  

Data  Point  

Label   f(x)   h(x)  

x1   y1=+1   0.9   +1  

x2   y2=+1   3.1   +1  

x3   y3=+1   -­‐0.1   -­‐1  

x4   y4=-­‐1   -­‐2.4   -­‐1  Safely  Far  from    Decision  Boundary  

Somewhat  close  to  Decision  Boundary  

Violates  Decision  Boundary  

Page 44: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Intui5on  

44  

f(x)  =  0.1*h1(x)  +    1.5*h2(x)  +  0.4*h3(x)  +  1.1*h4(x)  

h(x)  =  sign(f(x))  

Aggregate  Scoring  Func4on:  

Aggregate  Classifier:  

Data  Point  

Label   f(x)   h(x)  

x1   y1=+1   0.9   +1  

x2   y2=+1   3.1   +1  

x3   y3=+1   -­‐0.1   -­‐1  

x4   y4=-­‐1   -­‐2.4   -­‐1  Safely  Far  from    Decision  Boundary  

Somewhat  close  to  Decision  Boundary  

Violates  Decision  Boundary  

Thought  Experiment:  When  we  train  new  h5(x)  to  add  to  f(x)…  

…  what  happens  when  h5  mispredicts  on  everything?  

Page 45: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Intui5on  

45  

Data  Point  

Label   f1:4(x)   h1:4(x)   Worst  case  h5(x)  

Worst  case  f1:5(x)  

Impact  of  h5(x)  

x1   y1=+1   0.9   +1   -­‐1   0.4  

x2   y2=+1   3.1   +1   -­‐1   2.6  

x3   y3=+1   -­‐0.1   -­‐1   -­‐1   -­‐0.6  

x4   y4=-­‐1   -­‐2.4   -­‐1   +1   -­‐1.9  

f1:5(x)  =  f1:4(x)+  0.5*h5(x)  

h1:5(x)  =  sign(f1:5(x))  

Aggregate  Scoring  Func4on:  

Aggregate  Classifier:  

h5(x)  that  mispredicts  on  everything  

Suppose  a5  =  0.5  

Kind  of  Bad  

Irrelevant  

Very  Bad  

Irrelevant  

Page 46: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Intui5on  

46  

Data  Point  

Label   f1:4(x)   h1:4(x)   Worst  case  h5(x)  

Worst  case  f1:5(x)  

Impact  of  h5(x)  

x1   y1=+1   0.9   +1   -­‐1   0.4  

x2   y2=+1   3.1   +1   -­‐1   2.6  

x3   y3=+1   -­‐0.1   -­‐1   -­‐1   -­‐0.6  

x4   y4=-­‐1   -­‐2.4   -­‐1   +1   -­‐1.9  

f1:5(x)  =  f1:4(x)+  0.5*h5(x)  

h1:5(x)  =  sign(f1:5(x))  

Aggregate  Scoring  Func4on:  

Aggregate  Classifier:  

h5(x)  that  mispredicts  on  everything  

Suppose  a5  =  0.5  

Kind  of  Bad  

Irrelevant  

Very  Bad  

Irrelevant  

h5(x)  should  definitely  classify  (x3,y3)  correctly!  h5(x)  should  probably  classify  (x1,y1)  correctly.  

Don’t  care  about  (x2,y2)  &  (x4,y4)  Implies  a  weigh5ng  over  training  examples  

Page 47: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Intui5on  

47  

Data  Point  

Label   f1:4(x)   h1:4(x)   Desired  D5  

x1   y1=+1   0.9   +1  

x2   y2=+1   3.1   +1  

x3   y3=+1   -­‐0.1   -­‐1  

x4   y4=-­‐1   -­‐2.4   -­‐1  

Medium  

Low  

High  

Low  

f1:4(x)  =  0.1*h1(x)  +    1.5*h2(x)  +  0.4*h3(x)  +  1.1*h4(x)  

h1:4(x)  =  sign(f1:4(x))  

Aggregate  Scoring  Func4on:  

Aggregate  Classifier:  

Page 48: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

AdaBoost  •  Init  D1(x)  =  1/N  •  Loop  t  =  1…n:  –  Train  classifier  ht(x)  using  (S,Dt)  

–  Compute  error  on  (S,Dt):  

–  Define  step  size  at:  

–  Update  Weigh5ng:    

•  Return:  h(x)  =  sign(a1h1(x)  +  …  +  anhn(x))  

48  

S = (xi, yi ){ }i=1N

yi ∈ −1,+1{ }

εt ≡ LDt (ht ) = Dt (i)L(yi,ht (xi ))i∑

at =12log 1−εt

εt

"#$

%&'

Dt+1(i) =Dt (i)exp −atyiht (xi ){ }

ZtNormaliza5on  Factor    s.t.  Dt+1  sums  to  1.  

hlp://www.yisongyue.com/courses/cs155/lectures/msri.pdf  

E.g.,  best  decision  stump  

Page 49: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Example  

49  

Data  Point  

Label   D1  

x1   y1=+1   0.01  

x2   y2=+1   0.01  

x3   y3=+1   0.01  

x4   y4=-­‐1   0.01  …  

εt ≡ LDt (ht ) = Dt (i)L(yi,ht (xi ))i∑

at =12log 1−εt

εt

"#$

%&'

Dt+1(i) =Dt (i)exp −atyiht (xi ){ }

Zt

Normaliza5on  Factor    s.t.  Dt+1  sums  to  1.  

yiht(xi)  =  -­‐1  or  +1  

Page 50: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Example  

50  

Data  Point  

Label   D1   h1(x)  

x1   y1=+1   0.01   +1  

x2   y2=+1   0.01   -­‐1  

x3   y3=+1   0.01   -­‐1  

x4   y4=-­‐1   0.01   -­‐1  …  

ε1=0.4    a1=0.2  

εt ≡ LDt (ht ) = Dt (i)L(yi,ht (xi ))i∑

at =12log 1−εt

εt

"#$

%&'

Dt+1(i) =Dt (i)exp −atyiht (xi ){ }

Zt

Normaliza5on  Factor    s.t.  Dt+1  sums  to  1.  

yiht(xi)  =  -­‐1  or  +1  

…  

Page 51: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Example  

51  

Data  Point  

Label   D1   h1(x)   D2  

x1   y1=+1   0.01   +1   0.008  

x2   y2=+1   0.01   -­‐1   0.012  

x3   y3=+1   0.01   -­‐1   0.012  

x4   y4=-­‐1   0.01   -­‐1   0.008  

ε1=0.4    a1=0.2  

εt ≡ LDt (ht ) = Dt (i)L(yi,ht (xi ))i∑

at =12log 1−εt

εt

"#$

%&'

Dt+1(i) =Dt (i)exp −atyiht (xi ){ }

Zt

Normaliza5on  Factor    s.t.  Dt+1  sums  to  1.  

yiht(xi)  =  -­‐1  or  +1  

…  

…  

Page 52: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Example  

52  

Data  Point  

Label   D1   h1(x)   D2   h2(x)  

x1   y1=+1   0.01   +1   0.008   +1  

x2   y2=+1   0.01   -­‐1   0.012   +1  

x3   y3=+1   0.01   -­‐1   0.012   -­‐1  

x4   y4=-­‐1   0.01   -­‐1   0.008   +1  

ε1=0.4    a1=0.2  

εt ≡ LDt (ht ) = Dt (i)L(yi,ht (xi ))i∑

at =12log 1−εt

εt

"#$

%&'

Dt+1(i) =Dt (i)exp −atyiht (xi ){ }

Zt

Normaliza5on  Factor    s.t.  Dt+1  sums  to  1.  

ε2=0.45  a2=0.1  

yiht(xi)  =  -­‐1  or  +1  

…  

…  

…  

Page 53: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Example  

53  

Data  Point  

Label   D1   h1(x)   D2   h2(x)   D3  

x1   y1=+1   0.01   +1   0.008   +1   0.007  

x2   y2=+1   0.01   -­‐1   0.012   +1   0.011  

x3   y3=+1   0.01   -­‐1   0.012   -­‐1   0.013  

x4   y4=-­‐1   0.01   -­‐1   0.008   +1   0.009  

ε1=0.4    a1=0.2  

εt ≡ LDt (ht ) = Dt (i)L(yi,ht (xi ))i∑

at =12log 1−εt

εt

"#$

%&'

Dt+1(i) =Dt (i)exp −atyiht (xi ){ }

Zt

Normaliza5on  Factor    s.t.  Dt+1  sums  to  1.  

ε2=0.45  a2=0.1  

yiht(xi)  =  -­‐1  or  +1  

…  

…  

…  

Page 54: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Example  

54  

Data  Point  

Label   D1   h1(x)   D2   h2(x)   D3   h3(x)  

x1   y1=+1   0.01   +1   0.008   +1   0.007   -­‐1  

x2   y2=+1   0.01   -­‐1   0.012   +1   0.011   +1  

x3   y3=+1   0.01   -­‐1   0.012   -­‐1   0.013   +1  

x4   y4=-­‐1   0.01   -­‐1   0.008   +1   0.009   -­‐1  

ε1=0.4    a1=0.2  

εt ≡ LDt (ht ) = Dt (i)L(yi,ht (xi ))i∑

at =12log 1−εt

εt

"#$

%&'

Dt+1(i) =Dt (i)exp −atyiht (xi ){ }

Zt

Normaliza5on  Factor    s.t.  Dt+1  sums  to  1.  

ε2=0.45  a2=0.1  

yiht(xi)  =  -­‐1  or  +1  

ε3=0.35  a3=0.31  

…  

…  

…  

…  

What  happens  if  ε=0.5?    

Page 55: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 30

1

2

3

4

5

6

7

8

Exponen5al  Loss  

55  

L(y, f (x)) = exp −yf (x){ }

Target  y  

f(x)  

Exp  Loss   Upper  Bounds    0/1  Loss!  

Can  prove  that  AdaBoost  minimizes  Exp  Loss  (Homework  Ques5on)  

Page 56: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Decomposing  Exp  Loss  

56  

L(y, f (x)) = exp −yf (x){ }

= exp −y atht (x)t=1

n

∑#

$%

&

'(

)*+

,+

-.+

/+

= exp −yatht (x){ }t=1

n

Distribu4on  Update  Rule!  

hlp://www.yisongyue.com/courses/cs155/lectures/msri.pdf  

Page 57: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Intui5on  

•  Exp  Loss  operates  in  exponent  space  

•  Addi5ve  update  to  f(x)  =  mul5plica5ve  update  to  Exp  Loss  of  f(x)  

•  Reweigh5ng  Scheme  in  AdaBoost  can  be  derived  via  residual  Exp  Loss  

57  

L(y, f (x)) = exp −y atht (x)t=1

n

∑#$%

&'(= exp −yatht (x){ }

t=1

n

hlp://www.yisongyue.com/courses/cs155/lectures/msri.pdf  

Page 58: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

AdaBoost  =  Minimizing  Exp  Loss  •  Init  D1(x)  =  1/N  •  Loop  t  =  1…n:  –  Train  classifier  ht(x)  using  (S,Dt)  

–  Compute  error  on  (S,Dt):  

–  Define  step  size  at:  

–  Update  Weigh5ng:    

•  Return:  h(x)  =  sign(a1h1(x)  +  …  +  anhn(x))  

58  

S = (xi, yi ){ }i=1N

yi ∈ −1,+1{ }

εt ≡ LDt (ht ) = Dt (i)L(yi,ht (xi ))i∑

at =12log 1−εt

εt

"#$

%&'

Dt+1(i) =Dt (i)exp −atyiht (xi ){ }

ZtNormaliza5on  Factor    s.t.  Dt+1  sums  to  1.  

hlp://www.yisongyue.com/courses/cs155/lectures/msri.pdf  

Data  points  reweighted    according  to  Exp  Loss!  

Page 59: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Story  So  Far:  AdaBoost  

•  AdaBoost  itera5vely  finds  weak  classifier  to  minimize  residual  Exp  Loss  –  Trains  weak  classifier  on  reweighted  data  (S,Dt).  

•  Homework:  Rigorously  prove  it!  

1.  Formally  prove  Exp  Loss  ≥  0/1  Loss  

2.  Relate  Exp  Loss  to  Zt:  

3.  Jus5fy  choice  of  at:  •  Gives  largest  decrease  in  Zt  

59  

at =12log 1−εt

εt

"#$

%&'

Dt+1(i) =Dt (i)exp −atyiht (xi ){ }

Zt

The  proof  is  in    earlier  slides.  

hlp://www.yisongyue.com/courses/cs155/lectures/msri.pdf  

Page 60: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Recap:  AdaBoost  

•  Gradient  Descent  in  Func5on  Space    –  Space  of  weak  classifiers  

•  Final  model  =  linear  combina5on  of  weak  classifiers  – h(x)  =  sign(a1h1(x)  +  …  +  anhn(x))  –  I.e.,  a  point  in  Func5on  Space  

•  Itera5vely  creates  new  training  sets  via  reweigh5ng  –  Trains  weak  classifier  on  reweighted  training  set  –  Derived  via  minimizing  residual  Exp  Loss  

60  

Page 61: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Ensemble  Selec5on  

61  

Page 62: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Recall:  Bias-­‐Variance  Decomposi5on  

•  For  squared  error:  

ES LP (hS )[ ] = ES E(x,y)~P(x,y) L(y,hS (x))[ ]!" #$

ES LP (hS )[ ] = E(x,y)~P(x,y) ES hS (x)−H (x)( )2"#

$%+ H (x)− y( )2"

#&$%'

H (x) = ES hS (x)[ ]“Average  predic5on  on  x”  

Variance  Term   Bias  Term  

62  

Page 63: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Ensemble  Methods  

•  Combine  base  models  to  improve  performance  

•  Bagging:  averages  high  variance,  low  bias  models  –  Reduces  variance  –  Indirectly  deals  with  bias  via  low  bias  base  models  

•  Boos4ng:  carefully  combines  simple  models  –  Reduces  bias  –  Indirectly  deals  with  variance  via  low  variance  base  models  

•  Can  we  get  best  of  both  worlds?  63  

Page 64: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Insight:  Use  Valida5on  Set  

•  Evaluate  error  on  valida5on  set  V:  

•  Proxy  for  test  error:  

64  

LV (hS ) = E(x,y)~V L(y,hS (x))[ ]

EV LV (hS )[ ] = LP (hS )

Expected  Valida5on  Error   Test  Error  

Page 65: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Ensemble  Selec5on  

“Ensemble  Selec4on  from  Libraries  of  Models”  Caruana,  Niculescu-­‐Mizil,  Crew  &  Ksikes,  ICML  2004  

Person+ Age+ Male?+ Height+>+55”+

Alice# 14# 0# 1#

Bob# 10# 1# 1#

Carol# 13# 0# 1#

Dave# 8# 1# 0#

Erin# 11# 0# 0#

Frank# 9# 1# 1#

Gena# 8# 0# 0#

Person+ Age+ Male?+ Height+>+55”+

James# 11# 1# 1#

Jessica# 14# 0# 1#

Alice# 14# 0# 1#

Amy# 12# 0# 1#

Bob# 10# 1# 1#

Xavier# 9# 1# 0#

Cathy# 9# 0# 1#

Carol# 13# 0# 1#

Eugene# 13# 1# 0#

Rafael# 12# 1# 1#

Dave# 8# 1# 0#

Peter# 9# 1# 0#

Henry# 13# 1# 0#

Erin# 11# 0# 0#

Rose# 7# 0# 0#

Iain# 8# 1# 1#

Paulo# 12# 1# 0#

Margaret# 10# 0# 1#

Frank# 9# 1# 1#

Jill# 13# 0# 0#

Leon# 10# 1# 0#

Sarah# 12# 0# 0#

Gena# 8# 0# 0#

Patrick# 5# 1# 1#…+

L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###GeneralizaHon+Error:+

h(x)+y+

Person+ Age+ Male?+ Height+>+55”+

Alice# 14# 0# 1#

Bob# 10# 1# 1#

Carol# 13# 0# 1#

Dave# 8# 1# 0#

Erin# 11# 0# 0#

Frank# 9# 1# 1#

Gena# 8# 0# 0#

Person+ Age+ Male?+ Height+>+55”+

James# 11# 1# 1#

Jessica# 14# 0# 1#

Alice# 14# 0# 1#

Amy# 12# 0# 1#

Bob# 10# 1# 1#

Xavier# 9# 1# 0#

Cathy# 9# 0# 1#

Carol# 13# 0# 1#

Eugene# 13# 1# 0#

Rafael# 12# 1# 1#

Dave# 8# 1# 0#

Peter# 9# 1# 0#

Henry# 13# 1# 0#

Erin# 11# 0# 0#

Rose# 7# 0# 0#

Iain# 8# 1# 1#

Paulo# 12# 1# 0#

Margaret# 10# 0# 1#

Frank# 9# 1# 1#

Jill# 13# 0# 0#

Leon# 10# 1# 0#

Sarah# 12# 0# 0#

Gena# 8# 0# 0#

Patrick# 5# 1# 1#…+

L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###GeneralizaHon+Error:+

h(x)+y+

Person+ Age+ Male?+ Height+>+55”+

Alice# 14# 0# 1#

Bob# 10# 1# 1#

Carol# 13# 0# 1#

Dave# 8# 1# 0#

Erin# 11# 0# 0#

Frank# 9# 1# 1#

Gena# 8# 0# 0#

Person+ Age+ Male?+ Height+>+55”+

James# 11# 1# 1#

Jessica# 14# 0# 1#

Alice# 14# 0# 1#

Amy# 12# 0# 1#

Bob# 10# 1# 1#

Xavier# 9# 1# 0#

Cathy# 9# 0# 1#

Carol# 13# 0# 1#

Eugene# 13# 1# 0#

Rafael# 12# 1# 1#

Dave# 8# 1# 0#

Peter# 9# 1# 0#

Henry# 13# 1# 0#

Erin# 11# 0# 0#

Rose# 7# 0# 0#

Iain# 8# 1# 1#

Paulo# 12# 1# 0#

Margaret# 10# 0# 1#

Frank# 9# 1# 1#

Jill# 13# 0# 0#

Leon# 10# 1# 0#

Sarah# 12# 0# 0#

Gena# 8# 0# 0#

Patrick# 5# 1# 1#…+

L(h)#=#E(x,y)~P(x,y)[#f(h(x),y)#]###GeneralizaHon+Error:+

h(x)+y+

Training  S’  

Valida5on  V’  

H  =  {2000  models  trained  using  S’}  

h(x)  =  h1(x)  +  h2(x)  +  …  +  hn(x)    Maintain  ensemble  model  as  combina5on  of  H:  

Add  model  from  H  that  maximizes  performance  on  V’    

+  hn+1(x)    

Repeat  

S  

Denote  as  hn+1  

Models  are  trained  on  S’  Ensemble  built  to  op5mize  V’  

Page 66: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Reduces  Both  Bias  &  Variance  

•  Expected  Test  Error  =  Bias  +  Variance  

•  Bagging:  reduce  variance  of  low-­‐bias  models  

•  Boos4ng:  reduce  bias  of  low-­‐variance  models  

•  Ensemble  Selec4on:  who  cares!  –  Use  valida5on  error  to  approximate  test  error  –  Directly  minimize  valida5on  error  –  Don’t  worry  about  the  bias/variance  decomposi5on  

 66  

Page 67: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

What’s  the  Catch?  

•  Relies  heavily  on  valida5on  set  –  Bagging  &  Boos5ng:  uses  training  set  to  select  next  model    –  Ensemble  Selec5on:  uses  valida5on  set  to  select  next  model  

•  Requires  valida5on  set  be  sufficiently  large  

•  In  prac4ce:  implies  smaller  training  sets  – Training  &  valida5on  =  par55oning  of  finite  data  

•  Oten  works  very  well  in  prac5ce  67  

Page 68: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

“Ensemble  Selec4on  from  Libraries  of  Models”  Caruana,  Niculescu-­‐Mizil,  Crew  &  Ksikes,  ICML  2004  

Ensemble  Selec5on  oten  outperforms  a  more  homogenous  sets  of  models.  Reduces  overfiyng  by  building  model  using  valida5on  set.    Ensemble  Selec5on  won  KDD  Cup  2009    hlp://www.niculescu-­‐mizil.org/papers/KDDCup09.pdf    

Page 69: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

References  &  Further  Reading  “An  Empirical  Comparison  of  Vo4ng  Classifica4on  Algorithms:  Bagging,  Boos4ng,  and  Variants”  Bauer  &  Kohavi,  Machine  Learning,  36,  105–139  (1999)      

“Bagging  Predictors”  Leo  Breiman,  Tech  Report  #421,  UC  Berkeley,  1994,  hlp://sta5s5cs.berkeley.edu/sites/default/files/tech-­‐reports/421.pdf      

“An  Empirical  Comparison  of  Supervised  Learning  Algorithms”  Caruana  &  Niculescu-­‐Mizil,  ICML  2006    

“An  Empirical  Evalua4on  of  Supervised  Learning  in  High  Dimensions”  Caruana,  Karampatziakis  &  Yessenalina,  ICML  2008    

“Ensemble  Methods  in  Machine  Learning”  Thomas  Dielerich,  Mul&ple  Classifier  Systems,  2000    

“Ensemble  Selec4on  from  Libraries  of  Models”  Caruana,  Niculescu-­‐Mizil,  Crew  &  Ksikes,  ICML  2004    

“Ge<ng  the  Most  Out  of  Ensemble  Selec4on”  Caruana,  Munson,  &  Niculescu-­‐Mizil,  ICDM  2006    

“Explaining  AdaBoost”  Rob  Schapire,  hlps://www.cs.princeton.edu/~schapire/papers/explaining-­‐adaboost.pdf      

“Greedy  Func4on  Approxima4on:  A  Gradient  Boos4ng  Machine”,  Jerome  Friedman,  2001,    hlp://statweb.stanford.edu/~jhf/tp/trebst.pdf      

“Random  Forests  –  Random  Features”  Leo  Breiman,  Tech  Report  #567,  UC  Berkeley,  1999,      

“Structured  Random  Forests  for  Fast  Edge  Detec4on”  Dollár  &  Zitnick,  ICCV  2013    

“ABC-­‐Boost:  Adap4ve  Base  Class  Boost  for  Mul4-­‐class  Classifica4on”  Ping  Li,  ICML  2009    

“Addi4ve  Groves  of  Regression  Trees”  Sorokina,  Caruana  &  Riedewald,  ECML  2007,  hlp://addi5vegroves.net/      

“Winning  the  KDD  Cup  Orange  Challenge  with  Ensemble  Selec4on”,  Niculescu-­‐Mizil  et  al.,  KDD  2009    

“Lessons  from  the  Netflix  Prize  Challenge”  Bell  &  Koren,  SIGKDD  Expora5ons  9(2),  75—79,  2007            

 

Page 70: Machine(Learning(&(DataMining - Yisong  · PDF fileToday(• High(Level(Overview(of(Ensemble(Methods(• Boosng – Ensemble(Method(for(Reducing(Bias(• EnsembleSelecon 2

Next  Week  

•  Probabilis5c  Models  – Naïve  Bayes  – Basic  Graphical  Models  

•  Hidden  Markov  Models  

70