gradient boosting

13
Gradient Boosting Nghia Bui Nov 2016 Jerome H. Friedman (born 1939) Department of Statistics Stanford University Image source: https://statistics.stanford.edu/people/jerome-h-

Upload: nghia-bui-van

Post on 15-Apr-2017

728 views

Category:

Science


5 download

TRANSCRIPT

Page 1: Gradient Boosting

Gradient Boosting

Nghia BuiNov 2016

Jerome H. Friedman (born 1939)Department of Statistics

Stanford University

Image source: https://statistics.stanford.edu/people/jerome-h-friedman

Page 2: Gradient Boosting

2

Gradient Boosting =Boosting with Gradient

• Boosting: a machine learning technique which boosts weak learners to strong ones

• Gradient boosting: boosting which makes use of gradient (explained later)

• Developed by Prof. Jerome H. Friedman in 1999

• What is a learner in our current context?

Page 3: Gradient Boosting

3

A learner estimates a target function from training data

• A learner is given training data which is a list of pairs

• Its task is to estimate a target function so that the list best fits the list

• The “fit” is measured by a loss function denoted as or or – note: involves as constants and as variables – since

can be varied• Less loss = more fit! A typical loss function is mean

squared error:

Page 4: Gradient Boosting

4

Input of Gradient Boosting (GB)

• Training data • A differentiable loss function • A list of base/weak learners ( is not

uncommon)• A number of iterations

a differentiable function is a function whose derivative exists at each point in its domain

Page 5: Gradient Boosting

5

How GB estimates target function• GB assumes that the target function is a linear

combination of base functions :

• GB algorithm can be inceptively described as:– Init (an acceptable constant)– For to :

• Estimate and so:

• Update: – Result is

Page 6: Gradient Boosting

6

Estimate and • The previous inequality can be rewritten as:

• Thus, if we are able to find a list of real values in the form such that:

• Then we can estimate by feeding a basic learner the training data

Page 7: Gradient Boosting

7

Finding and • Consider the loss of :

• Remember, is a function of variables, are just particular values of those variables – we call them current point, and is just a particular loss corresponding to this point.

• Imagine we are standing at the current point, which direction, and how far should we move, in order to land at a point with smaller loss?

• That “direction” is actually the vector and that “how far” is reflected by the real value

• Let’s find that “direction” and that “how far”!

Page 8: Gradient Boosting

8

Gradient!• Consider as a function of a variable, other

variables are fixed at the current point, by calculating derivative of at that point ( is differentiable), we have:– if the derivative is positive/negative, then we will

decrease/increase value of the variable, in order to make the loss smaller (*)

• Apply the calculating above for all , we will get derivative values forming a vector so-called gradient of an -variable function at the current point.

Page 9: Gradient Boosting

9

Negative gradient!

• By taking the negative of the gradient, the rule (*) can be rewritten more “nicely”:– if the component of the negative gradient is

positive/negative, then we will increase/decrease value of the variable, in order to make the loss smaller (**)

• We denote the negative gradient as

Page 10: Gradient Boosting

10

The best direction is thenegative gradient

• In general, to make the loss smaller, value of each variable could be adjusted independently, as long as the rule (**) is applied.

• However, with we tend to prefer adjusting the variable times higher, in order to make the loss smaller “quickly”.

• Thus, the best direction to move is exactly the negative gradient, we have:

Page 11: Gradient Boosting

11

Estimate and

• With in hands, as mentioned, we choose a basic learner (this choosing is not specified by GB) and feed it the training data

• will be the target function estimated by this learner.• Finally, the “how far” is estimated using the line

search strategy:

• Officially, is called multiplier, the line search strategy is also applied for

Page 12: Gradient Boosting

12

The revised GB algorithm• Init • For to :

– Compute negative gradient:

– Feed a base learner the training data to get a base function

– Compute the multiplier with line search strategy:

– Update: • Result is

Page 13: Gradient Boosting

13

Thank you!

• Contact: [email protected]