gradient boosting

Gradient Boosting

Nghia BuiNov 2016

Jerome H. Friedman (born 1939)Department of Statistics

Stanford University

Image source: https://statistics.stanford.edu/people/jerome-h-friedman

2

Gradient Boosting =Boosting with Gradient

• Boosting: a machine learning technique which boosts weak learners to strong ones

• Gradient boosting: boosting which makes use of gradient (explained later)

• Developed by Prof. Jerome H. Friedman in 1999

• What is a learner in our current context?

3

A learner estimates a target function from training data

• A learner is given training data which is a list of pairs

• Its task is to estimate a target function so that the list best fits the list

• The “fit” is measured by a loss function denoted as or or – note: involves as constants and as variables – since

can be varied• Less loss = more fit! A typical loss function is mean

squared error:

4

Input of Gradient Boosting (GB)

• Training data • A differentiable loss function • A list of base/weak learners ( is not

uncommon)• A number of iterations

a differentiable function is a function whose derivative exists at each point in its domain

5

How GB estimates target function• GB assumes that the target function is a linear

combination of base functions :

• GB algorithm can be inceptively described as:– Init (an acceptable constant)– For to :

• Estimate and so:

• Update: – Result is

6

Estimate and • The previous inequality can be rewritten as:

• Thus, if we are able to find a list of real values in the form such that:

• Then we can estimate by feeding a basic learner the training data

7

Finding and • Consider the loss of :

• Remember, is a function of variables, are just particular values of those variables – we call them current point, and is just a particular loss corresponding to this point.

• Imagine we are standing at the current point, which direction, and how far should we move, in order to land at a point with smaller loss?

• That “direction” is actually the vector and that “how far” is reflected by the real value

• Let’s find that “direction” and that “how far”!

8

Gradient!• Consider as a function of a variable, other

variables are fixed at the current point, by calculating derivative of at that point ( is differentiable), we have:– if the derivative is positive/negative, then we will

decrease/increase value of the variable, in order to make the loss smaller (*)

• Apply the calculating above for all , we will get derivative values forming a vector so-called gradient of an -variable function at the current point.

9

Negative gradient!

• By taking the negative of the gradient, the rule (*) can be rewritten more “nicely”:– if the component of the negative gradient is

positive/negative, then we will increase/decrease value of the variable, in order to make the loss smaller (**)

• We denote the negative gradient as

10

The best direction is thenegative gradient

• In general, to make the loss smaller, value of each variable could be adjusted independently, as long as the rule (**) is applied.

• However, with we tend to prefer adjusting the variable times higher, in order to make the loss smaller “quickly”.

• Thus, the best direction to move is exactly the negative gradient, we have:

11

Estimate and

• With in hands, as mentioned, we choose a basic learner (this choosing is not specified by GB) and feed it the training data

• will be the target function estimated by this learner.• Finally, the “how far” is estimated using the line

search strategy:

• Officially, is called multiplier, the line search strategy is also applied for

12

The revised GB algorithm• Init • For to :

– Compute negative gradient:

– Feed a base learner the training data to get a base function

– Compute the multiplier with line search strategy:

– Update: • Result is

13

Thank you!

• Contact: [email protected]

gradient boosting

Science