gradient boosting
TRANSCRIPT
Gradient Boosting
Nghia BuiNov 2016
Jerome H. Friedman (born 1939)Department of Statistics
Stanford University
Image source: https://statistics.stanford.edu/people/jerome-h-friedman
2
Gradient Boosting =Boosting with Gradient
• Boosting: a machine learning technique which boosts weak learners to strong ones
• Gradient boosting: boosting which makes use of gradient (explained later)
• Developed by Prof. Jerome H. Friedman in 1999
• What is a learner in our current context?
3
A learner estimates a target function from training data
• A learner is given training data which is a list of pairs
• Its task is to estimate a target function so that the list best fits the list
• The “fit” is measured by a loss function denoted as or or – note: involves as constants and as variables – since
can be varied• Less loss = more fit! A typical loss function is mean
squared error:
4
Input of Gradient Boosting (GB)
• Training data • A differentiable loss function • A list of base/weak learners ( is not
uncommon)• A number of iterations
a differentiable function is a function whose derivative exists at each point in its domain
5
How GB estimates target function• GB assumes that the target function is a linear
combination of base functions :
• GB algorithm can be inceptively described as:– Init (an acceptable constant)– For to :
• Estimate and so:
• Update: – Result is
6
Estimate and • The previous inequality can be rewritten as:
• Thus, if we are able to find a list of real values in the form such that:
• Then we can estimate by feeding a basic learner the training data
7
Finding and • Consider the loss of :
• Remember, is a function of variables, are just particular values of those variables – we call them current point, and is just a particular loss corresponding to this point.
• Imagine we are standing at the current point, which direction, and how far should we move, in order to land at a point with smaller loss?
• That “direction” is actually the vector and that “how far” is reflected by the real value
• Let’s find that “direction” and that “how far”!
8
Gradient!• Consider as a function of a variable, other
variables are fixed at the current point, by calculating derivative of at that point ( is differentiable), we have:– if the derivative is positive/negative, then we will
decrease/increase value of the variable, in order to make the loss smaller (*)
• Apply the calculating above for all , we will get derivative values forming a vector so-called gradient of an -variable function at the current point.
9
Negative gradient!
• By taking the negative of the gradient, the rule (*) can be rewritten more “nicely”:– if the component of the negative gradient is
positive/negative, then we will increase/decrease value of the variable, in order to make the loss smaller (**)
• We denote the negative gradient as
10
The best direction is thenegative gradient
• In general, to make the loss smaller, value of each variable could be adjusted independently, as long as the rule (**) is applied.
• However, with we tend to prefer adjusting the variable times higher, in order to make the loss smaller “quickly”.
• Thus, the best direction to move is exactly the negative gradient, we have:
11
Estimate and
• With in hands, as mentioned, we choose a basic learner (this choosing is not specified by GB) and feed it the training data
• will be the target function estimated by this learner.• Finally, the “how far” is estimated using the line
search strategy:
• Officially, is called multiplier, the line search strategy is also applied for
12
The revised GB algorithm• Init • For to :
– Compute negative gradient:
– Feed a base learner the training data to get a base function
– Compute the multiplier with line search strategy:
– Update: • Result is