classification - regression · 2016. 10. 17. · linear regression solution { gradient descent...
TRANSCRIPT
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
ClassificationRegression
Huiping Cao
Huiping Cao, Slide 1/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Outline
1 Linear regression
2 Solution – Gradient descent
3 Solution - Normal equation
4 R code
5 More
Huiping Cao, Slide 2/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Regression problem
Regression problem: predict the value of one or morecontinuous target variables y given the value of ad-dimensional vector x.
Output is a real number, which is why the method is calledregression
Linear Regression is one of the most basic and importanttechnique for usually predicting a value of an attribute (y).
It is used to fit values in a forecasting or predictive model.
Huiping Cao, Slide 3/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Linear regression – Example
A collection of observations of the Old Faithful geyser in theUSA Yellowstone National Park.
> head(faithful)
eruptions waiting
1 3.600 79
2 1.800 54
3 3.333 74
4 2.283 62
5 4.533 85
6 2.883 55
There are two observation variables in the data set.
eruptions: the duration of the geyser eruptions.waiting: the length of waiting period until the next eruption.
Predict eruption time given waiting time.
Huiping Cao, Slide 4/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Linear regression – representation
Training set: {x1, x2, · · · , xn}n is the number of training data pointsxi is a d-dimensional data point (or vector)y = (y1, · · · , yn) is output(x, y) one training example(xi , yi ) the i-th training example.
Hypothesis: h(·).The learning algorithm learns the hypnosis.Using h, we can predict y for any given x.
Huiping Cao, Slide 5/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Linear regression – How do we represent h
Linear regression with one variable (univariate linearregression).
hθ(x) = θ0 + θ1x
Or, h(x) = θ0 + θ1x
50 60 70 80 90
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
faithful$waiting
faithful$eruptions
Huiping Cao, Slide 6/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Linear regression – cost function (1)
Hypothesis hθ(x) = θ0 + θ1x .
Parameters: θi s.
How to choose θi s?
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
50 60 70 80 90
1.5
2.5
3.5
4.5
x
y
Huiping Cao, Slide 7/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Linear regression – cost function (2)
Hypothesis: hθ(xi ) = θ0 + θ1xi .
Idea: Choose θ0 and θ1 such that hθ(x) is close to y for our trainingexamples.
Intuitively
minimizeθ0,θ1
n∑i=1
(hθ(xi )− yi )2
Huiping Cao, Slide 8/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Linear regression – cost function (2)
Cost function,
J(θ0, θ1) =1
2n
n∑i=1
(hθ(xi )− yi )2
Factor 1n is to average the cost, and the factor12 is to make
the analysis easier.J is called squared error function, or cost function, or squarederror cost function.There are other cost functions, but squared error cost functionis working well.
Goal: minimizeθ0,θ1J(θ0, θ1)
Huiping Cao, Slide 9/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Linear regression – cost function – simplified version
Parameters: let θ0=0,
Cost function becomes: hθ(xi ) = θ1xi .
Goal: minimizeθ1J(θ1) = minimizeθ1∑n
i=1(hθ(xi )− yi )2
Hypothesis function hθ(x): for fixed θ1, it is a function of x.
Cost function J(θ1): a function of θ1.
Huiping Cao, Slide 10/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Linear regression – cost function – simplified version J(θ1)
Training data: three points (1, 1), (2, 2), (3, 3).
Assume that θ1 = 1, J(1) = 0. Because hθ(xi ) = θ1xi = xi
J(θ1) =1
2n
n∑i=1
(hθ(xi )− yi )2 =1
2n
n∑i=1
(xi − yi )2
=1
6(02 + 02 + 02) = 0
Assume that θ1 = 0.5, J(0.5) = 0.58
J(0.5) =1
2n((0.5− 1)2 + (2− 1)2 + (3− 1.5)2)
=1
6(0.25 + 1 + 2.25) ∼ 0.58
Assume that θ1=0, J(0) =16 (1
2 + 22 + 32) = 146 = 2.3
Huiping Cao, Slide 11/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Linear regression – Simplified version J(θ1)
For each value of θ1, we can get a corresponding J(θ1).
Remember that our objective is to minimizeθ1J(θ1), whichequals to finding a straight line to fit {x} well.
The result is θ1 = 1 because J(1) = 0 is minimal.
Huiping Cao, Slide 12/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Multivariate Linear Regression - motivation and notation
Exampley house pricesx·1: house square feetx·2: number of bedroomsx·3: age of homeetc.
Notation
d : number of featuresxi : the i-th input examplexi,j : the value of the j-th feature of the ith training example.
Huiping Cao, Slide 13/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Multivariate Linear Regression - Hypothesis
hθ(xi ) = θ0 + θ1xi ,1 + · · ·+ θdxi ,dE.g., hθ(xi ) = 80 + 0.1xi ,1 + 0.01xi ,2 + 3xi ,3 − 2xi ,4For convenience, denote xi ,0 = 1 to define a 0th feature.
We can define two vectors x and θ
xi =
xi ,0xi ,1xi ,2· · ·xi ,d
∈ Rd+1, θ =
θ0θ1θ2· · ·θd
∈ Rd+1
hθ(x) = θTx
Huiping Cao, Slide 14/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Multivariate Linear Regression - Representation
Hypothesis: hθ(xi ) = θTxi where xi ,0 = 1
Parameters: θ0, θ1, θ2, · · · , θd (or, θ ∈ Rd+1)Cost function:
J(θ0, θ1, · · · , θn) =1
2n
n∑i=1
(hθ(xi )− yi )2
Or
J(θ) =1
2n
n∑i=1
(hθ(xi )− yi )2
Linear regression models: linear functions of the adjustableparameters.
The simplest form of linear regression models are also linearfunctions of the input variables.
Huiping Cao, Slide 15/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Linear regression model
Formally,
Given (1) a training data set {x1, x2, · · · , xn} where xi =
xi,1xi,2· · ·xi,d
together with (2) corresponding target variables y =
y1y2· · ·yn
Goal: predict the value of y for a new observation x.Linear model hθ(xi ) = θ
T xi or hθ(xi ) = θ0 + θ1xi,1 + · · ·+ θdxi,dDecides d+1 parameters (or a (d+1)-vector) θ = (θ0, θ1, · · · , θd ) s.t.1
2n
∑ni=1(h(xi )− yi )2 is minimal.
Solution: (1) gradient descent, (2) normal equation (or analytic pseudo-inverse
algorithm)
Huiping Cao, Slide 16/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Gradient descent – intuition
Gradient descent to minimize J(θ0, θ1)
OutlineStart with some θ0, θ1 values (say θ0 = 0, θ1 = 0)Keep changing θ0, θ1 to reduce J(θ0, θ1)until we hopefully end up at a minimum
In the general case, minimizeθ0,θ1,··· ,θd J(θ0, θ1, · · · , θd)
Huiping Cao, Slide 17/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Gradient descent algorithm
Repeat until convergence
θj = θj − α ∂∂θj J(θ0, θ1) for j = 0 and j = 1
Notations: partial derivative ∂∂θ0 ; Derivativeddθ1
α: learning rate; large α indicates aggressive learning;
θ0 and θ1 need to be updated simultaneously.
CORRECT:temp0 = θ0 − α ∂∂θ0 J(θ0, θ1)temp1 = θ1 − α ∂∂θ1 J(θ0, θ1)θ0 = temp0θ1 = temp1INCORRECT:temp0 = θ0 − α ∂∂θ0 J(θ0, θ1)θ0 = temp0temp1 = θ1 − α ∂∂θ1 J(θ0, θ1)θ1 = temp1
Huiping Cao, Slide 18/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Gradient descent algorithm - derivative
Derivative: slope of the tangent linecase 1: the slope is positive, θ1 = θ1 − α× positive number,thus, we are moving θ1 to be a smaller θ1.case 2: the line has negative slope. Then, θ1 is updated to abigger value. θ1 = θ1 − α× negative number
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
0.15
0.20
0.25
x
func
tion(
x) (x
- 0.
5)^2
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
0.15
0.20
0.25
x
func
tion(
x) (x
- 0.
5)^2
(a) positive θ1 (b) negative θ1Huiping Cao, Slide 19/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Gradient descent algorithm - learning rate
If α is too small, gradient descent can converge very slowly.
If α is too large, gradient descent can overshoot theminimum, it may fail to converge, or even diverge.
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
0.15
0.20
0.25
x
func
tion(
x) (x
- 0.
5)^2
0.0 0.2 0.4 0.6 0.8 1.00.00
0.05
0.10
0.15
0.20
0.25
x
func
tion(
x) (x
- 0.
5)^2
(a) small α (b) big α
Huiping Cao, Slide 20/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Gradient descent algorithm - local optimal
What if θ1 is already at local optima?Slop =0 (i.e., derivative = 0)θ1 = θ1 − α · derivative = θ1 − α · 0 = θ1Gradient descent can converge to a local minimal, even withthe fixed learning rate α.As we approach a local minimal, gradient descent willautomatically take smaller steps because ddθ1 J(θ1) is smaller.
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
0.15
0.20
0.25
x
func
tion(
x) (x
- 0.
5)^2
Huiping Cao, Slide 21/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Gradient descent algorithm for linear regression - onevariable
Linear regression model (one variable)Cost function:
J(θ0, θ1) =1
2n
n∑i=1
(hθ(xi )− yi )2 =1
2n
n∑i=1
(θ0 + θ1xi ,1 − yi )2
Goal: minθ0,θ1J(θ0, θ1)
Derivatives:
∂
∂θ0J(θ0, θ1) =
1
n
n∑i=1
(hθ(xi )− yi )
∂
∂θ1J(θ0, θ1) =
1
n
n∑i=1
(hθ(xi )− yi ) · xi,1
Huiping Cao, Slide 22/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Gradient descent algorithm for multivariate linearregression
Hypothesis: hθ(xi ) = θTxi where xi ,0 = 1
Parameters: θ ∈ Rd+1
Cost function:
J(θ) =1
2n
n∑i=1
(hθ(xi )− yi )2
Gradient descent:
Repeat until convergence{θj = θj − α ∂∂θj J(θ) for j = 0, · · · , j = d
}
Huiping Cao, Slide 23/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Linear regression - Gradient descent
Only one variable (feature), i.e., d = 1Repeat until convergence{
θ0 = θ0 − α ∂∂θ0 J(θ) = θ0 − α1n
∑ni=1(hθ(xi )− yi )
θ1 = θ1 − α ∂∂θ1 J(θ) = θ1 − α1n
∑ni=1(hθ(xi )− yi ) · xi,1
(simultaneously update θ0 and θ1)}
Multiple variables (features), i.e., d ≥ 1Repeat until convergence{
θj = θj − α ∂∂θj J(θ) = θj − α1n
∑ni=1(hθ(xi )− yi ) · xi,j
(simultaneously update θj for j = 0, · · · , d)}
Exampleθ0 = θ0 − α 1n
∑ni=1(hθ(xi )− yi ) · xi,0
θ1 = θ1 − α 1n∑n
i=1(hθ(xi )− yi ) · xi,1θ2 = θ2 − α 1n
∑ni=1(hθ(xi )− yi ) · xi,2
The update equation of θ0 is the same to that for d = 1 because xi,0.
The update equation of θ1 is the same to that for d = 1 because xi,1 = xi .
Huiping Cao, Slide 24/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Gradient descent algorithm for linear regression
Linear: bowl-shape, convex function.I.e., the function does not have multiple local optima. Italways converge to global optimal.
Batch gradient descent: each step of the algorithm uses allthe training examples in J(·) function.
There are other algorithms that use only part of the trainingexamples.
Huiping Cao, Slide 25/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Multivariate Linear Regression - Feature Scaling (1)
Idea: Make sure that features are on a similar scale ff thedataset has multiple features. The gradient descent canconverge more quickly.
For example, if a problem’s dataset has two features xi ,1 andxi ,2xi ,1 = size (0-2000 feet
2)xi ,2 = number of bedrooms (1-5)
It takes long time to converge.
A strategy we do is to scale the feature values
xi ,1 =size(feet2)
2000
xi ,2 =number of bedrooms
5Then, the contour plot of J(θ) is much less skewed (circleshape).
Huiping Cao, Slide 26/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Multivariate Linear Regression - Feature Scaling (2)
Idea: make every feature into approximately −1 ≤ xi ,j ≤ 1range.
xi,0 = 1 fits in the range.
Generally, as long as the range is similar, it is OK.
E.g., 0 ≤ xi,1 ≤ 3 and −1 ≤ xi,2 ≤ 1.5.
Adding two more features (−200 ≤ xi,4 ≤ 200 and−0.001 ≤ xi,4 ≤ 0.001) makes the range very different. This isnot desirable.
Huiping Cao, Slide 27/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Multivariate Linear Regression - Feature Scaling (3)
Mean normalization: another way of feature scaling.
Replace xi ,j with xi ,j − µj to make features haveapproximately zero mean (do not apply to xi ,0=1).
E.g., xi,1 =size−1000
2000 , xi,2 =#bedrooms−2
5 .Then, −0.5 ≤ xi,1 ≤ 0.5, −0.5 ≤ xi,2 ≤ 0.5
Generally, replace xi ,j byxi,j−µjσj
where σj can be the standard
deviation or max(xi ,j)−min(xi ,j).
e.g., 0 ≤ xi,1 ≤ 3 and −1 ≤ xi,2 ≤ 1.5.
Adding two more features (−200 ≤ xi,4 ≤ 200 and−0.001 ≤ xi,4 ≤ 0.001) makes the range very different. This isnot desirable.
Huiping Cao, Slide 28/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Multivariate Linear Regression - Normal Equation
Normal equation: method to solve for θ analytically
Normal equation method has some advantages anddisadvantages, which will be discussed later.
Huiping Cao, Slide 29/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Multivariate Linear Regression - Normal Equation
General case: n examples (x1, y1), (x2, y2), · · · , (xn, yn) andeach xi has d features.
xi =
xi,0xi,1xi,2· · ·xi,d
∈ Rd+1
Construct a n × (d + 1) matrix X by using the trainingexamples and adding extra feature xi ,0 with values 1;
X =
−−−xT1 −−−−−−xT2 −−−
· · ·− − −xTn −−−
∈ Rn×(d+1),Construct an n-dimensional vector y
y =
y1y2· · ·yn
Huiping Cao, Slide 30/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Multivariate Linear Regression - Normal Equation Example
Example 1: House price
X =
1 2104 5 1 451 1416 3 2 401 1534 3 2 401 852 2 1 36
y =
460232315178
Example 2: one variable If xi =(
1x1i
), then X =
1 x1,11 x2,1· · · · · ·1 xn,1
,Hypothesis: hθ(xi ) = θ
Txi or h(X ) = Xθ
Solution: θ = (XTX )−1XTy
Huiping Cao, Slide 31/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Multivariate Linear Regression - Solve for θ - Analysis (1)
hθ(xi ) = θT xi , J(θ) = 12n
∑ni=1(θ
T xi − yi )2 = 12n ||Xθ − y||2
Analysis:According to the definition of the Euclidean form of a vector.
Let x = (x1, · · · , xn), ‖x‖ =√
x21 + · · ·+ x2n .Thus, ‖x‖2 = x21 + · · ·+ x2n =
∑ni=1 x
2i
Let tempi = θT xi − yi , then J(θ) = 12n
∑ni=1(tempi )
2
Xθ =
−−−xT1 −−−−−−xT2 −−−
· · ·− − −xTn −−−
θd×1 =
xT1 θxT2 θ· · ·xTn θ
=
θT x1θT x2· · ·θT xn
Xθ − y =
θT x1 − y1θT x2 − y2· · ·
θT xn − yn
=
temp1temp2· · ·
tempn
J(θ) = 1
2n
∑ni=1(tempi )
2 = 12n||Xθ − y||2
Huiping Cao, Slide 32/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Multivariate Linear Regression - Solve for θ - Analysis (2)
Rewriting J(θ) further
J(θ) = 12n ||Xθ − y||2
= 12n (Xθ − y)T (Xθ − y)
= 12n ((Xθ)T − yT )(Xθ − y)
= 12n ((Xθ)T (Xθ)− yT (Xθ)− (Xθ)Ty + yTy)
= 12n (θTXTXθ − 2(Xθ)Ty + yTy)
(because yT (Xθ) = (Xθ)Ty)
Why yT (Xθ) = (Xθ)Ty? Details see next slide.
Huiping Cao, Slide 33/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Multivariate Linear Regression - Solve for θ - Analysis (3)
Prove yT (Xθ) = (Xθ)Ty.
(Xθ)Ty = θTXTy
= θT (x1, x2, · · · , xn)
y1y2· · ·yn
= (θTx1, θTx2, · · · , θTxn)
y1y2· · ·yn
= θTx1y1 + θTx2y2 + · · ·+ θTxnyn=
∑ni=1 θ
Txiyi , which is a scalar value
(Xθ)Ty is a scalar value. A scalar value’s transpose (Xθ)Ty) is itself.
Huiping Cao, Slide 34/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Multivariate Linear Regression - Solve for θ - Analysis (4)
J(θ) is rewritten as 12n (θTXTXθ − 2(Xθ)Ty + yTy)
Solve for θ.
Let the partial derivative ∂∂θj J(θ) = · · · = 0 for every j to solvefor θ0, θ1, · · · , θd
∂
∂θ=
1
2n(2XT θ − 2XTy + 0) = 2
2nXT (Xθ − y)
Let ∂∂θ = 0. I.e. XT (Xθ − y)=0. Then, θ = (XTX )−1XTy.
Solution: θ = (XTX )−1XTy
Let X † = (XTX )−1X , θ = X †Y . X † is called the pseudo-inverseof X .
Reason: (X †)−1X † = ((XTX )−1X )−1(XTX )−1X =(X−1(XTX )(XTX )−1X = X−1X = I
Huiping Cao, Slide 35/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Multivariate Linear Regression - Normal equation,Noninvertibility
θ = (XTX )−1XTy
It is very rare XTX is non-invertible. What if XTX is non-invertible(singular/degenerate)? Different softwares has different functions tosolve this.
Non-invertible XTX is generally caused by two reasons:
Dataset has redundant features (linearly dependent).E.g., x·1=size in feet
2, x·2 = size in m2.
Too many features (d > n).
Delete some features or use regularization.
Feature scaling is not necessary is we use normal equation methodto solve for θ.
Huiping Cao, Slide 36/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Multivariate Linear Regression - Method comparison
Gradient descent Normal equation
Need to choose α No need to choose αNeed many iterations Don’t need to iterateWorks well even when n is large Need to compute (XTX )−1 ∈
Rn×n (cost roughly O(n3))Slow if n is very large (n ≥ 1000)
Huiping Cao, Slide 37/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
R - linear regression
reg
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Linear basis function model
It is extended to a more complicated class of linearcombinations of fixed nonlinear functions of the inputparameters.
h(xi ) = θ0 +M−1∑j=1
θjφj(xi )
φj(·) are basis function. The total number of parameters is M.
θ0: bias parameter, for any fixed offset in the data.
Huiping Cao, Slide 39/40
-
Linear regression Solution – Gradient descent Solution - Normal equation R code More
Review
Given the value of a d-dimensional vector x and correspondingy values
Regression problem: predict the value of one or morecontinuous target variables y .
h(x) = θTx
Classification problem: output is binary.
h(x) = sign(θTx)
Define a cost function.
Solution: decides θ such that cost function is minimal.
Huiping Cao, Slide 40/40
Linear regression Solution – Gradient descentSolution - Normal equationR codeMore