classification - regression · 2016. 10. 17. · linear regression solution { gradient descent...

Linear regression Solution – Gradient descent Solution - Normal equation R code More

ClassificationRegression

Huiping Cao

Huiping Cao, Slide 1/40


Outline

1 Linear regression

2 Solution – Gradient descent

3 Solution - Normal equation

4 R code

5 More



Regression problem

Regression problem: predict the value of one or morecontinuous target variables y given the value of ad-dimensional vector x.

Output is a real number, which is why the method is calledregression

Linear Regression is one of the most basic and importanttechnique for usually predicting a value of an attribute (y).

It is used to fit values in a forecasting or predictive model.



Linear regression – Example

A collection of observations of the Old Faithful geyser in theUSA Yellowstone National Park.

> head(faithful)

eruptions waiting

1 3.600 79

2 1.800 54

3 3.333 74

4 2.283 62

5 4.533 85

6 2.883 55

There are two observation variables in the data set.

eruptions: the duration of the geyser eruptions.waiting: the length of waiting period until the next eruption.

Predict eruption time given waiting time.



Linear regression – representation

Training set: {x1, x2, · · · , xn}n is the number of training data pointsxi is a d-dimensional data point (or vector)y = (y1, · · · , yn) is output(x, y) one training example(xi , yi ) the i-th training example.

Hypothesis: h(·).The learning algorithm learns the hypnosis.Using h, we can predict y for any given x.



Linear regression – How do we represent h

Linear regression with one variable (univariate linearregression).

hθ(x) = θ0 + θ1x

Or, h(x) = θ0 + θ1x

50 60 70 80 90

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

faithful$waiting

faithful$eruptions



Linear regression – cost function (1)

Hypothesis hθ(x) = θ0 + θ1x .

Parameters: θi s.

How to choose θi s?

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

50 60 70 80 90

1.5

2.5

3.5

4.5

x

y




Hypothesis: hθ(xi ) = θ0 + θ1xi .

Idea: Choose θ0 and θ1 such that hθ(x) is close to y for our trainingexamples.

Intuitively

minimizeθ0,θ1

n∑i=1

(hθ(xi )− yi )2




Cost function,

J(θ0, θ1) =1

2n

n∑i=1

(hθ(xi )− yi )2

Factor 1n is to average the cost, and the factor12 is to make

the analysis easier.J is called squared error function, or cost function, or squarederror cost function.There are other cost functions, but squared error cost functionis working well.

Goal: minimizeθ0,θ1J(θ0, θ1)



Linear regression – cost function – simplified version

Parameters: let θ0=0,

Cost function becomes: hθ(xi ) = θ1xi .

Goal: minimizeθ1J(θ1) = minimizeθ1∑n

i=1(hθ(xi )− yi )2

Hypothesis function hθ(x): for fixed θ1, it is a function of x.

Cost function J(θ1): a function of θ1.



Linear regression – cost function – simplified version J(θ1)

Training data: three points (1, 1), (2, 2), (3, 3).

Assume that θ1 = 1, J(1) = 0. Because hθ(xi ) = θ1xi = xi

J(θ1) =1

2n

n∑i=1

(hθ(xi )− yi )2 =1

2n

n∑i=1

(xi − yi )2

=1

6(02 + 02 + 02) = 0

Assume that θ1 = 0.5, J(0.5) = 0.58

J(0.5) =1

2n((0.5− 1)2 + (2− 1)2 + (3− 1.5)2)

=1

6(0.25 + 1 + 2.25) ∼ 0.58

Assume that θ1=0, J(0) =16 (1

2 + 22 + 32) = 146 = 2.3



Linear regression – Simplified version J(θ1)

For each value of θ1, we can get a corresponding J(θ1).

Remember that our objective is to minimizeθ1J(θ1), whichequals to finding a straight line to fit {x} well.

The result is θ1 = 1 because J(1) = 0 is minimal.



Multivariate Linear Regression - motivation and notation

Exampley house pricesx·1: house square feetx·2: number of bedroomsx·3: age of homeetc.

Notation

d : number of featuresxi : the i-th input examplexi,j : the value of the j-th feature of the ith training example.



Multivariate Linear Regression - Hypothesis

hθ(xi ) = θ0 + θ1xi ,1 + · · ·+ θdxi ,dE.g., hθ(xi ) = 80 + 0.1xi ,1 + 0.01xi ,2 + 3xi ,3 − 2xi ,4For convenience, denote xi ,0 = 1 to define a 0th feature.

We can define two vectors x and θ

xi =

xi ,0xi ,1xi ,2· · ·xi ,d

∈ Rd+1, θ =

θ0θ1θ2· · ·θd

∈ Rd+1

hθ(x) = θTx



Multivariate Linear Regression - Representation

Hypothesis: hθ(xi ) = θTxi where xi ,0 = 1

Parameters: θ0, θ1, θ2, · · · , θd (or, θ ∈ Rd+1)Cost function:

J(θ0, θ1, · · · , θn) =1

2n

n∑i=1

(hθ(xi )− yi )2

Or

J(θ) =1

2n

n∑i=1

(hθ(xi )− yi )2

Linear regression models: linear functions of the adjustableparameters.

The simplest form of linear regression models are also linearfunctions of the input variables.



Linear regression model

Formally,

Given (1) a training data set {x1, x2, · · · , xn} where xi =

xi,1xi,2· · ·xi,d

together with (2) corresponding target variables y =

y1y2· · ·yn

Goal: predict the value of y for a new observation x.Linear model hθ(xi ) = θ

T xi or hθ(xi ) = θ0 + θ1xi,1 + · · ·+ θdxi,dDecides d+1 parameters (or a (d+1)-vector) θ = (θ0, θ1, · · · , θd ) s.t.1

2n

∑ni=1(h(xi )− yi )2 is minimal.

Solution: (1) gradient descent, (2) normal equation (or analytic pseudo-inverse

algorithm)



Gradient descent – intuition

Gradient descent to minimize J(θ0, θ1)

OutlineStart with some θ0, θ1 values (say θ0 = 0, θ1 = 0)Keep changing θ0, θ1 to reduce J(θ0, θ1)until we hopefully end up at a minimum

In the general case, minimizeθ0,θ1,··· ,θd J(θ0, θ1, · · · , θd)



Gradient descent algorithm

Repeat until convergence

θj = θj − α ∂∂θj J(θ0, θ1) for j = 0 and j = 1

Notations: partial derivative ∂∂θ0 ; Derivativeddθ1

α: learning rate; large α indicates aggressive learning;

θ0 and θ1 need to be updated simultaneously.

CORRECT:temp0 = θ0 − α ∂∂θ0 J(θ0, θ1)temp1 = θ1 − α ∂∂θ1 J(θ0, θ1)θ0 = temp0θ1 = temp1INCORRECT:temp0 = θ0 − α ∂∂θ0 J(θ0, θ1)θ0 = temp0temp1 = θ1 − α ∂∂θ1 J(θ0, θ1)θ1 = temp1



Gradient descent algorithm - derivative

Derivative: slope of the tangent linecase 1: the slope is positive, θ1 = θ1 − α× positive number,thus, we are moving θ1 to be a smaller θ1.case 2: the line has negative slope. Then, θ1 is updated to abigger value. θ1 = θ1 − α× negative number

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.05

0.10

0.15

0.20

0.25

x

func

tion(

x) (x

- 0.

5)^2

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.05

0.10

0.15

0.20

0.25

x

func

tion(

x) (x

- 0.

5)^2

(a) positive θ1 (b) negative θ1Huiping Cao, Slide 19/40


Gradient descent algorithm - learning rate

If α is too small, gradient descent can converge very slowly.

If α is too large, gradient descent can overshoot theminimum, it may fail to converge, or even diverge.

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.05

0.10

0.15

0.20

0.25

x

func

tion(

x) (x

- 0.

5)^2

0.0 0.2 0.4 0.6 0.8 1.00.00

0.05

0.10

0.15

0.20

0.25

x

func

tion(

x) (x

- 0.

5)^2

(a) small α (b) big α



Gradient descent algorithm - local optimal

What if θ1 is already at local optima?Slop =0 (i.e., derivative = 0)θ1 = θ1 − α · derivative = θ1 − α · 0 = θ1Gradient descent can converge to a local minimal, even withthe fixed learning rate α.As we approach a local minimal, gradient descent willautomatically take smaller steps because ddθ1 J(θ1) is smaller.

0.0 0.2 0.4 0.6 0.8 1.0

0.00

0.05

0.10

0.15

0.20

0.25

x

func

tion(

x) (x

- 0.

5)^2



Gradient descent algorithm for linear regression - onevariable

Linear regression model (one variable)Cost function:

J(θ0, θ1) =1

2n

n∑i=1

(hθ(xi )− yi )2 =1

2n

n∑i=1

(θ0 + θ1xi ,1 − yi )2

Goal: minθ0,θ1J(θ0, θ1)

Derivatives:

∂

∂θ0J(θ0, θ1) =

1

n

n∑i=1

(hθ(xi )− yi )

∂

∂θ1J(θ0, θ1) =

1

n

n∑i=1

(hθ(xi )− yi ) · xi,1



Gradient descent algorithm for multivariate linearregression

Hypothesis: hθ(xi ) = θTxi where xi ,0 = 1

Parameters: θ ∈ Rd+1

Cost function:

J(θ) =1

2n

n∑i=1

(hθ(xi )− yi )2

Gradient descent:

Repeat until convergence{θj = θj − α ∂∂θj J(θ) for j = 0, · · · , j = d

}



Linear regression - Gradient descent

Only one variable (feature), i.e., d = 1Repeat until convergence{

θ0 = θ0 − α ∂∂θ0 J(θ) = θ0 − α1n

∑ni=1(hθ(xi )− yi )

θ1 = θ1 − α ∂∂θ1 J(θ) = θ1 − α1n

∑ni=1(hθ(xi )− yi ) · xi,1

(simultaneously update θ0 and θ1)}

Multiple variables (features), i.e., d ≥ 1Repeat until convergence{

θj = θj − α ∂∂θj J(θ) = θj − α1n

∑ni=1(hθ(xi )− yi ) · xi,j

(simultaneously update θj for j = 0, · · · , d)}

Exampleθ0 = θ0 − α 1n

∑ni=1(hθ(xi )− yi ) · xi,0

θ1 = θ1 − α 1n∑n

i=1(hθ(xi )− yi ) · xi,1θ2 = θ2 − α 1n

∑ni=1(hθ(xi )− yi ) · xi,2

The update equation of θ0 is the same to that for d = 1 because xi,0.

The update equation of θ1 is the same to that for d = 1 because xi,1 = xi .



Gradient descent algorithm for linear regression

Linear: bowl-shape, convex function.I.e., the function does not have multiple local optima. Italways converge to global optimal.

Batch gradient descent: each step of the algorithm uses allthe training examples in J(·) function.

There are other algorithms that use only part of the trainingexamples.



Multivariate Linear Regression - Feature Scaling (1)

Idea: Make sure that features are on a similar scale ff thedataset has multiple features. The gradient descent canconverge more quickly.

For example, if a problem’s dataset has two features xi ,1 andxi ,2xi ,1 = size (0-2000 feet

2)xi ,2 = number of bedrooms (1-5)

It takes long time to converge.

A strategy we do is to scale the feature values

xi ,1 =size(feet2)

2000

xi ,2 =number of bedrooms

5Then, the contour plot of J(θ) is much less skewed (circleshape).




Idea: make every feature into approximately −1 ≤ xi ,j ≤ 1range.

xi,0 = 1 fits in the range.

Generally, as long as the range is similar, it is OK.

E.g., 0 ≤ xi,1 ≤ 3 and −1 ≤ xi,2 ≤ 1.5.

Adding two more features (−200 ≤ xi,4 ≤ 200 and−0.001 ≤ xi,4 ≤ 0.001) makes the range very different. This isnot desirable.




Mean normalization: another way of feature scaling.

Replace xi ,j with xi ,j − µj to make features haveapproximately zero mean (do not apply to xi ,0=1).

E.g., xi,1 =size−1000

2000 , xi,2 =#bedrooms−2

5 .Then, −0.5 ≤ xi,1 ≤ 0.5, −0.5 ≤ xi,2 ≤ 0.5

Generally, replace xi ,j byxi,j−µjσj

where σj can be the standard

deviation or max(xi ,j)−min(xi ,j).

e.g., 0 ≤ xi,1 ≤ 3 and −1 ≤ xi,2 ≤ 1.5.

Adding two more features (−200 ≤ xi,4 ≤ 200 and−0.001 ≤ xi,4 ≤ 0.001) makes the range very different. This isnot desirable.



Multivariate Linear Regression - Normal Equation

Normal equation: method to solve for θ analytically

Normal equation method has some advantages anddisadvantages, which will be discussed later.



Multivariate Linear Regression - Normal Equation

General case: n examples (x1, y1), (x2, y2), · · · , (xn, yn) andeach xi has d features.

xi =

xi,0xi,1xi,2· · ·xi,d

∈ Rd+1

Construct a n × (d + 1) matrix X by using the trainingexamples and adding extra feature xi ,0 with values 1;

X =

−−−xT1 −−−−−−xT2 −−−

· · ·− − −xTn −−−

∈ Rn×(d+1),Construct an n-dimensional vector y

y =

y1y2· · ·yn



Multivariate Linear Regression - Normal Equation Example

Example 1: House price

X =

1 2104 5 1 451 1416 3 2 401 1534 3 2 401 852 2 1 36

y =

460232315178

Example 2: one variable If xi =(

1x1i

), then X =

1 x1,11 x2,1· · · · · ·1 xn,1

,Hypothesis: hθ(xi ) = θ

Txi or h(X ) = Xθ

Solution: θ = (XTX )−1XTy



Multivariate Linear Regression - Solve for θ - Analysis (1)

hθ(xi ) = θT xi , J(θ) = 12n

∑ni=1(θ

T xi − yi )2 = 12n ||Xθ − y||2

Analysis:According to the definition of the Euclidean form of a vector.

Let x = (x1, · · · , xn), ‖x‖ =√

x21 + · · ·+ x2n .Thus, ‖x‖2 = x21 + · · ·+ x2n =

∑ni=1 x

2i

Let tempi = θT xi − yi , then J(θ) = 12n

∑ni=1(tempi )

2

Xθ =

−−−xT1 −−−−−−xT2 −−−

· · ·− − −xTn −−−

θd×1 =

xT1 θxT2 θ· · ·xTn θ

=

θT x1θT x2· · ·θT xn

Xθ − y =

θT x1 − y1θT x2 − y2· · ·

θT xn − yn

=

temp1temp2· · ·

tempn

J(θ) = 1

2n

∑ni=1(tempi )

2 = 12n||Xθ − y||2




Rewriting J(θ) further

J(θ) = 12n ||Xθ − y||2

= 12n (Xθ − y)T (Xθ − y)

= 12n ((Xθ)T − yT )(Xθ − y)

= 12n ((Xθ)T (Xθ)− yT (Xθ)− (Xθ)Ty + yTy)

= 12n (θTXTXθ − 2(Xθ)Ty + yTy)

(because yT (Xθ) = (Xθ)Ty)

Why yT (Xθ) = (Xθ)Ty? Details see next slide.




Prove yT (Xθ) = (Xθ)Ty.

(Xθ)Ty = θTXTy

= θT (x1, x2, · · · , xn)

y1y2· · ·yn

= (θTx1, θTx2, · · · , θTxn)

y1y2· · ·yn

= θTx1y1 + θTx2y2 + · · ·+ θTxnyn=

∑ni=1 θ

Txiyi , which is a scalar value

(Xθ)Ty is a scalar value. A scalar value’s transpose (Xθ)Ty) is itself.




J(θ) is rewritten as 12n (θTXTXθ − 2(Xθ)Ty + yTy)

Solve for θ.

Let the partial derivative ∂∂θj J(θ) = · · · = 0 for every j to solvefor θ0, θ1, · · · , θd

∂

∂θ=

1

2n(2XT θ − 2XTy + 0) = 2

2nXT (Xθ − y)

Let ∂∂θ = 0. I.e. XT (Xθ − y)=0. Then, θ = (XTX )−1XTy.

Solution: θ = (XTX )−1XTy

Let X † = (XTX )−1X , θ = X †Y . X † is called the pseudo-inverseof X .

Reason: (X †)−1X † = ((XTX )−1X )−1(XTX )−1X =(X−1(XTX )(XTX )−1X = X−1X = I



Multivariate Linear Regression - Normal equation,Noninvertibility

θ = (XTX )−1XTy

It is very rare XTX is non-invertible. What if XTX is non-invertible(singular/degenerate)? Different softwares has different functions tosolve this.

Non-invertible XTX is generally caused by two reasons:

Dataset has redundant features (linearly dependent).E.g., x·1=size in feet

2, x·2 = size in m2.

Too many features (d > n).

Delete some features or use regularization.

Feature scaling is not necessary is we use normal equation methodto solve for θ.



Multivariate Linear Regression - Method comparison

Gradient descent Normal equation

Need to choose α No need to choose αNeed many iterations Don’t need to iterateWorks well even when n is large Need to compute (XTX )−1 ∈

Rn×n (cost roughly O(n3))Slow if n is very large (n ≥ 1000)



R - linear regression

reg


Linear basis function model

It is extended to a more complicated class of linearcombinations of fixed nonlinear functions of the inputparameters.

h(xi ) = θ0 +M−1∑j=1

θjφj(xi )

φj(·) are basis function. The total number of parameters is M.

θ0: bias parameter, for any fixed offset in the data.



Review

Given the value of a d-dimensional vector x and correspondingy values

Regression problem: predict the value of one or morecontinuous target variables y .

h(x) = θTx

Classification problem: output is binary.

h(x) = sign(θTx)

Define a cost function.

Solution: decides θ such that cost function is minimal.


Linear regression Solution – Gradient descentSolution - Normal equationR codeMore

classification - regression · 2016. 10. 17. · linear regression solution { gradient descent...

Documents