classification - regression · 2016. 10. 17. · linear regression solution { gradient descent...

40
Linear regression Solution – Gradient descent Solution - Normal equation R code More Classification Regression Huiping Cao Huiping Cao, Slide 1/40

Upload: others

Post on 28-Jan-2021

16 views

Category:

Documents


0 download

TRANSCRIPT

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    ClassificationRegression

    Huiping Cao

    Huiping Cao, Slide 1/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Outline

    1 Linear regression

    2 Solution – Gradient descent

    3 Solution - Normal equation

    4 R code

    5 More

    Huiping Cao, Slide 2/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Regression problem

    Regression problem: predict the value of one or morecontinuous target variables y given the value of ad-dimensional vector x.

    Output is a real number, which is why the method is calledregression

    Linear Regression is one of the most basic and importanttechnique for usually predicting a value of an attribute (y).

    It is used to fit values in a forecasting or predictive model.

    Huiping Cao, Slide 3/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Linear regression – Example

    A collection of observations of the Old Faithful geyser in theUSA Yellowstone National Park.

    > head(faithful)

    eruptions waiting

    1 3.600 79

    2 1.800 54

    3 3.333 74

    4 2.283 62

    5 4.533 85

    6 2.883 55

    There are two observation variables in the data set.

    eruptions: the duration of the geyser eruptions.waiting: the length of waiting period until the next eruption.

    Predict eruption time given waiting time.

    Huiping Cao, Slide 4/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Linear regression – representation

    Training set: {x1, x2, · · · , xn}n is the number of training data pointsxi is a d-dimensional data point (or vector)y = (y1, · · · , yn) is output(x, y) one training example(xi , yi ) the i-th training example.

    Hypothesis: h(·).The learning algorithm learns the hypnosis.Using h, we can predict y for any given x.

    Huiping Cao, Slide 5/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Linear regression – How do we represent h

    Linear regression with one variable (univariate linearregression).

    hθ(x) = θ0 + θ1x

    Or, h(x) = θ0 + θ1x

    50 60 70 80 90

    1.5

    2.0

    2.5

    3.0

    3.5

    4.0

    4.5

    5.0

    faithful$waiting

    faithful$eruptions

    Huiping Cao, Slide 6/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Linear regression – cost function (1)

    Hypothesis hθ(x) = θ0 + θ1x .

    Parameters: θi s.

    How to choose θi s?

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ● ●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●●

    ●●

    ●●

    ●●

    ●●

    50 60 70 80 90

    1.5

    2.5

    3.5

    4.5

    x

    y

    Huiping Cao, Slide 7/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Linear regression – cost function (2)

    Hypothesis: hθ(xi ) = θ0 + θ1xi .

    Idea: Choose θ0 and θ1 such that hθ(x) is close to y for our trainingexamples.

    Intuitively

    minimizeθ0,θ1

    n∑i=1

    (hθ(xi )− yi )2

    Huiping Cao, Slide 8/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Linear regression – cost function (2)

    Cost function,

    J(θ0, θ1) =1

    2n

    n∑i=1

    (hθ(xi )− yi )2

    Factor 1n is to average the cost, and the factor12 is to make

    the analysis easier.J is called squared error function, or cost function, or squarederror cost function.There are other cost functions, but squared error cost functionis working well.

    Goal: minimizeθ0,θ1J(θ0, θ1)

    Huiping Cao, Slide 9/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Linear regression – cost function – simplified version

    Parameters: let θ0=0,

    Cost function becomes: hθ(xi ) = θ1xi .

    Goal: minimizeθ1J(θ1) = minimizeθ1∑n

    i=1(hθ(xi )− yi )2

    Hypothesis function hθ(x): for fixed θ1, it is a function of x.

    Cost function J(θ1): a function of θ1.

    Huiping Cao, Slide 10/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Linear regression – cost function – simplified version J(θ1)

    Training data: three points (1, 1), (2, 2), (3, 3).

    Assume that θ1 = 1, J(1) = 0. Because hθ(xi ) = θ1xi = xi

    J(θ1) =1

    2n

    n∑i=1

    (hθ(xi )− yi )2 =1

    2n

    n∑i=1

    (xi − yi )2

    =1

    6(02 + 02 + 02) = 0

    Assume that θ1 = 0.5, J(0.5) = 0.58

    J(0.5) =1

    2n((0.5− 1)2 + (2− 1)2 + (3− 1.5)2)

    =1

    6(0.25 + 1 + 2.25) ∼ 0.58

    Assume that θ1=0, J(0) =16 (1

    2 + 22 + 32) = 146 = 2.3

    Huiping Cao, Slide 11/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Linear regression – Simplified version J(θ1)

    For each value of θ1, we can get a corresponding J(θ1).

    Remember that our objective is to minimizeθ1J(θ1), whichequals to finding a straight line to fit {x} well.

    The result is θ1 = 1 because J(1) = 0 is minimal.

    Huiping Cao, Slide 12/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Multivariate Linear Regression - motivation and notation

    Exampley house pricesx·1: house square feetx·2: number of bedroomsx·3: age of homeetc.

    Notation

    d : number of featuresxi : the i-th input examplexi,j : the value of the j-th feature of the ith training example.

    Huiping Cao, Slide 13/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Multivariate Linear Regression - Hypothesis

    hθ(xi ) = θ0 + θ1xi ,1 + · · ·+ θdxi ,dE.g., hθ(xi ) = 80 + 0.1xi ,1 + 0.01xi ,2 + 3xi ,3 − 2xi ,4For convenience, denote xi ,0 = 1 to define a 0th feature.

    We can define two vectors x and θ

    xi =

    xi ,0xi ,1xi ,2· · ·xi ,d

    ∈ Rd+1, θ =

    θ0θ1θ2· · ·θd

    ∈ Rd+1

    hθ(x) = θTx

    Huiping Cao, Slide 14/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Multivariate Linear Regression - Representation

    Hypothesis: hθ(xi ) = θTxi where xi ,0 = 1

    Parameters: θ0, θ1, θ2, · · · , θd (or, θ ∈ Rd+1)Cost function:

    J(θ0, θ1, · · · , θn) =1

    2n

    n∑i=1

    (hθ(xi )− yi )2

    Or

    J(θ) =1

    2n

    n∑i=1

    (hθ(xi )− yi )2

    Linear regression models: linear functions of the adjustableparameters.

    The simplest form of linear regression models are also linearfunctions of the input variables.

    Huiping Cao, Slide 15/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Linear regression model

    Formally,

    Given (1) a training data set {x1, x2, · · · , xn} where xi =

    xi,1xi,2· · ·xi,d

    together with (2) corresponding target variables y =

    y1y2· · ·yn

    Goal: predict the value of y for a new observation x.Linear model hθ(xi ) = θ

    T xi or hθ(xi ) = θ0 + θ1xi,1 + · · ·+ θdxi,dDecides d+1 parameters (or a (d+1)-vector) θ = (θ0, θ1, · · · , θd ) s.t.1

    2n

    ∑ni=1(h(xi )− yi )2 is minimal.

    Solution: (1) gradient descent, (2) normal equation (or analytic pseudo-inverse

    algorithm)

    Huiping Cao, Slide 16/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Gradient descent – intuition

    Gradient descent to minimize J(θ0, θ1)

    OutlineStart with some θ0, θ1 values (say θ0 = 0, θ1 = 0)Keep changing θ0, θ1 to reduce J(θ0, θ1)until we hopefully end up at a minimum

    In the general case, minimizeθ0,θ1,··· ,θd J(θ0, θ1, · · · , θd)

    Huiping Cao, Slide 17/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Gradient descent algorithm

    Repeat until convergence

    θj = θj − α ∂∂θj J(θ0, θ1) for j = 0 and j = 1

    Notations: partial derivative ∂∂θ0 ; Derivativeddθ1

    α: learning rate; large α indicates aggressive learning;

    θ0 and θ1 need to be updated simultaneously.

    CORRECT:temp0 = θ0 − α ∂∂θ0 J(θ0, θ1)temp1 = θ1 − α ∂∂θ1 J(θ0, θ1)θ0 = temp0θ1 = temp1INCORRECT:temp0 = θ0 − α ∂∂θ0 J(θ0, θ1)θ0 = temp0temp1 = θ1 − α ∂∂θ1 J(θ0, θ1)θ1 = temp1

    Huiping Cao, Slide 18/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Gradient descent algorithm - derivative

    Derivative: slope of the tangent linecase 1: the slope is positive, θ1 = θ1 − α× positive number,thus, we are moving θ1 to be a smaller θ1.case 2: the line has negative slope. Then, θ1 is updated to abigger value. θ1 = θ1 − α× negative number

    0.0 0.2 0.4 0.6 0.8 1.0

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    x

    func

    tion(

    x) (x

    - 0.

    5)^2

    0.0 0.2 0.4 0.6 0.8 1.0

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    x

    func

    tion(

    x) (x

    - 0.

    5)^2

    (a) positive θ1 (b) negative θ1Huiping Cao, Slide 19/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Gradient descent algorithm - learning rate

    If α is too small, gradient descent can converge very slowly.

    If α is too large, gradient descent can overshoot theminimum, it may fail to converge, or even diverge.

    0.0 0.2 0.4 0.6 0.8 1.0

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    x

    func

    tion(

    x) (x

    - 0.

    5)^2

    0.0 0.2 0.4 0.6 0.8 1.00.00

    0.05

    0.10

    0.15

    0.20

    0.25

    x

    func

    tion(

    x) (x

    - 0.

    5)^2

    (a) small α (b) big α

    Huiping Cao, Slide 20/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Gradient descent algorithm - local optimal

    What if θ1 is already at local optima?Slop =0 (i.e., derivative = 0)θ1 = θ1 − α · derivative = θ1 − α · 0 = θ1Gradient descent can converge to a local minimal, even withthe fixed learning rate α.As we approach a local minimal, gradient descent willautomatically take smaller steps because ddθ1 J(θ1) is smaller.

    0.0 0.2 0.4 0.6 0.8 1.0

    0.00

    0.05

    0.10

    0.15

    0.20

    0.25

    x

    func

    tion(

    x) (x

    - 0.

    5)^2

    Huiping Cao, Slide 21/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Gradient descent algorithm for linear regression - onevariable

    Linear regression model (one variable)Cost function:

    J(θ0, θ1) =1

    2n

    n∑i=1

    (hθ(xi )− yi )2 =1

    2n

    n∑i=1

    (θ0 + θ1xi ,1 − yi )2

    Goal: minθ0,θ1J(θ0, θ1)

    Derivatives:

    ∂θ0J(θ0, θ1) =

    1

    n

    n∑i=1

    (hθ(xi )− yi )

    ∂θ1J(θ0, θ1) =

    1

    n

    n∑i=1

    (hθ(xi )− yi ) · xi,1

    Huiping Cao, Slide 22/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Gradient descent algorithm for multivariate linearregression

    Hypothesis: hθ(xi ) = θTxi where xi ,0 = 1

    Parameters: θ ∈ Rd+1

    Cost function:

    J(θ) =1

    2n

    n∑i=1

    (hθ(xi )− yi )2

    Gradient descent:

    Repeat until convergence{θj = θj − α ∂∂θj J(θ) for j = 0, · · · , j = d

    }

    Huiping Cao, Slide 23/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Linear regression - Gradient descent

    Only one variable (feature), i.e., d = 1Repeat until convergence{

    θ0 = θ0 − α ∂∂θ0 J(θ) = θ0 − α1n

    ∑ni=1(hθ(xi )− yi )

    θ1 = θ1 − α ∂∂θ1 J(θ) = θ1 − α1n

    ∑ni=1(hθ(xi )− yi ) · xi,1

    (simultaneously update θ0 and θ1)}

    Multiple variables (features), i.e., d ≥ 1Repeat until convergence{

    θj = θj − α ∂∂θj J(θ) = θj − α1n

    ∑ni=1(hθ(xi )− yi ) · xi,j

    (simultaneously update θj for j = 0, · · · , d)}

    Exampleθ0 = θ0 − α 1n

    ∑ni=1(hθ(xi )− yi ) · xi,0

    θ1 = θ1 − α 1n∑n

    i=1(hθ(xi )− yi ) · xi,1θ2 = θ2 − α 1n

    ∑ni=1(hθ(xi )− yi ) · xi,2

    The update equation of θ0 is the same to that for d = 1 because xi,0.

    The update equation of θ1 is the same to that for d = 1 because xi,1 = xi .

    Huiping Cao, Slide 24/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Gradient descent algorithm for linear regression

    Linear: bowl-shape, convex function.I.e., the function does not have multiple local optima. Italways converge to global optimal.

    Batch gradient descent: each step of the algorithm uses allthe training examples in J(·) function.

    There are other algorithms that use only part of the trainingexamples.

    Huiping Cao, Slide 25/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Multivariate Linear Regression - Feature Scaling (1)

    Idea: Make sure that features are on a similar scale ff thedataset has multiple features. The gradient descent canconverge more quickly.

    For example, if a problem’s dataset has two features xi ,1 andxi ,2xi ,1 = size (0-2000 feet

    2)xi ,2 = number of bedrooms (1-5)

    It takes long time to converge.

    A strategy we do is to scale the feature values

    xi ,1 =size(feet2)

    2000

    xi ,2 =number of bedrooms

    5Then, the contour plot of J(θ) is much less skewed (circleshape).

    Huiping Cao, Slide 26/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Multivariate Linear Regression - Feature Scaling (2)

    Idea: make every feature into approximately −1 ≤ xi ,j ≤ 1range.

    xi,0 = 1 fits in the range.

    Generally, as long as the range is similar, it is OK.

    E.g., 0 ≤ xi,1 ≤ 3 and −1 ≤ xi,2 ≤ 1.5.

    Adding two more features (−200 ≤ xi,4 ≤ 200 and−0.001 ≤ xi,4 ≤ 0.001) makes the range very different. This isnot desirable.

    Huiping Cao, Slide 27/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Multivariate Linear Regression - Feature Scaling (3)

    Mean normalization: another way of feature scaling.

    Replace xi ,j with xi ,j − µj to make features haveapproximately zero mean (do not apply to xi ,0=1).

    E.g., xi,1 =size−1000

    2000 , xi,2 =#bedrooms−2

    5 .Then, −0.5 ≤ xi,1 ≤ 0.5, −0.5 ≤ xi,2 ≤ 0.5

    Generally, replace xi ,j byxi,j−µjσj

    where σj can be the standard

    deviation or max(xi ,j)−min(xi ,j).

    e.g., 0 ≤ xi,1 ≤ 3 and −1 ≤ xi,2 ≤ 1.5.

    Adding two more features (−200 ≤ xi,4 ≤ 200 and−0.001 ≤ xi,4 ≤ 0.001) makes the range very different. This isnot desirable.

    Huiping Cao, Slide 28/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Multivariate Linear Regression - Normal Equation

    Normal equation: method to solve for θ analytically

    Normal equation method has some advantages anddisadvantages, which will be discussed later.

    Huiping Cao, Slide 29/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Multivariate Linear Regression - Normal Equation

    General case: n examples (x1, y1), (x2, y2), · · · , (xn, yn) andeach xi has d features.

    xi =

    xi,0xi,1xi,2· · ·xi,d

    ∈ Rd+1

    Construct a n × (d + 1) matrix X by using the trainingexamples and adding extra feature xi ,0 with values 1;

    X =

    −−−xT1 −−−−−−xT2 −−−

    · · ·− − −xTn −−−

    ∈ Rn×(d+1),Construct an n-dimensional vector y

    y =

    y1y2· · ·yn

    Huiping Cao, Slide 30/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Multivariate Linear Regression - Normal Equation Example

    Example 1: House price

    X =

    1 2104 5 1 451 1416 3 2 401 1534 3 2 401 852 2 1 36

    y =

    460232315178

    Example 2: one variable If xi =(

    1x1i

    ), then X =

    1 x1,11 x2,1· · · · · ·1 xn,1

    ,Hypothesis: hθ(xi ) = θ

    Txi or h(X ) = Xθ

    Solution: θ = (XTX )−1XTy

    Huiping Cao, Slide 31/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Multivariate Linear Regression - Solve for θ - Analysis (1)

    hθ(xi ) = θT xi , J(θ) = 12n

    ∑ni=1(θ

    T xi − yi )2 = 12n ||Xθ − y||2

    Analysis:According to the definition of the Euclidean form of a vector.

    Let x = (x1, · · · , xn), ‖x‖ =√

    x21 + · · ·+ x2n .Thus, ‖x‖2 = x21 + · · ·+ x2n =

    ∑ni=1 x

    2i

    Let tempi = θT xi − yi , then J(θ) = 12n

    ∑ni=1(tempi )

    2

    Xθ =

    −−−xT1 −−−−−−xT2 −−−

    · · ·− − −xTn −−−

    θd×1 =

    xT1 θxT2 θ· · ·xTn θ

    =

    θT x1θT x2· · ·θT xn

    Xθ − y =

    θT x1 − y1θT x2 − y2· · ·

    θT xn − yn

    =

    temp1temp2· · ·

    tempn

    J(θ) = 1

    2n

    ∑ni=1(tempi )

    2 = 12n||Xθ − y||2

    Huiping Cao, Slide 32/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Multivariate Linear Regression - Solve for θ - Analysis (2)

    Rewriting J(θ) further

    J(θ) = 12n ||Xθ − y||2

    = 12n (Xθ − y)T (Xθ − y)

    = 12n ((Xθ)T − yT )(Xθ − y)

    = 12n ((Xθ)T (Xθ)− yT (Xθ)− (Xθ)Ty + yTy)

    = 12n (θTXTXθ − 2(Xθ)Ty + yTy)

    (because yT (Xθ) = (Xθ)Ty)

    Why yT (Xθ) = (Xθ)Ty? Details see next slide.

    Huiping Cao, Slide 33/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Multivariate Linear Regression - Solve for θ - Analysis (3)

    Prove yT (Xθ) = (Xθ)Ty.

    (Xθ)Ty = θTXTy

    = θT (x1, x2, · · · , xn)

    y1y2· · ·yn

    = (θTx1, θTx2, · · · , θTxn)

    y1y2· · ·yn

    = θTx1y1 + θTx2y2 + · · ·+ θTxnyn=

    ∑ni=1 θ

    Txiyi , which is a scalar value

    (Xθ)Ty is a scalar value. A scalar value’s transpose (Xθ)Ty) is itself.

    Huiping Cao, Slide 34/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Multivariate Linear Regression - Solve for θ - Analysis (4)

    J(θ) is rewritten as 12n (θTXTXθ − 2(Xθ)Ty + yTy)

    Solve for θ.

    Let the partial derivative ∂∂θj J(θ) = · · · = 0 for every j to solvefor θ0, θ1, · · · , θd

    ∂θ=

    1

    2n(2XT θ − 2XTy + 0) = 2

    2nXT (Xθ − y)

    Let ∂∂θ = 0. I.e. XT (Xθ − y)=0. Then, θ = (XTX )−1XTy.

    Solution: θ = (XTX )−1XTy

    Let X † = (XTX )−1X , θ = X †Y . X † is called the pseudo-inverseof X .

    Reason: (X †)−1X † = ((XTX )−1X )−1(XTX )−1X =(X−1(XTX )(XTX )−1X = X−1X = I

    Huiping Cao, Slide 35/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Multivariate Linear Regression - Normal equation,Noninvertibility

    θ = (XTX )−1XTy

    It is very rare XTX is non-invertible. What if XTX is non-invertible(singular/degenerate)? Different softwares has different functions tosolve this.

    Non-invertible XTX is generally caused by two reasons:

    Dataset has redundant features (linearly dependent).E.g., x·1=size in feet

    2, x·2 = size in m2.

    Too many features (d > n).

    Delete some features or use regularization.

    Feature scaling is not necessary is we use normal equation methodto solve for θ.

    Huiping Cao, Slide 36/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Multivariate Linear Regression - Method comparison

    Gradient descent Normal equation

    Need to choose α No need to choose αNeed many iterations Don’t need to iterateWorks well even when n is large Need to compute (XTX )−1 ∈

    Rn×n (cost roughly O(n3))Slow if n is very large (n ≥ 1000)

    Huiping Cao, Slide 37/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    R - linear regression

    reg

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Linear basis function model

    It is extended to a more complicated class of linearcombinations of fixed nonlinear functions of the inputparameters.

    h(xi ) = θ0 +M−1∑j=1

    θjφj(xi )

    φj(·) are basis function. The total number of parameters is M.

    θ0: bias parameter, for any fixed offset in the data.

    Huiping Cao, Slide 39/40

  • Linear regression Solution – Gradient descent Solution - Normal equation R code More

    Review

    Given the value of a d-dimensional vector x and correspondingy values

    Regression problem: predict the value of one or morecontinuous target variables y .

    h(x) = θTx

    Classification problem: output is binary.

    h(x) = sign(θTx)

    Define a cost function.

    Solution: decides θ such that cost function is minimal.

    Huiping Cao, Slide 40/40

    Linear regression Solution – Gradient descentSolution - Normal equationR codeMore