sta141c: big data & high performance statistical...

25
STA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear Regression, Linear System Solvers Cho-Jui Hsieh UC Davis May 9, 2017

Upload: others

Post on 17-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

STA141C: Big Data & High Performance StatisticalComputing

Lecture 7: Linear Regression, Linear System Solvers

Cho-Jui HsiehUC Davis

May 9, 2017

Page 2: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Linear Regression

Page 3: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Regression

Input: training data x1, x2, . . . , xn ∈ Rd and corresponding outputsy1, y2, . . . , yn ∈ RTraining: compute a function f such that f (xi ) ≈ yi for all i

Prediction: given a testing sample x̃ , predict the output as f (x̃)

Examples:

Income, number of children ⇒ Consumer spendingProcesses, memory ⇒ Power consumptionFinancial reports ⇒ RiskAtmospheric conditions ⇒ Precipitation

Page 4: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Linear Regression

Assume f (·) is a linear function parameterized by w ∈ Rd :

f (x) = wTx

Training: compute the model w ∈ Rd such that wTxi ≈ yi for all iPrediction: given a testing sample x̃ , the prediction value is wT x̃How to find w?

w∗ = argminw∈Rd

n∑i=1

(wTxi − yi )2

Page 5: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Linear Regression: probability interpretation

Assume the data is generated from the probability model:

yi ∼ wTxi + εi , εi ∼ N (0, 1)

Maximum likelihood estimator:

w∗ = argmaxw

logP(y1, . . . , yn | x1, . . . , xn,w)

= argmaxw

n∑i=1

logP(yi | xi ,w)

= argmaxw

n∑i=1

log(1√2π

e−(wT xi−yi )2/2)

= argmaxw

n∑i=1

−1

2(wTxi − yi )

2 + constant

= argminw

n∑i=1

(wTxi − yi )2

Page 6: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Linear Regression: written as a matrix form

Linear regression: w∗ = argminw∈Rd

∑ni=1(wTxi − yi )

2

Matrix form: let X ∈ Rn×d be the matrix where the i-th row is xi ,y = [y1, . . . , yn]T , then linear regression can be written as

w∗ = argminw∈Rd

‖Xw − y‖22

Page 7: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Solving Linear Regression

Minimize the sum of squared error J(w)

J(w) =1

2‖Xw − y‖2

=1

2(Xw − y)T (Xw − y)

=1

2wTXTXw − yTXw +

1

2yTy

Derivative: ∂∂w J(w) = XTXw − XTy

Setting the derivative equal to zero gives the normal equation

XTXw∗ = XTy

Therefore, w∗ = (XTX )−1XTy

but XTX may be non-invertible . . .

Page 8: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Solving Linear Regression

Minimize the sum of squared error J(w)

J(w) =1

2‖Xw − y‖2

=1

2(Xw − y)T (Xw − y)

=1

2wTXTXw − yTXw +

1

2yTy

Derivative: ∂∂w J(w) = XTXw − XTy

Setting the derivative equal to zero gives the normal equation

XTXw∗ = XTy

Therefore, w∗ = (XTX )−1XTy

but XTX may be non-invertible . . .

Page 9: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Linear System Solver

Linear System: given A ∈ Rm×n and b ∈ Rm, solve (find a x ∈ Rn suchthat)

Ax = bThree cases:

A is invertible (m = n and full rank)⇒ unique solution x = A−1b

Under-determined system: rank(A) = m but n > m⇒ multiple solutions, usually want to find the “least norm” solution.

x∗ = arg minx‖x‖22 s.t. Ax = b.

Over-determined system: rank(A) < m⇒ (usually) no solution⇒ Output x∗ = arg minx ‖Ax − b‖2

Do not compute the inverse of a matrix!Numerical problemTime consuming

Page 10: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Linear System Solver

Linear System: given A ∈ Rm×n and b ∈ Rm, solve (find a x ∈ Rn suchthat)

Ax = bThree cases:

A is invertible (m = n and full rank)⇒ unique solution x = A−1b

Under-determined system: rank(A) = m but n > m⇒ multiple solutions, usually want to find the “least norm” solution.

x∗ = arg minx‖x‖22 s.t. Ax = b.

Over-determined system: rank(A) < m⇒ (usually) no solution⇒ Output x∗ = arg minx ‖Ax − b‖2

Do not compute the inverse of a matrix!Numerical problemTime consuming

Page 11: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Linear System Solver

Linear System: given A ∈ Rm×n and b ∈ Rm, solve (find a x ∈ Rn suchthat)

Ax = bThree cases:

A is invertible (m = n and full rank)⇒ unique solution x = A−1b

Under-determined system: rank(A) = m but n > m⇒ multiple solutions, usually want to find the “least norm” solution.

x∗ = arg minx‖x‖22 s.t. Ax = b.

Over-determined system: rank(A) < m⇒ (usually) no solution⇒ Output x∗ = arg minx ‖Ax − b‖2

Do not compute the inverse of a matrix!Numerical problemTime consuming

Page 12: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Linear System Solver

Linear System: given A ∈ Rm×n and b ∈ Rm, solve (find a x ∈ Rn suchthat)

Ax = bThree cases:

A is invertible (m = n and full rank)⇒ unique solution x = A−1b

Under-determined system: rank(A) = m but n > m⇒ multiple solutions, usually want to find the “least norm” solution.

x∗ = arg minx‖x‖22 s.t. Ax = b.

Over-determined system: rank(A) < m⇒ (usually) no solution⇒ Output x∗ = arg minx ‖Ax − b‖2

Do not compute the inverse of a matrix!Numerical problemTime consuming

Page 13: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Linear System Solver

First case: A ∈ Rm×m, A is invertible (full rank)

Use “Gaussian elimination” or equivalently “LU factorization”

Call “numpy.linalg.solve”

>>> a = np.array([3,1], [1,2]])

>>> b = np.array([9,8])

>>> x = np.linalg.solve(a, b)

>>> x

array([ 2., 3.])

Page 14: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Linear System Solver

Under-determined system and over-determined system: can be solvedby SVD

Ax = b

UΣV Tx = b

x = VӆUTb (psuedo-inverse)

Σ† = diag(σ−11 , σ−12 , · · · , σ−1k , 0, · · · , 0)

(assume A has k nonzero singular values)

Call “numpy.linalg.lstsq”

Page 15: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Linear System Solver

>>> a = scipy.rand(5,10)

>>> b = scipy.rand(5,1)

>>> x = numpy.linalg.lstsq(a,b)

>>> numpy.linalg.norm(a.dot(x[0]) - b)

7.4320704251928296e-16

>>> a = scipy.rand(5,3)

>>> b = scipy.rand(5,1)

>>> x = numpy.linalg.lstsq(a,b)

>>> numpy.linalg.norm(a.dot(x[0]) - b)

0.35374284817556079

Page 16: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Solve multiple linear systems

Many times we need to solve

Axi = bi for all i = 1, · · · ,N

They can be solved altogether (so only one SVD or otherdecomposition is needed)

>>> a = scipy.rand(5,3)

>>> b = scipy.rand(5,4)

>>> x = numpy.linalg.lstsq(a,b)

>>> x[0] ## solution for 5 linear systems

array([[ 0.15914526, 0.44365737, 0.31351924, 0.3476335 ],

[ 0.30223114, 0.54325633, 0.22719821, 1.05852352],

[ 0.07991735, 0.09856708, 0.08663738, -0.29111466]])

Page 17: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Solving Linear Regression

Normal equation: XTXw∗ = XTyIf XTX is invertible (typically when # samples > # features):

w∗ = (XTX )−1y

If XTX is low-rank (typically when # features > # samples):infinite number of solutions

In general, just use “numpy.linalg.lstsq”

Page 18: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Regularized Linear Regression

Page 19: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Overfitting

Overfitting: the model has low training error but high prediction error.

Using too many features can lead to overfitting

Page 20: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Regularization to Avoid Overfitting

Enforce the solution to have low L2-norm:

argminw

n∑i=1

‖wTxi − yi‖2 s.t. ‖w‖2 ≤ K

Equivalent to the following problem with some λ

argminw

n∑i=1

‖wTxi − yi‖2 + λ‖w‖2

Page 21: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Regularized Linear Regression

Regularized Linear Regression:

argminw‖Xw − y‖2 + R(w)

R(w): regularization

Ridge Regression (`2 regularization):

argminw‖Xw − y‖2 + λ‖w‖2

Lasso (`1 regularization):

argminw‖Xw − y‖2 + λ‖w‖1

Note that ‖w‖1 =∑d

i=1 |wi |

Page 22: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Regularization

Lasso: the solution is sparse, but no closed form solution

Page 23: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Ridge Regression

Ridge regression: argminw∈Rd

1

2‖Xw − y‖2 +

λ

2‖w‖2︸ ︷︷ ︸

J(w)

Closed form solution: optimal solution w∗ satisfies ∇J(w∗) = 0:

XTXw∗ − XTy + λw∗ = 0

(XTX + λI )w∗ = XTy

Optimial solution: w∗ = (XTX + λI )−1XTy

Inverse always exists because XTX + λI is positive definite

What’s the computational complexity?

Page 24: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Time Complexity

When X is dense:

Closed form solution requires O(nd2 + d3) if X is denseEfficient if d is very smallRuns forever when d > 100, 000

Typical case for big data applications:

X ∈ Rn×d is sparse with large n and large dHow can we solve the problem?

Page 25: STA141C: Big Data & High Performance Statistical Computingchohsieh/teaching/STA141C_Spring2017/lecture7.pdfSTA141C: Big Data & High Performance Statistical Computing Lecture 7: Linear

Coming up

Optimization

Questions?