least squares approximation/optimization

1

Computational Methods

Least Squares Approximation/Optimization

© Manfred Huber 2011

© Manfred Huber 2011 2

Least Squares n  Least squares methods are aimed at finding

approximate solutions when no precise solution exists n  Find the solution that minimizes the residual error in the

system

n  Least squares can be used to fit a model to noisy data points or to fit a simpler model to complex data

n  Amounts to projecting higher dimensional data onto a lower-dimensional space.


Linear Least Squares n  Linear least squares attempts to find a least

squares solution for an overdetermined linear system (i.e. a linear system described by an m x n matrix A with more equations than parameters).

n  Least squares minimizes the squared Eucliden norm of the residual

n  For data fitting on m data points using a linear combination of basis functions this corresponds to

!

A! x "! b

!

min ! x

! b " A! x

2

2=min ! x

! r 22

!

min" (ym # " j$ j (xi))2

j=1

n%i=1

m%


Existence and Uniqueness n  Linear least squares problem always has a solution

n  Solution is unique if and only if A has full rank, i.e. rank(A)=n n  If rank(A) < n then A is rank-deficient and the solution

of the least squares problem is not unique

n  If solution is unique the residual vector can be expressed through A

n  This is minimized if its derivative is 0

!

r 22

= rT r = (b " Ax)T (b " Ax) = bTb " 2xT ATb + xT AT Ax

!

"2ATb + 2AT Ax = 0


Normal Equations n  Optimization reduces to a n x n system of (linear)

normal equations

n  Linear least squares can be found by solving this system of linear equations

n  Solution can also be found through the Pseudo Inverse

n  Condition number with respect to A can be expressed as in the case of solving linear equations

!

AT Ax = ATb

!

Ax " b # x = A+b

A+ = (AT A)$1AT

!

cond(A) = A 2 A+

2


Data Fitting n  A common use for linear least squares solution is to

fit a given type of function to noisy data points n  nth order polynomial fit using monomial basis:

n  Solving the system of equations provides the best fit in terms of the Euclidian norm

!

A ! " =

1 x1 " x1n

1 x2 " x2n

# # $ #1 xm " xm

n

#

$

% % % %

&

'

( ( ( (

! " =

y1y2#ym

#

$

% % % %

&

'

( ( ( (

!

AT A" = AT y


Condition Number and Sensitivity

n  Sensitivity also depends on b n  Influence of b can be expressed in terms of an angle

between b and y

n  Bound on the error in the solution due to perturbation in b can be expressed as

!

cos(") =y 2

b 2

=Ax 2

b 2

!

"x 2

x 2

# cond(A) 1cos($)

"b 2

b 2



n  Bound on the error in the solution due to perturbation in A can be expressed as

n  For small residuals the condition number for least squares is approximately cond(A).

n  For large residuals the condition number can be square of worse

!

"x 2

x 2

# ((cond(A))2 tan($) + cond(A))"A 2

A 2



n  Conditioning of normal equation solution is n  For large systems of equation the condition number of

the formulation using normal equations (or the Pseudoinverse) increases rapidly

n  Much of the increased sensitivity is due to the need for multiplying A and AT in order to be able to apply a solution algorithm for the system of equations

n  Conditioning of the normal equations is potentially significantly worse than the conditioning of the original system

n  Algorithm is not very stable for large numbers of equations/data points

!

cond(AT A) = (cond(A))2


Augmented System Method n  Augmented system method transforms least square

problem into an system of equation solving problem by adding equations and can be used to improve the conditioning n  Increase matrix size to a square (m+n)x(m+n) matrix by

including the residual equations

n  Greater freedom to choose pivots and maintain stability

n  Substentially higher complexity for systems with m>>n due to the need to solve a (n+m)x(n+m) system

!

r + Ax = bAT r = 0

" I AAT 0#

$ %

&

' ( rx#

$ % &

' ( =

b0#

$ % &

' (


QR Factorization n  The Augmented System method addresses stability

but adds very high cost since it expands the matrix

n  QR factorization changes the system matrix into solvable form without computation of ATA and without expanding the matrix

!

A =Q

R1,1 R1,2 ! R1,n

0 R2,2 ! R2,n

" " # "0 ! 0 Rn,n

0 0 ! 0" " # "0 0 ! 0

"

#

$ $ $ $ $ $ $ $ $

%

&

' ' ' ' ' ' ' ' '

=QR0"

# $ %

& ' ( QT Ax =QTb


QR Factorization n  The important part of Q for the solution consists of the

first n rows since the others are multiplied by 0

n  QR factorization factors the system matrix A into Q an R where R is upper triangular n  As in LU Factorization, QT represents a sequence of solution-

preserving transformations n  In LU Factorization only identity has to be preserved which can be done

using elimination matrices

n  In QR Factorization the least square characteristic has to be preserved which requires the use of orthogonal transformations

n  R is an upper triangular matrix that can be used for backward substitution to compute the solution

!

QT = Q1Q2( ) " Q1T Ax = Rx =Q1

Tb


Orthogonal Transformations n  Orthogonal transformations preserve the least square

solution

n  For QR factorization we need to find a set of orthogonal transformations that can transform A into an upper triangular matrix R and that are numerically stable n  Householder transforms

n  Givens rotations

n  Gram-Schmidt orthogonalization

!

QQT = I

Qv 22

= (Qv)T Qv = vTQTQv = vTv = v 22


QR Factorization with Householder Transformations

n  Householder transformations allow to zero all entries below a chosen point in a vector a n  Applying this consecutively for every column of the matrix A,

choosing the diagonal element as the one below which everything is to be zeroed out we can construct R

n  Householder transformation can be computed from a given vector (vector of column values), a, setting all the values that are not to be changed (entries above the diagonal) to 0

n  The sign for β can be chosen to avoid cancellation (loss of significance) in the computation of v

n  H is orthogonal and symmetric: !

Q = H = I " 2 vvT

vTv , v = a "#ei , # = ± a 2

!

H = HT = H"1


QR Factorization with Householder Transformations

n  In QR factorization with Householder transforms, successive columns are adjusted to upper triangular by zeroing all entries below the diagonal element. n  Householder transform does not affect the entries in the

columns to the left and the rows above the currently chosen diagonal element since it contains only 0s in the rows below the chosen diagonal term.

n  Applying the Householder transform only to the columns to the right saves significant processing time

n  Applying it to individual columns also eliminates the need to compute full matrix H - only is needed

!

H! a j = I " 2 vvT

vTv#

$ %

&

' ( ! a j =! a j " 2

vvT

vTv! a j =! a j " 2v

vT ! a jvTv

=! a j " 2

vT ! a jvTv

v

!

v =! a j "#e j


QR Factorization Example n  Quadratic polynomial fit to 5 data points

n  Design linear system for data fitting

n  Starting with the first column start eliminating entries below the diagonal using appropriate Householder transforms choosing the sign on β to avoid cancellation

!

Data : ("1,1),("0.5,0.5),(0,0),(0.5,0.5),(1,2)

!

A ! " =

1 x1 x12

1 x2 x22

1 x3 x32

1 x4 x42

1 x5 x52

#

$

% % % % % %

&

'

( ( ( ( ( (

! " )

y1

y2

y3

y4

y5

#

$

% % % % % %

&

'

( ( ( ( ( (

*

1 +1 11 +0.5 0.251 0 01 0.5 0.251 1 1

#

$

% % % % % %

&

'

( ( ( ( ( (

! " )

10.50

0.52

#

$

% % % % % %

&

'

( ( ( ( ( (


QR Factorization Example n  Householder elimination by column

n  1st column:

n  Negative sign on β because potentially cancelling term is +1

!

v1 =! a 1 "! a 1 2e1 =

11111

#

$

% % % % % %

&

'

( ( ( ( ( (

+ 5

10000

#

$

% % % % % %

&

'

( ( ( ( ( (

=

3.2361111

#

$

% % % % % %

&

'

( ( ( ( ( (

!

H1A =

"2.236 0 "1.1180 "0.191 "0.4050 0.309 "0.6550 0.809 "0.4050 1.309 0.345

#

$

% % % % % %

&

'

( ( ( ( ( (

, H1! y =

"1.789"0.362"0.862"0.3621.138

#

$

% % % % % %

&

'

( ( ( ( ( (



n  2nd column:

n  Positive sign on β because possibly cancelling term is -0.191

!

v2 =! a 2 "

! a 2 2e2 =

0"0.1910.3090.8091.309

#

$

% % % % % %

&

'

( ( ( ( ( (

" 2.5

01000

#

$

% % % % % %

&

'

( ( ( ( ( (

=

0"1.7720.3090.8091.309

#

$

% % % % % %

&

'

( ( ( ( ( (

!

H2H1A =

"2.236 0 "1.1180 1.581 00 0 "0.7250 0 "0.5890 0 0.047

#

$

% % % % % %

&

'

( ( ( ( ( (

, H2H1! y =

"1.7890.632"1.035"0.8160.404

#

$

% % % % % %

&

'

( ( ( ( ( (



n  3rd column:

n  Positive sign on β because possibly cancelling term is -0.725

!

v2 =! a 3 "

! a 3 2e3 =

00

"0.725"0.5890.047

#

$

% % % % % %

&

'

( ( ( ( ( (

" 0.875

00100

#

$

% % % % % %

&

'

( ( ( ( ( (

=

00

"1.66"0.5890.047

#

$

% % % % % %

&

'

( ( ( ( ( (

!

H3H2H1A =

"2.236 0 "1.1180 1.581 00 0 0.9350 0 00 0 0

#

$

% % % % % %

&

'

( ( ( ( ( (

, H3H2H1! y =

"1.7890.6321.3360.0260.337

#

$

% % % % % %

&

'

( ( ( ( ( (

Least Squares Data Fitting

Existence, Uniqueness, and Conditioning

Solving Linear Least Squares Problems

Least Squares

Data Fitting

Example, continued

Resulting curve and original data points are shown in graph

< interactive example >

Michael T. Heath Scientific Computing 9 / 61


QR Factorization Example n  Backward substitution with the upper triangular matrix

yields the parameters and least squares fit second order polynomial

!

! " =

0.0860.41.429

#

$

% % %

&

'

( ( (

!

p(x) = 0.086 + 0.4x +1.429x 2


Other Orthogonal Transformations

n  Other transformations can be used for QR Factorization n  Givens rotations

n  Gram-Schmidt Orthogonalization

n  Householder Transformations generally achieve the best performance and stability tradeoff n  Complexity of QR Factorization with Householder

transformations is approximately mn2-n3/3 multiplications n  Depending on the size of m (data points / equations) this is between the

same and two times the work of normal equations

n  Conditioning of QR Factorization with Householder transformations is optimal

!

cond(A) + r 2 cond(A)( )2


QR Factorization n  Transformations for QR Factorization are numerically

more stable than elimination steps in LU Factorization n  Choice of sign in Householder transformations allows to avoid

cancellation and thus instabilities in individual transforms n  Row Pivoting is usually not necessary

n  Stability leads to QR factorization also frequently being used instead of LU Factorization to solve nonsingular systems of linear equations n  Increased complexity is traded off against stability (and thus

precision) of the solution


Nonlinear Least Squares n  As for equation solving, finding solutions for general

nonlinear systems requires iterative solutions n  Goal is to find the approximate solution with the smallest

square residual, ρ, for the system of functions f(x)

n  At the minimum, the gradient of the square residual function, would be 0

n  Could use Multivariate Newton method to find the root of the derivative which would require second derivative, the Hessian of the square error

!

ri(x) = bi " f i(x)

#(x) =12rT (x)r(x)

!

"#(x) = JrT (x)r(x) = 0

!

H" (x) = JrT (x)Jr (x) + ri(x)Hri

(x)i=1

m#


Gauss-Newton Method n  Multivariate Newton method would require solving the

following linear system in each iteration

n  Requires frequent computation of the Hessian which is expensive and reduces stability

n  Gauss-Newton avoids this by dropping second order term

n  This is best solved by converting this into the corresponding least squares problem and using QR factorization

n  Once solved, Gauss-Newton operates like Multivariate Newton

!

JrT (x)Jr (x) + ri(x)Hri

(x)i=1

m"#

$ % &

' ( s = )Jr

T (x)r(x)

!

JrT (x)Jr(x)s = "JT (x)r(x)

!

Jr(x)s " #r(x)

!

xt+1 = xt + s


Gauss-Newton Method n  Gauss-Newton method replaces nonlinear least squares

problem with a sequence of linear least squares problems that converge to solution of nonlinear system n  Converges if residual at the solution is not too large

n  Large residual at solution can lead to large values in Hessian, potentially leading to slow convergence or, in extreme cases, non-convergence

n  If it does not converge (large residual at solution) other methods have to be used

n  Levenberg-Marquardt method which uses an additional scalar parameter (and a separate, function-specific strategy to choose it) to modify step size

n  General optimization using the complete Hessian n  Significant increase in computational complexity

!

JrT (x)Jr (x) _µI( )s = "Jr

T (x)r(x) # Jr(x)

µI$

% &

'

( ) s *

"r(x)0

$

% &

'

( )


Gauss-Newton Example n  Fit exponential function to data

n  Residual function for data fitting is given by

n  Resulting Jacobian is

!

Data : (0,2),(1,0.7),(2,0.3),(3,0.1)

!

f (",x) ="1e"2x

!

r(") =

2 # f (",0)0.7 # f (",1)0.3# f (",2)0.1# f (",3)

$

%

& & & &

'

(

) ) ) )

!

Jr(") =

#1 0#e"2 #"1e

"2

#e2"2 #2"1e2"2

#e3"2 #3"1e3"2

$

%

& & & &

'

(

) ) ) )


Gauss-Newton Example n  First iteration starting at α(0)=(1 0)T

n  Initial square residual:

n  Solve for step

n  Update parameter vector for next iteration

!

Jr10"

# $ %

& '

"

# $

%

& ' s =

(1 0(1 (1(1 (2(1 (3

"

#

$ $ $ $

%

&

' ' ' '

s )

(10.30.70.9

"

#

$ $ $ $

%

&

' ' ' '

* s =0.69(0.61"

# $

%

& '

!

" (1) =10#

$ % &

' ( +

0.69)0.61#

$ %

&

' ( =

1.69)0.69#

$ %

&

' (

!

r10"

# $ %

& ' 2

= yi (1e0( )2

= 2.39i=1

4)


Gauss-Newton Example n  Second Iteration

n  Square residual

n  Solve for next step


!

" (2) =1.69#0.61$

% &

'

( ) +

0.285#0.32$

% &

'

( ) =

1.975#0.93$

% &

'

( ) !

Jr1.69"0.61#

$ %

&

' (

#

$ %

&

' ( s =

"1 0"0.543 "0.918"0.295 "0.998"0.16 "0.813

#

$

% % % %

&

'

( ( ( (

s )

"0.310.2180.1990.171

#

$

% % % %

&

'

( ( ( (

* s =0.285"0.32#

$ %

&

' (

!

r1.69"0.61#

$ %

&

' ( 2

= yi "1.69e"0.61xi( )

2= 0.212

i=1

4)


Gauss-Newton Example n  Third Iteration

n  Square residual

n  Solve for next step


!

" (3) =1.975#0.93$

% &

'

( ) +

0.019#0.074$

% &

'

( ) =

1.994#1.004$

% &

'

( )

!

Jr1.975"0.93#

$ %

&

' (

#

$ %

&

' ( s =

"1 0"0.395 "0.779"0.156 "0.615"0.061 "0.364

#

$

% % % %

&

'

( ( ( (

s )

"0.0250.0790.0070.02

#

$

% % % %

&

'

( ( ( (

* s =0.019"0.074#

$ %

&

' (

!

r1.975"0.93#

$ %

&

' ( 2

= yi "1.975e"0.93xi( )

2= 0.007

i=1

4)


Least Squares Approximation n  Least Squares approximation is used to determine

approximate solutions for a system of equation or to fit an approximate function to a set of data points n  As opposed to interpolation data points are not met precisely

n  Applicable to noisy data points

n  Allows less complex function to be fitted to the data points

n  Least squares for linear functions has direct solution methods n  Normal equations method not very stable for large numbers of equations

n  QR Factorization provides a more stable alternative that can also be used for equation solving instead of LU factorization

n  Least squares for nonlinear systems requires iterative solution n  Gauss-Newton provides robust solution for problems with small residual

n  Otherwise general, more expensive optimization methods have to be used

least squares approximation/optimization

Documents