the conjugate gradient method - stanford...
TRANSCRIPT
The Conjugate Gradient Method
Jason E. Hicken
Aerospace Design Lab
Department of Aeronautics & Astronautics
Stanford University
14 July 2011
Lecture Objectives
� describe when CG can be used to solve Ax = b
� relate CG to the method of conjugate directions
� describe what CG does geometrically
� explain each line in the CG algorithm
We are interested in solving the linear system
Ax = b
where x , b ∈ Rn and A ∈ R
n×n
Matrix is symmetric positive-definite (SPD)
AT = A (symmetric)
xTAx > 0, ∀ x 6= 0 (positive-definite)
• discretization of elliptic PDEs
• optimization of quadratic functionals
• nonlinear optimization problems
We are interested in solving the linear system
Ax = b
where x , b ∈ Rn and A ∈ R
n×n
Matrix is symmetric positive-definite (SPD)
AT = A (symmetric)
xTAx > 0, ∀ x 6= 0 (positive-definite)
• discretization of elliptic PDEs
• optimization of quadratic functionals
• nonlinear optimization problems
When A is SPD, solving the linear system is thesame as minimizing the quadratic form
f (x) =1
2xTAx − bTx .
Why? If x⋆ is the minimizing point, then
∇f (x⋆) = Ax⋆ − b = 0
and, for x 6= x⋆
f (x)− f (x⋆) > 0. (homework)
When A is SPD, solving the linear system is thesame as minimizing the quadratic form
f (x) =1
2xTAx − bTx .
Why? If x⋆ is the minimizing point, then
∇f (x⋆) = Ax⋆ − b = 0
and, for x 6= x⋆
f (x)− f (x⋆) > 0. (homework)
DefinitionsLet xi be the approximate solution to Ax = b atiteration i .
error: ei ≡ xi − x
residual: ri ≡ b − Axi
The following identities for the residual will beuseful later.
ri = −Aei
ri = −∇f (xi)
Model problem[
5 −3−3 5
](x1x2
)
=
(44
)
0 0.5 1 1.5 2 2.5 3 3.5 40.5
1
1.5
2
2.5
3
3.5
x1
x2
x = [2 2]T5x 1
− 3
x 2 =
4
−3x 1 + 5x 2
= 4
Model problem[
5 −3−3 5
](x1x2
)
=
(44
)
x1
x2
x = [2 2]T5x 1
− 3
x 2 =
4
−3x 1 + 5x 2
= 4
0 0.5 1 1.5 2 2.5 3 3.5 40.5
1
1.5
2
2.5
3
3.5
Review: Steepest Descent MethodQualitatively, how will steepest descent proceed onour model problem, starting at x0 = (1
3, 1)T ?
x1
x2
x0
0 0.5 1 1.5 2 2.5 3 3.5 40.5
1
1.5
2
2.5
3
3.5
Review: Steepest Descent MethodQualitatively, how will steepest descent proceed onour model problem, starting at x0 = (1
3, 1)T ?
x1
x2
x0
x1
0 0.5 1 1.5 2 2.5 3 3.5 40.5
1
1.5
2
2.5
3
3.5
Review: Steepest Descent MethodQualitatively, how will steepest descent proceed onour model problem, starting at x0 = (1
3, 1)T ?
x1
x2
x0
0 0.5 1 1.5 2 2.5 3 3.5 40.5
1
1.5
2
2.5
3
3.5
How can we eliminate this zig-zag behaviour?
To find the answer, we begin by considering theeasier problem
[2 00 8
](x1x2
)
=
(
4√2
0
)
,
f (x) = x21 + 4x22 − 4√2x1.
Here, the equations are decoupled, so we canminimize in each direction independently.What do the contours of the correspondingquadratic form look like?
How can we eliminate this zig-zag behaviour?
To find the answer, we begin by considering theeasier problem
[2 00 8
](x1x2
)
=
(
4√2
0
)
,
f (x) = x21 + 4x22 − 4√2x1.
Here, the equations are decoupled, so we canminimize in each direction independently.What do the contours of the correspondingquadratic form look like?
Simplified problem[2 00 8
](x1x2
)
=
(
4√2
0
)
x1
x2
0 0.5 1 1.5 2 2.5 3 3.5 4−1.5
−1
−0.5
0
0.5
1
1.5
Simplified problem[2 00 8
](x1x2
)
=
(
4√2
0
)
x1
x2
0 0.5 1 1.5 2 2.5 3 3.5 4−1.5
−1
−0.5
0
0.5
1
1.5
Simplified problem[2 00 8
](x1x2
)
=
(
4√2
0
)
x1
x2
e0
0 0.5 1 1.5 2 2.5 3 3.5 4−1.5
−1
−0.5
0
0.5
1
1.5
Simplified problem[2 00 8
](x1x2
)
=
(
4√2
0
)
x1
x2
e0
0 0.5 1 1.5 2 2.5 3 3.5 4−1.5
−1
−0.5
0
0.5
1
1.5
Method of Orthogonal Directions
Idea: Express error as a sum of n orthogonal searchdirections
e ≡ x0 − x =n−1∑
i=0
αidi .
At iteration i + 1, eliminate component αidi .
• never need to search along di again
• converge in n iterations!
How would we apply the method of orthogonaldirections to a non-diagonal matrix?
Method of Orthogonal Directions
Idea: Express error as a sum of n orthogonal searchdirections
e ≡ x0 − x =n−1∑
i=0
αidi .
At iteration i + 1, eliminate component αidi .
• never need to search along di again
• converge in n iterations!
How would we apply the method of orthogonaldirections to a non-diagonal matrix?
Method of Orthogonal Directions
Idea: Express error as a sum of n orthogonal searchdirections
e ≡ x0 − x =n−1∑
i=0
αidi .
At iteration i + 1, eliminate component αidi .
• never need to search along di again
• converge in n iterations!
How would we apply the method of orthogonaldirections to a non-diagonal matrix?
Review of Inner ProductsThe search directions in the method of orthogonaldirections are orthogonal with respect to the dotproduct.
The dot product is an example of an inner product.
Inner ProductFor x , y , z ∈ R
n and α ∈ R, an inner product(, ) : Rn × R
n → R satisfies
• symmetry: (x , y) = (y , x)
• linearity: (αx + y , z) = α(x , z) + (y , z)
• positive-definiteness: (x , x) > 0 ⇔ x 6= 0
Review of Inner ProductsThe search directions in the method of orthogonaldirections are orthogonal with respect to the dotproduct.
The dot product is an example of an inner product.
Inner ProductFor x , y , z ∈ R
n and α ∈ R, an inner product(, ) : Rn × R
n → R satisfies
• symmetry: (x , y) = (y , x)
• linearity: (αx + y , z) = α(x , z) + (y , z)
• positive-definiteness: (x , x) > 0 ⇔ x 6= 0
Fact: (x , y)A ≡ xTAy is an inner product
A-orthogonality (conjugacy)We say two vectors x , y ∈ R
n are A-orthogonal, orconjugate, if
(x , y)A = xTAy = 0.
What happens if we use A-orthogonality rather thanstandard orthogonality in the method of orthogonaldirections?
Let {p0, p1, . . . , pn−1} be a set of n linearlyindependent vectors that are A-orthogonal. If pi isthe i th column of P, then
PTAP = Σ
where Σ is a diagonal matrix.
Substitute x = Py into the quadratic form:
f (Py) = yTΣy − (PTb)Ty .
We can apply the method of orthogonal directionsin y -space.
Let {p0, p1, . . . , pn−1} be a set of n linearlyindependent vectors that are A-orthogonal. If pi isthe i th column of P, then
PTAP = Σ
where Σ is a diagonal matrix.
Substitute x = Py into the quadratic form:
f (Py) = yTΣy − (PTb)Ty .
We can apply the method of orthogonal directionsin y -space.
New Problem: how do we get the set {pi} ofconjugate vectors?
Gram-Schmidt ConjugationLet {d0, d1, . . . , dn−1} be a set of linearlyindependent vectors, e.g., coordinate axes.
• set p0 = d0
• for i > 0
pi = di −i−1∑
j=0
βijpj
where βij = (di , pj)A/(pj , pj)A.
New Problem: how do we get the set {pi} ofconjugate vectors?
Gram-Schmidt ConjugationLet {d0, d1, . . . , dn−1} be a set of linearlyindependent vectors, e.g., coordinate axes.
• set p0 = d0
• for i > 0
pi = di −i−1∑
j=0
βijpj
where βij = (di , pj)A/(pj , pj)A.
The Method of Conjugate DirectionsForce the error at iteration i + 1 to be conjugate tothe search direction pi .
pTi Aei+1 = pTi A(ei + αipi) = 0
⇒ αi = −pTi Aei
pTi Api
=pTi ri
pTi Api
• never need to search along pi again
• converge in n iterations!
The Method of Conjugate DirectionsForce the error at iteration i + 1 to be conjugate tothe search direction pi .
pTi Aei+1 = pTi A(ei + αipi) = 0
⇒ αi = −pTi Aei
pTi Api
=pTi ri
pTi Api
• never need to search along pi again
• converge in n iterations!
The Method of Conjugate DirectionsForce the error at iteration i + 1 to be conjugate tothe search direction pi .
pTi Aei+1 = pTi A(ei + αipi) = 0
⇒ αi = −pTi Aei
pTi Api
=pTi ri
pTi Api
• never need to search along pi again
• converge in n iterations!
The Method of Conjugate Directions[
5 −3−3 5
](x1x2
)
=
(44
)
x1
x2
d0
d1
0 0.5 1 1.5 2 2.5 3 3.5 40.5
1
1.5
2
2.5
3
3.5
The Method of Conjugate Directions[
5 −3−3 5
](x1x2
)
=
(44
)
x1
x2
p0
p1
0 0.5 1 1.5 2 2.5 3 3.5 40.5
1
1.5
2
2.5
3
3.5
The Method of Conjugate Directions[
5 −3−3 5
](x1x2
)
=
(44
)
x1
x2
x0
0 0.5 1 1.5 2 2.5 3 3.5 40.5
1
1.5
2
2.5
3
3.5
The Method of Conjugate Directions[
5 −3−3 5
](x1x2
)
=
(44
)
x1
x2
x0
0 0.5 1 1.5 2 2.5 3 3.5 40.5
1
1.5
2
2.5
3
3.5
The Method of Conjugate Directions[
5 −3−3 5
](x1x2
)
=
(44
)
x1
x2
x0
0 0.5 1 1.5 2 2.5 3 3.5 40.5
1
1.5
2
2.5
3
3.5
The Method of Conjugate Directions is well defined,and avoids the “zig-zagging”of Steepest Descent.
What about computational expense?
• If we choose the di in Gram-Schmidtconjugation to be the coordinate axes, theMethod of Conjugate Directions is equivalentto Gaussian elimination.
• Keeping all the pi is the same as storing adense matrix!
Can we find a smarter choice for di?
The Method of Conjugate Directions is well defined,and avoids the “zig-zagging”of Steepest Descent.
What about computational expense?
• If we choose the di in Gram-Schmidtconjugation to be the coordinate axes, theMethod of Conjugate Directions is equivalentto Gaussian elimination.
• Keeping all the pi is the same as storing adense matrix!
Can we find a smarter choice for di?
Error Decomposition Using pi
x1
x2
e0
0 0.5 1 1.5 2 2.5 3 3.5 40.5
1
1.5
2
2.5
3
3.5
Error Decomposition Using pi
x1
x2
α0 p
0
α1 p
1
e0
0 0.5 1 1.5 2 2.5 3 3.5 40.5
1
1.5
2
2.5
3
3.5
The error at iteration i can be expressed as
ei =n−1∑
k=i
αkpk ,
so the error must be conjugate to pj for j < i :
pTj Aei = 0, ⇒ pTj ri = 0,
but from Gram-Schmidt conjugation we have
=
=
The error at iteration i can be expressed as
ei =n−1∑
k=i
αkpk ,
so the error must be conjugate to pj for j < i :
pTj Aei = 0, ⇒ pTj ri = 0,
but from Gram-Schmidt conjugation we have
=
=
The error at iteration i can be expressed as
ei =n−1∑
k=i
αkpk ,
so the error must be conjugate to pj for j < i :
pTj Aei = 0, ⇒ pTj ri = 0,
but from Gram-Schmidt conjugation we have
=
=
The error at iteration i can be expressed as
ei =n−1∑
k=i
αkpk ,
so the error must be conjugate to pj for j < i :
pTj Aei = 0, ⇒ pTj ri = 0,
but from Gram-Schmidt conjugation we have
pj = dj −j−1∑
k=0
βjkpk .
The error at iteration i can be expressed as
ei =n−1∑
k=i
αkpk ,
so the error must be conjugate to pj for j < i :
pTj Aei = 0, ⇒ pTj ri = 0,
but from Gram-Schmidt conjugation we have
pTj ri = dTj ri −
j−1∑
k=0
βjkpTk ri
0 = dTj ri , j < i .
Thus, the residual at iteration i is orthogonal to thevectors dj used in the previous iterations:
dTj ri = 0, j < i
Idea: what happens if we choose di = ri?
• residuals become mutually orthogonal
• ri is orthogonal to pj , for j < i ⋆
• ri+1 becomes conjugate to pj , for j < i
This last point is not immediately obvious, so wewill prove it. This result has significant implicationsfor Gram-Schmidt conjugation.
⋆we showed this is true for any choice of di
Thus, the residual at iteration i is orthogonal to thevectors dj used in the previous iterations:
dTj ri = 0, j < i
Idea: what happens if we choose di = ri?
• residuals become mutually orthogonal
• ri is orthogonal to pj , for j < i ⋆
• ri+1 becomes conjugate to pj , for j < i
This last point is not immediately obvious, so wewill prove it. This result has significant implicationsfor Gram-Schmidt conjugation.
⋆we showed this is true for any choice of di
The solution is updated according to
xj+1 = xj + αjpj
⇒ rj+1 = rj − αjApj
⇒ Apj =1
αj
(rj − rj+1).
Next, take the dot product of both sides with anarbitrary residual ri :
rTi Apj =
rTi riαi
, i = j
− rTi riαi−1
, i = j + 1
0, otherwise.
The solution is updated according to
xj+1 = xj + αjpj
⇒ rj+1 = rj − αjApj
⇒ Apj =1
αj
(rj − rj+1).
Next, take the dot product of both sides with anarbitrary residual ri :
rTi Apj =
rTi riαi
, i = j
− rTi riαi−1
, i = j + 1
0, otherwise.
The solution is updated according to
xj+1 = xj + αjpj
⇒ rj+1 = rj − αjApj
⇒ Apj =1
αj
(rj − rj+1).
Next, take the dot product of both sides with anarbitrary residual ri :
rTi Apj =
rTi riαi
, i = j
− rTi riαi−1
, i = j + 1
0, otherwise.
The solution is updated according to
xj+1 = xj + αjpj
⇒ rj+1 = rj − αjApj
⇒ Apj =1
αj
(rj − rj+1).
Next, take the dot product of both sides with anarbitrary residual ri :
rTi Apj =
rTi riαi
, i = j
− rTi riαi−1
, i = j + 1
0, otherwise.
The solution is updated according to
xj+1 = xj + αjpj
⇒ rj+1 = rj − αjApj
⇒ Apj =1
αj
(rj − rj+1).
Next, take the dot product of both sides with anarbitrary residual ri :
rTi Apj =
rTi riαi
, i = j
− rTi riαi−1
, i = j + 1
0, otherwise.
We can show that the first case (i = j) contains nonew information (homework). Divide the remainingcases by pTj Apj and insert the definition of αi−1:
rTi Apj
pTj Apj︸ ︷︷ ︸
βij
=
− rTi ri
rTi−1ri−1
, i = j + 1
0, otherwise.
We recognize the L.H.S. as the coefficients inGram-Schmidt conjugation
• only one coefficient is nonzero!
We can show that the first case (i = j) contains nonew information (homework). Divide the remainingcases by pTj Apj and insert the definition of αi−1:
rTi Apj
pTj Apj︸ ︷︷ ︸
βij
=
− rTi ri
rTi−1ri−1
, i = j + 1
0, otherwise.
We recognize the L.H.S. as the coefficients inGram-Schmidt conjugation
• only one coefficient is nonzero!
The Conjugate Gradient MethodSet p0 = r0 = b − Ax0 and i = 0
αi = (pTi ri)/(pTi Api) (step length)
xi+1 = xi + αipi (sol. update)
ri+1 = ri − αiApi (resid. update)
βi+1,i = −(rTi+1ri+1)/(rTi ri) (G.S. coeff.)
pi+1 = ri+1 − βi+1,i pi (Gram Schmidt)
i := i + 1
The Conjugate Gradient MethodSet p0 = r0 = b − Ax0 and i = 0
αi = (pTi ri)/(pTi Api) (step length)
xi+1 = xi + αipi (sol. update)
ri+1 = ri − αiApi (resid. update)
βi+1,i = −(rTi+1ri+1)/(rTi ri) (G.S. coeff.)
pi+1 = ri+1 − βi+1,i pi (Gram Schmidt)
i := i + 1
The Conjugate Gradient MethodSet p0 = r0 = b − Ax0 and i = 0
αi = (pTi ri)/(pTi Api) (step length)
xi+1 = xi + αipi (sol. update)
ri+1 = ri − αiApi (resid. update)
βi+1,i = −(rTi+1ri+1)/(rTi ri) (G.S. coeff.)
pi+1 = ri+1 − βi+1,i pi (Gram Schmidt)
i := i + 1
The Conjugate Gradient MethodSet p0 = r0 = b − Ax0 and i = 0
αi = (pTi ri)/(pTi Api) (step length)
xi+1 = xi + αipi (sol. update)
ri+1 = ri − αiApi (resid. update)
βi+1,i = −(rTi+1ri+1)/(rTi ri) (G.S. coeff.)
pi+1 = ri+1 − βi+1,i pi (Gram Schmidt)
i := i + 1
The Conjugate Gradient MethodSet p0 = r0 = b − Ax0 and i = 0
αi = (pTi ri)/(pTi Api) (step length)
xi+1 = xi + αipi (sol. update)
ri+1 = ri − αiApi (resid. update)
βi+1,i = −(rTi+1ri+1)/(rTi ri) (G.S. coeff.)
pi+1 = ri+1 − βi+1,i pi (Gram Schmidt)
i := i + 1
The Conjugate Gradient MethodSet p0 = r0 = b − Ax0 and i = 0
αi = (pTi ri)/(pTi Api) (step length)
xi+1 = xi + αipi (sol. update)
ri+1 = ri − αiApi (resid. update)
βi+1,i = −(rTi+1ri+1)/(rTi ri) (G.S. coeff.)
pi+1 = ri+1 − βi+1,i pi (Gram Schmidt)
i := i + 1
The Conjugate Gradient MethodSet p0 = r0 = b − Ax0 and i = 0
αi = (pTi ri)/(pTi Api) (step length)
xi+1 = xi + αipi (sol. update)
ri+1 = ri − αiApi (resid. update)
βi+1,i = −(rTi+1ri+1)/(rTi ri) (G.S. coeff.)
pi+1 = ri+1 − βi+1,i pi (Gram Schmidt)
i := i + 1
The Conjugate Gradient Method[
5 −3−3 5
](x1x2
)
=
(44
)
x1
x2
x0
0 0.5 1 1.5 2 2.5 3 3.5 40.5
1
1.5
2
2.5
3
3.5
The Conjugate Gradient Method[
5 −3−3 5
](x1x2
)
=
(44
)
x1
x2
0 0.5 1 1.5 2 2.5 3 3.5 40.5
1
1.5
2
2.5
3
3.5
The Conjugate Gradient Method[
5 −3−3 5
](x1x2
)
=
(44
)
x1
x2
0 0.5 1 1.5 2 2.5 3 3.5 40.5
1
1.5
2
2.5
3
3.5
Lecture Objectives� describe when CG can be used to solve Ax = b
A must be symmetric positive-definite
� relate CG to the method of conjugate directionsCG is a method of conjugate directions withthe choice di = ri , which simplifiesGram-Schmidt conjugation
� describe what CG does geometricallyPerforms the method of orthogonal directionsin a transformed space where the contours ofthe quadratic form are aligned with thecoordinate axes
� explain each line in the CG algorithm
Lecture Objectives� describe when CG can be used to solve Ax = b
A must be symmetric positive-definite
� relate CG to the method of conjugate directionsCG is a method of conjugate directions withthe choice di = ri , which simplifiesGram-Schmidt conjugation
� describe what CG does geometricallyPerforms the method of orthogonal directionsin a transformed space where the contours ofthe quadratic form are aligned with thecoordinate axes
� explain each line in the CG algorithm
Lecture Objectives� describe when CG can be used to solve Ax = b
A must be symmetric positive-definite
� relate CG to the method of conjugate directionsCG is a method of conjugate directions withthe choice di = ri , which simplifiesGram-Schmidt conjugation
� describe what CG does geometricallyPerforms the method of orthogonal directionsin a transformed space where the contours ofthe quadratic form are aligned with thecoordinate axes
� explain each line in the CG algorithm
Lecture Objectives� describe when CG can be used to solve Ax = b
A must be symmetric positive-definite
� relate CG to the method of conjugate directionsCG is a method of conjugate directions withthe choice di = ri , which simplifiesGram-Schmidt conjugation
� describe what CG does geometricallyPerforms the method of orthogonal directionsin a transformed space where the contours ofthe quadratic form are aligned with thecoordinate axes
� explain each line in the CG algorithm
Lecture Objectives� describe when CG can be used to solve Ax = b
A must be symmetric positive-definite
� relate CG to the method of conjugate directionsCG is a method of conjugate directions withthe choice di = ri , which simplifiesGram-Schmidt conjugation
� describe what CG does geometricallyPerforms the method of orthogonal directionsin a transformed space where the contours ofthe quadratic form are aligned with thecoordinate axes
� explain each line in the CG algorithm
Lecture Objectives� describe when CG can be used to solve Ax = b
A must be symmetric positive-definite
� relate CG to the method of conjugate directionsCG is a method of conjugate directions withthe choice di = ri , which simplifiesGram-Schmidt conjugation
� describe what CG does geometricallyPerforms the method of orthogonal directionsin a transformed space where the contours ofthe quadratic form are aligned with thecoordinate axes
� explain each line in the CG algorithm
Lecture Objectives� describe when CG can be used to solve Ax = b
A must be symmetric positive-definite
� relate CG to the method of conjugate directionsCG is a method of conjugate directions withthe choice di = ri , which simplifiesGram-Schmidt conjugation
� describe what CG does geometricallyPerforms the method of orthogonal directionsin a transformed space where the contours ofthe quadratic form are aligned with thecoordinate axes
� explain each line in the CG algorithm
References
• Saad, Y., “Iterative Methods for Sparse LinearSystems”, second edition
• Shewchuk, J. R., “An introduction to theConjugate Gradient method without theagonizing pain”