c25 optimization - university of oxfordaz/lectures/opt/lect1.pdf · 2013. 1. 13. · steepest...
TRANSCRIPT
-
C25 Optimization8 Lectures Hilary Term 20132 Tutorial Sheets Prof. A. Zisserman
Overview• Lecture 1 (AZ): Review of B1 unconstrained multivariate optimization,
Levenberg-Marquardt algorithm.• Lecture 2 (AZ): Discrete optimization, dynamic programming.
• Lectures 3-6 (BK): Classical optimization. Constrained optimization - equality constraints, Lagrange multipliers, inequality constraints. Nonlinear programming - search methods, approximation methods, axial iteration, pattern search, descent methods, quasi-Newton methods. Least squares and singular values.
• Lecture 7 (AZ): Discrete optimization on graphs• Lecture 8 (AZ): Graph-cut, max-flow, and applications
Background reading and web resources
• Numerical Recipes in C (or C++) : The Art of Scientific ComputingWilliam H. Press, Brian P. Flannery, Saul A. Teukolsky, William T. VetterlingCUP 1992/2002
• Good chapter on optimization• Available on line at
http://www.nrbook.com/a/bookcpdf.php
-
Background reading and web resources
• Convex Optimization• Stephen Boyd and Lieven Vandenberghe
CUP 2004• Available on line at
http://www.stanford.edu/~boyd/cvxbook/
• Further reading, web resources, and the lecture notes are on http://www.robots.ox.ac.uk/~az/lectures/opt
The Optimization Landscape
• continuous optimization• variables can take any values (subject to constraints)
• e.g. x > 0
• discrete optimization• variables are restricted to a set of values (often finite)
• e.g. binary, [0, 1, …, 255]
We will see that there are very efficient algorithms for cost functions with a particular structure using discrete optimization
-
Example: image processing – noise removal
1
0
i
minxf(x) =
Xi
(zi−xi)2+dφ(xi, xi−1)
where φ(xi, xj) =
(0 ifxi = xj1 ifxi 6= xj.
Measured noisy data
zi = xi+ wi
where
xi ∈ {0,1} and wi ∼ N(0,σ2)
Estimate the true (no noise) signal by solving an optimization problem
xi, zi
• n is the number of pixels in the image, e.g. n = 20k
ji
N (i)Now in 2D for an image …
N (i) is the neighbourhood of node i
f(x) =nXi=1
(zi − xi)2 + dX
j∈N (i)φ(xi, xj)
original
original plus noise
20 40 60 80 100 120 140 160 180 200 220
20
40
60
20 40 60 80 100 120 140 160 180 200 220
20
40
60
Task of minimization:
• continuous optimization, many local minima,
• discrete optimization, restrict x to 0 or 1, O(2n)
IRn
-
original
original plus noise
Min x
20 40 60 80 100 120 140 160 180 200 220
20
40
60
20 40 60 80 100 120 140 160 180 200 220
20
40
60
20 40 60 80 100 120 140 160 180 200 220
20
40
60
d = 10 Max-flow algorithm
20 40 60 80 100 120 140 160 180 200 220
20
40
60
20 40 60 80 100 120 140 160 180 200 220
20
40
60
20 40 60 80 100 120 140 160 180 200 220
20
40
60
original
original plus noise
Min x
d = 60
n ∼ 20 K
Max-flow algorithm
-
B1 revision – unconstrained continuous optimization:• Convexity• Iterative optimization algorithms• Gradient descent• Newton’s method• Gauss-Newton method
New topics:• Axial iteration• Levenberg-Marquardt algorithm• Application
Lecture outline
Introduction: Problem specification
Suppose we have a cost function (or objective function)
Our aim is find the value of the parameters that minimize this function
subject to the following constraints:
• equality
• inequality
We will start by focussing on unconstrained problems
f(x) : IRn → IRx
x∗ = argminxf(x)
ci(x) = 0, i = 1, . . . ,me
ci(x) ≥ 0, i = me+1, . . . ,m
-
Unconstrained optimization
• down-hill search (gradient descent) algorithms can find local minima
• which of the minima is found depends on the starting point
• such minima often occur in real applications
minxf(x)
f(x)
xlocal minimum
global minimum
function of one variable
Reminder: convexity
-
Class of functions …
convex Not convex
• Convexity provides a test for a single extremum
• A non-negative sum of convex functions is convex
Class of functions continued …
single extremum – convex single extremum – non-convex
multiple extrema – non-convex noisy
Not convex
horrible
-
Optimization algorithm – key ideas
• Find δx such that f(x+ δx) < f(x)
• This leads to an iterative update xn+1 = xn+ δx
• Reduce the problem to a series of 1D line searches δx = αp
-5 0 5 10 15-5
0
5
10
15
Choosing the direction 1: axial iteration
Alternate minimization over x and y
-5 0 5 10 15-5
0
5
10
15
-
Choosing the direction 2: steepest descent
Move in the direction of the gradient ∇f(xn)
-5 0 5 10 15-5
0
5
10
15
-5 0 5 10 15-5
0
5
10
15
• The gradient is everywhere perpendicular to the contour lines.
• After each line minimization the new gradient is always orthogonalto the previous step direction (true of any line minimization.)
• Consequently, the iterates tend to zig-zag down the valley in a veryinefficient manner
Steepest descent
-
A harder case: Rosenbrock’s function
f(x, y) = 100(y − x2)2 + (1− x)2
-2 -1 0 1 2-1
-0.5
0
0.5
1
1.5
2
2.5
3Rosenbrock function
Minimum is at [1,1]
-0.95 -0.9 -0.85 -0.8 -0.750.65
0.7
0.75
0.8
0.85
Steepest Descent
-2 -1 0 1 2-1
-0.5
0
0.5
1
1.5
2
2.5
3Steepest Descent
Steepest descent on Rosenbrock function
• The zig-zag behaviour is clear in the zoomed view (100 iterations)
• The algorithm crawls down the valley
-
Conjugate Gradients – sketch onlyThe method of conjugate gradients chooses successive descent direc-
tions pn such that it is guaranteed to reach the minimum in a finite
number of steps.
• Each pn is chosen to be conjugate to all previous search directionswith respect to the Hessian H :
p>n H pj = 0, 0 =< j < n
• The resulting search directions are mutually linearly independent.
• Remarkably, pn can be chosen using only knowledge of pn−1,∇f(xn−1)and ∇f(xn) (see Numerical Recipes)
pn = ∇fn+⎛⎝ ∇f>n ∇fn∇f>n−1∇fn−1
⎞⎠pn−1
Choosing the direction 3: conjugate gradients
Again, uses first derivatives only, but avoids “undoing” previous work
• An N-dimensional quadratic form can be minimized in at most Nconjugate descent steps.
• 3 different starting points.
• Minimum is reached in exactly 2 steps.
-
Choosing the direction 4: Newton’s method
Start from Taylor expansion in 2DA function may be approximated locally by its Taylor series expansion
about a point x0
f(x+ δx) ≈ f(x) +Ã∂f
∂x,∂f
∂y
!Ãδxδy
!+1
2(δx, δy)
⎡⎢⎣ ∂2f
∂x2∂2f∂x∂y
∂2f∂x∂y
∂2f∂y2
⎤⎥⎦ Ã δxδy
!
The expansion to second order is a quadratic function
f(x+ δx) = a+ g>δx+1
2δx>H δx
Now minimize this expansion over δx:
minδx
f(x+ δx) = a+ g>δx+1
2δx>H δx
-5 0 5 10 15-5
0
5
10
15
minδx
f(x+ δx) = a+ g>δx+1
2δx>H δx
For a minimum we require that ∇f(x+ δx) = 0, and so
∇f(x+ δx) = g+ Hδx = 0with solution δx = −H−1g (Matlab δx = −H\g).
This gives the iterative update
xn+1 = xn − H−1n gn
-
• If f(x) is quadratic, then the solution is found in one step.
• The method has quadratic convergence (as in the 1D case).
• The solution δx = −H−1n gn is guaranteed to be a downhill directionprovided that H is positive definite
• Rather than jump straight to the predicted solution at xn − H−1n gn,it is better to perform a line search
xn+1 = xn − αnH−1n gn
• If H = I then this reduces to steepest descent.
xn+1 = xn − H−1n gn
Newton’s method - example
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
1
1.5
2
2.5
3Newton method with line search
gradient < 1e-3 after 15 iterations-2 -1 0 1 2-1
-0.5
0
0.5
1
1.5
2
2.5
3Newton method with line search
gradient < 1e-3 after 15 iterations
ellipses show successive quadratic approximations
• The algorithm converges in only 15 iterations – far superior to steepest descent
• However, the method requires computing the Hessian matrix at each iteration – this is not always feasible
-
Performance issues for optimization algorithms
1. Number of iterations required
2. Cost per iteration
3. Memory footprint
4. Region of convergence
Special structure for cost function - non-linear least squares
• It is very common in applications for a cost function f(x) to be thesum of a large number of squared residuals
f(x) =MXi=1
ri(x)2
• If each residual ri(x) depends non-linearly on the parameters x thenthe minimization of f(x) is a non-linear least squares problem.
• We also assume that the residuals ri are: (i) small at the optimum,and (ii) zero-mean.
-
f(x) =MXi=1
r2i
Gradient
∇f(x) = 2MXi
ri(x)∇ri(x)
Hessian
H = ∇∇>f(x) = 2MXi
∇³ri(x)∇>ri(x)
´
= 2MXi
∇ri(x)∇>ri(x) + ri(x)∇∇>ri(x)
which is approximated as
HGN = 2MXi
∇ri(x)∇>ri(x)
This is the Gauss-Newton approximation
Non-linear least squares
∇ri
∇ri∇>ri
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
1
1.5
2
2.5
3Gauss-Newton method with line search
gradient < 1e-3 after 14 iterations-2 -1 0 1 2-1
-0.5
0
0.5
1
1.5
2
2.5
3Gauss-Newton method with line search
gradient < 1e-3 after 14 iterations
• minimization with the Gauss-Newton approximation with line search takes only 14 iterations
xn+1 = xn − αnH−1n gn with Hn (x) = HGN (xn)
-
Comparison
-2 -1 0 1 2-1
-0.5
0
0.5
1
1.5
2
2.5
3Gauss-Newton method with line search
gradient < 1e-3 after 14 iterations-2 -1 0 1 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3Newton method with line search
gradient < 1e-3 after 15 iterations
Newton Gauss-Newton
• requires computing Hessian
• exact solution if quadratic
• approximates Hessian by product of gradient of residuals
• requires only derivatives
Summary of minimizations methods
Update xn+1 = xn+ δx
1. Newton.
H δx = −g
2. Gauss-Newton.
HGN δx = −g
3. Gradient descent.
λδx = −g
-
Levenberg-Marquardt algorithm
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
1.2
1.4
• Away from the minimum, in regions of negative curvature, theGauss-Newton approximation is not very good.
• In such regions, a simple steepest-descent step is probably the bestplan.
• The Levenberg-Marquardt method is a mechanism for varying be-tween steepest-descent and Gauss-Newton steps depending on how
good the HGN approximation is locally.
gradient descent
Newton
• The method uses the modified Hessian
H (x,λ) = HGN + λI
• When λ is small, H approximates the Gauss-Newton Hessian.
• When λ is large, H is close to the identity, causing steepest-descentsteps to be taken.
-
LM Algorithm
H (x,λ) = HGN(x) + λI
1. Set λ = 0.001 (say)
2. Solve δx = −H(x,λ)−1 g
3. If f(xn+ δx) > f(xn), increase λ (×10 say) and go to 2.
4. Otherwise, decrease λ (×0.1 say), let xn+1 = xn+ δx, and go to 2.
Note : This algorithm does not require explicit line searches.
Example
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-1
-0.5
0
0.5
1
1.5
2
2.5
3Levenberg-Marquardt method
gradient < 1e-3 after 31 iterations-2 -1 0 1 2
-1
-0.5
0
0.5
1
1.5
2
2.5
3Levenberg-Marquardt method
gradient < 1e-3 after 31 iterations
• Minimization using Levenberg-Marquardt (no line search) takes 31iterations.
Matlab: lsqnonlin
-
Comparison
-2 -1 0 1 2-1
-0.5
0
0.5
1
1.5
2
2.5
3Levenberg-Marquardt method
gradient < 1e-3 after 31 iterations
-2 -1 0 1 2-1
-0.5
0
0.5
1
1.5
2
2.5
3Gauss-Newton method with line search
gradient < 1e-3 after 14 iterations
Gauss-Newton
• more iterations than Gauss-Newton, but
• no line search required,
• and more frequently converges
Levenberg-Marquardt
Case study – Bundle Adjustment (non-examinable)
Notation:
• A 3D point Xj is imaged in the “i” th view asxij = P
i Xj
P : 3× 4 matrixX : 4-vector
x : 3-vector
Problem statement
• Given: n matching image points xij over m views
• Find: the cameras Pi and the 3D points Xj such that xij = P
i Xj
-
Number of parameters
• for each camera there are 6 parameters
• for each 3D point there are 3 parameters
a total of 6 m + 3 n parameters must be estimated• e.g. 50 frames, 1000 points: 3300 unknowns
For efficient optimization:• exploit the Gauss-Newton approximation (squared residual cost
function)
• and additional sparsity which means the Hessian has blocks of zeros
Example
image sequence cameras and points
-
Application: Augmented reality
original sequence
Augmentation