chapter 2-optimization
DESCRIPTION
Chapter 2-OPTIMIZATION. G.Anuradha. Contents. Derivative-based Optimization Descent Methods The Method of Steepest Descent Classical Newton’s Method Step Size Determination Derivative-free Optimization Genetic Algorithms Simulated Annealing Random Search Downhill Simplex Search. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/1.jpg)
Chapter 2-OPTIMIZATION
G.Anuradha
![Page 2: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/2.jpg)
Contents• Derivative-based Optimization
– Descent Methods– The Method of Steepest Descent – Classical Newton’s Method– Step Size Determination
• Derivative-free Optimization– Genetic Algorithms– Simulated Annealing– Random Search– Downhill Simplex Search
![Page 3: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/3.jpg)
What is Optimization?
• Choosing the best element from some set of available alternatives
• Solving problems in which one seeks to minimize or maximize a real function
![Page 4: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/4.jpg)
Notation of OptimizationOptimize y=f(x1,x2….xn) --------------------------------1subject to gj(x1,x2…xn) ≤ / ≥ /= bj ----------------------2 where j=1,2,….n
Eqn:1 is objective function Eqn:2 a set of constraints imposed on the solution. x1,x2…xn are the set of decision variables Note:- The problem is either to maximize or minimize the
value of objective function.
![Page 5: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/5.jpg)
![Page 6: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/6.jpg)
![Page 7: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/7.jpg)
Complicating factors in optimization
1. Existence of multiple decision variables2. Complex nature of the relationships
between the decision variables and the associated income
3. Existence of one or more complex constraints on the decision variables
![Page 8: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/8.jpg)
Types of optimization
• Constraint:- Solution is arrived at by maximizing or minimizing the objective function
• Unconstraint:- No constraints are imposed on the decision variables and differential calculus can be used to analyze them
![Page 9: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/9.jpg)
Least Square Methods for System Identification
• System Identification:- Determining a mathematical model for an unknown system by observing the input-output data pairs
• System identification is required– To predict a system behavior– To explain the interactions and relationship between
inputs and outputs – To design a controller
• System identification– Structure identification– Parameter identification
![Page 10: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/10.jpg)
Structure identification
• Apply a priori knowledge about the target system to determine a class of models within which the search for the most suitable model is conducted
• y=f(u;θ) y – model’s output u – Input Vector θ – parameter vector
![Page 11: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/11.jpg)
Parameter Identification
• Structure of the model is known and optimization techniques are applied to determine the parameter vector θ= θ
![Page 12: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/12.jpg)
Block diagram of parameter identification
![Page 13: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/13.jpg)
Parameter identification
• An input ui is applied to both the system and the model
• Difference between the target system’s output yi and model’s output yi is used to update a parameter vector θ to minimize the difference
• System identification is not a one-pass process; it needs to do both structure and parameter identification repeatedly
![Page 14: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/14.jpg)
Classification of Optimization algorithms
• Derivative-based algorithms:-• Derivative-free algorithms
![Page 15: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/15.jpg)
Characteristics of derivative free algorithm
1. Derivative freeness:- repeated evaluation of objective function
2. Intuitive guidelines:- concepts are based on nature’s wisdom, such as evolution and thermodynamics
3. Slower4. Flexibility5. Randomness:- global optimizers6. Analytic Opacity:-knowledge about them are based on
empirical studies7. Iterative nature:-
![Page 16: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/16.jpg)
Characteristics of derivative free algorithm
• Stopping condition of iteration:- let k denote an iteration count and fk denote the best objective function obtained at count k. stopping condition depends on– Computation time– Optimization goal;– Minimal Improvement– Minimal relative improvement
![Page 17: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/17.jpg)
Basics of Matrix Manipulation and Calculus
![Page 18: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/18.jpg)
Basics of Matrix Manipulation and Calculus
![Page 19: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/19.jpg)
Gradient of a Scalar Function
![Page 20: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/20.jpg)
Jacobian of a Vector Function
![Page 21: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/21.jpg)
Least Square Estimator
• Method of least squares is a standard approach to approximate solution of overdetermined systems.
• Least Squares- Overall solution minimizes the sum of the squares of the errors made in solving every single equation
• Application—Data Fitting
![Page 22: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/22.jpg)
Types of Least Squares• Least Squares
– Linear:- It is a linear combination of parameters.
– The model may represent a straight line, a parabola or any other linear combination of functions
– Non-Linear:- the parameters appear as functions, such as β2,eβx.
If the derivatives are either constant or depend only on the values of the independent variable, the model is linear else non-linear.
![Page 23: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/23.jpg)
Differences between Linear and Non-Linear Least Squares
Linear Non-LinearAlgorithms Does not require initial values
Algorithms Require Initial values
Globally concave; Non convergence is not an issue
Non convergence is a common issue
Normally solved using direct methods Usually an iterative process
Solution is unique Multiple minima in the sum of squares
Yields unbiased estimates even when errors are uncorrelated with predictor values
Yields biased estimates
![Page 24: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/24.jpg)
Linear regression with one variable
Model representation
Machine Learning
![Page 25: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/25.jpg)
Housing Prices(Portland, OR)
Price(in 1000s of dollars)
Size (feet2)
Supervised Learning
Given the “right answer” for each example in the data.
Regression Problem
Predict real-valued output
![Page 26: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/26.jpg)
Notation:
m = Number of training examples x’s = “input” variable / features y’s = “output” variable / “target” variable
Size in feet2 (x)
Price ($) in 1000's (y)
2104 4601416 2321534 315852 178… …
Training set ofhousing prices(Portland, OR)
![Page 27: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/27.jpg)
Training Set
Learning Algorithm
hSize of house
Estimated price
How do we represent h ?
Linear regression with one variable.Univariate linear regression.
![Page 28: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/28.jpg)
![Page 29: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/29.jpg)
Cost function
Machine Learning
Linear regression with one variable
![Page 30: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/30.jpg)
How to choose ‘s ?
Training Set
Hypothesis:
‘s: Parameters
Size in feet2 (x)
Price ($) in 1000's (y)
2104 4601416 2321534 315852 178… …
![Page 31: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/31.jpg)
![Page 32: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/32.jpg)
y
x
Idea: Choose so that is close to for our training examples
![Page 33: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/33.jpg)
![Page 34: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/34.jpg)
Cost functionintuition I
Machine Learning
Linear regression with one variable
![Page 35: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/35.jpg)
Hypothesis:
Parameters:
Cost Function:
Goal:
Simplified
![Page 36: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/36.jpg)
y
x
(for fixed , this is a function of x) (function of the parameter )
![Page 37: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/37.jpg)
y
x
(for fixed , this is a function of x) (function of the parameter )
![Page 38: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/38.jpg)
y
x
(for fixed , this is a function of x) (function of the parameter )
![Page 39: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/39.jpg)
![Page 40: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/40.jpg)
Cost functionintuition II
Machine Learning
Linear regression with one variable
![Page 41: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/41.jpg)
Hypothesis:
Parameters:
Cost Function:
Goal:
![Page 42: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/42.jpg)
(for fixed , this is a function of x) (function of the parameters )
Price ($) in 1000’s
Size in feet2 (x)
![Page 43: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/43.jpg)
![Page 44: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/44.jpg)
(for fixed , this is a function of x) (function of the parameters )
![Page 45: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/45.jpg)
(for fixed , this is a function of x) (function of the parameters )
![Page 46: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/46.jpg)
(for fixed , this is a function of x) (function of the parameters )
![Page 47: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/47.jpg)
(for fixed , this is a function of x) (function of the parameters )
![Page 48: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/48.jpg)
![Page 49: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/49.jpg)
Gradient descent
Machine Learning
Linear regression with one variable
![Page 50: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/50.jpg)
Have some function
Want
Outline:• Start with some
• Keep changing to reduce
until we hopefully end up at a
minimum
![Page 51: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/51.jpg)
J()
![Page 52: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/52.jpg)
J()
![Page 53: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/53.jpg)
Gradient descent algorithm
Correct: Simultaneous update Incorrect:
![Page 54: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/54.jpg)
![Page 55: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/55.jpg)
Gradient descentintuition
Machine Learning
Linear regression with one variable
![Page 56: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/56.jpg)
Gradient descent algorithm
![Page 57: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/57.jpg)
![Page 58: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/58.jpg)
If α is too small, gradient descent can be slow.
If α is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.
![Page 59: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/59.jpg)
at local optima
Current value of
![Page 60: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/60.jpg)
Gradient descent can converge to a local minimum, even with the learning rate α fixed.
As we approach a local minimum, gradient descent will automatically take smaller steps. So, no need to decrease α over time.
![Page 61: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/61.jpg)
![Page 62: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/62.jpg)
Gradient descent for linear regression
Machine Learning
Linear regression with one variable
![Page 63: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/63.jpg)
Gradient descent algorithm Linear Regression Model
![Page 64: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/64.jpg)
![Page 65: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/65.jpg)
Gradient descent algorithm
update and
simultaneously
![Page 66: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/66.jpg)
J()
![Page 67: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/67.jpg)
![Page 68: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/68.jpg)
(for fixed , this is a function of x) (function of the parameters )
![Page 69: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/69.jpg)
(for fixed , this is a function of x) (function of the parameters )
![Page 70: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/70.jpg)
(for fixed , this is a function of x) (function of the parameters )
![Page 71: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/71.jpg)
(for fixed , this is a function of x) (function of the parameters )
![Page 72: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/72.jpg)
(for fixed , this is a function of x) (function of the parameters )
![Page 73: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/73.jpg)
(for fixed , this is a function of x) (function of the parameters )
![Page 74: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/74.jpg)
(for fixed , this is a function of x) (function of the parameters )
![Page 75: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/75.jpg)
(for fixed , this is a function of x) (function of the parameters )
![Page 76: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/76.jpg)
(for fixed , this is a function of x) (function of the parameters )
![Page 77: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/77.jpg)
“Batch” Gradient Descent
“Batch”: Each step of gradient descent uses all the training examples.
![Page 78: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/78.jpg)
![Page 79: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/79.jpg)
Linear Regression with multiple variables
Multiple features
Machine Learning
![Page 80: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/80.jpg)
Size (feet2)
Price ($1000)
2104 4601416 2321534 315852 178… …
Multiple features (variables).
![Page 81: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/81.jpg)
Size (feet2)
Number of
bedrooms
Number of floors
Age of home
(years)
Price ($1000)
2104 5 1 45 4601416 3 2 40 2321534 3 2 30 315852 2 1 36 178… … … … …
Multiple features (variables).
Notation:= number of features= input (features) of training example.= value of feature in training example.
![Page 82: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/82.jpg)
Hypothesis:
Previously:
![Page 83: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/83.jpg)
For convenience of notation, define .
Multivariate linear regression.
![Page 84: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/84.jpg)
![Page 85: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/85.jpg)
Linear Regression with multiple variables
Gradient descent for multiple variables
Machine Learning
![Page 86: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/86.jpg)
Hypothesis:
Cost function:
Parameters:
(simultaneously update for every )
Repeat
Gradient descent:
![Page 87: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/87.jpg)
(simultaneously update )
Gradient Descent
Repeat
Previously (n=1):
New algorithm :
Repeat
(simultaneously update for )
![Page 88: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/88.jpg)
![Page 89: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/89.jpg)
Linear Regression with multiple variables
Gradient descent in practice I: Feature Scaling
Machine Learning
![Page 90: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/90.jpg)
E.g. = size (0-2000 feet2)
= number of bedrooms (1-5)
Feature ScalingIdea: Make sure features are on a similar scale.
size (feet2)
number of bedrooms
![Page 91: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/91.jpg)
Feature Scaling
Get every feature into approximately a range.
![Page 92: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/92.jpg)
Replace with to make features have approximately zero mean (Do not apply to ).
Mean normalization
E.g.
![Page 93: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/93.jpg)
![Page 94: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/94.jpg)
Linear Regression with multiple variables
Gradient descent in practice II: Learning rate
Machine Learning
![Page 95: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/95.jpg)
Gradient descent
- “Debugging”: How to make sure gradient descent is working correctly.
- How to choose learning rate .
![Page 96: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/96.jpg)
Example automatic convergence test:
Declare convergence if decreases by less than in one iteration.
No. of iterations
Making sure gradient descent is working correctly.
![Page 97: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/97.jpg)
Making sure gradient descent is working correctly.
Gradient descent not working.
Use smaller .
No. of iterations
No. of iterations No. of iterations
- For sufficiently small , should decrease on every iteration.
- But if is too small, gradient descent can be slow to converge.
![Page 98: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/98.jpg)
Summary:
- If is too small: slow convergence.- If is too large: may not decrease
on every iteration; may not converge.
To choose , try
![Page 99: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/99.jpg)
![Page 100: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/100.jpg)
Linear Regression with multiple variables
Features and polynomial regression
Machine Learning
![Page 101: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/101.jpg)
Housing prices prediction
![Page 102: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/102.jpg)
Polynomial regression
Price(y)
Size (x)
![Page 103: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/103.jpg)
Choice of features
Price(y)
Size (x)
![Page 104: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/104.jpg)
![Page 105: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/105.jpg)
Linear Regression with multiple variables
Normal equation
Machine Learning
![Page 106: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/106.jpg)
Gradient Descent
Normal equation: Method to solve for analytically.
![Page 107: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/107.jpg)
Intuition: If 1D
Solve for
(for every )
![Page 108: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/108.jpg)
Size (feet2)
Number of
bedrooms
Number of floors
Age of home
(years)
Price ($1000)
1 2104 5 1 45 4601 1416 3 2 40 2321 1534 3 2 30 3151 852 2 1 36 178
Size (feet2)
Number of
bedrooms
Number of floors
Age of home
(years)
Price ($1000)
2104 5 1 45 4601416 3 2 40 2321534 3 2 30 315852 2 1 36 178
Examples:
![Page 109: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/109.jpg)
examples ; features.
E.g. If
![Page 110: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/110.jpg)
is inverse of matrix .
Octave: pinv(X’*X)*X’*y
![Page 111: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/111.jpg)
training examples, features.
Gradient Descent Normal Equation
• No need to choose .• Don’t need to iterate.
• Need to choose . • Needs many
iterations.• Works well even
when is large.• Need to compute
• Slow if is very large.
![Page 112: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/112.jpg)
![Page 113: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/113.jpg)
Linear Regression with multiple variables
Normal equation and non-invertibility (optional)
Machine Learning
![Page 114: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/114.jpg)
Normal equation
- What if is non-invertible? (singular/ degenerate)
- Octave: pinv(X’*X)*X’*y
![Page 115: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/115.jpg)
What if is non-invertible?
• Redundant features (linearly dependent).E.g. size in feet2
size in m2
• Too many features (e.g. ).
- Delete some features, or use regularization.
![Page 116: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/116.jpg)
![Page 117: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/117.jpg)
Linear model
Regression Function
![Page 118: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/118.jpg)
Linear model contd…
Using matrix notationWhere A is a m*n matrix
![Page 119: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/119.jpg)
Due to noise a small amount of error is added
![Page 120: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/120.jpg)
Least Square Estimator
![Page 121: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/121.jpg)
Problem on Least Square Estimator
![Page 122: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/122.jpg)
Derivative Based Optimization
• Deals with gradient-based optimization techniques, capable of determining search directions according to an objective function’s derivative information
• Used in optimizing non-linear neuro-fuzzy models, – Steepest descent– Conjugate gradient
![Page 123: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/123.jpg)
First-Order Optimality ConditionF x F x x+ F x F x T
x x=x+= = 1
2---xT F x
x x=x2 + +
x x x–=
F x x+ F x F x T
x x=x+
For small x:
F x T
x x=x 0
F x T
x x=x 0
If x* is a minimum, this implies:
F x x– F x F x T
x x=x – F x If then
But this would imply that x* is not a minimum. ThereforeF x
T
x x=x 0=
Since this must be true for every x, F x x x=
0=
![Page 124: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/124.jpg)
Second-Order ConditionF x x+ F x 1
2---xT F x
x x=x2 + +=
xT F x x x=
x2 0A strong minimum will exist at x* if for any x ° 0.
Therefore the Hessian matrix must be positive definite. A matrix A is positive definite if:
zTAz 0
A necessary condition is that the Hessian matrix be positive semidefinite. A matrix A is positive semidefinite if:
zTAz 0
If the first-order condition is satisfied (zero gradient), then
for any z ° 0.
for any z.
This is a sufficient condition for optimality.
![Page 125: Chapter 2-OPTIMIZATION](https://reader038.vdocuments.net/reader038/viewer/2022102518/56814abe550346895db7d315/html5/thumbnails/125.jpg)
Basic Optimization Algorithmxk 1+ xk kpk+=
x k xk 1+ x k– kpk= =
pk - Search Direction
k - Learning Rate
or
xk
x k 1+kpk