Lecture 7: RLSC ‐ Prof. Sethu Vijayakumar 1
Lecture 7– Learning for Control (Nonlinear/Nonparametric Methods)
Contents:• Local Weighted and Lazy learning techniques• Nonparametric methods
Lecture 7: RLSC ‐ Prof. Sethu Vijayakumar 2
Nonparametric Methods
Working Definition The name “nonparametric” is to indicate that the data to be modeled stem from
very large families of distributions which cannot be indexed by a finite dimensional parameter vector in a natural way.
Remarks this does notmean that nonparametric methods have no parameters! nonparametric methods avoid making assumptions about the parametric form
of the underlying distributions (except some smoothness properties). nonparametric methods are often memory‐based (but not necessarily) sometimes called “lazy learning”
Can be applied to density estimation classification regression
Lecture 7: RLSC ‐ Prof. Sethu Vijayakumar 3
Locally Weighted Regression (LWR)
Fit locally lower order polynomials, e.g., first order polynomials
Approximate non‐linear functions with a mixture of k piecewise linear models
Lecture 7: RLSC ‐ Prof. Sethu Vijayakumar 4
Minimize Weighted Squared Error
Solution:Weighted Least Squares
Weight can be calculated from any weighting kernel, e.g., a Gaussian:
LWR: Formalization
J wn t n y n T
n1
N
t n y n , where y n x n T
1
21
0 0 00 0 0
where 0 0 00 0 0
T T
n
ww
w
X WX X WY W
w exp 12
x c T D x c
Lecture 7: RLSC ‐ Prof. Sethu Vijayakumar 5
LWR: Examples (cont’d)
-3 -2 -1 0 1 2 3-3
-2
-1
0
1
2
3
x
D=10.00
-3 -2 -1 0 1 2 3-3
-2
-1
0
1
2
3
x
D=50.00
-3 -2 -1 0 1 2 3-3
-2
-1
0
1
2
3
x
D=100.00
-3 -2 -1 0 1 2 3-3
-2
-1
0
1
2
3
x
D=200.00
The fits exhibit the bias‐variance tradeoff effect with respect to parameter D.
Lecture 7: RLSC ‐ Prof. Sethu Vijayakumar 6
How to Optimize the Distance Metric?
Global Optimization find the most suitable D for the entire data set
Local Optimization find the most suitable D as a function of the kernel location c
Two possible options :
Lecture 7: RLSC ‐ Prof. Sethu Vijayakumar 7
Global Distance Metric Optimization
Leave‐one‐out Cross Validation compute the prediction of every data point in the training set by: centering the kernel at every data point but excluding the data point at the center from the training set
find distance metric that minimizes the cross validation error depending on how many parameters are allowed in the distance metric, this is a multidimensional optimization problem
NOTE: Leave‐one‐out Cross Validation is very cheap in nonparametric methods (in comparison to most parametric methods) since there is no iterative training of the learning system
J c t n y nn T
n 1
N
t n y nn
Lecture 7: RLSC ‐ Prof. Sethu Vijayakumar 8
Why Cross Validation ?
X
y
w
X
y
w
Without Cross Validation, the kernel could just shrink to zero and focus on one data point only.
Avoids degenerate Solutions for D
Lecture 7: RLSC ‐ Prof. Sethu Vijayakumar 9
0 50 100 150 200 250 3000.24
0.245
0.25
0.255
0.26
0.265
0.27
0.275
0.28
D
Global Optimization of Distance Metric: Example
-3 -2 -1 0 1 2 3-3
-2
-1
0
1
2
3
x
D=100.00
Locally Weighted Cost functionResultant fit from Global Optimization
Lecture 7: RLSC ‐ Prof. Sethu Vijayakumar 10
Local Distance Metric Optimization
Why not optimize the distance metric as a function of the location of the kernel center?
Local Cross Validation Criterion
Something Exceptionally Cool: The local leave‐one‐out cross validation error can be computed analytically—WITHOUT an n‐fold recalculation of the prediction for linear local models !
Jc w n t n y nn T
n 1
N
t n y nn
where X T WX 1X T WY PXT WY
2
1 1 1
Tn n n n nN NTn n n n nc n n Tn n nn n
wJ w
w P
t y t yt y t y
x xa.k.a.:PressResidualError
Lecture 7: RLSC ‐ Prof. Sethu Vijayakumar 11
A Nonparametric Regression Algorithm
Ideas: Create new (Gaussian) kernels as
needed (i.e., when no other kernel in the net is activated sufficiently (=> a constructive network)
update the linear model in each kernel by weighted recursive least squares
adjust the distance metric by gradient descent in local cross validation criteria
The weighted output of all kernels is the prediction
Receptive Field Weighted Regression (RFWR)
Lecture 7: RLSC ‐ Prof. Sethu Vijayakumar 12
Nonparametric Regression Algorithm
y xTx 0
T ˜ x where ˜ x x T 1 T
w exp 12
xc T D x c
where DMTM
J 1
wii1
p
wi y i ˆ y i, i
2
i1
p
Dij2
i1, j1
n
Elements of each Local Linear Model
Penalized local cross validation error
Regression Slope
Receptive Field
Lecture 7: RLSC ‐ Prof. Sethu Vijayakumar 13
Nonparametric Regression Algorithm
Update of the parameters: Slope of local model :
Distance Metric Adaptation :
1 1
1
1where and
n n n Tcv
n T nn n T n
cvT n
w
w
P x e
P x x PP P e y xx P x
M n 1 M n JdM
Another very cool thing: For linear systems, leave-one-out cross validation can be approximated INCREMENTALLY!Thus, no data has to be kept in memory!
Lecture 7: RLSC ‐ Prof. Sethu Vijayakumar 14
Example of learning
-2
-1
0
1
2
3
4
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
y
x
0
0.5
1
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5
w
x
Lecture 7: RLSC ‐ Prof. Sethu Vijayakumar 15
A 2D Example
-1
-0.5
0
0.5
1
x
-1
-0.5
0
0.5
1
y
-0.5
0
0.5
1
1.5
z
-1
-0.5
0
0.5
1
x
-1
-0.5
0
0.5
1
y
-0.5
0
0.5
1
1
Actual Data Generating function Sampled Data with noise
Why is this a tough function to learn ??
Lecture 7: RLSC ‐ Prof. Sethu Vijayakumar 16
-1
-0.5
0
0.5
1
x
-1
-0.5
0
0.5
1
y
-0.5
0
0.5
1
1.5
z
-1
-0.5
0
0.5
1
x
-1
-0.5
0
0.5
1
y
-0.5
0
0.5
1
1
-1
-0.5
0
0.5
1
x
-1
-0.5
0
0.5
1
y
-0.5
0
0.5
1
1.5
z
-1
-0.5
0
0.5
1
x
-1
-0.5
0
0.5
1
y
-0.5
0
0.5
1
1
⊕
⊕
⊕
⊕
⊕
⊕⊕
⊕
⊕ ⊕⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕
⊕⊕
⊕
⊕⊕
⊕
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
y
x
⊕
⊕
⊕
⊕
⊕
⊕⊕
⊕
⊕ ⊕⊕
⊕
⊕
⊕
⊕
⊕
-1.5
-1
-0.5
0
0.5
1
1.5
-1.5 -1 -0.5 0 0.5 1 1.5
y
x
A 2D Example (cont’d)Results after one Epoch of Training Results after 50 Epochs of Training
Lecture 7: RLSC ‐ Prof. Sethu Vijayakumar 17
Nonparametric Regression Algorithm
Summary The LWR scheme we developed (RFWR)‐‐
can incrementally deal with the bias‐variance dilemma grows with the data (constructive) learns very fast is similar to a mixture of experts, but does not need a pre‐specified number of experts (no competitive learning)
is similar to committee networks (by averaging the outputs over many independently trained networks)
… but still has problems with the curse of dimensionality, as all spatially localized algorithms
Lecture 7: RLSC ‐ Prof. Sethu Vijayakumar 18
Handling curse of Dimensionality
We developed a method to make Local Learning methods like RFWR scale
…the resultant system is called Locally Weighted Projection Regression (LWPR)
LWPR module
We will learn more about this and other dimensionality reduction techniques next.
http://wcms.inf.ed.ac.uk/ipab/slmc/research/software-lwpr