from least squares regression to high …...generation vs. recall 159 + 72 = 231 “recall”...
TRANSCRIPT
IROS2018 Tutorial Mo-TUT-2
From Least Squares Regressionto High-dimensional Motion Primitives
Freek Stulp, Sylvain Calinon, Gerhard Neumann
Generation vs. Recall
159 + 72 = ???
Motion Generation Motion Recall
Generation vs. Recall
159 + 72 = (159+2) + (72-2)= 161 + 70
Motion Generation Motion Recall
Generation vs. Recall
159 + 72 = (159+2) + (72-2)= 161 + 70= 231 “generation”
Motion Generation Motion Recall
Generation vs. Recall
159 + 72 = (9+2) + (150+70)= 11 + 220
Motion Generation Motion Recall
Generation vs. Recall
159 + 72 = (9+2) + (150+70)= 11 + 220= 231 “generation”
Motion Generation Motion Recall
Generation vs. Recall
159 + 72 = ???
Motion Generation Motion Recall
Generation vs. Recall
159 + 72 = 231 “recall”
Motion Generation Motion Recall
Generation vs. Recall
159 + 72 = 231 “recall”
Distinction between these two strategies important incognitive science, artificial intelligence, robotics, teaching
Motion Generation Motion Recall
Generation vs. Recall
159 + 72 = 231 “recall”
Distinction between these two strategies important incognitive science, artificial intelligence, robotics, teaching
Motion Generation Motion Recall
Motion primitives in nature
Giszter, S.; Mussa-Ivaldi, F. & Bizzi, E. Convergent force fields organized in the frog’s spinal cord Journal of Neuroscience, 1993
Flash, T. & Hochner, B. Motor Primitives in Vertebrates and Invertebrates Current Opinion in Neurobiology, 2005
Motion primitives for robots
Motion primitives in nature
Giszter, S.; Mussa-Ivaldi, F. & Bizzi, E. Convergent force fields organized in the frog’s spinal cord Journal of Neuroscience, 1993
Flash, T. & Hochner, B. Motor Primitives in Vertebrates and Invertebrates Current Opinion in Neurobiology, 2005
Motion primitives for robots?
• Couple degrees of freedom to deal with high-dimensional systems
• Sequencing and superpositioning of MPs for more complex task
• Low-dimensional parameterization of MP enables learning
• MPs can be bootstrapped with demonstrations
• Direct mappings between task parameters and MP parameters
Motion primitives in nature
Giszter, S.; Mussa-Ivaldi, F. & Bizzi, E. Convergent force fields organized in the frog’s spinal cord Journal of Neuroscience, 1993
Flash, T. & Hochner, B. Motor Primitives in Vertebrates and Invertebrates Current Opinion in Neurobiology, 2005
Motion primitives for robots!
Ijspeert, A. J.; Nakanishi, J. & Schaal, S. Movement imitation with nonlinear dynamical systems in humanoid robots. ICRA, 2002
Schedule9:00 – 9:15 Introduction9:15 – 10:45 Regression Tutorial Freek Stulp
10:45 – 11:00 Motion Primitives 1 Sylvain Calinon11:00 – 11:30 Coffee Break11:30 – 12:15 Motion Primitives 1 (cont.) Sylvain Calinon12:15 – 13:15 Motion Primitives 2 Gerhard Neumann13:15 – 13:30 Wrap up
Regression TutorialIROS’18 Tutorial
Freek StulpInstitute of Robotics and Mechatronics, German Aerospace Center (DLR)
Autonomous Systems and Robotics, ENSTA-ParisTech
01.10.2018
What Is Regression?
Estimating a relationshipbetween input variablesand continuous output variablesfrom data
What Is Regression?
Estimating a relationshipbetween input variablesand continuous output variablesfrom data
What Is Regression?
Estimating a relationshipbetween input variablesand continuous output variablesfrom data
What Is Regression?
Estimating a relationshipbetween input variablesand continuous output variablesfrom data
Application: Dynamic parameter estimation
An, C.; Atkeson, C. and Hollerbach, J. (1985).
Estimation of inertial parameters of rigid body links of manipulators [404]IEEE Conference on Decision and Control.
What Is Regression?
Estimating a relationshipbetween input variablesand continuous output variablesfrom data
Application: Programming by demonstration
Rozo, L.; Calinon, S.; Caldwell, D. G.; Jimenez, P. and Torras, C. (2016).
Learning Physical Collaborative Robot Behaviors from Human DemonstrationsIEEE Trans. on Robotics.
Calinon, S.; Guenter, F. and Billard, A. (2007).
On Learning, Representing and Generalizing a Task in a Humanoid Robot [725]IEEE Transactions on Systems, Man and Cybernetics.
What Is Regression?
Estimating a relationshipbetween input variablesand continuous output variablesfrom data
Application: Biosignal Processing
Gijsberts, A., Bohra, R., Sierra Gonzlez, D., Werner, A., Nowak, M., Caputo, B., Roa, M. and Castellini, C. (2014)
Stable myoelectric control of a hand prosthesis using non-linear incremental learningFrontiers in Neurorobotics
What Is Not Regression?
Training data{( xn︸︷︷︸
input
, yn︸︷︷︸target
)}Nn=1 ∀n,xn ∈ X ∧ yn ∈ Y
Supervised Learning targets availableRegression targets available Y ⊆ RM
Classification targets available Y ⊆ 1, . . .KReinforcement learning no targets, only rewards rn ⊆ RUnsupervised learning no targets at all
Regression – Assumptions about the Function
linear locally linear smooth none
Regression – Assumptions about the Function
linear locally linear smooth none
→ Linear Least Squares
A.M. Legendre (1805).Nouvelles methodes pour la determination des orbites des comtes [519]Firmin Didot.
C.F. Gauss (1809).Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientum [943]
Linear Least Squares
X =
x1,1 x1,2 · · · x1,Dx2,1 x2,2 · · · x2,D
......
. . ....
xN,1 xN,2 · · · xN,D
, y =
y1y2...
yN
• X is the N × D “design matrix”• Each row is a D-dim. data point
f (x) = aᵀxLinear Least Squares
Linear Least Squares
X =
x1,1 x1,2 · · · x1,Dx2,1 x2,2 · · · x2,D
......
. . ....
xN,1 xN,2 · · · xN,D
, y =
y1y2...
yN
• X is the N × D “design matrix”• Each row is a D-dim. data point
f (x) = aᵀxLinear Least Squares
Linear Least Squares• Which line fits the data best?
1 Define fitting criterion2 Optimize a w.r.t. criterion
Nice! Closed form solution to find a∗
f (x) = aᵀxLinear Least Squares
Offset trickf (x) = aᵀx + b
= [ ab ]
ᵀ[ x
1 ]
X =
x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1
.
.
.
.
.
.. . .
.
.
.
.
.
.xN,1 xN,2 · · · xN,D 1
Linear Least Squares• Which line fits the data best?
1 Define fitting criterion2 Optimize a w.r.t. criterion
1 Define fitting criterionSum of squared residuals
S(a) =N∑
n=1
r2n (1)
=N∑
n=1
(yn − f (xn))2 (2)
Nice! Closed form solution to find a∗
f (x) = aᵀxLinear Least Squares
Offset trickf (x) = aᵀx + b
= [ ab ]
ᵀ[ x
1 ]
X =
x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1
.
.
.
.
.
.. . .
.
.
.
.
.
.xN,1 xN,2 · · · xN,D 1
Linear Least Squares• Which line fits the data best?
1 Define fitting criterion2 Optimize a w.r.t. criterion
1 Define fitting criterionSum of squared residuals
S(a) =N∑
n=1
r2n (1)
=N∑
n=1
(yn − f (xn))2 (2)
Applied to a linear model
S(a) =N∑
n=1
(yn − aᵀxn)2 (3)
= (y− Xa)ᵀ(y− Xa), (4)
Nice! Closed form solution to find a∗
f (x) = aᵀxLinear Least Squares
Offset trickf (x) = aᵀx + b
= [ ab ]
ᵀ[ x
1 ]
X =
x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1
.
.
.
.
.
.. . .
.
.
.
.
.
.xN,1 xN,2 · · · xN,D 1
Linear Least Squares• Which line fits the data best?
1 Define fitting criterion2 Optimize a w.r.t. criterion
2 Optimize a w.r.t. criterionMinimize sum of squared residuals S(a)
a∗ = arg mina
S(a) (1)
= arg mina
(y− Xa)ᵀ(y− Xa) (2)
Quadratic cost: when is its derivative 0?
S′(a) = 2(a(XᵀX)− Xᵀy) (3)
a∗ = (XᵀX)−1Xᵀy. (4)
Nice! Closed form solution to find a∗
f (x) = aᵀxLinear Least Squares
Offset trickf (x) = aᵀx + b
= [ ab ]
ᵀ[ x
1 ]
X =
x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1
.
.
.
.
.
.. . .
.
.
.
.
.
.xN,1 xN,2 · · · xN,D 1
Linear Least Squares• Which line fits the data best?
1 Define fitting criterion2 Optimize a w.r.t. criterion
2 Optimize a w.r.t. criterionMinimize sum of squared residuals S(a)
a∗ = arg mina
S(a) (1)
= arg mina
(y− Xa)ᵀ(y− Xa) (2)
Quadratic cost: when is its derivative 0?
S′(a) = 2(a(XᵀX)− Xᵀy) (3)
a∗ = (XᵀX)−1Xᵀy. (4)
Nice! Closed form solution to find a∗
f (x) = a∗ᵀxLinear Least Squares
Offset trickf (x) = aᵀx + b
= [ ab ]
ᵀ[ x
1 ]
X =
x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1
.
.
.
.
.
.. . .
.
.
.
.
.
.xN,1 xN,2 · · · xN,D 1
Linear Least Squares• Which line fits the data best?
1 Define fitting criterion2 Optimize a w.r.t. criterion
2 Optimize a w.r.t. criterionMinimize sum of squared residuals S(a)
a∗ = arg mina
S(a) (1)
= arg mina
(y− Xa)ᵀ(y− Xa) (2)
Quadratic cost: when is its derivative 0?
S′(a) = 2(a(XᵀX)− Xᵀy) (3)
a∗ = (XᵀX)−1Xᵀy. (4)
Nice! Closed form solution to find a∗
f (x) = a∗ᵀxLinear Least Squares
Offset trickf (x) = aᵀx + b
= [ ab ]
ᵀ[ x
1 ]
X =
x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1
.
.
.
.
.
.. . .
.
.
.
.
.
.xN,1 xN,2 · · · xN,D 1
Linear Least Squares• Which line fits the data best?
1 Define fitting criterion2 Optimize a w.r.t. criterion
2 Optimize a w.r.t. criterionMinimize sum of squared residuals S(a)
a∗ = arg mina
S(a) (1)
= arg mina
(y− Xa)ᵀ(y− Xa) (2)
Quadratic cost: when is its derivative 0?
S′(a) = 2(a(XᵀX)− Xᵀy) (3)
a∗ = (XᵀX)−1Xᵀy. (4)
Nice! Closed form solution to find a∗
f (x) = a∗ᵀxLinear Least Squares
Offset trickf (x) = aᵀx + b
= [ ab ]
ᵀ[ x
1 ]
X =
x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1
.
.
.
.
.
.. . .
.
.
.
.
.
.xN,1 xN,2 · · · xN,D 1
Outline
Linear LeastSquares
“Fit a line tosome data points”
WeightedLeast Squares
“Some datapoints more
important to fit”
LocallyWeighted
Regression+ friends
“Multiple weightedleast squares
in input space”
Radial BasisFunctionNetwork+ friends
“Project data intofeature space.
Do least squaresin this space.”
Outline
Linear LeastSquares
“Fit a line tosome data points”
WeightedLeast Squares
“Some datapoints more
important to fit”
LocallyWeighted
Regression+ friends
“Multiple weightedleast squares
in input space”
Radial BasisFunctionNetwork+ friends
“Project data intofeature space.
Do least squaresin this space.”
Linear LeastSquares
Outline
Linear LeastSquares
“Fit a line tosome data points”
WeightedLeast Squares
“Some datapoints more
important to fit”
LocallyWeighted
Regression+ friends
“Multiple weightedleast squares
in input space”
Radial BasisFunctionNetwork+ friends
“Project data intofeature space.
Do least squaresin this space.”
WeightedLeast Squares
Weighted Linear Least Squares
Idea: more important to fit some points than others.
1 Define fitting criterionWeighted residuals:
S(a) =N∑
n=1
wn(yn − aᵀxn)2. (5)
= (y− Xa)ᵀW(y− Xa),
(6)
2 Optimize a w.r.t. criterion
a∗ = (XᵀWX)−1XᵀWy. (7)
Weighted Linear Least Squares
Idea: more important to fit some points than others.
• Importance ≡Weight wn
• Example weighting• manual• boxcar function• Gaussian function
wn = φ(xn,θ)
= exp(− 1
2 (x− c)ᵀΣ−1(x− c))
with θ = (c,Σ)
1 Define fitting criterionWeighted residuals:
S(a) =N∑
n=1
wn(yn − aᵀxn)2. (5)
= (y− Xa)ᵀW(y− Xa),
(6)
2 Optimize a w.r.t. criterion
a∗ = (XᵀWX)−1XᵀWy. (7)
Weighted Linear Least Squares
Idea: more important to fit some points than others.
1 Define fitting criterionWeighted residuals:
S(a) =N∑
n=1
wn(yn − aᵀxn)2. (5)
= (y− Xa)ᵀW(y− Xa),
(6)
2 Optimize a w.r.t. criterion
a∗ = (XᵀWX)−1XᵀWy. (7)
Weighted Linear Least Squares
Idea: more important to fit some points than others.
1 Define fitting criterionWeighted residuals:
S(a) =N∑
n=1
wn(yn − aᵀxn)2. (5)
= (y− Xa)ᵀW(y− Xa), (6)
2 Optimize a w.r.t. criterion
a∗ = (XᵀWX)−1XᵀWy. (7)
Weighted Linear Least Squares
Idea: more important to fit some points than others.
1 Define fitting criterionWeighted residuals:
S(a) =N∑
n=1
wn(yn − aᵀxn)2. (5)
= (y− Xa)ᵀW(y− Xa), (6)
2 Optimize a w.r.t. criterion
a∗ = (XᵀWX)−1XᵀWy. (7)
Weighted Linear Least Squares
Idea: more important to fit some points than others.
1 Define fitting criterionWeighted residuals:
S(a) =N∑
n=1
wn(yn − aᵀxn)2. (5)
= (y− Xa)ᵀW(y− Xa), (6)
2 Optimize a w.r.t. criterion
a∗ = (XᵀWX)−1XᵀWy. (7)
Locally Weighted RegressionsIn robotics, functions usually non-linear. But often locally linear!
Idea: Do multiple, independent, locally weighted least sq. regressions
William S. Cleveland; Susan J. Devlin (1988).Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting [4074]Journal of the American Statistical Association.
Atkeson, C. G.; Moore, A. W. and Schaal, S. (1997).Locally Weighted Learning for Control [2160]Artificial Intelligence Review.
Locally Weighted RegressionsIn robotics, functions usually non-linear. But often locally linear!
Idea: Do multiple, independent, locally weighted least sq. regressions
William S. Cleveland; Susan J. Devlin (1988).Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting [4074]Journal of the American Statistical Association.
Atkeson, C. G.; Moore, A. W. and Schaal, S. (1997).Locally Weighted Learning for Control [2160]Artificial Intelligence Review.
Locally Weighted RegressionsIn robotics, functions usually non-linear. But often locally linear!
Idea: Do multiple, independent, locally weighted least sq. regressions
William S. Cleveland; Susan J. Devlin (1988).Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting [4074]Journal of the American Statistical Association.
Atkeson, C. G.; Moore, A. W. and Schaal, S. (1997).Locally Weighted Learning for Control [2160]Artificial Intelligence Review.
Locally Weighted Regressions
• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)
for e = 1 . . .Efor n = 1 . . .N
Wnne = g(xn,ce,Σ)
ae = (XᵀWeX)−1XᵀWey. (8)
Resulting model
f (x) =E∑
e=1
φ(x,θe) · (aᵀex). (9)
(φ must be normalized)
Locally Weighted Regressions
• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)
for e = 1 . . .Efor n = 1 . . .N
Wnne = g(xn,ce,Σ)
ae = (XᵀWeX)−1XᵀWey. (8)
Resulting model
f (x) =E∑
e=1
φ(x,θe) · (aᵀex). (9)
(φ must be normalized)
Locally Weighted Regressions
• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)
for e = 1 . . .Efor n = 1 . . .N
Wnne = g(xn,ce,Σ)
ae = (XᵀWeX)−1XᵀWey. (8)
Resulting model
f (x) =E∑
e=1
φ(x,θe) · (aᵀex). (9)
(φ must be normalized)
Locally Weighted Regressions
• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)
for e = 1 . . .Efor n = 1 . . .N
Wnne = g(xn,ce,Σ)
ae = (XᵀWeX)−1XᵀWey. (8)
Resulting model
f (x) =E∑
e=1
φ(x,θe) · (aᵀex). (9)
(φ must be normalized)
Locally Weighted Regressions
• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)
for e = 1 . . .Efor n = 1 . . .N
Wnne = g(xn,ce,Σ)
ae = (XᵀWeX)−1XᵀWey. (8)
Resulting model
f (x) =E∑
e=1
φ(x,θe) · (aᵀex). (9)
(φ must be normalized)
Locally Weighted Regressions
• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)
for e = 1 . . .Efor n = 1 . . .N
Wnne = g(xn,ce,Σ)
ae = (XᵀWeX)−1XᵀWey. (8)
Resulting model
f (x) =E∑
e=1
φ(x,θe) · (aᵀex). (9)
(φ must be normalized)
Locally Weighted Regressions
• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)
for e = 1 . . .Efor n = 1 . . .N
Wnne = g(xn,ce,Σ)
ae = (XᵀWeX)−1XᵀWey. (8)
Resulting model
f (x) =E∑
e=1
φ(x,θe) · (aᵀex). (9)
(φ must be normalized)
Locally Weighted Regressions
• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)
for e = 1 . . .Efor n = 1 . . .N
Wnne = g(xn,ce,Σ)
ae = (XᵀWeX)−1XᵀWey. (8)
Resulting model
f (x) =E∑
e=1
φ(x,θe) · (aᵀex). (9)
(φ must be normalized)
Variations of Locally Weighted Regressions
Receptive Field Weighted Regression• Incremental, not batch• E , centers c1...E and widths Σ1...E determined automatically• Disadvantage: many open parameters
Schaal, S. and Atkeson, C. G. (1997).Receptive Field Weighted Regression [34]Technical Report TR-H-209, ATR Human Information Processing Laboratories.
Locally Weighted Projection Regression• As RFWR, but also performs dimensionality reduction within
each receptive field
Vijayakumar, S. and Schaal, S. (2000).Locally Weighted Projection Regression . . . [208]International Conference on Machine Learning.
Variations of Locally Weighted Regressions
Receptive Field Weighted Regression• Incremental, not batch• E , centers c1...E and widths Σ1...E determined automatically• Disadvantage: many open parameters
Schaal, S. and Atkeson, C. G. (1997).Receptive Field Weighted Regression [34]Technical Report TR-H-209, ATR Human Information Processing Laboratories.
Locally Weighted Projection Regression• As RFWR, but also performs dimensionality reduction within
each receptive field
Vijayakumar, S. and Schaal, S. (2000).Locally Weighted Projection Regression . . . [208]International Conference on Machine Learning.
Outline
Linear LeastSquares
“Fit a line tosome data points”
WeightedLeast Squares
“Some datapoints more
important to fit”
LocallyWeighted
Regression+ friends
“Multiple weightedleast squares
in input space”
Radial BasisFunctionNetwork+ friends
“Project data intofeature space.
Do least squaresin this space.”
WeightedLeast Squares
LocallyWeighted
Regressions+ friends
RegularizationRegularization
“Avoid large pa-rameter vectors.”
Outline
Linear LeastSquares
“Fit a line tosome data points”
WeightedLeast Squares
“Some datapoints more
important to fit”
LocallyWeighted
Regression+ friends
“Multiple weightedleast squares
in input space”
Radial BasisFunctionNetwork+ friends
“Project data intofeature space.
Do least squaresin this space.”
WeightedLeast Squares
LocallyWeighted
Regressions+ friends
Regularization
Regularization
“Avoid large pa-rameter vectors.”
Outline
Linear LeastSquares
“Fit a line tosome data points”
WeightedLeast Squares
“Some datapoints more
important to fit”
LocallyWeighted
Regression+ friends
“Multiple weightedleast squares
in input space”
Radial BasisFunctionNetwork+ friends
“Project data intofeature space.
Do least squaresin this space.”
Regularization
Regularization
“Avoid large pa-rameter vectors.”
Regularization• Idea: penalize large parameter vectors to
• avoid overfitting / achieve sparse parameter vectors
a∗ = arg mina(
12‖y− Xᵀa‖2︸ ︷︷ ︸
fit data
+λ
2‖a‖2︸ ︷︷ ︸
small parameters
) (10)
Use combination of L1 and L2: “Elastic Nets”
Regularization• Idea: penalize large parameter vectors to
• avoid overfitting / achieve sparse parameter vectors
a∗ = arg mina(
12‖y− Xᵀa‖2︸ ︷︷ ︸
fit data
+λ
2‖a‖2︸ ︷︷ ︸
small parameters
) (10)
L2-norm for ‖a‖
‖a‖2 =
(D∑
d=1
|ad |2) 1
2
Euclidean distance
a∗ = (λI + XᵀX)−1Xᵀy.
“Thikonov Regularization”“Ridge Regression”
L1-norm for ‖a‖
‖a‖1 =
(D∑
d=1
|ad |1) 1
1
=D∑
d=1
|ad |
Manhattan distance
no closed-form solution . . .
“LASSO Regularization”
Use combination of L1 and L2: “Elastic Nets”
Regularization• Idea: penalize large parameter vectors to
• avoid overfitting / achieve sparse parameter vectors
a∗ = arg mina(
12‖y− Xᵀa‖2︸ ︷︷ ︸
fit data
+λ
2‖a‖2︸ ︷︷ ︸
small parameters
) (10)
L2-norm for ‖a‖
‖a‖2 =
(D∑
d=1
|ad |2) 1
2
Euclidean distance
a∗ = (λI + XᵀX)−1Xᵀy.
“Thikonov Regularization”“Ridge Regression”
L1-norm for ‖a‖
‖a‖1 =
(D∑
d=1
|ad |1) 1
1
=D∑
d=1
|ad |
Manhattan distance
no closed-form solution . . .
“LASSO Regularization”
Use combination of L1 and L2: “Elastic Nets”
Regularization• Idea: penalize large parameter vectors to
• avoid overfitting / achieve sparse parameter vectors
a∗ = arg mina(
12‖y− Xᵀa‖2︸ ︷︷ ︸
fit data
+λ
2‖a‖2︸ ︷︷ ︸
small parameters
) (10)
L2-norm for ‖a‖
‖a‖2 =
(D∑
d=1
|ad |2) 1
2
Euclidean distance
a∗ = (λI + XᵀX)−1Xᵀy.
“Thikonov Regularization”“Ridge Regression”
L1-norm for ‖a‖
‖a‖1 =
(D∑
d=1
|ad |1) 1
1
=D∑
d=1
|ad |
Manhattan distance
no closed-form solution . . .
“LASSO Regularization”
Use combination of L1 and L2: “Elastic Nets”
Regularization• Idea: penalize large parameter vectors to
• avoid overfitting / achieve sparse parameter vectors
a∗ = arg mina(
12‖y− Xᵀa‖2︸ ︷︷ ︸
fit data
+λ
2‖a‖2︸ ︷︷ ︸
small parameters
) (10)
L2-norm for ‖a‖
‖a‖2 =
(D∑
d=1
|ad |2) 1
2
Euclidean distance
a∗ = (λI + XᵀX)−1Xᵀy.
“Thikonov Regularization”“Ridge Regression”
L1-norm for ‖a‖
‖a‖1 =
(D∑
d=1
|ad |1) 1
1
=D∑
d=1
|ad |
Manhattan distance
no closed-form solution . . .
“LASSO Regularization”
Use combination of L1 and L2: “Elastic Nets”
Regularization• Idea: penalize large parameter vectors to
• avoid overfitting / achieve sparse parameter vectors
a∗ = arg mina(
12‖y− Xᵀa‖2︸ ︷︷ ︸
fit data
+λ
2‖a‖2︸ ︷︷ ︸
small parameters
) (10)
Michael Littman and Charles Isbell feat Infinite Harmony
“Overfitting A Cappella”
Use combination of L1 and L2: “Elastic Nets”
Beyond squares
a∗ = arg mina(
12‖y− Xᵀa‖2︸ ︷︷ ︸
fit data
+λ
2‖a‖2︸ ︷︷ ︸
small parameters
) (11)
Penalty on residuals rn(fit data)
Penalty on parameters a(regularization)
Beyond squares
a∗ = arg mina(
12‖y− Xᵀa‖2︸ ︷︷ ︸
fit data
+λ
2‖a‖2︸ ︷︷ ︸
small parameters
) (11)
Penalty on residuals rn(fit data)
Penalty on parameters a(regularization)
Linear Support Vector Regression
• No closed-form solution, but efficient optimizers exist
Outline
Linear LeastSquares
“Fit a line tosome data points”
WeightedLeast Squares
“Some datapoints more
important to fit”
LocallyWeighted
Regression+ friends
“Multiple weightedleast squares
in input space”
Radial BasisFunctionNetwork+ friends
“Project data intofeature space.
Do least squaresin this space.”
Regularization
Regularization
Radial BasisFunctionNetwork+ friends
“Avoid large pa-rameter vectors.”
Outline
Linear LeastSquares
“Fit a line tosome data points”
WeightedLeast Squares
“Some datapoints more
important to fit”
LocallyWeighted
Regression+ friends
“Multiple weightedleast squares
in input space”
Radial BasisFunctionNetwork+ friends
“Project data intofeature space.
Do least squaresin this space.”
Regularization
Regularization
Radial BasisFunctionNetwork+ friends
“Avoid large pa-rameter vectors.”
Outline
Linear LeastSquares
“Fit a line tosome data points”
WeightedLeast Squares
“Some datapoints more
important to fit”
LocallyWeighted
Regression+ friends
“Multiple weightedleast squares
in input space”
Radial BasisFunctionNetwork+ friends
“Project data intofeature space.
Do least squaresin this space.”
RegularizationRegularization
Radial BasisFunctionNetwork+ friends
“Avoid large pa-rameter vectors.”
Radial Basis Function Network
f (x) =E∑
e=1
we · φ(x,ce). (12)
Radial Basis Function Network
f (x) =E∑
e=1
we · φ(x,ce). (12)
Radial Basis Function Network
f (x) =E∑
e=1
we · φ(x,ce). (12)
Radial Basis Function Network
f (x) =E∑
e=1
we · φ(x,ce). (12)
Radial Basis Function Network
f (x) =E∑
e=1
we · φ(x,ce). (12)
Radial Basis Function Network
f (x) =E∑
e=1
we · φ(x,ce). (12)
Radial Basis Function Network
f (x) =E∑
e=1
we · φ(x,θe). (13)
Feature matrix (analogous to design matrix X)
Θ =
φ(x1, c1) φ(x1, c2) · · · φ(x1, cE )φ(x2, c1) φ(x2, c2) · · · φ(x2, cE )
......
. . ....
φ(xN , c1) φ(xN , c2) · · · φ(xN , cE )
(14)
Least squares solution
w∗ = (ΘᵀΘ)−1Θᵀy. (15)
Radial Basis Function Network
f (x) =E∑
e=1
we · φ(x,θe). (13)
Feature matrix (analogous to design matrix X)
Θ =
φ(x1, c1) φ(x1, c2) · · · φ(x1, cE )φ(x2, c1) φ(x2, c2) · · · φ(x2, cE )
......
. . ....
φ(xN , c1) φ(xN , c2) · · · φ(xN , cE )
(14)
Least squares solution
w∗ = (ΘᵀΘ)−1Θᵀy. (15)
Radial Basis Function Network
f (x) =E∑
e=1
we · φ(x,θe). (13)
Feature matrix (analogous to design matrix X)
Θ =
φ(x1, c1) φ(x1, c2) · · · φ(x1, cE )φ(x2, c1) φ(x2, c2) · · · φ(x2, cE )
......
. . ....
φ(xN , c1) φ(xN , c2) · · · φ(xN , cE )
(14)
Least squares solution
w∗ = (ΘᵀΘ)−1Θᵀy. (15)
Radial Basis Function Network
f (x) =E∑
e=1
we · φ(x,θe). (13)
Feature matrix (analogous to design matrix X)
Θ =
φ(x1, c1) φ(x1, c2) · · · φ(x1, cE )φ(x2, c1) φ(x2, c2) · · · φ(x2, cE )
......
. . ....
φ(xN , c1) φ(xN , c2) · · · φ(xN , cE )
(14)
Least squares solution
w∗ = (ΘᵀΘ)−1Θᵀy. (15)
Kernel Ridge Regression
• Like a RBFN, but every data point is the center of a basis function
• Uses L2 regularization
f (x) =N∑
n=1
wn · k(x,xn). (16)
“Gram matrix”(analogous to design matrix X)
K(X,X) =
k(x1, x1) k(x1, x2) · · · k(x1, xN )k(x2, x1) k(x2, x2) · · · k(x2, xN )
......
. . ....
k(xN , x1) k(xN , x2) · · · k(xN , xN )
(17)
w∗ = (KᵀK)−1Kᵀy (18)
= K−1y, (19)
w∗ = (λI + K)−1y with L2 regularization
(20)
Kernel Ridge Regression
• Like a RBFN, but every data point is the center of a basis function• Uses L2 regularization
f (x) =N∑
n=1
wn · k(x,xn). (16)
“Gram matrix”(analogous to design matrix X)
K(X,X) =
k(x1, x1) k(x1, x2) · · · k(x1, xN )k(x2, x1) k(x2, x2) · · · k(x2, xN )
......
. . ....
k(xN , x1) k(xN , x2) · · · k(xN , xN )
(17)
w∗ = (KᵀK)−1Kᵀy (18)
= K−1y, (19)
w∗ = (λI + K)−1y with L2 regularization (20)
(Radial) Basis Function Networks
Beyond radial basis functions• Cosines: Ridge Regression with Random Fourier Features• Sigmoids: Extreme Learning Machines (MLFF with 1 hidden)• Boxcars: model trees (as decision trees, but for regression)• Kernels: every data point is the center of a radial basis function
• Since least squares is at the heart of all of these• incremental versions← recursive least squares• apply L2 regularization (still closed form)
(Radial) Basis Function Networks
Beyond radial basis functions• Cosines: Ridge Regression with Random Fourier Features• Sigmoids: Extreme Learning Machines (MLFF with 1 hidden)• Boxcars: model trees (as decision trees, but for regression)• Kernels: every data point is the center of a radial basis function
• Since least squares is at the heart of all of these• incremental versions← recursive least squares• apply L2 regularization (still closed form)
Freek, aren’t you being a bit shallow?
• Deep learning great when you• do not know the features• know the features to be hierarchically organized
John Smart
Freek, aren’t you being a bit shallow?
• Deep learning great when you• do not know the features• know the features to be hierarchically organized
Rajeswaran A, Lowrey K, Todorov E and Kakade S. (2017).Towards generalization and simplicity in continuous controlNeural Information Processing Systems (NIPS).
A neural network perspective
All these models can be considered (degenerate) neural networks!
Backpropagation can be used in all these models!
A neural network perspective
All these models can be considered (degenerate) neural networks!
Backpropagation can be used in all these models!
Linear model
x1
x2
xD
...
∑y
a1
a2
aD
Figure: Network representation of a linear model. Activation is. . . linear!
RBFN
x1
xD
...
φ1
φ2
φE
...
∑y
b1
b2
bE
f (x) =∑E
e=1 beφ(x,θe)
Figure: The RBFN model. φe is an abbreviation of φ(x,θe)
RRRFF
x1
xD
...
φ1
φ2
φE
...
∑y
b1
b2
bE
f (x) =∑E
e=1 beφ(x,θe)
Figure: The RRRFF model. φe is an abbreviation of φ(x,θe)
SVR
x1
xD
...
φ1
φ2
φE
...
∑y
b1
b2
bE
f (x) =∑E
e=1 beφ(x,θe)
Figure: The SVR model. φe is an abbreviation of φ(x,θe)
Regression trees
x1
xD
...
φ1
φ2
φE
...
∑y
b1
b2
bE
f (x) =∑E
e=1 beφ(x,θe)
Figure: The regression trees model. φe is an abbreviation of φ(x,θe)
Extreme learning machine
x1
xD
...
φ1
φ2
φE
...
∑y
b1
b2
bE
f (x) =∑E
e=1 beφ(x,θe)
Figure: The extreme learning machine model. φe is an abbreviation ofφ(x,θe)
Extreme Learning Machine vs. (Deep) Neural Networks
• ELM: sigmoid act. function, no hidden layer, random features• ANN: sigmoid act. function, hidden layers, learned features
KRR and GPR
x1
xD
...
φ1
φ2
φ3
φ4
φ5
φ6
φN
...
∑y
w1w2w3w4w5
w6
wN
f (x) =∑N
n=1 wnφn(x)
Figure: The function model used in KRR and GPR, as a network.
Locally weighted regression
x1
xD
...
1
φ1
φ2
φE
∑
∑
∑...
∑y
b 1
b2
bE
a1,1
a1,2
a1,E
aD,1
aD,2
aD,E
w1
w2
wE
Figure: Function model in Locally Weighted Regressions, represented as afeedforward neural network. The functions φe(x) generate the weights we
from the hidden nodes – which contain linear sub-models (aᵀe x + be) – to the
output node. Here, φe is an abbreviation of φ(x,θe)
Conclusion
Linear LeastSquares
“Fit a line tosome data points”
WeightedLeast Squares
“Some datapoints more
important to fit”
LocallyWeighted
Regression+ friends
“Multiple weightedleast squares
in input space”
Radial BasisFunctionNetwork+ friends
“Project data intofeature space.
Do least squaresin this space.”
WeightedSum of Basis
Functions
WeightedSum of Linear
Models
Conclusion: Generic batch regression flow-chart
Algorithmleast squares: a∗ = (λI + XᵀX)−1Xᵀy
Modellinear model: f (x) = aᵀx
Meta parametersregularization: λ
Model parametersslopes: a
Regressionalgorithm
.
.
training data(inputs/targets)
model
modelparams
meta-params Evaluate
input(novel)
output(prediction)
Conclusion: Generic batch regression flow-chart
Algorithmleast squares: a∗ = (λI + XᵀX)−1Xᵀy
Modellinear model: f (x) = aᵀx
Meta parametersregularization: λ
Model parametersslopes: a
Regressionalgorithm
.
.
training data(inputs/targets)
model
modelparams
meta-params Evaluate
input(novel)
output(prediction)
Conclusion: Generic batch regression flow-chart
Algorithmleast squares: a∗ = (λI + XᵀX)−1Xᵀy
Modellinear model: f (x) = aᵀx
Meta parametersregularization: λ
Model parametersslopes: a
Regressionalgorithm
.
.
training data(inputs/targets)
model
modelparams
meta-params Evaluate
input(novel)
output(prediction)
Conclusion: Generic batch regression flow-chart
Algorithmleast squares: a∗ = (λI + XᵀX)−1Xᵀy
Modellinear model: f (x) = aᵀx
Meta parametersregularization: λ
Model parametersslopes: a
Regressionalgorithm
.
.
training data(inputs/targets)
model
modelparams
meta-params
Evaluate
input(novel)
output(prediction)
Conclusion: Generic batch regression flow-chart
Algorithmleast squares: a∗ = (λI + XᵀX)−1Xᵀy
Modellinear model: f (x) = aᵀx
Meta parametersregularization: λ
Model parametersslopes: a
Regressionalgorithm
.
.
training data(inputs/targets)
model
modelparams
meta-params Evaluate
input(novel)
output(prediction)
ConclusionLinear Least
Squares“Fit a line to
some data points”
WeightedLeast Squares
“Some datapoints more
important to fit”
LocallyWeighted
Regression+ friends
“Multiple weightedleast squares
in input space”
Radial BasisFunctionNetwork+ friends
“Project data intofeature space.
Do least squaresin this space.”
WeightedSum of Basis
Functions
WeightedSum of Linear
Models
Unified Model
f (x) =∑E
e=1φ(x,θe)·(be + aᵀ
e x) Weighted sum of linear models (21)
f (x) =∑E
e=1φ(x,θe)· we Weighted sum of basis functions (22)
ConclusionLinear Least
Squares“Fit a line to
some data points”
WeightedLeast Squares
“Some datapoints more
important to fit”
LocallyWeighted
Regression+ friends
“Multiple weightedleast squares
in input space”
Radial BasisFunctionNetwork+ friends
“Project data intofeature space.
Do least squaresin this space.”
WeightedSum of Basis
Functions
WeightedSum of Linear
Models
Unified Model
f (x) =∑E
e=1φ(x,θe)·(be + aᵀ
e x) Weighted sum of linear models (21)
f (x) =∑E
e=1φ(x,θe)· we Weighted sum of basis functions (22)
(22) is a special case of (21) with ae = 0 and be ≡ we
ConclusionLinear Least
Squares“Fit a line to
some data points”
WeightedLeast Squares
“Some datapoints more
important to fit”
LocallyWeighted
Regression+ friends
“Multiple weightedleast squares
in input space”
Radial BasisFunctionNetwork+ friends
“Project data intofeature space.
Do least squaresin this space.”
WeightedSum of Basis
Functions
WeightedSum of Linear
Models
Unified Model
f (x) =∑E
e=1φ(x,θe)·(be + aᵀ
e x) Weighted sum of linear models (21)
f (x) =∑E
e=1φ(x,θe)· we Weighted sum of basis functions (22)
(22) is a special case of (21) with ae = 0 and be ≡ we
ConclusionLinear Least
Squares“Fit a line to
some data points”
WeightedLeast Squares
“Some datapoints more
important to fit”
LocallyWeighted
Regression+ friends
“Multiple weightedleast squares
in input space”
Radial BasisFunctionNetwork+ friends
“Project data intofeature space.
Do least squaresin this space.”
WeightedSum of Basis
Functions
WeightedSum of Linear
Models
Unified Model
Freek Stulp and Olivier Sigaud (2015).Many regression algorithms, one unified model - A review.Neural Networks.
f (x) =∑E
e=1φ(x,θe)·(be + aᵀ
e x) Weighted sum of linear models (21)
f (x) =∑E
e=1φ(x,θe)· we Weighted sum of basis functions (22)
(22) is a special case of (21) with ae = 0 and be ≡ we
Conclusion
Mixture of linear models(unified model)
f (x) =∑E
e=1 φ(x,θe) · (aᵀex+be)
sub-models: aᵀex + be
weights: φ(x,θe)
GMR (Gaussian Mixture R.)
Gaussian BFsLWR (Locally Weighted R.)
RFWR (Receptive Field Weighted R.)
LWPR (Locally Weighted Projection R.)
XCSFAny BFs
M5 (Model Trees)Box-car BFs
Linear modelf (x) = aᵀx + b
E = 1
LLS (Linear Least Squares)
(Weighted Linear Least Squares)
Weighted sum of basis functions
f (x) =∑E
e=1 φ(x,θe) · be
sub-models: φ(x,θe)
weights: be
a = 0
RBFN (Radial Basis Function Network)
Radial φ(||x− c||)
KRR (Kernel Ridge R.)
GPR (Gaussian Process R.)Kernel φ(x, x′)
iRFRLSI-SSGPR
Cosine BFs
CART (Regression Trees)
Box-car BFs
ELM (Extreme Learning Machine)
Backpropagation
Sigmoidφ(〈x,w〉)
Figure: Classification of regression algorithms, based only on the modelused to represent the underlying function.
Many toolkits available
• Python• scikit-learn: http://scikit-learn.org• StatsModels: http://www.statsmodels.org/• PbDlib: http://calinon.ch/codes.htm• dmpbbo: https://github.com/stulp/dmpbbo
• Matlab• curvefit: https://www.mathworks.com/help/curvefit/linear-and-nonlinear-regression.html
• PbDlib: http://calinon.ch/codes.htm• C++
• PbDlib: http://calinon.ch/codes.htm• dmpbbo: https://github.com/stulp/dmpbbo
Personal FavouritesGaussian process regression
+ Very few assumptions+ Meta-parameters estimated from data itself+ Estimates variance also+ Works in high dimensions- Training/query times increase with amount of data- Not easy to make incremental
Gaussian mixture regression+ Estimates variance also+ Algorithm is inherently incremental+ Some meta-parameters, but easy to tune+ Fast training times- Training only stable for low input dimensions
Locally Weighted Regressions+ Fast query times, fast training+ Few meta-parameters, and easy to set+ Stable learning results (batch)- Not incremental- No variance estimate
Deep Learning+ Automatic extraction of (hierarhical) features
Conclusion
• Don’t think about these regression algorithms as being unique• Similar algorithms that use different subsets of algorithmic features
• All these models are essentially shallow neural networks withdifferent basis functions
Thank you for your attention!
Conclusion
• Don’t think about these regression algorithms as being unique• Similar algorithms that use different subsets of algorithmic features
• All these models are essentially shallow neural networks withdifferent basis functions
Thank you for your attention!
Model: (Locally) Linear Regularization Model: Sum of Basis Functions ConclusionUnified Model
Appendix
Regression Tutorial – Freek Stulp
Gaussian Process Regression
“Given a Gaussian process on some topological space T , with acontinuous covariance kernel C(·, ·) : T × T → R, we can associate aHilbert space, which is the reproducing kernel Hilbert space ofreal-valued functions on T , with C as kernel function.”
Instead of screaming, let’s talk about what it means to be smooth.
Gaussian Process Regression
“Given a Gaussian process on some topological space T , with acontinuous covariance kernel C(·, ·) : T × T → R, we can associate aHilbert space, which is the reproducing kernel Hilbert space ofreal-valued functions on T , with C as kernel function.”
Instead of screaming, let’s talk about what it means to be smooth.
Gaussian Process Regression
“Given a Gaussian process on some topological space T , with acontinuous covariance kernel C(·, ·) : T × T → R, we can associate aHilbert space, which is the reproducing kernel Hilbert space ofreal-valued functions on T , with C as kernel function.”
Instead of screaming, let’s talk about what it means to be smooth.
Gaussian Process Regression
• Points that are close in the input space should be close in theoutput space.• Cities that are close geographically have similar temperatures (on
average)• Taller people have larger shoe sizes (on average)
• Shoe size covaries with height
Gaussian Process Regression – Covariance Function
covariance function
covariance matrix (Gram matrix)
K(X,X) =
xAug xMuc xWar xMin xMos
xAug 1.00 0.96 0.42 0.02 0.00
xMuc 0.96 1.00 0.59 0.04 0.00xWar 0.42 0.59 1.00 0.32 0.10xMin 0.02 0.04 0.32 1.00 0.80xMos 0.00 0.00 0.10 0.80 1.00
• Remarks• Basis function has very specific interpretation: covariance• No temperature measurements y have been made yet• Prior: assume temperature is 0◦C
QuestionExpected temperature in Munich, given 9◦C in Augsburg?
(condition on TAug = 9, i.e. E [TMuc | TAug = 9])
Gaussian Process Regression – Covariance Function
covariance function
covariance matrix (Gram matrix)
K(X,X) =
xAug xMuc xWar xMin xMos
xAug 1.00 0.96 0.42 0.02 0.00
xMuc 0.96 1.00 0.59 0.04 0.00xWar 0.42 0.59 1.00 0.32 0.10xMin 0.02 0.04 0.32 1.00 0.80xMos 0.00 0.00 0.10 0.80 1.00
• Remarks• Basis function has very specific interpretation: covariance• No temperature measurements y have been made yet• Prior: assume temperature is 0◦C
QuestionExpected temperature in Munich, given 9◦C in Augsburg?
(condition on TAug = 9, i.e. E [TMuc | TAug = 9])
Gaussian Process Regression – Covariance Function
covariance function covariance matrix (Gram matrix)
K(X,X) =
xAug xMuc xWar xMin xMos
xAug 1.00 0.96 0.42 0.02 0.00xMuc 0.96 1.00 0.59 0.04 0.00xWar 0.42 0.59 1.00 0.32 0.10xMin 0.02 0.04 0.32 1.00 0.80xMos 0.00 0.00 0.10 0.80 1.00
• Remarks• Basis function has very specific interpretation: covariance• No temperature measurements y have been made yet• Prior: assume temperature is 0◦C
QuestionExpected temperature in Munich, given 9◦C in Augsburg?
(condition on TAug = 9, i.e. E [TMuc | TAug = 9])
Gaussian Process Regression – Covariance Function
covariance function covariance matrix (Gram matrix)
K(X,X) =
xAug xMuc xWar xMin xMos
xAug 1.00 0.96 0.42 0.02 0.00xMuc 0.96 1.00 0.59 0.04 0.00xWar 0.42 0.59 1.00 0.32 0.10xMin 0.02 0.04 0.32 1.00 0.80xMos 0.00 0.00 0.10 0.80 1.00
• Remarks• Basis function has very specific interpretation: covariance• No temperature measurements y have been made yet• Prior: assume temperature is 0◦C
QuestionExpected temperature in Munich, given 9◦C in Augsburg?
(condition on TAug = 9, i.e. E [TMuc | TAug = 9])
Gaussian Process Regression – Covariance Function
covariance function covariance matrix (Gram matrix)
K(X,X) =
xAug xMuc xWar xMin xMos
xAug 1.00 0.96 0.42 0.02 0.00xMuc 0.96 1.00 0.59 0.04 0.00xWar 0.42 0.59 1.00 0.32 0.10xMin 0.02 0.04 0.32 1.00 0.80xMos 0.00 0.00 0.10 0.80 1.00
• Remarks• Basis function has very specific interpretation: covariance• No temperature measurements y have been made yet• Prior: assume temperature is 0◦C
QuestionExpected temperature in Munich, given 9◦C in Augsburg?
(condition on TAug = 9, i.e. E [TMuc | TAug = 9])
Gaussian Process Regression – Example
k(xMuc, xAug) = 0.96
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
k(xMuc, [xAug xMin]) =[0.96 0.04]
k(xWar, [xAug xMin]) =[0.42 0.32]
k(xMos, [xAug xMin]) =[0.00 0.8]
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
k(xMuc, [xAug xMin]) =[0.96 0.04]
k(xWar, [xAug xMin]) =[0.42 0.32]
k(xMos, [xAug xMin]) =[0.00 0.8]
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
k(xMuc, [xAug xMin]) =[0.96 0.04]
k(xWar, [xAug xMin]) =[0.42 0.32]
k(xMos, [xAug xMin]) =[0.00 0.8]
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
k(xMuc, [xAug xMin]) =[0.96 0.04]
k(xWar, [xAug xMin]) =[0.42 0.32]
k(xMos, [xAug xMin]) =[0.00 0.8]
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
k(xMuc, [xAug xMin]) =[0.96 0.04]
k(xWar, [xAug xMin]) =[0.42 0.32]
k(xMos, [xAug xMin]) =[0.00 0.8]
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
k(xMuc, [xAug xMin]) =[0.96 0.04]
k(xWar, [xAug xMin]) =[0.42 0.32]
k(xMos, [xAug xMin]) =[0.00 0.8]
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
k(xMuc, [xAug xMin]) =[0.96 0.04]
k(xWar, [xAug xMin]) =[0.42 0.32]
k(xMos, [xAug xMin]) =[0.00 0.8]
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
k(xMuc, [xAug xMin]) =[0.96 0.04]
k(xWar, [xAug xMin]) =[0.42 0.32]
k(xMos, [xAug xMin]) =[0.00 0.8]
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
k(xMuc, [xAug xMin]) =[0.96 0.04]
k(xWar, [xAug xMin]) =[0.42 0.32]
k(xMos, [xAug xMin]) =[0.00 0.8]
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
k(xMuc, [xAug xMin]) =[0.96 0.04]
k(xWar, [xAug xMin]) =[0.42 0.32]
k(xMos, [xAug xMin]) =[0.00 0.8]
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
K(X,X) =[ xAug xMin
xAug 1.00 0.02xMin 0.02 1.00
]
Gaussian Process Regression – Example
k(xMuc, [xAug xMin]) =[0.96 0.04]
k(xWar, [xAug xMin]) =[0.42 0.32]
k(xMos, [xAug xMin]) =[0.00 0.8]
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
K(X,X) =[ xAug xMin
xAug 1.00 0.02xMin 0.02 1.00
]
Gaussian Process Regression – Example
k(xMuc, [xAug xMin]) =[0.96 0.04]
k(xWar, [xAug xMin]) =[0.42 0.32]
k(xMos, [xAug xMin]) =[0.00 0.8]
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Kernel Regression
f (x) =∑N
n=1wn · k(x,xn)
w∗ = K(X,X)−1y
Gaussian Process Regression – Example
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
The more measurements become available,the more certain we become
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
The more measurements become available,the more certain we become
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
The more measurements become available,the more certain we become
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
The more measurements become available,the more certain we become
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
The more measurements become available,the more certain we become
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
The more measurements become available,the more certain we become
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)
Gaussian Process Regression – Example
The more measurements become available,the more certain we become
What are the plane slopes?
yq =
see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸
Least squares!
(23)