from least squares regression to high …...generation vs. recall 159 + 72 = 231 “recall”...

153
IROS2018 Tutorial Mo-TUT-2 From Least Squares Regression to High-dimensional Motion Primitives Freek Stulp, Sylvain Calinon, Gerhard Neumann

Upload: others

Post on 04-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

IROS2018 Tutorial Mo-TUT-2

From Least Squares Regressionto High-dimensional Motion Primitives

Freek Stulp, Sylvain Calinon, Gerhard Neumann

Page 2: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Generation vs. Recall

159 + 72 = ???

Motion Generation Motion Recall

Page 3: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Generation vs. Recall

159 + 72 = (159+2) + (72-2)= 161 + 70

Motion Generation Motion Recall

Page 4: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Generation vs. Recall

159 + 72 = (159+2) + (72-2)= 161 + 70= 231 “generation”

Motion Generation Motion Recall

Page 5: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Generation vs. Recall

159 + 72 = (9+2) + (150+70)= 11 + 220

Motion Generation Motion Recall

Page 6: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Generation vs. Recall

159 + 72 = (9+2) + (150+70)= 11 + 220= 231 “generation”

Motion Generation Motion Recall

Page 7: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Generation vs. Recall

159 + 72 = ???

Motion Generation Motion Recall

Page 8: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Generation vs. Recall

159 + 72 = 231 “recall”

Motion Generation Motion Recall

Page 9: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Generation vs. Recall

159 + 72 = 231 “recall”

Distinction between these two strategies important incognitive science, artificial intelligence, robotics, teaching

Motion Generation Motion Recall

Page 10: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Generation vs. Recall

159 + 72 = 231 “recall”

Distinction between these two strategies important incognitive science, artificial intelligence, robotics, teaching

Motion Generation Motion Recall

Page 11: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Motion primitives in nature

Giszter, S.; Mussa-Ivaldi, F. & Bizzi, E. Convergent force fields organized in the frog’s spinal cord Journal of Neuroscience, 1993

Flash, T. & Hochner, B. Motor Primitives in Vertebrates and Invertebrates Current Opinion in Neurobiology, 2005

Motion primitives for robots

Page 12: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Motion primitives in nature

Giszter, S.; Mussa-Ivaldi, F. & Bizzi, E. Convergent force fields organized in the frog’s spinal cord Journal of Neuroscience, 1993

Flash, T. & Hochner, B. Motor Primitives in Vertebrates and Invertebrates Current Opinion in Neurobiology, 2005

Motion primitives for robots?

• Couple degrees of freedom to deal with high-dimensional systems

• Sequencing and superpositioning of MPs for more complex task

• Low-dimensional parameterization of MP enables learning

• MPs can be bootstrapped with demonstrations

• Direct mappings between task parameters and MP parameters

Page 13: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Motion primitives in nature

Giszter, S.; Mussa-Ivaldi, F. & Bizzi, E. Convergent force fields organized in the frog’s spinal cord Journal of Neuroscience, 1993

Flash, T. & Hochner, B. Motor Primitives in Vertebrates and Invertebrates Current Opinion in Neurobiology, 2005

Motion primitives for robots!

Ijspeert, A. J.; Nakanishi, J. & Schaal, S. Movement imitation with nonlinear dynamical systems in humanoid robots. ICRA, 2002

Page 14: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Schedule9:00 – 9:15 Introduction9:15 – 10:45 Regression Tutorial Freek Stulp

10:45 – 11:00 Motion Primitives 1 Sylvain Calinon11:00 – 11:30 Coffee Break11:30 – 12:15 Motion Primitives 1 (cont.) Sylvain Calinon12:15 – 13:15 Motion Primitives 2 Gerhard Neumann13:15 – 13:30 Wrap up

Page 15: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Regression TutorialIROS’18 Tutorial

Freek StulpInstitute of Robotics and Mechatronics, German Aerospace Center (DLR)

Autonomous Systems and Robotics, ENSTA-ParisTech

01.10.2018

Page 16: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

What Is Regression?

Estimating a relationshipbetween input variablesand continuous output variablesfrom data

Page 17: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

What Is Regression?

Estimating a relationshipbetween input variablesand continuous output variablesfrom data

Page 18: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

What Is Regression?

Estimating a relationshipbetween input variablesand continuous output variablesfrom data

Page 19: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

What Is Regression?

Estimating a relationshipbetween input variablesand continuous output variablesfrom data

Application: Dynamic parameter estimation

An, C.; Atkeson, C. and Hollerbach, J. (1985).

Estimation of inertial parameters of rigid body links of manipulators [404]IEEE Conference on Decision and Control.

Page 20: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

What Is Regression?

Estimating a relationshipbetween input variablesand continuous output variablesfrom data

Application: Programming by demonstration

Rozo, L.; Calinon, S.; Caldwell, D. G.; Jimenez, P. and Torras, C. (2016).

Learning Physical Collaborative Robot Behaviors from Human DemonstrationsIEEE Trans. on Robotics.

Calinon, S.; Guenter, F. and Billard, A. (2007).

On Learning, Representing and Generalizing a Task in a Humanoid Robot [725]IEEE Transactions on Systems, Man and Cybernetics.

Page 21: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

What Is Regression?

Estimating a relationshipbetween input variablesand continuous output variablesfrom data

Application: Biosignal Processing

Gijsberts, A., Bohra, R., Sierra Gonzlez, D., Werner, A., Nowak, M., Caputo, B., Roa, M. and Castellini, C. (2014)

Stable myoelectric control of a hand prosthesis using non-linear incremental learningFrontiers in Neurorobotics

Page 22: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

What Is Not Regression?

Training data{( xn︸︷︷︸

input

, yn︸︷︷︸target

)}Nn=1 ∀n,xn ∈ X ∧ yn ∈ Y

Supervised Learning targets availableRegression targets available Y ⊆ RM

Classification targets available Y ⊆ 1, . . .KReinforcement learning no targets, only rewards rn ⊆ RUnsupervised learning no targets at all

Page 23: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Regression – Assumptions about the Function

linear locally linear smooth none

Page 24: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Regression – Assumptions about the Function

linear locally linear smooth none

→ Linear Least Squares

A.M. Legendre (1805).Nouvelles methodes pour la determination des orbites des comtes [519]Firmin Didot.

C.F. Gauss (1809).Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientum [943]

Page 25: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Linear Least Squares

X =

x1,1 x1,2 · · · x1,Dx2,1 x2,2 · · · x2,D

......

. . ....

xN,1 xN,2 · · · xN,D

, y =

y1y2...

yN

• X is the N × D “design matrix”• Each row is a D-dim. data point

f (x) = aᵀxLinear Least Squares

Page 26: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Linear Least Squares

X =

x1,1 x1,2 · · · x1,Dx2,1 x2,2 · · · x2,D

......

. . ....

xN,1 xN,2 · · · xN,D

, y =

y1y2...

yN

• X is the N × D “design matrix”• Each row is a D-dim. data point

f (x) = aᵀxLinear Least Squares

Page 27: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Linear Least Squares• Which line fits the data best?

1 Define fitting criterion2 Optimize a w.r.t. criterion

Nice! Closed form solution to find a∗

f (x) = aᵀxLinear Least Squares

Offset trickf (x) = aᵀx + b

= [ ab ]

ᵀ[ x

1 ]

X =

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.

.

.

.

.

.. . .

.

.

.

.

.

.xN,1 xN,2 · · · xN,D 1

Page 28: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Linear Least Squares• Which line fits the data best?

1 Define fitting criterion2 Optimize a w.r.t. criterion

1 Define fitting criterionSum of squared residuals

S(a) =N∑

n=1

r2n (1)

=N∑

n=1

(yn − f (xn))2 (2)

Nice! Closed form solution to find a∗

f (x) = aᵀxLinear Least Squares

Offset trickf (x) = aᵀx + b

= [ ab ]

ᵀ[ x

1 ]

X =

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.

.

.

.

.

.. . .

.

.

.

.

.

.xN,1 xN,2 · · · xN,D 1

Page 29: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Linear Least Squares• Which line fits the data best?

1 Define fitting criterion2 Optimize a w.r.t. criterion

1 Define fitting criterionSum of squared residuals

S(a) =N∑

n=1

r2n (1)

=N∑

n=1

(yn − f (xn))2 (2)

Applied to a linear model

S(a) =N∑

n=1

(yn − aᵀxn)2 (3)

= (y− Xa)ᵀ(y− Xa), (4)

Nice! Closed form solution to find a∗

f (x) = aᵀxLinear Least Squares

Offset trickf (x) = aᵀx + b

= [ ab ]

ᵀ[ x

1 ]

X =

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.

.

.

.

.

.. . .

.

.

.

.

.

.xN,1 xN,2 · · · xN,D 1

Page 30: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Linear Least Squares• Which line fits the data best?

1 Define fitting criterion2 Optimize a w.r.t. criterion

2 Optimize a w.r.t. criterionMinimize sum of squared residuals S(a)

a∗ = arg mina

S(a) (1)

= arg mina

(y− Xa)ᵀ(y− Xa) (2)

Quadratic cost: when is its derivative 0?

S′(a) = 2(a(XᵀX)− Xᵀy) (3)

a∗ = (XᵀX)−1Xᵀy. (4)

Nice! Closed form solution to find a∗

f (x) = aᵀxLinear Least Squares

Offset trickf (x) = aᵀx + b

= [ ab ]

ᵀ[ x

1 ]

X =

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.

.

.

.

.

.. . .

.

.

.

.

.

.xN,1 xN,2 · · · xN,D 1

Page 31: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Linear Least Squares• Which line fits the data best?

1 Define fitting criterion2 Optimize a w.r.t. criterion

2 Optimize a w.r.t. criterionMinimize sum of squared residuals S(a)

a∗ = arg mina

S(a) (1)

= arg mina

(y− Xa)ᵀ(y− Xa) (2)

Quadratic cost: when is its derivative 0?

S′(a) = 2(a(XᵀX)− Xᵀy) (3)

a∗ = (XᵀX)−1Xᵀy. (4)

Nice! Closed form solution to find a∗

f (x) = a∗ᵀxLinear Least Squares

Offset trickf (x) = aᵀx + b

= [ ab ]

ᵀ[ x

1 ]

X =

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.

.

.

.

.

.. . .

.

.

.

.

.

.xN,1 xN,2 · · · xN,D 1

Page 32: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Linear Least Squares• Which line fits the data best?

1 Define fitting criterion2 Optimize a w.r.t. criterion

2 Optimize a w.r.t. criterionMinimize sum of squared residuals S(a)

a∗ = arg mina

S(a) (1)

= arg mina

(y− Xa)ᵀ(y− Xa) (2)

Quadratic cost: when is its derivative 0?

S′(a) = 2(a(XᵀX)− Xᵀy) (3)

a∗ = (XᵀX)−1Xᵀy. (4)

Nice! Closed form solution to find a∗

f (x) = a∗ᵀxLinear Least Squares

Offset trickf (x) = aᵀx + b

= [ ab ]

ᵀ[ x

1 ]

X =

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.

.

.

.

.

.. . .

.

.

.

.

.

.xN,1 xN,2 · · · xN,D 1

Page 33: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Linear Least Squares• Which line fits the data best?

1 Define fitting criterion2 Optimize a w.r.t. criterion

2 Optimize a w.r.t. criterionMinimize sum of squared residuals S(a)

a∗ = arg mina

S(a) (1)

= arg mina

(y− Xa)ᵀ(y− Xa) (2)

Quadratic cost: when is its derivative 0?

S′(a) = 2(a(XᵀX)− Xᵀy) (3)

a∗ = (XᵀX)−1Xᵀy. (4)

Nice! Closed form solution to find a∗

f (x) = a∗ᵀxLinear Least Squares

Offset trickf (x) = aᵀx + b

= [ ab ]

ᵀ[ x

1 ]

X =

x1,1 x1,2 · · · x1,D 1x2,1 x2,2 · · · x2,D 1

.

.

.

.

.

.. . .

.

.

.

.

.

.xN,1 xN,2 · · · xN,D 1

Page 34: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Outline

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

Page 35: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Outline

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

Linear LeastSquares

Page 36: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Outline

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

WeightedLeast Squares

Page 37: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Weighted Linear Least Squares

Idea: more important to fit some points than others.

1 Define fitting criterionWeighted residuals:

S(a) =N∑

n=1

wn(yn − aᵀxn)2. (5)

= (y− Xa)ᵀW(y− Xa),

(6)

2 Optimize a w.r.t. criterion

a∗ = (XᵀWX)−1XᵀWy. (7)

Page 38: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Weighted Linear Least Squares

Idea: more important to fit some points than others.

• Importance ≡Weight wn

• Example weighting• manual• boxcar function• Gaussian function

wn = φ(xn,θ)

= exp(− 1

2 (x− c)ᵀΣ−1(x− c))

with θ = (c,Σ)

1 Define fitting criterionWeighted residuals:

S(a) =N∑

n=1

wn(yn − aᵀxn)2. (5)

= (y− Xa)ᵀW(y− Xa),

(6)

2 Optimize a w.r.t. criterion

a∗ = (XᵀWX)−1XᵀWy. (7)

Page 39: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Weighted Linear Least Squares

Idea: more important to fit some points than others.

1 Define fitting criterionWeighted residuals:

S(a) =N∑

n=1

wn(yn − aᵀxn)2. (5)

= (y− Xa)ᵀW(y− Xa),

(6)

2 Optimize a w.r.t. criterion

a∗ = (XᵀWX)−1XᵀWy. (7)

Page 40: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Weighted Linear Least Squares

Idea: more important to fit some points than others.

1 Define fitting criterionWeighted residuals:

S(a) =N∑

n=1

wn(yn − aᵀxn)2. (5)

= (y− Xa)ᵀW(y− Xa), (6)

2 Optimize a w.r.t. criterion

a∗ = (XᵀWX)−1XᵀWy. (7)

Page 41: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Weighted Linear Least Squares

Idea: more important to fit some points than others.

1 Define fitting criterionWeighted residuals:

S(a) =N∑

n=1

wn(yn − aᵀxn)2. (5)

= (y− Xa)ᵀW(y− Xa), (6)

2 Optimize a w.r.t. criterion

a∗ = (XᵀWX)−1XᵀWy. (7)

Page 42: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Weighted Linear Least Squares

Idea: more important to fit some points than others.

1 Define fitting criterionWeighted residuals:

S(a) =N∑

n=1

wn(yn − aᵀxn)2. (5)

= (y− Xa)ᵀW(y− Xa), (6)

2 Optimize a w.r.t. criterion

a∗ = (XᵀWX)−1XᵀWy. (7)

Page 43: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Locally Weighted RegressionsIn robotics, functions usually non-linear. But often locally linear!

Idea: Do multiple, independent, locally weighted least sq. regressions

William S. Cleveland; Susan J. Devlin (1988).Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting [4074]Journal of the American Statistical Association.

Atkeson, C. G.; Moore, A. W. and Schaal, S. (1997).Locally Weighted Learning for Control [2160]Artificial Intelligence Review.

Page 44: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Locally Weighted RegressionsIn robotics, functions usually non-linear. But often locally linear!

Idea: Do multiple, independent, locally weighted least sq. regressions

William S. Cleveland; Susan J. Devlin (1988).Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting [4074]Journal of the American Statistical Association.

Atkeson, C. G.; Moore, A. W. and Schaal, S. (1997).Locally Weighted Learning for Control [2160]Artificial Intelligence Review.

Page 45: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Locally Weighted RegressionsIn robotics, functions usually non-linear. But often locally linear!

Idea: Do multiple, independent, locally weighted least sq. regressions

William S. Cleveland; Susan J. Devlin (1988).Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting [4074]Journal of the American Statistical Association.

Atkeson, C. G.; Moore, A. W. and Schaal, S. (1997).Locally Weighted Learning for Control [2160]Artificial Intelligence Review.

Page 46: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Locally Weighted Regressions

• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

ae = (XᵀWeX)−1XᵀWey. (8)

Resulting model

f (x) =E∑

e=1

φ(x,θe) · (aᵀex). (9)

(φ must be normalized)

Page 47: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Locally Weighted Regressions

• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

ae = (XᵀWeX)−1XᵀWey. (8)

Resulting model

f (x) =E∑

e=1

φ(x,θe) · (aᵀex). (9)

(φ must be normalized)

Page 48: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Locally Weighted Regressions

• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

ae = (XᵀWeX)−1XᵀWey. (8)

Resulting model

f (x) =E∑

e=1

φ(x,θe) · (aᵀex). (9)

(φ must be normalized)

Page 49: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Locally Weighted Regressions

• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

ae = (XᵀWeX)−1XᵀWey. (8)

Resulting model

f (x) =E∑

e=1

φ(x,θe) · (aᵀex). (9)

(φ must be normalized)

Page 50: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Locally Weighted Regressions

• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

ae = (XᵀWeX)−1XᵀWey. (8)

Resulting model

f (x) =E∑

e=1

φ(x,θe) · (aᵀex). (9)

(φ must be normalized)

Page 51: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Locally Weighted Regressions

• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

ae = (XᵀWeX)−1XᵀWey. (8)

Resulting model

f (x) =E∑

e=1

φ(x,θe) · (aᵀex). (9)

(φ must be normalized)

Page 52: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Locally Weighted Regressions

• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

ae = (XᵀWeX)−1XᵀWey. (8)

Resulting model

f (x) =E∑

e=1

φ(x,θe) · (aᵀex). (9)

(φ must be normalized)

Page 53: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Locally Weighted Regressions

• Idea: multiple, independent, locally weighted least squares regressions• Locally: radial weighting function with different centers (“receptive field”)

for e = 1 . . .Efor n = 1 . . .N

Wnne = g(xn,ce,Σ)

ae = (XᵀWeX)−1XᵀWey. (8)

Resulting model

f (x) =E∑

e=1

φ(x,θe) · (aᵀex). (9)

(φ must be normalized)

Page 54: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Variations of Locally Weighted Regressions

Receptive Field Weighted Regression• Incremental, not batch• E , centers c1...E and widths Σ1...E determined automatically• Disadvantage: many open parameters

Schaal, S. and Atkeson, C. G. (1997).Receptive Field Weighted Regression [34]Technical Report TR-H-209, ATR Human Information Processing Laboratories.

Locally Weighted Projection Regression• As RFWR, but also performs dimensionality reduction within

each receptive field

Vijayakumar, S. and Schaal, S. (2000).Locally Weighted Projection Regression . . . [208]International Conference on Machine Learning.

Page 55: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Variations of Locally Weighted Regressions

Receptive Field Weighted Regression• Incremental, not batch• E , centers c1...E and widths Σ1...E determined automatically• Disadvantage: many open parameters

Schaal, S. and Atkeson, C. G. (1997).Receptive Field Weighted Regression [34]Technical Report TR-H-209, ATR Human Information Processing Laboratories.

Locally Weighted Projection Regression• As RFWR, but also performs dimensionality reduction within

each receptive field

Vijayakumar, S. and Schaal, S. (2000).Locally Weighted Projection Regression . . . [208]International Conference on Machine Learning.

Page 56: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Outline

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

WeightedLeast Squares

LocallyWeighted

Regressions+ friends

RegularizationRegularization

“Avoid large pa-rameter vectors.”

Page 57: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Outline

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

WeightedLeast Squares

LocallyWeighted

Regressions+ friends

Regularization

Regularization

“Avoid large pa-rameter vectors.”

Page 58: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Outline

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

Regularization

Regularization

“Avoid large pa-rameter vectors.”

Page 59: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Regularization• Idea: penalize large parameter vectors to

• avoid overfitting / achieve sparse parameter vectors

a∗ = arg mina(

12‖y− Xᵀa‖2︸ ︷︷ ︸

fit data

2‖a‖2︸ ︷︷ ︸

small parameters

) (10)

Use combination of L1 and L2: “Elastic Nets”

Page 60: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Regularization• Idea: penalize large parameter vectors to

• avoid overfitting / achieve sparse parameter vectors

a∗ = arg mina(

12‖y− Xᵀa‖2︸ ︷︷ ︸

fit data

2‖a‖2︸ ︷︷ ︸

small parameters

) (10)

L2-norm for ‖a‖

‖a‖2 =

(D∑

d=1

|ad |2) 1

2

Euclidean distance

a∗ = (λI + XᵀX)−1Xᵀy.

“Thikonov Regularization”“Ridge Regression”

L1-norm for ‖a‖

‖a‖1 =

(D∑

d=1

|ad |1) 1

1

=D∑

d=1

|ad |

Manhattan distance

no closed-form solution . . .

“LASSO Regularization”

Use combination of L1 and L2: “Elastic Nets”

Page 61: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Regularization• Idea: penalize large parameter vectors to

• avoid overfitting / achieve sparse parameter vectors

a∗ = arg mina(

12‖y− Xᵀa‖2︸ ︷︷ ︸

fit data

2‖a‖2︸ ︷︷ ︸

small parameters

) (10)

L2-norm for ‖a‖

‖a‖2 =

(D∑

d=1

|ad |2) 1

2

Euclidean distance

a∗ = (λI + XᵀX)−1Xᵀy.

“Thikonov Regularization”“Ridge Regression”

L1-norm for ‖a‖

‖a‖1 =

(D∑

d=1

|ad |1) 1

1

=D∑

d=1

|ad |

Manhattan distance

no closed-form solution . . .

“LASSO Regularization”

Use combination of L1 and L2: “Elastic Nets”

Page 62: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Regularization• Idea: penalize large parameter vectors to

• avoid overfitting / achieve sparse parameter vectors

a∗ = arg mina(

12‖y− Xᵀa‖2︸ ︷︷ ︸

fit data

2‖a‖2︸ ︷︷ ︸

small parameters

) (10)

L2-norm for ‖a‖

‖a‖2 =

(D∑

d=1

|ad |2) 1

2

Euclidean distance

a∗ = (λI + XᵀX)−1Xᵀy.

“Thikonov Regularization”“Ridge Regression”

L1-norm for ‖a‖

‖a‖1 =

(D∑

d=1

|ad |1) 1

1

=D∑

d=1

|ad |

Manhattan distance

no closed-form solution . . .

“LASSO Regularization”

Use combination of L1 and L2: “Elastic Nets”

Page 63: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Regularization• Idea: penalize large parameter vectors to

• avoid overfitting / achieve sparse parameter vectors

a∗ = arg mina(

12‖y− Xᵀa‖2︸ ︷︷ ︸

fit data

2‖a‖2︸ ︷︷ ︸

small parameters

) (10)

L2-norm for ‖a‖

‖a‖2 =

(D∑

d=1

|ad |2) 1

2

Euclidean distance

a∗ = (λI + XᵀX)−1Xᵀy.

“Thikonov Regularization”“Ridge Regression”

L1-norm for ‖a‖

‖a‖1 =

(D∑

d=1

|ad |1) 1

1

=D∑

d=1

|ad |

Manhattan distance

no closed-form solution . . .

“LASSO Regularization”

Use combination of L1 and L2: “Elastic Nets”

Page 64: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Regularization• Idea: penalize large parameter vectors to

• avoid overfitting / achieve sparse parameter vectors

a∗ = arg mina(

12‖y− Xᵀa‖2︸ ︷︷ ︸

fit data

2‖a‖2︸ ︷︷ ︸

small parameters

) (10)

Michael Littman and Charles Isbell feat Infinite Harmony

“Overfitting A Cappella”

Use combination of L1 and L2: “Elastic Nets”

Page 65: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Beyond squares

a∗ = arg mina(

12‖y− Xᵀa‖2︸ ︷︷ ︸

fit data

2‖a‖2︸ ︷︷ ︸

small parameters

) (11)

Penalty on residuals rn(fit data)

Penalty on parameters a(regularization)

Page 66: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Beyond squares

a∗ = arg mina(

12‖y− Xᵀa‖2︸ ︷︷ ︸

fit data

2‖a‖2︸ ︷︷ ︸

small parameters

) (11)

Penalty on residuals rn(fit data)

Penalty on parameters a(regularization)

Page 67: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Linear Support Vector Regression

• No closed-form solution, but efficient optimizers exist

Page 68: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Outline

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

Regularization

Regularization

Radial BasisFunctionNetwork+ friends

“Avoid large pa-rameter vectors.”

Page 69: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Outline

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

Regularization

Regularization

Radial BasisFunctionNetwork+ friends

“Avoid large pa-rameter vectors.”

Page 70: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Outline

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

RegularizationRegularization

Radial BasisFunctionNetwork+ friends

“Avoid large pa-rameter vectors.”

Page 71: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,ce). (12)

Page 72: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,ce). (12)

Page 73: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,ce). (12)

Page 74: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,ce). (12)

Page 75: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,ce). (12)

Page 76: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,ce). (12)

Page 77: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,θe). (13)

Feature matrix (analogous to design matrix X)

Θ =

φ(x1, c1) φ(x1, c2) · · · φ(x1, cE )φ(x2, c1) φ(x2, c2) · · · φ(x2, cE )

......

. . ....

φ(xN , c1) φ(xN , c2) · · · φ(xN , cE )

(14)

Least squares solution

w∗ = (ΘᵀΘ)−1Θᵀy. (15)

Page 78: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,θe). (13)

Feature matrix (analogous to design matrix X)

Θ =

φ(x1, c1) φ(x1, c2) · · · φ(x1, cE )φ(x2, c1) φ(x2, c2) · · · φ(x2, cE )

......

. . ....

φ(xN , c1) φ(xN , c2) · · · φ(xN , cE )

(14)

Least squares solution

w∗ = (ΘᵀΘ)−1Θᵀy. (15)

Page 79: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,θe). (13)

Feature matrix (analogous to design matrix X)

Θ =

φ(x1, c1) φ(x1, c2) · · · φ(x1, cE )φ(x2, c1) φ(x2, c2) · · · φ(x2, cE )

......

. . ....

φ(xN , c1) φ(xN , c2) · · · φ(xN , cE )

(14)

Least squares solution

w∗ = (ΘᵀΘ)−1Θᵀy. (15)

Page 80: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Radial Basis Function Network

f (x) =E∑

e=1

we · φ(x,θe). (13)

Feature matrix (analogous to design matrix X)

Θ =

φ(x1, c1) φ(x1, c2) · · · φ(x1, cE )φ(x2, c1) φ(x2, c2) · · · φ(x2, cE )

......

. . ....

φ(xN , c1) φ(xN , c2) · · · φ(xN , cE )

(14)

Least squares solution

w∗ = (ΘᵀΘ)−1Θᵀy. (15)

Page 81: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Kernel Ridge Regression

• Like a RBFN, but every data point is the center of a basis function

• Uses L2 regularization

f (x) =N∑

n=1

wn · k(x,xn). (16)

“Gram matrix”(analogous to design matrix X)

K(X,X) =

k(x1, x1) k(x1, x2) · · · k(x1, xN )k(x2, x1) k(x2, x2) · · · k(x2, xN )

......

. . ....

k(xN , x1) k(xN , x2) · · · k(xN , xN )

(17)

w∗ = (KᵀK)−1Kᵀy (18)

= K−1y, (19)

w∗ = (λI + K)−1y with L2 regularization

(20)

Page 82: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Kernel Ridge Regression

• Like a RBFN, but every data point is the center of a basis function• Uses L2 regularization

f (x) =N∑

n=1

wn · k(x,xn). (16)

“Gram matrix”(analogous to design matrix X)

K(X,X) =

k(x1, x1) k(x1, x2) · · · k(x1, xN )k(x2, x1) k(x2, x2) · · · k(x2, xN )

......

. . ....

k(xN , x1) k(xN , x2) · · · k(xN , xN )

(17)

w∗ = (KᵀK)−1Kᵀy (18)

= K−1y, (19)

w∗ = (λI + K)−1y with L2 regularization (20)

Page 83: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

(Radial) Basis Function Networks

Beyond radial basis functions• Cosines: Ridge Regression with Random Fourier Features• Sigmoids: Extreme Learning Machines (MLFF with 1 hidden)• Boxcars: model trees (as decision trees, but for regression)• Kernels: every data point is the center of a radial basis function

• Since least squares is at the heart of all of these• incremental versions← recursive least squares• apply L2 regularization (still closed form)

Page 84: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

(Radial) Basis Function Networks

Beyond radial basis functions• Cosines: Ridge Regression with Random Fourier Features• Sigmoids: Extreme Learning Machines (MLFF with 1 hidden)• Boxcars: model trees (as decision trees, but for regression)• Kernels: every data point is the center of a radial basis function

• Since least squares is at the heart of all of these• incremental versions← recursive least squares• apply L2 regularization (still closed form)

Page 85: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Freek, aren’t you being a bit shallow?

• Deep learning great when you• do not know the features• know the features to be hierarchically organized

John Smart

Page 86: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Freek, aren’t you being a bit shallow?

• Deep learning great when you• do not know the features• know the features to be hierarchically organized

Rajeswaran A, Lowrey K, Todorov E and Kakade S. (2017).Towards generalization and simplicity in continuous controlNeural Information Processing Systems (NIPS).

Page 87: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

A neural network perspective

All these models can be considered (degenerate) neural networks!

Backpropagation can be used in all these models!

Page 88: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

A neural network perspective

All these models can be considered (degenerate) neural networks!

Backpropagation can be used in all these models!

Page 89: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Linear model

x1

x2

xD

...

∑y

a1

a2

aD

Figure: Network representation of a linear model. Activation is. . . linear!

Page 90: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

RBFN

x1

xD

...

φ1

φ2

φE

...

∑y

b1

b2

bE

f (x) =∑E

e=1 beφ(x,θe)

Figure: The RBFN model. φe is an abbreviation of φ(x,θe)

Page 91: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

RRRFF

x1

xD

...

φ1

φ2

φE

...

∑y

b1

b2

bE

f (x) =∑E

e=1 beφ(x,θe)

Figure: The RRRFF model. φe is an abbreviation of φ(x,θe)

Page 92: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

SVR

x1

xD

...

φ1

φ2

φE

...

∑y

b1

b2

bE

f (x) =∑E

e=1 beφ(x,θe)

Figure: The SVR model. φe is an abbreviation of φ(x,θe)

Page 93: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Regression trees

x1

xD

...

φ1

φ2

φE

...

∑y

b1

b2

bE

f (x) =∑E

e=1 beφ(x,θe)

Figure: The regression trees model. φe is an abbreviation of φ(x,θe)

Page 94: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Extreme learning machine

x1

xD

...

φ1

φ2

φE

...

∑y

b1

b2

bE

f (x) =∑E

e=1 beφ(x,θe)

Figure: The extreme learning machine model. φe is an abbreviation ofφ(x,θe)

Page 95: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Extreme Learning Machine vs. (Deep) Neural Networks

• ELM: sigmoid act. function, no hidden layer, random features• ANN: sigmoid act. function, hidden layers, learned features

Page 96: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

KRR and GPR

x1

xD

...

φ1

φ2

φ3

φ4

φ5

φ6

φN

...

∑y

w1w2w3w4w5

w6

wN

f (x) =∑N

n=1 wnφn(x)

Figure: The function model used in KRR and GPR, as a network.

Page 97: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Locally weighted regression

x1

xD

...

1

φ1

φ2

φE

∑...

∑y

b 1

b2

bE

a1,1

a1,2

a1,E

aD,1

aD,2

aD,E

w1

w2

wE

Figure: Function model in Locally Weighted Regressions, represented as afeedforward neural network. The functions φe(x) generate the weights we

from the hidden nodes – which contain linear sub-models (aᵀe x + be) – to the

output node. Here, φe is an abbreviation of φ(x,θe)

Page 98: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Conclusion

Linear LeastSquares

“Fit a line tosome data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

WeightedSum of Basis

Functions

WeightedSum of Linear

Models

Page 99: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Conclusion: Generic batch regression flow-chart

Algorithmleast squares: a∗ = (λI + XᵀX)−1Xᵀy

Modellinear model: f (x) = aᵀx

Meta parametersregularization: λ

Model parametersslopes: a

Regressionalgorithm

.

.

training data(inputs/targets)

model

modelparams

meta-params Evaluate

input(novel)

output(prediction)

Page 100: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Conclusion: Generic batch regression flow-chart

Algorithmleast squares: a∗ = (λI + XᵀX)−1Xᵀy

Modellinear model: f (x) = aᵀx

Meta parametersregularization: λ

Model parametersslopes: a

Regressionalgorithm

.

.

training data(inputs/targets)

model

modelparams

meta-params Evaluate

input(novel)

output(prediction)

Page 101: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Conclusion: Generic batch regression flow-chart

Algorithmleast squares: a∗ = (λI + XᵀX)−1Xᵀy

Modellinear model: f (x) = aᵀx

Meta parametersregularization: λ

Model parametersslopes: a

Regressionalgorithm

.

.

training data(inputs/targets)

model

modelparams

meta-params Evaluate

input(novel)

output(prediction)

Page 102: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Conclusion: Generic batch regression flow-chart

Algorithmleast squares: a∗ = (λI + XᵀX)−1Xᵀy

Modellinear model: f (x) = aᵀx

Meta parametersregularization: λ

Model parametersslopes: a

Regressionalgorithm

.

.

training data(inputs/targets)

model

modelparams

meta-params

Evaluate

input(novel)

output(prediction)

Page 103: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Conclusion: Generic batch regression flow-chart

Algorithmleast squares: a∗ = (λI + XᵀX)−1Xᵀy

Modellinear model: f (x) = aᵀx

Meta parametersregularization: λ

Model parametersslopes: a

Regressionalgorithm

.

.

training data(inputs/targets)

model

modelparams

meta-params Evaluate

input(novel)

output(prediction)

Page 104: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

ConclusionLinear Least

Squares“Fit a line to

some data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

WeightedSum of Basis

Functions

WeightedSum of Linear

Models

Unified Model

f (x) =∑E

e=1φ(x,θe)·(be + aᵀ

e x) Weighted sum of linear models (21)

f (x) =∑E

e=1φ(x,θe)· we Weighted sum of basis functions (22)

Page 105: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

ConclusionLinear Least

Squares“Fit a line to

some data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

WeightedSum of Basis

Functions

WeightedSum of Linear

Models

Unified Model

f (x) =∑E

e=1φ(x,θe)·(be + aᵀ

e x) Weighted sum of linear models (21)

f (x) =∑E

e=1φ(x,θe)· we Weighted sum of basis functions (22)

(22) is a special case of (21) with ae = 0 and be ≡ we

Page 106: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

ConclusionLinear Least

Squares“Fit a line to

some data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

WeightedSum of Basis

Functions

WeightedSum of Linear

Models

Unified Model

f (x) =∑E

e=1φ(x,θe)·(be + aᵀ

e x) Weighted sum of linear models (21)

f (x) =∑E

e=1φ(x,θe)· we Weighted sum of basis functions (22)

(22) is a special case of (21) with ae = 0 and be ≡ we

Page 107: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

ConclusionLinear Least

Squares“Fit a line to

some data points”

WeightedLeast Squares

“Some datapoints more

important to fit”

LocallyWeighted

Regression+ friends

“Multiple weightedleast squares

in input space”

Radial BasisFunctionNetwork+ friends

“Project data intofeature space.

Do least squaresin this space.”

WeightedSum of Basis

Functions

WeightedSum of Linear

Models

Unified Model

Freek Stulp and Olivier Sigaud (2015).Many regression algorithms, one unified model - A review.Neural Networks.

f (x) =∑E

e=1φ(x,θe)·(be + aᵀ

e x) Weighted sum of linear models (21)

f (x) =∑E

e=1φ(x,θe)· we Weighted sum of basis functions (22)

(22) is a special case of (21) with ae = 0 and be ≡ we

Page 108: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Conclusion

Mixture of linear models(unified model)

f (x) =∑E

e=1 φ(x,θe) · (aᵀex+be)

sub-models: aᵀex + be

weights: φ(x,θe)

GMR (Gaussian Mixture R.)

Gaussian BFsLWR (Locally Weighted R.)

RFWR (Receptive Field Weighted R.)

LWPR (Locally Weighted Projection R.)

XCSFAny BFs

M5 (Model Trees)Box-car BFs

Linear modelf (x) = aᵀx + b

E = 1

LLS (Linear Least Squares)

(Weighted Linear Least Squares)

Weighted sum of basis functions

f (x) =∑E

e=1 φ(x,θe) · be

sub-models: φ(x,θe)

weights: be

a = 0

RBFN (Radial Basis Function Network)

Radial φ(||x− c||)

KRR (Kernel Ridge R.)

GPR (Gaussian Process R.)Kernel φ(x, x′)

iRFRLSI-SSGPR

Cosine BFs

CART (Regression Trees)

Box-car BFs

ELM (Extreme Learning Machine)

Backpropagation

Sigmoidφ(〈x,w〉)

Figure: Classification of regression algorithms, based only on the modelused to represent the underlying function.

Page 109: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Many toolkits available

• Python• scikit-learn: http://scikit-learn.org• StatsModels: http://www.statsmodels.org/• PbDlib: http://calinon.ch/codes.htm• dmpbbo: https://github.com/stulp/dmpbbo

• Matlab• curvefit: https://www.mathworks.com/help/curvefit/linear-and-nonlinear-regression.html

• PbDlib: http://calinon.ch/codes.htm• C++

• PbDlib: http://calinon.ch/codes.htm• dmpbbo: https://github.com/stulp/dmpbbo

Page 110: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Personal FavouritesGaussian process regression

+ Very few assumptions+ Meta-parameters estimated from data itself+ Estimates variance also+ Works in high dimensions- Training/query times increase with amount of data- Not easy to make incremental

Gaussian mixture regression+ Estimates variance also+ Algorithm is inherently incremental+ Some meta-parameters, but easy to tune+ Fast training times- Training only stable for low input dimensions

Locally Weighted Regressions+ Fast query times, fast training+ Few meta-parameters, and easy to set+ Stable learning results (batch)- Not incremental- No variance estimate

Deep Learning+ Automatic extraction of (hierarhical) features

Page 111: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Conclusion

• Don’t think about these regression algorithms as being unique• Similar algorithms that use different subsets of algorithmic features

• All these models are essentially shallow neural networks withdifferent basis functions

Thank you for your attention!

Page 112: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Conclusion

• Don’t think about these regression algorithms as being unique• Similar algorithms that use different subsets of algorithmic features

• All these models are essentially shallow neural networks withdifferent basis functions

Thank you for your attention!

Page 113: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Model: (Locally) Linear Regularization Model: Sum of Basis Functions ConclusionUnified Model

Appendix

Regression Tutorial – Freek Stulp

Page 114: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression

“Given a Gaussian process on some topological space T , with acontinuous covariance kernel C(·, ·) : T × T → R, we can associate aHilbert space, which is the reproducing kernel Hilbert space ofreal-valued functions on T , with C as kernel function.”

Instead of screaming, let’s talk about what it means to be smooth.

Page 115: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression

“Given a Gaussian process on some topological space T , with acontinuous covariance kernel C(·, ·) : T × T → R, we can associate aHilbert space, which is the reproducing kernel Hilbert space ofreal-valued functions on T , with C as kernel function.”

Instead of screaming, let’s talk about what it means to be smooth.

Page 116: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression

“Given a Gaussian process on some topological space T , with acontinuous covariance kernel C(·, ·) : T × T → R, we can associate aHilbert space, which is the reproducing kernel Hilbert space ofreal-valued functions on T , with C as kernel function.”

Instead of screaming, let’s talk about what it means to be smooth.

Page 117: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression

• Points that are close in the input space should be close in theoutput space.• Cities that are close geographically have similar temperatures (on

average)• Taller people have larger shoe sizes (on average)

• Shoe size covaries with height

Page 118: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Covariance Function

covariance function

covariance matrix (Gram matrix)

K(X,X) =

xAug xMuc xWar xMin xMos

xAug 1.00 0.96 0.42 0.02 0.00

xMuc 0.96 1.00 0.59 0.04 0.00xWar 0.42 0.59 1.00 0.32 0.10xMin 0.02 0.04 0.32 1.00 0.80xMos 0.00 0.00 0.10 0.80 1.00

• Remarks• Basis function has very specific interpretation: covariance• No temperature measurements y have been made yet• Prior: assume temperature is 0◦C

QuestionExpected temperature in Munich, given 9◦C in Augsburg?

(condition on TAug = 9, i.e. E [TMuc | TAug = 9])

Page 119: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Covariance Function

covariance function

covariance matrix (Gram matrix)

K(X,X) =

xAug xMuc xWar xMin xMos

xAug 1.00 0.96 0.42 0.02 0.00

xMuc 0.96 1.00 0.59 0.04 0.00xWar 0.42 0.59 1.00 0.32 0.10xMin 0.02 0.04 0.32 1.00 0.80xMos 0.00 0.00 0.10 0.80 1.00

• Remarks• Basis function has very specific interpretation: covariance• No temperature measurements y have been made yet• Prior: assume temperature is 0◦C

QuestionExpected temperature in Munich, given 9◦C in Augsburg?

(condition on TAug = 9, i.e. E [TMuc | TAug = 9])

Page 120: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Covariance Function

covariance function covariance matrix (Gram matrix)

K(X,X) =

xAug xMuc xWar xMin xMos

xAug 1.00 0.96 0.42 0.02 0.00xMuc 0.96 1.00 0.59 0.04 0.00xWar 0.42 0.59 1.00 0.32 0.10xMin 0.02 0.04 0.32 1.00 0.80xMos 0.00 0.00 0.10 0.80 1.00

• Remarks• Basis function has very specific interpretation: covariance• No temperature measurements y have been made yet• Prior: assume temperature is 0◦C

QuestionExpected temperature in Munich, given 9◦C in Augsburg?

(condition on TAug = 9, i.e. E [TMuc | TAug = 9])

Page 121: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Covariance Function

covariance function covariance matrix (Gram matrix)

K(X,X) =

xAug xMuc xWar xMin xMos

xAug 1.00 0.96 0.42 0.02 0.00xMuc 0.96 1.00 0.59 0.04 0.00xWar 0.42 0.59 1.00 0.32 0.10xMin 0.02 0.04 0.32 1.00 0.80xMos 0.00 0.00 0.10 0.80 1.00

• Remarks• Basis function has very specific interpretation: covariance• No temperature measurements y have been made yet• Prior: assume temperature is 0◦C

QuestionExpected temperature in Munich, given 9◦C in Augsburg?

(condition on TAug = 9, i.e. E [TMuc | TAug = 9])

Page 122: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Covariance Function

covariance function covariance matrix (Gram matrix)

K(X,X) =

xAug xMuc xWar xMin xMos

xAug 1.00 0.96 0.42 0.02 0.00xMuc 0.96 1.00 0.59 0.04 0.00xWar 0.42 0.59 1.00 0.32 0.10xMin 0.02 0.04 0.32 1.00 0.80xMos 0.00 0.00 0.10 0.80 1.00

• Remarks• Basis function has very specific interpretation: covariance• No temperature measurements y have been made yet• Prior: assume temperature is 0◦C

QuestionExpected temperature in Munich, given 9◦C in Augsburg?

(condition on TAug = 9, i.e. E [TMuc | TAug = 9])

Page 123: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 124: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 125: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 126: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 127: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 128: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 129: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 130: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 131: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 132: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, xAug) = 0.96 k(xWar, xAug) = 0.42 k(xMos, xAug) = 0.00

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 133: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 134: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 135: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 136: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 137: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 138: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 139: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 140: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 141: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 142: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

K(X,X) =[ xAug xMin

xAug 1.00 0.02xMin 0.02 1.00

]

Page 143: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

K(X,X) =[ xAug xMin

xAug 1.00 0.02xMin 0.02 1.00

]

Page 144: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

k(xMuc, [xAug xMin]) =[0.96 0.04]

k(xWar, [xAug xMin]) =[0.42 0.32]

k(xMos, [xAug xMin]) =[0.00 0.8]

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Kernel Regression

f (x) =∑N

n=1wn · k(x,xn)

w∗ = K(X,X)−1y

Page 145: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 146: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 147: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

The more measurements become available,the more certain we become

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 148: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

The more measurements become available,the more certain we become

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 149: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

The more measurements become available,the more certain we become

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 150: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

The more measurements become available,the more certain we become

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 151: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

The more measurements become available,the more certain we become

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 152: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

The more measurements become available,the more certain we become

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)

Page 153: From Least Squares Regression to High …...Generation vs. Recall 159 + 72 = 231 “recall” Distinction between these two strategies important in cognitive science, artificial intelligence,

Gaussian Process Regression – Example

The more measurements become available,the more certain we become

What are the plane slopes?

yq =

see above︷ ︸︸ ︷k(xq ,X) K(X,X)−1y︸ ︷︷ ︸

Least squares!

(23)