deep learning, neural networks and kernel machinesrestricted boltzmann machines (rbm) • markov...

69
Deep Learning, Neural Networks and Kernel Machines Johan Suykens KU Leuven, ESAT-STADIUS Kasteelpark Arenberg 10, B-3001 Leuven, Belgium Email: [email protected] http://www.esat.kuleuven.be/stadius/ Deeplearn 2019, Warsaw Poland, July 2019

Upload: others

Post on 27-May-2020

32 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Deep Learning, Neural Networks and Kernel

Machines

Johan Suykens

KU Leuven, ESAT-STADIUSKasteelpark Arenberg 10, B-3001 Leuven, Belgium

Email: [email protected]://www.esat.kuleuven.be/stadius/

Deeplearn 2019, Warsaw Poland, July 2019

Page 2: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Part II: RBMs, kernel machines and deep learning

• Restricted Boltzmann Machines (RBM)

• Deep Boltzmann Machines (Deep BM)

• Restricted Kernel Machines (RKM)

• Deep RKM (see Part III)

• Generative RKM

1

Page 3: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Generative models: RBM, GAN and deep learning

1

Page 4: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Restricted Boltzmann Machines (RBM)

• Markov random field, bipartite graph, stochastic binary unitsLayer of visible units v and layer of hidden units hNo hidden-to-hidden connections

• Energy:

E(v, h; θ) = −vTWh− bTv − aTh with θ = W, b, a

Joint distribution:

P (v, h; θ) =1

Z(θ)exp(−E(v, h; θ))

with partition function Z(θ) =∑

v

h exp(−E(v, h; θ))

[Hinton, Osindero, Teh, Neural Computation 2006]

2

Page 5: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

RBM and deep learning

RBM

p(v, h) p(v, h1, h2, h3, ...)

[Hinton et al., 2006; Salakhutdinov, 2015]

3

Page 6: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Convolutional Deep Belief Networks

Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief

Networks [Lee et al. 2011]

4

Page 7: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Energy function

• RBM:E = −vTWh

• Deep Boltzmann machine (two layers):

E = −vTW 1h1 − h1TW 2h2

• Deep Boltzmann machine (three layers):

E = −vTW 1h1 − h1TW 2h2 − h2

TW 3h3

5

Page 8: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

RBM: example on MNIST

MNIST training data:

Generating new images:

source: https://www.kaggle.com/nicw102168/restricted-boltzmann-machine-rbm-on-mnist

6

Page 9: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

RBM training (1)Thanks to the special bipartite structure, explicit marginalization ispossible:

P (v; θ) =1

Z(θ)

h

exp(−E(v, h; θ)) =1

Z(θ)exp(bTv)

j

(1+exp(aj+∑

i

Wijvj))

with vi ∈ 0, 1, hi ∈ 0, 1.

Conditional distributions:

P (h|v; θ) =∏

j

p(hj|v) with p(hj = 1|v) = σ(∑

i

Wijvi + aj)

and

P (v|h; θ) =∏

i

p(vi|h) with p(vi = 1|h) = σ(∑

j

Wijhj + bi)

with σ the sigmoid activation.

7

Page 10: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

RBM training (2)

Given observations vnNn=1, the derivative of the log-likelihood is

1N

n∂ logP (vn;θ)

∂Wij= EPdata

[vihj]− EPmodel[vihj]

1N

n∂ logP (vn;θ)

∂aj= EPdata

[hj]− EPmodel[hj]

1N

n∂ logP (vn;θ)

∂bi= EPdata

[vi]− EPmodel[vi]

with

• Data-dependent expectation EPdata[·] (form of Hebbian learning):

an expectation with respect to the data distribution Pdata(h, v; θ) =P (h|v; θ)Pdata(v) with Pdata(v) = 1

N

n δ(v − vn) the empiricaldistribution.

• Model’s expectation EPmodel[·] (unlearning):

an expectation with respect to the distribution defined by the modelP (v, h; θ) = 1

Z(θ) exp(−E(v, h; θ)).

8

Page 11: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

RBM training (3)

Exact maximum likelihood learning is intractable (due to computation ofEPmodel

[·]). In practice, Contrastive Divergence (CD) algorithm [Hinton2002]:

∆W = α(EPdata[vhT ]− EPT

[vhT ])

with α learning rate and PT a distribution defined by running a Gibbs chaininitialized at the data for T full steps (T = 1, i.e. CD1 often in practice).

CD1 scheme:

1. Start Gibbs sampler v(1) := vn and generate h(1) ∼ P (h|v(1))

2. After obtaining h(1), generate v(2) ∼ P (v|h(1)) (called fantasy data)

3. After obtaining v(2), generate h(2) ∼ P (h|v(2))

with∆W ∝ (vnh

(1)T − v(2)h(2)T)

9

Page 12: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Deep Boltzmann machine training (1)

Consider 3-layer Deep BM with energy function [Salakhutdinov 2015]:

E(v, h1, h2, h3; θ) = −vTW 1h1 − h1TW 2h2 − h2

TW 3h3

with unknown model parameters θ = W 1,W 2,W 3.

The model assigns the following probability to a visible vector v:

P (v; θ) =1

Z(θ)

h1,h2,h3

exp(−E(v, h1, h2, h3; θ))

10

Page 13: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Deep Boltzmann machine training (2)

For training:

∂ logP (v;θ)∂W 1 = EPdata

[vh1T]− EPmodel

[vh1T]

∂ logP (v;θ)∂W 2 = EPdata

[h1h2T]− EPmodel

[h1h2T]

∂ logP (v;θ)∂W 3 = EPdata

[h2h3T]− EPmodel

[h2h3T]

Problem: the conditional distribution over the states of the hidden variablesconditioned on the data is no longer factorial. For simplicity and speed onecan assume and impose a fully factorized distribution, correspondingto a naive mean-field approximation [Salakhutdinov 2015].

11

Page 14: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Multimodal Deep Boltzmann Machine

From [Srivastava & Salakhutdinov 2014]

12

Page 15: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Generative Adversarial Network (GAN)

Generative Adversarial Network (GAN) [Goodfellow et al., 2014]Training of two competing models in a zero-sum game:

(Generator) generate fake output examples from random noise(Discriminator) discriminate between fake examples and real examples.

source: https://deeplearning4j.org/generative-adversarial-network

13

Page 16: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

GAN: example on MNIST

MNIST training data:

GAN generated examples:

source: https://www.kdnuggets.com/2016/07/mnist-generative-adversarial-model-keras.html

14

Page 17: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Kernel methods and deep learning

14

Page 18: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Kernel machines & deep learning

previous approaches:

• kernels for deep learning [Cho & Saul, 2009]

• mathematics of the neural response [Smale et al., 2010]

• deep gaussian processes [Damianou & Lawrence, 2013]

• convolutional kernel networks [Mairal et al., 2014]

• multi-layer support vector machines [Wiering & Schomaker, 2014]

• other

15

Page 19: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Kernel machines & deep learning: New Challenges

• new synergies and new foundations between support vector machines &kernel methods and deep learning architectures?

• possible to extend primal and dual model representations (as occuring inSVM and LS-SVM models) from shallow to deep architectures?

• possible to handle deep feedforward neural networks and deep kernel machineswithin a common setting?

16

Page 20: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Kernel machines & deep learning: New Challenges

• new synergies and new foundations between support vector machines &kernel methods and deep learning architectures?

• possible to extend primal and dual model representations (as occuring inSVM and LS-SVM models) from shallow to deep architectures?

• possible to handle deep feedforward neural networks and deep kernel machineswithin a common setting?

→ new framework:

”Deep Restricted Kernel Machines” [Suykens, Neural Computation, 2017]https://www.mitpressjournals.org/doi/pdf/10.1162/neco a 00984

16

Page 21: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Restricted Kernel Machines

16

Page 22: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Restricted Kernel Machines (RKM)

Main characteristics:

• Kernel machine interpretations in terms of visible and hidden units(similar to Restricted Boltzmann Machines (RBM))

• Restricted Kernel Machine (RKM) representations for

– LS-SVM regression/classification– Kernel PCA– Matrix SVD– Parzen-type models– other

• Based on principle of conjugate feature duality(with hidden features corresponding to dual variables)

17

Page 23: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

LS-SVM regression model: classical approach

LS-SVM regression model, given input & output data xi ∈ Rd, yi ∈ R

minw,b,ei

12w

Tw + γ2

N∑

i=1

e2i

subject to yi = wTϕ(xi) + b+ ei, i = 1, ..., N.

Solution in Lagrange multipliers αi:

[

K + I/γ 1N1TN 0

] [

αb

]

=

[

y1:N0

]

with K(xi, xj) = ϕ(xi)Tϕ(xj), y1:N = [y1; ...; yN ]

and y =∑

i αiK(x, xi) + b.

→ How to achieve a representation with visible and hidden units?

18

Page 24: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

LS-SVM regression model: classical approach

LS-SVM regression model, given input & output data xi ∈ Rd, yi ∈ R

minw,b,ei

12w

Tw + γ2

N∑

i=1

e2i

subject to yi = wTϕ(xi) + b+ ei, i = 1, ..., N.

Solution in Lagrange multipliers αi:

[

K + I/γ 1N1TN 0

] [

αb

]

=

[

y1:N0

]

with K(xi, xj) = ϕ(xi)Tϕ(xj), y1:N = [y1; ...; yN ]

and y =∑

i αiK(x, xi) + b.

→ How to achieve a representation with visible and hidden units?

18

Page 25: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Conjugate feature duality

Property. For λ > 0, the following quadratic form property holds:

1

2λeTe ≥ eTh−

λ

2hTh, ∀e, h ∈ R

p

Proof: This is verified by writing the quadratic form as

1

2

[

eT hT]

[

1λI II λI

] [

eh

]

≥ 0, ∀e, h ∈ Rp.

It is known that

Q =

[

A BBT C

]

≥ 0

if and only if A > 0 and the Schur complement C −BTA−1B ≥ 0.This results into the condition 1

2(λI − I(λI)I) ≥ 0, which holds.

Note. One has1

2λeTe = max

h(eTh−

λ

2hTh)

19

Page 26: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Conjugate feature duality

Property. For λ > 0, the following quadratic form property holds:

1

2λeTe ≥ eTh−

λ

2hTh, ∀e, h ∈ R

p

Proof: This is verified by writing the quadratic form as

1

2

[

eT hT]

[

1λI II λI

] [

eh

]

≥ 0, ∀e, h ∈ Rp.

It is known that

Q =

[

A BBT C

]

≥ 0

if and only if A > 0 and the Schur complement C −BTA−1B ≥ 0.This results into the condition 1

2(λI − I(λI)I) ≥ 0, which holds.

Note. One has1

2λeTe = max

h(eTh−

λ

2hTh)

19

Page 27: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Conjugate feature duality

Property. For λ > 0, the following quadratic form property holds:

1

2λeTe ≥ eTh−

λ

2hTh, ∀e, h ∈ R

p

Proof: This is verified by writing the quadratic form as

1

2

[

eT hT]

[

1λI II λI

] [

eh

]

≥ 0, ∀e, h ∈ Rp.

It is known that

Q =

[

A BBT C

]

≥ 0

if and only if A > 0 and the Schur complement C −BTA−1B ≥ 0.This results into the condition 1

2(λI − I(λI)I) ≥ 0, which holds.

Note. One has1

2λeTe = max

h(eTh−

λ

2hTh)

19

Page 28: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Model: living in two worlds ...

Original model:

y =W Tx+ b, e = y − yobjective J= regularization term Tr(W TW )+ (1

λ) error term

i eTi ei

↓ 12λe

Te ≥ eTh− λ2h

Th

New representation:

y =∑

j hjxTj x+ b

obtain J ≥ J(hi,W, b)solution from stationary points of J :∂J

∂hi= 0, ∂J

∂W= 0, ∂J

∂b= 0

20

Page 29: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Model: living in two worlds ...

Original model:

y =W Tx+ b, e = y − yobjective J= regularization term Tr(W TW )+ (1

λ) error term

i eTi ei

↓ 12λe

Te ≥ eTh− λ2h

Th

New representation:

y =∑

j hjxTj x+ b

obtain J ≥ J(hi,W, b)solution from stationary points of J :∂J

∂hi= 0, ∂J

∂W= 0, ∂J

∂b= 0

20

Page 30: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Model: living in two worlds ...

Original model:

y =W Tx+ b, e = y − yobjective J= regularization term Tr(W TW )+ (1

λ) error term

i eTi ei

↓ 12λe

Te ≥ eTh− λ2h

Th

New representation:

y =∑

j hjxTj x+ b

obtain J ≥ J(hi,W, b)solution from stationary points of J :∂J

∂hi= 0, ∂J

∂W= 0, ∂J

∂b= 0

20

Page 31: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Model: living in two worlds ...

Original model:

y =W Tϕ(x) + b, e = y − yobjective J= regularization term Tr(W TW )+ (1

λ) error term

i eTi ei

↓ 12λe

Te ≥ eTh− λ2h

Th

New representation:

y =∑

j hjK(xj, x) + b

obtain J ≥ J(hi,W, b)solution from stationary points of J :∂J

∂hi= 0, ∂J

∂W= 0, ∂J

∂b= 0

20

Page 32: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Simplest example: line fitting

Given data: (xi, yi)Ni=1, xi, yi ∈ R

y

x

oo

o

o

oo

o

o

o

o

o

o

Linear model:y = wx+ b, e = y − y

RKM representation:

y =∑

i

hixix+ b

3 visible units: v = [x; 1;−y]1 hidden unit: h ∈ R

v h

RKM

21

Page 33: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

From LS-SVM to the RKM representation

Multi-output model y =W Tx+ b, e = y − y

Objective in LS-SVM regression (linear case)

J =η

2Tr(W TW ) +

1

N∑

i=1

eTi ei s.t. ei = yi −W Txi − b,∀i

≥N∑

i=1

eTi hi −λ

2

N∑

i=1

hTi hi +η

2Tr(W TW ) s.t. ei = yi −W Txi − b,∀i

=N∑

i=1

(yTi − xTi W − bT )hi −λ

2

N∑

i=1

hTi hi +η

2Tr(W TW ) , J(hi,W, b)

= RtrainRKM −

λ

2

N∑

i=1

hTi hi +η

2Tr(W TW )

22

Page 34: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

From LS-SVM to the RKM representation

Multi-output model y =W Tx+ b, e = y − y

Objective in LS-SVM regression (linear case)

J =η

2Tr(W TW ) +

1

N∑

i=1

eTi ei s.t. ei = yi −W Txi − b,∀i

≥N∑

i=1

eTi hi −λ

2

N∑

i=1

hTi hi +η

2Tr(W TW ) s.t. ei = yi −W Txi − b,∀i

=N∑

i=1

(yTi − xTi W − bT )hi −λ

2

N∑

i=1

hTi hi +η

2Tr(W TW ) , J(hi,W, b)

= RtrainRKM −

λ

2

N∑

i=1

hTi hi +η

2Tr(W TW )

22

Page 35: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

From LS-SVM to the RKM representation

Multi-output model y =W Tx+ b, e = y − y

Objective in LS-SVM regression (linear case)

J =η

2Tr(W TW ) +

1

N∑

i=1

eTi ei s.t. ei = yi −W Txi − b,∀i

≥N∑

i=1

eTi hi −λ

2

N∑

i=1

hTi hi +η

2Tr(W TW ) s.t. ei = yi −W Txi − b,∀i

=N∑

i=1

(yTi − xTi W − bT )hi −λ

2

N∑

i=1

hTi hi +η

2Tr(W TW ) , J(hi,W, b)

= RtrainRKM −

λ

2

N∑

i=1

hTi hi +η

2Tr(W TW )

22

Page 36: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Connection between RKM and RBM

• RKM & RBM: interpretation in terms of visible and hidden units

• RKM: energy form as in RBM:

RtrainRKM =

N∑

i=1

RRKM(vi, hi)

= −N∑

i=1

(xTi Whi + bThi − yTi hi) =N∑

i=1

eTi hi

with RRKM(v, h) = −vTWh = −(xTWh+ bTh− yTh) = eTh.

• Conjugate feature duality: hidden features hi are conjugated to the eiand serve as dual variables.

23

Page 37: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

From LS-SVM to RKM representation (2)

• Stationary points of J(hi,W, b) (nonlinear case, feature map ϕ(·))

∂J

∂hi

= 0 ⇒ yi = WTϕ(xi) + b + λhi, ∀i

∂J

∂W= 0 ⇒ W =

1

η

i

ϕ(xi)hTi

∂J

∂b= 0 ⇒

i

hi = 0.

• Solution in hi and b with positive definite kernelK(xi, xj) = ϕ(xi)Tϕ(xj)

[ 1ηK + λIN 1N

1TN 0

] [

HT

bT

]

=

[

Y T

0

]

with K = [K(xi, xj)], H = [h1...hN ], Y = [y1...yN ].

24

Page 38: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

From LS-SVM to RKM representation (3)

v

hx ϕ(x)

yy

e

Note: ϕ(x) can be multi-layered, visible units: [ϕ(x); 1;−y]Conjugate feature duality: primal and dual model representations:

(P )RKM : y = W Tϕ(x) + b

ր

M

ց

(D)RKM : y =1

η

j

hjK(xj, x) + b.

(large N , small d) versus (large d, small N)

25

Page 39: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Kernel principal component analysis (KPCA)

−1.5 −1 −0.5 0 0.5 1−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

linear PCA

−1.5 −1 −0.5 0 0.5 1−2

−1.5

−1

−0.5

0

0.5

1

1.5

kernel PCA (RBF kernel)

Kernel PCA [Scholkopf et al., 1998]:take eigenvalue decomposition of the kernel matrix

K(x1, x1) ... K(x1, xN)... ...

K(xN , x1) ... K(xN , xN)

(applications in dimensionality reduction and denoising)

26

Page 40: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Kernel PCA: classical LS-SVM approach

• Primal problem: [Suykens et al., 2002]

minw,b,e

1

2wTw −

1

N∑

i=1

e2i s.t. ei = wTϕ(xi) + b, i = 1, ..., N.

• Dual problem corresponds to kernel PCA

Ω(c)α = λα with λ = 1/γ

with Ω(c)ij = (ϕ(xi)− µϕ)

T (ϕ(xj)− µϕ) the centered kernel matrix

and µϕ = (1/N)∑N

i=1ϕ(xi).

• Interpretation:1. pool of candidate components (objective function equals zero)2. select relevant components

27

Page 41: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

From KPCA to RKM representation

Model:

e =W Tϕ(x)objective J= regularization term Tr(W TW )- (1

λ) variance term

i eTi ei

↓ - 12λe

Te ≤ −eTh+ λ2h

Th

RKM representation:

e =∑

j hjK(xj, x)

obtain J ≤ J(hi,W )solution from stationary points of J :∂J∂hi

= 0, ∂J∂W

= 0

28

Page 42: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

From KPCA to RKM representation (2)

• Objective

J =η

2Tr(W TW )−

1

N∑

i=1

eTi ei s.t. ei =W Tϕ(xi), ∀i

≤ −N∑

i=1

eTi hi +λ

2

N∑

i=1

hTi hi +η

2Tr(W TW ) s.t. ei =W Tϕ(xi), ∀i

= −N∑

i=1

ϕ(xi)TWhi +

λ

2

N∑

i=1

hTi hi +η

2Tr(W TW ) , J

• Stationary points of J(hi,W ):

∂J

∂hi= 0 ⇒ W Tϕ(xi) = λhi, ∀i

∂J

∂W= 0 ⇒ W =

1

η

i

ϕ(xi)hTi

29

Page 43: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

From KPCA to RKM representation (3)

• Elimination of W gives the eigenvalue decomposition:

1

ηKHT = HTΛ

where H = [h1...hN ] ∈ Rs×N and Λ = diagλ1, ..., λs with s ≤ N

• Primal and dual model representations

(P )RKM : e =W Tϕ(x)ր

(D)RKM : e =1

η

j

hjK(xj, x).

30

Page 44: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Singular value decomposition

• Objective: given xi, zj row and column data of (non-square) matrix

J = −η

2Tr(V TW ) +

1

N∑

i=1

eTi ei +1

M∑

j=1

rTj rj s.t. ei =W Tϕ(xi),∀i

rj = V Tψ(zj), ∀j

• primal and dual representations (relates to non-symmetric kernels)

(P )RKM : e =W Tϕ(x)ր r = V Tψ(z)

(D)RKM : e =1

η

j

hrjψ(zj)Tϕ(x)

r =1

η

i

heiϕ(xi)Tψ(z)

31

Page 45: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Kernel probability mass function estimation

• Objective:

J =N∑

i=1

(pi − ϕ(xi)Tw)hi −

N∑

i=1

pi +η

2wTw

• primal and dual representations

(P )RKM : pi = wTϕ(xi)ր

(D)RKM : pi =1

η

j

K(xj, xi)

32

Page 46: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Deep Restricted Kernel Machines

32

Page 47: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Deep RKM: example

v

h(1)x ϕ1(x)

yy

e(1)ϕ2(h

(1))e(2)h(2)

ϕ3(h(2))

e(3)h(3)

Deep RKM: KPCA + KPCA + LSSVM

Coupling of RKMs by taking sum of the objectives

Jdeep = J1 + J2 + J3

33

Page 48: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Generative kernel PCA

33

Page 49: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

RKM objective for training and generating (1)

• RBM energy function

E(v, h; θ) = −vTWh− cTv − aTh

with model parameters θ = W, c, a

• RKM objective function

J(v, h,W ) = −vTWh+ λ2h

Th+ 12v

Tv + η2Tr(W

TW )

Training: clamp v → Jtrain(h,W )Generating: clamp h,W → Jgen(v)

[Schreurs & Suykens, ESANN 2018]

34

Page 50: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

RKM objective for training and generating (2)

• Training: (clamp v)

Jtrain(hi,W ) = −N∑

i=1

vTi Whi +λ

2

N∑

i=1

hTi hi +η

2Tr(WTW )

Stationary points:

∂Jtrain∂hi

= 0 ⇒WTvi = λhi, ∀i∂Jtrain∂W

= 0 ⇒W = 1η

∑Ni=1 vih

Ti

Elimination of W :1

ηKHT = HT∆,

where H = [h1, . . . , hN ] ∈ Rs×N , ∆ = diagλ1, . . . , λs with s ≤ N

the number of selected components and Kij = vTi vj the kernel matrixelements.

35

Page 51: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

RKM objective for training and generating (3)

• Generating: (clamp h,W )

Estimate distribution p(h) from hi, i = 1, ..., N (or assumed normal).Obtain a new value h⋆.Generate in this way v⋆ from

Jgen(v⋆) = −v⋆

TWh⋆ +

1

2v⋆

Tv⋆

Stationary points:∂Jgen∂v⋆

= 0

This givesv⋆ =Wh⋆

36

Page 52: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Dimensionality reduction and denoising: linear case

• Given training data vi = xi with X ∈ Rd×N , obtain hidden features

H ∈ Rs×N :

X =WH = (1

η

N∑

i=1

xihTi )H =

1

ηXHTH

• Reconstruction error: ‖X − X‖2

xi G(·) hi F (·) xi

37

Page 53: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Dimensionality reduction and denoising: nonlinear case (1)

• New datapoint x⋆ generated from h⋆ by

ϕ(x⋆) =Wh⋆ = (1

η

N∑

i=1

ϕ(xi)hTi )h

• Multiplying both sides by ϕ(xj) gives:

K(xj, x⋆) =

1

η(

N∑

i=1

K(xj, xi)hTi )h

On training data:

Ω =1

ηΩHTH

with H ∈ Rs×N ,Ωij = K(xi, xj) = ϕ(xi)

Tϕ(xj).

38

Page 54: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Dimensionality reduction and denoising: nonlinear case (2)

• Estimated value x for x⋆ by kernel smoother:

x =

∑Sj=1 K(xj, x

⋆)xj∑S

j=1 K(xj, x⋆)

with K(xj, x⋆) (e.g. RBF kernel) the scaled similarity between 0 and

1, a design parameter S ≤ N (S closest points based on the similarityK(xj, x

⋆)).

[Schreurs & Suykens, ESANN 2018]

39

Page 55: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Example: denoising

Synthetic data sets:

-3 -2 -1 0 1 2 3

X1

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

X2

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

X ∈ R2×500 (d = 2, N = 500)

Kernel PCA using RBF kernel with σ2 = 1 (left: s = 2; right: s = 8)Kernel smoother: S = 100 closed points, σ2 = 0.2

40

Page 56: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Example: generating new data

From MNIST data:

Training data: 50 images per digit; Kernel PCA (left: s = 20; right: s = 50)Normal distribution fitted on hi, used to generate h⋆

Kernel smoother: (left) S = 10 (digits 0); (right) S = 100 (digits 0,8)

[Schreurs & Suykens, ESANN 2018]

41

Page 57: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Towards explainable AI

Understanding the role of the hidden units:

-0.15 -0.1 -0.05 0 0.05 0.1

H1

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

H2

016

h(1,1) = -0.11 h(1,1) = -0.06 h(1,1) = -0.01 h(1,1) = 0.04 h(1,1) = 0.09

h(1,2) = -0.12 h(1,2) = -0.06 h(1,2) = 0 h(1,2) = 0.06 h(1,2) = 0.12

h(1,3) = -0.11 h(1,3) = -0.05 h(1,3) = 0.01 h(1,3) = 0.06 h(1,3) = 0.12

[figures by Joachim Schreurs]

42

Page 58: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Tensor-based RKM for Multi-view KPCA

min 〈W,W〉−N∑

i=1

Φ(i),W⟩

hi+λN∑

i=1

h2i with Φ(i) = ϕ[1](x[1]i )⊗...⊗ϕ[V ](x

[V ]i )

[Houthuys & Suykens, ICANN 2018]

43

Page 59: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Generative RKM (1)

The objective

Jtrain(hi, V, U) =N∑

i=1

(−ϕ1(xi)TV hi−ϕ2(yi)

TUhi+λi2hTi hi)+

η12Tr(V TV )+

η22Tr(UTU)

results for training into the eigenvalue problem

(1

η1K1 +

1

η2K2)H

T = HTΛ

with H = [h1...hN ] and kernel matrices K1,K2 related to ϕ1, ϕ2.

[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]

44

Page 60: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Generative RKM (2)

Generating data is based on a newly generated h⋆ and the objective

Jgenerate(ϕ1(x⋆), ϕ2(y

⋆)) = −ϕ1(x⋆)TV h⋆−ϕ2(y

⋆)TUh⋆+1

2ϕ1(x

⋆)Tϕ1(x⋆)+

1

2ϕ2(y

⋆)Tϕ2(y⋆)

giving

ϕ1(x⋆) =

1

η1

N∑

i=1

ϕ1(xi)hTi h

⋆, ϕ2(y⋆) =

1

η2

N∑

i=1

ϕ2(yi)hTi h

⋆.

For generating x, y one can either work with the kernel smoother or workwith an explicit feature map using a feedforward neural network.

[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]

45

Page 61: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Generative RKM (3)

Train:

Generate:

46

Page 62: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Generative RKM (4)

[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]

47

Page 63: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Generative RKM (5)

Figure: Image generation using neural networks as feature map:

(left) MNIST; (right) Small-NORB

[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]

48

Page 64: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Generative RKM (6)

Figure: Targeted image generation through corresponding latent variable.

[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]

49

Page 65: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Conclusions

• From RBM to deep BM

• From RKM to deep RKM

• RKM and RBM representation: visible and hidden units

• RKM representation for LS-SVM, KPCA, SVD and others

• RKM representation obtained by conjugate feature duality

• Generative RKM

50

Page 66: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Acknowledgements (1)

• Current and former co-workers at ESAT-STADIUS:

C. Alzate, Y. Chen, J. De Brabanter, K. De Brabanter, L. De Lathauwer,H. De Meulemeester, B. De Moor, H. De Plaen, Ph. Dreesen, M.Espinoza, T. Falck, M. Fanuel, Y. Feng, B. Gauthier, X. Huang, L.Houthuys, V. Jumutc, Z. Karevan, R. Langone, R. Mall, S. Mehrkanoon,G. Nisol, M. Orchel, A. Pandey, K. Pelckmans, S. RoyChowdhury, S.Salzo, J. Schreurs, M. Signoretto, Q. Tao, J. Vandewalle, T. Van Gestel,S. Van Huffel, C. Varon, Y. Yang, and others

• Many other people for joint work, discussions, invitations, organizations

• Support from ERC AdG E-DUALITY, ERC AdG A-DATADRIVE-B, KULeuven, OPTEC, IUAP DYSCO, FWO projects, IWT, iMinds, BIL, COST

51

Page 67: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Acknowledgements (2)

52

Page 68: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Acknowledgements (3)

NEW: ERC Advanced Grant E-DUALITYExploring duality for future data-driven modelling

53

Page 69: Deep Learning, Neural Networks and Kernel MachinesRestricted Boltzmann Machines (RBM) • Markov random field, bipartite graph, stochastic binary units Layer of visible units vand

Thank you

54