deep learning, neural networks and kernel machinesrestricted boltzmann machines (rbm) • markov...

Deep Learning, Neural Networks and Kernel

Machines

Johan Suykens

KU Leuven, ESAT-STADIUSKasteelpark Arenberg 10, B-3001 Leuven, Belgium

Email: [email protected]://www.esat.kuleuven.be/stadius/

Deeplearn 2019, Warsaw Poland, July 2019

Part II: RBMs, kernel machines and deep learning

• Restricted Boltzmann Machines (RBM)

• Deep Boltzmann Machines (Deep BM)

• Restricted Kernel Machines (RKM)

• Deep RKM (see Part III)

• Generative RKM

1

Generative models: RBM, GAN and deep learning

1

Restricted Boltzmann Machines (RBM)

• Markov random field, bipartite graph, stochastic binary unitsLayer of visible units v and layer of hidden units hNo hidden-to-hidden connections

• Energy:

E(v, h; θ) = −vTWh− bTv − aTh with θ = W, b, a

Joint distribution:

P (v, h; θ) =1

Z(θ)exp(−E(v, h; θ))

with partition function Z(θ) =∑

v

∑

h exp(−E(v, h; θ))

[Hinton, Osindero, Teh, Neural Computation 2006]

2

RBM and deep learning

RBM

p(v, h) p(v, h1, h2, h3, ...)

[Hinton et al., 2006; Salakhutdinov, 2015]

3

Convolutional Deep Belief Networks

Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief

Networks [Lee et al. 2011]

4

Energy function

• RBM:E = −vTWh

• Deep Boltzmann machine (two layers):

E = −vTW 1h1 − h1TW 2h2

• Deep Boltzmann machine (three layers):

E = −vTW 1h1 − h1TW 2h2 − h2

TW 3h3

5

RBM: example on MNIST

MNIST training data:

Generating new images:

source: https://www.kaggle.com/nicw102168/restricted-boltzmann-machine-rbm-on-mnist

6

RBM training (1)Thanks to the special bipartite structure, explicit marginalization ispossible:

P (v; θ) =1

Z(θ)

∑

h

exp(−E(v, h; θ)) =1

Z(θ)exp(bTv)

∏

j

(1+exp(aj+∑

i

Wijvj))

with vi ∈ 0, 1, hi ∈ 0, 1.

Conditional distributions:

P (h|v; θ) =∏

j

p(hj|v) with p(hj = 1|v) = σ(∑

i

Wijvi + aj)

and

P (v|h; θ) =∏

i

p(vi|h) with p(vi = 1|h) = σ(∑

j

Wijhj + bi)

with σ the sigmoid activation.

7

RBM training (2)

Given observations vnNn=1, the derivative of the log-likelihood is

1N

∑

n∂ logP (vn;θ)

∂Wij= EPdata

[vihj]− EPmodel[vihj]

1N

∑

n∂ logP (vn;θ)

∂aj= EPdata

[hj]− EPmodel[hj]

1N

∑

n∂ logP (vn;θ)

∂bi= EPdata

[vi]− EPmodel[vi]

with

• Data-dependent expectation EPdata[·] (form of Hebbian learning):

an expectation with respect to the data distribution Pdata(h, v; θ) =P (h|v; θ)Pdata(v) with Pdata(v) = 1

N

∑

n δ(v − vn) the empiricaldistribution.

• Model’s expectation EPmodel[·] (unlearning):

an expectation with respect to the distribution defined by the modelP (v, h; θ) = 1

Z(θ) exp(−E(v, h; θ)).

8

RBM training (3)

Exact maximum likelihood learning is intractable (due to computation ofEPmodel

[·]). In practice, Contrastive Divergence (CD) algorithm [Hinton2002]:

∆W = α(EPdata[vhT ]− EPT

[vhT ])

with α learning rate and PT a distribution defined by running a Gibbs chaininitialized at the data for T full steps (T = 1, i.e. CD1 often in practice).

CD1 scheme:

1. Start Gibbs sampler v(1) := vn and generate h(1) ∼ P (h|v(1))

2. After obtaining h(1), generate v(2) ∼ P (v|h(1)) (called fantasy data)

3. After obtaining v(2), generate h(2) ∼ P (h|v(2))

with∆W ∝ (vnh

(1)T − v(2)h(2)T)

9

Deep Boltzmann machine training (1)

Consider 3-layer Deep BM with energy function [Salakhutdinov 2015]:

E(v, h1, h2, h3; θ) = −vTW 1h1 − h1TW 2h2 − h2

TW 3h3

with unknown model parameters θ = W 1,W 2,W 3.

The model assigns the following probability to a visible vector v:

P (v; θ) =1

Z(θ)

∑

h1,h2,h3

exp(−E(v, h1, h2, h3; θ))

10

Deep Boltzmann machine training (2)

For training:

∂ logP (v;θ)∂W 1 = EPdata

[vh1T]− EPmodel

[vh1T]


[h1h2T]− EPmodel

[h1h2T]


[h2h3T]− EPmodel

[h2h3T]

Problem: the conditional distribution over the states of the hidden variablesconditioned on the data is no longer factorial. For simplicity and speed onecan assume and impose a fully factorized distribution, correspondingto a naive mean-field approximation [Salakhutdinov 2015].

11

Multimodal Deep Boltzmann Machine

From [Srivastava & Salakhutdinov 2014]

12

Generative Adversarial Network (GAN)

Generative Adversarial Network (GAN) [Goodfellow et al., 2014]Training of two competing models in a zero-sum game:

(Generator) generate fake output examples from random noise(Discriminator) discriminate between fake examples and real examples.

source: https://deeplearning4j.org/generative-adversarial-network

13

GAN: example on MNIST

MNIST training data:

GAN generated examples:

source: https://www.kdnuggets.com/2016/07/mnist-generative-adversarial-model-keras.html

14

Kernel methods and deep learning

14

Kernel machines & deep learning

previous approaches:

• kernels for deep learning [Cho & Saul, 2009]

• mathematics of the neural response [Smale et al., 2010]

• deep gaussian processes [Damianou & Lawrence, 2013]

• convolutional kernel networks [Mairal et al., 2014]

• multi-layer support vector machines [Wiering & Schomaker, 2014]

• other

15

Kernel machines & deep learning: New Challenges

• new synergies and new foundations between support vector machines &kernel methods and deep learning architectures?

• possible to extend primal and dual model representations (as occuring inSVM and LS-SVM models) from shallow to deep architectures?

• possible to handle deep feedforward neural networks and deep kernel machineswithin a common setting?

16

Kernel machines & deep learning: New Challenges

• new synergies and new foundations between support vector machines &kernel methods and deep learning architectures?

• possible to extend primal and dual model representations (as occuring inSVM and LS-SVM models) from shallow to deep architectures?

• possible to handle deep feedforward neural networks and deep kernel machineswithin a common setting?

→ new framework:

”Deep Restricted Kernel Machines” [Suykens, Neural Computation, 2017]https://www.mitpressjournals.org/doi/pdf/10.1162/neco a 00984

16

Restricted Kernel Machines

16

Restricted Kernel Machines (RKM)

Main characteristics:

• Kernel machine interpretations in terms of visible and hidden units(similar to Restricted Boltzmann Machines (RBM))

• Restricted Kernel Machine (RKM) representations for

– LS-SVM regression/classification– Kernel PCA– Matrix SVD– Parzen-type models– other

• Based on principle of conjugate feature duality(with hidden features corresponding to dual variables)

17

LS-SVM regression model: classical approach

LS-SVM regression model, given input & output data xi ∈ Rd, yi ∈ R

minw,b,ei

12w

Tw + γ2

N∑

i=1

e2i

subject to yi = wTϕ(xi) + b+ ei, i = 1, ..., N.

Solution in Lagrange multipliers αi:

[

K + I/γ 1N1TN 0

] [

αb

]

=

[

y1:N0

]

with K(xi, xj) = ϕ(xi)Tϕ(xj), y1:N = [y1; ...; yN ]

and y =∑

i αiK(x, xi) + b.

→ How to achieve a representation with visible and hidden units?

18

Conjugate feature duality

Property. For λ > 0, the following quadratic form property holds:

1

2λeTe ≥ eTh−

λ

2hTh, ∀e, h ∈ R

p

Proof: This is verified by writing the quadratic form as

1

2

[

eT hT]

[

1λI II λI

] [

eh

]

≥ 0, ∀e, h ∈ Rp.

It is known that

Q =

[

A BBT C

]

≥ 0

if and only if A > 0 and the Schur complement C −BTA−1B ≥ 0.This results into the condition 1

2(λI − I(λI)I) ≥ 0, which holds.

Note. One has1

2λeTe = max

h(eTh−

λ

2hTh)

19

Model: living in two worlds ...

Original model:

y =W Tx+ b, e = y − yobjective J= regularization term Tr(W TW )+ (1

λ) error term

∑

i eTi ei

↓ 12λe

Te ≥ eTh− λ2h

Th

New representation:

y =∑

j hjxTj x+ b

obtain J ≥ J(hi,W, b)solution from stationary points of J :∂J

∂hi= 0, ∂J

∂W= 0, ∂J

∂b= 0

20

Model: living in two worlds ...

Original model:

y =W Tϕ(x) + b, e = y − yobjective J= regularization term Tr(W TW )+ (1

λ) error term

∑

i eTi ei

↓ 12λe

Te ≥ eTh− λ2h

Th

New representation:

y =∑

j hjK(xj, x) + b

obtain J ≥ J(hi,W, b)solution from stationary points of J :∂J

∂hi= 0, ∂J

∂W= 0, ∂J

∂b= 0

20

Simplest example: line fitting

Given data: (xi, yi)Ni=1, xi, yi ∈ R

y

x

oo

o

o

oo

o

o

o

o

o

o

Linear model:y = wx+ b, e = y − y

RKM representation:

y =∑

i

hixix+ b

3 visible units: v = [x; 1;−y]1 hidden unit: h ∈ R

v h

RKM

21

From LS-SVM to the RKM representation

Multi-output model y =W Tx+ b, e = y − y

Objective in LS-SVM regression (linear case)

J =η

2Tr(W TW ) +

1

2λ

N∑

i=1

eTi ei s.t. ei = yi −W Txi − b,∀i

≥N∑

i=1

eTi hi −λ

2

N∑

i=1

hTi hi +η

2Tr(W TW ) s.t. ei = yi −W Txi − b,∀i

=N∑

i=1

(yTi − xTi W − bT )hi −λ

2

N∑

i=1

hTi hi +η

2Tr(W TW ) , J(hi,W, b)

= RtrainRKM −

λ

2

N∑

i=1

hTi hi +η

2Tr(W TW )

22

Connection between RKM and RBM

• RKM & RBM: interpretation in terms of visible and hidden units

• RKM: energy form as in RBM:

RtrainRKM =

N∑

i=1

RRKM(vi, hi)

= −N∑

i=1

(xTi Whi + bThi − yTi hi) =N∑

i=1

eTi hi

with RRKM(v, h) = −vTWh = −(xTWh+ bTh− yTh) = eTh.

• Conjugate feature duality: hidden features hi are conjugated to the eiand serve as dual variables.

23

From LS-SVM to RKM representation (2)

• Stationary points of J(hi,W, b) (nonlinear case, feature map ϕ(·))

∂J

∂hi

= 0 ⇒ yi = WTϕ(xi) + b + λhi, ∀i

∂J

∂W= 0 ⇒ W =

1

η

∑

i

ϕ(xi)hTi

∂J

∂b= 0 ⇒

∑

i

hi = 0.

• Solution in hi and b with positive definite kernelK(xi, xj) = ϕ(xi)Tϕ(xj)

[ 1ηK + λIN 1N

1TN 0

] [

HT

bT

]

=

[

Y T

0

]

with K = [K(xi, xj)], H = [h1...hN ], Y = [y1...yN ].

24

From LS-SVM to RKM representation (3)

v

hx ϕ(x)

yy

e

Note: ϕ(x) can be multi-layered, visible units: [ϕ(x); 1;−y]Conjugate feature duality: primal and dual model representations:

(P )RKM : y = W Tϕ(x) + b

ր

M

ց

(D)RKM : y =1

η

∑

j

hjK(xj, x) + b.

(large N , small d) versus (large d, small N)

25

Kernel principal component analysis (KPCA)

−1.5 −1 −0.5 0 0.5 1−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

linear PCA

−1.5 −1 −0.5 0 0.5 1−2

−1.5

−1

−0.5

0

0.5

1

1.5

kernel PCA (RBF kernel)

Kernel PCA [Scholkopf et al., 1998]:take eigenvalue decomposition of the kernel matrix

K(x1, x1) ... K(x1, xN)... ...

K(xN , x1) ... K(xN , xN)

(applications in dimensionality reduction and denoising)

26

Kernel PCA: classical LS-SVM approach

• Primal problem: [Suykens et al., 2002]

minw,b,e

1

2wTw −

1

2γ

N∑

i=1

e2i s.t. ei = wTϕ(xi) + b, i = 1, ..., N.

• Dual problem corresponds to kernel PCA

Ω(c)α = λα with λ = 1/γ

with Ω(c)ij = (ϕ(xi)− µϕ)

T (ϕ(xj)− µϕ) the centered kernel matrix

and µϕ = (1/N)∑N

i=1ϕ(xi).

• Interpretation:1. pool of candidate components (objective function equals zero)2. select relevant components

27

From KPCA to RKM representation

Model:

e =W Tϕ(x)objective J= regularization term Tr(W TW )- (1

λ) variance term

∑

i eTi ei

↓ - 12λe

Te ≤ −eTh+ λ2h

Th

RKM representation:

e =∑

j hjK(xj, x)

obtain J ≤ J(hi,W )solution from stationary points of J :∂J∂hi

= 0, ∂J∂W

= 0

28

From KPCA to RKM representation (2)

• Objective

J =η

2Tr(W TW )−

1

2λ

N∑

i=1

eTi ei s.t. ei =W Tϕ(xi), ∀i

≤ −N∑

i=1

eTi hi +λ

2

N∑

i=1

hTi hi +η

2Tr(W TW ) s.t. ei =W Tϕ(xi), ∀i

= −N∑

i=1

ϕ(xi)TWhi +

λ

2

N∑

i=1

hTi hi +η

2Tr(W TW ) , J

• Stationary points of J(hi,W ):

∂J

∂hi= 0 ⇒ W Tϕ(xi) = λhi, ∀i

∂J

∂W= 0 ⇒ W =

1

η

∑

i

ϕ(xi)hTi

29

From KPCA to RKM representation (3)

• Elimination of W gives the eigenvalue decomposition:

1

ηKHT = HTΛ

where H = [h1...hN ] ∈ Rs×N and Λ = diagλ1, ..., λs with s ≤ N

• Primal and dual model representations

(P )RKM : e =W Tϕ(x)ր

Mց

(D)RKM : e =1

η

∑

j

hjK(xj, x).

30

Singular value decomposition

• Objective: given xi, zj row and column data of (non-square) matrix

J = −η

2Tr(V TW ) +

1

2λ

N∑

i=1

eTi ei +1

2λ

M∑

j=1

rTj rj s.t. ei =W Tϕ(xi),∀i

rj = V Tψ(zj), ∀j

• primal and dual representations (relates to non-symmetric kernels)

(P )RKM : e =W Tϕ(x)ր r = V Tψ(z)

Mց

(D)RKM : e =1

η

∑

j

hrjψ(zj)Tϕ(x)

r =1

η

∑

i

heiϕ(xi)Tψ(z)

31

Kernel probability mass function estimation

• Objective:

J =N∑

i=1

(pi − ϕ(xi)Tw)hi −

N∑

i=1

pi +η

2wTw

• primal and dual representations

(P )RKM : pi = wTϕ(xi)ր

Mց

(D)RKM : pi =1

η

∑

j

K(xj, xi)

32

Deep Restricted Kernel Machines

32

Deep RKM: example

v

h(1)x ϕ1(x)

yy

e(1)ϕ2(h

(1))e(2)h(2)

ϕ3(h(2))

e(3)h(3)

Deep RKM: KPCA + KPCA + LSSVM

Coupling of RKMs by taking sum of the objectives

Jdeep = J1 + J2 + J3

33

Generative kernel PCA

33

RKM objective for training and generating (1)

• RBM energy function

E(v, h; θ) = −vTWh− cTv − aTh

with model parameters θ = W, c, a

• RKM objective function

J(v, h,W ) = −vTWh+ λ2h

Th+ 12v

Tv + η2Tr(W

TW )

Training: clamp v → Jtrain(h,W )Generating: clamp h,W → Jgen(v)

[Schreurs & Suykens, ESANN 2018]

34


• Training: (clamp v)

Jtrain(hi,W ) = −N∑

i=1

vTi Whi +λ

2

N∑

i=1

hTi hi +η

2Tr(WTW )

Stationary points:

∂Jtrain∂hi

= 0 ⇒WTvi = λhi, ∀i∂Jtrain∂W

= 0 ⇒W = 1η

∑Ni=1 vih

Ti

Elimination of W :1

ηKHT = HT∆,

where H = [h1, . . . , hN ] ∈ Rs×N , ∆ = diagλ1, . . . , λs with s ≤ N

the number of selected components and Kij = vTi vj the kernel matrixelements.

35


• Generating: (clamp h,W )

Estimate distribution p(h) from hi, i = 1, ..., N (or assumed normal).Obtain a new value h⋆.Generate in this way v⋆ from

Jgen(v⋆) = −v⋆

TWh⋆ +

1

2v⋆

Tv⋆

Stationary points:∂Jgen∂v⋆

= 0

This givesv⋆ =Wh⋆

36

Dimensionality reduction and denoising: linear case

• Given training data vi = xi with X ∈ Rd×N , obtain hidden features

H ∈ Rs×N :

X =WH = (1

η

N∑

i=1

xihTi )H =

1

ηXHTH

• Reconstruction error: ‖X − X‖2

xi G(·) hi F (·) xi

37

Dimensionality reduction and denoising: nonlinear case (1)

• New datapoint x⋆ generated from h⋆ by

ϕ(x⋆) =Wh⋆ = (1

η

N∑

i=1

ϕ(xi)hTi )h

⋆

• Multiplying both sides by ϕ(xj) gives:

K(xj, x⋆) =

1

η(

N∑

i=1

K(xj, xi)hTi )h

⋆

On training data:

Ω =1

ηΩHTH

with H ∈ Rs×N ,Ωij = K(xi, xj) = ϕ(xi)

Tϕ(xj).

38

Dimensionality reduction and denoising: nonlinear case (2)

• Estimated value x for x⋆ by kernel smoother:

x =

∑Sj=1 K(xj, x

⋆)xj∑S

j=1 K(xj, x⋆)

with K(xj, x⋆) (e.g. RBF kernel) the scaled similarity between 0 and

1, a design parameter S ≤ N (S closest points based on the similarityK(xj, x

⋆)).


39

Example: denoising

Synthetic data sets:

-3 -2 -1 0 1 2 3

X1

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

X2

-2 -1.5 -1 -0.5 0 0.5 1 1.5-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

X ∈ R2×500 (d = 2, N = 500)

Kernel PCA using RBF kernel with σ2 = 1 (left: s = 2; right: s = 8)Kernel smoother: S = 100 closed points, σ2 = 0.2

40

Example: generating new data

From MNIST data:

Training data: 50 images per digit; Kernel PCA (left: s = 20; right: s = 50)Normal distribution fitted on hi, used to generate h⋆

Kernel smoother: (left) S = 10 (digits 0); (right) S = 100 (digits 0,8)


41

Towards explainable AI

Understanding the role of the hidden units:

-0.15 -0.1 -0.05 0 0.05 0.1

H1

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

H2

016

h(1,1) = -0.11 h(1,1) = -0.06 h(1,1) = -0.01 h(1,1) = 0.04 h(1,1) = 0.09

h(1,2) = -0.12 h(1,2) = -0.06 h(1,2) = 0 h(1,2) = 0.06 h(1,2) = 0.12

h(1,3) = -0.11 h(1,3) = -0.05 h(1,3) = 0.01 h(1,3) = 0.06 h(1,3) = 0.12

[figures by Joachim Schreurs]

42

Tensor-based RKM for Multi-view KPCA

min 〈W,W〉−N∑

i=1

⟨

Φ(i),W⟩

hi+λN∑

i=1

h2i with Φ(i) = ϕ[1](x[1]i )⊗...⊗ϕ[V ](x

[V ]i )

[Houthuys & Suykens, ICANN 2018]

43

Generative RKM (1)

The objective

Jtrain(hi, V, U) =N∑

i=1

(−ϕ1(xi)TV hi−ϕ2(yi)

TUhi+λi2hTi hi)+

η12Tr(V TV )+

η22Tr(UTU)

results for training into the eigenvalue problem

(1

η1K1 +

1

η2K2)H

T = HTΛ

with H = [h1...hN ] and kernel matrices K1,K2 related to ϕ1, ϕ2.

[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]

44

Generative RKM (2)

Generating data is based on a newly generated h⋆ and the objective

Jgenerate(ϕ1(x⋆), ϕ2(y

⋆)) = −ϕ1(x⋆)TV h⋆−ϕ2(y

⋆)TUh⋆+1

2ϕ1(x

⋆)Tϕ1(x⋆)+

1

2ϕ2(y

⋆)Tϕ2(y⋆)

giving

ϕ1(x⋆) =

1

η1

N∑

i=1

ϕ1(xi)hTi h

⋆, ϕ2(y⋆) =

1

η2

N∑

i=1

ϕ2(yi)hTi h

⋆.

For generating x, y one can either work with the kernel smoother or workwith an explicit feature map using a feedforward neural network.


45

Generative RKM (3)

Train:

Generate:

46

Generative RKM (4)


47

Generative RKM (5)

Figure: Image generation using neural networks as feature map:

(left) MNIST; (right) Small-NORB


48

Generative RKM (6)

Figure: Targeted image generation through corresponding latent variable.


49

Conclusions

• From RBM to deep BM

• From RKM to deep RKM

• RKM and RBM representation: visible and hidden units

• RKM representation for LS-SVM, KPCA, SVD and others

• RKM representation obtained by conjugate feature duality

• Generative RKM

50

Acknowledgements (1)

• Current and former co-workers at ESAT-STADIUS:

C. Alzate, Y. Chen, J. De Brabanter, K. De Brabanter, L. De Lathauwer,H. De Meulemeester, B. De Moor, H. De Plaen, Ph. Dreesen, M.Espinoza, T. Falck, M. Fanuel, Y. Feng, B. Gauthier, X. Huang, L.Houthuys, V. Jumutc, Z. Karevan, R. Langone, R. Mall, S. Mehrkanoon,G. Nisol, M. Orchel, A. Pandey, K. Pelckmans, S. RoyChowdhury, S.Salzo, J. Schreurs, M. Signoretto, Q. Tao, J. Vandewalle, T. Van Gestel,S. Van Huffel, C. Varon, Y. Yang, and others

• Many other people for joint work, discussions, invitations, organizations

• Support from ERC AdG E-DUALITY, ERC AdG A-DATADRIVE-B, KULeuven, OPTEC, IUAP DYSCO, FWO projects, IWT, iMinds, BIL, COST

51


52


NEW: ERC Advanced Grant E-DUALITYExploring duality for future data-driven modelling

53

Thank you

54

deep learning, neural networks and kernel machinesrestricted boltzmann machines (rbm) • markov...

Documents