deep learning, neural networks and kernel machinesrestricted boltzmann machines (rbm) • markov...
TRANSCRIPT
Deep Learning, Neural Networks and Kernel
Machines
Johan Suykens
KU Leuven, ESAT-STADIUSKasteelpark Arenberg 10, B-3001 Leuven, Belgium
Email: [email protected]://www.esat.kuleuven.be/stadius/
Deeplearn 2019, Warsaw Poland, July 2019
Part II: RBMs, kernel machines and deep learning
• Restricted Boltzmann Machines (RBM)
• Deep Boltzmann Machines (Deep BM)
• Restricted Kernel Machines (RKM)
• Deep RKM (see Part III)
• Generative RKM
1
Generative models: RBM, GAN and deep learning
1
Restricted Boltzmann Machines (RBM)
• Markov random field, bipartite graph, stochastic binary unitsLayer of visible units v and layer of hidden units hNo hidden-to-hidden connections
• Energy:
E(v, h; θ) = −vTWh− bTv − aTh with θ = W, b, a
Joint distribution:
P (v, h; θ) =1
Z(θ)exp(−E(v, h; θ))
with partition function Z(θ) =∑
v
∑
h exp(−E(v, h; θ))
[Hinton, Osindero, Teh, Neural Computation 2006]
2
RBM and deep learning
RBM
p(v, h) p(v, h1, h2, h3, ...)
[Hinton et al., 2006; Salakhutdinov, 2015]
3
Convolutional Deep Belief Networks
Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief
Networks [Lee et al. 2011]
4
Energy function
• RBM:E = −vTWh
• Deep Boltzmann machine (two layers):
E = −vTW 1h1 − h1TW 2h2
• Deep Boltzmann machine (three layers):
E = −vTW 1h1 − h1TW 2h2 − h2
TW 3h3
5
RBM: example on MNIST
MNIST training data:
Generating new images:
source: https://www.kaggle.com/nicw102168/restricted-boltzmann-machine-rbm-on-mnist
6
RBM training (1)Thanks to the special bipartite structure, explicit marginalization ispossible:
P (v; θ) =1
Z(θ)
∑
h
exp(−E(v, h; θ)) =1
Z(θ)exp(bTv)
∏
j
(1+exp(aj+∑
i
Wijvj))
with vi ∈ 0, 1, hi ∈ 0, 1.
Conditional distributions:
P (h|v; θ) =∏
j
p(hj|v) with p(hj = 1|v) = σ(∑
i
Wijvi + aj)
and
P (v|h; θ) =∏
i
p(vi|h) with p(vi = 1|h) = σ(∑
j
Wijhj + bi)
with σ the sigmoid activation.
7
RBM training (2)
Given observations vnNn=1, the derivative of the log-likelihood is
1N
∑
n∂ logP (vn;θ)
∂Wij= EPdata
[vihj]− EPmodel[vihj]
1N
∑
n∂ logP (vn;θ)
∂aj= EPdata
[hj]− EPmodel[hj]
1N
∑
n∂ logP (vn;θ)
∂bi= EPdata
[vi]− EPmodel[vi]
with
• Data-dependent expectation EPdata[·] (form of Hebbian learning):
an expectation with respect to the data distribution Pdata(h, v; θ) =P (h|v; θ)Pdata(v) with Pdata(v) = 1
N
∑
n δ(v − vn) the empiricaldistribution.
• Model’s expectation EPmodel[·] (unlearning):
an expectation with respect to the distribution defined by the modelP (v, h; θ) = 1
Z(θ) exp(−E(v, h; θ)).
8
RBM training (3)
Exact maximum likelihood learning is intractable (due to computation ofEPmodel
[·]). In practice, Contrastive Divergence (CD) algorithm [Hinton2002]:
∆W = α(EPdata[vhT ]− EPT
[vhT ])
with α learning rate and PT a distribution defined by running a Gibbs chaininitialized at the data for T full steps (T = 1, i.e. CD1 often in practice).
CD1 scheme:
1. Start Gibbs sampler v(1) := vn and generate h(1) ∼ P (h|v(1))
2. After obtaining h(1), generate v(2) ∼ P (v|h(1)) (called fantasy data)
3. After obtaining v(2), generate h(2) ∼ P (h|v(2))
with∆W ∝ (vnh
(1)T − v(2)h(2)T)
9
Deep Boltzmann machine training (1)
Consider 3-layer Deep BM with energy function [Salakhutdinov 2015]:
E(v, h1, h2, h3; θ) = −vTW 1h1 − h1TW 2h2 − h2
TW 3h3
with unknown model parameters θ = W 1,W 2,W 3.
The model assigns the following probability to a visible vector v:
P (v; θ) =1
Z(θ)
∑
h1,h2,h3
exp(−E(v, h1, h2, h3; θ))
10
Deep Boltzmann machine training (2)
For training:
∂ logP (v;θ)∂W 1 = EPdata
[vh1T]− EPmodel
[vh1T]
∂ logP (v;θ)∂W 2 = EPdata
[h1h2T]− EPmodel
[h1h2T]
∂ logP (v;θ)∂W 3 = EPdata
[h2h3T]− EPmodel
[h2h3T]
Problem: the conditional distribution over the states of the hidden variablesconditioned on the data is no longer factorial. For simplicity and speed onecan assume and impose a fully factorized distribution, correspondingto a naive mean-field approximation [Salakhutdinov 2015].
11
Multimodal Deep Boltzmann Machine
From [Srivastava & Salakhutdinov 2014]
12
Generative Adversarial Network (GAN)
Generative Adversarial Network (GAN) [Goodfellow et al., 2014]Training of two competing models in a zero-sum game:
(Generator) generate fake output examples from random noise(Discriminator) discriminate between fake examples and real examples.
source: https://deeplearning4j.org/generative-adversarial-network
13
GAN: example on MNIST
MNIST training data:
GAN generated examples:
source: https://www.kdnuggets.com/2016/07/mnist-generative-adversarial-model-keras.html
14
Kernel methods and deep learning
14
Kernel machines & deep learning
previous approaches:
• kernels for deep learning [Cho & Saul, 2009]
• mathematics of the neural response [Smale et al., 2010]
• deep gaussian processes [Damianou & Lawrence, 2013]
• convolutional kernel networks [Mairal et al., 2014]
• multi-layer support vector machines [Wiering & Schomaker, 2014]
• other
15
Kernel machines & deep learning: New Challenges
• new synergies and new foundations between support vector machines &kernel methods and deep learning architectures?
• possible to extend primal and dual model representations (as occuring inSVM and LS-SVM models) from shallow to deep architectures?
• possible to handle deep feedforward neural networks and deep kernel machineswithin a common setting?
16
Kernel machines & deep learning: New Challenges
• new synergies and new foundations between support vector machines &kernel methods and deep learning architectures?
• possible to extend primal and dual model representations (as occuring inSVM and LS-SVM models) from shallow to deep architectures?
• possible to handle deep feedforward neural networks and deep kernel machineswithin a common setting?
→ new framework:
”Deep Restricted Kernel Machines” [Suykens, Neural Computation, 2017]https://www.mitpressjournals.org/doi/pdf/10.1162/neco a 00984
16
Restricted Kernel Machines
16
Restricted Kernel Machines (RKM)
Main characteristics:
• Kernel machine interpretations in terms of visible and hidden units(similar to Restricted Boltzmann Machines (RBM))
• Restricted Kernel Machine (RKM) representations for
– LS-SVM regression/classification– Kernel PCA– Matrix SVD– Parzen-type models– other
• Based on principle of conjugate feature duality(with hidden features corresponding to dual variables)
17
LS-SVM regression model: classical approach
LS-SVM regression model, given input & output data xi ∈ Rd, yi ∈ R
minw,b,ei
12w
Tw + γ2
N∑
i=1
e2i
subject to yi = wTϕ(xi) + b+ ei, i = 1, ..., N.
Solution in Lagrange multipliers αi:
[
K + I/γ 1N1TN 0
] [
αb
]
=
[
y1:N0
]
with K(xi, xj) = ϕ(xi)Tϕ(xj), y1:N = [y1; ...; yN ]
and y =∑
i αiK(x, xi) + b.
→ How to achieve a representation with visible and hidden units?
18
LS-SVM regression model: classical approach
LS-SVM regression model, given input & output data xi ∈ Rd, yi ∈ R
minw,b,ei
12w
Tw + γ2
N∑
i=1
e2i
subject to yi = wTϕ(xi) + b+ ei, i = 1, ..., N.
Solution in Lagrange multipliers αi:
[
K + I/γ 1N1TN 0
] [
αb
]
=
[
y1:N0
]
with K(xi, xj) = ϕ(xi)Tϕ(xj), y1:N = [y1; ...; yN ]
and y =∑
i αiK(x, xi) + b.
→ How to achieve a representation with visible and hidden units?
18
Conjugate feature duality
Property. For λ > 0, the following quadratic form property holds:
1
2λeTe ≥ eTh−
λ
2hTh, ∀e, h ∈ R
p
Proof: This is verified by writing the quadratic form as
1
2
[
eT hT]
[
1λI II λI
] [
eh
]
≥ 0, ∀e, h ∈ Rp.
It is known that
Q =
[
A BBT C
]
≥ 0
if and only if A > 0 and the Schur complement C −BTA−1B ≥ 0.This results into the condition 1
2(λI − I(λI)I) ≥ 0, which holds.
Note. One has1
2λeTe = max
h(eTh−
λ
2hTh)
19
Conjugate feature duality
Property. For λ > 0, the following quadratic form property holds:
1
2λeTe ≥ eTh−
λ
2hTh, ∀e, h ∈ R
p
Proof: This is verified by writing the quadratic form as
1
2
[
eT hT]
[
1λI II λI
] [
eh
]
≥ 0, ∀e, h ∈ Rp.
It is known that
Q =
[
A BBT C
]
≥ 0
if and only if A > 0 and the Schur complement C −BTA−1B ≥ 0.This results into the condition 1
2(λI − I(λI)I) ≥ 0, which holds.
Note. One has1
2λeTe = max
h(eTh−
λ
2hTh)
19
Conjugate feature duality
Property. For λ > 0, the following quadratic form property holds:
1
2λeTe ≥ eTh−
λ
2hTh, ∀e, h ∈ R
p
Proof: This is verified by writing the quadratic form as
1
2
[
eT hT]
[
1λI II λI
] [
eh
]
≥ 0, ∀e, h ∈ Rp.
It is known that
Q =
[
A BBT C
]
≥ 0
if and only if A > 0 and the Schur complement C −BTA−1B ≥ 0.This results into the condition 1
2(λI − I(λI)I) ≥ 0, which holds.
Note. One has1
2λeTe = max
h(eTh−
λ
2hTh)
19
Model: living in two worlds ...
Original model:
y =W Tx+ b, e = y − yobjective J= regularization term Tr(W TW )+ (1
λ) error term
∑
i eTi ei
↓ 12λe
Te ≥ eTh− λ2h
Th
New representation:
y =∑
j hjxTj x+ b
obtain J ≥ J(hi,W, b)solution from stationary points of J :∂J
∂hi= 0, ∂J
∂W= 0, ∂J
∂b= 0
20
Model: living in two worlds ...
Original model:
y =W Tx+ b, e = y − yobjective J= regularization term Tr(W TW )+ (1
λ) error term
∑
i eTi ei
↓ 12λe
Te ≥ eTh− λ2h
Th
New representation:
y =∑
j hjxTj x+ b
obtain J ≥ J(hi,W, b)solution from stationary points of J :∂J
∂hi= 0, ∂J
∂W= 0, ∂J
∂b= 0
20
Model: living in two worlds ...
Original model:
y =W Tx+ b, e = y − yobjective J= regularization term Tr(W TW )+ (1
λ) error term
∑
i eTi ei
↓ 12λe
Te ≥ eTh− λ2h
Th
New representation:
y =∑
j hjxTj x+ b
obtain J ≥ J(hi,W, b)solution from stationary points of J :∂J
∂hi= 0, ∂J
∂W= 0, ∂J
∂b= 0
20
Model: living in two worlds ...
Original model:
y =W Tϕ(x) + b, e = y − yobjective J= regularization term Tr(W TW )+ (1
λ) error term
∑
i eTi ei
↓ 12λe
Te ≥ eTh− λ2h
Th
New representation:
y =∑
j hjK(xj, x) + b
obtain J ≥ J(hi,W, b)solution from stationary points of J :∂J
∂hi= 0, ∂J
∂W= 0, ∂J
∂b= 0
20
Simplest example: line fitting
Given data: (xi, yi)Ni=1, xi, yi ∈ R
y
x
oo
o
o
oo
o
o
o
o
o
o
Linear model:y = wx+ b, e = y − y
RKM representation:
y =∑
i
hixix+ b
3 visible units: v = [x; 1;−y]1 hidden unit: h ∈ R
v h
RKM
21
From LS-SVM to the RKM representation
Multi-output model y =W Tx+ b, e = y − y
Objective in LS-SVM regression (linear case)
J =η
2Tr(W TW ) +
1
2λ
N∑
i=1
eTi ei s.t. ei = yi −W Txi − b,∀i
≥N∑
i=1
eTi hi −λ
2
N∑
i=1
hTi hi +η
2Tr(W TW ) s.t. ei = yi −W Txi − b,∀i
=N∑
i=1
(yTi − xTi W − bT )hi −λ
2
N∑
i=1
hTi hi +η
2Tr(W TW ) , J(hi,W, b)
= RtrainRKM −
λ
2
N∑
i=1
hTi hi +η
2Tr(W TW )
22
From LS-SVM to the RKM representation
Multi-output model y =W Tx+ b, e = y − y
Objective in LS-SVM regression (linear case)
J =η
2Tr(W TW ) +
1
2λ
N∑
i=1
eTi ei s.t. ei = yi −W Txi − b,∀i
≥N∑
i=1
eTi hi −λ
2
N∑
i=1
hTi hi +η
2Tr(W TW ) s.t. ei = yi −W Txi − b,∀i
=N∑
i=1
(yTi − xTi W − bT )hi −λ
2
N∑
i=1
hTi hi +η
2Tr(W TW ) , J(hi,W, b)
= RtrainRKM −
λ
2
N∑
i=1
hTi hi +η
2Tr(W TW )
22
From LS-SVM to the RKM representation
Multi-output model y =W Tx+ b, e = y − y
Objective in LS-SVM regression (linear case)
J =η
2Tr(W TW ) +
1
2λ
N∑
i=1
eTi ei s.t. ei = yi −W Txi − b,∀i
≥N∑
i=1
eTi hi −λ
2
N∑
i=1
hTi hi +η
2Tr(W TW ) s.t. ei = yi −W Txi − b,∀i
=N∑
i=1
(yTi − xTi W − bT )hi −λ
2
N∑
i=1
hTi hi +η
2Tr(W TW ) , J(hi,W, b)
= RtrainRKM −
λ
2
N∑
i=1
hTi hi +η
2Tr(W TW )
22
Connection between RKM and RBM
• RKM & RBM: interpretation in terms of visible and hidden units
• RKM: energy form as in RBM:
RtrainRKM =
N∑
i=1
RRKM(vi, hi)
= −N∑
i=1
(xTi Whi + bThi − yTi hi) =N∑
i=1
eTi hi
with RRKM(v, h) = −vTWh = −(xTWh+ bTh− yTh) = eTh.
• Conjugate feature duality: hidden features hi are conjugated to the eiand serve as dual variables.
23
From LS-SVM to RKM representation (2)
• Stationary points of J(hi,W, b) (nonlinear case, feature map ϕ(·))
∂J
∂hi
= 0 ⇒ yi = WTϕ(xi) + b + λhi, ∀i
∂J
∂W= 0 ⇒ W =
1
η
∑
i
ϕ(xi)hTi
∂J
∂b= 0 ⇒
∑
i
hi = 0.
• Solution in hi and b with positive definite kernelK(xi, xj) = ϕ(xi)Tϕ(xj)
[ 1ηK + λIN 1N
1TN 0
] [
HT
bT
]
=
[
Y T
0
]
with K = [K(xi, xj)], H = [h1...hN ], Y = [y1...yN ].
24
From LS-SVM to RKM representation (3)
v
hx ϕ(x)
yy
e
Note: ϕ(x) can be multi-layered, visible units: [ϕ(x); 1;−y]Conjugate feature duality: primal and dual model representations:
(P )RKM : y = W Tϕ(x) + b
ր
M
ց
(D)RKM : y =1
η
∑
j
hjK(xj, x) + b.
(large N , small d) versus (large d, small N)
25
Kernel principal component analysis (KPCA)
−1.5 −1 −0.5 0 0.5 1−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
linear PCA
−1.5 −1 −0.5 0 0.5 1−2
−1.5
−1
−0.5
0
0.5
1
1.5
kernel PCA (RBF kernel)
Kernel PCA [Scholkopf et al., 1998]:take eigenvalue decomposition of the kernel matrix
K(x1, x1) ... K(x1, xN)... ...
K(xN , x1) ... K(xN , xN)
(applications in dimensionality reduction and denoising)
26
Kernel PCA: classical LS-SVM approach
• Primal problem: [Suykens et al., 2002]
minw,b,e
1
2wTw −
1
2γ
N∑
i=1
e2i s.t. ei = wTϕ(xi) + b, i = 1, ..., N.
• Dual problem corresponds to kernel PCA
Ω(c)α = λα with λ = 1/γ
with Ω(c)ij = (ϕ(xi)− µϕ)
T (ϕ(xj)− µϕ) the centered kernel matrix
and µϕ = (1/N)∑N
i=1ϕ(xi).
• Interpretation:1. pool of candidate components (objective function equals zero)2. select relevant components
27
From KPCA to RKM representation
Model:
e =W Tϕ(x)objective J= regularization term Tr(W TW )- (1
λ) variance term
∑
i eTi ei
↓ - 12λe
Te ≤ −eTh+ λ2h
Th
RKM representation:
e =∑
j hjK(xj, x)
obtain J ≤ J(hi,W )solution from stationary points of J :∂J∂hi
= 0, ∂J∂W
= 0
28
From KPCA to RKM representation (2)
• Objective
J =η
2Tr(W TW )−
1
2λ
N∑
i=1
eTi ei s.t. ei =W Tϕ(xi), ∀i
≤ −N∑
i=1
eTi hi +λ
2
N∑
i=1
hTi hi +η
2Tr(W TW ) s.t. ei =W Tϕ(xi), ∀i
= −N∑
i=1
ϕ(xi)TWhi +
λ
2
N∑
i=1
hTi hi +η
2Tr(W TW ) , J
• Stationary points of J(hi,W ):
∂J
∂hi= 0 ⇒ W Tϕ(xi) = λhi, ∀i
∂J
∂W= 0 ⇒ W =
1
η
∑
i
ϕ(xi)hTi
29
From KPCA to RKM representation (3)
• Elimination of W gives the eigenvalue decomposition:
1
ηKHT = HTΛ
where H = [h1...hN ] ∈ Rs×N and Λ = diagλ1, ..., λs with s ≤ N
• Primal and dual model representations
(P )RKM : e =W Tϕ(x)ր
Mց
(D)RKM : e =1
η
∑
j
hjK(xj, x).
30
Singular value decomposition
• Objective: given xi, zj row and column data of (non-square) matrix
J = −η
2Tr(V TW ) +
1
2λ
N∑
i=1
eTi ei +1
2λ
M∑
j=1
rTj rj s.t. ei =W Tϕ(xi),∀i
rj = V Tψ(zj), ∀j
• primal and dual representations (relates to non-symmetric kernels)
(P )RKM : e =W Tϕ(x)ր r = V Tψ(z)
Mց
(D)RKM : e =1
η
∑
j
hrjψ(zj)Tϕ(x)
r =1
η
∑
i
heiϕ(xi)Tψ(z)
31
Kernel probability mass function estimation
• Objective:
J =N∑
i=1
(pi − ϕ(xi)Tw)hi −
N∑
i=1
pi +η
2wTw
• primal and dual representations
(P )RKM : pi = wTϕ(xi)ր
Mց
(D)RKM : pi =1
η
∑
j
K(xj, xi)
32
Deep Restricted Kernel Machines
32
Deep RKM: example
v
h(1)x ϕ1(x)
yy
e(1)ϕ2(h
(1))e(2)h(2)
ϕ3(h(2))
e(3)h(3)
Deep RKM: KPCA + KPCA + LSSVM
Coupling of RKMs by taking sum of the objectives
Jdeep = J1 + J2 + J3
33
Generative kernel PCA
33
RKM objective for training and generating (1)
• RBM energy function
E(v, h; θ) = −vTWh− cTv − aTh
with model parameters θ = W, c, a
• RKM objective function
J(v, h,W ) = −vTWh+ λ2h
Th+ 12v
Tv + η2Tr(W
TW )
Training: clamp v → Jtrain(h,W )Generating: clamp h,W → Jgen(v)
[Schreurs & Suykens, ESANN 2018]
34
RKM objective for training and generating (2)
• Training: (clamp v)
Jtrain(hi,W ) = −N∑
i=1
vTi Whi +λ
2
N∑
i=1
hTi hi +η
2Tr(WTW )
Stationary points:
∂Jtrain∂hi
= 0 ⇒WTvi = λhi, ∀i∂Jtrain∂W
= 0 ⇒W = 1η
∑Ni=1 vih
Ti
Elimination of W :1
ηKHT = HT∆,
where H = [h1, . . . , hN ] ∈ Rs×N , ∆ = diagλ1, . . . , λs with s ≤ N
the number of selected components and Kij = vTi vj the kernel matrixelements.
35
RKM objective for training and generating (3)
• Generating: (clamp h,W )
Estimate distribution p(h) from hi, i = 1, ..., N (or assumed normal).Obtain a new value h⋆.Generate in this way v⋆ from
Jgen(v⋆) = −v⋆
TWh⋆ +
1
2v⋆
Tv⋆
Stationary points:∂Jgen∂v⋆
= 0
This givesv⋆ =Wh⋆
36
Dimensionality reduction and denoising: linear case
• Given training data vi = xi with X ∈ Rd×N , obtain hidden features
H ∈ Rs×N :
X =WH = (1
η
N∑
i=1
xihTi )H =
1
ηXHTH
• Reconstruction error: ‖X − X‖2
xi G(·) hi F (·) xi
37
Dimensionality reduction and denoising: nonlinear case (1)
• New datapoint x⋆ generated from h⋆ by
ϕ(x⋆) =Wh⋆ = (1
η
N∑
i=1
ϕ(xi)hTi )h
⋆
• Multiplying both sides by ϕ(xj) gives:
K(xj, x⋆) =
1
η(
N∑
i=1
K(xj, xi)hTi )h
⋆
On training data:
Ω =1
ηΩHTH
with H ∈ Rs×N ,Ωij = K(xi, xj) = ϕ(xi)
Tϕ(xj).
38
Dimensionality reduction and denoising: nonlinear case (2)
• Estimated value x for x⋆ by kernel smoother:
x =
∑Sj=1 K(xj, x
⋆)xj∑S
j=1 K(xj, x⋆)
with K(xj, x⋆) (e.g. RBF kernel) the scaled similarity between 0 and
1, a design parameter S ≤ N (S closest points based on the similarityK(xj, x
⋆)).
[Schreurs & Suykens, ESANN 2018]
39
Example: denoising
Synthetic data sets:
-3 -2 -1 0 1 2 3
X1
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
X2
-2 -1.5 -1 -0.5 0 0.5 1 1.5-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
X ∈ R2×500 (d = 2, N = 500)
Kernel PCA using RBF kernel with σ2 = 1 (left: s = 2; right: s = 8)Kernel smoother: S = 100 closed points, σ2 = 0.2
40
Example: generating new data
From MNIST data:
Training data: 50 images per digit; Kernel PCA (left: s = 20; right: s = 50)Normal distribution fitted on hi, used to generate h⋆
Kernel smoother: (left) S = 10 (digits 0); (right) S = 100 (digits 0,8)
[Schreurs & Suykens, ESANN 2018]
41
Towards explainable AI
Understanding the role of the hidden units:
-0.15 -0.1 -0.05 0 0.05 0.1
H1
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
H2
016
h(1,1) = -0.11 h(1,1) = -0.06 h(1,1) = -0.01 h(1,1) = 0.04 h(1,1) = 0.09
h(1,2) = -0.12 h(1,2) = -0.06 h(1,2) = 0 h(1,2) = 0.06 h(1,2) = 0.12
h(1,3) = -0.11 h(1,3) = -0.05 h(1,3) = 0.01 h(1,3) = 0.06 h(1,3) = 0.12
[figures by Joachim Schreurs]
42
Tensor-based RKM for Multi-view KPCA
min 〈W,W〉−N∑
i=1
⟨
Φ(i),W⟩
hi+λN∑
i=1
h2i with Φ(i) = ϕ[1](x[1]i )⊗...⊗ϕ[V ](x
[V ]i )
[Houthuys & Suykens, ICANN 2018]
43
Generative RKM (1)
The objective
Jtrain(hi, V, U) =N∑
i=1
(−ϕ1(xi)TV hi−ϕ2(yi)
TUhi+λi2hTi hi)+
η12Tr(V TV )+
η22Tr(UTU)
results for training into the eigenvalue problem
(1
η1K1 +
1
η2K2)H
T = HTΛ
with H = [h1...hN ] and kernel matrices K1,K2 related to ϕ1, ϕ2.
[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]
44
Generative RKM (2)
Generating data is based on a newly generated h⋆ and the objective
Jgenerate(ϕ1(x⋆), ϕ2(y
⋆)) = −ϕ1(x⋆)TV h⋆−ϕ2(y
⋆)TUh⋆+1
2ϕ1(x
⋆)Tϕ1(x⋆)+
1
2ϕ2(y
⋆)Tϕ2(y⋆)
giving
ϕ1(x⋆) =
1
η1
N∑
i=1
ϕ1(xi)hTi h
⋆, ϕ2(y⋆) =
1
η2
N∑
i=1
ϕ2(yi)hTi h
⋆.
For generating x, y one can either work with the kernel smoother or workwith an explicit feature map using a feedforward neural network.
[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]
45
Generative RKM (3)
Train:
Generate:
46
Generative RKM (4)
[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]
47
Generative RKM (5)
Figure: Image generation using neural networks as feature map:
(left) MNIST; (right) Small-NORB
[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]
48
Generative RKM (6)
Figure: Targeted image generation through corresponding latent variable.
[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]
49
Conclusions
• From RBM to deep BM
• From RKM to deep RKM
• RKM and RBM representation: visible and hidden units
• RKM representation for LS-SVM, KPCA, SVD and others
• RKM representation obtained by conjugate feature duality
• Generative RKM
50
Acknowledgements (1)
• Current and former co-workers at ESAT-STADIUS:
C. Alzate, Y. Chen, J. De Brabanter, K. De Brabanter, L. De Lathauwer,H. De Meulemeester, B. De Moor, H. De Plaen, Ph. Dreesen, M.Espinoza, T. Falck, M. Fanuel, Y. Feng, B. Gauthier, X. Huang, L.Houthuys, V. Jumutc, Z. Karevan, R. Langone, R. Mall, S. Mehrkanoon,G. Nisol, M. Orchel, A. Pandey, K. Pelckmans, S. RoyChowdhury, S.Salzo, J. Schreurs, M. Signoretto, Q. Tao, J. Vandewalle, T. Van Gestel,S. Van Huffel, C. Varon, Y. Yang, and others
• Many other people for joint work, discussions, invitations, organizations
• Support from ERC AdG E-DUALITY, ERC AdG A-DATADRIVE-B, KULeuven, OPTEC, IUAP DYSCO, FWO projects, IWT, iMinds, BIL, COST
51
Acknowledgements (2)
52
Acknowledgements (3)
NEW: ERC Advanced Grant E-DUALITYExploring duality for future data-driven modelling
53
Thank you
54