deep learning, neural networks and kernel machines ... · generalization, deep learning and kernel...

98
Deep Learning, Neural Networks and Kernel Machines: towards a unifying framework Johan Suykens KU Leuven, ESAT-STADIUS Kasteelpark Arenberg 10 B-3001 Leuven (Heverlee), Belgium Email: [email protected] http://www.esat.kuleuven.be/stadius/ AI Seminar at BeCentral Brussels, Oct 2019

Upload: others

Post on 12-Oct-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Deep Learning, Neural Networks and Kernel

Machines: towards a unifying framework

Johan Suykens

KU Leuven, ESAT-STADIUSKasteelpark Arenberg 10

B-3001 Leuven (Heverlee), BelgiumEmail: [email protected]://www.esat.kuleuven.be/stadius/

AI Seminar at BeCentral Brussels, Oct 2019

Page 2: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Outline

• Introduction

• Function estimation, model representations, duality

• Neural networks and kernel machines

• Application examples, large scale methods

• Robustness

• Generative models: GAN, RBM, Deep BM

• Restricted kernel machines (RKM), Gen-RKM, and deep learning

• Explainability

• Recent developments

1

Page 3: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Introduction

1

Page 4: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Self-driving cars and neural networks

in the early days of neural networks:

ALVINN (Autonomous Land Vehicle In a Neural Network)

[Pomerleau, Neural Computation 1991]

2

Page 5: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Self-driving cars and deep learning

(27 million connections)

from: [selfdrivingcars.mit.edu (Lex Fridman et al.), 2017]

3

Page 6: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Convolutional neural networks

[LeCun et al., Proc. IEEE 1998]

Further advanced architectures:

Alexnet (2012): 5 convolutional layers, 3 fully connectedVGGnet (2014): 19 layersGoogLeNet (2014): 22 layersResNet (2015): 152 layers

4

Page 7: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Historical context

1942 McCulloch & Pitts: mathematical model for neuron1958 Rosenblatt: perceptron learning1960 Widrow & Hoff: adaline and lms learning rule1969 Minsky & Papert: limitations of perceptron

1986 Rumelhart et al.: error backpropagating neural networks→ booming of neural network universal approximators

1992 Vapnik et al.: support vector machine classifiers→ convex optimization, kernel machines

1998 LeCun et al.: convolutional neural networks2006 Hinton et al.: deep belief networks2010 Bengio et al.: stacked autoencoders

→ booming of deep neural networks

com

putin

g po

wer

4

Page 8: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Different paradigms

Deep

Learning

Neural

Networks

SVM, LS-SVM &

Kernel methods

5

Page 9: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Different paradigms

Deep

Learning

Neural

Networks

SVM, LS-SVM &

Kernel methods

new synergies?

5

Page 10: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Towards a unifying picture

Model

Dual representation

Primal representation

Duality principle

other

Legendre−Fenchel duality

Lagrange duality

Conjugate feature duality

Kernel−based

other

Parametric

linear, polynomial

finite or infinite dictionary

positive definite kernel

tensor kernel

indefinite kernel

symmetric or non−symmetric kernel

(deep) neural network

[Suykens 2017]

6

Page 11: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

multi−scale

multiplex

data fusion

ensembledeep

multi−task

[Suykens 2017]

7

Page 12: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Function estimation and model representations

7

Page 13: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Linear function estimation (1)

• Given (xi, yi)Ni=1 with xi ∈ Rd, yi ∈ R, consider y = f(x) where f is

parametrized asy = wTx+ b

with y the estimated output of the linear model.

• Consider estimating w, b by

minw,b

1

2wTw + γ

1

2

N∑

i=1

(yi − wTxi − b)2

→ one can directly solve in w, b

8

Page 14: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Linear function estimation (2)

• ... or write as a constrained optimization problem:

minw,b,e

12w

Tw + γ12

i e2i

subject to ei = yi − wTxi − b, i = 1, ..., N

Lagrangian: L(w, b, ei, αi) = 12w

Tw + γ 12

i e2i −

i αi(ei − yi + wTxi + b)

• From optimality conditions:

y =∑

i

αi xTi x+ b

where α, b follows from solving a linear system[

0 1TN1N Ω + I/γ

] [

b

α

]

=

[

0

y

]

with Ωij = xTi xj for i, j = 1, ..., N and y = [y1; ...; yN ].

9

Page 15: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Linear model: solving in primal or dual?

inputs x ∈ Rd, output y ∈ R

training set (xi, yi)Ni=1

(P ) : y = wTx+ b, w ∈ Rd

րModel

ց(D) : y =

i αi xTi x+ b, α ∈ R

N

10

Page 16: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Linear model: solving in primal or dual?

inputs x ∈ Rd, output y ∈ R

training set (xi, yi)Ni=1

(P ) : y = wTx+ b, w ∈ Rd

րModel

ց(D) : y =

i αi xTi x+ b, α ∈ R

N

10

Page 17: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Linear model: solving in primal or dual?

few inputs, many data points: d≪ N

primal : w ∈ Rd

dual: α ∈ RN (large kernel matrix: N ×N)

11

Page 18: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Linear model: solving in primal or dual?

many inputs, few data points: d≫ N

primal: w ∈ Rd

dual : α ∈ RN (small kernel matrix: N ×N)

11

Page 19: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Feature map and kernel

From linear to nonlinear model:

(P ) : y = wTϕ(x) + bր

Modelց

(D) : y =∑

i αiK(xi, x) + b

Mercer theorem:K(x, z) = ϕ(x)Tϕ(z)

Feature map ϕ, Kernel function K(x, z) (e.g. linear, polynomial, RBF, ...)

• SVMs: feature map and positive definite kernel [Cortes & Vapnik, 1995]

• Explicit or implicit choice of the feature map

• Neural networks: consider hidden layer as feature map [Suykens & Vandewalle, 1999]

• Least squares support vector machines [Suykens et al., 2002]: L2 loss and regularization

12

Page 20: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Least Squares Support Vector Machines: “core models”

• Regression

minw,b,e

wTw + γ∑

i

e2i s.t. yi = wTϕ(xi) + b+ ei, ∀i

• Classification

minw,b,e

wTw + γ∑

i

e2i s.t. yi(wTϕ(xi) + b) = 1− ei, ∀i

• Kernel pca (V = I), Kernel spectral clustering (V = D−1)

minw,b,e

−wTw + γ∑

i

vie2i s.t. ei = wTϕ(xi) + b, ∀i

• Kernel canonical correlation analysis/partial least squares

minw,v,b,d,e,r

wTw + vTv + ν∑

i

(ei − ri)2 s.t.

ei = wTϕ(1)(xi) + bri = vTϕ(2)(yi) + d

[Suykens & Vandewalle, 1999; Suykens et al., 2002; Alzate & Suykens, 2010]

13

Page 21: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Sparsity: through regularization or loss function

• through regularization: model y = wTx+ b

min∑

j

|wj|+ γ∑

i

e2i

⇒ sparse w (e.g. Lasso)

• through loss function: model y =∑

i αiK(x, xi) + b

min wTw + γ∑

i

L(ei)

⇒ sparse α (e.g. SVM)

−ε 0 +ε

14

Page 22: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

SVMs and neural networks

x

oooo

x

xx

x

x x

x

x

x

oo

oo Input space

Feature space

ϕ(x)

Parametric

Nonparametric

Primal space

Dual space

y = sign[wTϕ(x) + b]

y = sign[∑#sv

i=1 αiyiK(x, xi) + b]

K(xi, xj) = ϕ(xi)Tϕ(xj) (Mercer)

y

y

w1

wnh

α1

α#sv

ϕ1(x)

ϕnh(x)

K(x, x1)

K(x, x#sv)

x

x

15

Page 23: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

SVMs and neural networks

x

oooo

x

xx

x

x x

x

x

x

oo

oo Input space

Feature space

ϕ(x)

Parametric

Nonparametric

Primal space

Dual space

y = sign[wTϕ(x) + b]

y = sign[∑#sv

i=1 αiyiK(x, xi) + b]

K(xi, xj) = ϕ(xi)Tϕ(xj) (“Kernel trick”)

y

y

w1

wnh

α1

α#sv

ϕ1(x)

ϕnh(x)

K(x, x1)

K(x, x#sv)

x

x

Par

amet

ricN

on−p

aram

etric

[Suykens et al., 2002]

15

Page 24: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Wider use of the “kernel trick”

• Angle between vectors: (e.g. correlation analysis)Input space:

cos θxz =xTz

‖x‖2‖z‖2Feature space:

cos θϕ(x),ϕ(z) =ϕ(x)Tϕ(z)

‖ϕ(x)‖2‖ϕ(z)‖2=

K(x, z)√

K(x, x)√

K(z, z)

• Distance between vectors: (e.g. for “kernelized” clustering methods)Input space:

‖x− z‖22 = (x− z)T (x− z) = xTx+ zTz − 2xTz

Feature space:

‖ϕ(x)− ϕ(z)‖22 = K(x, x) +K(z, z)− 2K(x, z)

x

16

Page 25: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Interpretation of kernel-based models

Decision making: classification problem (e.g. apples versus tomatoes) Inputdata xi ∈ R

d and class labels yi ∈ −1,+1. N training data.

SVM or LS-SVM classifier: given a new x (e.g. ), obtain

y = sign[∑

i

αiyiK(x, xi) + b]

with xi for i = 1, ..., N :

Here K(x, xi) characterizes the similarity between x and xi.The bias term b can be related to prior class probabilities.

17

Page 26: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Function estimation in RKHS

• Find function f such that [Wahba, 1990; Evgeniou et al., 2000]

minf∈HK

1

N

N∑

i=1

L(yi, f(xi)) + λ‖f‖2K

with L(·, ·) the loss function. ‖f‖K is norm in RKHS HK defined by K.

• Representer theorem: for convex loss function, solution of the form

f(x) =N∑

i=1

αiK(x, xi)

Reproducing property f(x) = 〈f,Kx〉K with Kx(·) = K(x, ·)

• Sparse representation by hinge and ǫ-insensitive loss [Vapnik, 1998]

18

Page 27: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Kernels

Wide range of positive definite kernel functions possible:

- linear K(x, z) = xTz- polynomial K(x, z) = (η + xTz)d

- radial basis function K(x, z) = exp(−‖x− z‖22/σ2)- splines- wavelets- string kernel- kernels from graphical models- kernels for dynamical systems- Fisher kernels- graph kernels- data fusion kernels- additive kernels (good for explainability)- other

[Scholkopf & Smola, 2002; Shawe-Taylor & Cristianini, 2004; Jebara et al., 2004; other]

19

Page 28: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Krein spaces: indefinite kernels

• LS-SVM for indefinite kernel case:

minw+,w−,b,e

1

2(wT+w+ − wT−w−) +

γ

2

N∑

i=1

e2i s.t. yi = wT+ϕ+(xi) + wT−ϕ−(xi) + b+ ei, ∀i

and indefinite kernel K(xi, xj) = K+(xi, xj)−K−(xi, xj)with positive definite kernels K+, K−

K+(xi, xj) = ϕ+(xi)Tϕ+(xj) and K−(xi, xj) = ϕ−(xi)

Tϕ−(xj)

• also: KPCA with indefinite kernel [X. Huang et al. 2017], KSC andsemi-supervised learning [Mehrkanoon et al., 2018]

[X. Huang, Maier, Hornegger, Suykens, ACHA 2017]

[Mehrkanoon, X. Huang, Suykens, Pattern Recognition, 2018]

Related work of RKKS: [Ong et al 2004; Haasdonk 2005; Luss 2008; Loosli et al. 2015]

20

Page 29: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Banach spaces: tensor kernels

• Regression problem:

min(w,b,e)∈ℓr(K)×R×RN

ρ(‖w‖r) + γN

∑Ni=1L(ei)

subject to yi = 〈w,ϕ(xi)〉+ b+ ei , ∀i = 1, ..., N

with r = mm−1 for even m ≥ 2, ρ convex and even.

For m large this approaches ℓ1 regularization.

• Tensor-kernel representation

y = 〈w,ϕ(x)〉r,r∗ + b =1

Nm−1

N∑

i1,...,im−1=1

ui1...uim−1K(xi1, ..., xim−1

, x) + b

[Salzo & Suykens, arXiv 1603.05876; Salzo, Suykens, Rosasco, AISTATS 2018]

related: RKBS [Zhang 2013; Fasshauer et al. 2015]

21

Page 30: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Generalization, deep learning and kernel methods

Recently one has observed in deep learning that over-parametrized neuralnetworks, that would ”overfit”, may still perform well on test data.This phenomenon is currently not yet fully understood. A number ofresearchers have stated that understanding kernel methods in thiscontext is important for understanding the generalization performance.

Related references:

• Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals,

Understanding deep learning requires rethinking generalization, 2016, arXiv:1611.03530

• Amit Daniely, SGD Learns the Conjugate Kernel Class of the Network, 2017,

arXiv:1702.08503

• Arthur Jacot, Franck Gabriel, Clement Hongler, Neural Tangent Kernel: Convergence

and Generalization in Neural Networks, 2018, arXiv:1806.07572

• Tengyuan Liang, Alexander Rakhlin, Just Interpolate: Kernel ”Ridgeless” Regression

Can Generalize, 2018, arXiv:1808.00387

• Mikhail Belkin, Siyuan Ma, Soumik Mandal, To understand deep learning we need to

understand kernel learning, 2018, arXiv:1802.01396

22

Page 31: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Generalization and deep learning - Double U curve

Figure: Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal, Reconciling modern

machine learning and the bias-variance trade-off, 2018, arXiv:1812.11118

23

Page 32: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Example: Black-box weather forecasting (1)

Weather data350 stations located in US

Features:Tmax, Tmin, precipitation,wind speed, wind direction ,...

Black-box forecasting multiple weather stations simultaneously

[Signoretto, Frandi, Karevan, Suykens, IEEE-SCCI, 2014]

24

Page 33: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Black-box weather forecasting

• Black-box weather forecasting: prediction temperature in Brussels

• Multi-view learning:- Multi-view LS-SVM regression [Houthuys, Karevan, Suykens, IJCNN 2017]

- Multi-view Deep Neural Networks [Karevan, Houthuys, Suykens, ICANN 2018]

25

Page 34: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Multi-view learning: kernel-based (1)

• Primal problem:

minw[v],e[v]

12

∑Vv=1w

[v]Tw[v] + 12

∑Vv=1 γ

[v]e[v]Te[v] +ρ

∑Vv,u=1;v 6=u e

[v]T e[u]

subject to y = Φ[v]w[v] + b[v]1N + e[v], v = 1, ..., V

• Dual:[

0V×V 1TMΓM1M + ρ IM1M ΓMΩM + INV + ρ IMΩM

] [

bMαM

]

=

[

0VΓMyM + (V − 1)ρ yM

]

• Prediction:

y(x) =V∑

v=1

βv

N∑

k=1

α[v]k K

[v](x[v],x[v]k ) + b[v]

[Houthuys et al., 2017]

26

Page 35: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Multi-view learning: kernel-based (2)

• Data set:

– Real measurements for weather elements such as temperature, humidity, etc.

– From 2007 until mid 2014

– Two test sets:

- mid-November 2013 until mid-December 2013

- from mid-April 2014 to mid-May 2014

• Goal: Forecasting minimum and maximum temperature for one to six days ahead in

Brussels Belgium

• Views: Brussels together with 9 neighboring cities

• Tuning parameters:

- kernel parameters for each view

- regularization parameters γ[v]

- coupling parameter ρ

27

Page 36: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Multi-view learning: kernel-based (3)

Apr/May

Nov/Dec

28

Page 37: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Multi-view learning: deep neural network (1)

• Primal formulation of multi-view LS-SVM:

minw[v],e[v],b[v]

12

∑Vv=1w

[v]Tw[v] + 12

∑Vv=1 γ

[v]e[v]Te[v] + ρ

∑Vv,u=1;v 6=ue

[v]Te[u]

subject to y = Φ[v]w[v] + b[v]1N + e[v] for v = 1, ..., V

• Weighted Multi-view approach:

minw[v],e[v]

1

2

V∑

v=1

s[v](

w[v]Tw[v] + γ[v]e[v]Te[v]

)

+V∑

v,u=1;v 6=u

ρ[v,u]√

s[v]√

s[u] e[v]Te[u]

– s[v]: weight of the view v (can be manually determined by an expert,or calculated during a pre-processing step)

– ρ[v,u]: coupling parameter for pairwise combination of views– 0 ≤ ρ[v,u] ≤ minγ[v], γ[u]

[Karevan et al., 2018]

29

Page 38: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Multi-view learning: deep neural network (2)

• Weather forecasting is a time series prediction problem → Consider eachdelay as a view

• Consider 5 views (i.e. the delay is considered to be 5)

• Tuning parameters: regularization parameter γ[v] and number of neuronsfor each view, and ρ[v,u] coupling parameter for each part of views

• The weight of each view is defined based on its error the validation set:

s[v] = exp(−mse[v]val)

• Forecasting minimum and maximum temperature for one to six daysahead in Brussels, Belgium

30

Page 39: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Fixed-size kernel methods for large scale data

30

Page 40: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Nystrom method

• “big” kernel matrix: Ω(N,N) ∈ RN×N

“small” kernel matrix: Ω(M,M) ∈ RM×M (on subset)

• Eigenvalue decompositions: Ω(N,N) U = U Λ and Ω(M,M)U = U Λ

• Relation to eigenvalues and eigenfunctions of the integral equation

K(x, x′)φi(x)p(x)dx = λiφi(x′)

with

λi =1

Mλi, φi(xk) =

√M uki, φi(x

′) =

√M

λi

M∑

k=1

ukiK(xk, x′)

[Williams & Seeger, 2001] (Nystrom method in GP)

31

Page 41: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Fixed-size method: estimation in primal

• For the feature map ϕ(·) : Rd → Rh obtain an approximation

ϕ(·) : Rd → RM

based on the eigenvalue decomposition of the kernel matrix with ϕi(x′) =

λi φi(x′) (on a subset of size M ≪ N).

• Estimate in primal:

minw,b

1

2wT w + γ

1

2

N∑

i=1

(yi − wT ϕ(xi)− b)2

Sparse representation is obtained: w ∈ RM with M ≪ N and M ≪ h.

[Suykens et al., 2002; De Brabanter et al., CSDA 2010]

32

Page 42: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Random Fourier Features

• Proposed by [Rahimi & Recht, 2007].

• It requires a positive definite shift-invariant kernel K(x, y) = K(x− y).One obtains a randomized feature map z(x) : Rd → R

2D so that

z(x)Tz(y) ≃ K(x− y).

• Compute the Fourier transform p of the kernel K:

p(ω) =1

exp(−jωT∆)K(∆)d∆

Draw D iid samples ω1, ..., ωD ∈ Rd from p.

Obtain z(x) =√

1D[cos(ω

T1 x)... cos(ω

TDx) sin(ω

T1 x)... sin(ω

TDx)]

T .

33

Page 43: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Deep neural-kernel networks using random Fourier features

Use of Random Fourier Features [Rahimi & Recht, NIPS 2007] to obtainan approximation to the feature map in a deep architecture

[Mehrkanoon & Suykens, Neurocomputing 2018]

34

Page 44: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Example: electricity load forecasting

20 40 60 80 100 120 140 1600.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Actual LoadFS−LSSVM

Hour

Norm

alize

dLoad

20 40 60 80 100 120 140 1600.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Actual LoadLinear

Hour

Norm

alize

dLoad

(a) (b)

20 40 60 80 100 120 140 1600.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Actual LoadFS−LSSVM

Hour

Norm

alize

dLoad

20 40 60 80 100 120 140 1600.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Actual LoadLinear

Hour

Norm

alize

dLoad

(c) (d)

[Espinoza, Suykens, Belmans, De Moor, IEEE CSM 2007]

1-hour ahead

24-hours ahead

Fixed-size LS-SVM ր տ Linear ARX model

35

Page 45: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Robustness

35

Page 46: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Outliers and robustness

xx

x

x xx

xx

x

y

breakdown?

?Robust statistics: Bounded derivative of loss function, bounded kernel

[Huber, 1981; Hampel et al., 1986; Rousseeuw & Leroy, 1987]

36

Page 47: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Weighted versions and robustness

Convex cost function

convexoptimiz.

SVM solution

Weighted version with

modified cost function

robuststatistics

LS-SVM solution

SVM Weighted LS-SVM

• Weighted LS-SVM: minw,b,e

1

2wTw + γ

1

2

N∑

i=1

vie2i s.t. yi = wTϕ(xi) + b+ ei, ∀i

with vi determined from eiNi=1 of unweighted LS-SVM [Suykens et al., 2002].

Robustness and stability [Debruyne et al., JMLR 2008, 2010].

• SVM solution by applying iteratively weighted LS [Perez-Cruz et al., 2005]

37

Page 48: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Example: robust regression using weighted LS-SVM

−5 0 5−4

−3

−2

−1

0

1

2

3

4

function estimation using LS−SVMγ=0.14185,σ2=0.047615

RBF

LS−SVM

data

Real function

x

y

−5 0 5−4

−3

−2

−1

0

1

2

3

4

function estimation using LS−SVMγ=95025.4538,σ2=0.66686

RBF

LS−SVM

data

Real function

x

y

using LS-SVMlab v1.8 http://www.esat.kuleuven.be/sista/lssvmlab/

38

Page 49: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Generative models

38

Page 50: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Generative Adversarial Network (GAN)

Generative Adversarial Network (GAN) [Goodfellow et al., 2014]Training of two competing models in a zero-sum game:

(Generator) generate fake output examples from random noise(Discriminator) discriminate between fake examples and real examples.

source: https://deeplearning4j.org/generative-adversarial-network

39

Page 51: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

GAN: example on MNIST

MNIST training data:

GAN generated examples:

source: https://www.kdnuggets.com/2016/07/mnist-generative-adversarial-model-keras.html

40

Page 52: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Restricted Boltzmann Machines (RBM)

• Markov random field, bipartite graph, stochastic binary unitsLayer of visible units v and layer of hidden units hNo hidden-to-hidden connections

• Energy:

E(v, h; θ) = −vTWh− cTv − aTh with θ = W, c, a

Joint distribution:

P (v, h; θ) =1

Z(θ)exp(−E(v, h; θ))

with partition function Z(θ) =∑

v

h exp(−E(v, h; θ))

[Hinton, Osindero, Teh, Neural Computation 2006]

41

Page 53: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Restricted Boltzmann Machines (RBM)

• Markov random field, bipartite graph, stochastic binary unitsLayer of visible units v and layer of hidden units hNo hidden-to-hidden connections

• Energy:

E(v, h; θ) = −vTWh− cTv − aTh with θ = W, c, a

Joint distribution:

P (v, h; θ) =1

Z(θ)exp(−E(v, h; θ))

with partition function Z(θ) =∑

v

h exp(−E(v, h; θ))

[Hinton, Osindero, Teh, Neural Computation 2006]

41

Page 54: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

RBM and deep learning

RBM

p(v, h) p(v, h1, h2, h3, ...)

[Hinton et al., 2006; Salakhutdinov, 2015]

42

Page 55: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

in other words ...

”sandwich”

E = −vTWh

”deep sandwich”

E = −vTW 1h1 − h1TW 2h2 − h2

TW 3h3

43

Page 56: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

RBM: example on MNIST

MNIST training data:

Generating new images:

source: https://www.kaggle.com/nicw102168/restricted-boltzmann-machine-rbm-on-mnist

44

Page 57: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Convolutional Deep Belief Networks

Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief

Networks [Lee et al. 2011]

45

Page 58: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Restricted kernel machines

45

Page 59: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Restricted Kernel Machines (RKM)

• Kernel machine interpretations in terms of visible and hidden units(similar to Restricted Boltzmann Machines (RBM))

• Restricted Kernel Machine (RKM) representations for

– LS-SVM regression/classification– Kernel PCA– Matrix SVD– Parzen-type models– other

• Based on principle of conjugate feature duality (with hidden featurescorresponding to dual variables)

• Deep Restricted Kernel Machines (Deep RKM)

[Suykens, Neural Computation, 2017]

46

Page 60: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Kernel principal component analysis (KPCA)

−1.5 −1 −0.5 0 0.5 1−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

linear PCA

−1.5 −1 −0.5 0 0.5 1−2

−1.5

−1

−0.5

0

0.5

1

1.5

kernel PCA (RBF kernel)

Kernel PCA [Scholkopf et al., 1998]:take eigenvalue decomposition of the kernel matrix

K(x1, x1) ... K(x1, xN)... ...

K(xN , x1) ... K(xN , xN)

(applications in dimensionality reduction and denoising)

47

Page 61: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Kernel PCA: classical LS-SVM approach

• Primal problem: [Suykens et al., 2002]: model-based approach

minw,b,e

1

2wTw − 1

N∑

i=1

e2i s.t. ei = wTϕ(xi) + b, i = 1, ..., N.

• Dual problem corresponds to kernel PCA

Ω(c)α = λα with λ = 1/γ

with Ω(c)ij = (ϕ(xi)− µϕ)

T (ϕ(xj)− µϕ) the centered kernel matrix

and µϕ = (1/N)∑N

i=1ϕ(xi).

• Interpretation:1. pool of candidate components (objective function equals zero)2. select relevant components

48

Page 62: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

From KPCA to RKM representation (1)

Model:

e =W Tϕ(x)objective J= regularization term Tr(W TW )- (1λ) variance term

i eTi ei

↓ use property eTh ≤ 12λe

Te+ λ2h

Th

RKM representation:

e =∑

j hjK(xj, x)

obtain J ≤ J(hi,W )solution from stationary points of J :∂J∂hi

= 0, ∂J∂W = 0

49

Page 63: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

From KPCA to RKM representation (2)

• Objective

J =η

2Tr(W TW )− 1

N∑

i=1

eTi ei s.t. ei =W Tϕ(xi), ∀i

≤ −N∑

i=1

eTi hi +λ

2

N∑

i=1

hTi hi +η

2Tr(W TW ) s.t. ei =W Tϕ(xi), ∀i

= −N∑

i=1

ϕ(xi)TWhi +

λ

2

N∑

i=1

hTi hi +η

2Tr(W TW ) , J

• Stationary points of J(hi,W ):

∂J

∂hi= 0 ⇒ W Tϕ(xi) = λhi, ∀i

∂J

∂W= 0 ⇒ W =

1

η

i

ϕ(xi)hTi

50

Page 64: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

From KPCA to RKM representation (3)

• Elimination of W gives the eigenvalue decomposition:

1

ηKHT = HTΛ

where H = [h1...hN ] ∈ Rs×N and Λ = diagλ1, ..., λs with s ≤ N

• Primal and dual model representations

(P )RKM : e =W Tϕ(x)ր

(D)RKM : e =1

η

j

hjK(xj, x).

51

Page 65: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Deep Restricted Kernel Machines

51

Page 66: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Deep RKM: example

v

h(1)x ϕ1(x)

yy

e(1)ϕ2(h

(1))e(2)h(2)

ϕ3(h(2))

e(3)h(3)

Deep RKM: KPCA + KPCA + LSSVM [Suykens, 2017]

Coupling of RKMs by taking sum of the objectives

Jdeep = J1 + J2 + J3

Multiple levels and multiple layers per level.

52

Page 67: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

in more detail ...

v

h(1)x ϕ1(x)

yy

e(1)ϕ2(h

(1))e(2)h(2)

ϕ3(h(2))

e(3)h(3)

Jdeep = −N∑

i=1

ϕ1(xi)TW1h

(1)i +

λ12

N∑

i=1

h(1)i

Th(1)i +

η12Tr(W T

1 W1)

−N∑

i=1

ϕ2(h(1)i )TW2h

(2)i +

λ22

N∑

i=1

h(2)i

Th(2)i +

η22Tr(W T

2 W2)

+

N∑

i=1

(yTi − ϕ3(h(2)i )TW3 − bT )h

(3)i − λ3

2

N∑

i=1

h(3)i

Th(3)i +

η32Tr(W T

3 W3)

53

Page 68: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Primal and dual model representations

e(1) =W T1 ϕ1(x)

(P )DeepRKM : e(2) =W T2 ϕ2(Λ

−11 e(1))

ր y =W T3 ϕ3(Λ

−12 e(2)) + b

Mց e(1) = 1

η1

j h(1)j K1(xj, x)

(D)DeepRKM : e(2) = 1η2

j h(2)j K2(h

(1)j ,Λ−1

1 e(1))

y = 1η3

j h(3)j K3(h

(2)j ,Λ−1

2 e(2)) + b

The framework can be used for training deep feedforward neural networksand deep kernel machines [Suykens, 2017].

(Other approaches: e.g. kernels for deep learning [Cho & Saul, 2009], mathematics of

the neural response [Smale et al., 2010], deep gaussian processes [Damianou & Lawrence,

2013], convolutional kernel networks [Mairal et al., 2014], multi-layer support vector

machines [Wiering & Schomaker, 2014])

54

Page 69: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Training process

0 100 20010 -1

100

101

102

103

104

105

106

iteration step

Jdeep,P

stab

Objective function (logarithmic scale) during training on the ion data set:

• black color: level 3 objective only

• Jdeep for cstab = 1, 10, 100 (blue, red, magenta color) in stabilization term

55

Page 70: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Generative RKM

55

Page 71: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

RKM objective for training and generating (1)

• RBM energy function

E(v, h; θ) = −vTWh− cTv − aTh

with model parameters θ = W, c, a

• RKM ”super-objective” function (for training and for generating)

J(v, h,W ) = −vTWh+ λ2h

Th+ 12v

Tv + η2Tr(W

TW )

Training: clamp v → Jtrain(h,W )Generating: clamp h,W → Jgen(v)

[Schreurs & Suykens, ESANN 2018]

56

Page 72: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

RKM objective for training and generating (2)

• Training: (clamp v)

Jtrain(hi,W ) = −N∑

i=1

vTi Whi +λ

2

N∑

i=1

hTi hi +η

2Tr(WTW )

Stationary points:

∂Jtrain∂hi

= 0 ⇒WTvi = λhi, ∀i∂Jtrain∂W = 0 ⇒W = 1

η

∑Ni=1 vih

Ti

Elimination of W :1

ηKHT = HT∆,

where H = [h1, . . . , hN ] ∈ Rs×N , ∆ = diagλ1, . . . , λs with s ≤ N

the number of selected components and Kij = vTi vj the kernel matrixelements.

57

Page 73: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

RKM objective for training and generating (3)

• Generating: (clamp h,W )

Estimate distribution p(h) from hi, i = 1, ..., N (or assumed normal).Obtain a new value h⋆.Generate in this way v⋆ from

Jgen(v⋆) = −v⋆TWh⋆ +

1

2v⋆

Tv⋆

Stationary points:∂Jgen∂v⋆

= 0

This givesv⋆ =Wh⋆

58

Page 74: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Dimensionality reduction and denoising: linear case

• Given training data vi = xi with X ∈ Rd×N , obtain hidden features

H ∈ Rs×N :

X =WH = (1

η

N∑

i=1

xihTi )H =

1

ηXHTH

• Reconstruction error: ‖X − X‖2

xi G(·) hi F (·) xi

59

Page 75: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Dimensionality reduction and denoising: nonlinear case (1)

• New datapoint x⋆ generated from h⋆ by

ϕ(x⋆) =Wh⋆ = (1

η

N∑

i=1

ϕ(xi)hTi )h

• Multiplying both sides by ϕ(xj) gives:

K(xj, x⋆) =

1

η(

N∑

i=1

K(xj, xi)hTi )h

On training data:

Ω =1

ηΩHTH

with H ∈ Rs×N ,Ωij = K(xi, xj) = ϕ(xi)

Tϕ(xj).

60

Page 76: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Dimensionality reduction and denoising: nonlinear case (2)

• Estimated value x for x⋆ by kernel smoother:

x =

∑Sj=1 K(xj, x

⋆)xj∑S

j=1 K(xj, x⋆)

with K(xj, x⋆) (e.g. RBF kernel) the scaled similarity between 0 and

1, a design parameter S ≤ N (S closest points based on the similarityK(xj, x

⋆)).

[Schreurs & Suykens, ESANN 2018]

61

Page 77: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Explainable AI: latent space exploration (1)

hidden units: exploring the whole continuum:

-0.15 -0.1 -0.05 0 0.05 0.1

H1

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

H2

0

1

6

h(1,1) = -0.11 h(1,1) = -0.06 h(1,1) = -0.01 h(1,1) = 0.04 h(1,1) = 0.09

h(1,2) = -0.12 h(1,2) = -0.06 h(1,2) = 0 h(1,2) = 0.06 h(1,2) = 0.12

h(1,3) = -0.11 h(1,3) = -0.05 h(1,3) = 0.01 h(1,3) = 0.06 h(1,3) = 0.12

[figures by Joachim Schreurs]

62

Page 78: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Explainable AI: latent space exploration (2)

-0.05 -0.04 -0.03 -0.02 -0.01 0 0.01 0.02 0.03 0.04 0.05

-0.06

-0.04

-0.02

0

0.02

0.04

0.06

C

B

D

A

Yale Face database - generated faces from different regions A,B,C,D

[Winant, Schreurs, Suykens, BNAIC 2019]

63

Page 79: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Tensor-based RKM for Multi-view KPCA

min 〈W,W〉−N∑

i=1

Φ(i),W⟩

hi+λN∑

i=1

h2i with Φ(i) = ϕ[1](x[1]i )⊗...⊗ϕ[V ](x

[V ]i )

[Houthuys & Suykens, ICANN 2018]

64

Page 80: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Generative RKM (Gen-RKM) (1)

Train:

Generate:

[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]

65

Page 81: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Gen-RKM (2)

The objective

Jtrain(hi,U ,V ) =∑N

i=1

(

−φ1(xi)TUhi − φ2(yi)TV hi +

λ2h

Ti hi

)

+η12 Tr(U

TU ) + η22 Tr(V

TV )

results for training into the eigenvalue problem

(1

η1K1 +

1

η2K2)H

T = HTΛ

with H = [h1...hN ] and kernel matrices K1,K2 related to φ1, φ2.

[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]

66

Page 82: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Gen-RKM (3)

Generating data is based on a newly generated h⋆ and the objective

Jgen(φ1(x⋆), ϕ2(y

⋆)) = −φ1(x⋆)TV h⋆−φ2(y⋆)TUh⋆+1

2φ1(x

⋆)Tφ1(x⋆)+

1

2φ2(y

⋆)Tφ2(y⋆)

giving

φ1(x⋆) =

1

η1

N∑

i=1

φ1(xi)hTi h

⋆, φ2(y⋆) =

1

η2

N∑

i=1

φ2(yi)hTi h

⋆.

For generating x, y one can either work with the kernel smoother or workwith an explicit feature map using a (deep) neural network or CNN.

[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]

67

Page 83: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Gen-RKM (4)

Fx Fy

H

X Y

U⊤ U V V ⊤

φ1(·) ψ1(·) ψ2(·) φ2(·)

Gen-RKM schematic representation modeling a common subspace H between two data

sources X and Y. The φ1, φ2 are the feature maps (Fx and Fy represent the feature-

spaces) corresponding to the two data sources. While ψ1, ψ2 represent the pre-image

maps. The interconnection matrices U, V model dependencies between latent variables

and the mapped data sources.

[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]

68

Page 84: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Gen-RKM: implicit feature map

Obtain

kx⋆ =1

η1K1H

⊤h⋆, ky⋆ =1

η2K2H

⊤h⋆,

with kx⋆ = [k(x1,x⋆), . . . , k(xN ,x

⋆)]⊤.

Using the kernel-smoother:

x = ψ1 (φ1(x⋆)) =

∑nrj=1 k1(xj,x

⋆)xj∑nr

j=1 k1(xj,x⋆)

, y = ψ2 (φ2(y⋆)) =

∑nrj=1 k2(yj,y

⋆)yj∑nr

j=1 k2(yj,y⋆)

,

with k1(xi,x⋆) and k2(yi,y

⋆) the scaled similarities between 0 and 1; nris the number of closest points based on the similarity defined by kernels k1and k2.

69

Page 85: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Gen-RKM: explicit feature map

Parametrized feature maps: φθ(·), ψζ(·) (e.g. CNN and transposed CNN).

Overall objective function, using a stabilization mechanism [Suykens, 2017]:

minθ1,θ2,ζ1,ζ2

Jc = Jtrain +cstab2 J 2

train

+cacc2N

(

∑Ni=1

[

L1(x⋆i , ψ1ζ1

(φ1θ1(x⋆i ))) + L2(y

⋆i , ψ2ζ2

(φ2θ2(y⋆i )))

])

with reconstruction errors

L1(x⋆i , ψ1ζ1

(φ1θ1(x⋆i ))) =

1N‖x⋆i − ψ1ζ1

(φ1θ1(x⋆i ))‖22,

L2(y⋆i , ψ2ζ2

(φ2θ2(y⋆i ))) =

1N‖y⋆i − ψ2ζ2

(φ2θ2(y⋆i ))‖22

and with Φx = [φ1(x1), . . . , φ1(xN)],Φy = [φ2(y1), . . . , φ2(yN)], U, Vfrom

[

1η1ΦxΦ

⊤x

1η1ΦxΦ

⊤y

1η2ΦyΦ

⊤x

1η2ΦyΦ

⊤y

]

[

UV

]

=

[

UV

]

Λ.

Hence, joint feature learning and subspace learning.

70

Page 86: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Gen-RKM: examples (1)

MNIST Fashion-MNIST

Generated samples from the model using CNN as explicit feature map in the kernel function.

The yellow boxes show training examples and the adjacent boxes show the reconstructed

samples. The other images (columns 3-6) are generated by random sampling from the

fitted distribution over the learned latent variables.

[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]

71

Page 87: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Gen-RKM: examples (2)

CIFAR-10 CelebA

Generated samples from the model using CNN as explicit feature map in the kernel function.

The yellow boxes show training examples and the adjacent boxes show the reconstructed

samples. The other images (columns 3-6) are generated by random sampling from the

fitted distribution over the learned latent variables.

[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]

72

Page 88: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Gen-RKM: multi-view generation (1)

CelebA

Multi-view generation on CelebA dataset showing images and attributes

[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]

73

Page 89: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Gen-RKM: multi-view generation (2)

MNIST: Implicit feature maps with Gaussian kernel + generation by kernel-smoother

MNIST: Explicit feature maps using Convolutional Neural Networks

CIFAR-10: Explicit feature maps using CNNs + Transposed CNNs

Multi-view Generation (images and labels) using implicit and explicit feature maps

[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]

74

Page 90: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Gen-RKM: latent space exploration (1)

Exploring the learned uncorrelated-features by traversing along the eigenvectors

Explainability: changing one single neuron’s hidden feature changes the hair color while

preserving face structure! [Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]

75

Page 91: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Gen-RKM: latent space exploration (2)

MNIST reconstructed images by bilinear-interpolation in latent space

[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]

76

Page 92: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Gen-RKM: latent space exploration (3)

CelebA reconstructed images by bilinear-interpolation in latent space

[Pandey, Schreurs & Suykens, 2019, arXiv:1906.08144]

77

Page 93: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Future challenges

• efficient algorithms and implementations for large data

• extension to other loss functions and regularization schemes

• multimodal data, tensor models, coupling schemes

• models for deep clustering and semi-supervised learning

• choice kernel functions, invariances and symmetry properties

• deep generative models

• optimal transport

• synergies between neural networks, deep learning and kernel machines

78

Page 94: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Conclusions

• function estimation: parametric versus kernel-based

• primal and dual model representations

• neural network interpretations in primal and dual

• RKM: new connections between RBM, kernel PCA and LS-SVM

• deep kernel machines

• generative models

79

Page 95: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Acknowledgements (1)

• Current and former co-workers at ESAT-STADIUS:

C. Alzate, Y. Chen, J. De Brabanter, K. De Brabanter, B. De Cooman,L. De Lathauwer, H. De Meulemeester, B. De Moor, H. De Plaen, Ph.Dreesen, M. Espinoza, T. Falck, M. Fanuel, Y. Feng, B. Gauthier, X.Huang, L. Houthuys, V. Jumutc, Z. Karevan, R. Langone, F. Liu, R.Mall, S. Mehrkanoon, G. Nisol, M. Orchel, A. Pandey, P. Patrinos, K.Pelckmans, S. RoyChowdhury, S. Salzo, J. Schreurs, M. Signoretto, Q.Tao, F. Tonin, J. Vandewalle, T. Van Gestel, S. Van Huffel, C. Varon,Y. Yang, and others

• Many other people for joint work, discussions, invitations, organizations

• Support from ERC AdG E-DUALITY, ERC AdG A-DATADRIVE-B, KULeuven, OPTEC, IUAP DYSCO, FWO projects, IWT, iMinds, BIL, COST

80

Page 96: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Acknowledgements (2)

81

Page 97: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Acknowledgements (3)

NEW: ERC Advanced Grant E-DUALITYExploring duality for future data-driven modelling

82

Page 98: Deep Learning, Neural Networks and Kernel Machines ... · Generalization, deep learning and kernel methods Recently one has observed in deep learning that over-parametrized neural

Thank you

83