slidessindhwani - kernels random embeddings and deep learning

85
Kernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014

Upload: snt388857437

Post on 01-Jun-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 1/85

Kernels, Random Embeddings and Deep Learning

Vikas Sindhwani

IBM Research, NY

October 28, 2014

Page 2: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 2/85

Acknowledgements

At IBM: Haim Avron, Tara Sainath, B. Ramabhadran, Q. Fan

Summer Interns: Jiyan Yang (Stanford), Po-sen Huang (UIUC) Michael Mahoney (UC Berkeley), Ha Quang Minh (IIT Genova)

IBM DARPA XDATA project led by Ken Clarkson (IBM Almaden)

2 34

Page 3: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 3/85

Setting

Given labeled data in the form of input-output pairs,

xi,yini=1, xi ∈ X ⊂ Rd, yi ∈ Y ⊂ R

m,

estimate the unknown dependency f : X → Y .

3 34

Page 4: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 4/85

Setting

Given labeled data in the form of input-output pairs,

xi,yini=1, xi ∈ X ⊂ Rd, yi ∈ Y ⊂ R

m,

estimate the unknown dependency f : X → Y . Regularized Risk Minimization in a suitable hypothesis space H,

arg minf ∈H

ni=1

V (f (xi),yi) + Ω(f )

3 34

Page 5: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 5/85

Setting

Given labeled data in the form of input-output pairs,

xi,yini=1, xi ∈ X ⊂ Rd, yi ∈ Y ⊂ R

m,

estimate the unknown dependency f : X → Y . Regularized Risk Minimization in a suitable hypothesis space H,

arg minf ∈H

ni=1

V (f (xi),yi) + Ω(f )

Large n =⇒ Big models: H “rich” /non-parametric/nonlinear.

3 34

Page 6: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 6/85

Page 7: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 7/85

Setting

Given labeled data in the form of input-output pairs,

xi,yini=1, xi ∈ X ⊂ Rd, yi ∈ Y ⊂ R

m,

estimate the unknown dependency f : X → Y . Regularized Risk Minimization in a suitable hypothesis space H,

arg minf ∈H

ni=1

V (f (xi),yi) + Ω(f )

Large n =⇒ Big models: H “rich” /non-parametric/nonlinear. Two great ML traditions around choosing H:

– Deep Neural Networks: f (x) = sn(. . . s2(W2s1(W1x)) . . .)

3 34

Page 8: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 8/85

Setting

Given labeled data in the form of input-output pairs,

xi,yini=1, xi ∈ X ⊂ Rd, yi ∈ Y ⊂ R

m,

estimate the unknown dependency f : X → Y . Regularized Risk Minimization in a suitable hypothesis space H,

arg minf ∈H

ni=1

V (f (xi),yi) + Ω(f )

Large n =⇒ Big models: H “rich” /non-parametric/nonlinear. Two great ML traditions around choosing H:

– Deep Neural Networks: f (x) = sn(. . . s2(W2s1(W1x)) . . .)– Kernel Methods: general nonlinear function space generated by a

kernel function k(x, z) on X × X .

This talk: Thrust towards scalable kernel methods, motivated by therecent successes of deep learning.

3 34

Page 9: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 9/85

Outline

Motivation and Background

Scalable Kernel Methods

Random Embeddings+Distributed Computation (ICASSP, JSM 2014)libSkylark: An open source software stackQuasi-Monte Carlo Embeddings (ICML 2014)

Synergies?

Motivation and Background 4 34

Page 10: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 10/85

Deep Learning is “Supercharging” Machine Learning

Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) withtop-5 error rate of 15.3% compared to 26.2% of the second best entry.

Many statistical and computational ingredients:

– Large datasets (ILSVRC since 2010)

Motivation and Background 5 34

Page 11: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 11/85

Deep Learning is “Supercharging” Machine Learning

Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) withtop-5 error rate of 15.3% compared to 26.2% of the second best entry.

Many statistical and computational ingredients:

– Large datasets (ILSVRC since 2010)– Large statistical capacity (1.2M images, 60M params)

Motivation and Background 5 34

Page 12: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 12/85

Deep Learning is “Supercharging” Machine Learning

Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) withtop-5 error rate of 15.3% compared to 26.2% of the second best entry.

Many statistical and computational ingredients:

– Large datasets (ILSVRC since 2010)– Large statistical capacity (1.2M images, 60M params)

– Distributed computation

Motivation and Background 5 34

Page 13: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 13/85

Deep Learning is “Supercharging” Machine Learning

Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) withtop-5 error rate of 15.3% compared to 26.2% of the second best entry.

Many statistical and computational ingredients:

– Large datasets (ILSVRC since 2010)– Large statistical capacity (1.2M images, 60M params)

– Distributed computation– Depth, Invariant feature learning (transferrable to other tasks)

Motivation and Background 5 34

Page 14: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 14/85

Deep Learning is “Supercharging” Machine Learning

Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) withtop-5 error rate of 15.3% compared to 26.2% of the second best entry.

Many statistical and computational ingredients:

– Large datasets (ILSVRC since 2010)– Large statistical capacity (1.2M images, 60M params)

– Distributed computation– Depth, Invariant feature learning (transferrable to other tasks)– Engineering: Dropout, ReLU . . .

Very active area in Speech and Natural Language Processing.

Motivation and Background 5 34

Page 15: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 15/85

Machine Learning in 1990s

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

Motivation and Background 6 34

Page 16: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 16/85

Machine Learning in 1990s

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

Motivation and Background 6 34

Page 17: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 17/85

Machine Learning in 1990s

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

Personal history:

Motivation and Background 6 34

Page 18: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 18/85

Machine Learning in 1990s

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.

Motivation and Background 6 34

Page 19: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 19/85

Machine Learning in 1990s

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!

Motivation and Background 6 34

Page 20: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 20/85

Machine Learning in 1990s

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!

– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.Neural Nets with a kernel methods section!

Motivation and Background 6 34

Page 21: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 21/85

Machine Learning in 1990s

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!

– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.Neural Nets with a kernel methods section!

Why Kernel Methods?

Motivation and Background 6 34

Page 22: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 22/85

Machine Learning in 1990s

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!

– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.Neural Nets with a kernel methods section!

Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.

Motivation and Background 6 34

Page 23: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 23/85

Machine Learning in 1990s

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!

– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.Neural Nets with a kernel methods section!

Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing

Motivation and Background 6 34

Page 24: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 24/85

Page 25: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 25/85

Machine Learning in 1990s

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!

– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.Neural Nets with a kernel methods section!

Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing– Handle non-vectorial data; high-dimensional data– Easier model selection via continuous optimization.– Matched NN in many cases, although didnt scale wrt n as well.

So what changed?

Motivation and Background 6 34

Page 26: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 26/85

Machine Learning in 1990s

Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!

– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.Neural Nets with a kernel methods section!

Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing– Handle non-vectorial data; high-dimensional data– Easier model selection via continuous optimization.– Matched NN in many cases, although didnt scale wrt n as well.

So what changed?

– More data, parallel algorithms, hardware? Better DNN training? . . .

Motivation and Background 6 34

Page 27: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 27/85

Page 28: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 28/85

Kernel Methods and Neural Networks

Geoff Hinton facts meme maintained athttp://yann.lecun.com/ex/fun/

Motivation and Background 8 34

Page 29: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 29/85

Page 30: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 30/85

Kernel Methods and Neural Networks

Geoff Hinton facts meme maintained athttp://yann.lecun.com/ex/fun/

All kernels that ever dared approaching Geoff Hinton woke upconvolved.

The only kernel Geoff Hinton has ever used is a kernel of truth.

If you defy Geoff Hinton, he will maximize your entropy in no time.Your free energy will be gone even before you reach equilibrium.

Are there synergies between these fields towards design of even better

(faster and more accurate) algorithms?

Motivation and Background 8 34

Th M h i l N l f K l M h d

Page 31: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 31/85

The Mathematical Naturalness of Kernel Methods

Data X ∈ Rd, Models ∈ H : X → R

Motivation and Background 9 34

Th M h i l N l f K l M h d

Page 32: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 32/85

The Mathematical Naturalness of Kernel Methods

Data X ∈ Rd, Models ∈ H : X → R

Geometry in H: inner product ·, ·H, norm · H (Hilbert Spaces)

Motivation and Background 9 34

Th M th ti l N t l f K l M th d

Page 33: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 33/85

The Mathematical Naturalness of Kernel Methods

Data X ∈ Rd, Models ∈ H : X → R

Geometry in H: inner product ·, ·H, norm · H (Hilbert Spaces)

Theorem All nice Hilbert spaces are generated by a symmetricpositive definite function (the kernel) k(x,x) on X × X

– if f, g ∈ H close i.e. f − gH small, then f (x), g(x) close ∀x ∈ X .

– Reproducing Kernel Hilbert Spaces (RKHSs)

Motivation and Background 9 34

Th M th ti l N t l f K l M th d

Page 34: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 34/85

The Mathematical Naturalness of Kernel Methods

Data X ∈ Rd, Models ∈ H : X → R

Geometry in H: inner product ·, ·H, norm · H (Hilbert Spaces)

Theorem All nice Hilbert spaces are generated by a symmetricpositive definite function (the kernel) k(x,x) on X × X

– if f, g ∈ H close i.e. f − gH small, then f (x), g(x) close ∀x ∈ X .

– Reproducing Kernel Hilbert Spaces (RKHSs)

Functional Analysis (Aronszajn, Bergman (1950s)); Statistics(Parzen (1960s)); PDEs; Numerical Analysis. . .

– ML: Nonlinear classification, regression, clustering, time-seriesanalysis, dynamical systems, hypothesis testing, causal modeling, . . .

Motivation and Background 9 34

The Mathematical Naturalness of Kernel Methods

Page 35: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 35/85

The Mathematical Naturalness of Kernel Methods

Data X ∈ Rd, Models ∈ H : X → R

Geometry in H: inner product ·, ·H, norm · H (Hilbert Spaces)

Theorem All nice Hilbert spaces are generated by a symmetricpositive definite function (the kernel) k(x,x) on X × X

– if f, g ∈ H close i.e. f − gH small, then f (x), g(x) close ∀x ∈ X .

– Reproducing Kernel Hilbert Spaces (RKHSs)

Functional Analysis (Aronszajn, Bergman (1950s)); Statistics(Parzen (1960s)); PDEs; Numerical Analysis. . .

– ML: Nonlinear classification, regression, clustering, time-seriesanalysis, dynamical systems, hypothesis testing, causal modeling, . . .

In principle, possible to compose Deep Learning pipelines using moregeneral nonlinear functions drawn from RKHSs.

Motivation and Background 9 34

Outline

Page 36: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 36/85

Outline

Motivation and Background

Scalable Kernel Methods

Random Embeddings+Distributed Computation (ICASSP, JSM 2014)libSkylark: An open source software stackQuasi-Monte Carlo Embeddings (ICML 2014)

Synergies?

Scalable Kernel Methods 10 34

Scalability Challenges for Kernel Methods

Page 37: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 37/85

Scalability Challenges for Kernel Methods

f = arg minf ∈Hk

1

n

ni=1

V (yi, f (xi)) + λf 2Hk

, xi ∈ Rd

Representer Theorem: f (x) = n

i=1 αik(x,xi)

Regularized Least Squares

(K + λI)α = YO(n2) storage

O(n3 + n2d) trainingO(nd) test speed

Hard to parallelize when working directly with Kij = k(xi,xj)

Scalable Kernel Methods 11 34

Randomized Algorithms

Page 38: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 38/85

Randomized Algorithms

Explicit approximate feature map: Ψ : Rd → Cs such that,

k(x, z) ≈ Ψ(x), Ψ(z)Cs

⇒ Z(X)T Z(X) + λIw = Z(X)T Y,

O(ns) storage

O(ns2) trainingO(s) test speed

Interested in Data-oblivious maps that depend only on the kernelfunction, and not on the data.

– Should be very cheap to apply and parallelizable.

Scalable Kernel Methods 12 34

Random Fourier Features (Rahimi & Recht 2007)

Page 39: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 39/85

Random Fourier Features (Rahimi & Recht, 2007)

Theorem [Bochner 1930,33] One-to-one Fourier-pair

correspondence between any (normalized) shift-invariant kernel kand density p such that,

k(x, z) = ψ(x− z) =

Rd

e−i(x−z)T w p(w)dw

– Gaussian kernel: k(x, z) = e−x−z22

2σ2 ⇐⇒ p = N (0, σ−2Id)

Scalable Kernel Methods 13 34

Page 40: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 40/85

DNNs vs Kernel Methods on TIMIT (Speech)

Page 41: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 41/85

Joint work with IBM Speech Group, P. Huang :Can “shallow”, convex randomized kernel methods “match” DNNs?

(predicting HMM states given short window of coefficients representing acoustic input)

G = randn(size(X,1), s);Z = exp(i*X*G);I = eye(size(X,2));C = Z’*Z;

alpha = (C + lambda*I)\(Z’*y(:));ztest = exp(i*xtest*G)*alpha;

Scalable Kernel Methods 14 34

Page 42: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 42/85

DNNs vs Kernel Methods on TIMIT (Speech)

Page 43: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 43/85

(Sp )

Kernel Methods match DNNs on TIMIT, ICASSP 2014, with P. Huang and IBM Speech group

High-performance Kernel Machines with Implicit Distributed Optimization and Randomization, JSM 2014, with H. Avron.

Scalable Kernel Methods 15 34

DNNs vs Kernel Methods on TIMIT (Speech)

Page 44: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 44/85

( p )

Kernel Methods match DNNs on TIMIT, ICASSP 2014, with P. Huang and IBM Speech group

High-performance Kernel Machines with Implicit Distributed Optimization and Randomization, JSM 2014, with H. Avron.

∼ 2 hours on 256 IBM Bluegene/Qnodes.

Phone error rate of 21.3% - best

reported for Kernel methods.– Competitive with

HMM/DNN systems.– New record: 16.7% with

CNNs (ICASSP 2014). Only two hyperparameters: σ, s (early

stopping regularizer). Z ≈ 6.4TB, C ≈ 1.2TB.

Materialized in blocks/used/discardedon-the-fly, in parallel.

Scalable Kernel Methods 15 34

Page 45: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 45/85

Distributed Block-splitting ADMM

Page 46: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 46/85

https://github.com/xdata-skylark/libskylark/tree/master/ml

Y 1, X 1

Y 2, X 2

Y 3, X 3

node 1

node 2

node 3

Distributed Block-splitting ADMM

Page 47: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 47/85

https://github.com/xdata-skylark/libskylark/tree/master/ml

Y 1, X 1

Y 2, X 2

Y 3, X 3

node 1

node 2

node 3

Z 11 Z 12 Z 13

Z 21 Z 22 Z 23

Z 31 Z 32 Z 33

T[X1, 2]

T[X3, 2]

T-cores/OpenMP-threads

MPI (rank 1)

prox

l

T-cores/OpenMP-threads

MPI (rank 3)

Distributed Block-splitting ADMM

Page 48: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 48/85

https://github.com/xdata-skylark/libskylark/tree/master/ml

Y 1, X 1

Y 2, X 2

Y 3, X 3

node 1

node 2

node 3

Z 11 Z 12 Z 13

Z 21 Z 22 Z 23

Z 31 Z 32 Z 33

T[X1, 2]

T[X3, 2]

T-cores/OpenMP-threads

MPI (rank 1)

proxl

T-cores/OpenMP-threads

MPI (rank 3)

W 11 W 12 W 13

W 31 W 32 W 33

projZij

Distributed Block-splitting ADMM

Page 49: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 49/85

https://github.com/xdata-skylark/libskylark/tree/master/ml

Y 1, X 1

Y 2, X 2

Y 3, X 3

node 1

node 2

node 3

Z 11 Z 12 Z 13

Z 21 Z 22 Z 23

Z 31 Z 32 Z 33

T[X1, 2]

T[X3, 2]

T-cores/OpenMP-threads

MPI (rank 1)

proxl

T-cores/OpenMP-threads

MPI (rank 3)

W 11 W 12 W 13

W 31 W 32 W 33

projZij

W

W

node 0

node 0

reduce

reduce

Distributed Block-splitting ADMM

Page 50: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 50/85

https://github.com/xdata-skylark/libskylark/tree/master/ml

Y 1, X 1

Y 2, X 2

Y 3, X 3

node 1

node 2

node 3

Z 11 Z 12 Z 13

Z 21 Z 22 Z 23

Z 31 Z 32 Z 33

T[X1, 2]

T[X3, 2]

T-cores/OpenMP-threads

MPI (rank 1)

proxl

T-cores/OpenMP-threads

MPI (rank 3)

W 11 W 12 W 13

W 31 W 32 W 33

projZij

W

W

node 0

node 0

reduce

reduce

Distributed Block-splitting ADMM

Page 51: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 51/85

https://github.com/xdata-skylark/libskylark/tree/master/ml

Y 1, X 1

Y 2, X 2

Y 3, X 3

node 1

node 2

node 3

Z 11 Z 12 Z 13

Z 21 Z 22 Z 23

Z 31 Z 32 Z 33

T[X1, 2]

T[X3, 2]

T-cores/OpenMP-threads

MPI (rank 1)

proxl

T-cores/OpenMP-threads

MPI (rank 3)

W 11 W 12 W 13

W 31 W 32 W 33

projZij

W

W

node 0

node 0

reduce

reduce

proxr

Distributed Block-splitting ADMM

Page 52: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 52/85

https://github.com/xdata-skylark/libskylark/tree/master/ml

Y 1, X 1

Y 2, X 2

Y 3, X 3

node 1

node 2

node 3

Z 11 Z 12 Z 13

Z 21 Z 22 Z 23

Z 31 Z 32 Z 33

T[X1, 2]

T[X3, 2]

T-cores/OpenMP-threads

MPI (rank 1)

proxl

T-cores/OpenMP-threads

MPI (rank 3)

W 11 W 12 W 13

W 31 W 32 W 33

projZij

W

W

node 0

node 0

reduce

reduce

proxr

broadcast

broadcast

Scalable Kernel Methods 17 34

Scalability

Page 53: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 53/85

Graph Projection

U = [ZT ijZij + λI]−1 cached

(X + ZT ijY gemm

), V = ZijU gemm

High-performance implementation that can handle large columnsplits C = κ s

d by reorganizing updates to exploit shared-memory

access, structure of graph projection.

Memory nmR (T + 5) + nd

R + 5sm + 1κ

TndR + T md + sd

+ κnmsdR

Computation O( nd2

T Rκ)

transform

+ O(smd

κT )

graph-proj

+ O(nsm

T R )

gemm

Communication O(s m log R) (model reduce/broadcast)

Stochastic, Asynchronous versions may be possible to develop.

Scalable Kernel Methods 18 34

Randomized Kernel methods on thousands of cores1 5

2 0

2 5

d

u

p

M N I S T S t r o n g S c a l i n g ( T r i l o k a )

6 0

8 0

1 0 0

A

c

c

u

r

a

c

y

(

%

)

6

8

1 0

d

u

p

M N I S T S t r o n g S c a l i n g ( B G / Q )

6 0

8 0

1 0 0

A

c

c

u

r

a

c

y

(

%

)

1 5 0

2 0 0

(

s

e

c

s

)

M N I S T W e a k S c a l i n g ( T r i l o k a )

9 8 . 5

9 9 . 0

9 9 . 5

1 0 0 . 0

A

c

c

u

r

a

c

y

(

%

)

3 0 0

4 0 0

5 0 0

(

s

e

c

s

)

M N I S T W e a k S c a l i n g ( B G / Q )

9 8 . 6

9 8 . 8

9 9 . 0

A

c

c

u

r

a

c

y

(

%

)

Page 54: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 54/85

Triloka: 20 nodes/16-cores per node; BG/Q: 1024 nodes/16 (x4) cores per node.

s = 100K,C = 200; strong scaling (n = 250k), weak scaling (n = 250k per node)

0 5 1 0 1 5 2 0

N u m b e r o f M P I p r o c e s s e s ( t = 6 t h r e a d s / p r o c e s s )

0

5

1 0

S

p

e

e

S p e e d u p

I d e a l

0 5 1 0 1 5 2 0

0

2 0

4 0

C

l

a

s

s

i

f

i

c

a

t

i

o

n

A c c u r a c y

0 5 0 1 0 0 1 5 0 2 0 0 2 5 0

N u m b e r o f M P I p r o c e s s e s ( t = 6 4 t h r e a d s / p r o c e s s )

0

2

4

S

p

e

e

S p e e d u p

I d e a l

0 5 0 1 0 0 1 5 0 2 0 0 2 5 0

0

2 0

4 0

C

l

a

s

s

i

f

i

c

a

t

i

o

n

A c c u r a c y

0 5 1 0 1 5 2 0

N u m b e r o f M P I p r o c e s s e s ( t = 6 t h r e a d s / p r o c e s s )

0

5 0

1 0 0

T

i

m

e

(

0 5 1 0 1 5 2 0

9 6 . 0

9 6 . 5

9 7 . 0

9 7 . 5

9 8 . 0

C

l

a

s

s

i

f

i

c

a

t

i

o

n

0 5 0 1 0 0 1 5 0 2 0 0 2 5 0

N u m b e r o f M P I p r o c e s s e s ( t = 6 4 t h r e a d s / p r o c e s s )

0

1 0 0

2 0 0

T

i

m

e

(

0 5 0 1 0 0 1 5 0 2 0 0 2 5 0

9 8 . 0

9 8 . 2

9 8 . 4

C

l

a

s

s

i

f

i

c

a

t

i

o

n

Scalable Kernel Methods 19 34

Comparisons on MNIST

Page 55: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 55/85

See “no distortion” results at http://yann.lecun.com/exdb/mnist/

Gaussian Kernel 1.4Poly (4) 1.1

3-layer NN, 500+300 HU 1.53 Hinton, 2005

LeNet5 0.8 LeCun et al. 1998Large CNN (pretrained) 0.60 Ranzato et al., NIPS 2006

Large CNN (pretrained) 0.53 Jarrett et al., ICCV 2009Scattering transform + Gaussian Kernel 0.4 Bruna and Mallat, 2012

20000 random features 0.45 my experiments5000 random features 0.52 my experiments

committee of 35 conv. nets [elastic distortions] 0.23 Ciresan et al, 2012

Scalable Kernel Methods 20 34

Comparisons on MNIST

Page 56: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 56/85

See “no distortion” results at http://yann.lecun.com/exdb/mnist/

Gaussian Kernel 1.4Poly (4) 1.1

3-layer NN, 500+300 HU 1.53 Hinton, 2005

LeNet5 0.8 LeCun et al. 1998Large CNN (pretrained) 0.60 Ranzato et al., NIPS 2006

Large CNN (pretrained) 0.53 Jarrett et al., ICCV 2009Scattering transform + Gaussian Kernel 0.4 Bruna and Mallat, 2012

20000 random features 0.45 my experiments5000 random features 0.52 my experiments

committee of 35 conv. nets [elastic distortions] 0.23 Ciresan et al, 2012

When similar prior knowledge is enforced (invariant learning), performancegaps vanish.

RKHS mappings can, in principle, be used to implement CNNs.

Scalable Kernel Methods 20 34

Page 57: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 57/85

Page 58: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 58/85

Efficiency of Random Embeddings

Page 59: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 59/85

TIMIT: 58.8M (s = 400k, m = 147) vs DNN 19.9M parameters.

Draw S = [w1 . . .ws] ∼ p and approximate the integral:

k(x, z) =

Rd

e−i(x−z)T w p(w)dw ≈ 1

|S |w∈S

e−i(x−z)T w (6)

Integration error

p,S [f ] =

[0,1]d

f (x) p(x)dx− 1

s

w∈S

f (w)

Monte-carlo convergence rate: E[ p,S [f ]2] 12 ∝ s− 12

– 4-fold increase in s will only cut error by half.

Can we do better with a different sequence S ?

Scalable Kernel Methods 22 34

Quasi-Monte Carlo Sequences: Intuition

Page 60: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 60/85

Weyl 1916; Koksma 1942; Dick et. al., 2013; Caflisch, 1998

Consider approximating [0,1]2 f (x

)dx

with

1

s w∈S f

(w

)

Scalable Kernel Methods 23 34

Quasi-Monte Carlo Sequences: Intuition

Page 61: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 61/85

Weyl 1916; Koksma 1942; Dick et. al., 2013; Caflisch, 1998

Consider approximating [0,1]2 f (x)dx with 1

s w∈S f (w)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Uniform

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Halton

Deterministic correlated QMC sampling avoid clustering, clumping effectsin MC point sets.

Hierarchical structure: coarse-to-fine sampling as s increases.

Scalable Kernel Methods 23 34

Star Discrepancy

Page 62: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 62/85

Integration error depends on variation f and uniformity of S .

Scalable Kernel Methods 24 34

Star Discrepancy

Page 63: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 63/85

Integration error depends on variation f and uniformity of S . Theorem (Koksma-Hlawka inequality (1941, 1961))

S [f ] ≤ D(S )V HK [f ], where

D(S ) = supx∈[0,1]d

vol(J x) − |i : wi ∈ J x|s

Scalable Kernel Methods 24 34

Page 64: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 64/85

Page 65: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 65/85

Quasi-Monte Carlo Sequences

Page 66: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 66/85

We seek sequences with low discrepancy.

Theorem (Roth, 1954) D(S ) ≥ cd(log s)

d−12

s .

For the regular grid on [0, 1)d with s = md, D ≥ s1d =⇒ only

optimal for d = 1.

Scalable Kernel Methods 25 34

Quasi-Monte Carlo Sequences

Page 67: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 67/85

We seek sequences with low discrepancy.

Theorem (Roth, 1954) D(S ) ≥ cd(log s)

d−12

s .

For the regular grid on [0, 1)d with s = md, D ≥ s1d =⇒ only

optimal for d = 1.

Theorem (Doerr, 2013) Let S with s be chosen uniformly at

random from [0, 1)d. Then, E[D(S )] ≥ √ d√ s

- the Monte Carlo rate.

Scalable Kernel Methods 25 34

Quasi-Monte Carlo Sequences

Page 68: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 68/85

We seek sequences with low discrepancy.

Theorem (Roth, 1954) D(S ) ≥ cd(log s)

d−12

s .

For the regular grid on [0, 1)d with s = md, D ≥ s1d =⇒ only

optimal for d = 1.

Theorem (Doerr, 2013) Let S with s be chosen uniformly at

random from [0, 1)d. Then, E[D(S )] ≥ √ d√ s

- the Monte Carlo rate.

There exist low-discrepancy sequences that achieve a C d(log s)d

slower bound conjectured to be optimal.

– Matlab QMC generators: haltonset, sobolset . . .

Scalable Kernel Methods 25 34

Quasi-Monte Carlo Sequences

Page 69: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 69/85

We seek sequences with low discrepancy.

Theorem (Roth, 1954) D(S ) ≥ cd(log s)

d−12

s .

For the regular grid on [0, 1)d with s = md, D ≥ s1d =⇒ only

optimal for d = 1.

Theorem (Doerr, 2013) Let S with s be chosen uniformly at

random from [0, 1)d. Then, E[D(S )] ≥ √ d√ s

- the Monte Carlo rate.

There exist low-discrepancy sequences that achieve a C d(log s)d

slower bound conjectured to be optimal.

– Matlab QMC generators: haltonset, sobolset . . .

With d fixed, this bound actually grows until s ∼ ed to (d/e)d!– “On the unreasonable effectiveness of QMC” in high-dimensions.– RKHSs and Kernels show up in Modern QMC analysis!

Scalable Kernel Methods 25 34

How do standard QMC sequences perform?

Page 70: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 70/85

Compare K ≈ Z(X)Z(X)T where Z(X) = e−iXG where G isdrawn from a QMC sequence generator instead.

200 400

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Number of random features

R e l a t i v e e r r

o r o n | | K | | 2

USPST, n=1506

MCHaltonSobol’Digital netLattice

200 400 600 800

0.005

0.01

0.015

0.02

0.025

0.03

0.035

Number of random features

R e l a t i v e e r r

o r o n | | K | | 2

CPU, n=6554

MCHaltonSobol’Digital netLattice

QMC sequences consistently better.

Scalable Kernel Methods 26 34

How do standard QMC sequences perform?

Page 71: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 71/85

Compare K ≈ Z(X)Z(X)T where Z(X) = e−iXG where G isdrawn from a QMC sequence generator instead.

200 400

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Number of random features

R e l a t i v e e r r

o r o n | | K | | 2

USPST, n=1506

MCHaltonSobol’Digital netLattice

200 400 600 800

0.005

0.01

0.015

0.02

0.025

0.03

0.035

Number of random features

R e l a t i v e e r r

o r o n | | K | | 2

CPU, n=6554

MCHaltonSobol’Digital netLattice

QMC sequences consistently better.

Why are some QMC sequences better, e.g., Halton over Sobol?

Scalable Kernel Methods 26 34

How do standard QMC sequences perform?

Page 72: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 72/85

Compare K ≈ Z(X)Z(X)T where Z(X) = e−iXG where G isdrawn from a QMC sequence generator instead.

200 400

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Number of random features

R e l a t i v e e r r

o r o n | | K | | 2

USPST, n=1506

MCHaltonSobol’Digital netLattice

200 400 600 800

0.005

0.01

0.015

0.02

0.025

0.03

0.035

Number of random features

R e l a t i v e e r r

o r o n | | K | | 2

CPU, n=6554

MCHaltonSobol’Digital netLattice

QMC sequences consistently better.

Why are some QMC sequences better, e.g., Halton over Sobol?

Can we learn sequences even better adapted to our problem class?

Scalable Kernel Methods 26 34

RKHSs in QMC Theory

“Nice” integrands f ∈ Hh, where h(·, ·) is a kernel function.

Page 73: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 73/85

g f ∈ Hh, ( , )

Scalable Kernel Methods 27 34

RKHSs in QMC Theory

“Nice” integrands f ∈ Hh, where h(·, ·) is a kernel function.

Page 74: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 74/85

g f h, ( , ) By Reproducing property and Cauchy-Schwartz inequality:

Rd

f (x) p(x)dx− 1s

w∈S

f (w) ≤ f kDh,p(S ) (7)

where D2h,p is a discrepancy measure:

Dh,p(S ) = m p(·) − 1

s

w∈S

h(w, ·)2Hh

,

mean embedding m p =

Rd

h(x, ·) p(x)dx

Scalable Kernel Methods 27 34

Page 75: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 75/85

Box Discrepancy

Assume that the data (shifted) lives in a box

Page 76: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 76/85

( )

b =

−b

≤x

−z

≤b,x, z

∈ X Class of functions we want to integrate:

F b = f (w) = e−i∆T w, ∆ ∈ b

Scalable Kernel Methods 28 34

Box Discrepancy

Assume that the data (shifted) lives in a box

Page 77: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 77/85

( )

b =

−b

≤x

−z

≤b,x, z

∈ X Class of functions we want to integrate:

F b = f (w) = e−i∆T w, ∆ ∈ b

Theorem: Integration error proportional to ”Box discrepancy”:

Ef ∼U (F b)

S,p[f ]2

12 ∝ D

p (S ) (9)

where D p (S ) is discrepancy associated with the sinc kernel:

sincb (u,v) = π−ddi=1

sin(bj(uj − vj))uj − vj

Scalable Kernel Methods 28 34

Box Discrepancy

Assume that the data (shifted) lives in a box

Page 78: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 78/85

( )

b =

−b

≤x

−z

≤b,x, z

∈ X Class of functions we want to integrate:

F b = f (w) = e−i∆T w, ∆ ∈ b

Theorem: Integration error proportional to ”Box discrepancy”:

Ef ∼U (F b)

S,p[f ]2

12 ∝ D

p (S ) (9)

where D p (S ) is discrepancy associated with the sinc kernel:

sincb (u,v) = π−ddi=1

sin(bj(uj − vj))uj − vj

Can be evaluated in closed form for Gaussian density.

Scalable Kernel Methods 28 34

Does Box discrepancy explain behaviour of QMC

sequences?

Page 79: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 79/85

0 500 1000 150010

−5

10−4

10−3

10−2

Samples

D

( S ) 2

CPU, d=21

Digital NetMC (expected)Halton

Sobol’Lattice

Scalable Kernel Methods 29 34

Learning Adaptive QMC Sequences

Unlike Star discrepancy, Box discrepancy admits numerical optimization,

( )

Page 80: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 80/85

S ∗ = arg minS =(w1...ws)

∈Rds

D(S ), S (0) = Halton. (10)

0 20 40 60 8010

−6

10−4

10

−2

100

CPU dataset, s=100

Iteration

Normalized D(S )2

Maximum Squared ErrorMean Squared Error K −K 2/K 2

However, full impact on large-scale problems is an open research topic.Scalable Kernel Methods 30 34

Outline

Page 81: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 81/85

Motivation and Background

Scalable Kernel MethodsRandom Embeddings+Distributed Computation (ICASSP, JSM 2014)

libSkylark: An open source software stackQuasi-Monte Carlo Embeddings (ICML 2014)

Synergies?

Synergies? 31 34

Randomization-vs-Optimization

k(x, z)

Page 82: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 82/85

Randomization p

⇔k Optimization

eig5x

eig4x

eig3x

eig2x

eig1x

Synergies? 32 34

Randomization-vs-Optimization

k(x, z)

Page 83: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 83/85

Randomization p

⇔k Optimization

eig5x

eig4x

eig3x

eig2x

eig1x

Jarret et al 2009, What is the Best Multi-Stage Architecture for Object Recognition?: “The most astonishing result is that systems with random filters and no filter learning whatsoever achieve decent performance”

On Random Weights and Unsupervised Feature Learning, Saxe et al, ICML 2011:“surprising fraction of performance can be attributed to architecture alone.”

Synergies? 32 34

Deep Learning with Kernels?

Maps across layers can be parameterized using more generalnonlinearities (kernels)

Page 84: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 84/85

nonlinearities (kernels).

ImagePixelx1

f 1(x1), f 1 ∈ Hk1Ω1

x2

Ω2

f 2(x2), f 2 ∈ Hk2

Mathematics of Neural Response, Smale et. al., FCM (2010).

Convolutional Kernel Networks, Mairal et. al., NIPS 2014.

SimNets: A Generalization of Conv. Nets, Cohen and Sashua, 2014.

– learns networks 1/8 the size of comparable ConvNets.

Figure adapted from Mairal et al, NIPS 2014

Synergies? 33 34

Conclusions

Page 85: Slidessindhwani - Kernels Random Embeddings and Deep Learning

8/9/2019 Slidessindhwani - Kernels Random Embeddings and Deep Learning

http://slidepdf.com/reader/full/slidessindhwani-kernels-random-embeddings-and-deep-learning 85/85

Some empirical evidence suggests that once Kernel methods arescaled up and embody similar statistical principles, they arecompetitive with Deep Neural Networks.

– Randomization and Distributed Computation both required.

– Ideas from QMC Integration techniques are promising. Opportunities for designing new algorithms combining insights from

deep learning with the generality and mathematical richness of kernel methods.

Synergies? 34 34