kernels, random embeddings and deep learning · pdf filekernels, random embeddings and deep...

85
Kernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014

Upload: dodiep

Post on 18-Feb-2018

233 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Kernels, Random Embeddings and Deep Learning

Vikas SindhwaniIBM Research, NY

October 28, 2014

Page 2: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Acknowledgements

I At IBM: Haim Avron, Tara Sainath, B. Ramabhadran, Q. Fan

I Summer Interns: Jiyan Yang (Stanford), Po-sen Huang (UIUC)

I Michael Mahoney (UC Berkeley), Ha Quang Minh (IIT Genova)

I IBM DARPA XDATA project led by Ken Clarkson (IBM Almaden)

2 34

Page 3: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Setting

I Given labeled data in the form of input-output pairs,

xi,yini=1, xi ∈ X ⊂ Rd, yi ∈ Y ⊂ Rm,

estimate the unknown dependency f : X 7→ Y.

I Regularized Risk Minimization in a suitable hypothesis space H,

arg minf∈H

n∑i=1

V (f(xi),yi) + Ω(f)

I Large n =⇒ Big models: H “rich” /non-parametric/nonlinear.I Two great ML traditions around choosing H:

– Deep Neural Networks: f(x) = sn(. . . s2(W2s1(W1x)) . . .)– Kernel Methods: general nonlinear function space generated by a

kernel function k(x, z) on X × X .

I This talk: Thrust towards scalable kernel methods, motivated by therecent successes of deep learning.

3 34

Page 4: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Setting

I Given labeled data in the form of input-output pairs,

xi,yini=1, xi ∈ X ⊂ Rd, yi ∈ Y ⊂ Rm,

estimate the unknown dependency f : X 7→ Y.I Regularized Risk Minimization in a suitable hypothesis space H,

arg minf∈H

n∑i=1

V (f(xi),yi) + Ω(f)

I Large n =⇒ Big models: H “rich” /non-parametric/nonlinear.I Two great ML traditions around choosing H:

– Deep Neural Networks: f(x) = sn(. . . s2(W2s1(W1x)) . . .)– Kernel Methods: general nonlinear function space generated by a

kernel function k(x, z) on X × X .

I This talk: Thrust towards scalable kernel methods, motivated by therecent successes of deep learning.

3 34

Page 5: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Setting

I Given labeled data in the form of input-output pairs,

xi,yini=1, xi ∈ X ⊂ Rd, yi ∈ Y ⊂ Rm,

estimate the unknown dependency f : X 7→ Y.I Regularized Risk Minimization in a suitable hypothesis space H,

arg minf∈H

n∑i=1

V (f(xi),yi) + Ω(f)

I Large n =⇒ Big models: H “rich” /non-parametric/nonlinear.

I Two great ML traditions around choosing H:– Deep Neural Networks: f(x) = sn(. . . s2(W2s1(W1x)) . . .)– Kernel Methods: general nonlinear function space generated by a

kernel function k(x, z) on X × X .

I This talk: Thrust towards scalable kernel methods, motivated by therecent successes of deep learning.

3 34

Page 6: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Setting

I Given labeled data in the form of input-output pairs,

xi,yini=1, xi ∈ X ⊂ Rd, yi ∈ Y ⊂ Rm,

estimate the unknown dependency f : X 7→ Y.I Regularized Risk Minimization in a suitable hypothesis space H,

arg minf∈H

n∑i=1

V (f(xi),yi) + Ω(f)

I Large n =⇒ Big models: H “rich” /non-parametric/nonlinear.I Two great ML traditions around choosing H:

– Deep Neural Networks: f(x) = sn(. . . s2(W2s1(W1x)) . . .)– Kernel Methods: general nonlinear function space generated by a

kernel function k(x, z) on X × X .

I This talk: Thrust towards scalable kernel methods, motivated by therecent successes of deep learning.

3 34

Page 7: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Setting

I Given labeled data in the form of input-output pairs,

xi,yini=1, xi ∈ X ⊂ Rd, yi ∈ Y ⊂ Rm,

estimate the unknown dependency f : X 7→ Y.I Regularized Risk Minimization in a suitable hypothesis space H,

arg minf∈H

n∑i=1

V (f(xi),yi) + Ω(f)

I Large n =⇒ Big models: H “rich” /non-parametric/nonlinear.I Two great ML traditions around choosing H:

– Deep Neural Networks: f(x) = sn(. . . s2(W2s1(W1x)) . . .)

– Kernel Methods: general nonlinear function space generated by akernel function k(x, z) on X × X .

I This talk: Thrust towards scalable kernel methods, motivated by therecent successes of deep learning.

3 34

Page 8: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Setting

I Given labeled data in the form of input-output pairs,

xi,yini=1, xi ∈ X ⊂ Rd, yi ∈ Y ⊂ Rm,

estimate the unknown dependency f : X 7→ Y.I Regularized Risk Minimization in a suitable hypothesis space H,

arg minf∈H

n∑i=1

V (f(xi),yi) + Ω(f)

I Large n =⇒ Big models: H “rich” /non-parametric/nonlinear.I Two great ML traditions around choosing H:

– Deep Neural Networks: f(x) = sn(. . . s2(W2s1(W1x)) . . .)– Kernel Methods: general nonlinear function space generated by a

kernel function k(x, z) on X × X .

I This talk: Thrust towards scalable kernel methods, motivated by therecent successes of deep learning.

3 34

Page 9: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Outline

Motivation and Background

Scalable Kernel MethodsRandom Embeddings+Distributed Computation (ICASSP, JSM 2014)libSkylark: An open source software stackQuasi-Monte Carlo Embeddings (ICML 2014)

Synergies?

Motivation and Background 4 34

Page 10: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Deep Learning is “Supercharging” Machine Learning

Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) withtop-5 error rate of 15.3% compared to 26.2% of the second best entry.

I Many statistical and computational ingredients:

– Large datasets (ILSVRC since 2010)

– Large statistical capacity (1.2M images, 60M params)– Distributed computation– Depth, Invariant feature learning (transferrable to other tasks)– Engineering: Dropout, ReLU . . .

I Very active area in Speech and Natural Language Processing.

Motivation and Background 5 34

Page 11: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Deep Learning is “Supercharging” Machine Learning

Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) withtop-5 error rate of 15.3% compared to 26.2% of the second best entry.

I Many statistical and computational ingredients:

– Large datasets (ILSVRC since 2010)– Large statistical capacity (1.2M images, 60M params)

– Distributed computation– Depth, Invariant feature learning (transferrable to other tasks)– Engineering: Dropout, ReLU . . .

I Very active area in Speech and Natural Language Processing.

Motivation and Background 5 34

Page 12: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Deep Learning is “Supercharging” Machine Learning

Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) withtop-5 error rate of 15.3% compared to 26.2% of the second best entry.

I Many statistical and computational ingredients:

– Large datasets (ILSVRC since 2010)– Large statistical capacity (1.2M images, 60M params)– Distributed computation

– Depth, Invariant feature learning (transferrable to other tasks)– Engineering: Dropout, ReLU . . .

I Very active area in Speech and Natural Language Processing.

Motivation and Background 5 34

Page 13: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Deep Learning is “Supercharging” Machine Learning

Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) withtop-5 error rate of 15.3% compared to 26.2% of the second best entry.

I Many statistical and computational ingredients:

– Large datasets (ILSVRC since 2010)– Large statistical capacity (1.2M images, 60M params)– Distributed computation– Depth, Invariant feature learning (transferrable to other tasks)

– Engineering: Dropout, ReLU . . .

I Very active area in Speech and Natural Language Processing.

Motivation and Background 5 34

Page 14: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Deep Learning is “Supercharging” Machine Learning

Krizhevsky et.al. won the 2012 ImageNet challenge (ILSVRC-2012) withtop-5 error rate of 15.3% compared to 26.2% of the second best entry.

I Many statistical and computational ingredients:

– Large datasets (ILSVRC since 2010)– Large statistical capacity (1.2M images, 60M params)– Distributed computation– Depth, Invariant feature learning (transferrable to other tasks)– Engineering: Dropout, ReLU . . .

I Very active area in Speech and Natural Language Processing.

Motivation and Background 5 34

Page 15: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Machine Learning in 1990s

I Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

I Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.

Neural Nets with a kernel methods section!

I Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing– Handle non-vectorial data; high-dimensional data– Easier model selection via continuous optimization.– Matched NN in many cases, although didnt scale wrt n as well.

I So what changed?

– More data, parallel algorithms, hardware? Better DNN training? . . .

Motivation and Background 6 34

Page 16: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Machine Learning in 1990s

I Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

I Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.

Neural Nets with a kernel methods section!

I Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing– Handle non-vectorial data; high-dimensional data– Easier model selection via continuous optimization.– Matched NN in many cases, although didnt scale wrt n as well.

I So what changed?

– More data, parallel algorithms, hardware? Better DNN training? . . .

Motivation and Background 6 34

Page 17: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Machine Learning in 1990s

I Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

I Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.

Neural Nets with a kernel methods section!

I Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing– Handle non-vectorial data; high-dimensional data– Easier model selection via continuous optimization.– Matched NN in many cases, although didnt scale wrt n as well.

I So what changed?

– More data, parallel algorithms, hardware? Better DNN training? . . .

Motivation and Background 6 34

Page 18: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Machine Learning in 1990s

I Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

I Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.

– 1999: Introduced to Kernel Methods - by DNN researchers!– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.

Neural Nets with a kernel methods section!

I Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing– Handle non-vectorial data; high-dimensional data– Easier model selection via continuous optimization.– Matched NN in many cases, although didnt scale wrt n as well.

I So what changed?

– More data, parallel algorithms, hardware? Better DNN training? . . .

Motivation and Background 6 34

Page 19: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Machine Learning in 1990s

I Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

I Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!

– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.Neural Nets with a kernel methods section!

I Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing– Handle non-vectorial data; high-dimensional data– Easier model selection via continuous optimization.– Matched NN in many cases, although didnt scale wrt n as well.

I So what changed?

– More data, parallel algorithms, hardware? Better DNN training? . . .

Motivation and Background 6 34

Page 20: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Machine Learning in 1990s

I Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

I Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.

Neural Nets with a kernel methods section!

I Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing– Handle non-vectorial data; high-dimensional data– Easier model selection via continuous optimization.– Matched NN in many cases, although didnt scale wrt n as well.

I So what changed?

– More data, parallel algorithms, hardware? Better DNN training? . . .

Motivation and Background 6 34

Page 21: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Machine Learning in 1990s

I Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

I Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.

Neural Nets with a kernel methods section!

I Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing– Handle non-vectorial data; high-dimensional data– Easier model selection via continuous optimization.– Matched NN in many cases, although didnt scale wrt n as well.

I So what changed?

– More data, parallel algorithms, hardware? Better DNN training? . . .

Motivation and Background 6 34

Page 22: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Machine Learning in 1990s

I Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

I Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.

Neural Nets with a kernel methods section!

I Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.

– Theoretically appealing– Handle non-vectorial data; high-dimensional data– Easier model selection via continuous optimization.– Matched NN in many cases, although didnt scale wrt n as well.

I So what changed?

– More data, parallel algorithms, hardware? Better DNN training? . . .

Motivation and Background 6 34

Page 23: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Machine Learning in 1990s

I Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

I Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.

Neural Nets with a kernel methods section!

I Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing

– Handle non-vectorial data; high-dimensional data– Easier model selection via continuous optimization.– Matched NN in many cases, although didnt scale wrt n as well.

I So what changed?

– More data, parallel algorithms, hardware? Better DNN training? . . .

Motivation and Background 6 34

Page 24: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Machine Learning in 1990s

I Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

I Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.

Neural Nets with a kernel methods section!

I Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing– Handle non-vectorial data; high-dimensional data– Easier model selection via continuous optimization.

– Matched NN in many cases, although didnt scale wrt n as well.

I So what changed?

– More data, parallel algorithms, hardware? Better DNN training? . . .

Motivation and Background 6 34

Page 25: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Machine Learning in 1990s

I Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

I Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.

Neural Nets with a kernel methods section!

I Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing– Handle non-vectorial data; high-dimensional data– Easier model selection via continuous optimization.– Matched NN in many cases, although didnt scale wrt n as well.

I So what changed?

– More data, parallel algorithms, hardware? Better DNN training? . . .

Motivation and Background 6 34

Page 26: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Machine Learning in 1990s

I Convolutional Neural Networks (Fukushima 1980; Lecun et al 1989)

– 3 days to train on USPS (n = 7291; digit recognition) on SunSparcstation 1 (33MHz clock speed, 64MB RAM)

I Personal history:

– 1998: First ML experiment - train DNN on UCI Wine dataset.– 1999: Introduced to Kernel Methods - by DNN researchers!– 2003-4: NN paper at JMLR rejected; accepted in IEEE Trans.

Neural Nets with a kernel methods section!

I Why Kernel Methods?

– Local Minima free - stronger role of Convex Optimization.– Theoretically appealing– Handle non-vectorial data; high-dimensional data– Easier model selection via continuous optimization.– Matched NN in many cases, although didnt scale wrt n as well.

I So what changed?

– More data, parallel algorithms, hardware? Better DNN training? . . .

Motivation and Background 6 34

Page 27: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Kernel Methods and Neural Networks (Pre-Google)

Motivation and Background 7 34

Page 28: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Kernel Methods and Neural Networks

Geoff Hinton facts meme maintained athttp://yann.lecun.com/ex/fun/

I All kernels that ever dared approaching Geoff Hinton woke upconvolved.

I The only kernel Geoff Hinton has ever used is a kernel of truth.

I If you defy Geoff Hinton, he will maximize your entropy in no time.Your free energy will be gone even before you reach equilibrium.

Are there synergies between these fields towards design of even better(faster and more accurate) algorithms?

Motivation and Background 8 34

Page 29: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Kernel Methods and Neural Networks

Geoff Hinton facts meme maintained athttp://yann.lecun.com/ex/fun/

I All kernels that ever dared approaching Geoff Hinton woke upconvolved.

I The only kernel Geoff Hinton has ever used is a kernel of truth.

I If you defy Geoff Hinton, he will maximize your entropy in no time.Your free energy will be gone even before you reach equilibrium.

Are there synergies between these fields towards design of even better(faster and more accurate) algorithms?

Motivation and Background 8 34

Page 30: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Kernel Methods and Neural Networks

Geoff Hinton facts meme maintained athttp://yann.lecun.com/ex/fun/

I All kernels that ever dared approaching Geoff Hinton woke upconvolved.

I The only kernel Geoff Hinton has ever used is a kernel of truth.

I If you defy Geoff Hinton, he will maximize your entropy in no time.Your free energy will be gone even before you reach equilibrium.

Are there synergies between these fields towards design of even better(faster and more accurate) algorithms?

Motivation and Background 8 34

Page 31: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

The Mathematical Naturalness of Kernel Methods

I Data X ∈ Rd, Models ∈ H : X 7→ R

I Geometry in H: inner product 〈·, ·〉H, norm ‖ · ‖H (Hilbert Spaces)

I Theorem All nice Hilbert spaces are generated by a symmetricpositive definite function (the kernel) k(x,x′) on X × X

– if f, g ∈ H close i.e. ‖f − g‖H small, then f(x), g(x) close ∀x ∈ X .– Reproducing Kernel Hilbert Spaces (RKHSs)

I Functional Analysis (Aronszajn, Bergman (1950s)); Statistics(Parzen (1960s)); PDEs; Numerical Analysis. . .

– ML: Nonlinear classification, regression, clustering, time-seriesanalysis, dynamical systems, hypothesis testing, causal modeling, . . .

I In principle, possible to compose Deep Learning pipelines using moregeneral nonlinear functions drawn from RKHSs.

Motivation and Background 9 34

Page 32: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

The Mathematical Naturalness of Kernel Methods

I Data X ∈ Rd, Models ∈ H : X 7→ RI Geometry in H: inner product 〈·, ·〉H, norm ‖ · ‖H (Hilbert Spaces)

I Theorem All nice Hilbert spaces are generated by a symmetricpositive definite function (the kernel) k(x,x′) on X × X

– if f, g ∈ H close i.e. ‖f − g‖H small, then f(x), g(x) close ∀x ∈ X .– Reproducing Kernel Hilbert Spaces (RKHSs)

I Functional Analysis (Aronszajn, Bergman (1950s)); Statistics(Parzen (1960s)); PDEs; Numerical Analysis. . .

– ML: Nonlinear classification, regression, clustering, time-seriesanalysis, dynamical systems, hypothesis testing, causal modeling, . . .

I In principle, possible to compose Deep Learning pipelines using moregeneral nonlinear functions drawn from RKHSs.

Motivation and Background 9 34

Page 33: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

The Mathematical Naturalness of Kernel Methods

I Data X ∈ Rd, Models ∈ H : X 7→ RI Geometry in H: inner product 〈·, ·〉H, norm ‖ · ‖H (Hilbert Spaces)

I Theorem All nice Hilbert spaces are generated by a symmetricpositive definite function (the kernel) k(x,x′) on X × X

– if f, g ∈ H close i.e. ‖f − g‖H small, then f(x), g(x) close ∀x ∈ X .– Reproducing Kernel Hilbert Spaces (RKHSs)

I Functional Analysis (Aronszajn, Bergman (1950s)); Statistics(Parzen (1960s)); PDEs; Numerical Analysis. . .

– ML: Nonlinear classification, regression, clustering, time-seriesanalysis, dynamical systems, hypothesis testing, causal modeling, . . .

I In principle, possible to compose Deep Learning pipelines using moregeneral nonlinear functions drawn from RKHSs.

Motivation and Background 9 34

Page 34: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

The Mathematical Naturalness of Kernel Methods

I Data X ∈ Rd, Models ∈ H : X 7→ RI Geometry in H: inner product 〈·, ·〉H, norm ‖ · ‖H (Hilbert Spaces)

I Theorem All nice Hilbert spaces are generated by a symmetricpositive definite function (the kernel) k(x,x′) on X × X

– if f, g ∈ H close i.e. ‖f − g‖H small, then f(x), g(x) close ∀x ∈ X .– Reproducing Kernel Hilbert Spaces (RKHSs)

I Functional Analysis (Aronszajn, Bergman (1950s)); Statistics(Parzen (1960s)); PDEs; Numerical Analysis. . .

– ML: Nonlinear classification, regression, clustering, time-seriesanalysis, dynamical systems, hypothesis testing, causal modeling, . . .

I In principle, possible to compose Deep Learning pipelines using moregeneral nonlinear functions drawn from RKHSs.

Motivation and Background 9 34

Page 35: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

The Mathematical Naturalness of Kernel Methods

I Data X ∈ Rd, Models ∈ H : X 7→ RI Geometry in H: inner product 〈·, ·〉H, norm ‖ · ‖H (Hilbert Spaces)

I Theorem All nice Hilbert spaces are generated by a symmetricpositive definite function (the kernel) k(x,x′) on X × X

– if f, g ∈ H close i.e. ‖f − g‖H small, then f(x), g(x) close ∀x ∈ X .– Reproducing Kernel Hilbert Spaces (RKHSs)

I Functional Analysis (Aronszajn, Bergman (1950s)); Statistics(Parzen (1960s)); PDEs; Numerical Analysis. . .

– ML: Nonlinear classification, regression, clustering, time-seriesanalysis, dynamical systems, hypothesis testing, causal modeling, . . .

I In principle, possible to compose Deep Learning pipelines using moregeneral nonlinear functions drawn from RKHSs.

Motivation and Background 9 34

Page 36: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Outline

Motivation and Background

Scalable Kernel MethodsRandom Embeddings+Distributed Computation (ICASSP, JSM 2014)libSkylark: An open source software stackQuasi-Monte Carlo Embeddings (ICML 2014)

Synergies?

Scalable Kernel Methods 10 34

Page 37: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Scalability Challenges for Kernel Methods

f? = arg minf∈Hk

1

n

n∑i=1

V (yi, f(xi)) + λ‖f‖2Hk , xi ∈ Rd

I Representer Theorem: f?(x) =∑ni=1 αik(x,xi)

I Regularized Least Squares

(K + λI)α = YO(n2) storage

O(n3 + n2d) trainingO(nd) test speed

I Hard to parallelize when working directly with Kij = k(xi,xj)

Scalable Kernel Methods 11 34

Page 38: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Randomized Algorithms

I Explicit approximate feature map: Ψ : Rd 7→ Cs such that,

k(x, z) ≈ 〈Ψ(x), Ψ(z)〉Cs

⇒(Z(X)

TZ(X) + λI

)w = Z(X)

TY,

O(ns) storageO(ns2) trainingO(s) test speed

I Interested in Data-oblivious maps that depend only on the kernelfunction, and not on the data.

– Should be very cheap to apply and parallelizable.

Scalable Kernel Methods 12 34

Page 39: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Random Fourier Features (Rahimi & Recht, 2007)

I Theorem [Bochner 1930,33] One-to-one Fourier-paircorrespondence between any (normalized) shift-invariant kernel kand density p such that,

k(x, z) = ψ(x− z) =

∫Rde−i(x−z)Twp(w)dw

– Gaussian kernel: k(x, z) = e−‖x−z‖22

2σ2 ⇐⇒ p = N (0, σ−2Id)

I Monte-Carlo approximation to Integral representation:

k(x, z) ≈ 1

s

s∑j=1

e−i(x−z)Twj = 〈ΨS(x), ΨS(z)〉Cs

ΨS(x) =1√s

[e−ix

Tw1 . . . e−ixTws

]∈ Cs, S = [w1 . . .ws] ∼ p

Scalable Kernel Methods 13 34

Page 40: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Random Fourier Features (Rahimi & Recht, 2007)

I Theorem [Bochner 1930,33] One-to-one Fourier-paircorrespondence between any (normalized) shift-invariant kernel kand density p such that,

k(x, z) = ψ(x− z) =

∫Rde−i(x−z)Twp(w)dw

– Gaussian kernel: k(x, z) = e−‖x−z‖22

2σ2 ⇐⇒ p = N (0, σ−2Id)

I Monte-Carlo approximation to Integral representation:

k(x, z) ≈ 1

s

s∑j=1

e−i(x−z)Twj = 〈ΨS(x), ΨS(z)〉Cs

ΨS(x) =1√s

[e−ix

Tw1 . . . e−ixTws

]∈ Cs, S = [w1 . . .ws] ∼ p

Scalable Kernel Methods 13 34

Page 41: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

DNNs vs Kernel Methods on TIMIT (Speech)

Joint work with IBM Speech Group, P. Huang:Can “shallow”, convex randomized kernel methods “match” DNNs?(predicting HMM states given short window of coefficients representing acoustic input)

G = randn(size(X,1), s);Z = exp(i*X*G);I = eye(size(X,2));C = Z’*Z;alpha = (C + lambda*I)\(Z’*y(:));ztest = exp(i*xtest*G)*alpha;

I Z(X): 1.2TB

I Stream on blocksC+ = Z′BZB

I But C also big (47GB).

I Need: Distributed solvers tohandle big n, s; Z(X)implicitly.

1 2 3 4 5 6 7 8Number of Random Features (s) / 10000

33

34

35

36

37

38

39

40

41

Cla

ssifi

cati

onE

rror

(%)

TIMIT: n = 2M,d = 440, k = 147

DNN (440-4k-4k-147)RandomFourierExact Kernel (n=100k, 75GB)

Scalable Kernel Methods 14 34

Page 42: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

DNNs vs Kernel Methods on TIMIT (Speech)

Joint work with IBM Speech Group, P. Huang:Can “shallow”, convex randomized kernel methods “match” DNNs?(predicting HMM states given short window of coefficients representing acoustic input)

G = randn(size(X,1), s);Z = exp(i*X*G);I = eye(size(X,2));C = Z’*Z;alpha = (C + lambda*I)\(Z’*y(:));ztest = exp(i*xtest*G)*alpha;

I Z(X): 1.2TB

I Stream on blocksC+ = Z′BZB

I But C also big (47GB).

I Need: Distributed solvers tohandle big n, s; Z(X)implicitly.

1 2 3 4 5 6 7 8Number of Random Features (s) / 10000

33

34

35

36

37

38

39

40

41

Cla

ssifi

cati

onE

rror

(%)

TIMIT: n = 2M,d = 440, k = 147

DNN (440-4k-4k-147)RandomFourierExact Kernel (n=100k, 75GB)

Scalable Kernel Methods 14 34

Page 43: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

DNNs vs Kernel Methods on TIMIT (Speech)

Kernel Methods match DNNs on TIMIT, ICASSP 2014, with P. Huang and IBM Speech group

High-performance Kernel Machines with Implicit Distributed Optimization and Randomization, JSM 2014, with H. Avron.

I ∼ 2 hours on 256 IBM Bluegene/Qnodes.

I Phone error rate of 21.3% - best

reported for Kernel methods.

– Competitive withHMM/DNN systems.

– New record: 16.7% withCNNs (ICASSP 2014).

I Only two hyperparameters: σ, s (earlystopping regularizer).

I Z ≈ 6.4TB, C ≈ 1.2TB.

I Materialized in blocks/used/discardedon-the-fly, in parallel.

0 5 10 15 20 25 30 35 40Number of Random Features (s) / 10000

33

34

35

36

37

38

39

40

41

Cla

ssifi

cati

onE

rror

(%)

PER: 21.3% < 22.3% (DNN)

TIMIT: n = 2M,d = 440, k = 147

DNN (440-4k-4k-147)RandomFourierExact Kernel (n=100k, 75GB)

Scalable Kernel Methods 15 34

Page 44: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

DNNs vs Kernel Methods on TIMIT (Speech)

Kernel Methods match DNNs on TIMIT, ICASSP 2014, with P. Huang and IBM Speech group

High-performance Kernel Machines with Implicit Distributed Optimization and Randomization, JSM 2014, with H. Avron.

I ∼ 2 hours on 256 IBM Bluegene/Qnodes.

I Phone error rate of 21.3% - best

reported for Kernel methods.

– Competitive withHMM/DNN systems.

– New record: 16.7% withCNNs (ICASSP 2014).

I Only two hyperparameters: σ, s (earlystopping regularizer).

I Z ≈ 6.4TB, C ≈ 1.2TB.

I Materialized in blocks/used/discardedon-the-fly, in parallel.

0 5 10 15 20 25 30 35 40Number of Random Features (s) / 10000

33

34

35

36

37

38

39

40

41

Cla

ssifi

cati

onE

rror

(%)

PER: 21.3% < 22.3% (DNN)

TIMIT: n = 2M,d = 440, k = 147

DNN (440-4k-4k-147)RandomFourierExact Kernel (n=100k, 75GB)

Scalable Kernel Methods 15 34

Page 45: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Distributed Convex Optimization

I Alternating Direction Method of Multipliers (50s; Boyd et al, 2013)

arg minx∈Rn,z∈Rm

f(x) + g(z) subject to Ax+Bz = c

I Row/Column Splitting; Block splitting (Parikh & Boyd, 2013)

arg minx∈Rd

R∑i=1

fi(x) + g(x) ⇒R∑i=1

fi(xi) + g(z) s.t xi = z (1)

x(k+1)i = arg min

xfi(x) +

ρ

2‖x− zk + νki ‖22 (2)

z = proxg/(Rρ)[xk+1 + νk] (comm.) (3)

νk+1i = νki + xk+1

i − zk+1 (4)

where proxf [x] = arg miny

1

2‖x− y‖22 + f(y)

I Note: extra consensus and dual variables need to be managed.

I Closed-form updates, Extensibility, Code-reuse, Parallelism.

Scalable Kernel Methods 16 34

Page 46: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Distributed Block-splitting ADMM

https://github.com/xdata-skylark/libskylark/tree/master/ml

Y1, X1

Y2, X2

Y3, X3

node 1

node 2

node 3

Z11 Z12 Z13

Z21 Z22 Z23

Z31 Z32 Z33

T[X1, 2]

T[X3, 2]

T-cores/OpenMP-threads

MPI (rank 1)

proxl

T-cores/OpenMP-threads

MPI (rank 3)

W11 W12 W13

W31 W32 W33

projZij

W

W

node 0

node 0

reduce

reduce

proxr

broadcast

broadcast

U = [ZTijZij + λI]−1︸ ︷︷ ︸cached

(X + ZTijY︸ ︷︷ ︸gemm

), V = ZijU︸ ︷︷ ︸gemm

(5)

Scalable Kernel Methods 17 34

Page 47: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Distributed Block-splitting ADMM

https://github.com/xdata-skylark/libskylark/tree/master/ml

Y1, X1

Y2, X2

Y3, X3

node 1

node 2

node 3

Z11 Z12 Z13

Z21 Z22 Z23

Z31 Z32 Z33

T[X1, 2]

T[X3, 2]

T-cores/OpenMP-threads

MPI (rank 1)

proxl

T-cores/OpenMP-threads

MPI (rank 3)

W11 W12 W13

W31 W32 W33

projZij

W

W

node 0

node 0

reduce

reduce

proxr

broadcast

broadcast

U = [ZTijZij + λI]−1︸ ︷︷ ︸cached

(X + ZTijY︸ ︷︷ ︸gemm

), V = ZijU︸ ︷︷ ︸gemm

(5)

Scalable Kernel Methods 17 34

Page 48: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Distributed Block-splitting ADMM

https://github.com/xdata-skylark/libskylark/tree/master/ml

Y1, X1

Y2, X2

Y3, X3

node 1

node 2

node 3

Z11 Z12 Z13

Z21 Z22 Z23

Z31 Z32 Z33

T[X1, 2]

T[X3, 2]

T-cores/OpenMP-threads

MPI (rank 1)

proxl

T-cores/OpenMP-threads

MPI (rank 3)

W11 W12 W13

W31 W32 W33

projZij

W

W

node 0

node 0

reduce

reduce

proxr

broadcast

broadcast

U = [ZTijZij + λI]−1︸ ︷︷ ︸cached

(X + ZTijY︸ ︷︷ ︸gemm

), V = ZijU︸ ︷︷ ︸gemm

(5)

Scalable Kernel Methods 17 34

Page 49: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Distributed Block-splitting ADMM

https://github.com/xdata-skylark/libskylark/tree/master/ml

Y1, X1

Y2, X2

Y3, X3

node 1

node 2

node 3

Z11 Z12 Z13

Z21 Z22 Z23

Z31 Z32 Z33

T[X1, 2]

T[X3, 2]

T-cores/OpenMP-threads

MPI (rank 1)

proxl

T-cores/OpenMP-threads

MPI (rank 3)

W11 W12 W13

W31 W32 W33

projZij

W

W

node 0

node 0

reduce

reduce

proxr

broadcast

broadcast

U = [ZTijZij + λI]−1︸ ︷︷ ︸cached

(X + ZTijY︸ ︷︷ ︸gemm

), V = ZijU︸ ︷︷ ︸gemm

(5)

Scalable Kernel Methods 17 34

Page 50: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Distributed Block-splitting ADMM

https://github.com/xdata-skylark/libskylark/tree/master/ml

Y1, X1

Y2, X2

Y3, X3

node 1

node 2

node 3

Z11 Z12 Z13

Z21 Z22 Z23

Z31 Z32 Z33

T[X1, 2]

T[X3, 2]

T-cores/OpenMP-threads

MPI (rank 1)

proxl

T-cores/OpenMP-threads

MPI (rank 3)

W11 W12 W13

W31 W32 W33

projZij

W

W

node 0

node 0

reduce

reduce

proxr

broadcast

broadcast

U = [ZTijZij + λI]−1︸ ︷︷ ︸cached

(X + ZTijY︸ ︷︷ ︸gemm

), V = ZijU︸ ︷︷ ︸gemm

(5)

Scalable Kernel Methods 17 34

Page 51: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Distributed Block-splitting ADMM

https://github.com/xdata-skylark/libskylark/tree/master/ml

Y1, X1

Y2, X2

Y3, X3

node 1

node 2

node 3

Z11 Z12 Z13

Z21 Z22 Z23

Z31 Z32 Z33

T[X1, 2]

T[X3, 2]

T-cores/OpenMP-threads

MPI (rank 1)

proxl

T-cores/OpenMP-threads

MPI (rank 3)

W11 W12 W13

W31 W32 W33

projZij

W

W

node 0

node 0

reduce

reduce

proxr

broadcast

broadcast

U = [ZTijZij + λI]−1︸ ︷︷ ︸cached

(X + ZTijY︸ ︷︷ ︸gemm

), V = ZijU︸ ︷︷ ︸gemm

(5)

Scalable Kernel Methods 17 34

Page 52: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Distributed Block-splitting ADMM

https://github.com/xdata-skylark/libskylark/tree/master/ml

Y1, X1

Y2, X2

Y3, X3

node 1

node 2

node 3

Z11 Z12 Z13

Z21 Z22 Z23

Z31 Z32 Z33

T[X1, 2]

T[X3, 2]

T-cores/OpenMP-threads

MPI (rank 1)

proxl

T-cores/OpenMP-threads

MPI (rank 3)

W11 W12 W13

W31 W32 W33

projZij

W

W

node 0

node 0

reduce

reduce

proxr

broadcast

broadcast

U = [ZTijZij + λI]−1︸ ︷︷ ︸cached

(X + ZTijY︸ ︷︷ ︸gemm

), V = ZijU︸ ︷︷ ︸gemm

(5)

Scalable Kernel Methods 17 34

Page 53: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Scalability

I Graph Projection

U = [ZTijZij + λI]−1︸ ︷︷ ︸cached

(X + ZTijY︸ ︷︷ ︸gemm

), V = ZijU︸ ︷︷ ︸gemm

I High-performance implementation that can handle large columnsplits C = κ sd by reorganizing updates to exploit shared-memoryaccess, structure of graph projection.

Memory nmR

(T + 5) + ndR

+ 5sm+ 1κ

(TndR

+ Tmd+ sd)

+κnmsdR

Computation O(nd2

TRκ)︸ ︷︷ ︸

transform

+O(smd

κT)︸ ︷︷ ︸

graph-proj

+O(nsm

TR)︸ ︷︷ ︸

gemm

Communication O(s m log R) (model reduce/broadcast)

I Stochastic, Asynchronous versions may be possible to develop.

Scalable Kernel Methods 18 34

Page 54: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Randomized Kernel methods on thousands of coresI Triloka: 20 nodes/16-cores per node; BG/Q: 1024 nodes/16 (x4) cores per node.

I s = 100K,C = 200; strong scaling (n = 250k), weak scaling (n = 250k per node)

0 5 10 15 20

Number of MPI processes (t=6 threads/process)0

5

10

15

20

25

Speedup

MNIST Strong Scaling (Triloka)

SpeedupIdeal

0 5 10 15 200

20

40

60

80

100

Cla

ssific

ati

on A

ccura

cy (

%)

Accuracy

0 50 100 150 200 250

Number of MPI processes (t=64 threads/process)0

2

4

6

8

10

Speedup

MNIST Strong Scaling (BG/Q)

SpeedupIdeal

0 50 100 150 200 2500

20

40

60

80

100

Cla

ssific

ati

on A

ccura

cy (

%)

Accuracy

0 5 10 15 20

Number of MPI processes (t=6 threads/process)0

50

100

150

200

Tim

e (

secs

)

MNIST Weak Scaling (Triloka)

0 5 10 15 2096.0

96.5

97.0

97.5

98.0

98.5

99.0

99.5

100.0

Cla

ssific

ati

on A

ccura

cy (

%)

0 50 100 150 200 250

Number of MPI processes (t=64 threads/process)0

100

200

300

400

500

Tim

e (

secs

)

MNIST Weak Scaling (BG/Q)

0 50 100 150 200 25098.0

98.2

98.4

98.6

98.8

99.0

Cla

ssific

ati

on A

ccura

cy (

%)

Scalable Kernel Methods 19 34

Page 55: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Comparisons on MNIST

See “no distortion” results at http://yann.lecun.com/exdb/mnist/

Gaussian Kernel 1.4Poly (4) 1.1

3-layer NN, 500+300 HU 1.53 Hinton, 2005

LeNet5 0.8 LeCun et al. 1998Large CNN (pretrained) 0.60 Ranzato et al., NIPS 2006Large CNN (pretrained) 0.53 Jarrett et al., ICCV 2009

Scattering transform + Gaussian Kernel 0.4 Bruna and Mallat, 201220000 random features 0.45 my experiments5000 random features 0.52 my experiments

committee of 35 conv. nets [elastic distortions] 0.23 Ciresan et al, 2012

I When similar prior knowledge is enforced (invariant learning), performancegaps vanish.

I RKHS mappings can, in principle, be used to implement CNNs.

Scalable Kernel Methods 20 34

Page 56: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Comparisons on MNIST

See “no distortion” results at http://yann.lecun.com/exdb/mnist/

Gaussian Kernel 1.4Poly (4) 1.1

3-layer NN, 500+300 HU 1.53 Hinton, 2005

LeNet5 0.8 LeCun et al. 1998Large CNN (pretrained) 0.60 Ranzato et al., NIPS 2006Large CNN (pretrained) 0.53 Jarrett et al., ICCV 2009

Scattering transform + Gaussian Kernel 0.4 Bruna and Mallat, 201220000 random features 0.45 my experiments5000 random features 0.52 my experiments

committee of 35 conv. nets [elastic distortions] 0.23 Ciresan et al, 2012

I When similar prior knowledge is enforced (invariant learning), performancegaps vanish.

I RKHS mappings can, in principle, be used to implement CNNs.

Scalable Kernel Methods 20 34

Page 57: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Aside: LibSkylark

http://xdata-skylark.github.io/libskylark/docs/sphinx/

I C/C++/Python library, MPI,Elemental/CombBLAS containers.

I Distributed Sketching operators

‖Ax− b‖2 ⇒ ‖S (Ax− b) ‖2

I Randomized Least Squares, SVD

I Randomized Kernel Methods:

– Modularity via Prox operators– Sqr, hinge, L1, mult. logreg.– Regularizers: L1, L2

Kernel Embedding z(x) Time

Shift-Invariant RFT e−iGx O(sd), O(s nnz(x))

Shift-Invariant FRFT eiSGHPHBx O(s log(d))

Semigroup RLT e−Gx O(sd), O(s nnz(x))

Polynomial (deg q) PPT F−1 (F (C1x) . . . . ∗ F (Cqx)

)O(q(nnz(x) + s log s)

Rahimi & Recht, 2007; Pham & Pagh 2013; Le, Sarlos and Smola, 2013

Random Laplace Feature Maps for Semigroup Kernels on Histograms, CVPR 2014, J. Yang, V.S., M. Mahoney, H. Avron, Q. Fan.

Scalable Kernel Methods 21 34

Page 58: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Efficiency of Random Embeddings

I TIMIT: 58.8M (s = 400k,m = 147) vs DNN 19.9M parameters.

I Draw S = [w1 . . .ws] ∼ p and approximate the integral:

k(x, z) =

∫Rde−i(x−z)Twp(w)dw ≈ 1

|S|∑w∈S

e−i(x−z)Tw (6)

I Integration error

εp,S [f ] =

∣∣∣∣∣∫[0,1]d

f(x)p(x)dx− 1

s

∑w∈S

f(w)

∣∣∣∣∣I Monte-carlo convergence rate: E[εp,S [f ]2]

12 ∝ s− 1

2

– 4-fold increase in s will only cut error by half.

I Can we do better with a different sequence S?

Scalable Kernel Methods 22 34

Page 59: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Efficiency of Random Embeddings

I TIMIT: 58.8M (s = 400k,m = 147) vs DNN 19.9M parameters.

I Draw S = [w1 . . .ws] ∼ p and approximate the integral:

k(x, z) =

∫Rde−i(x−z)Twp(w)dw ≈ 1

|S|∑w∈S

e−i(x−z)Tw (6)

I Integration error

εp,S [f ] =

∣∣∣∣∣∫[0,1]d

f(x)p(x)dx− 1

s

∑w∈S

f(w)

∣∣∣∣∣I Monte-carlo convergence rate: E[εp,S [f ]2]

12 ∝ s− 1

2

– 4-fold increase in s will only cut error by half.

I Can we do better with a different sequence S?

Scalable Kernel Methods 22 34

Page 60: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Quasi-Monte Carlo Sequences: Intuition

I Weyl 1916; Koksma 1942; Dick et. al., 2013; Caflisch, 1998

I Consider approximating∫[0,1]2

f(x)dx with 1s

∑w∈S f(w)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Uniform

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Halton

I Deterministic correlated QMC sampling avoid clustering, clumping effectsin MC point sets.

I Hierarchical structure: coarse-to-fine sampling as s increases.

Scalable Kernel Methods 23 34

Page 61: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Quasi-Monte Carlo Sequences: Intuition

I Weyl 1916; Koksma 1942; Dick et. al., 2013; Caflisch, 1998

I Consider approximating∫[0,1]2

f(x)dx with 1s

∑w∈S f(w)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Uniform

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Halton

I Deterministic correlated QMC sampling avoid clustering, clumping effectsin MC point sets.

I Hierarchical structure: coarse-to-fine sampling as s increases.

Scalable Kernel Methods 23 34

Page 62: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Star Discrepancy

I Integration error depends on variation f and uniformity of S.

I Theorem (Koksma-Hlawka inequality (1941, 1961))

εS [f ] ≤ D?(S)VHK [f ], where

D?(S) = supx∈[0,1]d

∣∣∣∣vol(Jx)− |i : wi ∈ Jx|s

∣∣∣∣

Scalable Kernel Methods 24 34

Page 63: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Star Discrepancy

I Integration error depends on variation f and uniformity of S.I Theorem (Koksma-Hlawka inequality (1941, 1961))

εS [f ] ≤ D?(S)VHK [f ], where

D?(S) = supx∈[0,1]d

∣∣∣∣vol(Jx)− |i : wi ∈ Jx|s

∣∣∣∣

Scalable Kernel Methods 24 34

Page 64: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Quasi-Monte Carlo Sequences

I We seek sequences with low discrepancy.

I Theorem (Roth, 1954) D?(S) ≥ cd (log s)d−12

s .

I For the regular grid on [0, 1)d with s = md, D? ≥ s 1d =⇒ only

optimal for d = 1.

I Theorem (Doerr, 2013) Let S with s be chosen uniformly at

random from [0, 1)d. Then, E[D?(S)] ≥√d√s

- the Monte Carlo rate.

I There exist low-discrepancy sequences that achieve a Cd(log s)d

slower bound conjectured to be optimal.

– Matlab QMC generators: haltonset, sobolset . . .

I With d fixed, this bound actually grows until s ∼ ed to (d/e)d!

– “On the unreasonable effectiveness of QMC” in high-dimensions.– RKHSs and Kernels show up in Modern QMC analysis!

Scalable Kernel Methods 25 34

Page 65: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Quasi-Monte Carlo Sequences

I We seek sequences with low discrepancy.

I Theorem (Roth, 1954) D?(S) ≥ cd (log s)d−12

s .

I For the regular grid on [0, 1)d with s = md, D? ≥ s 1d =⇒ only

optimal for d = 1.

I Theorem (Doerr, 2013) Let S with s be chosen uniformly at

random from [0, 1)d. Then, E[D?(S)] ≥√d√s

- the Monte Carlo rate.

I There exist low-discrepancy sequences that achieve a Cd(log s)d

slower bound conjectured to be optimal.

– Matlab QMC generators: haltonset, sobolset . . .

I With d fixed, this bound actually grows until s ∼ ed to (d/e)d!

– “On the unreasonable effectiveness of QMC” in high-dimensions.– RKHSs and Kernels show up in Modern QMC analysis!

Scalable Kernel Methods 25 34

Page 66: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Quasi-Monte Carlo Sequences

I We seek sequences with low discrepancy.

I Theorem (Roth, 1954) D?(S) ≥ cd (log s)d−12

s .

I For the regular grid on [0, 1)d with s = md, D? ≥ s 1d =⇒ only

optimal for d = 1.

I Theorem (Doerr, 2013) Let S with s be chosen uniformly at

random from [0, 1)d. Then, E[D?(S)] ≥√d√s

- the Monte Carlo rate.

I There exist low-discrepancy sequences that achieve a Cd(log s)d

slower bound conjectured to be optimal.

– Matlab QMC generators: haltonset, sobolset . . .

I With d fixed, this bound actually grows until s ∼ ed to (d/e)d!

– “On the unreasonable effectiveness of QMC” in high-dimensions.– RKHSs and Kernels show up in Modern QMC analysis!

Scalable Kernel Methods 25 34

Page 67: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Quasi-Monte Carlo Sequences

I We seek sequences with low discrepancy.

I Theorem (Roth, 1954) D?(S) ≥ cd (log s)d−12

s .

I For the regular grid on [0, 1)d with s = md, D? ≥ s 1d =⇒ only

optimal for d = 1.

I Theorem (Doerr, 2013) Let S with s be chosen uniformly at

random from [0, 1)d. Then, E[D?(S)] ≥√d√s

- the Monte Carlo rate.

I There exist low-discrepancy sequences that achieve a Cd(log s)d

slower bound conjectured to be optimal.

– Matlab QMC generators: haltonset, sobolset . . .

I With d fixed, this bound actually grows until s ∼ ed to (d/e)d!

– “On the unreasonable effectiveness of QMC” in high-dimensions.– RKHSs and Kernels show up in Modern QMC analysis!

Scalable Kernel Methods 25 34

Page 68: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Quasi-Monte Carlo Sequences

I We seek sequences with low discrepancy.

I Theorem (Roth, 1954) D?(S) ≥ cd (log s)d−12

s .

I For the regular grid on [0, 1)d with s = md, D? ≥ s 1d =⇒ only

optimal for d = 1.

I Theorem (Doerr, 2013) Let S with s be chosen uniformly at

random from [0, 1)d. Then, E[D?(S)] ≥√d√s

- the Monte Carlo rate.

I There exist low-discrepancy sequences that achieve a Cd(log s)d

slower bound conjectured to be optimal.

– Matlab QMC generators: haltonset, sobolset . . .

I With d fixed, this bound actually grows until s ∼ ed to (d/e)d!

– “On the unreasonable effectiveness of QMC” in high-dimensions.– RKHSs and Kernels show up in Modern QMC analysis!

Scalable Kernel Methods 25 34

Page 69: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Quasi-Monte Carlo Sequences

I We seek sequences with low discrepancy.

I Theorem (Roth, 1954) D?(S) ≥ cd (log s)d−12

s .

I For the regular grid on [0, 1)d with s = md, D? ≥ s 1d =⇒ only

optimal for d = 1.

I Theorem (Doerr, 2013) Let S with s be chosen uniformly at

random from [0, 1)d. Then, E[D?(S)] ≥√d√s

- the Monte Carlo rate.

I There exist low-discrepancy sequences that achieve a Cd(log s)d

slower bound conjectured to be optimal.

– Matlab QMC generators: haltonset, sobolset . . .

I With d fixed, this bound actually grows until s ∼ ed to (d/e)d!

– “On the unreasonable effectiveness of QMC” in high-dimensions.– RKHSs and Kernels show up in Modern QMC analysis!

Scalable Kernel Methods 25 34

Page 70: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

How do standard QMC sequences perform?

I Compare K ≈ Z(X)Z(X)T where Z(X) = e−iXG where G isdrawn from a QMC sequence generator instead.

200 400

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Number of random features

Relati

ve err

or on

||K|| 2

USPST, n=1506

MC

HaltonSobol’Digital net

Lattice

200 400 600 800

0.005

0.01

0.015

0.02

0.025

0.03

0.035

Number of random featuresRe

lative

error

on ||K

|| 2

CPU, n=6554

MC

HaltonSobol’Digital net

Lattice

I QMC sequences consistently better.

I Why are some QMC sequences better, e.g., Halton over Sobol?

I Can we learn sequences even better adapted to our problem class?

Scalable Kernel Methods 26 34

Page 71: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

How do standard QMC sequences perform?

I Compare K ≈ Z(X)Z(X)T where Z(X) = e−iXG where G isdrawn from a QMC sequence generator instead.

200 400

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Number of random features

Relati

ve err

or on

||K|| 2

USPST, n=1506

MC

HaltonSobol’Digital net

Lattice

200 400 600 800

0.005

0.01

0.015

0.02

0.025

0.03

0.035

Number of random featuresRe

lative

error

on ||K

|| 2

CPU, n=6554

MC

HaltonSobol’Digital net

Lattice

I QMC sequences consistently better.

I Why are some QMC sequences better, e.g., Halton over Sobol?

I Can we learn sequences even better adapted to our problem class?

Scalable Kernel Methods 26 34

Page 72: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

How do standard QMC sequences perform?

I Compare K ≈ Z(X)Z(X)T where Z(X) = e−iXG where G isdrawn from a QMC sequence generator instead.

200 400

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

Number of random features

Relati

ve err

or on

||K|| 2

USPST, n=1506

MC

HaltonSobol’Digital net

Lattice

200 400 600 800

0.005

0.01

0.015

0.02

0.025

0.03

0.035

Number of random featuresRe

lative

error

on ||K

|| 2

CPU, n=6554

MC

HaltonSobol’Digital net

Lattice

I QMC sequences consistently better.

I Why are some QMC sequences better, e.g., Halton over Sobol?

I Can we learn sequences even better adapted to our problem class?

Scalable Kernel Methods 26 34

Page 73: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

RKHSs in QMC Theory

I “Nice” integrands f ∈ Hh, where h(·, ·) is a kernel function.

I By Reproducing property and Cauchy-Schwartz inequality:∣∣∣∣∣∫Rdf(x)p(x)dx− 1

s

∑w∈S

f(w)

∣∣∣∣∣ ≤ ‖f‖kDh,p(S) (7)

where D2h,p is a discrepancy measure:

Dh,p(S) = ‖mp(·)−1

s

∑w∈S

h(w, ·)‖2Hh ,

mean embedding︷ ︸︸ ︷mp =

∫Rdh(x, ·)p(x)dx

= const.− 2

s

s∑l=1

∫Rdh(wl, ω)p(ω)dω︸ ︷︷ ︸

Alignment with p (wl ≈ ω)

+1

s2

s∑l=1

s∑j=1

h(wl,wj)︸ ︷︷ ︸Pairwise similarity in S

(8)

Scalable Kernel Methods 27 34

Page 74: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

RKHSs in QMC Theory

I “Nice” integrands f ∈ Hh, where h(·, ·) is a kernel function.I By Reproducing property and Cauchy-Schwartz inequality:∣∣∣∣∣

∫Rdf(x)p(x)dx− 1

s

∑w∈S

f(w)

∣∣∣∣∣ ≤ ‖f‖kDh,p(S) (7)

where D2h,p is a discrepancy measure:

Dh,p(S) = ‖mp(·)−1

s

∑w∈S

h(w, ·)‖2Hh ,

mean embedding︷ ︸︸ ︷mp =

∫Rdh(x, ·)p(x)dx

= const.− 2

s

s∑l=1

∫Rdh(wl, ω)p(ω)dω︸ ︷︷ ︸

Alignment with p (wl ≈ ω)

+1

s2

s∑l=1

s∑j=1

h(wl,wj)︸ ︷︷ ︸Pairwise similarity in S

(8)

Scalable Kernel Methods 27 34

Page 75: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

RKHSs in QMC Theory

I “Nice” integrands f ∈ Hh, where h(·, ·) is a kernel function.I By Reproducing property and Cauchy-Schwartz inequality:∣∣∣∣∣

∫Rdf(x)p(x)dx− 1

s

∑w∈S

f(w)

∣∣∣∣∣ ≤ ‖f‖kDh,p(S) (7)

where D2h,p is a discrepancy measure:

Dh,p(S) = ‖mp(·)−1

s

∑w∈S

h(w, ·)‖2Hh ,

mean embedding︷ ︸︸ ︷mp =

∫Rdh(x, ·)p(x)dx

= const.− 2

s

s∑l=1

∫Rdh(wl, ω)p(ω)dω︸ ︷︷ ︸

Alignment with p (wl ≈ ω)

+1

s2

s∑l=1

s∑j=1

h(wl,wj)︸ ︷︷ ︸Pairwise similarity in S

(8)

Scalable Kernel Methods 27 34

Page 76: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Box Discrepancy

I Assume that the data (shifted) lives in a box

b = −b ≤ x− z ≤ b,x, z ∈ XI Class of functions we want to integrate:

Fb = f(w) = e−i∆Tw,∆ ∈ b

I Theorem: Integration error proportional to ”Box discrepancy”:

Ef∼U(Fb)

[εS,p[f ]2

] 12 ∝ Dp (S) (9)

where Dp (S) is discrepancy associated with the sinc kernel:

sincb (u,v) = π−dd∏i=1

sin(bj(uj − vj))uj − vj

Can be evaluated in closed form for Gaussian density.

Scalable Kernel Methods 28 34

Page 77: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Box Discrepancy

I Assume that the data (shifted) lives in a box

b = −b ≤ x− z ≤ b,x, z ∈ XI Class of functions we want to integrate:

Fb = f(w) = e−i∆Tw,∆ ∈ b

I Theorem: Integration error proportional to ”Box discrepancy”:

Ef∼U(Fb)

[εS,p[f ]2

] 12 ∝ Dp (S) (9)

where Dp (S) is discrepancy associated with the sinc kernel:

sincb (u,v) = π−dd∏i=1

sin(bj(uj − vj))uj − vj

Can be evaluated in closed form for Gaussian density.

Scalable Kernel Methods 28 34

Page 78: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Box Discrepancy

I Assume that the data (shifted) lives in a box

b = −b ≤ x− z ≤ b,x, z ∈ XI Class of functions we want to integrate:

Fb = f(w) = e−i∆Tw,∆ ∈ b

I Theorem: Integration error proportional to ”Box discrepancy”:

Ef∼U(Fb)

[εS,p[f ]2

] 12 ∝ Dp (S) (9)

where Dp (S) is discrepancy associated with the sinc kernel:

sincb (u,v) = π−dd∏i=1

sin(bj(uj − vj))uj − vj

Can be evaluated in closed form for Gaussian density.

Scalable Kernel Methods 28 34

Page 79: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Does Box discrepancy explain behaviour of QMC

sequences?

0 500 1000 150010

−5

10−4

10−3

10−2

Samples

D(S

)2

CPU, d=21

Digital NetMC (expected)HaltonSobol’Lattice

Scalable Kernel Methods 29 34

Page 80: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Learning Adaptive QMC Sequences

Unlike Star discrepancy, Box discrepancy admits numerical optimization,

S∗ = arg minS=(w1...ws)∈Rds

D(S), S(0) = Halton. (10)

0 20 40 60 8010

−6

10−4

10−2

100

CPU dataset, s=100

Iteration

Normalized D(S)2

Maximum Squared ErrorMean Squared Error‖K −K‖2/‖K‖2

However, full impact on large-scale problems is an open research topic.Scalable Kernel Methods 30 34

Page 81: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Outline

Motivation and Background

Scalable Kernel MethodsRandom Embeddings+Distributed Computation (ICASSP, JSM 2014)libSkylark: An open source software stackQuasi-Monte Carlo Embeddings (ICML 2014)

Synergies?

Synergies? 31 34

Page 82: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Randomization-vs-Optimization

k(x, z)

Randomization p⇔ k Optimization

eig′5x

eig′4x

eig′3x

eig′2x

eig′1x

I Jarret et al 2009, What is the Best Multi-Stage Architecture for Object Recognition?: “Themost astonishing result is that systems with random filters and no filter learning whatsoeverachieve decent performance”

I On Random Weights and Unsupervised Feature Learning, Saxe et al, ICML 2011:“surprising fraction of performance can be attributed to architecture alone.”

Synergies? 32 34

Page 83: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Randomization-vs-Optimization

k(x, z)

Randomization p⇔ k Optimization

eig′5x

eig′4x

eig′3x

eig′2x

eig′1x

I Jarret et al 2009, What is the Best Multi-Stage Architecture for Object Recognition?: “Themost astonishing result is that systems with random filters and no filter learning whatsoeverachieve decent performance”

I On Random Weights and Unsupervised Feature Learning, Saxe et al, ICML 2011:“surprising fraction of performance can be attributed to architecture alone.”

Synergies? 32 34

Page 84: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Deep Learning with Kernels?

Maps across layers can be parameterized using more generalnonlinearities (kernels).

ImagePixelx1

f1(x1), f1 ∈ Hk1Ω1

x2

Ω2

f2(x2), f2 ∈ Hk2

I Mathematics of Neural Response, Smale et. al., FCM (2010).I Convolutional Kernel Networks, Mairal et. al., NIPS 2014.

I SimNets: A Generalization of Conv. Nets, Cohen and Sashua, 2014.

– learns networks 1/8 the size of comparable ConvNets.

Figure adapted from Mairal et al, NIPS 2014

Synergies? 33 34

Page 85: Kernels, Random Embeddings and Deep Learning · PDF fileKernels, Random Embeddings and Deep Learning Vikas Sindhwani IBM Research, NY October 28, 2014. ... IIBM DARPA XDATA project

Conclusions

I Some empirical evidence suggests that once Kernel methods arescaled up and embody similar statistical principles, they arecompetitive with Deep Neural Networks.

– Randomization and Distributed Computation both required.– Ideas from QMC Integration techniques are promising.

I Opportunities for designing new algorithms combining insights fromdeep learning with the generality and mathematical richness ofkernel methods.

Synergies? 34 34