overview of deep learning final - github pages

35
Joseph E. Gonzalez [email protected] Deep Learning Overview Borrowed heavily from excellent talks by: Adam Coates: http://ai.stanford.edu/~acoates/coates_dltutorial_2013.pptx Fei-Fei Li and Andrej Karpathy: http://cs231n.stanford.edu/syllabus.html

Upload: others

Post on 10-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: overview of deep learning final - GitHub Pages

Joseph E. [email protected]

Deep LearningOverview

Borrowed heavily from excellent talks by:• Adam Coates: http://ai.stanford.edu/~acoates/coates_dltutorial_2013.pptx• Fei-Fei Li and Andrej Karpathy: http://cs231n.stanford.edu/syllabus.html

Page 2: overview of deep learning final - GitHub Pages

Machine Learning è Function Approximation

Label:Cat

Object Recognition Speech Recognition

“The cat in the hat”

Robotic Control

The cat in the hat.

El gato en el sombrero

Machine Translation

Page 3: overview of deep learning final - GitHub Pages

Label:Cat

Object Recognition

Speech Recognition

“The cat in the hat”

Function Approximation Pipeline

Page 4: overview of deep learning final - GitHub Pages

Label:Cat

Object Recognition

Speech Recognition

“The cat in the hat”

SVM

HMM

Function Approximation Pipeline

Page 5: overview of deep learning final - GitHub Pages

Often build multiple layers of features to abstract the input

Label:Cat

SVM

Label:Cat

SVM

Deep learning tries to automated this process.

Function Approximation Pipeline

Page 6: overview of deep learning final - GitHub Pages

Deep Learning: automatically learn a deep hierarchy of abstract features along with the classifier.

Ø Typically using neural networksØ composable general function approximators

Label:Cat

SVMLow-levelFeatures

Mid-levelFeatures

High-levelFeatures

Learn more abstract representation

Function Approximation Pipeline

Page 7: overview of deep learning final - GitHub Pages

Why is Deep Learning so Successful?Ø Feature engineering essential to many applications

Ø Expensive hand-engineering of “layers” of representation.Ø Deep learning automates the process of feature engineering

Ø Previous attempts were limited by data and computationØ We now have access to substantial amounts of data and computation

Ø Deep learning techniques are inherently compositionalØ Easy to extend and combine à rapid development

Page 8: overview of deep learning final - GitHub Pages

Crash Course in Neural Networks

Page 9: overview of deep learning final - GitHub Pages

Supervised LearningØ Predict is this a picture of a cat?

x =

Page 10: overview of deep learning final - GitHub Pages

Supervised LearningØ Predict is this a picture of a cat?

Ø and training data:

Ø Learn a function:

Ø By estimating the parameters w

=

Dtrain = {(xi, yi)}ni=1

2 Rd

fw(x) = y

y 2 {0, 1}x =

Page 11: overview of deep learning final - GitHub Pages

Logistic Regression for Binary ClassificationØ Consider the simple function family:

Ø With non-linearity:

�(u) =1

1 + exp(�u)

fw(x) = �

�w

Tx

�= �

0

@dX

j=1

wjxj

1

A = P (y = 1 |x)

Page 12: overview of deep learning final - GitHub Pages

Logistic Regression as a “Neuron”Ø Consider the simple function family: �(u) =

1

1 + exp(�u)

fw(x) = �

�w

Tx

�= �

0

@dX

j=1

wjxj

1

A = P (y = 1 |x)

w52x52

Neuron “fires” if weighted sum of input is greater than zero.

“perceptron”

Bias= w0 x

Page 13: overview of deep learning final - GitHub Pages

Learning the logistic regression modelØ Consider the simple function family:

Ø Goal: find w that minimizes the loss on the training data:

�(u) =1

1 + exp(�u)

fw(x) = �

�w

Tx

�= �

0

@dX

j=1

wjxj

1

A = P (y = 1 |x)

L(w) =nX

i=1

L (fw(xi), yi) =nX

i=1

fw(xi)yi (1� fw(xi))

1�yi

likelihood

Page 14: overview of deep learning final - GitHub Pages

Numerical OptimizationØ Gradient Descent:

Ø Convex à Guaranteed to find optimal wØ Stochastic gradient descent:

w(t+1) = w(t) � ⌘trwL(w)�rw

L(w)

L(w)

w

rwL(w) =nX

i=1

rwL (fw(xi), yi)

⇡ nrwL (fw(xi), yi) for (xi, yi) ⇠ D

Slow

Approximate

Page 15: overview of deep learning final - GitHub Pages

Logistic Regression: Strengths and LimitationsØ Widely used machine learning technique

Ø convex à efficient to learnØ easy to interpret model weightsØ works well given good features

Ø Limitations:Ø Restricted to linear relationships à sensitive to choice of features

y = 0

y = 1

1

2

Pixe

l 2

Pixel 1

Pixe

l 2

Pixel 1

1

1

0

0

Page 16: overview of deep learning final - GitHub Pages

Feature Engineering

Ø Rather than use raw pixels build/train feature functions:

1

2

Pixe

l 2

Pixel 1

Pixe

l 2

Pixel 1

1

1

0

0

g = (Paws, Furry)Fu

rryPaws

11

0 0fw(g(x))

fw(x)

Page 17: overview of deep learning final - GitHub Pages

Composition Linear Models and Nonlinearities

x1

x2

xd

Input Layer(Pixels)

d1

zk

1

W0k

d

= zk

1

k

1

𝜎 =h1

h2

hk

Page 18: overview of deep learning final - GitHub Pages

Composition Linear Models and Nonlinearities

k1

h1

h2

hk

W12

k

z2

1

= 𝜎 z2

1

2

1

h1

h2=

Page 19: overview of deep learning final - GitHub Pages

Neural NetworksØ Composing “perceptrons”

y = fW 0,W 1,W 2(x) = �

�W

2�

�W

1�

�W

0x

���

x1

x2

xd

h11

h12

h1k1 h2

k2

h22

h21

f

W0 W1

W2

Hidden Layers

Inpu

t Lay

er

x ! �(W 0x) ! h

1 ! �(W 1h

1) ! h

2 ! �(W 2h

2) ! f

Page 20: overview of deep learning final - GitHub Pages

Output

input

Conv7x7+2(S)

MaxPool3x3+2(S)

LocalRespNorm

Conv1x1+1(V)

Conv3x3+1(S)

LocalRespNorm

MaxPool3x3+2(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

MaxPool3x3+2(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

AveragePool5x5+3(V)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

AveragePool5x5+3(V)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

MaxPool3x3+2(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

AveragePool7x7+1(V)

FC

Conv1x1+1(S)

FC

FC

SoftmaxActivation

softmax0

Conv1x1+1(S)

FC

FC

SoftmaxActivation

softmax1

SoftmaxActivation

softmax2

Figure 3: GoogLeNet network with all the bells and whistles

7

Input

Output

Logistic Regression

Deep Learning àMany hidden layers …

Page 21: overview of deep learning final - GitHub Pages

Neural NetworksØ Composing non-linear models (e.g., Logistic Regression):

Ø Learn W0, W1, and W2 using Backpropagation + SGD

y = fW 0,W 1,W 2(x) = �

�W

2�

�W

1�

�W

0x

���

x1

x2

xd

h11

h12

h1k1 h2

k2

h22

h21

f

W0 W1

W2

Hidden Layers

Inpu

t Lay

er

x ! �(W 0x) ! h

1 ! �(W 1h

1) ! h

2 ! �(W 2h

2) ! f

Page 22: overview of deep learning final - GitHub Pages

Backpropagation in Neural Networks

Ø Need to compute the gradient of the loss wrt. W0, W1, and W2

Ø Use chain rule to push gradients back through dataflow graph:

y = fW 0,W 1,W 2(x) = �

�W

2�

�W

1�

�W

0x

���

rW 0,W 1,W 2L(y, fW 0,W 1,W 2(x))

x ! �(W 0x) ! h

1 ! �(W 1h

1) ! h

2 ! �(W 2h

2) ! f

W 0 W 1 W 2

Page 23: overview of deep learning final - GitHub Pages

Backpropagation in Neural Networks

Ø Define a general operator:

x ! �(W 0x) ! h

1 ! �(W 1h

1) ! h

2 ! �(W 2h

2) ! f

W 0 W 1 W 2

g(w, h) gh

Wkl

@L(W )

@g

@L(W )

@Wkl=

X

j

@L(W )

@gj

@gj@W i

kl

@L(W )

@hk=

X

j

@L(W )

@gj

@gj@hk

Page 24: overview of deep learning final - GitHub Pages

Backpropagation in Neural Networks

Ø Define a general operator:

x ! �(W 0x) ! h

1 ! �(W 1h

1) ! h

2 ! �(W 2h

2) ! f

W 0 W 1 W 2

g(w, h) gh

@L(W )

@g

@L(W )

@Wkl=

X

j

@L(W )

@gj

@gj@W i

kl

@L(W )

@hk=

X

j

@L(W )

@gj

@gj@hk

Wkl

Page 25: overview of deep learning final - GitHub Pages

Simple Example:

x

y

z

+

2

3

4

5

601

15

⨯4

12 = 4*3

20 = 4*5

f(x, y, z) = (x+ y) ⇤ y ⇤ z

15

1212

32=12+20

12

15h2 * z

h1 * y

Page 26: overview of deep learning final - GitHub Pages

BackpropagationØ Requires all operators to have well defined sub-gradients:

Ø Enables Automatic Differentiation!Ø User defines forward flow à system derives efficient training alg.Ø Easy to explore composition of new modules

Ø Enables Efficient Gradient ComputationØ Cache forward calculation to accelerate gradientsØ Compile optimized gradient computation

tanh logit Relu = max(0,x) abs

Page 27: overview of deep learning final - GitHub Pages

General Purpose Systems For DNNsØ Distributed Parameter Servers

Ø TensorFlow (DistBelief)Ø Microsoft Adam

Ø GPU SystemsØ TensorFlowØ CaffeØ Theano

Demo of TensorFlow

Page 28: overview of deep learning final - GitHub Pages

Challenges of Deep Neural Networks Ø Non-convex à (stochastic) gradient descent not guaranteed to

converge to optimumØ Soln: appear to be many good local optima

Ø High-dimensional à gradient descent converges slowlyØ Soln: hardware acceleration, improved algs. with momentum …

Ø Rich function class à overfittingØ Soln: more data, early-stopping, drop-out, parameter sharing

Ø Saturation of sigmoid à decaying gradientsØ Soln: other forms of non-linearity

logit Relu = max(0,x)

Page 29: overview of deep learning final - GitHub Pages

Convolutional Neural Networks: Exploiting Spatial Sparsity

Page 30: overview of deep learning final - GitHub Pages

Example: AlexNet (Krizhevsky et al., NIPS 2012)

Ø Introduced in 2012, significantly outperformed state-of-the-art (top 5 error of 16% compared to runner-up with 26% error)Ø Covered in reading …

Page 31: overview of deep learning final - GitHub Pages

Growth in Model ComplexityLeCun et al, “Backpropagation Applied to Handwritten Zip Code Recognition”. 1989

input

Conv7x7+2(S)

MaxPool3x3+2(S)

LocalRespNorm

Conv1x1+1(V)

Conv3x3+1(S)

LocalRespNorm

MaxPool3x3+2(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

MaxPool3x3+2(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

AveragePool5x5+3(V)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

AveragePool5x5+3(V)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

MaxPool3x3+2(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

AveragePool7x7+1(V)

FC

Conv1x1+1(S)

FC

FC

SoftmaxActivation

softmax0

Conv1x1+1(S)

FC

FC

SoftmaxActivation

softmax1

SoftmaxActivation

softmax2

Figure 3: GoogLeNet network with all the bells and whistles

7

9K Parameters66K Ops

Szegedy et al, “Going Deeper with Convolution”. 2014

Ø Winner of ImageNet Large-Scale Visual Recognition Challenge 2014

ØGoogLeNet (7.89% error)Ø 22 layersØ 6.8M parametersØ 1.5B flopsØ Ensemble of 7 models

Ø Current Best: ResNet (3.57% error)Ø 152 layersØ 2.3M parametersØ 11.3B flopsØ Ensemble of 6 models

Page 32: overview of deep learning final - GitHub Pages

Cost of Computation (from Prediction Serving Lecture)

Ø 100’s of millions of parameters + convolutions & unrollingØ Requires hardware acceleration

9

Table 2 shows the results for the Titan X GPU and the Xeon E5-2698 v3 server-class processor.

Network: AlexNet Batch Size Titan X (FP32) Xeon E5-2698 v3 (FP32)

Inference Performance

1

405 img/sec 76 img/sec Power 164.0 W 111.7 W

Performance/Watt 2.5 img/sec/W 0.7 img/sec/W

Inference Performance 128 (Titan X) 48 (Xeon E5)

3216 img/sec 476 img/sec

Power 227.0 W 149.0 W

Performance/Watt 14.2 img/sec/W 3.2 img/sec/W

Table 2 Inference performance, power, and energy efficiency on Titan X and Xeon E5-2698 v3.

The comparison between Titan X and Xeon E5 reinforces the same conclusion as the comparison between Tegra X1 and Core i7: GPUs appear to be capable of significantly higher energy efficiency for deep learning inference on AlexNet. In the case of Titan X, the GPU not only provides much better energy efficiency than the CPU, but it also achieves substantially higher performance at over 3000 images/second in the large-batch case compared to less than 500 images/second on the CPU. While larger batch sizes are more efficient to process, the comparison between Titan X and Xeon E5 with no batching proves that the GPU’s efficiency advantage is present even for smaller batch sizes. In comparison with Tegra X1, the Titan X manages to achieve competitive Performance/Watt despite its much bigger GPU, as the large 12 GB framebuffer allows it to run more efficient but memory-capacity-intensive FFT-based convolutions.

Finally, Table 3 presents inference results on GoogLeNet. As mentioned before, IDLF provides no support for GoogLeNet, and alternative deep learning frameworks have never been optimized for CPU performance. Therefore, we omit CPU results here and focus entirely on the GPUs. As GoogLeNet is a much more demanding network than AlexNet, Tegra X1 cannot run batch size 128 inference due to insufficient total memory capacity (4GB on a Jetson™ TX1 board). The massive framebuffer on the Titan X is sufficient to allow inference with batch size 128.

Network: GoogLeNet Batch Size Titan X (FP32) Tegra X1 (FP32) Tegra X1 (FP16)

Inference Performance

1

138 img/sec 33 img/sec 33 img/sec Power 119.0 W 5.0 W 4.0 W

Performance/Watt 1.2 img/sec/W 6.5 img/sec/W 8.3 img/sec/W

Inference Performance 128 (Titan X) 64 (Tegra X1)

863 img/sec 52 img/sec 75 img/sec

Power 225.0 W 5.9 W 5.8 W

Performance/Watt 3.8 img/sec/W 8.8 img/sec/W 12.8 img/sec/W

Table 3 GoogLeNet inference results on Tegra X1 and Titan X. Tegra X1's total memory capacity is not sufficient to run batch size 128 inference.

Compared to AlexNet, the results show significantly lower absolute performance values, indicating how much more computationally demanding GoogLeNet is. However, even on GoogLeNet, all GPUs are capable of achieving real-time performance on a 30 fps camera feed.

http://www.nvidia.com/content/tegra/embedded-systems/pdf/jetson_tx1_whitepaper.pdf

Page 33: overview of deep learning final - GitHub Pages

Improvement on ImagNet Benchmark

Page 34: overview of deep learning final - GitHub Pages

Recurrent Neural Networks:Modeling Sequence Structure

Ø State of the art in speech recognition and machine translationØ Required LSTM and GRU to address long dependencies

Ø Similar to the HMM from classical Bayesian ML

The cat in the hat.

El gato en el sombrero

Page 35: overview of deep learning final - GitHub Pages

Improvements in Machine Translation &Automatic Speech Recognition