overview of deep learning final - github pages

Joseph E. [email protected]

Deep LearningOverview

Borrowed heavily from excellent talks by:• Adam Coates: http://ai.stanford.edu/~acoates/coates_dltutorial_2013.pptx• Fei-Fei Li and Andrej Karpathy: http://cs231n.stanford.edu/syllabus.html

Machine Learning è Function Approximation

Label:Cat

Object Recognition Speech Recognition

“The cat in the hat”

Robotic Control

The cat in the hat.

El gato en el sombrero

Machine Translation

Label:Cat

Object Recognition

Speech Recognition


Function Approximation Pipeline

Label:Cat

Object Recognition

Speech Recognition


SVM

HMM


Often build multiple layers of features to abstract the input

Label:Cat

SVM

Label:Cat

SVM

Deep learning tries to automated this process.


Deep Learning: automatically learn a deep hierarchy of abstract features along with the classifier.

Ø Typically using neural networksØ composable general function approximators

Label:Cat

SVMLow-levelFeatures

Mid-levelFeatures

High-levelFeatures

Learn more abstract representation


Why is Deep Learning so Successful?Ø Feature engineering essential to many applications

Ø Expensive hand-engineering of “layers” of representation.Ø Deep learning automates the process of feature engineering

Ø Previous attempts were limited by data and computationØ We now have access to substantial amounts of data and computation

Ø Deep learning techniques are inherently compositionalØ Easy to extend and combine à rapid development

Crash Course in Neural Networks

Supervised LearningØ Predict is this a picture of a cat?

x =

Supervised LearningØ Predict is this a picture of a cat?

Ø and training data:

Ø Learn a function:

Ø By estimating the parameters w

=

Dtrain = {(xi, yi)}ni=1

2 Rd

fw(x) = y

y 2 {0, 1}x =

Logistic Regression for Binary ClassificationØ Consider the simple function family:

Ø With non-linearity:

�(u) =1

1 + exp(�u)

fw(x) = �

�w

Tx

�= �

0

@dX

j=1

wjxj

1

A = P (y = 1 |x)

Logistic Regression as a “Neuron”Ø Consider the simple function family: �(u) =

1

1 + exp(�u)

fw(x) = �

�w

Tx

�= �

0

@dX

j=1

wjxj

1

A = P (y = 1 |x)

w52x52

Neuron “fires” if weighted sum of input is greater than zero.

“perceptron”

Bias= w0 x

Learning the logistic regression modelØ Consider the simple function family:

Ø Goal: find w that minimizes the loss on the training data:

�(u) =1

1 + exp(�u)

fw(x) = �

�w

Tx

�= �

0

@dX

j=1

wjxj

1

A = P (y = 1 |x)

L(w) =nX

i=1

L (fw(xi), yi) =nX

i=1

fw(xi)yi (1� fw(xi))

1�yi

likelihood

Numerical OptimizationØ Gradient Descent:

Ø Convex à Guaranteed to find optimal wØ Stochastic gradient descent:

w(t+1) = w(t) � ⌘trwL(w)�rw

L(w)

L(w)

w

rwL(w) =nX

i=1

rwL (fw(xi), yi)

⇡ nrwL (fw(xi), yi) for (xi, yi) ⇠ D

Slow

Approximate

Logistic Regression: Strengths and LimitationsØ Widely used machine learning technique

Ø convex à efficient to learnØ easy to interpret model weightsØ works well given good features

Ø Limitations:Ø Restricted to linear relationships à sensitive to choice of features

y = 0

y = 1

1

2

Pixe

l 2

Pixel 1

Pixe

l 2

Pixel 1

1

1

0

0

Feature Engineering

Ø Rather than use raw pixels build/train feature functions:

1

2

Pixe

l 2

Pixel 1

Pixe

l 2

Pixel 1

1

1

0

0

g = (Paws, Furry)Fu

rryPaws

11

0 0fw(g(x))

fw(x)

Composition Linear Models and Nonlinearities

x1

x2

xd

Input Layer(Pixels)

d1

zk

1

W0k

d

= zk

1

k

1

𝜎 =h1

h2

hk

Composition Linear Models and Nonlinearities

k1

h1

h2

hk

W12

k

z2

1

= 𝜎 z2

1

2

1

h1

h2=

Neural NetworksØ Composing “perceptrons”

y = fW 0,W 1,W 2(x) = �

�W

2�

�W

1�

�W

0x

��

x1

x2

xd

h11

h12

h1k1 h2

k2

h22

h21

f

W0 W1

W2

Hidden Layers

Inpu

t Lay

er

x ! �(W 0x) ! h

1 ! �(W 1h

1) ! h

2 ! �(W 2h

2) ! f

Output

input

Conv7x7+2(S)

MaxPool3x3+2(S)

LocalRespNorm

Conv1x1+1(V)

Conv3x3+1(S)

LocalRespNorm

MaxPool3x3+2(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

MaxPool3x3+2(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

AveragePool5x5+3(V)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

AveragePool5x5+3(V)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

MaxPool3x3+2(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

AveragePool7x7+1(V)

FC

Conv1x1+1(S)

FC

FC

SoftmaxActivation

softmax0

Conv1x1+1(S)

FC

FC

SoftmaxActivation

softmax1

SoftmaxActivation

softmax2

Figure 3: GoogLeNet network with all the bells and whistles

7

Input

Output

Logistic Regression

Deep Learning àMany hidden layers …

Neural NetworksØ Composing non-linear models (e.g., Logistic Regression):

Ø Learn W0, W1, and W2 using Backpropagation + SGD

y = fW 0,W 1,W 2(x) = �

�W

2�

�W

1�

�W

0x

��

x1

x2

xd

h11

h12

h1k1 h2

k2

h22

h21

f

W0 W1

W2

Hidden Layers

Inpu

t Lay

er

x ! �(W 0x) ! h

1 ! �(W 1h

1) ! h

2 ! �(W 2h

2) ! f

Backpropagation in Neural Networks

Ø Need to compute the gradient of the loss wrt. W0, W1, and W2

Ø Use chain rule to push gradients back through dataflow graph:

y = fW 0,W 1,W 2(x) = �

�W

2�

�W

1�

�W

0x

��

rW 0,W 1,W 2L(y, fW 0,W 1,W 2(x))

x ! �(W 0x) ! h

1 ! �(W 1h

1) ! h

2 ! �(W 2h

2) ! f

W 0 W 1 W 2


Ø Define a general operator:

x ! �(W 0x) ! h

1 ! �(W 1h

1) ! h

2 ! �(W 2h

2) ! f

W 0 W 1 W 2

g(w, h) gh

Wkl

@L(W )

@g

@L(W )

@Wkl=

X

j

@L(W )

@gj

@gj@W i

kl

@L(W )

@hk=

X

j

@L(W )

@gj

@gj@hk


Ø Define a general operator:

x ! �(W 0x) ! h

1 ! �(W 1h

1) ! h

2 ! �(W 2h

2) ! f

W 0 W 1 W 2

g(w, h) gh

@L(W )

@g

@L(W )

@Wkl=

X

j

@L(W )

@gj

@gj@W i

kl

@L(W )

@hk=

X

j

@L(W )

@gj

@gj@hk

Wkl

Simple Example:

x

y

z

+

⨯

2

3

4

5

601

15

⨯4

12 = 4*3

20 = 4*5

f(x, y, z) = (x+ y) ⇤ y ⇤ z

15

1212

32=12+20

12

15h2 * z

h1 * y

BackpropagationØ Requires all operators to have well defined sub-gradients:

Ø Enables Automatic Differentiation!Ø User defines forward flow à system derives efficient training alg.Ø Easy to explore composition of new modules

Ø Enables Efficient Gradient ComputationØ Cache forward calculation to accelerate gradientsØ Compile optimized gradient computation

tanh logit Relu = max(0,x) abs

General Purpose Systems For DNNsØ Distributed Parameter Servers

Ø TensorFlow (DistBelief)Ø Microsoft Adam

Ø GPU SystemsØ TensorFlowØ CaffeØ Theano

Demo of TensorFlow

Challenges of Deep Neural Networks Ø Non-convex à (stochastic) gradient descent not guaranteed to

converge to optimumØ Soln: appear to be many good local optima

Ø High-dimensional à gradient descent converges slowlyØ Soln: hardware acceleration, improved algs. with momentum …

Ø Rich function class à overfittingØ Soln: more data, early-stopping, drop-out, parameter sharing

Ø Saturation of sigmoid à decaying gradientsØ Soln: other forms of non-linearity

logit Relu = max(0,x)

Convolutional Neural Networks: Exploiting Spatial Sparsity

Example: AlexNet (Krizhevsky et al., NIPS 2012)

Ø Introduced in 2012, significantly outperformed state-of-the-art (top 5 error of 16% compared to runner-up with 26% error)Ø Covered in reading …

Growth in Model ComplexityLeCun et al, “Backpropagation Applied to Handwritten Zip Code Recognition”. 1989

input

Conv7x7+2(S)

MaxPool3x3+2(S)

LocalRespNorm

Conv1x1+1(V)

Conv3x3+1(S)

LocalRespNorm

MaxPool3x3+2(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

MaxPool3x3+2(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

AveragePool5x5+3(V)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

AveragePool5x5+3(V)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

MaxPool3x3+2(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

Conv1x1+1(S)

MaxPool3x3+1(S)

DepthConcat

Conv3x3+1(S)

Conv5x5+1(S)

Conv1x1+1(S)

AveragePool7x7+1(V)

FC

Conv1x1+1(S)

FC

FC

SoftmaxActivation

softmax0

Conv1x1+1(S)

FC

FC

SoftmaxActivation

softmax1

SoftmaxActivation

softmax2

Figure 3: GoogLeNet network with all the bells and whistles

7

9K Parameters66K Ops

Szegedy et al, “Going Deeper with Convolution”. 2014

Ø Winner of ImageNet Large-Scale Visual Recognition Challenge 2014

ØGoogLeNet (7.89% error)Ø 22 layersØ 6.8M parametersØ 1.5B flopsØ Ensemble of 7 models

Ø Current Best: ResNet (3.57% error)Ø 152 layersØ 2.3M parametersØ 11.3B flopsØ Ensemble of 6 models

Cost of Computation (from Prediction Serving Lecture)

Ø 100’s of millions of parameters + convolutions & unrollingØ Requires hardware acceleration

9

Table 2 shows the results for the Titan X GPU and the Xeon E5-2698 v3 server-class processor.

Network: AlexNet Batch Size Titan X (FP32) Xeon E5-2698 v3 (FP32)

Inference Performance

1

405 img/sec 76 img/sec Power 164.0 W 111.7 W

Performance/Watt 2.5 img/sec/W 0.7 img/sec/W

Inference Performance 128 (Titan X) 48 (Xeon E5)

3216 img/sec 476 img/sec

Power 227.0 W 149.0 W

Performance/Watt 14.2 img/sec/W 3.2 img/sec/W

Table 2 Inference performance, power, and energy efficiency on Titan X and Xeon E5-2698 v3.

The comparison between Titan X and Xeon E5 reinforces the same conclusion as the comparison between Tegra X1 and Core i7: GPUs appear to be capable of significantly higher energy efficiency for deep learning inference on AlexNet. In the case of Titan X, the GPU not only provides much better energy efficiency than the CPU, but it also achieves substantially higher performance at over 3000 images/second in the large-batch case compared to less than 500 images/second on the CPU. While larger batch sizes are more efficient to process, the comparison between Titan X and Xeon E5 with no batching proves that the GPU’s efficiency advantage is present even for smaller batch sizes. In comparison with Tegra X1, the Titan X manages to achieve competitive Performance/Watt despite its much bigger GPU, as the large 12 GB framebuffer allows it to run more efficient but memory-capacity-intensive FFT-based convolutions.

Finally, Table 3 presents inference results on GoogLeNet. As mentioned before, IDLF provides no support for GoogLeNet, and alternative deep learning frameworks have never been optimized for CPU performance. Therefore, we omit CPU results here and focus entirely on the GPUs. As GoogLeNet is a much more demanding network than AlexNet, Tegra X1 cannot run batch size 128 inference due to insufficient total memory capacity (4GB on a Jetson™ TX1 board). The massive framebuffer on the Titan X is sufficient to allow inference with batch size 128.

Network: GoogLeNet Batch Size Titan X (FP32) Tegra X1 (FP32) Tegra X1 (FP16)

Inference Performance

1

138 img/sec 33 img/sec 33 img/sec Power 119.0 W 5.0 W 4.0 W

Performance/Watt 1.2 img/sec/W 6.5 img/sec/W 8.3 img/sec/W

Inference Performance 128 (Titan X) 64 (Tegra X1)

863 img/sec 52 img/sec 75 img/sec

Power 225.0 W 5.9 W 5.8 W

Performance/Watt 3.8 img/sec/W 8.8 img/sec/W 12.8 img/sec/W

Table 3 GoogLeNet inference results on Tegra X1 and Titan X. Tegra X1's total memory capacity is not sufficient to run batch size 128 inference.

Compared to AlexNet, the results show significantly lower absolute performance values, indicating how much more computationally demanding GoogLeNet is. However, even on GoogLeNet, all GPUs are capable of achieving real-time performance on a 30 fps camera feed.

http://www.nvidia.com/content/tegra/embedded-systems/pdf/jetson_tx1_whitepaper.pdf

Improvement on ImagNet Benchmark

Recurrent Neural Networks:Modeling Sequence Structure

Ø State of the art in speech recognition and machine translationØ Required LSTM and GRU to address long dependencies

Ø Similar to the HMM from classical Bayesian ML

The cat in the hat.

El gato en el sombrero

Improvements in Machine Translation &Automatic Speech Recognition

overview of deep learning final - github pages

Documents