overview of deep learning final - github pages
TRANSCRIPT
Joseph E. [email protected]
Deep LearningOverview
Borrowed heavily from excellent talks by:• Adam Coates: http://ai.stanford.edu/~acoates/coates_dltutorial_2013.pptx• Fei-Fei Li and Andrej Karpathy: http://cs231n.stanford.edu/syllabus.html
Machine Learning è Function Approximation
Label:Cat
Object Recognition Speech Recognition
“The cat in the hat”
Robotic Control
The cat in the hat.
El gato en el sombrero
Machine Translation
Label:Cat
Object Recognition
Speech Recognition
“The cat in the hat”
Function Approximation Pipeline
Label:Cat
Object Recognition
Speech Recognition
“The cat in the hat”
SVM
HMM
Function Approximation Pipeline
Often build multiple layers of features to abstract the input
Label:Cat
SVM
Label:Cat
SVM
Deep learning tries to automated this process.
Function Approximation Pipeline
Deep Learning: automatically learn a deep hierarchy of abstract features along with the classifier.
Ø Typically using neural networksØ composable general function approximators
Label:Cat
SVMLow-levelFeatures
Mid-levelFeatures
High-levelFeatures
Learn more abstract representation
Function Approximation Pipeline
Why is Deep Learning so Successful?Ø Feature engineering essential to many applications
Ø Expensive hand-engineering of “layers” of representation.Ø Deep learning automates the process of feature engineering
Ø Previous attempts were limited by data and computationØ We now have access to substantial amounts of data and computation
Ø Deep learning techniques are inherently compositionalØ Easy to extend and combine à rapid development
Crash Course in Neural Networks
Supervised LearningØ Predict is this a picture of a cat?
x =
Supervised LearningØ Predict is this a picture of a cat?
Ø and training data:
Ø Learn a function:
Ø By estimating the parameters w
=
Dtrain = {(xi, yi)}ni=1
2 Rd
fw(x) = y
y 2 {0, 1}x =
Logistic Regression for Binary ClassificationØ Consider the simple function family:
Ø With non-linearity:
�(u) =1
1 + exp(�u)
fw(x) = �
�w
Tx
�= �
0
@dX
j=1
wjxj
1
A = P (y = 1 |x)
Logistic Regression as a “Neuron”Ø Consider the simple function family: �(u) =
1
1 + exp(�u)
fw(x) = �
�w
Tx
�= �
0
@dX
j=1
wjxj
1
A = P (y = 1 |x)
w52x52
Neuron “fires” if weighted sum of input is greater than zero.
“perceptron”
Bias= w0 x
Learning the logistic regression modelØ Consider the simple function family:
Ø Goal: find w that minimizes the loss on the training data:
�(u) =1
1 + exp(�u)
fw(x) = �
�w
Tx
�= �
0
@dX
j=1
wjxj
1
A = P (y = 1 |x)
L(w) =nX
i=1
L (fw(xi), yi) =nX
i=1
fw(xi)yi (1� fw(xi))
1�yi
likelihood
Numerical OptimizationØ Gradient Descent:
Ø Convex à Guaranteed to find optimal wØ Stochastic gradient descent:
w(t+1) = w(t) � ⌘trwL(w)�rw
L(w)
L(w)
w
rwL(w) =nX
i=1
rwL (fw(xi), yi)
⇡ nrwL (fw(xi), yi) for (xi, yi) ⇠ D
Slow
Approximate
Logistic Regression: Strengths and LimitationsØ Widely used machine learning technique
Ø convex à efficient to learnØ easy to interpret model weightsØ works well given good features
Ø Limitations:Ø Restricted to linear relationships à sensitive to choice of features
y = 0
y = 1
1
2
Pixe
l 2
Pixel 1
Pixe
l 2
Pixel 1
1
1
0
0
Feature Engineering
Ø Rather than use raw pixels build/train feature functions:
1
2
Pixe
l 2
Pixel 1
Pixe
l 2
Pixel 1
1
1
0
0
g = (Paws, Furry)Fu
rryPaws
11
0 0fw(g(x))
fw(x)
Composition Linear Models and Nonlinearities
x1
x2
xd
Input Layer(Pixels)
d1
zk
1
W0k
d
= zk
1
k
1
𝜎 =h1
h2
hk
Composition Linear Models and Nonlinearities
k1
h1
h2
hk
W12
k
z2
1
= 𝜎 z2
1
2
1
h1
h2=
Neural NetworksØ Composing “perceptrons”
y = fW 0,W 1,W 2(x) = �
�W
2�
�W
1�
�W
0x
���
x1
x2
xd
h11
h12
h1k1 h2
k2
h22
h21
f
W0 W1
W2
Hidden Layers
Inpu
t Lay
er
x ! �(W 0x) ! h
1 ! �(W 1h
1) ! h
2 ! �(W 2h
2) ! f
Output
input
Conv7x7+2(S)
MaxPool3x3+2(S)
LocalRespNorm
Conv1x1+1(V)
Conv3x3+1(S)
LocalRespNorm
MaxPool3x3+2(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
MaxPool3x3+2(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
AveragePool5x5+3(V)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
AveragePool5x5+3(V)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
MaxPool3x3+2(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
AveragePool7x7+1(V)
FC
Conv1x1+1(S)
FC
FC
SoftmaxActivation
softmax0
Conv1x1+1(S)
FC
FC
SoftmaxActivation
softmax1
SoftmaxActivation
softmax2
Figure 3: GoogLeNet network with all the bells and whistles
7
Input
Output
Logistic Regression
Deep Learning àMany hidden layers …
Neural NetworksØ Composing non-linear models (e.g., Logistic Regression):
Ø Learn W0, W1, and W2 using Backpropagation + SGD
y = fW 0,W 1,W 2(x) = �
�W
2�
�W
1�
�W
0x
���
x1
x2
xd
h11
h12
h1k1 h2
k2
h22
h21
f
W0 W1
W2
Hidden Layers
Inpu
t Lay
er
x ! �(W 0x) ! h
1 ! �(W 1h
1) ! h
2 ! �(W 2h
2) ! f
Backpropagation in Neural Networks
Ø Need to compute the gradient of the loss wrt. W0, W1, and W2
Ø Use chain rule to push gradients back through dataflow graph:
y = fW 0,W 1,W 2(x) = �
�W
2�
�W
1�
�W
0x
���
rW 0,W 1,W 2L(y, fW 0,W 1,W 2(x))
x ! �(W 0x) ! h
1 ! �(W 1h
1) ! h
2 ! �(W 2h
2) ! f
W 0 W 1 W 2
Backpropagation in Neural Networks
Ø Define a general operator:
x ! �(W 0x) ! h
1 ! �(W 1h
1) ! h
2 ! �(W 2h
2) ! f
W 0 W 1 W 2
g(w, h) gh
Wkl
@L(W )
@g
@L(W )
@Wkl=
X
j
@L(W )
@gj
@gj@W i
kl
@L(W )
@hk=
X
j
@L(W )
@gj
@gj@hk
Backpropagation in Neural Networks
Ø Define a general operator:
x ! �(W 0x) ! h
1 ! �(W 1h
1) ! h
2 ! �(W 2h
2) ! f
W 0 W 1 W 2
g(w, h) gh
@L(W )
@g
@L(W )
@Wkl=
X
j
@L(W )
@gj
@gj@W i
kl
@L(W )
@hk=
X
j
@L(W )
@gj
@gj@hk
Wkl
Simple Example:
x
y
z
+
⨯
2
3
4
5
601
15
⨯4
12 = 4*3
20 = 4*5
f(x, y, z) = (x+ y) ⇤ y ⇤ z
15
1212
32=12+20
12
15h2 * z
h1 * y
BackpropagationØ Requires all operators to have well defined sub-gradients:
Ø Enables Automatic Differentiation!Ø User defines forward flow à system derives efficient training alg.Ø Easy to explore composition of new modules
Ø Enables Efficient Gradient ComputationØ Cache forward calculation to accelerate gradientsØ Compile optimized gradient computation
tanh logit Relu = max(0,x) abs
General Purpose Systems For DNNsØ Distributed Parameter Servers
Ø TensorFlow (DistBelief)Ø Microsoft Adam
Ø GPU SystemsØ TensorFlowØ CaffeØ Theano
Demo of TensorFlow
Challenges of Deep Neural Networks Ø Non-convex à (stochastic) gradient descent not guaranteed to
converge to optimumØ Soln: appear to be many good local optima
Ø High-dimensional à gradient descent converges slowlyØ Soln: hardware acceleration, improved algs. with momentum …
Ø Rich function class à overfittingØ Soln: more data, early-stopping, drop-out, parameter sharing
Ø Saturation of sigmoid à decaying gradientsØ Soln: other forms of non-linearity
logit Relu = max(0,x)
Convolutional Neural Networks: Exploiting Spatial Sparsity
Example: AlexNet (Krizhevsky et al., NIPS 2012)
Ø Introduced in 2012, significantly outperformed state-of-the-art (top 5 error of 16% compared to runner-up with 26% error)Ø Covered in reading …
Growth in Model ComplexityLeCun et al, “Backpropagation Applied to Handwritten Zip Code Recognition”. 1989
input
Conv7x7+2(S)
MaxPool3x3+2(S)
LocalRespNorm
Conv1x1+1(V)
Conv3x3+1(S)
LocalRespNorm
MaxPool3x3+2(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
MaxPool3x3+2(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
AveragePool5x5+3(V)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
AveragePool5x5+3(V)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
MaxPool3x3+2(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
Conv1x1+1(S)
MaxPool3x3+1(S)
DepthConcat
Conv3x3+1(S)
Conv5x5+1(S)
Conv1x1+1(S)
AveragePool7x7+1(V)
FC
Conv1x1+1(S)
FC
FC
SoftmaxActivation
softmax0
Conv1x1+1(S)
FC
FC
SoftmaxActivation
softmax1
SoftmaxActivation
softmax2
Figure 3: GoogLeNet network with all the bells and whistles
7
9K Parameters66K Ops
Szegedy et al, “Going Deeper with Convolution”. 2014
Ø Winner of ImageNet Large-Scale Visual Recognition Challenge 2014
ØGoogLeNet (7.89% error)Ø 22 layersØ 6.8M parametersØ 1.5B flopsØ Ensemble of 7 models
Ø Current Best: ResNet (3.57% error)Ø 152 layersØ 2.3M parametersØ 11.3B flopsØ Ensemble of 6 models
Cost of Computation (from Prediction Serving Lecture)
Ø 100’s of millions of parameters + convolutions & unrollingØ Requires hardware acceleration
9
Table 2 shows the results for the Titan X GPU and the Xeon E5-2698 v3 server-class processor.
Network: AlexNet Batch Size Titan X (FP32) Xeon E5-2698 v3 (FP32)
Inference Performance
1
405 img/sec 76 img/sec Power 164.0 W 111.7 W
Performance/Watt 2.5 img/sec/W 0.7 img/sec/W
Inference Performance 128 (Titan X) 48 (Xeon E5)
3216 img/sec 476 img/sec
Power 227.0 W 149.0 W
Performance/Watt 14.2 img/sec/W 3.2 img/sec/W
Table 2 Inference performance, power, and energy efficiency on Titan X and Xeon E5-2698 v3.
The comparison between Titan X and Xeon E5 reinforces the same conclusion as the comparison between Tegra X1 and Core i7: GPUs appear to be capable of significantly higher energy efficiency for deep learning inference on AlexNet. In the case of Titan X, the GPU not only provides much better energy efficiency than the CPU, but it also achieves substantially higher performance at over 3000 images/second in the large-batch case compared to less than 500 images/second on the CPU. While larger batch sizes are more efficient to process, the comparison between Titan X and Xeon E5 with no batching proves that the GPU’s efficiency advantage is present even for smaller batch sizes. In comparison with Tegra X1, the Titan X manages to achieve competitive Performance/Watt despite its much bigger GPU, as the large 12 GB framebuffer allows it to run more efficient but memory-capacity-intensive FFT-based convolutions.
Finally, Table 3 presents inference results on GoogLeNet. As mentioned before, IDLF provides no support for GoogLeNet, and alternative deep learning frameworks have never been optimized for CPU performance. Therefore, we omit CPU results here and focus entirely on the GPUs. As GoogLeNet is a much more demanding network than AlexNet, Tegra X1 cannot run batch size 128 inference due to insufficient total memory capacity (4GB on a Jetson™ TX1 board). The massive framebuffer on the Titan X is sufficient to allow inference with batch size 128.
Network: GoogLeNet Batch Size Titan X (FP32) Tegra X1 (FP32) Tegra X1 (FP16)
Inference Performance
1
138 img/sec 33 img/sec 33 img/sec Power 119.0 W 5.0 W 4.0 W
Performance/Watt 1.2 img/sec/W 6.5 img/sec/W 8.3 img/sec/W
Inference Performance 128 (Titan X) 64 (Tegra X1)
863 img/sec 52 img/sec 75 img/sec
Power 225.0 W 5.9 W 5.8 W
Performance/Watt 3.8 img/sec/W 8.8 img/sec/W 12.8 img/sec/W
Table 3 GoogLeNet inference results on Tegra X1 and Titan X. Tegra X1's total memory capacity is not sufficient to run batch size 128 inference.
Compared to AlexNet, the results show significantly lower absolute performance values, indicating how much more computationally demanding GoogLeNet is. However, even on GoogLeNet, all GPUs are capable of achieving real-time performance on a 30 fps camera feed.
http://www.nvidia.com/content/tegra/embedded-systems/pdf/jetson_tx1_whitepaper.pdf
Improvement on ImagNet Benchmark
Recurrent Neural Networks:Modeling Sequence Structure
Ø State of the art in speech recognition and machine translationØ Required LSTM and GRU to address long dependencies
Ø Similar to the HMM from classical Bayesian ML
The cat in the hat.
El gato en el sombrero
Improvements in Machine Translation &Automatic Speech Recognition