deep learning ii: neural networkscnn.pdf · neural networks: basics convolutional neural networks...
TRANSCRIPT
Neural networks: Basics Convolutional neural networks
Deep Learning II: Neural Networks
Hinrich Schutze
Center for Information and Language Processing, LMU Munich
2017-07-21
Schutze (LMU Munich): Neural networks 1 / 33
Neural networks: Basics Convolutional neural networks
Overview
1 Neural networks: Basics
2 Convolutional neural networks
Schutze (LMU Munich): Neural networks 2 / 33
Neural networks: Basics Convolutional neural networks
Outline
1 Neural networks: Basics
2 Convolutional neural networks
Schutze (LMU Munich): Neural networks 3 / 33
Neural networks: Basics Convolutional neural networks
Inverted ClassroomAndrew Ng: “Machine Learning”
http://coursera.org
Schutze (LMU Munich): Neural networks 4 / 33
Neural networks: Basics Convolutional neural networks
Neural networks: Andrew Ng videos
Model representation I
Model representation II
Schutze (LMU Munich): Neural networks 5 / 33
Neural networks: Basics Convolutional neural networks
A single neuron
Input nodes: x1, x2, x3
Parameters/weights: lines connecting nodes
Raw input to neuron: weighted sum ΘT~x =∑3
i=1 θixi
Nonlinear activation function (e.g., sigmoid):g(ΘT~x) = 1/(1 + exp(−ΘT~x))
Output of neuron: g(ΘT~x)
Schutze (LMU Munich): Neural networks 6 / 33
Neural networks: Basics Convolutional neural networks
A neuron
Inputs: x1, x2, x3
Parameters (= weights = lines): θ1, θ2, θ3
Activation function (e.g., sigmoid / logistic)
Hypothesis: hΘ(~x) = (ΘT~x)
Schutze (LMU Munich): Neural networks 7 / 33
Neural networks: Basics Convolutional neural networks
A neural network
Input layer (same as before): xi
Hidden layer, here: three neurons
Output layer, here: single neuron
Activations a(k)i , k = layer
Full connectivity
Same or different activation functions
Schutze (LMU Munich): Neural networks 8 / 33
Neural networks: Basics Convolutional neural networks
A neural network
Schutze (LMU Munich): Neural networks 9 / 33
Neural networks: Basics Convolutional neural networks
Another neural network architecture
Schutze (LMU Munich): Neural networks 10 / 33
Neural networks: Basics Convolutional neural networks
Another neural network architecture
Schutze (LMU Munich): Neural networks 11 / 33
Neural networks: Basics Convolutional neural networks
Forward propagation of activity
a(i)j = gi (Θ
Tij ~a
(i−1)) = 1/(1 + exp(−ΘTij ~a
(i−1)))
Schutze (LMU Munich): Neural networks 12 / 33
Neural networks: Basics Convolutional neural networks
Learning/Training: Backpropagation
As before: cost function
As before: objective(find parameters that minimize cost)
As before: gradient descent
That is: compute gradient and move along gradient
What’s new:We use backpropagation to compute the gradient.
Schutze (LMU Munich): Neural networks 13 / 33
Neural networks: Basics Convolutional neural networks
Gradient descent
Schutze (LMU Munich): Neural networks 14 / 33
Neural networks: Basics Convolutional neural networks
Neurons can be trained to detect features.
Schutze (LMU Munich): Neural networks 15 / 33
Neural networks: Basics Convolutional neural networks
Deep learning:
Each layer learns more powerful/abstract features.
Schutze (LMU Munich): Neural networks 16 / 33
Neural networks: Basics Convolutional neural networks
Increasingly abstract features in vision
Schutze (LMU Munich): Neural networks 17 / 33
Neural networks: Basics Convolutional neural networks
Number of weights/parameters
|L1| ∗ |L2|+ |L2| ∗ |L3| = 3 · 3 + 3 · 1 = 12
Schutze (LMU Munich): Neural networks 18 / 33
Neural networks: Basics Convolutional neural networks
Exercise: Number of weights/parameters
Schutze (LMU Munich): Neural networks 19 / 33
Neural networks: Basics Convolutional neural networks
Task: Does this sentence mention an officer leaving?
Given: A sentence
Workforce Solutions Alamo fired CEO John Hathaway yesterday.
Binary classification task
Class “yes”: This sentence contains information about an officerleaving a company (so a financial analyst should look at it).Class “no”: This sentence does not contain information about anofficer leaving a company (so nobody has to look at it).
Correct class in this case?
Class “yes”: This sentence contains information about an officerleaving a company.Class “no”: This sentence does not contain information about anofficer leaving a company.
Schutze (LMU Munich): Neural networks 20 / 33
Neural networks: Basics Convolutional neural networks
Task: Does this sentence mention an officer leaving?
Given: A sentence
CEO John Hathaway fired his gardener yesterday.
Correct class in this case?
Class “yes”: This sentence contains information about an officerleaving a company.Class “no”: This sentence does not contain information about anofficer leaving a company.
Schutze (LMU Munich): Neural networks 21 / 33
Neural networks: Basics Convolutional neural networks
Task: Does this sentence mention an officer leaving?
Given: A sentence
This picture shows parting CEO Cook talking with ex-CFO Dyer.
Correct class in this case?
Class “yes”: This sentence contains information about an officerleaving a company.Class “no”: This sentence does not contain information about anofficer leaving a company.
Schutze (LMU Munich): Neural networks 22 / 33
Neural networks: Basics Convolutional neural networks
Simple architecture for detecting leaving events
Schutze (LMU Munich): Neural networks 23 / 33
Neural networks: Basics Convolutional neural networks
Hypothesis? Parameters? Cost? Objective?
Schutze (LMU Munich): Neural networks 24 / 33
Neural networks: Basics Convolutional neural networks
Simplest architecture: Fixed-length input
→ Padding
1 2 3 4 5 6 7 8 9The board forced him to resign PAD PAD PADA majority of the board forced him to quit
Eventually the escalation led him to resign PAD PADIt’s legal threats that compelled him to leave PAD
key idea of convolution:learn a filter for the pattern“[force] [pronoun] to [leave]”filter = feature detector
Schutze (LMU Munich): Neural networks 25 / 33
Neural networks: Basics Convolutional neural networks
Exercise
If you use this architecture, why is it hard to learn the filter “[force][pronoun] to [leave]”?
Schutze (LMU Munich): Neural networks 26 / 33
Neural networks: Basics Convolutional neural networks
Outline
1 Neural networks: Basics
2 Convolutional neural networks
Schutze (LMU Munich): Neural networks 27 / 33
Neural networks: Basics Convolutional neural networks
Use convolution&pooling architecture Input layer
Convolution layer Convolution layer (filter size 3)
Max pooling layer Applying convolutional filterMax pooling ⇒ Sentence describes “a leaving
event”. Convolution computes features. Maxpooling selects max feature.
convolution&pooling=compute&select features
0.9
0.0 0.0 0.0 0.1 0.8 0.6 0.2 0.0 0.7 0.9 0.9 0.1
PAD
PAD
This
pictu
re
shows
partin
g
CEO
Cook
talking
with
ex-C
FO
Dyer
PAD
PAD
a = g(H ⊙ X )a = maxici
g(H⊙X )g(H⊙X )g(H⊙X )g(H⊙X )g(H⊙X )g(H⊙X )g(H⊙X )g(H⊙X )g(H⊙X )g(H⊙X )g(H⊙X )g(H⊙X )
max pooling
poolinglayer
selectsmax feature
convolutionlayer
computesfeatures
inputlayer
Schutze (LMU Munich): Neural networks 28 / 33
Neural networks: Basics Convolutional neural networks
Convolution & pooling
Widely used in vision
Recent development: widely used in NLP
Best example of successful transferfrom vision to NLP
Schutze (LMU Munich): Neural networks 29 / 33
Neural networks: Basics Convolutional neural networks
Convolution and max pooling in vision
Schutze (LMU Munich): Neural networks 30 / 33
Neural networks: Basics Convolutional neural networks
Exercise
Try to find a good example of a typical NLP task for which maxpooling (i.e., detecting whether or not a particular type of thingoccurs in a sentence) is the wrong approach.
(Alternatively, try to find a good example of a typical vision taskfor which max pooling (i.e., detecting whether or not a particulartype of thing occurs in a scene) is the wrong approach.)
Schutze (LMU Munich): Neural networks 31 / 33
Neural networks: Basics Convolutional neural networks
Convolutional filter H
a = g(H ⊙ X )
Kernel size k : length of subsequence
H is applied to every subsequence of length k .
X is the representation of the subsequence,of dimensionality D × k .
D is the dimensionality of the embeddings.
H also has dimensionality D × k .
⊙ is the (Frobenius) inner product:H ⊙ X =
∑(i ,j)HijXij
g : nonlinearity (e.g., sigmoid)
Schutze (LMU Munich): Neural networks 32 / 33
Neural networks: Basics Convolutional neural networks
Notation
V vocabulary sizeD embedding dimensionalityC number of classesC i number of input channelsC o number of output channelsKs kernel sizesN minibatch sizeW padded sequence length
Schutze (LMU Munich): Neural networks 33 / 33