Download - Deep Feedforward Networks - GitHub Pages
Deep Feedforward Networks
Lecture slides for Chapter 6 of Deep Learningwww.deeplearningbook.org
Ian GoodfellowLast updated 2016-10-04
Adapted by m.n. for CMPS 392
(Goodfellow 2016)
Multilayer perceptrons (MLP)β’ Approximate some functionβ’ A feedforward network defines a mapping
β’ No feedback connections β’ Functions are composed in a chain
q π ! is the first layer, π " is the second layer, and so on β¦
β’ The training examples specify what the output layer must doq The other layers are called hidden layers
(Goodfellow 2016)
Whatβs the idea?β’ The strategy of deep learning is to learn a new
represenation of the data π(π, π½). Think of this as the output of a hidden layer.
β’ Parameters π€ maps the output of hidden layers to the desired output: π π, π½ #π
β’ We give up on the convexity of the training problem β’ π π, π½ can be very generic, in the same time its
design can be guided by human experts
(Goodfellow 2016)
Roadmapβ’ Example: Learning XORβ’ Gradient-Based Learningβ’ Hidden Unitsβ’ Architecture Designβ’ Back-Propagation
(Goodfellow 2016)
XOR is not linearly separable
(Goodfellow 2016)
Linear model? β’ Cost function:
β’ Linear model
β’ Normal equations ? q W = [0,0], b=0.5
(Goodfellow 2016)
Network Diagrams
Figure 6.2
(Goodfellow 2016)
Network Diagrams
Figure 6.2
(Goodfellow 2016)
Rectified Linear Activation
Figure 6.3
(Goodfellow 2016)
Solving XOR
(Goodfellow 2016)
Solving XOR
Figure 6.1
(Goodfellow 2016)
Roadmapβ’ Example: Learning XORβ’ Gradient-Based Learningβ’ Hidden Unitsβ’ Architecture Designβ’ Back-Propagation
(Goodfellow 2016)
Gradient-Based Learningβ’ Specify
q Modelq Cost
β’ Design model and cost so cost is smoothβ’ Minimize cost using gradient descent or related
techniquesq Rather than linear equations solvers used in
linear regression q Or convex optimization algorithms used in SVM
and logistic regression
(Goodfellow 2016)
Conditional Distributions and Cross-Entropy
(Goodfellow 2016)
Stochastic gradient descent
Computing the gradient is O(m)Sample a mini-batch
(Goodfellow 2016)
Stochastic gradient descentβ’ No convergence guarantees β’ Sensitive to the values of the initial parameters β’ recipe
q Intialize all weights to small random numbers q Biases must be initialized to zero or small positive
values
(Goodfellow 2016)
Linear output unitsβ’ Suppose the network provides a set of hidden
features π = π(π, π½)β’ Predict a vector of continuous variables πβ’ Linear output unit: -π = πΎπ»π + πβ’ It corresponds to produce the mean of a conditional
Gaussian distribution: q π π π = π©(π; -π; π°)q Minimizing cross entropy is equivalent to
minimizing the mean squared error β’ Linear units do not saturate
(Goodfellow 2016)
Sigmoid output unitsβ’ Predict a binary variable π¦ (binary classification
problem)β’ 6π¦ = π π¦ = 1 π) based on: π = π(π, π½)β’ Bad idea: β’ Sigmoid output unit
q π§ = ππ»π + π is called logitq 6π¦ = π(π§)
β’ The log in the cost function undoes the exp of the sigmoidq No gradient saturation
(Goodfellow 2016)
Saturation β’ π½ π = β log π(π§) if π¦ = 1β’ π½ π = β log 1 β π π§ = βlog(π βπ§ ) if π¦ = 0β’ One equation: π½ π = β log π 2π¦ β 1 π§
= π 1 β 2π¦ π§β’ Saturates only when 1 β 2π¦ π§ is very negative
q When we have converged to the solution β’ In the limit of extremely incorrect z , the softplus
function does not shrink the gradient at all
(Goodfellow 2016)
Softmax output unitsβ’ A generalization of the sigmoid to represent a distribution over π values q muli-label classificationq In fact it is soft arg-max, winner-take-all
o one of the outputs is nearly one and the others are nearly 0.
β’ "π¦! = π π¦ = π π₯ = softmax π ! ="#$(&!)
β" "#$(&")
β’ π = πΎπ»π + π (we can think that z are unnormalized log probabilities given by the network)
β’ Cost function: βlog softmax π ! = βπ§! + logβ* exp(π§*)β’ We can impose a requirement that one element of z be fixed.
q But it is simpler to implement the overparametrized version.
this term cannot saturate, learning can proceed
(Goodfellow 2016)
Saturation β’ If π§! is extremely positive => saturates to 1 β’ If π§! is extremely negative => saturates to 0
q An output softmax π ! saturates to 1 when the corresponding input is maximal (π§! = max
!π§!) and π§! is much greater than all
the other inputs. q The output softmax π ! can also saturate to 0 when π§! is not
maximal and the maximum is much greater.
β’ MSE loss function will perform poorly because of these gradient vanishing problems
(Goodfellow 2016)
Output Types
Output Type Output Distribution Output Layer Cost Function
Binary Bernoulli Sigmoid Binary cross-entropy
Discrete Multinoulli Softmax Discrete cross-entropy
Continuous Linear (MSE)
(Goodfellow 2016)
Donβt mix and matchSigmoid output with target of 1
(Goodfellow 2016)
Other output units
β’ Designaneuralnetworktoestimatethevarianceofagaussiandistributionrepresentingπ π¦ π₯)
β’ Loss function:
q β logπ© π₯; π; π½!" β β "#log π½ + "
#π½ π₯ β π #
β’ Check notebook
(Goodfellow 2016)
Roadmapβ’ Example: Learning XORβ’ Gradient-Based Learningβ’ Hidden Unitsβ’ Architecture Designβ’ Back-Propagation
(Goodfellow 2016)
Hidden unitsβ’ Most hidden units can be described as:
q accepting a vector of inputs π, q computing an affine transformation π = πΎπ»π + π, q And applying an element-wise nonlinear function π(π).
β’ Use ReLUs, 90% of the timeβ’ For some research projects, get creativeβ’ Many hidden units perform comparably to ReLUs. New
hidden units that perform comparably are rarely interesting.
(Goodfellow 2016)
ReLu is not differentiable at 0!β’ The functions used in the context of neural networks
usually have defined left derivatives and defined right derivatives. q In the case of π(π§) = max{0, π§},
o the left derivative at π§ = 0 is 0o and the right derivative is 1.
q Software implementations of neural network training usually return one of the one-sided derivativeso rather than reporting that the derivative is undefined
or raising an error.
(Goodfellow 2016)
When ReLu is activeβ’ When the input is positive, we say that the rectifier is
active:q the gradient is 1 q The second derivative is 0, no second-order effects.
β’ When initializing the parameters of πΎπ»π + π,q set all elements of π to a small, positive value, such as
0.1q This makes it very likely that the rectified linear units
will be initially active for most inputs in the training set and allow the derivatives to pass through.
(Goodfellow 2016)
ReLu generalizations
q Absolute value rectification: πΌ% = β1q Leaky ReLu: πΌ% = 0.01q Parametric ReLu: πΌ% is learnable
β’ Rectified linear units and its generalizations are based on the principle that models are easier to optimize if their behavior is closer to linear.
(Goodfellow 2016)
Other hidden unitsβ’ Sigmoid
q π(π§) = π(π§)q They can saturate
o which makes learning difficultβ’ Hyperbolic tangent
q π(π§) = tanh(π§).q Related to sigmoid: tanh(π§) = 2π(2π§) β 1.q It typically performs better
o since it resembles the identity function o So long as the activations of the network can be
kept small.
(Goodfellow 2016)
Other hidden units β’ The authors tested a feedforward network using π = cos(πΎπ» π + π) on the MNIST dataset and obtained an error rate of less than 1%
β’ Linear hidden units offer an effective way of reducing the number of parameters in a networkq Assume πΎπ» is p x nq πΎπ» = ππππ where ππ is pxq and ππ is qxnq In total we have qx(n+p) trainable params
o Much less than << pxn for small q
(Goodfellow 2016)
Roadmapβ’ Example: Learning XORβ’ Gradient-Based Learningβ’ Hidden Unitsβ’ Architecture Designβ’ Back-Propagation
(Goodfellow 2016)
Architecture Basics
Width
Dep
th
β’ β ! = π ! (π ! #π₯ + π ! )β’ β " = π " (π " #β ! + π " )β’ β¦
(Goodfellow 2016)
Universal ApproximatorTheorem
β’ One hidden layer is enough to represent (not learn) an approximation of any function to an arbitrary degree of accuracy
β’ So why deeper?q Shallow nets may need (exponentially) more
widthq Shallow nets may overfit more
(Goodfellow 2016)
Bad idea: One hidden layerβ’ Consider binary functions over π£ β 0,1 '
q The image of any member of the possible 2' inputs can be 0 or 1 o In total, we have 2"! possible functions
q Selecting one such function requires π(2')degrees of freedom
β’ Working with a single layer is sufficient to represent βanyβ function q Yet the width of the layer can be exponential, and
prone to overfitting
(Goodfellow 2016)
Montufar et. al. (2014)
Example: Absolute value rectificationq Has the same output for every pair
of mirror pointsq The mirror axis of symmetry is given
by πβ + πq A function computed on top of that
unit is simplerq If we fold again, we get an even
simpler function, and so on ..
β’ A deep rectifier network can represent functions with a number of regions that is exponential in the depth of the network.
(Goodfellow 2016)
Exponential Representation Advantage of Depth
We obtain an exponentially large number of piecewiselinear regions which can capture all kinds of repeating
patterns.
(Goodfellow 2016)
Why deep?β’ Choosing a deep model encodes a very general
belief (or prior) that the function we want to learn should involve composition of several simpler functions
β’ Alternately, we can interpret the use of a deep architecture as expressing a belief that the function we want to learn is a computer program consisting of multiple steps, where each step makes use of the previous stepβs output
β’ Empirically, greater depth does seem to result in better generalization for a wide variety of tasks
(Goodfellow 2016)
Better Generalization with Greater Depth
Task: transcribe multi-digit numbers from photographs of addresses
Layers
(Goodfellow 2016)
Large, Shallow Models Overfit More
Figure 6.7
(Goodfellow 2016)
Other architectural considerations
β’ Other architectures: CNN, RNNβ’ Skip connections: going from layer i to layer i + 2 or
higher. q make it easier for the gradient to flow from output
layers to layers nearer the input.
Design some neural networks for fun: https://playground.tensorflow.org/
(Goodfellow 2016)
Roadmapβ’ Example: Learning XORβ’ Gradient-Based Learningβ’ Hidden Unitsβ’ Architecture Designβ’ Back-Propagation
(Goodfellow 2016)
Compute the gradientβ’ Forward propagation: π₯ β Xπ¦ β J π
β’ Backprop: %&(()%(
=?q Computing derivatives by propagating information
backward through the networkq Simple and inexpensive q Based on the chain rule:
o π³ = πΎπ + πo π = π πo Xπ¦ = ππ»π + πo π½ Xπ¦ = π¦ β Xπ¦ #
π = πΎ, π,π, ππ»(π½ =?
ππ½ππ!
=ππ½π 6π¦
π 6π¦πβ!
πβ!ππ§!
ππ§!ππ!
ππ½ππ!!
=ππ½π 6π¦
π 6π¦πβ!
πβ!ππ§!
ππ§!ππ!!
(Goodfellow 2016)
Chain Ruleβ’ More generally
q π β β!, π β β", π§ β β
q π§ = π π = π π πππ§ππ₯#
=,$
ππ§ππ¦$
ππ¦$ππ₯#
q Vector notation:
o π»ππ§ =&π&π
(π»ππ§
o Or: π»ππ§ = β$&)&*!
π»ππ¦$β’ Backprop consists of performing Jacobian-gradient product
for each operation in the graph.
πΓπJacobian
matrix
(Goodfellow 2016)
Back-Propagationβ’ Back-propagation is βjust the chain ruleβ of calculus
β’ But itβs a particular implementation of the chain rule
q Uses dynamic programming (table filling)
q Avoids recomputing repeated subexpressions
q Speed vs memory tradeoff
(Goodfellow 2016)
Simple Back-Prop Example
Forw
ard
prop Back-prop
Com
pute
act
ivat
ions
Com
pute derivativesCompute loss
(Goodfellow 2016)
General implementation Computation Graphs
Figure 6.8
Multiplication
ReLU layer
Logistic regression
Linear regressionand weight decay
Mini-batch
6π¦ = π₯#π€π’) = πβw%
"
(Goodfellow 2016)
Repeated Subexpressions
Figure 6.9Back-prop avoids computing this twice
(Goodfellow 2016)
Symbol-to-Symbol Differentiation
Figure 6.10
(Goodfellow 2016)
Implementing the chain rule β’ Given a computation graph, we wish to compute
*+ !
*+ " , for π = 1,β¦ , π%q π’ ' can be the cost function q π’ % are the trainable parameters of the network q π’ % is associated with an operation π
o π’ % = π πΈ% where πΈ% is the set of parents of π’ %
ππ’ '
ππ’ % = a,:%β/0(+ # )
ππ’ '
ππ’ ,ππ’ ,
ππ’ %
(Goodfellow 2016)
Forward propagation
(Goodfellow 2016)
Backward propagation
πΏπ’'
πΏπ’' = 1
(Goodfellow 2016)
Efficiency β’ The amount of computation required for performing the
back-propagation scales linearly with the number of edges in G, q the computation for each edge corresponds to
computing a partial derivative (of one node with respect to one of its parents)
q as well as performing one multiplication and one addition.
β’ NaΓ―ve approach:
(Goodfellow 2016)
Two implementaion approaches β’ Take a computational graph and a set of numerical
values for the inputs to the graph, then return a set of numerical values describing the gradient at those input values. We call this approach βsymbol-to-numberβ differentiation. This is the approach used by PyTorch.
β’ Another approach is to take a computational graph and add additional nodes to the graph that provide a symbolic description of the desired derivatives. This is the approach taken TensorFlow. q Because the derivatives are just another
computational graph, it is possible to run back-propagation again, derivatives.
(Goodfellow 2016)
Forward propagation for a neural network
(Goodfellow 2016)
Gradients of biases and weights
β’ We have π(π) = π(π) +πΎ(π)π(π4π)
q*6
*7"$ = *8
*0"$ . 1 + π
*9
*7"$
q Vectorized: π»π π π½ = π»π π πΏ + ππ»π π Ξ©
q*6
*<",#$ = *8
*0"$ . β,
=4! + π *9
*<",#$
q Vectorized: π»πΎ π π½ = (π»π π πΏ) β =4! '+ ππ»πΎ π Ξ©
Outer Product
(Goodfellow 2016)
Gradient of representationsβ’ We have π(π) = π(π) +πΎ(π)π(π4π)
q π»π π(π πΏ = *0 $
*@ $(*
#π»π π πΏ = πΎ(π)π» π»π π πΏ
β’ π(π) = π(π(π))q π»π π πΏ = π»π π πΏ β πA π π
(Goodfellow 2016)
Backward propagation for a neural network
(Goodfellow 2016)
Gradients for matrix multiplication
β’ π»π = πq π»π―π = π»πΌπ π#
q π»πΎπ = π»#π»πΌπβ’ Useful to backprop over a batch of samples β’ A layer receives πΊ:
q πΊπ# is the update for πq π»#πΊ is back propagated
(Goodfellow 2016)
Backprop for a batch of samples
Cost forward propagation: π(π€)multiply-addsCost backward propagation: π(π€)multiply-addsMemory-cost: π(ππ!)
π½ = π½'() + π(βπ*+, -
+ βπ*+- -)
πΊ = π»+($)π½,-.
π»/π½,-. = πΊπ0
π»π(1) = π(1)
π»2($)π½ = π»0πΊ + 2ππ(1)
ππ(3) = π(3)
π»+ & π½,-. = πΊβ²
π»2(&)π½ = π0πΊβ² + 2ππ(3)
(Goodfellow 2016)
Backprop in deep learning vs. the general field of automatic differentiation
β’ Reverse mode accumulationβ’ Forward mode accumulation β’ Simplifying algebraic expressions (e.g. Theano)
q π! ="#$ %"
β# "#$ %#
q π½ = ββπ' log π'
q()(%"
= π! β π!
β’ Specialized libraries vs. automatic generationq Data structures with bprop() method
β’ Backprop is not the only way or the optimal way of computing the gradient, but it is practical
(Goodfellow 2016)
High-order derivativesβ’ Properties of the Hessian may guide the learning β’ π½ π is a function β0 β β
q The hession of π½ π is in β0Γ0β’ In typical deep learning, π can be in the billions
q Computing the Hessian is not practical β’ Krylov methods are a set of iterative techniques for
performing various operations like:q approximately inverting a matrix q or finding eigenvectors or eigenvalues, q without using any operation other than matrix-vector
products.
(Goodfellow 2016)
Hessian-vector Products
β’ Compute π―π(π) for all π = 1,β¦ , πβ’ Where π(π) is the one hot vector:
q πE% = 1
q πF% = 0
(Goodfellow 2016)
Questionsβ’ Watch Ianβs lecture:
q https://drive.google.com/file/d/0B64011x02sIkRExCY0FDVXFCOHM/view