cs60010: deep learningsudeshna/courses/dl18/lec5.pdfsigmoid units for bernoulli output...
TRANSCRIPT
CS60010: Deep Learning
Sudeshna Sarkar Spring 2018
16 Jan 2018
FFN
โข Goal: Approximate some unknown ideal function f : X ! Y โข Ideal classifier: y = f*(x) with x and category y โข Feedforward Network: Define parametric mapping โข y = f(x; theta) โข Learn parameters theta to get a good approximation to f* from
available sample
โข Function f is a composition of many different functions
original W
negative gradient direction W_1
W_2
Gradient Descent
Slides based on cs231n by Fei-Fei Li & Andrej Karpathy & Justin Johnson
The effects of step size (or โlearning rateโ)
original W
True gradients in blue minibatch gradients in red
W_1
W_2
Stochastic Gradient Descent
Gradients are noisy but still make good progress on average
Cost functions:
โข In most cases, our parametric model defines a distribution ๐(๐ฆ|๐ฅ;๐)
โข Use the principle of maximum likelihood
โข The cost function is often the negative log-likelihood โข equivalently described as the cross-entropy between the
training data and the model distribution. ๐ฝ ๐ = โ๐ธ๐ฅ,๐ฆ~๐๏ฟฝ๐๐๐๐ log ๐๐๐๐๐๐(๐ฆ|๐ฅ)
โข .
Conditional Distributions and Cross-Entropy
๐ฝ ๐ = โ๐ธ๐ฅ,๐ฆ~๐๏ฟฝ๐๐๐๐ log ๐๐๐๐๐๐(๐ฆ|๐ฅ) โข The specific form of the cost function changes from model to
model, depending on the specific form of log๐๐๐๐๐๐ โข For example, if ๐๐๐๐๐๐(๐ฆ|๐ฅ) = ๐ฉ(y; f x,๐ , I) then we
recover the mean squared error cost,
๐ฝ ๐ =12๐ธ๐ฅ,๐ฆ~๐๏ฟฝ๐๐๐๐ ๐ฆ โ ๐(๐ฅ; ๐) 2 + Const
โข For predicting median of Gaussian, the equivalence between maximum likelihood estimation with an output distribution and minimization of mean squared error holds
โข Specifying a model ๐(๐ฆ|๐ฅ) automatically determines a cost function log ๐(๐ฆ|๐ฅ)
โข The gradient of the cost function must be large and predictable enough to serve as a good guide for the learning algorithm.
โข The negative log-likelihood helps to avoid this problem for many models.
Learning Conditional Statistics
โข Sometimes we merely predict some statistic of ๐ฆ conditioned on ๐ฅ. Use specialized loss functions โข For example, we may have a predictor ๐(๐ฅ; ๐) that we wish to employ to predict
the mean of ๐ฆ.
โข With a sufficiently powerful neural network, we can think of the NN as being able to represent any function ๐ from a wide class of functions. โข view the cost function as being a functional rather than just a function. โข A functional is a mapping from functions to real numbers.
โข We can thus think of learning as choosing a function rather than a set of parameters.
โข We can design our cost functional to have its minimum occur at some specific function we desire. For example, we can design the cost functional to have its minimum lie on the function that maps ๐ฅ to the expected value of ๐ฆ given ๐ฅ.
Learning Conditional Statistics
โข Results derived using calculus of variations: 1. Solving the optimization problem
๐โ =๐๐๐๐๐๐
๐ ๐ผ๐ฅ,๐ฆ~๐๐๐๐๐ ๐ฆ โ ๐(๐ฅ) 2
yields ๐โ(๐ฅ) = ๐ผ๐ฅ,๐ฆ~๐๐๐๐๐(๐ฆ|๐ฅ) ๐ฆ
2. Solving
๐โ =๐๐๐๐๐๐
๐ ๐ผ๐ฅ,๐ฆ~๐๐๐๐๐ ๐ฆ โ ๐(๐ฅ) 1
yields a function that predicts the median value of ๐ฆ for each ๐ฅ. Mean Absolute Error (MAE) โข MSE and MAE often lead to poor results when used with gradient-based optimization.
Some output units that saturate produce very small gradients when combined with these cost functions.
โข Thus use of cross-entropy is popular even when it is not necessary to estimate the distribution ๐(๐ฆ|๐ฅ)
Output Units
1. Linear units for Gaussian Output Distributions. Linear output layers are often used to produce the mean of a conditional Gaussian distribution ๐ ๐ฆ ๐ฅ = ๐ฉ(๐ฆ;๐ฆ๏ฟฝ, ๐ผ)
โข Maximizing the log-likelihood is then equivalent to minimizing the mean squared error
โข Because linear units do not saturate, they pose little difficulty for gradient-based optimization algorithms
2. Sigmoid Units for Bernoulli Output Distributions. 2-class classification problem. Needs to predict ๐(๐ฆ = 1|๐ฅ).
๐ฆ๏ฟฝ = ๐ ๐ค๐โ + ๐ 3. Softmax Units for Multinoulli Output Distributions. Any time we
wish to represent a probability distribution over a discrete variable with n possible values, we may use the softmax function. Softmax functions are most often used as the output of a classifier, to represent the probability distribution over ๐ different classes.
Softmax output
โข In case of a discrete variable with ๐ values, produce a vector ๐๏ฟฝ with ๐ฆ๏ฟฝ๐ = ๐(๐ฆ = ๐|๐ฅ).
โข A linear layer predicts unnormalized log probabilities: ๐ = ๐พ๐ป๐ + ๐
โข where ๐ง๐ = log๐๏ฟฝ(๐ฆ = ๐|๐ฅ)
softmax(๐ง)๐ =exp (๐ง๐)โ exp (๐ง๐)๐
โข When training the softmax to output a target value ๐ฆ using maximum log-likelihood
โข Maximize log๐ ๐ฆ = ๐; ๐ง = log ๐ ๐ ๐๐ ๐๐๐ฅ(๐ง)๐ โข log ๐ ๐ ๐๐ ๐๐๐ฅ(๐ง)๐ = ๐ง๐ โ logโ exp (๐ง๐)๐
Output Types
Output Type Output Distribution
Output Layer Cost Function
Binary Bernoulli Sigmoid Binary cross-
entropy
Discrete Multinoulli Softmax Discrete cross-
entropy
Continuous Gaussian Linear Gaussian cross- entropy (MSE)
Continuous Mixture of Gaussian
Mixture Density Cross-entropy
Continuous Arbitrary GAN, VAE, FVBN Various
Sigmoid output with target of 1
โ3 โ2 โ1 0 1 2 3
z
0.0
0.5
1.0
๐(๐) Cross-entropy loss MSE loss
Bad idea to use MSE loss with sigmoid unit.
Benefits
Hidden Units
โข Rectified linear units are an excellent default choice of hidden unit.
โข Use ReLUs, 90% of the time โข Many hidden units perform comparably to ReLUs.
New hidden units that perform comparably are rarely interesting.
Rectified Linear Activation
0 z
0
g(z)
= m
ax{0
, z}
ReLU
โข Positives: โข Gives large and consistent gradients (does not saturate) when active โข Efficient to optimize, converges much faster than sigmoid or tanh
โข Negatives: โข Non zero centered output โข Units "die" i.e. when inactive they will never update
Architecture Basics
โข Depth โข Width
Universal Approximator Theorem
โข One hidden layer is enough to represent (not learn) an approximation of any function to an arbitrary degree of accuracy
โข So why deeper? โข Shallow net may need (exponentially) more width โข Shallow net may overfit more
โข http://mcneela.github.io/machine_learning/2017/03/21/Universal-
Approximation-Theorem.html โข https://blog.goodaudience.com/neural-networks-part-1-a-simple-proof-of-
the-universal-approximation-theorem-b7864964dbd3 โข http://neuralnetworksanddeeplearning.com/chap4.html