intro to deep learning - bptt bright minds … · 17/08/2020 · intro to deep learning deep...
TRANSCRIPT
Intro to Deep LearningMandela Patrick August 17, 2020.
What is Deep Learning?
What is Deep Learning?
Lecture Schedule
Monday August 17:
Tuesday August 18:
Wednesday August 19:
Thursday August 20:
Friday August 21:
Intro to Deep Learning
Deep Sequence Modeling
Deep Computer Vision
Deep Generative Modeling
Deep Learning ApplicationsTimothy Hospadales
Lectures Q&A Coding Exercises
PyTorch Basics & MNIST classification using MLP
Sentiment Analysis using RNN and BERT
Image Classification and Detection using CNNS
Image Generation using GANs
Class Support• Slack:
• Useful for discussion with fellow classmates
• Course Website: http://bpttbrightminds.com
• Lecture Schedule
• Guest Q&As
• Software Labs
• Zoom Links
Course Staff
Timothy Hospadales Kwame Johnson Mark Moyou Patrick Hosein
Mandela Patrick
Thanks to Sponsors!
Why Deep Learning and Why Now?
Why Deep Learning?
Hand engineering features are time consuming, brittle and not scalable in practice
Can we learn underlying features directly from data?
Low level features Mid level features High level features
Edges Eyes, ears, nose Facial structure
Why Now?
Neural Networks date back decades, so why the resurgence?
1. Big Data • Large Datasets
• Easier Collection and Storage
2. Hardware • Graphic Processing
Units (GPUs)
• Easier Collection and Storage
3. Software • Improved Techniques
• Newer Models
• Toolboxes
1952
1958
1986
1995
Stochastic Gradient Descent
Perceptron • Learnable weights
Backpropogation • Multi-layer Perceptron
Deep Convolutional NN • Digit Recognition
……
……
The Perceptron
The Perceptron: Forward Propagation
y = g(m
∑i=1
xiwi)
Inputs Weights Sum Non-Linearity Output
Output
Non-Linear activation function
Linear combination of inputs
∑ σ
x1
x2
x3
y
w1
w2
w3
Weighted Sum
The Perceptron: Forward Propagation
y = g(w0 +m
∑i=1
xiwi)
Inputs Weights Sum Non-Linearity Output
OutputLinear combination
of inputs
Non-Linear activation function Bias
∑ σ
x1
x2
x3
y
w1
w2
w3
1
w0
Bias
The Perceptron: Forward Propagation
y = g(w0 +m
∑i=1
xiwi)
Inputs Weights Sum Non-Linearity Output
y = g(w0 + XTW)
X = [x1 . . . xm]where: and W = [w1 . . . wm]
∑ σ
x1
x2
x3
y
w1
w2
w3
1
w0
Non-Linearity
The Perceptron: Forward Propagation
Inputs Weights Sum Non-Linearity Output
y = g(w0 + XTW)
Activation Functions
Example: sigmoid function
g(z) = ϕ(z) =1
1 + e−z
∑ σ
x1
x2
x3
y
w1
w2
w3
1
w0
Non-Linearity
Common Activation Functions
Sigmoid Function Hyperbolic Tangent Rectified Lecture Unit (ReLu)
g(z) =1
1 + e−z
g′�(z) = 1 − g(z)2
g(z) =ez − e−z
ez + e−z
g′�(z) = g(z)(1 − g(z))
g(z) = max(0,z)
{1 z > 00 otherwise
g′�(z) =
Importance of Activation Functions
The purpose of activation functions is to introduce non-linearities in the network.
What if we wanted to build a neural network to distinguish red and green points?
Importance of Activation Functions
The purpose of activation functions is to introduce non-linearities in the network.
Linear activation functions produce linear decisions to matter the network size
Non-linearities allow us to approximate arbitrary complex functions
Importance of Activation Functions
The purpose of activation functions is to introduce non-linearities in the network.
Linear activation functions produce linear decisions to matter the network size
Non-linearities allow us to approximate arbitrary complex functions
The Perceptron: Example
∑ σ
1
x2
x3
y3
−2
1 w0 = 1We have: and
The Perceptron: Example
∑ σ
1
x2
x3
y3
−2
1
y = g(1 + 3x1 − 2x2)
1 +3x 1
−2x 2
=0x2
x1
The Perceptron: Example
∑ σ
1
x2
x3
y3
−2
1
y = g(1 + 3x1 − 2x2)
1 +3x 1
−2x 2
=0x2
x1
y = g(1 + (3 * −1) − (2 * 2))= g(−6) ≈ 0.002
Assume we have input: X =
The Perceptron: Example
∑ σ
1
x2
x3
y3
−2
1
1+3x 1
−2x 2
=0x2
x1
y = g(1 + 3x1 − 2x2)
z < 0y < 0.5
z > 0y > 0.5
Building Neural Networks with Perceptrons
The Perceptron: Simplified
Inputs Weights Sum Non-Linearity Output
∑ σ
x1
x2
x3
y
w1
w2
w3
1
w0
y = g(w0 + XTW)
The Perceptron: Simplified
z = w0 +m
∑j=1
xjwj
z
x1
x2
xm
w1
w2
w3
y = g(z)
Multi-Output Perceptron
zi = w0,i +m
∑j=1
xjwj,i
Because all inputs are densely connected to all outputs, these layers are called Dense Layers.
z2
x1
x2
xm
y2 = g(z2)
z1
y1 = g(z1)
Single Layer Neural Network
z2
x1
x2
xm
z1
z3
zd1
y1
y2
W(1) W(2)
Inputs Hidden Final Output
Single Layer Neural Network
z2
x1
x2
xm
z1
z3
zd1
y1
y2
W(1) W(2)
Multi Output Perceptron
z2
x1
x2
xm
z1
z3
zn
y1
y2
Inputs Hidden Final Output
Deep Neural Network
zk,2
x1
x2
xm
zk,1
zk,3
zk,nk
y1
y2
Inputs Hidden Final Output
….. …..
Applying Neural Networks
Example problem
Will I pass this class?
Let’s start with a simple two feature model
Number of lectures you attend
Hours spent on the final project
x1 =
x2 =
Example problem: Will I pass this class?
Hours spent on the final project
x2 =
Number of lectures you attendx1 =
Legend
Pass
Fail
Example problem: Will I pass this class?
Hours spent on the final project
x2 =
Number of lectures you attendx1 =
Legend
Pass
Fail
[4, 5]
Example problem: Will I pass this class?
z2
x1
x2
z1
z3
y1x(1) = [4,5] Predicted: 0.1
Example problem: Will I pass this class?
z2
x1
x2
z1
z3
y1x(1) = [4,5]Predicted: 0.1
Actual: 1
Quantifying Loss
L( f(x(i); W), y(i))Predicted Actual
z2
x1
x2
z1
z3
y1x(1) = [4,5]Predicted: 0.1
Actual: 1
Empirical Loss
J(W) =1n
n
∑i=1
L( f(x(i); W), y(i))Predicted Actual
z2
x1
x2
z1
z3
y1
The empirical loss measure the total loss over entire dataset
Also know as: • Objective Function • Cost Function • Empirical Risk
Binary Cross Entropy Loss
z2
x1
x2
z1
z3
y1
Cross Entropy Loss can be used with models that output a probability between 0 and 1
J(W) =1n
n
∑i=1
y(i)log( f(x(i); W)) + (1 − y(i))log(1 − f(x(i); W))PredictedActual Actual Predicted
Mean Squared Error Loss
J(W) =1n
n
∑i=1
(y(i) − f(x(i); W))2
z2
x1
x2
z1
z3
y1
Mean Squared Error loss can be used with regression models that output continuous real numbers.
Actual Predicted
Training Neural Networks
Loss Optimization
W* = argminW1n
n
∑i=1
L( f(x(i); W), y(i))
We want to find the network weights that achieve the lowest loss.
W* = argminWJ(W)
Loss Optimization
W* = argminW1n
n
∑i=1
L( f(x(i); W), y(i))
We want to find the network weights that achieve the lowest loss.
W* = argminWJ(W)
W = {W(0), W(1), . . . }
Loss Optimization
W* = argminWJ(W)Our loss is function of network weights
Loss Optimization
Randomly pick an initial (w0, w1)
Loss Optimization
Compute gradient, δJ(W)δW
Loss Optimization
Take small step in opposite direction of gradient
Loss Optimization
Repeat until convergence
Gradient Descent
Algorithm:
1. Initialize weights randomly ~
2. Loop until convergence:
3. Compute gradient,
4. Update Weights,
5. Return weights
N(0,σ2)
δJ(W)δW
W ← W − ηδJ(W)
δW
Computing Gradients: Backpropogation
x z1 y J(W)
How does a small change in one weights (ex. ) affect the final loss ?w2 J(W)
w1 w2
Computing Gradients: Backpropogation
x z1 y J(W)w1 w2
δJ(W)δw2
=
Let’s use chain rule
Computing Gradients: Backpropogation
x z1 y J(W)w1 w2
δJ(W)δw2
=δJ(W)
δ y*
δ yδ w2
Computing Gradients: Backpropogation
x z1 y J(W)w1 w2
δJ(W)δw1
=δJ(W)
δ y*
δ yδ w1
Apply chain rule! Apply chain rule!
Computing Gradients: Backpropogation
x z1 y J(W)w1 w2
δJ(W)δw1
=δJ(W)
δ y*
δ yδ z1
*δ z1
δ w1
Apply chain rule! Apply chain rule!
Computing Gradients: Backpropogation
x z1 y J(W)w1 w2
δJ(W)δw1
=δJ(W)
δ y*
δ yδ z1
*δ z1
δ w1
Repeat this for every weight in the network using gradients from later layers.
Neural Networks in Practice: Optimization
Training Neural Networks is Difficult
Loss Functions Can be Difficult to Optimise
W ← W − ηδJ(W)
δW
Optimization through gradient descent:
Loss Functions Can be Difficult to Optimise
W ← W − ηδJ(W)
δW
How can we set the learning rate?
Optimization through gradient descent:
Setting the Learning Rate
Small learning rates converge slowly and gets stuck in false local minima.
Setting the Learning Rate
Large learning rates overshoot, become unstable and diverge.
Setting the Learning Rate
Stable learning rates converge smoothly, and avoid local minima.
How to deal with this?
Idea 1:
Try lots of different learning rates and see what works “just right”
How to deal with this?
Idea 1:
Try lots of different learning rates and see what works “just right”
Idea 2:
Design an adaptive learning rate that adapts to the landscape.
Adaptive Learning Rates
• Learning rates are no longer fixed
• Can be made smaller and larger depending on:
• how large the gradient is
• how fast learning is happening
• size of particular weights
• etc..
Gradient Descent Algorithms
• SGD
• Adam
• Adadelta
• Adagrad
• RMSProp
Neural Networks in Practice: Mini-Batches
Gradient Descent
Algorithm:
1. Initialize weights randomly ~
2. Loop until convergence:
3. Compute gradient,
4. Update Weights,
5. Return weights
N(0,σ2)
δJ(W)δW
W ← W − ρδJ(W)
δW
Gradient Descent
Algorithm:
1. Initialize weights randomly ~
2. Loop until convergence:
3. Compute gradient,
4. Update Weights,
5. Return weights
N(0,σ2)
δJ(W)δW
W ← W − ρδJ(W)
δW
Computationally intensive to compute
Stochastic Gradient Descent
Algorithm:
1. Initialize weights randomly ~
2. Loop until convergence:
3. Pick single data point
4. Compute gradient,
5. Update Weights,
6. Return weights
N(0,σ2)
iδJi(W)
δW
W ← W − ρδJ(W)
δW
Stochastic Gradient Descent
Algorithm:
1. Initialize weights randomly ~
2. Loop until convergence:
3. Pick single data point
4. Compute gradient,
5. Update Weights,
6. Return weights
N(0,σ2)
iδJi(W)
δW
W ← W − ρδJ(W)
δW
Easy to compute, But noisy
Mini-Batch Gradient Descent
Algorithm:
1. Initialize weights randomly ~
2. Loop until convergence:
3. Pick a batch of data points
4. Compute gradient,
5. Update Weights,
6. Return weights
N(0,σ2)
BδJ(W)
δW=
1B
B
∑k=1
δJk(W)δW
W ← W − ρδJ(W)
δW
Mini-Batch Gradient Descent
Algorithm:
1. Initialize weights randomly ~
2. Loop until convergence:
3. Pick a batch of data points
4. Compute gradient,
5. Update Weights,
6. Return weights
N(0,σ2)
BδJ(W)
δW=
1B
B
∑k=1
δJk(W)δW
W ← W − ρδJ(W)
δWFast to compute,
Better estimate of true gradient
Mini Batches While Training
More accurate estimation of gradient Smoother convergence
Allows for larger learning rates
Mini Batches While Training
More accurate estimation of gradient Smoother convergence
Allows for larger learning rates
Mini-batches lead to faster training
Can parallelise computation + achieve significant speedups on GPUs
Neural Networks in Practice: Overfitting
The Problem of Overfitting
Underfitting Ideal Fit Overfitting
Model does not have capacity to fully learn the data
Too complex, extra parameters, Does not generalise well
Regularization
Technique that constrains our optimisation problem to discourage complex models
What is it?
Regularization
Technique that constrains our optimisation problem to discourage complex models
What is it?
Why do we need it?
Improve generalisation of our model on unseen data
Regularization 1: Dropout
• During training, randomly set some activations to 0.
z2
x1
x2
xm
z1
z3
zd1
y1
y2
Regularization 1: Dropout
• During training, randomly set some activations to 0
• Typically drop 50% of activations in layer
• Forces network to not rely on one node.
z2
x1
x2
xm
z1
z3
zd1
y1
y2
Regularization 1: Dropout
• During training, randomly set some activations to 0
• Typically drop 50% of activations in layer
• Forces network to not rely on one node.
z2
x1
x2
xm
z1
z3
zd1
y1
y2
Regularization 2: Early Stopping
• Stop training before we have to chance to overfit.
Stop training here
Deep Learning Review
The Perceptron
• Structural Building Blocks
• Non-Linear Activation Functions
Neural Networks
• Stacking Perceptrons to form neural networks
• Optimization through backpropogation
Training in Practice
• Adaptive Learning
• Batching
• Regularisation
∑ σ
x1
x2
x3
y
x1
zk,1
y1 …..x1
x1
zk,1
zk,1
zk,1
y2