intro to deep learning - bptt bright minds … · 17/08/2020  · intro to deep learning deep...

85
Intro to Deep Learning Mandel a P atrick August 17, 2020.

Upload: others

Post on 17-Oct-2020

58 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Intro to Deep LearningMandela Patrick August 17, 2020.

Page 2: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

What is Deep Learning?

Page 3: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

What is Deep Learning?

Page 4: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Lecture Schedule

Monday August 17:

Tuesday August 18:

Wednesday August 19:

Thursday August 20:

Friday August 21:

Intro to Deep Learning

Deep Sequence Modeling

Deep Computer Vision

Deep Generative Modeling

Deep Learning ApplicationsTimothy Hospadales

Lectures Q&A Coding Exercises

PyTorch Basics & MNIST classification using MLP

Sentiment Analysis using RNN and BERT

Image Classification and Detection using CNNS

Image Generation using GANs

Page 5: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Class Support• Slack:

• Useful for discussion with fellow classmates

• Course Website: http://bpttbrightminds.com

• Lecture Schedule

• Guest Q&As

• Software Labs

• Zoom Links

Page 6: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Course Staff

Timothy Hospadales Kwame Johnson Mark Moyou Patrick Hosein

Mandela Patrick

Page 7: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Thanks to Sponsors!

Page 8: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Why Deep Learning and Why Now?

Page 9: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Why Deep Learning?

Hand engineering features are time consuming, brittle and not scalable in practice

Can we learn underlying features directly from data?

Low level features Mid level features High level features

Edges Eyes, ears, nose Facial structure

Page 10: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Why Now?

Neural Networks date back decades, so why the resurgence?

1. Big Data • Large Datasets

• Easier Collection and Storage

2. Hardware • Graphic Processing

Units (GPUs)

• Easier Collection and Storage

3. Software • Improved Techniques

• Newer Models

• Toolboxes

1952

1958

1986

1995

Stochastic Gradient Descent

Perceptron • Learnable weights

Backpropogation • Multi-layer Perceptron

Deep Convolutional NN • Digit Recognition

……

……

Page 11: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

The Perceptron

Page 12: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

The Perceptron: Forward Propagation

y = g(m

∑i=1

xiwi)

Inputs Weights Sum Non-Linearity Output

Output

Non-Linear activation function

Linear combination of inputs

∑ σ

x1

x2

x3

y

w1

w2

w3

Weighted Sum

Page 13: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

The Perceptron: Forward Propagation

y = g(w0 +m

∑i=1

xiwi)

Inputs Weights Sum Non-Linearity Output

OutputLinear combination

of inputs

Non-Linear activation function Bias

∑ σ

x1

x2

x3

y

w1

w2

w3

1

w0

Bias

Page 14: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

The Perceptron: Forward Propagation

y = g(w0 +m

∑i=1

xiwi)

Inputs Weights Sum Non-Linearity Output

y = g(w0 + XTW)

X = [x1 . . . xm]where: and W = [w1 . . . wm]

∑ σ

x1

x2

x3

y

w1

w2

w3

1

w0

Non-Linearity

Page 15: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

The Perceptron: Forward Propagation

Inputs Weights Sum Non-Linearity Output

y = g(w0 + XTW)

Activation Functions

Example: sigmoid function

g(z) = ϕ(z) =1

1 + e−z

∑ σ

x1

x2

x3

y

w1

w2

w3

1

w0

Non-Linearity

Page 16: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Common Activation Functions

Sigmoid Function Hyperbolic Tangent Rectified Lecture Unit (ReLu)

g(z) =1

1 + e−z

g′�(z) = 1 − g(z)2

g(z) =ez − e−z

ez + e−z

g′�(z) = g(z)(1 − g(z))

g(z) = max(0,z)

{1 z > 00 otherwise

g′�(z) =

Page 17: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Importance of Activation Functions

The purpose of activation functions is to introduce non-linearities in the network.

What if we wanted to build a neural network to distinguish red and green points?

Page 18: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Importance of Activation Functions

The purpose of activation functions is to introduce non-linearities in the network.

Linear activation functions produce linear decisions to matter the network size

Non-linearities allow us to approximate arbitrary complex functions

Page 19: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Importance of Activation Functions

The purpose of activation functions is to introduce non-linearities in the network.

Linear activation functions produce linear decisions to matter the network size

Non-linearities allow us to approximate arbitrary complex functions

Page 20: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

The Perceptron: Example

∑ σ

1

x2

x3

y3

−2

1 w0 = 1We have: and

Page 21: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

The Perceptron: Example

∑ σ

1

x2

x3

y3

−2

1

y = g(1 + 3x1 − 2x2)

1 +3x 1

−2x 2

=0x2

x1

Page 22: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

The Perceptron: Example

∑ σ

1

x2

x3

y3

−2

1

y = g(1 + 3x1 − 2x2)

1 +3x 1

−2x 2

=0x2

x1

y = g(1 + (3 * −1) − (2 * 2))= g(−6) ≈ 0.002

Assume we have input: X =

Page 23: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

The Perceptron: Example

∑ σ

1

x2

x3

y3

−2

1

1+3x 1

−2x 2

=0x2

x1

y = g(1 + 3x1 − 2x2)

z < 0y < 0.5

z > 0y > 0.5

Page 24: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Building Neural Networks with Perceptrons

Page 25: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

The Perceptron: Simplified

Inputs Weights Sum Non-Linearity Output

∑ σ

x1

x2

x3

y

w1

w2

w3

1

w0

y = g(w0 + XTW)

Page 26: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

The Perceptron: Simplified

z = w0 +m

∑j=1

xjwj

z

x1

x2

xm

w1

w2

w3

y = g(z)

Page 27: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Multi-Output Perceptron

zi = w0,i +m

∑j=1

xjwj,i

Because all inputs are densely connected to all outputs, these layers are called Dense Layers.

z2

x1

x2

xm

y2 = g(z2)

z1

y1 = g(z1)

Page 28: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Single Layer Neural Network

z2

x1

x2

xm

z1

z3

zd1

y1

y2

W(1) W(2)

Inputs Hidden Final Output

Page 29: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Single Layer Neural Network

z2

x1

x2

xm

z1

z3

zd1

y1

y2

W(1) W(2)

Page 30: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Multi Output Perceptron

z2

x1

x2

xm

z1

z3

zn

y1

y2

Inputs Hidden Final Output

Page 31: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Deep Neural Network

zk,2

x1

x2

xm

zk,1

zk,3

zk,nk

y1

y2

Inputs Hidden Final Output

….. …..

Page 32: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Applying Neural Networks

Page 33: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Example problem

Will I pass this class?

Let’s start with a simple two feature model

Number of lectures you attend

Hours spent on the final project

x1 =

x2 =

Page 34: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Example problem: Will I pass this class?

Hours spent on the final project

x2 =

Number of lectures you attendx1 =

Legend

Pass

Fail

Page 35: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Example problem: Will I pass this class?

Hours spent on the final project

x2 =

Number of lectures you attendx1 =

Legend

Pass

Fail

[4, 5]

Page 36: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Example problem: Will I pass this class?

z2

x1

x2

z1

z3

y1x(1) = [4,5] Predicted: 0.1

Page 37: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Example problem: Will I pass this class?

z2

x1

x2

z1

z3

y1x(1) = [4,5]Predicted: 0.1

Actual: 1

Page 38: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Quantifying Loss

L( f(x(i); W), y(i))Predicted Actual

z2

x1

x2

z1

z3

y1x(1) = [4,5]Predicted: 0.1

Actual: 1

Page 39: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Empirical Loss

J(W) =1n

n

∑i=1

L( f(x(i); W), y(i))Predicted Actual

z2

x1

x2

z1

z3

y1

The empirical loss measure the total loss over entire dataset

Also know as: • Objective Function • Cost Function • Empirical Risk

Page 40: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Binary Cross Entropy Loss

z2

x1

x2

z1

z3

y1

Cross Entropy Loss can be used with models that output a probability between 0 and 1

J(W) =1n

n

∑i=1

y(i)log( f(x(i); W)) + (1 − y(i))log(1 − f(x(i); W))PredictedActual Actual Predicted

Page 41: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Mean Squared Error Loss

J(W) =1n

n

∑i=1

(y(i) − f(x(i); W))2

z2

x1

x2

z1

z3

y1

Mean Squared Error loss can be used with regression models that output continuous real numbers.

Actual Predicted

Page 42: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Training Neural Networks

Page 43: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Loss Optimization

W* = argminW1n

n

∑i=1

L( f(x(i); W), y(i))

We want to find the network weights that achieve the lowest loss.

W* = argminWJ(W)

Page 44: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Loss Optimization

W* = argminW1n

n

∑i=1

L( f(x(i); W), y(i))

We want to find the network weights that achieve the lowest loss.

W* = argminWJ(W)

W = {W(0), W(1), . . . }

Page 45: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Loss Optimization

W* = argminWJ(W)Our loss is function of network weights

Page 46: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Loss Optimization

Randomly pick an initial (w0, w1)

Page 47: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Loss Optimization

Compute gradient, δJ(W)δW

Page 48: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Loss Optimization

Take small step in opposite direction of gradient

Page 49: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Loss Optimization

Repeat until convergence

Page 50: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Gradient Descent

Algorithm:

1. Initialize weights randomly ~

2. Loop until convergence:

3. Compute gradient,

4. Update Weights,

5. Return weights

N(0,σ2)

δJ(W)δW

W ← W − ηδJ(W)

δW

Page 51: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Computing Gradients: Backpropogation

x z1 y J(W)

How does a small change in one weights (ex. ) affect the final loss ?w2 J(W)

w1 w2

Page 52: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Computing Gradients: Backpropogation

x z1 y J(W)w1 w2

δJ(W)δw2

=

Let’s use chain rule

Page 53: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Computing Gradients: Backpropogation

x z1 y J(W)w1 w2

δJ(W)δw2

=δJ(W)

δ y*

δ yδ w2

Page 54: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Computing Gradients: Backpropogation

x z1 y J(W)w1 w2

δJ(W)δw1

=δJ(W)

δ y*

δ yδ w1

Apply chain rule! Apply chain rule!

Page 55: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Computing Gradients: Backpropogation

x z1 y J(W)w1 w2

δJ(W)δw1

=δJ(W)

δ y*

δ yδ z1

*δ z1

δ w1

Apply chain rule! Apply chain rule!

Page 56: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Computing Gradients: Backpropogation

x z1 y J(W)w1 w2

δJ(W)δw1

=δJ(W)

δ y*

δ yδ z1

*δ z1

δ w1

Repeat this for every weight in the network using gradients from later layers.

Page 57: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Neural Networks in Practice: Optimization

Page 58: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Training Neural Networks is Difficult

Page 59: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Loss Functions Can be Difficult to Optimise

W ← W − ηδJ(W)

δW

Optimization through gradient descent:

Page 60: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Loss Functions Can be Difficult to Optimise

W ← W − ηδJ(W)

δW

How can we set the learning rate?

Optimization through gradient descent:

Page 61: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Setting the Learning Rate

Small learning rates converge slowly and gets stuck in false local minima.

Page 62: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Setting the Learning Rate

Large learning rates overshoot, become unstable and diverge.

Page 63: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Setting the Learning Rate

Stable learning rates converge smoothly, and avoid local minima.

Page 64: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

How to deal with this?

Idea 1:

Try lots of different learning rates and see what works “just right”

Page 65: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

How to deal with this?

Idea 1:

Try lots of different learning rates and see what works “just right”

Idea 2:

Design an adaptive learning rate that adapts to the landscape.

Page 66: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Adaptive Learning Rates

• Learning rates are no longer fixed

• Can be made smaller and larger depending on:

• how large the gradient is

• how fast learning is happening

• size of particular weights

• etc..

Page 67: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Gradient Descent Algorithms

• SGD

• Adam

• Adadelta

• Adagrad

• RMSProp

Page 68: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Neural Networks in Practice: Mini-Batches

Page 69: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Gradient Descent

Algorithm:

1. Initialize weights randomly ~

2. Loop until convergence:

3. Compute gradient,

4. Update Weights,

5. Return weights

N(0,σ2)

δJ(W)δW

W ← W − ρδJ(W)

δW

Page 70: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Gradient Descent

Algorithm:

1. Initialize weights randomly ~

2. Loop until convergence:

3. Compute gradient,

4. Update Weights,

5. Return weights

N(0,σ2)

δJ(W)δW

W ← W − ρδJ(W)

δW

Computationally intensive to compute

Page 71: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Stochastic Gradient Descent

Algorithm:

1. Initialize weights randomly ~

2. Loop until convergence:

3. Pick single data point

4. Compute gradient,

5. Update Weights,

6. Return weights

N(0,σ2)

iδJi(W)

δW

W ← W − ρδJ(W)

δW

Page 72: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Stochastic Gradient Descent

Algorithm:

1. Initialize weights randomly ~

2. Loop until convergence:

3. Pick single data point

4. Compute gradient,

5. Update Weights,

6. Return weights

N(0,σ2)

iδJi(W)

δW

W ← W − ρδJ(W)

δW

Easy to compute, But noisy

Page 73: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Mini-Batch Gradient Descent

Algorithm:

1. Initialize weights randomly ~

2. Loop until convergence:

3. Pick a batch of data points

4. Compute gradient,

5. Update Weights,

6. Return weights

N(0,σ2)

BδJ(W)

δW=

1B

B

∑k=1

δJk(W)δW

W ← W − ρδJ(W)

δW

Page 74: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Mini-Batch Gradient Descent

Algorithm:

1. Initialize weights randomly ~

2. Loop until convergence:

3. Pick a batch of data points

4. Compute gradient,

5. Update Weights,

6. Return weights

N(0,σ2)

BδJ(W)

δW=

1B

B

∑k=1

δJk(W)δW

W ← W − ρδJ(W)

δWFast to compute,

Better estimate of true gradient

Page 75: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Mini Batches While Training

More accurate estimation of gradient Smoother convergence

Allows for larger learning rates

Page 76: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Mini Batches While Training

More accurate estimation of gradient Smoother convergence

Allows for larger learning rates

Mini-batches lead to faster training

Can parallelise computation + achieve significant speedups on GPUs

Page 77: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Neural Networks in Practice: Overfitting

Page 78: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

The Problem of Overfitting

Underfitting Ideal Fit Overfitting

Model does not have capacity to fully learn the data

Too complex, extra parameters, Does not generalise well

Page 79: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Regularization

Technique that constrains our optimisation problem to discourage complex models

What is it?

Page 80: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Regularization

Technique that constrains our optimisation problem to discourage complex models

What is it?

Why do we need it?

Improve generalisation of our model on unseen data

Page 81: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Regularization 1: Dropout

• During training, randomly set some activations to 0.

z2

x1

x2

xm

z1

z3

zd1

y1

y2

Page 82: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Regularization 1: Dropout

• During training, randomly set some activations to 0

• Typically drop 50% of activations in layer

• Forces network to not rely on one node.

z2

x1

x2

xm

z1

z3

zd1

y1

y2

Page 83: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Regularization 1: Dropout

• During training, randomly set some activations to 0

• Typically drop 50% of activations in layer

• Forces network to not rely on one node.

z2

x1

x2

xm

z1

z3

zd1

y1

y2

Page 84: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Regularization 2: Early Stopping

• Stop training before we have to chance to overfit.

Stop training here

Page 85: Intro to Deep Learning - bpTT Bright Minds … · 17/08/2020  · Intro to Deep Learning Deep Sequence Modeling Deep Computer Vision Deep Generative Modeling Deep Learning Applications

Deep Learning Review

The Perceptron

• Structural Building Blocks

• Non-Linear Activation Functions

Neural Networks

• Stacking Perceptrons to form neural networks

• Optimization through backpropogation

Training in Practice

• Adaptive Learning

• Batching

• Regularisation

∑ σ

x1

x2

x3

y

x1

zk,1

y1 …..x1

x1

zk,1

zk,1

zk,1

y2