intro to deep learning - bptt bright minds … · 17/08/2020  · intro to deep learning deep...

Post on 17-Oct-2020

60 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Intro to Deep LearningMandela Patrick August 17, 2020.

What is Deep Learning?

What is Deep Learning?

Lecture Schedule

Monday August 17:

Tuesday August 18:

Wednesday August 19:

Thursday August 20:

Friday August 21:

Intro to Deep Learning

Deep Sequence Modeling

Deep Computer Vision

Deep Generative Modeling

Deep Learning ApplicationsTimothy Hospadales

Lectures Q&A Coding Exercises

PyTorch Basics & MNIST classification using MLP

Sentiment Analysis using RNN and BERT

Image Classification and Detection using CNNS

Image Generation using GANs

Class Support• Slack:

• Useful for discussion with fellow classmates

• Course Website: http://bpttbrightminds.com

• Lecture Schedule

• Guest Q&As

• Software Labs

• Zoom Links

Course Staff

Timothy Hospadales Kwame Johnson Mark Moyou Patrick Hosein

Mandela Patrick

Thanks to Sponsors!

Why Deep Learning and Why Now?

Why Deep Learning?

Hand engineering features are time consuming, brittle and not scalable in practice

Can we learn underlying features directly from data?

Low level features Mid level features High level features

Edges Eyes, ears, nose Facial structure

Why Now?

Neural Networks date back decades, so why the resurgence?

1. Big Data • Large Datasets

• Easier Collection and Storage

2. Hardware • Graphic Processing

Units (GPUs)

• Easier Collection and Storage

3. Software • Improved Techniques

• Newer Models

• Toolboxes

1952

1958

1986

1995

Stochastic Gradient Descent

Perceptron • Learnable weights

Backpropogation • Multi-layer Perceptron

Deep Convolutional NN • Digit Recognition

……

……

The Perceptron

The Perceptron: Forward Propagation

y = g(m

∑i=1

xiwi)

Inputs Weights Sum Non-Linearity Output

Output

Non-Linear activation function

Linear combination of inputs

∑ σ

x1

x2

x3

y

w1

w2

w3

Weighted Sum

The Perceptron: Forward Propagation

y = g(w0 +m

∑i=1

xiwi)

Inputs Weights Sum Non-Linearity Output

OutputLinear combination

of inputs

Non-Linear activation function Bias

∑ σ

x1

x2

x3

y

w1

w2

w3

1

w0

Bias

The Perceptron: Forward Propagation

y = g(w0 +m

∑i=1

xiwi)

Inputs Weights Sum Non-Linearity Output

y = g(w0 + XTW)

X = [x1 . . . xm]where: and W = [w1 . . . wm]

∑ σ

x1

x2

x3

y

w1

w2

w3

1

w0

Non-Linearity

The Perceptron: Forward Propagation

Inputs Weights Sum Non-Linearity Output

y = g(w0 + XTW)

Activation Functions

Example: sigmoid function

g(z) = ϕ(z) =1

1 + e−z

∑ σ

x1

x2

x3

y

w1

w2

w3

1

w0

Non-Linearity

Common Activation Functions

Sigmoid Function Hyperbolic Tangent Rectified Lecture Unit (ReLu)

g(z) =1

1 + e−z

g′�(z) = 1 − g(z)2

g(z) =ez − e−z

ez + e−z

g′�(z) = g(z)(1 − g(z))

g(z) = max(0,z)

{1 z > 00 otherwise

g′�(z) =

Importance of Activation Functions

The purpose of activation functions is to introduce non-linearities in the network.

What if we wanted to build a neural network to distinguish red and green points?

Importance of Activation Functions

The purpose of activation functions is to introduce non-linearities in the network.

Linear activation functions produce linear decisions to matter the network size

Non-linearities allow us to approximate arbitrary complex functions

Importance of Activation Functions

The purpose of activation functions is to introduce non-linearities in the network.

Linear activation functions produce linear decisions to matter the network size

Non-linearities allow us to approximate arbitrary complex functions

The Perceptron: Example

∑ σ

1

x2

x3

y3

−2

1 w0 = 1We have: and

The Perceptron: Example

∑ σ

1

x2

x3

y3

−2

1

y = g(1 + 3x1 − 2x2)

1 +3x 1

−2x 2

=0x2

x1

The Perceptron: Example

∑ σ

1

x2

x3

y3

−2

1

y = g(1 + 3x1 − 2x2)

1 +3x 1

−2x 2

=0x2

x1

y = g(1 + (3 * −1) − (2 * 2))= g(−6) ≈ 0.002

Assume we have input: X =

The Perceptron: Example

∑ σ

1

x2

x3

y3

−2

1

1+3x 1

−2x 2

=0x2

x1

y = g(1 + 3x1 − 2x2)

z < 0y < 0.5

z > 0y > 0.5

Building Neural Networks with Perceptrons

The Perceptron: Simplified

Inputs Weights Sum Non-Linearity Output

∑ σ

x1

x2

x3

y

w1

w2

w3

1

w0

y = g(w0 + XTW)

The Perceptron: Simplified

z = w0 +m

∑j=1

xjwj

z

x1

x2

xm

w1

w2

w3

y = g(z)

Multi-Output Perceptron

zi = w0,i +m

∑j=1

xjwj,i

Because all inputs are densely connected to all outputs, these layers are called Dense Layers.

z2

x1

x2

xm

y2 = g(z2)

z1

y1 = g(z1)

Single Layer Neural Network

z2

x1

x2

xm

z1

z3

zd1

y1

y2

W(1) W(2)

Inputs Hidden Final Output

Single Layer Neural Network

z2

x1

x2

xm

z1

z3

zd1

y1

y2

W(1) W(2)

Multi Output Perceptron

z2

x1

x2

xm

z1

z3

zn

y1

y2

Inputs Hidden Final Output

Deep Neural Network

zk,2

x1

x2

xm

zk,1

zk,3

zk,nk

y1

y2

Inputs Hidden Final Output

….. …..

Applying Neural Networks

Example problem

Will I pass this class?

Let’s start with a simple two feature model

Number of lectures you attend

Hours spent on the final project

x1 =

x2 =

Example problem: Will I pass this class?

Hours spent on the final project

x2 =

Number of lectures you attendx1 =

Legend

Pass

Fail

Example problem: Will I pass this class?

Hours spent on the final project

x2 =

Number of lectures you attendx1 =

Legend

Pass

Fail

[4, 5]

Example problem: Will I pass this class?

z2

x1

x2

z1

z3

y1x(1) = [4,5] Predicted: 0.1

Example problem: Will I pass this class?

z2

x1

x2

z1

z3

y1x(1) = [4,5]Predicted: 0.1

Actual: 1

Quantifying Loss

L( f(x(i); W), y(i))Predicted Actual

z2

x1

x2

z1

z3

y1x(1) = [4,5]Predicted: 0.1

Actual: 1

Empirical Loss

J(W) =1n

n

∑i=1

L( f(x(i); W), y(i))Predicted Actual

z2

x1

x2

z1

z3

y1

The empirical loss measure the total loss over entire dataset

Also know as: • Objective Function • Cost Function • Empirical Risk

Binary Cross Entropy Loss

z2

x1

x2

z1

z3

y1

Cross Entropy Loss can be used with models that output a probability between 0 and 1

J(W) =1n

n

∑i=1

y(i)log( f(x(i); W)) + (1 − y(i))log(1 − f(x(i); W))PredictedActual Actual Predicted

Mean Squared Error Loss

J(W) =1n

n

∑i=1

(y(i) − f(x(i); W))2

z2

x1

x2

z1

z3

y1

Mean Squared Error loss can be used with regression models that output continuous real numbers.

Actual Predicted

Training Neural Networks

Loss Optimization

W* = argminW1n

n

∑i=1

L( f(x(i); W), y(i))

We want to find the network weights that achieve the lowest loss.

W* = argminWJ(W)

Loss Optimization

W* = argminW1n

n

∑i=1

L( f(x(i); W), y(i))

We want to find the network weights that achieve the lowest loss.

W* = argminWJ(W)

W = {W(0), W(1), . . . }

Loss Optimization

W* = argminWJ(W)Our loss is function of network weights

Loss Optimization

Randomly pick an initial (w0, w1)

Loss Optimization

Compute gradient, δJ(W)δW

Loss Optimization

Take small step in opposite direction of gradient

Loss Optimization

Repeat until convergence

Gradient Descent

Algorithm:

1. Initialize weights randomly ~

2. Loop until convergence:

3. Compute gradient,

4. Update Weights,

5. Return weights

N(0,σ2)

δJ(W)δW

W ← W − ηδJ(W)

δW

Computing Gradients: Backpropogation

x z1 y J(W)

How does a small change in one weights (ex. ) affect the final loss ?w2 J(W)

w1 w2

Computing Gradients: Backpropogation

x z1 y J(W)w1 w2

δJ(W)δw2

=

Let’s use chain rule

Computing Gradients: Backpropogation

x z1 y J(W)w1 w2

δJ(W)δw2

=δJ(W)

δ y*

δ yδ w2

Computing Gradients: Backpropogation

x z1 y J(W)w1 w2

δJ(W)δw1

=δJ(W)

δ y*

δ yδ w1

Apply chain rule! Apply chain rule!

Computing Gradients: Backpropogation

x z1 y J(W)w1 w2

δJ(W)δw1

=δJ(W)

δ y*

δ yδ z1

*δ z1

δ w1

Apply chain rule! Apply chain rule!

Computing Gradients: Backpropogation

x z1 y J(W)w1 w2

δJ(W)δw1

=δJ(W)

δ y*

δ yδ z1

*δ z1

δ w1

Repeat this for every weight in the network using gradients from later layers.

Neural Networks in Practice: Optimization

Training Neural Networks is Difficult

Loss Functions Can be Difficult to Optimise

W ← W − ηδJ(W)

δW

Optimization through gradient descent:

Loss Functions Can be Difficult to Optimise

W ← W − ηδJ(W)

δW

How can we set the learning rate?

Optimization through gradient descent:

Setting the Learning Rate

Small learning rates converge slowly and gets stuck in false local minima.

Setting the Learning Rate

Large learning rates overshoot, become unstable and diverge.

Setting the Learning Rate

Stable learning rates converge smoothly, and avoid local minima.

How to deal with this?

Idea 1:

Try lots of different learning rates and see what works “just right”

How to deal with this?

Idea 1:

Try lots of different learning rates and see what works “just right”

Idea 2:

Design an adaptive learning rate that adapts to the landscape.

Adaptive Learning Rates

• Learning rates are no longer fixed

• Can be made smaller and larger depending on:

• how large the gradient is

• how fast learning is happening

• size of particular weights

• etc..

Gradient Descent Algorithms

• SGD

• Adam

• Adadelta

• Adagrad

• RMSProp

Neural Networks in Practice: Mini-Batches

Gradient Descent

Algorithm:

1. Initialize weights randomly ~

2. Loop until convergence:

3. Compute gradient,

4. Update Weights,

5. Return weights

N(0,σ2)

δJ(W)δW

W ← W − ρδJ(W)

δW

Gradient Descent

Algorithm:

1. Initialize weights randomly ~

2. Loop until convergence:

3. Compute gradient,

4. Update Weights,

5. Return weights

N(0,σ2)

δJ(W)δW

W ← W − ρδJ(W)

δW

Computationally intensive to compute

Stochastic Gradient Descent

Algorithm:

1. Initialize weights randomly ~

2. Loop until convergence:

3. Pick single data point

4. Compute gradient,

5. Update Weights,

6. Return weights

N(0,σ2)

iδJi(W)

δW

W ← W − ρδJ(W)

δW

Stochastic Gradient Descent

Algorithm:

1. Initialize weights randomly ~

2. Loop until convergence:

3. Pick single data point

4. Compute gradient,

5. Update Weights,

6. Return weights

N(0,σ2)

iδJi(W)

δW

W ← W − ρδJ(W)

δW

Easy to compute, But noisy

Mini-Batch Gradient Descent

Algorithm:

1. Initialize weights randomly ~

2. Loop until convergence:

3. Pick a batch of data points

4. Compute gradient,

5. Update Weights,

6. Return weights

N(0,σ2)

BδJ(W)

δW=

1B

B

∑k=1

δJk(W)δW

W ← W − ρδJ(W)

δW

Mini-Batch Gradient Descent

Algorithm:

1. Initialize weights randomly ~

2. Loop until convergence:

3. Pick a batch of data points

4. Compute gradient,

5. Update Weights,

6. Return weights

N(0,σ2)

BδJ(W)

δW=

1B

B

∑k=1

δJk(W)δW

W ← W − ρδJ(W)

δWFast to compute,

Better estimate of true gradient

Mini Batches While Training

More accurate estimation of gradient Smoother convergence

Allows for larger learning rates

Mini Batches While Training

More accurate estimation of gradient Smoother convergence

Allows for larger learning rates

Mini-batches lead to faster training

Can parallelise computation + achieve significant speedups on GPUs

Neural Networks in Practice: Overfitting

The Problem of Overfitting

Underfitting Ideal Fit Overfitting

Model does not have capacity to fully learn the data

Too complex, extra parameters, Does not generalise well

Regularization

Technique that constrains our optimisation problem to discourage complex models

What is it?

Regularization

Technique that constrains our optimisation problem to discourage complex models

What is it?

Why do we need it?

Improve generalisation of our model on unseen data

Regularization 1: Dropout

• During training, randomly set some activations to 0.

z2

x1

x2

xm

z1

z3

zd1

y1

y2

Regularization 1: Dropout

• During training, randomly set some activations to 0

• Typically drop 50% of activations in layer

• Forces network to not rely on one node.

z2

x1

x2

xm

z1

z3

zd1

y1

y2

Regularization 1: Dropout

• During training, randomly set some activations to 0

• Typically drop 50% of activations in layer

• Forces network to not rely on one node.

z2

x1

x2

xm

z1

z3

zd1

y1

y2

Regularization 2: Early Stopping

• Stop training before we have to chance to overfit.

Stop training here

Deep Learning Review

The Perceptron

• Structural Building Blocks

• Non-Linear Activation Functions

Neural Networks

• Stacking Perceptrons to form neural networks

• Optimization through backpropogation

Training in Practice

• Adaptive Learning

• Batching

• Regularisation

∑ σ

x1

x2

x3

y

x1

zk,1

y1 …..x1

x1

zk,1

zk,1

zk,1

y2

top related