on deep learning - joao-pereira.pt · joão pereira 3 22/10/2019. part 1 joão pereira 4. big...

On Deep LearningAn Overview

João Pereira

October 2019PDEng Data Science Talks

Outline – Part 1

• Introduction to deep learning• Types of learning• Historical Perspective• Neural networks• Training Neural Networks• Optimization Challenges• Regularization strategies• Generalization

João Pereira 2

15/10/2019

Outline – Part 2

• Convolutional Neural Networks (CNNs)• Recurrent Neural Networks (RNNs)

• Vanilla RNNs• LSTMs and GRUs

• Sequence to Sequence Models (Seq2Seq)• Attention Mechanisms• Applications

João Pereira 3

22/10/2019

Part 1

João Pereira 4

Big Picture

João Pereira 5

Machine Learning

Artificial Intelligence

Neural Networks

Deep Learning

Learning from data

João Pereira 6

Inputs and Outputs

• Inputs• Words, sentences• Images, videos• Observations, time series• Voice

• Outputs• Translation• Class label

João Pereira 7

What is learning?

• Learning or training refers to estimate the parameters of a model.

João Pereira 8

Training Validation TestDataset

Machine learning is about generalizing to unseen data.

João Pereira 9

What data is out there?

João Pereira 10

Unlabeled data

Labeled data

Cost

Availability

Abundant

Accessible

Scarce

Expensive

Data in the PDEng Data Science

João Pereira 11

Unlabeled data

Labeled data

Cost

Availability

Abundant

Accessible

Scarce

Expensive

Types of learning

• Supervised learning• Unsupervised learning• Reinforcement learning

João Pereira 12

LeCun’s Cake Analogy

João Pereira 13

Supervised Learning

• Given a dataset D of inputs x and labeled targets y, learn to predict y from x.

• Most successful paradigm in deep learning.

João Pereira 14

ModelDogCat

Dog Cat

Unsupervised Learning

• Given only the inputs , models and find

João Pereira 15

Model

PatternsStructureClusters

Generation

Why Did Deep Learning Become So Popular?

Important breakthroughs in AI-class of problems

• Speech recognition• Image classification• Machine translation• Chatbots• Self-driving cars• Playing games

João Pereira 16

Why Now?

Among other reasons:• Fifty years of AI research in machine learning• Explosion of convenient compute power

(e.g., GPUs, TPUs)• Lots of (labeled) data• New software and tools• Strong investments and interest of large organizations

João Pereira 17

Cepticism and AI winters

João Pereira 18

Link

https://www.youtube.com/watch?v=VsnQf7exv5I

João Pereira 19

Yann LeCunNew York University

Facebook AI Research

Turing Award

Yoshua BengioUniversité de Montreal

Geoffrey HintonUniversity of Toronto

Google Brain

Deep Learning Heroes

Why Deep Learning?

João Pereira 20

Input OutputFeature Extraction Classification

DuckNot a Duck

Classifier

Input Output

DuckNot a Duck

Traditional Machine Learning

Deep Learning

Feature Extraction + Classification

Deep Neural Network

Why Deep Learning?

• Hand-crafting features is time consuming and not scalable.• We can learn the underlying features from data, automatically.

João Pereira 21

Zeiler & Fergus, 2013Example by Yann LeCun

Deep Neural Network

Deep learning is about representation learning!Learning good features for downstream tasks (regression, classification, …)Hand-crafting vs learning features

João Pereira 22

Historical Perspective

João Pereira 23

RosenblattPerceptron

1957

Minsky & PapertAI Winter begins CNNs

19891986

Backpropagation LSTMs

1995 1997

SVMs AEs

2006 2012

ImageNetContest

1969

VAEsGANs

2014

PerceptronStructural building block of deep learning

João Pereira 24

The Perceptron

João Pereira 25

Input Weights Sum Non-Linearity Output

OutputPrediction Non-linearity Bias Inputs Weights

Rosenblatt, 1958

The Perceptron

João Pereira 24

Input Weights Sum Non-Linearity Output

In matrix form:

Activation FunctionsSigmoidHyperbolic TangentRectified Linear UnitSoftmax

João Pereira 27

Activation Functions

João Pereira 28

Recap…Activation function


• The goal of the activation functions is to introduce non-linearities into the network

• The are many of these functions• They are all non-linear

João Pereira 29


João Pereira 30

Rectified Linear UnitReLU Sigmoid

1

tf.keras.activations.sigmoid(x)tf.keras.activations.relu(x)

Hyperbolic Tangent

1

tf.keras.activations.tanh(x)

Activation Functions: Softmax

João Pereira 31

Neural Network

TargetsProbabilitiesCat

Dog

Output

Softmax

tf.keras.activations.softmax(x)

Building Neural Networks with Perceptrons

João Pereira 32

Single-Layer Neural Network

João Pereira 33

Output LayerInput Layer Hidden Layer

Properties

Neural networks are universal function approximators

Universal Approximation Theorem

A feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of Rn, under mild assumptions on the activation function.

João Pereira 34

Cybenko, 1989

Properties

Neural networks are universal function approximators

Universal Approximation Theorem

A feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of Rn, under mild assumptions on the activation function.

João Pereira 35

Cybenko, 1989

In a nutshell:They can approximate many functions up to any precision.

Training Neural NetworksLoss FunctionsGradient DescentStochastic Gradient Descent

João Pereira 36

Example: Wine Quality

João Pereira 37

Volatile Acidity

Alcohol

Good

Not Good

Example: Wine Quality

• How to quantify prediction error?

João Pereira 38

Prediction Target

Good (1)

Not Good (0)

Loss

João Pereira 39

Prediction Target

The loss measures the cost due to wrong predictions.

Loss

João Pereira 40

Target

This is the empirical loss.Aka

Objective functionCost functionEmpirical risk

Prediction Dataset

Binary Cross Entropy Loss

• Used in classification problems with models that output a probability between 0 and 1.

João Pereira 41tf.keras.losses.BinaryCrossentropy()

Prediction Target

Mean Squared Error Loss

• Used in regression problems with models that output continuous real numbers.

João Pereira 42

Prediction Target

tf.keras.losses.MSE()

Wine Quality [%]

Unsupervised Deep Learning

João Pereira 43

Autoencoders

• Most used model for representation learning.

João Pereira 44

Encoder Decoder

Unsupervised Learning

Latent space

Autoencoders

João Pereira 45

Reducing the Dimensionality of Data with Neural Networks, Hinton & Salakhutdinov, 2006

An autoencoder with linear activations does principal component analysis!

João Pereira 46

Unsupervised Representation Learning

Tip #1: Always “look” at your data before designing your model!

João Pereira 47

Based on Marc’Aurelio Ranzato´s (Facebook AI Research) tutorial on Unsupervised Deep Learning, NeurIPS’18

Mean & Covariance analysisPCA (check eigenvalue decay)t-sne visualization


Tip #1: Always “look” at your data before designing your model!

Learned features may be useful for down-stream tasks.(e.g., classification, regression, …)

João Pereira 48


Mean & Covariance analysisPCA (check eigenvalue decay)t-sne visualization

Representation learning using unsupervised learning


Tip #2: PCA and K-Means are very often a strong baseline.

João Pereira 49


Coffee Break16:14 + 15min

João Pereira 50

Optimizing the Loss

We want to find the neural network weights that achieve the lowest loss

Or simply

João Pereira 51

Weights

Training Examples

Based on first class of MIT 6.S191 by Alexander Amini

Optimizing the Loss

João Pereira 52

Random initialization

Optimizing the Loss

João Pereira 53

Compute gradient

Optimizing the Loss

João Pereira 54

Compute gradient

Optimizing the Loss

João Pereira 55

Take a small step in the opposite direction of the gradient

“It's like walking in the mountains in a fog and following the direction of steepest descent to reach the village in the valley”

João Pereira 56

Yann LeCun

Gradient DescentThis procedure is called gradient descent.

1. Initialize weights randomly.

2. Repeat until convergence.

2.1. Computer gradient

2.2. Update weights

3. Return weights.

João Pereira 57

Stochastic Gradient Descent



2.1. Choose 1 datapoint randomly


2.3. Update weights

3. Return weights.

João Pereira 58

tf.keras.optimizer.SGD(x)




2.1. Choose mini-batch of M samples randomly


2.3. Update weights

3. Return weights.

João Pereira 59

tf.keras.optimizer.SGD(x)


Many variants of SGDAdamAdaGradAdaDeltaRMSProp

Mini-batch training is faster (parallel computation on GPUs).

João Pereira 60

tf.keras.optimizer.Adam()

tf.keras.optimizer.RMSProp()

tf.keras.optimizer.Adagrad()

tf.keras.optimizer.AdaDelta()

Backpropagation

How to compute ?

• Employs the chain rule for deriving the gradient.• There are tools that compute these gradients automatically.• Check Hinton’s course for details.• Very good interactive explanation of backpropagation.

João Pereira 61

https://google-developers.appspot.com/machine-learning/crash-course/backprop-scroll/

https://google-developers.appspot.com/machine-learning/crash-course/backprop-scroll/

In real-life deep learning...

João Pereira 62

Training neural networks is hard!

Learning Rate

João Pereira

How to choose ?

Learning Rate

• Small learning makes learning slow and gets stuck in local minima.

• Large learning rates overshoot, are unstable and diverge.

Tips:• Try many learning rates and choose the best• Adaptive learning rate

• How large gradient is…• How fast learning occurs…

João Pereira 64

Optimization ChallengesOverfitting and UnderfittingBias-Variance Trade-Off

João Pereira 65

Bias-Variance Trade-off

66

Overfitting

Model Complexity

ErrorUnderfitting

Bias2Test Error

VarianceTrain Error

RegularizationDropoutEarly Stopping

João Pereira 67

Why do we need regularization?

• To make our models generalize better to unseen data.• We are going to look at two strategies:

• Dropout• Early stopping

João Pereira 68

Dropout

João Pereira 69

Randomly set activations to zero.Typically around 50%.

Dropout

João Pereira 70

Randomly set activations to zero.Typically around 50%.

tf.keras.layers.Dropout(p)

Early Stopping

• Early stop training before overfitting.

João Pereira 71Training Steps

LossTest

Train

OverfittingUnderfitting

Generalization in Deep Learning

João Pereira 72


João Pereira 73

Understanding Deep Learning Requires Rethinking Generalization, Zhang et al., 2017

Large models overfit very little ???


João Pereira 74

Reconciling modern machine-learning practice and the classical bias–variance trade-off, Belkin et al., 2019

The bias-variance trade-off was just the first episode of the series!

Bias-Variance Trade-off

On Deep Learning – João Pereira 75

Overfitting

Number parameters

Error

Training

Test

Part 2Convolutional Neural NetworksRecurrent Neural Networks

João Pereira 76

Motivation

• In feed-forward neural networks we assume data is independent and identically distributed in space or in time.

• In many applications, data has spatial or temporal dependencies.

João Pereira 77

Spatial

Convolutional Neural Networks

Temporal

Recurrent Neural Networks

Convolutional Neural Networks

João Pereira 78

Images

• An image is a matrix of numbers.

João Pereira 79

2 4 3 2 4 3 ... 33 8 5 3 8 5 71 2 1 1 2 1 52 4 3 2 4 3 23 8 5 3 8 5 11 2 1 1 2 1 3... ... 45 6 8 2 4 6 8 9

Values between 0 and 255

200

200200x200=40k pixels

Images

• An image is a matrix of numbers.

João Pereira 80

2 4 3 2 4 33 8 5 3 8 51 2 1 1 2 12 4 3 2 4 33 8 5 3 8 51 2 1 1 2 1...8 4 2 1 7 9 9 7

Values between 0 and 255

5 4 3 2 4 36 8 5 3 8 59 2 1 1 2 12 4 3 2 4 34 8 5 3 8 57 2 1 1 2 143 8 5 3 8 5 1

2 4 3 2 4 3 ... 33 8 5 3 8 5 71 2 1 1 2 1 52 4 3 2 4 3 23 8 5 3 8 5 11 2 1 1 2 1 3... ... 45 6 8 2 4 6 8 9

200 x 200 x 3 = 120k

RB

G

Tasks in Computer Vision

RegressionOutput is a continuous value.

ClassificationOutput is a class.

Applications:Classification, segmentation, localization, detection.

João Pereira 81

Examples

João Pereira 82Slide from Stanford CS231N, 2017, Lecture 11

Why do we need convolutional neural networks?

João Pereira 83

João Pereira 84

200

20040k

Requires1.6 billion parameters


Fully connected layer

João Pereira 85

200

20040k


Locally connected layer

Requires4 million parameters

10x10

10x10

10x10

Convolutional Layer Idea

João Pereira 86

200

20040k

Use multiple filters to extract different kinds of features.

Spatially share the same parameters across different locations.

Intuition: important features can happen anywhere in the image.

Idea

João Pereira 87

Idea

João Pereira 88

Idea

João Pereira 89

What is a convolution?

1 0 10 0 11 0 1

João Pereira 90

⨀

2 4 3 5 6 83 8 5 7 5 11 2 1 5 6 72 4 6 2 3 43 5 7 2 5 66 8 9 7 9 8

FilterKernel

Feature Detector

Feature MapActivation Map

Convolved Feature

=

Input Image

What is a convolution?

1 0 10 0 11 0 1

João Pereira 91

⨀

2 4 3 5 6 83 8 5 7 5 11 2 1 5 6 72 4 6 2 3 43 5 7 2 5 66 8 9 7 9 8

FilterKernel

Feature Detector

Feature MapActivation Map

Convolved Feature

=12 0 1 0

0 0 0 0

1 1 1 0

1 0 1 1

Input Image2x1 + 4x0 + 3x1 + 5x1 + 8x0 + 3x0 + 1x1 + 2x0 + 1x1 = 12

Convolutional Layer

• Weights of the filters are learned during training• We need to define:

• Number of filters• Filter dimensions• Padding• Stride

João Pereira 92

1 0 10 0 11 0 1

Padding

João Pereira 93

0 0 0 0 00 0 1 2 00 3 4 5 00 6 7 8 00 0 0 0 0

0 12 3⨀

0 3 8 4

9 19 25 10

21 37 43 16

6 7 8 0

Padding with zeros

=

• Used to avoid loosing pixels around the image.

2 4 33 8 51 2 1

Stride

• Is the number of rows and columns traversed per slide.

João Pereira 94

0 0 0 0 00 0 1 2 00 3 4 5 00 6 7 8 00 0 0 0 0

0 12 3⨀ =

2 4 33 8 51 2 1

0 86 8

Vertical: stride 3Horizontal: stride 2

In a nutshell...

Feed-forward neural networks applied to images:• Scale quadratically with the input sizeSolution: Use a convolutional layer• Connect each hidden unit to a small patch of the input.• Share weights across space.A network built with these structure is called a Convolutional Neural Network (CNN)

João Pereira 95

LeCun et al., 1998, Gradient-based learning applied to document recognitionBased on Marc'Aurelio Ranzato’s slides, Facebook AI Research, Image Classification with Deep Learning

Non-Linearity

• Introduce non-linearities after convolutional layers.• The rectified linear unit (ReLU) works very well.

João Pereira 96

tf.keras.activations.relu(x)

Replaces negative values by zero.

Pooling Layer

• Pooling promotes robustness to the specific location of the features.

Example:

Max, Avg, L2-pooling, ...

João Pereira 97

2 4 3 5 6 33 8 5 7 5 11 2 1 5 6 72 4 6 2 3 43 5 2 2 5 66 4 5 7 9 8

8 75 9

Max poolingFilters 3x3

Stride 3

Dimension reduced

Representation Learning with CNNs

João Pereira 98

Low Level Features Medium Level Features High Level Features

Conv. Layer 1 Conv. Layer 2 Conv. Layer 3

Progress in CNN Architectures

João Pereira 99

GoogLeNet, 2014Szegedy et al.

VGG, 2014Zisserman, Simonyan

AlexNet, 2012Krizhevsky, Sutskever, and Hinton

ResNet, 2015He

Importance of Depth

João Pereira 100Link

https://medium.com/analytics-vidhya/cnns-architectures-lenet-alexnet-vgg-googlenet-resnet-and-more-666091488df5

Deep learning is in production

João Pereira 101

Coffee Break16:16 + 15min

João Pereira 102

Recurrent Neural Networks

João Pereira 103

Why do we need recurrent neural networks?

In many applications, data has a sequential nature:Sensor time seriesNatural speechWords in sentencesDNAVideos

What if data is not i.i.d. in time?

João Pereira 104

Example:My name is João

Recap

• In feed-forward neural networks:

João Pereira 105

Recurrent Neural Network

João Pereira 106

Feed-forward Neural Network Recurrent Neural Network

Bengio et al., 1994

Recurrent Neural Network

João Pereira 107

RNNs capture the temporal dependencies of the data.

• Real-valued hidden state• Feedback connection• Parameters shared

across timestepsis typically a sigmoid or tanh

RNN Unrolled

João Pereira 108

Bengio et al., 1994

Notes

• is a vector representation of the entire sequence• Summarizes sequences into fixed-length vectors• Input sequences can have arbitrary length.• Final hidden state can be used as a feature vector for

classification or regression tasks.

João Pereira 109

Long-Short Term Memory Networks

João Pereira 110

Hochreiter & Schmidhuber, 1997

• Proposed to solve the vanishing gradient problem.

• New cell and three gates.

Sequence to Sequence Models (Seq2Seq)

João Pereira 111

Sequence to Sequence Models

• In some applications, mapping sequences to sequences is useful.

Example: machine translation

• A sequence to sequence model is an encoder-decoder model.

João Pereira 112

My name is João Mijn naam is João

Sequence to Sequence Models

João Pereira 113

Learned representation

Encoder

Decoder

My name is João

Mijn naam is João

Attention

João Pereira 114

Why do we need attention?

Problem: Fixed-size representation z is the bottleneck of seq2seq.Question: Is there a better way to pass the information from theencoder to the decoder?Solution: Attention Mechanism

João Pereira 115

Bahdanau, Cho, and Bengio, Neural Machine Translation by Jointly Learning to Align and TranslateLuong, Pham, and Manning, Effective Approaches to Attention-based Neural Machine Translation

Attention Mechanism

João Pereira 116

My name is João

Context Vector

Sequence to Sequence + Attention

João Pereira 117

My name is João

Context Vector Decoder

Mijn naam is João

Encoder

Application Examples

Google´s Neural Machine Translation System

João Pereira 118

Application Examples

Question Answering

João Pereira 119

What was missing?

João Pereira 120

Probabilistic deep learning models• Deep generative models: variational autoencoders and

generative adversarial networks

João Pereira 121

References – Part 1

Papers/Books• A Few Useful Things to Know About Machine Learning,

Domingos, 2012 (Link)• Deep Learning, Ian Goodfellow et al. (Link)

Tutorials• Tensorflow and Deep Learning Without a PhD (Link)• Unsupervised Deep Learning, NeurIPS18 (Link)

João Pereira 122

https://dl.acm.org/citation.cfm?id=2347755

http://www.deeplearningbook.org/

https://cloud.google.com/blog/products/gcp/learn-tensorflow-and-deep-learning-without-a-phd

https://ranzato.github.io/

References – Part 1

Courses• Neural Networks for Machine Learning, Geoffrey Hinton, (Link)• EE-559 Deep Learning, EPFL (Link)• 6.S191 Deep Learning, MIT (Link)• CS230 Deep Learning, Stanford, Andrew Ng (Link)• Deep Learning, NYU, Yann LeCun (Link)

João Pereira 123

https://www.youtube.com/watch?v=2fRnHVVLf1Y&list=PLiPvV5TNogxKKwvKb1RKwkq2hm7ZvpHz0

https://fleuret.org/ee559/

http://introtodeeplearning.com/#team

https://cs230.stanford.edu/

https://cilvr.nyu.edu/doku.php?id=deeplearning2017:schedule

CNN References – Part 2

Courses/Tutorials• Deep Learning for Computer Vision MIT 6.S191 (Link)• Convolutional Neural Networks Lecture, Stanford (Link)• Image Classification with Deep Learning (Link)• Deep Learning And The Future of AI, Yann LeCun (Link)• Deep dive into deep learning (Link)VideoDeep Learning and the Future of Artificial Intelligence (Link) How does the brain learn so much so quickly? (Link)

João Pereira 124

http://introtodeeplearning.com/2019/materials/2019_6S191_L3.pdf

http://cs231n.stanford.edu/slides/2018/cs231n_2018_lecture05.pdf

http://www.cs.toronto.edu/%7Eranzato/files/ranzato_CNN_stanford2015.pdf

http://deeploria.gforge.inria.fr/lecun-20160412-nancy-loria.pdf

https://d2l.ai/chapter_convolutional-neural-networks

https://www.youtube.com/watch?v=RM-Jtc2ryfM

https://www.youtube.com/watch?v=cWzi38-vDbE

RNN References – Part 2

Courses/Tutorials• LxMLS slides, Chris Dyer, DeepMind (Link)• Deep Sequence Modeling MIT 6.S191 (Link)

VideoDLRLSS, Recurrent Neural Networks, Yoshua Bengio (Link)

João Pereira 125

http://demo.clab.cs.cmu.edu/cdyer/lxmls-seq-attn.pdf

http://introtodeeplearning.com/2019/materials/2019_6S191_L2.pdf

https://www.youtube.com/watch?v=OgGbCGneuik

Questions?

João Pereira 126

[email protected]

Thank you for your attention!

João Pereira 127

on deep learning - joao-pereira.pt · joão pereira 3 22/10/2019. part 1 joão pereira 4. big...

Documents