on deep learning - joao-pereira.pt · joão pereira 3 22/10/2019. part 1 joão pereira 4. big...
TRANSCRIPT
On Deep LearningAn Overview
João Pereira
October 2019PDEng Data Science Talks
Outline – Part 1
• Introduction to deep learning• Types of learning• Historical Perspective• Neural networks• Training Neural Networks• Optimization Challenges• Regularization strategies• Generalization
João Pereira 2
15/10/2019
Outline – Part 2
• Convolutional Neural Networks (CNNs)• Recurrent Neural Networks (RNNs)
• Vanilla RNNs• LSTMs and GRUs
• Sequence to Sequence Models (Seq2Seq)• Attention Mechanisms• Applications
João Pereira 3
22/10/2019
Part 1
João Pereira 4
Big Picture
João Pereira 5
Machine Learning
Artificial Intelligence
Neural Networks
Deep Learning
Learning from data
João Pereira 6
Inputs and Outputs
• Inputs• Words, sentences• Images, videos• Observations, time series• Voice
• Outputs• Translation• Class label
João Pereira 7
What is learning?
• Learning or training refers to estimate the parameters of a model.
João Pereira 8
Training Validation TestDataset
Machine learning is about generalizing to unseen data.
João Pereira 9
What data is out there?
João Pereira 10
Unlabeled data
Labeled data
Cost
Availability
Abundant
Accessible
Scarce
Expensive
Data in the PDEng Data Science
João Pereira 11
Unlabeled data
Labeled data
Cost
Availability
Abundant
Accessible
Scarce
Expensive
Types of learning
• Supervised learning• Unsupervised learning• Reinforcement learning
João Pereira 12
LeCun’s Cake Analogy
João Pereira 13
Supervised Learning
• Given a dataset D of inputs x and labeled targets y, learn to predict y from x.
• Most successful paradigm in deep learning.
João Pereira 14
ModelDogCat
Dog Cat
Unsupervised Learning
• Given only the inputs , models and find
João Pereira 15
Model
PatternsStructureClusters
Generation
Why Did Deep Learning Become So Popular?
Important breakthroughs in AI-class of problems
• Speech recognition• Image classification• Machine translation• Chatbots• Self-driving cars• Playing games
João Pereira 16
Why Now?
Among other reasons:• Fifty years of AI research in machine learning• Explosion of convenient compute power
(e.g., GPUs, TPUs)• Lots of (labeled) data• New software and tools• Strong investments and interest of large organizations
João Pereira 17
João Pereira 19
Yann LeCunNew York University
Facebook AI Research
Turing Award
Yoshua BengioUniversité de Montreal
Geoffrey HintonUniversity of Toronto
Google Brain
Deep Learning Heroes
Why Deep Learning?
João Pereira 20
Input OutputFeature Extraction Classification
DuckNot a Duck
Classifier
Input Output
DuckNot a Duck
Traditional Machine Learning
Deep Learning
Feature Extraction + Classification
Deep Neural Network
Why Deep Learning?
• Hand-crafting features is time consuming and not scalable.• We can learn the underlying features from data, automatically.
João Pereira 21
Zeiler & Fergus, 2013Example by Yann LeCun
Deep Neural Network
Deep learning is about representation learning!Learning good features for downstream tasks (regression, classification, …)Hand-crafting vs learning features
João Pereira 22
Historical Perspective
João Pereira 23
RosenblattPerceptron
1957
Minsky & PapertAI Winter begins CNNs
19891986
Backpropagation LSTMs
1995 1997
SVMs AEs
2006 2012
ImageNetContest
1969
VAEsGANs
2014
PerceptronStructural building block of deep learning
João Pereira 24
The Perceptron
João Pereira 25
Input Weights Sum Non-Linearity Output
OutputPrediction Non-linearity Bias Inputs Weights
Rosenblatt, 1958
The Perceptron
João Pereira 24
Input Weights Sum Non-Linearity Output
In matrix form:
Activation FunctionsSigmoidHyperbolic TangentRectified Linear UnitSoftmax
João Pereira 27
Activation Functions
João Pereira 28
Recap…Activation function
Activation Functions
• The goal of the activation functions is to introduce non-linearities into the network
• The are many of these functions• They are all non-linear
João Pereira 29
Activation Functions
João Pereira 30
Rectified Linear UnitReLU Sigmoid
1
tf.keras.activations.sigmoid(x)tf.keras.activations.relu(x)
Hyperbolic Tangent
1
tf.keras.activations.tanh(x)
Activation Functions: Softmax
João Pereira 31
Neural Network
TargetsProbabilitiesCat
Dog
Output
Softmax
tf.keras.activations.softmax(x)
Building Neural Networks with Perceptrons
João Pereira 32
Single-Layer Neural Network
João Pereira 33
Output LayerInput Layer Hidden Layer
Properties
Neural networks are universal function approximators
Universal Approximation Theorem
A feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of Rn, under mild assumptions on the activation function.
João Pereira 34
Cybenko, 1989
Properties
Neural networks are universal function approximators
Universal Approximation Theorem
A feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of Rn, under mild assumptions on the activation function.
João Pereira 35
Cybenko, 1989
In a nutshell:They can approximate many functions up to any precision.
Training Neural NetworksLoss FunctionsGradient DescentStochastic Gradient Descent
João Pereira 36
Example: Wine Quality
João Pereira 37
Volatile Acidity
Alcohol
Good
Not Good
Example: Wine Quality
• How to quantify prediction error?
João Pereira 38
Prediction Target
Good (1)
Not Good (0)
Loss
João Pereira 39
Prediction Target
The loss measures the cost due to wrong predictions.
Loss
João Pereira 40
Target
This is the empirical loss.Aka
Objective functionCost functionEmpirical risk
Prediction Dataset
Binary Cross Entropy Loss
• Used in classification problems with models that output a probability between 0 and 1.
João Pereira 41tf.keras.losses.BinaryCrossentropy()
Prediction Target
Mean Squared Error Loss
• Used in regression problems with models that output continuous real numbers.
João Pereira 42
Prediction Target
tf.keras.losses.MSE()
Wine Quality [%]
Unsupervised Deep Learning
João Pereira 43
Autoencoders
• Most used model for representation learning.
João Pereira 44
Encoder Decoder
Unsupervised Learning
Latent space
Autoencoders
João Pereira 45
Reducing the Dimensionality of Data with Neural Networks, Hinton & Salakhutdinov, 2006
An autoencoder with linear activations does principal component analysis!
João Pereira 46
Unsupervised Representation Learning
Tip #1: Always “look” at your data before designing your model!
João Pereira 47
Based on Marc’Aurelio Ranzato´s (Facebook AI Research) tutorial on Unsupervised Deep Learning, NeurIPS’18
Mean & Covariance analysisPCA (check eigenvalue decay)t-sne visualization
Unsupervised Representation Learning
Tip #1: Always “look” at your data before designing your model!
Learned features may be useful for down-stream tasks.(e.g., classification, regression, …)
João Pereira 48
Based on Marc’Aurelio Ranzato´s (Facebook AI Research) tutorial on Unsupervised Deep Learning, NeurIPS’18
Mean & Covariance analysisPCA (check eigenvalue decay)t-sne visualization
Representation learning using unsupervised learning
Unsupervised Representation Learning
Tip #2: PCA and K-Means are very often a strong baseline.
João Pereira 49
Based on Marc’Aurelio Ranzato´s (Facebook AI Research) tutorial on Unsupervised Deep Learning, NeurIPS’18
Coffee Break16:14 + 15min
João Pereira 50
Optimizing the Loss
We want to find the neural network weights that achieve the lowest loss
Or simply
João Pereira 51
Weights
Training Examples
Based on first class of MIT 6.S191 by Alexander Amini
Optimizing the Loss
João Pereira 52
Random initialization
Optimizing the Loss
João Pereira 53
Compute gradient
Optimizing the Loss
João Pereira 54
Compute gradient
Optimizing the Loss
João Pereira 55
Take a small step in the opposite direction of the gradient
“It's like walking in the mountains in a fog and following the direction of steepest descent to reach the village in the valley”
João Pereira 56
Yann LeCun
Gradient DescentThis procedure is called gradient descent.
1. Initialize weights randomly.
2. Repeat until convergence.
2.1. Computer gradient
2.2. Update weights
3. Return weights.
João Pereira 57
Stochastic Gradient Descent
1. Initialize weights randomly.
2. Repeat until convergence.
2.1. Choose 1 datapoint randomly
2.2. Computer gradient
2.3. Update weights
3. Return weights.
João Pereira 58
tf.keras.optimizer.SGD(x)
Stochastic Gradient Descent
1. Initialize weights randomly.
2. Repeat until convergence.
2.1. Choose mini-batch of M samples randomly
2.2. Computer gradient
2.3. Update weights
3. Return weights.
João Pereira 59
tf.keras.optimizer.SGD(x)
Stochastic Gradient Descent
Many variants of SGDAdamAdaGradAdaDeltaRMSProp
Mini-batch training is faster (parallel computation on GPUs).
João Pereira 60
tf.keras.optimizer.Adam()
tf.keras.optimizer.RMSProp()
tf.keras.optimizer.Adagrad()
tf.keras.optimizer.AdaDelta()
Backpropagation
How to compute ?
• Employs the chain rule for deriving the gradient.• There are tools that compute these gradients automatically.• Check Hinton’s course for details.• Very good interactive explanation of backpropagation.
João Pereira 61
https://google-developers.appspot.com/machine-learning/crash-course/backprop-scroll/
In real-life deep learning...
João Pereira 62
Training neural networks is hard!
Learning Rate
João Pereira
How to choose ?
Learning Rate
• Small learning makes learning slow and gets stuck in local minima.
• Large learning rates overshoot, are unstable and diverge.
Tips:• Try many learning rates and choose the best• Adaptive learning rate
• How large gradient is…• How fast learning occurs…
João Pereira 64
Optimization ChallengesOverfitting and UnderfittingBias-Variance Trade-Off
João Pereira 65
Bias-Variance Trade-off
66
Overfitting
Model Complexity
ErrorUnderfitting
Bias2Test Error
VarianceTrain Error
RegularizationDropoutEarly Stopping
João Pereira 67
Why do we need regularization?
• To make our models generalize better to unseen data.• We are going to look at two strategies:
• Dropout• Early stopping
João Pereira 68
Dropout
João Pereira 69
Randomly set activations to zero.Typically around 50%.
Dropout
João Pereira 70
Randomly set activations to zero.Typically around 50%.
tf.keras.layers.Dropout(p)
Early Stopping
• Early stop training before overfitting.
João Pereira 71Training Steps
LossTest
Train
OverfittingUnderfitting
Generalization in Deep Learning
João Pereira 72
Generalization in Deep Learning
João Pereira 73
Understanding Deep Learning Requires Rethinking Generalization, Zhang et al., 2017
Large models overfit very little ???
Generalization in Deep Learning
João Pereira 74
Reconciling modern machine-learning practice and the classical bias–variance trade-off, Belkin et al., 2019
The bias-variance trade-off was just the first episode of the series!
Bias-Variance Trade-off
On Deep Learning – João Pereira 75
Overfitting
Number parameters
Error
Training
Test
Part 2Convolutional Neural NetworksRecurrent Neural Networks
João Pereira 76
Motivation
• In feed-forward neural networks we assume data is independent and identically distributed in space or in time.
• In many applications, data has spatial or temporal dependencies.
João Pereira 77
Spatial
Convolutional Neural Networks
Temporal
Recurrent Neural Networks
Convolutional Neural Networks
João Pereira 78
Images
• An image is a matrix of numbers.
João Pereira 79
2 4 3 2 4 3 ... 33 8 5 3 8 5 71 2 1 1 2 1 52 4 3 2 4 3 23 8 5 3 8 5 11 2 1 1 2 1 3... ... 45 6 8 2 4 6 8 9
Values between 0 and 255
200
200200x200=40k pixels
Images
• An image is a matrix of numbers.
João Pereira 80
2 4 3 2 4 33 8 5 3 8 51 2 1 1 2 12 4 3 2 4 33 8 5 3 8 51 2 1 1 2 1...8 4 2 1 7 9 9 7
Values between 0 and 255
5 4 3 2 4 36 8 5 3 8 59 2 1 1 2 12 4 3 2 4 34 8 5 3 8 57 2 1 1 2 143 8 5 3 8 5 1
2 4 3 2 4 3 ... 33 8 5 3 8 5 71 2 1 1 2 1 52 4 3 2 4 3 23 8 5 3 8 5 11 2 1 1 2 1 3... ... 45 6 8 2 4 6 8 9
200 x 200 x 3 = 120k
RB
G
Tasks in Computer Vision
RegressionOutput is a continuous value.
ClassificationOutput is a class.
Applications:Classification, segmentation, localization, detection.
João Pereira 81
Examples
João Pereira 82Slide from Stanford CS231N, 2017, Lecture 11
Why do we need convolutional neural networks?
João Pereira 83
João Pereira 84
200
20040k
Requires1.6 billion parameters
Why do we need convolutional neural networks?
Fully connected layer
João Pereira 85
200
20040k
Why do we need convolutional neural networks?
Locally connected layer
Requires4 million parameters
10x10
10x10
10x10
Convolutional Layer Idea
João Pereira 86
200
20040k
Use multiple filters to extract different kinds of features.
Spatially share the same parameters across different locations.
Intuition: important features can happen anywhere in the image.
Idea
João Pereira 87
Idea
João Pereira 88
Idea
João Pereira 89
What is a convolution?
1 0 10 0 11 0 1
João Pereira 90
⨀
2 4 3 5 6 83 8 5 7 5 11 2 1 5 6 72 4 6 2 3 43 5 7 2 5 66 8 9 7 9 8
FilterKernel
Feature Detector
Feature MapActivation Map
Convolved Feature
=
Input Image
What is a convolution?
1 0 10 0 11 0 1
João Pereira 91
⨀
2 4 3 5 6 83 8 5 7 5 11 2 1 5 6 72 4 6 2 3 43 5 7 2 5 66 8 9 7 9 8
FilterKernel
Feature Detector
Feature MapActivation Map
Convolved Feature
=12 0 1 0
0 0 0 0
1 1 1 0
1 0 1 1
Input Image2x1 + 4x0 + 3x1 + 5x1 + 8x0 + 3x0 + 1x1 + 2x0 + 1x1 = 12
Convolutional Layer
• Weights of the filters are learned during training• We need to define:
• Number of filters• Filter dimensions• Padding• Stride
João Pereira 92
1 0 10 0 11 0 1
Padding
João Pereira 93
0 0 0 0 00 0 1 2 00 3 4 5 00 6 7 8 00 0 0 0 0
0 12 3⨀
0 3 8 4
9 19 25 10
21 37 43 16
6 7 8 0
Padding with zeros
=
• Used to avoid loosing pixels around the image.
2 4 33 8 51 2 1
Stride
• Is the number of rows and columns traversed per slide.
João Pereira 94
0 0 0 0 00 0 1 2 00 3 4 5 00 6 7 8 00 0 0 0 0
0 12 3⨀ =
2 4 33 8 51 2 1
0 86 8
Vertical: stride 3Horizontal: stride 2
In a nutshell...
Feed-forward neural networks applied to images:• Scale quadratically with the input sizeSolution: Use a convolutional layer• Connect each hidden unit to a small patch of the input.• Share weights across space.A network built with these structure is called a Convolutional Neural Network (CNN)
João Pereira 95
LeCun et al., 1998, Gradient-based learning applied to document recognitionBased on Marc'Aurelio Ranzato’s slides, Facebook AI Research, Image Classification with Deep Learning
Non-Linearity
• Introduce non-linearities after convolutional layers.• The rectified linear unit (ReLU) works very well.
João Pereira 96
tf.keras.activations.relu(x)
Replaces negative values by zero.
Pooling Layer
• Pooling promotes robustness to the specific location of the features.
Example:
Max, Avg, L2-pooling, ...
João Pereira 97
2 4 3 5 6 33 8 5 7 5 11 2 1 5 6 72 4 6 2 3 43 5 2 2 5 66 4 5 7 9 8
8 75 9
Max poolingFilters 3x3
Stride 3
Dimension reduced
Representation Learning with CNNs
João Pereira 98
Low Level Features Medium Level Features High Level Features
Conv. Layer 1 Conv. Layer 2 Conv. Layer 3
Progress in CNN Architectures
João Pereira 99
GoogLeNet, 2014Szegedy et al.
VGG, 2014Zisserman, Simonyan
AlexNet, 2012Krizhevsky, Sutskever, and Hinton
ResNet, 2015He
Importance of Depth
João Pereira 100Link
Deep learning is in production
João Pereira 101
Coffee Break16:16 + 15min
João Pereira 102
Recurrent Neural Networks
João Pereira 103
Why do we need recurrent neural networks?
In many applications, data has a sequential nature:Sensor time seriesNatural speechWords in sentencesDNAVideos
What if data is not i.i.d. in time?
João Pereira 104
Example:My name is João
Recap
• In feed-forward neural networks:
João Pereira 105
Recurrent Neural Network
João Pereira 106
Feed-forward Neural Network Recurrent Neural Network
Bengio et al., 1994
Recurrent Neural Network
João Pereira 107
RNNs capture the temporal dependencies of the data.
• Real-valued hidden state• Feedback connection• Parameters shared
across timestepsis typically a sigmoid or tanh
RNN Unrolled
João Pereira 108
Bengio et al., 1994
Notes
• is a vector representation of the entire sequence• Summarizes sequences into fixed-length vectors• Input sequences can have arbitrary length.• Final hidden state can be used as a feature vector for
classification or regression tasks.
João Pereira 109
Long-Short Term Memory Networks
João Pereira 110
Hochreiter & Schmidhuber, 1997
• Proposed to solve the vanishing gradient problem.
• New cell and three gates.
Sequence to Sequence Models (Seq2Seq)
João Pereira 111
Sequence to Sequence Models
• In some applications, mapping sequences to sequences is useful.
Example: machine translation
• A sequence to sequence model is an encoder-decoder model.
João Pereira 112
My name is João Mijn naam is João
Sequence to Sequence Models
João Pereira 113
Learned representation
Encoder
Decoder
My name is João
Mijn naam is João
Attention
João Pereira 114
Why do we need attention?
Problem: Fixed-size representation z is the bottleneck of seq2seq.Question: Is there a better way to pass the information from theencoder to the decoder?Solution: Attention Mechanism
João Pereira 115
Bahdanau, Cho, and Bengio, Neural Machine Translation by Jointly Learning to Align and TranslateLuong, Pham, and Manning, Effective Approaches to Attention-based Neural Machine Translation
Attention Mechanism
João Pereira 116
My name is João
Context Vector
Sequence to Sequence + Attention
João Pereira 117
My name is João
Context Vector Decoder
Mijn naam is João
Encoder
Application Examples
Google´s Neural Machine Translation System
João Pereira 118
Application Examples
Question Answering
João Pereira 119
What was missing?
João Pereira 120
Probabilistic deep learning models• Deep generative models: variational autoencoders and
generative adversarial networks
João Pereira 121
References – Part 1
Papers/Books• A Few Useful Things to Know About Machine Learning,
Domingos, 2012 (Link)• Deep Learning, Ian Goodfellow et al. (Link)
Tutorials• Tensorflow and Deep Learning Without a PhD (Link)• Unsupervised Deep Learning, NeurIPS18 (Link)
João Pereira 122
References – Part 1
Courses• Neural Networks for Machine Learning, Geoffrey Hinton, (Link)• EE-559 Deep Learning, EPFL (Link)• 6.S191 Deep Learning, MIT (Link)• CS230 Deep Learning, Stanford, Andrew Ng (Link)• Deep Learning, NYU, Yann LeCun (Link)
João Pereira 123
CNN References – Part 2
Courses/Tutorials• Deep Learning for Computer Vision MIT 6.S191 (Link)• Convolutional Neural Networks Lecture, Stanford (Link)• Image Classification with Deep Learning (Link)• Deep Learning And The Future of AI, Yann LeCun (Link)• Deep dive into deep learning (Link)VideoDeep Learning and the Future of Artificial Intelligence (Link) How does the brain learn so much so quickly? (Link)
João Pereira 124
RNN References – Part 2
Courses/Tutorials• LxMLS slides, Chris Dyer, DeepMind (Link)• Deep Sequence Modeling MIT 6.S191 (Link)
VideoDLRLSS, Recurrent Neural Networks, Yoshua Bengio (Link)
João Pereira 125
Thank you for your attention!
João Pereira 127