ece 5984: introduction to machine learnings15ece5984/slides/l17_backprop.pptx.pdf · • perceptron...

54
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: Neural Networks – Backprop Readings: Murphy 16.5

Upload: others

Post on 20-May-2020

17 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

ECE 5984: Introduction to Machine Learning

Dhruv Batra Virginia Tech

Topics: –  Neural Networks

–  Backprop

Readings: Murphy 16.5

Page 2: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Administrativia •  HW3

–  Due: in 2 weeks –  You will implement primal & dual SVMs –  Kaggle competition: Higgs Boson Signal vs Background

classification –  https://inclass.kaggle.com/c/2015-Spring-vt-ece-machine-

learning-hw3 –  https://www.kaggle.com/c/higgs-boson

(C) Dhruv Batra 2

Page 3: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Administrativia •  Project Mid-Sem Spotlight Presentations

–  Friday: 5-7pm, 3-5pm Whittemore 654

–  5 slides (recommended) –  4 minute time (STRICT) + 1-2 min Q&A

–  Tell the class what you’re working on –  Any results yet? –  Problems faced? –  Upload slides on Scholar

(C) Dhruv Batra 3

Page 4: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Recap of Last Time

(C) Dhruv Batra 4

Page 5: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Not linearly separable data •  Some datasets are not linearly separable!

–  http://www.eee.metu.edu.tr/~alatan/Courses/Demo/AppletSVM.html

Page 6: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Addressing non-linearly separable data – Option 1, non-linear features

Slide Credit: Carlos Guestrin (C) Dhruv Batra 6

•  Choose non-linear features, e.g., –  Typical linear features: w0 + ∑i wi xi

–  Example of non-linear features: •  Degree 2 polynomials, w0 + ∑i wi xi + ∑ij wij xi xj

•  Classifier hw(x) still linear in parameters w –  As easy to learn –  Data is linearly separable in higher dimensional spaces –  Express via kernels

Page 7: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Addressing non-linearly separable data – Option 2, non-linear classifier

Slide Credit: Carlos Guestrin (C) Dhruv Batra 7

•  Choose a classifier hw(x) that is non-linear in parameters w, e.g., –  Decision trees, neural networks,…

•  More general than linear classifiers •  But, can often be harder to learn (non-convex

optimization required) •  Often very useful (outperforms linear classifiers) •  In a way, both ideas are related

Page 8: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Biological Neuron

(C) Dhruv Batra 8

Page 9: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Recall: The Neuron Metaphor •  Neurons

–  accept information from multiple inputs, –  transmit information to other neurons.

•  Multiply inputs by weights along edges •  Apply some function to the set of inputs at each node

9 Slide Credit: HKUST

Page 10: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Types of Neurons ✓1

✓2

✓D

✓0

1

f(~x, ✓)

Linear Neuron

✓1

✓2

✓D

✓0

1

f(~x, ✓)

Logistic Neuron

✓1

✓2

✓D

✓0

1

f(~x, ✓)

Perceptron

Potentially more. Require a convex

loss function for gradient descent training.

10 Slide Credit: HKUST

Page 11: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Limitation •  A single “neuron” is still a linear decision boundary

•  What to do?

•  Idea: Stack a bunch of them together!

(C) Dhruv Batra 11

Page 12: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Multilayer Networks

•  Cascade Neurons together •  The output from one layer is the input to the next •  Each Layer has its own sets of weights

x0

x1

x2

xP

f(x,

~

✓)

12

~✓0,0

~✓0,1

~✓0,2 ~✓1,2

~✓1,1

~✓1,0

✓2,0

✓2,1

✓2,2

Slide Credit: HKUST

Page 13: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Universal Function Approximators •  Theorem

–  3-layer network with linear outputs can uniformly approximate any continuous function to arbitrary accuracy, given enough hidden units [Funahashi ’89]

(C) Dhruv Batra 13

Page 14: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Plan for Today •  Neural Networks

–  Parameter learning –  Backpropagation

(C) Dhruv Batra 14

Page 15: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Forward Propagation •  On board

(C) Dhruv Batra 15

Page 16: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Feed-Forward Networks

•  Predictions are fed forward through the network to classify

16

x0

x1

x2

xP

~✓0,0

~✓0,1

~✓0,2 ~✓1,2

~✓1,1

~✓1,0

✓2,0

✓2,1

✓2,2

Slide Credit: HKUST

Page 17: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Feed-Forward Networks

•  Predictions are fed forward through the network to classify

17

x0

x1

x2

xP

~✓0,0

~✓0,1

~✓0,2 ~✓1,2

~✓1,1

~✓1,0

✓2,0

✓2,1

✓2,2

Slide Credit: HKUST

Page 18: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Feed-Forward Networks

•  Predictions are fed forward through the network to classify

18

x0

x1

x2

xP

~✓0,0

~✓0,1

~✓0,2 ~✓1,2

~✓1,1

~✓1,0

✓2,0

✓2,1

✓2,2

Slide Credit: HKUST

Page 19: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Feed-Forward Networks

•  Predictions are fed forward through the network to classify

19

x0

x1

x2

xP

~✓0,0

~✓0,1

~✓0,2 ~✓1,2

~✓1,1

~✓1,0

✓2,0

✓2,1

✓2,2

Slide Credit: HKUST

Page 20: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Feed-Forward Networks

•  Predictions are fed forward through the network to classify

20

x0

x1

x2

xP

~✓0,0

~✓0,1

~✓0,2 ~✓1,2

~✓1,1

~✓1,0

✓2,0

✓2,1

✓2,2

Slide Credit: HKUST

Page 21: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Feed-Forward Networks

•  Predictions are fed forward through the network to classify

21

x0

x1

x2

xP

~✓0,0

~✓0,1

~✓0,2 ~✓1,2

~✓1,1

~✓1,0

✓2,0

✓2,1

✓2,2

Slide Credit: HKUST

Page 22: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Gradient Computation •  First let’s try:

–  Single Neuron for Linear Regression –  Single Neuron for Logistic Regresion

(C) Dhruv Batra 22

Page 23: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Logistic regression •  Learning rule – MLE:

Slide Credit: Carlos Guestrin (C) Dhruv Batra 23

Page 24: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Gradient Computation •  First let’s try:

–  Single Neuron for Linear Regression –  Single Neuron for Logistic Regresion

•  Now let’s try the general case

•  Backpropagation! –  Really efficient

(C) Dhruv Batra 24

Page 25: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Neural Nets •  Best performers on OCR

–  http://yann.lecun.com/exdb/lenet/index.html

•  NetTalk –  Text to Speech system from 1987 –  http://youtu.be/tXMaFhO6dIY?t=45m15s

•  Rick Rashid speaks Mandarin –  http://youtu.be/Nu-nlQqFCKg?t=7m30s

(C) Dhruv Batra 25

Page 26: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Neural Networks •  Demo

–  http://neuron.eng.wayne.edu/bpFunctionApprox/bpFunctionApprox.html

(C) Dhruv Batra 26

Page 27: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Historical Perspective

(C) Dhruv Batra 27

Page 28: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Convergence of backprop •  Perceptron leads to convex optimization

–  Gradient descent reaches global minima

•  Multilayer neural nets not convex –  Gradient descent gets stuck in local minima –  Hard to set learning rate –  Selecting number of hidden units and layers = fuzzy

process –  NNs had fallen out of fashion in 90s, early 2000s –  Back with a new name and significantly improved

performance!!!! •  Deep networks

–  Dropout and trained on much larger corpus

Slide Credit: Carlos Guestrin (C) Dhruv Batra 28

Page 29: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Overfitting •  Many many many parameters

•  Avoiding overfitting? –  More training data –  Regularization –  Early stopping

(C) Dhruv Batra 29

Page 30: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

A quick note

(C) Dhruv Batra 30 Image Credit: LeCun et al. ‘98

Page 31: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Rectified Linear Units (ReLU)

(C) Dhruv Batra 31

Page 32: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Convolutional Nets •  Basic Idea

–  On board –  Assumptions:

•  Local Receptive Fields •  Weight Sharing / Translational Invariance / Stationarity

–  Each layer is just a convolution!

(C) Dhruv Batra 32

Input image Convolutional layer

Sub-sampling

layer

Image Credit: Chris Bishop

Page 33: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

(C) Dhruv Batra 33 Slide Credit: Marc'Aurelio Ranzato

Page 34: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

(C) Dhruv Batra 34 Slide Credit: Marc'Aurelio Ranzato

Page 35: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

(C) Dhruv Batra 35 Slide Credit: Marc'Aurelio Ranzato

Page 36: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

(C) Dhruv Batra 36 Slide Credit: Marc'Aurelio Ranzato

Page 37: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

(C) Dhruv Batra 37 Slide Credit: Marc'Aurelio Ranzato

Page 38: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

(C) Dhruv Batra 38 Slide Credit: Marc'Aurelio Ranzato

Page 39: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

(C) Dhruv Batra 39 Slide Credit: Marc'Aurelio Ranzato

Page 40: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

(C) Dhruv Batra 40 Slide Credit: Marc'Aurelio Ranzato

Page 41: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

(C) Dhruv Batra 41 Slide Credit: Marc'Aurelio Ranzato

Page 42: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Convolutional Nets •  Example:

–  http://yann.lecun.com/exdb/lenet/index.html

(C) Dhruv Batra 42

INPUT 32x32

Convolutions SubsamplingConvolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps6@14x14

S4: f. maps 16@5x5C5: layer120

C3: f. maps 16@10x10

F6: layer 84

Full connectionFull connection

Gaussian connections

OUTPUT 10

Image Credit: Yann LeCun, Kevin Murphy

Page 43: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

(C) Dhruv Batra 43 Slide Credit: Marc'Aurelio Ranzato

Page 44: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

(C) Dhruv Batra 44 Slide Credit: Marc'Aurelio Ranzato

Page 45: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

(C) Dhruv Batra 45 Slide Credit: Marc'Aurelio Ranzato

Page 46: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Visualizing Learned Filters

(C) Dhruv Batra 46 Figure Credit: [Zeiler & Fergus ECCV14]

Page 47: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Visualizing Learned Filters

(C) Dhruv Batra 47 Figure Credit: [Zeiler & Fergus ECCV14]

Page 48: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Visualizing Learned Filters

(C) Dhruv Batra 48 Figure Credit: [Zeiler & Fergus ECCV14]

Page 49: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Autoencoders •  Goal

–  Compression: Output tries to predict input

(C) Dhruv Batra 49 Image Credit: http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders

Page 50: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Autoencoders •  Goal

–  Learns a low-dimensional “basis” for the data

(C) Dhruv Batra 50 Image Credit: Andrew Ng

Page 51: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Stacked Autoencoders •  How about we compress the low-dim features more?

(C) Dhruv Batra 51 Image Credit: http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders

Page 52: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

(C) Dhruv Batra 52

Sparse DBNs

[Lee et al. ICML ‘09]

Figure courtesy: Quoc Le

Page 53: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

Stacked Autoencoders •  Finally perform classification with these low-dim

features.

(C) Dhruv Batra 53 Image Credit: http://ufldl.stanford.edu/wiki/index.php/Stacked_Autoencoders

Page 54: ECE 5984: Introduction to Machine Learnings15ece5984/slides/L17_backprop.pptx.pdf · • Perceptron leads to convex optimization – Gradient descent reaches global minima • Multilayer

What you need to know about neural networks

•  Perceptron: –  Representation –  Derivation

•  Multilayer neural nets –  Representation –  Derivation of backprop –  Learning rule –  Expressive power