lecture 19 intro to deep learning · cos 429: computer vision lecture 19 intro to deep learning...

87
COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted from Stanford CS231n: Convolutional Neural Networks for Visual Recognition Fei-Fei Li, Andrej Karpathy, Justin Johnson http://cs231n.stanford.edu/

Upload: others

Post on 31-May-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

COS 429: Computer Vision

Lecture 19Intro to Deep Learning

COS429 : 29.11.16 : Andras Ferencz

Thanks: most of these slides shamelessly adapted fromStanford CS231n: Convolutional Neural Networks for Visual Recognition

Fei-Fei Li, Andrej Karpathy, Justin Johnson http://cs231n.stanford.edu/

Page 2: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

2 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

2

Preview

LeCun

Page 3: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

3 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Network is a stack of components

Page 4: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

4 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Components of a Convolutional Net

weightsloss

input image

Page 5: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

5 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

5

two more layers to go: POOL/FC

Page 6: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

6 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

6

http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html

[ConvNetJS demo: training on CIFAR-10]

Page 7: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

7 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Fully Connected layer

Previous Layer

FC Layer

B j=∑i

(W ij∗Ai)+b j

A1

A2

A3

A4

A1

B2

B1

B3

B4

Page 8: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

8 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

8

32

32

3

Convolution Layer

32x32x3 image5x5x3 filter

convolve (slide) over all spatial locations

activation maps

1

28

28

consider a second, green filter

Page 9: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

9 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

9

In practice: Common to zero pad the border

e.g. input 7x73x3 filter, applied with stride 1 pad with 1 pixel border => what is the output?

7x7 output!in general, common to see CONV layers with stride 1, filters of size FxF, and zero-padding with (F-1)/2. (will preserve size spatially)e.g. F = 3 => zero pad with 1 F = 5 => zero pad with 2 F = 7 => zero pad with 3

0 0 0 0 0 0

0

0

0

0

Page 10: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

10 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

10

example 5x5 filters(32 total)

We call the layer convolutional because it is related to convolution of two signals:

elementwise multiplication and sum of a filter and the signal (image)

one filter => one activation map

Page 11: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

11 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

11

(btw, 1x1 convolution layers make perfect sense)

64

56

561x1 CONVwith 32 filters

32

56

56

(each filter has size 1x1x64, and performs a 64-dimensional dot product)

Page 12: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

12 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

12

Activation Layer

Sigmoid

tanh tanh(x)

ReLU max(0,x)

Maxout

ELU

Leaky ReLUmax(0.1x, x)

Page 13: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

13 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

TLDR: In practice:

- Use ReLU. Be careful with your learning rates- Try out Leaky ReLU / Maxout / ELU- Try out tanh but don’t expect much- Don’t use sigmoid

Page 14: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

14 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

14

Pooling layer- makes the representations smaller and more manageable - operates over each activation map independently:

Page 15: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

15 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

15

1 1 2 4

5 6 7 8

3 2 1 0

1 2 3 4

Single depth slice

x

y

max pool with 2x2 filters and stride 2 6 8

3 4

MAX POOLING

Page 16: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

16 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Suppose: 3 training examples, 3 classes, and their scores

cat

frog

car

3.2

5.1

-1.7

4.9

1.3

2.0 -3.1

2.5

2.2

Multiclass Hinge loss:

Given an example

where is the image andwhere is the (integer) label,

and using the shorthand for the scores vector:

the SVM loss has the form:

Page 17: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

17 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Softmax Loss (Multinomial Logistic)

cat

frog

car

3.2

5.1

-1.7

unnormalized log probabilities

24.5

164.0

0.18

exp normalize

unnormalized probabilities

0.13

0.87

0.00

probabilities

L_i = -log(0.13) = 0.89

Page 18: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

18 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Softmax vs. SVM

assume scores:[10, -2, 3][10, 9, 9][10, -100, -100]and

Q: Suppose I take a datapoint and I jiggle a bit (changing its score slightly). What happens to the loss in both cases?

Page 19: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

19 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Page 20: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

20 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

(image credits to Alec Radford)

How to optimize?

Page 21: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

21 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

The decision boundary perspective…

Initial random weights

David Corne

Page 22: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

22 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

The decision boundary perspective…

Present a training instance / adjust the weights

David Corne

Page 23: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

23 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

The decision boundary perspective…

Present a training instance / adjust the weights

David Corne

Page 24: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

24 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

The decision boundary perspective…

Present a training instance / adjust the weights

David Corne

Page 25: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

25 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

The decision boundary perspective…

Present a training instance / adjust the weights

David Corne

Page 26: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

26 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

The decision boundary perspective…

Eventually ….

David Corne

Page 27: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

27 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

A datasetFields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …

Stochastic Gradient Descent

David Corne

Stochastic Gradient Descent Stochastic Gradient Descent

Page 28: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

28 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Fields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …

Stochastic Gradient Descent

David Corne

Page 29: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

29 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Training dataFields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …

Initialise with random weights

Stochastic Gradient Descent

David Corne

Page 30: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

30 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Training dataFields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …

Present a training pattern

1.4

2.7

1.9

Stochastic Gradient Descent

David Corne

Page 31: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

31 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Training dataFields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …

Feed it through to get output

1.4

2.7 0.8

1.9

Stochastic Gradient Descent

David Corne

Page 32: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

32 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Training dataFields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …

Compare with target output

1.4

2.7 0.8 01.9 error 0.8

Stochastic Gradient Descent

David Corne

Page 33: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

33 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Training dataFields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …

Adjust weights based on error

1.4

2.7 0.8 0 1.9 error 0.8

Stochastic Gradient Descent

David Corne

Page 34: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

34 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Training dataFields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …

Present a training pattern

6.4

2.8

1.7

Stochastic Gradient Descent

David Corne

Page 35: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

35 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Training dataFields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …

Feed it through to get output

6.4

2.8 0.9

1.7

Stochastic Gradient Descent

David Corne

Page 36: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

36 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Training dataFields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …

Compare with target output

6.4

2.8 0.9 1 1.7 error -0.1

Stochastic Gradient Descent

David Corne

Page 37: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

37 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Training dataFields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …

Adjust weights based on error

6.4

2.8 0.9 1 1.7 error -0.1

Stochastic Gradient Descent

David Corne

Page 38: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

38 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Training dataFields class1.4 2.7 1.9 03.8 3.4 3.2 06.4 2.8 1.7 14.1 0.1 0.2 0etc …

And so on ….

6.4

2.8 0.9 1 1.7 error -0.1

Stochastic Gradient Descent

David Corne

Page 39: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

39 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Mini-batch SGD

Loop:1. Sample a batch of data2. Forward prop it through the graph, get loss3. Backprop to calculate the gradients4. Update the parameters using the gradient

Page 40: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

40 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Follow the slope

In 1-dimension, the derivative of a function:

In multiple dimensions, the gradient is the vector of (partial derivatives).

Page 41: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

41 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

This is silly. The loss is just a function of W:

= ...

Page 42: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

42 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Computational Graph

x

W

* hinge loss

R

+ Ls (scores)

Page 43: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

43 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

e.g. x = -2, y = 5, z = -4

Page 44: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

44 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

e.g. x = -2, y = 5, z = -4

Want:

Page 45: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

45 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

e.g. x = -2, y = 5, z = -4

Want:

Page 46: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

46 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

e.g. x = -2, y = 5, z = -4

Want:

Page 47: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

47 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

e.g. x = -2, y = 5, z = -4

Want:

Page 48: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

48 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

e.g. x = -2, y = 5, z = -4

Want:

Page 49: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

49 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

e.g. x = -2, y = 5, z = -4

Want:

Page 50: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

50 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

e.g. x = -2, y = 5, z = -4

Want:

Page 51: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

51 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

e.g. x = -2, y = 5, z = -4

Want:

Page 52: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

52 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

e.g. x = -2, y = 5, z = -4

Want:

Chain rule:

Page 53: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

53 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

e.g. x = -2, y = 5, z = -4

Want:

Page 54: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

54 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

e.g. x = -2, y = 5, z = -4

Want:

Chain rule:

Page 55: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

55 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

f

activations

Page 56: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

56 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

f

activations

“local gradient”

Page 57: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

57 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

f

activations

“local gradient”

gradients

Page 58: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

58 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

f

activations

gradients

“local gradient”

Page 59: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

59 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

f

activations

gradients

“local gradient”

Page 60: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

60 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

f

activations

gradients

“local gradient”

Page 61: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

61 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: AlexNet[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4=>Q: what is the output volume size? Hint: (227-11)/4+1 = 55

Page 62: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

62 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: AlexNet[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4=>Output volume [55x55x96]

Q: What is the total number of parameters in this layer?

Page 63: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

63 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: AlexNet[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4=>Output volume [55x55x96]Parameters: (11*11*3)*96 = 35K

Page 64: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

64 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: AlexNet[Krizhevsky et al. 2012]

Input: 227x227x3 imagesAfter CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2

Q: what is the output volume size? Hint: (55-3)/2+1 = 27

Page 65: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

65 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: AlexNet[Krizhevsky et al. 2012]

Input: 227x227x3 imagesAfter CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2Output volume: 27x27x96

Q: what is the number of parameters in this layer?

Page 66: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

66 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: AlexNet[Krizhevsky et al. 2012]

Input: 227x227x3 imagesAfter CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2Output volume: 27x27x96Parameters: 0!

Page 67: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

67 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: AlexNet[Krizhevsky et al. 2012]

Input: 227x227x3 imagesAfter CONV1: 55x55x96After POOL1: 27x27x96...

Page 68: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

68 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: AlexNet[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:[227x227x3] INPUT[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0[27x27x96] MAX POOL1: 3x3 filters at stride 2[27x27x96] NORM1: Normalization layer[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2[13x13x256] MAX POOL2: 3x3 filters at stride 2[13x13x256] NORM2: Normalization layer[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1[6x6x256] MAX POOL3: 3x3 filters at stride 2[4096] FC6: 4096 neurons[4096] FC7: 4096 neurons[1000] FC8: 1000 neurons (class scores)

Page 69: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

69 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: AlexNet[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:[227x227x3] INPUT[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0[27x27x96] MAX POOL1: 3x3 filters at stride 2[27x27x96] NORM1: Normalization layer[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2[13x13x256] MAX POOL2: 3x3 filters at stride 2[13x13x256] NORM2: Normalization layer[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1[6x6x256] MAX POOL3: 3x3 filters at stride 2[4096] FC6: 4096 neurons[4096] FC7: 4096 neurons[1000] FC8: 1000 neurons (class scores)

Details/Retrospectives: - first use of ReLU- used Norm layers (not common anymore)- heavy data augmentation- dropout 0.5- batch size 128- SGD Momentum 0.9- Learning rate 1e-2, reduced by 10manually when val accuracy plateaus- L2 weight decay 5e-4- 7 CNN ensemble: 18.2% -> 15.4%

Page 70: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

70 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: ZFNet [Zeiler and Fergus, 2013]

AlexNet but:CONV1: change from (11x11 stride 4) to (7x7 stride 2)CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512

ImageNet top 5 error: 15.4% -> 14.8%

Page 71: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

71 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: VGGNet[Simonyan and Zisserman, 2014]

best model

Only 3x3 CONV stride 1, pad 1and 2x2 MAX POOL stride 2

11.2% top 5 error in ILSVRC 2013->7.3% top 5 error

Page 72: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

72 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

72

(not counting biases)

TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)TOTAL params: 138M parameters

INPUT: [224x224x3] memory: 224*224*3=150K params: 0CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864POOL2: [112x112x64] memory: 112*112*64=800K params: 0CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456POOL2: [56x56x128] memory: 56*56*128=400K params: 0CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824POOL2: [28x28x256] memory: 28*28*256=200K params: 0CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296POOL2: [14x14x512] memory: 14*14*512=100K params: 0CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296POOL2: [7x7x512] memory: 7*7*512=25K params: 0FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000

Page 73: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

73 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

73

INPUT: [224x224x3] memory: 224*224*3=150K params: 0CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864POOL2: [112x112x64] memory: 112*112*64=800K params: 0CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456POOL2: [56x56x128] memory: 56*56*128=400K params: 0CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824POOL2: [28x28x256] memory: 28*28*256=200K params: 0CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296POOL2: [14x14x512] memory: 14*14*512=100K params: 0CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296POOL2: [7x7x512] memory: 7*7*512=25K params: 0FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000

(not counting biases)

TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)TOTAL params: 138M parameters

Note:

Most memory is in early CONV

Most params arein late FC

Page 74: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

74 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: GoogLeNet [Szegedy et al., 2014]

Inception module

ILSVRC 2014 winner (6.7% top 5 error)

Page 75: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

75 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: GoogLeNet

Fun features:

- Only 5 million params!(Removes FC layers completely)

Compared to AlexNet:- 12X less params- 2x more compute- 6.67% (vs. 16.4%)

Page 76: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

76 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Slide from Kaiming He’s recent presentation https://www.youtube.com/watch?v=1PGLj-uKT1w

Case Study: ResNet [He et al., 2015]

ILSVRC 2015 winner (3.6% top 5 error)

Page 77: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

77 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

(slide from Kaiming He’s recent presentation)

Page 78: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

78 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Page 79: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

79 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: ResNet [He et al., 2015]

ILSVRC 2015 winner (3.6% top 5 error)

(slide from Kaiming He’s recent presentation)

2-3 weeks of training on 8 GPU machine

at runtime: faster than a VGGNet! (even though it has 8x more layers)

Page 80: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

80 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: ResNet

[He et al., 2015]

224x224x3

spatial dimension only 56x56!

Page 81: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

81 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: ResNet [He et al., 2015]

Page 82: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

82 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: ResNet [He et al., 2015]

- Batch Normalization after every CONV layer- Xavier/2 initialization from He et al.- SGD + Momentum (0.9) - Learning rate: 0.1, divided by 10 when validation error plateaus- Mini-batch size 256- Weight decay of 1e-5- No dropout used

Page 83: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

83 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: ResNet [He et al., 2015]

Page 84: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

84 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: ResNet [He et al., 2015]

(this trick is also used in GoogLeNet)

Page 85: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

85 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study: ResNet [He et al., 2015]

Page 86: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

86 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

Case Study Bonus: DeepMind’s AlphaGo

Page 87: Lecture 19 Intro to Deep Learning · COS 429: Computer Vision Lecture 19 Intro to Deep Learning COS429 : 29.11.16 : Andras Ferencz Thanks: most of these slides shamelessly adapted

87 : COS429 : L19 : 29.11.16 : Andras Ferencz Slide Credit:

policy network:[19x19x48] InputCONV1: 192 5x5 filters , stride 1, pad 2 => [19x19x192]CONV2..12: 192 3x3 filters, stride 1, pad 1 => [19x19x192]CONV: 1 1x1 filter, stride 1, pad 0 => [19x19] (probability map of promising moves)