michigan state university inci m. baytas deep learning ...cse802/s17/slides/lec_10_11_feb15.pdf ·...

32
CSE 802 Spring 2017 Deep Learning Inci M. Baytas Michigan State University February 13-15, 2017 1

Upload: trandang

Post on 05-Apr-2018

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

CSE 802Spring 2017

Deep LearningInci M. Baytas

Michigan State UniversityFebruary 13-15, 2017

1

Page 2: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014

2

Deep Learning in Computer Vision

Page 3: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

3

Deep Learning in Computer Vision

Microsoft Deep Learning Semantic Image Segmentation

Page 4: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

4

Deep Learning in Computer Vision

NeuralTalk and Walk, recognition, text description of the image while walking.

Page 5: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

5

Deep Learning in Robotics

Self Driving Cars

Page 6: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

6

Deep Learning in Robotics

Deep Sensimotor Learning

Page 7: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

● Natural Language Processing (NLP)● Speech recognition and machine translation

7

Other Applications of Deep Learning

Why Should We Be Impressed? ● Automated vision (e.g., object recognition) is challenging: different

viewpoints, scales, occlusions, illumination,…● Robotics (e.g., autonomous driving) in real life environments

(constantly changing, new tasks without guidance, unexpected

factors) is challenging

● NLP (e.g., understanding human conversations) is an extremely

complex task: noise, context, partial sentences, different accent,..

Page 8: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

Why Is Deep Learning So Popular Now?• Better hardware

• Bigger data

• Regularization methods (dropout)

• Variety of optimization methods

• SGD

• Adagrad

• Adadelta

• ADAM

• RMS Prop8

Page 9: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

Criticism and Limitations of Deep Networks

• Large amount of data required for training

• High performance computing a necessity

• Non-optimal method

• Task specific

• Lack of theoretical understanding

9

Page 10: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

10

Common Deep Network TypesFeed forward networks Convolutional neural

networks

Recurrent neural networks

Page 11: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

Components of Deep Learning

11

Loss functions● Squared loss: (y - f(x))2

● Logistic loss: log(1 + e-yf(x))● Hinge loss: (1 + yf(x))+● Squared hinge loss: (1 + yf(x))+

2

Non-linear activation functions● Linear● Tanh● Sigmoid● Softmax● ReLU

Page 12: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

12

Page 13: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

13

Components of Deep LearningOptimizers● Gradient Descent● Adagrad (Adaptive Gradient Algorithm)● Adadelta (An Adaptive Learning Rate Method)● ADAM (Adaptive Moment Estimation)● RMSProp

Regularization Methods● L2 norm● L1 norm● Dataset Augmentation● Noise robustness● Early stopping● Dropout [12]

Page 14: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

14

Components of Deep LearningNumber of iterations● Less iterations: may underfitting● More iterations: use a stopping criteria

Step size● Very large step size: may miss optimal point● Very small step size: takes longer to converge

Parameter Initialization● Initializing with zeros● Random initialization● Xavier initialization

Page 15: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

15

Components of Deep LearningBatch size● Bigger batch size: might require less iterations ● Smaller batch size: will need more iterations

Number of layers● More layers (more depth): introducing more non-linearity, more complexity,

more parameters● Too many layers might cause overfitting.

Number of hidden parameters● Large number of hidden layer: more model complexity, can approximate a

more complex classifier● Too many parameters: overfitting, increased training time

Page 16: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

• Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers [1].

16

Convolutional Neural Networks

Convolution:• A linear operator• Cross-correlation with a flipped

kernel.• Convolution in spatial domain

corresponds to multiplication in frequency domain.

Page 17: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

• Feed forward networks that can extract topological features from images.

• Can provide invariance to geometric distortions such as translation, scaling, and rotation.

• Hierarchical and robust feature extraction was done before CNNs.• CNN is data-driven.• Parameters of filters are learned from the data instead of

predefined values.• At each iteration, parameters are updated to minimize the

loss.

17

Convolutional Neural Networks (CNNs)

Page 18: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

18

Convolution Layer • Local (sparse) connectivity

• Reduces memory requirements

• Fewer operations• Parameter sharing

• Same kernel used at every position of the input

• How to choose the filter size?

• Receptive field

● Equivariance property

Page 19: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

19

Pooling Layer (Subsampling)

• Convolution stage:• several convolutions in

parallel to produce a set of linear activations

• Followed by non-linear activation

• Then pooling layer:• Invariance to small

translations• Dealing with variable size

inputs

Page 20: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

• Maps the latent representation of input to output

• Output:• One-hot representation of class

label• Predicted response

• Appropriate activation function, e.g., softmax for classification.

20

Fully-Connected Layer

Page 21: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

21

Feature Extraction with CNNs

Page 22: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

22

Some Example CNN Architectures

LeNet-5 [2]

Page 23: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

23

Some Example CNN Architectures

AlexNet (5 layers)

Page 24: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

24VGG 16 [3]

Some Example CNN Architectures

Page 25: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

25

GoogLeNet (22 layers)

Page 26: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

26

Tricks to Improve CNN Performance

• Data augmentation

• Flipping (commonly used in face)

• Translation

• Rotation

• Stretching

• Normalizing, Whitening (less redundancy)

• Cropping and alignment (for especially face)

Page 27: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

27

Project• You will implement 11-layer CNN architecture proposed in [6] to extract features.

Page 28: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

28

Project• You can use a deep learning library to implement the network.

• Library will take care of convolution, pooling, dropout, and

back propagation.

• You need to define cost function and activation functions.

• The activation function of the output layer is softmax since it is

a classification problem.

• You can use tensorflow.

Page 29: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

29

HPCC• Data and evaluation protocol are on HPCC.

•/mnt/research/CSE_802_SPR_17

• To connect HPCC: ssh [email protected] and msu

email password

• To run small examples use developer mode: ssh dev-intel14

• Try to log in to HPCC and check the course research space.

• Try to use a python IDE (PyCharm). Debug your code and

understand how tensorflow works (if you are not familiar with a

deep learning library).

Page 30: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

30

Casia Dataset (Cropped Images)• The database contains 494,414

images.

• 10,575 subjects in total

• We provide cropped and original

images under

/mnt/research/CSE_802_SPR_17

Page 31: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

31

Test Data and Evaluation Protocol

● Final evaluation on Labeled Faces in the Wild (LFW) database [7] with 13,233 images, 5,749 subjects.

● Evaluation protocol:○ BLUFR protocol [8];

find under /mnt/research/CSE_802_SPR_17

Page 32: Michigan State University Inci M. Baytas Deep Learning ...cse802/S17/slides/Lec_10_11_Feb15.pdf · February 13-15, 2017 1. ... Gradient Descent Adagrad (Adaptive Gradient Algorithm)

32

References1. http://www.deeplearningbook.org/2. http://yann.lecun.com/exdb/lenet/3. https://www.cs.toronto.edu/~frossard/post/vgg16/4. A. Krizhevsky, I. Sutskever and G. E. Hinton “ImageNet Classification with Deep Convolutional Neural

Networks”, NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada 5. http://pubs.sciepub.com/ajme/2/7/9/6. Dong Yi, Zhen Lei, Shengcai Liao and Stan Z. Li. Learning Face Representation from Scratch,

arXiv:1411.7923v1 [cs.CV], 2014.7. http://vis-www.cs.umass.edu/lfw/8. http://www.cbsr.ia.ac.cn/users/scliao/projects/blufr/9. http://www.cbsr.ia.ac.cn/english/CASIA-WebFace-Database.html

10. https://www.nist.gov/programs-projects/face-recognition-grand-challenge-frgc11. Shengcai Liao, Zhen Lei, Dong Yi, Stan Z. Li, "A Benchmark Study of Large-scale Unconstrained Face

Recognition." In IAPR/IEEE International Joint Conference on Biometrics, Sep. 29 - Oct. 2, Clearwater, Florida, USA, 2014.

12. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, Journal of Machine Learning Research 15 (2014) 1929-1958.