paper overview: "deep residual learning for image recognition"

44
Article overview by Ilya Kuzovkin K. He, X. Zhang, S. Ren and J. Sun Microsoft Research Computational Neuroscience Seminar University of Tartu 2016 Deep Residual Learning for Image Recognition ILSVRC 2015 MS COCO 2015 WINNER

Upload: ilya-kuzovkin

Post on 10-Feb-2017

655 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Paper overview: "Deep Residual Learning for Image Recognition"

Article overview by Ilya Kuzovkin

K. He, X. Zhang, S. Ren and J. Sun Microsoft Research

Computational Neuroscience Seminar University of Tartu

2016

Deep Residual Learning for Image Recognition ILSVRC 2015

MS COCO 2015 WINNER

Page 2: Paper overview: "Deep Residual Learning for Image Recognition"

THE IDEA

Page 3: Paper overview: "Deep Residual Learning for Image Recognition"

1000 classes

Page 4: Paper overview: "Deep Residual Learning for Image Recognition"

2012

8 layers 15.31% error

Page 5: Paper overview: "Deep Residual Learning for Image Recognition"

2012

8 layers 15.31% error

2013

9 layers, 2x params 11.74% error

Page 6: Paper overview: "Deep Residual Learning for Image Recognition"

2012

8 layers 15.31% error

2013 2014

9 layers, 2x params 11.74% error

19 layers 7.41% error

Page 7: Paper overview: "Deep Residual Learning for Image Recognition"

2012

8 layers 15.31% error

2013 2014 2015

9 layers, 2x params 11.74% error

19 layers 7.41% error

?

Page 8: Paper overview: "Deep Residual Learning for Image Recognition"

2012

8 layers 15.31% error

2013 2014 2015

9 layers, 2x params 11.74% error

19 layers 7.41% error

Is learning better networks as easy as stacking more layers?

?

Page 9: Paper overview: "Deep Residual Learning for Image Recognition"

?

2012

8 layers 15.31% error

2013 2014 2015

9 layers, 2x params 11.74% error

19 layers 7.41% error

Is learning better networks as easy as stacking more layers?

Vanishing / exploding gradients

Page 10: Paper overview: "Deep Residual Learning for Image Recognition"

?

2012

8 layers 15.31% error

2013 2014 2015

9 layers, 2x params 11.74% error

19 layers 7.41% error

Is learning better networks as easy as stacking more layers?

Vanishing / exploding gradients

Normalized initialization & intermediate normalization

Page 11: Paper overview: "Deep Residual Learning for Image Recognition"

?

2012

8 layers 15.31% error

2013 2014 2015

9 layers, 2x params 11.74% error

19 layers 7.41% error

Is learning better networks as easy as stacking more layers?

Vanishing / exploding gradients

Normalized initialization & intermediate normalization

Degradation problem

Page 12: Paper overview: "Deep Residual Learning for Image Recognition"

Degradation problem

“with the network depth increasing, accuracy gets saturated”

Page 13: Paper overview: "Deep Residual Learning for Image Recognition"

Not caused by overfitting:

Degradation problem

“with the network depth increasing, accuracy gets saturated”

Page 14: Paper overview: "Deep Residual Learning for Image Recognition"

Conv

Conv

Conv

Conv

Trained

Accuracy X%

Tested

Page 15: Paper overview: "Deep Residual Learning for Image Recognition"

Conv

Conv

Conv

Conv

Trained

Accuracy X%

Conv

Conv

Conv

Conv

Identity

Identity

Identity

IdentityTested

Tested

Page 16: Paper overview: "Deep Residual Learning for Image Recognition"

Conv

Conv

Conv

Conv

Trained

Accuracy X%

Conv

Conv

Conv

Conv

Identity

Identity

Identity

Identity

Same performance

Tested

Tested

Page 17: Paper overview: "Deep Residual Learning for Image Recognition"

Conv

Conv

Conv

Conv

Trained

Accuracy X%

Conv

Conv

Conv

Conv

Identity

Identity

Identity

Identity

Same performance

Conv

Conv

Conv

Conv

Conv

Conv

Conv

Conv

Trained

Tested

Tested

Tested

Page 18: Paper overview: "Deep Residual Learning for Image Recognition"

Conv

Conv

Conv

Conv

Trained

Accuracy X%

Conv

Conv

Conv

Conv

Identity

Identity

Identity

Identity

Same performance

Conv

Conv

Conv

Conv

Conv

Conv

Conv

Conv

Trained

Worse!

Tested

Tested

Tested

Page 19: Paper overview: "Deep Residual Learning for Image Recognition"

Conv

Conv

Conv

Conv

Trained

Accuracy X%

Conv

Conv

Conv

Conv

Identity

Identity

Identity

Identity

Same performance

Conv

Conv

Conv

Conv

Conv

Conv

Conv

Conv

Trained

Worse!

Tested

Tested

Tested

“Our current solvers on hand are unable to find solutions that are comparably good or

better than the constructed solution (or unable to do so in feasible time)”

Page 20: Paper overview: "Deep Residual Learning for Image Recognition"

Conv

Conv

Conv

Conv

Trained

Accuracy X%

Conv

Conv

Conv

Conv

Identity

Identity

Identity

Identity

Same performance

Conv

Conv

Conv

Conv

Conv

Conv

Conv

Conv

Trained

Worse!

Tested

Tested

Tested

“Our current solvers on hand are unable to find solutions that are comparably good or

better than the constructed solution (or unable to do so in feasible time)”

“Solvers might have difficulties in approximating identity mappings by

multiple nonlinear layers”

Page 21: Paper overview: "Deep Residual Learning for Image Recognition"

Conv

Conv

Conv

Conv

Trained

Accuracy X%

Conv

Conv

Conv

Conv

Identity

Identity

Identity

Identity

Same performance

Conv

Conv

Conv

Conv

Conv

Conv

Conv

Conv

Trained

Worse!

Tested

Tested

Tested

“Our current solvers on hand are unable to find solutions that are comparably good or

better than the constructed solution (or unable to do so in feasible time)”

“Solvers might have difficulties in approximating identity mappings by

multiple nonlinear layers”

Add explicit identity connections and “solvers may simply drive the weights of

the multiple nonlinear layers toward zero”

Page 22: Paper overview: "Deep Residual Learning for Image Recognition"

Add explicit identity connections and “solvers may simply drive the weights of the multiple

nonlinear layers toward zero”

is the true function we want to learn

Page 23: Paper overview: "Deep Residual Learning for Image Recognition"

Add explicit identity connections and “solvers may simply drive the weights of the multiple

nonlinear layers toward zero”

is the true function we want to learn

Let’s pretend we want to learn

instead.

Page 24: Paper overview: "Deep Residual Learning for Image Recognition"

Add explicit identity connections and “solvers may simply drive the weights of the multiple

nonlinear layers toward zero”

is the true function we want to learn

Let’s pretend we want to learn

instead.

The original function is then

Page 25: Paper overview: "Deep Residual Learning for Image Recognition"
Page 26: Paper overview: "Deep Residual Learning for Image Recognition"

Network can decide how deep it needs to be…

Page 27: Paper overview: "Deep Residual Learning for Image Recognition"

Network can decide how deep it needs to be…

“The identity connections introduce neither extra parameter nor computation complexity”

Page 28: Paper overview: "Deep Residual Learning for Image Recognition"

2012

8 layers 15.31% error

2013 2014 2015

9 layers, 2x params 11.74% error

19 layers 7.41% error

?

Page 29: Paper overview: "Deep Residual Learning for Image Recognition"

2012

8 layers 15.31% error

2013 2014 2015

9 layers, 2x params 11.74% error

19 layers 7.41% error

152 layers 3.57% error

Page 30: Paper overview: "Deep Residual Learning for Image Recognition"

EXPERIMENTS AND DETAILS

Page 31: Paper overview: "Deep Residual Learning for Image Recognition"

• Lots of convolutional 3x3 layers • VGG complexity is 19.6 billion FLOPs

34-layer-ResNet is 3.6 bln. FLOPs

Page 32: Paper overview: "Deep Residual Learning for Image Recognition"

• Lots of convolutional 3x3 layers • VGG complexity is 19.6 billion FLOPs

34-layer-ResNet is 3.6 bln. FLOPs !• Batch normalization • SGD with batch size 256 • (up to) 600,000 iterations • LR 0.1 (divided by 10 when error plateaus) • Momentum 0.9 • No dropout • Weight decay 0.0001

Page 33: Paper overview: "Deep Residual Learning for Image Recognition"

• Lots of convolutional 3x3 layers • VGG complexity is 19.6 billion FLOPs

34-layer-ResNet is 3.6 bln. FLOPs !• Batch normalization • SGD with batch size 256 • (up to) 600,000 iterations • LR 0.1 (divided by 10 when error plateaus) • Momentum 0.9 • No dropout • Weight decay 0.0001 !• 1.28 million training images • 50,000 validation • 100,000 test

Page 34: Paper overview: "Deep Residual Learning for Image Recognition"

• 34-layer ResNet has lower training error. This indicates that the degradation problem is well addressed and we manage to obtain accuracy gains from increased depth.

Page 35: Paper overview: "Deep Residual Learning for Image Recognition"

• 34-layer ResNet has lower training error. This indicates that the degradation problem is well addressed and we manage to obtain accuracy gains from increased depth. !

• 34-layer-ResNet reduces the top-1 error by 3.5%

Page 36: Paper overview: "Deep Residual Learning for Image Recognition"

• 34-layer ResNet has lower training error. This indicates that the degradation problem is well addressed and we manage to obtain accuracy gains from increased depth. !

• 34-layer-ResNet reduces the top-1 error by 3.5% !

• 18-layer ResNet converges faster and thus ResNet eases the optimization by providing faster convergence at the early stage.

Page 37: Paper overview: "Deep Residual Learning for Image Recognition"

GOING DEEPER

Page 38: Paper overview: "Deep Residual Learning for Image Recognition"

Due to time complexity the usual building block is replaced by Bottleneck Block

50 / 101 / 152 - layer ResNets are build from those blocks

Page 39: Paper overview: "Deep Residual Learning for Image Recognition"
Page 40: Paper overview: "Deep Residual Learning for Image Recognition"
Page 41: Paper overview: "Deep Residual Learning for Image Recognition"

ANALYSIS ON CIFAR-10

Page 42: Paper overview: "Deep Residual Learning for Image Recognition"
Page 43: Paper overview: "Deep Residual Learning for Image Recognition"
Page 44: Paper overview: "Deep Residual Learning for Image Recognition"

ImageNet Classification 2015 1st 3.57% error

ImageNet Object Detection 2015 1st 194 / 200 categories

ImageNet Object Localization 2015 1st 9.02% error

COCO Detection 2015 1st 37.3%

COCO Segmentation 2015 1st 28.2%

http://research.microsoft.com/en-us/um/people/kahe/ilsvrc15/ilsvrc2015_deep_residual_learning_kaiminghe.pdf