deep residual learning - cs.kangwon.ac.krcs.kangwon.ac.kr/~leeck/ai2/deep_residual_learning.pdf ·...

Deep Residual LearningMSRA @ ILSVRC & COCO 2015 competitions

Kaiming Hewith Xiangyu Zhang, Shaoqing Ren, Jifeng Dai, & Jian Sun

Microsoft Research Asia (MSRA)

MSRA @ ILSVRC & COCO 2015 Competitions

• 1st places in all five main tracks• ImageNet Classification: “Ultra-deep” (quote Yann) 152-layer nets • ImageNet Detection: 16% better than 2nd• ImageNet Localization: 27% better than 2nd• COCO Detection: 11% better than 2nd• COCO Segmentation: 12% better than 2nd

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.

*improvements are relative numbers

Revolution of Depth

6.7 7.3

25.828.2

ILSVRC'15ResNet

ILSVRC'14GoogleNet

ILSVRC'14VGG

ILSVRC'13 ILSVRC'12AlexNet

ILSVRC'11 ILSVRC'10

ImageNet Classification top-5 error (%)

shallow8 layers

19 layers22 layers

152 layers

8 layers

Revolution of Depth

HOG, DPM AlexNet(RCNN)

VGG(RCNN)

ResNet(Faster RCNN)*

PASCAL VOC 2007 Object Detection mAP (%)

shallow8 layers

16 layers

101 layers

*w/ other improvements & more data

Engines ofvisual recognition

Revolution of Depth11x11 conv, 96, /4, pool/2

5x5 conv, 256, pool/2

3x3 conv, 384

fc, 4096

fc, 1000

AlexNet, 8 layers(ILSVRC 2012)

Revolution of Depth11x11 conv, 96, /4, pool/2

3x3 conv, 384

fc, 4096

fc, 1000

3x3 conv, 64

3x3 conv, 128

3x3 conv, 256

3x3 conv, 512

fc, 4096

fc, 1000

VGG, 19 layers(ILSVRC 2014)

Conv7x7+ 2(S)

MaxPool 3x3+ 2(S)

LocalRespNorm

Conv1x1+ 1(V)

Conv3x3+ 1(S)

LocalRespNorm

MaxPool 3x3+ 2(S)

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

MaxPool 3x3+ 2(S)

Dept hConcat

AveragePool 5x5+ 3(V)

Dept hConcat

MaxPool 3x3+ 2(S)

Dept hConcat

Conv1x1+ 1(S)

Soft maxAct ivat ion

soft max0

Conv1x1+ 1(S)

soft max1

soft max2

GoogleNet, 22 layers(ILSVRC 2014)

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 128, /2

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 512, /2

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

ave pool, fc 1000

7x7 conv, 64, /2, pool/2

Revolution of DepthResNet, 152 layers

(ILSVRC 2015)

3x3 conv, 64

3x3 conv, 128

3x3 conv, 256

3x3 conv, 512

fc, 4096

fc, 1000

11x11 conv, 96, /4, pool/2

3x3 conv, 384

fc, 4096

fc, 1000

VGG, 19 layers(ILSVRC 2014)

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 128, /2

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

3x3 conv, 256

7x7 conv, 64, /2, pool/2

(there was an animation here)

x conv, 5

x conv, 8

3x3 conv, 8

x conv, 5

x conv, 8

3x3 conv, 8

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

x conv, 56

3x3 conv, 56

x conv, 0 4

x conv, 56

3x3 conv, 56

x conv, 0 4

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

3x3 conv, 56

x conv, 0 4

x conv, 56

3x3 conv, 56

x conv, 0 4

x conv, 56

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 512, /2

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

ave pool, fc 1000

Is learning better networksas simple as stacking more layers?

Simply stacking layers?

0 1 2 3 4 5 60

iter. (1e4)

train error (%)

0 1 2 3 4 5 60

iter. (1e4)

test error (%)CIFAR-10

56-layer

20-layer

56-layer

20-layer

• Plain nets: stacking 3x3 conv layers…• 56-layer net has higher training error and test error than 20-layer net

Simply stacking layers?

0 1 2 3 4 5 60

iter. (1e4)

plain-20plain-32plain-44plain-56

CIFAR-10

20-layer32-layer44-layer56-layer

0 10 20 30 40 5020

iter. (1e4)

plain-18plain-34

ImageNet-1000

34-layer

18-layer

• “Overly deep” plain nets have higher training error• A general phenomenon, observed in many datasets

solid: test/valdashed: train

7x7 conv, 64, /2

3x3 conv, 64

3x3 conv, 128, /2

3x3 conv, 128

3x3 conv, 256, /2

3x3 conv, 256

3x3 conv, 512, /2

3x3 conv, 512

fc 1000

a shallowermodel

(18 layers)

a deepercounterpart(34 layers)

7x7 conv, 64, /2

3x3 conv, 64

3x3 conv, 128, /2

3x3 conv, 128

3x3 conv, 256, /2

3x3 conv, 256

3x3 conv, 512, /2

3x3 conv, 512

fc 1000

“extra” layers

• A deeper model should not have higher training error

• A solution by construction:• original layers: copied from a

learned shallower model• extra layers: set as identity• at least the same training error

• Optimization difficulties: solvers cannot find the solution when going deeper…

Deep Residual Learning

• Plaint net

any twostacked layers

𝑥𝑥

𝐻𝐻(𝑥𝑥)

weight layer

𝐻𝐻 𝑥𝑥 is any desired mapping,

hope the 2 weight layers fit 𝐻𝐻(𝑥𝑥)

• Residual net

𝐻𝐻 𝑥𝑥 is any desired mapping,

hope the 2 weight layers fit 𝐻𝐻(𝑥𝑥)

hope the 2 weight layers fit 𝐹𝐹(𝑥𝑥)

let 𝐻𝐻 𝑥𝑥 = 𝐹𝐹 𝑥𝑥 + 𝑥𝑥weight layer

weight layer

𝑥𝑥

𝐻𝐻 𝑥𝑥 = 𝐹𝐹 𝑥𝑥 + 𝑥𝑥

identity𝑥𝑥

𝐹𝐹(𝑥𝑥)

• 𝐹𝐹 𝑥𝑥 is a residual mapping w.r.t. identity

• If identity were optimal,easy to set weights as 0

• If optimal mapping is closer to identity,easier to find small fluctuations

weight layer

𝑥𝑥

𝐻𝐻 𝑥𝑥 = 𝐹𝐹 𝑥𝑥 + 𝑥𝑥

identity𝑥𝑥

𝐹𝐹(𝑥𝑥)

7x7 conv, 64, /2

pool, /2

3x3 conv, 64

3x3 conv, 128, /2

3x3 conv, 128

3x3 conv, 256, /2

3x3 conv, 256

3x3 conv, 512, /2

3x3 conv, 512

avg pool

fc 1000

7x7 conv, 64, /2

pool, /2

3x3 conv, 64

3x3 conv, 128, /2

3x3 conv, 128

3x3 conv, 256, /2

3x3 conv, 256

3x3 conv, 512, /2

3x3 conv, 512

avg pool

fc 1000

Network “Design”

• Keep it simple

• Our basic design (VGG-style)• all 3x3 conv (almost)

• spatial size /2 => # filters x2• Simple design; just deep!

• Other remarks:• no max pooling (almost)

• no hidden fc• no dropout

plain net ResNet

Training

• All plain/residual nets are trained from scratch

• All plain/residual nets use Batch Normalization

• Standard hyper-parameters & augmentation

CIFAR-10 experiments

0 1 2 3 4 5 60

iter. (1e4)

plain-20plain-32plain-44plain-56

CIFAR-10 plain nets

0 1 2 3 4 5 60

iter. (1e4)

ResNet-20ResNet-32ResNet-44ResNet-56ResNet-110

CIFAR-10 ResNets

110-layer

• Deep ResNets can be trained without difficulties• Deeper ResNets have lower training error, and also lower test error

solid: testdashed: train

ImageNet experiments

0 10 20 30 40 5020

iter. (1e4)

ResNet-18ResNet-34

0 10 20 30 40 5020

iter. (1e4)

plain-18plain-34

ImageNet plain nets ImageNet ResNets

solid: testdashed: train

34-layer

18-layer

34-layer

• Deep ResNets can be trained without difficulties• Deeper ResNets have lower training error, and also lower test error

• A practical design of going deeper

3x3, 64

1x1, 64relu

1x1, 256relu

all-3x3

bottleneck(for ResNet-50/101/152)

similar complexity

6.15.7

ResNet-34ResNet-50ResNet-101ResNet-15210-crop testing, top-5 val error (%)

this model haslower time complexity

than VGG-16/19

• Deeper ResNets have lower error

6.7 7.3

25.828.2

ILSVRC'15ResNet

ILSVRC'14GoogleNet

ILSVRC'14VGG

ILSVRC'13 ILSVRC'12AlexNet

ILSVRC'11 ILSVRC'10

ImageNet Classification top-5 error (%)

shallow8 layers

19 layers22 layers

152 layers

8 layers

deep residual learning - cs.kangwon.ac.krcs.kangwon.ac.kr/~leeck/ai2/deep_residual_learning.pdf ·...

Documents

adaptive deep residual network for single image...

development of in-house fully residual deep convolutional

deep residual learning for portfolio optimization:...

deep residual learning for compressed sensing ct ...deep...

deep residual neural networks resolve quartet molecular...

paper overview: "deep residual learning for image...

optimization properties of deep residual networks

deep residual learning microsoft research for image...

enhanced deep residual networks for single image super...

beyond deep residual learning for image...

deep residual networks - icml · deep residual networks...

identity mappings in deep residual networks arxiv:1603 ......

layer-parallel training of deep residual neural networks

short-term load forecasting with deep residual networks

deep cross residual learning for multitask visual...

deep residual learning for image recognition

deep residual learning for image recognition ·...

msa-mil: a deep residual multiple instance learning model

learning strict identity mappings in deep residual...

residual distillation: towards portable deep neural networks...