lecture 7: convolutional networksjustincj/slides/eecs498/498_fa2019_lecture07.pdflecture 7 -2 due...
TRANSCRIPT
JustinJohnson September24,2019
Lecture7:ConvolutionalNetworks
Lecture7- 1
JustinJohnson September24,2019
Reminder:A2
Lecture7- 2
DueMonday,September30,11:59pm(Evenifyouenrolledlate!)
Yoursubmissionmustpassthevalidationscript
JustinJohnson September24,2019
Slightschedulechange
Lecture7- 3
Contentoriginallyplannedfortodaygotsplitintotwolectures
Pushestheschedulebackabit:
A4DueDate:Friday11/1->Friday11/8A5DueDate:Friday11/15->Friday11/22A6DueDate:StillFriday12/6
JustinJohnson September24,2019
LastTime:Backpropagation
Lecture7- 4
x
W
hingeloss
R
+ Ls (scores)*
Representcomplexexpressionsascomputationalgraphs
Forwardpasscomputesoutputs
Backwardpasscomputesgradients
fLocal
gradients
Upstreamgradient
Downstreamgradients
Duringthebackwardpass,eachnodeinthegraphreceivesupstreamgradientsandmultipliesthembylocalgradients tocomputedownstreamgradients
JustinJohnson September24,2019Lecture7- 5
Inputimage(2,2)
56
231
24
2
56 231
24 2
Stretchpixelsintocolumn
(4,)x hW1 sW2
Input:3072
Hiddenlayer:100
Output:10
f(x,W)=Wx
Problem:Sofarourclassifiersdon’trespectthespatialstructureofimages!
JustinJohnson September24,2019Lecture7- 6
Inputimage(2,2)
56
231
24
2
56 231
24 2
Stretchpixelsintocolumn
(4,)x hW1 sW2
Input:3072
Hiddenlayer:100
Output:10
f(x,W)=Wx
Problem:Sofarourclassifiersdon’trespectthespatialstructureofimages!
Solution:Definenewcomputationalnodesthatoperateonimages!
JustinJohnson September24,2019
ComponentsofaFull-ConnectedNetwork
Lecture7- 7
x h s
Fully-ConnectedLayers ActivationFunction
JustinJohnson September24,2019
ComponentsofaConvolutionalNetwork
Lecture7- 8
ConvolutionLayers PoolingLayers
x h s
Fully-ConnectedLayers ActivationFunction
Normalization
JustinJohnson September24,2019
ComponentsofaConvolutionalNetwork
Lecture7- 9
ConvolutionLayers PoolingLayers
x h s
Fully-ConnectedLayers ActivationFunction
Normalization
JustinJohnson September24,2019
Fully-ConnectedLayer
Lecture7- 10
30721
32x32x3image->stretchto3072x1
10x3072weights
OutputInput
110
JustinJohnson September24,2019
Fully-ConnectedLayer
Lecture7- 11
30721
32x32x3image->stretchto3072x1
10x3072weights
OutputInput
1number:theresultoftakingadotproductbetweenarowofWandtheinput(a3072-dimensionaldotproduct)
110
JustinJohnson September24,2019
ConvolutionLayer
Lecture7- 12
32
3
3x32x32 image: preservespatialstructure
widthdepth/channels
height32
JustinJohnson September24,2019
ConvolutionLayer
Lecture7- 13
32
3
3x32x32 image
widthdepth/channels
3x5x5filter
Convolvethefilterwiththeimagei.e.“slideovertheimagespatially,computingdotproducts”
height32
JustinJohnson September24,2019
ConvolutionLayer
Lecture7- 14
32
3
3x32x32 image
width
height
depth/channels
3x5x5filter
Filtersalwaysextendthefulldepthoftheinputvolume
Convolvethefilterwiththeimagei.e.“slideovertheimagespatially,computingdotproducts”
32
JustinJohnson September24,2019
ConvolutionLayer
Lecture7- 15
32
3
3x32x32image
3x5x5filter
321number:theresultoftakingadotproductbetweenthefilterandasmall3x5x5chunkoftheimage(i.e.3*5*5=75-dimensionaldotproduct+bias)
JustinJohnson September24,2019
ConvolutionLayer
Lecture7- 16
32
3
3x32x32image
3x5x5filter
32convolve(slide)overallspatiallocations
1x28x28activationmap
1
28
28
JustinJohnson September24,2019
ConvolutionLayer
Lecture7- 17
32
3
3x32x32image
3x5x5filter
32convolve(slide)overallspatiallocations
two1x28x28activationmap
1
28
1
28
28
Considerrepeatingwithasecond(green)filter:
JustinJohnson September24,2019
ConvolutionLayer
Lecture7- 18
32
3
3x32x32image
32
6activationmaps,each1x28x28
Consider6filters,each3x5x5
ConvolutionLayer
6x3x5x5filters Stackactivationstogeta
6x28x28outputimage!
JustinJohnson September24,2019
ConvolutionLayer
Lecture7- 19
32
3
3x32x32image
32
6activationmaps,each1x28x28Also6-dimbiasvector:
ConvolutionLayer
6x3x5x5filters Stackactivationstogeta
6x28x28outputimage!
JustinJohnson September24,2019
ConvolutionLayer
Lecture7- 20
32
3
3x32x32image
32
28x28grid,ateachpointa6-dimvector
Also6-dimbiasvector:
ConvolutionLayer
6x3x5x5filters Stackactivationstogeta
6x28x28outputimage!
JustinJohnson September24,2019
ConvolutionLayer
Lecture7- 21
32
3
2x3x32x32Batchofimages
32
2x6x28x28Batchofoutputs
Also6-dimbiasvector:
ConvolutionLayer
6x3x5x5filters
JustinJohnson September24,2019
ConvolutionLayer
Lecture7- 22
W
Cin
NxCin xHxWBatchofimages
H
NxCout xH’xW’Batchofoutputs
AlsoCout-dimbiasvector:
ConvolutionLayer
Cout xCinx Kw xKhfilters
Cout
JustinJohnson September24,2019Lecture7- 23
32
32
3
W1:6x3x5x5b1:5 28
28
6 10
26
26
….
StackingConvolutions
Input:Nx3x32x32
Firsthiddenlayer:Nx6x28x28
W2:10x6x3x3b2:10
Secondhiddenlayer:Nx10x26x26
Conv Conv Conv
W3:12x10x3x3b3:12
JustinJohnson September24,2019Lecture7- 24
32
32
3
W1:6x3x5x5b1:5 28
28
6 10
26
26
….
StackingConvolutions
Input:Nx3x32x32
Firsthiddenlayer:Nx6x28x28
W2:10x6x3x3b2:10
Secondhiddenlayer:Nx10x26x26
Conv Conv Conv
W3:12x10x3x3b3:12
Q:Whathappensifwestacktwoconvolutionlayers?
JustinJohnson September24,2019Lecture7- 25
32
32
3
W1:6x3x5x5b1:6 28
28
6 10
26
26
….
StackingConvolutions
Input:Nx3x32x32
Firsthiddenlayer:Nx6x28x28
W2:10x6x3x3b2:10
Secondhiddenlayer:Nx10x26x26
Conv
W3:12x10x3x3b3:12
Q:Whathappensifwestacktwoconvolutionlayers?A:Wegetanotherconvolution!
(Recally=W2W1xisalinearclassifier)
ReLU Conv ReLU Conv ReLU
JustinJohnson September24,2019Lecture7- 26
32
32
3
W1:6x3x5x5b1:6 28
28
6 10
26
26
….
Whatdoconvolutionalfilterslearn?
Input:Nx3x32x32
Firsthiddenlayer:Nx6x28x28
W2:10x6x3x3b2:10
Secondhiddenlayer:Nx10x26x26
Conv
W3:12x10x3x3b3:12
ReLU Conv ReLU Conv ReLU
JustinJohnson September24,2019Lecture7- 27
32
32
3
W1:6x3x5x5b1:6 28
28
6
Whatdoconvolutionalfilterslearn?
Input:Nx3x32x32
Firsthiddenlayer:Nx6x28x28
Conv ReLU
Linearclassifier:Onetemplateperclass
JustinJohnson September24,2019Lecture7- 28
32
32
3
W1:6x3x5x5b1:6 28
28
6
Whatdoconvolutionalfilterslearn?
Input:Nx3x32x32
Firsthiddenlayer:Nx6x28x28
Conv ReLU
MLP:Bankofwhole-imagetemplates
JustinJohnson September24,2019Lecture7- 29
32
32
3
W1:6x3x5x5b1:6 28
28
6
Whatdoconvolutionalfilterslearn?
Input:Nx3x32x32
Firsthiddenlayer:Nx6x28x28
Conv ReLU
First-layerconvfilters:localimagetemplates(Oftenlearnsorientededges,opposingcolors)
AlexNet:64filters,each3x11x11
JustinJohnson September24,2019Lecture7- 30
32
32
3
W1:6x3x5x5b1:6 28
28
6
Acloserlookatspatialdimensions
Input:Nx3x32x32
Firsthiddenlayer:Nx6x28x28
Conv ReLU
JustinJohnson September24,2019Lecture7- 31
Acloserlookatspatialdimensions
7
7
Input:7x7Filter:3x3
JustinJohnson September24,2019Lecture7- 32
Acloserlookatspatialdimensions
7
7
Input:7x7Filter:3x3
JustinJohnson September24,2019Lecture7- 33
Acloserlookatspatialdimensions
7
7
Input:7x7Filter:3x3
JustinJohnson September24,2019Lecture7- 34
Acloserlookatspatialdimensions
7
7
Input:7x7Filter:3x3
JustinJohnson September24,2019Lecture7- 35
Acloserlookatspatialdimensions
7
7
Input:7x7Filter:3x3Output:5x5
JustinJohnson September24,2019Lecture7- 36
Acloserlookatspatialdimensions
7
7
Input:7x7Filter:3x3Output:5x5
Ingeneral:Input:WFilter:KOutput:W– K+1
Problem:Featuremaps“shrink”witheachlayer!
JustinJohnson September24,2019
0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
Lecture7- 37
Acloserlookatspatialdimensions
Input:7x7Filter:3x3Output:5x5
Ingeneral:Input:WFilter:KOutput:W– K+1
Problem:Featuremaps“shrink”witheachlayer!
Solution:paddingAddzerosaroundtheinput
JustinJohnson September24,2019
0 0 0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0 0 0
Lecture7- 38
Acloserlookatspatialdimensions
Input:7x7Filter:3x3Output:5x5
Ingeneral:Input:WFilter:KPadding:POutput:W– K+1+2P
Verycommon:SetP=(K– 1)/2tomakeoutputhavesamesizeasinput!
JustinJohnson September24,2019Lecture7- 39
ReceptiveFields
Input Output
ForconvolutionwithkernelsizeK,eachelementintheoutputdependsonaKxKreceptivefield intheinput
JustinJohnson September24,2019Lecture7- 40
ReceptiveFields
Input Output
EachsuccessiveconvolutionaddsK– 1tothereceptivefieldsizeWithLlayersthereceptivefieldsizeis1+L*(K– 1)
Becareful– ”receptivefieldintheinput”vs“receptivefieldinthepreviouslayer”Hopefullyclearfromcontext!
JustinJohnson September24,2019Lecture7- 41
ReceptiveFields
Input Output
EachsuccessiveconvolutionaddsK– 1tothereceptivefieldsizeWithLlayersthereceptivefieldsizeis1+L*(K– 1)
Problem:Forlargeimagesweneedmanylayersforeachoutputto“see”thewholeimageimage
JustinJohnson September24,2019Lecture7- 42
ReceptiveFields
Input Output
EachsuccessiveconvolutionaddsK– 1tothereceptivefieldsizeWithLlayersthereceptivefieldsizeis1+L*(K– 1)
Problem:Forlargeimagesweneedmanylayersforeachoutputto“see”thewholeimageimage
Solution:Downsample insidethenetwork
JustinJohnson September24,2019Lecture7- 43
Strided ConvolutionInput:7x7Filter:3x3Stride:2
JustinJohnson September24,2019Lecture7- 44
Strided ConvolutionInput:7x7Filter:3x3Stride:2
JustinJohnson September24,2019Lecture7- 45
Strided ConvolutionInput:7x7Filter:3x3Stride:2
Output:3x3
JustinJohnson September24,2019Lecture7- 46
Strided ConvolutionInput:7x7Filter:3x3Stride:2
Output:3x3
Ingeneral:Input:WFilter:KPadding:PStride:SOutput:(W– K+2P)/S+1
JustinJohnson September24,2019
ConvolutionExample
Lecture7- 47
Inputvolume:3x 32 x 32105x5filterswithstride1,pad2
Outputvolumesize:?
JustinJohnson September24,2019
ConvolutionExample
Lecture7- 48
Inputvolume:3x 32 x 3210 5x5 filterswithstride1,pad2
Outputvolumesize:(32+2*2-5)/1+1=32spatially,so10 x32 x 32
JustinJohnson September24,2019
ConvolutionExample
Lecture7- 49
Inputvolume:3x32x32105x5filterswithstride1,pad2
Outputvolumesize:10x32x32Numberoflearnableparameters:?
JustinJohnson September24,2019
ConvolutionExample
Lecture7- 50
Inputvolume:3 x32x3210 5x5 filterswithstride1,pad2
Outputvolumesize:10x32x32Numberoflearnableparameters:760Parametersperfilter:3*5*5+1(forbias)=7610 filters,sototalis10 *76 =760
JustinJohnson September24,2019
ConvolutionExample
Lecture7- 51
Inputvolume:3x32x32105x5filterswithstride1,pad2
Outputvolumesize:10x32x32Numberoflearnableparameters:760Numberofmultiply-addoperations:?
JustinJohnson September24,2019
ConvolutionExample
Lecture7- 52
Inputvolume:3 x32x32105x5 filterswithstride1,pad2
Outputvolumesize:10x32x32Numberoflearnableparameters:760Numberofmultiply-addoperations:768,00010*32*32 =10,240outputs;eachoutputistheinnerproductoftwo3x5x5tensors(75elems);total=75*10240=768K
JustinJohnson September24,2019
Example:1x1Convolution
Lecture7- 53
64
56
561x1CONVwith32filters
3256
56
(eachfilterhassize1x1x64,andperformsa64-dimensionaldotproduct)
JustinJohnson September24,2019
Example:1x1Convolution
Lecture7- 54
64
56
561x1CONVwith32filters
3256
56
(eachfilterhassize1x1x64,andperformsa64-dimensionaldotproduct)
Linetal,“NetworkinNetwork”,ICLR2014
Stacking1x1convlayersgivesMLPoperatingoneachinputposition
JustinJohnson September24,2019
ConvolutionSummary
Lecture7- 55
Input:Cin xHxWHyperparameters:- Kernelsize:KH xKW- Numberfilters:Cout- Padding:P- Stride:SWeightmatrix:Cout xCin xKH xKWgivingCout filtersofsizeCin xKH xKWBiasvector:CoutOutputsize:Cout xH’xW’where:- H’=(H– K+2P)/S+1- W’=(W– K+2P)/S+1
JustinJohnson September24,2019
ConvolutionSummary
Lecture7- 56
Input:Cin xHxWHyperparameters:- Kernelsize:KH xKW- Numberfilters:Cout- Padding:P- Stride:SWeightmatrix:Cout xCin xKH xKWgivingCout filtersofsizeCin xKH xKWBiasvector:CoutOutputsize:Cout xH’xW’where:- H’=(H– K+2P)/S+1- W’=(W– K+2P)/S+1
Commonsettings:KH =KW (Smallsquarefilters)P=(K– 1)/2(”Same”padding)Cin,Cout =32,64,128,256(powersof2)K=3,P=1,S=1(3x3conv)K=5,P=2,S=1(5x5conv)K=1,P=0,S=1(1x1conv)K=3,P=1,S=2(Downsample by2)
JustinJohnson September24,2019
Othertypesofconvolution
Lecture7- 57
Sofar:2DConvolution
CinW
H
Input:Cin xHxWWeights:Cout xCin xKxK
JustinJohnson September24,2019
Othertypesofconvolution
Lecture7- 58
Sofar:2DConvolution 1DConvolution
CinW
H
Input:Cin xHxWWeights:Cout xCin xKxK
Cin
W
Input:Cin xWWeights:Cout xCin xK
JustinJohnson September24,2019
Othertypesofconvolution
Lecture7- 59
Sofar:2DConvolution 3DConvolution
CinW
H
Input:Cin xHxWWeights:Cout xCin xKxK
Cin-dimvectorateachpointinthevolume
W
D
H
Input:Cin xHxWxDWeights:Cout xCin xKxKxK
JustinJohnson September24,2019Lecture7- 60
PyTorch ConvolutionLayer
JustinJohnson September24,2019Lecture7- 61
PyTorch ConvolutionLayers
JustinJohnson September24,2019
ComponentsofaConvolutionalNetwork
Lecture7- 62
ConvolutionLayers PoolingLayers
x h s
Fully-ConnectedLayers ActivationFunction
Normalization
JustinJohnson September24,2019
PoolingLayers:Anotherwaytodownsample
Lecture7- 63
Hyperparameters:KernelSizeStridePoolingfunction
JustinJohnson September24,2019
MaxPooling
Lecture7- 64
1 1 2 4
5 6 7 8
3 2 1 0
1 2 3 4
Singledepthslice
x
y
Maxpoolingwith2x2kernelsizeandstride2 6 8
3 4
Introducesinvariance tosmallspatialshiftsNolearnableparameters!
JustinJohnson September24,2019
PoolingSummary
Lecture7- 65
Input:CxHxWHyperparameters:- Kernelsize:K- Stride:S- Poolingfunction(max,avg)Output:CxH’xW’where- H’=(H– K)/S+1- W’=(W– K)/S+1Learnableparameters:None!
Commonsettings:max,K=2,S=2max,K=3,S=2(AlexNet)
JustinJohnson September24,2019
ComponentsofaConvolutionalNetwork
Lecture7- 66
ConvolutionLayers PoolingLayers
x h s
Fully-ConnectedLayers ActivationFunction
Normalization
JustinJohnson September24,2019
ConvolutionalNetworks
Lecture7- 67
Lecun etal,“Gradient-basedlearningappliedtodocumentrecognition”,1998
Classicarchitecture:[Conv,ReLU,Pool]xN,flatten,[FC,ReLU]xN,FC
Example:LeNet-5
JustinJohnson September24,2019
Example:LeNet-5
Lecture7- 68
Layer OutputSize WeightSizeInput 1x28 x28Conv(Cout=20,K=5, P=2,S=1) 20x28x28 20x1x5x5ReLU 20x28x28MaxPool(K=2,S=2) 20x14 x14Conv (Cout=50,K=5,P=2,S=1) 50x14x14 50x20x5x5ReLU 50x14x14MaxPool(K=2, S=2) 50x7x7Flatten 2450Linear(2450 ->500) 500 2450x500ReLU 500Linear(500->10) 10 500x10
Lecun etal,“Gradient-basedlearningappliedtodocumentrecognition”,1998
JustinJohnson September24,2019
Example:LeNet-5
Lecture7- 69
Layer OutputSize WeightSizeInput 1x28 x28Conv(Cout=20,K=5, P=2,S=1) 20x28x28 20x1x5x5ReLU 20x28x28MaxPool(K=2,S=2) 20x14 x14Conv (Cout=50,K=5,P=2,S=1) 50x14x14 50x20x5x5ReLU 50x14x14MaxPool(K=2, S=2) 50x7x7Flatten 2450Linear(2450 ->500) 500 2450x500ReLU 500Linear(500->10) 10 500x10
Lecun etal,“Gradient-basedlearningappliedtodocumentrecognition”,1998
JustinJohnson September24,2019
Example:LeNet-5
Lecture7- 70
Layer OutputSize WeightSizeInput 1x28 x28Conv(Cout=20,K=5, P=2,S=1) 20x28x28 20x1x5x5ReLU 20x28x28MaxPool(K=2,S=2) 20x14 x14Conv (Cout=50,K=5,P=2,S=1) 50x14x14 50x20x5x5ReLU 50x14x14MaxPool(K=2, S=2) 50x7x7Flatten 2450Linear(2450 ->500) 500 2450x500ReLU 500Linear(500->10) 10 500x10
Lecun etal,“Gradient-basedlearningappliedtodocumentrecognition”,1998
JustinJohnson September24,2019
Example:LeNet-5
Lecture7- 71
Layer OutputSize WeightSizeInput 1x28 x28Conv(Cout=20,K=5, P=2,S=1) 20x28x28 20x1x5x5ReLU 20x28x28MaxPool(K=2,S=2) 20x14 x14Conv (Cout=50,K=5,P=2,S=1) 50x14x14 50x20x5x5ReLU 50x14x14MaxPool(K=2, S=2) 50x7x7Flatten 2450Linear(2450 ->500) 500 2450x500ReLU 500Linear(500->10) 10 500x10
Lecun etal,“Gradient-basedlearningappliedtodocumentrecognition”,1998
JustinJohnson September24,2019
Example:LeNet-5
Lecture7- 72
Layer OutputSize WeightSizeInput 1x28 x28Conv(Cout=20,K=5, P=2,S=1) 20x28x28 20x1x5x5ReLU 20x28x28MaxPool(K=2,S=2) 20x14 x14Conv (Cout=50,K=5,P=2,S=1) 50x14x14 50x20x5x5ReLU 50x14x14MaxPool(K=2, S=2) 50x7x7Flatten 2450Linear(2450 ->500) 500 2450x500ReLU 500Linear(500->10) 10 500x10
Lecun etal,“Gradient-basedlearningappliedtodocumentrecognition”,1998
JustinJohnson September24,2019
Example:LeNet-5
Lecture7- 73
Layer OutputSize WeightSizeInput 1x28 x28Conv(Cout=20,K=5, P=2,S=1) 20x28x28 20x1x5x5ReLU 20x28x28MaxPool(K=2,S=2) 20x14 x14Conv (Cout=50,K=5,P=2,S=1) 50x14x14 50x20x5x5ReLU 50x14x14MaxPool(K=2, S=2) 50x7x7Flatten 2450Linear(2450 ->500) 500 2450x500ReLU 500Linear(500->10) 10 500x10
Lecun etal,“Gradient-basedlearningappliedtodocumentrecognition”,1998
JustinJohnson September24,2019
Example:LeNet-5
Lecture7- 74
Layer OutputSize WeightSizeInput 1x28 x28Conv(Cout=20,K=5, P=2,S=1) 20x28x28 20x1x5x5ReLU 20x28x28MaxPool(K=2,S=2) 20x14 x14Conv (Cout=50,K=5,P=2,S=1) 50x14x14 50x20x5x5ReLU 50x14x14MaxPool(K=2, S=2) 50x7x7Flatten 2450Linear(2450 ->500) 500 2450x500ReLU 500Linear(500->10) 10 500x10
Lecun etal,“Gradient-basedlearningappliedtodocumentrecognition”,1998
JustinJohnson September24,2019
Example:LeNet-5
Lecture7- 75
Layer OutputSize WeightSizeInput 1x28 x28Conv(Cout=20,K=5, P=2,S=1) 20x28x28 20x1x5x5ReLU 20x28x28MaxPool(K=2,S=2) 20x14 x14Conv (Cout=50,K=5,P=2,S=1) 50x14x14 50x20x5x5ReLU 50x14x14MaxPool(K=2, S=2) 50x7x7Flatten 2450Linear(2450 ->500) 500 2450x500ReLU 500Linear(500->10) 10 500x10
Lecun etal,“Gradient-basedlearningappliedtodocumentrecognition”,1998
JustinJohnson September24,2019
Example:LeNet-5
Lecture7- 76
Layer OutputSize WeightSizeInput 1x28 x28Conv(Cout=20,K=5, P=2,S=1) 20x28x28 20x1x5x5ReLU 20x28x28MaxPool(K=2,S=2) 20x14 x14Conv (Cout=50,K=5,P=2,S=1) 50x14x14 50x20x5x5ReLU 50x14x14MaxPool(K=2, S=2) 50x7x7Flatten 2450Linear(2450 ->500) 500 2450x500ReLU 500Linear(500->10) 10 500x10
Lecun etal,“Gradient-basedlearningappliedtodocumentrecognition”,1998
Aswegothroughthenetwork:
Spatialsizedecreases(usingpoolingorstrided conv)
Numberofchannelsincreases(total“volume”ispreserved!)
JustinJohnson September24,2019
Problem:DeepNetworksveryhardtotrain!
Lecture7- 77
JustinJohnson September24,2019
ComponentsofaConvolutionalNetwork
Lecture7- 78
ConvolutionLayers PoolingLayers
x h s
Fully-ConnectedLayers ActivationFunction
Normalization
JustinJohnson September24,2019
BatchNormalization
Lecture7- 79
Ioffe andSzegedy,“Batchnormalization:Acceleratingdeepnetworktrainingbyreducinginternalcovariateshift”,ICML2015
Idea:“Normalize”theoutputsofalayersotheyhavezeromeanandunitvariance
Why?Helpsreduce“internalcovariateshift”,improvesoptimization
Wecannormalizeabatchofactivationslikethis:
Thisisadifferentiablefunction,sowecanuseitasanoperatorinournetworksandbackprop throughit!
JustinJohnson September24,2019
BatchNormalization
Lecture7- 80
Ioffe andSzegedy,“Batchnormalization:Acceleratingdeepnetworktrainingbyreducinginternalcovariateshift”,ICML2015
Input: Per-channelmean,shapeisD
Normalizedx,ShapeisNxD
XN
D
Per-channelstd,shapeisD
JustinJohnson September24,2019
BatchNormalization
Lecture7- 81
Ioffe andSzegedy,“Batchnormalization:Acceleratingdeepnetworktrainingbyreducinginternalcovariateshift”,ICML2015
Input: Per-channelmean,shapeisD
Normalizedx,ShapeisNxD
XN
D Problem:Whatifzero-mean,unitvarianceistoohardofaconstraint?
Per-channelstd,shapeisD
JustinJohnson September24,2019
BatchNormalization
Lecture7- 82
Learnablescaleandshiftparameters:
Output,ShapeisNxD
Learning=,=willrecoverthe
identityfunction!
Input: Per-channelmean,shapeisD
Normalizedx,ShapeisNxD
Per-channelstd,shapeisD
JustinJohnson September24,2019
BatchNormalization:Test-Time
Lecture7- 83
Learnablescaleandshiftparameters:
Output,ShapeisNxD
Learning=,=willrecoverthe
identityfunction!
Input: Per-channelmean,shapeisD
Normalizedx,ShapeisNxD
Per-channelstd,shapeisD
Problem:Estimatesdependonminibatch;can’tdothisattest-time!
JustinJohnson September24,2019
BatchNormalization:Test-Time
Lecture7- 84
Learnablescaleandshiftparameters:
Output,ShapeisNxD
Learning=,=willrecoverthe
identityfunction!
Input: Per-channelmean,shapeisD
Normalizedx,ShapeisNxD
Per-channelstd,shapeisD
(Running)averageofvaluesseenduringtraining
(Running)averageofvaluesseenduringtraining
JustinJohnson September24,2019
BatchNormalization:Test-Time
Lecture7- 85
Learnablescaleandshiftparameters:
Output,ShapeisNxD
Input: Per-channelmean,shapeisD
Normalizedx,ShapeisNxD
Per-channelstd,shapeisD
(Running)averageofvaluesseenduringtraining
(Running)averageofvaluesseenduringtraining
Duringtestingbatchnormbecomesalinearoperator!Canbefusedwiththepreviousfully-connectedorconvlayer
JustinJohnson September24,2019
BatchNormalizationforConvNets
Lecture7- 86
x: N × D
𝞵,𝝈: 1 × Dɣ,β: 1 × Dy = ɣ(x-𝞵)/𝝈+β
x: N×C×H×W
𝞵,𝝈: 1×C×1×1ɣ,β: 1×C×1×1y = ɣ(x-𝞵)/𝝈+β
Normalize Normalize
BatchNormalizationforfully-connected networks
BatchNormalizationforconvolutional networks(SpatialBatchnorm,BatchNorm2D)
JustinJohnson September24,2019
BatchNormalization
Lecture7- 87
FC
BN
tanh
FC
BN
tanh
UsuallyinsertedafterFullyConnectedorConvolutionallayers,andbeforenonlinearity.
Ioffe andSzegedy,“Batchnormalization:Acceleratingdeepnetworktrainingbyreducinginternalcovariateshift”,ICML2015
JustinJohnson September24,2019
BatchNormalization
Lecture7- 88
FC
BN
tanh
FC
BN
tanh
- Makesdeepnetworksmucheasiertotrain!- Allowshigherlearningrates,fasterconvergence- Networksbecomemorerobusttoinitialization- Actsasregularizationduringtraining- Zerooverheadattest-time:canbefusedwithconv!
Trainingiterations
ImageNetaccuracy
Ioffe andSzegedy,“Batchnormalization:Acceleratingdeepnetworktrainingbyreducinginternalcovariateshift”,ICML2015
JustinJohnson September24,2019
BatchNormalization
Lecture7- 89
FC
BN
tanh
FC
BN
tanh
- Makesdeepnetworksmucheasiertotrain!- Allowshigherlearningrates,fasterconvergence- Networksbecomemorerobusttoinitialization- Actsasregularizationduringtraining- Zerooverheadattest-time:canbefusedwithconv!- Notwell-understoodtheoretically(yet)- Behavesdifferentlyduringtrainingandtesting:this
isaverycommonsourceofbugs!
Ioffe andSzegedy,“Batchnormalization:Acceleratingdeepnetworktrainingbyreducinginternalcovariateshift”,ICML2015
JustinJohnson September24,2019
LayerNormalization
Lecture7- 90
x: N × D
𝞵,𝝈: 1 × Dɣ,β: 1 × Dy = ɣ(x-𝞵)/𝝈+β
x: N × D
𝞵,𝝈: N × 1ɣ,β: 1 × Dy = ɣ(x-𝞵)/𝝈+β
Normalize Normalize
LayerNormalization forfully-connectednetworksSamebehaviorattrainandtest!UsedinRNNs,Transformers
BatchNormalization forfully-connectednetworks
Ba,Kiros,andHinton,“LayerNormalization”,arXiv 2016
JustinJohnson September24,2019
InstanceNormalization
Lecture7- 91
Ulyanovetal,ImprovedTextureNetworks:MaximizingQualityandDiversityinFeed-forwardStylizationandTextureSynthesis,CVPR2017
x: N×C×H×W
𝞵,𝝈: 1×C×1×1ɣ,β: 1×C×1×1y = ɣ(x-𝞵)/𝝈+β
x: N×C×H×W
𝞵,𝝈: N×C×1×1ɣ,β: 1×C×1×1y = ɣ(x-𝞵)/𝝈+β
Normalize Normalize
InstanceNormalization forconvolutionalnetworksSamebehaviorattrain/test!
BatchNormalization forconvolutionalnetworks
JustinJohnson September24,2019
ComparisonofNormalizationLayers
Lecture7- 92
WuandHe,“GroupNormalization”,ECCV2018
JustinJohnson September24,2019
GroupNormalization
Lecture7- 93
WuandHe,“GroupNormalization”,ECCV2018
JustinJohnson September24,2019
ComponentsofaConvolutionalNetwork
Lecture7- 94
ConvolutionLayers PoolingLayers
x h s
Fully-ConnectedLayers
ActivationFunction Normalization
JustinJohnson September24,2019
ComponentsofaConvolutionalNetwork
Lecture7- 95
ConvolutionLayers PoolingLayers
x h s
Fully-ConnectedLayers
ActivationFunction Normalization
Mostcomputationally
expensive!
JustinJohnson September24,2019Lecture7- 96
Summary:ComponentsofaConvolutionalNetworkConvolutionLayers PoolingLayers
x h s
Fully-ConnectedLayers
ActivationFunction Normalization
JustinJohnson September24,2019Lecture7- 97
Summary:ComponentsofaConvolutionalNetwork
Problem:Whatistherightwaytocombineallthesecomponents?
JustinJohnson September24,2019
Nexttime:CNNArchitectures
Lecture7- 98