fei-fei li & justin johnson & serena...
TRANSCRIPT
![Page 1: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/1.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 20191
![Page 2: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/2.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 20192
Administrative
- Project Milestone due tomorrow 5/15- Fill out project registration form by tomorrow even if using late days:
https://tinyurl.com/cs231nproject- Midterm grades will be out tomorrow
![Page 3: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/3.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 20193
Last Time: Generative ModelsAutoregressive models:PixelRNN, PixelCNN
Van der Oord et al, “Conditional image generation with pixelCNN decoders”, NIPS 2016
Variational Autoencoders
Kingma and Welling, “Auto-encoding variational bayes”, ICLR 2013
Generative Adversarial Networks (GANs)
Goodfellow et al, “Generative Adversarial Nets”, NIPS 2014
![Page 4: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/4.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019
Last Time: GAN Images
4
Progressive GAN, Karras 2018.Brock et al., 2019
![Page 5: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/5.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 20195
Class ScoresCat: 0.9Dog: 0.05Car: 0.01...
So far: Image Classification
This image is CC0 public domain Vector:4096
Fully-Connected:4096 to 1000
![Page 6: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/6.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 20196
Today: Segmentation, Detection
![Page 7: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/7.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 20197
Computer Vision Tasks
Classification SemanticSegmentation
Object Detection
Instance Segmentation
CAT GRASS, CAT, TREE, SKY
DOG, DOG, CAT DOG, DOG, CAT
No spatial extent Multiple ObjectNo objects, just pixels This image is CC0 public domain
![Page 8: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/8.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 20198
Semantic Segmentation
Classification SemanticSegmentation
Object Detection
Instance Segmentation
CAT GRASS, CAT, TREE, SKY
DOG, DOG, CAT DOG, DOG, CAT
No spatial extent Multiple ObjectNo objects, just pixels
![Page 9: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/9.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 20199
Semantic Segmentation
Cow
Grass
SkyTrees
Label each pixel in the image with a category label
Don’t differentiate instances, only care about pixels
This image is CC0 public domain
Grass
Cat
Sky Trees
![Page 10: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/10.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201910
Semantic Segmentation Idea: Sliding Window
Full image
Extract patchClassify center pixel with CNN
Cow
Cow
Grass
Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
![Page 11: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/11.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201911
Semantic Segmentation Idea: Sliding Window
Full image
Extract patchClassify center pixel with CNN
Cow
Cow
GrassProblem: Very inefficient! Not reusing shared features between overlapping patches Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013
Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014
![Page 12: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/12.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201912
Semantic Segmentation Idea: Fully Convolutional
Input:3 x H x W
Convolutions:D x H x W
Conv Conv Conv Conv
Scores:C x H x W
argmax
Predictions:H x W
Design a network as a bunch of convolutional layers to make predictions for pixels all at once!
![Page 13: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/13.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201913
Semantic Segmentation Idea: Fully Convolutional
Input:3 x H x W
Convolutions:D x H x W
Conv Conv Conv Conv
Scores:C x H x W
argmax
Predictions:H x W
Design a network as a bunch of convolutional layers to make predictions for pixels all at once!
Problem: convolutions at original image resolution will be very expensive ...
![Page 14: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/14.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201914
Semantic Segmentation Idea: Fully Convolutional
Input:3 x H x W Predictions:
H x W
Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!
High-res:D1 x H/2 x W/2
High-res:D1 x H/2 x W/2
Med-res:D2 x H/4 x W/4
Med-res:D2 x H/4 x W/4
Low-res:D3 x H/4 x W/4
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
![Page 15: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/15.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201915
Semantic Segmentation Idea: Fully Convolutional
Input:3 x H x W Predictions:
H x W
Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!
High-res:D1 x H/2 x W/2
High-res:D1 x H/2 x W/2
Med-res:D2 x H/4 x W/4
Med-res:D2 x H/4 x W/4
Low-res:D3 x H/4 x W/4
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Downsampling:Pooling, strided convolution
Upsampling:???
![Page 16: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/16.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201916
In-Network upsampling: “Unpooling”
1 2
3 4
Input: 2 x 2 Output: 4 x 4
1 1 2 2
1 1 2 2
3 3 4 4
3 3 4 4
Nearest Neighbor
1 2
3 4
Input: 2 x 2 Output: 4 x 4
1 0 2 0
0 0 0 0
3 0 4 0
0 0 0 0
“Bed of Nails”
![Page 17: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/17.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201917
In-Network upsampling: “Max Unpooling”
Input: 4 x 4
1 2 6 3
3 5 2 1
1 2 2 1
7 3 4 8
1 2
3 4
Input: 2 x 2 Output: 4 x 4
0 0 2 0
0 1 0 0
0 0 0 0
3 0 0 4
Max UnpoolingUse positions from pooling layer
5 6
7 8
Max PoolingRemember which element was max!
… Rest of the network
Output: 2 x 2
Corresponding pairs of downsampling and upsampling layers
![Page 18: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/18.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201918
Learnable Upsampling: Transpose Convolution
Input: 4 x 4 Output: 4 x 4
Recall: Normal 3 x 3 convolution, stride 1 pad 1
![Page 19: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/19.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201919
Recall: Normal 3 x 3 convolution, stride 1 pad 1
Input: 4 x 4 Output: 4 x 4
Dot product between filter and input
Learnable Upsampling: Transpose Convolution
![Page 20: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/20.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201920
Input: 4 x 4 Output: 4 x 4
Dot product between filter and input
Recall: Normal 3 x 3 convolution, stride 1 pad 1
Learnable Upsampling: Transpose Convolution
![Page 21: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/21.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201921
Input: 4 x 4 Output: 2 x 2
Recall: Normal 3 x 3 convolution, stride 2 pad 1
Learnable Upsampling: Transpose Convolution
![Page 22: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/22.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201922
Input: 4 x 4 Output: 2 x 2
Dot product between filter and input
Recall: Normal 3 x 3 convolution, stride 2 pad 1
Learnable Upsampling: Transpose Convolution
![Page 23: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/23.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201923
Input: 4 x 4 Output: 2 x 2
Dot product between filter and input
Filter moves 2 pixels in the input for every one pixel in the output
Stride gives ratio between movement in input and output
Recall: Normal 3 x 3 convolution, stride 2 pad 1
Learnable Upsampling: Transpose Convolution
![Page 24: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/24.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201924
3 x 3 transpose convolution, stride 2 pad 1
Input: 2 x 2 Output: 4 x 4
Learnable Upsampling: Transpose Convolution
![Page 25: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/25.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201925
Input: 2 x 2 Output: 4 x 4
Input gives weight for filter
3 x 3 transpose convolution, stride 2 pad 1
Learnable Upsampling: Transpose Convolution
![Page 26: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/26.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201926
Input: 2 x 2 Output: 4 x 4
Input gives weight for filter
3 x 3 transpose convolution, stride 2 pad 1
Filter moves 2 pixels in the output for every one pixel in the input
Stride gives ratio between movement in output and input
Learnable Upsampling: Transpose Convolution
![Page 27: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/27.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201927
Input: 2 x 2 Output: 4 x 4
Input gives weight for filter
Sum where output overlaps3 x 3 transpose convolution, stride 2 pad 1
Filter moves 2 pixels in the output for every one pixel in the input
Stride gives ratio between movement in output and input
Learnable Upsampling: Transpose Convolution
![Page 28: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/28.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201928
Input: 2 x 2 Output: 4 x 4
Input gives weight for filter
Sum where output overlaps3 x 3 transpose convolution, stride 2 pad 1
Filter moves 2 pixels in the output for every one pixel in the input
Stride gives ratio between movement in output and input
Other names:-Deconvolution (bad)-Upconvolution-Fractionally strided convolution-Backward strided convolution
Learnable Upsampling: Transpose Convolution
![Page 29: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/29.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201929
Learnable Upsampling: 1D Example
a
b
x
y
z
ax
ay
az + bx
by
bz
Input FilterOutput
Output contains copies of the filter weighted by the input, summing at where at overlaps in the output
Need to crop one pixel from output to make output exactly 2x input
![Page 30: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/30.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201930
Convolution as Matrix Multiplication (1D Example)We can express convolution in terms of a matrix multiplication
Example: 1D conv, kernel size=3, stride=1, padding=1
![Page 31: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/31.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201931
Convolution as Matrix Multiplication (1D Example)We can express convolution in terms of a matrix multiplication
Example: 1D conv, kernel size=3, stride=1, padding=1
Convolution transpose multiplies by the transpose of the same matrix:
When stride=1, convolution transpose is just a regular convolution (with different padding rules)
![Page 32: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/32.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201932
Convolution as Matrix Multiplication (1D Example)We can express convolution in terms of a matrix multiplication
Example: 1D conv, kernel size=3, stride=2, padding=1
![Page 33: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/33.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201933
Convolution as Matrix Multiplication (1D Example)We can express convolution in terms of a matrix multiplication
Example: 1D conv, kernel size=3, stride=2, padding=1
Convolution transpose multiplies by the transpose of the same matrix:
When stride>1, convolution transpose is no longer a normal convolution!
![Page 34: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/34.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201934
Semantic Segmentation Idea: Fully Convolutional
Input:3 x H x W Predictions:
H x W
Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!
High-res:D1 x H/2 x W/2
High-res:D1 x H/2 x W/2
Med-res:D2 x H/4 x W/4
Med-res:D2 x H/4 x W/4
Low-res:D3 x H/4 x W/4
Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015
Downsampling:Pooling, strided convolution
Upsampling:Unpooling or strided transpose convolution
![Page 35: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/35.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201935
Object Detection
Classification SemanticSegmentation
Object Detection
Instance Segmentation
CAT GRASS, CAT, TREE, SKY
DOG, DOG, CAT DOG, DOG, CAT
No spatial extent Multiple ObjectNo objects, just pixels
![Page 36: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/36.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201936
Object Detection: Impact of Deep Learning
Figure copyright Ross Girshick, 2015. Reproduced with permission.
![Page 37: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/37.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201937
Class ScoresCat: 0.9Dog: 0.05Car: 0.01...
Object Detection: Single Object(Classification + Localization)
This image is CC0 public domain Vector:4096
FullyConnected:4096 to 1000
Box Coordinates(x, y, w, h)
FullyConnected:4096 to 4
Treat localization as a regression problem!
![Page 38: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/38.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201938
Class ScoresCat: 0.9Dog: 0.05Car: 0.01...
Vector:4096
FullyConnected:4096 to 1000
Box Coordinates(x, y, w, h)
FullyConnected:4096 to 4
Softmax Loss
L2 Loss
Correct label:Cat
Correct box:(x’, y’, w’, h’)
This image is CC0 public domain
Treat localization as a regression problem!
Object Detection: Single Object(Classification + Localization)
![Page 39: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/39.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201939
Class ScoresCat: 0.9Dog: 0.05Car: 0.01...
Vector:4096
FullyConnected:4096 to 1000
Box Coordinates(x, y, w, h)
FullyConnected:4096 to 4
Softmax Loss
L2 Loss
Loss
Correct label:Cat
Correct box:(x’, y’, w’, h’)
+This image is CC0 public domain
Treat localization as a regression problem!
Multitask Loss
Object Detection: Single Object(Classification + Localization)
![Page 40: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/40.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201940
Class ScoresCat: 0.9Dog: 0.05Car: 0.01...
Vector:4096
FullyConnected:4096 to 1000
Box Coordinates(x, y, w, h)
FullyConnected:4096 to 4
Softmax Loss
L2 Loss
Loss
Correct label:Cat
Correct box:(x’, y’, w’, h’)
+This image is CC0 public domain Often pretrained on ImageNet
(Transfer learning)
Treat localization as a regression problem!
Object Detection: Single Object(Classification + Localization)
![Page 41: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/41.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201941
CAT: (x, y, w, h)
DOG: (x, y, w, h)DOG: (x, y, w, h)CAT: (x, y, w, h)
DUCK: (x, y, w, h)DUCK: (x, y, w, h)….
Object Detection: Multiple Objects
![Page 42: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/42.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201942
CAT: (x, y, w, h)
DOG: (x, y, w, h)DOG: (x, y, w, h)CAT: (x, y, w, h)
DUCK: (x, y, w, h)DUCK: (x, y, w, h)….
4 numbers
16 numbers
Many numbers!
Each image needs a different number of outputs!Object Detection: Multiple Objects
![Page 43: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/43.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201943
Dog? NOCat? NOBackground? YES
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Object Detection: Multiple Objects
![Page 44: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/44.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201944
Dog? YESCat? NOBackground? NO
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Object Detection: Multiple Objects
![Page 45: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/45.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201945
Dog? YESCat? NOBackground? NO
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Object Detection: Multiple Objects
![Page 46: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/46.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201946
Dog? NOCat? YESBackground? NO
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Object Detection: Multiple Objects
![Page 47: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/47.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201947
Dog? NOCat? YESBackground? NO
Apply a CNN to many different crops of the image, CNN classifies each crop as object or background
Problem: Need to apply CNN to huge number of locations, scales, and aspect ratios, very computationally expensive!
Object Detection: Multiple Objects
![Page 48: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/48.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201948
Region Proposals: Selective Search● Find “blobby” image regions that are likely to contain objects● Relatively fast to run; e.g. Selective Search gives 2000 region
proposals in a few seconds on CPU
Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV 2014
![Page 49: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/49.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201949
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 50: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/50.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201950
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 51: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/51.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201951
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 52: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/52.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201952
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 53: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/53.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201953
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 54: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/54.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201954
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
![Page 55: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/55.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201955
R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Problem: Very slow! Need to do ~2k independent forward passes for each image!
![Page 56: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/56.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201956
“Slow” R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
Problem: Very slow! Need to do ~2k independent forward passes for each image!
Idea: Process image before cropping! Swap convolution and cropping!
![Page 57: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/57.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201957
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
“Slow” R-CNN
![Page 58: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/58.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201958
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
“Slow” R-CNN
![Page 59: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/59.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201959
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
“Slow” R-CNN
![Page 60: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/60.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201960
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
“Slow” R-CNN
![Page 61: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/61.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201961
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
“Slow” R-CNN
![Page 62: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/62.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201962
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
“Slow” R-CNN
![Page 63: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/63.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201963
Fast R-CNN
Girshick, “Fast R-CNN”, ICCV 2015. Figure copyright Ross Girshick, 2015; source. Reproduced with permission.
“Slow” R-CNN
![Page 64: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/64.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201964
Cropping Features: RoI Pool
Input Image(e.g. 3 x 640 x 480)
CNN
Girshick, “Fast R-CNN”, ICCV 2015.
Image features(e.g. 512 x 20 x 15)
Girshick, “Fast R-CNN”, ICCV 2015.
![Page 65: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/65.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201965
Input Image(e.g. 3 x 640 x 480)
CNN
Girshick, “Fast R-CNN”, ICCV 2015.
Image features(e.g. 512 x 20 x 15)
Project proposal onto features
Girshick, “Fast R-CNN”, ICCV 2015.
Cropping Features: RoI Pool
![Page 66: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/66.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201966
Input Image(e.g. 3 x 640 x 480)
CNN
Girshick, “Fast R-CNN”, ICCV 2015.
Image features(e.g. 512 x 20 x 15)
Project proposal onto features
“Snap” to grid cells
Cropping Features: RoI Pool
![Page 67: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/67.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201967
Input Image(e.g. 3 x 640 x 480)
CNN
Girshick, “Fast R-CNN”, ICCV 2015.
Image features(e.g. 512 x 20 x 15)
Project proposal onto features
“Snap” to grid cells
Divide into 2x2 grid of (roughly) equal subregions
Cropping Features: RoI Pool
![Page 68: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/68.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201968
Input Image(e.g. 3 x 640 x 480)
CNN
Girshick, “Fast R-CNN”, ICCV 2015.
Image features(e.g. 512 x 20 x 15)
Project proposal onto features
“Snap” to grid cells
Divide into 2x2 grid of (roughly) equal subregions
Max-pool within each subregion
Region features(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Region features always the same size even if input
regions have different sizes!
Cropping Features: RoI Pool
![Page 69: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/69.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201969
Input Image(e.g. 3 x 640 x 480)
CNN
Girshick, “Fast R-CNN”, ICCV 2015.
Image features(e.g. 512 x 20 x 15)
Project proposal onto features
“Snap” to grid cells
Divide into 2x2 grid of (roughly) equal subregions
Max-pool within each subregion
Region features(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Region features always the same size even if input
regions have different sizes!Problem: Region features slightly misaligned
Cropping Features: RoI Pool
![Page 70: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/70.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201970
Input Image(e.g. 3 x 640 x 480)
CNN
Image features(e.g. 512 x 20 x 15)
Project proposal onto features
He et al, “Mask R-CNN”, ICCV 2017
No “snapping”!
Cropping Features: RoI Align
![Page 71: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/71.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201971
Cropping Features: RoI Align
Input Image(e.g. 3 x 640 x 480)
CNN
Image features(e.g. 512 x 20 x 15)
Project proposal onto features
He et al, “Mask R-CNN”, ICCV 2017
Sample at regular points in each subregion using bilinear interpolationNo “snapping”!
![Page 72: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/72.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201972
Cropping Features: RoI Align
Input Image(e.g. 3 x 640 x 480)
CNN
Image features(e.g. 512 x 20 x 15)
Project proposal onto features
He et al, “Mask R-CNN”, ICCV 2017
Sample at regular points in each subregion using bilinear interpolationNo “snapping”!
Feature fxy for point (x, y) is a linear combination of features at its four neighboring grid cells:
![Page 73: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/73.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201973
Cropping Features: RoI Align
Input Image(e.g. 3 x 640 x 480)
CNN
Image features(e.g. 512 x 20 x 15)
Project proposal onto features
He et al, “Mask R-CNN”, ICCV 2017
Sample at regular points in each subregion using bilinear interpolationNo “snapping”!
(x,y)
f11∈R5
12
(x1,y1)f12∈R5
12
(x1,y2)
f22∈R5
12
(x2,y2)
f21∈R5
12
(x2,y1)
Feature fxy for point (x, y) is a linear combination of features at its four neighboring grid cells:
![Page 74: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/74.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201974
Input Image(e.g. 3 x 640 x 480)
CNN
Image features(e.g. 512 x 20 x 15)
Project proposal onto features
He et al, “Mask R-CNN”, ICCV 2017
Sample at regular points in each subregion using bilinear interpolation
Max-pool within each subregion
Region features(here 512 x 2 x 2;
In practice e.g 512 x 7 x 7)
Cropping Features: RoI AlignNo “snapping”!
![Page 75: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/75.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201975
R-CNN vs Fast R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014Girshick, “Fast R-CNN”, ICCV 2015
![Page 76: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/76.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201976
R-CNN vs Fast R-CNN
Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014Girshick, “Fast R-CNN”, ICCV 2015
Problem:Runtime dominated by region proposals!
![Page 77: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/77.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201977
Faster R-CNN: Make CNN do proposals!
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015Figure copyright 2015, Ross Girshick; reproduced with permission
Insert Region Proposal Network (RPN) to predict proposals from features
Otherwise same as Fast R-CNN: Crop features for each proposal, classify each one
![Page 78: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/78.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019
Region Proposal Network
78
CNN
Input Image(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)
![Page 79: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/79.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019
Region Proposal Network
79
CNN
Input Image(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)
Imagine an anchor box of fixed size at each
point in the feature map
![Page 80: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/80.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019
Region Proposal Network
80
CNN
Input Image(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)
Imagine an anchor box of fixed size at each
point in the feature map
Conv
Anchor is an object?1 x 20 x 15
At each point, predict whether the corresponding anchor contains an object
(per-pixel logistic regression)
![Page 81: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/81.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019
Region Proposal Network
81
CNN
Input Image(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)
Imagine an anchor box of fixed size at each
point in the feature map
Conv
Anchor is an object?1 x 20 x 15
For positive boxes, also predict a transformation from the
anchor to the ground-truth box (regress 4 numbers per pixel)
Box transforms4 x 20 x 15
![Page 82: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/82.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019
Region Proposal Network
82
CNN
Input Image(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)
In practice use K different anchor boxes of different size / scale at each point
Conv
Anchor is an object?K x 20 x 15
Box transforms4K x 20 x 15
![Page 83: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/83.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019
Region Proposal Network
83
CNN
Input Image(e.g. 3 x 640 x 480) Image features
(e.g. 512 x 20 x 15)
In practice use K different anchor boxes of different size / scale at each point
Conv
Anchor is an object?K x 20 x 15
Box transforms4K x 20 x 15
Sort the K*20*15 boxes by their “object” score, take top ~300 as our proposals
![Page 84: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/84.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201984
Faster R-CNN: Make CNN do proposals!
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015Figure copyright 2015, Ross Girshick; reproduced with permission
Jointly train with 4 losses:1. RPN classify object / not object2. RPN regress box coordinates3. Final classification score (object
classes)4. Final box coordinates
![Page 85: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/85.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201985
Faster R-CNN: Make CNN do proposals!
![Page 86: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/86.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201986
Faster R-CNN: Make CNN do proposals!
Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015Figure copyright 2015, Ross Girshick; reproduced with permission
Glossing over many details:- Ignore overlapping proposals with
non-max suppression- How to determine whether a
proposal is positive or negative?- How many positives / negatives
to send to second stage?- How to parameterize bounding
box regression?
![Page 87: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/87.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201987
Faster R-CNN: Make CNN do proposals!
Faster R-CNN is a Two-stage object detector
First stage: Run once per image- Backbone network- Region proposal network
Second stage: Run once per region- Crop features: RoI pool / align- Predict object class- Prediction bbox offset
![Page 88: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/88.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201988
Faster R-CNN: Make CNN do proposals!
Faster R-CNN is a Two-stage object detector
First stage: Run once per image- Backbone network- Region proposal network
Second stage: Run once per region- Crop features: RoI pool / align- Predict object class- Prediction bbox offset
Do we really need the second stage?
![Page 89: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/89.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201989
Single-Stage Object Detectors: YOLO / SSD / RetinaNet
Divide image into grid7 x 7
Image a set of base boxes centered at each grid cell
Here B = 3
Input image3 x H x W
Within each grid cell:- Regress from each of the B
base boxes to a final box with 5 numbers:(dx, dy, dh, dw, confidence)
- Predict scores for each of C classes (including background as a class)
- Looks a lot like RPN, but category-specific!
Output:7 x 7 x (5 * B + C)Redmon et al, “You Only Look Once:
Unified, Real-Time Object Detection”, CVPR 2016Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016Lin et al, “Focal Loss for Dense Object Detection”, ICCV 2017
![Page 90: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/90.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201990
Object Detection: Lots of variables ...
Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017
Backbone NetworkVGG16ResNet-101Inception V2Inception V3Inception ResNetMobileNet
R-FCN: Dai et al, “R-FCN: Object Detection via Region-based Fully Convolutional Networks”, NIPS 2016Inception-V2: Ioffe and Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ICML 2015Inception V3: Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”, arXiv 2016Inception ResNet: Szegedy et al, “Inception-V4, Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv 2016MobileNet: Howard et al, “Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv 2017
“Meta-Architecture”Two-stage: Faster R-CNNSingle-stage: YOLO / SSDHybrid: R-FCN
Image Size# Region Proposals…
TakeawaysFaster R-CNN is slower but more accurate
SSD is much faster but not as accurate
Bigger / Deeper backbones work better
![Page 91: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/91.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201991
Object Detection: Lots of variables ...
Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017Zou et al, “Object Detection in 20 Years: A Survey”, arXiv 2019 (today!)
Backbone NetworkVGG16ResNet-101Inception V2Inception V3Inception ResNetMobileNet
R-FCN: Dai et al, “R-FCN: Object Detection via Region-based Fully Convolutional Networks”, NIPS 2016Inception-V2: Ioffe and Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ICML 2015Inception V3: Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”, arXiv 2016Inception ResNet: Szegedy et al, “Inception-V4, Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv 2016MobileNet: Howard et al, “Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv 2017
“Meta-Architecture”Two-stage: Faster R-CNNSingle-stage: YOLO / SSDHybrid: R-FCN
Image Size# Region Proposals…
TakeawaysFaster R-CNN is slower but more accurate
SSD is much faster but not as accurate
Bigger / Deeper backbones work better
![Page 92: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/92.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201992
Instance Segmentation
Classification SemanticSegmentation
Object Detection
Instance Segmentation
CAT GRASS, CAT, TREE, SKY
DOG, DOG, CAT DOG, DOG, CAT
No spatial extent Multiple ObjectNo objects, just pixels
![Page 93: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/93.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201993
Object Detection:Faster R-CNN
![Page 94: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/94.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201994
Instance Segmentation:Mask R-CNN
Mask Prediction
He et al, “Mask R-CNN”, ICCV 2017
Add a small mask network that operates on each RoI and predicts a 28x28 binary mask
![Page 95: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/95.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 201995
Mask R-CNN
He et al, “Mask R-CNN”, arXiv 2017
RoI Align Conv
Classification Scores: C Box coordinates (per class): 4 * C
CNN+RPN
Conv
Predict a mask for each of C classes
C x 28 x 28
256 x 14 x 14 256 x 14 x 14
![Page 96: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/96.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019
Mask R-CNN: Example Mask Training Targets
96
![Page 97: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/97.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019
Mask R-CNN: Example Mask Training Targets
97
![Page 98: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/98.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019
Mask R-CNN: Example Mask Training Targets
98
![Page 99: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/99.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019
Mask R-CNN: Example Mask Training Targets
99
![Page 100: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/100.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019100
Mask R-CNN: Very Good Results!
He et al, “Mask R-CNN”, ICCV 2017
![Page 101: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/101.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019101
Mask R-CNNAlso does pose
He et al, “Mask R-CNN”, ICCV 2017
![Page 102: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/102.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019
Open Source Frameworks
Lots of good implementations on GitHub!
TensorFlow Detection API: https://github.com/tensorflow/models/tree/master/research/object_detection Faster RCNN, SSD, RFCN, Mask R-CNN
Caffe2 Detectron: https://github.com/facebookresearch/Detectron Mask R-CNN, RetinaNet, Faster R-CNN, RPN, Fast R-CNN, R-FCN
Finetune on your own dataset with pre-trained models
102
![Page 103: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/103.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019103
Computer Vision Tasks
Classification SemanticSegmentation
Object Detection
Instance Segmentation
CAT GRASS, CAT, TREE, SKY
DOG, DOG, CAT DOG, DOG, CAT
No spatial extent Multiple ObjectNo objects, just pixels This image is CC0 public domain
![Page 104: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/104.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019
Beyond 2D Object Detection...
104
![Page 105: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/105.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019105
Object Detection + Captioning= Dense Captioning
Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016Figure copyright IEEE, 2016. Reproduced for educational purposes.
![Page 106: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/106.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019106
Aside: Object Detection + Captioning= Dense Captioning
Johnson, Karpathy, and Fei-Fei, “DenseCap: Fully Convolutional Localization Networks for Dense Captioning”, CVPR 2016Figure copyright IEEE, 2016. Reproduced for educational purposes.
![Page 107: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/107.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019107
![Page 108: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/108.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019
Objects + Relationships = Scene Graphs
108
Krishna, Ranjay, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen et al. "Visual genome: Connecting language and vision using crowdsourced dense image annotations." International Journal of Computer Vision 123, no. 1 (2017): 32-73.
![Page 109: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/109.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019109
Scene Graph Prediction
Xu, Zhu, Choy, and Fei-Fei, “Scene Graph Generation by Iterative Message Passing”, CVPR 2017Figure copyright IEEE, 2018. Reproduced for educational purposes.
![Page 110: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/110.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019
3D Object Detection
110
This image is CC0 public domain
![Page 111: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/111.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019111Image source: https://www.pcmag.com/encyclopedia_images/_FRUSTUM.GIF
2D point
3D ray
3D Object Detection: Simple Camera Model
image plane
camera viewing frustrum
camera
![Page 112: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/112.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019112
3D Object Detection: Monocular Camera
Chen, Xiaozhi, Kaustav Kundu, Ziyu Zhang, Huimin Ma, Sanja Fidler, and Raquel Urtasun. "Monocular 3d object detection for autonomous driving." CVPR 2016.
Faster R-CNN
![Page 113: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/113.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019
3D Shape Prediction
113
Pointcloud: V x 3 float
Voxel: D x D x D binary
Mesh: V x 3 float, F x 3 int
Choy et al, “3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction”, ECCV 2016
Fan et al, “A Point Set Generation Network for 3D Object Reconstruction from a Single Image”, CVPR 2017
Wang et al, “Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images”, ECCV 2018
![Page 114: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/114.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019114
Recap: Lots of computer vision tasks!
Classification SemanticSegmentation
Object Detection
Instance Segmentation
CAT GRASS, CAT, TREE, SKY
DOG, DOG, CAT DOG, DOG, CAT
No spatial extent Multiple ObjectNo objects, just pixels This image is CC0 public domain
![Page 115: Fei-Fei Li & Justin Johnson & Serena Yeungcs231n.stanford.edu/slides/2019/cs231n_2019_lecture12.pdfFei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - 2 May 14, 2019 Administrative](https://reader030.vdocuments.net/reader030/viewer/2022040115/5e6f59ae40c8355c14164899/html5/thumbnails/115.jpg)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 12 - May 14, 2019115
Next time:Visualizing CNN featuresDeepDream + Style Transfer