lecture 11: detection and localizationyboykov/courses/cs484... · cat:0.9 dog: 0.05 car: 0.01......

65
Transfer Learning + Localization and Detection Most slides are from Fei-Fei Li, Justin Johnson, Andrej Karpathy, SerenaYeung, Jia-Bin Huang CS484/684 Topic 12

Upload: others

Post on 01-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Transfer Learning

+

Localization and Detection

Most slides are from Fei-Fei Li, Justin Johnson, Andrej Karpathy, Serena Yeung, Jia-Bin Huang

CS484/684

Topic 12

Page 2: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Transfer Learning

▪ Improve learning in a new task through transfer of knowledge from a related task that has already been learned

▪ Often used when your own dataset is not large enough▪ cannot train large CNN with little data

▪ Two major strategies▪ Use trained CNN as a fixed feature extractor▪ Fine tune trained CNN for your task

▪ References▪ Donahue et al, “DeCAF: A Deep Convolutional Activation Feature for Generic Visual

Recognition”, ICML 2014

▪ Razavian et al, “CNN Features Off-the-Shelf: An Astounding Baseline for Recognition”, CVPR

Workshops 2014

▪ Oquab et al, “Learning and Transferring Mid-level Image Representations Using Convolutional

Neural Networks”, CVPR 2014

Page 3: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Transfer Learning Classification: Fixed Feature Extraction

Image

MaxPool

Conv-64

Conv-64

MaxPool

Conv-128

Conv-128

MaxPool

Conv-256

Conv-256

MaxPool

Conv-512

Conv-512

MaxPool

Conv-512

Conv-512

FC-1000

FC-4096

FC-4096

1. Train on Imagenet

Caffe: https://github.com/BVLC/caffe/wiki/Model-Zoo

TensorFlow: https://github.com/tensorflow/models

PyTorch: https://github.com/pytorch/vision

or load off internet

VGG

Page 4: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Transfer Learning Classification: Fixed Feature Extraction

Image

MaxPool

Conv-64

Conv-64

MaxPool

Conv-128

Conv-128

MaxPool

Conv-256

Conv-256

MaxPool

Conv-512

Conv-512

MaxPool

Conv-512

Conv-512

FC-1000

FC-4096

FC-4096

1. CNN pretrained on Imagenet

1000 classes

Image

MaxPool

Conv-64

Conv-64

MaxPool

Conv-128

Conv-128

MaxPool

Conv-256

Conv-256

MaxPool

Conv-512

Conv-512

MaxPool

Conv-512

Conv-512

2. Your dataset (small)

C classes

Freeze these

Reinitialize

this and train

FC-CFC-4096

FC-4096

4096 features

extracted from

pretrained CNN

Interpretation:

Linear classifier is

trained on top of

fixed features

Page 5: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Transfer Learning Classification: Fine Tuning

Image

MaxPool

Conv-64

Conv-64

MaxPool

Conv-128

Conv-128

MaxPool

Conv-256

Conv-256

MaxPool

Conv-512

Conv-512

Conv-512

Conv-512

MaxPool

FC-CFC-4096

FC-4096

Reinitialize

and train

Freeze these

3. Your dataset (medium)

C classes

Image

MaxPool

Conv-64

Conv-64

MaxPool

Conv-128

Conv-128

MaxPool

Conv-256

Conv-256

MaxPool

Conv-512

Conv-512

MaxPool

Conv-512

Conv-512

FC-1000

FC-4096

FC-4096

1. Train on Imagenet

1000 classes

“Finetune” these

• low learning rate

• e.g. 1/10 of

original LR

Page 6: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Transfer Learning Classification: Fine Tuning

Image

MaxPool

Conv-64

Conv-64

MaxPool

Conv-128

Conv-128

MaxPool

Conv-256

Conv-256

MaxPool

Conv-512

Conv-512

Conv-512

Conv-512

MaxPool

FC-CFC-4096

FC-4096

Reinitialize

and train

Freeze these

3. Your dataset (medium large)

C classes

Image

MaxPool

Conv-64

Conv-64

MaxPool

Conv-128

Conv-128

MaxPool

Conv-256

Conv-256

MaxPool

Conv-512

Conv-512

MaxPool

Conv-512

Conv-512

FC-1000

FC-4096

FC-4096

1. Train on Imagenet

1000 classes

“Finetune” these

• low learning rate

• e.g. 1/10 of

original LR

Page 7: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Image

MaxPool

Conv-64

Conv-64

MaxPool

Conv-128

Conv-128

MaxPool

Conv-256

Conv-256

MaxPool

Conv-512

Conv-512

MaxPool

Conv-512

Conv-512

FC-1000

FC-4096

FC-4096

more dataset

specific features

more generic

features

very similar

dataset

very different

dataset

very little data

quite a lot of

data

Transfer Learning Classification Overview

use linear

classifier on

top layer

finetune a

few layers

finetune a

larger number

of layers

most difficult

case, try

linear classifier

from earlier

stages

Page 8: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Transfer Learning for Semantic Segmentation

input

Page 9: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Transfer Learning for Semantic Segmentation

Image

MaxPool

Conv-64

Conv-64

MaxPool

Conv-128

Conv-128

MaxPool

Conv-256

Conv-256

MaxPool

Conv-512

Conv-512

MaxPool

Conv-512

Conv-512

FC-1000

FC-4096

FC-4096

Pretrained CNN

22464

112

128

64

256 resize

resize

resize

3xWxH

WH

256

WH128

WH64

concat

448

W H

conv 1x1, 64

64W

H

concat

input

Page 10: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Classification vs. Localization

Classification + Localization

CAT

Classification

CAT

Page 11: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Class

Scores

Cat: 0.9

Dog: 0.05

Car: 0.01

...

Classification + Localization

This image is CC0 publicdomainVector

4096

Fully

Connected

4096 to 1000

Box

Coords

(x, y, w, h)

Fully

Connected

4096 to 4

Treat localization as a regression problem!

Page 12: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Class

ScoresCat: 0.9

Dog: 0.05

Car: 0.01

...

Classification + Localization

Vector

4096

Fully

Connected

4096 to 1000

Box

Coords

Fully

Connected

4096 to 4

This image is CC0 publicdomain

(x, y, w, h)

L2

Loss

Correct box

(x’, y’, w’, h’)Treat localization as a regression problem!

Softmax Loss

Correct label

Cat

Page 13: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Class

ScoresCat: 0.9

Dog: 0.05

Car: 0.01

...

Classification + Localization

Vector

4096

Fully

Connected:

4096 to 1000

Box

Coords

Fully

Connected:

4096 to 4

Softmax Loss

L2

Loss

Correct label

Cat

This image is CC0 publicdomain

(x, y, w, h)

Correct box

(x’, y’, w’, h’)Treat localization as a regression problem!

Multitask Loss Loss+

Page 14: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Class

Scores

Cat: 0.9

Dog: 0.05

Car: 0.01

...

Classification + Localization

Vector

4096

Fully

Connected:

4096 to 1000

Box

Coords

Fully

Connected:

4096 to 4

Softmax Loss

L2

Loss

Correct label:

Cat

This image is CC0 publicdomain

(x, y, w, h)

Correct box:

(x’, y’, w’, h’)Treat localization as a regression problem!

Multitask Loss Loss+

Often pretrained on ImageNet

(Transfer learning)

Page 15: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Classification + Localization: More FC Layers

Class scores

Softmax loss

Fully-Connectedlayers

Fully-Connectedlayers

Box coords

L2 loss

Page 16: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Classification + Localization:

Class agnostic vs per class regression

Class scores

C numbers

Softmax loss

FC Layers

FC Layers

Box cords,

class agnostic,

4 numbers

L2 loss

Page 17: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Classification + Localization: Class agnostic vs

per class regression

Class scores

C numbers

Softmax loss

FC Layers

FC Layers

Box cords,

class specific,

Cx4 numbers

L2 loss

Page 18: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Localization vs. Object Detection

Detection

DOG, DOG, CAT

Object categories +

2D bounding boxes

Localization

Page 19: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Object Detection as Regression?

CAT: (x, y, w, h)

DOG: (x, y, w, h)

DOG: (x, y, w, h)

CAT: (x, y, w, h)

4 numbers

12 numbers

DUCK: (x, y, w, h)

DUCK: (x, y, w, h)

….

Each image needs a different number of outputs!

Many

numbers

Page 20: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Object Detection as Classification: Sliding Window

Dog? NO

Cat? NO

Background? YES

▪ Apply a CNN to many different crops of the image

▪ Add an additional “background” class

▪ CNN classifies each crop as object or background

Page 21: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Object Detection as Classification: Sliding Window

Dog? YES

Cat? NO

Background? NO

▪ Apply a CNN to many different crops of the image

▪ Add an additional “background” class

▪ CNN classifies each crop as object or background

Page 22: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Object Detection as Classification: Sliding Window

Dog? YES

Cat? NO

Background? NO

▪ Apply a CNN to many different crops of the image

▪ Add an additional “background” class

▪ CNN classifies each crop as object or background

Page 23: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Object Detection as Classification: Sliding Window

Dog? NO

Cat? YES

Background? NO

▪ Apply a CNN to many different crops of the image

▪ Add an additional “background” class

▪ CNN classifies each crop as object or background

Page 24: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Object Detection as Classification: Sliding Window

Dog? NO

Cat? YES

Background? NO

▪ Problem: Need to apply CNN to huge number of

▪ locations, scales, and aspect ratios

▪ Very computationally expensive!

Page 25: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Region Proposals / Selective Search

▪ Find “blobby” image regions that are likely to contain objects▪ Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012

▪ Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013

▪ Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014

▪ Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV 2014

▪ Relatively fast to run▪ e.g. Selective Search gives 2000 region proposals in a few seconds on CPU

Page 26: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8 - 1 Feb 2016

Region Proposals: Selective Search

Bottom-up segmentation, merging regions at multiple scales

Lecture 8 - 50

Convert

regions

to boxes

Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013

Page 27: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and segmentation”, CVPR'14

Page 28: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and segmentation”, CVPR'14

Regions of Interest (ROI) from proposal method (~2k)

Page 29: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and segmentation”, CVPR'14

Regions of Interest (ROI) from proposal method (~2k)

Warp ROI to the same size

Page 30: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and segmentation”, CVPR'14

Regions of Interest (ROI) from proposal method (~2k)

Warp ROI to the same size

Forward each ROI through ConvNet and extract features

Sidenote: ConvNetpretrained on large classification dataset, then fine-tuned on proposal windows

Page 31: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and segmentation”, CVPR'14

Regions of Interest (ROI) from proposal method (~2k)

Warp ROI to the same size

Forward each ROI through ConvNet and extract features

Classify ROIs with SVM (type of linear classifier)

Sidenote: In original paper, separate SVM classifier worked better than softmaxlayer used for finetuning

Page 32: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

R-CNN

Girshick et al, “Rich feature hierarchies for accurate object detection and segmentation”, CVPR'14

Regions of Interest (ROI) from proposal method (~2k)

Warp ROI to the same size

Forward each ROI through ConvNet and extract features

Classify ROIs with SVM (type of linear classifier)

Regression for bounding box offset

Page 33: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

R-CNN: Problems

▪ Proposals are not learned, brittle

▪ Training is multi-stage pipeline

▪ with each stage has its own loss function and is trained

separately

▪ i.e., box regression does not help classification,

classification does not help regression

▪ Training is slow (84h), takes a lot of disk space

▪ Inference (detection) is slow▪ 47s / image with VGG16 [Simonyan & Zisserman. ICLR15]

▪ Need to feed-forward 2000 proposals through ConvNet

Girshick et al, “Rich feature hierarchies for accurate object detection and segmentation”, CVPR'14

Page 34: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015

Main idea: speedup via shared CNN computation

Page 35: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015

Forward the whole image through ConvNet

Page 36: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015

Forward the whole image through ConvNet

“conv5” feature map of the image

Page 37: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015

Page 38: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015

Page 39: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Fast R-CNN: RoI Pooling

Input image

3 x 640 x 480

Projected region

proposal is e.g.

512 x 18 x 8

Fully-connected

layers

RoI conv features:

512 x 7 x 7

for region proposal

First FC layer expects

features of size

512 x 7 x 7

CNN

CNN for features

Girshick, “Fast R-CNN”, ICCV 2015

Conv features:

512 x 20 x 15

divide projected

proposal into 7x7 grid

maxpool inside

each grid

Page 40: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015

Page 41: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Fast R-CNN

Girshick, “Fast R-CNN”, ICCV 2015

Page 42: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Fast R-CNN

(Training)

Girshick, “Fast R-CNN”, ICCV 2015

Multi-task loss

Page 43: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Fast R-CNN

(Training)

Girshick, “Fast R-CNN”, ICCV 2015

Multi-task loss▪ single loss function▪ end-to-end training▪ box regression and

classification can influence each other

Page 44: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8 - 1 Feb 2016

Fast R-CNN Results

Using VGG-16 CNN on Pascal VOC 2007 dataset

Lecture 8 - 75

R-CNN Fast R-CNN

Training Time: 84 hours 9.5 hours

(Speedup) 1x 8.8xFaster!

Page 45: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8 - 1 Feb 2016

Fast R-CNN Results

Using VGG-16 CNN on Pascal VOC 2007 dataset

Lecture 8 - 76

R-CNN Fast R-CNN

Training Time: 84 hours 9.5 hours

(Speedup) 1x 8.8x

Test time per image 47 seconds 0.32 seconds

(Speedup) 1x 146x

Faster!

FASTER!

Page 46: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Fast R-CNN Results

Using VGG-16 CNN on Pascal VOC 2007 dataset

R-CNN Fast R-CNN

Training Time: 84 hours 9.5 hours

(Speedup) 1x 8.8x

Test time per image 47 seconds 0.32 seconds

(Speedup) 1x 146x

mAP (VOC 2007) 66.0 66.9

Faster!

FASTER!

Better!

Page 47: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8 - 1 Feb 2016

Fast R-CNN Problem

R-CNN Fast R-CNN

Test time per image 47 seconds 0.32 seconds

(Speedup) 1x 146x

Test time per image

with Selective Search 50 seconds 2 seconds

(Speedup) 1x 25x

Test-time speeds don’t include region proposals

Page 48: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Faster R-CNN

▪ Make CNN learn the proposals

▪ Insert Region Proposal

Network (RPN) to predict

proposals from features

▪ After RPN, as before, use

▪ RoI Pooling

▪ Classifier

▪ Box regressor

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015

Page 49: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

▪ Consider a fixed set of rectangles as possible proposals

▪ Choices in paper:▪ height and width ratios 1:1, 1:2, 2:1, at ▪ 3 scales, 128x128, 256x256, 512x512▪ stride 16▪ Gives 17,901 boxes for 600x800 image

Faster R-CNN: Region Proposal Network

▪ Classify each box as proposal/not proposal▪ or object/no object

▪ How to classify efficiently?

Page 50: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

▪ Take layer “conv5” for Region Proposal Network

Faster R-CNN: Region Proposal Network

Conv Net

ConvNet

‘conv5’

39

51

512

▪ Moving by 1 “spatial pixel” in this layer corresponds to moving by 16 pixels in the original image

Page 51: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

▪ Take layer “conv5” for Region Proposal Network

Faster R-CNN: Region Proposal Network

39

51

512

▪ Moving by 1 “spatial pixel” in this layer corresponds to moving by 16 pixels in the original image

Page 52: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

▪ Take layer “conv5” for Region Proposal Network

Faster R-CNN: Region Proposal Network

39

51

512

move by 16 pixels

▪ Moving by 1 “spatial pixel” in this layer corresponds to moving by 16 pixels in the original image

▪ We have 9 possible proposal windows every 16 pixels

Page 53: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Faster R-CNN: Region Proposal Network

▪ Make each “spatial pixel” responsible for classifying 9 proposal rectangles

39

51

512

Page 54: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Faster R-CNN: Region Proposal Network

39

51

512

▪ Already have 512 features for learning, but not enough▪ Take 3 by 3 spatial window around green ‘pixel’ for learning

move by 16 pixels

Page 55: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Faster R-CNN: Region Proposal Network

39

51

512

▪ Already have 512 features for learning, but not enough▪ Take 3 by 3 spatial window around green ‘pixel’ for learning▪ Implemented as 3x3 convolutional layer for all pixels

move by 16 pixels

▪ with 256 filters, to get 256 new features per ‘pixel’

Page 56: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Faster R-CNN: Region Proposal Network

39

51

512

3x3 convolution with 256 filters

39

51256▪ Now need to classify each “pixel”

as object or not object ▪ for 9 different proposal boxes

1x1 convolution with 2x9 filters

1x1 convolution with 4x9 filters

▪ And get 4 box coordinates▪ for 9 different proposal boxes

softmax loss regression loss

Page 57: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 8 - 1 Feb 2016

Faster R-CNN: Training

▪ In the paper: Ugly pipeline▪ Since publication: Joint training▪ One network, four losses▪ RPN classification (anchor good / bad)▪ RPN regression (anchor -> proposal)▪ Fast R-CNN classification (over classes)▪ Fast R-CNN regression (proposal -> box)

Slide credit: Ross GirschickLecture 8 - 83

Page 58: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Faster R-CNN: Results

R-CNN Fast R-CNN Faster R-CNN

Test time per

image

(with proposals)

50 seconds 2 seconds 0.2 seconds

(Speedup) 1x 25x 250x

mAP (VOC 2007) 66.0 66.9 66.9

Page 59: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Mask R-CNN

He et al, “Mask R-CNN”, arXiv 2017

RoI AlignConv

Classification Scores: CBox coordinates (per class): 4 * C

CNN Conv

Predict a mask for

each of C classes

C x 14 x 14

256 x 14 x 14 256 x 14 x 14

▪ Adds a branch for predicting segmentationmasks on each RoI

▪ Also changes how RoI pooling implemented▪ Computational overhead is small

Page 60: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Mask R-CNN: Very Good Results!

He et al, “Mask R-CNN”, arXiv 2017

Page 61: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Mask R-CNN

Also does pose

He et al, “Mask R-CNN”, arXiv 2017

Page 62: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

YOLO- You Only Look Once

Redmon et al. CVPR 2016.https://arxiv.org/abs/1506.02640

Idea: No boundingbox proposals.Predict a class and a box for every locationin a grid.

Page 63: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

YOLO- You Only Look Once

Redmon et al. CVPR 2016.https://arxiv.org/abs/1506.02640

Divide the image into 7x7 cells. Each cell trains a detector.The detector needs to predict the object’s class distributions.The detector has 2 bounding-box predictors to predict bounding-boxes and confidence scores.

Page 64: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

SSD: Single Shot Detector

Liu et al. ECCV 2016.

Idea: Similar to YOLO, but denser grid map, multiscale grid maps. + Data augmentation + Hard negative mining + Other design choices in the network.

Page 65: Lecture 11: Detection and Localizationyboykov/Courses/cs484... · Cat:0.9 Dog: 0.05 Car: 0.01... Classification + Localization Vector 4096 Fully Connected 4096 to 1000 Box Coords

Object Detection: Impact of Deep Learning