project liveness report 01
TRANSCRIPT
Object Detection
R. Q. FEITOSA
Content
1. Computer Vision Tasks
2. Data Sets for Object Detection
3. Performance Metrics
4. YOLO
5. Mask R-CNN
2
Computer Vision Tasks
3
Image Classification Object Localization Semantic Segmentation
Object DetectionInstance Segmentation
Computer Vision
Tasks
Computer Vision Tasks
Image Classification: assigns a label to an image.
4
Image Classification Object Localization Semantic Segmentation
Object DetectionInstance Segmentation
Computer Vision
Tasks
CAT
Computer Vision Tasks
Object Localization: locates the single object in the image and indicates its location with a bounding box.
5
Image Classification Object Localization Semantic Segmentation
Object DetectionInstance Segmentation
Computer Vision
Tasks
CAT
Computer Vision Tasks
Object Detection: locates multiple (if any) instances of object classes with bounding boxes.
6
Image Classification Object Localization Semantic Segmentation
Object DetectionInstance Segmentation
Computer Vision
Tasks
Computer Vision Tasks
Semantic Segmentation: links a class label to individual pixels with no concern to separate instances.
7
Image Classification Object LocalizationSemantic
Segmentation
Object DetectionInstance Segmentation
Computer Vision
Tasks
CAT, ROAD, VEGETATION
Computer Vision Tasks
Instance Segmentation: identifies each instance of each object class at the pixel level.
8
Image Classification Object Localization Semantic Segmentation
Object DetectionInstance Segmentation
Computer Vision
Tasks
pixel basedregion based
reco
gn
ize
s c
lasse
s
ign
ore
s insta
nce
sre
co
gn
ize
s c
lasse
s
an
d insta
nce
s
Content
1. Computer Vision Tasks
2. Data Sets for Object Detection
3. Performance Metrics
4. YOLO
5. Mask R-CNN
9
Datasetsโข PASCAL Visual Object Classification (PASCAL VOC)
โข 8 different challenges from 2005 and 2012
โข Around 10 000 images of 20 categories
10Source: Everingham et al., 2012 Visual Object Classes Challenge 2012 Dataset
sample images of PASCAL VOC
Datasetsโข Common Objects in COntext (COCO) - T.-Y.Lin and al. (2015)
โข Multiple challenges
โข Around 160 000 images of 80 categories
11Source: Arthur Ouaknine, Review of Deep Learning Algorithms for Object Detection (2018)
sample images of PASCAL VOC
sample images of COCO
Content
1. Computer Vision Tasks
2. Data Sets for Object Detection
3. Performance Metrics
4. YOLO
5. Mask R-CNN
12
Performance Metrics
13
๐ท๐๐๐๐๐๐๐๐ =๐ก๐๐ข๐ ๐๐๐ ๐๐ก๐๐ฃ๐
๐ก๐๐ข๐ ๐๐๐ ๐๐ก๐๐ฃ๐ + ๐๐๐๐ ๐ ๐๐๐ ๐๐ก๐๐ฃ๐=
๐ฐ๐๐๐๐๐๐๐๐๐๐๐ ๐๐๐๐ ๐ผ๐๐๐๐ ๐ฐ๐๐ผ =
๐จ๐๐๐๐๐๐ ๐ท๐๐๐๐๐๐๐๐ ๐จ๐ท = ๐๐๐๐๐๐ ๐๐๐ ๐๐ฃ๐๐ ๐๐๐ ๐๐๐๐ ๐ ๐๐
false
positive
outcome
reference
false
negative
true
positive
๐น๐๐๐๐๐ =๐ก๐๐ข๐ ๐๐๐ ๐๐ก๐๐ฃ๐
๐ก๐๐ข๐ ๐๐๐ ๐๐ก๐๐ฃ๐ + ๐๐๐๐ ๐ ๐๐๐๐๐ก๐๐ฃ๐=
Performance Metrics
Most competitions use mean average precision (๐๐ด๐) as principal metric to evaluate object detector, though there are variations in definitions and implementations.
14
Performance MetricsPrecision-recall curve
15
Confidence score is the
probability that an anchor box
contains an object. It is usually
predicted by a classifier.
As the threshold for confidence
score decreases,
โข recall increases
monotonically;
โข precision can go up and
down, but the general
tendency is to decrease.
Source: Nickzeng (2018)
Performance MetricsPrecision-recall curve โ Average Precision
16
The average precision (๐ด๐) is
the precision averaged across all
unique recall levels.
It is based on the interpolated
precision-recall curve.
The interpolated precision
๐๐๐๐ก๐๐๐ at a certain recall level ๐ is
defined as the highest precision
found for any recall level
๐๐๐๐ก๐๐๐ ๐ = max๐โฒโฅ๐
๐โฒ
.
Source: Nickzeng (2018)
interpolated
original
Performance MetricsMean average precision โ๐๐ด๐
The calculation of ๐ด๐ involves one class.
However, in object detection, there are usually ๐พ > 1 classes.
Mean average precision (๐๐ด๐) is defined as the mean of ๐ด๐ across all ๐พ classes:
17
๐๐ด๐ =ฯ๐=1๐พ ๐ด๐๐๐พ
Content
1. Computer Vision Tasks
2. Data Sets for Object Detection
3. Performance Metrics
4. YOLO
5. Mask R-CNN
18
YOLO History
19
Date Event
May, 2016 Joseph Redmonโs paper "You Only Look Once: Unified, Real-Time Object Detection" introduced YOLO
December, 2017 Redmon introduced YOLOv2 in the paper "YOLO9000: Better, Faster, Stronger."
April, 2018 Redmon published "YOLOv3: An Incremental Improvement"
February, 2020 Redmon announced he was stepping away from Computer Vision
April 23, 2020, Alexey Bochkovskiy published YOLOv4.
May 29, 2020 Glenn Jocher created a repository called YOLOv5
Joseph Redmon
YOLOv3
20Redmon & Farhadi, 2018 YOLOv3: An Incremental Improvement"
YOLOv3
21Source: Sik-Ho Tsang Review: YOLOv3 โ Yo Only Look Once (Object Detection)
Example of Object Detection
22
Training set:
๐ ๐ฆ
1
0
1
1
0
CNN
๐
0/1
๐ฆ
Sliding Window Detection
23
CNN 0
Problem: High computational cost!
Accurate Bounding Boxes
24
YOLO Basics
25Redmon, et al., 2016, You Only Look Once: Unified, Real-Time Object Detection
๐ =
๐๐๐๐ฅ๐๐ฆ๐โ๐๐ค๐1๐2๐3
0???????
1๐๐ฅ๐๐ฆ๐โ๐๐ค010
1๐๐ฅ๐๐ฆ๐โ๐๐ค010
convmax
pool
๐
๐
๐
3
33
8
๐
Remarks:
โข It indicates if there is an object in each grid cell (๐๐), and from which class (๐1, ๐2, ๐3).
โข It outputs the bounding box coordinates (๐๐ฅ,๐๐ฆ,๐โ,๐๐ค) for each grid cell.
โข With finer grids you reduce the chance of multiple objects in the same grid cell.
โข The object is assigned to just one grid cell, the one that contains its midpoint, no matter if the
object extends beyond the cell limits.
โข A single CNN accounts for all objects in the image โ efficient.
๐
๐
Labels for training โ for each grid cell:
a single CNN
Parametrization of box coordinates
26Redmon, et al., 2016, You Only Look Once: Unified, Real-Time Object Detection
๐ฆ =
1๐๐ฅ๐๐ฆ๐โ๐๐ค010
(0,0)
(1,1)
๐๐ฅ
๐๐ฆ
0.2
0.4
0.9
1.5
๐โ
๐๐ค
between
0 and 1
can be > 1
Remark: actually, there are other parametrizations, but this is
enough for the time being.
Overlapping boxes
27Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg
Non-maximum suppression
28
0.90.60.70.8
0.6
Often, some of the generated boxes overlap.
Algorithm:
Discard all boxes with ๐๐ โค 0.6
While there are boxes:
โข Pick the box with the largest ๐๐ and output it as a prediction.
โข Discard any remaining box with IoUโฅ0.5 with the box output in the previous step.
The algorithm must run separately for each class.
1๐๐ฅ๐๐ฆ๐โ๐๐ค100
Dealing with Overlapping Objects
29
Anchor box 1
Anchor box 2
๐๐๐๐ฅ๐๐ฆ๐โ๐๐ค๐1๐2๐3๐๐๐๐ฅ๐๐ฆ๐โ๐๐ค๐1๐2๐3
๐ =
Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg
1๐๐ฅ๐๐ฆ๐โ๐๐ค010
0???????1๐๐ฅ๐๐ฆ๐โ๐๐ค010
anchor
box 1
anchor
box 2
car only?
Anchor box algorithmPreviously:
Each object in training image was assigned to the grid cell that contains that objectโs midpoint.
30
With two anchor boxes:
Each object in training image is assigned to the grid cell that contains objectโs midpoint andanchor box for the grid cell with highest IoU.
In our example so far
Output ๐: 3ร3ร8
Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg
In our example so far
Output ๐: 3ร3ร16
Selecting Anchor Boxes1. Choose by hand, guided by the frequency of shapes in your
application.
2. Use a clustering algorithm to find representative shapes.
31
Figure 2: Clustering box dimensions on VOC and COCO. We run k-means clustering on the dimensions of bounding
boxes to get good priors for our model. The left image shows the average IOU we get with various choices for k. We find
that k = 5 gives a good tradeoff for recall vs. complexity of the model. The right image shows the relative centroids for
VOC and COCO. Both sets of priors favor thinner, taller boxes while COCO has greater variation in size than VOC.
Source: Redmon, J. & Farhadi, A., 2016 YOLO9000: Better, Faster Stronger
YOLO โ wrap upTraining
32
1. Pedestrian
2. Car
3. Motorcycle
0???????
๐๐๐๐ฅ๐๐ฆ๐โ๐๐ค๐1๐2๐3๐๐๐๐ฅ๐๐ฆ๐โ๐๐ค๐1๐2๐3
๐ = 0???????
๐???????
1๐๐ฅ๐๐ฆ๐โ๐๐ค010
anchor
box 1
anchor
box 2
In our example
๐: 3ร3ร2ร8
grid cells# anchors
5+ #classes
Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg
1 2
CNN
๐ ร ๐ ร 3 1 ร 1 ร 16
grid cellgrid cell
outcome
YOLO โ wrap upPrediction
33
0???????
๐๐๐๐ฅ๐๐ฆ๐โ๐๐ค๐1๐2๐3๐๐๐๐ฅ๐๐ฆ๐โ๐๐ค๐1๐2๐3
๐ =0???????
๐???????
1๐๐ฅ๐๐ฆ๐โ๐๐ค010
anchor
box 1
anchor
box 2
Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg
CNN
1 ร 1 ร 16
YOLO โ wrap upPrediction
34Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg
โข For each grid cell, get 2 predicted bounding
boxes.
โข Get rid of low probability predictions.
โข For each class (pedestrian, car, motorcycle)
use non-max suppression to generate final
predictions.
YOLOv3 Network Architecture
35Source: Ayoosh Kathuria, Whatโs new in YOLO V3?
13๐ฅ13
13
13
26
26
52
52
Depper than YOLO 9000: from 53 to 106 Layers
Input image: 320, 416 or 608
Detection at three scales better at detecting smaller objects
Skip connections
YOLOv3 โ AnchorsFor each cell there are ๐ต anchors.
For each anchor the prediction is a vector of dimension (5 + ๐ถ ).
The number of anchor predictions is:
This is 10 times YOLO9000!
Example:
For COCO, ๐ต = 3, ๐ถ = 80anchors/predictions per grid cell.
36
ร ๐ต
(13 ร 13 +26 ร 26 +
52 ร 52) = ๐๐, ๐๐๐
YOLOv3 โ Bounding Box Prediction
37Source: Redmon, J., Farhadi, A., (2018), YOLOv3: An Incremental Improvement.
Same as YOLO 9000.
It predicts ๐ก๐ฅ, ๐ก๐ฆ, ๐ก๐ค, ๐กโ for each anchor box.
The predictions ๐๐ฅ, ๐๐ฆ, ๐๐ค, ๐โ are given by
๐โ
๐๐ค
๐ ๐ก๐ฆ
๐ ๐ก๐ฅ
๐โ
๐๐ค
๐๐ฆ = ๐ ๐ก๐ฆ + ๐๐ฆ
๐๐ฅ = ๐ ๐ก๐ฅ + ๐๐ฅ
๐๐ฅ๐๐ฆ
๐๐ค = ๐๐ค๐๐ก๐ค
๐โ = ๐โ๐๐กโ
where
โข (๐๐ฅ, ๐๐ฆ) is the offset from the
top left corner of the image.
โข ๐๐ค and ๐โ are the anchorwidth and height.
โข ๐ is the sigmoid function.
YOLOv3 โ Loss Function
where 1๐,๐๐๐๐
/1๐,๐๐๐๐๐
is the indicator function, ๐๐๐๐๐and ๐๐๐๐๐๐are weights,
๐ฆ/เท๐ฆ is the prediction/ground truth, ๐ is the sigmoid, and ๐ต๐ถ๐ธ is the binary cross entropy.
38
๐ฆ =
๐ก0๐ก๐ฅ๐ก๐ฆ๐กโ๐ก๐ค๐ 1โฎ๐ ๐ถ
+
๐=0
๐2
๐=0
๐ต
1๐,๐๐๐๐
โ๐๐๐ ๐ ๐ก0 +
๐=1
๐ถ
๐ต๐ถ๐ธ เท๐ฆ๐ฅ , ๐ ๐ ๐
+ ๐๐๐๐๐๐
๐=0
๐2
๐=0
๐ต
1๐,๐๐๐๐
๐ก๐ฅ โ ฦธ๐ก๐ฅ2 + ๐ก๐ฆ โ ฦธ๐ก๐ฆ
2+ ๐ก๐ค โ ฦธ๐ก๐ค
2 + ๐กโ โ ฦธ๐กโ2
๐๐๐๐๐
๐=0
๐2
๐=0
๐ต
1๐,๐๐๐๐๐
โ๐๐๐ 1 โ ๐ ๐ก0
โ๐๐๐ฟ๐ =
instead of softmax, as in YOLOv2
for empty anchors
for non-empty anchors
for non-empty anchors
Performance on COCO
39
Object Detector Architecturestypically compose of several components:
41
โข VGG16
โข RESNET-50
โข DARKNET52
โข RESTNEXT50
โข โฆ
โข ASPP
โข PAN
โข RFB
โข โฆ
โข RPN
โข YOLO
โข SSD
โข โฆ
โข FASTER R-CNN
โข R-CNN
โข โฆ
Network that
extracts the
features
enhance feature
discriminability
and robustness
handles the
prediction
Source: Bochkovskiy, et al., 2020
Bag of Freebiesmethods that only change the training strategy or only increase the training cost.
โข โbagโ refers to a set of strategies, and
โข โfreebiesโ means that we get additional performance for FREE.
The authors investigated a large โfreebiesโ arsenal (too much to go in depth here), in particular, data
augmentation strategies.
42
Bag of SpecialsSet of modules that only increases the inference cost by a small amount but significantly improves the accuracy, such as
โข Attention mechanism,
etc, both for the backbone and the head.
43
โข Special activation functions,
Source: Bochkovskiy, et al., 2020
Additional Improvementsโข A new data augmentation method:
โ MOSAIC + Self-Adversarial Training (SAT)
โข Optimal Hyper-Parameters while applying Genetic Algorithms.
โข Modification to some existing methods for efficient training and detection.
44
YOLOv4 Architecture
YOLOv4 consists of:
โข Backbone: CSPDarknet53
โข Neck: SPP, PAN
โข Head: YOLOv3
45
Recall DenseNet
DenseNet increases model capacity (complexity) and improves information flow (training) with highly interconnected layers.
46
Source: Huang et al., 2018
Source: Huang & Wang., 2018
Recall DenseNet
Formed by composing multiple DenseBlocks with a transition layer in between.
47Source: Wang et al., 2020
Cross-Stage Partial Connection (CSP)CSPNet separates the input feature maps of the DenseBlock into two parts:
โข The first part xโโ bypasses the DenseBlock and becomes part of the input to the next transition layer.
โข The second part xโโโ will go thought the Dense block as below.
It reduces the computational complexity by separating the input into two parts โ with only one going through the Dense Block
48Source: Wang et al., 2020
Cross-Stage Partial Connection (CSP)CSP separates the input feature maps of the DenseBlock into two parts.
The first part bypasses the DenseBlock and becomes part of the input to the next transition layer.
49Source: Wang et al., 2020
CSPDarknet-53
โYOLOv4 utilizes the CSP connections with the Darknet-53 (on the right) as the backbone in feature extraction.โ1
It reduces the computational complexity by separating the input into two parts โ with only one going through the Dense Block.
50
Darknet-53
1Jonathan Hui, 2020
YOLOv4 Architecture
YOLOv4 consists of:
โข Backbone: CSPDarknet53
โข Neck: SPP, PAN
โข Head: YOLOv3
51
Improved Spatial Pyramid Pooling (SPP)
Feature maps are max pooled in three scales with stride 1.
Three feature maps are concatenated.
52Source: Huand & Wang, 2018
Original SPP
Enahnced SPP
CSP and SPP integrated in YOLO
53
Path Aggregation Network (PAN)YOLOv3 adapts a similar approach as FPN in making object detection predictions at different scale levels.
It upsamples (2ร) the previous top-down stream and add it with the neighboring layer of the bottom-up stream. The result is passed into a 3ร3 convolution filter to reduce upsamplingartifacts and create the feature maps P4 below for the head.
54Source: Lin et al., 2019
YOLO v3 vs YOLO v4 - Demo
56
click here to watch the demo
Useful links
โข Jonathan Hui, 2020 YOLOv4
โข Deval Shah, 2020 , YOLOVv4 โ Version 3: Proposed Workflow
โข Shreejai Trivedi YOLOv4 โ Version 4: Final Verdict
57
YOLOv5โข Glen Jocher, founder and CEO of Ultrlytics released
its open source PyTorch implementation of YOLOv5 in Github
โข No paper published, just a blog
โข Authors claim:
โข 140 frames per second (Tesla P100) v.s 50 FPS of YOLOv4
โข Small - 90% of YOLOv4
โข Fast
58
YOLOv5 Performance
59
Source: Jacob Solawetz, 2020
Other OD approachesโข R-CNN Rich feature hierarchies for accurate object detection and
semantic segmentation (2014)
โข Fast R-CNN (2015)
โข Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, (2016)
โข R-FCN: Object Detection via Region-based Fully Convolutional Networks (2016)
โข SSD: Single Shot MultiBox Detector(2016)
โข FPN: Feature Pyramid Networks for Object Detection (2017)
โข RetinaNet: Focal Loss for Dense Object Detection (2018)
โข Path Aggregation Network for Instance Segmentation (2018)
โข EfficientDet: Scalable and Efficient Object Detection (2020)
60
Content
1. Computer Vision Tasks
2. Data Sets for Object Detection
3. Performance Metrics
4. YOLO
5. Mask R-CNN
61
Mask-RCNN
62
CV tasks
63Figure source: Tiba Razmi, (2019) Mask R-CNN
Instance Segmentation combines
โข object detection, and
โข semantic segmentation.
Instance Segmentation
64
object detection semantic segmentation
Mask R-CNNIt is a two-stage framework (meta algorithm):
a) generates proposals (areas likely to contain an object)
b) segments the proposals
65Figure source: https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/
Mask R-CNNProposed by Kaiming He and co-authors in 2018.
It is built upon Faster R-CNN.
It involves three steps:
a) the Region Proposal Network (RPN), that generates region proposals,
b) Mask generation
The Region of Interest Alignment (ROIAlign), that reshapes the region proposal to a standard shape/size, and
Mask Generation itself, that segments the image inside of bounding boxes with objects.
66Figure source: https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/
Backbone
VGG
ResNet C4
FPN
โฆ
Region Proposal Network (RPN)
67
Source: Shipla Ananth, (2019) Faster R-CNN for object detection
Input image is resized.
Sizes in figure are just illustrative.
feature
map
Region Proposal Network (RPN)
68
Source: Shipla Ananth, (2019) Faster R-CNN for object detection
A backbone network produces the feature
maps.
Backbone
VGG
ResNet C4
FPN
โฆ
feature
map
Region Proposal Network (RPN)
69
Source: Shipla Ananth, (2019) Faster R-CNN for object detection
The network has to learn โฆ
a) whether an object is present in a
location,
b) estimate its size/shape.
โฆ this is done by placing a set of anchors on the input image for each
location on the output feature map.
Backbone
VGG
ResNet C4
FPN
โฆ
feature
map
Region Proposal Network (RPN)
70
Source: Shipla Ananth, (2019) Faster R-CNN for object detection
The network has to learn โฆ
a) whether an object is present in a
location, and
b) estimate its size/shape.
โฆ this is done by placing a set of anchors on the input image for each
location on the output feature map.
Example of 9 anchor per image location
Backbone
VGG
ResNet C4
FPN
โฆ
feature
map
Region Proposal Network (RPN)
71
Source: Shipla Ananth, (2019) Faster R-CNN for object detection
First a 3x3 convolution with 512 filters is
applied to the feature map.
This is followed by two layers:
a) a 1x1 conv. with 36 filters, and
b) a 1x1 conv. with 18 filters
Backbone
VGG
ResNet C4
FPN
โฆ
feature
map
Region Proposal Network (RPN)
72
where
โข ๐๐ is output score for anchor ๐
โข ฦธ๐๐ is the ground truth for anchor ๐
โข ๐ ๐ฅ is the smooth L1 function ๐ ๐ฅ = แ0.5๐ฅ2 if ๐ฅ < 1๐ฅ โ 0.5 otherwise
โข ๐ก๐๐ฅ = ฮค๐ฅ๐ โ ๐ฅ๐๐ ๐ค๐๐ ; ๐ก๐๐ฆ= ฮค๐ฆ๐ โ ๐ฆ๐๐ โ๐ ; ๐ก๐๐ค= ๐๐๐ ฮค๐ค๐๐ค๐๐ ; ๐ก๐โ= ๐๐๐ เตโ๐ โ๐๐
โข ฦธ๐ก๐๐ฅ = ฮคเท๐ฅ๐๐ โ ๐ฅ๐๐ ๐ค๐๐ ; ฦธ๐ก๐๐ฆ = ฮคเท๐ฆ๐๐ โ ๐ฆ๐๐ โ๐๐ ; ฦธ๐ก๐๐ค = ๐๐๐ เตเท๐ค๐๐๐ค๐๐ ; ฦธ๐ก๐โ = ๐๐๐ เต
โ๐๐โ๐๐
โข ๐ฅ๐ , ๐ฆ๐ , ๐ค๐ , and โ๐denote the center coordinates, width and height of box ๐
โข ๐ฅ๐, ๐ฅ๐๐, and เท๐ฅ๐๐are for the predicted, anchor and ground truth boxes (likewise for ๐ฆ๐ , ๐ค๐ , and โ๐)
โข BCE stands for binary cross entropy
โข ๐๐๐๐ is the mini-batch size
โข ๐๐๐๐ is the number anchor locations
The multi-task loss function for RPN is given by
โ ๐๐ , ๐ก๐| ฦธ๐๐ , ฦธ๐ก๐๐ =1
๐๐๐๐
๐
๐ต๐ถ๐ธ ๐๐ , ฦธ๐๐ +๐
๐๐๐๐
๐
๐โ ๐ฅ,๐ฆ,๐ค,โ
ฦธ๐๐๐ ๐ก๐๐ โ ฦธ๐ก๐๐
bounding box loss
activated only for
positive anchors ( ฦธ๐๐=1)
classification loss ๐ ๐ฅ
0 1 2-1-20
.5
1
1.5
๐ฅ
Non-Maximum SuppressionThe regions delivered by RPN are validate as follows:
Discard all boxes with objectness ๐๐ โค ๐กโ๐๐๐ โ๐๐๐
While there are boxes:
โข Pick the box with the largest ๐๐ and output it as a prediction.
โข Discard any remaining box with IoUโฅ0.5 with the box output in the previous step.
The algorithm must run separately for each class.
73
Mask R-CNNProposed by Kaiming He and co-authors in 2018.
It is built upon Faster R-CNN.
It involves three steps:
a) the Region Proposal Network (RPN), that generates region proposals,
b) Mask generation
The Region of Interest Alignment (ROIAlign), that reshapes the region proposal to a standard shape/size, and
Mask Generation itself, that segments the image inside of bounding boxes with objects.
74Figure source: https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/
ROI poolingThe next step (Mask Generator) requires inputs of a fixed shape/size. Thus, proposals (anchors) must be resized.
Example: anchor is 4x6 and the Mask Generator input is 2x2
75
0.8
0.8 0.6
0.7
Source: Shipla Ananth, (2019) Faster R-CNN for object detection
ROI poolingOften, the anchor size is not an integer number of the FCN input size!
Solution: round to the nearest integer.
76Source: Shipla Ananth, (2019) Faster R-CNN for object detection
ROI AlignmentInstead of rounding to the nearest integer, the values are bilinear interpolated.
77Source: Shipla Ananth, (2019) Faster R-CNN for object detection
ROIAlign โimproves mask accuracy by 10% to 50%โ.
Mask R-CNNProposed by Kaiming He and co-authors in 2018.
It is built upon Faster R-CNN.
It involves three steps:
a) the Region Proposal Network (RPN), that generates region proposals,
b) Mask generation
The Region of Interest Alignment (ROIAlign), that reshapes the region proposal to a standard shape/size, and
Mask Generation itself, that segments the image inside of bounding boxes with objects.
78Figure source: https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/
โข Fixed size feature map from ROI align goes into R-CNN resulting in bounding box predictions and class scores.
โข Fixed size feature map from ROI align also goes into a FCN to get the segmentation mask.
โข Mask prediction is resampled to recover
the original resolution.
Steps of Mask Generation
79
feature
map
regression
coefficients
classification
scores
mask
prediction
bounding
box
ROIAlign
conv
conv
FC
FC
FC
FC
resample
FCN
final mask
Loss for the Mask Generation
80
where
โข ๐๐ is output score/probability of the true class for box ๐
โข ๐ ๐ฅ is the smooth L1 function ๐ ๐ฅ = แ0.5๐ฅ2 if ๐ฅ < 1๐ฅ โ 0.5 otherwise
โข ๐ก๐๐ฅ = ฮค๐ฅ๐ โ ๐ฅ๐๐ ๐ค๐๐ ; ๐ก๐๐ฆ= ฮค๐ฆ๐ โ ๐ฆ๐๐ โ๐ ; ๐ก๐๐ค= ๐๐๐ ฮค๐ค๐๐ค๐๐ ; ๐ก๐โ= ๐๐๐ เตโ๐ โ๐๐
โข ฦธ๐ก๐๐ฅ = ฮคเท๐ฅ๐๐ โ ๐ฅ๐๐ ๐ค๐๐ ; ฦธ๐ก๐๐ฆ = ฮคเท๐ฆ๐๐ โ ๐ฆ๐๐ โ๐๐ ; ฦธ๐ก๐๐ค = ๐๐๐ เตเท๐ค๐๐๐ค๐๐ ; ฦธ๐ก๐โ = ๐๐๐ เต
โ๐๐โ๐๐
โข ๐ฅ๐ , ๐ฆ๐ , ๐ค๐ , and denote the center coordinates and width and height of box ๐
โข ๐ฅ๐, ๐ฅ๐๐, and เท๐ฅ๐๐are for the predicted, anchor and ground truth boxes (likewise for๐ฆ๐ , ๐ค๐ , and โ๐)
โข BCE stands for binary cross entropy
โข ๐พ denotes the number of classes
The multi-task loss function for the Mask generator is given by
โ ๐๐ , ๐ก๐| ฦธ๐๐ , ฦธ๐ก๐๐ =
๐
โ๐๐๐ ๐๐ +
๐โ ๐ฅ,๐ฆ,๐ค,โ
๐ ๐ก๐๐ โ ฦธ๐ก๐๐ +
1โค๐โค๐พ
๐ต๐ถ๐ธ ๐๐๐ , ฦธ๐๐๐
bounding box lossclassification loss mask loss
๐ ๐ฅ
0 1 2-1-20
.5
1
1.5
๐ฅ
OD vs IS vs SS - Demo
81
click here to watch the demo
Main Sources
Original paper: He, K., Gkioxari, G., Dollรกr, P., Girshick, R., (2018), Mask R-CNN
Presentation at ICCV17 by Kaiming He, the first author
82
Object DetectionThanks for your attention
R. Q. FEITOSA