project liveness report 01

83
Object Detection R. Q. FEITOSA

Upload: others

Post on 18-Apr-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PROJECT LIVENESS REPORT 01

Object Detection

R. Q. FEITOSA

Page 2: PROJECT LIVENESS REPORT 01

Content

1. Computer Vision Tasks

2. Data Sets for Object Detection

3. Performance Metrics

4. YOLO

5. Mask R-CNN

2

Page 3: PROJECT LIVENESS REPORT 01

Computer Vision Tasks

3

Image Classification Object Localization Semantic Segmentation

Object DetectionInstance Segmentation

Computer Vision

Tasks

Page 4: PROJECT LIVENESS REPORT 01

Computer Vision Tasks

Image Classification: assigns a label to an image.

4

Image Classification Object Localization Semantic Segmentation

Object DetectionInstance Segmentation

Computer Vision

Tasks

CAT

Page 5: PROJECT LIVENESS REPORT 01

Computer Vision Tasks

Object Localization: locates the single object in the image and indicates its location with a bounding box.

5

Image Classification Object Localization Semantic Segmentation

Object DetectionInstance Segmentation

Computer Vision

Tasks

CAT

Page 6: PROJECT LIVENESS REPORT 01

Computer Vision Tasks

Object Detection: locates multiple (if any) instances of object classes with bounding boxes.

6

Image Classification Object Localization Semantic Segmentation

Object DetectionInstance Segmentation

Computer Vision

Tasks

Page 7: PROJECT LIVENESS REPORT 01

Computer Vision Tasks

Semantic Segmentation: links a class label to individual pixels with no concern to separate instances.

7

Image Classification Object LocalizationSemantic

Segmentation

Object DetectionInstance Segmentation

Computer Vision

Tasks

CAT, ROAD, VEGETATION

Page 8: PROJECT LIVENESS REPORT 01

Computer Vision Tasks

Instance Segmentation: identifies each instance of each object class at the pixel level.

8

Image Classification Object Localization Semantic Segmentation

Object DetectionInstance Segmentation

Computer Vision

Tasks

pixel basedregion based

reco

gn

ize

s c

lasse

s

ign

ore

s insta

nce

sre

co

gn

ize

s c

lasse

s

an

d insta

nce

s

Page 9: PROJECT LIVENESS REPORT 01

Content

1. Computer Vision Tasks

2. Data Sets for Object Detection

3. Performance Metrics

4. YOLO

5. Mask R-CNN

9

Page 10: PROJECT LIVENESS REPORT 01

Datasetsโ€ข PASCAL Visual Object Classification (PASCAL VOC)

โ€ข 8 different challenges from 2005 and 2012

โ€ข Around 10 000 images of 20 categories

10Source: Everingham et al., 2012 Visual Object Classes Challenge 2012 Dataset

sample images of PASCAL VOC

Page 11: PROJECT LIVENESS REPORT 01

Datasetsโ€ข Common Objects in COntext (COCO) - T.-Y.Lin and al. (2015)

โ€ข Multiple challenges

โ€ข Around 160 000 images of 80 categories

11Source: Arthur Ouaknine, Review of Deep Learning Algorithms for Object Detection (2018)

sample images of PASCAL VOC

sample images of COCO

Page 12: PROJECT LIVENESS REPORT 01

Content

1. Computer Vision Tasks

2. Data Sets for Object Detection

3. Performance Metrics

4. YOLO

5. Mask R-CNN

12

Page 13: PROJECT LIVENESS REPORT 01

Performance Metrics

13

๐‘ท๐’“๐’†๐’„๐’Š๐’”๐’Š๐’๐’ =๐‘ก๐‘Ÿ๐‘ข๐‘’ ๐‘๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘ฃ๐‘’

๐‘ก๐‘Ÿ๐‘ข๐‘’ ๐‘๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘ฃ๐‘’ + ๐‘“๐‘Ž๐‘™๐‘ ๐‘’ ๐‘๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘ฃ๐‘’=

๐‘ฐ๐’๐’•๐’†๐’“๐’”๐’†๐’„๐’•๐’Š๐’๐’ ๐’๐’—๐’†๐’“ ๐‘ผ๐’๐’Š๐’๐’ ๐‘ฐ๐’๐‘ผ =

๐‘จ๐’—๐’†๐’“๐’‚๐’ˆ๐’† ๐‘ท๐’“๐’†๐’„๐’Š๐’”๐’Š๐’๐’ ๐‘จ๐‘ท = ๐‘๐‘Ÿ๐‘’๐‘๐‘–๐‘ ๐‘–๐‘œ๐‘› ๐‘œ๐‘ฃ๐‘’๐‘Ÿ ๐‘Ž๐‘™๐‘™ ๐‘๐‘™๐‘Ž๐‘ ๐‘ ๐‘’๐‘ 

false

positive

outcome

reference

false

negative

true

positive

๐‘น๐’†๐’„๐’‚๐’๐’ =๐‘ก๐‘Ÿ๐‘ข๐‘’ ๐‘๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘ฃ๐‘’

๐‘ก๐‘Ÿ๐‘ข๐‘’ ๐‘๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘ฃ๐‘’ + ๐‘“๐‘Ž๐‘™๐‘ ๐‘’ ๐‘›๐‘’๐‘”๐‘Ž๐‘ก๐‘–๐‘ฃ๐‘’=

Page 14: PROJECT LIVENESS REPORT 01

Performance Metrics

Most competitions use mean average precision (๐‘š๐ด๐‘ƒ) as principal metric to evaluate object detector, though there are variations in definitions and implementations.

14

Page 15: PROJECT LIVENESS REPORT 01

Performance MetricsPrecision-recall curve

15

Confidence score is the

probability that an anchor box

contains an object. It is usually

predicted by a classifier.

As the threshold for confidence

score decreases,

โ€ข recall increases

monotonically;

โ€ข precision can go up and

down, but the general

tendency is to decrease.

Source: Nickzeng (2018)

Page 16: PROJECT LIVENESS REPORT 01

Performance MetricsPrecision-recall curve โ€“ Average Precision

16

The average precision (๐ด๐‘ƒ) is

the precision averaged across all

unique recall levels.

It is based on the interpolated

precision-recall curve.

The interpolated precision

๐‘๐‘–๐‘›๐‘ก๐‘’๐‘Ÿ๐‘ at a certain recall level ๐‘Ÿ is

defined as the highest precision

found for any recall level

๐‘๐‘–๐‘›๐‘ก๐‘’๐‘Ÿ๐‘ ๐‘Ÿ = max๐‘Ÿโ€ฒโ‰ฅ๐‘Ÿ

๐‘Ÿโ€ฒ

.

Source: Nickzeng (2018)

interpolated

original

Page 17: PROJECT LIVENESS REPORT 01

Performance MetricsMean average precision โ€“๐‘š๐ด๐‘ƒ

The calculation of ๐ด๐‘ƒ involves one class.

However, in object detection, there are usually ๐พ > 1 classes.

Mean average precision (๐‘š๐ด๐‘ƒ) is defined as the mean of ๐ด๐‘ƒ across all ๐พ classes:

17

๐‘š๐ด๐‘ƒ =ฯƒ๐‘–=1๐พ ๐ด๐‘ƒ๐‘–๐พ

Page 18: PROJECT LIVENESS REPORT 01

Content

1. Computer Vision Tasks

2. Data Sets for Object Detection

3. Performance Metrics

4. YOLO

5. Mask R-CNN

18

Page 19: PROJECT LIVENESS REPORT 01

YOLO History

19

Date Event

May, 2016 Joseph Redmonโ€™s paper "You Only Look Once: Unified, Real-Time Object Detection" introduced YOLO

December, 2017 Redmon introduced YOLOv2 in the paper "YOLO9000: Better, Faster, Stronger."

April, 2018 Redmon published "YOLOv3: An Incremental Improvement"

February, 2020 Redmon announced he was stepping away from Computer Vision

April 23, 2020, Alexey Bochkovskiy published YOLOv4.

May 29, 2020 Glenn Jocher created a repository called YOLOv5

Joseph Redmon

Page 20: PROJECT LIVENESS REPORT 01

YOLOv3

20Redmon & Farhadi, 2018 YOLOv3: An Incremental Improvement"

Page 21: PROJECT LIVENESS REPORT 01

YOLOv3

21Source: Sik-Ho Tsang Review: YOLOv3 โ€“ Yo Only Look Once (Object Detection)

Page 22: PROJECT LIVENESS REPORT 01

Example of Object Detection

22

Training set:

๐’™ ๐‘ฆ

1

0

1

1

0

CNN

๐’™

0/1

๐‘ฆ

Page 23: PROJECT LIVENESS REPORT 01

Sliding Window Detection

23

CNN 0

Problem: High computational cost!

Page 24: PROJECT LIVENESS REPORT 01

Accurate Bounding Boxes

24

Page 25: PROJECT LIVENESS REPORT 01

YOLO Basics

25Redmon, et al., 2016, You Only Look Once: Unified, Real-Time Object Detection

๐’š =

๐‘๐‘๐‘๐‘ฅ๐‘๐‘ฆ๐‘โ„Ž๐‘๐‘ค๐‘1๐‘2๐‘3

0???????

1๐‘๐‘ฅ๐‘๐‘ฆ๐‘โ„Ž๐‘๐‘ค010

1๐‘๐‘ฅ๐‘๐‘ฆ๐‘โ„Ž๐‘๐‘ค010

convmax

pool

๐’™

๐‘›

๐‘›

3

33

8

๐’š

Remarks:

โ€ข It indicates if there is an object in each grid cell (๐‘๐‘), and from which class (๐‘1, ๐‘2, ๐‘3).

โ€ข It outputs the bounding box coordinates (๐‘๐‘ฅ,๐‘๐‘ฆ,๐‘โ„Ž,๐‘๐‘ค) for each grid cell.

โ€ข With finer grids you reduce the chance of multiple objects in the same grid cell.

โ€ข The object is assigned to just one grid cell, the one that contains its midpoint, no matter if the

object extends beyond the cell limits.

โ€ข A single CNN accounts for all objects in the image โ†’ efficient.

๐‘›

๐‘›

Labels for training โ†’ for each grid cell:

a single CNN

Page 26: PROJECT LIVENESS REPORT 01

Parametrization of box coordinates

26Redmon, et al., 2016, You Only Look Once: Unified, Real-Time Object Detection

๐‘ฆ =

1๐‘๐‘ฅ๐‘๐‘ฆ๐‘โ„Ž๐‘๐‘ค010

(0,0)

(1,1)

๐‘๐‘ฅ

๐‘๐‘ฆ

0.2

0.4

0.9

1.5

๐‘โ„Ž

๐‘๐‘ค

between

0 and 1

can be > 1

Remark: actually, there are other parametrizations, but this is

enough for the time being.

Page 27: PROJECT LIVENESS REPORT 01

Overlapping boxes

27Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg

Page 28: PROJECT LIVENESS REPORT 01

Non-maximum suppression

28

0.90.60.70.8

0.6

Often, some of the generated boxes overlap.

Algorithm:

Discard all boxes with ๐‘๐‘ โ‰ค 0.6

While there are boxes:

โ€ข Pick the box with the largest ๐‘๐‘ and output it as a prediction.

โ€ข Discard any remaining box with IoUโ‰ฅ0.5 with the box output in the previous step.

The algorithm must run separately for each class.

Page 29: PROJECT LIVENESS REPORT 01

1๐‘๐‘ฅ๐‘๐‘ฆ๐‘โ„Ž๐‘๐‘ค100

Dealing with Overlapping Objects

29

Anchor box 1

Anchor box 2

๐‘๐‘๐‘๐‘ฅ๐‘๐‘ฆ๐‘โ„Ž๐‘๐‘ค๐‘1๐‘2๐‘3๐‘๐‘๐‘๐‘ฅ๐‘๐‘ฆ๐‘โ„Ž๐‘๐‘ค๐‘1๐‘2๐‘3

๐’š =

Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg

1๐‘๐‘ฅ๐‘๐‘ฆ๐‘โ„Ž๐‘๐‘ค010

0???????1๐‘๐‘ฅ๐‘๐‘ฆ๐‘โ„Ž๐‘๐‘ค010

anchor

box 1

anchor

box 2

car only?

Page 30: PROJECT LIVENESS REPORT 01

Anchor box algorithmPreviously:

Each object in training image was assigned to the grid cell that contains that objectโ€™s midpoint.

30

With two anchor boxes:

Each object in training image is assigned to the grid cell that contains objectโ€™s midpoint andanchor box for the grid cell with highest IoU.

In our example so far

Output ๐’š: 3ร—3ร—8

Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg

In our example so far

Output ๐’š: 3ร—3ร—16

Page 31: PROJECT LIVENESS REPORT 01

Selecting Anchor Boxes1. Choose by hand, guided by the frequency of shapes in your

application.

2. Use a clustering algorithm to find representative shapes.

31

Figure 2: Clustering box dimensions on VOC and COCO. We run k-means clustering on the dimensions of bounding

boxes to get good priors for our model. The left image shows the average IOU we get with various choices for k. We find

that k = 5 gives a good tradeoff for recall vs. complexity of the model. The right image shows the relative centroids for

VOC and COCO. Both sets of priors favor thinner, taller boxes while COCO has greater variation in size than VOC.

Source: Redmon, J. & Farhadi, A., 2016 YOLO9000: Better, Faster Stronger

Page 32: PROJECT LIVENESS REPORT 01

YOLO โ€“ wrap upTraining

32

1. Pedestrian

2. Car

3. Motorcycle

0???????

๐‘๐‘๐‘๐‘ฅ๐‘๐‘ฆ๐‘โ„Ž๐‘๐‘ค๐‘1๐‘2๐‘3๐‘๐‘๐‘๐‘ฅ๐‘๐‘ฆ๐‘โ„Ž๐‘๐‘ค๐‘1๐‘2๐‘3

๐’š = 0???????

๐ŸŽ???????

1๐‘๐‘ฅ๐‘๐‘ฆ๐‘โ„Ž๐‘๐‘ค010

anchor

box 1

anchor

box 2

In our example

๐’š: 3ร—3ร—2ร—8

grid cells# anchors

5+ #classes

Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg

1 2

CNN

๐‘› ร— ๐‘› ร— 3 1 ร— 1 ร— 16

grid cellgrid cell

outcome

Page 33: PROJECT LIVENESS REPORT 01

YOLO โ€“ wrap upPrediction

33

0???????

๐‘๐‘๐‘๐‘ฅ๐‘๐‘ฆ๐‘โ„Ž๐‘๐‘ค๐‘1๐‘2๐‘3๐‘๐‘๐‘๐‘ฅ๐‘๐‘ฆ๐‘โ„Ž๐‘๐‘ค๐‘1๐‘2๐‘3

๐’š =0???????

๐ŸŽ???????

1๐‘๐‘ฅ๐‘๐‘ฆ๐‘โ„Ž๐‘๐‘ค010

anchor

box 1

anchor

box 2

Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg

CNN

1 ร— 1 ร— 16

Page 34: PROJECT LIVENESS REPORT 01

YOLO โ€“ wrap upPrediction

34Source: Andrew Ng. https://www.youtube.com/watch?v=RTlwl2bv0Tg

โ€ข For each grid cell, get 2 predicted bounding

boxes.

โ€ข Get rid of low probability predictions.

โ€ข For each class (pedestrian, car, motorcycle)

use non-max suppression to generate final

predictions.

Page 35: PROJECT LIVENESS REPORT 01

YOLOv3 Network Architecture

35Source: Ayoosh Kathuria, Whatโ€™s new in YOLO V3?

13๐‘ฅ13

13

13

26

26

52

52

Depper than YOLO 9000: from 53 to 106 Layers

Input image: 320, 416 or 608

Detection at three scales better at detecting smaller objects

Skip connections

Page 36: PROJECT LIVENESS REPORT 01

YOLOv3 โ€“ AnchorsFor each cell there are ๐ต anchors.

For each anchor the prediction is a vector of dimension (5 + ๐ถ ).

The number of anchor predictions is:

This is 10 times YOLO9000!

Example:

For COCO, ๐ต = 3, ๐ถ = 80anchors/predictions per grid cell.

36

ร— ๐ต

(13 ร— 13 +26 ร— 26 +

52 ร— 52) = ๐Ÿ๐ŸŽ, ๐Ÿ”๐Ÿ’๐Ÿ•

Page 37: PROJECT LIVENESS REPORT 01

YOLOv3 โ€“ Bounding Box Prediction

37Source: Redmon, J., Farhadi, A., (2018), YOLOv3: An Incremental Improvement.

Same as YOLO 9000.

It predicts ๐‘ก๐‘ฅ, ๐‘ก๐‘ฆ, ๐‘ก๐‘ค, ๐‘กโ„Ž for each anchor box.

The predictions ๐‘๐‘ฅ, ๐‘๐‘ฆ, ๐‘๐‘ค, ๐‘โ„Ž are given by

๐‘โ„Ž

๐‘๐‘ค

๐œŽ ๐‘ก๐‘ฆ

๐œŽ ๐‘ก๐‘ฅ

๐‘โ„Ž

๐‘๐‘ค

๐‘๐‘ฆ = ๐œŽ ๐‘ก๐‘ฆ + ๐‘๐‘ฆ

๐‘๐‘ฅ = ๐œŽ ๐‘ก๐‘ฅ + ๐‘๐‘ฅ

๐‘๐‘ฅ๐‘๐‘ฆ

๐‘๐‘ค = ๐‘๐‘ค๐‘’๐‘ก๐‘ค

๐‘โ„Ž = ๐‘โ„Ž๐‘’๐‘กโ„Ž

where

โ€ข (๐‘๐‘ฅ, ๐‘๐‘ฆ) is the offset from the

top left corner of the image.

โ€ข ๐‘๐‘ค and ๐‘โ„Ž are the anchorwidth and height.

โ€ข ๐œŽ is the sigmoid function.

Page 38: PROJECT LIVENESS REPORT 01

YOLOv3 โ€“ Loss Function

where 1๐‘–,๐‘—๐‘œ๐‘๐‘—

/1๐‘–,๐‘—๐‘›๐‘œ๐‘๐‘—

is the indicator function, ๐œ†๐‘›๐‘œ๐‘๐‘—and ๐œ†๐‘๐‘œ๐‘œ๐‘Ÿ๐‘‘are weights,

๐‘ฆ/เทœ๐‘ฆ is the prediction/ground truth, ๐œŽ is the sigmoid, and ๐ต๐ถ๐ธ is the binary cross entropy.

38

๐‘ฆ =

๐‘ก0๐‘ก๐‘ฅ๐‘ก๐‘ฆ๐‘กโ„Ž๐‘ก๐‘ค๐‘ 1โ‹ฎ๐‘ ๐ถ

+

๐‘–=0

๐‘†2

๐‘—=0

๐ต

1๐‘–,๐‘—๐‘œ๐‘๐‘—

โˆ’๐‘™๐‘œ๐‘” ๐œŽ ๐‘ก0 +

๐‘˜=1

๐ถ

๐ต๐ถ๐ธ เทœ๐‘ฆ๐‘ฅ , ๐œŽ ๐‘ ๐‘˜

+ ๐œ†๐‘๐‘œ๐‘œ๐‘Ÿ๐‘‘

๐‘–=0

๐‘†2

๐‘—=0

๐ต

1๐‘–,๐‘—๐‘œ๐‘๐‘—

๐‘ก๐‘ฅ โˆ’ ฦธ๐‘ก๐‘ฅ2 + ๐‘ก๐‘ฆ โˆ’ ฦธ๐‘ก๐‘ฆ

2+ ๐‘ก๐‘ค โˆ’ ฦธ๐‘ก๐‘ค

2 + ๐‘กโ„Ž โˆ’ ฦธ๐‘กโ„Ž2

๐œ†๐‘›๐‘œ๐‘๐‘—

๐‘–=0

๐‘†2

๐‘—=0

๐ต

1๐‘–,๐‘—๐‘›๐‘œ๐‘๐‘—

โˆ’๐‘™๐‘œ๐‘” 1 โˆ’ ๐œŽ ๐‘ก0

โ„’๐‘Œ๐‘‚๐ฟ๐‘‚ =

instead of softmax, as in YOLOv2

for empty anchors

for non-empty anchors

for non-empty anchors

Page 39: PROJECT LIVENESS REPORT 01

Performance on COCO

39

Page 40: PROJECT LIVENESS REPORT 01

YOLOv4

40https://arxiv.org/pdf/2004.10934.pdf

Page 41: PROJECT LIVENESS REPORT 01

Object Detector Architecturestypically compose of several components:

41

โ€ข VGG16

โ€ข RESNET-50

โ€ข DARKNET52

โ€ข RESTNEXT50

โ€ข โ€ฆ

โ€ข ASPP

โ€ข PAN

โ€ข RFB

โ€ข โ€ฆ

โ€ข RPN

โ€ข YOLO

โ€ข SSD

โ€ข โ€ฆ

โ€ข FASTER R-CNN

โ€ข R-CNN

โ€ข โ€ฆ

Network that

extracts the

features

enhance feature

discriminability

and robustness

handles the

prediction

Source: Bochkovskiy, et al., 2020

Page 42: PROJECT LIVENESS REPORT 01

Bag of Freebiesmethods that only change the training strategy or only increase the training cost.

โ€ข โ€œbagโ€ refers to a set of strategies, and

โ€ข โ€œfreebiesโ€ means that we get additional performance for FREE.

The authors investigated a large โ€œfreebiesโ€ arsenal (too much to go in depth here), in particular, data

augmentation strategies.

42

Page 43: PROJECT LIVENESS REPORT 01

Bag of SpecialsSet of modules that only increases the inference cost by a small amount but significantly improves the accuracy, such as

โ€ข Attention mechanism,

etc, both for the backbone and the head.

43

โ€ข Special activation functions,

Source: Bochkovskiy, et al., 2020

Page 44: PROJECT LIVENESS REPORT 01

Additional Improvementsโ€ข A new data augmentation method:

โ€“ MOSAIC + Self-Adversarial Training (SAT)

โ€ข Optimal Hyper-Parameters while applying Genetic Algorithms.

โ€ข Modification to some existing methods for efficient training and detection.

44

Page 45: PROJECT LIVENESS REPORT 01

YOLOv4 Architecture

YOLOv4 consists of:

โ€ข Backbone: CSPDarknet53

โ€ข Neck: SPP, PAN

โ€ข Head: YOLOv3

45

Page 46: PROJECT LIVENESS REPORT 01

Recall DenseNet

DenseNet increases model capacity (complexity) and improves information flow (training) with highly interconnected layers.

46

Source: Huang et al., 2018

Source: Huang & Wang., 2018

Page 47: PROJECT LIVENESS REPORT 01

Recall DenseNet

Formed by composing multiple DenseBlocks with a transition layer in between.

47Source: Wang et al., 2020

Page 48: PROJECT LIVENESS REPORT 01

Cross-Stage Partial Connection (CSP)CSPNet separates the input feature maps of the DenseBlock into two parts:

โ€ข The first part xโ‚€โ€™ bypasses the DenseBlock and becomes part of the input to the next transition layer.

โ€ข The second part xโ‚€โ€™โ€™ will go thought the Dense block as below.

It reduces the computational complexity by separating the input into two parts โ€” with only one going through the Dense Block

48Source: Wang et al., 2020

Page 49: PROJECT LIVENESS REPORT 01

Cross-Stage Partial Connection (CSP)CSP separates the input feature maps of the DenseBlock into two parts.

The first part bypasses the DenseBlock and becomes part of the input to the next transition layer.

49Source: Wang et al., 2020

Page 50: PROJECT LIVENESS REPORT 01

CSPDarknet-53

โ€œYOLOv4 utilizes the CSP connections with the Darknet-53 (on the right) as the backbone in feature extraction.โ€1

It reduces the computational complexity by separating the input into two parts โ€” with only one going through the Dense Block.

50

Darknet-53

1Jonathan Hui, 2020

Page 51: PROJECT LIVENESS REPORT 01

YOLOv4 Architecture

YOLOv4 consists of:

โ€ข Backbone: CSPDarknet53

โ€ข Neck: SPP, PAN

โ€ข Head: YOLOv3

51

Page 52: PROJECT LIVENESS REPORT 01

Improved Spatial Pyramid Pooling (SPP)

Feature maps are max pooled in three scales with stride 1.

Three feature maps are concatenated.

52Source: Huand & Wang, 2018

Original SPP

Enahnced SPP

Page 53: PROJECT LIVENESS REPORT 01

CSP and SPP integrated in YOLO

53

Page 54: PROJECT LIVENESS REPORT 01

Path Aggregation Network (PAN)YOLOv3 adapts a similar approach as FPN in making object detection predictions at different scale levels.

It upsamples (2ร—) the previous top-down stream and add it with the neighboring layer of the bottom-up stream. The result is passed into a 3ร—3 convolution filter to reduce upsamplingartifacts and create the feature maps P4 below for the head.

54Source: Lin et al., 2019

Page 55: PROJECT LIVENESS REPORT 01

Performance on COCO

55Source: Bochkovskiy etl al., 2020

Page 56: PROJECT LIVENESS REPORT 01

YOLO v3 vs YOLO v4 - Demo

56

click here to watch the demo

Page 57: PROJECT LIVENESS REPORT 01

Useful links

โ€ข Jonathan Hui, 2020 YOLOv4

โ€ข Deval Shah, 2020 , YOLOVv4 โ€“ Version 3: Proposed Workflow

โ€ข Shreejai Trivedi YOLOv4 โ€“ Version 4: Final Verdict

57

Page 58: PROJECT LIVENESS REPORT 01

YOLOv5โ€ข Glen Jocher, founder and CEO of Ultrlytics released

its open source PyTorch implementation of YOLOv5 in Github

โ€ข No paper published, just a blog

โ€ข Authors claim:

โ€ข 140 frames per second (Tesla P100) v.s 50 FPS of YOLOv4

โ€ข Small - 90% of YOLOv4

โ€ข Fast

58

Page 59: PROJECT LIVENESS REPORT 01

YOLOv5 Performance

59

Source: Jacob Solawetz, 2020

Page 60: PROJECT LIVENESS REPORT 01

Other OD approachesโ€ข R-CNN Rich feature hierarchies for accurate object detection and

semantic segmentation (2014)

โ€ข Fast R-CNN (2015)

โ€ข Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, (2016)

โ€ข R-FCN: Object Detection via Region-based Fully Convolutional Networks (2016)

โ€ข SSD: Single Shot MultiBox Detector(2016)

โ€ข FPN: Feature Pyramid Networks for Object Detection (2017)

โ€ข RetinaNet: Focal Loss for Dense Object Detection (2018)

โ€ข Path Aggregation Network for Instance Segmentation (2018)

โ€ข EfficientDet: Scalable and Efficient Object Detection (2020)

60

Page 61: PROJECT LIVENESS REPORT 01

Content

1. Computer Vision Tasks

2. Data Sets for Object Detection

3. Performance Metrics

4. YOLO

5. Mask R-CNN

61

Page 62: PROJECT LIVENESS REPORT 01

Mask-RCNN

62

Page 63: PROJECT LIVENESS REPORT 01

CV tasks

63Figure source: Tiba Razmi, (2019) Mask R-CNN

Page 64: PROJECT LIVENESS REPORT 01

Instance Segmentation combines

โ€ข object detection, and

โ€ข semantic segmentation.

Instance Segmentation

64

object detection semantic segmentation

Page 65: PROJECT LIVENESS REPORT 01

Mask R-CNNIt is a two-stage framework (meta algorithm):

a) generates proposals (areas likely to contain an object)

b) segments the proposals

65Figure source: https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/

Page 66: PROJECT LIVENESS REPORT 01

Mask R-CNNProposed by Kaiming He and co-authors in 2018.

It is built upon Faster R-CNN.

It involves three steps:

a) the Region Proposal Network (RPN), that generates region proposals,

b) Mask generation

The Region of Interest Alignment (ROIAlign), that reshapes the region proposal to a standard shape/size, and

Mask Generation itself, that segments the image inside of bounding boxes with objects.

66Figure source: https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/

Page 67: PROJECT LIVENESS REPORT 01

Backbone

VGG

ResNet C4

FPN

โ€ฆ

Region Proposal Network (RPN)

67

Source: Shipla Ananth, (2019) Faster R-CNN for object detection

Input image is resized.

Sizes in figure are just illustrative.

feature

map

Page 68: PROJECT LIVENESS REPORT 01

Region Proposal Network (RPN)

68

Source: Shipla Ananth, (2019) Faster R-CNN for object detection

A backbone network produces the feature

maps.

Backbone

VGG

ResNet C4

FPN

โ€ฆ

feature

map

Page 69: PROJECT LIVENESS REPORT 01

Region Proposal Network (RPN)

69

Source: Shipla Ananth, (2019) Faster R-CNN for object detection

The network has to learn โ€ฆ

a) whether an object is present in a

location,

b) estimate its size/shape.

โ€ฆ this is done by placing a set of anchors on the input image for each

location on the output feature map.

Backbone

VGG

ResNet C4

FPN

โ€ฆ

feature

map

Page 70: PROJECT LIVENESS REPORT 01

Region Proposal Network (RPN)

70

Source: Shipla Ananth, (2019) Faster R-CNN for object detection

The network has to learn โ€ฆ

a) whether an object is present in a

location, and

b) estimate its size/shape.

โ€ฆ this is done by placing a set of anchors on the input image for each

location on the output feature map.

Example of 9 anchor per image location

Backbone

VGG

ResNet C4

FPN

โ€ฆ

feature

map

Page 71: PROJECT LIVENESS REPORT 01

Region Proposal Network (RPN)

71

Source: Shipla Ananth, (2019) Faster R-CNN for object detection

First a 3x3 convolution with 512 filters is

applied to the feature map.

This is followed by two layers:

a) a 1x1 conv. with 36 filters, and

b) a 1x1 conv. with 18 filters

Backbone

VGG

ResNet C4

FPN

โ€ฆ

feature

map

Page 72: PROJECT LIVENESS REPORT 01

Region Proposal Network (RPN)

72

where

โ€ข ๐‘๐‘– is output score for anchor ๐‘–

โ€ข ฦธ๐‘๐‘– is the ground truth for anchor ๐‘–

โ€ข ๐‘… ๐‘ฅ is the smooth L1 function ๐‘… ๐‘ฅ = แ‰Š0.5๐‘ฅ2 if ๐‘ฅ < 1๐‘ฅ โˆ’ 0.5 otherwise

โ€ข ๐‘ก๐‘–๐‘ฅ = ฮค๐‘ฅ๐‘– โˆ’ ๐‘ฅ๐‘–๐‘Ž ๐‘ค๐‘–๐‘Ž ; ๐‘ก๐‘–๐‘ฆ= ฮค๐‘ฆ๐‘– โˆ’ ๐‘ฆ๐‘–๐‘Ž โ„Ž๐‘Ž ; ๐‘ก๐‘–๐‘ค= ๐‘™๐‘œ๐‘” ฮค๐‘ค๐‘–๐‘ค๐‘–๐‘Ž ; ๐‘ก๐‘–โ„Ž= ๐‘™๐‘œ๐‘” เต—โ„Ž๐‘– โ„Ž๐‘–๐‘Ž

โ€ข ฦธ๐‘ก๐‘–๐‘ฅ = ฮคเทœ๐‘ฅ๐‘–๐‘Ž โˆ’ ๐‘ฅ๐‘–๐‘Ž ๐‘ค๐‘–๐‘Ž ; ฦธ๐‘ก๐‘–๐‘ฆ = ฮคเทœ๐‘ฆ๐‘–๐‘Ž โˆ’ ๐‘ฆ๐‘–๐‘Ž โ„Ž๐‘–๐‘Ž ; ฦธ๐‘ก๐‘–๐‘ค = ๐‘™๐‘œ๐‘” เต—เท๐‘ค๐‘–๐‘Ž๐‘ค๐‘–๐‘Ž ; ฦธ๐‘ก๐‘–โ„Ž = ๐‘™๐‘œ๐‘” เต—

โ„Ž๐‘–๐‘Žโ„Ž๐‘–๐‘Ž

โ€ข ๐‘ฅ๐‘– , ๐‘ฆ๐‘– , ๐‘ค๐‘– , and โ„Ž๐‘–denote the center coordinates, width and height of box ๐‘–

โ€ข ๐‘ฅ๐‘–, ๐‘ฅ๐‘–๐‘Ž, and เทœ๐‘ฅ๐‘–๐‘Žare for the predicted, anchor and ground truth boxes (likewise for ๐‘ฆ๐‘– , ๐‘ค๐‘– , and โ„Ž๐‘–)

โ€ข BCE stands for binary cross entropy

โ€ข ๐‘๐‘๐‘™๐‘  is the mini-batch size

โ€ข ๐‘๐‘Ÿ๐‘’๐‘” is the number anchor locations

The multi-task loss function for RPN is given by

โ„’ ๐‘๐‘– , ๐‘ก๐‘–| ฦธ๐‘๐‘– , ฦธ๐‘ก๐‘–๐‘— =1

๐‘๐‘๐‘™๐‘ 

๐‘–

๐ต๐ถ๐ธ ๐‘๐‘– , ฦธ๐‘๐‘– +๐œ†

๐‘๐‘Ÿ๐‘’๐‘”

๐‘–

๐‘—โˆˆ ๐‘ฅ,๐‘ฆ,๐‘ค,โ„Ž

ฦธ๐‘๐‘–๐‘… ๐‘ก๐‘–๐‘— โˆ’ ฦธ๐‘ก๐‘–๐‘—

bounding box loss

activated only for

positive anchors ( ฦธ๐‘๐‘–=1)

classification loss ๐‘… ๐‘ฅ

0 1 2-1-20

.5

1

1.5

๐‘ฅ

Page 73: PROJECT LIVENESS REPORT 01

Non-Maximum SuppressionThe regions delivered by RPN are validate as follows:

Discard all boxes with objectness ๐‘๐‘– โ‰ค ๐‘กโ„Ž๐‘Ÿ๐‘’๐‘ โ„Ž๐‘œ๐‘™๐‘‘

While there are boxes:

โ€ข Pick the box with the largest ๐‘๐‘– and output it as a prediction.

โ€ข Discard any remaining box with IoUโ‰ฅ0.5 with the box output in the previous step.

The algorithm must run separately for each class.

73

Page 74: PROJECT LIVENESS REPORT 01

Mask R-CNNProposed by Kaiming He and co-authors in 2018.

It is built upon Faster R-CNN.

It involves three steps:

a) the Region Proposal Network (RPN), that generates region proposals,

b) Mask generation

The Region of Interest Alignment (ROIAlign), that reshapes the region proposal to a standard shape/size, and

Mask Generation itself, that segments the image inside of bounding boxes with objects.

74Figure source: https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/

Page 75: PROJECT LIVENESS REPORT 01

ROI poolingThe next step (Mask Generator) requires inputs of a fixed shape/size. Thus, proposals (anchors) must be resized.

Example: anchor is 4x6 and the Mask Generator input is 2x2

75

0.8

0.8 0.6

0.7

Source: Shipla Ananth, (2019) Faster R-CNN for object detection

Page 76: PROJECT LIVENESS REPORT 01

ROI poolingOften, the anchor size is not an integer number of the FCN input size!

Solution: round to the nearest integer.

76Source: Shipla Ananth, (2019) Faster R-CNN for object detection

Page 77: PROJECT LIVENESS REPORT 01

ROI AlignmentInstead of rounding to the nearest integer, the values are bilinear interpolated.

77Source: Shipla Ananth, (2019) Faster R-CNN for object detection

ROIAlign โ€œimproves mask accuracy by 10% to 50%โ€.

Page 78: PROJECT LIVENESS REPORT 01

Mask R-CNNProposed by Kaiming He and co-authors in 2018.

It is built upon Faster R-CNN.

It involves three steps:

a) the Region Proposal Network (RPN), that generates region proposals,

b) Mask generation

The Region of Interest Alignment (ROIAlign), that reshapes the region proposal to a standard shape/size, and

Mask Generation itself, that segments the image inside of bounding boxes with objects.

78Figure source: https://www.pyimagesearch.com/2018/11/19/mask-r-cnn-with-opencv/

Page 79: PROJECT LIVENESS REPORT 01

โ€ข Fixed size feature map from ROI align goes into R-CNN resulting in bounding box predictions and class scores.

โ€ข Fixed size feature map from ROI align also goes into a FCN to get the segmentation mask.

โ€ข Mask prediction is resampled to recover

the original resolution.

Steps of Mask Generation

79

feature

map

regression

coefficients

classification

scores

mask

prediction

bounding

box

ROIAlign

conv

conv

FC

FC

FC

FC

resample

FCN

final mask

Page 80: PROJECT LIVENESS REPORT 01

Loss for the Mask Generation

80

where

โ€ข ๐‘๐‘– is output score/probability of the true class for box ๐‘–

โ€ข ๐‘… ๐‘ฅ is the smooth L1 function ๐‘… ๐‘ฅ = แ‰Š0.5๐‘ฅ2 if ๐‘ฅ < 1๐‘ฅ โˆ’ 0.5 otherwise

โ€ข ๐‘ก๐‘–๐‘ฅ = ฮค๐‘ฅ๐‘– โˆ’ ๐‘ฅ๐‘–๐‘Ž ๐‘ค๐‘–๐‘Ž ; ๐‘ก๐‘–๐‘ฆ= ฮค๐‘ฆ๐‘– โˆ’ ๐‘ฆ๐‘–๐‘Ž โ„Ž๐‘Ž ; ๐‘ก๐‘–๐‘ค= ๐‘™๐‘œ๐‘” ฮค๐‘ค๐‘–๐‘ค๐‘–๐‘Ž ; ๐‘ก๐‘–โ„Ž= ๐‘™๐‘œ๐‘” เต—โ„Ž๐‘– โ„Ž๐‘–๐‘Ž

โ€ข ฦธ๐‘ก๐‘–๐‘ฅ = ฮคเทœ๐‘ฅ๐‘–๐‘Ž โˆ’ ๐‘ฅ๐‘–๐‘Ž ๐‘ค๐‘–๐‘Ž ; ฦธ๐‘ก๐‘–๐‘ฆ = ฮคเทœ๐‘ฆ๐‘–๐‘Ž โˆ’ ๐‘ฆ๐‘–๐‘Ž โ„Ž๐‘–๐‘Ž ; ฦธ๐‘ก๐‘–๐‘ค = ๐‘™๐‘œ๐‘” เต—เท๐‘ค๐‘–๐‘Ž๐‘ค๐‘–๐‘Ž ; ฦธ๐‘ก๐‘–โ„Ž = ๐‘™๐‘œ๐‘” เต—

โ„Ž๐‘–๐‘Žโ„Ž๐‘–๐‘Ž

โ€ข ๐‘ฅ๐‘– , ๐‘ฆ๐‘– , ๐‘ค๐‘– , and denote the center coordinates and width and height of box ๐‘–

โ€ข ๐‘ฅ๐‘–, ๐‘ฅ๐‘–๐‘Ž, and เทœ๐‘ฅ๐‘–๐‘Žare for the predicted, anchor and ground truth boxes (likewise for๐‘ฆ๐‘– , ๐‘ค๐‘– , and โ„Ž๐‘–)

โ€ข BCE stands for binary cross entropy

โ€ข ๐พ denotes the number of classes

The multi-task loss function for the Mask generator is given by

โ„’ ๐‘๐‘– , ๐‘ก๐‘–| ฦธ๐‘๐‘– , ฦธ๐‘ก๐‘–๐‘— =

๐‘–

โˆ’๐‘™๐‘œ๐‘” ๐‘๐‘– +

๐‘—โˆˆ ๐‘ฅ,๐‘ฆ,๐‘ค,โ„Ž

๐‘… ๐‘ก๐‘–๐‘— โˆ’ ฦธ๐‘ก๐‘–๐‘— +

1โ‰ค๐‘˜โ‰ค๐พ

๐ต๐ถ๐ธ ๐‘๐‘–๐‘˜ , ฦธ๐‘๐‘–๐‘˜

bounding box lossclassification loss mask loss

๐‘… ๐‘ฅ

0 1 2-1-20

.5

1

1.5

๐‘ฅ

Page 81: PROJECT LIVENESS REPORT 01

OD vs IS vs SS - Demo

81

click here to watch the demo

Page 82: PROJECT LIVENESS REPORT 01

Main Sources

Original paper: He, K., Gkioxari, G., Dollรกr, P., Girshick, R., (2018), Mask R-CNN

Presentation at ICCV17 by Kaiming He, the first author

82

Page 83: PROJECT LIVENESS REPORT 01

Object DetectionThanks for your attention

R. Q. FEITOSA