introduction to object detection - github pages...2019/04/03 · one stage detector: yolo...

Introduction to Object DetectionPart 1

Gang Yu

Megvii Researcher

[email protected]

Visual Recognition

A fundamental task in computer vision

• Classification

• Object Detection

• Semantic Segmentation

• Instance Segmentation

• Key point Detection

• VQA

…

Category-level Recognition

Category-level Recognition Instance-level Recognition

Representation

• Bounding-box• Face Detection, Human Detection, Vehicle Detection, Text Detection,

general Object Detection• Point

• Semantic segmentation (will be discussed in next week)• Keypoint

• Face landmark• Human Keypoint

Outline

• Detection• Human Keypoint• Conclusion

Detection - Evaluation Criteria

Average Precision (AP) and mAP

Figures are from wikipedia

Detection - Evaluation Criteria

mmAP

Figures are from http://cocodataset.org

How to perform a detection?

• Sliding window: enumerate all the windows (up to millions of windows)• VJ detector: cascade chain

• Fully Convolutional network• shared computation

Robust Real-time Object Detection; Viola, Jones; IJCV 2001

http://www.vision.caltech.edu/html-files/EE148-2005-Spring/pprs/viola04ijcv.pdf

General Detection Before Deep Learning

• Feature + classifier• Feature

• Haar Feature• HOG (Histogram of Gradient)• LBP (Local Binary Pattern)• ACF (Aggregated Channel Feature)• …

• Classifier• SVM• Bootsing• Random Forest

Traditional Hand-crafted Feature: HoG

General Detection Before Deep Learning

Traditional Methods

• Pros• Efficient to compute (e.g., HAAR, ACF) on CPU• Easy to debug, analyze the bad cases• reasonable performance on limited training data

• Cons• Limited performance on large dataset• Hard to be accelerated by GPU

Deep Learning for Object Detection

Based on the whether following the “proposal and refine”

• One Stage• Example: Densebox, YOLO (YOLO v2), SSD, Retina Net• Keyword: Anchor, Divide and conquer, loss sampling

• Two Stage• Example: RCNN (Fast RCNN, Faster RCNN), RFCN, FPN, MaskRCNN• Keyword: speed, performance

A bit of History

ImageFeature

Extractor

classification

localization

(bbox)

One stage detector

Densebox (2015) UnitBox (2016) EAST (2017)

YOLO (2015) Anchor Free

Anchor importedYOLOv2 (2016)

SSD (2015)

RON(2017)

RetinaNet(2017)

DSSD (2017)

two stages detector

ImageFeature

Extractor

classification

localization

(bbox)

Proposal

classification

localization

(bbox)

Refine

RCNN (2014) Fast RCNN(2015)Faster RCNN (2015)

RFCN (2016)

MultiBox(2014)

RFCN++ (2017)

FPN (2017)

Mask RCNN (2017)

OverFeat(2013)

One Stage Detector: Densebox

DenseBox: Unifying Landmark Localization with End to End Object Detection, Huang etc, 2015

https://arxiv.org/abs/1509.04874


• No Anchor: GT Assignment• A sub-circle in the GT is labeled as positive

• fail when two GT highly overlaps• the size of the sub-circle matters• more attention (loss) will be placed to large faces

• Loss sampling• All pos/negative positions will be used to compute the cls loss


Problems

• L2 loss is not robust to scale variation (UnitBox)• learnt features are not robust

• GT assignment issue (SSD)• Fail to handle the crowd case

• relatively large localization error (Two stages detector)• more false positive (FP) (Two stages detector)

• does not obviously kill the fp

One Stage Detector: Densebox -> UnitBox

UnitBox: An Advanced Object Detection Network, Yu etc, 2016

http://cn.arxiv.org/pdf/1608.01471.pdf

One Stage Detector: Densebox -> UnitBox->EAST

EAST: An Efficient and Accurate Scene Text Detector, Zhou etc, CVPR 2017


One Stage Detector: YOLO

You Only Look Once: Unified, Real-Time Object Detection, Redmon etc, CVPR 2016



• No Anchor • GT assignment is based on the cells (7x7)

• Loss sampling• all pos/neg predictions are evaluated (but more sparse than densebox)




Discussion

• fc reshape (4096-> 7x7x30)• more context• but not fully convolutional

• One cell can output up to two boxes in one category • fail to work on the crowd case

• Fast speed• small imagenet base model• small input size (448x448)




Experiments on general detection



Method VOC 2007 test VOC 2012 test COCO time

YOLO 57.9/NA 52.7/63.4 NA fps: 45/155

One Stage Detector: YOLO -> YOLOv2

YOLO9000: Better, Faster, Stronger Redmon etc, CVPR 2016



Experiments:



Method VOC 2007 test VOC 2012 test COCO time

YOLO 52.7/63.4 57.9/NA NA fps: 45/155

YOLOv2 78.6 73.4 21.6 fps: 40


Video demo: https://pjreddie.com/darknet/yolo/



One Stage Detector: SSD

SSD: Single Shot MultiBox Detector, Liu etc

https://arxiv.org/pdf/1512.02325.pdf


SSD: Single Shot MultiBox Detector, Liu etc, ECCV 2016


One Stage Detector: SSD• Anchor

• GT-anchor assignment• GT is predicted by one best matched (IOU) anchor or matched with an

anchor with IOU > 0.5• better recall

• dense or sparse anchor?• Divide and Conquer

• Different layers handle the objects with different scales• Assume small objects can be predicted in earlier layers (not very strong

semantics)• Loss sampling

• OHEM: negative positions are sampled (not balanced pos/neg ratio)• negative:pos is at most 3:1




Discussion:

• Assume small objects can be predicted in earlier layers (not very strong semantics) (DSSD, RON, RetinaNet)

• strong data augmentation• VGG model (Replace by resnet in DSSD)

• cannot be easily adapted to other models• a lot of hacks

• A long tail (Large computation)




Experiments



Method VOC 2007 test VOC 2012 test COCO time (fps)

YOLO 52.7/63.4 57.9/NA NA 45/155

YOLOv2 78.6 73.4 21.6 40

SSD 77.2/79.8 75.8/78.5 25.1/28.8 46/19

One Stage Detector: SSD -> DSSD

DSSD : Deconvolutional Single Shot Detector, Fu etc 2017,


One Stage Detector: DSSD

Experiments


YOLO 52.7/63.4 57.9/NA NA 45/155

YOLOv2 78.6 73.4 21.6 40

SSD 77.2/79.8 75.8/78.5 25.1/28.8 46/19

DSSD 81.5 80.0 33.2 5.5

DSSD : Deconvolutional Single Shot Detector, Fu etc 2017,


One Stage Detector: SSD -> RON

RON: Reverse Connection with Objectness Prior Networks for Object Detection, Kong etc, CVPR 2017


One Stage Detector: RON

• Anchor• Divide and conquer

• Reverse Connect (similar to FPN)• Loss Sampling

• Objectness prior• pos/neg unbalanced issue• split to 1) binary cls 2) multi-class cls



One Stage Detector: RON

Experiments


YOLO 52.7/63.4 57.9/NA NA 45/155

YOLOv2 78.6 73.4 21.6 40

SSD 77.2/79.8 75.8/78.5 25.1/28.8 46/19

DSSD 81.5 80.0 33.2 5.5

RON 81.3 80.7 27.4 15



One Stage Detector: SSD -> RetinaNet

Focal Loss for Dense Object Detection, Lin etc, ICCV 2017


One Stage Detector: RetinaNet

• Anchor• Divide and Conquer

• FPN• Loss Sampling

• Focal loss• pos/neg unbalanced issue• new setting (e.g., more anchor)



One Stage Detector: RetinaNet

Experiments


YOLO 52.7/63.4 57.9/NA NA 45/155

YOLOv2 78.6 73.4 21.6 40

SSD 77.2/79.8 75.8/78.5 25.1/28.8 46/19

DSSD 81.5 80.0 33.2 5.5

RON 81.3 80.7 27.4 15

RetinaNet NA N 39.1 5



One Stage Detector: Summary

• Anchor• No anchor: YOLO, densebox/unitbox/east • Anchor: YOLOv2, SSD, DSSD, RON, RetinaNet

• Divide and conquer• SSD, DSSD, RON, RetinaNet

• loss sample• all sample: densebox• OHEM: SSD• focal loss: RetinaNet

One Stage Detector: DiscussionAnchor (YOLO v2, SSD, RetinaNet) or Without Anchor (Densebox, YOLO)

• Model Complexity• Difference on the extremely small model (< 30M flops on 224x224 input)

• Sampling • Application

• No Anchor: Face• With Anchor: Human, General Detection

• Problem for one stage detector• Unbalanced pos/neg data• Pool localization precision

Two Stages Detector: RCNN

Rich feature hierarchies for accurate object detection and semantic segmentation, Girshirk etc, CVPR 2014



Discussion

• Extremely slow speed• selective search proposal (CPU)/warp

• not end-to-end optimized• Good for small objects




Experiments


YOLO 52.7/63.4 57.9/NA NA 45/155

YOLOv2 78.6 73.4 21.6 40

SSD 77.2/79.8 75.8/78.5 25.1/28.8 46/19

DSSD 81.5 80.0 33.2 5.5

RON 81.3 80.7 27.4 15


RCNN 66 NA NA 47s



Two Stages Detector: RCNN -> Fast RCNN

Fast R-CNN, Girshick etc, ICCV 2015


Two Stages Detector: Fast RCNN

Discussion

• slow speed• selective search proposal (CPU)

• not end-to-end optimized• ROI pooling

• alignment issue• sampling• aspect ratio changes



Two Stages Detector: Fast RCNN

Experiments


YOLO 52.7/63.4 57.9/NA NA 45/155

YOLOv2 78.6 73.4 21.6 40

SSD 77.2/79.8 75.8/78.5 25.1/28.8 46/19

DSSD 81.5 80.0 33.2 5.5

RON 81.3 80.7 27.4 15


RCNN 66 NA NA 47s

Fast RCNN 77.0 82.3 (wth coco data) NA 0.5s



Two Stages Detector: RCNN -> Fast RCNN -> FasterRCNN

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Ren etc, CVPR 2016


Two Stages Detector: Faster RCNN

Discussion

• speed• selective search proposal (CPU) -> RPN

• alternative optimization/end-to-end optimization• Recall issue due to two stages detector



Two Stages Detector: Faster RCNNExperiments


YOLO 52.7/63.4 57.9/NA NA 45/155

YOLOv2 78.6 73.4 21.6 40

SSD 77.2/79.8 75.8/78.5 25.1/28.8 46/19

DSSD 81.5 80.0 33.2 5.5

RON 81.3 80.7 27.4 15


RCNN 66 NA NA 47s


Faster RCNN 73.2 70.4 NA 5



Two Stages Detector: RCNN -> Fast RCNN -> FasterRCNN -> RFCN

R-FCN: Object Detection via Region-based Fully Convolutional Networks, Dai etc, NIPS 2016,


Two Stages Detector: RFCN

Discussion

• Share convolution• fasterRCNN: shared Res1-4 (RPN), not shared Res5 (RCNN)• RFCN: shared Res1-5 (both RPN and RCNN)

• PSPooling• a large number of channels:(7x7xC)xWxH• Problems in ROIPooling also exist

• Fully connected vs Convolution• fc: global context• conv: can be shared but the context is relative small• trade-off: large kernel



Two Stages Detector: RFCNExperiments


YOLO 52.7/63.4 57.9/NA NA 45/155

YOLOv2 78.6 73.4 21.6 40

SSD 77.2/79.8 75.8/78.5 25.1/28.8 46/19

DSSD 81.5 80.0 33.2 5.5

RON 81.3 80.7 27.4 15


RCNN 66 NA NA 47s


Faster RCNN 73.2 70.4 NA 200ms

RFCN 79.5 77.6 29.9 170ms



Two Stages Detector: RFCN -> Deformable Convolutional Networks

Deformable Convolutional Networks, Dai etc, ICCV 2017


Two Stages Detector: RFCN -> Deformable Convolutional Networks

Discussion

• Deformable pool is similar to ROIAlign (in Mask RCNN)• Deformable conv

• flexible to learn the non-rigid objects

Deformable Convolutional Networks, Dai etc, ICCV 2017


Two Stages Detector: RCNN -> Fast RCNN -> FasterRCNN -> FPN

Feature Pyramid Networks for Object Detection, Lin etc, CVPR 2017


Two Stages Detector: FPN

Discussion

• FasterRCNN reproduced (setting)• Deeply supervised (better feature)



Two Stages Detector: FPNExperiments


YOLO 52.7/63.4 57.9/NA NA 45/155

YOLOv2 78.6 73.4 21.6 40

SSD 77.2/79.8 75.8/78.5 25.1/28.8 46/19

DSSD 81.5 80.0 33.2 5.5

RON 81.3 80.7 27.4 15


RCNN 66 NA NA 47s



RFCN 79.5 77.6 29.9 170ms

FPN NA NA 36.2 6



Two Stages Detector: RCNN -> Fast RCNN -> FasterRCNN -> FPN -> MaskRCNN

Mask R-CNN, He etc, ICCV 2017


Two Stages Detector: Mask RCNN

Discussion

• Alignment issue in ROIPooling -> ROIAlign• Multi-task learning: detection & mask



Two Stages Detector: Mask RCNNExperiments


YOLO 52.7/63.4 57.9/NA NA 45/155

YOLOv2 78.6 73.4 21.6 40

SSD 77.2/79.8 75.8/78.5 25.1/28.8 46/19

DSSD 81.5 80.0 33.2 5.5

RON 81.3 80.7 27.4 15


RCNN 66 NA NA 47s



RFCN 79.5 77.6 29.9 170ms

FPN NA NA 36.2 6

Mask RCNN NA NA 38.2 2.5



Two Stages Detector: Summary

• Speed• RCNN -> Fast RCNN -> Faster RCNN -> RFCN

• performance• Divide and conquer

• FPN• Deformable Pool/ROIAlign• Deformable Conv• Multi-task learning

Two Stages Detector: Discussion

FasterRCNN vs RFCN

One stage vs two Stage

MegDetection

Introduction & Demo Video

Open Problem in Detection

• FP• NMS (detection in crowd)• GT assignment issue• Detection in video

• detect & track in a network

Outline


Human Keypoint Task

• Single Person Skeleton• Cropped RGB image -> 2d key points / 3d key points• Keyword: inter-middle loss, large receptive field, context

• Multiple-Person Skeleton• RGB image -> human localization & human Keypoint for each person

Single Person Skeleton: CPM

Convolutional Pose Machines, Wei etc, CVPR 2016


Single Person Skeleton: Hourglass

Stacked Hourglass Networks for Human Pose Estimation, Newell etc, ECCV 2016


Multiple-Person Skeleton

• Top Down• Detect -> Single person skeleton

• Bottom Up• Deep/Deeper Cut• OpenPose• Associative Embedding

Multiple-Person Skeleton: OpenPose

CPM + PAF

Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields, Cao etc, CVPR 2017


Multiple-Person Skeleton: OpenPose

Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields, Cao etc, CVPR 2017


https://github.com/CMU-Perceptual-Computing-Lab/openpose

Multiple-Person Skeleton: Associative Embedding

Hourglass + AE

Associative Embedding: End-to-End Learning for Joint Detection and Grouping, Newell etc, NIPS 2017


Multiple-Person Skeleton: Associative Embedding

Associative Embedding: End-to-End Learning for Joint Detection and Grouping, Newell etc, NIPS 2017


Multiple-Person Skeleton: Discussion

• Top Down:• Depends on the detector

• Fail in the crowd case• Fail with partial observation• can detect the small-scale human

• More computation• Better localization when the input-size of single person skeleton is large

• Bottom up:• Fast computational speed• good at localizing the human with partial observation• Hard to assemble human

Challenges in Skeleton

• combine top-down approaches with bottom-up approaches

• perform pose track

• handle the crowd case

MegSkeleton

Introduction and demo Video

Outline


Conclusion

• Detection• One stage: Densebox, YOLO, SSD, RetinaNet• Two Stage: RCNN, Fast RCNN, FasterRCNN, RFCN, FPN, Mask RCNN

• Skeleton• Single Person Skeleton: CPM, Hourglass• Multi-person Skeleton

• Top Down• Bottom up: Openpose, Associative Embedding

Introduction to Object DetectionPart 2

Outline

• Modern Object detectors• One Stage detector vs Two-stage detector

• Challenges• Backbone• Head• Scale• Batch Size• Crowd

• Conclusion

What is object detection?

Modern Object detectors

Backbone Head

• Modern object detectors• RetinaNet

• f1-f7 for backbone, f3-f7 with 4 convs for head• FPN with ROIAlign

• f1-f6 for backbone, two fcs for head• Recall vs localization

• One stage detector: Recall is high but compromising the localization ability• Two stage detector: Strong localization ability

Postprocess

NMS

One Stage detector: RetinaNet

• FPN Structure

• Focal loss

Focal Loss for Dense Object Detection， Lin etc, ICCV 2017 Best student paper

Two-Stage detector: FPN/Mask R-CNN

• FPN Structure

• ROIAlign

Mask R-CNN， He etc, ICCV 2017 Best paper

What is next for object detection?

• The pipeline seems to be mature

• There still exists a large gap between existing state-of-arts and product requirements

• The devil is in the detail

Challenges Overview

• Backbone• Head• Scale• Batch Size• Crowd

Backbone Head Postprocess

NMS

Challenges - Backbone

• Backbone network is designed for classification task but not for localization task

• Receptive Field vs Spatial resolution

• Only f1-f5 is pretrained but randomly initializing f6 and f7 (if applicable)

Backbone - DetNet

• DetNet: A Backbone network for Object Detection, Li etc, 2018, https://arxiv.org/pdf/1804.06215.pdf


Backbone - DetNet

Challenges - Head

• Speed is significantly improved for the two-stage detector• RCNN - > Fast RCNN -> Faster RCNN - > RFCN

• How to obtain efficient speed as one stage detector like YOLO, SSD?• Small Backbone

• Light Head

Head – Light head RCNN• Light-Head R-CNN: In Defense of Two-Stage Object Detector, 2017,

https://arxiv.org/pdf/1711.07264.pdfCode: https://github.com/zengarden/light_head_rcnn


Head – Light head RCNN

• Backbone• L: Resnet101

• S: Xception145

• Thin Feature map• L:C_{mid} = 256

• S: C_{mid} =64

• C_{out} = 10 * 7 * 7

• R-CNN subnet• A fc layer is connected to the PS ROI pool/Align


• Mobile Version• A technique report will be released soon

• The first two-stage detector on mobile devices

Challenges - Scale

• Scale variations is extremely large for object detection

Challenges - Scale

• Scale variations is extremely large for object detection

• Previous works• Divide and Conquer: SSD, DSSD, RON, FPN, …

• Limited Scale variation

• Scale Normalization for Image Pyramids, Singh etc, CVPR2018• Slow inference speed

• How to address extremely large scale variation without compromising inference speed?

Scale - SFace

• SFace: An Efficient Network for Face Detection in Large Scale Variations, 2018, http://cn.arxiv.org/pdf/1804.06559.pdf• Anchor-based:

• Good localization for the scales which are covered by anchors

• Difficult to address all the scale ranges of faces

• Anchor-free:• Able to cover various face scales

• Not good for the localization ability

http://cn.arxiv.org/pdf/1804.06559.pdf

Scale - SFace

Scale - SFace

• Summary:• Integrate anchor-based and anchor-free for the scale issue

• A new benchmark for face detection with large scale variations: 4K Face

Challenges - Batchsize

• Small mini-batchsize for general object detection• 2 for R-CNN, Faster RCNN

• 16 for RetinaNet, Mask RCNN

• Problem with small mini-batchsize• Long training time

• Insufficient BN statistics

• Inbalanced pos/neg ratio

Batchsize – MegDet

• MegDet: A Large Mini-Batch Object Detector, CVPR2018, https://arxiv.org/pdf/1711.07240.pdf


Batchsize – MegDet

• Techniques• Learning rate warmup

• Cross-GPU Batch Normalization

Challenges - Crowd

• NMS is a post-processing step to eliminate multiple responses on one object instance• Reasonable for mild crowdness like COCO and VOC

• Will Fail in the case when the objects are in a crowd

Challenges - Crowd

• A few works have been devoted to this topic• Softnms, Bodla etc, ICCV 2017, http://www.cs.umd.edu/~bharat/snms.pdf

• Relation Networks, Hu etc, CVPR 2018, https://arxiv.org/pdf/1711.11575.pdf

• Lacking a good benchmark for evaluation in the literature

http://www.cs.umd.edu/~bharat/snms.pdf


Crowd - CrowdHuman

• CrowdHuman: A Benchmark for Detecting Human in a Crowd, 2018, https://arxiv.org/pdf/1805.00123.pdf, http://www.crowdhuman.org/• A benchmark with Head, Visible Human, Full body bounding-box

• Generalization ability for other head/pedestrian datasets

• Crowdness


http://www.crowdhuman.org/

Crowd - CrowdHuman

Crowd-CrowdHuman

Crowd-CrowdHuman

• Generalization• Head

• Pedestrian

• COCO

Conclusion

• The task of object detection is still far from solved

• Details are important to further improve the performance• Backbone

• Head

• Scale

• Batchsize

• Crowd

• The improvement of object detection will be a significantly boost for the computer vision industry

Introduction to Face++ Detection Team• Category-level Recognition

• Detection• Face Detection:

• FAN: https://arxiv.org/pdf/1711.07246.pdf

• Sface: https://arxiv.org/pdf/1804.06559.pdf

• Human Detection:

• Repulsion loss: https://arxiv.org/abs/1711.07752

• CrowdHuman: https://arxiv.org/pdf/1805.00123.pdf

• General Object Detection:

• Light Head: https://arxiv.org/pdf/1711.07264.pdfhttps://github.com/zengarden/light_head_rcnn

• MegDet: https://arxiv.org/pdf/1711.07240.pdf

• DetNet: https://arxiv.org/pdf/1804.06215.pdf

• Segmentation• Large Kernel Matters: https://arxiv.org/pdf/1703.02719.pdf

• DFN: https://arxiv.org/pdf/1804.09337.pdf

• BiseNet: https://arxiv.org/pdf/1808.00897.pdf

• Skeleton:• CPN: https://arxiv.org/pdf/1711.07319.pdf

• https://github.com/chenyilun95/tf-cpn






https://github.com/zengarden/light_head_rcnn







Introduction to Face++ Detection Team

• COCO2017 Detection/Skeleton Winner (ICCV 2017)

• ActivityNet AVA Winner (CVPR 2018)

• WAD Instance Segmentation Winner (CVPR 2018)

• WiderFace Challenge Winner (ECCV 2018)

• COCO 2018 Detection/Panoptic Seg/Skeleton Winner (ECCV 2018)

• Mapillary Panoptic Seg Winner (ECCV 2018)

广告部分

• Looking for Researchers/RSDEs/Interns

• 做“实用”的视觉技术

Email: [email protected]

Thanks

introduction to object detection - github pages...2019/04/03 · one stage detector: yolo...

Documents