reppoints: point set representation for object detection · reppoints: point set representation for...

RepPoints: Point Set Representation for Object Detection

Ze Yang1†∗ Shaohui Liu2,3†∗ Han Hu3 Liwei Wang1 Stephen Lin31Peking University 2Tsinghua University

3Microsoft Research [email protected], [email protected], [email protected]

{hanhu, stevelin}@microsoft.com

Abstract

Modern object detectors rely heavily on rectangularbounding boxes, such as anchors, proposals and the fi-nal predictions, to represent objects at various recognitionstages. The bounding box is convenient to use but providesonly a coarse localization of objects and leads to a cor-respondingly coarse extraction of object features. In thispaper, we present RepPoints (representative points), a newfiner representation of objects as a set of sample points use-ful for both localization and recognition. Given groundtruth localization and recognition targets for training, Rep-Points learn to automatically arrange themselves in a man-ner that bounds the spatial extent of an object and indicatessemantically significant local areas. They furthermore donot require the use of anchors to sample a space of boundingboxes. We show that an anchor-free object detector basedon RepPoints, implemented without multi-scale training andtesting, can be as effective as state-of-the-art anchor-baseddetection methods, with 42.8 AP and 65.0 AP50 on theCOCO test-dev detection benchmark.

1. Introduction

Object detection aims to localize objects in an image andprovide their class labels. As one of the most fundamentaltasks in computer vision, it serves as a key component formany vision applications, including instance segmentation[34], human pose analysis [43], and visual reasoning [47].The significance of the object detection problem togetherwith the rapid development of deep neural networks has ledto substantial progress in recent years [7, 12, 11, 38, 15, 3].

In the object detection pipeline, bounding boxes, whichencompass rectangular areas of an image, serve as the ba-sic element for processing. They describe target locationsof objects throughout the stages of an object detector, from

∗Equal contribution. †The work is done when Ze Yang and ShaohuiLiu are interns at Microsoft Research Asia.

recognitionsupervision

convert

feature extraction+ classification

localizationsupervision

Figure 1. RepPoints are a new representation for object detectionthat consists of a set of points which indicate the spatial extent ofan object and semantically significant local areas. This representa-tion is learned via weak localization supervision from rectangularground-truth boxes and implicit recognition feedback. Based onthe richer RepPoints representation, we develop an anchor-free ob-ject detector that yields improved performance compared to usingbounding boxes.

anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for feature extraction in deepnetworks, because of its regular shape and the ease of sub-dividing a rectangular window into a matrix of pooled cells.

Though bounding boxes facilitate computation, they pro-vide only a coarse localization of objects that does not con-form to an object’s shape and pose. Features extracted fromthe regular cells of a bounding box may thus be heavilyinfluenced by background content or uninformative fore-ground areas that contain little semantic information. Thismay result in lower feature quality that degrades classifica-tion performance in object detection.

In this paper, we propose a new representation, calledRepPoints, that provides more fine-grained localization and

1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert







1

arX

iv:1

904.

1149

0v1

[cs

.CV

] 2

5 A

pr 2

019





Abstract


1. Introduction





convert




anchors and proposals to final predictions. Based on thesebounding boxes, features are extracted and used for pur-poses such as object classification and location refinement.The prevalence of the bounding box representation canpartly be attributed to common metrics for object detectionperformance, which account for the overlap between esti-mated and ground truth bounding boxes of objects. Anotherreason lies in its convenience for

reppoints: point set representation for object detection · reppoints: point set representation for...

Documents