human detection by searching in 3d space using camera and scene

Human Detection by Searching in 3D Space Using Camera and SceneKnowledge

Yuan Li, Bo Wu and Ram NevatiaUniversity of Southern California, Institute for Robotics and Intelligent Systems

Los Angeles, CA 90089-0273{yli8|bowu|nevatia}@usc.edu

AbstractMany existing human detection systems are based on

sub-window classification, namely detection is done byenumerating rectangular sub-images in the 2D imagespace. Detection rate of such approaches may be af-fected by perspective distortion and tilted orientation ofthe human in images. To overcome this problem with-out re-training the classifier, we develop a 3D searchmethod. A search grid is defined in the 3D scene. Ateach grid point a rectified sub-image is generated toapproximate the orthogonal projection of the target, sothat the distortion due to camera setting is reduced. Inaddition, 3D target position can be estimated from sin-gle camera data. Experiments on challenging data fromthe PETS2007 and CAVIAR INRIA datasets show signif-icantly improved detection performance of our approachcompared with the 2D search-based methods.

1 IntroductionAs an important problem in visual surveillance, hu-

man detection aims at finding all the humans in an im-

age. Among the large variety of detection methods, de-

tection based on sub-window classification [10][6] is an

important category with many high performance repre-

sentative human detection systems [11][3][8].

A sub-window classification based human detector

represents the target appearance by a rectangular im-

age patch with a pre-defined aspect ratio. Detection is

done by enumerating all such possible sub-windows in

the 2D image space. This works well when the cam-

era is distant, perspective distortion of the target is not

strong, and the target’s orientation is upright in images.

However, different camera settings may affect detection

performance. Figure 1(a) shows an example: humans in

the left top region become undetectable simply because

of the view angle and perspective effect.

Handling this problem at the classifier level is diffi-

cult and inefficient. We approach this problem by de-

veloping a new search strategy. Assuming that camera

(a) Input image fromPETS 2007 dataset [2].

(b) Our approach. (c) Image synthesizedwith detected 3D positionof objects.

Figure 1. Comparison of pedestrian detection results

in a scene with strong perspective effect.

settings can be estimated (which should be the case for

most surveillance situations), object search is performed

in the 3D world space instead of the 2D image space. A

3D scanning grid is created to cover all possible posi-

tions of objects in the scene. At each grid point, a rec-

tified sub-image is generated to approximate the object

appearance under orthogonal projection, so as to reduce

distortion caused by camera projection. This is done by

approximating the object by an imaginary planar sur-

face facing the camera and computing the homography

between the input image coordinates and the rectified

image coordinates. The classifier is then applied to the

rectified sub-image. Figure 1(b)(c) show the detection

result using our method.

Compared with the conventional 2D search method,

the benefits of our approach include: 1) The range of

viewpoint variation that the detector can deal with is en-

larged, i.e.3D search extends the detection ability be-

yond the training data of the classifier. 2) Contextual

knowledge such as pedestrian height and ground plane

is naturally integrated to constrain the search space and

rule out false alarms. 3) Detection results are interpreted

as 3D world coordinates, which can be used for tracking

and other further processes. 4) It can be easily combined

with any sub-window classification based detector.

2 Related workIn the recent decade, intense research interest in clas-

sification based object detection has brought forward nu-

merous detection systems [10][6][11][3]. Since our em-

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

Authorized licensed use limited to: IEEE Xplore. Downloaded on February 2, 2009 at 14:53 from IEEE Xplore. Restrictions apply.

Radialundistortion

3D searching grid

Sub-image rectification Detection Result

interpretation

Figure 2. Approach overview.

phasis is not on building a classifier, we will focus on the

design of search strategies and post-processes.

To search for objects in images, the most widely

adopted way is to enumerate all possible sub-windows

in the 2D image space. To further eliminate seman-

tically meaningless false detections, post-processes are

designed to utilize scene knowledge. Assuming that ob-

jects are on a ground plane, the 2D height of an object

can be modeled as a function of its position in image to

reduce false alarms [5]. [4] uses ground plane estima-

tion with surface orientation classification to refine de-

tection results. To reduce the search space during detec-

tion, [7] computes a homography between the head top

points and foot points and combine it with background

substraction to get a subset of foot positions for detec-

tion. A similar approach is adopted in [9]. But the ba-

sic assumption is still that pedestrians appear upright in

the image and are viewed by a distant camera. All these

methods can reduce the false alarm rate by applying con-

straints on object size and position, but they cannot im-

prove detection rate because they do not attempt to ad-

just the input image to compensate for target appearance

variations caused by camera settings.

3 Object detection in 3D search spaceFigure 2 shows the block diagram of our approach. In

the following, we describe each step in detail.

At the first step, we eliminate radial distortion of the

input image to obtain image I , so that from now on we

only operate on image I without radial distortion. Three

coordinate systems are involved: 2D coordinates of the

radial undistorted image frame I , 2D coordinates of the

rectified sub-image A (on which detection is performed)

and 3D coordinates in the world frame W . We use

homogeneous coordinates in I , A and W , and denote

points in them by p = (Iu, Iv, 1)T , q = (Au, Av, 1)T

and P = (x, y, z, 1)T respectively. Let the mapping

from W to I obtained from camera parameters be

p = MP. (1)

The search space is defined as a set of points in 3D

world coordinates. For humans, we assume that they

Figure 3. 3D searching grid of a scene from the

PETS 2007 dataset. Left to right: original image, auto-

matically generated searching grid, searching grid with

scene knowledge). Each yellow line approximates a hu-

man standing at a grid point.

stand on a ground plane (other possible positions can

also be included according to the scene). Search is per-

formed on a discrete planar grid in 3D (Figure 3). The

3D search range can be determined automatically by

constraining visibility and the minimal size of objects in

images, or it can be refined by adding scene knowledge.

3.1 Generating rectified sub-imagesHuman detectors based on sub-window classification

are commonly trained on size-fixed images of humans

viewed from a distant camera. To get best detection re-

sult, the sub-images input to the classifier should also be

obtained from a similar view point as the training im-

ages. Since camera settings may vary for different sites,

images need to be rectified before input for classifica-

tion.

Given the camera position at Pc, at each point Po of

the 3D searching grid, we generate a rectified sub-image

A so that if a human stands at Po, its appearance in Aapproximates its appearance under an orthogonal projec-

tion. To achieve this, we define the rectified image plane

A as parallel to z axis and perpendicular to PcPo’s pro-

jection on the x-y plane. Figure 4 shows an overview

of the relationship among these geometric entities: (a)

shows the scene in 3D, and image I is shown in more

detail in (b) and (c). The human is illustrated as a cylin-

der for the purpose of illustration. The visible part of the

cylinder side surface I is marked in blue.

When the camera is not very close, the orthogonal

view of the cylinder can be obtained by warping the blue

region in I . To avoid non-linear warping, we simplify

the model by assuming that the human is a rectangular


Image I

Rectifiedsub-image A

Camera

World frame Wx

yz

Au

Av

POP1

P2

P3

P4

q1q2

q3

q4

qO

Image I Iu

Iv

Iu

IvP = q

p = P

pOp1

p2

p3p4

PC

H

(a)

(b)

(c)

Figure 4. Relationship among image I , rectified sub-

image A and an object in world frame W .

surface H (P1P2P3P4) parallel to A with its bottom

center located at Po. The projection of P1P2P3P4 in

image I (p1p2p3p4 in Figure 4(c)) is a reasonable ap-

proximation of the blue region in Figure 4 (b) if the angle

between PcPo and the x-y plane is not very large; if the

camera is over the top of the object, it would be impos-

sible to recover the object’s frontal orthogonal view.

Under this imaginary object plane approximation, the

object’s projection in image I can be transformed to the

approximate orthogonal view in the rectified image A as

follows. The angle between A and the x-z plane in the

world frame can be computed by

θ =π

2− arccos(

(xc, yc)(xo, yo)T

|(xc, yc)||(xo, yo)|), (2)

where Pc = (xc, yc, zc, 1)T is the camera position (es-

timated from M if not known directly), and Po =(xo, yo, 0, 1)T is the search grid point.

The mapping between the imaginary object plane

P1P2P3P4 and the rectified image A can be written

as

P = T q

=

⎛⎜⎜⎝

cos θ 0 −Auo cos θ + xo/αsin θ 0 −Auo sin θ + yo/α

0 −1 Avo

0 0 1/α

⎞⎟⎟⎠q, (3)

where qo = (Auo,Auo, 1)T is the desired projected po-

sition of Po in A. α is the ratio between the real world

object size and the image patch size which the detec-

tor operates on (e.g., for a 1.8 meter tall human which

is normalized to an image patch of 60 pixel in height,

α = 1.8/60). Image A can therefore be generated from

image I by the homography

p = MP = MT q. (4)

x

yzx

y

z

Camera

Figure 5. Orientation of imaginary plane of pedes-

trian at each grid point (left: viewed from the camera of

the scene; right: another different view).

Original input Rectified: � 30� � o 60� � o

Figure 6. Choice of the orientation of the rectified im-

age projection plane A (or the imaginary object plane

H) affects the appearance of the rectified sub-image.

Sub-images shown here are computed using orientation

calculated in our approach (θ) and other alternatives.

Figure 5 is an example of the imaginary object plane

at each search grid point in a given scene from the PETS

2007 data. Figure 6 shows how different choices of θcan result in different rectified images. We can see that

our approach can approximate the appearance of the ob-

ject under an orthogonal view and with a standard size,

which is suitable for input to a sub-window classifier.

3.2 Result interpretation and post processingIf an object is detected in a certain rectified image

A at position qo, we can use Equation 3 to estimate its

3D position Po. Multiple detection responses may be

present around one object, therefore agglomerative clus-

tering is done based on the 3D distances of the detec-

tion responses. Clustering using 3D distance is better

for crowded scenes because in such cases, detection re-

sponses of two objects may largely overlap in 2D, caus-

ing clustering in 2D to merge them.

4 ExperimentExperiments are done on subsets of the PETS 2007

and CAVIAR INRIA data 1, with quantitative compari-

son between our approach and detection by conventional

2D search.

4.1 Pedestrian detection in PETS 2007 dataWe integrate our approach with a Cluster Boosted

Tree detector trained using the method in [11]. The

1The images and groundtruth are available at http://iris.usc.edu/

˜yli8/data/PETS07 subset.zip and CAVIAR INRIA subset.zip.


��

��

��

Figure 7. Sample results on PETS 2007 dataset (all the false alarms of 3D search are marked in red).

training samples are collected from scenes in which

pedestrians are upright and the camera position is dis-

tant. No modification to the classifier itself is necessary

to integrate it with our approach.

For the PETS 2007 data, we test on the third view

which has an obvious perspective effect which im-

pairs detection performance of conventional 2D search

method. Since ground truth is not available, we ran-

domly select 540 frames (1794 humans) to perform a

quantitative comparison.

Figure 8. Typical miss detections in PETS 2007

dataset.

The detection rate and false alarm number are com-

puted for the 3D search and the conventional 2D search

using the same detector. Results are given in Table 1,

and Figure 7 shows some sample results. From the re-

sults before clustering we can see that the detection re-

sponse of 3D search is much stronger than that of 2D.

While the 3D search method is capable of detecting hu-

mans with different orientations and perspective distor-

tion, humans correctly detected by the detector using 2D

search are mostly around the upper-left part of the im-

ages, where the perspective effect is not strong. Also,

clustering using 3D distance can give better object infer-

ence in crowded scenes, as we have discussed in Section

3.2.

The overall detection rate is not high mainly due to

occlusion, low contrast and pose variation. Figure 8

shows some failure modes: profile view with walking

pose (marked by yellow circles), top-down view (blue),

occlusion in crowds (white) and poses like crouching

and bending (orange). The latter two are common dif-

ficulties for human detection. For the walking pose, the

rectified sub-image of a walking person is less accurate

in the lower body because the stretched leg is far from

our assumed planar surface of the human. Also, in the

region where the camera is pointing down over the head


(lower right image region in the PETS examples), the de-

tection rate is low because it is hard to recover the frontal

orthogonal view from a near-overhead camera view.

��

��

��

��

Figure 9. Sample results on CAVIAR INRIA dataset

(all the false alarms of 3D search are marked in red).

4.2 Pedestrian detection in CAVIAR INRIAdata

For the CAVIAR INRIA data [1], we calibrate the

camera using a semi-automatic calibration tool by label-

ing parallel lines from the buildings. The result indicates

that our method is quite robust to inaccuracy in camera

parameters.

Quantitative comparison on the CAVIAR INRIA

dataset is done on 706 randomly selected images (947

humans). The result of conventional 2D search is very

poor(“2D” in Table 2). For a more meaningful compar-

ison, we improve the 2D search by rotating the image

in 2D according to the tilt angle of human’s upright di-

rection at each scanning point. The result shows that

detection rate of 3D search is still more than twice of

that of 2D search plus in-plane rotation with compara-

ble false alarm rates (“2D+rotation” in Table 2), mean-

ing that only adjusting tilt angle in 2D is not sufficient.

Some sample results can be found in Figure 9.

5 Conclusion

We have proposed a novel 3D searching strategy in

object detection, which has the following advantages:

3D(L) 3D(H) 2D(L) 2D(H)

Detection rate 65.6% 48.4% 46.6% 25.7%

False alarms / frame 0.28 0.07 0.63 0.09

Table 1. Comparison on PETS 2007 dataset. (H) in-

dicates a higher threshold of detection confidence, (L)indicates a lower threshold.

3D 2D+rotation 2D

Detection rate 87.2% 38.3% 17.5%

False alarms / frame 1.23 1.54 1.75

Table 2. Comparison on CAVIAR INRIA dataset.

image rectification to reduce adversary effect of cam-

era view on detection performance, integration of scene

knowledge to reduce false alarms, flexibility to com-

bine with any patch-based detector, and the ability to

estimate object position in 3D. The output of 3D ob-

ject position can improve post-detection processes, and

hopefully will also integrate more naturally with existing

multi-view tracking algorithms.

6 AcknowledgmentsThis research is supported, in part, by the U.S. Gov-

ernment VACE program. Yuan Li is funded, in part, by

a Provost’s Fellowship from USC.

References[1] Caviar dataset. http://homepages.inf.ed.ac.uk/rbf/

CAVIARDATA1/.[2] Pets 2007 dataset. http://www.cvg.rdg.ac.uk/

PETS2007/.[3] N. Dalai and B. Triggs. Histograms of oriented gradients

for human detection. In CVPR, 2005.[4] D. Hoiem, A. A. Efros, and M. Hebert. Putting objects

in perspective. In CVPR, 2006.[5] B. Leibe, K. Schindler, and L. V. Gool. Coupled detec-

tion and trajectory estimation for multi-object tracking.

In ICCV, 2007.[6] S. Z. Li, L. Zhu, Z. Zhang, A. Blake, H. Zhang, and

H. Shum. Statistical learning of multi-view face detec-

tion. In ECCV, 2002.[7] Z. Lin, L. S. Davis, D. Doermann, and D. DeMenthon.

Hierarchical part-template matching for human detec-

tion and segmentation. In ICCV, 2007.[8] P. Sabzmeydani and G. Mori. Detecting pedestrians by

learning shapelet features. In CVPR, 2007.[9] P. Tu, N. Krahstoever, and J. Rittscher. View adaptive

detection and distributed site wide tracking. In AVSS,

2007.[10] P. Viola and M. Jones. Rapid object detection using a

boosted cascade of simple features. In CVPR, 2001.[11] B. Wu and R. Nevatia. Cluster boosted tree classifier for

multi-view, multi-pose object detection. In ICCV, 2007.


human detection by searching in 3d space using camera and scene

Documents