detecting multiple moving objects in crowded environments

Detecting Multiple Moving Objects in Crowded Environments with CoherentMotion Regions

Anil M. Cheriyadat1,2, Budhendra L. Bhaduri1, and Richard J. Radke2

1Computational Sciences and Engineering, Oak Ridge National LaboratoryOak Ridge, TN, 37831 USA

2Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute,Troy, NY 12180 ∗

AbstractWe propose an object detection system that uses the loca-tions of tracked low-level feature points as input, and pro-duces a set of independent coherent motion regions as out-put. As an object moves, tracked feature points on it spana coherent 3D region in the space-time volume defined bythe video. In the case of multi-object motion, many possi-ble coherent motion regions can be constructed around theset of all feature point tracks. Our approach is to identifyall possible coherent motion regions, and extract the sub-set that maximizes an overall likelihood function while as-signing each point track to at most one motion region. Wesolve the problem of finding the best set of coherent mo-tion regions with a simple greedy algorithm, and show thatour approach produces semantically correct detections andcounts of similar objects moving through crowded scenes.

1 IntroductionThe inherent ability of our visual system to perceive coher-ent motion patterns in crowded environments is remarkable.Johansson [7] described experiments supporting this splen-did visual perception capability, demonstrating the innateability of humans to distinguish activities and count inde-pendent motions simply from 2D projections of a sparseset of feature points manually identified on human joints.When we conducted similar experiments with the video seg-ments used in this paper, reducing each to a swarm of mov-ing bright dots (automatically extracted features) against adark background, human observers were easily able to de-tect and classify the moving objects. This motivated us todevelop an automated system that can detect and count in-dependently moving objects based on feature point trajec-tories alone.

We propose an object detection system that uses the lo-

∗This work was supported in part by the US National Science Founda-tion, under the award IIS-0237516.

Figure 1: Sample multi-object detection results from ouralgorithm. The red number indicates the aggregate count ofindependently moving objects detected in the video up tothe given frame.

cations of tracked low-level feature points as input, and pro-duces a set of independent coherent motion regions as out-put. We define a coherent motion region as a spatiotempo-ral subvolume fully containing a group of point tracks; ide-ally, a single moving object corresponds to a single coherentmotion region. However, in the case of many similar mov-ing objects that may overlap from the camera’s perspective,many possible coherent motion regions exist. Therefore, wepose the multi-object detection problem as one of choosinga good set of disjoint coherent motion regions that repre-sents the individual moving objects. This decision is basedon a track similarity measure that evaluates the likelihoodthat all the trajectories within a coherent motion region arisefrom a single object, and is made using a greedy algorithm.Figure 1 shows sample video frames with our algorithm’sresults overlaid.

With the exception of defining a single spatial bound-

1

ing box that defines the expected object size, the proposedapproach can be considered as a general purpose algo-rithm for detecting and counting large numbers of similarmoving objects. Our algorithm neither uses object-specificshape models nor relies on detecting and assembling object-specific components. Camera calibration is not required.The proposed algorithm is computationally efficient, andwell-suited for camera network applications that requiresensors to distill video into compact, salient descriptions.We demonstrate that our system can localize and track mul-tiple moving objects with a high detection rate.

The remainder of the paper is organized as follows. InSection 2, we review recent work on multi-object detec-tion in video. In Section 3, we describe our framework forestimating a good set of disjoint coherent motion regions.Results obtained from several real video sequences are pre-sented in Section 4. Section 5 concludes the paper withideas for future work.

2 Related WorkPrevious approaches to detecting multiple moving objects,often humans in particular, include methods based on com-plex shape models [6, 22], generative shape models basedon low-level spatial features [18, 14], bag-of-features [20],low-level motion pattern models [17], body-part assemblymodels [21, 17, 13, 11] and low-level feature track cluster-ing [4, 12, 5].

Gravila [6] proposed a shape exemplar-based object de-tection system. A shape exemplar database is constructedfrom a set of hundreds of training shapes and organized intoa tree structure. The problem of matching a new image re-gion against the exemplars on the tree nodes is posed in aBayesian framework. The proposed object detection sys-tem addressed automated pedestrian detection from a mov-ing vehicle. Similarly, Leibe et al. [8] proposed an approachfor detecting pedestrians in crowded scenes that used a setof hundreds of ground-truth segmentations of pedestrians.These are used to build a bag-of-features type codebookfor objects (based on squares centered at detected interestpoints). These local detectors are combined with globalshape cues learned from training silhouettes into an MDL-type method to assign foreground pixels to human supportregions using mean shift estimation. Viola et al. [20] pro-posed another bag-of-features approach to detect pedestri-ans in a video sequence. A detector is scanned over twoframes of image sequence to extract features that relate tovariations in intensity gradients within this spatiotemporalvolume. The features thus extracted are used to train anAdaBoost classifier to detect walking people.

Song et al. [17] addressed the problem of detecting hu-mans based on motion models. A perceptual model of hu-man motion is constructed from training sequences. La-

beled features on the human body are tracked, and the jointprobability density function of these features’ positions andvelocities are used to model human motion. During detec-tion, features are tracked over two consecutive frames anda person is detected if the features’ positions and velocitiesmaximize the posterior probability of the motion model.

Zhao and Nevatia [22] presented a key advance in find-ing people in crowds. People are modeled with shapeparameters represented by a multi-ellipsoidal model, ap-pearance parameters are represented by color histograms,and a modified Gaussian distribution is used to model thebackground. Candidates for head locations identified fromthe foreground blobs direct a Markov chain Monte Carlo(MCMC) algorithm to create new human hypotheses.

Another interesting approach [13, 11, 21] is the simul-taneous detection and tracking of humans based on bodypart assembly. Body part detectors rely on spatial featuresand simple structural models to generate body part can-didates. Body part candidates are assembled in a proba-bilistic framework, allowing the system to make hypothe-ses about probable human locations even in the presence ofinter- and intra-object occlusions. Leordeanu and Collins[9] took an unsupervised learning approach based on thepairwise co-occurrences of parts to identify parts belongingthe same object. Each object part is a collection of scaleinvariant feature points extracted from frames and groupedbased on certain similarity conditions. Moving objects areidentified by matching parts and estimating their pairwiseco-occurrences.

Brostow and Cipolla [4] discussed scenarios in whichcrowds were so dense that background subtraction ormodel-based detection approaches would fail, since thecrowd takes up most of the frame and there are few mean-ingful boundaries between entities. They proposed an un-supervised Bayesian algorithm for clustering tracked low-level interest points based entirely on motion, not on appear-ance. Using a spatial prior and a likelihood model for co-herent motion, they obtained qualitatively encouraging re-sults on crowded videos of people, bees, ants, and so on.Similarly, Rabaud and Belongie [12] addressed similar sce-narios for counting moving objects. Under crowded situa-tions, low-level feature extraction and tracking can result infragmented and noisy feature-point trajectories. They pro-posed a trajectory conditioning strategy by propagating aspatial window along the temporal direction of each trajec-tory. New spatial coordinates for fragmented trajectories areobtained by averaging other trajectory coordinates insidethis spatial window. Cheriyadat and Radke [5] proposedan algorithm for clustering feature point tracks in crowdedscenes into dominant motions using a distance measurebased on longest common subsequences. Vidal and Hartley[19] posed the point trajectory clustering as a problem offinding linear subspaces representing independent motion.

2

Point trajectories are projected to a five-dimensional spaceusing the PowerFactorization method and moving objectsare segmented by fitting linear subspaces to projected pointsunder a generalized principal component analysis frame-work.

Tu and Rittscher [18, 14] took a different approach tocrowd segmentation by arranging spatial features to formcliques. They posed the multi-object detection problem asone of finding a set of maximal cliques in a graph. The spa-tial features form the graph vertices, and each edge weightcorresponds to the probability that features arise from thesame individual. Their earlier work [18] dealt with over-head views and the spatial feature similarity measure wasbased on the assumption that the vertices lie on the circularcontour of a human seen from above. In the later work [14],they used a a variant of the expectation-maximization al-gorithm for the estimation of shape parameters from imageobservations via hidden assignment vectors of features tocliques. The features are extracted from the bounding con-tours of foreground blob silhouettes, and each clique (repre-senting an individual) is parameterized as a simple bound-ing box. Our work is most similar to these approaches; thecoherent motion regions we propose are similar to maxi-mal cliques in a graph. However, our work differs in theimportant sense that the coherent motion regions extend intime as well as space, enforcing consistency in detected ob-jects over long time periods and making the algorithm ro-bust to noisy or short point tracks. As a result of enforcingthe constraint that selected coherent motion regions containdisjoint sets of tracks, our algorithm cannot be viewed astaking place on a static graph. If we were to pose our al-gorithm in a graph framework, the edges and their weightswould change at every iteration of the greedy algorithm, asdescribed further below.

We see our approach as a trade-off between algorithmsthat require object models and/or a moderate amount ofprior information [6, 22, 21, 14, 13] and algorithms thatpossess no model for objects at all [4, 12, 5, 19]. Our al-gorithm operates directly on raw, unconditioned low-levelfeature point tracks, and tries to minimize a global measureof the coherent motion regions rather than approaching theproblem with bottom-up clustering.

3 Algorithm Overview

3.1 Feature Point Tracks

Our algorithm begins with a set of low-level spatial featurepoints tracked over time through a video sequence. We de-fine the ith feature point track by Xi:

Xi = {(xit, y

it), t = T i

init, . . . , Tifinal}, i = 1, . . . , Z. (1)

Here, Z represents the total number of point tracks. Thelengths of the tracks vary depending on the durationsfor which corresponding feature points are successfullytracked.

In our implementation, we first identify low-level fea-tures in the initial frame using the standard Shi-Tomasi-Kanade detector [16] as well as the Rosten-Drummond de-tector [15], a fast algorithm for finding corners. The low-level features are tracked over time using a hierarchical im-plementation [3] of the Kanade-Lucas-Tomasi optical flowalgorithm [10]. The new features are tracked along with theexisting point tracks to form a larger trajectory set. For tra-jectories that have initially stationary segments, we retainonly the remaining part of the trajectory that shows signifi-cant temporal variations.

The user is also required to sketch a single rectangle thatmatches the rough dimensions of the objects to be detectedin the sequence. Let the dimensions of this rectangle bew × h.

3.2 Trajectory SimilarityWe require a measure of similarity between two featurepoint tracks. If both Xi and Xj exist at time t, we define

dxt (i, j) = (xi

t − xjt )

(1 + max

(0,|xi

t − xjt | − ww

))

dyt (i, j) = (yi

t − yjt )

(1 + max

(0,|yi

t − yjt | − hh

))Dt(i, j) =

√dx

t (i, j)2 + dyt (i, j)2

That is, if the features do not fit within aw×h rectangle, thedistance between them is nonlinearly increased. Our expec-tation is that feature point tracks from the same underlyingobject are likely to have a low maximum Dt as well as alow variance in Dt over the region of overlap. Hence, wecompute an overall trajectory similarity as

S(i, j) = exp{−α∗(max(Dt(i, j))+var(Dt(i, j))}, (2)

where the maximum and variance are taken over the tempo-ral region where both trajectories exist. For those trajecto-ries where there is no overlap the similarity value is set tozero. The pairwise similarities are collected into a Z × Zmatrix S. In our experiments below, we set α as 0.025.

3.3 Coherent Motion RegionsA key component of our algorithm is what we term a coher-ent motion region. Mathematically, a coherent motion re-gion is a spatiotemporal subvolume that fully contains a set

3

of associated feature point tracks. In our case, each coherentmotion region is a contiguous chunk of (x, y, t) space thatcompletely spans the point tracks associated with it. Notethat all the feature point tracks inside this region might nothave complete temporal overlap. The set of all coherent mo-tion regions can be represented by a binary Z ×M matrixA indicating which point tracks are associated with eachcoherent motion region.

Figure 2: (a) Illustrating candidate coherent motion regionsfor a given set of point tracks. (b,c) The coherent motionregions are updated by sliding a spatial window over eachframe and adding/deleting columns of the matrixA. (d) Thematrix A after time instant t = t3.

In practice, we generate the coherent motion region ma-trix A by looping over the video frames, sliding a w × hspatial window over each frame. A column is added to A ifa spatial window is found that contains a new subset of fea-ture points not already represented as a column of A. If aspatial window is found that is a proper superset of an exist-ing coherent motion region, the column of A correspondingto that region is deleted. In this way, only “maximal” coher-ent motion regions are recorded in the A matrix.

Figure 2 illustrates several example coherent motion re-gions for a set of point tracks. The dots in the figure denotethe spatial locations of the feature points at a particular timeinstant. At time instant t = t1, the point tracks form twocoherent motion regions, shown as m1 and m2 in Figure2a. At a later time instant t = t2, the spatial location ofpoint trackX5 triggers an update of coherent motion regionm1, because the newly formed group is a superset of theprevious one. Track X6 triggers the generation of a new

coherent motion region m3. Figure 2d shows the status ofthe matrix A after the time instant t = t3 shown in Figure2c. For the video sequences used in this paper, we found Mto be in the range 3000-5000. Figure 3 shows an examplevideo frame in which coherent motion regions are identifiedas red rectangles.

Figure 3: Coherent motion regions identified by the algo-rithm for an example frame.

We also associate anM -vector Lwith the set of coherentmotion regions, indicating a “strength” related to the overalllikelihood that the coherent motion region corresponds to asingle object, created by

L(j) = A(j)TSA(j) (3)

where A(j) is the jth column of A.

3.4 Selecting the Coherent Motion RegionSubset

Conceptually, we would like to select a subset V of coherentmotion regions that maximizes the sum of the strengths inV , subject to the constraint that a point track can belongto at most one selected coherent motion region. We use agreedy algorithm to estimate a good subset V as follows:

1. Set V = ∅.

2. Determine j∗ /∈ V with maximal L(j).

3. Add j∗ to V and set A(i, j) = 0 for j 6= j∗ and all isuch that A(i, j∗) = 1.

4. Recompute L(j) for the remaining j /∈ V and repeatsteps 2-4 until L(j∗) < β · maxj∈V L(j). In our im-plementation, we used β’s in the range 0.01-0.1.

This approach differs from a soft-assign approach of as-signing tracks to cliques (e.g., extending [18]), since thegreedy algorithm explicitly enforces the constraint that se-lected coherent motion regions be disjoint at each iteration.

4

Our approach could be viewed as starting with a highly con-nected graph in which edges are progressively removed andthe sum of edge weights (likelihood) updated.

We observed that the best coherent motion regions aretypically selected around objects that have been visible forsome time. Since new feature point tracks are detected andtracked in each frame, a longer-duration object has a greateropportunity to gather feature tracks. Although some of anobject’s tracks may be lost or “picked up” by other objects,the remaining consistent tracks will yield a coherent mo-tion region with high likelihood value. Selecting a high-likelihood coherent motion region in the greedy algorithmresults in lowering the likelihood values of other coherentmotion regions containing tracks that are part of the selectedcoherent motion region.

4 Results

We illustrate the performance of our algorithm on seven dif-ferent video sequences typically used in multi-object track-ing, featuring various types of objects including movingpeople, ants and vehicles. Although features related to ob-ject appearance and shape could be extracted from these se-quences, we show that good performance can be obtainedfrom feature point tracks alone.

The first video sequence, termed the campus sequence,shows a busy campus walkway (Figure 4a-d). The secondvideo sequence, termed the GE sequence, shows a group ofnine people walking together with significant intra-objectocclusion during most of the sequence (Figure 4e-f). Thethird sequence, termed the metro sequence, was obtainedfrom the PETS database [2] and shows people walkingacross a rail platform (Figure 4g-h). The fourth and fifthsequences, termed highway-1 and highway-2, show fast-moving vehicles on a highway (Figure 4i-j). During mostof the highway-2 sequence, the camera is swaying heavilyin the wind, which is typical of such outdoor camera in-stallations. The sixth video sequence, termed the overheadsequence and part of the CAVIAR database [1], shows fourpeople walking through a corridor (Figure 4k). The seventhvideo sequence, termed the ant sequence, shows many antsmoving randomly on a glass plate (Figure 4l-m).

The feature point trajectory generation is fast and com-parable with the frame rate of the video sequence (i.e.,≈ 15frames/sec). The multi-object detection algorithm is imple-mented in non-optimized Matlab code that takes less than3 minutes of running time for each test video sequencereported here. Figure 5 shows comparisons between ob-ject counts made by a human observer and determined bythe algorithm for each sequence. For the campus, metro,highway-1 and highway-1 sequences, both the counts weremade for the spatial region identified by the red rectangle

indicated in Figure 4a, g, i and j. The average false nega-tive and average false positive detections across all the se-quences are 12% and 4% respectively. We observed that theundercountings in Sequences 4 and 5 were due to the failureof the KLT tracker to generate feature points on several ofthe fast-moving vehicles.

As illustrated in Figure 6, the proposed general purposeobject detection system yields comparable results with theobject-specific trackers reported by Rittscher et al. [14] andZhao et al. [22] for person detection and tracking, as wellas with the general-purpose tracker of Brostow and Cipolla[4]. Our algorithm is able to overcome some of the difficul-ties in detecting individual objects that move in unison. Toget a better appreciation of the detection results, we referthe reader to the following web link for the result videos:http://www.ecse.rpi.edu/˜rjradke/pocv/.

Figure 5: Object count determined by the algorithm com-pared with manual count for seven video sequences. Thenumber in parentheses indicates the number of frames ineach sequence.

5 Conclusions and Future Work

In this paper we introduce a simple approach based oncoherent motion region detection for counting and locat-ing objects in the presence of high object density andinter-object occlusions. We exploit the information gener-ated by tracking low-level features to construct all possiblecoherent-motion-regions, and chose a good disjoint set ofcoherent motion regions representing individual objects us-ing a greedy algorithm. The proposed algorithm requires nocomplex shape or appearance models for objects.

5

Figure 4: Sample results for seven different video sequences. The red number indicates the aggregate count of independentlymoving objects detected in the video up to the given frame. (a)-(d) Campus sequence. (e)-(f) GE sequence. Both sequencesinclude substantial inter-object overlap and occlusion. (g)-(h) Metro sequence. (i) Highway 1 sequence. (j) Highway 2sequence. (k) Overhead sequence. (l) Ant sequence. Note that several stationary ants are not detected by our algorithm,which is designed to find moving objects. The large red rectangle overlaid on subfigures (a),(g),(i) and (j) indicates the regionwithin the frame where moving objects are detected and counted.

6

Figure 6: (a),(d) illustrate example results from Zhao and Nevatia’s model based tracker [22], (b),(e) illustrate example resultsfrom Brostow and Cipolla’s independent motion detector [4], and (c),(f) illustrate our algorithm’s results for correspondingframes from the campus sequence. (i) illustrates an example result from Rittscher et al.’s object-specific tracker [14], (j)illustrates an example result from Brostow and Cipolla’s independent motion detector, and (k) illustrates our algorithm’sresult for one frame of the GE sequence.

False positives in our algorithm would occur in the pres-ence of heterogeneous object classes. For example, a per-son pulling a large cart might be counted as two movingpersons, or a large truck may be counted as two cars. Falsenegatives occur when very few feature points are identifiedon an object, resulting in very low likelihood values andhence undetected objects.

Possible extensions of this work include the incorpo-ration of other cues such as appearance which could al-low the system to maintain class-specific counts (e.g., carsvs. pedestrians or adults vs. children).

References[1] CAVIAR database. In

http://homepages.inf.ed.ac.uk/rbf/CAVIAR/, 2004.

[2] PETS benchmark data. In Proc. of IEEE Workshop on Per-formance Evaluation and Tracking and Surveillance, 2007.

[3] G. Bradski. OpenCV: Examples of use and new applicationsin stereo, recognition and tracking. In Proc. of InternationalConference on Vision Interface, 2002.

[4] G. J. Brostow and R. Cipolla. Unsupervised Bayesian detec-tion of independent motion in crowds. In Proc. of IEEE Con-ference on Computer Vision and Pattern Recognition, vol-ume 1, pages 594–601, 2006.

[5] A. Cheriyadat and R. Radke. Automatically determiningdominant motions in crowded scenes by clustering partialfeature trajectories. In Proceedings of the First ACM/IEEEInternational Conference on Distributed Smart Cameras(ICDSC-07), September 2007.

[6] D. Gavrila. Pedestrian detection from a moving vehicle. InProceedings of the 6th European Conference on ComputerVision, pages 37–49, 2000.

[7] G. Johansson. Visual motion perception. Scientific Ameri-can, 14:76–78, 1975.

[8] B. Leibe, E. Seemann, and B. Schiele. Pedestrian detectionin crowded scenes. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, 2005.

[9] M. Leordeanu and R. Collins. Unsupervised learning of ob-ject features from video sequences. In Proceedings of theIEEE International Conference on Computer Vision and Pat-tern Recognition, 2005.

[10] B. D. Lucas and T. Kanade. An iterative image registrationtechnique with an application to stereo vision. In Proc. of7th International Joint Conference on Artificial Intelligence,pages 674–679, 1981.

[11] K. Mikolajczyk, C. Schmid, and A. Zisserman. Human de-tection based on a probabilistic assembly of robust part de-tectors. In Proceedings of the European Conference on Com-puter Vision, pages 69–82, 2004.

[12] V. Rabaud and S. Belongie. Counting crowded moving ob-jects. In Proc. of IEEE Conference on Computer Vision andPattern Recognition, volume 1, pages 705–711, 2006.

7

[13] D. Ramanan and D. A. Forsyth. Finding and tracking peoplefrom the bottom up. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 18–20,2003.

[14] J. Rittscher, P. Tu, and N. Krahnst. Simultaneous estimationof segmentation and shape. In Proc. of IEEE Conference onComputer Vision and Pattern Recognition, pages 486–493,2005.

[15] E. Rosten and T. Drummond. Machine learning for highspeed corner detection. In Proc. of European Conferenceon Computer Vision, 2006.

[16] J. Shi and C. Tomasi. Good features to track. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 593–600, 1994.

[17] Y. Song, X. Feng, and P. Perona. Towards detection of humanmotion. In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 13–15, 2000.

[18] P. Tu and J. Rittscher. Crowd segmentation through emergentlabeling. In Proc. ECCV Workshop on Statistical Methods inVideo Processing, 2004.

[19] R. Vidal and R. Hartley. Motion segmentation with missingdata using power factorization and GPCA. In Proceedings ofthe IEEE International Conference on Computer Vision andPattern Recognition, 2004.

[20] P. Viola, M. J. Jones, and D. Snow. Detecting pedestriansusing patterns of motion and appearance. In Proc of IEEEInternational Conference on Computer Vision, pages 734–741, 2003.

[21] B. Wu and R. Nevatia. Detection and tracking of multi-ple, partially occluded humans by Bayesian combination ofedgelet based part detectors. International Journal of Com-puter Vision, 2007.

[22] T. Zhao and R. Nevatia. Tracking multiple humans incrowded environment. In Proc. of IEEE Conference on Com-puter Vision and Pattern Recognition, pages 406–413, 2004.

8

detecting multiple moving objects in crowded environments

Documents