[ieee 2010 20th international conference on pattern recognition (icpr) - istanbul, turkey...

An Information Fusion Approach for Multiview Feature Tracking

Esra Ataer Cansizoglu and Margrit Betke Image and Video Computing Group, Computer Science Department, Boston University

{ataer, betke}@cs.bu.edu

Abstract

We propose an information fusion approach to tracking objects from different viewpoints that can detect and recover from tracking failures. We introduce a reliability measure that is a combination of terms associated with correlation-based template matching and the epipolar geometry of the cameras. The measure is computed to evaluate the performance of 2D trackers in each camera view and detect tracking failures. The 3D object trajectory is constructed using stereoscopy and evaluated to predict the next 3D position of the object. In case of track loss in one camera view, the projection of the predicted 3D position onto the image plane of this view is used to reinitialize the lost 2D tracker. We conducted experiments with 34 subjects to evaluate our proposed system on videos of facial feature movements during human-computer interaction. The system successfully detected feature loss and gave promising results on accurate re-initialization of the feature.

1. Introduction

Object tracking is an important task within the field of computer vision. Detection and frame-to-frame tracking of moving objects is, for example, required for human computer interaction [1], surveillance, and video indexing. Robust tracking systems are needed that automatically detect the event of a tracking failure and recover from it [2].

In this study, we propose a solution for the problem of detecting a tracking failure by approaching it as a multi-camera information fusion problem. We treat the output of two-dimensional (2D) trackers in each view as separate sensor data. The main contribution of our study is that we propose a Reliability Measure (RM). We developed a stereo-camera tracking system that computes this measure for each view and uses the

results to identify and recuperate from the event of feature loss.

The experimental validation of our reliability measure was carried out on videos where facial features were tracked. There exist many works on facial feature tracking, e.g., [3-12]. Existing systems often require some manual initialization, and few systems are able to recover from the loss of track, for example, due to occlusion [2,8,9,13,15]. When occlusion does not occur frequently, a single camera may be sufficient for robust tracking. We focus on the scenarios where the tracked object is often partially or fully occluded in one camera view but may be visible in another. With a multi-camera system, using the information from all available cameras and the relation between them often makes it possible or easier to recover from a feature loss. Mittal and Davis [16] presented a system for tracking multiple people in cluttered scenes that combined the information from multiple cameras and considered the probability of occlusion. Snidaro and Foresti [17] used data fusion for multiview tracking by considering the quality of object segmentation in each view. Lien and Huang [18] modeled object motion and occlusion with two hidden Markov processes in each view. Du and Piater [19] proposed an approach to accomplish cooperation between particle filters in each view via belief propagation.

2. Multiview Tracking

An important task in evaluating multi-sensor data is to determine how much each sensor can be trusted. In a 2-camera tracking scenario, there are three cases: (i) accurate tracking is taking place in both views, (ii) the feature is lost in one of the views, and (iii) tracking failures occur in both views. In addition to detecting the event of a tracking failure, a successful stereo-camera tracking system should report in which view

2010 International Conference on Pattern Recognition

1051-4651/10 $26.00 © 2010 IEEE

DOI 10.1109/ICPR.2010.422

1710


1051-4651/10 $26.00 © 2010 IEEE

DOI 10.1109/ICPR.2010.422

1710


1051-4651/10 $26.00 © 2010 IEEE

DOI 10.1109/ICPR.2010.422

1706


1051-4651/10 $26.00 © 2010 IEEE

DOI 10.1109/ICPR.2010.422

1706


1051-4651/10 $26.00 © 2010 IEEE

DOI 10.1109/ICPR.2010.422

1706

(or views) the feature was lost and provide a mechanism to recuperate from feature loss.

For this study, we used optical-flow-based 2D trackers [1] to track features in the left and right camera views. For each tracker estimate, our system computes the proposed Reliability Measure and uses it to detect tracking failures automatically. The Reliability Measure combines terms associated with template matching with terms derived from the epipolar geometry of the cameras. Our system constructs 3D trajectories by stereo vision and computes predictions of object positions in 3D space. The projection of a 3D position estimate into one camera view becomes an alternative to the estimate computed by the 2D tracker of that view. The RM measures the level of trust in the estimate by the 2D tracker against the level of trust in the projection of the 3D position. If the RM is low, the system determines that the 2D tracker lost the feature and recovers its 2D location by substituting the projection of the 3D position estimate.

2.1. Reliability Measure

Given the coordinates tx and 'tx of corresponding 2D points in two views and the predicted coordinates

tz and 'tz of the optical-flow based trackers in each view at time t, the 3D position tX of the object can be computed by stereoscopic reconstruction using the 2D predictions tz and 'tz . Alternatively, the 3D

position tX can be estimated by 3D tracking using the assumption that the object moves with a constant velocity: given the previous two 3D positions

1tX and 2tX , the estimate of the current 3D position

is 212ˆttt XXX (Fig. 1). Estimates ty and 'ty are

the corresponding 2D projections of tX̂ in each view. We define the reliability measures RM and RM’ of each view to be (after dropping subscript t)

)',()',(

))]((1[))((

43

21

zzEPDzyEPDU

dyz

U

yNCCUzNCCURM

(1)

and

)',()',(''

))]'((1[))'(('

43

21

zzEPDyzEPDU

dyz

U

yNCCUzNCCURM

(2)

where U is the normalization function,

100

,1,,0

)( xx

otherwiseifif

xxU , 14

1ii , d=10, and

)(yNCC is the Normalized Correlation Coefficient between the patch positioned at y and the initially selected patch, and EPD(x, x’) is the shortest distance between x and the corresponding epipolar line defined by x’ of the other view. In the last term of the equations, we compute the epipolar distance by considering y and 'z as the corresponding 2D positions in two views. We divide this distance by the epipolar distance computed assuming z and 'z are a pair of matching points. The more the estimates by the 2D trackers can be trusted, the larger this ratio becomes. The normalization factor d scales the distance between the two position estimates, which is taken as 10 pixels throughout the experiments.

RM and RM’ measure the level of trust that we assign to the 2D optical-flow based trackers in each view respectively. If RM is low, the system disregards estimate z provided by the left-view 2D tracker and instead uses the projection y of the 3D position

estimate X̂ as the current left-view 2D feature position. Similarly, if RM’ is low, 'y is used as the right-view feature estimate.

Figure 1: 2D and 3D tracks and point estimates.

2.2. Adjusting the RM Weights

To determine how to weigh the different terms of the reliability measure so that the combination best represents the desired measure of trust in the tracked position, we collected training dataset of videos of 8 subjects, who rotated their heads from the center to the right, then left, up, and down, and we analyzed the weights 41,..., of the RM during this motion. Our goal was to choose weights that reflected our trust in

17111711170717071707

the left-view tracker when the subject moved to the left. The videos contained 450 frames on average per subject.

First, for each subject, we analyzed the RM of each view with only one term in the measure, i.e. 1i and

0j for all ij . For example, when subject A moved to the right until frame 60 (Fig. 2), the right-view RM’ increased more than the left-view RM. When she moved to the left until frame 190, values improved for the left view. The 3rd and 4th terms had a different scale and needed further attention. In addition, the RM values for the cases 11 and 12

correlated highly (0.95) and thus we dropped the 2nd

term ( 02 ) in our later experiments.

Figure 2: Values of the four terms of the left-view RM (top) and right-view RM’ (bottom) for the video of subject A and direction of subject’s movements.

Second, we computed the RMs using pairs and triplets of equally-weighted terms so that all two-element and three-element combinations of three terms were considered (Figure 3).

Figure 3: Values of pairs and triplets of RM terms with weights set equally.

We wanted the RM to provide a clear distinction between left and right movements. For example, the RMs with 01 do not have succinct peaks (Fig. 3, blue). We therefore set 5.01 and adjusted the weights in such a way that we paid equal attention to the 3rd and 4th terms. After inspecting the graphs of

RMs for all 8 subjects, we selected the final weights 5.01 , 02 and 25.043 (Figure 4).

Figure 4: RMs with final weights for subject A.

3. Experiments and Results

Our tracking and recovery system was implemented in C++ and ran in real time on an Intel Dual 2.20 GHz CPU computer. The training and test videos were recorded using two Logitech Quickcam Pro400 webcams, placed about 50 cm apart from each other, with an angle of ca. 120o between the optical axes. The spatial resolution of the videos was low (320 by 240 pixels per frame) and the illumination conditions differed across sequences. The user manually initialized the feature to be tracked in both views. To ensure that homologous points were tracked in both views, the system warned the user if the initialization was inaccurate and the selected pair of points did not satisfy the epipolar constraint.

The test dataset involved 26 subjects who were different from the subjects in the training dataset and were recorded in two test sessions with an average length of 1200 frames. We recorded the videos from the right and left cameras while the subjects used a camera-based mouse replacement interface [14], which was controlled by a third frontal-view camera.

We selected the nose tip as the feature to track. It was lost 304 times in one of the views and the system detected the event of track loss in the correct view in 254 cases. Hence, the true positive rate of loss detection was 83.5%. In 25 cases, the lost feature was not detected as lost, and in the remaining 25 cases, it was lost in one view, but declared as lost in the other view. The feature was lost in both views 9 times, but was declared as lost in only one of the views. There were 53 false alarms, but in all cases the feature was reinitialized to a location at most 3 pixels from the actual location, i.e., the system automatically detected the nose tip, which makes the false alarm rate negligible. For 254 correctly detected tracking failures, the system was able to recover 181 times (71.3%).

5. Discussion and Conclusions

We proposed an information fusion approach to multiview tracking and introduced a reliability measure that evaluates the performance of independent

17121712170817081708

2D trackers in each camera view. By computing this measure, our system can automatically detect track failures and automatically reinitialize. Our system uses stereoscopy to reconstruct the 3D trajectory of the tracked feature. The trajectory is used to estimate the next 3D position of the object. In case of track failure, the projection of this estimate onto the image plane is selected as an alternative to the estimate by the failed 2D tracker.

We point out that our proposed reliability measure is inexpensive to compute. This is an advantage over the method proposed by Connor et al. [2], which involves a multi-phase process to search for a lost feature.

We used a simple recursive 3D tracker that assumed that the object moved with constant velocity. Instead, a Kalman or particle filter may be used to potentially improve the accuracy of prediction of the 3D object position.

Adjusting the weights of the reliability measure was difficult, since we combined terms that expressed two different quantities – correlation values between image patches of intensities (terms 1 and 2) and ratios of pixel distances (terms 3 and 4). To ease the process, we normalized each term to yield unit-less quantities. By adjusting the weights, we addressed the relative variations of each term. We observed the strength of the correlation-based template matching technique and assigned the largest weight to its corresponding term in the reliability measure. A more extensive training dataset may prove valuable in further adjusting the weights. The proposed reliability measure may also be extended by including additional terms, e.g., geometrical constraints about the 3D motion of the object that are specific to the task at hand.

Acknowledgements

The authors thank the test subjects for their time and gratefully acknowledge NSF funding (grant 0713229).

References

[1] C. Fagiani, M. Betke, and J. Gips, “Evaluation of tracking methods for human-computer interaction,” IEEE Workshop on Applications in Computer Vision (WACV 2002), pp. 121-126, Orlando, USA, Dec. 2002.

[2] C. Connor, E. Yu, J. Magee, E. Cansizoglu, S. Epstein, and M. Betke, "Movement and Recovery Analysis of a Mouse-Replacement Interface for Users with Severe Disabilities," 13th Int. Conference on Human-Computer Interaction, 10 pp., San Diego, USA, July 2009.

[3] I. Matthews and S. Baker, “Active Appearance Models Revisited,” Int. J. Comp. Vision, 60(2):135-164, 2004.

[4] J. Xiao, S. Baker, I. Matthews, and T. Kanade, "Real-Time Combined 2D+3D Active Appearance Models," IEEE Conference on Computer Vision and Pattern Recognition, pp. 535–542, Washington DC, June 2004.

[5] K. Ramnath, S. C. Koterba, J. Xiao, C. Hu, I. Matthews, S. Baker, J. Cohn, and T. Kanade, "Multi-view AAM fitting and construction," Int. J. Comput. Vision, 76(2):183-204, February 2008.

[6] J. Wang, W. Gao, S. Shan, and X. Hu, “Facial feature tracking combining model-based and model-free methods,” IEEE Int. Conf. on Multimedia and Expo, pp. III-125 - III-128, Baltimore, USA, July 2003.

[7] K. He, G. Wang, Y. Yang, “Optical flow-based facial feature tracking using prior measurement,” IEEE International Conference on Cognitive Informatics, pp. 324-331, Stanford, USA, August 2008.

[8] T. J. Castelli, M. Betke, and C. Neidle, “Facial Feature Tracking and Occlusion Recovery in American Sign Language,” 6th Int. Workshop on Pattern Recognition in Information Systems, pp. 81-90, Cyprus, May 2006.

[9] J. Chen and B. Tiddeman, “A Robust Facial Feature Tracking System”, IEEE Int. Conf. Image Processing, pp. 2829-2832, Atlanta, USA, October 2006.

[10] F. Dornaika and F. Davoine, “Online Appearance-based Face and Facial Feature Tracking,” Int. Conf. Pattern Recognition, pp. 814-817, Cambridge, U.K., 2004.

[11] D. Comaniciu and V. Ramesh, “Robust Detection and Tracking of Human Faces with an Active Camera,” IEEE International Workshop on Visual Surveillance, pp. 11-18, Dublin, Ireland, July 2000.

[12] Y. Tong, Y. Wang, Z. Zhu, and Q. Ji, “Robust Facial Feature Tracking under Varying Face Pose and Facial Expression,” Pattern Recogn. 40(11):3195-3208, 2007.

[13] J. Ström. "Reinitialization of a Model-Based Face Tracker," EuroImage - Int. Conf. on Augmented, Virtual Environments and Three-Dimensional Imaging, 4 pp., Mykonos, Greece, May 2001.

[14] Camera Mouse, http://www.cameramouse.org/, accessed January 2010.

[15] T. Tommasini, A. Fusiello, E. Trucco, and V. Roberto, "Making good features track better," Conf. on Computer Vision and Pattern Recognition, pp. 178-183, Santa Barbara, USA, June 1998.

[16] A. Mittal and L. S. Davis, “M2 Tracker: A Multi-view Approach to Segmenting and Tracking People in a Cluttered Scene,” Int. J. Comput. Vision 51(3): 189-203, February 2003.

[17] L. Snidaro and G. L. Foresti, “Sensor Quality Evaluation in a Multi-Camera System”, 7th Int. Conf. on Information Fusion, 387-393, Philadelphia, July 2005.

[18] K. Lien and C. Huang, “Multi-view-based Cooperative Tracking of Multiple Human Objects in Cluttered Scenes,” EURASIP Journal on Image and Video Processing, Vol. 2008, 13 pp., 2008.

[19] W. Du and J. Piater, “Data Fusion by Belief Propagation for Multi-Camera Tracking,” 9th Int. Conf. on Information Fusion, 8 pp., Florence, Italy, July 2006.

17131713170917091709

[ieee 2010 20th international conference on pattern recognition (icpr) - istanbul, turkey...

Documents