face detection, tracking, and recognition for … · 2014-05-18 · synonyms: face localization,...

____________________F______________________

FACE DETECTION, TRACKING, AND RECOGNITION FOR BROADCAST VIDEO

Duy-Dinh Le, Xiaomeng Wu, and Shin’ichi Satoh National Institute of Informatics, Tokyo, Japan

Synonyms: Face localization, face grouping, face identification, face matching, person

information analysis;

Definition: Face detection and tracking are techniques for locating faces in images and video

sequences; face recognition is a technique for identifying or verifying unknown people using a

stored database of known faces.

1. Introduction

Human face processing techniques for broadcast video, including face detection, tracking, and recognition, has attracted a lot of research interest because of its value in various applications, such as video structuring, indexing, retrieval, and summarization. The main reason for this is that the human face provides rich information for spotting the appearance of certain people of interest, such as government leaders in news video, the pitcher in a baseball video, or a hero in a movie, and is the basis for interpreting facts. Face processing techniques have several applications. For example, Name-It [Satoh99MM] aims to associate names and faces appearing in news video for the purpose of person annotation. [DLe07MDDM] developed a system to find important people who appear frequently in large video archives. A celebrity search engine developed by Viewdle1 can find video segments in which the queried person (query by name) appears. Localizing faces and recognizing their identities are challenging problems: facial appearance varies largely because of intrinsic factors, such as aging, facial expressions, and make-up styles; and extrinsic factors such as pose changes, lighting conditions, and partial occlusion. These factors make it difficult to construct good face models. Many

1 http://www.viewdle.com

A 2

efforts have been made in the fields of computer vision and pattern recognition, but good results have been limited to restricted settings. This article describes state-of-the art techniques for face detection, tracking, and recognition with applications to broadcast video. For each technique, we firstly describe the challenges they are to overcome. Then several modern approaches are presented. Finally, a discussion of these techniques is given.

2. Face Detection

Face detection, which is the task of localizing faces in an input image, is a fundamental part of any face processing system. The extracted faces can then be used for initializing face tracking or automatic face recognition. An ideal face detector should possess the following characteristics:

- Robustness: it should be capable of handling appearance variations in pose, size, illumination, occlusion, complex backgrounds, facial expressions, and resolution.

- Quickness: it should be fast enough to perform real-time processing, which is an important factor in processing large video archives.

- Simplicity: the training process should be simple. For example, the training time should be short, the number of parameters should be small, and training samples should be able to be collected cheaply.

2.1. Real-time Face Detection Using Cascaded Classifiers

There are many approaches for building fast and robust face detectors [YangH02PAMI]. Among them, those using advanced learning methods, such as neural networks, support vector machines, and boosting, are the best. As shown in Figure 1, detecting the faces in an image typically takes the following steps:

- Window scanning: in order to detect faces at multiple locations and sizes, a fixed window size (e.g. 24 x 24 pixels) is used to extract image patterns at every location and scale. The number of patterns extracted from a 320 x 240 frame image is large, approximately 160,000, and only a small number of these patterns contain a face.

- Feature extraction: the features are extracted from the given image pattern. The most popular feature type is the Haar wavelet because it is very fast to compute using the integral image [Viola01CVPR]. Other feature types include pixel intensity [Rowley98PAMI], local binary patterns [Hadid04CVPR], and edge orientation histogram [Levi04CVPR].

- Classification: the extracted features are passed through a classifier that has been previously trained to classify the input pattern associated with these features as a face or a non-face.

- Merging overlapping detections: since the classifier is insensitive to small changes in translation and scale, there might be multiple detections around each face. In order to return a single final detection per face, it is necessary to combine the overlapping detections into a single detection. To this end, the set of detections is partitioned into disjoint subsets so that each subset consists of the nearby detections for a specific location and scale. The average of the corners of

Encyclopedia of Multimedia 3

all detections in each subset is considered to be the corners of one face region that this set returns.

Figure 1: A typical face detection system in which a fixed size window is used to scan to every location and scale to extract image patterns which are then passed through a classifier to check for the existence of a face.

Figure 2: A cascaded structure for fast face detection in which easy patterns are rejected by simple classifiers in earlier stages while more difficult patterns are processed by more complicated classifiers in later stages. Since the vast majority of processed patterns are non-face, the single classifier based systems, such as the neural network [Rowley98PAMI] and support vector machines [Hadid04CVPR], are usually slow. To overcome this problem, a combination of simple-to-complex classifiers has been proposed [Viola01CVPR] and this has led to the first real-time robust face detector. In this structure, fast and simple classifiers are used as filters in early stages of detection to quickly reject a large number of non-face patterns, whereas slower but more accurate classifiers are used in later stages for classifying face-like patterns. In this way, the complexity of classifiers can be adapted to correspond to the increasing difficulty of the input patterns. Figure 2 shows an example of this structure. Training classifiers usually consist of the following steps:

- Training set preparation: Supervised learning methods require a large number of training samples to obtain accurate classifiers. The training samples are patterns

A 4

that must be labeled as face (positive samples) or non-face (negative samples) in advance. Face patterns are manually collected from images containing faces, and they are then scaled to the same size and normalized to a canonical pose in which the eyes, mouth, and nose are aligned. These face patterns can be used to generate other artificial faces by randomly rotating the images (about their center points) by up to 10 degrees, scaling them between 90 and 110%, translating them up to half a pixel, and mirroring them to enlarge the number of positive samples [Rowley98PAMI]. The collection of non-face patterns is usually done automatically by scanning through images which contain no faces. The accurate classifier described in [Viola01CVPR] requires about five thousand original face patterns and hundreds of millions of non-face patterns extracted from 9,500 non-face images. In [Levi04CVPR], a smaller number of training samples can be used to build a robust face detector by using an edge orientation histogram.

- Learning method selection: Basically, in an ideal situation with the proper settings, advanced learning methods, such as the neural network, support vector machine, and AdaBoost, perform similarly. However, in practice, it is difficult to find proper settings. A neural network method entails designing layers, nodes, etc., which is a complicated task. Therefore, it is preferable to use support vector machines because only two parameters are necessary if an RBF kernel is used and many tools are available. AdaBoost (and its variants) is another popular learning method that is used in many object detection systems. The advantage of AdaBoost is it can be used for both selecting features and learning the classifier.

The current frontal-view face detection systems work in real-time and with high accuracy. However, development a face detection system to handle arbitrary views of faces is still a challenging job. A simple approach is to divide the entire face space into subspaces that correspond to specific views such as frontal, full-profile and half-profile, and build several face detectors so that each detector handles one view. Figure 3 shows the tree structure used in the multi-view face detection system described in [HuangC05ICCV].

Figure 3: The tree structure used in the multi-view face detection system proposed by Huang et al. [HuangC05ICCV]. Each node in the tree is a detector which handles a specific range of poses. (Courtesy of C. Huang, H. Ai, Y. Li and S. Lao.)


2.2. Discussion

The face detection techniques presented above are mainly for still images rather than videos. However, by considering each video frame to be a still image, these techniques can be made to work for videos. Although frame-based face detection techniques have been demonstrated on real images, their ability of detecting faces in videos is still primitive. The performance of the detector may decrease for various reasons including occlusions and changes in lighting conditions and face poses. Without additional information, the detector’s responses can easily be rejected, even if they indicate the presence of a face. To provide more complete video segments in which to track the person of interest, it is therefore important to incorporate temporal information in a video sequence.

3. Face Tracking

Face tracking is the process of locating a moving face or several of them over a period of time by using a camera, as illustrated in Figure 4. A given face is first initialized manually or by a face detector. The face tracker then analyzes the subsequent video frames and outputs the location of the initialized face within these frames by estimating the motion parameters of the moving face. This is different from face detection, the outcome of which is the position and scale of one single face in one single frame. Face tracking acquires information on multiple consecutive faces within consecutive video frames. More importantly, these faces have the same identity.

Figure 4. Overview of face tracking

3.1. Benefits of Face Tracking

One of the main applications of face tracking is person retrieval from broadcast video, for example: “intelligent fast-forward”, where the video jumps to the next scene containing a certain person/actor; or retrieval of different TV segments, interviews, shows, etc., featuring a given person in a video or a large collection of videos. [Sivic05CIVR] proposes a straightforward way of face tracking for person retrieval from feature-length movie

A 6

video. At run time, the user outlines a face in a video frame, and the face tracks within the movie are then ranked according to their similarity to the outlined query face in the same way as Google. Since one face track corresponds to one identity, unlike in frame-based face detection, the workload of intra-shot face matching is greatly reduced. In addition, face tracking provides multiple examples of the same character’s appearance to help with inter-shot face matching. Face tracking is also used for face-name association, the objective of which is to label television or movie footage with the identity of the person present in each frame of the video. Everingham et al. [Everingham06BMVC] proposed an automatic face-name association system. This system uses a face tracker similar to the one in [Sivic05CIVR] that can extract a few hundred tracks of each particular character in a single shot. Based on the temporal information obtained from the face tracker, textual information for TV and movie footage including subtitles and transcripts is employed to assign the character’s name to each face track. For instance, shots containing a particular person can be retrieved by inputting a keyword like “Bush” or “Julia Roberts” instead of by inputting an outlined query face, as is used in [Sivic05CIVR]. Besides broadcast video, face tracking also has important applications in humanoid robotics, visual surveillance, human-computer interaction (HCI), video conferencing, face-based biometric person authentication, etc.

3.2. Selection Criteria of Face Tracking Methods

Choosing a face tracker can be a difficult task because of the variety of face trackers currently available. The application provider must decide which face tracker is best suited to his/her individual needs and, of course, the type of video that he/she wants to use as the target. Generally speaking, the important issues are the tracker’s speed, robustness, and accuracy. Can the system run in real time? Similar to the case of many processing tools for broadcast video, speed is not the most critical issue because offline processing is permitted in most video structuring and indexing activities. However, a real-time face tracker is necessary if the target archive is based on too large a quantity of video, e.g. 24 hours of continuous video recording that needs daily structuring. Moreover, the speed of the tracker is critical in most non-broadcast video applications, e.g. HCI. Note that there is always a tradeoff between speed and performance-related issues such as robustness and accuracy. Can the system cope with varying illuminations, facial expressions, scales, poses, camerawork, occlusion, and large head motions? A number of illumination factors, e.g. light sources, background colors, luminance levels, and media, greatly affect the appearance of a moving face, for instance, when tracking a person who is moving from an indoor to an outdoor environment. Face tracking also tends to fail when there are large facial deformations of the eyes, nose, mouth, etc., due to changes in facial expression. Different from non-broadcast video, e.g. video used for HCI, faces appearing in broadcast video vary from large in close-ups to small in long shots. A smaller face scale always leads to a lower resolution, and most face trackers designed by computer


vision researchers will reject such faces. Pose variations, i.e. head rotations including pitch, roll, and yaw, can cause parts of faces to disappear. In some cases, scale and pose variations might be caused by camerawork changes. Occlusion by other objects will also partially obscure faces, and other motions onscreen may interfere with the acquisition of motion information. Moreover, the task of face tracking becomes even more difficult when the head is moving fast relative to the frame rate, so that the tracker fails to “arrive in time”. How accurate is the tracking? When initializing the tracker with a face detector, the first factor that affects accuracy is false face detections. This problem is difficult to solve because the face detector has a fixed threshold. Lowering the threshold of the face detector reduces the number of false rejections, but increases the number of false detections. Drift, or the long sequence motion problem, also affects accuracy. These problems always are a result of imperfect motion estimation techniques. A tracker might accumulate motion errors and eventually lose track of a face, for instance, as it changes from a frontal view to a profile.

3.3. Workflow of Face Tracking

Face tracking can be considered to be a kind of algorithm that analyzes video frames and outputs the location of moving faces within each frame. For each tracked face, three steps are involved, i.e., initialization, tracking, and stopping as illustrated in Figure 5.

Figure 5. Face tracking flowchart

Most methods use a face detector for initialization of their tracking processes. An always ignored difficulty with this step is how to control false face detections as described above. Another problem is in handling new non-frontal faces. Although there have been studies on profile or intermediate pose face detectors, they all suffer from the false-detection problem far more than a frontal face detector does. To alleviate these problems, Chaudhury et al. [Choudhury03PAMI] used two face probability maps instead of a fixed threshold to initialize the face tracker, one for frontal views and one for profiles. All local maxima in these maps are chosen as face candidates, the face probabilities of which are propagated throughout the temporal sequence. Candidates whose probabilities either go to zero or remain low over time are determined to be non-face and are eliminated. The information from the two face probability maps is combined to represent an intermediate

A 8

head pose. Their experiments showed that the proposed probabilistic detector was more accurate than a traditional face detector and could handle head movements covering ±90 degrees out-of-plane rotation (yaw). After initialization, one should choose the features to track before tracking a face. The exploitation of color is one of the more common choices because it is invariant to facial expressions, scale, and pose changes [Boccignone05ICIAP, LiY06HCIW]. However, color-based face trackers often depend on a learning set dedicated to a certain type of processed video and might not work on unknown videos with varying illumination conditions or on faces of people of different races. Moreover, the color image is susceptible to occlusion by other head-like objects. Two other choices that are more robust to varying illuminations and occlusions are key point [Sivic05CIVR, Everingham06BMVC] and facial features [Arnaud05ICIP, ZhuZ05CVIU, TongY07PR], e.g. eyes, nose, mouth, etc. Although the generality of key points allows for tracking of different kinds of objects, without any face-specific knowledge, this method’s power to discriminate between the target and clutter might not be enough to deal with background noise or other adverse conditions. Facial features enable tracking of high-level facial information, but they are of little use when the video is of low quality. Most facial-feature-based face trackers [ZhuZ05CVIU, TongY07PR] have been tested using only non-broadcast video, e.g. webcam video, and their applicability to broadcast video is questionable. Note that the different cues described above may be combined. An appearance-based or featureless tracker matches an observation model of the entire facial appearance with the input image, instead of choosing only a few features to track. One example, in [Choudhury03PAMI], is the appearance-based face tracker mentioned above. Another example, in [LiY06HCIW], uses a multi-view face detector to detect and track faces from different poses. Besides the face-based observation model, a head model is also included to represent the back of the head. This model is based on the idea that a head can be an object of interest because the face is not always trackable An extended particle filter is used to fuse these two sets of information to handle occlusions due to out-of-plane head rotations (yaw) exceeding ±90 degrees. During the tracking procedure, face tracking systems usually use a motion model that describes how the image of the target might change for different possible face motions. Examples of simple motion models are as follows. Assuming the face to be a planar object, the corresponding motion model can be a 2D transformation, e.g. affine transformation or homography, of a facial image, e.g. the initial frame [Arnaud05ICIP, ZhuZ05CVIU]. Some research treats the face as a rigid 3D object; the resulting motion model defines aspects depending on 3D position and orientation [TongY07PR]. However, a face is actually both 3D and deformable. Some systems try to model faces in this sense, and the image of face can be covered with a mesh, i.e. a sophisticated geometry and texture face model [Dornaika04TSMC, Dornaika06CSVT]. The motion of the face is defined by the position of the nodes of the mesh. If the quality of the video is high, a more sophisticated motion model will give more accurate results. For instance, a sophisticated geometry and texture model might be more insusceptible to false face detections and drifting than a simple 2D transformation model. However, most 3D-based and mesh-based face trackers require a relatively clear appearance, high resolution, and a limited


pose variation, e.g. out-of-plane head rotations (roll and yaw) that are far less than ±90 degrees. These requirements cannot be satisfied in the case of broadcast video. Therefore, most 3D-based and mesh-based face trackers are only tested on non-broadcast video, e.g. webcam video [Dornaika04TSMC, Dornaika06CSVT, TongY07PR]. Finally, the stopping procedure is rarely discussed. This constitutes a major deficiency for the face tracking algorithms that are generally not able to stop a face track in case of tracking errors, i.e. drifting. [Arnaud05ICIP] proposed an approach that uses a general object tracker for face tracking and a stopping criterion based on the addition of an eye tracker to alleviate drifting. The two positions of the tracked eyes are compared with the tracked face position. If neither of the eyes is in the face region, drifting is determined to be occurring and the tracking process stops. In addition, most mesh-based or top-down trackers are assumed to be able to avoid drifting.

3.4. Discussion

Face tracking has attracted much attention from researchers of multimedia content analysis, computer vision, etc. However, while most of the face trackers have been for high-quality video in computer vision, only a limited number have been designed for broadcast video. This is because the current face trackers still require a relatively clear appearance, high resolution, and limited pose variations, which cannot be guaranteed in broadcast video. On the other hand, face trackers are still evaluated with different types of video and different criteria. A general evaluation criterion, in terms of speed, robustness, and accuracy, is needed for comparing the performances of face trackers with different purposes.

4. Face Recognition Face recognition is the process of identifying or verifying one or more persons appearing in a scene by using a stored database of faces [ZhaoW03ACS]. The applications of face recognition in video are as follows.

- Face retrieval: retrieve shots containing a person’s appearances using one or several face images as a query [Sivic05CIVR, Arandjelovic05CVPRa].

- Face matching: match face sequences of unknown people against annotated face sequences in the database for annotation or identification [Satoh00FG].

- Face grouping: organize detected face sequences into clusters for auto-cast listing [Arandjelovic06CVPR].

- Name-face association: associate names and faces in video by multi-modal analysis for annotation and retrieval [Satoh99MM, YangJ04ACMM, Everingham06BMVC].

Similar to face detection and face tracking, face recognition faces difficulties in handling variations in resolution, face size, pose, illumination, occlusion, and facial expression. In addition, it is crucial to handle inter-variations, i.e., variations among individuals, and intra-variation, i.e., variations affecting each individual for a robust face recognition system.

4.1. Pre-processing Techniques

The detected faces usually vary wildly and are not reliable for matching. Therefore a normalization step is required to eliminate the effects of complex backgrounds and

A 10

different illuminations, poses and sizes. A simple technique to handle different illuminations (Figure 6) is to subtract the best fit brightness plane and do histogram equalization. To handle pose changes and different sizes, facial features such as eyes, nose and mouth, are detected and used to rectify all faces to a canonical pose and scale to the same size. Elliptical masks or other background subtraction techniques can be used to remove background clutter. [Arandjelovic05CVPRa] proposes a sophisticated face normalization technique that involves a series of transformations, each aimed at removing the effect of a particular variation.

Figure 6: Faces before and after the normalization process.

4.2. Face Recognition Techniques

In general, a single face image can be viewed as a point in a Euclidean image space. The dimensionality, D, of this space is equal to the number of pixels of the input face image. Usually D is large, leading to the curse of dimensionality problem. However, the surfaces of faces are mostly smooth and have regular texture, making their appearance quite constrained. As a result, it can be expected that face images can be confined to a face space, a manifold of lower dimension d << D embedded in the image space [Arandjelovic05CVPRb]. The eigen-face method is a popular technique for dimensionality reduction when a set of training face images is given [Turk91CVPR]. A face is represented as a point in a high-dimensional eigen-face space, which is computed from a set of training faces using principle component analysis (PCA). Figure 7 shows an example of a set of eigen-faces computed from 3,816 faces [DLe07MDDM].

Figure 7: Eigen faces


After each face is represented in a face space of fewer dimensions than the image space, face sequences can be represented and matched in one of the following ways:

- Nearest Neighbor (NN): This technique is based on matching the closest pair of faces [Satoh00FG]. It presumes that when two face sequences correspond to the same person, the closest pair of them corresponds to faces having a similar pose, facial expression, etc. An example of this method is shown in Figure 8.

- Mutual Subspace Method (MSM): A set of faces of each individual is represented in a subspace. To compare the input subspace with the reference subspace, the similarity is defined as the minimum angle between these two subspaces [Yamaguchi98FG].

- Manifold Density Model (MDM): Each set of faces of each individual is used to estimate a probability density function from which the faces are drawn. As described in [Arandjelovic05CVPRb], the densities are modeled as Gaussian mixture models (GMMs) and the recognition is formulated in terms of minimizing the Kullback-Leibler divergence between these densities.

3.4. Discussion

The number of individuals in broadcast videos is unknown in advance. Therefore, face retrieval and face matching applications are more popular than face recognition ones. The NN-based method is a simple but efficient face matching technique. The methods based on MSM and MDM require a large number of face samples for the estimation of subspaces and densities.

5. Software and Tools The face detector implemented in OpenCV2 is used in many face-based applications. The basic techniques for face normalization, face subspace estimation, and face recognition can be found in the CSU Face Identification Evaluation System3.

2 http://opencvlibrary.sourceforge.net/

3 http://www.cs.colostate.edu/evalfacerec/

A 12

Figure 7: Face sequence matching using closest pairs of faces.

Conclusion Face processing techniques, including face detection, tracking, and recognition, are important for many video indexing and retrieval applications. Although most of these techniques focus on static images, there has been some success in applying them to video. However, integration of these techniques to account for distinct features of video such as temporal information, motion, and large number of observations, is still a challenging issue.

References

[Arandjelovic05CVPRa] O. Arandjelovic and A. Zisserman, “Automatic face recognition for film character retrieval in feature-length films,” in Proc. Intl. Conf. on Computer Vision and Pattern Recognition, vol. 1, 2005, pp. 860–867.

[Arandjelovic05CVPRb] O. Arandjelovic, G. Shakhnarovich, J. Fisher, R. Cipolla, and T. Darrell, “Face recognition with image sets using manifold density divergence,” in Proc. Intl. Conf. on Computer Vision and Pattern Recognition, vol. 1, 2005, pp. 581–588.

[Arandjelovic06CVPR] O. Arandjelovic and R. Cipolla, “Automatic cast listing in feature-length films with anisotropic manifold space,” in Proc. Intl. Conf. on Computer Vision and Pattern Recognition, vol. 2, 2006, pp. 1513–1520.

[Arnaud05ICIP] E. Arnaud, B. Fauvet, É. Mémin, and P. Bouthemy, “A robust and


automatic face tracker dedicated to broadcast videos,” in Proc. Int. Conf. on Image Processing, 2005, pp. 429-432.

[Boccignone05ICIAP] G. Boccignone, V. Caggiano, G. D. Fiore, and A. Marcelli, “Probabilistic detection and tracking of faces in video,” in Proc. Int. Conf. on Image Analysis and Processing, 2005, pp. 687-694.

[Choudhury03PAMI] R. Choudhury, C. Schmid, and K. Mikolajczyk, “Face detection and tracking in a video by propagating detection probabilities,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 10, pp. 1215-1228, 2003.

[DLe07MDDM] D.-D. Le, S. Satoh, M. E. Houle, and D. P. Nguyen, “Finding important people in large news video databases using multimodal and clustering analysis,” in Proc. 2nd IEEE Int. Workshop on Multimedia Databases and Data Management, 2007, pp. 127–136.

[Dornaika04TSMC] F. Dornaika and J. Ahlberg, “Fast and reliable active appearance model search for 3-D face tracking,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 34, no. 4, pp. 1838-1853, 2004.

[Dornaika06CSVT] F. Dornaika and F. Davoine, “On appearance based face and facial action tracking,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 16, no. 9, pp. 1107-1124, 2006.

[Everingham06BMVC] M. Everingham, J. Sivic, and A. Zisserman, “’Hello, My name is...Buffy’ – automatic naming of charecters in tv video,” in Proc. British Machine Vison Conf., 2006.

[Hadid04CVPR] A. Hadid, M. Pietikainen, and T. Ahonen, “A discriminative feature space for detecting and recognizing faces,” in Proc. Intl. Conf. on Computer Vision and Pattern Recognition, vol. 2, 2004, pp. 797–804.

[HuangC05ICCV] C. Huang, H. Ai, Y. Li, and S. Lao, “Vector boosting for rotation invariant multi-view face detection,” in Proc. Intl. Conf. on Computer Vision, vol. 1, 2005, pp. 446–453.

[Levi04CVPR] K. Levi and Y. Weiss, “Learning object detection from a small number of examples: The importance of good features,” in Proc. Intl. Conf. on Computer Vision and Pattern Recognition, vol. 2, 2004, pp. 53–60.

[LiY06HCIW] Y. Li, H. Ai, C. Huang, and S. Lao, “Robust head tracking with particles based on multiple cues fusion,” ECCV Workshop on HCI, 2006, pp. 29-39.

[Rowley98PAMI] H. Rowley, S. Baluja, and T. Kanade, “Neural network-based face detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 23–38, Jan 1998.

[Satoh00FG] S. Satoh, “Comparative evaluation of face sequence matching for content-based video access,” in Proc. Intl. Conf. on Automatic Face and Gesture Recognition, 2000, pp. 163–168.

[Satoh99MM] S. Satoh, Y. Nakamura, and T. Kanade, “Name-It: Naming and detecting faces in news videos,” IEEE MultiMedia, vol. 6, no. 1, pp. 22-35, January-March (Spring), 1999.

[Sivic05CIVR] Sivic, M. Everingham, and A. Zisserman, “Person spotting: Video shot retrieval for face sets,” in Proc. Int. Conf. on Image and Video Retrieval, 2005, pp. 226–236.

A 14

[TongY07PR] Y. Tong, Y. Wang, Z. Zhu, and Q. Ji, “Robust facial feature tracking under varying face pose and facial expression,” Pattern Recognition, vol. 40, no. 11, pp. 3195-3208, Nov 2007.

[Turk91CVPR] M. Turk and A. Pentland, “Face recognition using Eigenfaces,” in Proc. Intl. Conf. on Computer Vision and Pattern Recognition, pp. 586-591, 1991.

[Viola01CVPR] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proc. Intl. Conf. on Computer Vision and Pattern Recognition, vol. 1, 2001, pp. 511–518.

[Yamaguchi98FG] O. Yamaguchi, K. Fukui, and K. Maeda, “Face recognition using temporal image sequence,” in Proc. Intl. Conf. on Automatic Face and Gesture Recognition, 1998, pp. 318–323.

[YangH02PAMI] M.-H. Yang, D. Kriegman, and N. Ahuja, “Detecting faces in images: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 1, pp. 34–58, Jan 2002.

[YangJ04ACMM] J. Yang and A. G. Hauptmann, “Naming every individual in news video monologues,” in Proc. ACM Int. Conf. on Multimedia, 2004, pp. 580–587.

[ZhaoW03ACS] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, “Face recognition: A literature survey,” ACM Computing Surveys, vol. 35, no. 4, pp. 399–458, 2003.

[ZhuZ05CVIU] Z. Zhu and Q. Ji, “Robust real-time eye detection and tracking under variable lighting conditions and various face orientations,” Computer Vision and Image Understanding, vol. 98, no. 1, pp. 124-154, April, 2005.

face detection, tracking, and recognition for … · 2014-05-18 · synonyms: face localization,...

Documents