chapter 2 literature survey 2.1 introduction 2.2 facial...
Post on 30-Jan-2021
2 Views
Preview:
TRANSCRIPT
-
CHAPTER 2
LITERATURE SURVEY
2.1 INTRODUCTION
This chapter presents a detailed literature survey on facial tracking
using lip movement, skin color and mouth movement in a video sequence.
The Automatic Facial extraction, 3D modal Shaping, Algorithm for Robust
segmentation of various parts that are designed by various authors are
discussed.
2.2 FACIAL TRACKING USING LIP READING
Yuille et al., 1992, develop an automatic facial feature extraction
system, which is able to identify the detailed shape of eyes, eyebrows and
mouth from facial images. The developed system not only extracts the
location information of the features, but also estimates the parameters
pertaining the contours and parts of the features using parametric
deformable templates approach. In order to extract facial features,
deformable models for each of eye, eyebrow, and mouth are developed. The
development steps of the geometry, imaging model and matching
algorithms, and energy functions for each of these templates are presented
in detail, along with the important implementation issues. An eigenface
based multi-scale face detection algorithm which incorporates standard facial
proportions is implemented, so that when a face is detected, the rough
search regions for the facial features are readily available. The developed
system is tested on JAFFE (Japanese Females Facial Expression Database),
Yale Faces, and ORL (Olivetti Research Laboratory) face image databases.
The performance of each deformable template and the face detection
algorithm are discussed separately.
Rabiner 1993, state that although the face detection algorithm is
designed for frontal face, the same mechanism can also be applied to track
-
non-frontal faces with online adapted face models. Due to the essence of
template matching, the algorithm is capable of comparing the similarity
among different faces, which makes it suitable for tracking the same face
that occur at disjointed temporal locations in video. While the proposed face
detection method provides comparable accuracy as the neural network
based approach, it is much faster.
Terzopoulos et al., 1993, present a new approach to the analysis of
dynamic facial images for the purposes of estimating and resynthesizing
dynamic facial expressions. The approach exploits a sophisticated generative
model of the human face originally developed for realistic facial animation.
The face model, which may be simulated and rendered at interactive rates
on a graphics workstation, incorporates a physics-based synthetic facial
tissue and a set of anatomically motivated facial muscle actuators. They
consider the estimation of dynamic facial muscle contractions from video
sequences of expressive human faces. They develop an estimation technique
that uses deformable contour models (snakes) to track the non-rigid motions
of facial features in video images
Lanitis et al., 1994, present flexible shape and flexible grey-level
models for representing variations in the appearance of human faces. These
models are controlled by a small number of parameters which can be used
for coding and reconstructing a face image.
Jacquin Amaud et al., 1995, address the issue of automatically
tracking the faces and facial features of persons in head-and-shoulders video
sequences. They propose two totally automatic algorithms which
respectively perform the detection of head outlines and identify rectangular
eyes-nose-mouth regions, both from down sampled binary threshold edge
images. Unlike ones that have been proposed recently, a priori assumptions
regarding the nature and content of the sequences to code are minimal for
-
our techniques, and the algorithms operate accurately and robustly, even in
cases of significant head rotation or partial occlusion by moving objects.
Gavrila and Davis, 1996, present a vision system for the 3-D model-
based tracking of unconstrained human movement. Using image sequences
acquired simultaneously from multiple views, they recover the 3D body pose
at each time instant without the use of markers. The pose recovery problem
is formulated as a search problem and entails finding the pose parameters of
a graphical human model whose synthesized appearance is most similar to
the actual appearance of the real human in the multi-view images. The
models used for this purpose are acquired from the images. They use a
decomposition approach and a best-first technique to search through the
high dimensional pose parameter space. A robust variant of chamfer
matching is used as a fast similarity measure between synthesized and real
edge images. They present initial tracking results from a large new human-
in-action database containing more the 2500 frames in each of four
orthogonal views. The four image streams are synchronized. They contain
subjects involved in a variety of activities, of various degree of complexity
ranging from the simpler one person hand waving to the challenging two
persons close interaction in the Argentine Tango.
McKenna et al., 1996, they describe a dynamic face tracking system
based on an integrated motion-based object tracking and model based face
detection, framework. The motion-based tracker focuses attention for the
face detector whilst the latter aids the tracking process. The system
produces segmented face sequences from complex scenes with poor viewing
conditions in surveillance applications. They also investigate a Gabor wavelet
transform as a representation scheme for capturing head rotations in depth.
Principal components analysis was used to visualize the manifolds described
by pose change. Heinzmann and Zelinsky, 1997, state that people naturally
express themselves through facial gestures. They have implemented an
-
interface that tracks a person's facial features robustly in real time (30Hz)
and does not require artificial artifacts such as special illumination or facial
makeup. Even if features become occluded, the system is capable of
recovering tracking in a couple of frames after the features reappear in the
image. Based on this fault tolerant face tracker they have implemented real
time gesture recognition capable of distinguishing 12 different gestures
ranging from "yes", "no" and "may be" to winks, blinks and "asleep".
Sanchez et al., 1997, a method for lip tracking intended to support
personal verification is presented. Lip contours are represented by means of
quadratic B-splines. The lips are automatically localized in the original image
and an elliptic B-spline is generated to start up tracking. Lip localization
exploits grey-level gradient projections as well as chromaticity models to
find the lips in an automatically segmented region corresponding to the face
area. Tracking proceeds by estimating new lip contour positions according to
a statistical chromaticity model for the lips. The current tracker
implementation follows a deterministic second order model for the spline
motion based on a Lagrangian formulation of contour dynamics. The method
has been tested on the M2VTS database. Lips were accurately tracked on
sequences consisting of more than hundred frames.
Basu et al., 1998, address the problem of tracking and reconstructing
3D human lip motions from a 2D view. They build a physically-based 3D
model of lips and train it to cover only the subspace of lip motions. They
then track this model in video by finding the shape within the subspace that
maximizes the posterior probability of the model given the observed
features. The features are the likelihoods of the lip and non-lip color classes:
they iteratively derive forces from these values to apply to the physical
model and converge to the final solution. Because of the full 3D nature of
the model, this framework allows to track the lips from any head pose. In
addition, because of the constraints imposed by the learned subspace of the
-
model, they are able to accurately estimate the full 3D lip shape from the 2D
view.
Edward et al., 1998, address the problem of robust face identification
in the presence of pose, lighting, and expression variation. Previous
approaches to the problem have assumed similar models of variation for
each individual, estimated from pooled training data. They describe a
method of updating a first order global estimate to identity by learning the
class specific correlation between the estimate and the residual variation
during a sequence. This is integrated with an optimal tracking scheme, in
which identity variation is decoupled from pose, lighting and expression
variation. The method results in robust tracking and a more stable estimate
of facial identity under changing conditions.
Schödl Arno et al., 1998, describe the use of a three-dimensional
textured model of the human head under perspective projection to track a
person’s face. The system is hand-initialized by projecting an image of the
face onto a polygonal head model. Tracking is achieved by finding the six
translation and rotation parameters to register the rendered images of the
textured model with the video images. They find the parameters by mapping
the derivative of the error with respect to the parameters to intensity
gradients in the image. They use a robust estimator to pool the information
and do gradient descent to find an error minimum.
Stan Birchfield, 1998, presents an algorithm for tracking a person’s
head. The head’s projection onto the image plane is modeled as an ellipse
whose position and size are continually updated by a local search combining
the output of a module concentrating on the intensity gradient around the
ellipse’s perimeter with that of another module focusing on the color
histogram of the ellipse’s interior. Since these two modules have roughly
orthogonal failure modes, they serve to complement one another. The result
is a robust, real-time system that is able to track a person’s head with
-
enough accuracy to automatically control the camera’s pan, tilt, and zoom in
order to keep the person centered in the field of view at a desired size.
Extensive experimentation shows the algorithm’s robustness with respect to
full 360-degree out-of-plane rotation, up to 90-degree tilting, severe but
brief occlusion, arbitrary camera movement, and multiple moving people in
the background.
Toyama 1998, real-time 3D face tracking is a task with applications to
animation, video teleconferencing, speech reading, and accessibility. In spite
of advances in hardware and efficient vision algorithms, robust face tracking
remains elusive for all of the reasons which make computer vision difficult:
Variations in illumination, pose, expression, and visibility complicate the
tracking process, especially under real-time constraints. They note that
robust systems tend to possess some state-based architecture comprising
heterogeneous algorithms, and that robust recovery from tracking failure
requires several other facial image analysis tasks.
Cascia et al., 2000, propose an improved technique for 3D head
tracking under varying illuminating conditions. The head is modeled as a
texture mapped cylinder. Tracking is formulated as an image registration
problem in the cylinder's texture map image. The resulting dynamic texture
map provides a stabilized view of the face that can be used as input to many
existing 2D techniques for face recognition, facial expressions analysis, lip
reading, and eye tracking.
Lievin and Luthon, 2000, propose an algorithm for speaker's lip
segmentation and features extraction. A color video sequence of speaker's
face is acquired, under natural lighting conditions and without any particular
make-up. A logarithmic color transform is performed from the RGB to HI
(hue, intensity) color space. A statistical approach using Markov random
field modeling determines the red hue prevailing region and motion in a
-
spatiotemporal neighborhood. Third, the final label field is used to extract
ROI (region of interest) and geometrical features.
Tian et al., 2000, propose a dual state model based system of tracking
eye features that uses convergent tracking techniques and show how it can
be used to detect whether the eyes are open or closed, and to recover the
parameters of the eye model.
Jian et al., 2001, develop real time lip tracking information that can be
used to implement and control a virtual lip. The use of soft computing to
represent the real time lip parameters enables them to have a more robust
and flexible system which can compensate for the potential errors of lip
tracking.
Chan et al., 2002, state that contour model-based tracking is more
robust if an accurate reference shape model of the underlying object is
available. As lip shapes vary, the ability to automatically extract user-
dependent lip models from input images is desirable. They present an
unsupervised segmentation method to hierarchically locate the user's face
and lips. Techniques employed include modeling in the hue / saturation color
space using Gaussian mixture models and the use of geometric constraints.
With the region of interest automatically located, the model extraction
problem is formulated as a regularized model-fitting problem. The use of a
generic shape as prior information improves the accuracy of the extracted lip
model which is based on a cubic B-spline representation. They describe a
method to compute automatically an optimal linear color space transform
needed to obtain raw estimates of the lip boundary locations, as required by
the fitting procedure.
Delman and Lievin, 2002, present an algorithm for speaker's lip
segmentation and features extraction. A color video sequence of speaker's
face is acquired, under natural lighting conditions and without any particular
make-up. A logarithmic color transform is performed from RGB to HI (hue,
-
intensity) color space. A statistical approach using Markov randomly
modeling determines lip prevailing region and motion in spatiotemporal
neighborhoods.
Eveno et al., 2002, propose an accurate and robust lip segmentation
algorithm. Characteristic points are found by using hybrid edges, which
combine color and intensity information, and a priori knowledge about the lip
structure. Corner position, which is crucial, is provided by a coarse-to-fine
process. A model is fitted on the lips. Unlike most model oriented methods,
they consider that the lip boundary is composed of several independent
cubic polynomial models. It gives to the global model enough flexibility to
reproduce the specificity of very different lip shapes. Compared to existing
models, it brings a significant accuracy improvement. It ensures a robust
convergence towards the edges.
Liew et al., 2002, present use of visual information from lip
movements that can improve the accuracy and robustness of a speech
recognition system. A region-based lip contour extraction algorithm based
on deformable model is proposed. The algorithm employs a stochastic cost
function to partition a color lip image into lip and non-lip regions such that
the joint probability of the two regions is maximized. Given a discrete
probability map generated by spatial fuzzy clustering, they show how the
optimization of the cost function can be done in the continuous setting. The
region-based approach makes the algorithm more tolerant to noise and
artifacts in the image. It also allows larger region of attraction, thus making
the algorithm less sensitive to initial parameter settings. The algorithm
works on unadorned lips and accurate extraction of lip contour is possible.
Mark Barnard et al., 2002, propose a robust and adaptable lip tracking
method that uses a combination of snakes and a 2D template matching
technique. The snake, an energy minimizing spline, is driven by 2D template
matching techniques to find the expected lip contour of a specific speaker.
-
Their experiments show that the technique can track the unadorned lips in
various colors and shapes of speakers, including the lips of a bearded
speaker.
Morency et al., 2002, present a robust implementation of stereo-based
head tracking designed for interactive environments with uncontrolled
lighting. They integrate fast face detection and drift reduction algorithms
with a gradient-based stereo rigid motion tracking technique. Their system
can automatically segment and track a user’s head under large rotation and
illumination variations. Precision and usability of their approach are
compared with previous tracking methods for cursor control and target
selection in both desktop and interactive room environments.
Yang et al., 2002, insist that images containing faces are essential to
intelligent vision-based human computer interaction, and research efforts in
face processing include face recognition, face tracking, pose estimation, and
expression recognition. Given a single image, the goal of face detection is to
identify all image regions which contain a face regardless of its three-
dimensional position, orientation, and lighting conditions. Such a problem is
challenging because faces are not rigid and have a high degree of variability
in size, shape, color, and texture. Numerous techniques have been
developed to detect faces in a single image.
Blanz Volker and Vetter, 2003, present a method for face recognition
across variations in pose, ranging from frontal to profile views, and across a
wide range of illuminations, including cast shadows and secular reflections.
To account for these variations, the algorithm simulates the process of
image formation in 3D space, using computer graphics, and it estimates 3D
shape and texture of faces from single images. The estimate is achieved by
fitting a statistical, morph able model of 3D faces to images. The model is
learned from a set of textured 3D scans of heads. They describe the
construction of the morph able model, an algorithm to fit the model to
-
images, and a framework for face identification. In this framework, faces are
represented by model parameters for 3D shape and texture.
Liew, 2003, describe the application of a novel spatial fuzzy clustering
algorithm to the lip segmentation problem. The proposed spatial fuzzy
clustering algorithm is able to take into account both the distributions of
data in feature space and the spatial interactions between neighboring pixels
during clustering. By appropriate pre- and post processing utilizing the color
and shape properties of the lip region, successful segmentation of most lip
images is possible. Comparative study with some existing lip segmentation
algorithms such as the hue filtering algorithm and the fuzzy entropy
histogram thresholding algorithm has demonstrated the superior
performance of their method.
Suandi et al., 2003, introduce an extended technique in template
matching to track eyes and mouth in real-time. The technique makes use of
a set of ‘n’ correlation candidates from template matching. They first list all
the candidates from each face model regions, and select the best candidates
based on two selective functions. These functions are for right-left eyes pair
and eyes-mouth pair selection, respectively. They also introduce a novel
technique in tracking framework, called feature selective (FS), where the
system selects the features automatically so that it is feasible for multiple
face types and conditions.
Wu et al., 2003, state that occlusion is a difficult problem for
appearance-based target tracking, especially when it needs to track multiple
targets simultaneously and maintain the target identities during tracking.
They propose a dynamic Bayesian network which accommodates an extra
hidden process for occlusion and stipulates the conditions on which the
image observation likelihood is calculated. The statistical inference of such a
hidden process can reveal the occlusion relations among different targets,
which makes the tracker more robust against partial even complete
-
occlusions. In addition, considering the fact that target appearances change
with views, another generative model for multiple view representation is
proposed by adding a switching variable to select from different view
templates .The integration of the occlusion model and multiple view model
results in a complex dynamic Bayesian network, where extra hidden
processes describe the switch of targets’ templates, dynamics, and the
occlusions among different targets. The tracking and inference algorithms
are implemented by the sampling-based sequential Monte Carlo strategies.
Our experiments show the effectiveness of the proposed probabilistic models
and the algorithms.
Eveno Nicolas et al., 2004, they propose an accurate and robust quasi
automatic lip segmentation algorithm. The upper mouth boundary and
several characteristic points are detected in the first frame by using a new
kind of active contour: the “jumping snake”. Unlike classic snakes, it can be
initialized far from the final edge and the adjustment of its parameters is
easy and intuitive. Then, to achieve the segmentation they propose a
parametric model composed of several cubic curves. Its high flexibility
enables accurate lip contour extraction even in the challenging case of very
asymmetric mouth. It brings a significant accuracy and realism
improvement. The segmentation in the following frames is achieved by using
an inter frame tracking of the key points and the model parameters. The
key point’s positions become unreliable after a few frames. They propose an
adjustment process that enables an accurate tracking even after hundreds of
frames and the mean key points tracking errors of our algorithm are
comparable to manual point’s selection errors.
Leung Shu-Hung et al., 2004, presented a new fuzzy clustering
method for lip image segmentation. This clustering method takes both the
color information and the spatial distance into account while most of the
current clustering methods only deal with the former. A new dissimilarity
-
measure, which integrates the color dissimilarity and the spatial distance in
terms of an elliptic shape function, is introduced. Because of the presence of
the elliptic shape function, the new measure is able to differentiate the pixels
having similar Color information but located in different regions. A new
iterative algorithm for the determination of the membership and centroid for
each class is derived, which is shown to provide good differentiation between
the lip region and the non-lip region.
Wang et al., 2004, visual information from lip shapes and movements
helps improve the accuracy and robustness of a speech recognition system.
A new region-based lip contour extraction algorithm that combines the
merits of the point-based model and the parametric model is presented.
Their algorithm uses a 16-point lip model to describe the lip contour. Given a
robust probability map of the color lip image generated by the FCMS (fuzzy
clustering method incorporating shape function) algorithm, a region-based
cost function that maximizes the joint probability of the lip and non-lip
region can be established. Then an iterative point-driven optimization
procedure has been developed to fit the lip model to the probability map. In
each iteration, the adjustment of the 16 lip points is governed by three
pieces of quadratic curves that constrain the points to form a physical lip
shape.
Narayanan et al., 2006, they present a lip contour tracking algorithm
using attractor guided particle filtering. It is difficult to robustly track the lip
contour because the lip contour is highly deformable and the contrast
between skin and lip colors is very low. It makes the traditional blind
segmentation-based algorithms often fail to have robust and realistic results.
The lip contour is constrained by the facial muscles; the tracking
configuration space can then be represented by a lower dimensional
manifold. They take some representative lip shapes as the attractors in the
lower dimensional manifold. To resolve the low contrast problem, they adopt
-
a color feature selection algorithm to maximize the between skin and lip
colors. Then they integrate the shape priors and the discriminative feature
into the attractor-guided particle filtering framework to track the lip contour.
Nguyen et al., 2008, they propose and evaluate a novel method for
enhancing performance of lips contour tracking, which is based on the
concept of statistic shape models (ASM) and multi features. On the first
image of the video sequence, lip region is detected using the Bayesian's rule
in which lip color information is modeled by using the Gaussian Mixture
Model (GMM) and the GMM is trained by Expectation-Maximization (EM)
algorithm. The lip region is then used to initialize the lip shape model. A
single feature-based ASM presents good performance only in particular
conditions but gets stuck in local minima for noisy conditions enhance the
convergence, we propose to use 2 features: normal profile and grey level
patches, and combine them by using a voting approach. The standard ASM
is not able to take into account temporal information from previous frames
therefore the lip contours are tracked by replacing the standard ASM with a
hybrid active shape model (HASM) which is capable to take advantage of the
temporal information.
Ong Eng-Jon and Bowden, 2008, they propose a learnt data-driven
approach to the accurate, real-time tracking of lip shapes using only
intensity information. This has the advantage that constraints such as a-
priori shape models or temporal models for dynamics are not required or
used. Tracking the lip shape is simply the independent tracking of a set of
points that lie on the lip’s contour. This allows us to cope with different lip
shapes that were not present in the training data and performs as well as
other approaches that have pre-learnt shape models such as the AAM.
Tracking is archived via linear predictors, where each linear predictor
essentially linearly maps sparse template difference vectors to tracked
feature position displacements. Multiple linear predictors are grouped into a
-
rigid flock to obtain increased robustness. To achieve accurate tracking, two
approaches are proposed for selecting relevant sets of LPs within each flock.
Analysis of the selection results show that the LPs selected for tracking a
feature point choose areas that are strongly correlated with that of the
tracked target and that these areas are not necessarily the region around
the feature point as is commonly assumed in LK based approaches.,
effective fusion of acoustic and visual modalities in speech recognition has
been an important issue in human computer interfaces, warranting further
improvements in intelligibility and robustness. Speaker lip motion stands out
as the most linguistically relevant visual feature for speech recognition. They
present a new hybrid approach to improve lip localization and tracking,
aimed at improving speech recognition in noisy environments. It begins with
a new color space transformation for enhancing lip segmentation. In the
color space transformation, a PCA method is employed to derive a new one
dimensional color space which maximizes discrimination between lip and
non-lip colors. Intensity information is also incorporated in the process to
improve contrast of upper and corner lip segments. In the subsequent step,
a constrained deformable lip model with high flexibility is constructed to
accurately capture and track lip shapes. The model requires only six degrees
of freedom, yet provides a precise description of lip shapes using a simple
least square fitting method. Experimental results indicate that the proposed
hybrid approach delivers reliable and accurate localization and tracking of lip
motions under various measurement conditions.
Rohani et al., 2008, Lip feature extraction is one of the most
challenging tasks in the lip reading systems' performance. They propose a
new approach for lip contour extraction based on fuzzy clustering. The
algorithm employs a stochastic cost function to partition a color image into
lip and non-lip regions such that the joint probability of the two regions is
maximized. The mouth location is determined and then, lip region is
-
preprocessed using pseudo hue transformation. Fuzzy c-means clustering is
applied to each transformed image along with b components of CIELAB color
space. To delete the clustered pixels around lip, an ellipse and a Gaussian
mask were used. In order to show the performance of the proposed method,
the pseudo hue segmentation and fuzzy c-mean clustering without
preprocessing are compared. The compared methods were applied to the
VidTIMIT and M2VTS databases and the results show the superiority of the
proposed method in comparison with other methods.
Chin Siew Wen et al., 2009, present automatic lips detection and
tracking system based on watershed segmentation approach. For some of
the lips detection systems, skin / non-skin detection is a prerequisite step to
localize the face region followed by detection of lip region. A direct lips
detection technique using watershed segmentation without needing
preliminary face localization is proposed. The watershed algorithm segments
the input image into regions. The cubic spline interplant lips color modeling
and symmetry detection is used to detect the lips region from the
segmented regions. The position of the segmented lips is passed to the
tracking system to predict the location of the lips in the succeeding video
frame.
Hoai BAC Et Al., 2010, they present to solve a narrower problem, the
lip tracking, which is an essential step to provide visual lip data for the lip-
reading system. Inspired by the idea of AVCSR, which has combined visual
features with audio features to increase the accuracy in noisy environments;
they use AdaBoost algorithm and Kalman filter for the face and lip detectors.
-
1.3 FACIAL TRACKING USING SPEECH
Leymaric and Levine, 1993, propose segmentation of a noisy intensity
image and tracking a non-rigid object. A technique based on an active
contour model commonly called a snake is examined. The technique is
applied to cell locomotion and tracking studies. The snake permits both the
segmentation and tracking problems to be simultaneously solved in
constrained cases. A detailed analysis of the snake model, emphasizing its
limitations and shortcomings, is presented, and improvements to the original
description of the model are proposed. Problems of convergence of the
optimization scheme are considered. In particular, an improved terminating
criterion for the optimization scheme that is based on topographic features
of the graph of the intensity image is proposed. Hierarchical filtering
methods, as well as a continuation method based on a discrete sale-space
representation, are discussed.
Luettin Juergen et al., 1996, describe a robust method for extracting
visual speech information from the shape of lips to be used for an automatic
speech reading (lip reading) systems. Lip de-formation is model led by a
statistically based deformable contour model which learns typical lip
deformation from a training set. The main difficulty in locating and tracking
lips consists of finding dominant image features for representing the lip
contours. They describe the use of a statistical profile model which learns
dominant image features from a training set. The model captures global
intensity variation due to different illumination and different skin reflectance
as well as intensity changes at the inner lip contour due to mouth opening
and visibility of teeth and tongue. The method is validated for locating and
tracking lip movements on a database of a broad variety of speakers.
Kaucle and Blake, 1998, human speech is inherently multi-model
consisting of both audio and visual components. Recently researchers have
shown that the incorporation of information about the position of the lips
-
into acoustic speech recognizer enables robust recognition of noisy speech.
In the case of Hidden Markov model recognition, they show that this
happens because the visual signal stabilizes the alignment of states. It is
also shown that unadorned lips, both the inner and outer contours, can be
robustly tracked in real time on general purpose workstations. To accomplish
this, efficient algorithms are employed which contain three key components,
shape models, motion models, and focused color feature detectors all of
which are learnt from examples.
Lei et al., 2004, the paper presents a robust hierarchical lip tracking
approach (RoHiLTA) for lip-reading and audio visual speech recognition
(AVSR) applications. Lip regions of interest are subtly detected by motion
and facial structure information. Improvements are made on active shape
models (ASMs) for extracting lip contours more accurately and efficiently
from video sequences of a speaker's talking face in natural lighting
conditions and without particular make-ups. Local and global ASM search
algorithms are both improved by introducing color information, 2D mouth
corner match, and robust estimation. For noise-free features, localization
errors are automatically corrected by an interpolating scheme. A fast
implementation of the hierarchical approach is also proposed. Extensive
experiments show that the improved ASM can effectively reduce the lip
locating errors. The fast implementation of RoHiLTA can consistently achieve
superior performance to conventional ASMs in lip tracking tasks, and then
can be effectively integrated in lip-reading and AVSR systems.
1.4 FACIAL TRACKING USING SKIN AND COLOR
Sobottka Karin and Pitas loannis, 1996, present a new approach for
automatically segmenting and tracking of faces in color images. The
segmentation of faces is done based on color and shape information. By
searching for facial features, face hypotheses are verified. Afterwards
-
tracking is performed by using an active contour model. This ensures fast
processing and an increase in robustness for the face recognition process.
The exterior forces of the active contour are defined based on color features.
Results for tracking are shown for an image sequence consisting of 150
frames.
Yang and Waibel, 1996, present a real-time face tracker. The system
has achieved a rate of 30+ frames / second using an HP-9000 workstation
with a frame grabber and a Canon VC-C1 camera. It can track a person’s
face while the person moves freely (e.g., walks, jumps, sits down and stands
up) in a room. Three types of models have been employed in developing the
system. They present a stochastic model to characterize skin-color
distributions of human faces. The information provided by the model is
sufficient for tracking a human face in various poses and views. This model
is adaptable to different people and different lighting conditions in real-time.
A motion model is used to estimate image motion and to predict search
window. A camera model is used to predict and to compensate for camera
motion. The system can be applied to teleconferencing and many HCI
applications including lip-reading and gaze tracking. The principle in
developing this system can be extended to other tracking problems such as
tracking the human hand.
Jebara et al., 1997, describe automatic detecting, modeling and
tracking faces in 3D. A closed loop approach is proposed which utilizes
structure from motion to generate a 3D model of a face and then feedback
the estimated structure to constrain feature tracking in the next frame. The
system initializes by using skin classification, symmetry operations, 3D
warping and eigenfaces to and a face. Feature trajectories are then
computed by SSD or correlation-based tracking. The trajectories are
simultaneously processed by an extended Kalman filter to stably recover 3D
structure, camera geometry and facial pose. Adaptively weighted estimation
-
is used in this filter by modeling the noise characteristics of the 2D image
patch tracking technique. The structural estimate is constrained by using
parameterized models of facial structure (eigen-heads). The Kalman filter's
estimate of the 3D state and motion of the face predicts the trajectory of the
features which constrains the search space for the next frame in the video
sequence. The feature tracking and Kalman filtering closed loop system
operates at 30Hz.
Bradski Gary, 1998, states a first step towards a perceptual user
interface. A computer vision color tracking algorithm is developed and
applied towards tracking human faces. The algorithm is based on a robust
nonparametric technique for climbing density gradients to find the mode of
probability distributions called the mean shift algorithm. The mean shift
algorithm is modified to deal with dynamically changing color probability
distributions derived from video frame sequences. The modified algorithm is
called the continuously adaptive mean shift (CAMSHIFT) algorithm.
CAMSHIFT’s tracking accuracy is compared against a Polhemus tracker.
Bradski, 1998, develop computer vision algorithms that are intended
to form part of a perceptual user interface. They must be able to track in
real time yet not absorb a major share of computational resources: other
tasks must be able to run while the visual interface is being used. The new
algorithm developed is based on a robust nonparametric technique for
climbing density gradients to find the mode (peak) of probability
distributions called the mean shift algorithm. They want to find the mode of
a color distribution within a video scene. The mean shift algorithm is
modified to deal with dynamically changing color probability distributions
derived from video frame sequences. The modified algorithm is called the
Continuously Adaptive Mean Shift (CAMSHIFT) algorithm. CAMSHIFT’s
tracking accuracy is compared against a Polhemus tracker. Tolerance to
noise, distracters and performance is studied. CAMSHIFT is then used as a
-
computer interface for controlling commercial computer games and for
exploring immersive 3D graphic worlds.
Raja Yogesh et al., 1998, state that they used to obtain robust
detection and tracking of people in relatively unconstrained dynamic scenes.
Gaussian mixture models were used to estimate probability densities of color
for skin, clothing and background. These models were used to detect, track
and segment people, faces and hands. A technique for dynamically updating
the models to accommodate changes in apparent color due to varying
lighting conditions was used. Two applications are highlighted: (1) actor
segmentation for virtual studios and (2) focus of attention for face and
gesture recognition systems.
Yang et al., 1998, state that a human face provides a variety of
different communicative functions. They present approaches for real-time
face / facial feature tracking and their applications. They present techniques
of tracking human faces. It is revealed that human skin color can be used as
a major feature for tracking human faces. An adaptive stochastic model has
been developed to characterize the skin-color distributions. Based on the
maximum likelihood method, the model parameters can be adapted for
different people and different lighting conditions. The feasibility of the model
has been demonstrated by the development of a real time face tracker. We
then present a top-down approach for tracking facial features such as eyes,
nostrils, and lip corners. These real-time tracking techniques have been
successfully applied to many applications such as eye-gaze monitoring, head
pose tracking, and lip-reading.
Jordao et al., 1999, describe a method for the detection and tracking
of human face and facial features. Skin segmentation is learnt from samples
of an image. After detecting a moving object, the corresponding area is
searched for clusters of pixels with a known distribution. This process is
quite insensitive to illumination changes. The face localization procedure
-
looks for areas in the segmented area which resemble a head. Using simple
heuristics, the located head is searched and its centroid is fed back to a
camera motion control algorithm which tries to keep the face centered in the
image using a pan-tilt camera unit. The system is capable of tracking, in
every frame, the three main features of a human face. Since precise eye
location is computationally intensive, an eye and mouth locator using fast
morphological and linear filters is developed. This allows for frame-by-frame
checking, which reduces the probability of tracking a non-basis feature,
yielding a higher success ratio. Velocity and robustness are the main
advantages of this fast facial feature detector.
Lihin, 2000, propose an algorithm for speaker’s lip contour extraction.
A color video sequence of speaker’s face is acquired, under natural lighting
conditions and without any particular make-up. A logarithmic color transform
is performed from RGB to HI (hue, intensity) color space. A Bayesian
approach segments the mouth area using Markov random field modeling.
Motion is combined with red hue lip information into a spatiotemporal
neighborhood. Simultaneously, a region of interest and relevant boundaries
points are automatically extracted. An active contour using spatially varying
coefficients is initialized with the results of the preprocessing stage. An
accurate lip shape with inner and outer borders is obtained with good quality
results in this challenging situation.
Schwerdt and Crowley, 2000, discuss robust tracking technique
applied to histograms of intensity normalized color. This technique supports
a video codec based on orthonormal basis coding. Orthonormal basis coding
can be very efficient when the images to be coded have been normalized in
size and position. However, an imprecise tracking procedure can have a
negative impact on the efficiency and the quality of reconstruction of this
technique, since it may increase the size of the required basis space. The
-
method has greater stability, higher precision and less jitter, over
conventional tracking techniques using color histograms.
Zhang and Mersereau, 2000, state that the use of color information
can significantly improve the efficiency and robustness of lip feature
extraction capability over purely grayscale-based methods. Edge information
provides another useful tool in characterizing lip boundaries. They present a
method of integrating both types of information to address the problem of lip
feature extraction for the purpose of speech reading. They first examine
various color models and view hue as an effective descriptor to characterize
the lips due to its invariance to luminance and human skin color, and its
discriminative properties. They use prominent red hue as an indicator to
locate the position of the lips. Based on the identified lip area, they further
refine the interior and exterior lip boundary using both color and spatial edge
information, where those two are combined within a Markov random field
(MRF) framework.
Spors et al., 2001, present face localization and tracking algorithm
which is based upon skin color detection and principle component analysis
(PCA) based eye localization. Skin color segmentation is performed using
statistical models for human skin color. The skin color segmentation task
results in a mask marking the skin color regions in the actual frame, which is
further used to compute the position and size of the dominant facial region
utilizing a robust statistics-based localization method. To improve the results
of skin color segmentation a foreground / background segmentation and an
adaptive background update scheme were added. The derived face position
is tracked with Kalman filter.
Gargesha 2002, Existing techniques for facial feature point detection
from color images include template matching, facial geometry and symmetry
analysis, mathematical morphology, luminance and chrominance analysis,
and PCA. These techniques are plagued by poor performance in the presence
-
of scale variations. A hybrid technique is proposed that employs a
combination of the above approaches along with curvature analysis of the
intensity surface of the face image in order to provide a superior
performance with reduced computational complexity, even in the presence
of scale variations.
Perez et al., 2002, propose color-based trackers for drastically shape
varying objects. The method relies on the deterministic search of a window
whose color content matches a reference histogram color model. Relying on
the same principle of color histogram distance, but within a probabilistic
framework, they introduce a Monte Carlo tracking technique. The use of a
particle filter allows them to better handle color clutter in the background, as
well as complete occlusion of the tracked entities over few frames. The
probabilistic approach is very flexible and can be extended in a number of
useful ways. In particular, they introduce the following ingredients: multi-
part color modeling to capture a rough spatial layout ignored by global
histograms, incorporation of a background color model when relevant, and
extension to multiple objects.
Andreas et al., 2003, present a hierarchical realization of an enhanced
active shape model for color video tracking and study the performance of
both hierarchical and nonhierarchical implementations in the RGB, YUV, and
HSI color spaces. Active shape models can be applied to tracking non-rigid
objects in video image sequences.
Huang and Trivedi, et al., 2004, Human face analysis has been
recognized as a crucial part in intelligent systems. There are several
challenges before robust and reliable face analysis systems can be deployed
in real-world environments. One of the main difficulties is associated with
the detection of faces with variations in illumination conditions and viewing
perspectives. They present the development of a computational framework
for robust detection, tracking and pose estimation of faces captured by video
-
arrays. They discuss development of a multi primitive skin-tone and edge-
based detection module integrated with a tracking module for efficient and
robust detection and tracking. A multi-state continuous density Hidden
Markov Model based pose estimation module is developed for providing an
accurate estimate of the orientation of the face.
Varona et al., 2005, they present a robust real-time 3D tracking
system of human hands and face. This system can be used as a perceptual
interface for virtual reality activities in a workbench environment. In front of
the virtual reality device, do not needs any type of marker or special suite.
The system includes a real time color segmentation module to detect in real-
time the skin-color pixels present in the images. The results of this skin-
color segmentation are skin-color blobs that are the inputs of a data
association module that labels the blobs pixels using a set of object state
hypothesis from previous frames. The 2D tracking results are used for the
3D reconstruction of hands and face in order to obtain the 3D positions of
these limbs. They present several results using the H-ANIM standard to
show the system output performance.
Stasiak and Vicente-Garcia, 2010, a system for parallel face detection,
tracking and recognition in real-time video sequences is being developed.
The particle filtering is utilized for the purpose of combined and effective
detection, tracking and recognition. Temporal information contained in
videos is utilized. Fast, skin color-based face extraction and normalization
technique is applied. Consequently, real-time processing is achieved.
Implementation of face recognition mechanisms within the tracking
framework is used for the purpose of identity recognition, and to improve
the tracking robustness in case of multi-person tracking scenarios. Face-to-
track assignment conflicts can often be resolved with the use of motion
modeling. motion-based conflict resolution can be erroneous. Identity clue
can be used to improve tracking quality they describe the concept of face
-
tracking corrections with the use of identity recognition mechanism,
implemented within a compact particle filtering-based framework for face
detection, tracking and recognition.
Shi and Tomasi, 1994, state that no feature-based vision system can
work unless good features can be identified and tracked from frame to
frame. Although tracking itself is by and large a solved problem, selecting
features that can be tracked well and correspond to physical points in the
world is still hard. They propose a feature selection criterion that is optimal
by construction because it is based on how the tracker works, and a feature
monitoring method that can detect occlusions, disocclusions, and features
that do not correspond to points in the world. These methods are based on a
new tracking algorithm that extends previous Newton-Raphson style search
methods to work under affine image transformations. They test performance
with several simulations and experiments.
Black et al., 1995, explore the use of local parameterized models of
image motion for recovering and recognizing the non-rigid and articulated
motion of human faces. Parametric models are popular for estimating motion
in rigid scenes. They observe that within local regions in space and time,
such models not only accurately model non-rigid facial motions but also
provide a concise description of the motion in terms of a small number of
parameters. These parameters are intuitively related to the motion of facial
features during facial expressions and show how expressions can be
recognized from the local parametric motions in the presence of significant
head motion. The motion tracking and expression recognition approach
performs with high accuracy movie sequences.
MacCormick and Blake 1995, tracking multiple targets is a challenging
problem, especially when the targets are identical, in the sense that the
same model is used to describe each target. They present an observation
density for tracking, which solves the problem by exhibiting a probabilistic
-
exclusion principle. Exclusion arises naturally from a systematic derivation of
the observation density, without relying on heuristics. They presentation
partitioned sampling, a new sampling method for multiple object tracking.
Partitioned sampling avoids the high computational load associated with fully
coupled trackers, while retaining the desirable properties of coupling.
Basu Sumit et al., 1996, describe a method for the robust tracking of
rigid head motion from video. This method uses a 3D ellipsoidal model of the
head and interprets the optical flow in terms of the possible rigid motions of
the model. This method is robust to large angular and translational motions
of the head and is not subject to the singularities of a 2D model. The method
has been successfully applied to heads with a variety of shapes, hair styles.
This method has the advantage of accurately capturing the 3D motion
parameters of the head. The accuracy is shown through comparison with a
ground truth synthetic sequence. The ellipsoidal model is robust to small
variations in the initial fit, enabling the automation of the model
initialization.
Darrell et al., 1996, demonstrate real-time face tracking and pose
estimation in an unconstrained office environment with a camera. Using
vision routines previously implemented for an interactive environment they
determine the spatial location of a user’s head and guide and active camera
to obtain images of the face. Faces are analyzed using a set of Eigen spaces
indexed over both pose and world location. Closed loop feedback from the
estimated facial location is used to guide the camera when a face is present
in the fontal view.
Crowley James et al., 1997, describe a system which uses multiple
visual processes to detect and track faces for video compression and
transmission. The system is based on an architecture in which a supervisor
selects and activates visual processes in cyclic manner. Control of visual
processes is made possible by a confidence factor which accompanies each
-
observation. Fusion of results into a unified estimation for tracking is made
possible by estimating a covariance matrix with each observation. Visual
processes for face tracking are described using blink detection, normalized
color histogram matching, and cross correlation (SSD and NCC). Ensembles
of visual processes are organized into processing states so as to provide
robust tracking. Transition between states is determined by events detected
by processes. The result of face detection is fed into recursive estimator. The
output from the estimator drives a PD controller for a pan / tilts / zoom
camera.
Fieguth et al., 1997, develop a simple and very fast method for object
tracking based exclusively on color information in digitized video images.
Running on a silicon graphics R4600 Indy system with an Indy cam, the
algorithm is capable of simultaneously tracking objects at full frame size
(640 x 480) pixels and video frame rate 50fps. Robustness with respect to
occlusion is achieved via an explicit hypothesis tree model of the occlusion
process. They demonstrate the efficacy of their techniques in the challenging
task of tracking people, especially tracking human head and hands.
Oliver Nuria and Pentland, 1997, describe an active-camera real-time
system for tracking, shape description, and classification of the human face
and mouth using only an SGI Indy computer. The system is based on use of
2-D blob features, which are spatially-compact clusters of pixels that are
similar in terms of low-level image properties. Patterns of behavior Facial
expressions and head movements can be classified in real-time using Hidden
Markov Model (HMM) methods. The system has been tested on hundreds of
users and has demonstrated extremely reliable and accurate performance.
Birchfield, 1998, present for tracking a person’s head. The head’s
projection onto the image plane is modeled as an ellipse whose position and
size are continually updated by a local search combining the output of a
module concentrating on the intensity gradient around the ellipse’s
-
perimeter with that of another module focusing on the color histogram of the
ellipse’s interior. These two modules have roughly orthogonal failure modes;
they serve to complement one another. The result is a robust, real-time
system that is able to track a person’s head with enough accuracy to
automatically control the camera’s pan, tilt, and zoom in order to keep the
person centered in the field of view at a desired size.
Hager Gregory and Belhumeur, 1998, develop an efficient, general
framework for object tracking which addresses different complications. They
first develop a computationally efficient method for handling the geometric
distortions produced by changes in pose. Then combine geometry and
illumination into an algorithm that tracks large image regions using no more
computation than would be required to track with no accommodation for
illumination changes. They augment these methods with techniques from
robust statistics and treat occluded regions on the object as statistical
outliers. Throughout, they present experimental results performed on live
video sequences demonstrating the effectiveness and efficiency of their
methods.
Hager Gregory and Toyama et al., 1998, describe X Vision as a small
set of image-level tracking primitives, and a framework for combining
tracking primitives to form complex tracking systems. Efficiency and
robustness are achieved by propagating geometric and temporal constraints
to the feature detection level, where image warping and specialized image
processing are combined to perform feature detection quickly and robustly.
They present some of these applications as an illustration of how useful,
robust tracking systems can be constructed by simple combinations of a few
basic primitives combined with the appropriate task-specific constraints.
Colmenarez, 1999, provide information from video and keep track of
the people, recognize their facial expressions and gestures, and complement
other forms of human computer interfaces. A learning technique based on
-
information-theoretic discrimination is used to construct face and facial
feature detectors. A real-time system for face and facial feature detection
and tracking in continuous video is done. A probabilistic framework for
embedded face and facial expression recognition from image sequences is
obtained.
Harold Hualu Wang and Chang, 1999, present Face Track, a system
that detects, tracks, and groups faces from compressed video data. They
introduce the face tracking framework based on the Kalman filter and
multiple hypothesis techniques. They compare and discuss the effects of
various motion models on tracking performance. They investigate constant-
velocity, constant-acceleration, correlated-acceleration, and variable-
dimension-filter models. They find that constant-velocity and correlated-
acceleration models work more effectively for commercial videos sampled at
high frame rates. They also develop novel approaches based on multiple
hypothesis techniques to resolving ambiguity issues. Simulation results show
the effectiveness of the proposed algorithms on tracking faces in real
applications.
Vieux et al., 1999, use face tracking system developed in the robotics
area to normalize a video sequence to centered images of the face. The
face-tracking allowed us to implement a compression scheme based on
Principal Component Analysis (PCA), which they call Orthonormal Basis
Coding (OBC).
Comaniciu et al., 2000, propose a new method for real-time tracking
of non-rigid objects seen from a moving camera. The central computational
module is based on the mean shift iterations and finds the most probable
target position in the current frame. The dissimilarity between the target
model and the target candidates are expressed by a metric derived from the
Bhattacharyya coefficient. The theoretical analysis of the approach shows
that it relates to the Bayesian framework while providing a practical, fast
-
and efficient solution. The capability of the tracker to handle in real-time
partial occlusions, significant clutter, and target scale variations is
demonstrated for several image sequences.
Feris Rogério Schmidt et al., 2000, present a real time system for
detection and tracking of facial features in video sequences. Such system
may be used in visual communication applications, such as teleconferencing,
virtual reality, intelligent interfaces, human machine interaction, and
surveillance. They have used a statistical skin-color model to segment face-
candidate regions in the image. The presence or absence of a face in each
region is verified by means of an eye detector, based on an efficient
template matching scheme. Once a face is detected, the pupils, nostrils and
lip corners are located and these facial features are tracked in the image
sequence, performing real time processing.
Liu Zhu and Wang, 2000, propose a new approach for combined face
detection and tracking in video. The face detection algorithm is a fast
template matching procedure using iterative dynamic programming (DP).
Schneider man and Kanade, 2000, describe a statistical method for 3D
object detection. They represent the statistics of both object appearance and
“non-object” appearance using a product of histograms. Each histogram
represents the joint statistics of a subset of wavelet coefficients and their
position on the object. Their approach is to use many such histograms
representing a wide variety of visual attributes. Using this method, they
have developed the first algorithm that can reliably detect human faces with
out-of-plane rotation and the first algorithm that can reliably detect
passenger cars over a wide range of viewpoints.
Shan et al., 2001, present model-based bundle adjustment algorithm
to recover the 3D model of a scene / object from a sequence of images with
unknown motions. Instead of representing scene / object by a collection of
isolated 3D features (usually points), their algorithm uses a surface
-
controlled by a small set of parameters. Compared with previous model
based approaches, their approach has the following advantages. Instead of
using the model space as a regularized, they directly use it as their search
space, thus resulting in a more elegant formulation with fewer unknowns
and fewer equations. Their algorithm automatically associates tracked points
with their correct locations on the surfaces, thereby eliminating the need for
a prior 2D-to-3D association. Regarding face modeling, they use a very
small set of face metrics to parameterize the face geometry, resulting in a
smaller search space and a better posed system.
Towama and Blake, 2001, present probabilistic paradigm for visual
tracking. Probabilistic mechanisms are attractive because they handle fusion
of information, especially temporal fusion, in a principled manner. Exemplars
are selected as representatives of raw training data. They represent
probabilistic mixture distributions of object configurations. Their use avoids
tedious hand-construction of object models, and problems with changes of
topology. Using exemplars in place of a parameterized model poses several
challenges. It uses a noise model that is learned from training data. It
eliminates any need for an assumption of probabilistic pixel wise
independence.
Arulampalam et al., 2002, review both optimal and suboptimal
Bayesian algorithms for nonlinear / non-Gaussian tracking problems, with a
focus on particle filters. Particle filters are sequential Monte Carlo methods
based on point mass representations of probability densities, which can be
applied to any state-space model and that generalize the traditional Kalman
filtering methods. Several variants of the particle filter such as SIR, ASIR,
and RPF are introduced within a generic framework of the sequential
importance sampling (SIS) algorithm and compared with the standard EKF.
Chiang et al., 2003, present a real-time face detection algorithm for
locating faces in images and videos. This algorithm finds not only the face
-
regions, but also the precise locations of the facial components such as eyes
and lips. The algorithm starts from the extraction of skin pixels based upon
rules derived from a simple quadratic polynomial model. With a minor
modification, this polynomial model is also applicable to the extraction of
lips. The benefits of applying these two similar polynomial models are
twofold. First, much computation time are saved. Second both extraction
processes can be performed simultaneously in one scan of the image or
video frame. The eye components are then extracted after the extraction of
skin pixels and lips. The algorithm removes the falsely extracted components
by verifying with rules derived from the spatial and geometrical relationships
of facial components. The precise face regions are determined accordingly.
According to the experimental results, the proposed algorithm exhibits
satisfactory performance in terms of both accuracy and speed for detecting
faces with wide variations in size.
Verma et al., 2003, present probabilistic method for detecting and
tracking multiple faces in a video sequence. The proposed method integrates
the information of face probabilities provided by the detector and the
temporal information provided by the tracker to produce a method superior
to the available detection and tracking methods. They claim 1) Accumulation
of probabilities of detection over a sequence. This leads to coherent
detection over time and, improves detection results. 2) Prediction of the
detection parameters which are position, scale, and pose. This guarantees
the accuracy of accumulation as well as a continuous detection. 3) The
representation of pose is based on the combination of two detectors, one for
frontal views and one for profiles.
Zhou et al., 2003, they propose a time series state space model to
fuse temporal information in a probe video, which simultaneously
characterizes the kinematics and identity using a motion vector and an
identity variable, respectively. The joint posterior distribution of the motion
-
vector and the identity variable is estimated at each time instant and then
propagate to the next time instant. Marginalization over the motion vector
yields a robust estimate of the posterior distribution of the identity variable.
A computationally efficient sequential importance sampling (SIS) algorithm
is developed to estimate the posterior distribution. The propagation of the
identity variable over time, degeneracy in posterior probability of the identity
variable is achieved to give improved recognition. The gallery is generalized
to videos in order to realize video-to-video recognition. An exemplar-based
learning strategy is adopted to automatically select video representatives
from the gallery, serving as mixture centers in an updated likelihood
measure. The SIS algorithm is applied to approximate the posterior
distribution of the motion vector, the identity variable, and the exemplar
index, whose marginal distribution of the identity variable produces the
recognition result. The model formulation is very general and it allows a
variety of image representations and transformations.
Okuma Kenji et al., 2004, introduce a vision system that is capable of
learning, detecting and tracking the objects of interest. The system is
demonstrated in the context of tracking hockey players using video
sequences. Their approach combines the strengths of two successful
algorithms: mixture particle filters and Adaboost. The mixture particle filter
is ideally suited to multi-target tracking as it assigns a mixture component to
each player. The crucial design issues in mixture particle filters are the
choice of the proposal distribution and the treatment of objects leaving and
entering the scene. They construct the proposal distribution using a mixture
model that incorporates information from the dynamic models of each player
and the detection hypotheses generated by Adaboost. The learned Adaboost
proposal distribution allows us to quickly detect players entering the scene,
while the filtering process enables us to keep track of the individual players.
-
Perez, 2004, the effectiveness of probabilistic tracking of objects in
image sequences has been revolutionized by the development of particle
filtering. Kalman filters are restricted to Gaussian distributions, particle
filters can propagate more general distributions, albeit only approximately.
This is of particular benefit in visual tracking because of the inherent
ambiguity of the visual world that stems from its richness and complexity.
One important advantage of the particle filtering framework is that it allows
the information from different measurement sources to be fused in a
principled manner. They introduce generic importance sampling mechanisms
for data fusion and discuss them for fusing color with either stereo sound,
for teleconferencing, or with motion, for surveillance with a still camera.
They show how each of the three cues can be modeled by an appropriate
data likelihood function, and how the intermittent cues (sound or motion)
are best handled by generating proposal distributions from their likelihood
functions. The effective fusion of the cues by particle filtering is
demonstrated on real teleconference and surveillance data.
Vacchetti et al., 2004, they propose an efficient real-time solution for
tracking rigid objects in 3D using a single camera that can handle large
camera displacements, drastic aspect changes, and partial occlusions. While
commercial products are already available for offline camera registration,
robust online tracking remains an open issue because many real-time
algorithms described in the literature still lack robustness and are prone to
drift and jitter. To address these problems, they have formulated the
tracking problem in terms of local bundle adjustment and have developed a
method for establishing image correspondences that can equally well handle
short and wide baseline matching. They then can merge the information
from preceding frames with that provided by a very limited number of key
frames created during a training stage, which results in a real-time tracker
that does not jitter or drift and can deal with significant aspect changes.
-
Dong-gil Jeong et al., 2005, propose a robust real-time head tracking
algorithm using a pan-tilt-zoom camera. They assume the shape of a head is
an ellipse and a model color histogram is acquired in advance. In the first
frame, the appropriate position and scale of the head is determined based
on the user input. In the subsequent frames, the initial position is selected
at the same position of the ellipse as in the previous frame. The mean shift
procedure is applied to make the ellipse position converge to the target
center where the color histogram similarity to the model and previous one is
maximized. The previous histogram means a color histogram adaptively
extracted from the result of the previous frame. The position-adjusted ellipse
is refined by using color and shape information. Large background motion
often prohibits the initial position from converging to the target position.
They estimate a robust initial position by compensating the background
motion. They use vertical and horizontal 1-D projection datasets. Extensive
experiments prove that a head is well tracked even when the person moves
fast and the scale of the head changes drastically.
Fidaleo Douglas et al., 2005, provides an extensive analysis of a state-
of-the-art key frame based tracker: quantitatively demonstrating the
dependence of tracking performance on underlying mesh accuracy, number
and coverage of reliably matched feature points, and initial key frame
alignment. 3D tracking of faces in video streams is a difficult problem that
can be assisted with the use of a priori knowledge of the structure and
appearance of the subject’s face at predefined poses (key frames). Tracking
with a generic face mesh can introduce an erroneous bias that leads to
degraded tracking performance when the subject’s out-of-plane motion is far
from the set of key frames. To reduce this bias, they show how online
refinement of a rough estimate of face geometry may be used to re-estimate
the 3d key frame features, thereby mitigating sensitivities to initial key
frame inaccuracies in pose and geometry. An in-depth analysis is performed
-
on sequences of faces with synthesized rigid head motion. Subsequent trials
on real video sequences demonstrate that tracking performance is more
sensitive to initial model alignment and geometry errors when fewer feature
points are matched and/or do not adequately span the face. The analysis
suggests several indications for most effective 3D tracking of faces in real
environments.
Hampapur et al., 2005, Situation awareness is the key to security.
Awareness requires information that spans multiple scales of space and
time. Smart video surveillance systems are capable of enhancing situational
awareness across multiple scales of space and time at the present time, the
component technologies are evolving in isolation. To provide comprehensive,
nonintrusive situation awareness, it is imperative to address the challenge of
multi scale, spatiotemporal tracking. This article explores the concepts of
multi scale spatiotemporal tracking through the use of real-time video
analysis, active cameras, multiple object models, and long-term pattern
analysis to provide comprehensive situation awareness.
Koterba Seth et al., 2005, study the relationship between multi view
Active Appearance Model (AAM) fitting and camera calibration. They propose
to calibrate the relative orientation of a set of N > 1 cameras by fitting an
AAM to sets of N images. They use the human face as a (non-rigid)
calibration grid. Algorithm calibrates a set of 2 × 3 weak perspective camera
projection matrices, projections of the world coordinate system origin into
the images, depths of the world coordinate system origin, and focal lengths.
Roy-Chowdhury et al., 2005, present two algorithms for 3D face
modeling from a monocular video sequence. The first method is based on
Structure from Motion (SFM), while the second one relies on contour
adaptation over time. The SFM based method incorporates statistical
measures of quality of the 3D estimate into the reconstruction algorithm.
The initial multi-frame SFM estimate is smoothed using a generic face model
-
in an energy function minimization framework. Such a strategy avoids
excessively biasing the final 3D estimate towards the generic model. The
second method relies on matching a generic 3D face model to the outer
contours of a face in the input video sequence, and integrating this strategy
over all the frames in the sequence. It consists of an edge-based head pose
estimation step, followed by global and local deformations of the generic
face model in order to adapt it to the actual 3D face. This contour adaptation
approach is able to separate the geometric subtleties of the human head
from the variations in shading and texture and it does not rely on finding
accurate point correspondences across frames.
Adam et al., 2006, present an algorithm for tracking an object in a
video sequence. The template object is represented by multiple image
fragments or patches. The patches are arbitrary and are not based on an
object model. Every patch votes on the possible positions and scales of the
object in the current frame, by comparing its histogram with the
corresponding image patch histogram.
Dedeoglu et al., 2006, describe active appearance models (AAM) as
compact representations of the shape and appearance of objects. Fitting
AAMs to images is a difficult, non-linear optimization task. Traditional
approaches minimize the L2 norm error between the model instance and the
input image warped onto the model coordinate frame. While this works well
for high resolution data, the fitting accuracy degrades quickly at lower
resolutions. They show a careful design of the fitting criterion can overcome
many of the low resolution challenges. In resolution aware formulation
(RAF), they explicitly account for the finite size sensing elements of digital
cameras, and simultaneously model the processes of object appearance
variation, geometric deformation, and image formation. Gauss-Newton
gradient descent algorithm not only synthesizes model instances as a
function of estimated parameters, simulates the formation of low resolution
-
images in a digital camera. They compare the RAF algorithm against a state-
of-the-art tracker across a variety of resolution and model complexity levels.
Fonseca Pedro Miguel et al., 2006, states that a compressed domain
generic object tracking algorithm offers, in combination with a face detection
algorithm, a low-computational- cost solution to the problem of detecting
and locating faces in frames of compressed video sequences (such as MPEG-
1 or MPEG-2). Objects such as faces can thus be tracked through a
compressed video stream using motion information provided by existing
forward and backward motion vectors. The described solution requires only
low computational resources on CE devices and offers at one and the same
time sufficiently good location rates.
Lu Le and Dai Xiangtan, 2006, presents a hybrid sampling solution that
combines RANSAC and particle filtering.RANSAC provides proposal particles
that, with high probability, represent the observation likelihood. Both
conditionally independent RANSAC sampling and boosting-like conditionally
dependent RANSAC sampling are explored. They show that the use of
RANSAC-guided sampling reduces the necessary number of particles to
dozens for a full 3D tracking problem. The algorithm has been applied to the
problem of 3D face pose tracking with changing expression. They
demonstrate the validity of approach with several video sequences acquired
in an unstructured environment.
Xu and Roy Chowdhury, 2007, they present a theory for combining the
effects of motion, illumination, and 3D structure, and camera parameters in
a sequence of images obtained by a perspective camera. The set of all
Lambertian reflectance functions of a moving object, at any position,
illuminated by arbitrarily distant light sources, lies “close” to a bilinear
subspace consisting of nine illumination variables and six motion variables.
This result implies that, given an arbitrary video sequence, it is possible to
recover the 3D structure, motion and illumination conditions simultaneously
-
using the bilinear subspace formulation. The derivation builds upon existing
work on linear subspace representations of reflectance by generalizing it to
moving objects. Lighting can change slowly or suddenly, locally or globally,
and can originate from a combination of point and extended sources. They
experimentally compare the results of their theory with ground truth data
and also provide results on real data by using video sequences of a 3D face
and the entire human body with various combinations of motion and
illumination directions. They show results of their theory in estimating 3D
motion and illumination model parameters from a video sequence.
Yu et al., 2007, they propose a method to incrementally super resolve
3D facial texture by integrating information frame by frame from a video
captured under changing poses and illuminations. They recover illumination,
3D motion and shape parameters from our tracking algorithm. This
information is then used to super-resolve 3D texture using Iterative Back-
Projection (IBP) method. The super-resolved texture is fed back to the
tracking part to improve the estimation of illumination and motion
parameters. This closed-loop process continues to refine the texture as new
frames come in. They also propose a local-region based scheme to handle
non-rigidity of the human face.
Stasiak and Pacut, 2008, a system for parallel face detection, tracking
and recognition in real-time video sequences is being developed. They
describe its face detection and tracking modules. The solution is based on
the particle filtering in the conditional density propagation framework of
Izard and Blake and utilizes color information at different levels of detail.
The use of color makes processing computationally cheap and robust in
finding candidates for further processing.
Suandi et al., 2008, they describes a technique to estimate human
face pose from color video sequence using Dynamic Bayesian Network
(DBN). As face and facial features trackers usually track eyes, pupils, mouth
-
corners and skin region(face), their proposed method utilizes merely three of
these features–pupils, mouth center and skin region to compute the
evidence for DBN inference. No additional image processing algorithm is
required, thus, it is simple and operates in real-time. The evidence, which
are called horizontal ratio and vertical ratio, are determined using model-
based technique and designed significantly to simultaneously solve two
problems in tracking task; scales factor and noise influence.
Valenti and Gevers, 2008, the ubiquitous application of eye tracking is
precluded by the requirement of dedicated and expensive hardware, such as
infrared high definition cameras. The systems based solely on appearance
are being proposed in literature. These systems are able to successfully
locate eyes; their accuracy is significantly lower than commercial eye
tracking devices. Their aim is to perform very accurate eye center location
and tracking, using a simple web cam. By means of a novel relevance
mechanism, the proposed method makes use of isopoda properties to gain
invariance to linear lighting changes, to achieve rotational invariance and to
keep low computational costs. They test their approach for accurate eye
location and robustness to changes in illumination and pose, using the BioID
and the Yale Face B databases. They demonstrate that their system can
achieve a considerable improvement in accuracy over state of the art
techniques.
Yung et al., 2011, propose the state-of-the-art progress on visual
tracking methods, classify them into different categories, as well as identify
future trends. Visual tracking is a fundamental task in many computer vision
applications and has been well studied in the last decades. Robust visual
tracking remains a huge challenge. Difficulties in visual tracking can arise
due to abrupt object motion, appearance pattern change, non-rigid object
structures, occlusion and camera motion. They first analyze the state-of-the-
art feature descriptors which are used to represent the appearance of
-
tracked objects. Then, they categorize the tracking progresses into three
groups; provide detailed descriptions of representative methods in each
group examine their positive and negative aspects and the future trends for
visual tracking research.
2.5 SUMMARY
This chapter has presented the various methods used for face tracking
in a continuous video. The local features such as eye brows, lips, and mouth,
skin color based face tracking are presented. Chapter 3 presents the feature
extraction.
top related