segmentation of image sequences: a surveyclasses.design.ucla.edu/winter05/256/projects/ir... ·...

29
1 Segmentation of Moving Objects in Image Sequence: A Review Dengsheng Zhang and Guojun Lu Gippsland School of Computing and Information Technology Monash University, Churchill, Vic 3842, Australia {dengsheng.zhang, guojun.lu}@infotech.monash.edu.au Abstract Segmentation of objects in image sequences is very important in many aspects of multimedia applications. In the second generation image/video coding, images are segmented into objects to achieve efficient compression by coding the contour and texture separately. As the purpose is to achieve high compression performance, the objects segmented may not be semantically meaningful to human observers. The more recent applications, such as content-based image/video retrieval and image/video composition, require that the segmented objects be semantically meaningful. Indeed, the recent multimedia standard MPEG-4 specifies that a video is composed of meaningful video objects. Although many segmentation techniques have been proposed in the literature, fully automatic segmentation tools for general applications are currently not achievable. This paper provides a review of this important and challenging area of segmentation of moving objects. We describe common approaches including temporal segmentation, spatial segmentation and the combination of temporal-spatial segmentation. As an example, a complete segmentation scheme, which is an informative part of MPEG-4, is summarized. Keywords: Image/video segmentation, optical flow, motion estimation, multimedia 1. Introduction The ideal goal of segmentation is to identify the semantically meaningful components of an image and grouping the pixels belonging to such components. While it is impossible to segment static objects in image at the present stage, it is more practical to segment moving objects from dynamic scene with the aid of motion information contained in it. Segmentation of moving objects in image sequence plays an important role in image sequence processing and analysis. Once the moving objects are detected or extracted out, they can serve for varieties of purposes. The development of techniques for the segmentation of moving object has mostly been driven by the so called second-generation coding [KIK85, TKP96]. The second generation coding techniques use image representations based on human vision system (HVS) rather than the conventional canonical form which is based on the concept of pixel or block of pixels as the basic entities that are coded. As a result of including the human visual system, natural images are treated as being a composition of objects defined not by a set of pixels regularly spaced in all dimensions but by their shape and color. With second-generation coding techniques, the original image is broken down into regions of homogenous characteristics or “objects” of arbitrary shapes, these “objects” are then contour and texture encoded. Since compression efficiency is the primary goal of coding, although much work has been done in segmentation-based coding in second-generation coding to achieve very low bit rate video streams (MORPHCO [SBCP96], SESAME [Salembier et al 97]), content-based functionalities, such as object identification and retrieval, are not addressed to this end. The “objects” in second-generation coding are different from the semantic objects corresponding to real world objects. Second-generation coding decomposes images into regions of homogeneous characteristics which can be intensity, color, motion, directional components or other predefined visual patterns. In real world, objects rarely appear to be homogenous. As a result, say, a human body with different moving parts is likely to be segmented into different parts to achieve more prediction gain [Diehl91]. This is in contrast with segmentation for content-

Upload: others

Post on 10-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

1

Segmentation of Moving Objects in Image Sequence: A Review

Dengsheng Zhang and Guojun Lu

Gippsland School of Computing and Information Technology Monash University, Churchill, Vic 3842, Australia

dengsheng.zhang, [email protected]

Abstract

Segmentation of objects in image sequences is very important in many aspects of multimedia applications. In the second generation image/video coding, images are segmented into objects to achieve efficient compression by coding the contour and texture separately. As the purpose is to achieve high compression performance, the objects segmented may not be semantically meaningful to human observers. The more recent applications, such as content-based image/video retrieval and image/video composition, require that the segmented objects be semantically meaningful. Indeed, the recent multimedia standard MPEG-4 specifies that a video is composed of meaningful video objects. Although many segmentation techniques have been proposed in the literature, fully automatic segmentation tools for general applications are currently not achievable. This paper provides a review of this important and challenging area of segmentation of moving objects. We describe common approaches including temporal segmentation, spatial segmentation and the combination of temporal-spatial segmentation. As an example, a complete segmentation scheme, which is an informative part of MPEG-4, is summarized. Keywords: Image/video segmentation, optical flow, motion estimation, multimedia 1. Introduction

The ideal goal of segmentation is to identify the semantically meaningful components of an image and

grouping the pixels belonging to such components. While it is impossible to segment static objects in image at the present stage, it is more practical to segment moving objects from dynamic scene with the aid of motion information contained in it. Segmentation of moving objects in image sequence plays an important role in image sequence processing and analysis. Once the moving objects are detected or extracted out, they can serve for varieties of purposes. The development of techniques for the segmentation of moving object has mostly been driven by the so called second-generation coding [KIK85, TKP96]. The second generation coding techniques use image representations based on human vision system (HVS) rather than the conventional canonical form which is based on the concept of pixel or block of pixels as the basic entities that are coded. As a result of including the human visual system, natural images are treated as being a composition of objects defined not by a set of pixels regularly spaced in all dimensions but by their shape and color. With second-generation coding techniques, the original image is broken down into regions of homogenous characteristics or “objects” of arbitrary shapes, these “objects” are then contour and texture encoded. Since compression efficiency is the primary goal of coding, although much work has been done in segmentation-based coding in second-generation coding to achieve very low bit rate video streams (MORPHCO [SBCP96], SESAME [Salembier et al 97]), content-based functionalities, such as object identification and retrieval, are not addressed to this end. The “objects” in second-generation coding are different from the semantic objects corresponding to real world objects. Second-generation coding decomposes images into regions of homogeneous characteristics which can be intensity, color, motion, directional components or other predefined visual patterns. In real world, objects rarely appear to be homogenous. As a result, say, a human body with different moving parts is likely to be segmented into different parts to achieve more prediction gain [Diehl91]. This is in contrast with segmentation for content-

Page 2: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

2

based functionalities, which seeks way of aims at identifying meaningful objects corresponding to real world objects. Due to the rapid progress in micro-electronics and computer technology together with the creation of networks operating with various channel capacities, the last decade has seen the emerging of new multimedia applications such as Internet multimedia, Video on Demand (VOD), interpersonal communications (videoconference, videophone), and digital library. The importance of visual communications has increased tremendously. A new world standard MPEG-4 has just come into being to address issues associated with multimedia applications. As a part of MPEG-4 standard, MPEG-4 video differs from other video standard in two main aspects: very low bitrate and content-based functionalities. The content-based functionalities in MPEG-4 video characterize a revolution in representing digital video and will have a tremendous influence in the future of visual world. With content-based functionalities, video bit stream can be manipulated to achieve personalized video. In MPEG-4 video, bitstream is composed of Video Object Planes (VOPs) which can be used to assemble a real world scene [N2172]. Each VOP is coded independent of other objects by its texture, motion and shape. VOPs form the basic elements of MPEG-4 video bitstreams. VOP extraction is a key issue in efficiently applying the MPEG-4 coding scheme. Although MPEG-4 standard doesn’t specify how to obtain VOPs, it’s apparent and recognised that segmentation-based techniques are essential to, and therefore, will dominate VOP generation. This is because most of visual information, existing or being generated, is in the form of frames or images. To achieve content-based functionalities, these frames and images have to be decomposed into individual objects to be fed into MPEG-4 video encoder.

Although it is premature to seek an automatic solution for general segmentation purpose at present [Meier98], many video segmentation approaches have been proposed in the literature for this purpose. They can be classified motion-based segmentation(based on motion information only) and spatial-temporal segmentation (a combination of temporal and spatial segmentation). Motion-based techniques suffer from boundary inaccuracy. For content-based purposes, it appears that the spatio-temporal approach is most appropriate. This paper provides a review of some of the most important segmentation techniques.

The rest of the paper is organized as follows. In Section 2, we give a overview on segmentation of moving objects. Section 3 reviews 2D methods. In Section 4, a number of 3D methods are discussed. Section 5 devotes to spatio-temporal approaches. In Section 6, we summarize the discussions. In Section 7, we describe a complete segmentation scheme which combines temporal and spatial segmentation. Section 8 concludes the paper. 2. Segmentation of moving objects: an overview

Classifications of motion segmentation vary significantly in the literature and no consistent classification can be found. Most classifications are either ambiguous or not complete. For example, Meier [Meier98] classified motion segmentation into four categories: 3-D segmentation, segmentation based on motion information only, spatio-temporal segmentation and joint motion estimation and segmentation, where 3-D and joint motion estimation and segmentation should be under motion-based segmentation. Torr [Torr95] also groups motion segmentation into four categories: methods for a stationary camera, methods that are based on image properties of projected motion, methods that require knowledge of the camera motion and methods founded on the constraints imposed in the image by Euclidean motions in the world. The classification is only suitable for structure from motion methods. Tekalp’s classification of motion segmentation [Tekalp95] is based on motion estimation: direct method (change detection), optical flow segmentation and simultaneous estimation and segmentation. As can be seen, the classification only includes motion-based approaches. In this paper, segmentation of moving objects is broken into two groups: motion-based versus spatio-temporal. Among motion-based segmentation techniques, there are two subgroups of segmentation: 2D approach and 3D approach, based on the dimension of motion models employed in the segmentation. Within 3D approaches, there are structure from motion (SFM) method, which mostly deals with rigid object motion and 3D scenes, and parametric method, which deals with piecewise rigid motion and 2D scenes.

Motion-based segmentation algorithms generally involves three main issues. The first issue is data primitives or region of support [SK99], the data primitives can be individual pixels, corners, lines, blocks or regions. The second issue is motion models or motion representations, which can be 2D optical flow, or 3D motion parameters, this issue involves parameter estimation or motion estimation. The third issue is

Page 3: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

3

segmentation criteria, which can be maximum a posteriori (MAP), Hough transform, expectation and maximization (EM). Therefore, a typical motion segmentation algorithm consists of three steps corresponding to these three issues. However, due to noise problem and motion complexity of the scene, the real motion segmentation/clustering schemes are usually much more complex than this in that the motion estimation in the motion representation stage and the segmentation are usually recursive processes. A simplified motion-based segmentation is given in Figure 1(a). Motion-based segmentation algorithms can be classified either based on their motion representations or based on their clustering criteria. Since motion representation plays such a crucial role in motion segmentation that motion segmentation techniques generally focus on the design of motion estimation algorithms. Therefore, motion segmentation are best identified and distinguished by the motion representation it adopts. Within each subgroup identified by its motion representation, the methods are distinguished by their clustering criteria. Traditional motion-based segmentation methods, which employ motion information only, usually deals with scenes with rigid motion or piecewise rigid motion. The comparatively new spatio-temporal segmentation techniques, which employ both spatial and temporal information embedded in the sequence and directly target the emerging multimedia applications and generic situations, are often neglected from the classification categories in the literature. By combining both motion and spatial information, these techniques intend to overcome the over-segmentation problem in image segmentation and overcome the noise-sensitive and inaccuracy problems in motion-based segmentation. The spatio-temporal segmentation is classified into motion segmentation because it employs the same motion estimation techniques as in motion-based segmentation and temporal segmentation is usually used to guide the overall segmentation results. However, this group of segmentation algorithms differ from the motion-based segmentation in that it makes use of spatial information to rectify and improve the temporal segmentation results. In this way, not only can it overcome the above mentioned problems in motion-based segmentation but also can be adopted to non-rigid motion, therefore, more generic scenes. A simplified spatio-temporal segmentation is shown in Figure 1(b). Based on the above discussions, this paper adopts the classification of segmentation of moving objects as shown in Figure 2. Moving object Moving object (a) (b)

Figure 1. (a) Simplified motion-based segmentation and (b) simplified spatio-temporal segmentation In the following sections, we discuss each of the segmentation techniques shown in Figure 2.

Input data

Motion representation

Segmentation

Input data

Motion-based segmentation

Spatial segmentation

Combined segmentation

Page 4: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

4

Page 5: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

5

3. 2D motion-based segmentation methods 2D motion-based segmentation methods can be divided into segmentation based on optical flow

discontinuities and segmentation based on change detection. 3.1 Segmentation based on optical flow discontinuities This group of methods perform segmentation based on displacement or optical flow of image pixels. The displacement or optical flow of a pixel is a motion vector represented by the motion between a pixel in one frame and its correspondence pixel in the following frame. Optical flow is itself a very active research topic. For the varieties of optical flow estimation, readers are referred to [An88, BB95, Tekalp95, SK99].

Early work on segmentation tries to segment images into moving objects using local measurements. Potter [Potter75] uses velocity as a cue to segmentation. The work is based on the assumption that all parts of an object have the same velocity when they are moving. Potter’s approach to motion extraction was based on the measurement of the movement of edges. He assumed that since the pictures of a scene were taken in very close time, the edges could be correlated between the pictures by their spatial positions alone. A motion measurement for a given reference point of a superimposed grid was obtained by determining the differences (between pictures) of the displacements of edges from the point. The point was classified into one of three classes—body, shadow and background. Points were grouped within classes on the basis of identical motion measurement values. In his later work [Potter77], Potter determines the approximate velocity of a pixel by using template matching and the templates he chose are Cross, T- and L-templates. It is shown in the work that the template matching process provides more accurate velocity information from more complex scenes. The main advantage of using templates for velocity extraction is that they are object independent, but since the template features do not appear everywhere in the picture, the resulted velocity field is sparse. Without a spatial analysis, it is impossible to segment out the whole object.

Spoerri and Ullman [SU87] recognize that the computation of motion and the detection of motion boundaries is a “chicken and egg” dilemma. In order to broken this dilemma, they try to detect motion boundaries before a full flow field is found, using local flow discontinuity tests. The input to the bimodality tests is a local normal flow histograms constructed from a circular neighborhood at each image point. The bimodality test detects motion boundaries by computing the degree of bimodality, or two peaks of equal strength, present in the local histogram. Hildreth [Hildreth84] makes use of the fact that if two adjacent objects undergo different motion v1 and v2, then a normal flow components, whose orientation lie between the directions of v1 and v2, will change in sign and/or magnitude across the boundary. Therefore she uses zero-crossing of normal flow to detect motion boundaries. These methods are suitable for the situations where the estimation of motion is difficult, but the motion boundaries can be still perceived. Only very simple images are used for testing. The advantage of using local flow test is that it can detect boundaries locally without knowing the motion of the rest parts of the moving object, this is helpful in moving object segmentation where it is not possible to use a test criterion, such as bimodality test, to divide the whole flow field. This can happen when the flow field of the moving object is not uniform. However, the choice of input data primitives is a challenge for these algorithms, since intensity values are sensitive to noise and changes in illumination while edge features tend to be sparse and their density is non-uniform. Similar approach has also been adopted by Nagel et. al [NSKO94].

Overington [Overington87] also utilizes the normal components of flow computed at edges to find discontinuities in the normal flow component. The discontinuities are used to detect moving objects in a scene, taken from a static camera.

Thompson et al [TMB85] apply motion edge detection on image flow field to find object boundaries, which is a natural extension of classical intensity-based edge detection method. For this purpose, the Marr-Hildreth edge detector is used to detect moving objects’ boundaries based on the Laplacian-of-the-Gaussian (LOG) smoothed flow field. Clocksin [Clocksin80] proposed the use of the Laplacian operator for detecting sharp changes in the velocity field generated when an observer translates in a static environment. He shows that, in such circumstances, discontinuities in the magnitude of flow can be detected with a Laplacian operator; in particular, singularities in the Laplacian occur at discontinuities in the flow. Similar approach has been taken by Schunck [Schunck89] which applies motion edge detection on a surface-based smoothed optical flow field resulted from the clustering of optical flow constraint line. These algorithms suffer the same drawback of over-segmentation as those approaches based on the intensity-based gradient edge detection. A solution is to combine temporal information with spatial

Page 6: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

6

information such as color and texture, so that the over-segmentation in the motion field can be overcome by a spatial segmentation.

While segmentation based on finding flow discontinuities is straightforward, it is unlikely to achieve expected results. In essence, the optical flow field has the same statistical characteristics as that of intensity in image. Based on the experience from image segmentation, it is not difficult to recognize that high level information and rules are needed to help the analysis.

3.2 Segmentation based on change detection

In the above methods, optical flow or velocity is usually computed at every image point in the frame.

Since the percentage of points having zero motion or a simple global motion is usually large, it is more efficient and economical to locate and focus analysis on areas that are changing. More importantly, segmentation of moving objects is usually a multi-stage process or an iterative process, the elimination of large number of potential noise points can prevent much errors from propagating further. Now that background is usually stationary or has a simple global motion, it is possible to remove it by a simple differencing or a motion compensated differencing. This method is dominantly used in the segmentation for object-based coding and segmentation for content-based functionalities [HT88, Diehl91, MW98]. Segmentation using change detection avoids the computation of differential gradients in the estimation of optical flow which is unreliable. These algorithms begin with change detection to distinguish between temporally changed and unchanged regions of two successive images k-1 and k, the moving object is then separated from the changed regions. The decision, whether a spatial position x = (x, y) belongs to a changed or to an unchanged image part, is based on the evaluation of the frame difference (FD):

FD(x) = Sk(x)-Sk-1(x) (3.2.1)

The FD is usually applied on a measurement window rather than on single pixel. In order to distinguish

between relevant changes due to motion of objects or brightness changes and irrelevant temporal changes due to noise, the frame difference has to be compared to a threshold Tch. The reliable decision, that a spatial position x belongs to a changed region, is only possible, if the frame difference exceeds this threshold. The binary change detection mask C(x), indicating changed (C(x) = 1) and unchanged regions (C(x) = 0), is provided by the change detection algorithm at each spatial position x. Hence, the performance of a change detector is essentially dependent on two parameters. The first is the choice of the threshold separating changed from unchanged luminance picture elements, and the second is to find a reasonable criterion that eliminates small regions, e.g. small unchanged regions within large changed regions [TB89].

Jain et al have made intensive studies on change detection using accumulative difference picture (ADP) technique [JN79, JMA79, Jain81, JJ83, Jain84a, Jain84b]. An accumulative difference picture is formed by comparing every frame of an image sequence to a reference frame and increasing the entry in the accumulative difference picture by 1 whenever the difference for the pixel exceeds the threshold, see Figure 3. Thus an accumulative difference picture ADPk is computed over k frames by comparing with a reference frame ADP0 [JKS95].

ADP0 (x, y) = 0, ADPk (x, y) = ADPk-1 (x, y) + ADP0 k (x, y) (3.2.2)

Moving object in reference frame Moving object in following frames

Counter Figure 3. Accumulative difference picture. The moving object moves right one pixel per frame [JN79].

3 2 1

6 5 4 3 2 1

9 8 7 6 5 4 6 6 6

Page 7: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

7

The ADPk (x, y) works as a counter for each pixel in the accumulative image. Segmentation can be carried out by finding the higher values in the counter, which are likely to correspond to actual moving objects.

In [JN79, Jain81], Jain and Nagel try to extract out rigid moving objects in the scene, taken from a stationary camera, by analyzing accumulative difference pictures. The key idea in their approaches is to reconstruct a stationary scene component, or the whole background, from the image sequence. Once this background is reconstructed, moving object can be detected by comparing the following frames with this stationary scene. The difference pictures are created by a simple thresholding of the change between two frames, where the threshold is a likelihood ratio constructed from the mean grey mean value and its variance for the sample areas (measurement window) from two frames. In the next stage, the first frame is selected as the reference frame and the difference picture for the reference frame is initialized to zero for all its elements, then other successive difference pictures are accumulated with the first difference picture. When the object has been completely displaced from it’s original location, a reference frame comprising only stationary components can be formed, and moving objects in following frames can be segmented out by comparing with this reference frame. It’s difficult to apply the algorithm to generic scene due to some strong assumptions such as stationary background, rigid motion and monotonicity of object motion. Furthermore, the reconstruction may not possible because occluding object may come into scene before the object under analyzing has moved over its original location, it also needs a long sequence of frames in the motion analysis.

Difference accumulation can only be applied on limited applications due to the strong assumptions. However, if these assumptions are satisfied, difference accumulation can be a very convenient way to recover moving object. An alternative approach to the difference accumulation is the accumulation of similarity which can be used to recover stationary component and for scene mosaicing, as is exploited by Wang and Adelson [WA94]. They try to recover different layers from a scene of panorama view after each layer is identified by its motion characteristics. In [JJ83], Jayaramamurthy and Jain proposed an approach to segment dynamic scenes containing textured objects by combining pixel velocity and difference pictures. This multistage approach first uses differencing to obtain active regions in the frame which contain moving objects. The threshold chosen to detect the active region is a preset value of 10% of the peak intensity value found in the frames. In the next stage, a Hough transform on an optical flow field is used to determine the motion parameters associated with each active region. Finally, the intensity changes and the motion parameters are combined to obtain the masks of the moving objects. This is an approach that combines the strengths of local and global methods. It has been known that using global method such as Hough transform alone to a scene containing several moving objects can not yield useful results because of the potential interference of the individual peaks in the parameter space contributed by different moving objects. So in their work, they resolve this difficulty in their work by successive refinement using local confidence test. Like their other algorithms, the algorithm is based on strong assumptions such as rigid and translational motion, stationary camera etc., limiting its application. The pixel to pixel matching method exploited in the pixel velocity estimation makes the algorithm specially impractical in most situations. Change detection through simple thresholding can lead to significant errors and inaccuracy in general situations. For this reason, change detection is usually embedded into a hierarchical or a relaxation algorithm. In this case, the initial change detection is refined by a motion compensated prediction/update or a threshold evaluation-update process (see Figure 4 for example).

FD C1 C2 C Figure 4. Block diagram of the change detector. (FD: frame difference; C, Ci: change detection masks)

Threshold calculation

Threshold operation

Median filtering

Elimination of small regions

Page 8: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

8

Unchanged region Changed region Unchanged region Moving Covered object background t Motion vectors t +1 Unchanged region Uncovered Moving object Unchanged region background

Figure 5. Separation of change mask into moving objects, covered and uncovered background. Moving object is represented by motion vectors which both tails and heads fall within the changed region.

Thoma and Bierling [TB89] combine change detection with optical flow to carry out object segmentation, and incorporate a median filter to eliminate small elements in the change mask. The change detection based segmentation algorithm is adopted from [HT88], which is an iterative three-step process (Figure 4). In the first step, the threshold operation is performed on a measurement window using the mean absolute frame difference. The threshold is initially chosen to be a fixed value of 3/256. Then, a two dimensional median filter is used in order to smooth the boundaries of the changed regions. The last step is to eliminate small isolated regions of the change detection mask. After these three steps, the initial threshold is re-evaluated, it is adapted to the standard deviation of the noise based on the available unchanged regions. The process repeats with the new threshold until the system is stable. This results in a change detection mask. The segmentation of moving object is achieved by separating the covered and uncovered background from the change detection mask based on the previously estimated motion field using hierarchical block matching technique (Figure 5). Problem can arise here that some spatial positions in the uncovered background of current frame are not addressed by any motion vector, and hence cannot be identified, therefore, this problem affects boundaries accuracy in most situations.

Aach et al [AKM93] propose a change detection technique using MAP and relaxation. The algorithm starts by computing the grey level difference image D. An initial change detection mask is then computed by a threshold operation on the squared normalized difference image. The thresholding is carried out by performing a significance test on the noise hypothesis of the luminance difference image D, which is modeled as Gaussian camera noise with a variance σ 2. Since the global thresholding can result in small isolated regions and irregular boundary, an optimization mechanism using MAP is designed to modify or update the change mask in expectation to eliminate small elements. The MAP criterion is then put into a deterministic relaxation to refine the object boundary. While the algorithm overcome the ‘corona’ effect (a blurring effect which reduces the spatial resolution by some filters), the choice of the significance level for the significance test is still arbitrary, this can make the algorithm image dependant. The result object mask is too scattered due to large number of small areas in the object area. Mechanism for eliminating small regions is apparently necessary, Mech and Wollborn [MW98] overcome this shortcoming by a morphological closing operation. 4. 3D motion-based methods The 2D approaches described above analyze apparent motion on 2D image plane and the analysis is performed only according to the information available on frames without taking into account the structure and the “real” motion of the moving objects in space. 2D motion models are simple, but less realistic. A

Page 9: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

9

practical and robust motion segmentation scheme must take into account objects’ structure and motion in space. As a result, 3D motion segmentation is employed in most practical segmentation systems. Within 3D methods, two groups of segmentations can be distinguished from the varieties of algorithms: structure from motion (SFM) approach and parametric approach. SFM usually deals with 3D scenes, which contain significant depth information; while significant depth is not assumed in parametric approach. Another important difference between the two approaches is the motion rigidity assumption in SFM while parametric approach only assumes piecewise motion rigidity of the scene. 4.1 Structure from motion The structure from motion problem refers to recovering 3D geometry in space from 2D motion on image plane. The idea of structure from motion is inspired by the phenomenon that human beings act in a three dimensional world while they only sense 2D projection of it. Early structure from motion efforts were focused on structure from stereo, which deals with binocular scenes. In this paper we only discuss SFM with monocular scenes, which is effectively equivalent to stereo with a single camera. For structure from stereo, readers are referred to [Scharstein97, Fusiello98, Tekalp95] SFM is preferred in applications where recovery of the “real” motion in the environment is essential, such as in robotic navigation, object tracking etc. SFM also find application in some other area such as animation, active vision, 3D coding and mosaic etc. Recovering 3D structure from 2D motion is a difficult problem to solve, since the observation data have one less dimension with respect to the unknown environment to be estimated. Several simplifying assumptions are usually made to the general problem of 3D models from 2D imagery to formulate SFM task. One key assumption is that objects in the scene are moving rigidly, or, equivalently only the camera is allowed to move. An additional simplification is that feature points have been located and correspondences have been established between feature points in the two frames. SFM techniques differ in their camera (geometry) model, linearity (in parameters estimation), number of features, and restrictions on camera motion and scene structure. Most SFM techniques are based on exploiting geometric or algebraic properties that are invariant under projection to multiple images, from which camera motion information is easily extracted. For example, the essential matrix, the fundamental matrix and factorization method all exploit various invariants of perspective or parallel projection to recover the relative extrinsic camera parameters from two or three views. The SFM problems are then distinguished into linear approaches and non-linear approaches based on the optimization methods employed to estimate the geometric parameters. Linear approaches try to solve the SFM problems linearly by put them into least-square sense optimization. However, with the exception of the factorization method, these techniques are rarely scalable to multiple images which limits the extent to which the solution can be made robust. Though the factorization method is scalable to a number of images, it is not recursive, and it assumes orthographic model which severely limits camera motion and scene structure. Therefore, these linear methods fails when the geometry constraints are degenerated, for example, when the motion between images is small. For that, non-linear approaches using extended Kalman filter (EKF) [JAP99] and projector error refinement [Bestor98] are proposed. Robust structure and motion are recovered by minimizing a nonlinear cost function over image sequence. A full review of SFM methods is beyond the scope of this paper, for the varieties of SFM algorithms, readers are referred to [JAP99, Bestor98, Tekalp95]. In the following, we discuss two different 3D segmentation schemes using the SFM method. Torr [Torr95] developed a stochastic 3D motion segmentation scheme that makes use of epipolar line generated from fundamental matrix. In the two view case, according to epipolar geometry, a point p1 in one frame is associated with its correspondence point p2 in the other frame by a fundamental matrix F (which encapsulates all the information on camera motion, including camera translation information) in the form: p1

tFp2 = 0 (4.1.1) and the correspondence point p2 can only fall on a constraint epipolar line Fp1 (Figure 6).

Page 10: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

10

Figure 6. The perspective projection of object point P onto image planes formulates the epipolar geometry … Repeat (a) Segmentation algorithm (b) Model generation block

Figure 7. Block diagram of segmentation algorithm and model generation in [Torr95]

The algorithm involves four steps: feature matching, model generation, cluster pruning and multiple hypothesis test to determine correct segmentation. The whole algorithm and the model generation are given in Figure 7.

In the first step, the feature matching and the geometry estimation are a recursive approach. Once the epipolar geometry has been estimated, it is used to aid feature matching. All the corners are rematched after the estimation of the epipolar geometry, and the epipolar geometry may then be further refined. When there is more than one moving object in the scene, it is assumed that the motion of the camera relative to the ith object is specified by the relative motion parameters with associated fundamental matrix F. Since the rigid 3D feature point set (corners) of the two views are linked by the fundamental matrix F, the segmentation problem is transformed into that of clustering the features in the image consistent with distinct fundamental matrices, which constraint a epipolar line for each feature point. So, in the second step, clustering is carried out in a model generation module (Figure 5(b)). It starts by randomly select 7 pairs, calculating F. More pairs are added with an eye to consistency: each new pair is tested to see if it meets a maximum likelihood (ML) criterion of being at least 95% likely belong to the cluster. The ML cost function is created by modeling the distance of a feature to its estimated epipolar line constrained by F as white Gassian noise with zero mean, and the a priori probability as geometric distribution. In the third step, small clusters may be pruned depending on the likelihood that they were randomly generated, and like clusters are merged. A special cluster, which makes use of a uniform probability density function, is used to capture data points which are not well-modeled by the other clusters. The final step is a multiple hypothesis test to determine which particular combination of the many feasible clusters is more likely to represent the actual motions.

Feature Matching

Model generation

Cluster pruning

Multiple hypothesis test

Sample 7 points

Estimate F

Motion model 1

Motion model 2

Motion model n

Hypothesis test

Page 11: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

11

Due the sparse features exploited in the segmentation, object boundary cannot be detected. Furthermore, the estimation of fundamental matrix needs a bigger displacement between frames (views).

MacLean [MacLean96] proposed a scheme based on the expectation and maximization (EM) algorithm. The input data to the segmentation algorithm are linear constraints on 3D translational motion. The linear constraints are generated from 7 bilinear constraints on 3D translation and rotation using the subspace methods:

Tt(x ×××× u(x)) + (T ×××× x)t(x ×××× ΩΩΩΩ) = 0 (4.1.2)

where ΩΩΩΩ and T are the rotation velocity and the translation velocity of the object: V = ΩΩΩΩ××××X + T, and u(x) = (u(x, y), v(x, y)) is the 2D optical flow. The linear constraint cancels the rotational part of the motion, resulting in only translational constraint to recover depth information of the scene, which is favored in the application.

The algorithm adopts a top-down multi-process approach, starting with an initial guess that there is a single translational motion in the scene, the translational parameters are then estimated by a non-linear optimization. In the second step, the EM algorithm is used to cluster constraints between the single translation process and an outlier population (rejection). The EM algorithm is an iterative two step methods: (1) the expectation step assigns features to the motion parameters that they most consistent with; (2) the maximization step estimates parameters of the models from the features found consistent with them. The third step examines the outlier population for evidence of other translation process according to a ownership probability criterion. The fourth step checks the new processes and merge small processes with the larger processes or discard if they are too small. Repeat from step 2 until no new process turns out. The problem with the algorithm is the initialization. Since methods for optimizing non-linear equations seldom guarantee a global minimum, the initial guess is critical, poor guess can lead to undesired result. Torr [Torr95, TM93] has proved that the EM algorithm may improve a segmentation if the segmentation is already good, otherwise the algorithm’s convergence properties are poor, e.g. if two models are nearly the same then they will each grab elements of the other’s set.

Both methods adopt joint motion estimation and segmentation approach. However, there are several differences between the second algorithm and the first one. First, MacLean deals with a 2D dense optical flow as input data which can detect object boundary. Second, the second segmentation is performed based on constraints that are related solely to translational direction and a mixture models are used to model multiple motion process. Third, the top-down approach of the second algorithm is also different from the first one. 4.2 Parametric methods 4.2.1 Parametric models Parametric methods relax the rigidity assumption in the SFM method into piecewise rigidity. Parametric models are built by making use of object motion in space and explicitly assume physical structure in the scene. The three dimensional motion of the object, which is usually modeled as a 3D affine motion model and can be described by a rotation matrix R and a translation vector T: X' = RX + T, where X and X' are two objects points at time t and t+1 respectively. The two physical structure models assumed in the parametric methods are usually planar surface or parabolic surface, which are acceptable approximations to real objects’ structure in natural scenes. By combining the structure and motion constraints with one of the two geometry models: parallel and perspective, the following parametric models are generated [Diehl91, Tekalp95]:

(1) 6 parameter (affine) model, corresponds to planar surface under parallel projection: x'= a1 x + a2 y + a3 y'= a4 x + a5 y + a6 (4.2.1.1)

(2) 8 parameter model, corresponds to planar surface under perspective projection

1'

87

321

++++=

yaxaayaxax ,

1'

87

654

++++=

yaxaayaxay (4.2.1.2)

Page 12: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

12

(3) 12 parameter model, corresponds to parabolic surface under parallel projection x' = a1 x2 + a2 y2 + a3 xy+ a4 x+ a5 y + a6

y' = b1 x2 + b2 y2 + b3 xy+ b4 x+ b5 y + b6 (4.2.1.3) where x = (x, y) and x' = (x', y') are correspondence points at time t and t+1 respectively, assumed to be established by a 2D optical flow vector. Another 8 parameter model, called quadratic flow is generated by combining 3D object velocity model V = ΩΩΩΩ××××X + T, planar surface and perspective projection geometry constraints:

x' = a1 + a2 x+ a3 y + a7 x2 + a8 xy y' = a4 + a5 x+ a6 y + a7 xy + a8 y2 (4.2.1.4)

4.2.2 Segmentation of scene into planar patches

Adiv [Adiv85] first proposed the segmentation of scene into planar patches, the idea is later adopted by

many researchers. The segmentation is a hierarchically structured three-stage algorithm in which objects in the scene are decomposed into planar patches that are moving rigidly. In the first stage of the segmentation, optical flow vectors are grouped into components using a multi-pass modified Hough transform on the parameter space, where a component is a connected set of vectors which support the same affine transformation of (4.2.1.1). In the modified Hough transform, each flow vector u(x) = (v1(x), v2(x)) votes for a set of quantized parameter which minimizes:

δ 2(x) = (δx

2(x) + δy2(x))1/2 (4.2.2.1)

where δx (x)= v1(x) - a1 x - a2 y - a3 and δy

(x) = v2(x) - a4 x + a5 y + a6. The parameter sets that receive the most votes are likely to represent candidate motions. Three techniques are used to alleviate the computation involved in the high dimensional Hough transform: i) multi-resolution, where at each resolution level the parameter space is quantized around the estimates obtained at the previous level; ii) decomposition of the parameter space into two subspace a1 , a2 , a3 and a4 , a5 , a6 and iii) multi-pass, where the flow vectors which are most consistent with the candidate parameters are grouped first.

The second stage is a merging of components. Adjacent components created in the first stage are then merged into segments if they obey the same eight-parameter quadratic flow model (4.2.1.4), called Ψ transform. In the last stage, flow vectors that are not consistent with any Ψ transform are merged into neighbouring segments if they are consistent with the corresponding Ψ transform, resulting in the final segmentation. Many other researchers also use this idea to segment the world into planar facets [MB87, MW86, TM93, IRP92]. These methods could be expected to be useful in many man-made environments where planar surfaces occur in abundance. However, when combined with other methods, these methods can be applied to many other situations as can be seen in the integrated methods in Section 4.2.4. 4.2.3 Bayesian Segmentation

Byesian method is among the most widely used segmentation techniques. The objective of Byesian

method is to maximize the posterior probability of the unknown label field (segmentation) X, given the observed motion field D. In order to maximize the a posteriori (MAP) probability P(X | D), two probability distributions must be specified according to Bayes’ theorem: the conditional probability P(D|X) and the a priori likelihood P(X). To determine P(X), the label field X is usually modeled as Markov random field (MRF), and P(D|X) is modeled as the white Gaussian noise with zero mean and variance σ 2. Due to the equivalence between MRF and Gibbs distribution [GG84], the MAP problem is reduced to minimizing a global cost function, which consists of two terms: the close-to-data term and the smoothness term:

∑ ∑∑∈

+−=i iji N

jiCii XXVEx xxx

xxxuxu ))(),((||)(~)(||2

1 22σ

(4.2.3.1)

where u and u~ are the observed flow and the synthesized or estimated flow at each pixel , Nxi denotes the neighborhood system for the label file and the potential function:

Page 13: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

13

0)()(

))(),(( > =−

= βββ

otherwiseXXif

XXV jijiC

xxxx (4.2.3.2)

The varieties of Bayesian segmentation techniques then differ in the choice of different cost functions. It is well-known that motion estimation and segmentation are interdependent. Motion estimation requires the knowledge of motion boundaries, while segmentation needs the estimate motion field to identify motion boundaries. Joint motion estimation and segmentation algorithms are ways to break this cycle. For that, the MAP is usually put into a recursive process to iteratively optimize motion parameters and the segmentation, this is illustrated as Figure 8(a): Yes No No output Yes output (a) (b)

Figure 8. (a) joint motion estimation and segmentation (b) ICM used in [BF93]

Murray and Buxton [MB87] first proposed a MAP segmentation algorithm where the parametric model is modeled as quadratic flow of (4.2.1.5) and segmentation field modeled by Gibbs distribution. In order to compute the observation probability P(D|X), the eight parameters of a quadratic flow model (4.2.1.4) are calculated for each region by linear regression on a normal flow field. The interpretation of this flow field, or the label file X is modelled as MRF. The cost function of the corresponding Gibbs distribution has the following two components of (4.2.3.1): (1) a close-to-data-term, measuring how well the estimated motion model approximates the observed optical flow field using a criterion of the sum of squared normalized difference between the original flow vector and the estimated motion vector. (2) a spatial smoothness term, represented by the sum of potential functions (4.2.3.2) and line process potentials to allow for motion discontinuities across the motion boundary. The potential function is constructed on those two sites cliques C and the line process potential is a 0-1 function reflecting the cost of various line configurations for allowing neighboring sites to have different interpretation. The line process is a compensating mechanism to the boundary blurring side-effect introduced by the smoothness. The cost function is minimized iteratively using simulated annealing (SA), starting with a initial labeling X. Calculate the mapping parameters for each region using linear squares and set the initial temperature T. In the second step, scan the pixel sites, at each site, perturb the label Xi = X(xi) randomly. Decide whether to accept or reject this

Initialisation

Motion Estimation

Segmentation

Iteration? Prediction

Compensated?

Initialisation

Motion Estimation

Segmentation

Motion Estimation

Page 14: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

14

perturbation according to the cost function. The third step re-estimate the mapping parameters for each region in the least squares sense after all pixel sites are visited once, based on the new segmentation label configuration. In the last step, lower the temperature T and repeat from step 2, until a stopping criterion is satisfied. The drawback with this algorithm is the prohibitive computation involved in the SA. Chang et al [CTS94] proposed a Bayesian approach based on a representation of the motion field as the sum of a parametric field and a residual field. The parameters of the eight-parameter model (4.2.1.2) are obtained for each region in the least-square sense from the dense field. The cost function to be minimized resulting from the MAP criterion consists of four terms, each derived from an MRF. The first term U1 is the temporal continuity term, measuring how well the prediction is and is minimized when both the synthesized and dense motion field minimize the displaced frame difference (DFD): U1 = α∑ +− +

xuxx 2

1 )]~()([ kk II (4.2.5.3)

where Ik and Ik+1 are the two frames under analyzing and α is the normalization factor. The second term U2 is the close-to-data-term, which is the same with the first term of (4.2.3.1), is minimized if the parametric representation is consistent with the dense flow field. The third term U3 is a piecewise smoothness term, a term intending to replace the line process in [MB87] by only enforce spatial smoothness on flow vectors generated by a single object: U3 = β ∑ ∑

−−i ij N

jiji XXx xx

xxxuxu ))()((||)()(|| 2 δ (4.2.5.4)

The fourth term U4 is a standard spatial smoothness term, which is the same with the second term of (4.2.3.1), represented by a potential function to enforce a smooth label field. Since the number of unknowns is three times higher when the motion field has to be estimated as well, the computational complexity is significantly larger. Chang et al decomposed the cost function into two terms and alternate between estimating the motion field and the segmentation labels using high confidence first (HCF) and iterated conditional modes (ICM), respectively. Comparing with the approach of [MB87], the cost function has two more term: the piecewise smoothness term and the temporal continuity term. The use of ICM alleviates the prohibitive computation in the SA. Bouthemy and Francois [BF93] also put MAP into ICM. The algorithm is a top down approach, starting by a initial guess to obtain a initial segmentation. Two initialization are tested, one is to start by region growing based on the likelihood ratio test on each motion block. The other is to start by treating the entire image as a single region. In the second step, estimate motion parameters represented by the affine model of (4.2.1.1) making use of the initial segmentation. The next step labels each pixel into the estimated affine model according to the MAP cost function similar to that in [MB87]. In the fourth step, motion parameters are re-estimated based on the new segmentation field. The fifth step performs motion prediction between the two frames under analyzing to check consistency with the observed frames. If the motions between the two frames are not compensated well, repeat from step three. The block diagram of the algorithm is illustrated in Figure 8(b). Both [BF93] and [CTS94] take into account of temporal continuity, but Bouthemy and Francois adopts a different approach. Instead of putting the temporal continuity criterion into the cost function, it is put into a separate step of the algorithm. Although deterministic relaxation algorithm such as ICM is less expensive in computation than the SA, it does not guarantee a global minimum. The minimization is likely trapped into local minimum near the initial stage. A well initialization is essential to the overall performance of the segmentation.

Bayesian motion estimation and segmentation algorithms are criticized for its computational expensiveness. To solve this problem, Bayesian approach is often conducted in multi-resolution way.

Stiller [Stiller97] employs a deterministic relaxation technique over a multi-scale pyramid. The relaxation techniques is similar to the ICM. The cost function for the MAP criterion consists of all the terms of the cost function in [CTS94] except the second term.

While Bayesian segmentation algorithms solve the noise problem and works well in the finding of homogenous motion regions, it has no mechanism to solve the over-segmentation problem. But it can be integrated into other segmentation techniques to overcome this limitation. In the segmentation for content-based functionalities, Mech and Wollborn [MW98] use this method for the creation of the change detection mask.

Page 15: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

15

4.2.4 Integrated methods: Segmentation for object-based video coding In the above segmentation techniques, scenes are segmented into planar patches representing different

motions, more precisely, disconnected motions. These disconnected motions are not necessarily different in motion. Instead, regions with same motion may be scattered in the resulted segmentation. In the segmentation-based coding, however, it is preferred to group these scattered regions with same motion into a single motion, called “layer”. More gain can be obtained by representing a motion layer with a single set of motion parameters than by representing each region with a individual set of parameters. By grouping the motions in the scene into different motion layers, more efficient coding can be achieved through a hierarchical manner. Besides, segmentation for coding purpose needs a more accurate motion boundary to avoid the prediction artifact. To achieve this, the algorithms for this purpose focus on finding global motion homogeneity instead of local motion homogeneity, and integrated methods combining 2D and 3D methods and layered segmentation methods are proposed.

I t I t+1 No

Yes output

Figure 9. Block diagram of segmentation algorithm in [HT88] and [Diehl91]

Hötter and Thoma [HT88] proposed a hierarchically structured segmentation algorithm aiming at object-based coding, where each segmented motion region is described by one set of motion parameters. The algorithm is a three step approach, see Figure 9. (1) In the first step, change detection is applied on two input frames It and It+1 to segment the two fields into changed and unchanged regions, resulting in the change detection mask. The change detection mechanism is an iterative three-step process which has been discussed in Section 3.2 (Figure 4). (2) In the second step, each connected changed region is for the present treated as one moving object. Then an eight parameters motion model (4.2.1.2) is estimated for each region using the direct method proposed by Tsai and Huang [TH81]. Using the displacement field derived from the estimated parameters, the change detection mask is verified resulting in a segmentation mask, where the moving objects are separated from uncovered background or background to be covered. The separation mechanism is discussed in Section 3.2 (Figure 5). (3) In the next step of the hierarchically structured segmentation, the mapping parameters are used to perform a motion compensating prediction in the changed regions using a displaced frame difference (DFD) criterion . The image regions not correctly described by the mapping parameters are detected from the change detector, evaluating the prediction result. These detected regions are treated according to the changed regions of the initializing step. This

Change detection

Object definition & Parameter estimation

Memory

Prediction

IInnttrraa--ffrraammee sseeggmmeennttaattiioonn

Separation segmentation into

changed & unchanged

Consistency check DFD < T ?

Page 16: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

16

procedure is hierarchically repeated until all separately moving objects are described by their mapping parameters.

Diehl [Diehl91] extends Hötter and Thoma’s method into a spatio-temporal algorithm, in which, both contour and texture information from the single images and information from successive images is used to split up a scene into various objects. The overall procedure of the algorithm is similar to that of Hötter and Thoma’s. But in every stage, the segmentation result is refined by an intra-frame counterpart based on single images. The inter-frame segmentation controls this combination as motion is used as the main segmentation criterion. Using the inter-frame segmentation, each of the contour regions is assigned to a moving region. Then adjacent regions of the same type are merged to obtain the final objects.

The differences between Hötter and Thoma’s method and Diehl’s method are in three aspects. Firstly, Diehl’s method is a spatial-temporal one while Hötter and Thoma’s method is a motion-based segmentation. Secondly, the two methods are also different in the parmetric models employed and in the estimation of the model parameters. Hötter and Thoma employ a parametric model of (4.2.1.2) while Diehl employ a parametric of model of (4.2.1.3) which is a more accurate approximation to object surface. For the parameter estimation, Hötter and Thoma use the direct method while Diehl optimizing the parameter estimation using a more complex modified Newton algorithm. Thirdly, Hötter and Thoma use segmentation history to support object tracking. This is useful in recovering meaningful moving objects from the scene, as can be seen later in the segmentation for content-based functionalities. The block diagram of these two algorithms is plotted in Figure 9. In the diagram, the central part which is marked by bold signs is shared by both methods. Diehl’s intra-frame segmentation is plotted in the right by dot signs, and the two more blocks in Hötter and Thoma’s algorithm are plotted in the left.

Similar approach to the above two methods has also been taken by Musmann et al [MHO89]. The segmentation techniques described in these algorithms can be naturally exploited for background-foreground separation where all the moving parts in the foreground is regarded as a single moving object. Because of its feasibility in obtaining an integrated moving foreground object, the main idea in these algorithms is adopted by Mech and Wollborn in their proposal to MPEG-4 [m1949]. The idea of representing motions in the scene into different levels is later adopted by Wang and Adelson [WA94], Borshukov et al [BBAT97] to represent scene into layers.

It is worthwhile to indicate out here that the concept of object used in the above segmentation algorithms is different from what it means in the segmentation for content-based functionalities where objects represent meaningful, or real world objects. For segmentation for object-based coding, or more properly, segmentation-based coding [DM95], the final segmented objects are regions of homogenous motion which can be described by a single set of motion parameters. These regions of uniform motion are often called objects in these algorithms. It is clear that the final results of the segmentation for object-based coding are not meaningful objects as used in the content-based functionality methods, because the motion of real objects is rarely uniform. 4.2.5 Segmentation of scene into layers (scene mosaicing)

There are cases where the scene can be separated into layers. Wang and Adelson [WA94] proposed a

segmentation scheme to separate panorama scenes into layers. The idea underlying this algorithm is to align the scene by compensating out the global motion, then accumulate the aligned frames to recover the interested layer(s). They assume that regions undergoing a common affine motion are part of the same physical object in the scene. The objective is to derive a single representative image for each layer. The algorithm starts by estimating an optical flow field, and then subdivides the frame into square blocks. The affine motion parameters are computed for each block by linear regression to get an initial set of motion models or hypotheses. The pixels are then grouped using an iterative adaptive K-means clustering algorithm. Pixel x is assigned to hypothesis or layer i if the difference between the optical flow at x and the flow vector synthesized from the affine parameters of the layer is smaller than for any other hypothesis. Obviously, this does not enforce spatial continuity of the label i field. To construct the layers, information of a longer sequence is necessary due to accumulation process. The frames are warped according to the affine motion of the layers such that coherently moving objects are aligned. A temporal median filter is then applied to the aligned sequence to enhance the image. By accumulating all the aligned frames, the layer can be recovered. A similar approach has been proposed by Torres et al [TGM97] and has been later improved by Borshukov et al into a multi-stage affine classification algorithm [BBAT97]. Hsu et al

Page 17: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

17

[HAP94] also adopted this idea to segment scene into layers of coherent motion with the objective for coding.

Some researchers take these algorithms as a VOP segmentation approaches, but the situations they can apply are rare. In essence, these proposed algorithms of representing scene into different layers are methods of representing scene into different levels of motions, with similar idea to that of Hötter and Thoma’s [HT88]. In terms of separating scenes into different layers of meaningful objects, these algorithms can be applied to very limited situations such as in scenes of panorama view due to the strong conditions needed to recover layers. Such conditions include rigid, monotonously incremental motion, significant depth variations and long sequence of frames etc. Therefore, these approaches are more suitable for coding than for content-based functionalities as is demonstrated in the results by [BBAT97, HAP94].

5. Spatial-temporal segmentation

Recently, due to the emerging multimedia applications such as MPEG-4 and MPEG-7, there is a need of segmentation of scenes into meaningful objects, or meaningful moving objects, to facilitate the so called content-based functionalities. For example, the new world standard MPEG-4 defines video scene consisting of video object plane (VOP) to support content-based functionalities such as object-based spatial and temporal scalability, user interacting with scene content etc. Although many motion segmentation techniques are available, techniques for segmentation of meaningful objects from generic scenes does not exist. From the preceding discussion, we note that the integrated methods employed in the segmentation for object-based coding is specially suitable for this purpose. For example, the Diehl’s spatial-temporal attempt plus Hötter and Thoma’s memory mechanism is a possible alternative to achieve segmentation of meaningful objects from scene. Several schemes have been attempted to segment moving objects from background employing spatial-temporal approach. In the following, we discuss these algorithms by analyzing the temporal and spatial parts separately. 5.1 Temporal segmentation

Mech and Wollborn [MW98, m1949] propose a segmentation scheme based on Hötter and Thoma’s

algorithm [HT88] and Diehl’s aogorithm [Diehl91]. The algorithm is implemented in four steps which is illustrated in the left part of Figure 10.

(1) In the first step, a possibly apparent camera motion is estimated and compensated using an 8 parameter motion model of (4.2.1.2). (2) In the second step, an apparent scene cut or strong camera pan is detected by evaluating the median-squared error (MSE) between the two successive frames, considering only background regions of the previous frame. In case of a scene cut the algorithm is reset. (3) The third step is a change detection mask (CDM) module, see the leftmost part of Figure 9. First, an initial CDM called CDMi between two successive frames is generated by a relaxation technique (See discussion of [AKM93] in Section 3.2), using local thresholds which consider the state of neighboring pixels. In order to get temporally stable object regions, a memory for change detection masks (CDM) is then applied to make use the previous segmentation results. The updated CDM (CDMu) is then simplified using morphological closing operator to generate the final CDM for object detection. (4) By the fourth step, the same technique used by Thoma and Bierling [TB89] is used to obtain an initial moving object mask (OMi) from the CDM (See Figure 5 in Section 3.2). It is then adapted to luminance edges of the corresponding frame, resulting in the final object region. The key idea in this algorithm is to get an initial object mask from the change detection, which can be used to improve and track the interested object in later stage. In order to create the initial object mask, instead of firstly performing change detection followed by a global motion estimation as is done in [HT88], it first applies a global motion compensation in a purpose of eliminating motion caused by camera motion, and change detection can be applied after. Similar method has also been used by Dufaux et al [DML95]. The algorithm eliminates the assumption of a stationary background as that in [HT88] and [Diehl91] by allowing camera panning and zooming. In the global motion estimation, pixels which distance from the left or right border is less than 10 pixels are used as observation points, assuming that there is no moving object near the left and right image border. This assumption can be easily violated in most natural scenes. In order to overcome the limitations in the motion estimation, more effective motion estimation method has to be found.

Page 18: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

18

CDM generation

I t I t+1 CDM OM

Figure 10. Block diagram of segmentation algorithm in [MW98] Meier [Meier98, m2238] proposes a VOP segmentation scheme with similar procedure to [MW98]. The

algorithm starts with the estimation of a dense motion field using block matching. The global motion is then modeled by the six-parameter affine transformation of (4.2.1.1) with the parameters being obtained using a robust least median of squares algorithm. But without a pre-motion segmentation in the dense motion field, the result of global motion estimation can be far away from the correct global motion, because the pixels in the moving object region are equally selected as the observation points in the affine parameter estimation while the observation points used for the parameter estimation should be restricted to those in the background. The initial object model or initial object mask is obtained by combining the motion segmentation result from a complex morphological motion filtering with a spatial segmentation using Canny operator. Instead of using memory tracking exploited in [MW98], the object tracking is done by using Hausdorff distance and the model update process. The VOP extraction is applied by a post-processing. Since some key parameters in the motion segmentation process need input from the user, it is not a fully automatic VOP segmentation algorithm.

Neri et al [NCRT98, m2365] proposed a segmentation algorithm based on high order statistics (HOS). The algorithm produces the segmentation map of each frame fk of the sequence by processing a group of frames fk-i, i=0,..n. The number of frames n varies on the basis of the estimated object velocity. Any global motion component is removed by a pre-processing stage, aligning fk-i to fk, j=1,..n. For each frame fk, the algorithm splits in three steps, as is illustrated in Figure 11.

(1) In the first step, the frame differences dk-j (x,y)=fk-j(x,y)– fk-n(x,y), j=0,..n-1 of each frame of the group with respect to the first frame fk-n are evaluated in order to detect the changed areas, due to object motion, uncovered background and noise. In order to reject the luminance variations due to noise, an Higher Order Statistic test is performed. Namely, for each pixel (x,y) the fourth-order central moment

Camera motion compensation

Relaxation

CDM simplification

Updating with MEM

Separating object and background

Initial CDMi

Optical flow EEddggee ddeetteeccttiioonn

Scene cut detection

Initial object mask: OMi

Page 19: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

19

),(~ )4( yxmjkd −

of each inter-frame difference d(x,y) is estimated on a 3x3 window it is compared with a

threshold adaptively set on the basis of the estimated background activity, and set to zero if it is below the threshold. On the sequence of the thresholded fourth-order moment maps, a motion detection procedure is performed. (2) This step aims at distinguish changed areas representing uncovered background (which stands still in the HOS maps) and moving objects (moving in the HOS maps). At the j-th iteration, the pair of thresholded HOS maps ),(~),,(~ )4()4(

1yxmyxm

jkjk dd −−− is examined. For each pixel (x,y) the displacement of it

is evaluated on a 3x3 window, adopting a SAD criterion (Sum of Absolute Differences), and if the displacement is not null the pixel is classified as moving. Then, the lag j is increased (i.e. the pair of maps slides) and the motion analysis is repeated, until j=n-2. Pixels presenting null displacements on all the observed pairs are classified as still. Note that, from a computational point of view, at each iteration, only pixels that were not already classified as moving are examined. Moreover, the matching is not necessary on the pixels which are zero and are included in a 3x3 zero window, and they are assumed to be still. Thus, at the expense of some comparisons, the matching is performed on few pixels, and at the expense of examining more than a pair of difference, the search window is small. (3) Still regions, internal to moving regions, are re-assigned to foreground. The regularization algorithm refines the imposing a priori local connectivity constraints, on both the background and the foreground. Namely, topological constraints on the size of objects irregularities such as holes, isthmi, gulfs and isles are imposed by means of morphological filtering. Five morphological operations, with circular structuring element, are applied. From a topological point of view, the regularization supports multiple moving regions whose size exceeds the size of the structuring element, corresponding to different moving objects. Finally, a post-processing operation refines the results on the basis of spatial edges.

fk-n VOP fk-1

fk

Figure 11. Block diagram of segmentation algorithm in [MCRT98] The temporal segmentation in Choi et al’s algorithm [CLK97, m2091] is also resulted from camera

motion compensated change detection. The change detection mask, resulting from a Neyman-Pearson test based on the statistical characteristics of the observation window, is overlaid over the current segmented regions resulted from the morphological segmentation based on spatial information. When a majority part of a segmented region belongs to changed region in the change detection mask, the whole area of the segmented region is declared as a foreground, otherwise a background.

Up to now, the algorithms discussed in this section all ignore depth information in the scene. This is due to that, in those applications such as coding, content-based functionalities, the scenes under analysis are usually assumed to be 2D scenes. In 2D scene, the scene viewed from a camera is at such a distance that it can be approximated by a flat 2D surface and the camera is undergoing rotations and zooms. In these situations, camera translation or ego-motion is not significant in 2D scenes, therefore, the depth variations are not significant, thus often ignored. But for generic scene, camera can be close to the moving objects or can be under translational move, this will induce significant depth variations in the images. Depth information can be employed to help analysis of motion event, such as occlusion.

Pardás [Pardás97] proposed a segmentation scheme to use motion coherence information together with the depth level information. The scheme is a two-level bottom-up approach. The bottom level is based on the grey level information, while the top level uses the relative depth of the regions in order to merge regions from the previous level. Temporal continuity is achieved by means of a region tracking procedure implicit in the grey level segmentation and by filtering the depth estimation. The bottom level segmentation uses a time-recursive segmentation scheme relying on the grey level information. In the top level, a classification of the regions obtained in the previous level is done relying on an estimation of the relative depth between these regions. Those neighbouring regions which are found to be in the same depth level are considered as a unique region in this segmentation level. The relative depth of the regions of the grey level segmentation is estimated by considering the occlusions between regions and the motion coherence between neighbouring regions. This estimation procedure is performed in four steps: Motion estimation,

Frame sequen

Change Detection Motion detection

Regularization

Frame difference

HOS test

Edge post-processing

Morphological filtering

Page 20: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

20

motion parameters comparison, overlapping computation and depth level assignation. Only image sequence with simple scene is tested in the work.

The segmentation algorithms described in this section are all automatic approaches trying to segment scene into meaningful objects. Precisely, these algorithms separate scene into background and moving objects in foreground. The segmented foreground is not processed further, although it may still be separable. These algorithms directly target the content-based functionalities addressed by MPEG-4. Most of these algorithms base the segmentation of moving objects on change detection instead of on motion field. This is due to that frame difference field is more reliable than motion field, this is explained in the beginning of Section 3.2.

Although these approaches are far away from the ultimate goal of segmentation: separate scene into semantically meaningful objects, they are promising. 5.2. Spatial Segmentation

Segmentation based on motion information is unlikely to achieve an accurate result without the help of

spatial information. Therefore, many effective segmentation schemes proposed are spatio-temporal schemes. While in these approaches, motion information is usually used as the main criterion to guide the segmentation process, spatial segmentation also plays an important role in the algorithms. Two types of spatial segmentation methods are usually exploited: contour based and region based.

In [MW98, m1949], spatial segmentation is applied twice in the process. Morphological closing-operator is used to eliminate potential small regions in the change detection mask (CDM) resulted from the change detection process in the previous stage. In order to avoid mistakenly merging of small regions into the main body, a special ternary mask is created to record false change points. Within this ternary region, small regions with size below a certain threshold are eliminated. After this simplification process, an object mask (OM) is created by eliminating out the uncovered background from the segmented CDM. This CDM may include undesired background points or pieces around its boundaries due to noise. For this reason, edge information extracted from the current frame using Sobel operator is exploited to improve the boundaries of the object mask. Motion boundaries within certain radius of the local edges are adapted to edges. This results in the final moving object.

While edge plays a post-processing role in [MW98], edge information plays a key role in the segmentation process proposed by Meier [Meier98, m2238]. Instead of using an intensity map for the object mask, Meier uses an edge map to represent object mask for the reason that grey level representations are not reliable due to their sensitivity to changes in illumination. This representation leads to expensive distance comparison in the tracking stage and close-fill in the final object extraction stage. For this purpose, the change detection mask (CDM) resulted from the change detection process in the previous stage is adapted to the edge map extracted from the current frame using Canny operator. The binary object mask (OM) of a moving object is then created by selecting all edge points that belong to the CDM. The next operations of tracking, updating and video object plane (VOP) extraction are all based on this binary model. The binary OM is tracked in the sequence by using Hausdorff Distance. The model update that accommodates both rigid and non-rigid moving parts of an object (referred to as slowly changing and rapidly changing components, respectively) is followed. The VOP extraction is a close-fill process, where the close process is an erosion of the OM boundaries. After this, VOP can be created by simply filling-in the closed OM. The close process is realized by examine a 3×3 neighborhood using an adjacency code (AC) combined with a look-up table. The boundary gap is dealt with Dijikstra’s shortest path. However, there are still some gaps in the final model that can not be connected, as can be seen from the results presented on the work This can cause problem in the filling-in process. The author fails to point out how it is dealt with. Besides, the empirical set of the distance value for different type of points in the gap filling stage is ad hoc.

In Choi et al’s segmentation scheme [CLK97, m2091], spatial segmentation consists a core part of the algorithm. The spatial segmentation is based on a morphological segmentation which utilizes morphological filters and watershed algorithm as basic tools. In the first step, images (or motion compensated image if there are global motion) are simplified by morphological open-close by reconstruction filters. These filters remove regions that are smaller than a given size but preserve the contours of the remaining objects in the image. By the second step, the spatial gradient of the simplified image is approximated by the use of morphological gradient operator. In order to increase robustness, color information is also incorporated into the gradient computation and the estimated gradient is

Page 21: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

21

thresholded. The spatial gradient is used as an input of watershed algorithm to partition an image into homogeneous intensity regions. In the third step, the boundary decision is taken through the use of watershed algorithm. The watershed algorithm is a region growing algorithm and it assigns pixels in the uncertainty area to the most similar region with some segmentation criterion such as difference of intensity values. The final result of the watershed algorithm is usually an over-segmented tessellation of the input image. To overcome this problem, region merging follows. In the fourth step, small regions are merged in order to yield larger and meaningful regions which are homogenous and different from its neighbors. For this purpose, a joint similarity measurement T is chosen to compare a region R under consideration with its neighbors. T is the sum of the average sum of absolute difference (ASAD) of two corresponding regions between two frames and the average intensity of the region in the current frame.

)',()1()( RRASASIT

Rββ −+= ∑

∈xx (5.2.1)

where β is the weight factor. Then the region under consideration is merged to a neighboring region where the difference between two similarity measures is smallest. Drawback of this algorithm is the need of preset value for the thresholds and parameters employed in the gradient approximation and region merge stages. Morphological operations and watershed algorithm prove to be very powerful segmentation tools. In order to make them efficient in the implementation, many research activities have been carried out in this area [SP94, VS91].

Wardhani and Gonzalez [WG99] proposed a split and merge image sequence segmentation scheme for content based retrieval. The algorithm first splits the frame into regions of homogenous color. It then implements merging by applying Gestalt laws, which are proximity, similarity, good continuation, closure, surroundedness, relative size, symmetry and common fate. In their algorithm, size, color and texture are used as proximity and similarity criteria; lines or edges are used as a high level information for the good continuation grouping criterion. Surroundedness grouping can eliminate regions of holes, isthmi, gulfs and isles of small size. The symmetry grouping is effective for symmetrical objects. Finally, based on the motion information obtained by a weighted comparing of colour, size and location of the segmented regions between frames, an inter-frame grouping is applied as the common fate grouping criterion. This algorithm is mainly based on image segmentation techniques, since region growing techniques have over-segmentation problems, if part of the interested object in the scene does not move, the algorithm can still have problem in extracting a integrated or meaningful object.

Meier et al [MNG85] proposed a Bayesian segmentation based on high confidence first (HCF), which is an improvement to Pappas [Pappas92] method. In this work, the cost function is composed of three parts, The first term is the close-to-data term originates from the conditional probability P(O|X), which is modeled as the squared difference between observation points and the mean of the region. The second term consists of two parts. The first part is the sum of two-point cliques potential functions and the second part is the sum of three-point cliques potential functions based on the edge image. The algorithm starts from the seed points or initial regions which are selected on a grid with spacing d. Based on these seeds, a modified HCF technique labels pixels in order of confidence. Regions that were missed by the initialization grid will remain uncommitted after this stage. For these regions, a new label is created so that in the second stage all pixels can be assigned to a region by HCF, resulting in the final partition. The approach improves/differs from Pappas’ method in two aspects. Firstly, different regions carry different labels to ensure that only pixels belonging to the same region are included into the calculation of the regions’ mean gray-level. Secondly, there is no need of presetting labels or class K, and no need for an initial estimate of the segmentation X. As mentioned in the beginning of this section, image segmentation algorithms are either based on contour finding or on region growing. For contour based segmentation, edge detection techniques are the most widely adopted. Canny and Sobel operator are the commonly used edge detection methods due to the resilience performance. For region growing techniques, three typical segmentation methods have been presented. They are the classic hybrid approach [WG99], method based on morphological filters and watershed algorithm [m2091] and Bayesian method [MNG85]. Arguably, Bayesian segmentation, is the most widely used segmentation method, but due to the usually need of a relaxation minimization process such as simulated annealing, it’s too expensive in computation. In recent researches, region growing

Page 22: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

22

methods based on morphological filter and watershed algorithm are dominantly used. For a complete review of image segmentation, readers are referred to [PP93, HS85, RMB97, GJJ96, Dougherty92]. 6. Performance comparison of different techniques

In this section, we summarize the above discussed methods with a performance comparison. While it is

not a detailed comparison, it reflects to some extent the pros and cons of those methods. We recognize that major concern is given to the advantages, drawbacks and computation complexities when a user starts to choose an appropriate algorithm for a specific purpose. The summary is given in Table 1.

Different techniques

Strength

Weakness

Computation Complexity

2D methods

Simple to implement, Applicable to non-rigid motion

Not robust, Over-segmentation

low

SFM methods

Scene structure reconstruction

Strong assumptions, Applicable to limited applications

Linear: medium Non-linear: high

Segmentation into planar patches

Robust, Approximate to non-rigid motion

Not applicable to non-rigid motion, Over-segmentation

medium

Baysian segmentation

Robust, Applicable to non-rigid motion

Over-segmentation, Hard initialization

ICM: medium SA: high

Integrated methods

Robust, Applicable to non-rigid motion

Static background, Non-meaningful object

high

Motion-based approaches

3D m

ethods

Layered representation

Robust, Scene reconstruction and manipulation

Strong assumptions, Applicable to limited situations

medium

Spatio-temporal approaches

Robust, Applicable to non-rigid motion, Allowing for moving background, Meaningful object

Meaningful moving object only means moving foreground

high

Table 1. A comparison between different segmentation techniques

It can be seen from the table, all robust segmentation algorithms are achieved at the cost of high computation complexities. It can also be drawn from the table, that the spatio-temporal approaches perform comparatively better than the other methods. In the following we give a complete segmentation scheme reflecting the latest spatial and temporal segmentation techniques. 7. A Complete Segmentation Scheme

To show a complete segmentation framework, we present a complete object segmentation scheme for content-based functionalities in this section. The scheme presented here is described in the annex of MPEG-4 Visual Working Draft [N2553]. The scheme combines techniques from FUB (Fondazione Ugo Bordoni, Italy), UH (University of Hannover, Germany) and ETRI (Electronic and Telecommunications Research Institute, Korea). It consists of four main steps. In the first step, camera motion is estimated and compensated. The second step detects scene cuts. This step is required because temporal segmentation is not carried out between frames of different scene cuts due to too much content change between cuts. In the third step, temporal segmentation and spatial segmentation are carried out independently. In the fourth step,

Page 23: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

23

the temporal and spatial segmentation results are combined to obtain moving object boundaries. The block diagram is illustrated in Fig. 12. In the following, we summarize each of these steps.

Video Sequence

Object Mask Sequence

Figure 12. Block diagram of combined temporal and spatial segmentation framework 7.1 Camera motion compensation

Camera motion, or global motion, is modeled as an eight-parameter motion model of (4.2.1.2), the

parameters are estimated by regression considering only pixels within background regions of the previous image. After the estimation of the motion parameters, the motion vector for every pixel is known. Then, a post processing step accounts for model failures in background regions due to assumption of a rigid plane. By this post processing step, the estimated motion vector of every background pixel is improved by performing a full search within an squared area of limited size. The frame is then motion compensated according to the estimated motion. 7.2 Scene cut detection

There are normally dramatic content changes between scene cuts. Thus temporal segmentation should

only be used between frames within the same cut. The scene cut detector evaluates whether the difference between the current original image and the camera motion compensated previous image exceeds a given threshold. The evaluation is only performed within the background regions of the previous frame. 7.3 Temporal segmentation

The temporal segmentation algorithm, which is mainly based on a change detector in [MW98] and has

been discussed in Section 5.1, can be summarized into the following four steps, assuming that a possible camera motion has already been compensated.

(i) An initial change detection mask (CDMi) between two successive frames is generated by thresholding the difference image.

(ii) Boundaries of the CDMi are smoothed by a relaxation on a MAP detector [AKM93], using local thresholds which consider the state of neighboring pels. This results in a change detection mask (CDM). The CDM is simplified by usage of a morphological closing-operator and elimination of small regions. In order to get temporally stable object regions, a memory

Camera Motion Compensation

Scene Cut Detection

Temporal Segmentation

Spatial Segmentation

Combination of Temporal and Spatial Results

Page 24: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

24

for the CDM is applied, denoted as MEM. The temporal depth of MEM adapts automatically to the sequence.

(iii) An initial moving object mask (OMi) is estimated by eliminating the uncovered background from the CDM [HT88]. Displacement information for pels within the CDM is used. The displacement of each pel within the CDM is calculated by hierarchical block-matching. After the estimation of the displacement vector for every pel in the CDM, motion vectors whose ‘head’ and ‘foot’ are both within the CDM are marked as motion vectors corresponding to object pels. This results in the OMi. Those other pels are treated as covered/uncovered background

(iv) At the last step, the CMi is adapted to the luminance edges of the corresponding frame, resulting in the final object mask (OM).

7.4 Spatial segmentation

Spatial segmentation splits the entire image into homogeneous regions in terms of intensity. The

different homogenous regions are distinguished by their encompassing boundaries. The spatial segmentation algorithm is implemented in four steps.

(i) The input images (or motion compensated image if there are global motion) are simplified by morphological open-close by reconstruction filters. These filters remove regions that are smaller than a given size but preserve the contours of the remaining objects in the image.

(ii) The spatial gradient of the simplified image is approximated by the use of morphological gradient operator. In order to increase robustness, color information is also incorporated into the gradient computation and the estimated gradient is thresholded to remove noisy gradients. The spatial gradient is used as an input of watershed algorithm to partition an image into homogeneous intensity regions.

(iii) The boundary decision is taken through the use of watershed algorithm. The watershed algorithm is a region growing algorithm and it assigns pixels in the uncertainty area to the most similar region with some segmentation criterion such as difference of intensity values. The watershed algorithm is highly sensitive to gradient noise, which yields many catchment basins, the final result of the algorithm is usually an over-segmented tessellation of the input image. To overcome this problem, region merging follows

(iv) In the region merging step, small regions are merged in order to yield larger and meaningful regions which are homogenous and different from its neighbors. For this purpose, a similarity measurement T is chosen to compare a region under consideration with its neighbors. T is the sum of the average sum of absolute difference (ASAD) of two corresponding regions between two frames and the average intensity of the region in the current frame. Then a region under consideration is merged to a neighbor region when the difference between two similarity measurements is the smallest

7.5 Combination of temporal and spatial results

In the last step, OM s or moving object boundaries are obtained by combining the spatially segmented

regions with the object mask (OM) obtained from the temporal segmentation. First, the OM is overlaid on top of the spatial segmentation mask, when the majority part of a spatially segmented region belongs to the OM, the whole area of the segmented region is declared as part of the OM, otherwise as background. Next, the spatially segmented regions are also overlaid on the previous OM in the MEM, if the majority part of a region under consideration belongs to the previous OM, the region is also declared to be a part of OM. 7.6 Discussion

This framework represents the current state of the art of automatic segmentation for MPEG-4 video

applications. The work is a spatio-temporal approach where motion is utilized as the main segmentation criterion in terms of final decision on the object definition. The resulted foreground (OMs) and background are not treated further. Due to the frequent use of ad hoc thresholding, preset parameters and that some system values need adjust to different scene, the framework can be very unstable. The final results are promising, but are not completely semantically meaningful, especially when part of the object is not

Page 25: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

25

moving. Since spatial segmentation, such as morphological operations and edge information, has already been incorporated in the temporal segmentation to refine object mask (OM) and its boundaries, it is not clear whether another process of spatial segmentation is necessary and can do better. No results demonstrating this necessity are given. Furthermore there are several underlying assumptions in the algorithm. Firstly, it is assumed that there is no relatively abrupt motion, or significant camera translation between two consecutive frames (at least between the first two frames), otherwise the global motion can not be compensated by the global motion model adopted. Secondly, it is assumed the background is composed of rigid planar surface, if there are non-rigid moving things such as moving clouds, smokes, and flowing water etc, they could be included into foreground objects. Thirdly, it is assumed all the parts of the interested objects are under motion (not necessarily uniform, though). If part of the object has not moved, it is missed out from the final segmentation. Finally, there is another assumption in the global motion estimation that moving object is away from the frame’s three boundaries (upper, left and right), at least 10 pixels away. These assumptions can affect its performance when applied on generic scenes. In recognition of these limitations of current automatic techniques, two semi-automatic methods that require human intervention are also included in the annex of MPEG-4 [N2553]. 8. Conclusion

In this paper we have reviewed current techniques for the segmentation of moving objects in image

sequence. Emphasis has been given to those techniques of segmentation for content-based functionalities. The most promising approach combines temporal and spatial segmentation.

It is clear from the review that although great advances have been made in image/video segmentation techniques, there are still challenges to achieve fully automatic segmentation/extraction of semantically meaningful objects from generic scenes. Reference:

1. [Adiv85] Gilad Adiv. Determining three-dimensional motion and structure from optical flow generated by several moving objects. IEEE PAMI 7(4):384-401, July 1985.

2. [AKM93] Til Aach, Andre Kaup and Rudolf Mester. Statistical Model-based Change Detection in Moving Video. Signal Processing 31(2):165-180.

3. [AN88] J. K. Aggargwal and N. Nandhakumar. On the Computation of Motion from Sequences of Images-- A Review. Proc. of the IEEE, 76(8):917-935, 1988.

4. [BB95] S. S. Beauchemin and J. L. Barron. The Computation of Optical Flow. ACM Computing Surveys, 27(3), Sept. 1995.

5. [BBAT97] Georgi D. Borshukov, Gozde Bozdagi, Yucel Altunbasak and A. Murat Tekalp, Motion. Segmentation by Multi-stage Affine Classification. IEEE Trans. on Image Processing 6(1):1591-1594 Nov. 1997

6. [BF93] Patrick Bouthemy and Edouard Francois. Motion Segmentation and Qualitative Dynamic Scene Analysis from an Image Sequence. International Journal of Computing Vision, 10:2, 157-182, 1993.

7. [Bestor98] Gareth S. Bestor. Recovering Feature and Observer Position By Projected Error Refinement. Ph.D thesis, University of Wisconsin- Madison, 1998.

8. [CLK97] Jae Gark Choi, Si-Woong Lee and Seong-Dae Kim. Spatio-Temporal Video Segmentation Using a Joint Similarity Measure. IEEE Trans. on Circuits and Systems for Video Technology vol.7 No.2 April 1997, pp.279-286.

9. [Clocksin80] W. F. Clocksin. Perception of Surface Slant and Edge Labels from Optical Flow: A Computational Approach, Perception 9:253-269, 1980.

10. [CTS94] Michael M. Chang, A. Murat Tekalp and M. Ibrahim Sezan. An Algorithm for Simultaneous Motion Estimation and Segmentation. IEEE Int. Conf. On Acoustics, Speech and Signal Processing, ICASSP’94, Adelaide, Australia, April 1994, vol. V pp.221-224.

11. [Diehl91] Norbert Diehl. Object-Oriented Motion Estimation and Segmentation in Image. Sequence Signal Processing: Image Communication 3: 23-56, 1991.

Page 26: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

26

12. [DM95] F. Dufaux, F. Moscheni. Segmentation-based motion estimation for second generation video coding techniques. In Video Coding: the Second Generation Approach, L. Torres and M. Kunt Eds., Kluwer Academic Publishers, pp. 219-263, 1995.

13. [DML95] Frederic Dufaux, Fabrice Moscheni and Andrew Lippman. Spatio-Temporal Segmentation Based On Motion And Static Segmentation. Proc. Int. Conf. On Image Processing vol. I Oct. 1995 Washington D.C. pp.306-309.

14. [Dougherty92] E. R. Dougherty. An introduction to morphological image processing. SPIE, Bellingham, Washington, 1992.

15. [Fusiello98] Andrea Fusiello. Three-Dimensional Vision For Structure and Motion Estimation. Ph.D thesis, University of Udine, Italy, November 1998.

16. [GG84] S. Geman and D. Geman. Stochastic relaxation, Gipps distributions and the Bayesian restoration of images. IEEE Trans. on Pattern Analysis and Machine Intelligence PAMI-6(6):721-741, 1984.

17. [GJJ96] E. Gose, R. Johnsonbaugh and Steve Jost. Pattern Recognition and Image Analysis. Pretice Hall PTR Upper Saddle River, NJ, 1996.

18. [HAP94] S. Hsu, P. Anandan, and S. Peleg. Accurate computation of optical flow by using layered motion motion representation. International Conference on Pattern Recognition, Jerusalem Oct. 1994.

19. [Heldreth84] E. Hildreth. The Measurement of Visual Motion. MIT press, 1984. 20. [HS81] Berthold K. P. Horn and Brian G. Schunck. Determining Optical Flow. Artificial

Intelligence 17(1-3): 185-203 1981. 21. [HS85] Robert M. Haralick and Linda G. Shapiro. Survey: Image Segmentation Techniques.

Computer Vision, Graphics and Image Processing 29:100-132, 1985. 22. [HT88] Michael Hötter and Robert Thoma. Image Segmentation Based on Object Oriented

Mapping Parameter Estimation. Signal Processing 15, 1988 315-334. 23. [IRP92] M. Irani, B. Rousso and S. Peleg. Detecting and tracking multiple moving objects using

temporal integration. In G. Sandini, editor, Proc. 2nd European Conference on Computer Vision, LNCS 588, pp.282-287 Springer-Verlag, 1992.

24. [Jain81] Ramesh Jain. Dynamic Scene Analysis Using Pixel-Based Processes. Computer vol.14 No.8 1981 pp. 12-18

25. [Jain84a] Ramesh Jain. Difference and accumulative difference pictures in dynamic scene analysis. Image and Vision Computing 2(2): 98-108 1984.

26. [Jain84b] Ramesh C. Jain. Segmentation of Frame Sequences Obtained by a Moving Observer. IEEE Trans. on PAMI vol. PAMI-6 No.5 Sep.1984 pp. 624-629.

27. [JAP99] Tony Jebara, Ali Azarbayejani and Alex Pentland. 3D Structure from 2D Motion. IEEE Signal Processing Magazine, 16(3):66-84, May 1999.

28. [JJ83] S. N. Jayaramamurthy and Ramesh Jain. An Approach to the Segmentation of Textured Dynamic Scenes. Computer Vision, Graphics and Image Processing 21, 239-261, 1983.

29. [JKS95] R. Jain, R. Kasturi and B. G. Schunck. Machine Vision. McGraw-Hill, Inc, 1995. 30. [JMA79] Ramesh Jain, W. N. Martin and J. K. Aggarwal. Segmentation through the Detection of

Changes Due to Motion. Computer Graphics and Image Processing vol.11 No.1 1979 pp. 13-34. 31. [JN79] Ramesh Jain and H.-H. Nagel. On the Analysis of Accumulative Difference Pictures from

Image Sequences of Real World Scenes. IEEE Trans. on PAMI vol. PAMI-1 No.2 April 1979 pp. 206-214.

32. [KD92] J. Konrad and E. Dubois. Bayesian estimation of motion vector field. IEEE Trans. on Pattern Analysis and Machine Intelligence 14:910-927, 1992.

33. [KIK85] M. Kunt, A. Ikonomopoulos and M. Kocher. Second generation image coding techniques. Proceedings of the IEEE 73(4):549-575, 1985.

34. [m571] S. Colonnese, A. Neri, G. Russo and P. Talone (FUB). Moving objects versus still background classification: a spatial temporal segmentation tool for MPEG-4. ISO/IEC JTC1/SC29/WG11 MPEG96/m571 Munich, January 1996

35. [m1831] R. Mech, and P. Gerken (UH). Automatic segmentation of moving objects (Core Experiment N2). ISO/IEC JTC1/SC29/WG11 MPEG97/m1831 Sevilla, ES February 1997.

36. [m1949] R. Mech, and P. Gerken (UH). Automatic segmentation of moving objects (Patial results of core experiment N2). ISO/IEC JTC1/SC29/WG11 MPEG97.

Page 27: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

27

37. [m2091] J.G. Choi, M. Kim, M. H. Lee and C. Ahn (ETRI). Automatic segmentation based on spatio-temporal information. ISO/IEC JTC1/SC29/WG11 MPEG97/m2091, Bristol, GB April 1997.

38. [m2238] T. Meier and King N. Ngan (UWA). Automatic Segmentation Based on Hausdoff Object Tracking ISO/IEC JTC1/SC29/WG11 MPEG97/m2238. Stockholm, July 1997.

39. [m2365] S. Colonnese, A. Neri, G. Russo and P. Talone (FUB). Core Experiment N2: Preliminary FUB results on combination of automatic segmentation techniques. ISO/IEC JTC1/SC29/WG11 MPEG97/m2365, July 1997.

40. [m2383] J.G. Choi, M. Kim, M. H. Lee and C. Ahn (ETRI) S. Colonnese, U. Mascia, G. Russo and P. Talone (FUB). Merging of temporal and spatial segmentation. ISO/IEC JTC1/SC29/WG11 MPEG97/m2383, July 1997.

41. [m2641] J.G. Choi, M. Kim, M. H. Lee and C. Ahn (ETRI). New ETRI results on core experiment N2 on automatic segmentation techniques. ISO/IEC JTC1/SC29/WG11 MPEG97/m2641, October 1997.

42. [m2803] Munchurl Kim, Jae Gark Choi, Myoung Ho Lee and Chieteuk Ahn. User-assisted segmentation for moving objects of interest. ISO/IEC JTC1/SC29/WG11 MPEG97/m2803, Oct. 1997.

43. [m3093] S. Colonnese, G. Russo (FUB). Segmentation techniques: towards a semi-automatic approach. ISO/IEC JTC1/SC29/WG11 MPEG98/m3093, February 1998.

44. [m3320] S. Colonnese, G. Russo (FUB). User interactions modes in semi-automatic segmentation: development of a flexible graphical user interface in Java. ISO/IEC JTC1/SC29/WG11 MPEG98/m3320, March 1998.

45. [m3349] Munchurl Kim, Jae Gark Choi, Myoung Ho Lee and Chieteuk Ahn. User-assisted Video Object Segmentation by Multiple Object Tracking with a Graphical User Interface. ISO/IEC JTC1/SC29/WG11 MPEG98/m3349, March 1998.

46. [m3935] Munchurl Kim, Jae Gark Choi, Myoung Ho Lee and Chieteuk Ahn. User’s guide for a user-assisted video object segmentation tool. ISO/IEC JTC1/SC29/WG11 MPEG98/m3935, Oct. 1998.

47. [m4047] G. Russo (FUB). Results of FUB user assisted segmentation environment. ISO/IEC JTC1/SC29/WG11 MPEG98/m4047, Oc. 1998.

48. [MB87] David Murry and Bernard Buxton. Scene segmentation from visual motion using global optimization. IEEE PAMI 9(2) 1987.

49. [MacLean96] Wallace James MacLean. Recovery of Egomotion and Segmentation of Independent Object Motion Using The EM-Algorithm. Ph.D thesis, University of Toronto, 1996.

50. [Meier98] Thomas Meier. Segmentation for Video Object Plane Extraction and Reduction of Coding Artifacts. Ph.D thesis, Dept. Of Electrical and Electronic Engineering, The University of Western Australia, 1998.

51. [MHO89] Hans Georg Musmann, Michael Hötter and Jörn Ostermann. Object-Oriented Analysis Coding of Moving Images. Signal Processing: Image Communication 1, 1989 pp.117-138.

52. [MNG85] T. Meier, K. N. Ngan and G. Crebbin. A robust Markovian segmentation based on highest confidence first (HCF). In IEEE Int. Conf. On Image Processing, ICIP’97, Santa Barbara, CA, USA, Oct. 1997, vol. I pp.216-219.

53. [MW86] D. W. Murray and N. S. Williams. Detecting the image boundaries between optical flow fields from several moving planar facets. Pattern Recognition Letters 4:87-92, 1986.

54. [MW98] Roland Mech and Michael Wollborn. A noise robust method for 2D shape estimation of moving objects in video sequences considering a moving camera. Signal Processing 66(2):203-217, 1998.

55. [N2172] ISO/IEC JTC1/SC29/WG11 MPEG98/N2172. MPEG-4 Video Verification Model. Version 11.0, Tokyo, March 1998.

56. [N2553] ISO/IEC JTC1/SC29/WG11 MPEG98/N553. Version 2 Visual Working Draft Revision 6.0. Rome, 1998.

57. [N2995] ISO/IEC JTC1/SC29/WG11 MPEG99/N2995. MPEG-4 Overview. Melbourne Oct. 1999.

58. [NCRT98] A. Neri, S. Colonnese, G. Russo and P. Ralone. Automatic Moving Object and Background Separation. Signal Processing 66:219-232, 1998.

Page 28: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

28

59. [NSKO94] H.-H. Nagel, G. Socher, H. Kollnig and M. Otte. Motion Boundary detection in image sequences by local stochastic test. In Proc. 3rd European Conf. On Computer Vision, LNCS 800/801,Vol. II, pp.305-314, May 1994.

60. [Overington87] I. Overington. Gradient-based flow segmentation and location of the focus of expansion. In Proc. 3rd Alvey Vision Conference pp. 169-177, 1987.

61. [Pappas92] Thrasyvoulos N. Pappas. An Adaptive Clustering Algorithm for Image Segmentation. IEEE Trans. on Signal Processing 40(4): 901-914, 1992.

62. [Pardás97] M. Pardás. Relative depth estimation and segmentation in monocular schemes. Picture Coding Symposium, PCS 97, Berlin, Germany, Sept. 1997, pp.367-372.

63. [Potter75] Jerry L. Potter. Velocity as a Cue to Segmentation. IEEE Trans. on Systems, Man and Cybernetics May 1975 pp. 390-394.

64. [Potter77] Jerry L. Potter. Scene Segmentation Using Motion Information. Computer Graphics and Image Processing 6, 558-581 (1977).

65. [PP93] Nikhil R. Pal and Sankar K. Pal. A Review on Image Segmentation Techniques. Pattern Recognition 26(9):1277-1294, 1993.

66. [RMB97] M. M. Reid, R. J. Millar and N. D. Black. Second Generation Image Coding: An Overview. ACM Computing Surveys 29(1):3-29, 1997.

67. [Salembier et al 97] P. Salembier, F. Marqués, M. Pardàs, R. Morros, I. Corset, S. Jeannin, L. Bouchard, F. Meyer, and B. Marcotegui. Segmentation-based video coding system allowing the manipulation of objects. IEEE Trans. on Circuits and Systems for Video Technology, 7(1):60-74, February 1997.

68. [SBCP96] P. Salembier, P. Brigger, J.R. Casas, and M. Pardàs. Morphological operators for image and video compression. IEEE Transactions on Image Processing, 5(6):881-898, June 1996.

69. [Scharstein97] Daniel Scharstein. View Synthesis Using Stereo Vision. Ph.D thesis, Cornell University, 1997.

70. [Schunck89] Brian G. Schunck. Image Flow Segmentation and Estimation by Constraint Line Clustering. IEEE Trans. on Pattern Analysis and Machine Intelligence 11(10):1010-1027, 1989.

71. [SK99] C. Stiller and J. Konrad. Estimating motion in image sequences: A tutorial on modeling and computation of 2D motion. IEEE Signal Process. Magazine 16:70-91, July 1999.

72. [SP94] Philippe Salembier and Montse Pardas. Hierarchical Morphological Segmentation for Image Sequence Coding. IEEE Transactions on Image Processing 3(5): 639-651, 1994.

73. [Stiller97] Christoph Stiller. Object-Based Estimation of Dense Motion Fields. IEEE Transactions on Image Processing vol.6 No.2 Feb. 1997 pp.234-250.

74. [SU87] A. Spoerri and S. Ullman. The early detection of motion boundaries. Proc. 1st International Conference on Computer Vision pp. 209-218, 1987.

75. [TB89] Robert Thoma and Matthias Bierling. Motion Compensating Interpolation Considering Covered and Uncovered Background. Signal Processing: Image Communication 1 (1989) 191-212.

76. [Tekalp95] A. Murat Tekalp. Digital Video Processing. Prentice Hall PTR. 77. [TGM97] L. Torres, D. García and A. Mates. On the Use of Layers for Video Coding and Object

Manipulation. 2nd Erlangen Symposium, Advances in Digital Image Communication, Erlangen, Germany, pages 65 -73, April 25, 1997.

78. [TH81] Roger Y. Rsai and Thomas S. Huang. Estimating Three-Dimensional Motion Parameters of a Rigid Planar Patch. IEEE Trans. on Acoustics, Speech and Signal Processing ASSP-29(6):1147-1152, 1981.

79. [TKP96] L.Torres, M. Kunt and F. Pereira. Second Generation Video Coding Schemes and their Role in MPEG-4. European Conference on Multimedia Applications, Services and Techniques, pages 799 - 824, Louvain-la-Neuve, Belgium, May 28-30, 1996.

80. [TM93] P. H. S. Torr and D. W. Murray. Statistical detection of independent movement from a moving camera Image and Vision Computing 1(4):180-187, May 1993.

81. [TMB85] William B. Thompson, Kathleen M. Mutch and Valdis A. Berzins. Dynamic Occlusion Analysis in Optical Flow Fields. IEEE Trans. on Pattern Analysis and Machine Intelligence PAMI-7(4):374-383, 1985.

82. [Torr95] P. H. S. Torr. Motion Segmentation and Outlier Detection. Ph.D thesis, University of Oxford, 1995.

Page 29: Segmentation of Image Sequences: A Surveyclasses.design.ucla.edu/Winter05/256/projects/ir... · 2005-02-21 · segmentation (a combination of temporal and spatial segmentation). Motion-based

29

83. [VS91] Luc Vincent and Pierre Soille. Watersheds in Digital Spaces: An Efficient Algorithm Based on Immersion Simulations. IEEE Trans. on Pattern Analysis and Machine Intelligence 13(6):583-598, 1991.

84. [WA94] John Y. A. Wang and Edward H. Adelson. Representing Moving Images with Layers. IEEE Trans. on Image Processing vol.3, no. 5, pp.625-638, Sept. 1994.

85. [WG99] Aster Wardhani and Ruben Gonzalez. Image Structure Analysis for CBIR. Proc. Digital Image Computing: Techniques and Applications, DICTA'99, Dec. 1999, Perth, Australia, pp.166-168.