new i 0 i 1 i 2 -...

M.I.T Media Laboratory Perceptual Computing Section Technical Report No. 279 (Replaces TR-228)Appears in the IEEE Transactions on Image Processing Special Issue: Image Sequence Compression,

vol 3, no. 5, p. 625-638, September 1994. Revised: May, 1994

Representing Moving Images with Layers

John Y. A. Wang AND Edward H. Adelson

Abstract

We describe a system for representing movingimages with sets of overlapping layers. Eachlayer contains an intensity map that de�nes theadditive values of each pixel, along with an alphamap that serves as a mask indicating the trans-parency. The layers are ordered in depth andthey occlude each other in accord with the rulesof compositing. Velocity maps de�ne how thelayers are to be warped over time. The layeredrepresentation is more exible than standard im-age transforms and can capture many importantproperties of natural image sequences. We de-scribe some methods for decomposing image se-quences into layers using motion analysis, and wediscuss how the representation may be used forimage coding and other applications.

Keywords| Image coding, motion analysis, imagesegmentation, image representation, robust esti-mation.

1 Introduction

An image coding system involves three parts: the encoder,the representation, and the decoder. The representationis the central determinant of the overall structure of thecoding system. The most popular image coding systemstoday are based on \low-level" image processing conceptssuch as DCT's, subbands, etc. It may ultimately be pos-sible to encode images using \high-level" machine visionconcepts such as 3-D object recognition, but it will bemany years before such techniques can be applied to ar-bitrary images. We believe that a fruitful domain for newimage coding lies in \mid-level" techniques, which involveconcepts such as segmentation, surfaces, depth, occlusion,and coherent motion. We describe one such representationbased on \layers" and show how it may be applied to thecoding of video sequences.Consider the language used by a traditional cel anima-

tor. First a background is painted and then a series of

Edward H. Adelson is a�liated with the Department ofBrain and Cognitive Sciences.John Y. A. Wang is a�liated with the Department of

Electrical Engineering and Computer Sciences.Both are at the The MIT Media Laboratory, Mas-

sachusetts Institute of Technology, Cambridge, MA 02139.This research was supported in part by contracts with

the Television of Tomorrow Program, SECOM Co., andGoldstar Co. We thank Ujjaval Desai for compressing thelayer images.

(a) (b)Moving Hand Background

(c)Frame 1 Frame 2 Frame 3

Figure 1: The objects involved in a hypothetical scene ofa moving hand. (a) The hand, which undergoes a simplemotion. (b) The background, which is translating downand to the left. (c) The observed image sequence.

images are painted on sheets of clear celluloid (the \cels").As a cel is moved over the background it occludes andreveals di�erent background regions. A similar represen-tation is used in computer graphics where a set of imagescan be composited with the aid of \alpha channels" thatserve as masks to indicate the transparency and opacity ofthe overlying layers.In traditional cel animation one is restricted to rigid mo-

tions of the backgrounds and the cels. In a digital system,however, it is easy to use more complex motions such asa�ne transformations, which include all combinations oftranslation, rotation, dilation, and shear. Such representa-tions will be successful insofar as the image model o�ers anadequate description of the motions found in the originalsequence.Figure 1 illustrates the concept with an image sequence

of a waving hand (�gure 1a) moving against a moving back-ground (�gure 1b). Suppose that both the hand and thebackground execute simple motions as shown. The resul-tant image sequence is shown in �gures 1c.Given this sequence we wish to invert the process by

which it was generated. Thus we would like to decomposethe sequence into a set of layers which can be compositedso as to generate the original sequence. Since the worldconsists of stable objects undergoing smooth motions, ourdecomposition should also contain stable objects undergo-ing smooth motions.An earlier description of layered formats is found in [1].

1

(a)Intensity map Alpha map Velocity map

(b)Intensity map Alpha map Velocity map

(c)Frame 1 Frame 2 Frame 3

Figure 2: The desired decomposition of the hand sequenceinto layers. (a) The background layer. The intensity mapcorresponds to the checkerboard pattern; the alpha mapis unity everywhere (since the background is assumed tobe opaque); the velocity map is a constant. (b) The handlayer. The alpha map is unity where the hand is presentand zero where the hand is absent; the velocity map issmoothly varying. (c) The resynthesized image sequencebased on the layers.

In the representation that we use, following Adelson [1],each layer contains three di�erent maps: (1) the intensitymap, (often called a \texture map" in computer graph-ics); (2) the alpha map, which de�nes the opacity or trans-parency of the layer at each point; and (3) the velocitymap, which describes how the map should be warped overtime. In addition, the layers are assumed to be ordered indepth.Figure 2 shows the layered decomposition of the hand

sequence. The hand and background layers are shown in�gures 2a and b, respectively. The resynthesized sequenceis shown in �gure 2c.Let us note that traditional motion analysis methods fall

short of what is needed. Optic ow techniques typicallymodel the world as a 2-D rubber sheet that is distortedover time. But when one object moves in front of another,as moving objects generally do, the rubber sheet modelfails. Image information appears and disappears at theocclusion boundaries and the optic ow algorithm has noway to represent this fact. Therefore such algorithms tendto give extremely bad motion estimates near boundaries.In many image coding systems the motion model is even

more primitive. A �xed array of blocks are assumed totranslate rigidly from frame to frame. The model cannotaccount for motions other than translations, nor can it dealwith occlusion boundaries.

E 1

α1

E 2

2

E 0

I 0 I 1 I 2

cmp

α

cmp

Figure 3: A ow chart for compositing a series of layers.The box labeled \cmp" generates the complement of alpha,(1� �)

The layered representation that we propose is also animperfect model of the world but it is able to cope witha wider variety of phenomena than the traditional repre-sentations. The approach may be categorized with object-based methods [10, 14].

2 The layered representation

As discussed above, a layer contains a set of maps specify-ing its intensity, opacity, and motion. Other maps can bede�ned as well, but these ones are crucial.In order to deal with transparency, motion blur, optical

blur, shadows, etc., it is useful to allow the alpha channelto take on any value between 0 and 1. A ow chart forthe compositing process is shown in �gure 3. Each layeroccludes the one beneath it, according to the equation:

I1(x; y) = E0(x; y)(1� �1(x; y)) + E1(x; y)�1(x; y): (1)

where �1 is the alpha channel of layer E1 and E0 is thebackground layer. Any number of stages can be cascaded,allowing for any number of layers.For dealing with image sequences we allow a velocity

map to operate on the layers over time. It is as if the celanimator were allowed to apply simple distortions to hiscels. The intensity map and the alpha map are warpedtogether so that they stay registered. Since the layeredmodel may not adequately capture all of the image changein each layer, we also allow for a \delta map," which servesas an error signal to update the intensity map over time.The resulting system is shown in �gure 4. Only one fullstage is shown, but an unlimited series of such stages maybe cascaded.Once we are given a description in terms of layers, it is

straightforward to generate the image sequence. The dif-�cult part is determining the layered representation giventhe input sequence. In other words, synthesis is easy butanalysis is hard. The representation is non-unique: therewill be many di�erent descriptions that lead to the samesynthesized image sequence. Indeed it is always possibleto represent an image sequence with a single layer, where

2

E 1

α1

I 0 I 1

warp

V1 D1

E 0

warp

V0 D0

warp

cmp

Figure 4: A ow chart for compositing that incorporatesvelocity maps, V , and delta maps, D.

the delta map does all the work of explaining the changeover time. As our expertise in mid-level vision improveswe can achieve better representations that capture more ofthe underlying structure of the scene. Thus, a layered rep-resentation can serve to specify the decoding process, whilethe encoding process remains open to further improvementby individual users.We will describe some analysis techniques that we have

applied to a set of standard video sequences. In our presentimplementations we have simpli�ed the representation inseveral ways. The alpha channels are binary, i.e. objectsare completely transparent or completely opaque. The ve-locity maps are restricted to a�ne transformations. Thereare no delta maps used. In spite of these simpli�cations therepresentation is able to capture much of the desired imageinformation. Earlier discussions of this work are containedin [20, 21].

3 Analysis into layers

Let us begin by considering the �rst layer: the background.Sometimes the background is stationary but often it is un-dergoing a smooth motion due, for example, to a camerapan. Background information appears and disappears atthe edges of the screen in the course of the pan. It is,of course, possible to send updating information on eachframe, adding and subtracting the image data at the edges.However, with layers it is preferable to represent the back-ground as it really is: an extended image that is largerthan the viewing frame. Having built up this extendedbackground, one can simply move the viewing frame overit, thereby requiring a minimal amount of data.Figure 5a shows a sequence of positions that the view-

ing frame might take during a pan over the background.Figure 5b shows the image information that can be ac-cumulated from that sequence. At any given moment thecamera is selecting one part of the extended background [9,18].To build up the extended background we may consider

Original sceneI1

I2

I3I4I5

I6I7

Individual frames(a)

Accumulated layer

Visible portionat one instant.

(b)

Figure 5: (a) The frames in the sequence are taken from anoriginal scene that is larger than any individual frame. (b)The information from all the frames may be accumulatedinto a single large layer. Each frame is then a glimpse intothis layer as viewed through a window.

the background to be a continuous function of unlimitedspatial extent [11, 13]. Each image in a sequence cap-tures a set of discrete samples from that function withinthe bounds speci�ed by the viewing frame. The viewingparameters may change over time; for example the back-ground image may undergo a�ne or higher-order distor-tions. If we estimate these parameters we can continue tomap the image data into our extended background image.

It is also possible that foreground objects will obscurethe background from time to time. If there are backgroundregions that are always obscured we cannot know their truecontent. On the other hand, since these regions are neverseen, we do not need to know them in order to resynthe-size the sequence. Any regions that are revealed can beaccumulated into the background if we correctly identifythem.

To illustrate how an extended view of a layer can be ob-tained, consider the MPEG ower garden sequence, threeframes of which are shown in �gure 6. Because the camerais translating laterally, the ower bed image undergoes ashearing motion, with the nearer regions translating fasterthan the farther regions. The distortion can be approx-imated by an a�ne transformation. Figure 7 shows thesequence after the motion of the ower bed has been can-celled with the inverse a�ne transformation. When theinverse warp is applied, the ower bed is stabilized. Theremainder of the image appears to move; for example thetree seems to bend.

We can generate the entire warped sequence and accu-mulate the pixels with stable values into a single image.This image will contain pixels that belong to the owerbed, including those that were sometimes hidden by thetree and those that were sometimes hidden by the right orleft boundary of the viewing frame. Such an accumulatedimage is shown in �gure 8.

3

(a) (b) (c)

Figure 6: Frames 0, 15 and 30, of MPEG ower garden sequence are shown in �gures (a-c), respectively.

(a) (b) (c)

Figure 7: (a) Frame 1 warped with an a�ne transformation to align the owerbed region with that of frame 15. (b)Original frame 15 used as reference. (c) Frame 30 warped with an a�ne transformation to align the owerbed region withthat of frame 15.

Figure 8: Accumulation of the owerbed. Image intensities are obtained from a temporal median operation on the motioncompensated images. Only the regions belonging to the owerbed layer is accumulated in this image. Note also occludedregions are correctly recovered by accumulating data over many frames.

4

In the ideal case, it will be possible to account for all thevisible ower bed pixels in every image in the sequence bywarping and sampling this single accumulated image. Andin practice we �nd the process can be fairly successful incases where the region is undergoing a simple motion.The same process can be used for any region whose mo-

tion has been identi�ed. Thus we may build up separateimages for several regions. These become the layers thatwe will composite in resynthesizing the scene. Figure 12shows the other layers that we extract: one for the houseregion; one for the tree. We will now describe how we usemotion analysis to perform the layered decomposition.

4 Motion analysis

The ower garden sequence contains a number of regionsthat are each moving in a coherent fashion. We can usethe motion to segment the image sequence. In traditionalimage processing systems, segmentation means that eachpixel is assigned to a region and the image is cut apartlike a jigsaw puzzle without overlap. We do perform sucha segmentation initially, but our ultimate goal is a layeredrepresentation whereby the segmented regions are trackedand processed over many frames to form layers of surfaceintensity that have overlapping regions ordered in depth.Thus, our analysis of an image sequence into layers con-sists of two stages: 1) robust motion segmentation and; 2)synthesis of the layered representation.We consider that regions undergoing common a�ne mo-

tion are likely to arise from the same surface in the world,and so we seek out such regions and group them together.A�ne motion is de�ned by the equations:

Vx(x; y) = ax0 + axxx+ axyy; (2)

Vy(x; y) = ay0 + ayxx+ ayyy (3)

where V x and V y are the x and y components of velocity,and the a0s are the parameters of the transformation. Ifthese components are plotted as a function of position, thena�ne motions become planes. When we use a�ne modelsfor analyzing motion, we are proposing that the optic owin an image can be described as a set of planar patches invelocity space.The concept is easier to visualize in one dimension, as

shown in �gure 9a. Suppose that we have a backgroundundergoing one a�ne motion and a foreground object un-dergoing another a�ne motion. In the illustration theforeground object contains two parts, both of which areundergoing a common motion. The velocity �eld will con-sist of �ve di�erent patches: three corresponding to thebackground, and two corresponding to the foreground.In a standard optic ow algorithm the velocity �eld

will be smoothed. Such algorithms generally impose somesmoothing on the velocity estimates in order to deal withnoise, the aperture problems, etc. The result is a smoothfunction as shown in �gure 9b. Regions along the bound-aries are assigned velocities that are intermediate betweenthe foreground and background velocities. This is plainlyincorrect. It arises from the fact that the optic ow al-gorithm is implicitly using something like a rubber-sheetmodel of the world. The velocity �eld is single-valued andsmooth, and so the mixing of the two motions is inevitable.The analysis can be improved by allowing for sharp

breaks in the velocity �eld, as is often done with regular-ization [7, 15]. The velocity �eld is now modeled as a set

position

velo

city

position

velo

city

(a) velocity estimates (b) velocity smoothing

position

velo

city

position

velo

city

(c) regularization (d) robust estimation

Figure 9: (a) Velocity estimates appear as samples in space.(b) Standard optic ow algorithms impose some smoothingon the velocity estimates in order to deal with noise. (c)Regularization algorithms model the velocity �eld as a setof smooth regions while allowing for sharp breaks at modelboundaries. (d) Shows the representation that we wish toattain. The velocity samples are explained by two a�nemodels (straight lines) and discontinuities are explained bythe occlusion of one object by the other.

of smooth regions interrupted by discontinuities. The re-sult looks like that shown in �gure 9c. The representationstill has problems. It does not allow for multiple motionestimates to coexist at a point, which is needed for dealingwith transparency. And it does not tell us anything aboutthe way the regions should be grouped; it does not know,for example, that all of the background segments belongtogether.Figure 9d shows the representation that we wish to at-

tain. Since we are modeling the motions with a�ne trans-formations, we are seeking to explain the data with a setof straight lines. In this case we need two straight lines,one for the background and one for the foreground. Thereare no motion discontinuities as such: the lines of motioncontinue smoothly across space. The discontinuities are tobe explained by the occlusions of one object by another {that is, there are discontinuities in visibility. The motionsthemselves are considered to extend continuously whetherthey are seen or not. In this representation the backgroundpixels all share a common motion, so they are assigned to asingle segment, in spite of being spatially non-contiguous.This represents the �rst phase of our motion analysis.

5 Robust motion segmentation

Our robust segmentation consists of two stages: 1) localmotion estimation; 2) segmentation by a�ne model �t-ting. Critical processing involves motion estimation andsegmentation.A number of authors have described methods for achiev-

ing multiple a�ne motion decompositions [3, 5, 9]. Ourmethod is based on robust estimation and k-means clus-

5

tering in a�ne parameter space.In many multiple motion estimation techniques, a recur-

sive algorithm is used is to detect multiple motion regionsin the scene [3, 9]. At each iteration, these algorithms as-sume that a dominant motion region can be detected. Oncethe dominant region is identi�ed and the motion withinthe region is estimated, it is eliminated and the next dom-inant motion is estimated from the remaining portion ofthe image. Such methods can have di�culties with scenescontaining several strong motions, since the estimated pa-rameters re ect a mixture of sources.To avoid the problems resulting from estimating a sin-

gle global motion, we use a gradual migration from a localrepresentation to a global representation. We begin with aconventional optic ow representation of motion. Becauseoptic ow estimation is carried out over a local neighbor-hood, we can minimize the problems of mixing multiplemotions within a given analysis region. From optic ow,we determine a set of a�ne motions that are likely to beobserved. Segmentation is obtained by classifying regionsto the motion model that provides the best description ofthe motion within the region.The segmentation framework is outlined in �gure 10.

At each iteration, our multiple model framework identi�esmultiple coherent motion regions simultaneously. Itera-tively, the motion model parameters are calculated withinthese coherent regions and segmentation is re�ned.Several authors have presented robust techniques [4, 5,

6] for multiple motion estimation. Black and Anandan de-scribed a multiple motion estimation technique that appliesimage constraints via in uence functions that regulate thee�ects of outliers in a simulated annealing framework. Dar-rell and Pentland described a multi-model regularizationnetwork whereby image constraints are applied separatelyto each model. Depommier and Dubois employed linemodels to detect motion discontinuities. These techniques,however, require intensive computation.We use a simpler technique for imposing image con-

straints and rejecting outliers. In our algorithm, motionsmoothness is imposed only within the layer and by havingmultiple layers we can describe the discontinuities in mo-tion. In addition, we impose constraints on coherent regionsize and local connectivity by applying simple thresholdsto reject outliers at di�erent stages in the algorithm, thusproviding stability and robustness.

5.1 Optic ow estimation

Our local motion estimate is obtained with a multi-scalecoarse-to-�ne algorithm based on a gradient approach de-scribed by [2, 12, 16]. For two consecutive frames the mo-tion at each point in the image can be described by equa-tion 4 and the linear least-squares solution for motion byequation 5:

It(x� Vx(x; y); y � Vy(x;y)) = It+1(x; y) (4)

� PI2x

PIxIyP

IxIy

PI2y

� �Vx

Vy

�=

� P�IxItP�IyIt

�(5)

where, Ix, Iy, and It are the partial derivatives of the imageintensity at position (x;y) with respect to x, y, and t, re-

spectively. The summation is taken over a small neighbor-hood around the point (x; y). The multi-scale implementa-tion allows for estimation of large motions. When analyz-ing scenes exhibiting transparent phenomena, the motionestimation technique described by Shizawa and Mase [17]may be suitable. However, in most natural scenes, the sim-ple optic ow model provides a good starting point for oursegmentation algorithm.

5.2 Motion segmentation

Given the optic ow �eld, the task of segmentation is toidentify the coherent motion regions. When a set of mo-tion models are known, classi�cation based on motion candirectly follow to identify the corresponding regions. How-ever, the corresponding motion models for the optic ow�eld are unknown initially. One simple solution is to gen-erate a set of models that can describe all the motionsthat might generally be encountered in sequences. Thisresults in a large hypothesis set, which is undesirable be-cause classi�cation with a large number of models will becomputationally expensive and unstable.In our framework, we sample the motion data to derive

a set of motion hypotheses that are likely to be observed inthe image. Referring to �gure 10, the region generator ini-tially divides the image into several regions. Within eachregion, the model estimator calculates the model parame-ters producing one motion hypothesis for each region. Anideal con�guration for these regions is one that correspondsto the actual coherent regions. However, this con�gurationis unknown and is ultimately what we want to obtain insegmentation.We use an array of non-overlapping square regions to

derive an initial set of motion models. The size of theinitial square regions are kept to a minimum to localize theestimation and to avoid estimating motion across objectboundaries. Larger regions will provide better parameterestimation under noisy situations.The a�ne parameters within these regions are estimated

by standard linear regression techniques. This estimationcan be seen as a plane-�tting algorithm in the velocityspace since the a�ne model is a linear model of local mo-tion. The regression is applied separately on each velocitycomponent because the x a�ne parameters depend onlyon the x component of velocity and the y parameters de-pend only on the y component of velocity. If we let aTi =[ax0i axxi axyi ay0i ayxi ayyi] be the i

th hypothesisvector in the 6 dimensional a�ne parameter space withaxi

T = [ax0i axxi axyi] and ayiT = [ay0i ayxi ayyi] cor-

responding to the x and y components , and �T = [1 x y]be the regressor, then the motion �elds equations 2 and 3can be simply written as:

Vx(x; y) = �Taxi (6)

Vy(x;y) = �Tayi (7)

and a linear least squares estimate of ai for a given motion�eld is as follows:

[axi ayi ] = [XPi

� �T]�1XPi

(�[Vx(x; y) Vy(x;y)]) (8)

where each region is indexed with the variable i, and thesummation is applied within each region.

6

finalsegmentation

imagesequence

optic flowestimator

modelestimator

regiongenerator

modelmerger

region filter

regionsplitter

region classifier

Figure 10: This �gures shows the motion segmentation algorithm. Motion models are estimated within regions speci�edby the region generator. Similar models are merged by the merger and segmentation obtained by motion classi�cation.Region splitter and region �lter enforces local connectivity and provides robustness to the system.

Regardless of the initial region size, many of the a�nehypotheses will be incorrect because the initial square re-gions may contain object boundaries. The reliability of ahypothesis is indicated by its residual error, �2i , which canbe calculated as follows:

�2i =

1

Ni

Xx;y

(V(x; y)�Vai(x; y))2 (9)

where N is the number of pixels in the analysis region. Hy-potheses with a residual greater than a prescribed thresh-old are eliminated because they do not provide a good de-scription of motion within the analysis region.Motion models from regions that cover the same object

will have similar parameters. These are grouped in thea�ne motion parameter space with a k-means clustering al-gorithm [19], which is modi�ed to allow k to be determinedadaptively. In the clustering process, we derive a represen-tative model for each group of similar models. This modelclustering produces a set of likely a�ne motion models thatare exhibited by objects in the scene. In addition, it pro-vides more stability in the segmentation by representing asingle coherent motion region with a single model insteadof with various similar models.A scaled distance, Dm(a1;a2), is used in the parameter

clustering process in order to scale the distance of the dif-ferent components in the parameter space. Scale factorsare chosen so that a unit distance along any componentin the parameter space corresponds to roughly a unit dis-placement at the boundaries of the image.

Dm(a1;a2) = [(a1 � a2)TM(a1 � a2)]

1

2 (10)

M = diag(1 r2r2 1 r

2r2) (11)

where r is the roughly the dimensions of the image.In our adaptive k-means algorithm, a set of cluster cen-

ters that are separated by a prescribed distance are initiallyselected. Each of the a�ne hypotheses is then assignedto the nearest center. Following the assignment, the cen-ters are updated with mean position of the cluster. Itera-tively, the centers are modi�ed until cluster membership,or equivalently, the cluster centers are unchanged. Duringthese iterations, some centers may approach other centers.When the distance between any two centers is less thana prescribed distance, the two clusters are merged into asingle cluster, reducing the two centers into a single cen-ter. In this way, the adaptive k-means clustering strives atdescribing the data with a small number of centers while

minimizing distortion. By selecting the clusters with thelargest membership, we can expect to represent a largeportion of the image with only a few motion models.

5.3 Region assignment by hypothesistesting

We use the derived motion models in a hypothesis testingframework to identify the coherent regions. In the hy-pothesis testing, we assign the regions within the image ina way that minimizes the motion distortion. We use thedistortion function, G(i(x; y)), described by:

G(i(x; y)) =Xx;y

(V(x;y)�Vai(x;y))2 (12)

where i(x; y) indicates the model that location (x; y) is as-signed to, V(x; y) is the estimated local motion �eld, andVai

(x;y) is the a�ne motion �eld corresponding to the

ith a�ne motion hypothesis. From equation 12, we seethat G(i(x; y)) reaches a minimum when the a�ne modelsexactly describe the motion within the regions. However,this is often not the case when dealing with real sequences,so instead we settle for an assignment that minimizes thetotal distortion.Since each pixel location is assigned to only one hypoth-

esis, we obtain the minimum total distortion by minimizingthe distortion at each location. This is achieved when eachpixel location is assigned to the model that best describesthe motion at that location. Coherent motion regions areidenti�ed by processing each location in this manner. Wesummarize the assignment in the following equation:

i0(x; y) = arg min [V(x;y)�Vai(x; y)]2 (13)

where i0(x; y) is the minimum distortion assignment.However, not all the pixels receive an assignment be-

cause some of motion vectors produced by optic ow es-timation may not correctly describe image motion. In re-gions where the assumptions used by the optic ow estima-tion are violated, usually at object boundaries, the motionestimates are typically di�cult to described using a�nemotion models. Any region where the error between theexpected and observed motion is greater than a prescribedthreshold remains unassigned, thus preventing inaccuratedata from corrupting the analysis. We �nd by experimentthat a value of 1 pixel motion provides a reasonable thresh-old.

7

We now de�ne the binary support maps that describethe regions for each of the a�ne hypotheses as:

Pi(x; y) =

�1 if i0(x; y) = i

0 otherwise(14)

Thus, for each model, its corresponding region supportmap takes on the value of 1 in regions where it best de-scribes the motion, and a value of 0 elsewhere. These mapsallow us to identify the model support regions and to re�neour a�ne motion estimates in the subsequent iterations.

5.4 Iterative algorithm

In the initial segmentation step, we use an array of squareregions. These will not produce the best set of a�ne mod-els as discussed above. However, after the �rst iteration,the algorithm produces a set of a�ne models and the cor-responding regions undergoing a�ne motion. By estimat-ing the a�ne parameters within the estimated regions, weobtain parameters that more accurately describe the mo-tion of the region. At each iteration, the segmentationbecomes more accurate because the parameter estimationof an a�ne motion model is performed within a single co-herent motion region.Additional stability and robustness in the segmentation

is obtained by applying additional constraints on the co-herent regions. These constraints are applied to the seg-mentation before the models are re-estimated. Because thea�ne motion classi�cation is performed at every point inthe image with models that span globally, a model maycoincidentally support points in regions undergoing a dif-ferent coherent motion or it may support points that haveinaccurate motion estimates. Typically, this results in amodel supporting sporadic points that are disjoint fromeach other.The region splitter and �lter as shown in �gure 10 iden-

tify these points and eliminate them. When two disjointregions are supported by a single model, the region split-ter separates them into two independent regions so thata model can be produced for each region. Following thesplitter is the region �lter, which eliminates small regionsbecause model estimation in small regions will be unsta-ble. Subsequently, the model estimator produces a motionmodel for each reliably connected region. Thus the regionsplitter in conjunction with the region �lter enforces thelocal spatial connectivity of motion regions.Sometimes the splitting process produces two regions

even though they can be described by a single model. This,however, is not a problem because the parameters esti-mated for each of these regions will be similar. The mod-els corresponding to these regions will be clustered into asingle model in the model merging step.Convergence is obtained when only a few points are reas-

signed or when the number of iterations reaches the max-imum allowed. This is typically less than 20 iterations.Regions that are unassigned at the end of the motion seg-mentation algorithm are re-assigned in a re�nement step ofthe segmentation. In this step, we assign these regions bywarping the images according to the a�ne motion modelsand selecting the model that minimizes the intensity errorbetween the pair of images.The analysis maintains temporal coherence and stability

of segmentation by using the current motion segmentation

results to initialize the segmentation for the next pair offrames. The a�ne model parameters and the segmenta-tion of a successive frame will be similar because an object'sshape and motion change slowly from frame to frame. Inaddition to providing temporal stability, analysis that isinitialized with models from previous segmentation resultsin fewer iterations for convergence. When the motion seg-mentation on the entire sequence is completed, each a�nemotion region will be identi�ed and tracked in the sequencewith corresponding support maps and a�ne motion pa-rameters.Typically, processing on the subsequent frames requires

only two iterations for stability, and the parameter cluster-ing step becomes trivial. Thus, most of the computationalcomplexity is in the initial segmentation, which is requiredonly once per sequence.

6 Layer synthesis

The robust segmentation technique described above pro-vides us with a set of non-overlapping regions, which �ttogether like pieces of a jigsaw puzzle. Regions that arespatially separate may be assigned a common label sincethey belong to the same motion model. However, robustsegmentation does not generate a layered representation inand of itself. The output of a segmenter does not provideany information about depth and occlusion: the segmentsall lie in a single plane.In order to generate a true layered representation we

must take a second step. The information from a longersequence must be combined over time, so that the stableinformation within a layer can be accumulated. Moreover,the depth ordering and occlusion relationships between lay-ers must be established. This combined approach { robustsegmentation followed by layer accumulation { is central tothe present work. Previous robust segmenters have takenthe �rst step but not the second. (Note that Darrell andPentland use a \multi-layer" neural network in their robustsegmenter, but their neural layers contain no informationabout depth ordering or occlusion, nor do they containoverlapping regions.)In deriving the intensity and alpha maps for each layer,

we observe that the images of the corresponding regions inthe di�erent frames di�er only by an a�ne transformation.By applying these transformations to each of the framesin the original sequence, corresponding regions in di�erentframes can be motion compensated with an inverse warp.We use bicubic interpolation in motion compensation.Thus, when the motion parameters are accurately esti-

mated for each layer, the corresponding regions will appearstationary in the motion compensated sequence. The layerintensity maps and alpha maps are derived from these mo-tion compensated sequences.Some of the images in the compensated sequence, how-

ever, may not contain a complete image of the object be-cause of occlusions. Additionally, an image may have smallintensity variations due to di�erent lighting conditions. Inorder to recover the complete representative image andboundary of the object, we collect the data available ateach point in the layer and apply a median operation onthe data. This operation can be easily seen as a tempo-ral median �ltering operation on the motion compensatedsequence in regions de�ned by the support maps. Ear-

8

lier studies have shown that motion compensation median�lter can enhance noisy images and preserve edge informa-tion better than a temporal averaging �lter [8].

Layeri(x; y) =medianfMi;1(x; y);Mi;2(x; y); :::;Mi;n(x; y)g

(15)

where Layeri is the ith layer, and Mi;k(x; y) is the mo-

tion compensated image obtained by warping frame k ofthe original sequence by the estimated a�ne transform forlayer i. Note that the median operation ignores regionsthat are not supported by the layer. These regions areindicated by a 0 in the support maps described in equa-tion 14.Finally, we determine occlusion relationships. For each

layer, we generate a map corresponding to the number ofpoints available for constructing the layer intensity map.A point in the intensity map generated from more datais visible in more frames and its derived intensity in thelayer representation is more reliable. These maps and thecorresponding layer intensity maps are warped to their re-spective positions in the original sequence according to theinverse a�ne transformation. By veri�cation of intensitiesand of the reliability maps, the layers are assigned a depthordering. A layer that is derived from more points oc-cludes an image that is derived from fewer points, since anoccluded region necessarily has fewer corresponding datapoints in the motion compensation stage. Thus, the statis-tics from the motion segmentation and temporal median�ltering provide the necessary description of the object mo-tion, texture pattern, opacity, and occlusion relationship.

7 Experimental results.

We have implemented the image analysis technique in Con a Hewlett Packard 9000 series 700 workstation. We il-lustrate the analysis, the representation, and the synthesiswith the �rst 30 frames of the MPEG ower garden se-quence, of which frames 1, 15 and 30, are shown in �gure 6.Dimensions of the image are 720 x 480. In this sequence,the tree, ower bed, and row of houses move towards theleft but at di�erent velocities. Regions of the ower bedcloser to the camera move faster than the regions near therow of houses, which in the distance.Optic ow obtained with a multi-scale coarse-to-�ne gra-

dient method on a pair of frames is shown in �gure 11a.Notice the incorrect motion estimates along the occlusionboundaries of the tree as shown by the di�erent lengths ofthe arrows and the arrows that point upwards.The segmentation map for the �rst frame is shown in

�gure 11b. Each a�ne motion region is depicted by a dif-ferent gray level. The darkest regions along the edges ofthe tree correspond to regions where the motion could notbe easily described by a�ne models. Region assignmentbased on warping the images and minimizing intensity er-ror reassigns these regions, shown in �gure 11c.We used an initial segmentation map consisting of 75

non-overlapping square regions to derive a set of a�ne mo-tion model. However, we found that the exact number ofinitial regions is not critical in the segmentation. The tenmodels with the lowest residual errors were selected fromthe k-means a�ne parameter clustering for model testingin the assignment stage. After about �ve iterations, seg-mentation remained stable with four a�ne models. We

�nd that the number of models initially selected can varyfrom 5 to 15 with similar results. This initial segmentationrequired about two minutes of computation time. Totalcomputation time for motion analysis on the 30 frames in-cluding the median operation is about 40 minutes.We used frame 15 as the reference frame for the image

alignment. With the motion segmentation, we are ableto remove the tree from the ower bed and house layersand recover the occluded regions. The sky layer is notshown. Regions with no texture, such as the sky, cannotbe readily assigned to a layer since they contain no mo-tion information. We assign these regions to a single layerthat describes stationary textureless objects. Note thatthe overlap regions in house and ower bed layer are un-dergoing similar motion, and therefore, these regions canbe supported by either layer without introducing annoyingartifacts in the synthesis.We can recreate an approximation of the entire image

sequence from the layer maps of �gure 12, along with theocclusion information, the a�ne parameters that describethe object motion, and the stationary layer. Figure 13shows three synthesized images corresponding to the threeimages in �gure 6. The objects are placed in their respec-tive positions and the occlusion of background by the treeis correctly described by the layers.Figures 15, 16, 17 show results of layer analysis on the

MPEG calendar sequence. In this sequence, a toy trainpushes a ball along the track, while the camera pans rightand zooms out. A small rotating toy in lower left cornerof the image did not produce a layer because its motioncould not be estimated. Also the calendar, which is movingdownward with respect to the wallpaper, is combined intoa single layer with the wallpaper. These regions are repre-sented with a single layer because motion in these regionsas estimated from two consecutive frames can be describedwith a single a�ne motion within speci�ed motion toler-ances. To handle this problem, we need to consider motionanalysis and segmentation over di�erent temporal scales.Currently, our motion analysis technique performs well

when motion regions in the image can be easily describedby the a�ne motion model. Because our analysis is basedon motion, regions must be su�ciently textured and largein size for stability in the segmentation and layer extrac-tion. Therefore, scenes with few foreground objects under-going a�ne motion are suitable for our analysis. Sequenceswith complicated motions that are not easily described bythe layered model require special treatment.

8 Compression

The layered decomposition can be used for image compres-sion. In the case of the ower garden sequence, we are ableto synthesize a full 30 frame sequence starting with onlya single still image of each of the layers. The layers canbe compressed using conventional still image compressionschemes. While we have not yet done extensive studies ofcompression we can describe some preliminary results [22].The data that must be sent to reconstruct the ower gar-

den sequence includes the intensity maps, the alpha maps,and the motion parameters for each layer. To compress theintensity map we used a JPEG coder. Some blocks in themap are empty, being outside the valid region of supportfor the layer; these are coded with a symbol indicating an

9

(a) (b) (c)

Figure 11: (a) The optic ow from multi-scale gradient method. (b) Segmentation obtained by clustering optic ow intoa�ne motion regions. (c) Segmentation from consistency checking by image warping. Representing moving images withlayers.

(a) (b) (c)

Figure 12: The layers corresponding to the tree, the ower bed, and the house shown in �gures (a-c), respectively. Thea�ne ow �eld for each layer is superimposed.

(a) (b) (c)

Figure 13: Frames 0, 15, and 30 as reconstructed from the layered representation shown in �gures (a-c), respectively.

(a) (b) (c)

Figure 14: The sequence reconstructed without the tree layer shown in �gures (a-c), respectively.

10

(a) (b) (c)

Figure 15: Frames 0, 15 and 30, of MPEG Calendar sequence shown in �gures (a-c), respectively.

(a) (b) (c)

Figure 16: The layers corresponding to the ball, the train, and the background shown in �gures (a-c), respectively.

(a) (b) (c)

Figure 17: Synthesize frames 1, 15 and 30, of MPEG Calendar sequence shown in �gures (a-c), respectively. These Imagesare generated from the layer images in �gure 16.

11

empty block. Other blocks are partially empty, and thesewere �lled in smoothly so that JPEG did not devote bitsto coding the sharp (spurious) edge. To compress the al-pha map we used a chain code, which is possible becausewe are representing alpha as a binary image. The motionparameters can be sent without compression because theyonly represent 6 numbers per layer per frame.We are able to code a 30 frame (1 second) color video se-

quence of the ower garden using 300 kbits for a resolutionof 360 x 240, and 700 kbits for a resolution of 720 x 480.Thus, the data rates are 300 and 700 kbits/sec respectively.The a�ne motion parameters for the sequence required40 kbits. For the full resolution sequence, the alpha mapswere encoded losslessly with chain codes at 60 kbits andthe color intensity maps encoded with JPEG at 600 kbits.The only artifacts added by the data compression are

those associated with the JPEG coder, and these are �xedartifacts that are rigidly attached to the moving layers.Thus they do not \twinkle" or \shimmer" the way tempo-ral artifacts often do with traditional coders. At the datarates noted above, the JPEG artifacts are only slightly vis-ible with an SNR of 30 dB.The layered coding process itself can add artifacts re-

gardless of data compression. For example, since the lay-ered segmentation is imperfect there is a bit of the skyattached to the edge of the tree, and this becomes visiblewhen the tree moves over the houses. Also, the a�ne ap-proximation of the motion is imperfect so that the synthe-sized motions of the layers do not perfectly match the ac-tual motions in the original scene. And the temporal accu-mulation process can introduce blurring into regions wherethe a�ne warp is incorrect. We believe that all of theseartifacts can be reduced by improvements in the analysisprocess. But in any case the reconstructed sequence willnot generally be a perfect replica of the original sequenceeven if the layers are transmitted without data compres-sion. Error images could be sent in order to correct forthe imperfections of the reconstructed images at a cost ofadditional data.It is interesting to note that some of the artifacts asso-

ciated with the layered coding can be severe in the MSEsense and yet be invisible to the human viewer. For ex-ample, the naive observer does not know that the owerbed's motion is not truly a�ne. The reconstructed motionis slightly wrong { and in the squared error sense it is quitewrong { but it looks perfectly acceptable.

9 Other applications

Because the layered representation breaks the sequenceinto parts that are meaningfully related to the scene it be-comes possible to achieve some interesting special e�ects.Figure 14 shows the ower garden sequence synthesizedwithout the tree layer. This shows what the scene wouldhave looked like had the tree not been present; it is a syn-thetic sequence that has never actually existed. Occludedregions are correctly recovered because our representationmaintains a description of motion in these regions. Notethat an ordinary background memory could not achievethe e�ect because the various regions of the scene are un-dergoing di�erent motions.The layered representation also provides frame-rate in-

dependence. Once a sequence has been represented as lay-

ers it is straightforward to synthesize the images corre-sponding to any instant in time. Slow-motion and frame-rate conversion can be conveniently done by using the lay-ered format.

10 Conclusions

We employ a layered image representation that providesa useful description of scenes containing moving objects.The image sequence is decomposed into a set of layers,each layer describing a region's motion, intensity, shape,and opacity. Occlusion boundaries are represented as dis-continuities in a layer's alpha map (opacity), and there isno need to represent explicit discontinuities in velocity.To achieve the layered description, we use a robust mo-

tion segmentation algorithm that produces stable imagesegmentation and accurate a�ne motion estimation overtime. We deal with the many problems in motion segmen-tation by appropriately applying the image constraints ateach step of our algorithm. We initially estimate the localmotion within the image, then iteratively re�ne the esti-mates of object's shape and motion. A set of likely a�nemotion models exhibited by objects in the scene are calcu-lated from the local motion data and used in a hypothesistesting framework. Layers are accumulated over a sequenceof frames.The layered representation can be used for image coding.

One can represent 30 frame sequence of the MPEG owergarden sequence using four layers, along with a�ne motionparameters for each layer; each layer is represented by asingle still image. The layers can be coded using traditionalstill frame coding techniques.The synthesized sequence provides an approximate re-

construction of the original sequence. One can achievesubstantial data compression in sequences that lend them-selves to layered coding. The method is more successful forsequences that are easily represented by the layered model,i.e., sequences containing a few regions undergoing simplemotions. We are currently investigating extensions to morecomplex sequences.The layered decomposition also provides useful tools in

image analysis, frame-rate conversion, and special e�ects.

12

References

[1] E. H. Adelson. Layered representation for image cod-ing. Technical Report 181, The MIT Media Lab, 1991.

[2] J. Bergen, P. Anandan, K. Hana, and R. Hingorini al.Hierarchial model-based motion estimation. In Proc.

Second European Conf. on Comput. Vision, pages237{252, Santa Margherita Ligure, Italy, May 1992.

[3] J. Bergen, P. Burt, R. Hingorini, and S. Peleg. Com-puting two motions from three frames. In Proc.

Third Int'l Conf. Comput. Vision, pages 27{32, Os-aka, Japan, December 1990.

[4] M. J. Black and P. Anandan. Robust dynamic motionestimation over time. In Proc. IEEE Conf. Comput.Vision Pattern Recog., pages 296{302, Maui, Hawaii,June 1991.

[5] T. Darrell and A. Pentland. Robust estimation of amulti-layered motion representation. In Proc. IEEEWorkshop on Visual Motion, pages 173{178, Prince-ton, New Jersey, October 1991.

[6] R. Depommier and E. Dubois. Motion estimation withdetection of occlusion areas. In Proc. IEEE ICASSP,volume 3, pages 269{273, San Francisco, California,April 1992.

[7] S. Geman and D. Geman. Stochastic relaxation, gibbsdistributions, and the bayesian restoration of images.IEEE Trans. PAMI, 6(6):721{741, November 1984.

[8] T. S. Huang and Y. P. Hsu. Image sequence enhance-ment. In T. S. Huang, editor, Image Sequence Analy-sis, pages 289{309. Springer-Verlag, Berlin, 1981.

[9] M. Irani and S. Peleg. Image sequence enhance-ment using multiple motions analysis. In Proc. IEEE

Conf. Comput. Vision Pattern Recog., pages 216{221,Champaign, Illinois, June 1992.

[10] R. Lenz and A. Gerhard. Image sequence coding us-ing scene analysis and spatio-temporal interpolation.In T. S. Huang, editor, Image sequence processing anddynamic scene analysis, volume F2 of NATO ASI Se-

ries, pages 264{274. Springer-Verlag, Berlin, 1983.

[11] A. Lippman and R. Kermode. Generalized predictivecoding of movies. In Picture Encoding Symposium,pages 17{19, Lausanne, Switzerland, 1993.

[12] B. Lucas and T. Kanade. An iterative image registra-tion technique with an application to stereo vision. InImage UnderstandingWorkshop, pages 121{130, 1981.

[13] P. C. McLean. Structured video coding. Master of sci-ence in media arts and sciences, The Media Lab, Mas-sachusetts Institute of Technology, Cambridge, MA,June 1991.

[14] H. G. Musmann, M. Hotter, and J. Ostermann.Object-oriented analysis-synthesis coding of movingimages. Signal Processing: Image Communication 1,pages 117{138, 1989.

[15] T. Poggio, V. Torre, and C. Koch. Computationalvision and regularization theory. Nature, 317:314{319,1985.

[16] L. H. Quam. Hierarchial warp stereo. In Proc. DARPAImage UnderstingWorkshop, pages 149{155, New Or-leans, Lousiana, 1984. Springer-Verlag.

[17] M. Shizawa and K. Mase. A uni�ed computationaltheory for motion transparency and motion bound-aries based on eigenenergy analysis. In Proc. IEEEConf. Comput. Vision Pattern Recog., pages 289{295,Maui, Hawaii, June 1991.

[18] L. Teodosio and W. Bender. Salient video stills. InProc. of the ACM Multimedia Conference, Anaheim,CA, August 1993.

[19] C. W. Therrien. Decision Estimation and Classi�ca-tion. John Wiley and Sons, New York, 1989.

[20] J. Y. A. Wang and E. H. Adelson. Layered repre-sentation for image sequence coding. In Proc. IEEE

ICASSP, volume 5, pages 221{224, Minneapolis, Min-nesota, April 1993.

[21] J. Y. A. Wang and E. H. Adelson. Layered repre-sentation for motion analysis. In Proc. IEEE Conf.Comput. Vision Pattern Recog., pages 361{366, NewYork, June 1993.

[22] J. Y. A. Wang, E. H. Adelson, and U. Y. Desai. Ap-plying mid-level vision techniques for video data com-pression and manipulation. In Proc. SPIE on Digi-tal Video Compression on Personal Computers: Algo-

rithms and Technologies, volume 2187, pages 116{127,San Jose, California, February 1994.

13

new i 0 i 1 i 2 -...

Documents