08.20.layered representation for motion analysis

8/14/2019 08.20.Layered Representation for Motion Analysis

1/6

Layered Representation for Motion AnalysisJohn Y . A . WangThe MIT Media Laboratory

Dept. Elec. Eng. and Comp. Sci.Massachusetts Institute of Technology

Cambridge, MA 02139

AbstractStandard approaches to motion analysis assumethat th,e optic pow is smooth; such techniques havetroublc dealing with occlusion boundaries. Th e mos t

popular solution is to alloiw discontinuities in the flowf i e ld , imposing the smoothness constraint in a piece-wise fashion. But there is a sense i n which the dis-continuit ies in f low are arti factual, result ing from theattempt to capture the motio n of multiple overlappingobjects in a single flow field. Instead we can decom-pose the i m a g e sequence into a se t of overlapping la y -ers, where each layers mo tion is described b y a smooth,flow field. Th e discontinarifies in the de scription arethen attributed to object opacit ies rather than to theflow itself , mirroring the structure of the scene. Wehave devised a se t of techniques for segmenting im-ages into coherently moving regions using aDne mo-tio n ana1ys:s and clustering techniques. W e are ableto decompose an image inlto a set o f layers along withinfor ma tion about occlusion and depth ordering. Wehave applied the techniques to the ufiower garden se-quence. We can analyze the scene into fou r layers,and then represent the entire 90-frame sequence witha single image of each layer, along with associated mo-t ion porameters.

1 IntroductionOcclusions represent one of the difficult problems

in motion analysis. Smoothing is necessary in orderto derive reliable flow fields, but when smoothing oc-curs across boundaries the result is a flow field thati s simply incorrect. Various techniques have been de-vised to allow for motion discontinuities but none areentirely satisfactory. In addit ion, transparency due tovarious sources (including motion blur) can make itmeaningless to assign a single motion vector to a sin-gle point. It is helpful to reconsider this problem from

Edward H. AdelsonThe MIT Media LaboratoryDept. Brain and Cognitive Sciences

Massachusetts Institute of TechnologyCambridge, MA 02139

a different point of view.Consider an image that is formed by one opaque

object moving in front of a background. In Figure1, this is illustrated with a moving hand in front ofa stationary checkerboard. The first row shows theobjects that compose the scene; the second row showsthe image sequence that will result. An animationsystem - whether traditional cel animation or moderndigital compositing - can generate this sequence bystarting with an image of the background, an imageof the hand, an opacity map (known as a matten oran alpha channeln) for the hand, motion fields forthe hand and the background, and finally the rules ofimage formation.

The resulting image sequence will pose challengesfor standard motion analysis because of the occlusionboundaries. But in principle we should be able to re-trieve the same simple description of the sequence tha tthe animator used in generating it: an opaque handmoving smoothly in front of a background. The de-sired description for the hand is shown in the third rowof Figure 1; it involves an intensity map, an opacitymap, and a warp map. The background (not shown)would also be extracted. Having accomplished thisdecomposition we could transmit the information veryefficiently and could then resynthesize the original se-quence, as shown in the bottom row. In addition, thedescription could be an important step on the wayto a meaningful object-based description of the scene,rather than a mere description of a flow fivld.

Adelson [l]has described a general framework forlayered image representation, in which image se-quences are decomposed into a set of layers orderedin depth along with associated maps defining theirmotions, opacities, and intensities. Given such a de-scription, it is straightforward to synthesize the imagesequence using standard techniques of warping andcompositing. The challenge is to achieve the descrip-tion starting with an image sequence from a naturalscene. In other words: rendering is easy, but vision

3611063-6919/93 $03.00 0 1993 1EE.E

uthorized licensed use limited to: BIBLIOTECA D'AREA SCIENTIFICO TECNOLOGICA ROMA 3. Downloaded on October 8, 2009 at 04:27 from IEEE Xplore. Restrictions apply.


2/6

Movinghand Background

Intensity map Opacity map Warp map

Synthesizedimagesequence.Frame 1 Frame2 Frame3

Figure 1: This figures show the decomposition of an im-age sequence consisting of a hand moving in front of acheckerboard background. The conventional method ofrepresenting motion is by a dense motion field with motiondiscoritinuit,ies at object boundary. The layered represen-tation describes these objects with smooth motions, anddiscontinuities in opacity. The apparent motion disconti-nuities result when the layers are composited according tothe ocdusion relationship between objects.

is difficult, as usual. In this paper we describe sometechniques that are helpful in accomplishing the visionside of the procedure.

2 Image analysisAnalysis of the scene into the layered representation

requiies grouping the points in the image into multipleregioiu where each region undergoes a smooth motion.However, multiple motion estimation and segmenta-tion is a difficult problem t ha t involvesa simultaneousestimation of the object boundary and motion. With-out the knowledge of the object boundaries, motionestimation will incorrectly apply the image constraintsacross multiple objects. Likewise, object boundariesare difficult to determine without some estimation ofmotion.

Recent works by [ 7 , 2 , 91 have shown tha t the affinemotican model provides a good approximation of 3-D

moving objects. Since the motion model used in theanalysis will determine the descriptiveness the rep-resentation, we use the &ne motion model in ourlayered representation to describe a wide range ofmotions commonly encountered in image sequences.These motions include translation, rotation, zoom,and shear. Affine motion is parameterized by s ix pa-rameters as follows:

Vy(2,Y)= ay0+ ay1 2 + = y2 Y (2 )where a t each point (x,y), Vz(z,y) and V,(z,y) arethe z and y components of velocity respectively, andthe aks are the affine motion parameters.

3 ImplementationTypical methods in multiple affine motion estima-tion use an iterative motion estimation techniques to

detect multiple affine motion regions in the scene. Ateach iteration, these methods assume that a dominantmotion region can be detected and eliminated fromsubsequent analysis. Estimation of these regions in-volve global estimation using a single motion model,and thus, often result in accumulating data from mul-tiple objects.

Our implementation of multiple motion estimationis similar to robust techniques presented by [3 , 4, 51.We use a gradual migration from a local motion r e presentation to a global object motion represerrtation.By performing optic flow estimation follow by ;fineestimation instead of a direct global affine motion es-timation, we can minimize the problems of multipleobjects within our analysis region. The layers image,opacity map are obtained by integrating the motionand regions over time. Our analysis of an image se-quence into layers consists of three stages: 1) ocal mo-tion estimation; 2) motion-based Segmentation and; 3)object image recovery.3.1 Motion segmentation

Our motion segmentation algorithm is illustratedin Figure 2. The segmentation algorithm is is dividedinto two primary steps: I) local motion estimationand, 2) affine motion segmentation. Multiple affinemotions are estimated within subregions of the imageand coherent motion regions are determined based onthe estimated affine models. By iteratively updatingthe affine models and the regions, this architecture

362



3/6

Estimatorinitial..condtlons7MRegionAssgn

re ions m a s kadne motion pavametersFigure 2: This figures shows the technique used in mo-tion segmentation. Affine motion models are determinedby regression on the dense motion fields and the regionsare assigned to minimizes the error between the motionexpected by the models and the estimated dense motion.

minimizes the problem of intergrating da ta across ob-ject boundaries.

Our local motion estimation is obtained with amulti-scale coarse-to-fine algorithm based on a gradi-ent approach described by [8]. ince only one motion isvisible at any point when dealing with opaque objects,the single motion model assumed in the optic flow esti-mation is acceptable. The multi-scale implementationallows for estimation of large motions. When analyz-ing scene exhibiting transparent phenomena, the mo-tion estimation technique described by Shizawa andMase [lo] may be suitable.

Motion segmentation is obtained by iteratively re-fining the estimates of affine motions and the corre-sponding regions. We estimate the affine parameterswithin each subregion of ithe image by standard re-gression techniques on loctal motion field. This esti-mation can be seen as a plane fitting algorithm inthe velocity space since the affine model is a linearrnodel of local motion. The regression is applied sep-arately on each velocity component since the compo-nents are independent. If we let Hi = [Hy, Z,]be the i*h hypothesis vector in the affine parameterspace with components = [ a Zo i a c l i QcZi] andHY IT= [ayoi ayli aYz.]corresponding to the I andy components , and # = [l I y] be the regressor,

then the affine equations 1 and 2 become:VZ(I, Y) = m z . (3 )VY(I, Y) = P H y , (4 )

[ a y , HZ,l = [E n ' (P [VY( . , Y> VZ(Z, Y)l)and a linear least squares estimate of Hi for an givenlocal motion field is as follows:

P. P.( 5 )The summation is taken over Pi corresponding to the

i t h subregion in the image.We avoid estimating motion across object bound-

aries by initially using small arbitrary subregionswithin the image to obtain a set of hypotheses of likelyaffine motions exhibited in the image. Many of thesehypotheses will be incorrect because these initial sub-regions may contain object boundaries. We identifythese hypotheses by their large residual error and elim-inate them from our analysis.However, motion estimates from patches that coverthe same object will have similar parameters. Theseare grouped in the affine motion parameter spacewith a k-means clustering algorithm described in [ I l l .In the clustering process, we derive a represent,ativemodel for each group of similar models. The modelclustering produces a set of likely affine motion mod-els that are exhibited by objects in the scene.

Next, we use hypothesis testing with the motionmodels to reassign the regions. We use a simple costfunction, C(i(z , y)), tha t minimizes the velocity errorsbetween the local motion estimates and the expectedmotion described by the affine models. This cost func-tion is summarized as follows:

c( i ( z , Y)) = E v ( z , Y) - VH,(I, (6 )=,Y

where i(z,y) is the indicates the model that location(I , y) is assigned to, V(z,y) is the est imated local mo-tion field, and V H ~I, y) is the affine motion field cor-responding to the i th hypothesis. Since each locationis assigned to only one of the hypotheses, we obtainthe minimum total cost by minimizing the cost a t eachlocation. We summarize the assignment in the follow-ing equations:

h(I, ) = a% min[V(z, Y) -VH, ( t , Y)]' (7 )where Zo(z, y) is the minimum costs assignment. Re-gions that are not easily described by any of the mod-els are unassigned. These regions usually occur a t object boundaries because the assumptions used by the

363



4/6

optic flow estimation are violated. We assign theseregions by warping the images according to the affinemotion models and selecting the model that minimizesthe error in intensity between the pair of images.

We now define the binary region masks that de-scribe the support regions for each of the affine hy-potheses as:

These region masks allow us to identify the object re-gions and to refine our &ne motion estimates in thesubsequent iterations according to Equations 5.

As we perform more iterat ions, we obtain more ac-curate motion segmentation because the affine motionestimation is performed within single motion regions.Convergence is obtained when only a few points are re-assigned or when the number of iterations reaches themaximum allowed. Models tha t have small supportregions are eliminated because their affine parameterswill be inaccurate in these small regions.

We maintain the temporal coherence and stabilityof the segmentation by using the current motion seg-mentation results as initial conditions for segmenta-tion on the next pair of frames. Since an objects shapeand motion change slowly from frame to frame, thesegmentation results between consecutive frames aresimilar and require fewer iterations for convergence.When the motion segmentationon he entire sequenceis completed, each object will have a region mask andan affine motion description for each frame of the se-quence.

3. 2 Analysis of layersThe images of the corresponding regions in the dif-

ferent frames differ only by an affine transformation.By applying these transformations to all the frames,we align the corresponding regions in the differentframes. When the motion parameters are accuratelyestimated, objects will appear stationary in the mGtion compensated sequence. The layer images andopacity map are derived from these motion compen-sated sequences.

However, some of the images in the compensatedsequence may not contain a complete image of the ob-ject because of occlusions. Additionally, an image mayhave small intensity variations due to different lightingconditions. In order t o recover the complete represen-tative image and boundary of the object, we collectthe da ta available at each point in the layer and applya median operation on he data. This operation can be

easily seen as a temporal median filtering operation onthe motion compensated sequence in regions definedby the region masks. Earlier studies have shown thatmotion compensation median filter can enhance noisyimages and preserve edge information better than atemporal averaging filter [6].Finally, we determine occlusion relationship. Foreach location of each layer, we tabulate the numberof corresponding points used in the median filteringoperation. These images are warped to their respec-tive positions in the original sequence according to theestimated affine motions and the values are comparedat each location. An layer tha t is derived from morepoints occludes an image that is derived from fewerpoints, since an occluded region necessarily has fewercorresponding points in the recovery stage. Thus thestatistics from the motion segmentation and temporalmedian filtering provide the necessary description ofthe object motion, texture pattern, opacity, and oc-clusion relationship.

Our modular approach also allows us to easily in-corporate other motion estimation and segmentationalgorithm into a single robust framework.

4 Experimental resultsWe implemented the image analysis technique on

a SUN workstation and use the first 30 frames of theMPEG flower garden sequence to illustrate the anal-ysis, the representation, and synthesis. Three framesof the sequence, frames 0,15 nd 30, are shown in Fig-ure 3. In this sequence, the tree, flower bed, and rowof houses move towards the left but at different veloc-ities. Regions of the flower bed closer to the cameramove faster t han the regions near the row of houses inthe distance.

Optic flow obtained with a multi-scale coarse-to-fine gradient method on a pair of frames is shown onthe left in Figure 4. The initial regions used for thesegmentation consisted of 215 square regions. Noticethe poor motion estimates along the occlusion bound-aries of the tree as shown by the different lengths ofthe arrows and the arrows pointing upwards. In thesame figure, results of the affine motion segmentationis shown on the middle. The affine motion regions aredepicted by different gray levels and darkest regionsalong the edges of the tree in the middle figure cor-respond to regions where the local motion could notbe accurately described by any of the affine models.Region assignment based on warping the images andminimizing intensity error reassigns these regions andis shown on the right.

364



5/6

Our analysis decomposed the image into 4 primaryregions: t ree, house, flower-bed and sky. A ane pa-rameters and the support regions were obtained for theentire sequence, and the layer images for the four ob-jects obtained by motion compensated temporal me-dian filtering are shown in Figure 5 . We use Frame 15as the reference frame for the image alignment. Theoccluding tree has been removed and occluded regionsrecovered in the flower-bed layer and the house layer.The sky layer is not shown. Regions with no texture,such as the sky, cannot be readily assigned to a layersince they contain no motion information. We assignthese regions to a single layer that describes stationarytextureless objects.

We can recreate the entire image sequence from thelayer images of Figure 5 , along with the occlusion in-formation, the affine parameters that describe the ob-ject motion, and the sta tionary layer. Figure 6 showsthree synthesized images corresponding to the threeimages in Figure 3. The objects are placed in theirrespective positions and occlusion of background bythe tree is correctly described by the layers. Figure 7

the object image, boundary and occlusion relation-ship. Our approach provides useful tools in image un-derstanding and object tracking, and has potentials asan efficient model for image sequence coding.

AcknowledgementsThis research was supported in part by a contract

with SECOM Co., and Goldstar Co.,Ltd.

References[l] E.H. Adelson, Layered representation for image cod-

ing, Technical Report No . 181, Vision and ModelingGroup, The MIT Media Lab, December 1991.

[2] J . R . Bergen, P. J. Burt, R. Hingorani, and S . Peleg ,Computing tw o motions from three frames, Interna-tional Conference on C omputer Vision, 1990.

[3] M. J . Black, P. Anandan, Robust dynamic motionestimation over time, Proc. IEEE Computer Visionand Pattern Recognition91, pp. 296-302, 1991.

shows the corresponding frames synthesized withoutthe tree layer. Uncovered regions are correctly recov-ered because our layered representation maintains adescription of motion in these regions.

[4] T. Darrell, and Alex Pentland, Robust estimation ofmulti-layered motion representation, IEEE Workshopon Visual Motion, pp. 173-178, Princeton, 1991.

5 ConclusionsWe employ a layered image motion representation

that provides an accurate description of motion dis-continuities and motion occlusion. Each occluding andoccluded object is explicitly represented by a layerthat describes the objects motion, texture pattern,shape, and opacity. In this representation, we describemotion discontinuities as discontinuities in object sur-face opacity rather than discontinuities in the actualobject motion.To achieve the layered description, we use a robustmotion segmentation algorithm that produces stableimage segmentation and accurate &ne motion esti-mation over time. We deal with the many problemsin motion segmentation by appropriately applying theimage constraints at each step of our algorithm. Weinitially estimate the local motion within the image,then iteratively refine the estimates of objects shapeand motion. A set of likely affine motion models ex-hibited by objects in the scene are calculated from thelocal motion data and used in a hypothesis testingframework to determine the coherent motion regions.Finally, the temporal coherence of object shape andtexturc: pattern allows us to produce a description of

[5] R.Depommier R., E . Dubois, Motion estimation withdetection of occlusion areas, Proc. IEEE ICASSP92,Vol. 3, pp. 269-273, San Francisco, March 1992.

[6] T . S. Huang and Y. P . Hsu, Image Sequence En-hancement, Image Sequence Analysis, Editor T. S .Huang, pp. 289-309., Springer-Verlag, 1981.

[7]M . Irani, S . Peleg, Image sequence enhancement usingmultiple motions analysis, Proc. IEEE Computer Vi-sion and Pattern Recognition92, pp. 216-221, Cham-paign, June, 1992.

[8] Lucas, B . , Kanade, T. , An iterative image registra-tion technique with an application to stereo vision,Image Unders tanding Workshop, pp. 121-130, April,1981.

[9] S. Nagahdaripour, S . Lee, Motion recovery from im-age sequences using first-order optical flow informa-tion, Proc. IEEE Workshop on Visual Motion 91, pp.132-1 39, Princeton, 1991.

[IO] M . Shizawa and K . Mase, A unified computationaltheory for motion transparency and motion bound-aries based on eigenenergy analysis, Proc. IEEEComputer Vision and Pattern Recognition91, pp. 296-302, 1991.

[ll] C. W . Therrien, Decision Estimat ion and Cla&fica-tion, John Wiley and Sons, New York, 1989.

365



6/6

Figure 3: Frames 0, 15 and 30, of MPEG lower garden sequence.

Figure 4: Affine motion segmentation of optic flow.

Figure 5: Images of the flower bed, houses, and tree. Affine motion fields are also shown here.

Figure 6: Corresponding frames of Figure 3 synthesized from layer images in Figure 5

Figure 7: Corresponding frames of Figure 3 synthesized without the tree layer.

366

08.20.layered representation for motion analysis

Documents