video retargeting: automating pan and...

Video Retargeting: Automating Pan and Scan

Feng LiuDepartment of Computer SciencesUniversity of Wisconsin, Madison

1210 West Dayton StreetMadison, Wisconsin, 53706

[email protected]

Michael GleicherDepartment of Computer SciencesUniversity of Wisconsin, Madison

1210 West Dayton StreetMadison, Wisconsin, 53706

[email protected]

ABSTRACTWhen a video is displayed on a smaller display than origi-nally intended, some of the information in the video is nec-essarily lost. In this paper, we introduce Video Retargetingthat adapts video to better suit the target display, minimiz-ing the important information lost. We define a frameworkthat measures the preservation of the source material, andmethods for estimating the important information in thevideo. Video retargeting crops each frame and scales it tofit the target display. An optimization process minimizesinformation loss by balancing the loss of detail due to scal-ing with the loss of content and composition due to crop-ping. The cropping window can be moved during a shotto introduce virtual pans and cuts, subject to constraintsthat ensure cinematic plausibility. We demonstrate resultsof adapting a variety of source videos to small display sizes.

Categories and Subject DescriptorsI.4.9 [Image Processing and Computer Vision]: Ap-plications

General TermsAlgorithms, Human Factors

KeywordsVideo retargeting, Video editing, Mobile multimedia, Im-portance estimation

1. INTRODUCTIONViewing video on small screens is becoming increasingly

common as portable devices become more capable and pop-ular. Unfortunately, most source material is originally in-tended for larger displays, such as televisions and theaterscreens. If such video is presented naıvely, by simply scal-ing it to fit the small screen, important parts of the imagebecome too small to see. To make matters worse, small

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.MM’06, October 23–27, 2006, Santa Barbara, California, USA.Copyright 2006 ACM 1-59593-447-2/06/0010 ...$5.00.

displays often have different aspect ratios than larger ones,requiring either an anisotropic “squish” or padding the videoto fill the display. Small displays are limited to display lesscontent than larger ones; our goal is to enable effective smalldisplay by retaining what is important.

This paper considers the problem of video retargeting, thatis, adapting a video so that it is better suited for viewing ona display different than was originally intended. Video retar-geting applies two operations to each frame of a video: crop-ping, which discards information outside of a window anddisturbs the composition of the image; and scaling, whichloses details of the image especially as objects become toosmall to recognize, and distorts the image if the scaling isanisotropic.

In this paper, we introduce an approach for automaticallyretargeting video to displays of different sizes and aspectratios. This intelligent retargeting solution uses the videocontent to determine how to best combine cropping andscaling: unimportant aspects of the frame are cropped awayso that more important content appears at a larger scale.We cast the retargeting as an optimization problem: whatnew video least damages the content of the original video.By moving the cropping window, video retargeting can cre-ate virtual pans and cuts to better portray dynamic shots.While our focus is on adapting edited films and videos forsmall displays, the methods are also applicable for automat-ically adapting wide format videos (such as feature films) toother aspect ratios (such as standard television).

Cropping discards considerable information. Not only isthe content of the cropped portion lost, but we also lose theintended composition of the original frame. Compositionis important in video as filmmakers use it in subtle waysto convey emotion and story. However, for small devicesthe alternative, downsampling the image to a tiny size thatwhere objects are potentially too small to be recognized, isoften worse. In essence, we choose to selectively lose someinformation from cropping in the hope of avoiding losing allinformation from scaling. Examples are shown in Figure 1.

In video, the motion on screen has significance beyondindividual frames. Not only do objects move, but also film-makers move the camera to achieve their desired goals. Theselatter effects are often subtle, yet significant: a zoom-in tocreate a feeling of connection or the timing of cuts to es-tablish pacing. Therefore, video retargeting must not onlyconsider how each frame is cropped, but also how this crop-ping affects motion. Rather than computing the croppingfor each frame, we must be careful not to introduce new mo-tions that will be obvious artifacts or significantly destroy

(a) Original DVD Frame 480 × 200 pixels (b) Retarget by scaling

(c) Retarget by letterboxing (d) Retarget by our approach

Figure 1: An example of retargeting a frame of a feature film from a widescreen (2.4:1) DVD to a variety of other devices, both naıvely

and using our approach. Targets are 200 × 150, 160 × 120, and 120 × 120 pixels respectively. Note that while both the left and center

targets have the same 4:3 aspect ratio, the different sizes lead to different retargeting solutions. Source c© 2004 Warner Home Video.

important existing motions. In particular, the cropping win-dow may be moved either smoothly to introduce a virtualpan or discontinuously to introduce a virtual cut. However,the introduction of the virtual camera motions must be cin-ematically plausible: they must look as if they could havebeen part of the original film.

Implementing video retargeting involves two components.A system must determine what is important in the videothen choose a retargeting that preserves it. In Section 3 wedefine a framework where heuristic penalties measure theamount of information lost due to a retargeting operation.Video retargeting attempts to find the cropping window foreach frame that minimizes this penalty over the entire video.In Section 4 we describe how we find solutions to this min-imization that enforce constraints of cinematic plausibility.The final sections of the paper present examples and discussresults.

Automation is important in video retargeting because thereis an ever expanding set of potential target displays. Man-ual retargeting may provide better results for a specific tar-get, but our automated video retargeting allows video tobe adapted to a wide variety of devices. For each differentdisplay size and aspect ratio, the system can determine thebest way to retarget the video.

The main contribution of this paper is an approach forvideo retargeting that works for edited source material suchas feature films. To do this, we also contribute a novel im-portance metric that accounts for object and camera motion,a problem formulation that penalizes various types of lossand methods for minimizing these penalties subject to theconstraints of cinematic plausibility.

1.1 Overview and ExampleTo motivate and explain our methodology we consider a

feature film in the common 2.4:1 aspect ratio1. To displaythe entire frame of the film on a standard television screen

1Modern feature films are presented in a variety of as-pect ratios, almost all are wider than the 16:9 aspect of“widescreen” television.

(4:3 aspect ratio), one would either need to “squeeze” theimage considerably (Figure 1(b)), or display the frame as anarrow band across the frame, padding the frame with blankspace (known as “letterboxing,” see Figure 1(c)). As an al-ternative, the sides of the film image might be cropped off,creating an image with 4:3 aspect ratio. When this coppedimage is shown on the television display it utilizes the fullheight of the screen and is commonly called a fullscreen pre-sentation.

Fullscreen presentations of movies are created by a processcalled pan and scan where an operator chooses how to cropthe image to the required aspect ratio. To our knowledge,this process is done manually, and typically only involvescropping the left and right edges. The pan-and-scan oper-ator chooses what is important in each frame and movesthe cropping window accordingly. Our work automates thisprocess, and extends it to crop from all edges of the frame.

Pan-and-scan is problematic as it discards a portion of theoriginal frame, eliminating parts of the image and destroyingthe composition of the image. However, a face that is 1/10thof the vertical size of the frame (or in fullscreen presentation)would be only 1/18th of a screen tall in letter box. On alarge display, a viewer may be willing to sacrifice the size,but on a small display even 1/10th of the image may makethe face too small to be recognizable, so a viewer may bemore willing to sacrifice the edges of the image such thatthe face is recognizable.

We can be more aggressive in our cropping than tradi-tional pan and scan by also cropping vertically. While wemay lose more of the original image content and composi-tion, such aggressive cropping may be justified for a smalldisplay as the important face may not be recognizable oth-erwise. By appropriately cropping a video, we can retargetit to a smaller display. Cropping may be useful even if thesource and target video have the same aspect ratio.

Our automatic video retargeting process considers eachshot of the video independently (§4). First, it determineswhat are the important parts of each frame (§3). Then foreach shot it determines how to crop the shot so that it best

preserves the content of the original video. Choosing thecropping requires balancing between a number of compet-ing goals, such as showing as much of the image as possi-ble yet not shrinking the important aspects of the imagetoo much. To perform this balancing, our approach assignspenalties for many types of information loss, and determinesthe cropping by choosing it such that the weighted sum ofthe penalties are minimized (§4).

In some cases, a single cropping rectangle is not the bestchoice for an entire shot. For example, if an important ob-ject moves across the source frame, the cropping windowcan move with the object so that it is always in view atan appropriate size. Moving the cropping window createseffects similar to moving the camera during filming. Ourapproach (§4.2) limits the movements it considers to onesthat are cinematically plausible, that is, ones that appear asif they may have been part of the original film. In general,we are conservative in not violating the “rules” taught to be-ginning filmmakers: an experienced filmmaker may violatethese rules to create a dramatic effect, but the retargetingprocess should not introduce dramatic effects not present inthe original film. Similarly, our approach considers discon-tinuous movements of the cropping rectangle to introducecuts, however these cuts are applied cautiously to ensuretheir cinematic plausibility.

Our implementation of video retargeting first pre-processesthe source video to identify shot boundaries and to identifythe important aspects of it (§3). Once pre-processing is com-plete, different target video sizes can be produced. Given atarget video size, the system considers each shot, findingthe best static cropping and virtual camera motions, andselects the one that minimizes information loss. Given apre-processed source video, the system can generate videosretargeted for a variety of different target sizes. For exam-ple, we retarget DVDs of feature films to sizes appropriatefor small media players , PDAs and cellular phones.

2. RELATED WORKFeature films are retargeted for television through pan

and scan. To our knowledge, this is done manually by anoperator who decides what to crop. Historically, pan andscan has only cropped the left and right edges of the frameto use a fixed scaling. Now that pan and scan is performeddigitally, operators can apply more general zooming if theywant to. Historically, automation has been a minor issue aspan and scan has only been important in converting films totelevision which only needs to be done once and for a singletarget. The range of portable device sizes and aspect ratiosmake automatic video retargeting important.

A few authors have considered adapting videos to smalldevices. Fan et al.[7] and Wang et al. [30] present systemsthat retarget video to small devices. Both use an impor-tance model to determine important aspects of the frameand apply cropping to remove less important content. Nei-ther importance model accounts for global motion, nor dothey consider cinematic plausibility in either their impor-tance model or cropping rules. Because of this, these priorworks are most likely inappropriate for our intended appli-cation: automatic retargeting of edited video such as featurefilms.

Automatic retargeting has been successful for still images.Several authors, including Chen et al. [5] and Suh et al.[27], analyze image content to determine important aspects

and crop away less important image regions so that the im-portant regions appear at a larger scale. While work onautomatic still image retargeting inspired and informs ourapproach, the video retargeting problem is more challengingfor a number of reasons including: determining the impor-tant aspects of a video is difficult (§3), what is importantchanges dynamically, and video retargeting has the oppor-tunity to introduce virtual camera motions.

The image retargeting approach of Santella et al.[25] re-lies on eye-tracking to determine the important region. Itintroduces a number of heuristics for preserving image com-position in addition to information content, similar to ourefforts to preserve aspects of the video. We must considercamera motions and other aspects of filmmaking.

An automatic retargeting system requires some “knowl-edge” of the art of filmmaking so that the alterations itintroduces appear natural rather than as obvious artifactsof the retargeting process. While books for beginning film-makers, such as Cantine et al.[4], provide some basic ruleson filmmaking, more advanced texts (such as Katz[14] orArijon[1]) show both the limited nature of such rules, aswell as how good filmmakers may break them to achieve de-sired effects. This has made the art of filmmaking difficultto codify. There have been two domains where it has beendiscussed: creating 3D camera movements (including cuts)for cinematography of virtual worlds (such as He et al.[9]and Bares et al.[2]) and automatic video editing.

Systems for automatic video editing have been presentedin a few focussed domains. For personal video collections,the focus is on selecting clips from a library Girgensohnet al.[8]. Automatic editing of classroom videos is a sim-ilar problem to video retargeting as it transforms a videostream (potentially from multiple cameras) into a video ofthe same length by selecting parts of them. Successful sys-tems, such as Rui et al.[24], Bianchi[3] and Heck et al.[10],all use heuristics to determine what is most important toshow, and then crop away less important aspects (either byzooming a camera in real time or by cropping a pre-recordedframe). Unfortunately, the methods used in classroom auto-matic video editing systems do not generalize easily as theyrely on the known structure of classroom lectures and theunedited source material. Our automatic retargeting workapplies to a wide range of video, including edited sourcematerial such as films and television shows.

An alternative to cropping and scaling for retargeting isto apply non-linear distortions. This has been considered forstill images in Liu et al.[15] and Setlur et al.[26]. Applyingsuch approaches to video is challenging and may introduceodd visual effects.

3. IMPORTANCEThe basic idea of video retargeting is to preserve what

is important in video by removing what is not. To auto-mate retargeting, we must have some method of identifyingwhat is important in the video. Our approach is to definepenalties for various forms of important information loss andto find retargeting solutions that minimize these penalties.This approach provides a flexible framework: we can easilyadd new types of penalties as we devise better mechanismsfor measuring the important content lost in a video.

Automatically determining what is important in a videois difficult and subjective. To truly determine what is mostimportant to show on screen requires an understanding of

the story of a film: a seemingly unimportant element mightbe critical later on. To make matters worse, different viewersmay have different opinions on what is most important.

Automatic image retargeting must also determine what isimportant in an image without semantic understanding. Re-cent results in still image retargeting [5, 15, 26, 27] suggesttwo mechanisms work well: specifically recognizable objects(such as faces) are usually important, and regions of the im-age that are most likely to attract low-level visual systemare likely to be important.

Determining importance in video is more difficult thandetermining importance in its constituent still images fortwo reasons: first, there is motion in the frame that carriesmeaning; and second, the image is part of a larger contextof the overall video2. In our current work, we consider thefirst issue, but not the latter.

The heuristics we use for video retargeting without seman-tic understanding fall into two categories that extend whathas been applied to still image retargeting. First, we con-sider the temporally local information of a given frame, thatis the content of the image and its motion. If a part of theframe is likely to attract a viewer’s attention, we considerit important to preserve. The methods from still images,salience and object identification, are extended with motionfeatures in Section 3.1. Second, we consider more generaltypes of loss due to retargeting. Retargeting should avoidthings that always lose information, or create results that afilmmaker would be unlikely to do (or would only do if theywanted to call attention to something). In Section 3.2, wedescribe the specific heuristics used in our prototype. Ourextensible framework allows for new penalties to be addedas we develop a richer set of importance determining mech-anisms.

The user can add hints to inform the system what is im-portant. This is valuable as they may have semantic orcontextual information that is unavailable to a purely auto-matic system, and their effort can be amortized over a setof retargeting targets. None of the examples in this paperemploy hints.

3.1 Local ImportanceOur first heuristic considers each frame without regard

to its context. For each image, we compute an importancemap that for each pixel measures how important that pixelis likely to be. Our approach is similar to what is used in stillimage retargeting: we combine image salience and specificobject detection. For video, we must extend the saliencemodel to account for motion: certain types of motion arevery likely to draw a viewer’s attention.

The importance map is temporally local: it considers thecontent of the frame and a bounded number of neighboringframes (for motion estimation). It is created as a weightedlinear combination of three components: the image saliency(SI), motion saliency (SM ) and detected object saliency(SF ). The first two components are discussed in subsequentsections. An example of an importance map is shown inFigure 2.

In our current implementation, the only detected objectsare faces. Faces are almost always important, are quicklyrecognized by viewers, and usually attract attention. Theyare also easily detected automatically. Our implementation

2Including the audio, consider a narration that says “lookat the left side of the image.”

(a) (b) (c) (d)

Figure 2: Salience map for a frame. (a) original frame, (b) im-

age salience SI without center weighting (c) motion salience SM

(d) resulting map S = SI + SM + SF . Note the center weighting

and that the small faces are weighted less than the large one. Not

all small faces are detected. Source video c© 2001 MGM.

uses the method of [29]. The object salience map SF hasvalue zero for pixels that are not part of a face, and a valueproportional to the area of the face if it is part of one. Thisimplements the heuristic that larger faces are likely to bemore important.

The importance map is center weighted to add the heuris-tic that things in the center of the image are more likely to beimportant. Therefore, the importance map S is computedas

S = ωISI + ωMSM + ωF SF

ωI + ωM + ωF = 1

where the ω are the weights for each salience component.

3.1.1 Image SalienceResearch in vision science, such as [19, 20], indicates

that saliency can be measured by low-level feature contrast.Based on this, several methods have been proposed to de-tect saliency from an image. Itti et al. [13, 12] detect lo-cal spatial feature discontinuities in a static image pyramidwith a fixed number of scales, as feature contrast maps, andcombine them into a single saliency map. Ma et al. [18]provide a simpler saliency metric based on heuristics obser-vation how people perceive images. Hu et al. [11], transformimage features into a 2D space through a polar transforma-tion, and identify regions by estimating the subspaces. Theyconsider both the region feature contrast and its geometricproperties to determine the saliency. Our implementationuses an implementation of Ma’s contrast based method [18]that weights the center of the image more as described in[15]. Basically, the method applies a band-pass filter in aperceptually uniform color space.

3.1.2 Motion SalienceMoving objects usually attract attention [23]. Therefore,

given the observed motion (the optical flow that assigns amotion vector to each pixel) with an image, we would expectthat regions with motion are likely to be salient. However,the simple approach of only considering the amount of mo-tion at each pixel, such as in [17], is insufficient. People aregood at factoring out global motion, such as that inducedby head or camera movements, therefore motion contrast,not magnitude, is a better predictor of salience [23].

For video retargeting, we use a new motion salience schemethat is based on motion contrast. Given the optical flowassociated with an image, we first use dominant motion es-timation to segment the image into a global background3

and foreground regions. The salience for each pixel is then

3The dominant motion may not semantically be the back-ground.

proportional to the magnitude difference between its mo-tion and the background motion. For pixels that are in thebackground, there will be no difference. To reduce noisefrom the fitting of a global motion model, we explicitly setthese values to zero. So given the optical flow MV and theglobal motion model MG, both of which provide 2D vec-tors for each pixel, we compute the motion salience SM atforeground pixel i, j as:

SM (i, j) =‖MV (i, j) − MG(i, j)‖

dMVmax

where dMVmax is the maximal difference between a motionvector and the global motion vector.

Our implementation uses the Lucas and Kanade method [16]to calculate optical flow in each frame. We model the domi-nant global motion using a restricted affine model with fourparameters (zoom, rotate, pan, tilt)

MG(x, y) =

»zoom rotate

−rotate zoom

– »xy

–+

»pantilt

–.

In practice, the model seems to fit well enough for mostcamera motions observed in regular videos and is easy toestimate. Because we are not performing registration, moredetailed models are not necessary.

Our implementation uses the iterative method of [21] to si-multaneously perform segmentation and parameter estima-tion. A variant of the expectation maximization (EM) algo-rithm, the method alternates estimating the assignment ofpixels into foreground and background and determining theparameters of the global motion of the background model.The model is initialized by assigning all pixels to the back-ground. The following steps are then alternated until themodel sufficiently fits the motion of the assigned pixels, ortoo few pixels are left in the background:

1. Estimate the background motion parameters given thecurrent set of pixels assigned to the background. To fitthe model to the motion vectors, a linear least squaresproblem is formed and solved using singular value de-composition (SVD).

2. For each pixel (i, j), compare its motion MV (i, j) tothe motion predicted by the global model MG(i, j). Ifthe magnitude of the difference is small, assign it to thebackground set, otherwise assign it to the foregroundset.

Because of errors in motion estimation, segmentation, andbackground model fitting, the computed motion saliencemap SM is often noisy. Band-pass filtering to reduce noiseseems appropriate, as a pixel’s salience should be similarto pixels on the same object since objects move coherently.However, a simple filter disregards the fact that neighbor-ing pixels may be on different objects so filtering reducescontrast between groups. Therefore, we apply a BilateralFilter [28] that takes into account similarity in value as wellas proximity. We apply a Bilater Filter in space and time,to the motion salience map. Similar to a 3D Gaussian fil-ter, the Bilateral Filter also takes the weighted average ofeach pixels’ neighbors (in space and time) but compute theweights as a Gaussian function of distance in space, time,and value.

Figure 3: Cut faces appear unnatural (left), so our system adds

a penalty to avoid them. This often causes the faces to appear

at the edge of the frame (center) so the system penalizes this

as well, leading to more natural framings (right). Source video

c© 2003 20th Century Fox.

3.2 Penalties for Information LossGiven a source video of dimensions wsource × hsource and

a target video dimension wtarg × htarg, retargeting choosesa subwindow of the source W that is subsequently scaled tofit the target. For convenience, rather than parameterizingW by the corner position (we place the origin at the upperleft), width and height, we choose to parameterize it bycorner position, scale factor s, and anisotropy of scale sa

W = (x, y, w, h) = (x, y, wtarg ∗ s, htarg ∗ s ∗ sa).

For a unity scale factor (s = sa = 1), the source pixels arecopied directly to the target. We do not consider enlarge-ment (s < 1).

Given a window, we can compute a penalty value for theamount of “information” that is destroyed on a particularframe. The total penalty is computed as a weighted sum ofa number of individual penalties. Each penalty pi has anassociated weight ωi. At present, our system considers thefollowing penalties:

Information cropped: the sum of the importance of thepixels of the source video that do not appear in theresulting image

px(W) =

P(x,y)/∈W S(x, y)

Stotal,

where S(x, y) is the importance value at pixel (x, y)and Stotal is the total importance in the shot.

Scaling: as the scale factor increases, information is lostdue to downsampling. We penalize larger scale factors:

ps(W) = |s − 1.0|3,where s is the scale factor. We choose a cubic costfunction to avoid drastic downsampling.

Pixel Aspect Ratio: while the system can squeeze theimage with an anisotropic scale, such distortions lookodd and are penalized:

pa(W) = |sa − 1.0|3,where sa is the aspect ratio change factor. We choosea cubic cost function again to avoid drastic distortion.

Face cut cost: cutting off part of a face is unattractive,and generally not done by filmmakers [1]. If the win-dow intersects a located face, we create an image thatis clearly an artifact of the retargeting process losingcinematic plausibility (Figure 3). For each detectedface, we penalize the system with a constant cost ifpart of the face is cut outside the window.

Edge crowding cost: by itself, the face cut penalty tendsto cause faces to appear adjacent to the end of the win-dow which appears unnatural (Figure 3). We penalizethe system for putting a face too close to the edge ofthe frame:

pe(W) =

j1.0 − d/dmax, d ≤ dmax;0, else.

,

where d is the smallest distance between the face edgeand the window edge, and dmax is a constant, set em-pirically in our system to 15 pixels.

Pan and Cut costs: motions of the window betweenframes can cause the loss of motion information, sothey are penalized. This is discussed in more detail inSections 4.2 and 4.3.

User hint costs: if the user specifies that a particular po-sition in the frame is important, a window is penalizedif it does not contain the point. Note: none of theexamples in this paper include user specified hints.

There is a weight ωi for each penalty and for each saliencecomponent. This set of weights serves to tune the impor-tance determination process to balance all of the factors. Inprinciple, these weights could be tuned to account for viewerpreferences. In practice, we have determined a set of weightsempirically and used them for all of our experiments.

One feature of our approach is that the set of penaltiesand salience terms is extensible. New ones can be addedeasily, potentially improving the quality of the results.

4. OPTIMIZING RETARGETINGThe video retargeting problem is to choose a cropping

window (x, y, s, sa) for each video frame. As described, theoptimal retargeting would be to chose the window such thatthe penalty is minimized. However, performing this opti-mization on each frame independently may cause the crop-ping window to move around, inducing motion that appearslike camera motion. Therefore, we must limit the motionsintroduced by retargeting. Induced motions must be cin-ematically plausible: they must not be something that afilmmaker would be unlikely to use, as this would yield re-sults that would appear as obvious artifacts. We also shouldconsider the preservation of the motion in the original videoas it is part of the video’s content that should be preserved.

Our approach considers each shot (a duration of the videotaken from a continuous viewpoint) independently. Betweenshots, there is already a discontinuity, so we need not worryabout providing continuity across shot boundaries (ramifi-cations of this are considered in Section 6). Our implemen-tation automatically breaks a long video into a sequence ofshots using the algorithm of [22]. This simple algorithm is ef-ficient and robust against object and camera motions. First,a color histogram is computed for the consecutive frames.Color is quantized to improve performance. A shot bound-ary is detected whenever the histogram intersection betweentwo neighboring frames is below a threshold.

To ensure cinematic plausibility, video retargeting con-siders each shot as a unit. Rather than solving a largeconstrained-optimization problem to determine the windowfor each frame in the shot, we restrict the motion to be oneof three shot types that are most common in film: a crop(the window remains constant over the shot); a pan (the

window moves smoothly over the duration of the shot); anda cut (a discontinuity is introduced and the shot is dividedinto two independent shots). Each shot type has a smallnumber of parameters (for example, a crop’s parameters are(x, y, s, sa)). These shot types are detailed in subsequentsections.

For each shot, video retargeting chooses the cropping win-dow for each frame such that the sum of the penalties forall of the frames is minimized, subject to the constraint thatthe motion of the cropping window is one of the three shottypes. To implement this constrained optimization for agiven shot, for each of the three shot types we determinethe parameter values that give the lowest penalty. Then,from these three options, we select the one that has the low-est overall penalty. For each shot type, we use a differentalgorithm to find the parameter values that minimize thepenalty.

4.1 CroppingThe simplest method for performing retargeting is to se-

lect a single cropping window for each shot. By not havingthe cropping window move within the shot, retargeting doesnot introduce any visual motion. Motion in the source video,both object and camera movements, are preserved.

Choosing the optimal cropping for a shot requires mini-mizing the penalties across the entire shot. The set of poten-tial cropping window forms a 4-dimensional space to search,including the position of the lower left corner of the rect-angle, (x, y), the scale factor, s, and the anisotropy, sa (ascaling factor on the width that allows the aspect ratio ofthe cropping window to vary from the aspect ratio of thetarget). In performing cropping, we minimize the weightedsum of the penalties (§3.2). Because of the discrete and non-linear nature of this objective, we have chosen to optimize itby a brute-force search through the parameter space. Specif-ically, we step the cropping window position (x, y) pixel bypixel. We iterate the anisotropy factor sa from 0.8 to 1.2with step size 0.1, and the scale factor s from 1.0 multiply-ing by 1.1 until the cropping window outgrows the sourceframe.

To efficiently compute the information lost cost for allof the different windows, we first sum the importance overeach shot to create a single map. For each window, theinformation lost can be computed in constant time using asummed area table [6].

4.2 Virtual PansIntroducing camera motion must be done carefully to avoid

introducing artifacts or changing the existing cinematogra-phy. For example, we avoid introducing zooms (changingthe scale s and sa in a shot) as they have strong effects onthe viewer. In practice, we limit ourselves to the most com-mon camera motion: a horizontal pan. Real cameras createpans by turning on their tripod. For us, a horizontal panmeans that the horizontal position of the window x changesover time. We restrict each shot to contain a single pan toavoid a “ping-pong” effect. Examples of pans are shown inFigure 4 and Figure 8.

Real cameras have mass and do not accelerate instantly.Rapid accelerations are so noticeable and disconcering toviewers that real tripods for film and video are almost alwaysdamped to prevent this. Therefore instant accelerations ofthe window would be an obvious artifact of retargeting and

(a) Static Cropping

(b) Virtual Pan

Figure 4: Three frames from a shot where our system uses a

pan to better show moving objects. With a static shot (a), the

system must either show the characters at a small scale, or crop

them. With a pan (b) the system can show the characters across

the entire shot. Source video c© 2004 Warner Home Video.

time

x

Posi

tio

n

Time0

0

a 1-a 1

1

Figure 5: Left: the interpolation function has zero initial and

final derivatives, constant acceleration and deceleration, and con-

stant velocity in the middle. Right: the optimal window is deter-

mined for each frame and a smooth curve is fit.

must be avoided. To do this, we define the horizontal posi-tion of the window to be a function of an appropriate form.The function depends on 4 parameters: x1 the initial value,x2 the ending value, t1 the time that the motion begins, andt2 the time that the motion ends. Specifically, the cameradoes not move until t1, then accelerates, holds a constantvelocity, then decelerates until t2, when it holds constant

x(t) =

8<:

x1 t < t1(1 − d( t−t1

t2−t1))x1 + d( t−t1

t2−t1))x2 t1 ≤ t ≤ t2

x2 t > t2

.

Where d(u) is the interpolation function (depicted in Fig-ure 5)

d(u) =

8><>:

u2

2a−2a2 u < aa

2−2a+ u−a

1−aa ≤ u ≤ 1 − a

1 − (1−u)2

2a−2a2 u > 1 − a

,

where a is a constant. Note that for a given t, x(t) is linearin all parameters except t1 and t2.

To find the optimal horizontal pan, we need to search the7 parameter space (x1, x2, y, s, a, t1, t2) to find the valuesthat minimize the penalty over the shot. The brute forcesearch used in the previous section is intractable. Instead,we approximate the ideal solution by first finding the opti-mal window on each frame individually (using the methods

of the previous section) and then fitting the interpolationcurve to this data, as schematized in Figure 5.

Given the optimal window for each frame, we performfitting by creating constraints that the top, left, bottomand right edges of the window on each frame are equal totheir ideal counterparts. Given the optimal window for eachframe, we perform the fitting with the goal that the top,left, bottom and right edges of the window on each frameare close to their ideal counterparts as possible. We fit thecurve in a least squares sense. For a given t1 and t2, thisis a linear least squares problem and can be solved analyt-ically. We exhaustively search the valid range of t1 and t2,finding the best values of the other 5 parameters for each,and choose the one with the lowest residual.

Once the parameters of the optimal panning motion arecomputed, the penalty for the retargeted shot using thismovement can be computed by evaluating the penalty terms(§3.2) and adding an addition “pan” penalty that discour-ages the addition of camera motion

pp = wp ∗ (t2 − t1),

where wp is a pan penalty coefficient.

4.3 Virtual CutsRetargeting can introduce a cut to break a shot into two.

Such cuts must be introduced carefully: discontinuous cam-era motions can disorient the viewer if they are not donecorrectly, and even good cuts effect the pacing of the video.

In principle, we could treat each new sub-shot as a shotand perform the entire retargeting optimization recursively,potentially dividing each sub-shot further. In practice, werestrict ourselves to performing a single cut on each shot,since finer divisions would have too drastic an impact onthe pacing of the result. We also restrict each subshot to beretargeted with a crop (the window does not move over thesubshot). This facilitates determining if the cut is good aswe need not worry about motion matching issues.

In introducing a cut we must avoid creating a cut thatstands out as obtrusive to the viewer. We restrict intro-duced cuts such that the resulting sub-shots are of sufficientlength (63 frames) because shots shorter than this have anoticeable effect on pacing and are only used very inten-tionally by filmmakers. Secondly, we avoid making a “jumpcut,” the phenomena where a cut is very jarring to a viewerif the difference between the two subshots it divides is notsufficiently different [4].

A cut applied to a shot has nine parameters: the four pa-rameters of the windows of each subshot, and the time ofthe cut tc. To find the optimal retargeting we use an approx-imation that incorporates some of the constraints that avoida jump cut. We note that to be sufficiently different, oneof the subshots must come from the left part of the frame,while the other must come from the right. We then use thefollowing three steps:

1. For each frame, we compute a penalty for the bestwindow for the left and right sides, L(t) and R(t).

2. We determine tc and which side is first by exhaustivelytrying all reasonable values of tc and both orderingsto see which leads to the minimum. For example, the

Figure 6: A shot where our system inserts a cut. Top: DVD

frame. Bottom: 2 frames from each of the sub-shots. The horse

(left) only moves (and therefore becomes salient) in the later part

of the shot. Source video c© 2004 Warner Home Video.

left-first penalty for a given tc is

P (tc, left) =

tcXi=1

L(i) +nX

i=tc

R(i).

3. We then use the method of Section 4.1 on each subshot.

An example where our system introduces a cut is shown inFigure 6.

Introducing a cut disturbs the continuity of the originalshot. Also, our simple method of determining the cut doesnot account for other preconditions for quality cuts such ascutting on action [1, 4, 14]. Therefore, we add an additionalpenalty pc to the penalty computed from the optimal cutshot. To further discourage jump cuts, the penalty is pro-portional to the image similarity between the two subshotsas measured by color histogram intersection [22]. In com-puting the total penalty for applying a cut to a shot, wediscount the information cropped penalties for the two sub-shots to account for the fact some of the cropped informationwill appear in the other subshot.

5. EXAMPLESWe have implemented the video retargeting methods de-

scribed in the previous sections on Windows-based PCs.Analysis of the video to determine the local importance istime consuming as we use inefficient open source implemen-tations of face finding and optical flow computation. Weperform it as an offline pre-process. Once the importanceinformation is pre-computed, retargeting requires roughly3-5 times real time with the unoptimized prototype.

We have used our system to retarget various content fromDVDs to a variety of sizes and aspect ratios. For all ofour tests, the source comes from “widescreen” DVDs thatletterbox the original theatrical presentation. The aspectratios of the source material range from 1.85:1 to 2.4:1 (theseDVDs appear letterboxed even on 16:9 TVs). The systemsperforms as we would expect: choosing static crops, pansand cut insertion where appropriate.

target size crops pans cuts failures120 × 120 141 66 7 22160 × 120 147 63 4 12200 × 150 147 67 0 9

Table 1: Shot choices and failure rates on the test set.

Despite its simple importance models and heuristics, oursystem often chooses good retargeting solutions. In films,many shots have a clear focus of attention (as directors wantto facilitate an audience’s comprehension), and our systemcan appropriately choose tighter croppings. In other cases,such as an action scene, importance information is spreadover more of the screen, and our system shows more of theframe.

On some shots, our system fails to produce a retargetingsolution that conveys the original intent of the shot. Onsome of these failures, the system cannot be blamed: thesource material might be such that there is no way to pre-serve its content in a smaller size. Even a manual pan andscan operator would need to make a difficult decision to dis-card something important. Other failures can be directlytraced to system components, e.g. the face detector failingto identify a poorly lit face, or our heuristics being insuffi-cient.

To assess our system’s results, we selected 3 movies thatwere known to be difficult for fullscreen presentation basedon web critiques of the full screen versions. For efficiency,and to test the system’s ability to handle different sourcesize, we first downsampled the movies to various sizes. Fromthese films we used 214 shots for our tests, and retargetedeach to three different targets: two different sizes of 4:3 dis-plays (200 × 150 pixels and 160 × 120 pixels), and a squaredisplay like a PDA (120 × 120 pixels). These target sizeswere chosen because their relatioship to the source materialis similar to the relationship of current devices to the mostcommon source material.

The shot choices made by our system are listed in Table 1.Our system chooses static crops the majority of the time.As the target grows larger, the frequency that cuts are useddecreases. For all of the target sizes, the average anisotropyused was 1.1, or a 10 percent “squish.”

Result quality was assessed subjectively by the authors.We consider a retargeting solution for a shot to be a failureif we prefer a naıve scaling to that solution. On the test set,our system failed less than 10% of the time. Failures havetwo forms: cropping of an important part of the image, andintroduction of a cinematically implausible artifact. Becausethe baseline for comparison is a maximal scale, we cannotfail by scaling too much.

Camera motions are the largest cause of failures due tocinematic implausibility. Despite our efforts to prevent jumpcuts, almost one third of the cuts made by our system weredistracting, suggesting that we must be even more conserva-tive about applying cuts. In rare cases pans create cinematicimplausibility by changing the direction of the apparent mo-tion of a moving object. The most common form of failure,however, is when the system removes an important elementof the image. This happens when either the system fails toidentify something as important, or incorrectly decides tofocus on some other part of the image.

Figure 7: A clip from a digital camera is retargeted. Despite no

identifyable faces and jerky handheld motion, the system still cor-

rectly applies a pan. Top: range of pan shown on source frames.

Bottom: frames from retargeted result. Source video c© 2006 M.

Gleicher.

5.1 Home VideosThe approach described in this paper addresses the re-

targeting of edited videos such as films and television. Ourapproach makes some assumptions that the video is “good.”For example, we assume that the video is broken into shotsand that within each shot the camera work is intentionaland should be preserved. Amateur “home” videos often failto meet these criteria. Digital still cameras and video cam-era cell phones usually take short clips of video. Retargetingthese clips is important as they are often timely and so userswant to transmit them. Retagreting not only makes the re-sulting videos more appropriate for portable displays, it canalso make them smaller.

Preliminary experiments show that our retargeting systemcan be applied to short clips from small digital cameras.Despite the jerky, hand-held camera movements these clipstend to exhibit, the importance finding algorithm providesreasonable results. An example of a retargeted clip where avirtual pan is used is shown in Figure 7.

We expect that a real solution for amateur video clip re-targeting will require new methods that do not make as-sumptions about the quality of the source cinematography.However, the importance model from this paper should stillbe applicable, as will the general approach of seeking to cre-ate cinematically plausible virtual camera work.

6. DISCUSSIONVideo retargeting is fundamentally limited: we necessarily

must throw away information when presenting a video on asmaller device. Our ability to do this well is fundamentallylimited by the fact that what is important to preserve de-pends not only on the low-level visual information that maybe available to an automatic system, but also high level as-pects of the underlying story.

At present, our system relies on only basic, easy to ob-tain information about the video’s content. We have chosento use only salience, face detection, and dominant motiondirection as this information can be obtained with simple,robust algorithms. In a sense, our prototype is an experi-ment of how well we can do video retargeting based on thislimited information.

Incorporating more information poses a number of chal-lenges. First, we need a method of obtaining the new in-formation; second, we need mechanisms for evaluating howmuch information is lost in a retargeted presentation; and

third, we need mechanisms for incorporating these new met-rics into our optimization. For example, at present, our sys-tem identifies faces in the image, but does not distinguishbetween them. Computer vision techniques could find fa-cial motion to identify which face is speaking, allowing theheuristic that the speaker is more important. Aside from theobvious caveat that the heuristic is imperfect (sometimesanother character’s reaction is more important), this newtype of information suggests not only new penalty terms,but also new optimization methods that will encourage “cuton speaker change.”

There is one category of information that we feel is par-ticularly important for improving the quality of retargeting:coverage. At present, our approach treats each salient pixeland face as independent. There is no notion of somethingbeing the same across frames. If our system were to identifywhen important image features are common across multipleframes, we could introduce metrics that measure the cover-age of a retargeting. That to give importance that objectsin a shot are seen at some time. For example, in a staticscene, the system might introduce a virtual pan so that theentire frame is seen more closely.

Presently, our system treats shots independently. Per-forming analysis across multiple shots may be useful for anumber of reasons. For example, information given in oneshot may be less important to repeat in a later one; or sincea filmmaker may use a repeated shot composition to con-vey continuity or repetition or allow the viewer to witnesschange, cropping these similar shots differently breaks thisrhythm. Therefore, performing video retargeting “globally”over all shots poses a number of technical challenges to scalethe optimization process as well as a development of schemesfor using the global information.

Better evaluation is necessary for assessing video retar-geting. Assessing the quality of the results of our system isdifficult. Displaying a video at a small size necessarily dis-cards information, and the choice in how to do this is sub-jective. Manual pan-and-scan is sufficiently criticized thatletterbox presentation is preferred by most movie fans, al-though we feel that the tradeoffs make aggressive retargetingusing cropping important for small devices.

7. ACKNOWLEDGMENTSThis work was supported in part by NSF grants IIS-

0097456 and IIS-0416284. Figures 1, 4, 6, and 8are from Harry Potter and the Prisoner of Azkabanc© 2004 Warner Home Video. Figure 3 is from Ever Afterc© 2003 20th Century Fox. Figure 2 is from The PrincessBride c© 2001 MGM.

References[1] D. Arijon. Grammar of the Film Language. Silman-James

Press; Reprint edition, 1991.

[2] W. Bares and J. Lester. Intelligent multi-shot visualiza-tion interfaces for dynamic 3d worlds. In Proceedings ofthe 1999 International Conference on Intelligent User In-terfaces, pages 119–126, 1999.

[3] M. Bianchi. Automatic video production of lectures usingan intelligent and aware environment. In MUM ’04: Pro-ceedings of the 3rd international conference on Mobile andubiquitous multimedia, pages 117–123, New York, NY, USA,2004. ACM Press.

(a) First and last frames of DVD shot.

(b) Frames from retargeted shot 160 × 120.

Figure 8: A virtual pan applied. The original shot is a complex camera motion with different characters moving at different times.

Our motion salience method (§3.1.2) can distinguish object and camera motion. The system uses a virtual pan (§4.2) to better include

the active characters. The range of the cropping window is indicated on the source frames. Source video c© 2004 Warner Home Video.

[4] J. Cantine, B. Lewis, and S. Howard. Shot by Shot; A Prac-tical Guide to Filmmaking. Pittsburgh Filmmakers, 1995.

[5] L.-Q. Chen, X. Xie, X. Fan, W.-Y. Ma, H.-J. Zhang, and H.-Q. Zhou. A visual attention model for adapting images onsmall displays. ACM Multimedia Systems Journal, 9(4):353–364, November 2003.

[6] F. C. Crow. Summed-area tables for texture mapping. InProc. of SIGGRAPH 84, pages 207–212, 1984.

[7] X. Fan, X. Xie, H.-Q. Zhou, and W.-Y. Ma. Looking intovideo frames on small displays. In Proc. of ACM Multimedia,pages 247–250, 2003.

[8] A. Girgensohn, J. Boreczky, P. Chiu, J. Doherty, J. Foote,G. Golovchinsky, S. Uchihashi, and L. Wilcox. A semi-automatic approach to home video editing. In Proceedingsof UIST, pages 81–89, 2000.

[9] L. He, M. Cohen, and D. Salesin. The virtual cinematog-rapher: A paradigm for automatic real-time camera controland directing. Proceedings of SIGGRAPH 96, pages 217–224, August 1996.

[10] R. Heck, M. Wallick, and M. Gleicher. Virtual videography.ACM Transactions on Multimedia Computing, Communi-cations and Applications, 3(1), 2007. to appear.

[11] Y. Hu, D. Rajan, and L.-T. Chia. Robust subspace analysisfor detecting visual attention regions in images. In Proc. ofACM Multimedia, pages 716–724, 2005.

[12] L. Itti and C. Koch. Computational modeling of visual at-tention. Nature Reviews Neuroscience, 2(3):194–203, March2001.

[13] L. Itti, C. Koch, and E. Niebur. A model of saliency-basedvisual attention for rapid scene analysis. IEEE Trans. Pat-tern Anal. Mach. Intell., 20(11):1254–1259, November 1998.

[14] S. Katz. Film Directing: Shot by Shot : Visualizing fromConcept to Screen. Michael Wiese Productions, 1991.

[15] F. Liu and M. Gleicher. Automatic image retargeting withfisheye-view warping. In Proc. of the 18th annual ACMsymposium on User interface software and technology, pages153–162, 2005.

[16] B. Lucas and T. Kanade. An iterative image registrationtechnique with an application to stereo vision. In Proc.of International Joint Conference on Artificial Intelligence,pages 674–679, 1981.

[17] Y.-F. Ma and H.-J. Zhang. A model of motion attention forvideo skimming. In Proc. IEEE ICIP, pages 129–132, 2002.

[18] Y.-F. Ma and H.-J. Zhang. Contrast-based image attentionanalysis by using fuzzy growing. In Proc. of ACM Multime-dia, pages 374–381, 2003.

[19] H. Nothdurft. Salience from feature contrast: additivityacross dimensions. Vision Research, 40(11-12):1183–1201,2000.

[20] H. Nothdurft. Salience from feature contrast: variations withtexture density. Vision Research, 40(23):3181–3200, 2000.

[21] W. R. and H. T. Fast camera motion analysis in mpeg do-main. In Proc. of IEEE ICIP, pages 691–694, 1999.

[22] M. Rasheed, Z.and Shah. Scene detection in hollywoodmovies and tv shows. In Proc. of IEEE CVPR, pages 343–348, 2003.

[23] R. Rosenholtz. A simple saliency model predicts a number ofmotion popout phenomena. Vision Research, 39(19):3157–3163, 1999.

[24] Y. Rui, L. He, A. Gupta, and Q. Liu. Building an intelli-gent camera management system. In Proceedings of ACMMultimedia Conference, 2001.

[25] A. Santella, M. Agrawala, D. DeCarlo, D. Salesin, andM. Cohen. Gaze-based interaction for semi-automatic photocropping. In Proc. of CHI, 2006.

[26] V. Setlur, S. Takagi, R. Raskar, M. Gleicher, and B. Gooch.Automatic image retargeting. In Proc. of International Con-ference on Mobile and Ubiquitous Multimedia, 2005.

[27] B. Suh, H. Ling, B. B. Bederson, and D. W. Jacobs. Auto-matic thumbnail cropping and its effectiveness. In Proc. ofthe 16th annual ACM symposium on User interface softwareand technology, pages 95–104, 2003.

[28] C. Tomasi and R.Manduchi. Bilateral filtering for gray andcolor images. In Proc. of IEEE ICCV 98, pages 839 – 846,1998.

[29] P. Viola and M. Jones. Rapid object detection using aboosted cascade of simple features. In Proc. of IEEE CVPR,pages 511–518, 2001.

[30] J. Wang, M. Reinders, R. Lagendijk, J. Lindenberg, andM. Kankanhalli. Video content presentation on tiny devices.In Proc. of IEEE Conference on Multimedia and Expo, pages1711–1714, 2004.

video retargeting: automating pan and...

Documents