semantic mastering: content adaptation in the creative drama production workflow

Multimed Tools Appl (2012) 59:307–340DOI 10.1007/s11042-010-0710-0

Semantic Mastering: content adaptation in the creativedrama production workflow

Dieter Van Rijsselbergen · Chris Poppe ·Maarten Verwaest · Erik Mannens · Rik Van de Walle

Published online: 8 January 2011© Springer Science+Business Media, LLC 2011

Abstract In order to provide audiences with a proper universal multimedia expe-rience, all classes of media consumption devices, from high definition displays tomobile media players, must receive a product that is not only adapted to their capa-bilities and usage environments, but also conveys the semantics and cinematographybehind the narrative in an optimal way. This paper introduces a semantic videoadaptation system that incorporates the media adaptation process in the center ofthe drama production process. Producers, directors and other creative staff instructthe semantic adaptation system using common cinematographic terminology and vo-cabulary, thereby seamlessly extending the drama production process into the realmof content adaptation. The multitude of production metadata obtained from varioussteps in the production process provides a valuable context of narrative semanticsthat is exploited by the adaptation process. As such, high definition imagery can beintelligently adapted to smaller resolutions while optimally fulfilling the filmmaker’sdramatic intentions with respect to the original narrative and obeying various rulesof cinematographic grammar.

Keywords Semantic adaptation · UMA · Universal multimedia experiences ·Cinematography · Drama production

1 Introduction

Viewers today have access to a multitude of platforms for the consumption of anever increasing supply of ’broadcast’ audiovisual media: from movie theaters and

D. Van Rijsselbergen (B) · C. Poppe · E. Mannens · R. Van de WalleDepartment of Electronics and Information Systems (ELIS)—Multimedia Lab,Ghent University—IBBT, Gaston Crommenlaan 8/201, 9050 Ghent, Belgiume-mail: [email protected]

M. VerwaestVRT-medialab, Gaston Crommenlaan 10/101, 9050 Ghent, Belgium

308 Multimed Tools Appl (2012) 59:307–340

high definition-enabled or standard definition television screens at home, to smartcellular phones on the road, and online using the web. These platforms vary widelyin capabilities and specifications. Content producers and providers are challenged toprovide this entire range of different devices with proper access to large quantities ofaudiovisual products. In practice, this implies that content must be altered to meetthe limitations of a user’s terminal and network, essentially realizing the promise ofUniversal Multimedia Access (UMA) [37].

Considering the increasing prevalence of high definition presentation devices,we expect content creators to start actively focusing more on wide cinematic-likeframing. However, this presents issues for mobile device viewers. Due to significantlysmaller display surfaces, they can not reasonably be provided with identical, butsimply down-scaled versions of material originally framed with high definitionpresentation in mind. Studies have shown that in order to provide a comfortable userviewing experience, adapted images should be constrained to contain image regionsthat are most meaningful in terms of the content they represent [17]. Adaptationsshould be guided by notions of the semantics and narrative that the content beingadapted conveys. Such semantic adaptations would, for example, crop the pictureto include only a particular story character; or would, more elaborately, pan andscan from one character to another within the imagery recorded for a single originalwide camera shot. Hence, an additional effort is required to present users with aworthwhile multimedia experience, despite possibly limited terminal capabilities,thereby turning plain UMA into true Universal Multimedia Experiences [28].

Unfortunately, we have found that content adaptation systems described inliterature today perform adaptation processes almost as an afterthought, whenregular media production has finished. Many algorithms have been developed thatanalyze video signals in an attempt to automatically infer possible semanticallyinteresting regions, of which some can be manually annotated by human operators.While many of these systems can operate in a context-agnostic fashion, none of theoriginal production information concerning the semantics associated with the sourcematerial is reused. A rich set of production metadata, whether in paper or electronicform or implicit in the heads of production people, is left unconsidered. Havingthis production metadata available can help us reduce the semantic gap betweenaudiovisual signals and their original narrative and aesthetic intentions. This reducesthe need for computational aesthetics algorithms [14] that can produce incorrectconclusions, and provides a semantically rich context in which content adaptationcan be performed.

One particular class of media production where semantics significantly influencethe structure and imagery of the final product is drama production, which includessoap opera’s, prime time quality fiction and motion pictures. In fact, the semanticsof the story drive the entire production process. Narrative and creative decisionsare taken to emphasize specific aspects of this story. We have built an adaptationsystem that includes notions of the adaptation process from early in the productionprocess. This allows directors, producers and other creative staff to decide whichinteresting objects in the video frames should be retained for smaller displays. Thecreative staff is in a better position to make these decisions than automated systemsor operators beyond the production chain would be. After all, cinematography canbe considered a work of art and must be handled carefully when being adapted forvarious output channels. Because drama production involves many creative planningdecisions anyway [40], we let the drama crew define adaptation parameters them-

Multimed Tools Appl (2012) 59:307–340 309

selves. However, we are aware that the additional burden placed on the productioncrew should be limited, and such, we have balanced their required efforts and theamount of automated algorithms used to drive the actual adaptation. Essentially,we have constructed a system that is interactive and provides essential elements ofintelligence.

In the following section, we define the functionality of our semantic adaptationprocess and describe how it can be incorporated into the existing drama mediaproduction workflow. An overview of the related work concerning spatial videoadaptation is presented in Section 3. Section 4 explains the concepts behind oursemantic adaptation system and how it is impacted by cinematographic vocabularyand grammar. In Section 5 we explain how we implemented our semantic adaptationsystem, after which we provide an evaluation in Section 6. We also list a number ofsuggestions of future research, and conclude this paper with Section 7.

2 Semantic Mastering and the drama production process

We have seamlessly integrated the semantic adaptation process into an existingdrama production workflow. In this section, we provide an overview of this extensivedrama production workflow and explain how semantic adaptation was included. Theworkflow we describe in this paper represents the typical production process forvarious drama productions, as we have observed from research in the field, as wellas in literature [40]. Our assumption about the implementation of this workflow ina file-based production facility is quickly becoming a reality as most broadcastersand production houses are transitioning away from legacy tape-based systems [19].The tight integration of all components and processes, connected by extensive andelectronic metadata streams is not yet realistic in practice, although proof of conceptsystems do exist, one of which our adaptation system is based on [12]. The processesand workflow metadata that flows between them have been mapped out in Fig. 1.

2.1 Script writing and 3-D previsualization

In conventional drama production workflows, the production process typically startswith the definition of a story synopsis, which is later extended into a complete scriptor screenplay during the script writing process. Once approved, the screenplay is thenelaborated by the director into a shooting script. This document defines which aural

Editing

Semantic Adaptation

Semantic Mastering

Acquisition

Script Writing &Shooting Scripting

Analysis &Quality Assurance

HDDistribution

MobileDistribution

Semantic OOIDefinitions

OoI Projections,Mastering

ParametersOoIProjections

ShootingDefinitions

Semantic OoIDefinitions

ShootingDefinitions,Screenplay

AV Essence,Continuity & Logging Info.

AV Essence,Edit Decisions

Fig. 1 The drama production workflow with semantic adaptation processes


and visual points of view of the scene must be realized and serves as the templateaccording to which cast and crew performances will be coordinated. In some cases,the functionality of these processes can be combined into a single previsualizationstep where scenes are set up in a virtual 3-D environment, in which characters areplaced, dialogue is written and virtual cameras are parameterized and animated [6].

Right from the beginning of the production workflow, objects of semantic interest(OoI) are defined. The narrative described by screenplay documents involves anumber of characters that actively participate in the story (e.g., as protagonist orantagonist), and prop objects which serve a more illustrative function but can besemantically relevant nonetheless. A scene’s narrative progresses by means of aseries of events that impact the scene’s OoIs. Typically, this encompasses dialogueexchanged between characters, but it can also be actions performed by characters orother OoIs.

2.2 Acquisition

The Acquisition process, where real-life audiovisual essence is recorded on soundstages or location, is typically the most prominent drama production process.During acquisition, takes or realizations are recorded for the duration of a singleperformance—usually a single scene—from the perspective of one or more acquisi-tion devices such as cameras and microphones. Because this process is driven by thedirector’s shooting script, the acquired file-based essence can be directly associatedwith the semantics of the screenplay [40].

2.3 Analysis and quality assurance

After acquisition a combination of Quality Assurance and Analysis steps determinethe fitness of acquired audiovisual material. Quality Assurance and Analysis donot directly contribute to the production of audiovisual media assets, but they areimportant processes that help boost the efficiency of the production process furtheralong the workflow. Some material will be deemed unusable for further use duringeditorial operations while other material will be kept but only as a last resort or forother purposes than to end up in the final product (e.g., as part of a gag reel). DuringQuality Assurance logging and continuity information gathered during acquisition iscollected, conformed, and streamlined to accompany audiovisual media assets downthe production pipeline.

The generic Analysis label represents any combination of processes that generatesnew information based on the audiovisual characteristics of the recorded material.This includes color information and feature extraction, shot-cut detection and theselection of representative key frames. In our system, it is primarily focused on thedetection of OoIs in audiovisual media assets, such that they can be reasoned withby the semantic adaptation processes.

2.4 Semantic Mastering and adaptation

In professional audio production, mastering denotes the process where sound op-erators prepare and optimize the dynamic range of a sound track for a givenoutput medium. Similarly, we reuse this term for our implementation of a semantic


adaptation system. The process where filmmakers decide, expressed using semanticelements, how the video repurposing should be performed is called Semantic Mas-tering. We also define a counterpart process, Semantic Adaptation, that performs theactual essence transformations using the mastering specifications as input.

In order to let production people feel quickly at home with the Semantic Master-ing process, it should naturally extend existing drama production practices. This isaccomplished by exposing mastering and adaptation to users using proper cinematicterminology—which includes shot types such as close-ups, long shots and verbs suchas panning, tilting, zooming—and by incorporating notions of the semantic elementsof the production. In particular, the scene OoIs are actively used by directors todetermine picture framing, e.g., a close-up of a character. The crew can take creativedecisions on how a virtual camera, implemented within the imagery of the originalfootage, should behave for different mastering outputs. Our Semantic Masteringconcept is depicted in Fig. 2. Two soap opera characters, Jenny and Rosa, wereoriginally shot at either side of the frame. For display area-constrained products,Jenny was deemed more relevant, so the focus is shifted there in the adapted image(indicated by dashed rectangle). The Semantic Mastering parameters specified are:“show a waist-shot of Jenny at the left of frame”. Additional adaptation parameterscould be used to shift the image back and forth between characters as the dialogueprogresses. Instructions are given to the adaptation system in a similar fashion asthey are conveyed to the camera crew on the set. By implementing the SemanticMastering process at this point in the workflow, the adaptation parameters can bedefined for use with the originally acquired media assets or ’dailies’ without themhaving been edited, cut and intertwined with other takes or camera perspectives.This simplifies the mastering process as it does not have to work around irreversibleshot cut boundaries introduced during editing operations. As a result, masteringparameters are specified over the length of an entire take of continuously shotfootage, reducing the amount of work required during mastering, and increasing thecontinuity and fluidity of the executed semantic adaptations.

2.5 Editing

In the meantime, craft editing is performed on the acquired high definition essence.Editing decisions are imported into the production workflow so that they can be

Fig. 2 An example ofSemantic Mastering


re-applied for other output profiles. Although temporal adaptations have not been atopic of research in this work, different edits can be explicitly made for differentoutput profiles. Despite the fact that, in principle, the Semantic Mastering andEditing processes are executed independent of one another, any available resultsfrom the editorial processes can be previewed during Semantic Mastering in orderto tweak the adapted media assets for better consistency within the final edit.The Semantic Adaptation process will include functionality to render edit decisionlists (EDL) such that edit decisions can be reapplied to adapted media assets anda coherent visual narrative is constructed. However, the specifications for bothprocesses are still modified independently.

3 Related work in video adaptation

We give an overview of related work in the area of video content adaptation ofwhich the purpose is to create better-suited media assets for playback on displaysize-constrained devices. Other work related to specific functionality in our systemwill be discussed in the respective sections. Many kinds of customization operationscan be performed to audiovisual material to make it better suit a given playbackplatform [24]. Our work concerns semantic-driven spatial customization, guided bythe narrative ’behind’ the adapted images.

When considering related work of which the primary purpose is the spatial trans-formation of media assets for smaller resolutions, spatially-optimized and genericapproaches can be distinguished. The first class of systems attempts to identifyan optimized adaptation scheme based on visual and spatial characteristics of theoriginal media asset. This category can be further divided into approaches thatstrive for complete automation, and those that allow or require some form of userinteraction (possibly in addition to some level of automation), the last of which ourwork belongs to.

3.1 Spatially-optimized automated adaptation systems

Setlur et al. [34] have used image segmentation and correlated pixel clustering tech-niques based on saliency maps and face detection to obtain regions of interest (ROIs)in images. Depending on the original composition, the image is simply cropped or arecomposition operation is executed, where the unimportant background parts of theimage are extracted, in-painted and resized after which the ROIs are recompositedwith the new background. A similar approach is used by Cheng et al. [9] but theircomposition operation attempts to obey media aesthetics when recompositing ROIsonto the in-painted background plane. In particular, relative sizes and depth relationsbetween ROIs are maintained, as is the size of ROIs across a number of video frames.Liu and Gleicher [21] do not recompose the final image, but rather perform non-uniform fish-eye wrapping of the background regions that surround the image ROIs.While a non-homogeneous resizing technique is also used by Wolf et al. [43], a facedetection step was added here to improve the accuracy of detected ROIs and to boostthe relevance of ROIs representing human characters. Deselaers et al. describe howsaliency and optical flow detection used in many of these works can be improved byincorporating a parameter training mechanism [13]. Avidan and Shamir [2] present


seam carving as a significantly different image size reduction technique. Detectedcontinuous low energy seams can be removed from images, usually with minimalvisual impact. This technique is especially effective for images with few and dispersedROIs. Rubenstein et al. [30] extend upon this work by implementing seam carvingfor video sequences instead of individual frames, removing two-dimensional seammanifolds from the space-time volume concept that video sequences represent. Inorder to maintain visual appeal, seams are only removed if they are consistentlypresent in a number of consecutive frames. Kopf et al. [18] in turn improve on thiswork by providing a number of optimizations. Visual image stability is improvedby compensating for camera movement. Computational complexity is also reduced,enabling seam carving to be applied to high definition video sequences, a featpractically impossible in [30] due to the size of generated intermediary graph datastructures.

Clearly, a wide variety of video size reduction techniques has been developed.However, the work listed above fails to create a clear separation of concerns betweenthe detection and definition of ROIs on one hand, and the assignment of semanticimportance to these ROIs on the other hand. Due to the generic nature of the listedapproaches, the importance of a ROI is assumed to correspond to the perceivedimportance of its objective visual features. Especially in the context of dramaproductions, misinterpretations of ROI importance is then likely to occur. In fact,Liu and Gleicher [22] note the difficulty of deriving proper relevancy informationfor ROIs with automated and objective detection techniques, given the complexmotivations cinematographers might employ for emphasizing, or not emphasizing,objects of interest. As such, they suggest the use of user annotations that can provideadditional hints to the adaptation system. They also describe their need for shot cutdetection to separate unrelated camera viewpoints, which is a step we can avoidby adapting material before it is temporally edited. A notion of virtual pans andvirtual cuts is employed as an abstraction of the bare cropping and scaling operationsused, similar to our model introduced in Section 4. Penalties can be associated toretargeting operations that violate the rules defined for the proper implementationof such pans and cuts.

None of the image size reduction techniques listed above offers the single bestsolution in all situations and for all video content. A study conducted by Rubensteinet al. [31] has investigated this issue and proposes combinations of various operationsto produce better results. In order to avoid distortions of the original imagery, whichis often the result of careful planning by a cinematography team, we have currentlyonly employed the conservative scaling and cropping operators. However, dependingon the preferences of filmmakers, future revisions of our system could employ multi-operator image operations to allow more flexible video retargeting.

Bertini et al. [5] and Seo et al. [33] provide interesting examples of howdomain-specific knowledge—in both cases concerning sports—can be employedto significantly enhance objective detection results and infer semantically relevantinformation. Advance knowledge about a sport learns us the importance of theball, which should hence be located within the adapted image whenever possible[33]. Similarly, the play field can be detected from white lines visible in the videoimages. Match highlights can also be detected with finite state machines thatmodel the flow of events in a typical match [5]. Instead of attempting to re-detectdomain-specific semantic knowledge, we will extract this information where possiblefrom the production workflow in which our adaptation process is embedded. Con-


ceded, the particular nature of drama production does provide a richer set of suchsemantic metadata than would ever be available from other formats like sports ornews events.

3.2 Spatially-optimized user-assisted adaptation systems

Little related work seeks a balance between automated approaches and an activeinvolvement of users in guiding the adaptation process. Due to the lack of semanticinformation and context concerning the adapted material, this is often quite un-derstandable and user interference would be too cumbersome to be practical. Aswe mentioned, Liu and Gleicher [22] did hint at the possibility of including someuser annotations to aid the retargeting process. In addition, two other examplesare of particular interest. Chen and De Vleeschouwer [8] present a system for theautomated production of personalized basketball match videos from video streamsrecorded from various viewpoints around the basketball field. Viewer preferencesconcerning viewpoints and individual salient objects are incorporated into the finaldecision of selecting a particular video stream and of zooming in onto specific players.Explicit attention is also paid to smoothing these decisions in order to limit disturbingfluctuations. Unfortunately, no provisions were made to let users decide exactly howsalient objects must be framed in the output image for an improved visual impact.

The work closest related to our own is that presented by Krähenbühl et al. [20]who acknowledge the importance of involvement of production people in the videoretargeting process and have actively sought to consider the retargeting process inthe entirety of production and post-production workflows. Automated saliency andmotion detection is combined with a number of user tools for guiding the imagetargeting and warping operations. Polygonal envelopes of interesting regions canbe explicitly drawn and linearly interpolated between, avoiding the need for usersto assign each and every frame with an annotation where the interesting region isrepresented. Additionally, users can assign location preferences such that, in additionto size and uniformity, the position of a given region is also maintained without beingwarped in the retargeted image. Unlike the approach presented, where regions areassigned specific coordinate preferences, we wish to make a stronger abstractionof the targeting process by reusing filming vocabulary and grammars that wouldbe actively used in the traditional acquisition of audiovisual essence for dramaticperformances. This also allows for a stronger separation of the physical projection ofobjects of interest to regions in the images, and the significance of these objects forthe narrative at a given moment in time.

An earlier version of our adaptation system [41] suffered from a number of short-comings that we are addressing now. While also operating within the rich context ofproduction metadata, users were asked to manually define and tweak desired OoIposition and size parameters, which was too cumbersome to be practically usable. Inthis paper, we provide a number of abstractions that hide rudimentary OoI framingparameters and allow users to communicate adaptation intentions on a higher level.

3.3 Generic adaptation systems

Unlike spatially-optimized systems, generic adaptation systems attempt to leveragecontent-agnostic mechanisms for the adaptation of media assets. These systems are


geared toward the ease of extensibility with respect to new coding formats andcontainer formats without the need for complex format-specific code, a few of whichare described in [16, 23, 35, 36, 38]. Such systems often take advantage of mediacoding specifications that are inherently scalable and, as such, can be adapted usingsimple content-agnostic extraction operations. While this makes these systems veryefficient, it significantly reduces the granularity and flexibility of the adaptationsthat can be performed. Since they address the adaptation issue from a differentperspective, we consider generic adaptation systems to be complementary to ourwork. In fact, systems of both approaches can be tiered together such that thespatially-optimized system provides hints to the generic adaptation system and anadaptation is procured that is optimal within the limited solution space of genericadaptation functionality.

4 Basics of Semantic Mastering

The Semantic Mastering instructions discussed in Section 2 instruct a virtual camera,implemented within the imagery of the original footage, how to behave and whichobjects of interest to frame in its output image.

The way in which photographic imagery is obtained is illustrated in Fig. 3. Theoriginal camera is pointed at a scene. The lens of the camera receives light that hashit objects in the scene and projects these light rays onto a photosensitive device(whether by means of digital electronics or celluloid film) where the projected imageof the scene becomes an audiovisual media asset. The image obtained is equivalentto the image projected from the world to the center of the lens onto plane α shown inFig. 3. Our aim is to define spatial adaptation as cropping and scaling operationsthat extract a part of the original image. By approximation, this extracted imageregion (region β in Fig. 3) corresponds to the image that would have been obtainedby a different camera (the Mastering Camera) with a longer focal length (i.e., largerzoom factor) and positioned facing the center of the retained image region but stillorthogonal to plane α. While, in fact, some perspective skew would occur due to

Fig. 3 The basics of SemanticMastering: the MasteringCamera records a part of theoriginal imagery


the difference in symmetry of the extracted image with respect to the original imagecenter, this would only slightly change the image. The analogy is sufficiently accurateto determine the cinematographic restrictions and degrees of freedom of the virtualcamera.

The parameters of the virtual mastering camera are restricted. The virtual cameracan only be positioned in a plane defined by the position and orientation of theoriginal acquisition device. This fixes both camera panning and tilting angles and itscoordinates along the z-axis. Furthermore, the x- and y-coordinates are constrainedsuch that the virtual camera projection fits within the area of the original image.Because the spatial resolution of the original imagery will be limited, it will not notbe possible to extract an infinitesimally small region of the image because it wouldlack sufficient image detail when applied to the actual source material. The focallength of the virtual camera will hence need to be limited such that no up-samplingoperations are required to fit the extracted region with the output devices, therebymaximizing the image quality.

4.1 Framing idioms and the virtual mastering canvas

Users will not control the virtual mastering camera directly. Mastering instructionsconvey framing preferences by referring to framing idioms, which are stereotypicalways of capturing the events in a scene using a combination of one or more shotsor camera points of view. Over the history of the cinematographic arts, a vocabularyof idioms has been defined that is known to work effectively [1, 25]. Examples ofsuch idioms include the close-up, medium shot, panning shot, and over-the-shouldercomposition shot. The close-up is illustrated in Fig. 4a. The subject’s head serves asreference as to which retargeted image will be used. Based on literature [1] a close-

Original Imagery

Reframed (out)

subject:head

a) Close-up

out.height = subject:head.height

Properties

45

Original Imagery

Reframed (out)

subject

b) Medium Shot

out.height = subject.height

Properties

85

out.y = subject.y –subject.height

81

out.x ~ centers(subject:head)

out.y ~ centers(subject:head)

out.width ≥ subject:head.width

out.width ≥ subject.width

out.x ~ centers(subject)

Fig. 4 Properties of the close-up (a) and medium shot (b) framing idioms


up shot will show just a little more than the entire head of the subject. Additionally,the subject’s head is often shown centered in the image, in both the horizontal andvertical direction. Assuming that the retargeted image will eventually conform tosome image aspect ratio of an output device, we can omit an explicit constraintfor the other dimension and provide only a single minimum value: the width of thecharacter’s head. Similarly, a medium shot is defined using the properties depicted inFig. 4b. Medium shots show about half of a character, typically from the head to thewaist, plus some additional space above the character, to avoid an artificial feeling ofcomposition. A centered horizontal orientation is preferred, while the ideal verticalposition is explicitly defined.

The presented virtual camera model serves us well in gaining understanding ofthe limitations and possibilities of the mastering camera. At this point, however, theactual definition of mastering camera state is easiest conveyed using basic rectangularcoordinates relative to the original imagery, referred to as out. This avoids redundantback-and-forth conversions between projected and world coordinates. As illustratedin the properties listed in Fig. 4, idioms can define output parameters explicitly,or they can provide symbolic values which will be filled in and optimized by theadaptation engine. For example, the close-up requests that the subject should beplaced at the center of the output image, without providing an absolute value.

4.2 Film grammar and the virtual mastering canvas

In addition to the constraints laid out by the framing idiom vocabulary, cinematog-raphy is also subject to a number of common grammatical rules that, in order toproduce visually and narratively attractive scenes, place additional constraints on theoverall camera placement and combinations of idioms. In particular, it is important toprovide viewers with various forms of scene continuity to avoid confusion or fatigue[1, 25]. The primary measures are the following.

– All view points and cuts between shots should be maintained to one side of acommon line of interest or action axis that divides the scene in two. This ruleholds unless transitions from one side of the line to the other are explicitly shown,in which case the original line also shifts.

– Spatial scene consistency should be preserved by consistently showing charactersat the same side of the frame when cutting back and forth between shots. Thisholds, again, unless a transition is explicitly shown.

– Jump cuts between shots that differ only slightly in terms of framing should beavoided. Consecutively displayed shots should differ significantly in setup andcomposition.

Any media asset that is generated by our adaptation system should also abide bythese grammatical rules. And because these cinematographic rules fall beyond the di-rect control of the production crew who will only be defining mastering instructions,their application must be observed by the adaptation system. The specific nature ofthe mastering camera does influence the way that these grammatical rules must orcan be enforced. In particular, maintaining the line of interest’s side is no longerapplicable, since the virtual mastering camera cannot venture beyond the imagery ofthe original camera. Adaptation can only comply to this rule if the media asset that


is being adapted already does. Jump cuts and spatial object location consistency, onthe other hand, must be observed during adaptation.

5 Implementation of Semantic Mastering

This section discusses how the Semantic Mastering processes were implemented.Figure 5 shows the functional components of the Semantic Mastering system andthe information they exchange.

The Semantic Mastering system stores all production metadata and media assetsin a central AV Media Production repository. In addition, a Knowledge Base (KB)stores, infers and provides information about OoIs. In particular, the KB containsinformation about how OoIs have been projected onto audiovisual essence suchthat they can be reasoned upon. The Analysis processes populate the KB with thisprojection information. The KB is elaborated upon in Section 5.1.

Drama production documents that define screenplays and shooting scripts canbe extended with mastering instructions introduced in Section 2. The textual instruc-tions are parsed into framing preferences and processed to be placed along a timelinethat runs parallel to the original media assets and allows the virtual mastering camerato be animated such that correctly adapted images are obtained. This functionality isfurther discussed in Section 5.2.

Section 5.3 describes how the Framing Idiom Processor (cf. Fig. 5) combinesframing preferences, properties defined by the various framing idioms and OoIprojection information into a uniform set of framing constraints. The CompositionEngine, finally, contains the actual intelligent components that compute properlyretargeted video sequences from the given mastering instructions. It composes thedifferent OoIs into a single rectangular output image that fulfills as many framingconstraints as possible. The Composition Engine also attempts to maintain sceneconsistency wherever possible. Section 5.4 discusses how the Composition Engineoperates.

For the Mastering process, an application named Kameleon (cf. Fig. 6) wasdeveloped that allows the creative staff to interactively define mastering instructions

Fig. 5 Componentarchitecture and informationoverview of the semanticadaptation system

Semantic Adaptation Engine

Knowledge Base

Screenplay & Shooting Script

Documents

Quality Assurance & Analysis

Instruction Parser &

Processor

Composition Engine

OoI Projection Knowledge

Framing Idiom Knowledge

Mastering Instructions

Framing Constraints

Framing Idiom Processor

Framing Preferences

OoI Projection Knowledge

Scene Knowledge Inference Rules

Semantic Mastering (Kameleon)

AV Media Production System

(Assets + Metadata)


Fig. 6 The Kameleon application provides the GUI front-end of the semantic adaptation system

and to preview the executed scaling and cropping adaptations. Kameleon importsthe existing screenplay and shooting script documents that were used to produce theoriginal audiovisual material and uses them as a template for the mastering process.This way, all narratively important information is at the disposal of the SemanticMastering operator. The functionality provided by the Semantic Adaptation Engine,which includes the Composition Engine, the Instruction Parser & Processor andthe Framing Idiom Processor, is employed in Kameleon to provide previews ofadaptation operations.

5.1 The Knowledge Base

The Knowledge Base stores information about scene OoIs that are referred toduring Semantic Mastering. The KB accepts asserted knowledge from the Analysisprocesses, and also supports information inferencing for the deduction of newknowledge from existing facts. We will not elaborately discuss the inner workingsof the KB, because it would unnecessarily complicate and lengthen this paper. Wewill give a short discussion about the knowledge stored and inferred by the KB.

First and foremost, the KB stores information that describes how scene OoIs havebeen projected onto the 2-D plane of an audiovisual media asset. This projectioninformation, in the form of spatial coordinates, enables the Composition Engine andFraming Idiom Processor to reason about OoIs and their positions in the originalaudiovisual material, and in adapted video streams. The KB also defines notions ofthe way OoIs are constructed from different physical parts. In particular, the KBcontains an ontology that defines the anatomical structure of the human body. Thisfunctionality supports framing idioms that use specific character parts as basis fortheir reframing constraints. One example is the close-up, which uses a character’shead as reference for the position of the output image.


In cases where projection information is lacking for specific subject parts, aninferencing mechanism is used to derive an estimation of this information fromfacts that are available. For example, given average body proportions, a reasonableestimate can be made about the location of various body parts, such as headsand shoulders. The union of inferred and asserted knowledge is exposed to othercomponents of the Semantic Mastering system.

5.2 Integration and processing of mastering instructions

The script writing and previzualization applications in our system use the MovieScript Markup Language (MSML) [42] as storage and exchange format for scenespecification documents, including screenplays and descriptions of 3-D previsualizedscenes with animation and shooting scripts. For uniformity and ease of integration,the Semantic Mastering and adaptation processes will also use MSML as the mas-tering description format. MSML comprises an object-oriented model with a Sceneobject that contains Events, Components and Entities, illustrated in Fig. 7. We give abrief overview of the objects applicable for our adaptation system; an extensive dis-cussion of all MSML objects is given in [42]. The OoIs that were defined in Section 2can also be found here: Characters and Props, which are subclasses of Entity.Additionally, the MSML model defines an Event object and three relevant subtypesAction, Dialogue and Instruction. The latter conveys instructions and guidelinesconcerning the acquisition of the scene. Instruction objects that provide guidelines tothe production crew are typically written in a free-text fashion. However, by defininga formal instruction grammar and assigning them a “mastering” type, we can useInstructions to define Semantic Mastering operations within the context of the actualscreenplay and shooting script document from which the original audiovisual assetswere produced.

With the mastering instruction mechanism properly in place within the rest ofthe MSML specification, mastering instructions can be processed and applied tothe audiovisual timeline of the media asset that must be adapted. The sequenceof operations is shown in Fig. 8: an Instruction (Fig. 8:1) is parsed into a numberof framing expressions (Fig. 8:2), which are then placed on a timeline that will bematched to play back along with the audiovisual media asset for adaptation (Fig. 8:3).Where applicable, framing expressions will be made to overlap in order to providegradual transitions from one expression to the next.

Fig. 7 Main objects of theMovie Script MarkupLanguage


Fig. 8 Mastering instructions are processed into framing idiom references with subject and orienta-tion preferences and are then placed onto a timeline with proper transitions

Mastering instructions are parsed using an ANTLR-generated [26] lexer/parserpair and have been formally defined using the grammar in Listing 1. The grammarpermits the use of sufficiently natural language instructions in most cases, notefor example the instruction in Fig. 8:1. However, there are some cases where thissimple grammar and its lack of proper punctuation limits the fluidity of instructionsentences. One such example is, “...end with gradually to a close-up”. This howeverwill not hinder comprehension, but could be remedied by using a more potentlanguage processor, for instance, PENG (Processable English) [32], which is beyondthe scope of this article and work. The leaves and branches of the Abstract SyntaxTree (AST) generated by the parser are mapped to new FramingExpression datastructures (cf. Fig. 8:2). A single FramingExpression associates a framing idiomidentifier with a list of OoI subjects, each of which can be assigned a vertical andhorizontal orientation preference.

Instructions are anchored to the scene timeline by means of synchronizationmarkers (also shown in Fig. 8). These markers are a provision of MSML and allowevents to be synchronized in relation to other events or to fixed points in time. Byembedding markers at either end of an instruction, its begin and end times can beset. Markers are set by the Semantic Mastering operator by dragging instruction endpoints onto other events using the Kameleon application GUI. Markers can alsobe placed within instruction content, between two framing expressions, to assign aspecific time to the enclosed transition. As such, transitions can be made to coincidewith other events in the scene. Three durations can be chosen for transitions betweentwo framing expressions:

– Instant transitions have no overlap and can be used to introduce cuts betweensignificantly different camera perspectives.

– Gradual transitions will use half the duration of either framing expressions asoverlapping period.

– Quick transitions (the default selection) have a much shorter overlap period(about 200ms, but configurable) and are meant to simulate a typical humanreaction time for the transition between instructions, which will occur less suddenthan instantaneously.


Listing 1 Semantic Mastering instruction grammar expressed in Augmented BNF

instruction = ‘overall’ framing_expr/‘begin with’ framing_expr

*(‘, then’ framing_expr) ‘, end with’ framing_exprframing_expr = [transition_type] framing_type subj_expr/‘original’framing_type = (‘a’/‘an’) IDIOM_IDENTIFIERtransition_type = ‘gradually to’/‘instantly to’subj_expr = ‘of’ entity_spec *(‘and’ entity_spec)entity_spec = SUBJ_IDENTIFIER [‘at’ orientation *(‘+’ orientation)]orientation = ‘left’/‘right’/‘top’/‘bottom’/‘center’

5.3 The definition of framing idioms and the composition tree

The Framing Idiom Processor combines framing expressions conveyed by masteringinstructions with the properties of a framing idiom (cf., the examples given inFig. 4) and translates these into a declarative data structure, the ConstraintSet,which we will define in the next subsection. This is the only data structure that willbe used to convey framing preferences to the Composition Engine. The FramingIdiom Processor is queried for a ConstraintSet at each relevant time point in theadaptation timeline (e.g., each video frame), to allow for framing preferences thatvary over time. Of course, depending on the framing idiom definition, the FramingIdiom Processor can successively output an identical ConstraintSet. Before passingConstraintSets on to the Composition Engine, the Framing Idiom Processor will alsoquery the KB to resolve and translate symbolic references to subjects and their partsinto projected screen-space coordinates.

5.3.1 ConstraintSets

The ConstraintSet, depicted in Fig. 9a, defines a number of fields that directly expressa framing intent of an OoI with respect to the mastering camera’s reframed imagedefined in Section 4. Two orientation fields (hOrientation and vOrientation) specifywhich direction from the center of the output image the OoI should be locatedin. Symbolic values are LEFT, RIGHT, CENTER (in both directions), TOP andBOTTOM. Directions given in mastering instructions are likely to be passed along

Fig. 9 The ConstrainSet datastructure (a) and a ConstraintSet that applies to the instruction: “a close-up of Jenny at left” (b)


directly into these fields, e.g., “... at left+top”. An optional of fset field represents theextent by which the OoI is located away from the center of the image, in the directionof the requested orientations. The definition of the framing idiom will determine thevalue of this of fset field. These fields are completed by a set of constraint expressionsthat define idiom-specific requirements. Two types of such expressions can be used:

– Membership expressions that relate an output variable to its ideal value, andimpose limits to the range of this variable, i.e., it defines an interval with anoptimum: [minimum, optimum, maximum].

– Equality expressions that relate an output variable to its ideal value. These areequivalent to membership expressions that specify no minimum and maximum.

While constraint expressions can override the orientation and of fset values, theyare primarily used to define the size of the reframed image with respect to the OoIbeing framed. The levelOfInterest field of a ConstraintSet is used for various purposesand expresses the relative importance of the preference in the ConstraintSet (andhence, of the OoI it represents). Examples of how this field is used are given furtherin the text.

Figure 9b shows the ConstraintSet for the first framing expression from Fig. 8, “aclose-up of Jenny at left”, which is a translation of the properties shown in Fig. 4a.The vertical orientation has been set to CENTER. The horizontal orientation, whileby default also CENTER, has been overridden as LEFT by the instruction. Theconstraints reflect the optimum and minimum height of the reframed image inrelation to Jenny’s head (close-ups ignore other subjects than the first), in the formof membership expressions. A minimum width is also provided, while the questionmarks represent undefined values that can be freely chosen during composition.Finally, an offset vector of no length (i.e., {0, 0}) has been defined for close-ups, sincethe default LEFT positioning (as defined in Section 5.4.1) was deemed adequate.

5.3.2 The composition tree

Framing idioms output one ConstraintSet per OoI that takes part in the composition.When multiple OoI subjects must be considered, ConstraintSets can be collected intoa tree structure such that each individual ConstraintSet defines how its subject willfit in the overall composition. In this case, an additional ConstraintSet allocated atthe idiom level combines the different subject ConstraintSets as its children. Theoperation (op) field of the ConstraintSet determines how the subject should becomposed into the final image. We have currently defined three such operations.

– INCLUDE: The subject must be included in the final image.– EXCLUDE: The reframed image must be chosen such that the subject is not

included in the final image.– INTERPOLATE: The constraints associated with the subject will influence the

framing of the final image to the extent of the levelOfInterest field.

Trees of ConstraintSet nodes are also used for combining framing preferences ofmultiple idioms, for example, during the overlap of an idiom transition. As such, asingle composition tree represents all composition preferences by all idioms that areactive at any given moment in time.

Two complete example composition tree instances are depicted in Fig. 10. Thetree for frame#1 applies a transition between two idioms, for which an interpolation


Fig. 10 Examples of composition trees of ConstraintSets

operation will be used. Deeper in the tree, the Establishing Shot idiom combineseach of its subjects with an INCLUDE operation. The constraints defined for eachof the leaf ConstraintSets will be specified in such a way that the subject is enveloped.The tree generated by a Panning Shot idiom for frame#2 illustrates how CompositionEngine functionality can be exercised at each level of the composition tree. Over thecourse of its instruction duration, the Panning Shot idiom will set up its subjects in anINTERPOLATION relationship and vary the levelOfInterest field in order to panfrom the first subject to the second.

The use of a single data structure for the description of all framing operationsgreatly reduces the amount of functionality required of a single framing idiom andincreases its declarative nature, as opposed to requiring error-prone imperativecode. It also allows the adaptation process to postpone framing decisions until allapplicable information has been collected, such that composition can take place froma global perspective that is not limited to the view of individual framing idioms.

5.4 The Composition Engine

As discussed in the previous section, Framing Idiom instantiations and compositionpreferences from mastering instructions are processed into a number of ConstraintSetcomposition trees, using information from the scene KB to relate symbolic referencesof subjects to actual projected coordinates in audiovisual media assets. These com-position trees are combined into a single tree and handed over to the CompositionEngine to produce a reframing rectangle which defines the semantically masteredimage in the original footage, as illustrated in Fig. 11. This final subsection will discusshow the Composition Engine works.

Figure 11 depicts that ConstraintSet trees are processed by the consecutive butoptional execution of three functional blocks: Composition, Smoothing and Interpo-lation. Whether a block is executed depends on the attributes of the ConstraintSetnode currently being processed. This sequence of blocks is iterated over recursivelyuntil the entire tree has been traversed (in depth-first order).


Fig. 11 Overview ofcomponents and informationflows in and around theSemantic MasteringComposition Engine

The Interpolation block combines one or more ConstraintSet output resultsobtained from Smoothing, or directly from Composition, into a single result, whereeach input result is combined linearly a ratio of its levelOfInterest field. As explained,the Interpolation allows for the implementation of panning effects and gradualtransitions between framing idioms.

Smoothing can be applied to the result of a previous operation to reduce suddenfluctuations in reframing rectangles in order to obtain a smoother virtual cameraimage. Depending on the framing idiom being realized, more or less smoothingcan be applied. For instance, close-up shots that frame a single subject requireless smoothing such that sudden subject movements can be followed more closely.In the case of a long shot, however, the scenery should be presented staticallywithout sudden position jumps as a result of character movement. The Smoothingfunctionality allows the adaptation system to avoid inserting jump cuts, since smallmovements in virtual camera setup are eliminated. We have used the smoothingtechnique employed in [8].

The Composition block is employed to calculate the reframing rectangle (out)that provides the ideal image composition given the constraints associated with theConstraintSet node being processed. Many algorithms and systems have been devisedfor the calculation of narratively sensible viewpoints within computer-generated 3-D virtual scenes. Considering the similarity with our problem of framing a virtualcamera given a set of constraints and preferences, we have used many conceptsemployed in this previous work as inspiration for our Composition Engine. We listthe ones most important with respect to our system and refer to [11] for an extensiveoverview of work in this area.

The composition operation is as follows. An algorithmic breakdown of thisoperation is shown in Fig. 14.

1. Determine the minimum size of the reframing rectangle. When framing a singlesubject, this is the size of the subject. When framing multiple objects, this isthe smallest rectangle enveloping all included subjects. ConstraintSets have noexplicit subject size field. The subject size is derived from the lower bounds ofthe provided constraints.

2. Calculate an estimated range of output width and height values that satisfy allgiven constraints for all subjects by means of constraint propagation [39]. If


such a range is found, it is used as a reduced search space for finding the bestreframing solution. If no such range is found, an unrestricted search space mustbe considered. Some reduction of the search space will always be in effect: thesize of the reframing rectangle cannot be smaller than the minimum determinedin the previous step. Also, in many cases, upscaling operations will not beallowed, which places an additional lower bound on the rectangle size.A similar optimization was used in [4], where the search space is limited early onin the solving process by applying a number of domain-specific axioms related tocamera geometry and 3-D viewport intersections.Figure 12 provides a visual clue of the reduced search space in two concrete cases,indicated by the shaded areas. In Fig. 12a, the reduced search range is shownfor a close-up where an output aspect ratio constraint (4:3 in this case) is takeninto account. The smallest possible solution in the search space is determined bycombining this constraint with the minimum subject rectangle calculated in step1. Because no explicit maximum output values for a close-up have been defined(we rely on the Composition Engine to favor solutions that are close to the idealvalue defined in Fig. 4 and shown as the dashed rectangle in Fig. 12a), the searchspace is limited only by the height of the image. The width of possible solutionsis smaller than the original image width because it is limited by the aspect ratioconstraint.Figure 12b depicts the reduced search space for an enveloping shot where oneOoI must be framed on the left, and the other on the right. The minimumsolution size is the union of both OoI-enveloping rectangles. The lack of verticalpreferences retains the entire height of the original image as solutions that satisfyall framing constraints. The width of potential solutions is however stronglylimited by the framing constraints. Enlarging the search space beyond the widthshown in Fig. 12b would invalidate the requested orientation of at least one ofthe included OoIs (according to the orientation definitions introduced later inSection 5.4.1).

3. Using a (possibly significantly reduced) search space, build score tables thatrepresent a global view of the solution space. Figure 13 illustrates how this isdone. For each reframing rectangle width (out.w) in the search space, all possibleout.x position values are mapped to a score. The score represents how well thatsolution satisfies the constraints and preferences of one or more objects thatneed framing. The scores are sorted and, as a measure of optimization, onlythe out.x position with the highest score is retained and placed in the table of

OriginalImagerya)

OriginalImageryb)

Fig. 12 Reduced search spaces for a close-up (a) and an enveloping shot (b)


Fig. 13 Construction of thesolution score tables used fordomain variable selection

topscore tuples. This table contains the best positions for each out.w value in thesearch space. A top scores table is also constructed for out.h and out.y values, thecalculation of which is done similarly. Because the top scores contain values fromall possible solutions in the search space, the top value is guaranteed to representthe global maximum.Scoring systems with cost or satisfaction functions or other numerical heuristicsare used by a number of systems to pick the optimal solution for a given problemfrom multiple candidates that satisfy the provided constraints [4, 10].

4. Assign a limited set of constraints (among others, for aspect ratio preservation,minimum output image size, and a number of basic axioms) to a generic Con-straint Programming (CP) solver [39] and use a top scores-optimized variabledomain value selection to find a solution fast, instead of using a naive lineardomain iteration and slow cost optimization solving. For each variable in thesystem (i.e., all fields of the out rectangle), the solver will use the top scores toobtain decreasingly optimal candidate values. In fact, the first combination ofreframing rectangle variable values that satisfies the solver constraints will bethe optimal solution. We have used the Choco solver [29] as our generic integerCP solver.

Given the highly constrained nature of the mastering problem, chances are thatno solution exists that satisfy all given constraints, whether concerning reframingrectangle size or requested OoI orientations. The general consensus is to attempt toprovide a solution nonetheless, albeit one that is suboptimal and that tries to satisfythe given constraints as good as possible [11, 15]. In our case, every possible solutioncan be assigned a score. Hence, a solution can always be graded with a score that isconstructed using all constraints, whether or not it satisfies them all. In case the initialreduction of the search space obtains no solutions, the search space is first expandedand a new solving iteration is started with a new set of top scores. In any case, thesolution with the best score will be selected, after which the system can perform anevaluation of the solution with respect to the original constraints and preferences.

When one of the axiomatic constraints (such as image aspect ratio preservation)prevents the solver from finding a solution, we employ the failure strategy used insystems that implement soft or fuzzy constraint solving [39]. Axiomatic constraintsare subsequently discarded according to their priority such that a degraded solutioncan be found [3]. Users are notified of such occurrences, since in this case, thereis little chance that a properly adapted and visually pleasing reframed compositionexists (Fig. 14).


Determine Minimum Output

Rectangle Size

Reduce Search Space

Build Topscores

Tables

CP Solve System with Score-based

Variable Selection

Optimal Solution

Expand to Full Search Space

Empty Reduced

Space

Discard Constraints

No Solutions

Found

Fig. 14 The algorithm for the composition operation

5.4.1 Orientation constraints and score functions

Framing idioms convey an OoI’s preferred location within the reframed output rec-tangle by means of the horizontal and vertical orientations fields of the ConstraintSet.Figure 15 illustrates the visual meaning of the three horizontal orientations (LEFT,CENTER and RIGHT). The vertical orientations are defined similarly. The formaldefinition of each orientation is given below in (1–3), with dl resp. dr being thedistance from the left resp. right side of the output rectangle to the left resp. rightside of the OoI. The ideal value for each orientation is also given. The of fset fieldof a ConstraintSet can influence this default ideal value. Higher offsets will place theideal value closer to the edge of the image, i.e., more left or more right. The of fsetfield is ignored for CENTER orientations.

LEFTrange : dldl + dr

∈[

0,14

]; LEFTideal : dl

dl + dr= 1

8(1)

CENT ERrange : dldl + dr

∈]

14,

34

[; CENT ERideal : dl

dl + dr= 1

2(2)

RIGHTrange : dldl + dr

∈[

14, 1

]; RIGHTideal : dl

dl + dr= 7

8(3)

dl subject

dl subject dr

LEFT CENTER RIGHT

dr

Output Rectangle (out)

Original Imagery

Fig. 15 A visual indication of the horizontal orientation values


Given the definition of each orientation, a function can be chosen that assignsa score to each reframing rectangle position, given a subject location and a givenwidth of this rectangle. This function is based on (1–3) and is tweaked as such that itsoutput score is highest around the ideal value defined for each orientation. Figure 16adepicts the score function for the LEFT orientation, plotted over the validity rangeof this orientation. The function is a Gaussian distribution function centered aroundthe ideal value multiplied with a small linear bias function such that positions towardthe (left) edge of the frame are graded slightly higher. The CENTER and RIGHTscore functions are defined similarly, but without an edge bias for the CENTERorientation.

Scoring functions are also used for the calculation of the ideal size of the reframingrectangle, in addition to its location within the original imagery. In Fig. 16b, thescore function employed to obtain the best out.h solution for the close-up example(from Fig. 12a) is plotted for the reduced search space. Just as (1–3) determine theparameters of the Gaussian scoring function for reframing rectangle orientations,the framing properties of the close-up idiom (cf. Figs. 4a and 9b) define the scoringfunction in this case. As such, the optimal value requested by the idiom is also theone that yields the highest score. This optimal reframing solution corresponds to thedashed rectangle in Fig. 12a (note that out.w is derived directly from out.h becauseof the output aspect ratio constraint).

Fig. 16 Score functions forLEFT orientation and theout.x variable, given a subjectrectangle and an out.w value(a), for the out.h variable inthe close-up shot example (b),and for both the out.x andout.w variables in the envelopeshot example (c)

0

1

2

3

4

5

6

7

8

0 50 100 150 200

scor

e

out.x

subject.x subject.x+subject.width-out.w8

subject.x+subject.width-out.w4

a)

0

10

20

30

40

50

60

500 600 700 800 900 1000 1100

scor

e

out.hsubject:head.height 5/4 subject:head.height image.height

b)

750 800 850 900 950 1000 1050 1100 1150 1200

0 200

400 600

800 1000

1200

100

120

140

160

180

200

220

score

c)

out.w

out.x

score


When framing constraints for multiple subjects influence the composition result,the pattern of scores produced can become more complex. Figure 16c shows a 3-Dplot of the scores obtained for the composition example introduced in Fig. 12b. Thepresented example was mapped onto an equivalent high definition image of 1920 ×1080 pixels such that actual numerical values could be shown in the plot.

Changes in both the potential out.x and out.w values produce a different score withrespect to the framing preferences of both subjects. A large part of the plot consistsof two adjacent lobes that represent the satisfaction of only one of the subject’spreferences. For example, lower out.x values clearly satisfy the placement of the rightsubject in the right of the output frame but they are less beneficial for placing the leftsubject on the left. The inverse becomes true as out.x increases. A similar observationcan be made about out.w. Only in a limited part of the search space do the scores forboth subjects accumulate into significantly higher scores, including a single top scorethat represents the ideal solution for the reframing rectangle. This is also the solutionindicated by the dashed rectangle in Fig. 12b.

The levelOfInterest field of a ConstraintSet can also be used to influence the scorefor a given OoI. The impact of a set of framing preferences on the final compositioncan be made greater by varying this parameter, which will be set by the FramingIdiom Processer based on framing idiom definitions.

5.4.2 Composition state

Throughout the adaptation process explicit composition state is kept, as is illustratedin Fig. 11. This state is updated during each completed operation of the CompositionEngine and reflects information about the adapted scene at that time. Keeping stateis required in order to let the Composition Engine remember and it is used to realizespatial scene consistency (cf. Section 4.2). At the end of a frame’s final compositioniteration, the positions of all OoIs relative to the output image are recorded and willserve as defaults for future composition operations. Unless overridden by explicitpreferences, OoIs will hence be positioned at consistent positions.

This completes the discussion concerning the implementation of the semanticadaptation system. In the following section, we will provide an evaluation of oursystem and provide a discussion, which includes hints for future work.

6 System evaluation and discussion

6.1 Visual results

We start the evaluation of our system by providing a number of visual examplesthat illustrate the kinds of adapted images our Semantic Mastering system produces.We provide the mastering instructions used to obtain each result and discuss theirpeculiarities with respect to the composition mechanism. The results shown aretaken directly from the adaptation system and will show that visually convincingresults were obtained using simple instructions that require little additional effortfrom the drama production crew to define. We also compare our results with relatedwork presented in Section 3. The locations of the characters and applicable partswere annotated manually by setting a limited number of key points and performingCatmull-Rom [7] interpolation in order to obtain a continuous projection.


Fig. 17 Illustration of theclose-up idiom with varioushorizontal orientations

a) b)

c)

d)

In Fig. 17, the ubiquitous close-up is demonstrated. Figure 17a is the originalsource footage. Figure 17c was produced with, “a close-up of Barbara.” The subject’shead is isolated and centered within the output image. Left and right orientationswere also tried using, “a close-up of Barbara at left,” and “a close-up of Barbara atright,” depicted resp. in Fig. 17b and d.

This adaptation result can also be obtained by many of the spatially-optimizedautomated adaptation systems described in Section 3.1. Adaptation of close-ups isespecially suited for systems that employ face detection such that the importance offacial regions is increased and this region is likely to be focused on. Systems thatlack face detection also have good chance of presenting a properly reframed image,except when the character in close-up is surrounded by a lot of motion in which casethis motion could mistakenly be considered as more salient than the character itself.These automated systems, however, do not offer functionality to specify additionalpreferences such that Barbara can be placed in a particular part of the reframedimage.

A conversation between two on-screen characters shown in Fig. 18 can be se-mantically mastered to consecutive close-ups, e.g., using, “begin with a close-up ofJamesBond, end with a close-up of M.” In order to follow a conversation, transitionscan be anchored to dialogue events in the scene (cf., Section 5.2).

Figure 19 illustrates how high definition images can support extensive scenery shotwith wide angle lenses. When the attention shifts from one character to another inthe scene, the mastering operator can use a panning operation between charactersto explicitly guide the viewer’s attention. Figure 19b–e and f–i show two slightlydifferent approaches. The first instruction, “a panning-shot of JamesBond and M,”makes the Composition Engine linearly interpolate between centered positions ofboth characters. This immediately indicates the primary character at either point. Inorder to provide viewers with a better sense of the original scene and the relative

Fig. 18 Combinations ofidioms for multiple on-screencharacters

a)

b) c)


a)

b) c) d) e)

f) g) h) i)

Fig. 19 Panning between characters using different desired positions

positions of each character, another instruction can be used: “a panning-shot ofJamesBond at left and M at right.” Note that, a side-effect of the right orientationof the M character, its larger height compared to James Bond and the overallpreservation of image aspect ratio also keeps James Bond prominently in the picture.

The impact of orientation preferences is also demonstrated in Fig. 20. Figure 20ais the result of “an envelope-shot of JamesBond at left and M at right.” In this case, theenvelope shot frames the heads of OoIs using the smallest enveloping rectangle that

a) b)

c)

Fig. 20 The impact of orientation preferences on the semantically mastered image size


fits all OoIs. The requested orientations match the original setup of the scene, whichresults in a closely fitting envelope. Enforcing an output image aspect ratio producesthe image in Fig. 20b. When the mastering instruction is changed to: “an envelope-shot of JamesBond at left and M at center,” the envelope is stretched significantly toobey the requested orientations. If the output aspect ratio were to be preserved inthis case, the output image would no longer be significantly smaller than the originalimagery. Users are free to define any orientation preferences they like, however,these might not lead to satisfying results on surface-constrained devices. If required,additional constraints can be employed by the system to enforce a maximum outputimage size.

These more complex adaptation examples present issues for spatially-optimizedautomated adaptation systems. Firstly, whenever the saliency of individual objectsof interest is detected correctly, no way is offered to convey the subtle compositionpreferences required for dramatic cinematography. Secondly, the proper detection ofsalient regions is not a trivial task. In Fig. 19, for example, James Bond is sitting down,barely moving and almost blending into the background, while M is walking around,which would lead to only M being detected as a region of interest. Additionally,because none of the characters is facing the camera in this scene, the application offace detection is less likely to improve these initial detection results.

Unlike automated adaptation systems, the work presented by Krähenbühl et al.[20] can produce reframing results similar to the ones displayed in this section.Whenever the automated content analysis used in this work produces unsatisfactoryresults, manual tweaks can be applied. Unfortunately, this involves the explicitdefinition of a region’s coordinates and desired relocated position, which is notthe modus operandi known to filmmakers, and directors in particular. In contrast,our system separates the detection of OoIs from the adaptation process such thatinstructions about OoIs can be conveyed at a higher level by means of well knowncinematographic idioms.

We mentioned in Section 4 that the virtual mastering camera is constrained interms of its maximum ‘zoom’ factor. This is to avoid the need for upscaling operationsthat would reduce image quality when trying to fill the display of an output device.As a final example, Fig. 21a shows how the Llewelyn Moss character is walking acrossthe plains of Texas. In order to experience the vastness of these plains, an extremewide shot was used originally. Enveloping the character for mastering (“a full-shot ofLlewelynMoss”) will lead to upscaling in Fig. 21b. The Composition Engine can beconfigured to avoid upscaling such that the better result in Fig. 21c is obtained, whichprovides a proper trade-off between ‘zooming’ and image quality.

Fig. 21 No-Upscalingconstraints limit the minimumoutput image size

a) b)

c)


We are currently in the process of planning field trials of our system, such thatit can be used by production people on real world productions. Candidates aresoap opera and prime-time fiction produced by the Flemish public broadcasterV.R.T.

6.2 Computational performance

We have considered a realistic professional drama production scenario for testingthe computational performance of our adaptation system. We have semanticallymastered images of 1920 × 1080 pixels with a frame rate of 25 frames per second,which is a common format for high definition acquisition. The target resolution was480 × 270, an image size suited for newer-generation mobile devices and portablemedia players. We have used an Intel Core 2 Q6600, paired with 4GB of RAM, asour test machine. Little code optimizations have been performed at this point and ourmeasurements only give a rough estimate of computation performance. For example,the adaptation process is implemented in a single-threaded fashion, utilizing only onecore of our CPU.

We have only taken into account the time required for the determination of theoutput rectangle. This does not include video stream processing and transformationoperations such as decoding, encoding and cropping and scaling. The decoding andencoding steps would require most of the computational effort here, and becausethese are highly dependent on the chosen video coding specifications, we have notincluded them in our evaluation.

For performance reasons, the OoI projection information is cached from the KBat the beginning of a scene’s adaptation process. Discrete annotations are extractedand interpolated between using Catmull-Rom curves. The start-up preparationsalso include parsing of instructions and placement of instructions on the timeline,which occurs only once for the duration of an entire scene. When amortized overthe entire adaptation process, its impact is very small. As an indication, setting upthe adaptation engine for three OoIs with 69 projection points (and a duration of100 s) took 290 ms. The construction of the composition tree, including lookups andinterpolations of projections (less than 5 ms) is also negligible compared to the timerequired for the actual composition operation.

The amount of computation time required for the composition operation dependson the image resolution of the original imagery, since this will determine the size ofsearch space of output solutions and hence the number of scores to calculate. Thecomputational complexity of composition operations also depends on the structureand depth of the composition tree. ConstraintSets that are combined using interpo-lation each require an individual composition operation. Adding multiple subjectsusing INCLUDE operations is less expensive since score tables only need to bebuilt once and the solver aggregates the constraints of all subjects when searchingfor a solution. In general, a complete composition operation was found to take from300 to 1200 ms to compute, which translates to 0.8 to 3 frames per second wheneach complete composition operation corresponds to a single output frame. For eachcomposition operation, we have found that less than 3% of computation time is spenton the search space reduction and actual CP solving process, leaving the majority oftime to the construction of score tables, which is executed by code that has not yetbeen thoroughly optimized.


Both the interpolation and smoothing operations of the Composition Engineperform only few operations on a limited set of rectangle data structures, whichmakes their complexity minimal with respect to the composition process.

In order to make our adaptation system truly interactive by giving immediatefeedback in response to users that modify instructions, a number of computationspeed improvements remain yet to be implemented. In addition to performing codeoptimizations, we could introduce a number of granularity reductions. Instead ofcalculating a solution for each output frame, a solution can be calculated for only onein a number of frames. An additional measure could reduce the spatial granularityof the solution space. This way, the solution space for which scores are calculated isseverely reduced. Interpolation can be used in both cases to estimate skipped values.

6.3 Current limitations and future work

We finish the evaluation section of this paper by identifying a number of limitations,and we provide some ideas for future research tracks related to our adaptationsystem.

Ideally, we wish to automate the Analysis processes as much as possible, as theyinvolve tedious manual labor. The mapping of OoIs to their projection is currentlyannotated manually, albeit aided by the inferencing rules of the KB. Future researchmust focus on the integration of computer vision detection and OoI recognitionalgorithms. In fact, we expect the accuracy and computational requirements ofcurrent state-of-the-art detection techniques to be aided significantly by the richsemantic context provided by the extensive drama production metadata that isavailable in our system. For example, the advance knowledge that a scene is likelyto feature only two characters at either side of the frame should ease the detectionprocess significantly.

Our work somewhat neglects the issue of sound associated with the semanti-cally mastered video material, and whether or not a semantic sound adaptationtechnique could be implemented. The adaptation of video material is likely torequire modifications of the audio tracks as well, such that they match the spatialor temporal adaptations of the video stream. Advanced sound modifications couldeven optimize the original sound mix for various devices, possibly emphasizing auralsemantic objects of interest. Regarding possible adaptation requirements concerningon-screen graphics and overlays that could be lost or require significant resizing, werefer to complementary work [27]. In any case, the fusion of semantic adaptationtechniques for the visual, aural and temporal adaptation axes, and how thesetechniques influence one another, is a challenging research topic.

As a final point, we wish to mention the possibility of introducing notions of theSemantic Mastering process even earlier in the production process than we haveimplemented in our system now. Already during the 3-D previsualization process,the production crew could define semantic adaptation operations that are to beapplied on material that is first rendered by a computer and later shot on set. Thisway, directors can actively plan Semantic Mastering even during the preparation ofthe shooting script, possibly previsioning camera perspectives that are specificallychosen to serve mobile devices. This would raise the issue of how previsualizedmastering instructions can be applied to live-action media assets and how much ofthis translation can then be accurately automated while, again, taking advantage of


the rich semantic context of the workflow metadata that is available during the dramaproduction process.

7 Conclusions

In this paper, we have presented a system that places semantic adaptation ofaudiovisual media assets in the hands of the drama production crew. Instead ofperforming spatial adaptation after all regular media production has been completed,we have seamlessly integrated this process into the existing production workflow.The Semantic Mastering process combines cinematographic instructions from theproduction crew with semantic knowledge that is extracted and inferred from thevarious sources of production workflow metadata such that optimal video retargetingis obtained. We have shown how this approach offers significant usability advantagesover other adaptation techniques. In particular, by means of simple instructions ouradaptation engine is able to compile visually convincing results. We have explainedhow mastering instructions are translated into declarative composition trees offraming constraints and how the Composition Engine calculates properly adaptedimages that fulfill the requests of the production crew while enforcing a number ofcinematographic consistency rules. However, some work is still required in termsof computation speed optimizations, the integration of automated object of interestdetection techniques and the inclusion of the Semantic Mastering process evenearlier in the production workflow, e.g., during 3-D previsualization.

Acknowledgements The research activities that have been described in this paper were fundedby Ghent University, VRT, IBBT, the Institute for the Promotion of Innovation by Science andTechnology in Flanders (IWT), the Fund for Scientific Research-Flanders (FWO-Flanders), theBelgian Federal Science Policy Office (BFSPO), and the European Union.

The image in Fig. 2 was taken from “Thuis”, © 1995–2010 Vlaamse Radio & Televisie (V.R.T.);the images in Fig. 21 were taken from “No Country for Old Men”, © 2007 Miramax Film Corp.and Paramount Vantage, A Division of Paramount Pictures Corporation; the images in Fig. 18–20were taken from “Casino Royale”, © 2006 Danjaq LLC, United Artists Corporation and ColumbiaPictures Industries Inc.

References

1. Arijon D (1976) Grammar of the film language. Silman-James Press, Los Angeles, CA2. Avidan S, Shamir A (2007) Seam Carving for Content-Aware Image Resizing. ACM Trans

Graph 26(3):1–93. Bares WH, Lester JC (1999) Intelligent multi-shot visualization interfaces for dynamic 3D

worlds. In: Proceedings of the 4th international conference on intelligent user interfaces, pp 119–126

4. Bares W, McDermott S, Boudreaux C, Thainimit S (2000) Virtual 3D camera composition fromframe constraints. In: Proceedings of the eighth ACM international conference on multimedia,pp 177–186

5. Bertini M, Cucchiara R, Del Bimbo A, Prati A (2006) Semantic adaptation of sport videos withuser-centred performance analysis. IEEE Trans Multimedia 8(3):433–443

6. Cardinaels M, Frederix K, Nulens J, Van Rijsselbergen D, Verwaest M, Bekaert P (2008) Amulti-touch 3D set modeler for drama production. In: Proc. of the IBC conf., pp 330–335

7. Catmull E, Rom R (1974) A class of local interpolating splines. In: Barnhill RE, Reisenfeld RF(eds) Computer aided geometric design. Academic Press, New York, pp 317–326


8. Chen F, De Vleeschouwer C (2009) Autonomous production of basketball videos from multi-sensored data with personalized viewpoints. In: International workshop on image analysis formultimedia interactive services, pp 81–84

9. Cheng W-H, Wang C-W, Wu J-L (2007) Video adaptation for small display based on contentrecomposition. IEEE Trans Circuits Syst Video Technol 17(1):43–58

10. Christianson DB, Anderson SE, He L, Salesin DH, Weld DS, Cohen MF (1996) Declarativecamera control for automatic cinematography. In: Proc. of the AAAI-96, pp 148–155

11. Christie M, Machap R, Normand J-M, Olivier P, Pickering J (2005) Virtual Camera Planning: Asurvey. In: Proceedings of smart graphics. Springer, pp 40–52

12. De Geyter M, Oorts N, Overmeire L (2008) Integration demands on MAM systems: a proof ofconcept solution. SMPTE Motion Imaging Journal 118(8):38–46

13. Deselaers T, Dreuw P, Ney H (2008) Pan, zoom, scan—time-coherent, trained automatic videocropping. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition,pp 1–8

14. Dorai C, Venkatesh S (2001) Computational media aesthetics: finding meaning beautiful. IEEEMultimed 8(4):10–12

15. Halper N, Helbing R, Strothotte T (2001) A camera engine for computer games: managing thetrade-off between constraint satisfaction and frame coherence. Comput Graph Forum 20(3):174–183

16. Jannach D, Leopold K (2007) Knowledge-based multimedia adaptation for ubiquitous multime-dia consumption. J Netw Comput Appl 30(3):958–982

17. Knoche H, Sasse A (2009) The big picture on small screens delivering acceptable video qualityin mobile TV. ACM Trans Multimedia Comput Commun Appl 5(3):1–27

18. Kopf S, Kiess J, Lemelson H, Effelsberg W (2009) FSCAV: fast seam carving for size adaptationof videos. In: ACM MM ’09: proceedings of the 17th ACM international conference on multime-dia, pp 321–330

19. Kovalick A (2006) Video systems in an IT environment. Elsevier Inc, Oxford20. Krähenbühl P, Lang M, Hornung A, Gross M (2009) A system for retargeting of streaming video.

In: ACM SIGGRAPH Asia ’09: international conference on computer graphics and interactivetechniques

21. Liu F, Gleicher M (2005) Automatic image retargeting with fisheye-view warping. In: UIST ’05:proceedings of the 18th annual ACM symposium on user interface software and technology, pp153–162

22. Liu F, Gleicher M (2006) Video retargeting: automating pan and scan. In: Proceedings of theACM multimedia conference, PP 241–250

23. López F, Martínez JM, García N (2009) CAIN-21: an extensible and metadata-driven multimediaadaptation engine in the MPEG-21 framework. In: Proceedings of the 4th international confer-ence on semantic and digital media technologies, pp 114–125

24. Magalhães J, Pereira F (2004) Using MPEG standards for multimedia customization. SignalProcess Image Commun 19(5):437–456

25. Mascelli JV (1998) The Five C’s of Cinematography: motion picture filming techniques. Silman-James Press, Los Angeles, CA

26. Parr T (2008) The reuse of grammars with embedded semantic actions. In: Proceedings of the16th IEEE international conference on program comprehension, pp 5–10

27. Pellan B, Concolato C (2009) Summarization of scalable multimedia documents. In: Proc. of theintl. workshop on image analysis for multimedia interactive services 2009, pp 304–307

28. Pereira F, Burnett I (2003) Universal multimedia experiences for tomorrow. IEEE Signal ProcessMag 20(2):63–73

29. Rochart G (2008) The CHOCO constraint programming solver. In: Proceedings of the fifthinternational conference on integration of AI and OR techniques in constraint programmingfor combinatorial optimization problems

30. Rubinstein M, Shamir A, Avidan S (2008) Improved seam carving for video retargeting. ACMTrans Graph 27(3):1–9

31. Rubinstein M, Shamir A, Avidan S (2009) Multi-operator media retargeting. ACM Trans Graph28(3):1–11

32. Schwitter R (2002) English as a formal specification language. In: Proceedings of the 13thinternational workshop on database and expert systems applications, pp 228–232

33. Seo K, Ko J, Ahn I, Kim C (2007) An intelligent display scheme of soccer video on mobiledevices. IEEE Trans Circuits Syst Video Technol 17(10):1395–1401


34. Setlur V, Takagi S, Raskar R, Gleicher M, Gooch B (2005) Automatic image retargeting. In:Proceedings of the 4th international conference on Mobile and Ubiquitous Multimedia, MUM’05, pp 59–68

35. Sofokleous AA, Angelides MC (2008) DCAF: an MPEG-21 dynamic content adaptation frame-work. Multimedia Tools & Applications 40(2):151–182

36. Tseng BL, Lin C-Y, Smith JR (2004) Using MPEG-7 and MPEG-21 for Personalizing video.IEEE Multimed 11(1):42–53

37. van Beek P, Smith JR, Ebrahimi T, Suzuki T, Askelof J (2003) Metadata-driven multimediaaccess. IEEE Signal Process Mag 20(2):40–52

38. Van Deursen D, Van Lancker W, De Neve W, Paridaens T, Mannens E, Van de Walle R (2010)NinSuna: a fully integrated platform for format-independent multimedia content adaptation anddelivery using Semantic Web technologies. Multimedia Tools & Applications 46(2–3):371–398

39. van Harmelen F, Lifschitz V, Porter B (eds) (2008) Handbook of Knowledge Representation.Elsevier B.V., Amsterdam

40. Van Rijsselbergen D, Van De Keer B, Van de Walle R (2008) The canonical expression of thedrama product manufacturing processes. Multimedia Syst 14(6):395–403

41. Van Rijsselbergen D, Van De Keer B, Verwaest M, Mannens E, Van de Walle R (2009) Onthe implementation of semantic content adaptation in the drama manufacturing process. In:Proceedings of the 2009 IEEE international conference on multimedia and expo, pp 822–825

42. Van Rijsselbergen D, Van De Keer B, Verwaest M, Mannens E, Van de Walle R (2009) Moviescript markup language. In: Proceedings of the 9th ACM symposium on document engineering,pp 161–170

43. Wolf L, Guttmann M, Cohen-Or D (2007) Non-homogeneous content-driven video-retargeting.In: IEEE International Conference on Computer Vision, pp 1–6

Dieter Van Rijsselbergen is a researcher and Ph.D. candidate in computer science and engineeringat the Multimedia Lab at the Department of Electronics and Information Systems (ELIS) of GhentUniversity in Belgium. He received his Master’s degree in Computer Science from Ghent Universityin 2005. His research interests and areas of publication include IT-based broadcasting and dramamedia production, large-scale video and metadata processing architectures, standardization, videocoding, and GPU-driven signal processing.


Chris Poppe received the Master’s degree in Industrial Sciences from KaHo Sint-Lieven, Belgium, in2002 and received his Master’s degree in Computer Science from Ghent University, Belgium, in 2004.He joined the Multimedia Lab of Ghent University - IBBT where he obtained the Ph.D. degree in2009. His research interests include video coding technologies, video analysis, and multimedia meta-data extraction, processing and representation, with a strong focus on standardization processes.

Maarten Verwaest learned Information Technology on the job at Framatome Connectors, where hehelped introducing tools for model-driven development while in charge of the mechanics depart-ment. At the time of writing, Maarten was a technology expert at VRT and as such was responsiblefor the design and implementation of several large-scale projects, including an integrated ‘DigitalNewsroom Production System’, media asset management and web content management systems,and a large-scale Content Delivery Network serving broadband applications. Between 2005 and2009, he has been responsible for research and development related to ‘virtual modeling’ and searchtechnology for audiovisual media production. Recently, Maarten founded Limecraft, a provider ofcloud-based and model-driven media production services.


Erik Mannens received his Master’s degree in engineering (1992) at KAHO Ghent and his Master’sdegree in Computer Science (1995) at K.U. Leuven University. His major expertise is centered onbroadcasting, iDTV and web development. He is involved in several projects as senior researcher,he’s co-chair of W3C’s Media Fragments Working Group and actively participating in other W3C’ssemantic web standardization activities. He’s also member of the technical committee of ACMMultiMedia, SAMT and MAReSO.

Rik Van de Walle received his M.Sc. and Ph.D. degrees in Engineering from Ghent University,Belgium in 1994 and 1998 respectively. After a visiting scholarship at the University of Arizona(Tucson, USA), he returned to Ghent University, where he became professor of multimedia systemsand applications, and head of the Multimedia Lab. His current research interests include multimediacontent delivery, presentation and archiving, coding and description of multimedia data, contentadaptation, and interactive (mobile) multimedia applications.