[ieee 2007 2nd international conference on pervasive computing and applications - birmingham, uk...

5
Human Attention Model for Action Movie Analysis 12 2 2 Anan Liu' , Jintao Li2, Yongdong Zhang , Sheng Tang , Yan Song2, Zhaoxuan Yang' 'School of Electronic Engineering, Tianjin University, Tianjin, China 2 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China {liuananjtli,zhyd, ts,songyan}@ict. ac. cn Abstract Nowadays, automatic detection of highlights for movies is indispensable for video management and browsing. In this paper, we specifically present the formulation of human attention model and its applicationfor action movie analysis. Depending on the relationship between stimuli in both visual and audio modalities and the change of human attention, we construct the visual and audio attention sub-models respectively. By integrating both sub-models, human attention model is formulated and attention flow plot is derived to simulate the change of human attention in the time domain. Based on video structuralization and human attention analysis, we realize action concept annotation for interest-oriented navigation of viewers. The large-scale experiments demonstrate the effectiveness and generality of human attention model for action movie analysis. Keywords: action, attention model, visual, audio. 1. Introduction With the increasing amount of movie data, there is a great need of the way in which we can manage and browse the movies conveniently and retrieve video clips with special semantic concepts. The most representative work for movie content analysis is presented in [1,2] by Brett Adams et al. In these papers, for the first time, they bring in the film grammar and present an original computational approach for extraction of movie tempo for deriving story sections and events that convey high level semantics of stories portrayed in video. Due to the limitation of using only a single modality, Lei chen et al in [3] incorporate audio and visual cues for dialog and action scene extraction. They also implement the method for making action movie trailers in [4]. Moreover, Hsuan-Wei Chen et al in [5] use the features of audio energy to modify tempo proposed by Brett Adams et al. With the improved tempo, they present a novel method for movie segmentation and summarization. Although more and more researchers are being engaged in this field, more attention is paid on the edit principle of film grammar while few researchers focus on the role of human perception for semantic event detection. This results in the fact that researchers analyze movie on the basis of edit technique and the characteristics of movie and neglect the response of viewers to the video content. Motivated by [6,7], we bring in human attention modal to simulate human response to stimuli, including ceaselessly changed visual and audio signal. The computational attention methodologies have been studied for many years. However, researchers mainly focus on the construction of topographical saliency map and the application of the attention modal is restricted to the processing of the single frame. Yu- Fei Ma et al are those of few researchers who are engaging in applying human attention model on video analysis. In [8], they detailedly present the attention model and the application in video summarization. However, they do not deeply analyze the relationship between user attention and the external stimuli. In this paper, depending on the characteristics of action movie, we pay much attention on the construction of human attention model to reflect the stimuli of all kinds of modalities on human perception from the viewpoint of viewers. Moreover, with the importance score calculated for each frame, we get the attention flow plot to simulate the change of human attention for different sections in one movie. With video structuralization and attention flow plot analysis, we realize action concept annotation for action movie analysis. The rest of the paper is organized as follows. In Section 2, we specifically illustrate the human attention model based on multi-modal fusion. Then, the application of the model for action concept annotation is introduced in Section 3. In Section 4, the experimental results are presented. At last, the conclusions and future work are stated in Section 5. 2. Human Attention Model As for human perception, people usually focus on the conspicuous changes both in audio and visual modalities. For example, the sudden shock of an explosion in auditory modality or dramatic object motion in visual modality can attract viewers' much attention. Therefore, for effectively analyzing the movie 1-4244-0971-3/07/$25.00 ©)2007 IEEE.

Upload: zhaoxuan

Post on 16-Mar-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2007 2nd International Conference on Pervasive Computing and Applications - Birmingham, UK (2007.07.26-2007.07.27)] 2007 2nd International Conference on Pervasive Computing and

Human Attention Model for Action Movie Analysis12 2 2

Anan Liu' , Jintao Li2, Yongdong Zhang ,

Sheng Tang , Yan Song2, Zhaoxuan Yang''School ofElectronic Engineering, Tianjin University, Tianjin, China

2 Institute ofComputing Technology, Chinese Academy ofSciences, Beijing, China{liuananjtli,zhyd, ts,songyan}@ict. ac. cn

Abstract

Nowadays, automatic detection of highlights formovies is indispensable for video management andbrowsing. In this paper, we specifically present theformulation of human attention model and itsapplicationfor action movie analysis. Depending on therelationship between stimuli in both visual and audiomodalities and the change of human attention, we

construct the visual and audio attention sub-modelsrespectively. By integrating both sub-models, humanattention model is formulated and attention flow plot isderived to simulate the change of human attention inthe time domain. Based on video structuralization andhuman attention analysis, we realize action conceptannotation for interest-oriented navigation of viewers.The large-scale experiments demonstrate theeffectiveness and generality of human attention modelfor action movie analysis.

Keywords: action, attention model, visual, audio.

1. Introduction

With the increasing amount of movie data, there is a

great need of the way in which we can manage andbrowse the movies conveniently and retrieve video clipswith special semantic concepts.

The most representative work for movie contentanalysis is presented in [1,2] by Brett Adams et al. Inthese papers, for the first time, they bring in the filmgrammar and present an original computationalapproach for extraction of movie tempo for derivingstory sections and events that convey high levelsemantics of stories portrayed in video. Due to thelimitation of using only a single modality, Lei chen et alin [3] incorporate audio and visual cues for dialog andaction scene extraction. They also implement themethod for making action movie trailers in [4].Moreover, Hsuan-Wei Chen et al in [5] use the featuresof audio energy to modify tempo proposed by BrettAdams et al. With the improved tempo, they present a

novel method for movie segmentation andsummarization.

Although more and more researchers are beingengaged in this field, more attention is paid on the edit

principle of film grammar while few researchers focuson the role of human perception for semantic eventdetection. This results in the fact that researchersanalyze movie on the basis of edit technique and thecharacteristics of movie and neglect the response ofviewers to the video content. Motivated by [6,7], we

bring in human attention modal to simulate humanresponse to stimuli, including ceaselessly changedvisual and audio signal.

The computational attention methodologies havebeen studied for many years. However, researchersmainly focus on the construction of topographicalsaliency map and the application of the attention modalis restricted to the processing of the single frame. Yu-Fei Ma et al are those of few researchers who are

engaging in applying human attention model on videoanalysis. In [8], they detailedly present the attentionmodel and the application in video summarization.However, they do not deeply analyze the relationshipbetween user attention and the external stimuli. In thispaper, depending on the characteristics of action movie,we pay much attention on the construction of humanattention model to reflect the stimuli of all kinds ofmodalities on human perception from the viewpoint ofviewers. Moreover, with the importance score

calculated for each frame, we get the attention flow plotto simulate the change of human attention for differentsections in one movie. With video structuralization andattention flow plot analysis, we realize action conceptannotation for action movie analysis.

The rest of the paper is organized as follows. InSection 2, we specifically illustrate the human attentionmodel based on multi-modal fusion. Then, theapplication of the model for action concept annotationis introduced in Section 3. In Section 4, theexperimental results are presented. At last, theconclusions and future work are stated in Section 5.

2. Human Attention Model

As for human perception, people usually focus on

the conspicuous changes both in audio and visualmodalities. For example, the sudden shock of an

explosion in auditory modality or dramatic objectmotion in visual modality can attract viewers' muchattention. Therefore, for effectively analyzing the movie

1-4244-0971-3/07/$25.00 ©)2007 IEEE.

Page 2: [IEEE 2007 2nd International Conference on Pervasive Computing and Applications - Birmingham, UK (2007.07.26-2007.07.27)] 2007 2nd International Conference on Pervasive Computing and

content, we also need mining human attentiondescriptors from the viewers' standpoint.

The human attention model presented in the paper iscomprised of two sub-models, namely visual attentionsub-model and audio attention sub-model. As for thevisual modality, people tend to pay much attention tothe scene with drastic movements of objects while asfor the auditory modality, the sound with high volumeor quick pace may attract viewers' much attention. Onthe basis of human attention analysis, we extract thefollowing features, namely, Motion Energy, MotionDispersion, Audio Energy and Audio Pace, to representdifferent stimuli for human attention. By fusing themulti-modal factors, human attention model is founded.The details of modeling methods will be elucidated inthe following sections.

2.1 Visual Attention Sub-model

From the viewpoint of psychology, the visualattention of human can be attracted by the change ofimage content in the video sequence and the complexityof one picture itself. These factors can be representedby the motion information between the adjacent frames.In the paper, we respectively use Motion Energy andMotion Dispersion to stimulate the two factors forconstructing visual attention sub-model.

2.1.1 Motion Energy

We use motion activity descriptor defined in MPEG-7 to represent motion energy. Motion activity descriptorvalues can be calculated with motion vectors values of a16*16 macroblock. The method of extracting motionenergy is presented as follows.(a) The motion vectors for each frame are extracted

from the MPEG-I compressed film video.(b) As for each P frame (for less computational

complexity, we only process P frame for eachvideo), the motion energy of the (i,j)th macroblock,EMv(ij), is defined as:

EMv (i, j) = xi2J +yi2j (1)where (xij , yij ) is the motion vector of the (i,j)thmacroblock.

(c) Mask is made for each frame to assign differentweights to motion vectors.

Although in the former work in [8], motionvectors are used for the calculation of motionenergy, importance of all the motion vectors in eachframe is not deeply analyzed. For an action movie,persons are always attracted by fighting betweencharacters, namely the motion of foreground.Therefore, the motion energy of foreground canrepresent intensity of attention more exactly and thehigher weight should be assigned to it. Therefore,we make the mask for each frame as following.

Figurel. One frame form the movie: "Fearless"(Each grid represents a macroblock; the red linerepresents the motion vector and its length meansthe value of motion energy).

Because the highlight is usually in the centre ofone frame, we use the field F in blue shadow inFigure.1 to represent the background. The averageenergy of the background in the kth p frame, definedas BEJ.JEv (Pk), is calculated by:

BEMV (Pk ) = E EMV (i, j)(Xi j,Yi j )eF

(2)

where L is the total number of macroblock in thefield F (the field in blue shadow). For all the Pframes in the video, mean, u and standarddeviation, a of motion energy of the background arecalculated. Then the adaptive threshold, Th, isdetermined by:

Th = /u+ a * (3)where a is an experienced parameter and can be gotby experiments (in our experiments, it is set with 3).

With Th, the weight(xij, y ) of the (i,j)thmacroblock is determined by:

If EMV (i, j) > Th, weight(xj,YiJ)=2,If EM (i, j) <= Th, weight(x, y) =1, (4)

(d) The average motion energy of the Wth P frame iscalculated by:

M NE P() = Z Z Em, (i, j) * weight(xij'Jyi) (5)

i=1 j=1

where M and N are the numbers of macroblocksrespectively in horizontal and vertical directions ineach frame. And the average motion energy of eachshot,EAvrE, is calculated by averaging of all the Pframe in one shot.

2.1.2 Motion Dispersion

In Figure.2, there are three pictures selected from themovie "Fearless". Picture (a) is from the scene that thecharacter, Yuanjia Huo, observes the condition of thearena. Depending on the content of the story, it is only astatic picture without foreground and backgroundmotion. Picture (b) is from the scene that Huo wins thecompetition and is welcomed by people in therestaurant. Because the picture only depicts persons'walk in the single direction, the orientations of motionvectors are almost the same. Picture (c) is from the

Page 3: [IEEE 2007 2nd International Conference on Pervasive Computing and Applications - Birmingham, UK (2007.07.26-2007.07.27)] 2007 2nd International Conference on Pervasive Computing and

scene that Huo fights with another fighter on theBecause the motion of the persons is compleorientations of the motion vectors are diverse.perceivable that static picture (a) with tedious csurely can not attract much attention of vi(Compared between picture (b) and (c), althoughare more dramatic motion of foreground in (1single orientation degrade the complexity of the cin the picture. Conversely, the diversity oorientation in (c) can make viewers focus on it altmotion energy is a little less than that in (b).

With the analysis above, we define ADispersion to describe the complexity of the conieach frame. It is calculated as follows.(a) Extracting motion vectors from the M]compressed film video and calculating the maleach frame in the way presented in Section 2.1.1 (I(b) Calculating the orientation 0 of motion vectoiyi,j ) with weight "2" in mask as follc(OE [-90,270))

If x > O,y .0,0= arctan(yiIf xi, <0, y .0,0= arctan(yiIf xij =O,yij >0,0=90If x1j = O,Yij < 0,0= -90;

Iyij ==0xij >0,0=0,

If Yij =o,xij <0,0=180,

1If xij = O,yi1j = 0,omitting 0,

xiJ)+180'xi j )+180,.

(kh

"'-Figure 2. Three P frames form the movie: "Fea(Each grid represents a macroblock; the rerepresents the motion vector and its length i

the value of motion energy). (a) A static pwithout foreground and back ground motion;picture with single orientation motion of foregiand background; (c) A picture with complex nof foreground and background.

arena.x, the

It isontentewers.there

)), theontentf thehough

votiontent in

PFG(-I

(c) 2-D space is divided into four sections,namely, [-90,0), [0,90), [90,180), [180,270). OrientationHistogram is defined to represent motion dispersion.Orientation Histogram contains four bins, each of whichis corresponding to one section in 2-D space. Theorientations 0 calculated in (b) fall into different bins.H[i] (i=1,2,3,4) is defined to represent the ratio of thenumber of 0 in ith bin to the total number of 0 in oneframe. The sum of H[i] (i=1,2,3,4) is 1.We define Motion Dispersion (MD) as follows:

4

MD =HH[i]i=l

(7).

In maths, the mean of H[i], u ,is a constant, 0.25 and

k%_j

the variance of them , a, is variable. Moreover, when,sk forc). the sum of H[i] (i=1,2,3,4) is invariable, it can be

r (xij , demonstrated that the smaller & iS, the bigger MD is.)wing: With deep analysis, we can see that more equably the

orientations distribute, the larger Motion Dispersionbecomes, which denotes that more complex theforeground motion is, more intense impact the contentwill have on human feelings.(d) The average Motion Dispersion of each shot,MDAVE, is calculated by averaging MD of all P frames

(6) in one shot.

2.1.3 Visual Attention Sub-model

For formulating visual attention sub-model, we useEquation (8) to fuse both elements mentioned above.The visual attention sub-model is defined as:

i5 oc~~(ME(n) - ,UME) /3(MD(n) - pAmD i*W V ()8aME QMD

where:ME: Motion Energy;MD: Motion Dispersionn: shot number;

,uj and ,u,: means of motion energy and motiondispersion, respectively;

o, and o,,: standard deviation of motion energyand motion dispersion, respectively.

The weight a and /3 are given values of 0.5,effectively assuming that both elements contributeequally to visual attention model.

2.2 Audio Attention Sub-model

From the viewpoint of psychology, the audiorless" attention of human can be attracted by the sound withd line high volume or quick pace. In the paper, wemeans respectively use Audio Energy and Audio Pace toiicture stimulate the two factors for constructing audio(b) A attention sub-model.roundnotion 2.2.1 Audio Energy

Page 4: [IEEE 2007 2nd International Conference on Pervasive Computing and Applications - Birmingham, UK (2007.07.26-2007.07.27)] 2007 2nd International Conference on Pervasive Computing and

Referring to [9], we calculate the frame-based shorttime energy to represent Audio Energy. Then the shorttime energy is averaged for each shot to represent theimportance of accompanying sound.

2.2.2 Audio Pace

After Audio Energy is calculated, we use NAP torepresent the number of audio peaks that are greaterthan a threshold ThAp determined in the experiment. Nisfurther normalized by the total number of audio framein one shot. Then Audio Pace can be formulated by:

AP = NAP(Number of audio frames in one shot (9)

Audio Pace (AP) can directly represent the changefrequency of audio energy in one shot. It is perceptivethat higher Audio Pace indicates tenser atmosphere orsplendid scene. Therefore, viewer may pay moreattention on the shot with high audio pace.

2.2.3 Audio Attention Sub-model

For formulating audio attention sub-model, we useEquation (10) to fuse both audio features. The audioattention sub-model is defined as:

AM(n) = r(AE(n)- UAE) + A(AP(n) - HAP) (10)

aAE 4APwhere:AE:AP:n :

Audio Energy;Audio Pace;shot number;

u'AE and uAP means of audio energy and audio pace,respectively;

,AE and AP, standard deviation of audio energy andaudio pace, respectively.

Here, the weight rand A are also given values of 0.5,assuming that both elements contribute equally to audioattention model. Other weighting schemes will befurther researched in the future.

2.3 Human Attention Model

By integrating both sub-models, we formulatehuman attention model as follows:

HA V(n) = 0 * VM(n) + V* AM(n) (1 1)where HA V(n) denotes the human attention value of nthframe. Here, the weight 0 and yw are also initially set

with 0.5, which means that both modalities, visual andaudio have the same influence on human attention.

With the variety of n in temporal domain, we will getattention flow plot to depict the change of humanattention shown in Figure.3 (a). Moreover, the attentionflow plot is smoothed with a Gaussian filter with a size9 window ( a =1.5), shown in Figure.3 (b), to depict thegradual change by the continuous stimuli.

w1. 2

0 8

04

0. 2

1 53 105 1S5 209 261 31 3 365 417 469 521Sot

(a)

1. 2

1

o0. S

0.64

0.42

01 53 1O0 15? 209 261 313 865 417 469 521

SiDt

(b)

Figure. 3 (a) Attention flow plot for one video clip of themovie, "Fearless" (b) The corresponding attention flowplot smoothed with Gaussian filter. (Different colorsdenotes different scenes).

3. Action Scene Detection

With human attention model and videostructuralization basing on our previous work in [10],we can detect the action scene with the following rule:

(a) Counting the number NP of the peaks within onescene, the values of which are higher than Thresholdl;

(b) If NP is larger than Threshold2, the scene isconsidered as an action scene.

Here, Thresholdl and Threshold2 are parametersdetermined in the experiments. From Figure.3 (b), it isperceivable that because Scene 4 and 5 are actionscenes in the movie, human attention changes moredrastically. With appropriate Thresholdl andThreshold2, action scene can be detected.

4. Experimental Results

To demonstrate the effectiveness and generality ofaction scene detection with human attention model, weselect four action movies for test. The details of eachmovie will be presented in Table 1.By human judgment, we found the ground truth for

experimental result analysis. Precision and Recall areused to measure the results of action scene detection,which are well known rules in information retrieval

Page 5: [IEEE 2007 2nd International Conference on Pervasive Computing and Applications - Birmingham, UK (2007.07.26-2007.07.27)] 2007 2nd International Conference on Pervasive Computing and

field. The detailed experimental results are shown inTable 2.From Table 2, we can see that recall value is

satisfactory, which means that human attention modelcan well depict the characteristics of action scene.Comparatively, precision value is a little lower. Themain reasons are as follows:(a). Some scenes, for example, the congested crowdwalking in the street, have the most characteristics ofaction scene and thereby give the false positives.(b). Although some scenes, for example, the scene withloud shout or noise, only have one or two features ofaction scene, the value is much bigger. Theinappropriate parameters mentioned in Section 2 cannot represent the best combination of the all elementsand lead to false positives.

Table 1. Detailed information about each movieMovie Title Crouching Fist of Wind Pearl

Tiger,Hidden Legend talkers HarborDragon

Runtime(min) 120 103 134 183Movie Genre Action+Drama Action+WarFile Format MPEG-1Audio 16 bits/sample, mono, 22kHzFormatDelivery:(f/s) 30 25 25 30

Table 2. Experimental resultsMovie Title Crouching Fist of Wind Pearl

Tiger,Hidden Legend talkers HarborDragon

Ground 8 8 9 12

Detected 10 8 10 14

False accepted 2 1 2 2

False rejected 0 1 1 0Precision(%) 80 87.5 80 85.7Recall (%) 100 87.5 88.9 100

Average Precision (%) 83.3

Average Recall (%) 94.1

5. Conclusions and Future Work

By integrating visual and audio attention sub-model,we formulate the human attention model to simulate thechange of human attention in time domain. Moreover,on the basis of structuralization and action scene

detection, we design a system for hierarchical browseand movie edit with action scene annotation.

Large-scale experiments demonstrate theeffectiveness and generality of human attention model

for action movie analysis. However, the parameters formodel computation mentioned in Section 2 should befurther analyzed for better reflecting the relationshipand importance of these elements. Besides, we willadvance our research on analyzing other classificationsof movies with human attention model to build ageneral system for movie content analysis in the future.

Acknowledgements

This work was supported by Beijing Science andTechnology Planning Program of China(D0106008040291), the Key Project of Beijing NaturalScience Foundation (4051004), and the Key Project ofInternational Science and Technology Cooperation(2005DFA1 1060).

References

[1] Brett Adams, Chitra Dorai, Svetha Venkatesh. NovelApporach to Determining Tempo and Dramatic StorySections in Motion Pictures. IEEE ICIP, 2000.[2] Brett Adams, Chitra Dorai, Svetha Venkatesh. TowardAutomatic Extraction of Expressice Elements from MotionPictures: Tempo. IEEE Transactions on Multimedia, 2002.[3] Lei Chen, Sariq J. Rizvi, M.Tamer Ozsu. IncorporatingAudio Cues into Dialog and Action Scene Extraction. Proc. ofSPIE Storage and Retrieval for Media Databases, 2003[4] Alan F.Smeaton, Bart Lehane et al. AutomaticallySelecting Shots for Action Movie Trailers. Proceedings of the8th ACM international workshop on Multimedia informationretrieval, 2006.[5] Hsuan-Wei Chen, Jin-Hau Kuo, Wei-Ta Chu, et al. ActionMovies Segmentation and Summarization Based on TempoAnalysis. Proceedings of the 6th ACM SIGMM internationalworkshop on Multimedia information retrieval, 2004.[6] L.Itti, C.Koch, E.Niebur, "A Modal of Saliency-BasedVisual Attention for rapid Scene Analysis," IEEE Trans. OnPattern Analysis and Machine Intelligence, 1998.[7] L.Itti, C.Koch, E.Niebur, "Computational Modaling ofVisual Attention," Nature Reviews Neuroscience, 2001.[8] Yu-Fei Ma, Lie Lu, Hong-Jiang Zhang, et al, "A UserAttention Model for Video Summarization," Proc. of the tenthACM international conference on Multimedia, 2002.[9] Bai Liang; Hu Yaali, Feature analysis and extraction foraudio automatic classification, Proc. of IEEE InternationalConference on Systems, Man and Cybernetics, 2005.[10] Sheng Tang, Yong-Dong Zhang, Jin-Tao Li et al.TRECVID 2006 Rushes Exploitation By CAS MCG. In Proc.TRECVID Workshop, Gaithesburg, USA , Nov. 2006.http://www-nlpir.nist.gov/projects/tvpubs/tv.pubs.org.html.