automatic detection of regions of interest in complex video … · quality analysis, image and...

12
header for SPIE use Automatic detection of regions of interest in complex video sequences Wilfried Osberger * and Ann Marie Rohaly Video Business Unit, Tektronix, Inc., Beaverton OR ABSTRACT Studies of visual attention and eye movements have shown that people generally attend to only a few areas in typical scenes. These areas are commonly referred to as regions of interest (ROIs). When scenes are viewed with the same context and motivation (e.g., typical entertainment scenario), these ROIs are often highly correlated amongst different people, motivating the development of computational models of visual attention. This paper describes a novel model of visual attention designed to provide an accurate and robust prediction of a viewer's locus of attention across a wide range of typical video content. The model has been calibrated and verified using data gathered in an experiment in which the eye movements of 24 viewers were recorded while viewing material from a large database of still (130 images) and video (~13 minutes) scenes. Certain characteristics of the scene content, such as moving objects, people, foreground and centrally-located objects, were found to exert a strong influence on viewers’ attention. The results of comparing model predictions to experimental data demonstrate a strong correlation between the predicted ROIs and viewers’ fixations. Keywords: Visual attention, regions of interest, ROI, eye movements, human visual system. 1. INTRODUCTION In order to efficiently process the mass of information presented to it, the resolution of the human retina varies across its spatial extent. High acuity is only available in the fovea, which is approximately 2 deg in diameter. Knowledge of a scene is obtained through regular eye movements which reposition the area under foveal view. These eye movements are by no means random; they are controlled by visual attention mechanisms, which direct the fovea to regions of interest (ROIs) within the scene. The factors that influence visual attention are considered to be either top-down (i.e., task or context driven) or bottom- up (i.e., stimulus driven). A number of studies have shown that when scenes are viewed with the same context and motivation (e.g., typical entertainment scenario), ROIs are often highly correlated amongst different people. 1,2,3,4 As a result, it is possible to develop computational models of visual attention that can analyze a picture and accurately estimate the location of viewers’ ROIs. Many different applications can make use of such a model, 5 including image and video compression, picture quality analysis, image and video databases and advertising. A number of different models of visual attention have been proposed in the literature (see refs. 5 and 6 for a review). Some of these have been designed for use with simple, artificial scenes (like those used in visual search experiments) and consequently do not perform well on complex scenes. Others require top-down input. Because such information is not available in a typical entertainment video application, these types of models are not considered here. Koch and his colleagues have proposed a number of models for detecting visual saliency. In their most recent work, 7 a multi- scale decomposition is performed on the input image and three features – contrast, color and orientation – are extracted. These features are weighted equally and combined to produce a saliency map. The ordering of viewers’ fixations is then estimated using an inhibition-of-return model that suppresses recently fixated scenes. Results of this model have so far only been demonstrated for still images. More recently, object-based models of attention have been proposed. 5,8,9,10 This approach is supported by evidence that attention is directed towards objects rather than locations (see ref. 5 for a discussion). The general framework for this class of models is quite similar. The scene is first segmented into homogeneous objects. Factors that are known to influence visual attention are then calculated for each object in the scene, weighted and combined to produce a map that indicates the likelihood that observers would focus their attention on a particular region. We refer to these maps as Importance Maps (IMs). 5 Although the results reported for these models have been promising, a number of issues prevent their use with complex video sequences: * Further author information: Email: [email protected]

Upload: others

Post on 09-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic detection of regions of interest in complex video … · quality analysis, image and video databases and advertising. A number of different models of visual attention have

header for SPIE use

Automatic detection of regions of interest in complex video sequencesWilfried Osberger* and Ann Marie Rohaly

Video Business Unit, Tektronix, Inc., Beaverton OR

ABSTRACTStudies of visual attention and eye movements have shown that people generally attend to only a few areas in typical scenes.These areas are commonly referred to as regions of interest (ROIs). When scenes are viewed with the same context andmotivation (e.g., typical entertainment scenario), these ROIs are often highly correlated amongst different people, motivatingthe development of computational models of visual attention. This paper describes a novel model of visual attentiondesigned to provide an accurate and robust prediction of a viewer's locus of attention across a wide range of typical videocontent. The model has been calibrated and verified using data gathered in an experiment in which the eye movements of 24viewers were recorded while viewing material from a large database of still (130 images) and video (~13 minutes) scenes.Certain characteristics of the scene content, such as moving objects, people, foreground and centrally-located objects, werefound to exert a strong influence on viewers’ attention. The results of comparing model predictions to experimental datademonstrate a strong correlation between the predicted ROIs and viewers’ fixations.

Keywords: Visual attention, regions of interest, ROI, eye movements, human visual system.

1. INTRODUCTIONIn order to efficiently process the mass of information presented to it, the resolution of the human retina varies across itsspatial extent. High acuity is only available in the fovea, which is approximately 2 deg in diameter. Knowledge of a scene isobtained through regular eye movements which reposition the area under foveal view. These eye movements are by no meansrandom; they are controlled by visual attention mechanisms, which direct the fovea to regions of interest (ROIs) within thescene. The factors that influence visual attention are considered to be either top-down (i.e., task or context driven) or bottom-up (i.e., stimulus driven). A number of studies have shown that when scenes are viewed with the same context and motivation(e.g., typical entertainment scenario), ROIs are often highly correlated amongst different people.1,2,3,4 As a result, it ispossible to develop computational models of visual attention that can analyze a picture and accurately estimate the location ofviewers’ ROIs. Many different applications can make use of such a model,5 including image and video compression, picturequality analysis, image and video databases and advertising.

A number of different models of visual attention have been proposed in the literature (see refs. 5 and 6 for a review). Some ofthese have been designed for use with simple, artificial scenes (like those used in visual search experiments) andconsequently do not perform well on complex scenes. Others require top-down input. Because such information is notavailable in a typical entertainment video application, these types of models are not considered here.

Koch and his colleagues have proposed a number of models for detecting visual saliency. In their most recent work,7 a multi-scale decomposition is performed on the input image and three features – contrast, color and orientation – are extracted.These features are weighted equally and combined to produce a saliency map. The ordering of viewers’ fixations is thenestimated using an inhibition-of-return model that suppresses recently fixated scenes. Results of this model have so far onlybeen demonstrated for still images.

More recently, object-based models of attention have been proposed.5,8,9,10 This approach is supported by evidence thatattention is directed towards objects rather than locations (see ref. 5 for a discussion). The general framework for this class ofmodels is quite similar. The scene is first segmented into homogeneous objects. Factors that are known to influence visualattention are then calculated for each object in the scene, weighted and combined to produce a map that indicates thelikelihood that observers would focus their attention on a particular region. We refer to these maps as Importance Maps(IMs).5

Although the results reported for these models have been promising, a number of issues prevent their use with complex videosequences:

* Further author information: Email: [email protected]

Page 2: Automatic detection of regions of interest in complex video … · quality analysis, image and video databases and advertising. A number of different models of visual attention have

• Most of the models were designed for use with still images and cannot be directly used to predict attention invideo sequences.

• The number of features used in the models is often too small to obtain high accuracy predictions across awide range of complex scene types.

• The relative weightings of the different features is unknown or modeled in an ad hoc manner.

The model described in this paper specifically addresses the above issues. It is based on the framework developed in ourprevious work5,8,11 with a number of significant changes that improve its robustness and accuracy across a wide range oftypical video content.

This paper is organized as follows: Section 2 discusses human visual attention, focusing in particular on the different featuresthat have been found to influence attention. In Section 3, the details of the attention model are presented while the subjectiveeyetracking experiments used to calibrate and verify the model’s operation are contained in Section 4. Results of the model’sperformance across a wide range of complex video inputs are summarized in Section 5.

2. FACTORS THAT INFLUENCE VISUAL ATTENTIONVisual search experiments, eye movement studies and other psychophysical and psychological tests have resulted in theidentification of a number of factors that influence visual attention and eye movements. These are often categorized as beingeither top-down (task or context driven) or bottom-up (stimulus driven), although for some factors the distinction may not beso clear. A general observation is that an area or object that stands out from its surroundings with respect to a particular factoris likely to attract attention. Some of the factors that have been found to exert the strongest influence on attention are listedbelow:

• Motion. Motion has been found to be one of the strongest influences on visual attention.12 Peripheral vision ishighly tuned to detect changes in motion, the result being that attention is involuntarily drawn to peripheralareas exhibiting motion which is distinct from surrounding areas. Areas undergoing smooth, steady motioncan be tracked by the eye and humans cannot tolerate any more distortion in these regions than in stationaryregions.13

• Contrast. The human visual system converts luminance into contrast at an early stage of processing. Regioncontrast is consequently a very strong bottom-up visual attractor.14 Regions that have a high contrast withtheir surrounds attract attention and are likely to be of greater visual importance.

• Size. Findlay15 has shown that region size also has an important effect on attention. Larger regions are morelikely to attract attention than smaller ones. A saturation point exists, however, after which the importance ofsize levels off.

• Shape. Regions whose shape is long and thin (edge-like) or have many corners and angles have been foundto be visual attractors.3 They are more likely to attract attention than smooth, homogeneous regions ofpredictable shape.

• Color. Color has also been found to be important in attracting attention.16 A strong influence occurs when thecolor of a region is distinct from the color of its background. Certain specific colors (e.g., red) have beenshown to attract attention more than others.

• Location. Eyetracking experiments have shown that viewers’ fixations are directed at the center 25% of aframe for a majority of viewing material.17,18

• Foreground / Background. Viewers are more likely to be attracted to objects in the foreground than those inthe background.4

• People. Many studies1,19 have shown that viewers are drawn to people in a scene, in particular their faces,eyes, mouths and hands.

• Context. Viewers’ eye movements can be dramatically changed, depending on the instructions they are givenprior to or during the observation of an image.1,14

Other bottom-up factors that have been found to influence attention include brightness, orientation, edges and line ends.Although many factors that influence visual attention have been identified, little quantitative data exists regarding the exactweighting of the different factors and their inter-relationships. Some factors are clearly of very high importance (e.g., motion)but it is difficult to determine exactly the relative importance of one factor vs. another. To answer this question, we have

Page 3: Automatic detection of regions of interest in complex video … · quality analysis, image and video databases and advertising. A number of different models of visual attention have

performed an eye movement test with a large number of viewers and a wide range of stimulus material (see Section 4). Theindividual factors used in the visual attention model were correlated to viewers’ fixations in order to determine the relativeinfluence of each factor on eye movements. This provided a set of factor-weightings which were then used to calibrate themodel. Details of this process are contained in Section 4 of the paper.

3. MODEL DESCRIPTIONAn overall block diagram showing the operation of the attention model is shown in Figure 1. While the general structure issimilar to that reported previously,5,8,11 the algorithms and techniques used within each of the model components have beenchanged significantly, resulting in considerable improvements in accuracy and robustness. In addition, new features such ascamera motion estimation, color and skin detection have been added.

Figure 1: Block diagram of the visual attention model Figure 2: Size importance factor

3.1 SegmentationThe spatial part of the model is represented by the upper branch in Figure 1. The original frame is first segmented intohomogeneous regions. A recursive split-and-merge technique is used to perform the segmentation. Both the graylevel (gl)and the color (in L*u*v* coordinates) are used in determining the split and merge conditions. The condition for the recursivesplit is:

If: )))((var(&)))((var( splitcolisplitlumi thcolRthglR >> ,

Then: split iR into 4 quadrants, and recursively split each,

where var=variance, 2i

2ii (v*))var(R(u*))var(R (col))var(R += . Values of the thresholds that have been found to provide

good results are thsplitlum = 250 and thsplitcol = 120.

For the region merging, both the mean and the variance of the region’s graylevel and color are used to determine whether tworegions should be merged. The merge condition for testing whether two regions R1 and R2 should be merged is:

If:

))(&)((

))(&)(

&)))((var(&)))(((var( 1212

collowlumlow

olmeanmergecummeanmergel

mergecolmergelum

thcolthlumOR

thcolthlum

thcolRthglR

<∆<∆

<∆<∆

<<

,

Then: combine two regions into oneElse: keep regions separate,

thsize1 thsize2

1

Isize

size(Ri) / size(frame)

0

thsize3 thsize4

Page 4: Automatic detection of regions of interest in complex video … · quality analysis, image and video databases and advertising. A number of different models of visual attention have

where |)()(| 21 glRglRlum −=∆ and 221

221 )*)(*)(()*)(*)(( vRvRuRuRcol −+−=∆ . The thresholds thmergelum and thmergecol

are adaptive and increase as the size of the regions being merged increases. The thresholds thmeanmergelum and thmeanmergecol arealso adaptive and depend upon a region’s luminance and color. Following the split-and-merge procedure, regions that have asize less than 64 pixels (for a 512 x 512 image) are merged with the neighboring region having the closest luminance. Thisprocess removes the large number of small regions that may be present after the split-and-merge.

3.2 Spatial factorsThe segmented frame is then analyzed by a number of different factors that are known to influence attention (see Section 2),resulting in an importance map for each factor. Seven different attentional factors are used in the current implementation ofthe spatial model.

• Contrast of region with its surround. Regions that have a high luminance contrast with their local surrounds are knownto attract visual attention. The contrast importance Icont of a region Ri is calculated as:

=

=

⋅⋅−

=J

jjijborder

jijborder

J

jji

icont

RsizeBk

RsizeBkglRglR

RI

1

1

))(,min(

)))(,min(|)()((|

)( ,

where j = 1..J are the regions that share a 4-connected border with Ri, kborder is a constant to limit the extent of influenceof neighbors (set to 10) and Bij is the number of pixels in Rj that share a 4-connected border with Ri. Improved resultscan be achieved when the contrast is scaled to account for Weber and deVries-Rose conditions. Icont is then scaled to therange 0-1. This is done in an adaptive manner so the contrast importance for a region of a particular contrast is reducedin frames that have many high contrast regions and increased in frames where the highest contrast is low.

• Size of region. Large objects have been found to be visual attractors. A saturation point exists, however, after whichfurther size increases no longer increase the likelihood of the object attracting attention. The effect of this is illustrated inFigure 2. Parameter values that have been found to work well are thsize1 = 0, thsize2 = 0.01, thsize3 = 0.05 and thsize4 = 0.50.

• Shape of region. Areas with an unusual shape or areas with a long and thin shape have been identified as attractors ofattention. Importance due to region shape is calculated as:

)()(

)(i

powi

shapeishape RsizeRbp

kRIshape

⋅= ,

where bp(Ri) is the number of pixels in Ri that share a 4-connected border with another region, powshape is used toincrease the size-invariance of Ishape (set to 1.75) and kshape is an adaptive scaling constant that reduces the shapeimportance for regions with many neighbors. Ishape is then scaled to the range 0-1. As with Icont, this is done in an adaptivemanner so the shape importance of a particular region is reduced in frames that have many regions of high shapeimportance and increased in frames where the highest shape importance is low.

• Location of region. Several studies have shown a viewer preference for centrally-located objects. As a result, thelocation importance is calculated using four zones within the frame whose importance gradually decreases with distancefrom the center. This is shown graphically in Figure 3. The location importance for each region is calculated as:

)(

)()(

4

1

i

zizz

ilocation Rsize

RnumpixwRI

∑=

= ,

where numpix(Riz) is the number of pixels in region Ri that are located in zone z, and wz are the zone weightings (valuesgiven in Figure 3).

Page 5: Automatic detection of regions of interest in complex video … · quality analysis, image and video databases and advertising. A number of different models of visual attention have

Figure 3: Weightings and zones for location IM Figure 4: Calculation of skin importance

• Foreground/Background region. Foreground objects have been found to attract more attention than background objects.It is difficult to detect foreground and background objects in a still scene since no motion information is present.However, a general assumption can be made that foreground objects will not be located on the border of the scene. Aregion can then be assigned to the foreground or background on the basis of the number of pixels which lie on the frameborder. Regions with a high number of frame border pixels are classified as belonging to the background and have a lowForeground/Background importance as given by:

)0.1,))(),(min(3.0

)(min(1

i

iFGBG Rixperimeterpframeborderpix

RborderpixI

⋅−= ,

where borderpix(R) is the number of pixels in region R that also border on the edge of the frame and perimeterpix(R) isthe number of pixels in region R that share a 4-connected border with another region.

• Color contrast of region. The color importance is calculated in a manner very similar to that of contrast importance. Ineffect, the two features are performing a similar operation – one calculates the luminance contrast of a region withrespect to its background while the other calculates the color contrast of a region with respect to its background. Thecalculation of color importance begins by calculating color contrasts separately for u* and v*, by substituting thesevalues for gl in the formula used to compute contrast importance. The two color importance values are then combined:

2*

2* )()()( iviuicol RIRIRI += .

Icol is then scaled to the range 0-1. This is done in an adaptive manner so the color importance for a region of a particularcolor contrast is reduced in frames that have many regions of high color contrast and increased in frames where thehighest color contrast is low.

• Skin. People, in particular their faces and hands, are very strong attractors of attention. Areas of skin can be detected byanalyzing their color since all human skin tones, regardless of race, fall within a restricted area of color space. The hue-saturation-value (HSV) color space is commonly used since human skin tones are strongly clustered into a narrow rangeof HSV values. After converting to HSV, an algorithm similar to that proposed by Heredotou20 is used on each pixel todetermine whether or not the pixel has the same color as skin. The skin importance for the region is then calculated asillustrated in Figure 4.

3.3 Combining spatial factorsThe seven spatial factors each produce a spatial IM. An example of this is shown in Figure 5 for the scene rapids. The sevenspatial factors need to be combined to produce an overall spatial IM. The literature provides little quantitative indication of

w1=1.0

0.70

0.55

0.40

videoframe

w2=0.7

w3=0.4w4=0.0

0.25 0.75

1

Iskin(Ri)

Proportion of skin pixels in Ri

0

Page 6: Automatic detection of regions of interest in complex video … · quality analysis, image and video databases and advertising. A number of different models of visual attention have

(a) (b) (c)

(d) (e) (f)

(g) (h) (i)

Figure 5: Spatial factor IMs produced for the image rapids. (a) original image, (b) segmented image, (c) location, (d) shape,(e) size, (f) foreground/background, (g) contrast, (h) color and (i) skin. In (c)-(i), lighter shading represents highest importance

how the different factors should be weighted although it is known that each factor exerts a different influence on visualattention. In order to quantify this relationship, the individual factor maps were correlated with the eye movements of viewerscollected in the experiment described in Section 4. This provided an indication of the relative influence of the differentfactors on viewers’ eye movements. Using this information, the following weighting was used to calculate the overall spatialIM:

))(()(7

1

f

w

pow

fif

powfispatial RIwRI ∑

=

⋅= ,

where wf is the feature weight (obtained from eyetracking experiments), poww is the feature weighting exponent (to controlthe relative impact of wf) and powf is the IM weighting exponent. The values of wf are given in Section 4. The spatial IM wasthen scaled so that the region of highest importance had a value of 1.0. To expand the ROIs, block processing was performedon the resultant spatial IM. The block-processed IM is simply the maximum of the spatial IM within each local n x n block.Values of n = 16 and n = 32 have been shown to provide good results.

The resultant spatial IMs can be quite noisy from frame to frame. In order to reduce this noisiness and improve the temporalconsistency of the IMs, a temporal smoothing operation can be performed.

Page 7: Automatic detection of regions of interest in complex video … · quality analysis, image and video databases and advertising. A number of different models of visual attention have

3. 4 Temporal IMA temporal IM model that was found to work well on a subset of video material has been reported previously.5,11 Two majorproblems, however, prevented this model’s use with all video content:

• The model could not distinguish camera motion from true object motion. Hence, it failed when there was anycamera movement (e.g., pan, tilt, zoom, rotate) while the video was being shot.

• Fixed thresholds were used when assigning importance to a particular motion. Since the amount of motion(and consequently the motion’s influence on attention) varies greatly across different video scenes, thesethresholds need to adapt to the motion in the video.

A block diagram showing the improved temporal attention model is shown in Figure 6. Features have been added to solve theproblems noted above, allowing the model to work reliably over a wide range of video content.

Figure 6: Temporal attention model Figure 7: Mapping from object motion to temporal importance

As in ref. 5, the current and previous frames are used in a hierarchical block matching process to calculate the motion vectors.These motion vectors are used by a novel camera motion estimation algorithm to determine four parameters regarding thecamera’s motion: pan, tilt, zoom and rotation. These four parameters are used to compensate the motion vectors (MVs)calculated in the first part of the model (MVorig) so that the true object motion in the scene can be captured.

cameraorigcomp MVMVMV −= .

Since the motion vectors in texturally flat areas are not reliable, the compensated motion vectors in these areas are set to 0.

In the final block in Figure 6, the compensated MVs are converted into a measure of temporal importance. This involvesscene cut detection, temporal smoothing, flat area removal and adaptive thresholding. The adaptive thresholding process isshown graphically in Figure 7. The thresholds thtempx are calculated adaptively, depending on the amount of object motion inthe scene. Scenes with few moving objects and with slow moving objects should have lower thresholds than those sceneswith many fast moving objects since human motion sensitivity will not be masked by numerous fast moving objects. Anestimate of the amount of motion in the scene is obtained by taking the mth percentile of the camera motion compensated MVmap (termed motionm). The current model obtains good results using m = 98. The thresholds are then calculated as:

01 =tempth ,

)),,min(max( max2min222 ththmthtemp kkmotionkth ⋅= ,

)),,min(max( max3min333 ththmthtemp kkmotionkth ⋅= ,

344 tempthtemp thkth ⋅= .

Parameter values that provide good results are 5.02 =thk , secdeg/0.1min2 =thk , secdeg/0.10max2 =thk , 5.13 =thk ,secdeg/0.5min2 =thk , secdeg/0.20max2 =thk and 0.23 =thk . Since the motion is measured in deg/sec, it is necessary to

know the monitor’s resolution, pixel spacing and the viewing distance. The current model assumes a pixel spacing of 0.25mm and a viewing distance of five screen heights, which is typical for SDTV viewing.

CurrentFrame

PreviousFrame

CalculateMotionVectors

CameraMotion

Estimation

Compensatefor Camera

Motion

Smoothing &Adaptive

Thresholding

TemporalIM

thtemp1 thtemp2

1

Itemp

Object Motion (deg/sec)

0

thtemp3 thtemp4

Page 8: Automatic detection of regions of interest in complex video … · quality analysis, image and video databases and advertising. A number of different models of visual attention have

In scenes where a fast moving object is being tracked by a fast pan or tilt movement, the object’s motion may be greater thanthtemp3; hence, its temporal importance will be reduced to a value less than 1.0. To prevent this from occurring, objects beingtracked by fast pan or tilt are detected and their temporal importance is set to 1.0. To increase the spatial extent of the ROIs,block processing at the 16 x 16 pixel level is performed, in the same way as for the spatial IM.

3. 5 Combining spatial and temporal IMsTo provide an overall IM, a linear weighting of the spatial and temporal IMs is performed.

tempcombspatcombtotal IkIkI ⋅−+⋅= )1( .

An appropriate value of kcomb that has been determined from the eyetracker experiments is 0.6.

4. MODEL CALIBRATION AND VALIDATIONAs discussed previously, there is limited information available in the literature regarding the relative influence of the differentattentional factors and their interactions. For this reason, an eye movement study was performed using a wide range ofviewing material. The individual factors were then correlated with the viewers’ fixations in order to determine each factor’srelative influence. The correlation of the overall IMs could then be calculated to determine how accurately the model predictsviewers’ fixations. In this section, the eyetracker experiment is described and results given, along with details of how theseresults were used to calibrate the attention model.

4.1 Eyetracker experimentThe viewing room and monitor (Sony BVM-20F1U) were calibrated in accordance with ITU-R Rec. BT.500.21 Viewingdistance was five screen heights. Twenty-four viewers (12 men and 12 women) from a range of technical and non-technicalbackgrounds and with normal or corrected-to-normal vision participated in the experiment. Their ages ranged from 27-57years with a mean of 40.2 years. The viewers had their eye movements recorded non-obtrusively by an ASL model #504eyetracker during the experiment. The accuracy of the eyetracker, as reported by the manufacturer, is within ±1.0 deg.Viewers were asked to watch the material as they would if they were watching television at home (i.e., for entertainmentpurposes).

The stimulus set consisted of 130 still images (displayed for 5 seconds each) and 46 video clips derived from standard testscenes and from satellite channels. The total duration of the video clips was approximately 13 minutes. The material wasselected to cover a broad cross-section of features such as saturated/unsaturated color, high/low motion and varying spatialcomplexity. The still images and video clips were presented in separate blocks and the ordering of the images and clips waspseudorandom within each block. None of the material was processed to introduce defects. Some of the material hadpreviously undergone high bit-rate JPEG or MPEG compression but the effects were not perceptible.

To ensure the accuracy of the recorded eye movements, calibration checks were performed every 1-2 minutes during the test.Re-calibration was performed if the accuracy of the eyetracker was found to have drifted. Post-processing of the data wasalso performed, in order to correct for any calibration offsets.

4.2 Experimental resultsThe fixations of all viewers were corrected, aggregated and superimposed on the original stimuli. For the still images, thisresulted in an average of approximately 250 fixations per scene while for the video clips, there were aproximately 20fixations per frame. Some examples of the aggregated images are shown in Figure 8. Each white square represents a singleviewer fixation, with the size of the square being proportional to fixation duration. The smallest squares (1 x 1 pixel) showvery short fixations (< 200 msec) while the largest squares (9 x 9 pixels) correspond to very long fixations (> 800 msec).

4.3 Correlation of factors with eye movementsTo determine how well each factor predicts viewers’ eye movements, a hit ratio was computed for each factor. The hit ratiowas defined as the percentage of all fixations falling within the most important regions of the scene, as identified by theattention model IM. Hit ratios were calculated over those regions whose combined area represented 10%, 20%, 30% and 40%of the scene’s total area. To calibrate the spatial model and compute the values of wf, only the data from the 130 still imageswere used. These scenes were split in two equal-sized sets. One set was used for training, and the sequestered set was laterused to verify the calibration parameters and to ensure that the model was not tuned to a particular set of data.

Page 9: Automatic detection of regions of interest in complex video … · quality analysis, image and video databases and advertising. A number of different models of visual attention have

header for SPIE use

(a) (b)

Figure 8: Aggregate fixations for two test images, (a) rapids and (b) liberty.

The hit ratios for the four area levels for each of the seven spatial factors are given in Table 1. These values show that theIMs for three of the factors – location, foreground/background and skin – correlated very strongly with the viewers’ fixations.Three other factors – shape, color and contrast – had lower but still strong correlation with viewers’ fixations while the sizefactor reported the lowest hit ratios. Note that the hit ratios for all of the factors were greater than that expected if areas werechosen randomly (termed the baseline level), confirming that all factors exert some influence on viewers’ eye movements.

Weights for each of the factors wf were then calculated as follows:

∑==

+++=

40,7

10,1,

40,30,20,10,

afaf

fffff

h

hhhhw ,

where hf,a is the hit ratio for feature f at area level a. This resulted in wf = [0.193, 0.176, 0.172, 0.130, 0.121, 0.114, 0.094] for[location, foreground/background, skin, shape, contrast, color, size] respectively. Table 1 shows that when these unequalvalues of wf were used in the spatial model, the hit ratios increased considerably. The hit ratios for the full spatial model forboth the test and sequestered set were very similar (within 2%), demonstrating that the weightings are not tuned to a specificset of scenes.

Stimulus Set Factor or Model TestedHit Ratio at

10% area (%)Hit Ratio at

20% area (%)Hit Ratio at

30% area (%)Hit Ratio at

40% area (%)Still foreground / background 31.8 46.9 63.2 75.2Still center 34.8 54.7 70.1 76.8Still color 17.8 32.1 44.8 54.8Still contrast 18.4 34.1 46.6 59.3Still shape 20.8 36.4 51.4 60.8Still size 14.6 27.0 35.9 46.0Still skin 25.2 48.3 70.8 79.1Still Spatial model (equal wf) 29.3 49.2 63.6 74.8Still Spatial model (unequal wf) 32.9 54.5 68.2 78.8

Video Spatial model 46.6 67.2 75.8 81.9Video Temporal model 32.4 50.4 62.9 72.3Video Combined model 41.0 63.8 75.0 82.6

baseline 10.0 20.0 30.0 40.0Table 1: Hit ratios for different factors and models.

Page 10: Automatic detection of regions of interest in complex video … · quality analysis, image and video databases and advertising. A number of different models of visual attention have

5. RESULTSFigure 9 shows the resulting IMs for the sequence football. The original scene is of high temporal and moderate spatialcomplexity, with large camera panning, some camera zoom and a variety of different objects in the scene. The superimposedfixations show that all of the viewers fixated on or near the player with the ball. The spatial IM identified the players as theprimary ROIs and the background people as secondary ROIs. (The IMs in the figure have lighter shading in regions of highimportance.) Note that the spatial extent of the ROIs was increased by the temporal smoothing process. The temporal modelcorrectly compensated for camera pan and zoom and identified only those areas of true object motion. The combined IMidentified the player with the ball as the most important object while the other two players were also identified as being ofhigh importance. The background people were classified as secondary ROIs while the playing field was assigned the lowestimportance. The model’s predictions correspond well with viewers’ fixations.

Another example of the model’s performance, this time with a scene of high spatial and moderate temporal complexity(mobile) is shown in Figure 10. Spatially, this scene is very cluttered and there are a number of objects capable of attractingattention. Since there are numerous potential spatial attractors, the spatial IM is quite flat with a general weighting towardsthe center. The temporal model compensated for the slow camera pan and correctly identified the mobile, train and calendaras the moving objects in the scene. Although they are located near the boundary of the scene, these moving objects stillreceived a high weighting in the overall IM. This corresponds well with the experimental data, as most fixations were locatedaround the moving objects in the lower part of the scene.

The model has been tested on a wide range of video sequences and found to operate in a robust manner. In order to quantifythe model’s accuracy and robustness, hit ratios were calculated across the full set of video sequences used in the eyetrackerexperiment. The results are shown in Table (video stimulus set). The hit ratios achieved by the combined model are high –for example, 75% of all fixations occur in the 30% area classified by the model as most important. The hit ratios are abovethe baseline level for each individual clip indicating that the model does not fail catastrophically on any of the 46 sequences.Visual inspection of IMs and fixations showed that many of the fixations that were not hits had narrowly missed the model’sROIs and were often within the ±1 deg accuracy of the eyetracker. Hence, if the ROIs were expanded spatially or if theeyetracker had higher accuracy, the hit ratio may increase. The hit ratios of the spatial model were all considerably higherthan those of the temporal model and were often slightly higher than those of the combined model. This may be caused inpart by the fact that in scenes with no motion or extremely high and unpredictable motion, the correlation between motionand eye movements is low. Several of these types of sequences are contained in the test set and the resultant low hit ratio forthese scenes with the temporal model has reduced the overall hit ratio for the temporal model. Nevertheless, for most of thetest sequences, the spatial model had a higher impact on predicting viewers’ attention than the temporal model. As a result, itwas given a slightly higher weighting when combining the two models.

6. DISCUSSIONThis paper has described a computational model of visual attention which automatically predicts ROIs in complex videosequences. The model was based on a number of factors known to influence visual attention. These factors were calibratedusing eye movement data gathered from an experiment using a large database of still and video scenes. A comparison ofviewers’ fixations and model predictions showed that, across a wide range of material, 75% of viewers’ fixations occurred inthe 30% area estimated by the model as being the most important. This verifies the accuracy of the model, in light of the factsthat the eyetracking device used exhibits some inaccuracy and people’s fixations naturally exhibit some degree of drift.

There are a number of different applications where a computational model of visual attention can be readily utilized. Theseinclude areas as diverse as image and video compression, objective picture quality evaluation, image and video databases,machine vision and advertising. Any application requiring variable-resolution processing of scenes may also benefit from theuse of a visual attention model such as the one described here.

Page 11: Automatic detection of regions of interest in complex video … · quality analysis, image and video databases and advertising. A number of different models of visual attention have

header for SPIE use

(a)

(c)

(b)

(d)

Figure 9: IMs for a frame of the football sequence. (a) original frame, (b) spatial IM, (c) temporal IM and (d) combined IM.

(a)

(c)

(b)

(d)

Figure 10: IMs for a frame of the mobile sequence. (a) original frame, (b) spatial IM, (c) temporal IM and (d) combined IM.

Page 12: Automatic detection of regions of interest in complex video … · quality analysis, image and video databases and advertising. A number of different models of visual attention have

REFERENCES 1. A.L. Yarbus, Eye Movements and Vision, Plenum Press, New York, 1967.2. L. Stelmach, W.J. Tam and P.J. Hearty, “Static and dynamic spatial resolution in image coding: An investigation of eye

movements,” Proceedings SPIE, Vol. 1453, pp. 147-152, San Jose, 1992.3. N.H. Mackworth and A.J. Morandi, “The gaze selects informative details within pictures,” Perception & Psychophysics,

2(11), pp. 547-552, 1967.4. G.T. Buswell, How people look at pictures, The University of Chicago Press, Chicago, 1935.5. W. Osberger, Perceptual Vision Models for Picture Quality Assessment and Compression Applications, PhD Thesis,

Queensland University of Technology, Brisbane, Australia, 2000. http://www.scsn.bee.qut.edu.au/~wosberg/thesis.htm6. J.M. Wolfe, “Visual search: A review”, in H. Pashler (ed), Attention, University College London Press, London, 1998.7. L. Itti, C. Koch and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Trans. PAMI,

20(11), pp. 1254-1259, 1998.8. W. Osberger and A. J. Maeder, “Automatic identification of perceptually important regions in an image,” 14th ICPR, pp.

701-704, Brisbane, Australia, August 1998.9. X. Marichal, T. Delmot, C. De Vleeschouwer, V. Warscotte and B. Macq, “Automatic detection of interest areas of an

image or a sequence of images,” ICIP-96, pp. 371-374, Lausanne, 1996.10. J. Zhao, Y. Shimazu, K. Ohta, R. Hayasaka and Y. Matsushita, “An outstandingness oriented image segmentation and its

application,” ISSPA, pp. 45-48, Gold Coast, Australia, 1996.11. W. Osberger, A.J. Maeder and N. Bergmann, “A perceptually based quantisation technique for MPEG encoding,”

Proceedings SPIE, Vol 3299, pp. 148-159, San Jose, 1998.12. R.B. Ivry, “Asymmetry in visual search for targets defined by differences in movement speed,” J.of Exp Psych: Human

Perc.& Perf, 18(4), pp. 1045-1057, 1992.13. M.P. Eckert and G. Buchsbaum, “The significance of eye movements and image acceleration for coding television image

sequences,” in A.B. Watson (ed), Digital Images and Human Vision, pp. 89-98, MIT Press, Cambridge MA, 1993.14. L. Stark, I. Yamashita, G. Tharp and H.X. Ngo, “Search patterns and search paths in human visual search,” in D. Brogan,

A. Gale and K. Carr (eds), Visual Search 2, pp. 37-58, Taylor and Francis, London, 1993.15. J. M. Findlay, “The visual stimulus for saccadic eye movements in human observers,” Perception, 9, pp. 7-21, 1980.16. M. D’Zmura, “Color in visual search,” Vision Research, 31(6), pp. 951-966, 1991.17. J. Enoch, “Effect of the size of a complex display on visual search,” JOSA, 49(3), pp. 208-286, 1959.18. J. Wise, “Eye movements while viewing commercial NTSC format television,” SMPTE psychophysics committee white

paper, 1984.19. G. Walker-Smith, A. Gale and J. Findlay, “Eye movement strategies involved in face perception,” Perception, (6), pp.

313-326, 1977.20. N. Herodotou, K. Plataniotis and A. Venetsanopoulos, “Automatic location and tracking of the facial region in color

video sequences,” Signal Processing: Image Communication, 14, pp. 359-388, 1999.21. ITU-R Rec. BT.500, “Methodology for the subjective assessment of the quality of television pictures.”