video shot boundary detection and condensed...

1

Video Shot Boundary Detection and CondensedRepresentation: A Review

Costas Cotsaces,Student Member, IEEE,Nikos Nikolaidis, Member, IEEE,and Ioannis Pitas,Senior Member, IEEE

Abstract— In this paper, a review of certain basic in-formation extraction operations that can be performed onvideo is presented. These operations are those that areparticularly useful as a foundation for further semanticprocessing. Specifically, the review focuses on shot bound-ary detection and condensed video representation (alsocalled summarization and abstraction). Shot boundarydetection is the complete segmentation of a video intocontinuously imaged temporal video segments. Condensedvideo representation is the extraction of video framesor short clips that are semantically representative of thecorresponding video. Both tasks are very significant for theorganization of video data into more manageable forms.An overview of the fundamental issues in each task isprovided, and recent work on the subject is described andis critically reviewed.

Index Terms— temporal video segmentation, shot changedetection, shot cut detection, condensed representation,video summarization, highlighting

I. I NTRODUCTION

The advances in digital video technology and theever increasing availability of computing resources haveresulted in the last few years in an explosion of digitalvideo data, especially on the Internet. However, theincreasing availability of digital video has not beenaccompanied by an increase in its accessibility. This isdue to the nature of video data, which is unsuitable fortraditional forms of data access, indexing, search andretrieval, which are either text-based or based on thequery-by-example paradigm. Therefore, techniques havebeen sought that organize video data into more compactforms or extract semantically meaningful information[1]. Such operations can serve as a first step for a numberof different data access tasks, such as browsing, retrieval,genre classification and event detection. Here we focusnot on the high level video analysis tasks themselves

The authors are with the Department of Informatics, Uni-versity of Thessaloniki, Thessaloniki 54124, Greece (email:[email protected]).

This work has been supported by the FP6 European Union Net-work of Excellence MUSCLE ”Multimedia Understanding ThroughSemantic Computation and Learning” (FP6-507752)

but on the common basic techniques that have beendeveloped to facilitate them. These basic tasks areshotboundary detectionandcondensed video representation.

Shot boundary detection is the most basic temporalvideo segmentation task, as it is intrinsically and in-extricably linked to the way that video is produced. Itis a natural choice for segmenting a video into moremanageable parts, and thus it is very often the firststep in algorithms that accomplish other video analysistasks, one of them being condensed video representation,described below. In the case of video retrieval, a videoindex is much smaller and thus easier to constructand use if it references whole video shots rather thanevery video frame. Since scene changes almost alwayshappen on a shot change, shot boundary detection isindispensable as a first step for scene boundary detection.Finally, shot transitions provide convenient jump pointsfor video browsing. Condensed representation is theextraction of a characteristic set of either independentframes or short sequences from a video. This can beused as a substitute for the whole video for the pur-poses of indexing, comparison and categorization. It isalso especially useful for video browsing. The paper“Semantic Retrieval of Video” by Xiong et al in thisissue, provides a detailed account on condensed videorepresentation applications. The results of both shotboundary detection and condensed video representationdo not need to be immediately directed to the aboveapplications; they may be instead stored as metadata andused when they are needed. This can be achieved by theuse of standards such as MPEG-7 [2], which containappropriate specifications for the description of shots,scenes and various types of condensed representation.

Several reviews on shot boundary detection have beenpublished in the last decade [3], [4], [5], [6], howevernone later than 2001. There is little review work on thesubject of condensed video representation published inrefereed journals.

II. SHOT BOUNDARY DETECTION

The concept of temporal image sequence (video) seg-mentation is not a new one, as it dates back to the first

2

a) Physical Scene

b) Takes

Shot Shot Shot Shot Shot

Video Scene

c) Editing

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

1 2 3 4 5 6 7 8

3 4 5 6

Fig. 1. Schematic sketch of scenes and shots as defined by thevideo production process: a) the actual objects (or people) that areimaged, which are spatio-temporally coherent and thus comprise ascene; b) continuous segments imaged by different cameras (takes);c) the editing process that createsshotsfrom takes.

days of motion pictures, well before the introduction ofcomputers. Motion picture specialists perceptually seg-ment their works into a hierarchy of partitions. A video(or film) is completely and disjointly segmented into asequence of scenes, which are subsequently segmentedinto a sequence of shots. Scenes (also called story units)are a concept that is much older than motion pictures,ultimately originating in the theater. Traditionally, ascene is a continuous sequence that is temporally andspatially cohesivein the real world, but not necessarilycohesive in the projection of the real world on film.On the other hand, shots originate with the invention ofmotion cameras and are defined as the longest continuoussequence that originates from a single camera take,which is what the camera images in an uninterruptedrun, as shown in Figure 1.

In general, the automatic segmentation of a videointo scenes ranges from very difficult to intractable.On the other hand, video segmentation into shots isboth exactly defined and also characterized by distinctivefeatures of the video stream itself. This is because videocontent within a shot tends to be continuous, due to thecontinuity of both the physical scene and the parameters(motion, zoom, focus) of the camera that images it.

Therefore, in principle, the detection of a shot changebetween two adjacent frames simply requires to computean appropriate continuity or similarity metric. However,this simple concept has three major complications.

The first, and most obvious one, is defining a continu-ity metric for the video in such a way that it is insensitiveto gradual changes in camera parameters, lighting, andphysical scene content, easy to compute and discriminantenough to be useful. The simplest way to do that is toextract one or more scalar or vector features from eachframe and to define distance functions on the featuredomain. Alternatively the features themselves can beused either for clustering the frames into shots, or fordetecting shot transition patterns.

The second complication is deciding which valuesof the continuity metric correspond to a shot changeand which do not. This is not trivial, since the featurevariation within certain shots can exceed the respec-tive variation across shots. Decision methods for shotboundary detection include fixed thresholds, adaptivethresholds and statistical detection methods.

The third complication, and the most difficult to han-dle, is the fact that not all shot changes are abrupt. Usingmotion picture terminology, changes between shots canbelong to the following categories, some of which areillustrated in Figure 2:

1. Cut. This is the classic abrupt change case, whereone frame belongs to the disappearing shot and thenext one to the appearing shot.

2. Dissolve. In this case, the last few frames ofthe disappearing shot temporally overlap with thefirst few frames of the appearing shot. Duringthe overlap, the intensity of the disappearing shotdecreases from normal to zero (fade out), whilethat of the appearing shot increases from zero tonormal (fade in).

3. Fade. Here, first the disappearing shot fades outinto a blank frame, and then the blank frame fadesin into the appearing shot.

4. Wipe. This is actually a set of shot change tech-niques, where the appearing and disappearing shotscoexist in different spatial regions of the interme-diate video frames, and the region occupied by theformer grows until it entirely replaces the latter.

5. Other transition types.There is a multitude ofinventive special effects techniques used in mo-tion pictures. These are in general very rare anddifficult to detect.

A. Components of Shot Boundary Detection Algorithms

As was already mentioned, shot boundary detectionalgorithms work by extracting one or more features from

3

(a) Dissolve

(b) Dissolve

(c) Fade

(d) Wipe (Door)

(e) Wipe (Grill)

(f) Special Effect

Fig. 2. Examples of gradual transitions from the TRECVID 2003corpus: a) a dissolve between shots with similar color content; b) adissolve between shots with dissimilar color content; c) a fade; d)a wipe of the “door” variety; e) a wipe of the “grill” variety; f) acomputer generated special effect transition between shots.

a video frame or a subset of it, called aregion of interest(ROI). An algorithm can then use different methodsto detect shot change from these features. Since thereare many different ways the above components can becombined, we have chosen not to provide a hierarchicaldecomposition of different classes of algorithms. Insteadwe present below the different choices that can be madefor each component, along with their advantages anddisadvantages. These can then be combined more or lessfreely to design a shot detection algorithm.

1) Features Used:Almost all shot change detectionalgorithms reduce the large dimensionality of the videodomain by extracting a small number of features fromone or more regions of interest in each video frame. Suchfeatures include:

1. Luminance/color:The simplest feature that can beused to characterize a ROI is its average grayscaleluminance. This, however, is susceptible to illumi-nation changes. A more robust choice is to use oneor more statistics (e.g. averages) of the values ina suitable color space [7], [8], [9], like HSV.

2. Luminance/color histogram:A richer feature fora ROI is the grayscale or color histogram. Itsadvantage is that it is quite discriminant, easy tocompute and mostly insensitive to translational,rotational and zooming camera motion. For the

above reasons it is widely used [10], [11].3. Image edges:An obvious choice for characterizing

a ROI is its edge information [12], [9]. Theiradvantage is that they are sufficiently invariant toillumination changes and several types of motionand that they are related to the human visualperception of a scene. Their main disadvantageis their computational cost, their noise sensitivityand, when not post-processed, their high dimen-sionality.

4. Transform coefficients (Discrete Fourier Trans-form, Discrete Cosine Transform, wavelet):Theseare a classic way to describe the image informationin a ROI. The Discrete Cosine Transform (DCT)coefficients also have the advantage of being al-ready present in MPEG encoded video streams orfiles.

5. Motion: This is sometimes used as a feature fordetecting shot transitions, but it is usually coupledwith others, since by itself it can be highly dis-continuous within a shot (when motion changesabruptly) and is obviously useless when there isno motion in the video.

2) Spatial Feature Domain:The size of the regionfrom which individual features are extracted plays animportant role in the overall performance of shot changedetection. A small region tends to reduce detectioninvariance with respect to motion, while a large regionmight lead to missed transitions between similar shots. Inthe following we will describe various possible choices:

1. Single pixel: Some algorithms derive a feature foreach pixel. This feature can be luminance, edgestrength [9] or other. However, such an approachresults in a very large feature vector and is verysensitive to motion, unless motion compensationis subsequently performed.

2. Rectangular block:Another method is to segmenteach frame into equal-sized blocks, and to extracta set of features (e.g. average color or orientation,color histogram) per such block [8], [7]. Thisapproach has the advantage of being invariantto small camera and object motion as well asbeing adequately discriminant for shot boundarydetection.

3. Arbitrarily shaped region:Feature extraction canalso be applied to arbitrarily shaped and sizedframe regions, derived by spatial segmentationalgorithms. This enables the derivation of featuresbased on the most homogeneous regions, thusfacilitating a better detection of temporal disconti-nuities. The main disadvantage is the usually high

4

computational complexity and instability of regionsegmentation.

4. Whole frame:The algorithms that extract features(e.g. histograms) from the whole frame [13], [11],[14] have the advantage of being very robust withrespect to motion within a shot, but tend to havepoor performance at detecting the change betweentwo similar shots.

3) Feature Similarity Metric:In order to evaluate dis-continuity between frames based on the selected features,an appropriate similarity/dissimilarity metric needs tobe chosen. Assuming that a frame is characterized byK scalar featuresF (k), k = 1..K, the most traditionalchoice is to use aLn norm:

DLn(i, j) = n

√√√√ K∑k=1

|Fi(k) − Fj(k)|n

Examples of the above are the histogram difference,whereF (k) are the bins of the histogram andn = 1,and image difference whereF (k) are the pixels of the(usually subsampled) image andn = 2. Another exampleof a commonly used metric, especially in the case ofhistograms, is the chi-square (χ2):

Dχ2(i, j) =K∑k=1

(Fi(k) − Fj(k))2

Fi(k)

4) Temporal Domain of Continuity Metric:Anotherimportant aspect of shot boundary detection algorithmsis the temporal window which is used to perform shotchange detection. In general, the objective is to select atemporal window that contains a representative amountof video activity. Options include:

1. Two frames:The simplest way to detect disconti-nuity between frames is to look for a high valueof the discontinuity metric between two successiveframes [13], [8], [15], [16]. However, such anapproach can fail to discriminate between shottransitions and changes within the shot when thereis significant variation in activity among differentparts of the video, or when certain shots containevents that cause brief discontinuities (e.g. photo-graphic flashes). It also has difficulty in detectinggradual transitions.

2. N -frame window:The most common techniquefor alleviating the above problems is to detect thediscontinuity by using the features of all frameswithin a suitable temporal window, which is cen-tered on the location of the potential discontinuity[14], [8], [17], [9].

3. Interval since last shot change:Perhaps the mostobvious method for detecting a shot end is to com-pute one or more statistics from the last detected

shot change up to the current point, and to checkif the next frame is consistent with them, as in[7], [11]. The problem with such approaches isthat there is often great variability within shots,meaning that statistics computed for an entire shotmay not be representative of its end.

5) Shot Change Detection Method:Having defined afeature (or a set of features) computed on one or moreROIs for each frame (and optionally a similarity metric),a shot change detection algorithm needs to detect wherethese exhibit discontinuity. This can be done in thefollowing ways:

1. Static Thresholding:This is the most basic desi-cion method, which entails comparing a metricexpressing the similarity or dissimilarity of thefeatures computed on adjacent frames against afixed threshold [11]. This only performs well ifvideo content exhibits similar characteristics overtime, and only if the threshold is manually adjustedfor each video.

2. Adaptive Thresholding:The obvious solution tothe problems of the static thresholding is to varythe threshold depending on a statistic (e.g. average)of the feature difference metrics within a temporalwindow, as in [13], [17].

3. Probabilistic Detection:Perhaps the most rigor-ous way to detect shot changes is to model thepattern of specific types of shot transitions and,presupposing specific probability distributions forthe feature difference metrics in each shot, performoptimal a posteriori shot change estimation. Thisis demonstrated in [7], [8].

4. Trained Classifier:A radically different methodfor detecting shot changes is to formulate theproblem as a classification task where frames areclassified (through their corresponding features)into two classes, namely “shot change” and “noshot change”, and train a classifier (e.g. a neuralnetwork) to distinguish between the two classes[14].

B. Performance Evaluation

The basic measures in detection and retrieval prob-lems in general arerecall, precision and accuracy.Recall quantifies what proportion of the correct entities(shot changes in our case) are detected, while precisionquantifies what proportion of the detected entities arecorrect. Accuracy reflects the temporal correctness ofthe detected results, and corresponds to the distancebetween the detected location of a transition and its truelocation. Therefore if we denote byD the shot transitions

5

8 12 161 4 7

a) Video Frames

b) Frame features (histograms)

c) Discontinuity metric (histogram difference)

d) Shot change detection (static threshold)

threshold

shot change

Fig. 3. A typical shot detection algorithm. a) Samples of the videocontent. b) Histogram of the distribution of the intensity values in thedisplayed frames. c) Histogram difference between adjacent frames.d) A shot change is detected when the histogram difference exceedsa threshold.

correctly detected by the algorithm, byDM the numberof missed detections, that is the transitions that shouldhave been detected but were not, and byDF the numberof false detections, that is the transitions that should nothave been detected but were, we have that:

Recall =D

D +DM

Precision =D

D +DF

Because of the richness and variety of digital videocontent, the effective quantitative (and even qualitative)comparison of the results of different video analysisalgorithms is meaningless if they are derived fromdifferent video corpora. To alleviate this problem, theTREC Video Retrieval Evaluation [18] (TRECVID) wasorganized. The goal of TRECVID is not only to providea common corpus of video data as a testbed for differentalgorithms but also to standardize and oversee theirevaluation and to provide a forum for the comparisonof the results.

C. Specific Shot Boundary Detection Algorithms

In the previous section we have analyzed the features,metrics and the decision methods that can be used todesign a shot boundary detection algorithm. We illustratethe design of such an algorithm in Figure 3. Specificallythe intensity histogram of the whole frame has beenchosen as a feature, the histogram difference was chosenas a discontinuity metric, and a constant threshold asthe shot change detection method. In the following wewill also give some state-of-the-art examples of completealgorithms.

The shot boundary detection problem is thoroughlyanalyzed by Hanjalic [8], and a probabilistically based

algorithm is proposed. Each video frame is completelysegmented into nonoverlapping blocks, and the averageYUV color component values are extracted from eachblock. Blocks are matched between each pair of adjacentframes according to average difference between the colorcomponents between blocks, and the discontinuity valueis then defined as the the above average difference formatching blocks. Adjacent frames are compared for de-tecting abrupt transitions, while for gradual ones, framesthat are apart by the minimum shot length are compared.The a priori likelihood functions of the discontinuitymetric are obtained by experiments on manually labelleddata. Their choice is largely heuristic. Thea priori prob-ability of shot boundary detection, conditional on thetime that elapsed since the last shot boundary detection,was modelled as a Poisson function, which was observedto produce good results. The shot boundary detectionprobability is refined with a heuristic derived from thediscontinuity value pattern (for cuts and fades) and fromin-frame variances (for dissolves) in a temporal windowaround the current frame. Effectively, a different detectoris implemented for each transition type. In conclusion,although the proposed method claims to offer a rigoroussolution to the shot boundary detection problem, it issomewhat heuristic. Yet the probabilistic analysis of theproblem is valid and thorough. Additionally, experimen-tal results are very satisfactory, being perfect for cuts andhaving good recall and precision for dissolves.

The approach developed by Lienhart [14], detectsdissolves (andonly dissolves) by a trained classifier(i.e. a neural network), operating on either YUV colorhistograms, magnitude of directional gradients, or edge-based contrast. The classifier detects possible dissolvesat multiple temporal scales and merges the results usinga winner-take all strategy. The interesting part is that theclassifier is trained using adissolve synthesizerwhichcreates artificial dissolves from any available set of videosequences. Although the videos used for experimentalverification are non-standard, performance is shown to besuperior when compared to simple edge-based dissolvedetection methods.

Cernekova et al. [11] propose performing singularvalue decomposition (SVD) on the RGB color his-tograms of each frame to reduce feature vector di-mensionality, in order to make the feature space morediscriminant. Initially video segmentation is performedby comparing the angle between the feature vector ofeach frame and the average of the feature vectors of thecurrent segment. If their difference is higher than a staticthreshold, a new segment is started. Segments whosefeature vectors exhibit large dispersion are considered todepict a gradual transition between two shots, whereas

6

segments with small dispersion are characterized asshots. Experiments were run on a database of four realTV sequences. The main problem with this approach isthe static threshold which is applied on the angle betweenvectors to detect shot change, which may be problematicin the case of large intra-shot content variation and smallinter-shot content variation.

Boccignone et al. [17] approach the shot boundarydetection problem from a completely new angle, usingthe attentional paradigm for human vision. Fundamen-tally, their algorithm computes for every frame a set(called a trace) of points of focus of attention (FOAs)in decreasing order of saliency, and compares nearbyframes by evaluating the consistency of their traces. Shotboundaries are found when the above similarity is belowa dynamic threshold. Specifically, FOAs are detectedby first computing a saliency map from the normalizedsum of multiscale intensity contrast, color contrast (blue-yellow and red-green) and orientation contrast. Then themost salient points in this map are found using a winner-take-all artificial neural network, which also gives ameasure of their saliency. The FOAs are accompanied bythe color and shape information in their neighborhoods.The consistency of two traces is computed by firstfinding homologous FOAs between the two traces basedon three different types of local consistency: spatial,temporal (i.e. referring to order of saliency) and visual(i.e. referring to neighborhood information). A cut isthen detected by comparing the inter-frame consistencywith a probabilistically-derived adaptive threshold, whiledissolves are detected in a subsequent phase. The results,given for a subset of TRECVID01 and some news andmovie segments, are overall satisfactory but not abovethe state of the art.

Lelescu and Schonfeld [7] present an interesting statis-tical approach which additionally benefits from operatingon MPEG-compressed videos. They extract the averageluminance and chrominance for each block in every I andP frame, and then perform principal component analysis(PCA) on the resulting feature vectors. It should be notedthat the eigenvectors are computed based only on theMfirst frames of each shot. The resulting projected vectorsare modelled by a Gaussian distribution whose mean andcovariance matrix are estimated from theM first framesof each shot. The originality of the approach is that achange statistic is estimated for each new frame usinga maximum likelihood methodology (the GeneralizedLikelihood Ratio) and, if it exceeds an experimentallydetermined threshold, a new shot is started. The algo-rithm has been tested on a number of videos wheregradual transitions are common. A characteristic of thetest videos used is the particularly short duration of

their shots. This highlights the greatest problem of thisalgorithm, namely that the extraction of the shot charac-teristics (eigenspace, mean and covariance) happens atits very beginning (firstM frames). This means that inthe case of long and non-homogeneous shots, like theones common in motion pictures, the values calculatedat the start of the shot might not be characteristic of itsend.

The Joint Probability Image (JPI) of two frames isdefined by Li et al. [15] as a matrix whose[i, j] elementcontains the probability that a pixel with colori in thefirst image has colorj in the second image. They alsoderive the Joint Probability Projection Vector (JPPV)as a 1-dimensional projection of the JPI, and the JointProbability Projection Centroid (JPPC) as a scalar mea-sure of the JPI’s dispersion. Specific types of transition(dissolve, fade, dither) are observed to have specificJPI patterns. To detect shot changes, possible locationsare first detected by using the JPPC, and then theyare refined and validated, and classified into transitiontypes by matching the JPPV with specific transitionpatterns. Results are given for a relatively small numberof abrupt and gradual transitions. The authors admitthe obvious problems this method has with large objectand/or camera motion.

Another way to attack the most difficult problemin shot boundary detection, which is the detection ofgradual transitions, is to use different algorithms foreach type of transition. This is the path followed byNam and Tewfik [9], who designed two algorithms,one for dissolves and fades, and one for wipes. Thefade/dissolve detector starts by approximating, within atemporal window, the temporal evolution of the intensityof each pixel in the frame with a B-spline. It then com-putes a change metric as the negative of the normalizedsum of the errors of the above approximations. Thismetric is then discretized by comparison with a locallyadaptive threshold, processed by a morphological filter,and used as an indicator of the existence of dissolvesand fades. Fades are distinguished from dissolves byadditionally checking whether the inter-frame standarddeviation is close to zero. The wipe detector functionsby first calculating the difference between one frameand the next, and then multiplying this difference withparts of the frame’s 2-D wavelet transform in order toenhance directionality. Then the Radon transform of eachframe is taken for the four cardinal directions (horizontal,vertical and the two diagonals) and the same is donefor the half-frames (but only for the horizontal andvertical directions). For each frame, the location of themaximum is taken from each of the above projections,representing the location of the strongest edge in each

7

direction. Tracking these maxima over time results insix temporal curves. Wipe candidates are then detectedwhen these curves exhibit a linear increase or decrease,corresponding to the regular movement of the wipe’sedge. Finally, the candidates are considered to be verifiedif they can be well approximated with a B-spline. Theresults of these methods are close to the state of the artfor the sequences tested, but the approach has problemsin detecting other types of transitions than the ones itwas designed for.

III. C ONDENSEDV IDEO REPRESENTATION

An important functionality when retrieving informa-tion in general is the availability of a condensed repre-sentation of a larger information unit. This can be thesummary of a book, the theme or ground of a musicalpiece, the thumbnail of an image, or the abstract of apaper. Condensed representation allows us to assess therelevance or value of the information before commit-ting time, effort or computational and communicationresources to process the entire information unit. It canalso allow us to extract high level information when weare not interested in the whole unit, especially duringmanual organization, classification and annotation tasks.While for most types of information there is one or morestandard forms of condensed representation, this is notthe case for video.

In the literature, the termhighlighting or storyboard-ing is used for video representation resulting in distinctimages, whileskimming is sometimes used for a rep-resentation resulting in a shorter video. These classesof methods are illustrated schematically in Figure 4. Inaddition, we will use the termcondensed representationto describe the process of summarization, highlightingand skimming in general.

A. Highlighting

Highlighting is the field that has attracted by far themost attention, since it is the simplest and most intuitive.Its purpose is the selection of the video highlights, thatis those frames that contain the most, or most relevant,information for a given video. Highlighting is sometimesalso referred to askeyframe extraction.

An advanced, object-based method, has been devel-oped by Kim and Hwang [19]. They use the methoddescribed in [20] to extract objects from each frame,then use these objects to extract video keyframes. Objectdetection and tracking is performed by Canny edgedetection on image differences, and then finding areasenclosed by edges and tracking them. The objects are la-belled either with the magnitude of the first eight Fourier

descriptors (i.e. Fourier transform coefficients of theobject contour) or with Hu’s seven region moments. Thedistance between objects is defined as theL1 distancebetween their descriptors, weighted by experimentallyderived constants. Objects in the two frames are pairedby spatial proximity of their centers, and the distancebetween frames is the maximum of the distance betweenobjects in such pairs. When the number of objects in aframe is different from that of the last, then the frame isautomatically labelled as a keyframe. Otherwise, a frameis labelled as keyframe if its distance from the previouskeyframe is greater than a predefined threshold. Thisobject-based method has the advantage of getting closeto the physical semantics of the video. However, likeother object-based approaches, it has the disadvantagethat it is unable to perform well in complex scenes orin scenes with significant background motion. It is notaccidental that all the experimental videos used in thismethod have a static background and a small number ofwell-defined objects.

Li et al. [21] approach the highlighting problem fromthe viewpoint of video compression, and attempt to min-imize the loss of visual information that occurs duringhighlighting. First they definesummarization rateas theratio of the size of the condensed video to the size ofthe original one, anddistortionas the maximum distancebetween any frame in the highlighting and any framein the original video. Then they formulate highlightingas one of two problems: either the minimization ofdistortion with the summarization rate as a constraint(minimum distortion optimal summarization - MDOS)or the minimization of the summarization rate withdistortion as a constraint (minimum rate optimal summa-rization - MROS). They then present a solution for theMROS problem by means of dynamic programming, anda solution for the MDOS problem by efficiently search-ing through relevant solutions of the MROS problem.The only drawback of their approach is that it is tooquantitative, and does not address the difference betweenvisual and semantic information.

B. Hierarchical Highlighting

A somewhat different task from highlighting ishi-erarchical highlighting, where the goal is to extract ahierarchy or a tree of highlight frames, so that the usercan then interactively request more detail on the part ofthe video that interests him.

A semi-automatic method for the extraction of ahierarchy of highlights is proposed by Zhu et al. [22].First, shots are found and a keyframe is arbitrarily chosenfor each of them. Similarity between shots is defined as

8

9

10

10

1111

11 12

55

115

5

6

6 7 8

22

2

2

1

1 3 4

a) Original Video

b) Highlighting c) Skimming

21

5 6 10 11

d) Hierarchical highlighting e) Hierarchical Skimming

Fig. 4. Illustration of the different types of condensed representation. The “film” segments denote continuous video, while images denotevideo frames.

the global color histogram and texture similarity betweentheir keyframes. Then shot groups are detected by heuris-tics based on similarity with previous and followingframes at their borders. Shot groups are described aseither spatially coherent(i.e. composed of consecutivesimilar shots) ortemporally coherent(e.g shots forminga dialog), through clustering using a similarity threshold.Videos, groups, shots and frames then have to be anno-tated manually with labels selected from a set of ontolo-gies. The similarity between two groups is defined as theaverage similarity between shots in one group and thecorrespondingly most similar shots in the other group,using both the above visual features and the semanticlabelling. Scenes are detected heuristically by merginggroups according to their similarity. The summary iscreated hierarchically at four levels (video, scene, group,shot) by semantically clustering the immediately lowerlevel (e.g. clustering shots in order to create groups,etc). Finally the highlighting is constructed by selectingthose keyframes that correspond to those shots and/orgroups that contain the most annotation information. Thesystem described is an advanced and complex one butit suffers from the need of manual annotation and fromthe arbitrary selection of the keyframe of each shot.

The extraction of the hierarchical highlights is per-formed in two stages by Ferman and Tekalp [23]. Theiralgorithm receives as input a video that has alreadybeen segmented into shots. Each shot is characterizedby itsα-trimmed average luminance histogram (althoughtheoretically other features could be used as well), andeach frame is characterized by the normalized distancebetween its histogram and that of the shot to which itbelongs. Then fuzzyc-means clustering is performed onthe frames in each shot, based on the above feature. Theclustering is refined by two stages of cluster hardeningcoupled with the pruning of uncertain data, with a re-

computation of the fuzzyc-means performed betweenthem. The clustering is repeated with increasingc, un-til decreasing clustering validity is detected. For eachcluster the most representative frame is selected as akeyframe, based either on maximal proximity to the clus-ter’s centroid or on maximal fuzzy cluster membership.In a second stage temporally adjacent clusters with highsimilarity are merged in order to reduce the number ofkeyframes generated according to user preference. Thesame merging approach can also be used to extract ahierarchy of highlights. Clustering is a typical methodfor highlight extraction, but the technical interest of thework lies in the refinement procedure of the clusteringand the automatic determination of the optimal numberof clusters.

C. Skimming

Video skimming is more difficult to perform effec-tively than video highlighting, because it is more difficultto extract appropriate features from its atomic elements(video segments instead of images), since they contain amuch greater amount of information.

Gong and Liu [24] partly solve this problem byconstraining the skimming to consist of whole shots ofthe video. First they down-sample the video temporallyas well as spatially. Each resulting frame is characterizedby its color histogram and SVD is performed on theresulting feature matrixA to reduce its dimensionality.For SVD-reduced frame featuresψi = [υi1 υi2 . . . υin]T

the authors define a new metric:

‖ψi‖ =

√√√√√rank(A)∑j=1

υ2ij

which is inversely proportional to the cardinality of thefeature within the matrix, and thus to how typical the

9

frame is within the video. Similarly, for a video segmentSi the metric CON(Si) =

∑ψi∈Si

‖ψi‖ is defined as ameasure of visual content contained in this segment. Thenext step is clustering the reduced feature vectors. Themost common frame is used to seed the first cluster,which is grown until a heuristic limit is reached. Theother clusters are grown in such a way that they haveCON(Si) similar to that of the first cluster. Then thelongest shot is extracted from each cluster, and anappropriate number of these shots is selected to formthe summary, according to user preferences.

D. Hierarchical Skimming

The creation of a hierarchical video skimming isapproached by Shipman et al. [25] as the creation ofa specific type of video on demand, and as part oftheir Hyper-Hitchcockhypervideo authoring tool. Theirwork, which they calldetail-on-demand video, is basedon video segmentation into“takes” , which are not realtakes but are scenes in the case of commercial videosand shots in the case of home videos, and“clips” whichare shots in the case of commercial videos and cohe-sive subsegments of shots in the case of home videos.This way, the authors attempt to sidestep the semanticambiguity of the scene concept. It should be notedthat the segmentation is a prerequisite for their work.The hierarchical skimming is constructed by selecting,for each level, a number of “clips” and connecting allor some of them to “clips” in the lower levels. Thenumber of levels in the hierarchy is dependant solelyon the duration of the video. The “clips” that makeup each level are chosen by three methods: either bytemporal homogeneity, or by a set of heuristics thatensures as much as possible that the selected clips belongto different takes, or by an importance metric which iseither pre-computed or manually assigned.

A different approach is presented by Ngo et al. [26]who create their hierarchical skimming based mostly ontraditional concepts such as scenes and shots. Their firststep is to segment the video into shots and sub-shots.Then the shots are organized as a graph with the vertexlength equal to similarity of shot keyframes, and groupedinto clusters using the normalized cut algorithm. Thetemporal ordering of shots is used to make another direc-tional graph with the clusters as nodes. This is then usedto group clusters into continuous scenes. The attentionvalue of a frame is computed from the motion of theMPEG macroblocks in it, with compensation for cameramotion. Then the attention value for shots, clusters andscenes is calculated by averaging the attention valuein the immediately lower level of the hierarchy. The

hierarchical skimming is produced by first discardingscenes and clusters that have a low number of shotsand/or low attention value, and then for each clusterdiscarding subshots from those shots that have a lowattention value. This procedure is continued until therequired skimming length is reached.

E. Performance Evaluation

Because the objective of the above methods is toextract a semantically informative video representation,their objective performance evaluation is a very difficultaffair. In practice, performance evaluation is only feasibleby visual inspection of the results. In fact when mostauthors need to do experimental evaluation, they usepanels of experts that quantify whether a specific videorepresentation contains all salient information of theoriginal video. Of course there are other measures ofthe characteristics of each algorithm, such as the size ofthe resulting representation but provide no informationon the quality and completeness of the results.

IV. CONCLUSIONS

It is clear that the field of video shot detection hasmatured, and is capable of detecting not only simplecuts but also gradual transitions with considerable recalland precision. This maturity is also evidenced by theestablishment of a standard performance evaluation andcomparison benchmark in the form of the TREC Videoshot detection task. The field has also advanced fromsimple and heuristic feature comparison methods torigorous probabilistic methods and the use of complexmodels of the shot transition formation process. How-ever, there is still room for algorithm improvement. Itis clear that improved solutions for this problem willinvolve firstly an increasingly rigorous inclusion of thea priori information about the actual physical processof shot formation, and secondly a thorough theoreticalanalysis of the editing effects, or a simulation of theirgeneration as in [14].

On the other hand, condensed video representationremains a largely open issue. In part, this is firstly dueto its significantly higher complexity, secondly due tothe greater variety of tasks that fall under its head-ing, and thirdly because the evaluation of its resultsis almost entirely subjective. Furthermore the desiredcondensed video representation may vary with the taskat hand: a different representation might be needed forarchival purposes than for promotional or for educationalpurposes. However, the main reason for the relativefailure of most automatic methods for the extraction ofrepresentative or significant frames or video segments, is

10

that the concept of being “representative” or “significant”is heavily dependent on the semantics of the video, andspecifically on the semantic content of each video frame.So it is reasonable to state that the full solution ofthe automatic highlighting and skimming tasks dependson the extraction of semantics at the basic (frame andobject) level. Still, a significant amount of success hasbeen shown in applications dealing with specific typesof video content, and when tasks of limited complexityare attempted. It should also be noted that the lack ofa commonly accepted experimental corpus hinders theevaluation of success in the general case.

ACKNOWLEDGMENT

The C-SPAN video used in this work is pro-vided for research purposes by C-SPAN through theTREC Information-Retrieval Research Collection. C-SPAN video is copyrighted.

Costas Cotsacesreceived the Diploma degree in Electrical and Com-puter Engineering from the Aristotle University of Thessaloniki. Heis currently a researcher in the Artificial Intelligence and InformationAnalysis Laboratory in the Department of Informatics at the sameuniversity, where he is also working towards a PhD degree. Hisresearch interests lie in the areas of computer vision, image andvideo processing, pattern recognition and artificial intelligence. Heis a member of the IEEE and the Technical Chamber of Greece.

Nikos Nikolaidis received the Diploma of Electrical Engineering in1991 and the PhD degree in Electrical Engineering in 1997, bothfrom the Aristotle University of Thessaloniki, Greece. From 1992 to1996 he served as teaching assistant in the Departments of ElectricalEngineering and Informatics at the same University. From 1998 to2002 he was postdoctoral researcher and teaching assistant at theDepartment of Informatics, Aristotle University of Thessaloniki. Heis currently a Lecturer in the same Department. Dr. Nikolaidis is theco-author of the book ”3-D Image Processing Algorithms” (J. Wiley,2000). He has co-authored 5 book chapters, 19 journal papers and 56conference papers. His research interests include computer graphics,image and video processing and analysis, copyright protection ofmultimedia and 3-D image processing. Dr. Nikolaidis has been ascholar of the State Scholarship Foundation of Greece.

Ioannis Pitas received the Diploma of Electrical Engineering in1980 and the PhD degree in Electrical Engineering in 1985 bothfrom the Aristotle University of Thessaloniki, Greece. Since 1994,he has been a Professor at the Department of Informatics, AristotleUniversity of Thessaloniki. From 1980 to 1993 he served as ScientificAssistant, Lecturer, Assistant Professor, and Associate Professor inthe Department of Electrical and Computer Engineering at the sameUniversity. He served as a Visiting Research Associate or VisitingAssistant Professor at several Universities. He has published 140journal papers, 350 conference papers and contributed in 18 books inhis areas of interest and edited or co-authored another 5. He has alsobeen an invited speaker and/or member of the program committee ofseveral scientific conferences and workshops. In the past he servedas Associate Editor or co-Editor of four international journals andGeneral or Technical Chair of three international conferences. Hiscurrent interests are in the areas of digital image and video processingand analysis, multidimensional signal processing, watermarking andcomputer vision.

REFERENCES

[1] N. Dimitrova, H.-J. Zhang, B. Shahraray, I. Sezan, T. Huang,and A. Zakhor, “Applications of video-content analysis andretrieval,” IEEE Multimedia Magazine, vol. 9, no. 3, pp. 42– 55, July 2002.

[2] P. Salembier and J. Smith, “Mpeg-7 multimedia descriptionschemes,”IEEE Trans. Circuits Syst. Video Technol., vol. 11,no. 6, pp. 748–759, June 2001.

[3] R. Lienhart, “Reliable transition detection in videos: A surveyand practitioner’s guide,”International Journal of Image andGraphics, vol. 1, no. 3, pp. 469 – 486, Sept. 2001.

[4] I. Koprinska and S. Carrato, “Temporal video segmentation: Asurvey,”Signal Processing: Image Communication, vol. 16, pp.477 – 500, Jan. 2001.

[5] R. M. Ford, C. Robson, D. Temple, and M. Gerlach, “Metricsfor shot boundary detection in digital video sequences,”ACMMultimedia Systems, vol. 8, no. 1, pp. 37 – 46, Jan. 2000.

[6] U. Gargi, R. Kasturi, and S. H. Strayer, “Performance character-ization of video-shot-change detection methods,”IEEE Trans.Circuits Syst. Video Technol., vol. 10, no. 1, pp. 1 – 13, Feb.2000.

[7] D. Lelescu and D. Schonfeld, “Statistical sequential analysisfor real-time video scene change detection on compressedmultimedia bitstream,”IEEE Trans. Multimedia, vol. 5, no. 1,pp. 106 – 117, Mar. 2003.

[8] A. Hanjalic, “Shot-boundary detection: Unraveled and re-solved?” IEEE Trans. Circuits Syst. Video Technol., vol. 12,no. 2, pp. 90 – 105, Feb. 2002.

[9] J. Nam and A. Tewfik, “Detection of gradual transitions invideo sequences using b-spline interpolation,”IEEE Trans.Multimedia, vol. 7, no. 4, pp. 667–679, Aug. 2005.

[10] H. Zhang and S. S. A. Kankanhalli, “Automatic partitioning offull-motion video,” ACM Multimedia Systems, vol. 1, no. 1, pp.10 – 28, Jan. 1993.

[11] Z. Cernekova, C. Kotropoulos, and I. Pitas, “Video shot seg-mentation using singular value decomposition,” inProc. 2003IEEE International Conference on Multimedia and Expo, vol. II,Baltimore, Maryland, USA, July 2003, pp. 301 – 302.

[12] R. Zabih, J. Miller, and K. Mai, “A feature-based algorithm fordetecting and classification production effects,”ACM Multime-dia Systems, vol. 7, no. 1, pp. 119 – 128, Jan. 1999.

[13] J. Yu and M. D. Srinath, “An efficient method for scene cutdetection,”Pattern Recognition Letters, vol. 22, no. 13, pp. 1379– 1391, Nov. 2001.

11

[14] R. Lienhart, “Reliable dissolve detection,” inStorage andRetrieval for Media Databases 2001, ser. Proceedings of SPIE,vol. 4315, Jan. 2001, pp. 219–230.

[15] Z.-N. Li, X. Zhong, and M. S. Drew, “Spatialtemporal jointprobability images for video segmentation,”Pattern Recogni-tion, vol. 35, no. 9, pp. 1847 – 1867, Sept. 2002.

[16] W. K. Li and S. H. Lai, “Integrated video shot segmentationalgorithm,” in Storage and Retrieval for Media Databases, ser.Proceedings of SPIE, vol. 5021, Jan. 2003, pp. 264 – 271.

[17] G. Boccignone, A. Chianese, V. Moscato, and A. Picariello,“Foveated shot detection for video segmentation,”IEEE Trans.Circuits Syst. Video Technol., vol. 15, no. 3, pp. 365–377, Mar.2005.

[18] “TREC video retrieval evaluation.” [Online]. Available:http://www-nlpir.nist.gov/projects/trecvid/

[19] C. Kim and J.-N. Hwang, “Object-based video abstraction forvideo surveillance systems,”IEEE Trans. Circuits Syst. VideoTechnol., vol. 12, no. 12, pp. 1128 – 1138, Dec. 2002.

[20] ——, “Fast and automatic video object segmentation and track-ing for content-based applications,”IEEE Trans. Circuits Syst.Video Technol., vol. 12, no. 2, pp. 122 – 129, Feb. 2002.

[21] Z. Li, G. M. Schuster, and A. K. Katsaggelos, “Minmaxoptimal video summarization,”IEEE Trans. Circuits Syst. VideoTechnol., vol. 15, no. 10, pp. 1245–1256, Oct. 2005.

[22] X. Zhu, J. Fan, A. K. Elmagarmid, and X. Wu, “Hierarchicalvideo summarization and content description joint semantic andvisual similarity.” ACM Multimedia Systems, vol. 9, no. 1, July2003.

[23] A. Ferman and A. Tekalp, “Two-stage hierarchical video sum-mary extraction to match low-level user browsing preferences,”IEEE Trans. Multimedia, vol. 5, no. 3, pp. 244 – 256, June2003.

[24] Y. Gong, “Summarizing audio-visual contents of a video pro-gram,” EURASIP Journal on Applied Signal Processing, vol.2003, pp. 160 – 169, Feb. 2003.

[25] F. Shipman, A. Girgensohn, and L. Wilcox, “Creating navigablemulti-level video summaries,” inProc. 2003 IEEE InternationalConference on Multimedia and Expo, vol. II, Baltimore, Mary-land, USA, July 2003, pp. 753 – 756.

[26] C.-W. Ngo, Y.-F. Ma, and H.-J. Zhang, “Video summarizationand scene detection by graph modeling,”IEEE Trans. CircuitsSyst. Video Technol., vol. 15, no. 2, pp. 296–305, Feb. 2005.

video shot boundary detection and condensed...

Documents