ibm research trec-2002 video retrieval system · analysis, indexing, and retrieval of video, which...

IBM Research TREC-2002 Video Retrieval System

Bill Adams∗, Arnon Amir†, Chitra Dorai‡, Sugata Ghosal§, Giridharan Iyengar∗,Alejandro Jaimes‡, Christian Lang‡, Ching-yung Lin‡, Apostol Natsev‡, Milind Naphade‡,

Chalapathy Neti∗, Harriet J. Nock∗, Haim H. Permuter†, Raghavendra Singh§, John R. Smith‡,Savitha Srinivasan†, Belle L. Tseng‡, Ashwin T. V.§, Dongqing Zhang¶

Abstract

In this paper, we describe the IBM Research system foranalysis, indexing, and retrieval of video, which was ap-plied to the TREC-2002 video retrieval benchmark. Thesystem explores methods for fully-automatic content anal-ysis, shot boundary detection, multi-modal feature extrac-tion, statistical modeling for semantic concept detection,and speech recognition and indexing. The system supportsquerying based on automatically extracted features, mod-els, and speech information. Additional interactive methodsfor querying include multiple-example and relevance feed-back searching, cluster, concept, and storyboard browsing,and iterative fusion based on user-selected aggregation andcombination functions. The system was applied to all fourof the tasks of the video retrieval benchmark including shotboundary detection, concept detection, concept exchange,and search. In this paper, we describe the approaches foreach of the tasks and discuss some of the results.

1 Introduction

The growing amount of digital video is driving the need formore effective methods for indexing, searching, and retriev-ing video based on its content. Recent advances in contentanalysis, feature extraction, and classification are improv-ing capabilities for effectively searching and filtering digitalvideo content. Furthermore, the recent MPEG-7 standardpromises to enable interoperable content-based retrieval byproviding a rich set of standardized tools for describing fea-tures of multimedia content [SS01]. However, the extrac-tion and use of MPEG-7 descriptions and the creation ofusable fully-automatic video indexing and retrieval systemsremains a significant research challenge.

∗IBM T. J. Watson Research Center, 1101 Kitchawan Rd., YorktownHeights, NY 10598†IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120‡IBM T. J. Watson Research Center, 19 Skyline Drive, Hawthorne,

NY 10532§IBM India Research Lab¶Dept. of E.E., Columbia University, New York, NY 10027

The TREC video retrieval benchmark is allowing ad-vances in the area of content-based retrieval by fixinga video corpus and different query and detection tasks.This allows experimentation with new video indexing tech-niques, with consistent measures for assessing progress.This year, we participated in the TREC video retrievalbenchmark and submitted results for four tasks: (1) shotboundary detection, (2) concept detection, (3) concept ex-change, (4) search. We explored several diverse methodsfor video analysis, indexing, and retrieval, which includedautomatic descriptor extraction, statistical modeling, andmulti-modal fusion. We conducted experiments that in-dividually explored audio-visual and speech modalities aswell as their combination in manual and interactive query-ing. In the paper, we describe the video indexing and re-trieval system and discuss the results on the video retrievalbenchmark.

1.1 Outline

In this paper, we describe the video indexing and retrievalsystem and examine the approaches and results on theTREC-2002 video retrieval benchmark. The outline is asfollows: in Section 2, we describe the process for videoindexing including video content indexing and speech in-dexing. In Section 3, we describe the video retrieval systemincluding methods for content-based search, model-basedsearch, speech-based search, and other methods for interac-tive searching and browsing. In Section 4, we discuss theapproaches for each of the benchmark tasks and examinesome of the results.

2 Video indexing system

The video indexing system analyzes the video in an off-lineprocess that involves video content indexing and speech in-dexing. The video content indexing process consists of shotboundary detection, key-frame extraction, feature extrac-tion, region extraction, concept detection, and clustering, asshown in Figure 1. The basic unit of indexing and retrievalis a video shot.

1

� � � ��

� � ��

� � � � � � ��

��

��

� � � � � �� !�"�#

$�% & ' ( ) * + , ( & - , . % /0 1 2 % / &

34/ 5 & + % (0 1 2 % / &

Figure 1: Summary of video content indexing process.

2.1 Shot boundary detection (SBD)

The shot boundary detection (SBD) is performed using thereal-timeIBM CueVideosystem [Cue] which automaticallydetects shots and extracts key-frames. This year, we ex-plored several methods for making SBD more robust topoor video quality. Some of the methods include usinglocalized edge gradient histograms and comparing pairs offrames at greater temporal distances. Overall, the 2002 sys-tem showed reduction in SBD errors by more than 30%compared to the 2001 SBD system [SSA+02].

The baseline CueVideo SBD system uses sampled, three-dimensional color histograms in RGB color space to com-pare pairs of frames. Histograms of recent frames arestored in a buffer to allow a comparison between multi-ple image pairs up to seven frames apart. Statistics offrame differences are computed in a moving window aroundthe processed frame and are used to compute the adap-tive thresholds, shown in Figure 2 as a line above the dif-ference measures (Diff1, Diff3 and Edge1). A state ma-chine is used to detect the different events (states). TheSBD system does not require any sensitivity-tuning param-eters. More details about the baseline system can be foundin [SPA+99, SSA+02].

Several changes were incorporated to the baseline SBDalgorithm to accommodate lower video quality, such as forthe dated videos in TREC-02 data set. Localized edge-gradient histograms were added to overcome color errors.The 512-bin edge-gradient histogram counts the numberof pixels in each of eight image regions, having similarIx, Iy derivatives (each derivative is quantized into threebits). Thus it is invariant to any global light and colorchanges across the entire image. Rank filtering was addedin time/space/histogram at various different points alongthe processing to handle the new types and higher levelsof noise. More comparisons of pairs of frames at distancesup to thirteen frames apart were added to overcome the highMPEG-1 compression noise. Several new states were addedto the state machine to detect certain types of video errorsand to detect very short dissolves, 2-3 frames long. Thesechanges were evaluated using systematic precision-recall

measurements of many different variations of the system.Evaluation was carried using two data sets, one consistingof about half of the TREC01 test set with NIST ground truthdata, and the other with twelve video segments, each oneto two minutes long, taken from TREC02 training set forwhich ground truth was generated manually.

6200 6400 6600 6800 7000 7200 7400

GT Sys

State

Image diff

Intensity

Diff1

Frame #

Diff3

Edges1

Figure 2: Plot of frame-to-frame processing of the SBD al-gorithm. Notice the ground truth (GT) and system output(Sys) plots for this segment of video which has six dissolves(one missed) and twelve cuts.

2.2 Feature extraction

The system extracts a number of descriptors for each videoshot. Some of the descriptors as indicated below are ex-tracted in multiple ways from each key-frame image usingdifferent normalization strategies (see [SN02]) as follows:(1) globally, (2) based on 4x4 grid, (3) based on 5-regionlayout, and (4) based on automatically extracted regions.The following descriptors were extracted:

• Color histogram (global per key-frame, 4x4 grid, 5-region layout, segmentation regions): one based on a166-bin HSV color space [Smi01] and another basedon 512-bin RGB color space,

• Edge orientation histogram (global per key-frame, 4x4grid, 5-region layout): based on Sobel filtered image

2

and quantization to 8 angles and 8 magnitudes [SN02],

• Wavelet texture (global per key-frame, 4x4 grid, 5-region layout): based on wavelet spatial-frequencyenergy of 12 bands using quadrature mirror fil-ters [Smi01],

• Color correlogram (global per key-frame, 4x4 grid,5-region layout): based on a single-banded auto-correlogram coefficients extracted for 8 radii depths in166-color HSV color space [HKM+99],

• Co-occurrence texture (global per key-frame, 4x4 grid,5-region layout): based on entropy, energy, contrast,and homogeneity features extracted from gray-levelco-occurrence matrices at 24 orientations [JKS95],

• Motion vector histogram (global per shot, segmenta-tion regions): based on8×8 motion estimation blocksin the MPEG-1 decoded I and P frames. A six-bin his-togram is generated based on the motion vector mag-nitudes,

• Visual perception texture (global per key-frame,segmentation regions): Three values representingthe coarseness, contrast, and directionality, respec-tively [TMY78],

• Mel-Frequency Cepstral Coefficients (MFCC): trans-formation of uncompressed PCM signal to 24 MFCCfeatures including the energy coefficient [RJ93].

2.3 Region extraction

In order to better extract local features and detect concepts,we developed a video region segmentation system that au-tomatically extracts foregreound and background regionsfrom video, as shown in Figure 3. The system runs in real-time with extraction of foreground and background regionsfrom I-frames and P-frames in MPEG-1 video.

(a) (b)

Figure 3: Example region segmentation results (boundingboxes of extracted regions): (a) background segmentation,(b) foreground object segmentation.

The segmentation of the background scene regions usesa block-based region growing method based on color his-

tograms, edge histograms, and directionality. The segmen-tation of the foreground regions uses a spiral searching tech-nique to calculate the motion vectors of I- and P- frames.The motion features are used in region growing in the spa-tial domain with additional tracking constraints in the timedomain. Although we tested MPEG-1 compressed-domainmotion vectors, we found them to be too noisy. We alsofound that combining motion vectors, color, edge, and tex-ture information for extraction of foreground objects did notgive significantly better results than using only motion in-formation.

2.4 Clustering

We used the extracted visual descriptors (see Section 2.2) tocluster the video shots into perceptually similar groups. Weused ak-means clustering algorithm to generate20 clus-ters. We found color correlograms to achieve an excellentbalance between color and texture features. The clusterswere later used to facilitate browsing and navigation for in-teractive retrieval (as described in Section 3.5).

2.5 Concept detection

The concept detection system uses the labeled trainingvideo content to classify unknown video content (in ourcase, the feature test and search test data). We have investi-gated several different types of static models including Sup-port Vector Machines (SVM) and Gaussian Mixture Models(GMM).

2.5.1 Lexicon design

The first step in designing a semantic concept detection sys-tem is the construction of a concept lexicon [NBS+02].We viewed the training set video and identified the mostsalient frequently occurring concepts and fixed a lexicon of106 concepts, which included the 10 concepts belonging tothe TREC concept detection task (denoted as primary con-cepts). Overall, we generated training and validation dataand modeled the10 primary concepts as follows: Outdoors,Indoors, Cityscape, Landscape, Face, People, Text Overlay,Music, Speech and Monologue and39 secondary genericconcepts as follows:

• Objects: Person, Road, Building, Bridge, Car, Train,Transportation, Cow, Pig, Dog, Penguin, Fish, Horse,Animal, Tree, Flower, Flag, Cloud,

• Scenes: Man Made Scenes, Beach, Mountain, Green-ery, Sky, Water, Household Setting, Factory Setting,Office Setting, Land, Farm, Farm House, Farm Field,Snow, Desert, Forest, Canyon,

• Events: Parade, Explosion, Picnic, Wedding.

3

Figure 4: VideoAnnEx MPEG-7 video annotation tool. Thesystem enables semi-automatic annotation of video shotsand editing of the lexicon.

2.5.2 Annotation

In order to generate training and validation data, we manu-ally annotated the video content using two annotation tools1

– one produced the visual annotations and the other pro-duced audio annotations. The IBM MPEG-7 Video Anno-tation Tool (a.k.a. VideoAnnEx), shown in Figure 4, allowsthe shots in the video to be annotated using terms from animported lexicon. The tool is compatible with MPEG-7 inthat the lexicons can be imported as MPEG-7 classificationschemes and generates MPEG-7 descriptions of the videobased on the detected shots and annotations. The tool alsoallows the users to directly edit the lexicons.

IBM Multimodal Annotation Tool provides three modesof annotation: video, audio with video, or audio withoutvideo. The audio annotation is based upon audio segmentsin which the user manually delimits each segment withinthe audio upon listening and selects from the lexicon thoseterms that describe the audio content. Multimodal concepts(e.g. Monologues) are annotated using audio with videomode of annotation.

2.5.3 Training and validation

Training and validation of models was done using the NISTfeature training data set. We randomly partitioned the NISTfeature training data set into a 19 hour Feature Training(FTR) collection and a 5 hour Feature Validation (FV) col-lection. We used the FTR collection to construct the modelsand the FV collection to select parameters and evaluate theconcept detection performance. The validation process wasbeneficial in helping to avoid overfitting to the FTR collec-tion.

1Available at http://alphaworks.ibm.com

2.5.4 Concept modeling

We followed an approach of acheiving semantic concept de-tection using a classification methodology (as investigatedin [NKFH98, NKH02, NBS+02]). The system learns theparameters of the classifiers using training data for eachconcept using statisical methods. We considered two ap-proaches: one based on a decision theoretic approach andthe other based on a risk minimization approach.

Decision theoretic approach In this approach, the de-scriptors are assumed to be independent identically dis-tributed random variables drawn from known probabilitydistributions with unknown deterministic parameters. Forthe purpose of classification, we assume that the unknownparameters are distinct under different hypotheses and canbe estimated. In particular, each semantic concept is rep-resented by a binary random variable. The two hypothe-ses associated with each such variable are denoted byHi,i ∈ {0, 1}, where0 denotes absence and1 denotes presenceof the concept. Under each hypothesis, we assume that thedescriptor values are generated by the conditional probabil-ity density functionPi(X), i ∈ {0, 1}. The class condi-tional densities conditional distributions over the descrip-tors under the two hypotheses – concept present (H1) andconcept absent (H0) – are assumed to be generated by dis-tinct mixtures of diagonal Gaussians in case of static con-cepts. For dynamic concepts or events we assume that theclass conditional densities are generated by Hidden MarkovModels (HMM) [Rab89] with GMMs for the observationdensities [NKFH98]. The parameters of the GMMs (mean,covariance, and mixture weights) are estimated by usingthe Expectation Maximization (EM) [DLR77] algorithm. Aranked list of confidence measures can then be generatedusing the likelihood ratio test [Poo99].

Structural risk minimization Unlike the decision the-oretic approach, the discriminant approach focusses onlyon those characteristics of the feature set that discrimi-nate between the two hypotheses of interest. The ideaof constructing learning algorithms based on the struc-tural risk minimization inductive principle was proposedin [Vap95]. Support vector machines (SVM) achieve dis-crimination by mapping the feature vectors into a higher di-mensional space through nonlinear function and construct-ing the optimal separating hyperplane. Considering a setof patterns{ ~x1, . . . , ~xn} with a corresponding set of labels{y1, . . . , yn} wherey ∈ {−1, 1}. The idea is to use a non-linear transformationΦ(x) and a kernelK(~xi, ~xj), suchthat this kernelK can be used in place of an inner prod-uct defined on the transformed non-linear feature vectors< Φ(~xi),Φ(~xj) >. The optimal hyperplane for classifi-cation in the nonlinear transformed space is then computedby converting this constrained optimization problem into its

4

Classifier

1

Classifier

2

Classifier

N

Score

Normalization

Combiner

Function

Score

Normalization

Score

Normalization

INPUT Optimal

Selection

Figure 5: Three-stage normalized classifier fusion, which includes normalization of confidence socres, combination, andselection.

dual problem, using Lagrange multipliers and then solvingthe dual problem. If the non-linear transformationΦ()̇ ischosen carefully, such that the kernel can be used to replaceinner product in the transformed space, then the operationscan be performed using the kernel in the original featurespace.

2.5.5 Fusion

Since no single descriptor is powerful enough to encom-pass all aspects of video content and separate the concepthypotheses, fusion is needed at several levels in the conceptmodeling and detection processes. We experimented withtwo distinct approaches involving early fusion and late fu-sion. For early fusion we experimented with fusing descrip-tors prior to classification. For the late fusion we experi-mented with retaining soft decisions and fusing classifers.In addition, we explored various combining methods andaggregation functions for late fusion of search results as de-scribed in Section 3.6.

Feature fusion The objective of feature fusion is to com-bine multiple features at an early stage to construct a sin-gle model. This approach is most suitable for conceptsthat have sufficiently large number of training set examplesthat are believed to be correlated and dependent. We ex-perimented with feature fusion by simply concatenating de-scriptors. Different combinations of descriptors were usedto construct models. We used the validation set to choosethe best combination.

Classifier fusion In an ideal situation, early fusion shouldwork for all concepts, since there is no loss of information.However, practical considerations, such as limited numberof training examples, limited computational resources, andthe risk of overfitting the data, necessitate an alternate strat-egy. If the features are fairly independent, then the loss ofinformation is not a concern. In such situations, we use

late fusion based on soft decisions that emanate from mod-els constructed independently for individual feature typesor modalities. We used a separate model (SVM or GMM)for each descriptor, which results in multiple classificationsand associated confidences for each shot depending on thedescriptor. While the classifiers can be combined in manyways, we explored normalized ensemble fusion to improveoverall classification performance as shown in Figure 5.

The normalized ensemble fusion process consists of threemajor steps, as shown in FIgure 5. The first step is scorenormalization of resulting confidences from each classi-fier. The second step is the aggregation of the normal-ized confidence scores. The third step is the optimizationover multiple score normalization and fusion functions toattain optimal performance against the validation set. Inthe weighted average case we considered several generalweighting mechanisms, such as (1) inverse entropy, (2) in-verse variance, and (3) model selection. For model selec-tion we restricted the weights to be0 or 1 only, essentiallyidentifying high-performing and complementary subsets ofclassifiers. Subsequently, an optimal selection of the bestperforming normalized ensemble fusion is obtained by eval-uating the performance measure against the ground truth ofthe validation set. During the optimization stage we favorfusion methods that generalize well in order to avoid over-fitting the validation set.

2.5.6 Specialized detectors

Although we chose the generic above approaches for detec-tion of most of the concepts, for two concepts (monologuesand text overlay) we explored specialized approaches.

Monologue detection For monologue detection, we firstperformed speech and face detection on each shot. Then,for shots containing speech and face, we further evaluatedthe synchrony between the face and speech using mutualinformation and used the combined score thus generated torank all shots in the corpus. Based on experimental results

5

of a variety of synchrony detection techniques [NIN02], wechose a scheme that models audio and video features as lo-cally Gaussian distributions. Given a video shot, we ex-tracted all the video frames corresponding to it and the as-sociated audio. The audio features were segmented into lo-cally Gaussian segments using a model selection based seg-mentation scheme [CG98b]. For each such locally Gaus-sian audio segment, we evaluated the mutual informationbetween every pixel in the video frames and the audio fea-tures. To get a synchrony score from such a mutual infor-mation image, we computed the ratio between the mutualinformation of the face region and the average mutual infor-mation across the entire image. Intuitively, the higher thisscore the greater the synchrony between audio and video.

Text overlay detection We explored two algorithms forextracted overlay text in video and fused the results of theclassifiers to produce the final concept labeling. The firstmethod (see [SDB98]) works by extracting and analyzingregions in a video frame. The processing stages in this sys-tem are: (1) isolating regions that may contain text charac-ters, (2) separating each character region from its surround-ings and (3) verifying the presence of text by consistencyanalysis across multiple text blocks. A confidence measureis computed as a function of the number of characters in textobjects in the frame. The second method uses macroblock-based texture and motion energy. These features are used togenerate candidate text overlay regions. For each candidateregion, color layering in HSV color space is used to gener-ate several hypothetical binary images; afterwards groupingis performed on each binary image to connect and extractcharacter blocks using connected component analysis. Fi-nally layout analysis is used to verify the layout of thesecharacter blocks. A text region is identified if the characterblocks can be aligned to form sentences or words.

2.6 Speech recognition and indexing

As in TREC-2001, we constructed a speech-based systemfor video retrieval. Significant improvements were made toboth the automatic speech recognition (ASR) performanceand the search engine performance relative to our TREC-2001 submission.

2.6.1 Automatic speech recognition (ASR)

A series of increasingly accurate speech transcriptions forthe entire corpus were produced in the period leading upto the evaluation. The first set of transcriptions were pro-duced using an IBM real-time transcription system tunedfor Broadcast News; this is the same transcription systemas was used in TREC-2001 [SSA+02]. Later transcriptionswere produced using an off-line, multiple pass transcriptionsystem comprising the following stages:

• Remove silent videos using the output of GMM-basedspeech, music and silence detectors;

• Run per-video BIC-based segmentation to divideeach video into shorter segments suitable for decod-ing [CG98a];

• Run music detection and silence detection over allBIC segments and also automatically transcribe allBIC segments using an IBM 10×Real-Time BroadcastNews transcription System; remove pure music andsilence segments, and adjust boundaries of retainedspeech segments based on silence locations in the tran-script;

• Supervised MLLR adaptation of speaker-independentHUB4 models to TREC-2002 data using a set ofeight (word-level transcribed) videos from Feature-Train [Leg95];

• Decode BIC “speech-only” segments using interpo-lated trigram LM;

• Unsupervised clustering of BIC “speech-only” seg-ments into “speaker- and environment- similar” clus-ters [CG98a];

• Unsupervised MLLR adaptation of TREC-2002-adapted HUB4 models to each cluster using singleglobal MLLR mean and precision transforms [Leg95].

The word error rate (WER) of the final transcripts is es-timated at38.7% on a held out set of eight videos fromSearch Test and Feature Test which were manually tran-scribed2. This compares favorably to42.7% for the bestof the publically-released transcriptions on the same set andrepresents a41% improvement over the transcriptions usedas the basis for IBM’s TREC-2001 SDR system.

2.6.2 Speech indexing

Indexes were constructed for SDR from the final most ac-curate speech transcriptions. Three types of indexes weregenerated: document-based indexes, an inverse word index,and a phonetic index. No attempt was made to index the setof silent videos.

Document-level indexes: the document-level indexessupport retrieval at the document level, where a documentis defined to span a temporal segment containing at most100 words3. Consecutive documents overlap by 50 words

2Note this set does not overlap with the set used in supervised acousticmodel adaptation.

3Minor differences in document definition were used in constructingthe different indexes, such as whether or not documents boundaries are de-fined at long stretches of silence or music; experiments suggest these dif-ferences do not make a significant contribution to the differences in MAPacross systems.

6

in order to address boundary truncation effects. Once doc-uments are defined and their associated time boundaries arerecorded, the documents are preprocessed using (1) tok-enization to detect sentence/phrase boundaries; (2) (noisy)part-of-speech tagging such as noun phrase, plural nounetc; (2) morphological analysis, which uses the part-of-speech tag and a morph dictionary to reduce each word to itsmorph eg. verbs[lands] , [landing] and[land] re-duce to/land/ ; (4) “stop” words are removed using stan-dard stop-word lists. After pre-processing, indexes are con-structed and statistics (such as word and word pair term- andinverse-document frequencies) are recorded for use duringretrieval.

Inverse word index: the inverse word index supportsBoolean search by providing the(videoi, timei) of all theoccurrences of a query term in the videos. Preprocessing oftranscripts is similar to that above.

Phonetic index: the phonetic index supports search ofout-of-vocabulary words. The (imperfect) speech transcriptis converted to a string of phones [AES01]. The phoneticindex can be searched for sound-like phone sequences, cor-responding to out-of-vocabulary query terms such as someacronyms, names of people, places, and so forth4

3 Video retrieval system

The video retrieval system provides a number of facilitiesfor searching, which include content-based retrieval (CBR),model-based retrieval (MBR), speech-based search or spo-ken document retrieval (SDR) and other interactive meth-ods.

3.1 Content-based retrieval (CBR)

Content-based retrieval (CBR) is an important technique forindexing video content. While CBR is not a robust sur-rogate for indexing based on semantics of image content(scenes, objects, events, and so forth), CBR has an impor-tant role in searching. For one, CBR compliments tradi-tional querying by allowing “looks like” searches, whichcan be useful for pruning or re-ordering result sets basedon visual appearance. Furthermore, in many cases, multi-example search and relevance feedback techniques are ableto learn semantics from features through training that resultsfrom interactions with the user.

The objective of CBR is to match example query contentto target video content using the extracted descriptors (seeSection 2.2), as shown in Figure 6. The degree of matchis determined on basis of feature similarity, which we have

4For this year’s queries we found the phonetic index was of limited use:only two queries involved out-of-vocabulary words, which were names.

��

��

��

�� !�� "�� #

$ �� % � � � � ��

& � �� "

'�(�� #*) � � + � ��

� � � � � � �

��

�,� � ��- � � �

Figure 6: CBR matches example query content to target videoshots and returns scored results in rank-order.

measured using Minkowski-form metrics considering val-ues ofr = 1 (Manhattan distance) andr = 2 (Euclideandistance) as follows: given descriptors represented as multi-dimensional feature vectors,vq andvt be the query and tar-get vectors, respectively, then

drq,t = (

M−1∑m=0

|vq[m]− vt[m]|r). (1)

Furthermore, in the case of histogram descriptors (eg., colorhistograms, edge histograms – see Section 2.2), we use thefollowing class of normalizations, wherer = 1, 2, andhcorresponds to anM -dimensional histogram:

vr =h

(∑M−1

m=0 |h[m]|r)1/r, (2)

which allows the comparison of images of different sizes.

3.2 Model-based retrieval (MBR)

Model-based search allows the user to retrieve video shotsbased on the concept labels produced by the models (seeSection 2.5). In MBR, the user enters the query by typinglabel text, or the user selects from the label lexicon. Sincea confidence score is associated with each automatically as-signed label, MBR ranks the shots using a distanceD de-rived from confidenceC usingD = 1− C.

3.3 Speech-based search (SDR)

Speech-based search allows the user to retrieve video shotsbased on the speech transcript associated with the shots.We used multiple SDR systems independently and com-bined the results to produce the final SDR results for TREC-2002; we refer to the three systems as OKAPI-SYSTEM-1,OKAPI-SYSTEM-2, BOOLEAN-SYSTEM-1. To evaluatedifferent design decisions, a limited ground truth was cre-ated for the combined FTR and FV collections by poolingthe results and performing relevance assessment.

7

Query development and preprocessing: All SDR sys-tems operate using a textual statement of information need.Query strings are preprocessed in a similar manner to thedocuments: tokenization, tagging and morphing gives thefinal query term sequence for use in retrieval.

Video segment retrieval: Given a query, the three SDRsystems rank documents or video segments as follows:

• OKAPI-SYSTEM-1, OKAPI-SYSTEM-2: a singlepass approach is used to compute a relevancy score foreach document. Each document is ranked against aquery, where the relevancy score is given by the Okapiformula [RWSJ+95]. The total relevancy score forthe query string is the combined score of each of thequery terms. The scoring function takes into accountthe number of times each query term occurs in the doc-ument and how rare that query term is across the entirecorpus, with normalization based upon the length ofthe document to remove the bias towards longer docu-ments since longer documents are more likely to havemore instances of any given word.

• BOOLEAN-SYSTEM-1: a Boolean search was ap-plied to Boolean queries. This search also supportedphonetic search of out-of-vocabulary words using thephonetic index, in conjunction with in-vocabularywords which can be located in the inverse word index.

Many SDR systems use the results of first pass retrievalas the basis for automatic query expansion scheme prior torunning a second pass of retrieval. Experiments showedlittle gain from using an LCA-based scheme [XC00] onFTR+FV, since the number of relevant documents retrievedper query in the first pass is quite low, so the approach wasnot investigated further.

Video segment-to-shot mapping: NIST evaluate videoretrieval performance at the level of shots, rather than atthe level of documents or video sgments which span oneor more shots. Thus we must somehow use the scores as-signed to documents or video segments by SDR to assignscores at the level of shots5. The mappings used in the threecomponent systems are:

• OKAPI-SYSTEM-1: the score assigned to a documentis assigned to the longest shot overlapping that docu-ment;

• OKAPI-SYSTEM-2: the score assigned to a documentis assigned to all the overlapping shots. A slightlyhigher score given to the later shots than to the firstones;

5Whilst this procedure might be simplified by defining documents ina fashion more closely related to shot boundaries, our results to date havefound this to be less successful than the approaches discussed above.

• BOOLEAN-SYSTEM-1: First, the boundaries of thevideo segment are determined by the coverage of therelevant words. Then the overlapping shots are scoredthe same way as with OKAPI-SYSTEM-2.

The video segment-to-shot mapping is critical to over-all SDR performance. Post-evaluation experiments showthe schemes above were not optimal choices; for example,since multiple relevant shots often overlap a single docu-ment, OKAPI-SYSTEM-1 performance can be improvedsimply by assigning a document score to all overlappingshots. Our current research is investigating more sophisti-cated schemes.

Fusion of multiple SDR systems: Analysis of the resultsfrom the different systems shows that they are often comple-mentary on FTR+FV: no system consistently outperformsthe others. Thus we hypothesized fusion of scores mightlead to improved overall performance. Whilst various fu-sion schemes are possible, for TREC-2002 we use a simpleadditive weighted scheme to combine shot-level, zero-to-one range normalised scores from each of our basic SDRsystems. Weights can be optimised on FTR+FV prior tothe final run on (held-out) search test data. This combinedsystem is termed “SDR-FUSION-SYSTEM”.

3.4 Term vector search

We used term vectors constructed from the ASR text forallowing similarity search based on textual content. Giventhe entire collection of shots, we obtained a list of all ofthe distinct terms that appear in the ASR for the collection.The order of this list was fixed to give a one-to-one map-ping of distinct terms and dimensions of the vector space.Each shot was then represented by ann-dimensional vec-tor, where the value at each dimension represented the fre-quency of the corresponding term in each shot. This allowsthe comparison of two shots based on frequency of terms.We constructed several term vector representations basedon ASR-text.

3.5 Browsing and navigation

The system provides several methods for browsing and nav-igation. For each video a story-board overview image wasgenerated that allowed its content to be viewed at a glance.The system also generated these overview images for eachcluster (see Section 2.4) and each model (see Section 2.5).

3.6 Iterative fusion

The interative fusion methods provide a way for combiningand rescoring results lists through successive search opera-tions using different combination methods and aggregationfunctions defined as follows:

8

Combination methods Consider results listRk for queryk and results listQr for current user-issued search, then thecombination functionRi = Fr(Ri−1, Qr) combines the re-sults lists by performing set operations on list membership.We explored the following combination methods:

• Intersection: retains only those shots present in bothresults lists.

Ri = Ri−1 ∩Qr (3)

• Union: retains shots present in either results list.

Ri = Ri−1 ∪Qr (4)

Aggregation functions Consider scored results listRk

for queryk, whereDk(n) gives the score of shot with id =n andQd(n) the scored result for each shotn in the currentuser-issued search, then the aggregation function rescoresthe shots using the functionDi(n) = Fd(Di−1(n), Qd(n)).We explored the following aggregation functions:

• Average: takes the average of scores of prior resultslist and current user-search. Provides “and” semantics.This can be useful for searches such as “retrieve shotsthat are indoorsandcontain faces.”

Di(n) =12(Di−1(n) + Qd(n)) (5)

• Minimum: retains lowest score from prior results listand current user-issued search. Provides “or” seman-tics. This can be useful in searches such as “retrieveshots that are outdoorsor have music.”

Di(n) = min(Di−1(n), Qd(n)) (6)

• Maximim: retains highest score from prior results listand current user-issued search.

Di(n) = max(Di−1(n), Qd(n)) (7)

• Sum: takes the sum of scores of prior results list andcurrent user-search. Provides “and” semantics.

Di(n) = Di−1(n) + Qd(n) (8)

• Product: takes the product of scores of prior results listand current user-search. Provides “and” semantics andbetter favors those shots that have low scores comparedto “average”.

Di(n) = Di−1(n)×Qd(n) (9)

• A: retains scores from prior results list. This can beuseful in conjunction with “intersection” to prune a re-sults list, as in searches such as “retrieve shots of beachscenes but retain only those showing faces.”

Di(n) = Di−1(n) (10)

• B: retains scores from current user-issued search. Thiscan be useful in searches similar to those above butexchanges the arguments.

Di(n) = Qd(n) (11)

3.7 Normalization

The normalization methods provide a user with controls tomanipulate the scores of a results list. Given a scoreDk(n)for each shot with id =n in results setk, the normalizationmethods produce the scoreDi(n) = Fl(Di−1(n)) for eachshotn as follows:

• Invert: Re-ranks the results list from bottom totop. Provides “not” semantics. This can be use-ful for searches such as “retrieve shots that arenotcityscapes.”

Di+1(n) = 1−Di(n) (12)

• Studentize: Normalizes the scores around the meanand standard deviation. This can be useful before com-bining results lists.

Di+1(n) =Di(n)− µi

σi, (13)

whereµi gives the mean andσi the standard deviation,respectively, over the scoresDi(n) for results listi.

• Range normalize: Normalizes the scores within therange0 . . . 1.

Di+1(n) =Di(n)−min(Di(n))

max(Di(n))−min(Di(n))(14)

3.8 Shot expansion

The shot expansion methods allow the user to expand a re-sults list to include for each shot its temporally adjacentneighbors. This can be useful in growing the matchedshots to include a larger context surrounding the shots, asin searches such as “retrieve shots that surround those spe-cific shots that depict beach scenes.”

3.9 Multi-example search

The multi-example search method allows the user to selectmultiple positive example shots from a results list and is-sue a query that is executed as a sequence of independentsearches using each of the selected shots. The user canalso select a descriptor for matching (such as those in Sec-tion 2.2) and an aggregation function for combining andrescoring the results from the multiple searches. Considerfor each searchk of K independent searches the scored

9

result Sk(n) for each shotn, then the final scored resultQd(n) for each shot with id =n is obtained using a choiceof the following fusion functions:

• Average: Provides “and” semantics. This can be usefulin searches such as “retrieve shots that are similar toshot “A” andshot “B”.

Qd(n) =1K

∑

k

(Sk(n)) (15)

• Minimum: Provides “or” semantics. This can be use-ful in searches such as “retrieve shots that are similarto shot “A” or shot “B”.

Qd(n) = mink

(Sk(n)) (16)

• Maximum:

Qd(n) = maxk

(Sk(n)) (17)

• Sum: Provides “and” semantics.

Qd(n) =∑

k

(Sk(n)) (18)

• Product: Provides “and” semantics and better favorsthose shots that have low scoring matches compared to“average”.

Qd(n) =∏

k

(Sk(n)) (19)

3.10 Relevance feedback search

Relevance feedback based search techniques enhance inter-active search and browsing, where user feedback on a set ofshots is used to refine the search and retrieve in minimumnumber of iterations the desired matches. The user implic-itly provides information about the matches being soughtor query conceptby marking whether shots are relevantor non-relevant in relation to his/her desired search output.The system utilizes this feedback to learn and refine an ap-proximation to the user’squery conceptand retrieve morerelevant video-clips in the next iteration.

We use a robust relevance feedback algorithm [AGG02]that utilizes non-relevant video-clips to optimally delineatethe relevant region from the non-relevant one, thereby en-suring that the relevant region does not contain any non-relevant video-clips. A similarity metric estimated usingthe relevant video-clips is then used to rank and retrievedatabase video-clips in the relevant region. The partitioningof the feature space is achieved by using a piecewise lineardecision surface that separates the relevant and non-relevantvideo-clips. Each of the hyperplanes constituting the deci-sion surface is normal to the minimum distance vector from

a non-relevant point to the convex hull of the relevant points.With query concepts that can reasonably be captured us-ing an ellipsoid in the feature space, the proposed algorithmgives a significant improvement in precision as comparedto MARS ([RHM97]) and SVM-based relevance feedbackalgorithms. The relevance feedback technique is computa-tionally robust w.r.t. the size of feedback and distributionof relevant and non-relevant video-clips. The relevant re-gion is obtained as the result of intersection of half spacesand hence forms a convex subset of the feature space. Thisensures that we can employ a quadratic distance metric torank and retrieve video-clips inside the relevant region. Rel-evance feedback based search was performed using low -level color histogram and edge orientation histogram fea-tures extracted from keyframe images of video-clips.

3.11 Interface operations

The graphical user interface is shot/key-frame oriented inthat the unit of query, retrieval, and display is a video shot.Figure 7 shows the key-frame display area (right-side) andthe system function buttons (left-side). Some of the opera-tions that can be conducted on each shot, such as CBR, areinitiated using small buttons depicted below each key-frameimage.

Figure 7: Screen image of the video retrieval system.

3.12 Example searches

The video retrieval system allows the CBR, MBR, SDR,and other methods of searching and browsing. The follow-

10

ing pedagogical queries show how the facilities of the videoretrieval system can be used for querying for “beach” scenesin the case of manual, interactive, and browsing-orientedsearch.

Manual search Consider user looking for video shotsshowing a beach scene. For manual search operationsthe user can issue a query with the following sequence ofsearches, which corresponds to the following query state-ment (((beach color AND sky) AND water) OR beachspeech):

1. Search for images with color similar to example queryimages of beach scenes,

2. Combine results with model = “sky” using “average”aggregation function (Eq 6),

3. Combine with model = “water” using “product” aggre-gation function (Eq 9),

4. Combine with ASR test = “beach” using “minimum”aggregation function (Eq 6).

Interactive search On the other hand, for interativesearch operations the user can issue the following sequenceof searches in which the user views the results at eachstage, which corresponds to the following query statement:RFSEARCH(((EXPAND(beach speech) AND beach color)AND sky) AND water):

1. Combine with ASR text = “beach”,

2. Expand results list to include adjacent shots using “ex-pand” operation of Section 3.8,

3. Select those shots that best depict beach scenes andsearch for target shots with similar color and com-bine with previous results using “product” aggregationfunction (Eq 9),

4. Combine with model = “sky” using “average” aggre-gation function (Eq 6),

5. Combine with model = “water” using “product” aggre-gation function (Eq 9),

6. Select positive and negative examples of beach scenesand search for target shots using relevance feed-back based on color and edges using method of Sec-tion 3.10.

For interactive searching, a more browsing oriented ap-proach can be taken as follows:

1. Browse the video overview story boards or clusters andselect the ones that best depict beach scenes,

2. Select positive and negative examples of beach scenesand search for target shots using relevance feedback.

3. Repeat.

4 Tasks and results

We participated four tasks: shot boundary detection (SBD),concept detection, concept exchange, and search.

4.1 Shot boundary detection (SBD) results

For the shot boundary detection task, the results of five sys-tems were submitted, one of which was last year’s SBD sys-tem as a baseline. A large difference in performance rela-tive to last year was anticipated due to the degraded videoquality. The other four were different versions of the im-proved system, mainly applying different logic to the fusionof color histogram and the localized edges histogram infor-mation. Three of them performed well and yielded verysimilar results, while the forth one did not perform as well.Table 1 summarizes the evaluation of the baseline system,alm1, and the best new system,sys47, on last year’s andthis year’s TREC video data test sets. The results on TREC-01 data set were computed by us, while the results for theTREC-02 data set are taken from the official NIST TREC2002 evaluation of those systems. Two additional rows areprovided on TREC-02 benchmark that compare our resultsto the best and average systems, respectively, among the54SBD runs submitted by TREC participants.

Video All Cuts Gradual FrameSys. Data Rc Pr Rc Pr Rc Pr Rc Pr

alm1 TR-01 .95 .88 .98 .97 .87 .68 .59 .93sys47 TR-01 .96 .92 .99 .98 .89 .79 .66 .90alm1 TR-02 .86 .77 .93 .80 .69 .71 .48 .94sys47 TR-02 .88 .83 .93 .87 .76 .72 .57 .89S-5 TR-02 .84 .89 .91 .94 .76 .78 .62 .90

mean TR-02 .76 .79 .86 .84 .53 .60 .55 .71

Table 1: Shot boundary detection results, comparing thenew system with last year system on both TREC-01 andTREC-02 video data test sets. If all participating systemsare to be ranked byPrAll + RcAll then systemS-5wouldbe found the best one, provided here for comparison. Sys-temmeanreflects the average of all 54 submitted systems.

As anticipated, the SBD performance on TREC-02 datawas lower than on TREC-01 data set. This was very no-ticeable in other participating systems as well. Never-the-less, the error rates of the new systemsys47were20− 36%lower than of the baseline systemalm1 in almost all mea-sures on both data sets. In order to find the main causes oferrors, a manual verification of all insertion errors reportedby NIST evaluation was done for two of the nineteen videos

11

(36553.mpg and 08024.mpg). The breakdown of the sixtyinsertion errors is summarized in Table 2.

Cause of insertion error freq. %Fast motion 12 20%Fast zoom 5 8%Illumination changes 4 7%Long grad reported as two 4 7%MPEG video error 1 2%others 7 12%Short/long grad missmatch 12 20%FOI reported as two (FO+FI) 7 12%Missing cuts in ground truth 5 8%Fade in at video start 2 3%Fade out at video end 1 2%Total 60 100%

Table 2: Shot boundary detection (SBD) insertion errors -breakdown by cause.

Those errors can be divided into three group. The firstgroup is of objective errors (56%) which are due to systemfault. The second group is of subjective errors (32%) wherethere was a (correct, in a sense) detection of an existingboundary but it was not reported in the way defined by theevaluation criteria. For example, an FOI was reported asone FO and one FI, thus one was considered as an insertionerror. Or, a long dissolve was reported as a short one andtherefore was evaluated as one insertion error and one dele-tion error (the five frames long threshold). The last group isof ground truth errors (13%), mainly the five unlisted cuts,and some fades at the beginning or the end of videos.

4.2 Concept detection results

Overall, concept detection results were submitted for tenconcept classes. The evaluation results are plotted in Fig-ure 8, which shows shows Average Precision measured at afixed number of documents (1000 for the feature test set).The “Average” bars correspond to the performance aver-aged across all participants. The “Best” bars correspondto the system returning the highest Average Precision. The“IBM” bars correspond to IBM’s submitted concept detec-tion run (priority=1). The IBM system performed relativelywell on the concept detection task giving highest AveragePrecision on 6 of the 10 concepts6.

Detection experiments We experimented with differentkernels for non-linear projection in higher dimensional fea-ture spaces. We used radial basis kernels as well as poly-nomial kernels. For each feature type or combination of

6Top score is indicated only on five concepts. In our original submis-sion to NIST, we mistakenly submitted the speech detection twice over-writing our instrument detection result. However, the actual Average Pre-cision of our instrument sound detector was 0.686, which was reportedthrough later communication with NIST.

NIST TREC 2002 Concept Detection Task

00.10.20.30.40.50.60.70.8

Outdoo

rs

Indoo

rsFa

ce

People

Citysca

pe

Land

scape

Text_

Overla

ySpe

ech

Instru

ment

Monolo

gue

Ave

rage

Pre

cisi

on

Avg.BestIBM

Figure 8: Comparison of concept detection performance us-ing average precision.

feature types, we constructed independent models for eachkernel (radial basis, polynomial). For each kernel we builtmodels for different values of the parameters constrainedwithin a reasonable range. For example, we varied theσvalue of the radial basis kernel from0.01 to1 in different in-crements, evaluated the model atc values of1, 10 and100.We also varied the penalty for missing positive examplesas against negative examples. For concepts with a smallnumber of positive examples, we also experimented withdownsizing the negative examples set for computational ef-ficiency and to balance the ratio of positive to negative ex-amples.

Each model was run on the validation set. The resultinglist was then sorted based on the distance of each validationset example from the separating hyperplane. We therebyused the SVM classifier to rank the validation set examplessuch that if the example was far from the decision bound-ary it was assumed to have been detected with greater confi-dence. For each concept and each feature type/combination,we then used the validation set to choose that parametriccombination which resulted in the highest non-interpolatedAverage Precision. LetR be the number of true relevantshots in a set of sizeS. Let L be the ranked list of returnedshots. At any given indexi let Ri be the number of relevantdocuments in the topi documents. LetIj be an indicatorfunction withIj = 1 if the jth document is relevant and0otherwise. AssumingR < S, the non-interpolated AveragePrecision is then defined as

1R

S∑

j=1

Rj

j∗ Ij (20)

We also analyzed the precision recall curve before selectingthe final model.

Having selected a set of parameters for a feature type orcombination, for each concept, we then performed late de-

12

cision fusion using normalized confidence scores. Again,the final model was selected on the basis of its validationset performance. For the auditory concepts (speech andmusicHMMs [INN02] were used to prototype the featuredistributions under both hypotheses. An HMM was used tomodel each audio concept; each state in a given HMM hasthe same observation distribution, namely the GMM trainedin the previous scheme. This can be viewed as impositionof a minimum duration constraint on the temporal extent ofthe atomic labels. Given a set of HMMs one for each au-dio concept we used them to generate anN -best list at eachaudio frame and then average these scores over the dura-tion of the shot. We normalized these scores by dividingeach concept score with the sum of all the concept scoresin a particular shot. The scores are indicative of the relativestrengths of the different hypotheses in a given shot ratherthan their absolute values.

4.3 Concept exchange results

Apart from running the primary and secondary detectors onthe search test set to assist the search task, we participatedin the concept exchange task by submitting results of eightprimary detectors on the search test set. We generated shotbased MPEG-7 descriptions for this exercise thus permit-ting easy exchange of the detection results between partici-pants.

4.4 Search results

The search task required retrieving video shots from thesearch test collection for a given set of query topics. Weinvestigated both manual and interactive methods of search-ing. We submitted four runs of all 25 query topics using thecontent-based, model-based, speech-based, and interactivesearch methods described above. Table 3 summarizes theresults for the four search runs.

System Type Code MAP

CBR Manual M B M 1 0.006SDR Manual M B M-2 2 0.136CBR+SDR Manual M B M-3 3 0.093CBR+SDR Interactive I B M-4 4 0.244

Table 3: Summary of search results for four submitted runs.

4.5 Manual CBR

The manual CBR run consisted of mapping the query topicsinto one or more content-based or model-based queries andfusing the results in a predetermined fashion. As describedin Section 2.2, CBR was based on a variety of descriptors.The manual CBR run was generated by allowing the fol-lowing operations to answer each query topic:

1. Issue a content-based search by selecting one or morequery examples, a feature type, and a fusion method,as necessary;

2. Issue a model-based search by selecting one or moreconcept models, a fusion method, and model weights,as necessary;

3. Fuse results lists from one or more content-based ormodel-based search by selecting a fusion method.

For example, the following sequence of operations wasexecuted for Query 79:People spending leisurely time atthe beach:

1. Pick examples 0, 1, 4, 8, 12, 23, 29 from query contentset

2. Perform CBR search with edge histogram layout using“minimum” fusion (Eq 16)

3. Combine with “Landscape” model using “intersec-tion” combining method (Eq 3) and “product” aggre-gation function (Eq 9).

The exact mapping of query topics into a fixed sequenceof the above operations was performed manually by visu-ally optimizing performance over the FTR and/or FV col-lections without knowledge of the search test collection.Once a query topic was mapped to system operations, theoperations were applied to the search collection by a desig-nated person who did not participate in the mapping processor have prior knowledge of the search test collection. Fig-ure 9 and Figure 10 show results of manual CBR searches.Figure 9 shows the results for topic76, which is looking forshots depicting “James Chandler.” As shown, some matchesare found in the results list, however, many shots of “JamesChandler” are not retrieved using CBR. Figure 10 shows theresults for topic80, which is looking for shots of musicians.As shown, many of the matches correspond to musicians.This query used instrument concept detector in addition toother models.

With respect to performance, it was our experience thatthe TREC 2002 query topics were at a higher semantic levelthan what CBR can handle. While CBR and semantic mod-eling are generally able to capture low- to mid-level seman-tics, they are fairly limited in the case of only a few queryexamples or mid- to high-level semantics. We found thatpurely CBR worked best for refining candidate lists gen-erated from semantically rich sources, such as speech, orexplicit semantic models that closely match the query need.For example, refining the face model by cross-comparisonwith examples images of “James Chandler” did produce afew relevant hits near the top (see Figure 9). Model-basedretrieval on the other hand worked well when the querytopic was a close match to an existing model and was builtwith sufficient training data, such as the “musician” topic

13

Figure 9: Results for topic 76: James Chandler.

Figure 10: Results for topic 80: Musicians.

(see Figure 10). However, in the case of limited examplecontent, such as of query topic looking for “butterflies”,or given a lack of closely related explicit semantic mod-els, CBR and MBR techniques alone are not sufficient. Inaddition, some of the query topics were so general (e.g.,beach query) or specifiec (e.g., Price Tower query) that itis doubtful whether any reasonable discrimination can bedone using low-level features alone.

4.6 Manual SDR

Manual searching using spoken document retrieval (SDR)was based on the indexed speech information. We exploredmultiple methods of SDR and their fusion, where the SDRqueries were developed through interaction with the FeatureTraining collection.

Query strings were created manually for each query.Queries derived from the audio and textual statement of in-formation need supplied by NIST were expanded by handin ad-hoc fashion based on retrieval on the FTR+FV sets7.More complicated query strings are used in the Booleansystem, since it was hypothesized that the Boolean retrievalwould be less susceptible to the effects of query overtuningon FTR+FV.

The query terms used in the submitted multiple-SDR fu-sion system for topic 90 (“Find shots with one or moresnow-covered mountain peaks or ridges. Some sky mustbe visible behind them”) were “ice snow covered mountainpeaks valley vista”. Twenty relevant items were retrievedin the top 100, with Average Precision0.12. For topic84 (“Find shots of Price Tower, designed by Frank LloydWright and built in Bartlesville, Oklahoma”) the queryterms are “Price Tower Frank Lloyd Wright Bartlesville Ok-lahoma”, the top three items recalled are relevant and Aver-age Precision is0.75.

Weights for the SDR-FUSION-SYSTEM were optimisedusing the limited ground truth that was compiled forFTR+FV. As expected, this scheme led to Mean Aver-age Precision (MAP) improvements FTR+FV; more impor-tantly, fusion gave performance improvements (35%) overour best single SDR system on the unseen search test data(as shown in Table 4). Note that simple post-evaluationchanges in the video segment-to-shot mapping scheme im-proved the performance of the individual OKAPI systems(eg. OKAPI-SYSTEM-1 increased to MAP0.114) and thefusion system performance might be expected to improvefurther as the component systems improve. The resultsoverall are a significant improvement over those for IBM’sspeech-only retrieval submission to TREC-2001.

System MAP

OKAPI-SYSTEM-1 0.073OKAPI-SYSTEM-2 0.093BOOLEAN-SYSTEM-1 0.101SDR-FUSION-SYSTEM 0.137

Table 4: Search test performance of component and fusionSDR systems

The OKAPI-SYSTEM-1 was used to investigate effectsof increasing speech transcript accuracy, by indexing tran-

7Later experients showed that, at least in OKAPI-SYSTEM-1, the gainsdue to the manual query expansion were negligible.

14

scripts of varying accuracy for the FTR+FV set for whichwe have compiled limited ground truth. Table 5 shows theresults: a consistently increasing MAP with increasing tran-script % Correct, which justifies our greater investment ofeffort in ASR this year. However, this result also suggeststhat we may not have reached an error rate at which retrievalis comparable to retrieval from ground truth transcriptions.We hope to compile a complete set of ground truth speechtranscripts for the TREC-2002 videos that will allow us toanswer this question.

Transcript % Correct MAP

52.8% 0.0959.7% 0.1367.9% 0.1772.8% 0.21

Table 5: Relationship between Transcript Accuracy and Re-trieval Performance

4.7 Manual CBR and SDR

The combination of CBR and SDR was explored for man-ual searching, where queries were developed through inter-action with the Feature Training collection. An exampleof (successful) SDR and CBR integration is query topic 86(“find overhead views of cities - downtown and suburbs; theviewpoint should be higher than the highest building visi-ble”). In the following, we assume that the SDR results andCBR results have been found independently prior to the in-tegrated query:

1. Retrieve results for SDR query of “view panoramaoverhead downtown suburbs city town urban”

2. Expand results list to include adjacent shots (repeattwo times) using expand operation (see Section 3.8)

3. Combine with CBR results using “union” combina-tion method (Eq 4) and “product” aggregation function(Eq 9).

The final Average Precision improved from CBR0.0 andSDR 0.039 to CBR+SDR0.057. A similar approach wasused for the other queries with minor differences such asthe number of shot expansions and the choice of the com-bination method and aggregation function, for example, us-ing “intersection” rather than “union” and “sum” rather than“product”. However, this approach was not always success-ful; for example, the same scheme was used for topic 84(“Price Tower”) SDR+CBR but performance was degradedbelow that obtained using SDR alone. This approach toSDR and CBR integration improved four queries beyondthe performance attained with SDR alone.

4.8 Interactive search

We explored interactive search using CBR and SDR inwhich the user interacted with the search test collection atquery-time, we chose various combinations of these meth-ods and selected among different methods for fusion, mul-tiple examples search, relevance feedback, and browsing.The wall-clock time was measured to gauge the user effortfor each interactive query. The following describes the in-teractive search operations for query topic 89 for “Butter-flies”, which took just over seven minutes of user time:

1. Search for shots of butterflies using SDR with termssuch as “monarch”, “butterfly”, “wings”, “flower”.

2. View grouping of results by video (clusters shots ac-cording to source video) to get idea of which videoscontribute which shots

3. Remove two unrelevant shots at top of results list

4. Expand all shots to adjacent shots

5. Results show 5 hits at the top, stop.

5 Summary

We presented the IBM Research video indexing system.The system explores fully-automatic content analysis meth-ods for shot detection, multi-modal feature extraction, sta-tistical modeling for semantic concept detection, and speechrecognition and indexing. The system supports manualmethods of querying based on automatically extracted fea-tures, models, and speech information. In this paper wedescribed the system and the experiments runs are part ofthe TREC-2002 video retrieval benchmarking effort. Theresults show good performance on tasks such as shot bound-ary detection, concept detection, and search.

Acknowledgements: We thank Prof. Chiou-Ting Hsu,National Tsing-Hua University, Hsinchu, Taiwan and hisstudents for their assistance in annotating the feature train-ing data sets.

References

[AES01] Arnon Amir, Alon Efrat, and Savitha Srini-vasan. Advances in phonetic word spotting. InProc. of the 2001 ACM CIKM Int. Conferenceon Information and Knowledge Management,pages 580–582. ACM, November 2001.

[AGG02] T. V. Ashwin, R. Gupta, and S. Ghosal. Lever-aging non-relevant images to enhance imageretrieval performance. InProc. ACM Intern.Conf. Multimedia (ACMMM), Juan Les Pins,France, December 2002.

15

[CG98a] S. S. Chen and P. S. Gopalakrishnan. Speaker,Environment and Channel Change DetectionAnd Clustering via the Bayesian InformationCriterion. In Proc. DARPA Broadcast NewsTranscription and Understanding Workshop,Virginia, USA, 1998.

[CG98b] Scott S. Chen and P. S. Gopalakrishnan.Speaker, environment and channel change de-tection and clustering via the bayesian in-formation criterion. IEEE Proc. Int. Conf.Acoust., Speech, Signal Processing (ICASSP),1998.

[Cue] IBM CueVideo Toolkit Version 2.1,http://www.almaden.ibm.com/cs/cuevideo/.

[DLR77] A. P. Dempster, N. M. Laird, and D. B. Ru-bin. Maximum likelihood from incompletedata via the EM algorithm.Proceedings of theRoyal Statistical Society, B(39):1–38, 1977.

[HKM +99] J. Huang, S. Kumar, M. Mitra, W. Zhu, andR. Zabih. Spatial color indexing and applica-tions. International Journal of Computer Vi-sion, 35(3):245–268, December 1999.

[INN02] G Iyengar, HJ Nock, and C Neti. SemanticIndexing of Multimedia using Audio, Text andVisual Cues. InProc. ICME, 2002.

[JKS95] R. Jain, R. Kasturi, and B. Schunck.MachineVision. MIT Press and McGraw-Hill, NewYork, 1995.

[Leg95] C. J. Leggetter. Improved Acoustic Mod-elling for HMMs using Linear Transforma-tions. PhD thesis, University of Cambridge,1995.

[NBS+02] M. Naphade, S. Basu, J. Smith, C. Lin, andB. Tseng. Modeling semnatic concepts to sup-port query by keywords in video. InIEEEInternational Confernce on Image Processing,Rochester, NY, Sep 2002.

[NIN02] H. J. Nock, G. Iyengar, and C. Neti. Assessingface and speech consistency for monologuedetectionin video. InProc. ACM Multimedia,2002.

[NKFH98] M. Naphade, T. Kristjansson, B. Frey, andT. S. Huang. Probabilistic multimedia objects(multijects): A novel approach to indexingand retrieval in multimedia systems. InPro-ceedings of IEEE International Conference onImage Processing, volume 3, pages 536–540,Chicago, IL, Oct. 1998.

[NKH02] M. R. Naphade, I. Kozintsev, and T. S. Huang.A factor graph framework for semantic videoindexing. IEEE Transactions on Circuits andSystems for Video Technology, 12(1):40–52,Jan 2002.

[Poo99] H. V. Poor.An Introduction to Signal Detec-tion and Estimation,. Springer-Verlag, NewYork, 2 edition, 1999.

[Rab89] L. R. Rabiner. A tutorial on hidden Markovmodels and selected applications in speechrecognition. Proceedings IEEE, 77(2):257–286, Feb. 1989.

[RHM97] Y. Rui, T. S. Huang, and S. Mehrotra. Content-based image retrieval with relevance feedbackin mars. InIEEE Proc. Int. Conf. Image Pro-cessing (ICIP), 1997.

[RJ93] L. R. Rabiner and B.-H. Juang.Fundamentalsof Speech Recognition. Prentice Hall, Engle-wood Cliffs, NJ, USA, first edition, 1993.

[RWSJ+95] S. E. Robertson, A. Walker, K. Sparck-Jones,M. M. Hancock-Beaulieu, and M. Gatford.OKAPI at TREC-3. InIn Proc. Third Text Re-trieval Conference, 1995.

[SDB98] J.-C. Shim, C. Dorai, and R. Bolle. Automatictext extraction from video for content-basedannotation and retrieval. InIEEE Conferenceon Pattern Recognition, volume 1, pages 618–620, Brisbane, Australia, August 1998.

[Smi01] J. R. Smith. Content-based access of imageand video libraries. In A. Kent, editor,Ency-clopedia of Library and Information Science.Marcel Dekker, Inc., 2001.

[SN02] J. R. Smith and A. Natsev. Feature and spa-tial normalization for content-based retrieval.In IEEE Conference on Multimedia and Expo,Laussane, Switzerland, August 2002.

[SPA+99] S. Srinivasan, D. Ponceleon, A. Amir, , andD. Petkovic. What is in that video anyway? insearch of better browsing. Inin Proc. of IEEEIntl. Conf. on Multimedia Computing and Sys-tems, pages 388–392, Florence, Italy, 1999.

[SS01] P. Salembier and J. R. Smith. MPEG-7 multi-media description schemes.IEEE Trans. Cir-cuits Syst. for Video Technol., August 2001.

[SSA+02] J. R. Smith, S. Srinivasan, A. Amir, S. Basu,G. Iyengar, C.-Y. Lin, M. Naphade, D. Pon-celeon, and B. Tseng. Integrating features,

16

models, and semantics for trec video re-trieval. In E. M. Voorhees and D. K. Har-man, editors,Proc. Text Retrieval Conference(TREC), pages 240–249, Gaithersburg, MD,2002. NIST.

[TMY78] H. Tamura, S. Mori, and T. Yamawaki. Tex-tural features corresponding to visual percep-tion. IEEE Trans. Syst., Man, Cybern., SMC-8(6):460–473, 1978.

[Vap95] V. Vapnik. The Nature of Statistical LearningTheory. Springer, New York, 1995.

[XC00] J. Xu and B. Croft. Improving the Effec-tiveness of Informational Retrieval with LocalContext Analysis. ACM Transactions on In-formation Systems, 2000.

17

ibm research trec-2002 video retrieval system · analysis, indexing, and retrieval of video, which...

Documents