multi-modal fusion for associated news story retrieval

23
Multimed Tools Appl DOI 10.1007/s11042-013-1404-1 Multi-modal fusion for associated news story retrieval Ehsan Younessian · Deepu Rajan © Springer Science+Business Media New York 2013 Abstract In this paper, we investigate multi-modal approaches to retrieve associated news stories sharing the same main topic. In the visual domain, we employ near dupli- cate keyframe/scene detection method using local signatures to identify stories with mutual visual cues. Further, to improve the effectiveness of visual representation, we develop a semantic signature that contains pre-defined semantic visual concepts in a news story. We propose a visual concept weighting scheme to combine local and semantic signature similarities to obtain the enhanced visual content similarity. In the textual domain, we utilize Automatic Speech Recognition (ASR) and refined Opti- cal Character Recognition (OCR) transcripts and determine the enhanced textual similarity using the proposed semantic similarity measure. To fuse textual and visual modalities, we investigate different early and late fusion approaches. In the proposed early fusion approach, we employ two methods to retrieve the visual semantics using textual information. Next, using a late fusion approach, we integrate uni-modal similarity scores and the determined early fusion similarity score to boost the final re- trieval performance. Experimental results show the usefulness of the enhanced visual content similarity and the early fusion approach, and the superiority of our late fusion approach. Keywords Semantic signature · Scene signature · Visual concept signature · News story retrieval 1 Introduction News programs are one of the most popular and viewed programs on television. Since the launch of 24-hour news channels, broadcasting many news stories each E. Younessian (B ) · D. Rajan Center for Multimedia and Network Technology, School of Computer Engineering, Nanyang Technological University, Nanyang 639798, Singapore e-mail: [email protected]

Upload: deepu

Post on 11-Dec-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Multi-modal fusion for associated news story retrieval

Multimed Tools ApplDOI 10.1007/s11042-013-1404-1

Multi-modal fusion for associated news story retrieval

Ehsan Younessian · Deepu Rajan

© Springer Science+Business Media New York 2013

Abstract In this paper, we investigate multi-modal approaches to retrieve associatednews stories sharing the same main topic. In the visual domain, we employ near dupli-cate keyframe/scene detection method using local signatures to identify stories withmutual visual cues. Further, to improve the effectiveness of visual representation, wedevelop a semantic signature that contains pre-defined semantic visual concepts ina news story. We propose a visual concept weighting scheme to combine local andsemantic signature similarities to obtain the enhanced visual content similarity. In thetextual domain, we utilize Automatic Speech Recognition (ASR) and refined Opti-cal Character Recognition (OCR) transcripts and determine the enhanced textualsimilarity using the proposed semantic similarity measure. To fuse textual and visualmodalities, we investigate different early and late fusion approaches. In the proposedearly fusion approach, we employ two methods to retrieve the visual semanticsusing textual information. Next, using a late fusion approach, we integrate uni-modalsimilarity scores and the determined early fusion similarity score to boost the final re-trieval performance. Experimental results show the usefulness of the enhanced visualcontent similarity and the early fusion approach, and the superiority of our late fusionapproach.

Keywords Semantic signature · Scene signature · Visual concept signature ·News story retrieval

1 Introduction

News programs are one of the most popular and viewed programs on television.Since the launch of 24-hour news channels, broadcasting many news stories each

E. Younessian (B) · D. RajanCenter for Multimedia and Network Technology, School of Computer Engineering,Nanyang Technological University, Nanyang 639798, Singaporee-mail: [email protected]

Page 2: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

Fig. 1 Associated news story categories

day, there has been an explosion of television news content, from all over the worldand in numerous languages. These stories could be in the form of an interview,breaking news, pre-recorded segments, or live broadcast of an event. Nowadays,media houses are moving from traditional newspapers and posters towards electronicmeans for news dissemination. Most news agencies have their websites to broadcastnews products in different formats like text, audio or video articles. They also haveestablished active accounts in multimedia sharing websites like YouTube and usetheir popularity to attract more audiences surfing the World Wide Web.

Among these huge volumes of news stories from various channels, there existsa great deal of overlap in the semantic sense, i.e., they address the same main topic.We refer to such news stories as associated news stories. Examples of associated newsstories are shown in Fig. 1. For instance, in Fig. 1a, there are three news storiesfrom ABC, CCTV and CNN channels discussing the same topic of “Bush pressconference”. The objective of this research is to detect associated news stories fromdifferent channels in daily broadcast news videos using visual and textual cues. Thistask can be applied as a prior stage for other tasks like event-based informationorganization, topic detection and tracking (TDT), news story summarization, newsstory clustering, novelty and redundancy detection in news stories, event threadingand auto-documentary. In all mentioned tasks, an effective video representation,containing low/mid/high-level multi-modal features, and a proper similarity measureare foundations to designing a multimedia system. In this paper, we investigate thesetwo issues in the context of associated news story retrieval.

In order to clarify the meaning of association between news stories and to clearlystate the focal point of this study, we briefly explain the roles of different modalitiesto retrieve different types of news stories, in the following. Then, we overview theproposed framework.

Page 3: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

Broadcast news stories include enriched auditory, textual and visual cues whichcan be utilized for the news story retrieval. For example in reader, which is a type ofnews story read without accompanying video or sound [3], the spoken words carrythe major part of semantics. Accordingly, applying ASR and retrieving spoken wordswould be essential for the story retrieval task. Further, most of the news storiesalso come with some textual information which mainly appears as the caption onthe bottom of the screen indicating the headline of the news story and some briefs,called kicker, to grab the reader’s attention, or as the cutline [3] to name and describepictures or person(s) speaking. These valuable textual cues, which are highly relatedto the news story, can be extracted through Optical Character Recognition (OCR)and can be used for news story retrieval.

In addition to ASR and OCR transcripts, visual elements play a critical role,especially since humans receive much of their information of the world throughtheir sense of vision. However, finding the semantic similarity of visual objects inassociated news stories often can be a challenging task. To elaborate on this issue, wevisually categorize the associated news stories into two major groups. (1) Associatednews stories covering the same news object in usually the same venue and within thesame context. They may come from the same or different video footages as shown inFig. 1a. Finding mutual visual cues across associated news stories from this categorycan get problematic due to the high degree of variation of camera angles, cameralens settings etc. (2) Associated news stories addressing an event which is not visuallyrelated to specific objects or occurs at a specific venue like news stories addressing“Katrina storm” or “Fire in Oklahoma” as shown in Fig. 1b. Visually, the storm orfire will be similar irrespective of where it happens. Modalities like textual informa-tion or high-level visual annotations could possibly be more informative and discrim-inative than routine visual representations like global and local signatures in thesecases.

1.1 The overview of the proposed framework

In this paper, we propose a multi-modal approach to retrieve associated newsstories using visual and textual modalities. Figure 2 shows an overview of theproposed approach that includes different uni- and multi-modal modules whichbe explained later. The proposed approach is originally introduced in [22] wherewe explore reference news stories and retrieve associated news stories for a givenquery story. We extract textual information from keyframes using an OpticalCharacter Recognition (OCR) engine (http://jocr.sourceforge.net). A novel post-OCR processing is deployed to refine OCR outputs using a local dictionary builtbased on the ASR transcript for each story. Next, the refined OCR output and theASR transcript are used together to obtain the enhanced textual representationin the form of weighted TFIDF (wtf idf ) feature vector for each news story basedon which we determine enhanced textual content similarity between news stories.TFIDF, term frequency—inverse document frequency, is a numerical statisticwhich indicates the importance of a word to a document in a corpus. In the visualdomain, a single or multiple keyframes are extracted per shot with respect to thelength of the shot. Next, we introduce the semantic signature which indicates theprobabilities of the presence of the per-defined visual concepts in the keyframes

Page 4: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

Fig. 2 The proposed framework for associated news story retrieval

of a news story. We also incorporate keyframe set similarity which is based on thenumber of near-duplicate keyframes between news stories. The semantic signatureand keyframe set similarities are combined using an effective weighting scheme todetermine enhanced visual content similarity.

Unlike [22], in this paper we use the scene signature, determined in scene-level[23], instead of keyframe-level local signatures. Using scene signature, we obtain abetter performance particularly for the first category of associated news stories andwe can retrieve news stories almost 7 times faster than when we use keyframe-levellocal signatures.

Here, in addition to the above-mentioned modules, we also incorporate newcomponents as listed below:

• An effective semantic similarity measure is developed to capture word related-ness between news stories and then to determine enhanced textual similarity. In[22], a simple vector space model is used where only words with an identical rootcould get matched.

• A novel early fusion method is proposed to retrieve visual semantics usingtextual information. It is especially useful when a query story has a noisyand an unreliable visual information but contains significant amount of textualinformation. Then visual semantics of stories in the dataset can be retrieved usingtextual information of the query story, as shown in Fig. 2. To make textual andvisual features comparable, we propose a Semantic Similarity Mapping (SSM)to directly project textual features to the visual concept space. In addition, wedeploy Canonical Correlation Analysis to project both textual and visual featuresto another feature space where they are comparable and the correspondingfeatures are close to each other.

• Through a late fusion approach using a Support Vector Machine (SVM), wecombine the proposed early fusion similarity score with the enhanced visualand textual similarities scores and determine the final similarity score betweentwo stories, as shown in Fig. 2. In our case, an SVM [4] performs classificationby constructing a three-dimensional hyperplane that optimally determines theprobability of being associated for a pair of news stories.

Page 5: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

The rest of this paper is organized as follows. Section 2 describes related work. InSection 3, we propose the enhanced visual similarity measure. Section 4 explains howto determine enhanced textual similarity using semantic relatedness between words.In Section 5, the proposed early/late fusion methods are discussed. In Section 6,we assess the enhanced textual and visual content similarity measures and also theproposed early and late fusion approaches for the associated news story retrieval.

2 Related work

We briefly review uni-modal approaches and multi-modal fusion approaches ad-dressing the video retrieval task. The former refers to the methods which use visualor textual cues to measure the similarity/relatedness between videos. We tackle thisproblem using our proposed keyframe/scene set similarity and the enriched textualcontent similarity, introduced in Section 1.1. On the other hand, the multi-modalfusion approaches study the integration of multiple modalities including differentvisual and textual representations or similarity scores to retrieve videos. As shown inFig. 2, we address this problem by proposing—(1) the enriched visual content simi-larity to fuse the semantic signature similarity and the keyframe/scene set similarity,(2) the early fusion process to retrieve visual cues using textual information, and (3)the SVM-based late fusion to fuse all calculated similarity scores.

2.1 Uni-modal approaches

In the video retrieval literature, a wide range of approaches has been proposed forthe duplicate/near-duplicate video retrieval tasks using visual cues with the mainfocus on the accuracy or/and efficiency of the proposed approach. For instance, Wuet al. [18, 20] use a bag of local feature to represent each keyframe in a video. Usinga keyframe to represent a shot is a well-known method in the area of video retrieval[18]. Although these methods have been most robust methods against wide range oftransformations like lighting changes, object occlusion, cropping, view point changesand etc, but their stability is unsatisfactory when dealing with unconstrained videocontent since the detection rate of near-duplicates in keyframe-based approachesis dependent on the result of keyframe selection to certain extent. Addressing themultimedia event detection task, Zhong Lan et al. [25] uses a set of local visualfeatures such as SIFT and aggregates the bag-of-word representation of all keyframesto represent a video. However, it seems to be a problematic approach to deal withnews stories since it is possible that only a fraction of a news story is visually relevantto the event of interest and considering all keyframes together might degrade theperformance. In our approach, we propose a weighted fusion method to combineour local feature similarity and the semantic similarity. In Section 6.2, we compareour proposed approach to the SIFT-BOW approach where the whole news storyis presented as a bag-of-word using SIFT features extracted from all keyframes. Amore comprehensive review of the video retrieval approaches using visual modalitiesis available in [23].

In the textual domain, scholars have explored a wide range of methods usingdifferent available textual information such as spoken words, displayed words onscreen, metadata, etc to address the video retrieval/search tasks. For instance, the

Page 6: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

authors in [9] tackle the broadcast video retrieval problem using OCR and speechtranscript. They compare n-gram analysis and dictionary look up techniques tocorrect OCR error for the text-based video retrieval. The former generates a new setof n-gram strings to match the unedited OCR outputs. These n-gram strings includestrings with an edit distance of 1 character and all sub-strings with at least three char-acters. The second method uses the global dictionary to correct spelling errors. Theyuse simple cosine similarity to calculate textual similarity between TFIDF featuresof videos. Against [9], we use ASR transcript to refine erroneous OCR output andpropose a semantic similarity measure to calculate the enhanced textual similaritybetween news stories. In Section 6.1, we compare our proposed approach to [9]. Inthe context of the MediaEval benchmark, authors in [10] study the integration ofdifferent audio, visual and text-based descriptors for automatic video genre classica-tion. Experimental evaluation is conducted on 26 video genres specific to web mediaplatforms. The highest distinguishing power is obtained by the text information,and particularly by metadata which outperforms the ASR transcripts. However,the discriminative power of metadata diminishes when mixed with audio-visualinformation. Similarly, our proposed textual module performs as the best uni-modalmodule for the associated news story retrieval. However, unlike [10], the retrievalperformance is significantly improved by incorporating visual information.

2.2 Multi-modal fusion approaches

Multi-modal fusion research has attracted much attention due to the benefit itprovides for various multimedia analysis tasks. A comprehensive survey article onmulti-modal fusion is available in [1]. We can categorize multi-modal strategies intoearly and late fusion methods. The integration of multiple media features is referredto as early fusion while the integration of the intermediate decisions is referred toas late fusion [1]. Neither of these fusion methods is perfect [15]. We briefly reviewrelated works for each category in the followings.

Early fusion combines different features into a long feature through which it canimplicitly model the correlations between them. For instance, in a multi-modal earlyfusion approach, the authors in [14] study audio-visual fusion using Canonical Corre-lation Analysis (CCA). They use the co-occurrence of auditory and visual features inthe training data to determine the projection functions through which the visual andauditory features can be mapped to another feature space wherein they are compa-rable and the corresponding features are close to each other. As another applicationof early fusion, the authors in [2] exploit visual concepts using text query for imageretrieval task. They optionally use WordNet [16] to match detected visual conceptnames and the given text query. By mapping the text query of interest onto the visualconcept list, they could retrieve images with the related visual concepts. In this re-search as a part of our proposed early fusion process, we also adopt the CCA methodto retrieve visual cues using textual information. In addition, we directly retrieve theproposed visual semantic signatures using textual information and incorporate it toimprove the early fusion similarity score.

However, the early fusion does not perform well, if the feature of differentmodalities is too heterogeneous with skewed length distribution and significantlydifferent scales. In contrast, in the late fusion, this is not a concern since the features

Page 7: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

from each modality will not compare with each other before the final fusion step.In addition, we can employ various detection techniques and classifiers accordingto the specific feature types in the late fusion framework. Moreover, the late fusionmethods are usually less computationally expensive compared to the early fusionapproaches. Therefore, the late fusion techniques have become more popular andmore extensively studied than the early fusion techniques in the literature [15].

The late fusion strategies can be categorized into two major groups of (1) Rule-based (e.g Average, MIN, MAX, Ranked List, query (in)dependent weighting fusionetc.) and (2) Classification-based fusion (e.g. SVM, Bayesian inference etc.) ascomprehensively discussed in [1].

As an exemplary of the rule-based approach in the context of video retrieval, theauthors in [7] adopt a linear weighted late fusion technique to integrate the normal-ized scores and ranks of the retrieval results. The normalization was conducted usingmax-min method. The video shots were retrieved using different modalities such astext and different visual features (color, edge and texture). The authors obtain thebest performance in TRECVID type searches by combining scores determined bytextual and visual unites with different weights. Combining scores and ranks withequal weights, they obtain the best performance in a single query image. In [19], theauthors explored novelty and redundancy detection with visual duplicates and speechtranscripts for cross-lingual news stories. Similar to documents that are treated as abag of words, they treat a news story as composed of a bag of keyframes in the visualtrack. The keyframes are further classified as Near-Duplicate Keyframe (NDK)when they appear multiple times in the corpus and as non-near-duplicate keyframes(non-NDK) when they appear only once. Finally, a weighted linear fusion is used tocombine the similarity scores from speech transcripts and visual duplicates.

Query-dependent or query-class weighting can be considered an evolution ofquery independent weighting since the former tackles many of the latter failures. Thefocal point of this approach is that given training references and an appropriate setof training queries, query clusters (i.e. query-classes) can be found such that querieswithin each cluster share some similar properties which differentiate them from otherqueries in the collection. The properties may be artifacts such as semantic similarity,performance similarity, distance etc. By partitioning a set of training queries into dis-crete query classes, it is then possible to optimize for each query-class a different setof weights for local decisions. The query-class concepts was introduced by Yan et al.[21] for content-based video retrieval where four classes of Named person, Namedobject, General object, and Scene were defined, based on which they assign differentweights to different low-level classifiers. Similarly in this research, we calculate theenriched visual content similarity by fusion the semantic similarity and the localfeature similarity using weights calculated based on the likelihood of different visualconcepts in a video, as explained in Section 3.2.

Among various classification-based late fusion approaches, we focus on the SVM-based late fusion approach since it has been extensively used in many existing multi-media analysis and retrieval methods [1]. In the context of multi-modal fusion, SVM[4] is employed to build a separating hyperplane by using scores given by the individ-ual classifiers. Using the kernel concept, the basic SVM method can be extended tocreate a non-linear classifier, where every dot product in the basic SVM formulationis replaced with a non-linear kernel function. In [25], the authors introduce a fusionscheme, called double fusion, which integrates early fusion and late fusion together,

Page 8: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

to address the Multimedia Event Detection (MED) task. For early fusion, they takethe average of the distance matrices of different modalities. They use multiple visualand textual representations and treat the early fusion result as an individual classifier.Then a simple average late fusion scheme is adopted using all developed uni-modalclassifiers together with the early fusion classifier. Promising results are reported onTRECVID MED 2010 and 2011 datasets. Similarly, we also treat the early fusionresult as the individual classifier, which is fused with the outputs of the textual andvisual modules, as shown in Fig. 2. However, unlike [25], we early fuse textual andvisual features in different ways, as explained in Section 6.3, and use an SVM-basedlate fusion to calculate the final similarity score between news stories. Note that in[25], there are tens of positive samples for each event using which a model is learnedfor each event, while here the task is different. We aim to retrieve associated newsstories for a given query news story and we use a set of associated news storiesto learn an SVM model to fuse the similarity scores calculated between the queryand the reference news stories.

3 Enhanced visual content similarity

As shown in Fig. 2, we determine the enhanced visual content similarity betweenstories using local and semantic signature. Local signature refers to the keyframe- orscene-level local feature representation. Here, we use Bag-of-SIFT to represent eachkeyframe within a news story and determine between-story similarity as keyframeset similarity as

KF_Sim(Si, S j) = |Si ∩ S j|.(1/|Si| + 1/|S j|), (1)

where Si refers to the set of keyframes contained in the ith story. |Si ∩ S j| indicatesthe number of NDKs between Si and S j based on the number of matching keypoints.The matching keypoints are keypoints with the most similar SIFT descriptions, de-termined using a symmetric nearest neighbour search [24]. We filter out less reliablematches where the similarity between a keypoint and its nearest neighbour is not sig-nificantly greater than its similarity to its second nearest neighbour. Although localsignature is robust to low to mid-level object displacement, edit and camera settingvariations, it does not perform well when we deal with associated news stories withsignificant object/camera movements, or stories with conceptual connections wherethe visual similarity between stories can be implicitly retrieved based on commonvisual concepts they share such as fire, fire-fighter, jungle etc like S1 and S2 story inFig. 1b. This observation motivates us to incorporate a visual semantic representationfor a news story, called the semantic signature.

3.1 Semantic signature

Semantic signature is essentially a 374-dimensional vector representing the probabil-ities of the presence of 374 pre-defined visual concepts [11] in a news story and it isdetermined as

SemSig(Si) =n∑

j=1

SIN(KFij), (2)

Page 9: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

Fig. 3 NDK detectability for various visual concepts. The figure shows three sample NDKs witha static, and b dynamic visual concepts. The more static visual concepts a keyframe contains, themore likely the NDK can be detected

where n denotes the number of keyframes within story Si and SIN(KFij) refers to the

Semantic INdexing (SIN) of the jth keyframe in Si which is a 374-dimensional vectorindicating probabilities of the presence of the pre-defined visual concepts in thejth keyframe. In [11], the authors have computed these probabilities by training anSVM classifier. The similarity between semantic signatures can be simply determinedusing cosine similarity measure. In practice, semantic signature is not discriminativeenough due to the limited number of visual concepts (i.e., 374) and relatively lowprecisions of visual concept detectors. Specifically, among 374 visual concepts thereare only a few visual concepts describing different actions which also have a verypoor precision. It leads to an unsatisfied discrimination between stories with similarsubject/object/context but different actions. For instance, semantic signature fails torobustly capture the difference between stories showing a group of people fightingor cheering in a street, or the difference between stories showing a group of peoplesitting in a courtroom or in a classroom. However, we aim to incorporate the semanticsignature similarity with the local signature similarity to explore their possiblecomplementary roles.

3.2 Visual concept weighting

To combine local and semantic signature similarities, first we investigate when localsignature works well and when it fails so that the semantic signature similarity can beweighted, accordingly. Local signature does not work well when keypoint matchingscheme does not perform well. Generally speaking, keypoint matching methodssuffer from significant object/camera movements occurring in scenes with conceptsthat are dynamic. For instance, Fig. 3b shows three NDK pairs depicting basketballgame, dancing in a party, and an explosion, respectively. In all of these NDKs, key-point matching scheme does not work properly due to the existence of dynamic visualconcepts. On the other hand, keypoint matching scheme performs well in scenes withmostly static concepts. For instance, Fig. 3a shows three NDKs depicting a speaker,a building and mountains, in all of which keypoint matching scheme works well.

This observation motivates us to study the relation between visual conceptspresented in a scene and the ability of local signature to capture the visual similarityacross scenes. We use TRECVID 2006 dataset [17] including around 160 h newsvideo from seven different channels. There are around 21,000 shots each of whichcontains several keyframes which are NDKs. We determine matching keypointsbetween NDKs within each shot using their SIFT descriptions. To do so, we use

Page 10: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

Fig. 4 t-score for 374 visual concepts for TRECVID 2006 dataset [17]

a symmetric nearest neighbour search to match keypoints with the most similardescriptions. We filter out less reliable matches where the similarity between akeypoint and its nearest neighbour is not significantly greater than the similarity toits second nearest neighbour. Next, we categorize shots into two groups of detectableand non-detectable if the number of matching keypoints between the NDKs in a shotexceeds a specific threshold. For each visual concept in each group, we calculate themean (μi) and the variance (σ i) of its probability over all keyframes. Then for the ithvisual concept, we employ the t-test [13] and determine t-score(i) as

t-score(i) = |μi1 − μi

2|√σ i

1n1

+ σ i2

n2

, (3)

where μi1 and σ i

1, and μi2 and σ i

2 denote mean and variance of the ith visual conceptprobability in the detectable and non-detectable groups, respectively. n1 and n2 referto the numbers of shots in the detectable and the non-detectable groups, respectively.A significant t-score(i) means that the presence of the ith visual concept leads to afailure of NDK detection within the shots. In Fig. 4, the t-score for all visual conceptsare shown. Concepts like sitting, U.S. f lag, furniture, windows and address or speechhave a high t-score which implies that the NDK detection algorithm is generallycapable of finding scenes having these static concepts. On the other hand, conceptssuch as shooting, dancing, ruins, natural disaster have a low t-score which means

Page 11: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

that there is a difficulty to detect NDK containing these concepts. Accordingly, wecompute the Detectability Score (DS) for each concept as

DS(i) = 1 + 1

(1 + t-score(i)), i = 1, 2, .., 374. (4)

Hence, we assign a higher DS on a concept with a lower t-score. We modify semanticsignature similarity between stories by weighting visual concepts as

Sim_Sem(Si, S j) = (DS ◦ SemSig(Si))T(DS ◦ SemSig(S j)), (5)

where ◦ denotes the element-wise product between two vectors. Note that high t-score means that we deal with relatively static visual concepts in the story and localsignature similarity can probably capture visual similarity. Accordingly, we assignlow weights to semantic signature concepts that decreases Sim_Sem. On the otherhand, low t-score means that local signature similarity is not reliable to measure thevisual similarity for the given news story. Thus, we assign higher weights to semanticsignature concepts in (5) that increases Sim_Sem.

3.3 Fusion of local and semantic signature similarities

Finally, through a leave-one-out training process, we train an SVM classifier using thelocal signature similarity and the weighted semantic similarity as two inputs and usethe trained model to determine the enhanced visual content similarity score betweena pair of stories.

Note that we can alternatively use scene signatures [23] instead of bag-of-SIFTof keyframes as the local signature. A scene signature is essentially a scene-levelbag-of-SIFT representation. To generate scene signatures of a news story, first wecluster NDKs within the news story and group their keypoints into matching andnon-matching keypoint. Next, we represent matching keypoints with the one withthe largest scale and filter less informative non-matching keypoints. After threerefinement steps, a scene signature is obtained for each NDK cluster which is morecompact and effective than a keyframe-level Bag-of-SIFT representation. Moreover,scene signature is not sensitive to the choice of keyframes extracted from a newsstory, while keyframe set similarity, explained in (1), can be highly affected by thechoice of keyframes [23]. Scene set similarity can be determined similar to (1), where|Si ∩ S j| denotes the number of similar scene signatures between the ith and the jthstory, which are connected by a sufficient number of matching keypoints.

4 Enhanced textual content similarity

We propose a measure for text-based semantic similarity between news stories. Thetextual information from OCR transcript and ASR is represented using the weightedtf idf as

wt f idf (S j, termi) = t fA j(termi) + t fO j(termi)

LDF(termi, S j), (6)

LDF(termi, S̃) = log|s ∈ S̃ : termi ∈ s|

|S̃| , (7)

Page 12: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

where t fA j is the term frequency in the ASR transcript of story S j and t fO j is the termfrequency in the refined OCR transcript of story S j. |S̃| is the number of documentsincluding termi within, say, a day. Note that instead of the conventional documentfrequency (df ), we utilize a local document frequency (LDF) within a temporalwindow. Specifically, the LDF score is determined for each term based on temporalproximity of news stories, as explained in (7). Note that we convert all terms to theirroot form using a stemmer before the wt f idf calculation.

This textual representation enables the computation of similarity between twonews stories by simply calculating the cosine of the angle between their correspond-ing wtf idf vectors. However, there is a significant number of incorrectly recognizedwords in the extracted ASR/OCR transcripts. Also it often happens that differentnews agencies use different but related words to report an identical event, e.g., “taxi”and “cab”, “mom” and “mother”, “Baghdad” and “Iraq”, “ticket” and “flight” etc.Particularly for non-English news, since machine translation is used to generate theEnglish version of the extracted ASR transcript, using semantic similarity can play acritical role to measure the textual similarity between stories, more effectively. To doso, first we calculate semantic similarity between all terms in the dataset dictionary(i.e. ti, i = 1, 2, 3, ..., 8827.) as

Semantic_Similairty(t1, t2) ={

W NSim(t1, t2) + WikiSim(t1, t2) if (W NSim(t1, t2) > τw),

0 otherwise(8)

where W NSim(t1, t2) and WikiSim(t1, t2) denote the WordNet-based and Wikipedia-based semantic similarities between t1 and t2, respectively. We calculateW NSim(t1, t2) based on the relative position of t1 and t2 in the WordNet hierarchy [6].WikiSim(t1, t2) is calculated based on the co-occurrence of t1 and t2 in the Wikipediadataset [12]. According to the preliminary experiments, we found that the WordNet-based similarity is more robust and reliable than the Wikipedia-based similarity in thegiven news dataset. Hence, we filter out unreliable relatedness if W NSim betweentwo words is lower than a particular value (τw = 0.3) as shown in (8).

Considering all words that are semantically related according to (8), we observemany noisy connections between words, which degrades the retrieval performance.There are 8,827 unique words in the ASR/OCR transcripts among which generalwords such as “go”, “make”, and “take” can get connected to over a hundred words.On the other hand, specific words such as “rain”, “damper”, and “bomber” havea much fewer number of related words in the dataset and form generally moremeaningful and useful connections. The noisy connections can negatively affect therole of the other more meaningful connections and undesirably result in a significantsimilarity score between two news stories which are not semantically related.

To suppress the problem with terms having noisy connections, we adopt a soft-mapping scheme to only consider the top-k semantic similarities for each word inthe dataset. We calculate k through a 10-fold cross-validation process. We split thedevelopment data into 10 folds and use 9 out of 10 folds for the training where weemploy a grid search (k = 1, 2, ..., 15.) to find the best k which results in the bestretrieval performance in the remaining fold of the development data. Next, we calcu-late the proximity matrix, SSTT = [sij]n×n, where sij refers to the semantic similaritybetween the ith and the jth terms in the dataset as determined in (8). In practice, wecalculate a large proximity matrix through an offline process using around 100,000most commonly used English words and utilize it to construct SSTT . Considering

Page 13: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

the determined k for soft-mapping, the ith column of SSTT has non-zero values foronly the top-k most semantically similar terms in the dataset to the ith term. Notethat n denotes the number of terms in the dictionary which is the compilation ofall distinct terms in the ASR and OCR transcripts passed through the stop wordremoval and the stemming process. We can calculate the semantic similarity betweenstories as

Scoret(S1, S2) = wt f idf T(S1) × SSTT × wt f idf (S2), (9)

where wt f idf (S1) and wt f idf (S2) refer to the weighted t f idf features for S1 and S2

as calculated in (6). Note that if we set SSTT to In×n, (9) reduces to the inner productof wt f idf features.

5 Fusion of visual and textual information

Although most of the news stories contain significant amount of textual informationextracted via ASR and OCR, it often happens that some stories do not haverepresentative visual cues which can be extracted through the local or the semanticsignature. For instance in reader, which is a type of news story read without accompa-nying video or sound, there is no relevant visual cue to the topic of interest. This factleads to poor retrieval results using visual representation. To improve the retrievalperformance for these stories, we can integrate both visual and textual modalitiesthrough early or late fusion approaches. In the early fusion, we combine these twomodalities at the feature-level while in the late fusion we integrate similarity scoresat the decision-level.

In this section, we aim to propose an early fusion scheme where we retrieve visualcontent using textual content. In the visual domain, we represent each video usingsemantic signatures proposed in (2), and in the textual domain, we describe a videousing wtf idf representation explained in (6). Since textual representation, wt f idf ,and semantic signature, SemSig, come from different sources of knowledge andconsequently different feature spaces, they are considered heterogeneous featuresand are not directly comparable. Hence, the proposed framework uses two scores toretrieve semantic signatures of reference stories for a given query story as shown inFig. 5. Reference stories refer to all stories in dataset other than the query story. First,we directly map the wt f idf of the query story to the visual semantic feature spaceusing Semantic Similarity Mapping (SSM) and generate visual concept signature asexplained in the next section. Then we determine its cosine similarity to the semanticsignature of reference stories to obtain ScoreSSM. The second score is obtained bymapping these heterogeneous features to a third feature space where they are com-parable. We use Canonical Correlation Analysis (CCA) to learn the required pro-jection functions as explained in Section 5.2. Using the projection functions, we mapwt f idf of the query story and semantic signature of reference stories onto a thirdspace where their cosine similarity is computed as ScoreCCA. Finally, similarity be-tween two stories is obtained as the sum of ScoreSSM and ScoreCCA as shown in Fig. 5.

Note that although both SSM and CCA methods are employed to retrievevisual contents using textual information, they use different tools to do so. In theSSM method, the semantic relatedness between terms in ASR/OCR transcripts andthe name of visual concepts is used. On the other hand, the CCA approach is a

Page 14: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

Fig. 5 The proposed early fusion approach, using CCA and the direct Semantic Similarity Mapping(SSM) of textual information onto the visual concept list

data-driven approach which uses the co-occurrence of terms and the detected visualconcepts in the news story corpus, and it does not directly take their semanticrelatedness into account. In the experimental Section 6.3, we will show that these twomethods can play a complementary role for each other and combing them togethercan boost the retrieval performance.

5.1 Semantic similarity mapping of textual information onto visual space

We project wt f idf of the query story to the list of 374 visual concept names. We usethe semantic similarity, described in (8), to determine the textual semantic similaritybetween every term in the wt f idf dictionary and visual concept names. We buildSSTV = [ssij]m×n where m and n refer to the number of visual concepts and thenumber of terms in the wt f idf dictionary, respectively. m = 374 and n = 8,827. Theith row of SSTV has non-zero values, determined by the textual semantic similaritymeasure of (8), for the top-k most semantically similar words in the dictionary to theith visual concept. Accordingly, we determine the Visual Concept Signature (VCS)for story Si as

VCS(Si) = SSTV × wt f idf (Si), (10)

where wt f idf (Si) is the n-by-1 vector indicating the textual representation of Si in(6). n denotes the number of words in the dictionary. VCS(Si) is the m-dimensionalvector representing the visual concept signature generated from textual information.Figure 6a shows an example story about ’Fire in Oklahoma’ broadcast by NBC chan-nel. Figure 6b shows the ASR transcript of the story of interest. We bold some wordsin the extracted ASR transcript that contribute to the semantic similarity mapping inSSTV matrix calculation, e.g. “fire”, “burn”, “danger”, “fire-fighter”. Figure 6c showsthe determined visual concept signature (VCS). We obtain relatively high scores forassociated concepts like “Natural Disaster”, “Fire-fighter”, “Outdoor”, “Smoke” etc.However, there are some completely irrelevant words also with significant scoressuch as “Glass”, “Lawyer”, “Snow” etc.

To retrieve associated news stories for a given query story, we use cosine similarityto measure the similarity between the generated visual concept signature (i.e. VCS)

Page 15: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

Fig. 6 An example of generated visual concept signature using extracted ASR transcript.a Keyframes, b ASR transcript, and c generated visual concept signature

of the query story, as determined in (10), and the visual semantic signature ofreference stories, calculated in (2). We assess the retrieval performance in theexperimental section.

5.2 Canonical correlation analysis

In this section, instead of directly mapping of textual information onto visual featurespace, we map both textual and visual semantic features to a third space wherethey are comparable. To determine the projection functions, we assume that themapped textual and semantic signature of a news story are close together in theprojection feature space. We use Canonical Correlation Analysis (CCA) to learn theco-occurrence of the textual information and visual concepts. In statistics, CCA isan approach to study cross-covariance matrices. If two sets of variables, a1, a2, ..., an

and b 1, b 2, ..., b m, are correlated, then we can find a linear combination of the ai andthe bi that have maximum correlation. In our case, we use textual feature (T) andvisual semantic indexing (V) as a and b and consider their co-occurrence in a newsstory to determine the correlation between them. T is essentially a n-dimensionalwt f idf feature vector determined for each story as explained in Section 4 and nis the number of words in the dictionary. V is a m-dimensional semantic signaturedetermined for each video as in (2) where m is the number of visual concepts. Welearn projection functions, wt and wv , from V and T extracted from training videos.

Page 16: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

wt and wv are sets of basis vectors for textual and visual features, respectively, tomaximize the following correlation function [8]:

ρ = E[wTt TVTwv]√

E[wTt TTTwt]E[wT

v VVTwv]. (11)

The maximum of ρ with respect to wt and wv is the maximum canonical correlation.We can re-write above equation as

ρ = E[wTt Ctvwv]√

E[wTt Cttwt]E[wT

v Cvvwv], (12)

where Ctt and Cvv are the within-set covariance matrices of T and V, respectively,and Ctv is the between-sets covariance matrix. The canonical correlations between Tand V can be found by solving the eigenvalue equations [8]:

C−1tt CtvC−1

vv Cvtwt = ρ2wtC−1vv CvtC−1

tt Ctvwv = ρ2wv, (13)

where the eigenvalues ρ2 are the squared canonical correlations and the eigenvectorswt and wv are the normalized canonical correlation basis vectors. More details aboutCCA and possible solutions for (11) can be found in [8].

Next, the projection of the textual feature of the query story, wt f idfq, onto wt isgiven by wt f idf T

q .wt. Similarly, the semantic signature of a reference story (SemSigr)is projected to wv to obtain SemSigT

r .wv . Note that we normalize both wt f idfq andSemSigr to have zero mean in each feature dimension before we calculate theirprojections. Finally, we retrieve similar reference news stories based on the cosinesimilarity between their projections and the query story projection, as shown inFig. 5. We evaluate the effectiveness of the CCA approach compared to the SemanticSimilarity Mapping approach presented in Section 5.1 in the experimental section.

As shown in Fig. 5, the final similarity score in the proposed early fusion frame-work is determined by adding the similarity scores in the SSM and the CCA methods.

5.3 Late fusion of textual and visual modalities

We explore the effect of the late fusion of textual and visual modalities by combiningthe enhanced visual similarity (Section 3), the enhanced textual similarity (Section 4),and the early fusion similarity (Section 5). Similar to [25], we also use early fusionapproach as an individual classifier in addition to other two classifiers in the proposedlate fusion scheme as shown in Fig. 2. We use the SVM-based and the the Ranked Listlate fusion methods to fuse above-mentioned similarity scores. In the SVM-basedfusion approach, we employ a leave-one-out training framework and treat similarityscores between pair of stories in the training dataset as inputs to the SVM classifier.In our case, we have three inputs of the enhanced textual and visual similarities,and the early fusion similarity score. We utilize an RBF kernel and find its optimalparameters (C and γ ) through a coarse-to-fine grid search using training data [5]. Weuse the trained SVM model to calculate the similarity between a given query videoand all reference videos and rank them, accordingly. In the Ranked List method [7],first we retrieve and rank reference stories for a given query story using every singlesimilarity score. Next, we determine the final rank of a reference story as the min-

Page 17: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

imum of the ranks determined using different similarity scores. In the next section,we evaluate retrieval results using different sets of features and late fusion methods.

6 Experimental results

We use TRECVID 2006 dataset containing world wide news videos, broadcast inDecember 2005 from seven different channels. It includes 830 news stories out ofwhich 296 pairs of associated news stories are labeled. The length of stories changesbetween 30 s and 5 min. We use ASR transcripts provided by [17]. We manuallysegment the news stories as a group of keyframes and label each story based on theirmain topic. Around 46 % of the associated news stories belong to the first categoryand the rest belong to the second category. It should be mentioned that sometimes itis not obvious which category a news story belongs to. We consider each associatednews story as a query and measure the similarity between the query and all the otherstories. The retrieval performance is quantified by the probability of retrieving the as-sociated news stories in the top-k positions of the ranked list given as P(k) = Zc/Z ,where Zc is the number of queries that rank their associated news stories withinthe top-k positions and Z is the total number of queries, which is 296.

6.1 Enhanced textual similarity evaluation

Table 1 shows the average of the top-5 retrieval results using different sources oftextual information and similarity measures. SS refers to the semantic similarity mea-sure introduced in (9). Textual information from different sources are presented asTFIDF feature and their similarity is calculated using cosine similarity. Expectedly,ASR performs better than OCR, since almost all news stories have an ASR transcriptwhile only a small group of them have a useful OCR transcript. Aggregating ASRtranscript and modified OCR outputs using the n-gram or the global dictionarymethod [9], the retrieval performance is improved by around 1 %. Using the wtf idffeature and the proposed semantic similarity measure, the retrieval performanceis improved by around 5 %. This result confirms the effectiveness of the wtf idfrepresentation and the usefulness of the proposed semantic similarity measure tocapture semantic relatedness between news stories.

6.2 Enhanced visual similarity evaluation

Figure 7 shows the retrieval performance using various visual similarities—keyframeset similarity in (1), semantic signature similarity (Section 3.1), their ranked list andSVM-based fusion (Section 3.3), and the SVM-based fusion of the keyframe set sim-ilarity and the weighted semantic signature similarity, explained in (5). The semantic

Table 1 The average of the top-5 retrieval results using different sources of textual information andsimilarity measures

OCR ASR ASR+OCR ASR+OCR ASR+OCR ASR+OCR(n-gram) (global dict.) (wtf idf ) (wtf idf )+SS

0.1912 0.3689 0.3808 0.3752 0.3913 0.4478

Page 18: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

Fig. 7 The top-k retrieval result using visual modalities. The best result belongs to the SVM-basedfusion of the keyframe set similarity and the weighted semantic signature similarity

signature similarity performs worse than the keyframe set similarity, however, itis better than random retrieval. Because semantic signature is not discriminativeenough due to the limited number of visual concepts (i.e., 374) and relatively lowprecisions of visual concept detectors. A poor retrieval performance of keyframe setsimilarity can be explained as since majority of the associated news stories belongto the second category discussed in Section 1, where this visual representation is notdiscriminative enough. The SVM-based fusion of the keyframe set similarity and thesemantic signature similarity outperforms the Ranked List fusion method for thetop-1 through the top-17 retrieval results. This superiority comes from the fact thatthrough the SVM-based fusion we learn how to fuse similarity scores using trainingdata as explained in Section 3.3, while the Ranked List fusion is an unsupervisedmethod where no training is involved. The best result is obtained by the SVM-basedfusion of the keyframe set similarity and the weighted semantic signature similar-ity. This superiority illustrates the key role of our proposed Detectability Score,described in (4), to weight the visual concepts and facilitates effective combinationof the keyframe set and the semantic signature similarities.

We also illustrate the top-k retrieval performance using the SIFT-BOW methodwhere the whole news story is described as a bag-of-SIFT and the similarity betweenstories is calculated using the chi-square distance [25]. The performance is slightlyworse than the keyframe set similarity method and its SVM-based fusions. Wecan conclude that by aggregating bag-of-SIFT representations of all keyframes, theretrieval performance is degraded.

6.3 Early fusion evaluation

In Fig. 8, we compare the retrieval results using the early fusion of textual and visualinformation as explained in Sections 5.1 and 5.2. It is clear that none of the SSM,

Page 19: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

Fig. 8 The top-k retrieval result using the proposed early fusion methods. The best result is obtainedusing the linear fusion of CCA and SSM outputs

CCA or their linear fusion performs well. However, all perform better than randomretrieval. The linear fusion and CCA perform better than the SSM method. Apossible explanation for these relatively poor retrieval results is that in the SSM andCCA methods we retrieve semantic signatures that are not discriminative enoughdue to the limited number of visual concepts and relatively low precision of visualconcept detectors. Moreover, noisy ASR and OCR transcripts can also degrade thequality of the generated visual concept signatures, which affects the SSM retrievalperformance.

6.4 Late fusion evaluation

In Fig. 9, we compare the retrieval results using different multi-modal late fusionstrategies. First, comparing different single modalities, we found that the enhancedtextual similarity, described in (9), outperforms the enhanced visual content simi-larity, explained in Section 3.3. The possible explanation for this superiority is thatalmost all news stories have ASR/OCR transcripts while only some of them containa salient and informative visual content. Further, the accuracy of ASR transcript isrelatively higher than that of visual features such as visual concepts, since a newsstory is mainly recorded in a studio, which provides an audio channel with a highquality from which an accurate ASR transcript can be extracted.

The second best result is obtained by the SVM-based fusion of the enhancedtextual and visual similarities and the early fusion, which outperforms the RankedList fusion performance. Since in the SVM-based fusion, we combine differentsimilarities through a learning process explained in Section 5.3, we obtain a betterperformance compared to the Ranked List fusion which is an unsupervised method.

The best result is obtained by the SVM-based fusion of the enhanced textualsimilarity, the early fusion similarity, and the enhanced visual similarity using scene

Page 20: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

Fig. 9 The top-k retrieval result using different modalities with different fusion strategies. The bestresult is obtained by integrating our enhanced textual similarity score, scene signature and weightedsemantic signature similarity scores, and the early fusion similarity score through the SVM-based latefusion

set similarity instead of keyframe set similarity as explained in Section 3.3. As shownin Fig. 9, we could capture around 43.5 % of the associated news stories in thetop-1 retrieval result. This result confirms the effectiveness of our enhanced textualand visual representations and similarities and their complementary role to capturesemantics shared by associated news stories.

6.5 Discussion

To elaborate the contribution of each modality and the proposed refinement stepsfor the case of the best performance, we demonstrate the trend of performanceimprovement in Fig. 10. The vertical axis denotes the average of the top-5 re-trieval results and the horizontal axis shows the accumulative set of modalities andrefinement steps. wt f idf , explained in (6), is the best single feature using whichwe could retrieve 39 % of associated news stories in the top-1 through the top-5in average. Incorporating textual semantic similarity (Section 4), the performanceis improved by around 5 %. Incorporating scene set similarity, the performance isimproved by around 9 %. Integrating weighted semantic signature similarity, weboost the performance by around 5 %. To investigate the origin of this improvement,we track down the retrieval results and observe that by incorporating the semanticsignature similarity, we could improve the retrieval performance for a few clustersof news stories with the specific topics such as ‘New York subway strike’ and ‘Firein Texas and Oklahoma’. This observation shows the effectiveness of the semanticsignature to capture conceptual relationships between news stories within theseclusters. On the other hand, there are clusters of stories with the specific topics

Page 21: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

Fig. 10 The contribution of different modalities and the refinement steps in the best performance

such as ‘Saddam Hussein court’ or ‘Fassir war tour’ where the calculated scene setsimilarity is significantly greater than the semantic similarity since these clusters ofstories belong to the first category of associated news stories and they share the sameor very similar picture, object and/or venue.

As shown in Fig. 10, the early fusion similarity does not contribute significantly tothe final performance and only improves the performance by less than 2 % due tothe reasons mentioned in Section 6.3.

7 Conclusion

In this paper, we introduce a semantic signature that represents a news story usingpre-defined visual concepts. We determine enhanced visual content similarity byintegrating local signature similarity (such as scene signature) and semantic signaturesimilarity. We learn from failure cases of local signature similarity to tune thecontribution of semantic signature similarity by assigning a proper weight to eachvisual concept. In the textual domain, we determine enhanced textual similarity be-tween stories using semantic relatedness between terms extracted by ASR and OCRengines. Next, we fuse heterogeneous sources of knowledge, i.e. enhanced textualand visual representation/similarity, through different early/late fusion strategies. Inthe proposed early fusion scheme, textual information is used to retrieve the visualsemantic signatures. We study different late fusion schemes to combine decisions oftextual, visual and early fusion modules. The best result is obtained by fusing theproposed wt f idf , scene set and weighted semantic signature similarities, and theearly fusion similarity through the SVM-based late fusion.

Page 22: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

References

1. Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimediaanalysis: a survey. Multimedia Syst 16(6):345–379

2. Aytar Y, Shah M, Luo J (2008) Utilizing semantic word similarity measures for video retrieval. In:Proceedings of IEEE conference on Computer Vision and Pattern Recognition, CVPR ’08, pp 1–8

3. Boyd PS, Alexander R (2008) Broadcast journalism: techniques of radio and television news.Focal Press

4. Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Miningand Knowledge Discovery 2(2):121–167

5. Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans IntellSyst Technol 2:27:1–27:27

6. Do Q, Roth D, Sammons M, Tu Y, Vydiswaran V (2009) Robust, light-weight approaches tocompute lexical similarity. Technical report, University of Illinois

7. Donald K, Smeaton A (2005) A comparison of score, rank and probability-based fusion methodsfor video shot retrieval. In: Image and video retrieval, pp 61–70

8. Hardoon DR, Szedmak SR, Shawe-taylor JR (2004) Canonical correlation analysis: an overviewwith application to learning methods. Neural Comput 16(12):2639–2664

9. Hauptmann AG, Jin R, Ng TD (2002) Multi-modal information retrieval from broadcast videousing ocr and speech recognition. In: Proceedings of the 2nd ACM/IEEE-CS Joint Conferenceon Digital Libraries, JCDL ’02, pp 160–161

10. Ionescu B, Mironica I, Seyerlehner K, Knees P, Schlüter J, Schedl M, Cucu H, Buzo A, LambertP (2012) Arf @ mediaeval 2012: multimodal video classification. In: MediaEval

11. Jiang YG, Yang J, Ngo CW, Hauptmann AG (2009) Representations of keypoint-based semanticconcept detection: a comprehensive study. IEEE Trans Multimedia 12(1):42–53

12. Kolb P (2009) Experiments on the difference between semantic similarity and relatedness. In:Proceedings of the 17th Nordic conference of computational linguistics, NODALIDA ’09 vol 4,pp 81–88

13. Rice JA (2007) Mathematical statistic and data analysis, 3rd edn. Duxbury, Belmont, CA14. Sargin ME, Yemez Y, Erzin E, Tekalp AM (2007) Audiovisual synchronization and fusion using

canonical correlation analysis. IEEE Trans Multimedia 9(7):1396–140315. Srikanth M, Bowden M, Moldovan D (2005) LCC at trecvid 2005. In: Proceedings of NIST TREC

video retrieval evaluation. Citeseer, pp 3–616. Stark MM, Riesenfeld RF (1998) Wordnet: an electronic lexical database. In: Proceedings of 11th

Eurographics workshop on rendering. MIT Press17. TRECVID (2006) www-nlpir.nist.gov/projects/tv2006/tv2006.html. Retrieved 15 May 201118. Wu X, Hauptmann AG, Ngo CW (2007) Practical elimination of near-duplicates from web video

search. In: Proceedings of the 15th ACM international conference on multimedia, MM ’07,pp 218–227

19. Wu X, Hauptmann AG, Ngo C-W (2007) Novelty detection for cross-lingual news stories with vi-sual duplicates and speech transcripts. In: Proceedings of the 15th ACM international conferenceon multimedia, MM ’07, pp 168–177

20. Wu X, Takimoto M, Satoh S, Adachi J (2008) Scene duplicate detection based on the patternof discontinuities in feature point trajectories. In: Proceedings of the 16th ACM internationalconference on multimedia, MM ’08, p 51

21. Yan R, Yang J, Hauptmann AG (2004) Learning query-class dependent weights in automaticvideo retrieval. In: Proceedings of the 12th annual ACM international conference on multimedia,MM ’04, pp 548–555

22. Younessian E, Rajan D (2012) Multi-modal solution for unconstrained news story retrieval.In: Proceedings of the 18th international conference on advances in Multimedia Modeling,MMM ’12, pp 186–195

23. Younessian E, Rajan D (2012) Scene signatures for unconstrained news video stories. In: Pro-ceedings of the 18th international conference on advances in Multimedia Modeling, MMM ’12,pp 77–88

24. Younessian E, Rajan D, Chng ES (2009) Improved keypoint matching method for near-duplicatekeyframe retrieval. In: Proceedings of IEEE International Symposium on Multimedia, ISM ’09,pp 298–303

25. Zhong Lan Z, Bao L, Yu S-I, Liu W, Hauptmann AG (2012) Double fusion for multimedia eventdetection. In: Proceedings of the 18th international conference on Multimedia and Modeling,MMM ’12, vol 7131. Lecture notes in computer science. Springer, pp 173–185

Page 23: Multi-modal fusion for associated news story retrieval

Multimed Tools Appl

Ehsan Younessian received the B.Sc in Electrical Engineering in division of Control from Universityof Tehran, Iran in 2007. Since August 2007, he has been pursuing his PhD degree in the School ofComputer Engineering at Nanyang Technological University on the topic of Multi-modal semanticextraction from multimedia content. He is currently a visiting research scholar at Informedia lab atCMU. His research interests are in the general area of content-based video indexing and retrieval,video understanding, and multi-modal integration for multimedia analysis.

Deepu Rajan is an Associate Professor in the School of Computer Engineering at NanyangTechnological University. He received his Bachelor of Engineering degree in Electronics andCommunication Engineering from Birla Institute of Technology, Ranchi (India), M.S. in ElectricalEngineering from Clemson University, USA and Ph.D degree from Indian Institute of Technology,Bombay (India). From 1992 till 2002, he was a Lecturer in the Department of Electronics atCochin University of Science and Technology, India. His research interests include image processing,computer vision and multimedia signal processing.