[ieee 2007 international workshop on content-based multimedia indexing - talence, france...

8
MULTIMODAL SEMANTIC-ASSOCIATIVE COLLATERAL LABELLING AND INDEXING OF STILL IMAGES Meng Zhu, Atta Badii University of Reading, IMSS Research Laboratory, Department of Computer Science, Reading. {meng.zhu; atta.badii}@reading.ac.uk ABSTRACT Google and Yahoo image search engines can easily gather more than one billion images from the WWW. It is A novel framework for multimodal semantic-associative worth mentioning that those images are normally collateral image labelling, aiming at associating image accompanied with certain type of collateral texts, like regions with textual keywords, is described. Both the captions, titles, news, URL, etc. It is this collateral textual primary image and collateral textual modalities are information that is utilised for image indexation and exploited in a cooperative and complementary fashion. The retrieval, which is currently the dominant commercial collateral content and context based knowledge is used to solution. However, the performance of such kind of image bias the mapping from the low-level region-based visual search engine is limited by the subjectivity of the primitives to the high-level visual concepts defined in a accompanied descriptive texts. Moreover, the text-based visual vocabulary. We introduce the notion of collateral retrieval approach also suffers from inherent shortcomings context, which is represented as a co-occurrence matrix of like, the use of synonyms and word sense ambiguity. the visual keywords. A collaborative mapping scheme is Content-Based Image Retrieval (CBIR), has become the devised using statistical methods like Gaussian distribution dominant paradigm for image indexing and retrieval in or Euclidean distance together with collateral content and research, development and application since the 1990's context-driven inference mechanism. Finally, we use Self [22]. Well known CBIR systems that have been widely Organising Maps to examine the classification and retrieval used include: QBIC by IBM [10], Virage [12], VisualSEEk effectiveness of the proposed high-level image feature [24], Blobworld [6] [7], MARS system [21]. Each of the vector model which is constructed based on the image above systems allows the user to query it using low-level labelling results. visual features through either a weighted combination of visual features (e.g. colour, shape, texture, etc.) or an 1. INTRODUCTION exemplar query image. However, it has been argued that the low-level visual primitives are not sufficient for Digitised information nowadays is typically represented in depicting the semantic (object) level meaning of the multiple modalities and distributed through various images. Hence, a "semantic gap" still exists due to the lack channels. Tremendous volumes of multimedia data are of coincidence between the low-level visual primitives and generated constantly everyday because of the advances in the high-level semantics conveyed by the same image digital media technologies. Efficient access to such amount content [23]. In addition, some researchers who of multimedia information depends on effective and investigated the users' needs found out that: i) users search intelligent multimodal indexing and retrieval techniques. both by types and identities of the entities in images, ii) The notion of multimodal implies the use of at least two users tend to request images both by the innate visual human sensory or perceptual experiences for receiving the features and the concepts conveyed by the picture [9] [15] different representations of the same information [1]. [19] [17]. With this conclusion, clearly we can see that According to the information need, a distinction can always using text or image features alone cannot suffice the users' be identified between the primary and collateral requests. information modalities. For instance, the primary modality We propose to use both the primary image and the of an image retrieval system is of course the images, while collateral textual modalities in a complementary fashion to all the other modalities that explicitly or implicitly related perform semantic-based image labelling, indexing and to the image content, like collateral texts could be retrieval. Accordingly, a framework for semantic labelling considered as the collateral modality, of image content using multimodal cues was developed; aiming at automatically tagging image regions with textual 1-4244-10 11-8/07/$25.00 ©)2007 IEEE 173 CBMI 2007

Upload: atta

Post on 26-Feb-2017

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2007 International Workshop on Content-Based Multimedia Indexing - Talence, France (2007.06.25-2007.06.27)] 2007 International Workshop on Content-Based Multimedia Indexing -

MULTIMODAL SEMANTIC-ASSOCIATIVE COLLATERAL LABELLING ANDINDEXING OF STILL IMAGES

Meng Zhu, Atta Badii

University of Reading,IMSS Research Laboratory, Department of Computer Science, Reading.

{meng.zhu; atta.badii}@reading.ac.uk

ABSTRACT Google and Yahoo image search engines can easilygather more than one billion images from the WWW. It is

A novel framework for multimodal semantic-associative worth mentioning that those images are normallycollateral image labelling, aiming at associating image accompanied with certain type of collateral texts, likeregions with textual keywords, is described. Both the captions, titles, news, URL, etc. It is this collateral textualprimary image and collateral textual modalities are information that is utilised for image indexation andexploited in a cooperative and complementary fashion. The retrieval, which is currently the dominant commercialcollateral content and context based knowledge is used to solution. However, the performance of such kind of imagebias the mapping from the low-level region-based visual search engine is limited by the subjectivity of theprimitives to the high-level visual concepts defined in a accompanied descriptive texts. Moreover, the text-basedvisual vocabulary. We introduce the notion of collateral retrieval approach also suffers from inherent shortcomingscontext, which is represented as a co-occurrence matrix of like, the use of synonyms and word sense ambiguity.the visual keywords. A collaborative mapping scheme is Content-Based Image Retrieval (CBIR), has become thedevised using statistical methods like Gaussian distribution dominant paradigm for image indexing and retrieval inor Euclidean distance together with collateral content and research, development and application since the 1990'scontext-driven inference mechanism. Finally, we use Self [22]. Well known CBIR systems that have been widelyOrganising Maps to examine the classification and retrieval used include: QBIC by IBM [10], Virage [12], VisualSEEkeffectiveness of the proposed high-level image feature [24], Blobworld [6] [7], MARS system [21]. Each of thevector model which is constructed based on the image above systems allows the user to query it using low-levellabelling results. visual features through either a weighted combination of

visual features (e.g. colour, shape, texture, etc.) or an1. INTRODUCTION exemplar query image. However, it has been argued that

the low-level visual primitives are not sufficient forDigitised information nowadays is typically represented in depicting the semantic (object) level meaning of themultiple modalities and distributed through various images. Hence, a "semantic gap" still exists due to the lackchannels. Tremendous volumes of multimedia data are of coincidence between the low-level visual primitives andgenerated constantly everyday because of the advances in the high-level semantics conveyed by the same imagedigital media technologies. Efficient access to such amount content [23]. In addition, some researchers whoof multimedia information depends on effective and investigated the users' needs found out that: i) users searchintelligent multimodal indexing and retrieval techniques. both by types and identities of the entities in images, ii)The notion of multimodal implies the use of at least two users tend to request images both by the innate visualhuman sensory or perceptual experiences for receiving the features and the concepts conveyed by the picture [9] [15]different representations of the same information [1]. [19] [17]. With this conclusion, clearly we can see thatAccording to the information need, a distinction can always using text or image features alone cannot suffice the users'be identified between the primary and collateral requests.information modalities. For instance, the primary modality We propose to use both the primary image and theof an image retrieval system is of course the images, while collateral textual modalities in a complementary fashion toall the other modalities that explicitly or implicitly related perform semantic-based image labelling, indexing andto the image content, like collateral texts could be retrieval. Accordingly, a framework for semantic labellingconsidered as the collateral modality, of image content using multimodal cues was developed;

aiming at automatically tagging image regions with textual

1-4244-1011-8/07/$25.00 ©)2007 IEEE 173 CBMI 2007

Page 2: [IEEE 2007 International Workshop on Content-Based Multimedia Indexing - Talence, France (2007.06.25-2007.06.27)] 2007 International Workshop on Content-Based Multimedia Indexing -

keywords. We have exploited the collateral knowledge to illustration. More recently, Li et al have created a systembias the mapping from low-level visual primitives to high- for automatic linguistic indexing of pictures using alevel visual concepts defined in a visual vocabulary. We statistical modelling approach. The 2D multi-resolutionhave introduced a novel notion of "collateral context" Hidden Markov Model is used for profiling categories ofrepresented as a matrix of the co-occurrence of the visual images, each corresponding to a concept. They built up akeywords. The mapping procedure is a collaborative dictionary of concepts, which is subsequently used as thescheme using statistical methods like Gaussian distribution linguistic indexing source. The system can automaticallyor Euclidean distance together with collateral content and a index images by firstly extracting multi-resolution block-context-driven inference mechanism. By automatic based features, then selecting top k categories with thelabelling of image regions, the semantics is localised within highest log likelihood of the given image to the category,the given image. A semantic-level image feature vector and finally determining a small subset of key terms frommodel was constructed based on the labelling results. The the vocabulary of those selected categories stored in thehigh-level features extracted from around 3000 Corel concept dictionary as indexing terms.images have been used to evaluate our proposed method bytraining Self Organising Maps, i.e. SOMs, to examine the 3. A MULTIMODAL METHOD FOR SEMANTICclassification and retrieval effectiveness. COLLATERAL IMAGE LABELLING

In Section 2, we introduce some recent work relatedto this research. Section 3 provides the details of our We propose a novel method of multimodal semantic-basedproposed method for multimodal semantic-based collateral collateral image labelling, aiming at automaticallyimage labelling. Section 4 reports the experiments assigning textual keywords to image regions. A visualconducted and discusses the experimental results. Finally, vocabulary comprising 236 visual concepts wassection 5 concludes this research and points out the future constructed based on 3748 manually annotated imageworking direction. segments. The collateral knowledge extracted from the

collateral textual modality of the image content was2. RELATED RESEARCH exploited as a bias to the mapping from the low-level

region-based visual features to high-level visual conceptsOne of the earliest attempts in this area is to extract high- defined in the visual vocabulary. Two different types oflevel visual semantics from low-level image content. collateral knowledge can be identified, namely collateralTypical examples include: the discrimination between content and collateral context. A conditional probabilistic'indoor' and 'outdoor' scenes [25] [20], 'city' vs. co-occurrence matrix representing the conditional'landscape' [11], 'natural' vs. 'manmade' [4], etc. These cooccurrency of the visual keywords was created forresearch results are considered to be limited regarding the representing the collateral contextual knowledge. Wegranularity of the visual semantics due to the fact that only devised a collaborative mapping scheme by using statisticalthe generic theme of the images can be identified. Recently methods like Gaussian distribution or Euclidean distanceother researchers started to develop methods for together with collateral content and context-drivenautomatically annotating images at object level. Mori et al inference mechanism.[18] proposed a co-occurrence model which formulated theco-occurrency relationships between keywords and sub- 3.1. Overview of the proposed frameworkblocks of images. Duygulu et al [8] improved Mori et al's The proposed system architecture for collateral imageco-occurrence model with Brown et al' s machine

labelling aims at assigning textual keywords to regions oftranslational model [5] which assumes image annotation interest for a given image. The typical system output wouldcan be considered as a task of translating blobs to a . gi yp ybe image regions associated with a set of keywords thatvocabulary of keywords. Barnard et al [3] [2] also tried to depict the semantics of the visual content to which they aretackle the problem using machine learning techniques. related. We believe that this application would be veryThey proposed a statistical model for organising image . .collections into a hierarchical tree structure which yuseful forbrldgng the semantic gap. We also believe thatintegrates semantic information provided by associated text dbylocalisng the semantics to image regions, more accurateand perceptual information provided by image features. and sensbleviThe nodes of the tree structure generate both image wll beenabled.segments by Gaussian distribution and words bymultimodal distribution. The integration of and thecorrespondence learnt between textual and visual featuresat a blob/keyword level within the model allows the twomodalities to mutually support each other to performchallenging tasks like automatic image annotation and text

174

Page 3: [IEEE 2007 International Workshop on Content-Based Multimedia Indexing - Talence, France (2007.06.25-2007.06.27)] 2007 International Workshop on Content-Based Multimedia Indexing -

im ge Featureetradion constructed the final feature vector by sequentiallys egmenn(colour.textue,she assembling all the four kinds of features together.

ColIlaerWi Collateralcontent k leeoftr s 3.3. Constructing a visual vocabulary

colour edge shape texture

Pr_obatmslc The visual vocabulary developed in our system is a set ofCollateral Visual keywards CollateralConmtext co-occurrence Knowledge clusters of region-based visual feature vectors labelled by>S L;Mt textual keywords (see Figure 2).

Visual vocabularyBias Low-level feature vectors

1coour edge shape texture

Visual vocabularycolour edge shape texture

Figure 1, An overview of the proposed system architecture - -for collateral image labelling II I I I I I I I I

Figure 1 illustrates the architecture of the proposedframework and the typical system workflow. Four keycomponents can be identified, namely image segmentation..and region-based low-level visual feature extraction, Fiue2- vsa oaulr osst f26vsulcnetcollateral knowledge base, a visual vocabulary and a,r lcollaborative mapping procedure. The raw image data will :

the l lolblfirstly be segmented into a number of regions and then low- te consrucion of the viua vocabuaryisbs onlevel visual features, like colour, edge, shape, texture, will mandatas tlproided ygBarnas [] which contsfo 347be extracted from each segment. Using the region-based maullianed.rheimgi n segments e fromt 45feature vectors, we map those low-level visual features to Cyreuluiae th age segments were anntaedbthe visual concepts defined in the visual vocabulary with Duygulue 2]wisuth adcontolledvocsabuotar3 icisual c etthe help of the external knowledge extracted from the omagh Atnm thatedegbeenusefotere annotatieengtheCorecollateral content and context. In the following sections, images. An image segment may be annotated using singlewe will introduce each individual component in details. or multiple keywords.

There are two basic approaches of clustering the3.2. Region-based visual feature extraction region-based visual feature vectors. One is to use statistical

clustering methods, like K-means or hierarchical, toWe used the normalised cut [14] method to segment images classify the visual feature vectors into a number of clustersinto a number of regions. The reason for choosing this and manually label those clusters using keywords. Themethod for image segmentation is that it outputs image drawback of this method is that sometimes it would groupsegments with no overlaps and salient regions. One thing some segments, which are visually very similar butworth mentioning here is that the segmentation is based on conceptually very different, together. In other words, thisthe grouping of visual similarity of pixels without any method does not overcome the limitation of content-basedconcern about the semantics of the object. Hence, for image feature classification. Since we had this dataset ofexample, a semantic visual object may be segmented into manually labelled image segments, we decided to usedifferent parts due to perceptual variety of the object's another approach to cluster those segments, which is asurface. simple process of grouping the segments labelled by the

A 54-dimessional feature vector, comprising colour, same keyword together. Finally, 236 visual concepts wereedge, shape and texture features, is created for each extracted to constitute the vocabulary. However, since thesegment. 21 colour features of image are extracted using segments may be labelled by multiple keywords, therethe intensity histogram. 19 edge features, are extracted by would be overlaps among the clusters.applying an edge filter and the water-filling algorithm [26]on the binary edge image. A statistical texture analysis 3.4. Collateral knowledge basemethod proposed by Haralick et al [13], the Grey Level Co-occurrence Matrices, was used to extract 7 texture features There are two kinds of collateral knowledge withinfrom every segment. And finally, 7 features related to knowledge base, i.e. collateral content and collateralshape were extracted using Matlab-based functions for context.extracting the statistics of image regions. We have

175

Page 4: [IEEE 2007 International Workshop on Content-Based Multimedia Indexing - Talence, France (2007.06.25-2007.06.27)] 2007 International Workshop on Content-Based Multimedia Indexing -

3.4.1. Collateral content 0.14 _ ,g ( ~~~sunset)

The collateral content refers to the knowledge that can be cu0450 066

extracted directly from the collateral modality, in our case 0 Xthe collateral keywords accompanying the images. As sky 025mentioned above, such kind of collateral textual 00-0 .2 ea

information becomes easier and easier to acquire due to the bird12D062 bird 004 nimultimodal nature of the modem digitalized information OX06 0'17Xln0D64 0.012dissemination. And there are many different sources where .0-the collateral keywords can be extracted, like captions, Q citY 0.71bdsa0n12titles, URL, etc. Such keywords are expected to depict the yte .3 .

subject or concepts conveyed by the image content.Figure 3, A graph representation of the visual keyword co-

3.4.2. Collateral context occurrence matrix

Collateral context is another new concept that we By exploiting this co-occurrence matrix, we believeintroduced in this research, which refers to the contextual that this is a novel way of bridging the gap between theknowledge, representing the relationships among the visual computer generated image segmentation and the semanticconcepts. We use a conditional probabilistic cooccurrency visual object. Image segmentation by itself is still an openmatrix which represents the cooccurrency relationships issue. So far, even the most state-of-the-art imageamong the visual concepts defined in the visual vocabulary. segmentation techniques cannot generate perfect segmentsA 236 x 236 matrix is created based on the annotations of which separate the whole objects from each other.the image segments (see Table 1). Normally, image segments either contain a part of object of

K1 K2 K3 ... Knseveral parts of different objects. For instance, it is very

K] ~ P(K2 K]) -P(K3 K]) ... P(Kn K]) likely to get segments containing a wing of an airplane withK2 .P(K] K2) P(K3 K2) P(Kn K2) the background of sky. With our co-occurrence matrix, ifK3P4(K K3) P(K2 K3) ... P(Kn K3) the segments can be firstly recognized as a wing then

Kn P(K Kn) P(K2 Kn) P(K system will probably provide other labels like plane, jet,Table 1, Conditional probabilistic co-occurrence matrix of sky, etc. according to the conditional probability between

visual keywords the visual keyword 'wing' and the others.

Each element of the matrix can be formally defined as 3.5. Collaborative mapping towards visual conceptsfollows:

Having the visual vocabulary and collateral knowledge

P(Kj IK )=P(K, Ki) (i j) base constructed, now it is time to map the low-level visualP(K K) feature vectors of the image segments to the visual

The relationships could be considered as bidirectional, keywords. We used two methods to do the mapping,because the value of each element is calculated based on namely Euclidean distance and Gaussian distribution.the conditional probabilistic of the cooccurrency between For the Euclidean distance method, we calculate the

two visual keywords. The relationships appear to be distance between the segment features and the centroids of

reasonable in many cases (see Figure 3). For instance, the each cluster within the visual vocabulary (see equation 2).relationship between sky and clouds showing that clouds D (Xi ' ) (2)has bigger cooccurrency probability against sky than the =other way round. And this is reasonable because clouds x is the segment's feature vector and c is the mean vector ofmust appear in the sky while there may be a lot of other the visual feature cluster. And for the Gaussian distributionthings appearing in the sky. method, we calculate the probability between the segment' s

feature vector and visual feature clusters, which can bedefined as follows:

P = expK () 2 (3)where a and 2 are the mean vector and standard deviationof the cluster respectively.

However, only based on the shortest Euclideandistance and highest Gaussian probability cannot always

176

Page 5: [IEEE 2007 International Workshop on Content-Based Multimedia Indexing - Talence, France (2007.06.25-2007.06.27)] 2007 International Workshop on Content-Based Multimedia Indexing -

provide accurate labels. This is the reason why we Class Category Name Class Category Nameintroduce the collateral knowledge to bias the mapping Sunsets & Sunrises 16 GreekIslesprocedure. Again, there are two different knowledge of 2 Wild Animals 17 Coins & Currencycollateral content and collateral context that had been used 3 The Arctic 18 English Country Gardens

4 Arizona Desert 19 Bald Eaglesas the bias. -s5 Bridges 20 Cougars

For the collateral content based labelling, the process 66 Aviation Photography 21 Divers & Divingis quite straightforward. Instead of finding the shortest 7 Bears 22 Grapes & Wine

8 Fiji 23 Land of the PyramidsEuclidean distance or highest Gaussian probability against 9 North American Deer 24 Nesting Birdsall the clusters within the visual vocabulary, we just find 10 Lions 25 Helicoptersout the best matching cluster out of the collateral keywords. 11 Wildlife of Antarctica 26 Models

12 Elephants 27 Virgin IslandsSee equation 4and 5. 13 Foxes & Coyotes 28 Tulips

Euclidean 14 Rhinos & Hippos 29 Wild Cats

Distance: Min(Euclidean _ diS(fseg,Uk)) (4) 15 Arabian Horses 30 BeachesTable 2, 30 categories of Corel image

GaussianDistribution Max(Gaussian - PJfseg Clusterkn)) (5) 4.1. Region-based semantic image labellingwhere knE {k; ke col - kw}n{k; kE vis_vOC} The 3000 Corel image were firstly segmented into 27953

For the collateral context based labelling, we use a regions of interest. The segments for each image were

Thresholding mechanism to combine both visual similarity sorted in the descending order of the size (or area), and webetween the segment's feature and visual keywords and co- selected the top 10 segments, if the image was segmentedoccurrence probability between the collateral content-based into more 10 regions, as the most significant visual objectslabel and the rest of the visual keywords within the visual in the image to label. We carried out two experiments tovocabulary. See equation 6 and 7. label the 3000 Corel images using the two different

mapping methods, namely Euclidean distance and GaussianEuclidean Min(Eucl - dis(iseg S Ulkn )) x P(K Kn)>T 6 distribution. However, as there is no benchmarking datasetDistance: EuclX(dis(Kseg m (6) for evaluating such kind of application. We could not do

any direct evaluation on the performance of the region-Gaussian G P C based image labelling. Figure 4 shows an example outputDistribution Gau _ P(feg C,) x P(Km Kl) . T (7) of the labelling system using the Gaussian distribution

kn E {k; k E col - kw} n {k; k E vis - voc method. The keywords in Bold are the labels assigned towhere the region based on the collateral content, namely the Corel

km E {k; k E ViS _ VOC} keywords. And the keywords in Italic are the labels givenbased on the visual keywords co-occurrence matrix by

4. EXPERIMENT AND EVALUATION following equation 7.

All the experiments were carried out based on the Coreldataset which contains more than 600 CDs, each of whichhas 100 photos grouped as one theme. Another highlight ofthe collection is that each image is accompanied with acouple of keywords which depict the key objects containedin the image. In our experiments, we take advantage of thegiven keywords as the collateral textual cues to the imagestagging. Those keywords can be considered to be extracted tigyer tiger animal animalfrom the collateral text information of the image using animalcat animalcat cat catnatural language processing or keyword extractiontechniques. We selected 30 categories of Corel images asour experimental dataset (see Table 2).

tiger animal branches groundanimal cat tree sand

Figure 4, Image labelling result using Gaussian distribution

177

Page 6: [IEEE 2007 International Workshop on Content-Based Multimedia Indexing - Talence, France (2007.06.25-2007.06.27)] 2007 International Workshop on Content-Based Multimedia Indexing -

4.2. Semantic-based image classification, indexing and SOM classification accuracy using the four different kindsretrieval of feature vectors. The Euclidean 120D shows the best

As there is no way to examine the performance of the average accuracy at 71% followed by the Gaussian 120D at

proposed region-based image labelling model directly. The 70%. Also, the performance of the Euclidean 120D and

only way to evaluate this method is to encode the labelling Gaussian 120D appear to more stable that Textual 333Dresults into feature vectors and examine it based the and Visual 46D.effectiveness of the classification and retrieval on those The trained SOMs were also used to perform thefeature vectors. image retrieval on the four different kinds of featurefeatureveconstoruct anovelhomogenoussemantic-bvectors. The SOM-based retrieval is based on the principleWe construct a novel homogenous semantic-based tha to rereetetann. tm ihrsetthi

image feature vector model which combines both the visual tiatio value the Best Mtchn Uit (BMe) o thefeatures of the image content and textual features of itscollateral text based on the image labelling results. We testing input. The Euclidean 120D and Gaussian 120D

outperform both the textual and visual feature vectors forcalculate the weight of each label based on the frequency itappears in all the labels of the image with the expectation the top 5 retrieved items, and significantly increased thethat the weights should be indicating the proportion of the precision than purely using content-based image featurevisual object within the whole scene. We divide the vectors (see Figure 5).keywords into 4 groups by applying an exponentialfunction off(x) = 3x where x e 10, 1, 2, 3} according the 70%tfxidf values of each keyword of the category. We create a 60% - --120 (4 x 30) dimensional feature vector for each image, A -A -A - -Eda2where the value of each element is a quotient of the label

4 Gaussian 120Dweight divided by the number of terms in the 40% -*-- Gaussian 120Dcorresponding exponential division. Hence, the more 30 ---A--- texual 33Dsignificant and representative keywords for this image 20%- -x -x- - -xxcategory will be located in first partitions and gaining more 10%weights, while keeping the non-important keywords in the 0% Tlarger partitions towards the end of the feature vector. Top Top Top Top Top Top Top Top Top Top Top Top

Because we used different two different mapping methods,i.e. Euclidean distance and Gaussian distribution, finally Figure 5, Precisions of the SOM-based image retrieval usingwe generated 2 sets of feature vectors for all the 3000 Corel four different feature vector setsimages, namely Euclidean 120D, Gaussian 120D.

In order to compare the performance of the proposed However, the statistical calculations alone sometimesfeature vector model with the traditional feature extraction are not enough for the evaluation. We believe that one ofmethods, we also extract 3000 image feature vectors based the advantages of our proposed semantic-based high-levelon the visual content of the image and text feature vectors image feature vector models is that they can combine bothbased on the Corel keywords. For the visual feature vector, the visual and conceptual similarity of the image. Take thewe construct a 46-dimensional feature vector which retrieval results showed in Figure 7 as an example.consists of colour, edge and texture features for each Although Textual 333D got the best statistical results, itimage. For the textual feature vector, we select the top 15 cannot meet the user's perceptual needs on the objects ofkeywords with the highest tfxidf values of each category. the interest. Moreover, the word 'animal' for instanceThen we merged the repeated keywords, and finally built misleads the system to retrieve any image whose annotationup a 333-dimesional text feature vector model. The value contains this word. However, by using our proposed high-of each element is the occurrence of the keywords. level image feature vector models can successfully identify

We created four 30 x 30 SOMs for learning to the objects of interest, i.e. bear and water in this case, andclassify the four kinds of feature vectors, i.e. Euclidean retrieve images that both visually and conceptually similar120D, Gaussian 120D, Visual 46D, and Textual 333D. We to the query images, even though the retrieved image weredivided the 3000 Corel images into training and testing set categorised into the different classes by the annotator thanwith a ratio of 1:9. 10 images were randomly selected from the that of the query image. It is also this reason thateach category making a total number of 300 testing images, sometimes the statistical calculations, e.g. precision,and the rest 2700 images as training set. Each one of the appears to be lower for the proposed features than thefour SOMs was trained with the 2700 feature vectors textual features. Actually, there are lots of examples likeextracted from the training images using the four methods this. We also believe that in real scenarios, where thefor 100 epochs. Then, we test the trained network using the collateral text may contain a lot of noises, the advantage300 testing feature vectors. 4 confusion matrices were our high-level image feature vector model will becomecreated based on the testing result. Figure 6 indicates the

178

Page 7: [IEEE 2007 International Workshop on Content-Based Multimedia Indexing - Talence, France (2007.06.25-2007.06.27)] 2007 International Workshop on Content-Based Multimedia Indexing -

more obvious due to its ability of visually reconfirming the image features outperform both traditional text-based andconcept dose exists in the images. content-based image indexing methods in terms of their

capability for combining both the perceptual and5. CONCLUSIONS AND FUTURE WORK conceptual similarity of the image content. The future

working directions of this research include the exploitationIn this paper, we have proposed a novel framework for of more sophisticated knowledge representation schemesemantically labelling image regions by using multimodal like ontology and moving towards more realistic scenarios.information cues in a cooperative and complementaryfashion. We have introduced the use of collateral content 6. ACKNOWLEDGEMENTand context-based knowledge in image labellingapplications. Accordingly a collaborative mapping scheme The authors wish to thank Kobus Barnard and P Duyguluhas been devised to combine statistical methods and a for providing us with their manually labelled Corel imagecollateral knowledge based inference mechanism. We have segments. We would also like to thank for the support ofvalidated our method by examining the classification and the UK Engineering and Physical Sciences Researchretrieval effectiveness of high-level semantic-based image Council Grant (DREAM Project, DTIEO06140/1).feature vectors, developed based on the labelling results.The experimental results have indicated that our high-level

100% - 44 ---

20% - /A

0%1 21 3 4 5 6 7 S 9i 10 1112 1314.15 1617 13 19 20 21 22 23 24.25 26 27 2621930

- Eu clidean 12QD - --Gaussian 1GDi ---A--- Teotura I333D --- <--- Visua 14 U1

Figure 6, SOM classification accuracy using four kinds of feature vectors

Query Image Corel keywords

Animal bear fish water

Textual 333D

Visual 46D

Figure 7, An example retrieval results using four different kinds of feature vectors

Postgraduate Workshop, University of Sussex, 26-27,

17 C__e)(Vil~~17

Page 8: [IEEE 2007 International Workshop on Content-Based Multimedia Indexing - Talence, France (2007.06.25-2007.06.27)] 2007 International Workshop on Content-Based Multimedia Indexing -

[3]. Barnard K., Forsyth D., "Learning the Semantics of [18]. Mori, Y. Takahashi, H. and Oka, R. "Image-to-wordWords and Pictures". Proceedings of Int. Conf. on transformation based on dividing and vector quantizingComputer Vision, pp. 11:408-415, 2001. images with words". In MISRM'99 First Int. Workshop[4]. Bradshaw, B., "Semantic based image retrieval: a on Multimedia Intelligent Storage and Retrievalprobabilistic approach", Proc. of the eighth ACM Int. Management, 1999.conf. on Multimedia, pp: 167 - 176, 2000. [19]. Ornager, S. "View a picture: Theoretical image[5]. Brown, P. Pietra, S.D. Pietra, V. D. and Mercer, R., analysis and empirical user studies on indexing and"The mathematics of statistical machine translation: retrieval", Swedis Library Research, Vol. 2 No. 3, pp 31-Parameter estimation". In Computational Linguistics, 41, 1996.19(2):263-311, 1993. [20]. Paek S., Sable C. L., Hatzivassiloglou V., Jaimes A.,[6]. Carson C., Belongie S., Greenspan H., Malik J. Schiffman B. H., Chang S.-F., & McKeown K. R.,"Blobworld: Image Segmentation Using Expectation- "Integration of visual and text based approaches for theMaximization and Its Application to Image Querying". content labelling and classification of Photographs",IEEE Trans. on Pattern Analysis and Machine ACM SIGIR'99 Workshop on Multimedia Indexing andIntelligence, vol. 24, No. 8, pp.1026-1038, 2002. Retrieval, Berkeley, CA, Aug. 19, 1999.[7]. Carson C., Thomas M., Belongie S. Hellerstein J. [21]. Rui, Y., Huang, T., Mehrotra, S., and Ortega, M., "AM., Malik J., "Blobworld: A System for Region-Based relevance feedback architecture in content-basedImage Indexing and Retrieval". In Proc. of the Third Int. multimedia information retrieval systems", In Proc ofconf. on Visual Information Systems, pp. 509-516, 1999. IEEE Workshop on Content-based Access of Image and[8]. Duygulu, P. Barnard, K. Freitas, N. and Forsyth, D., Video Libraries, 1997."Object recognition as machine translation: Learning a [22]. Rui, Y., Huang, T.S., and Chang, S.F., "Imagelexicon for a fixed image vocabulary". In 7th European Retrieval: current techniques, promising directions andConf. on Computer Vision, pages 97-112, 2002. open issues", Dept. of ECE & Bechman Institute[9]. Enser, P.G. "Query analysis in a visual information University of Illinios at Urbana-Champaingn Urbana,retrieval context". Journal of Document and Text Dept. of EE & New Media Technology Center ColumbiaManagement, Vol. 1, pp 25-52, 1993. University New York, 1998.[10]. Flickner, M., Sawhney, H., Niblack, W., Ashley, J., [23]. Smeulder, A.W.M., Worring, M., Anntini, S., Gupta,Huang, Q., Dom, B., Gorkani, M., Hafine, J., Lee,D. A. and Jain, R., "Content-Based Image Retieval at the EndPetkovic, D., Steele, D. and Yanker, P., "Query by image of the Early Years", IEEE Trans. on Pattern Analysis andand video content: The QBIC system", IEEE Machine Intelligence, Vol. 22, No. 12, Dec. 2000.Computer, 1995. [24]. Smith, J., Chang, S., Intelligent multimedia[11]. Gorkani, M.M., and Picard, R.W., "Texture information Retrieval, Chapter Query by colour regionsorientation for sorting photos 'at a glance"', In proc. of the using the VisualSEEk content-based visual query system,IEEE Int. Conf. on Pattern Recognition, October, 1994. pp 23 - 41, AAAI Press 1997.[12]. Gupta, A. and Jain, R., Visual information retrieval, [25]. Szummer, M. and Picard. R.W., "Indoor-outdoorcommunications of the ACM, 40(5):71-79, May 1997. image classification". In IEEE Int. Workshop on Content-[13]. Haralick, R. M., Shanmugam, K., and Dinstein, I., based Access of Image and Video Databases, 1998."Texture features for image classification", IEEE Trans. [26]. Zhou X.S., Huang S.T., "Image Retrieval: FeatureOn Sys, Man, and Cyb, SMC-3(6):610-621, 1973. Primitives, Feature Representation, and Relevance[14]. Jianbo, S. and Jitendra, M., "Normalized Cuts and Feedback". IEEE Workshop on Content-based Access ofImage Segmentation", IEEE Trans. on Pattern Analysis Image and Video Libraries (CBAIVL-2000), inand Machine Intelligence, Volume: 22 Issue: 8, Aug. conjunction with IEEE CVPR-2000, pp. 10-13, 2000.2000.[15]. Keister, L.H. "User types and queries: impact onimage access systems", Challenges in indexing electronictext and images. Learned Information, 1994.[16]. Li J., Wang J.Z., "Automatic linguistic indexing ofpictures by a statistical modelling approach" IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 25, no. 9,pp. 10751088, 2003.[17]. Markkula, M., Sormunen, E. "End-user searchingchallenges indexing practices in the digital newspaperphoto archive", Information retrieval, Vol. 1, pp 259-285,2000.

180