image retrieval using hierarchical self-organizing feature maps

Image retrieval using hierarchical self-organizing featuremaps q

I.K. Sethi *,1, I. Coman

Vision and Neural Networks Laboratory, Department of Computer Science, Wayne State University, Detroit, MI 48202, USA

Abstract

This paper presents a scheme for image retrieval that lets a user retrieve images either by exploring summary views of

the image collection at di�erent levels or by similarity retrieval using query images. The proposed scheme is based on

image clustering through a hierarchy of self-organizing feature maps. While the suggested scheme can work with any

kind of low-level feature representation of images, our implementation and description of the system is centered on the

use of image color information. Experimental results using a database of 2100 images are presented to show the e�cacy

of the suggested scheme. Ó 1999 Published by Elsevier Science B.V. All rights reserved.

Keywords: Exploration-based retrieval; Image databases; Image retrieval; Self-organizing feature maps

1. Introduction

The widespread availability of images andvideos in digital form has created a growing in-terest in methods that can search image and videoarchives and retrieve images and videos of desiredcontent. The current methods for providing con-tent-based access to images and videos follow oneof the two approaches: (1) keyword-based re-trieval (KBR) and (2) similarity-based retrieval(SBR). The KBR approach relies on manualcataloging to generate a set of descriptive key-words for each image or video. The keywordsselected for an image are generally based on the

most direct description of the objects present inthe image. It is the most widely used approachfollowed by large on-line stock photographyarchives. Although simple and straightforward,the KBR approach has two main limitations.First, the descriptive keywords of an image donot provide any clue about its compositional as-pects, which are important in many applicationssuch as advertising. Second, users in di�erentcontexts or with di�erent backgrounds tend todescribe the same object using di�erent descrip-tive terms causing di�culties in image retrieval.Additionally, manual cataloging is prone to sub-jectivity and other cataloging errors.

The SBR approach follows the dictum that thebest representation of an image is the image itself.Instead of assigning descriptive keywords to eachimage, a feature vector representation for eachimage is extracted at the time of image catalog-ing. The access to images in the SBR approach isprovided by searching for images that exhibit

www.elsevier.nl/locate/patrec

Pattern Recognition Letters 20 (1999) 1337±1345

q Electronic Annexes available. See www.elsevier.nl/locate/

patrec.* Corresponding author.1 Present address: Department of Computer Science and

Engineering, Oakland University, Rochester, MI 48309-4478,

USA.

0167-8655/99/$ - see front matter Ó 1999 Published by Elsevier Science B.V. All rights reserved.

PII: S 0 1 6 7 - 8 6 5 5 ( 9 9 ) 0 0 1 0 3 - 8

feature vectors similar to the feature vector of thequery image. Thus, the SBR approach lets a usersearch for images in an image database by pre-senting a query in visual form, making it moresuitable to search images based on their compo-sitional aspects. The SBR approach is also wellsuited for computerized indexing. Therefore, theSBR approach has received considerable atten-tion in image processing, pattern recognition anddatabase communities in the last few years. Sev-eral prototypical image retrieval systems havebeen built in recent years and some of thesesystems, e.g., QBIC (Niblack et al., 1993) andVirage (Bach et al., 1996), have been commer-cialized.

A de®ciency of the existing image-similarityretrieval systems is that these systems do notprovide a summary view of the images in theirimage database to their users. The availability of asummary view is important in situations where auser has no speci®c query image at the beginningof the search process and wants to explore theimage collection to locate images of interest. Theonly way a user can obtain a feel for images in acollection in existing systems is through a randombrowsing of the thumbnail images. Such abrowsing is not necessarily guaranteed to let theuser browse through the entire image collection.Furthermore, random browsing requires moretime.

The goal of the present paper is to describe ascheme for image retrieval that lets a user retrieveimages either by exploring summary views of theimage collection at di�erent levels, or by similarityretrieval using query images. The proposed schemeis based on image clustering through a hierarchy ofself-organizing feature maps (Kohonen, 1995).While the suggested scheme can work with anykind of low-level feature representation of images,our implementation and description of the systemis centered on the use of image color information.

The organization of the paper is as follows.Section 2 presents a method for encoding colorcomposition information of images that is used toorganize images for exploration and retrieval usinghierarchical, self-organizing feature maps. Section3 provides a brief exposition of hierarchical, self-organizing feature maps. Section 4 describes the

proposed scheme for image retrieval by explora-tion and similarity search. The performance of thesystem is described in Section 5. Finally, a sum-mary of the work and conclusions are presented inSection 6.

2. Color composition representation

The image exploration and retrieval schemedescribed here uses image color information tobuild a feature vector representation for images.The motivation for choosing color-based repre-sentation lies in the fact that color is an easilyrecognizable element of an image and the humanvisual system is capable of di�erentiating betweenan in®nitely large number of colors. The use ofcolor for similarity retrieval requires two mainconsiderations: (1) the selection of the color space,i.e. the color-coordinate system and (2) a schemefor representing the color composition of an im-age. There is no consensus on the choice of colorspace; RGB, HSV, HSI and YUV systems havebeen all used in di�erent systems. Histogrammingis the most commonly used scheme to capture thecolor composition of an image. For 24-bit images,the number of bins in the color histogram is 224.Since such a high resolution is not needed forimage similarity retrieval, it is common to quantizethe color space by either reducing the color reso-lution or color depth (Wan and Jay Kuo, 1996).The global color histogram, whether quantized ornot, su�ers from one drawback; it is not able tocapture the spatial component of the color com-position of an image. This has led to many varia-tions of histogramming. For example, a localhistogramming approach is suggested by Gonget al. (1995) where an image is divided into nineequal partitions and each partition has its ownlocal color histogram. A multi-level histogram-ming approach based on a quad-tree structure isused by Lu et al. (1994) to incorporate spatialcomponents of the color composition of an image.Although these variations of color histogrammingare able to capture the spatial distribution ofcolor information, they do not provide an e�cientrepresentation scheme. The color information ofeach image is represented in a very high-dimen-

1338 I.K. Sethi, I. Coman / Pattern Recognition Letters 20 (1999) 1337±1345

sional space because of many local histograms.This leads to high storage demands and ine�cientsearches during similarity retrieval.

Our image representation scheme is guidedprimarily by three major factors. First, the repre-sentation must be closely related to human visualperception, since a user determines whether a re-trieval operation in response to an example queryimage is successful or not. Second, the represen-tation must encode the spatial distribution of colorin an image. Third, the representation should be ascompact as possible to minimize storage andcomputation e�orts. Following these consider-ations, we use the HSV (hue, saturation, value)color coordinate system, which correlates well withhuman color perception and is commonly used byartists. Since digital images are normally availablein the RGB space, we use the conversion programgiven in (Foley et al., 1994) to obtain HSV valuesin the range �0; 1�.

In order to represent the spatial distribution ofcolor in an image, we rely on a ®xed image-parti-tioning scheme. This is in contrast with severalproposals in the literature (Smith and Chang,1996) suggesting color-based segmentation tocharacterize the spatial distribution of color in-formation. Although the color-based segmentationapproach provides a more ¯exible representationand hence more powerful queries, we believe thatthese advantages are outweighed by the simplicityof the ®xed partitioning approach. In the ®xedpartitioning scheme, each image is divided intoM � N overlapping blocks as shown in Fig. 1. Theoverlapping blocks allow a certain amount of`fuzzyness' to be incorporated in the spatial dis-tribution of color information, which helps inobtaining a better performance. To provide forpartial-image queries, a masking bit is associatedwith each block. The default value for this bit for

every block is one. Only during partial-imagequeries, some of the mask bits are set to zero.

Three separate local histograms (hue, satura-tion, value) for each block are computed.Although these local histograms can be usedto encode the spatial distribution of colorinformation, the resulting representation is notcompact enough. To obtain a compact repre-sentation, we extract from each local histogramthe location of its area-peak. This is done byplacing a ®xed-sized window on the histogram atevery possible location. At each location, thehistogram area falling within the window is cal-culated. The location of the window yielding thehighest area determines the histogram area-peak.This value then represents the correspondinghistogram. Thus, each image is reduced to3�M � N numbers, three per block to accountfor the hue, saturation and intensity histograms.To demonstrate that our representation scheme isable to retain essential color information, weshow in Fig. 2 two example images and theirrespective approximation using area-peak repre-sentation.

3. Hierarchical self-organizing feature maps

The self-organizing feature map (SOFM) is aneural network-based method for unsupervisedclustering that maps high-dimensional data on atwo-dimensional grid of neurons in such a waythat similar high-dimensional data points areFig. 1. The ®xed partitioning scheme with overlapping blocks.

Fig. 2. Two examples of original and approximated images.

I.K. Sethi, I. Coman / Pattern Recognition Letters 20 (1999) 1337±1345 1339

mapped to same or neighboring neurons (Ko-honen, 1995). While some distortion is inevitable,the mapping generally preserves the neighborhoodrelationships. Grids of other dimensions are alsopossible; however, two-dimensional rectangular orhexagonal grids are most common.

The SOFM learning process is a generalizationof competitive learning. To construct a map, eachneuron in the grid is initialized with small randomweights. The neighborhood for each neuron,which shrinks with learning, is also initialized. Theinitialized weights then adapt through an iterativelearning process consisting of the following steps:1. randomly select an input vector and apply it to

all neurons;2. determine the winning neuron, i.e. the neuron

whose weights resemble most the input vector;3. bring the weights of the winning neuron closer

to the input;4. bring also the weights of the neurons in the

neighborhood of the winning neuron closer tothe input vector.

The learning process terminates when the weightadjustments are arbitrarily small.

A hierarchical self-organizing map (HSOFM) isformed by arranging several layers of two-dimen-sional maps in a hierarchy. For each map unit inone layer of the hierarchy, a two-dimensional mapis added to the next layer. The learning in anHSOFM is done in a sequential fashion; the mapat the ®rst layer, the highest level of the hierarchy,is trained ®rst. While the ®rst layer map is trainedwith all the example vectors, the successive layermaps are trained only with those example vectorsthat are won by their respective parent map unit.In many instances, the maps at a lower level aretrained with truncated example vectors by omit-ting those vector components that are equal in theoriginal training vectors.

Compared to an SOFM, an HSOFM may beviewed as performing organization of the infor-mation at several levels, going for ®ner and ®nerdistinctions. This particular property of HSOFMshas been exploited by many researchers in the in-formation retrieval area for organizing text docu-ments and providing an exploratory search mode.For example, Merkl (1997) has used HSOFMs togenerate a taxonomy of software manuals. Other

applications include organization of full-text doc-uments of the Usenet group (Kohonen et al., 1996;Kaski et al., 1996), an analysis of AI literature (Linet al., 1991) and context learning in natural lan-guage processing (Scholtes, 1991). It should benoted that the organization of information learnedby an HSOFM is not independent of the HSOFMarchitecture; it depends, in addition to the trainingdata, on the number of layers and the size of eachlayer.

4. Image retrieval by exploration and similarity

search

Our scheme for image retrieval by explorationand similarity search uses an HSOFM of threelayers as shown in Fig. 3. The ®rst layer is calledthe global view layer and the corresponding map iscalled the global view map; it consists of a lattice ofr1 � c1 neurons. The function of this layer is toprovide an overall summary view, in the form of amosaic image, of the entire image collection. Thesecond layer, called the regional layer, has r1 � c1

maps. These maps are called regional view maps.Each map, consisting of a lattice of r2 � c2 neu-rons, corresponds to a neuron in the global viewlayer and provides a ®ner summary of the associ-ated images in the form of a mosaic image. The®nal layer in the hierarchy of the self-organizingmaps consists of r2 � c2 maps with each maphaving r3 � c3 neurons. The maps in the ®nal layerare known as local view maps and the layer is calledlocal view layer. The local view maps provide yetanother level of detailed summary and each

Fig. 3. HSOFM architecture for image retrieval.


element of a local view map points to a group ofimages, which are directly accessible to that mapelement. These images constitute another layerthat is called the image layer.

The system can operate in two modes: explo-ration mode and similarity search mode. In theexploration mode, a user simply browses throughthe image collection by traversing up and down thehierarchy and sideways at each level of the hier-archy to view a small set of images of chosen colorcomposition. In the similarity search mode, a useraccesses images via a query image. In this mode,both full and partial query modes are possible. Infull query mode, the color composition of the fullquery image is used in the similarity search pro-cess. In partial query mode, a user has the optionto specify which part of the query image should beused in the similarity search process. To providefor partial query mode, the system uses maskingbits associated with each image block. The defaultsetting of all masking bits is 1. Some of these bitsare cleared in the partial query mode and the in-formation from the corresponding blocks is notused in similarity computation. To perform simi-larity search, the color composition of the queryimage is ®rst matched at the global view level todetermine the most appropriate regional view thatshould be searched further. The matching is thenrepeated at the regional view level to locate thebest matching local view map. A further search atthe local view level brings out a set of images thatmay be most similar to the query image. Theseimages are then individually compared with thequery image to retrieve them in a ranked order.

The rank ordering is calculated by block-by-block matching of the dominant HSV triplets ofthe query image and the target image. Let qi and ti

represent the block number i in a query (Q) and atarget (T) image, respectively. Let �hqi ; sqi ; vqi� bethe dominant hue±saturation±value triplet for theblock i of the query image. Let �hti ; sti ; vti� representthe same in the target image. The block similarityis then de®ned by the following relationship:

S�qi; ti�

� 1

�1� a�hqi ÿ hti�2 � b�sqi ÿ sti�2 � c�vqi ÿ vti�2�;

where a, b and c are constants that are selected tode®ne the relative importance of hue, saturationand value in similarity calculation. Using thesimilarities between the corresponding pairs ofblocks from the query and target images, thesimilarity measure between a query±target imagepair is computed by the following expression:

S�Q; T � �PM�N

i�1 biS�qi; ti�PM�Ni�1 bi

;

where bi stands for the masking bit for block i andM � N is the number of blocks.

Before the system can be used, maps for dif-ferent layers must be trained using the images thatthe system is expected to handle. The training isperformed using the self-organizing feature maplearning brie¯y described earlier. The global viewlayer is trained using all the images. The subse-quent layers are trained with the respective imagesubsets, for example a regional map correspondingto the neuron �i; j� of the global view is trainedwith an image subset consisting of only those im-ages from the entire collection that are `won' bythe neuron �i; j� of the global view at the end oftraining.

Once the maps at di�erent levels are ®xed, thechanges to the image database, for example addi-tion of new images or deletion of some existingimages, are made only at the image layer level.This is done to avoid retraining; however, as moreand more modi®cations to the image layer aremade, the summary maps start having inaccurateinformation. In such a situation, a retraining of thesystem is performed to generate a new set ofsummary maps.

5. Performance

In this section, we present some results to showthe performance of the system. These results arebased on an implementation using a database of2100 images. The implemented system has a globallayer of 6� 6 neurons. The regional layer consistsof 36 maps of size 4� 4. The number of maps inthe local layer is 576; each map has 3� 3 neurons.The main criterion for the selected architecture


was to have a moderate number of images asso-ciated with each neuron at the lowest level of thehierarchy. All layers use a hexagonal latticestructure, which was found to yield relatively evencluster sizes at di�erent levels of hierarchy.

To train the HSOFM, each image was parti-tioned into 8 ´ 8 overlapping blocks as describedearlier and was represented by a 192-componentvector consisting of 64 elements each of hue, sat-uration and value. The training was done using theSOMPAK software (Kohonen et al., 1995). Alllayers were trained in two stages consisting of10,000 iterations each. The ®rst stage is the or-dering phase during which the reference vectors ofthe map units are ordered. During the secondstage, the values of the reference vectors are ®ne-tuned. At the end of the training for each layer, theweight vectors for each neuron were mapped intoan image and mosaic images for the global view,various regional views and numerous local viewswere constructed. A small modi®cation was madeto the training procedure described earlier. Themodi®cation involves de®ning the training subsetof images for maps at regional and local layers.The image subset for each map at these layers wasconstituted by pooling images won by the parentneuron as well as the images won by its threeclosest neighbors. This modi®cation was made inconsideration of the size of the image databaseused in our experiment.

Fig. 4 shows the global view of the entire imagecollection. The di�erent color compositions thatare present in the image database are clearly seenin this global view. It is evident from the globalview that not many images with dominating redhue are present. Furthermore, the global viewmosaic shows a gradual change in color composi-tion as we move in any direction in the global viewmap. This is due to the topology preservingproperty of the self-organizing maps.

Fig. 4. The global view of the experimental database of 2100

images.

Fig. 5. Two regional views of the image database. The view on the left corresponds to the third tile of the ®rst row of the global view

map. The view on the right is for the last tile of the fourth row of the global view map.


Fig. 5 presents two regional views which showthe next level of views corresponding to two dif-ferent areas of the global view. Similarly, Fig. 6shows two local views. These views from di�erentlayers indicate that images have been organized atseveral levels according to their color content. It iseasy to see that the hierarchical organization ofimages provides a convenient method for imageexploration. For example, the map areas in the topleft corner of the global image correspond to im-ages with predominance of blue. A user searchingfor sky images could explore this region of the mapat regional and local levels to see ®ner distinctionsand ®nally to retrieve a set of images with domi-nating blue in the upper part of the images. One

such result of image retrieval by exploration isshown in Fig. 7.

Figs. 8 and 9 show two sets of retrieval resultsfor the similarity search mode. The ®rst image ineach set was used as a query image. The nu-merical value above each thumbnail image rep-resents the similarity value between the thumbnailand the query image. The three constants, a, band c, of the similarity measure de®ned earlierwere taken as 2.5, 0.5 and 3.0, respectively. Thesevalues were chosen through empirical means. Thekeyword above each thumbnail comes from theimage CD. It is evident from the keywords thatthey are too general to really give an idea aboutimage content.

Fig. 6. Two local views of the image database. The view on the left corresponds to the ®rst tile of the ®rst regional view of Fig. 5. The

view on the right is for the ®rst tile of the second regional view of Fig. 5.

Fig. 7. Image retrieval by exploration. These images are retrieved when a user arrives at the ®rst local view map of Fig. 6 and clicks on

the middle image tile in the last row.


Fig. 8. An example of image retrieval via similarity search mode. The top left image was used as the query image.

Fig. 9. Another example of image retrieval.


6. Summary and conclusions

A scheme for image retrieval using color com-position information was presented. The salientfeature of the scheme is its ability to provide animage exploration mode in addition to similaritysearch mode. The exploration mode is useful as itprovides a summary view of the image collectionat di�erent levels of detail. The summary views aremade possible due to the use of hierarchical, self-organizing feature maps. These maps provide atrainable method of organizing images into dif-ferent clusters based on color composition infor-mation. While the present scheme is meant forimage color information, similar schemes arepossible using other image features such as shapeand texture. We are currently investigating suchimplementations.

For further reading, see (Sethi et al., 1998).

Discussion

Kamel: I would like to hear your commentsabout another application, in which the similarityin the images is not expressed by the color or bythe shape, but rather in the semantics. What isyour comment on how to handle this?

Sethi: I can give you a few ideas on that. Mostpeople ignore that issue. My opinion is to move alittle bit up in terms of the features. Instead ofusing the low-level features, one may try to extractmid-level features. Each of these mid-level featurescan be associated with a set of semantic concepts.Through relaxation, or some other similar scheme,one can then narrow down the semantic conceptsassociated with a collection of mid-level featuresdetected in an image. (Note of the editors: at thispoint, recording of the discussion was interrupted bya power failure).

References

Bach, J.R. et al., 1996. Virage image search engine: an open

framework for image management. Proc. SPIE: Storage and

Retrieval for Image and Video Databases 2670, 76±87.

Foley, J.D., van Dam, A., Feiner, S.K., Hughes, J.F., Phillips,

R.L., 1994. Introduction to Computer Graphics. Addison-

Wesley, Reading, MA.

Gong, Y., Chua, H., Guo, X., 1995. Image indexing and

retrieval based on color histogram. In: Proc. 2nd Internat.

Conf. Multimedia Modeling, Singapore, pp. 115±126.

Kaski, S., Honkela, T., Lagus, K., Kohonen, T., 1996. Creating

an order in digital libraries with self-organizing maps. In:

Proc. World Congress on Neural Networks, pp. 814±817.

Kohonen, T., 1995. Self-Organizing Maps. Springer, Berlin.

Kohonen, T., Hynninen, J., Kangas, J., Laaksonen, J., 1995.

SOM_PAK: The self-organizing map program package,

Helsinki University of Technology, Finland.

Kohonen, T., Kaski, S., Lagus, K., Honkela, T., 1996. Very

large two-level SOM for the browsing of newsgroup. In:

Proc. Internat. Conf. Arti®cial Neural Networks, Bochum,

Germany.

Lin, X., Soergei, D., Marchionini, G., 1991. A self-organizing

semantic map for information retrieval. In: Proc. 14th

Annual Internat. ACM SIGIR Conf., Chicago, IL, pp. 192±

201.

Lu, H., Ooi, B., Tan, K., 1994. E�cient image retrieval by color

contents. In: Proc. Internat. Conf. Applications of Data-

bases, Vadstena, Sweden, pp. 95±108.

Merkl, D., 1997. Exploration of text collections with hierar-

chical feature maps. In: Proc. 20th Annual Internat. ACM

SIGIR Conf., Philadelphia, PA, pp. 186±195.

Niblack, W. et al., 1993. The QBIC project: querying images by

content using color, texture and shape. Proc. SPIE: Storage

and Retrieval for Image and Video Databases 1908, 173±

187.

Scholtes, J.C., 1991. Unsupervised learning and the information

retrieval problem. In: Proc. Internat. Joint Conf. Neural

Networks, Seattle, Washington, pp. 95±100.

Sethi, I.K. et al., 1998. Color-Wise: A system for image

similarity retrieval using color. Proc. SPIE: Storage and

Retrieval for Image and Video Databases 3312, 140±149.

Smith, J.R., Chang, S.-F., 1996. Tools and techniques for color

image retrieval. Proc. SPIE: Storage and Retrieval for

Image and Video Databases IV 2670, 426±437.

Wan, X., Jay Kuo, C.-C., 1996. Color distribution analysis and

quantization for image retrieval. Proc. SPIE: Storage and

Retrieval for Image and Video Databases IV 2670, 8±16.


image retrieval using hierarchical self-organizing feature maps

Documents