image categorization by learning and reasoning with...

Journal of Machine Learning Research 5 (2004) 913–939 Submitted 7/03; Revised 11/03; Pulished 8/04

Image Categorization by Learningand Reasoning with Regions

Yixin Chen [email protected]

Department of Computer ScienceUniversity of New OrleansNew Orleans, LA 70148, USA

James Z. Wang [email protected]

School of Information Sciences and TechnologyThe Pennsylvania State UniversityUniversity Park, PA 16802, USA

Editor: Donald Geman

AbstractDesigning computer programs to automatically categorize images using low-level features is a chal-lenging research topic in computer vision. In this paper, we present a new learning technique, whichextends Multiple-Instance Learning (MIL), and its application to the problem of region-based im-age categorization. Images are viewed as bags, each of which contains a number of instancescorresponding to regions obtained from image segmentation. The standard MIL problem assumesthat a bag is labeled positive if at least one of its instances is positive; otherwise, the bag is negative.In the proposed MIL framework, DD-SVM, a bag label is determined by some number of instancessatisfying various properties. DD-SVM first learns a collection of instance prototypes accordingto a Diverse Density (DD) function. Each instance prototype represents a class of instances that ismore likely to appear in bags with the specific label than in the other bags. A nonlinear mappingis then defined using the instance prototypes and maps every bag to a point in a new feature space,named the bag feature space. Finally, standard support vector machines are trained in the bag fea-ture space. We provide experimental results on an image categorization problem and a drug activityprediction problem.

Keywords: image categorization, multiple-instance learning, support vector machines, imageclassification, image segmentation

1. Introduction

The term image categorization refers to the labeling of images into one of a number of predefinedcategories. Although this is usually not a very difficult task for humans, it has proved to be anextremely difficult problem for machines (or computer programs). Major sources of difficultiesinclude variable and sometimes uncontrolled imaging conditions, complex and hard-to-describeobjects in an image, objects occluding other objects, and the gap between arrays of numbers repre-senting physical images and conceptual information perceived by humans. In this paper, an objectin the physical world, which we live in, refers to anything that is visible or tangible and is rel-atively stable in form. An object in an image is defined as a region, not necessarily connected,which is a projection of an object in the physical world. Designing automatic image categorizationalgorithms has been an important research field for decades. Potential applications include digital

c©2004 Yixin Chen and James Z. Wang.

CHEN AND WANG

(a) (b) (c) (d) (e) (f) (g)

Figure 1: Sample images belonging to one of the categories: Mountains and glaciers, Skiing, andBeach.

libraries, Space science, Web searching, geographic information systems, biomedicine, surveillanceand sensor systems, commerce, and education.

1.1 Overview of Our Approach

Although color and texture are fundamental aspects for visual perception, human discernment ofcertain visual contents could be potentially associated with interesting classes of objects or semanticmeaning of objects in the image. For one example: if we are asked to decide which images inFigure 1 are images about Mountains and glaciers, Skiing, and Beach, at a single glance, we maycome up with the following answers together with supporting arguments:

• Images (a) and (b) are images about mountains and glaciers since we see mountain in them;

• Images (c), (d) and (e) are skiing images since there are snow, people, and perhaps a steepslope or mountain in them;

• Images (f) and (g) are beach images since we see either people playing in water or people onsand;

This seems to be effortless for humans because prior knowledge of similar images and objects mayprovide powerful assistance for us in recognition. Given a set of labeled images, can a computerprogram learn such knowledge or semantic concepts from implicit information of objects containedin images? This is the question we attempt to address in this work.

In terms of image representation, our approach is a region-based method. Images are segmentedinto regions such that each region is roughly homogeneous in color and texture. Each region ischaracterized by one feature vector describing color, texture, and shape attributes. Consequently,an image is represented by a collection of feature vectors. If segmentation is ideal, regions willcorrespond to objects. But, in general, semantically accurate image segmentation by a computerprogram is still an ambitious long-term goal for computer vision researchers (see Shi and Malik,2000; Wang et al., 2001a; Zhu and Yuille, 1996). Here, semantically accurate image segmentationrefers to building a one-to-one mapping between regions generated by an image segmentation al-gorithm and objects in the image. Nevertheless, we argue that region-based image representationcan provide some useful information about objects even though segmentation may not be perfect.Moreover, empirical results in Section 4.3 demonstrate that the proposed method has low sensitivityto inaccurate image segmentation.

From the perspective of learning, our approach is a generalization of supervised learning, inwhich labels are associated with images instead of individual regions. This is in essence identical to

914

IMAGE CATEGORIZATION BY REGIONS

the Multiple-Instance Learning (MIL) setting (Dietterich et al., 1997; Blum and Kalai, 1998; Maronand Lozano-Perez, 1998; Zhang and Goldman, 2002) where images and regions are respectivelycalled bags and instances. In this paper, a “bag” refers to an “image”, and an “instance” refers to a“region.” MIL assumes that every instance possesses an unknown label that is indirectly accessiblethrough labels attached to bags.

1.2 Related Work in Multiple-Instance Learning

Several researchers have applied MIL for image classification and retrieval (Andrews et al., 2003;Maron and Ratan, 1998; Zhang et al., 2002; Yang and Lozano-Perez, 2000). Key assumptions oftheir formulation of MIL are that bags and instances share the same set of labels (or categories orclasses or topics), and a bag receives a particular label if at least one of the instances in the bagpossesses the label. For binary classification, this implies that a bag is “positive” if at least one ofits instances is a positive example; otherwise, the bag is “negative.” Therefore, learning focuses onfinding “actual” positive instances in positive bags. The formulations of MIL in image classificationand retrieval fall into two categories: the Diverse Density approach (Maron and Ratan, 1998; Zhanget al., 2002) and the Support Vector Machine (SVM) approach (Andrews et al., 2003).

• In the Diverse Density approach, an objective function, called the Diverse Density (DD) func-tion (Maron and Lozano-Perez, 1998), is defined over the instance feature space, in whichinstances can be viewed as points. The DD function measures a co-occurrence of similarinstances from different bags with the same label. A feature point with large Diverse Densityindicates that it is close to at least one instance from every positive bag and far away fromevery negative instance. The DD approach searches the instance feature space for points withhigh Diverse Density. Once a point with the maximum DD is found, a new bag is classifiedaccording to the distances between instances in the bag and the maximum DD point: if thesmallest distance is less than certain fixed threshold, the bag is classified as positive; other-wise, the bag is classified as negative. The major difference between Maron’s method andZhang’s method lies in the way to search a maximum DD point. Zhang’s method is relativelyinsensitive to the dimension of instance feature space and scales up well to the average bagsize, i.e., the average number of instances in a bag (Zhang and Goldman, 2002). Empiricalstudies demonstrate that DD-based MIL can learn certain simple concepts of natural scenes,such as waterfall, sunsets, and mountains, using features of subimages or regions (Maron andRatan, 1998; Zhang et al., 2002).

• Andrews et al. (2003) use SVMs (Burges, 1998) to solve the MIL problem. In particular, MILis formulated as a mixed integer quadratic program. In their formulation, integer variables areselector variables that select which instance in a positive bag is the positive instance. Theiralgorithm, which is called MI-SVM, has an outer loop and an inner loop. The outer loop setsthe values of these selector variables. The inner loop then trains a standard SVM in whichthe selected positive instances replace the positive bags. The outer loop stops if none of theselector variables changes value in two consecutive iterations. Andrews et al. (2003) show thatMI-SVM outperforms the DD approach described in Zhang and Goldman (2002) on a set ofimages belonging to three different categories (“elephant”, “fox”, and “tiger”). The differencebetween MI-SVM and DD approach can also be viewed from the shape of the correspondingclassifier’s decision boundary in the instance feature space. The decision boundary of a DD

915

CHEN AND WANG

classifier is an ellipsoidal sphere because classification is based exclusively on the distance tothe maximum DD point.1 For MI-SVM, depending on the kernel used, the decision boundarycan be a hyperplane in the instance feature space or a hyperplane in the kernel induced featurespace, which may correspond to very complex boundaries in the instance feature space.

1.3 A New Formulation of Multiple-Instance Learning

In the above MIL formulations, a bag is essentially summarized by one of its instances, i.e., aninstance with the maximal label (considering binary classification with 1 and −1 representing thepositive and negative classes, respectively). However, these formulations have a drawback for im-age categorization tasks: in general, a concept about images may not be captured by a single region(instance) even if image segmentation and object recognition are assumed to be ideal (inaccuratesegmentation and recognition will only worsen the situation). For one simple example, let’s con-sider categorizing Mountains and glaciers versus Skiing images in Figure 1. To classify a scene asinvolving skiing, it is helpful to identify snow, people, and perhaps mountain. If an image is viewedas a bag of regions, then the standard MIL formulation cannot realize this, because a bag is labeledpositive if any one region in the bag is positive. In addition, a class might also be disjunctive. Asshown by Figure 1 (g) and (h), a Beach scene might involve either people playing in water or peopleon sand. Thus we argue that the correct categorization of an image depends on identifying multipleaspects of the image. This motivates our extension of MIL where a bag must contain a number ofinstances satisfying various properties (e.g., people, snow, etc.).

In our approach, MIL is formulated as a maximum margin problem in a new feature spacedefined by the DD function. The new approach, named DD-SVM, proceeds in two steps. First, in theinstance feature space, a collection of feature vectors, each of which is called an instance prototype,is determined according to DD. Each instance prototype is chosen to be a local maximizer of the DDfunction. Since DD measures the co-occurrence of similar instances from different bags with thesame label, loosely speaking, an instance prototype represents a class of instances (or regions) thatis more likely to appear in bags (or images) with the specific label than in the other bags (or images).Second, a nonlinear mapping is defined using the learned instance prototypes and maps every bagto a point in a new feature space, which is named the bag feature space. In the bag feature space,the original MIL problem becomes an ordinary supervised learning problem. Standard SVMs arethen trained in the bag feature space.

DD-SVM is similar to MI-SVM in the sense that both approaches apply SVM to solve the MILproblem. However, in DD-SVM, several features are defined for each bag. Each bag feature couldbe defined by a separate instance within the bag (i.e., the instance that is most similar to an instanceprototype). Hence, the bag features summarize the bag along several dimensions defined by instanceprototypes. This is in stark contrast to MI-SVM, in which one instance is selected to represent thewhole positive bag.

1.4 Related Work in Image Categorization

In the areas of image processing, computer vision, and pattern recognition, there has been abundanceof prior work on detecting, recognizing, and classifying a relatively small set of objects or concepts

1. The maximum DD algorithms described in (Maron and Lozano-Perez, 1998; Zhang and Goldman, 2002) produce apoint in the instance feature space together with scaling factors for each feature dimension. Therefore, the decisionboundary is an ellipsoidal sphere instead of a sphere.

916


in specific domains of application (Forsyth and Ponce, 2002; Marr, 1983; Strat, 1992). We onlyreview work most relevant to what we propose, which by no means represents the comprehensivelist in the cited area.

As one of the simplest representations of digital images, histograms have been widely usedfor various image categorization problems. Szummer and Picard (1998) use k-nearest neighborclassifier on color histograms to discriminate between indoor and outdoor images. In the workof Vailaya et al. (2001), Bayesian classifiers using color histograms and edge directions histogramsare implemented to organize sunset/forest/mountain images and city/landscape images, respectively.Chapelle et al. (1999) apply SVMs, which are built on color histogram features, to classify imagescontaining a generic set of objects. Although histograms can usually be computed with little costand are effective for certain classification tasks, an important drawback of a global histogram rep-resentation is that information about spatial configuration is ignored. Many approaches have beenproposed to tackle the drawback. In the method of Huang et al. (1998), a classification tree isconstructed using color correlograms. Color correlogram captures the spatial correlation of col-ors in an image. Gdalyahu and Weinshall (1999) apply local curve matching for shape silhouetteclassifications, in which objects in images are represented by their outlines.

A number of subimage-based methods have been proposed to exploit local and spatial propertiesby dividing an image into rectangular blocks. In the method introduced by Gorkani and Picard(1994), an image is first divided into 16 non-overlapping equal-sized blocks. Dominant orientationsare computed for each block. The image is then classified as city or suburb scenes as determined bythe majority orientations of blocks. Wang et al. (2001b) develop a graph/photograph classificationalgorithm.2 The classifier partitions an image into blocks and classifies every block into one oftwo categories based on wavelet coefficients in high frequency bands. If the percentage of blocksclassified as photograph is higher than a threshold, the image is marked as a photograph; otherwise,the image is marked as a graph. Yu and Wolf (1995) present a one-dimensional Hidden MarkovModel (HMM) for indoor/outdoor scene classification. The model is trained on vector quantizedcolor histograms of image blocks. In the recent ALIP system (Li and Wang, 2003), a conceptcorresponding to a particular category of images is captured by a two-dimensional multiresolutionHMM trained on color and texture features of image blocks. Murphy et al. (2004) propose fourgraphical models that relate features of image blocks to objects and perform joint scene and objectrecognition.

Although a rigid partition of an image into rectangular blocks preserves certain spatial infor-mation, it often breaks an object into several blocks or puts different objects into a single block.Thus visual information about objects, which could be beneficial to image categorization, may bedestroyed by a rigid partition. The ALIP system (Li and Wang, 2003) uses a small block size(4× 4) for feature extraction to avoid this problem. Image segmentation is one way to extract ob-ject information. It decomposes an image into a collection of regions, which correspond to objectsif decomposition is ideal. Segmentation-based algorithms can take into consideration the shapeinformation, which is in general not available without segmentation.

Image segmentation has been successfully used in content-based image and video analysis (e.g.,Carson et al., 2002; Chen and Wang, 2002; Ma and Manjunath, 1997; Modestino and Zhang, 1992;Smith and Li, 1999; Vasconcelos and Lippman, 1998; Wang et al., 2001b). Modestino and Zhang(1992) apply a Markov random field model to capture spatial relationships between regions. Im-

2. As defined by Wang et al. (2001b), a graph image is an image containing mainly text, graph, and overlays; a photo-graph is a continuous-tone image.

917

CHEN AND WANG

age interpretation is then given by a maximum a posteriori rule. SIMPLIcity system (Wang et al.,2001b) classifies images into textured or nontextured classes based upon how evenly a region scat-ters in an image. Mathematically, this is described by the goodness of match, which is measuredby the χ2 statistics, between the distribution of the region and a uniform distribution. Smith andLi (1999) propose a method for classifying images by spatial orderings of regions. Their systemdecomposes an image into regions with the attribute of interest of each region represented by asymbol that corresponds to an entry in a finite pattern library. The region string is converted tocomposite region template descriptor matrix that enables classification using spatial information.Vasconcelos and Lippman (1998) model image retrieval as a classification problem based on theprinciple of Bayesian inference. The information of the regions identified as human skin is used inthe inference. Very interesting results have been achieved in associating words to images based onregions (Barnard and Forsyth, 2001) or relating words to image regions (Barnard et al., 2003). Intheir method, an image is modeled as a sequence of regions and a sequence of words generated bya hierarchical statistic model. The method demonstrates the potential for searching images. But asnoted by Barnard and Forsyth (2001), the method relies on semantically meaningful segmentation,which, as mentioned earlier, is still an open problem in computer vision.

1.5 Outline of the Paper

The remainder of the paper is organized as follows. Section 2 describes image segmentation andfeature representation. Section 3 presents DD-SVM, an extension of MIL. Section 4 describesthe extensive experiments we have performed and provides the results. Finally, we conclude inSection 5, together with a discussion of future work.

2. Image Segmentation and Representation

In this section we describe a simple image segmentation procedure based on color and spatial vari-ation features using a k-means algorithm (Hartigan and Wong, 1979). For general-purpose imagessuch as the images in a photo library or images on the World Wide Web, precise object segmen-tation is nearly as difficult as natural language semantics understanding. However, semanticallyprecise segmentation is not crucial to our system. As we will demonstrate in Section 4, our im-age categorization method has low sensitivity to inaccurate segmentation. Image segmentation is awell-studied topic (e.g., Shi and Malik, 2000; Wang et al., 2001a; Zhu and Yuille, 1996). The focusof this paper is not to achieve superior segmentation results but good categorization performance.The major advantage of the proposed image segmentation is its low computational cost.

To segment an image, the system first partitions the image into non-overlapping blocks of size4×4 pixels. A feature vector is then extracted for each block. The block size is chosen to compro-mise between texture effectiveness and computation time. Smaller block size may preserve moretexture details but increase the computation time as well. Conversely, increasing the block sizecan reduce the computation time but lose texture information and increase the segmentation coarse-ness. Each feature vector consists of six features. Three of them are the average color componentsin a block. We use the well-known LUV color space, where L encodes luminance and U and Vencode color information (chrominance). The other three represent square root of energy in thehigh-frequency bands of the wavelet transforms (Daubechies, 1992), that is, the square root of thesecond order moment of wavelet coefficients in high-frequency bands.

918


2 regions 3 regions 7 regions 2 regions 3 regions 4 regions

Figure 2: Segmentation results by the k-means clustering algorithm. First row: original images.Second row: regions in their representative colors.

To obtain these moments, a Daubechies-4 wavelet transform is applied to the L component ofthe image. After a one-level wavelet transform, a 4× 4 block is decomposed into four frequencybands: the LL, LH, HL, and HH bands. Each band contains 2× 2 coefficients. Without loss ofgenerality, we suppose the coefficients in the HL band are {ck,l , ck,l+1, ck+1,l , ck+1,l+1}. One featureis

f =

(

14

1

∑i=0

1

∑j=1

c2k+i,l+ j

)12

.

The other two features are computed similarly to the LH and HH bands. Unser (1995) shows thatmoments of wavelet coefficients in various frequency bands are effective for representing texture.For example, the HL band shows activities in the horizontal direction. An image with vertical stripsthus has high energy in the HL band and low energy in the LH band.

The k-means algorithm is used to cluster the feature vectors into several classes with every classcorresponding to one “region” in the segmented image. No information about the spatial layoutof the image is used in defining the regions, so they are not necessarily spatially contiguous. Thealgorithm does not specify the number of clusters, N, to choose. We adaptively select N by grad-ually increasing N until a stopping criterion is met. The number of clusters in an image changesin accordance with the adjustment of the stopping criteria. A detailed description of the stoppingcriteria can be found in Wang et al. (2001b). Examples of segmentation results are shown in Fig-ure 2. Segmented regions are shown in their representative colors. It takes less than one second onaverage to segment a 384× 256 image on a Pentium III 700MHz PC running the Linux operatingsystem. Since it is almost impossible to find a stopping criterion that is best suited for a large col-lection of images, images sometimes may be under-segmented or over-segmented. However, ourcategorization method has low sensitivity to inaccurate segmentation.

After segmentation, the mean of the set of feature vectors corresponding to each region R j (asubset of Z

2) is computed and denoted as f j. Three extra features are also calculated for each regionto describe shape properties. They are normalized inertia (Gersho, 1979) of order 1, 2, and 3. For a

919

CHEN AND WANG

region R j in the image plane, the normalized inertia of order γ is given as

I(R j,γ) =∑r∈Rj

‖r− r‖γ

V1+ γ

2j

,

where r is the centroid of R j, Vj is the number of pixels in region R j. The normalized inertiais invariant to scaling and rotation. The minimum normalized inertia on a 2-dimensional plane isachieved by circles. Denote the γ-th order normalized inertia of circles as Iγ. We define shapefeatures of region R j as

s j =

[

I(R j,1)

I1,I(R j,2)

I2,I(R j,3)

I3

]T

.

Finally, an image Bi, which is segmented into Ni regions {R j : j = 1, · · · ,Ni}, is represented bya collection of feature vectors {xi j : j = 1, · · · ,Ni}. Each xi j is a 9-dimensional feature vector,corresponding to region R j, defined as

xi j =[

fTj ,s

Tj

]T.

3. An Extension of Multiple-Instance Learning

In this section, we first introduce DD-SVM, a maximum margin formulation of MIL in bag featurespace. We then describe one way to construct a bag feature space using Diverse Density. Finally,we compare DD-SVM with another SVM-based MIL formulation, MI-SVM, proposed by Andrewset al. (2003).

3.1 Maximum Margin Formulation of MIL in a Bag Feature Space

We start with some notations in MIL. Let D be the labeled data set, which consists of l bag/labelpairs, i.e., D = {(B1,y1), · · · ,(Bl,yl)}. Each bag Bi ⊂ R

m is a collection of instances with xi j ∈ Rm

denoting the j-th instance in the bag. Different bags may have different number of instances. Labelsyi take binary values 1 or −1. A bag is called a positive bag if its label is 1; otherwise, it is called anegative bag. Note that a label is attached to each bag and not to every instance. In the context ofimages, a bag is a collection of region feature vectors; an instance is a region feature vector; positive(negative) label represents that an image belongs (does not belong) to a particular category.

The basic idea of the new MIL framework is to map every bag to a point in a new feature space,named the bag feature space, and to train SVMs in the bag feature space. For an introduction toSVMs, we refer interested readers to tutorials and books on this topic (Burges, 1998; Cristianini andShawe-Taylor, 2000). The maximum margin formulation of MIL in a bag feature space is given asthe following quadratic optimization problem:

DD−SV M α∗ = argmaxαi

l

∑i=1

αi −12

l

∑i, j=1

yiy jαiα jK(φ(Bi),φ(B j)) (1)

sub ject tol

∑i=1

yiαi = 0

C ≥ αi ≥ 0, i = 1, · · · , l.

920


The bag feature space is defined by φ : B →Rn where B is a subset of P (Rm) (the power set of R

m).In practice, we can assume that the elements of B are finite sets since the number of instances in abag is finite. K : R

n ×Rn → R is a kernel function. The parameter C controls the trade-off between

accuracy and regularization. The bag classifier is then defined by α∗ as

label(B) = sign

(

l

∑i=1

yiα∗i K(φ(Bi),φ(B))+b∗

)

(2)

where b∗ is chosen so that

y j

(

l

∑i=1

yiα∗i K(φ(Bi),φ(B j))+b∗

)

= 1

for any j with C > α∗j > 0. The optimization problem (1) assumes that the bag feature space (i.e.,

φ) is given. Next, we introduce a way of constructing φ from a set of labeled bags.

3.2 Constructing a Bag Feature Space

Given a set of labeled bags, finding what is in common among the positive bags and does not appearin the negative bags may provide inductive clues for classifier design. In our approach, such cluesare captured by instance prototypes computed from the DD function. A bag feature space is thenconstructed using the instance prototypes, each of which defines one dimension of the bag featurespace.

3.2.1 DIVERSE DENSITY

In the ideal scenario, the intersection of the positive bags minus the union of the negative bagsgives the instances that appear in all the positive bags but none of the negative bags. However,in practice strict set operations of intersection, union, and difference may not be useful becausemost real world problems involve noisy information. Features of instances might be corrupted bynoise. Some bags might be mistakenly labeled. Strict intersection of positive bags might generatethe empty set. Diverse Density implements soft versions of the intersection, union, and differenceoperations by thinking of the instances and bags as generated by some probability distribution. Itis a function defined over the instance feature space. The DD value at a point in the feature spaceis indicative of the probability that the point agrees with the underlying distribution of positive andnegative bags.

Next, we introduce one definition of DD from Maron and Lozano-Perez (1998). Interested read-ers are referred to Maron and Lozano-Perez (1998) for detailed derivations based on a probabilisticframework. Given a labeled data set D , the DD function is defined as

DDD(x,w) =l

∏i=1

[

1+ yi

2− yi

Ni

∏j=1

(

1− e−‖xi j−x‖2w

)

]

. (3)

Here, x is a point in the instance feature space; w is a weight vector defining which features areconsidered important and which are considered unimportant; Ni is the number of instances in thei-th bag; and ‖ · ‖w denotes a weighted norm defined by

‖x‖w =[

xT Diag(w)2x]

12 (4)

921

CHEN AND WANG

where Diag(w) is a diagonal matrix whose (i, i)-th entry is the i-th component of w.It is not difficult to observe that values of DD are always between 0 and 1. For fixed weights w,

if a point x is close to an instance from a positive bag Bi, then

1+ yi

2− yi

Ni

∏j=1

(

1− e−‖xi j−x‖2w

)

(5)

will be close to 1; if x is close to an instance from a negative bag Bi, then (5) will be close to 0. Theabove definition indicates that DD(x,w) will be close to 1 if x is close to instances from differentpositive bags and, at the same time, far away from instances in all negative bags. Thus it measuresa co-occurrence of instances from different (diverse) positive bags.

3.2.2 LEARNING INSTANCE PROTOTYPES

The DD function defined in (3) is a continuous and highly nonlinear function with multiple peaksand valleys (or local maximums and minimums). A larger value of DD at a point indicates a higherprobability that the point fits better with the instances from positive bags than with those from neg-ative bags. This motivates us to choose local maximizers of DD as instance prototypes. Looselyspeaking, an instance prototype represents a class of instances that is more likely to appear in posi-tive bags than in negative bags. Note that, the MIL formulation in Maron and Lozano-Perez (1998)computes the global maximizer of DD, which corresponds to one instance prototype in our notation.

Learning instance prototypes then becomes an optimization problem: finding local maximizersof the DD function in a high-dimensional space. For our application the dimension of the optimiza-tion problem is 18 because the dimension of the region features is 9 and the dimension of weights isalso 9. Since the DD functions are smooth, we apply gradient based methods to find local maximiz-ers. Now the question is: how do we find all the local maximizers? In general, we do not know thenumber of local maximizers a DD function has. However, according to the definition of DD, a localmaximizer is close to instances from positive bags (Maron and Lozano-Perez, 1998). Thus start-ing a gradient based optimization from one of those instances will likely lead to a local maximum.Therefore, a simple heuristic is applied to search for multiple maximizers: we start an optimizationat every instance in every positive bag with uniform weights, and record all the resulting distinctmaximizers (feature vector and corresponding weights).

Instance prototypes are selected from those maximizers with two additional constraints: (a)they need to be distinct from each other; and (b) they need to have large DD values. The firstconstraint addresses the precision issue of numerical optimization. Due to numerical precision,different starting points may lead to different versions of the same maximizer. Hence we need toremove some of the maximizers that are essentially repetitions of each other. The second constraintlimits instance prototypes to those that are most informative in terms of co-occurrence in differentpositive bags. In our algorithm, this is achieved by picking maximizers with DD values greater thancertain threshold.

Following the above descriptions, one can find instance prototypes representing classes of in-stances that are more likely to appear in positive bags than in negative bags. One could argue thatinstance prototypes with the exactly reversed property (more likely to appear in negative bags thanin positive bags) may be of equal importance. Such instance prototypes can be computed in exactlythe same fashion after negating the labels of positive and negative bags. Our empirical study showsthat including such instance prototypes (for negative bags) improves classification accuracy by anaverage amount of 2.2% for the 10-class image categorization experiment described in Section 4.2.

922


3.2.3 AN ALGORITHMIC VIEW

Next, we summarize the above discussion in pseudo code. The input is a set of labeled bags D .The following pseudo code learns a collection of instance prototypes each of which is representedas a pair of vectors (x∗i ,w

∗i ). The optimization problem involved is solved by Quasi-Newton search

dfpmin in Press et al. (1992).

Algorithm 3.1 Learning Instance Prototypes

MainLearnIPs(D)1 Ip = LearnIPs(D) [Learn Instance Prototypes for positive bags]2 negate labels of all bags in D3 In = LearnIPs(D) [Learn Instance Prototypes for negative bags]4 OUTPUT (the set union of Ip and In)

LearnIPs(D)1 set P be the set of instances from all positive bags in D2 initialize M to be the empty set3 FOR (every instance in P as starting point for x)4 set the starting point for w to be all 1’s5 find a maximizer (p,q) of the log(DD) function by quasi-Newton search6 add (p,q) to M7 END

8 set i = 1, T =max(p,q)∈M log(DDD (p,q))+min(p,q)∈M log(DDD (p,q))

29 REPEAT10 set (x∗i ,w

∗i ) = argmax(p,q)∈M log(DDD(p,q))

11 remove from M all elements (p,q) satisfying‖p⊗ abs(q)−x∗i ⊗ abs(w∗

i )‖ < β‖x∗i ⊗ abs(w∗i )‖ OR log(DDD(p,q)) < T

12 set i = i+113 WHILE (M is not empty)14 OUTPUT ({(x∗1,w

∗1), · · · ,(x

∗i−1,w

∗i−1)})

In the above pseudo code for LearnIPs, lines 1–7 find a collection of local maximizers for theDD function by starting optimization at every instance in every positive bag with uniform weights.For better numerical stability, the optimization is performed on the log(DD) function, instead of theDD function itself. In line 5, we implement the EM-DD algorithm (Zhang and Goldman, 2002),which scales up well to large bag sizes in running time. Lines 8–13 describe an iterative process topick a collection of “distinct” local maximizers as instance prototypes. In each iteration, an elementof M, which is a local maximizer, with the maximal log(DD) value (or, equivalently, the DD value)is selected as an instance prototype (line 10). Then elements of M that are close to the selectedinstance prototype or that have DD values lower than a threshold are removed from M (line 11). Anew iteration starts if M is not empty. The abs(w) in line 11 computes component-wise absolutevalues of w. This is because the signs in w have no effect on the definition (4) of weighted norm.The ⊗ in line 11 denotes component-wise product.

The number of instance prototypes selected from M is determined by two parameters β and T .In our implementation, β is set to be 0.05, and T is the average of the maximal and minimal log(DD)

923

CHEN AND WANG

values for all local maximizers found (line 8). These two parameters may need to be adjusted forother applications. However, empirical study shows that the performance of the classifier is notsensitive to β and T . Experimental analysis of the conditions under which the algorithm will findgood instance prototypes is given in Section 4.5.

3.2.4 COMPUTING BAG FEATURES

Let {(x∗k ,w∗k) : k = 1, · · · ,n} be the collection of instance prototypes given by Algorithm 3.1. We

define bag features, φ(Bi), for a bag Bi = {xi j : j = 1, · · · ,Ni}, as

φ(Bi) =

min j=1,···,Ni ‖xi j −x∗1‖w∗1

min j=1,···,Ni ‖xi j −x∗2‖w∗2

...min j=1,···,Ni ‖xi j −x∗n‖w∗

n

. (6)

In the definition (6), each bag feature is defined by one instance prototype and one instance fromthe bag, i.e., the instance that is “closest” to the instance prototype. A bag feature gives the smallestdistance (or highest similarity score) between any instance in the bag and the corresponding instanceprototype. Hence, it can also be viewed as a measure of the degree that an instance prototype showsup in the bag.

3.3 Comparing DD-SVM with MI-SVM

The following pseudo code summarizes the learning process of DD-SVM. The input is D , a collec-tion of bags with binary labels. The output is an SVM classifier defined by (2).

Algorithm 3.2 Learning DD-SVM

DD-SVM(D)1 let S be the empty set2 I P = MainLearnIPs(D)3 FOR (every bag B in D)4 define bag features φ(B) according to (6)5 add (φ(B),y) to S where y is the label of B6 END7 train a standard SVM using S8 OUTPUT (the SVM)

MI-SVM, proposed by Andrews et al. (2003), is also an SVM-based MIL method. In Section 4,we experimentally compare DD-SVM against MI-SVM. An algorithmic description of MI-SVM isgiven below. The input is a collection of labeled bags D . The output is a classifier of the form

label(Bi) = sign(max j=1,···,Ni f (xi j)) (7)

where xi j, j = 1, · · · ,Ni, are instances of Bi, f is a function given by SVM learning.

924


Algorithm 3.3 Learning MI-SVM

MI-SVM(D)1 let P be the empty set2 FOR (every positive bag B in D)3 set x∗ be the average of instances in B4 add (x∗,1) to P5 END6 let N be the empty set7 FOR (every negative bag B in D)8 FOR (every instance x in B)9 add (x,−1) to N10 END11 END12 REPEAT13 set P′ = P14 set S = P′∪N15 train a standard SVM, label(x) = sign( f (x)), using S16 let P be the empty set17 FOR (every positive bag B in D)18 set x∗ = argmaxx∈B f (x)19 add (x∗,1) to P20 END21 WHILE (P 6= P′)22 OUTPUT (the classifier defined by (7))

In the above pseudo code for MI-SVM, the key steps are the loop given by lines 12–21. Duringeach iteration, a standard SVM classifier, label(x) = sign( f (x)), is trained in the instance space.The training set is the union of negative instances and positive instances. Negative instances arethose from every negative bag. Each positive instance represents a positive bag. It is chosen to bethe instance, in a positive bag, with the maximum f value from the previous iteration. In the firstiteration, each positive instance is initialized to be the average of the feature vectors in the bag. Theloop terminates if the set of positive instances selected for the next iteration is identical to that ofthe current iteration.

The crucial difference between DD-SVM and MI-SVM lies in the underlying assumption. MI-SVM method, as well as other standard MIL methods (such as the DD approach proposed by Maronand Lozano-Perez, 1998), assumes that if a bag is labeled negative then all instances in that bag isnegative, and if a bag is labeled positive, then as least one of the instances in that bag is a positiveinstance. In MI-SVM, one instance is selected to represent the whole positive bag. An SVM istrained in the instance feature space using all negative instances and the selected positive instances.Our DD-SVM method assumes that a positive bag must contain some number of instances satisfyingvarious properties, which are captured by bag features. Each bag feature is defined by an instance inthe bag and an instance prototype derived from the DD function. Hence, the bag features summarizethe bag along several dimensions. An SVM is then trained in the bag feature space.

925

CHEN AND WANG

4. Experiments

We present systematic evaluations of DD-SVM based on a collection of images from the CORELand the MUSK data sets. The data sets and the source code of DD-SVM can be downloaded athttp://www.cs.uno.edu/∼yixin/ddsvm.html. Section 4.1 describes the experimental setupfor image categorization, including the image data set, the implementation details, and the selectionof parameters. Section 4.2 compares DD-SVM with MI-SVM and color histogram-based SVMusing COREL data. The effect of inaccurate image segmentation on classification accuracies isdemonstrated in Section 4.3. Section 4.4 illustrates the performance variations when the number ofimage categories increases. Analysis of the effects of training sample size and diversity of imagesis given in Section 4.5. Results on the MUSK data sets are presented in Section 4.6. Computationalissues are discussed in Section 4.7.

4.1 Experimental Setup for Image Categorization

The image data set employed in our empirical study consists of 2,000 images taken from 20 CD-ROMs published by COREL Corporation. Each COREL CD-ROM of 100 images represents onedistinct topic of interest. Therefore, the data set has 20 thematically diverse image categories, eachcontaining 100 images. All the images are in JPEG format with size 384× 256 or 256× 384. Weassigned a keyword (or keywords) to describe each image category. The category names and somerandomly selected sample images from each category are shown in Figure 3.

Images within each category are randomly divided into a training set and a test set each with 50images. We repeat each experiment for 5 random splits, and report the average of the results ob-tained over 5 different test sets together with the 95% confidence interval. The SVMLight (Joachims,1999) software is used to train the SVMs. The classification problem here is clearly a multi-classproblem. We use the one-against-the-rest approach: (a) for each category, an SVM is trained toseparate that category from all the other categories; (b) the final predicted class label is decided bythe winner of all SVMs, i.e., one with the maximum value inside the sign(·) function in (2).

Two other image classification methods are implemented for comparison. One is a histogram-based SVM classification approach proposed by Chapelle et al. (1999). We denote it by Hist-SVM.Each image is represented by a color histogram in the LUV color space. The dimension of eachhistogram is 125. The other is MI-SVM (Andrews et al., 2003). MI-SVM uses the same set ofregion features as our approach (it is implemented according to the pseudo code in Algorithm 3.3).The learning problems in Hist-SVM and MI-SVM are solved by SVMLight . The Gaussian kernel,K(x,z) = e−s‖x−z‖2

, is used in all three methods.Several parameters need to be specified for SVMLight .3 The most significant ones are s and C

(the constant in (1) controlling the trade-off between training error and regularization). We applythe following strategy to select these two parameters: We allow each one of the two parametersbe respectively chosen from two sets each containing 10 predetermined numbers. For every pairof values of the two parameters (there are 100 pairs in total), a twofold cross-validation error onthe training set is recorded. The pair that gives the minimum twofold cross-validation error isselected to be the “optimal” parameters. Note that the above procedure is applied only once for eachmethod. Once the parameters are determined, they are used in all subsequent image categorizationexperiments.

3. SVMLight software and detailed descriptions of all its parameters are available at http://svmlight.joachims.org.

926


Category 0: African people and villages Category 1: Beach

Category 2: Historical buildings Category 3: Buses

Category 4: Dinosaurs Category 5: Elephants

Category 6: Flowers Category 7: Horses

Category 8: Mountains and glaciers Category 9: Food

Category 10: Dogs Category 11: Lizards

Category 12: Fashion Category 13: Sunsets

Category 14: Cars Category 15: Waterfalls

Category 16: Antiques Category 17: Battle ships

Category 18: Skiing Category 19: Desserts

Figure 3: Sample images taken from 20 image categories.

4.2 Categorization Results

The classification results provided in Table 1 are based on images in Category 0 to Category 9,i.e., 1,000 images. Results for the whole data set will be given in Section 4.4. DD-SVM performsmuch better than Hist-SVM with a 14.8% difference in average classification accuracy. Comparedwith MI-SVM, the average accuracy of DD-SVM is 6.8% higher. As we will see in Section 4.4,

927

CHEN AND WANG

Average Accuracy :[95% confidence interval]

DD-SVM 81.5% : [78.5%,84.5%]Hist-SVM 66.7% : [64.5%,68.9%]MI-SVM 74.7% : [74.1%,75.3%]

Table 1: Image categorization performance of DD-SVM, Hist-SVM, and MI-SVM. The numberslisted are the average classification accuracies over 5 random test sets and the correspond-ing 95% confidence intervals. The images belong to Category 0 to Category 9. Trainingand test sets are of equal size.

Cat. 0 Cat. 1 Cat. 2 Cat. 3 Cat. 4 Cat. 5 Cat. 6 Cat. 7 Cat. 8 Cat. 9

Cat. 0 67.7% 3.7% 5.7% 0.0% 0.3% 8.7% 5.0% 1.3% 0.3% 7.3%Cat. 1 1.0% 68.4% 4.3% 4.3% 0.0% 3.0% 1.3% 1.0% 15.0% 1.7%Cat. 2 5.7% 5.0% 74.3% 2.0% 0.0% 3.3% 0.7% 0.0% 6.7% 2.3%Cat. 3 0.3% 3.7% 1.7% 90.3% 0.0% 0.0% 0.0% 0.0% 1.3% 2.7%Cat. 4 0.0% 0.0% 0.0% 0.0% 99.7% 0.0% 0.0% 0.0% 0.0% 0.3%Cat. 5 5.7% 3.3% 6.3% 0.3% 0.0% 76.0% 0.7% 4.7% 2.3% 0.7%Cat. 6 3.3% 0.0% 0.0% 0.0% 0.0% 1.7% 88.3% 2.3% 0.7% 3.7%Cat. 7 2.3% 0.3% 0.0% 0.0% 0.0% 2.0% 1.0% 93.4% 0.7% 0.3%Cat. 8 0.3% 15.7% 5.0% 1.0% 0.0% 4.3% 1.0% 0.7% 70.3% 1.7%Cat. 9 3.3% 1.0% 0.0% 3.0% 0.7% 1.3% 1.0% 2.7% 0.0% 87.0%

Table 2: The confusion matrix of image categorization experiments (over 5 randomly generatedtest sets). Each row lists the average percentage of images (test images) in one categoryclassified to each of the 10 categories by DD-SVM. Numbers on the diagonal show theclassification accuracy for each category.

the difference becomes even greater as the number of categories increases. This suggests that theproposed method is more effective than MI-SVM in learning concepts of image categories underthe same image representation. The MIL formulation of our method may be better suited for region-based image classification than that of MI-SVM.

Next, we make a closer analysis of the performance by looking at classification results on everycategory in terms of the confusion matrix. The results are listed in Table 2. Each row lists theaverage percentage of images in one category classified to each of the 10 categories by DD-SVM.The numbers on the diagonal show the classification accuracy for each category, and off-diagonalentries indicate classification errors. Ideally, one would expect the diagonal terms be all 1’s, andthe off-diagonal terms be all 0’s. A detailed examination of the confusion matrix shows that two ofthe largest errors (the underlined numbers in Table 2) are errors between Category 1 (Beach) andCategory 8 (Mountains and glaciers): 15.0% of “Beach” images are misclassified as “Mountains andglaciers;” 15.7% of “Mountains and glaciers” images are misclassified as “Beach.” Figure 4 presents12 misclassified images (in at least one experiment) from both categories. All “Beach” images inFigure 4 contain mountains or mountain-like regions, while all the “Mountains and glaciers” images

928


Beach 1 Beach 2 Beach 3 Beach 4 Beach 5 Beach 6

Mountains 1 Mountains 2 Mountains 3 Mountains 4 Mountains 5 Mountains 6

Figure 4: Some sample images taken from two categories: “Beach” and “Mountains and glaciers.”All the listed “Beach” images are misclassified as “Mountains and glaciers,” while thelisted “Mountains and glaciers” images are misclassified as “Beach.”

have regions corresponding to river, lake, or even ocean. In other words, although these two imagecategories do not share annotation words, they are semantically related and visually similar. Thismay be the reason for the classification errors.

4.3 Sensitivity to Image Segmentation

Because image segmentation cannot be perfect, being robust to segmentation-related uncertain-ties becomes a critical performance index for a region-based image classification method. Figure 5shows two images, “African people” and “Horses,” and the segmentation results with different num-bers of regions (the results are obtained by varying the stopping criteria of the k-mean segmentationalgorithm presented in Section 2). Regions are shown in their representative colors. We can seefrom Figure 5 that, under some stopping criteria, objects totally different in semantics may be clus-tered into the same region (under-segmented). While under some other stopping criteria, one objectmay be divided into several regions (over-segmented).

In this section, we compare the performance of DD-SVM with MI-SVM when the coarsenessof image segmentation varies. To give a fair comparison, we control the coarseness of image seg-mentation by adjusting the stopping criteria of the k-means segmentation algorithm. We pick 5different stopping criteria. The corresponding average numbers of regions per image (computedover 1,000 images from Category 0 to Category 9) are 4.31, 6.32, 8.64, 11.62, and 12.25. Theaverage classification accuracies (over 5 randomly generated test sets) under each coarseness leveland the corresponding 95% confidence intervals are presented in Figure 6.

The results in Figure 6 indicate that DD-SVM outperforms MI-SVM on all 5 coarseness levels.In addition, for DD-SVM, there are no significant changes in the average classification accuracyfor different coarseness levels. While the performance of MI-SVM degrades as the average numberof regions per image increases. The difference in average classification accuracies between thetwo methods are 6.8%, 9.5%, 11.7%, 13.8%, and 27.4% as the average number of regions perimage increases. This appears to support the claim that DD-SVM has low sensitivity to imagesegmentation.

929

CHEN AND WANG

Original Image 3 regions 5 regions 7 regions 9 regions 11 regions

Original Image 3 regions 5 regions 7 regions 9 regions 11 regions

Figure 5: Segmentation results given by the k-means clustering algorithm with 5 different stoppingcriteria. Original images, which are taken from “African people” and “Horses” categories,are in the first column. Segmented regions are shown in their representative colors.

5 6 7 8 9 10 11 120.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

Average Number of Regions per Image

Ave

rage

Cla

ssifi

catio

n A

ccur

acy

with

95%

Con

fiden

ce In

terv

al

DD−SVMMI−SVM

Figure 6: Comparing DD-SVM with MI-SVM on the robustness to image segmentation. The ex-periment is performed on 1,000 images in Category 0 to Category 9 (training and test setsare of equal size). The average classification accuracies and the corresponding 95% con-fidence intervals are computed over 5 randomly generated test sets. The average numbersof regions per image are 4.31, 6.32, 8.64, 11.62, and 12.25.

4.4 Sensitivity to the Number of Categories in a Data Set

Although the experimental results in Section 4.2 and 4.3 demonstrate the good performance of DD-SVM using 1,000 images in Category 0 to Category 9, the scalability of the method remains aquestion: how does the performance scale as the number of categories in a data set increases? Weattempt to empirically answer this question by performing image categorization experiments over

930


10 11 12 13 14 15 16 17 18 19 200.5

0.55

0.6

0.65

0.7

0.75

0.8

Number of Categories in the Data Set

Ave

rage

Cla

ssifi

catio

n A

ccur

acy

with

95%

Con

fiden

ce In

terv

al

DD−SVMMI−SVM

Figure 7: Comparing DD-SVM with MI-SVM on the robustness to the number of categories in adata set. The experiment is performed on 11 different data sets. The number of categoriesin a data set varies from 10 to 20. A data set with i categories contains 100× i imagesfrom Category 0 to Category i− 1 (training and test sets are of equal size). The averageclassification accuracies and the corresponding 95% confidence intervals are computedover 5 randomly generated test sets.

data sets with different numbers of categories. A total of 11 data sets are used in the experiments.The number of categories in a data set varies from 10 to 20. A data set with i categories contains100× i images from Category 0 to Category i−1. The average classification accuracies over 5 ran-domly generated test sets and the corresponding 95% confidence intervals are presented in Figure 7.We also include the results of MI-SVM for comparison.

We observe a decrease in average classification accuracy as the number of categories increases.When the number of categories becomes doubled (increasing from 10 to 20 categories), the averageclassification accuracy of DD-SVM drops from 81.5% to 67.5%. However, DD-SVM seems to beless sensitive to the number of categories in a data set than MI-SVM. This is indicated, in Figure 8,by the difference in average classification accuracies between the two methods as the number of cat-egories in a data set increases. It should be clear that our method outperforms MI-SVM consistently.And the performance discrepancy increases with the number of categories. For the 1000-image dataset with 10 categories, the difference is 6.8%. This number is nearly doubled (12.9%) when thenumber of categories becomes 20. In other words, the performance degradation of DD-SVM isslower than that of MI-SVM as the number of categories increases.

4.5 Sensitivity to the Size and Diversity of Training Images

We test the sensitivity of DD-SVM to the size of training set using 1,000 images from Category 0to Category 9 with the size of the training sets being 100, 200, 300, 400, and 500 (the number ofimages from each category is size o f the training set

10 ). The corresponding numbers of test images are 900,

931

CHEN AND WANG

10 11 12 13 14 15 16 17 18 19 200

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Number of Categories

Diff

eren

ce in

Ave

rage

Cla

ssifi

catio

n A

ccur

acie

s

DD−SVM over MI−SVM

Figure 8: Difference in average classification accuracies between DD-SVM and MI-SVM as thenumber of categories varies. A positive number indicates that DD-SVM has higher aver-age classification accuracy.

100 150 200 250 300 350 400 450 5000.6

0.65

0.7

0.75

0.8

Number of Training Images

Ave

rage

Cla

ssifi

catio

n A

ccur

acy

with

95%

Con

fiden

ce In

terv

al

DD−SVMMI−SVM

Figure 9: Comparing DD-SVM with MI-SVM as the number of training images varies from 100to 500. The experiment is performed on 1,000 images in Category 0 to Category 9.The average classification accuracies and the corresponding 95% confidence intervals arecomputed over 5 randomly generated test sets.

800, 700, 600, and 500. As indicated in Figure 9, when the number of training images decreases,the average classification accuracy of DD-SVM degrades as expected. Figure 9 also shows that the

932


0 1 2 3 4 5 6 7 8 9 10

0.88

0.9

0.92

0.94

0.96

0.98

1

Percentage of Training Images with Labels Negated

Ave

rage

Cla

ssifi

catio

n A

ccur

acy

with

95%

Con

fiden

ce In

terv

al

DD−SVMMI−SVM

Figure 10: Comparing DD-SVM with MI-SVM as the diversity of training images varies. The ex-periment is performed on 200 images in Category 2 (Historical buildings) and Category7 (Horses). The average classification accuracies and the corresponding 95% confidenceintervals are computed over 5 randomly generated test sets. Training and test sets haveequal size.

performance of DD-SVM degrades in roughly the same speed as that of MI-SVM: the differencesin average classification accuracies between DD-SVM and MI-SVM are 8.7%, 7.9%, 8.0%, 7.1%,and 6.8% when the training sample size varies from 100 to 500.

To test the performance of DD-SVM as the diversity of training images varies, we need to definea measure of the diversity. In terms of binary classification, we define the diversity of images as ameasure of the number of positive images that are “similar” to negative images and the number ofnegative images that are “similar” to positive images. In this experiment, training sets with differentdiversities are generated as follows. We first randomly pick d% of positive images and d% ofnegative images from a training set. Then, we modify the labels of the selected images by negatingtheir labels, i.e., positive (negative) images become negative (positive) images. Finally, we put theseimages with new labels back to the training set. The new training set has d% of images with negatedlabels. It should be clear that d = 0 and d = 50 correspond to the lowest and highest diversities,respectively.

We compare DD-SVM with MI-SVM for d = 0, 2, 4, 6, 8, and 10 based on 200 images fromCategory 2 (Historical buildings) and Category 7 (Horses). The training and test sets have equal size.The average classification accuracies (over 5 randomly generated test sets) and the corresponding95% confidence intervals are presented in Figure 10. We observe that the average classificationaccuracy of DD-SVM is about 4% higher than that of MI-SVM when d = 0. And this differenceis statistically significant. However, if we randomly negate the labels of one positive image andone negative image in the training set (i.e., d = 2 in this experimental setup), the performance ofDD-SVM is roughly the same as that of MI-SVM: although DD-SVM still leads MI-SVM by 2% of

933

CHEN AND WANG

1 2 3 4 5 6 7 8 9 10

x 10−3

0.845

0.85

0.855

0.86

0.865

0.87

sC

lass

ifica

tion

Acc

urac

y (M

US

K1)

1 2 3 4 5 6 7 8 9 10

x 10−3

0.9

0.905

0.91

0.915

0.92

0.925

s

Cla

ssifi

catio

n A

ccur

acy

(MU

SK

2)

Figure 11: Average accuracy of 10-fold cross-validation on MUSK data sets using DD-SVM. Theparameters are C = 1,000 and s taking 19 values evenly distributed in [0.001,0.01].

average classification accuracy, the difference is statistically-indistinguishable. As d increases, DD-SVM and MI-SVM generate roughly the same performance. This suggests that DD-SVM is moresensitive to the diversity of training images than MI-SVM. We attempt to explain this observation asfollows. The DD function (3) used in Algorithm 3.1 is very sensitive to instances in negative bags.It is not difficult to derive from (3) that the DD value at a point is substantially reduced is there isa single instance from negative bags close to the point. Therefore, negating labels of one positiveand one negative image could significantly modify the DD function, and consequently, the instanceprototypes learned by Algorithm 3.1.

4.6 MUSK Data Sets

The MUSK data sets, MUSK1 and MUSK2 (Blake and Merz, 1998), are benchmark data sets forMIL. Both data sets consist of descriptions of molecules. Specifically, a bag represents a molecule.Instances in a bag represent low-energy conformations of the molecule. Each instance (or con-formation) is defined by a 166-dimensional feature vector describing the surface of a low-energyconformation. The data were preprocessed by dividing each feature value by 100. This was doneso that learning of instance prototypes would not begin at a flat area of the instance space. MUSK1has 92 molecules (bags), of which 47 are labeled positive, with an average of 5.17 conformations(instances) per molecule. MUSK2 has 102 molecules, of which 39 are positive, with an average of64.69 conformations per molecule.

Figure 11 shows the average accuracy of 10-fold cross-validation using DD-SVM with C =1,000 and the s parameter of the Gaussian kernel taking the following values: 0.001, 0.0015, 0.002,0.0025, 0.003, 0.0035, 0.004, 0.0045, 0.005 0.0055, 0.006, 0.0065, 0.007, 0.0075, 0.008, 0.0085,0.009, 0.0095, 0.01. As a function of s, the average 10-fold cross-validation accuracy of DD-SVMvaries within [84.9%,86.9%] (MUSK1) or [90.2%,92.2%] (MUSK2). For both data sets, the me-

934


Average AccuracyMUSK1 MUSK2

DD-SVM 85.8% 91.3%IAPR 92.4% 89.2%DD 88.9% 82.5%EM-DD 84.8% 84.9%MI-SVM 77.9% 84.3%mi-SVM 87.4% 83.6%MI-NN 88.0% 82.0%Multinst 76.7% 84.0%

Table 3: Comparison of averaged 10-fold cross-validation accuracies on MUSK data sets.

dian of the average accuracy, which is robust over a range of parameter values, is reported in Table 3.Table 3 also summarizes the performance of seven MIL algorithms in the literature: IAPR (Diet-terich et al., 1997), DD (Maron and Lozano-Perez, 1998), EM-DD (Zhang and Goldman, 2002),4

MI-SVM and mi-SVM (Andrews et al., 2003), MI-NN (Ramon and De Raedt, 2000), and Multi-nst (Auer, 1997). Although DD-SVM is outperformed by IAPR, DD, MI-NN, and mi-SVM onMUSK1, it generates the best performance on MUSK2. Overall, DD-SVM achieves very competi-tive accuracy values.

4.7 Speed

On average, the learning of each binary classifier using a training set of 500 images (4.31 regionsper image) takes around 40 minutes of CPU time on a Pentium III 700MHz PC running the Linuxoperating system. Algorithm 3.1 is implemented in Matlab with the quasi-Newton search procedurewritten in the C programming language. Among this amount of time, the majority is spent onlearning instance prototypes, in particular, the FOR loop of LearnIPs(D) in Algorithm 3.1. This isbecause the quasi-Newton search needs to be applied with every instance in every positive bag asstarting points (each optimization only takes a few seconds). However, since these optimizationsare independent of each other, they can be fully parallelized. Thus the training time may be reducedsignificantly.

5. Conclusions and Future Work

In this paper, we proposed a region-based image categorization method using an extension ofMultiple-Instance Learning, DD-SVM. Each image is represented as a collection of regions ob-tained from image segmentation using the k-means algorithm. In DD-SVM, each image is mappedto a point in a bag feature space, which is defined by a set of instance prototypes learned with theDiverse Density function. SVM-based image classifiers are then trained in the bag feature space.We demonstrate that DD-SVM outperforms two other methods in classifying images from 20 dis-tinct semantic classes. In addition, DD-SVM generates highly competitive results on the MUSKdata sets, which are benchmark data sets for MIL.

4. The EM-DD results reported in Zhang and Goldman (2002) were obtained by selecting the optimal solution usingthe test data. The EM-DD result cited in this paper is provide by Andrews et al. (2003) using the correct algorithm.

935

CHEN AND WANG

The proposed image categorization method has several limitations:

• The semantic meaning of an instance prototype is usually unknown because the learningalgorithm in Section 3 does not associate a linguistic label with each instance prototype. Asa result, “region naming” (Barnard et al., 2003) is not supported by DD-SVM.

• It may not be possible to learn certain concepts through the method. For example, textureimages can be designed using a simple object (or region), such as a T-shaped object. Byvarying orientation, frequency of appearance, and alignment of the object, one can get textureimages that are visually different. In other words, the concept of texture depends on notonly the individual object but also the spatial relationship of objects (or instances). But thisspatial information is not exploited by the current work. As pointed out by one reviewer ofthe initial draft, a possible way to tackle this problem is to use Markov random field type ofmodels (Modestino and Zhang, 1992).

The performance of image categorization may be improved in the following ways:

• The image segmentation algorithm may be improved. The current k-means algorithm is rel-atively simple and efficient. But over-segmentation and under-segmentation may happen fre-quently for a fixed stopping criterion. Although the empirical results in Section 4.3 show thatthe proposed method has low sensitivity to image segmentation, a semantically more accuratesegmentation algorithm may improve the overall classification accuracy.

• The definition of the DD function may be improved. The current DD function, which is amultiplicative model, is very sensitive to instances in negative bags. It can be easily observedfrom (3) that the DD value at a point is significantly reduced if there is a single instance froma negative bag close to the point. This property may be desirable for some applications, suchas drug discovery (Maron and Lozano-Perez, 1998), where the goal is to learn a single pointin the instance feature space with the maximum DD value from an almost “noise free” dataset. But this is not a typical problem setting for region-based image categorization where datausually contain noise. Thus a more robust definition of DD, such as an additive model, mayenhance the performance.

As pointed out by a reviewer of the initial draft, scene category can be a vector. For example, ascene can be {mountain, beach} in one dimension, but also {winter, summer} in the other dimen-sion. Under this scenario, our current work can be applied in two ways: (a) design a multi-classclassifier for each dimension, i.e., mountain/beach classifier for one dimension and winter/summerclassifier for the other dimension; or (b) design one multi-class classifier taking all scene categoriesinto consideration, i.e., mountain-winter, mountain-summer, beach-winter, and beach-summer cat-egories.

In our experimental evaluations, image semantic categories are assumed to be well-defined. Aspointed out by one of the reviewers, image semantics is inherently linguistic, therefore, can onlybe defined loosely. Thus a methodologically well-defined evaluation technique should take intoaccount scenarios with differing amounts of knowledge about the image semantics. Unless thisissue can be fully investigated, our image categorization results should be interpreted cautiously.

As continuations of this work, several directions may be pursued. The proposed method canpotentially be applied to automatically index images using linguistic descriptions. It can also be

936


integrated to content-based image retrieval systems to group images into semantically meaningfulcategories so that semantically-adaptive searching methods applicable to each category can be ap-plied. The current instance prototype learning scheme may be improved by boosting techniques.Art and biomedical images would be interesting applications.

Acknowledgments

The material is based upon work supported by the National Science Foundation under Grant No.IIS-0219272 and CNS-0202007, The Pennsylvania State University, University of New Orleans,The Research Institute for Children, the PNC Foundation, SUN Microsystems under Grant EDUD-7824-010456-US, and NASA/EPSCoR DART Grant NCC5-573. The authors would like to thankJia Li for making many suggestions on the initial manuscript. We thank the reviewers for valuablesuggestions. We would also like to thank Jinbo Bi, Seth Pincus, Andres Castano, C. Lee Giles,Donald Richards, and John Yen for helpful discussions.

References

S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instancelearning. In Advances in Neural Information Processing Systems 15, pages 561–568. Cambridge,MA:MIT Press, 2003.

P. Auer. On learning from mult-instance examples: empirical evaluation of a theoretical approach.In Proc. 14th Int’l Conf. on Machine Learning, pages 21–29, 1997.

K. Barnard, P. Duygulu, D. Forsyth, N. De Freitas, D. M. Blei, and M. I. Jordan. Matching wordsand pictures. Journal of Machine Learning Research, 3:1107–1135, 2003.

K. Barnard and D. Forsyth. Learning the semantics of words and pictures. In Proc. 8th Int’l Conf.on Computer Vision, pages II:408–415, 2001.

C. L. Blake and C. J. Merz. UCI Repository of machine learning databases, 1998. URLhttp://www.ics.uci.edu/∼mlearn/MLRepository.html.

A. Blum and A. Kalai. A note on learning from multiple-instance examples. Machine Learning, 30(1):23–29, 1998.

C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining andKnowledge Discovery, 2(2):121–167, 1998.

C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image segmentation usingexpectation-maximization and its application to image querying. IEEE Transactions on PatternAnalysis and Machine Intelligence, 24(8):1026–1038, 2002.

O. Chapelle, P. Haffner, and V. N. Vapnik. Support vector machines for histogram-based imageclassification. IEEE Transactions on Neural Networks, 10(5):1055–1064, 1999.

Y. Chen and J. Z. Wang. A region-based fuzzy feature matching approach to content-based imageretrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9):1252–1267,2002.

937

CHEN AND WANG

N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, 2000.

I. Daubechies. Ten Lectures on Wavelets. Capital City Press, 1992.

T. G. Dietterich, R. H. Lathrop, and T. Lozano-Perez. Solving the multiple instance problem withaxis-parallel rectangles. Artificial Intelligence, 89(1-2):31–71, 1997.

D. A. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice Hall, 2002.

Y. Gdalyahu and D. Weinshall. Flexible syntactic matching of curves and its application to automatichierarchical classification of silhouettes. IEEE Transactions on Pattern Analysis and MachineIntelligence, 21(12):1312–1328, 1999.

A. Gersho. Asymptotically optimum block quantization. IEEE Transactions on Information Theory,25(4):373–380, 1979.

M. M. Gorkani and R. W. Picard. Texture orientation for sorting photos ‘at a glance’. In Proc. 12thInt’l Conf. on Pattern Recognition, pages I:459–464, 1994.

J. A. Hartigan and M. A. Wong. Algorithm AS136: A k-means clustering algorithm. AppliedStatistics, 28:100–108, 1979.

J. Huang, S. R. Kumar, and R. Zabih. An automatic hierarchical image classification scheme. InProc. 6th ACM Int’l Conf. on Multimedia, pages 219–228, 1998.

T. Joachims. Making large-scale SVM learning practical. In Advances in Kernel Methods - Sup-port Vector Learning, pages 169–184. Edited by B. Scholkopf, C. J.C. Burges, and A.J. Smola,Cambridge, MA: MIT Press, 1999.

J. Li and J. Z. Wang. Automatic linguistic indexing of pictures by a statistical modeling approach.IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(9):1075–1088, 2003.

W. Y. Ma and B. Manjunath. NeTra: A toolbox for navigating large image databases. In Proc. IEEEInt’l Conf. on Image Processing, pages 568–571, 1997.

O. Maron and T. Lozano-Perez. A framework for multiple-instance learning. In Advances in NeuralInformation Processing Systems 10, pages 570–576. Cambridge, MA: MIT Press, 1998.

O. Maron and A. L. Ratan. Multiple-instance learning for natural scene classification. In Proc. 15thInt’l Conf. on Machine Learning, pages 341–349, 1998.

D. Marr. Vision: A Computational Investigation into the Human Representation and Processing ofVisual Information. W H Freeman & Co., 1983.

J. W. Modestino and J. Zhang. A Markov random field model-based approach to image interpreta-tion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(6):606–615, 1992.

K. Murphy, A. Torralba, and W. Freeman. Using the forest to see the trees: a graphical modelrelating features, objects, and scenes. In Advances in Neural Information Processing Systems 16.Cambridge, MA:MIT Press, 2004.

938


S. A. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The artof scientific computing. second edition, Cambridge University Press, New York, 1992.

J. Ramon and L. De Raedt. Multi instance neural networks. In Proc. ICML-2000 Workshop onAttribute-Value and Relational Learning, 2000.

J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on PatternAnalysis and Machine Intelligence, 22(8):888–905, 2000.

J. R. Smith and C.-S. Li. Image classification and querying using composite region templates. Int’lJ. Computer Vision and Image Understanding, 75(1/2):165–174, 1999.

T. M. Strat. Natural Object Recognition. Berlin: Springer-Verlag, 1992.

M. Szummer and R. W. Picard. Indoor-outdoor image classification. In Proc. IEEE Int’l Workshopon Content-Based Access of Image and Video Databases, pages 42–51, 1998.

M. Unser. Texture classification and segmentation using wavelet frames. IEEE Transactions onImage Processing, 4(11):1549–1560, 1995.

A. Vailaya, M. A. T. Figueiredo, A. K. Jain, and H.-J. Zhang. Image classification for content-basedindexing. IEEE Transactions on Image Processing, 10(1):117–130, 2001.

N. Vasconcelos and A. Lippman. A Bayesian framework for semantic content characterization. InProc. IEEE Conf. on Computer Vision and Pattern Recognition, pages 566–571, 1998.

J. Z. Wang, J. Li, R. M. Gray, and G. Wiederhold. Unsupervised multiresolution segmentation forimages with low depth of field. IEEE Transactions on Pattern Analysis and Machine Intelligence,23(1):85–91, 2001a.

J. Z. Wang, J. Li, and G. Wiederhold. SIMPLIcity: Semantics-sensitive integrated matching forpicture libraries. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(9):947–963, 2001b.

C. Yang and T. Lozano-Perez. Image database retrieval with multiple-instance learning techniques.In Proc. IEEE Int’l Conf. on Data Engineering, pages 233–243, 2000.

H. Yu and W. Wolf. Scenic classification methods for image and video databases. In Proc. SPIEInt’l Conf. on Digital Image Storage and archiving systems, pages 2606:363–371, 1995.

Q. Zhang and S. A. Goldman. EM-DD: An improved multiple-instance learning technique. InAdvances in Neural Information Processing Systems 14, pages 1073–1080. Cambridge, MA:MIT Press, 2002.

Q. Zhang, S. A. Goldman, W. Yu, and J. Fritts. Content-based image retrieval using multiple-instance learning. In Proc. 19th Int’l Conf. on Machine Learning, pages 682–689, 2002.

S. C. Zhu and A. Yuille. Region competition: unifying snakes, region growing, and Bayes/MDL formultiband image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence,18(9):884–900, 1996.

939

image categorization by learning and reasoning with...

Documents