learning structured model by gic ontology tree for multi ediachang87/papers/tip_2016.pdf ·...

12
1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2612825, IEEE Transactions on Image Processing 1 LEGO-MM : LEarning structured model by probabilistic loGic Ontology tree for M ultiM edia Jinhui Tang, Shiyu Chang, Guo-Jun Qi, Qi Tian, Fellow, IEEE, Yong Rui, Fellow, IEEE, and Thomas S. Huang, Life Fellow, IEEE Abstract—Recent advances in Multimedia ontology have resulted in a number of concept models, e.g., LSCOM and Mediamill 101, which are accessible and public to other researchers. However, most current research effort still focuses on building new concepts from scratch, very few work explores the appropriate method to construct new concepts upon the existing models already in the warehouse. To address this issue, we propose a new framework in this paper, termed LEGO 1 -MM, which can seamlessly integrate both the new target training examples and the existing primitive concept models to infer the more complex concept models. LEGO- MM treats the primitive concept models as the lego toy to potentially construct an unlimited vocabulary of new concepts. Specifically, we first formulate the logic operations to be the lego connectors to combine existing concept models hierarchically in probabilistic logic ontology trees. Then, we incorporate new target training information simultaneously to efficiently disambiguate the underlying logic tree and correct the error propagation. Extensive experiments are conducted on a large vehicle domain data set from ImageNet. The results demonstrate that LEGO-MM has significantly superior performance over existing state-of-the-art methods, which build new concept models from scratch. Index Terms—LEGO-MM, Concept recycling, Model ware- house, Probabilistic logic ontology tree, Logical operations. I. I NTRODUCTION Effectively modeling structured concepts has become an essential ingredient for visual recognition, retrieval and search on the Web. In the prior literature, many sophis- ticated models have been proposed to recognize a wide range of visual concepts from our daily life to many specific domains such as news video broadcast, surveillance videos, etc. While most researchers continue to build new J. Tang is with the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China (e-mail: [email protected]). S. Chang and T. S. Huang are with the Beckman Institute, University of Illinois at Urbana-Champaign, Urbana Illinois 61801, USA (e-mails: [email protected], [email protected]). G.-J. Qi is with the Department of Electrical Engineering and Computer Science in University of Central Florida, Orlando, Florida, 32816, USA (e-mail: [email protected]). Q. Tian is with the Department of Computer Science, University of Texas at San Antonio, San Antonio, TX 78249-1604, USA (e-mail: [email protected]). Y. Rui is with the Microsoft Research Asia, Beijing, 100080, China (e-mail: [email protected]). 1 Lego is a popular line of construction toys. Lego, consists of colorful interlocking plastic bricks and an accompanying array of gears, minifigures and various other parts. Lego bricks can be assembled and connected in many ways, to construct such objects as vehicles, buildings, and even working robots. Anything constructed can then be taken apart again, and the pieces used to make other objects. (http://www.lego.com/) Fig. 1. An example of using logical hierarchical semantic ontology to model concepts. models from scratch leveraging machine learning tech- niques [1][2][3][4][5][6][7], we concentrate on exploring the utilization of powerful knowledge base in the warehouse such as Large-Scale Concept for Multimedia (LSCOM) [8] and 101 semantic concepts in Mediamill 101 [9]. In this paper, we propose an effective approach which can seamlessly integrate the existing models with the new target training samples to construct new complex models. Conventional approaches to visual classification and recognition are usually sensitive to the number of samples involved in the models. Generally, a large number of samples can provide good generalization ability. However, in many cases, it is very difficult to obtain massive training data due to the expense of manual labeling work. Moreover, computational power is another constraint to recognize a wide range of visual concepts. To address such challenges, the proposed method aims to recycle the existing semantics for the new tasks. The inspiration of our general idea comes from the lego constructing complex toys. By analogy to the lego toys, the existing models in the warehouse can be viewed as each interlocking plastic brick in the toy. Based on them, the more complex concepts can be constructed. In this way, such a Multimedia “lego” model can provide semantic- rich building blocks to construct new concepts, rather than starting with zero knowledge. This idea brings about a new perspective to efficiently leverage a large number of existing primitive concepts for constructing potentially unlimited vocabularies of visual concepts. To construct new concepts from existing components, we

Upload: others

Post on 30-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LEarning structured model by Gic Ontology tree for Multi ediachang87/papers/tip_2016.pdf · Abstract—Recent advances in Multimedia ontology have resulted in a number of concept

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2612825, IEEETransactions on Image Processing

1

LEGO-MM: LEarning structured model byprobabilistic loGic Ontology tree for MultiMedia

Jinhui Tang, Shiyu Chang, Guo-Jun Qi, Qi Tian, Fellow, IEEE, Yong Rui, Fellow, IEEE, andThomas S. Huang, Life Fellow, IEEE

Abstract—Recent advances in Multimedia ontology haveresulted in a number of concept models, e.g., LSCOM andMediamill 101, which are accessible and public to otherresearchers. However, most current research effort still focuseson building new concepts from scratch, very few work exploresthe appropriate method to construct new concepts upon theexisting models already in the warehouse. To address thisissue, we propose a new framework in this paper, termedLEGO1-MM, which can seamlessly integrate both the newtarget training examples and the existing primitive conceptmodels to infer the more complex concept models. LEGO-MM treats the primitive concept models as the lego toyto potentially construct an unlimited vocabulary of newconcepts. Specifically, we first formulate the logic operationsto be the lego connectors to combine existing concept modelshierarchically in probabilistic logic ontology trees. Then, weincorporate new target training information simultaneously toefficiently disambiguate the underlying logic tree and correctthe error propagation. Extensive experiments are conductedon a large vehicle domain data set from ImageNet. Theresults demonstrate that LEGO-MM has significantly superiorperformance over existing state-of-the-art methods, whichbuild new concept models from scratch.

Index Terms—LEGO-MM, Concept recycling, Model ware-house, Probabilistic logic ontology tree, Logical operations.

I. INTRODUCTION

Effectively modeling structured concepts has become anessential ingredient for visual recognition, retrieval andsearch on the Web. In the prior literature, many sophis-ticated models have been proposed to recognize a widerange of visual concepts from our daily life to manyspecific domains such as news video broadcast, surveillancevideos, etc. While most researchers continue to build new

J. Tang is with the School of Computer Science and Engineering,Nanjing University of Science and Technology, Nanjing 210094, China(e-mail: [email protected]).

S. Chang and T. S. Huang are with the Beckman Institute, Universityof Illinois at Urbana-Champaign, Urbana Illinois 61801, USA (e-mails:[email protected], [email protected]).

G.-J. Qi is with the Department of Electrical Engineering and ComputerScience in University of Central Florida, Orlando, Florida, 32816, USA(e-mail: [email protected]).

Q. Tian is with the Department of Computer Science, University ofTexas at San Antonio, San Antonio, TX 78249-1604, USA (e-mail:[email protected]).

Y. Rui is with the Microsoft Research Asia, Beijing, 100080, China(e-mail: [email protected]).

1Lego is a popular line of construction toys. Lego, consists of colorfulinterlocking plastic bricks and an accompanying array of gears, minifiguresand various other parts. Lego bricks can be assembled and connected inmany ways, to construct such objects as vehicles, buildings, and evenworking robots. Anything constructed can then be taken apart again, andthe pieces used to make other objects. (http://www.lego.com/)

Fig. 1. An example of using logical hierarchical semantic ontology tomodel concepts.

models from scratch leveraging machine learning tech-niques [1][2][3][4][5][6][7], we concentrate on exploringthe utilization of powerful knowledge base in the warehousesuch as Large-Scale Concept for Multimedia (LSCOM)[8] and 101 semantic concepts in Mediamill 101 [9]. Inthis paper, we propose an effective approach which canseamlessly integrate the existing models with the new targettraining samples to construct new complex models.

Conventional approaches to visual classification andrecognition are usually sensitive to the number of samplesinvolved in the models. Generally, a large number ofsamples can provide good generalization ability. However,in many cases, it is very difficult to obtain massive trainingdata due to the expense of manual labeling work. Moreover,computational power is another constraint to recognize awide range of visual concepts. To address such challenges,the proposed method aims to recycle the existing semanticsfor the new tasks.

The inspiration of our general idea comes from the legoconstructing complex toys. By analogy to the lego toys, theexisting models in the warehouse can be viewed as eachinterlocking plastic brick in the toy. Based on them, themore complex concepts can be constructed. In this way,such a Multimedia “lego” model can provide semantic-rich building blocks to construct new concepts, rather thanstarting with zero knowledge. This idea brings about a newperspective to efficiently leverage a large number of existingprimitive concepts for constructing potentially unlimitedvocabularies of visual concepts.

To construct new concepts from existing components, we

Page 2: LEarning structured model by Gic Ontology tree for Multi ediachang87/papers/tip_2016.pdf · Abstract—Recent advances in Multimedia ontology have resulted in a number of concept

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2612825, IEEETransactions on Image Processing

2

need to explore the appropriate array of gears to connectthe existing lego pieces together. Similar to the toy lego,the array of gears play an essential role of coherentlyconnecting all the components as a whole. First, let usinvestigate how human perceives a new concept in thereal world. Usually, children are taught to learn concreteconcepts which can be directly recognized by their naturalattributes, such as shape, color and materials. As grow-ing up, they start to learn how to use logics to connectthese primitive concepts into more complex concepts. Forinstance, “beach” is a fairly abstract concept from con-cepts “people”, “sand”, “boat”, “sea”, etc. As illustrated inFigure 1, “sand”, “sea”, “people” and “boat” are parts of“beach”. Therefore, “beach” can be represented by “(peopleAND sea)” OR “(sea AND sand)” OR “(sea AND boat)”OR “(sea AND people AND sand)” which exploits thepossible combinations of parts of “beach” by a AND-ORrelationship. It demonstrates that the hierarchical semanticontology can provide a reasonable way to model concepts,in an order from the primitive concepts in the lower level tothe complex ones in upper level by utilizing various logicaloperations. In other words, given a collection of primitiveconcepts, many other complex concepts can be built uponthese primitive concepts by connecting them with logics.

Based on the above observations, we propose a novelLEGO-MM approach, called LEarning structured model byprobabilistic loGical Ontology tree for MultiMedia in thispaper, which will construct structured concepts built upon aset of primitive models. The key contributions of this papercan be summarized as follows.

• Different from many existing concept modeling tech-niques, LEGO-MM integrates the logical and statisti-cal inferences in a unified framework where the exist-ing primitive concepts are connected into a potentiallyunlimited vocabulary of high-level concepts by thebasic logical operations. In contrast, most existingmodeling algorithms either only learn a flat correlativeconcept structure [10] [11], or a simple hierarchicalstructure without logical connections [12] [13] [14].

• An efficient statistical learning algorithm is proposedto model the complex concepts in the upper levelsof hierarchy upon logically connected primitive con-cepts. This statistical learning approach is much moreflexible, where each concept in the hierarchy can bemodeled from heterogeneous feature spaces of themost suitable feature descriptors (e.g., visual featuressuch as color and shape for scenery concepts, textualfeatures such as TF-IDF for named-entities) or canbe obtained from different semantic warehouse. Thismeans that when we build the target model, we canselect the LEGO-MM made from different “materials”and still be able to connect these heterogeneous piecesof lego together.

• LEGO-MM can simultaneously incorporate both newtarget training information and lego building blocks.This setup allows LEGO-MM to efficiently disam-biguate the underlying logic tree and correct the error

propagation only leveraging a few of training samples,especially for the situation where a large amount ofconcepts need to be categorized, and labeled data isdeficient. It has demonstrated the significant superi-ority compared to the SVM-type training algorithmswhich build new concepts from scratch.

To sum up, the general idea is that the primitive modelsas pieces of multimedia lego can be viewed as buildingblocks by analogy to training examples in conventionalclassification problem, because both of them provide basicsemantic information to infer new concept models. Inthe prior work, a large number of models exist in manywarehouses such as LSCOM 374 and Mediamill 101. Thesemodels can provide much more semantic resources toexplore information than training examples. This is becausethese models are learned from example images and videokey-frames, and thus rich discriminative information aboutthe primitive concepts has been captured by them. There-fore, it is much more efficient to mine these models directly,rather than coming back to tediously collecting labeledexamples and retraining models again. In this way, plentyof resources and efforts can be saved in the multimediacommunity by leveraging the existing multimedia legomodels.

Compared to our preliminary work [15], in this paperwe present more detailed theoretical derivations of theproposed approach and conduct more extensive experimentsand analysis. The paper is organized as follows. We reviewrelated work in section 2. In section 3, the probabilisticlogic ontology tree is formally proposed in subsection 3.1.We explain the corresponding learning algorithm for thecomplex concepts in probabilistic logic ontology tree insection 3.2. Then the probabilistic logic ontology tree isapplied to hierarchical concept classification problem inSection 4. In section 5, we present extensive experimentalresults on a large vehicle domain data set from ImageNet,and demonstrate significantly superior performance overexisting SVM-type approaches which build new conceptmodels from scratch. Finally, we conclude and proposepossible future directions in section 6.

II. RELATED WORK

Hierarchical concept classification has attracted muchattention [16][12][17]. As opposed to the traditional flatconcept classification, it attempts to classify a testing sam-ple by a hierarchical concept tree. Hierarchical structureis a natural concept organization form which is consistentwith the natural language. For example, in WordNet[18],the concepts are organized in a hierarchical structure withhyponym (i.e., Y is a hyponym of X if every Y is a kindof X) or meronym (i.e., Y is a meronym of X if Y isa part of X) relation. While the contribution discussed in[19][20][21] focussed on using hierarchical structures toenhance classification efficiency, several other works pro-posed to learn visual categories hierarchically [20][22][23]in a unsupervised or semi-supervised fashion. Some otherresearches illustrated methods for organizing low level

Page 3: LEarning structured model by Gic Ontology tree for Multi ediachang87/papers/tip_2016.pdf · Abstract—Recent advances in Multimedia ontology have resulted in a number of concept

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2612825, IEEETransactions on Image Processing

3

object representation hierarchically so that descriptivenessand discrimination performance are enhanced [24] [25].

Another important research direction is how to handlelimited labeled data in visual classification and recognition.Usually, the generalization ability of most of the supervisedmodels is determined by the number of labeled samples in-volved in the training stage. However, in many applications,a large number of labeled training data is hard to obtainbecause of the expense of labeling. A deterministic ap-proach to alleviate this problem is to incorporate unlabeleddata with model learning(co-training) using manifold [26][27] [28]. Another approach is transfer learning based, forexample, the method proposed in [29] aims at transferringthe knowledge between heterogeneous domains so thatmore prior knowledge can be obtained.

III. PROBABILISTIC LOGICAL ONTOLOGY TREE

To present how we can apply LEGO-MM to constructcomplex concepts, in this section we define concrete datalearning structure - Probabilistic Logical Ontology Tree(PLOT).

A. Prior Probabilistic Models in PLOT

We start by the definition of PLOT with an example. Asillustrated in figure 2, PLOT is a logical tree

T ={(C, f, L), (C6, f6, L6), (C7, f7, L7), (C8, f8, L8),

(C1, f1, P 1), . . . , (C5, f5, P 5)}(1)

where C and Ci (i = 1, 2, · · · 5) are concept nodes, andf and f i are different feature descriptors attached with Cand Ci (it also can be seen as a sub-tree of a hierarchicalontology). For each upper level concept Ci other than leafnodes, Ci can be expanded into a set of children conceptsby a logical operation Li from either OR, AND, or NOT.For each node, there is also an attached model P i(y|f i(x))predicting the probability of label being positive if y = 1 ornegative if y = 0 given the feature f i(x) for each samplex. The associated model can have arbitrary flexible mathe-matical forms such as logistic regression model, exponentialmodel or even support vector machine or boosting model(but should be normalized into probabilistic form first).

In PLOT, each complex concept in the upper level can berepresented via the leaf concepts by the logical relations.Take an example of PLOT in figure 2, C6 = C1 OR C2,C7 = C3 AND C4, and C = (C1 OR C2) OR (C3 ANDC4) OR (NOT C5). Such classical Boolean logic gives twoexclusive results: one sample is either positive or negativefor the target concept. To formulate a corresponding priorprobability model for learning and inference, the Booleanlogic is converted into fuzzy logic which replaces AND,OR, NOT by some continuously probability conversions.There are many different fuzzy logical operations whichcan do such conversion, and here we enumerate two kindsof them as follows.Min/Max/complement:In this case, AND is replaced by “min”, OR by “max”, and

NOT by 1−P (y = 1|f(x)) where P is the model attachedwith the children node of the logic NOT. Take C in figure2 as an example, the prior model Pprior(y|f(x)) for C is

P (y = 1|f(x)) = max{max{P 1(y = 1|f1(x)),

P 2(y = 1|f2(x)},min{P 3(y = 1|f3(x)),

P 4(y = 1|f4(x))}, 1− P 5(y = 1|f5(x))}.Probabilistic product/sum:In this case, Ca = C1 AND C2 is replaced by

P a(y = 1|f(x)) = Tpod(P1(y = 1|f(x)), P 2(y = 1|f(x)))

= P 1(y = 1|f(x))P 2(y = 1|f(x)),

Cb = C1 OR C2 is replaced by

P b(y = 1|f(x)) = ⊥sum(P 1(y = 1|f(x)), P 2(y = 1|f(x)))= P 1(y = 1|f(x)) + P 2(y = 1|f(x))− P 1(y = 1|f(x))P 2(y = 1|f(x)).

Again, NOT is converted by complement operation as inCase 1. The probabilistic product and sum is often calledT-norm and T-conorm.

There are many other fuzzy logical operations to do asimilar conversion from the classical Boolean logics to theirprobability forms, such as Lukasiewicz logic, Nilpotentlogic and Hamacher logic. Interested readers can find morein [30].

To learn a satisfactory model for upper level conceptsby PLOT, we still need to overcome the following twoproblems: semantic ambiguity, and error propagation.

• Semantic ambiguity. One concept can be representedin a hierarchical logical structures in several levels.There are different structures on different sub-conceptsto represent the root concept. Therefore, if a hierarchi-cal logical structure can represent a concept, a conceptis not in this hierarchical logical structure yet in otherhierarchical logical structure that also can representthis concept may result the semantic ambiguity. Weuse an example to explain this problem as illustrated infigure 3. Figure 3 illustrates a vehicle PLOT with threechildren nodes combining by logic OR. Since bus, car,and truck are kinds of vehicle, the samples on the pos-itive sides of these three objects, which is the regionassociated with logic “bus OR car OR truck”, mustalso be vehicle. However, the negative sides of theseobjects, i.e., the ambiguity region, cannot exclude thepossibility of some samples on it being vehicle. Forexample, an example corresponding to ship locates onthe ambiguity region but it is also a vehicle. In otherwords, it is impossible and unnecessary to enumerateall possible kinds of subclasses of vehicles (such asship) in PLOT. Thus the logical results conducted fromPLOT cannot clarify the ambiguity in the negativesides of children nodes and some extra information isneeded to clarify it. The similar problem of semanticambiguity also exists in other logic operations.

• Error propagation. Because the probabilistic modelsin the nodes of low levels cannot be perfectly con-structed due to incomplete semantic information and

Page 4: LEarning structured model by Gic Ontology tree for Multi ediachang87/papers/tip_2016.pdf · Abstract—Recent advances in Multimedia ontology have resulted in a number of concept

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2612825, IEEETransactions on Image Processing

4

Fig. 2. An example of probabilistic logical ontology tree. The target concept C is represented in a hierarchical logical structure in three levels. Fromthis PLOT, C is finally described by five primitive concepts in model warehouse as C = (C1 OR C2) OR (C3 AND C4) OR (NOT C5). It isworth noting that for each concept Ci different features f i can be used as low-level descriptors.

the limitation of models. Thus the error contained inthese models may be propagated into the higher-levelconcepts. Therefore, some relevance feedback scheme[31] is required to correct these errors when learningthe higher-level concepts.

Summarizing the above two problems, in order to modelhigh-level concepts, only using the information from themodels associated with lower level nodes in PLOT is notenough. It requires some extra content-based examples toupdate the prior models purely obtained by the logical re-lations to clarify semantic ambiguity, and correct the errorsfrom lower level models. In other words, two criterions areproposed when modeling the complex concepts in upperlevels.

• Criterion 1: The model P (y|f(x)) for the upperlevel concepts should preserve as much informationof the prior model Pprior(y|f(x)) as possible whichcombines the information on primitive models of lowerlevel nodes in PLOT.

• Criterion 2: With the new extra training examples,the model P (y|f(x)) for upper level concepts mustreflect the information contained in these extra trainingexamples.

Based on the above two criteria, we formulate the proposedprobabilistic algorithms to model the high-level concepts onPLOT.

B. Learning and Inference on PLOT

Here we formally formulate our problem mathematically.Given a set of the models Pm(y|fm(x)) of low-level con-cepts Cm (where 1 ≤ m ≤ M , and M = 5 in figure 2), aswell as some extra training examples {xl, yl|1 ≤ l ≤ N},our goal is to learn a model P (y|f(x)) for the targetconcept C based on a given PLOT.

First, according to PLOT, target concept C can beexpanded into Cm by the logical relation uncoveredby PLOT. Accordingly, we can obtain a prior modelPprior(y|[fm(x)]Mm=1) just as the example shown in figure

Fig. 3. An example of semantic ambiguity problem.

2. Then the new model P (y|f(x)) should reflect the twocriteria mentioned in the end of the last subsection. There-fore, we formulate the following optimization problem tosolve it,

minP (y|x)

1

N

N∑l=1

D

s.t.1

N

N∑l=1

EP (y|f(xl))

[yfd(xl)] =1

N

N∑l=1

ylfd(xl) + θd

1

N

N∑l=1

EP (y|f(xl))

[y] =1

N

N∑l=1

yl + η∑y∈{0,1}

P (y|f(xl)) = 1

D∑d=1

θ2d2σ2

θ

/N

+η2

2σ2η

/N

≤ T

1 ≤ d ≤ D

(2)

Page 5: LEarning structured model by Gic Ontology tree for Multi ediachang87/papers/tip_2016.pdf · Abstract—Recent advances in Multimedia ontology have resulted in a number of concept

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2612825, IEEETransactions on Image Processing

5

where D = DKL

(P (y|f(xl)) ||Pprior

(y| [fm(xl)]

Mm=1

))is the Kullback-Leibler divergence, θd and η are the estima-tion errors, σθ and ση are two given parameters, as well asT is a constant. By minimizing the divergence between P (y|f(x)) and Pprior(y|[fm(x)]Mm=1), the information in theprior model can be preserved as much as possible accordingto criterion 2. E

P (y|f(xl))[·] is the expectation with respect

to the distribution P (y|f(xl)) and is the dth element oflow-level feature vector f(xl). The first two constraints inthe above formulation requires the first two-order statisticsof new model P (y|f(x)) must comply training set up, toestimate errors θd and η. Furthermore, the third constraintnormalizes the model so that it satisfies the probabilisticproperty. Finally, the fourth constraint assumes the jointprobability of estimation errors should be reasonably upperbounded by T [17].Inference on PLOT:First, given equation (2), we see how to infer the modelP (y|f(x)) for the target concept. From (2), the Lagrangianfunction is:

L (P (y|f(xl)), θ, η, b, w, γ, ξ) =

D +D∑

d=1

wd

{1

N

N∑l=1

ylfd(xl) + θd − E1

}

+ b

{1

N

N∑l=1

yl + η − E2

}

+ γ

{D∑

d=1

θ2d2σ2

θ

/N

+η2

2σ2η

/N

− C

}

+∑x

ξ(x)

1−∑

y∈{0,1}

P (y|f(x))

.

(3)

where E1 = 1N

N∑l=1

EP (y|f(xl))

[yfd(xl)] and E2 =

1N

N∑l=1

EP (y|f(xl))

[y]. By inferring the Eq. (3), we can obtain

the solution of target model P (y|f(xl)), as follow,

P (y|f(xl))

=1

Z(xl)Pprior

(y| [fm(xl)]

Mm=1

)· X ,

(4)

where

Z(xl) =∑

y∈{0,1}

Pprior

(y| [fm(xl)]

Mm=1

)· X

is the partition function for normalization, and X =exp

{y(wTxl + b)

}. The details of solving and derivative

in inference can be found in Part A of Section Appendix.Learning on PLOT:Now we show how to learn P (y|x), i.e., computing itsmodel parameters w and b. From (11) in Section Appendixwe obtain

η = − b

Nγσ2η,

θd = − wd

Nγσ2θ .

(5)

Algorithm 1 PLOTInput: a set of the models Pm(y|fm(x)), training examples

{xl, yl|1 ≤ l ≤ N}, parameter θd, η, σθ and ση and T ,Output: model P (y|f(x))

1: Solve P (y|f(xl)) with Eq. (4).2: Solve b∗ and w∗ with Eq. (6) by conjugate gradient method.3: P (y = 1|f(x)) = 1

1+ewx+b .

Substitute (4) (in Section Appendix) and (5) into Lagra-gian function (3), we can formulate the dual optimizationproblem as

b∗, w∗ = argmaxb,w

L (P (y|f(xl)), θ, η, b, w, γ, ξ)

= argmaxb,w

1

N

N∑l=1

(yl(wT f(xl) + b

)+ logPprior

(y| [fm(xl)]

Mm=1

)− logZ (xl))−

λw

2N||w||22 −

λb

2Nb2

(6)

where λw =σ2θ

γ , λb =σ2η

γ are the balance parameters. Theoptimization of Eq. (6) is described in Part B of SectionAppendix. Algorithm 1 summarizes the algorithm steps.

C. Efficient Online Modeling for Large Scale Problem

As more and more visual data booms on the Internet,from image and video sharing web sites to various kindsof social communities, efficient modeling and recognitionalgorithms are required to handle these rising data. A-mong them, online modeling technique is very useful tohandle the large scale data set where the samples areprocessed one by one. Once new data arrives, it neednot re-train a brand new model with all the collecteddata, instead only new samples are required to updatethe existing model. Assume we currently have a modelP (y|f(x)) = 1

Z(x)Pprior

(y| [fm(x)]

Mm=1

)· X as equation

(4) in hand, our goal is to obtain a new one P (y|f(x))by some new examples {xl, yl}Nl=1. Following the similaridea in formulation (2), the new model should preserveas much information as possible in P (y|f(x)) as well asreflect the new information in {xl, yl}Nl=1. By substitut-

ing 1N

N∑l=1

DKL

(P (y|f(x))||P (y|f(xl))

)into the objective

function in (2) and the new examples in {xl, yl}Nl=1 forthose in {xl, yl}Nl=1 , we have the new model as

P (y|f(x)) = 1

Z(x)Pprior

(y| [fm(x)]

Mm=1

)· exp

{y((w + w)

Tx+ (b+ b)

)} (7)

andZ(x) =

∑y∈{0,1}

Pprior

(y| [fm(x)]

Mm=1

)· exp

{y((w + w)

Tx+ (b+ b)

)} (8)

Page 6: LEarning structured model by Gic Ontology tree for Multi ediachang87/papers/tip_2016.pdf · Abstract—Recent advances in Multimedia ontology have resulted in a number of concept

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2612825, IEEETransactions on Image Processing

6

Where w, b can be computed from

b∗, w∗ = argmaxb,w

1

N

N∑l=1

{yl((w + w)

Tf(xl) + (b+ b)

)+ logPprior

(y| [fm(xl)]

Mm=1

)− log Z (xl)}

− λw

2N||w||22 −

λb

2Nb2

(9)Since only new samples are involved in the above opti-mization problem, the model can be updated much moreefficiently.

D. Multimodal Feature Descriptors

Note in the learning algorithm in subsection 3.2, differentfeature descriptors (i.e., fm and f ) can be used to representthe concepts associated with the nodes in PLOT. It increasesthe flexibility of the feature representation so that the mostsuitable features can be used for each concept. Moreover,with such a heterogeneous feature structure, both content-based features (i.e., visual/audio features extracted frommultimedia content) and context-based features (i.e., thesurrounding text, GPS location data, user tags etc. fromthe multimedia context) can be adopted in PLOT. Someconcepts can be better modeled by the content featuresuch as objects (e.g., car, rocket, and horse), scenery (e.g.,beach, mountain) and events (e.g., human action). On theother hand, some concepts can be better modeled by thecontextual features, such as landmark place of interest (e.g.,Great Wall, Eiffel tower and White House). By the aboveproposed learning algorithm in PLOT, different conceptscan select their most suitable feature descriptors for model-ing integrated in a unifying framework. However, the majorcontribution of the proposed work is not characterizingdifferent features, it is still worth developing more in depthin the future.

IV. EXPERIMENTS

In this section we present experiments by comparing theproposed LEGO-MM approach with the other state-of-the-art algorithms. We demonstrate how the proposed algo-rithm effectively model structured concepts using modelwarehouse, as well as enhance target concepts classificationresults.

A. Dataset

In order to demonstrate the robustness and effective-ness of the proposed LEGO-MM approach in hierarchicalvisual recognition using model warehouse, we conductexperiments on ImageNet dataset [32] in the “vehicle”domain. ImageNet is a realistic image database organizedby WordNet [33][18] hierarchy. Each node in the hierarchyis representing a concept, and associated with a set ofimages. “Vehicle” is a relatively complex root concept inontology, including a large number of different sub-genrecategories. This “vehicle” specified ImageNet subset hasfirst been used and released in [34]. Here we adopt the same

Fig. 4. Some example images in the dataset.

dataset to show the competitive results of the proposedLEGO-MM algorithm.

The dataset is from Concept “vehicle” in ImageNetdatabase, which contains 26,210 images, including 13,889positive “vehicle” samples, and 12,321 negative samples.In concept “vehicle”, there are 20 concepts associated withroot “vehicle”, including a four-level ontological structurewith 13 leaf nodes. Each of the leaf concept contains around1,000 positive images. There are 20 concepts associatedwith root “vehicle”, including a four-level ontological struc-ture with 13 leaf nodes. Each of the leaf concept containsaround 1,000 positive images. The negative samples includeconcepts such as “formation”, “structure” and “sports”,which contain tremendous low level visual ambiguities.Some sample images are shown in figure 4. The ”parent-child” relationship indicate a “is-a” (also seen as a ORrelation in PLOT) relationship in the WordNet taxonomy.For example, “plane” in the ontology shown in figure5 contains “OR” relationship to its children, and “Not”relationships to other nodes in the same level.

B. Feature Extraction and Selection

We extract the Hierarchical Gaussianization (HG) feature[35] to represent each image for our experiment. Basically,HG features jointly model appearance and spatial structuresof each image by fitting into a Bayesian hierarchicalframework using mixture of Gaussians. Specifically, inour experiments, we use normal HG features by firstextracting 128-dimension SIFT descriptor within a 20x20patches over a grid with five pixels spacing. Then PrincipalComponent Analysis (PCA) is applied to each SIFT vectorto reduce its dimensionality to 80. Moreover, each image ischaracterized by 512 Gaussian mixture components, eachof which is then vectorized by a 80 dimensional vector.In total, 80 · 512 = 40, 960 dimension feature vector isconstructed for each images. For computational simplicityand accelerating the learning process, we further reduce thedimensionality of the final feature representations to 1,000using PCA.

Page 7: LEarning structured model by Gic Ontology tree for Multi ediachang87/papers/tip_2016.pdf · Abstract—Recent advances in Multimedia ontology have resulted in a number of concept

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2612825, IEEETransactions on Image Processing

7

!"#$%&"'

(")$(&' *)+,-.'

/&(-"' !"#$%&'(")*

+$)#$,")*

%()' %0%&"'

-$%./'*

()/%.*0&(&)%1%#"*

0$22$#"*

2"3+,* 4$%1%#"*

()+$,* (+,.*

5+)'#+,"*

1()$-"'

2#$3' 2/40+)$,"*

2+$#*4&+(* 4+6#"*2!$'*

Fig. 5. Vehicle hierarchy from ImageNet. The red color notes indicate the leaf-node concepts, which can be obtained from the model warehouse.

C. Real World Problem Simulations

In our experiments, we simulate the real word scenarioas we described in the introduction. We split all images tothree disjoint sets randomly with 40%, 15%, and 45%. Thenbasic logistic regression models are trained using 40% ofsamples for the leaf nodes on the given hierarchical struc-ture. After training, we only keep the weight vectors anduse them as our “existing models” in the model warehouse.In such a way, we obtain a pool of rich semantic modelsfor images. It is an analog to LSCOM and Mediamill 101,where all the samples used to generate the models in thepool are not accessible anymore. Actually, in real life, alarge number of labeled training samples are not easy toacquire, especially when the domain of concepts expandrapidly. In order to take this fact into account, we randomlysample 15% of the data used as training images for LEGO-MM as well as other compared methods which cannotacquire information from pre-trained models. The last 45%of data is used as testing samples to evaluate recognitionperformance for different approaches. All of the experimentresults are reported as averaged over ten random runs.

D. Logic Operations and Error Propagation AnalysisIn section III-A, we mentioned two different ways to

represent Boolean logic probabilistically. One way is us-ing Min/Max function, the other way is using T-norm/T-conorm. Moreover, we also illustrated how important ofthe small number of training samples to prevent errorpropagation from lower level to higher. Here we comparethe classification results on proposed LEGO-MM approachand a purely logical version without using any updatingsamples called P-LEGO-MM. And the results is shown intabel I.

Two main observations can be made from table I. First,both probabilistic logic operations provide comparable re-sults on all different concepts regardless of how far theseconcepts are away from the “existing models”. In general,“T-conorm” performs slightly better than “Max” in bothLEGO-MM and P-LEGO-MM methods on all the eightdifferent target concepts. Second, P-LEGO-MM’s accuracydecreases dramatically as the “target concept” comes higherin the ontology hierarchy. On the other hand, our proposedLEGO-MM algorithm still has a reliable performance. The

main reason is that the general procedure using logicaloperations on hierarchical structure depends on lower levelconcepts that provide information to learn “target con-cepts” at higher levels. Once the prior is obtained, modeladaptation process will start. In the meantime the priorprobabilities for its parent’s nodes are also computed in thesame manner. However, without model-updating scheme,higher level models cannot be reliably refined and it couldresult in errors accumulated through the entire hierarchicalstructure. A similar observation also can be obtained fromfigure 6, the two ROC curves in the first row illustrates theclassification accuracy at the 3rd level, and the two ROCat the bottom is the 2nd level concepts.

E. Performance Comparison

In previous section we have shown that only combiningprimitive concepts is insufficient due to error propagations.To evidently demonstrate the advantage of the proposedLEGO-MM method for ontological categorizations, wecompare experimental results with other state-of-the-artapproaches. The comparison experiments strictly follow theprotocol in section IV-C, and the results are shown in tableII. In [34] and [35], the authors proposed the best classifierusing [35]′s feature was obtained by first applying WithinClass Covariance Normalization (WCCN), and then usingNearest Neighbor or Nearest Central (WCCN+NN, andWCCN+NC). In our setup, both methods yield comparableresults as flat multi-class SVM [2]. However, there is nostraight way to integrate these approaches in the givenontology hierarchy. Therefore, we also conduct our exper-iment using a tree loss based hierarchical SVM proposedin [36] [37]. Besides, we also add the comparison of deepconvolutional neural networks, which is the state=of-the-art classification method in recent years. Among all, ourproposed LEGO-MM algorithm provides a significant gainover the typical methods, especially at the 3rd level ofclassification. By using the CNN features as the imagedescriptors (called Ours + CNN features), the performanceof our proposed LEGO-MM is further improved overDCNN. This shows that the “existing models” in the modelwarehouse provide tremendous contributions to distinguish-ing more abstractive categories at higher level.

Page 8: LEarning structured model by Gic Ontology tree for Multi ediachang87/papers/tip_2016.pdf · Abstract—Recent advances in Multimedia ontology have resulted in a number of concept

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2612825, IEEETransactions on Image Processing

8

TABLE IHIERARCHIAL CLASSIFICATION RESULTS ON THE “TARGET CONCEPTS”. THE BEST PERFORMANCE FOR EACH CATEGORY IS HIGHLIGHTED IN

BOLD.

Level Category P-LEGO-MM (Max) P-LEGO-MM (T-conorm) LEGO-MM (Max) LEGO-MM (T-conorm)3rd plane 94.18 94.23 94.30 94.333rd car 95.33 95.41 95.99 96.043rd cycle 95.25 95.31 95.46 95.513rd ship 95.57 95.58 95.53 95.542nd aerial 91.36 91.50 92.32 92.372nd ground 89.45 89.86 93.00 93.112nd marine 93.80 93.90 94.31 94.361st vehicle 87.21 89.39 96.47 96.94

TABLE IICOMPARISON RESULT TO OTHER STATE-OF-THE-ARTS METHODS. THEBEST PERFORMANCE FOR EACH CATEGORY IS HIGHLIGHTED IN BOLD.

Category 2nd level 3rd levelFlat SVM [2] 84.41 76.55

WCCN+NN [35] 84.50 78.98WCCN+NC [35] 85.74 64.36H-SVM [36] [37] 86.26 78.10

DCNN [38] 93.15 91.47Ours 89.03 87.82

Ours + CNN Features 94.86 93.40

F. Robustness and Online Efficiency

To test the robustness of our proposed algorithm, we firstlook into the case of how a limited number of primitive con-cepts affect performances. Second, we study the sensitivityof learning rate of our approach. The last but not the least,online and batch comparison will be reported.

1) Incomplete Model Warehouse: In practice, when wemodel a structured concept hierarchy, even if we couldaccess a rich semantic pool, it might still be insufficient.Because not all the possible sub-genres of “target concepts”can be covered in the warehouse. To illustrate such morechallenging scenario with incomplete model warehouse, wetake away some leaf-node concepts from the primitive poolbefore we have all leaf node “existing models” accessible,now we are only allowed to access some of them. Figure 7demonstrates how LEGO-MM algorithm handles the caseof incomplete model warehouse. Three experiments areperformed: the leftmost histogram illustrates the classifica-tion accuracy for concept “plane” at 3rd level, and “aerial”at 2nd without the prior model “warplane.” Similarly, themiddle and the right histograms are both missing a leaf-node concept, which is “bicycle”, and “sailboat.” We reportthe recognition performances on the parent and grandparentnodes. Since less prior information is able to be obtainedby LEGO-MM, it will introduce a huge semantic ambiguityas illustrated in Figure 3. Consequently, the classificationperformance at each node may drop. However, LEGO-MMcan adjust its models with the small number of model-updating samples. Therefore, as we can see, compared withtable I, the accuracy at each particular node is only slightlydropped, but it is still much more reliable than the SVM-typed approach.

2) Learning Rate Trade off: In equation (6) we intro-duced the learning rate λ as a balance parameter between

70

75

80

85

90

95

100

SVM

PLOT

cycle

ground

ship

marine

Cla

ssif

ica

tio

n A

ccu

racy

┘キデエラ┌デ ╄╄┘;ヴヮノ;ミWげげ ┘キデエラ┌デ ╄╄HキI┞IノWげげ ┘キデエラ┌デ ╄╄ゲ;キノHラ;デげげ

plane

aerial

Robustness of Incomplete Warehouse

Ours

SVM

Fig. 7. Model robustness: incomplete model warehouse.

criteria 1 and 2 mentioned in section III-A. Figure 8illustrates how such λ affects recognition performance.Seven curves in Figure 8 indicate how the accuracy on eachintermediate nodes changes as learning rate increases from1 to 800. The recognition performances tend to stabilizeafter λ goes greater than 150. Thus, we can set the learningrate to 200 and obtain reliable results in the experiments.

86.5

88.5

90.5

92.5

94.5

96.5

0 100 200 300 400 500 600 700 800

plane cal cycle ship

aerial ground marine

Cla

ssif

ica

tio

n A

ccu

racy

Learning Rate Analysis

Learning Rate

stabilized

Fig. 8. Model robustness: learning rate.

3) Online Step Size: Another parameter affecting theonline modeling of the proposed LEGO-MM algorithm isthe step size, which trades off between the modeling perfor-mance and the number of updating samples involved in eachstep. Figure 9 demonstrates the relationship between the

Page 9: LEarning structured model by Gic Ontology tree for Multi ediachang87/papers/tip_2016.pdf · Abstract—Recent advances in Multimedia ontology have resulted in a number of concept

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2612825, IEEETransactions on Image Processing

9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False alarm

Rec

all

ROC curve for Plane

SVM

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False alarm

Rec

all

ROC curve for Ship

SVM

Fig. 6. ROC curve for concept “plane” and “ship” at the 3rd level, and “aerial” and “marine” at the 2nd level.

step size (in horizontal axis) and classification performance(in vertical axis) with a fixed learning rate. The onlineupdating with a small step size (e.g., only one) performsthe worst, because at each step, much fewer samples canresult in the risk of introducing huge variance to the model.On the contrary, when the step size reaches to about 200,the performance becomes stable. After that, even increasingthe step size does not affect the overall performance toomuch. In all means, the proposed LEGO-MM achievesmuch robust performance with an online updating strategy.

G. Extension: Multimodal Features and Single FeaturesTo illustrate the advantage of multimodal feature de-

scriptors, we further conduct the experiments of the pro-posed method with single-modal features and multi-modalfeatures. The dataset is Concept “skating” in ImageNetdatabase, which contains 9,806 images, As shown in fig-ure 10, there are 7 concepts associated with root “skating”,including a three-level ontological structure with 4 leaf n-odes. Each of the leaf concept contains 500 positive images.

87

88

89

90

91

92

93

94

95

96

0 200 400 600 800 1000 1200 1400

plane car cycle ship

aerial ground marine

Cla

ssif

ica

tio

n A

ccu

racy

stabilized

Online Step Size

Online Step Size Analysis

Fig. 9. Model robustness: learning step size.

The negative samples include concepts such as “forma-tion”, “structure” and “vehicle”, which contain tremendous

Page 10: LEarning structured model by Gic Ontology tree for Multi ediachang87/papers/tip_2016.pdf · Abstract—Recent advances in Multimedia ontology have resulted in a number of concept

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2612825, IEEETransactions on Image Processing

10

skating

skateboarding roller skating speed skating ice skating

roller blading figure skating

Fig. 10. Skating hierarchy from ImageNet. The red color notes indicate the leaf-node concepts.

low level visual ambiguities. The comparison experimentsstrictly follow the protocol in section IV-C. The multi-modal features are SIFT features and CNN features [38],[39]. The former is one of the most classical features, whilethe latter is the state-of-the-art image feature descriptors.The results is shown in Table III. We can see that theproposed method with multi-modal features gains the betterperformance than the proposed method with single-modalfeatures.

TABLE IIICOMPARISON RESULT OF OUR PROPOSED METHOD WITH

SINGLE-MODAL FEATURES AND MULTI-MODAL FEATURES.

Feature 2nd level 3rd levelOurs with SIFT features 82.18 82.36Ours with CNN features 89.47 90.20

Ours with SIFT + CNN features 90.35 91.08

V. CONCLUSION

In this paper we developed a novel framework, LEGO-MM, to seamlessly integrate both the new target train-ing examples and the existing primitive concept models.LEGO-MM treats the primitive concept models as “lego”to potentially construct an unlimited vocabulary of newconcepts. We proposed a much flexible learning algorithmto efficiently combine the obtained probabilistic model withnew information to clarify semantic ambiguity as wellas correct the errors propagated from the nodes at lowerlevel. The extensive experiments over a real-world data setdemonstrated: 1) using logical operations combining indi-vidual concepts in the existing model warehouse providesus with rich semantic resources to improve performanceon more abstractive concepts at higher level of ontologyhierarchy; 2) evolving higher level models by using a smallnumber of examples could clarify the semantic ambigui-ties. Particularly, our proposed “T-conorm” LEGO-MM ap-proach has significant advantages over other state-of-the-artalgorithms; 3) LEGO-MM is also robust with incompleteconcepts in the existing model warehouse, varying learningrates and online step sizes.

ACKNOWLEDGMENTS

This work was partially supported by the 973 Program ofChina (Project No. 2014CB347600), the National Natural

Science Foundation of China (Grant No. 61522203 and61402228), and the National Ten Thousand Talent Programof China (Young Top-Notch Talent).

VI. APPENDIX

A. Part A

By deriving the Eq. (3) with respect to P (y|f(xl)) andsetting the results to be zero, we have:

∂L∂P (y|f(xl))

=

1

N{logP (y|f(xl)) + 1− logPprior

(y| [fm(xl)]

Mm=1

)− y(wTxl + b)} − ξ(xl) = 0,

(10)and

∂L∂η

= b+Nγη

σ2η

= 0,

∂L∂θd

= wd +Nγθdσ2θ

= 0.

(11)

From (10) we have:

P (y|f(xl)) ∝

Pprior

(y| [fm(xl)]

Mm=1

)exp

{y(wTxl + b)

}.

(12)

Now, considering the normalization constraints in 2, whichought to be

P (y|f(xl))

=1

Z(xl)Pprior

(y| [fm(xl)]

Mm=1

)exp

{y(wTxl + b)

},

(13)where

Z(xl) =∑

y∈{0,1}

Pprior

(y| [fm(xl)]

Mm=1

)exp

{y(wTxl + b)

}is the partition function for normalization. Thus we haveobtained the target model as shown in equation (13), wherethe corresponding concept C can be inferred.

B. Part B

This maximization problem in Eq. (6) is unconstrainedconvex problem with respect to b and w, so a global

Page 11: LEarning structured model by Gic Ontology tree for Multi ediachang87/papers/tip_2016.pdf · Abstract—Recent advances in Multimedia ontology have resulted in a number of concept

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2612825, IEEETransactions on Image Processing

11

maximum exists. Take the derivatives with respect to b andw, we have

∂L∂b

=1

N

N∑l=1

yl −1

N

N∑l=1

EP (y|f(xl))

[y]− λbb

N

∂L∂wd

=1

N

N∑l=1

ylfd(xl)−1

N

N∑l=1

(E

P (y|f(xl))[y] fd(xl)

)− λwwd

N(14)

Then (6) can be maximized by a conjugate gradient methodbased on (6) and its derivatives in Eq. (14).

REFERENCES

[1] V. N. Vapnik, The Nature of Statistical Learning Theory. Springer,1995.

[2] K. Crammer and Y. Singer, “On the algorithmic implementationof multiclass kernel-based vector machines,” Journal of MachineLearning Research, vol. 2, pp. 265–292, 2001.

[3] F. Sun, J. Tang, H. Li, G. Qi, and T. S. Huang, “Multi-label imagecategorization with sparse factor representation,” IEEE Transactionson Image Processing, vol. 23, no. 3, pp. 1028–1037, 2014.

[4] W. Hu, N. Xie, R. Hu, H. Ling, Q. Chen, S. Yan, and S. J. Maybank,“Bin ratio-based histogram distances and their application to imageclassification,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36,no. 12, pp. 2338–2352, 2014.

[5] J. Tang, R. Hong, S. Yan, T. Chua, G. Qi, and R. Jain, “Imageannotation by knn-sparse graph-based label propagation over noisilytagged web images,” ACM Transactions on Intelligent Systems andTechnology, vol. 2, no. 2, p. 14, 2011.

[6] Q. Chen, Z. Song, J. Dong, Z. Huang, Y. Hua, and S. Yan,“Contextualizing object detection and classification,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 37, no. 1, pp. 13–27, 2015.

[7] J. Tang, X. Shu, G.-J. Qi, Z. Li, M. Wang, S. Yan, and R. Jain, “Tri-clustered tensor completion for social-aware image tag refinement,”IEEE transaction on Pattern Analysis and Machine Intelligence,2016.

[8] M. R. Naphade, J. R. Smith, J. Tesic, S.-F. Chang, W. H. Hsu,L. S. Kennedy, A. G. Hauptmann, and J. Curtis, “Large-scale conceptontology for multimedia,” IEEE MultiMedia, vol. 13, no. 3, pp. 86–91, 2006.

[9] C. Snoek, M. Worring, J. van Gemert, J.-M. Geusebroek, andA. W. M. Smeulders, “The challenge problem for automated detec-tion of 101 semantic concepts in multimedia,” in ACM Multimedia,2006, pp. 421–430.

[10] G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, T. Mei, and H.-J. Zhang,“Correlative multi-label video annotation,” in ACM Multimedia,2007, pp. 17–26.

[11] A. Natsev, A. Haubold, J. Tesic, L. Xie, and R. Yan, “Semanticconcept-based query expansion and re-ranking for multimedia re-trieval,” in ACM Multimedia, 2007, pp. 991–1000.

[12] J. Fan, Y. Gao, and H. Luo, “Hierarchical classification for automaticimage annotation,” in ACM SIGIR Conference, 2007, pp. 111–118.

[13] M. Marszalek and C. Schmid, “Semantic hierarchies for visual objectrecognition,” in IEEE Conference on Computer Vision and PatternRecognition, 2007.

[14] N. Verma, D. Mahajan, S. Sellamanickam, and V. Nair, “Learninghierarchical similarity metrics,” in IEEE Conference on ComputerVision and Pattern Recognition, 2012, pp. 2280–2287.

[15] S. Chang, G. Qi, J. Tang, Q. Tian, Y. Rui, and T. S. Huang,“Multimedia LEGO: learning structured model by probabilistic logicontology tree,” in 2013 IEEE 13th International Conference on DataMining, Dallas, TX, USA, December 7-10, 2013, 2013, pp. 979–984.

[16] Y. Wu, B. L. Tseng, and J. R. Smith, “Ontology-based multi-classification learning for video concept detection,” in IEEE Interna-tional Conference on Multimedia and Expo, 2004, pp. 1003–1006.

[17] S. F. Chen and R. Rosenfeld, “A gaussian prior for smoothingmaximum entropy models,” School of Computer Science, CarnegieMellon University, Technical Report CMU-CS-98-108, 1999.

[18] G. A. Miller, “Wordnet: A lexical database for english,” Commun.ACM, vol. 38, no. 11, pp. 39–41, 1995.

[19] M. Marszalek and C. Schmid, “Semantic hierarchies for visual objectrecognition,” in IEEE Conference on Computer Vision and PatternRecognition, 2007.

[20] G. Griffin and P. Perona, “Learning and using taxonomies for fastvisual categorization,” in IEEE Conference on Computer Vision andPattern Recognition, 2008.

[21] A. Zweig and D. Weinshall, “Exploiting object hierarchy: Combin-ing models from different category levels,” in Proc. InternationalConference on Computer Vision, 2007.

[22] X. He and R. S. Zemel, “Latent topic random fields: Learning usinga taxonomy of labels,” in IEEE Conference on Computer Vision andPattern Recognition, 2008.

[23] E. Bart, I. Porteous, P. Perona, and M. Welling, “Unsupervisedlearning of visual taxonomies,” in IEEE Conference on ComputerVision and Pattern Recognition, 2008.

[24] D. Nister and H. Stewenius, “Scalable recognition with a vocabularytree,” in Proceedings of the 2006 IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition - Volume 2, 2006, pp.2161–2168.

[25] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Objectretrieval with large vocabularies and fast spatial matching,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2007.

[26] C. C. Aggarwal, “Towards systematic design of distance functionsfor data mining applications,” 2003.

[27] M. Belkin and P. Niyogi, “Using manifold structure for partially la-belled classification,” in The Neural Information Processing Systems,2002.

[28] K. Nigam, A. McCallum, S. Thrun, and T. M. Mitchell, “Textclassification from labeled and unlabeled documents using em,”Machine Learning, vol. 39, no. 2/3, pp. 103–134, 2000.

[29] G. Qi, C. C. Aggarwal, and T. S. Huang, “Towards semanticknowledge propagation from text corpus to web images,” in ACMConference on World Wide Web, 2011, pp. 297–306.

[30] R. Cignoli, I. M. L. D’Ottaviano, and D. Mundici, Eds., AlgebraicFoundations of Many-valued Reasoning, Dordrecht,Kluwer, 2000.

[31] Y. Rui, T. S. Huang, M. Ortega, and S. Mehrotra, “Relevancefeedback: a power tool for interactive content-based image retrieval,”IEEE Trans. Circuits Syst. Video Techn., vol. 8, no. 5, pp. 644–655,1998.

[32] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and F.-F. Li, “Imagenet:A large-scale hierarchical image database,” in IEEE Conference onComputer Vision and Pattern Recognition, 2009, pp. 248–255.

[33] C. Fellbaum, Ed., WordNet An Electronic Lexical Database. Cam-bridge, MA ; London: The MIT Press, May 1998.

[34] M.-H. Tsai, S.-F. Tsai, and T. S. Huang, “Hierarchical image featureextraction and classification,” in ACM Multimedia, 2010, pp. 1007–1010.

[35] X. Zhou, N. Cui, Z. Li, F. Liang, and T. S. Huang, “Hierarchicalgaussianization for image classification,” in Proc. International Con-ference on Computer Vision, 2009, pp. 1971–1977.

[36] S. Bengio, J. Weston, and D. Grangier, “Label embedding treesfor large multi-class tasks,” in The Neural Information ProcessingSystems, 2010, pp. 163–171.

[37] L. Cai and T. Hofmann, “Hierarchical document categorization withsupport vector machines,” in ACM International Conference onInformation and Knowledge Management, 2004, pp. 78–87.

[38] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica-tion with deep convolutional neural networks,” in Advances in neuralinformation processing systems, 2012, pp. 1097–1105.

[39] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecturefor fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.

Page 12: LEarning structured model by Gic Ontology tree for Multi ediachang87/papers/tip_2016.pdf · Abstract—Recent advances in Multimedia ontology have resulted in a number of concept

1057-7149 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2612825, IEEETransactions on Image Processing

12

Jinhui Tang is a Professor in School of Comput-er Science and Engineering, Nanjing Universityof Science and Technology, China. He receivedhis B.E. and Ph.D. degrees in July 2003 and July2008 respectively, both from the University ofScience and Technology of China. From 2008 to2010, he worked as a research fellow in Schoolof Computing, National University of Singapore.His current research interests include large-scalemultimedia search. He has authored over 100journal and conference papers in these areas.

Prof. Tang is a recipient of ACM China Rising Star Award and a co-recipient of the Best Student Paper Award in MMM 2016, and Best PaperAward in ACM MM 2007, PCM 2011 and ICIMCS 2011.

Shiyu Chang received the B.S. and M.S. degreesfrom the University of Illinois at UrbanaCham-paign, in 2011 and 2014, respectively, where heis currently pursuing the Ph.D. degree under thesupervision of Prof. T. S. Huang. He has a widerange of research interests in data exploratoryand analytics. His current directions lie on build-ing high performance and reliable systems withthe help of large-scale multimodality informationto solve complex computational tasks in realworld. He received the Best Student Paper Award

in the IEEE ICDM 2014.

Guo-Jun Qi is an assistant professor in theDepartment of Computer Science at Universityof Central Florida. He received the Ph.D. degreein Electrical and Computer Engineering from theUniversity of Illinois at Urbana-Champaign. Hisresearch interests include pattern recognition,machine learning, computer vision and multime-dia. He was the co-recipient of the best studentpaper award in IEEE Conference on Data Mining(2014), and the recipient of the best paper award(2007) and the best paper runner-up (2015) in the

ACM International Conference on Multimedia. He has served or will serveas program co-chair of MMM 2016, an area chair of ACM Multimedia(2015, 2016), a senior program committee member of ACM CIKM 2015and ACM SIGKDD 2016, and program committee members or reviewersfor the conferences and journals in the fields of computer vision, patternrecognition, machine learning, and data mining. Dr. Qi has published over60 academic papers in these areas. He also (co-)edited the two specialissues on IEEE transactions on multimedia and IEEE transactions on bigdata.

Qi Tian received the BE degree in electronicengineering from Tsinghua University, China, in1992, the MS degree in electrical and computerengineering from Drexel University in 1996, andthe PhD degree in electrical and computer engi-neering from the University of Illinois, Urbana-Champaign, in 2002. He is currently a profes-sor in the Department of Computer Science,University of Texas at San Antonio. He tooka one-year faculty leave at Microsoft ResearchAsia (MSRA) during 2008-2009. His research

interests include multimedia information retrieval and computer vision.He has published more than 280 refereed journal and conference papers.He received the Best Paper Awards in PCM 2013, MMM 2013, andICIMCS 2012, the Top 10% Paper Award in MMSP 2011, and the StudentContest Paper Award in ICASSP 2006. He is on the editorial board ofthe IEEE Transactions on Multimedia, IEEE Transactions on Circuit andSystems for Video Technology, Multimedia Systems Journal, Journal ofMultimedia, and Journal of Machine Visions and Applications.

Yong Rui received the B.S. degree from South-east University, the M.S. degree from TsinghuaUniversity, and the Ph.D. degree from the Uni-versity of Illinois at UrbanaChampaign. He iscurrently the Deputy Managing Director withMicrosoft Research Asia (MSRA), leading re-search groups in multimedia search and mining,and big data analysis, and engineering group-s in multimedia processing, data mining, andsoftware/hardware systems. He has authored 16books and book chapters, and more than 100+

refereed journal and conference papers. His publications are among themost cited 15 000+ citations and his h-index of 52. He holds 60 issuedU.S. and international patents. He is a fellow of IAPR and SPIE, aDistinguished Scientist of ACM, and a Distinguished Lecturer of bothACM and IEEE. He is an Executive Member of ACM SIGMM, and theFounding Chair of its China Chapter. He is recognized as a leading expertin his research areas. He is the Editor-in-Chief of the IEEE MultimediaMagazine, an Associate Editor of the ACM Transactions on MultimediaComputing, Communication and Applications, a Founding Editor of theInternational Journal of Multimedia Information Retrieval, and a FoundingAssociate Editor of the IEEE ACCESS. He was an Associate Editorof the IEEE TRANSACTIONS ON MULTIMEDIA (2004-2008), theIEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEOTECHNOLOGIES (2006-2010), the ACM/Springer Multimedia SystemsJournal (2004-2006), and the International Journal of Multimedia Toolsand Applications (2004-2006). He also serves on the Advisory Boardof the IEEE TRANSACTIONS ON AUTOMATION SCIENCE ANDENGINEERING.

Thomas S. Huang received his Sc.D. fromMIT in 1963. He is a fulltime faculty withBeckman Institute at University of Illinois atUrbana-Champaign. He was William L. EverittDistinguished Professor in the Department ofElectrical and Computer Engineering and theCoordinated Science Lab (CSL), and was a full-time faculty member in the Beckman InstituteImage Formation and Processing and ArtificialIntelligence groups. His professional interestsare computer vision, image compression and

enhancement, pattern recognition, and multimodal signal processing.