a semantic similarity measure based on information distance for ontology alignment

12
1 3 A semantic similarity measure based on information distance 4 for ontology alignment 5 6 7 Yong Jiang Q1 , Xinmin Wang, Hai-Tao Zheng 8 Tsinghua-Southampton Web Science Laboratory at Shenzhen, Graduate School at Shenzhen, Tsinghua University, Shenzhen City 518055, PR China 9 11 article info 12 Article history: 13 Received 19 January 2012 14 Received in revised form 6 May 2013 15 Accepted 3 March 2014 16 Available online xxxx 17 Keywords: 18 Ontology alignment 19 Semantic measure 20 Link weight 21 Information distance 22 Normalized Google distance 23 24 abstract 25 Ontology alignment is the key point to reach interoperability over ontologies. In semantic 26 web environment, ontologies are usually distributed and heterogeneous and thus it is nec- 27 essary to find the alignment between them before processing across them. Many efforts 28 have been conducted to automate the alignment by discovering the correspondence 29 between entities of ontologies. However, some problems are still obvious, and the most 30 crucial one is that it is almost impossible to extract semantic meaning of a lexical label that 31 denotes the entity by traditional methods. In this paper, ontology alignment is formalized 32 as a problem of information distance metric. In this way, discovery of optimal alignment is 33 cast as finding out the correspondences with minimal information distance. We demon- 34 strate a novel measure named link weight that uses semantic characteristics of two entities 35 and Google page count to calculate an information distance similarity between them. The 36 experimental results show that our method is able to create alignments between different 37 lexical entities that denotes the same ones. These results outperform the typical ontology 38 alignment methods like PROMPT (Noy and Musen, 2000) [38], QOM (Ehrig and Staab, 39 2004) [12], and APFEL (Ehrig et al., 2005) [13] in term Q3 s of semantic precision and recall. 40 Ó 2014 Published by Elsevier Inc. 41 42 43 44 1. Introduction 45 With the development of knowledge society, information interoperability encounters the bottleneck of traditional infor- 46 mation representation technologies, as a result of different information systems employing different individual schemas to 47 represent data [9]. One of the prominent approaches is to construct a common, formal language that machines can somehow 48 understand. Semantic Web languages, including RDF and OWL emerge to complete this common, formal language by adding 49 explicit semantic relationships and logical constraints in the form of ontologies. An ontology is defined as ‘‘explicit formal 50 specifications of the terms in the domain and relations among them’’ [23]. With the development of ontologies on the 51 World-Wide-Web, however, the available ontologies themselves could introduce heterogeneity, that the same entity in 52 one ontology may be given a different name or simply be defined in a different way in another ontology, whereas both ontol- 53 ogies may express the same knowledge but in different language [25]. Obviously it goes against the goal of developing ontol- 54 ogies, that is defining a set of data and their structure for sharing and reusing. 55 Semantic interoperability can be grounded in ontology reconciliation: finding out correspondences between entities of 56 different ontologies. This process is called ontology alignment, and can be described as follows: given two ontologies with http://dx.doi.org/10.1016/j.ins.2014.03.021 0020-0255/Ó 2014 Published by Elsevier Inc. Correspon Q2 ding author. Tel.: +86 18038153089. E-mail address: [email protected] (H.-T. Zheng). Information Sciences xxx (2014) xxx–xxx Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins INS 10699 No. of Pages 12, Model 3G 22 March 2014 Please cite this article in press as: Y. Jiang Q1 et al., A semantic similarity measure based on information distance for ontology alignment, Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.03.021

Upload: hai-tao

Post on 23-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A semantic similarity measure based on information distance for ontology alignment

1

3

4

5

6

7 Q1

8

9

1 1

1213141516

17181920212223

2 4

Q3

43

44

45

46

47

48

49

50

51

52

53

54

55

56

Q2

Information Sciences xxx (2014) xxx–xxx

INS 10699 No. of Pages 12, Model 3G

22 March 2014

Q1

Contents lists available at ScienceDirect

Information Sciences

journal homepage: www.elsevier .com/locate / ins

A semantic similarity measure based on information distancefor ontology alignment

http://dx.doi.org/10.1016/j.ins.2014.03.0210020-0255/� 2014 Published by Elsevier Inc.

⇑ Corresponding author. Tel.: +86 18038153089.E-mail address: [email protected] (H.-T. Zheng).

Please cite this article in press as: Y. Jiang et al., A semantic similarity measure based on information distance for ontology aligInform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.03.021

Yong Jiang, Xinmin Wang, Hai-Tao Zheng ⇑Tsinghua-Southampton Web Science Laboratory at Shenzhen, Graduate School at Shenzhen, Tsinghua University, Shenzhen City 518055, PR China

252627282930313233343536

a r t i c l e i n f o

Article history:Received 19 January 2012Received in revised form 6 May 2013Accepted 3 March 2014Available online xxxx

Keywords:Ontology alignmentSemantic measureLink weightInformation distanceNormalized Google distance

3738394041

a b s t r a c t

Ontology alignment is the key point to reach interoperability over ontologies. In semanticweb environment, ontologies are usually distributed and heterogeneous and thus it is nec-essary to find the alignment between them before processing across them. Many effortshave been conducted to automate the alignment by discovering the correspondencebetween entities of ontologies. However, some problems are still obvious, and the mostcrucial one is that it is almost impossible to extract semantic meaning of a lexical label thatdenotes the entity by traditional methods. In this paper, ontology alignment is formalizedas a problem of information distance metric. In this way, discovery of optimal alignment iscast as finding out the correspondences with minimal information distance. We demon-strate a novel measure named link weight that uses semantic characteristics of two entitiesand Google page count to calculate an information distance similarity between them. Theexperimental results show that our method is able to create alignments between differentlexical entities that denotes the same ones. These results outperform the typical ontologyalignment methods like PROMPT (Noy and Musen, 2000) [38], QOM (Ehrig and Staab,2004) [12], and APFEL (Ehrig et al., 2005) [13] in terms of semantic precision and recall.

� 2014 Published by Elsevier Inc.

42

1. Introduction

With the development of knowledge society, information interoperability encounters the bottleneck of traditional infor-mation representation technologies, as a result of different information systems employing different individual schemas torepresent data [9]. One of the prominent approaches is to construct a common, formal language that machines can somehowunderstand. Semantic Web languages, including RDF and OWL emerge to complete this common, formal language by addingexplicit semantic relationships and logical constraints in the form of ontologies. An ontology is defined as ‘‘explicit formalspecifications of the terms in the domain and relations among them’’ [23]. With the development of ontologies on theWorld-Wide-Web, however, the available ontologies themselves could introduce heterogeneity, that the same entity inone ontology may be given a different name or simply be defined in a different way in another ontology, whereas both ontol-ogies may express the same knowledge but in different language [25]. Obviously it goes against the goal of developing ontol-ogies, that is defining a set of data and their structure for sharing and reusing.

Semantic interoperability can be grounded in ontology reconciliation: finding out correspondences between entities ofdifferent ontologies. This process is called ontology alignment, and can be described as follows: given two ontologies with

nment,

Page 2: A semantic similarity measure based on information distance for ontology alignment

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

2 Y. JiangQ1 et al. / Information Sciences xxx (2014) xxx–xxx

INS 10699 No. of Pages 12, Model 3G

22 March 2014

Q1

each describing a set of discrete entities, find the relationships that hold between these entities. Alignment results can fur-ther support displaying correspondences, transforming one source into another, or creating a set of bridge axioms betweenthe ontologies. An overview of ontology alignment is presented in Section 2.

The foundation of ontology alignment is the similarity of entities. The existing similarity measures can be divided intotwo general groups, namely, lexical measure and structural measure. The main idea in lexical measures is the fact that sim-ilar entities usually have similar names or descriptions across different ontologies. On the other hand, the main idea in struc-tural measures is based on considering the kinship of the components and structures residing in the ontology graphs.

However, traditional alignment methods such as PROMPT and QOM, mainly based on lexical measures and structuralmeasures, have their limitations on the scenario that the same concepts are expressed in far different names. The lexicalmeasures, rely on an entity’s name (label) to compute the similarity between a pair of entities, which align entities by ana-lyzing them in isolation, ignoring their relationships with other entities. The process can be described briefly as follows: (i)string normalization (e.g., case folding, use of a standardized encoding, blank normalization, etc.); (ii) syntactical comparison(e.g., prefix/suffix comparison, edit distance, soundex index and n-grams, etc.). The comparison may be either exact, i.e., enti-ties’ name are aligned only when the strings are equal, or approximate, where a confidence value is computed using a seriesof similarity metrics. Although it allows for the successful alignment of entities, a pure lexical method has obvious limita-tions. For example, equivalent entities described by different names (synonyms) cannot be detected, while different entitiesdescribed by equal names (homonyms) will mistakenly be detected as a complete alignment. It is often difficult to determinethe true semantics of an entity without considering context. Structural measures, differ from lexical methods in not onlyconsidering a single concept at a time, but in utilizing the ontology structure information. The fact that ontologies can betreated as graphs indicates that an entity of an ontology (a node in a graph) inherits its parental semantics, and also transmitits own semantics to its children. In other words, similar nodes in distinct ontologies indicate the similarity of their neigh-bors [44]. The methods are based on the following idea. First, the ontologies to be aligned are converted into directed labeledgraphs. Then, similarities between nodes in one graph and nodes in the second graph propagate along the graph structure byiterative fixpoint computation [35]. The similarity computations rely on the intuition that entities of two distinct ontologyare similar when their adjacent entities are similar, i.e., a part of similarity of two entities propagates to their respectiveneighbors. Although ontology structure has been used by many alignment systems, they all suffer from some drawbacks.A main drawback is that the initial similarity of entities, which initiates the iterative fixpoint computation, is usually ob-tained using simple lexical measures, e.g., string comparison of entity labels.

Fig. 1 is an alignment snippet between two pairs of ontologies which describes bibliographic references. In this alignmentexample, entity O1 : Proceedings and entity O2 : Transaction have different labels and are seemingly not very similarsemantically if only considering the lexical characteristics. But if taking into account the structure information of the ontol-ogies they belong to, the neighbors of O1 : Proceedings include: O1 : Book; O1 : Monograph; O1 : Collection are very similarto the neighbors of O2 : Transaction, i.e., O2 : Book; O2 : Monograph; O2 : Collection. Consequently, O1 : Proceedings and O2 :

Transaction are highly possible to be aligned together, and the correspondence is correct. However, if entity O1 : Proceedingsand entity O3 : Journal are aligned in the same way, the correspondence is incorrect because people have reached a consensusthat Proceedings and Transaction are usually equivalent (both mean records of business conducted at a meeting), while Journalis another kind of book (a periodical presenting articles on a particular subject).

In this paper, by referring to information distance theory, we propose a semantic measure that is designed for relationshipbetween different concepts. Our measure is designed to combine a comprehensive set of lexical-level and structure-levelmeasures of similarity with a technique using Google page count to calculate the information distance between entities.

Fig. 1. An example of different entity with same and different meaning respectively.

Please cite this article in press as: Y. Jiang et al., A semantic similarity measure based on information distance for ontology alignment,Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.03.021

Page 3: A semantic similarity measure based on information distance for ontology alignment

97

98

99

100

101

102

103

104

105 Q4

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

Y. JiangQ1 et al. / Information Sciences xxx (2014) xxx–xxx 3

INS 10699 No. of Pages 12, Model 3G

22 March 2014

Q1

It has been proven in [6,7] that Google page count can be used to extract semantic meaning of words and to quantify thesemantic relation between two words as the World Wide Web can be used as a very large corpus consisted of latent semanticinformation.

The rest of the paper is organized as follows: First, the related work will be discussed in the next section. Second, we pres-ent a brief definition of the problem and a general description of the proposed method, which is followed by the details of thesimilarity measure. Section 4 shows the experimental results and discussion. Finally, conclusion and future work are given.

2. Related work

Ontology alignment is a relatively new but active field of current research, with a vigorous community proposingnumerous solutions. Euzenat et al. [1] presented a comprehensive overview of the field, and the internals of known align-ment methods are classified under local methods and global methods according to comparison level. The local methods onlyassess the correspondence between two entities of different ontologies, however, the global methods establish alignmentsbetween ontologies on the basis of the results of local comparison. Granitzer et al. [22] gave a brief outline of most commonalignment methods, along with references to a subset of systems applying particular methods. Latest development of ontol-ogy alignment can be find out in [20].

Most work of ontology alignment has focused on two general groups of similarity measures, namely, lexical measures andstructural measures. Early work on ontology alignment and mapping focus mainly on the string distances between entitylabels and the overall taxonomic structure of the ontologies. However, it becomes increasingly clear that any two ontologiesconstructed for the same domain by different experts could be vastly dissimilar in terms of taxonomy and lexical features.After recognizing this, systems such as FCA-Merge [41] and T-Tree [16] analyze subclass and superclass relationships foreach entity as well as the lexical correspondences, and additionally require that the ontologies have instances to improvecomparison. COMA [32] uses parallel composition of multiple element-level and structure-level matchers. PROMPT consistsof an interactive ontology merging tool [38] and a graph based mapping dubbed Anchor-PROMPT [37]. It uses linguistic ‘‘an-chor’’ as a starting point and analyzes these anchors in terms of the structure of the ontologies. GLUE [8] discovers mappingsthrough multiple learners who analyze the taxonomy and the information within concept instances of ontologies. Corpus-based matching [30] uses domain-specific knowledge in the form of an external corpus of mappings which evolves overtime. RiMOM [42] discovers similarities within entity descriptions, analyzes instances, entity names, entity descriptions, tax-onomy structure, and constraints using Bayesian decision theory in order to generate an alignment between ontologies, andadditionally accepts user input to improve the mappings. Falcon-AO [24] uses a linguistic matcher combined with a tech-nique that represents the structure of the ontologies to construct a bipartite graph. IF-Map [27] matches two ontologiesby firstly examining their instances to see whether they can be assigned to concepts in a reference ontology, and then usingformal concept analysis to derive an alignment. Similarity flooding [35] uses a technique of propagation of similarities alongthe property relationships between classes. Rodríguez and Egenhofer [39] developed a similarity function determines similarentity classes by using a matching process over synonym sets, semantic neighborhoods, and distinguishing features that areclassified into parts, functions, and attributes.

Recently, Ngo and Bellahsene [36] presented the ontology matching tool YAM++ by using machine learning approach andshowed the tool is able to deal with multi-lingual ontologies matching problem. Bock and Hettenhausen [3] treated ontologyalignment problem as an optimization problem and designed a discrete particle swarm optimization algorithm. Theirs ap-proach brought several benefits for solving the ontology alignment problem, such as inherent parallelization, anytime behav-ior, and flexibility according to the characteristics of particular ontologies. Spiliopoulos and Vouros [40] addressed theproblem of synthesizing ontology alignment methods by maximizing the social welfare within a group of interacting agents.They used an extension of the max-sum algorithm to generate near-to-optimal solutions to the alignment of ontologiesthrough local decentralized message passing. Jan et al. [26] presented a rough-set based approach to ontology alignmentto deal with uncertain entities for which the employed ontology alignment measures produce conflicting results on similar-ity of the mapped entities.

Semantic techniques have also been used to verify, rather than derive, correspondences between entities. An approachproposed by Meilicke et al. [34] uses model-theoretic semantics to identify inconsistencies and automatically remove thecorrespondences from a proposed alignment. The same authors have extended this work to define mapping stability as acriterion for alignment extraction [33]. While the question is that the model can only identify those correspondences thatare provably inconsistent according to a description logics formulation.

Since information distance theory is employed to develop a similarity measure in this study, we introduce the relatedwork using information distance herein. Bennett, Vitányi, Li, et al. [2,28,43] first presented and discussed information dis-tance theory based on Kolmogorov complexity. In [7,6], Cilibrasi and Vitányi presented a method to extract semantic sim-ilarity using Google page counts, derived from information distance and Kolmogorov complexity. Chen and Lin [5] employedGoogle similarity distance to measure keywords to implement automatic keyword prediction. In [31], a taxonomy extractionalgorithm was proposed based on Google similarity and mutual information to generate a taxonomy tree automatically. In[21], Gilgorov et al. demonstrated a definition for approximate mappings between concepts and proposed a method thatmakes use of the Google similarity to weight approximate ontology matches. Compared to the existing methods, the pro-posed method proposes a semantic measure based on information distance and formalizes the ontology alignment as a prob-

Please cite this article in press as: Y. Jiang et al., A semantic similarity measure based on information distance for ontology alignment,Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.03.021

Page 4: A semantic similarity measure based on information distance for ontology alignment

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

4 Y. JiangQ1 et al. / Information Sciences xxx (2014) xxx–xxx

INS 10699 No. of Pages 12, Model 3G

22 March 2014

Q1

lem of measuring the information distance between entities of distinct ontologies. To the best of our knowledge, the feasi-bility of the information distance for ontology alignment has not been exploited and evaluated until now.

3. Methodology

In order to better understand the mechanism of the study, we first present a framework to process the ontology align-ment. As illustrated in Fig. 2, there are mainly five stages for ontology alignment in our study. In the framework, the inputis two ontologies, and eventually, the output returns an alignment mapping between the two ontologies.

1. Feature Extraction: Extracting small excerpts of the overall ontology definition to describe a specific entity (e.g., selectinglabels to describe the concept O1 : CellPhone).

2. Selection: Choosing two entities from the two ontologies for comparison (e.g., O1 : CellPhone and O2 : MobilePhone).3. Similarity Computation and Similarity Aggregation: Computing the similarity for two given entities with some simi-

larity measures (e.g., simðO1 : CellPhone;O2 : MobilePhoneÞ ¼ 0:8). Then, we aggregate multiple similarity assessmentsfor one pair of entities into a single measure value.

4. Interpretation: Using all aggregated values, some thresholds and some interpreted strategies to propose the equality forthe selected entity pairs.

5. Iteration: As the similarity of one entity pair influences the similarity of neighboring entity pairs, the equality is propa-gated through the ontologies.

In the first stage, we traverse the input ontologies and find the related information such as descriptions or labels for aspecific entity. Then, a hashtable is employed to store the entity and its corresponding description as key and value. Usingthe hashtable, it is efficient to obtain any entity and its description. In the second stage, any two entities in the ontologies areselected for comparison. If the two entities are similar, they will be aligned. The third stage is to compute the similaritiesbetween entities and aggregate the similarities. This stage is the most important stage and we will discuss it in detail later.In the fourth stage, we define some thresholds and some interpreted strategies to filter uncorrected results in the aggregated

FeatureExtraction

Selection

SimilarityComputation

SimilarityAggregation

Interpretation

Iteration

Input

Output

Fig. 2. A framework for ontology alignment.

Please cite this article in press as: Y. Jiang et al., A semantic similarity measure based on information distance for ontology alignment,Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.03.021

Page 5: A semantic similarity measure based on information distance for ontology alignment

178

179

180

181

182

183

184

185

186

187

188

189

190

191192

194194

195

196

197

198199201201

202203

205205

206

207

208

209

210211

213213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232233235235

236

237238240240

Y. JiangQ1 et al. / Information Sciences xxx (2014) xxx–xxx 5

INS 10699 No. of Pages 12, Model 3G

22 March 2014

Q1

values. Consequently, we suggest the equality for the selected entity pairs. In the final stage, since the similarity of one entitypair influences the similarity of neighboring entity pairs, we propagate the equality of the selected entity pairs through theontologies. This process can help us not only to improve the precision of similarities between other selected entity pairs, butalso to calculate the structural similarities between unselected entity pairs later.

The similarity computation and similarity aggregation stage is a critical challenge in the process of ontology alignment.There have been many different approaches to measure the similarity, as we described in Related Work section. The fact isthat the traditional measures may not measure the similarity precisely and semantically. In addition, two entities may beequivalent in semantics but totally different in syntactical description (e.g., O1 : College and O2 : University). In this study,we develop a semantic similarity measure based on Google page count by considering the syntactical features of the entities.

The proposed method for ontology alignment is based on mutual information theory. As we regard the Web as a tremen-dous and omniscient knowledge repository, the Google page count can be used to measure the semantic information dis-tance between any two words. The Normalized Google Distance (NGD) is a parameter-free, universal, similarity distance.A theoretical precursor, the Normalized Information Distance (NID), has been proven to be an optimal universal measure.The NID is defined as follows [28]:

PleaseInform

NIDðx; yÞ ¼ maxfKðx j yÞ;Kðy j xÞgmaxfKðxÞ;KðyÞg ð1Þ

where Kðx j yÞ is the conditional Kolmogorov complexity of string x given the string y; KðxÞ is equivalent to Kðx j eÞ; e beingthe empty string. Kolmogorov complexity KðxÞ of the string x is the shortest binary program with no input that outputs x.Further, Kolmogorov complexity of x given y is the length of the shortest binary program that on input y outputs x. Accordingto the additive property of Kolmogorov complexity [29], Kðx j yÞ can be rewritten as:

Kðx j yÞ ¼ Kðx; yÞ � KðyÞ ð2Þ

Accordingly, NID can be rewritten as:

NIDðx; yÞ ¼ maxfKðx; yÞ � KðyÞ;Kðx; yÞ � KðxÞgmaxfKðxÞ;KðyÞg ¼ Kðx; yÞ �minfKðxÞ;KðyÞg

maxfKðxÞ;KðyÞg ð3Þ

where Kðx; yÞ represents the length of the shortest binary program for the pair ðx; yÞ. NID is shown to be universal in thesense that it accounts for the dominant discriminating feature between two objects. However, since Kolmogorov complexityis uncomputable in the Turing sense, the NID cannot be computed directly. To approximate the Kolmogorov complexity,knowledge distributions such as Google distribution have been employed [6,7]. This yields NGD as an approximation ofthe NID. NGD is defined as follows [6,7]:

NGDðx; yÞ ¼ maxfGðx j yÞ;Gðy j xÞgmaxfGðxÞ;GðyÞg ¼ Gðx; yÞ �minfGðxÞ;GðyÞg

maxfGðxÞ;GðyÞg ð4Þ

where GðxÞ denotes the Google code-word length of the search term x; Gðx; yÞ denotes the joint Google code-word length ofthe joint search terms x AND y. Accordingly, Gðx j yÞ is the conditional Google complexity of search term x given the searchterm y. NGD actually is an approximation to the NID in Eq. (3). Regarding the background knowledge on the Web as condi-tional information, the Google distribution can determine a Google code-word length (sort of prefix code) to approximate thelength of the Kolmogrov code.

Billions of web pages are indexed by Google search engine, and each web page can be viewed as a set of index words. If wrepresents a web page and x a search term, we denote x 2 w as a web page w returned by Google when using search term x.Hence, we can view an event x as a collection of web pages indexed by Google, containing one or more occurrences of thesearch term x. We define the event x as x ¼ fw : x 2 wg. In the sense of the probabilistic framework, we can writegðxÞ ¼j x j =N, where N represents the overall number of web pages indexed by Google, to denote the probability of eventx occurs in the Google distribution. Actually, gðxÞ ¼j x j =N is the number of web pages containing the search term x dividedby the overall number N of web pages indexed by Google. Then, the joint probability gðx; yÞ ¼j fw : x; y 2 wg j =N is definedas the number of web pages in the joint event x \ y divided by the overall number N of web pages indexed by Google, wherethe joint event x \ y ¼ fw : x; y 2 wg represents the set of web pages returned by Google, containing both the search term xand y. Consequently, we can also define the probability of conditional events x j y ¼ ðx \ yÞ=y according to the probabilisticframework, i.e., gðx j yÞ ¼ gðx; yÞ=gðyÞ. If we define the frequency f ðxÞ of the search term x as the number of web pages re-turned by Google, then f ðxÞ ¼ N � gðxÞ; f ðx; yÞ ¼ N � gðx; yÞ. As a result, gðx j yÞ ¼ f ðx; yÞ=f ðyÞ.

Given the Google distribution, Cilibrasi and Vitanyi [6] introduce the Google code-word length GðxÞ to approximate NID.The GðxÞ and Gðx; yÞ are defined as follows:

GðxÞ ¼ Gðx; xÞ; Gðx; yÞ ¼ � log gðx; yÞ ð5Þ

where gðx; yÞ ¼ f ðx; yÞ=N, and N denotes the total number of pages indexed by Google. According to the additive propertyinherited from Kolmogrov complexity in Eq. (2), the Gðx j yÞ can be deduced as follows:

Gðx j yÞ ¼ Gðx; yÞ � GðyÞ ¼ � log gðx; yÞ þ log gðyÞ ¼ log gðyÞ=gðx; yÞ ¼ � log gðx j yÞ ð6Þ

cite this article in press as: Y. Jiang et al., A semantic similarity measure based on information distance for ontology alignment,. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.03.021

Page 6: A semantic similarity measure based on information distance for ontology alignment

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260261

263263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283284286286

287

288

289

290

291

292293

295295

6 Y. JiangQ1 et al. / Information Sciences xxx (2014) xxx–xxx

INS 10699 No. of Pages 12, Model 3G

22 March 2014

Q1

However, experimenting with NGD gives bad results. One reason is that NGD is a symmetric similarity measure, whichmeans NGDðx; yÞ ¼ NGDðy; xÞ. Considering an ontology actually is a set of terms within a Directed Acyclic Graph (DAG) con-sisting of a number of terms, represented as nodes within the graph, connected by relations, represented as edges, in thepractice of ontology alignment, the probability of term A aligned to B should not be equivalent to the probability of termB aligned to A. For example, O1 : Red and O2 : Color, in the semantic sense, we cannot simply put a distance value to rep-resent the similarity between ‘‘Color’’ and ‘‘Red’’. We need a couple of asymmetric dependencies measures to identify thedirected link between the terms of ‘‘Color’’ and ‘‘Red’’. Therefore, although NGD efficiently quantify the similarity link be-tween two terms, it is inadequate to determine asymmetric dependencies of terms in the link. Therefore, we develop a cou-ple of an asymmetric dependency measures, which satisfy the following constraints:

� They can efficiently quantify the overlap information between x and y.� They can distinguish asymmetric dependencies between x to y and y to x.

where x and y are names of two entities respectively from ontologies to be aligned.We note that in the definition of NGD in Eq. (4), the maxfGðx j yÞ;Gðy j xÞg is normalized by dividing by maxfGðxÞ;GðyÞg.

Based on Eq. (6), Gðx j yÞ actually is the negative logarithm of the gðx j yÞ, and the same way to Gðy j xÞ and gðy j xÞ. Both val-ues of Gðx j yÞ and Gðy j xÞ are greater than one, which means gðx j yÞ and gðy j xÞ are smaller than one. gðx j yÞ and gðy j xÞactually quantify the overlap information between x and y. In addition, gðx j yÞ and gðy j xÞ are able to distinguish asymmetricdependencies between x to y and y to x. To satisfy the two constraints, we use gðx j yÞ and gðy j xÞ as a couple of asymmetricmeasures.

The couple of asymmetric measures are named link weight, which can be used to evaluate the contribution of each wordin the link. Therefore, link weight is defined by

PleaseInform

xðx; yÞ ¼ gðx j yÞ ¼ f ðx; yÞf ðyÞ ; xðy; xÞ ¼ gðy j xÞ ¼ f ðx; yÞ

f ðxÞ ð7Þ

where f ðxÞ denotes the number of web pages containing the name of an entity x, and f ðx; yÞ denotes the number of web pagescontaining both the names of entity x and entity y, as returned by Google for a search. In practice, for an entity x, we use thepage counts returned by Google for f ðxÞ when searching the entity name x in Google. For entities x and y, we use the pagecounts returned by Google for f ðx; yÞ when searching both the names of entity x and entity y in Google. Based on the asym-metric measures, we can evaluate the following statement in the interpretation stage:

Statement 1. xðx; yÞ ¼ xðy; xÞ > 0: the word x and y are semantically correlated.Considering the noisy words (e.g., misspelling), it rarely happens that xðx; yÞ just equal to xðy; xÞ. Therefore, in practice,

the statement is revised as follows:Statement 2. 0 <j xðx; yÞ �xðy; xÞ j< d: the word x and y are semantic correlated.Base on Eq. (7), we employ three kinds of similarity measurement to compute the overall similarity [1], which are shown

as follows:

� Terminological Similarity: This similarity measurement computes similarities based on the extracted features of entitiesas described in stage 1 (e.g., names of classes), which is denoted as simt .� Structural Internal Similarity: This similarity measurement computes similarities between two classes by a portion of

the similarity computed for the properties of the two classes, which is denoted as simsi.� Structural External Similarity: This similarity measurement computes similarities between two classes by a portion of

the similarity computed for the superclasses of the two classes, which is denoted as simse.

The similarities computed by the above methods are aggregated to produce an overall similarity, which is described asfollows:

simðx; yÞ ¼ kt � simtðx; yÞ þ ksi � simsiðx; yÞ þ kse � simseðx; yÞ ð8Þ

where kt þ ksi þ kse ¼ 1, and ksi ¼ kse for that either structural internal or external similarity should have same weight in theaggregation. In the study, we use xðx; yÞ to measure the terminological similarity between entities x and y. To computesimsiðx; yÞ, we get the maximum similarity between properties pi and pj, where pi 2 PðxÞ and pj 2 PðyÞ. PðxÞ denotes the prop-erty set of entity x and PðyÞ denotes the property set of entity y. To calculate simseðx; yÞ, we get the maximum similarity be-tween superclasses si and sj, where si 2 SðxÞ and sj 2 SðyÞ. SðxÞ denotes the superclass set of entity x, and SðyÞ denotes thesuperclass set of entity y. As a result, Eq. (8) is expressed as follows:

simðx; yÞ ¼ k� simtðx; yÞ þ1� k

2� simsiðx; yÞ þ

1� k2� simseðx; yÞ

¼ k�xðx; yÞ þ 1� k2

Xpi2PðxÞ;pj2PðyÞ

max xðpi;pjÞ� �

þX

si2SðxÞ;sj2SðyÞmax xðsi; sjÞ

� �0@

1A ð9Þ

cite this article in press as: Y. Jiang et al., A semantic similarity measure based on information distance for ontology alignment,. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.03.021

Page 7: A semantic similarity measure based on information distance for ontology alignment

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319320

Y. JiangQ1 et al. / Information Sciences xxx (2014) xxx–xxx 7

INS 10699 No. of Pages 12, Model 3G

22 March 2014

Q1

where 0 6 k 6 1; k indicates the importance of simtðx; yÞ in simðx; yÞ, i.e., k is a trade-off between terminological similarity andstructural similarities.

As an extended version of link weight, the statement for semantic equivalence can be extended as follows:Statement 3. 0 <j simðx; yÞ � simðy; xÞ j< d: the word x and y are semantically equivalent.In the study, d is a threshold to measure the semantic equivalence between simðx; yÞ and simðy; xÞ. Generally, d should be

set relatively small to reflect simðx; yÞ and simðy; xÞ has little difference when the word x and y are semantically equivalent.However, if the d is set too small, we will miss many semantically equivalent terms. To obtain a suitable d, we conduct exper-iments and find the semantic equivalence performs well when d is assigned 0.1. Based on the proposed method, in Table 1,we illustrate the computating process by calculating the similarity between O1 : Proceeding and O2 : Trasaction and the sim-ilarity between O1 : Proceedng and O2 : Journal in the example shown in Fig. 1.

According to the extended version of the link weight statement, j simðProceedings; TransactionÞ �simðTransaction;ProceedingsÞ j indicates that O1 : Proceedings and O2 : Transaction are semantically equivalent, and to be aligned accordingly.However, j simðProceedings; JournalÞ�simðJournal; ProceedingsÞ j indicates that O1 : Proceedings and O3 : Journal are notsemantically equivalent, and should not be aligned accordingly.

4. Evaluation

For the purpose of evaluating the effectiveness of the proposed ontology alignment method, we confront it with someontology datasets. To evaluate the quality of the method, we adopted two basic quality measures specially designed forontology alignment [18], i.e., semantic precision and semantic recall, which are both able to take into account the semantics.The reason we abandon the well-known and well-understood precision and recall measures originating from informationretrieval, is that these measures have the drawback to be of the all-or-nothing kind [10], i.e., it can happen that an alignmentmay be very close to the reference alignment and another quite remote from it so that both alignments sharing the sameprecision and recall values. This is due to the restricted set-theoretic foundation of these measures on the basis of the syntaxof alignments.

Given a reference alignment R, the semantic precision and recall of an alignment A, are given as follows:

Table 1The sim

Com(1)

f ðPf ðTf ðP

(2)

sim

simsimsimsim

(3)k ¼simsimj s¼j

Com(1)

f ðPf ðJf ðP

(2)

sim

simsimsimsim

(3)k ¼simsimj s¼j

PleaseInform

ilarity computation process of an example.

putation for O1 : Proceedings and O2 : Transaction:

roceedingsÞ ¼ 349;000;000ransactionÞ ¼ 785;000;000roceedings; TransactionÞ ¼ 65;900;000

tðProceedings; Transaction ¼ 65;900;000785;000;000Þ ¼ 0:084

siðProceedings; TransactionÞ ¼ 0:0

siðTransaction; ProceedingsÞ ¼ 0:0seðProceedings; TransactionÞ ¼ xðBook;BookÞ ¼ 1:0seðTransaction; ProceedingsÞ ¼ xðBook;BookÞ ¼ 1:0

0:6; d ¼ 0:1ðProceedings; TransactionÞ ¼ 0:6� 0:084þ 0:2� 1:0 ¼ 0:250ðTransaction; ProceedingsÞ ¼ 0:6� 0:189þ 0:2� 1:0 ¼ 0:313

imðProceedings; TransactionÞ � simðTransaction; ProceedingsÞ j0:2504� 0:3134 j¼ 0:063 < d

putation for O1 : Proceedings and O3 : Journal :

roceedingsÞ ¼ 349;000;000ournalÞ ¼ 2;210;000;000roceedings; JournalÞ ¼ 125;000;000

tðProceedings; JournalÞ ¼ 125;000;0002;210;000;000 ¼ 0:057

siðProceedings; JournalÞ ¼ 0:0

siðJournal; ProceedingsÞ ¼ 0:0seðProceedings; JournalÞ ¼ xðBook;BookÞ ¼ 1:0

siðJournal; ProceedingsÞ ¼ xðBook;BookÞ ¼ 1:0

0:6; d ¼ 0:1ðProceedings; JournalÞ ¼ 0:6� 0:057þ 0:2� 1:0 ¼ 0:234ðJournal; ProceedingsÞ ¼ 0:6� 0:358þ 0:2� 1:0 ¼ 0:415

imðProceedings; JournalÞ � simðJournal; ProceedingsÞ j0:234� 0:415 j¼ 0:181 > d

cite this article in press as: Y. Jiang et al., A semantic similarity measure based on information distance for ontology alignment,. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.03.021

Page 8: A semantic similarity measure based on information distance for ontology alignment

322322

323

325325

326

327328

330330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

1 http2 http3 http

8 Y. JiangQ1 et al. / Information Sciences xxx (2014) xxx–xxx

INS 10699 No. of Pages 12, Model 3G

22 March 2014

PleaseQ1

Inform

PsemðA;RÞ ¼j A \ CnðRÞ jj A j ð10Þ

RsemðA;RÞ ¼j CnðAÞ \ R jj R j ð11Þ

where CnðAÞ is defined as the set of a-consequences [18,19].Semantic F-Measure combines the semantic precision and semantic recall, which is shown as follows:

Fsem ¼ 2� Psem � Rsem

Psem þ Rsemð12Þ

5. Results and discussion

In order to perform the alignment and evaluation experiments we employed the API for ontology alignment introducedby Euzenat [17]. The experiments were conducted on a J2SE6.0, Windows 7, Intel Core 2Duo, 2.53 GHz with 2 GB RAM. Ourdatasets are made up of four ontology pairs,1 which were employed by Ehrig et al. [14] for evaluation. These four ontologies orslight adaptations thereof have already been used at the ontology alignment contests I3CON (The Information Interpretation andIntegration Conference) and OAEI (Ontology Alignment Contest). Russia and Animal dataset have been used at I3CON.2 Bibliog-raphy dataset has been used at OAEI.3 The descriptions of the datasets are listed as follows:

Russia 1:In this dataset there are two ontologies describing Russia. The ontologies were created with the objectives of two indepen-dent travel websites about Russia. These ontologies have approximately 400 entities each, including concepts, relations, andinstances. The total number of possible mappings is 160, which the referred alignment has assigned.

Russia 2:This second dataset also covers Russia, but the two ontologies are more difficult to align. They have been altered by removingentities and changing the labels at random after their creation. Each ontology has 300 entities with 215 possible mappings,which were counted during generation.

Animals:The two animal ontologies are different versions of one another. They contains about 40 entities each, thus being rathersmall examples, which might make it even more difficult to align them. There are 25 alignments provided in the referredalignment.

Bibliography:These bibliographic ontologies are provided by INRIA with 180 entities. There are about 40 alignments identified in thereferred alignment.

To illustrate the quality of our proposed method, we adapted three popular alignment methods for comparison. The firstmethod is PROMPT [38] because the PROMPT algorithm is a well-known ontology alignment algorithm and rather simpleand fast. The second method is QOM [12] which employs a dynamic programming approach to achieve low complexity.Therefore, QOM has a good efficiency in ontology alignment. The third method is APFEL [11] which uses a machine learningalgorithm to obtain optimal parameters for similarity aggregation. The three methods are chosen because they are three typ-ical ontology alignment methods not only focusing on efficiency, but also focusing on effectiveness. The brief processes ofthese methods are described as follows:

PROMT:As the PROMPT algorithm is rather simple and fast [38], we adapted it as a baseline to evaluate the proposed method. Thebrief description of the algorithm is as follows: First, PROMPT creates an initial list of matches based on class names. Thenthe following loop happens: (i) the user triggers an operation by either selecting one of PROMPT’s suggestions from the list orby using an ontology-editing environments to specify the desired operation directly; and (ii) PROMPT performs the opera-

://weblab.sz.tsinghua.edu.cn/a-semantic-similarity-measure-based-on-information-distance-for-ontology-alignment/.://www.atl.external.lmco.com/projects/ontology/i3con.html.://oaei.ontologymatching.org/2004/Contest/.

cite this article in press as: Y. Jiang et al., A semantic similarity measure based on information distance for ontology alignment,. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.03.021

Page 9: A semantic similarity measure based on information distance for ontology alignment

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

Y. JiangQ1 et al. / Information Sciences xxx (2014) xxx–xxx 9

INS 10699 No. of Pages 12, Model 3G

22 March 2014

Q1

tion, automatically executes additional changes based on the type of the operation, generates a list of suggestions for the userbased on the structures of the ontology around the arguments to the last operation, determines conflicts that the last oper-ation introduced in the ontology, and finds possible solutions for those conflicts.

QOM:Quick Ontology Mapping (QOM) is a approach focusing on efficiency [12]. In order to lower run-time complexity, QOMemploys a dynamic programming approach [4]. The process of the approach is as follows: First, it makes use of the ontologystructure, and initializes the data structures with the candidate alignments by only allowing those which have very similaridentifiers (or labels) or a close neighbor of other existing alignments. Then, the candidate alignments are classified intopromising and less promising pairs, and some of them are discarded entirely to gain efficiency.

APFEL:It is an approach where both the similarity aggregation and the interpretation have been left to a machine learning algorithm[11]. In the stage of similarity aggregation, it is difficult to set aggregation weights for each feature similarity, and it is noteasy in the stage of interpretation to select threshold and interpretation strategy. Methods called ‘‘Parameterizable Align-ment Methods’’ (PAM) are proposed to solve those problems by using parameters. APFEL is short for ‘‘Alignment Process Fea-ture Estimation and Learning’’, which acquires the parameters by using machine learning techniques. The process of theapproach is as follows: First, initial alignments are generated to validate training data. Second, input ontologies are processedby an arbitrary alignment method such as PAM to generate feature/similarity hypotheses which are the basis of feature/sim-ilarity combinations. Then, the learned decision tree is used for feature/similarity weighting scheme and threshold fixing.Finally, an optimized final alignment method is set up. The process in detail is explained in [15,13].

To determine the parameter value, we conducted the experiments based on the four datasets. As shown in Fig. 3, thealignment preforms well when k ¼ 0:6. First, we can see that when k gets larger when k 6 0:6, the F-Measure gets better.This is because the link weight xðx; yÞ plays an important role to measure the similarity between terms. When k 6 0:6,the greater importance of the xðx; yÞ, we compute the similarity between terms more precisely. However, when k > 0:6,the F-Measure gets worse. We attribute this to the fact that some terms having different meanings, but highly correlatedin terms of link weight, may be mistakenly considered as similar terms because the importance of the structural similarityis underestimated. Therefore, we use k ¼ 0:6 in the proposed method to compare the other alignment methods.

As shown in Fig. 4 and Table 2, the proposed method performs better than any other method in terms of semantic pre-cision and semantic recall. In Table 2, on dataset Russia 2, the values of semantic precision/recall of the proposed method are0.98/0.87. While the semantic precision/recall of PROMT, QOM and APEEL are 0.927/0.035, 0.753/0.326, and 0.826/0.802,respectively. The semantic precision of our proposed method is 0.053, 0.227 and 0.154 higher than PROMT, QOM and APEEL,respectively. The semantic recall is 0.835, 0.544 and 0.068 higher than PROMP, QOM and APEEL, respectively. The high valuesof semantic precision and recall show that the proposed method can almost align all the corresponding entities correctly. Weattribute this to the advantage that the link weight measures the similarities between entities relatively precisely andsemantically.

In Fig. 5, we can immediately see the proposed method performs better then any other method in terms of semantic F-Measure. On dataset Russia 1, the proposed method performs a little bit better than other methods. While on dataset Russia2, as the alignment difficulty increases, the proposed method performs best, followed by APEEL and QOM. PROMT performs

Fig. 3. F-Measure values of our proposed method on four datasets with different parameter k.

Please cite this article in press as: Y. Jiang et al., A semantic similarity measure based on information distance for ontology alignment,Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.03.021

Page 10: A semantic similarity measure based on information distance for ontology alignment

411

412

413

414

415

416

417

418

419

420

421

422

423

424

Fig. 4. Semantic precision vs. semantic recall values.

Table 2Comparison of the proposed method and other methods in terms of semantic precision/recall.

Russia 1 Russia 2 Animals Bibliography

PROMPT 1.0/0.687 0.927/0.035 1.0/0.7 0.81/0.427QOM 0.89/0.72 0.753/0.326 1.0/0.667 0.721/0.443APFEL 0.928/0.765 0.826/0.802 1.0/0.72 0.785/0.507Our proposed method 1.0/0.812 0.98/0.87 1.0/0.796 0.86/0.568

Fig. 5. F-measure results.

10 Y. JiangQ1 et al. / Information Sciences xxx (2014) xxx–xxx

INS 10699 No. of Pages 12, Model 3G

22 March 2014

Q1

worst since it only considers the lexical similarity, which cannot handle the semantic alignment difficulty. On dataset Ani-mals, although PROMPT achieves a better F-measure value, the proposed method achieves the highest point. On dataset Bib-liography, all of the methods performs worse, but the proposed method still performs best in the four methods.

The reason that the proposed method outperforms the three traditional methods, lies in that it employs more semanticinformation and structure. In addition, when processing the alignment problem we found that different ontologies may use apair of synonyms to represent the same entity. For example, in the dataset ‘‘Animals’’, to represent the species described as‘‘an animal with hairy coat, that lives wild or is kept on farms for its meat or milk, and is classified in the subfamily Caprinaein Biological classification’’, ontology O1 uses the label ‘‘sheep’’ while ontology O2 uses the label ‘‘goat’’. It is almost impos-sible for traditional methods to align correctly unless using the benefit of dictionary. Undeniably, for one thing it is labor-intensive and time-consuming to construct a dictionary, and for another it will definitely lose its robustness for the problemof ontology alignment. Therefore, it is hard to use a dictionary to process the semantic correspondences in the traditionalmethods.

The proposed method has higher performance, indicating it is able to identify the correspondences effectively. Since thetraditional methods treat the synonyms (i.e., O1 : sheep and O2 : goat) as independent words without considering their

Please cite this article in press as: Y. Jiang et al., A semantic similarity measure based on information distance for ontology alignment,Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.03.021

Page 11: A semantic similarity measure based on information distance for ontology alignment

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

Table 3An example of computing simðO1 : sheep;O2 : goatÞ in the experiment.

xðO1 : sheep;O2 : goatÞ ¼ 0:31,Ppi2PðO1 :sheepÞmaxfxðpi;pjÞ j pj 2 PðO2 : goatÞg ¼ 1:18,Psi2SðO1 :sheepÞmaxfxðsi; sjÞ j sj 2 SðO2 : goatÞg ¼ 0:48,

simðO1 : sheep;O2 : goatÞ ¼ 0:31 � 0:6þ ð1� 0:6Þ=2 � ð1:18þ 0:48Þ � 0:52

Table 4An example of computing simðO2 : goat;O1 : sheepÞ in the experiment.

xðO2 : goat;O1 : sheepÞ ¼ 0:19,Ppi2PðO2 :goatÞmaxfxðpi;pjÞ j pj 2 PðO1 : sheepÞg ¼ 1:21,Psi2SðO2 :goatÞmaxfxðsi; sjÞ j sj 2 SðO1 : sheepÞg ¼ 0:51,

simðO2 : goat;O1 : sheepÞ ¼ 0:19 � 0:6þ ð1� 0:6Þ=2 � ð1:21þ 0:51Þ � 0:46

Y. JiangQ1 et al. / Information Sciences xxx (2014) xxx–xxx 11

INS 10699 No. of Pages 12, Model 3G

22 March 2014

Q1

semantic meanings, they cannot align the correspondences correctly, which degraded the performance of the traditionalmethods. Without using an additional thesaurus, it may be impossible to overcome this problem for traditional alignmentmethods.

In the proposed method we utilize the link weight to measure the similarities between entities directly. Although twolabels that represent the same entity may have different expressions, we found that the Google page counts of these labelterms indicate positive correlations on an underlying semantic structural distribution. For example, simðO1 : sheep;O2 :

goatÞ ¼ 0:52 and simðO2 : goat;O1 : sheepÞ ¼ 0:46. In the experiment with the parameter k ¼ 0:6, for simðO1 : sheep;O2 : goatÞ and simðO2 : goat;O1 : sheepÞ, the computing processes are listed in Tables 3 and 4, respectively.

With the link weight measure, both the similarities are greater than 0 and the absolute difference is less than d ¼ 0:1, andthus shows that these two entities have high semantic similarity although they have totally different label terms. Therefore,the link weight method is resistant to the noise, which is caused by the different label synonyms.

Traditionally, it is very difficult to correctly measure the similarity of terms in different expressions without dictionary.However, since we use the World Wide Web as an enormous corpus and develop the link weight to measure the similaritybased on information distance theory, it is possible to quantity the relationships between terms based on a huge knowledgebase-World Wide Web. Even there might be some noise in the knowledge base, we are able to define the threshold and im-prove the similarity calculation process to conquer the problem. For example, to measure the similarity between ‘‘Humor’’and ‘‘Humour’’, we find the page counts returned by Google of ‘‘Humor’’ and ‘‘Humour’’ are very similar. Using the proposedmethod, we are able to recognize that they are similar terms to be aligned.

For aligning terms highly correlated but having different meanings in a same domain, we define a similarity function tomeasure two terms in Eq. (8). Based on Eq. (8), we can see that similarity function not only considers the terminological sim-ilarity simtðx; yÞ between two terms, but also considers the structural internal similarity simsiðx; yÞ and structural externalsimilarity simseðx; yÞ. For two terms having different meanings, such as ‘‘football’’ and ‘‘basket’’, that are highly correlatedin terms of terminological similarity, the proposed method also compute the simsiðx; yÞ and simseðx; yÞ between the two terms.Since the two terms are much different in terms of structural similarity, we are able to differentiate the two terms and willnot align them in the process.

6. Conclusion

In this paper, we propose an ontology alignment method based on information distance that can align ontologies seman-tically, making it possible to achieve high performance and scalability over previous alignment methods to handle semanticchallenges in processing of ontology alignment. To achieve these goals, our method took advantage of notion of link weight.Specifically, owing to the link weight, our method concentrates on aligning only generally conceptualized ontologies, by con-sidering both terminological similarity and structural similarity. Our method have an advantage over traditional alignmentmethods in that the method finds semantic aligned pairs without using dictionaries. We have compared proposed methodwith three famous existing methods, e.g., PROMT, QOM and APEEL, using four datasets. The experimental results show ourmethod has higher semantic precision, recall and F-Measure. The main reason is that the link weight is able to extract themeaning of words and phases from the World Wide Web using Google page counts, and so it is able to measure the simi-larities between entities in different ontologies with a relative high precision. Therefore, we believe our method will playan important role in constructing ontology alignment method based on semantics.

One limitation of our work is that it depends on the entity of ontologies to be aligned. If the entity is not comprehensive(like some specify custom entity), the proposed method may have poor performance. In particular, if the similarity measuredoes not satisfy the two asymmetric constraints, it will also have poor performance. In this study, the link weight measurewe defined is suitable. But it also performs poor when the value of f ðxÞ is too large, which may be thought to correspond tothe idea of negative correlation in probability theory. Smoothing approaches can be introduced to conquer this drawback inthe further research.

Please cite this article in press as: Y. Jiang et al., A semantic similarity measure based on information distance for ontology alignment,Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.03.021

Page 12: A semantic similarity measure based on information distance for ontology alignment

468

469 Q5

470 Q6

471

472

473474475476477 Q7478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537

538

12 Y. JiangQ1 et al. / Information Sciences xxx (2014) xxx–xxx

INS 10699 No. of Pages 12, Model 3G

22 March 2014

Q1

Acknowledgements

We thank the anonymous reviewers for valuable comments. This research is supported by National Natural Science Foun-dation of China (Grant Nos. 61003100 and 60972011) and Research Fund for the Doctoral Program of Higher Education ofChina (Grant Nos. 20100002120018 and 20100002110033).

References

[1] T. Bach, J. Barrasa, P. Bouquet, R. Inria, M. Hauswirth, M. Vu, S. Vuamsterdam, P. Trento, T. Bolzano, S. Acker, et al, State of the art on ontology alignment,Bioinformatics (2004).

[2] C. Bennett, P. Gács, M. Li, P. Vitanyi, W. Zurek, Information distance, IEEE Trans. Inform. Theor. 44 (1998) 1407–1423.[3] J. Bock, J. Hettenhausen, Discrete particle swarm optimisation for ontology alignment, Inform. Sci. 192 (2012) 152–173.[4] M. Boddy, Anytime problem solving using dynamic programming, in: Proceedings of the Ninth National Conference on Artificial Intelligence, vol. 2, pp.

738–743.[5] P. Chen, S. Lin, Automatic keyword prediction using Google similarity distance, Expert Syst. Appl. 37 (2010) 1928–1938.[6] R. Cilibrasi, P. Vitanyi, Automatic Meaning Discovery Using Google, Manuscript, CWI, 2004.[7] R. Cilibrasi, P. Vitanyi, The Google similarity distance, IEEE Trans. Knowl. Data Eng. 19 (2007) 370–383.[8] A. Doan, J. Madhavan, P. Domingos, A. Halevy, Ontology matching: a machine learning approach, in: Handbook on ontologies, 2004, pp. 385–404.[9] M. Ehrig, Ontology Alignment: Bridging the Semantic Gap, vol. 4, Springer-Verlag New York Inc., 2007.

[10] M. Ehrig, J. Euzenat, Relaxed precision and recall for ontology matching, in: Proc. K-Cap 2005 Workshop on Integrating Ontology, Banff (CA), pp. 25–32.[11] M. Ehrig, J. Euzenat, A. Hess, W. van Hage, G. Stoilos, G. Stamou, U. Straccia, V. Svatek, R. Troncy, P. Valtchev, et al., D2. 2.4: Alignment Implementation

and Benchmarking Results, 2006.[12] M. Ehrig, S. Staab, QOM–quick ontology mapping, in: The Semantic Web–ISWC 2004, 2004, pp. 683–697.[13] M. Ehrig, S. Staab, Y. Sure, Bootstrapping ontology alignment methods with APFEL, in: The Semantic Web–ISWC 2005, 2005, pp. 186–200.[14] M. Ehrig, S. Staab, Y. Sure, Framework for Ontology Alignment and Mapping, 2005.[15] M. Ehrig, Y. Sure, S. Staab, Supervised learning of an ontology alignment process, Prof. Knowl. Manage. (2005) 508–517.[16] J. Euzenat, Brief overview of t-tree: the tropes taxonomy building tool, in: Proc. 4th ASIS SIG/CR Workshop on Classification Research, Columbus (OH

US), pp. 69–87.[17] J. Euzenat, An API for ontology alignment, in: The Semantic Web–ISWC 2004, 2004, pp. 698–712.[18] J. Euzenat, Semantic precision and recall for ontology alignment evaluation, in: Proc. 20th International Joint Conference on Artificial Intelligence

(IJCAI), pp. 348–353.[19] D. Fleischhacker, H. Stuckenschmidt, Implementing semantic precision and recall, in: Proc. 4th International Workshop on Ontology Matching (OM-

2009), Collocated with ISWC-2009, Chantilly (USA).[20] A. Gal, P. Shvaiko, Advances in ontology matching, in: Advances in Web Semantics I, 2009, pp. 176–198.[21] R. Gligorov, W. ten Kate, Z. Aleksovski, F. Van Harmelen, Using Google distance to weight approximate ontology matches, in: Proceedings of the 16th

International Conference on World Wide Web, ACM, pp. 767–776.[22] M. Granitzer, V. Sabol, K. Onn, D. Lukose, K. Tochtermann, Ontology alignment a survey with focus on visually supported semi-automatic techniques,

Future Internet 2 (2010) 238–258.[23] T. Gruber et al, A translation approach to portable ontology specifications, Knowl. Acquis. 5 (1993) 199–220.[24] W. Hu, G. Cheng, D. Zheng, X. Zhong, Y. Qu, The results of falcon-AO in the OAEI 2006 campaign, Ontol. Matching (2006) 124.[25] T. Hughes, The Semantics of Ontology Alignment, Technical Report, DTIC Document, 2004.[26] S. Jan, M. Li, H. Al-Raweshidy, A. Mousavi, M. Qi, Dealing with uncertain entities in ontology alignment using rough sets, IEEE Trans. Syst. Man Cybern.

Part C: Appl. Rev. 42 (2012) 1600–1612.[27] Y. Kalfoglou, M. Schorlemmer, If-map: an ontology-mapping method based on information-flow theory, J Data semantics I (2003) 98–127.[28] M. Li, X. Chen, X. Li, B. Ma, P. Vitanyi, The similarity metric, IEEE Trans. Inform. Theor. 50 (2004) 3250–3264.[29] M. Li, P.M. Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications, third ed., Springer Publishing Company, Incorporated, 2008.[30] J. Madhavan, P. Bernstein, A. Doan, A. Halevy, Corpus-based schema matching, in: Proceedings. 21st International Conference on Data Engineering,

2005, ICDE 2005, IEEE, 2005, pp. 57–68.[31] M. Makrehchi, M. Kamel, Automatic taxonomy extraction using Google and term dependency, in: Proceedings of the IEEE/WIC/ACM International

Conference on Web Intelligence, IEEE Computer Society, pp. 321–325.[32] S. Massmann, D. Engmann, E. Rahm, Coma++: results for the ontology alignment contest OAEI 2006, in: Proc. 1st ISWC International Workshop on

Ontology Matching (OM), 2006, pp. 107–114.[33] C. Meilicke, H. Stuckenschmidt, Analyzing mapping extraction approaches, in: Proceedings of the Workshop on Ontology Matching.[34] C. Meilicke, H. Stuckenschmidt, A. Tamilin, Repairing ontology mappings, Proceedings of the National Conference on Artificial Intelligence, vol. 22,

AAAI Press; MIT Press, Menlo Park, CA; Cambridge, MA; London, 1999, p. 1408.[35] S. Melnik, H. Garcia-Molina, E. Rahm, Similarity flooding: a versatile graph matching algorithm and its application to schema matching, in:

Proceedings. 18th International Conference on Data Engineering, 2002, IEEE, 2002, pp. 117–128.[36] D. Ngo, Z. Bellahsene, Yam++: a multi-strategy based approach for ontology matching task, in: Proceedings of the 18th International Conference on

Knowledge Engineering and Knowledge Management, EKAW’12, Springer-Verlag, Berlin, Heidelberg, 2012, pp. 421–425.[37] N. Noy, M. Musen, Anchor-prompt: using non-local context for semantic matching, in: Proceedings of the Workshop on Ontologies and Information

Sharing at the International Joint Conference on Artificial Intelligence (IJCAI), pp. 63–70.[38] N.F. Noy, M.A. Musen, PROMPT: algorithm and tool for automated ontology merging and alignment, in: Proceedings of the Seventeenth National

Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, AAAI Press, 2000, pp. 450–455.[39] M.A. Rodríguez, M.J. Egenhofer, Determining semantic similarity among entity classes from different ontologies, IEEE Trans. Knowl. Data Eng. 15

(2003) 442–456.[40] V. Spiliopoulos, G. Vouros, Synthesizing ontology alignment methods using the max-sum algorithm, IEEE Trans. Knowl. Data Eng. 24 (2012) 940–951.[41] G. Stumme, A. Maedche, FCA-merge: bottom-up merging of ontologies, in: International Joint Conference on Artificial Intelligence, vol. 17, Lawrence

Erlbaum Associates LTD., pp. 225–234.[42] J. Tang, J. Li, B. Liang, X. Huang, Y. Li, K. Wang, Using bayesian decision for ontology mapping, in: Web Semantics: Science, Services and Agents on the

World Wide Web, vol. 4, 2006, pp. 243–262.[43] P. Vitanyi, F. Balbach, R. Cilibrasi, M. Li, Normalized information distance, Inform. Theor. Stat. Learn. (2009) 45–82.[44] Y. Wang, W. Liu, D. Bell, A structure-based similarity spreading approach for ontology matching, in: Scalable Uncertainty Management, 2010, pp. 361–

374.

Please cite this article in press as: Y. Jiang et al., A semantic similarity measure based on information distance for ontology alignment,Inform. Sci. (2014), http://dx.doi.org/10.1016/j.ins.2014.03.021