context-aware similarity assessment within semantic space formed in linked data

18
ORIGINAL RESEARCH Context-aware similarity assessment within semantic space formed in linked data Parisa D. Hossein Zadeh Marek Z. Reformat Received: 19 December 2011 / Accepted: 7 July 2012 / Published online: 9 August 2012 Ó Springer-Verlag 2012 Abstract The Web is a constantly growing repository of information. Amount of data that becomes available exceeds our abilities to search and examine this data in a reasonable time and with a practical effort. The data is stored in forms of documents, texts and web pages, which are not suitable for comprehensive analysis and search. In order to make the data stored on the Internet more acces- sible, a new model of data representation has been intro- duced—linked data. Linked data provides an open platform for representing and storing structured data as well as metadata. In this paper, we propose a novel approach for calculating the degree of similarity between two entities in the web of linked data. The idea is based on the fact that entities are submerged in the linked data and their semantics is defined via their connections to other entities. Therefore, similarity between two entities is determined by comparing connections of two entities to other entities. Firstly, the approach is introduced to determine semantic similarity in a context-free manner. This method does not select specific types of connections but takes into consid- eration all of them. Secondly, a context-aware approach is presented as a modification of the original method. In this case, a context is defined by a set of connection types— only connections of specific types are considered for sim- ilarity determination. The proposed approach uses concepts of possibility theory to determine lower and upper bounds of similarity intervals. We evaluate the proposed similarity assessment process by applying it to real-world datasets, and we compare it to other related methods. Keywords Semantic similarity Feature-based similarity Context-aware similarity RDF triples Linked data Semantic space 1 Introduction An ultimate contribution of the Semantic Web (Lee et al. 2001) is utilization of ontology as the knowledge represen- tation form. Resource description framework (RDF) (Lassila and Swick 1999) is introduced as an underlying framework in order to use ontology in the web environment. Resource description framework data model treats each piece of information as a triple: subject-property-object (Lassila and Swick 1999). In the last few years, the application of RDF for data representation has become a very popular way of rep- resenting data on the web (Shadbolt et al. 2006). Over time, more attention has been paid to it, and the term linked data (LD) has been used to describe the network of data sources based on RDF triples for information representation (Bizer et al. 2009). The power of LD, in contrary to hypertext web, is that entities from different sources/locations are linked to other related entities on the web. This enables one to view the web as a single global data space (Bizer et al. 2009). In other words, hypertext web connects documents in a naive way— links always point to documents. However, in the web of LD single information items are connected—links point to other pieces of information stored at different physical locations. As a result, LD allows for better representation of structured data and even its underlying semantics. In order to publish data on the web of LD some prin- ciples have to be followed (Berners-Lee and Hendler P. D. Hossein Zadeh (&) M. Z. Reformat Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada e-mail: [email protected] M. Z. Reformat e-mail: [email protected] 123 J Ambient Intell Human Comput (2013) 4:515–532 DOI 10.1007/s12652-012-0154-7

Upload: marek-z-reformat

Post on 08-Dec-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

ORIGINAL RESEARCH

Context-aware similarity assessment within semantic spaceformed in linked data

Parisa D. Hossein Zadeh • Marek Z. Reformat

Received: 19 December 2011 / Accepted: 7 July 2012 / Published online: 9 August 2012

� Springer-Verlag 2012

Abstract The Web is a constantly growing repository of

information. Amount of data that becomes available

exceeds our abilities to search and examine this data in a

reasonable time and with a practical effort. The data is

stored in forms of documents, texts and web pages, which

are not suitable for comprehensive analysis and search. In

order to make the data stored on the Internet more acces-

sible, a new model of data representation has been intro-

duced—linked data. Linked data provides an open platform

for representing and storing structured data as well as

metadata. In this paper, we propose a novel approach for

calculating the degree of similarity between two entities in

the web of linked data. The idea is based on the fact that

entities are submerged in the linked data and their

semantics is defined via their connections to other entities.

Therefore, similarity between two entities is determined by

comparing connections of two entities to other entities.

Firstly, the approach is introduced to determine semantic

similarity in a context-free manner. This method does not

select specific types of connections but takes into consid-

eration all of them. Secondly, a context-aware approach is

presented as a modification of the original method. In this

case, a context is defined by a set of connection types—

only connections of specific types are considered for sim-

ilarity determination. The proposed approach uses concepts

of possibility theory to determine lower and upper bounds

of similarity intervals. We evaluate the proposed similarity

assessment process by applying it to real-world datasets,

and we compare it to other related methods.

Keywords Semantic similarity � Feature-based similarity �Context-aware similarity � RDF triples � Linked data �Semantic space

1 Introduction

An ultimate contribution of the Semantic Web (Lee et al.

2001) is utilization of ontology as the knowledge represen-

tation form. Resource description framework (RDF) (Lassila

and Swick 1999) is introduced as an underlying framework

in order to use ontology in the web environment. Resource

description framework data model treats each piece of

information as a triple: subject-property-object (Lassila and

Swick 1999). In the last few years, the application of RDF for

data representation has become a very popular way of rep-

resenting data on the web (Shadbolt et al. 2006). Over time,

more attention has been paid to it, and the term linked data

(LD) has been used to describe the network of data sources

based on RDF triples for information representation (Bizer

et al. 2009). The power of LD, in contrary to hypertext web,

is that entities from different sources/locations are linked to

other related entities on the web. This enables one to view the

web as a single global data space (Bizer et al. 2009). In other

words, hypertext web connects documents in a naive way—

links always point to documents. However, in the web of LD

single information items are connected—links point to other

pieces of information stored at different physical locations.

As a result, LD allows for better representation of structured

data and even its underlying semantics.

In order to publish data on the web of LD some prin-

ciples have to be followed (Berners-Lee and Hendler

P. D. Hossein Zadeh (&) � M. Z. Reformat

Department of Electrical and Computer Engineering,

University of Alberta, Edmonton, AB, Canada

e-mail: [email protected]

M. Z. Reformat

e-mail: [email protected]

123

J Ambient Intell Human Comput (2013) 4:515–532

DOI 10.1007/s12652-012-0154-7

2001). One fundamental rule is the use of uniform resource

identifiers (URIs) to identify each piece of information

(Shadbolt et al. 2006). Uniform resource identifiers aim to

universally define entities in the web of data so that users

and machines can use the URIs to obtain information about

the data. This means that every entity has a global identifier

that a person or machine can use to look it up, refer to it,

and find its description. Another rule of publishing data in

LD is that the created URIs should be obtainable via HTTP

on the web.

As stated, LD is expressed in RDFs, i.e., triples: subject,

property, and object, where each one of these is represented

by an URI. This way finding a specific piece of information

in the web of data is facilitated with the help of inter-

pretable URIs. For example, the entity ‘‘University of

California’’ can be referred to in different ways, such as

‘‘University of Berkeley’’, ‘‘UCB’’, and ‘‘UC Berkeley’’ by

different data sources. However, assigning a unique URI in

different datasets helps to avoid any confusion.

A collection of Semantic Web Technologies and

Applications supports manifestation of LD in reality. These

include protocols, strategies and tools for querying the

RDF datasets (SPARQL, SPARQL?), transforming cur-

rent application-specific formats of resources to the RDF

format (STEAMY, Cvs2rdf, Cypher), reasoning and dis-

covering new relationships using RDF data in order to

manage the information on the web (Pellet, Jena,

FaCT??), and extracting RDF triples (CumulusRDF,

3Store, Pubby). Linked data is potentially beneficial to

various semantic applications such as web search engines,

web browsers, information retrieval systems, and reasoning

engines. As a result of the interconnected data, navigation

and query using semantic-enabled browsers over the LD

can be facilitated to a great extent. However, LD as an

integration of the interlinking datasets poses challenges

regarding processing and analysis of data (D. Hossein

Zadeh and Reformat 2012a). One of them is finding simi-

larity between pieces of information.

In this paper, we propose a semantic similarity assess-

ment method between the entities in LD based on the

interconnections between entities and applying elements of

possibility theory. The proposed approach includes a

number of important features:

• Semantic-oriented—the approach is fully dependent on

the interconnections between entities that define seman-

tics of data; therefore, the method is an attempt to

determine the similarity based on semantics of entities.

• Context-aware—the approach is sensitive to the types

of interconnections between entities; this makes it

suitable to assess similarity in the context of specific

features or utilization of entities that can be defined by

different types of interconnections.

• Dynamic/adaptive—similarities are determined based

on the current state of knowledge as embedded in LD;

any addition and modification of this data will be

momentarily reflected in similarity evaluations.

The remainder of this paper is organized as follows:

Sect. 2 provides some background information related to

LD, principle of similarity, and possibility theory. In Sect.

3, we present our context-free semantic similarity

approach. The context-aware similarity method and its

underlying motivations are described in Sect. 4. Sections 5

and 6 evaluate our approaches using real-world datasets

and compare it with other methods. Section 7 reviews

related work in the field of semantic similarity measure-

ment. Finally, Sect. 8 concludes the paper and presents

future work.

2 Background

2.1 Linked data

Linked data resembles a decentralized partial mesh net-

work in which entities from different resources are con-

nected to other related entities directly or indirectly. As

previously mentioned, the foundation of LD consists of two

elements: RDF and URI. The basic atomic component of

LD is a RDF triple. In fact, all pieces of information in LD

are expressed using triples. The generic format of a RDF

triple is: subject, property, and object. For example, the

statement:

‘‘The Matrix (movie) is distributed by Warner Bros.’’

can be expressed as the following triple:

The Matrix(subject)-distributed by(property)-

Warner Bros.(object)

components of this triple can be parts of another triple(s):

The Matrix(subject)-directed by(property)

-The Wachowskis.(object)

New_Line_Cinema(subject)-parent of(property)

-Warner Bros.(object)

Graphically, the above RDF triples constituted in a so

called RDF graph, see Fig. 1.

As stated, the second fundamental aspect of LD is URI.

Uniform resource identifiers provide unique identification

for every piece of data on the web. Thus, dereferencing the

URI associated with every entity enables a user/machine to

find all information related to that entity, which includes its

associative RDF fragments in the web of data. This shows

that LD is a huge network of interconnections from dif-

ferent datasets without a single centre of knowledge. The

516 P. D. Hossein Zadeh, M. Z. Reformat

123

RDF triples: ‘‘The Matrix’’ is distributed by ‘‘Warner

Bros.’’, ‘‘The Matrix’’ is directed by ‘‘The Wachowskis’’,

and ‘‘New_Line_Cinema’’ is parent of ‘‘Warner Bros.’’ can

be expressed in the following way, Fig. 2.

In the example above, dereferencing the URI:

http://dbpedia.org/resource/The_Matrix shows that the

subject ‘‘The Matrix’’ belongs to the dataset dbpedia.org.

Also, dereferencing the URI http://dbpedia.org/resource/

Warner_Bros. implies that this object is defined in the

dataset dbpedia.org. Similar to the subject and object,

properties are uniquely represented by their assigned URIs.

Properties, Pi, can be categorized as sets of relations that

are associated with every particular class of things. This

means that every class of entities including person, vehicle,

event, movie, etc. has some specific properties, while each

property is distinguished by a unique URI. For example,

the class ‘‘movie’’ can have properties such as runtime,

budget, cinematography, director, distributor, producer,

language, etc.

There exist several important data collections that have

published their contents in the format of LD, such as

DBPedia,1 DBLP,2 Geonames,3 Freebase,4 New York

Times,5 BBC programmes,6 and FOAF.7 Among these

datasets, DBPedia transformed contents of Wikipeida into

the LD format with more than 103 million RDF triples.

Geonames is another dataset that provides information

about over eight million geographical locations in the

world. Freebase presents data related to approximately

20 million different entities in an open linked data graph.

New York Times linked open data has published news,

vocabularies and subject headings as LD. Information

about people, their activities and social media are pub-

lished in LD format in FOAF project. Information about

books in the form of LD is available on RDF Book

Mashup.8 It is worth mentioning that all the mentioned data

sources allow connections to and from other data sources.

Figure 3 is generated using Gephi9 and depicts a snap-

shot of DBPedia dataset containing RDF triples of four

different movies. Vertices represent resources (subjects and

objects), and properties are shown by edges between the

resources. The four vertices with the highest number of

edges connected to them are the four selected movies. As it

can be observed, these movies have connections to unique

features (specific to that particular movie) and also con-

nections to features that are shared between different

movies. One of the most intriguing observations regarding

LD is its contribution to semantic definition of entities. A

set of relations between an entity and other resources can

be conceived as resource’s features defining its semantics.

2.2 Concept of similarity

The task of determining similarity between two entities is a

fundamental process for many applications related to

analysis and processing of data including artificial intelli-

gence, biomedicine, knowledge representation, data min-

ing, natural language processing, and information retrieval.

A variety of approaches for similarity assessment have

been proposed in the literature while some leverage lexi-

cographic, syntactic, structural information, as well as

representation of information about entities to measure the

The matrix New LineCinema

WarnorsBros.

distributed by parent of

directed by

TheWachowskis

Fig. 1 A simple RDF graph

containing three triples

Subject: http://dbpedia.org/resource/The_Matrix Predicate: http://dbpedia.org/ontology/distributor Object: http://dbpedia.org/resource/Warner_Bros.

Subject: http://dbpedia.org/resource/The_Matrix Predicate: http://dbpedia.org/property/director Object: http://dbpedia.org/resource/The_Wachwskis

Subject: http://dbpedia.org/resource/New_Line_Cinema Predicate: http://dbpedia.org/property/parent Object: http://dbpedia.org/resource/Warner_Bros.

Fig. 2 Examples of RDF triples

1 http://dbpedia.org/About.2 http://www.informatik.uni-trier.de/*ley/db/.3 http://www.geonames.org/.4 http://www.freebase.com/.5 http://data.nytimes.com/.6 http://www.bbc.co.uk/programmes.7 http://www.foaf-project.org/.

8 http://www4.wiwiss.fu-berlin.de/bizer/bookmashup/.9 http://gephi.org/.

Context-aware similarity assessment 517

123

similarity. The most popular techniques are based on entities’

feature matching (Tversky 1977; Nosofsky 1991) as well as

combination approaches (Johannesson 1997). Many approa-

ches depend on representation of information in the form of

ontology, while a few methods investigate the problem of

similarity assessment in LD. A detailed review of the

aforementioned techniques is provided in Sect. 7.

Intuitively, a degree of similarity between two items

depends on how many features of these two items are the

same. Before we investigate this idea further, let us take a

closer look at the definition of identity. The original defi-

nition of identity comes from the Leibniz’s law (Leibniz

1975) of identity of indiscernible, which states that the two

entities i and j are identical if they share common prop-

erties Pi and Pj:

8i8j½8PðPi $ PjÞ ! i ¼ j� ð1Þ

It can be inferred that unique features of each entity con-

tribute to the dissimilarity measure between the two

entities.

Another important aspect of such understood (dis)simi-

larity is related to its symmetry. If a number of features of

an entity are different than another entity then (dis)simi-

larity is not symmetrical. Thus, if we assume that

(dis)similarity is a ratio between mutual features and all

features connected to an entity, then—of course—this ratio

will depend on a number of features (connections). Since

the number of connections for each entity in LD may be

different the similarity is asymmetric—this complies with

the work conducted by Tversky (1977). According to

Tversky (1977), similarity between two entities can be

determined by comparing features defining these resources.

We also believe that similarity can be determined when

only some specific features are considered while others are

meant to be ignored. Thus, an appropriate selection of

features allows for determining similarity in a context

defined by these selected features.

The above-mentioned principle sheds light on how to

assess the degree of similarity between two entities in LD.

The fact that all pieces of data in LD are interconnected

creates a natural implication to think about similarity as the

degree of interconnection between the pieces. Generally,

connections represent reasonable amount of information

about the entities in LD. Detailed analysis of these inter-

connections enables one to extract features related to every

entity in the web of data.

2.3 Possibility theory

Introduced by Zadeh (1999) and fully developed by Dubois

and Prade (2003) possibility theory is a suitable vehicle to

handle incomplete information. Even if similar to the

probability theory it differs in using two sets of functions—

possibility and necessity measures—instead of just one

measure as in the probability theory. Here, we present the

basic definitions and concepts that are used in our approach

for similarity evaluation. For more information on possi-

bility theory see (DuBois and Prade 1980; Dubois et al.

1988; Klir and Folger 1988).

Let us assume a finite set of states, S. A possibility

distribution function is:

pðsÞ : S! h0; 1i ð2Þ

that s represents a current state of knowledge. Possibility

theory appraises what elements of S are plausible and what

elements are not, what is ‘‘normal’’ and what is not. The

state s is expressed to be impossible as:

pðsÞ ¼ 0

or totally possible (plausible):

pðsÞ ¼ 1

This allows for expressing complete knowledge, when

some state s0 is possible pðs0Þ ¼ 1, and other states s are

impossible pðsÞ ¼ 0. Total ignorance is expressed as

pðsÞ ¼ 1 for all s from S. Therefore, degrees of

possibility and necessity for a given subset of states Ssub

can be computed as below: possibility:

PðSsubÞ ¼ sups2Ssub

pðsÞ ð3Þ

necessity:

NðSsubÞ ¼ infs62Ssub

1� pðsÞ ð4Þ

Fig. 3 A snapshot of dbpedia.org representing four movies

518 P. D. Hossein Zadeh, M. Z. Reformat

123

The duality of possibility-necessity is:

NðSsubÞ ¼ 1�PðS0subÞ ð5Þ

where S0

represents the complement of S. Possibility

measures satisfy the basic property of:

PðSbsub [ Sa

subÞ ¼ maxðPðSasubÞ;PðSb

subÞÞ ð6Þ

while necessity measures satisfies the dual property:

PðSbsub \ Sa

subÞ ¼ minðPðSasubÞ;PðSb

subÞÞ ð7Þ

Equations (3, 4, 5) are applied in the proposed approach for

similarity assessment.

3 Context-free similarity assessment

3.1 Approach overview

Linked data (LD) represents each entity (resource) via fea-

tures associated with it (Sect. 2.1). This representation creates

a captivating form of defining a single resource and com-

paring it with other resources. In a nutshell, the approach

proposed identifies resources that are certainly shared and

possibly shared between two entities, and uses elements of

possibility theory to assess similarity between these entities.

3.2 Approach description

As mentioned before, LD is a mesh of interconnected resour-

ces, which can be represented as a set of triples\resource-as-

subject, property, resource-as-object[. Formally:

LD ¼ hri; pq; rmi: ri; rm 2 R; pq 2 P� �

ð8Þ

where R is a set of resources, and P is a set of properties. In

this mesh, a single resource ri is defined via its connections

to other resources. Each of these resources can be

considered as a feature of ri. A set of all resources

(features) connected to ri can be treated as its semantic

definition. The connections between the resource ri and

other resources are labeled with properties that have ri as

their subject. Therefore, for an entity ri we can write:

ni ¼ jfhri; pq; rmi: rm 2 Rnfrig; pq 2 Pgj ð9Þ

where the symbol | | stands for cardinality of a set, and ni

represents the number of connections between ri and other

resources in LD. In other words, ni represents the number

of resources—features—of ri.

Indeed, we can say that LD is a powerful infrastructure

providing entities with semantic. As a result, LD can be

treated as a large semantic space containing multiple defini-

tions—semantic is formed through connections between

resources. Therefore, we can use LD to determine a semantic-

based similarity using connections as indicators of relatedness

between resources. For two different resources ri and rj, we

identify their associated features in order to appraise the

degree of relatedness between them. There exist four different

scenarios that can be encountered during similarity assess-

ment of two resources ri and rj. For instance, some resources

(features) are not shared between the two resources ri and rj,

while some properties can be of the same or different type(s).

Overall, there are four possible scenarios as shown in Table 1.

Each of the scenarios represents evidence towards simi-

larity or dissimilarity. Let us explain each specific scenario:

• S1: this scenario intuitively contributes to the similarity of

two entities. The same property existing in both entities is

used as the connections to the same resources.

• S2: this scenario introduces ambiguity: the same property

connects the two entities to different resources. Note that

this scenario may contribute to similarity or dissimilarity

depending on the relatedness of the attached resources at

the other end of properties.

Table 1 Possible scenarios of

connections between two

resourcesScenario label Properties (connections) Resources (features)

S1 Same type

S2 Same type

Connecting

Different resources

S3 Different types

S4 Different types Different resources

Same (shared) resource

Same (shared) resource

Context-aware similarity assessment 519

123

• S3: this scenario introduces ambiguity: the same (shared)

resource is connected via different properties. Thus,

depending on the relatedness of the properties it is possible

that this scenario contributes to similarity or dissimilarity.

• S4: this scenario contributes directly to dissimilarity.

Each of the entities is connected to different resources

via different properties.

Let the sets Pi and Pj represent properties of the

resources ri and rj, respectively. Ri and Rj, on the other

hand, represent sets of features (connected resources) of ri

and rj. Additionally, we define the following sets:

• a set of resources connected to both resources ri and rj,

and a set of properties shared by both of them:

Ri;j ¼ Ri \ Rj Pi;j ¼ Pi \ Pj ð10Þ

• a set of resources describing exclusively the resource

ri (rj):

Rexci ¼ RinRj Rexc

j ¼ RjnRi ð11Þ

• likewise, a set of properties exclusive for ri (rj):

Pexci ¼ PinPj Pexc

j ¼ PjnPi: ð12Þ

Based on the definitions above, the scenarios for the

resource ri with respect to any resource rj can be presented

in the following way. Number of resources describing ri

that belongs to the scenario S1 is:

niS1¼ jfðhri;pq;rmi;hrj;pq;rmiÞ : rm 2Ri;j;pq 2Pi;jgj ð13Þ

scenario S2:

niS2 ¼ jfhri; pq; rmi : rm 2 Rexci ; pq 2 Pi;jgj ð14Þ

scenario S3:

niS3 ¼ jfðhri;pq; rmi; hrj;ps; rmiÞ : rm 2 Ri;j; ðpq 6¼ psÞ 2 Pgjð15Þ

scenario S4:

niS4 ¼ jfhri; pq; rmi : rm 2 Rexci ; pq 2 Pexc

i gj: ð16Þ

Remark 1 intuitively, S1 contributes to similarity and S4

contributes to dissimilarity of two resources. The scenarios

S2 and S3 require in depth evaluation to determine their

contributions to similarity/dissimilarity. In this case, we

need to assess relatedness of resources (S2) or properties

(S3) that are different, and this would give us an indication

whether additional evidence exists towards similarity or

dissimilarity. For now, we focus on S1 and S4 as

computationally simpler scenarios. Scenarios S2 and S3

will be discussed later in this section.

Using the descriptions given to the different scenarios, the

similarity and dissimilarity between ri and rj can be expressed

in the following way: the similarity is solely based on scenario

S1 and thus the necessity of similarity can be determined

according to the possibility theory (Sect. 2.3):

Nðsim½ri; rj�Þ ¼niS1

nið17Þ

This represents the similarity of ri to rj; therefore, the

denominator contains the value of connections (features) of

ri. This leads to an asymmetric nature of the proposed

approach, as explained in Sect. 2.3. The necessity of

dissimilarity of ri to rj is determined based on scenario S4

using the equation:

Nðdissim½ri; rj�Þ ¼niS4

nið18Þ

As discussed before, scenarios S2 and S3 contribute to

ambiguity; thus, they are involved in determining

possibility of dissimilarity:

Pðdissim½ri; rj�Þ ¼niS2þ niS3þ niS4

nið19Þ

with the understanding that necessity of dissimilarity (niS4)

counts in possibility of dissimilarity as it certainty implies

possibility. This is according to the principle of minimal

specificity in possibility theory in which the unknown to be

impossible scenarios count towards possibility (Zadeh

1999). This leads to the following formulas for necessity

of similarity and possibility of similarity:

Nðsim½ri; rj�Þ ¼ minððNðsim½ri; rj�Þ; 1�Pðdissim½ri; rj�ÞÞð20Þ

Pðsim½ri; rj�Þ ¼ 1� Nðdissim½ri; rj�Þ ð21Þ

Therefore, similarity between resources can be expressed

as an interval with N (sim) as its lower limit, and P (sim) as

its upper limit.

Remark 2 in order to remedy the computational com-

plexity of P (dissim [ri, rj]) in Eq. (19) as already explained

in Remark 1, we may take advantage of the fact that:

ni ¼ niS1þ niS2þ niS3þ niS4 ð22Þ

with a simple arithmetic, it can be seen that the problems in

Eqs. (20) and (21) can now be solved directly through S1

and S4. Consequently, we can derive final values of

necessity of similarity and possibility of similarity as:

NFðsim½ri; rj�Þ ¼ Nðsim½ri; rj�Þ ð23Þ

PFðsim½ri; rj�Þ ¼ 1� Nðdissim½ri; rj�Þ: ð24Þ

The simple manipulation presented above allows us to

consider only the scenarios S1 and S4—using these scenarios

we can determine the lower-bound (necessity) and the upper-

bound (possibility) of a similarity interval. In the light of that,

we can say that the measures of the scenarios S2 and S3

520 P. D. Hossein Zadeh, M. Z. Reformat

123

represent the width of the interval. The process of determining

the scenario S2, different resources and the same property,

involves identifying similarity between these different

resources. In order to do this we need to travel farther from

the original two resources under evaluation and analyze other

connections. An algorithm is developed for this purpose in

Sect. 4. Nevertheless, it increases the complexity of the

similarity assessment. For the scenario S3, the same resource

and different properties, the process is more complex than the

one for S2. It requires an investigation of semantics of

properties, which is external to LD knowledge sources. If such

processes are performed the formulas for possibility of

similarity and necessity of dissimilarity should be adapted

accordingly as below:

Pðsim½ri; rj�Þ ¼niS1þ niS2C þ niS3C

nið25Þ

Nðdissim½ri; rj�Þ ¼niS4þ ðniS2� niS2CÞ þ ðniS3� niS3CÞ

ni

ð26Þ

where niS2C and niS3C represent numbers of resources and

properties indirectly related to each other. In the work

presented here, we consider S2 in context-aware similarity

assessment (Sect. 4). However, we do not investigate sce-

nario S3 that would lead to a semantic analysis of prop-

erties—a topic beyond the scope of this work.

Let us take a look at a simple example illustrated in

Fig. 4. The figure represents two resources ri and rj and

their associated features (other resources) connected via

properties. For simplicity, different types of properties are

shown with different line widths and styles.

For scenarios S1 and S3 the numbers of resources for

each scenario are the same for both resources. In Fig. 4, we

have nS1 = 4 {re, rg, rh, rk} and nS3 = 1 {rf}. S2 and S4

quantities have to be determined for each resource sepa-

rately. For ri, niS2 = 1 {rc}, niS4 = 3 {ra, rb, rd}. For rj,

njS2 = 2 {ro, rp} and njS4 = 1 {rq}.

Based on Eqs. (23) and (24), we can obtain the fol-

lowing similarity values for ri as:

Nðsim½ri; rj�Þ ¼niS1

ni¼ 4

9) NFðsim½ri; rj�Þ

Nðdissim½ri; rj�Þ ¼niS4

ni¼ 3

9) PFðsim½ri; rj�Þ ¼ 1� 3

9¼ 6

9

Hence, the upper and lower bounds of similarity of ri to rj

are:

sim½ri; rj�Þ 24

9;6

9

��

Likewise, for rj:

Nðsim½rj; ri�Þ ¼n jS1

n j¼ 4

8) NFðsim½rj; ri�Þ

Nðdissim½rj; ri�Þ ¼n jS4

n j¼ 1

8) PFðsim½rj; ri�Þ ¼ 1� 1

8

¼ 7

8

Thus,

sim½rj; ri� 24

8;7

8

��:

4 Context-aware similarity assessment

4.1 Motivation

The approach presented in Sect. 3 for determining similarity

between two entities takes into account all properties of

entities. In some cases, not all of these connections are

necessary to be considered for similarity assessment—some

of them could be very unique for a single entity, while some

could be meaningless. More importantly, in many real-life

situations the user might be interested in similarity between

two entities only in the aspect of some specific properties.

This means that only those specific types of connections

should be used for similarity determination. We refer to this

situation as context-aware similarity assessment.

The context, i.e., a set of properties to be considered for

similarity assessment can be provided explicitly by the

user, domain expert, or application. It can also be identified

based on any given information from these sources relevant

to the topic of the task in hand. In such a case, related

properties to the context can be extracted from the ontology

of properties associated with a dataset.

4.2 Approach description

For context-aware similarity we take into consideration

only a set of specific properties that are in line with the

context. For this purpose, the number of connections of

resource ri only counts the ones which belong to the

desired context:Fig. 4 Snippet of LD, definitions of resources ri and rj. Types of

connections are indicated by thickness/types of lines

Context-aware similarity assessment 521

123

niðPcntxÞ ¼ jfhri; pq; rmi : rm 2 Rnfrig; pq 2 Pcntxgj ð27Þ

where Pcntx [ P and Pcntx only contains properties related to

the particular context. Now that the context is defined,

properties associated to that context are easily recognized.

Therefore, in assessing the similarity of the resources ri and

rj, the two resources are compared against the same set of

properties (the ones related to the context). For example,

similarity of two movies in a context of directorship makes

their comparison limited to specialized properties such as

director. In another example, similarity of two books in the

context of appearance is evaluated according to a set of

properties such as color, cover type, and number of pages.

Once again, let us take look at all possible scenarios

presented in Table 1.

S1: this scenario contributes to the similarity of two

resources: the same properties existing in both entities are

connecting to the same features (other resources). For the

scenario S1, we applied the same approach as before in

Sect. 3. The only difference is the constraint on the prop-

erties that is used to determine the shared resources:

niS1ðPcntxÞ ¼ jfðhri; pq; rmi; hrj; pq; rmiÞ : rm 2 Ri;j; pq

2 Pi;j; pq 2 Pcntxgj ð28Þ

In this case, we have a set of properties Pcntx that contains

properties of interest—defining context of interests.

S2: this scenario can contribute to similarity as well as

dissimilarity as discussed in Sect. 3. In this scenario, the

same properties connect to different resources; if there is

some relatedness between these resources then S2 contrib-

utes to similarity, and if not—to dissimilarity. Relatedness

between different resources can be determined by connec-

tions these resources have to other resources. In scenario S2,

we need to investigate possible relatedness between different

resources connected by the same property to find out if they

are connected via other resources. In other words, we

investigate if there is a path between these resources. If it

happens that a path exists, a value of 1/ni(Pcntx) is added to

the possibly of similarity measure, otherwise to the necessity

of dissimilarity. The procedure needs to be performed for

each pair of resources found in S2 (one resource from Ri and

the other from Rj). Recall that Ri and Rj represent sets of

features (connected resources) of ri and rj, respectively.

Before presenting the algorithm for determining the number

of remotely connected resources, we need to define a few

quantities (similar to the ones defined in Sect. 3.2

• a set of resources connected to both resources ri and rj

in the context Pcntx:

Ri;jðPcntxÞ ¼ RiðPcntxÞ \ RjðPcntxÞ: ð29Þ

• a set of resources describing only the resource ri (not

describing rj) in context Pcntx. We also call it a set of

resources exclusive for ri:

Rexci ðPcntxÞ ¼ RiðPcntxÞnRi;jðPcntxÞ: ð30Þ

• likewise, a set of resources exclusive for rj:

Rexcj ðPcntxÞ ¼ RjðPcntxÞnRi;jðPcntxÞ: ð31Þ

• also, let niS2CðPcntxÞdenotes the number of resources in

Ri (in scenario S2) that are connected indirectly to

resources in Rj through some other external resources.

S3: this is yet another scenario that can contribute to sim-

ilarity or dissimilarity. The process of determining its contri-

bution to the similarity depends on identifying relatedness of

properties that are part of the context, which connect the two

entities to the same resource. This process requires additional

sources of knowledge (outside of LD). Let niS3CðPcntxÞdenotes the number of properties in Pi (in scenario S3) that are

indirectly related to different properties in Pj.

S4: for this scenario we have different resources con-

nected by different properties. We assume that such dif-

ferences do not contribute to (possibility) similarity.

Table 2 Pseudo-code for discovering remotely connected resources

522 P. D. Hossein Zadeh, M. Z. Reformat

123

Now, the pseudo-code for the algorithm can be pre-

sented as shown in Table 2. Finally, the necessity of sim-

ilarity is expressed as:

NFcntxðsim½ri; rj�Þ ¼

niS1ðPcntxÞniðPcntxÞ

ð32Þ

and, the possibility of similarity as:

PFcntxðsim½ri; rj�Þ ¼

niS1ðPcntxÞþ niS2CðPcntxÞþ niS3CðPcntxÞniðPcntxÞ

ð33ÞHowever, as mentioned before, we do not consider the

scenario S3 that would require investigation of properties’

relatedness (so niS3CðPcntxÞ ¼ 0). One of the possibilities here

would be developing a method based on application of tax-

onomy-based measures deployed on ontology of properties.

Let us take a look at an example to explain the approach

described above. Figure 5 depicts a snippet of LD with two

resources ri and rj, and a number of resources {ra, rb, rc, rd,

re, rf, rg, rh, rk, ro} defining them within the context Pcntx.

Here, different properties are shown with different line styles.

This means that the three resources {rp, rq, rs} are connected

with properties potentially different than the properties for

the resources {ra, rb, rc, rd, re, rf, rg, rh, rk, ro}.

The sets of common and exclusive resources describing

ri and rj are:

Ri;jðPcntxÞ ¼ fre; rf ; rggRexc

i ðPcntxÞ ¼ fra; rb; rc; rdgRexc

j ðPcntxÞ ¼ frh; rk; rog

Based on the algorithm in Table 2, we obtain {ra, rb} as

remotely connected resources between ri and rj:

niS2CðPcntxÞ ¼ 2

Using Eqs. (32) and (33), the context-aware similarity

values for ri in Fig. 5 are obtained as:

NFcntxðsim½ri; rj�Þ ¼

niS1ðPcntxÞniðPcntxÞ

¼ 3

7

PFcntxðsim½ri; rj�Þ ¼

niS1ðPcntxÞ þ niS2CðPcntxÞ þ niS3ðPcntxÞniðPcntxÞ

¼ 3þ 2þ 0

7¼ 5

7

Therefore, the final values are:

sim½ri; rj� 2 hNFcntxðsim½ri; rj�Þ;PF

cntxðsim½ri; rj�Þi

sim½ri; rj� 23

7;5

7

� �

and these values for rj are:

NFcntxðsim½rj; ri�Þ ¼

n jS1ðPcntxÞn jðPcntxÞ

¼ 3

6

PFcntxðsim½rj; ri�Þ ¼

n jS1ðPcntxÞ þ n jS2CðPcntxÞ þ n jS3ðPcntxÞn jðPcntxÞ

¼ 3þ 2þ 0

6¼ 5

6

Thus:

sim½rj; ri� 2 hNFcntxðsim½rj; ri�Þ;PF

cntxðsim½rj; ri�Þi

sim½rj; ri� 23

6;5

6

� �

5 Case study

In order to illustrate functioning of the proposed semantic

similarity measures and to evaluate them, a case study is

presented using real-world datasets. We use DBPedia as the

data source of RDFs of the following resources. Four movies:

Matrix,10 Matrix_Reloaded,11 Hangover,12 and Blade_Run-

ner,13 one soundtrack album: Matrix-music14 (movie Matrix

soundtrack) and one car brand: Toyota15 are selected. The

evaluation is performed on the selected ordered pairs of the

resources. Context-free and context-aware similarity mea-

sures are obtained according to the methods presented in

Sects. 3 and 4, respectively. First, all RDF triples associated

with each resource are extracted from DBPedia. A graphical

visualization of the selected resources are depicted in Fig. 6

using a Java-based graph visualization software, Gephi.16

Fig. 5 A sample of LD representing two resources ri and rj defined

with a number of resources connected via the properties (dotted lines

represent different connection types) within the context Pcntx

10 http://dbpedia.org/page/The_Matrix.11 http://dbpedia.org/page/The_Matrix_Reloaded.12 http://dbpedia.org/page/The_Hangover_(film).13 http://dbpedia.org/page/Blade_Runner.14 http://dbpedia.org/page/The_Matrix:_Music_from_the_Motion_

Picture.15 http://dbpedia.org/page/Toyota.16 http://gephi.org/.

Context-aware similarity assessment 523

123

The computed similarity measures are shown as intervals

with necessity of similarity and possibility of similarity as

lower-bound and upper-bound values, respectively. In par-

ticular, necessity of similarity represents the known and cer-

tain similarity between the pair of resources {ri, rj}. On the

other hand, possibility of similarity denotes the maximum

possible similarity between the two resources. All possible

situations that might satisfy the similarity of ri and rj are

counted towards the possibility of similarity.

5.1 Context-free similarity

The evaluation is performed on the ordered pairs of

resources including {Matrix, Matrix-Reloaded}, {Matrix,

Blade-Runner}, {Matrix, Hangover}, {Matrix, Matrix-

music} and {Matrix, Toyota}. The results are shown in

Table 3.

The similarity interval for the pair {Matrix, Matrix-

Reloaded} is calculated based on Eq. (23) by comparing

their RDF triples according to scenarios S1 and S4.

Therefore, the lower-bound is given by:

NFðsim½Matrix; Matrix� reloaded�Þ ¼ niS1

ni¼ 61

176¼ 0:35

and upper-bound based on Eq. (24) is:

NFðdissim½Matrix; Matrix� reloaded�Þ ¼ niS4

ni¼ 9

176¼ 0:05

Here, the niS1 is a number of shared resources describing

Matrix and Matrix-Reloaded and connected via the same

property, while niS4 is a number of resources unique for

Matrix and connected via unique properties to Matrix. The

value of ni represents number of all triples that include

Matrix as their subject. Finally, interval-valued similarity is

obtained as:

sim½Matrix; Matrix� reloaded� 2 h0:35; 0:95i:

There are two important observations that can be made

at this point. Firstly, the necessity of similarity gives an

unquestionable similarity value, 0.35, of the movie Matrix

to Matrix-Reloaded. This value is obtained based on the

facts and certain information, and thus it is referred to as a

lower-bound of similarity. This can be described as the

pessimistic view of similarity. The possibility of similarity,

(optimistic view of similarity) on the other hand, is

determined based on the necessity of dissimilarity

obtained based on definite facts that are different for both

resources. Indeed, the value of necessity refers to the

degree of certainty and the value of possibility to the

degree of plausibility when it comes to the similarity

between Matrix and Matrix-Reloaded (with the reference

to Matrix). Secondly, the interval, in this case 0.60, is a

range of possible values of similarity between Matrix and

Matrix-Reloaded. Its existence is related to the inability to

determine for certain which RDF triples (features) of

Matrix and Matrix-Reloaded are the same (belong to S1) or

definitely different (belong to S4). This interval is an

indication of uncertainty regarding the similarity between

Matrix and Matrix-Reloaded. Further investigation of the

scenarios S2 and S3 would lead to some modifications of

the interval, and in particular its upper-bound.

It can be observed that the similarity intervals for the

two movie pairs {Matrix, Hangover} and {Matrix, Blade-

Fig. 6 Graphical visualization of the resources described in DBPe-

dia dataset

Table 3 Context-free

similarity valuesOrdered pairs {ri, rj} Necessity of similarity

NF (sim[ri, rj])

Possibility of similarity

PF (sim[ri, rj])

Similarity interval

(sim[ri, rj])

{Matrix, Matrix-reloaded} 0.35 0.95 h0.35, 0.95i{Matrix, Blade-runner} 0.09 0.88 h0.09, 0.88i{Matrix, Hangover} 0.09 0.94 h0.09, 0.94i{Matrix, Matrix-music} 0.01 0.73 h0.01, 0.73i{Matrix, Toyota} 0.00 0.71 h0.00, 0.71i

524 P. D. Hossein Zadeh, M. Z. Reformat

123

Runner} have small necessity values and large possibility

values when compared with the pair {Matrix, Matrix-

Reloaded}. Such large interval is due to a relatively large

value of the necessity of similarity for the {Matrix, Matrix-

Reloaded} (0.35) when compared with the necessity of

similarity for the other two pairs (0.09). The reason is that

the science fiction movie Matrix-Reloaded is a sequel to the

movie Matrix. Therefore, both movies have related stories,

actors, studios, directors, etc. On the other hand, the movie

Hangover is quite different in many areas such as producers,

directors, casts, studios, etc. A similar situation exists for the

pair {Matrix, Blade-Runner}. The movies Blade-Runner and

Matrix are different in several areas regardless of both

belonging to a science fiction movies category. This is due to

the measurement of context-free similarity, which takes into

account all features of a resource without considering any

particular area of focus. The pair {Matrix, Blade-runner} has

only one common triple, which ‘‘disappears’’ when matched

against all 176 triples of Matrix.

In the ordered pair {Matrix, Matrix-music}, the similarity of

the movie Matrix is measured to the soundtrack album Matrix-

music of the movie Matrix. The number of matching RDFs

between these two resources is negligible, thus leads to a very

small value of similarity. Similarly, the pair {Matrix, Toyota}

has necessity of similarity equal to zero, as there are zero

matching RDFs found between the entities. Instead, the values

of possibility of similarity for the pairs {Matrix, Matrix-music}

and {Matrix, Toyota} are well below 1.0. This indicates the

existence of a higher number of RFD triples that are uniquely

different for both pairs (large values of S4). This also means

that the range of similarity values is smaller compared to the

previous case studies.

Overall, it can be inferred that context-free similarity

provides an unbiased measure of similarity between two

resources based on all the available information about the

resources without taking into account any consideration.

Among RDFs of all selected resources (movies, sound-

track, and a car brand) there are many triples which are not

very informative, and could be considered as a noise.

5.2 Context-aware similarity

For context-aware similarity, the desired context imposes a

constraint on the scenarios. The same set of pairs intro-

duced in Sect. 5.1 is considered for the evaluation. The

experimental results are shown in Table 4. The added

column ‘‘Context’’ indicates the desired context in the

similarity assessment.

Similarity for the ordered pair {Matrix, Blade-Runner}

is computed within the context of the property subject. It is

worth noting that subject is the property that links the

resources Matrix and Blade-Runner to other resources

representing categories that Matrix and Blade-Runner

belong to. Therefore, all other properties of the two

resources {Matrix, Blade-Runner} are discarded while the

only property that is taken into account is subject. For

example, subject links the movie Matrix to the following

resources: 1990s_action_films, American_action_films,

Cyberpunk_films, Silver_Pictures_films, Films_set_in_the_

22nd_ century, etc. See Fig. 7. For more information on

definition of context in this paper, see Sect. 4.

Thus, the context-based similarity for {Matrix, Blade-

Runner} is calculated based on Eqs. (32) and (33) as

follows:

NFcntxðsim½Matrix; Blade� runner�Þ

¼ niS1ðPcntx ¼ subjectÞniðPcntxÞ

¼ 10

27

Table 4 Context-aware similarity values

Ordered pairs {ri, rj} Context (property) Necessity of similarity

NF(sim[ri, rj])

Possibility of similarity

PF (sim[ri, rj])

Similarity interval

sim[ri, rj]

{Matrix, Matrix-reloaded} Starring 0.80 0.80 h0.80, 0.80i{Matrix, Blade-runner} Subject 0.37 0.48 h0.37, 0.48i{Matrix, Hangover} Distributor 1.00 1.00 h1.00, 1.00i{Matrix, Matrix-music} Type 0.08 0.13 h0.08, 0.13i{Matrix, Toyota} Label 0.00 0.00 h0.00,0.00i

PFcntxðsim½Matrix;Blade� runner�Þ ¼ niS1ðPcntx ¼ subjectÞ þ niS2CðPcntx ¼ subjectÞ þ niS3ðPcntx ¼ subjectÞ

niðPcntxÞ¼ 10þ 3

27

¼ 13

27

Context-aware similarity assessment 525

123

sim½ri; rj� 2 h0:37; 0:48i:

The result shows that although the movies {Matrix, Blade-

Runner} have a low context-free similarity, their similarity in

the context of subject is high with a small uncertainty interval.

This is because their features in the context of subject are

similar as both are American science fiction movies.

Next, the ordered pair {Matrix, Matrix-Reloaded} is

evaluated within the context starring, which defines the

casting crew of a movie. Based on the result, the calculated

similarity interval is h0.80, 0.85i. This means that there is a

very high certainty about their similarity value in the

context of starring.

The pair {Matrix, Hangover} is compared in the context of

distributor. This property contains name of the distributors of

each movie. Due to the same distributor of the two movies,

Warner-Brothers, this results in the similarity interval of h1.0,

1.0i. It is worth noting that the context-free similarity interval

for the pair {Matrix, Hangover} was obtained as h0.09, 0.94i.This again explains the importance of the context and its effect

on the similarity assessment.

The considered context in the similarity evaluation of

the pair {Matrix, Matrix-music} is type. Type is the prop-

erty that connects the resources to all of their subclasses

through RDF triples. For example, information about the

type in the resource Matrix includes: CreativeWork, Movie,

KungFuFilms, Film, Work, etc. Similarly, for Matrix-music

the property type connects Matrix-music to: Album, Mu-

sicalWork, MusicalComposition, etc. This information is

evaluated in both resources Matrix and Matrix-music in

order to obtain the context-aware measure.

Lastly, the property label is considered as the context of

similarity for the pair {Matrix, Toyota}. The label property

provides all the vocabularies used to describe this partic-

ular resource. Zero values in the similarity interval of

{Matrix, Toyota} show no similar or related information

between the two resources in this context.

The above examples show the influence of the context in

the similarity measure. As it can be seen, the uncertainty

intervals in context-aware similarities are narrower than in

context-free measures, which means higher confidence in

context-aware similarity measures. This type of similarity

is more often used in real-life scenarios, especially in situ-

ations involving human judgment.

6 Validation and comparative experiments

6.1 Overview

The process of determining similarity between two entities

is not uniform while numerous methods and techniques

proposed different definitions and interpretations of the

similarity concept (see Sect. 7). Two main factors influ-

encing similarity estimation techniques are formats of

information representation and levels of abstraction. The

method proposed here treats LD as a vast network of

interconnected entities that define semantics of its elements

(entities) via their ‘‘involvement’’ in connections, and the

proposed technique uses these connections for similarity

assessment.

Thus, comparison of the proposed method with other

similarity measures imposes some challenges and requires

Fig. 7 Values of the property

Subject for the movie Matrix in

DBPedia

526 P. D. Hossein Zadeh, M. Z. Reformat

123

some modifications and adjustments. Our proposed method

does not take into account any structure existing between

resources under evaluation. One of the advantages of the

approach is its ability to determine similarity between any

two resources existing in the RDF-based repository—

would that be the web, a local database, or a distributed

network of information. On top of that, obtained results are

not single values but intervals with minimum (pessimistic)

and maximum (optimistic) limits. These characteristics of

the proposed method make the comparison process quite

challenging. Thus, we need to adjust our method or the

ones in the literature to make such comparison meaningful.

The comparative experiments with other measures have

been grouped into three parts: first, we target the compar-

ison with measures that are applicable to information rep-

resented by RDF triples. Second, we focus on measures

that are particularly designed for hierarchically organized

data. Third, we illustrate the ability of our proposed method

to deal with RDF-based data created by different users or

tools, and stored at different data sources.

6.2 RDF-triples as data representation

For the first set of experiments, numbers of feature-based

methods are selected. These approaches include: Tversky

(1977), Dice (Frakes and Baeza-Yates 1992) as well as the

concept-based similarity method by Boros (Boros et al.

1996). Boros’ approach is adapted here such that it eval-

uates the substitutions, deletions and additions of RDF

triples not words or concepts as originally described in

(Boros et al. 1996). In addition to the above measures,

latent semantic analysis (LSA) (Landauer and Dumais

1997) is chosen from natural language processing tech-

niques since it represents an interesting approach to

determine similarity levels between words or blocks of

text. Latent semantic analysis extracts coherence of words

by statistical computations of a large corpus of text, for

more information about LSA, see Deerwester et al. 1990;

Landauer et al. 1998]. To compare our method with LSA,

the datasets of the entities are extracted from Wikipedia

and are transformed into LSA semantic spaces, high

dimensional vectors representing each entity, using Text to

Matrix Generator (TMG).17 TMG is a MATLAB toolbox

that can be used for various data mining and information

retrieval tasks. It must be mentioned that the potential

range of LSA similarity measure is [-1, ?1].

For comparison purposes, we have selected 12 pairs of

real-world entities extracted from DBpedia as shown in

Table 5. The pairs from #1 to #3, #4 to #8, and #9 to #12

are selected as very similar, relatively similar, and dis-

similar entities, respectively. For the comparison to be fair

and meaningful, we use our context-free approach since

none of the other approaches consider context in their

similarity formulation.

Usefulness of our method can be explained when simi-

larity values of different measures are compared with our

method. The obtained intervals in our approach include the

values computed in Tversky’s and Dice’s methods. This

can be acknowledged knowing the underlying principle of

these two methods: they are simply ratios of common to

total features. The fact that in most cases they are slightly

higher than the necessity values (lower bounds) comes

from the fact they do not evaluate the similarity of prop-

erties. In other words, only features are taken into account

in these methods, while the similarities of properties are

ignored. This could lead to potentially unreliable values of

similarity in LD. In our method, the benefit of the similarity

interval is that the lower bound represents certain value of

similarity—we are sure about it, while the upper bound

indicates the maximum possible value of the similarity—

and we are sure about it too.

An important observation can be made for the values

obtained from Boros’ approach (Boros et al. 1996). As it

can be seen, Boros’ values are the same as lower-bounds of

the intervals in our method. In other words, the Boros

approach, as adopted here, is similar to our scenario S1.

Based on the results from LSA approach, we can state

that LSA derives reasonably high similarity values for very

similar pairs, i.e., pairs from #1 to #3. However, results

confirm that LSA cannot correctly distinguish the dissim-

ilarity between the taxonomically different pairs #4, #10,

and #12, while our method performed well. In general,

LSA similarity measures are quite high when applied to

less similar and dissimilar pairs comparing to our intu-

itions. This observation conforms to the extensive studies

presented in (Simmons and Estes 2006). While LSA is a

good model for some applications, its requirement for

generating semantic spaces, SVD computations, and

dimensional reduction methods are computationally

expensive for large datasets. Thus, methods based on LSA

have not been used with enormous corpora such as LD.

The triple-based nature of LD evidently induces the

similarity assessment approaches to be capable of explor-

ing interconnected RDF triples as well as the semantics

behind resources and properties in metadata. The similarity

intervals found by our analysis of the semantic space,

formed by RDF triples, show a somewhat different picture

of similarity assessment in LD environment.

6.3 Taxonomy as data representation

The second set of experiments is an attempt to compare our

proposed method with taxonomy-based measures. Two

taxonomy-based measures are selected: Wu and Palmer17 http://scgroup20.ceid.upatras.gr:8000/tmg/.

Context-aware similarity assessment 527

123

(1994), and Leacock and Chodorow (1998). Both of these

measures determine similarity based on the location of

entities in the taxonomy. For this reason, we use DBpedia

ontology,18 which consists of more than 320 classes, where

these classes are organized in the hierarchy with maximum

depth of seven. The selected pairs (Table 5) belong to such

classes as: film (depth three), video game (depth four), and

automobile (depth three).

The obtained results in Table 6 suggest the inadequacy

of these measures for the pairs #1 to #3, #5 to #9, and #11.

All obtained values are the same—0.67 for Wu and P

(maximum is 1.0), and—1.1 for LCH (maximum is plus

infinity). Hence, the distinction between the pairs (#1 to #3,

#5 to #9, and #11) is not reflected in the selected taxon-

omy-based methods. This can be explained as these mea-

sures limit their attention to is-a links and are sensitive to

taxonomic and not semantic relations between the entities.

In other words their values fully depend on the structure of

ontology. These methods are not able to distinguish

between different individuals of one class. The obtained

values for the pairs #4, #10 and #12 are different since

entities in each of these pairs belong to different classes.

These values seem to be more reasonable since the amount

of information embedded in the taxonomy for these pairs is

higher.

6.4 Multiple-source RDF-triples as data representation

In order to evaluate the applicability of the proposed

method with RDF triples created by different individuals or

tools, we conducted an experiment in which we determine

the similarity of entities that belong to two different in-

terlinking datasets, besides DBpedia, in LD cloud. One of

them is the Linked Movie DataBase19 (LinkedMDB),

which has more than six million triples with over 162,000

triples that are interlinked with other datasets such as

DBpedia and YAGO.20 YAGO is another dataset contain-

ing more than two million entities about people, objects,

cities, etc. Five pairs have been selected among three

datasets presented above, shown in Table 7. Movies in

every pair belong to different datasets.

Pair #1, award-winnings psychological thriller movies

directed by Alfred Hitchcock, is selected as very similar

movies. In pair #2, the movie Woman-in-Green is put

together with a movie Sherlock Holmes—both movies are

about Holmes and his investigations. Each movie is defined

by RDFs that belong to different datasets. Movies in pair

#3 are thrillers and directed by the same director. Pair #5

contains movies with the same actor. Pair #4 is chosen as

dissimilar movies. Due to compatibility of our method with

Table 5 Comparison of our approach to other related methods

# Ordered pairs Similarity models

Corpus-based Feature-based Concept-based LD-based

LSA (Landauer

and Dumais 1997)

Tversky

(1977)

Dice (Frakes and

Baeza-Yates 1992)

Boros et al.

(1996)

Our method

1 {Matrix, Matrix-reloaded} 0.96 0.40 0.38 0.35 h0.35, 0.95i2 {Good-fellas, God-fattier} 0.92 0.29 0.37 0.20 h0.20, 0.97i3 (Jaws, Jurassic-Park} 0.87 0.54 0.68 0.20 h0.20, 0.90i4 {Matrix, Matrix-music} 0.90 0.25 0.20 0.01 h0.01, 0.73i5 {Star-wars, Star-trek} 0.79 0.42 0.36 0.30 h0.30, 0.56i6 {Jurassic-Park, Godzilla} 0.66 0.02 0.03 0.02 h0.02, 0.24i7 {Spider-man, I-robot} 0.75 0.25 0.30 0.10 h0.10, 0.55i8 {Matrix, Blade-runner} 0.85 0.55 0.4 0.09 h0.09, 0.88i9 {Matrix, Hangover} 0.87 0.10 0.15 0.09 h0.09, 0.94i10 {Matrix, Toyota} 0.58 0.20 0.32 0.00 h0.00, 0.71i11 {Pulp-fiction, Wall-E} 0.89 0.09 0.09 0.08 h0.08, 0.15i12 {Angry-birds, Titanic} 0.63 0.04 0.05 0.02 h0.02, 0.10i

Table 6 Similarity values of taxonomy-based methods (pair# is

taken from Table 5)

Pair# Wu and Palmer

(1994)

Leacock and

Chodorow (1998)

#1, #2, #3, #5, #6,

#7, #8, #9, #11

0.67 1.1

#4 0.34 0.48

#10 0.30 0.42

#12 0.34 0.48

18 http://wiki.dbpedia.org/Ontology.

19 http://www.linkedmdb.org/.20 http://www.mpi-inf.mpg.de/yago-naga/yago/.

528 P. D. Hossein Zadeh, M. Z. Reformat

123

the triple-based nature of any dataset in LD, similarities

of the pairs yield sensible results. Finally, we must

acknowledge that future research could improve the results

by further decreasing the ambiguity in similarity intervals

as previously explained in Sects. 3 and 4.

7 Related work

There are several methods on assessing the semantic sim-

ilarity among entities to improve the traditional similarity

measurement methods. It is pivotal that the calculated

similarity by these methods corresponds closely to the

human judgment of similarity. Many approaches in the

literature (Taylor 2010; Pedersen et al. 2007) are based on

ontology, taxonomy of terms that describe a certain area of

knowledge. The most popular definition of ontology says

that ‘‘an ontology is a specification of a conceptualization’’

(Gruber 1993). Ontology compared to LD is an engineered

structured model of information. Surprisingly, there are

few proposals in the literature related to similarity assess-

ment in LD. In this section, number of approaches related

to both areas is investigated. The majority of today’s

methods are classified based on two underlying approa-

ches: context-free and context-aware.

7.1 Context-free models

Similarity between entities in (Rada et al. 1989) is calcu-

lated based on the number of links separating the two

entities in a semantic net, which is known as the edge-

counting method. Indeed, the more number of links sepa-

rating the entities the less similar they are. In general, links

between entities convey a meaning denoting the semantic

and context that affect the similarity between entities. In

(Rada et al. 1989), only is-a link types are considered in the

calculation of similarity, which both semantics and context

are not taken into the consideration. To date, due to the

simplicity and ease of implementation of this method

several approaches (Wu and Palmer 1994; Leacock and

Chodorow 1998; Li et al. 2003) leverage this technique by

combining it into a more efficient and semantic-oriented

approaches. However, the focus of edge-counting

approaches is mainly on the number of links in a hierarchy

of entities without considering other parameters that affect

semantics and context.

Resnik (1995), Lin (1998) and Giuseppe (2009) propose

methods that are based on entities’ information that can be

measured in different ways. Resnik (1995) argues that the

similarity of two entities depends on the shared amount of

information between them. In (Resnik 1995), the common

information is quantified using the least common subsumer

of entities in the hierarchy of concepts. Taking a similar

approach, (Lin 1998) presents an information theoretic

measure related to the amount of information required to

describe each entity, as well as information related to the

commonality of the two entities. According to Lin (1998),

similarity between two entities is the weighted average of

the computed similarity from different perspectives. The

drawback of information theoretic approaches is their need

for external information, such as a probabilistic model of

the domain in order to determine the relatedness of entities.

In particular, the problem with application of information

theoretic approaches in LD is that the information in LD is

not represented in a form of ontology; thus, evaluating the

required criteria in these approaches such as locating a least

common subsumer of two entities and frequency count of a

concept in a hierarchy is not applicable.

Number of approaches (Leacock and Chodorow 1998;

Oliva et al. 2011) leverage lexical information found in

external knowledge repositories such as WordNet—an

English lexical database of concepts and relations—to

quantify the semantic similarity of concepts. Among these,

(Oliva et al. 2011) captures and combines lexical and

syntactic information based on WordNet and a deep pars-

ing process respectively to assess similarity between short

sentences. In (Oliva et al. 2011), lexical information is

obtained using the hierarchical structure of WordNet and

different glossaries related to each entity. In syntactic

measurement, (Oliva et al. 2011) emphasize on the

importance of different syntactic roles of words in human’s

calculation of similarity. For this purpose, they assign

different weights to different word syntactic roles such as

verb, subject, object and adverb. In (Navigli and Velardi

2005), a semantic graph based on the extracted lexical

information of entities from a variety of lexical knowledge

Table 7 Similarity assessment using our approach of entities belonging to different datasets

# Ordered pairs LinkedMDB DBpedia YAGO Similarity measure

1 {Vertigo, Psycho} Vertigo Psycho – h0.42, 0.66i2 {The Woman in Green, Sherlock Holmes} – Sherlock Holmes The Woman -in-Green h0.74, 0.96i3 {Panic Room, Zodiac) Panic Room – Zodiac h0.51, 0.89i4 {King Arthur, Ghost] Ghost – King Arthur h0.06, 0.45i5 {Men in Black, Bad Boys} Men in Black Bad Boys – h0.25, 0.78i

Context-aware similarity assessment 529

123

repositories is created. Lexical information in (Navigli and

Velardi 2005) includes different types of semantic relation

such as synsets, glosses, hyperonymy–hyponymy, meron-

ymy–holonymy that can be found in a lexical resource.

In a similar approach, (Giunchiglia et al. 2007) performs

a graph similarity matching by analyzing the semantic

relations and the meanings of entities according to the

lexical knowledge and structure of schemas, respectively.

In (Giunchiglia et al. 2007), semantic relations include

equivalence, more general, less general and disjointness.

The matching problem is then converted into a proposi-

tional validity problem using a predefined set of rules to

verify the semantic relations between entities.

The approach in (Hliaoutakis et al. 2006) is a combi-

nation method based on lexical information using WordNet

and MeSH as well as the information content found in

document corpus. In (Hliaoutakis et al. 2006), it is shown

that lexical difference of two entities does not necessarily

mean that the entities are semantically different. For a

complete comparative evaluation of state-of-the-art simi-

larity methods, see (Oliva et al. 2011; Hliaoutakis et al.

2006). Authors of (Hliaoutakis et al. 2006) discuss that

lexical term matching techniques such as vector space and

probabilistic models are not semantic-oriented; thus, not

efficient to be used over semantic web.

7.2 Context-aware models

There exist similarity assessment methods (Volz et al.

2009; Han et al. 2006; D. Hossein Zadeh and Reformat

2012b, c) that focus on entities’ features and context. In

LD, every resource is defined with a number of properties,

thus allowing the entities to differ not only in label, but also

in machine-understandable information.

Han et al. (2006) presents an approach to determine the

similarity between entities from different ontologies. Their

approach takes into consideration the number of features

associated with each entity as well as positions of the

entities in the structure. In their similarity model, weights

are assigned to the entities regarding their position in the

ontology, e.g., entities at a lower level are considered more

meaningful than the entities at higher levels; thus accred-

ited with higher weights. In (Han et al. 2006), contextual

similarity between entities is defined according to the

combined similarity between two entities and their adjacent

entities across neighborhoods.

As stated before, number of approaches in the literature

addressing the similarity measurement in LD is quite low.

In the line of approaches with LD, Sheng et al. (2010)

proposes a semantic similarity method using a combination

of lexical, structural, corpus-statistics information. A

variety of semantic relations from WordNet is used as a

lexical knowledge and is combined with taxonomy

information such as depth of an entity and the distance

between two entities in the schema. In (Sheng et al. 2010),

corpus-statistic is a measure of co-occurrence frequency of

two entities and their synonyms obtained from WordNet.

Semantic is explored in depth by extending the similarity

measure to features of each entity, where the similarity is

calculated between the corresponding features of two

entities. Furthermore, context is taken into account by

assigning different weights to every feature according to

their importance in the domain, which are assigned man-

ually based on the empirical experience of the domain

expert. The approach in Sheng et al. (2010) is only appli-

cable to entities described in one dataset and does not scale

up to the web of interlinking complex data sources.

Authors in Albertoni et al. (2008) discuss the problems

associated with context similarity in the web of data. The

investigated problems are mainly due to heterogeneity

nature of LD, such as URI inconsistency and overlapping

metadata vocabularies. In Albertoni et al. (2008), URIs of

resources are dereferences to exploit the properties of each

entity through the extracted RDF elements. The collected

properties are then used to measure the context-enabled

ontology-based similarity method in Albertoni and De

Martino (2010). According to Albertoni et al. (2008),

limiting the process of URI dereferencing to a specific

context reduces computation complexity and improves the

efficiency of the similarity result.

Majority of the methods above are not proper to be

applied as a similarity measurement technique in LD given

that they are designed to be used in ontology. Our work is

distinguished from the ontology-based approaches in

dealing with LD in which data sources do not follow any

consistent schema or taxonomy. In fact, information con-

tent and taxonomy of different datasets are not directly

comparable; therefore, information theoretic and taxon-

omy-based methods do not perform very well when two

entities belong to different data sources. Furthermore,

semantic and context in edge-counting methods is poorly

evaluated. These imply that these methods are inefficient to

be used in LD that is composed of a large composition of

various datasets connected together. On the other hand,

feature-based methods that measure the similarity based on

common features of entities, are best suited when entities

belong to different datasets.

8 Conclusion and future work

Linked data is a new method of publishing information on

the web. Any piece of information is represented as a triple.

All triples can be interconnected—each component of one

triple can be a component of an unlimited number of other

triples. As the result, a vast network of information is

530 P. D. Hossein Zadeh, M. Z. Reformat

123

formed. This allows us to treat LD as a space containing

semantic definitions of its elements. The introduction of

LD on one side, and required semantic-based analysis of

relatedness on the other trigger our interest in the research

activity addressing those issues.

In this paper, we have proposed a novel semantic sim-

ilarity measurement method that is fully dependent on

connections in LD. The method depends on the relatedness

of resources’ RDF-triples. It accommodates concepts of

necessity and possibility, while mechanisms of their

manipulation are taken from the possibility theory. The

obtained similarity measure is advantageous since it pro-

vides realistic lower and upper similarity bounds between

any pair of entities in LD. An important aspect of the

proposed method is its ability to determine similarity in

context-aware situations. Utilizing types of connections

creates a neat way of looking at different facets of simi-

larity between entities.

The asymmetric nature of the proposed similarity measure

could be seen as a thought-provoking feature of the method. It

is in line with the work of Tversky (1977) and Nosofsky

(1991), who claimed that the direction of asymmetries could

be associated with prominence of compared entities. Tversky

pointed out that prominence is related to salience, familiarity,

and goodness in form and informational content. It can be

stated that less prominent entities are more similar to more

prominent entities than the other way around. In the light of

that, the proposed method could be applied to determine the

level of prominence of compared objects. It is very likely that

adding relative prominence to the similarity assessment would

increase its descriptive power. This idea alone will be inves-

tigated more as an extension of the work presented here.

Another important aspect of the proposed method is the

fact that it provides not a single similarity value but a

similarity interval. The interval can be interpreted as the

lower-bound, which is the similarity with a high confidence

in it; and the upper-bound indicating the maximum possi-

ble similarity when all ambiguities are ‘‘resolved posi-

tively’’. At the same time the width of this interval can be

interpreted as the level of uncertainty in the similarity

assessment—larger width indicates higher uncertainty,

while a narrower interval means higher confidence in the

provided assessment. This interpretation is fully compati-

ble with our interpretations of the scenario S1 as the lower

bound, and S4 as the upper bound. Further investigations of

scenarios S2 and S3 lead towards decreasing upper-bounds

and this means increased confidence in the obtained simi-

larity assessment. However, activities related to ‘‘solving’’

scenarios S2 and S3 are computationally expensive, and

they can be performed once a crude assessment is done.

Overall, the proposed method constitutes an effective

way of determining similarity between entities in LD.

Using principles of possibility theory, the method allows

for assessing the similarity interval by identifying identical

features (scenario S1) and entirely different features (sce-

nario S4). When compared with other measures it pos-

sesses number of advantages: direct applicability in the LD

environment, simplicity and easiness of computations,

ability to provide pessimistic (lower-bound) and optimistic

(upper-bound) estimations of similarity, capability to deal

with specific facets of similarity in form of different con-

texts, and potentiality being further expanded to take full

advantage of RDF triples as data representation format.

References

Albertoni R, De Martino M (2010) Semantic similarity and selection

of resources published according to linked data best practice. In:

OTM 2010 workshops on the move to meaningful internet

systems, pp 378–383

Albertoni R, Camossi E, De Martino M, Giannini F, Monti M (2008)

Context enabled semantic granularity. In: Knowledge-based

intelligent information and engineering systems, pp 682–688

Berners-Lee T, Hendler J (2001) Scientific publishing on the semantic

web. Nature 410:1023–1024

Bizer C, Heath T, Berners-Lee T (2009) Linked data-the story so far.

Int J Semant Web Inf Syst 4:1–22

Boros M, Eckert W, Gallwitz F, Gorz G, Hanrieder G, Niemann H

(1996) Towards understanding spontaneous speech: word accuracy

vs. concept accuracy in spoken language. Proceedings of fourth

international conference on ICSLP 96, vol 2, pp 1009–1012

D. Hossein Zadeh P, Reformat MZ (2012a) Assimilation of informa-

tion in linked data based knowledge base. In: 14th international

conference on information processing and management of uncer-

tainty in knowledge-based systems, Catania, 9–13 July 2012

D. Hossein Zadeh P, Reformat MZ (2012b) Feature-based similarity

assessment in ontology using fuzzy set theory. In: IEEE 2010

international conference on fuzzy systems (FUZZ), pp 1462–1468

D. Hossein Zadeh P, Reformat MZ (2012c) Ontology as knowledge

base for determining asymmetric and context-dependent simi-

larity. J Inf Sci (submitted)

Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R

(1990) Indexing by latent semantic analysis. J Am Soc Inf Sci

41:391–407

DuBois D, Prade HM (1980) Fuzzy sets and systems: theory and

applications. Academic Press, New York

Dubois D, Prade H (2003) Possibility theory and its applications: a

retrospective and prospective view. In: FUZZ’03 the 12th IEEE

international conference on fuzzy systems, p 5

Dubois D, Prade H, Harding E (1988) Possibility theory: an approach

to computerized processing of uncertainty. Plenum press, New

York

Frakes WB, Baeza-Yates R (1992) Information retrieval: data

structures and algorithms, vol 7632. PTR Prentice-Hall Inc,

Eaglewood Cliffs

Giunchiglia F, Yatskevich M, Shvaiko P (2007) Semantic matching:

algorithms and implementation. J Data Semant IX, University of

Trento, Trento, pp 1–38

Giuseppe P (2009) A semantic similarity metric combining features

and intrinsic information content. Data Knowl Eng 68:1289–

1308

Gruber TR (1993) A translation approach to portable ontology

specifications. Knowl Acquis 5:199–220

Context-aware similarity assessment 531

123

Han L, Sun L, Chen G, Xie L (2006) ADSS: an approach to

determining semantic similarity. Adv Eng Softw 37:129–132

Hliaoutakis A, Varelas G, Voutsakis E, Petrakis EGM, Milios E

(2006) Information retrieval by semantic similarity. Int J Semant

Web Inf Syst 2:55–73

Johannesson M (1997) Modelling asymmetric similarity with prom-

inence. Lund University Cognitive Studies, Lund

Klir GJ, Folger TA (1988) Fuzzy sets, uncertainty, and information.

Prentice Hall, Englewood Cliffs

Landauer TK, Dumais ST (1997) A solution to Plato’s problem: the

latent semantic analysis theory of acquisition, induction, and

representation of knowledge. Psychol Rev 104:211

Landauer TK, Foltz PW, Laham D (1998) An introduction to latent

semantic analysis. Discourse Process 25:259–284

Lassila O, Swick R (1999) Resource description framework (RDF)

model and syntax specification. World wide web consortium

technical reports and publications. http://www.w3.org/TR/1999/

REC-rdf-syntax-19990222

Leacock C, Chodorow M (1998) Combining local context with

WordNet similarity for word sense identification. In: WordNet: a

lexical reference system and its application, MIT Press, Cam-

bridge, pp 265–283

Lee TB, Hendler J, Lassila O (2001) The semantic web. Sci Am

284:34–43

Leibniz GW (1975) Philosophical papers and letters. Kluwer

Academic Publishers, Dordrecht

Li Y, Bandar ZA, McLean D (2003) An approach for measuring

semantic similarity between words using multiple information

sources. Knowl Data Eng IEEE Trans 15:871–882

Lin D (1998) An information-theoretic definition of similarity. In:

Proceedings of the fifteenth international conference on machine

learning, Madison, pp 296–304

Navigli R, Velardi P (2005) Structural semantic interconnections: a

knowledge-based approach to word sense disambiguation. IEEE

Transactions on Pattern Analysis and Machine Intelligence, vol

27, pp 1075–1086

Nosofsky RM (1991) Stimulus bias, asymmetric similarity, and

classification. Cognit Psychol 23:94–140

Oliva J, Serrano JI, Del Castillo MD, Iglesias A (2011) SyMSS: a

syntax-based measure for short-text semantic similarity. Data

Knowl Eng 70(4):390–405

Pedersen T, Pakhomov SVS, Patwardhan S, Chute CG (2007)

Measures of semantic similarity and relatedness in the biomed-

ical domain. J Biomed Inform 40:288–299

Rada R, Mili H, Bicknell E (1989) Development and application of a

metricon semantic nets. IEEE Trans Syst Man Cybern 19:17–30

Resnik P (1995) Using information content to evaluate semantic

similarity in a taxonomy. IJCAI 448–453

Shadbolt N, Hall W, Berners-Lee T (2006) The semantic web

revisited. IEEE Intell Syst 21:96–101

Sheng H, Chen H, Yu T, Feng Y (2010) Linked data based semantic

similarity and data mining. In: IEEE 2010 international confer-

ence on Information reuse and integration (IRI), pp 104–108

Simmons S, Estes Z (2006) Using latent semantic analysis to estimate

similarity. In: Proceedings of the Cognitive Science Society,

pp 2169–2173

Taylor JM (2010) Ontology-based view of natural language meaning:

the case of humor detection. J Ambient Intell Humaniz Comput

1:221–234

Tversky A (1977) Features of similarity. Psychol Rev 84:327–352

Volz J, Bizer C, Gaedke M, Kobilarov G (2009) Silk—a link

discovery framework for the web of data. In: Proceedings of the

second linked data on the Web workshop, Madrid

Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In:

Proceedings of the 32nd annual meeting on association for

computational linguistics, Las Cruces, pp 133–138

Zadeh LA (1999) Fuzzy sets as a basis for a theory of possibility.

Fuzzy Sets Syst 100:9–34

532 P. D. Hossein Zadeh, M. Z. Reformat

123