context-aware similarity assessment within semantic space formed in linked data
TRANSCRIPT
ORIGINAL RESEARCH
Context-aware similarity assessment within semantic spaceformed in linked data
Parisa D. Hossein Zadeh • Marek Z. Reformat
Received: 19 December 2011 / Accepted: 7 July 2012 / Published online: 9 August 2012
� Springer-Verlag 2012
Abstract The Web is a constantly growing repository of
information. Amount of data that becomes available
exceeds our abilities to search and examine this data in a
reasonable time and with a practical effort. The data is
stored in forms of documents, texts and web pages, which
are not suitable for comprehensive analysis and search. In
order to make the data stored on the Internet more acces-
sible, a new model of data representation has been intro-
duced—linked data. Linked data provides an open platform
for representing and storing structured data as well as
metadata. In this paper, we propose a novel approach for
calculating the degree of similarity between two entities in
the web of linked data. The idea is based on the fact that
entities are submerged in the linked data and their
semantics is defined via their connections to other entities.
Therefore, similarity between two entities is determined by
comparing connections of two entities to other entities.
Firstly, the approach is introduced to determine semantic
similarity in a context-free manner. This method does not
select specific types of connections but takes into consid-
eration all of them. Secondly, a context-aware approach is
presented as a modification of the original method. In this
case, a context is defined by a set of connection types—
only connections of specific types are considered for sim-
ilarity determination. The proposed approach uses concepts
of possibility theory to determine lower and upper bounds
of similarity intervals. We evaluate the proposed similarity
assessment process by applying it to real-world datasets,
and we compare it to other related methods.
Keywords Semantic similarity � Feature-based similarity �Context-aware similarity � RDF triples � Linked data �Semantic space
1 Introduction
An ultimate contribution of the Semantic Web (Lee et al.
2001) is utilization of ontology as the knowledge represen-
tation form. Resource description framework (RDF) (Lassila
and Swick 1999) is introduced as an underlying framework
in order to use ontology in the web environment. Resource
description framework data model treats each piece of
information as a triple: subject-property-object (Lassila and
Swick 1999). In the last few years, the application of RDF for
data representation has become a very popular way of rep-
resenting data on the web (Shadbolt et al. 2006). Over time,
more attention has been paid to it, and the term linked data
(LD) has been used to describe the network of data sources
based on RDF triples for information representation (Bizer
et al. 2009). The power of LD, in contrary to hypertext web,
is that entities from different sources/locations are linked to
other related entities on the web. This enables one to view the
web as a single global data space (Bizer et al. 2009). In other
words, hypertext web connects documents in a naive way—
links always point to documents. However, in the web of LD
single information items are connected—links point to other
pieces of information stored at different physical locations.
As a result, LD allows for better representation of structured
data and even its underlying semantics.
In order to publish data on the web of LD some prin-
ciples have to be followed (Berners-Lee and Hendler
P. D. Hossein Zadeh (&) � M. Z. Reformat
Department of Electrical and Computer Engineering,
University of Alberta, Edmonton, AB, Canada
e-mail: [email protected]
M. Z. Reformat
e-mail: [email protected]
123
J Ambient Intell Human Comput (2013) 4:515–532
DOI 10.1007/s12652-012-0154-7
2001). One fundamental rule is the use of uniform resource
identifiers (URIs) to identify each piece of information
(Shadbolt et al. 2006). Uniform resource identifiers aim to
universally define entities in the web of data so that users
and machines can use the URIs to obtain information about
the data. This means that every entity has a global identifier
that a person or machine can use to look it up, refer to it,
and find its description. Another rule of publishing data in
LD is that the created URIs should be obtainable via HTTP
on the web.
As stated, LD is expressed in RDFs, i.e., triples: subject,
property, and object, where each one of these is represented
by an URI. This way finding a specific piece of information
in the web of data is facilitated with the help of inter-
pretable URIs. For example, the entity ‘‘University of
California’’ can be referred to in different ways, such as
‘‘University of Berkeley’’, ‘‘UCB’’, and ‘‘UC Berkeley’’ by
different data sources. However, assigning a unique URI in
different datasets helps to avoid any confusion.
A collection of Semantic Web Technologies and
Applications supports manifestation of LD in reality. These
include protocols, strategies and tools for querying the
RDF datasets (SPARQL, SPARQL?), transforming cur-
rent application-specific formats of resources to the RDF
format (STEAMY, Cvs2rdf, Cypher), reasoning and dis-
covering new relationships using RDF data in order to
manage the information on the web (Pellet, Jena,
FaCT??), and extracting RDF triples (CumulusRDF,
3Store, Pubby). Linked data is potentially beneficial to
various semantic applications such as web search engines,
web browsers, information retrieval systems, and reasoning
engines. As a result of the interconnected data, navigation
and query using semantic-enabled browsers over the LD
can be facilitated to a great extent. However, LD as an
integration of the interlinking datasets poses challenges
regarding processing and analysis of data (D. Hossein
Zadeh and Reformat 2012a). One of them is finding simi-
larity between pieces of information.
In this paper, we propose a semantic similarity assess-
ment method between the entities in LD based on the
interconnections between entities and applying elements of
possibility theory. The proposed approach includes a
number of important features:
• Semantic-oriented—the approach is fully dependent on
the interconnections between entities that define seman-
tics of data; therefore, the method is an attempt to
determine the similarity based on semantics of entities.
• Context-aware—the approach is sensitive to the types
of interconnections between entities; this makes it
suitable to assess similarity in the context of specific
features or utilization of entities that can be defined by
different types of interconnections.
• Dynamic/adaptive—similarities are determined based
on the current state of knowledge as embedded in LD;
any addition and modification of this data will be
momentarily reflected in similarity evaluations.
The remainder of this paper is organized as follows:
Sect. 2 provides some background information related to
LD, principle of similarity, and possibility theory. In Sect.
3, we present our context-free semantic similarity
approach. The context-aware similarity method and its
underlying motivations are described in Sect. 4. Sections 5
and 6 evaluate our approaches using real-world datasets
and compare it with other methods. Section 7 reviews
related work in the field of semantic similarity measure-
ment. Finally, Sect. 8 concludes the paper and presents
future work.
2 Background
2.1 Linked data
Linked data resembles a decentralized partial mesh net-
work in which entities from different resources are con-
nected to other related entities directly or indirectly. As
previously mentioned, the foundation of LD consists of two
elements: RDF and URI. The basic atomic component of
LD is a RDF triple. In fact, all pieces of information in LD
are expressed using triples. The generic format of a RDF
triple is: subject, property, and object. For example, the
statement:
‘‘The Matrix (movie) is distributed by Warner Bros.’’
can be expressed as the following triple:
The Matrix(subject)-distributed by(property)-
Warner Bros.(object)
components of this triple can be parts of another triple(s):
The Matrix(subject)-directed by(property)
-The Wachowskis.(object)
New_Line_Cinema(subject)-parent of(property)
-Warner Bros.(object)
Graphically, the above RDF triples constituted in a so
called RDF graph, see Fig. 1.
As stated, the second fundamental aspect of LD is URI.
Uniform resource identifiers provide unique identification
for every piece of data on the web. Thus, dereferencing the
URI associated with every entity enables a user/machine to
find all information related to that entity, which includes its
associative RDF fragments in the web of data. This shows
that LD is a huge network of interconnections from dif-
ferent datasets without a single centre of knowledge. The
516 P. D. Hossein Zadeh, M. Z. Reformat
123
RDF triples: ‘‘The Matrix’’ is distributed by ‘‘Warner
Bros.’’, ‘‘The Matrix’’ is directed by ‘‘The Wachowskis’’,
and ‘‘New_Line_Cinema’’ is parent of ‘‘Warner Bros.’’ can
be expressed in the following way, Fig. 2.
In the example above, dereferencing the URI:
http://dbpedia.org/resource/The_Matrix shows that the
subject ‘‘The Matrix’’ belongs to the dataset dbpedia.org.
Also, dereferencing the URI http://dbpedia.org/resource/
Warner_Bros. implies that this object is defined in the
dataset dbpedia.org. Similar to the subject and object,
properties are uniquely represented by their assigned URIs.
Properties, Pi, can be categorized as sets of relations that
are associated with every particular class of things. This
means that every class of entities including person, vehicle,
event, movie, etc. has some specific properties, while each
property is distinguished by a unique URI. For example,
the class ‘‘movie’’ can have properties such as runtime,
budget, cinematography, director, distributor, producer,
language, etc.
There exist several important data collections that have
published their contents in the format of LD, such as
DBPedia,1 DBLP,2 Geonames,3 Freebase,4 New York
Times,5 BBC programmes,6 and FOAF.7 Among these
datasets, DBPedia transformed contents of Wikipeida into
the LD format with more than 103 million RDF triples.
Geonames is another dataset that provides information
about over eight million geographical locations in the
world. Freebase presents data related to approximately
20 million different entities in an open linked data graph.
New York Times linked open data has published news,
vocabularies and subject headings as LD. Information
about people, their activities and social media are pub-
lished in LD format in FOAF project. Information about
books in the form of LD is available on RDF Book
Mashup.8 It is worth mentioning that all the mentioned data
sources allow connections to and from other data sources.
Figure 3 is generated using Gephi9 and depicts a snap-
shot of DBPedia dataset containing RDF triples of four
different movies. Vertices represent resources (subjects and
objects), and properties are shown by edges between the
resources. The four vertices with the highest number of
edges connected to them are the four selected movies. As it
can be observed, these movies have connections to unique
features (specific to that particular movie) and also con-
nections to features that are shared between different
movies. One of the most intriguing observations regarding
LD is its contribution to semantic definition of entities. A
set of relations between an entity and other resources can
be conceived as resource’s features defining its semantics.
2.2 Concept of similarity
The task of determining similarity between two entities is a
fundamental process for many applications related to
analysis and processing of data including artificial intelli-
gence, biomedicine, knowledge representation, data min-
ing, natural language processing, and information retrieval.
A variety of approaches for similarity assessment have
been proposed in the literature while some leverage lexi-
cographic, syntactic, structural information, as well as
representation of information about entities to measure the
The matrix New LineCinema
WarnorsBros.
distributed by parent of
directed by
TheWachowskis
Fig. 1 A simple RDF graph
containing three triples
Subject: http://dbpedia.org/resource/The_Matrix Predicate: http://dbpedia.org/ontology/distributor Object: http://dbpedia.org/resource/Warner_Bros.
Subject: http://dbpedia.org/resource/The_Matrix Predicate: http://dbpedia.org/property/director Object: http://dbpedia.org/resource/The_Wachwskis
Subject: http://dbpedia.org/resource/New_Line_Cinema Predicate: http://dbpedia.org/property/parent Object: http://dbpedia.org/resource/Warner_Bros.
Fig. 2 Examples of RDF triples
1 http://dbpedia.org/About.2 http://www.informatik.uni-trier.de/*ley/db/.3 http://www.geonames.org/.4 http://www.freebase.com/.5 http://data.nytimes.com/.6 http://www.bbc.co.uk/programmes.7 http://www.foaf-project.org/.
8 http://www4.wiwiss.fu-berlin.de/bizer/bookmashup/.9 http://gephi.org/.
Context-aware similarity assessment 517
123
similarity. The most popular techniques are based on entities’
feature matching (Tversky 1977; Nosofsky 1991) as well as
combination approaches (Johannesson 1997). Many approa-
ches depend on representation of information in the form of
ontology, while a few methods investigate the problem of
similarity assessment in LD. A detailed review of the
aforementioned techniques is provided in Sect. 7.
Intuitively, a degree of similarity between two items
depends on how many features of these two items are the
same. Before we investigate this idea further, let us take a
closer look at the definition of identity. The original defi-
nition of identity comes from the Leibniz’s law (Leibniz
1975) of identity of indiscernible, which states that the two
entities i and j are identical if they share common prop-
erties Pi and Pj:
8i8j½8PðPi $ PjÞ ! i ¼ j� ð1Þ
It can be inferred that unique features of each entity con-
tribute to the dissimilarity measure between the two
entities.
Another important aspect of such understood (dis)simi-
larity is related to its symmetry. If a number of features of
an entity are different than another entity then (dis)simi-
larity is not symmetrical. Thus, if we assume that
(dis)similarity is a ratio between mutual features and all
features connected to an entity, then—of course—this ratio
will depend on a number of features (connections). Since
the number of connections for each entity in LD may be
different the similarity is asymmetric—this complies with
the work conducted by Tversky (1977). According to
Tversky (1977), similarity between two entities can be
determined by comparing features defining these resources.
We also believe that similarity can be determined when
only some specific features are considered while others are
meant to be ignored. Thus, an appropriate selection of
features allows for determining similarity in a context
defined by these selected features.
The above-mentioned principle sheds light on how to
assess the degree of similarity between two entities in LD.
The fact that all pieces of data in LD are interconnected
creates a natural implication to think about similarity as the
degree of interconnection between the pieces. Generally,
connections represent reasonable amount of information
about the entities in LD. Detailed analysis of these inter-
connections enables one to extract features related to every
entity in the web of data.
2.3 Possibility theory
Introduced by Zadeh (1999) and fully developed by Dubois
and Prade (2003) possibility theory is a suitable vehicle to
handle incomplete information. Even if similar to the
probability theory it differs in using two sets of functions—
possibility and necessity measures—instead of just one
measure as in the probability theory. Here, we present the
basic definitions and concepts that are used in our approach
for similarity evaluation. For more information on possi-
bility theory see (DuBois and Prade 1980; Dubois et al.
1988; Klir and Folger 1988).
Let us assume a finite set of states, S. A possibility
distribution function is:
pðsÞ : S! h0; 1i ð2Þ
that s represents a current state of knowledge. Possibility
theory appraises what elements of S are plausible and what
elements are not, what is ‘‘normal’’ and what is not. The
state s is expressed to be impossible as:
pðsÞ ¼ 0
or totally possible (plausible):
pðsÞ ¼ 1
This allows for expressing complete knowledge, when
some state s0 is possible pðs0Þ ¼ 1, and other states s are
impossible pðsÞ ¼ 0. Total ignorance is expressed as
pðsÞ ¼ 1 for all s from S. Therefore, degrees of
possibility and necessity for a given subset of states Ssub
can be computed as below: possibility:
PðSsubÞ ¼ sups2Ssub
pðsÞ ð3Þ
necessity:
NðSsubÞ ¼ infs62Ssub
1� pðsÞ ð4Þ
Fig. 3 A snapshot of dbpedia.org representing four movies
518 P. D. Hossein Zadeh, M. Z. Reformat
123
The duality of possibility-necessity is:
NðSsubÞ ¼ 1�PðS0subÞ ð5Þ
where S0
represents the complement of S. Possibility
measures satisfy the basic property of:
PðSbsub [ Sa
subÞ ¼ maxðPðSasubÞ;PðSb
subÞÞ ð6Þ
while necessity measures satisfies the dual property:
PðSbsub \ Sa
subÞ ¼ minðPðSasubÞ;PðSb
subÞÞ ð7Þ
Equations (3, 4, 5) are applied in the proposed approach for
similarity assessment.
3 Context-free similarity assessment
3.1 Approach overview
Linked data (LD) represents each entity (resource) via fea-
tures associated with it (Sect. 2.1). This representation creates
a captivating form of defining a single resource and com-
paring it with other resources. In a nutshell, the approach
proposed identifies resources that are certainly shared and
possibly shared between two entities, and uses elements of
possibility theory to assess similarity between these entities.
3.2 Approach description
As mentioned before, LD is a mesh of interconnected resour-
ces, which can be represented as a set of triples\resource-as-
subject, property, resource-as-object[. Formally:
LD ¼ hri; pq; rmi: ri; rm 2 R; pq 2 P� �
ð8Þ
where R is a set of resources, and P is a set of properties. In
this mesh, a single resource ri is defined via its connections
to other resources. Each of these resources can be
considered as a feature of ri. A set of all resources
(features) connected to ri can be treated as its semantic
definition. The connections between the resource ri and
other resources are labeled with properties that have ri as
their subject. Therefore, for an entity ri we can write:
ni ¼ jfhri; pq; rmi: rm 2 Rnfrig; pq 2 Pgj ð9Þ
where the symbol | | stands for cardinality of a set, and ni
represents the number of connections between ri and other
resources in LD. In other words, ni represents the number
of resources—features—of ri.
Indeed, we can say that LD is a powerful infrastructure
providing entities with semantic. As a result, LD can be
treated as a large semantic space containing multiple defini-
tions—semantic is formed through connections between
resources. Therefore, we can use LD to determine a semantic-
based similarity using connections as indicators of relatedness
between resources. For two different resources ri and rj, we
identify their associated features in order to appraise the
degree of relatedness between them. There exist four different
scenarios that can be encountered during similarity assess-
ment of two resources ri and rj. For instance, some resources
(features) are not shared between the two resources ri and rj,
while some properties can be of the same or different type(s).
Overall, there are four possible scenarios as shown in Table 1.
Each of the scenarios represents evidence towards simi-
larity or dissimilarity. Let us explain each specific scenario:
• S1: this scenario intuitively contributes to the similarity of
two entities. The same property existing in both entities is
used as the connections to the same resources.
• S2: this scenario introduces ambiguity: the same property
connects the two entities to different resources. Note that
this scenario may contribute to similarity or dissimilarity
depending on the relatedness of the attached resources at
the other end of properties.
Table 1 Possible scenarios of
connections between two
resourcesScenario label Properties (connections) Resources (features)
S1 Same type
S2 Same type
Connecting
Different resources
S3 Different types
S4 Different types Different resources
Same (shared) resource
Same (shared) resource
Context-aware similarity assessment 519
123
• S3: this scenario introduces ambiguity: the same (shared)
resource is connected via different properties. Thus,
depending on the relatedness of the properties it is possible
that this scenario contributes to similarity or dissimilarity.
• S4: this scenario contributes directly to dissimilarity.
Each of the entities is connected to different resources
via different properties.
Let the sets Pi and Pj represent properties of the
resources ri and rj, respectively. Ri and Rj, on the other
hand, represent sets of features (connected resources) of ri
and rj. Additionally, we define the following sets:
• a set of resources connected to both resources ri and rj,
and a set of properties shared by both of them:
Ri;j ¼ Ri \ Rj Pi;j ¼ Pi \ Pj ð10Þ
• a set of resources describing exclusively the resource
ri (rj):
Rexci ¼ RinRj Rexc
j ¼ RjnRi ð11Þ
• likewise, a set of properties exclusive for ri (rj):
Pexci ¼ PinPj Pexc
j ¼ PjnPi: ð12Þ
Based on the definitions above, the scenarios for the
resource ri with respect to any resource rj can be presented
in the following way. Number of resources describing ri
that belongs to the scenario S1 is:
niS1¼ jfðhri;pq;rmi;hrj;pq;rmiÞ : rm 2Ri;j;pq 2Pi;jgj ð13Þ
scenario S2:
niS2 ¼ jfhri; pq; rmi : rm 2 Rexci ; pq 2 Pi;jgj ð14Þ
scenario S3:
niS3 ¼ jfðhri;pq; rmi; hrj;ps; rmiÞ : rm 2 Ri;j; ðpq 6¼ psÞ 2 Pgjð15Þ
scenario S4:
niS4 ¼ jfhri; pq; rmi : rm 2 Rexci ; pq 2 Pexc
i gj: ð16Þ
Remark 1 intuitively, S1 contributes to similarity and S4
contributes to dissimilarity of two resources. The scenarios
S2 and S3 require in depth evaluation to determine their
contributions to similarity/dissimilarity. In this case, we
need to assess relatedness of resources (S2) or properties
(S3) that are different, and this would give us an indication
whether additional evidence exists towards similarity or
dissimilarity. For now, we focus on S1 and S4 as
computationally simpler scenarios. Scenarios S2 and S3
will be discussed later in this section.
Using the descriptions given to the different scenarios, the
similarity and dissimilarity between ri and rj can be expressed
in the following way: the similarity is solely based on scenario
S1 and thus the necessity of similarity can be determined
according to the possibility theory (Sect. 2.3):
Nðsim½ri; rj�Þ ¼niS1
nið17Þ
This represents the similarity of ri to rj; therefore, the
denominator contains the value of connections (features) of
ri. This leads to an asymmetric nature of the proposed
approach, as explained in Sect. 2.3. The necessity of
dissimilarity of ri to rj is determined based on scenario S4
using the equation:
Nðdissim½ri; rj�Þ ¼niS4
nið18Þ
As discussed before, scenarios S2 and S3 contribute to
ambiguity; thus, they are involved in determining
possibility of dissimilarity:
Pðdissim½ri; rj�Þ ¼niS2þ niS3þ niS4
nið19Þ
with the understanding that necessity of dissimilarity (niS4)
counts in possibility of dissimilarity as it certainty implies
possibility. This is according to the principle of minimal
specificity in possibility theory in which the unknown to be
impossible scenarios count towards possibility (Zadeh
1999). This leads to the following formulas for necessity
of similarity and possibility of similarity:
Nðsim½ri; rj�Þ ¼ minððNðsim½ri; rj�Þ; 1�Pðdissim½ri; rj�ÞÞð20Þ
Pðsim½ri; rj�Þ ¼ 1� Nðdissim½ri; rj�Þ ð21Þ
Therefore, similarity between resources can be expressed
as an interval with N (sim) as its lower limit, and P (sim) as
its upper limit.
Remark 2 in order to remedy the computational com-
plexity of P (dissim [ri, rj]) in Eq. (19) as already explained
in Remark 1, we may take advantage of the fact that:
ni ¼ niS1þ niS2þ niS3þ niS4 ð22Þ
with a simple arithmetic, it can be seen that the problems in
Eqs. (20) and (21) can now be solved directly through S1
and S4. Consequently, we can derive final values of
necessity of similarity and possibility of similarity as:
NFðsim½ri; rj�Þ ¼ Nðsim½ri; rj�Þ ð23Þ
PFðsim½ri; rj�Þ ¼ 1� Nðdissim½ri; rj�Þ: ð24Þ
The simple manipulation presented above allows us to
consider only the scenarios S1 and S4—using these scenarios
we can determine the lower-bound (necessity) and the upper-
bound (possibility) of a similarity interval. In the light of that,
we can say that the measures of the scenarios S2 and S3
520 P. D. Hossein Zadeh, M. Z. Reformat
123
represent the width of the interval. The process of determining
the scenario S2, different resources and the same property,
involves identifying similarity between these different
resources. In order to do this we need to travel farther from
the original two resources under evaluation and analyze other
connections. An algorithm is developed for this purpose in
Sect. 4. Nevertheless, it increases the complexity of the
similarity assessment. For the scenario S3, the same resource
and different properties, the process is more complex than the
one for S2. It requires an investigation of semantics of
properties, which is external to LD knowledge sources. If such
processes are performed the formulas for possibility of
similarity and necessity of dissimilarity should be adapted
accordingly as below:
Pðsim½ri; rj�Þ ¼niS1þ niS2C þ niS3C
nið25Þ
Nðdissim½ri; rj�Þ ¼niS4þ ðniS2� niS2CÞ þ ðniS3� niS3CÞ
ni
ð26Þ
where niS2C and niS3C represent numbers of resources and
properties indirectly related to each other. In the work
presented here, we consider S2 in context-aware similarity
assessment (Sect. 4). However, we do not investigate sce-
nario S3 that would lead to a semantic analysis of prop-
erties—a topic beyond the scope of this work.
Let us take a look at a simple example illustrated in
Fig. 4. The figure represents two resources ri and rj and
their associated features (other resources) connected via
properties. For simplicity, different types of properties are
shown with different line widths and styles.
For scenarios S1 and S3 the numbers of resources for
each scenario are the same for both resources. In Fig. 4, we
have nS1 = 4 {re, rg, rh, rk} and nS3 = 1 {rf}. S2 and S4
quantities have to be determined for each resource sepa-
rately. For ri, niS2 = 1 {rc}, niS4 = 3 {ra, rb, rd}. For rj,
njS2 = 2 {ro, rp} and njS4 = 1 {rq}.
Based on Eqs. (23) and (24), we can obtain the fol-
lowing similarity values for ri as:
Nðsim½ri; rj�Þ ¼niS1
ni¼ 4
9) NFðsim½ri; rj�Þ
Nðdissim½ri; rj�Þ ¼niS4
ni¼ 3
9) PFðsim½ri; rj�Þ ¼ 1� 3
9¼ 6
9
Hence, the upper and lower bounds of similarity of ri to rj
are:
sim½ri; rj�Þ 24
9;6
9
��
Likewise, for rj:
Nðsim½rj; ri�Þ ¼n jS1
n j¼ 4
8) NFðsim½rj; ri�Þ
Nðdissim½rj; ri�Þ ¼n jS4
n j¼ 1
8) PFðsim½rj; ri�Þ ¼ 1� 1
8
¼ 7
8
Thus,
sim½rj; ri� 24
8;7
8
��:
4 Context-aware similarity assessment
4.1 Motivation
The approach presented in Sect. 3 for determining similarity
between two entities takes into account all properties of
entities. In some cases, not all of these connections are
necessary to be considered for similarity assessment—some
of them could be very unique for a single entity, while some
could be meaningless. More importantly, in many real-life
situations the user might be interested in similarity between
two entities only in the aspect of some specific properties.
This means that only those specific types of connections
should be used for similarity determination. We refer to this
situation as context-aware similarity assessment.
The context, i.e., a set of properties to be considered for
similarity assessment can be provided explicitly by the
user, domain expert, or application. It can also be identified
based on any given information from these sources relevant
to the topic of the task in hand. In such a case, related
properties to the context can be extracted from the ontology
of properties associated with a dataset.
4.2 Approach description
For context-aware similarity we take into consideration
only a set of specific properties that are in line with the
context. For this purpose, the number of connections of
resource ri only counts the ones which belong to the
desired context:Fig. 4 Snippet of LD, definitions of resources ri and rj. Types of
connections are indicated by thickness/types of lines
Context-aware similarity assessment 521
123
niðPcntxÞ ¼ jfhri; pq; rmi : rm 2 Rnfrig; pq 2 Pcntxgj ð27Þ
where Pcntx [ P and Pcntx only contains properties related to
the particular context. Now that the context is defined,
properties associated to that context are easily recognized.
Therefore, in assessing the similarity of the resources ri and
rj, the two resources are compared against the same set of
properties (the ones related to the context). For example,
similarity of two movies in a context of directorship makes
their comparison limited to specialized properties such as
director. In another example, similarity of two books in the
context of appearance is evaluated according to a set of
properties such as color, cover type, and number of pages.
Once again, let us take look at all possible scenarios
presented in Table 1.
S1: this scenario contributes to the similarity of two
resources: the same properties existing in both entities are
connecting to the same features (other resources). For the
scenario S1, we applied the same approach as before in
Sect. 3. The only difference is the constraint on the prop-
erties that is used to determine the shared resources:
niS1ðPcntxÞ ¼ jfðhri; pq; rmi; hrj; pq; rmiÞ : rm 2 Ri;j; pq
2 Pi;j; pq 2 Pcntxgj ð28Þ
In this case, we have a set of properties Pcntx that contains
properties of interest—defining context of interests.
S2: this scenario can contribute to similarity as well as
dissimilarity as discussed in Sect. 3. In this scenario, the
same properties connect to different resources; if there is
some relatedness between these resources then S2 contrib-
utes to similarity, and if not—to dissimilarity. Relatedness
between different resources can be determined by connec-
tions these resources have to other resources. In scenario S2,
we need to investigate possible relatedness between different
resources connected by the same property to find out if they
are connected via other resources. In other words, we
investigate if there is a path between these resources. If it
happens that a path exists, a value of 1/ni(Pcntx) is added to
the possibly of similarity measure, otherwise to the necessity
of dissimilarity. The procedure needs to be performed for
each pair of resources found in S2 (one resource from Ri and
the other from Rj). Recall that Ri and Rj represent sets of
features (connected resources) of ri and rj, respectively.
Before presenting the algorithm for determining the number
of remotely connected resources, we need to define a few
quantities (similar to the ones defined in Sect. 3.2
• a set of resources connected to both resources ri and rj
in the context Pcntx:
Ri;jðPcntxÞ ¼ RiðPcntxÞ \ RjðPcntxÞ: ð29Þ
• a set of resources describing only the resource ri (not
describing rj) in context Pcntx. We also call it a set of
resources exclusive for ri:
Rexci ðPcntxÞ ¼ RiðPcntxÞnRi;jðPcntxÞ: ð30Þ
• likewise, a set of resources exclusive for rj:
Rexcj ðPcntxÞ ¼ RjðPcntxÞnRi;jðPcntxÞ: ð31Þ
• also, let niS2CðPcntxÞdenotes the number of resources in
Ri (in scenario S2) that are connected indirectly to
resources in Rj through some other external resources.
S3: this is yet another scenario that can contribute to sim-
ilarity or dissimilarity. The process of determining its contri-
bution to the similarity depends on identifying relatedness of
properties that are part of the context, which connect the two
entities to the same resource. This process requires additional
sources of knowledge (outside of LD). Let niS3CðPcntxÞdenotes the number of properties in Pi (in scenario S3) that are
indirectly related to different properties in Pj.
S4: for this scenario we have different resources con-
nected by different properties. We assume that such dif-
ferences do not contribute to (possibility) similarity.
Table 2 Pseudo-code for discovering remotely connected resources
522 P. D. Hossein Zadeh, M. Z. Reformat
123
Now, the pseudo-code for the algorithm can be pre-
sented as shown in Table 2. Finally, the necessity of sim-
ilarity is expressed as:
NFcntxðsim½ri; rj�Þ ¼
niS1ðPcntxÞniðPcntxÞ
ð32Þ
and, the possibility of similarity as:
PFcntxðsim½ri; rj�Þ ¼
niS1ðPcntxÞþ niS2CðPcntxÞþ niS3CðPcntxÞniðPcntxÞ
ð33ÞHowever, as mentioned before, we do not consider the
scenario S3 that would require investigation of properties’
relatedness (so niS3CðPcntxÞ ¼ 0). One of the possibilities here
would be developing a method based on application of tax-
onomy-based measures deployed on ontology of properties.
Let us take a look at an example to explain the approach
described above. Figure 5 depicts a snippet of LD with two
resources ri and rj, and a number of resources {ra, rb, rc, rd,
re, rf, rg, rh, rk, ro} defining them within the context Pcntx.
Here, different properties are shown with different line styles.
This means that the three resources {rp, rq, rs} are connected
with properties potentially different than the properties for
the resources {ra, rb, rc, rd, re, rf, rg, rh, rk, ro}.
The sets of common and exclusive resources describing
ri and rj are:
Ri;jðPcntxÞ ¼ fre; rf ; rggRexc
i ðPcntxÞ ¼ fra; rb; rc; rdgRexc
j ðPcntxÞ ¼ frh; rk; rog
Based on the algorithm in Table 2, we obtain {ra, rb} as
remotely connected resources between ri and rj:
niS2CðPcntxÞ ¼ 2
Using Eqs. (32) and (33), the context-aware similarity
values for ri in Fig. 5 are obtained as:
NFcntxðsim½ri; rj�Þ ¼
niS1ðPcntxÞniðPcntxÞ
¼ 3
7
PFcntxðsim½ri; rj�Þ ¼
niS1ðPcntxÞ þ niS2CðPcntxÞ þ niS3ðPcntxÞniðPcntxÞ
¼ 3þ 2þ 0
7¼ 5
7
Therefore, the final values are:
sim½ri; rj� 2 hNFcntxðsim½ri; rj�Þ;PF
cntxðsim½ri; rj�Þi
sim½ri; rj� 23
7;5
7
� �
and these values for rj are:
NFcntxðsim½rj; ri�Þ ¼
n jS1ðPcntxÞn jðPcntxÞ
¼ 3
6
PFcntxðsim½rj; ri�Þ ¼
n jS1ðPcntxÞ þ n jS2CðPcntxÞ þ n jS3ðPcntxÞn jðPcntxÞ
¼ 3þ 2þ 0
6¼ 5
6
Thus:
sim½rj; ri� 2 hNFcntxðsim½rj; ri�Þ;PF
cntxðsim½rj; ri�Þi
sim½rj; ri� 23
6;5
6
� �
5 Case study
In order to illustrate functioning of the proposed semantic
similarity measures and to evaluate them, a case study is
presented using real-world datasets. We use DBPedia as the
data source of RDFs of the following resources. Four movies:
Matrix,10 Matrix_Reloaded,11 Hangover,12 and Blade_Run-
ner,13 one soundtrack album: Matrix-music14 (movie Matrix
soundtrack) and one car brand: Toyota15 are selected. The
evaluation is performed on the selected ordered pairs of the
resources. Context-free and context-aware similarity mea-
sures are obtained according to the methods presented in
Sects. 3 and 4, respectively. First, all RDF triples associated
with each resource are extracted from DBPedia. A graphical
visualization of the selected resources are depicted in Fig. 6
using a Java-based graph visualization software, Gephi.16
Fig. 5 A sample of LD representing two resources ri and rj defined
with a number of resources connected via the properties (dotted lines
represent different connection types) within the context Pcntx
10 http://dbpedia.org/page/The_Matrix.11 http://dbpedia.org/page/The_Matrix_Reloaded.12 http://dbpedia.org/page/The_Hangover_(film).13 http://dbpedia.org/page/Blade_Runner.14 http://dbpedia.org/page/The_Matrix:_Music_from_the_Motion_
Picture.15 http://dbpedia.org/page/Toyota.16 http://gephi.org/.
Context-aware similarity assessment 523
123
The computed similarity measures are shown as intervals
with necessity of similarity and possibility of similarity as
lower-bound and upper-bound values, respectively. In par-
ticular, necessity of similarity represents the known and cer-
tain similarity between the pair of resources {ri, rj}. On the
other hand, possibility of similarity denotes the maximum
possible similarity between the two resources. All possible
situations that might satisfy the similarity of ri and rj are
counted towards the possibility of similarity.
5.1 Context-free similarity
The evaluation is performed on the ordered pairs of
resources including {Matrix, Matrix-Reloaded}, {Matrix,
Blade-Runner}, {Matrix, Hangover}, {Matrix, Matrix-
music} and {Matrix, Toyota}. The results are shown in
Table 3.
The similarity interval for the pair {Matrix, Matrix-
Reloaded} is calculated based on Eq. (23) by comparing
their RDF triples according to scenarios S1 and S4.
Therefore, the lower-bound is given by:
NFðsim½Matrix; Matrix� reloaded�Þ ¼ niS1
ni¼ 61
176¼ 0:35
and upper-bound based on Eq. (24) is:
NFðdissim½Matrix; Matrix� reloaded�Þ ¼ niS4
ni¼ 9
176¼ 0:05
Here, the niS1 is a number of shared resources describing
Matrix and Matrix-Reloaded and connected via the same
property, while niS4 is a number of resources unique for
Matrix and connected via unique properties to Matrix. The
value of ni represents number of all triples that include
Matrix as their subject. Finally, interval-valued similarity is
obtained as:
sim½Matrix; Matrix� reloaded� 2 h0:35; 0:95i:
There are two important observations that can be made
at this point. Firstly, the necessity of similarity gives an
unquestionable similarity value, 0.35, of the movie Matrix
to Matrix-Reloaded. This value is obtained based on the
facts and certain information, and thus it is referred to as a
lower-bound of similarity. This can be described as the
pessimistic view of similarity. The possibility of similarity,
(optimistic view of similarity) on the other hand, is
determined based on the necessity of dissimilarity
obtained based on definite facts that are different for both
resources. Indeed, the value of necessity refers to the
degree of certainty and the value of possibility to the
degree of plausibility when it comes to the similarity
between Matrix and Matrix-Reloaded (with the reference
to Matrix). Secondly, the interval, in this case 0.60, is a
range of possible values of similarity between Matrix and
Matrix-Reloaded. Its existence is related to the inability to
determine for certain which RDF triples (features) of
Matrix and Matrix-Reloaded are the same (belong to S1) or
definitely different (belong to S4). This interval is an
indication of uncertainty regarding the similarity between
Matrix and Matrix-Reloaded. Further investigation of the
scenarios S2 and S3 would lead to some modifications of
the interval, and in particular its upper-bound.
It can be observed that the similarity intervals for the
two movie pairs {Matrix, Hangover} and {Matrix, Blade-
Fig. 6 Graphical visualization of the resources described in DBPe-
dia dataset
Table 3 Context-free
similarity valuesOrdered pairs {ri, rj} Necessity of similarity
NF (sim[ri, rj])
Possibility of similarity
PF (sim[ri, rj])
Similarity interval
(sim[ri, rj])
{Matrix, Matrix-reloaded} 0.35 0.95 h0.35, 0.95i{Matrix, Blade-runner} 0.09 0.88 h0.09, 0.88i{Matrix, Hangover} 0.09 0.94 h0.09, 0.94i{Matrix, Matrix-music} 0.01 0.73 h0.01, 0.73i{Matrix, Toyota} 0.00 0.71 h0.00, 0.71i
524 P. D. Hossein Zadeh, M. Z. Reformat
123
Runner} have small necessity values and large possibility
values when compared with the pair {Matrix, Matrix-
Reloaded}. Such large interval is due to a relatively large
value of the necessity of similarity for the {Matrix, Matrix-
Reloaded} (0.35) when compared with the necessity of
similarity for the other two pairs (0.09). The reason is that
the science fiction movie Matrix-Reloaded is a sequel to the
movie Matrix. Therefore, both movies have related stories,
actors, studios, directors, etc. On the other hand, the movie
Hangover is quite different in many areas such as producers,
directors, casts, studios, etc. A similar situation exists for the
pair {Matrix, Blade-Runner}. The movies Blade-Runner and
Matrix are different in several areas regardless of both
belonging to a science fiction movies category. This is due to
the measurement of context-free similarity, which takes into
account all features of a resource without considering any
particular area of focus. The pair {Matrix, Blade-runner} has
only one common triple, which ‘‘disappears’’ when matched
against all 176 triples of Matrix.
In the ordered pair {Matrix, Matrix-music}, the similarity of
the movie Matrix is measured to the soundtrack album Matrix-
music of the movie Matrix. The number of matching RDFs
between these two resources is negligible, thus leads to a very
small value of similarity. Similarly, the pair {Matrix, Toyota}
has necessity of similarity equal to zero, as there are zero
matching RDFs found between the entities. Instead, the values
of possibility of similarity for the pairs {Matrix, Matrix-music}
and {Matrix, Toyota} are well below 1.0. This indicates the
existence of a higher number of RFD triples that are uniquely
different for both pairs (large values of S4). This also means
that the range of similarity values is smaller compared to the
previous case studies.
Overall, it can be inferred that context-free similarity
provides an unbiased measure of similarity between two
resources based on all the available information about the
resources without taking into account any consideration.
Among RDFs of all selected resources (movies, sound-
track, and a car brand) there are many triples which are not
very informative, and could be considered as a noise.
5.2 Context-aware similarity
For context-aware similarity, the desired context imposes a
constraint on the scenarios. The same set of pairs intro-
duced in Sect. 5.1 is considered for the evaluation. The
experimental results are shown in Table 4. The added
column ‘‘Context’’ indicates the desired context in the
similarity assessment.
Similarity for the ordered pair {Matrix, Blade-Runner}
is computed within the context of the property subject. It is
worth noting that subject is the property that links the
resources Matrix and Blade-Runner to other resources
representing categories that Matrix and Blade-Runner
belong to. Therefore, all other properties of the two
resources {Matrix, Blade-Runner} are discarded while the
only property that is taken into account is subject. For
example, subject links the movie Matrix to the following
resources: 1990s_action_films, American_action_films,
Cyberpunk_films, Silver_Pictures_films, Films_set_in_the_
22nd_ century, etc. See Fig. 7. For more information on
definition of context in this paper, see Sect. 4.
Thus, the context-based similarity for {Matrix, Blade-
Runner} is calculated based on Eqs. (32) and (33) as
follows:
NFcntxðsim½Matrix; Blade� runner�Þ
¼ niS1ðPcntx ¼ subjectÞniðPcntxÞ
¼ 10
27
Table 4 Context-aware similarity values
Ordered pairs {ri, rj} Context (property) Necessity of similarity
NF(sim[ri, rj])
Possibility of similarity
PF (sim[ri, rj])
Similarity interval
sim[ri, rj]
{Matrix, Matrix-reloaded} Starring 0.80 0.80 h0.80, 0.80i{Matrix, Blade-runner} Subject 0.37 0.48 h0.37, 0.48i{Matrix, Hangover} Distributor 1.00 1.00 h1.00, 1.00i{Matrix, Matrix-music} Type 0.08 0.13 h0.08, 0.13i{Matrix, Toyota} Label 0.00 0.00 h0.00,0.00i
PFcntxðsim½Matrix;Blade� runner�Þ ¼ niS1ðPcntx ¼ subjectÞ þ niS2CðPcntx ¼ subjectÞ þ niS3ðPcntx ¼ subjectÞ
niðPcntxÞ¼ 10þ 3
27
¼ 13
27
Context-aware similarity assessment 525
123
sim½ri; rj� 2 h0:37; 0:48i:
The result shows that although the movies {Matrix, Blade-
Runner} have a low context-free similarity, their similarity in
the context of subject is high with a small uncertainty interval.
This is because their features in the context of subject are
similar as both are American science fiction movies.
Next, the ordered pair {Matrix, Matrix-Reloaded} is
evaluated within the context starring, which defines the
casting crew of a movie. Based on the result, the calculated
similarity interval is h0.80, 0.85i. This means that there is a
very high certainty about their similarity value in the
context of starring.
The pair {Matrix, Hangover} is compared in the context of
distributor. This property contains name of the distributors of
each movie. Due to the same distributor of the two movies,
Warner-Brothers, this results in the similarity interval of h1.0,
1.0i. It is worth noting that the context-free similarity interval
for the pair {Matrix, Hangover} was obtained as h0.09, 0.94i.This again explains the importance of the context and its effect
on the similarity assessment.
The considered context in the similarity evaluation of
the pair {Matrix, Matrix-music} is type. Type is the prop-
erty that connects the resources to all of their subclasses
through RDF triples. For example, information about the
type in the resource Matrix includes: CreativeWork, Movie,
KungFuFilms, Film, Work, etc. Similarly, for Matrix-music
the property type connects Matrix-music to: Album, Mu-
sicalWork, MusicalComposition, etc. This information is
evaluated in both resources Matrix and Matrix-music in
order to obtain the context-aware measure.
Lastly, the property label is considered as the context of
similarity for the pair {Matrix, Toyota}. The label property
provides all the vocabularies used to describe this partic-
ular resource. Zero values in the similarity interval of
{Matrix, Toyota} show no similar or related information
between the two resources in this context.
The above examples show the influence of the context in
the similarity measure. As it can be seen, the uncertainty
intervals in context-aware similarities are narrower than in
context-free measures, which means higher confidence in
context-aware similarity measures. This type of similarity
is more often used in real-life scenarios, especially in situ-
ations involving human judgment.
6 Validation and comparative experiments
6.1 Overview
The process of determining similarity between two entities
is not uniform while numerous methods and techniques
proposed different definitions and interpretations of the
similarity concept (see Sect. 7). Two main factors influ-
encing similarity estimation techniques are formats of
information representation and levels of abstraction. The
method proposed here treats LD as a vast network of
interconnected entities that define semantics of its elements
(entities) via their ‘‘involvement’’ in connections, and the
proposed technique uses these connections for similarity
assessment.
Thus, comparison of the proposed method with other
similarity measures imposes some challenges and requires
Fig. 7 Values of the property
Subject for the movie Matrix in
DBPedia
526 P. D. Hossein Zadeh, M. Z. Reformat
123
some modifications and adjustments. Our proposed method
does not take into account any structure existing between
resources under evaluation. One of the advantages of the
approach is its ability to determine similarity between any
two resources existing in the RDF-based repository—
would that be the web, a local database, or a distributed
network of information. On top of that, obtained results are
not single values but intervals with minimum (pessimistic)
and maximum (optimistic) limits. These characteristics of
the proposed method make the comparison process quite
challenging. Thus, we need to adjust our method or the
ones in the literature to make such comparison meaningful.
The comparative experiments with other measures have
been grouped into three parts: first, we target the compar-
ison with measures that are applicable to information rep-
resented by RDF triples. Second, we focus on measures
that are particularly designed for hierarchically organized
data. Third, we illustrate the ability of our proposed method
to deal with RDF-based data created by different users or
tools, and stored at different data sources.
6.2 RDF-triples as data representation
For the first set of experiments, numbers of feature-based
methods are selected. These approaches include: Tversky
(1977), Dice (Frakes and Baeza-Yates 1992) as well as the
concept-based similarity method by Boros (Boros et al.
1996). Boros’ approach is adapted here such that it eval-
uates the substitutions, deletions and additions of RDF
triples not words or concepts as originally described in
(Boros et al. 1996). In addition to the above measures,
latent semantic analysis (LSA) (Landauer and Dumais
1997) is chosen from natural language processing tech-
niques since it represents an interesting approach to
determine similarity levels between words or blocks of
text. Latent semantic analysis extracts coherence of words
by statistical computations of a large corpus of text, for
more information about LSA, see Deerwester et al. 1990;
Landauer et al. 1998]. To compare our method with LSA,
the datasets of the entities are extracted from Wikipedia
and are transformed into LSA semantic spaces, high
dimensional vectors representing each entity, using Text to
Matrix Generator (TMG).17 TMG is a MATLAB toolbox
that can be used for various data mining and information
retrieval tasks. It must be mentioned that the potential
range of LSA similarity measure is [-1, ?1].
For comparison purposes, we have selected 12 pairs of
real-world entities extracted from DBpedia as shown in
Table 5. The pairs from #1 to #3, #4 to #8, and #9 to #12
are selected as very similar, relatively similar, and dis-
similar entities, respectively. For the comparison to be fair
and meaningful, we use our context-free approach since
none of the other approaches consider context in their
similarity formulation.
Usefulness of our method can be explained when simi-
larity values of different measures are compared with our
method. The obtained intervals in our approach include the
values computed in Tversky’s and Dice’s methods. This
can be acknowledged knowing the underlying principle of
these two methods: they are simply ratios of common to
total features. The fact that in most cases they are slightly
higher than the necessity values (lower bounds) comes
from the fact they do not evaluate the similarity of prop-
erties. In other words, only features are taken into account
in these methods, while the similarities of properties are
ignored. This could lead to potentially unreliable values of
similarity in LD. In our method, the benefit of the similarity
interval is that the lower bound represents certain value of
similarity—we are sure about it, while the upper bound
indicates the maximum possible value of the similarity—
and we are sure about it too.
An important observation can be made for the values
obtained from Boros’ approach (Boros et al. 1996). As it
can be seen, Boros’ values are the same as lower-bounds of
the intervals in our method. In other words, the Boros
approach, as adopted here, is similar to our scenario S1.
Based on the results from LSA approach, we can state
that LSA derives reasonably high similarity values for very
similar pairs, i.e., pairs from #1 to #3. However, results
confirm that LSA cannot correctly distinguish the dissim-
ilarity between the taxonomically different pairs #4, #10,
and #12, while our method performed well. In general,
LSA similarity measures are quite high when applied to
less similar and dissimilar pairs comparing to our intu-
itions. This observation conforms to the extensive studies
presented in (Simmons and Estes 2006). While LSA is a
good model for some applications, its requirement for
generating semantic spaces, SVD computations, and
dimensional reduction methods are computationally
expensive for large datasets. Thus, methods based on LSA
have not been used with enormous corpora such as LD.
The triple-based nature of LD evidently induces the
similarity assessment approaches to be capable of explor-
ing interconnected RDF triples as well as the semantics
behind resources and properties in metadata. The similarity
intervals found by our analysis of the semantic space,
formed by RDF triples, show a somewhat different picture
of similarity assessment in LD environment.
6.3 Taxonomy as data representation
The second set of experiments is an attempt to compare our
proposed method with taxonomy-based measures. Two
taxonomy-based measures are selected: Wu and Palmer17 http://scgroup20.ceid.upatras.gr:8000/tmg/.
Context-aware similarity assessment 527
123
(1994), and Leacock and Chodorow (1998). Both of these
measures determine similarity based on the location of
entities in the taxonomy. For this reason, we use DBpedia
ontology,18 which consists of more than 320 classes, where
these classes are organized in the hierarchy with maximum
depth of seven. The selected pairs (Table 5) belong to such
classes as: film (depth three), video game (depth four), and
automobile (depth three).
The obtained results in Table 6 suggest the inadequacy
of these measures for the pairs #1 to #3, #5 to #9, and #11.
All obtained values are the same—0.67 for Wu and P
(maximum is 1.0), and—1.1 for LCH (maximum is plus
infinity). Hence, the distinction between the pairs (#1 to #3,
#5 to #9, and #11) is not reflected in the selected taxon-
omy-based methods. This can be explained as these mea-
sures limit their attention to is-a links and are sensitive to
taxonomic and not semantic relations between the entities.
In other words their values fully depend on the structure of
ontology. These methods are not able to distinguish
between different individuals of one class. The obtained
values for the pairs #4, #10 and #12 are different since
entities in each of these pairs belong to different classes.
These values seem to be more reasonable since the amount
of information embedded in the taxonomy for these pairs is
higher.
6.4 Multiple-source RDF-triples as data representation
In order to evaluate the applicability of the proposed
method with RDF triples created by different individuals or
tools, we conducted an experiment in which we determine
the similarity of entities that belong to two different in-
terlinking datasets, besides DBpedia, in LD cloud. One of
them is the Linked Movie DataBase19 (LinkedMDB),
which has more than six million triples with over 162,000
triples that are interlinked with other datasets such as
DBpedia and YAGO.20 YAGO is another dataset contain-
ing more than two million entities about people, objects,
cities, etc. Five pairs have been selected among three
datasets presented above, shown in Table 7. Movies in
every pair belong to different datasets.
Pair #1, award-winnings psychological thriller movies
directed by Alfred Hitchcock, is selected as very similar
movies. In pair #2, the movie Woman-in-Green is put
together with a movie Sherlock Holmes—both movies are
about Holmes and his investigations. Each movie is defined
by RDFs that belong to different datasets. Movies in pair
#3 are thrillers and directed by the same director. Pair #5
contains movies with the same actor. Pair #4 is chosen as
dissimilar movies. Due to compatibility of our method with
Table 5 Comparison of our approach to other related methods
# Ordered pairs Similarity models
Corpus-based Feature-based Concept-based LD-based
LSA (Landauer
and Dumais 1997)
Tversky
(1977)
Dice (Frakes and
Baeza-Yates 1992)
Boros et al.
(1996)
Our method
1 {Matrix, Matrix-reloaded} 0.96 0.40 0.38 0.35 h0.35, 0.95i2 {Good-fellas, God-fattier} 0.92 0.29 0.37 0.20 h0.20, 0.97i3 (Jaws, Jurassic-Park} 0.87 0.54 0.68 0.20 h0.20, 0.90i4 {Matrix, Matrix-music} 0.90 0.25 0.20 0.01 h0.01, 0.73i5 {Star-wars, Star-trek} 0.79 0.42 0.36 0.30 h0.30, 0.56i6 {Jurassic-Park, Godzilla} 0.66 0.02 0.03 0.02 h0.02, 0.24i7 {Spider-man, I-robot} 0.75 0.25 0.30 0.10 h0.10, 0.55i8 {Matrix, Blade-runner} 0.85 0.55 0.4 0.09 h0.09, 0.88i9 {Matrix, Hangover} 0.87 0.10 0.15 0.09 h0.09, 0.94i10 {Matrix, Toyota} 0.58 0.20 0.32 0.00 h0.00, 0.71i11 {Pulp-fiction, Wall-E} 0.89 0.09 0.09 0.08 h0.08, 0.15i12 {Angry-birds, Titanic} 0.63 0.04 0.05 0.02 h0.02, 0.10i
Table 6 Similarity values of taxonomy-based methods (pair# is
taken from Table 5)
Pair# Wu and Palmer
(1994)
Leacock and
Chodorow (1998)
#1, #2, #3, #5, #6,
#7, #8, #9, #11
0.67 1.1
#4 0.34 0.48
#10 0.30 0.42
#12 0.34 0.48
18 http://wiki.dbpedia.org/Ontology.
19 http://www.linkedmdb.org/.20 http://www.mpi-inf.mpg.de/yago-naga/yago/.
528 P. D. Hossein Zadeh, M. Z. Reformat
123
the triple-based nature of any dataset in LD, similarities
of the pairs yield sensible results. Finally, we must
acknowledge that future research could improve the results
by further decreasing the ambiguity in similarity intervals
as previously explained in Sects. 3 and 4.
7 Related work
There are several methods on assessing the semantic sim-
ilarity among entities to improve the traditional similarity
measurement methods. It is pivotal that the calculated
similarity by these methods corresponds closely to the
human judgment of similarity. Many approaches in the
literature (Taylor 2010; Pedersen et al. 2007) are based on
ontology, taxonomy of terms that describe a certain area of
knowledge. The most popular definition of ontology says
that ‘‘an ontology is a specification of a conceptualization’’
(Gruber 1993). Ontology compared to LD is an engineered
structured model of information. Surprisingly, there are
few proposals in the literature related to similarity assess-
ment in LD. In this section, number of approaches related
to both areas is investigated. The majority of today’s
methods are classified based on two underlying approa-
ches: context-free and context-aware.
7.1 Context-free models
Similarity between entities in (Rada et al. 1989) is calcu-
lated based on the number of links separating the two
entities in a semantic net, which is known as the edge-
counting method. Indeed, the more number of links sepa-
rating the entities the less similar they are. In general, links
between entities convey a meaning denoting the semantic
and context that affect the similarity between entities. In
(Rada et al. 1989), only is-a link types are considered in the
calculation of similarity, which both semantics and context
are not taken into the consideration. To date, due to the
simplicity and ease of implementation of this method
several approaches (Wu and Palmer 1994; Leacock and
Chodorow 1998; Li et al. 2003) leverage this technique by
combining it into a more efficient and semantic-oriented
approaches. However, the focus of edge-counting
approaches is mainly on the number of links in a hierarchy
of entities without considering other parameters that affect
semantics and context.
Resnik (1995), Lin (1998) and Giuseppe (2009) propose
methods that are based on entities’ information that can be
measured in different ways. Resnik (1995) argues that the
similarity of two entities depends on the shared amount of
information between them. In (Resnik 1995), the common
information is quantified using the least common subsumer
of entities in the hierarchy of concepts. Taking a similar
approach, (Lin 1998) presents an information theoretic
measure related to the amount of information required to
describe each entity, as well as information related to the
commonality of the two entities. According to Lin (1998),
similarity between two entities is the weighted average of
the computed similarity from different perspectives. The
drawback of information theoretic approaches is their need
for external information, such as a probabilistic model of
the domain in order to determine the relatedness of entities.
In particular, the problem with application of information
theoretic approaches in LD is that the information in LD is
not represented in a form of ontology; thus, evaluating the
required criteria in these approaches such as locating a least
common subsumer of two entities and frequency count of a
concept in a hierarchy is not applicable.
Number of approaches (Leacock and Chodorow 1998;
Oliva et al. 2011) leverage lexical information found in
external knowledge repositories such as WordNet—an
English lexical database of concepts and relations—to
quantify the semantic similarity of concepts. Among these,
(Oliva et al. 2011) captures and combines lexical and
syntactic information based on WordNet and a deep pars-
ing process respectively to assess similarity between short
sentences. In (Oliva et al. 2011), lexical information is
obtained using the hierarchical structure of WordNet and
different glossaries related to each entity. In syntactic
measurement, (Oliva et al. 2011) emphasize on the
importance of different syntactic roles of words in human’s
calculation of similarity. For this purpose, they assign
different weights to different word syntactic roles such as
verb, subject, object and adverb. In (Navigli and Velardi
2005), a semantic graph based on the extracted lexical
information of entities from a variety of lexical knowledge
Table 7 Similarity assessment using our approach of entities belonging to different datasets
# Ordered pairs LinkedMDB DBpedia YAGO Similarity measure
1 {Vertigo, Psycho} Vertigo Psycho – h0.42, 0.66i2 {The Woman in Green, Sherlock Holmes} – Sherlock Holmes The Woman -in-Green h0.74, 0.96i3 {Panic Room, Zodiac) Panic Room – Zodiac h0.51, 0.89i4 {King Arthur, Ghost] Ghost – King Arthur h0.06, 0.45i5 {Men in Black, Bad Boys} Men in Black Bad Boys – h0.25, 0.78i
Context-aware similarity assessment 529
123
repositories is created. Lexical information in (Navigli and
Velardi 2005) includes different types of semantic relation
such as synsets, glosses, hyperonymy–hyponymy, meron-
ymy–holonymy that can be found in a lexical resource.
In a similar approach, (Giunchiglia et al. 2007) performs
a graph similarity matching by analyzing the semantic
relations and the meanings of entities according to the
lexical knowledge and structure of schemas, respectively.
In (Giunchiglia et al. 2007), semantic relations include
equivalence, more general, less general and disjointness.
The matching problem is then converted into a proposi-
tional validity problem using a predefined set of rules to
verify the semantic relations between entities.
The approach in (Hliaoutakis et al. 2006) is a combi-
nation method based on lexical information using WordNet
and MeSH as well as the information content found in
document corpus. In (Hliaoutakis et al. 2006), it is shown
that lexical difference of two entities does not necessarily
mean that the entities are semantically different. For a
complete comparative evaluation of state-of-the-art simi-
larity methods, see (Oliva et al. 2011; Hliaoutakis et al.
2006). Authors of (Hliaoutakis et al. 2006) discuss that
lexical term matching techniques such as vector space and
probabilistic models are not semantic-oriented; thus, not
efficient to be used over semantic web.
7.2 Context-aware models
There exist similarity assessment methods (Volz et al.
2009; Han et al. 2006; D. Hossein Zadeh and Reformat
2012b, c) that focus on entities’ features and context. In
LD, every resource is defined with a number of properties,
thus allowing the entities to differ not only in label, but also
in machine-understandable information.
Han et al. (2006) presents an approach to determine the
similarity between entities from different ontologies. Their
approach takes into consideration the number of features
associated with each entity as well as positions of the
entities in the structure. In their similarity model, weights
are assigned to the entities regarding their position in the
ontology, e.g., entities at a lower level are considered more
meaningful than the entities at higher levels; thus accred-
ited with higher weights. In (Han et al. 2006), contextual
similarity between entities is defined according to the
combined similarity between two entities and their adjacent
entities across neighborhoods.
As stated before, number of approaches in the literature
addressing the similarity measurement in LD is quite low.
In the line of approaches with LD, Sheng et al. (2010)
proposes a semantic similarity method using a combination
of lexical, structural, corpus-statistics information. A
variety of semantic relations from WordNet is used as a
lexical knowledge and is combined with taxonomy
information such as depth of an entity and the distance
between two entities in the schema. In (Sheng et al. 2010),
corpus-statistic is a measure of co-occurrence frequency of
two entities and their synonyms obtained from WordNet.
Semantic is explored in depth by extending the similarity
measure to features of each entity, where the similarity is
calculated between the corresponding features of two
entities. Furthermore, context is taken into account by
assigning different weights to every feature according to
their importance in the domain, which are assigned man-
ually based on the empirical experience of the domain
expert. The approach in Sheng et al. (2010) is only appli-
cable to entities described in one dataset and does not scale
up to the web of interlinking complex data sources.
Authors in Albertoni et al. (2008) discuss the problems
associated with context similarity in the web of data. The
investigated problems are mainly due to heterogeneity
nature of LD, such as URI inconsistency and overlapping
metadata vocabularies. In Albertoni et al. (2008), URIs of
resources are dereferences to exploit the properties of each
entity through the extracted RDF elements. The collected
properties are then used to measure the context-enabled
ontology-based similarity method in Albertoni and De
Martino (2010). According to Albertoni et al. (2008),
limiting the process of URI dereferencing to a specific
context reduces computation complexity and improves the
efficiency of the similarity result.
Majority of the methods above are not proper to be
applied as a similarity measurement technique in LD given
that they are designed to be used in ontology. Our work is
distinguished from the ontology-based approaches in
dealing with LD in which data sources do not follow any
consistent schema or taxonomy. In fact, information con-
tent and taxonomy of different datasets are not directly
comparable; therefore, information theoretic and taxon-
omy-based methods do not perform very well when two
entities belong to different data sources. Furthermore,
semantic and context in edge-counting methods is poorly
evaluated. These imply that these methods are inefficient to
be used in LD that is composed of a large composition of
various datasets connected together. On the other hand,
feature-based methods that measure the similarity based on
common features of entities, are best suited when entities
belong to different datasets.
8 Conclusion and future work
Linked data is a new method of publishing information on
the web. Any piece of information is represented as a triple.
All triples can be interconnected—each component of one
triple can be a component of an unlimited number of other
triples. As the result, a vast network of information is
530 P. D. Hossein Zadeh, M. Z. Reformat
123
formed. This allows us to treat LD as a space containing
semantic definitions of its elements. The introduction of
LD on one side, and required semantic-based analysis of
relatedness on the other trigger our interest in the research
activity addressing those issues.
In this paper, we have proposed a novel semantic sim-
ilarity measurement method that is fully dependent on
connections in LD. The method depends on the relatedness
of resources’ RDF-triples. It accommodates concepts of
necessity and possibility, while mechanisms of their
manipulation are taken from the possibility theory. The
obtained similarity measure is advantageous since it pro-
vides realistic lower and upper similarity bounds between
any pair of entities in LD. An important aspect of the
proposed method is its ability to determine similarity in
context-aware situations. Utilizing types of connections
creates a neat way of looking at different facets of simi-
larity between entities.
The asymmetric nature of the proposed similarity measure
could be seen as a thought-provoking feature of the method. It
is in line with the work of Tversky (1977) and Nosofsky
(1991), who claimed that the direction of asymmetries could
be associated with prominence of compared entities. Tversky
pointed out that prominence is related to salience, familiarity,
and goodness in form and informational content. It can be
stated that less prominent entities are more similar to more
prominent entities than the other way around. In the light of
that, the proposed method could be applied to determine the
level of prominence of compared objects. It is very likely that
adding relative prominence to the similarity assessment would
increase its descriptive power. This idea alone will be inves-
tigated more as an extension of the work presented here.
Another important aspect of the proposed method is the
fact that it provides not a single similarity value but a
similarity interval. The interval can be interpreted as the
lower-bound, which is the similarity with a high confidence
in it; and the upper-bound indicating the maximum possi-
ble similarity when all ambiguities are ‘‘resolved posi-
tively’’. At the same time the width of this interval can be
interpreted as the level of uncertainty in the similarity
assessment—larger width indicates higher uncertainty,
while a narrower interval means higher confidence in the
provided assessment. This interpretation is fully compati-
ble with our interpretations of the scenario S1 as the lower
bound, and S4 as the upper bound. Further investigations of
scenarios S2 and S3 lead towards decreasing upper-bounds
and this means increased confidence in the obtained simi-
larity assessment. However, activities related to ‘‘solving’’
scenarios S2 and S3 are computationally expensive, and
they can be performed once a crude assessment is done.
Overall, the proposed method constitutes an effective
way of determining similarity between entities in LD.
Using principles of possibility theory, the method allows
for assessing the similarity interval by identifying identical
features (scenario S1) and entirely different features (sce-
nario S4). When compared with other measures it pos-
sesses number of advantages: direct applicability in the LD
environment, simplicity and easiness of computations,
ability to provide pessimistic (lower-bound) and optimistic
(upper-bound) estimations of similarity, capability to deal
with specific facets of similarity in form of different con-
texts, and potentiality being further expanded to take full
advantage of RDF triples as data representation format.
References
Albertoni R, De Martino M (2010) Semantic similarity and selection
of resources published according to linked data best practice. In:
OTM 2010 workshops on the move to meaningful internet
systems, pp 378–383
Albertoni R, Camossi E, De Martino M, Giannini F, Monti M (2008)
Context enabled semantic granularity. In: Knowledge-based
intelligent information and engineering systems, pp 682–688
Berners-Lee T, Hendler J (2001) Scientific publishing on the semantic
web. Nature 410:1023–1024
Bizer C, Heath T, Berners-Lee T (2009) Linked data-the story so far.
Int J Semant Web Inf Syst 4:1–22
Boros M, Eckert W, Gallwitz F, Gorz G, Hanrieder G, Niemann H
(1996) Towards understanding spontaneous speech: word accuracy
vs. concept accuracy in spoken language. Proceedings of fourth
international conference on ICSLP 96, vol 2, pp 1009–1012
D. Hossein Zadeh P, Reformat MZ (2012a) Assimilation of informa-
tion in linked data based knowledge base. In: 14th international
conference on information processing and management of uncer-
tainty in knowledge-based systems, Catania, 9–13 July 2012
D. Hossein Zadeh P, Reformat MZ (2012b) Feature-based similarity
assessment in ontology using fuzzy set theory. In: IEEE 2010
international conference on fuzzy systems (FUZZ), pp 1462–1468
D. Hossein Zadeh P, Reformat MZ (2012c) Ontology as knowledge
base for determining asymmetric and context-dependent simi-
larity. J Inf Sci (submitted)
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R
(1990) Indexing by latent semantic analysis. J Am Soc Inf Sci
41:391–407
DuBois D, Prade HM (1980) Fuzzy sets and systems: theory and
applications. Academic Press, New York
Dubois D, Prade H (2003) Possibility theory and its applications: a
retrospective and prospective view. In: FUZZ’03 the 12th IEEE
international conference on fuzzy systems, p 5
Dubois D, Prade H, Harding E (1988) Possibility theory: an approach
to computerized processing of uncertainty. Plenum press, New
York
Frakes WB, Baeza-Yates R (1992) Information retrieval: data
structures and algorithms, vol 7632. PTR Prentice-Hall Inc,
Eaglewood Cliffs
Giunchiglia F, Yatskevich M, Shvaiko P (2007) Semantic matching:
algorithms and implementation. J Data Semant IX, University of
Trento, Trento, pp 1–38
Giuseppe P (2009) A semantic similarity metric combining features
and intrinsic information content. Data Knowl Eng 68:1289–
1308
Gruber TR (1993) A translation approach to portable ontology
specifications. Knowl Acquis 5:199–220
Context-aware similarity assessment 531
123
Han L, Sun L, Chen G, Xie L (2006) ADSS: an approach to
determining semantic similarity. Adv Eng Softw 37:129–132
Hliaoutakis A, Varelas G, Voutsakis E, Petrakis EGM, Milios E
(2006) Information retrieval by semantic similarity. Int J Semant
Web Inf Syst 2:55–73
Johannesson M (1997) Modelling asymmetric similarity with prom-
inence. Lund University Cognitive Studies, Lund
Klir GJ, Folger TA (1988) Fuzzy sets, uncertainty, and information.
Prentice Hall, Englewood Cliffs
Landauer TK, Dumais ST (1997) A solution to Plato’s problem: the
latent semantic analysis theory of acquisition, induction, and
representation of knowledge. Psychol Rev 104:211
Landauer TK, Foltz PW, Laham D (1998) An introduction to latent
semantic analysis. Discourse Process 25:259–284
Lassila O, Swick R (1999) Resource description framework (RDF)
model and syntax specification. World wide web consortium
technical reports and publications. http://www.w3.org/TR/1999/
REC-rdf-syntax-19990222
Leacock C, Chodorow M (1998) Combining local context with
WordNet similarity for word sense identification. In: WordNet: a
lexical reference system and its application, MIT Press, Cam-
bridge, pp 265–283
Lee TB, Hendler J, Lassila O (2001) The semantic web. Sci Am
284:34–43
Leibniz GW (1975) Philosophical papers and letters. Kluwer
Academic Publishers, Dordrecht
Li Y, Bandar ZA, McLean D (2003) An approach for measuring
semantic similarity between words using multiple information
sources. Knowl Data Eng IEEE Trans 15:871–882
Lin D (1998) An information-theoretic definition of similarity. In:
Proceedings of the fifteenth international conference on machine
learning, Madison, pp 296–304
Navigli R, Velardi P (2005) Structural semantic interconnections: a
knowledge-based approach to word sense disambiguation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol
27, pp 1075–1086
Nosofsky RM (1991) Stimulus bias, asymmetric similarity, and
classification. Cognit Psychol 23:94–140
Oliva J, Serrano JI, Del Castillo MD, Iglesias A (2011) SyMSS: a
syntax-based measure for short-text semantic similarity. Data
Knowl Eng 70(4):390–405
Pedersen T, Pakhomov SVS, Patwardhan S, Chute CG (2007)
Measures of semantic similarity and relatedness in the biomed-
ical domain. J Biomed Inform 40:288–299
Rada R, Mili H, Bicknell E (1989) Development and application of a
metricon semantic nets. IEEE Trans Syst Man Cybern 19:17–30
Resnik P (1995) Using information content to evaluate semantic
similarity in a taxonomy. IJCAI 448–453
Shadbolt N, Hall W, Berners-Lee T (2006) The semantic web
revisited. IEEE Intell Syst 21:96–101
Sheng H, Chen H, Yu T, Feng Y (2010) Linked data based semantic
similarity and data mining. In: IEEE 2010 international confer-
ence on Information reuse and integration (IRI), pp 104–108
Simmons S, Estes Z (2006) Using latent semantic analysis to estimate
similarity. In: Proceedings of the Cognitive Science Society,
pp 2169–2173
Taylor JM (2010) Ontology-based view of natural language meaning:
the case of humor detection. J Ambient Intell Humaniz Comput
1:221–234
Tversky A (1977) Features of similarity. Psychol Rev 84:327–352
Volz J, Bizer C, Gaedke M, Kobilarov G (2009) Silk—a link
discovery framework for the web of data. In: Proceedings of the
second linked data on the Web workshop, Madrid
Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In:
Proceedings of the 32nd annual meeting on association for
computational linguistics, Las Cruces, pp 133–138
Zadeh LA (1999) Fuzzy sets as a basis for a theory of possibility.
Fuzzy Sets Syst 100:9–34
532 P. D. Hossein Zadeh, M. Z. Reformat
123