measuring semantic distance for linked open data … · measuring semantic distance for linked open...

6
Measuring Semantic Distance for Linked Open Data-enabled Recommender Systems Guangyuan Piao Insight Centre for Data Analytics National University of Ireland Galway IDA Business Park, Lower Dangan, Galway, Ireland [email protected] John G. Breslin Insight Centre for Data Analytics National University of Ireland Galway IDA Business Park, Lower Dangan, Galway, Ireland [email protected] ABSTRACT The Linked Open Data (LOD) initiative has been quite successful in terms of publishing and interlinking data on the Web. On top of the huge amount of interconnected data, measuring relatedness between resources and identify- ing their relatedness could be used for various applications such as LOD-enabled recommender systems. In this paper, we propose various distance measures, on top of the basic concept of Linked Data Semantic Distance (LDSD ), for cal- culating Linked Data semantic distance between resources that can be used in a LOD-enabled recommender system. We evaluated the distance measures in the context of a recommender system that provides the top-N recommen- dations with baseline methods such as LDSD. Results show that the performance is significantly improved by our pro- posed distance measures incorporating normalizations that use both of the resources and global appearances of paths in a graph. CCS Concepts Information systems Similarity measures; Rec- ommender systems; Keywords Linked Data; Semantic similarity; Recommender system 1. INTRODUCTION In recent years, the Linked Open Data (LOD) cloud 1 has been increasing in popularity. As a result of the success of the LOD, many semantic datasets are freely available on the Web in machine-understandable format (primarily RDF [1]) 1 http://lod-cloud.net/ Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC 2016,April 04-08, 2016, Pisa, Italy Copyright 2016 ACM 978-1-4503-3739-7/16/04. . . $15.00 http://dx.doi.org/10.1145/2851613.2851839 related to different domains. The LOD has been adopted in recommender systems in order to improve the performance of such systems as well as reduce the cold-start problem in- herent to recommender systems [2, 8, 12–14]. Also, the need for a semantic representation of data and user profiles has been identified as one of the next challenges in the field of recommender systems [11]. The use of semantic data in rec- ommendation tasks not only can provide better knowledge (e.g., a richer representation of data), but also can facili- tate the easy adoption of the same approach to other do- mains [3]. Moreover, problems related to a keyword-based approach such as synonymy and polysemy are resolved since resources in LOD datasets are identified by unique URIs and semantically interlinked to each other [11]. All of the information within the LOD can be exploited and used in an LOD-enabled, content-based recommender system where the domain knowledge plays a fundamental role. In this regard, measuring the distance between resources and iden- tifying their relatedness plays a significant role as it can be adopted to recommender systems for providing recommen- dations. To this end, various approaches to measure the distance/similarity between two resources in LOD datasets such as DBpedia 2 have been proposed [6, 9, 15, 19]. Linked Data Semantic Distance (LDSD ) [15] is one of the most pop- ular approaches to measure the semantic distance between two resources, which has been adopted to a recommender system in the music domain [14]. Di Noia et. al [3, 4] pro- posed applying one of the most popular models in infor- mation retrieval: the Vector Space Model (VSM) [17] in a LOD-based setting and representing the whole RDF graph as a matrix. Items and user profiles can be represented in terms of the VSM and the similarity between items and users can be identified by using the cosine similarity measure. In this paper, based on the basic concept of LDSD, we present various semantic distance measures by incorporat- ing the number of connected resources via a link, different normalization strategies and a statistical approach for cal- culating the semantic distance between two resources. We then evaluate these distance measures in the context of a recommender system that provides the top-N recommen- dations. It is noteworthy that although we performed our experiments using the DBpedia dataset, the distance mea- sures we proposed are not tied to this particular dataset and 2 http://dbpedia.org/

Upload: vandieu

Post on 02-May-2018

219 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Measuring Semantic Distance for Linked Open Data … · Measuring Semantic Distance for Linked Open Data-enabled Recommender Systems ... successful in terms of publishing and interlinking

Measuring Semantic Distance for Linked OpenData-enabled Recommender Systems

Guangyuan PiaoInsight Centre for Data Analytics

National University of Ireland GalwayIDA Business Park, Lower Dangan, Galway,

[email protected]

John G. BreslinInsight Centre for Data Analytics

National University of Ireland GalwayIDA Business Park, Lower Dangan, Galway,

[email protected]

ABSTRACTThe Linked Open Data (LOD) initiative has been quitesuccessful in terms of publishing and interlinking data onthe Web. On top of the huge amount of interconnecteddata, measuring relatedness between resources and identify-ing their relatedness could be used for various applicationssuch as LOD-enabled recommender systems. In this paper,we propose various distance measures, on top of the basicconcept of Linked Data Semantic Distance (LDSD), for cal-culating Linked Data semantic distance between resourcesthat can be used in a LOD-enabled recommender system.We evaluated the distance measures in the context of arecommender system that provides the top-N recommen-dations with baseline methods such as LDSD. Results showthat the performance is significantly improved by our pro-posed distance measures incorporating normalizations thatuse both of the resources and global appearances of paths ina graph.

CCS Concepts•Information systems → Similarity measures; Rec-ommender systems;

KeywordsLinked Data; Semantic similarity; Recommender system

1. INTRODUCTIONIn recent years, the Linked Open Data (LOD) cloud1 hasbeen increasing in popularity. As a result of the success ofthe LOD, many semantic datasets are freely available on theWeb in machine-understandable format (primarily RDF [1])

1http://lod-cloud.net/

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. To copyotherwise, or republish, to post on servers or to redistribute to lists,requires prior specific permission and/or a fee.SAC 2016,April 04-08, 2016, Pisa, ItalyCopyright 2016 ACM 978-1-4503-3739-7/16/04. . . $15.00http://dx.doi.org/10.1145/2851613.2851839

related to different domains. The LOD has been adopted inrecommender systems in order to improve the performanceof such systems as well as reduce the cold-start problem in-herent to recommender systems [2,8,12–14]. Also, the needfor a semantic representation of data and user profiles hasbeen identified as one of the next challenges in the field ofrecommender systems [11]. The use of semantic data in rec-ommendation tasks not only can provide better knowledge(e.g., a richer representation of data), but also can facili-tate the easy adoption of the same approach to other do-mains [3]. Moreover, problems related to a keyword-basedapproach such as synonymy and polysemy are resolved sinceresources in LOD datasets are identified by unique URIsand semantically interlinked to each other [11]. All of theinformation within the LOD can be exploited and used inan LOD-enabled, content-based recommender system wherethe domain knowledge plays a fundamental role. In thisregard, measuring the distance between resources and iden-tifying their relatedness plays a significant role as it can beadopted to recommender systems for providing recommen-dations. To this end, various approaches to measure thedistance/similarity between two resources in LOD datasetssuch as DBpedia2 have been proposed [6, 9, 15, 19]. LinkedData Semantic Distance (LDSD) [15] is one of the most pop-ular approaches to measure the semantic distance betweentwo resources, which has been adopted to a recommendersystem in the music domain [14]. Di Noia et. al [3, 4] pro-posed applying one of the most popular models in infor-mation retrieval: the Vector Space Model (VSM) [17] in aLOD-based setting and representing the whole RDF graphas a matrix. Items and user profiles can be represented interms of the VSM and the similarity between items and userscan be identified by using the cosine similarity measure.

In this paper, based on the basic concept of LDSD, wepresent various semantic distance measures by incorporat-ing the number of connected resources via a link, differentnormalization strategies and a statistical approach for cal-culating the semantic distance between two resources. Wethen evaluate these distance measures in the context of arecommender system that provides the top-N recommen-dations. It is noteworthy that although we performed ourexperiments using the DBpedia dataset, the distance mea-sures we proposed are not tied to this particular dataset and

2http://dbpedia.org/

Page 2: Measuring Semantic Distance for Linked Open Data … · Measuring Semantic Distance for Linked Open Data-enabled Recommender Systems ... successful in terms of publishing and interlinking

can be adopted to any similar LOD datasets as well.

The rest of the paper is organized as follows: Section 2 givessome related work. In Section 3 we propose various seman-tic distance measures for calculating the distance betweentwo resources. Section 4 is dedicated to the evaluation andSection 5 presents the results. Finally, Section 6 concludesthe paper and gives some ideas for future work.

2. RELATED WORKMaedche et. al [10] defined a set of similarity measures forcomparing ontology-based metadata by considering differentaspects of an ontology separately. They propose differenti-ating across three dimensions for comparing two resources:taxonomic, relational and attribute similarities. However,the similarity measures depend on some strong assumptionsabout the model such as “Ontologies are strictly hierarchi-cal such that each concept is subsumed by only one con-cept”, which is not the case in terms of DBpedia. Pas-sant [15] proposed a measure named LDSD to calculatesemantic distance on Linked Data. The distance measureconsiders direct links from resource A to resource B andvice versa. In addition, it also considers the same incom-ing and outgoing nodes via the same properties of resourcesA and B in a graph. However, several aspects of related-ness that mentioned in this paper have been ignored. Lealet al. [9] presents an approach for computing the seman-tic relatedness of resources in DBpedia. In the paper, theyproposed a similarity measure based on a notion of proxim-ity, which measures how connected two resources are, ratherthan how distant they are. This means that the similar-ity measure considers both distance and paths between twonodes. However, they do not consider incoming nodes (re-sources) and properties of the resources as LDSD did. Basedon the Shakti measure, Strobin et al. [19] propose a methodto find the weights automatically by using Genetic Algo-rithms (GA) [18] based on a training dataset from Last.fm3.This method is quite efficient at learning the weights auto-matically. However, it needs a gold standard dataset (e.g.,Last.fm dataset for the music domain) to learn the weightsof properties which is not always available in other domains.There are also some supervised learning approaches based onVSM model for LOD-enabled recommender systems. Tom-maso et. al [3] adapt the VSM model to a LOD-based settingand represent the whole RDF graph as a matrix. On top ofthe VSM representation, they use the Support Vector Ma-chine as a classifer to predict if a user would like an item ornot. Using the same representation, they also propose to as-sign a weight to each property that represents its worth withrespect to the user profile [4]. In this regard, they use GA tolearn the weights of properties that minimize the misclassifi-cation errors. Our study is different from these model-basedapproaches since we focused on the top-N recommendationtask while they focused on the prediction task. In addi-

3http://last.fm

tion, our proposed distance measures can be easily adoptedin different domains since they do not require any learningprocess with a gold standard dataset. However, adaptivealgorithms such as GA can also be applied to the semanticdistance measures to learn the different weights of variouspaths (e.g., direct or indirect paths with different lengths).This, although interesting, is beyond the scope of this paperand we aim to explore it in future work.

3. LINKED DATA SEMANTIC DISTANCEMEASURES

LDSD was one of the first approaches for measuring thesemantic distance between two resources on LOD datasetssuch as DBpedia and used for recommender systems [15].The distance measure (equation (1)) considers direct linksfrom resource ra to resource rb and vice versa. In addition,it also considers the same incoming and outgoing nodes viathe same properties of resource ra and rb. The distancemeasure has a scale from 0 to 1, where a larger value denotesless similarity between two resources. We use the definitionof a dataset following the Linked Data principles outlinedin [14] and the definition of a path as below:

Definition 1. A dataset following the Linked Data prin-ciples is a graph G such as G = (R,L) in which R ={r1, r2, ...rn} is a set of resources identified by their URI,L = {l1, l2, ...ln} is a set of typed links identified by theirURI. A path is a sequence of resources and links betweentwo resources, such as pi =

[. . . , l(i)j , rn, l(o)j , . . .

]. The

direction of a link lj in terms of the first resource can berepresented by l(i)j (incoming) or l(o)j (outgoing).

For example, in the example graph (Fig. 1), we havepaths such as

[l(o)associatedMusicArtist

],[l(i)musicalguests,

rList of The Tonight Show with Jay Leno episodes (2013−14),

l(o)musicalguests]

from the resource rAriana Grande torSelena Gomez.

LDSD consists of two Cd functions with Cii(li, ra, rb) andCio(li, ra, rb). Cd is a function that computes the numberof direct and distinct links between resources in a graph G.Cd(li, ra, rb) equals 1 if there is a link li from resource ra toresource rb. Otherwise, if there is no link from resource rato resource rb, Cd(li, ra, rb) equals to 0. By extension Cdcan be the total number of nodes via li from ra (Cd(li, ra)).For example, in the example graph (Fig. 1), we have:

Cd(linfluences, rAriana Grande, rSelena Gomez) = 1Cd(linfluences, rAriana Grande) = 1Cd(lmusicalguests,rList of The Tonight Show with Jay Leno episodes (2013−14)) = 2

Cii and Cio are functions that compute the number ofindirect and distinct links, both incoming and outgoing,between resources in a graph G. Cii(li, ra, rb) equals1 if there is a resource rn linked to both ra and rb

LDSD(ra, rb) =1

1 +∑i

Cd(li,ra,rb)1+log(Cd(li,ra))

+∑i

Cd(li,rb,ra)1+log(Cd(li,rb))

+∑i

Cii(li,ra,rb)1+log(Cii(li,ra))

+∑i

Cio(li,ra,rb)1+log(Cio(li,ra))

(1)

Page 3: Measuring Semantic Distance for Linked Open Data … · Measuring Semantic Distance for Linked Open Data-enabled Recommender Systems ... successful in terms of publishing and interlinking

List_of_The_Tonight_Show_with_Jay_Leno_episodes_(2013–14)

Category:21st-century_American_singers

Ariana_Grande

Selena_Gomez

musicalguests musicalguests

subject subject

associatedMusicArtist

influences

Figure 1: Example of relationships of two resources in DBpedia

via an incoming property li, and 0 if not. Similarly,Cio(li, ra, rb) equals 1 if there is a resource rn linked toboth ra and rb via an outgoing property li, and 0 if not.By extension Cii and Cio can be used to compute thetotal number of resources linked indirectly to ra via li(Cii(li, ra) and Cio(li, ra)). In the example (Fig. 1), wehave Cii(lmusicalguests, rAriana Grande, rSelena Gomez) =1 (via incoming property fromrList of The Tonight Show with Jay Leno episodes (2013−14)) andCio(lsubject, rAriana Grande, rSelena Gomez) = 1 (via outgoingproperty to rCategory:21st−century American singers).

On top of the basic concept of LDSD, we propose four dif-ferent distance measures incorporating various aspects of re-latedness between two resources (Section 3.2-3.5).

3.1 Linked Data Semantic Distance incorpo-rating the number of linked resources viaa link

In this section, we introduce LDSDα (equation (2)) thatincorporates the number of resources linked to ra and rbvia li. C′ii (C′io) of LDSDα, is equal to the number ofresources linked to ra and rb via an incoming (outgoing)property li while Cii (Cio) of LDSD equals 1 if there isa resource rn linked to ra and rb via an incoming (out-going) property li. The intuition behind this is that tworesources are more similar if there are a greater number oflinked resources via a property li. For instance, if two mu-sic artists have 10 dbpedia:MusicalArtist(s) in commonvia the property dbpedia-owl:associatedMusicalArtist,then they are more similar than two other musicartists that have 1 dbpedia:MusicalArtist in com-mon via the same property. The prefix dbpedia isused for the namespace http://dbpedia.org/resource/

and the prefix dbpedia-owl is used for the namespace

http://dbpedia.org/ontology/.

3.2 Linked Data Semantic Distance with nor-malizations considering both resources

In our previous study [16], we showed that LDSD with nor-malizations considering both resources can produce symmet-ric results for ra and rb and the reversed order of them, whichis an important property as a distance measure [7]. On topof LDSDα, we further modify the normalizations of C′ii andC′io by considering both resources ra and rb (see equation(3)). That is, the normalization of C′ii is carried out by theaverage of Cii(li, ra) and Cii(li, rb) while for Cii in LDSD itis carried out by considering the first resource ra only. Sim-ilarly, the normalization of C′io is carried out by the averageof Cio(li, ra) and Cio(li, rb). We use LDSDβ to refer to thedistance measure in the rest of the paper.

3.3 Linked Data Semantic Distance withglobal normalizations

These aforementioned distance measures use local normal-izations, i.e., normalizations that are carried out in the localcontext of ra and rb. Instead of using local normalizations,we use global normalizations of a path to investigate the im-pact on calculating the distance between two resources. Thedistance measure can be defined as equation (4) and we useLDSDγ to refer to the distance measure in the rest of thepaper.

This measure penalizes the importance of a path betweentwo resources according to the global appearances of thepath in the whole graph. In LDSD, the normalizations ofCd(li, ra, rb) and Cd(li, rb, ra) are carried out using Cd(li, ra)that computes the number of resources rn from ra via li(see Fig. 2). In contrast, in LDSDγ , the normalizationsof Cd functions are carried out using Cdp(li) that computes

LDSDα(ra, rb) =1

1 +∑i

Cd(li,ra,rb)1+log(Cd(li,ra))

+∑i

Cd(li,rb,ra)1+log(Cd(li,rb))

+∑i

C′ii(li,ra,rb)

1+log(Cii(li,ra))+∑i

C′io(li,ra,rb)

1+log(Cio(li,ra))

(2)

LDSDβ(ra, rb) =

1

1 +∑i

Cd(li,ra,rb)1+log(Cd(li,ra))

+∑i

Cd(li,rb,ra)1+log(Cd(li,rb))

+∑i

C′ii(li,ra,rb)

1+log(Cii(li,ra)+Cii(li,rb)

2)

+∑i

C′io(li,ra,rb)

1+log(Cio(li,ra)+Cio(li,rb)

2)

(3)

Page 4: Measuring Semantic Distance for Linked Open Data … · Measuring Semantic Distance for Linked Open Data-enabled Recommender Systems ... successful in terms of publishing and interlinking

LDSDγ(ra, rb) =

1

1 +∑i

Cd(li,ra,rb)1+log(Cdp(li))

+∑i

Cd(li,rb,ra)1+log(Cdp(li))

+∑i

∑j

Cii(li,rj ,ra,rb)

1+log(Ciip(li,rj))+∑i

∑j

Cio(li,rj ,ra,rb)

1+log(Ciop(li,rj))

(4)

ra rn

li

Figure 2: Local normalization of Cd function in equations(1), (3) and (5): the number of resources from ra to rn viali

rx ry

li

Figure 3: Global normalization of Cd function in equation(7): the number of appearances from rx to ry via li in agraph

ra ry rx

li li

Figure 4: Local normalization of Cii function in equations(1), (3) and (5): the number of resources linked to a resourcevia incoming property li as ra

rx ry rj

li li

Figure 5: Global normalization of Cii function in equation(7): the number of appearances of the path from rx to ryvia the path

[l(i)i, rj , l(o)i

]the global appearances of the path

[li]

between any two re-sources in DBpedia (see Fig. 3). Furthermore, for indirectpaths between two resources, LDSDγ normalizes each indi-rect path by the number of global appearances of it. Takingincoming indirect paths for example, Cii(li, rj , ra, rb) equals1 if there is a path

[l(i)i, rj , l(o)i

]from ra to rb, and 0 if not.

The normalization of Cii(li, rj , ra, rb) is then carried out us-ing Ciip(li, rj) that computes the global appearances of thepath

[l(i)i, rj , l(o)i

]between any two resources in DBpedia

(see Fig. 5).

3.4 Linked Data Semantic Distance by log-likelihood

In this section, we adopt a statistical approach, log-likelihoodratio [5] for measuring the semantic distance between two re-sources. The approach is adopted and implemented as a sim-ilarity metric named log-likelihood similarity in the ApacheMahout4, which is a widely known open source project tobuild an environment for creating machine learning appli-cations with scalable performance as well as collaborative

4http://mahout.apache.org/

filtering based recommender systems. We adopt the imple-mentation of the similarity in Mahout for our experiment.

The log-likelihood similarity is good at handling rare events.In our approach, we treat a resource and direct/indirectpaths of it appearing in a graph as an event. The dis-tance measure can be defined as equation (5) and we useLDSDLLR to refer to this measure in the rest of the paper.Ld(li, ra, rb) denotes the log-likelihood ratio of two events [5]:an event for the resource ra with the path [l(o)i] and theother event for the resource rb with the path [l(i)i]. Simi-larly, Lii(li, rj , ra, rb) denotes the log-likelihood ratio of twoevents: the resource ra with the incoming link li from a re-source rj , and the resource rb with the incoming link li fromthe resource rj .

4. EVALUATION SETUPWe evaluate the proposed semantic distance measures interms of the performance of recommender systems when us-ing them as recommendation algorithms. To this end, we usea subset of the dataset from the second Linked Open Data-enabled recommender systems challenge5. The dataset wascollected from Facebook6 profiles about personal preferences(“likes”) in the music domain, which consists of 52,072 usersand 21 liked items on average. The items available in thedataset have been mapped to their corresponding DBpediaURIs. We randomly select 500 users with 10,590 preferencerecords for the experiment. The main details of the datasetfrom the challenge and its subset for our experiment arepresented in Table 1.

For each user, 5 liked items were blinded out to constructa candidate list for recommendations while the rest of theitems were used to construct preference profiles of users. Inthe end, the candidate list consists of 1,132 items of typedbpedia:MusicalArtist or dbpedia:MusicalBand.

A user profile can be represented as a set of resources thatthey liked before (equation (6)). The similarity between auser ui and an item can be measured using equation (7)where dist(ra, rb) denotes the distance measure deployed inthe recommender system.

Profile(ui) = {r1, r2, ..., rm} (6)

sim(ui, ra) =

∑rb∈Profile(ui)

(1− dist(ra, rb))

|Profile(ui)|(7)

5http://2015.eswc-conferences.org/important-dates/call-RecSys6https://www.facebook.com/

LDSDLLR(ra, rb) =1

1 +∑i Ld(li, ra, rb) +

∑i Ld(li, rb, ra) +

∑i

∑j Lii(li, rj , ra, rb) +

∑i

∑j Lio(li, rj , ra, rb)

(5)

Page 5: Measuring Semantic Distance for Linked Open Data … · Measuring Semantic Distance for Linked Open Data-enabled Recommender Systems ... successful in terms of publishing and interlinking

Table 1: Descriptive statistics of the dataset

Dataset # of users# of liked items

min. max. avg. std.Total 52,072 15 37 21 6.20Subset 500 15 37 21 6.18

The recommender system then provides the top-N recom-mendations based on the similarity of a user profile and anitem from the candidate list. We evaluate the performance ofthe recommender system using our proposed distance mea-sures and taking LDSD as a baseline. Additionally, a model-based approach [3] for RDF data with VSM [17] is adoptedas another baseline. The model considers all properties ofitems (resources) that a user liked, where each resource isrepresented as a unique vector of weights and each weightindicates the degree of association between the item and theresource with respect to a property. In this regard, bothitems and users can be represented in terms of the VSM.We then use the cosine similarity measure for measuringthe similarities between a user and an item. Finally, we addan item-based collaborative filtering (ITEMCF ) approachwith log-likelihood similarity in Mahout as another baselinewhich does not exploit LOD.

The performance of the recommender system was evaluatedby standard evaluation methods for top-N (N = 1, 5, 10, 20)recommendation tasks: recall at N (R@N ), precision at N(P@N ) and Mean Reciprocal Rank (MRR). R@N is the frac-tion of items that are relevant to the users that are success-fully retrieved in the top-N recommendations, and P@N isthe fraction of the top-N recommended items that are rel-evant to the user. MRR indicates at which rank the firstitem relevant to the user occurs on average.

DBpedia provides a large set of properties for each item.Hence, selecting a subset of domain-dependent propertiesis necessary [4, 12]. Referring to the music domain, inwhich we performed the evaluation, we selected the 15properties (see Table 2) for calculating the semantic dis-tance between resources. dct:subject relates a resourceto its categories. In addition, we decided to leveragethe properties belonging to the DBpedia Ontology sincethey represent high-quality, clean and well-structured in-formation [13]. The prefix dct is used for the namespacehttp://purl.org/dc/terms/subject.

5. RESULTSFigure 6(a) and 6(b) show the R@N and P@N results of therecommender system with the proposed distance measures(solid lines) and baseline methods (dashed lines). Figure6(c) shows the MRR results of the recommender system.The paired t-test is used for testing the significance where

Table 2: Properties Selected for the Music Domain

dct:subject, dbpedia-owl:genre, dbpedia-owl:associatedBand,

dbpedia-owl:associatedMusicalArtist, dbpedia-owl:instrument,

dbpedia-owl:formerBandMember, dbpedia-owl:currentMember,

dbpedia-owl:influencedBy, dbpedia-owl:pastMember,

dbpedia-owl:associatedAct, dbpedia-owl:influenced,

dbpedia-owl:recordLabel, dbpedia-owl:occupation,

dbpedia-owl:hometown, dbpedia-owl:bandMember

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

N=1 N=5 N=10 N=20

Aver

age

Rec

all

LDSDα LDSDβ LDSDγ LDSDLLR LDSD VSMCOSINE ITEMCF

(a) Recall

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

N=1 N=5 N=10 N=20

Aver

age

Prec

isio

n

LDSDα LDSDβ LDSDγ LDSDLLR LDSD VSMCOSINE ITEMCF

(b) Precision

0.135 0.137 0.136

0.125 0.126

0.064

0.133

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

LDSDα LDSDβ LDSDγ LDSDLLR LDSD VSMCOSINE ITEMCF

Mea

n R

ecip

roca

l Ran

k

(c) MRRFigure 6: Results of the recommender system using differentdistance measures

the significance level was set to 0.05 unless otherwise noted.

Overall, LDSDγ performs best, which considers global ap-pearances of paths within the whole graph, followed byLDSDβ and LDSDα. More specifically, the R@5, 10 haveimproved significantly (p < 0.01), as well as R@20 usingLDSDγ compared to using LDSD. Also, the R@10 has im-proved significantly by using LDSDβ compared to usingLDSD. Similar observations can be found in the precisionresults. LDSDLLR performs worst among these proposeddistance measures, and worse than LDSD. All of the pro-

Page 6: Measuring Semantic Distance for Linked Open Data … · Measuring Semantic Distance for Linked Open Data-enabled Recommender Systems ... successful in terms of publishing and interlinking

posed distance measures perform significantly better thanthe VSM model with cosine similarity measure. In addi-tion, we can observe the improvements by exploiting LODbased on the recall and precision results compared to usingthe ITEMCF approach.

In terms of MRR (see Figure 6(c)), LDSDα, LDSDβ andLDSDγ have similar performance and the performance issignificantly better than the VSM model with cosine simi-larity measure and slightly better than LDSD and ITEMCF(but not significant).

The results show that incorporating the number of linkedresources and adopting different normalization strategiessuch as local normalizations by considering both resources(LDSDβ), and the global normalizations of paths (LDSDγ)can improve the performance of the recommender system.In addition, the best performance achieved by LDSDγ indi-cates that global normalizations of paths represent the im-portance of paths better than local ones of them.

6. CONCLUSION AND FUTURE WORKIn this paper, we proposed and investigated various seman-tic distance measures for calculating the distance betweenresources. These measures were adopted into a LOD-basedrecommender system to recommend the top-N items to usersbased on the preference of a user. We found that the perfor-mance of the recommender system is significantly improvedby incorporating the number of resources via indirect paths,local normalizations considering both resources (LDSDβ)and global appearances of paths (LDSDγ) compared tobaseline methods.

In the future, we plan to extend the algorithms by incor-porating longer paths, in order to recommend items thatare more than one node away from the seed one. Moreover,paths with different lengths would contribute to the seman-tic distance in a different way and the weights of them canbe optimized by using GA as we mentioned in Section 2.

7. ACKNOWLEDGMENTSThis publication has emanated from research conductedwith the financial support of Science Foundation Ireland(SFI) under Grant Number SFI/12/RC/2289 (Insight Cen-tre for Data Analytics).

8. REFERENCES[1] D. Brickley and R. V. Guha. {RDF vocabulary

description language 1.0: RDF schema}. 2004.

[2] T. Di Noia, I. Cantador, and V. C. Ostuni. Linkedopen data-enabled recommender systems: ESWC 2014challenge on book recommendation. In Semantic WebEvaluation Challenge, pages 129–143. Springer, 2014.

[3] T. Di Noia, R. Mirizzi, V. C. Ostuni, and D. Romito.Exploiting the web of data in model-basedrecommender systems. In Proceedings of the sixthACM conference on Recommender systems, pages253–256. ACM, 2012.

[4] T. Di Noia, R. Mirizzi, V. C. Ostuni, D. Romito, andM. Zanker. Linked Open Data to Support

Content-based Recommender Systems. In Proceedingsof the 8th International Conference on SemanticSystems, I-SEMANTICS ’12, pages 1–8, New York,NY, USA, 2012. ACM.

[5] T. Dunning. Accurate methods for the statistics ofsurprise and coincidence. Computational linguistics,19(1):61–74, 1993.

[6] V. Groues, Y. Naudet, and O. Kao. Adaptation andevaluation of a semantic similarity measure fordbpedia: A first experiment. In Semantic and SocialMedia Adaptation and Personalization (SMAP), 2012Seventh International Workshop on, pages 87–91.IEEE, 2012.

[7] S. Harispe, S. Ranwez, S. Janaqi, and J. Montmain.Semantic Measures for the Comparison of Units ofLanguage, Concepts or Instances from Text andKnowledge Base Analysis. arXiv preprintarXiv:1310.1285, 2013.

[8] B. Heitmann and C. Hayes. Using Linked Data toBuild Open, Collaborative Recommender Systems. InAAAI spring symposium: linked data meets artificialintelligence, pages 76–81, 2010.

[9] J. P. Leal, V. Rodrigues, and R. Queiros. Computingsemantic relatedness using dbpedia. 2012.

[10] A. Maedche and V. Zacharias. Clusteringontology-based metadata in the semantic web. InPrinciples of Data Mining and Knowledge Discovery,pages 348–360. Springer, 2002.

[11] S. E. Middleton, D. De Roure, and N. R. Shadbolt.Ontology-based recommender systems. In Handbookon ontologies, pages 779–796. Springer, 2009.

[12] C. Musto, P. Basile, P. Lops, M. de Gemmis, andG. Semeraro. Linked Open Data-enabled Strategies forTop-N Recommendations. CBRecSys 2014, page 49,2014.

[13] V. C. Ostuni, T. Di Noia, E. Di Sciascio, andR. Mirizzi. Top-n recommendations from implicitfeedback leveraging linked open data. In Proceedingsof the 7th ACM conference on Recommender systems,pages 85–92. ACM, 2013.

[14] A. Passant. dbrec: Music Recommendations UsingDBpedia. In ISWC 2010 SE - 14, pages 209–224, 2010.

[15] A. Passant. Measuring Semantic Distance on LinkingData and Using it for Resources Recommendations. InAAAI Spring Symposium: Linked Data MeetsArtificial Intelligence, volume 77, page 123, 2010.

[16] G. Piao, S. showkat Ara, and J. G. Breslin.Computing the Semantic Similarity of Resources inDBpedia for Recommendation Purposes. In SemanticTechnology. Springer International Publishing, 2015.

[17] G. Salton, A. Wong, and C.-S. Yang. A vector spacemodel for automatic indexing. Communications of theACM, 18(11):613–620, 1975.

[18] M. Srinivas and L. M. Patnaik. Genetic algorithms: Asurvey. Computer, 27(6):17–26, 1994.

[19] L. Strobin and A. Niewiadomski. Evaluating semanticsimilarity with a new method of path analysis in RDFusing genetic algorithms. COMPUTER SCIENCE,21(2):137–152, 2013.