disambiguatingidentitywebreferencesusingweb2.0dataand ...staff€¦ · a detailed approach to...

25
Disambiguating Identity Web References using Web 2.0 Data and Semantics Matthew Rowe a,* , Fabio Ciravegna, a The OAK Group, Department of Computer Science, University of Sheeld, Regent Court, 211 Portobello Street, Sheeld S1 4DP, UK Abstract As web users disseminate more of their personal information on the web, the possibility of these users becoming victims of lateral surveillance and identity theft increases. Therefore web resources containing this personal information, which we refer to as identity web references must be found and disambiguated to produce a unary set of web resources which refer to a given person. Such is the scale of the web that forcing web users to monitor their identity web references is not feasible, therefore automated approaches are required. However automated approaches require background knowledge about the person whose identity web references are to be disambiguated. Within this paper we present a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from Web 2.0 platforms to support automated disambiguation processes. We present a methodology for generating this background knowledge by exporting data from multiple Web 2.0 platforms as RDF data models and combining these models together for use as seed data. We present two disambiguation techniques; the first using a semi-supervised machine learning technique known as Self-training and the second using a graph-based technique known as Random Walks, we explain how the semantics of data supports the intrinsic functionalities of these techniques. We compare the performance of our presented disambiguation techniques against several baseline measures including human processing of the same data. We achieve an average precision level of 0.935 for Self-training and an average f-measure level of 0.705 for Random Walks in both cases outperforming several baselines measures. Key words: Semantic Web, Web 2.0, Identity Disambiguation, Machine Learning, Graphs 1. Introduction The en masse uptake of Web 2.0 sites has altered the role of web users from information consumers to content contributors: Sites such as Wikipedia 1 and Yahoo Answers 2 have provided the functionality to allow web users to produce content, and organisa- * Corresponding author. Tel: +44 (0) 114 222 1800 Fax: +44 (0) 114 222 1810 Email addresses: [email protected] (Matthew Rowe), [email protected] (Fabio Ciravegna). 1 http://www.wikipedia.org 2 http://answers.yahoo.com/ tions such as the BBC now encourage the public to share information and interact as part of its News service. Blogging platforms have allowed those web users who lack the technical expertise to construct web pages to publish their thoughts and musings on their own personal blog. Other Web 2.0 sites, such as social networking platforms, have given web users a base from which they are able to deploy, maintain and enrich their online web presence. In short Web 2.0 sites have provided a variety of rich function- alities and services which are inviting and easy to use, and have allowed web users to become involved in producing content on the web, thereby elevating Preprint submitted to Elsevier 1 April 2010

Upload: others

Post on 08-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

Disambiguating Identity Web References using Web 2.0 Data andSemantics

Matthew Rowe a,!, Fabio Ciravegna,

aThe OAK Group, Department of Computer Science,University of She!eld, Regent Court, 211 Portobello Street, She!eld S1 4DP, UK

Abstract

As web users disseminate more of their personal information on the web, the possibility of these users becoming victimsof lateral surveillance and identity theft increases. Therefore web resources containing this personal information, whichwe refer to as identity web references must be found and disambiguated to produce a unary set of web resources whichrefer to a given person. Such is the scale of the web that forcing web users to monitor their identity web referencesis not feasible, therefore automated approaches are required. However automated approaches require backgroundknowledge about the person whose identity web references are to be disambiguated. Within this paper we presenta detailed approach to monitor the web presence of a given individual by obtaining background knowledge fromWeb 2.0 platforms to support automated disambiguation processes. We present a methodology for generating thisbackground knowledge by exporting data from multiple Web 2.0 platforms as RDF data models and combining thesemodels together for use as seed data. We present two disambiguation techniques; the first using a semi-supervisedmachine learning technique known as Self-training and the second using a graph-based technique known as RandomWalks, we explain how the semantics of data supports the intrinsic functionalities of these techniques. We comparethe performance of our presented disambiguation techniques against several baseline measures including humanprocessing of the same data. We achieve an average precision level of 0.935 for Self-training and an average f-measurelevel of 0.705 for Random Walks in both cases outperforming several baselines measures.

Key words: Semantic Web, Web 2.0, Identity Disambiguation, Machine Learning, Graphs

1. Introduction

The en masse uptake of Web 2.0 sites has alteredthe role of web users from information consumers tocontent contributors: Sites such as Wikipedia 1 andYahoo Answers 2 have provided the functionality toallow web users to produce content, and organisa-

! Corresponding author. Tel: +44 (0) 114 222 1800 Fax: +44(0) 114 222 1810

Email addresses: [email protected] (MatthewRowe), [email protected] (Fabio Ciravegna).1 http://www.wikipedia.org2 http://answers.yahoo.com/

tions such as the BBC now encourage the public toshare information and interact as part of its Newsservice. Blogging platforms have allowed those webusers who lack the technical expertise to constructweb pages to publish their thoughts and musings ontheir own personal blog. Other Web 2.0 sites, suchas social networking platforms, have given web usersa base from which they are able to deploy, maintainand enrich their online web presence. In short Web2.0 sites have provided a variety of rich function-alities and services which are inviting and easy touse, and have allowed web users to become involvedin producing content on the web, thereby elevating

Preprint submitted to Elsevier 1 April 2010

Page 2: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

their presence.This elevation has also produced several notable

by-products: through standard web search enginessuch as Google 3 and Yahoo 4 , it is now possible tofind information about the majority of people on-line, whether those people wish to be visible in sucha way or not. This has lead to several distressingweb practices becoming increasingly popular. Onesuch practice is lateral surveillance [1], the action offinding and surveying a person without their knowl-edge. This has become a widespread practice amongemployers who are keen to vet potential employeesprior to o!ering them a job and also among peo-ple vetting prospective dates. Another web practiceon the rise is online identity theft, which currentlycosts the UK economy £1.2 billion per annum 5 . Asweb users share more of their personal informationwithin the public domain, the risk of that informa-tion being stolen and reused increases.

To assess whether a person is at risk from identitytheft or susceptible to lateral surveillance, web re-sources (web pages, data feeds, etc) containing theperson’s personal information must be found andthe information contained within those resourcesassessed for risk. For example the UK governmentguidelines stipulates that the visibility of the per-son’s name, their address and email address couldlead to identity theft 6 . The step of finding the cor-rect web resources (herefter identity web references)is what this paper focuses on. Finding includes theactual retrieval of Web references pointing to a spe-cific person name (e.g. by retieving Web pages viaa standard search engine) and the decision whetherthe resource actually refers to the person we are con-centrating on or to a person with an identical or sim-ilar name. We refer to this decision as disambigua-tion. Due to the immense scale of the web and thelarge amount of references that can be returned fora specific name (e.g. Google returns 4,750,000 forJohn Smith only), human disambiguation is largelyunfeasible, therefore automated approaches are re-quired. Such automated approaches must rely onbackground knowledge (seed data) about the per-son whose information is to be found. Producingthis seed data manually for one person (let alone formultiple people) can be costly and time consuming,and limited by access to resources.

3 http://www.google.com4 http://www.yahoo.com5 http://www.identitytheft.org.uk/faqs.asp6 http://www.getsafeonline.org/nqcontent.cfm?a id=1132

The central contribution in this paper is an ap-proach to identify and disambiguate identity webreferences for a specified (set of) person(s). Themethod is completely automatic and does not re-quire the manual construction of any seed data, asthe seed data used is automatically extracted andintegrated by Web 2.0 resources (e.g. social sites).In particular our paper proposes:– Techniques to extract data from Web 2.0 plat-

forms, turn this data into RDF data models andcombine them into a single Linked Data [21] so-cial graph which can then be used as seed data forautomated disambiguation;

– Two novel techniques to automatically disam-biguate identity web references each supported bySemantic Web technologies and machine learning;

– A detailed evaluation of said techniques using alarge-scale dataset, comparing the resulting accu-racy to several baseline measures (including hu-man processing of the same data); we show howour techniques largely outperform both humansand state of the art methods in terms of accuracy.Moreover, we show empirical evidence that data

from Web 2.0 platforms is synonymous with realidentity information. We present a detailed method-ology for extracting a given person’s data frommultiple Web 2.0 platforms as RDF data modelsthereby attributing machine-readable semantics tosocial data, .

Our first disambiguation technique Self-trainingemploys a semi-supervised machine learning strat-egy to iteratively train and improve a machine learn-ing classifier. We model machine learning instancesas RDF models and the features of those instances asRDF instances from within the models. This permitsthe novel variation of feature similarity measuresused by the classifier between di!erent RDF graphsimilarity techniques. Our second technique employsa graph-based disambiguation technique known asRandom Walks. We construct a graph space by util-ising the graph structure of RDF data models, giventhat our seed data and the set of web resources to bedisambiguated are represented using RDF. We areable to weight edges within the graph space accord-ing to their semantic constraints, given that edgeswithin the graph space are created from predicateswithin triples.

We have structured this paper as follows: Section2 presents current state of the art approaches formonitoring personal information online and identitytracking. Section 3 details the requirements gleanedfrom limitations in the current state of art, which

2

Page 3: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

an approach to monitor personal information mustfulfil. Section 4 presents the proposed approach todiscover personal information on the web aided withthe use of Web 2.0 data. Section 5, 6 and 7 presentthe various stages in the approach: section 5 presentsthe techniques used to compile seed data. Section6 describes how we gather web resources to be dis-ambiguated and generate metadata models describ-ing the underlying knowledge structure of those re-sources. Section 7 presents two complimentary auto-mated disambiguation techniques. Section 8 inves-tigates the performance of these techniques using alarge-scale evaluation with a dataset compiled fromreal-world data. Section 9 presents the conclusionsdrawn from our work and plans for future work.

2. Related Works

Approaches designed to monitor personal infor-mation online fall into two distinct categories: un-supervised - which collect web resources, observecommon features between those resources and groupweb resources by their common features and semi-supervised - which obtain background knowledgeabout a given person and use this information to aidthe decision process. We now present current workwhich contains examples of such approaches.

2.1. Unsupervised Monitoring Approaches

The unsupervised approach presented in [32]searches for web pages using a person’s name, andthen clusters web pages based on their commonfeatures, which in this case is the presence of personnames. Another unsupervised approach from [17]gathers a set of web pages and creates features forweb pages by pairing a persons name within thepage together with a relevant concept derived fromthe page (e.g Matthew Rowe/PhD Student). Usingthese pairings a maximum entropy model is learntwhich predicts the likelihood that two pairings referto the same person. Web pages are then clusteredbased on the likelihood measures of pairings withinseparate web pages.

Comparable work in [44] provides a similar ap-proach by aligning topics with person names withina web page. Generative models are used to providethe topic to be paired with the person name and webpages are clustered by common name and topic pairssimilar to [17]. Features within web pages are repre-sented by named entities in the approach described

in [27]. Using these named entities a term frequency-inverse document frequency (TF-IDF) score for eachnamed entity within the web page is calculated. Webpages are then clustered together by common namedentities and TF-IDF scores for those entities abovea certain threshold.

An unsupervised monitoring approach describedin [49] extracts features from web pages includ-ing lexical information, linguistic informationand personal information. A basic agglomerative(bottom-up) clustering strategy is then appliedto a set of gathered web pages, resulting in sev-eral clusters where each cluster contains web pagesciting a given person. By varying the combina-tions of features (lexical, lexical+linguistic, lex-ical+linguistic+personal) [49] demonstrates thevariation in clustering accuracy when applied to thesame set of web pages. One of major limitations ofunsupervised approaches is their unfocussed natureand the need to discover how many di!erent name-sakes (di!erent people sharing the same name)arereferred to in the set of resources. Moreover, if thetask is to find information about a specific personthen large amounts of irrelevant information areprocessed and compared.

2.2. Semi-supervised Monitoring Approaches

Background knowledge, or seed data, is suppliedto a semi-supervised monitoring approach describedin [3] as a list of names representing the social net-work of a person. Link structures of web pages arethen used as features to group web pages into twoclusters; relevant which cite the given person andnon-relevant which do not. The intuition behind us-ing link structures is that web pages which cite aspecific person are likely to be connected, maybe notdirectly but via transitive paths. Monitoring per-sonal information involves finding and disambiguat-ing those web resources that contain personal infor-mation of a given person from those that do not.This disambiguation process is a binary classifica-tion problem where web resources must be labelledby the relevant class label. Classifying web resourcesin such a manner has been investigated in work by[51] which gathers web pages and then labels thosepages as belonging to a given class based on sup-plied seed data, where the seed data is provided aslabelled training examples from which a machinelearning classifier can be learnt and applied to un-labelled web pages. A similar approach is presented

3

Page 4: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

in [18] which, by using labelled seed data, is able tolearn a classifier and apply this to a corpus of textdocuments.

Seed data is essential for semi-supervised ap-proaches in order to focus the monitoring process,however obtaining such data is a di"cult process,limited by access to resources, as highlighted in [3].In several approaches [51] [50] [18] seed data is pre-viously known and supplied with no explanation ofhow it was obtained and from what sources. Oneof the limitations of semi-supervised approachesto monitor personal information is this relianceon seed data. If the seed data does not cover theweb resources to be analysed then the accuracyof the approach may be limited. Therefore semi-supervised approaches must cope with this lack ofcoverage, and use a suitable strategy to overcomethis limitation.

The growing need to monitor personal informa-tion is reflected in the wide range of commercial ser-vices now available created to tackle identity theftby tracking personal information of individuals onthe World Wide Web. For example Garlik’s DataPatrol 7 service searches across various data sourcesfor personal information about a given person byusing background information provided by the per-son when they signed up to the service as seed data(includes biographical information such as the nameand address of the person). The Trackur service 8

monitors social media on the web for references andcitations with the intended goal of monitoring thereputation of an individual. Similar to Data Patrol,Trackur requires background knowledge about theperson whose reputation is to be monitored. Iden-tity Guard 9 is one of the most widely used personalinformation monitoring services. It functions in asimilar vein to Data Patrol and Trackur by usingpersonal information disclosed by the web user tomonitor that person’s personal information online.

3. Requirements

From the analysis of current state of the art ap-proaches for monitoring personal information on theweb we hypothesise that a semi-supervised strategyoutperforms an unsupervised approach when findingidentity web references of a given person. However,implementing a semi-supervised strategy places em-

7 http://www.garlik.com/dpindividuals.php8 http://www.trackur.com/9 http://www.identityguard.com

phasis on the need for reliable seed data and forthe approach to cope with correctly finding web re-sources which contain features not present withinthe seed data. To address these issues the processof monitoring personal information must fulfil thefollowing requirements:

(i) Full Automation with minimal supervision:Our decision process within the approachmust be able to disambiguate web resourcese!ectively with no intervention following theprovision of seed data. In doing so the ap-proach must be fully automated and requireno human involvement.

(ii) Achieve disambiguation accuracy comparableto human processing of data: Automated ap-proaches are required to alleviate the need forhumans to monitor their personal information.Therefore our approach must achieve accuracylevels comparable to human processing.

(iii) Disambiguate identity web references for indi-viduals with varying levels of web presence: Ourdisambiguation techniques must achieve con-sistent accuracy levels regardless of the webpresence of the person whose identity web ref-erences are to be disambiguated.

(iv) Outperform unsupervised disambiguation tech-niques: Our approach must outperform unsu-pervised techniques that do not use seed data,thereby demonstrating the benefits of usingseed data to aid disambiguation.

(v) Cope with web resources which do not con-tain any features present within the seed data:Although seed data provides backgroundknowledge for disambiguating identity webreferences for a given person, it may not con-tain the maximum knowledge coverage. Inthis case disambiguation approaches must beable to cope with this lack of coverage andlearn from unseen features to improve theirlater performance.

Our semi-supervised approach relies on seed datato support the process of monitoring personal infor-mation online, therefore the seed data we use mustfulfil the following requirements:

(i) Produce seed data with minimal cost: Manuallyconstructing seed data is a time consumingand costly process. Therefore techniques areneeded which overcome these issues.

(ii) Generate reliable seed data: The seed data usedby the approach must contain accurate infor-mation about the person whose personal infor-mation is to be found. Moreover, inaccuracies

4

Page 5: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

in this data will e!ect the overall accuracy ofthe approach

4. Overview of Approach to Disambiguation

The proposed approach is shown in Fig. 1; it is di-vided into three distinct stages: the first stage gen-erates RDF models from distributed Web 2.0 plat-forms describing a fragment of a given person’s dig-ital identity contained within a specific platform.By reconciling these digital identity fragments frommultiple platforms we combine distinct RDF modelsinto a single Linked Data social graph to be used asseed data. The second stage of the approach gath-ers web resources from searching the Semantic Weband the World Wide Web by querying a given per-son’s name. We create metadata models for these re-sources by generating an RDF model describing theknowledge structure of the resource. The third stageof our approach uses the seed data, modelled as aLinked Data social graph, and analyses the web re-sources to disambiguate the identity web referencesof a given person. In the next three sections we de-scribe the three stages of the approach in detail.

5. Generating Seed Data

In this section we describe our approach for theautomatic collection of seed data with no user in-tervention. We claim that Web 2.0 data constitutesan excellent starting point for the generation of seeddata for the disambiguation of identity web refer-ences. This is because a) it is plentiful and 2) itis representative of the real identity of people (andhence it provides an excellent basis to discriminateidentities of homonyms). Sociological studies whichanalyse the intrinsic nature of Web 2.0 platformsdescribe users of such platforms as maintaining anidentity which is comparable to their real identity.For instance work by [23] states that Web 2.0 plat-forms are primarily used to maintain o#ine relation-ships and similar work by [16] supports this claimby stating that web users use such platforms to re-inforce established relationships.

We performed further investigation of this claimby studying 50 users of Facebook 10 from the Uni-versity of She"eld 11 . Each participant listed theirreal world (o#ine) social network consisting of those

10http://www.facebook.com11http://www.shef.ac.uk

Fig. 2. Results from user study measuring extent to which of-fline social networks are replicated in an online environment

individuals that they socialised with the most. Thisnetwork was then compared against their onlinesocial network in order to analyse the extent towhich the networks were similar. The findings fromthe study indicated that between 50% and 100%of participants’ o#ine social networks were repli-cated within the Web 2.0 platform (as shown inFig. 2), producing an average of 77%. This indicatesa strong correlation between the digital identitythat is portrayed on a given Web 2.0 platform andthe real identities of such web users. To assess thee!ectiveness of Web 2.0 data for disambiguationpurposes, we compiled a dataset of web resourcesfrom querying several search engines (Google, Ya-hoo and Sindice 12 ) using distinct person names. Bymanually analysing the query results we found thaton average 5 distinct people appeared within eachweb resource and that in the majority of cases thesepeople knew one another (i.e. worked together, partof the same sports club). These findings suggestthat automated disambiguation techniques wouldbenefit from seed data which describes the peoplethat a given person knows. Therefore Web 2.0 plat-forms provide an ideal source for such data giventheir social nature, and as we have demonstrated,the social networks maintained on such platforms.

However, a person’s identity is often fragmentedand distributed over multiples platforms, where eachprofile describes a di!erent facet of the persons on-line identity. It is therefore needed to aggregate thisfragmented information together in order to get amore comprehensive set of data for each individual.In the rest of this section we present our approachto generate seed data (shown in Fig. 3) using a twostage process. Firstly we export individual social

12http://www.sindice.com

5

Page 6: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

Fig. 1. Overview of our proposed approach to disambiguate identity web references using Web 2.0 data

graphs, as RDF models, from distributed Web 2.0platforms, and secondly we interlink these individ-ual RDF models together.

Fig. 3. Generating machine-readable seed data from Web 2.0platforms

5.1. Exporting Individual Social Graphs

Social graphs are exported from individual Web2.0 platforms as RDF models by using a specific ex-porter for each platform. Such platforms allow in-formation to be accessed via their API, however thisinformation is often returned in a proprietary for-mat such as XML based on the platform’s XMLschema. Therefore this response must be leveragedto RDF using consistent ontologies to enable socialgraphs from multiple platforms to be combined to-gether. We now present several example exporterswhich generate RDF models from Web 2.0 platformsand describe the internal knowledge structure of thegenerated models.

5.1.1. Twitter ExporterOur first exporter 13 produces a social graph

from the Microblogging platform Twitter 14 . TheExporter requests information from Twitter viathe service’s API and receives an XML response.We lift this response to RDF by mapping elementsin the response to concepts from the FOAF [6]and GeoNames 15 ontologies. The XML containsbiographical information and social network infor-mation of the Twitter user. We create an instanceof foaf:Person for the person that the social graphbelongs to, which we refer to as the social graphowner, and add information to the instance usingthe foaf:name property for the name of the personand foaf:homepage for the person’s web site. Wealso attribute a geographical location to the personby creating an instance of geo:Feature and relat-ing this to the person using foaf:based near. Weassign each instance of foaf:Person a URI in theRDF model using a hash URI [40] from the publicdisplay name used on Twitter, this provides a ref-erence to the person instance within the exportedsocial graph. For each Twitter user that the personhas a relationship with and follows on Twitter wealso create an instance of foaf:Person and assignthe same information as mentioned above to theinstance. We use the foaf:knows relation to formrelationships within the social graph between theowner and each person he/she follows on Twitter.

5.1.2. Facebook ExporterFacebook is a walled garden social networking

platform, which restricts information access to only

13http://ext.dcs.shef.ac.uk/"u0057/SocialGraphAggregator/twitter.php14http://www.twitter.com15http://www.geonames.org/ontology/

6

Page 7: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

users of the platform. Our exporter 16 works byhandling authenticated queries to the platform byfirst logging in the person whose social graph wewish to export. Once successful we then query theplatform’s API and handle the XML response ina similar manner to the Twitter exporter. We usethe same process as described above by creating aninstance of foaf:Person for the social graph ownerand assigning the relevant biographical informationto that instance, along with the social network ofthe person on Facebook. We assign a hash URIsto foaf:Person instances using the user identifica-tion of the person on Facebook, thereby providingreferenceable resources within the generated socialgraph.

5.1.3. OpenSocial ExporterSeveral social web platforms and services imple-

ment a set of common standards specified within theOpenSocial API specification 17 . By developing anexportation technique conforming to data structureswithin the OpenSocial API specification, homoge-nous exporters can be created for platforms whichadhere to the same specification (Bebo 18 , Orkut 19 ,Hi5 20 ). We implemented an exporter for the socialnetworking site MySpace 21 - which models data ac-cess and structures according to the OpenSocial API- using a similar technique to the Twitter and Face-book exporters. Authentication is performed usingOAuth 22 prior to granting data access, from query-ing the API the response is returned as XML whichwe convert to RDF, describing biographical and so-cial network information using the FOAF ontologyand location information using the Geonames ontol-ogy.

5.1.4. Graph EnrichmentSocial graphs exported as RDF models from

distributed Web 2.0 platforms contain anonymousresources within the models. Therefore we performgraph enrichment to assign URIs to these anony-mous resources. We created instantiations of twoclasses within exported RDF models; foaf:Person

16http://ext.dcs.shef.ac.uk/"u0057/FoafGenerator17http://www.google.com/opensocial18http://www.bebo.com19http://www.orkut.com20http://hi5.com21http://www.myspace.com22http://www.oauth.net

and geo:Feature. Instances of the former have anassigned URI, whereas the latter do not. Thereforewe query the Geonames Web Service 23 (which ac-cesses the Geonames dataset) using the geo:nameand geo:inCountry properties assigned to thegeo:Feature instance for a URI for the location andadd this to the instance. Fig. 4 shows a portionof an example social graph exported and enrichedfrom Twitter.

5.2. Integrating Distributed Social Graphs

Using our exporters we are able to generate RDFmodels describing the digital identity of a given per-son fragmented across Web 2.0 platforms. We mustnow combine these social graphs together to produceseed data for disambiguation techniques. Our pro-cess of combining the graphs matches foaf:Personinstances in separate RDF models that refer to thesame person, and provide an owl:sameAs link be-tween these instances.

Existing work within the field of reference rec-onciliation has explored the use of training a clas-sifier to detect object matches [13]. References arereconciled in [14] by using contextual informationand past reconciliation decisions in the form of a de-pendency graph. This allows reconciliation of refer-ences belonging to di!erent classes. Techniques forthe reconciliation of references are presented in [39]which combine both numerical and logical (based onthe semantics of distinct data items) approaches . Insuch work, and in our work also, matching instancesusing only the foaf:name assigned to an instance offoaf:Person is unreliable due to person name ambi-guity. Therefore we use a strategy called graph rea-soning to detect equivalent instances in distinct so-cial graphs which exploits contextual information ofthe instances.

5.2.1. Graph ReasoningOur graph reasoning technique uses lightweight

inference rules to detect equivalent foaf:Person in-stances. Given that we have two social graphs todetect equivalent instances within (G and H), weiteratively move through the foaf:Person instanceswithin G and compare each instance with eachfoaf:Person instance in H . We first verify that thenames of the two instances are similar using theLevenstein [8] string similarity measure to allow for

23http://www.geonames.org/export/

7

Page 8: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

!"##$%&&'()*+$(,*-,.)&#/0##*1-1234

5)*

!"##$%&&+/+-6*.'()*+-.16&789:;<<4

3.(3%#.$0,

3.(3%=(+*2>'*(1

3.(3%'()*

6*.%'()* 6*.%0'?.@'#1A

B(##"*/CD./*

E"*3!*F2 GH

!"##$%&///-2,+-+"*3-(,-@I&J)1./*4

!)-1./*K2,+L+"*3-(,-@I4

3.(3%)=.M

3.(3%".)*$(6*

5+()5"(11A"(F$0'

!"##$%&&+/+-6*.'()*+-.16&789:;<<4

!"##$%&&+/+-6*.'()*+-.16&999977N4

O20'=@16" E,.#F('2 E"*3!*F2 GH

3.(3%I'./+ 3.(3%I'./+

3.(3%=(+*2>'*(1 3.(3%=(+*2>'*(1

6*.%'()*6*.%0'?.@'#1A6*.%0'?.@'#1A

6*.%'()*

P(11ACP(F$0' E()C?"($)('!"##$%&&///-2,+-+"*3-(,-@I&J+()4

3.(3%'()* 3.(3%'()*3.(3%".)*$(6*

Fig. 4. Portion of an exported Twitter Social Graph

slight variations. If the names are su"ciently similar(i.e. the similarity exceeds a predefined threshold)then we compare additional information within theinstance descriptions as follows:

5.2.1.1. Unique Identifiers We compare uniqueidentifiers associated with each instance. Given theuse of FOAF to describe the semantics of the infor-mation within each social graph we compare proper-ties typed as owl:inverseFunctionalProperty such asfoaf:homepage, and foaf:mbox from either instance[42]. Such is the strength of these properties thatshould information associated with both instancesfoaf:Person match then we class the instances asreferring to the same real world person. We definean inference rule to detect such a match, using Jenarule based syntax 24 , where the antecedent of therule contains statements which must be matchedwithin the instance knowledge structure for the con-sequent, which is also a statement, to be inferred:

(?x rdf:type foaf:Person), (?y rdf:type foaf:Person),(?x foaf:mbox ?email), (?x foaf:mbox ?email)" (?x owl:sameAs ?y)

5.2.1.2. Geographical Reasoning If unique iden-tifiers are unavailable in either subgraph then ad-ditional information is analysed. Exported socialgraphs include location data within foaf:Person in-stance descriptions. First we compare the URIs oflocation resources associated with the foaf:Personinstances:

24http://jena.sourceforge.net/inference/#rules

(?x rdf:type foaf:Person), (?y rdf:type foaf:Person),(?x foaf:based near ?geo), (?y foaf:based near ?geo)" (?x owl:sameAs ?y)

If the URIs are distinct then we assess the se-mantics of the relation between the locations. Inorder to derive this relation we utilise open linkeddatasets containing structured machine-readabledata which explicitly defines relations between lo-cations and their semantics. In this instance wequery DBPedia 25 through an available SPARQLendpoint 26 using the following query:

SELECT ?relation WHERE {<http://dbpedia.org/resource/location1>

?relation

<http://dbpedia.org/resource/location2>

}

The query returns a relation from DBPedia de-picting the association between the locations shouldone exist, and the semantics of this association. Forinstance when we use the query to ask for the re-lation between Crookes and She!eld the relationdbprop:metropolitanBorough is returned. If a rela-tion exists between the locations then we class thefoaf:Person instances as being equivalent. This tech-nique also covers abbreviations of location names(e.g. NYC ) as the relation returned by DBPedia inthat case is dbprop:redirect. thereby defining the twolocations as being equivalent.

25http://dbpedia.org26http://dbpedia.org/sparql

8

Page 9: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

5.2.2. Producing Linked DataOur graph reasoning method produces a set of

matched foaf:Person instances. We construct a newlinked data social graph by creating a new instanceof foaf:Person for every instance in each of thecompared social graphs. For those that have beenmatched in separate graphs we create one instance.We maintain the format of the models exportedfrom each of the Web 2.0 platforms by creating oneinstance of foaf:person for the social graph ownerand then linking each person in the graph ownerssocial network using the foaf:knows relation. Foreach foaf:Person instance in the Linked Data socialgraph we provide a HTTP URI for additional in-stance data by providing a reference to the resourceusing the owl:sameAs predicate. An instance offoaf:Person which resides in two social graphs isdefined as follows:

<foaf:Person rdf:about="#samchapman">

<foaf:name>Sam Chapman</foaf:name>

<owl:sameAs

rdf:about="http://namespace.com/fb.rdf#617555567"/>

<owl:sameAs

rdf:about="http://namespace.com/twitter.rdf#sam"/>

</foaf:Person>

Only the foaf:name property and a URI are in-cluded in the instance description. Hash valuesare used for identifiers according to guidelines de-scribed in [40] due to the relatively small size ofthe datasets being considered for linkage. Wherepossible the identifiers are reused from the availablesocial graphs as such identifiers, if user selected,commonly resemble the handle of the user. Thissocial graph now provides the seed data to be usedfor automated disambiguation approaches.

6. Generating Metadata Models from WebResources

6.1. Web Resource Retrieval

We define a web resource as any document on theweb which is accessible by a URI such as web pagesor data feeds. We wish to find all the web resourcesresiding on the web which are identity web refer-ences of a given person. First we must collect a setof web resources which might refer to the person sothat we can then use disambiguation techniques tofind the resources within the set that are identityweb references. Therefore we query the World Wide

Web search engines Google and Yahoo and the Se-mantic Web search engines Swoogle 27 , Watson 28

and Sindice using the name of the person whose per-sonal information we wish to find. We collect the setof resources returned by those queries.

6.2. Generating RDF Models from Web Resources

To enable automated processing of informationwithin the retrieved resources we generate meta-data models representing the resources’ underlyingknowledge using common semantics. Web resourcessuch as ontologies and RDF documents already con-tain machine-readable content therefore no furtherprocessing is required. Instead we must turn our at-tention to handling XHTML documents which con-tain implicit knowledge structures and HTML doc-uments which contain no metadata descriptions. Ineach case we are interested in observing informationwithin the resources which is consistent with ourseed data, and therefore describes people.

6.2.1. Generating RDF Models from XHTML PagesMethodologies exist for generating RDF models

from XHTML pages containing lowercase seman-tics. Technologies such as Gleaning Resource De-scriptions from Dialects of Language (GRDDL) [48]allow RDF models to be gleaned from web pagesusing an XSL transformation [28] specified withinthe header of the page. This transformation liftsthe XHTML structure to an RDF model by provid-ing references to concepts from ontologies specifiedwithin the transformation. Therefore when a givenweb resource contains Microformats [29] or RDFa[20] then we use the XSL transformation specifiedwithin the header of the resource together withGRDDL to glean an RDF model describing theknowledge structure of the resource. If the resourcedoes not contain an XSL transformation declaredwithin the header then we apply external transfor-mations for the hCard Microformat 29 , the XFNMicroformat 30 and RDFa 31 to glean an RDFmodel from the resource.

27http://swoogle.umbc.edu/28http://watson.kmi.open.ac.uk/WatsonWUI/29http://www.w3.org/2006/vcard/hcard2rdf.xsl30http://www.w3.org/2003/12/rdf-in-xhtml-xslts/grokXFN.xsl31http://ns.inria.fr/grddl/rdfa/2008/09/03/RDFa2RDFXML.xsl

9

Page 10: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

6.2.2. Generating RDF Models from HTML PagesThe majority of resources on the web are en-

coded using semi-structured HTML, which containinformation that automated processes are unableto interpret. Therefore we must construct an RDFmodel from scratch using information that we ob-serve within the page: First we construct a simplegazetteer based Named Entity Recogniser from alist of person names, which matches the occurrenceof a person’s name within the web page. When amatch occurs an instance of foaf:Person is createdand the matched person name is attributed to theinstance using the foaf:name property. It is commonfor a person’s name to appear on a web page withadditional information about that person appearingin the vicinity of the name.

For instance a web page containing a list of re-searchers or employees often has the name of a per-son together with their contact details. Thereforewhen a person name is found, we use a window func-tion surrounding the name to find contextual infor-mation about the person using regular expressionsfor the email and homepage of the person, and a ge-ographical place name gazetteer for the location ofthe person. We then add this data to the instancedescription using the relevant semantics.

6.3. Normalising RDF Models to ConsistentSemantics

Ontologies and gleaned RDF models su!er fromthe problem of heterogeneous semantics describinghomogeneous data. Due to the variety of digitalidentity ontologies it is common for equivalent datain disparate locations to be described using distinctontologies. For instance an XHTML page may con-tain personal information described semanticallyusing the hCard microformat whereas a publishedRDF document may contain the same informationdescribed using the FOAF ontology. In order forour disambiguation techniques to function the RDFmodels of the web resources to be disambiguatedmust have semantics which are consistent with theseed data. Given that our seed data uses the FOAFand GeoNames ontologies to describe the semanticsof the data obtained from Web 2.0 platforms wemust normalise the RDF models to use the sameontologies thus achieving consistent semantics.

We normalise RDF models to consistent seman-tics using transfomation rules automatically gen-erated from the Social Identity Schema Mapping

(SISM) vocabulary [37]. SISM contains mappingsbetween six commonly used digital identity ontolo-gies: FOAF, the GeoNames ontology, the ontologyfor VCards [22], the XFN ontology [11], and theNepomuk ontologies; Personal Information ModelOntology [41], and the Nepomuk Contact Ontology.

SISM contains mapping constructs from theWeb Ontology Language (OWL) [43] and theSimple Knowledge Organisation System (SKOS)[26] to cover the range of mappings needed torelate concepts from disparate digital identity on-tologies. Mapped concepts within SISM are ex-pressed as instances of skos:Concept, this allowsthe use of 5 mapping predicates between these con-cepts: owl:equivalentClass, owl:equivalentProperty,skos:narrower, skos:broader and skos:related. In or-der to normalise a given RDF model to a consistentset of semantics we consult SISM and automati-cally generate transformation rules according to thesemantics of mappings.

If the RDF model contains concepts whichare mapped to the required ontology using us-ing owl:equivalentProperty, owl:equivalentClass orskos:broader then we are able to directly alter thetriples containing those concepts. This is due tothe semantics of the mappings facilitating a ba-sic transformation. For instance if an instance ofpimo:Person exists within a given RDF model andwe wish to transform the knowledge structure to usethe FOAF ontology we consult SISM and find themapping triple: (pimo:Person owl:equivalentClasspimo:Person). Based on the semantics of the map-ping (owl:equivalentClass) we are able to infer thefollowing rule which is directly applied to the RDFmodel to transform the class definition and removethe previous triple (denoted by remove(i) where i isthe triple index in the antecedent based on syntaxfrom [35]):

(?x rdf:type pimo:Person) " remove(0), (?x rdf:typefoaf:Person)

In certain cases however concepts are mappedtogether using skos:related thereby indicating anassociation between the concepts which is weakerthan equivalence. Transforming between these con-cepts may result in information loss. For instanceconsider the example in Fig 5 where an RDF modelcontains an instance of vcard:VCard describing aperson’s contact details using the VCard ontology.We wish to convert the gleaned model to use theFOAF and GeoNames ontologies in order to provide

10

Page 11: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

!"#

$%&'()&('

$%&'()*

$%&'()%+,*-'./*&"# $%&'()0+%&01-.

2&--3#456+4#

78 93#:!#0(

;"<'+4#=(%>?>3#:<&%<,@A

$%&'()#"&10

;3--B)CC444<(%><>3#:<&%<,@CD"'+4#A

$%&'(),'0

!"#

:+&:)E&>#(F*#&'

:+&:)*&"#

G#+)*&"# G#+)1*H+,*-'.

2&--3#456+4#

93#:!#0( 78

;"<'+4#=(%>?>3#:<&%<,@A

:+&:)"E+I

:+&:)3+"#B&G#

;3--B)CC444<(%><>3#:<&%<,@CD"'+4#A

F)*+(#J

$%&'()KH&'('(:)-.B# :+&:)L#'>+* '(:)-.B#

F)*+(#J

+40)#M,1$&0#*-L'+B#'-.

+40)#M,1$&0#*-L'+B#'-.

+40)#M,1$&0#*-L'+B#'-.

$%&'()N(('#>>'(:)-.B# G#+)L+1*- '(:)-.B#

>@+>)'#0&-#(

>@+>)'#0&-#(

+40)#M,1$&0#*-L'+B#'-.

+40)#M,1$&0#*-L'+B#'-.

>@+>)'#0&-#(

Fig. 5. Transforming an instance within a gleaned RDF model from using the VCard ontology to use the FOAF and GeoNamesontologies

consistent semantics with the Linked Data socialgraph. To convert the vcard:VCard instance weconsult the mapping vocabulary (SISM) and findthat the concept is mapped to foaf:Person usingskos:related. The model is checked to ensure thata transformation between the concepts would notresult in information loss - this is done by checkingthe predicates used within the instance descriptionand their mappings to the required ontology, a de-tailed explanation of this procedure falls outside ofthe scope of this paper. Once the model has beenchecked and the transformation has been approvedthen the a rule can be applied to alter the conceptdefinition.

An example mapping within SISM using skos:relatedassociates the concept foaf:based near with the con-cept vcard:adr. In this instance we have defined thetwo concepts as being similar but not completelyequivalent. Therefore prior to applying a generatedrule from this mapping we verify that no informa-tion is lost when transforming the RDF model. Thegenerated rule from this mapping is as follows:

(?x vcard:adr ?geo) " remove(0), (?x foaf:based near?geo)

From applying the rules we are able to transformany RDF model containing digital identity informa-tion to use the FOAF and GeoNames ontologies.This allows instances within the models to be nor-malised to contain the same semantics as the seeddata, thus aiding information interpretation in thefollowing disambiguation techniques

7. Identity Disambiguation

At this stage in our approach we have gatheredseed data in the form of a Linked Data social graphdescribing a given person and a set of web resources,each represented as an RDF model. Our disambigua-tion task is a binary classification problem (i.e. Doesthis web page contain information about person Xor not?), therefore we must use the seed data topartition the set of web resources into positive andnegative sets thereby disambiguating identity webreferences. Two disambiguation techniques are nowpresented.

7.1. Disambiguating using a Semi-supervisedMachine Learning Technique

Semi-supervised machine learning techniquesutilise training sets to initially construct a naiveclassifier, and then support the learning processof retraining the classifier. Several techniques ex-ist within the literature including expectation-maximization [34], which re-estimates parametersof a classification model and then tunes the modelbased on these expected parameters, co-training[5], which trains multiple classifiers simultaneouslyand updates the training sets of each classifier ac-cording to a separate classifier’s predictions, andself-training [18] which learns an initial classifierfrom a limited set of seed data and iteratively re-trains the classifier using the strongest classificationinstances. Given our limited amount of seed data,only a single Linked Data social graph, our firstdisambiguation technique employs a self-trainingstrategy to iteratively learn and improve a machine

11

Page 12: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

Fig. 6. Self-training a machine learning classifier to disambiguate identity web references

learning classifier.By using self-training we have been able to inves-

tigate the e!ects in disambiguation accuracy basedon two variations: a) Di!erent machine learning clas-sifiers, and b) Di!erent feature similarity measures.Machine learning classifiers must be given labelledinstances from which they are able to learn. Theseinstances are represented as a binary feature vector(i.e. #"x ), where the classifier learns weights for eachfeature. By varying the feature similarity measureand the classifier we are able to observe the e!ectsin performance of di!erent permutations of classi-fier and feature similarity measure. Our self-trainingtechnique is shown in Fig. 6 and consists of two steps:the first step gathers binary seed data, both posi-tive and negative to be used for learning a classifier,given that the initial seed data is only positive.

The second step of the approach uses the positiveand negative sets as training data. Once the classi-fier has been trained, we apply the classifier to theset of web resources (we refer to this as the univer-sal set) in order to derive labels. We then enlargethe seed data with the labelled instances exhibit-ing the strongest confidence scores. We repeat thisprocess until the universal set is empty by retrain-ing the classifier using the new seed data, classify-ing instances from the universal set and enlargingthe training sets. As Fig. 7 shows, the resulting posi-tive and negative sets are enlarged with all instancesfrom the universal set, thereby generating classifi-cations. We deal with three data sets within thistechnique the positive set, the negative set and theuniversal set, which we will now refer to as P , Nand U respectively.

7.1.1. Generating Binary Seed DataA learned classifier must be able to label a web

resource as citing a given person or not. Therefore

Fig. 7. Initial training sets iteratively enlarged using re-sources with the strongest confidence scores

we must populate N to accompany data within P .We initially populate P with the Linked Data socialgraph. Using the semantics of information withinthis graph, any resources related to the foaf:Personinstance of the graph owner using foaf:homepage,foaf:workplaceHomepage or foaf:weblog are assumedto contain personal information describing the per-son. Therefore we generate an RDF model from thelinked resource and add to this to P , thereby addingadditional instances to P from which a classifier canlearn.

We add instances to N using a similar strategyto the technique from [51] by compiling a list ofpositive features from instances in P . All instancesfrom U are then filtered out that contain any posi-tive features from the list. The intuition behind thistechnique, according to [51], is that the resultinginstances will contain no positive features and aretherefore more likely to be negative. Due to the rel-atively small size of our positive data this strategyinitially resulted in false negatives being added to N ,additionally the relatively size of our positive train-ing data meant that statistical distribution methodssuch as those described in [18] were not feasible. In-stead we place constraints on the size of the nega-tive set by forcing its size to be approximately thesame as the positive set, both in the number of in-stances and the number of features within those in-stances. The intuition being that the hypothesis ofthe trained classifier will correct itself and improve

12

Page 13: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

over time thereby overcoming the bias of false neg-atives (we observe this empirically when evaluatingthis method). If N is too large initially then thiscould unfairly bias the initial classifier from whichit will never recover.

7.1.2. Learning to Classify Identity Web ReferencesInstances are supplied to machine learning classi-

fiers as RDF models, the classifier uses features fromthose instances to construct a classification model.Therefore we now explain how features are extractedfrom the RDF models, what di!erent feature simi-larity measures are used, and how we map the RDFmodels into a machine learning dataset:

7.1.2.1. Feature Extraction We use seed data ex-tracted from Web 2.0 platforms, given that suchdata contains social network information, under theintuition that a person will appear within a web re-source with individuals that he/she knows. There-fore we use RDF instances as features from the RDFmodels within P , N and U . We generate these fea-tures by extracting subgraphs from the RDF modelusing a recursive algorithm called ”Concise BoundedDescription” [19]. This algorithm functions by ”ex-tracting a subgraph around an RDF resource”, thusreturning the immediate properties of a resource asits description and therefore the feature to use. Thisnot only produces instances of foaf:Person, but alsoother types of instances that are present within theRDF models (e.g. geo:Feature).

7.1.2.2. Feature Similarity Measures As previ-ously mentioned, we are able to vary the featuresimilarity measure used by a machine learning clas-sifier. Features are represented by foaf:Person in-stances within RDF models of web resources. Thispermits the variation of the feature similarity mea-sures the following three RDF graph comparisontechniques. It worth noting that we use the Lev-enshtein string similarity metric when literals arecompared - to overcome misspellings - as part ofthe following RDF graph comparison techniques.

Jaccard Similarity: The Jaccard similarity ofgraphs as described in [7] looks at how similar twographs are based on the number of edits required totransform one graph into another. Jaccard similar-ity requires the learning of a given threshold, whichwhen exceeded classes two graphs as equivalent,thus declaring two RDF instances as representingthe same person. The measure is defined as follows:

let Vg denote the vertices in graph g correspondingto resources and literals in the RDF instance de-scription, and Eg denote the set of edges in graph gcorresponding to properties and relations. Graphsg and h are equivalent if jaccsim(g, h) > ! wherethe Jaccard similarity is defined as:

jaccsim(g, h) = 2|Vg # Vh| + |Eg # Eh|

|Vg| + |Vh| + |Eg| + |Eh|

Inverse Functional Property Matching One of thekey advantages of modelling information semanti-cally is the expressivity of certain properties and re-lations, such as the use of inverse functional prop-erties. According to the FOAF ontology, propertiessuch an foaf:mbox and foaf:homepage are typed asowl:InverseFunctionalProperty and depict a one-to-one relation between a foaf:Person instance and theattributed resource. Therefore we use a similar tech-nique to our graph reasoningmethod when interlink-ing social graphs by defining foaf:Person instancesas being equivalent when inverse functional proper-ties match from both instances, this is also knownas ”Data Smushing” [42].

RDF Entailment It is common for RDF instancesto vary size depending on available knowledge.Imagine there are two RDF instances g and h de-scribing the same person: g contains geographicalinformation along with the person’s name, emailaddress and web site, whereas h contains the samedata without geographical information. One caninfer that both g and h refer to the same person,and therefore the tripleset of h is a subset of thetripleset of g. According to Pat Hayes’ Interpola-tion Lemma 32 and work in [30], one can say thath entails g (h |= g) if every model of h is also amodel of g. Therefore we define two RDF graphs asequivalent if one graph entails the other.

7.1.2.3. Mapping RDF models into a MachineLearning Dataset In order to use RDF models asinstances from which machine classifiers can learnwe must represent these models as binary featurevectors where each feature represents a distinct in-stance of foaf:Person. Therefore we map the RDFmodel for each web resource into a dataset forlearning using the following three steps:

(i) One RDF subgraph (instance) per person isextracted from the RDF model of the web

32http://www.w3.org/TR/2004/REC-rdf-mt-20040210/#entail

13

Page 14: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

resource using the previously described tech-nique.

(ii) Each extracted RDF subgraph is flattened,representing it as a set of RDF triples, to en-able comparison using the feature similaritymeasures. The metrics work at the statementlevel by comparing the triplesets of each sub-graph and deriving a match.

(iii) Machine learning instances are constructedas binary features vectors using a featureindexer (i.e. #"xi where the index i denotesa given feature). To compile a new featurevector representation of a machine learninginstance, indexes are derived for the fea-tures of the given web resource by comparingthose features with existing features withinthe indexer. If a feature is matched then therelevant index is returned, otherwise a newindex is created. Given that the features arerepresented as RDF subgraphs, the featurematching technique can be varied thereby ef-fecting the indexes within the feature vectorsand the resultant dataset.

By modelling dataset instances as RDF models,and instance features as foaf:Person instance de-scriptions in the RDF model we allow for the vari-ation of feature similarity measure. Within this dis-ambiguation technique we compare three machinelearning classifiers; Naive Bayes [34], Support VectorMachines [15] and Perceptron [4]. Naive Bayes de-rives the conditional probabilities from labelled in-stances and their features to build the initial genera-tive model. SVM and Perceptron construct their dis-criminative hyperplanes within the feature space viaoptimization based on the labelled instances withinthe dataset. By mapping the RDF models of eachweb resource into such a dataset, the internal func-tionality of the machine learning classifiers is notaltered. Instead we use the classifiers merely as a”blackbox” by inputting the datasets for trainingand testing and returning the classification labelsfor instances within U along with the classificationconfidences of those labels.

7.1.2.4. Self-training The self-training strategy(as Algorithm 1 explains) begins by first using theseed data to train a classifier, at this stage the sizeof the sets are relatively small. Instances from U areclassified using the initially trained classifier and theinstances are ranked based on their classificationconfidence. The top k positive instances and the top

k negative instances from U are chosen accordingto rankings, and P and N are enlarged with thoseinstances respectively. The strongest k positivesand strongest k negatives are also removed from U ,thereby decreasing the size of the universal set. Werepeat this process by retraining the classifier andreapplying it to the universal set until the univer-sal set becomes empty. The intuition behind thisstrategy is that the trained classifier will improveover time as the hypothesis of the classifier gradu-ally moves to a stable state and the classificationconfidence of instances begin to converge.

Algorithm 1 ItEnl(P ,N ,U) : Iterative self-trainingbased on enlargement of training data according toclassification rankings. P is the positive set, N is thenegative set and U is the universal setInput: P , N and UOutput: P and N

1: while U $= % do2: Train ! using P and N3: Classify U using !4: Rank U based on classification confidences5: Enlarge training sets by strongest k positive (UP

k ) andnegative (UN

k ) instances6: P = P # UP

k7: U = U & UP

k8: N = N # UN

k9: U = U & UN

k10: end while11: return P and N

According to [46] and [50] ranking instances basedon their classification confidences and then enlargingthe relevant training sets with the top k instancesyields tradeo!s between resulting accuracy and op-timized performance. Work in [50] states that low-ering k achieves high accuracy whereas increasing kspeeds up the process. Therefore we must set k to asuitable value given the domain of application. Weexplain within the experimental setup our specifiedvalue and the decisions behind this.

7.2. Disambiguating using a Graph-BasedTechnique

State of the art graph-based disambiguation tech-niques compile a graph space from instances to bedisambiguated and their features, and then exploitthe graph space to support the disambiguation pro-cess. For instance [10] addresses the general prob-lem of entity disambiguation by constructing thegraph space from declarative links between entitiesextracted from the Linked Data web, and employ

14

Page 15: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

Fig. 8. Using Random Walks to disambiguate identity web references

the links to infer relationships between entities us-ing a probabilistic model. [2] uses a partially labelledgraph space as input to a machine learning classi-fier, which then predicts labels for the unlabellednodes. Relaxation labeling is then used to correct la-belled nodes within the graph space based on thelabels of their neighbouring nodes. [25] constructsa graph space from various features (named enti-ties, hyperlinks, and lexical tokens) from a given setof web pages and performs Random Walks throughthe graph space, measuring distances between pagesand clustering those that are closest together.

Our second disambiguation technique uses a sim-ilar Random Walks approach to [25]. This techniqueis well suited to our problem setting given that theseed data and the set of web resources to be dis-ambiguate are modelled using RDF which providesa graph structure consisting of interlinked triples.Our graph-based disambiguation technique is di-vided into five steps (see Fig. 8). We first constructan interlinked graph space from the seed data andthe set of web resources. The second step derivesstrongly connected components within the graphspace. The third step assigns weights to edges andconstructs a random walks model to traverse thegraph space. The fourth step measures distanceswithin the graph space, and the fifth step clustersthe web resources according to those distances. Clus-tering aims to partition the web resources into twogroups: positive and negative. We now explain eachof these steps in greater detail.

7.2.1. Building an Interlinked Graph SpaceWe construct the graph space by initially adding

the seed data (the Linked Data social graph), and it-eratively with the RDF model of each web resourceto be disambiguated using an indexing paradigm.Nodes are represented in the graph space by re-sources and literals from the RDF models and edgesare represented in the graph by properties and rela-tions. The indexing paradigm functions by compar-

ing the resources and literals from an RDF modelwhich are yet to be integrated into the graph spacewith all the resources and literals currently in thegraph space. If a match occurs, for example if anRDF model contains a person’s name which alreadyresides in the graph space, then the target node ofthe edge in the RDF model is replaced with thematched node in the graph space, thereby provid-ing a link. It is common for literals to contain mis-spellings therefore the Levenshtein string similaritymetric is used. When comparing literals, if the de-rived measure is above an assigned threshold then amatch is confirmed and the RDF model is integratedinto the graph space.

Resources within the RDF models, denoted byURIs, are compared for strict equivalent (e.g. geo-graphical locations with a URI). Each of the web re-source models and the social graph have a documentnamespace URI which we refer to as the web resourceroot node and the social graph root node respectively(as shown in Fig. 9). This strategy compiles a graphspace which is rich in features thereby allowing webresources to be linked together by many commonliterals and resources, this provides a multitude ofpaths through which traversals can be made.

7.2.2. Identifying Strongly Connected ComponentsIt is possible for a compiled graph space to contain

islands of connected nodes called strongly connectedcomponents. These islands are created due to theRDF models integrated into the graph space onlyhaving common literals and resources with a limitednumber of other web resource models. This inhibitsthe ability of a walker from traversing across thefull diameter of the graph space in that the walkercannot island-hop, therefore individual strongly con-nected components must be identified. We identifythese components using Tarjan’s algorithm [45] bystarting at the social graph root node within thegraph space and performing a depth first search,returning all connected nodes and edges from the

15

Page 16: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

starting node and therefore the strongly connectedcomponent which the social graph root node residesin.

By only considering this component for traver-sal information is removed that is irrelevant to theperson whose personal information we wish to find.For instance, the social graph contains seed data de-scribing literals and resources which web resourcesshould also contain if they are correct identity webreferences. Even if certain web resources do not ex-hibit any of those features, they may share featureswith other web resources which in turn share fea-tures with the social graph (thereby providing tran-sitive linking). Therefore, we assume that any re-source which is not linked to this island of nodesdoes not contain information we wish to find.

7.2.3. Traversing the Graph SpaceOnce we have the strongly connected component

we traverse this graph using Random Walks [31],moving from a given node to a random adjacentnode, and then randomly to any adjacent node ofthat node, thus forming a stochastic transition pro-cess. Edges must first be weighted within the graphspace to e!ect the transition probabilities of the ran-dom walker.

7.2.3.1. Weighting Edges We apply a weight-ing strategy similar to the technique presentedin [32]: given that the edges denote propertiesand relations from an ontology, we able to weightedges which denote unique links as carryingmore weight. These are edges that represent con-cepts typed as owl:inverseFunctionalProperty orowl:functionalProperty (e.g. foaf:mbox, foaf:homepage,and foaf:weblog). We denote the set of edges withinthe graph as E and the set of nodes as N . The set ofedge labels is denoted as L = {l1, l2, · · · , ln} whereli denotes an edge label corresponding to a distinctconcept from an ontology. Therefore a weightingfunction is defined that returns the weight associ-ated with a given edge (and therefore a distinct on-tology concept) w(li) . The weights are distributedusing the same strategy from [25] as follows:!

li!L

w(li) = 1.

7.2.3.2. Random Walks We use the Random walksmodel to return the probability of moving fromone node in the graph space to another using agiven number of steps. Essentially this walk forms

a stochastic process and is defined according to afirst-order Markov chain [33] where future states areindependent of previous states. To compute thesetransition probabilities we initially create a localsimilarity matrix known as the adjacency matrix A.A contains the similarity between nodes i and j andis computed by normalising the weight of the edgebetween i and j with how many edges of distincttypes leave i (the degree of node i). A is formallydefined according to work by [38] as follows:

aij =

"#$

#%

!

lk!L

w (lk)|(i, .) $ E : l(i, .) = lk|

, (i, j) $ E

0, otherwise

We then construct the diagonal degree matrix as:D =

!

k

Aik

Using the Markov chain model of a random walkwe wish to derive the probability of moving fromnode i to node j in t steps. Therefore we create atransition probability matrix using P = D"1A forone step, and P t = (D"1A)t for t steps.

7.2.4. Measuring distances within the Graph Space

7.2.4.1. Commute Time We wish to measure thedistances between root nodes within the graph spaceso that we may then cluster nodes which are closertogether with one another. One such distance mea-sure is Average First-Passage Time [38] which cal-culates the number of steps taken when performinga random walk through the graph space from onenode to another, in essence this returns the hittingtime for the traversal:

m (k|i) =

"#$

#%

1 +n!

j=1,j #=k

pijm(k|j), i %= k

0

This measure only returns the number of stepstaken from leaving a given node and reaching an-other. Therefore by calculating the Commute Timeit is possible to derive the round trip cost of leav-ing a given node, reaching another node and thenreturning again to the previous node:

n(i|k) = m(k|i) + m(i|k)

Both m(k|i) and m(i|k) will return di!erent val-ues (given that P t is not symmetric), therefore bysumming their traversal costs a symmetric matrix isproduced for the commute time distances N wherenij = nji . When deriving the average commute

16

Page 17: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

!""#$%%&&&'()*'*!+,'-)'./%0123&+%456/*'!"14

7$8#+2*36 7$8#+2*36

9!""#$%%&&&'()*'*!+,'-)'./%0,-:53;

!""#$%%&&&'()*'*!+,'-)'./%0123&+%,3-,'2(,

<=>>>=>?@?<>A?====>? <>AAABBCD

9!""#$%%*&*'E+36-1+*'32E%C>DFG??;

H!+,!+4(IJ

7$843)-"536

,3-,$:-*+(76+-2

E+3$6-1+

,3-,$6-1+

,3-,$/63&* ,3-,$/63&*

,3-,$6-1+

,3-,$!31+#-E+

,3-,$6-1+,3-,$!31+#-E+

,3-,$6-1+

,3-,$"3#5) ,3-,$"3#5) ,3-,$"3#5)

,3-,$:-*+(76+-2

,3-,$6-1+

,3-,$!31+#-E+

K-:53LM52-N+E6-H-1LM!-#1-6

,3-,$6-1+

O36-"!-6LP.""+2*

,3-,$6-1+

O36-"!-6LP.""+2*

9!""#$%%&&&'()*'*!+,'-)'./%0Q(:;K-:53LM52-N+E6-

R-""!+&LS3&+

R-""!+&LS3&+

H!+,!+4(

!""#$%%&&&'()*'*!+,'-)'./%0123&+%,3-,'2(,

<1+

<=>>>=>?@? <>A?====>? <>AAABBCD

9!""#$%%*&*'E+36-1+*'32E%C>DFG??;

IJ

,3-,$"3#5)

,3-,$:-*+(76+-2

E+3$56M3.6"2T E+3$6-1+

,3-,$6-1+

,3-,$!31+#-E+

K-:53LM52-N+E6- H-1LM!-#1-6

,3-,$6-1+

O36-"!-6LP.""+2*

,3-,$6-1+

!""#$%%&&&'()*'*!+,'-)'./%0123&+%456/*'!"14

7$8#+2*36 7$8#+2*36 7$8#+2*36

9!""#$%%&&&'()*'*!+,'-)'./%0,-:53;

7$843)-"536

E+3$6-1+

,3-,$6-1+,3-,$6-1+

,3-,$"3#5) ,3-,$"3#5) ,3-,$"3#5)

,3-,$:-*+(76+-2,3-,$6-1+

,3-,$!31+#-E+

9!""#$%%&&&'()*'*!+,'-)'./%0Q(:;

H!+,!+4(

9!""#$%%&&&'()*'*!+,'-)'./%0123&+;

,3-,$!31+#-E+

,3-,$!31+#-E+

,3-,$!31+#-E+

H3)5-4LU2-#!

S+*3.2)+LR3(+4

V6"+2456/+(LU2-#!LH#-)+

,3-,$"3#5)9!""#$%%&&&'()*'*!+,'-)'./%0123&+;

E+3$56M3.6"2T

7$8#+2*36

E+3$6-1+

,3-,$/63&*

<1+

R-""!+&LS3&+

,3-,$/63&*,3-,$/63&*

,3-,$/63&*

,3-,$6-1+9!""#$%%&&&'()*'*!+,'-)'./%0123&+;

Fig. 9. Snippet of a compiled graph space. Anonymous nodes are represented using blank node notations :?person. A completegraph space would be too large to display graphically in a legible manner.

time we must set the random walks model to a suit-able number of steps and therefore set the size of t,(P t). If t is too high then the random walker coversa larger portion of the graph, reducing the distinc-tion between locally connected nodes. By setting tto a lower value, the random walker focuses on mov-ing along shorter paths, preserving the clarity of lo-cally connected nodes. As the number of paths be-tween nodes increases, where these paths are shortin length, then the commute time decreases [36].Therefore in order to apply commute time e!ectivelyas a clustering distance measure we set t = 2 therebyfocussing on short paths within the graph space.

7.2.4.2. Optimum Transitions The variation inthe number of steps that the random walker takes(denoted by t) can be exploited to measure the tran-sition probabilities. In [32] walks are made throughthe graph space until a desired node is reached orwhen t exceeds 50 and [9] alters the size of t toe!ect information retrieval performance. It is pos-sible to measure distances between nodes withinthe graph space by deriving the optimum num-ber of steps needed to traverse from a given nodeto another node. We find the optimum number ofsteps between two nodes by incrementing t untilthe transition probability has peaked for the firsttime. Therefore we iteratively increase the matrix

power of the transition probability matrix P t untilthe probability of moving from node i to node j haspeaked (as shown in Algorithm 2 ) and store thenumber of steps within the matrix O.

Algorithm 2 OptTrans(P ) : Derives the OptimumTransitions matrix O from incrementing the matrixpower of PInput: POutput: O

1: O|P | = 0 {Initalise the Optimum Transitions Matrix asthe zero matrix}

2: for i = 0 to |P | do3: for j = 0 to |P | do4: for t = 1 to ' do5: if (pt

ij ( pt!1ij ) and (i $= j) then

6: oij = t7: else8: break9: end if

10: end for11: end for12: end for13: return O

7.2.5. Clustering Web ResourcesUsing the above distance measures we use pair-

wise agglomerative clustering [49] to produce thetwo clusters (positive and negative). Our clusteringstrategy begins by placing every web resource root

17

Page 18: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

node and the social graph root node within the graphspace in an individual cluster. Clusters are then com-pared to measure their average distance. If this isbelow a given threshold then the clusters are mergedtogether. We compare and merge clusters until nomore clusters can be merged; at this point we selectthe cluster containing the social graph as the posi-tive cluster. Clusters are compared as follows:

average dist(c1, c2) =

!

ri!c1

!

rj!c2

dist(ri, rj)

|c1||c2|Where dist(ri, rj) is defined for Commute Time

Distance as:dist(ri, rj) =

nrirj

max(N)

And for Optimum Transitions as:dist(ri, rj) =

orirj

max(O)

In order for two clusters to be merged togetheraverage dist(c1, c2) must be less than a giventhreshold ! where 0 & ! & 1.The value of thethreshold is proportional to the maximum com-mute time from N , when clustering according tocommute time distances, and proportional to themaximum number of transitions from O, whenclustering using optimum transitions. We assumethat the strongly connected component containingthe social graph root node will also contain webresources which do not cite the person we are inter-ested in, given that such resources may contain alow number of features in common with the socialgraph. This clustering strategy therefore dictatesthat a value for ! must be derived by tuning onknown data, we discuss this in the following section.

8. Evaluation

In order to assess the performance of each of thepresented disambiguation techniques a thoroughevaluation was conducted. This section describesthe experimental setup followed by the results fromthe evaluation.

8.1. Experiment Design

8.1.1. Evaluation MeasuresWe are concerned with assessing each of our pre-

sented disambiguation techniques’ ability to disam-biguate web resources which contain personal in-formation about a given person. Therefore we usethe information retrieval metrics; recall, precisionand f-measure [47]. A denotes the set of web re-

sources which contain personal information and Bdenotes the set of retrieved web resources thereforeprecision = |A ' B|/|A| and recall = |A ' B|/|B|.F-measure provides the harmonic mean of both pre-cision and recall as follows:

f-measure= 2$precision$recallprecision+recall

8.1.2. DatasetThe dataset was compiled using 50 members of the

Semantic Web and Web 2.0 communities as partici-pants. For each participant seed data was producedusing our previously described method in the form ofa Linked Data social graph. A set of web resources tobe disambiguated were gathered using our approachby querying the World Wide Web and the Seman-tic Web. The first 100 results were gathered fromeach query and cached for the experiments. As ex-pected there were many duplicate web resources re-turned from each search engine (i.e. equivalent pub-lication and homepages) and several web resourceswere also noise (i.e. Web resources which require se-cure access) and were removed. Therefore the result-ing dataset contained approximately 17300 web re-sources with 346 web resources to be disambiguatedfor each participant.

We then created a gold standard by manually la-belling each of the web resources as either citingthe given person or not. It is worth noting that thedataset contains participants who have a varyinglevel of web presence. Each participant’s web pres-ence level is measured as the proportion of web re-sources within the total set that are identity web ref-erences, this is given as a percentage. For instance ifparticipant A appears in 50 of the 350 web resourceswithin their total set, then the web presence levelof that participant is 14%. This web presence levelallows each of the disambiguation technique to beassessed based on its ability to handle and processvarying amounts of information.

8.1.3. Selection of Baseline MeasuresIn order to provide contextual assessment of each

of the presented disambiguation techniques threebaseline measures have been used:

8.1.3.1. Baseline 1: Person Name Occurrence Ourfirst baseline technique utilises the presence of theparticipant’s name within a web resource as being apositive indicator of the presence of personal infor-mation thereby yielding low levels of precision but

18

Page 19: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

high recall levels. It o!ers a useful baseline againstwhich our approaches can be compared and high-lights the need for background knowledge for eachperson.

8.1.3.2. Baseline 2: Hierarchical Clustering usingPerson Names Our second baseline technique useslatent semantic analysis via hierarchical clustering[32]. This is our baseline unsupervised approachwhich does not use seed data, instead the approachrelies on a feature comparison technique where webresources are clustered based on the presence of per-son names within resources. This method producesk clusters where each cluster contains web resourcescontaining personal information of a given person.We then select the clusters which is relevant to ourparticipant for evaluation.

8.1.3.3. Baseline 3: Human Processing Our thirdbaseline measure compares the presented disam-biguation techniques against human processing ofthe same data. To provide this baseline measure agroup of 15 people, denoted as raters, were set thetask of manually processing the dataset. Three dif-ferent raters disambiguate web resources for eachevaluation participant, thereby providing threedistinct sets of results. We then used interrateragreement to find the agreement of results betweenthe raters [24] by using one rater’s results as thegold standard and another rater’s results as thetest subject. We then compute the average of thethree separate rater agreement measures for eachparticipant.

8.1.4. Experimental Setup

8.1.4.1. Self-training We assess the performance ofnine distinct permutations of machine learning clas-sifier and feature similarity measure, (given that weemploy three classifiers; Naive Bayes, Perceptronand SVM and three measures; Jaccard similarity,Inverse Functional Property matching and RDF en-tailment). For each of the 50 participants we testnine permutations, resulting in 450 experiments tobe run for this disambiguation technique. We setself-training to enlarge the training data with thetop k strongest classified instances where k = 1.We set k at the lowest value possible under the as-sumption that accuracy of the derived classificationsis paramount, and therefore the strongest exampleshould be selected to enlarge the training sets with.

We set the similarity threshold for Jaccard similar-ity to 1 to detect strict graph equivalence when com-paring features. We employed the machine learningframework Aleph 33 to perform self-training of theframework’s built in classifier algorithms.

8.1.4.2. Random Walks Random Walks uses ag-glomerative clustering to cluster web resources clos-est to the social graph. The clustering threshold wasderived by tuning ! using 10% of the dataset andwas set to 0.3 for commute time distances and 0.55for optimum transitions.

8.2. Experimental Results

8.2.1. Self-TrainingTable 1 presents the results derived from evaluat-

ing self-training using di!erent permutations of ma-chine learning classifier and feature similarity mea-sure. The results indicate that the most precise sim-ilarity measure used by each classifier was Jaccardsimilarity. However, this measure also yielded thepoorest scores for recall, largely due to its strict be-haviour when detecting a match (by not allowing forslight variations between comparable features). Theresults demonstrate that RDF entailment performsconsistently well when combined with di!erent clas-sifiers. We believe that this is due to the large vari-ance in the level of personal information displayedon web pages. For example, there were several caseswhere a person would display their name and webaddress with their e-mail encoded to restrict spamfrom screen-scrapers. The same person would thenappear on another web page with the same informa-tion yet without the encoded e-mail address. RDFentailment is able to derive equivalence between theinstances, thereby detecting a feature match.

The combination of SVM and RDF entailmentyielded the highest f-measure score. As expected,SVM outperforms Perceptron for all similarity mea-sures studied, given that Perceptron induces a singlevector linear model for explaining the data, whereasSVM uses several support vectors leading to a bet-ter optimization of the induced model. The resultspresent the most precise combination as the use ofPerceptron with Jaccard similarity. The use of a sin-gle vector to classify instances within the featurespace o!ers a precise approach and coupling thiswith the Jaccard similarity achieves a high level of

33http://aleph-ml.sourceforge.net/

19

Page 20: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

Table 1Accuracy of Di"erent Permutations of Machine LearningClassifier and Feature Similarity Measure with respect toBaseline Techniques

Precision Recall F-Measure

Perceptron Entailment 0.645 0.636 0.540

IFP 0.865 0.364 0.512

Jaccard 0.935 0.318 0.474

SVM Entailment 0.733 0.670 0.642

IFP 0.811 0.412 0.546

Jaccard 0.912 0.384 0.540

Naive Bayes Entailment 0.684 0.665 0.606

IFP 0.786 0.418 0.545

Jaccard 0.906 0.389 0.545

Baseline 1 0.191 0.998 0.294

Baseline 2 0.648 0.592 0.556

Baseline 3 0.765 0.725 0.719

precision. Overall the results indicate that the use ofself-training is a very precise disambiguation tech-nique outperforming all of the baselines in terms ofprecision, however its precise nature leads to poorlevels of recall. In terms of f-measure permutationsof a given classifier with Entailment outperform thebaseline unsupervised technique, but do not performas well as human processing. To provide a more indepth analysis of this technique we now discuss theperformance of individual classifiers.

8.2.1.1. Perceptron In order to analyse the rela-tionship between the performance of a given dis-ambiguation technique and web presence levels, weuse best fitting lines of regression (as shown in Fig.10). We select the regression model that maximisesthe coe"cient of determination [12] within the sta-tistical model of the data, this can either explaina linear or nonlinear relationship between the ac-curacy of the disambiguation technique and webpresence levels. As Fig. 10(a) indicates Perceptroncombined with Jaccard and IFP performs well forparticipants with low web presence levels. At suchlevels these permutations outperform the baselinetechniques and Perceptron with Entailment. As webpresence levels increase Perceptron with Entailmentimproves in accuracy whereas the two alternativepermutations involving Perceptron actually reducein accuracy resulting in poorer performance thanthe baseline techniques.

(a) Perceptron

(b) SVM

(c) Naive Bayes

Fig. 10. Accuracy in terms of f-measure of self-trained ma-chine learning classifiers with respect to web presence levels

8.2.1.2. SVM Like Perceptron permutations,SVM combined with all similarity measures per-forms well for low web presence levels outperform-ing the baseline measures (as Fig. 10(b) shows). Asweb presence increases however accuracy levels ofthese permutations reduces. Only SVM combinedwith Entailment indicates a less drastic reductionin performance with only a slight reduction as webpresence increases with the regression curve tendingtowards a linear model.

20

Page 21: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

8.2.1.3. Naive Bayes Combining precision and re-call into f-measure values (see Fig. 10(c)) indicatesthat Naive Bayes with Entailment improves as theweb presence level increases and is therefore moresuited for participants with higher levels of web pres-ence, while Naive Bayes and IFP and Naive Bayesand Jaccard are more appropriate permutations forlow-levels of web presence. Moreover, this suggeststhat entailment is a better suited feature similaritymeasure for individuals with higher web presencelevels.

8.2.1.4. Iterative Improvement of Classifiers Weearlier stipulated that self-training must improvethe performance of a classifier in order to overcomethe possible bias of false negatives within the seeddata. To observe the e!ects of Web 2.0 data andthe improvement in performance when applyingself-training to each permutation of classifier andfeature similarity measure, we analysed the perfor-mance of the classifier at each iteration step. Thiswas performed for each permutation for each par-ticipant, thereby providing a detailed insight intothe e!ects of self-training.

The performance of Perceptron with IFP and Jac-card improves through training, while Perceptronwith Entailment worsens. This was due to misclassi-fied instances early within the training process lead-ing to an overall reduction in accuracy. The deci-sion boundary created by Perceptron with IFP (asshown in Fig. 11(a) reaches a stable state very earlywithin the training cycle due to the strict nature ofinstance matching via inverse functional properties.The decision boundaries constructed using Percep-tron with Jaccard and Entailment take longer toreach a stable state as the hyperplane fluctuates be-tween positions within the feature space.

SVM experiences large fluctuations prior to estab-lishing a stable hyperplane within the feature spacewhen combined with Jaccard and Entailment. As re-training is performed the margins running parallel tothe hyperplane are tweaked due to new support vec-tors being discovered within the feature space. Thisvariance is similar to the behaviour of combinationsof the same feature similarity measure with Percep-tron (given that both construct separating hyper-planes via discriminative models). Another similar-ity with Perceptron is the relatively small accuracyfluctuation when combining SVM with IFP.

As shown in Fig. 11(c) Naive Bayes also experi-ences early fluctuations in classification accuracy for

all permutations with feature similarity measures,however these variations are not as great as those ex-perienced by SVM and Perceptron. Unlike the pre-vious two classifiers, a hypothesis is found relativelyearly on within the training process for combina-tions of Naive Bayes with Jaccard and Entailment.Observing the e!ects of training in Fig. 11(c) indi-cates that the initial hypothesis compiled by a NaiveBayes classifier is close to the final retrained clas-sifier for all permutations, while the classifier doesimprove it is not a significant amount.

8.2.2. Random WalksOur second disambiguation technique; Random

Walks, outperforms two of the baseline techniques interms of precision and overall f-measure. As the re-sults in Table 2 show clustering using both commutetime and optimum transitions produces high levelsof recall outperforming human processing. This in-dicates that a graph-based technique using Web 2.0data is better suited for finding the maximum num-ber of web resources that contain personal informa-tion of a given individual whilst preserving satisfac-tory levels of precision.

Table 2Accuracy of Commute Time and Optimum Transitions withrespect to Baseline Techniques

Precision Recall F-Measure

Commute Time 0.707 0.798 0.705

Optimum Transitions 0.659 0.805 0.684

Baseline 1 0.191 0.998 0.294

Baseline 2 0.648 0.592 0.556

Baseline 3 0.765 0.725 0.719

As Fig. 12 demonstrates both commute time andoptimum transitions improves in terms of precisionand recall as web presence levels increase. More-over in terms of recall the relationship between ac-curacy and web presence is linear, thereby indicat-ing a greater increase in performance for optimumtransitions. The results also indicate that cluster-ing using commute time is a more precise methodthan utilising optimum transitions. This could beattributed to the round trip cost utilised by com-mute time, whereas optimum transitions only con-siders the number of steps to reach a node withinthe graph space as a proximity measure. Converselyclustering with optimum transitions returns moreidentity web references to a given person comparedto clustering by commute time.

21

Page 22: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

(a) Perceptron

(b) SVM

(c) Naive Bayes

Fig. 11. E"ects on f-measure levels when applying self-train-ing to permutations of machine learning classifiers with dif-ferent feature similarity measures

As stipulated in our earlier requirements the accu-racy of automated disambiguation techniques mustbenefit from the use of seed data. The use of RandomWalks as a disambiguation technique indicates thatthis requirement has been fulfilled. Additionally theresults demonstrate that this technique outperformsboth an unsupervised technique and human process-ing in terms of precision, recall and f-measure

Fig. 12. Accuracy of Random Walks with clustering usingCommute Time and Optimum Transitions in terms of f-mea-sure with respect to baseline techniques

9. Conclusions

In this paper we have presented an accurate andhigh-performance approach to disambiguate iden-tity web references through the use of Web 2.0 data.For disambiguation techniques to provide accurateresults we claim that seed data supplied to such tech-niques must be representative of real identity infor-mation. Our study of users of the Web 2.0 platformFacebook supports this claim by demonstrating thegreat extent to which o#ine relationships are nowan intrinsic part of such platforms. Moreover theevaluation of our disambiguation techniques demon-strates that the use of Web 2.0 data provides accu-rate results.

We have shown that monitoring of personal in-formation online can be performed using automatedtechniques thereby alleviating the need for humanprocessing. We hypothesised that a semi-supervisedapproach would outperform a state of the art un-supervised technique and have shown through a de-tailed evaluation that both Self-training and Ran-dom Walks achieve this.

Data semantics plays a crucial role in our ap-proach. It facilitates information integration whenreconciling digital identity fragments for use as seeddata by attributing semantics to data extracted fromWeb 2.0 platforms. This provides a common inter-pretation of what the data is about, thus supportinginferencing. We describe the underlying knowledgestructures of web resources by generating metadatamodels from those resources which contains con-cepts from web accessible ontologies, thus providinginterpretable semantics.

22

Page 23: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

Using semantics supports disambiguation tech-niques, for instance our use of Self-training modelsmachine learning instances as RDF models and fea-tures as RDF instances from the models. We arethen able to vary the feature similarity measure usedby the classifier between various RDF instance com-parison techniques. When using Random Walks wecapitalise on the graph structure of RDF to compilea graph space and exploit the graph space to dis-ambiguate identity web references. We are able toweight edges within the space based on the seman-tics of the ontology concept that they represent.

The requirements we stipulated previously havebeen fulfilled as follows:

(i) Bootstrap the disambiguation process withminimal supervision: Our approach only re-quires seed data at the start of the process.Once this data has been supplied the approachrequires no intervention.

(ii) Cope with web resources which do not con-tain any features present within the seed data:We have empirically observed the improve-ment in performance of the self-trained classi-fiers thereby demonstrating learning capabil-ities from minimal seed data. Random walksover an interlinked graph space allowed webresources to be indirectly linked to the socialgraph, thus enabling the detection of identityweb references via transitive relations, the ef-fectiveness of this strategy is demonstrated inthe resultant high accuracy levels.

(iii) Outperform unsupervised disambiguationtechniques & Achieve disambiguation accu-racy comparable to human processing of data:The evaluation of our disambiguation tech-niques resulted in accuracy levels which weregreater than an unsupervised technique andoutperformed human processing in terms ofprecision when using Self-training and recallwhen using Random Walks.

(iv) Disambiguate identity web references for indi-viduals with varying levels of web presence: Thedataset used for evaluation contained individ-uals exhibiting a range of web presence levels.By evaluating our methods using this datasetwe have shown that our disambiguation tech-niques function for individuals with all webpresence levels, which is essential when con-sidering the rise in participation of web usersin a virtual environment.

(v) Produce seed data with minimal cost: We gen-erate seed data from data residing within Web

2.0 platforms. By leveraging this existing datawe are able to produce seed data incurring noproduction cost.

(vi) Generate reliable seed data: Seed data suppliedto our disambiguation techniques is reliablegiven that it has been obtained from Web 2.0platforms. We have demonstrated that datafound on Web 2.0 platforms is representativeof real identity and o#ine social network in-formation, and is therefore representative of agiven individual.

The seed data used within our approach is lever-aged from only two Web 2.0 platforms. Despite thislimited number our results empirically demonstratehow disambiguation techniques perform well whensupported by such data. However the used platformsonly expose a portion of an individual’s digital iden-tity, additional identity fragments could be obtainedby leveraging seed data from additional platformssuch as the business networking site LinkedIn 34 .Given our presented techniques this is somethingwhich we plan to explore in the future, by aggre-gating data from a range of Web 2.0 platforms andobserving the e!ects on the performance of the pre-sented disambiguation techniques.

The use of data leveraged from Web 2.0 platformsas seed data can limit our approach if the identitywhich web users construct on such platforms is notconsistent with their identity on the WWW. Withinthe evaluation we observed several cases where thisoccurred. However in many cases, despite the seeddata being minimal - due to the user exposing a lim-ited about of information within their profile - thedisambiguation techniques were still able to learnfrom previously unknown features and achieve ac-curate results. We found that those individuals whoexposed a large amount of identity information onWeb 2.0 platforms yielded more accurate results -given the greater feature coverage o!ered by the seeddata. This suggests that paradoxically web usersmust share more digital identity information on Web2.0 platforms in order to aid disambiguation tech-niques, where the privacy of several such platformsis debatable. In the case of Facebook and similarwalled garden sites identity information is shared un-der the pretense that it is safe and secure, and giventhe profile owner’s permission, can then be used asseed data.

As a final concluding remark to provide a contex-tual indication of how at risk our evaluation partici-

34http://www.linkedin.com

23

Page 24: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

pants were to identity theft, we analysed the identityweb references disambiguated using each technique.According to UK government guidelines 35 whichstipulates that the visibility of the persons name,their address and email address could lead to iden-tity theft we found that over half of the participantswere at risk.

We are currently investigating alternative dis-ambiguation methods, most notably rule inductionfrom RDF instances within the provided seed data.Our future work will compare the performance ofprecise rule-based disambiguation methods againstthe presented disambiguation techniques. We alsoplan to deploy the approach presented within thispaper to allow web users to monitor their onlinepresence and compare the performance of the de-ployed system against several commercial ventureswhich were discussed within the state of the art ofthis work - thereby comparing our e!ort againstsimilar semi-supervised approaches.

10. Acknowledgements

This work was partially supported by X-Media,a project funded by the European Commission aspart of the Information Society Technologies (IST)programme under grant number IST-FP6-026978.

References

[1] M. Andrejevic, The discipline of watching: Detection,risk, and lateral surveillance, Critical Studies in MediaCommunication 23 (5) (2006) 392–407.

[2] R. Angelova, G. Weikum, Graph-based text classification: learn from your neighbors, in:SIGIR ’06: Proceedings of the 29th annual internationalACM SIGIR conference on Research and development ininformation retrieval, ACM, New York, NY, USA, 2006.

[3] R. Bekkerman, A. McCallum, Disambiguating webappearances of people in a social network, in: WWW’05: Proceedings of the 14th international conference onWorld Wide Web, ACM, New York, NY, USA, 2005.

[4] C. M. Bishop, Neural Networks for Pattern Recognition,Oxford University Press, 1995.

[5] A. Blum, T. Mitchell, Combining labeled and unlabeleddata with co-training, Morgan Kaufmann Publishers,1998.

[6] D. Brickley, L. Miller, FOAF vocabulary specification,Tech. rep., FOAF project, published online on May 24th,2007 at (May 2007).URL http://xmlns.com/foaf/spec/20070524.html

35http://www.getsafeonline.org/nqcontent.cfm?a id=1132

[7] H. Bunke, P. Dickinson, M. Kraetzl, W. D. Wallis,A Graph-Theoretic Approach to Enterprise NetworkDynamics, Birkhauser, 2006.

[8] S. Chapman, B. Norton, F. Ciravegna, Armadillo:Integrating knowledge for the semantic web, in:Proceedings of the Dagstuhl Seminar in MachineLearning for the Semantic Web, 2005.

[9] N. Craswell, M. Szummer, Random walks on the clickgraph, in: SIGIR ’07: Proceedings of the 30th annualinternational ACM SIGIR conference on Research anddevelopment in information retrieval, ACM, New York,NY, USA, 2007.

[10] P. Cudre-Mauroux, P. Haghani, M. Jost, K. Aberer,H. D. Meer, idmesh: Graph-based disambiguation oflinked data, in: 18th International World Wide WebConference, 2009.URL http://www2009.eprints.org/60/

[11] R. Cyganiak, Expressing the xfn microformat in rdf,Tech. rep., Deri (2008).URL http://vocab.sindice.com/xfn

[12] R. Darlington, Regression and Linear Models, McGraw-Hill, 1990.

[13] A. Doan, Y. Lu, Y. Lee, J. Han, Object matchingfor information integration: A profiler-based approach,in: In: Proceedings of the IJCAI-03 Workshop onInformation Integration on the Web. (2003, 2003.

[14] X. Dong, A. Halevy, J. Madhavan, Referencereconciliation in complex information spaces, in:SIGMOD ’05: Proceedings of the 2005 ACM SIGMODinternational conference on Management of data, ACM,New York, NY, USA, 2005.

[15] S. Dumais, J. Platt, D. Heckerman, M. Sahami,Inductive learning algorithms and representations fortext categorization, in: CIKM ’98: Proceedings of theseventh international conference on Information andknowledge management, ACM, New York, NY, USA,1998.

[16] N. B. Ellison, C. Steinfield, C. Lampe, The benefitsof facebook friends: Social capital and college students’use of online social network sites, Journal of Computer-Mediated Communication 12 (2007) 1143–1168.

[17] M. B. Fleischman, E. Hovy, Multi-document personname resolution, in: Annual Meeting of the Associtationfor Computational Linguistics (ACL), ReferenceResolution Workshop, 2004.

[18] G. P. C. Fung, H. Lu, Text classification withoutnegative examples revisit, IEEE Trans. on Knowl. andData Eng. 18 (1) (2006) 6–20, member-Yu, Je"rey X.and Fellow-Yu, Philip S.

[19] G. Grimnes, P. Edwards, A. Preece, Instance basedclustering of semantic web resources, in: Proceedingsof the 5th European Semantic Web Conference, LNCS,Springer-Verlag, 2008.

[20] W. W. Group, Rdfa primer: Bridging the human anddata webs (October 2008).URL http://www.w3.org/TR/xhtml-rdfa-primer/

[21] W. Halb, Y. Raimond, M. Hausenblas, Building linkeddata for both humans and machines, in: Linked Dataon the Web (LDOW2008), 2008.URLhttp://data.semanticweb.org/workshop/LDOW/2008/paper/10

24

Page 25: DisambiguatingIdentityWebReferencesusingWeb2.0Dataand ...staff€¦ · a detailed approach to monitor the web presence of a given individual by obtaining background knowledge from

[22] H. Halpin, B. Suda, N. Walsh, An ontology for vcards,Tech. rep., W3C (2007).URL http://www.w3.org/2006/vcard/ns

[23] J. Hart, C. Ridley, F. Taher, C. Sas, A. Dix, Exploringthe facebook experience: a new approach to usability, in:NordiCHI ’08: Proceedings of the 5th Nordic conferenceon Human-computer interaction, ACM, New York, NY,USA, 2008.

[24] G. Hripcsak, A. S. Rothschild, Agreement, the f-measure, and reliability in information retrieval, Journalof American Medical Informatics Association 12 (3)(2005) 296–298.

[25] J. Iria, L. Xia, Z. Zhang, Wit: Web people searchdisambiguation using random walks, in: Proceedings of4th International Workshop on Semantic Evaluations,Prague, Czech Republic, 2007.

[26] A. Isaac, E. Summers, Skos simple knowledgeorganization system primer, Tech. rep., W3C (2009).URL http://www.w3.org/TR/skos-primer/

[27] D. V. Kalashnikov, R. Nuray-Turan, S. Mehrotra,Towards breaking the quality curse.: a web-queryingapproach to web people search., in: SIGIR ’08:Proceedings of the 31st annual international ACMSIGIR conference on Research and development ininformation retrieval, ACM, New York, NY, USA, 2008.

[28] M. Kay, Xsl transformations (xslt) version 2.0 (w3crecommendation 23 january 2007), Tech. rep., W3C(2007).URL http://www.w3.org/TR/xslt20/

[29] R. Khare, Microformats: The next (small) thing on thesemantic web?, IEEE Internet Computing 10 (1) (2006)68–75.

[30] S. Liu, J. Zhang, Retrieving and matching rdf graphsby solving the satisfiability problem, in: WI ’06:Proceedings of the 2006 IEEE/WIC/ACM InternationalConference on Web Intelligence, IEEE ComputerSociety, Washington, DC, USA, 2006.

[31] L. Lovasz, Random walks on graphs: A survey (1993).

[32] B. Malin, Unsupervised name disambiguation via socialnetwork similarity, in: Workshop on Link Analysis,Counterterrorism, and Security, 2005.

[33] S. P. Meyn, R. L. Tweedie, Markov chains and stochasticstability, Springer–Verlag, 1993.

[34] T. M. Mitchell, Machine Learning, McGraw-Hill, NewYork, 1997.

[35] M. Obitko, V. Marık, Integrating transportationontologies using semantic web languages, in: SecondInternational Conference on Industrial Applications ofHolonic and Multi-Agent Systems, 2005.

[36] H. Qiu, E. R. Hancock, Clustering and embedding usingcommute times, IEEE Trans. Pattern Analysis. Mach.Intell. 29 (11) (2007) 1873–1890.

[37] M. Rowe, Mapping between digital identity ontologiesthrough sism, in: In Proceedings of the Social Data onthe Web Workshop at the International Semantic WebConference, 2009.

[38] M. Saerens, F. Fouss, L. Yen, P. Dupont, The principalcomponents analysis of a graph, and its relationships tospectral clustering, in: Proceedings of the 15th EuropeanConference on Machine Learning (ECML 2004). LectureNotes in Artificial Intelligence, Springer-Verlag, 2004.

[39] F. Saıs, N. Pernelle, M.-C. Rousset, Combining a logicaland a numerical method for data reconciliation, Journalon Data Semantics XII (2009) 66–94.

[40] L. Sauermann, R. Cyganiak, M. Voelkel, Cooluris for the semantic web, Tech. rep., DeutschesForschungszentrum fuer Kuenstliche Intelligenz GmbH(2007).URLhttp://www.w3.org/TR/2007/WD-cooluris-20071217/

[41] L. Sauermann, L. van Elst, A. Dengel, Pimo - aframework for representing personal information models,in: K. Tochtermann, W. Haas, F. Kappe, A. Scharl,T. Pellegrini, S. Scha"ert (eds.), Proceedings of I-MEDIA ’07 and I-SEMANTICS ’07, 2007.

[42] L. Shi, D. Berrueta, S. Fernandez, L. Polo, S. Ferna?dez,Smushing rdf instances: are alice and bob the same opensource developer?, in: ISWC2008 workshop on PersonalIdentification and Collaborations: Knowledge Mediationand Extraction (PICKME 2008), 2008.

[43] M. Smith, C. Welty, D. McGuinness, OWL web ontologyguide. (2004).URL http://www.w3.org/TR/owl-guide/

[44] Y. Song, J. Huang, I. G. Councill, J. Li,C. L. Giles, E#cient topic-based unsupervised namedisambiguation, in: JCDL ’07: Proceedings of the 7thACM/IEEE-CS joint conference on Digital libraries,ACM, New York, NY, USA, 2007.

[45] R. Tarjan, Depth-first search and linear graphalgorithms, SIAM Journal on Computing 1 (2) (1972)146–160.

[46] M. Thelen, E. Rilo", A bootstrapping method forlearning semantic lexicons using extraction patterncontexts, in: EMNLP ’02: Proceedings of the ACL-02conference on Empirical methods in natural languageprocessing, Association for Computational Linguistics,Morristown, NJ, USA, 2002.

[47] C. J. van Rijsbergen, Information Retrieval, 2nd ed.,Butterworths, London, 1979.

[48] W3C, Gleaning resource descriptions from dialectsof languages (GRDDL), W3c recommendation, W3C(September 2007).URL http://www.w3.org/TR/grddl/

[49] X. Wan, J. Gao, M. Li, B. Ding, Person resolutionin person search results: Webhawk, in: CIKM ’05:Proceedings of the 14th ACM international conferenceon Information and knowledge management, ACM, NewYork, NY, USA, 2005.

[50] G. Wang, Y. Yu, H. Zhu, Pore: Positive-only relationextraction from wikipedia text, in: ISWC/ASWC, 2007.

[51] H. Yu, J. Han, K. C. chuan Chang, Pebl: Webpage classification without negative examples, IEEETransactions on Knowledge and Data Engineering 16(2004) 70–81.

25