fusing semantic data

53
Fusing automatically extracted annotations for the Semantic Web Andriy Nikolov Knowledge Media Institute The Open University

Upload: andriynikolov

Post on 11-May-2015

281 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Fusing semantic data

Fusing automatically extracted annotations for the Semantic

WebAndriy NikolovKnowledge Media InstituteThe Open University

Page 2: Fusing semantic data

2

Outline

• Motivation• Handling fusion subtasks

– problem-solving method approach

• Processing inconsistencies– applying uncertainty reasoning

• Overcoming schema heterogeneity– Linked Data scenario

Page 3: Fusing semantic data

3

Database scenario

• Classical scenario (database domain)– Merging information from datasets

containing partially overlapping information

Name Year of birth E-mail Address

H. Schmidt 1972 [email protected]

J. Smith 1983 [email protected]

Name Year of birth E-Mail Job position

Wen, Zhao 1980 [email protected]

Schmidt, Hans 1973 [email protected]

Page 4: Fusing semantic data

4

Database scenario

• Coreference resolution (record linkage)– Resolving ambiguous identities

Name Year of birth E-mail Address

H. Schmidt 1972 [email protected]

J. Smith 1983 [email protected]

Name Year of birth E-Mail Job position

Wen, Zhao 1980 [email protected]

Schmidt, Hans 1973 [email protected]

Page 5: Fusing semantic data

5

Database scenario

• Inconsistency resolution– Handling contradictory pieces of data

Name Year of birth E-mail Address

H. Schmidt 1972 [email protected]

J. Smith 1983 [email protected]

Name Year of birth E-Mail Job position

Wen, Zhao 1980 [email protected]

Schmidt, Hans 1973 [email protected]

Page 6: Fusing semantic data

6

Semantic data scenario

• Database domain:– A record belongs to a single

table– Table structure defines

relevant attributes– Inconsistency of values

• Semantic data:– Classes are organised into

hierarchies– One individual may belong

to several classes– Available properties depend

on the level of granularity– Other types of

inconsistencies are possible• E.g., class disjointness

foaf:Person

sweto:Person

foaf:namexsd:string

xsd:stringfoaf:mbox

sweto:Placesweto:lives_in

sweto:Organization

sweto:affiliated_with

sweto:Researcherxsd:stringsome:has_degree

Page 7: Fusing semantic data

7

Motivating scenario – X-Media

RDF

Images

Other data

Annotation FusionText

Internal corporate reports (Intranet)

Pre-defined public sources (WWW)

Domain ontology

KnoFuss

Knowledge base

Page 8: Fusing semantic data

8

Outline

• Motivation• Handling fusion subtasks

– problem-solving method approach

• Processing inconsistencies– applying uncertainty reasoning

• Overcoming schema heterogeneity– Linked Data scenario

Page 9: Fusing semantic data

9

Handling fusion subtasks

• For each subtask, several available methods exist

• Example: coreference resolution– Aggregated attribute similarity

• [Fellegi&Sunter 1969]

– String similarity• Levenshtein, Jaro, Jaro-Winkler

– Machine learning• Clustering• Classification

– Rule-based

Page 10: Fusing semantic data

10

Handling fusion subtasks

• All methods have their pros and cons– Rule-based

• High precision• Restricted to a specific domain

– Machine learning• Require sufficient training data

– String similarity• Lower precision• Still need configuration (e.g., distance metric, threshold, set of

attributes to include)

• Trade-off between the quality of results and applicability range – better precision requires more domain-specific knowledge

Page 11: Fusing semantic data

11

Problem-solving method approach

• Fusion task is decomposed into subtasks• Algorithms defined as methods solving a particular

task• Each method is formally described using the fusion

ontology– Task handled by the method– Applicability criteria– Domain knowledge required– Reliability of output

• Methods are selected based on their capabilities

Page 12: Fusing semantic data

12

KnoFuss architecture

Fusion KBIntermediate data

Main KB

KnoFuss

CoreferenceResolutionMethod

ConflictDetectionMethod

ConflictResolutionMethod

Method library

New data

Fusion ontology

• Method library– Contains implementation of each technique for specific

subtasks (problem-solving method [Motta 1999])• Fusion ontology

– Describes method capabilities– Defines intermediate structures (mappings, conflict sets, etc.)

Page 13: Fusing semantic data

13

Task decomposition

Knowledge fusion

Coreferenceresolution

Knowledge base

updating

Modelconfiguration

Dependency identification

Dependency resolution

Linkdiscovery

Source KB

TargetKB

(fused)

TargetKB

Page 14: Fusing semantic data

14

Method selection

Adaptive learning matcher Application context:

Publication

Application context:Journal Article

rdf:type owl:Thing

datatypeProperty ?x

reliability =0.4

rdf:type sweto:Publication

rdfs:label ?x

sweto:year ?y

reliability =0.8

rdf:type sweto:Article

rdfs:label ?x

sweto:year ?y

sweto:journal ?z

sweto:volume ?a

reliability =0.9

• Depends on:– Range of applicability– Reliability

• Configuration parameters– Generic (for individuals of unknown types)– Context-dependent

Page 15: Fusing semantic data

15

Using class hierarchy

• Configuring machine-learning methods:– Using training instances for a subclass to learn

generic models for superclasses

owl:Thing

foaf:Person foaf:Document

sweto:Publication

sweto:Article sweto:Article_in_Proceedings

year

name

volume book_title

label

journal_name

Ind1: {label, year, book_title}Ind2: {label, year, book_title}Ind3: {label, year, book_title}

Ind1: {label, year}Ind2: {label, year}Ind3: {label, year}

Ind1: {label}Ind2: {label}Ind3: {label}

Page 16: Fusing semantic data

16

Using class hierarchy

• Configuring machine-learning methods:– Combining training instances for subclasses to

learn a generic model for a superclasssweto:Publication

sweto:Article sweto:Article_in_Proceedings

year

volumebook_title

label

journal_name

Ind1: {label, year, book_title}Ind2: {label, year, book_title}Ind3: {label, year, book_title}

Ind1: {label, year}Ind2: {label, year}Ind3: {label, year}Ind4: {label, year}Ind5: {label, year}Ind6: {label, year}

Ind4: {label, year, journal_name, volume}Ind5: {label, year, journal_name, volume}Ind6: {label, year, journal_name, volume}

Page 17: Fusing semantic data

17

Outline

• Motivation• Handling fusion subtasks

– problem-solving method approach

• Processing inconsistencies– applying uncertainty reasoning

• Overcoming schema heterogeneity– Linked Data scenario

Page 18: Fusing semantic data

18

Data quality problems

• Causes of inconsistency– Data errors

• Obsolete data• Mistakes of manual annotators• Errors of information extraction algorithms

– Coreference resolution errors• Automatic methods not 100% reliable

• Applying uncertainty reasoning– Estimated reliability of separate pieces of

data– Domain knowledge defined in the ontology

Page 19: Fusing semantic data

19

Refining fused data

• Additional evidence:– Ontological schema restrictions

• Disjointness• Cardinality• …

– Neighborhood graph• Mappings between related entities

– Provenance• Uncertainty of candidate mappings• Uncertainty of data statements• “Cleanness” of data sources

Page 20: Fusing semantic data

20

Dempster-Shafer theory of evidence

• Bayesian probability theory:Assigning probabilities to atomic alternatives: – p(true)=0.6 ! p(false)=0.4 – Sometimes hard to assign– Negative bias:

Extraction uncertainty less than 0.5 – negative evidence rather than insufficient evidence

Dempster-Shafer theory: Assigning confidence degrees (masses) to sets of alternatives– m({true}) = 0.6– m({false}) = 0.1– m({true;false})=0.3

probability

support

plausibility

Page 21: Fusing semantic data

21

Dependency detection

• Identifying and localizing conflicts– Using formal diagnosis [Reiter 1987] in

combination with standard ontological reasoning

ArticleArticle ProceedingsProceedings

Paper_10Paper_10

owl:disjointWithowl:disjointWith owl:FunctionalPropertyowl:FunctionalProperty

rdf:typerdf:type

hasYear2007

hasYear 2006

E. MottaE. Motta V.S. UrenV.S. UrenhasAuthor hasAuthor

Page 22: Fusing semantic data

22

Belief networks (cont)

• Valuation networks [Shenoy and Shafer 1990]

• Network nodes – OWL axioms– Variable nodes

• ABox statements (I2X, R(I1, I2))

• One variable – the statement itself

– Valuation nodes• TBox axioms (XtY)• Mass distribution between several variables

(I2X, I2Y, I2XtY)

Page 23: Fusing semantic data

23

Belief networks (cont)

• Belief network construction– Using translation rules– Rule antecedents:

• Existence of specific OWL axioms (one rule per OWL construct)

• Existence of network nodes– Example rule:

• Explicit ABox statements:IF I2X THEN CREATE N1(I2X)

• TBox inferencing:IF Trans(R) AND EXIST N1(R(I1, I2)) AND EXIST N2(R(I2, I3)) THEN CREATE N3(Trans(R)) AND CREATE N4(R(I1, I3))

Page 24: Fusing semantic data

24

Example

#Paper_10#Paper_10

ArticleArticle ProceedingsProceedings

owl:disjointWithowl:disjointWith

rdf:type rdf:type

Page 25: Fusing semantic data

25

Example

#Paper_10#Paper_10

ArticleArticle ProceedingsProceedings

owl:disjointWithowl:disjointWith

rdf:type rdf:type

#Paper_102Article

#Paper_102Proceedings

Page 26: Fusing semantic data

26

Example

#Paper_10#Paper_10

ArticleArticle ProceedingsProceedings

owl:disjointWithowl:disjointWith

rdf:type rdf:type

#Paper_102Article

#Paper_102Proceedings

Article v :Proceedings

Page 27: Fusing semantic data

27

Example

#Paper_102Article

#Paper_102Proceedings

Article v :Proceedings

m(true)=0.8m(false) = 0

m({true;false})=0.2

m(true)=0.6m(false) = 0

m({true;false})=0.4

#Paper_102 Article

#Paper_102 Proceedings

false false

false true

true false

true true

m( )=1.0

m( )=0.0

Page 28: Fusing semantic data

28

Example

#Paper_102Article

#Paper_102Proceedings

Article v :Proceedings

m(true)=0.8m(false) = 0

m({true;false})=0.2

m(true)=0.6m(false) = 0

m({true;false})=0.4

#Paper_102 Article

#Paper_102 Proceedings

false false

false true

true false

false true

true false

m( )=0.15 -Dempster’s rule

m(

m(

)=0.23

)=0.62

Page 29: Fusing semantic data

29

Example

#Paper_102Article

#Paper_102Proceedings

Article v :Proceedings

m(true)=0.62m(false) = 0.23

m({true;false})=0.15

m(true)=0.23m(false) = 0.62

m({true;false})=0.15

#Paper_102 Article

#Paper_102 Proceedings

false false

false true

true false

false true

true false

m( )=0.15

m(

m(

)=0.23

)=0.62

Page 30: Fusing semantic data

30

Belief propagation

• Translating subontology into a belief network– Using provenance and confidence values of data statements– Coreferencing algorithm precision for owl:sameAs mappings

• Data refinement:– Detecting spurious mappings– Removing unreliable data statements

Articlev:in_Proc

Ind1=Ind2

Functional(year)

Article(Ind1)

(0.99;1.0)/(0.97;0.98)

in_Proc(Ind2) Ind1=Ind2

inProc(Ind1)

Ind1=Ind2

year(Ind1, 2007)

year(Ind2, 2007)

year(Ind1, 2006)

(0.9;1.0)/(0.74;0.82) (0.92;1.0)/(0.2;0.21) (0.85;1.0)/(0.72;0.85)

(0.95;1.0)/(0.91;0.96)

Page 31: Fusing semantic data

31

Neighbourhood graph

• Non-functional relations: varying impact

Paper_10Paper_10 H. SchmidtH. SchmidthasAuthor

Paper_11Paper_11 Schmidt, HansSchmidt, Hans

owl:sameAs (0.9) owl:sameAs (0.3)

hasAuthor

ProceedingsProceedings

rdf:type

rdf:type

PersonPerson

GermanyGermany H. SchmidtH. Schmidtcitizen_of

GermanyGermany Schmidt, HansSchmidt, Hans

owl:sameAs (1.0) owl:sameAs (0.3)

citizen_of

CountryCountry

rdf:type

rdf:type

PersonPerson

Page 32: Fusing semantic data

32

Neighborhood graph

• Implicit relations: set co-membership

Person11 = Person12

Person21 = Person22

Coauthor(Person12, Person22)

Person11 = Person12

Coauthor(Person11, Person22)

Person21 = Person22

Coauthor(Person21, Person22)

“Bard, J.B.L.”=“Jonathan Bard”

“Webber, B.L.”=“Bonnie L. Webber”

0.84/(0.86;1.0)

0.16/(0.83;1.0)1.0/(1.0;1.0)

1.0/(1.0;1.0)

Page 33: Fusing semantic data

33

Provenance

• Initial belief assignments:– Data statements

(source AND/OR extractor confidence)

– Candidate mappings (precision of attribute similarity algorithms)

– Source “cleanness” – contains duplicates or not

Arl_Va Arl_Tx

Arlington = Arl_Tx

Arlington Arl_Tx

Arlington = Arl_Va

Arlington = Arl_Va

Arl_Va Arl_Tx

1.0/(1.0;1.0)

0.9/(0.31;0.35)

Arlington, Virginia

0.95/(0.65;0.69)

Arlington, Texas

Page 34: Fusing semantic data

34

Experiments

• Datasets:– Publication 1

• AKT• Rexa• SWETO-DBLP

– Cora• database community benchmark• translated into RDF• 2 versions used

– different structure – different gold standard

Page 35: Fusing semantic data

35

Experiments

Dataset No Matcher Publication

Prec. Recall F1 Prec. Recall F1

AKT/Rexa 1 Jaro-Winkler 0.950 0.833 0.887 0.969 0.832 0.895

2 L2 Jaro-Winkler 0.879 0.956 0.916 0.923 0.956 0.939

AKT/DBLP 3 Jaro-Winkler 0.922 0.952 0.937 0.992 0.952 0.971

4 L2 Jaro-Winkler 0.389 0.984 0.558 0.838 0.983 0.905

Rexa/DBLP 5 Jaro-Winkler 0.899 0.933 0.916 0.944 0.932 0.938

6 L2 Jaro-Winkler 0.546 0.982 0.702 0.823 0.981 0.895

Cora (I) 7 Monge-Elkan 0.735 0.931 0.821 0.939 0.836 0.884

Cora (II) 8 Monge-Elkan 0.698 0.986 0.817 0.958 0.956 0.957

• Publication individuals– Ontological restrictions mainly influence

precision

Page 36: Fusing semantic data

36

Experiments

Dataset No Matcher Person

Prec. Recall F1 Prec. Recall F1

AKT/Rexa 7 L2 Jaro-Winkler 0.738 0.888 0.806 0.788 0.935 0.855

AKT/DBLP 8 L2 Jaro-Winkler 0.532 0.746 0.621 0.583 0.921 0.714

Rexa/DBLP 9 Jaro-Winkler 0.965 0.755 0.846 0.968 0.876 0.920

Cora (I) 10 L2 Jaro-Winkler 0.983 0.879 0.928 0.981 0.895 0.936

Cora (II) 11 L2 Jaro-Winkler 0.999 0.994 0.997 0.999 0.994 0.997

• Person individuals– Evidence coming from the neighborhood graph– Mainly influences recall

Page 37: Fusing semantic data

37

Outline

• Motivation• Handling fusion subtasks

– problem-solving method approach

• Processing inconsistencies– applying uncertainty reasoning

• Overcoming schema heterogeneity– Linked Data scenario

Page 38: Fusing semantic data

38

Advanced scenario

• Linked Data cloud: network of public RDF repositories [Bizer et al. 2009]

• Added value: coreference links (owl:sameAs)

Page 39: Fusing semantic data

39

Data linking: current state

• Automatic instance matching algorithms– SILK, ODDLinker, KnoFuss, …

• Pairwise matching of datasets– Requires significant configuration

effort

• Transitive closure of links– Use of “reference” datasets

Page 40: Fusing semantic data

40

Reference datasets

Page 41: Fusing semantic data

41

Problems

• Transitive closures often incomplete– Reference dataset is incomplete– Missing intermediate links– Direct comparison of relevant datasets is

desirable

• Schema heterogeneity– Which instances to compare?– Which properties are relevant

A BReference

Page 42: Fusing semantic data

42

Schema matching

• Interpretation mismatches– dbpedia:Actor = professional actor– movie:actor = anybody who participated in a movie

• Class interpretation “as used” vs “as designed”– FOAF: foaf:Person = any person– DBLP: foaf:Person = computer scientist

• Instance-based ontology matching

Repository Richard Nixon David Garrick

dbpedia:Actor DBPedia - +

movie:Actor LinkedMDB + -

Page 43: Fusing semantic data

43

KnoFuss - enhanced

Knowledge fusion

Ontology integration

Knowledge base

integration

Ontology matching

Instancetransformation

Coreferenceresolution

Dependency resolution

Source KB

TargetKB

SPARQL query translation

Page 44: Fusing semantic data

44

Schema matching

• Step 1: inferring schema mappings from pre-existing instance mappings

• Step 2: utilizing schema mappings to produce new instance mappings

Ontology 1 Ontology 2

Dataset 1 Dataset 2

Ontology 1 Ontology 2

Dataset 1 Dataset 2

Page 45: Fusing semantic data

45

• Background knowledge:– Data-level

(intermediate repositories)

– Schema-level (datasets with more fine-grained schemas)

Overview

Page 46: Fusing semantic data

46

Algorithm

• Step 1:– Obtaining transitive closure of

existing mappings

LinkedMDB DBPedia

movie:music_contributor/2490

MusicBrainz

music:artist/a16…9fdf

= =

dbpedia:Ennio_Morricone

Page 47: Fusing semantic data

47

Algorithm

• Step 2: Inferring class and property mappings– ClassOverlap and PropertyOverlap mappings– Confidence (classes A, B) = |c(A)Åc(B)| / min(c(|A|), c(|B|))

(overlap coefficient)– Confidence (properties r1, r2) = |c(X)|/||c(Y)|

• X – identity clusters with equivalent values of r1 and r2• Y – all identity clusters which have values for both r1 and r2

LinkedMDB DBPediaMusicBrainz

music:artist/a16…9fdf

==

dbpedia:Ennio_Morriconemovie:music_contributor/2490

movie:music_contributor dbpedia:Artist

is_a is_a

Page 48: Fusing semantic data

48

• Step 3: Inferring data patterns

• Functionality restrictions

• IF 2 equivalent movies do not have overlapping actors AND have different release dates THEN break the equivalence link

• Note:– Only usable if not taken

into account at the initial instance matching stage

Algorithm

Page 49: Fusing semantic data

49

Algorithm

• Step 4: utilizing mappings and patterns– Run instance-level matching for individuals

of strongly overlapping classes– Use patterns to filter out existing mappings

• LinkedMDB

SELECT ?uri

WHERE {

?uri rdf:type movie:music_contributor .

}

• DBPediaSELECT ?uriWHERE { ?uri rdf:type

dbpedia:Artist . }

Page 50: Fusing semantic data

50

Results

• Class mappings:– Improvement in recall

• Previously omitted mappings were discovered after direct comparison of instances

• Data patterns– Improved precision

• Filtered out spurious mappings• Identified 140 mappings

between movies as “potentially spurious”

• 132 identified correctly

00.10.20.30.40.50.60.70.80.9

1

Existing KnoFuss(only)

Combined

PrecisionRecallF1-measure

00.10.20.30.40.50.60.70.80.9

1

Existing KnoFuss(only)

Combined

PrecisionRecallF1-measure

00.10.20.30.40.50.60.70.80.9

1

Existing KnoFuss(only)

Combined

PrecisionRecallF1-measure

DBPedia/

DBLP

DBPedia/

LinkedMDB

DBPedia/

BookMashup

Page 51: Fusing semantic data

51

Future work

• From the pairwise scenario to the network of repositories

• Combining schema and data integration in an efficient way

• Evaluating data sources– Which data source(s) to link to?– Which data source(s) to select data

from?

Page 52: Fusing semantic data

52

Questions?

Thanks for your attention

Page 53: Fusing semantic data

53

References

[Shenoy and Shafer 1990] P. Shenoy, G. Shafer. Axioms for probability and belief-function propagation. In: Readings in uncertain reasoning. San Francisco: Morgan Kaufmann, pp. 575-610, 1990

[Motta 1999] E. Motta. Reusable components for knowledge modelling. Amsterdam: IOS Press, 1999

[Bizer et al 2009] C. Bizer, T. Heath, T. Berners-Lee. Linked Data - the story so far. International Journal on Semantic Web and Information Systems 5(3), pp. 1-22, 2009

[Fellegi and Sunter 1969] Ivan P. Fellegi and Alan B. Sunter. A theory for record linkage. Journal of American Statistical Association, 64(328):1183-1210, 1969

[Reiter 1987] R. Reiter. A theory of diagnosis from first principles. Artificial Intelligence, 32(1):57-95, 1987