semanticmatching.org semantic atching and...

38
semanticmatching.org SEMANTIC MATCHING AND S-MATCH for Digital Libraries Aliaksandr Autayeu autayeu.com

Upload: truongdat

Post on 16-Jun-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

semanticmatching.org

SEMANTIC MATCHING AND S-MATCHfor Digital Libraries

Aliaksandr Autayeuautayeu.com

2

CREDITS

Fausto GiunchigliaMikalai YatskevichPavel ShvaikoLuciano SerafiniVincenzo MalteseJuan Paneand others

http://semanticmatching.org/credits.html

3

OUTLINE

Digital Libraries Super SimplifiedSemantic MatchingS-MatchMapping Use EvaluationS-Match and Digital LibrariesExercises

DIGITAL LIBRARIES SUPER SIMPLIFIED

5

A SUPER SIMPLIFIED VIEW

000 Computer science, information & general works000 Computer science, knowledge & systems

005 Computer programming, programs & data

Programming languages (Computers)

Programming (Computers)

Programming software

classification

book

subject heading

DDC

LCSH

? LCC ?

? MeSH ?

6

A BIT MORE REALISTIC VIEW000 Computer science, information & general works

000 Computer science, knowledge & systems005 Computer programming, programs & data

Programming languages (Computers)Programming (Computers)

Programming software

classificationS

books

subject headingS

DDC, UDC, LCC, RVK, IconClass

LCSH, NALT, MeSH

Class Q - ScienceSubclass QA Mathematics

QA71-90 Instruments and machinesQA75.5-76.95 Electronic computers. Computer science

computer science

computer softwarecomputer literacy

7

A TREND

Federated Digital Librariesmore different books cool!more different classifications oh-more different subject headings oh-

Information Integration

Semantic Matching

SEMANTIC MATCHING

9

MOTIVATION

Ample motivation:information integration,ontology evolution and alignment,peer-to-peer information sharing,digital libraries integration,web service composition,agent communication, andquery answering on the web.

DLs? Huge and heterogeneous:LOC: 138,313,427 items. Manageable manually?Wikipedia: 138 DL projects. Same software?

10

DEFINITION

Matching: given two graph-like structures (e.g. lightweight ontologies), produce a mapping between the nodes of the graphs that semanticallycorrespond to each other

Wine  and  Cheese

Italy  

Europe

Austria

Pictures

1

2 3

4 5

11

TYPES OF MATCHING

Matching

Semantic  MatchingSyntactic  MatchingRelations are computed between labels at nodesR = {x [0,1]}

Relations are computed between concepts at nodesR = { =, , , }

Note: many systems are Note:are semantic

12

CONCEPT OF LABEL

Wine  and  Cheese

Italy  

Europe

Austria

Pictures

1

2 3

4 5

The idea: A label has an intended meaning, which is what this label means in the world (Concept of a label)

Labels in classification hierarchies are used to compute the set of documents classified under the node holding the label

13

CONCEPT AT NODE

Wine  and  Cheese

Italy  

Europe

Austria

Pictures

1

2 3

4 5

Observations: Think of a tree structure (see below) used for classifying documentsA (concept of a) label defines which documents should be classified under that

Concept of a node is the set of documents that we would classify under this node, given it has a certain label and it is positioned in a certain place in the tree

The idea: Trees add structure which allows us to perform the classification of documents more effectively

14

MAPPING

Mapping element is a 4-tuple < IDij, n1i, n2j, R >, where IDij is a unique identifier of the given mapping element;n1i is the i-th node of the first graph;n2j is the j-th node of the second graph;R specifies a semantic relation between the concepts at the given

nodes

Semantic Matching: Given two graphs G1 and G2, for any node n1i G1, find the strongest semantic relation holding with node n2j G2

Computed , listed in the decreasing strength order:equivalence { = };more general/specific { , };mismatch or disjoint { };

15

EXAMPLE

?=

=4

Images

Europe

ItalyAustria

2

3 4

1

Italy  

Europe

Wine  and  Cheese

Austria

Pictures

1

2 3

5

16

MINIMAL MAPPING

2-1, 2-4 redundant links, can be inferredcompact, compressed information!

?=

=4

Images

Europe

ItalyAustria

2

3 4

1

Italy  

Europe

Wine  and  Cheese

Austria

Pictures

1

2 3

5

17

MINIMAL MAPPING PATTERNS

A

B E

D

C F

A

B E

D

C F

A

B E

D

C F

A

B E

D

C F

(1) (2) (3) (4)

Dependencies across-symbols: equivalence is the combination of more and less specific

Patterns 1 and 2 are still valid in case of equivalence between B-EPattern 4 can be seen as a combination of patterns 1 and 2

S-MATCH

19

FOUR MACRO STEPS

1. For all labels in T1 and T2 compute concepts of labels2. For all nodes in T1 and T2 compute concepts at nodes3. For all pairs of labels in T1 and T2 compute relations between

concepts of labels4. For all pairs of nodes in T1 and T2 compute relations between

concepts at nodes

Steps 1 and 2 constitute the preprocessing phase, and are executed once and each time after the schema/ontology is changed (offline)

Steps 3 and 4 constitute the matching phase, and are executed every time the two schemas/ontologies are to be matched (online)

Given two labeled trees T1 and T2, do:

20

1: COMPUTE CONCEPTS OF LABELSThe idea:

Translate natural language expressions into internal formal languageCompute concepts based on possible senses of words in a label and their interrelations

Preprocessing:Tokenization. Labels (according to punctuation, spaces, etc.) are parsed into tokens. Wine and Cheese <Wine, and, Cheese>;Lemmatization. Tokens are morphologically analyzed in order to find all their possible basic forms. Images Image;Building atomic concepts. An oracle (WordNet) is used to extract senses of lemmatized tokens. Image has 8 senses, 7 as a noun and 1 as a verb;Building complex concepts. Prepositions, conjunctions, etc. are translated into logical connectives and used to build complex conceptsout of the atomic conceptsCWine and Cheese = <Wine, U(WNWine)> U <Cheese, U(WNCheese)>,where U is a union of the senses that WordNet attaches to lemmatized tokens

21

2: COMPUTE CONCEPTS AT NODES

The idea: extend concepts at labels by capturing the knowledge residing in a structure of a graph in order to define a context in which the given concept at a label occursComputation (basic case): Concept at a node for some node n is computed as an intersection of concepts at labels located above the given node, including the node itself

Wine  and  Cheese

Italy  

Europe

Austria

Pictures

1

2 3

4 5

C4 = CEurope CPictures CItaly

22

3: COMPUTE RELATIONS BETWEENCONCEPTS OF LABELS

The idea: Exploit a priori knowledge, e.g., lexical, domain knowledge with the help of element level semantic matchers

Results of step 3:Italy  

Europe

Wine  andCheese

Austria

Pictures

1

2 3

4 5

Europe

ItalyAustria

2

3 4

1

Images

T1 T2

T2

=C Italy

=C Austria

=C E urope

=C Images

C AustriaC ItalyC PicturesC E uropeT1 C Wine C C heese

23

4: COMPUTE RELATIONS BETWEENCONCEPTS AT NODES

The idea: Reduce the matching problem to a validity problem

Context: the relations between concepts of labels (from step 3) Wff rel (C1i, C2j): relation to be proved (with C1i in Tree1 and C2j in Tree2)Translate into propositional logic:

C1i = C2j is translated into C1i C2j

C1i subsumes C2j is translated into C1i C2j

C1i C2j is translated into C1i C2j)

Prove Context Wff rel (C1i, C2j) is valid with A propositional formula is valid iff its negation is unsatisfiable (SAT

24

ALGORITHM AND FRAMEWORK

S-Match: an algorithmMinSMatch (produces minimal mapping)SPSM (preserves structure)

S-Match: a frameworkalgorithmsdata formatspreprocessorsreasonersmatchersknowledge bases

25

S-MATCH: ARCHITECTURE

InputLoaders

TabIndented XML

Offl

ine Step 1

Preprocessors ClassifiersStep 2

Matchers

Onl

ine

DecidersSat4J MiniSat

OraclesWordnet InMemoryWN

ElementSyntactic

StringGloss

Semantic

WordNetSense

Step 3 Step 4

StructureTree

MinimalDefault

Node

MinimalDefault

OutputFilters RenderersSPSM TabIndented XML

26

S-MATCH: API, CLI, GUI

APIprograms

CLIgeeks

GUIpeople

MAPPING USE AND EVALUATION

28

MAPPING USE

Equivalence (=)

More and Less Specificity ( , )

Disjointness ( )

29

MAPPING EVALUATION

Standard Measures:Precision, [0,1] (correctness criterion, how many false positives?)

Recall, [0,1] (completeness criterion, how many false negatives?)

F-Measure, [0,1]Time

30

STANDARD MEASURES

A False negativesB True positivesC False positivesD True negatives

A B CD

Expert Matches System Matches

BAB

CBB

Recall

Precision

RecallPrecisionRecallPrecision2Measure-F

31

ADVANCED GOLDEN STANDARDS

GS+ matchesGS- mismatches

CD

Expert GS+ and GS- System Matches

G S

G SC

G SCG SC

G SC

Recall

Precision

GS+

GS-

32

EVALUATION FEATURES

MappingsExperts subjectivity

Schema-basedInstance-based

Small and Large Data SetsLarge data sets are expensiveAnd often unfeasible

Minimal vs Maximum vs Normal mappingDiverse golden standards and practices

Similarity vs Relations [0,1] vs { , ,

{ , { , , Technical Problems

bugs in data, data parsing bugs, absent or abandoned implementation, closed or restrictive licenses

33

RECOMMENDATIONS

Use large golden standards, always include an explicit GS-, reduce as much as possible the pairs in Unknown

How the size of a GS influences the result of the evaluation? How large should be the part covered by the GS+ and GS- to be statistically significant?

When comparing different tools, specify whether and how the evaluation takes into account the kinds of semantic relations

How to evaluate tools producing disjointness?What about overlap?

To obtain accurate measures it is fundamental to maximize both the golden standard and the matching result

S-MATCH AND DIGITAL LIBRARIES

35

CATALOG INTEGRATION

different points of viewdifferent terminologydifferent preferred terms

A

B E

DJOURNALS

C

DEVELOPMENT ANDPROGRAMMING LANGUAGES

JAVA

PROGRAMMING AND DEVELOPMENT

F

G

LANGUAGES

JAVA

MAGAZINES

36

BOOK CLASSIFICATION

S-Match matches tree vs treesingle node is a particular case of a tree

match node vs tree

A

B

JOURNALS

C

DEVELOPMENT ANDPROGRAMMING LANGUAGES

JAVA FProgramming and Developmentin Java

EXERCISES

38

EXERCISES

Download S-Match, run it

Integrate catalogsLoad the examples C and WMatch and explore the mappingEdit the trees, match and explore differences

edit the labels (try using synonyms)edit the structure

Classify a bookCreate a small catalog of your own in the left treeCreate a single node in the right tree. Use book title as a node labelMatch and observe the link

understand why there is no links you expectedunderstand why there are wrong links