semanticmatching.org semantic atching and...
TRANSCRIPT
semanticmatching.org
SEMANTIC MATCHING AND S-MATCHfor Digital Libraries
Aliaksandr Autayeuautayeu.com
2
CREDITS
Fausto GiunchigliaMikalai YatskevichPavel ShvaikoLuciano SerafiniVincenzo MalteseJuan Paneand others
http://semanticmatching.org/credits.html
3
OUTLINE
Digital Libraries Super SimplifiedSemantic MatchingS-MatchMapping Use EvaluationS-Match and Digital LibrariesExercises
5
A SUPER SIMPLIFIED VIEW
000 Computer science, information & general works000 Computer science, knowledge & systems
005 Computer programming, programs & data
Programming languages (Computers)
Programming (Computers)
Programming software
classification
book
subject heading
DDC
LCSH
? LCC ?
? MeSH ?
6
A BIT MORE REALISTIC VIEW000 Computer science, information & general works
000 Computer science, knowledge & systems005 Computer programming, programs & data
Programming languages (Computers)Programming (Computers)
Programming software
classificationS
books
subject headingS
DDC, UDC, LCC, RVK, IconClass
LCSH, NALT, MeSH
Class Q - ScienceSubclass QA Mathematics
QA71-90 Instruments and machinesQA75.5-76.95 Electronic computers. Computer science
computer science
computer softwarecomputer literacy
7
A TREND
Federated Digital Librariesmore different books cool!more different classifications oh-more different subject headings oh-
Information Integration
Semantic Matching
9
MOTIVATION
Ample motivation:information integration,ontology evolution and alignment,peer-to-peer information sharing,digital libraries integration,web service composition,agent communication, andquery answering on the web.
DLs? Huge and heterogeneous:LOC: 138,313,427 items. Manageable manually?Wikipedia: 138 DL projects. Same software?
10
DEFINITION
Matching: given two graph-like structures (e.g. lightweight ontologies), produce a mapping between the nodes of the graphs that semanticallycorrespond to each other
Wine and Cheese
Italy
Europe
Austria
Pictures
1
2 3
4 5
11
TYPES OF MATCHING
Matching
Semantic MatchingSyntactic MatchingRelations are computed between labels at nodesR = {x [0,1]}
Relations are computed between concepts at nodesR = { =, , , }
Note: many systems are Note:are semantic
12
CONCEPT OF LABEL
Wine and Cheese
Italy
Europe
Austria
Pictures
1
2 3
4 5
The idea: A label has an intended meaning, which is what this label means in the world (Concept of a label)
Labels in classification hierarchies are used to compute the set of documents classified under the node holding the label
13
CONCEPT AT NODE
Wine and Cheese
Italy
Europe
Austria
Pictures
1
2 3
4 5
Observations: Think of a tree structure (see below) used for classifying documentsA (concept of a) label defines which documents should be classified under that
Concept of a node is the set of documents that we would classify under this node, given it has a certain label and it is positioned in a certain place in the tree
The idea: Trees add structure which allows us to perform the classification of documents more effectively
14
MAPPING
Mapping element is a 4-tuple < IDij, n1i, n2j, R >, where IDij is a unique identifier of the given mapping element;n1i is the i-th node of the first graph;n2j is the j-th node of the second graph;R specifies a semantic relation between the concepts at the given
nodes
Semantic Matching: Given two graphs G1 and G2, for any node n1i G1, find the strongest semantic relation holding with node n2j G2
Computed , listed in the decreasing strength order:equivalence { = };more general/specific { , };mismatch or disjoint { };
15
EXAMPLE
?=
=4
Images
Europe
ItalyAustria
2
3 4
1
Italy
Europe
Wine and Cheese
Austria
Pictures
1
2 3
5
16
MINIMAL MAPPING
2-1, 2-4 redundant links, can be inferredcompact, compressed information!
?=
=4
Images
Europe
ItalyAustria
2
3 4
1
Italy
Europe
Wine and Cheese
Austria
Pictures
1
2 3
5
17
MINIMAL MAPPING PATTERNS
A
B E
D
C F
A
B E
D
C F
A
B E
D
C F
A
B E
D
C F
(1) (2) (3) (4)
Dependencies across-symbols: equivalence is the combination of more and less specific
Patterns 1 and 2 are still valid in case of equivalence between B-EPattern 4 can be seen as a combination of patterns 1 and 2
19
FOUR MACRO STEPS
1. For all labels in T1 and T2 compute concepts of labels2. For all nodes in T1 and T2 compute concepts at nodes3. For all pairs of labels in T1 and T2 compute relations between
concepts of labels4. For all pairs of nodes in T1 and T2 compute relations between
concepts at nodes
Steps 1 and 2 constitute the preprocessing phase, and are executed once and each time after the schema/ontology is changed (offline)
Steps 3 and 4 constitute the matching phase, and are executed every time the two schemas/ontologies are to be matched (online)
Given two labeled trees T1 and T2, do:
20
1: COMPUTE CONCEPTS OF LABELSThe idea:
Translate natural language expressions into internal formal languageCompute concepts based on possible senses of words in a label and their interrelations
Preprocessing:Tokenization. Labels (according to punctuation, spaces, etc.) are parsed into tokens. Wine and Cheese <Wine, and, Cheese>;Lemmatization. Tokens are morphologically analyzed in order to find all their possible basic forms. Images Image;Building atomic concepts. An oracle (WordNet) is used to extract senses of lemmatized tokens. Image has 8 senses, 7 as a noun and 1 as a verb;Building complex concepts. Prepositions, conjunctions, etc. are translated into logical connectives and used to build complex conceptsout of the atomic conceptsCWine and Cheese = <Wine, U(WNWine)> U <Cheese, U(WNCheese)>,where U is a union of the senses that WordNet attaches to lemmatized tokens
21
2: COMPUTE CONCEPTS AT NODES
The idea: extend concepts at labels by capturing the knowledge residing in a structure of a graph in order to define a context in which the given concept at a label occursComputation (basic case): Concept at a node for some node n is computed as an intersection of concepts at labels located above the given node, including the node itself
Wine and Cheese
Italy
Europe
Austria
Pictures
1
2 3
4 5
C4 = CEurope CPictures CItaly
22
3: COMPUTE RELATIONS BETWEENCONCEPTS OF LABELS
The idea: Exploit a priori knowledge, e.g., lexical, domain knowledge with the help of element level semantic matchers
Results of step 3:Italy
Europe
Wine andCheese
Austria
Pictures
1
2 3
4 5
Europe
ItalyAustria
2
3 4
1
Images
T1 T2
T2
=C Italy
=C Austria
=C E urope
=C Images
C AustriaC ItalyC PicturesC E uropeT1 C Wine C C heese
23
4: COMPUTE RELATIONS BETWEENCONCEPTS AT NODES
The idea: Reduce the matching problem to a validity problem
Context: the relations between concepts of labels (from step 3) Wff rel (C1i, C2j): relation to be proved (with C1i in Tree1 and C2j in Tree2)Translate into propositional logic:
C1i = C2j is translated into C1i C2j
C1i subsumes C2j is translated into C1i C2j
C1i C2j is translated into C1i C2j)
Prove Context Wff rel (C1i, C2j) is valid with A propositional formula is valid iff its negation is unsatisfiable (SAT
24
ALGORITHM AND FRAMEWORK
S-Match: an algorithmMinSMatch (produces minimal mapping)SPSM (preserves structure)
S-Match: a frameworkalgorithmsdata formatspreprocessorsreasonersmatchersknowledge bases
25
S-MATCH: ARCHITECTURE
InputLoaders
TabIndented XML
Offl
ine Step 1
Preprocessors ClassifiersStep 2
Matchers
Onl
ine
DecidersSat4J MiniSat
OraclesWordnet InMemoryWN
ElementSyntactic
StringGloss
Semantic
WordNetSense
Step 3 Step 4
StructureTree
MinimalDefault
Node
MinimalDefault
OutputFilters RenderersSPSM TabIndented XML
29
MAPPING EVALUATION
Standard Measures:Precision, [0,1] (correctness criterion, how many false positives?)
Recall, [0,1] (completeness criterion, how many false negatives?)
F-Measure, [0,1]Time
30
STANDARD MEASURES
A False negativesB True positivesC False positivesD True negatives
A B CD
Expert Matches System Matches
BAB
CBB
Recall
Precision
RecallPrecisionRecallPrecision2Measure-F
31
ADVANCED GOLDEN STANDARDS
GS+ matchesGS- mismatches
CD
Expert GS+ and GS- System Matches
G S
G SC
G SCG SC
G SC
Recall
Precision
GS+
GS-
32
EVALUATION FEATURES
MappingsExperts subjectivity
Schema-basedInstance-based
Small and Large Data SetsLarge data sets are expensiveAnd often unfeasible
Minimal vs Maximum vs Normal mappingDiverse golden standards and practices
Similarity vs Relations [0,1] vs { , ,
{ , { , , Technical Problems
bugs in data, data parsing bugs, absent or abandoned implementation, closed or restrictive licenses
33
RECOMMENDATIONS
Use large golden standards, always include an explicit GS-, reduce as much as possible the pairs in Unknown
How the size of a GS influences the result of the evaluation? How large should be the part covered by the GS+ and GS- to be statistically significant?
When comparing different tools, specify whether and how the evaluation takes into account the kinds of semantic relations
How to evaluate tools producing disjointness?What about overlap?
To obtain accurate measures it is fundamental to maximize both the golden standard and the matching result
35
CATALOG INTEGRATION
different points of viewdifferent terminologydifferent preferred terms
A
B E
DJOURNALS
C
DEVELOPMENT ANDPROGRAMMING LANGUAGES
JAVA
PROGRAMMING AND DEVELOPMENT
F
G
LANGUAGES
JAVA
MAGAZINES
36
BOOK CLASSIFICATION
S-Match matches tree vs treesingle node is a particular case of a tree
match node vs tree
A
B
JOURNALS
C
DEVELOPMENT ANDPROGRAMMING LANGUAGES
JAVA FProgramming and Developmentin Java
38
EXERCISES
Download S-Match, run it
Integrate catalogsLoad the examples C and WMatch and explore the mappingEdit the trees, match and explore differences
edit the labels (try using synonyms)edit the structure
Classify a bookCreate a small catalog of your own in the left treeCreate a single node in the right tree. Use book title as a node labelMatch and observe the link
understand why there is no links you expectedunderstand why there are wrong links