schema & ontology matching: current research directions

AnHai DoanDatabase and Information System Group

University of Illinois, Urbana Champaign

Spring 2004

Schema & Ontology Matching: Schema & Ontology Matching: Current Research DirectionsCurrent Research Directions

2

Road MapRoad Map

Schema Matching– motivation & problem definition– representative current solutions: LSD, iMAP, Clio– broader picture

Ontology Matching– motivation & problem definition– representative current solution: GLUE– broader picture

Conclusions & Emerging Directions

3

New faculty member

Find houses with 2 bedrooms priced under

200K

homes.comrealestate.com homeseekers.com

Motivation: Data Integration Motivation: Data Integration

4

Architecture of Data Integration SystemArchitecture of Data Integration System

mediated schema

homes.comrealestate.com

source schema 2

homeseekers.com

source schema 3source schema 1

Find houses with 2 bedrooms priced under 200K

5

price agent-name address

Semantic Matches between SchemasSemantic Matches between Schemas

1-1 match complex match

homes.com listed-price contact-name city state

Mediated-schema

320K Jane Brown Seattle WA240K Mike Smith Miami FL

6

Schema Matching is Ubiquitous!Schema Matching is Ubiquitous! Fundamental problem in numerous applications Databases

– data integration

– data translation

– schema/view integration

– data warehousing

– semantic query processing

– model management

– peer data management

AI– knowledge bases, ontology merging, information gathering agents, ...

Web– e-commerce

– marking up data using ontologies (e.g., on Semantic Web)

7

Why Schema Matching is DifficultWhy Schema Matching is Difficult

Schema & data never fully capture semantics!– not adequately documented – schema creator has retired to Florida!

Must rely on clues in schema & data – using names, structures, types, data values, etc.

Such clues can be unreliable– same names => different entities: area => location or square-feet– different names => same entity: area & address => location

Intended semantics can be subjective– house-style = house-description?– military applications require committees to decide!

Cannot be fully automated, needs user feedback!

8

Current State of AffairsCurrent State of Affairs Finding semantic mappings is now a key bottleneck!

– largely done by hand– labor intensive & error prone– data integration at GTE [Li&Clifton, 2000]

– 40 databases, 27000 elements, estimated time: 12 years

Will only be exacerbated– data sharing becomes pervasive– translation of legacy data

Need semi-automatic approaches to scale up! Many research projects in the past few years

– Databases: IBM Almaden, Microsoft Research, BYU, George Mason,

U of Leipzig, U Wisconsin, NCSU, UIUC, Washington, ... – AI: Stanford, Karlsruhe University, NEC Japan, ...

9

Road MapRoad Map




10

LSD LSD Learning Source Description Developed at Univ of Washington 2000-2001

– with Pedro Domingos and Alon Halevy

Designed for data integration settings– has been adapted to several other contexts

Desirable characteristics– learn from previous matching activities– exploit multiple types of information in schema and data – incorporate domain integrity constraints– handle user feedback– achieves high matching accuracy (66 -- 97%) on real-world data

11

Suppose user wants to integrate 100 data sources

1. User – manually creates matches for a few sources, say 3– shows LSD these matches

2. LSD learns from the matches

3. LSD predicts matches for remaining 97 sources

Schema Matching for Data Integration:Schema Matching for Data Integration:the LSD Approachthe LSD Approach

12

price agent-name agent-phone office-phone description

Learning from the Manual Matches Learning from the Manual Matches

listed-price contact-name contact-phone office comments

Schema of realestate.com

Mediated schema

$250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location


realestate.com

If “fantastic” & “great” occur frequently in data instances => description

sold-at contact-agent extra-info

$350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle

homes.com

If “office” occurs in name => office-phone

13

price agent-name agent-phone office-phone description

Must Exploit Multiple Types of Information! Must Exploit Multiple Types of Information!


Schema of realestate.com

Mediated schema

$250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location


realestate.com

If “fantastic” & “great” occur frequently in data instances => description

sold-at contact-agent extra-info

$350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle

homes.com

If “office” occurs in name => office-phone

14

Multi-Strategy LearningMulti-Strategy Learning

Use a set of base learners– each exploits well certain types of information

To match a schema element of a new source– apply base learners– combine their predictions using a meta-learner

Meta-learner– uses training sources to measure base learner accuracy– weighs each learner based on its accuracy

15

Base LearnersBase Learners Training

Matching Name Learner

– training: (“location”, address) (“contact name”, name)

– matching: agent-name => (name,0.7),(phone,0.3)

Naive Bayes Learner– training: (“Seattle, WA”,address)

(“250K”,price)

– matching: “Kent, WA” => (address,0.8),(name,0.2)

labels weighted by confidence scoreX

(X1,C1)(X2,C2)...(Xm,Cm)

Observed label

Training examples

Object

Classification model (hypothesis)

16

The LSD ArchitectureThe LSD ArchitectureMatching PhaseTraining Phase

Mediated schemaSource schemas

Base-Learner1 Base-Learnerk

Meta-Learner

Training datafor base learners

Hypothesis1 Hypothesisk

Weights for Base Learners

Base-Learner1 .... Base-Learnerk

Meta-Learner

Prediction Combiner

Predictions for elements

Predictions for instances

Constraint Handler

Mappings

Domainconstraints

17

Naive Bayes Learner

(“Miami, FL”, address)(“$250K”, price)(“James Smith”, agent-name)(“(305) 729 0831”, agent-phone)(“(305) 616 1822”, office-phone)(“Fantastic house”, description)(“Boston,MA”, address)

Training the Base LearnersTraining the Base Learners

Miami, FL $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic houseBoston, MA $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location

location price contact-name contact-phone office comments

realestate.com

(“location”, address)(“price”, price)(“contact name”, agent-name)(“contact phone”, agent-phone)(“office”, office-phone)(“comments”, description)

Name Learner

address price agent-name agent-phone office-phone descriptionMediated schema

18

Meta-Learner: StackingMeta-Learner: Stacking[Wolpert 92,Ting&Witten99][Wolpert 92,Ting&Witten99]

Training– uses training data to learn weights

– one for each (base-learner,mediated-schema element) pair

– weight (Name-Learner,address) = 0.2

– weight (Naive-Bayes,address) = 0.8

Matching: combine predictions of base learners– computes weighted average of base-learner confidence scores

Seattle, WAKent, WABend, OR

(address,0.4)(address,0.9)

Name LearnerNaive Bayes

Meta-Learner (address, 0.4*0.2 + 0.9*0.8 = 0.8)

area

19

The LSD ArchitectureThe LSD ArchitectureMatching PhaseTraining Phase

Mediated schemaSource schemas

Base-Learner1 Base-Learnerk

Meta-Learner

Training datafor base learners

Hypothesis1 Hypothesisk

Weights for Base Learners


Meta-Learner

Prediction Combiner

Predictions for elements

Predictions for instances

Constraint Handler

Mappings

Domainconstraints

20

contact-agent

Applying the LearnersApplying the Learners

Name LearnerNaive Bayes

Prediction-Combiner

(address,0.8), (description,0.2)(address,0.6), (description,0.4)(address,0.7), (description,0.3)

(address,0.6), (description,0.4)

Meta-LearnerName LearnerNaive Bayes

(address,0.7), (description,0.3)

(price,0.9), (agent-phone,0.1)

extra-info

homes.com

Seattle, WAKent, WABend, OR

area

sold-at

(agent-phone,0.9), (description,0.1)

Meta-Learner

area sold-at contact-agent extra-infohomes.com schema

21

Domain ConstraintsDomain Constraints

Encode user knowledge about domain Specified only once, by examining mediated schema Examples

– at most one source-schema element can match address– if a source-schema element matches house-id then it is a key– avg-value(price) > avg-value(num-baths)

Given a mapping combination – can verify if it satisfies a given constraint

area: addresssold-at: price contact-agent: agent-phoneextra-info: address

22

area: (address,0.7), (description,0.3)sold-at: (price,0.9), (agent-phone,0.1)contact-agent: (agent-phone,0.9), (description,0.1)extra-info: (address,0.6), (description,0.4)

The Constraint HandlerThe Constraint Handler

Searches space of mapping combinations efficiently Can handle arbitrary constraints Also used to incorporate user feedback

– sold-at does not match price

0.30.10.10.40.0012

0.70.90.90.40.2268

Domain Constraints

At most one element matches address

Predictions from Prediction Combiner

area: addresssold-at: price contact-agent: agent-phoneextra-info: description

0.70.90.90.60.3402

area: addresssold-at: price contact-agent: agent-phoneextra-info: address

23

The Current LSD SystemThe Current LSD System Can also handle data in XML format

– matches XML DTDs

Base learners– Naive Bayes [Duda&Hart-93, Domingos&Pazzani-97]

– exploits frequencies of words & symbols– WHIRL Nearest-Neighbor Classifier [Cohen&Hirsh KDD-98]

– employs information-retrieval similarity metric– Name Learner [SIGMOD-01]

– matches elements based on their names– County-Name Recognizer [SIGMOD-01]

– stores all U.S. county names

– XML Learner [SIGMOD-01]– exploits hierarchical structure of XML data

24

Empirical EvaluationEmpirical Evaluation Four domains

– Real Estate I & II, Course Offerings, Faculty Listings

For each domain– created mediated schema & domain constraints– chose five sources– extracted & converted data into XML– mediated schemas: 14 - 66 elements, source schemas: 13 - 48

Ten runs for each domain, in each run:– manually provided 1-1 matches for 3 sources– asked LSD to propose matches for remaining 2 sources

– accuracy = % of 1-1 matches correctly identified

25

High Matching AccuracyHigh Matching Accuracy

0

10

20

30

40

50

60

70

80

90

100

Real Estate I Real Estate II CourseOfferings

FacultyListings

LSD’s accuracy: 71 - 92%

Best single base learner: 42 - 72%

+ Meta-learner: + 5 - 22%

+ Constraint handler: + 7 - 13%

+ XML learner: + 0.8 - 6%

Ave

rage

Mat

chin

g A

cccu

racy

(%

)

26

0

10

20

30

40

50

60

70

80

90

100

Real Estate I Real Estate II Course Offerings Faculty Listings

Contribution of Schema vs. DataContribution of Schema vs. Data

LSD with only schema info.

LSD with only data info.

Complete LSD

Ave

rage

mat

chin

g ac

cura

cy (

%)

More experiments in [Doan et al. SIGMOD-01]

27

LSD SummaryLSD Summary LSD

– learns from previous matching activities– exploits multiple types of information

– by employing multi-strategy learning

– incorporates domain constraints & user feedback– achieves high matching accuracy

LSD focuses on 1-1 matches Next challenge: discover more complex matches!

– iMAP (illinois Mapping) system [SIGMOD-04]– developed at Washington and Illinois, 2002-2004– with Robin Dhamanka, Yoonkyong Lee, Alon Halevy, Pedro Domingos

28

listed-price agent-id full-baths half-baths city zipcode

The iMAP Approach The iMAP Approach

For each mediated-schema element – searches space of all matches– finds a small set of likely match candidates– uses LSD to evaluate them

To search efficiently – employs a specialized searcher for each element type– Text Searcher, Numeric Searcher, Category Searcher, ...

price num-baths address

Mediated-schema

homes.com

320K 53211 2 1 Seattle 98105240K 11578 1 1 Miami 23591

29

The iMAP Architecture [SIGMOD-04]The iMAP Architecture [SIGMOD-04]

Source schema + dataMediated schema

SearcherkSearcher2

Domainknowledgeand data

Searcher1

User


1-1 and complex matches

Meta-Learner

Similarity Matrix

Match candidates

Match selector

Explanationmodule

30

An Example: Text SearcherAn Example: Text Searcher

Best match candidates for address – (agent-id,0.7), (concat(agent-id,city),0.75), (concat(city,zipcode),0.9)

listed-price agent-id full-baths half-baths city zipcode

price num-baths address

Mediated-schema

320K 532a 2 1 Seattle 98105240K 115c 1 1 Miami 23591

homes.com

concat(agent-id,zipcode)

532a 98105115c 23591

concat(city,zipcode)

Seattle 98105Miami 23591

concat(agent-id,city)

532a Seattle115c Miami

Beam search in space of all concatenation matches Example: find match candidates for address

31

Empirical EvaluationEmpirical Evaluation Current iMAP system

– 12 searchers

Four real-world domains – real estate, product inventory, cricket, financial wizard– target schema: 19 -- 42 elements, source schema: 32 -- 44

Accuracy: 43 -- 92% Sample discovered matches

– agent-name = concat(first-name,last-name)– area = building-area / 43560– discount-cost = (unit-price * quantity) * (1 - discount)

More detail in [Dhamanka et. al. SIGMOD-04]

32

ObservationsObservations Finding complex matches much harder than 1-1 matches!

– require gluing together many components– e.g., num-rooms = bath-rooms + bed-rooms + dining-rooms + living-rooms

– if missing one component => incorrect match

However, even partial matches are already very useful!– so are top-k matches => need methods to handle partial/top-k

matches

Huge/infinite search spaces– domain knowledge plays a crucial role!

Matches are fairly complex, hard to know if they are correct– must be able to explain matches

Human must be fairly active in the loop– need strong user interaction facilities

Break matching architecture into multiple "atomic" boxes!

33

Road MapRoad Map




34

Finding Matches is only Half of the Job!Finding Matches is only Half of the Job!

Mappings– area = SELECT location FROM HOUSES– agent-address = SELECT concat(city,state) FROM AGENTS– list-price = price * (1 + fee-rate)

FROM HOUSES, AGENTS WHERE agent-id = id

Schema TSchema S

location price ($) agent-idAtlanta, GA 360,000 32Raleigh, NC 430,000 15

HOUSES

area list-price agent-address agent-nameDenver, CO 550,000 Boulder, CO Laura SmithAtlanta, GA 370,800 Athens, GA Mike Brown

LISTINGS

id name city state fee-rate32 Mike Brown Athens GA 0.0315 Jean Laup Raleigh NC 0.04

AGENTS

To translate data/queries, need mappings, not matches

35

Clio: Elaborating Matches into MappingsClio: Elaborating Matches into Mappings Developed at Univ of Toronto & IBM Almaden, 2000-2003

– by Renee Miller, Laura Haas, Mauricio Hernandez, Lucian Popa, Howard Ho, Ling Yan, Ron Fagin

Given a match– list-price = price * (1 + fee-rate)

Refine it into a mapping– list-price = SELECT price * (1 + fee-rate)

FROM HOUSES (FULL OUTER JOIN) AGENTS WHERE agent-id = id

Need to discover– the correct join path among tables, e.g., agent-id = id– the correct join, e.g., full outer join? inner join?

Use heuristics to decide– when in doubt, ask users– employ sophisticated user interaction methods [VLDB-00, SIGMOD-01]

36

Clio: Illustrating ExamplesClio: Illustrating Examples

Mappings– area = SELECT location FROM HOUSES– agent-address = SELECT concat(city,state) FROM AGENTS– list-price = price * (1 + fee-rate)

FROM HOUSES, AGENTS WHERE agent-id = id

Schema TSchema S

location price ($) agent-idAtlanta, GA 360,000 32Raleigh, NC 430,000 15

HOUSES

area list-price agent-address agent-nameDenver, CO 550,000 Boulder, CO Laura SmithAtlanta, GA 370,800 Athens, GA Mike Brown

LISTINGS

id name city state fee-rate32 Mike Brown Athens GA 0.0315 Jean Laup Raleigh NC 0.04

AGENTS

37

Road MapRoad Map




38

Broader Picture: Find MatchesBroader Picture: Find Matches

COMA by Erhard Rahm group David Embley group at BYUJaewoo Kang group at NCSUKevin Chang group at UIUCClement Yu group at UIC

SEMINT [Li&Clifton94]ILA [Perkowitz&Etzioni95]DELTA [Clifton et al. 97]AutoMatch, Autoplex [Berlin & Motro, 01-03]

LSD [Doan et al., SIGMOD-01]iMAP [Dhamanka et. al., SIGMOD-04]

Single learnerExploit data 1-1 matches

Hand-crafted rules Exploit schema 1-1 matches

Learners + rules, use multi-strategy learningExploit schema + data1-1 + complex matchesExploit domain constraints

More about some of these works soon ....

TRANSCM [Milo&Zohar98]ARTEMIS [Castano&Antonellis99] [Palopoli et al. 98] CUPID [Madhavan et al. 01]

Other Important Works

39

Broader Picture: From Matches Broader Picture: From Matches to Mappingsto Mappings

iMAP [Dhamanka et al., SIGMOD-04]

CLIO [Miller et. al., 00] [Yan et al. 01]

RulesExploit dataPowerful user interaction

Learners + rules Exploit schema + data1-1 + complex matchesAutomate as much as possible

?

40

Road MapRoad Map




41

Ontology MatchingOntology Matching Increasingly critical for

– knowledge bases, Semantic Web

An ontology – concepts organized into a taxonomy tree– each concept has

– a set of attributes– a set of instances

– relations among concepts

Matching– concepts – attributes – relations

name: Mike Burnsdegree: Ph.D.

Entity

UndergradCourses

GradCourses

People

StaffFaculty

AssistantProfessor

AssociateProfessor

Professor

CS Dept. US

42

Matching Taxonomies of ConceptsMatching Taxonomies of Concepts

Entity

Courses Staff

Technical StaffAcademic Staff

Lecturer Senior Lecturer

Professor

CS Dept. Australia

Entity

UndergradCourses

GradCourses

People

StaffFaculty

AssistantProfessor

AssociateProfessor

Professor

CS Dept. US

43

GlueGlue

Solution– Use data instances extensively– Learn classifiers using information within taxonomies– Use a rich constraint satisfaction scheme

[Doan, Madhavan, Domingos, Halevy; WWW’2002]

44

Concept Similarity Concept Similarity

Multiple Similarity measures in terms of the JPD Multiple Similarity measures in terms of the JPD

Concept A Concept S

A,S

A, S

A,S

A,S

P(A,S) + P(A,S) + P(A,S)

P(A,S)=

Joint Probability Distribution: P(A,S),P(A,S),P(A,S),P(A,S)

Hypotheticaluniverse ofall examples

P(A S)

P(A S)Sim(Concept A, Concept S) =

[Jaccard, 1908]

45

Machine Learning for Machine Learning for Computing SimilaritiesComputing Similarities

JPD estimated by counting the sizes of the partitionsJPD estimated by counting the sizes of the partitions

CLS

S

S

Taxonomy 1 Taxonomy 2A

A

S

S

CLA

A

A

A,S A,S

A,S A,S

A,S A,S

A,S A,S

46

The GlueThe Glue System System

Similarity Estimator

Base Learner Base Learner

Meta Learner

Relaxation Labeling

Common Knowledge &Domain Constraints

Similarity Function Joint Probability Distribution P(A,B), P(A’, B)…

Similarity Matrix

Taxonomy O1 (tree structure + data instances)

Taxonomy O2 (tree structure + data instances)

DistributionEstimator

Matches for O1 , Matches for O2

47

Constraints in Taxonomy MatchingConstraints in Taxonomy Matching

Domain-dependent – at most one node matches department-chair– a node that matches professor can not be a child of a node

that matches assistant-professor

Domain-independent– two nodes match if parents & children match– if all children of X matches Y, then X also matches Y

– Variations have been exploited in many restricted settings[Melnik&Garcia-Molina,ICDE-02], [Milo&Zohar,VLDB-98],[Noy et al., IJCAI-01], [Madhavan et al., VLDB-01]

Challenge: find a general & efficient approach

48

Solution: Relaxation LabelingSolution: Relaxation Labeling Relaxation labeling [Hummel&Zucker, 83]

– applied to graph labeling in vision, NLP, hypertext classification– finds best label assignment, given a set of constraints– starts with initial label assignment– iteratively improves labels, using constraints

Standard relax. labeling not applicable– extended it in many ways [Doan et al., W W W-02]

49

Real World ExperimentsReal World Experiments Taxonomies on the web

– University organization (UW and Cornell)

– Colleges, departments and sub-fields

– Companies (Yahoo and The Standard)

– Industries and Sectors

For each taxonomy– Extract data instances – course descriptions, company profiles

– Trivial data cleaning

– 100 – 300 concepts per taxonomy

– 3-4 depth of taxonomies

– 10-90 data instances per concept

Evaluation against manual mappings as the gold standard

50

Glue’s PerformanceGlue’s Performance

0

10

20

30

40

50

60

70

80

90

100

Cornell to Wash. Wash. to Cornell Cornell to Wash. Wash. to Cornell Standard to Yahoo Yahoo to Standard

Mat

chin

g ac

cura

cy (%

)

Name Learner Content Learner Meta Learner Relaxation Labeler

University Depts 1 Company ProfilesUniversity Depts 2

51

Broader PictureBroader Picture Ontology matching parallels the development of

schema matching– rule-based & learning-based approaches– PROMPT family, OntoMorph, OntoMerge, Chimaera, Onion,

OBSERVER, FCAMerge, ...– extensive work by Ed Hovy's group– ontology versioning (e.g., by Noy et. al.)

More powerful user interaction methods– e.g., iPROMPT, Chimaera

Much more theoretical works in this area

52

Road MapRoad Map




53

Develop the Theoretical FoundationDevelop the Theoretical Foundation Not much is going on, however ...

– see works by Alon Halevy (AAAI-02) and Phil Bernstein (in model management contexts)

– some preliminary work in AnHai Doan's Ph.D. dissertation– work by Stuart Russell and other AI people on identity uncertainty is

potentially relevant

Most likely foundation– probability framework

54

Need Much More Domain KnowledgeNeed Much More Domain Knowledge Where to get it?

– past matches (e.g., LSD, iMAP)– other schemas in the domain

– holistic matching approach by Kevin Chang group [SIGMOD-02]– corpus-based matching by Alon Halevy group [IJCAI-03]– clustering to achieve bridging effects by Clement Yu group

[SIGMOD-04]– external data (e.g., iMAP at SIGMOD-04)– mass of users (e.g., MOBS at WebDB-03)

How to get it and how to use it? – no clear answer yet

55

Employ Multi-Module ArchitectureEmploy Multi-Module Architecture Many "black boxes", each is good at doing a single thing Combine them and tailor them to each application Examples

– LSD, iMAP, COMA, David Embley's systems

Open issues– what are these back boxes? – how to build them?– how to combine them?

56

Powerful User InteractionPowerful User Interaction Minimize user effort, maximize its impact Make it very easy for users to

– supply domain knowledge– provide feedback on matches/mappings

Develop powerful explanation facilities

57

Other IssuesOther Issues What to do with partial/top-k matches? Meaning negotiation Fortifying schemas for interoperability Very-large-scale matching scenarios (e.g., the Web) What can we do without the mappings? Interaction between schema matching and tuple matching? Benchmarks, tools?

58

SummarySummary

Schema/ontology matching: key to numerous data management problems– much attention in the database, AI, Semantic Web communities

Simple problem definition, yet very difficult to do– no satisfactory solution yet– AI complete?

We now understand the problems much better– still at the beginning of the journey – will need techniques from multiple fields

59

Backup SlidesBackup Slides

60

Backup SlidesBackup Slides

61

Least-SquaresLinear Regression

Training the Meta-LearnerTraining the Meta-Learner

<location> Miami, FL</><listed-price> $250,000</><area> Seattle, WA </><house-addr>Kent, WA</><num-baths>3</>...

Extracted XML Instances Name Learner

0.5 0.8 1 0.4 0.3 0 0.3 0.9 1 0.6 0.8 1 0.3 0.3 0 ... ... ...

Naive Bayes True Predictions

Weight(Name-Learner,address) = 0.1Weight(Naive-Bayes,address) = 0.9

For address

62

Sensitivity to Amount of Available DataSensitivity to Amount of Available Data

40

50

60

70

80

90

100

0 100 200 300 400 500

Ave

rage

mat

chin

g ac

cura

cy (

%)

Number of data listings per source (Real Estate I)

63

Contribution of Each ComponentContribution of Each Component

0

20

40

60

80

100

Real Estate I Course Offerings Faculty Listings Real Estate II

Ave

rage

Mat

chin

g A

cccu

racy

(%

)

Without Name Learner

Without Naive Bayes

Without Whirl Learner

Without Constraint Handler

The complete LSD system

64

Existing learners flatten out all structures

Developed XML learner– similar to the Naive Bayes learner

– input instance = bag of tokens– differs in one crucial aspect

– consider not only text tokens, but also structure tokens

Exploiting Hierarchical Structure Exploiting Hierarchical Structure

<description> Victorian house with a view. Name your price! To see it, contact Gail Murphy at MAX Realtors.</description>

<contact> <name> Gail Murphy </name> <firm> MAX Realtors </firm></contact>

65

Reasons for Incorrect MatchingsReasons for Incorrect Matchings Unfamiliarity

– suburb– solution: add a suburb-name recognizer

Insufficient information– correctly identified general type, failed to pinpoint exact type– agent-name phone

Richard Smith (206) 234 5412

– solution: add a proximity learner

Subjectivity– house-style = description?

Victorian Beautiful neo-gothic houseMexican Great location

66

Evaluate Mapping CandidatesEvaluate Mapping Candidates For address, Text Searcher returns

– (agent-id,0.7)– (concat(agent-id,city),0.8)– (concat(city,zipcode),0.75)

Employ multi-strategy learning to evaluate mappings Example: (concat(agent-id,city),0.8)

– Naive Bayes Learner: 0.8– Name Learner: “address” vs. “agent id city” 0.3– Meta-Learner: 0.8 * 0.7 + 0.3 * 0.3 = 0.65

Meta-Learner returns– (agent-id,0.59)– (concat(agent-id,city),0.65)– (concat(city,zipcode),0.70)

67

Relaxation LabelingRelaxation Labeling

Dept U.S. Dept Australia

CoursesCourses Staff People

StaffFacultyTech. StaffAcad. StaffStaff

People

CoursesCourses

Faculty

Applied to similar problems in– vision, NLP, hypertext classification

68

Relaxation Labeling for Taxonomy MatchingRelaxation Labeling for Taxonomy Matching Must define

– neighborhood of a node – k features of neighborhood– how to combine influence of features

–

Algorithm– init: for each pair <N,L>, compute – loop: for each pair <N,L>, re-compute

M

MPMLNPLNP )|().,|()|(

)|( LNP

),...,,|( 21 kfffLNP

Acad. Staff: FacultyTech. Staff: StaffStaff = People

Neighborhood configuration

69

Relaxation Labeling for Taxonomy MatchingRelaxation Labeling for Taxonomy Matching

Huge number of neighborhood configurations!– typically neighborhood = immediate nodes– here neighborhood can be entire graph

100 nodes, 10 labels => configurations

Solution– label abstraction + dynamic programming– guarantee quadratic time for a broad range of domain constraints

Empirical evaluation– GLUE system [Doan et. al., WWW-02]– three real-world domains – 30 -- 300 nodes / taxonomy– high accuracy 66 -- 97% vs. 52 -- 83% of best base learner– relaxation labeling very fast, finished in several seconds

10010

schema & ontology matching: current research directions

Documents