byu data extraction group funded by nsf1 brigham young university li xu source discovery and schema...
Post on 22-Dec-2015
214 views
TRANSCRIPT
Funded by NSF 1 BYU Data Extraction Group
Brigham Young University
Li Xu
Source Discovery andSchema Mapping
for Data Integration
Funded by NSF 2 BYU Data Extraction Group
Data IntegrationFind houses with four bedrooms priced under $200,000
global schema
homes.comrealestate.com
source schema 2
homeseekers.com
source schema 3source schema 1
wrappers
Mediator
Funded by NSF 3 BYU Data Extraction Group
Problems
• How to Recognize Applicable Information Sources for an Application?
• How to Specify Mapping between the Source Schemas and the Global Schema?
• How to Reformulate User Queries?• How to Merge Data from Heterogeneous
Sources?• …
Funded by NSF 6 BYU Data Extraction Group
Applicable HTML Documents
• Multiple-Record Documents
• Single-Record Documents
• HTML Forms
How to distinguish an applicable HTML document?
Funded by NSF 7 BYU Data Extraction Group
Multiple-Record Doc’sDocument 1: Car Ads
Document 2: Items for Sale or Rent
Funded by NSF 10 BYU Data Extraction Group
Recognition Heuristics
• h1+: Densities
• h2: Expected Values
• h3: Grouping
How to measure the applicability of an HTML document for an application?
Funded by NSF 11 BYU Data Extraction Group
Document 1: Car Ads
h1+: Densities
Document 2: Items for Sale or Rent
Funded by NSF 12 BYU Data Extraction Group
Document 1: Car Ads
Year: 3Make: 2Model: 3Mileage: 1Price: 1Feature: 15PhoneNr: 3
h2: Expected Values
Document 2: Items for Sale or Rent
Year: 1Make: 0Model: 0Mileage: 1Price: 0Feature: 0PhoneNr: 4
<Year:0.98, Make:0.93, Model:0.91, Mileage:0.45, Price:0.80, Feature:2.10, PhoneNr:1.15>
Funded by NSF 13 BYU Data Extraction Group
h3: Grouping (of 1-Max Object Sets)
YearMakeModelPriceYearModelYearMakeModelMileage…
Document 1: Car Ads
{{{
YearMileage…MileageYearPricePrice…
Document 2: Items for Sale or Rent
{{
Funded by NSF 14 BYU Data Extraction Group
Classification Problem
• Subtasks– Multiple Records– Singleton Record– Application Form
• Learning Algorithm: Decision Tree C4.5– (h1+0, h1+1, …, h2, h3, Positive)– (h1+0, h1+1, …, h2, h3, Negative)
How to construct recognition rules for an application?
Funded by NSF 15 BYU Data Extraction Group
Experiments Car Ads and Obituaries
• Training Sets– Car Ads (Yes| No)
• 143 | 363• 614 | 636• 50 |69
– Obituaries (Yes| No) • 68 | 135• 50 | 69 • 62 | 135
• Test Sets– Car Ads (40 | 40)
Precision 95%Recall 98%F-measure 96%
– Obituaries (40 |40)Precision 95%Recall 95%F-measure 95%
Funded by NSF 19 BYU Data Extraction Group
Incorrect Positive ResponseMotorcycle
Year
Make
Price
Mileage
PhoneNr
Feature
Funded by NSF 20 BYU Data Extraction Group
HistoricalFigure
Deceased Name
Death Date
Birth Date
Age
Relationship
Relative Name
Funded by NSF 22 BYU Data Extraction Group
Schema Mapping
Source
Car
Year
Cost
Style
YearFeature
Cost
Phone
Target
Car
MilesMileage
Model
Make Make&
Model
Color
Body Type
Funded by NSF 23 BYU Data Extraction Group
Schema Mappingfor Populated Schemas
• Central Idea: Exploit All Data & Metadata
• Matching Possibilities (Facets)– Attribute Names– Data-Value Characteristics– Expected Data Values– Data-Dictionary Information– Structural Properties
Funded by NSF 24 BYU Data Extraction Group
The Approach• Input:
– Two Graphs, S and T– Data Instances for S and T– Lightweight Domain Ontology
• Output: – A Source-to-Target Mapping between S and T
• Should enable translating data instances from S to T.
– Direct and Many Indirect Matches• (t, s)• (t, s’ <= )
• Framework– Individual Facet Matching– Combination of Individual Matchers
Funded by NSF 25 BYU Data Extraction Group
Attribute Names
• Target and Source Attributes – T : A – S : B
• WordNet• C4.5 Decision Tree: feature selection, trained on
schemas in DB books– f0: same word– f1: synonym– f2: sum of distances to a common hypernym root– f3: number of different common hypernym roots– f4: sum of the number of senses of A and B
Funded by NSF 26 BYU Data Extraction Group
WordNet Rule
The number
of different common
hypernym roots of A
and B
The sum of distances of A and B to a
common hypernym
The sum of the
number of senses of A and B
Funded by NSF 27 BYU Data Extraction Group
Data-Value Characteristics
• C4.5 Decision Tree
• Features– Numeric data
(Mean, variation, standard deviation, …)
– Alphanumeric data(String length, numeric ratio, space ratio)
Funded by NSF 28 BYU Data Extraction Group
Make & Model Brand Model
Expected Data Values• Concepts and Relationships• Data Recognizers
– CarMake• “ford”
• “honda”
• …
– CarModel• “accord”
• “mustang”
• “taurus”
• …
Ford MustangFord TaurusFord F150…
CarMake . CarModel
Legend MustangA4…
CarModelCarMake
Target Source
Acura AudiBMW…
Funded by NSF 29 BYU Data Extraction Group
Structure Matching
House Agent
Golf
course
Water
front
Name
Fax
Address
Street City State
Basic_features
bedsSQFT
MLS
agentlocation_
description
name
fax phone
location
Address
Target Source
MLS Bedrooms
Funded by NSF 30 BYU Data Extraction Group
Structure Matching (Cont.)
House Agent
Golf
course
Water
front
Name
Fax
Address
Street City State
Basic_features
bedsSQFT
MLS
agentlocation_
description
name
fax phone
location
Address
Target Source
MLS Bedrooms
Funded by NSF 31 BYU Data Extraction Group
Structure Matching (Cont.)
House Agent
Golf
course
Water
front
Name
Fax
Address
Street City State
Basic_features
bedsSQFT
MLS
agentlocation_
description
name
fax phone
location
Address
Target Source
MLS Bedrooms
Funded by NSF 32 BYU Data Extraction Group
Structure Matching (Cont.)
House Agent
Golf
course
Water
front
Name
Fax
Address
Street City State
Basic_features
bedsSQFT
MLS
agentlocation_
description
name
fax phone
location
Address
TargetSource
MLS Bedrooms
Funded by NSF 33 BYU Data Extraction Group
Structure Matching (Cont.)
House Agent
Golf
course
Water
front
Name
Fax
Address
Street City State
Basic_features
bedsSQFT
MLS
agentlocation_
description
name
fax phone
location
Address
TargetSource
MLS Bedrooms
Funded by NSF 34 BYU Data Extraction Group
Structure Matching (Cont.)
House Agent
Golf
course
Water
front
Name
Fax
Address
Street City State
Basic_features
bedsSQFT
MLS
agentlocation_
description
name
fax phone
location
Address
TargetSource
MLS Bedrooms
Funded by NSF 35 BYU Data Extraction Group
{House, MLS} vs. {MLS}
House
Golf
course
Water
front Address
Street City State
Basic_features
bedsSQFT
MLS
location_
description
location
TargetSource
MLS Bedrooms
Funded by NSF 36 BYU Data Extraction Group
{House, MLS} vs. {MLS}
House
Golf
course
Water
front Address
Street City State
Basic_features
bedsSQFT
MLS
location_
description
location
TargetSource
MLS Bedrooms
Funded by NSF 37 BYU Data Extraction Group
{House, MLS} vs. {MLS}
House
Golf
course
Water
front Address
Street City State
Basic_features
beds
SQFT
MLS
location_
description
location
TargetSource
MLS Bedrooms
House’
Address1’
Funded by NSF 38 BYU Data Extraction Group
{House, MLS} vs. {MLS}
House
Golf
course
Water
front Address
Street City State
Basic_features
beds
SQFT
MLS
location_
description
location
Target Source
MLS Bedrooms
House’
Golf
course’
Water
front’Address1’
Street1’ City1’ State1’
Funded by NSF 39 BYU Data Extraction Group
{Agent} vs. {agent}
Agent
Name
Fax
Address
Street City State
agent
name
fax phone
address
TargetSource
Funded by NSF 40 BYU Data Extraction Group
{Agent} vs. {agent}
Agent
Name
Fax
Address
Street City State
agent
name
fax
phone
address
TargetSource
Address2’
Street2, City2’ State2’
Funded by NSF 41 BYU Data Extraction Group
Inter-Relationship Set
House Agent
Golf
course
Water
front
Name
Fax
Address
Street City State
MLS
agent
TargetSource
MLS Bedrooms
House’
Funded by NSF 42 BYU Data Extraction Group
Example:Source-To-Target Mapping
House’
Golf
course’
Water
front’
MLS
bedsagent
name
fax
Address1’ Address2’
Address’
Street’
City’
State’
Funded by NSF 43 BYU Data Extraction Group
Target-based Integration and Query System (TIQS)
• Definition : I = (T, {Si}, {Mi})
• Phases– Design (Source-to-Target Mappings {Mi})– Query Processing (Rule Unfolding)
Funded by NSF 44 BYU Data Extraction Group
Query Reformulation
• Query– House-Bedrooms(x, 4) :- House-Bedrooms(x, 4),
House-Golf_course(x, “Yes”),
House-Water_front(x, “Yes”)
House’
Golf
course’
Water
front’
MLS
bedsagent
name
fax
Address1’ Address2’
Address’
Street’
City’
State’
Funded by NSF 45 BYU Data Extraction Group
Query Reformulation
• Query– House-Bedrooms(x, 4) :- House-Bedrooms(x, 4),
House-Golf_Course(x, “Yes”),
House-Water_Front(x, “Yes”)
House’
Golf
course’
Water
front’
MLS
bedsagent
name
fax
Address1’ Address2’
Address’
Street’
City’
State’
Funded by NSF 46 BYU Data Extraction Group
TIQS (Cont.)
• User Queries– Logic Rules – Maximal and Sound Query Answers
• Advantages– Rule Unfolding– Scalability
Funded by NSF 47 BYU Data Extraction Group
Experimental ResultsApplication
(Number of Schemes)
Precision
(%)
Recall
(%)
F
(%)
Number Matches
Number Correct
Number
Incorrect
Faculty Member (5) 100 100 100 540 540 0
Course Schedule (5) 99 93 96 490 454 6
Real Estate (5) 90 94 92 876 820 92
Data borrowed from Univ. of Washington [DDH, SIGMOD01]
Indirect Matches: (precision 87%, recall 94%, F-measure 90%)
Rough Comparison with U of W Results
* Course Schedule – Accuracy: ~71%
• * Real Estate (2 tests) – Accuracy: ~75%
* Faculty Member – Accuracy, ~92%
Funded by NSF 48 BYU Data Extraction Group
Conclusion
• A Robust and Flexible Approach to Check Applicability of HTML documents
• A Composite Approach to Automate Schema Mapping– Direct Matches– Indirect Matches
• An Approach that Combines Advantages of Basic Approaches to Data Integration
Funded by NSF 49 BYU Data Extraction Group
Future Work
• Test More Applications and Data to Evaluate the Approaches
• Extend Training Classifiers for Applicability Checking• Further Automating Schema Mapping• Automate Ontology Mapping on the Semantic Web• Automate Mapping between XML Documents• …