Download - Edbt 2010, Belhajjame
Feedback-Based Annotation, Selection and Refinement of Schema Mappings for Dataspaces
Khalid Belhajjame, Norman W. Paton, Suzanne M. Embury, Alvaro A. A. Fernandes, and Cornelia Hedeler
1 EDBT/ICDT 2010
Data Integra2on
EDBT/ICDT 2010 2
PedroDB PepSeeker Pride GPMDB
Scien2st
What are the available proteins of the Fruit Fly?
Integra2on Schema
Mappings
Towards Pay-‐as-‐you-‐go Data Integra2on
Data Integra*on
– SeKng up a data integra2on system requires significant upfront effort
– The specifica2on of schema mappings has proved to be 2me and resource consuming: it requires deep knowledge of the sources to be integrated as well as the user’s requirements.
Dataspaces: a Pay-‐as-‐you-‐go Data Integra*on [Franklin et al. 2005]
– Reduce the up-‐front cost required to setup a data integra2on system: Provide some services immediately
– Gradually improve the services provided by the system through interac2on with end users in a pay-‐as-‐you-‐go fashion.
EDBT/ICDT 2010 3
M. J. Franklin, A. Y. Halevy, and D. Maier. From databases to dataspaces: a new abstrac2on for informa2on management. SIGMOD Record, 34(4):27–33, 2005.
Pay-‐as-‐you-‐go Data Integra2on
EDBT/ICDT 2010 4
PedroDB PepSeeker Pride GPMDB
Scien2st
Integra2on Schema
Mappings
What are the available proteins of the Fruit Fly?
Bootstrap Dataspaces
Objec2ve of the present work: Inves2gate Pay-‐as-‐you-‐go Annota2on, Selec2on, and Refinement of Schema Mappings
Pay-‐as-‐you-‐go Data Integra2on
We consider that integration schema and source schemas are relational, and that the schema mappings that define the extent of the relations in the integration schema, r, are global as view mappings of the form:
m = ⟨r,qs⟩ where qs is a relational query over the source schemas.
A relation in the integration schema can be associated with multiple candidate mappings: We consider a setting in which multiple matching mechanisms can be used, each of which could give rise to multiple mapping candidates for populating the same relation of the integration schema.
EDBT/ICDT 2010 5
Outline
User Feedback
Annota*on of Schema Mappings
Selec*on of Schema Mappings Based on User Requirements
Refinement of Schema Mappings
EDBT/ICDT 2010 6
User Feedback
Query: What are the available fruit fly proteins?
Results:
EDBT/ICDT 2010 7
Feedback
✔
✖
✖
✔
User Feedback (cont.)
EDBT/ICDT 2010 8
Let m be a candidate mapping, and UF a set of feedback instances UF supplied by the user:
tp(m,UF): the tuples that are expected by the user and that are retrieved by the mapping m.
fp(m,UF): the tuples that are not expected by the user and that are retrieved by the mapping m.
fn(m,UF): the tuples that are expected by the user and are not retrieved by the mapping m.
Outline
User Feedback
Annota*on of Schema Mappings
Selec*on of Schema Mappings Based on User Requirements
Refinement of Schema Mappings
EDBT/ICDT 2010 9
Annota2ng Mappings
Using a simple annota*on scheme, a schema mapping can be annotated as:
Correct
Incorrect
EDBT/ICDT 2010 10
The set of schema mappings is likely to be incomplete, and, therefore, we may end up annota2ng all mappings as incorrect.
Because of this, we use a less stringent scheme mapping annota2on.
Annota2ng Mappings (cont.)
Instead, we use and adapt the no2ons of precision and recall used in informa2on retrieval to measure the quality of a mapping.
Precision:
Recall:
F measure:
EDBT/ICDT 2010 11
Mapping Annota2on: Valida2on
Ques*ons:
– How much user feedback is required for approxima8ng the real precision and recall, i.e., those based on complete knowledge of the expected results?
– Does the pay-‐as-‐you-‐go philosophy hold?
EDBT/ICDT 2010 12
Mapping Annota2on: Valida2on (cont.)
Experiment:
Data:
– Two datasets: the Mondial geographical database and the Amalgam data integra2on benchmark
– Candidate schema mappings: created using the IBM Infosphere Data Architect.
Process: we applied the two-‐step process illustrated below for mul2ple itera2ons.
1. Generate a sample feedback instances.
2. Compute the rela2ve precision and recall of the candidate mappings given cumula2ve feedback.
EDBT/ICDT 2010 13
Mapping Annota2on: Error in Precision
EDBT/ICDT 2010 14
Error
Mapping Annota2on: Error in Recall
EDBT/ICDT 2010 15
Error
Outline
User Feedback
Annota*on of Schema Mappings
Selec*on of Schema Mappings Based on User Requirements
Refinement of Schema Mappings
EDBT/ICDT 2010 16
Mapping Selec2on
Mapping selec2on should be tailored to meet user requirements.
We use a selec2on method that aims to maximise the recall such that the precision of the results is higher than a given precision threshold.
We cast this selec2on problem as a search problem that aims to maximise the following u2lity func2on:
EDBT/ICDT 2010 17
D. A. Menascé and V. Dubey. U2lity-‐based qos brokering in service oriented architectures. In ICWS, pages 422–430. IEEE CS, 2007.
Mapping Selec2on
Mapping selec2on should be tailored to meet user requirements.
We use a selec2on method that aims to maximise the recall such that the precision of the results is higher than a given precision threshold.
We cast this selec2on problem as a search problem that aims to maximise the following u2lity func2on:
EDBT/ICDT 2010 18
D. A. Menascé and V. Dubey. U2lity-‐based qos brokering in service oriented architectures. In ICWS, pages 422–430. IEEE CS, 2007.
Mapping Selec2on: Precision
EDBT/ICDT 2010 19
Do we meet precision requirement, i.e., is the precision threshold set by the user respected?
Mapping Selec2on: Precision
EDBT/ICDT 2010 20
Mapping Selec2on: Recall
EDBT/ICDT 2010 21
Do we get some benefits for recall, i.e., does the method we use maximise the recall?
Mapping Selec2on: Recall
EDBT/ICDT 2010 22
Outline
User Feedback
Annota*on of Schema Mappings
Selec*on of Schema Mappings Based on User Requirements
Refinement of Schema Mappings
EDBT/ICDT 2010 23
Mapping Refinement
We dis2nguish two kinds of refinement:
Mapping refinement that seeks to reduce the number of false posi2ves
A candidate mapping is refined by modifying a source query so that the number of false posi2ves it returns is reduced.
Mapping refinement that aims to increase the number of true posi2ves
A candidate mapping m is refined by modifying a source query so that the number of true posi2ves it returns is increased.
EDBT/ICDT 2010 24
Mapping Refinement: Example
EDBT/ICDT 2010 25
Accession name gene
Protein
I Want Fruit fly proteins
Integration schema
Source schema
m = <Protein, ProteinEntry>
15/04/2009 Khalid 26
Mapping Refinement: The Space of Solu2ons
The space of solu2ons is composed of the mappings that can be constructed out of the candidate mappings. Specifically:, by
i. Joining the source query of a candidate mapping.
ii. Augmen2ng the source query of a candidate mapping with a selec2on condi2on.
iii. Relaxing the selec2on condi2on of the source query of a candidate mapping.
iv. Combining the source queries of two or more mappings using union, difference and intersec2on.
15/04/2009 Khalid 27
Exploring the Space of Solu2ons
The space of mappings that can be obtained by refinement is
poten2ally large.
A search algorithm that explores the whole space of the possible
mappings may not be able to find a solu2on in a bounded 2me.
In the context of the present work, we used an evolu*onary algorithm for exploring the space of mappings that can be obtained
by refinement.
Mapping Refinement Algorithm
EDBT/ICDT 2010 28
Mapping Refinement: Valida2on
Ques*on: Can mapping refinement improve the quality of ini8al candidate
mappings, and, if so, at what cost, i.e., what is the amount of user feedback required?
Experiment: To answer the above ques2on we applied the following process for mul2ple itera2ons.
1) Generate a sample of feedback instances. 2) Annotate the set of candidate mappings.
3) Refine candidate mappings using the RefineMappings algorithm.
EDBT/ICDT 2010 29
Mapping Refinement: Valida2on (cont.)
EDBT/ICDT 2010 30
Conclusions Pay-‐as-‐you-‐go Annota*on of Schema Mappings We showed how schema mappings can be incrementally annotated based
on feedback supplied by end users.
We also showed through an evalua2on exercise that the more feedback the user supplies, the bemer is the quality of the mapping annota2on computed.
Applica*on: Selec*on and Refinement of Schema Mappings in Dataspaces
Mapping annota2on computed based on user feedback are used as input for enabling the selec2on and the refinement of schema mappings.
The evalua2on exercises also showed that mapping refinement is more cost effec2ve in the first feedback itera2ons.
EDBT/ICDT 2010 31
Feedback-Based Annotation, Selection and Refinement of Schema Mappings for Dataspaces
Khalid Belhajjame, Norman W. Paton, Suzanne M. Embury, Alvaro A. A. Fernandes, and Cornelia Hedeler
32 EDBT/ICDT 2010