2016 vldb - the ibench integration metadata generator

33
The iBench Integration Metadata Generator Patricia C. Arocena 2 , Boris Glavic 1 , Radu Ciucanu 3 , Renée J. Miller 2 IIT DBGroup 1 University of Toronto 2 Université Blaise Pascal 3

Upload: boris-glavic

Post on 22-Jan-2018

144 views

Category:

Science


1 download

TRANSCRIPT

The iBench Integration Metadata Generator

Patricia C. Arocena2, Boris Glavic1, Radu Ciucanu3, Renée J. Miller2

IIT DBGroup1 University of Toronto2

Universite Blaise Pascal3

Outline

1) Introduction2) The Metadata Generation Problem3) The iBench Metadata Generator4) Experiments5) Conclusions and Future Work

2 VLDB 2016 - The iBench Integration Metadata Generator

Overview

• Challenges of evaluating integration systems– Metadata is central

• Various types of metadata used by integration tasks– Quality is as (or maybe more) important than

performance• Often requires “gold standard” solution

• Goal: make empirical evaluations …• … more robust, repeatable, shareable, and broad• … less painful and time-consuming

• Contributions– iBench – a flexible metadata generator for evaluating

integration systems– Evaluate iBench performance + use it in evaluations

3 VLDB 2016 - The iBench Integration Metadata Generator

Integration TasksMany integration tasks work with metadata:• Data Exchange

– Input: Schemas, Constraints, (Source Instance), Mappings– Output: Executable Transformations, (Target Instance)

• Schema Mapping Generation– Input: Schemas, Constraints, Instance Data, Correspondences– Output: Mappings, Transformations

• Schema Matching– Input: Schemas, (Instance Data), (Constraints)– Output: Correspondences

• Virtual Data Integration– Input: Schemas, Instance Data, Mappings, Queries– Output: Rewritten Queries, Certain Query Results

• … and many others (e.g., Mapping Operators, Schema Evolution, Data Cleaning, ...)

4 VLDB 2016 - The iBench Integration Metadata Generator

IntegrationScenarioMetadata:Schemas,Constraints,Correspondences,MappingsData:SourceInstance,TargetInstance

Empirical Evaluation

• What are important input parameters?– Ability to create input where each parameter is varied

systematically as an independent variable– What are realistic values for these parameters?

• How is output performance measured?– Measures of scalability (throughput, response time)– Measures of quality

• Given input, what is the best output (gold standard)?– Many quality metrics measure “deviation from gold

standard”

5 VLDB 2016 - The iBench Integration Metadata Generator

Running Example

• Alberto has designed a new mapping inversion algorithm called mapping restoration– Not all mappings have a perfect restoration

• A partial restoration leads to a mapping that is close to an inverse but not equivalent

– He has implemented his algorithm– Alberto would like to publish (deadline in 3 weeks)

• Alberto’s would like to do a comprehensive evaluation to test …– Hypothesis 1: many mappings do have a restoration– Hypothesis 2: the quality of the partial inverses is good– Hypothesis 3: although the algorithm has exponential complexity he

believes that some heuristic optimizations will make it feasible to compute inverses for mappings over large schemas

• Alberto’s Dilemma– How to test his hypotheses (in short time)?

6 VLDB 2016 - The iBench Integration Metadata Generator

Given(S,T,ΣST)find(T,S,ΣTS)

State-of-the-art

• How are integration systems typically evaluated?• Small real-world integration scenarios– Advantages:

• Realistic ;-)– Disadvantages:

• Not possible to scale (schema-size, data-size, …)• Not possible to vary parameters (e.g., mapping complexity)

• Ad-hoc synthetic scenarios– Advantages:

• Can influence scale and characteristics– Disadvantages:

• Often not very realistic metadata• Diversity requires huge effort

7 VLDB 2016 - The iBench Integration Metadata Generator

Requirements

• Alberto needs a tool to generate realistic metadata!– Scalability

• Generate large integration scenarios efficiently• Requires low user effort

– Control over the characteristics of the generated metadata• Size• Structure• …

– Generate inputs as well as gold standard outputs– Generate data that matches the generated meta-data

• Data should obey integrity constraints– Promote reproducibility

• Enable other researchers to regenerate metadata to repeat an experiment• Support researchers in understanding the characteristics of generated

metadata• Enable researchers to reuse generated integration scenarios8 VLDB 2016 - The iBench Integration Metadata Generator

Related Work

• STBenchmark [Alexe et al. PVLDB ‘08]– Pioneered the primitive approach• Generate metadata by combining typical micro

scenarios– No support for controlling reuse among primitives– No support for flexible value invention, scaling

real-world scenarios, …• Data generators– PDGF, Myriad

9 VLDB 2016 - The iBench Integration Metadata Generator

Outline

1) Introduction2) The Metadata Generation Problem3) The iBench Metadata Generator4) Experiments5) Conclusions and Future Work

10 VLDB 2016 - The iBench Integration Metadata Generator

Integration Scenarios

• Integration Scenario–M = (S,T, ΣS, ΣT, Σ, I, J, 𝛕)– Source schema S with instance I– Target schema T with instance J– Source constraints ΣS and target constraints ΣT• Instance I fulfills ΣS and instance J fulfills ΣT

– Schema mapping Σ• Instances (I,J) fulfill Σ

– Transformations 𝛕

11 VLDB 2016 - The iBench Integration Metadata Generator

I J

S TΣ𝛕

ΣS ΣT

I J

S TΣ𝛕

ΣS ΣT

MD Task

• Metadata generation task (MD task)– Scenario parameters П• Number of source relations• Number of attributes of target relations

–MD task Γ• Specifies min/max constraints over some 𝛑∊П• E.g., source relations should have between 3 and 10

attributes

• Solutions– A integration scenario M is a solution for an MD

task Γ if M fulfills the constraints of Γ

12 VLDB 2016 - The iBench Integration Metadata Generator

Example - MD Task

• MD Task

• Example solution (mappings)• S1(A,B,C),S2(C,D,E) —> T(A,E)• S3(A,B,C,D),S4(E,A,B) —> ∃X,Y,Z T1(A,X,X),

T2(A,Y,C),T3(C,B,Y,Z)

• Limited usefulness in practice– Can we generate “realistic” scenarios?

13 VLDB 2016 - The iBench Integration Metadata Generator

Parameter𝛑 Source Target

NumberRelations 2-4 1-3

NumberAttributes 2-10 2-10

NumberofJoinAttr 1-2 1-2

NumberofExistentials 0-3

Mapping Primitives

• Mapping Primitives– Template micro-scenarios that encode a typical schema

mapping/evolution operations• Vertical partitioning a source relation

– Used as building blocks for generating scenarios• Comprehensive Set of Primitives– Schema Evolution Primitives

• Mapping Adaptation [Yu, Popa VLDB05]• Mapping Composition [Bernstein et al. VLDBJ08]

– Schema Mapping Primitives• STBenchmark [Alexe, Tan, Velegrakis PVLDB08]

– First to propose parameterized primitives

14 VLDB 2016 - The iBench Integration Metadata Generator

Scenario Primitives

15 VLDB 2016 - The iBench Integration Metadata Generator

Primitive Parameters

• What are primitive parameters?– For a given set of primitive types• 𝛉p = number of instances of primitive p that have to be

included in a solution for an MD task– Primitives are templates• 𝛑 parameters affect instances of primitives

– e.g., number of attributes per source table

• 𝛌 parameters that determine primitive specific settings– Specified as min/max constraints– e.g., number of partitions in vertical partitioning

16 VLDB 2016 - The iBench Integration Metadata Generator

MD task with Primitives

• Solutions– Contain requested

number 𝛉p of instances of primitive p• 𝛌 parameter constraints

are obeyed– 𝛑 parameters

constraints are fulfilled

17 VLDB 2016 - The iBench Integration Metadata Generator

Sharing among Primitives

• Primitives occur in the real world

• However:– Assuming independence

of schema elements across primitive instances is unrealistic

• Solution– Scenario parameters that

control sharing among primitives

18 VLDB 2016 - The iBench Integration Metadata Generator

Analysis

• Complexity– Finding a solution for an MD task is NP-hard

• Unsatisfiability– There are MD tasks that have no solutions

• Implications– Any PTIME algorithm will have to either be

unsound (may not fulfill MD task) or incomplete(may not always return a solution)

– Users may accidentally specify unsatisfiable tasks• How to deal with that?

19 VLDB 2016 - The iBench Integration Metadata Generator

Outline

1) Introduction2) The Metadata Generation Problem3) The iBench Metadata Generator4) Experiments5) Conclusions and Future Work

20 VLDB 2016 - The iBench Integration Metadata Generator

iBench Overview

• MDGen Algorithm– PTIME greedy algorithm solving the metadata generation

problem• Relaxes scenario parameters if necessary to avoid backtracking and

guarantee that a solution can be generated

• Input configuration– A configuration file (1-2 pages of text)– Scenario and Primitive Parameters– What types of metadata to produce?– Generate Data?

• Output– An XML file storing the metadata– Data as CSV files (optional)

21 VLDB 2016 - The iBench Integration Metadata Generator

MDGen Algorithm

• Instantiate Primitives to fulfill primitive parameters– If necessary relax 𝛑 constraints to fulfill 𝛌 and 𝛉

constraints• e.g., the user requests 3 target relations (𝛑), one vertical

partitioning primitive instance (𝛉) with 5 fragments (𝛌)– We would ensure that 5 fragments are created

• Generate random mappings to fulfill 𝛑 constraints if necessary– e.g., not enough source relations have been created

• Customize generated scenario– Adapt value invention– Generate random constraints

• Generate Data

22 VLDB 2016 - The iBench Integration Metadata Generator

The iBench Metadata Generator

• Primitive Generation Example– I want 1 copy and 1 vertical partitioning

23 VLDB 2016 - The iBench Integration Metadata Generator

Source

Custaaaa a

Nameaaa

Addraaaa

Empaaaaaa

Nameaaa

Company

Targetaaaaa

Customer

Nameaa

Addraaa

Loyaltyi

Personaa

Idaa aaa

Nameaa

WorksAta

EmpRec

Firmaaa

Idaa aaa

Outline

1) Introduction2) The Metadata Generation Problem3) The iBench Metadata Generator4) Experiments5) Conclusions and Future Work

24 VLDB 2016 - The iBench Integration Metadata Generator

Evaluation Goals

• Scalability of iBench– Can we generate large and complex integration

scenarios in reasonable time?• Effectiveness– Can we enable new types of evaluations that would

be hard or impossible without having iBench

25 VLDB 2016 - The iBench Integration Metadata Generator

iBench Performance

26 VLDB 2016 - The iBench Integration Metadata Generator

• RED• nosharing/constraints

• GREEN• 25%constraints

• BLUE• 25%sourcesharing

• TURQUOISE• 25%targetsharing

• NoSharing:1Mattributesinsecs• Sharing: 1Mattributesin2-3mins

• non-lineartrend&highstd deviance(12-14sec)duetobest-effortsharing

Case Study - Data Exchange

27 VLDB 2016 - The iBench Integration Metadata Generator

• Comparetwodataexchangesystems• Mappinggenerationtime• Dataexchangeperformance• Targetinstancequality(redundancy)

• Clio- Toronto/IBMAlmaden collaboration• createsuniversalsolutions• [Popa,Velegrakis,M-,Hernandez,FaginVLDB02]

• MapMerge - UCSantaCruz/IBMAlmaden collaboration• mappingcorrelationtoremoveredundancyinCliomappings• [Alexe,Hernandez,Popa,TanVLDBJ12]

• OriginalMapMerge Evaluation• tworeal-lifescenarios14mappings• onesynthetic:onesourcerelation,upto256mappings

• Ourgoal:• testtheoriginalexperimentalobservationsovermorediverseandcomplex

metadatascenarios

Case Study – Results

28 VLDB 2016 - The iBench Integration Metadata Generator

• s

ResultQuality MappingGeneration DataExchange

Newinsights• Largerimprovementforontologyscenarios(Ont)thatexhibit

moreincompleteness• Comparablemappingexecutiontime(dataexchange)withslight

benefitsforMapMerge overlargerscenarios

Success stories• Value Invention in Data Exchange

– [Arocena, Glavic, Miller, SIGMOD 2013]– Generate a large number of mappings (12.5 M)

• Vagabond– [us]– Use iBench to scale real world scenarios and evaluate the effectiveness and performance of our approach for

• Mapping Discovery– [Miller, Getoor, Memory]– Use iBench to generate schemas and data as input for learning mappings between schemas using statistical

techniques and compare against iBench ground truth mappings• Functional Dependencies Unleashed for Scalable Data Exchange

– [Bonifati, Ileana, Linardi - arXiv preprint arXiv:1602.00563, 2016]– Used iBench to compare a new chase-based data exchange algorithm to SQL-based exchange algorithm of

++Spicy• Approximation Algorithms for Schema-Mapping Discovery from Data

– [ten Cate, Kolaitis, Qian, Tan AMW 2015]– Approximate the Gottlob-Senellart notion– Kun Qian currently using iBench to evaluate effectiveness of approximation

• Comparative Evaluation of Chase engines– [Università della Basilicata, University of Oxford]– Using iBench to generate schemas, constraints

29 VLDB 2016 - The iBench Integration Metadata Generator

Outline

1) Introduction2) The Metadata Generation Problem3) The iBench Metadata Generator4) Experiments5) Conclusions and Future Work

30 VLDB 2016 - The iBench Integration Metadata Generator

Conclusions

• Goals– Improve empirical evaluation of integration

systems• Help researchers evaluate their systems

• iBench– Comprehensive metadata generator– Produces inputs and outputs (gold standards) for a

variety of integration tasks– Control over metadata characteristics

31 VLDB 2016 - The iBench Integration Metadata Generator

Future Work

• Give more control over data generation• Orchestration of mappings– Serial and parallel

• Improve control over naming of schema elements– e.g., to better support schema matching

• Incorporate implementations of quality measures

32 VLDB 2016 - The iBench Integration Metadata Generator

Questions?

• Homepage: Boris: http://www.cs.iit.edu/~glavic/Radu: http://ws.isima.fr/~ciucanu/Patricia: http://dblab.cs.toronto.edu/~prg/Renee: http://dblab.cs.toronto.edu/~miller/

• iBench Project Webpage: http://dblab.cs.toronto.edu/project/iBench/Code: https://bitbucket.org/ibencher/ibench/Public Scenario Repo: https://bitbucket.org/ibencher/ibenchconfigurationsandscenarios

33 VLDB 2016 - The iBench Integration Metadata Generator