2016 vldb - the ibench integration metadata generator
TRANSCRIPT
The iBench Integration Metadata Generator
Patricia C. Arocena2, Boris Glavic1, Radu Ciucanu3, Renée J. Miller2
IIT DBGroup1 University of Toronto2
Universite Blaise Pascal3
Outline
1) Introduction2) The Metadata Generation Problem3) The iBench Metadata Generator4) Experiments5) Conclusions and Future Work
2 VLDB 2016 - The iBench Integration Metadata Generator
Overview
• Challenges of evaluating integration systems– Metadata is central
• Various types of metadata used by integration tasks– Quality is as (or maybe more) important than
performance• Often requires “gold standard” solution
• Goal: make empirical evaluations …• … more robust, repeatable, shareable, and broad• … less painful and time-consuming
• Contributions– iBench – a flexible metadata generator for evaluating
integration systems– Evaluate iBench performance + use it in evaluations
3 VLDB 2016 - The iBench Integration Metadata Generator
Integration TasksMany integration tasks work with metadata:• Data Exchange
– Input: Schemas, Constraints, (Source Instance), Mappings– Output: Executable Transformations, (Target Instance)
• Schema Mapping Generation– Input: Schemas, Constraints, Instance Data, Correspondences– Output: Mappings, Transformations
• Schema Matching– Input: Schemas, (Instance Data), (Constraints)– Output: Correspondences
• Virtual Data Integration– Input: Schemas, Instance Data, Mappings, Queries– Output: Rewritten Queries, Certain Query Results
• … and many others (e.g., Mapping Operators, Schema Evolution, Data Cleaning, ...)
4 VLDB 2016 - The iBench Integration Metadata Generator
IntegrationScenarioMetadata:Schemas,Constraints,Correspondences,MappingsData:SourceInstance,TargetInstance
Empirical Evaluation
• What are important input parameters?– Ability to create input where each parameter is varied
systematically as an independent variable– What are realistic values for these parameters?
• How is output performance measured?– Measures of scalability (throughput, response time)– Measures of quality
• Given input, what is the best output (gold standard)?– Many quality metrics measure “deviation from gold
standard”
5 VLDB 2016 - The iBench Integration Metadata Generator
Running Example
• Alberto has designed a new mapping inversion algorithm called mapping restoration– Not all mappings have a perfect restoration
• A partial restoration leads to a mapping that is close to an inverse but not equivalent
– He has implemented his algorithm– Alberto would like to publish (deadline in 3 weeks)
• Alberto’s would like to do a comprehensive evaluation to test …– Hypothesis 1: many mappings do have a restoration– Hypothesis 2: the quality of the partial inverses is good– Hypothesis 3: although the algorithm has exponential complexity he
believes that some heuristic optimizations will make it feasible to compute inverses for mappings over large schemas
• Alberto’s Dilemma– How to test his hypotheses (in short time)?
6 VLDB 2016 - The iBench Integration Metadata Generator
Given(S,T,ΣST)find(T,S,ΣTS)
State-of-the-art
• How are integration systems typically evaluated?• Small real-world integration scenarios– Advantages:
• Realistic ;-)– Disadvantages:
• Not possible to scale (schema-size, data-size, …)• Not possible to vary parameters (e.g., mapping complexity)
• Ad-hoc synthetic scenarios– Advantages:
• Can influence scale and characteristics– Disadvantages:
• Often not very realistic metadata• Diversity requires huge effort
7 VLDB 2016 - The iBench Integration Metadata Generator
Requirements
• Alberto needs a tool to generate realistic metadata!– Scalability
• Generate large integration scenarios efficiently• Requires low user effort
– Control over the characteristics of the generated metadata• Size• Structure• …
– Generate inputs as well as gold standard outputs– Generate data that matches the generated meta-data
• Data should obey integrity constraints– Promote reproducibility
• Enable other researchers to regenerate metadata to repeat an experiment• Support researchers in understanding the characteristics of generated
metadata• Enable researchers to reuse generated integration scenarios8 VLDB 2016 - The iBench Integration Metadata Generator
Related Work
• STBenchmark [Alexe et al. PVLDB ‘08]– Pioneered the primitive approach• Generate metadata by combining typical micro
scenarios– No support for controlling reuse among primitives– No support for flexible value invention, scaling
real-world scenarios, …• Data generators– PDGF, Myriad
9 VLDB 2016 - The iBench Integration Metadata Generator
Outline
1) Introduction2) The Metadata Generation Problem3) The iBench Metadata Generator4) Experiments5) Conclusions and Future Work
10 VLDB 2016 - The iBench Integration Metadata Generator
Integration Scenarios
• Integration Scenario–M = (S,T, ΣS, ΣT, Σ, I, J, 𝛕)– Source schema S with instance I– Target schema T with instance J– Source constraints ΣS and target constraints ΣT• Instance I fulfills ΣS and instance J fulfills ΣT
– Schema mapping Σ• Instances (I,J) fulfill Σ
– Transformations 𝛕
11 VLDB 2016 - The iBench Integration Metadata Generator
I J
S TΣ𝛕
ΣS ΣT
I J
S TΣ𝛕
ΣS ΣT
MD Task
• Metadata generation task (MD task)– Scenario parameters П• Number of source relations• Number of attributes of target relations
–MD task Γ• Specifies min/max constraints over some 𝛑∊П• E.g., source relations should have between 3 and 10
attributes
• Solutions– A integration scenario M is a solution for an MD
task Γ if M fulfills the constraints of Γ
12 VLDB 2016 - The iBench Integration Metadata Generator
Example - MD Task
• MD Task
• Example solution (mappings)• S1(A,B,C),S2(C,D,E) —> T(A,E)• S3(A,B,C,D),S4(E,A,B) —> ∃X,Y,Z T1(A,X,X),
T2(A,Y,C),T3(C,B,Y,Z)
• Limited usefulness in practice– Can we generate “realistic” scenarios?
13 VLDB 2016 - The iBench Integration Metadata Generator
Parameter𝛑 Source Target
NumberRelations 2-4 1-3
NumberAttributes 2-10 2-10
NumberofJoinAttr 1-2 1-2
NumberofExistentials 0-3
Mapping Primitives
• Mapping Primitives– Template micro-scenarios that encode a typical schema
mapping/evolution operations• Vertical partitioning a source relation
– Used as building blocks for generating scenarios• Comprehensive Set of Primitives– Schema Evolution Primitives
• Mapping Adaptation [Yu, Popa VLDB05]• Mapping Composition [Bernstein et al. VLDBJ08]
– Schema Mapping Primitives• STBenchmark [Alexe, Tan, Velegrakis PVLDB08]
– First to propose parameterized primitives
14 VLDB 2016 - The iBench Integration Metadata Generator
Primitive Parameters
• What are primitive parameters?– For a given set of primitive types• 𝛉p = number of instances of primitive p that have to be
included in a solution for an MD task– Primitives are templates• 𝛑 parameters affect instances of primitives
– e.g., number of attributes per source table
• 𝛌 parameters that determine primitive specific settings– Specified as min/max constraints– e.g., number of partitions in vertical partitioning
16 VLDB 2016 - The iBench Integration Metadata Generator
MD task with Primitives
• Solutions– Contain requested
number 𝛉p of instances of primitive p• 𝛌 parameter constraints
are obeyed– 𝛑 parameters
constraints are fulfilled
17 VLDB 2016 - The iBench Integration Metadata Generator
Sharing among Primitives
• Primitives occur in the real world
• However:– Assuming independence
of schema elements across primitive instances is unrealistic
• Solution– Scenario parameters that
control sharing among primitives
18 VLDB 2016 - The iBench Integration Metadata Generator
Analysis
• Complexity– Finding a solution for an MD task is NP-hard
• Unsatisfiability– There are MD tasks that have no solutions
• Implications– Any PTIME algorithm will have to either be
unsound (may not fulfill MD task) or incomplete(may not always return a solution)
– Users may accidentally specify unsatisfiable tasks• How to deal with that?
19 VLDB 2016 - The iBench Integration Metadata Generator
Outline
1) Introduction2) The Metadata Generation Problem3) The iBench Metadata Generator4) Experiments5) Conclusions and Future Work
20 VLDB 2016 - The iBench Integration Metadata Generator
iBench Overview
• MDGen Algorithm– PTIME greedy algorithm solving the metadata generation
problem• Relaxes scenario parameters if necessary to avoid backtracking and
guarantee that a solution can be generated
• Input configuration– A configuration file (1-2 pages of text)– Scenario and Primitive Parameters– What types of metadata to produce?– Generate Data?
• Output– An XML file storing the metadata– Data as CSV files (optional)
21 VLDB 2016 - The iBench Integration Metadata Generator
MDGen Algorithm
• Instantiate Primitives to fulfill primitive parameters– If necessary relax 𝛑 constraints to fulfill 𝛌 and 𝛉
constraints• e.g., the user requests 3 target relations (𝛑), one vertical
partitioning primitive instance (𝛉) with 5 fragments (𝛌)– We would ensure that 5 fragments are created
• Generate random mappings to fulfill 𝛑 constraints if necessary– e.g., not enough source relations have been created
• Customize generated scenario– Adapt value invention– Generate random constraints
• Generate Data
22 VLDB 2016 - The iBench Integration Metadata Generator
The iBench Metadata Generator
• Primitive Generation Example– I want 1 copy and 1 vertical partitioning
23 VLDB 2016 - The iBench Integration Metadata Generator
Source
Custaaaa a
Nameaaa
Addraaaa
Empaaaaaa
Nameaaa
Company
Targetaaaaa
Customer
Nameaa
Addraaa
Loyaltyi
Personaa
Idaa aaa
Nameaa
WorksAta
EmpRec
Firmaaa
Idaa aaa
Outline
1) Introduction2) The Metadata Generation Problem3) The iBench Metadata Generator4) Experiments5) Conclusions and Future Work
24 VLDB 2016 - The iBench Integration Metadata Generator
Evaluation Goals
• Scalability of iBench– Can we generate large and complex integration
scenarios in reasonable time?• Effectiveness– Can we enable new types of evaluations that would
be hard or impossible without having iBench
25 VLDB 2016 - The iBench Integration Metadata Generator
iBench Performance
26 VLDB 2016 - The iBench Integration Metadata Generator
• RED• nosharing/constraints
• GREEN• 25%constraints
• BLUE• 25%sourcesharing
• TURQUOISE• 25%targetsharing
• NoSharing:1Mattributesinsecs• Sharing: 1Mattributesin2-3mins
• non-lineartrend&highstd deviance(12-14sec)duetobest-effortsharing
Case Study - Data Exchange
27 VLDB 2016 - The iBench Integration Metadata Generator
• Comparetwodataexchangesystems• Mappinggenerationtime• Dataexchangeperformance• Targetinstancequality(redundancy)
• Clio- Toronto/IBMAlmaden collaboration• createsuniversalsolutions• [Popa,Velegrakis,M-,Hernandez,FaginVLDB02]
• MapMerge - UCSantaCruz/IBMAlmaden collaboration• mappingcorrelationtoremoveredundancyinCliomappings• [Alexe,Hernandez,Popa,TanVLDBJ12]
• OriginalMapMerge Evaluation• tworeal-lifescenarios14mappings• onesynthetic:onesourcerelation,upto256mappings
• Ourgoal:• testtheoriginalexperimentalobservationsovermorediverseandcomplex
metadatascenarios
Case Study – Results
28 VLDB 2016 - The iBench Integration Metadata Generator
• s
ResultQuality MappingGeneration DataExchange
Newinsights• Largerimprovementforontologyscenarios(Ont)thatexhibit
moreincompleteness• Comparablemappingexecutiontime(dataexchange)withslight
benefitsforMapMerge overlargerscenarios
Success stories• Value Invention in Data Exchange
– [Arocena, Glavic, Miller, SIGMOD 2013]– Generate a large number of mappings (12.5 M)
• Vagabond– [us]– Use iBench to scale real world scenarios and evaluate the effectiveness and performance of our approach for
• Mapping Discovery– [Miller, Getoor, Memory]– Use iBench to generate schemas and data as input for learning mappings between schemas using statistical
techniques and compare against iBench ground truth mappings• Functional Dependencies Unleashed for Scalable Data Exchange
– [Bonifati, Ileana, Linardi - arXiv preprint arXiv:1602.00563, 2016]– Used iBench to compare a new chase-based data exchange algorithm to SQL-based exchange algorithm of
++Spicy• Approximation Algorithms for Schema-Mapping Discovery from Data
– [ten Cate, Kolaitis, Qian, Tan AMW 2015]– Approximate the Gottlob-Senellart notion– Kun Qian currently using iBench to evaluate effectiveness of approximation
• Comparative Evaluation of Chase engines– [Università della Basilicata, University of Oxford]– Using iBench to generate schemas, constraints
29 VLDB 2016 - The iBench Integration Metadata Generator
Outline
1) Introduction2) The Metadata Generation Problem3) The iBench Metadata Generator4) Experiments5) Conclusions and Future Work
30 VLDB 2016 - The iBench Integration Metadata Generator
Conclusions
• Goals– Improve empirical evaluation of integration
systems• Help researchers evaluate their systems
• iBench– Comprehensive metadata generator– Produces inputs and outputs (gold standards) for a
variety of integration tasks– Control over metadata characteristics
31 VLDB 2016 - The iBench Integration Metadata Generator
Future Work
• Give more control over data generation• Orchestration of mappings– Serial and parallel
• Improve control over naming of schema elements– e.g., to better support schema matching
• Incorporate implementations of quality measures
32 VLDB 2016 - The iBench Integration Metadata Generator
Questions?
• Homepage: Boris: http://www.cs.iit.edu/~glavic/Radu: http://ws.isima.fr/~ciucanu/Patricia: http://dblab.cs.toronto.edu/~prg/Renee: http://dblab.cs.toronto.edu/~miller/
• iBench Project Webpage: http://dblab.cs.toronto.edu/project/iBench/Code: https://bitbucket.org/ibencher/ibench/Public Scenario Repo: https://bitbucket.org/ibencher/ibenchconfigurationsandscenarios
33 VLDB 2016 - The iBench Integration Metadata Generator