2016 vldb - the ibench integration metadata generator

The iBench Integration Metadata Generator

Patricia C. Arocena2, Boris Glavic1, Radu Ciucanu3, Renée J. Miller2

IIT DBGroup1 University of Toronto2

Universite Blaise Pascal3

Outline

1) Introduction2) The Metadata Generation Problem3) The iBench Metadata Generator4) Experiments5) Conclusions and Future Work

2 VLDB 2016 - The iBench Integration Metadata Generator

Overview

• Challenges of evaluating integration systems– Metadata is central

• Various types of metadata used by integration tasks– Quality is as (or maybe more) important than

performance• Often requires “gold standard” solution

• Goal: make empirical evaluations …• … more robust, repeatable, shareable, and broad• … less painful and time-consuming

• Contributions– iBench – a flexible metadata generator for evaluating

integration systems– Evaluate iBench performance + use it in evaluations


Integration TasksMany integration tasks work with metadata:• Data Exchange

– Input: Schemas, Constraints, (Source Instance), Mappings– Output: Executable Transformations, (Target Instance)

• Schema Mapping Generation– Input: Schemas, Constraints, Instance Data, Correspondences– Output: Mappings, Transformations

• Schema Matching– Input: Schemas, (Instance Data), (Constraints)– Output: Correspondences

• Virtual Data Integration– Input: Schemas, Instance Data, Mappings, Queries– Output: Rewritten Queries, Certain Query Results

• … and many others (e.g., Mapping Operators, Schema Evolution, Data Cleaning, ...)


IntegrationScenarioMetadata:Schemas,Constraints,Correspondences,MappingsData:SourceInstance,TargetInstance

Empirical Evaluation

• What are important input parameters?– Ability to create input where each parameter is varied

systematically as an independent variable– What are realistic values for these parameters?

• How is output performance measured?– Measures of scalability (throughput, response time)– Measures of quality

• Given input, what is the best output (gold standard)?– Many quality metrics measure “deviation from gold

standard”


Running Example

• Alberto has designed a new mapping inversion algorithm called mapping restoration– Not all mappings have a perfect restoration

• A partial restoration leads to a mapping that is close to an inverse but not equivalent

– He has implemented his algorithm– Alberto would like to publish (deadline in 3 weeks)

• Alberto’s would like to do a comprehensive evaluation to test …– Hypothesis 1: many mappings do have a restoration– Hypothesis 2: the quality of the partial inverses is good– Hypothesis 3: although the algorithm has exponential complexity he

believes that some heuristic optimizations will make it feasible to compute inverses for mappings over large schemas

• Alberto’s Dilemma– How to test his hypotheses (in short time)?


Given(S,T,ΣST)find(T,S,ΣTS)

State-of-the-art

• How are integration systems typically evaluated?• Small real-world integration scenarios– Advantages:

• Realistic ;-)– Disadvantages:

• Not possible to scale (schema-size, data-size, …)• Not possible to vary parameters (e.g., mapping complexity)

• Ad-hoc synthetic scenarios– Advantages:

• Can influence scale and characteristics– Disadvantages:

• Often not very realistic metadata• Diversity requires huge effort


Requirements

• Alberto needs a tool to generate realistic metadata!– Scalability

• Generate large integration scenarios efficiently• Requires low user effort

– Control over the characteristics of the generated metadata• Size• Structure• …

– Generate inputs as well as gold standard outputs– Generate data that matches the generated meta-data

• Data should obey integrity constraints– Promote reproducibility

• Enable other researchers to regenerate metadata to repeat an experiment• Support researchers in understanding the characteristics of generated

metadata• Enable researchers to reuse generated integration scenarios8 VLDB 2016 - The iBench Integration Metadata Generator

Related Work

• STBenchmark [Alexe et al. PVLDB ‘08]– Pioneered the primitive approach• Generate metadata by combining typical micro

scenarios– No support for controlling reuse among primitives– No support for flexible value invention, scaling

real-world scenarios, …• Data generators– PDGF, Myriad


Outline



Integration Scenarios

• Integration Scenario–M = (S,T, ΣS, ΣT, Σ, I, J, 𝛕)– Source schema S with instance I– Target schema T with instance J– Source constraints ΣS and target constraints ΣT• Instance I fulfills ΣS and instance J fulfills ΣT

– Schema mapping Σ• Instances (I,J) fulfill Σ

– Transformations 𝛕


I J

S TΣ𝛕

ΣS ΣT

I J

S TΣ𝛕

ΣS ΣT

MD Task

• Metadata generation task (MD task)– Scenario parameters П• Number of source relations• Number of attributes of target relations

–MD task Γ• Specifies min/max constraints over some 𝛑∊П• E.g., source relations should have between 3 and 10

attributes

• Solutions– A integration scenario M is a solution for an MD

task Γ if M fulfills the constraints of Γ


Example - MD Task

• MD Task

• Example solution (mappings)• S1(A,B,C),S2(C,D,E) —> T(A,E)• S3(A,B,C,D),S4(E,A,B) —> ∃X,Y,Z T1(A,X,X),

T2(A,Y,C),T3(C,B,Y,Z)

• Limited usefulness in practice– Can we generate “realistic” scenarios?


Parameter𝛑 Source Target

NumberRelations 2-4 1-3

NumberAttributes 2-10 2-10

NumberofJoinAttr 1-2 1-2

NumberofExistentials 0-3

Mapping Primitives

• Mapping Primitives– Template micro-scenarios that encode a typical schema

mapping/evolution operations• Vertical partitioning a source relation

– Used as building blocks for generating scenarios• Comprehensive Set of Primitives– Schema Evolution Primitives

• Mapping Adaptation [Yu, Popa VLDB05]• Mapping Composition [Bernstein et al. VLDBJ08]

– Schema Mapping Primitives• STBenchmark [Alexe, Tan, Velegrakis PVLDB08]

– First to propose parameterized primitives


Scenario Primitives


Primitive Parameters

• What are primitive parameters?– For a given set of primitive types• 𝛉p = number of instances of primitive p that have to be

included in a solution for an MD task– Primitives are templates• 𝛑 parameters affect instances of primitives

– e.g., number of attributes per source table

• 𝛌 parameters that determine primitive specific settings– Specified as min/max constraints– e.g., number of partitions in vertical partitioning


MD task with Primitives

• Solutions– Contain requested

number 𝛉p of instances of primitive p• 𝛌 parameter constraints

are obeyed– 𝛑 parameters

constraints are fulfilled


Sharing among Primitives

• Primitives occur in the real world

• However:– Assuming independence

of schema elements across primitive instances is unrealistic

• Solution– Scenario parameters that

control sharing among primitives


Analysis

• Complexity– Finding a solution for an MD task is NP-hard

• Unsatisfiability– There are MD tasks that have no solutions

• Implications– Any PTIME algorithm will have to either be

unsound (may not fulfill MD task) or incomplete(may not always return a solution)

– Users may accidentally specify unsatisfiable tasks• How to deal with that?


Outline



iBench Overview

• MDGen Algorithm– PTIME greedy algorithm solving the metadata generation

problem• Relaxes scenario parameters if necessary to avoid backtracking and

guarantee that a solution can be generated

• Input configuration– A configuration file (1-2 pages of text)– Scenario and Primitive Parameters– What types of metadata to produce?– Generate Data?

• Output– An XML file storing the metadata– Data as CSV files (optional)


MDGen Algorithm

• Instantiate Primitives to fulfill primitive parameters– If necessary relax 𝛑 constraints to fulfill 𝛌 and 𝛉

constraints• e.g., the user requests 3 target relations (𝛑), one vertical

partitioning primitive instance (𝛉) with 5 fragments (𝛌)– We would ensure that 5 fragments are created

• Generate random mappings to fulfill 𝛑 constraints if necessary– e.g., not enough source relations have been created

• Customize generated scenario– Adapt value invention– Generate random constraints

• Generate Data


The iBench Metadata Generator

• Primitive Generation Example– I want 1 copy and 1 vertical partitioning


Source

Custaaaa a

Nameaaa

Addraaaa

Empaaaaaa

Nameaaa

Company

Targetaaaaa

Customer

Nameaa

Addraaa

Loyaltyi

Personaa

Idaa aaa

Nameaa

WorksAta

EmpRec

Firmaaa

Idaa aaa

Outline



Evaluation Goals

• Scalability of iBench– Can we generate large and complex integration

scenarios in reasonable time?• Effectiveness– Can we enable new types of evaluations that would

be hard or impossible without having iBench


iBench Performance


• RED• nosharing/constraints

• GREEN• 25%constraints

• BLUE• 25%sourcesharing

• TURQUOISE• 25%targetsharing

• NoSharing:1Mattributesinsecs• Sharing: 1Mattributesin2-3mins

• non-lineartrend&highstd deviance(12-14sec)duetobest-effortsharing

Case Study - Data Exchange


• Comparetwodataexchangesystems• Mappinggenerationtime• Dataexchangeperformance• Targetinstancequality(redundancy)

• Clio- Toronto/IBMAlmaden collaboration• createsuniversalsolutions• [Popa,Velegrakis,M-,Hernandez,FaginVLDB02]

• MapMerge - UCSantaCruz/IBMAlmaden collaboration• mappingcorrelationtoremoveredundancyinCliomappings• [Alexe,Hernandez,Popa,TanVLDBJ12]

• OriginalMapMerge Evaluation• tworeal-lifescenarios14mappings• onesynthetic:onesourcerelation,upto256mappings

• Ourgoal:• testtheoriginalexperimentalobservationsovermorediverseandcomplex

metadatascenarios

Case Study – Results


• s

ResultQuality MappingGeneration DataExchange

Newinsights• Largerimprovementforontologyscenarios(Ont)thatexhibit

moreincompleteness• Comparablemappingexecutiontime(dataexchange)withslight

benefitsforMapMerge overlargerscenarios

Success stories• Value Invention in Data Exchange

– [Arocena, Glavic, Miller, SIGMOD 2013]– Generate a large number of mappings (12.5 M)

• Vagabond– [us]– Use iBench to scale real world scenarios and evaluate the effectiveness and performance of our approach for

• Mapping Discovery– [Miller, Getoor, Memory]– Use iBench to generate schemas and data as input for learning mappings between schemas using statistical

techniques and compare against iBench ground truth mappings• Functional Dependencies Unleashed for Scalable Data Exchange

– [Bonifati, Ileana, Linardi - arXiv preprint arXiv:1602.00563, 2016]– Used iBench to compare a new chase-based data exchange algorithm to SQL-based exchange algorithm of

++Spicy• Approximation Algorithms for Schema-Mapping Discovery from Data

– [ten Cate, Kolaitis, Qian, Tan AMW 2015]– Approximate the Gottlob-Senellart notion– Kun Qian currently using iBench to evaluate effectiveness of approximation

• Comparative Evaluation of Chase engines– [Università della Basilicata, University of Oxford]– Using iBench to generate schemas, constraints


Outline



Conclusions

• Goals– Improve empirical evaluation of integration

systems• Help researchers evaluate their systems

• iBench– Comprehensive metadata generator– Produces inputs and outputs (gold standards) for a

variety of integration tasks– Control over metadata characteristics


Future Work

• Give more control over data generation• Orchestration of mappings– Serial and parallel

• Improve control over naming of schema elements– e.g., to better support schema matching

• Incorporate implementations of quality measures


Questions?

• Homepage: Boris: http://www.cs.iit.edu/~glavic/Radu: http://ws.isima.fr/~ciucanu/Patricia: http://dblab.cs.toronto.edu/~prg/Renee: http://dblab.cs.toronto.edu/~miller/

• iBench Project Webpage: http://dblab.cs.toronto.edu/project/iBench/Code: https://bitbucket.org/ibencher/ibench/Public Scenario Repo: https://bitbucket.org/ibencher/ibenchconfigurationsandscenarios


2016 vldb - the ibench integration metadata generator

Science