Download - Data integration and transformation
![Page 1: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/1.jpg)
Data integration and transformation
Paolo Atzeni Dipartimento di Informatica e Automazione
Università Roma Tre29/09/2010
![Page 2: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/2.jpg)
P. Atzeni GII School 29/09/2010 2
A ten-year goal for database research
• The “Asilomar report”(Bernstein et al. Sigmod Record 1999 www.acm.org/sigmod):– The information utility:
make it easy for everyone to store, organize, access, and analyze the majority of human information online
• A lot of interesting work has been done but …• …integration, translation, exchange are still difficult…• … 2009 has come… we are late!
![Page 3: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/3.jpg)
P. Atzeni GII School 29/09/2010 3
A general framework: cooperation
• "The capacity of a system to interact (effectively) with other systems, possibly managed by different organizations"
![Page 4: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/4.jpg)
P. Atzeni GII School 29/09/2010 4
Forms of cooperation
• Process-centered cooperation: – the systems offer services one another, by exchanging
messages, information or documents, or by triggering activities, without making remote data explicitly visible
• Data-centered cooperation: – the systems offer data one another; data is distributed,
heterogeneous and autonomous, and accessible from remote locations according to some co-operation agreement
![Page 5: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/5.jpg)
P. Atzeni GII School 29/09/2010 5
Databases in the Internet era
• Databases before the Internet– An internal infrastructure, a precious resource, but usually
hidden, with some controlled cooperation• Internet changes the requirements
– More users (not only humans), more diverse cooperating systems (distributed, heterogeneous, autonomous), more types of data
• "Future" Internet changes more– New devices (embedded everywhere), even more users
(many “per person”), real mobility, need for personalization and adaptation
– Social networks– Cloud computing applications
![Page 6: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/6.jpg)
P. Atzeni GII School 29/09/2010 6
The most studied form of data-centered cooperation: integration
• We are interested in data-centered cooperation, often referred to as integration“The unification of related, heterogeneous data from disparate
sources, for example, to enable collaboration” (Hammer & Stonebraker 2005)
• Some "paradigms" …
![Page 7: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/7.jpg)
P. Atzeni GII School 29/09/2010 7
Multidatabase
Global Manager
Local mgr
DB
Mediator
Local mgr
DB
Mediator
Local mgr
DB
Mediator
![Page 8: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/8.jpg)
P. Atzeni GII School 29/09/2010 8
Data Warehousing System
Mediator
Local manager
Mediator
Local manager
Mediator
Local manager
“Integrator”
DB DB DB
Data Warehouse
DW manager
![Page 9: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/9.jpg)
P. Atzeni GII School 29/09/2010 9
Intermediate solutions in practice
Mediator
Local manager
DB
Local manager
DB
Integrator
Mediator
Local manager
Mediator
Local manager
DB DB
DB
Local Manager
Application
![Page 10: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/10.jpg)
P. Atzeni GII School 29/09/2010 10
Peer-based architecturePeer mgr
Local mgr
DB
Local mgr
DB
Mediator
Mediator
Peer mgr
Local mgr
DB
Local mgr
DB
Mediator
Mediator
Peer mgr
Local mgr
DB
Local mgr
DB
Mediator
Mediator
![Page 11: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/11.jpg)
P. Atzeni GII School 29/09/2010 11
Data is not just in databases
• Mail messages• Social networks• Web pages• Spreadsheets• Textual documents• Palmtop devices, mobile phones• Multimedia annotations (e.g., in digital photos)• XML documents
![Page 12: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/12.jpg)
Data spaces
• The information and data is often unstructured and not preprocessed
P. Atzeni GII School 29/09/2010 12
![Page 13: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/13.jpg)
P. Atzeni GII School 29/09/2010 13
The same data in the same form?
• Adaptivity:– Personalization: content adapted to the user
• upon system's decision• upon user's request
– Customization: structure adapted to the user• according to the user's role• upon user's request
– Context dependence• User, Device, Network, Place, Time, Rate
![Page 14: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/14.jpg)
P. Atzeni GII School 29/09/2010 14
A general need
• We have data at various places, and data has to be– exchanged– replicated – migrated– integrated – adapted
![Page 15: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/15.jpg)
P. Atzeni GII School 29/09/2010 15
A major difficulty
• Heterogeneity– System level– Model level– Structural (different structure for similar data)– Semantic (different meaning for the same structure)
• Many efforts, but current techniques are mostly manual and ad hoc
![Page 16: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/16.jpg)
P. Atzeni GII School 29/09/2010 16
Three problems
• Schema and data translation• Schema and data integration• Data exchange
![Page 17: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/17.jpg)
P. Atzeni GII School 29/09/2010 17
Schema and data translation
• Given a schema find another one with respect to some specific goal (better quality, another model, …)
![Page 18: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/18.jpg)
GII School 29/09/2010 18
Many different models
OR
Relational
OO
…
ER
XSD
P. Atzeni
![Page 19: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/19.jpg)
GII School 29/09/2010 19
Many different models (and variants …) OR
w/ PK, gen, ref, FK
Relational…
OR w/ PK, gen, ref
OR w/ PK, gen, FK
OR
w/ PK, ref, FK
OR w/ gen, ref
OR w/ PK, FK
OR w/ PK, ref
OR w/ ref
P. Atzeni
![Page 20: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/20.jpg)
P. Atzeni GII School 29/09/2010 20
Schema and data integration
• Given two or more sources, build an integrated schema or database
![Page 21: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/21.jpg)
P. Atzeni GII School 29/09/2010 21
Data exchange
• Given a source and a target schema, find a transformation from the former to the latter
![Page 22: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/22.jpg)
P. Atzeni GII School 29/09/2010 22
Data exchange, a typical approach (the Clio project)
Schema Match
Mapping generation
Query generation
Target schema
Source schema
![Page 23: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/23.jpg)
P. Atzeni GII School 29/09/2010 23
Data exchange, example
PayRate ( Rank HrRate )
Professor ( Id Name Sal )
Student ( Name GPA Yr )
WorksOn ( Name Proj Hrs ProjRank )
Personnel ( Id Name Sal Addr )
Address ( Id Addr )
![Page 24: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/24.jpg)
P. Atzeni GII School 29/09/2010 24
Data exchange, example
PayRate ( Rank HrRate )
Professor ( Id Name Sal )
Student ( Name GPA Yr )
WorksOn ( Name Proj Hrs ProjRank )
Personnel ( Id Name Sal Addr )
Address ( Id Addr )
![Page 25: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/25.jpg)
P. Atzeni GII School 29/09/2010 25
The process, example
PayRate ( Rank HrRate )
Professor ( Id Name Sal )
Student ( Name GPA Yr )
WorksOn ( Name Proj Hrs ProjRank )
Personnel ( Id Name Sal Addr )
Address ( Id Addr )
SELECT P.Id, P.Name, P.Sal, A.AddrFROM Professor P, Address AWHERE A.Id = P.IdUNION ALLSELECT NULL AS Id, S.Name, p.HrRate * W.Hrs, NULL AS AddrFROM PayRate P, Student S, WorksOn WWHERE W.Name = S.Name AND S.Yr = P.Rank
![Page 26: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/26.jpg)
P. Atzeni GII School 29/09/2010 26
A direction for the solutions
• Be general! – ad hoc solution could work in-the-small, but they
• are repetitive and time consuming • do not scale• are not maintainable
• Historical notes:– W. C. McGee: Generalization: Key to Successful Electronic
Data Processing. J. ACM 1959• Indeed, databases are the result of generalization applied to
secondary storage management!
![Page 27: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/27.jpg)
P. Atzeni GII School 29/09/2010 27
Generality requires …
• … high-level descriptions of problems within the family of interest:– Metadata:
• “data about data”• (formal or informal) description of structures and
meaning
• General solutions leverage on metadata (and then operate on data as a consequence)
![Page 28: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/28.jpg)
P. Atzeni GII School 29/09/2010 28
A wider perspective
• (Generic) Model Management:– A proposal by Bernstein et al (2000 +)– Includes a set of operators on
• schemas and • mappings between schemas
![Page 29: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/29.jpg)
P. Atzeni GII School 29/09/2010 29
Terminology: a warning
Model Mgmt people Traditional DB people
Meta-metamodel Metamodel
Metamodel Model
Model Schema
![Page 30: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/30.jpg)
P. Atzeni GII School 29/09/2010 30
Schemas and mappings
• More on the issue later• For the time being:
– Schema: • a set of elements, related in some way to one another
– Mapping:• a set of correspondences (pair of elements) or• its reification, a third schema related to the other two via
two sets of correspondences
![Page 31: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/31.jpg)
P. Atzeni GII School 29/09/2010 31
Model mgmt operators, a first set
• map = Match (S1, S2) • S3 = Merge (S1, S2, map)• S2 = Diff (S1, map) • and more
– map3 = Compose (map1, map2)– S2 = Select (S1, pred) – Apply (S, f) – list = Enumerate (S)– S2 = Copy (S1)– …
![Page 32: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/32.jpg)
P. Atzeni GII School 29/09/2010 32
Match
• map = Match (S1, S2)– given
• two schemas S1, S2– returns
• a mapping between them• the “classical” initial step in data integration:
– find the common elements of two schemas and the correspondences between them
![Page 33: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/33.jpg)
P. Atzeni GII School 29/09/2010 33
Merge
• S3 = Merge (S1, S2, map)– given
• two schemas and a mapping between them– returns
• a third schema (and two mappings)• the “classical” second step in data integration:
– given the correspondences, find a way to obtain one schema out of two
![Page 34: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/34.jpg)
P. Atzeni GII School 29/09/2010 34
Diff
• S2 = Diff (S1, map) – given
• a schema and a mapping from it (to some other schema, not relevant)
– returns • a (sub-)schema, with the elements that do not participate
in the mapping
![Page 35: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/35.jpg)
P. Atzeni GII School 29/09/2010 35
DW2
Example
(Bernstein and Rahm, ER 2000)• A database (a “source”), a data warehouse and a mapping
between the two• We want to add a source, with some similarity to the first one• and update the DW
DB1 DW1
DB2
![Page 36: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/36.jpg)
P. Atzeni GII School 29/09/2010 36
DW2
Example, the "solution"
DB1 DW1
DB2
m1
m2m3
DB2’
m2 = Match(DB1,DB2)
m3= Compose(m2,m1)
DB2’=Diff(DB2,m3)
DW2’, m4 user defined
m5 = Match(DW1,DW2’)
DW2 = Merge(DW,DW2’,m5)DW2’m4
m5
![Page 37: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/37.jpg)
P. Atzeni GII School 29/09/2010 37
Magic does not exist
• Operators might require human intervention:– Match is the main case
• Scripts involving operators might require human intervention as well (or at least benefit from it):– a full implementation of each operator might not always
available– a mapping might require manual specification– incomparable alternatives might exist
![Page 38: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/38.jpg)
P. Atzeni GII School 29/09/2010 38
The “data level”
• The major operators have also an extended version that operates on data, and not only on schemas
• Especially apparent for– Merge
![Page 39: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/39.jpg)
P. Atzeni GII School 29/09/2010 39
We also have heterogeneity
• Round trip engineering (Bernstein, CIDR 2003)– A specification, an implementation– then a change to the implementation: want to revise the
specification• We need a translation from the implementation model to the
specification one
S1
I1 I2
S2
![Page 40: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/40.jpg)
P. Atzeni GII School 29/09/2010 40
Model management with heterogeneity
• The previous operators have to be “model generic” (capable of working on different models)
• We need a “translation” operator– <S2, map12> = ModelGen (S1)
![Page 41: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/41.jpg)
P. Atzeni GII School 29/09/2010 41
ModelGen, an additional operator
• <S2, map12> = ModelGen (S1) – given
• a schema (in a model)– returns
• a schema (in a different data model) and a mapping between the two
• A “translation” from a model to another• I should call it “SchemaGen” …• We should better write
– <S2, map12> = ModelGen (S1,mod2)
![Page 42: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/42.jpg)
P. Atzeni GII School 29/09/2010 42
S2
Round trip engineering
S1
I1
m1
I2
m2 = Match (I1,I2)m3 = Compose (m1,m2)I2’= Diff(I2,m3)<S2’,m4 > = Modelgen(I2’)… Match, Merge
m2
m3
I2’
S2’
m4
![Page 43: Data integration and transformation](https://reader036.vdocuments.net/reader036/viewer/2022062410/568160df550346895dd00fc7/html5/thumbnails/43.jpg)
Summary
• data management in the Internet world• data integration• schema and data translation, data exchange• model management
• Part II
P. Atzeni GII School 29/09/2010 43