cse 636 data integration overview. 2 data warehouse architecture data source data source relational...

22
CSE 636 Data Integration Overview

Post on 21-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

CSE 636Data Integration

Overview

Page 2: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

2

Data Warehouse Architecture

DataSource

DataSource

Relational Database(Warehouse)

DataSource

Users

Applications

OLAP / Decision SupportData Cubes / Data Mining

ETL Tools(Extract-Transform-Load)

Data Cleaning

Page 3: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

3

Virtual Integration Architecture

• Leave the data in the sources• When a query comes in:

– Determine the relevant sources to the query– Break down the query into sub-queries for the sources– Get the answers from the sources, filter them if needed

and combine them appropriately

• Data is fresh• Otherwise known as

On Demand Integration

Page 4: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

4

Virtual Integration Architecture

End Users

Applications

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Design-Time

SchemaMappingsSchema

MappingsSchema

Mappings

Sources can be:• Relational DBs• Excel Files• Web Sites• Web Services

Page 5: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

5

• Differences in:– Names in schema– Attribute grouping

– Coverage of databases– Granularity and format of attributes

Inventory Database B

AuthorsISBNFirstNameLastName

BooksTitleISBNPriceDiscountPriceEdition

Inventory Database A

BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords

Schema Mappings

BookCategoriesISBNCategory

CDCategoriesASINCategory

ArtistsASINArtistNameGroupName

CDsAlbumASINPriceDiscountPriceStudio

Page 6: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

6

Issues for Schema Mappings

Design-Time

• What formalisms to express them?

• How to create them?• Can we discover them

somehow?• How do we use them?

End Users

Applications

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

SchemaMappingsSchema

MappingsSchema

Mappings

Page 7: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

7

Mediator

Virtual Integration Architecture

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Run-Time

Reformulation

Optimization

Execution

Query Result

Wrapper Wrapper

Page 8: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

8

Mediator

Issues for Query Processing

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Reformulation

Reformulation

Query

• User queries refer to the global schema

• Data is stored in the sources in a local schema

• Rewriting algorithms

Page 9: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

9

Issues for Query Processing

Reformulation

Global Schema

BooksTitleISBNPriceDiscountPriceEdition

Local Schema A

BooksAndMusicTitleAuthorPublisherItemIDItemTypeSuggestedPriceCategoriesKeywords

SELECT ISBN, PriceFROM BooksWHERE Title = ‘on the road’

SELECT ItemID, SuggestedPriceFROM BooksAndMusicWHERE Title = ‘on the road’AND ItemType = ‘Books’

Page 10: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

10

Mediator

Issues for Query Processing

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Query Translation

Reformulation

Optimization

Execution

Query

Wrapper

• Different query languages

Page 11: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

11

Local Source A

Issues for Query Processing

Query Translation

Global Schema

BooksTitleISBNPriceDiscountPriceEdition

SELECT ISBN, PriceFROM BooksWHERE Title = ‘on the road’

http://www.amazon.com/homepage.html?ItemType=Books&Title=on+the+road

Page 12: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

12

Mediator

Issues for Query Processing

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Data Translation

Reformulation

Optimization

Execution

Query

Wrapper

• Different data models

Page 13: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

13

Issues for Query Processing

Data Translation

<table> <tr> <td> <a href=/details?isbn=123> <b>On the Road</b> </a> -- by Jack Kerouac; Paperback <br> <a href=/details?isbn=123> Buy new </a> :<b class=price>$10.86</b> </td> </tr></table>

Local Result A

Global Schema

BooksTitleISBNPriceDiscountPriceEdition

Title ISBN Price … …

On the Road 123 10.86 … …

Page 14: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

14

Mediator

Issues for Query Processing

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Query Execution

Reformulation

Optimization

Execution

Query

Wrapper Wrapper

• Access as many data sources as needed

• Duplicate/redundant and irrelevant data

• Limited query capabilities

Page 15: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

15

Issues for Query Processing

Limited Query Capabilities

Global Schema

BooksTitleISBNPriceDiscountPriceEdition

Local Schema A

BooksAndMusicTitleAuthorItemIDItemTypeSuggestedPrice

SELECT ISBN, Price, DiscountPriceFROM BooksWHERE Title = ‘on the road’

SELECT GreatPriceFROM DiscountBooksWHERE ISBN = ?

Local Schema B

DiscountBooksTitleEditionISBNGreatPrice

SELECT ItemID, SuggestedPriceFROM BooksAndMusicWHERE Title = ?

SELECT ItemID, SuggestedPriceFROM BooksAndMusicWHERE Title = ‘on the road’

A

B

SELECT GreatPriceFROM DiscountBooksWHERE ISBN = 123

C

ItemID SuggestedPrice

123 10.86

ItemID SuggestedPrice

123 10.86D

E

GreatPrice

8.86

ISBN Price DiscountPrice

123 10.86 8.86

Page 16: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

16

Mediator

Issues for Query Processing

DataSource

DataSource

GlobalSchema

LocalSchema

LocalSchema

DataSource

LocalSchema

Query Answering

Reformulation

Optimization

Execution

Query Result

Wrapper Wrapper

• Combine the results and further process them if needed

• Mainly union and merge• Inconsistencies

Page 17: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

17

Issues for Query Processing

Query Answering (Union)

ItemID SuggestedPrice

123 10.86

ISBN GreatPrice

456 8.86

ISBN Price

123 10.86

456 8.86

Page 18: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

18

Issues for Query Processing

Query Answering (Merge)

ItemID Title

123 On the Road

ISBN Edition Price

123 2nd 8.86

ISBN Title Edition Price

123 On the Road 2nd 8.86

PrimaryKey

ISBN Title Edition Price

123 On the Road 2nd 8.86

PrimaryKey

PrimaryKey

Page 19: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

19

Issues for Query Processing

Query Answering (Inconsistencies)

ItemID Title Edition

123 On the Road 1st

ISBN Edition Price

123 2nd 8.86

ISBN Title Edition Price

123 On the Road 8.86

PrimaryKey

ISBN Title Edition Price

123 On the Road ??? 8.86

PrimaryKey

PrimaryKey

Page 20: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

21

Peer-Based Integration

Peer 2

Peer 1

Peer 5

Peer 3

Peer 4Query

Query

Page 21: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

22

Peer-Based Integration

• No need for a central mediated schema• Peers serve as mediators for other peers• A peer can be both a server and a client• Semantic relationships are specified locally

(between small sets of peers)• Queries are posed using the peer’s schema• Answers come from anywhere in the system• This is not P2P file sharing.

– Data has rich semantics

Page 22: CSE 636 Data Integration Overview. 2 Data Warehouse Architecture Data Source Data Source Relational Database (Warehouse) Data Source Users   Applications

23

References

• Information integration– Maurizio Lenzerini

– Eighteenth International Joint Conference on Artificial Intelligence, IJCAI 2003

– Invited Tutorial

• Data Integration: a Status Report– Alon Halevy

– German Database Conference (BTW), 2003– Invited Talk