practical lineage tracing in data warehouses

42
Practical Lineage Tracing in Data Warehouses Paper by Y. Cui and J. Widom Appeared in ICDE 2000 Presented by Royi Ronen in Seminar in Databases (236826), Winter 2009

Upload: remy

Post on 09-Jan-2016

22 views

Category:

Documents


2 download

DESCRIPTION

Practical Lineage Tracing in Data Warehouses. Paper by Y. Cui and J. Widom Appeared in ICDE 2000 Presented by Royi Ronen in Seminar in Databases (236826), Winter 2009. Introduction. A visit to a computing center of a restaurant. View Menu(item,cost,price). Database. product(name,cost) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Practical Lineage Tracing in Data Warehouses

Practical Lineage Tracing in Data Warehouses

Paper by Y. Cui and J. WidomAppeared in ICDE 2000

Presented by Royi Ronen in Seminar in Databases (236826), Winter 2009

Page 2: Practical Lineage Tracing in Data Warehouses

Introduction

Page 3: Practical Lineage Tracing in Data Warehouses

A visit to a computing center of a restaurant

soup57

coffee3.96

salad710

cake85

steak1012

………

View Menu(item,cost,price)

product(name,cost)labor(id,cost)overheads(name,cost)operations(name,cost)

Database

>

What made the price high? How can we solve the problem?

Page 4: Practical Lineage Tracing in Data Warehouses

The Lineage Problem

Given: A view V A database instance D A data item d in a tuple in V(D)

Find: All data items that produced d and

the process in which d was produced

= Lineageאילן יוחסין, שושלת יוחסין

Page 5: Practical Lineage Tracing in Data Warehouses

Motivation In many data analysis and management

scenarios, the source of the data is valuable OLAP (online analytical processing) When sources are of different qualities

(certainty, reliability, etc.) Scientific databases Top-down Datalog evaluation On-line monitoring

This is the first research to discuss the problem

Page 6: Practical Lineage Tracing in Data Warehouses

Example - IDB Schema

View

Page 7: Practical Lineage Tracing in Data Warehouses

Example - II

Promising

computer

medicine

e.industry

Page 8: Practical Lineage Tracing in Data Warehouses

Example - III What is the exact set of data items

which produced computer according to view Promising?

Page 9: Practical Lineage Tracing in Data Warehouses

View Data Lineage

Page 10: Practical Lineage Tracing in Data Warehouses

Tuple Lineage for one Operator

Let Op be an operator from {,,,} Let T=Op(T1,…,Tm) , tT

t’s lineage in (T1,…,Tm) according to Op is:

Op-1<T1,…,Tm>= T*1,…,T*m

where T*i are the maximal sets s.t.

(a) Op(T*1,…,T*m) = {t}

(b) T*i t*Ti : Op(T*1,…,{t*},…,T*m)

lineage tuples derive exactly t

every tuple contributes to t

Page 11: Practical Lineage Tracing in Data Warehouses

Discussion

Op(T*1,…,T*m) = {t}

Alone, this condition could be met even if many non-relevant tuples are in T*i

T*i t*Ti : Op(T*1,…,{t*},…,T*m) Alone, this condition could be met by many tuples not at all related to t

Together, the two conditions define the lineage

Page 12: Practical Lineage Tracing in Data Warehouses

Example

a2

a3

b6

b9

c3

d4

a1

d5

a6

b15

c3

d9

x,sum(Y)(T)T(X,Y)

a2

a3

a1

Lineage of (a,6)

t=

Page 13: Practical Lineage Tracing in Data Warehouses

Tuple lineage for a view A view definition has many operators

We assume that views are evaluated as a query tree, bottom-up

The lineage in D of a tuple t according to v(D), v-

1D(t), is defined by recursively generalizing tuple

lineage for an operator Basis: t contributes to itself in V, when the view

is just a table Step: previous definition of an operator

Transitivity: if t1 contributes to t2, and t2 contributes to t3, then t1 contributes to t3

Page 14: Practical Lineage Tracing in Data Warehouses

Example

V = X,sum(Y) (Y>0(R S))

Page 15: Practical Lineage Tracing in Data Warehouses

Canonical form for ASPJ views

Any aggregate-select-project-join (ASPJ) view can be transformed to an equivalent canonical form

The canonical form consists of nested ASPJ segments of the form agg-project-select-join

Example: The Promising View is canonical, with two levels

Segment 1

Segment 2

Page 16: Practical Lineage Tracing in Data Warehouses

Canonicalization Algorithm

Page 17: Practical Lineage Tracing in Data Warehouses

Lineage Tracing Query

Let D be a database instance, Let v be a view definition Let t v(D) Then, TQt,v is a lineage tracing

query ifTQt,v(D) = v-1D(t)

And for a set T, TQT,v(D)

Page 18: Practical Lineage Tracing in Data Warehouses

Lineage Tracing Query for one-level ASPJ Views Consider a query in canonical form

The tracing query for a tuple t is

And for a set Tsplit turns the table into

multiple tables with projections

Page 19: Practical Lineage Tracing in Data Warehouses

Lineage Tracing for multi-level Views

Page 20: Practical Lineage Tracing in Data Warehouses

Example - Tracing the lineage of view Promising

Page 21: Practical Lineage Tracing in Data Warehouses

Auxiliary Views

Page 22: Practical Lineage Tracing in Data Warehouses

Motivation

In a distributed environment, querying data sources is a difficult problem Access costs Network costs Not always accessible

Storing auxiliary views in the warehouse can help

What should we

store??

Page 23: Practical Lineage Tracing in Data Warehouses

Scope

We deal with one-level SPJ view only

Extension to ASPJ views and to multi-level ASPJ view are straightforward and done on [Cui and Widom, DMDW 2000]

Page 24: Practical Lineage Tracing in Data Warehouses

Tracing query trees for SPJ views

view

tracing

Page 25: Practical Lineage Tracing in Data Warehouses

Method 1: Store Nothing (N) A degenerated case where no auxiliary

views are stored User view is

Lineage tracing query is

Very low storage costs No aux. view storage or aux. view updating costs Tracing query has large costs, particularly

network User view has maintenance cost

Page 26: Practical Lineage Tracing in Data Warehouses

Method 2: Base Tables (BT)

Auxiliary views are base tables after selection, BTi

User view is

Lineage tracing query is

High storage costs, tables are large (even after selection) Maintenance of aux. views is fast (unprocessed tables) Tracing query has processing costs but not network costs User view has to be maintained

Page 27: Practical Lineage Tracing in Data Warehouses

Method 3: Lineage View (LV) Auxiliary view: User view is

Lineage tracing query is (query tree (a))

Large storing costs (for the join) Maintenance of lineage view is expensive Very good tracing performance, LV appears as-is in

tracing query Maintenance of lineage views helps maintaining user

view

Page 28: Practical Lineage Tracing in Data Warehouses

Performance

Page 29: Practical Lineage Tracing in Data Warehouses

Method 4: Store Split Lineage Tables (SLT) Auxiliary views (Ti are source tables):

User view is

Lineage tracing query is:

Usually small storage costs (LV is not materialized)

Same maintenance cost as in method LV Tracing cost is low, yet higher than LV because

more than a simple semi-join is performed

Very good when LV joins

are large

Page 30: Practical Lineage Tracing in Data Warehouses

Method 5: Store Partial Base Tables (PBT) Auxiliary views (V is the user view):

User view is

Lineage tracing query is:

Smaller storage comparing to BT Maintenance is costly, user view has to be

maintained before aux. views Tracing benefits from operating on small

tables

Page 31: Practical Lineage Tracing in Data Warehouses

Method 6: Store Base Tables Projections (BP) Auxiliary views (Ai includes key atts., atts.

projected in V and atts. involved in the join):

User view is

Lineage tracing query is:

Small storage due to usually small tables Cheap maintenance (tables, not join, are maintained) However, source tables have to be queried in

tracing, rendering tracing relatively expensive

What is the assumption

here?

Page 32: Practical Lineage Tracing in Data Warehouses

Method 7: Store Linear View Projections (LP) Auxiliary views (A are atts in V, Ki are key

atts. in Ti ):

User view is

Lineage tracing query is:

Small storage due to small tables Maintenance higher than BP due to join Small tracing cost, but sources have to be

queried

Page 33: Practical Lineage Tracing in Data Warehouses

Performance

Page 34: Practical Lineage Tracing in Data Warehouses

Self maintainability

Previous results show how to store more data in order to make views self-maintainable [Quass, Gupta, Mumick and Widom 1996]

Done using… auxiliary views Maintenance is done using delta relations Methods 5, 6, 7 have a self maintainable

version: S-PBT, S-BP, S-LP

Page 35: Practical Lineage Tracing in Data Warehouses

Experiments

Page 36: Practical Lineage Tracing in Data Warehouses

Storage Costs

Page 37: Practical Lineage Tracing in Data Warehouses

Total timeIncluding user-view

maintenance

Page 38: Practical Lineage Tracing in Data Warehouses

Cost Model

Maintenance / Tracing cost:

Disk cost * num of I/Os +Trans cost * num of transmitted bytes

+Msg cost * num of network messages

Page 39: Practical Lineage Tracing in Data Warehouses

Impact of table size on storage

Page 40: Practical Lineage Tracing in Data Warehouses

Impact of table size on time

Page 41: Practical Lineage Tracing in Data Warehouses

Conclusions

Page 42: Practical Lineage Tracing in Data Warehouses

Results In Brief

Definitions and problem formulation

Lineage tracing For an operator For views in a canonical form

Auxiliary views Performance study