practical lineage tracing in data warehouses

Post on 09-Jan-2016

22 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Practical Lineage Tracing in Data Warehouses. Paper by Y. Cui and J. Widom Appeared in ICDE 2000 Presented by Royi Ronen in Seminar in Databases (236826), Winter 2009. Introduction. A visit to a computing center of a restaurant. View Menu(item,cost,price). Database. product(name,cost) - PowerPoint PPT Presentation

TRANSCRIPT

Practical Lineage Tracing in Data Warehouses

Paper by Y. Cui and J. WidomAppeared in ICDE 2000

Presented by Royi Ronen in Seminar in Databases (236826), Winter 2009

Introduction

A visit to a computing center of a restaurant

soup57

coffee3.96

salad710

cake85

steak1012

………

View Menu(item,cost,price)

product(name,cost)labor(id,cost)overheads(name,cost)operations(name,cost)

Database

>

What made the price high? How can we solve the problem?

The Lineage Problem

Given: A view V A database instance D A data item d in a tuple in V(D)

Find: All data items that produced d and

the process in which d was produced

= Lineageאילן יוחסין, שושלת יוחסין

Motivation In many data analysis and management

scenarios, the source of the data is valuable OLAP (online analytical processing) When sources are of different qualities

(certainty, reliability, etc.) Scientific databases Top-down Datalog evaluation On-line monitoring

This is the first research to discuss the problem

Example - IDB Schema

View

Example - II

Promising

computer

medicine

e.industry

Example - III What is the exact set of data items

which produced computer according to view Promising?

View Data Lineage

Tuple Lineage for one Operator

Let Op be an operator from {,,,} Let T=Op(T1,…,Tm) , tT

t’s lineage in (T1,…,Tm) according to Op is:

Op-1<T1,…,Tm>= T*1,…,T*m

where T*i are the maximal sets s.t.

(a) Op(T*1,…,T*m) = {t}

(b) T*i t*Ti : Op(T*1,…,{t*},…,T*m)

lineage tuples derive exactly t

every tuple contributes to t

Discussion

Op(T*1,…,T*m) = {t}

Alone, this condition could be met even if many non-relevant tuples are in T*i

T*i t*Ti : Op(T*1,…,{t*},…,T*m) Alone, this condition could be met by many tuples not at all related to t

Together, the two conditions define the lineage

Example

a2

a3

b6

b9

c3

d4

a1

d5

a6

b15

c3

d9

x,sum(Y)(T)T(X,Y)

a2

a3

a1

Lineage of (a,6)

t=

Tuple lineage for a view A view definition has many operators

We assume that views are evaluated as a query tree, bottom-up

The lineage in D of a tuple t according to v(D), v-

1D(t), is defined by recursively generalizing tuple

lineage for an operator Basis: t contributes to itself in V, when the view

is just a table Step: previous definition of an operator

Transitivity: if t1 contributes to t2, and t2 contributes to t3, then t1 contributes to t3

Example

V = X,sum(Y) (Y>0(R S))

Canonical form for ASPJ views

Any aggregate-select-project-join (ASPJ) view can be transformed to an equivalent canonical form

The canonical form consists of nested ASPJ segments of the form agg-project-select-join

Example: The Promising View is canonical, with two levels

Segment 1

Segment 2

Canonicalization Algorithm

Lineage Tracing Query

Let D be a database instance, Let v be a view definition Let t v(D) Then, TQt,v is a lineage tracing

query ifTQt,v(D) = v-1D(t)

And for a set T, TQT,v(D)

Lineage Tracing Query for one-level ASPJ Views Consider a query in canonical form

The tracing query for a tuple t is

And for a set Tsplit turns the table into

multiple tables with projections

Lineage Tracing for multi-level Views

Example - Tracing the lineage of view Promising

Auxiliary Views

Motivation

In a distributed environment, querying data sources is a difficult problem Access costs Network costs Not always accessible

Storing auxiliary views in the warehouse can help

What should we

store??

Scope

We deal with one-level SPJ view only

Extension to ASPJ views and to multi-level ASPJ view are straightforward and done on [Cui and Widom, DMDW 2000]

Tracing query trees for SPJ views

view

tracing

Method 1: Store Nothing (N) A degenerated case where no auxiliary

views are stored User view is

Lineage tracing query is

Very low storage costs No aux. view storage or aux. view updating costs Tracing query has large costs, particularly

network User view has maintenance cost

Method 2: Base Tables (BT)

Auxiliary views are base tables after selection, BTi

User view is

Lineage tracing query is

High storage costs, tables are large (even after selection) Maintenance of aux. views is fast (unprocessed tables) Tracing query has processing costs but not network costs User view has to be maintained

Method 3: Lineage View (LV) Auxiliary view: User view is

Lineage tracing query is (query tree (a))

Large storing costs (for the join) Maintenance of lineage view is expensive Very good tracing performance, LV appears as-is in

tracing query Maintenance of lineage views helps maintaining user

view

Performance

Method 4: Store Split Lineage Tables (SLT) Auxiliary views (Ti are source tables):

User view is

Lineage tracing query is:

Usually small storage costs (LV is not materialized)

Same maintenance cost as in method LV Tracing cost is low, yet higher than LV because

more than a simple semi-join is performed

Very good when LV joins

are large

Method 5: Store Partial Base Tables (PBT) Auxiliary views (V is the user view):

User view is

Lineage tracing query is:

Smaller storage comparing to BT Maintenance is costly, user view has to be

maintained before aux. views Tracing benefits from operating on small

tables

Method 6: Store Base Tables Projections (BP) Auxiliary views (Ai includes key atts., atts.

projected in V and atts. involved in the join):

User view is

Lineage tracing query is:

Small storage due to usually small tables Cheap maintenance (tables, not join, are maintained) However, source tables have to be queried in

tracing, rendering tracing relatively expensive

What is the assumption

here?

Method 7: Store Linear View Projections (LP) Auxiliary views (A are atts in V, Ki are key

atts. in Ti ):

User view is

Lineage tracing query is:

Small storage due to small tables Maintenance higher than BP due to join Small tracing cost, but sources have to be

queried

Performance

Self maintainability

Previous results show how to store more data in order to make views self-maintainable [Quass, Gupta, Mumick and Widom 1996]

Done using… auxiliary views Maintenance is done using delta relations Methods 5, 6, 7 have a self maintainable

version: S-PBT, S-BP, S-LP

Experiments

Storage Costs

Total timeIncluding user-view

maintenance

Cost Model

Maintenance / Tracing cost:

Disk cost * num of I/Os +Trans cost * num of transmitted bytes

+Msg cost * num of network messages

Impact of table size on storage

Impact of table size on time

Conclusions

Results In Brief

Definitions and problem formulation

Lineage tracing For an operator For views in a canonical form

Auxiliary views Performance study

top related