supporting streaming updates in an active data warehouse

38
Supporting Streaming Updates in an Active Data Warehouse Neoklis Polyzotis, Spiros Skiadopoulos, Panos Vassiliadis, Alkis Simitsis, Nils-Erik Frantzell

Upload: graceland

Post on 25-Feb-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Supporting Streaming Updates in an Active Data Warehouse. Neoklis Polyzotis, Spiros Skiadopoulos, Panos Vassiliadis , Alkis Simitsis, Nils-Erik Frantzell. Forecast. Problem in active data warehousing : - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Supporting Streaming Updates in an Active Data Warehouse

Supporting Streaming Updates in an Active Data

WarehouseNeoklis Polyzotis,

Spiros Skiadopoulos, Panos Vassiliadis,

Alkis Simitsis, Nils-Erik Frantzell

Page 2: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 2

Forecast• Problem in active data warehousing:

– the join between a fast stream of source updates and a disk-based relation under the constraint of limited memory

• Solution:– the mesh join, a novel join operator that operates

under minimum assumptions for the stream and the relation

• Features: – a cost model and tuning methodology that accurately

associates memory consumption with the incoming stream rate

Page 3: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 3

Roadmap• Motivation & Problem statement• The Mesh-Join Algorithm• Cost model & Tuning• Experiments• Conclusions

Page 4: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 4

Roadmap• Motivation & Problem statement• The Mesh-Join Algorithm• Cost model & Tuning• Experiments• Conclusions

Page 5: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 5

Add_SPK1

SUPPKEY=1

SK1

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SUPPKEY

$2€

COST DATE

DS.PS2 Add_SPK2

SUPPKEY=2

SK2

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SUPPKEYCOST DATE=SYSDATE

AddDate CheckQTY

QTY>0

U

DS.PS1

Log

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

DIFF1

DS.PS_NEW1.PKEY,DS.PS_OLD1.PKEYDS.PS_NEW

1

DS.PS_OLD1

DW.PARTSUPP

Aggregate1

PKEY, DAYMIN(COST)

Aggregate2

PKEY, MONTHAVG(COST)

V2

V1

TIME

DW.PARTSUPP.DATE,DAY

FTP1S1_PARTSU

PP

S2_PARTSUPP

FTP2

DS.PS_NEW2

DIFF2

DS.PS_OLD2

DS.PS_NEW2.PKEY,DS.PS_OLD2.PKEY

Sources DW

DSA

ETL workflows

Page 6: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 6

Active Data Warehousing• Traditionally, data warehouse

refreshment has been performed off-line, through Extraction-Transformation-Loading (ETL) software

• Active Data Warehousing refers to a new trend where data warehouses are updated as frequently as possible, to accommodate the high demands of users for fresh data

Page 7: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 7

Issues around Active Warehousing

• Smooth upgrade of the software at the (legacy) source

– minimal modification of the software configuration at the source side

• Minimal overhead of the source system • No data losses are allowed in the long run• Maximum freshness of data

– the response time for the transport, cleaning, transformation and loading of a new source record to the DW should be small and predictable

• Scalability at the warehouse side – the architecture should scale up with respect to the

number of sources and data consumers at the DW– if possible, cover issues like checkpointing, index

maintenance

Page 8: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 8

Grand view of an Active DWReal-time

Stream of S1 updates DS

Relation R

Source Relation

S

Join module

Load shedder

(Active) ETL activities for regular DW load

...

...

DSA

DW

Active ETL workflow for approximate, real-time reporting

Off-line synchronization

IntroductoryStage

GrowthStage

MaturityStage Decline Stage

TotalMarketSales

Time

10090

8070

6050

4030

4050

Real-time DW refreshment

Offline update of reports

DW refreshmentData to

be loaded

Page 9: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 9

Problem statement• Joining a fast stream of updates with a

persistent relation within limited memory bounds is of particular importance in the Active Warehousing setting

• Example practical cases:– Surrogate Key assignment– Duplicate detection– …

Page 10: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 10

Example: Surrogate Key

Sources DWETL

id descr

1020

cokepepsi

R1

id descr

1020

pepsifanta

R2

id source10201020

R1

R1

R2

R2

Lookupskey100110110120

id descr

100110120

cokepepsifanta

RDW

Page 11: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 11

Roadmap• Motivation & Problem statement• The Mesh-Join Algorithm• Cost model & Tuning• Experiments• Conclusions

Page 12: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 12

Operation of Mesh-Joins1

Stream S

Join module

Relation R

p1

p1s1

t = 0p2

Stream S

Join module

Relation R

p1

p2s2

t = 1p2

already joined with p1

Stream S

Join module

Relation R

p1

p1s3

t = 2p2

s2s1

already joined with p2

scan resumes

Page 13: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 13

(Not really any) Assumptions

• No assumption of any order in either the stream or the relation

• No indexes are necessarily present • Limited memory is available• The join condition is arbitrary (equality,

similarity, range, etc.) • The join relationship is general (i.e., many-

to-many, one-to-many, or many-to-one)• The result is exact.

… But ..• The relation remains fixed throughout the

join

Page 14: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 14

Architecture of Mesh-Join

Queue QHash H

Relation R

Stream S

Output Streamb

pages of R

w tupples

of S

hash function

hash function

w pointers ... w

pointers

Buffer

BufferJoin

Page 15: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 15

Page 16: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 16

Roadmap• Motivation & Problem statement• The Mesh-Join Algorithm• Cost model & Tuning• Experiments• Conclusions

Page 17: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 17

Critical issues• The important measures are:

– the stream rate λ– the available memory M– the service rate μ of the join

• The main challenge is to interrelate these metrics in a cost formula, so as to be able to tune the system– minimize M, given a desirable rate μ– maximize μ, give a constraint of available

memory M

Page 18: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 18

Cost model: Memory wrt b, s

Size of b

buffers

Size of w buffe

rs

Size of

queue Q

Size of

hash H

NR

b= # iterations a stream tuple must “see”

Page 19: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 19

Cost model: cost of an iteration wrt b, s

Page 20: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 20

Cost model

Cloop = function (w, b)

M = function (w, b)

Interrelated M, μ, λ via w, s

Page 21: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 21

Tuning: M,μ as a function of b

Page 22: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 22

Minimize M, given a desirable rate μ

• Minimize w => minimize M• Minimum wmin = λcloop

In this case λ = μ• Thus, M is a function only of b,

computed by simple calculus

Page 23: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 23

Roadmap• Motivation & Problem statement• The Mesh-Join Algorithm• Cost model & Tuning• Experiments• Conclusions

Page 24: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 24

Experimental methodology• Synthetic data set: Zipf distribution, skew

in [0,1], 10% of R as available memory, 3.5M rows, domain of 1.35M values

• Real data set: cloud cover data, 10M rows, domain of 36,000 values

• INL as an opponent, based on a clustered B+, in Berkeley DB

• Platform: Pentium IV 3GHz, 1GB main memory, 7200 RPM disk

Page 25: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 25

Predicted and measuredperformance (synthetic data)

Page 26: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 26

Performance for varyingmemory (synthetic data)

Page 27: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 27

Performance for varyingdata skew (synthetic data)

Page 28: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 28

Performance for varyingmemory (real-life data)

Page 29: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 29

Roadmap• Motivation & Problem statement• The Mesh-Join Algorithm• Cost model & Tuning• Experiments• Conclusions

Page 30: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 30

Conclusions• We have proposed the mesh join, a

join operator particularly fit for active data warehousing that operates under minimum assumptions for the stream and the relation

• We have presented a cost model and tuning methodology that accurately associates memory consumption with the incoming stream rate

Page 31: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 31

Other capabilities & Possible extensions

• Approximate processing• Ordered join output• Tuning for join conditions other than

equality• Dynamic tuning for changes in the stream

rate• Possible Extensions

– multi-way joins– other active ETL operators

Page 32: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 32

Thank you for your attention!

… many thanks to our hosts!

This research was co-funded by the European Union in the framework of the program “Pythagoras IΙ” of the “Operational Program for Education and Initial Vocational Training” of the 3rd Community Support Framework of the Hellenic Ministry of Education, funded by 25% from national sources and by 75% from the European Social Fund (ESF).

Figures of the Antikythera mechanism by Rupert Russell <[email protected]> URL: http://www.giant.net.au/users/rupert/kythera/kythera.htm

Page 33: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 33

Questions?

Page 34: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 34

Backup Slides

Page 35: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 35

Related work• Applications of Symmetric Hash-Joins over

windows of streaming inputs that fit in M/M– Chandrasekaran, Franklin @ VLDBJ, 2003– Golab, Ozsu @ VLDB 2003– Hammad, Franklin, Aref, Elmagarmid @ VLDB 2003– Viglas, Naughton, Burger @ VLDB 2003

• Joins of streamed bounded relations: Xjoin variants that flush overflow tuples to disk– Dittrich, Seeger, Taylor, Widmayer @ VLDB 2002– Tao, Yiu, Papadias, Hadjieleftheriou, Mamoulis @

SIGMOD 2005

Page 36: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 36

Involved Measures

Page 37: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 37

Cost model

I/O per secondI/O per stream tuple

Page 38: Supporting Streaming Updates in an Active Data Warehouse

ICDE 2007, Constantinople 18/4/2007 38

Loops of Mesh Join

Join

Queue Q

Hash H

π(k)

ω(1), ...., ω(k)

ω(k)hash

function

hash function

...

Buffer

Bufferptrs to ω(1)

ptrs to ω(k)

Join

Queue Q

Hash H

hash function

hash function

...

Buffer

Buffer

ptrs to ω(k)

ptrs to

)%(pRNk

)(k

)(,),1( kkpRN

)1( pRNk

empty

empty