efficient snapshot differential algorithms for data warehousing wilburt juan labiohector...

Efficient Snapshot Differential Algorithms for Data Warehousing

• Wilburt Juan Labio Hector Garcia-Molina

Purpose

• detect modifications from information source• extract modifications from information source• information source is not sophisticated (e.g.,

legacy system)

DataWarehouse

LocalDB

modifications

Problem Outline

• file containing distinct records

• {R1, R2, …Rn}, where Ri is <Ki, Bi>

• given two snapshots F1 and F2 produce modifications and Fout

• possible modifications generated:– <update, Ki, Bj>

– <delete, Ki>

– <insert, Ki, Bi>

Difficulties

• physical location of record may be different between snapshots

• wasted messages:– useless delete-insert pairs

• introduces waste• delete then insert same record, do nothing• delete then insert record with <K, B’ >, update

– useless insert-delete pairs• introduces correctness problem• insert then delete same record, do nothing• insert <K, B’> then delete record with K, update

Example: with physical movement

Ft-1

Ki Bi

Ki+1 Bi+1

Ki+2 Bi+2

Ki+3 Bi+3

Ki+4 Bi+4

Ki+5 Bi+5

Ki+6 Bi+6

Ft

Ki Bi

Ki+3 Bi+3

Ki+2 Bi+2

Ki+4 B’i+4

Ki+5 Bi+5

Kj Bj

Ki+6 Bi+6

Modifications made:<delete, Ki+1><update, Ki+4, B’i+4><insert, Kj, Bj>

Example: wasted messages

Ft-1

Ki Bi

Ki+1 Bi+1

Ki+2 Bi+2

Ki+3 Bi+3

Ki+4 Bi+4

Ki+5 Bi+5

Ki+6 Bi+6

Ki+7 Bi+7

Ft

Ki+7 Bi+7

Ki+3 Bi+3

Ki+2 Bi+2

Ki+4 B’i+4

Ki+6 Bi+6

Kj Bj

Ki+5 B’i+5

Ki Bi

useless insert-delete•<insert, Ki+3, Bi+3>•<delete, Ki+3>

or:•<insert, Ki+4, B’i+4>•<delete, Ki+4>

useless delete-insert•<delete, Ki>•<insert, Ki, Bi>

or:•<delete, Ki+5>•<insert, Ki+5, B’i+5>

Related Solutions

• maintain log of modifications• add timestamp to base table• joins

Proposed Solutions

• alter extraction application, code is worn• parse system log, need DBA privilege to get log• snapshot differential

Filet-1 out

Filet-1 out

Filet-1 out

differ datawarehouse

Algorithm Compromises

• related to joins, but cost less• allow some useless delete-insert pairs• change all insert-delete pairs to delete-insert pairs• batch and send all deletes first• may miss a few modifications• save file for next snapshot differential

Sort Merge Join I

• part I: sort two input files– save sorted file from previous snapshot

– use multi-way merge sort for F2

• creates runs, which are sequences of blocks with sorted records

• merge runs till 1 run remains

• 4 * |F2| IO operations, assuming |F2|1/2 < |M|

• part II: merge takes |F1| + |F2| IO operations

Sort Merge Join II• reduce IO operations

• reuse F1 from previous differential

• part I: produce sorted runs for F2

– sort F2 into runs Fruns

• creates runs, which are sequences of blocks with sorted records

• 2 * |F2| IO operations, assuming |F2|1/2 < |M|

• part II: create sorted F2 while merging files

– merge takes |F1| + 2 * |F2| IO operations

• read into memory 1 block from each run in Fruns

• select record with smallest K value

Ex. Expected Number of Good Days

• let n = 32, # records in F = 1,789,570• P(collision) = 2-n • P(no error) = (1 - E)records(F)

• N(good days) = 1/(1 - P(no error))= 2,430 snapshot

comparisons• if file size increases, then increase size of n

Extending ad hoc join Algorithms

• |F|: # of blocks in file• |M|: # of blocks in memory• Sort Merge join I:

– |F1| + 5 * |F2| IO

• Sort Merge join II:– |F1| + 4 * |F2| IO

• Partitioned Hash Join:– |F1| + 3 * |F2| IO

Compression Technique

• reduce record size => reduce IO

• lossy compression: – higher compression– different uncompressed values maybe mapped into the

same compressed value• compress object of b bits into n bits, b > n• 2b/2n values mapped to each compressed value• P(collision) = ((2b/2n) - 1)/2b => 2-n = E• P(no error) = (1 - E)records(F)

• N(good days) = (1 - P(no error))*Σ1<=i i* P(no error)i-1 = 1/(1 - P(no error))

Outer Join with Compression <K,B>

|f1| + 3*|F2| + |f2| IO

• sort F2 into runs: f2run

• r1 = f1.pop()

• r2 = f2runs.pop()

• f2sort.put(r2.K, compress(r2.B))

• while((r1 != null) V (r2 != null))

– if((r1 == null) V (r1.K > r2.K)) /* insert */

• Fout.put(insert, r2.K, r2.B)

• r2 = f2runs.pop()

• f2sort.put(r2.K, compress(r2.B))

– else if((r2 == null) V (r1.K < r2.K)) /* delete */

– else if(r1.K == r2.K)

• if(r1.b != compress(r2.B)) /* update */

Outer Join with Compression <K,b,p>

|f1| + |F2| + 3*|f2sort| + U + I IO

• compress F2 during creation of sorted runs into f2run

• r1 = f1.pop()

• r2 = f2run.pop() /* p -> record */

• f2sort.put(r2.K, r2.b, r2.p)/* b compressed B */

• while((r1 != null) V (r2 != null))

– if((r1 == null) V (r1.K > r2.K)) /* insert */

• Fout.put(insert, r2.K, getTuple(r2.p).B)

• r2 = f2run.pop()

• f2sort.put(r2.K, r2.b, r2.p) /* what about p */

– else if((r2 == null) V (r1.K < r2.K)) /* delete */

– else if(r1.K == r2.K)

• if(r1.b != r2.b) /* update */

Partitioned hash Outer Join

• <K,B> compression– |f1| + 3*|F2| + |f2sort| IO

• <K,B> compression– |f1| + |F2| + 2*|f2sort| + I + U IO

Window Algorithm

• reads snapshots only once• assumes records do not move much• divide memory into 4 four parts:

– input buffers 1 and 2

– aging buffers 1 and 2

• |f1| + |F2| IO

• distance between snapshots– sum of absolute values of distances, for matching records

– normalize by maximum distance for snapshots

InputBuffer 1

InputBuffer 2

AgingBuffer 1

AgingBuffer 2

DISK

Transfer blocks

k i 1

jm

l :

:

etc.

Tail Head

Tail Head

Memory Buckets

efficient snapshot differential algorithms for data warehousing wilburt juan labiohector...

Documents