efficient snapshot differential algorithms for data warehousing wilburt juan labiohector...
TRANSCRIPT
Efficient Snapshot Differential Algorithms for Data Warehousing
• Wilburt Juan Labio Hector Garcia-Molina
Purpose
• detect modifications from information source• extract modifications from information source• information source is not sophisticated (e.g.,
legacy system)
DataWarehouse
LocalDB
modifications
Problem Outline
• file containing distinct records
• {R1, R2, …Rn}, where Ri is <Ki, Bi>
• given two snapshots F1 and F2 produce modifications and Fout
• possible modifications generated:– <update, Ki, Bj>
– <delete, Ki>
– <insert, Ki, Bi>
Difficulties
• physical location of record may be different between snapshots
• wasted messages:– useless delete-insert pairs
• introduces waste• delete then insert same record, do nothing• delete then insert record with <K, B’ >, update
– useless insert-delete pairs• introduces correctness problem• insert then delete same record, do nothing• insert <K, B’> then delete record with K, update
Example: with physical movement
Ft-1
Ki Bi
Ki+1 Bi+1
Ki+2 Bi+2
Ki+3 Bi+3
Ki+4 Bi+4
Ki+5 Bi+5
Ki+6 Bi+6
Ft
Ki Bi
Ki+3 Bi+3
Ki+2 Bi+2
Ki+4 B’i+4
Ki+5 Bi+5
Kj Bj
Ki+6 Bi+6
Modifications made:<delete, Ki+1><update, Ki+4, B’i+4><insert, Kj, Bj>
Example: wasted messages
Ft-1
Ki Bi
Ki+1 Bi+1
Ki+2 Bi+2
Ki+3 Bi+3
Ki+4 Bi+4
Ki+5 Bi+5
Ki+6 Bi+6
Ki+7 Bi+7
Ft
Ki+7 Bi+7
Ki+3 Bi+3
Ki+2 Bi+2
Ki+4 B’i+4
Ki+6 Bi+6
Kj Bj
Ki+5 B’i+5
Ki Bi
useless insert-delete•<insert, Ki+3, Bi+3>•<delete, Ki+3>
or:•<insert, Ki+4, B’i+4>•<delete, Ki+4>
useless delete-insert•<delete, Ki>•<insert, Ki, Bi>
or:•<delete, Ki+5>•<insert, Ki+5, B’i+5>
Related Solutions
• maintain log of modifications• add timestamp to base table• joins
Proposed Solutions
• alter extraction application, code is worn• parse system log, need DBA privilege to get log• snapshot differential
Filet-1 out
Filet-1 out
Filet-1 out
differ datawarehouse
Algorithm Compromises
• related to joins, but cost less• allow some useless delete-insert pairs• change all insert-delete pairs to delete-insert pairs• batch and send all deletes first• may miss a few modifications• save file for next snapshot differential
Sort Merge Join I
• part I: sort two input files– save sorted file from previous snapshot
– use multi-way merge sort for F2
• creates runs, which are sequences of blocks with sorted records
• merge runs till 1 run remains
• 4 * |F2| IO operations, assuming |F2|1/2 < |M|
• part II: merge takes |F1| + |F2| IO operations
Sort Merge Join II• reduce IO operations
• reuse F1 from previous differential
• part I: produce sorted runs for F2
– sort F2 into runs Fruns
• creates runs, which are sequences of blocks with sorted records
• 2 * |F2| IO operations, assuming |F2|1/2 < |M|
• part II: create sorted F2 while merging files
– merge takes |F1| + 2 * |F2| IO operations
• read into memory 1 block from each run in Fruns
• select record with smallest K value
Ex. Expected Number of Good Days
• let n = 32, # records in F = 1,789,570• P(collision) = 2-n • P(no error) = (1 - E)records(F)
• N(good days) = 1/(1 - P(no error))= 2,430 snapshot
comparisons• if file size increases, then increase size of n
Extending ad hoc join Algorithms
• |F|: # of blocks in file• |M|: # of blocks in memory• Sort Merge join I:
– |F1| + 5 * |F2| IO
• Sort Merge join II:– |F1| + 4 * |F2| IO
• Partitioned Hash Join:– |F1| + 3 * |F2| IO
Compression Technique
• reduce record size => reduce IO
• lossy compression: – higher compression– different uncompressed values maybe mapped into the
same compressed value• compress object of b bits into n bits, b > n• 2b/2n values mapped to each compressed value• P(collision) = ((2b/2n) - 1)/2b => 2-n = E• P(no error) = (1 - E)records(F)
• N(good days) = (1 - P(no error))*Σ1<=i i* P(no error)i-1 = 1/(1 - P(no error))
Outer Join with Compression <K,B>
|f1| + 3*|F2| + |f2| IO
• sort F2 into runs: f2run
• r1 = f1.pop()
• r2 = f2runs.pop()
• f2sort.put(r2.K, compress(r2.B))
• while((r1 != null) V (r2 != null))
– if((r1 == null) V (r1.K > r2.K)) /* insert */
• Fout.put(insert, r2.K, r2.B)
• r2 = f2runs.pop()
• f2sort.put(r2.K, compress(r2.B))
– else if((r2 == null) V (r1.K < r2.K)) /* delete */
– else if(r1.K == r2.K)
• if(r1.b != compress(r2.B)) /* update */
Outer Join with Compression <K,b,p>
|f1| + |F2| + 3*|f2sort| + U + I IO
• compress F2 during creation of sorted runs into f2run
• r1 = f1.pop()
• r2 = f2run.pop() /* p -> record */
• f2sort.put(r2.K, r2.b, r2.p)/* b compressed B */
• while((r1 != null) V (r2 != null))
– if((r1 == null) V (r1.K > r2.K)) /* insert */
• Fout.put(insert, r2.K, getTuple(r2.p).B)
• r2 = f2run.pop()
• f2sort.put(r2.K, r2.b, r2.p) /* what about p */
– else if((r2 == null) V (r1.K < r2.K)) /* delete */
– else if(r1.K == r2.K)
• if(r1.b != r2.b) /* update */
Partitioned hash Outer Join
• <K,B> compression– |f1| + 3*|F2| + |f2sort| IO
• <K,B> compression– |f1| + |F2| + 2*|f2sort| + I + U IO
Window Algorithm
• reads snapshots only once• assumes records do not move much• divide memory into 4 four parts:
– input buffers 1 and 2
– aging buffers 1 and 2
• |f1| + |F2| IO
• distance between snapshots– sum of absolute values of distances, for matching records
– normalize by maximum distance for snapshots
InputBuffer 1
InputBuffer 2
AgingBuffer 1
AgingBuffer 2
DISK
Transfer blocks
k i 1
jm
l :
:
etc.
Tail Head
Tail Head
Memory Buckets