is sampling useful in data mining? a case in the maintenance of discovered association rules

47
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong Kong Data Mining and Knowledge Discovery, 1998 Revised and Presented by Matthew Starbuck : April 2, 2014 1

Upload: devin

Post on 30-Jan-2016

54 views

Category:

Documents


0 download

DESCRIPTION

S.D. Lee, David W. Cheung, Ben Kao The University of Hong Kong Data Mining and Knowledge Discovery , 1998. Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules. Revised and Presented by Matthew Starbuck : April 2, 2014. Definitions Old Algorithms - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Is Sampling Useful in Data Mining?A Case in the Maintenance of Discovered Association Rules

S.D. Lee, David W. Cheung, Ben KaoThe University of Hong Kong

Data Mining and Knowledge Discovery, 1998

Revised and Presented by Matthew Starbuck : April 2, 2014

1

Page 2: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Definitions Old Algorithms

Apriori FUP2

New Algorithm: DELI Design Pseudo code Sampling Techniques

Experiments(Comparisons) showing DELI is better Consecutive runs/Conclusions/Exam Questions

Outline

2

Page 3: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Definitions(1)

D = Transaction Set

I = Full Item Set

T1 T2 T3

3

Page 4: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Definitions(2)

For X=σX = Support count = 4

Support = 4/5 = 80%

Support threshold: s%

4

Page 5: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Definitions(3) K-itemset: itemset containing k items.

Large itemset (Lk): itemset with support larger than support threshold.

5

Page 6: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Old Algorithm (1)

Apriori

6

Page 7: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Pseudo code of Apriori

get C1;

k = 1;

until (Ck is empty || Lk is empty)

do

{

Get Lk from Ck using minimum support count;

Use apriori_gen() to generate Ck+1 from Lk;

k++;

} ;

return union(all Lk); 7

Page 8: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Ck = Candidate Set Lk = Large Set

apriori_gen()

apriori_gen()s% = 40%

8

Page 9: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Use Apriori in Maintenance

Simply apply the algorithm to the updated database again; Not efficient; Fails to reuse the results of previous

mining; Very cost-expensive.

9

Page 10: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

FUP2 works similarly to Apriori by generating large itemsets iteratively;

It scans only the updated part of the database for old large itemsets;

For the rest, it scans the whole database.

Old Algorithm 2: FUP2

10

Page 11: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Δ-: set of deleted transactionsΔ+: set of added transactionsD: old database D': updated databaseD*: set of unchanged transactions

σX: support count of itemset X

σ’X: new support count

of itemset XδX

-: support count of itemset X in Δ-

δX+: support count of itemset X in Δ+

11

Page 12: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Pseudo code of FUP2

get C1; k = 1;until (Ck is empty || Lk’ is empty)do{

divide Ck into two partitions: Pk = Ck۸ Lk and Qk = Ck – Pk ;For X in Pk, calculate σ’X = σX - δX

- + δX+

and get part 1 of Lk’ ;For X in Qk, eliminate candidates with

δX+ - δX

- < (Δ+ -Δ-)s% ;

For the remaining candidates X in Qk, scan D* to get part 2 of Lk’ ;

Use apriori_gen() to generate Ck+1 from Lk’;k++;

};return union(all Lk’);

Ck Lk

PkQk

Δ- (δ-X)

Δ+ (δ+X)

D*D (σX)

D’ (σ’X)

12

Page 13: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

An Example on FUP2

13

Page 14: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

DELI Algorithm Difference Estimation for Large Itemsets

Key idea: It examines samples of the database when the update is not too much;

14

Page 15: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Basic pseudo code of DELIget C1; k = 1;until (Ck is empty || Lk’ is empty)do{

divide Ck into two partitions: Pk = Ck۸ Lk and Qk = Ck – Pk

For X in Pk, calculate σ’X = σX - δX- + δX

+

and get part 1 of Lk’For X in Qk, eliminate candidates with

δX+ - δX

- < (Δ+ -Δ-)s%,

For the remaining candidates X in Qk, scan D* to get part 2 of Lk’

Use apriori_gen() to generate Ck+1 from Lk’; k++;};return union(all Lk’);

A sample subset of D*

15

Page 16: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Binomial Distribution Assume 5% of the population is green-eyed.

You pick 500 people randomly with replacement.

The total number of green-eyed people you pick is a random variable X which follows a binomial distribution with n = 500 and p = 0.05.

16

Page 17: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Binomial Distribution

http://en.wikipedia.org/wiki/Image:Binomial_distribution_pmf.png

17

Page 18: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Sampling Techniques (1) Consider an arbitrary itemset X; Randomly select m transactions from

D with replacement; TX = the total number of X out of m; TX is binomially distributed with

p = σX / |D| n = m Mean = np = (m / |D|) σX Variance = np(1-p)

18

Page 19: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Sampling Techniques (2)

TX approximates normally distributed with Mean = (m / |D|) σX Variance = mp(1 - p)

Define: σX^ = |D| / m * TX

σX^ is normally distributed with

Mean = σX Variance = σX (|D| - σX )/m

19

Page 20: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Confidence Interval

ax bxMean = σX

α/2α/2

20

Page 21: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Sampling Techniques (3) We can obtain a 100(1-α)% confidence

interval [ax, bx] for σX where

Typical Values: For α= 0.1, z α/2=1.645

For α= 0.05, z α/2=1.960

For α= 0.01, z α/2=2.57621

Page 22: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Sampling Techniques (4) The width of this interval is

The widths of all confidence intervals are no more than

Suppose we want the widths not to exceed

22

Page 23: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Sampling Techniques (5) If s = 2 and α= 0.05, then zα/2=1.96

Solving the above inequality gives m ≥18823.84 .

This value is independent of the size of the database D! Note*: D may contain billions of transactions. A

sample of around 19 thousand is large enough for the desired accuracy in this example

23

Page 24: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Lk»: large in D and

D’ ;

Lk>: not large in D,

large in D’ with a certain confidence;

Lk≈: not large in D,

maybe large in D’ ;

Lk’: approximation of

new Lk. Lk

’=Lk» Lk

> Lk≈

Lk

Lk’

Ck

Qk Pk

Lk»

Lk>

Lk≈

Obtain the estimated set of Lk

24

Page 25: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Degree of uncertainty uk = Lk

≈/Lk’ , uncertainty factor

uk- is a user-specified threshold

If uk ≥ uk- , then DELI halts and FUP2 is needed

Amount of changes (symmetric difference) ηk = |Lk – Lk

’| ξk = |Lk

(>)| + |Lk(≈)|

dk- is a user-specified threshold

If dk ≥ dk-, then DELI halts and FUP2 is needed

Criteria met to perform a full update

k

jjjk Ld

1

||/)(

25

Page 26: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Pseudo code of DELIget C1; k = 1;until (Ck is empty || Lk is empty)do{

divide Ck into two partitions: Pk = Ck۸ Lk and Qk = Ck – Pk

For X in Pk, calculate σ’X = σX - δX- + δX

+

and get part 1 of Lk’For X in Qk, eliminate candidates with

δX+ - δX

- < (Δ+ -Δ-)s%,

For the remaining candidates X in Qk, scan a sample subset of D* to get part 2 of Lk’

Use apriori_gen() to generate Ck+1 from Lk’;If any criteria is met, then terminate and go to FUP2;k++;

};return union(all Lk’);

26

Page 27: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

An Improvement

Store the support counts of all 1-itemsets

Extra storage: O(|I|)

27

Page 28: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Experiment Preparation Synthetic databases – generate D, Δ+, Δ-

1%-18% of the large itemsets are changed by the updates.

uk- = ∞

dk- = ∞

28

Page 29: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Experimental Results (1)

α= 0.05

z α/2=1.960

|Δ+|=|Δ-|= 5000

|D| = 100000

s% = 2%

29

Page 30: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Experimental Results (2)

α= 0.05

z α/2=1.960

|Δ+|=|Δ-|= 5000

|D| = 100000

s% = 2%

30

Page 31: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Experimental Results (3)

m=20000

|Δ+|=|Δ-|= 5000

|D| = 100000

s% = 2%

31

Page 32: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Experimental Results (4)

m=20000

|Δ+|=|Δ-|= 5000

|D| = 100000

s% = 2%

32

Page 33: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Experimental Results (5)

α= 0.05

z α/2=1.960

m=20000

|D| = 100000

s% = 2%

33

Page 34: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Experimental Results (6)

α= 0.05

z α/2=1.960

m=20000

|D| = 100000

s% = 2%

34

Page 35: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Experimental Results (7)α= 0.05

z α/2=1.960

|Δ-|= 5000

m = 20000

|D| = 100000

s% = 2%

35

Page 36: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Experimental Results (8)

α= 0.05

z α/2=1.960

|Δ-|= 5000

m = 20000

|D| = 100000

s% = 2%

36

Page 37: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Experimental Results (9)

α= 0.05

z α/2=1.960

|Δ+|= 5000

m = 20000

|D| = 100000

s% = 2%

37

Page 38: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Experimental Results (10)

α= 0.05

z α/2=1.960

|Δ+|= 5000

m = 20000

|D| = 100000

s% = 2%

38

Page 39: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Experimental Results (11)

α= 0.05

z α/2=1.960

|Δ+|= |Δ-|

= 5000

m = 20000

|D| = 100000

39

Page 40: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Experimental Results (12)

α= 0.05

z α/2=1.960

|Δ+|= |Δ-|

= 5000

m = 20000

|D| = 100000

40

Page 41: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Experimental Results (13)

α= 0.05

z α/2=1.960

|Δ+|= |Δ-|

= 5000

m = 20000

s% = 2%

41

Page 42: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Experimental Summary

uc- < 0.036 , very low;

when | Δ -| < 10000, dc- < 0.1;

when | Δ -| = 20000, dc- < 0.21;

(Suggested) u- = 0.05, d- = 0.1

42

Page 43: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Consecutive Runs:

Say we use Apriori to find association rules in a database Later, 1st batch of updates arrives, use DELI to make

rules (r) if necessary If r = F then use old association rules When 2nd batch comes, check both batches for

significant changes Sense the 2nd batch is repeating work from 1st batch we

must try to afford some storage space To get storage space we must keep every δX

+ and δX-.

Repeat for each updated batch, so that every update has resources stored from the previous batch

43

Page 44: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Conclusions

Real-world databases get updated constantly, therefore the knowledge extracted from them changes too.

The authors proposed DELI algorithm to determine if the change is significant so that it knows when to update the extracted association rules.

The algorithm applies sampling techniques and statistic methods to efficiently estimate an approximate large itemsets.

44

Page 45: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Final Exam Questions

Q1: Compare and contrast FUP2 and DELI Both algorithms are used in Association Analysis; Goal: DELI decides when to update the association

rules while FUP2 provides an efficient way of updating them;

Technique: DELI scans a small portion of the database (sample) and approximates the large itemsets whereas FUP2 scans the whole database and returns the large itemsets exactly;

DELI saves machine resources and time.

45

Page 46: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Final Exam Questions Q2: Difference Estimation for Large

Itemsets Q3 Difference between Apriori and FUP2:

Apriori scans the whole database to find association rules, and does not use old data mining results;

For most itemsets, FUP2 scans only the updated part of the database and takes advantage of the old association analysis results.

46

Page 47: Is  Sampling  Useful in Data Mining? A Case in the  Maintenance  of Discovered  Association  Rules

Thank you!

Now it is discussion time!

47