is sampling useful in data mining? a case in the maintenance of discovered association rules
DESCRIPTION
S.D. Lee, David W. Cheung, Ben Kao The University of Hong Kong Data Mining and Knowledge Discovery , 1998. Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules. Revised and Presented by Matthew Starbuck : April 2, 2014. Definitions Old Algorithms - PowerPoint PPT PresentationTRANSCRIPT
Is Sampling Useful in Data Mining?A Case in the Maintenance of Discovered Association Rules
S.D. Lee, David W. Cheung, Ben KaoThe University of Hong Kong
Data Mining and Knowledge Discovery, 1998
Revised and Presented by Matthew Starbuck : April 2, 2014
1
Definitions Old Algorithms
Apriori FUP2
New Algorithm: DELI Design Pseudo code Sampling Techniques
Experiments(Comparisons) showing DELI is better Consecutive runs/Conclusions/Exam Questions
Outline
2
Definitions(1)
D = Transaction Set
I = Full Item Set
T1 T2 T3
3
Definitions(2)
For X=σX = Support count = 4
Support = 4/5 = 80%
Support threshold: s%
4
Definitions(3) K-itemset: itemset containing k items.
Large itemset (Lk): itemset with support larger than support threshold.
5
Old Algorithm (1)
Apriori
6
Pseudo code of Apriori
get C1;
k = 1;
until (Ck is empty || Lk is empty)
do
{
Get Lk from Ck using minimum support count;
Use apriori_gen() to generate Ck+1 from Lk;
k++;
} ;
return union(all Lk); 7
Ck = Candidate Set Lk = Large Set
apriori_gen()
apriori_gen()s% = 40%
8
Use Apriori in Maintenance
Simply apply the algorithm to the updated database again; Not efficient; Fails to reuse the results of previous
mining; Very cost-expensive.
9
FUP2 works similarly to Apriori by generating large itemsets iteratively;
It scans only the updated part of the database for old large itemsets;
For the rest, it scans the whole database.
Old Algorithm 2: FUP2
10
Δ-: set of deleted transactionsΔ+: set of added transactionsD: old database D': updated databaseD*: set of unchanged transactions
σX: support count of itemset X
σ’X: new support count
of itemset XδX
-: support count of itemset X in Δ-
δX+: support count of itemset X in Δ+
11
Pseudo code of FUP2
get C1; k = 1;until (Ck is empty || Lk’ is empty)do{
divide Ck into two partitions: Pk = Ck۸ Lk and Qk = Ck – Pk ;For X in Pk, calculate σ’X = σX - δX
- + δX+
and get part 1 of Lk’ ;For X in Qk, eliminate candidates with
δX+ - δX
- < (Δ+ -Δ-)s% ;
For the remaining candidates X in Qk, scan D* to get part 2 of Lk’ ;
Use apriori_gen() to generate Ck+1 from Lk’;k++;
};return union(all Lk’);
Ck Lk
PkQk
Δ- (δ-X)
Δ+ (δ+X)
D*D (σX)
D’ (σ’X)
12
An Example on FUP2
13
DELI Algorithm Difference Estimation for Large Itemsets
Key idea: It examines samples of the database when the update is not too much;
14
Basic pseudo code of DELIget C1; k = 1;until (Ck is empty || Lk’ is empty)do{
divide Ck into two partitions: Pk = Ck۸ Lk and Qk = Ck – Pk
For X in Pk, calculate σ’X = σX - δX- + δX
+
and get part 1 of Lk’For X in Qk, eliminate candidates with
δX+ - δX
- < (Δ+ -Δ-)s%,
For the remaining candidates X in Qk, scan D* to get part 2 of Lk’
Use apriori_gen() to generate Ck+1 from Lk’; k++;};return union(all Lk’);
A sample subset of D*
15
Binomial Distribution Assume 5% of the population is green-eyed.
You pick 500 people randomly with replacement.
The total number of green-eyed people you pick is a random variable X which follows a binomial distribution with n = 500 and p = 0.05.
16
Binomial Distribution
http://en.wikipedia.org/wiki/Image:Binomial_distribution_pmf.png
17
Sampling Techniques (1) Consider an arbitrary itemset X; Randomly select m transactions from
D with replacement; TX = the total number of X out of m; TX is binomially distributed with
p = σX / |D| n = m Mean = np = (m / |D|) σX Variance = np(1-p)
18
Sampling Techniques (2)
TX approximates normally distributed with Mean = (m / |D|) σX Variance = mp(1 - p)
Define: σX^ = |D| / m * TX
σX^ is normally distributed with
Mean = σX Variance = σX (|D| - σX )/m
19
Confidence Interval
ax bxMean = σX
α/2α/2
20
Sampling Techniques (3) We can obtain a 100(1-α)% confidence
interval [ax, bx] for σX where
Typical Values: For α= 0.1, z α/2=1.645
For α= 0.05, z α/2=1.960
For α= 0.01, z α/2=2.57621
Sampling Techniques (4) The width of this interval is
The widths of all confidence intervals are no more than
Suppose we want the widths not to exceed
22
Sampling Techniques (5) If s = 2 and α= 0.05, then zα/2=1.96
Solving the above inequality gives m ≥18823.84 .
This value is independent of the size of the database D! Note*: D may contain billions of transactions. A
sample of around 19 thousand is large enough for the desired accuracy in this example
23
Lk»: large in D and
D’ ;
Lk>: not large in D,
large in D’ with a certain confidence;
Lk≈: not large in D,
maybe large in D’ ;
Lk’: approximation of
new Lk. Lk
’=Lk» Lk
> Lk≈
Lk
Lk’
Ck
Qk Pk
Lk»
Lk>
Lk≈
Obtain the estimated set of Lk
24
Degree of uncertainty uk = Lk
≈/Lk’ , uncertainty factor
uk- is a user-specified threshold
If uk ≥ uk- , then DELI halts and FUP2 is needed
Amount of changes (symmetric difference) ηk = |Lk – Lk
’| ξk = |Lk
(>)| + |Lk(≈)|
dk- is a user-specified threshold
If dk ≥ dk-, then DELI halts and FUP2 is needed
Criteria met to perform a full update
k
jjjk Ld
1
||/)(
25
Pseudo code of DELIget C1; k = 1;until (Ck is empty || Lk is empty)do{
divide Ck into two partitions: Pk = Ck۸ Lk and Qk = Ck – Pk
For X in Pk, calculate σ’X = σX - δX- + δX
+
and get part 1 of Lk’For X in Qk, eliminate candidates with
δX+ - δX
- < (Δ+ -Δ-)s%,
For the remaining candidates X in Qk, scan a sample subset of D* to get part 2 of Lk’
Use apriori_gen() to generate Ck+1 from Lk’;If any criteria is met, then terminate and go to FUP2;k++;
};return union(all Lk’);
26
An Improvement
Store the support counts of all 1-itemsets
Extra storage: O(|I|)
27
Experiment Preparation Synthetic databases – generate D, Δ+, Δ-
1%-18% of the large itemsets are changed by the updates.
uk- = ∞
dk- = ∞
28
Experimental Results (1)
α= 0.05
z α/2=1.960
|Δ+|=|Δ-|= 5000
|D| = 100000
s% = 2%
29
Experimental Results (2)
α= 0.05
z α/2=1.960
|Δ+|=|Δ-|= 5000
|D| = 100000
s% = 2%
30
Experimental Results (3)
m=20000
|Δ+|=|Δ-|= 5000
|D| = 100000
s% = 2%
31
Experimental Results (4)
m=20000
|Δ+|=|Δ-|= 5000
|D| = 100000
s% = 2%
32
Experimental Results (5)
α= 0.05
z α/2=1.960
m=20000
|D| = 100000
s% = 2%
33
Experimental Results (6)
α= 0.05
z α/2=1.960
m=20000
|D| = 100000
s% = 2%
34
Experimental Results (7)α= 0.05
z α/2=1.960
|Δ-|= 5000
m = 20000
|D| = 100000
s% = 2%
35
Experimental Results (8)
α= 0.05
z α/2=1.960
|Δ-|= 5000
m = 20000
|D| = 100000
s% = 2%
36
Experimental Results (9)
α= 0.05
z α/2=1.960
|Δ+|= 5000
m = 20000
|D| = 100000
s% = 2%
37
Experimental Results (10)
α= 0.05
z α/2=1.960
|Δ+|= 5000
m = 20000
|D| = 100000
s% = 2%
38
Experimental Results (11)
α= 0.05
z α/2=1.960
|Δ+|= |Δ-|
= 5000
m = 20000
|D| = 100000
39
Experimental Results (12)
α= 0.05
z α/2=1.960
|Δ+|= |Δ-|
= 5000
m = 20000
|D| = 100000
40
Experimental Results (13)
α= 0.05
z α/2=1.960
|Δ+|= |Δ-|
= 5000
m = 20000
s% = 2%
41
Experimental Summary
uc- < 0.036 , very low;
when | Δ -| < 10000, dc- < 0.1;
when | Δ -| = 20000, dc- < 0.21;
(Suggested) u- = 0.05, d- = 0.1
42
Consecutive Runs:
Say we use Apriori to find association rules in a database Later, 1st batch of updates arrives, use DELI to make
rules (r) if necessary If r = F then use old association rules When 2nd batch comes, check both batches for
significant changes Sense the 2nd batch is repeating work from 1st batch we
must try to afford some storage space To get storage space we must keep every δX
+ and δX-.
Repeat for each updated batch, so that every update has resources stored from the previous batch
43
Conclusions
Real-world databases get updated constantly, therefore the knowledge extracted from them changes too.
The authors proposed DELI algorithm to determine if the change is significant so that it knows when to update the extracted association rules.
The algorithm applies sampling techniques and statistic methods to efficiently estimate an approximate large itemsets.
44
Final Exam Questions
Q1: Compare and contrast FUP2 and DELI Both algorithms are used in Association Analysis; Goal: DELI decides when to update the association
rules while FUP2 provides an efficient way of updating them;
Technique: DELI scans a small portion of the database (sample) and approximates the large itemsets whereas FUP2 scans the whole database and returns the large itemsets exactly;
DELI saves machine resources and time.
45
Final Exam Questions Q2: Difference Estimation for Large
Itemsets Q3 Difference between Apriori and FUP2:
Apriori scans the whole database to find association rules, and does not use old data mining results;
For most itemsets, FUP2 scans only the updated part of the database and takes advantage of the old association analysis results.
46
Thank you!
Now it is discussion time!
47