false positive or false negative: mining frequent itemsets from high speed transactional data...
TRANSCRIPT
False Positive or False Negative: False Positive or False Negative: Mining Frequent Itemsets from High SMining Frequent Itemsets from High Speed Transactional Data Streamspeed Transactional Data Streams
Jeffrey Xu Yu , Zhihong Chong , Hongjun Lu , Aoying ZhouJeffrey Xu Yu , Zhihong Chong , Hongjun Lu , Aoying Zhou
VLDB 2004VLDB 2004
IntroductionIntroduction
Mining data stream:Mining data stream:– Data items arrive Data items arrive
continuouslycontinuously– One scan of dataOne scan of data– Limited memoryLimited memory– Bounded errorBounded error
IntroductionIntroduction
In this paper, develop algorithm of efIn this paper, develop algorithm of effectively mining frequent itemset witfectively mining frequent itemset with bound of memory consumptionh bound of memory consumption
Use false-negativeUse false-negative
False PositiveFalse Positive
Most existing algorithm of mining freMost existing algorithm of mining frequent itemset are false-positive oriequent itemset are false-positive orientednted– Control memory consumption by error Control memory consumption by error
parameter parameter εε– Allow item’s support below min suppoAllow item’s support below min suppo
rt rt ss but above s – but above s –ε ε as frequentas frequent Approximate frequency counts over Approximate frequency counts over
data streams (VLDB 02)data streams (VLDB 02)
False PositiveFalse Positive
Memory bound : O ( Memory bound : O ( .. log (log (εεN))N)) Dilemma of false-positive approachDilemma of false-positive approach
– εε smaller, less # of false-positive item inclu smaller, less # of false-positive item includedded
– Memory consumption increase reciprocally Memory consumption increase reciprocally in terms of in terms of εε
– In Apriori, k-th frequent itemset generate In Apriori, k-th frequent itemset generate (k+1)-th candidate itemset(k+1)-th candidate itemset
1lε
False Positive & False False Positive & False NegativeNegative
s
S + ε
S - ε
False Positive
False Negative
All itemsets will output
Some will output
All itemsets will output
Some will output
False NegativeFalse Negative
Error control and pruning Error control and pruning – εε : prune data, control error bound, : prune data, control error bound,
changeablechangeable ε decrease and approach to zero when # of ε decrease and approach to zero when # of
observation increaseobservation increase εε reciprocal of n reciprocal of n
– s : minimum supports : minimum support– n : # of observationn : # of observation
False NegativeFalse Negative
Memory controlMemory control– δ δ : reliability, instead : reliability, instead ε ε control control
memory memory consumption consumption– Memory consumption related to Memory consumption related to
ln(1/ ln(1/ δδ)) In this approach not allow 1-In this approach not allow 1-
itemsets with support below s as itemsets with support below s as frequentfrequent
Comparison:Comparison:False Positive & False NegativeFalse Positive & False Negative
Recall and PrecisionRecall and PrecisionAA : true frequent itemsets : true frequent itemsets BB : obtained frequent itemsets : obtained frequent itemsets
– Recall =Recall =
– Precision =Precision =
|A∩B|
|A||A∩B|
|B|
Comparison:Comparison:False PositiveFalse Positive
εε= S/10= S/10 δδ=0.1=0.1
S(%S(%))
True True SizeSize
Mined Mined SizeSize
RecaRecallll
PrecisiPrecisionon
0.080.08 21,36121,361 126,307126,307 1.001.00 0.170.17
0.100.10 12,25212,252 68,27568,275 1.001.00 0.180.18
0.200.20 2,3592,359 23,15423,154 1.001.00 0.160.16
Comparison:Comparison:False NegativeFalse Negative
s+ε: minimum supportS S
(%)(%)True True SizeSize
Mined Mined SizeSize
RecaRecallll
PrecisiPrecisionon
0.080.08 21,36121,361 18,35118,351 0.860.86 1.001.00
0.100.10 12,25212,252 10,41110,411 0.850.85 1.001.00
0.200.20 2,3592,359 1,7391,739 0.740.74 1.001.00
Chernoff BoundChernoff Bound
Chernoff Bound give certain probabilistic Chernoff Bound give certain probabilistic guarantee on estimation of statistics abouguarantee on estimation of statistics about underlying datat underlying data
Pr{ T Pr{ T ≥ ≥ ее E[T]} E[T]} ≤ ≤ ее--E[T]E[T]
For example : Pick a lottery numberFor example : Pick a lottery number 0000,0001, …,9999. 0000,0001, …,9999. 1,000,000 people buy $1 ticket 1,000,000 people buy $1 ticket E[#winners] = 100E[#winners] = 100 Pr{TPr{T≧≧273} 273} ≦≦ e e-100-100
Chernoff BoundChernoff Bound
Bernolli trails (coin flips):Bernolli trails (coin flips):– PrPr[[ooii=1]==1]=pp, , PrPr[[ooii=0]=1-=0]=1-pp– rr : # of heads in : # of heads in nn coin flips coin flips– npnp: expectation of : expectation of rr
for any for any γ> 0γ> 0
Chernoff BoundChernoff Bound
Let Let rr as as rr//nn, min support , min support ss as as pp
Replace Replace ssγ with γ with εε
Right of equation beRight of equation be δ δ– Pr{|Pr{|RunningSupport – TrueSupportRunningSupport – TrueSupport||≥≥εεn n } } ≤≤δδ
Frequent or InfrequentFrequent or Infrequent
A pattern X is potential A pattern X is potential infrequentinfrequent if if count(X) / n < s –εcount(X) / n < s –εnn in terms of n in terms of n A pattern X is potential A pattern X is potential frequentfrequent if it if it is is notnot potential potential infrequentinfrequent in terms of in terms of
n n
FDPM-1(s, δ)FDPM-1(s, δ)
ACD
FDPM-1(s, δ)FDPM-1(s, δ)
ItemItem AACountCount
Source
B1 12
C1
Memory is full
Compute new εn
Delete infrequent items
D
FDPM-1(s, δ)FDPM-1(s, δ)
Algorithm ensure :Algorithm ensure :– item whose true frequency exceeds item whose true frequency exceeds sN sN arar
e output with probability of at least 1-e output with probability of at least 1-δδ– No item whose true frequency is less thaNo item whose true frequency is less tha
n n sNsN are output are output– Probability of the estimated support that Probability of the estimated support that
equal true support no less than 1-equal true support no less than 1-δδ
Memory BoundMemory Bound
Sup(X) Sup(X) ≥≥ ( s – ε ( s – εnn) n) n |P| |P| ≤≤ 1/( s – ε 1/( s – εnn), when s – ε), when s – εnn>0>0
|P| = n|P| = n = = = =
nn = =
ns
s)/2ln(2
1
1
S – εn
s
)/2ln(22
FDPM-2(s, δ)FDPM-2(s, δ)
Mining Frequent Itemsets
Mining Frequent Itemsets Mining Frequent Itemsets from a Data Streamfrom a Data Stream
ItemItem
setset{A}{A} {B}{B} {AB{AB
}} ……
CounCountt ……
Source
……
n1
{A,B{A,B}}
………….... {E,F,{E,F,G}G}
ItemItem
SetSet{A}{A} {B}{B} {AB{AB
}}{E}{E} {F}{F} {EF{EF
}}
CounCountt 44 55 44 55 66 44
P
F{C} {D}
9 8 6 3 313 13 10
{E}
5
Memory is full, compute new εnDelete infrequent itemsets
{F} {EF}
6 4
ConclusionConclusion
False negativeFalse negative Limited memoryLimited memory Error bound with some probabilityError bound with some probability