lcm: an efficient algorithm for enumerating frequent closed item sets l inear time c losed itemset m...

16
LCM: An Efficient Algorithm LCM: An Efficient Algorithm for for Enumerating Frequent Closed Enumerating Frequent Closed Item Sets Item Sets L L inear time inear time C C losed itemset losed itemset M M iner iner Takeaki Uno Takeaki Uno Tatsuya Asai Tatsuya Asai Hiroaki Hiroaki Arimura Arimura Yuzo Uchida Yuzo Uchida National Institute of Informatics Kyushu University Kyushu University Kyushu University 19/Nov/2003 FIMI 2003

Upload: marvin-james

Post on 16-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo

LCM: An Efficient Algorithm forLCM: An Efficient Algorithm forEnumerating Frequent Closed Enumerating Frequent Closed

Item SetsItem Sets

LLinear time inear time CClosed itemset losed itemset

MMineriner

Takeaki UnoTakeaki Uno

Tatsuya AsaiTatsuya Asai

Hiroaki ArimuraHiroaki Arimura

Yuzo UchidaYuzo Uchida

National Institute of Informatics

Kyushu University

Kyushu University

Kyushu University

19/Nov/2003 FIMI 2003

Page 2: LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo

small supports

MotivationMotivation

- We want to solve difficult problems in short time

Few solutions for small support

Many solutions foreven large support

#closed set = #freq. set #closed set << #freq. set

retail

accidents

IBMdatas

chess

connect

mushroom

kosarak

pumsb*

pumsb

BMS POS BMS web1,2

・・ database reductiondatabase reduction・・ remove infrequent itemsremove infrequent items

・・ sparse/densesparse/dense (occ-deliv/diffsets)(occ-deliv/diffsets)

・・ exact enumerationexact enumeration of closed item setof closed item set

・・ generation of generation of all/maximal item set all/maximal item set from closed item setfrom closed item set

large supports

Page 3: LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo

Outline of Our ResearchOutline of Our Research

- Exact enumerationExact enumeration of closed item sets

(no sophisticated pruning, post processing, nor memory for

obtained closed item sets)

- Enumerate all/maximal frequent item sets using closed item set

- Algorithms for updating occurrences/maximality check

in dense/sparse cases, and their adaptive hybridadaptive hybrid

- Save additional memorySave additional memory useuse

(right first sweep, adjacency matrix only for large transactions)

Page 4: LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo

- Introduce acyclic parent-child relationshipparent-child relationship on freq. closed sets

( it induces a tree-shaped transversal routetree-shaped transversal route )

- Traverse the route in depth-first mannerdepth-first manner

( find a child, and go to it )

Exact Enumeration of Closed Item SetsExact Enumeration of Closed Item Sets

Exact enumeration (linear time to #closed set)

Any child is found by taking closure (in short time)

Not need to store obtained item sets (small memory) can enumerate all closed item sets (even without min. support)

rootroot((== φφ))

Page 5: LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo

X : closed item set

parent of X = closure of X∩{1,…,i}

where i is the maximum s.t. X ≠closure of X∩{1,…,i}

parent of X ⊆ X, acyclic

X' = child of X ⇔ X' is closure of X {∪ i} for some i

and (cond) X' \ X includes no item <i

Definition of ParentDefinition of Parent

All children are found by taking closure of X {∪ i}

(cond) can be checked in short time by using some algorithms

xx

x'x'

Closure = maximal item set with the same

occurrences

child

Page 6: LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo

Computation of Occurrences X {∪ i} for Sparse and Dense Cases

- In sparse case, by tracing items of each occurrence of X

(occurrence deliver : maybe a known technique)

- In dense case, use diffsets (proposed by Zaki)

Adaptive Hybrid AlgorithmAdaptive Hybrid Algorithm

We choose best one according to estimations of computation timein each iterations

Page 7: LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo

- Maximal frequent sets generated from closed item sets

- All frequent sets (hypercube decomposition) -- decompose classes of closed item sets into complete sublattices

-- enumerate pairs of greatest/least elements of sublattices

-- generate others from the pairs

Maximal and All Frequent SetsMaximal and All Frequent Sets

000 ••• 0

111 ••• 1

closed item set

class01 lattice

Page 8: LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo

ResultResult

retail

accidents

IBMdatas

chess

connect

mushroom

kosarak

pumsb*

pumsb

BMS POS BMS web1,2

fast if support is small

fast or usual

Slower than others

large supports

small supports

fast

Page 9: LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo

ConclusionConclusion

- For data sets s.t. #freq. closed sets << #freq. sets

- large business datasets: BMS-web1,2, retails

- machine learning datasets with small supports: UCI repository

exact enumerationexact enumeration of closed item sets and

hypercube decomposition hypercube decomposition perform well

- These techniques are orthogonal to other techniques,

( ・ database reduction, ・ pruning infrequent items,… )

we can do better for large supports / accidents (blue area).

- Parameter of hybridhybrid is not tuned

not fast for kosarak, IBMdatas now faster

For further speed upFor further speed upFast without pruning, trie,

other existing method

Page 10: LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo

We think…We think…

● What are the real problem (bottleneck) What are the real problem (bottleneck) ??

---- Mining structured item sets

(closed item sets, association rule with threshold,… )

● Is it only a counting problem ?Is it only a counting problem ?

---- for all frequent item set mining, Yes.

the problem is how to make the occurrences of an item set

from other item sets (choose best way, represent

● Is maximal item set useful ?Is maximal item set useful ?

---- closed item set is useful!!

have an application for classification, association rule mining

Page 11: LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo

Usually, < 1/2 Really need to prune ?

- Computing occurrences for infrequent items from X

Some ObservationsSome Observations

X X {1∪ } X {2∪ } X {3∪ } X {4∪ } X {5∪ }

frequency

- Almost computation is for updating occurrences- There is a best e to get occurrence of X from X - eCan we design algorithm choosing e in each iteration ? how we find this e ? Does this accelerate? ( we can evaluate the lower bound of occurrence computation )

Pruning of infrequent sets really necessary?Pruning of infrequent sets really necessary?

Need for accelerating occurrence computation ?Need for accelerating occurrence computation ?

Page 12: LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo

Usually, < 1/2

- Computing occurrences for infrequent items from X

Some ObservationsSome Observations

Really need to prune ?

X X {10∪ } X {11∪ } X {12∪ } X {13∪ } X {14∪ }

frequency

Page 13: LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo

- Generate recursive calls in decreasing order of items

- Clear memory after the recursive call

- Re-use the memory in the following recursive calls

Right First SweepRight First Sweep

Child iterations need no memory

X {10∪ } X {11∪ } X {12∪ } X {13∪ } X {14∪ }

A A ABBCDD

DE

Page 14: LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo

Compute T(X {∪ i}) by tracing each occurrence of X

Occurrence deliverOccurrence deliver

In sparse cases, fast

E

D

C

B

A

X {10∪ } X {11∪ } X {12∪ } X {13∪ } X {14∪ }

A A ABBCDD

DE

Page 15: LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo

- Check (cond) closure of X {∪ i} \ X includes no item <i

- In sparse case, find an occurrence not including j,

for all possible item j

- In dense case, update occurrences of all frequent X {∪ j},

and compute T(X {∪ i} {∪ j})

CheckingChecking (cond) (cond) of Closure of Closure

Quite faster than computing the closure of X {∪ i}

ABC

X {1∪ } X {2∪ } X {∪ i} X {14∪ }

ABC

A・・・

・・・

C

Page 16: LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo

connect

0.1

1

10

100

1000

95 90 80 70 60 50 40 30minsup (%)

time (sec)IBM T10I4D100K

1

10

100

1000

0.15 0.125 0.1 0.075 0.05 0.025 minsup (%)

time (sec)LCMfreq

LCM

LCMmax

fpgrowth

fp_eclat

fp_apriori

mafia_fi

mafia_fci

mafia_mfi

BMS-WebView-2

1

10

100

1000

0.1 0.08 0.06 0.04 0.02 0.01

minsup (%)

time (sec)BMS-WebView-1

0.1

1

10

100

1000

0.1 0.08 0.06 0.04 0.02 0.01 minsup (%)

time (sec)LCMfreq

LCM

LCMmax

fpgrowth

fp_eclat

fp_apriori

mafia_fi

mafia_mfi

ResultsResults all closed maximal