new york university cs industry day 1998 title name department of computer science courant institute...

20
NYU New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University http://www/? Pincer-Search*:An Efficient Algorithm for Discovering the Maximum Frequent Set Dao-I Lin and Zvi M. Kedem *Appeared in Advances in Database Technology- EDBT’98, Proceedings, LNCS Vol. 1377, Springer, pp. 105-119, March 1998

Post on 21-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

Title

NameDepartment of Computer Science

Courant Institute of Mathematical SciencesNew York University

http://www/?

Pincer-Search*:An Efficient Algorithm for Discovering the Maximum Frequent Set

Dao-I Lin and Zvi M. Kedem

*Appeared in Advances in Database Technology- EDBT’98, Proceedings, LNCS Vol. 1377, Springer, pp. 105-119, March 1998

Page 2: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

Applications

Association rule applications:

Based on supermarket databases, one might be interested to know that “95% of the customers who bought pasta and ground meat also bought spaghetti sauce”

Based on the alarm signals in telecommunication databases, one might be interested to know that “one can have 90% confidence that alarm C will occur within some interval of time if alarm A and alarm B have occurred in that interval of time”

Based on the stock market trading databases, one might be interested to know that “90% of the time during the last month when the prices of stock A and stock B went up, the price of stock C also went up”

Page 3: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

Setting

Basic terms: 1,2, …, n: The set of all items

– e.g. items in supermarkets, alarm signals in telecommunication networks, or stocks in stock markets

Transaction: A set of items– e.g. items purchased in a supermarket, alarm signals occurring within an

interval of time, or stocks that their prices went up during the last one hour

Database: A set of transactions User-defined threshold (min-support): A number in [0,1] Frequent itemset: A collection of items (an itemset) occurring in at

least min-support fraction of the database

The problem: Given a large database of sets of items and a user-defined

min-support threshold, what are the frequent itemsets?

Page 4: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

The Importance of the Maximum Frequent Set

Maximal frequent itemsets: The frequent itemsets such that no proper superset of them is

frequent

Maximum frequent set: The set of all maximal frequent itemsets

Fact: An itemset is frequent if and only if it is a subset a maximal

frequent itemset The maximum frequent set uniquely determines the entire

frequent set, since the union of its subsets forms the frequent set

Discovering the maximum frequent set is a key problem in many data mining applications: Such as the discovery of association rules, theories, strong rules,

episodes, and minimal keys

Page 5: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

An Example

DatabaseTransaction

1 {1,2,3,5}

2 {1,5}

3 {1,2}

4 {1,2,3}

Min-support is 0.5 Frequent itemsets are {1}, {2}, {3}, {5}, {1,2}, {1,3}, {1,5}, {2,3}, and

{1,2,3}, since they occur in at least 2 out of 4 transactions Maximum frequent set is {{1,2,3},{1,5}}

{1,2,3,4,5}

{1,2,3}

{1,2} {1,3} {2,3} {1,5}

{1}{2}{3} {4} {5}

Page 6: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

Two Closure Properties

Let A and B be two itemsets and A B Property1: A infrequent B infrequent

(if a transaction does not contain A, it cannot contain B)

Property2: B frequent A frequent(if a transaction contains B, it must contain A)

{1,2,3,5}

{1,2,5} {1,3,5} {1,4,5}

{1,5} {2,5} {3,5}

{5} A

{4,5}

{2,3,5} {2,4,5} {3,4,5}

{1,2,4,5} {1,3,4,5} {2,3,4,5}

{1,2,3,4}

{1,2} {1,3} {2,3}

{1} {2} {3}

{1,2,3} {1,2,4} {1,3,4} {2,3,4}

{1,4} {2,4} {3,4}

B

B

A

Page 7: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

Traditional One-Way Search Approaches

Traditional approach for discovering the maximum frequent set is either using a bottom-up search or a top-down search approach

Bottom-up search is good when ALL maximal frequent itemsets are short

Top-down search is good when ALL maximal frequent itemsets are long

One-way search can only make use of ONE of the two closure properties to prune candidates

Page 8: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

One-Way Search Algorithms

Property1 leads to bottom-up search algorithms, such as AIS (AIS93), Apriori (AS94), OCD (MTV94), SETM (HS95), DHP (PCY95), Partition (SON95), ML-T2+ (HF95), Sampling (T96), DIC (BMUT97), Clique (ZPOL97)

Property2 leads to top-down search algorithms, such as TopDown (ZPOL97), guess-and-correct (MT97)

{1,2,3,4,5}

{1,2,3,4} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5}

{1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5}

{1,5} {2,5} {3,5} {4,5}

{1,2,3,4}

{1,2,3} {1,2,4} {1,3,4} {2,3,4}

{1,2} {1,3} {2,3} {1,4} {2,4} {3,4}

{1} {2} {3} {4} {5}

Blue: frequent itemsetsRed: maximal frequent itemsetsBlack: infrequent itemsets

{5}

Page 9: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

Complexity of One-Way Searches

For bottom-up search, every frequent itemset is explicitly examined (in the example, until {1,2,3,4} is examined)

For top-down search, every infrequent itemset is explicitly examined (in the example until {5} is examined)

{1,2,3,4,5}

{1,2,3,4} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5}

{1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5}

{1,5} {2,5} {3,5} {4,5}

{1,2,3,4}

{1,2,3} {1,2,4} {1,3,4} {2,3,4}

{1,2} {1,3} {2,3} {1,4} {2,4} {3,4}

{1} {2} {3} {4} {5}

Blue: frequent itemsetsRed: maximal frequent itemsetsBlack: infrequent itemsets

{5}

Page 10: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

Our Two-Way Search Approach: Pincer-Search

Run both bottom-up search and top-down search at the same time

Use information gathered in the bottom-up search to helppruning candidates in the top-down search Use Property1 to eliminate candidates in the top-down search

Use information gathered in the top-down search to helppruning candidates in the bottom-up search Use Property2 to eliminate candidates in the bottom-up search

Can efficiently discover both long and short maximal frequent itemsets

Page 11: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

{1,2,3,4,5}

{1,2,3,4} {1,3,4,5} {1,2,3,5} {1,2,4,5} {2,3,4,5}

{1,2,3} {1,2,4} {1,3,4} {2,3,4}{1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5}

{1,2} {1,3} {1,4} {2,3} {2,4} {3,4}

{1,5} {2,5} {3,5} {4,5}

{1} {2} {3} {4} {5}

Blue: frequent itemsetsRed: maximal frequent itemsetsBlack: infrequent itemsetsGreen: itemsets not examined

Pincer Search: CombiningTop-down and Bottom-up Searches

Eliminated in the top-down search by using the Property1 Eliminated in the bottom-up search by using the Property2

This example shows how combining both searches could dramatically reduce the number of candidates examined the pass of reading the database

Page 12: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

Performance:Observations and Experiments

Non-monotone property of the maximum frequent set Both the number of candidates and the number of of frequent

itemsets increase as the min-support decreases NOT true for the number of maximal frequent itemsets

– If MFS is {{1,2},{2,3},{3,4}} when min-support is 9%

– If min-support decreases to 6% then MFS could become {{1,2,3}}

This property will NOT help bottom-up search algorithms However, this property may help the Pincer-Search algorithm

Concentrated and scattered distributions Concentrated: For the same number of frequent itemsets, the

frequent items are grouped in a NARROW and TALL shape; a few LONG maximal frequent itemsets

Scattered: For the same number of frequent itemsets, the frequent itemsets are grouped in a WIDE and SHORT shape; many SHORT maximal frequent itemsets

Page 13: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

Scattered Distributions

|L| is 2000 in these experiments Apriori Algorithm Pincer-Search Algorithm

T5.I2.D100K

0

10

20

30

40

50

60

2%

1.50

% 1%

0.75

%

0.50

%

0.33

%

0.25

%

Min-Support

Sec

onds

T5.I2.D100K

0

1

2

3

4

5

6

7

8

2%

1.50

% 1%

0.75

%

0.50

%

0.33

%

0.25

%

Min-Support

Pas

ses

T5.I2.D100K

0

50

100

150

200

250

300

350

400

450

2%

1.50

% 1%

0.75

%

0.50

%

0.33

%

0.25

%Min-Support

Can

dida

tes

Page 14: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

Scattered Distributions

|L| is 2000 in these experiments Apriori Algorithm Pincer-Search Algorithm

T10.I4.D100K

0

100

200

300

400

500

600

2%

1.50

% 1%

0.75

%

0.50

%

0.33

%

0.25

%

Min-Support

Sec

onds

T10.I4.D100K

0

1000

2000

3000

4000

5000

6000

7000

8000

2%

1.50

% 1%

0.75

%

0.50

%

0.33

%

0.25

%Min-Support

Can

dida

tes

T10.I4.D100K

0

1

2

3

4

5

6

7

8

9

10

2%

1.50

% 1%

0.75

%

0.50

%

0.33

%

0.25

%

Min-Support

Pas

ses

Page 15: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

Experiments on Scattered Distributions

The benchmark databases are generated by a well-know synthetic data generation program from IBM Quest project |T| is the average transaction size, |I| is the average size of the

maximal frequent itemsets, |D| is the number of transactions, and |L| is the number of the maximal frequent itemsets

The experiment on T5.I2.D100K shows that although Pincer-Search algorithm used more candidates than Apriori algorithm (due to the candidates considered in the MFCS), Pincer-Search algorithm still performed better since the I/O time saved compensated the extra cost

The experiment on T10.I4.D100K shows that it is also possible for Pincer-Search algorithm to spend efforts on maintaining the MFCS, but did not prune enough candidates to cover the extra cost

– For instance, Pincer-Search algorithm performed slightly worse than Apriori algorithm when min-support is 0.75%

Page 16: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

Concentrated Distributions

|L| is 50 in these experiments Apriori Algorithm Pincer-Search Algorithm

T20.I10.D100K

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

18%

16%

14%

12%

10% 8% 6%

Min-Support

Sec

onds

T20.I10.D100K

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

18%

16%

14%

12%

10% 8% 6%

Min-Support

Can

dida

tes

T20.I10.D100K

0

2

4

6

8

10

12

14

16

18

18%

16%

14%

12%

10% 8% 6%

Min-Support

Pas

ses

Page 17: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

Concentrated Distributions

|L| is 50 in these experiments Apriori Algorithm Pincer-Search Algorithm

T20.I15.D100K

0

10000

20000

30000

40000

50000

60000

70000

18%

16%

14%

12%

10% 8% 6%

Min-Support

Sec

onds

T20.I15.D100K

0

50000

100000

150000

200000

250000

18%

16%

14%

12%

10% 8% 6%

Min-Support

Can

dida

tes

T20.I15.D100K

0

2

4

6

8

10

12

14

16

18

18%

16%

14%

12%

10% 8% 6%

Min-Support

Pas

ses

Page 18: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

Experiments on Concentrated Distributions

These experiments show that Pincer-Search algorithm is good for discovering the maximum frequent set with concentrated distributions

The improvements can be up to several orders of magnitude For instance, the improvements are more than 2 orders of

magnitude on the experiment of T20I15.D100K database and when min-supports are 7% and 6%

One can expect even greater improvements when some maximal frequent itemsets are longer

Page 19: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

Census Data

Apriori Algorithm Pincer-Search Algorithm

PUMS

0

20000

40000

60000

80000

100000

120000

140000

160000

50%

46%

42%

38%

34%

30%

Min-Support

Tim

e (s

ec)

PUMS

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

50%

46%

42%

38%

34%

30%

Min-Support

Can

did

ates

PUMS

0

2

4

6

8

10

12

14

16

18

50%

46%

42%

38%

34%

30%

Min-Support

Pas

ses

Page 20: New York University CS Industry Day 1998 Title Name Department of Computer Science Courant Institute of Mathematical Sciences New York University

NYU

New York University CS Industry Day 1998

Experiments on Real-Life Databasesand Conclusions

Pincer-Search algorithm performed quite well on the experiments on this PUMS database, which contains Public Use Microdata Samples

Some preliminary experiments on NYSE stock market databases also show promising results

Conclusions: Pincer-Search is good for concentrated distributions In general, can use Adaptive Pincer-Search

– Delay the use of the two-way search approach until a later pass

More experiments on real-life databases are in progress