new york university cs industry day 1998 title name department of computer science courant institute...
Post on 21-Dec-2015
219 views
TRANSCRIPT
NYU
New York University CS Industry Day 1998
Title
NameDepartment of Computer Science
Courant Institute of Mathematical SciencesNew York University
http://www/?
Pincer-Search*:An Efficient Algorithm for Discovering the Maximum Frequent Set
Dao-I Lin and Zvi M. Kedem
*Appeared in Advances in Database Technology- EDBT’98, Proceedings, LNCS Vol. 1377, Springer, pp. 105-119, March 1998
NYU
New York University CS Industry Day 1998
Applications
Association rule applications:
Based on supermarket databases, one might be interested to know that “95% of the customers who bought pasta and ground meat also bought spaghetti sauce”
Based on the alarm signals in telecommunication databases, one might be interested to know that “one can have 90% confidence that alarm C will occur within some interval of time if alarm A and alarm B have occurred in that interval of time”
Based on the stock market trading databases, one might be interested to know that “90% of the time during the last month when the prices of stock A and stock B went up, the price of stock C also went up”
NYU
New York University CS Industry Day 1998
Setting
Basic terms: 1,2, …, n: The set of all items
– e.g. items in supermarkets, alarm signals in telecommunication networks, or stocks in stock markets
Transaction: A set of items– e.g. items purchased in a supermarket, alarm signals occurring within an
interval of time, or stocks that their prices went up during the last one hour
Database: A set of transactions User-defined threshold (min-support): A number in [0,1] Frequent itemset: A collection of items (an itemset) occurring in at
least min-support fraction of the database
The problem: Given a large database of sets of items and a user-defined
min-support threshold, what are the frequent itemsets?
NYU
New York University CS Industry Day 1998
The Importance of the Maximum Frequent Set
Maximal frequent itemsets: The frequent itemsets such that no proper superset of them is
frequent
Maximum frequent set: The set of all maximal frequent itemsets
Fact: An itemset is frequent if and only if it is a subset a maximal
frequent itemset The maximum frequent set uniquely determines the entire
frequent set, since the union of its subsets forms the frequent set
Discovering the maximum frequent set is a key problem in many data mining applications: Such as the discovery of association rules, theories, strong rules,
episodes, and minimal keys
NYU
New York University CS Industry Day 1998
An Example
DatabaseTransaction
1 {1,2,3,5}
2 {1,5}
3 {1,2}
4 {1,2,3}
Min-support is 0.5 Frequent itemsets are {1}, {2}, {3}, {5}, {1,2}, {1,3}, {1,5}, {2,3}, and
{1,2,3}, since they occur in at least 2 out of 4 transactions Maximum frequent set is {{1,2,3},{1,5}}
{1,2,3,4,5}
{1,2,3}
{1,2} {1,3} {2,3} {1,5}
{1}{2}{3} {4} {5}
NYU
New York University CS Industry Day 1998
Two Closure Properties
Let A and B be two itemsets and A B Property1: A infrequent B infrequent
(if a transaction does not contain A, it cannot contain B)
Property2: B frequent A frequent(if a transaction contains B, it must contain A)
{1,2,3,5}
{1,2,5} {1,3,5} {1,4,5}
{1,5} {2,5} {3,5}
{5} A
{4,5}
{2,3,5} {2,4,5} {3,4,5}
{1,2,4,5} {1,3,4,5} {2,3,4,5}
{1,2,3,4}
{1,2} {1,3} {2,3}
{1} {2} {3}
{1,2,3} {1,2,4} {1,3,4} {2,3,4}
{1,4} {2,4} {3,4}
B
B
A
NYU
New York University CS Industry Day 1998
Traditional One-Way Search Approaches
Traditional approach for discovering the maximum frequent set is either using a bottom-up search or a top-down search approach
Bottom-up search is good when ALL maximal frequent itemsets are short
Top-down search is good when ALL maximal frequent itemsets are long
One-way search can only make use of ONE of the two closure properties to prune candidates
NYU
New York University CS Industry Day 1998
One-Way Search Algorithms
Property1 leads to bottom-up search algorithms, such as AIS (AIS93), Apriori (AS94), OCD (MTV94), SETM (HS95), DHP (PCY95), Partition (SON95), ML-T2+ (HF95), Sampling (T96), DIC (BMUT97), Clique (ZPOL97)
Property2 leads to top-down search algorithms, such as TopDown (ZPOL97), guess-and-correct (MT97)
{1,2,3,4,5}
{1,2,3,4} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5}
{1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5}
{1,5} {2,5} {3,5} {4,5}
{1,2,3,4}
{1,2,3} {1,2,4} {1,3,4} {2,3,4}
{1,2} {1,3} {2,3} {1,4} {2,4} {3,4}
{1} {2} {3} {4} {5}
Blue: frequent itemsetsRed: maximal frequent itemsetsBlack: infrequent itemsets
{5}
NYU
New York University CS Industry Day 1998
Complexity of One-Way Searches
For bottom-up search, every frequent itemset is explicitly examined (in the example, until {1,2,3,4} is examined)
For top-down search, every infrequent itemset is explicitly examined (in the example until {5} is examined)
{1,2,3,4,5}
{1,2,3,4} {1,2,3,5} {1,2,4,5} {1,3,4,5} {2,3,4,5}
{1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5}
{1,5} {2,5} {3,5} {4,5}
{1,2,3,4}
{1,2,3} {1,2,4} {1,3,4} {2,3,4}
{1,2} {1,3} {2,3} {1,4} {2,4} {3,4}
{1} {2} {3} {4} {5}
Blue: frequent itemsetsRed: maximal frequent itemsetsBlack: infrequent itemsets
{5}
NYU
New York University CS Industry Day 1998
Our Two-Way Search Approach: Pincer-Search
Run both bottom-up search and top-down search at the same time
Use information gathered in the bottom-up search to helppruning candidates in the top-down search Use Property1 to eliminate candidates in the top-down search
Use information gathered in the top-down search to helppruning candidates in the bottom-up search Use Property2 to eliminate candidates in the bottom-up search
Can efficiently discover both long and short maximal frequent itemsets
NYU
New York University CS Industry Day 1998
{1,2,3,4,5}
{1,2,3,4} {1,3,4,5} {1,2,3,5} {1,2,4,5} {2,3,4,5}
{1,2,3} {1,2,4} {1,3,4} {2,3,4}{1,2,5} {1,3,5} {1,4,5} {2,3,5} {2,4,5} {3,4,5}
{1,2} {1,3} {1,4} {2,3} {2,4} {3,4}
{1,5} {2,5} {3,5} {4,5}
{1} {2} {3} {4} {5}
Blue: frequent itemsetsRed: maximal frequent itemsetsBlack: infrequent itemsetsGreen: itemsets not examined
Pincer Search: CombiningTop-down and Bottom-up Searches
Eliminated in the top-down search by using the Property1 Eliminated in the bottom-up search by using the Property2
This example shows how combining both searches could dramatically reduce the number of candidates examined the pass of reading the database
NYU
New York University CS Industry Day 1998
Performance:Observations and Experiments
Non-monotone property of the maximum frequent set Both the number of candidates and the number of of frequent
itemsets increase as the min-support decreases NOT true for the number of maximal frequent itemsets
– If MFS is {{1,2},{2,3},{3,4}} when min-support is 9%
– If min-support decreases to 6% then MFS could become {{1,2,3}}
This property will NOT help bottom-up search algorithms However, this property may help the Pincer-Search algorithm
Concentrated and scattered distributions Concentrated: For the same number of frequent itemsets, the
frequent items are grouped in a NARROW and TALL shape; a few LONG maximal frequent itemsets
Scattered: For the same number of frequent itemsets, the frequent itemsets are grouped in a WIDE and SHORT shape; many SHORT maximal frequent itemsets
NYU
New York University CS Industry Day 1998
Scattered Distributions
|L| is 2000 in these experiments Apriori Algorithm Pincer-Search Algorithm
T5.I2.D100K
0
10
20
30
40
50
60
2%
1.50
% 1%
0.75
%
0.50
%
0.33
%
0.25
%
Min-Support
Sec
onds
T5.I2.D100K
0
1
2
3
4
5
6
7
8
2%
1.50
% 1%
0.75
%
0.50
%
0.33
%
0.25
%
Min-Support
Pas
ses
T5.I2.D100K
0
50
100
150
200
250
300
350
400
450
2%
1.50
% 1%
0.75
%
0.50
%
0.33
%
0.25
%Min-Support
Can
dida
tes
NYU
New York University CS Industry Day 1998
Scattered Distributions
|L| is 2000 in these experiments Apriori Algorithm Pincer-Search Algorithm
T10.I4.D100K
0
100
200
300
400
500
600
2%
1.50
% 1%
0.75
%
0.50
%
0.33
%
0.25
%
Min-Support
Sec
onds
T10.I4.D100K
0
1000
2000
3000
4000
5000
6000
7000
8000
2%
1.50
% 1%
0.75
%
0.50
%
0.33
%
0.25
%Min-Support
Can
dida
tes
T10.I4.D100K
0
1
2
3
4
5
6
7
8
9
10
2%
1.50
% 1%
0.75
%
0.50
%
0.33
%
0.25
%
Min-Support
Pas
ses
NYU
New York University CS Industry Day 1998
Experiments on Scattered Distributions
The benchmark databases are generated by a well-know synthetic data generation program from IBM Quest project |T| is the average transaction size, |I| is the average size of the
maximal frequent itemsets, |D| is the number of transactions, and |L| is the number of the maximal frequent itemsets
The experiment on T5.I2.D100K shows that although Pincer-Search algorithm used more candidates than Apriori algorithm (due to the candidates considered in the MFCS), Pincer-Search algorithm still performed better since the I/O time saved compensated the extra cost
The experiment on T10.I4.D100K shows that it is also possible for Pincer-Search algorithm to spend efforts on maintaining the MFCS, but did not prune enough candidates to cover the extra cost
– For instance, Pincer-Search algorithm performed slightly worse than Apriori algorithm when min-support is 0.75%
NYU
New York University CS Industry Day 1998
Concentrated Distributions
|L| is 50 in these experiments Apriori Algorithm Pincer-Search Algorithm
T20.I10.D100K
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
18%
16%
14%
12%
10% 8% 6%
Min-Support
Sec
onds
T20.I10.D100K
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
18%
16%
14%
12%
10% 8% 6%
Min-Support
Can
dida
tes
T20.I10.D100K
0
2
4
6
8
10
12
14
16
18
18%
16%
14%
12%
10% 8% 6%
Min-Support
Pas
ses
NYU
New York University CS Industry Day 1998
Concentrated Distributions
|L| is 50 in these experiments Apriori Algorithm Pincer-Search Algorithm
T20.I15.D100K
0
10000
20000
30000
40000
50000
60000
70000
18%
16%
14%
12%
10% 8% 6%
Min-Support
Sec
onds
T20.I15.D100K
0
50000
100000
150000
200000
250000
18%
16%
14%
12%
10% 8% 6%
Min-Support
Can
dida
tes
T20.I15.D100K
0
2
4
6
8
10
12
14
16
18
18%
16%
14%
12%
10% 8% 6%
Min-Support
Pas
ses
NYU
New York University CS Industry Day 1998
Experiments on Concentrated Distributions
These experiments show that Pincer-Search algorithm is good for discovering the maximum frequent set with concentrated distributions
The improvements can be up to several orders of magnitude For instance, the improvements are more than 2 orders of
magnitude on the experiment of T20I15.D100K database and when min-supports are 7% and 6%
One can expect even greater improvements when some maximal frequent itemsets are longer
NYU
New York University CS Industry Day 1998
Census Data
Apriori Algorithm Pincer-Search Algorithm
PUMS
0
20000
40000
60000
80000
100000
120000
140000
160000
50%
46%
42%
38%
34%
30%
Min-Support
Tim
e (s
ec)
PUMS
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
50%
46%
42%
38%
34%
30%
Min-Support
Can
did
ates
PUMS
0
2
4
6
8
10
12
14
16
18
50%
46%
42%
38%
34%
30%
Min-Support
Pas
ses
NYU
New York University CS Industry Day 1998
Experiments on Real-Life Databasesand Conclusions
Pincer-Search algorithm performed quite well on the experiments on this PUMS database, which contains Public Use Microdata Samples
Some preliminary experiments on NYSE stock market databases also show promising results
Conclusions: Pincer-Search is good for concentrated distributions In general, can use Adaptive Pincer-Search
– Delay the use of the two-way search approach until a later pass
More experiments on real-life databases are in progress