parallel mining frequent patterns: a sampling-based approach shengnan cong
TRANSCRIPT
Parallel Mining Frequent Patterns: A Sampling-based Approach
Shengnan Cong
2
Talk Outline
Background– Frequent pattern mining– Serial algorithm
Parallel frequent pattern mining– Parallel framework– Load balancing problem
Experimental results Optimization Summary
3
Frequent Pattern Analysis
Frequent pattern– A pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
Motivation– To find the inherent regularities in data
Applications– Basket data analysis
• What products were often purchased together?
– DNA sequence analysis• What kinds of DNA are sensitive to this new drug?
– Web log analysis• Can we automatically classify web documents?
4
Frequent Itemset Mining
Itemset– A collection of one or more items
• Example: {Milk, Juice}– k-itemset
• An itemset that contains k items
Transaction– An itemset– A dataset is a collection of transactions
Support– Number of transactions containing an itemset
• Example: Support({Milk, Juice}) = 2
Frequent-itemset mining– To output all itemsets whose support values are no less
than a predefined threshold in a dataset
Transaction Items
T1 Milk, bread, cookies, juice
T2 Milk, juice
T3 Milk, eggs
T4 Bread, cookies, coffee
Support threshold = 2
{milk}:3
{bread}:2
{cookies}:2
{juice}:2
{milk, juice}:2
{bread, cookies}:2
5
Frequent Itemset Mining
Frequent itemset mining is computationally expensive
Brute-force approach:– Given d items, there are 2d possible candidate itemsets– Count the support of each candidate by scanning the
database– Match each transaction against every candidate
– Complexity ~ O(NMW) => Expensive since M= 2d
Transaction Items
T1 Milk, Bread, Cookies, Juice
T2 Milk, Juice
T3 Milk, Eggs
T4 Bread, Cookies, Coffee
Transactions List of candidates
N
W
M
6
Mining Frequent-itemset In Serial
FP-growth algorithm [Han et al. @SIGMOD 2000]
– One of the most efficient serial algorithms to mine frequent-itemset. [FIMI’03]
– A divide-and-conquer algorithm.
Mining process of FP-growth– Step 1:
• Identify frequent 1-items with one scan of the dataset.– Step 2:
• Construct a tree structure (FP-tree) for the dataset with another dataset scan.
– Step 3: • Traverse the FP-tree and construct a projection (sub-tree) for
each frequent 1-item. Recursively mine the projections.
7
Example of FP-growth Algorithm
TID Items
100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}Item frequency f 4 c 4 a 3 b 3 m 3 p 3
Support Threshold =3
Input Dataset:
Step 1:
Example
8
root
root
f:1
c:1
a:1
m:1
p:1
root
f:2
c:2
a:2
m:1
p:1
b:1
m:1
root
f:3
c:2
a:2
m:1
p:1
b:1
m:1
b:1
root
f:3
c:2
a:2
m:1
p:1
b:1
m:1
b:1
c:1
b:1
p:1
root
f:4
c:3
a:3
m:2
p:2
b:1
m:1
b:1
c:1
b:1
p:1
TID Items bought100 {f, a, c, d, g, i, m, p}200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}
Support =3
Ordered Frequent Items{f, c, a, m, p}{f, c, a, b, m}
{f, b} {c, b, p} {f, c, a, m, p}
Step 2:
Item frequency f 4c 4a 3b 3m 3p 3
root
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
9
Example of FP-growth Algorithm (cont’d)
Step 3:
Traverse the FP-tree by following the side link of each frequent 1-item and accumulate the prefix paths. Build FP-tree structure (projection) for the accumulated prefix paths If the projection contains only one path, enumerate all the combinations of the items, else recursively mine the projection.
All frequent patterns concerning m
m,
fm, cm, am,
fcm, fam, cam,
fcam
m’s prefix paths: fca: 2, fcab: 1
m-projection
root
f:3
c:3
a:3
Item Prefix pathes
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
Item frequency f 4c 4a 3b 3m 3p 3
root
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
f:4
c:3
a:3
b:1m:2
m:1
Item frequency f 4c 4a 3b 3m 3p 3
root
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
X
10
Talk Outline
Background– Frequent-itemset mining– Serial algorithm
Parallel frequent pattern mining– Parallel framework– Load balancing problem
Experimental results Optimization Summary
11
Parallel Mining Frequent Itemset
FP-growth algorithm:
Parallelization framework for FP-growth algorithm
Identify frequent single items (1.31%)
Build tree structure for the whole DB (1.91%)
Make projection for each frequent single item from the tree structure and mine the projection (96.78%)
Divide and conquer
Identify frequent single items in parallel
• Partition the frequent single items and assign each subset of frequent items to a processor. • Each processor builds tree structure related to the assigned items from the DB.
Each processor makes projections for the assigned items from its local tree and mines the projections independently.
12
Load Balancing Problem
Reason:– The large projections takes too long to mine
relative to the mining time of the overall dataset. Solution:
– The larger projections must be partitioned. Challenge:
– Identify the larger projections.
1
10
100
1 2 4 8 16 32 64Processor #
Spee
dup
optimalpumsb
0
50
100
150
200
250
1 6 11 16 21 26 31 36 41 46 51
Index of projections
Min
ing t
ime
(in s
econds) Task mining time
(max=204.7sec)
Average mining time(26.7sec)
13
How To Identify The Larger Projections?
Static estimation– Based on dataset parameters
• Number of items, number of transactions, length of transactions, …
– Based on the characteristics of the projection• Depth, bushiness, tree size, number of leaves, fan-out,
fan-in, …
Result --- No correlation found with any of the above.
Mining time vs. Tree depth
0
2
4
6
8
10
12
14
16
1 6 11 16 21 26 31 36
Indexes of projections
Min
ing
tim
e (
in s
eco
nd
s)
0
5
10
15
20
25
30
35
FP
-tre
e d
ep
th
Mining time
Tree depth
14
Dynamic Estimation
Runtime sampling– Use the relative mining time of a sample to estimate th
e relative mining time of the whole dataset.– Accuracy vs. overhead
Random sampling: random select a subset of records. – Not accurate
e.g. Pumsb 1% random sampling
(overhead 1.03%)
pumsb
0
50
100
150
200
250
1 6 11 16 21 26 31 36 41 46 51
Index of projected datasets
W
ho
le d
ata
se
t m
inin
g tim
e (
inse
co
nd
)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Sa
mp
le m
inin
g tim
e (
in s
eco
nd
)
Whole dataset
1% random sample
15
Selective Sampling
Sampling based on frequency.– Discard the infrequent items– Discard a fraction t of the most frequent 1-items.
e.g. Frequent-1 items: <(f:4),(c:4),(a:3),(b:3),(m:3),(p:3),(l:2), (s:2),(n:2),(q:2)>, (Supmin=2),
t=20% { f, a, c, d, g, i, m, p}{ f, a, b, g, i, p}
When filtering top 20%, sampling takes average 1.41% of the sequential mining time still provides fairly good accuracy.
e.g. Pumsb Top 20% (overhead 0.71%)
X X X X X
pumsb
0
50
100
150
200
250
1 6 11 16 21 26 31 36 41 46 51
Index of projected databases
Wh
ole
da
tase
t min
ing
tim
e(in
se
con
d)
0
0.1
0.2
0.3
0.4
0.5
0.6
Sa
mp
le m
inin
g ti
me
(in
se
con
d)
Whole datasetSelective sample
X XX
16
Why Selective Sampling Works?
The mining time is proportional to the number of frequent itemsets in the result. (from experiments)
Given a frequent L-itemset, all its subsets are frequent itemsets. There are 2L-1 subsets.
Removing one item at the root reduces the total number of itemsets in the result and, therefore, reduces the mining time roughly by half.
The most frequent items are close to the root. The mining time of their projections are negligible but their presence increases the number of itemsets in the results.
17
Talk Outline
Background– Frequent-itemset mining– Serial algorithm
Parallel frequent pattern mining– Parallel framework– Load balancing problem
Experimental results Optimization Summary
18
Experimental Setups
A Linux cluster with 64 nodes. – 1GHz Pentium III processor and 1GB memory per node.
Implementation: C++ and MPI. Datasets:
Dataset #Transactions #Items Max Trans. Length
mushroom 8,124 23 23
connect 57,557 43 43
pumsb 49,046 7,116 74
pumsb_star 49,046 7,116 63
T40I10D100K 100,000 999 77
T50I5D500K 500,000 5,000 94
19
Speedups: Frequent-itemset Mining
mushroom
1
10
100
1 2 4 8 16 32 64
Processor#
optimal
Par-FP
connect
1
10
100
1 2 4 8 16 32 64Processor#
optimal
Par-FP
pumsb
1
10
100
1 2 4 8 16 32 64Processor#
optimal
Par-FP
pumsb_star
1
10
100
1 2 4 8 16 32 64Processor#
optimal
Par-FP
T40I10D100K
1
10
100
1 2 4 8 16 32 64Processor#
optimal
Par-FP
T50I5D500K
1
10
100
1 2 4 8 16 32 64Processor#
optimal
Par-FP
Needs multi-level task
partitioning
20
Experimental Results For Selective Sampling
Overhead of selective sampling (average 1.4%)
Effectiveness of selective sampling– Selective sampling can improve the
performance by 37% on average.Speedups on 64 processors
0
5
10
15
20
25
30
35
40
mushroom connect pumsb pumsb_star T40I 10D100K T50I 5D500K
Speedups
wi thout sampl i ngwi th sampl i ng
Dataset
Mushroom
Connect
Pumsb
Pumsb_star
T40I10D100K
T50I5D500K
Overhead
0.71% 1.80%0.71%
2.90% 2.05% 0.28%
21
Optimization: Multi-level Partitioning
Problem analysis:
Optimal speedup with 1-level partitioning: 576/131=4.4
Conclusion:– We need multi-level task partitioning to obtain better
speedup. Challenges:
– How many levels are necessary?– Which sub-subtasks to be further partitioned?
pumsb_star
1
10
100
1 2 4 8 16 32 64Processor#
optimal
Par-FP
0
50
100
150
200
250
300
1
275
133
65Total576
Max Subtask with 1-level partition: 131
4
Max Subtask with 1-level partition: 65
22
Optimization: Multi-level Partitioning
Observations:– The mining time of the maximal subtask derived is about ½ of the mining time of the task itself.– The mining time of the maximal subtask derived from top1 task is about the same as the top2 task.
Reason:– There is one very long frequent pattern in the dataset. <abcdefg……>
a: 2(L-1) Max subtask -> ab: 2(L-2) b: 2(L-2) Max subtask -> bc: 2(L-3)
c: 2(L-3) Max subtask -> cd: 2(L-4)
…… ……
0
50
100
150
200
250
300
1
275
133
65 Total
576
Max Subtask with 1-level partition: 131
Max Subtask with 1-level partition: 65
Labcde: 2(L-5)
(if we partition a to abcde, the mining time of the derived subtask for abcde is about 1/16 of the task of a.)
(The mining time is proportional to the number of frequent itemsets in the result.)
23
Optimization: Multi-level Partitioning
Multi-level partitioning heuristic:– If a subtask’s mining time (based on selective
sampling) is greater than , partition it so that the
maximal mining time of the derived sub-subtasks is less
than . (N: number of processors, M total number of tasks, Ti : mining
time of a subtask)
Result:
4
1*1
N
TM
ii
pumsb_star
1
10
100
1 2 4 8 16 32 64processor#
Speedup
optimal
one-level
multi-level
4
1*1
N
TM
ii
24
Summary
Data mining is an important application of parallel processing.
We developed a framework for parallel mining frequent-itemset and achieved good speedups.
We proposed the selective sampling technique to address the load balancing problem and improved the speedups by 45% on average on 64 processors.
25
Questions?