林俊宏 2010.06.01 parallel association rule mining based on fi-growth algorithm bundit...

20
LOGO 林林林 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

Upload: antony-horton

Post on 12-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

林俊宏2010.06.01

Parallel Association Rule Mining based on FI-Growth Algorithm

Bundit Manaskasemsak,

Nunnapus Benjamas,

Arnon Rungsawang

Page 2: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

Outline

Introduction1

FI-Growth algorithm

Parallel FI-Growth

Experiments and results

2

3

4

Conclusion5

Page 3: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

Introduction

Association rule mining is one of the most important techniques in data mining.

consists of two main steps: frequent itemsets generation tries to extract the most

frequent patterns; rule generation uses these frequent patterns to

generate interesting rules.

林俊宏 2010.06.01

Page 4: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

Two fundamental algorithms proposed for finding the frequent itemsets from large databases Apriori algorithm Closed algorithm

Proposed to reduce this cost. The Fp-growth algorithm FI-growth algorithm

Introduction

林俊宏 2010.06.01

Page 5: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

Transaction-oriented databases are usually very large. Mining useful rules from such large and volatile

databases is a challenging problem.

Fast association rule mining inevitably requires large computing resources.

cluster computing technology offers a potential solution parallel Apriori approach, parallel FP-growth approach

Introduction

林俊宏 2010.06.01

Page 6: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

The objective of this paper utilize parallelization on a computing cluster

environment for fast extraction of frequent itemsets from large dense databases.

propose an alternative approach parallel association rule mining based on the FI-

growth algorithm

Introduction

林俊宏 2010.06.01

Page 7: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

Similar to the FP-growth algorithm, FI-growth represents the data set as a prefix

sharing tree, called an “FI-tree”.

It commonly consists of two phases: FI-tree construction Mining

FI-Growth algorithm

林俊宏 2010.06.01

Page 8: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

FI-Growth algorithm

Constructing an FI-tree requires scanning the database only twice: the first scan creates the header table the second scan creates the items-tree.

A 3

B 1

C 4

D 2

E 4

F 4

A 3

C 4

D 2

E 4

F 4

Note that : the items in all lists must be

in the same relative order.

林俊宏 2010.06.01

Page 9: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

Combining operation the same sub-paths are grouped and their counts

summed.

The combining operation has the following properties. 1) Self-reflective property: tree(a) © tree(a) is equal to

tree(a) itself. 2) Commutative property: tree(a1) © tree(a2) is equal to

tree(a2) © tree(a1). 3) Associative property: (tree(a1) © tree(a2)) © tree(a3) is

equal to tree(a1) © (tree(a2) © tree(a3)).

FI-Growth algorithm

e: 1

d:2

f: 1 f:1

e: 1

d:2

f: 1 f:1

e: 1

d:2

f: 1 f:1

林俊宏 2010.06.01

Page 10: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

The result (grey nodes) replaces the old one that is linked from root.

林俊宏 2010.06.01

Page 11: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

root

a:3

c:2

e:1

d:2

c:2

e:1 e:2

f:2f:1f:1 f:4 f:3

e:4 e:1

d:2

f:1f:1

e:1

d:2

f:1f:1 f:2

FI-Growth algorithm Branching step Subset finding step Pruning step

林俊宏 2010.06.01

Page 12: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

Parallel FI-Growth

a parallel version of the FI-growth algorithm employ a data parallelism technique on a PC

cluster partition the transaction one-time synchronization to

exchange their sub-trees

林俊宏 2010.06.01

Page 13: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

Hierarchical minimum support two solutions to avoid such a problem:

All processors synchronize their lists of item counts utilizing two values of minimum support:

• min_supL1 is defined and used to prune the local header table

• min_supL2 is defined to prune the local items-tree.

in this paper, we use the second approach.

Parallel FI-Growth

林俊宏 2010.06.01

Page 14: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

Parallelization min_supL1 = 1(20%) min_supL2 = 2(40%)

Parallel FI-Growth

林俊宏 2010.06.01

Page 15: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

FI-Tree synchronization Exchanging of local header table:

• To reduce the communication overhead, only the list of items is broadcast to other processors.

Sending of local sub-tree:• which local sub-tree(s) should be kept, and which should be

sent to the target processors

Parallel FI-Growth

林俊宏 2010.06.01

Page 16: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

Experiments and results

Hardware and environment configuration: Tested on a cluster of x86-64 based SMP machines

named “Bedrocks”. Each machine consists of dual 3.2GHz Intel quad-core

processors, 4GB of main memory, and an 80GB SATA disk.

equipped with the Linux-based operating system inter-connected via a 1000Base-TX Ethernet switch the parallel algorithm is written in the C language uses the MPICH message passing library version 1.2.7.

All experiments were run under no-load conditions

林俊宏 2010.06.01

Page 17: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

Data set: For the test data set, we utilized the standard “IBM

synthetic data generator” to synthesize a transaction database.• 1000 unique items • 16 million records (each has average transaction length of

10)

Experiments and results

林俊宏 2010.06.01

Page 18: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

林俊宏 2010.06.01

Page 19: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

Conclusion

research in many areas, including run-time memory requirements

In this paper propose a parallel FI-growth algorithm to accelerate

association rule mining.

In future work, effects of partitioning memory requirements reduce the communication overhead load balancing

林俊宏 2010.06.01

Page 20: 林俊宏 2010.06.01 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang

林俊宏 2010.06.01