temporal pattern mining

INDIAN INSTITUTE OF TECHNOLOGY ROORKEE

A Study of Scalable Pattern Mining Algorithms on Large Scale Interval Data

Under Supervision Of:Dr. Dhaval Patel

CSE Department

Presented By:Prakhar Dhama

15535029

2

Outline

• What is Pattern Mining?

• Need for Scalable Pattern Mining

• Interval-based Events

• Serial Frequent Itemset Mining

– Apriori, Eclat and FP-growth

• Parallel Itemset Mining

– FP-growth based PFP

– Ultrametric tree based FiDoop

• Pattern Mining on Interval Data

– Interval Sequences

– Temporal Relations

– Heirarchical Representation

• Conclusion and Research Gap

3

What is Pattern Mining?

• A pattern can be a set of items, ordered subsequences,

subgraphs, etc.

• Different kinds of pattern mining are

– Frequent itemset mining. finding set of items that frequently appear

together in a transactional database, such as milk and bread.

– Sequential pattern mining. finding frequently occurring

subsequence in a sequence database, such as customer buying

pattern, first a digital camera, followed by a memory card.

– Structured pattern mining. finding frequent substructures in a

spatial database such as graphs, trees, or lattices.

– Temporal pattern mining. finding relations among events in a

temporal database such as time for which iron is on and time for

which its steel base is hot.

4

Need for Scalable Pattern Mining

• The NSA Utah Data Center store data in order of exabytes.

In 2014, NSA processed 29 petabytes of data in a single day.

• With huge increase in data size, pattern mining on single

machine is infeasible.

• Solution. Modify existing pattern mining algorithms and

design scalable versions which can run on distributed

means.

• The parallel programming models are MapReduce, Bulk

Synchronous Parallel, etc.

• Some of the popular big data tools to implement parallel

algorithms are Apache Spark, Hadoop, NoSQL databases

like Cassandra, MongoDB, etc.

5

Interval-based Events Data

• In real world events, instead of being instantaneous, persist

for some duration and called interval events.

• The data including time related attribute is stored in temporal

database.

• The relation among these interval events is intrinsically

complex and point-based algorithms are not applicable.

• Applications

– power meter in house that logs household appliance electricity usage,

can be used to identify times each appliance is turned on or off.

– it has been observed that in diabetic patients, the presence of

hyperglycemia overlaps with the absence of glycosuria.

– domains such as medical, multimedia, meteorology and finance

where the events durations could play an important role.

6

Large Scale Interval Data

• Querying vs Mining. The purpose of mining is to discover

knowledge while database querying simply retrieves data.

• The only work that deals with large scale interval data is

querying quantitative analysis[1].

• All the current efforts on mining temporal relationships rely

on sequential algorithms and problem of scalable mining on

large scale interval data is not yet addressed.

• Solution. Design novel strategy to mine temporal patterns

on large scale interval data by augmenting

– Existing parallel mining algorithms for point-based events.

– Sequential pattern mining algorithms on interval data.

7

Serial Frequent Itemset Mining Methods

• Mining frequent itemset is the first step, it is followed by

another step to generate inter transaction association rules.

• Apriori. It uses bread first strategy to count support of

itemset and uses candidate generation function which

exploits downward closure property of support.

• Eclat. Equivalent Class Transformation is depth first

algorithm. It converts the transactional database to its

vertical format i.e. transaction list for each item and then

uses set intersection.

• FP-growth. It doesn’t include candidate generation, instead

use a prefix tree structure FP-tree. It uses two passes over

data set and does recursive traversal of FP-tree for each

item in itemset.

8

Parallel Itemset Mining

• Apriori-like parallel FIM algorithms such as FDM, DDM,

FPM, and MapReduce based DPC[2].

• Apriori-like solutions suffer potential problems of high I/O,

communication, and synchronization overhead, which make

it strenuous to scale up these parallel algorithms.

• Eclat-like most recent parallel algorithms include Dist-Eclat

and BigFIM[3].

• FP-growth-like parallel FIM algorithms such as and shared

memory based cache conscious FP-growth and most

popular MapReduce based PFP[4].

• Utrametric-tree based FIUT[5] and FiDoop[6].

• Others include recent lexicographical tree based Sequence

Growth[7].

9

PFP algorithm

• Popular parallel FP-growth MapReduce based algorithm.

• Includes three MapReduce phases.

Sharding and Parallel Counting

Group-dependent Shard FP-growth

Aggregation

• Phase 1. Sharding divides the database in

consecutive parts and stores them in different

machines. Parallel Counting does a MapReduce task

for counting the support of the items. Each mapper

works on single shard.

• Phase 2. The frequent items are dividing in groups.

The mapper for each group id as key outputs the list

of transaction ids. The reducer then creates FP-tree

for each group.

• Phase 3. For all the items the corresponding frequent

patterns are listed out of which required number of

mostly supported patterns are reported.

10

FiDoop Algorithm

• One of the recent parallel FIM algorithm outperforms Apriori-

like solution as well as FP-growth based PFP.

• Based on ultrametric tree extending FIUT.

• k-FIU-tree is built by placing all frequent itemsets of length k

starting from root to last item in itemset in a single path.

Hence, all the leaves are at same height k.

• Example.

abc 1

abd 2

acde 3

3-FIU-tree

root

a

b

c:1 d:2

itemsets

11

FiDoop Design

• Uses three MapReduce phases like PFP.

• First MapReduce Job. discovers all frequent items or

frequent one-itemsets.

• Second MapReduce Job. scans the database to generate k-

itemsets by removing infrequent items in each transaction.

• Third MapReduce Job. constructs decomposed h-FIU-tree,

2≤h≤k-1, and mines all frequent h-itemsets

Input transaction<LongWritable offset, Text

record>

Global one-itemset<Text item, LongWritable

count>

Pruned transaction of k-itemset <ArrayWritable k-item,

LongWritable 1>

<IntWritable id, MapWritable<ArrayWritable k-

item, LongWritable SUM>>

<IntWritable id, MapWritable<ArrayWritablek-item, LongWritable SUM>>

Frequent h-itemset from h-FIU-tree

MapReduce MapReduce MapReduce

12

Pattern Mining on Interval Data

• Various algorithms have been proposed to discover temporal

patterns on interval data.

• Apriori-like. HDFS[8]: transforms event sequence into id-

lists and merges the id-lists iteratively, IEMiner[9]: reduce

search space and remove non promising candidates

• Pattern-growth. TPrefixSpan[10]: generates all possible

candidiates then scan the projected database recursively to

discover temporal patterns, TPMiner[11]: based on projection

database techniques and including several pruning

techniques to reduce search space.

13

Interval Sequences

• A temporal database can handle data with time. It stores all

the interval sequences.

• An interval sequence is a collection of several intervals

having start time and end time.

• Example

Db contains 4 interval sequences.Let minimum support = 3Temporal pattern (C=D) is frequent with support 4

14

Temporal Relations

• Most of the pattern mining on interval data is based on 13

relations among temporal events proposed by Allen.

• Relations among two interval events X & Y is as shown

below.

15

Heirarchical Representation

• Representation should be lossless otherwise spurious frequent patterns may be

discovered such that from representation the events arrangement can be

estimated reversably.

• Lossless Heirarchical Representation

P

Q

R

R

Q

P

R

Q

P

a. Overlap count wrt R=1Meet count wrt R=0

b. Overlap count wrt R=2Meet count wrt R=0

c. Overlap count wrt R=1Meet count wrt R=1

Various Interpretation of temporal pattern (P o Q) o R

• IEMiner uses 5 variables to distinguish above interpretations contain count, finish

count, meet count, overlap count, and start count in order.

a. (P o[0,0,0,1,0] Q) o[0,0,0,1,0] R

b. (P o[0,0,0,1,0] Q) o[0,0,0,2,0] R

c. (P o[0,0,0,1,0] Q) o[0,0,1,1,0] R

16

Conclusion

• The classic mining algorithms are modified to run in

distributed means on a cluster. Although much efforts are still

going on in field of pattern mining in interval data, to the best

of my knowledge one issue is not addressed anywhere.

• All the current pattern mining algorithms on interval-based

events are sequential in nature. They cannot scale to large

data set which cannot be stored in single memory. The

various parallel techniques in mining frequent patterns in

instantaneous events and current sequential techniques on

interval data can help in addressing this issue.

17

References

[1] Ruan, Guangchen, et al. 2014. Parallel and quantitative sequential pattern

mining for large-scale interval-based temporal data. IEEE International

Conference on Big Data.

[2] Lin, Hsueh, et al. 2012. Apriori-based frequent itemset mining algorithms on

MapReduce. In Proceedings of the 6th International Conference on

Ubiquitous Information Management and Communication.

[3] Moens, Aksehirli, et al. 2013. Frequent itemset mining for big data. IEEE

International Conference on Big Data.

[4] Li, Haoyuan, et al. 2008. Pfp: parallel fp-growth for query recommendation.

Proceedings of the ACM conference on Recommender systems.

[5] Tsay, Yuh-Jiuan, et al. 2009. FIUT: A new method for mining frequent

itemsets. Proceedings of Information Sciences.

[6] Xun, Yaling, et al. 2015. FiDoop: Parallel Mining of Frequent Itemsets Using

MapReduce. IEEE Transactions on Systems, Man, and Cybernetics.

[7] Liang, Yen-Hui, et al. 2015. Sequence-Growth: A Scalable and Effective

Frequent Itemset Mining Algorithm for Big Data Based on MapReduce

Framework. IEEE International Conference on Big Data.

18

References

[8] Papapetrou, Panagiotis, et al. 2005. Discovering frequent arrangements of

temporal intervals. Proceedings of Fifth IEEE International Conference on

Data Mining.

[9] Patel, et al. 2008. Mining relationships among interval events for

classification. Proceedings of the ACM SIGMOD international conference

on Management of data.

[10] Wu, Chen, et al. 2007. Mining nonambiguous temporal patterns for

interval-based events. IEEE Transactions on Knowledge and Data

Engineering.

[11] Chen, Yi-Cheng, et al. 2015. Mining Temporal Patterns in Time Interval-

based Data. IEEE Transactions on Knowledge and Data Engineering.

19

Thank You!