p253

8/7/2019 P253

1/15

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MANUSCRIPT ID 1

An UpDown Directed Acyclic Graph Approachfor Sequential Pattern Mining

Jinlin Chen, Member, IEEE

Abstract Traditional pattern-growth based approaches for sequential pattern mining derive length-(k+1) patterns based on

the projected databases of length-kpatterns recursively. At each level of recursion, they uni-directionally grow the length of

detected patterns by one along the suffix of detected patterns, which needs klevels of recursion to find a length-kpattern. In this

paper a novel data structure, UpDown Directed Acyclic Graph (UDDAG), is invented for efficient sequential pattern mining.

UDDAG allows bidirectional pattern growth along both ends of detected patterns. Thus a length-kpattern can be detected in

log2k+1 levels of recursion at best, which results in fewer levels of recursion and faster pattern growth. When minSup is large

such that the average pattern length is close to 1, UDDAG and PrefixSpan have similar performance because the problem

degrades into frequent item counting problem. However, UDDAG scales up much better. It often outperforms PrefixSpan by

almost one order of magnitude in scalability tests. UDDAG is also considerably faster than Spade and LapinSpam. Except for

extreme cases, UDDAG uses comparable memory to that of PrefixSpan and less memory than Spade and LapinSpam.

Additionally, the special feature of UDDAG enables its extension toward applications involving searching in large spaces.

Index Terms Data mining algorithm, directed acyclic graph, performance analysis, sequential pattern, transaction database.

1 INTRODUCTION

EQUENTIAL pattern mining is an important datamining problem which detects frequent sub-sequences in a sequence database. A major technique

for sequential pattern mining is pattern-growth. Tradi-tional pattern-growth based approaches (e.g., PrefixSpan)derive length (k+1) patterns based on the projected data-bases of a length k pattern recursively. At each level of

recursion, the length of detected patterns is grown by 1,and patterns are grown uni-directionally along the suffixdirection. Consequently, we need k levels of recursion tomine a length-k pattern, which is expensive due to thelarge number of recursive database projections.

In this paper a new approach based on UpDown Di-rected Acyclic Graph (UDDAG) is proposed for fast pat-tern growth. UDDAG is a novel data structure whichsupports bidirectional pattern growth from both ends ofdetected patterns. With UDDAG, at level i recursion wemay grow the length of patterns by 2i-1 at most. Thus alength k pattern can be detected in log2k+1 levels of re-

cursion at minimum, which results in better scale upproperty for UDDAG compared to PrefixSpan.

Our extensive experiments clearly demonstrated thestrength of UDDAG with its bi-directional pattern growthstrategy. When minSup is very large such that the averagelength of patterns is very small (close to 1), UDDAG andPrefixSpan have similar performance because in this casethe problem degrades into a basic frequent item countingproblem. However, UDDAG scales up much better com-pared to PrefixSpan. It often outperforms PrefixSpan byone order of magnitude in our scalability tests. UDDAG is

also considerably faster than two other representativealgorithms, Spade and LapinSpam. Except for some ex-treme cases, the memory usage of UDDAG is comparableto that of PrefixSpan. UDDAG generally uses less memo-ry than Spade and LapinSpam.

UDDAG may be extended to other areas where effi-cient searching in large searching spaces is necessary.

The rest of the paper is organized as follows: Section 2defines the problem and discusses related works. Section3 presents motivation of our approach. Section 4 definesUDDAG based pattern mining. Performance evaluation ispresented in Section 5. Discussions on time and spacecomplexity are presented in Section 6. Finally, we con-clude the paper and discuss future work in Section 7.

2 PROBLEM STATEMENT AND RELATED WORK

2.1 Problem Statement

Let I= {i1, i2, in} be a set of items, an itemset is a sub-set of I, denoted as (x1, x2, , xk), where xiI, i {1, ,k}. Without loss of generality, in this paper we use non-negative integers to represent items, and assume thatitems in an itemset are sorted in ascending order. Weomit the parentheses for itemset with only one item. Asequence s is a list of itemsets, denoted as ,where si is an itemset, siI, i {1, , m}. The number ofinstances of itemsets in s is called the length of s.

Given two sequences a = and b = , if kj and there exists integers 1 i1

8/7/2019 P253

2/15

2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MANUSCRIPT ID

said to contain a sequence , if s.The absolute support of a sequence in a sequence da-

tabase D is defined as SupD() = |{ |( s) ( D) }|, and the relative support of is definedas SupD(s)/|D|. In this paper we will use absolute andrelative supports interchangeably. Given a positive valueminSup as the support threshold, is called a sequentialpattern in D if SupD() minSup.

Problem Statement. Given a sequence database D andthe minimum support threshold, sequential patternmining is to find the complete set of sequential pat-terns (denoted as P) in the database. (Note: in this pa-per we will always use D as a sequence database and Pas the complete set of sequential patterns in D).

Example 1. Given D as shown in Table 1 and minSup = 2.The length of sequence 1 is 5. < (1,2) 3> is a pattern be-cause it is contained in both sequences 1 and 3. occurs twice in sequence 1, however, sequence 1 only

contributes 1 to the support of . is asubsequence of sequences 1 and 2.

TABLE 1AN EXAMPLE SEQUENCE DATABASE

Seq. Id Sequence

1

2

3

4

2.2 Related WorkThe problem of sequential pattern mining was introducedby Agrawal and Srikant [1]. Among the many algorithmsproposed to solve the problem, GSP [22] and PrefixSpan[18], [19] represent two major types of approaches: a pri-ori-based and pattern-growth based.

A priori principle states that any super-sequence of anon-frequent sequence must not be frequent. A priori-based approaches can be considered as breadth-first tra-versal algorithms because they construct all length k pat-terns before constructing length (k+1) patterns.

The AprioriAll algorithm [1] is one of the earliest a

priori-based approaches. It first finds all frequent item-sets, transforms the database so that each transaction isreplaced by all frequent itemsets it contains, and thenfinds patterns. The GSP algorithm [21] is an improvementover AprioriAll. To reduce candidates, GSP only creates anew length k candidate when there are two frequentlength (k-1) sequences with the prefix of one equal to thesuffix of the other. To test whether a candidate is a fre-quent length k pattern, the support of each length k can-didate is counted by examining all the sequences. ThePSP algorithm [17] is similar to GSP except that theplacement of candidates is improved through a prefix treearrangement to speed up pattern discovery. The SPIRITalgorithm [12] uses regular expressions as constraints anddeveloped a family of algorithms for pattern mining un-der constraints based on a priori rule. The SPaRSe algo-rithm [3] improves GSP by using both candidate genera-

tion and projected databases to achieve higher efficiencyfor high pattern density conditions.

The approaches above represent databases horizontal-ly. In [5] and [24], databases are transformed into verticallayout consisting of items' id-lists. The Spade algorithm[24] joins id-list pairs to form sequence lattices to groupcandidate sequences such that each group can be storedin the memory. Spade then searches patterns across eachsequence lattice. In Spade candidates are generated andtested on-the-fly to avoid storing candidates, which costsa lot to merge the id-lists of frequent sequences for a largenumber of candidates. To reduce this cost, The SPAMalgorithm [5] adopts the lattice concept but representseach id-list as a vertical bitmap. SPAM is more efficientthan Spade for mining long patterns if all the bitmaps canbe stored in the memory. However, it generally consumesmore memory. LapinSpam [25] improves SPAM by usinglast position information of items to avoid the ANDingoperation or comparison at each iteration in the support

counting process.One major problem of a priori-based approaches is

that a combinatorially explosive number of candidatesequences may be generated in a large sequence database,especially when long patterns exist.

Pattern-growth approaches can be considered asdepth-first traversal algorithms as they recursively gener-ate the projected database for each length k pattern to findlength k+1 patterns. They focus the search on a restrictedportion of the initial database to avoid the expensive can-didate generation and test step.

The FreeSpan algorithm [14] first projects a database

into multiple smaller databases based on frequent items.Patterns are found by recursively growing subsequencefragments in each projected database. Based on a similarprojection technique, the same authors proposed the Pre-fixSpan algorithm [18][19] which outperforms FreeSpanby projecting only effective postfixes.

One major concern of PrefixSpan is that it may gener-ate multiple projected databases, which is expensivewhen long patterns exist. The MEMISP [15] algorithmuses memory indexes instead of projected databases todetect patterns. It uses the find-then-index technique torecursively find the items that constitute a frequent se-quence and constructs a compact index set which indi-cates the set of data sequences for further exploration. Asa result of effective index advancing, fewer and shorterdata sequences need to be processed as the discoveredpatterns become longer. MEMISP is faster than the basicPrefixSpan algorithm but slower when pseudo-projectiontechnique is used in PrefixSpan.

Among the various approaches, PrefixSpan was one ofthe most influential and efficient ones in terms of bothtime and space. Some approaches may achieve better per-formance under special circumstances; however, theoverall performance of PrefixSpan is among the best. Forexample, LAPIN [26] is more efficient for dense data sets

with long patterns but less efficient in other cases. Be-sides, it consumes much more memory than PrefixSpan.FSPM [23] declares to be faster than PrefixSpan in manycases. However, the sequences that FSPM mines contain

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication

8/7/2019 P253

3/15

AUTHOR ET AL.: TITLE 3

only a single item in each itemset. In this sense, FSPM isnot a pattern mining algorithm as we discuss here. SPAMoutperforms the basic PrefixSpan but is much slower thanPrefixSpan with pseudo-projection technique [22].

3 MOTIVATION

Pattern-growth based approaches recursively grow thelength of detected patterns. At each level of recursion, thealgorithms first partition the solution space into disjointsub-spaces. For each sub-space, a projected database (orvariations, e.g., memory index, etc.) is created, based onwhich a detection strategy (e.g., frequent prefix counting,memory index counting, etc.) is applied to grow existingpatterns. Projection and support counting are the twomajor costs for pattern-growth based approaches.

In PrefixSpan patterns are partitioned based on com-mon prefix and grown uni-directionally along the suffixdirection of detected patterns. At each level of recursion

the length of detected patterns is only grown by 1. If wecan grow the patterns bi-directionally along both ends ofdetected patterns, we may grow patterns in parallel ateach level of recursion. The motivation of this paper is tofind suitable partitioning, projection, and detection strat-egies that allow for faster pattern growth.

To support bidirectional pattern growth, instead ofpartitioning patterns based on common prefix, we canpartition them based on common root items. For a data-base with n different frequent items (without loss of gene-rality, we assume these items are 1, 2, , n), its patternscan be divided into n disjoint subsets. The ith subset (1 i

n) is the set of patterns that contain i (the root item ofthe subset) and items smaller than i. Since any pattern insubset i contains i, to detect the ith subset, we need onlycheck the subset of tuples whose sequences contain i indatabase D, i.e., the projected database of i, or iD. In the ithsubset, each patterncan be divided into two parts, prefixand suffix of i. Since all items in the ith subset are no largerthan i, we exclude items that are larger than i in iD.

Example 2. Given the following database, 1) ,2) , 3) , 4) , 5), 8D is, 1) , 2) , 3) , 4) .

If minSup is 2, the 8th subset of patterns is {, ,, , , , , , , ,, , , , }.

Observing the patterns in the 8th subset, except for ,which only contains 8 and can be derived directly, allother patterns can be clustered and derived as follows,

1) {, , , }, the patterns with the 8at the end. This cluster can be derived based on the prefixsub-sequences of 8 in 8D, or Pre(8D), which is, 1) , 2), 3) , 4) . By concatenating the patterns(, , , ) of Pre(8D) with 8, we can derivepatterns in this cluster;

2) {, , , , , }, thepatterns with 8 at the beginning. This cluster can be de-rived based on the suffix sub-sequences of 8 in 8D, orSuf(8D), which is, 1) , 2) , 3) , 4) . By concatenating 8 with the patterns (, , ,, , ) of Suf(8D), we can derive patterns inthis cluster;

3) {, , , }, the patternswith 8 in between the beginning and end of each pattern.This cluster can be mined based on the patterns in Pre(8D)and Suf(8D). In this case a pattern (e.g., ) can bederived by concatenating a pattern of Pre(8D) (e.g., 4) withthe root item 8 and a pattern of Suf(8D) ( e.g., 3).

Note: in case a pattern belongs to more than one clus-ter, it can de derived separately in each cluster. Dupli-cated patterns can be eliminated by set union operation.

Here the major difficulty is case 3. In example 2 wehave 4 patterns from Pre(8D), and 6 from Suf(8D). Intui-tively, each pattern pair (one from Pre(8D) and one fromSuf(8D)) is a possible candidate for case 3. Direct evalua-tion of every pair can be expensive (24 candidates in thisexample). If we can decrease the number of candidates forevaluation, we will be able to recursively detect patterns

in case 1 and 2 using similar strategies, and eventuallyfind all the patterns in the 8 th subset efficiently.

Based on a priori rule, if the concatenation of a patternfrom Pre(8D) (e.g., ) with a pattern from Suf(8D) (e.g.,), i.e., (the root item 8 is added implicitly), isnot a pattern, then the concatenation of any pattern inPre(8D) contains (e.g., ) with any pattern inSuf(8D) that contains (e.g., ) is also not a pattern.

On the other hand, given a pattern s from Pre(8D) (e.g.,), the valid patterns from Suf(8D) for s should also bevalid for any pattern from Pre(8D) that are contained in s(e.g., , ). Therefore, to check the candidate patterns

from Suf(8D) for s, we need only check the intersection ofthe valid pattern sets from Suf(8D) for patterns in Pre(8D)that contain s. Here the valid pattern sets from Suf(8D) for and are both {}, and the intersection of thetwo sets is {}, which means we need only verify with .

The strategies above can effectively decrease the num-ber of candidates for case 3. One challenging issue is howto efficiently find and represent the contain relationshipbetween patterns. To solve this problem we can use a di-rected acyclic graph (DAG) which represents patterns asvertexes and contain relationships as directed edges inbetween vertexes. Such a DAG can be recursively con-structed in an efficient way to derive the contain relation-ship of patterns (see Section 4.3). By representing the con-tain relationship of patterns from Pre(8D) with an DAG(Up DAG) and the contain relationship of patterns fromSuf(8D) with another DAG (Down DAG), we can decreasethe number of candidates by using these DAGs based onthe strategies discussed above. Fig. 1 shows the Up andDown DAGs for the patterns in Pre(8D) and Suf(8D). In theDAGs, each vertex represents a pattern with occurrenceinformation, i.e., the ids of tuples containing the pattern.A directed edge means that the pattern of the destinationvertex contains the pattern of the source vertex.

To mine the patterns in the ith subset, first we performlevel 1 projection to get iD. At this stage the only length 1pattern in the ith subset, , is detected. We then performlevel 2 projections on Pre(iD) and Suf(iD), respectively,


8/7/2019 P253

4/15


based on which we can detect length 2 (case 1 and 2) andlength 3 patterns (case 3). We then perform level 3 projec-tions to detect length 3, 4, 5, 6, and 7 patterns and contin-ue this process to find all the patterns in the ith subset. Ifthe maximal pattern length is k, then at worst we project klevels, but at best we only project log2k+1 levels, whichis much less than those of previous approaches.

Fig. 1. Example Up / Down DAGs of patterns from Pre(8D)/Suf(

8D)

In the example above each itemset has exactly one

item. Practically an itemset may have multiple items.Most previous approaches detect frequent itemsets withmultiple items simultaneously when detecting sequentialpatterns. In our approach we first detect frequent itemsetsand transform the database based on frequent itemsets.We then detect patterns on the transformed database us-ing UDDAG. Our strategy of detecting frequent itemsetsfirst is the same as AprioriAll. In Section 6.1 we will dis-cuss in detail the impact of this strategy.

In our previous work [8], we presented an UpDownTree data structure to detect contiguous sequential pat-terns (in which no gap is allowed for a sequence to con-tain the pattern). However, UpDown Tree is substantiallydifferent from the UpDown DAG in this paper. In addi-tion to the different internal data structures, a major dif-ference is that UpDown Tree is for compressed represen-tation of the projected databases, while UDDAGrepresents the containing relationship of detected pat-terns.

4 UPDOWN DIRECTED ACYCLIC GRAPH BASEDSEQUENTIAL PATTERN MINING

This section presents UDDAG based pattern mining ap-proach, which first transforms a database based on fre-

quent itemsets, then partitions the problem, and finallydetects each subset using UDDAG.

4.1 Database Transformation

Definition 1 (Frequent itemset). The absolute support for anitemset in a sequence database is the number of tuples whosesequences contain the itemset. An itemset with a supportlarger than minSup is called afrequent itemset(FI).

Based on frequent itemsets we transform each se-quence in a database D into an alternative representation.First, we assign a unique id to each FI in D. We then re-place each itemset in each sequence with the ids of all theFIs contained in the itemset.

For example, for the database in Table 1, the FIs are:(1), (2), (3), (4), (5), (6), (1,2), (2,3). By assigning a uniqueid to each FI, e.g., (1)-1, (1,2)-2, (2)-3, (2,3)-4, (3)-5, (4)-6,

(5)-7, (6)-8, we can transform the database as shown inTable 2 (infrequent items are eliminated).

TABLE 2TRANSFORMED DATABASE

Seq. Id Sequence

1

2 3

4

Definition 2 (Item pattern). An item pattern is a sequentialpattern with exactly 1 item in every itemset it contains.

Lemma 1 (Transformed Database).Let D be a database andP be the complete set of sequential patterns in D , D be itstransformed database, substituting the ids of each item pat-tern contained in D with the corresponding itemsets, anddenoting the resulted pattern set as P, we have P = P.

Proof. Let p be a pattern in P, ip be the item pattern de-

rived by replacing each itemset in p with the corres-ponding id in D, since the id of an itemset i exists atthe same position in D as that of i in D, the support ofip in D is the same as that ofp in D. Thus ip is an itempattern in D. Substituting each id in ip with the cor-responding itemset, and denote the resulted pattern asip, we have ip = p. Based on the definition of P, wehave ipP. Thusp P, and P P. Similarly, P P.All together, P = P.Based on Lemma 1, mining patterns from D is equiva-

lent to mining item patterns from D. Below we focus onmining item patterns from D and represent frequent

itemsets with their ids. For brevity, we still use frequentitemsets instead of ids, use pattern instead of item pat-terns, use D instead of D, and use P instead of P.

4.2 Problem Partitioning

Lemma 2 (Problem partitioning). Let {x1, x2, , xt} be the frequent itemsets in a database D, x1 < x2 < < xt , thecomplete set of patterns (P) in D can be divided into t dis-

joint subsets. The ith subset (denoted as Pxi, 1 i t) is theset of patterns that contain xi and FIs smaller than xi.

Proof. First we create t empty sets. Next we move pat-terns that contain xt from P to Pxt, and in the remainingP we move all the patterns that contain xt-1 to Pxt-1. Wecontinue this until moving all the patterns that containx1 to Px1. Now P is empty because any pattern can onlycontain FIs in {x1, x2, , xt}. Thus P = Px1 Pxt.

Given two integers i andj, 1 i

8/7/2019 P253

5/15


is calledx-projected database, denoted as xD.

Lemma 3 (Projected database). Let D be a database, and x bean itemset, P and P be the complete set of patterns in D andxD, respectively, we have Px = Px.

Proof. Since any tuple in xD also exists in D, Px Px. For

pPx, p. Thus any tuple that contains p alsocontains x, and any tuple that does not contain x alsodoes not contain p. Therefore p can only be detectedfrom the collection of all the tuples that contain x, i.e.,xD. Therefore Px Px. All together, Px = Px.Based on Lemma 3, Px can be mined from xD.

Definition 4 (Prefix/suffix subsequence/tuple). Given afrequent itemset x and a tuple in a database, s = , if x si, 1 i j, then sp= is thepre-fix subsequence of x in s, and ss= is the suf-fix subsequence of x in s. is theprefix tuple of x,

and is the suffix tuple of x.

Definition 5 (Prefix/suffix projected database). The col-lection of all the prefix/suffix tuples of a frequent itemset xin xD is called the prefix/suffix projected database of x,denoted as Pre(xD) / Suf(xD).

Definition 6 (Sequence concatenation). Given two se-quences a=, b=, sequence concatenationof a and b, denoted as a.b, is defined as .

If a FI x occurs multiple times in a sequence, then eachoccurrence has its prefix/suffix subsequence. For exam-ple, in sequence , 3 has two suffix subse-

quences ( and ). If both subsequencescontain a pattern (e.g. ), they only contribute 1 to thecount of the pattern.

Theorem 1 (pattern mining).Let x be a FI and xD be its pro-jected database, P, PPre, and PSuf be the complete set of se-quential patterns in xD, Pre(xD), and Suf(xD), respectively,we have PxQ, where Q={} Q1Q2Q3, and Q1= {pk. | pk PPre }, Q2 = {.pk | pk PSuf}, Q3 ={pk. . pi | pk PPre, pi PSuf }.

Proof. For pjP,pj= , based on the positionof x inpj and the length ofpj, we have,

1) n = 1, x is the only itemset ofpj,pj =;2) n > 1, x only exists at the beginning and/or end ofpj,

i.e., a1= x, and/or an= x, and aj x,j {2, 3, , n-1};3) n > 1, m {2, 3, , n-1}, am= x. i.e., x exists in be-

tween the beginning and end ofpj.For case 1, since Q, we havepjQ.For case 2, if x only resides at the beginning ofpj, letpj

= , for each occurrence of pj in xD, there is a cor-responding occurrence ofpj in Suf(xD), thuspjPSuf, and

pjQ2. Similarly, if x only resides at the end of pj, thenpjQ1. If x only resides at the beginning and end of pj, wehavepjQ1, and pjQ2.

For case 3, letpj = , each occurrence ofpjin xD corresponds to a prefix subsequence of x, which iscontained in Pre(xD), thus pjPPre. letpj= , each occurrence of pj in xD corresponds to a suffixsubsequence of x, which is contained in Suf(xD), thuspj

PSuf. Sincepj =pj.. pj, we havepjQ3.All together, we have PxQ.

Based on Theorem 1 we can detect Px based on PPreand PSuf, which can be recursively derived. Here case 1 isobvious. Case 2 is directly based on PPre and PSuf. Case 3is complicated due to a potential large number of candi-dates. Below we define UDDAG to decrease the numberof candidates.

Definition 7 (UpDown directed acyclic graph). Given a FIx and xD, an UpDown Directed Acyclic Graph based on Px,denoted as x-UDDAG, is derived as follows,

1) Each pattern in Px corresponds to a vertex in x-UDDAG. corresponds to the root vertex, denoted as vx. For a vertex vin x-UDDAG, op(v) represents the pattern corresponding to v.Forp Px, ov(p) represents the vertex corresponding to p.

2) Let PU be the set of length-2 patterns ending with x in Px,forp PU, let vu = ov(p), add a directed edge from vx to vu. vu

is called an up root child of vx.3) Let PD be the set of length-2 patterns starting with x in

Px, forp PD, let vd = ov(p), add a directed edge from vx to vd.vd is called a down root child of vx.

4) Each up/down root child vu/vd of vx also corresponds to anUDDAG (defined recursively using rules 1)-4)), denoted as xU-UDDAG and xD-UDDAG. Forv1VU and v2VD , whereVU/VD is the set of all the vertexes in xU-UDDAG/xD-UDDAG, assume op(v1) = , and op(v2) = , ifp Px, p=, let v3 = ov(p),add a direct edge from v1 to v3 , and add another direct edge

from v2 to v3, here v1/ v2 is the Up/Down parent of v3, and v3 isthe UpDown child of v

1and v

2.

Note: if v3 corresponds to multiple up and down parents,only one pair (randomly selected) is linked.

Definition 8 (Occurrence set). The Occurrence Set of a ver-tex v in a database D (denoted as OSD(v)) is the set of se-quence ids of the tuples in D that contain op(v).

The data structure of a vertex in UDDAG is as follows:class UDVertex{

UDVertex upParent, downParent;List upChildren, downChildren, upDownChildren;int[] pattern; //pattern sequenceint[] occurs; // occurrence set

}.In an UDDAG if there is a directed path from vertex v1

to v2, v2 is called reachable from v1. The UDDAG for allthe patterns in Pre(xD) / Suf(xD) is called the Up / DownDAG of x. The set of vertexes of an UDDAG (Up/DownDAG) is denoted as V(VU/VD).

Definition 9 (Valid down vertex set). Given a vertex v inthe Up DAG of x, the valid down vertex set of v (VDVSv) isdefined as VDVSv={v|(vVD) (op(v)..op(v) Px)}.

Definition 10 (Parent valid down vertex set). Given avertex v in the Up DAG of x, the parent valid down vertex

set of v (PVDVSv) is defined as follows:1) If v has no parent (i.e., root vertex), PVDVSv = VD;2) If v has one parent, PVDVSv is the VDVS of the parent;3) Otherwise PVDVSv is the intersection of the VDVSs of


8/7/2019 P253

6/15


the parents.

Lemma 4. VDVSvPVDVSv.

Proof. If v has no parent, PVDVSv = VD. Based on Def. 9,VDVSvVD. Therefore, VDVSvPVDVSv.

If v has one or more parents, for v VDVSv,

op(v)..op(v) Px. Based on a priori rule, For sp, ifsp op(v), then sp..op(v) Px. If v is a parent ofv, we have op(v) op(v). Therefore op(v)..op(v) Px. Thus vPVDVSv, and VDVSvPVDVSv.Based on Lemma 4, to detect VDVSv, we need only ex-

amine all the vertexes in PVDVSv.

Lemma 5. Given a vertex v in the Up DAG of x and PVDVSv,forv PVDVSv, if v VDVSv, then for any vertex vin the Down DAG of x reachable from v, v VDVSv.

Proof. Since v VDVSv, op(v)..op(v) Px. Thus|OS(v) OS (v) |< minSup. Since v is reachable from

v, op(v

)

op(v). Thus OS (v

)

OS (v), and | OS (v) OS (v) |< minSup. Therefore, op(v)..op(v) Px,

and v VDVSv.

Based on Lemma 5, if v does not belong to VDVSv,then all the vertexes reachable from v do not belong toVDVSv. Lemmas 4 and 5 help eliminate candidates forcase 3. Lemma 6 further evaluates candidate patterns.

Lemma 6. Given a vertex v in the Up DAG of x and a vertexvin PVDVSv, let IS be the intersection set of the occurrencesets of v and v, if |IS|minSup, and for any tuple whoseid is contained in IS, x occurs exactly once in the corres-

ponding sequence, then osp(v).. osp(v)

Px.Proof. For a tuple , if sidIS, then op(v) s, and

op(v) s. Since x occurs once in s, op(v) occurs beforex in s, and op(v) occurs after x. Thus op(v)..op(v) s. Since |IS| minSup, at least minSup tuples containop(v).. op(v). Thus op(v).. op(v) Px.Lemma 6 evaluates candidates for Px when x occurs

once in each sequence in IS. If x occurs more than once ina sequence, we need further verify whether the sequencereally contains op(v).. op(v). For example, in sequence, is the prefix of the second occurrence of

5, and is the suffix of the first occurrence of 5.Because of this, a candidate pattern may bemistakenly considered as being contained in the se-quence. Thus we need further verification.

To minimize the effort of pattern detection in this case,we build Pre(xD)/Suf(xD) as follows, 1) If x occurs onlyonce in a sequence, directly add its prefix/suffix tuple toPre(xD)/Suf(xD); 2) If x occurs more than once in a se-quence, add the prefix tuple of the last occurrence of x toPre(xD), and the suffix tuple of the first occurrence of x toSuf(xD). Denoting the derived prefix/suffix projected da-tabases as Pre(xD)/Suf(xD), and Ppre/PSur be the complete

set of patterns in Pre(x

D)/Suf(x

D), and let R ={}R1R2R3}, where R1= {spk. | spkPpre }, R2 = {.spk | spkPsuf}, R3 = {spk. . spj | spkPpre, spjPsuf}, we have

Theorem 2. PxRQ (Q is defined in Theorem 1).

Proof. The proof of PxR is similar to that of PxQ inTheorem 1. The only difference is that in Pre(xD)/Suf(xD), the prefix/suffix tuple of every occurrence of xis contained for multiple occurrence of x in the samesequence, while in Pre(xD)/Suf(xD), only the last pre-fix/first suffix tuple is contained. Based on the defini-tion, if multiple prefix/suffix tuples from the same se-quence contain the same pattern, only one is countedfor support. By including the last prefix/first suffixtuple in Pre(xD)/Suf(xD), we can guarantee not mis-counting the support of any pattern because the se-quences of all other prefix/suffix tuples are containedin the last prefix/first suffix tuple. Therefore, PxR.

Since every tuple in Pre(xD)/Suf(xD) also exists in Pre(xD)/Suf(xD), we have RQ.

Based on the lemmas and theorems above, below wefirst give an example to illustrate UDDAG based patternmining, and then present the algorithm in detail.

Example 3 (UDDAG). For the sample database in Table 1,if minSup = 2, its patterns can be mined as follows:

1) Database transformation. see Table 2 in Section 4.1.2) Pattern partitioning.P is partitioned into 8 subsets:

the one contains 1 (P1), the one contains 2 and smaller ids(P2), , and the one contains 8 and smaller ids (P8).

3) Finding subsets of patterns. To detect Px, we firstdetect patterns in Pre(xD) and Suf(xD) and then combinethem to derive Px. This is a recursive process because forPre(xD) and Suf(xD), we perform the same action untilreaching the base case, where the projected database has

no frequent itemset.3.1) Finding P8First we build Pre(8D) and Suf(8D), which are, Pre(8D):

1) , 3) , 4) ; Suf(8D):1) , 3) < (1,2,3) (6,8) 5 3>, 4) .

Let PP be all the patterns in Pre(8D), since the FIs inPre(8D) are (1), (2), (3), and (7), we can partition PP into 4subsets, PP7, PP3, PP2, and PP1,. First, we detect PP7. Sincethe prefix projected database of 7 in Pre(8D) is empty, andthe suffix projected database of 7 is: 3) < (1, 2, 3) >, 4) < >,the only pattern in PP7 is . Similarly, PP3 = {}, PP2= {}, and PP1 = {}. Thus PP ={, , , }.

Let PS be all the patterns in Suf(8

D), since the FIs inSuf(8D) are (3) and (5), we can partition PS into 2 subsets,PS5 and PS3. First, we detect PS5. The prefix projected da-tabase of 5 in Suf(8D) is: 3) < (1,2,3) (6,8) >, 4) , whichcontains a pattern . The suffix projected database of 5in Suf(8D) is: 3) < 3>, 4) , which also contains a pat-tern . Since both databases have patterns, we needconsider case 3, i.e., whether concatenating with root5 and is also a pattern. Here the occurrence set of in the prefix projected database of 5 is {3, 4}, and the oc-currence set of in the suffix projected database of 5 isalso {3, 4}. Thus their intersection set is {3, 4}, whichmeans the support of is at most 2. However, since

5 occurs twice in tuple 4, we need check whether it reallycontains , which is not true by verification. Thusthe support of is 1, and it is not a pattern. There-fore PS5 = {, , }. Similarly PS3 = {}. All


8/7/2019 P253

7/15


together, PS = {, , , }.Next we detect P8 based on the Up and Down DAGs of

8 (Fig. 2 (a) and (b)) by evaluating each candidate vertexpair. First we detect the VDVSs for length 1 pattern inPre(8D), i.e., up vertexes 1, 2, 3, and 7. For vertex 1, firstwe check its combination with down vertex 3, the inter-section of the occurrence sets is {3}. Thus the correspond-ing support is at most 1, which is not a valid combination.Similarly up vertex 1 and down vertex 5 are also invalidcombination. Based on Lemma 5, all the children of downvertex 5 are not valid for up vertex 1. Therefore,VDVS1=. Similarly, VDVS3=, VDVS7={ov(),ov(), ov()}. Since no length 2 pattern exists inPre(8D), the detection stops. Eventually we have P8 ={, , , , , , < 8 5>, , , , , }. 8-UDDAG based on de-tected patterns in P8 is shown in Fig. 2 (c).

Fig. 2. UpDown DAG forP8Note: Based on Lemma 1, here we actually detect item

patterns. Integers in the patterns are ids of FIs.3.2) Similarly, we have,P7 = {, , , , , , , };P6 = {, < 1 6>, < 2 6>, , , , ,

, < 2 6 5>, };P5 = {, , , , , , ,

, , , , };P4 = {, , , };P3 = {, , , };P2 = {};P1 = {, }.4) The complete set of patterns is the union of all the

subsets of patterns detected above.

Algorithm 1 (UDDAG based pattern Mining).

Input: A database D and the minimum supportOutput: P, the complete set of patterns in DMethod: findP (D, minSup){

P=

FISet=D.getAllFI(minSup);D.transform();

for each FI x in FISet{UDVertex rootVT=new UDVertex (x)

findP(D.getPreD(x), rootVT, up, minSup)findP(D.getSurD(x), rootVT, down, minSup)findPUDDAG(rootVT)P = P rootVT.getAllPatterns()

}}

The algorithm first calls subroutine getAllFI to detectall the FIs (An adapted version of the FP-growth* algo-rithm [11] is used to detect FIs in our implementation).

The algorithm then transforms the database. A di-rected acyclic graph is built to represent the containingrelationship of FIs. For each (sorted) itemset, we check allits FIs with children in the DAG, and verify whether theFI corresponding to each child is valid in the itemset. Ifso, we add the id of the child to the itemset and further

check the children of that child.Based on the transformed database, for each FI x, the

algorithm creates a root vertex for , detects all the pat-terns in the prefix projected database and suffix projecteddatabase of x, creates x-UDDAG, detects Px using x-UDDAG, and add Px to P.

Subroutine:findP(PD, rootVT, type, minSup){

FISet=PD.getAllFI(minSup);for each FI x in FISet{

UDVertex curVT=new UDVertex (x, rootVT)if(type==up) rooVT.addUpChild(curVT)else rootVT.addDownChild(curVT)

findP(PD. getPreD(x), curVT, up, minSup)findP(PD.getSufD(x), curVT, down, minSup)findPUDDAG(curVT)

}}

This subroutine detects all the patterns whose ids areno larger than the root of the projected database. The pa-rameters are 1) PD is the projected database; 2) rootVT isthe vertex for the root item of PD; 3) type (up/down) indi-cates prefix/suffix PD; 4) minSup is the support threshold.

The subroutine first detects all the FIs in PD. For eachFI x, it creates a new vertex as the Up/Down child (based

on type) of the root vertex. It then recursively detects allthe patterns in PD similar asfindP (D, minSup).Subroutine:findPUDDAG(rootVT){

upQueue.enQueue(rootVT.upChildren)while(!upQueue.isEmpty()){

UDVertex upVT=upQueue.deQueue()if(upVT.upParent == rootVT)

downQueue.enQueue(rootVT.downChildren)else if (upVT.downParent==null)

downQueue.enQueue(upVT.upParent.VDVS)else downQueue.enQueue(upVT.upParent.VDVS

upVT.downParent.VDVS)

while(!downQueue.isEmpty()){UDVertex downVT=downQueue.deQueue()if(isValid(upVT, downVT){

UDVertex curVT=new UDVertex (upVT, downVT)


8/7/2019 P253

8/15


upVT.addVDVS(downVT)if(upVT.upParent==rootVT)

downQueue.enQueue(downVT.children)}

}if(upVT.VDVS.size>0)upQueue.enQueue(upVT.children)

}}

SubroutinefindPUDDAG detects all the case 3 patternsin a projected database using UpDown DAG. The para-meter rootVT is the root vertex of the recursively con-structed UpDown DAG.

It first enqueues all the up Children of the root vertexto an upQueue. For each vertex upVT in the upQueue itenqueues PVDVS of upVT into a downQueue as follows,if upVT is root child of rootVT, it enqueues all the downchildren of rootVT into a downQueue, else if upVT hasonly one parent, it enqueues the VDVS of the parent intodownQueue, else it enqueues the intersection of the

VDVSs of the parents into downQueue.For each vertex downVT in the downQueue of upVT,

if upVT and downVT corresponds to a valid pattern, itcreates a new vertex whose parents are upVT and down-VT, and adds downVT to the VDVS of upVT. It furtherenqueues all the children of downVT to downQueue ifupVT is the upChild of the rootVT.

Finally if the size of the VDVS of upVT is not 0, thesubroutine enqueues all the children of upVT into up-Queue for further examination.

Theorem 3 (UDDAG). A sequence is a sequential pattern ifand only if UDDAG says so.

Proof sketch. Based on Theorem 2, PxR. In Algorithm 1every candidate in R is checked either directly or indi-rectly based on Lemmas 4, 5, and 6 (Case 1 and 2 arechecked directly in subroutine findP, and case 3 ischecked in subroutine findPUDDAG). Therefore, a se-quence is a sequential pattern if UDDAG says so.

Since all the candidates in R are verified in Algorithm1, we can guarantee that UDDAG identifies the com-plete set of patterns in D.

4.4 Detailed Implementation strategies

The major costs of our approach are database projectionand candidate pattern verification. Below we discuss theimplementation strategies for these two issues.1) Pseudo-Projection

To reduce the number and size of projected databaseswe adopt similar pseudo-projection technique as in Pre-fixSpan. One major difference is that we register the ids ofsequences and both the starting and ending positions ofthe projected subsequences in the original sequences. Thereason is that we project a sequence bi-directionally.

Example 4 (Pseudo-Projection). Using Pseudo-Projection,Pre(8D) and Suf(8D) in Example 3 are shown as Table 3,

where $ indicates that 8 has an occurrence in the cur-rent sequence but its projected prefix/suffix is empty.Note that for multiple occurrences of 8 in a sequence(e.g., sequence 3), we only register the last prefix and

the first suffix based on Theorem 2.

TABLE 3Pre(8D) / Suf(8D) BASED ON PSEUDO-PROJECTION

Id Sequence Pre(8D) Suf(8D)

1 Start: 0; End: 3 $

3 Start: 0; End: 1 Start: 1; End: 4

4 Start: 0; End: 0 Start: 2; End: 42) Verification of candidate patterns

To verify whether an up vertex and a down vertex in aUDDAG correspond to a valid pattern, we derive thesupport of the candidate based on the size of the intersec-tion set of the up and down vertexes occurrence sets.Two approaches are provided in our implementation.

The first approach is bit vector based. We representeach occurrence set with a bit vector, and perform And-ing operation on the two bit vectors. The size of the inter-section set is derived by counting the number of 1s in theresulted bit vector. Several approaches exist for efficiently

counting bit 1s in a bit vector [6] [20]. In our implementa-tion we use the arithmetic logic based approach [20].For example, 8D in Example 3 has three sequences (1, 3,

and 4). The up vertex 1 in Fig. 2 occurs in sequences 1 and3, thus the bit vector representation of its occurrence set is110. The down vertex 5 occurs in sequences 3 and 4, andthe bit vector is 011. Anding result of the two bit vectorsis 010, which has only one bit 1. This means the supportof in 8D is at most 1, thus not a pattern.

The second approach is co-occurrence counting based.Given Pre(xD) and Suf(xD), we derive co-occurrence countfor each ordered pair of FIs (one from a prefix and theother from the corresponding suffix) by enumeratingevery ordered pair and adding the corresponding co-occurrence count by 1. If the co-occurrence count of a pairis less than minSup, the pair is an invalid candidate.

For example, given a 9-projected database with the fol-lowing sequences, 1) and 2) , and 3), we have co-occurring pairs (5 6), (5 8), (3 6), (3 8)for sequence 1, (3 8), (5 8) for sequence 2, and (6 8) forsequence 3. If minSup is 2, only (3 8) and (5 8) (both co-occur twice) will be considered as candidates. Other pairsare discarded because they occur only once.

5 PERFORMANCE

EVALUATION

We conducted an extensive set of experiments to compareour approach with other representative algorithms. Allthe experiments were performed on a Windows Server2003 with 3.0GHz Quad Core Intel Xeon Server and 16 GBmemory. The algorithms we compared are PrefixSpan,Spade, and LapinSpam, which were all implemented inC++ by their authors (Minor changes have been made toadapt Spade to Windows). Two versions of UDDAG weretested. UDDAG-bv uses bit vector to verify candidates,and UDDAG-co uses co-occurrences to verify candidateswhenever possible.

We perform two studies using the same data generatoras in [19]: 1) Comparative study, which uses similar data-sets as that in [19]; 2) Scalability study. The datasets weregenerated by maintaining all except one of the parametersas shown in Table 4 fixed, and exploring different values


8/7/2019 P253

9/15


for the remaining ones. We present the experiment resultsin this section and give more discussion in Section 6.(Note: The default value for Tis 2.8 for scalability testingon Ito allow higher value of Ito be tested. )

TABLE 4PARAMETERS FOR GENERATING DATASETS

Symbol Name Def. value

C Number of sequences 100,000

N Number of different items 15,000

S Ave. number of items per sequence 10

T Ave. number of items in a trans. 2.4

L Ave. number of transactions in a pattern 8

I Ave. number of items in a tran. in pattern 1.2

5.1 Experiment Results for Comparative Study

First we tested data set C10S8T8I8 with 10k sequencesand 1000 different items. Fig. 3 shows the distribution ofpattern lengths. Fig. 4 and 5 shows the time and memoryusage of the algorithms at different minSup values.

1

10

100

1000

10000

100000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17Length of frequent k-sequences

Numberoffrequentsequences

3%2.50%2%1.50%1%0.50%

Fig. 3. Distribution of pattern lengths of data set C10S8T8I8.

Fig. 4 shows that UDDAG algorithms are fastest while

LapinSpam is the slowest. When minSup is large (e.g., 3%,UDDAG-bv (0.16s) and UDDAG-co (0.17) are slightlyfaster than PrefixSpan (0.19) and Spade (0.25), but aremore than 10 times faster than LapinSpam (1.9s). WhenminSup is 0.5%, UDDAG-bv (3s) and UDDAG-co (2.9s)are much faster than all the other algorithms.

0.1

1

10

100

3 2.5 2 1.5 1 0.5Minimum support (%)

Time(s)

PrefixSpan UDDAG-bv

UDDAG-co LapinSpam

Spade

Fig. 4. Time usage on data set C10S8T8I8

The UDDAG algorithms use less memory than Pre-fixSpan when minSup is large ( 1%). When minSup is less

than 1%, they use more memory because of the extramemory usage for UDDAG, which increases as the num-ber of patterns increases. The memory usages of UDDAGbased approaches are generally less than that of Spade

and much less than that of LapinSpam. Since LapinSpamcrashed in large data sets, in the following tests we onlyshow the testing results on the other four algorithms.

1

10

100

1000

3 2.5 2 1.5 1 0.5

Minimum support (%)

Memory(MB

)

PrefixSpan UDDAG-bv UDDAG-coLapinSpam Spade

Fig. 5. Memory usage on data set C10S8T8I8

Secondly we tested data set C200S10T2.5I1.25 with200k sequences and 10000 different items. Fig. 6 showsthe distribution of pattern lengths. Fig. 7 and 8 shows thetime and memory usage of the algorithms. The processingtime has similar order as the first test. When minSup is1%, the algorithms have similar running time. As minSupdecreases, the processing time of PrefixSpan and Spadegrows faster than those of UDDAG-bv and UDDAG-co.When minSup is 0.1, UDDAG-bv (8.5s) and UDDAG-co(8.7s) are almost 4 times faster than PrefixSpan (32s) and 3times faster than Spade (23s).

1

10

100

1000

10000

100000

1 2 3 4 5 6 7 8 9 10 11 12 13Length of frequent k-sequences

Numberoffrequentsequences

1%

0.70%

0.40%

0.20%

0.15%

0.10%

Fig. 6. Distribution of pattern lengths of data set C200S10T2.5I1.25.

0.1

1

10

100

1 0.7 0.4 0.2 0.15 0.1

Minimum support (%)

Time(s)

PrefixSpan UDDAG-bv

UDDAG-co Spade

Fig. 7. Time usage on data set C200S10T2.5I1.25UDDAG-bv and UDDAG-co use less memory than

PrefixSpan except when minSup is 1%. The memory usageof Spade is the highest.


8/7/2019 P253

10/15


1

10

100

1 0.7 0.4 0.2 0.15 0.1Minimum support (%)

Me

mory(MB)

PrefixSpan UDDAG-bv

UDDAG-co Spade

Fig. 8. Memory usage on data set C200S10T2.5I1.25Next we tested a denser data set C200S10T5I2.5 with

200k sequences and 10000 different items. Fig. 9 showsthe distribution of pattern lengths. Fig. 10 and 11 showsthe time and memory usage of the algorithms. The

processing time shows similar order as previous experi-ments. When minSup is 1%, the algorithms have similarrunning time. As minSup decreases, the times of PrefixS-pan and Spade grow faster than those of UDDAG-bv andUDDAG-co. When minSup is 0.25%, UDDAG-bv (49s) andUDDAG-co (50s) are 4 times faster than PrefixSpan (195s)and more than 2 times faster than Spade (118s).

1

10

100

1000

10000

100000

1000000

1 2 3 4 5 6 7 8 9 1 0 11 12 13 14 15 16 17Length of frequent k-sequences

Numberof

frequentsequence

1%0.75%0.50%0.375%0.30%0.25%

Fig. 9. Distribution of pattern lengths of data set C200S10T5I2.5.

1

10

100

1000

1 0.75 0.5 0.375 0.3 0.25

Minimum support (%)

Tim

e(s)

PrefixSpan UDDAG-bv

UDDAG-co Spade

Fig. 10. Time usage on data set C200S10T5I2.5When minSup is large (>0.375%), UDDAG-bv and

UDDAG-co have similar memory usage as PrefixSpan.However, when minSup is less than 0.375%, they use

more memory due to the extremely large number of pat-terns in this dataset at low minSup. The memory usage ofSpade is the highest when minSup is larger than 0.375%.

1

10

100

1000

1 0.75 0.5 0.375 0.3 0.25Minimum support (%)

Memory(MB)

PrefixSpan UDDAG-bv

UDDAG-co Spade

Fig. 11. Memory usage on data set C200S10T5I2.5

5.2 Experiment Results for Scalability Study

This section studies the impact of different parametersof the datasets on the performance of each algorithm. Thedefault absolute support threshold is 100.

First we examine the performance of the algorithmswith different number of sequences (C) under two differ-ent minSup settings. Fig. 12 and 13 shows the performanceof the algorithms when minSup is 100, and Fig. 14 and 15shows the performance when minSup is 400. When min-Sup is 100, the UDDAG algorithms are about 10 timesfaster than PrefixSpan and 3-4 times faster than Spade.When minSup is 400, Spade is the slowest. The UDDAGalgorithms have similar performance as PrefixSpan forsmall datasets (100K and 200K). However, when the data-sets get larger, UDDAG outperforms PrefixSpan withgrowing margins.

The UDDAG algorithms have similar memory usageas that of PrefixSpan. Spade consumes more memorythan other algorithms in most cases.

1

10

100

1000

100 150 200 250 300 350 400Number of sequences ( '000)

Time(s)

PrefixSpan UDDAG-bv

UDDAG-co Spade

Fig. 12. Time usage on different sequence numbers (minSup=100)

1

10

100

1000

100 150 200 250 300 350 400

Number of sequences ('000)

Memory(MB)

PrefixSpan UDDAG-bv

UDDAG-co Spade

Fig. 13. Memory usage on different seq. numbers (minSup=100)


8/7/2019 P253

11/15


0.1

1

10

100

100 150 200 250 300 350 400Number of sequences ('000)

T

ime(s)

PrefixSpan UDDAG-bv

UDDAG-co Spade

Fig. 14. Time usage on different sequence numbers (minSup=400)

1

10

100

100 150 200 250 300 350 400

Number of sequences ('000)

Memory(M

B)

PrefixSpan UDDAG-bv

UDDAG-co Spade

Fig. 15. Memory usage on different seq. numbers (minSup=400)

Fig. 16 and 17 shows the performance of the algo-

rithms on data sets with different number of items (N).The time usage of PrefixSpan and Spade grows as N in-creases. On the contrary, the time usage of UDDAG ap-proaches generally decreases as Nincreases. They outper-form PrefixSpan by about an order of magnitude on aver-age. They are 3-4 times faster than Spade.

1

10

100

10 12 14 16 18 20

Number of items ('000)

Ti

me(s)

PrefixSpan UDDAG-bv

UDDAG-co Spade

Fig. 16. Time usage on different number of items

UDDAG-bv and UDDAG-co use similar memory asPrefixSpan and less memory than Spade.

1

10

100

10 12 14 16 18 20Number of items ('000)

Me

mory(MB)

PrefixSpan UDDAG-bv

UDDAG-co Spade

Fig. 17. Memory usage on different number of items

Fig. 18 and 19 shows the performance of the algo-rithms on data sets with different average number oftransactions in a sequence (S). UDDAG-bv and UDDAG-co are faster than PrefixSpan by about one order of mag-

nitude, and they outperform Spade by about 3-4 times.The time usage of PrefixSpan increases faster than thoseof UDDAG as S increases.

1

10

100

8 9 10 11 12

Average number of transactions in a sequence

Time(s)

PrefixSpan UDDAG-bv

UDDAG-co Spade

Fig. 18. Time usage on different ave. number of trans. in a se-quence

UDDAG-bv and UDDAG-co use similar memory as

PrefixSpan and less memory than Spade.

1

10

100

8 9 10 11 12

Average number of transactions in a sequence

Memory(MB)

PrefixSpan UDDAG-bv

UDDAG-co Spade

Fig. 19. Memory usage on different ave. number of trans. in a seq.

Fig. 20 and 21 shows the performance of the algo-rithms on data sets with different average number ofitems in a transaction (T). The UDDAG algorithms out-

perform PrefixSpan by about one order of magnitude onaverage, and outperform Spade by about 2-4 times.


8/7/2019 P253

12/15


1

10

100

1000

2 3 4 5 6

Average number of items in a transaction

Time(s)

PrefixSpan UDDAG-bv

UDDAG-co Spade

Fig. 20. Time usage on different ave. number of items in a trans.

The UDDAG algorithms use similar memory as Pre-fixSpan and less memory than Spade when T is 2. How-ever, they use more memory as T grows.

1

10

100

2 3 4 5 6Average number of items in a transaction

Memory(MB)

PrefixSpan UDDAG-bv

UDDAG-co Spade

Fig. 21. Memory usage on different ave. number of items in a trans.

Fig. 22 and 23 shows the performance of the algo-rithms on data sets with different average number oftransactions (L) in a sequential pattern. When L is 2, theUDDAG algorithms are slightly faster than PrefixSpan

and 2 times faster than Spade. However, when L is 8, theyare about an order of magnitude faster than PrefixSpanand 3.5 times faster than Spade.

1

10

100

2 3 4 5 6 7 8

Average number of transactions in a pattern

Time(s)

PrefixSpan UDDAG-bv

UDDAG-co Spade

Fig. 22. Time usage on different ave. number of trans. in a pattern

UDDAG-bv and UDDAG-co use similar memory asPrefixSpan, and they use less memory than Spade.

1

10

100

2 3 4 5 6 7 8

Average number of transactions in a pattern

Memory(MB)

PrefixSpan UDDAG-bv

UDDAG-co Spade

Fig. 23. Memory usage on different ave. number of trans. in a pattern

Fig. 24 and 25 shows the performance of the algo-rithms on data sets with different average number ofitems (I) in a transaction of patterns. The UDDAG algo-rithms outperform PrefixSpan by about one order ofmagnitude, and outperform Spade by about 3 times.

1

10

100

1000

1 1.2 1.4 1.6 1.8 2 2.2Average number of items in the transactions of patterns

Time(s)

PrefixSpan UDDAG-bv

UDDAG-co Spade

Fig. 24. Time usage on different average number of items in a trans-

action in sequential patterns

When I is small (e.g., < 1.4), the UDDAG algorithms

use similar memory as PrefixSpan and less memory thanSpade. However, when Iis larger, they use more memorydue to the extremely large number of patterns.

1

10

100

1 1.2 1.4 1.6 1.8 2 2.2

Average number of items in the transactions of patterns

Memory(MB)

PrefixSpan UDDAG-bvUDDAG-co Spade

Fig. 25. Memory usage on different average number of items in a

transaction in sequential patterns

6 DISCUSSION

6.1 Multi-item frequent itemset detection

UDDAG and PrefixSpan take different approaches ondetecting FIs with multiple items. PrefixSpan detects mul-ti-item FIs in each projected database while detecting se-quential patterns. UDDAG detects all the FIs before pat-tern detection. Below we examine the impact of this strat-

egy to its performance.Table 5 shows the relative time (RT) of FI detection (as

well as database transformation) with respect to the totaltime usage of UDDAG-bv for the tests in Section 5. Table5 (a)-(c) show that RTgenerally decreases as minSup de-creases (except for the first minSup values in each test).Similarly, Table 5 (d), (e), (g), (h), (i), and (j) show that RTgenerally decreases as the corresponding parameter in-creases. The only exception is Table 5 (f), where RT al-most remains the same with different number of items.

TABLE 5

RELATIVE TIME CONSUMPTION OF FI DETECTION(a) Comparative Study dataset C10S8T8I8

minSup (%) 3 2.5 2 1.5 1 0.5

RT (%) 10 17 14 9 9 7


8/7/2019 P253

13/15


(b) Comparative Study dataset C200S10T2.5I1.25

minSup (%) 1 0.7 0.4 0.2 0.15 0.1

RT (%) 7 13 18 16 13 10

(c) Comparative Study dataset C200S10T5I2.5

minSup (%) 1 0.75 0.5 0.375 0.3 0.25

RT (%) 18 19 19 16 11 6

(d) Scalability Study on Different number of Seq. (C) minSup=400D (000) 100 150 200 250 300 350 400

RT (%) 16 18 16 15 13 12 10

(e) Scalability Study on Different number of Seq. (C) minSup=100

D (000) 100 150 200 250 300 350 400

RT (%) 11 9 7 6 5 5 4

(f) Scalability Study on Different number of Items (N)

N (000) 10 12 14 16 18 20

RT (%) 11 11 11 12 11 12

(g) Scalability Study on Different number of trans. in a seq. (S)

S 8 9 10 11 12

RT (%) 17 13 12 10 8

(h) Scalability Study on Different ave. No. of items in a trans. (T)T 2 3 4 5 6

RT (%) 12 8 5 3 2

(i) Scalability Study on Different ave. No. of trans. in a pattern. (L)

P 2 3 4 5 6 7 8

RT (%) 20 18 17 15 14 12 11

(j) Scalability Study on Different ave. No. of items. in a transaction

of the patterns. (I)

I 1 1.2 1.4 1.6 1.8 2.0 2.2

RT (%) 13 10 7 6 5 4 6

Table 5 shows that FI detection consumes around 10%

of the total time on average, which is insignificant to theoverall performance of UDDAG.

As discussed in Section 3, AprioriAll also adopts simi-lar solution paths, i.e., detecting FIs separately before pat-tern detection. However, practically AprioriAll is veryslow. There are two major reasons:

1) The approach AprioriAll took for FI detection wasvery inefficient. Based on [1], AprioriAll uses the Apriorialgorithm [2] for FI detection. This approach is extremelyslow compared to the state of the art solutions. Based onour tests [9] and the FIMI tests [11], Apriori based algo-rithms are considerably slower than FP-Growth* in many

cases (often one or two orders of magnitude slower).Since these tests include the time for writing the detectedpatterns into a file (which may be significant when largenumber of FIs are involved), the actual gain of FP-growthto Apriori may be even higher if file output is not needed(which is the case in this paper). In addition, the apriorialgorithms tested in [9] and [11] are state of the art algo-rithms, which themselves are considerably faster than theoriginal Apriori algorithm implemented in [2]. Besides, inour implementation we have made some adaptations toexisting state of the art FP-growth approach to make iteven faster. Altogether, the Apriori approach for FI detec-tion adopted in [2] may be significantly slower than theFP-growth approach adopted in our algorithms, whichcontributes to the inefficiency of AprioriAll.

2) The original AprioriAll algorithms candidate gen-eration (by joining and pruning) and support counting

(by checking all the sequences for supported patterns)strategies are extremely slow especially for large databas-es with many patterns.

6.2 Time Complexity

Multi-item FI detection, database projection and pat-tern detection account for the major time usage of our

approach.1) Multi-item FI detection. As discussed in Section 6.1,

the time UDDAG used for FI detection is around 10% ofthe total time. This is insignificant and thus does not havebig impact to the overall performance of UDDAG.

2) Database projection. The goal of database projectionis to find the occurrence information of FIs in a projecteddatabase and further derive a projected database for eachFI. The time for database projection is proportional to thetotal time of checking items in projected databases. Givena sequence, if the longest pattern in the sequence isM, themaximal level of projections we may have on the se-

quence is M. This means each item in the sequence ischecked at most M times. Given a database, the totalnumber of items is CST. Let L be the average length ofdetected patterns, then on average an item is checked atmost L times, and the total instances of items we check isat most LCSTtimes. Practically it is close to O((logL)CST))because the minimal levels of projections to detect a pat-tern with lengthMis about log2M+1.

Using PrefixSpan, the total levels of projection is al-waysMwhen detecting a lengthMpattern. Thus its pro-

jection time is O(LCST). The projection complexity ofUDDAG is similar to that of PrefixSpan in the worst case.

However, on average it is much less.The above analysis is verified by our experiment re-sults. Figures 12, 14, 18, and 20 show that the processingtime for both UDDAG and PrefixSpan scale up quasi-linearly when C, S, and T increase. However, when L in-creases, Fig. 22 clearly shows that PrefixSpan almostscales up linearly, while UDDAG based approaches scaleup much slower (close to O(logL) ).

3) Pattern detection. The major cost for pattern detec-tion in UDDAG is the evaluation of candidate patterns ofcase 3. Different approaches such as UDDAG-bv andUDDAG-co may have different efficiency, as shown inFigures 4, 7, 10, etc. Lemmas 4 and 5 state that the validityof children vertex candidates can be inferred based onthat of parent vertex candidates. As the average length ofpatterns becomes longer, the number of children vertexcandidates in UDDAG also becomes larger, which helpsto eliminate unnecessary candidate checking. The longerL is, the more effective the evaluation of case 3 candidateswill be. This is verified in Fig. 22.

Since PrefixSpan does not generate candidate patterns,the cost of pattern detection in PrefixSpan is limited to FIcounting in each projected database, which is an advan-tage over UDDAG. However, practically UDDAG per-forms better due to the following reasons:

a) The special data structure UDDAG eliminates unne-cessary candidates (based on Lemmas 4 and 5);

b) Projected databases shrink much faster compared toPrefixSpan. First, UDDAG has fewer levels of projections


8/7/2019 P253

14/15


compared to PrefixSpan on average. Second, with UD-DAG, at each level of recursion, we project a databaseinto prefix and suffix projected database. Each sequencein the prefix and suffix projected database has half thelength of the original sequence on average. Thus at level kprojection, the average sequence length is T/2k-1 in UD-DAG, while in PrefixSpan the average sequence length isT-(T/L)*k. Therefore, the average number of instances ofFIs in a projected database at level k in UDDAG is muchsmaller than that in PrefixSpan, which leads to more effi-cient database projection and FI counting.

One additional fact is, when minSup is large enoughsuch that the average pattern length is close to 1, theproblem of sequential pattern mining degrades into fre-quent item counting problem and PrefixSpan and UD-DAG will have similar performance. For example, in Fig.3, the average pattern lengths are 1.44 and 1.60 for minSupvalues 3% and 2.5%, respectively. The time usages of Pre-fixSpan and UDDAG are very close as shown in Fig. 4.

Similar observations can be found in Fig. 7, 10, 14, and 22for large minSup values/small pattern lengths.

6.3 Space Complexity

In UDDAG, the problem of finding all the patterns in adatabase is partitioned into finding subsets of patternsdefined in Lemma 1. Thus the maximal memory usage offinding all the patterns is max (M1, M2, , Mt), where Miis the maximal memory usage for detecting subset i.Mi ismainly used to store the projected database and UpDownDAG, whose size is decided by the total number of ver-texes. Besides, we also need to store the transformed da-

tabase during the whole pattern mining process.The size of the transformed database is decided by thesize of the original database as well as the characteristicsof FIs. If the average length of FI is small, then the size ofthe transformed database is close to that of the originaldatabase. The size of the transformed database increasesas the average length, total number, and support of multi-item FIs increase. This is verified in Fig. 25 (where theaverage length of FIs increases) and Fig. 11 (where thenumber and support of multi-item FIs increase dramati-cally as minSup decreases, ).

For projected databases, given a level 1 projected data-base with Csequences, if the length of the longest patternis M, then the maximal level of projections is M. UsingPseudo-Projection, at each level of projection we store thebeginning and ending positions as well as sequence ids,therefore the maximal number of integers we need to re-cursively store is 3CM. The actual memory usage may bemuch smaller because, 1) the size of projected databasesgets smaller as the recursion level increases, 2) The totallevels of real projection may be much smaller thanM.

The cost of storing UDDAG is proportional to the max-imal number of patterns in a subset. Generally this cost isrelatively small compared to storing the databases. How-ever, if the number of patterns is extremely large, this cost

may also increase significantly as shown in Fig. 11. Inaddition, this feature of UDDAG may also cause the jittereffect on memory usage for scalability tests. Fig. 23, thememory usage for different average number of transac-

tions in a pattern (L), shows an example for such effect.When L=5, the memory usage (23.6 MB) of UDDAG-co ishigher than those of the datasets of L=4 (16.7 MB) and L=6(20.9 MB). The reason is that each testing dataset is gener-ated independently. The largest subsets of patterns insome datasets may be smaller/larger than their neighbor-ing datasets, which results in smaller/larger memoryconsumption. Similar effect can also be found in Fig. 17.

PrefixSpan does not need additional space for UD-DAG, and it needs less space for storing the whole data-base as it stores the original database instead of the trans-formed database, but it may need more memory for pro-

jected databases because of more levels of projection.Overall, the memory usage of UDDAG is comparable tothat of PrefixSpan as shown in Figures 5, 8, 11, 13, 15, 17,etc. UDDAG may use more memory than PrefixSpan inextreme cases when a significant number of patterns existin a subset or the average length of FIs is large and thenumber/support of multi-item FIs is extremely big, as

shown in Figures 11 and 25.

7 CONCLUSIONS AND FUTURE WORK

In this paper a novel data structure UDDAG is inventedfor efficient pattern mining. The new approach growspatterns from both ends (prefixes and suffixes) of de-tected patterns, which results in faster pattern growthbecause of less levels of database projection compared totraditional approaches. Extensive experiments on bothcomparative and scalability study have been performedto evaluate the proposed algorithm.

In terms of time efficiency, when minSup is very largesuch that the average length of patterns is close to 1, UD-DAG and PrefixSpan have similar performance becausein this case the problem becomes a simple frequent itemcounting problem (practically not interesting for sequen-tial pattern mining). However, UDDAG scales up muchslower compared to PrefixSpan. It often outperforms Pre-fixSpan by one order of magnitude in our scalability tests.Experiments also show that UDDAG is considerably fast-er than two other representative algorithms, Spade andLapinSpam. In addition, UDDAG also demonstrated sa-tisfactory scale-up properties with respect to various pa-rameters such as the total number of sequences, the totalnumber of items, the average lengths of sequences, etc.

The memory usages of UDDAG based approaches aregenerally comparable to that of PrefixSpan. UDDAGbased approaches may use more memory in extreme cas-es when a significant number of patterns exist in a subsetor the average length of FIs is large and the num-ber/support of multi-item FIs is extremely big. UDDAGgenerally uses less memory than Spade and LapinSpam.

One major feature of UDDAG is that it supports effi-cient pruning of invalid candidates. This represents apromising approach for applications involving searchingin large spaces. Thus it has great potential to related areas

of data mining and artificial intelligence. In the future weexpect to further improve UDDAG based pattern miningalgorithm as follows: 1) Currently FI detection is inde-pendent from pattern mining. Practically, the knowledge


8/7/2019 P253

15/15


gained from FI detection may be useful for pattern min-ing. In the future we will integrate the solutions of thetwo so that they can benefit from each other; 2) Differentcandidate verification strategies may have different im-pacts to the efficiency of the algorithm. In the future wewill study more efficient verification strategy; 3) UDDAGhas big impact to the memory usage when the number ofpatterns in a subset is extremely large. In the future wewill find an efficient way to store UDDAG.

We will also extend our approach to other type of se-quential pattern mining problems, e.g., mining with con-straints, closed and maximal pattern mining, approximatepattern mining, and domain-specific pattern mining, etc.

We also expect to extend the UDDAG based approachto other areas where large searching spaces are involvedand pruning of searching spaces are necessary.

ACKNOWLEDGMENT

The author is grateful for the insightful comments of theanonymous reviewers. The author would also like tothank Ping Zhong, Terry Cook, and Anne Moroney fortheir help on the draft. This work was supported in partby a PSC-CUNY Research Grant (PSCREG-38-892) and aQueens College Research Enhancement Grant.

REFERENCES

[1] R. Agrawal and R. Srikant, Mining Sequential Patterns, Pro-

ceedings ICDE'95, pp. 3-14, 1995.

[2] R. Agrawal, and R. Srikant, Fast algorithms for mining associa-

tion rules, Proc. of 20th Intl. Conf. on VLDB, pp. 487-499, 1994.

[3] C. Antunes and A. L. Oliveira, Generalization of Pattern-Growth Methods for Sequential Pattern Mining with Gap Con-

straints, Proc. Int'l Conf Machine Learning and Data Mining

2003, pp. 239-251, 2003.

[4] C. Antunes and A. L. Oliveira, Sequential Pattern Mining Al-

gorithms: Trade-offs between Speed and Memory, Proc. 2nd

Intl. Workshop on Mining Graphs, Trees and Sequences, 2004.

[5] J. Ayres, J. Gehrke, T. Yu, and J. Flannick, Sequential PAttern

Mining using a Bitmap Representation, Proc. Int'l Conf. Know-

ledge Discovery and Data Mining 2002, pp. 429-435, 2002.

[6] S. Berkovich, G. Lapir, and M. Mack, A Bit-Counting Algo-

rithm Using the Frequency Division Principle, Software: Prac-

tice and Experience, vol. 30 (14), pp. 1531-1540, 2000.

[7] C. Bettini, X.S. Wang, and S. Jajodia, Mining Temporal Rela-

tionships with Multiple Granularities in Time Sequences, Data

Eng. Bull., vol. 21, pp. 32-38, 1998.

[8] J. Chen, T. Cook. Mining Contiguous Sequential Patterns from

Web Logs. In Proc. of WWW2007 Poster session, May 8-12,

2007, Banff, Alberta, Canada

[9] J. Chen, K. Xiao, BISC: a Binary Itemset Support Counting Ap-

proach towards Efficient Frequent Itemset Mining, submitted to

ACM Transactions on Knowledge Discovery in Data.

[10] D.Y. Chiu, Y.H. Wu and A.L.P. Chen, An Efficient Algorithm

for Mining Frequent Sequences by a New Strategy without

Support Counting, Proc. ICDE 2004, pp. 375, 2004.

[11] G. Grahne and J. Zhu., "Efficiently Using Prefix-trees in Mining

Frequent Itemsets," Prof. FIMI'03 Workshop on Frequent Item-

set Mining Implementations, 2003.

Pattern Mining with Regular Expression Constraints, Proc.

VLDB'99, pp. 223-234, 1999.

[13] B. Goethals, M. J. Zaki, Fimi03: Workshop on frequent itemset

mining implementations, In Proceedings of the ICDM 2003

Workshop on Frequent Itemset Mining Implementations (2003).

[14] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.C.

Hsu, FreeSpan: Frequent Pattern-Projected Sequential Pattern

Mining, Proc. SIGKDD 2000, pp. 355-359, 2000.

[15] M.Y. Lin and S.Y. Lee, Fast Discovery of Sequential Patterns

through Memory Indexing and Database Partitioning, J. of In-

formation Science and Engineering, vol. 21, pp. 109-128, 2005.

[16] H. Mannila, H Toivonen, and A.I. Verkamo, Discovery of

Frequent Episodes in Event Sequences, Data Mining and

Knowledge Discovery, vol. 1, pp.259-289, 1997.

[17] F. Masseglia, F. Cathala, and P. Poncelet, The PSP Approach

for Mining Sequential Patterns, Proc. Eurpn. Symp. Principle

of Data Mining and Knowledge Discovery, pp. 176-184, 1998.

[18] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal,

and M.C. Hsu, PrefixSpan: Mining Sequential Patterns Effi-

ciently by Prefix-Projected Pattern Growth, Proc. 2001 IntlConf. Data Eng. (ICDE 01), pp. 215-224, 2001.

[19] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U.

Dayal, M.C. Hsu, Mining Sequential Patterns by Pattern-

Growth: The PrefixSpan Approach, IEEE Transactions on

Knowledge and Data Engineering, vol. 16, pp.1424-1440, 2004.

[20] E. M. Reingold, J. Nievergelt, and N. Deo, Combinatorial Al-

gorithmsTheory and Practice, Prentice-Hall, Inc.: Engle-

wood, Cliffs, NJ, 1977.

[21] R. Srikant and R. Agrawal, Mining Sequential Patterns: Gene-

ralizations and Performance Improvements, Proc. Int'l Conf.

Extending Database Technology 1996, pp. 3-17, 1996.

[22] K. Wang, Y. Xu, and J.X. Yu, Scalable Sequential Pattern Min-ing for Biological Sequences, Proc. the 2004 ACM Int. Conf. In-

formation and Knowledge Management, pp.178187, 2004.

[23] J. Wang, Y. Asanuma,E. Kodama, T. Takata, and J. Li, "Mining

Sequential Patterns More Efficiently by Reducing the Cost of

Scanning Sequence Databases," IPSJ Transactions on Database,

Vol. 47 No. 12, pp. 3365-3379, 2006.

[24] M. Zaki, Spade: An Efficient Algorithm for Mining Frequent

Sequences, Machine Learning, vol. 40, pp.31-60, 2001.

[25] Z. Zhang and M. Kitsuregawa, LAPIN-SPAM: An Improved

Algorithm for Mining Sequential Pattern, Proc. of SWOD'05,

pp. 8-11, Apr. 2005.

[26] Z. Zhang, Y. Wang, and M. Kitsuregawa, Effective Sequential

Pattern Mining Algorithms for Dense Database, Proc. Japanese

National Data Engineering WorkShop (DEWS'06). 2006

Jinlin Chen received his PhD degree inAutomatic Control in 1999, Bachelor ofEngineering and Bachelor of Economicsin 1994, all from Tsinghua University,China. He is a faculty member at Com-puter Science Dept, Queens College, theCity Univ. of New York. Previously he wasa visiting professor at Univ. of Pittsburg,and a researcher at Microsoft ResearchAsia. His research interests include webinformation modeling and processing,

information retrieval, and data mining. He is a member of the IEEEand ACM.


p253

Documents