modul 8: sequential pattern mining. terminology item itemset sequence (customer-sequence) ...

Post on 20-Jan-2016

233 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Modul 8:Sequential Pattern Mining

Terminology Item Itemset Sequence (Customer-sequence) Subsequence Support for a sequence Large/frequent sequence

Example

Q. How to find the sequential patterns?

Example

Item

Itemset

Transaction

Example (cont.)

Sequence

3-Sequence

Subsequence

Example (cont.)

<(30) (90)> is supported by customer 1 and 4

<30 (40 70)> is supported by customer 2 and 4

customer 1 and 4 contain <(30) (90)>

Example (cont.)

Q. Find the large/frequent sequences with minimum support set to 25%:

-Frequent sequence = The sequence with minimum support

<(30)>, <(40)>, <(70)>, <(90)><(30) (40)>, <(30) (70)>, <(40 70)>

The Algorithm Apriori Five phases

Sort phase Large itemset phase Transformation phase Sequence phase Maximal phase

Sort the database with customer-id as the major key and transaction-time as the minor key

Sort phase

Find the large itemset. Itemsets mapping

Litemset phase

Transformation phase

Deleting non-large itemsets Mapping large itemsets to integers

Sequence phase Use the set of litemsets to find the desired sequence. Two families of algorithms:

Count-all:

Algorithm AprioriAll Count-some:

Algorithm AprioriSome, Algorithm DynamicSome

AprioriAll The basic method to mine sequential patterns Based on the Apriori algorithm. Count all the large sequences, including non-maximal

sequences. Use Apriori-generate function to generate candidate

sequence.

AprioriAll (cont.)

L1 = {large 1-sequences}; // Result of the phasefor ( k=2; Lk-1≠Φ; k++) do begin Ck = New candidate generate from Lk-1 foreach customer-sequence c in the database do Increment the count of all candidates in Ck that are contained in cLk = Candidates in Ck with minimum support.EndAnswer=Maximal Sequences in UkLk;

Apriori Candidate Generation

generate candidates for pass using only the large sequences found in the previous pass and then makes a pass over the data to find their support.

Algorithm: Lk the set of all large k-sequences

Ck the set of candidate k-sequences

Apriori Candidate Generation

insert into Ck

select p.litemset1, p.litemset2,…, p.litemsetk-1, q.litemsetk-1

from Lk-1 p, Lk-1 qwhere p.litemset1=q.litemset1,…, p.litemsetk-2=q.litemsetk-2;

forall sequences cCk do forall (k-1)-subsequences s of c do if (sLk-1) then delete c from Ck;

Example: Transformed Customer Sequences

Apriori Candidate Generation

<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>

<{1}{3}{5}><{4}{5}>

next step: find the large 1-sequences

With minimum set to 25%

next step: find the large 2-sequences

Sequence Support

<1>

<2>

<3>

<4>

<5>

<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>

<{1}{3}{5}><{4}{5}>

ExampleLarge 1-Sequence

4

2

4

4

2

next step: find the large 3-sequences

Sequence Support

<1 2> 2

<1 3> 4

<1 4> 3

<1 5> 3

<2 3> 2

<2 4> 2

<3 4> 3

<3 5> 2

<4 5> 2

<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>

<{1}{3}{5}><{4}{5}>

ExampleLarge 2-Sequence

next step: find the large 4-sequences

Sequence Support

<1 2 3> 2

<1 2 4> 2

<1 3 4> 3

<1 3 5> 2

<2 3 4> 2

<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>

<{1}{3}{5}><{4}{5}>

ExampleLarge 3-Sequence

next step: find the maximal sequential pattern

Sequence Support

<1 2 3 4> 2<{1 5}{2}{3}{4}><{1}{3}{4}{3 5}><{1}{2}{3}{4}>

<{1}{3}{5}><{4}{5}>

ExampleLarge 4-Sequence

Maximal phase Find the maximum sequences among the set of large

sequences. In some algorithms, this phase is combined with the

sequence phase.

Maximal phase Algorithm:

S the set of all litemsets n the length of the longest sequence

for (k = n; k > 1; k--) do foreach k-sequence sk do Delete from S all subsequences of sk

Sequence Support

<1 2 3 4> 2

Example

Sequence Support

<1> 4

<2> 2

<3> 4

<4> 4

<5> 2

Sequence Support

<1 2> 2

<1 3> 4

<1 4> 3

<1 5> 3

<2 3> 2

<2 4> 2

<3 4> 3

<3 5> 2

<4 5> 2

Sequence Support

<1 2 3> 2

<1 2 4> 2

<1 3 4> 3

<1 3 5> 2

<2 3 4> 2

Find the maximal large sequences

26

Examples of Sequence DataSequence Database

Sequence Element (Transaction)

Event(Item)

Customer Purchase history of a given customer

A set of items bought by a customer at time t

Books, diary products, CDs, etc

Web Data Browsing activity of a particular Web visitor

A collection of files viewed by a Web visitor after a single mouse click

Home page, index page, contact info, etc

Event data History of events generated by a given sensor

Events triggered by a sensor at time t

Types of alarms generated by sensors

Genome sequences

DNA sequence of a particular species

An element of the DNA sequence

Bases A,T,G,C

27

Examples of Sequence

Web sequence:

< {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation} {Return to Shopping} >

Sequence of initiating events causing the nuclear accident at 3-mile Island:(http://stellar-one.com/nuclear/staff_reports/summary_SOE_the_initiating_event.htm)

< {clogged resin} {outlet valve closure} {loss of feedwater} {condenser polisher outlet valve shut} {booster pumps trip} {main waterpump trips} {main turbine trips} {reactor pressure increases}>

Sequence of books checked out at a library:<{Fellowship of the Ring} {The Two Towers} {Return of the King}>

28

GSP algorithm

29

Candidate generation

Contains 2 phase: Join phase and Prune phase

Join phase: Ck = Fk-1 x Fk-1

A sequence s1 and s2 in Fk-1 can be joined if the subsequence obtained by dropping the first item of s1 is the same as the subsequence obtained by dropping the last item of s2.

The resulting sequence is the sequence s1 extended by the last item in s2. The added item becomes a separate element if it was a separate

element in s2, and part of element s1 otherwise

30

Candidate Generation Examples

Merging the sequences w1=<{1} {2 3} {4}> and w2 =<{2 3} {4 5}> will produce the candidate sequence < {1} {2 3} {4 5}> because the last two events in w2 (4 and 5) belong to the same element

Merging the sequences w1=<{1} {2 3} {4}> and w2 =<{2 3} {4} {5}> will produce the candidate sequence < {1} {2 3} {4} {5}> because the last two events in w2 (4 and 5) do not belong to the same element

We do not have to merge the sequences w1 =<{1} {2 6} {4}> and w2 =<{1} {2} {4 5}> to produce the candidate < {1} {2 6} {4 5}> because if the latter is a viable candidate, then it can be obtained by merging w1 with < {1} {2 6} {5}>

31

Pruning phase: Delete candidate sequences that have an infrequent (k-1)-

subsequence.

33

GSP Example

< {1} {2} {3} >< {1} {2 5} >< {1} {5} {3} >< {2} {3} {4} >< {2 5} {3} >< {3} {4} {5} >< {5} {3 4} >

< {1} {2} {3} {4} >< {1} {2 5} {3} >< {1} {5} {3 4} >< {2} {3} {4} {5} >< {2 5} {3 4} >

< {1} {2 5} {3} >

Frequent3-sequences

CandidateGeneration

CandidatePruning

34

Database Example

35

The mining result

top related