data e web mining 825368 paolo gobbo

30
Data e Web Mining Data e Web Mining 825368 Paolo Gobbo 825368 Paolo Gobbo Smart Miner: A New Framework for Mining Large Scale Web Usage Data Bayir – Toroslu – Cosar - Fidan

Upload: candie

Post on 10-Jan-2016

57 views

Category:

Documents


2 download

DESCRIPTION

Data e Web Mining 825368 Paolo Gobbo. Smart Miner: A New Framework for Mining Large Scale Web Usage Data. Bayir – Toroslu – Cosar - Fidan. Data Mining on Web. Web Mining. discover and retrieve useful and interesting pattern from large web dataset. web content mining. web structure mining. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining Data e Web Mining

825368 Paolo Gobbo825368 Paolo Gobbo

Smart Miner: A New Framework for Mining Large Scale Web Usage Data

Bayir – Toroslu – Cosar - Fidan

Page 2: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo2

Data Mining on WebData Mining on Web

Web Miningdiscover and retrieve useful and

interesting pattern from large web dataset

web content mining

web structure mining

web usage mining

text and multimedia documents

hyperlink structure

web log records

real data in web pages

data describes the organization of the

content

data describes the pattern of usage of

web pages

Page 3: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo3

PreProcessingPreProcessing

Site File Access Log Referrer Log Agent Log Registration

Site Crawler

Data CleaningPath Completion

Session IdentificationUser Identification

User Session File

TransactionIdentification

Transaction File

Site Topology

INP

UT

PR

EP

RO

CES

ISN

G

SQLQuery

Page 4: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo4

Session IdentificationSession Identification

)()(

,1:

1 ii PTPT

nii )()( 1PTPT n

partitioning each user’s activities into sequence (session) of entries from web request logs

Session Identification

time oriented heuristics

navigation oriented heuristics

temporal boundaries

session length page-stay

link between web pages

truePPLink

iji

ij

),(

0

Page 5: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo5

Sequential MiningSequential Mining

},,,{ 21 miiiI Ø XIxi

raaas ,,, 21

Sequential Mining

Association Mining with the order of transactions

itemset/element

items

sequence

},,,{ 21 kxxxX :

: ia is itemset

sequence size

sequence length

number of itemsets/elements

number of items

:

:

:

:

:

Given a set of data sequences find all sequences with a user-specified minimum support

subsequence naaas ,,, 211

niniin bababaiii ,,, : 21 2121

1s 2s :nbbbs ,,, 212

Page 6: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo6

Sequential Mining algorithmsSequential Mining algorithms

GSP APrioriAll APrioriSome

Sort Phase

LargeItemSet Phase

Transformation Phase

Sequence Phase

Maximal Phase

Transforms customer transaction into custumer sequences

Generates set of large itemset

Represents customer sequences based on large itemset

Derives large k-sequences based on large (k-1)-sequences

Prunes non maximal sequences

Page 7: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo7

Smart-SRA sessionSmart-SRA session

x],,,,,,[ 121 nkkx PPPPPS

},,,{ 21 mSSSS Smart-SRA session

Path

• timestamp ordering (time oriented) rule

• topology (navigation oriented) rule

• maximality rule

truePPLinki ii ),(: 1

)()(,1: 1 ii PTPTnii

)()(,1: 1 ii PTPTnii )()( 1PTPT n

yxyyx SSSSSSS :

(session)

(path in the web site)

(path in the web site)

Page 8: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo8

Smart MinerSmart Miner

Candidate Session

Smart Session

Sequencial AprioriAll

SMART-SRA SESSION

CONSTRUCTION

SEQUENCIALMINING

DATA STREAM

FREQUENT ACCESS PATTERN

Page 9: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo9

Smart Miner: First Phase Smart SRA Smart Miner: First Phase Smart SRA

time oriented heuristics

session length

page-stay

no backward movement

P1 P13

P20

P49P34

P23

Web Site Graph Candidate Session

Candidate session construction

P1 P20 P13 P49 P34 P23

0 6 9 12 14 15

Page

TimeStamp

P13 P20 P23

0 5 9

Page

TimeStamp

P49

10

Page 10: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo10

Smart Miner: Second Phase Smart SRA Smart Miner: Second Phase Smart SRA

time oriented heuristics

inherithed session length

re-check page-stay

no backward movement

maximality

topology rule

Smart session construction

P1 P13

P20

P49P34

P23

Web Site Graph

[P1, P13, P34, P23]

[P1, P13, P49, P23]

[P1, P20, P23]

Smart Session

P1 P20 P13 P49 P34 P23

0 6 9 12 14 15

Page

TimeStamp

Page 11: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo11

Smart Miner: Second Phase SmartSmart Miner: Second Phase Smart

SMART SESSION RECONSTRUCTION

foreach CanditateSession in CandSessionSet NewSessionSet={} while CanditateSession ≠Ø TSessionSet = {}; TPageSet = {}; foreach Pagei in CandSession StartPageFlag = TRUE foreach Pagej in CandidateSession with j<i if (Link[Pagej,Pagei] and TimeDiff(Pagei,Pagej)≤σ then StartPageFlag = FALSE endfor if StartPageFlag then TPageSet = TPageSet U {Pagei} endfor CandSession = TPageSet U {Pagei} if NewSessionSet = {} then foreach Pagei in TPageSet TSessionSet = TSessionSet U {[Pagei]} else foreach Pagei in TPageSet foreach Sessionj in NewSessionSet if (Link[Last(Sessionj),Pagei] and TimeDiff(Last(Sessionj),Pagei)≤σ) then TSession = Sessionj TSession.mark = UNEXTENDED TSession = TSession • Pagei TSessionSet = TSessionSet U {TSession} Sessionj.mark = EXTENDED endif endfor endfor endif foreach SessionJ in New SessionSet if SessionJ.mark ≠ EXTENDED then TSessionSet = TSessionSet U {SessionJ} end for NewSessionSet = TSessionSet end whileend for

page with no incoming

link

session set construction

session set extension

session set extension with no

extended

Page 12: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo12

Session Construction ExampleSession Construction Example

Iteration CandidateSession TPageSet NewSessionSet

1 [ P1, P20, P13, P49, P34, P23 ]

2 [ P20, P13, P49, P34, P23 ]

3

4

[ P49, P34, P23 ]

[ P23 ]

{ P1 }

{ P20, P13 }

{ P49, P34 }

{ P23 }

[ P1 ]

[ P1, P20 ] [ P1, P13]

[ P1, P13, P34 ] [ P1, P13, P49 ] [ P1, P20 ]

[ P1, P13, P34, P23 ] [ P1, P13, P49, P23] [ P1, P20, P23 ]

P1 P13

P20

P49P34

P23

Page 13: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo13

Sequential APrioriAllSequential APrioriAll

Pruning

topological constraint

every subsequent pair of pages in a sequence the former one must have a hyperlink to the latter one

string matching costraint

session S supports a pattern P if and only if P is a subsequence of S not violating string matching

<1,2,3> support <1,2><1,2,3> not support <1,3>

• during candidate sequence generation before calculating their support

Page 14: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo14

Support Support

Support

I : pattern

S : user reconstructed sessions

S

SiSSISupport ii } of substring is I |{

),(

• one scan through the transaction database by keeping candidate session in hashmap

Page 15: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo15

Sequential Apriori AlgorithmSequential Apriori Algorithm

SEQUENTIAL APRIORIINPUT: minimum support frequency : δ

reconstructed sessions : Stopology information : Linkset of all web pages : P

OUTPUT: set of maximal frequent patterns : Max

L1 = {}for i = 1 to |P| do L1 = L1 U [Pi] | if Support([Pi],S)> δfor k = 1 to N-1 do if Lk = Ø then Halt else Lk+1 = {} foreach Ii in Lk foreach Pj in P if Link[Last(Ii),Pj] then T = Ii • Pj // append page if Support(T,S)> δ then T.maximal = true Ii.maximal = false V = [T2,T3,…, T|T|] if V in Lk then V.maximal = false lk+1 = lk+1 U {T} endif endif endif endfor endfor endif max = {} for k=1 to N-1 do max = max U {S|S in Lk and S.maximal = true }endfor

length-1 candidatepattern generation

union of the sets ofmaximal patterns

no further generation

length-k+1 candidate pattern generationjoining step

pruning steptopological rule

support rulemaximality rule

Page 16: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo16

Accuracy MetricAccuracy Metric

AMP

H

HAH MP

MPMPPRE

HHH PRERECA *

: frequent maximal pattern of the agent simulator

: frequent maximal pattern of the heuristic

A

HAH MP

MPMPREC

HMP

recall

precision

accuracy

Page 17: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo17

Agent SimulatorAgent Simulator

• STP : Session Termination Probability

• LPP : Link from Previous page Probability

• LPC : Link from Current page Probability

• NIP : New Initial page Probability

probability of terminating session

probability of referring next page from one of the previously accessed pages except the most recently accessed one

probability of referring next page from the most recently visited page

probability of selecting one of the starting pages of a web site during the navigation

Agent Simulator Parameters

Page 18: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo18

Simulated DataSimulated Data

Web topology

• number of web pages from 10 to 1000

• number users from 1000 to 10000

Agent simulator parameters

• NIP/STP 0.1 , 0.2 , 0.5 , 1.0 , 2.0 , 5.0 , 10.0

• LPC/LPP 0.1 , 0.2 , 0.5 , 1.0 , 2.0 , 5.0 , 10.0

• 49 different cases

Support parameter

• Values 0.001 , 0.0025 , 0.005 , 0,0075 , 0.01

Runs of agent simulator

• 10 random different runs

Page 19: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo19

Results on Simulated DataResults on Simulated Data

NOTO :

:

SSRA :

navigation orientedtime orientedSmart SRA

NIP : New Initial Page Probability

STP : Session Termination Probability

NIP : New Initial Page Probability

STP : Session Termination Probability

Page 20: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo20

Results on Simulated DataResults on Simulated Data

NOTO :

:

SSRA :

navigation orientedtime orientedSmart SRA

Page 21: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo21

Real DataReal Data

AGMLAB’s company web site

• 4 months user activity

• 3801 users

• 30 minutes session time-out

• 10 web pages

• link graph densely connected

User Activity

• action tracking program

• cookies

• cookie information recorded to a server log file

Page 22: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo22

Results on Real DataResults on Real Data

NOTO :

:

SSRA :

navigation orientedtime orientedSmart SRA

Page 23: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo23

ScalabilityScalability

Performance on 100 GB Data Performance with 50 nodes

MAP/REDUCE paradigm

each node process a block of session database computing the local frequency of each candidate patterns

Page 24: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo24

Sitologia/BibliografiaSitologia/Bibliografia

M.A.Bayir – I.H.Toroslu – A.Cosar – G.Fidan, Smart Miner: A New Framework for Mining Larga Scale Web Usage Data - 2009

R.Cooley - B.Mobasher - J.Srivastava, Data Preparation for Mining World Wide Web - 1999

J.Srivastava - R.Cooley – M.Deshpande – P.N. Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data - 2000

M.G Da Costa jr – Z. Gong, Web Structure Mining: An Introduction - 2005

J.J.Jung, Semantic PreProcessing of Web Request Streams for Web Usage Mining - 2005

R.Agrawal – R.Srikant, Mining Sequential Patterns- 1995

Page 25: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo25

foreach p in Lk-1

foreach q in Lk-1

if ( )

then Ck = Ck U {p1,…,pk-1,qk-1 }

foreach s in Ck

if exists(r | ˄ )

then Ck = Ck - s

GSPGSP

C1 = Init_Pass

L1 = {<{f}>|f in C1, with minimum support}

for (k=2; Lk-1≠Ø; k++) do begin

Ck = Candidate-gen-SPM Lk-1

foreach sequence s in the database D do foreach candidate c in Ck if (c in s) then update candidate c

Lk= candidated c in Ck with minimum support end

result = Uk(Lk)

GSP – GENERALIZED SEQUENTIAL PATTERN

nn qpkni 1:2

CANDIDATE-GEN-SPM

(join step)

(prune step)sr 1 kLr

Page 26: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo26

GSP ExampleGSP Example

L3-sequencesCandidate 4-sequences

(join step)Candidate 4-sequences

(prune step)

<{1,2},{4}>

<{1,2},{5}>

<{1},{4,5}>

<{1,4},{6}>

<{2},{4,5}>

<{2},{4},{6}>

<{1,2},{4,5}>

<{1,2},{4},{6}>

<{1,2},{4,5}>

<{1},{4},{6}>

Page 27: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo27

foreach p in Lk-1

foreach q in Lk-1

if (p.x1=q.x1) ˄ (p.x2=q.x2) ˄ … ˄ (p.xk-2=q.xk-2)

then Ck = Ck U {<p.x1,…,p.xk-1,q.xk-1>}

foreach s in Ck

if exists(r | ˄ )

then Ck = Ck - s

APrioriAllAPrioriAll

sr

L1 = {large 1-sequences}

for (k=2; Lk-1≠Ø; k++) do begin

Ck = Apriori-generate function Lk-1

foreach sequence c in the database D do

update candidates in Ck that are contained in c

Lk= candidated in Ck with minimum support end

result = maximal sequences in Uk(Lk)

APRIORIALL

APRIORI-GENERATE

1 kLr

(join step)

(prune step)

Page 28: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo28

APrioriAll ExampleAPrioriAll Example

L3-sequences

<1,2,3>

<1,2,4>

<1,3,4>

<1,3,5>

<2,3,4>

Candidate 4-sequences(join step)

<1,2,3,4>

<1,2,4,3>

<1,3,4,5>

<1,3,5,4>

Candidate 4-sequences (prune step)

<1,2,3,4>

Page 29: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo29

APrioriSomeAPrioriSome

APRIORISOME

//Forward Phase

L1 = {large 1-sequences}; C1 = L1 ; last = 1;

for (k=2; Ck-1≠Ø; k++) do begin

if (Lk-1 known) then Ck = Apriori-generate function Lk-1

else Ck = Apriori-generate function Ck-1

if (k=next(last)) then foreach sequence c in the database D do

update candidates in Ck that are contained in c

Lk= candidated in Ck with minimum support; last = kend//Backword Phasefor (k--; k>=1; k--) do begin

if (Lk not found) then

delete all sequences in Ck contained in some Li, i>k

foreach sequence c in the database D do update candidates in Ck that are contained in c

Lk= candidated in Ck with minimum support else

delete all sequences in Lk contained in some Li, i>kend

result = maximal sequences in Uk(Lk)

Page 30: Data e Web Mining  825368 Paolo Gobbo

Data e Web Mining 825368 - Paolo Gobbo30

Sequential Mining AlgorithmSequential Mining Algorithm

11

June 25 ’93June 25 ‘93

3090

222

June 10 ’93June 15 ’93June 20 ‘93

10,2030

40,60,60

3 June 25 ’93 30,50,70

444

June 25 ’93June 30 ‘93July 25 ‘93

3040,70

90

5 June 12 ’93 90

Customer ID Transaction Time Items

Customer ID Customer Sequence

1 <(30)(90)>

2 <(10 20) (30) (40 60 70)>

3 <(30) (50 (70))>

4 <(30) (40 70) (90)>

5 <(90)>

(30) 1

(40) 2

(70) 3

(40 70) 4

(90) 5

Large itemset Mapped to

1 <{1} {5}>

2 <{1} {2, 3, 4}>

3 <{1, 3}>

4 <{1} {2, 3, 4} {5}>

5 <{5}>

Customer ID Customer Sequence