analysing clickstream data: from anomaly detection to visitor profiling peter i. hofgesang...
TRANSCRIPT
![Page 1: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649f575503460f94c7c7c3/html5/thumbnails/1.jpg)
Analysing Clickstream Data:From Anomaly Detection to
Visitor Profiling
Peter I. [email protected]
Wojtek [email protected]
ECML/PKDD Discovery Challenge October 7, 2005 Porto, Portugal
![Page 2: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649f575503460f94c7c7c3/html5/thumbnails/2.jpg)
Web server data
• 7 internet shops (home electronics)• 80.000 visitors (IP-addresses) in 25 days• 0.5 million sessions• 3 million clicks (records in a log file)• Example record:
11;1076262912;193.170.198.122;eb5cbe50997fcb7f9155c6c194c832a8;/znacka/?c=162&tisk=ano;http://www.google.com./search?hl=cs&q=Sennheiser+HD+650&btnG=Vyhledat+Googlem&lr=lang_cs
• Objective: discover interesting patterns !!!
![Page 3: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649f575503460f94c7c7c3/html5/thumbnails/3.jpg)
Data Mining Process
INPUT DATA
Web accesslog data
DATABASE
DATA PREPARATION
PREPROCESSING
SESSION IDENTIFICATION & DETECTION OF ANOMALIES PROFILE MINING
2
3
3
3Tree of profile sequences
Pro
bab
ility
Content types
T e x t Tex
t Tex
tT e x t
Mixture model
BASIC STATISTICS
Detection of anomalies
Shop information
Identified sessions / based on a new definition
![Page 4: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649f575503460f94c7c7c3/html5/thumbnails/4.jpg)
Anomalies/Strange things I
• Multiple IP-addresses per session– 2 IP-addresses: 3.051 sessions– 3 IP-addresses: 362 sessions– 4 IP-addresses: 113 sessions– ………………– 22 IP-addresses: 1 session– Some sessions involve IP’s from different countries
• A few sessions (12) refer to multiple shops
![Page 5: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649f575503460f94c7c7c3/html5/thumbnails/5.jpg)
Anomalies/Strange things II
• Sessions with long duration– 476 sessions longer than 24 hours (up to 18 days)
• Very Intensive Sessions– 2.865 sessions with more than 100 visited pages– 19 sessions with more than 1.000 visited pages– 2 sessions with more than 10.000 visited pages
• Frequent IP-addresses with short sessions– E.g.: 29.320 sessions in less than 20 hours from 147.229.205.80
• “Parallel sessions”– Overlapping sequences of clicks from the same IP to the same
shop within a short period with multiple SIDs (Opening a new window? Making a transaction? )
![Page 6: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649f575503460f94c7c7c3/html5/thumbnails/6.jpg)
Anomalies/Strange things III
• Sequences of short sessions that form sessionsExample: clicks from 62.209.194.163 (31 Jan 04)
09:40:09 /dt/?c=13654;http://www.shop5.cz/09:41:21 /dt/param.php?id=115;09:41:21 /;09:41:37 /ls/?id=20;http://www.shop5.cz/dt/?c=1365409:41:42 /;09:42:24 /ls/?&id=20&view=1,2,3,8&pozice=20;http://www.shop5.cz/ls/…09:42:25 /;09:42:48 /ls/?&id=20&view=1,2,3,8;http://www.shop5.cz/ls/?&id=20& …09:42:48 /;09:42:53 /ls/?&id=20&view=1,2,3,8&pozice=40;http://www.shop5.cz/ls/…
Each one has another session identifier !!!
![Page 7: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649f575503460f94c7c7c3/html5/thumbnails/7.jpg)
Fixing the data
• A new definition of “session”:
A chronologically ordered sequence of “clicks” from the same IP-address to the same shop with no gaps longer than 30 minutes
• Sessions longer than 50 clicks ignored (12.000)
• Number of sessions dropped: 522.410 281.153
![Page 8: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649f575503460f94c7c7c3/html5/thumbnails/8.jpg)
Old and New Sessions
Session Length Count Old Count New
1 318.523 65.258
2 24.762 31.821
3 17.353 18.828
4 15.351 16.332
5 15.361 15.509
6 13.455 13.448
7 10.958 10.883
8 9.045 9.095
9 7.939 8.070
10 7.028 7.091
... ... ...
![Page 9: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649f575503460f94c7c7c3/html5/thumbnails/9.jpg)
Visitor Profiling
Motivation: On the internet each shop is
just “one click away”. If a user is not
satisfied with the service he/she just goes
to a next one and will likely never return.
![Page 10: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649f575503460f94c7c7c3/html5/thumbnails/10.jpg)
Visitor Profiling Scheme
I. Clustering of user sessions
II. Analysis/interpretation of the clusters
III. Assign a cluster label to each session
IV. Analysis of the profile sequences
![Page 11: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649f575503460f94c7c7c3/html5/thumbnails/11.jpg)
Clustering
• Cadez et al. (2001) - predictive profiles from historical transaction data
• Mixture of multinomials:
• Full data likelihood:
• The unknown parameters and
are estimated by the expectation maximization (EM) algorithm.
},...,{ 1 K},...,{ 1 K
K
k
C
c
nkckijijcyp
1 1
)(
N
iiDpDp
1
)|()|(
![Page 12: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649f575503460f94c7c7c3/html5/thumbnails/12.jpg)
Interpretation of the clustersProfile 1 General overview of the products
Profile 2 Focused search
Profile 3 Potential buyers
Profile 4 Parameter based search
![Page 13: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649f575503460f94c7c7c3/html5/thumbnails/13.jpg)
The transitions of profiles
P1 P2 P3 P4
P1 0.7208 0.1592 0.0621 0.0579
P2 0.5908 0.2828 0.0710 0.0553
P3 0.5022 0.1616 0.2873 0.0489
P4 0.6000 0.1702 0.0685 0.1613
![Page 14: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649f575503460f94c7c7c3/html5/thumbnails/14.jpg)
Tree of user profiles
![Page 15: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649f575503460f94c7c7c3/html5/thumbnails/15.jpg)
Tree of potential buyers
![Page 16: Analysing Clickstream Data: From Anomaly Detection to Visitor Profiling Peter I. Hofgesang hpi@few.vu.nl Wojtek Kowalczyk wojtek@few.vu.nl ECML/PKDD Discovery](https://reader036.vdocuments.net/reader036/viewer/2022062515/56649f575503460f94c7c7c3/html5/thumbnails/16.jpg)
Conclusion
• We spot several anomalies background information about pre-processing & data preparation is important
• Important features were missing (who is a buyer?)• Four clear user profiles