webmining - sascommunity · 40 % escalão 0 escalão 1 escalão 2 escalão 3 escalão 4 8,1 23,5 20...
TRANSCRIPT
Copyright © 2000 , SAS Institute Inc. All rights reserved.
WebMining
Gonçalo AbreuPedro Marques
Instituto Superior Técnico - LEIC 1999/2000
What is DataMining?A set of techniques and procedures to find patterns and hidden information.Several steps involved:
• Data Selection• Preparation and purification• Enrichment• Visual analysis• Analytical Analysis• Analysis and Validation of the results
It’s an iterative process
Data Mining on the WebAnalysis of a www site through the Web Server’s logsLog files contain, amongst other things:
• Requested URL• A cookie with an ID• Access time• User’s IP address• Referer
Our goal will be to determine access patterns, problems with the site, referer information.
Some problems with the logsThe first access to the site has no cookie with an ID,
it only exists from the second onwards (provided the user accepts the cookie).
There isn’t the concept of a session.Logs are in plain text with lots of redundant
information (eg. Gif files).Besides the user ID we don’t have any other
information about the user.The user can always turn off the cookies (hard to
determine).
Log purifyingGiven the stated problems we need to:
Correlate a user ID observed for the first time using the subnet IP, between the 1st and 2nd accesses.
Group each user’s accesses in session (eachsession is separated by a given idle time)
Enrich the data with• Time between accesses• Number of pages per session• Total session time
Aditional data cleaning (gifs, etc.)
The siteDivided in 5 sections:
PresentationConsultant cabinetGeneral directoryNewsApplying to investment
The Enterprise MinerA typical Enterprise Miner project
Organized by data flowsNodes with several functionalities
Access chartsTraffic volume chartPer hour accessThe volume is
unstable and decays rapidly. Evident variation in
hourly volume.
Access chart 2 Number of pages
access per sessionAccess time per session
falls exponetionally The majority of sessions
don’t last more than 3 minutes.
Hourly volume analysis
10,914,4 15,6
22,7
36,4
05
10152025303540%
Escalão 0 Escalão 1 Escalão 2 Escalão 3 Escalão 4
8,1
23,520
28
20,4
0
5
10
15
20
25
30%
Escalão 0 Escalão 1 Escalão 2 Escalão 3 Escalão 4
From the access charts, 5 distinct hourly periods canbe observed :
0: 1h00-9h00 1: 9h00-12h002: 12h00-14h00 3: 14h00-19h00 4: 19h00-1h00
We attempt to classifythose periods with a decision tree and using theweekend, page numberand session time.
Associoation analysis
Analyse the access sequence. Access patterns show design problems.Page sequence from main page also show design
problems.
Support Confidence Rule
33.97 36.22 /default.asp /canais/apresentacao/apres.asp?menu=0&submenu=1
29.49 31.45 /default.asp /canais/apresentacao/apres.asp?menu=0&submenu=0
26.42 28.17 /default.asp /canais/directDigital/topTen.asp?menu=2
23.43 24.98 /default.asp /canais/gabConsultor/expert.asp?menu=1
Sequence analysis From the previous analysis we discover that the
“presentation” pages are the most visited. Access sequence analysis.In the “presentation” page a correlation exists with
each menu entry up until the 2nd from last option.Technique to uncover user classes.
Support Confidence Rule
11.93 91.08 Submenu=3 submenu=4 submenu=5
11.69 90.70 Submenu=2 submenu=4 submenu=5
11.21 86.77 Submenu=1 submenu=3 submenu=4
11.78 86.56 Submenu=2 submenu=3 submenu=4
Sequence analysis with referesDetermine access patterns from the referers.Useful in determining the impact of an ad.Correlate referers with areas of the site.Classify user types from the referers.Determine which sites deep link to our site.
Support Confidence Rule
28.79 91.34 [Non Existent] /default.asp
15.40 97.44 ad.pt.doubleclik.pt /default.asp
8.16 98.06 adforce.imgis.com /default.asp
7.95 98.40 www.netc.pt /default.asp
4.83 98.36 ads.netc.pt /default.asp
ConclusionGrowing technology with future.Web Mining gives us insight on our users and theirhabits.Allows interesting discoverys about our site.It’s a detective’s work.Possible Privacy issues. Unfortunatley we had to rely on a small amount of log data, so there is a need to supplement with other(external) data.Ideally suited to a site with user registration.