webmining - sascommunity · 40 % escalão 0 escalão 1 escalão 2 escalão 3 escalão 4 8,1 23,5 20...

15
Copyright © 2000, SAS Institute Inc. All rights reserved. WebMining Gonçalo Abreu Pedro Marques Instituto Superior Técnico - LEIC 1999/2000

Upload: ngoliem

Post on 17-Dec-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Copyright © 2000 , SAS Institute Inc. All rights reserved.

WebMining

Gonçalo AbreuPedro Marques

Instituto Superior Técnico - LEIC 1999/2000

What is DataMining?A set of techniques and procedures to find patterns and hidden information.Several steps involved:

• Data Selection• Preparation and purification• Enrichment• Visual analysis• Analytical Analysis• Analysis and Validation of the results

It’s an iterative process

Data Mining on the WebAnalysis of a www site through the Web Server’s logsLog files contain, amongst other things:

• Requested URL• A cookie with an ID• Access time• User’s IP address• Referer

Our goal will be to determine access patterns, problems with the site, referer information.

Some problems with the logsThe first access to the site has no cookie with an ID,

it only exists from the second onwards (provided the user accepts the cookie).

There isn’t the concept of a session.Logs are in plain text with lots of redundant

information (eg. Gif files).Besides the user ID we don’t have any other

information about the user.The user can always turn off the cookies (hard to

determine).

Log purifyingGiven the stated problems we need to:

Correlate a user ID observed for the first time using the subnet IP, between the 1st and 2nd accesses.

Group each user’s accesses in session (eachsession is separated by a given idle time)

Enrich the data with• Time between accesses• Number of pages per session• Total session time

Aditional data cleaning (gifs, etc.)

The siteDivided in 5 sections:

PresentationConsultant cabinetGeneral directoryNewsApplying to investment

Site presentationThis is the site’s general look and feel.

The Enterprise MinerA typical Enterprise Miner project

Organized by data flowsNodes with several functionalities

Access chartsTraffic volume chartPer hour accessThe volume is

unstable and decays rapidly. Evident variation in

hourly volume.

Access chart 2 Number of pages

access per sessionAccess time per session

falls exponetionally The majority of sessions

don’t last more than 3 minutes.

Hourly volume analysis

10,914,4 15,6

22,7

36,4

05

10152025303540%

Escalão 0 Escalão 1 Escalão 2 Escalão 3 Escalão 4

8,1

23,520

28

20,4

0

5

10

15

20

25

30%

Escalão 0 Escalão 1 Escalão 2 Escalão 3 Escalão 4

From the access charts, 5 distinct hourly periods canbe observed :

0: 1h00-9h00 1: 9h00-12h002: 12h00-14h00 3: 14h00-19h00 4: 19h00-1h00

We attempt to classifythose periods with a decision tree and using theweekend, page numberand session time.

Associoation analysis

Analyse the access sequence. Access patterns show design problems.Page sequence from main page also show design

problems.

Support Confidence Rule

33.97 36.22 /default.asp /canais/apresentacao/apres.asp?menu=0&submenu=1

29.49 31.45 /default.asp /canais/apresentacao/apres.asp?menu=0&submenu=0

26.42 28.17 /default.asp /canais/directDigital/topTen.asp?menu=2

23.43 24.98 /default.asp /canais/gabConsultor/expert.asp?menu=1

Sequence analysis From the previous analysis we discover that the

“presentation” pages are the most visited. Access sequence analysis.In the “presentation” page a correlation exists with

each menu entry up until the 2nd from last option.Technique to uncover user classes.

Support Confidence Rule

11.93 91.08 Submenu=3 submenu=4 submenu=5

11.69 90.70 Submenu=2 submenu=4 submenu=5

11.21 86.77 Submenu=1 submenu=3 submenu=4

11.78 86.56 Submenu=2 submenu=3 submenu=4

Sequence analysis with referesDetermine access patterns from the referers.Useful in determining the impact of an ad.Correlate referers with areas of the site.Classify user types from the referers.Determine which sites deep link to our site.

Support Confidence Rule

28.79 91.34 [Non Existent] /default.asp

15.40 97.44 ad.pt.doubleclik.pt /default.asp

8.16 98.06 adforce.imgis.com /default.asp

7.95 98.40 www.netc.pt /default.asp

4.83 98.36 ads.netc.pt /default.asp

ConclusionGrowing technology with future.Web Mining gives us insight on our users and theirhabits.Allows interesting discoverys about our site.It’s a detective’s work.Possible Privacy issues. Unfortunatley we had to rely on a small amount of log data, so there is a need to supplement with other(external) data.Ideally suited to a site with user registration.