mining web data: techniques for understanding the …mate.dm.uba.ar/~pfmislej/web mining/web...

Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 1

Mining web data: Techniques forunderstanding the user behavior

in the Web

Juan D. Velásquez SilvaJuan D. Velásquez SilvaPhD Information Engineering, PhD Information Engineering, University of Tokyo, JapanUniversity of Tokyo, [email protected]@dii.uchile.clhttp://wi.dii.uchile.clhttp://wi.dii.uchile.clDepartment of Industrial EngineeringDepartment of Industrial EngineeringUniversity of ChileUniversity of Chile


Outline

Motivation. Web data. Web mining. Applications. Summary.


1.- Motivation


The new sales window

The markets move towards business virtualization. Many companies have begun to use the Web as a

new channel of massive spread and worldwidecover.

How can we improve our web site? By understanding what describes the user behavior

[Brusilovsky96].


OK, but how can we do it?

Analyzing the user browsing behavior. Analyzing the web page preferences. Analyzing the user profile.

In fact “Extracting knowledgefrom web data”

… applying Web mining techniques.


2.- Data originated in the Web:Web Data


Web server and web browser


Web logs# IP Id Acces Time Method/URL/Protocol Status Bytes Referer Agent

1 165.182.168.101 - - 16/06/2002:16:24:06 GET p1.htm HTTP/1.1 200 3821 out.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)

2 165.182.168.101 - - 16/06/2002:16:24:10 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)

3 165.182.168.101 - - 16/06/2002:16:24:57 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)

4 204.231.180.195 - - 16/06/2002:16:32:06 GET p3.htm HTTP/1.1 304 0 - Mozilla/4.0 (MSIE 6.0; Win98)

5 204.231.180.195 - - 16/06/2002:16:32:20 GET C.gif HTTP/1.1 304 0 - Mozilla/4.0 (MSIE 6.0; Win98)

6 204.231.180.195 - - 16/06/2002:16:34:10 GET p1.htm HTTP/1.1 200 3821 p3.htm Mozilla/4.0 (MSIE 6.0; Win98)

7 204.231.180.195 - - 16/06/2002:16:34:31 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 6.0; Win98)

8 204.231.180.195 - - 16/06/2002:16:34:53 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 6.0; Win98)





13 165.182.168.101 - - 16/06/2002:16:39:58 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)



16 165.182.168.101 - - 16/06/2002:16:42:08 GET C.gif HTTP/1.1 200 3423 p2.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)

17 204.231.180.195 - - 16/06/2002:17:34:20 GET p3.htm HTTP/1.1 200 2342 out.htm Mozilla/4.0 (MSIE 6.0; Win98)

18 204.231.180.195 - - 16/06/2002:17:34:48 GET C.gif HTTP/1.1 200 3423 p2.htm Mozilla/4.0 (MSIE 6.0; Win98)


20 204.231.180.195 - - 16/06/2002:17:35:56 GET D.gif HTTP/1.1 200 3231 p4.htm Mozilla/4.0 (MSIE 6.0; Win98)

21 204.231.180.195 - - 16/06/2002:17:36:06 GET E.gif HTTP/1.1 404 0 p4.htm Mozilla/4.0 (MSIE 6.0; Win98)


Web site contents

You can put anything you want on aWeb page, from family to businessinfo….

Hyperlink structure


Nature of web data Content. The web page content, i.e., pictures, free

text, sounds, etc. Structure. Data that show the internal web page

structure. In general, they are HTML or XML tags,some of them contain information about hyperlinkconnections with other web pages.

Usage. Data that describe the visitor preferenceswhile browsing in a web site. It is possible to findthem inside web log files.

User profile. A collection of information about auser, combining personal information (e.g. name, ageetc.), usage information (e.g. page visited) andinterests.


Understanding the visitor behaviorin a web site

Visitor browsing behavior: Web logs. Visitor preferences: Web pages Problems:

Web logs contain a lot of irrelevant data. A Web site is a huge collection of

heterogeneous, unlabelled, distributed, timevariant, semi-structured and high dimensionaldata.


The dream

“Transform the visitors intocustomers and retain the existing

ones” Some solutions:

Continuous improvement of the web sitestructure and content.

Personalization of the relationship between theuser and the web site.

Understanding the user behavior in the website.


Data cleaning and preprocessing

Web logs. Applying the sessionizationprocess. To identify the real visitor session.

Web pages. Vector space model. Transforming the web pages into feature vectors.

Web hyperlinks structure. Identifying social networks


Web logs# IP Id Acces Time Method/URL/Protocol Status Bytes Referer Agent




4 204.231.180.195 - - 16/06/2002:16:32:06 GET p3.htm HTTP/1.1 304 0 - Mozilla/4.0 (MSIE 6.0; Win98)

5 204.231.180.195 - - 16/06/2002:16:32:20 GET C.gif HTTP/1.1 304 0 - Mozilla/4.0 (MSIE 6.0; Win98)


7 204.231.180.195 - - 16/06/2002:16:34:31 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 6.0; Win98)

8 204.231.180.195 - - 16/06/2002:16:34:53 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 6.0; Win98)








16 165.182.168.101 - - 16/06/2002:16:42:08 GET C.gif HTTP/1.1 200 3423 p2.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)

17 204.231.180.195 - - 16/06/2002:17:34:20 GET p3.htm HTTP/1.1 200 2342 out.htm Mozilla/4.0 (MSIE 6.0; Win98)

18 204.231.180.195 - - 16/06/2002:17:34:48 GET C.gif HTTP/1.1 200 3423 p2.htm Mozilla/4.0 (MSIE 6.0; Win98)


20 204.231.180.195 - - 16/06/2002:17:35:56 GET D.gif HTTP/1.1 200 3231 p4.htm Mozilla/4.0 (MSIE 6.0; Win98)

21 204.231.180.195 - - 16/06/2002:17:36:06 GET E.gif HTTP/1.1 404 0 p4.htm Mozilla/4.0 (MSIE 6.0; Win98)


WUM – Heuristics

The session identification heuristics Timeout: if the time between pages requests exceeds a certain

limit, it is assumed that the user is starting a new session IP/Agent: Each different agent type for an IP address

represents a different sessions Referring page: If the referring page file for a request is not

part of an open session, it is assumed that the request is comingfrom a different session.

Same IP-Agent/different sessions (Closest): Assigns the requestto the session that is closest to the referring page at the time ofthe request.

Same IP-Agent/different sessions (Recent): In the case wheremultiple sessions are same distance from a page request, assignsthe request to the session with the most recent referrer accessin terms of time


Sessionization- ExampleBased on examples in tutorial PKDD-2002 by Berendt et. al.


Sessionization- Example (2)Sort the users (IP+Agent)

Based on examples in tutorial PKDD-2002 by Berendt et. al.


Sessionization- Example (3)

Sessionize using heuristics (h1 with 30 min)

The h1 heuristic (timeout=30 min) will result in the two sessions




Sessionize using heuristics (with href)

By using the reffer-based heuristics, we have only a single session




Path completion



Web text content

From different web page content, specialattention receive the free text.

For the moment, a searching is performed byusing key words.

It is necessary to represent the textinformation in a feature vector, before toapply a mining process.

The representation must consider that thewords in the web page don’t have the sameimportance.


Web text content: filters

There are words that don’t have information(article, prepositions, conjunctions, etc)

It is necessary a cleaning process: HTML tags. Stop words (i.e. pronouns, prepositions,

conjunctions, etc.) Word stemming. Process of suffix removal, to

generate word


Processing the web site:Vector space model [Salton75]

The model associates a weight to each word in the page,based on its frequency in the whole web site.

Let and Q the amount of pages, a simple estimation ofthe relevance of a word is

The inverse document frequency can beused like a weight.

A variation of the last expression is known as TF*IDF

Qn

w iji =,

)log(in

QIDF =

)log(**i

ij nQ

fIDFTF =

in


Web page: vectorial representation

Its vectorialrepresentation would be amatrix of RxQ.

Q is the number of pagesin the web site and R isthe number of differentwords in P.

Word 1 2 … Q

1 advise 1 0 … 1

2 business 0 1 … 0

. … . . … .

. … . . … .

. … . . … .

. … . . … .

. … . . … .

R zambia 1 0 … 0


Vector space model: a graphic example


Comparing web pages

)log(*)(in

Q

ijij fmM ==

),...,( 1 Riii mmp ! ),...,( 1 Rjjj mmp !

! !

!==

= =

=

R

k

R

k

kjki

R

k

kjki

mm

mm

ji ppdp

1 1

22

1

)()(

cos),( "ip

jp

!


Vector Model, Summarized The best term-weighting schemes tf-idf

weights:wij = f(i,j) * log(Q/ni)

For the query term weights, a suggestion iswiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) *

log(N / ni)

This model is very good in practice: tf-idf works well with general collections Simple and fast to compute Vector model is usually as good as the known

ranking alternatives


Advantages: term-weighting improves quality of the answer

set partial matching allows retrieval of docs that

approximate the query conditions cosine ranking formula sorts documents

according to degree of similarity to the query Disadvantages:

assumes independence of index terms; not clearif this is a good or bad assumption

Pros & Cons of Vector Model


Web hyperlinks structure

Why a web page point to another one? Link analysis: use link structure to determine

credibility. If a web page is pointed by other ones,

maybe it is because the page containsrelevant information.

We can understand the formation of a webcommunity.

We can improve our web site.


Processing the hyperlinks structure


Processing the hyperlinks structure (2) Identifying which pages contain more relevant

information that others [Kleinberg99]:Authorities. A natural information repository

for the community.Hub. These concentrate links to authorities web

pages, for instance, “my favorite sites”.

!

!

"

"

=

=

pqthatsuchq

qp

pqthatsuchq

qp

xy

yx


4.- Web mining


Data mining techniques

Association rules Clustering Classification. Prediction. Probabilistic models. Regression.


What happen with web data?

“The web is a huge collectionof heterogeneous, unlabelled,

distributed, time variant,semi-structured and high

dimensional data”

S.K. Pal 2002


Web Mining Taxonomy [Jooshi00,Lu03]

Web Mining

Web ContentMining

Web UsageMining

Web StructureMining


Web Structure Mining

It deals with the mining of the web hyperlinkstructure (inter document structure).

A website is represented by a graph of itslinks, within the site or between sites.

Facts like the popularity of a web page can bestudied, for instance, if a page is referred bya lot of other pages in the web.

The web link structure allows to develop anotion of hyperlinked communities.

It can be used by search engines, like googleor altavista, in order to get the set of pagesmore cited for a particular subject.


WSM (2) Finding authoritative Web pages

Retrieving pages that are not only relevant, but also ofhigh quality, or authoritative on the topic

Hyperlinks can infer the notion of authority The Web consists not only of pages, but also of

hyperlinks pointing from one page to another These hyperlinks contain an enormous amount of latent

human annotation A hyperlink pointing to another Web page, this can be

considered as the author's endorsement of the otherpage


Google’s PageRank (Brin/Page 98) Mine structure of web graph independently of the

query! Each web page is a node, each hyperlink is a directed edge

Assumes a random walk (surf) through the web: Start at a random page

At each step, the surfer proceeds to a randomly chosen web page with probability d to a randomly chosen successor of the current page with probability 1- d

The PageRank of a page p is the fraction of stepsthe surfer spends at p in the limit


PageRank Algorithm The assumption is the importance of a page is given for the

importance of the pages that pointed it.

!"#$

++%=

pqPpq q

k

qk

pN

xd

ndx

/,

)(

)1( 1)1(

Importance of page p

Number of out-links from page q

Number of pages in the web graph

Importance of page qProbability that the surfer follows someout-links of q when visit that page


PageRank Algorithm (2)

By using a matrix representation, the PageRank equationis rewrite as:

)()1( )1( kkdMXDdX +!=

+

)1

()1

,...,1(),,...,( )()()(

1

j

ij

k

p

k

p

k

NmMand

nnDxxX

n====

where

With amount of out-links from page j such asjN ij!



Initialize

Set d, usually d=0.85.

Calculate

If stop and return .

Else and go to point 3.

!<"+ |||| )()1( kk

XX)1( +k

X

)1

,...,1()0,...0()0(

nnDandX ==

)()1( )1( kkdMXDdX +!=

+

)1()( +=

kkXX



!!!!!

"

#

$$$$$

%

&

=

''''

(

)

****

+

,

=

25.0

25.0

25.0

25.0

005.01

0000

1000

015.00

DandM



!!!!!

"

#

$$$$$

%

&

=

!!!!!

"

#

$$$$$

%

&

=

!!!!!

"

#

$$$$$

%

&

=

126.0

0375.0

0375.0

0853.0

,

126.0

0375.0

0375.0

0853.0

,

0375.0

0375.0

0375.0

0375.0

)3()2()1(XXX

0

0 0

00375.0 0375.0

0375.0

0375.00853.0

126.0

0375.0

0375.0

0375.0

0375.0

0853.0

126.0


PageRank Algorithm (6) About the disadvantages:

If a page is pointed by another one, it mean the pagereceive a vote for the PageRank calculus.

If a page is pointed by a lots of pages, it mean thatthe page is important.

“Only the good pages are pointed by others one”, but: Reciprocal links. If the page A link page B, then page

B will link page A. Link Requirements. Some web pages give electronic

gifts, like a program, document etc., if another pagepoint it.

Near persons community. For instance friends andrelatives that from their pages point another friend orrelative, only for the human relationship between them.


WSM summary

HIST and PageRank use back-links as a meansof adjusting the “worthiness” or “importance”of a page

Both use iterative process over matrix/vectorvalues to reach a convergence point

HITS is query-dependent (thus too expensiveto compute in general)

PageRank is query-independent and consideredmore stable

Just one piece of the overall picture…


Web Content Mining

The goal is to find useful information from theweb content.

In this sense, WCM is similar to InformationRetrieval (IR).

However, web content is not only free text, otherobjects like pictures, sound and movies belong alsoto the content.

There are two main areas in WCM : the mining of document contents (web page content

mining) the improvement of content search in tools like

search engines (search result mining).


Issues in Web Content Mining

Developing intelligent tools for IRFinding keywords and key phrasesDiscovering grammatical rules and collocationsHypertext classification/categorizationExtracting key phrases from text documentsLearning extraction models/rulesHierarchical clustering Predicting (words) relationship


WEBSOM

It is a means for organizing miscellaneous textdocuments into meaningful maps forexploration and search.

It is based on SOM (Self-Organizing Map)that automatically organizes documents into atwo-dimensional grid so that relateddocuments appear close to each other

http://www.cis.hut.fi/websom


WEBSOMAll words ofdocument aremapped into the wordcategory map

Histogram of “hits” onit is formed

Self-organizing map.Largest experiments have used:• word-category map

315 neurons with 270 inputs each

• Document-map 104040 neurons with 315inputs each

Self-organizing semantic map.15x21 neurons

Interrelated words that have similar contextsappear close to each other on the map

• Training done with 1124134 documents


WEBSOM


Sample search

A new document or anydocument’s descriptioncan be used for findingrelated documents.

The circle on the mapdenote the location of themost representativedocument for thequestion.


How to use the Maps

As filter that notifies theuser of interestingdocuments.

As a searching engine. A new index method.


Extraction of key-textcomponents from web pages

The key-text components are parts of an entiredocument, for instance a paragraph, phrase andinclusive a word, that contain significant informationabout a particular topic, from the web site user point ofview.

A web site keyword is “a word or possibly a set ofwords that make a web page more attractive for aneventual user during his visit to the website''[Velasquez05b].

The assumption is that there exists a correlationbetween the time that the user spent in a page andhis/her interest in its content [Velasquez04b] .

Usually, the keywords in a web site have been relatedwith the ``most frequently used words''.

In [Buyukkokten01] a method to extract keywordsfrom a huge set of web pages is introduced.


WCM summary

The vector space model is a recurrent method torepresent a document as a feature vector.

Because the set of words used in the construction ofthe web site could be a lot, it is necessary to apply astop word cleaning and stemming process.

A web page content is different to a commondocument, in fact, a web page contain semi-structuredtext, i.e., with tags that give additional informationabout the text component.

Also a page could contain pictures, sounds, movies, etc. Sometime the page text content is a short text or

inclusive a set of unconnected words.


Web Usage Mining (WUM)

Also known as Web log mining mining techniques to discover interesting

usage patterns from the secondary dataderived from the interactions of the userswhile surfing the web


WUM: Considerations [Pierrakos03]

WUM applies traditional data mining methods inorder to analyze usage data.

The sessionization process is necessary to correctthe problems detected in the data.

The goal is to discover patterns in usage dataapplying different kinds of data mining techniques.

Applications of WUM can be grouped in two maincategories: User modeling in adaptive interfaces, known as

personalization. User navigation patterns, in order to improve the

web site structure.


WUM: Applications Target potential customers for electronic commerce Enhance the quality and delivery of Internet

information services to the end user Improve Web server system performance Identify potential prime advertisement locations Facilitates personalization/adaptive sites Improve site design Fraud/intrusion detection Predict user’s actions (allows prefetching)


Statistics

Several tools use the conventional statistics foranalyzing the user behavior in a web site(http://www.accrue.com/, http://www.netgenesis.com,http://www.webtrends.com).

Each one of them use graphics interfaces, likehistograms, with statistics associate.

For instance amount of clicks per page during thelast month.

By applying conventional statistics on web logs, one canperform different kinds of analysis [Boving04] .


Statistics (2)

These basic statistics are used to improve the website structure and content in many institutions, overall the commercial ones [Dellmann04] .

Their application is simple, there are many webtraffic analysis tools and it easy to understand thetools's reports.

Although the statistic analysis could seem a verysimple tool to mining the web logs, it resolve a lot ofproblems relate with the performance of the website.


Examples:

60% of clients who accessed /products/, also accessed/products/software/webminer.htm.

30% of clients who accessed /special-offer.html,placed an online order in /products/software/.

(Example from IBM official Olympics Site) {Badminton, Diving} ===> {Table Tennis} (α = 69.7%, σ

= 0.35%)


Clustering

Groups together a set of items having similarcharacteristics

User Clusters Discover groups of users exhibiting similar browsing

patterns Page recommendation

User’s partial session is classified into a single cluster The links contained in this cluster are recommended


Clustering

Grouping users with common browsingbehavior and pages with similar content[Srivastava00] .

An important thing in any clusteringtechnique, is the similarity measure or methodused to compare the input vectors for creatingthe clusters.

In the WUM case, the similarities areconstructed basis on the usage data in the website, i.e., the set of pages used or visitedduring the user session.


WUM: Clustering example Clusters of user browsing behavior


Association Rule Generation Discovers the correlations between pages that are most often

referenced together in a single server session Provide the information

What are the set of pages frequently accessed together byWeb users?

What page will be fetched next? What are paths frequently accessed by Web users?

Association rule A B [ Support = 60%, Confidence = 80% ]Example “50% of visitors who accessed URLs /diplomas.html and

BI/info.html also visited webmining.html”


Association rules

In WUM, the association rules are focusmainly in the discovery of relations betweenpages visited by the users [Mobasher01] .

For instance, an association rule for a MBAprogram is

mba/seminar.html mba/speakers.html From a different point of view, in

[Schwarzkopf01] a Bayesian network is usedfor defining taxonomic relations betweentopics shown in a web site.


5.- Applications


Improving the relationship between theweb site and the user

Recommendations to modify the web sitestructure and content.

Web personalization (online navigationrecommendations).

Intelligent web site.


Web personalization [Lu03,Mombasher00]

It is the process where the web server andthe related applications, dynamically,customize the content for a particular user,based on information about his/her behaviorin a website.

This is different to another related conceptcalled “customization” where the visitorinteracts with the web server using aninterface to create her/his own web site, e.g.,“MyBanking Page”.


Intelligent Web site [Coenen00,Perkowitz98, Velasquez04b]

It represents the main concepts behind the“new portal generation”.

They are systems that “based on the userbehavior, allow to implement changes inthe current web site structure andcontent”.


What we should do? [Mombasher01]

Modeling the visitor behavior. A model for the visitor browsing behavior

must consider the visited pages, the timespent and the pages sequence.

A model for the visitor preferences mustconsider the content of the pages and thetime spent in each one of them in a session.


Visitor browsing behavior

¿ ?


Visitor browsing behavior (2)

Three variables are considered: the web page path,its content and the time spent when it is visited bya visitor.

The visitor behavior vector is defined and asimilarity measure between visitor session isintroduced.

Where is a component that represents thepage content, its path and percentage of timespent in the page visited.

The vector maintains the page visit order.

)],)...(,[( 11 nntptpv =

r

),(iitp

thi


Vector space model: A variation proposed

)log(*)1()(in

Qiijij

TR

swfmM +== sw: special words array

TR: Total special words

),...,( 1 Riii mmp ! ),...,( 1 Rjjj mmp !

! !

!==

= =

=

R

k

R

k

kjki

R

k

kjki

mm

mm

ji ppdp

1 1

22

1

)()(

cos),( "ip

jp

!


An example of variation proposed


Comparing the sequences [Runkler03,Velasquez04a]

The sequence of anavigation can berepresented by a graph.Each page is identified byan identification number.

!"#

$

=%==

=

=

&&&=

&&&&=

+)()(1

)()(02),(

}7,6,3,1{)(

}8,5,6,2,1{)(

}73,63,31{

}85,52,62,21{

21

21

)||()(||

)||()(||

21

2

1

2

1

21

21

GEGEif

GEGEifGGdG

GE

GE

G

G

GEGE

GEGE 'I

We need to know how similar ordifferent are both sequencerepresentations!!

Notion of similarity


Comparing sequences: An example

4),(

3),(

"1367"

"12856"

"1367"

"12648"

43

21

4

3

2

1

=

=

=

=

=

=

SSL

SSL

S

S

S

S

] ]

3,021

||)(||||)(||

),(21),(

||})(||||,)(max{||,0

0),(

453

21

2121

21

21

21

=!=

+!=

"#$ %

=

+

GEGE

SSLGGdG

elseGEGE

SSSSL


Comparing browsing behavior

Then the similarity measure is:

where is an indicator of visitor interest

is the page distance

and dG is a “graph distance”, i.e., how similar are the pathsbetween two sessions and dP is a “page distance” betweenthe content of the visited pages.

!=

=L

k

c

k

c

kt

t

t

t

L

hh ppdpppdGsmk

k

k

k

1

,,1 ),(),min(),(),( "#"# #

"

"

#

"#rr

),min( !

"

"

!

k

k

k

k

t

t

t

t

),( ,,

c

k

c

k ppdp !"


What is she/he looking for?

Free text. Movies Pictures Sounds


Visitor preferences

The main focus is to understand the textpreferences.

Web site keywords: the word or the set ofwords that make the web page more attractivefor the visitor.

From the visitor behavior vector, we want toselect the most important pages, assumingthat the degree of importance is correlatedwith the percentage of time spent on eachpage.


Important pages vector [Velasquez05b]

We can suppose that the most important content for thevisitors is included in the web pages that concentrate thegreater amount of time spent in their visit.

Then, in order to know the most important web pagecontent, we can select a component set from v . Letbe the quantity selected.

being the component in v that represents thepage most important and the time spent on it.

This information is obtained after sorting the vcomponents by time and select the first ones.

)],(),...,,[()( 11 !!! "#"#$ =v

!

),(kk!"

thk

!


Similarity measure using text

),(*},min{1

))(),((1

!"#

"

!

!

"

## $$%

%

%

%

#!&"& kk

k k

k

k

k dpst '=

=

This equation combines the content of the mostimportant visited pages with the time spent oneach one of them by a multiplication.

In this way we can distinguish between two pageswith similar content, but different spent times.


Identifying web site keywords

A clustering algorithm can be applied to find groups ofsimilar visitor sessions.

Based on the clusters centroids, the most importantwords for each cluster will be identified.

Using the vector space model and a method todetermine the most important keywords and theirimportance in each cluster is proposed.

A measure (geometric mean) to calculate theimportance of each word relative to each cluster, isdefined as:

!"

ippmikw

#$=][ : Cluster with vectors! !

Ri ,...,1:


Applying clustering algorithms[Hinnserbury03, Theodoridis99, Velasquez03

For example, Self Organizing Feature Maps. Schematically, a SOFM is presented as a two-dimensional array in

whose positions the neurons are located. Each neuron is constituted by an n-dimensional vector, whose

components are the synaptic weights. The notion of neighborhood among the neurons provides diverse

topologies. In this case a toroidal topology was used, which means that the

neurons closest to the ones of the superior edge, are located in theinferior and lateral edges.


Working with real web sites


Bank web site

It belongs to a Chilean virtual bank, i.e., a bankthat doesn't have physical branches and whereall the transactions are made using electronicmeans, like e-mails, portals, etc.

Written in Spanish. 217 static web pages. Approximately eight million raw web logs

registers corresponding to the period Januaryto March, 2003.


Visitor Behavior Vectors

Only 16% of the visitors visit 10 or more pagesand 18% less than 4.

The average of visited pages is 6. With these data, we fixed 6 as the maximum

number of components in a visitor behaviorvector

Finally, applying the above described filters,approximately 200,000 vectors wereidentified.


Visitor Behavior Vectors (2)


Cluster analysis

The cluster identification and validation isgenerally a subjective tasks.

It is necessary the support of a businessexpert.

In this case bank business expert. Eight clusters were identified but only four

of the were accepted by the business expert. The criterion is “What cluster make sense for

the business expert”


Results


Important page clustering


Cluster analysis

The cluster must content only pages relatedby main topic.

The business expert define the main topic perpage.

From twelve identified clusters, only eightwere accepted.


ResultsCluster Pages Visited

1 (6,8,190)

2 (100,128.30)

3 (86,150,97)

4 (101,105,1)

5 (3,9,147)

6 (100,126,58)

7 (70,186,137)

8 (157,169,180)


Cluster analysis Review which words in each page cluster are

the same. Using the vector page definition, it is possible

to get the key words and their relativeimportance in each cluster.

Example

)log(*)1()(in

Qiijij

TR

swfmM +==

319086

}190,8,6{

iiiimmmkw =

=!

!"

ippmikw

#$=][


Cluster analysis


Some identified keywords

#

1 Crédito Credit

2 Hipotecario House credit

3 Tarjeta Card

4 Promoción Promotion

5 Concurso Concourse

6 Puntos Points

7 Descuento Discount

8 Cuenta Account

Keywords


Offline recommendations

Structure. Add links intra clusters, e.g., link from page 150

to page 137. Add link inter cluster, e.g. link from page 105 to

126. Eliminate link, e.g., from page 150 to page 186

Content. To use the web site keywords as linksor contents in the page.


Online recommendations

A current user session is classified into someclusters found.

The online navigation recommendation iscreated as a set of links to pages belonging tothe current web site.

The user can select some links or not. Default case: No recommendation. It is too risky to apply the recommendations in

the real web site. However it is possible tomake a simulation.


A methodology to test onlinenavigation recommendations

qCl)],(),...,,(),...,,[( 11 HHmm tptptp=!

qm Clp !+1 0,/ ,11,1 >!!" +++ jCllRl qjmmjm

###

)(1 !+mR

Recommendations for ? 1+mp },...,{ 1 npp=

},...,,...,{)( ,1,10,11

!!!! jmkmmm lllR ++++ =1+mp

!

xml

,1+


Online recommendations (2)


Testing the online navigation effectiveness

12

3

fourth page

fifth page

sixth page

0

10

20

30

40

50

60

70

80

90

100

Pe

rce

nta

ge

of

ac

ce

pta

nc

e

Number of page links

per suggestion


Intelligent web site [Velasquez04b]


Knowledge Base

Parametric rules

Historic patternrepository


5.- Summary


Summary

Web data are a real source to analyze theuser behavior in the web.

An important step is cleaning and pre-processing of the web data.

The application of web mining techniquesallows to find unknown patterns.

These patterns must be validated/rejected byan expert in the business under investigation.

We can personalize a web site.


Muchas gracias por su atenciónThank you very much for your attention

ご清聴ありがとうございました

mining web data: techniques for understanding the …mate.dm.uba.ar/~pfmislej/web mining/web...

Documents