mining web data: techniques for understanding the …mate.dm.uba.ar/~pfmislej/web mining/web...
TRANSCRIPT
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 1
Mining web data: Techniques forunderstanding the user behavior
in the Web
Juan D. Velásquez SilvaJuan D. Velásquez SilvaPhD Information Engineering, PhD Information Engineering, University of Tokyo, JapanUniversity of Tokyo, [email protected]@dii.uchile.clhttp://wi.dii.uchile.clhttp://wi.dii.uchile.clDepartment of Industrial EngineeringDepartment of Industrial EngineeringUniversity of ChileUniversity of Chile
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 2
Outline
Motivation. Web data. Web mining. Applications. Summary.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 4
The new sales window
The markets move towards business virtualization. Many companies have begun to use the Web as a
new channel of massive spread and worldwidecover.
How can we improve our web site? By understanding what describes the user behavior
[Brusilovsky96].
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 5
OK, but how can we do it?
Analyzing the user browsing behavior. Analyzing the web page preferences. Analyzing the user profile.
In fact “Extracting knowledgefrom web data”
… applying Web mining techniques.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 6
2.- Data originated in the Web:Web Data
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 8
Web logs# IP Id Acces Time Method/URL/Protocol Status Bytes Referer Agent
1 165.182.168.101 - - 16/06/2002:16:24:06 GET p1.htm HTTP/1.1 200 3821 out.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)
2 165.182.168.101 - - 16/06/2002:16:24:10 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)
3 165.182.168.101 - - 16/06/2002:16:24:57 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)
4 204.231.180.195 - - 16/06/2002:16:32:06 GET p3.htm HTTP/1.1 304 0 - Mozilla/4.0 (MSIE 6.0; Win98)
5 204.231.180.195 - - 16/06/2002:16:32:20 GET C.gif HTTP/1.1 304 0 - Mozilla/4.0 (MSIE 6.0; Win98)
6 204.231.180.195 - - 16/06/2002:16:34:10 GET p1.htm HTTP/1.1 200 3821 p3.htm Mozilla/4.0 (MSIE 6.0; Win98)
7 204.231.180.195 - - 16/06/2002:16:34:31 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 6.0; Win98)
8 204.231.180.195 - - 16/06/2002:16:34:53 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 6.0; Win98)
9 204.231.180.195 - - 16/06/2002:16:38:40 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 6.0; Win98)
10 165.182.168.101 - - 16/06/2002:16:39:02 GET p1.htm HTTP/1.1 200 3821 out.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)
11 165.182.168.101 - - 16/06/2002:16:39:15 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)
12 165.182.168.101 - - 16/06/2002:16:39:45 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)
13 165.182.168.101 - - 16/06/2002:16:39:58 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)
14 165.182.168.101 - - 16/06/2002:16:42:03 GET p3.htm HTTP/1.1 200 4036 p2.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)
15 165.182.168.101 - - 16/06/2002:16:42:07 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)
16 165.182.168.101 - - 16/06/2002:16:42:08 GET C.gif HTTP/1.1 200 3423 p2.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)
17 204.231.180.195 - - 16/06/2002:17:34:20 GET p3.htm HTTP/1.1 200 2342 out.htm Mozilla/4.0 (MSIE 6.0; Win98)
18 204.231.180.195 - - 16/06/2002:17:34:48 GET C.gif HTTP/1.1 200 3423 p2.htm Mozilla/4.0 (MSIE 6.0; Win98)
19 204.231.180.195 - - 16/06/2002:17:35:45 GET p4.htm HTTP/1.1 200 3523 p3.htm Mozilla/4.0 (MSIE 6.0; Win98)
20 204.231.180.195 - - 16/06/2002:17:35:56 GET D.gif HTTP/1.1 200 3231 p4.htm Mozilla/4.0 (MSIE 6.0; Win98)
21 204.231.180.195 - - 16/06/2002:17:36:06 GET E.gif HTTP/1.1 404 0 p4.htm Mozilla/4.0 (MSIE 6.0; Win98)
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 9
Web site contents
You can put anything you want on aWeb page, from family to businessinfo….
Hyperlink structure
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 10
Nature of web data Content. The web page content, i.e., pictures, free
text, sounds, etc. Structure. Data that show the internal web page
structure. In general, they are HTML or XML tags,some of them contain information about hyperlinkconnections with other web pages.
Usage. Data that describe the visitor preferenceswhile browsing in a web site. It is possible to findthem inside web log files.
User profile. A collection of information about auser, combining personal information (e.g. name, ageetc.), usage information (e.g. page visited) andinterests.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 11
Understanding the visitor behaviorin a web site
Visitor browsing behavior: Web logs. Visitor preferences: Web pages Problems:
Web logs contain a lot of irrelevant data. A Web site is a huge collection of
heterogeneous, unlabelled, distributed, timevariant, semi-structured and high dimensionaldata.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 12
The dream
“Transform the visitors intocustomers and retain the existing
ones” Some solutions:
Continuous improvement of the web sitestructure and content.
Personalization of the relationship between theuser and the web site.
Understanding the user behavior in the website.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 13
Data cleaning and preprocessing
Web logs. Applying the sessionizationprocess. To identify the real visitor session.
Web pages. Vector space model. Transforming the web pages into feature vectors.
Web hyperlinks structure. Identifying social networks
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 14
Web logs# IP Id Acces Time Method/URL/Protocol Status Bytes Referer Agent
1 165.182.168.101 - - 16/06/2002:16:24:06 GET p1.htm HTTP/1.1 200 3821 out.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)
2 165.182.168.101 - - 16/06/2002:16:24:10 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)
3 165.182.168.101 - - 16/06/2002:16:24:57 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)
4 204.231.180.195 - - 16/06/2002:16:32:06 GET p3.htm HTTP/1.1 304 0 - Mozilla/4.0 (MSIE 6.0; Win98)
5 204.231.180.195 - - 16/06/2002:16:32:20 GET C.gif HTTP/1.1 304 0 - Mozilla/4.0 (MSIE 6.0; Win98)
6 204.231.180.195 - - 16/06/2002:16:34:10 GET p1.htm HTTP/1.1 200 3821 p3.htm Mozilla/4.0 (MSIE 6.0; Win98)
7 204.231.180.195 - - 16/06/2002:16:34:31 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 6.0; Win98)
8 204.231.180.195 - - 16/06/2002:16:34:53 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 6.0; Win98)
9 204.231.180.195 - - 16/06/2002:16:38:40 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 6.0; Win98)
10 165.182.168.101 - - 16/06/2002:16:39:02 GET p1.htm HTTP/1.1 200 3821 out.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)
11 165.182.168.101 - - 16/06/2002:16:39:15 GET A.gif HTTP/1.1 200 3766 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)
12 165.182.168.101 - - 16/06/2002:16:39:45 GET B.gif HTTP/1.1 200 2878 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)
13 165.182.168.101 - - 16/06/2002:16:39:58 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)
14 165.182.168.101 - - 16/06/2002:16:42:03 GET p3.htm HTTP/1.1 200 4036 p2.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)
15 165.182.168.101 - - 16/06/2002:16:42:07 GET p2.htm HTTP/1.1 200 2960 p1.htm Mozilla/4.0 (MSIE 5.5; WinNT 5.1)
16 165.182.168.101 - - 16/06/2002:16:42:08 GET C.gif HTTP/1.1 200 3423 p2.htm Mozilla/4.0 (MSIE 5.01; WinNT 5.1)
17 204.231.180.195 - - 16/06/2002:17:34:20 GET p3.htm HTTP/1.1 200 2342 out.htm Mozilla/4.0 (MSIE 6.0; Win98)
18 204.231.180.195 - - 16/06/2002:17:34:48 GET C.gif HTTP/1.1 200 3423 p2.htm Mozilla/4.0 (MSIE 6.0; Win98)
19 204.231.180.195 - - 16/06/2002:17:35:45 GET p4.htm HTTP/1.1 200 3523 p3.htm Mozilla/4.0 (MSIE 6.0; Win98)
20 204.231.180.195 - - 16/06/2002:17:35:56 GET D.gif HTTP/1.1 200 3231 p4.htm Mozilla/4.0 (MSIE 6.0; Win98)
21 204.231.180.195 - - 16/06/2002:17:36:06 GET E.gif HTTP/1.1 404 0 p4.htm Mozilla/4.0 (MSIE 6.0; Win98)
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 15
WUM – Heuristics
The session identification heuristics Timeout: if the time between pages requests exceeds a certain
limit, it is assumed that the user is starting a new session IP/Agent: Each different agent type for an IP address
represents a different sessions Referring page: If the referring page file for a request is not
part of an open session, it is assumed that the request is comingfrom a different session.
Same IP-Agent/different sessions (Closest): Assigns the requestto the session that is closest to the referring page at the time ofthe request.
Same IP-Agent/different sessions (Recent): In the case wheremultiple sessions are same distance from a page request, assignsthe request to the session with the most recent referrer accessin terms of time
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 16
Sessionization- ExampleBased on examples in tutorial PKDD-2002 by Berendt et. al.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 17
Sessionization- Example (2)Sort the users (IP+Agent)
Based on examples in tutorial PKDD-2002 by Berendt et. al.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 18
Sessionization- Example (3)
Sessionize using heuristics (h1 with 30 min)
The h1 heuristic (timeout=30 min) will result in the two sessions
Based on examples in tutorial PKDD-2002 by Berendt et. al.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 19
Sessionization- Example (4)
Sessionize using heuristics (with href)
By using the reffer-based heuristics, we have only a single session
Based on examples in tutorial PKDD-2002 by Berendt et. al.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 20
Sessionization- Example (5)
Path completion
Based on examples in tutorial PKDD-2002 by Berendt et. al.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 21
Web text content
From different web page content, specialattention receive the free text.
For the moment, a searching is performed byusing key words.
It is necessary to represent the textinformation in a feature vector, before toapply a mining process.
The representation must consider that thewords in the web page don’t have the sameimportance.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 22
Web text content: filters
There are words that don’t have information(article, prepositions, conjunctions, etc)
It is necessary a cleaning process: HTML tags. Stop words (i.e. pronouns, prepositions,
conjunctions, etc.) Word stemming. Process of suffix removal, to
generate word
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 23
Processing the web site:Vector space model [Salton75]
The model associates a weight to each word in the page,based on its frequency in the whole web site.
Let and Q the amount of pages, a simple estimation ofthe relevance of a word is
The inverse document frequency can beused like a weight.
A variation of the last expression is known as TF*IDF
Qn
w iji =,
)log(in
QIDF =
)log(**i
ij nQ
fIDFTF =
in
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 24
Web page: vectorial representation
Its vectorialrepresentation would be amatrix of RxQ.
Q is the number of pagesin the web site and R isthe number of differentwords in P.
Word 1 2 … Q
1 advise 1 0 … 1
2 business 0 1 … 0
. … . . … .
. … . . … .
. … . . … .
. … . . … .
. … . . … .
R zambia 1 0 … 0
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 25
Vector space model: a graphic example
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 26
Comparing web pages
)log(*)(in
Q
ijij fmM ==
),...,( 1 Riii mmp ! ),...,( 1 Rjjj mmp !
! !
!==
= =
=
R
k
R
k
kjki
R
k
kjki
mm
mm
ji ppdp
1 1
22
1
)()(
cos),( "ip
jp
!
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 27
Vector Model, Summarized The best term-weighting schemes tf-idf
weights:wij = f(i,j) * log(Q/ni)
For the query term weights, a suggestion iswiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) *
log(N / ni)
This model is very good in practice: tf-idf works well with general collections Simple and fast to compute Vector model is usually as good as the known
ranking alternatives
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 28
Advantages: term-weighting improves quality of the answer
set partial matching allows retrieval of docs that
approximate the query conditions cosine ranking formula sorts documents
according to degree of similarity to the query Disadvantages:
assumes independence of index terms; not clearif this is a good or bad assumption
Pros & Cons of Vector Model
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 29
Web hyperlinks structure
Why a web page point to another one? Link analysis: use link structure to determine
credibility. If a web page is pointed by other ones,
maybe it is because the page containsrelevant information.
We can understand the formation of a webcommunity.
We can improve our web site.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 30
Processing the hyperlinks structure
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 31
Processing the hyperlinks structure (2) Identifying which pages contain more relevant
information that others [Kleinberg99]:Authorities. A natural information repository
for the community.Hub. These concentrate links to authorities web
pages, for instance, “my favorite sites”.
!
!
"
"
=
=
pqthatsuchq
qp
pqthatsuchq
qp
xy
yx
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 33
Data mining techniques
Association rules Clustering Classification. Prediction. Probabilistic models. Regression.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 34
What happen with web data?
“The web is a huge collectionof heterogeneous, unlabelled,
distributed, time variant,semi-structured and high
dimensional data”
S.K. Pal 2002
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 35
Web Mining Taxonomy [Jooshi00,Lu03]
Web Mining
Web ContentMining
Web UsageMining
Web StructureMining
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 36
Web Structure Mining
It deals with the mining of the web hyperlinkstructure (inter document structure).
A website is represented by a graph of itslinks, within the site or between sites.
Facts like the popularity of a web page can bestudied, for instance, if a page is referred bya lot of other pages in the web.
The web link structure allows to develop anotion of hyperlinked communities.
It can be used by search engines, like googleor altavista, in order to get the set of pagesmore cited for a particular subject.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 37
WSM (2) Finding authoritative Web pages
Retrieving pages that are not only relevant, but also ofhigh quality, or authoritative on the topic
Hyperlinks can infer the notion of authority The Web consists not only of pages, but also of
hyperlinks pointing from one page to another These hyperlinks contain an enormous amount of latent
human annotation A hyperlink pointing to another Web page, this can be
considered as the author's endorsement of the otherpage
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 38
Google’s PageRank (Brin/Page 98) Mine structure of web graph independently of the
query! Each web page is a node, each hyperlink is a directed edge
Assumes a random walk (surf) through the web: Start at a random page
At each step, the surfer proceeds to a randomly chosen web page with probability d to a randomly chosen successor of the current page with probability 1- d
The PageRank of a page p is the fraction of stepsthe surfer spends at p in the limit
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 39
PageRank Algorithm The assumption is the importance of a page is given for the
importance of the pages that pointed it.
!"#$
++%=
pqPpq q
k
qk
pN
xd
ndx
/,
)(
)1( 1)1(
Importance of page p
Number of out-links from page q
Number of pages in the web graph
Importance of page qProbability that the surfer follows someout-links of q when visit that page
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 40
PageRank Algorithm (2)
By using a matrix representation, the PageRank equationis rewrite as:
)()1( )1( kkdMXDdX +!=
+
)1
()1
,...,1(),,...,( )()()(
1
j
ij
k
p
k
p
k
NmMand
nnDxxX
n====
where
With amount of out-links from page j such asjN ij!
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 41
PageRank Algorithm (3)
Initialize
Set d, usually d=0.85.
Calculate
If stop and return .
Else and go to point 3.
!<"+ |||| )()1( kk
XX)1( +k
X
)1
,...,1()0,...0()0(
nnDandX ==
)()1( )1( kkdMXDdX +!=
+
)1()( +=
kkXX
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 42
PageRank Algorithm (4)
!!!!!
"
#
$$$$$
%
&
=
''''
(
)
****
+
,
=
25.0
25.0
25.0
25.0
005.01
0000
1000
015.00
DandM
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 43
PageRank Algorithm (5)
!!!!!
"
#
$$$$$
%
&
=
!!!!!
"
#
$$$$$
%
&
=
!!!!!
"
#
$$$$$
%
&
=
126.0
0375.0
0375.0
0853.0
,
126.0
0375.0
0375.0
0853.0
,
0375.0
0375.0
0375.0
0375.0
)3()2()1(XXX
0
0 0
00375.0 0375.0
0375.0
0375.00853.0
126.0
0375.0
0375.0
0375.0
0375.0
0853.0
126.0
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 44
PageRank Algorithm (6) About the disadvantages:
If a page is pointed by another one, it mean the pagereceive a vote for the PageRank calculus.
If a page is pointed by a lots of pages, it mean thatthe page is important.
“Only the good pages are pointed by others one”, but: Reciprocal links. If the page A link page B, then page
B will link page A. Link Requirements. Some web pages give electronic
gifts, like a program, document etc., if another pagepoint it.
Near persons community. For instance friends andrelatives that from their pages point another friend orrelative, only for the human relationship between them.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 45
WSM summary
HIST and PageRank use back-links as a meansof adjusting the “worthiness” or “importance”of a page
Both use iterative process over matrix/vectorvalues to reach a convergence point
HITS is query-dependent (thus too expensiveto compute in general)
PageRank is query-independent and consideredmore stable
Just one piece of the overall picture…
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 46
Web Content Mining
The goal is to find useful information from theweb content.
In this sense, WCM is similar to InformationRetrieval (IR).
However, web content is not only free text, otherobjects like pictures, sound and movies belong alsoto the content.
There are two main areas in WCM : the mining of document contents (web page content
mining) the improvement of content search in tools like
search engines (search result mining).
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 47
Issues in Web Content Mining
Developing intelligent tools for IRFinding keywords and key phrasesDiscovering grammatical rules and collocationsHypertext classification/categorizationExtracting key phrases from text documentsLearning extraction models/rulesHierarchical clustering Predicting (words) relationship
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 48
WEBSOM
It is a means for organizing miscellaneous textdocuments into meaningful maps forexploration and search.
It is based on SOM (Self-Organizing Map)that automatically organizes documents into atwo-dimensional grid so that relateddocuments appear close to each other
http://www.cis.hut.fi/websom
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 49
WEBSOMAll words ofdocument aremapped into the wordcategory map
Histogram of “hits” onit is formed
Self-organizing map.Largest experiments have used:• word-category map
315 neurons with 270 inputs each
• Document-map 104040 neurons with 315inputs each
Self-organizing semantic map.15x21 neurons
Interrelated words that have similar contextsappear close to each other on the map
• Training done with 1124134 documents
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 51
Sample search
A new document or anydocument’s descriptioncan be used for findingrelated documents.
The circle on the mapdenote the location of themost representativedocument for thequestion.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 52
How to use the Maps
As filter that notifies theuser of interestingdocuments.
As a searching engine. A new index method.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 53
Extraction of key-textcomponents from web pages
The key-text components are parts of an entiredocument, for instance a paragraph, phrase andinclusive a word, that contain significant informationabout a particular topic, from the web site user point ofview.
A web site keyword is “a word or possibly a set ofwords that make a web page more attractive for aneventual user during his visit to the website''[Velasquez05b].
The assumption is that there exists a correlationbetween the time that the user spent in a page andhis/her interest in its content [Velasquez04b] .
Usually, the keywords in a web site have been relatedwith the ``most frequently used words''.
In [Buyukkokten01] a method to extract keywordsfrom a huge set of web pages is introduced.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 54
WCM summary
The vector space model is a recurrent method torepresent a document as a feature vector.
Because the set of words used in the construction ofthe web site could be a lot, it is necessary to apply astop word cleaning and stemming process.
A web page content is different to a commondocument, in fact, a web page contain semi-structuredtext, i.e., with tags that give additional informationabout the text component.
Also a page could contain pictures, sounds, movies, etc. Sometime the page text content is a short text or
inclusive a set of unconnected words.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 55
Web Usage Mining (WUM)
Also known as Web log mining mining techniques to discover interesting
usage patterns from the secondary dataderived from the interactions of the userswhile surfing the web
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 56
WUM: Considerations [Pierrakos03]
WUM applies traditional data mining methods inorder to analyze usage data.
The sessionization process is necessary to correctthe problems detected in the data.
The goal is to discover patterns in usage dataapplying different kinds of data mining techniques.
Applications of WUM can be grouped in two maincategories: User modeling in adaptive interfaces, known as
personalization. User navigation patterns, in order to improve the
web site structure.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 57
WUM: Applications Target potential customers for electronic commerce Enhance the quality and delivery of Internet
information services to the end user Improve Web server system performance Identify potential prime advertisement locations Facilitates personalization/adaptive sites Improve site design Fraud/intrusion detection Predict user’s actions (allows prefetching)
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 58
Statistics
Several tools use the conventional statistics foranalyzing the user behavior in a web site(http://www.accrue.com/, http://www.netgenesis.com,http://www.webtrends.com).
Each one of them use graphics interfaces, likehistograms, with statistics associate.
For instance amount of clicks per page during thelast month.
By applying conventional statistics on web logs, one canperform different kinds of analysis [Boving04] .
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 59
Statistics (2)
These basic statistics are used to improve the website structure and content in many institutions, overall the commercial ones [Dellmann04] .
Their application is simple, there are many webtraffic analysis tools and it easy to understand thetools's reports.
Although the statistic analysis could seem a verysimple tool to mining the web logs, it resolve a lot ofproblems relate with the performance of the website.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 60
Examples:
60% of clients who accessed /products/, also accessed/products/software/webminer.htm.
30% of clients who accessed /special-offer.html,placed an online order in /products/software/.
(Example from IBM official Olympics Site) {Badminton, Diving} ===> {Table Tennis} (α = 69.7%, σ
= 0.35%)
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 61
Clustering
Groups together a set of items having similarcharacteristics
User Clusters Discover groups of users exhibiting similar browsing
patterns Page recommendation
User’s partial session is classified into a single cluster The links contained in this cluster are recommended
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 62
Clustering
Grouping users with common browsingbehavior and pages with similar content[Srivastava00] .
An important thing in any clusteringtechnique, is the similarity measure or methodused to compare the input vectors for creatingthe clusters.
In the WUM case, the similarities areconstructed basis on the usage data in the website, i.e., the set of pages used or visitedduring the user session.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 63
WUM: Clustering example Clusters of user browsing behavior
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 64
Association Rule Generation Discovers the correlations between pages that are most often
referenced together in a single server session Provide the information
What are the set of pages frequently accessed together byWeb users?
What page will be fetched next? What are paths frequently accessed by Web users?
Association rule A B [ Support = 60%, Confidence = 80% ]Example “50% of visitors who accessed URLs /diplomas.html and
BI/info.html also visited webmining.html”
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 65
Association rules
In WUM, the association rules are focusmainly in the discovery of relations betweenpages visited by the users [Mobasher01] .
For instance, an association rule for a MBAprogram is
mba/seminar.html mba/speakers.html From a different point of view, in
[Schwarzkopf01] a Bayesian network is usedfor defining taxonomic relations betweentopics shown in a web site.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 67
Improving the relationship between theweb site and the user
Recommendations to modify the web sitestructure and content.
Web personalization (online navigationrecommendations).
Intelligent web site.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 68
Web personalization [Lu03,Mombasher00]
It is the process where the web server andthe related applications, dynamically,customize the content for a particular user,based on information about his/her behaviorin a website.
This is different to another related conceptcalled “customization” where the visitorinteracts with the web server using aninterface to create her/his own web site, e.g.,“MyBanking Page”.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 69
Intelligent Web site [Coenen00,Perkowitz98, Velasquez04b]
It represents the main concepts behind the“new portal generation”.
They are systems that “based on the userbehavior, allow to implement changes inthe current web site structure andcontent”.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 70
What we should do? [Mombasher01]
Modeling the visitor behavior. A model for the visitor browsing behavior
must consider the visited pages, the timespent and the pages sequence.
A model for the visitor preferences mustconsider the content of the pages and thetime spent in each one of them in a session.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 72
Visitor browsing behavior (2)
Three variables are considered: the web page path,its content and the time spent when it is visited bya visitor.
The visitor behavior vector is defined and asimilarity measure between visitor session isintroduced.
Where is a component that represents thepage content, its path and percentage of timespent in the page visited.
The vector maintains the page visit order.
)],)...(,[( 11 nntptpv =
r
),(iitp
thi
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 73
Vector space model: A variation proposed
)log(*)1()(in
Qiijij
TR
swfmM +== sw: special words array
TR: Total special words
),...,( 1 Riii mmp ! ),...,( 1 Rjjj mmp !
! !
!==
= =
=
R
k
R
k
kjki
R
k
kjki
mm
mm
ji ppdp
1 1
22
1
)()(
cos),( "ip
jp
!
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 74
An example of variation proposed
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 75
Comparing the sequences [Runkler03,Velasquez04a]
The sequence of anavigation can berepresented by a graph.Each page is identified byan identification number.
!"#
$
=%==
=
=
&&&=
&&&&=
+)()(1
)()(02),(
}7,6,3,1{)(
}8,5,6,2,1{)(
}73,63,31{
}85,52,62,21{
21
21
)||()(||
)||()(||
21
2
1
2
1
21
21
GEGEif
GEGEifGGdG
GE
GE
G
G
GEGE
GEGE 'I
We need to know how similar ordifferent are both sequencerepresentations!!
Notion of similarity
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 76
Comparing sequences: An example
4),(
3),(
"1367"
"12856"
"1367"
"12648"
43
21
4
3
2
1
=
=
=
=
=
=
SSL
SSL
S
S
S
S
] ]
3,021
||)(||||)(||
),(21),(
||})(||||,)(max{||,0
0),(
453
21
2121
21
21
21
=!=
+!=
"#$ %
=
+
GEGE
SSLGGdG
elseGEGE
SSSSL
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 77
Comparing browsing behavior
Then the similarity measure is:
where is an indicator of visitor interest
is the page distance
and dG is a “graph distance”, i.e., how similar are the pathsbetween two sessions and dP is a “page distance” betweenthe content of the visited pages.
!=
=L
k
c
k
c
kt
t
t
t
L
hh ppdpppdGsmk
k
k
k
1
,,1 ),(),min(),(),( "#"# #
"
"
#
"#rr
),min( !
"
"
!
k
k
k
k
t
t
t
t
),( ,,
c
k
c
k ppdp !"
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 78
What is she/he looking for?
Free text. Movies Pictures Sounds
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 79
Visitor preferences
The main focus is to understand the textpreferences.
Web site keywords: the word or the set ofwords that make the web page more attractivefor the visitor.
From the visitor behavior vector, we want toselect the most important pages, assumingthat the degree of importance is correlatedwith the percentage of time spent on eachpage.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 80
Important pages vector [Velasquez05b]
We can suppose that the most important content for thevisitors is included in the web pages that concentrate thegreater amount of time spent in their visit.
Then, in order to know the most important web pagecontent, we can select a component set from v . Letbe the quantity selected.
being the component in v that represents thepage most important and the time spent on it.
This information is obtained after sorting the vcomponents by time and select the first ones.
)],(),...,,[()( 11 !!! "#"#$ =v
!
),(kk!"
thk
!
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 81
Similarity measure using text
),(*},min{1
))(),((1
!"#
"
!
!
"
## $$%
%
%
%
#!&"& kk
k k
k
k
k dpst '=
=
This equation combines the content of the mostimportant visited pages with the time spent oneach one of them by a multiplication.
In this way we can distinguish between two pageswith similar content, but different spent times.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 82
Identifying web site keywords
A clustering algorithm can be applied to find groups ofsimilar visitor sessions.
Based on the clusters centroids, the most importantwords for each cluster will be identified.
Using the vector space model and a method todetermine the most important keywords and theirimportance in each cluster is proposed.
A measure (geometric mean) to calculate theimportance of each word relative to each cluster, isdefined as:
!"
ippmikw
#$=][ : Cluster with vectors! !
Ri ,...,1:
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 83
Applying clustering algorithms[Hinnserbury03, Theodoridis99, Velasquez03
For example, Self Organizing Feature Maps. Schematically, a SOFM is presented as a two-dimensional array in
whose positions the neurons are located. Each neuron is constituted by an n-dimensional vector, whose
components are the synaptic weights. The notion of neighborhood among the neurons provides diverse
topologies. In this case a toroidal topology was used, which means that the
neurons closest to the ones of the superior edge, are located in theinferior and lateral edges.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 85
Bank web site
It belongs to a Chilean virtual bank, i.e., a bankthat doesn't have physical branches and whereall the transactions are made using electronicmeans, like e-mails, portals, etc.
Written in Spanish. 217 static web pages. Approximately eight million raw web logs
registers corresponding to the period Januaryto March, 2003.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 86
Visitor Behavior Vectors
Only 16% of the visitors visit 10 or more pagesand 18% less than 4.
The average of visited pages is 6. With these data, we fixed 6 as the maximum
number of components in a visitor behaviorvector
Finally, applying the above described filters,approximately 200,000 vectors wereidentified.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 88
Cluster analysis
The cluster identification and validation isgenerally a subjective tasks.
It is necessary the support of a businessexpert.
In this case bank business expert. Eight clusters were identified but only four
of the were accepted by the business expert. The criterion is “What cluster make sense for
the business expert”
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 91
Cluster analysis
The cluster must content only pages relatedby main topic.
The business expert define the main topic perpage.
From twelve identified clusters, only eightwere accepted.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 92
ResultsCluster Pages Visited
1 (6,8,190)
2 (100,128.30)
3 (86,150,97)
4 (101,105,1)
5 (3,9,147)
6 (100,126,58)
7 (70,186,137)
8 (157,169,180)
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 93
Cluster analysis Review which words in each page cluster are
the same. Using the vector page definition, it is possible
to get the key words and their relativeimportance in each cluster.
Example
)log(*)1()(in
Qiijij
TR
swfmM +==
319086
}190,8,6{
iiiimmmkw =
=!
!"
ippmikw
#$=][
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 95
Some identified keywords
#
1 Crédito Credit
2 Hipotecario House credit
3 Tarjeta Card
4 Promoción Promotion
5 Concurso Concourse
6 Puntos Points
7 Descuento Discount
8 Cuenta Account
Keywords
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 96
Offline recommendations
Structure. Add links intra clusters, e.g., link from page 150
to page 137. Add link inter cluster, e.g. link from page 105 to
126. Eliminate link, e.g., from page 150 to page 186
Content. To use the web site keywords as linksor contents in the page.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 97
Online recommendations
A current user session is classified into someclusters found.
The online navigation recommendation iscreated as a set of links to pages belonging tothe current web site.
The user can select some links or not. Default case: No recommendation. It is too risky to apply the recommendations in
the real web site. However it is possible tomake a simulation.
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 98
A methodology to test onlinenavigation recommendations
qCl)],(),...,,(),...,,[( 11 HHmm tptptp=!
qm Clp !+1 0,/ ,11,1 >!!" +++ jCllRl qjmmjm
###
)(1 !+mR
Recommendations for ? 1+mp },...,{ 1 npp=
},...,,...,{)( ,1,10,11
!!!! jmkmmm lllR ++++ =1+mp
!
xml
,1+
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 100
Testing the online navigation effectiveness
12
3
fourth page
fifth page
sixth page
0
10
20
30
40
50
60
70
80
90
100
Pe
rce
nta
ge
of
ac
ce
pta
nc
e
Number of page links
per suggestion
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 101
Intelligent web site [Velasquez04b]
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 102
Knowledge Base
Parametric rules
Historic patternrepository
Mining web dataJuan D. Velásquez © 2006 (http://wi.dii.uchile.cl/) 104
Summary
Web data are a real source to analyze theuser behavior in the web.
An important step is cleaning and pre-processing of the web data.
The application of web mining techniquesallows to find unknown patterns.
These patterns must be validated/rejected byan expert in the business under investigation.
We can personalize a web site.