data preparation for web usage analysis bamshad mobasher depaul university bamshad mobasher depaul...

Data Preparation forWeb Usage Analysis

Bamshad MobasherDePaul University


2

Web Usage Mining Revisitedi Web Usage Mining

4 discovery of meaningful patterns from data generated by user access to resources on one or more Web/application servers

i Typical Sources of Data:4 clickstream data from Web/application server access logs or third-party page

tagging services

4 e-commerce and product-oriented user events (e.g., shopping cart changes, product click-throughs, purchases, etc.)

4 user profiles data, user ratings, user contributed data (tags, comments, reviews)

4 product meta-data, page content, site structure

i User Transactions4 sets or sequences of pageviews possibly with associated weights

4 a pageview is a set of page files and associated objects that contribute to a single display in a Web Browser

3

Web Usage Mining vs. Web Analytics

i Web Analytics4 As a general concept refers to the measurement, analysis, and reporting

of user behavior on the Web

4 In practice, usually involves descriptive statistics from clickstream and other user behavior data at different levels of aggregations across predetermined dimensions such as time, content/product categories, referring sites, etc.

4 Many tools and third party services available (e.g., Google Analytics)

4 Often provides the “biggest bang for the buck”

i Web Usage Mining4 Goes beyond basic analytics to discover patterns in usage data, identify

and characterize important customer segments, find affinities across pages or products, build models to predict future behavior, etc.

Google Analytics

4

Google Analytics

5

Google Analytics

6

Web Usage Mining: Going deeper

Sequence mining

Sequence mining

Markov chains

Markov chains

Association rules

Association rules

ClusteringClustering

Session ClusteringSession

Clustering

ClassificationClassification

Prediction of next eventPrediction of next event

Discovery of associated events, products, objectsDiscovery of associated events, products, objects

Discovery of visitor/customer groups with common

characteristics

Discovery of visitor/customer groups with common

characteristics

Discovery of visitor/customer groups with common behavior

or common interests

Discovery of visitor/customer groups with common behavior

or common interests

Characterization of visitors/customers with respect to a set of predefined classes

Characterization of visitors/customers with respect to a set of predefined classes

Anomaly/attack detectionAnomaly/attack detection

Common Clickstream Data Sourcesi Server Log Files

4 Passive data collection4 Normal part of web browser/web server transaction

h Data is always available and does not depend on client setup

4 Data belongs to the organizationh Fewer data security/privacy concerns due to sharingh Access to full data allows for deeper analysis

i Page Tagging4 Active (client-side) data collection4 Often requires a third party to implement – a vendor

h Vendor Supplies page tags, collects the data, and often analyzes the data to generate reports

4 Usually involves adding code (Javascript) to each page that when loaded, sends back information to vendor

8

9

Simplified Web Access Layout

HTTP Protocoli Client sends a request to a serveri Server sends a response to clienti Connectionless

4 Client: h Opens connection to serverh Sends request

4 Serverh Responds to requesth Closes connection

i Stateless4 Client/Server have no memory of prior connections4 Server cannot distinguish one client request from another client

10

Cookiesi Used to solve the “Statelessness” of the HTTP Protocol

4 When an HTTP server responds to a request it may send additional information that is stored by the client - “state information”

4 When client makes a request to this server the client will return the “cookie” that contains its state information

4 State information may be a client ID that can be used as an index to a client data record on the server

i Most common applications for Client-side cookies4 Identify repeat visitors4 Use third-party ad servers to track users across sites (e.g., using Web

“bugs”)

i Drawbacks4 Can be turned off on the client-side4 Potential privacy concerns, especially with user tracking

11

User Tracking via Cookies & Web Bug

12

ClientBrowser

My_Brwsr

Server BServer C

WBS Server A

Cookie: My_Brwsr

Pg A - Server APg B - Server BPg C - Server C

1. Render page2. Click on URL

Page B cnts- URLs & Img Src- WebBug Img@

WBS. TRKSTRM.COM

Page A cnts- URLs & Img Src- WebBug Img @

WBS. TRKSTRM.COM

Page C cnts- URLs & Img Src- WebBug Img@

WBS. TRKSTRM.COM

Req: Page_B.html

Req: Page_A.html

Res: Page_A.html

Req:

WebBug IMG-Referer Header- Any cookie for TRKSTRM.com

Res:

WebBug Img-Cookie to client

Browser on 1st Req.

Res: Page_B.html

Res: Page_C.html

Req: Page_C.html

Illustration from Robert J. Boncella, Washburn University

Server Log Files

i Each time a client requests a resource the server of that resource may record the following in its log files:4 The name & IP address of the client computer4 The time of the request4 The URL that was requested4 The time it took to send the resource4 If HTTP authentication used; the username of the user of the client will

be recorded4 Status code for errors or successful request4 The referrer (location where request originated) 4 The agent: the kind of web browser and operating system that was used4 The Client-side cookies

13

What’s in a Typical Server Log?<ip_addr> <base_url> - <date> <method> <file> <protocol> <code> <bytes> <referrer> <user_agent> <ip_addr> <base_url> - <date> <method> <file> <protocol> <code> <bytes> <referrer> <user_agent>

203.30.5.145 www.acr-news.org - [01/Jun/1999:03:09:21 -0600] "GET /Calls/OWOM.html HTTP/1.0" 200 3942 "http://www.lycos.com/cgi-bin/pursuit?query=advertising+psychology&maxhits=20&cat=dir" "Mozilla/4.5 [en] (Win98; I)"

203.30.5.145 www.acr-news.org - [01/Jun/1999:03:09:23 -0600] "GET /Calls/Images/earthani.gif HTTP/1.0" 200 10689 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"

203.30.5.145 www.acr-news.org - [01/Jun/1999:03:09:24 -0600] "GET /Calls/Images/line.gif HTTP/1.0" 200 190 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"

203.30.5.145 www.acr-news.org - [01/Jun/1999:03:09:25 -0600] "GET /Calls/Images/red.gif HTTP/1.0" 200 104 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"

203.252.234.33 www.acr-news.org - [01/Jun/1999:03:32:31 -0600] "GET / HTTP/1.0" 200 4980 "" "Mozilla/4.06 [en] (Win95; I)"

203.252.234.33 www.acr-news.org - [01/Jun/1999:03:32:35 -0600] "GET /Images/line.gif HTTP/1.0" 200 190 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"

203.252.234.33 www.acr-news.org - [01/Jun/1999:03:32:35 -0600] "GET /Images/red.gif HTTP/1.0" 200 104 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"

203.252.234.33 www.acr-news.org - [01/Jun/1999:03:32:35 -0600] "GET /Images/earthani.gif HTTP/1.0" 200 10689 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"

203.252.234.33 www.acr-news.org - [01/Jun/1999:03:33:11 -0600] "GET /CP.html HTTP/1.0" 200 3218 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"

15

What’s in a Typical Server Log?1 2006-02-01 00:08:43 1.2.3.4 - GET /classes/cs589/papers.html - 200 9221 HTTP/1.1

maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727) http://dataminingresources.blogspot.com/

2 2006-02-01 00:08:46 1.2.3.4 - GET /classes/cs589/papers/cms-tai.pdf - 200 4096 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727) http://maya.cs.depaul.edu/~classes/cs589/papers.html

3 2006-02-01 08:01:28 2.3.4.5 - GET /classes/ds575/papers/hyperlink.pdf - 200 318814 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1) http://www.google.com/search?hl=en&lr=&q=hyperlink+analysis+for+the+web+survey

4 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/announce.html - 200 3794 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1) http://maya.cs.depaul.edu/~classes/cs480/

5 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/styles2.css - 200 1636 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1) http://maya.cs.depaul.edu/~classes/cs480/announce.html

6 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/header.gif - 200 6027 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1) http://maya.cs.depaul.edu/~classes/cs480/announce.html

Typical Fields in a Log File Entry

client IP address 1.2.3.4base url maya.cs.depaul.edudate/time 2006-02-01 00:08:43 http method GETfile accessed /classes/cs589/papers.htmlprotocol version HTTP/1.1 status code 200 (successful access)bytes transferred 9221referrer page http://dataminingresources.blogspot.com/user agent Mozilla/4.0+(compatible;+MSIE+6.0;

+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727)

client IP address 1.2.3.4base url maya.cs.depaul.edudate/time 2006-02-01 00:08:43 http method GETfile accessed /classes/cs589/papers.htmlprotocol version HTTP/1.1 status code 200 (successful access)bytes transferred 9221referrer page http://dataminingresources.blogspot.com/user agent Mozilla/4.0+(compatible;+MSIE+6.0;

+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727)

In addition, there are fields corresponding to• login information• client-side cookies• session ids issued by the Web or application servers (if any)

16

17

Basic Entities in Web Usage Mining

i User (Visitor) - Single individual that is accessing files from one or more Web servers through a Browser

i Page File - File that is served through HTTP protocol

i Pageview - Set of Page Files that contribute to a single display in a Web Browser

i User Session - Set of Pageviews served due to a series of HTTP requests from a single User across the entire Web.

i Server Session - Set of Pageviews served due to a series of HTTP requests from a single User to a single site

i Transaction (Episode) - Subset of Pageviews from a single User or Server Session

18

Higher-Level Data Abstractions

i Abstractions concerning Visitorsi Establishes precise semantics for the concepts

4 Unique Visitor4 Conversion Rate4 Abandonment Rate4 Attrition4 Loyalty4 Frequency4 Recency

19

Main Challenges in Data Collection and Preprocessing

i Main Questions:4 what data to collect and how to collect it; what to exclude4 how to identify unique visitors/users4 how to identify requests associated with a unique user session (HTTP is

“stateless”)4 how to identify what is the basic unit of analysis (e.g., pageviews, items

purchased, user ratings, events, etc.)4 how to identify/define user transactions4 how to integrate data across channels: e-commerce data, clickstream data,

user profiles, social media data, product meta data, etc.

Usage Data Preparation Tasksi Data cleaning

4 remove irrelevant references and fields in server logs4 remove references due to spider navigation4 add missing references due to client-side caching

i Data integration4 synchronize data from multiple server logs4 integrate e-commerce and application server data4 integrate meta-data

i Data Transformation4 pageview identification4 identification of product-oriented events4 identification of unique users4 sessionization – partitioning each user’s record into multiple sessions or

transactions (usually representing different visits)4 integrating meta-data and user profile data with user sessions

20

Conceptual Representation of User Transactions or Sessions

A B C D E Fuser0 15 5 0 0 0 185user1 0 0 32 4 0 0user2 12 0 0 56 236 0user3 9 47 0 0 0 134user4 0 0 23 15 0 0user5 17 0 0 157 69 0user6 24 89 0 0 0 354user7 0 0 78 27 0 0user8 7 0 45 20 127 0user9 0 38 57 0 0 15

Sessions/user transactions

Pageview/objects

This is the typical representation of the data, after preprocessing, that is used for input into data mining algorithms. Raw weights may be binary, based on time spent on a page, or other measures of user interest in an item. In practice, need to normalize or standardize this data.

21

Mechanisms for User Identification

Examples: page tags (javascript), some browser plugins22

23

Identifying Users and Sessionsi 1. First partition the log file into “user activity logs”

4 this is a sequence of pageviews associated with one user encompassing all user visits to the site

4 can use the methods described earlier4 most reliable (but not most accurate) is IP+Agent heuristic

i 2. Apply sessionization heuristics to partition each user activity log into sessions4 can be based on an absolute maximum time allowed for each session4 or based on the amount of elapsed time between two pageviews4 can also use navigation-oriented heuristics based on site topology or the

referrer field in the log file

i 3. Path completion to infer cached references: 4 e.g., expanding a session A ==> B ==> C by an access pair (B ==> D)

results in: A ==> B ==> C ==> B ==> D;4 to disambiguate paths, sessions are expanded based on heuristics such as

number of back references required to complete the path

24

Sessionization Heuristicsi Server log L is a list of log entries each containing

4 timestamp4 user host identifiers4 URL request (including URL stem and query)4 and possibly, referrer, agent, cookie, etc.

i User identification and sessionization4 user activity log is a sequence of log entries in L belonging to the same user4 user identification is the process of partitioning L into a set of user activity logs4 the goal of sessionization is to further partition each user activity log into

sequences of entries corresponding to each user visit

i Real v. Constructed Sessions4 Conceptually, the log L is partitioned into an ordered collection of “real”

sessions R4 Each heuristic h partitions L into an ordered collection of “constructed

sessions” Ch4 The ideal heuristic h*: Ch* = R

25

Sessionization Heuristics

i Time-Oriented Heuristics4 consider boundaries on time spent on individual pages or in the entire a site

during a single visit4 boundaries can be based on a maximum session length or based on maximum

time allowable for each pageview4 additional granularity can be obtained by treating different boundaries on

different (types of) pageviews

i Navigation-Oriented Heuristics4 take the linkage between pages into account in sessionization4 “linkage” can be based on site topology (e.g., split a session at a request that

could not have been reached from previous requests in the session)4 “linkage” can also be usage-based (based on referrer information in log entries)

h usually more restrictive than topology-based heuristicsh more difficult to implement in frame-based sites

26

Some Selected Heuristicsi Time-Oriented Heuristics:

4 h1: Total session duration may not exceed a threshold q . Given t0, the timestamp for the first request in a constructed session S, the request with timestamp t is assigned to S, iff t - t0 £ q.

4 h2: Total time spent on a page may not exceed a threshold d. Given t1, the timestamp for request assigned to constructed session S, the next request with timestamp t2 is assigned to S, iff t2 - t1 £ d.

i Referrer-Based Heuristic:4 href: Given two consecutive requests p and q, with p belonging to

constructed session S. Then q is assigned to S, if the referrer for q was previously invoked in S.

Note: in practice, it is often useful to use a combination of time-and navigation-oriented heuristics in session identification.

Note: in practice, it is often useful to use a combination of time-and navigation-oriented heuristics in session identification.

27

navigationalpages

contentpages

Histogram ofpage referencelengths (secs)

Inferring User Transactions from Sessions

i Studies show that reference lengths follow Zipf distribution

i Page types: navigational, content, mixed

i Page types correlate with reference lengths

i Can automatically classify pages as navigational or content using statistical methods

i A transaction can be defined as an intra-session path ending in a content page, or as a set of content pages in a session

28

Path Completion

4 Need knowledge of link structure to complete the navigation path.4 There may be multiple candidate for completing the path. For example consider

the two paths : E => D => B => C and E => D => B => A => C.4 In this case, the referrer field allows us to partially disambiguate. But, what about:

E => D => B => A => B => C?4 One heuristic: always take the path that requires the fewest number of “back”

references.4 Problem gets much more complicated in frame-based sites.

User’s actual navigation path:

A B D E D B C

What the server log shows:

URL ReferrerA --B AD BE DC B

A

B C

D E F

29

Sessionization Example

Time IP URL Ref Agent

0:01 1.2.3.4 A - IE5;Win2k0:09 1.2.3.4 B A IE5;Win2k0:10 2.3.4.5 C - IE4;Win980:12 2.3.4.5 B C IE4;Win980:15 2.3.4.5 E C IE4;Win980:19 1.2.3.4 C A IE5;Win2k0:22 2.3.4.5 D B IE4;Win980:22 1.2.3.4 A - IE4;Win980:25 1.2.3.4 E C IE5;Win2k0:25 1.2.3.4 C A IE4;Win980:33 1.2.3.4 B C IE4;Win980:58 1.2.3.4 D B IE4;Win981:10 1.2.3.4 E D IE4;Win981:15 1.2.3.4 A - IE5;Win2k1:16 1.2.3.4 C A IE5;Win2k1:17 1.2.3.4 F C IE4;Win981:25 1.2.3.4 F C IE5;Win2k1:30 1.2.3.4 B A IE5;Win2k1:36 1.2.3.4 D B IE5;Win2k

A

B C

D E F

30


Time IP URL Ref Agent

0:01 1.2.3.4 A - IE5;Win2k0:09 1.2.3.4 B A IE5;Win2k0:10 2.3.4.5 C - IE4;Win980:12 2.3.4.5 B C IE4;Win980:15 2.3.4.5 E C IE4;Win980:19 1.2.3.4 C A IE5;Win2k0:22 2.3.4.5 D B IE4;Win980:22 1.2.3.4 A - IE4;Win980:25 1.2.3.4 E C IE5;Win2k0:25 1.2.3.4 C A IE4;Win980:33 1.2.3.4 B C IE4;Win980:58 1.2.3.4 D B IE4;Win981:10 1.2.3.4 E D IE4;Win981:15 1.2.3.4 A - IE5;Win2k1:16 1.2.3.4 C A IE5;Win2k1:17 1.2.3.4 F C IE4;Win981:26 1.2.3.4 F C IE5;Win2k1:30 1.2.3.4 B A IE5;Win2k1:36 1.2.3.4 D B IE5;Win2k

1. Sort users (based on IP+Agent)1. Sort users (based on IP+Agent)

0:01 1.2.3.4 A - IE5;Win2k0:09 1.2.3.4 B A IE5;Win2k0:19 1.2.3.4 C A IE5;Win2k0:25 1.2.3.4 E C IE5;Win2k1:15 1.2.3.4 A - IE5;Win2k1:26 1.2.3.4 F C IE5;Win2k1:30 1.2.3.4 B A IE5;Win2k1:36 1.2.3.4 D B IE5;Win2k

0:10 2.3.4.5 C - IE4;Win980:12 2.3.4.5 B C IE4;Win980:15 2.3.4.5 E C IE4;Win980:22 2.3.4.5 D B IE4;Win98

0:22 1.2.3.4 A - IE4;Win980:25 1.2.3.4 C A IE4;Win980:33 1.2.3.4 B C IE4;Win980:58 1.2.3.4 D B IE4;Win981:10 1.2.3.4 E D IE4;Win981:17 1.2.3.4 F C IE4;Win98

31


2. Sessionize using heuristics2. Sessionize using heuristics

0:01 1.2.3.4 A - IE5;Win2k0:09 1.2.3.4 B A IE5;Win2k0:19 1.2.3.4 C A IE5;Win2k0:25 1.2.3.4 E C IE5;Win2k1:15 1.2.3.4 A - IE5;Win2k1:26 1.2.3.4 F C IE5;Win2k1:30 1.2.3.4 B A IE5;Win2k1:36 1.2.3.4 D B IE5;Win2k

0:01 1.2.3.4 A - IE5;Win2k0:09 1.2.3.4 B A IE5;Win2k0:19 1.2.3.4 C A IE5;Win2k0:25 1.2.3.4 E C IE5;Win2k

1:15 1.2.3.4 A - IE5;Win2k1:26 1.2.3.4 F C IE5;Win2k1:30 1.2.3.4 B A IE5;Win2k1:36 1.2.3.4 D B IE5;Win2k

The h1 heuristic (with timeout variable of 30 minutes) will result in the two sessions given above.

How about the heuristic href?How about heuristic h2 with a timeout variable of 10 minutes?

The h1 heuristic (with timeout variable of 30 minutes) will result in the two sessions given above.

How about the heuristic href?How about heuristic h2 with a timeout variable of 10 minutes?

32


2. Sessionize using heuristics (another example)2. Sessionize using heuristics (another example)

In this case, the referrer-based heuristics will result in a single session, while the h1 heuristic (with timeout = 30 minutes) will result in two different sessions.

How about heuristic h2 with timeout = 10 minutes?

In this case, the referrer-based heuristics will result in a single session, while the h1 heuristic (with timeout = 30 minutes) will result in two different sessions.

How about heuristic h2 with timeout = 10 minutes?


33


3. Perform Path Completion3. Perform Path Completion


A

B C

D E F

A=>C , C=>B , B=>D , D=>E , C=>F

Need to look for the shortest backwards path from E to C based on the site topology. Note, however, that the elements of the path need to have occurred in the user trail previously.

Need to look for the shortest backwards path from E to C based on the site topology. Note, however, that the elements of the path need to have occurred in the user trail previously.

E=>D, D=>B, B=>C

34

E-Commerce Datai Integrating E-Commerce and Usage Data

4 Needed for analyzing relationships between navigational patterns of visitors and business questions such as profitability, customer value, product placement, etc.

4 E-business / Web Analytics4 E.g., tracking and analyzing conversion of browsers to buyers

i E-Commerce Event Models4 Major difficulty for E-commerce events is defining and implementing the

events for a particular site4 Events may involve a collection or sequence of actions by a user possibly

involving multiple pageviews or interactions with applications4 Typical product oriented events:

h Viewh Click-throughh Shopping Cart Changeh Buy or Bid

35

Content and Structure Preprocessingi Processing content and structure of the site are often essential

for successful usage analysisi Two primary tasks:

4 determine what constitutes a unique content item (i.e., pageview, product, content category)

4 represent content and structure of the items in a quantifiable form

i Basic elements in content and structure processing4 creation of a site map

h captures linkage and frame structure of the siteh also needs to identify script templates for dynamically generated pages

4 extracting important content elements in pagesh meta-information, keywords, internal and external links, etc.

4 identifying and classifying pages based on their content and structural characteristics

36

Data Preparation Tasks for Mining Content Data

i Extract relevant features from text and meta-data4 meta-data is required for product-oriented pages4 keywords are extracted from content-oriented pages4 weights are associated with features based on domain knowledge and/or text

frequency (e.g., tf.idf weighting)4 the integrated data can be captured in the XML representation of each

pageview

i Feature representation for pageviews4 each pageview p is represented as a k-dimensional feature vector, where k is

the total number of extracted features from the site in a global dictionary4 feature vectors obtained are organized into an inverted file structure containing

a dictionary of all extracted features and posting files for pageviews

37

Basic Automatic Text Processingi Parse documents to recognize structure

4 e.g. title, date, other fields

i Scan for word tokens 4 lexical analysis to recognize keywords, numbers, special characters, etc.

i Stopword removal 4 common words such as “the”, “and”, “or” which are not semantically meaningful in a

document

i Stem words 4 morphological processing to group word variants such as plurals (e.g., “compute”,

“computer”, “computing”, … can be represented by the stem “comput”)

i Weight words 4 using frequency in documents and across documents

i Store Index4 Stored in a Term-Document Matrix (“inverted index”) which stores each document as a

vector of keyword weights

38

Inverted IndexesAn Inverted File is essentially a vector file “inverted” so that rows become columns and columns become rows

docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1

D10 0 1 1

Terms D1 D2 D3 D4 D5 D6 D7 …

t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0

Term weights can be:- Binary- Raw Frequency in document (Text Freqency)- Normalized Frequency- TF x IDF

Term weights can be:- Binary- Raw Frequency in document (Text Freqency)- Normalized Frequency- TF x IDF

39

How Inverted Indexes Are Created

i Sorted Array Implementation4 Documents are parsed to extract tokens. These are saved

with the Document ID.

Now is the timefor all good men

to come to the aidof their country

Doc 1

It was a dark andstormy night in

the country manor. The time was past midnight

Doc 2

Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2

40

How Inverted Files are CreatedThen the file can be split into a Dictionary and a Postings file

Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2

Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2

Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2

Notes: The links between postings for a term is usually implemented as a linked list. The dictionary is enhanced with some term statistics such as Document frequency and the total frequency in the collection.

41

Assigning Weightsi tf x idf measure:

4 term frequency (tf)4 inverse document frequency (idf) 4 Want to weight terms highly if they are

h frequent in relevant documents … BUTh infrequent in the collection as a whole

i Goal: assign a tf x idf weight to each term in each document

log

Tcontain that in documents ofnumber the

collection in the documents ofnumber total

in T termoffrequency document inverse

document in T termoffrequency

document in term

nNidf

Cn

CN

Cidf

Dtf

DkT

kk

kk

kk

ikik

ik

41

10000log

698.220

10000log

301.05000

10000log

010000

10000log

2log ( / )ik ik kw tf N n

42

Weight Pageview ID Significant Features (stems)1.00 CFP: One World One Market world challeng busi co manag global0.63 CFP: Int'l Conf. on Marketing & Development challeng co contact develop intern0.35 CFP: Journal of Global Marketing busi global0.32 CFP: Journal of Consumer Psychology busi manag global

Weight Pageview ID Significant Features (stems)1.00 CFP: Journal of Psych. & Marketing psychologi consum special market1.00 CFP: Journal of Consumer Psychology I psychologi journal consum special market0.72 CFP: Journal of Global Marketing journal special market0.61 CFP: Journal of Consumer Psychology II psychologi journal consum special0.50 CFP: Society for Consumer Psychology psychologi consum special0.50 CFP: Conf. on Gender, Market., Consumer Behavior journal consum market

Example: Discovery of “Content Profiles”i Content Profiles

4 Represent concept groups within a Web site or among a collection of documents4 Can be represented as overlapping collections of pageview-weight pairs4 Instead of clustering documents we cluster features (keywords) over the n-dimensional

space of pageviews (see the term clustering example of previous lecture)4 for each feature cluster derive a content profile by collecting pageviews in which these

features appear as significant (this is the centroid of the clusters, but we only keep elements in the centroid whose mean weight is greater than a threshold)

i Example Content Profiles from the ACR Site:

43

How Content Profiles Are Generated1. Extract important features (e.g., word stems) from eachdocument:

Feature Freqconfer 12market 9develop 9intern 5ghana 3ismd 3contact 3

… …

Feature Freqpsychologi 11consum 9journal 6manuscript 5cultur 5special 4issu 4paper 4

… …… …

icmd.html

jcp.html

2. Build a global dictionary of all features(words) along with relevant statistics

Total Documents = 41

Feature-id Doc-freq Total-freq Feature0 6 44 19971 12 59 19982 13 76 19993 8 41 2000… … … …123 26 271 confer124 9 24 consid125 23 165 consum… … … …439 7 45 psychologi440 14 78 public441 11 61 publish… … … …549 1 6 vision550 3 8 volunt551 1 9 vot552 4 23 vote553 3 17 web… … … …

44

How Content Profiles Are Generated3. Construct a document-word matrix with normalized tf-idf weights

doc-id/feature-id 0 1 2 3 4 5 …0 0.27 0.43 0.00 0.00 0.00 0.00 …1 0.07 0.10 0.00 0.00 0.00 0.00 …2 0.00 0.06 0.07 0.03 0.00 0.00 …3 0.00 0.00 0.00 0.00 0.00 0.00 …4 0.00 0.00 0.00 0.00 0.00 0.00 …5 0.00 0.00 0.05 0.06 0.00 0.00 …6 0.17 0.10 0.07 0.03 0.03 0.00 …7 0.14 0.09 0.08 0.02 0.02 0.00 …8 0.00 0.00 0.10 0.00 0.00 0.00 …9 0.00 0.07 0.00 0.00 0.00 0.00 …

10 0.02 0.02 0.00 0.00 0.00 0.00 …11 0.00 0.00 0.00 0.00 0.00 0.00 …12 0.00 0.00 0.00 0.00 0.00 0.00 …13 0.00 0.00 0.00 0.00 0.00 0.00 …14 0.00 0.00 0.00 0.00 0.00 0.00 …15 0.00 0.00 0.32 0.38 0.00 0.00 …… … … … … … … …

4. Now we can perform clustering on word (or documents) using one of the techniques described earlier (e.g., k-means clustering on features).

45

How Content Profiles Are GeneratedExamples of feature (word) clusters obtained using k-means:

5. Content profiles are now generated from feature clusters based on centroids ofeach cluster (similar to usage profiles, but we have words instead of users/sessions).

CLUSTER 0----------anthropologianthropologistappropriassocibehavior...

CLUSTER 4----------consumissujournalmarketpsychologispecial

CLUSTER 10----------ballotresultvotvote...

CLUSTER 11----------advisoriappointcommittecouncil...

Weight Pageview ID Significant Features (stems)1.00 CFP: One World One Market world challeng busi co manag global0.63 CFP: Int'l Conf. on Marketing & Development challeng co contact develop intern0.35 CFP: Journal of Global Marketing busi global0.32 CFP: Journal of Consumer Psychology busi manag global

Weight Pageview ID Significant Features (stems)1.00 CFP: Journal of Psych. & Marketing psychologi consum special market1.00 CFP: Journal of Consumer Psychology I psychologi journal consum special market0.72 CFP: Journal of Global Marketing journal special market0.61 CFP: Journal of Consumer Psychology II psychologi journal consum special0.50 CFP: Society for Consumer Psychology psychologi consum special0.50 CFP: Conf. on Gender, Market., Consumer Behavior journal consum market

46

Content Enhanced User Transactionsi Essentially combines usage and content profiling techniques

discussed earlier

i Basic Idea:4 for each user/session, extract important features of the selected

documents/items4 based on the global dictionary create a user-feature matrix4 each row is a feature vector representing significant terms associated with

documents/items selected by the user in a given session4 weight can be determined as before (e.g., using tf.idf measure)

i Applications:4 Can analyze user behavior at a more granular level of concepts or keywords

associated with item purchased, pages visited, etc.4 Can create user segments based on their common underlying interests4 Help explain emerging patterns in user behavior data

47

A.html B.html C.html D.html E.html

user1 1 0 1 0 1

user2 1 1 0 0 1

user3 0 1 1 1 0

user4 1 0 1 1 1

user5 1 1 0 0 1

user6 1 0 1 1 1

A.html B.html C.html D.html E.html

web 0 0 1 1 1

data 0 1 1 1 0

mining 0 1 1 1 0

business 1 1 0 0 0

intelligence 1 1 0 0 1

marketing 1 1 0 0 1

ecommerce 0 1 1 0 0

search 1 0 1 0 0

information 1 0 1 1 1

retrieval 1 0 1 1 1

User transaction matrix UT

Feature-DocumentMatrix FP

48

Content Enhanced Transactions

web data mining business intelligence marketing ecommerce search information retrieval

user1 2 1 1 1 2 2 1 2 3 3

user2 1 1 1 2 3 3 1 1 2 2

user3 2 3 3 1 1 1 2 1 2 2

user4 3 2 2 1 2 2 1 2 4 4

user5 1 1 1 2 3 3 1 1 2 2

user6 3 2 2 1 2 2 1 2 4 4

User-FeatureMatrix UF Note that: UF = UT x FPT

Example: users 4 and 6 are more interested in concepts related to Web information retrieval, while user 3 is more interested in data mining.

customers

ordersproducts

OperationalDatabase

ContentAnalysisModule

Web/ApplicationServer Logs

Preprocessing /Sessionization

Module

Site Map

SiteDictionary

IntegratedSessionized

Data

DataIntegration

Module

E-CommerceData Mart

Data MiningEngine

OLAPTools

UsageAnalysis

PatternAnalysis

OLAPAnalysis

SiteContent

Data Cube

Architectural Framework for Web Usage Mining

Web Usage Mining as a Process

50

Data Preparation forWeb Usage Analysis



data preparation for web usage analysis bamshad mobasher depaul university bamshad mobasher depaul...

Documents

usage data

user behavior data

data tags

data preparation

server server

user profiles data

client setup data

web analytics web analytics