data preparation for web usage analysis bamshad mobasher depaul university bamshad mobasher depaul...
TRANSCRIPT
Data Preparation forWeb Usage Analysis
Bamshad MobasherDePaul University
Bamshad MobasherDePaul University
2
Web Usage Mining Revisitedi Web Usage Mining
4 discovery of meaningful patterns from data generated by user access to resources on one or more Web/application servers
i Typical Sources of Data:4 clickstream data from Web/application server access logs or third-party page
tagging services
4 e-commerce and product-oriented user events (e.g., shopping cart changes, product click-throughs, purchases, etc.)
4 user profiles data, user ratings, user contributed data (tags, comments, reviews)
4 product meta-data, page content, site structure
i User Transactions4 sets or sequences of pageviews possibly with associated weights
4 a pageview is a set of page files and associated objects that contribute to a single display in a Web Browser
3
Web Usage Mining vs. Web Analytics
i Web Analytics4 As a general concept refers to the measurement, analysis, and reporting
of user behavior on the Web
4 In practice, usually involves descriptive statistics from clickstream and other user behavior data at different levels of aggregations across predetermined dimensions such as time, content/product categories, referring sites, etc.
4 Many tools and third party services available (e.g., Google Analytics)
4 Often provides the “biggest bang for the buck”
i Web Usage Mining4 Goes beyond basic analytics to discover patterns in usage data, identify
and characterize important customer segments, find affinities across pages or products, build models to predict future behavior, etc.
Google Analytics
4
Google Analytics
5
Google Analytics
6
Web Usage Mining: Going deeper
Sequence mining
Sequence mining
Markov chains
Markov chains
Association rules
Association rules
ClusteringClustering
Session ClusteringSession
Clustering
ClassificationClassification
Prediction of next eventPrediction of next event
Discovery of associated events, products, objectsDiscovery of associated events, products, objects
Discovery of visitor/customer groups with common
characteristics
Discovery of visitor/customer groups with common
characteristics
Discovery of visitor/customer groups with common behavior
or common interests
Discovery of visitor/customer groups with common behavior
or common interests
Characterization of visitors/customers with respect to a set of predefined classes
Characterization of visitors/customers with respect to a set of predefined classes
Anomaly/attack detectionAnomaly/attack detection
Common Clickstream Data Sourcesi Server Log Files
4 Passive data collection4 Normal part of web browser/web server transaction
h Data is always available and does not depend on client setup
4 Data belongs to the organizationh Fewer data security/privacy concerns due to sharingh Access to full data allows for deeper analysis
i Page Tagging4 Active (client-side) data collection4 Often requires a third party to implement – a vendor
h Vendor Supplies page tags, collects the data, and often analyzes the data to generate reports
4 Usually involves adding code (Javascript) to each page that when loaded, sends back information to vendor
8
9
Simplified Web Access Layout
HTTP Protocoli Client sends a request to a serveri Server sends a response to clienti Connectionless
4 Client: h Opens connection to serverh Sends request
4 Serverh Responds to requesth Closes connection
i Stateless4 Client/Server have no memory of prior connections4 Server cannot distinguish one client request from another client
10
Cookiesi Used to solve the “Statelessness” of the HTTP Protocol
4 When an HTTP server responds to a request it may send additional information that is stored by the client - “state information”
4 When client makes a request to this server the client will return the “cookie” that contains its state information
4 State information may be a client ID that can be used as an index to a client data record on the server
i Most common applications for Client-side cookies4 Identify repeat visitors4 Use third-party ad servers to track users across sites (e.g., using Web
“bugs”)
i Drawbacks4 Can be turned off on the client-side4 Potential privacy concerns, especially with user tracking
11
User Tracking via Cookies & Web Bug
12
ClientBrowser
My_Brwsr
Server BServer C
WBS Server A
Cookie: My_Brwsr
Pg A - Server APg B - Server BPg C - Server C
1. Render page2. Click on URL
Page B cnts- URLs & Img Src- WebBug Img@
WBS. TRKSTRM.COM
Page A cnts- URLs & Img Src- WebBug Img @
WBS. TRKSTRM.COM
Page C cnts- URLs & Img Src- WebBug Img@
WBS. TRKSTRM.COM
Req: Page_B.html
Req: Page_A.html
Res: Page_A.html
Req:
WebBug IMG-Referer Header- Any cookie for TRKSTRM.com
Res:
WebBug Img-Cookie to client
Browser on 1st Req.
Res: Page_B.html
Res: Page_C.html
Req: Page_C.html
Illustration from Robert J. Boncella, Washburn University
Server Log Files
i Each time a client requests a resource the server of that resource may record the following in its log files:4 The name & IP address of the client computer4 The time of the request4 The URL that was requested4 The time it took to send the resource4 If HTTP authentication used; the username of the user of the client will
be recorded4 Status code for errors or successful request4 The referrer (location where request originated) 4 The agent: the kind of web browser and operating system that was used4 The Client-side cookies
13
What’s in a Typical Server Log?<ip_addr> <base_url> - <date> <method> <file> <protocol> <code> <bytes> <referrer> <user_agent> <ip_addr> <base_url> - <date> <method> <file> <protocol> <code> <bytes> <referrer> <user_agent>
203.30.5.145 www.acr-news.org - [01/Jun/1999:03:09:21 -0600] "GET /Calls/OWOM.html HTTP/1.0" 200 3942 "http://www.lycos.com/cgi-bin/pursuit?query=advertising+psychology&maxhits=20&cat=dir" "Mozilla/4.5 [en] (Win98; I)"
203.30.5.145 www.acr-news.org - [01/Jun/1999:03:09:23 -0600] "GET /Calls/Images/earthani.gif HTTP/1.0" 200 10689 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"
203.30.5.145 www.acr-news.org - [01/Jun/1999:03:09:24 -0600] "GET /Calls/Images/line.gif HTTP/1.0" 200 190 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"
203.30.5.145 www.acr-news.org - [01/Jun/1999:03:09:25 -0600] "GET /Calls/Images/red.gif HTTP/1.0" 200 104 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"
203.252.234.33 www.acr-news.org - [01/Jun/1999:03:32:31 -0600] "GET / HTTP/1.0" 200 4980 "" "Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 www.acr-news.org - [01/Jun/1999:03:32:35 -0600] "GET /Images/line.gif HTTP/1.0" 200 190 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 www.acr-news.org - [01/Jun/1999:03:32:35 -0600] "GET /Images/red.gif HTTP/1.0" 200 104 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 www.acr-news.org - [01/Jun/1999:03:32:35 -0600] "GET /Images/earthani.gif HTTP/1.0" 200 10689 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 www.acr-news.org - [01/Jun/1999:03:33:11 -0600] "GET /CP.html HTTP/1.0" 200 3218 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"
15
What’s in a Typical Server Log?1 2006-02-01 00:08:43 1.2.3.4 - GET /classes/cs589/papers.html - 200 9221 HTTP/1.1
maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727) http://dataminingresources.blogspot.com/
2 2006-02-01 00:08:46 1.2.3.4 - GET /classes/cs589/papers/cms-tai.pdf - 200 4096 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727) http://maya.cs.depaul.edu/~classes/cs589/papers.html
3 2006-02-01 08:01:28 2.3.4.5 - GET /classes/ds575/papers/hyperlink.pdf - 200 318814 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1) http://www.google.com/search?hl=en&lr=&q=hyperlink+analysis+for+the+web+survey
4 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/announce.html - 200 3794 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1) http://maya.cs.depaul.edu/~classes/cs480/
5 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/styles2.css - 200 1636 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1) http://maya.cs.depaul.edu/~classes/cs480/announce.html
6 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/header.gif - 200 6027 HTTP/1.1 maya.cs.depaul.edu Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1) http://maya.cs.depaul.edu/~classes/cs480/announce.html
Typical Fields in a Log File Entry
client IP address 1.2.3.4base url maya.cs.depaul.edudate/time 2006-02-01 00:08:43 http method GETfile accessed /classes/cs589/papers.htmlprotocol version HTTP/1.1 status code 200 (successful access)bytes transferred 9221referrer page http://dataminingresources.blogspot.com/user agent Mozilla/4.0+(compatible;+MSIE+6.0;
+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727)
client IP address 1.2.3.4base url maya.cs.depaul.edudate/time 2006-02-01 00:08:43 http method GETfile accessed /classes/cs589/papers.htmlprotocol version HTTP/1.1 status code 200 (successful access)bytes transferred 9221referrer page http://dataminingresources.blogspot.com/user agent Mozilla/4.0+(compatible;+MSIE+6.0;
+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727)
In addition, there are fields corresponding to• login information• client-side cookies• session ids issued by the Web or application servers (if any)
16
17
Basic Entities in Web Usage Mining
i User (Visitor) - Single individual that is accessing files from one or more Web servers through a Browser
i Page File - File that is served through HTTP protocol
i Pageview - Set of Page Files that contribute to a single display in a Web Browser
i User Session - Set of Pageviews served due to a series of HTTP requests from a single User across the entire Web.
i Server Session - Set of Pageviews served due to a series of HTTP requests from a single User to a single site
i Transaction (Episode) - Subset of Pageviews from a single User or Server Session
18
Higher-Level Data Abstractions
i Abstractions concerning Visitorsi Establishes precise semantics for the concepts
4 Unique Visitor4 Conversion Rate4 Abandonment Rate4 Attrition4 Loyalty4 Frequency4 Recency
19
Main Challenges in Data Collection and Preprocessing
i Main Questions:4 what data to collect and how to collect it; what to exclude4 how to identify unique visitors/users4 how to identify requests associated with a unique user session (HTTP is
“stateless”)4 how to identify what is the basic unit of analysis (e.g., pageviews, items
purchased, user ratings, events, etc.)4 how to identify/define user transactions4 how to integrate data across channels: e-commerce data, clickstream data,
user profiles, social media data, product meta data, etc.
Usage Data Preparation Tasksi Data cleaning
4 remove irrelevant references and fields in server logs4 remove references due to spider navigation4 add missing references due to client-side caching
i Data integration4 synchronize data from multiple server logs4 integrate e-commerce and application server data4 integrate meta-data
i Data Transformation4 pageview identification4 identification of product-oriented events4 identification of unique users4 sessionization – partitioning each user’s record into multiple sessions or
transactions (usually representing different visits)4 integrating meta-data and user profile data with user sessions
20
Conceptual Representation of User Transactions or Sessions
A B C D E Fuser0 15 5 0 0 0 185user1 0 0 32 4 0 0user2 12 0 0 56 236 0user3 9 47 0 0 0 134user4 0 0 23 15 0 0user5 17 0 0 157 69 0user6 24 89 0 0 0 354user7 0 0 78 27 0 0user8 7 0 45 20 127 0user9 0 38 57 0 0 15
Sessions/user transactions
Pageview/objects
This is the typical representation of the data, after preprocessing, that is used for input into data mining algorithms. Raw weights may be binary, based on time spent on a page, or other measures of user interest in an item. In practice, need to normalize or standardize this data.
21
Mechanisms for User Identification
Examples: page tags (javascript), some browser plugins22
23
Identifying Users and Sessionsi 1. First partition the log file into “user activity logs”
4 this is a sequence of pageviews associated with one user encompassing all user visits to the site
4 can use the methods described earlier4 most reliable (but not most accurate) is IP+Agent heuristic
i 2. Apply sessionization heuristics to partition each user activity log into sessions4 can be based on an absolute maximum time allowed for each session4 or based on the amount of elapsed time between two pageviews4 can also use navigation-oriented heuristics based on site topology or the
referrer field in the log file
i 3. Path completion to infer cached references: 4 e.g., expanding a session A ==> B ==> C by an access pair (B ==> D)
results in: A ==> B ==> C ==> B ==> D;4 to disambiguate paths, sessions are expanded based on heuristics such as
number of back references required to complete the path
24
Sessionization Heuristicsi Server log L is a list of log entries each containing
4 timestamp4 user host identifiers4 URL request (including URL stem and query)4 and possibly, referrer, agent, cookie, etc.
i User identification and sessionization4 user activity log is a sequence of log entries in L belonging to the same user4 user identification is the process of partitioning L into a set of user activity logs4 the goal of sessionization is to further partition each user activity log into
sequences of entries corresponding to each user visit
i Real v. Constructed Sessions4 Conceptually, the log L is partitioned into an ordered collection of “real”
sessions R4 Each heuristic h partitions L into an ordered collection of “constructed
sessions” Ch4 The ideal heuristic h*: Ch* = R
25
Sessionization Heuristics
i Time-Oriented Heuristics4 consider boundaries on time spent on individual pages or in the entire a site
during a single visit4 boundaries can be based on a maximum session length or based on maximum
time allowable for each pageview4 additional granularity can be obtained by treating different boundaries on
different (types of) pageviews
i Navigation-Oriented Heuristics4 take the linkage between pages into account in sessionization4 “linkage” can be based on site topology (e.g., split a session at a request that
could not have been reached from previous requests in the session)4 “linkage” can also be usage-based (based on referrer information in log entries)
h usually more restrictive than topology-based heuristicsh more difficult to implement in frame-based sites
26
Some Selected Heuristicsi Time-Oriented Heuristics:
4 h1: Total session duration may not exceed a threshold q . Given t0, the timestamp for the first request in a constructed session S, the request with timestamp t is assigned to S, iff t - t0 £ q.
4 h2: Total time spent on a page may not exceed a threshold d. Given t1, the timestamp for request assigned to constructed session S, the next request with timestamp t2 is assigned to S, iff t2 - t1 £ d.
i Referrer-Based Heuristic:4 href: Given two consecutive requests p and q, with p belonging to
constructed session S. Then q is assigned to S, if the referrer for q was previously invoked in S.
Note: in practice, it is often useful to use a combination of time-and navigation-oriented heuristics in session identification.
Note: in practice, it is often useful to use a combination of time-and navigation-oriented heuristics in session identification.
27
navigationalpages
contentpages
Histogram ofpage referencelengths (secs)
Inferring User Transactions from Sessions
i Studies show that reference lengths follow Zipf distribution
i Page types: navigational, content, mixed
i Page types correlate with reference lengths
i Can automatically classify pages as navigational or content using statistical methods
i A transaction can be defined as an intra-session path ending in a content page, or as a set of content pages in a session
28
Path Completion
4 Need knowledge of link structure to complete the navigation path.4 There may be multiple candidate for completing the path. For example consider
the two paths : E => D => B => C and E => D => B => A => C.4 In this case, the referrer field allows us to partially disambiguate. But, what about:
E => D => B => A => B => C?4 One heuristic: always take the path that requires the fewest number of “back”
references.4 Problem gets much more complicated in frame-based sites.
User’s actual navigation path:
A B D E D B C
What the server log shows:
URL ReferrerA --B AD BE DC B
A
B C
D E F
29
Sessionization Example
Time IP URL Ref Agent
0:01 1.2.3.4 A - IE5;Win2k0:09 1.2.3.4 B A IE5;Win2k0:10 2.3.4.5 C - IE4;Win980:12 2.3.4.5 B C IE4;Win980:15 2.3.4.5 E C IE4;Win980:19 1.2.3.4 C A IE5;Win2k0:22 2.3.4.5 D B IE4;Win980:22 1.2.3.4 A - IE4;Win980:25 1.2.3.4 E C IE5;Win2k0:25 1.2.3.4 C A IE4;Win980:33 1.2.3.4 B C IE4;Win980:58 1.2.3.4 D B IE4;Win981:10 1.2.3.4 E D IE4;Win981:15 1.2.3.4 A - IE5;Win2k1:16 1.2.3.4 C A IE5;Win2k1:17 1.2.3.4 F C IE4;Win981:25 1.2.3.4 F C IE5;Win2k1:30 1.2.3.4 B A IE5;Win2k1:36 1.2.3.4 D B IE5;Win2k
A
B C
D E F
30
Sessionization Example
Time IP URL Ref Agent
0:01 1.2.3.4 A - IE5;Win2k0:09 1.2.3.4 B A IE5;Win2k0:10 2.3.4.5 C - IE4;Win980:12 2.3.4.5 B C IE4;Win980:15 2.3.4.5 E C IE4;Win980:19 1.2.3.4 C A IE5;Win2k0:22 2.3.4.5 D B IE4;Win980:22 1.2.3.4 A - IE4;Win980:25 1.2.3.4 E C IE5;Win2k0:25 1.2.3.4 C A IE4;Win980:33 1.2.3.4 B C IE4;Win980:58 1.2.3.4 D B IE4;Win981:10 1.2.3.4 E D IE4;Win981:15 1.2.3.4 A - IE5;Win2k1:16 1.2.3.4 C A IE5;Win2k1:17 1.2.3.4 F C IE4;Win981:26 1.2.3.4 F C IE5;Win2k1:30 1.2.3.4 B A IE5;Win2k1:36 1.2.3.4 D B IE5;Win2k
1. Sort users (based on IP+Agent)1. Sort users (based on IP+Agent)
0:01 1.2.3.4 A - IE5;Win2k0:09 1.2.3.4 B A IE5;Win2k0:19 1.2.3.4 C A IE5;Win2k0:25 1.2.3.4 E C IE5;Win2k1:15 1.2.3.4 A - IE5;Win2k1:26 1.2.3.4 F C IE5;Win2k1:30 1.2.3.4 B A IE5;Win2k1:36 1.2.3.4 D B IE5;Win2k
0:10 2.3.4.5 C - IE4;Win980:12 2.3.4.5 B C IE4;Win980:15 2.3.4.5 E C IE4;Win980:22 2.3.4.5 D B IE4;Win98
0:22 1.2.3.4 A - IE4;Win980:25 1.2.3.4 C A IE4;Win980:33 1.2.3.4 B C IE4;Win980:58 1.2.3.4 D B IE4;Win981:10 1.2.3.4 E D IE4;Win981:17 1.2.3.4 F C IE4;Win98
31
Sessionization Example
2. Sessionize using heuristics2. Sessionize using heuristics
0:01 1.2.3.4 A - IE5;Win2k0:09 1.2.3.4 B A IE5;Win2k0:19 1.2.3.4 C A IE5;Win2k0:25 1.2.3.4 E C IE5;Win2k1:15 1.2.3.4 A - IE5;Win2k1:26 1.2.3.4 F C IE5;Win2k1:30 1.2.3.4 B A IE5;Win2k1:36 1.2.3.4 D B IE5;Win2k
0:01 1.2.3.4 A - IE5;Win2k0:09 1.2.3.4 B A IE5;Win2k0:19 1.2.3.4 C A IE5;Win2k0:25 1.2.3.4 E C IE5;Win2k
1:15 1.2.3.4 A - IE5;Win2k1:26 1.2.3.4 F C IE5;Win2k1:30 1.2.3.4 B A IE5;Win2k1:36 1.2.3.4 D B IE5;Win2k
The h1 heuristic (with timeout variable of 30 minutes) will result in the two sessions given above.
How about the heuristic href?How about heuristic h2 with a timeout variable of 10 minutes?
The h1 heuristic (with timeout variable of 30 minutes) will result in the two sessions given above.
How about the heuristic href?How about heuristic h2 with a timeout variable of 10 minutes?
32
Sessionization Example
2. Sessionize using heuristics (another example)2. Sessionize using heuristics (another example)
In this case, the referrer-based heuristics will result in a single session, while the h1 heuristic (with timeout = 30 minutes) will result in two different sessions.
How about heuristic h2 with timeout = 10 minutes?
In this case, the referrer-based heuristics will result in a single session, while the h1 heuristic (with timeout = 30 minutes) will result in two different sessions.
How about heuristic h2 with timeout = 10 minutes?
0:22 1.2.3.4 A - IE4;Win980:25 1.2.3.4 C A IE4;Win980:33 1.2.3.4 B C IE4;Win980:58 1.2.3.4 D B IE4;Win981:10 1.2.3.4 E D IE4;Win981:17 1.2.3.4 F C IE4;Win98
33
Sessionization Example
3. Perform Path Completion3. Perform Path Completion
0:22 1.2.3.4 A - IE4;Win980:25 1.2.3.4 C A IE4;Win980:33 1.2.3.4 B C IE4;Win980:58 1.2.3.4 D B IE4;Win981:10 1.2.3.4 E D IE4;Win981:17 1.2.3.4 F C IE4;Win98
A
B C
D E F
A=>C , C=>B , B=>D , D=>E , C=>F
Need to look for the shortest backwards path from E to C based on the site topology. Note, however, that the elements of the path need to have occurred in the user trail previously.
Need to look for the shortest backwards path from E to C based on the site topology. Note, however, that the elements of the path need to have occurred in the user trail previously.
E=>D, D=>B, B=>C
34
E-Commerce Datai Integrating E-Commerce and Usage Data
4 Needed for analyzing relationships between navigational patterns of visitors and business questions such as profitability, customer value, product placement, etc.
4 E-business / Web Analytics4 E.g., tracking and analyzing conversion of browsers to buyers
i E-Commerce Event Models4 Major difficulty for E-commerce events is defining and implementing the
events for a particular site4 Events may involve a collection or sequence of actions by a user possibly
involving multiple pageviews or interactions with applications4 Typical product oriented events:
h Viewh Click-throughh Shopping Cart Changeh Buy or Bid
35
Content and Structure Preprocessingi Processing content and structure of the site are often essential
for successful usage analysisi Two primary tasks:
4 determine what constitutes a unique content item (i.e., pageview, product, content category)
4 represent content and structure of the items in a quantifiable form
i Basic elements in content and structure processing4 creation of a site map
h captures linkage and frame structure of the siteh also needs to identify script templates for dynamically generated pages
4 extracting important content elements in pagesh meta-information, keywords, internal and external links, etc.
4 identifying and classifying pages based on their content and structural characteristics
36
Data Preparation Tasks for Mining Content Data
i Extract relevant features from text and meta-data4 meta-data is required for product-oriented pages4 keywords are extracted from content-oriented pages4 weights are associated with features based on domain knowledge and/or text
frequency (e.g., tf.idf weighting)4 the integrated data can be captured in the XML representation of each
pageview
i Feature representation for pageviews4 each pageview p is represented as a k-dimensional feature vector, where k is
the total number of extracted features from the site in a global dictionary4 feature vectors obtained are organized into an inverted file structure containing
a dictionary of all extracted features and posting files for pageviews
37
Basic Automatic Text Processingi Parse documents to recognize structure
4 e.g. title, date, other fields
i Scan for word tokens 4 lexical analysis to recognize keywords, numbers, special characters, etc.
i Stopword removal 4 common words such as “the”, “and”, “or” which are not semantically meaningful in a
document
i Stem words 4 morphological processing to group word variants such as plurals (e.g., “compute”,
“computer”, “computing”, … can be represented by the stem “comput”)
i Weight words 4 using frequency in documents and across documents
i Store Index4 Stored in a Term-Document Matrix (“inverted index”) which stores each document as a
vector of keyword weights
38
Inverted IndexesAn Inverted File is essentially a vector file “inverted” so that rows become columns and columns become rows
docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1
D10 0 1 1
Terms D1 D2 D3 D4 D5 D6 D7 …
t1 1 1 0 1 1 1 0t2 0 0 1 0 1 1 1t3 1 0 1 0 1 0 0
Term weights can be:- Binary- Raw Frequency in document (Text Freqency)- Normalized Frequency- TF x IDF
Term weights can be:- Binary- Raw Frequency in document (Text Freqency)- Normalized Frequency- TF x IDF
39
How Inverted Indexes Are Created
i Sorted Array Implementation4 Documents are parsed to extract tokens. These are saved
with the Document ID.
Now is the timefor all good men
to come to the aidof their country
Doc 1
It was a dark andstormy night in
the country manor. The time was past midnight
Doc 2
Term Doc #now 1is 1the 1time 1for 1all 1good 1men 1to 1come 1to 1the 1aid 1of 1their 1country 1it 2was 2a 2dark 2and 2stormy 2night 2in 2the 2country 2manor 2the 2time 2was 2past 2midnight 2
40
How Inverted Files are CreatedThen the file can be split into a Dictionary and a Postings file
Term Doc # Freqa 2 1aid 1 1all 1 1and 2 1come 1 1country 1 1country 2 1dark 2 1for 1 1good 1 1in 2 1is 1 1it 2 1manor 2 1men 1 1midnight 2 1night 2 1now 1 1of 1 1past 2 1stormy 2 1the 1 2the 2 2their 1 1time 1 1time 2 1to 1 2was 2 2
Doc # Freq2 11 11 12 11 11 12 12 11 11 12 11 12 12 11 12 12 11 11 12 12 11 22 21 11 12 11 22 2
Term N docs Tot Freqa 1 1aid 1 1all 1 1and 1 1come 1 1country 2 2dark 1 1for 1 1good 1 1in 1 1is 1 1it 1 1manor 1 1men 1 1midnight 1 1night 1 1now 1 1of 1 1past 1 1stormy 1 1the 2 4their 1 1time 2 2to 1 2was 1 2
Notes: The links between postings for a term is usually implemented as a linked list. The dictionary is enhanced with some term statistics such as Document frequency and the total frequency in the collection.
41
Assigning Weightsi tf x idf measure:
4 term frequency (tf)4 inverse document frequency (idf) 4 Want to weight terms highly if they are
h frequent in relevant documents … BUTh infrequent in the collection as a whole
i Goal: assign a tf x idf weight to each term in each document
log
Tcontain that in documents ofnumber the
collection in the documents ofnumber total
in T termoffrequency document inverse
document in T termoffrequency
document in term
nNidf
Cn
CN
Cidf
Dtf
DkT
kk
kk
kk
ikik
ik
41
10000log
698.220
10000log
301.05000
10000log
010000
10000log
2log ( / )ik ik kw tf N n
42
Weight Pageview ID Significant Features (stems)1.00 CFP: One World One Market world challeng busi co manag global0.63 CFP: Int'l Conf. on Marketing & Development challeng co contact develop intern0.35 CFP: Journal of Global Marketing busi global0.32 CFP: Journal of Consumer Psychology busi manag global
Weight Pageview ID Significant Features (stems)1.00 CFP: Journal of Psych. & Marketing psychologi consum special market1.00 CFP: Journal of Consumer Psychology I psychologi journal consum special market0.72 CFP: Journal of Global Marketing journal special market0.61 CFP: Journal of Consumer Psychology II psychologi journal consum special0.50 CFP: Society for Consumer Psychology psychologi consum special0.50 CFP: Conf. on Gender, Market., Consumer Behavior journal consum market
Example: Discovery of “Content Profiles”i Content Profiles
4 Represent concept groups within a Web site or among a collection of documents4 Can be represented as overlapping collections of pageview-weight pairs4 Instead of clustering documents we cluster features (keywords) over the n-dimensional
space of pageviews (see the term clustering example of previous lecture)4 for each feature cluster derive a content profile by collecting pageviews in which these
features appear as significant (this is the centroid of the clusters, but we only keep elements in the centroid whose mean weight is greater than a threshold)
i Example Content Profiles from the ACR Site:
43
How Content Profiles Are Generated1. Extract important features (e.g., word stems) from eachdocument:
Feature Freqconfer 12market 9develop 9intern 5ghana 3ismd 3contact 3
… …
Feature Freqpsychologi 11consum 9journal 6manuscript 5cultur 5special 4issu 4paper 4
… …… …
icmd.html
jcp.html
2. Build a global dictionary of all features(words) along with relevant statistics
Total Documents = 41
Feature-id Doc-freq Total-freq Feature0 6 44 19971 12 59 19982 13 76 19993 8 41 2000… … … …123 26 271 confer124 9 24 consid125 23 165 consum… … … …439 7 45 psychologi440 14 78 public441 11 61 publish… … … …549 1 6 vision550 3 8 volunt551 1 9 vot552 4 23 vote553 3 17 web… … … …
44
How Content Profiles Are Generated3. Construct a document-word matrix with normalized tf-idf weights
doc-id/feature-id 0 1 2 3 4 5 …0 0.27 0.43 0.00 0.00 0.00 0.00 …1 0.07 0.10 0.00 0.00 0.00 0.00 …2 0.00 0.06 0.07 0.03 0.00 0.00 …3 0.00 0.00 0.00 0.00 0.00 0.00 …4 0.00 0.00 0.00 0.00 0.00 0.00 …5 0.00 0.00 0.05 0.06 0.00 0.00 …6 0.17 0.10 0.07 0.03 0.03 0.00 …7 0.14 0.09 0.08 0.02 0.02 0.00 …8 0.00 0.00 0.10 0.00 0.00 0.00 …9 0.00 0.07 0.00 0.00 0.00 0.00 …
10 0.02 0.02 0.00 0.00 0.00 0.00 …11 0.00 0.00 0.00 0.00 0.00 0.00 …12 0.00 0.00 0.00 0.00 0.00 0.00 …13 0.00 0.00 0.00 0.00 0.00 0.00 …14 0.00 0.00 0.00 0.00 0.00 0.00 …15 0.00 0.00 0.32 0.38 0.00 0.00 …… … … … … … … …
4. Now we can perform clustering on word (or documents) using one of the techniques described earlier (e.g., k-means clustering on features).
45
How Content Profiles Are GeneratedExamples of feature (word) clusters obtained using k-means:
5. Content profiles are now generated from feature clusters based on centroids ofeach cluster (similar to usage profiles, but we have words instead of users/sessions).
CLUSTER 0----------anthropologianthropologistappropriassocibehavior...
CLUSTER 4----------consumissujournalmarketpsychologispecial
CLUSTER 10----------ballotresultvotvote...
CLUSTER 11----------advisoriappointcommittecouncil...
Weight Pageview ID Significant Features (stems)1.00 CFP: One World One Market world challeng busi co manag global0.63 CFP: Int'l Conf. on Marketing & Development challeng co contact develop intern0.35 CFP: Journal of Global Marketing busi global0.32 CFP: Journal of Consumer Psychology busi manag global
Weight Pageview ID Significant Features (stems)1.00 CFP: Journal of Psych. & Marketing psychologi consum special market1.00 CFP: Journal of Consumer Psychology I psychologi journal consum special market0.72 CFP: Journal of Global Marketing journal special market0.61 CFP: Journal of Consumer Psychology II psychologi journal consum special0.50 CFP: Society for Consumer Psychology psychologi consum special0.50 CFP: Conf. on Gender, Market., Consumer Behavior journal consum market
46
Content Enhanced User Transactionsi Essentially combines usage and content profiling techniques
discussed earlier
i Basic Idea:4 for each user/session, extract important features of the selected
documents/items4 based on the global dictionary create a user-feature matrix4 each row is a feature vector representing significant terms associated with
documents/items selected by the user in a given session4 weight can be determined as before (e.g., using tf.idf measure)
i Applications:4 Can analyze user behavior at a more granular level of concepts or keywords
associated with item purchased, pages visited, etc.4 Can create user segments based on their common underlying interests4 Help explain emerging patterns in user behavior data
47
A.html B.html C.html D.html E.html
user1 1 0 1 0 1
user2 1 1 0 0 1
user3 0 1 1 1 0
user4 1 0 1 1 1
user5 1 1 0 0 1
user6 1 0 1 1 1
A.html B.html C.html D.html E.html
web 0 0 1 1 1
data 0 1 1 1 0
mining 0 1 1 1 0
business 1 1 0 0 0
intelligence 1 1 0 0 1
marketing 1 1 0 0 1
ecommerce 0 1 1 0 0
search 1 0 1 0 0
information 1 0 1 1 1
retrieval 1 0 1 1 1
User transaction matrix UT
Feature-DocumentMatrix FP
48
Content Enhanced Transactions
web data mining business intelligence marketing ecommerce search information retrieval
user1 2 1 1 1 2 2 1 2 3 3
user2 1 1 1 2 3 3 1 1 2 2
user3 2 3 3 1 1 1 2 1 2 2
user4 3 2 2 1 2 2 1 2 4 4
user5 1 1 1 2 3 3 1 1 2 2
user6 3 2 2 1 2 2 1 2 4 4
User-FeatureMatrix UF Note that: UF = UT x FPT
Example: users 4 and 6 are more interested in concepts related to Web information retrieval, while user 3 is more interested in data mining.
customers
ordersproducts
OperationalDatabase
ContentAnalysisModule
Web/ApplicationServer Logs
Preprocessing /Sessionization
Module
Site Map
SiteDictionary
IntegratedSessionized
Data
DataIntegration
Module
E-CommerceData Mart
Data MiningEngine
OLAPTools
UsageAnalysis
PatternAnalysis
OLAPAnalysis
SiteContent
Data Cube
Architectural Framework for Web Usage Mining
Web Usage Mining as a Process
50
Data Preparation forWeb Usage Analysis
Bamshad MobasherDePaul University
Bamshad MobasherDePaul University