categorizing web viewership using statistical models of ... · categorizing web viewership using...
TRANSCRIPT
![Page 1: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/1.jpg)
1
Categorizing Web Viewership Using Statistical Models
of Web Navigation and Text Classification
Alan L. Montgomery and Brett GordonCarnegie Mellon University
Marketing Science ConferenceMarketing Science ConferenceUniversity of Alberta, EdmontonUniversity of Alberta, Edmonton
28 June 200228 June 2002
2
Outline
• Clickstream Example– What topic is the user looking at on each page?
• Information Sources– Dmoz.org classification– Text classification– User browsing model
• Results• Conclusions
![Page 2: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/2.jpg)
2
Clickstream Example
What topics is this user browsing on each of the following pages?
4
True Class: Business
![Page 3: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/3.jpg)
3
5
True Class: Business
6
True Class: Business
![Page 4: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/4.jpg)
4
7
True Class: Sports
8
True Class: Sports
![Page 5: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/5.jpg)
5
9
True Class: Sports
10
True Class: News
![Page 6: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/6.jpg)
6
11
True Class: News
12
True Class: News
![Page 7: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/7.jpg)
7
13
True Class: News
14
User DemographicsSex: MaleAge: 22Occupation:StudentIncome: < $30,000State: PennsylvaniaCountry: U.S.A.
{Business} {Business} {Business} {Sports}
{Sports} {Sports} {News} {News}
{News} {News}
![Page 8: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/8.jpg)
8
Information Sources
16
Data
Clickstream Data
• Panel of representative web users collected by Jupiter Media Metrix
• Sample of 30 randomly selected users who browsed during April 2002– 38k URLs viewings– 13k unique URLs visited– 1,550 domains
• Average user– Views 1300 URLs– Active for 9 hours/month
Classification Information
• Dmoz.org - Pages classified by human experts
• Page Content - Text classification algorithms from Comp. Sci./Inform. Retr.
![Page 9: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/9.jpg)
9
17
Dmoz.org
• Largest, most comprehensive human-edited directory of the web
• Constructed and maintained by volunteers (open-source), and original set donated by Netscape
• Used by Netscape, AOL, Google, Lycos, Hotbot, DirectHit, etc.
• Over 3m+ sites classified, 438k categories, 43k editors (Dec 2001)
Categories1. Arts2. Business3. Computers4. Games5. Health6. Home7. News8. Recreation9. Reference10. Science11. Shopping12. Society13. Sports14. Adult
18
Problem
• Web is very large and dynamic and only a fraction of pages can be classified– 147m hosts (Jan 2002, Internet Domain Survey, isc.org)– 1b (?) web pages+
• Only a fraction of the web pages in our panel are categorized– 1.3% of web pages are exactly categorized
– 7.3% categorized within one level– 10% categorized within two levels– 74% of pages have no classification information
![Page 10: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/10.jpg)
10
Text Classification
20
Background
• Informational Retrieval– Overview (Baeza-Yates and Ribeiro-Neto 2000, Chakrabarti
2000)– Naïve Bayes (Joachims 1997)– Support Vector Machines (Vapnik 1995 and Joachims 1998)– Feature Selection (Mladenic and Grobelnik 1998, Yang
Pederson 1998)– Latent Semantic Indexing
– Support Vector Machines– Language Models (MacKey and Peto 1994)
![Page 11: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/11.jpg)
11
21
True Class: Sports
22
Page Contents = HTML Code + Regular Text
![Page 12: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/12.jpg)
12
23
Tokenization & Lexical Parsing
• HTML code is removed• Punctuation is removed• All words are converted to lowercase• Stopwords are removed
– Common, non-informative words such as ‘the’, ‘and’, ‘with’, ‘an’, etc…
Determine the term frequency (TF) of each remaining unique word
24
Result: Document Vector
home 2game 8hit 4runs 6threw 2ejected 1baseball 5major 2league 2bat 2
![Page 13: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/13.jpg)
13
25
Classifying Document Vectors
home 2game 8hit 4runs 6threw 2ejected 1baseball 5major 2league 2bat 2
bush 58congress 92tax 48cynic 16politician 23forest 9major 3world 29summit 31federal 64
sale 87customer 28cart 24game 16microsoft 31buy 93order 75pants 21nike 8tax 19
game 97football 32hit 45goal 84umpire 23won 12league 58baseball 39soccer 21runs 26
{News Class} {Sports Class} {Shopping Class}
? ? ?
Test Document
26
Classifying Document Vectors
home 2game 8hit 4runs 6threw 2ejected 1baseball 5major 2league 2bat 2
bush 58congress 92tax 48cynic 16politician 23forest 9major 3world 29summit 31federal 64
sale 87customer 28cart 24game 16microsoft 31buy 93order 75pants 21nike 8tax 19
game 97football 32hit 45goal 84umpire 23won 12league 58baseball 39soccer 21runs 26
{News Class} {Sports Class} {Shopping Class}
Test Document
![Page 14: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/14.jpg)
14
27
home 2game 8hit 4runs 6threw 2ejected 1baseball 5major 2league 2bat 2
bush 58congress 92tax 48cynic 16politician 23forest 9major 3world 29summit 31federal 64
sale 87customer 28cart 24game 16microsoft 31buy 93order 75pants 21nike 8tax 19
game 97football 32hit 45goal 84umpire 23won 12league 58baseball 39soccer 21runs 26
{News Class} {Sports Class}
Test Document
{Shopping Class}
P( {News} | Test Doc) = 0.02 P( {Sports} | Test Doc) = 0.91 P( {Shopping} | Test Doc) = 0.07
Classifying Document Vectors
28
home 2game 8hit 4runs 6threw 2ejected 1baseball 5major 2league 2bat 2
game 97football 32hit 45goal 84umpire 23won 12league 58baseball 39soccer 21runs 26
{Sports Class}
Test Document
P( {Sports} | Test Doc) = 0.91
Classifying Document Vectors
![Page 15: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/15.jpg)
15
29
Classification Model
• A document is a vector of term frequency (TF) values, each category has its own term distribution
• Words in a document are generated by a multinomial model of the term distribution in a given class:
• Classification: )}d|c(P{maxargCc?
|V| : vocabulary size ni
c : # of times word i appears in class c
})c|w(P)c(P{maxarg|V|
i
ni
Cc
ci?
?? 1
)}p,...,p,p(p,n{M~d c|v|
ccc 21??
30
Results
• 25% correct classification• Compare with random guessing of 7%• More advanced techniques perform slightly better:
– Shrinkage of word term frequencies (McCallum et al 1998)– n-gram models– Support Vector Machines
![Page 16: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/16.jpg)
16
User Browsing Model
32
User Browsing Model
• Web browsing is “sticky” or persistent: users tend to view a series of pages within the same category and then switch to another topic
• Example:{News} {News} {News}
![Page 17: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/17.jpg)
17
33
Markov Switching Modelartsbusinesscomputers games health home newsrecreationreference scienceshopping society sports adult
arts 83% 4% 5% 2% 1% 2% 6% 3% 2% 6% 2% 3% 4% 1%business 3% 73% 5% 3% 2% 3% 6% 2% 3% 3% 3% 2% 3% 2%computers 5% 11% 79% 3% 3% 7% 5% 3% 4% 4% 5% 5% 2% 2%games 1% 3% 2% 90% 1% 1% 1% 1% 0% 1% 1% 1% 1% 0%health 0% 0% 0% 0% 84% 1% 1% 0% 0% 1% 0% 1% 0% 0%home 0% 1% 1% 0% 1% 80% 1% 1% 0% 1% 1% 1% 0% 0%news 1% 1% 1% 0% 1% 0% 69% 0% 0% 1% 0% 1% 1% 0%recreation 1% 1% 1% 0% 1% 1% 1% 86% 1% 1% 1% 1% 1% 0%reference 0% 1% 1% 0% 1% 0% 1% 0% 85% 2% 0% 1% 1% 0%science 1% 0% 0% 0% 1% 1% 1% 0% 1% 75% 0% 1% 0% 0%shopping 1% 3% 2% 1% 1% 2% 1% 1% 0% 1% 86% 1% 1% 0%society 1% 1% 2% 0% 2% 1% 3% 1% 2% 2% 0% 82% 1% 1%sports 2% 1% 1% 0% 0% 0% 3% 1% 1% 0% 0% 1% 85% 0%adult 1% 1% 1% 0% 0% 0% 1% 0% 0% 0% 0% 1% 0% 93%
16% 10% 19% 11% 2% 3% 2% 6% 3% 2% 7% 6% 5% 7%
Pooled transition matrix, heterogeneity across users
34
Implications
• Suppose we have the following sequence:
• Using Bayes Rule can determine that there is a 97% probability of news, unconditional=2%, conditional on last observation=69%
{News} ? {News}
![Page 18: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/18.jpg)
18
Results
36
Methodology
Bayesian setup to combine information from:• Known categories based on exact matches• Text classification• Markov Model of User Browsing
– Introduce heterogeneity by assuming that conditional transition probability vectors drawn from Dirichlet distribution
• Similarity of other pages in the same domain– Assume that category of each page within a domain follows
a Dirichlet distribution, so if we are at a “news” site then pages more likely to be classified as “news”
![Page 19: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/19.jpg)
19
37
Findings
Random guessingText Classification+ Domain Model
+ Browsing Model
7%25%41%78%
Conclusions
![Page 20: Categorizing Web Viewership Using Statistical Models of ... · Categorizing Web Viewership Using Statistical Models of Web Navigation and Text Classification Alan L. Montgomery and](https://reader036.vdocuments.net/reader036/viewer/2022081522/5f03cbf87e708231d40ad0db/html5/thumbnails/20.jpg)
20
39
Summary
• Each technique (text classification, browsing model, or domain model) performs only fairly well (~25% classification)
• Combining these techniques together results in very good (~80%) classification rates
• Future directions: larger datasets and newer text classification and user browsing models
40
Applications
• Newsgroups– Gather information from newsgroups and determine whether
consumers are responding positively or negatively
• E-mail– Scan e-mail text for similarities to known problems/topics
• Better Search engines– Instead of experts classifying pages we can mine the
information collected by ISPs and classify it automatically
• Adult filters– US Appeals Court struck down Children’s Internet Protection
Act on the grounds that technology was inadequate