detecting online commercial intention (oci)
DESCRIPTION
TRANSCRIPT
1
Detecting Online Commercial Intention (OCI)
Honghua Dai, Zaiqing Nie, Lee Wang, Lingzhi Zhao, Ji-Rong Wen, Ying Li
WWW’06
Advisor: Chia-Hui ChangStudent: Teng-Kai Fan
Date: 2009-08-24
2
Outline
Introduction Defining (OCI): Online Commercial Intention Learning Online Commercial Intention
Web Page OCI Detector Query OCI Detector
Experiment Conclusion
3
Introduction
Two major online user activities: Browsing activity Searching activity
Three categories for user’s search intention: Navigational: reach to a particular web site. Informational: acquire information on web pages. Transactional: perform some “web-mediated”
activity.
4
Introduction cont.
OCI (Online Commercial Intention): understanding whether a user has intention to purchase or participate in commercial service.
5
Defining OCI(Online Commercial Intention) Defining OCI to be a function from a query
or a Web page to a binary value: Commercial or Non-Commercial.
The goal is to compute two functions OCI: Q → {Commercial, Non-Commercial} OCI: P → {Commercial, Non-Commercial}
6
Learning Online Commercial Intention Taxonomy-based
Using existing concept hierarchies or categories.
Machine learning approach Extracting features from page content and
building the classifiers based on those features. Labeling Process: Human-evaluation approach.
7
Web Page OCI Detector Input: a Web Page P Output: OCI (commercial or non-commercial) of P
SVM
8
Keyword Extraction and Selection Keyword extraction: both inner text and tag at
tributes of all the training data.
Feature selection:
Pr(k|C): the probability of the keyword k occurring in a Web page belonging to
class C.
12)|Pr()|Pr(
)}|Pr(),|{Pr()(
CkCk
CkCkMaxkSig
)|Pr()( CCkkFreq
9
Keyword Extraction and Selection cont. Define two aspects of properties for each
keyword k in a page p:
For a page p with n keywords can be represented in 2*n dimensions:
p
ppknit
pagein elementsfor number total
in inner text itsin appeared keyword that theelements of#),(
p
ppknta
pagein elementsfor number total
in attributes tagitsin appeared keyword that theelements of#),(
10
Query OCI Detector
Four type of data sources for query OCI: Constituent terms of search query.
Ex.: “airline ticket deals”, “digital camera price”.
Content of top landing pages recommended by search engine.
Content of search result page. Including title, short descriptions, and URL links.
The number of user clicks of landing pages recommended by search engine.
11
Detecting OCI based on Top Search Result Landing Pages Using top-10 result pages generated by
MSN.
Using the Web page OCI detector to detect the OCI of top 10 landing pages.
.query ofresult search in the rank has that page Web theis qnpnq
12
Detecting OCI based on Top Search Result Landing Pages cont.
13
Detecting OCI based on First Search Result Page
14
Experiments
Data 1408 US English queries. Collect the first search result page for 1408 queries. Collect the top 10 landing pages for 1408 queries Randomly pick 26186 English Web pages.
Labeling Analysis
15
Evaluation Methodology
For Web OCI detector, due to unbalanced problem, they selected all commercial pages and the equals number of non-commercial to train model.
For query OCI detector: Compare the model based on first search result page and top N result
landing pages. Using 3-fold cross validation.
Measures: Precision, Recall and F-Measure
16
Evaluating Page OCI Detector
CP (Precision), CR (Recall), CF (F-measure)
17
Evaluating Page OCI Detector cont.
18
Evaluating Query OCI Detector
19
OCI Analysis for a Stratified Query Sample based on Query Frequency Divided query frequency into 5: Single, Very low, Low, Mid, and High. Randomly select 10000 queries for each level.
Observation: Query set with high frequency have larger portion of queries with commercial intention.
20
Conclusion
They present the framework of building machine learning models to learn OCI (queries and Web pages) based on any web page content.