learning on the web - bigeye.au.tsinghua.edu.cn
TRANSCRIPT
Learning on the Web
Tong Zhang
Rutgers University
T. Zhang (Rutgers) Learning on the Web 1 / 27
Machine Learning Problems on Web
ClassificationRankingUser Behavior ModelingRecommendationCommunity AnalysisQuality AssessmentExploration ExploitationScalability...
T. Zhang (Rutgers) Learning on the Web 2 / 27
Machine Learning Problems on Web
ClassificationRankingUser Behavior ModelingRecommendationCommunity AnalysisQuality AssessmentExploration ExploitationScalability...
T. Zhang (Rutgers) Learning on the Web 2 / 27
Classification
Electronic Spamemail spam: unwanted emailwebpage spam: low-quality pages to be placed highblog spam: random blog pages to promote other pagesclick spam: misleading clicks of ads or webpagestext messaging spam: unwanted text messagesusually aim for commercial gains
Sentiment analysisWebpage classificationQuery classification...
Key issues:problem formulation; feature generation; information aggregation;model adaptation
T. Zhang (Rutgers) Learning on the Web 3 / 27
Classification
Electronic Spamemail spam: unwanted emailwebpage spam: low-quality pages to be placed highblog spam: random blog pages to promote other pagesclick spam: misleading clicks of ads or webpagestext messaging spam: unwanted text messagesusually aim for commercial gains
Sentiment analysisWebpage classificationQuery classification...
Key issues:problem formulation; feature generation; information aggregation;model adaptation
T. Zhang (Rutgers) Learning on the Web 3 / 27
Spam Email
� ,�&,��,��
'(,��,�!,��,��,%$,��,�#,�,
��& , �,.......���&;��)
"�QQ-2297845601 ,,"����-13715362114
*��+��*��+
From: �� <[email protected]>Subject: �� �������
Date: July 25, 2012 9:43:08 PM GMT+08:00Reply-To: <[email protected]>
T. Zhang (Rutgers) Learning on the Web 4 / 27
Sentiment Analysis
Two ipad reviews: do they like or dislike the product?
Review 1 Review 2
This is my first iPad and Ijust absolutely love it!!! I pre-viously owned a tablet butthis by far, beats the tablet Ihad!! It is so easy to use andthe retina is amazing! I nowunderstand why people lovetheir iPad! ...
Poor quality control. Foundthe corners at the edgeswhere the screen meets thebody, to be crimped on eachside...
free text is referred to as unstructured data
T. Zhang (Rutgers) Learning on the Web 5 / 27
Sentiment Analysis
Two ipad reviews: do they like or dislike the product?
Review 1 Review 2
This is my first iPad and Ijust absolutely love it!!! I pre-viously owned a tablet butthis by far, beats the tablet Ihad!! It is so easy to use andthe retina is amazing! I nowunderstand why people lovetheir iPad! ...
Poor quality control. Foundthe corners at the edgeswhere the screen meets thebody, to be crimped on eachside...
free text is referred to as unstructured data
T. Zhang (Rutgers) Learning on the Web 5 / 27
Structured Data Example
A table or relational database
Systolic DiseaseGender BP Weight Code
M 175 65 3F 141 72 1
. . . . . . . . . . . .F 160 59 2
Figure: Example of Medical Data Prediction
T. Zhang (Rutgers) Learning on the Web 6 / 27
Structured versus Unstructured Data
Structured data:table or spreadsheetrelational database with well-defined attributes (features)features are usually dense
Unstructured datafree format textwithout well-defined attributes
Learning: extract information, find patterns, organize contentsencode desired information into unknown labels to predict (output).encode available unstructured data into sparse feature vectorcombine structured and unstructured data
T. Zhang (Rutgers) Learning on the Web 7 / 27
Structured versus Unstructured Data
Structured data:table or spreadsheetrelational database with well-defined attributes (features)features are usually dense
Unstructured datafree format textwithout well-defined attributes
Learning: extract information, find patterns, organize contentsencode desired information into unknown labels to predict (output).encode available unstructured data into sparse feature vectorcombine structured and unstructured data
T. Zhang (Rutgers) Learning on the Web 7 / 27
Encoding UnStructured Data
Goal: represent text by a feature vectorMethod: vector space model
create dictionary of size m consisted of all wordsmap each document into an m-dimensional vector
the i-th component is the frequency of word i in the documentfeature vector is very sparse and high dimensional
Bag-of-words (BoW): represent text without word ordering infoImprovements
can preserve section or partial position information.can combine multiple dictionaries and use phrases
T. Zhang (Rutgers) Learning on the Web 8 / 27
Bag of Word Document Representation
document
term
word1 word2 word3 word4 word5 . . . wordN label0 2 4 1 0 . . . 0 11 0 3 0 0 . . . 0 02 1 0 5 1 . . . 0 10 4 0 0 2 . . . 0 01 2 1 2 0 . . . 1 10 1 0 3 3 . . . 0 12 0 1 4 0 . . . 2 00 3 1 2 1 . . . 0 1
Figure: Document BoW Representation
T. Zhang (Rutgers) Learning on the Web 9 / 27
Term Weighting
Modify each word count by the perceived importance of the word
Rare words carry more information than common wordsTFIDF weighting of token j in document d :
tf-idf(j ,d) = tf(j ,d) ∗ idf(j)
idf(j) = log(
number of documentsdf( j)
)
tf(j ,d): term frequency of token j in document ddf(j): frequency of documents containing term j
T. Zhang (Rutgers) Learning on the Web 10 / 27
Term Weighting
Modify each word count by the perceived importance of the wordRare words carry more information than common wordsTFIDF weighting of token j in document d :
tf-idf(j ,d) = tf(j ,d) ∗ idf(j)
idf(j) = log(
number of documentsdf( j)
)
tf(j ,d): term frequency of token j in document ddf(j): frequency of documents containing term j
T. Zhang (Rutgers) Learning on the Web 10 / 27
Example Feature Vector for Email Spam Detection
text:title text:body nontext label. . . cheap . . . enlargement . . . ink from known spam host spam. . . yes . . . yes . . . yes yes true. . . no . . . yes . . . no yes true. . . no . . . no . . . no no false. . . . . . . . . . . . . . . . . . . . . . . .
Feature representationbag-of-word binary feature representation of email text without TFIDFusing known spam host as nontext features
Text feature versus nontext featuretext feature: sparse BoW representation, linear classifier works wellnontext feature: dense and heterogeneous, often needs non-linearinteraction
T. Zhang (Rutgers) Learning on the Web 11 / 27
Webpage Classification
Problem: determine the topics of a webpagedoes it talks about arts, finance, sports?is it a personal homepage, university department page, etc?
Features:text (BoW)HTML tagurlpage layout and imageslinks
How to combine featuresintegrate different information source into a unified featurerepresentationpropagate features or class labels through links
Modify standard algorithms
T. Zhang (Rutgers) Learning on the Web 12 / 27
Information Aggregation in Query Classification
Classify each query into a tree-structured taxonomyApparel and Jewelry/Shoes/Womens ShoesMass Merchants/Baby Products...
ChallengesLarge scale: approximately 6000 nodesDifficulty:
queries are brief: average 2.4 to 2.7 words per queryquery words alone don’t provide sufficient information for good queryclassification
Solution: employ auxiliary knowledge to augment the queriesAuxiliary Knowledge
Send query to a major search engineAugment the query using top pages returned by the search engine.Remedy the problem of query brevity:
words contained in top results pages reveal the category
T. Zhang (Rutgers) Learning on the Web 13 / 27
Information Aggregation in Query Classification
Classify each query into a tree-structured taxonomyApparel and Jewelry/Shoes/Womens ShoesMass Merchants/Baby Products...
ChallengesLarge scale: approximately 6000 nodesDifficulty:
queries are brief: average 2.4 to 2.7 words per queryquery words alone don’t provide sufficient information for good queryclassification
Solution: employ auxiliary knowledge to augment the queries
Auxiliary KnowledgeSend query to a major search engineAugment the query using top pages returned by the search engine.Remedy the problem of query brevity:
words contained in top results pages reveal the category
T. Zhang (Rutgers) Learning on the Web 13 / 27
Information Aggregation in Query Classification
Classify each query into a tree-structured taxonomyApparel and Jewelry/Shoes/Womens ShoesMass Merchants/Baby Products...
ChallengesLarge scale: approximately 6000 nodesDifficulty:
queries are brief: average 2.4 to 2.7 words per queryquery words alone don’t provide sufficient information for good queryclassification
Solution: employ auxiliary knowledge to augment the queriesAuxiliary Knowledge
Send query to a major search engineAugment the query using top pages returned by the search engine.Remedy the problem of query brevity:
words contained in top results pages reveal the category
T. Zhang (Rutgers) Learning on the Web 13 / 27
Working Example
Query: nikonTop search result pages contain: camera, photography, lens, ...These augmented words imply “Digital Camera” as a category.Can provide matching ads about digital cameras.
T. Zhang (Rutgers) Learning on the Web 14 / 27
Search based Query Classification
Notations:q: query, p: web-page, C = {Cj}: set of categories
Problem: given query q, want to find s(q,Cj)s(q,Cj) is the quality score of query q belonging to category Cj .
Information source:top-search results containing pages p1, . . . ,pk with high relevance to qs(p,Cj): quality score of page p belonging to category Cjknown through a separate web-page classifier.
Information aggregation:voting:
s(q,Cj) =k∑
i=1
s(pi ,Cj)/k ,
where pi is the i-th ranked page for query q.several other methods
T. Zhang (Rutgers) Learning on the Web 15 / 27
Search based Query Classification
Notations:q: query, p: web-page, C = {Cj}: set of categories
Problem: given query q, want to find s(q,Cj)s(q,Cj) is the quality score of query q belonging to category Cj .Information source:
top-search results containing pages p1, . . . ,pk with high relevance to qs(p,Cj): quality score of page p belonging to category Cjknown through a separate web-page classifier.
Information aggregation:voting:
s(q,Cj) =k∑
i=1
s(pi ,Cj)/k ,
where pi is the i-th ranked page for query q.several other methods
T. Zhang (Rutgers) Learning on the Web 15 / 27
Search based Query Classification
Notations:q: query, p: web-page, C = {Cj}: set of categories
Problem: given query q, want to find s(q,Cj)s(q,Cj) is the quality score of query q belonging to category Cj .Information source:
top-search results containing pages p1, . . . ,pk with high relevance to qs(p,Cj): quality score of page p belonging to category Cjknown through a separate web-page classifier.
Information aggregation:voting:
s(q,Cj) =k∑
i=1
s(pi ,Cj)/k ,
where pi is the i-th ranked page for query q.several other methods
T. Zhang (Rutgers) Learning on the Web 15 / 27
Performance Evaluation
small number of positive examples, most data are negativeprecision, recall, and F-measure
precision =number of correct positive predictions
number of positive predictions,
recall =number of correct positive predictionsnumber of positive class documents
,
F -measure =2
1/precision + 1/recall
T. Zhang (Rutgers) Learning on the Web 16 / 27
Effect of using Search Results
0.4
0.5
0.6
0.7
0.8
0.9
1
1.00.90.80.70.60.50.40.30.20.1
Pre
cisi
on
Recall
BaselineEngine A full-pageEngine A summaryEngine B full-pageEngine B summary
T. Zhang (Rutgers) Learning on the Web 17 / 27
Ranking
Rank a set of items and display to users in corresponding order.Important in web-search:
web-page rankingdisplay ranked pages for a query
query-refinement and spelling correctiondisplay ranked suggestions and candidate corrections
web-page summarydisplay ranked sentence segments
related: crawling/indexing:which page to crawl firstpages to keep in the index: priority/quality
T. Zhang (Rutgers) Learning on the Web 18 / 27
Web-Search Problem
User types a query, search engine returns a result page:select pages from billions of pages.
Method: given a querysearch engine assign a relevance score for each pagereturn pages ranked by the scores.
Quality of search engine:relevance (whether returned pages are on topic and authoritative)presentation issues (diversity, perceived relevance, etc)personalization (predict user specific intention)coverage (size and quality of index).freshness (whether contents are timely).responsiveness (how quickly search engine responds to the query)....
T. Zhang (Rutgers) Learning on the Web 19 / 27
Web-Search Problem
User types a query, search engine returns a result page:select pages from billions of pages.
Method: given a querysearch engine assign a relevance score for each pagereturn pages ranked by the scores.
Quality of search engine:relevance (whether returned pages are on topic and authoritative)presentation issues (diversity, perceived relevance, etc)personalization (predict user specific intention)coverage (size and quality of index).freshness (whether contents are timely).responsiveness (how quickly search engine responds to the query)....
T. Zhang (Rutgers) Learning on the Web 19 / 27
Web-Search Ranking: Notations
Notation:q: queryp: webpagey(p,q): true relevance of page p to query q rated by humanx(p,q): search engine creates a feature for page p and query qf (x(p,q)): search engine assigns a quality score f (x(p,q))
Web-search process:user input query qsearch engine returns page p ordered by highest scores f (x(p,q))
T. Zhang (Rutgers) Learning on the Web 20 / 27
Relevance Ranking: Machine Learning Approach
Training:randomly select queries q, and web-pages p for each query.use editorial judgment to assign relevance grade y(p,q).construct a feature x(p,q) for each query/page pair.learn scoring function f̂ (x(p,q)) to preserve the order of y(p,q) foreach q.
Deployment:query q comes in.return pages p1, . . . ,pm in descending order of f̂ (x(p,q)).
T. Zhang (Rutgers) Learning on the Web 21 / 27
Measuring Ranking Quality
Given scoring function f̂ , return ordered page-list p1, . . . ,pm for aquery q.
only the order information is important.should focus on the relevance of returned pages near the top.
DCG (discounted cumulative gain) with decreasing weight ci
DCG(f̂ ,q) =m∑
j=1
ci r(pi ,q).
ci : reflects effort (or likelihood) of user clicking on the i-th position.
T. Zhang (Rutgers) Learning on the Web 22 / 27
Ranking and Pairwise Preference Learning
The quality of ranking only depends on the relative order of{f (x(pi ,q)) : i} for each query qPreference relationship: if y(pi ,q) < y(pj ,q)
x(pi ,q) ≺ x(pj ,q)
pj is more relevant than pi for query qPairwise preference learning
learn a scoring function f for items to preserve preference ≺.two items x and x ′: f (x) < f (x ′) when x ≺ x ′.ordering inputs according to f (x).
T. Zhang (Rutgers) Learning on the Web 23 / 27
Example Loss Function for Preference Learning
Training data: query-url features xi for i = 1, . . . ,ni ≺ j if url of xj is more relevant than url of xi for a certain query q.Let S be the indices of preference relationships i ≺ jLet f (X ) = [f (x1), . . . , f (xn)]
Example loss:
R(f (X )) =∑{i≺j}∈S
max(0,1 + f (xi)− f (xj))2
T. Zhang (Rutgers) Learning on the Web 24 / 27
Gradient Boosted Decision Tree Formula for Ranking
Let f (x) = 0Iterate t = 1,2, . . .
For each i = 1, . . . ,n, compute
ri =∂
∂fiR(f ) =2
∑{i≺k}∈S
max(0,1 + f (xi)− f (xk ))
− 2∑
{k≺i}∈S
max(0,1 + f (xk )− f (xi))
Find decision tree gt that approximately minimizes
minn∑
i=1
‖g(xi)− ri‖22
using a regression tree algorithm.Pick ηt and let
f (x)← f (x)− ηtgt(x)
T. Zhang (Rutgers) Learning on the Web 25 / 27
An Application of Preference Learning in Web-search
A Web-search dataset: determine the relevancy of (query,url) pairGBrank: boosted tree based on preference learningGBDT: boosted tree based on regressionRankSVM: SVM based on preference learning
Table: Precision at K % for GBrank, GBT, and RankSVM
%K GBrank GBDT RankSVM10% 0.9867 0.9243 0.852420% 0.9722 0.8833 0.815250% 0.8638 0.7814 0.7357100% 0.7225 0.6742 0.6465
T. Zhang (Rutgers) Learning on the Web 26 / 27
Summary
Many machine learning problems on the webMany information sourcesChallenges:
how to formulate the problemshow to generate featureshow to aggregate informationhow to adapt learning modelshow to control data qualityhow to evaluate performancehow to handle large scale computing...
T. Zhang (Rutgers) Learning on the Web 27 / 27