learning on the web - bigeye.au.tsinghua.edu.cn

Learning on the Web

Tong Zhang

Rutgers University

T. Zhang (Rutgers) Learning on the Web 1 / 27

Machine Learning Problems on Web

ClassificationRankingUser Behavior ModelingRecommendationCommunity AnalysisQuality AssessmentExploration ExploitationScalability...


Classification

Electronic Spamemail spam: unwanted emailwebpage spam: low-quality pages to be placed highblog spam: random blog pages to promote other pagesclick spam: misleading clicks of ads or webpagestext messaging spam: unwanted text messagesusually aim for commercial gains

Sentiment analysisWebpage classificationQuery classification...

Key issues:problem formulation; feature generation; information aggregation;model adaptation


Spam Email

� ,�&,��,��

'(,��,�!,��,��,%$,��,�#,�,

��& , �,.......��&;��)

"�QQ-2297845601 ,,"��-13715362114

*��+��*��+

From: �� <[email protected]>Subject: ��

Date: July 25, 2012 9:43:08 PM GMT+08:00Reply-To: <[email protected]>


Sentiment Analysis

Two ipad reviews: do they like or dislike the product?

Review 1 Review 2

This is my first iPad and Ijust absolutely love it!!! I pre-viously owned a tablet butthis by far, beats the tablet Ihad!! It is so easy to use andthe retina is amazing! I nowunderstand why people lovetheir iPad! ...

Poor quality control. Foundthe corners at the edgeswhere the screen meets thebody, to be crimped on eachside...

free text is referred to as unstructured data


Structured Data Example

A table or relational database

Systolic DiseaseGender BP Weight Code

M 175 65 3F 141 72 1

. . . . . . . . . . . .F 160 59 2

Figure: Example of Medical Data Prediction


Structured versus Unstructured Data

Structured data:table or spreadsheetrelational database with well-defined attributes (features)features are usually dense

Unstructured datafree format textwithout well-defined attributes

Learning: extract information, find patterns, organize contentsencode desired information into unknown labels to predict (output).encode available unstructured data into sparse feature vectorcombine structured and unstructured data


Encoding UnStructured Data

Goal: represent text by a feature vectorMethod: vector space model

create dictionary of size m consisted of all wordsmap each document into an m-dimensional vector

the i-th component is the frequency of word i in the documentfeature vector is very sparse and high dimensional

Bag-of-words (BoW): represent text without word ordering infoImprovements

can preserve section or partial position information.can combine multiple dictionaries and use phrases


Bag of Word Document Representation

document

term

word1 word2 word3 word4 word5 . . . wordN label0 2 4 1 0 . . . 0 11 0 3 0 0 . . . 0 02 1 0 5 1 . . . 0 10 4 0 0 2 . . . 0 01 2 1 2 0 . . . 1 10 1 0 3 3 . . . 0 12 0 1 4 0 . . . 2 00 3 1 2 1 . . . 0 1

Figure: Document BoW Representation


Term Weighting

Modify each word count by the perceived importance of the word

Rare words carry more information than common wordsTFIDF weighting of token j in document d :

tf-idf(j ,d) = tf(j ,d) ∗ idf(j)

idf(j) = log(

number of documentsdf( j)

)

tf(j ,d): term frequency of token j in document ddf(j): frequency of documents containing term j


Term Weighting

Modify each word count by the perceived importance of the wordRare words carry more information than common wordsTFIDF weighting of token j in document d :

tf-idf(j ,d) = tf(j ,d) ∗ idf(j)

idf(j) = log(

number of documentsdf( j)

)

tf(j ,d): term frequency of token j in document ddf(j): frequency of documents containing term j


Example Feature Vector for Email Spam Detection

text:title text:body nontext label. . . cheap . . . enlargement . . . ink from known spam host spam. . . yes . . . yes . . . yes yes true. . . no . . . yes . . . no yes true. . . no . . . no . . . no no false. . . . . . . . . . . . . . . . . . . . . . . .

Feature representationbag-of-word binary feature representation of email text without TFIDFusing known spam host as nontext features

Text feature versus nontext featuretext feature: sparse BoW representation, linear classifier works wellnontext feature: dense and heterogeneous, often needs non-linearinteraction


Webpage Classification

Problem: determine the topics of a webpagedoes it talks about arts, finance, sports?is it a personal homepage, university department page, etc?

Features:text (BoW)HTML tagurlpage layout and imageslinks

How to combine featuresintegrate different information source into a unified featurerepresentationpropagate features or class labels through links

Modify standard algorithms


Information Aggregation in Query Classification

Classify each query into a tree-structured taxonomyApparel and Jewelry/Shoes/Womens ShoesMass Merchants/Baby Products...

ChallengesLarge scale: approximately 6000 nodesDifficulty:

queries are brief: average 2.4 to 2.7 words per queryquery words alone don’t provide sufficient information for good queryclassification

Solution: employ auxiliary knowledge to augment the queriesAuxiliary Knowledge

Send query to a major search engineAugment the query using top pages returned by the search engine.Remedy the problem of query brevity:

words contained in top results pages reveal the category






Solution: employ auxiliary knowledge to augment the queries

Auxiliary KnowledgeSend query to a major search engineAugment the query using top pages returned by the search engine.Remedy the problem of query brevity:







Solution: employ auxiliary knowledge to augment the queriesAuxiliary Knowledge

Send query to a major search engineAugment the query using top pages returned by the search engine.Remedy the problem of query brevity:



Working Example

Query: nikonTop search result pages contain: camera, photography, lens, ...These augmented words imply “Digital Camera” as a category.Can provide matching ads about digital cameras.


Search based Query Classification

Notations:q: query, p: web-page, C = {Cj}: set of categories

Problem: given query q, want to find s(q,Cj)s(q,Cj) is the quality score of query q belonging to category Cj .

Information source:top-search results containing pages p1, . . . ,pk with high relevance to qs(p,Cj): quality score of page p belonging to category Cjknown through a separate web-page classifier.

Information aggregation:voting:

s(q,Cj) =k∑

i=1

s(pi ,Cj)/k ,

where pi is the i-th ranked page for query q.several other methods


Search based Query Classification

Notations:q: query, p: web-page, C = {Cj}: set of categories

Problem: given query q, want to find s(q,Cj)s(q,Cj) is the quality score of query q belonging to category Cj .Information source:

top-search results containing pages p1, . . . ,pk with high relevance to qs(p,Cj): quality score of page p belonging to category Cjknown through a separate web-page classifier.

Information aggregation:voting:

s(q,Cj) =k∑

i=1

s(pi ,Cj)/k ,

where pi is the i-th ranked page for query q.several other methods


Performance Evaluation

small number of positive examples, most data are negativeprecision, recall, and F-measure

precision =number of correct positive predictions

number of positive predictions,

recall =number of correct positive predictionsnumber of positive class documents

,

F -measure =2

1/precision + 1/recall


Effect of using Search Results

0.4

0.5

0.6

0.7

0.8

0.9

1

1.00.90.80.70.60.50.40.30.20.1

Pre

cisi

on

Recall

BaselineEngine A full-pageEngine A summaryEngine B full-pageEngine B summary


Ranking

Rank a set of items and display to users in corresponding order.Important in web-search:

web-page rankingdisplay ranked pages for a query

query-refinement and spelling correctiondisplay ranked suggestions and candidate corrections

web-page summarydisplay ranked sentence segments

related: crawling/indexing:which page to crawl firstpages to keep in the index: priority/quality


Web-Search Problem

User types a query, search engine returns a result page:select pages from billions of pages.

Method: given a querysearch engine assign a relevance score for each pagereturn pages ranked by the scores.

Quality of search engine:relevance (whether returned pages are on topic and authoritative)presentation issues (diversity, perceived relevance, etc)personalization (predict user specific intention)coverage (size and quality of index).freshness (whether contents are timely).responsiveness (how quickly search engine responds to the query)....


Web-Search Ranking: Notations

Notation:q: queryp: webpagey(p,q): true relevance of page p to query q rated by humanx(p,q): search engine creates a feature for page p and query qf (x(p,q)): search engine assigns a quality score f (x(p,q))

Web-search process:user input query qsearch engine returns page p ordered by highest scores f (x(p,q))


Relevance Ranking: Machine Learning Approach

Training:randomly select queries q, and web-pages p for each query.use editorial judgment to assign relevance grade y(p,q).construct a feature x(p,q) for each query/page pair.learn scoring function f̂ (x(p,q)) to preserve the order of y(p,q) foreach q.

Deployment:query q comes in.return pages p1, . . . ,pm in descending order of f̂ (x(p,q)).


Measuring Ranking Quality

Given scoring function f̂ , return ordered page-list p1, . . . ,pm for aquery q.

only the order information is important.should focus on the relevance of returned pages near the top.

DCG (discounted cumulative gain) with decreasing weight ci

DCG(f̂ ,q) =m∑

j=1

ci r(pi ,q).

ci : reflects effort (or likelihood) of user clicking on the i-th position.


Ranking and Pairwise Preference Learning

The quality of ranking only depends on the relative order of{f (x(pi ,q)) : i} for each query qPreference relationship: if y(pi ,q) < y(pj ,q)

x(pi ,q) ≺ x(pj ,q)

pj is more relevant than pi for query qPairwise preference learning

learn a scoring function f for items to preserve preference ≺.two items x and x ′: f (x) < f (x ′) when x ≺ x ′.ordering inputs according to f (x).


Example Loss Function for Preference Learning

Training data: query-url features xi for i = 1, . . . ,ni ≺ j if url of xj is more relevant than url of xi for a certain query q.Let S be the indices of preference relationships i ≺ jLet f (X ) = [f (x1), . . . , f (xn)]

Example loss:

R(f (X )) =∑{i≺j}∈S

max(0,1 + f (xi)− f (xj))2


Gradient Boosted Decision Tree Formula for Ranking

Let f (x) = 0Iterate t = 1,2, . . .

For each i = 1, . . . ,n, compute

ri =∂

∂fiR(f ) =2

∑{i≺k}∈S

max(0,1 + f (xi)− f (xk ))

− 2∑

{k≺i}∈S

max(0,1 + f (xk )− f (xi))

Find decision tree gt that approximately minimizes

minn∑

i=1

‖g(xi)− ri‖22

using a regression tree algorithm.Pick ηt and let

f (x)← f (x)− ηtgt(x)


An Application of Preference Learning in Web-search

A Web-search dataset: determine the relevancy of (query,url) pairGBrank: boosted tree based on preference learningGBDT: boosted tree based on regressionRankSVM: SVM based on preference learning

Table: Precision at K % for GBrank, GBT, and RankSVM

%K GBrank GBDT RankSVM10% 0.9867 0.9243 0.852420% 0.9722 0.8833 0.815250% 0.8638 0.7814 0.7357100% 0.7225 0.6742 0.6465


Summary

Many machine learning problems on the webMany information sourcesChallenges:

how to formulate the problemshow to generate featureshow to aggregate informationhow to adapt learning modelshow to control data qualityhow to evaluate performancehow to handle large scale computing...


learning on the web - bigeye.au.tsinghua.edu.cn

Documents