learning on the web - bigeye.au.tsinghua.edu.cn

37
Learning on the Web Tong Zhang Rutgers University T. Zhang (Rutgers) Learning on the Web 1 / 27

Upload: others

Post on 10-Dec-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning on the Web - bigeye.au.tsinghua.edu.cn

Learning on the Web

Tong Zhang

Rutgers University

T. Zhang (Rutgers) Learning on the Web 1 / 27

Page 2: Learning on the Web - bigeye.au.tsinghua.edu.cn

Machine Learning Problems on Web

ClassificationRankingUser Behavior ModelingRecommendationCommunity AnalysisQuality AssessmentExploration ExploitationScalability...

T. Zhang (Rutgers) Learning on the Web 2 / 27

Page 3: Learning on the Web - bigeye.au.tsinghua.edu.cn

Machine Learning Problems on Web

ClassificationRankingUser Behavior ModelingRecommendationCommunity AnalysisQuality AssessmentExploration ExploitationScalability...

T. Zhang (Rutgers) Learning on the Web 2 / 27

Page 4: Learning on the Web - bigeye.au.tsinghua.edu.cn

Classification

Electronic Spamemail spam: unwanted emailwebpage spam: low-quality pages to be placed highblog spam: random blog pages to promote other pagesclick spam: misleading clicks of ads or webpagestext messaging spam: unwanted text messagesusually aim for commercial gains

Sentiment analysisWebpage classificationQuery classification...

Key issues:problem formulation; feature generation; information aggregation;model adaptation

T. Zhang (Rutgers) Learning on the Web 3 / 27

Page 5: Learning on the Web - bigeye.au.tsinghua.edu.cn

Classification

Electronic Spamemail spam: unwanted emailwebpage spam: low-quality pages to be placed highblog spam: random blog pages to promote other pagesclick spam: misleading clicks of ads or webpagestext messaging spam: unwanted text messagesusually aim for commercial gains

Sentiment analysisWebpage classificationQuery classification...

Key issues:problem formulation; feature generation; information aggregation;model adaptation

T. Zhang (Rutgers) Learning on the Web 3 / 27

Page 6: Learning on the Web - bigeye.au.tsinghua.edu.cn

Spam Email

� ,�&,��,��

'(,��,�!,��,��,%$,��,�#,�,

��& , �,.......���&;��)

"�QQ-2297845601 ,,"����-13715362114

*��+��*��+

From: �� <[email protected]>Subject: �� �������

Date: July 25, 2012 9:43:08 PM GMT+08:00Reply-To: <[email protected]>

T. Zhang (Rutgers) Learning on the Web 4 / 27

Page 7: Learning on the Web - bigeye.au.tsinghua.edu.cn

Sentiment Analysis

Two ipad reviews: do they like or dislike the product?

Review 1 Review 2

This is my first iPad and Ijust absolutely love it!!! I pre-viously owned a tablet butthis by far, beats the tablet Ihad!! It is so easy to use andthe retina is amazing! I nowunderstand why people lovetheir iPad! ...

Poor quality control. Foundthe corners at the edgeswhere the screen meets thebody, to be crimped on eachside...

free text is referred to as unstructured data

T. Zhang (Rutgers) Learning on the Web 5 / 27

Page 8: Learning on the Web - bigeye.au.tsinghua.edu.cn

Sentiment Analysis

Two ipad reviews: do they like or dislike the product?

Review 1 Review 2

This is my first iPad and Ijust absolutely love it!!! I pre-viously owned a tablet butthis by far, beats the tablet Ihad!! It is so easy to use andthe retina is amazing! I nowunderstand why people lovetheir iPad! ...

Poor quality control. Foundthe corners at the edgeswhere the screen meets thebody, to be crimped on eachside...

free text is referred to as unstructured data

T. Zhang (Rutgers) Learning on the Web 5 / 27

Page 9: Learning on the Web - bigeye.au.tsinghua.edu.cn

Structured Data Example

A table or relational database

Systolic DiseaseGender BP Weight Code

M 175 65 3F 141 72 1

. . . . . . . . . . . .F 160 59 2

Figure: Example of Medical Data Prediction

T. Zhang (Rutgers) Learning on the Web 6 / 27

Page 10: Learning on the Web - bigeye.au.tsinghua.edu.cn

Structured versus Unstructured Data

Structured data:table or spreadsheetrelational database with well-defined attributes (features)features are usually dense

Unstructured datafree format textwithout well-defined attributes

Learning: extract information, find patterns, organize contentsencode desired information into unknown labels to predict (output).encode available unstructured data into sparse feature vectorcombine structured and unstructured data

T. Zhang (Rutgers) Learning on the Web 7 / 27

Page 11: Learning on the Web - bigeye.au.tsinghua.edu.cn

Structured versus Unstructured Data

Structured data:table or spreadsheetrelational database with well-defined attributes (features)features are usually dense

Unstructured datafree format textwithout well-defined attributes

Learning: extract information, find patterns, organize contentsencode desired information into unknown labels to predict (output).encode available unstructured data into sparse feature vectorcombine structured and unstructured data

T. Zhang (Rutgers) Learning on the Web 7 / 27

Page 12: Learning on the Web - bigeye.au.tsinghua.edu.cn

Encoding UnStructured Data

Goal: represent text by a feature vectorMethod: vector space model

create dictionary of size m consisted of all wordsmap each document into an m-dimensional vector

the i-th component is the frequency of word i in the documentfeature vector is very sparse and high dimensional

Bag-of-words (BoW): represent text without word ordering infoImprovements

can preserve section or partial position information.can combine multiple dictionaries and use phrases

T. Zhang (Rutgers) Learning on the Web 8 / 27

Page 13: Learning on the Web - bigeye.au.tsinghua.edu.cn

Bag of Word Document Representation

document

term

word1 word2 word3 word4 word5 . . . wordN label0 2 4 1 0 . . . 0 11 0 3 0 0 . . . 0 02 1 0 5 1 . . . 0 10 4 0 0 2 . . . 0 01 2 1 2 0 . . . 1 10 1 0 3 3 . . . 0 12 0 1 4 0 . . . 2 00 3 1 2 1 . . . 0 1

Figure: Document BoW Representation

T. Zhang (Rutgers) Learning on the Web 9 / 27

Page 14: Learning on the Web - bigeye.au.tsinghua.edu.cn

Term Weighting

Modify each word count by the perceived importance of the word

Rare words carry more information than common wordsTFIDF weighting of token j in document d :

tf-idf(j ,d) = tf(j ,d) ∗ idf(j)

idf(j) = log(

number of documentsdf( j)

)

tf(j ,d): term frequency of token j in document ddf(j): frequency of documents containing term j

T. Zhang (Rutgers) Learning on the Web 10 / 27

Page 15: Learning on the Web - bigeye.au.tsinghua.edu.cn

Term Weighting

Modify each word count by the perceived importance of the wordRare words carry more information than common wordsTFIDF weighting of token j in document d :

tf-idf(j ,d) = tf(j ,d) ∗ idf(j)

idf(j) = log(

number of documentsdf( j)

)

tf(j ,d): term frequency of token j in document ddf(j): frequency of documents containing term j

T. Zhang (Rutgers) Learning on the Web 10 / 27

Page 16: Learning on the Web - bigeye.au.tsinghua.edu.cn

Example Feature Vector for Email Spam Detection

text:title text:body nontext label. . . cheap . . . enlargement . . . ink from known spam host spam. . . yes . . . yes . . . yes yes true. . . no . . . yes . . . no yes true. . . no . . . no . . . no no false. . . . . . . . . . . . . . . . . . . . . . . .

Feature representationbag-of-word binary feature representation of email text without TFIDFusing known spam host as nontext features

Text feature versus nontext featuretext feature: sparse BoW representation, linear classifier works wellnontext feature: dense and heterogeneous, often needs non-linearinteraction

T. Zhang (Rutgers) Learning on the Web 11 / 27

Page 17: Learning on the Web - bigeye.au.tsinghua.edu.cn

Webpage Classification

Problem: determine the topics of a webpagedoes it talks about arts, finance, sports?is it a personal homepage, university department page, etc?

Features:text (BoW)HTML tagurlpage layout and imageslinks

How to combine featuresintegrate different information source into a unified featurerepresentationpropagate features or class labels through links

Modify standard algorithms

T. Zhang (Rutgers) Learning on the Web 12 / 27

Page 18: Learning on the Web - bigeye.au.tsinghua.edu.cn

Information Aggregation in Query Classification

Classify each query into a tree-structured taxonomyApparel and Jewelry/Shoes/Womens ShoesMass Merchants/Baby Products...

ChallengesLarge scale: approximately 6000 nodesDifficulty:

queries are brief: average 2.4 to 2.7 words per queryquery words alone don’t provide sufficient information for good queryclassification

Solution: employ auxiliary knowledge to augment the queriesAuxiliary Knowledge

Send query to a major search engineAugment the query using top pages returned by the search engine.Remedy the problem of query brevity:

words contained in top results pages reveal the category

T. Zhang (Rutgers) Learning on the Web 13 / 27

Page 19: Learning on the Web - bigeye.au.tsinghua.edu.cn

Information Aggregation in Query Classification

Classify each query into a tree-structured taxonomyApparel and Jewelry/Shoes/Womens ShoesMass Merchants/Baby Products...

ChallengesLarge scale: approximately 6000 nodesDifficulty:

queries are brief: average 2.4 to 2.7 words per queryquery words alone don’t provide sufficient information for good queryclassification

Solution: employ auxiliary knowledge to augment the queries

Auxiliary KnowledgeSend query to a major search engineAugment the query using top pages returned by the search engine.Remedy the problem of query brevity:

words contained in top results pages reveal the category

T. Zhang (Rutgers) Learning on the Web 13 / 27

Page 20: Learning on the Web - bigeye.au.tsinghua.edu.cn

Information Aggregation in Query Classification

Classify each query into a tree-structured taxonomyApparel and Jewelry/Shoes/Womens ShoesMass Merchants/Baby Products...

ChallengesLarge scale: approximately 6000 nodesDifficulty:

queries are brief: average 2.4 to 2.7 words per queryquery words alone don’t provide sufficient information for good queryclassification

Solution: employ auxiliary knowledge to augment the queriesAuxiliary Knowledge

Send query to a major search engineAugment the query using top pages returned by the search engine.Remedy the problem of query brevity:

words contained in top results pages reveal the category

T. Zhang (Rutgers) Learning on the Web 13 / 27

Page 21: Learning on the Web - bigeye.au.tsinghua.edu.cn

Working Example

Query: nikonTop search result pages contain: camera, photography, lens, ...These augmented words imply “Digital Camera” as a category.Can provide matching ads about digital cameras.

T. Zhang (Rutgers) Learning on the Web 14 / 27

Page 22: Learning on the Web - bigeye.au.tsinghua.edu.cn

Search based Query Classification

Notations:q: query, p: web-page, C = {Cj}: set of categories

Problem: given query q, want to find s(q,Cj)s(q,Cj) is the quality score of query q belonging to category Cj .

Information source:top-search results containing pages p1, . . . ,pk with high relevance to qs(p,Cj): quality score of page p belonging to category Cjknown through a separate web-page classifier.

Information aggregation:voting:

s(q,Cj) =k∑

i=1

s(pi ,Cj)/k ,

where pi is the i-th ranked page for query q.several other methods

T. Zhang (Rutgers) Learning on the Web 15 / 27

Page 23: Learning on the Web - bigeye.au.tsinghua.edu.cn

Search based Query Classification

Notations:q: query, p: web-page, C = {Cj}: set of categories

Problem: given query q, want to find s(q,Cj)s(q,Cj) is the quality score of query q belonging to category Cj .Information source:

top-search results containing pages p1, . . . ,pk with high relevance to qs(p,Cj): quality score of page p belonging to category Cjknown through a separate web-page classifier.

Information aggregation:voting:

s(q,Cj) =k∑

i=1

s(pi ,Cj)/k ,

where pi is the i-th ranked page for query q.several other methods

T. Zhang (Rutgers) Learning on the Web 15 / 27

Page 24: Learning on the Web - bigeye.au.tsinghua.edu.cn

Search based Query Classification

Notations:q: query, p: web-page, C = {Cj}: set of categories

Problem: given query q, want to find s(q,Cj)s(q,Cj) is the quality score of query q belonging to category Cj .Information source:

top-search results containing pages p1, . . . ,pk with high relevance to qs(p,Cj): quality score of page p belonging to category Cjknown through a separate web-page classifier.

Information aggregation:voting:

s(q,Cj) =k∑

i=1

s(pi ,Cj)/k ,

where pi is the i-th ranked page for query q.several other methods

T. Zhang (Rutgers) Learning on the Web 15 / 27

Page 25: Learning on the Web - bigeye.au.tsinghua.edu.cn

Performance Evaluation

small number of positive examples, most data are negativeprecision, recall, and F-measure

precision =number of correct positive predictions

number of positive predictions,

recall =number of correct positive predictionsnumber of positive class documents

,

F -measure =2

1/precision + 1/recall

T. Zhang (Rutgers) Learning on the Web 16 / 27

Page 26: Learning on the Web - bigeye.au.tsinghua.edu.cn

Effect of using Search Results

0.4

0.5

0.6

0.7

0.8

0.9

1

1.00.90.80.70.60.50.40.30.20.1

Pre

cisi

on

Recall

BaselineEngine A full-pageEngine A summaryEngine B full-pageEngine B summary

T. Zhang (Rutgers) Learning on the Web 17 / 27

Page 27: Learning on the Web - bigeye.au.tsinghua.edu.cn

Ranking

Rank a set of items and display to users in corresponding order.Important in web-search:

web-page rankingdisplay ranked pages for a query

query-refinement and spelling correctiondisplay ranked suggestions and candidate corrections

web-page summarydisplay ranked sentence segments

related: crawling/indexing:which page to crawl firstpages to keep in the index: priority/quality

T. Zhang (Rutgers) Learning on the Web 18 / 27

Page 28: Learning on the Web - bigeye.au.tsinghua.edu.cn

Web-Search Problem

User types a query, search engine returns a result page:select pages from billions of pages.

Method: given a querysearch engine assign a relevance score for each pagereturn pages ranked by the scores.

Quality of search engine:relevance (whether returned pages are on topic and authoritative)presentation issues (diversity, perceived relevance, etc)personalization (predict user specific intention)coverage (size and quality of index).freshness (whether contents are timely).responsiveness (how quickly search engine responds to the query)....

T. Zhang (Rutgers) Learning on the Web 19 / 27

Page 29: Learning on the Web - bigeye.au.tsinghua.edu.cn

Web-Search Problem

User types a query, search engine returns a result page:select pages from billions of pages.

Method: given a querysearch engine assign a relevance score for each pagereturn pages ranked by the scores.

Quality of search engine:relevance (whether returned pages are on topic and authoritative)presentation issues (diversity, perceived relevance, etc)personalization (predict user specific intention)coverage (size and quality of index).freshness (whether contents are timely).responsiveness (how quickly search engine responds to the query)....

T. Zhang (Rutgers) Learning on the Web 19 / 27

Page 30: Learning on the Web - bigeye.au.tsinghua.edu.cn

Web-Search Ranking: Notations

Notation:q: queryp: webpagey(p,q): true relevance of page p to query q rated by humanx(p,q): search engine creates a feature for page p and query qf (x(p,q)): search engine assigns a quality score f (x(p,q))

Web-search process:user input query qsearch engine returns page p ordered by highest scores f (x(p,q))

T. Zhang (Rutgers) Learning on the Web 20 / 27

Page 31: Learning on the Web - bigeye.au.tsinghua.edu.cn

Relevance Ranking: Machine Learning Approach

Training:randomly select queries q, and web-pages p for each query.use editorial judgment to assign relevance grade y(p,q).construct a feature x(p,q) for each query/page pair.learn scoring function f̂ (x(p,q)) to preserve the order of y(p,q) foreach q.

Deployment:query q comes in.return pages p1, . . . ,pm in descending order of f̂ (x(p,q)).

T. Zhang (Rutgers) Learning on the Web 21 / 27

Page 32: Learning on the Web - bigeye.au.tsinghua.edu.cn

Measuring Ranking Quality

Given scoring function f̂ , return ordered page-list p1, . . . ,pm for aquery q.

only the order information is important.should focus on the relevance of returned pages near the top.

DCG (discounted cumulative gain) with decreasing weight ci

DCG(f̂ ,q) =m∑

j=1

ci r(pi ,q).

ci : reflects effort (or likelihood) of user clicking on the i-th position.

T. Zhang (Rutgers) Learning on the Web 22 / 27

Page 33: Learning on the Web - bigeye.au.tsinghua.edu.cn

Ranking and Pairwise Preference Learning

The quality of ranking only depends on the relative order of{f (x(pi ,q)) : i} for each query qPreference relationship: if y(pi ,q) < y(pj ,q)

x(pi ,q) ≺ x(pj ,q)

pj is more relevant than pi for query qPairwise preference learning

learn a scoring function f for items to preserve preference ≺.two items x and x ′: f (x) < f (x ′) when x ≺ x ′.ordering inputs according to f (x).

T. Zhang (Rutgers) Learning on the Web 23 / 27

Page 34: Learning on the Web - bigeye.au.tsinghua.edu.cn

Example Loss Function for Preference Learning

Training data: query-url features xi for i = 1, . . . ,ni ≺ j if url of xj is more relevant than url of xi for a certain query q.Let S be the indices of preference relationships i ≺ jLet f (X ) = [f (x1), . . . , f (xn)]

Example loss:

R(f (X )) =∑{i≺j}∈S

max(0,1 + f (xi)− f (xj))2

T. Zhang (Rutgers) Learning on the Web 24 / 27

Page 35: Learning on the Web - bigeye.au.tsinghua.edu.cn

Gradient Boosted Decision Tree Formula for Ranking

Let f (x) = 0Iterate t = 1,2, . . .

For each i = 1, . . . ,n, compute

ri =∂

∂fiR(f ) =2

∑{i≺k}∈S

max(0,1 + f (xi)− f (xk ))

− 2∑

{k≺i}∈S

max(0,1 + f (xk )− f (xi))

Find decision tree gt that approximately minimizes

minn∑

i=1

‖g(xi)− ri‖22

using a regression tree algorithm.Pick ηt and let

f (x)← f (x)− ηtgt(x)

T. Zhang (Rutgers) Learning on the Web 25 / 27

Page 36: Learning on the Web - bigeye.au.tsinghua.edu.cn

An Application of Preference Learning in Web-search

A Web-search dataset: determine the relevancy of (query,url) pairGBrank: boosted tree based on preference learningGBDT: boosted tree based on regressionRankSVM: SVM based on preference learning

Table: Precision at K % for GBrank, GBT, and RankSVM

%K GBrank GBDT RankSVM10% 0.9867 0.9243 0.852420% 0.9722 0.8833 0.815250% 0.8638 0.7814 0.7357100% 0.7225 0.6742 0.6465

T. Zhang (Rutgers) Learning on the Web 26 / 27

Page 37: Learning on the Web - bigeye.au.tsinghua.edu.cn

Summary

Many machine learning problems on the webMany information sourcesChallenges:

how to formulate the problemshow to generate featureshow to aggregate informationhow to adapt learning modelshow to control data qualityhow to evaluate performancehow to handle large scale computing...

T. Zhang (Rutgers) Learning on the Web 27 / 27