info 4300 / cs4300 information retrieval [0.5cm] slides ... · index only one document from each...

INFO 4300 / CS4300Information Retrieval

slides adapted from Hinrich Schutze’s,linked from http://informationretrieval.org/

IR 19/25: Web Search Basics and Classification

Paul Ginsparg

Cornell University, Ithaca, NY

9 Nov 2010

1 / 67

http://informationretrieval.org/

Discussion 5, Tue 16 Nov

For this class, read and be prepared to discuss the following:

Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified DataProcessing on Large Clusters. Usenix SDI ’04, 2004.http://www.usenix.org/events/osdi04/tech/full papers/dean/dean.pdf

2 / 67

http://www.usenix.org/events/osdi04/tech/full_papers/dean/dean.pdf

Overview

1 Recap

2 Spam

3 Size of the web

4 Intro vector space classification

5 Rocchio

6 kNN

3 / 67

Outline

1 Recap

2 Spam

3 Size of the web


5 Rocchio

6 kNN

4 / 67

Duplicate detection

The web is full of duplicated content.

More so than many other collections

Exact duplicates

Easy to eliminateE.g., use hash/fingerprint

Near-duplicates

Abundant on the webDifficult to eliminate

For the user, it’s annoying to get a search result withnear-identical documents.

Recall marginal relevance

We need to eliminate near-duplicates.

5 / 67

Shingling: Summary

Input: N documents

Choose n-gram size for shingling, e.g., n = 5

Pick 200 random permutations, represented as hash functions

Compute N sketches: 200× N matrix shown on previousslide, one row per permutation, one column per document

Compute N·(N−1)2 pairwise similarities

Transitive closure of documents with similarity > θ

Index only one document from each equivalence class

6 / 67

Web IR: Differences from traditional IR

Links: The web is a hyperlinked document collection.

Queries: Web queries are different, more varied and there area lot of them. How many? ≈ 109

Users: Users are different, more varied and there are a lot ofthem. How many? ≈ 109

Documents: Documents are different, more varied and thereare a lot of them. How many? ≈ 1011

Context: Context is more important on the web than in manyother IR applications.

Ads and spam

7 / 67

Types of queries / user needs in web search

Informational user needs: I need information on something.“low hemoglobin”

We called this “information need” earlier in the class.

On the web, information needs proper are only a subclass ofuser needs.

Other user needs: Navigational and transactional

Navigational user needs: I want to go to this web site.“hotmail”, “myspace”, “United Airlines”

Transactional user needs: I want to make a transaction.

Buy something: “MacBook Air”Download something: “Acrobat Reader”Chat with someone: “live soccer chat”

Difficult problem: How can the search engine tell what theuser need or intent for a particular query is?

8 / 67

Bowtie structure of the web

A.Broder,R.Kumar,F.Maghoul,P.Raghavan,S.Rajagopalan,S. Stata, A. Tomkins, andJ. Wiener. Graph structure in the web. Computer Networks, 33:309–320, 2000.

Strongly connected component (SCC) in the centerLots of pages that get linked to, but don’t link (OUT)Lots of pages that link to other pages, but don’t get linked to (IN)Tendrils, tubes, islands

# of in-links (in-degree) averages 8–15, not randomly distributed (Poissonian),instead a power law:# pages with in-degree i is ∝ 1/iα, α ≈ 2.1

9 / 67

Poisson Distribution

Bernoulli process with N trials, each probability p of success:

p(m) =

(

N

m

)

pm(1− p)N−m .

Probability p(m) of m successes, in limit N very large and p small,parametrized by just µ = Np (µ = mean number of successes).For N ≫ m, we have N!

(N−m)! = N(N − 1) · · · (N −m + 1) ≈ Nm,

so(

Nm

)

≡ N!m!(N−m)! ≈

Nm

m! , and

p(m) ≈1

m!Nm

( µ

N

)m(

1−µ

N

)N−m

≈µm

m!lim

N→∞

(

1−µ

N

)N

= e−µ µm

m!

(ignore (1− µ/N)−m since by assumption N ≫ µm).N dependence drops out for N →∞, with average µ fixed (p → 0).The form p(m) = e

−µ µm

m! is known as a Poisson distribution

(properly normalized:∑∞

m=0 p(m) = e−µ

∑∞m=0

µm

m! = e−µ · eµ = 1).

10 / 67

Poisson Distribution for µ = 10

p(m) = e−10 10m

m!

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 5 10 15 20 25 30

Compare to power law p(m) ∝ 1/m2.1

11 / 67

Power Law p(m) ∝ 1/m2.1 and Poisson p(m) = e−10 10m

m!

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

10 20 30 40 50 60 70 80 90 100

12 / 67

Power Law p(m) ∝ 1/m2.1 and Poisson p(m) = e−10 10m

m!

1e-09

1e-08

1e-07

1e-06

1e-05

0.0001

0.001

0.01

0.1

1

1 10 100 1000 10000

(log–log scale)

13 / 67

The spatial context: Geo-search

Three relevant locations

Server (nytimes.com → New York)Web page (nytimes.com article about Albania)User (located in Palo Alto)

Locating the user

IP addressInformation provided by user (e.g., in user profile)Mobile phone

Geo-tagging: Parse text and identify the coordinates of thegeographic entities

Example: East Palo Alto CA → Latitude: 37.47 N, Longitude:122.14 WImportant NLP problem

14 / 67

Outline

1 Recap

2 Spam

3 Size of the web


5 Rocchio

6 kNN

15 / 67

The goal of spamming on the web

You have a page that will generate lots of revenue for you ifpeople visit it.

Therefore, you would like to direct visitors to this page.

One way of doing this: get your page ranked highly in searchresults.

How can I get my page ranked highly?

16 / 67

Spam technique: Keyword stuffing / Hidden text

Misleading meta-tags, excessive repetition

Hidden text with colors, style sheet tricks etc.

Used to be very effective, most search engines now catch these

17 / 67

Keyword stuffing

18 / 67

Spam technique: Doorway and lander pages

Doorway page: optimized for a single keyword, redirects tothe real target page

Lander page: optimized for a single keyword or a misspelleddomain name, designed to attract surfers who will then clickon ads

19 / 67

Lander page

Number one hit on Google for the search “composita”

The only purpose of this page: get people to click on the adsand make money for the page owner

20 / 67

Spam technique: Duplication

Get good content from somewhere (steal it or produce ityourself)

Publish a large number of slight variations of it

For example, publish the answer to a tax question with thespelling variations of “tax deferred” on the previous slide

21 / 67

Spam technique: Cloaking

Serve fake content to search engine spider

So do we just penalize this always?

No: legitimate uses (e.g., different content to US vs.European users)

22 / 67

Spam technique: Link spam

Create lots of links pointing to the page you want to promote

Put these links on pages with high (or at least non-zero)PageRank

Newly registered domains (domain flooding)A set of pages that all point to each other to boost eachother’s PageRank (mutual admiration society)Pay somebody to put your link on their highly ranked page(“schuetze horoskop” example)Leave comments that include the link on blogs

23 / 67

SEO: Search engine optimization

Promoting a page in the search rankings is not necessarilyspam.

It can also be a legitimate business – which is called SEO.

You can hire an SEO firm to get your page highly ranked.

There are many legitimate reasons for doing this.

For example, Google bombs like Who is a failure?

And there are many legitimate ways of achieving this:

Restructure your content in a way that makes it easy to indexTalk with influential bloggers and have them link to your siteAdd more interesting and original content

24 / 67

The war against spam

Quality indicators

Links, statistically analyzed (PageRank etc)Usage (users visiting a page)No adult content (e.g., no pictures with flesh-tone)Distribution and structure of text (e.g., no keyword stuffing)

Combine all of these indicators and use machine learning

Editorial intervention

BlacklistsTop queries auditedComplaints addressedSuspect patterns detected

25 / 67

Webmaster guidelines

Major search engines have guidelines for webmasters.

These guidelines tell you what is legitimate SEO and what isspamming.

Ignore these guidelines at your own risk

Once a search engine identifies you as a spammer, all pageson your site may get low ranks (or disappear from the indexentirely).

There is often a fine line between spam and legitimate SEO.

Scientific study of fighting spam on the web: adversarial

information retrieval

26 / 67

Outline

1 Recap

2 Spam

3 Size of the web


5 Rocchio

6 kNN

27 / 67

Growth of the web

The web keeps growing.But growth is no longer exponential?

28 / 67

Size of the web: Who cares?

Media

Users

They may switch to the search engine that has the bestcoverage of the web.Users (sometimes) care about recall. If we underestimate thesize of the web, search engine results may have low recall.

Search engine designers (how many pages do I need to be ableto handle?)

Crawler designers (which policy will crawl close to N pages?)

29 / 67

What is the size of the web? Any guesses?

30 / 67

Simple method for determining a lower bound

OR-query of frequent words in a number of languages

According to this query: Size of web ≥ 21,450,000,000 on2007.07.07

Big if: Page counts of google search results are correct.(Generally, they are just rough estimates.)

But this is just a lower bound, based on one search engine.

How can we do better?

31 / 67

Size of the web: Issues

What is size? Number of web servers? Number of pages?Terabytes of data available?

The “dynamic” web is infinite.

Any sum of two numbers is its own dynamic page on Google.(Example: “2+4”)Many other dynamic sites generating infinite number of pages

The static web contains duplicates – each “equivalence class”should only be counted once.

Some servers are seldom connected.

Example: Your laptopIs it part of the web?

32 / 67

“Search engine index contains N pages”: Issues

Can I claim a page is in the index if I only index the first 4000bytes?

Can I claim a page is in the index if I only index anchor textpointing to the page?

There used to be (and still are?) billions of pages that are onlyindexed by anchor text.

33 / 67

How can we estimate the size of the web?

34 / 67

Sampling methods

Random queries (picked from dictionary)

Random searches (picked from search logs)

Random IP addresses

Random walks

35 / 67

Variant: Estimate relative sizes of indexes

There are significant differences between indexes of differentsearch engines.

Different engines have different preferences.

max url depth, max count/host, anti-spam rules, priority rulesetc.

Different engines index different things under the same URL.

anchor text, frames, meta-keywords, size of prefix etc.

36 / 67

Outline

1 Recap

2 Spam

3 Size of the web


5 Rocchio

6 kNN

38 / 67

Digression: “naive” Bayes

Spam classifier:Imagine a training set of 2000 messages,1000 classified as spam (S),and 1000 classified as non-spam (S).

180 of the S messages contain the word “offer”.20 of the S messages contain the word “offer”.

Suppose you receive a message containing the word “offer”.What is the probability it is S? Estimate:

180

180 + 20=

9

10.

(Formally, assuming “flat prior” p(S) = p(S):

p(S |offer) =p(offer|S)p(S)

p(offer|S)p(S) + p(offer|S)p(S)=

1801000

1801000 + 20

1000

=9

10.)

39 / 67

Classification

Naive Bayes is simple and a good baseline.

Use it if you want to get a text classifier up and running in ahurry.

But other classification methods are more accurate.

Perhaps the simplest well performing alternative: kNN

kNN is a vector space classifier.

Today1 intro vector space classification2 very simple vector space classification: Rocchio3 kNN

Next time: general properties of classifiers

40 / 67

Recall vector space representation

Each document is a vector, one component for each term.

Terms are axes.

High dimensionality: 100,000s of dimensions

Normalize vectors (documents) to unit length

How can we do classification in this space?

41 / 67

Vector space classification

As before, the training set is a set of documents, each labeledwith its class.

In vector space classification, this set corresponds to a labeledset of points or vectors in the vector space.

Premise 1: Documents in the same class form a contiguousregion.

Premise 2: Documents from different classes don’t overlap.

We define lines, surfaces, hypersurfaces to divide regions.

42 / 67

Classes in the vector space

xxx

x

⋄

⋄⋄⋄

⋄

⋄

China

Kenya

UK⋆

Should the document ⋆ be assigned to China, UK or Kenya?Find separators between the classesBased on these separators: ⋆ should be assigned to China

How do we find separators that do a good job at classifying newdocuments like ⋆?

43 / 67

Outline

1 Recap

2 Spam

3 Size of the web


5 Rocchio

6 kNN

44 / 67

Recall Rocchio algorithm (lecture 12)

The optimal query vector is:

~qopt = µ(Dr ) + [µ(Dr )− µ(Dnr )]

=1

|Dr |

∑

~dj∈Dr

~dj + [1

|Dr |

∑

~dj∈Dr

~dj −1

|Dnr |

∑

~dj∈Dnr

~dj ]

We move the centroid of the relevant documents by thedifference between the two centroids.

45 / 67

Exercise: Compute Rocchio vector (lecture 12)

x

x

x

x

xx

circles: relevant documents, X’s: nonrelevant documents

46 / 67

Rocchio illustrated (lecture 12)

x

x

x

x

xx

~µR

~µNR

~µR − ~µNR~qopt

~µR : centroid of relevant documents~µNR : centroid of nonrelevant documents~µR − ~µNR : difference vectorAdd difference vector to ~µR to get ~qopt

~qopt separates relevant/nonrelevant perfectly.

47 / 67

Rocchio 1971 algorithm (SMART) (lecture 12)

Used in practice:

~qm = α~q0 + βµ(Dr )− γµ(Dnr )

= α~q0 + β1

|Dr |

∑

~dj∈Dr

~dj − γ1

|Dnr |

∑

~dj∈Dnr

~dj

qm: modified query vector; q0: original query vector; Dr andDnr : sets of known relevant and nonrelevant documentsrespectively; α, β, and γ: weights attached to each term

New query moves towards relevant documents and away fromnonrelevant documents.

Tradeoff α vs. β/γ: If we have a lot of judged documents, wewant a higher β/γ.

Set negative term weights to 0.

“Negative weight” for a term doesn’t make sense in the vectorspace model.

48 / 67

Using Rocchio for vector space classification

We can view relevance feedback as two-class classification.

The two classes: the relevant documents and the nonrelevantdocuments.

The training set is the set of documents the user has labeledso far.

The principal difference between relevance feedback and textclassification:

The training set is given as part of the input in textclassification.It is interactively created in relevance feedback.

49 / 67

Rocchio classification: Basic idea

Compute a centroid for each class

The centroid is the average of all documents in the class.

Assign each test document to the class of its closest centroid.

50 / 67

Recall definition of centroid

~µ(c) =1

|Dc |

∑

d∈Dc

~v(d)

where Dc is the set of all documents that belong to class c and

~v(d) is the vector space representation of d .

51 / 67

Rocchio algorithm

TrainRocchio(C, D)1 for each cj ∈ C

2 do Dj ← {d : 〈d , cj 〉 ∈ D}3 ~µj ←

1|Dj |

∑

d∈Dj~v(d)

4 return {~µ1, . . . , ~µJ}

ApplyRocchio({~µ1, . . . , ~µJ}, d)1 return arg minj |~µj − ~v(d)|

52 / 67

Rocchio illustrated: a1 = a2, b1 = b2, c1 = c2

xxx

x

⋄

⋄⋄

⋄

⋄

⋄

China

Kenya

UK

⋆ a1

a2

b1

b2

c1

c2

53 / 67

Rocchio properties

Rocchio forms a simple representation for each class: thecentroid

We can interpret the centroid as the prototype of the class.

Classification is based on similarity to / distance fromcentroid/prototype.

Does not guarantee that classifications are consistent with thetraining data!

54 / 67

Time complexity of Rocchio

mode time complexity

training Θ(|D|Lave + |C||V |) ≈ Θ(|D|Lave)testing Θ(La + |C|Ma) ≈ Θ(|C|Ma)

55 / 67

Rocchio vs. Naive Bayes

In many cases, Rocchio performs worse than Naive Bayes.

One reason: Rocchio does not handle nonconvex, multimodalclasses correctly.

56 / 67

Rocchio cannot handle nonconvex, multimodal classes

a

a

a

a

a

a

a aa

a

aa

aa

a a

aa

a

a

a

a

a

a

a

a

a

a

a

a a

aa

aa

aa

aa

a

bb

bb

bb

bb b

b

bbb

b

b

b

b

b

b

X XA

B

o

Exercise: Why is Rocchio notexpected to do well for theclassification task a vs. b here?

A is centroid of the a’s, Bis centroid of the b’s.

The point o is closer to Athan to B.

But it is a better fit forthe b class.

A is a multimodal classwith two prototypes.

But in Rocchio we onlyhave one.

57 / 67

Outline

1 Recap

2 Spam

3 Size of the web


5 Rocchio

6 kNN

58 / 67

kNN classification

kNN classification is another vector space classificationmethod.

It also is very simple and easy to implement.

kNN is more accurate (in most cases) than Naive Bayes andRocchio.

If you need to get a pretty accurate classifier up and runningin a short time . . .

. . . and you don’t care about efficiency that much . . .

. . . use kNN.

59 / 67

kNN classification

kNN = k nearest neighbors

kNN classification rule for k = 1 (1NN): Assign each testdocument to the class of its nearest neighbor in the trainingset.

1NN is not very robust – one document can be mislabeled oratypical.

kNN classification rule for k > 1 (kNN): Assign each testdocument to the majority class of its k nearest neighbors inthe training set.

Rationale of kNN: contiguity hypothesis

We expect a test document d to have the same label as thetraining documents located in the local region surrounding d .

60 / 67

Probabilistic kNN

Probabilistic version of kNN: P(c |d) = fraction of k neighborsof d that are in c

kNN classification rule for probabilistic kNN: Assign d to classc with highest P(c |d)

61 / 67

kNN is based on Voronoi tessellation

x

x

xx

x

xx

xx x

x

⋄

⋄⋄

⋄

⋄

⋄

⋄⋄⋄

⋄ ⋄

⋆

1NN, 3NNclassifica-tion decisionfor star?

62 / 67

kNN algorithm

Train-kNN(C, D)1 D

′ ← Preprocess(D)2 k ← Select-k(C, D′)3 return D

′, k

Apply-kNN(D′, k, d)1 Sk ← ComputeNearestNeighbors(D′, k, d)2 for each cj ∈ C(D′)3 do pj ← |Sk ∩ cj |/k4 return arg maxj pj

63 / 67

Exercise

⋆

x

x

x

x

x

x

x

x

x

x

oo

o

o

o

How is star classified by:

(i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio?

64 / 67

Exercise

⋆

x

x

x

x

x

x

x

x

x

x

oo

o

o

o

How is star classified by:

(i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio

65 / 67

Time complexity of kNN

kNN with preprocessing of training set

training Θ(|D|Lave)testing Θ(La + |D|MaveMa) = Θ(|D|MaveMa)

kNN test time proportional to the size of the training set!

The larger the training set, the longer it takes to classify atest document.

kNN is inefficient for very large training sets.

66 / 67

kNN: Discussion

No training necessary

But linear preprocessing of documents is as expensive astraining Naive Bayes.You will always preprocess the training set, so in realitytraining time of kNN is linear.

kNN is very accurate if training set is large.

Optimality result: asymptotically zero error if Bayes rate iszero.

But kNN can be very inaccurate if training set is small.

67 / 67