info 4300 / cs4300 information retrieval [0.5cm] slides ... · index only one document from each...
TRANSCRIPT
INFO 4300 / CS4300Information Retrieval
slides adapted from Hinrich Schutze’s,linked from http://informationretrieval.org/
IR 19/25: Web Search Basics and Classification
Paul Ginsparg
Cornell University, Ithaca, NY
9 Nov 2010
1 / 67
Discussion 5, Tue 16 Nov
For this class, read and be prepared to discuss the following:
Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified DataProcessing on Large Clusters. Usenix SDI ’04, 2004.http://www.usenix.org/events/osdi04/tech/full papers/dean/dean.pdf
2 / 67
Overview
1 Recap
2 Spam
3 Size of the web
4 Intro vector space classification
5 Rocchio
6 kNN
3 / 67
Outline
1 Recap
2 Spam
3 Size of the web
4 Intro vector space classification
5 Rocchio
6 kNN
4 / 67
Duplicate detection
The web is full of duplicated content.
More so than many other collections
Exact duplicates
Easy to eliminateE.g., use hash/fingerprint
Near-duplicates
Abundant on the webDifficult to eliminate
For the user, it’s annoying to get a search result withnear-identical documents.
Recall marginal relevance
We need to eliminate near-duplicates.
5 / 67
Shingling: Summary
Input: N documents
Choose n-gram size for shingling, e.g., n = 5
Pick 200 random permutations, represented as hash functions
Compute N sketches: 200× N matrix shown on previousslide, one row per permutation, one column per document
Compute N·(N−1)2 pairwise similarities
Transitive closure of documents with similarity > θ
Index only one document from each equivalence class
6 / 67
Web IR: Differences from traditional IR
Links: The web is a hyperlinked document collection.
Queries: Web queries are different, more varied and there area lot of them. How many? ≈ 109
Users: Users are different, more varied and there are a lot ofthem. How many? ≈ 109
Documents: Documents are different, more varied and thereare a lot of them. How many? ≈ 1011
Context: Context is more important on the web than in manyother IR applications.
Ads and spam
7 / 67
Types of queries / user needs in web search
Informational user needs: I need information on something.“low hemoglobin”
We called this “information need” earlier in the class.
On the web, information needs proper are only a subclass ofuser needs.
Other user needs: Navigational and transactional
Navigational user needs: I want to go to this web site.“hotmail”, “myspace”, “United Airlines”
Transactional user needs: I want to make a transaction.
Buy something: “MacBook Air”Download something: “Acrobat Reader”Chat with someone: “live soccer chat”
Difficult problem: How can the search engine tell what theuser need or intent for a particular query is?
8 / 67
Bowtie structure of the web
A.Broder,R.Kumar,F.Maghoul,P.Raghavan,S.Rajagopalan,S. Stata, A. Tomkins, andJ. Wiener. Graph structure in the web. Computer Networks, 33:309–320, 2000.
Strongly connected component (SCC) in the centerLots of pages that get linked to, but don’t link (OUT)Lots of pages that link to other pages, but don’t get linked to (IN)Tendrils, tubes, islands
# of in-links (in-degree) averages 8–15, not randomly distributed (Poissonian),instead a power law:# pages with in-degree i is ∝ 1/iα, α ≈ 2.1
9 / 67
Poisson Distribution
Bernoulli process with N trials, each probability p of success:
p(m) =
(
N
m
)
pm(1− p)N−m .
Probability p(m) of m successes, in limit N very large and p small,parametrized by just µ = Np (µ = mean number of successes).For N ≫ m, we have N!
(N−m)! = N(N − 1) · · · (N −m + 1) ≈ Nm,
so(
Nm
)
≡ N!m!(N−m)! ≈
Nm
m! , and
p(m) ≈1
m!Nm
( µ
N
)m(
1−µ
N
)N−m
≈µm
m!lim
N→∞
(
1−µ
N
)N
= e−µ µm
m!
(ignore (1− µ/N)−m since by assumption N ≫ µm).N dependence drops out for N →∞, with average µ fixed (p → 0).The form p(m) = e
−µ µm
m! is known as a Poisson distribution
(properly normalized:∑∞
m=0 p(m) = e−µ
∑∞m=0
µm
m! = e−µ · eµ = 1).
10 / 67
Poisson Distribution for µ = 10
p(m) = e−10 10m
m!
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0 5 10 15 20 25 30
Compare to power law p(m) ∝ 1/m2.1
11 / 67
Power Law p(m) ∝ 1/m2.1 and Poisson p(m) = e−10 10m
m!
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
10 20 30 40 50 60 70 80 90 100
12 / 67
Power Law p(m) ∝ 1/m2.1 and Poisson p(m) = e−10 10m
m!
1e-09
1e-08
1e-07
1e-06
1e-05
0.0001
0.001
0.01
0.1
1
1 10 100 1000 10000
(log–log scale)
13 / 67
The spatial context: Geo-search
Three relevant locations
Server (nytimes.com → New York)Web page (nytimes.com article about Albania)User (located in Palo Alto)
Locating the user
IP addressInformation provided by user (e.g., in user profile)Mobile phone
Geo-tagging: Parse text and identify the coordinates of thegeographic entities
Example: East Palo Alto CA → Latitude: 37.47 N, Longitude:122.14 WImportant NLP problem
14 / 67
Outline
1 Recap
2 Spam
3 Size of the web
4 Intro vector space classification
5 Rocchio
6 kNN
15 / 67
The goal of spamming on the web
You have a page that will generate lots of revenue for you ifpeople visit it.
Therefore, you would like to direct visitors to this page.
One way of doing this: get your page ranked highly in searchresults.
How can I get my page ranked highly?
16 / 67
Spam technique: Keyword stuffing / Hidden text
Misleading meta-tags, excessive repetition
Hidden text with colors, style sheet tricks etc.
Used to be very effective, most search engines now catch these
17 / 67
Keyword stuffing
18 / 67
Spam technique: Doorway and lander pages
Doorway page: optimized for a single keyword, redirects tothe real target page
Lander page: optimized for a single keyword or a misspelleddomain name, designed to attract surfers who will then clickon ads
19 / 67
Lander page
Number one hit on Google for the search “composita”
The only purpose of this page: get people to click on the adsand make money for the page owner
20 / 67
Spam technique: Duplication
Get good content from somewhere (steal it or produce ityourself)
Publish a large number of slight variations of it
For example, publish the answer to a tax question with thespelling variations of “tax deferred” on the previous slide
21 / 67
Spam technique: Cloaking
Serve fake content to search engine spider
So do we just penalize this always?
No: legitimate uses (e.g., different content to US vs.European users)
22 / 67
Spam technique: Link spam
Create lots of links pointing to the page you want to promote
Put these links on pages with high (or at least non-zero)PageRank
Newly registered domains (domain flooding)A set of pages that all point to each other to boost eachother’s PageRank (mutual admiration society)Pay somebody to put your link on their highly ranked page(“schuetze horoskop” example)Leave comments that include the link on blogs
23 / 67
SEO: Search engine optimization
Promoting a page in the search rankings is not necessarilyspam.
It can also be a legitimate business – which is called SEO.
You can hire an SEO firm to get your page highly ranked.
There are many legitimate reasons for doing this.
For example, Google bombs like Who is a failure?
And there are many legitimate ways of achieving this:
Restructure your content in a way that makes it easy to indexTalk with influential bloggers and have them link to your siteAdd more interesting and original content
24 / 67
The war against spam
Quality indicators
Links, statistically analyzed (PageRank etc)Usage (users visiting a page)No adult content (e.g., no pictures with flesh-tone)Distribution and structure of text (e.g., no keyword stuffing)
Combine all of these indicators and use machine learning
Editorial intervention
BlacklistsTop queries auditedComplaints addressedSuspect patterns detected
25 / 67
Webmaster guidelines
Major search engines have guidelines for webmasters.
These guidelines tell you what is legitimate SEO and what isspamming.
Ignore these guidelines at your own risk
Once a search engine identifies you as a spammer, all pageson your site may get low ranks (or disappear from the indexentirely).
There is often a fine line between spam and legitimate SEO.
Scientific study of fighting spam on the web: adversarial
information retrieval
26 / 67
Outline
1 Recap
2 Spam
3 Size of the web
4 Intro vector space classification
5 Rocchio
6 kNN
27 / 67
Growth of the web
The web keeps growing.But growth is no longer exponential?
28 / 67
Size of the web: Who cares?
Media
Users
They may switch to the search engine that has the bestcoverage of the web.Users (sometimes) care about recall. If we underestimate thesize of the web, search engine results may have low recall.
Search engine designers (how many pages do I need to be ableto handle?)
Crawler designers (which policy will crawl close to N pages?)
29 / 67
What is the size of the web? Any guesses?
30 / 67
Simple method for determining a lower bound
OR-query of frequent words in a number of languages
According to this query: Size of web ≥ 21,450,000,000 on2007.07.07
Big if: Page counts of google search results are correct.(Generally, they are just rough estimates.)
But this is just a lower bound, based on one search engine.
How can we do better?
31 / 67
Size of the web: Issues
What is size? Number of web servers? Number of pages?Terabytes of data available?
The “dynamic” web is infinite.
Any sum of two numbers is its own dynamic page on Google.(Example: “2+4”)Many other dynamic sites generating infinite number of pages
The static web contains duplicates – each “equivalence class”should only be counted once.
Some servers are seldom connected.
Example: Your laptopIs it part of the web?
32 / 67
“Search engine index contains N pages”: Issues
Can I claim a page is in the index if I only index the first 4000bytes?
Can I claim a page is in the index if I only index anchor textpointing to the page?
There used to be (and still are?) billions of pages that are onlyindexed by anchor text.
33 / 67
How can we estimate the size of the web?
34 / 67
Sampling methods
Random queries (picked from dictionary)
Random searches (picked from search logs)
Random IP addresses
Random walks
35 / 67
Variant: Estimate relative sizes of indexes
There are significant differences between indexes of differentsearch engines.
Different engines have different preferences.
max url depth, max count/host, anti-spam rules, priority rulesetc.
Different engines index different things under the same URL.
anchor text, frames, meta-keywords, size of prefix etc.
36 / 67
Outline
1 Recap
2 Spam
3 Size of the web
4 Intro vector space classification
5 Rocchio
6 kNN
38 / 67
Digression: “naive” Bayes
Spam classifier:Imagine a training set of 2000 messages,1000 classified as spam (S),and 1000 classified as non-spam (S).
180 of the S messages contain the word “offer”.20 of the S messages contain the word “offer”.
Suppose you receive a message containing the word “offer”.What is the probability it is S? Estimate:
180
180 + 20=
9
10.
(Formally, assuming “flat prior” p(S) = p(S):
p(S |offer) =p(offer|S)p(S)
p(offer|S)p(S) + p(offer|S)p(S)=
1801000
1801000 + 20
1000
=9
10.)
39 / 67
Classification
Naive Bayes is simple and a good baseline.
Use it if you want to get a text classifier up and running in ahurry.
But other classification methods are more accurate.
Perhaps the simplest well performing alternative: kNN
kNN is a vector space classifier.
Today1 intro vector space classification2 very simple vector space classification: Rocchio3 kNN
Next time: general properties of classifiers
40 / 67
Recall vector space representation
Each document is a vector, one component for each term.
Terms are axes.
High dimensionality: 100,000s of dimensions
Normalize vectors (documents) to unit length
How can we do classification in this space?
41 / 67
Vector space classification
As before, the training set is a set of documents, each labeledwith its class.
In vector space classification, this set corresponds to a labeledset of points or vectors in the vector space.
Premise 1: Documents in the same class form a contiguousregion.
Premise 2: Documents from different classes don’t overlap.
We define lines, surfaces, hypersurfaces to divide regions.
42 / 67
Classes in the vector space
xxx
x
⋄
⋄⋄⋄
⋄
⋄
China
Kenya
UK⋆
Should the document ⋆ be assigned to China, UK or Kenya?Find separators between the classesBased on these separators: ⋆ should be assigned to China
How do we find separators that do a good job at classifying newdocuments like ⋆?
43 / 67
Outline
1 Recap
2 Spam
3 Size of the web
4 Intro vector space classification
5 Rocchio
6 kNN
44 / 67
Recall Rocchio algorithm (lecture 12)
The optimal query vector is:
~qopt = µ(Dr ) + [µ(Dr )− µ(Dnr )]
=1
|Dr |
∑
~dj∈Dr
~dj + [1
|Dr |
∑
~dj∈Dr
~dj −1
|Dnr |
∑
~dj∈Dnr
~dj ]
We move the centroid of the relevant documents by thedifference between the two centroids.
45 / 67
Exercise: Compute Rocchio vector (lecture 12)
x
x
x
x
xx
circles: relevant documents, X’s: nonrelevant documents
46 / 67
Rocchio illustrated (lecture 12)
x
x
x
x
xx
~µR
~µNR
~µR − ~µNR~qopt
~µR : centroid of relevant documents~µNR : centroid of nonrelevant documents~µR − ~µNR : difference vectorAdd difference vector to ~µR to get ~qopt
~qopt separates relevant/nonrelevant perfectly.
47 / 67
Rocchio 1971 algorithm (SMART) (lecture 12)
Used in practice:
~qm = α~q0 + βµ(Dr )− γµ(Dnr )
= α~q0 + β1
|Dr |
∑
~dj∈Dr
~dj − γ1
|Dnr |
∑
~dj∈Dnr
~dj
qm: modified query vector; q0: original query vector; Dr andDnr : sets of known relevant and nonrelevant documentsrespectively; α, β, and γ: weights attached to each term
New query moves towards relevant documents and away fromnonrelevant documents.
Tradeoff α vs. β/γ: If we have a lot of judged documents, wewant a higher β/γ.
Set negative term weights to 0.
“Negative weight” for a term doesn’t make sense in the vectorspace model.
48 / 67
Using Rocchio for vector space classification
We can view relevance feedback as two-class classification.
The two classes: the relevant documents and the nonrelevantdocuments.
The training set is the set of documents the user has labeledso far.
The principal difference between relevance feedback and textclassification:
The training set is given as part of the input in textclassification.It is interactively created in relevance feedback.
49 / 67
Rocchio classification: Basic idea
Compute a centroid for each class
The centroid is the average of all documents in the class.
Assign each test document to the class of its closest centroid.
50 / 67
Recall definition of centroid
~µ(c) =1
|Dc |
∑
d∈Dc
~v(d)
where Dc is the set of all documents that belong to class c and
~v(d) is the vector space representation of d .
51 / 67
Rocchio algorithm
TrainRocchio(C, D)1 for each cj ∈ C
2 do Dj ← {d : 〈d , cj 〉 ∈ D}3 ~µj ←
1|Dj |
∑
d∈Dj~v(d)
4 return {~µ1, . . . , ~µJ}
ApplyRocchio({~µ1, . . . , ~µJ}, d)1 return arg minj |~µj − ~v(d)|
52 / 67
Rocchio illustrated: a1 = a2, b1 = b2, c1 = c2
xxx
x
⋄
⋄⋄
⋄
⋄
⋄
China
Kenya
UK
⋆ a1
a2
b1
b2
c1
c2
53 / 67
Rocchio properties
Rocchio forms a simple representation for each class: thecentroid
We can interpret the centroid as the prototype of the class.
Classification is based on similarity to / distance fromcentroid/prototype.
Does not guarantee that classifications are consistent with thetraining data!
54 / 67
Time complexity of Rocchio
mode time complexity
training Θ(|D|Lave + |C||V |) ≈ Θ(|D|Lave)testing Θ(La + |C|Ma) ≈ Θ(|C|Ma)
55 / 67
Rocchio vs. Naive Bayes
In many cases, Rocchio performs worse than Naive Bayes.
One reason: Rocchio does not handle nonconvex, multimodalclasses correctly.
56 / 67
Rocchio cannot handle nonconvex, multimodal classes
a
a
a
a
a
a
a aa
a
aa
aa
a a
aa
a
a
a
a
a
a
a
a
a
a
a
a a
aa
aa
aa
aa
a
bb
bb
bb
bb b
b
bbb
b
b
b
b
b
b
X XA
B
o
Exercise: Why is Rocchio notexpected to do well for theclassification task a vs. b here?
A is centroid of the a’s, Bis centroid of the b’s.
The point o is closer to Athan to B.
But it is a better fit forthe b class.
A is a multimodal classwith two prototypes.
But in Rocchio we onlyhave one.
57 / 67
Outline
1 Recap
2 Spam
3 Size of the web
4 Intro vector space classification
5 Rocchio
6 kNN
58 / 67
kNN classification
kNN classification is another vector space classificationmethod.
It also is very simple and easy to implement.
kNN is more accurate (in most cases) than Naive Bayes andRocchio.
If you need to get a pretty accurate classifier up and runningin a short time . . .
. . . and you don’t care about efficiency that much . . .
. . . use kNN.
59 / 67
kNN classification
kNN = k nearest neighbors
kNN classification rule for k = 1 (1NN): Assign each testdocument to the class of its nearest neighbor in the trainingset.
1NN is not very robust – one document can be mislabeled oratypical.
kNN classification rule for k > 1 (kNN): Assign each testdocument to the majority class of its k nearest neighbors inthe training set.
Rationale of kNN: contiguity hypothesis
We expect a test document d to have the same label as thetraining documents located in the local region surrounding d .
60 / 67
Probabilistic kNN
Probabilistic version of kNN: P(c |d) = fraction of k neighborsof d that are in c
kNN classification rule for probabilistic kNN: Assign d to classc with highest P(c |d)
61 / 67
kNN is based on Voronoi tessellation
x
x
xx
x
xx
xx x
x
⋄
⋄⋄
⋄
⋄
⋄
⋄⋄⋄
⋄ ⋄
⋆
1NN, 3NNclassifica-tion decisionfor star?
62 / 67
kNN algorithm
Train-kNN(C, D)1 D
′ ← Preprocess(D)2 k ← Select-k(C, D′)3 return D
′, k
Apply-kNN(D′, k, d)1 Sk ← ComputeNearestNeighbors(D′, k, d)2 for each cj ∈ C(D′)3 do pj ← |Sk ∩ cj |/k4 return arg maxj pj
63 / 67
Exercise
⋆
x
x
x
x
x
x
x
x
x
x
oo
o
o
o
How is star classified by:
(i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio?
64 / 67
Exercise
⋆
x
x
x
x
x
x
x
x
x
x
oo
o
o
o
How is star classified by:
(i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio
65 / 67
Time complexity of kNN
kNN with preprocessing of training set
training Θ(|D|Lave)testing Θ(La + |D|MaveMa) = Θ(|D|MaveMa)
kNN test time proportional to the size of the training set!
The larger the training set, the longer it takes to classify atest document.
kNN is inefficient for very large training sets.
66 / 67
kNN: Discussion
No training necessary
But linear preprocessing of documents is as expensive astraining Naive Bayes.You will always preprocess the training set, so in realitytraining time of kNN is linear.
kNN is very accurate if training set is large.
Optimality result: asymptotically zero error if Bayes rate iszero.
But kNN can be very inaccurate if training set is small.
67 / 67