info 4300 / cs4300 information retrieval [0.5cm] slides ... · slides adapted from hinrich...
TRANSCRIPT
![Page 1: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/1.jpg)
INFO 4300 / CS4300Information Retrieval
slides adapted from Hinrich Schutze’s,linked from http://informationretrieval.org/
IR 20/25: Linear Classifiers and Flat clustering
Paul Ginsparg
Cornell University, Ithaca, NY
10 Nov 2011
1 / 121
![Page 2: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/2.jpg)
Administrativa
Assignment 4 to be posted tomorrow,due Fri 2 Dec (last day of classes),permitted until Sun 4 Dec (no extensions)
2 / 121
![Page 3: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/3.jpg)
Overview
1 Recap
2 Rocchio
3 kNN
4 Linear classifiers
5 > two classes
6 Clustering: Introduction
7 Clustering in IR
8 K -means
3 / 121
![Page 4: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/4.jpg)
Outline
1 Recap
2 Rocchio
3 kNN
4 Linear classifiers
5 > two classes
6 Clustering: Introduction
7 Clustering in IR
8 K -means
4 / 121
![Page 5: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/5.jpg)
Digression: “naive” Bayes
Spam classifier:Imagine a training set of 2000 messages,1000 classified as spam (S),and 1000 classified as non-spam (S).
180 of the S messages contain the word “offer”.20 of the S messages contain the word “offer”.
Suppose you receive a message containing the word “offer”.What is the probability it is S? Estimate:
180
180 + 20=
9
10.
(Formally, assuming “flat prior” p(S) = p(S):
p(S |offer) =p(offer|S)p(S)
p(offer|S)p(S) + p(offer|S)p(S)=
1801000
1801000 + 20
1000
=9
10.)
5 / 121
![Page 6: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/6.jpg)
Basics of probability theory
A = event
0 ≤ p(A) ≤ 1
joint probability p(A,B) = p(A ∩ B)
conditional probability p(A|B) = p(A,B)/p(B)
Note p(A,B) = p(A|B)p(B) = p(B |A)p(A), gives posteriorprobability of A after seeing the evidence B
Bayes ‘Thm‘ : p(A|B) =p(B |A)p(A)
p(B)
In denominator, usep(B) = p(B ,A) + p(B ,A) = p(B |A)p(A) + p(B |A)p(A)
Odds: O(A) =p(A)
p(A)=
p(A)
1− p(A)
6 / 121
![Page 7: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/7.jpg)
“naive” Bayes, cont’d
Spam classifier:Imagine a training set of 2000 messages,1000 classified as spam (S),and 1000 classified as non-spam (S).
words wi = {“offer”,“FF0000”,“click”,“unix”,“job”,“enlarge”,. . .}ni of the S messages contain the word wi .mi of the S messages contain the word wi .
Suppose you receive a message containing the wordsw1,w4,w5, . . ..What are the odds it is S? Estimate:
p(S |w1,w4,w5, . . .) ∝ p(w1,w4,w5, . . . |S)p(S)
p(S |w1,w4,w5, . . .) ∝ p(w1,w4,w5, . . . |S)p(S)
Odds are
p(S |w1,w4,w5, . . .)
p(S |w1,w4,w5, . . .)=
p(w1,w4,w5, . . . |S)p(S)
p(w1,w4,w5, . . . |S)p(S)7 / 121
![Page 8: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/8.jpg)
“naive” Bayes odds
Oddsp(S |w1,w4,w5, . . .)
p(S |w1,w4,w5, . . .)=
p(w1,w4,w5, . . . |S)p(S)
p(w1,w4,w5, . . . |S)p(S)
are approximated by
≈p(w1|S)p(w4|S)p(w5|S) · · · p(wℓ|S)p(S)
p(w1|S)p(w4|S)p(w5|S) · · · p(wℓ|S)p(S)
≈(n1/1000)(n4/1000)(n5/1000) · · · (nℓ/1000)
(m1/1000)(m4/1000)(m5/1000) · · · (mℓ/1000)=
n1n4n5 · · · nℓ
m1m4m5 · · ·mℓ
where we’ve assumed words are independent eventsp(w1,w4,w5, . . . |S) ≈ p(w1|S)p(w4|S)p(w5|S) · · · p(wℓ|S),and p(wi |S) ≈ ni/|S |, p(wi |S) ≈ mi/|S |(recall ni and mi , respectively, counted the number of spam S andnon-spam S training messages containing the word wi)
8 / 121
![Page 9: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/9.jpg)
Classification
Naive Bayes is simple and a good baseline.
Use it if you want to get a text classifier up and running in ahurry.
But other classification methods are more accurate.
Perhaps the simplest well performing alternative: kNN
kNN is a vector space classifier.
Today1 intro vector space classification2 very simple vector space classification: Rocchio3 kNN
Next time: general properties of classifiers
9 / 121
![Page 10: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/10.jpg)
Recall vector space representation
Each document is a vector, one component for each term.
Terms are axes.
High dimensionality: 100,000s of dimensions
Normalize vectors (documents) to unit length
How can we do classification in this space?
10 / 121
![Page 11: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/11.jpg)
Vector space classification
As before, the training set is a set of documents, each labeledwith its class.
In vector space classification, this set corresponds to a labeledset of points or vectors in the vector space.
Premise 1: Documents in the same class form a contiguousregion.
Premise 2: Documents from different classes don’t overlap.
We define lines, surfaces, hypersurfaces to divide regions.
11 / 121
![Page 12: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/12.jpg)
Classes in the vector space
xxx
x
⋄
⋄⋄⋄
⋄
⋄
China
Kenya
UK⋆
Should the document ⋆ be assigned to China, UK or Kenya?Find separators between the classesBased on these separators: ⋆ should be assigned to ChinaHow do we find separators that do a good job at classifying newdocuments like ⋆?
12 / 121
![Page 13: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/13.jpg)
Outline
1 Recap
2 Rocchio
3 kNN
4 Linear classifiers
5 > two classes
6 Clustering: Introduction
7 Clustering in IR
8 K -means
13 / 121
![Page 14: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/14.jpg)
Recall Rocchio algorithm (lecture 12)
The optimal query vector is:
~qopt = µ(Dr ) + [µ(Dr )− µ(Dnr )]
=1
|Dr |
∑
~dj∈Dr
~dj + [1
|Dr |
∑
~dj∈Dr
~dj −1
|Dnr |
∑
~dj∈Dnr
~dj ]
We move the centroid of the relevant documents by thedifference between the two centroids.
14 / 121
![Page 15: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/15.jpg)
Exercise: Compute Rocchio vector (lecture 12)
x
x
x
x
xx
circles: relevant documents, X’s: nonrelevant documents
15 / 121
![Page 16: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/16.jpg)
Rocchio illustrated (lecture 12)
x
x
x
x
xx
~µR
~µNR
~µR − ~µNR~qopt
~µR : centroid of relevant documents~µNR : centroid of nonrelevant documents~µR − ~µNR : difference vectorAdd difference vector to ~µR to get ~qopt
~qopt separates relevant/nonrelevant perfectly.
16 / 121
![Page 17: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/17.jpg)
Rocchio 1971 algorithm (SMART) (lecture 12)
Used in practice:
~qm = α~q0 + βµ(Dr )− γµ(Dnr )
= α~q0 + β1
|Dr |
∑
~dj∈Dr
~dj − γ1
|Dnr |
∑
~dj∈Dnr
~dj
qm: modified query vector; q0: original query vector; Dr andDnr : sets of known relevant and nonrelevant documentsrespectively; α, β, and γ: weights attached to each term
New query moves towards relevant documents and away fromnonrelevant documents.
Tradeoff α vs. β/γ: If we have a lot of judged documents, wewant a higher β/γ.
Set negative term weights to 0.
“Negative weight” for a term doesn’t make sense in the vectorspace model.
17 / 121
![Page 18: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/18.jpg)
Using Rocchio for vector space classification
We can view relevance feedback as two-class classification.
The two classes: the relevant documents and the nonrelevantdocuments.
The training set is the set of documents the user has labeledso far.
The principal difference between relevance feedback and textclassification:
The training set is given as part of the input in textclassification.It is interactively created in relevance feedback.
18 / 121
![Page 19: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/19.jpg)
Rocchio classification: Basic idea
Compute a centroid for each class
The centroid is the average of all documents in the class.
Assign each test document to the class of its closest centroid.
19 / 121
![Page 20: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/20.jpg)
Recall definition of centroid
~µ(c) =1
|Dc |
∑
d∈Dc
~v(d)
where Dc is the set of all documents that belong to class c and
~v(d) is the vector space representation of d .
20 / 121
![Page 21: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/21.jpg)
Rocchio algorithm
TrainRocchio(C, D)1 for each cj ∈ C
2 do Dj ← {d : 〈d , cj 〉 ∈ D}3 ~µj ←
1|Dj |
∑
d∈Dj~v(d)
4 return {~µ1, . . . , ~µJ}
ApplyRocchio({~µ1, . . . , ~µJ}, d)1 return arg minj |~µj − ~v(d)|
21 / 121
![Page 22: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/22.jpg)
Rocchio illustrated: a1 = a2, b1 = b2, c1 = c2
xxx
x
⋄
⋄⋄
⋄
⋄
⋄
China
Kenya
UK
⋆ a1
a2
b1
b2
c1
c2
22 / 121
![Page 23: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/23.jpg)
Rocchio properties
Rocchio forms a simple representation for each class: thecentroid
We can interpret the centroid as the prototype of the class.
Classification is based on similarity to / distance fromcentroid/prototype.
Does not guarantee that classifications are consistent with thetraining data!
23 / 121
![Page 24: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/24.jpg)
Time complexity of Rocchio
mode time complexity
training Θ(|D|Lave + |C||V |) ≈ Θ(|D|Lave)testing Θ(La + |C|Ma) ≈ Θ(|C|Ma)
24 / 121
![Page 25: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/25.jpg)
Rocchio vs. Naive Bayes
In many cases, Rocchio performs worse than Naive Bayes.
One reason: Rocchio does not handle nonconvex, multimodalclasses correctly.
25 / 121
![Page 26: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/26.jpg)
Rocchio cannot handle nonconvex, multimodal classes
a
a
a
a
a
a
a aa
a
aa
aa
a a
aa
a
a
a
a
a
a
a
a
a
a
a
a a
aa
aa
aa
aa
a
bb
bb
bb
bb b
b
bbb
b
b
b
b
b
b
X XA
B
o
Exercise: Why is Rocchio notexpected to do well for theclassification task a vs. b here?
A is centroid of the a’s, Bis centroid of the b’s.
The point o is closer to Athan to B.
But it is a better fit forthe b class.
A is a multimodal classwith two prototypes.
But in Rocchio we onlyhave one.
26 / 121
![Page 27: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/27.jpg)
Outline
1 Recap
2 Rocchio
3 kNN
4 Linear classifiers
5 > two classes
6 Clustering: Introduction
7 Clustering in IR
8 K -means
27 / 121
![Page 28: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/28.jpg)
kNN classification
kNN classification is another vector space classificationmethod.
It also is very simple and easy to implement.
kNN is more accurate (in most cases) than Naive Bayes andRocchio.
If you need to get a pretty accurate classifier up and runningin a short time . . .
. . . and you don’t care about efficiency that much . . .
. . . use kNN.
28 / 121
![Page 29: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/29.jpg)
kNN classification
kNN = k nearest neighbors
kNN classification rule for k = 1 (1NN): Assign each testdocument to the class of its nearest neighbor in the trainingset.
1NN is not very robust – one document can be mislabeled oratypical.
kNN classification rule for k > 1 (kNN): Assign each testdocument to the majority class of its k nearest neighbors inthe training set.
Rationale of kNN: contiguity hypothesis
We expect a test document d to have the same label as thetraining documents located in the local region surrounding d .
29 / 121
![Page 30: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/30.jpg)
Probabilistic kNN
Probabilistic version of kNN: P(c |d) = fraction of k neighborsof d that are in c
kNN classification rule for probabilistic kNN: Assign d to classc with highest P(c |d)
30 / 121
![Page 31: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/31.jpg)
kNN is based on Voronoi tessellation
x
x
xx
x
xx
xx x
x
⋄
⋄⋄
⋄
⋄
⋄
⋄⋄⋄
⋄ ⋄
⋆
1NN, 3NNclassifica-tion decisionfor star?
31 / 121
![Page 32: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/32.jpg)
kNN algorithm
Train-kNN(C, D)1 D
′ ← Preprocess(D)2 k ← Select-k(C, D′)3 return D
′, k
Apply-kNN(D′, k, d)1 Sk ← ComputeNearestNeighbors(D′, k, d)2 for each cj ∈ C(D′)3 do pj ← |Sk ∩ cj |/k4 return arg maxj pj
32 / 121
![Page 33: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/33.jpg)
Exercise
⋆
x
x
x
x
x
x
x
x
x
x
oo
o
o
o
How is star classified by:
(i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio?
33 / 121
![Page 34: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/34.jpg)
Exercise
⋆
x
x
x
x
x
x
x
x
x
x
oo
o
o
o
How is star classified by:
(i) 1-NN (ii) 3-NN (iii) 9-NN (iv) 15-NN (v) Rocchio
34 / 121
![Page 35: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/35.jpg)
Time complexity of kNN
kNN with preprocessing of training set
training Θ(|D|Lave)testing Θ(La + |D|MaveMa) = Θ(|D|MaveMa)
kNN test time proportional to the size of the training set!
The larger the training set, the longer it takes to classify atest document.
kNN is inefficient for very large training sets.
35 / 121
![Page 36: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/36.jpg)
kNN: Discussion
No training necessary
But linear preprocessing of documents is as expensive astraining Naive Bayes.You will always preprocess the training set, so in realitytraining time of kNN is linear.
kNN is very accurate if training set is large.
Optimality result: asymptotically zero error if Bayes rate iszero.
But kNN can be very inaccurate if training set is small.
36 / 121
![Page 37: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/37.jpg)
Outline
1 Recap
2 Rocchio
3 kNN
4 Linear classifiers
5 > two classes
6 Clustering: Introduction
7 Clustering in IR
8 K -means
37 / 121
![Page 38: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/38.jpg)
Linear classifiers
Linear classifiers compute a linear combination or weightedsum
∑
i wixi of the feature values.
Classification decision:∑
i wixi > θ?
. . . where θ (the threshold) is a parameter.
(First, we only consider binary classifiers.)
Geometrically, this corresponds to a line (2D), a plane (3D) ora hyperplane (higher dimensionalities)
Assumption: The classes are linearly separable.
Can find hyperplane (=separator) based on training set
Methods for finding separator: Perceptron, Rocchio, NaiveBayes – as we will explain on the next slides
38 / 121
![Page 39: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/39.jpg)
A linear classifier in 1D
x1
A linear classifier in 1D isa point described by theequation w1x1 = θ
The point at θ/w1
Points (x1) with w1x1 ≥ θare in the class c .
Points (x1) with w1x1 < θare in the complementclass c .
39 / 121
![Page 40: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/40.jpg)
A linear classifier in 2D
A linear classifier in 2D isa line described by theequation w1x1 + w2x2 = θ
Example for a 2D linearclassifier
Points (x1 x2) withw1x1 + w2x2 ≥ θ are inthe class c .
Points (x1 x2) withw1x1 + w2x2 < θ are inthe complement class c .
40 / 121
![Page 41: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/41.jpg)
A linear classifier in 3D
A linear classifier in 3D isa plane described by theequationw1x1 + w2x2 + w3x3 = θ
Example for a 3D linearclassifier
Points (x1 x2 x3) withw1x1 + w2x2 + w3x3 ≥ θare in the class c .
Points (x1 x2 x3) withw1x1 + w2x2 + w3x3 < θare in the complementclass c .
41 / 121
![Page 42: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/42.jpg)
Rocchio as a linear classifier
Rocchio is a linear classifier defined by:
M∑
i=1
wixi = ~w · ~x = θ
where the normal vector ~w = ~µ(c1)− ~µ(c2)andθ = 0.5 ∗ (|~µ(c1)|
2 − |~µ(c2)|2).
(follows from decision boundary |~µ(c1)− ~x | = |~µ(c2)− ~x |)
42 / 121
![Page 43: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/43.jpg)
Naive Bayes classifier
~x represents document, what is p(c |~x) that document is in class c?
p(c |~x) =p(~x |c)p(c)
p(~x)p(c |~x) =
p(~x |c)p(c)
p(~x)
odds :p(c |~x)
p(c |~x)=
p(~x |c)p(c)
p(~x |c)p(c)≈
p(c)
p(c)
∏
1≤k≤ndp(tk |c)
∏
1≤k≤ndp(tk |c)
log odds : logp(c |~x)
p(c |~x)= log
p(c)
p(c)+
∑
1≤k≤nd
logp(tk |c)
p(tk |c)
43 / 121
![Page 44: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/44.jpg)
Naive Bayes as a linear classifier
Naive Bayes is a linear classifier defined by:
M∑
i=1
wixi = θ
where wi = log(
p(ti |c)/p(ti |c))
,xi = number of occurrences of ti in d ,andθ = − log
(
p(c)/p(c))
.
(the index i , 1 ≤ i ≤ M, refers to terms of the vocabulary)
Linear in log space
44 / 121
![Page 45: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/45.jpg)
kNN is not a linear classifier
x
x
x x
x
x x
xx x
x
⋄
⋄⋄
⋄
⋄
⋄
⋄⋄⋄
⋄ ⋄
⋆
Classification decisionbased on majority ofk nearest neighbors.
The decisionboundaries betweenclasses are piecewiselinear . . .
. . . but they are notlinear classifiers thatcan be described as∑M
i=1 wixi = θ.
45 / 121
![Page 46: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/46.jpg)
Example of a linear two-class classifier
ti wi x1i x2i ti wi x1i x2i
prime 0.70 0 1 dlrs -0.71 1 1rate 0.67 1 0 world -0.35 1 0interest 0.63 0 0 sees -0.33 0 0rates 0.60 0 0 year -0.25 0 0discount 0.46 1 0 group -0.24 0 0bundesbank 0.43 0 0 dlr -0.24 0 0
This is for the class interest in Reuters-21578.For simplicity: assume a simple 0/1 vector representationx1: “rate discount dlrs world”x2: “prime dlrs”Exercise: Which class is x1 assigned to? Which class is x2 assigned to?We assign document ~d1 “rate discount dlrs world” to interest since~wT · ~d1 = 0.67 · 1 + 0.46 · 1 + (−0.71) · 1 + (−0.35) · 1 = 0.07 > 0 = b.We assign ~d2 “prime dlrs” to the complement class (not in interest) since~wT · ~d2 = −0.01 ≤ b.
(dlr and world have negative weights because they are indicatorsfor the competing class currency)
46 / 121
![Page 47: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/47.jpg)
Which hyperplane?
47 / 121
![Page 48: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/48.jpg)
Which hyperplane?
For linearly separable training sets: there are infinitely manyseparating hyperplanes.
They all separate the training set perfectly . . .
. . . but they behave differently on test data.
Error rates on new data are low for some, high for others.
How do we find a low-error separator?
Perceptron: generally bad; Naive Bayes, Rocchio: ok; linearSVM: good
48 / 121
![Page 49: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/49.jpg)
Linear classifiers: Discussion
Many common text classifiers are linear classifiers: NaiveBayes, Rocchio, logistic regression, linear support vectormachines etc.
Each method has a different way of selecting the separatinghyperplane
Huge differences in performance on test documents
Can we get better performance with more powerful nonlinearclassifiers?
Not in general: A given amount of training data may sufficefor estimating a linear boundary, but not for estimating amore complex nonlinear boundary.
49 / 121
![Page 50: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/50.jpg)
A nonlinear problem
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Linear classifier like Rocchio does badly on this task.
kNN will do well (assuming enough training data)
50 / 121
![Page 51: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/51.jpg)
A linear problem with noise
Figure 14.10: hypothetical web page classification scenario:Chinese-only web pages (solid circles) and mixed Chinese-Englishweb (squares). linear class boundary, except for three noise docs
51 / 121
![Page 52: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/52.jpg)
Which classifier do I use for a given TC problem?
Is there a learning method that is optimal for all textclassification problems?
No, because there is a tradeoff between bias and variance.
Factors to take into account:
How much training data is available?How simple/complex is the problem? (linear vs. nonlineardecision boundary)How noisy is the problem?How stable is the problem over time?
For an unstable problem, it’s better to use a simple and robustclassifier.
52 / 121
![Page 53: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/53.jpg)
Outline
1 Recap
2 Rocchio
3 kNN
4 Linear classifiers
5 > two classes
6 Clustering: Introduction
7 Clustering in IR
8 K -means
53 / 121
![Page 54: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/54.jpg)
How to combine hyperplanes for > 2 classes?
?
(e.g.: rank and select top-ranked classes)
54 / 121
![Page 55: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/55.jpg)
One-of problems
One-of or multiclass classification
Classes are mutually exclusive.Each document belongs to exactly one class.Example: language of a document (assumption: no documentcontains multiple languages)
55 / 121
![Page 56: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/56.jpg)
One-of classification with linear classifiers
Combine two-class linear classifiers as follows for one-ofclassification:
Run each classifier separatelyRank classifiers (e.g., according to score)Pick the class with the highest score
56 / 121
![Page 57: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/57.jpg)
Any-of problems
Any-of or multilabel classification
A document can be a member of 0, 1, or many classes.A decision on one class leaves decisions open on all otherclasses.A type of “independence” (but not statistical independence)Example: topic classificationUsually: make decisions on the region, on the subject area, onthe industry and so on “independently”
57 / 121
![Page 58: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/58.jpg)
Any-of classification with linear classifiers
Combine two-class linear classifiers as follows for any-ofclassification:
Simply run each two-class classifier separately on the testdocument and assign document accordingly
58 / 121
![Page 59: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/59.jpg)
Outline
1 Recap
2 Rocchio
3 kNN
4 Linear classifiers
5 > two classes
6 Clustering: Introduction
7 Clustering in IR
8 K -means
59 / 121
![Page 60: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/60.jpg)
What is clustering?
(Document) clustering is the process of grouping a set ofdocuments into clusters of similar documents.
Documents within a cluster should be similar.
Documents from different clusters should be dissimilar.
Clustering is the most common form of unsupervised learning.
Unsupervised = there are no labeled or annotated data.
60 / 121
![Page 61: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/61.jpg)
Data set with clear cluster structure
0.0 0.5 1.0 1.5 2.0
0.0
0.5
1.0
1.5
2.0
2.5
61 / 121
![Page 62: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/62.jpg)
Classification vs. Clustering
Classification: supervised learning
Clustering: unsupervised learning
Classification: Classes are human-defined and part of theinput to the learning algorithm.
Clustering: Clusters are inferred from the data without humaninput.
However, there are many ways of influencing the outcome ofclustering: number of clusters, similarity measure,representation of documents, . . .
62 / 121
![Page 63: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/63.jpg)
Outline
1 Recap
2 Rocchio
3 kNN
4 Linear classifiers
5 > two classes
6 Clustering: Introduction
7 Clustering in IR
8 K -means
63 / 121
![Page 64: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/64.jpg)
The cluster hypothesis
Cluster hypothesis. Documents in the same cluster behavesimilarly with respect to relevance to information needs.
All applications in IR are based (directly or indirectly) on thecluster hypothesis.
64 / 121
![Page 65: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/65.jpg)
Applications of clustering in IR
Application What is Benefit Exampleclustered?
Search result clustering searchresults
more effective infor-mation presentationto user
next slide
Scatter-Gather (subsets of)collection
alternative user inter-face: “search withouttyping”
two slides ahead
Collection clustering collection effective informationpresentation for ex-ploratory browsing
McKeown et al. 2002,news.google.com
Cluster-based retrieval collection higher efficiency:faster search
Salton 1971
65 / 121
![Page 66: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/66.jpg)
Search result clustering for better navigation
Jaguar the cat not among top results, but available via menu at left
66 / 121
![Page 67: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/67.jpg)
Scatter-Gather
A collection of news stories is clustered (“scattered”) into eight clusters (toprow), user manually gathers three into smaller collection ‘International Stories’and performs another scattering. Process repeats until a small cluster withrelevant documents is found (e.g., Trinidad).
67 / 121
![Page 68: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/68.jpg)
Global navigation: Yahoo
68 / 121
![Page 69: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/69.jpg)
Global navigation: MESH (upper level)
69 / 121
![Page 70: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/70.jpg)
Global navigation: MESH (lower level)
70 / 121
![Page 71: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/71.jpg)
Note: Yahoo/MESH are not examples of clustering.
But they are well known examples for using a global hierarchyfor navigation.
Some examples for global navigation/exploration based onclustering:
CartiaThemescapesGoogle News
71 / 121
![Page 72: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/72.jpg)
Global navigation combined with visualization (1)
72 / 121
![Page 73: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/73.jpg)
Global navigation combined with visualization (2)
73 / 121
![Page 74: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/74.jpg)
Global clustering for navigation: Google News
http://news.google.com
74 / 121
![Page 75: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/75.jpg)
Clustering for improving recall
To improve search recall:
Cluster docs in collection a prioriWhen a query matches a doc d , also return other docs in thecluster containing d
Hope: if we do this: the query “car” will also return docscontaining “automobile”
Because clustering groups together docs containing “car” withthose containing “automobile”.Both types of documents contain words like “parts”, “dealer”,“mercedes”, “road trip”.
75 / 121
![Page 76: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/76.jpg)
Data set with clear cluster structure
0.0 0.5 1.0 1.5 2.0
0.0
0.5
1.0
1.5
2.0
2.5
Exercise: Come up with analgorithm for finding the threeclusters in this case
76 / 121
![Page 77: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/77.jpg)
Document representations in clustering
Vector space model
As in vector space classification, we measure relatednessbetween vectors by Euclidean distance . . .
. . . which is almost equivalent to cosine similarity.
Almost: centroids are not length-normalized.
For centroids, distance and cosine give different results.
77 / 121
![Page 78: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/78.jpg)
Issues in clustering
General goal: put related docs in the same cluster, putunrelated docs in different clusters.
But how do we formalize this?
How many clusters?
Initially, we will assume the number of clusters K is given.
Often: secondary goals in clustering
Example: avoid very small and very large clusters
Flat vs. hierarchical clustering
Hard vs. soft clustering
78 / 121
![Page 79: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/79.jpg)
Flat vs. Hierarchical clustering
Flat algorithms
Usually start with a random (partial) partitioning of docs intogroupsRefine iterativelyMain algorithm: K -means
Hierarchical algorithms
Create a hierarchyBottom-up, agglomerativeTop-down, divisive
79 / 121
![Page 80: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/80.jpg)
Hard vs. Soft clustering
Hard clustering: Each document belongs to exactly onecluster.
More common and easier to do
Soft clustering: A document can belong to more than onecluster.
Makes more sense for applications like creating browsablehierarchiesYou may want to put a pair of sneakers in two clusters:
sports apparelshoes
You can only do that with a soft clustering approach.
For soft clustering, see course text: 16.5,18
Today: Flat, hard clusteringNext time: Hierarchical, hard clustering
80 / 121
![Page 81: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/81.jpg)
Flat algorithms
Flat algorithms compute a partition of N documents into aset of K clusters.
Given: a set of documents and the number K
Find: a partition in K clusters that optimizes the chosenpartitioning criterion
Global optimization: exhaustively enumerate partitions, pickoptimal one
Not tractable
Effective heuristic method: K -means algorithm
81 / 121
![Page 82: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/82.jpg)
Outline
1 Recap
2 Rocchio
3 kNN
4 Linear classifiers
5 > two classes
6 Clustering: Introduction
7 Clustering in IR
8 K -means
82 / 121
![Page 83: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/83.jpg)
K -means
Perhaps the best known clustering algorithm
Simple, works well in many cases
Use as default / baseline for clustering documents
83 / 121
![Page 84: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/84.jpg)
K -means
Each cluster in K -means is defined by a centroid.
Objective/partitioning criterion: minimize the average squareddifference from the centroid
Recall definition of centroid:
~µ(ω) =1
|ω|
∑
~x∈ω
~x
where we use ω to denote a cluster.
We try to find the minimum average squared difference byiterating two steps:
reassignment: assign each vector to its closest centroidrecomputation: recompute each centroid as the average of thevectors that were assigned to it in reassignment
84 / 121
![Page 85: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/85.jpg)
K -means algorithm
K -means({~x1, . . . , ~xN},K )1 (~s1,~s2, . . . ,~sK )← SelectRandomSeeds({~x1, . . . , ~xN},K )2 for k ← 1 to K3 do ~µk ← ~sk4 while stopping criterion has not been met5 do for k ← 1 to K6 do ωk ← {}7 for n← 1 to N8 do j ← arg minj ′ |~µj ′ − ~xn|9 ωj ← ωj ∪ {~xn} (reassignment of vectors)
10 for k ← 1 to K11 do ~µk ←
1|ωk |
∑
~x∈ωk~x (recomputation of centroids)
12 return {~µ1, . . . , ~µK}
85 / 121
![Page 86: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/86.jpg)
Set of points to be clustered
b
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
bb
86 / 121
![Page 87: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/87.jpg)
Random selection of initial cluster centers (k = 2 means)
×
×b
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
bb
Centroids after convergence?
87 / 121
![Page 88: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/88.jpg)
Assign points to closest centroid
b
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
bb
×
×
88 / 121
![Page 89: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/89.jpg)
Assignment
2
1
1
2
1
1
1 111
1
1
1
11
2
11
2 2
×
×
89 / 121
![Page 90: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/90.jpg)
Recompute cluster centroids
2
1
1
2
1
1
1 111
1
1
1
11
2
11
2 2
×
×
×
×
90 / 121
![Page 91: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/91.jpg)
Assign points to closest centroid
b
b
b
b
b
b bb
b
b
b
b
b
b
bb
bb
×
×b b
91 / 121
![Page 92: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/92.jpg)
Assignment
2
2
1
2
1
1
1 111
1
2
1
11
2
11
2 2
×
×
92 / 121
![Page 93: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/93.jpg)
Recompute cluster centroids
2
2
1
2
1
1
1 111
1
2
1
11
2
11
2 2
×
×
×
×
93 / 121
![Page 94: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/94.jpg)
Assign points to closest centroid
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
bb
×
×b
94 / 121
![Page 95: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/95.jpg)
Assignment
2
2
2
2
1
1
1 111
1
2
1
11
2
11
2 2
×
×
95 / 121
![Page 96: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/96.jpg)
Recompute cluster centroids
2
2
2
2
1
1
1 111
1
2
1
11
2
11
2 2
×
×
×
×
96 / 121
![Page 97: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/97.jpg)
Assign points to closest centroid
b
b
b
b
b
b
b b
b
b
b
b
b
b
b
bb
bb
×
×
b
97 / 121
![Page 98: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/98.jpg)
Assignment
2
2
2
2
1
1
1 121
1
2
1
11
2
11
2 2
×
×
98 / 121
![Page 99: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/99.jpg)
Recompute cluster centroids
2
2
2
2
1
1
1 121
1
2
1
11
2
11
2 2
×
×
×
×
99 / 121
![Page 100: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/100.jpg)
Assign points to closest centroid
b
b
b
b
b
b
b bb
b
b
b
b
b
bb
b
×
×b
bb
100 / 121
![Page 101: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/101.jpg)
Assignment
2
2
2
2
1
1
1 122
1
2
1
11
1
11
2 1
×
×
101 / 121
![Page 102: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/102.jpg)
Recompute cluster centroids
2
2
2
2
1
1
1 122
1
2
1
11
1
11
2 1
××
×
×
102 / 121
![Page 103: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/103.jpg)
Assign points to closest centroid
b
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
b
××
b
103 / 121
![Page 104: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/104.jpg)
Assignment
2
2
2
2
1
1
1 122
1
2
1
11
1
11
1 1
××
104 / 121
![Page 105: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/105.jpg)
Recompute cluster centroids
2
2
2
2
1
1
1 122
1
2
1
11
1
11
1 1
××
×
×
105 / 121
![Page 106: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/106.jpg)
Assign points to closest centroid
b
b
b
b
b
b
b bb
b
b
b
b
b
b
bb
bb
×× b
106 / 121
![Page 107: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/107.jpg)
Assignment
2
2
2
2
1
1
1 122
1
1
1
11
1
11
1 1
××
107 / 121
![Page 108: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/108.jpg)
Recompute cluster centroids
2
2
2
2
1
1
1 122
1
1
1
11
1
11
1 1
××
×
×
108 / 121
![Page 109: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/109.jpg)
Centroids and assignments after convergence
2
2
2
2
1
1
1 122
1
1
1
11
1
11
1 1
××
109 / 121
![Page 110: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/110.jpg)
Set of points clustered
b
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
bb
110 / 121
![Page 111: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/111.jpg)
Set of points to be clustered
b
b
b
b
b
b
b bb
b
b
b
b
b
b
b
bb
bb
111 / 121
![Page 112: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/112.jpg)
K -means is guaranteed to converge
Proof:
The sum of squared distances (RSS) decreases duringreassignment, because each vector is moved to a closercentroid(RSS = sum of all squared distances between documentvectors and closest centroids)
RSS decreases during recomputation (see next slide)
There is only a finite number of clusterings.
Thus: We must reach a fixed point.(assume that ties are broken consistently)
112 / 121
![Page 113: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/113.jpg)
Recomputation decreases average distance
RSS =∑K
k=1 RSSk – the residual sum of squares (the “goodness”measure)
RSSk(~v) =∑
~x∈ωk
‖~v − ~x‖2 =∑
~x∈ωk
M∑
m=1
(vm − xm)2
∂RSSk(~v)
∂vm
=∑
~x∈ωk
2(vm − xm) = 0
vm =1
|ωk |
∑
~x∈ωk
xm
The last line is the componentwise definition of the centroid!We minimize RSSk when the old centroid is replaced with the newcentroid.RSS, the sum of the RSSk , must then also decrease duringrecomputation.
113 / 121
![Page 114: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/114.jpg)
K -means is guaranteed to converge
But we don’t know how long convergence will take!
If we don’t care about a few docs switching back and forth,then convergence is usually fast (< 10-20 iterations).
However, complete convergence can take many moreiterations.
114 / 121
![Page 115: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/115.jpg)
Optimality of K -means
Convergence does not mean that we converge to the optimalclustering!
This is the great weakness of K -means.
If we start with a bad set of seeds, the resulting clustering canbe horrible.
115 / 121
![Page 116: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/116.jpg)
Exercise: Suboptimal clustering
0 1 2 3 40
1
2
3
×
×
×
×
×
×d1 d2 d3
d4 d5 d6
What is the optimal clustering for K = 2?
Do we converge on this clustering for arbitrary seeds di1 , di2?
116 / 121
![Page 117: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/117.jpg)
Exercise: Suboptimal clustering
0 1 2 3 40
1
2
3
×
×
×
×
×
×d1 d2 d3
d4 d5 d6
What is the optimal clustering for K = 2?
Do we converge on this clustering for arbitrary seeds di1 , di2?
For seeds d2 and d5, K -means converges to{{d1, d2, d3}, {d4, d5, d6}} (suboptimal clustering).
For seeds d2 and d3, instead converges to{{d1, d2, d4, d5}, {d3, d6}} (global optimum for K = 2).
117 / 121
![Page 118: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/118.jpg)
Initialization of K -means
Random seed selection is just one of many ways K -means canbe initialized.
Random seed selection is not very robust: It’s easy to get asuboptimal clustering.
Better heuristics:
Select seeds not randomly, but using some heuristic (e.g., filterout outliers or find a set of seeds that has “good coverage” ofthe document space)Use hierarchical clustering to find good seeds (next class)Select i (e.g., i = 10) different sets of seeds, do a K -meansclustering for each, select the clustering with lowest RSS
118 / 121
![Page 119: INFO 4300 / CS4300 Information Retrieval [0.5cm] slides ... · slides adapted from Hinrich Sch¨utze’s, ... IR 20/25: Linear Classifiers and Flat clustering Paul Ginsparg Cornell](https://reader035.vdocuments.net/reader035/viewer/2022062911/5c633e3a09d3f2362e8b50d5/html5/thumbnails/119.jpg)
Time complexity of K -means
Computing one distance of two vectors is O(M).
Reassignment step: O(KNM) (we need to compute KNdocument-centroid distances)
Recomputation step: O(NM) (we need to add each of thedocument’s < M values to one of the centroids)
Assume number of iterations bounded by I
Overall complexity: O(IKNM) – linear in all importantdimensions
However: This is not a real worst-case analysis.
In pathological cases, the number of iterations can be muchhigher than linear in the number of documents.
119 / 121