106) =0 652 =27, 106 106)(49+141) 652 652+774 106 652+774 ... · (we will show this on the next...
Post on 19-Oct-2020
4 Views
Preview:
TRANSCRIPT
Intr
oduct
ion
toIn
form
atio
nRet
riev
alhttp://informationretrieval.org
IIR
16:
Fla
tClu
ster
ing
Hin
rich
Sch
utz
e
Inst
itute
for
Natu
ralLanguage
Pro
cess
ing,U
niv
ersita
tStu
ttgar
t
2009.0
6.1
6
1/
64
Ove
rvie
w
1Rec
ap
2Clu
ster
ing:
Intr
oduct
ion
3Clu
ster
ing
inIR
4K
-mea
ns
5Eva
luat
ion
6H
owm
any
clust
ers?
Outlin
e
1Rec
ap
2Clu
ster
ing:
Intr
oduct
ion
3Clu
ster
ing
inIR
4K
-mea
ns
5Eva
luat
ion
6H
owm
any
clust
ers?
3/
64
MIex
ample
for
poultry
/export
inReu
ters
e c=
epoultry
=1
e c=
epoultry
=0
e t=
eexport
=1
N11
=49
N10
=27,6
52
e t=
eexport
=0
N01
=141
N00
=774,1
06
Plu
g
thes
eva
lues
into
form
ula
:
I(U
;C)
=49
801,9
48
log2
801,9
48·4
9
(49+
27,6
52)(
49+
141)
+141
801,9
48
log2
801,9
48·141
(141+
774,1
06)(
49+
141)
+27,6
52
801,9
48
log2
801,9
48·27,6
52
(49+
27,6
52)(
27,6
52+
774,1
06)
+774,1
06
801,9
48
log2
801,9
48·7
74,1
06
(141+
774,1
06)(
27,6
52+
774,1
06)
≈0.0
00105
Lin
ear
clas
sifier
s
Lin
earcl
assifier
sco
mpute
alin
earco
mbin
atio
nor
wei
ghte
dsu
m∑ i
wix
iofth
efe
ature
valu
es.
Cla
ssifi
cation
dec
isio
n:
∑ iw
ixi>
θ?
Geo
met
rica
lly,th
eeq
uat
ion
∑ iw
ixi=
θdefi
nes
alin
e(2
D),
apla
ne
(3D
)or
ahyp
erpla
ne
(hig
her
dim
ensional
itie
s).
Ass
um
ption:
The
clas
ses
are
linea
rly
separ
able
.
Met
hods
forfindin
ga
linea
rse
par
ator
:Per
ceptr
on,Rocc
hio
,
Nai
veBay
es,m
any
oth
ers
5/
64
Alin
ear
clas
sifier
in1D
Alin
earcl
assifier
in1D
apoin
tdes
crib
edby
theq
uat
ion
w1d1
=θ
The
poin
tat
θ/w
1
Poin
ts(d
1)
with
w1d1≥
are
inth
ecl
ass
c.
Poin
ts(d
1)
with
w1d1
<
are
inth
eco
mple
men
tcl
ass
c.
Alin
ear
clas
sifier
in2D
Alin
earcl
assifier
in2D
is
alin
edes
crib
edby
the
equat
ion
w1d1+
w2d2
=θ
Exa
mple
fora
2D
linea
rcl
assifier
Poin
ts(d
1d2)
with
w1d1+
w2d2≥
θar
ein
the
clas
sc.
Poin
ts(d
1d2)
with
w1d1+
w2d2
<θ
are
inth
eco
mple
men
tcl
ass
c.
7/
64
Alin
ear
clas
sifier
in3D
Alin
earcl
assifier
in3D
apla
ne
des
crib
edby
theq
uat
ion
w1d1+
w2d2+
w3d3
=
Exa
mple
fora
3D
linea
clas
sifier
Poin
ts(d
1d2
d3)
with
w1d1+
w2d2+
w3d3≥
are
inth
ecl
ass
c.
Poin
ts(d
1d2
d3)
with
w1d1+
w2d2+
w3d3
<ar
ein
the
com
ple
men
t
clas
sc.
Rocc
hio
asa
linea
rcl
assifier
Rocc
hio
isa
linea
rcl
assifier
defi
ned
by:
M ∑ i=1
wid
i=
w d
=θ
wher
eth
enor
mal
vect
or w
=µ(c
1)−
µ(c
2)
and
θ=
0.5∗
(|µ(c
1)|
2−|µ
(c2)|
2).
9/
64
Nai
veB
ayes
asa
linea
rcl
assifier
Nai
veBay
esis
alin
ear
clas
sifier
defi
ned
by:
M ∑ i=1
wid
i=
θ
wher
ew
i=
log[P
(ti|c)/
P(t
i|c)]
,di=
num
ber
ofocc
urr
ence
soft i
ind,an
dθ
=−
log[P
(c)/
P(c
)].
Her
e,th
ein
dex
i,1≤
i≤
M,
refe
rsto
term
softh
evo
cabula
ry(n
ot
topositions
ind
ask
did
in
our
orig
inal
defi
nitio
nofN
aive
Bay
es)
kNN
isnot
alin
ear
clas
sifier
x
x
xxx
xx x
xx
x
The
dec
isio
nboundar
ies
bet
wee
ncl
asse
sar
epie
cew
ise
linea
r..
.
...b
ut
they
are
not
linea
rse
par
ator
sth
atca
nbe
des
crib
edas
∑ M i=1w
idi=
θ.
11
/64
Outlin
e
1Rec
ap
2Clu
ster
ing:
Intr
oduct
ion
3Clu
ster
ing
inIR
4K
-mea
ns
5Eva
luat
ion
6H
owm
any
clust
ers?
What
iscl
ust
erin
g?
(Docu
men
t)cl
ust
erin
gis
the
proce
ssofgro
upin
ga
set
of
docu
men
tsin
tocl
ust
ers
ofsim
ilardocu
men
ts.
Docu
men
tsw
ithin
acl
ust
ersh
ould
be
sim
ilar.
Docu
men
tsfrom
diff
eren
tcl
ust
ers
should
be
dissim
ilar.
Clu
ster
ing
isth
em
ost
com
mon
form
ofunsu
per
vise
dle
arnin
g.
Unsu
per
vise
d=
ther
ear
eno
label
edor
annota
ted
dat
a.
13
/64
Dat
ase
tw
ith
clea
rcl
ust
erst
ruct
ure
0.0
0.5
1.0
1.5
2.0
0.00.51.01.52.02.5
Cla
ssifi
cation
vs.Clu
ster
ing
Cla
ssifi
cation:
super
vise
dle
arnin
g
Clu
ster
ing:
unsu
per
vise
dle
arnin
g
Cla
ssifi
cation:
Cla
sses
are
hum
an-d
efined
and
par
tofth
e
input
toth
ele
arnin
gal
gor
ithm
.
Clu
ster
ing:
Clu
ster
sar
ein
ferr
edfrom
the
dat
aw
ithout
hum
anin
put. H
owev
er,th
ere
are
man
yway
sofin
fluen
cing
the
outc
om
eof
clust
erin
g:
num
ber
ofcl
ust
ers,
sim
ilarity
mea
sure
,re
pres
enta
tion
ofdocu
men
ts,
...
15
/64
Outlin
e
1Rec
ap
2Clu
ster
ing:
Intr
oduct
ion
3Clu
ster
ing
inIR
4K
-mea
ns
5Eva
luat
ion
6H
owm
any
clust
ers?
The
clust
erhyp
othes
is
Clu
ster
hypoth
esis.
Docu
men
tsin
the
sam
ecl
ust
erbeh
ave
sim
ilarly
with
resp
ect
tore
leva
nce
toin
form
atio
nnee
ds.
All
applic
atio
ns
inIR
are
bas
ed(d
irec
tly
orin
direc
tly)
on
the
clust
erhyp
oth
esis.
17
/64
Applic
atio
ns
ofcl
ust
erin
gin
IR
Applic
atio
nW
hat
isBen
efit
Exa
mple
clust
ered
?
Sea
rch
resu
ltcl
ust
erin
gse
arch
resu
lts
mor
eeff
ective
info
r-
mat
ion
pres
enta
tion
touse
r
Sca
tter
-Gat
her
(subse
tsof)
col-
lect
ion
alte
rnat
ive
use
rin
ter-
face
:“se
arch
without
typin
g”
Colle
ctio
ncl
ust
erin
gco
llect
ion
effec
tive
info
rmat
ion
pres
enta
tion
for
ex-
plo
rato
rybr
owsing
McK
eow
net
al.200
new
s.google
.com
Clu
ster
-bas
edre
trie
val
colle
ctio
nhig
her
effici
ency
:
fast
erse
arch
Sal
ton
1971
Sea
rch
resu
ltcl
ust
erin
gfo
rbet
ter
nav
igat
ion
19
/64
Sca
tter
-Gat
her
Glo
bal
nav
igat
ion:
Yah
oo
21
/64
Glo
bal
nav
igat
ion:
MESH
(upper
leve
l)
Glo
bal
nav
igat
ion:
MESH
(low
erle
vel)
23
/64
Note
:Yah
oo/M
ESH
are
not
exam
ple
sofcl
ust
erin
g.
But
they
are
wel
lkn
own
exam
ple
sfo
rusing
aglo
bal
hie
rarc
hy
fornav
igat
ion.
Som
eex
ample
sfo
rglo
bal
nav
igat
ion/ex
plo
ration
bas
edon
clust
erin
g:
Car
tia
Them
esca
pes
New
s
Glo
balna
viga
tion
com
bined
with
visu
aliz
atio
n(1
)
25
/64
Glo
balna
viga
tion
com
bined
with
visu
aliz
atio
n(2
)
Glo
bal
clust
erin
gfo
rnav
igat
ion:
Goog
leN
ews
htt
p:/
/new
s.google
.com
27
/64
Clu
ster
ing
for
impr
ovin
gre
call
To
impr
ove
sear
chre
call:
Clu
ster
docs
inco
llect
ion
apr
iori
When
aquer
ym
atch
esa
doc
d,al
sore
turn
oth
erdocs
inth
ecl
ust
erco
nta
inin
gd
Hope:
ifwe
do
this:
the
quer
y“ca
r”w
illal
sore
turn
docs
conta
inin
g“au
tom
obile
”
Bec
ause
clust
erin
ggro
ups
toget
her
docs
conta
inin
g“ca
r”w
ith
those
conta
inin
g“au
tom
obile
”.
Both
types
ofdocu
men
tsco
nta
inwor
ds
like
“par
ts”,“dea
ler”
,“m
erce
des
”,“ro
adtr
ip”.
Dat
ase
tw
ith
clea
rcl
ust
erst
ruct
ure
0.0
0.5
1.0
1.5
2.0
0.00.51.01.52.02.5
Exe
rcise:
Com
eup
with
anal
gor
ithm
forfindin
gth
eth
ree
clust
ers
inth
isca
se
29
/64
Docu
men
tre
pres
enta
tion
sin
clust
erin
g
Vec
tor
spac
em
odel
As
inve
ctor
spac
ecl
assifica
tion,we
mea
sure
rela
tednes
sbet
wee
nve
ctor
sby
Eucl
idea
ndista
nce
...
...w
hic
his
alm
ost
equiv
alen
tto
cosine
sim
ilarity
.
Alm
ost
:ce
ntr
oid
sar
enot
length
-nor
mal
ized
.
For
centr
oid
s,dista
nce
and
cosine
giv
ediff
eren
tre
sults.
Issu
esin
clust
erin
g
Gen
eral
goal
:put
rela
ted
docs
inth
esa
me
clust
er,put
unre
late
ddocs
indiff
eren
tcl
ust
ers.
But
how
do
we
form
aliz
eth
is?
How
man
ycl
ust
ers?
Initia
lly,we
will
assu
me
the
num
ber
ofcl
ust
ers
Kis
giv
en.
Often
:se
condar
ygoal
sin
clust
erin
g
Exa
mple
:av
oid
very
smal
lan
dve
ryla
rge
clust
ers
Fla
tvs
.hie
rarc
hic
alcl
ust
erin
g
Har
dvs
.so
ftcl
ust
erin
g
31
/64
Fla
tvs
.H
iera
rchic
alcl
ust
erin
g
Fla
tal
gor
ithm
s
Usu
ally
star
tw
ith
ara
ndom
(par
tial
)par
titionin
gofdocs
into
gro
ups
Refi
ne
iter
ativ
ely
Mai
nal
gor
ithm
:K
-mea
ns
Hie
rarc
hic
alal
gor
ithm
s
Cre
ate
ahie
rarc
hy
Bott
om
-up,ag
glo
mer
ativ
eTop-d
own,div
isiv
e
Har
dvs
.Sof
tcl
ust
erin
g
Har
dcl
ust
erin
g:
Eac
hdocu
men
tbel
ongs
toex
actly
one
clust
er.
Mor
eco
mm
on
and
easier
todo
Soft
clust
erin
g:
Adocu
men
tca
nbel
ong
tom
ore
than
one
clust
er.
Mak
esm
ore
sense
forap
plic
atio
ns
like
crea
ting
brow
sable
hie
rarc
hie
sYou
may
wan
tto
put
apai
rofsn
eake
rsin
two
clust
ers:
sport
sappare
l
shoes
You
can
only
do
that
with
aso
ftcl
ust
erin
gap
proac
h.
We
won’t
hav
etim
efo
rso
ftcl
ust
erin
g.
See
IIR
16.5
,IIR
18
33
/64
Our
pla
n This
lect
ure
:Fla
t,har
dcl
ust
erin
g
Nex
tle
cture
:H
iera
rchic
al,har
dcl
ust
erin
g
Fla
tal
gorith
ms
Fla
tal
gor
ithm
sco
mpute
apar
tition
ofN
docu
men
tsin
toa
set
ofK
clust
ers.
Giv
en:
ase
tofdocu
men
tsan
dth
enum
ber
K
Fin
d:
apar
tition
inK
clust
ers
that
optim
izes
the
chose
npar
titionin
gcr
iter
ion
Glo
bal
optim
izat
ion:
exhau
stiv
ely
enum
erat
epar
titions,
pic
koptim
alone
Not
trac
table
Effec
tive
heu
rist
icm
ethod:
K-m
eans
algor
ithm
35
/64
Outlin
e
1Rec
ap
2Clu
ster
ing:
Intr
oduct
ion
3Clu
ster
ing
inIR
4K
-mea
ns
5Eva
luat
ion
6H
owm
any
clust
ers?
K-m
eans
Per
hap
sth
ebes
tkn
own
clust
erin
gal
gor
ithm
Sim
ple
,wor
kswel
lin
man
yca
ses
Use
asdef
ault
/bas
elin
efo
rcl
ust
erin
gdocu
men
ts
37
/64
K-m
eans
Eac
hcl
ust
erin
K-m
eans
isdefi
ned
bya
centr
oid
.
Obje
ctiv
e/par
titionin
gcr
iter
ion:
min
imiz
eth
eav
erag
esq
uar
ed
diff
eren
cefrom
the
centr
oid
Rec
alldefi
nitio
nofce
ntr
oid
:
µ(ω
)=
1 |ω|
∑ x∈
ω
x
wher
ewe
use
ωto
den
ote
acl
ust
er.
We
try
tofind
the
min
imum
aver
age
squar
eddiff
eren
ceby
iter
atin
gtw
ost
eps:
reas
signm
ent:
assign
each
vect
orto
its
close
stce
ntr
oid
reco
mputa
tion:
reco
mpute
each
centr
oid
asth
eav
erag
eofth
eve
ctor
sth
atwer
eas
signed
toit
inre
assignm
ent
K-m
eans
algo
rith
m
K-means(
x1,.
.., x
N,K
)
1( s
1, s
2,.
.., s
K)←
SelectR
andomSeeds(
x1,.
.., x
N,K
)2
for
k←
1to
K
3do
µk←
s k4
while
stoppin
gcr
iter
ion
has
not
bee
nm
et
5do
for
k←
1to
K6
do
ωk←
7fo
rn←
1to
N8
do
j←
argm
inj′|µ
j′−
xn|
9ω
j←
ωj∪x n
(rea
ssig
nm
ent
ofve
ctor
s)
10
for
k←
1to
K11
do
µk←
1|ω
k|
∑ x∈
ωk x
(rec
om
puta
tion
ofce
ntr
oid
s)
12
retu
rnµ
1,.
..,µ
K
39
/64
K-m
eans
exam
ple
K-m
eans
isgu
aran
teed
toco
nve
rge
Pro
of:
The
sum
ofsq
uar
eddista
nce
s(R
SS)
dec
reas
esduring
reas
signm
ent.
RSS
=su
mofal
lsq
uar
eddista
nce
sbet
wee
ndocu
men
tve
ctor
and
close
stce
ntr
oid
(bec
ause
each
vect
oris
move
dto
acl
ose
rce
ntr
oid
)
RSS
dec
reas
esduring
reco
mputa
tion.
(We
will
show
this
on
the
nex
tslid
e.)
Ther
eis
only
afinite
num
ber
ofcl
ust
erin
gs.
Thus:
We
must
reac
ha
fixe
dpoin
t.
(ass
um
eth
atties
are
broke
nco
nsist
ently)
41
/64
Rec
omputa
tion
dec
reas
esav
erag
edista
nce
RSS
=∑ K k
=1RSS
k–
the
residual
sum
ofsq
uar
es(t
he
“goodnes
s”
mea
sure
)
RSS
k( v
)=
∑ x∈
ωk
‖v−
x‖2
=∑ x∈
ωk
M ∑ m=
1
(vm−
x m)2
∂RSS
k( v
)
∂v m
=∑ x∈
ωk
2(v
m−
x m)
=0
v m=
1
|ωk|
∑ x∈
ωk
x m
The
last
line
isth
eco
mponen
twise
defi
nitio
nofth
ece
ntr
oid
!W
em
inim
ize
RSS
kw
hen
the
old
centr
oid
isre
pla
ced
with
the
new
centr
oid
.RSS,th
esu
mofth
eRSS
k,m
ust
then
also
dec
reas
eduring
reco
mputa
tion.
K-m
eans
isgu
aran
teed
toco
nve
rge
But
we
don’t
know
how
long
conve
rgen
cew
illta
ke!
Ifwe
don’t
care
about
afe
wdocs
switch
ing
bac
kan
dfo
rth,
then
conve
rgen
ceis
usu
ally
fast
(<10-2
0iter
atio
ns)
.
How
ever
,co
mple
teco
nve
rgen
ceca
nta
kem
any
mor
e
iter
atio
ns.
43
/64
Optim
ality
ofK
-mea
ns
Conve
rgen
cedoes
not
mea
nth
atwe
conve
rge
toth
eoptim
alcl
ust
erin
g!
This
isth
egre
atwea
knes
sofK
-mea
ns.
Ifwe
star
tw
ith
abad
set
ofse
eds,
the
resu
ltin
gcl
ust
erin
gca
n
be
hor
rible
.
Exe
rcise:
Subop
tim
alcl
ust
erin
g
01
23
40123
××
××
××d1
d2
d3
d4
d5
d6
What
isth
eoptim
alcl
ust
erin
gfo
rK
=2?
Do
we
conve
rge
on
this
clust
erin
gfo
rar
bitra
ryse
eds
di 1,d
i 2?
45
/64
Initia
lizat
ion
ofK
-mea
ns
Ran
dom
seed
sele
ctio
nis
just
one
ofm
any
way
sK
-mea
ns
can
be
initia
lized
.
Ran
dom
seed
sele
ctio
nis
not
very
robust
:It’s
easy
toget
asu
boptim
alcl
ust
erin
g.
Bet
ter
heu
rist
ics:
Sel
ect
seed
snot
random
ly,but
using
som
eheu
rist
ic(e
.g.,
filte
out
outlie
rsor
find
ase
tofse
eds
that
has
“good
cove
rage”
of
the
docu
men
tsp
ace)
Use
hie
rarc
hic
alcl
ust
erin
gto
find
good
seed
s(n
ext
clas
s)Sel
ect
i(e
.g.,
i=
10)
diff
eren
tse
tsofse
eds,
do
aK
-mea
ns
clust
erin
gfo
rea
ch,se
lect
the
clust
erin
gw
ith
lowes
tRSS
Tim
eco
mpl
exity
ofK
-mea
ns
Com
puting
one
dista
nce
oftw
ove
ctor
sis
O(M
).
Rea
ssig
nm
ent
step
:O
(KN
M)
(we
nee
dto
com
pute
KN
docu
men
t-ce
ntr
oid
dista
nce
s)
Rec
om
puta
tion
step
:O
(NM
)(w
enee
dto
add
each
ofth
e
docu
men
t’s
<M
valu
esto
one
ofth
ece
ntr
oid
s)
Ass
um
enum
ber
ofiter
atio
ns
bounded
byI
Ove
rall
com
ple
xity
:O
(IK
NM
)–
linea
rin
allim
por
tant
dim
ensions
How
ever
:T
his
isnot
are
alwor
st-c
ase
anal
ysis.
Inpat
holo
gic
alca
ses,
the
num
ber
ofiter
atio
ns
can
be
much
hig
her
than
linea
rin
the
num
ber
ofdocu
men
ts.
47
/64
Outlin
e
1Rec
ap
2Clu
ster
ing:
Intr
oduct
ion
3Clu
ster
ing
inIR
4K
-mea
ns
5Eva
luat
ion
6H
owm
any
clust
ers?
What
isa
good
clust
erin
g?
Inte
rnal
criter
ia
Exa
mple
ofan
inte
rnal
criter
ion:
RSS
inK
-mea
ns
But
anin
tern
alcr
iter
ion
often
does
not
eval
uat
eth
eac
tual
utilit
yofa
clust
erin
gin
the
applic
atio
n.
Alter
nat
ive:
Ext
ernal
criter
ia
Eva
luat
ew
ith
resp
ect
toa
hum
an-d
efined
clas
sifica
tion
49
/64
Ext
ernal
criter
iafo
rcl
ust
erin
gqual
ity
Bas
edon
agold
stan
dar
ddat
ase
t,e.
g.,
the
Reu
ters
colle
ctio
nwe
also
use
dfo
rth
eev
aluat
ion
ofcl
assifica
tion
Goal
:Clu
ster
ing
should
repr
oduce
the
clas
ses
inth
egold
stan
dar
d
(But
we
only
wan
tto
repr
oduce
how
docu
men
tsar
ediv
ided
into
gro
ups,
not
the
clas
sla
bel
s.)
First
mea
sure
forhow
wel
lwe
wer
eab
leto
repr
oduce
the
clas
ses:
purity
Ext
ernal
criter
ion:
Purity
purity
(Ω,C
)=
1 N
∑ k
max j|ω
k∩
c j|
Ω=ω
1,ω
2,.
..,ω
K
isth
ese
tofcl
ust
ers
and
C=c 1
,c2,.
..,c
J
isth
ese
tofcl
asse
s.
For
each
clust
erω
k:
find
clas
sc j
with
most
mem
ber
snkjin
ωk
Sum
allnkjan
ddiv
ide
byto
talnum
ber
ofpoin
ts
51
/64
Exa
mple
for
com
puting
purity
x
o xxxx
ox
ooo
x
x
clust
er1
clust
er2
clust
er3
To
com
pute
purity
:5
=m
axj|ω
1∩
c j|(c
lass
x,cl
ust
er1);
4=
max
j|ω
2∩
c j|
(cla
sso,cl
ust
er2);
and
3=
max
j|ω
3∩
c j|(c
lass,cl
ust
er3).
Purity
is(1
/17)×
(5+
4+
3)≈
0.7
1.
Ran
din
dex
Defi
nitio
n:
RI=
TP
+T
NT
P+FP
+FN
+T
NBas
edon
2x2
contingen
cyta
ble
ofal
lpai
rsofdocu
men
ts:
sam
ecl
ust
erdiff
eren
tcl
ust
ers
sam
ecl
ass
true
positive
s(T
P)
false
neg
ativ
es(F
N)
diff
eren
tcl
asse
sfa
lse
positive
s(F
P)
true
neg
ativ
es(T
N)
TP+
FN
+FP+
TN
isth
eto
talnum
ber
ofpai
rs.
Ther
ear
e( N 2
) pai
rsfo
rN
docu
men
ts.
Exa
mple
:( 17 2
) =136
ino//x
exam
ple
Eac
hpai
ris
eith
erpositive
orneg
ativ
e(t
he
clust
erin
gputs
the
two
docu
men
tsin
the
sam
eor
indiff
eren
tcl
ust
ers)
...
...a
nd
eith
er“tr
ue”
(cor
rect
)or
“fa
lse”
(inco
rrec
t):
the
clust
erin
gdec
isio
nis
corr
ect
orin
corr
ect.
53
/64
As
anex
ample
,we
com
pute
RIfo
rth
eo//x
exam
ple
.W
efirs
tco
mpute
TP
+FP.
The
thre
ecl
ust
ers
conta
in6,6,an
d5
poin
ts,
resp
ective
ly,so
the
tota
lnum
ber
of“positive
s”or
pai
rsof
docu
men
tsth
atar
ein
the
sam
ecl
ust
eris:
TP
+FP
=
( 6 2
) +
( 6 2
) +
( 5 2
) =40
Ofth
ese,
the
xpai
rsin
clust
er1,th
eo
pai
rsin
clust
er2,th
e
pai
rsin
clust
er3,an
dth
ex
pai
rin
clust
er3
are
true
positive
s:
TP
=
( 5 2
) +
( 4 2
) +
( 3 2
) +
( 2 2
) =20
Thus,
FP
=40−
20
=20.
FN
and
TN
are
com
pute
dsim
ilarly.
Ran
dm
easu
refo
rth
eo//x
exam
ple
sam
ecl
ust
erdiff
eren
tcl
ust
ers
sam
ecl
ass
TP
=20
FN
=24
diff
eren
tcl
asse
sFP
=20
TN
=72
RIis
then
(20
+72)/
(20
+20
+24
+72)≈
0.6
8.
55
/64
Two
other
exte
rnal
eval
uat
ion
mea
sure
s
Two
oth
erm
easu
res
Nor
mal
ized
mutu
alin
form
atio
n(N
MI)
How
much
info
rmat
ion
does
the
clust
erin
gco
nta
inab
out
the
clas
sifica
tion?
Sin
gle
ton
clust
ers
(num
ber
ofcl
ust
ers
=num
ber
ofdocs
)hav
em
axim
um
MI
Ther
efor
e:nor
mal
ize
byen
tropy
ofcl
ust
ers
and
clas
ses
Fm
easu
re
Lik
eRan
d,but
“pr
ecisio
n”
and
“re
call”
can
be
wei
ghte
d
Eva
luat
ion
resu
lts
for
the
o//x
exam
ple
purity
NM
IRI
F5
lower
bound
0.0
0.0
0.0
0.0
max
imum
1.0
1.0
1.0
1.0
valu
efo
rex
ample
0.7
10.3
60.6
80.4
6
All
four
mea
sure
sra
nge
from
0(r
eally
bad
clust
erin
g)
to1
(per
fect
clust
erin
g).
57
/64
Outlin
e
1Rec
ap
2Clu
ster
ing:
Intr
oduct
ion
3Clu
ster
ing
inIR
4K
-mea
ns
5Eva
luat
ion
6H
owm
any
clust
ers?
How
man
ycl
ust
ers?
Either
:N
um
ber
ofcl
ust
ers
Kis
giv
en.
Then
par
tition
into
Kcl
ust
ers
Km
ight
be
giv
enbec
ause
ther
eis
som
eex
tern
alco
nst
rain
t.Exa
mple
:In
the
case
ofSca
tter
-Gat
her
,it
was
har
dto
show
mor
eth
an10–20
clust
ers
on
am
onitor
inth
e90s.
Or:
Fin
din
gth
e“right”
num
ber
ofcl
ust
ers
ispar
tofth
epr
oble
m.
Giv
endocs
,find
Kfo
rw
hic
han
optim
um
isre
ached
.H
owto
defi
ne
“optim
um
”?
We
can’t
use
RSS
orav
erag
esq
uar
eddista
nce
from
centr
oid
ascr
iter
ion:
alway
sch
oose
sK
=N
clust
ers.
59
/64
Exe
rcise Suppose
we
wan
tto
anal
yze
the
set
ofal
lar
ticl
espublis
hed
bya
maj
ornew
spap
er(e
.g.,
New
Yor
kT
imes
orSuddeu
tsch
e
Zei
tung)
in2008.
Goal
:w
rite
atw
o-p
age
repor
tab
out
what
the
maj
ornew
s
stor
ies
in2008
wer
e.
We
wan
tto
use
K-m
eans
clust
erin
gto
find
the
maj
ornew
sst
orie
s.
How
would
you
det
erm
ine
K?
Sim
ple
obje
ctiv
efu
nct
ion
for
K(1
)
Bas
icid
ea:
Sta
rtw
ith
1cl
ust
er(K
=1)
Kee
pad
din
gcl
ust
ers
(=ke
epin
crea
sing
K)
Add
apen
alty
for
each
new
clust
er
Tra
de
off
clust
erpen
alties
agai
nst
aver
age
squar
eddista
nce
from
centr
oid
Choose
Kw
ith
bes
ttr
adeo
ff
61
/64
Sim
ple
obje
ctiv
efu
nct
ion
for
K(2
)
Giv
ena
clust
erin
g,defi
ne
the
cost
fora
docu
men
tas
(squar
ed)
dista
nce
toce
ntr
oid
Defi
ne
tota
ldisto
rtio
nRSS(K
)as
sum
ofal
lin
div
idual
docu
men
tco
sts
(cor
resp
onds
toav
erag
edista
nce
)
Then
:pen
aliz
eea
chcl
ust
erw
ith
aco
stλ
Thus
for
acl
ust
erin
gw
ith
Kcl
ust
ers,
tota
lcl
ust
erpen
alty
isK
λ
Defi
ne
the
tota
lco
stofa
clust
erin
gas
disto
rtio
nplu
sto
tal
clust
erpen
alty
:RSS(K
)+
Kλ
Sel
ect
Kth
atm
inim
izes
(RSS(K
)+
Kλ)
Still
nee
dto
det
erm
ine
good
valu
efo
rλ
...
Fin
din
gth
e“k
nee
”in
the
curv
e
24
68
10
17501800185019001950
num
ber
of c
lust
ers
residual sum of squares
Pic
kth
enum
ber
ofcl
ust
ers
wher
e
curv
e“flat
tens”
.H
ere:
4or
9.
63
/64
Res
ourc
es Chap
ter
16
ofIIR
Res
ourc
esat
http://ifnlp.org/ir
K-m
eans
exam
ple
Kei
thva
nRijsb
ergen
on
the
clust
erhyp
oth
esis
(he
was
one
of
the
orig
inat
ors)
Bin
g/Car
rot2
/Clu
sty:
sear
chre
sult
clust
erin
g
top related