106) =0 652 =27, 106 106)(49+141) 652 652+774 106 652+774 ... · (we will show this on the next...

16
Introduction to Information Retrieval http://informationretrieval.org IIR 16: Flat Clustering Hinrich Sch¨ utze Institute for Natural Language Processing, Universit¨at Stuttgart 2009.06.16 1 / 64 Overview 1 Recap 2 Clustering: Introduction 3 Clustering in IR 4 K -means 5 Evaluation 6 How many clusters? Outline 1 Recap 2 Clustering: Introduction 3 Clustering in IR 4 K -means 5 Evaluation 6 How many clusters? 3 / 64 MI example for poultry/export in Reuters e c = e poultry =1 e c = e poultry =0 e t = e export =1 N 11 = 49 N 10 = 27,652 e t = e export =0 N 01 = 141 N 00 = 774,106 Plug these values into formula: I (U ; C ) = 49 801,948 log 2 801,948 · 49 (49+27, 652)(49+141) + 141 801,948 log 2 801,948 · 141 (141+774, 106)(49+141) + 27,652 801,948 log 2 801,948 · 27,652 (49+27, 652)(27, 652+774, 106) + 774,106 801,948 log 2 801, 948 · 774,106 (141+774, 106)(27, 652+774, 106) 0.000105

Upload: others

Post on 19-Oct-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 106) =0 652 =27, 106 106)(49+141) 652 652+774 106 652+774 ... · (We will show this on the next slide.) There is only a finite number of clusterings. Thus: We must reach a fixed

Intr

oduct

ion

toIn

form

atio

nRet

riev

alhttp://informationretrieval.org

IIR

16:

Fla

tClu

ster

ing

Hin

rich

Sch

utz

e

Inst

itute

for

Natu

ralLanguage

Pro

cess

ing,U

niv

ersita

tStu

ttgar

t

2009.0

6.1

6

1/

64

Ove

rvie

w

1Rec

ap

2Clu

ster

ing:

Intr

oduct

ion

3Clu

ster

ing

inIR

4K

-mea

ns

5Eva

luat

ion

6H

owm

any

clust

ers?

Outlin

e

1Rec

ap

2Clu

ster

ing:

Intr

oduct

ion

3Clu

ster

ing

inIR

4K

-mea

ns

5Eva

luat

ion

6H

owm

any

clust

ers?

3/

64

MIex

ample

for

poultry

/export

inReu

ters

e c=

epoultry

=1

e c=

epoultry

=0

e t=

eexport

=1

N11

=49

N10

=27,6

52

e t=

eexport

=0

N01

=141

N00

=774,1

06

Plu

g

thes

eva

lues

into

form

ula

:

I(U

;C)

=49

801,9

48

log2

801,9

48·4

9

(49+

27,6

52)(

49+

141)

+141

801,9

48

log2

801,9

48·141

(141+

774,1

06)(

49+

141)

+27,6

52

801,9

48

log2

801,9

48·27,6

52

(49+

27,6

52)(

27,6

52+

774,1

06)

+774,1

06

801,9

48

log2

801,9

48·7

74,1

06

(141+

774,1

06)(

27,6

52+

774,1

06)

≈0.0

00105

Page 2: 106) =0 652 =27, 106 106)(49+141) 652 652+774 106 652+774 ... · (We will show this on the next slide.) There is only a finite number of clusterings. Thus: We must reach a fixed

Lin

ear

clas

sifier

s

Lin

earcl

assifier

sco

mpute

alin

earco

mbin

atio

nor

wei

ghte

dsu

m∑ i

wix

iofth

efe

ature

valu

es.

Cla

ssifi

cation

dec

isio

n:

∑ iw

ixi>

θ?

Geo

met

rica

lly,th

eeq

uat

ion

∑ iw

ixi=

θdefi

nes

alin

e(2

D),

apla

ne

(3D

)or

ahyp

erpla

ne

(hig

her

dim

ensional

itie

s).

Ass

um

ption:

The

clas

ses

are

linea

rly

separ

able

.

Met

hods

forfindin

ga

linea

rse

par

ator

:Per

ceptr

on,Rocc

hio

,

Nai

veBay

es,m

any

oth

ers

5/

64

Alin

ear

clas

sifier

in1D

Alin

earcl

assifier

in1D

apoin

tdes

crib

edby

theq

uat

ion

w1d1

The

poin

tat

θ/w

1

Poin

ts(d

1)

with

w1d1≥

are

inth

ecl

ass

c.

Poin

ts(d

1)

with

w1d1

<

are

inth

eco

mple

men

tcl

ass

c.

Alin

ear

clas

sifier

in2D

Alin

earcl

assifier

in2D

is

alin

edes

crib

edby

the

equat

ion

w1d1+

w2d2

Exa

mple

fora

2D

linea

rcl

assifier

Poin

ts(d

1d2)

with

w1d1+

w2d2≥

θar

ein

the

clas

sc.

Poin

ts(d

1d2)

with

w1d1+

w2d2

are

inth

eco

mple

men

tcl

ass

c.

7/

64

Alin

ear

clas

sifier

in3D

Alin

earcl

assifier

in3D

apla

ne

des

crib

edby

theq

uat

ion

w1d1+

w2d2+

w3d3

=

Exa

mple

fora

3D

linea

clas

sifier

Poin

ts(d

1d2

d3)

with

w1d1+

w2d2+

w3d3≥

are

inth

ecl

ass

c.

Poin

ts(d

1d2

d3)

with

w1d1+

w2d2+

w3d3

<ar

ein

the

com

ple

men

t

clas

sc.

Page 3: 106) =0 652 =27, 106 106)(49+141) 652 652+774 106 652+774 ... · (We will show this on the next slide.) There is only a finite number of clusterings. Thus: We must reach a fixed

Rocc

hio

asa

linea

rcl

assifier

Rocc

hio

isa

linea

rcl

assifier

defi

ned

by:

M ∑ i=1

wid

i=

w d

wher

eth

enor

mal

vect

or w

=µ(c

1)−

µ(c

2)

and

θ=

0.5∗

(|µ(c

1)|

2−|µ

(c2)|

2).

9/

64

Nai

veB

ayes

asa

linea

rcl

assifier

Nai

veBay

esis

alin

ear

clas

sifier

defi

ned

by:

M ∑ i=1

wid

i=

θ

wher

ew

i=

log[P

(ti|c)/

P(t

i|c)]

,di=

num

ber

ofocc

urr

ence

soft i

ind,an

=−

log[P

(c)/

P(c

)].

Her

e,th

ein

dex

i,1≤

i≤

M,

refe

rsto

term

softh

evo

cabula

ry(n

ot

topositions

ind

ask

did

in

our

orig

inal

defi

nitio

nofN

aive

Bay

es)

kNN

isnot

alin

ear

clas

sifier

x

x

xxx

xx x

xx

x

The

dec

isio

nboundar

ies

bet

wee

ncl

asse

sar

epie

cew

ise

linea

r..

.

...b

ut

they

are

not

linea

rse

par

ator

sth

atca

nbe

des

crib

edas

∑ M i=1w

idi=

θ.

11

/64

Outlin

e

1Rec

ap

2Clu

ster

ing:

Intr

oduct

ion

3Clu

ster

ing

inIR

4K

-mea

ns

5Eva

luat

ion

6H

owm

any

clust

ers?

Page 4: 106) =0 652 =27, 106 106)(49+141) 652 652+774 106 652+774 ... · (We will show this on the next slide.) There is only a finite number of clusterings. Thus: We must reach a fixed

What

iscl

ust

erin

g?

(Docu

men

t)cl

ust

erin

gis

the

proce

ssofgro

upin

ga

set

of

docu

men

tsin

tocl

ust

ers

ofsim

ilardocu

men

ts.

Docu

men

tsw

ithin

acl

ust

ersh

ould

be

sim

ilar.

Docu

men

tsfrom

diff

eren

tcl

ust

ers

should

be

dissim

ilar.

Clu

ster

ing

isth

em

ost

com

mon

form

ofunsu

per

vise

dle

arnin

g.

Unsu

per

vise

d=

ther

ear

eno

label

edor

annota

ted

dat

a.

13

/64

Dat

ase

tw

ith

clea

rcl

ust

erst

ruct

ure

0.0

0.5

1.0

1.5

2.0

0.00.51.01.52.02.5

Cla

ssifi

cation

vs.Clu

ster

ing

Cla

ssifi

cation:

super

vise

dle

arnin

g

Clu

ster

ing:

unsu

per

vise

dle

arnin

g

Cla

ssifi

cation:

Cla

sses

are

hum

an-d

efined

and

par

tofth

e

input

toth

ele

arnin

gal

gor

ithm

.

Clu

ster

ing:

Clu

ster

sar

ein

ferr

edfrom

the

dat

aw

ithout

hum

anin

put. H

owev

er,th

ere

are

man

yway

sofin

fluen

cing

the

outc

om

eof

clust

erin

g:

num

ber

ofcl

ust

ers,

sim

ilarity

mea

sure

,re

pres

enta

tion

ofdocu

men

ts,

...

15

/64

Outlin

e

1Rec

ap

2Clu

ster

ing:

Intr

oduct

ion

3Clu

ster

ing

inIR

4K

-mea

ns

5Eva

luat

ion

6H

owm

any

clust

ers?

Page 5: 106) =0 652 =27, 106 106)(49+141) 652 652+774 106 652+774 ... · (We will show this on the next slide.) There is only a finite number of clusterings. Thus: We must reach a fixed

The

clust

erhyp

othes

is

Clu

ster

hypoth

esis.

Docu

men

tsin

the

sam

ecl

ust

erbeh

ave

sim

ilarly

with

resp

ect

tore

leva

nce

toin

form

atio

nnee

ds.

All

applic

atio

ns

inIR

are

bas

ed(d

irec

tly

orin

direc

tly)

on

the

clust

erhyp

oth

esis.

17

/64

Applic

atio

ns

ofcl

ust

erin

gin

IR

Applic

atio

nW

hat

isBen

efit

Exa

mple

clust

ered

?

Sea

rch

resu

ltcl

ust

erin

gse

arch

resu

lts

mor

eeff

ective

info

r-

mat

ion

pres

enta

tion

touse

r

Sca

tter

-Gat

her

(subse

tsof)

col-

lect

ion

alte

rnat

ive

use

rin

ter-

face

:“se

arch

without

typin

g”

Colle

ctio

ncl

ust

erin

gco

llect

ion

effec

tive

info

rmat

ion

pres

enta

tion

for

ex-

plo

rato

rybr

owsing

McK

eow

net

al.200

new

s.google

.com

Clu

ster

-bas

edre

trie

val

colle

ctio

nhig

her

effici

ency

:

fast

erse

arch

Sal

ton

1971

Sea

rch

resu

ltcl

ust

erin

gfo

rbet

ter

nav

igat

ion

19

/64

Sca

tter

-Gat

her

Page 6: 106) =0 652 =27, 106 106)(49+141) 652 652+774 106 652+774 ... · (We will show this on the next slide.) There is only a finite number of clusterings. Thus: We must reach a fixed

Glo

bal

nav

igat

ion:

Yah

oo

21

/64

Glo

bal

nav

igat

ion:

MESH

(upper

leve

l)

Glo

bal

nav

igat

ion:

MESH

(low

erle

vel)

23

/64

Note

:Yah

oo/M

ESH

are

not

exam

ple

sofcl

ust

erin

g.

But

they

are

wel

lkn

own

exam

ple

sfo

rusing

aglo

bal

hie

rarc

hy

fornav

igat

ion.

Som

eex

ample

sfo

rglo

bal

nav

igat

ion/ex

plo

ration

bas

edon

clust

erin

g:

Car

tia

Them

esca

pes

Google

New

s

Page 7: 106) =0 652 =27, 106 106)(49+141) 652 652+774 106 652+774 ... · (We will show this on the next slide.) There is only a finite number of clusterings. Thus: We must reach a fixed

Glo

balna

viga

tion

com

bined

with

visu

aliz

atio

n(1

)

25

/64

Glo

balna

viga

tion

com

bined

with

visu

aliz

atio

n(2

)

Glo

bal

clust

erin

gfo

rnav

igat

ion:

Goog

leN

ews

htt

p:/

/new

s.google

.com

27

/64

Clu

ster

ing

for

impr

ovin

gre

call

To

impr

ove

sear

chre

call:

Clu

ster

docs

inco

llect

ion

apr

iori

When

aquer

ym

atch

esa

doc

d,al

sore

turn

oth

erdocs

inth

ecl

ust

erco

nta

inin

gd

Hope:

ifwe

do

this:

the

quer

y“ca

r”w

illal

sore

turn

docs

conta

inin

g“au

tom

obile

Bec

ause

clust

erin

ggro

ups

toget

her

docs

conta

inin

g“ca

r”w

ith

those

conta

inin

g“au

tom

obile

”.

Both

types

ofdocu

men

tsco

nta

inwor

ds

like

“par

ts”,“dea

ler”

,“m

erce

des

”,“ro

adtr

ip”.

Page 8: 106) =0 652 =27, 106 106)(49+141) 652 652+774 106 652+774 ... · (We will show this on the next slide.) There is only a finite number of clusterings. Thus: We must reach a fixed

Dat

ase

tw

ith

clea

rcl

ust

erst

ruct

ure

0.0

0.5

1.0

1.5

2.0

0.00.51.01.52.02.5

Exe

rcise:

Com

eup

with

anal

gor

ithm

forfindin

gth

eth

ree

clust

ers

inth

isca

se

29

/64

Docu

men

tre

pres

enta

tion

sin

clust

erin

g

Vec

tor

spac

em

odel

As

inve

ctor

spac

ecl

assifica

tion,we

mea

sure

rela

tednes

sbet

wee

nve

ctor

sby

Eucl

idea

ndista

nce

...

...w

hic

his

alm

ost

equiv

alen

tto

cosine

sim

ilarity

.

Alm

ost

:ce

ntr

oid

sar

enot

length

-nor

mal

ized

.

For

centr

oid

s,dista

nce

and

cosine

giv

ediff

eren

tre

sults.

Issu

esin

clust

erin

g

Gen

eral

goal

:put

rela

ted

docs

inth

esa

me

clust

er,put

unre

late

ddocs

indiff

eren

tcl

ust

ers.

But

how

do

we

form

aliz

eth

is?

How

man

ycl

ust

ers?

Initia

lly,we

will

assu

me

the

num

ber

ofcl

ust

ers

Kis

giv

en.

Often

:se

condar

ygoal

sin

clust

erin

g

Exa

mple

:av

oid

very

smal

lan

dve

ryla

rge

clust

ers

Fla

tvs

.hie

rarc

hic

alcl

ust

erin

g

Har

dvs

.so

ftcl

ust

erin

g

31

/64

Fla

tvs

.H

iera

rchic

alcl

ust

erin

g

Fla

tal

gor

ithm

s

Usu

ally

star

tw

ith

ara

ndom

(par

tial

)par

titionin

gofdocs

into

gro

ups

Refi

ne

iter

ativ

ely

Mai

nal

gor

ithm

:K

-mea

ns

Hie

rarc

hic

alal

gor

ithm

s

Cre

ate

ahie

rarc

hy

Bott

om

-up,ag

glo

mer

ativ

eTop-d

own,div

isiv

e

Page 9: 106) =0 652 =27, 106 106)(49+141) 652 652+774 106 652+774 ... · (We will show this on the next slide.) There is only a finite number of clusterings. Thus: We must reach a fixed

Har

dvs

.Sof

tcl

ust

erin

g

Har

dcl

ust

erin

g:

Eac

hdocu

men

tbel

ongs

toex

actly

one

clust

er.

Mor

eco

mm

on

and

easier

todo

Soft

clust

erin

g:

Adocu

men

tca

nbel

ong

tom

ore

than

one

clust

er.

Mak

esm

ore

sense

forap

plic

atio

ns

like

crea

ting

brow

sable

hie

rarc

hie

sYou

may

wan

tto

put

apai

rofsn

eake

rsin

two

clust

ers:

sport

sappare

l

shoes

You

can

only

do

that

with

aso

ftcl

ust

erin

gap

proac

h.

We

won’t

hav

etim

efo

rso

ftcl

ust

erin

g.

See

IIR

16.5

,IIR

18

33

/64

Our

pla

n This

lect

ure

:Fla

t,har

dcl

ust

erin

g

Nex

tle

cture

:H

iera

rchic

al,har

dcl

ust

erin

g

Fla

tal

gorith

ms

Fla

tal

gor

ithm

sco

mpute

apar

tition

ofN

docu

men

tsin

toa

set

ofK

clust

ers.

Giv

en:

ase

tofdocu

men

tsan

dth

enum

ber

K

Fin

d:

apar

tition

inK

clust

ers

that

optim

izes

the

chose

npar

titionin

gcr

iter

ion

Glo

bal

optim

izat

ion:

exhau

stiv

ely

enum

erat

epar

titions,

pic

koptim

alone

Not

trac

table

Effec

tive

heu

rist

icm

ethod:

K-m

eans

algor

ithm

35

/64

Outlin

e

1Rec

ap

2Clu

ster

ing:

Intr

oduct

ion

3Clu

ster

ing

inIR

4K

-mea

ns

5Eva

luat

ion

6H

owm

any

clust

ers?

Page 10: 106) =0 652 =27, 106 106)(49+141) 652 652+774 106 652+774 ... · (We will show this on the next slide.) There is only a finite number of clusterings. Thus: We must reach a fixed

K-m

eans

Per

hap

sth

ebes

tkn

own

clust

erin

gal

gor

ithm

Sim

ple

,wor

kswel

lin

man

yca

ses

Use

asdef

ault

/bas

elin

efo

rcl

ust

erin

gdocu

men

ts

37

/64

K-m

eans

Eac

hcl

ust

erin

K-m

eans

isdefi

ned

bya

centr

oid

.

Obje

ctiv

e/par

titionin

gcr

iter

ion:

min

imiz

eth

eav

erag

esq

uar

ed

diff

eren

cefrom

the

centr

oid

Rec

alldefi

nitio

nofce

ntr

oid

:

µ(ω

)=

1 |ω|

∑ x∈

ω

x

wher

ewe

use

ωto

den

ote

acl

ust

er.

We

try

tofind

the

min

imum

aver

age

squar

eddiff

eren

ceby

iter

atin

gtw

ost

eps:

reas

signm

ent:

assign

each

vect

orto

its

close

stce

ntr

oid

reco

mputa

tion:

reco

mpute

each

centr

oid

asth

eav

erag

eofth

eve

ctor

sth

atwer

eas

signed

toit

inre

assignm

ent

K-m

eans

algo

rith

m

K-means(

x1,.

.., x

N,K

)

1( s

1, s

2,.

.., s

K)←

SelectR

andomSeeds(

x1,.

.., x

N,K

)2

for

k←

1to

K

3do

µk←

s k4

while

stoppin

gcr

iter

ion

has

not

bee

nm

et

5do

for

k←

1to

K6

do

ωk←

7fo

rn←

1to

N8

do

j←

argm

inj′|µ

j′−

xn|

j←

ωj∪x n

(rea

ssig

nm

ent

ofve

ctor

s)

10

for

k←

1to

K11

do

µk←

1|ω

k|

∑ x∈

ωk x

(rec

om

puta

tion

ofce

ntr

oid

s)

12

retu

rnµ

1,.

..,µ

K

39

/64

K-m

eans

exam

ple

Page 11: 106) =0 652 =27, 106 106)(49+141) 652 652+774 106 652+774 ... · (We will show this on the next slide.) There is only a finite number of clusterings. Thus: We must reach a fixed

K-m

eans

isgu

aran

teed

toco

nve

rge

Pro

of:

The

sum

ofsq

uar

eddista

nce

s(R

SS)

dec

reas

esduring

reas

signm

ent.

RSS

=su

mofal

lsq

uar

eddista

nce

sbet

wee

ndocu

men

tve

ctor

and

close

stce

ntr

oid

(bec

ause

each

vect

oris

move

dto

acl

ose

rce

ntr

oid

)

RSS

dec

reas

esduring

reco

mputa

tion.

(We

will

show

this

on

the

nex

tslid

e.)

Ther

eis

only

afinite

num

ber

ofcl

ust

erin

gs.

Thus:

We

must

reac

ha

fixe

dpoin

t.

(ass

um

eth

atties

are

broke

nco

nsist

ently)

41

/64

Rec

omputa

tion

dec

reas

esav

erag

edista

nce

RSS

=∑ K k

=1RSS

k–

the

residual

sum

ofsq

uar

es(t

he

“goodnes

s”

mea

sure

)

RSS

k( v

)=

∑ x∈

ωk

‖v−

x‖2

=∑ x∈

ωk

M ∑ m=

1

(vm−

x m)2

∂RSS

k( v

)

∂v m

=∑ x∈

ωk

2(v

m−

x m)

=0

v m=

1

|ωk|

∑ x∈

ωk

x m

The

last

line

isth

eco

mponen

twise

defi

nitio

nofth

ece

ntr

oid

!W

em

inim

ize

RSS

kw

hen

the

old

centr

oid

isre

pla

ced

with

the

new

centr

oid

.RSS,th

esu

mofth

eRSS

k,m

ust

then

also

dec

reas

eduring

reco

mputa

tion.

K-m

eans

isgu

aran

teed

toco

nve

rge

But

we

don’t

know

how

long

conve

rgen

cew

illta

ke!

Ifwe

don’t

care

about

afe

wdocs

switch

ing

bac

kan

dfo

rth,

then

conve

rgen

ceis

usu

ally

fast

(<10-2

0iter

atio

ns)

.

How

ever

,co

mple

teco

nve

rgen

ceca

nta

kem

any

mor

e

iter

atio

ns.

43

/64

Optim

ality

ofK

-mea

ns

Conve

rgen

cedoes

not

mea

nth

atwe

conve

rge

toth

eoptim

alcl

ust

erin

g!

This

isth

egre

atwea

knes

sofK

-mea

ns.

Ifwe

star

tw

ith

abad

set

ofse

eds,

the

resu

ltin

gcl

ust

erin

gca

n

be

hor

rible

.

Page 12: 106) =0 652 =27, 106 106)(49+141) 652 652+774 106 652+774 ... · (We will show this on the next slide.) There is only a finite number of clusterings. Thus: We must reach a fixed

Exe

rcise:

Subop

tim

alcl

ust

erin

g

01

23

40123

××

××

××d1

d2

d3

d4

d5

d6

What

isth

eoptim

alcl

ust

erin

gfo

rK

=2?

Do

we

conve

rge

on

this

clust

erin

gfo

rar

bitra

ryse

eds

di 1,d

i 2?

45

/64

Initia

lizat

ion

ofK

-mea

ns

Ran

dom

seed

sele

ctio

nis

just

one

ofm

any

way

sK

-mea

ns

can

be

initia

lized

.

Ran

dom

seed

sele

ctio

nis

not

very

robust

:It’s

easy

toget

asu

boptim

alcl

ust

erin

g.

Bet

ter

heu

rist

ics:

Sel

ect

seed

snot

random

ly,but

using

som

eheu

rist

ic(e

.g.,

filte

out

outlie

rsor

find

ase

tofse

eds

that

has

“good

cove

rage”

of

the

docu

men

tsp

ace)

Use

hie

rarc

hic

alcl

ust

erin

gto

find

good

seed

s(n

ext

clas

s)Sel

ect

i(e

.g.,

i=

10)

diff

eren

tse

tsofse

eds,

do

aK

-mea

ns

clust

erin

gfo

rea

ch,se

lect

the

clust

erin

gw

ith

lowes

tRSS

Tim

eco

mpl

exity

ofK

-mea

ns

Com

puting

one

dista

nce

oftw

ove

ctor

sis

O(M

).

Rea

ssig

nm

ent

step

:O

(KN

M)

(we

nee

dto

com

pute

KN

docu

men

t-ce

ntr

oid

dista

nce

s)

Rec

om

puta

tion

step

:O

(NM

)(w

enee

dto

add

each

ofth

e

docu

men

t’s

<M

valu

esto

one

ofth

ece

ntr

oid

s)

Ass

um

enum

ber

ofiter

atio

ns

bounded

byI

Ove

rall

com

ple

xity

:O

(IK

NM

)–

linea

rin

allim

por

tant

dim

ensions

How

ever

:T

his

isnot

are

alwor

st-c

ase

anal

ysis.

Inpat

holo

gic

alca

ses,

the

num

ber

ofiter

atio

ns

can

be

much

hig

her

than

linea

rin

the

num

ber

ofdocu

men

ts.

47

/64

Outlin

e

1Rec

ap

2Clu

ster

ing:

Intr

oduct

ion

3Clu

ster

ing

inIR

4K

-mea

ns

5Eva

luat

ion

6H

owm

any

clust

ers?

Page 13: 106) =0 652 =27, 106 106)(49+141) 652 652+774 106 652+774 ... · (We will show this on the next slide.) There is only a finite number of clusterings. Thus: We must reach a fixed

What

isa

good

clust

erin

g?

Inte

rnal

criter

ia

Exa

mple

ofan

inte

rnal

criter

ion:

RSS

inK

-mea

ns

But

anin

tern

alcr

iter

ion

often

does

not

eval

uat

eth

eac

tual

utilit

yofa

clust

erin

gin

the

applic

atio

n.

Alter

nat

ive:

Ext

ernal

criter

ia

Eva

luat

ew

ith

resp

ect

toa

hum

an-d

efined

clas

sifica

tion

49

/64

Ext

ernal

criter

iafo

rcl

ust

erin

gqual

ity

Bas

edon

agold

stan

dar

ddat

ase

t,e.

g.,

the

Reu

ters

colle

ctio

nwe

also

use

dfo

rth

eev

aluat

ion

ofcl

assifica

tion

Goal

:Clu

ster

ing

should

repr

oduce

the

clas

ses

inth

egold

stan

dar

d

(But

we

only

wan

tto

repr

oduce

how

docu

men

tsar

ediv

ided

into

gro

ups,

not

the

clas

sla

bel

s.)

First

mea

sure

forhow

wel

lwe

wer

eab

leto

repr

oduce

the

clas

ses:

purity

Ext

ernal

criter

ion:

Purity

purity

(Ω,C

)=

1 N

∑ k

max j|ω

k∩

c j|

Ω=ω

1,ω

2,.

..,ω

K

isth

ese

tofcl

ust

ers

and

C=c 1

,c2,.

..,c

J

isth

ese

tofcl

asse

s.

For

each

clust

erω

k:

find

clas

sc j

with

most

mem

ber

snkjin

ωk

Sum

allnkjan

ddiv

ide

byto

talnum

ber

ofpoin

ts

51

/64

Exa

mple

for

com

puting

purity

x

o xxxx

ox

ooo

x

x

clust

er1

clust

er2

clust

er3

To

com

pute

purity

:5

=m

axj|ω

1∩

c j|(c

lass

x,cl

ust

er1);

4=

max

j|ω

2∩

c j|

(cla

sso,cl

ust

er2);

and

3=

max

j|ω

3∩

c j|(c

lass,cl

ust

er3).

Purity

is(1

/17)×

(5+

4+

3)≈

0.7

1.

Page 14: 106) =0 652 =27, 106 106)(49+141) 652 652+774 106 652+774 ... · (We will show this on the next slide.) There is only a finite number of clusterings. Thus: We must reach a fixed

Ran

din

dex

Defi

nitio

n:

RI=

TP

+T

NT

P+FP

+FN

+T

NBas

edon

2x2

contingen

cyta

ble

ofal

lpai

rsofdocu

men

ts:

sam

ecl

ust

erdiff

eren

tcl

ust

ers

sam

ecl

ass

true

positive

s(T

P)

false

neg

ativ

es(F

N)

diff

eren

tcl

asse

sfa

lse

positive

s(F

P)

true

neg

ativ

es(T

N)

TP+

FN

+FP+

TN

isth

eto

talnum

ber

ofpai

rs.

Ther

ear

e( N 2

) pai

rsfo

rN

docu

men

ts.

Exa

mple

:( 17 2

) =136

ino//x

exam

ple

Eac

hpai

ris

eith

erpositive

orneg

ativ

e(t

he

clust

erin

gputs

the

two

docu

men

tsin

the

sam

eor

indiff

eren

tcl

ust

ers)

...

...a

nd

eith

er“tr

ue”

(cor

rect

)or

“fa

lse”

(inco

rrec

t):

the

clust

erin

gdec

isio

nis

corr

ect

orin

corr

ect.

53

/64

As

anex

ample

,we

com

pute

RIfo

rth

eo//x

exam

ple

.W

efirs

tco

mpute

TP

+FP.

The

thre

ecl

ust

ers

conta

in6,6,an

d5

poin

ts,

resp

ective

ly,so

the

tota

lnum

ber

of“positive

s”or

pai

rsof

docu

men

tsth

atar

ein

the

sam

ecl

ust

eris:

TP

+FP

=

( 6 2

) +

( 6 2

) +

( 5 2

) =40

Ofth

ese,

the

xpai

rsin

clust

er1,th

eo

pai

rsin

clust

er2,th

e

pai

rsin

clust

er3,an

dth

ex

pai

rin

clust

er3

are

true

positive

s:

TP

=

( 5 2

) +

( 4 2

) +

( 3 2

) +

( 2 2

) =20

Thus,

FP

=40−

20

=20.

FN

and

TN

are

com

pute

dsim

ilarly.

Ran

dm

easu

refo

rth

eo//x

exam

ple

sam

ecl

ust

erdiff

eren

tcl

ust

ers

sam

ecl

ass

TP

=20

FN

=24

diff

eren

tcl

asse

sFP

=20

TN

=72

RIis

then

(20

+72)/

(20

+20

+24

+72)≈

0.6

8.

55

/64

Two

other

exte

rnal

eval

uat

ion

mea

sure

s

Two

oth

erm

easu

res

Nor

mal

ized

mutu

alin

form

atio

n(N

MI)

How

much

info

rmat

ion

does

the

clust

erin

gco

nta

inab

out

the

clas

sifica

tion?

Sin

gle

ton

clust

ers

(num

ber

ofcl

ust

ers

=num

ber

ofdocs

)hav

em

axim

um

MI

Ther

efor

e:nor

mal

ize

byen

tropy

ofcl

ust

ers

and

clas

ses

Fm

easu

re

Lik

eRan

d,but

“pr

ecisio

n”

and

“re

call”

can

be

wei

ghte

d

Page 15: 106) =0 652 =27, 106 106)(49+141) 652 652+774 106 652+774 ... · (We will show this on the next slide.) There is only a finite number of clusterings. Thus: We must reach a fixed

Eva

luat

ion

resu

lts

for

the

o//x

exam

ple

purity

NM

IRI

F5

lower

bound

0.0

0.0

0.0

0.0

max

imum

1.0

1.0

1.0

1.0

valu

efo

rex

ample

0.7

10.3

60.6

80.4

6

All

four

mea

sure

sra

nge

from

0(r

eally

bad

clust

erin

g)

to1

(per

fect

clust

erin

g).

57

/64

Outlin

e

1Rec

ap

2Clu

ster

ing:

Intr

oduct

ion

3Clu

ster

ing

inIR

4K

-mea

ns

5Eva

luat

ion

6H

owm

any

clust

ers?

How

man

ycl

ust

ers?

Either

:N

um

ber

ofcl

ust

ers

Kis

giv

en.

Then

par

tition

into

Kcl

ust

ers

Km

ight

be

giv

enbec

ause

ther

eis

som

eex

tern

alco

nst

rain

t.Exa

mple

:In

the

case

ofSca

tter

-Gat

her

,it

was

har

dto

show

mor

eth

an10–20

clust

ers

on

am

onitor

inth

e90s.

Or:

Fin

din

gth

e“right”

num

ber

ofcl

ust

ers

ispar

tofth

epr

oble

m.

Giv

endocs

,find

Kfo

rw

hic

han

optim

um

isre

ached

.H

owto

defi

ne

“optim

um

”?

We

can’t

use

RSS

orav

erag

esq

uar

eddista

nce

from

centr

oid

ascr

iter

ion:

alway

sch

oose

sK

=N

clust

ers.

59

/64

Exe

rcise Suppose

we

wan

tto

anal

yze

the

set

ofal

lar

ticl

espublis

hed

bya

maj

ornew

spap

er(e

.g.,

New

Yor

kT

imes

orSuddeu

tsch

e

Zei

tung)

in2008.

Goal

:w

rite

atw

o-p

age

repor

tab

out

what

the

maj

ornew

s

stor

ies

in2008

wer

e.

We

wan

tto

use

K-m

eans

clust

erin

gto

find

the

maj

ornew

sst

orie

s.

How

would

you

det

erm

ine

K?

Page 16: 106) =0 652 =27, 106 106)(49+141) 652 652+774 106 652+774 ... · (We will show this on the next slide.) There is only a finite number of clusterings. Thus: We must reach a fixed

Sim

ple

obje

ctiv

efu

nct

ion

for

K(1

)

Bas

icid

ea:

Sta

rtw

ith

1cl

ust

er(K

=1)

Kee

pad

din

gcl

ust

ers

(=ke

epin

crea

sing

K)

Add

apen

alty

for

each

new

clust

er

Tra

de

off

clust

erpen

alties

agai

nst

aver

age

squar

eddista

nce

from

centr

oid

Choose

Kw

ith

bes

ttr

adeo

ff

61

/64

Sim

ple

obje

ctiv

efu

nct

ion

for

K(2

)

Giv

ena

clust

erin

g,defi

ne

the

cost

fora

docu

men

tas

(squar

ed)

dista

nce

toce

ntr

oid

Defi

ne

tota

ldisto

rtio

nRSS(K

)as

sum

ofal

lin

div

idual

docu

men

tco

sts

(cor

resp

onds

toav

erag

edista

nce

)

Then

:pen

aliz

eea

chcl

ust

erw

ith

aco

stλ

Thus

for

acl

ust

erin

gw

ith

Kcl

ust

ers,

tota

lcl

ust

erpen

alty

isK

λ

Defi

ne

the

tota

lco

stofa

clust

erin

gas

disto

rtio

nplu

sto

tal

clust

erpen

alty

:RSS(K

)+

Sel

ect

Kth

atm

inim

izes

(RSS(K

)+

Kλ)

Still

nee

dto

det

erm

ine

good

valu

efo

...

Fin

din

gth

e“k

nee

”in

the

curv

e

24

68

10

17501800185019001950

num

ber

of c

lust

ers

residual sum of squares

Pic

kth

enum

ber

ofcl

ust

ers

wher

e

curv

e“flat

tens”

.H

ere:

4or

9.

63

/64

Res

ourc

es Chap

ter

16

ofIIR

Res

ourc

esat

http://ifnlp.org/ir

K-m

eans

exam

ple

Kei

thva

nRijsb

ergen

on

the

clust

erhyp

oth

esis

(he

was

one

of

the

orig

inat

ors)

Bin

g/Car

rot2

/Clu

sty:

sear

chre

sult

clust

erin

g