document clustering - ir.cis.udel.eduir.cis.udel.edu/~carteret/cisc689/slides/lecture16.pdf4/21/09 1...

20
4/21/09 1 Document Clustering CISC489/689‐010, Lecture #17 Monday, April 20 th Ben CartereGe ClassificaIon Review Items (documents, web pages, emails) are represented with features Some items are assigned a class from a fixed set ClassificaIon goal: use known class assignments to “learn” a general funcIon f(x) for classifying new instances Naïve Bayes classifier:

Upload: vuongtuyen

Post on 23-Apr-2019

215 views

Category:

Documents


0 download

TRANSCRIPT

4/21/09

1

DocumentClustering

CISC489/689‐010,Lecture#17Monday,April20th

BenCartereGe

ClassificaIonReview

•  Items(documents,webpages,emails)arerepresentedwithfeatures

•  Someitemsareassignedaclassfromafixedset

•  ClassificaIongoal:useknownclassassignmentsto“learn”ageneralfuncIonf(x)forclassifyingnewinstances

•  NaïveBayesclassifier:

4/21/09

2

Clustering

•  AsetofalgorithmsthataGempttofindlatent(hidden)structureinasetofitems

•  GoalistoidenIfygroups(clusters)ofsimilaritems– Twoitemsinthesamegroupshouldbesimilartooneanother

– Aniteminonegroupshouldbedissimilartoaniteminanothergroup

ClusteringExample

•  SupposeIgaveyoutheshape,color,vitaminCcontent,andpriceofvariousfruitsandaskedyoutoclusterthem– Whatcriteriawouldyouuse?

– Howwouldyoudefinesimilarity?

•  ClusteringisverysensiIvetohowitemsarerepresentedandhowsimilarityisdefined!

4/21/09

3

ClusteringinTwoDimensions

Howwouldyouclusterthesepoints?

ClassificaIonvsClustering

•  ClassificaIonissupervised–  Youaregivenafixedsetofclasses–  Youaregivenclasslabelsforcertaininstances–  ThisisdatayoucanusetolearntheclassificaIonfuncIon

•  Clusteringisunsupervised–  YouarenotgivenanyinformaIonabouthowdocumentsshouldbegrouped

–  Youdon’tevenknowhowmanygroupsthereshouldbe–  Thereisnotrainingdatatolearnfrom

•  Onewaytothinkofit:learningvsdiscovery

4/21/09

4

ClusteringinIR

•  Clusterhypothesis:– “Closelyassociateddocumentstendtoberelevanttothesamerequests”–vanRijsbergen‘79

•  DocumentclustersmaycapturerelevancebeGerthanindividualdocuments

•  Clustersmaycapture“subtopics”

Cluster‐BasedSearch

4/21/09

5

Yahoo!Hierarchy

dairy crops

agronomy forestry

AI

HCI craft

missions

botany

evolution

cell magnetism

relativity

courses

agriculture biology physics CS space

... ... ...

… (30)

www.yahoo.com/Science

... ...

Notbasedonclusteringapproaches,butonepossibleuseofclustering.

Examplefrom“IntroducIontoIR”slidesbyHinrichSchutze

ClusteringAlgorithms

•  Generaloutlineofclusteringalgorithms1.  Decidehowitemswillberepresented(e.g.,feature

vectors)2.  Definesimilaritymeasurebetweenpairsorgroups

ofitems(e.g.,cosinesimilarity)3.  Determinewhatmakesa“good”clustering4.  IteraIvelyconstructclustersthatareincreasingly

“good”5.  Stopajeralocal/globalopImumclusteringisfound

•  Steps3and4differthemostacrossalgorithms

4/21/09

6

ItemRepresentaIon

•  TypicalrepresentaIonfordocumentsinIR:–  “Bagofwords”–avectoroftermsappearinginthedocumentwithassociatedweights

– N‐grams–  etc.

•  AnyrepresentaIonusedinretrievalcan(theoreIcally)beusedinclusteringorclassificaIon–  ThoughspecializedrepresentaIonsmaybebeGerforparIculartasks

ItemSimilarity

•  ClusterhypothesissuggeststhatdocumentsimilarityshouldbebasedoninformaIoncontent–  IdeallysemanIccontent,butwehavealreadyseenhowhardthatis

•  Instead,usethesameideaasinquery‐basedretrieval–  Thescoreofadocumenttoaqueryisbasedonhowsimilartheyareinthewordstheycontain

•  Cosineanglebetweenvectors;P(R|Q,D);P(Q|D)–  Thesimilarityoftwodocumentswillbebasedonhowsimilartheyareinthewordstheycontain

4/21/09

7

DocumentSimilarity

Cosinesimilarity

D1

D2

D1

D2

Euclideandistance

D1

D2

ManhaGandistance

“similarity”vs“distance”:inpracIce,youcanuseeither

WhatMakesaGoodCluster?

•  Largevssmall?–  IsitOKtohaveaclusterwithoneitem?–  IsitOKtohaveaclusterwith10,000items?

•  Similaritybetweenitems?–  IsitOKforthingsinaclustertobeveryfarapart,aslongastheyareclosertoeachotherthantothingsinothercluster?

–  IsitOKforthingstobesoclosetogetherthatothersimilarthingsareexcludedfromthecluster?

•  Overlappingvsnon‐overlapping?–  IsitOKfortwoclusterstocontainsomeitemsincommon?–  Shouldclusters“nest”withinoneanother?

4/21/09

8

ExampleApproaches

•  “Hard”clustering–  Everyitemisinonlyonecluster

•  “Soj”clustering–  Itemscanbelongtomorethanonecluster– Nestedhierarchy:itembelongstoacluster,aswellasthecluster’sparentcluster,andsoon

– Non‐nested:itembelongstotwoseparateclusters•  E.g.adocumentaboutjaguarcatsridinginJaguarcarsmightbelongtothe“animal”clusterandthe“car”cluster

ExampleApproaches

•  Flatclustering:–  Nooverlap:everyiteminexactlyonecluster–  Kclusterstotal–  Startwithrandomgroups,thenrefinethemunIltheyare

“good”•  Hierarchicalclustering:

–  Clustersarenested:aclustercanbemadeupoftwoormoresmallerclusters

–  Nofixednumber–  StartwithonegroupandsplititunIltherearegoodclusters–  OrstartwithNgroupsandagglomeratethemunIlthereare

goodclusters

4/21/09

9

FlatClustering

•  Goal:parIIonNdocumentsintoKclusters•  Given:Ndocumentfeaturevectors,anumberK•  OpImalalgorithm:

–  Tryeverypossibleclusteringandtakewhicheveroneisthe“best”

–  ComputaIonIme:O(KN)

•  HeurisIcapproach:–  SplitdocumentsintoKclustersrandomly– MovedocumentsfromoneclustertoanotherunIltheclustersseem“good”

K‐MeansClustering

•  K‐meansisaparIIoningheurisIc•  Documentsarerepresentedasvectors

•  Clustersarerepresentedasacentroid vector

•  Basicalgorithm:–  Step0:ChooseKdocstobeiniIalclustercentroids–  Step1:Assignpointstoclosetcentroid–  Step2:Recomputeclustercentroids–  Step3:Goto1

4/21/09

10

K‐MeansClusteringAlgorithm

Input:Ndocuments,anumberK•  A[1],A[2],…,A[N]:=0•  C1,C2,…,CK:=iniIalclusterassignment(pickKdocs)•  do

–  changed=false–  foreachdocumentDi,i=1toN

•  k=argminkdist(Di,Ck)(equivalently,k=argmaxksim(Di,Ck))•  ifA[i]!=kthen

–  A[i]=k–  changed=true

–  ifchangedthenC1,C2,…,CK:=clustercentroids•  unIlchangedisfalse•  returnA[1..N]

K‐MeansDecisions

•  K–numberofclusters–  K=2?K=10?K=500?

•  ClusteriniIalizaIon–  RandominiIalizaIonojenused–  AbadiniIalassignmentcanresultinbadclusters

•  Distancemeasure–  Cosinesimilaritymostcommon–  Euclideandistance,ManhaGandistance,manifolddistances

•  StoppingcondiIon–  UnIlnodocumentshavechangedclusters–  UnIlcentroidsdonotchange–  FixednumberofiteraIons

4/21/09

11

K‐MeansAdvantages

•  ComputaIonallyefficient – Distancebetweentwodocuments=O(V)

– Distanceofeachdoctoeachcentroid=O(KNV)– CalculaIngcentroids=O(NV)– FormiteraIons,O(m(KNV+NV))=O(mKNV)

•  Tendstoconvergequickly(misrelaIvelysmall)

•  Easytoimplement

K‐MeansDisadvantages

•  WhatshouldKbe?•  Clustershavefixedgeometricshape

– Spherical– VerysensiIvetodimensionsandweights

•  NonoIonofoutliers– Adocumentthat’sfarawayfromeverythingwilleitherbeinaclusteronitsownorinsomeverywide(geometricallyspeaking)cluster

4/21/09

12

HierarchicalClustering

•  Goal:constructahierarchyofclusters– Thetoplevelofthehierarchyconsistsofasingleclusterwithallitemsinit

– TheboGomlevelofthehierarchyconsistsofN(#items)singletonclusters

•  Twotypesofhierarchicalclustering– Divisive(“topdown”)– AgglomeraIve(“boGomup”)

•  Hierarchycanbevisualizedasadendrogram

A D EB CF

G

HI

J

KL

M

ExampleDendrogramObtainclustersbycu{ngdendrogramatsomethreshold

Theclustersaretheconnectedcomponents

4/21/09

13

DivisiveandAgglomeraIveHierarchicalClustering

•  Divisive–  StartwithasingleclusterconsisIngofalloftheitems–  UnIlonlysingletonclustersexist…

•  DivideanexisIngclusterintotwonewclusters•  AgglomeraIve

–  StartwithN(#items)singletonclusters–  UnIlasingleclusterexists…

•  CombinetwoexisIngclusterintoanewcluster

•  Howdoweknowhowtodivideorcombineclusters?–  DefineadivisionorcombinaIoncost–  PerformthedivisionorcombinaIonwiththelowestcost

AgglomeraIveHierarchicalClustering

4/21/09

14

DivisiveHierarchicalClustering

ClusteringCosts•  Similaritymeasuredbetweentwodifferentclusters•  Singlelinkage

•  Completelinkage

•  Averagelinkage

•  Averagegrouplinkage

4/21/09

15

SingleLinkage CompleteLinkage

AverageLinkage AverageGroupLinkage

ClusteringStrategies

SingleLinkage

•  Similaritybetweentwoclusters=minimumdistancebetweenallpairsofdocuments–  (Ormaximumsimilarity)

•  Ajermergingtwoclusters,

•  Tendstoproduce“stringier”hierarchies•  Example:• 

4/21/09

16

CompleteLinkage

•  Similaritybetweentwoclusters=maximumdistancebetweenallpairsofdocuments–  (Orminimumsimilarity)

•  Ajermergingtwoclusters,

•  Tendstoproducemore“spherical”clusters•  Example:• 

HierarchicalClusteringAdvantages

•  Flexibility– Nofixednumberofclusters– Canchangethresholdtogetdifferentclusters

•  Lowerthreshold:morespecificclusters•  Higherthreshold:broaderclusters

– CanchangecostfuncIontogetdifferentclusters•  Hierarchicalstructuremaybemeaningful

– E.g.arIclesaboutjaguarcatsagglomeratetogether,arIclesaboutIgersagglomerate,thenbothagglomeratetoarIclesaboutbigcats

4/21/09

17

HierarchicalClusteringDisadvantages

•  ComputaIonallyinefficient– Similaritybetweentwodocuments=O(V)

– Requiressimilaritybetweenallpairsofdocuments=O(VN2)

– ThenrequiressimilaritybetweenmostrecentclusterandallexisIngclusters,naïvelyO(N3)•  O(N2logN)withaliGlecleverness

K‐NearestNeighborClustering

•  K‐meansclusteringparIIonitemsintoclusters

•  Hierarchicalclusteringcreatesnestedclusters•  K‐nearestneighborclusteringformsoneclusterperitem– Theclusterforitemjconsistsofjandj’sKnearestneighbors

– Clustersnowoverlap– Somethingsdon’tgetclustered

4/21/09

18

B

D

D

B

B

A DA

C

A

B DB

D

CC

CC

AAA

D

B C

5‐NearestNeighborClustering

EvaluaIngClustering

•  Clusteringwillneverbe100%accurate– Documentswillbeplacedinclusterstheydon’tbelongin

– Documentswillbeexcludedfromclusterstheyshouldbepartof

– AnaturalconsequenceofusingtermstaIsIcstorepresenttheinformaIoncontainedindocuments

•  LikeretrievalandclassificaIon,clusteringeffecIvenessmustbeevaluated

•  EvaluaIngclusteringischallenging,sinceitisanunsupervisedlearningtask

4/21/09

19

EvaluaIngClustering

•  Iflabelsexist,canusestandardIRmetrics,suchasprecisionandrecall–  InthiscaseweareevaluaIngtheabilityofouralgorithmtodiscoverthe“true”latentinformaIon

•  Thisonlyworksifyouhavesomewayto“match”clusterstoclasses

•  Whatiftherearefewerormoreclustersthanclasses?

ClassA ClassB ClassC ClassD

Cluster1 A1 B1 C1 D1

Cluster2 A2 B2 C2 D2

Cluster3 A3 B3 C3 D3

Cluster4 A4 B4 C4 D4

EvaluaIngClusters

•  “Purity”:theraIobetweenthenumberofdocumentsfromthedominantclassinCtothesizeofC

– Ciisacluster;Kjisaclass•  Notsuchagreatmeasure

– Doesnottakeintoaccountcoherenceoftheclass– OpImizedbymakingNclusters,oneforeachdocument

4/21/09

20

EvaluaIngClusters

•  Withnolabeleddataevenmoredifficult•  Bestapproach:

– Evaluatethesystemthattheclusteringispartof

– E.g.ifclusteringisusedtoaidretrieval,evaluatethecluster‐aidedretrieval

– MoreonWednesday