document clustering - ir.cis.udel.eduir.cis.udel.edu/~carteret/cisc689/slides/lecture16.pdf4/21/09 1...
TRANSCRIPT
4/21/09
1
DocumentClustering
CISC489/689‐010,Lecture#17Monday,April20th
BenCartereGe
ClassificaIonReview
• Items(documents,webpages,emails)arerepresentedwithfeatures
• Someitemsareassignedaclassfromafixedset
• ClassificaIongoal:useknownclassassignmentsto“learn”ageneralfuncIonf(x)forclassifyingnewinstances
• NaïveBayesclassifier:
4/21/09
2
Clustering
• AsetofalgorithmsthataGempttofindlatent(hidden)structureinasetofitems
• GoalistoidenIfygroups(clusters)ofsimilaritems– Twoitemsinthesamegroupshouldbesimilartooneanother
– Aniteminonegroupshouldbedissimilartoaniteminanothergroup
ClusteringExample
• SupposeIgaveyoutheshape,color,vitaminCcontent,andpriceofvariousfruitsandaskedyoutoclusterthem– Whatcriteriawouldyouuse?
– Howwouldyoudefinesimilarity?
• ClusteringisverysensiIvetohowitemsarerepresentedandhowsimilarityisdefined!
4/21/09
3
ClusteringinTwoDimensions
Howwouldyouclusterthesepoints?
ClassificaIonvsClustering
• ClassificaIonissupervised– Youaregivenafixedsetofclasses– Youaregivenclasslabelsforcertaininstances– ThisisdatayoucanusetolearntheclassificaIonfuncIon
• Clusteringisunsupervised– YouarenotgivenanyinformaIonabouthowdocumentsshouldbegrouped
– Youdon’tevenknowhowmanygroupsthereshouldbe– Thereisnotrainingdatatolearnfrom
• Onewaytothinkofit:learningvsdiscovery
4/21/09
4
ClusteringinIR
• Clusterhypothesis:– “Closelyassociateddocumentstendtoberelevanttothesamerequests”–vanRijsbergen‘79
• DocumentclustersmaycapturerelevancebeGerthanindividualdocuments
• Clustersmaycapture“subtopics”
Cluster‐BasedSearch
4/21/09
5
Yahoo!Hierarchy
dairy crops
agronomy forestry
AI
HCI craft
missions
botany
evolution
cell magnetism
relativity
courses
agriculture biology physics CS space
... ... ...
… (30)
www.yahoo.com/Science
... ...
Notbasedonclusteringapproaches,butonepossibleuseofclustering.
Examplefrom“IntroducIontoIR”slidesbyHinrichSchutze
ClusteringAlgorithms
• Generaloutlineofclusteringalgorithms1. Decidehowitemswillberepresented(e.g.,feature
vectors)2. Definesimilaritymeasurebetweenpairsorgroups
ofitems(e.g.,cosinesimilarity)3. Determinewhatmakesa“good”clustering4. IteraIvelyconstructclustersthatareincreasingly
“good”5. Stopajeralocal/globalopImumclusteringisfound
• Steps3and4differthemostacrossalgorithms
4/21/09
6
ItemRepresentaIon
• TypicalrepresentaIonfordocumentsinIR:– “Bagofwords”–avectoroftermsappearinginthedocumentwithassociatedweights
– N‐grams– etc.
• AnyrepresentaIonusedinretrievalcan(theoreIcally)beusedinclusteringorclassificaIon– ThoughspecializedrepresentaIonsmaybebeGerforparIculartasks
ItemSimilarity
• ClusterhypothesissuggeststhatdocumentsimilarityshouldbebasedoninformaIoncontent– IdeallysemanIccontent,butwehavealreadyseenhowhardthatis
• Instead,usethesameideaasinquery‐basedretrieval– Thescoreofadocumenttoaqueryisbasedonhowsimilartheyareinthewordstheycontain
• Cosineanglebetweenvectors;P(R|Q,D);P(Q|D)– Thesimilarityoftwodocumentswillbebasedonhowsimilartheyareinthewordstheycontain
4/21/09
7
DocumentSimilarity
Cosinesimilarity
D1
D2
D1
D2
Euclideandistance
D1
D2
ManhaGandistance
“similarity”vs“distance”:inpracIce,youcanuseeither
WhatMakesaGoodCluster?
• Largevssmall?– IsitOKtohaveaclusterwithoneitem?– IsitOKtohaveaclusterwith10,000items?
• Similaritybetweenitems?– IsitOKforthingsinaclustertobeveryfarapart,aslongastheyareclosertoeachotherthantothingsinothercluster?
– IsitOKforthingstobesoclosetogetherthatothersimilarthingsareexcludedfromthecluster?
• Overlappingvsnon‐overlapping?– IsitOKfortwoclusterstocontainsomeitemsincommon?– Shouldclusters“nest”withinoneanother?
4/21/09
8
ExampleApproaches
• “Hard”clustering– Everyitemisinonlyonecluster
• “Soj”clustering– Itemscanbelongtomorethanonecluster– Nestedhierarchy:itembelongstoacluster,aswellasthecluster’sparentcluster,andsoon
– Non‐nested:itembelongstotwoseparateclusters• E.g.adocumentaboutjaguarcatsridinginJaguarcarsmightbelongtothe“animal”clusterandthe“car”cluster
ExampleApproaches
• Flatclustering:– Nooverlap:everyiteminexactlyonecluster– Kclusterstotal– Startwithrandomgroups,thenrefinethemunIltheyare
“good”• Hierarchicalclustering:
– Clustersarenested:aclustercanbemadeupoftwoormoresmallerclusters
– Nofixednumber– StartwithonegroupandsplititunIltherearegoodclusters– OrstartwithNgroupsandagglomeratethemunIlthereare
goodclusters
4/21/09
9
FlatClustering
• Goal:parIIonNdocumentsintoKclusters• Given:Ndocumentfeaturevectors,anumberK• OpImalalgorithm:
– Tryeverypossibleclusteringandtakewhicheveroneisthe“best”
– ComputaIonIme:O(KN)
• HeurisIcapproach:– SplitdocumentsintoKclustersrandomly– MovedocumentsfromoneclustertoanotherunIltheclustersseem“good”
K‐MeansClustering
• K‐meansisaparIIoningheurisIc• Documentsarerepresentedasvectors
• Clustersarerepresentedasacentroid vector
• Basicalgorithm:– Step0:ChooseKdocstobeiniIalclustercentroids– Step1:Assignpointstoclosetcentroid– Step2:Recomputeclustercentroids– Step3:Goto1
4/21/09
10
K‐MeansClusteringAlgorithm
Input:Ndocuments,anumberK• A[1],A[2],…,A[N]:=0• C1,C2,…,CK:=iniIalclusterassignment(pickKdocs)• do
– changed=false– foreachdocumentDi,i=1toN
• k=argminkdist(Di,Ck)(equivalently,k=argmaxksim(Di,Ck))• ifA[i]!=kthen
– A[i]=k– changed=true
– ifchangedthenC1,C2,…,CK:=clustercentroids• unIlchangedisfalse• returnA[1..N]
K‐MeansDecisions
• K–numberofclusters– K=2?K=10?K=500?
• ClusteriniIalizaIon– RandominiIalizaIonojenused– AbadiniIalassignmentcanresultinbadclusters
• Distancemeasure– Cosinesimilaritymostcommon– Euclideandistance,ManhaGandistance,manifolddistances
• StoppingcondiIon– UnIlnodocumentshavechangedclusters– UnIlcentroidsdonotchange– FixednumberofiteraIons
4/21/09
11
K‐MeansAdvantages
• ComputaIonallyefficient – Distancebetweentwodocuments=O(V)
– Distanceofeachdoctoeachcentroid=O(KNV)– CalculaIngcentroids=O(NV)– FormiteraIons,O(m(KNV+NV))=O(mKNV)
• Tendstoconvergequickly(misrelaIvelysmall)
• Easytoimplement
K‐MeansDisadvantages
• WhatshouldKbe?• Clustershavefixedgeometricshape
– Spherical– VerysensiIvetodimensionsandweights
• NonoIonofoutliers– Adocumentthat’sfarawayfromeverythingwilleitherbeinaclusteronitsownorinsomeverywide(geometricallyspeaking)cluster
4/21/09
12
HierarchicalClustering
• Goal:constructahierarchyofclusters– Thetoplevelofthehierarchyconsistsofasingleclusterwithallitemsinit
– TheboGomlevelofthehierarchyconsistsofN(#items)singletonclusters
• Twotypesofhierarchicalclustering– Divisive(“topdown”)– AgglomeraIve(“boGomup”)
• Hierarchycanbevisualizedasadendrogram
A D EB CF
G
HI
J
KL
M
ExampleDendrogramObtainclustersbycu{ngdendrogramatsomethreshold
Theclustersaretheconnectedcomponents
4/21/09
13
DivisiveandAgglomeraIveHierarchicalClustering
• Divisive– StartwithasingleclusterconsisIngofalloftheitems– UnIlonlysingletonclustersexist…
• DivideanexisIngclusterintotwonewclusters• AgglomeraIve
– StartwithN(#items)singletonclusters– UnIlasingleclusterexists…
• CombinetwoexisIngclusterintoanewcluster
• Howdoweknowhowtodivideorcombineclusters?– DefineadivisionorcombinaIoncost– PerformthedivisionorcombinaIonwiththelowestcost
AgglomeraIveHierarchicalClustering
4/21/09
14
DivisiveHierarchicalClustering
ClusteringCosts• Similaritymeasuredbetweentwodifferentclusters• Singlelinkage
• Completelinkage
• Averagelinkage
• Averagegrouplinkage
4/21/09
15
SingleLinkage CompleteLinkage
AverageLinkage AverageGroupLinkage
ClusteringStrategies
SingleLinkage
• Similaritybetweentwoclusters=minimumdistancebetweenallpairsofdocuments– (Ormaximumsimilarity)
• Ajermergingtwoclusters,
• Tendstoproduce“stringier”hierarchies• Example:•
4/21/09
16
CompleteLinkage
• Similaritybetweentwoclusters=maximumdistancebetweenallpairsofdocuments– (Orminimumsimilarity)
• Ajermergingtwoclusters,
• Tendstoproducemore“spherical”clusters• Example:•
HierarchicalClusteringAdvantages
• Flexibility– Nofixednumberofclusters– Canchangethresholdtogetdifferentclusters
• Lowerthreshold:morespecificclusters• Higherthreshold:broaderclusters
– CanchangecostfuncIontogetdifferentclusters• Hierarchicalstructuremaybemeaningful
– E.g.arIclesaboutjaguarcatsagglomeratetogether,arIclesaboutIgersagglomerate,thenbothagglomeratetoarIclesaboutbigcats
4/21/09
17
HierarchicalClusteringDisadvantages
• ComputaIonallyinefficient– Similaritybetweentwodocuments=O(V)
– Requiressimilaritybetweenallpairsofdocuments=O(VN2)
– ThenrequiressimilaritybetweenmostrecentclusterandallexisIngclusters,naïvelyO(N3)• O(N2logN)withaliGlecleverness
K‐NearestNeighborClustering
• K‐meansclusteringparIIonitemsintoclusters
• Hierarchicalclusteringcreatesnestedclusters• K‐nearestneighborclusteringformsoneclusterperitem– Theclusterforitemjconsistsofjandj’sKnearestneighbors
– Clustersnowoverlap– Somethingsdon’tgetclustered
4/21/09
18
B
D
D
B
B
A DA
C
A
B DB
D
CC
CC
AAA
D
B C
5‐NearestNeighborClustering
EvaluaIngClustering
• Clusteringwillneverbe100%accurate– Documentswillbeplacedinclusterstheydon’tbelongin
– Documentswillbeexcludedfromclusterstheyshouldbepartof
– AnaturalconsequenceofusingtermstaIsIcstorepresenttheinformaIoncontainedindocuments
• LikeretrievalandclassificaIon,clusteringeffecIvenessmustbeevaluated
• EvaluaIngclusteringischallenging,sinceitisanunsupervisedlearningtask
4/21/09
19
EvaluaIngClustering
• Iflabelsexist,canusestandardIRmetrics,suchasprecisionandrecall– InthiscaseweareevaluaIngtheabilityofouralgorithmtodiscoverthe“true”latentinformaIon
• Thisonlyworksifyouhavesomewayto“match”clusterstoclasses
• Whatiftherearefewerormoreclustersthanclasses?
ClassA ClassB ClassC ClassD
Cluster1 A1 B1 C1 D1
Cluster2 A2 B2 C2 D2
Cluster3 A3 B3 C3 D3
Cluster4 A4 B4 C4 D4
EvaluaIngClusters
• “Purity”:theraIobetweenthenumberofdocumentsfromthedominantclassinCtothesizeofC
– Ciisacluster;Kjisaclass• Notsuchagreatmeasure
– Doesnottakeintoaccountcoherenceoftheclass– OpImizedbymakingNclusters,oneforeachdocument