clustering and classification methods: a brief introduction€¦ · Ø divisive hierarchical...

53
Clustering and Classification Methods: A Brief Introduction EPSY 905: Multivariate Analysis Spring 2016 Lecture #14 (and last!) – May 4, 2016

Upload: others

Post on 14-Aug-2020

12 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

ClusteringandClassificationMethods:ABriefIntroduction

EPSY905:MultivariateAnalysisSpring2016

Lecture#14(andlast!)–May4,2016

Page 2: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

Today’sLecture

• Classificationmethods:Usefulforwhenyoualreadyknowwhichgroupsexistbutyoudon’tknowwhichgrouptoassignanewobservations

• Clusteringmethods:Usefulforwhenyoudon’tknowhowmanygroupsexist

• Asbothmethodsrelyupondistances(moreorless)wewillstartwithadefinitionofdistances

Ø We’llalsogetabonusslideortwoaboutmultidimensional scaling—anothermethod thatusesdistances (butdoesn’t clusterorclassifydirectly)

Page 3: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

Today’sData

• WewillmakeuseoftheclassicFisher’sIrisDatatodemonstrateclusteringandclassificationmethods

• 150flowers;50fromthreespecies(Setosa,Versicolor,Virginica)

• Fourmeasurementsfromeachflower:Ø SepalwidthØ SepallengthØ PetalwidthØ Petallength

• ThesedataarebuiltintoRalready!

Page 4: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

DISTANCEMETRICS

Page 5: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

MeasuresofDistance

• Caremustbetakenwithchoosingthemetricbywhichsimilarityisquantified

• Importantconsiderationsinclude:Ø Thenatureofthevariables (e.g., discrete, continuous, binary)Ø Scalesofmeasurement (nominal,ordinal, interval, orratio)Ø Thenatureofthematterunderstudy

Page 6: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

EuclideanDistance• Euclideandistanceisafrequentchoiceofadistancemetric:

𝑑 𝒙, 𝒚 = 𝑥' − 𝑦' * + 𝑥* − 𝑦* * + ⋯+ 𝑥- − 𝑦-*

= 𝒙 − 𝒚 . 𝒙− 𝒚

• Distancescanbebetweenanythingyouwishtocluster:Ø Observations (distanceacrossallvariables)Ø Variables (distance acrossallobservations)

• TheEuclideandistanceisafrequentchoicebecauseitrepresentsanunderstandablemetric

Ø Itmaynotbethebestchoicealways

Page 7: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

EuclideanDistance?

• ImagineIwantedtoknowhowmanymilesitwasfrommyoldhouseinSacramentotoLombardStreetinSanFrancisco…

• Knowinghowfaritwasonastraightlinewouldnotdometoomuchgood,particularlywiththenumberofone-waystreetsthatexistinSanFrancisco

Page 8: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

• OtherpopulardistancemetricsincludetheMinkowski metric:

• Thekeytothismetricisthechoiceofm:– Ifm =2,thisprovidestheEuclideandistance– Ifm=1,thisprovidesthe“city-block”distance

OtherDistanceMetrics

mp

i

mii yxd

/1

1),( ⎥

⎤⎢⎣

⎡−= ∑

=

yx

Page 9: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

PreferredDistanceProperties

• Itisoftendesirabletouseadistancemetricthatmeetsthefollowingproperties:

Ø d(P,Q) = d(Q,P)Ø d(P,Q) > 0 if P ≠ QØ d(P,Q) = 0 if P = QØ d(P,Q) ≤ d(P,R) + d(R,Q)

• TheEuclideanandMinkowskimetricssatisfytheseproperties

Page 10: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

TriangleInequality• Thefourthproperty iscalled thetriangleinequality, whichoftengetsviolatedbynon-routinemeasures ofdistance

• Thisinequality canbeshownbythefollowing triangle(withlinesrepresenting Euclidean distances):

QP

R

d(P,Q)

d(P,R)d(R,Q)

d(P,R) d(R,Q)

d(P,Q)

Page 11: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

BinaryVariables• Inthecaseofbinary-valuedvariables(variablesthathavea0/1coding),manyotherdistancemetricsmaybedefined

• TheEuclideandistanceprovidesacountofthenumberofmismatchedobservations:

• Here,d(i,k)=2

• ThisissometimescalledtheHammingDistance

Page 12: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

OtherBinaryDistanceMeasures

• Thereareanumberofotherwaystodefinethedistancebetweenasetofbinaryvariables

• Mostofthesemeasuresreflectthevariedimportanceplacedondifferingcellsina2x2table

Page 13: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

GeneralDistanceMeasureProperties

• Useofmeasuresofdistancethataremonotonicintheirorderingofobjectdistanceswillprovideidenticalresultsfromclusteringheuristics

• Manytimesthiswillonlybeanissueifthedistancemeasureisforbinaryvariablesor thedistancemeasuredoesnotsatisfythetriangleinequality

Page 14: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

DistancesinR

Page 15: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

CLASSICALCLUSTERINGMETHODS(USINGDISTANCES)

Page 16: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

ClusteringDefinitions• Clusteringobjects(orobservations)canprovidedetailregardingthenatureandstructureofdata

• ClusteringisdistinctfromclassificationinterminologyØ Classification pertainstoaknown numberofgroups,withtheobjectivebeingtoassignnewobservations tothesegroups

• Classicalmethodsofclusteranalysisisatechniquewherenoassumptionsaremadeconcerningthenumberofgroupsorthegroupstructure

Ø Mixturemodelsdomakeassumptions aboutthedata

Page 17: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

• Clusteringalgorithmsmakeuseofmeasuresofsimilarity(oralternatively,dissimilarity)todefineandgroupvariablesorobservations

• Clusteringpresentsahostoftechnicalproblems

• Forareasonablesizeddatasetwithn objects(eithervariablesorindividuals),thenumberofwaysofgroupingn objectsintokgroupsis:

• Forexample,thereareoverfourtrillionwaysthat25objectscanbeclusteredinto4groups– whichsolutionisbest?

nk

j

kj jjk

k ∑=−

⎟⎟⎠

⎞⎜⎜⎝

⎛−⎟

⎞⎜⎝

01

!1

Page 18: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

NumericalProblems• Intheory,onewaytofindthebestsolutionistotryeachpossiblegroupingofalloftheobjects– anoptimizationprocesscalledintegerprogramming

• Itisdifficult,ifnotimpossible,todosuchamethodgiventhestateoftoday’scomputers(althoughcomputersarecatchingupwithsuchproblems)

• Ratherthanusingsuchbrute-forcetypemethods,asetofheuristicshavebeendevelopedtoallowforfastclusteringofobjectsintogroups

• Suchmethodsarecalledheuristicsbecausetheydonotguaranteethatthesolutionwillbeoptimal(best),onlythatthesolutionwillbebetterthanmost

Page 19: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

ClusteringHeuristicInputs

• Theinputsintoclusteringheuristicsareintheformofmeasuresofsimilaritiesordissimilarities

• Theresultoftheheuristicdependsinlargepartonthemeasureofsimilarity/dissimilarityusedbytheprocedure

Page 20: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

ClusteringVariablesvs.ClusteringObservations

• Whenvariablesaretobeclustered,oftusedmeasuresofsimilarityincludecorrelationcoefficients(orsimilarmeasuresfornon-continuousvariables)

• Whenobservationsaretobeclustered,distancemetricsareoftenused

Page 21: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

HierarchicalClusteringMethods• Becauseofthelargesizeofpossibleclusteringsolutions,searchingthroughallcombinationsisnotfeasible

• Hierarchicalclusteringtechniquesproceedbytakingasetofobjectsandgroupingasetatatime

• Twotypesofhierarchicalclusteringmethodsexist:Ø Agglomerative hierarchicalmethodsØ Divisive hierarchicalmethods

Page 22: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

AgglomerativeClusteringMethods• Agglomerativeclusteringmethodsstartfirstwiththeindividualobjects

• Initially,eachobjectisit’sowncluster

• Themostsimilar objectsarethengroupedtogetherintoasinglecluster(withtwoobjects)

We will find that what we mean by similar will change depending on the method

Page 23: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

AgglomerativeClusteringMethods• Theremainingstepsinvolvemergingtheclustersaccordingthesimilarityordissimilarityoftheobjectswithintheclustertothoseoutsideofthecluster

• Themethodconcludeswhenallobjectsarepartofasinglecluster

Page 24: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

DivisiveClusteringMethods• Divisivehierarchicalmethodsworksintheoppositedirection–beginningwithasingle,n-objectsizedcluster

• Thelargeclusteristhendividedintotwosubgroupswheretheobjectsinopposinggroupsarerelativelydistantfromeachother

• Theprocesscontinuessimilarlyuntilthereareasmanyclustersasthereareobjects

Page 25: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

ToSummarize• Soyoucanseethatwehavethisideaofsteps

Ø Ateachsteptwoclusters combine toformone(Agglomerative)

OR…

Ø Ateachstepaclusterisdivided intotwonewclusters(Divisive)

Page 26: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

MethodsforViewingClusters• Asyoucouldimagine,whenweconsiderthemethodsforhierarchicalclustering,therearealargenumberofclustersthatareformedsequentially

• Oneofthemostfrequentlyusedtoolstoviewtheclusters(andlevelatwhichtheywereformed)isthedendrogram

• Adendrogram isagraphthatdescribesthedifferinghierarchicalclusters,andthedistanceatwhicheachisformed

Page 27: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

ExampleDataSet#1• Todemonstrateseveralofthehierarchicalclusteringmethods,anexampledatasetisused

• Datacomefroma1991studybytheeconomicresearchdepartmentoftheunionbankofSwitzerlandrepresentingeconomicconditionsof48citiesaroundtheworld

• Threevariableswerecollected:– Average workinghoursfor12occupations– Priceof112goodsandservicesexcluding rent– Indexofnethourly earningsin12occupations

Page 28: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

This is an example of an agglomerate cluster where cities start off in there own group and then are combined

For example, Here Amsterdam and Brussels were combined to form a group

Notice that we do not specify groups, but if we know how many we want…we simply go to the step where there are that many groups

The Cities

Where the lines connect is when those two previous groups were joined

Page 29: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

Similarity?• So,wementionedthat:

Ø Themostsimilar objectsarethengrouped togetherintoasinglecluster(withtwoobjects)

• SothenextquestionishowdowemeasuresimilaritybetweenclustersØ Morespecifically,howdoweredefine itwhenaclustercontainsacombinationofoldclusters

• Wefindthatthereareseveralwaystodefinesimilarandeachwaydefinesanewmethodofclustering

Page 30: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

AgglomerativeMethods• NextwediscussseveraldifferentwaytocompleteAgglomerativehierarchicalclustering:

Ø SingleLinkageØ Complete LinkageØ Average LinkageØ CentroidØ MedianØ WardMethod

Page 31: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

ExampleDistanceMatrix

• Theexamplewillbebasedonthedistancematrixbelow

1 2 3 4 5

1 0 9 3 6 11

2 9 0 7 5 10

3 3 7 0 9 2

4 6 5 9 0 8

5 11 10 2 8 0

Page 32: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

SingleLinkage

• Thesinglelinkagemethodofclusteringinvolvescombiningclustersbyfindingthe“nearestneighbor”– theclusterclosesttoanygivenobservationwithinthecurrentcluster

Cluster 1

1

2

Cluster 2

34

5

Notice that this is equal to the minimum distance of any observation in cluster one with any observation in cluster 3

Page 33: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

• Sothedistancebetweenanytwoclustersis:

{ })(min),( jidBAD y,y= For all yi in A and yj in B

1

2

34

5

Notice any other distance is longer

So how would we do this using our distance matrix?

Page 34: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

SingleLinkageExample

• Thefirststepintheprocessistodeterminethetwoelementswiththesmallestdistance,andcombinethemintoasinglecluster.

• Here,thetwoobjectsthataremostsimilarareobjects3and5…wewillnowcombinetheseintoanewcluster,andcomputethedistancefromthatclustertotheremainingclusters(objects)viathesinglelinkagerule.

1 2 3 4 5

1 0 9 3 6 11

2 9 0 7 5 10

3 3 7 0 9 2

4 6 5 9 0 8

5 11 10 2 8 0

Page 35: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

SingleLinkageExample

• Theshadedrows/columns aretheportionsofthetablewiththedistances fromanobjecttoobject3orobject5.

• Thedistancefromthe(35)clustertotheremainingobjectsisgivenbelow:

d(35),1 =min{d31,d51}=min{3,11}=3d(35),2 =min{d32,d52}=min{7,10}=7d(35),4 =min{d34,d54}=min{9,8}=8

1 2 3 4 5

1 0 9 3 6 11

2 9 0 7 5 10

3 3 7 0 9 2

4 6 5 9 0 8

5 11 10 2 8 0

These are the distances of 3 and 5 with 1. Our rule says that the distance of our new cluster with 1 is equal to the minimum of theses two values…3

These are the distances of 3 and 5 with 2. Our rule says that the distance of our new cluster with 2 is equal to the minimum of theses two values…7

In equation form our new distances are

Page 36: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

SingleLinkageExample

• Usingthedistancevalues,wenowconsolidateourtablesothat(35)isnowasinglerow/column

• Thedistancefromthe(35)clustertotheremainingobjectsisgivenbelow:

d(35)1 =min{d31,d51}=min{3,11}=3d(35)2 =min{d32,d52}=min{7,10}=7d(35)4 =min{d34,d54}=min{9,8}=8

(35) 1 2 4

(35) 0

1 3 0

2 7 9 0

4 8 6 5 0

Page 37: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

SingleLinkageExample

• Wenowrepeattheprocess,byfinding thesmallestdistancebetween within thesetofremaining clusters

• Thesmallestdistanceisbetween object1andcluster(35)

• Therefore, object1joinscluster(35),creatingcluster(135)

(35) 1 2 4

(35) 0 3 7 8

1 3 0 9 6

2 7 9 0 5

4 8 6 5 0

The distance from cluster (135) to the other clusters is then computed:

d(135)2 = min{d(35)2,d12} = min{7,9} = 7d(135)4 = min{d(35)4,d14} = min{8,6} = 6

Page 38: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

SingleLinkageExample

• Usingthedistancevalues,wenowconsolidateourtablesothat(135)isnowasinglerow/column

• Thedistancefromthe(135)clustertotheremainingobjectsisgivenbelow:

(135) 2 4

(135) 0

2 7 0

4 6 5 0

d(135)2 = min{d(35)2,d12} = min{7,9} = 7d(135)4 = min{d(35)4,d14} = min{8,6} = 6

Page 39: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

SingleLinkageExample

• Wenowrepeattheprocess,byfindingthesmallestdistancebetweenwithinthesetofremainingclusters

• Thesmallestdistanceisbetweenobject2andobject4

• Thesetwoobjectswillbejoinedtofromcluster(24)

• Thedistancefrom(24)to(135)isthencomputed

d(135)(24) =min{d(135)2,d(135)4}=min{7,6}=6

• Thefinalclusterisformed(12345)withadistanceof6

(135) 2 4

(135) 0

2 7 0

4 6 5 0

Page 40: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

TheDendogram

For example, here is where 3 and 5 joined

Page 41: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

CompleteLinkage

• Thecompletelinkagemethodofclusteringinvolvescombiningclustersbyfindingthe“farthestneighbor”–theclusterfarthesttoanygivenobservationwithinthecurrentcluster

• Thisensuresthatallobjectsinaclusterarewithinsomemaximumdistanceofeachother

Cluster 1

1

2

Cluster 2

3 4

5

Page 42: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

ClusteringinR

Page 43: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

Dendrogram ofRExamplewithFlowers

Page 44: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

OtherClassicalClusteringMethods

• K-meansclusteringwillpartitiondataintomorethanonegroup…butyouhavetospecifyhowmanygroups

• ThemethodusestheMahalanobisdistanceofeachobservationtothegroupmeananditerativelyswitchesgroupmembershipuntilnooneswitches

• Rhasabuilt-inK-meansmethod

Page 45: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

FINITEMIXTUREMODELS

Page 46: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

FiniteMixtureModels:ClusteringwithLikelihoods• Finitemixturemodelsaremodelsthatattempttodeterminethenumberofclusters(calledclasses)inasetofdata

Ø Finite:countablymanyclassesØ Mixture:distributionofthedatacomesfromamixtureofobservationsfromdifferentclasses

• FMMsuselikelihoodstodetermineclassmembershipØ Alikelihood(pdf)isasimilarity(inversedistance)Ø Higherlikelihoodà observationismorelikeclassaverageàmorelikelyobservationcomesfromclass

Ø Lowerlikelihoodà observationislesslikeclassaverageà lesslikelyobservationcomesfromclass

• FMMshaveveryspecificnames:Ø Latentclassanalysis:FMMwheredataarebinaryandindependentwithinclassØ Factormixturemodel:FMMwheredataarecontinuousandhaveafactorstructurethatyieldsawithin-classcovariancematrix

Page 47: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

FMMMarginalLikelihood Function

• ThegeneralFMMmarginallikelihoodfunctionis:

𝑓 𝒙0 =1𝜂3𝑓(𝒙0|𝑐)8

39'• 𝑓(𝒙0|𝑐) iswithinclasslikelihoodfunction(pdf)

Ø Eachclasshasitsowndistribution

• 𝜂3 istheprobabilityanyobservationcomesfromclasscØ The“mixing”proportion

• Thesumiswherethemixturegetsitsname:themarginaldatalikelihoodisablend(mixture)ofeachclass’likelihood

Page 48: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

FMMsinR:Gaussian(MVN)Mixtureswithmclust

• Rhasseveralpackagesformixtures:Iwillonlyshowyouone– Mclust

• Wewillattempttodeterminethenumberofclassesinourflowerdatausingonlythefourmeasurements

• First,weletmclust runanumberofmodelstodeterminethe“bestfitting”(lowestBICvalue)

Page 49: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

PlotofBICsfrommclust

Page 50: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

ExampleOutputfromOneSolution

Page 51: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

ProblemswithFMMs

• Exploratory(findingnumberofclasses)usesofmixturesareoftensuspectasnearlyanythingcanbringaboutadditional(spurrious/false)classes

• Forinstance:ifourdatawerenotMVNwithinclass,wewouldlikelyseemoreclassesthanweneed

• Findingtherightmodelisoftendifficult

Page 52: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

WRAPPINGUP

Page 53: Clustering and Classification Methods: A Brief Introduction€¦ · Ø Divisive hierarchical methods. Agglomerative Clustering Methods ... 3 3 7 0 9 2 4 6 5 9 0 8 5 11 10 2 8 0. Single

WrappingUp

• Thisclassonlyscratchedthesurfaceofwhatcanbedonetoclusterorclassifydata

• Clustering:DeterminingthenumberofclustersindataØ Agglomerative/Hierarchical clusteringØ K-MeansØ Finitemixturemodels

• Classification:PuttingobservationsintoknownclassesØ Classicalapproach:Discriminantanalysis(notshowntoday;lda()Rfunction)Ø Modernapproach:FMMs

• Myclassnextsemester(DiagnosticTesting)describesconfirmatoryusesofFMMs