stats 170a: project in data science exploratory …...stats 170a: project in data science...

Stats170A:ProjectinDataScience

ExploratoryDataAnalysis:ClusteringAlgorithms

Padhraic SmythDepartmentofComputerScienceBrenSchoolofInformationandComputerSciencesUniversityofCalifornia,Irvine

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 2

Assignment5

RefertotheWikipage

DuenoononMondayFebruary12th toEEEdropbox

Note:duebeforeclass(by2pm)

Questions?


WhatisExploratoryDataAnalysis?

• EDA={visualization,clustering,dimensionreduction,….}

• Forsmallnumbersofvariables,EDA=visualization

• Forlargenumbersofvariables,weneedtobecleverer– Clustering,dimension reduction, embedding algorithms– Thesearetechniques thatessentiallyreducehigh-dimensional datato

something wecanlookat

• Today’slecture:– Finishupvisualization– Overviewofclusteringalgorithms


Tufte’s PrinciplesofVisualization

Graphicalexcellence…

– isthewell-designedpresentationofinterestingdata– amatterofsubstance,ofstatistics,andofdesign

– consistsofcomplexideascommunicated withclarity,precisionandefficiency

– isthatwhichgivestotheviewerthegreatestnumberofideasintheshortesttimewiththeleastinkinthesmallestspace

– requirestellingthetruthaboutthedata


DifferentWaysofPresentingtheSameData

FromKarlBroman,viawww.cs.princeton.edu/


PrincipleofProportionalInk(orHowtoLiewithVisualization)


PotentiallyMisleadingScalesontheX-axis


Example:VisualizationofNapoleon’s1812March

Illustratessizeofarmy,direction, location,temperature,date…allononechart


FromNewYorkTimes,Feb22018

DataJournalism


ExploratoryDataAnalysis:Clustering


x1

x2

Example:ClusteringVectorsina2-DimensionalSpace

Eachpoint(or2dvector)representsadocument


x1

x2

Cluster1

Cluster2

Example:PossibleClusters


x1

x2

Cluster1

Cluster2

Example:HowmanyClusters?

Cluster3


ClusterStructureinReal-WorldData

0.0 0.5 1.0 1.5

0.0

0.5

1.0

signal T

sign

al C

≈1500subjects

Twomeasurementspersubject

FigurefromProfZhaoxia Yu,StatisticsDepartment,UCIrvine


ClusterStructureinReal-WorldData

0.0 0.5 1.0 1.5

0.0

0.5

1.0

signal T

sign

al C

≈1500subjects

Twomeasurementspersubject



0.0 0.5 1.0 1.5

0.0

0.5

1.0

signal T

sign

al C

CC

TT

CT



IssuesinClustering

• Representation– Howdowerepresentourexamplesasdatavectors?

• Distance– Howdowewanttodefinedistancebetweenvectors?

• Algorithm– Whattypeofalgorithmdowewanttousetosearchforclusters?– Whatisthetimeandspacecomplexityofthealgorithm?

• NumberofClusters– Howmanyclustersdowewant?

No“right”answertothesequestionsingeneral…itdependsontheapplication


ClusterAnalysisvsClassification

• Dataareunlabeled

• Thenumberofclustersareunknown

• “Unsupervised”learning• Goal:findunknown

structures

19

• Thelabelsfortrainingdataareknown

• Thenumberofclassesareknown

• “Supervised”learning• Goal:allocatenew

observations,whoselabelsareunknown,tooneoftheknownclasses


Clustering:TheK-MeansAlgorithm


Notation

NdocumentsRepresenteachdocumentasavectorofTterms(e.g.,countsortf-idf)

Thevectorfortheith documentis:xi =(xi1,xi2,…,xij ,....,xiT ),i =1,…..N

Document-Termmatrix• xij istheith row,jth column• columnscorrespondtoterms• rowscorrespondtodocuments

WecanthinkofourdocumentsasbeinginaT-dimensionalspace,withclustersas“cloudsofpoints”


TheK-MeansClusteringAlgorithm

Input:Nvectorsx1,….xN ofdimensionDK=numberofclusters(K>1)

Output:– Kclustercenters,c1,….cK, eachcenterisavectorofdimensionD– (Equivalently) Alistofclusterassignments (values1toK)foreachoftheN

inputvectors

Note:InK-meanseachinputvectorx isassignedtooneandonlyoneclusterk,orclustercenterck

TheK-meansalgorithmpartitions theNdatavectorsintoKdisjointgroups


x1

x2

Cluster1

Cluster2

ExampleofK-MeansOutputwith2Clusters

c1

c2

BluecirclesareexamplesofdocumentsRedcirclesareexamplesofclustercenters


SquaredErrorDistance

),,,( 21 Txxxx !=

ConsidertwovectorseachwithTcomponents(i.e.,dimensionT)

∑=

−=T

jjjE yxyxd

1

2)(),(

Acommondistancemetricissquarederrordistance:

Intwodimensionsthesquarerootofthisistheusualnotionofspatialdistance,i.e.,Euclideandistance

),,,( 21 Tyyyy !=


SquaredErrorsandClusterCenters

• Squarederror(distance)betweenadatapointx andaclustercenterc:

dist [x ,c ]=Σj (xj - cj )2

IndexjisovertheDcomponents/dimensions ofthevectors

Cluster1

c1





• Totalsquarederrorbetweenaclustercenterck andallNk pointsassignedtothatcluster:

Sk =Σi d[xi ,ck ]

SumisovertheDcomponents/dimensions ofthevectors

Thissumisovervectors,overtheNk pointsassigned toclusterk

DistancedefinedasEuclideandistance

Cluster1

c1





• Totalsquarederrorbetweenaclustercenterck andallNk pointsassignedtothatcluster:

Sk =Σi d[xi ,ck ]

• TotalsquarederrorsummedacrossKclusters

SSE=Σk Sk

SumisovertheDcomponents/dimensions ofthevectors

SumisovertheNk points assignedtoclusterk

SumisovertheKclusters

DistancedefinedasEuclideandistance


K-meansObjectiveFunction

• K-means:minimizethetotalsquarederror,i.e.,findtheKclustercentersck,andassignments,thatminimize

SSE = Σk Sk =Σk (Σi d[xi ,ck ])

• K-meansseekstominimizeSSE,i.e.,findtheclustercenterssuchthatthesum-squared-errorissmallest– willplaceclustercentersstrategicallyto“cover”data– similartodatacompression (infactusedindatacompressionalgorithms)


K-MeansAlgorithm

• Randominitialization– SelecttheinitialKcentersrandomly fromNinputvectorsrandomly– Or,assigneachoftheNvectorsrandomly tooneoftheKclusters

• Iterate:– Assignment Step:

• AssigneachoftheNinputvectorstotheirclosestmean

– UpdatetheMean-Vectors(Kofthem)• Computeupdatedcenters:theaveragevalueofthevectorsassignedtok

New ck =1/Nk Σi xi

• Convergence:– Didanypointsgetreassigned?

• Yes:terminate• No:returntoIteratestep

SumisovertheNk points assignedtoclusterk


Pseudocode fortheK-meansAlgorithm

FromChapter16inManning,Raghavan,andSchutze


ExampleofK-MeansClustering

-2 -1 0 1 2 3 4 5 6 7 8

DIMENSION 1

-2

-1

0

1

2

3

4

5

6

7

DIM

ENSI

ON

2

Original Data



-1 0 1 2 3 4 5 6 7 8

DIMENSION 1

-1

0

1

2

3

4

5

6

7D

IMEN

SIO

N 2

Iteration 1

MeanSquaredError=3.45



-1 0 1 2 3 4 5 6 7 8

DIMENSION 1

-1

0

1

2

3

4

5

6

7D

IMEN

SIO

N 2

Iteration 2




-1 0 1 2 3 4 5 6 7 8

DIMENSION 1

-1

0

1

2

3

4

5

6

7D

IMEN

SIO

N 2

Iteration 3




-1 0 1 2 3 4 5 6 7 8

DIMENSION 1

-1

0

1

2

3

4

5

6

7D

IMEN

SIO

N 2

Iteration 5


(converged)


K-means1. Pick number of

clusters (e.g. K=5) 2. Randomly guess K

cluster Center locations

Figure/slidefromAndrewMoore,CMU





3. Each datapoint finds out which Center it’s closest to.







4. Each Center finds the centroid of the points it owns







4. Each Center finds the centroid of the points it owns

5. New Centers => new boundaries

6. Repeat until no change



TheIrisData

• CollectedbyR.A.Fisher• Afamousearlydatasetinmultivariatedataanalysis

• Fourfeatures:– sepallength incm– sepalwidth incm– petallength incm– petalwidth incm

• Threedifferentspecies– Setosa– Versicolor– Virginica


K-MeansClusteringontheIrisData


K-MeansforImageCompression


AnExampleofDatawhereK-Meansdoesnotworkwell

IdealClusteringofDatain2Dimensions


AnExampleofDatawhereK-Meansdoesnotworkwell

K-meansClusteringResult,K=2IdealClusteringofDatain2Dimensions


From:http://scikit-learn.org/stable/modules/clustering.html#


PropertiesoftheK-MeansAlgorithm

• Timecomplexity?N=numberofdatapointsK=numberofclustersD=dimension ofdatapoints (number ofvariables)

O(NKd)intimeperiterationThisisgood: lineartimeineachinputparameter

• DoesK-meansalwaysfindaGlobalMinimum?i.e.,thesetofKcentersthatminimize theSSE?

No:alwaysconvergesto*some*localminimum, butnotnecessarilythebest• Dependsonthestartingpointchosen• CanprovethatSSEoneachiterationmusteither

– Decrease,or– Notchange(inwhichcasewehaveconverged)

[Thinkabouthowyoumightprovethis]


SummaryofKmeans

• Input:– Nvectors

• Output:Kclusters– Eachclusterrepresentedbyaclustermean(avector)– Assignseachdatapoint toitsclosestclustercenter

• Strengths– Fast:timecomplexityisO(NDK), i.e.,lineartimeinN,T,K– Simple toimplement

• Weaknesses:– Notguaranteed tofindthebestsolution (theglobalminimumofSSE)– AssumesafixedK,numberofclusters– UsesEuclideandistance– notnecessarilyideal


NumberofClusters?

• Generallyno“right”answer…itdependsontheapplication

• Wecanthinkofclusteringasatypeofdatacompressiontechnique:– AsK,thenumberofclustersgrows,wecompressthedatabetter,e.g.,lower

overallsquarederror– ButthisdoesnotmeanlargerKisalwaysbetter…..thelargerthevalueofKthe

harderitisforhumans tounderstand theclusteringresults

• Options?– PickavalueofKbasedonintuition/heuristics, e.g.,relativelysmallK(e.g.,K=5

or10)ifweareshowing theresultstoahuman– EvaluatedifferentvaluesofKifwehavesomeground truthforevaluationand

selectthebestvalueofKusing thetask-specificevaluationmeasure


HierarchicalClusteringAlgorithms


Setosa Virginica Versicolor


HierarchicalClustering

• Thenumberofclustersisnotrequired• Givesatree-basedrepresentationofobservations- dendrogram

• Eachleafrepresentsanobservation

• Leavesmostsimilar toeachotheraremerged

• Internalnodesmostsimilar toeachothermerged

• Processcontinuesrecursivelyuntilallnodesaremergedattherootnode


BasicConceptofHierarchicalClustering

Step 0 Step 1 Step 2 Step 3 Step 4

b

dc

e

a a b

d ec d e

a b c d e

Mergedatapoints,andthenclusters,inabottom-upfashion,untilalldatapointsarein1cluster.

Requiresthatwecandefinedistance/similaritybetweensetsofpoints


SimpleExampleofHierarchicalClustering

Dimension1

Dimension2


Complete-linkclusteringofReutersnewsstories

FigurefromChapter17ofManning,Raghavan,andSchutze


DistancebetweenTwoBranches/Clusters

Singlelinkage

Completelinkage

Averagelinkage

Many other options


ComplexityofHierarchicalClustering

• TimeComplexity (N=numofdocs,T=dimensionality)– Timetocomputeallpairwisedistances:O(N2 T)– Timetocreatethetree:O(N3)

->Overalltimecomplexity is O(N3 +N2 T)

• Spacecomplexity=O(N2)

• Thisisasignificantweaknessofhierarchicalclustering:scalespoorlyinN– OnepracticaloptionisfirstrunK-meanswith(e.g.,)K=20or100or500clustersand

then“clustertheclusters”fromK-means


AutomaticallyClusteringLanguagesinLinguistics


HierarchicalClusteringbasedonuservotesforfavoritebeers

Basedoncentroidmethod

Fromdata.ranker.com


“Heat-Map”Representation(humandata)


DiscoveringStructurefromaHeatMap ofBrainNetworkData

Fromhttps://seaborn.pydata.org/examples/structured_heatmap.html


SummaryofClusteringAlgorithms

• Usedforexploringdata– Cananswerquestions such“aretheresubgroups?”

• Differentclusteringalgorithms– K-means

• Simple,fast,easytointerpret• Tendstofind“circularclusters”,canfailoncomplexstructure• NumberofclustersKisfixedaheadoftime

– Hierarchicalagglomerativeclustering• Producesatreeofclusters(dendrogram)• Numberofclustersisnotfixed• Computationalcomplexityishigh,doesnotscalewelltolargeN

• Clusteringisusefulforexploration….butoneshouldbecareful– No“goldstandard”tocompareitto– Manydifferentmethods….cangivedifferent results


Assignment5

RefertotheWikipage

DuenoononMondayFebruary12th toEEEdropbox

Notechange:duebeforeclass(by2pm)

stats 170a: project in data science exploratory …...stats 170a: project in data science...

Documents