stats 170a: project in data science exploratory …...stats 170a: project in data science...
TRANSCRIPT
Stats170A:ProjectinDataScience
ExploratoryDataAnalysis:ClusteringAlgorithms
Padhraic SmythDepartmentofComputerScienceBrenSchoolofInformationandComputerSciencesUniversityofCalifornia,Irvine
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 2
Assignment5
RefertotheWikipage
DuenoononMondayFebruary12th toEEEdropbox
Note:duebeforeclass(by2pm)
Questions?
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 3
WhatisExploratoryDataAnalysis?
• EDA={visualization,clustering,dimensionreduction,….}
• Forsmallnumbersofvariables,EDA=visualization
• Forlargenumbersofvariables,weneedtobecleverer– Clustering,dimension reduction, embedding algorithms– Thesearetechniques thatessentiallyreducehigh-dimensional datato
something wecanlookat
• Today’slecture:– Finishupvisualization– Overviewofclusteringalgorithms
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 4
Tufte’s PrinciplesofVisualization
Graphicalexcellence…
– isthewell-designedpresentationofinterestingdata– amatterofsubstance,ofstatistics,andofdesign
– consistsofcomplexideascommunicated withclarity,precisionandefficiency
– isthatwhichgivestotheviewerthegreatestnumberofideasintheshortesttimewiththeleastinkinthesmallestspace
– requirestellingthetruthaboutthedata
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 5
DifferentWaysofPresentingtheSameData
FromKarlBroman,viawww.cs.princeton.edu/
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 6
PrincipleofProportionalInk(orHowtoLiewithVisualization)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 7
PrincipleofProportionalInk(orHowtoLiewithVisualization)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 8
PotentiallyMisleadingScalesontheX-axis
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 9
Example:VisualizationofNapoleon’s1812March
Illustratessizeofarmy,direction, location,temperature,date…allononechart
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 10
FromNewYorkTimes,Feb22018
DataJournalism
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 11
ExploratoryDataAnalysis:Clustering
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 12
x1
x2
Example:ClusteringVectorsina2-DimensionalSpace
Eachpoint(or2dvector)representsadocument
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 13
x1
x2
Cluster1
Cluster2
Example:PossibleClusters
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 14
x1
x2
Cluster1
Cluster2
Example:HowmanyClusters?
Cluster3
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 15
ClusterStructureinReal-WorldData
0.0 0.5 1.0 1.5
0.0
0.5
1.0
signal T
sign
al C
≈1500subjects
Twomeasurementspersubject
FigurefromProfZhaoxia Yu,StatisticsDepartment,UCIrvine
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 16
ClusterStructureinReal-WorldData
0.0 0.5 1.0 1.5
0.0
0.5
1.0
signal T
sign
al C
≈1500subjects
Twomeasurementspersubject
FigurefromProfZhaoxia Yu,StatisticsDepartment,UCIrvine
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 1717
0.0 0.5 1.0 1.5
0.0
0.5
1.0
signal T
sign
al C
CC
TT
CT
FigurefromProfZhaoxia Yu,StatisticsDepartment,UCIrvine
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 18
IssuesinClustering
• Representation– Howdowerepresentourexamplesasdatavectors?
• Distance– Howdowewanttodefinedistancebetweenvectors?
• Algorithm– Whattypeofalgorithmdowewanttousetosearchforclusters?– Whatisthetimeandspacecomplexityofthealgorithm?
• NumberofClusters– Howmanyclustersdowewant?
No“right”answertothesequestionsingeneral…itdependsontheapplication
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 19
ClusterAnalysisvsClassification
• Dataareunlabeled
• Thenumberofclustersareunknown
• “Unsupervised”learning• Goal:findunknown
structures
19
• Thelabelsfortrainingdataareknown
• Thenumberofclassesareknown
• “Supervised”learning• Goal:allocatenew
observations,whoselabelsareunknown,tooneoftheknownclasses
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 20
Clustering:TheK-MeansAlgorithm
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 21
Notation
NdocumentsRepresenteachdocumentasavectorofTterms(e.g.,countsortf-idf)
Thevectorfortheith documentis:xi =(xi1,xi2,…,xij ,....,xiT ),i =1,…..N
Document-Termmatrix• xij istheith row,jth column• columnscorrespondtoterms• rowscorrespondtodocuments
WecanthinkofourdocumentsasbeinginaT-dimensionalspace,withclustersas“cloudsofpoints”
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 22
TheK-MeansClusteringAlgorithm
Input:Nvectorsx1,….xN ofdimensionDK=numberofclusters(K>1)
Output:– Kclustercenters,c1,….cK, eachcenterisavectorofdimensionD– (Equivalently) Alistofclusterassignments (values1toK)foreachoftheN
inputvectors
Note:InK-meanseachinputvectorx isassignedtooneandonlyoneclusterk,orclustercenterck
TheK-meansalgorithmpartitions theNdatavectorsintoKdisjointgroups
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 23
x1
x2
Cluster1
Cluster2
ExampleofK-MeansOutputwith2Clusters
c1
c2
BluecirclesareexamplesofdocumentsRedcirclesareexamplesofclustercenters
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 24
SquaredErrorDistance
),,,( 21 Txxxx !=
ConsidertwovectorseachwithTcomponents(i.e.,dimensionT)
∑=
−=T
jjjE yxyxd
1
2)(),(
Acommondistancemetricissquarederrordistance:
Intwodimensionsthesquarerootofthisistheusualnotionofspatialdistance,i.e.,Euclideandistance
),,,( 21 Tyyyy !=
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 25
SquaredErrorsandClusterCenters
• Squarederror(distance)betweenadatapointx andaclustercenterc:
dist [x ,c ]=Σj (xj - cj )2
IndexjisovertheDcomponents/dimensions ofthevectors
Cluster1
c1
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 26
SquaredErrorsandClusterCenters
• Squarederror(distance)betweenadatapointx andaclustercenterc:
dist [x ,c ]=Σj (xj - cj )2
• Totalsquarederrorbetweenaclustercenterck andallNk pointsassignedtothatcluster:
Sk =Σi d[xi ,ck ]
SumisovertheDcomponents/dimensions ofthevectors
Thissumisovervectors,overtheNk pointsassigned toclusterk
DistancedefinedasEuclideandistance
Cluster1
c1
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 27
SquaredErrorsandClusterCenters
• Squarederror(distance)betweenadatapointx andaclustercenterc:
dist [x ,c ]=Σj (xj - cj )2
• Totalsquarederrorbetweenaclustercenterck andallNk pointsassignedtothatcluster:
Sk =Σi d[xi ,ck ]
• TotalsquarederrorsummedacrossKclusters
SSE=Σk Sk
SumisovertheDcomponents/dimensions ofthevectors
SumisovertheNk points assignedtoclusterk
SumisovertheKclusters
DistancedefinedasEuclideandistance
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 28
K-meansObjectiveFunction
• K-means:minimizethetotalsquarederror,i.e.,findtheKclustercentersck,andassignments,thatminimize
SSE = Σk Sk =Σk (Σi d[xi ,ck ])
• K-meansseekstominimizeSSE,i.e.,findtheclustercenterssuchthatthesum-squared-errorissmallest– willplaceclustercentersstrategicallyto“cover”data– similartodatacompression (infactusedindatacompressionalgorithms)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 29
K-MeansAlgorithm
• Randominitialization– SelecttheinitialKcentersrandomly fromNinputvectorsrandomly– Or,assigneachoftheNvectorsrandomly tooneoftheKclusters
• Iterate:– Assignment Step:
• AssigneachoftheNinputvectorstotheirclosestmean
– UpdatetheMean-Vectors(Kofthem)• Computeupdatedcenters:theaveragevalueofthevectorsassignedtok
New ck =1/Nk Σi xi
• Convergence:– Didanypointsgetreassigned?
• Yes:terminate• No:returntoIteratestep
SumisovertheNk points assignedtoclusterk
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 30
Pseudocode fortheK-meansAlgorithm
FromChapter16inManning,Raghavan,andSchutze
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 31
ExampleofK-MeansClustering
-2 -1 0 1 2 3 4 5 6 7 8
DIMENSION 1
-2
-1
0
1
2
3
4
5
6
7
DIM
ENSI
ON
2
Original Data
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 32
ExampleofK-MeansClustering
-1 0 1 2 3 4 5 6 7 8
DIMENSION 1
-1
0
1
2
3
4
5
6
7D
IMEN
SIO
N 2
Iteration 1
MeanSquaredError=3.45
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 33
ExampleofK-MeansClustering
-1 0 1 2 3 4 5 6 7 8
DIMENSION 1
-1
0
1
2
3
4
5
6
7D
IMEN
SIO
N 2
Iteration 2
MeanSquaredError=1.93
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 34
ExampleofK-MeansClustering
-1 0 1 2 3 4 5 6 7 8
DIMENSION 1
-1
0
1
2
3
4
5
6
7D
IMEN
SIO
N 2
Iteration 3
MeanSquaredError=1.25
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 35
ExampleofK-MeansClustering
-1 0 1 2 3 4 5 6 7 8
DIMENSION 1
-1
0
1
2
3
4
5
6
7D
IMEN
SIO
N 2
Iteration 5
MeanSquaredError=1.21
(converged)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 36
K-means1. Pick number of
clusters (e.g. K=5) 2. Randomly guess K
cluster Center locations
Figure/slidefromAndrewMoore,CMU
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 37
K-means1. Pick number of
clusters (e.g. K=5) 2. Randomly guess K
cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
Figure/slidefromAndrewMoore,CMU
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 38
K-means1. Pick number of
clusters (e.g. K=5) 2. Randomly guess K
cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center finds the centroid of the points it owns
Figure/slidefromAndrewMoore,CMU
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 39
K-means1. Pick number of
clusters (e.g. K=5) 2. Randomly guess K
cluster Center locations
3. Each datapoint finds out which Center it’s closest to.
4. Each Center finds the centroid of the points it owns
5. New Centers => new boundaries
6. Repeat until no change
Figure/slidefromAndrewMoore,CMU
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 40
TheIrisData
• CollectedbyR.A.Fisher• Afamousearlydatasetinmultivariatedataanalysis
• Fourfeatures:– sepallength incm– sepalwidth incm– petallength incm– petalwidth incm
• Threedifferentspecies– Setosa– Versicolor– Virginica
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 41
K-MeansClusteringontheIrisData
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 42
K-MeansforImageCompression
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 43
AnExampleofDatawhereK-Meansdoesnotworkwell
IdealClusteringofDatain2Dimensions
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 44
AnExampleofDatawhereK-Meansdoesnotworkwell
K-meansClusteringResult,K=2IdealClusteringofDatain2Dimensions
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 45
From:http://scikit-learn.org/stable/modules/clustering.html#
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 46
PropertiesoftheK-MeansAlgorithm
• Timecomplexity?N=numberofdatapointsK=numberofclustersD=dimension ofdatapoints (number ofvariables)
O(NKd)intimeperiterationThisisgood: lineartimeineachinputparameter
• DoesK-meansalwaysfindaGlobalMinimum?i.e.,thesetofKcentersthatminimize theSSE?
No:alwaysconvergesto*some*localminimum, butnotnecessarilythebest• Dependsonthestartingpointchosen• CanprovethatSSEoneachiterationmusteither
– Decrease,or– Notchange(inwhichcasewehaveconverged)
[Thinkabouthowyoumightprovethis]
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 47
SummaryofKmeans
• Input:– Nvectors
• Output:Kclusters– Eachclusterrepresentedbyaclustermean(avector)– Assignseachdatapoint toitsclosestclustercenter
• Strengths– Fast:timecomplexityisO(NDK), i.e.,lineartimeinN,T,K– Simple toimplement
• Weaknesses:– Notguaranteed tofindthebestsolution (theglobalminimumofSSE)– AssumesafixedK,numberofclusters– UsesEuclideandistance– notnecessarilyideal
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 48
NumberofClusters?
• Generallyno“right”answer…itdependsontheapplication
• Wecanthinkofclusteringasatypeofdatacompressiontechnique:– AsK,thenumberofclustersgrows,wecompressthedatabetter,e.g.,lower
overallsquarederror– ButthisdoesnotmeanlargerKisalwaysbetter…..thelargerthevalueofKthe
harderitisforhumans tounderstand theclusteringresults
• Options?– PickavalueofKbasedonintuition/heuristics, e.g.,relativelysmallK(e.g.,K=5
or10)ifweareshowing theresultstoahuman– EvaluatedifferentvaluesofKifwehavesomeground truthforevaluationand
selectthebestvalueofKusing thetask-specificevaluationmeasure
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 49
HierarchicalClusteringAlgorithms
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 50
Setosa Virginica Versicolor
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 51
HierarchicalClustering
• Thenumberofclustersisnotrequired• Givesatree-basedrepresentationofobservations- dendrogram
• Eachleafrepresentsanobservation
• Leavesmostsimilar toeachotheraremerged
• Internalnodesmostsimilar toeachothermerged
• Processcontinuesrecursivelyuntilallnodesaremergedattherootnode
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 52
BasicConceptofHierarchicalClustering
Step 0 Step 1 Step 2 Step 3 Step 4
b
dc
e
a a b
d ec d e
a b c d e
Mergedatapoints,andthenclusters,inabottom-upfashion,untilalldatapointsarein1cluster.
Requiresthatwecandefinedistance/similaritybetweensetsofpoints
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 53
SimpleExampleofHierarchicalClustering
Dimension1
Dimension2
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 54
Complete-linkclusteringofReutersnewsstories
FigurefromChapter17ofManning,Raghavan,andSchutze
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 55
DistancebetweenTwoBranches/Clusters
Singlelinkage
Completelinkage
Averagelinkage
Many other options
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 56
ComplexityofHierarchicalClustering
• TimeComplexity (N=numofdocs,T=dimensionality)– Timetocomputeallpairwisedistances:O(N2 T)– Timetocreatethetree:O(N3)
->Overalltimecomplexity is O(N3 +N2 T)
• Spacecomplexity=O(N2)
• Thisisasignificantweaknessofhierarchicalclustering:scalespoorlyinN– OnepracticaloptionisfirstrunK-meanswith(e.g.,)K=20or100or500clustersand
then“clustertheclusters”fromK-means
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 57
AutomaticallyClusteringLanguagesinLinguistics
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 58
HierarchicalClusteringbasedonuservotesforfavoritebeers
Basedoncentroidmethod
Fromdata.ranker.com
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 59
“Heat-Map”Representation(humandata)
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 60
DiscoveringStructurefromaHeatMap ofBrainNetworkData
Fromhttps://seaborn.pydata.org/examples/structured_heatmap.html
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 61
SummaryofClusteringAlgorithms
• Usedforexploringdata– Cananswerquestions such“aretheresubgroups?”
• Differentclusteringalgorithms– K-means
• Simple,fast,easytointerpret• Tendstofind“circularclusters”,canfailoncomplexstructure• NumberofclustersKisfixedaheadoftime
– Hierarchicalagglomerativeclustering• Producesatreeofclusters(dendrogram)• Numberofclustersisnotfixed• Computationalcomplexityishigh,doesnotscalewelltolargeN
• Clusteringisusefulforexploration….butoneshouldbecareful– No“goldstandard”tocompareitto– Manydifferentmethods….cangivedifferent results
PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 62
Assignment5
RefertotheWikipage
DuenoononMondayFebruary12th toEEEdropbox
Notechange:duebeforeclass(by2pm)