stats 170a: project in data science exploratory …...stats 170a: project in data science...

62
Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California, Irvine

Upload: others

Post on 20-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

Stats170A:ProjectinDataScience

ExploratoryDataAnalysis:ClusteringAlgorithms

Padhraic SmythDepartmentofComputerScienceBrenSchoolofInformationandComputerSciencesUniversityofCalifornia,Irvine

Page 2: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 2

Assignment5

RefertotheWikipage

DuenoononMondayFebruary12th toEEEdropbox

Note:duebeforeclass(by2pm)

Questions?

Page 3: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 3

WhatisExploratoryDataAnalysis?

• EDA={visualization,clustering,dimensionreduction,….}

• Forsmallnumbersofvariables,EDA=visualization

• Forlargenumbersofvariables,weneedtobecleverer– Clustering,dimension reduction, embedding algorithms– Thesearetechniques thatessentiallyreducehigh-dimensional datato

something wecanlookat

• Today’slecture:– Finishupvisualization– Overviewofclusteringalgorithms

Page 4: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 4

Tufte’s PrinciplesofVisualization

Graphicalexcellence…

– isthewell-designedpresentationofinterestingdata– amatterofsubstance,ofstatistics,andofdesign

– consistsofcomplexideascommunicated withclarity,precisionandefficiency

– isthatwhichgivestotheviewerthegreatestnumberofideasintheshortesttimewiththeleastinkinthesmallestspace

– requirestellingthetruthaboutthedata

Page 5: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 5

DifferentWaysofPresentingtheSameData

FromKarlBroman,viawww.cs.princeton.edu/

Page 6: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 6

PrincipleofProportionalInk(orHowtoLiewithVisualization)

Page 7: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 7

PrincipleofProportionalInk(orHowtoLiewithVisualization)

Page 8: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 8

PotentiallyMisleadingScalesontheX-axis

Page 9: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 9

Example:VisualizationofNapoleon’s1812March

Illustratessizeofarmy,direction, location,temperature,date…allononechart

Page 10: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 10

FromNewYorkTimes,Feb22018

DataJournalism

Page 11: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 11

ExploratoryDataAnalysis:Clustering

Page 12: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 12

x1

x2

Example:ClusteringVectorsina2-DimensionalSpace

Eachpoint(or2dvector)representsadocument

Page 13: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 13

x1

x2

Cluster1

Cluster2

Example:PossibleClusters

Page 14: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 14

x1

x2

Cluster1

Cluster2

Example:HowmanyClusters?

Cluster3

Page 15: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 15

ClusterStructureinReal-WorldData

0.0 0.5 1.0 1.5

0.0

0.5

1.0

signal T

sign

al C

≈1500subjects

Twomeasurementspersubject

FigurefromProfZhaoxia Yu,StatisticsDepartment,UCIrvine

Page 16: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 16

ClusterStructureinReal-WorldData

0.0 0.5 1.0 1.5

0.0

0.5

1.0

signal T

sign

al C

≈1500subjects

Twomeasurementspersubject

FigurefromProfZhaoxia Yu,StatisticsDepartment,UCIrvine

Page 17: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 1717

0.0 0.5 1.0 1.5

0.0

0.5

1.0

signal T

sign

al C

CC

TT

CT

FigurefromProfZhaoxia Yu,StatisticsDepartment,UCIrvine

Page 18: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 18

IssuesinClustering

• Representation– Howdowerepresentourexamplesasdatavectors?

• Distance– Howdowewanttodefinedistancebetweenvectors?

• Algorithm– Whattypeofalgorithmdowewanttousetosearchforclusters?– Whatisthetimeandspacecomplexityofthealgorithm?

• NumberofClusters– Howmanyclustersdowewant?

No“right”answertothesequestionsingeneral…itdependsontheapplication

Page 19: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 19

ClusterAnalysisvsClassification

• Dataareunlabeled

• Thenumberofclustersareunknown

• “Unsupervised”learning• Goal:findunknown

structures

19

• Thelabelsfortrainingdataareknown

• Thenumberofclassesareknown

• “Supervised”learning• Goal:allocatenew

observations,whoselabelsareunknown,tooneoftheknownclasses

Page 20: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 20

Clustering:TheK-MeansAlgorithm

Page 21: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 21

Notation

NdocumentsRepresenteachdocumentasavectorofTterms(e.g.,countsortf-idf)

Thevectorfortheith documentis:xi =(xi1,xi2,…,xij ,....,xiT ),i =1,…..N

Document-Termmatrix• xij istheith row,jth column• columnscorrespondtoterms• rowscorrespondtodocuments

WecanthinkofourdocumentsasbeinginaT-dimensionalspace,withclustersas“cloudsofpoints”

Page 22: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 22

TheK-MeansClusteringAlgorithm

Input:Nvectorsx1,….xN ofdimensionDK=numberofclusters(K>1)

Output:– Kclustercenters,c1,….cK, eachcenterisavectorofdimensionD– (Equivalently) Alistofclusterassignments (values1toK)foreachoftheN

inputvectors

Note:InK-meanseachinputvectorx isassignedtooneandonlyoneclusterk,orclustercenterck

TheK-meansalgorithmpartitions theNdatavectorsintoKdisjointgroups

Page 23: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 23

x1

x2

Cluster1

Cluster2

ExampleofK-MeansOutputwith2Clusters

c1

c2

BluecirclesareexamplesofdocumentsRedcirclesareexamplesofclustercenters

Page 24: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 24

SquaredErrorDistance

),,,( 21 Txxxx !=

ConsidertwovectorseachwithTcomponents(i.e.,dimensionT)

∑=

−=T

jjjE yxyxd

1

2)(),(

Acommondistancemetricissquarederrordistance:

Intwodimensionsthesquarerootofthisistheusualnotionofspatialdistance,i.e.,Euclideandistance

),,,( 21 Tyyyy !=

Page 25: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 25

SquaredErrorsandClusterCenters

• Squarederror(distance)betweenadatapointx andaclustercenterc:

dist [x ,c ]=Σj (xj - cj )2

IndexjisovertheDcomponents/dimensions ofthevectors

Cluster1

c1

Page 26: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 26

SquaredErrorsandClusterCenters

• Squarederror(distance)betweenadatapointx andaclustercenterc:

dist [x ,c ]=Σj (xj - cj )2

• Totalsquarederrorbetweenaclustercenterck andallNk pointsassignedtothatcluster:

Sk =Σi d[xi ,ck ]

SumisovertheDcomponents/dimensions ofthevectors

Thissumisovervectors,overtheNk pointsassigned toclusterk

DistancedefinedasEuclideandistance

Cluster1

c1

Page 27: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 27

SquaredErrorsandClusterCenters

• Squarederror(distance)betweenadatapointx andaclustercenterc:

dist [x ,c ]=Σj (xj - cj )2

• Totalsquarederrorbetweenaclustercenterck andallNk pointsassignedtothatcluster:

Sk =Σi d[xi ,ck ]

• TotalsquarederrorsummedacrossKclusters

SSE=Σk Sk

SumisovertheDcomponents/dimensions ofthevectors

SumisovertheNk points assignedtoclusterk

SumisovertheKclusters

DistancedefinedasEuclideandistance

Page 28: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 28

K-meansObjectiveFunction

• K-means:minimizethetotalsquarederror,i.e.,findtheKclustercentersck,andassignments,thatminimize

SSE = Σk Sk =Σk (Σi d[xi ,ck ])

• K-meansseekstominimizeSSE,i.e.,findtheclustercenterssuchthatthesum-squared-errorissmallest– willplaceclustercentersstrategicallyto“cover”data– similartodatacompression (infactusedindatacompressionalgorithms)

Page 29: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 29

K-MeansAlgorithm

• Randominitialization– SelecttheinitialKcentersrandomly fromNinputvectorsrandomly– Or,assigneachoftheNvectorsrandomly tooneoftheKclusters

• Iterate:– Assignment Step:

• AssigneachoftheNinputvectorstotheirclosestmean

– UpdatetheMean-Vectors(Kofthem)• Computeupdatedcenters:theaveragevalueofthevectorsassignedtok

New ck =1/Nk Σi xi

• Convergence:– Didanypointsgetreassigned?

• Yes:terminate• No:returntoIteratestep

SumisovertheNk points assignedtoclusterk

Page 30: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 30

Pseudocode fortheK-meansAlgorithm

FromChapter16inManning,Raghavan,andSchutze

Page 31: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 31

ExampleofK-MeansClustering

-2 -1 0 1 2 3 4 5 6 7 8

DIMENSION 1

-2

-1

0

1

2

3

4

5

6

7

DIM

ENSI

ON

2

Original Data

Page 32: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 32

ExampleofK-MeansClustering

-1 0 1 2 3 4 5 6 7 8

DIMENSION 1

-1

0

1

2

3

4

5

6

7D

IMEN

SIO

N 2

Iteration 1

MeanSquaredError=3.45

Page 33: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 33

ExampleofK-MeansClustering

-1 0 1 2 3 4 5 6 7 8

DIMENSION 1

-1

0

1

2

3

4

5

6

7D

IMEN

SIO

N 2

Iteration 2

MeanSquaredError=1.93

Page 34: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 34

ExampleofK-MeansClustering

-1 0 1 2 3 4 5 6 7 8

DIMENSION 1

-1

0

1

2

3

4

5

6

7D

IMEN

SIO

N 2

Iteration 3

MeanSquaredError=1.25

Page 35: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 35

ExampleofK-MeansClustering

-1 0 1 2 3 4 5 6 7 8

DIMENSION 1

-1

0

1

2

3

4

5

6

7D

IMEN

SIO

N 2

Iteration 5

MeanSquaredError=1.21

(converged)

Page 36: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 36

K-means1. Pick number of

clusters (e.g. K=5) 2. Randomly guess K

cluster Center locations

Figure/slidefromAndrewMoore,CMU

Page 37: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 37

K-means1. Pick number of

clusters (e.g. K=5) 2. Randomly guess K

cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

Figure/slidefromAndrewMoore,CMU

Page 38: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 38

K-means1. Pick number of

clusters (e.g. K=5) 2. Randomly guess K

cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns

Figure/slidefromAndrewMoore,CMU

Page 39: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 39

K-means1. Pick number of

clusters (e.g. K=5) 2. Randomly guess K

cluster Center locations

3. Each datapoint finds out which Center it’s closest to.

4. Each Center finds the centroid of the points it owns

5. New Centers => new boundaries

6. Repeat until no change

Figure/slidefromAndrewMoore,CMU

Page 40: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 40

TheIrisData

• CollectedbyR.A.Fisher• Afamousearlydatasetinmultivariatedataanalysis

• Fourfeatures:– sepallength incm– sepalwidth incm– petallength incm– petalwidth incm

• Threedifferentspecies– Setosa– Versicolor– Virginica

Page 41: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 41

K-MeansClusteringontheIrisData

Page 42: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 42

K-MeansforImageCompression

Page 43: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 43

AnExampleofDatawhereK-Meansdoesnotworkwell

IdealClusteringofDatain2Dimensions

Page 44: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 44

AnExampleofDatawhereK-Meansdoesnotworkwell

K-meansClusteringResult,K=2IdealClusteringofDatain2Dimensions

Page 45: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 45

From:http://scikit-learn.org/stable/modules/clustering.html#

Page 46: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 46

PropertiesoftheK-MeansAlgorithm

• Timecomplexity?N=numberofdatapointsK=numberofclustersD=dimension ofdatapoints (number ofvariables)

O(NKd)intimeperiterationThisisgood: lineartimeineachinputparameter

• DoesK-meansalwaysfindaGlobalMinimum?i.e.,thesetofKcentersthatminimize theSSE?

No:alwaysconvergesto*some*localminimum, butnotnecessarilythebest• Dependsonthestartingpointchosen• CanprovethatSSEoneachiterationmusteither

– Decrease,or– Notchange(inwhichcasewehaveconverged)

[Thinkabouthowyoumightprovethis]

Page 47: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 47

SummaryofKmeans

• Input:– Nvectors

• Output:Kclusters– Eachclusterrepresentedbyaclustermean(avector)– Assignseachdatapoint toitsclosestclustercenter

• Strengths– Fast:timecomplexityisO(NDK), i.e.,lineartimeinN,T,K– Simple toimplement

• Weaknesses:– Notguaranteed tofindthebestsolution (theglobalminimumofSSE)– AssumesafixedK,numberofclusters– UsesEuclideandistance– notnecessarilyideal

Page 48: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 48

NumberofClusters?

• Generallyno“right”answer…itdependsontheapplication

• Wecanthinkofclusteringasatypeofdatacompressiontechnique:– AsK,thenumberofclustersgrows,wecompressthedatabetter,e.g.,lower

overallsquarederror– ButthisdoesnotmeanlargerKisalwaysbetter…..thelargerthevalueofKthe

harderitisforhumans tounderstand theclusteringresults

• Options?– PickavalueofKbasedonintuition/heuristics, e.g.,relativelysmallK(e.g.,K=5

or10)ifweareshowing theresultstoahuman– EvaluatedifferentvaluesofKifwehavesomeground truthforevaluationand

selectthebestvalueofKusing thetask-specificevaluationmeasure

Page 49: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 49

HierarchicalClusteringAlgorithms

Page 50: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 50

Setosa Virginica Versicolor

Page 51: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 51

HierarchicalClustering

• Thenumberofclustersisnotrequired• Givesatree-basedrepresentationofobservations- dendrogram

• Eachleafrepresentsanobservation

• Leavesmostsimilar toeachotheraremerged

• Internalnodesmostsimilar toeachothermerged

• Processcontinuesrecursivelyuntilallnodesaremergedattherootnode

Page 52: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 52

BasicConceptofHierarchicalClustering

Step 0 Step 1 Step 2 Step 3 Step 4

b

dc

e

a a b

d ec d e

a b c d e

Mergedatapoints,andthenclusters,inabottom-upfashion,untilalldatapointsarein1cluster.

Requiresthatwecandefinedistance/similaritybetweensetsofpoints

Page 53: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 53

SimpleExampleofHierarchicalClustering

Dimension1

Dimension2

Page 54: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 54

Complete-linkclusteringofReutersnewsstories

FigurefromChapter17ofManning,Raghavan,andSchutze

Page 55: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 55

DistancebetweenTwoBranches/Clusters

Singlelinkage

Completelinkage

Averagelinkage

Many other options

Page 56: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 56

ComplexityofHierarchicalClustering

• TimeComplexity (N=numofdocs,T=dimensionality)– Timetocomputeallpairwisedistances:O(N2 T)– Timetocreatethetree:O(N3)

->Overalltimecomplexity is O(N3 +N2 T)

• Spacecomplexity=O(N2)

• Thisisasignificantweaknessofhierarchicalclustering:scalespoorlyinN– OnepracticaloptionisfirstrunK-meanswith(e.g.,)K=20or100or500clustersand

then“clustertheclusters”fromK-means

Page 57: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 57

AutomaticallyClusteringLanguagesinLinguistics

Page 58: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 58

HierarchicalClusteringbasedonuservotesforfavoritebeers

Basedoncentroidmethod

Fromdata.ranker.com

Page 59: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 59

“Heat-Map”Representation(humandata)

Page 60: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 60

DiscoveringStructurefromaHeatMap ofBrainNetworkData

Fromhttps://seaborn.pydata.org/examples/structured_heatmap.html

Page 61: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 61

SummaryofClusteringAlgorithms

• Usedforexploringdata– Cananswerquestions such“aretheresubgroups?”

• Differentclusteringalgorithms– K-means

• Simple,fast,easytointerpret• Tendstofind“circularclusters”,canfailoncomplexstructure• NumberofclustersKisfixedaheadoftime

– Hierarchicalagglomerativeclustering• Producesatreeofclusters(dendrogram)• Numberofclustersisnotfixed• Computationalcomplexityishigh,doesnotscalewelltolargeN

• Clusteringisusefulforexploration….butoneshouldbecareful– No“goldstandard”tocompareitto– Manydifferentmethods….cangivedifferent results

Page 62: Stats 170A: Project in Data Science Exploratory …...Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

PadhraicSmyth,UCIrvine:Stats170AB,Winter 2018: 62

Assignment5

RefertotheWikipage

DuenoononMondayFebruary12th toEEEdropbox

Notechange:duebeforeclass(by2pm)