statistical data mining and machine learning hilary term 2016sejdinov/teaching/ht16_lecture3.pdfv1...

41
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml

Upload: others

Post on 09-Dec-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Statistical Data Mining and Machine LearningHilary Term 2016

Dino SejdinovicDepartment of Statistics

Oxford

Slides and other materials available at:http://www.stats.ox.ac.uk/~sejdinov/sdmml

Page 2: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Visualisation and Dimensionality Reduction

Last time: Biplots, MDSBiplots: a way to visualiserelationships between principalcomponents and the original variables.In its scaled version, projected pointsare uncorrelated and equi-variant andangles between variables correspondto correlation.

−0.2 −0.1 0.0 0.1 0.2 0.3

−0.

2−

0.1

0.0

0.1

0.2

0.3

Comp.1

Com

p.2

AlabamaAlaska

Arizona

Arkansas

California

ColoradoConnecticut

Delaware

Florida

Georgia

Hawaii

Idaho

Illinois

Indiana Iowa

Kansas

KentuckyLouisiana

MaineMaryland

Massachusetts

Michigan

Minnesota

Mississippi

Missouri

Montana

Nebraska

Nevada

New Hampshire

New Jersey

New Mexico

New York

North Carolina

North Dakota

Ohio

Oklahoma

Oregon Pennsylvania

Rhode Island

South Carolina

South DakotaTennessee

Texas

Utah

Vermont

Virginia

Washington

West Virginia

Wisconsin

Wyoming

−5 0 5

−5

05

Murder

Assault

UrbanPop

Rape

Multidimensional Scaling: a class ofdimensionality reduction techniqueswhich find a representation of the dataitems x1, . . . , xn ∈ Rp in alower-dimensional spacez1, . . . , zn ∈ Rk which approximatelypreserves the inter-point(dis)similarities.

−1000 −500 0 500 1000

−10

00−

500

050

010

00

ATLANTA

CHICAGO

DENVER

HOUSTON

LA

MIAMI

NY

SF

SEATTLE

DC

Page 3: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Visualisation and Dimensionality Reduction Multidimensional Scaling

Varieties of MDSChoices of (dis)similarities and stress functions S(Z) lead to differentalgorithms.

Classical/Torgerson: preserves inner products instead - strain function(cmdscale)

S(Z) =∑i 6=j

(bij − 〈zi − z̄, zj − z̄〉)2

Metric Shephard-Kruskal: preserves distances w.r.t. squared stress

S(Z) =∑i 6=j

(dij − ‖zi − zj‖2)2

Sammon: preserves shorter distances more (sammon)

S(Z) =∑i 6=j

(dij − ‖zi − zj‖2)2

dij

Non-Metric Shephard-Kruskal: ignores actual distance values, onlypreserves ranks (isoMDS)

S(Z) = ming increasing

∑i6=j(g(dij)− ‖zi − zj‖2)

2∑i 6=j ‖zi − zj‖2

2

Page 4: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Visualisation and Dimensionality Reduction Multidimensional Scaling

Example: Language data

Presence or absence of 2867 homologous traits in 87 Indo-Europeanlanguages.

> X<-read.table("http://www.stats.ox.ac.uk/~sejdinov/sdmml/data/cognate.txt")> X[1:15,1:16]

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16Irish_A 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0Irish_B 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0Welsh_N 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0Welsh_C 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0Breton_List 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0Breton_SE 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0Breton_ST 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0Romanian_List 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0Vlach 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0Italian 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0Ladin 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0Provencal 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0French 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0Walloon 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0French_Creole_C 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Page 5: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Visualisation and Dimensionality Reduction Multidimensional Scaling

Example: Language dataUsing MDS with non-metric (Sammon) scaling.

MDS (i.e. cmdscale) which minimizes (d2ij − d̃2

ij)2. Sammon thereby puts

more weight on reproducing the separation of points which are close byforcing them apart. Projection by MDS(Jaccard/sammon) with cluster dis-covery by k-means (Jaccard): There is an obvious east to west (top-left to

!0.5 0.0 0.5

!0.6

!0.4

!0.2

0.0

0.2

0.4

0.6

Irish_A

Irish_B

Welsh_NWelsh_C

Breton_List

Breton_SE

Breton_ST

Romanian_ListVlach

Italian

Ladin

Provencal

French

Walloon

French_Creole_C

French_Creole_D

Sardinian_N

Sardinian_L

Sardinian_C

Spanish Portuguese_STBrazilian

Catalan

German_STPenn_Dutch

Dutch_List

Afrikaans

Flemish

Frisian

Swedish_UpSwedish_VL

Swedish_List

Danish

Riksmal

Icelandic_ST

Faroese

English_STTakitaki

Lithuanian_O

Lithuanian_STLatvian

Slovenian

Lusatian_L

Lusatian_U

Czech

Slovak

Czech_E

Ukrainian

Byelorussian

PolishRussian

MacedonianBulgarian

Serbocroatian

Gypsy_Gk

Singhalese

Kashmiri

Marathi

Gujarati

Panjabi_ST

Lahnda

Hindi

Bengali

Nepali_List

Khaskura

Greek_ML

Greek_MD

Greek_Mod

Greek_D

Greek_K

Armenian_Mod

Armenian_List

Ossetic

Afghan

WaziriPersian_ListTadzik

Baluchi

Wakhi

Albanian_T

Albanian_Top

Albanian_G

Albanian_K

Albanian_C

HITTITE

TOCHARIAN_A

TOCHARIAN_B

bottom-right) separation of languages in the MDS and the clusters in theMDS grouping agree with the clusters discovered by agglomerative clus-tering and k-means. The two clustering methods group languages slightlydifferently with k-means splitting the Germanic languages.

## (alternative/MDS) make a field to display the clusters

## use MDS - sammon does this nicely

di.sam <- sammon(D,magic=0.20000002,niter=1000,tol=1e-8)

eqscplot(di.sam$points,pch=km$cluster,col=km$cluster)

text(di.sam$points,labels=row.names(X),pos=4,col=km$cluster)

5

Page 6: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Visualisation and Dimensionality Reduction Isomap

Nonlinear Dimensionality Reduction

Two aims of different varieties of MDS:To visualize the (dis)similarities among items in a dataset, where these(dis)disimilarities may not have Euclidean geometric interpretations.To perform nonlinear dimensionality reduction.

Many high-dimensional datasets exhibit low-dimensional structure (“live on alow-dimensional menifold”).

Page 7: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Visualisation and Dimensionality Reduction Isomap

Isomap

Isomap is a non-linear dimensional reduction technique based on classicalMDS. Differs from other MDSs as it uses estimates of geodesic distancesbetween the data points.

converts distances to inner products (17 ),which uniquely characterize the geometry ofthe data in a form that supports efficientoptimization. The global minimum of Eq. 1 isachieved by setting the coordinates yi to thetop d eigenvectors of the matrix !(DG) (13).

As with PCA or MDS, the true dimen-sionality of the data can be estimated fromthe decrease in error as the dimensionality ofY is increased. For the Swiss roll, whereclassical methods fail, the residual varianceof Isomap correctly bottoms out at d " 2(Fig. 2B).

Just as PCA and MDS are guaranteed,given sufficient data, to recover the truestructure of linear manifolds, Isomap is guar-anteed asymptotically to recover the true di-mensionality and geometric structure of astrictly larger class of nonlinear manifolds.Like the Swiss roll, these are manifolds

whose intrinsic geometry is that of a convexregion of Euclidean space, but whose ambi-ent geometry in the high-dimensional inputspace may be highly folded, twisted, orcurved. For non-Euclidean manifolds, such asa hemisphere or the surface of a doughnut,Isomap still produces a globally optimal low-dimensional Euclidean representation, asmeasured by Eq. 1.

These guarantees of asymptotic conver-gence rest on a proof that as the number ofdata points increases, the graph distancesdG(i, j) provide increasingly better approxi-mations to the intrinsic geodesic distancesdM(i, j), becoming arbitrarily accurate in thelimit of infinite data (18, 19). How quicklydG(i, j) converges to dM(i, j) depends on cer-tain parameters of the manifold as it lieswithin the high-dimensional space (radius ofcurvature and branch separation) and on the

density of points. To the extent that a data setpresents extreme values of these parametersor deviates from a uniform density, asymp-totic convergence still holds in general, butthe sample size required to estimate geodes-ic distance accurately may be impracticallylarge.

Isomap’s global coordinates provide asimple way to analyze and manipulate high-dimensional observations in terms of theirintrinsic nonlinear degrees of freedom. For aset of synthetic face images, known to havethree degrees of freedom, Isomap correctlydetects the dimensionality (Fig. 2A) and sep-arates out the true underlying factors (Fig.1A). The algorithm also recovers the knownlow-dimensional structure of a set of noisyreal images, generated by a human hand vary-ing in finger extension and wrist rotation(Fig. 2C) (20). Given a more complex dataset of handwritten digits, which does not havea clear manifold geometry, Isomap still findsglobally meaningful coordinates (Fig. 1B)and nonlinear structure that PCA or MDS donot detect (Fig. 2D). For all three data sets,the natural appearance of linear interpolationsbetween distant points in the low-dimension-al coordinate space confirms that Isomap hascaptured the data’s perceptually relevantstructure (Fig. 4).

Previous attempts to extend PCA andMDS to nonlinear data sets fall into twobroad classes, each of which suffers fromlimitations overcome by our approach. Locallinear techniques (21–23) are not designed torepresent the global structure of a data setwithin a single coordinate system, as we do inFig. 1. Nonlinear techniques based on greedyoptimization procedures (24–30) attempt todiscover global structure, but lack the crucialalgorithmic features that Isomap inheritsfrom PCA and MDS: a noniterative, polyno-mial time procedure with a guarantee of glob-al optimality; for intrinsically Euclidean man-

Fig. 2. The residualvariance of PCA (opentriangles), MDS [opentriangles in (A) through(C); open circles in (D)],and Isomap (filled cir-cles) on four data sets(42). (A) Face imagesvarying in pose and il-lumination (Fig. 1A).(B) Swiss roll data (Fig.3). (C) Hand imagesvarying in finger exten-sion and wrist rotation(20). (D) Handwritten“2”s (Fig. 1B). In all cas-es, residual variance de-creases as the dimen-sionality d is increased.The intrinsic dimen-sionality of the datacan be estimated bylooking for the “elbow”at which this curve ceases to decrease significantly with added dimensions. Arrows mark the true orapproximate dimensionality, when known. Note the tendency of PCA and MDS to overestimate thedimensionality, in contrast to Isomap.

Fig. 3. The “Swiss roll” data set, illustrating how Isomap exploits geodesicpaths for nonlinear dimensionality reduction. (A) For two arbitrary points(circled) on a nonlinear manifold, their Euclidean distance in the high-dimensional input space (length of dashed line) may not accuratelyreflect their intrinsic similarity, as measured by geodesic distance alongthe low-dimensional manifold (length of solid curve). (B) The neighbor-hood graph G constructed in step one of Isomap (with K " 7 and N "

1000 data points) allows an approximation (red segments) to the truegeodesic path to be computed efficiently in step two, as the shortestpath in G. (C) The two-dimensional embedding recovered by Isomap instep three, which best preserves the shortest path distances in theneighborhood graph (overlaid). Straight lines in the embedding (blue)now represent simpler and cleaner approximations to the true geodesicpaths than do the corresponding graph paths (red).

R E P O R T S

www.sciencemag.org SCIENCE VOL 290 22 DECEMBER 2000 2321

on

De

ce

mb

er

5,

20

08

w

ww

.scie

nce

ma

g.o

rgD

ow

nlo

ad

ed

fro

m

Tenenbaum et al. (2000)

Page 8: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Visualisation and Dimensionality Reduction Isomap

Isomap

Isomap

Calculate Euclidean distances dij for i, j = 1, . . . , n between all data points.Form a graph G with n samples as nodes, and edges between therespective K nearest neighbours (K-Isomap) or between i and j if dij < ε(ε-Isomap).For i, j linked by an edge, set dG

ij = dij. Otherwise, set dGij to the

shortest-path distance between i and j in G.Run classical MDS using distances dG

ij .

converts distances to inner products (17 ),which uniquely characterize the geometry ofthe data in a form that supports efficientoptimization. The global minimum of Eq. 1 isachieved by setting the coordinates yi to thetop d eigenvectors of the matrix !(DG) (13).

As with PCA or MDS, the true dimen-sionality of the data can be estimated fromthe decrease in error as the dimensionality ofY is increased. For the Swiss roll, whereclassical methods fail, the residual varianceof Isomap correctly bottoms out at d " 2(Fig. 2B).

Just as PCA and MDS are guaranteed,given sufficient data, to recover the truestructure of linear manifolds, Isomap is guar-anteed asymptotically to recover the true di-mensionality and geometric structure of astrictly larger class of nonlinear manifolds.Like the Swiss roll, these are manifolds

whose intrinsic geometry is that of a convexregion of Euclidean space, but whose ambi-ent geometry in the high-dimensional inputspace may be highly folded, twisted, orcurved. For non-Euclidean manifolds, such asa hemisphere or the surface of a doughnut,Isomap still produces a globally optimal low-dimensional Euclidean representation, asmeasured by Eq. 1.

These guarantees of asymptotic conver-gence rest on a proof that as the number ofdata points increases, the graph distancesdG(i, j) provide increasingly better approxi-mations to the intrinsic geodesic distancesdM(i, j), becoming arbitrarily accurate in thelimit of infinite data (18, 19). How quicklydG(i, j) converges to dM(i, j) depends on cer-tain parameters of the manifold as it lieswithin the high-dimensional space (radius ofcurvature and branch separation) and on the

density of points. To the extent that a data setpresents extreme values of these parametersor deviates from a uniform density, asymp-totic convergence still holds in general, butthe sample size required to estimate geodes-ic distance accurately may be impracticallylarge.

Isomap’s global coordinates provide asimple way to analyze and manipulate high-dimensional observations in terms of theirintrinsic nonlinear degrees of freedom. For aset of synthetic face images, known to havethree degrees of freedom, Isomap correctlydetects the dimensionality (Fig. 2A) and sep-arates out the true underlying factors (Fig.1A). The algorithm also recovers the knownlow-dimensional structure of a set of noisyreal images, generated by a human hand vary-ing in finger extension and wrist rotation(Fig. 2C) (20). Given a more complex dataset of handwritten digits, which does not havea clear manifold geometry, Isomap still findsglobally meaningful coordinates (Fig. 1B)and nonlinear structure that PCA or MDS donot detect (Fig. 2D). For all three data sets,the natural appearance of linear interpolationsbetween distant points in the low-dimension-al coordinate space confirms that Isomap hascaptured the data’s perceptually relevantstructure (Fig. 4).

Previous attempts to extend PCA andMDS to nonlinear data sets fall into twobroad classes, each of which suffers fromlimitations overcome by our approach. Locallinear techniques (21–23) are not designed torepresent the global structure of a data setwithin a single coordinate system, as we do inFig. 1. Nonlinear techniques based on greedyoptimization procedures (24–30) attempt todiscover global structure, but lack the crucialalgorithmic features that Isomap inheritsfrom PCA and MDS: a noniterative, polyno-mial time procedure with a guarantee of glob-al optimality; for intrinsically Euclidean man-

Fig. 2. The residualvariance of PCA (opentriangles), MDS [opentriangles in (A) through(C); open circles in (D)],and Isomap (filled cir-cles) on four data sets(42). (A) Face imagesvarying in pose and il-lumination (Fig. 1A).(B) Swiss roll data (Fig.3). (C) Hand imagesvarying in finger exten-sion and wrist rotation(20). (D) Handwritten“2”s (Fig. 1B). In all cas-es, residual variance de-creases as the dimen-sionality d is increased.The intrinsic dimen-sionality of the datacan be estimated bylooking for the “elbow”at which this curve ceases to decrease significantly with added dimensions. Arrows mark the true orapproximate dimensionality, when known. Note the tendency of PCA and MDS to overestimate thedimensionality, in contrast to Isomap.

Fig. 3. The “Swiss roll” data set, illustrating how Isomap exploits geodesicpaths for nonlinear dimensionality reduction. (A) For two arbitrary points(circled) on a nonlinear manifold, their Euclidean distance in the high-dimensional input space (length of dashed line) may not accuratelyreflect their intrinsic similarity, as measured by geodesic distance alongthe low-dimensional manifold (length of solid curve). (B) The neighbor-hood graph G constructed in step one of Isomap (with K " 7 and N "

1000 data points) allows an approximation (red segments) to the truegeodesic path to be computed efficiently in step two, as the shortestpath in G. (C) The two-dimensional embedding recovered by Isomap instep three, which best preserves the shortest path distances in theneighborhood graph (overlaid). Straight lines in the embedding (blue)now represent simpler and cleaner approximations to the true geodesicpaths than do the corresponding graph paths (red).

R E P O R T S

www.sciencemag.org SCIENCE VOL 290 22 DECEMBER 2000 2321

on

De

ce

mb

er

5,

20

08

w

ww

.scie

nce

ma

g.o

rgD

ow

nlo

ad

ed

fro

m

R function: isomap{vegan}.

Page 9: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Visualisation and Dimensionality Reduction Isomap

Handwritten Characters

tion to geodesic distance. For faraway points,geodesic distance can be approximated byadding up a sequence of “short hops” be-tween neighboring points. These approxima-tions are computed efficiently by findingshortest paths in a graph with edges connect-ing neighboring data points.

The complete isometric feature mapping,or Isomap, algorithm has three steps, whichare detailed in Table 1. The first step deter-mines which points are neighbors on themanifold M, based on the distances dX (i, j)between pairs of points i, j in the input space

X. Two simple methods are to connect eachpoint to all points within some fixed radius !,or to all of its K nearest neighbors (15). Theseneighborhood relations are represented as aweighted graph G over the data points, withedges of weight dX(i, j) between neighboringpoints (Fig. 3B).

In its second step, Isomap estimates thegeodesic distances dM (i, j) between all pairsof points on the manifold M by computingtheir shortest path distances dG(i, j) in thegraph G. One simple algorithm (16 ) for find-ing shortest paths is given in Table 1.

The final step applies classical MDS tothe matrix of graph distances DG " {dG(i, j)},constructing an embedding of the data in ad-dimensional Euclidean space Y that bestpreserves the manifold’s estimated intrinsicgeometry (Fig. 3C). The coordinate vectors yi

for points in Y are chosen to minimize thecost function

E ! !#$DG% " #$DY%!L2 (1)

where DY denotes the matrix of Euclideandistances {dY(i, j) " !yi & yj!} and !A!L2

the L2 matrix norm '(i, j Ai j2 . The # operator

Fig. 1. (A) A canonical dimensionality reductionproblem from visual perception. The input consistsof a sequence of 4096-dimensional vectors, rep-resenting the brightness values of 64 pixel by 64pixel images of a face rendered with differentposes and lighting directions. Applied to N " 698raw images, Isomap (K" 6) learns a three-dimen-sional embedding of the data’s intrinsic geometricstructure. A two-dimensional projection is shown,with a sample of the original input images (redcircles) superimposed on all the data points (blue)and horizontal sliders (under the images) repre-senting the third dimension. Each coordinate axisof the embedding correlates highly with one de-gree of freedom underlying the original data: left-right pose (x axis, R " 0.99), up-down pose ( yaxis, R " 0.90), and lighting direction (slider posi-tion, R " 0.92). The input-space distances dX(i, j )given to Isomap were Euclidean distances be-tween the 4096-dimensional image vectors. (B)Isomap applied to N " 1000 handwritten “2”sfrom the MNIST database (40). The two mostsignificant dimensions in the Isomap embedding,shown here, articulate the major features of the“2”: bottom loop (x axis) and top arch ( y axis).Input-space distances dX(i, j ) were measured bytangent distance, a metric designed to capture theinvariances relevant in handwriting recognition(41). Here we used !-Isomap (with ! " 4.2) be-cause we did not expect a constant dimensionalityto hold over the whole data set; consistent withthis, Isomap finds several tendrils projecting fromthe higher dimensional mass of data and repre-senting successive exaggerations of an extrastroke or ornament in the digit.

R E P O R T S

22 DECEMBER 2000 VOL 290 SCIENCE www.sciencemag.org2320

on D

ecem

ber

5, 2008

ww

w.s

cie

ncem

ag.o

rgD

ow

nlo

aded fro

m

Page 10: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Visualisation and Dimensionality Reduction Isomap

Faces

tion to geodesic distance. For faraway points,geodesic distance can be approximated byadding up a sequence of “short hops” be-tween neighboring points. These approxima-tions are computed efficiently by findingshortest paths in a graph with edges connect-ing neighboring data points.

The complete isometric feature mapping,or Isomap, algorithm has three steps, whichare detailed in Table 1. The first step deter-mines which points are neighbors on themanifold M, based on the distances dX (i, j)between pairs of points i, j in the input space

X. Two simple methods are to connect eachpoint to all points within some fixed radius !,or to all of its K nearest neighbors (15). Theseneighborhood relations are represented as aweighted graph G over the data points, withedges of weight dX(i, j) between neighboringpoints (Fig. 3B).

In its second step, Isomap estimates thegeodesic distances dM (i, j) between all pairsof points on the manifold M by computingtheir shortest path distances dG(i, j) in thegraph G. One simple algorithm (16 ) for find-ing shortest paths is given in Table 1.

The final step applies classical MDS tothe matrix of graph distances DG " {dG(i, j)},constructing an embedding of the data in ad-dimensional Euclidean space Y that bestpreserves the manifold’s estimated intrinsicgeometry (Fig. 3C). The coordinate vectors yi

for points in Y are chosen to minimize thecost function

E ! !#$DG% " #$DY%!L2 (1)

where DY denotes the matrix of Euclideandistances {dY(i, j) " !yi & yj!} and !A!L2

the L2 matrix norm '(i, j Ai j2 . The # operator

Fig. 1. (A) A canonical dimensionality reductionproblem from visual perception. The input consistsof a sequence of 4096-dimensional vectors, rep-resenting the brightness values of 64 pixel by 64pixel images of a face rendered with differentposes and lighting directions. Applied to N " 698raw images, Isomap (K" 6) learns a three-dimen-sional embedding of the data’s intrinsic geometricstructure. A two-dimensional projection is shown,with a sample of the original input images (redcircles) superimposed on all the data points (blue)and horizontal sliders (under the images) repre-senting the third dimension. Each coordinate axisof the embedding correlates highly with one de-gree of freedom underlying the original data: left-right pose (x axis, R " 0.99), up-down pose ( yaxis, R " 0.90), and lighting direction (slider posi-tion, R " 0.92). The input-space distances dX(i, j )given to Isomap were Euclidean distances be-tween the 4096-dimensional image vectors. (B)Isomap applied to N " 1000 handwritten “2”sfrom the MNIST database (40). The two mostsignificant dimensions in the Isomap embedding,shown here, articulate the major features of the“2”: bottom loop (x axis) and top arch ( y axis).Input-space distances dX(i, j ) were measured bytangent distance, a metric designed to capture theinvariances relevant in handwriting recognition(41). Here we used !-Isomap (with ! " 4.2) be-cause we did not expect a constant dimensionalityto hold over the whole data set; consistent withthis, Isomap finds several tendrils projecting fromthe higher dimensional mass of data and repre-senting successive exaggerations of an extrastroke or ornament in the digit.

R E P O R T S

22 DECEMBER 2000 VOL 290 SCIENCE www.sciencemag.org2320

on D

ecem

ber

5, 2008

ww

w.s

cie

ncem

ag.o

rgD

ow

nlo

aded fro

m

Page 11: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Visualisation and Dimensionality Reduction Isomap

Nonlinear Dimensionality Reduction Techniques

Kernel PCALocally Linear EmbeddingLaplacian EigenmapsMaximum Variance Unfolding

Page 12: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering

Clustering

Page 13: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering

Clustering

Many datasets consist of multiple heterogeneous subsets.Cluster analysis: Given an unlabelled data, want algorithms thatautomatically group the datapoints into coherent subsets/clusters.Examples:

market segmentation of shoppers based on browsing and purchase historiesdifferent types of breast cancer based on the gene expressionmeasurementsdiscovering communities in social networksimage segmentation

Page 14: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering

Types of Clustering

Model-based clustering:Each cluster is described using a probability model.

Model-free clustering:Defined by similarity/dissimilarity among instances within clusters.

Page 15: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering

This Lecture: Two “Model-free” Clustering Methods

K-means clustering: a partition-based method into K clusters. Findsgroups such that variation within each group is small. The number ofclusters K is usually fixed beforehand or various values of K areinvestigated as a part of the analysis.Hierarchical clustering: nearby data items are joined into clusters, thenclusters into super-clusters forming a hierarchy. Typically, the hierarchyforms a binary tree (a dendrogram) where each cluster has two“children” clusters. Dendrogram allows to view the clusterings for eachpossible number of clusters, from 1 to n (number of data items).

Page 16: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering K-means

K-meansPartition-based methods seek to divide data points into a pre-assignednumber of clusters C1, . . . ,CK where for all k, k′ ∈ {1, . . . ,K},

Ck ⊂ {1, . . . , n} , Ck ∩ Ck′ = ∅ ∀k 6= k′,K⋃

k=1

Ck = {1, . . . , n} .

For each cluster, represent it using a prototype or cluster centroid µk.We can measure the quality of a cluster with its within-cluster deviance

W(Ck, µk) =∑i∈Ck

‖xi − µk‖22.

The overall quality of the clustering is given by the total within-clusterdeviance:

W =K∑

k=1

W(Ck, µk).

The overall objective is to choose both the cluster centroids and allocation ofpoints to minimize the objective function.

Page 17: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering K-means

K-means

W =

K∑k=1

∑i∈Ck

‖xi − µk‖22 =

n∑i=1

‖xi − µci‖22

where ci = k if and only if i ∈ Ck.Given partition {Ck}, we can find the optimal prototypes easily bydifferentiating W with respect to µk:

∂W∂µk

= 2∑i∈Ck

(xi − µk) = 0 ⇒ µk =1|Ck|

∑i∈Ck

xi

Given prototypes, we can easily find the optimal partition by assigningeach data point to the closest cluster prototype:

ci = argmink‖xi − µk‖2

2

But joint minimization over both is computationally difficult.

Page 18: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering K-means

K-means

The K-means algorithm is a widely used method that returns a local optimumof the objective function W, using iterative and alternating minimization.

1 Randomly initialize K cluster centroids µ1, . . . , µK .2 Cluster assignment: For each i = 1, . . . , n, assign each xi to the cluster

with the nearest centroid,

ci := argmink‖xi − µk‖2

2

Set Ck := {i : ci = k} for each k.3 Move centroids: Set µ1, . . . , µK to the averages of the new clusters:

µk :=1|Ck|

∑i∈Ck

xi

4 Repeat steps 2-3 until convergence.5 Return the partition {C1, . . . ,CK} and means µ1, . . . , µK .

Page 19: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering K-means

K-means

The algorithm stops in a finite number of iterations. Between steps 2and 3, W either stays constant or it decreases, this implies that we neverrevisit the same partition. As there are only finitely many partitions, thenumber of iterations cannot exceed this.The K-means algorithm need not converge to global optimum.K-means is a heuristic search algorithm so it can get stuck at suboptimalconfigurations. The result depends on the starting configuration. Typicallyperform a number of runs from different configurations, and pick the endresult with minimum W.

−0.5 0.0 0.5 1.0 1.5

0.0

0.5

1.0

x

y

W= 9.184

−0.5 0.0 0.5 1.0 1.5

0.0

0.5

1.0

x

y

W= 3.418

−0.5 0.0 0.5 1.0 1.5

0.0

0.5

1.0

x

y

W= 9.264

Page 20: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering K-means

K-means on Crabs

Looking at the Crabs data again.

library(MASS)library(lattice)data(crabs)

splom(~log(crabs[,4:8]),pch=as.numeric(crabs[,2]),col=as.numeric(crabs[,1]),main="circle/triangle is gender, black/red is species")

Page 21: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering K-means

K-means on Crabscircle/triangle is gender, black/red is species

Scatter Plot Matrix

FL2.62.83.03.2

2.62.83.03.2

2.02.22.42.6

2.02.22.42.6●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●● ●●●●

●●

● ●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●

●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●● ●●●●●●●●

●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●

●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

RW2.42.62.83.0

2.42.62.83.0

1.82.02.22.4

1.82.02.22.4●

●●●●●●●●●●●

●●●●●

●●●●●●●●

●●●●●●

●●●●●●●

●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●

●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●

●●●●●●●●●● ●

●●●●●

●●●●●●●●●●●●●●

●●●●● ●●●●●●●

●●●

●●●●

●●●

●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●

●●● ●●

●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●

CL3.43.63.8

3.43.63.8

2.83.03.2

2.83.03.2●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●

●●●●●●●●●● ●

●●●●●●●●●●●●●●●●●●● ●●●●●

●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●● ●●●●

●●● ●●

●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●

●●

●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●

CW3.43.63.84.0

3.43.63.84.0

2.83.03.23.4

2.83.03.23.4●

●●●●●●●●●● ●

●●●●●●●●●●●●●●●●●●●

●●●●● ●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●●●●

●●● ●●

●●●●●●●●●●●●●

●●●●●●●

●●●●●

●●

●●●●●●●●●●●

●●●●●●

●●●●●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●

●●●

●●●●

●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●

●●●●

●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●

●●●●●

BD2.5

3.02.5 3.0

2.0

2.5

2.0 2.5

Page 22: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering K-means

K-means on Crabs

Apply K-means with 2 clusters and plot results.

Crabs.kmeans <- kmeans( log(crabs[,4:8]), 2, nstart=1, iter.max=10)

splom(~log(crabs[,4:8]),col=Crabs.kmeans$cluster+2,main="blue/green is cluster; finds big/small")

Page 23: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering K-means

K-means on Crabsblue/green is cluster finds big/small

Scatter Plot Matrix

FL2.62.83.03.2

2.62.83.03.2

2.02.22.42.6

2.02.22.42.6 ●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●● ●●●●

●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●

● ●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●

●●●

●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●

●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●

●●●

●●●●●●●

●●●●●●●●●●●●●●● ●●●●●●●●●

●●●●●●●●●

●●●●●●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●● ●●●●●●●●

●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●

●●●

●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●

●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

RW2.42.62.83.0

2.42.62.83.0

1.82.02.22.4

1.82.02.22.4●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●●●

●●●●●●

●●●●●●●

●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●

●●●●●

●●●●●

●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●

●●●●●

●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●●

●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●● ●

●●●●●

●●●●●●●●●●●●●●

●●●●● ●●●●●●●

●●●

●●●●

●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●

●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●

●●●●●

●●●

●●●●

●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●

●●

●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●

●●● ●●

●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●

CL3.43.63.8

3.43.63.8

2.83.03.2

2.83.03.2●

●●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●

●●●

●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●● ●

●●●●●●●●●●●●●●●●●●● ●●●●●

●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●

●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●● ●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●

●●● ●●

●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●

●●

●●●●●

●●●●●●●●●●●

●●●●●●●●

●●●●●●●

●●●●●●●●●

●●●●●●●●●

●●●●

●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●

CW3.43.63.84.0

3.43.63.84.0

2.83.03.23.4

2.83.03.23.4 ●●●●●

●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●● ●

●●●●●●●●●●●●●●●●●●●

●●●●● ●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●

●●●

●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●

●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●

●●● ●●

●●●●●●●●●●●●●

●●●●●●●

●●●●●

●●

●●●●●●●●●●●

●●●●●●

● ●●●●

●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●

●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●

●●●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●

●●●●●

BD2.5

3.02.5 3.0

2.0

2.5

2.0 2.5

Page 24: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering K-means

K-means on Crabs

‘Whiten’ or ‘sphere’1 the data using PCA.

pcp <- princomp( log(crabs[,4:8]) )Crabs.sphered <- pcp$scores %*% diag(1/pcp$sdev)splom( ~Crabs.sphered[,1:3],

col=as.numeric(crabs[,1]),pch=as.numeric(crabs[,2]),main="circle/triangle is gender, black/red is species")

And apply K-means again.

Crabs.kmeans <- kmeans(Crabs.sphered, 2, nstart=1, iter.max=20)splom( ~Crabs.sphered[,1:3],

col=Crabs.kmeans$cluster+2, main="blue/green is cluster")

1Apply a linear transformation so that the covariance matrix is identity.

Page 25: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering K-means

K-means on Crabs

circle/triangle is gender, black/red is species

Scatter Plot Matrix

V11

2

3 1 2 3

−2

−1

0

−2 −1 0

●●●●● ●● ●● ●

●●● ●●

●●●

●● ●●●● ●● ●●● ●● ●●● ●● ●●● ●

●●● ●●

●●●

●●

●●●

● ●●●●●

● ●●●●● ●●●

●● ●● ●●●●●●

●● ●●●

●● ●●●●●●

●●●●●

● ●●●●● ●●

●●●

● ● ●●●

●●● ●●●●● ●●●●●●

●●●●●●●●●●

●●●● ●

●●●●

●●

●●●

●● ●●● ●

●● ●●●●●●●

● ●●●●●

●●●●●

●● ●●●●●● ●●

●●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

V20

1

20 1 2

−2

−1

0

−2 −1 0

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

● ●

● ●●●

● ●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●●

●●●

●●●

●●●●

●●

●●●●

●●

●●

●●●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●● ●

●●

●●●

●●●

●●

● ●●

●●●●

●●

● ●●●

●●

●●

●●●● ●

●●

● ●● ●

●●

●●

●●

●●

●●

●●

●●●

●●●●

V30

1

2 0 1 2

−2

−1

0

−2 −1 0

blue/green is cluster

Scatter Plot Matrix

V11

2

3 1 2 3

−2

−1

0

−2 −1 0

●●

●●

● ●● ●●

●●● ● ●●● ●●●●●● ● ●●●

● ●●● ●●● ●●●●● ●●●●

●● ●●● ●

●●●●● ●● ●● ●

●●● ●●

●●●

●● ●●●● ●● ●●● ●● ●●● ●● ●●● ●

●●● ●●

●●●

● ●●●●

●●●

●●● ●●● ●

●●●●●●

●●●

●●● ●● ● ●●●● ●●● ●●●● ●

●●●● ●

●●

●●

●●●

● ●●●●●

● ●●●●● ●●●

●● ●● ●●●●●●

●● ●●●

●● ●●●●●●

●●●●●

●●●●

●●●●●

● ●●● ● ●●● ●●

● ●●●● ●●● ●●

●●●●● ●●● ●●● ●●

● ●●●●●

● ●●●●● ●●

●●●

● ● ●●●

●●● ●●●●● ●●●●●●

●●●●●●●●●●

●●●● ●

●●●●

●●●

●●●

● ●●

● ●●●● ●●● ●

●●●●● ●

●●●●●● ●● ●● ●●● ●●●

●●●● ●● ●

●●

●●

●●●

●● ●●● ●

●● ●●●●●●●

● ●●●●●

●●●●●

●● ●●●●●● ●●

●●●

●●●●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●

●●●●

●●

●●●

●●●●

●●

●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

V20

1

20 1 2

−2

−1

0

−2 −1 0●

●●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●●

●●●●●●

● ●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

● ●

● ●●●

● ●

●●●●

●●

●●

●●

●●●

●●●●●●●

●●●●

●●

●●

●●●●

●●●

●●●

●●●●

●●

●●

●●●●●

●●

●●

●●

●●●

●●

●●●●●

●●●

●●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●●●

●●

●●

●●●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

● ●

●●●● ●● ●

●●

●●

●●

●●

●●● ●

●●●

●●●

●●●●

●●

●●

● ●●

●●

●●

●●

●●

●● ●

●●

●●●

●●●

●●

● ●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●● ●●●

●●

●●

● ●●●

●●

●●

●●●● ●

●●

● ●● ●

●●

●●

●●

●●

●●

●●

●●●

●●●●

V30

1

2 0 1 2

−2

−1

0

−2 −1 0

Discovers gender difference...But the result depends crucially on sphering the data first!

Page 26: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering K-means

K-means on Crabs with K = 4circle/triangle is gender, black/red is species

Scatter Plot Matrix

V11

2

3 1 2 3

−2

−1

0

−2 −1 0

●●●●● ●● ●● ●

●●● ●●

●●●

●● ●●●● ●● ●●● ●● ●●● ●● ●●● ●

●●● ●●

●●●

●●

●●●

● ●●●●●

● ●●●●● ●●●

●● ●● ●●●●●●

●● ●●●

●● ●●●●●●

●●●●●

● ●●●●● ●●

●●●

● ● ●●●

●●● ●●●●● ●●●●●●

●●●●●●●●●●

●●●● ●

●●●●

●●

●●●

●● ●●● ●

●● ●●●●●●●

● ●●●●●

●●●●●

●● ●●●●●● ●●

●●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

V20

1

20 1 2

−2

−1

0

−2 −1 0

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

● ●

● ●●●

● ●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●●●

●●●

●●●

●●●●

●●

●●●●

●●

●●

●●●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●● ●

●●

●●●

●●●

●●

● ●●

●●●●

●●

● ●●●

●●

●●

●●●● ●

●●

● ●● ●

●●

●●

●●

●●

●●

●●

●●●

●●●●

V30

1

2 0 1 2

−2

−1

0

−2 −1 0

colors are clusters

Scatter Plot Matrix

V11

2

3 1 2 3

−2

−1

0

−2 −1 0

●●

●●

●●● ●

●●●● ● ●●● ● ●

●●●● ● ● ●●

● ●●● ●●● ●●●●●

●●●●

●● ●

●● ●

●●●●● ●● ●● ●

●●● ●●

●●●

●● ●●●● ●● ●●● ●● ●

●● ●● ●●

● ●●●

● ●●●●●

●●

●●●

●●●

●●● ● ●● ●

●●●●●●

●●●

●●● ●● ● ●●●● ●●

● ●●●● ●

●●●

● ●●●

●●

●●●

●●●●●●

● ●●●●● ●●●

●●●● ●●●●●●

●● ●●●

●● ●● ●●●●

●●●●●

●●

●●

●●●●

●● ●●● ● ●●● ●

●● ●●●● ●●

● ●●●●●●● ●●● ●

●● ●●

●●●●● ●

● ●● ●●● ●●

●●●

● ● ●●●

●●● ●●●●● ●●●●●●

●●●●●

●● ●●●

● ●●● ●

●●●●

●●

●●●

●● ●

●● ●●●● ●

●● ●●●●

●● ●

●●●●●● ●● ●● ●●

● ●●●●●

●●●

● ●●●

●●

●●●

●● ●●● ●

●● ●●●●●●●

● ●●●●

●●●●●

●●● ●●

●●●● ●●●●

●●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●●

●●

●●

●●●●

●●

●●●●

●●

●●

●●

●●●

●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●

V20

1

2 0 1 2

−2

−1

0

−2 −1 0

●●●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●●●

● ●●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●

● ●

● ●●●

● ●

●●

●●

●●

●●

●●

●●●

●●●

●●●●

●●

●●

●●

●●●●

●●●

●●●

●●●

●●

●●

●●●●●

●●

●●

●●

●●●

●●

●●●

●●●●●

●●●

●●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●●

●●

●●

●●●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●●●

● ●

●●●●

●● ●

●●

●●

●●

●●● ●

●●●

●●●

●●●

●●

●●

● ●●

●●

●●

●●

●●

●● ●

●●

●●●

●●●

●●

● ●●

●●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●● ●●●

●●

●●

● ●●●

●●

●●

●● ●●●

● ●● ●

●●

●●

●●

●●

●●

●●●

●●●●

V30

1

2 0 1 2

−2

−1

0

−2 −1 0

> table(Crabs.kmeans$cluster,Crabs.class)Crabs.classBF BM OF OM

1 3 0 41 02 39 8 6 03 8 42 0 04 0 0 3 50

Page 27: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering K-means

K-means Additional Comments

Good practice initialization. Randomly pick K training examples(without replacement) and set µ1, µ2, . . . , µK equal to those examplesSensitivity to distance measure. Euclidean distance can be greatlyaffected by measurement unit and by strong correlations. Can useMahalanobis distance instead:

‖x− y‖M =√(x− y)>M−1(x− y)

where M is positive semi-definite matrix, e.g. sample covariance.Determination of K. The K-means objective will always improve withlarger number of clusters K. Determination of K requires an additionalregularization criterion. E.g., in DP-means2, use

W =

K∑k=1

∑i∈Ck

‖xi − µk‖22 + λK

2DP-means paper.

Page 28: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering K-means

Other partition based methods

Other partition-based methods with related ideas:K-medoids3: requires cluster centroids µi to be an observation xj

K-medians: cluster centroids represented by a median in eachdimensionK-modes: cluster centroids represented by a mode estimated from acluster

3See also Affinity propagation.

Page 29: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering Hierarchical Clustering

Hierarchical Clustering

Hierarchically structured data is ubiquitous (genus, species, subspecies,individuals...)There are two general strategies for generating hierarchical clusters.Both proceed by seeking to minimize some measure of overalldissimilarity.

Agglomerative / Bottom-Up / MergingDivisive / Top-Down / Splitting

Higher level clusters are created by merging clusters at lower levels. Thisprocess can easily be viewed by a tree/dendrogram.Avoids specifying how many clusters are appropriate.

hclust, agnes{cluster}

Page 30: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering Hierarchical Clustering

EU Indicators Data

6 Economic indicators for EU countries in 2011.

> eu<-read.csv(’http://www.stats.ox.ac.uk/~sejdinov/sdmml/data/eu_indicators.csv’,sep=’ ’)

> eu[1:15,]Countries abbr CPI UNE INP BOP PRC UN_perc

1 Belgium BE 116.03 4.77 125.59 908.6 6716.5 -1.62 Bulgaria BG 141.20 7.31 102.39 27.8 1094.7 3.53 CzechRep. CZ 116.20 4.88 119.01 -277.9 2616.4 -0.64 Denmark DK 114.20 6.03 88.20 1156.4 7992.4 0.55 Germany DE 111.60 4.63 111.30 499.4 6774.6 -1.36 Estonia EE 135.08 9.71 111.50 153.4 2194.1 -7.77 Ireland IE 106.80 10.20 111.20 -166.5 6525.1 2.08 Greece EL 122.83 11.30 78.22 -764.1 5620.1 6.49 Spain ES 116.97 15.79 83.44 -280.8 4955.8 0.710 France FR 111.55 6.77 92.60 -337.1 6828.5 -0.911 Italy IT 115.00 5.05 87.80 -366.2 5996.6 -0.512 Cyprus CY 116.44 5.14 86.91 -1090.6 5310.3 -0.413 Latvia LV 144.47 12.11 110.39 42.3 1968.3 -3.614 Lithuania LT 135.08 11.47 114.50 -77.4 2130.6 -4.315 Luxembourg LU 118.19 3.14 85.51 2016.5 10051.6 -3.0

Data from Greenacre (2012)

Page 31: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering Hierarchical Clustering

EU Indicators Data

dat<-scale(eu[,3:8])rownames(dat)<-eu$Countriesbiplot(princomp(dat))

−0.2 0.0 0.2 0.4

−0.

20.

00.

20.

4

Comp.1

Com

p.2

Belgium

Bulgaria

CzechRep.

Denmark

Germany

Estonia

Ireland

Greece

Spain

FranceItaly

Cyprus

LatviaLithuania

Luxembourg

HungaryMalta

NetherlandsAustria

Poland

Portugal

Romania

Slovenia

Slovakia

Finland

Sweden

UnitedKingdom

−4 −2 0 2 4 6

−4

−2

02

46

CPI

UNE

INPBOP

PRC

UN_perc

Page 32: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering Hierarchical Clustering

Visualising Hierarchical Clustering

> hc<-hclust(dist(dat))> plot(hc,hang=-1)

Luxe

mbo

urg

Bel

gium

Ger

man

yA

ustr

iaN

ethe

rland

sD

enm

ark

Sw

eden

Cze

chR

ep.

Mal

taS

love

nia

Fra

nce

Fin

land

Italy

Uni

tedK

ingd

omC

ypru

sP

ortu

gal

Gre

ece

Irel

and

Spa

inR

oman

iaB

ulga

riaH

unga

ryP

olan

dS

lova

kia

Est

onia

Latv

iaLi

thua

nia

01

23

45

6

Cluster Dendrogram

hclust (*, "complete")dist(dat)

Hei

ght

> library(ape)> plot(as.phylo(hc), type = "fan")

Belgium

Bul

garia

CzechR

ep.

Den

mar

k

Germany

Estonia

Irelan

d

Greece

Spai

n

France

Italy

Cyprus

Latvia

Lithuania

Luxembourg

Hungary

Malta

Net

herla

nds

Austria

Poland

Portugal

Rom

ania

Slovenia

Slovakia

Finland

Sw

eden

UnitedKingdom

Page 33: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering Hierarchical Clustering

Visualising Hierarchical Clustering

Levels in the dendrogram represent a dissimilarity between examples.Tree dissimilarity dT

ij = minimum height in the tree at which examples iand j belong to the same cluster.ultrametric (stronger than triangle) inequality:

dTij ≤ max{dT

ik, dTkj}.

Hierarchical clustering can be interpreted as an approximation of a givendissimilarity dij by an ultrametric dissimilarity.

Page 34: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering Hierarchical Clustering

Measuring Dissimilarity Between ClustersTo join clusters Ci and Cj into super-clusters, we need a way to measure thedissimilarity D(Ci,Cj) between them.

b

b

b

b

bx1

x2

x3

x4

x5

single linkage: d(x1, x3)

b

b

b

b

bx1

x2

x3

x4

x5

complete linkage: d(x2, x5)

b

b

b

b

bx1

x2

x3

x4

x5

average linkage: 1

6

∑2

i=1

∑5

j=3d(xi, xj)

Page 35: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering Hierarchical Clustering

Measuring Dissimilarity Between ClustersTo join clusters Ci and Cj into super-clusters, we need a way to measure thedissimilarity D(Ci,Cj) between them.(a) Single Linkage: elongated, loosely connected clusters

D(Ci,Cj) = minx,y

(d(x, y)|x ∈ Ci, y ∈ Cj)

(b) Complete Linkage: compact clusters, relatively similar objects canremain separated at high levels

D(Ci,Cj) = maxx,y

(d(x, y)|x ∈ Ci, y ∈ Cj)

(c) Average Linkage: tries to balance the two above, but affected by thescale of dissimilarities

D(Ci,Cj) = avgx,y (d(x, y)|x ∈ Ci, y ∈ Cj)

Page 36: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering Hierarchical Clustering

Hierarchical Clustering on Artificial Dataset

−20 0 20 40 60 80 100

−40

−20

020

4060

80

V1

V2

1

2

3

4

5

6

7

8

910

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

3233 34

3536 37

3839

40

4142

43

44 45

4647

4849

50

51

52

5354

55

56

5758

59

60

6162

63

64

65

66 67

68

6970

71

72

7374

7576

777879

80

8182

8384

85

86

87

888990

91

92

93

94

95

96

97

9899

100101

102103

104

105106

107

108109

110

111112

113114

115

116117

118

119

120

121122

123

124

125

126

127128

129

130

131

132133

134135

136

137

138 139

140141142

143144 145

146

147

148149

150151

152

153

154

155156

157

158

159160

161162163

164

165

166

167 168

169

170171

172173

174

175

176 177

178

179180

181 182183

184

185

186

187188

189190191

192193 194195

196 197198

199

200201

202

203

204

205

206

207208

209210 211

212

213

214

215

216

217

218

219 220

221 222223

224225

226

227228229

230

231

232

233

234

235

236237

238

239

240241242

243

244

245246247

248

249

250251

252

253 254 255256257

258259

260261

262

263

264 265

266

267268

269

270

271272

273274

275

276

277

278

279

280

281282

283284285286

287

288

289 290

291

292

293

294

295

296

297

298299

300 301

302

303

304305

306

307308

309

310

311

312313

314

315

316

317

318319

320

321

322

323

324

325

326

327

328

329

330

331

332333334 335 336337

338339

340

341

342 343

344345

346347

348

349

350

351

352

353

354

355

356357

358

359

360

361

362

363364

365

366

367

368369

370371372

373

374375

376377378

379380

381382

383384385

386

387

388

389

390391

392393

394

395

396 397398

399

400

401402

403

404

405

406

407

408

409

410

411412

413

414

415

416

417 418419420

421

422

423424

425426

427

428

429

430

431432433

434

435

436437

438

439

440441 442

443

444445

446447 448

449

450

451

452453

454

455

456

457

458

459460

461

462

463464

465466467

468

469

470

471

472

473

474

475476

477478 479

480

481

482

483

484

485

486

487

488489

490491 492

493

494

495496

497

498499

500

501

502

503 504

505

506507

508

509510

511

512

513

514

515

516

517 518519

520521 522

523

524

525

526

527

528

529

530

531

532

533534

535

536

537

538539 540

541

542

543

544

545

546

547

548549

550

551552

553

554

555

556557

558

559

560

561

562

563

564

565

566567

568

569570

571 572573

574

575576577

578

579

580581

582 583

584585586587

588

589590

591

592593 594

595

596 597

598

599600

601

602603604

605

606

607

608

609

610

611

612

613

614615

616617

618619

620

621622

623 624

625

626

627

628629

630

631

632

633

634635636

637638

639

640641

642643 644

645

646

647

648

649650

651

652

653

654

655

656657

658659

660

661

662663

664 665

666

667

668669

670

671672

673674

675 676

677

678

679

680

681

682683

684

685

686

687688

689

690

691692

693

694

695696

697698699700701

702

703 704

705

706

707

708

709

710

711

712

713

714

715

716717

718

719

720

721722

723 724725

726

727728729

730731

732

733

734

735

736

737738

739

740

741

742743

744

745

746

747

748

749

750751 752753

754755756

757

758

759

760

761762

763764

765

766

767768

769

770771

772 773774

775

776777

778

779

780781

782

783784

785786787

788

789

790791

792

793794 795796797

798

799

800801

802803

804

805

806

807

808

809810

811

812

813814

815

816

817

818819

820821

822

823

824825

826 827828

829

830

831

832 833

834

835

836

837838

839 840

841

842

843844

845

846

847

848849

850

851

852

853854

855856

857 858

859

860861

862

863

864

865866 867

868

869

870

871

872 873874

875876

877878879880

881

882

883

884

885886 887888

889

890

891

892893

894895

896

897

898

899

900

901

902

903904

905

906

907 908

909

910

911912

913

914

915916

917

918919

920

921922

923

924

925926

927

928929930

931

932

933

934

935

936

937

938

939

940

941942

943944

945

946

947

948

949

950951952

953

954

955

956

957

958959960

961

962

963

964965

966967

968969

970

971972

973974

975976

977

978979

980

981

982983984

985

986

987

988

989990

991

992

993

994

995

996997998

9991000

1001

1002

1003

1004

10051006

1007

1008 1009

1010

10111012

1013

1014

1015

1016

1017

1018

1019

1020

1021

1022

10231024

1025102610271028

1029 103010311032

10331034 1035

1036

1037

1038

1039

1040

1041

1042

10431044

1045

1046

1047

10481049

1050

1051

1052

1053

10541055

1056

1057

1058105910601061

1062

1063

1064

1065 1066

1067

1068

1069

1070

1071

1072

1073

1074

1075 1076

1077

1078

1079

1080

1081

1082

1083

10841085

1086

1087

1088

1089

10901091

1092

1093

10941095

1096 10971098

1099

11001101

1102

11031104

11051106

1107

11081109

1110

1111

111211131114

1115

1116

11171118

1119

1120

11211122

11231124

1125

1126

1127

1128

11291130

1131

11321133

1134

1135 1136

1137

11381139

1140

11411142 11431144

1145

1146

1147

11481149

1150

11511152

1153

1154

11551156

1157

1158

1159

11601161

1162

1163

1164

116511661167

11681169

1170

11711172

1173

1174

1175

117611771178

1179

1180

1181

11821183

1184

11851186

1187

11881189

1190

1191

1192

1193

1194

1195

1196

119711981199

1200

1201

1202

1203

12041205

1206

1207

1208

12091210

1211

1212

1213

1214

1215

1216

12171218

1219

1220

122112221223

1224

1225

12261227

12281229

1230

1231

1232

1233

1234

1235

1236

1237

1238

123912401241

1242

1243

1244

1245

12461247

12481249

1250

1251

1252

1253

125412551256

1257

12581259

1260126112621263

1264

1265

1266 12671268

1269

12701271

1272

1273

1274

1275

12761277

1278

1279 12801281

1282

1283

1284

1285

1286

1287

1288

1289

1290

1291

1292

12931294

12951296

1297

1298

129913001301

1302

1303

13041305 1306

1307

1308

13091310

13111312

1313

13141315

131613171318

1319

1320

13211322

1323

1324

1325

13261327

1328 132913301331

13321333

1334

1335

1336

1337

1338

13391340

13411342

1343

1344 13451346

1347

1348

13491350

1351

1352

1353

1354

13551356

13571358

1359

1360

13611362

1363

1364

1365

1366

1367

136813691370 1371

1372

1373

1374

137513761377

1378

1379

1380

1381

13821383

13841385

1386

1387

13881389

139013911392

13931394

1395

1396

1397

13981399

14001401

1402

1403

1404

1405140614071408

1409

1410

1411

14121413

1414

14151416

1417

1418

1419

1420

14211422

1423

1424

1425

1426

142714281429

1430

1431

1432

14331434

14351436

1437

1438

1439

144014411442 1443

1444

1445

1446

1447

14481449

1450

1451

1452

14531454

1455

1456

1457

14581459

14601461

1462

14631464

1465146614671468

1469

1470

1471

14721473

14741475

1476

1477

14781479

14801481

14821483

148414851486

14871488

1489

1490

1491

14921493

14941495

1496

1497

14981499

1500

15011502

1503

1504

150515061507

1508

1509

1510

1511

1512

1513

1514

1515

1516

1517

1518

1519

1520

1521

1522

1523

15241525

1526

15271528

1529

15301531

15321533

1534

1535

1536

1537

1538

1539 1540

15411542

154315441545

15461547

15481549 1550

1551

1552 1553

1554

1555

1556

1557

1558

15591560

15611562

15631564

1565

1566

1567

1568

1569

1570 15711572

1573

1574

1575

1576

15771578

1579

15801581

1582

1583

1584

15851586

1587

1588

15891590

1591

1592

1593

15941595

1596

1597

159815991600

1601

1602 1603

1604

1605

1606

1607

1608

1609

1610

1611

161216131614

16151616

16171618

1619 16201621

1622

1623

1624

1625

16261627

162816291630

16311632

1633

1634

16351636 1637

16381639

1640

1641

1642

1643

1644

1645

1646

16471648

1649

165016511652

1653

1654

1655

1656

1657

1658

16591660

1661

1662

16631664

16651666

16671668

1669 1670

1671

1672

16731674

1675

167616771678

1679

1680

16811682

16831684

1685

16861687

1688 16891690

1691

1692

1693

1694

1695

1696

1697 169816991700

17011702

17031704

17051706

1707

1708

1709

17101711

17121713

1714

1715

1716

1717

1718

17191720

1721

1722

1723

1724 17251726

1727

1728

17291730

17311732

1733

1734

1735

1736 17371738

17391740

174117421743

1744

174517461747

1748

1749

1750

17511752

1753

1754

1755

1756

1757

1758

175917601761

1762

1763

1764

1765

1766

1767

1768

1769

1770

1771

17721773

17741775

17761777177817791780

1781

17821783 1784

1785

17861787

1788

17891790

1791

179217931794

17951796

1797

1798

1799

1800

1801

1802

1803

1804

1805 18061807

1808

18091810

18111812

18131814

1815

18161817

18181819

1820

1821

1822

1823

182418251826

1827

1828

1829

1830

183118321833

1834

18351836

1837

18381839

1840

1841

1842 1843

1844

1845

1846

18471848

1849

1850

1851

18521853

1854185518561857 1858

1859

1860

18611862

1863

1864

1865

18661867

186818691870

1871

1872

1873

1874

1875

1876

18771878

1879

188018811882

1883

18841885

1886

18871888

1889

18901891

18921893

1894

18951896

18971898

18991900

1901

1902

1903 1904

1905

1906

190719081909

1910

19111912 1913

1914

1915

1916

19171918

1919

1920

1921

1922

1923

1924

1925

19261927

1928

1929

1930 1931

1932

1933

19341935

1936

19371938 19391940

19411942

194319441945

19461947

1948

1949

1950

19511952

1953

1954

1955

195619571958

1959

19601961

1962

1963

1964

19651966

1967

1968

19691970

19711972

1973

19741975

1976

1977

19781979

19801981

1982

19831984

1985

19861987

1988

19891990

1991

1992

19931994 1995

1996

1997

19981999

20002001

20022003

2004

20052006

2007

20082009

2010

2011

2012

2013

2014

2015

2016

201720182019

2020 20212022

2023 20242025

2026

20272028

2029

2030

2031

2032

20332034

2035

20362037

2038

20392040

2041

204220432044

204520462047

20482049

2050

2051

20522053

2054

2055

2056

2057

2058

2059 2060

2061

206220632064

206520662067

2068

20692070

2071

20722073

2074

207520762077

2078

2079

2080

2081

2082

2083

2084

20852086

2087

2088

2089

2090

20912092

2093

209420952096

2097

2098 2099

2100

2101

2102

2103 2104

2105

2106

2107

2108

2109

2110

211121122113

21142115

2116

21172118

2119

2120

2121

21222123

2124

2125

2126

2127

2128

2129

2130

2131

2132

2133

2134

2135

2136

2137

21382139

2140

2141

2142

2143

21442145

2146

2147

21482149

2150 21512152

2153

2154

2155

2156

2157

2158

2159

2160

2161

2162

2163

2164

21652166

2167

2168

21692170

2171

2172

21732174

2175

2176

2177 2178

2179

2180

2181

2182

2183

2184

21852186

21872188

2189

2190

2191

21922193

2194

21952196

2197

2198

2199

2200

2201

2202

22032204

22052206

2207 2208

2209

2210

2211

2212

2213

22142215

2216

22172218

2219 2220

2221

2222

22232224

2225

2226

2227

2228

22292230

2231

2232

22332234

2235

2236

2237

22382239

2240

2241

2242

2243

2244

2245

2246

22472248

2249

2250

22512252

2253

2254

225522562257

2258

22592260

2261

22622263

2264

2265 22662267

2268

22692270

2271

22722273

2274

22752276

2277

2278

2279

2280

2281

2282

2283

2284

2285

2286

2287

2288

2289

2290

2291

2292

22932294

22952296

22972298

2299

2300

2301

2302

2303

23042305

2306

2307

2308

230923102311

2312

23132314

23152316

2317

2318 2319

2320

2321

23222323

2324

23252326

232723282329

2330

2331

2332

2333

23342335

2336

2337

2338

2339

2340

2341

2342

2343

23442345

2346

2347

23482349

2350

23512352

2353 2354

23552356

23572358

2359

2360

23612362

2363

23642365

2366

2367

2368

23692370

2371

2372

2373

2374

2375

2376

2377

2378

2379

23802381

23822383

2384

23852386

2387

2388

2389

23902391

23922393

23942395

2396

2397

23982399

24002401

2402 240324042405

24062407

2408

24092410

2411

2412

24132414

2415

2416

2417

24182419

2420

2421 24222423

24242425

2426

242724282429

2430

2431

2432

24332434

2435

2436

2437

2438

2439

244024412442

2443 2444

2445

24462447

2448

2449

2450

2451

2452

2453

24542455

2456

24572458 2459

2460

2461

24622463

246424652466

246724682469

2470

2471

2472

2473

2474

2475

2476

2477

24782479

2480

2481

2482

24832484

2485

2486

2487

24882489

2490

249124922493

2494

2495

2496

2497

24982499

2500

2501

2502

2503

250425052506

25072508

2509

2510

2511

2512

25132514

2515

2516

2517

2518 2519

2520

2521

2522

2523

25242525

2526 2527

2528

25292530

2531

2532

2533

2534

25352536

2537

25382539 2540

2541

2542

2543

2544

2545

2546

25472548

2549

25502551 25522553

2554

2555

25562557

255825592560

25612562

2563

2564

2565

25662567

2568

2569

2570

2571

2572

2573

25742575

2576

2577

2578

2579

2580

2581

25822583

2584

2585

2586

2587

25882589

2590

25912592

25932594

2595

25962597

25982599

2600

2601

2602

2603

2604

2605

2606

2607

26082609

26102611

2612

26132614

26152616

26172618

2619

26202621

2622

26232624

2625

2626

2627

2628

2629

2630

2631

2632

263326342635

2636

263726382639

2640

2641

26422643

26442645

264626472648

2649

2650

26512652

2653

2654

2655

2656

2657

26582659 2660

266126622663

2664

2665

2666

2667

26682669

2670

2671

2672

2673

2674

2675

2676

26772678

2679

26802681

2682

268326842685 26862687

2688

2689

2690

26912692

26932694

2695

2696

2697

2698

2699

27002701

2702

27032704

2705

27062707

2708

2709

2710

2711

2712

2713 27142715

2716

27172718

2719

27202721

2722

2723

2724

2725

2726

2727

2728

2729

27302731

2732

2733

2734

27352736

2737

27382739

2740

2741

27422743

27442745

2746

2747

2748

2749

2750

27512752

2753

2754

27552756

2757

2758

2759

2760

2761

2762

276327642765 2766

2767

27682769

2770

2771

2772

2773

2774

27752776

2777

2778

27792780

27812782

27832784

2785

2786

2787

27882789

2790

27912792

2793

2794

2795

2796

2797 2798

2799

2800

2801

2802

2803

28042805

2806

28072808

2809

28102811

2812

28132814

2815

28162817

2818

28192820

2821

2822

2823

28242825 2826

282728282829

2830

28312832

28332834

28352836

2837

2838

2839

28402841

28422843

2844

28452846

2847

2848

28492850

2851

28522853

2854

285528562857

28582859

2860

2861

2862

28632864

2865

286628672868

28692870

2871

28722873

2874

28752876

2877

2878

28792880

28812882

2883

2884

28852886

28872888

2889

28902891

2892

2893

2894

2895

2896

2897

2898

2899

2900 2901

2902

2903

2904

29052906

2907

2908

2909

29102911

2912

29132914

2915

2916

29172918

2919

2920

2921

2922

2923

29242925

29262927

29282929

2930

29312932

2933

29342935

2936

293729382939

2940

2941

2942

2943

2944

2945

29462947

29482949

2950

2951

2952

2953

2954

29552956

2957

295829592960

2961

2962

296329642965

2966

2967

2968

2969

2970

29712972

2973

2974

2975

2976

2977

29782979

2980

2981

2982

29832984

298529862987

2988

2989

29902991 2992

2993

2994

2995

29962997

2998 2999

3000

Page 37: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering Hierarchical Clustering

Hierarchical Clustering on Artificial Dataset

dat=xclara #3000 x 2library(cluster)

#plot the dataplot(dat,type="n")text(dat,labels=row.names(dat))

plot(agnes(dat,method="single"))plot(agnes(dat,method="complete"))plot(agnes(dat,method="average"))

Page 38: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering Hierarchical Clustering

Hierarchical Clustering on Artificial Dataset

110

392

694

505

547 39

5 70 24 760 54

121

680

349

036

515

389

089

911

329

4 330 70

9 246

691

2245

153

074

7 620

804

240

516

32 502 4

9713

361

5 158

612

179

454

326

212

627

5123

528

918

02

714 1

1263

178

1 648

254

764 55

366

483 42

273

7 40 231

488

386 39

8 828 20

162

154 68 338 5

09 838

3116

729

934

965

6 256

600

237

703

396

769

670

700

63 706

588 95 17

568

673

040

533

913

038 63

2 567

4726

277

3 264

744

852

7378

481

7 357

875

751

825 55

747

451

755

974

681

371

8 750

521

531

450

549 44

76 152

354

872

885

796 79

455

552

221

78 881 34

610

615

431

189

782 18

4 196 45

714

437

269

325

332

328 89

3 585 12

612

463

417

296

636 42

443

679

739

784

5 783 82

369

988 19

323

331

077

7 382

448

514

591

628 26

028

421

349

5 520

880 61

675

465

938

387

8 682 83

358

778

5 874

575

735 51

159

2 75 234

291

861

683

879 64

273

1 457

079

0 285

77 335

674

755

343

741

188

720

589 2

38 453

295

551

602

814

738

801 4

2930

548

152

102

617

444

778 202

848

576 3

7036

42 657

163

538

321

780

603

652

859

293

226

604

705

717 2

82 333

434

460

810 3

80 844

103

314

171

776

855

208

445

394

584

223

159

432

877

594

176

206 540

580

81 269 84

149

653

400

716

143

418 65

830

470

245

299 20

934

747

116

116

214

761

917

223

614

237

843

3 892

222

533 55

672

486

286

4 274

168

385 60

976

627

762

9 403

376

768 88

7 279

352

795

178

426

194

793

313

329 3

6724

780

662

681

263

752

774

386

580

854

251

2 261

292

308

1233

467

777

172

3 286

499

805

192

377

837 84

9 590

710

135

524 85

628

156

354

835

9 242

11 351

207

297

368 23

989

816 50

015

584

7 263

379

443

412 83

456

215

641

26 503 25

925

849

419

778

727

374 70

1 886 50

466

8 58

90 341 10

867

567

8 458

889

882

651

842

742

748 25

231

269

7 198

218 3

3237

146

5 466 288

417

786

13 300

664

826 6

551

060

773

967

269

877

4 29

322

719

388

545

46 722

210

391 8

40 165

136

721

276

579

469

375

736 1

1782

286

612

858

669

676

3 513 80

734

575

764

414

058

259

968

039

314

171

389 19

547

737 33

646

760

523

043

144

283

048

687

351

875

2 5013

414

576

782

0 57 779

759

2347

653

757

8 104 11

985

419

148

7 858

428 227

871

596

618

268

410 1

3959

861

180

081

689

116

057

175

680

979

2 219

271

870 55

364

0 660

721

129

830

189

453

270

646

597 72

8 534 66

588

847

963

535 70

421

426

566 22

546

464

348

555

590

051

567

385

356

479

818

757

779

182

162

578

818

536

711 36

017

371

512

335

6 508

287

320

835

430

647 30

617

020

520 37

349 27

849

255

0 402 69

241 69

047

883

942

379

476

2 114

80 200

543

844

658

368

473

2 761

734

484

67 369

645

654 53

550

712

125

136

3 811 66

930

366

738

142

1 729

480

827

190

449

183

613

315

109

661

249

257

440

447

896

156

860

302

733

344

355

472

836

87 189

970

786

7 48

473

802

290

43 132

272

122

851

758

129

624

863 3

28 415 6

0131

742

039

940

665

045 40

7 148

9252

652

846

184

350

1 151

506

695

765 69

401

671 8

6912

540

884

654

648

915 16

474

578

284

162

268

171

233

64 199

97 525 131

389

409 48

221

250

475

544

83 94 127 86

8 685

224

411 35

841

945

9 181 82

522

036

440

4 799 63

049

653

226

646

819

248

523 33

134

081

556

059 18

562

118

554 22

836

239 31

917

7 9811

612

028

352

2 493

374

572

9361

465

541

463

364

942

759

5 895

186

573

884

72 244

539

441

384

342

593 34

843

983

2 753

157

666

498

361 3

3711

156

1 772 5

4313

858

151

910

020

478

915

046

356

885

7 267

676

275

413

689 4

6252

963

982

422

945

668

7 688

101

169

166

679

146

740

255

558

775

818

437

324

876

491

565

105

307 32

556

620

391

232

316

574

725

819

883

137

243

390

318

569

27 638

217

662

60 726 3

5010

717

438

660

814

85 850

110

606

831

245

727

96 435

470

1451 62

328

071 18

26

3016

8390

117

4891

913

9017

6719

3110

7810

9414

8911

5114

0018

7720

37 159

219

99 1991

1494

1123

1825

1583

1870 93

398

320

01 936

1827

1006

1791 17

9896

397

516

5110

9819

3517

2519

4911

7216

2113

67 1144

1170

1946

1183

1606 15

0915

8115

2212

9816

6918

9710

9515

6113

8219

0713

1518

31 1050

1586

2050 18

4019

8419

2513

2911

0913

11 1267

1858

1272

976

1711

1226

1468

1247

1512

1782 18

6419

4013

0313

3015

3015

73 170

011

2912

8714

6216

2514

7218

9313

0916

2018

3215

2717

37 100

216

5519

4417

3115

4611

7316

96 1498

1978

1917

926

1042

1958 14

5918

7913

1313

6218

02 1936

1007

1579 18

1312

8515

5219

9413

8817

34 1076

1369

2029

1708

2020 10

7011

0113

4611

7618

6716

3520

40 143

614

2513

4210

4716

9412

41 109

693

713

6820

49 944

989

981

1764

1564

1759

1420

1225

1036

1107 1

299

1029

1482

1569

1762

1604

959

1636

1125

1189

1483

1710 1

691

1932

1122

1359

1559

1649

1892

1355

1456

1828

1976

974

984

1919

1422 16

1510

0812

93 1223

1143

1909 18

7415

5016

5219

72 146

314

7420

2315

7715

8016

3218

0910

6112

3015

6212

6210

5418

48 1839

1239

1963

1603

1780

1777

1281

1835

1664

1663

1806

1860

1937

1105

1587

1208

1816 9

3417

6919

2310

5515

7815

29 1967

1059

1268

1890

1630

1340

1585

1383 18

5510

1219

1613

7219

04 1847 16

3410

6613

3915

6714

0519

65 1548

950

1822 11

7412

8018

68 1682

1453

1740

1624

2007 1

575

1533

1104

1974

2031

1058

1826

1908

1464 99

617

8414

6516

5019

5719

4110

8813

4199

792

018

33 1389

1126

1057

2035 13

84 149

310

2416

6110

7412

6313

74 1302

1751

1572 11

8011

0218

1115

4112

3415

05 115

717

5792

813

7619

0210

8617

6020

2219

4512

0312

0796

210

1415

3113

26 1399

1306

1969 1

478

971

1049

1133

1458

1637

1100

1275

2010 14

7620

3312

6414

4913

1691

311

6212

6017

0717

7893

294

9 101

517

6519

9894

114

3817

7915

1518

0418

7811

4912

05 1219

1773 9

0713

3115

4418

1811

9819

8311

6517

8315

1919

9018

6212

3213

6014

1616

7420

03 977

1623

1197

1555

1873

1332

1659

1717

1385

1715

1514

1153

1508

1676

1181

1532

1545

1171

1736 17

6616

6219

11 152

393

813

33 102

512

8418

5213

4417

24 131

818

9619

2917

7620

1514

1314

1999

510

3719

81 904

1871

988

1046

1128

1253 16

4710

8514

2499

417

47 1713

1159

1697

1716

1912

1397

1407 19

8218

5610

1111

4213

1212

7911

5214

0618

98 982

1114

1597

1432

1211

1654

1084

1118

1411

1113

1454 10

2811

0318

6614

67 152

812

2216

81 164

211

3814

15 1256

1314

1501

1885 15

58 1704

916

1457

1466

1108

1216

1979

1345

1952 11

4111

47 181

591

517

20 1005

1845

2025 13

5319

9692

415

11 1602

1959

1328

1712

1136

1938

1956

1484

970

1093 19

8512

0612

1418

7610

2616

7713

00 1679

1178

1370

1186

1237

1381 13

7912

5516

1218

3712

9514

99 1282

1901

1068

1259

1274

1758 93

098

518

8496

720

1110

87 1746

1913

1140

1231

1245

1375

1403 17

4319

61 1488

1212

1441

1643

1888 16

2620

32 188

299

116

1619

21 103

911

5611

66 1795

1947

1185

1242 16

3314

8614

8515

3716

38 986

2002 1

726

1167

1266

1869

1989

1699

1805

1323

1391 1

179

1240

912

1960 11

32 948

1781

1444

1687

1927 19

0615

0320

08 1551

1308

1479

1448

1510

987

1439

2045

1215

1598 1

304

1673

956

978

1335

1513

1824

1729

1060

1971

1730

1204 1

003

1257

1793

1987

1106

1182

1690

1789 20

30 1518 10

2310

9217

0214

4514

7115

2496

110

6413

2119

70 154

014

9020

1293

115

6317

8893

514

3711

9513

3619

95 1030

1742

1875 13

5210

0411

4613

6192

212

5412

33 1955 13

4814

3415

9110

1312

61 1628

1939 1

506

1322

1918

1378

2024

1130

1398

1834

1863

1859 14

5594

011

3513

6514

0291

192

918

46 1754

1210

1358

1785 15

3418

08 1973

1243

1727

2047

1027

1619 19

4813

9319

97 1075

1814

1079

1943 15

47 1905

1112

1953 1

252

1350

1750 14

9618

19 130

113

9613

9416

1717

3815

2117

03 159

614

7311

8816

6020

09 1065

1903 16

8010

4418

3012

7012

9616

1392

710

9019

6617

0117

8618

8320

3618

9111

9218

41 969

1721

1763

966

1276

1770

903

1117

1688

1796 95

215

4310

4110

8217

56 162

918

3615

49 1535

1803

1857

1072

1139

1317

1062

1639

1115

1155 18

8112

2813

4316

1120

1319

4219

86 960

1560

972

1487 1

614

1177

964

1150

1168 10

4013

34 1357

1761 12

8318

72 2016

1292

908

1349

1246

1307 1

584

1895

1914

947

1752

1423

1810

1732

1829 16

7012

6914

21 193

495

117

9016

89 979

1609

1526

1698 10

9711

3114

4312

5812

91 204

818

6111

6015

53 202

191

711

6310

9914

9710

2210

7710

3518

51 1111 1

954

1470

1653

990

1213

1706

1217

1641

1656

1018

1605

1395

2043

1854 90

920

46 1305

1417

1412

1693

1993

1539

1608

1193

1354

992

1071

1286

1290

1794

1817

2044

1739

1705

1031

1073

2039

1191

1461 14

2714

4616

8611

6116

9211

1011

8414

52 1347 9

7310

5217

9720

42 1556 12

9712

6515

4216

4020

0519

3320

2616

2712

7713

92 177

112

4410

8013

5618

1214

9218

38 1695

1570

1657

1418

1977

1091

1236

1568 9

8011

1916

2218

80 178

712

9414

3516

3110

4312

7117

18 1380

1089

1926

1121

1250 9

5818

8714

0915

9414

2914

4019

62 142

812

2115

02 100

012

1816

7210

1616

00 153

818

4219

68 1920

1684

946

1886

1017

1733

918

1363

1792

1799

1950

1980

1063 1507 14

6917

2214

7515

3615

7416

7120

1790

295

711

9613

37 1491

1582

1894 17

1419

2819

8820

2713

1916

67 201

999

310

5115

6620

38 123

810

1915

20 1554

1900

1158

1164

1447

1709

1410

2006

1404

1450 9

2317

35 2018 1

145

939

1209

2041 12

5112

88 943

1154 16

65 999

1033

1741 14

6013

1013

77 200

4 1666

1889 1

169

1414

1034

1433

1865

1924

1187

1504 13

5195

411

2415

9394

211

9012

7812

2914

9511

4814

3119

92 1749 12

0217

4518

53 158

813

8710

09 945

921

1915

906

1001 1

576

1386

1477

1056

1480

1069

905

1645

1273

1775

1820

1481 14

0810

10 968

1565

1571

1719

965

1646

1021

1426

1964

2014 1

910

1517

955

1500

2028

1320

1324

914

1768 20

3410

6715

99 1045

1755

1116

1048

1843

1744

1823

1821

1235

1200

1224

910

1053

1801

1844

1083

1849

1675

1199

1595

1678

1327

1220

1668

1774 13

2518

99 1194

1516

1607

1364

1175

1032

1371

1020

1366

1590

1753

1807

1850

1975

1137

1589

1648 1

134

953

1800

1723

1081

1644

925

1601 1

227 15

5714

0120

0019

51 1618

1442

1772

1338

327

1430 11

2012

8919

2270

810

3816

1012

0113

7312

4917

2815

2519

3024

142

534 66

3 61

829

115

998

1685

2051

2507

2948

2055

2120

2938

2825 28

6021

6929

2521

4924

0722

1320

6627

03 2281

2629

2465

2078

2284

2115

2697

2139

2602

2521

2935

2649

2414

2964

2788

2832

2614

2737 24

0929

5420

9726

5826

6921

2327

6922

6724

0025

7027

8623

0523

08 2930

2529 21

3324

8926

72 2209

2536

2896 2

464

2706

2806

2122

2440

2720

2778

2592

2885 26

1527

2128

1827

5128

7829

73 2709

2620

2888 2

837

2869

2071

2383

2617 28

1523

2823

48 2513

2965

2998 24

7625

9721

4227

8120

8420

9825

46 2870 20

9023

2924

85 2311

2351

2096

2300

2811 2

856

2201

2334 22

4524

7128

0922

3522

7724

8727

53 2566 24

0824

1329

7628

3629

28 2429

2095

2549 22

6526

7728

9823

4923

74 2864

2855 24

10 2219

2262

2876

2873

2315

2812 2

681

2992

2415

2382

2369

2535

2653

2579

2852

2883 24

2724

0124

9827

5524

6226

3924

9125

9423

4326

35 2365

2622 26

8624

9329

6922

7629

29 242

8 2158

2361

2793

2080

2137

2562

2531

2130

2322 24

2522

3222

9026

8327

6022

9324

5829

8421

7826

2129

0222

0524

75 246

022

7824

0325

3227

0425

8224

0426

5624

38 2517

2246

2761

2951

2060

2146

2607

2543

2911 2

738

2564 2730

2805 27

5020

6424

7922

5222

30 2188

2673

2075

2168

2337

2559 27

9528

3529

00 2416

2700

2695 28

7522

8624

2126

0921

2822

7023

5229

6123

9024

7421

9123

9125

0423

6424

5926

3323

1723

60 2558

2424

2914

2611

2651

2820

2439

2872

2393

2584

2436

2514

2576 20

6222

9723

41 2207

2103

2814

2829 2

392

2575

2569

2865

2953

2813

2881

2884

2121

2222

2112

2229

2548

2675

2816

2511

2823

2174

2810

2131

2652

2692

2908

2143

2378

2057

2089

2268

2707

2175 21

4824

9923

4626

2427

8228

2620

6727

6522

2428

4823

9426

80 2644

2118

2228

2527

2573 24

05 2236

2370

2742

2445

2843

2190

2238

2211

2254

2398

2568

2746 2

204

2567 2

743

2842

2110

2616

2295

2114

2135

2312 2

645

2822

2170

2868

2273

2239

2266

2335

2824 25

0229

3122

2323

5725

5626

8926

2727

14 2117

2323

2797

2124

2515 22

4925

6124

9222

1724

0625

2526

38 2844

2802

2451

2263

2132

2358

2705

2741 28

6323

8820

8529

8823

7629

9322

5628

49 2647

2478 25

06 2113

2151

2827

2423

2373

2968 23

8125

05 2719

2309

2985

2368

2324

2978 28

4026

9129

87 2747

2648

2995

2444

2661

2173

2780

2257

2306

2828 25

0025

5328

6128

91 298

627

17 227

220

8826

57 2282

2091

2162

2450

2956

2109

2269

2271

2728 24

1120

9229

4224

30 296

022

5122

6028

38 2604

2682

2632

2225

2758 23

0224

3423

7124

96 2587 2

244

2915

2554

2698

2420

2461

2516

2522

2684

2913

2608

2107

2326

2385

2975

2395

2618

2668

2218

2939

2715

2186

2231

2193

2776

2291

2526 2

386

2972

2418

2947

2530 29

09 2678

2052

2431 22

9627

1127

1320

7721

3421

5026

2829

50 2166

2181

2214

2237

2980

2303

2733

2770 2

294

2063

2255 25

4527

4827

79 2053

2596

2076

2099

2867

2331

2557

2694

2187

2399

2203 24

6629

9626

4228

5723

3323

6227

8920

9427

5624

2225

5224

5728

9028

94 276

726

3429

05 2643

2664

2494

2419

2630

2074

2981

2846

2585

2437

2936

2182

2777 2

227

2292

2880 2

375

2307

2519

2722 29

4427

85 295

927

9829

67 2145

2240

2775 21

0226

5527

83 248

230

0021

3621

9522

48 2736 2

762

2058

2990

2086

2108

2625 23

1629

4629

5824

5426

0627

68 267

421

5623

5321

6723

1829

34 2957 21

7724

7225

6329

0725

4426

59 2389

2723

2882

2105

2833 29

1228

8622

7529

2124

9725

60 2687

2859

2708 28

5027

3925

1826

8528

04 273

122

6120

6123

59 2441

2484

2577 27

2529

2726

5429

9723

3627

6620

6525

9321

92 2159 29

1028

8721

7922

2024

5520

6829

7422

4723

8424

17 2259

2899 24

7025

3323

1424

73 2726

2172

2970

2547

2877

2989

2800

2483

2079

2199

2745

2387

2298

2646

2841

2773

2082

2469

2901 29

71 270

128

6623

1929

4129

6328

3125

7428

0823

5426

93 2463

2379

2637

2509

2600 27

8728

9225

4125

5027

1829

9920

8322

4221

4129

2020

9326

4121

2623

9722

0624

81 2274

2435

2449 22

3324

5627

5221

6427

4021

3828

3421

8522

3423

3825

0127

63 258

325

7223

0426

6025

3923

6325

4229

3729

24 2488

2432

2081

2111

2279

2468

2528 2

879

2208

2667

2790

2087

2226

2327

2161

2350

2922 2

650

2994 27

12 2665

2906

2280

2153

2503

2520 23

4225

9920

5627

8426

0128

4529

4925

1222

8323

5520

6925

89 217

127

2929

1625

3422

1229

4322

0223

2128

7421

9426

36 221

627

9922

8729

52 2977

2571

2160

2955

2127

2918

2180

2749 21

63 2807

2565

2839

2250

2754

2771

2688

2871

2332

2299

2452

2101

2119

2759

2072

2285

2598

2372

2367

2258

2613

2330

2140

2716

2734

2155

2603

2523

2962

2581

2591

2198

2917 24

4829

2629

3223

0124

8624

4226

4023

4526

7928

9320

7326

9626

7622

8923

2525

88 2510 29

8224

9525

8626

1027

7229

9126

2320

5928

1922

2123

1021

0626

7028

9525

7822

1522

8827

4424

4321

4423

66 252

424

0226

9924

2629

0422

6421

1621

8921

9628

8923

4726

7124

8027

9623

5623

9625

0829

7928

4721

7621

0026

1925

3726

1221

4725

3821

2922

41 2595

2794

2580

2830

2791

2152

2490

2433

2735 26

2627

2727

5727

3223

4427

1029

1927

0228

9720

5421

6523

3921

0428

5822

5327

24 244

6 2340

2540

2453

2690

2933

2184

2605

2966

2183

2853 29

0321

2525

5528

5424

4722

0023

2027

7426

6622

4326

3129

83 259

028

0121

5421

5723

1328

5124

1228

6224

7721

9727

6425

5128

1720

7028

2128

0329

4529

40 309

1127

1658

2792

2923

610

1248

387

562

2662

2380

2467

353

770

2377

2210 41

674

926

63

02

46

810

Dendrogram of agnes(x = dat, method = "single")

Agglomerative Coefficient = 0.93dat

Hei

ght

Page 39: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering Hierarchical Clustering

Hierarchical Clustering on Artificial Dataset

124 76

039

269

439

5 10 541 70

505

547

153

890

899

22 365

212

627

13 300

826

664

698

774

713

140

582

599

680

268

410

393

667

6551

060

767

273

982

286

639

940

665

049

029 32

238

854

571

927

657

932

083

5 86 398

828

242

136

721

469

375

736

165

210

391

840

1853

671

136

017

371

528

751

235

289

180

246

691

123

356

508

430

647

7378

481

717

020

580 20

030

6 3 54 68 338

838

201

621

405

63 706

175

588 95

686

730

38 632

262

773

264

357

875

509

567 47

744

852

751

135

524

160

571

756

809

792

192

377

837

849

590

710 8

474

517

521

531

25 557

559

746

813

718

750 44 79

455

552 31

167

299

349

656

256

600

396

769 74 701

886

196

46 722

117

237

703

670

700

76 152

354

796

872

885 82 184

130

339

41 690

478

839

109

661

249

257

440

447

896

111

561

157

666

498

337

1433 64 199

97 525

131

59 185 62

118

554

389

409

482

113

294

330

709

216

803 20 373

49 278

402

423

794

762

492

550

692

186

573

884

228

362

114

427

595

895

414

633

649

229

462

491

565

27 638

318

569 43

772 24

444

136

134

838

453

943

983

275

377

213

858

151

934

259

354

360 72

635

010

724

142

5 387

416

749 6

87 189

472

836

681 20

312

786

868

535

841

945

949

653

218

182

541

368

922

079

936

440

463

011 35

120

729

789

811

985

423

936

856

423 47

610

453

757

819

148

742

885

8 2125

047

554

422

441

119

044

930

273

367 36

964

565

418

361

331

550

753

512

166

925

136

381

134

435

583 94 27

250

466

8 84 658

143

418

149

653

400

716 19

248

523

331

340

815

560

85 850

110

606 83

196 43

547

0 1451

12 334

677

771

723

198

218

856

332

466

371

465

417

786

853

156

860

798

288

800

464

643

485

555

900

28 893

585

126

81 269

383

878

874

587

785

682

833

144

372

693

253

323

154

311

897

450

549

291

861

683

879

642

731 75 234

308

871

281

563

359

548

286

499

805 16 500

155

847

263

379

443

412

834

219

553

640

271

870

660

66 225

328

415

26 503

259

258

5621

564

121

349

552

088

057

573

561

675

465

957

779

182

178

810

020

478

955

877

581

881

988

312

285

160

175

812

962

486

331

742

048

910

116

914

616

667

9 740

150

463

568

857

187

515

673 82

4 623

353

271

411

263

178

164

840 23

148

813

959

861

181

689

1 37 336 50

467

605 55

366

483

422

737

254

764

134

145

767

820

304

702

452

159

540

176

206

432

877

594 45 407 92

526

528

148

712

151

506

695

765

461

843

501 4

570

790

285

77 335

674

755

343

741

102

617

444

778

223

36 42 657

163

538

321

780

603

293

652

859 78 881

221

106

346

124

634

457

260

284

103

314

305

481

429

171

776

855

238

453

580

295

814

551

602

738

801

197

787

273

222

533

724

556

862

864

759

52 370

202

848

576

188

720

589

460

810

208

445

394

584

226

604

705

717

282

333

380

434

844

128

513

345

757

644

511

592

586

696

763

807 5

438

484

446

583

684

732

761 53

270

646

597

728

888

534

665 7

211

298

301

894 89 195

477

480

827

99 209

479

635

194

793

313

329

261

292

35 704

214

265 54

623

044

283

043

148

687

351

862

514

237

843

389

216

838

560

976

637

676

814

761

917

223

616

116

234

747

127

762

988

727

940

330

375

217

424

436

296

636

699

596

618

274

397

845

783

797

823

88 193

233

310

777

448

514

382

591

628 58 90 341

141

227

252

312

697

458

889

882

651

842

43 132

745

6940

167

112

540

884

686

949

452

774

386

554

257 77

935

279

510

867

567

874

274

817

842

636

724

780

663

762

681

251

280

8 39 319

177

255

9811

612

028

352

249

337

457

210

530

732

556

613

724

3 390

245

727

267

676 72

938

142

173

462

232

487

657

472

5 71 182 27

528

061

012

48 970

786

7 4829

047

380

2 15 164

782

841

266

468 9

123

231

634 66

3 61 829 11

5 309

32 502

497

240

620

804

133

615

158

516

451

530

747 93

614

655

179

454

326

612

456

687

688

174

386

608

217

662

529

639 30

1683

1081

1644

943

1154 99

916

6510

3411

6912

5112

8814

3318

6519

2470

810

3816

1011

2012

89 1922

327

1430

965

1646 1021

1069

1187

1504

1351

953

1800

1200

1235

960

1560

1357

1761

992

1071

1286

964

1334

1040

1150

1168

1283

1872

917

1163

1099

1497

1022

1077

1217

1641

1656

1018

1705

1035

1851

1111

1954

1470

1653

1722

944

989

1834

1863

981

1764

1564

1759

1420

1029

1299

1604

1482

1569

1762

1036

1107

1225

1180

1359 98

714

3920

4512

1515

9813

0499

012

1317

0695

916

3614

8317

1011

2511

2211

8916

9111

3013

9818

5914

5618

2819

7690

813

4912

4613

0715

8418

9519

1494

717

5216

7014

2318

1017

3218

2995

117

9016

89 979

1609

1526

1698 10

0911

6116

9212

6914

2119

3495

018

2211

7415

3315

7516

2420

0710

1312

6116

2819

3911

0419

7414

9318

8212

8018

6816

8214

5317

4010

9712

5812

9111

3114

4320

4812

2115

0211

6015

5318

6120

21 921

1915

958

1887

1428

1409

1594

1429

1440

1962

922

1254

1233

1955

931

1563

1195

1336

1995

1788

1030

1352

1742

1875 93

514

3713

6194

618

8610

1710

0411

4610

4418

3012

7012

9211

9915

9516

78 1327

1744

1823

1821

925

1601 9

9810

3213

71 1227 1

685

914

1768

2034

1048

1843

1067

1599

1045

1755 11

1610

0012

1816

7216

8410

1616

0015

3818

4219

6819

2010

4313

8012

7117

1810

8919

26 1175

1565

1194

1134

1338

1364

1516

1607

1249

1728

1320

1324 15

5790

117

4815

2717

3711

2919

1712

8714

6216

2514

7218

9313

0916

2018

3210

0216

5519

4414

2511

7316

9619

7814

9812

1214

41 985

1488

1546

1731

1743

1961

2016

1064

1540

1321

1970

1490

2012

928

1086

1203

1207

1264

1449

1476

2033

1760

2022 93

011

0012

7520

1013

1696

210

1415

3113

9913

2613

7619

0219

4516

4318

8813

0619

6914

7813

8817

3416

4918

9291

219

6011

3215

0320

0819

0613

0814

7914

4815

1015

5194

817

8114

4416

8719

2713

4814

3415

9197

110

4911

3314

5816

3711

4012

3112

4513

7514

03 916

1457

1466

1108

1216

1979

1345

1952

1141

1147

1138

1415

1528

1901

1068

1259

1815

1153

1181

1532

1545

982

1114

1597

1211

1432

1654

1103

1866

1467

1222

1681

1642

1256

1314

1501

1885

1558

1028

1084

1118

1411

1214

1876

1113

1454

1704

1136

1938

1956

1484 92

610

4219

5813

1314

5918

7910

9613

6218

0219

3611

8612

9514

9912

5516

1218

3712

8212

3713

8113

7997

010

9319

8512

0610

2616

7713

0016

7911

7813

7010

4716

9412

4114

1314

1910

0715

7918

1317

0820

2012

8515

5219

9410

7613

6920

2910

7014

3611

0113

4611

7618

6716

3520

40 920

1833

1389

1126

933

983

2001

1342

936

1827

1006

1791

1798

976

1711

1226

1468

1669

1897

1884

1247

1512

1782

1864

1940

1303

1330

1700

1530

1573 99

617

8416

5019

5719

41 997

1465

1057

2035

1384

1322

1918

1088

1341

1274

1758

963

1725

1949 975

1651

1098

1935

1144

1298

1522

1170

1946

1095

1561

1382

1315

1831

1907

1109

1311

1050

1925

1984

1172

1621

1367

1329

1586

2050

1840

1061

1262

1230

1562

1580

1632

1809 90

418

7199

417

4713

9714

0716

4719

8298

811

2812

5310

4610

8514

2411

5916

9717

1319

1217

1618

5691

517

2010

0518

4520

2513

5319

96 924

1511

1602

1328

1959

1011

1142

1152

1279

1312

1406

1898

1062

1639

1115

1155

1228

1881

1549

1712

1343

1611

2013

1673

1942

1986 94

211

9012

7812

2914

9510

2314

4514

7115

2412

9017

9418

1720

44 956

978

1335

1513

1003

1092

1702

1060

1971

1204

1730

1729

1824

1257

1793

1987

919

1390

1767

1931

1078

1183

1606

1509

1581

1094

1489

1991

1151

1400

1877

2037

1494

1825

1583

1870

1592

1999

1024

1661

1074

1263

1374

1751

1302

1572

1267

1858

1272 96

110

6519

0316

8011

2313

5511

0615

1811

8220

3016

9017

89 913

1162

1260

1707

1778

1515

1804

1878 97

414

6410

5818

2619

0898

419

1914

6314

7420

2315

7710

0812

9312

2314

2216

1511

4319

0918

7415

5016

5219

72 932

949

1015

941

1438

1779

1149

1205

1787

1102

1811

1757

1541

1157

1234

1505

1765

1998 934

1769

1923

1055

1578

1529

1967

1746

1339

1567

1405

1965

1548

1012

1916

1634

1066

1372

1904

1847

1059

1268

1890

1630

1855

1340

1585

1383

1054

1848

1663

1806

1839

1105

1587

1208

1816

1239

1963

1603

1780

1777

1281

1835

1664

1860

1937

1219

1773

1913

1626

2032 92

796

917

0117

8617

2117

6318

8320

3610

9019

6618

9111

9218

4196

720

1110

87 995

1037

1981 96

612

7617

7014

9115

8218

9417

1419

2819

8820

27 918

1507

1063

1792

1799

1363

1950

1980

1052

1797

2042

1265

1542

1627

1640

2005

1933

2026 98

012

7713

9211

1916

2218

8010

8013

5618

1214

1819

7712

4417

7112

2016

6817

7413

2518

9914

6914

9218

3816

9515

7016

57 945 95

515

0020

28 993

1051

1566

2038

1238

1056

1480 12

0112

2417

2397

315

5611

1013

4711

8414

52 1148

1431

1992

1749

1019

1520

1554

1900

1158

1164

1447

1709

1091

1236

1297

1568

1410

2006

903

1117

1796

1688

952

1543

1041

1082

1756

1629

1836

1535

1803

1857

1179

1776

2015

938

1333

1025

1344

1724

1284

1852

1165

1783

1519

1990

1862

1416

1674

2003

1605

907

1232

1360

1072

991

1616

1921

1039

1198

1983

1331

1544

1818

1156

1166

1795

1485

1537

1638

1185

1242

1633

1947

1486

1027

1619

1948

1393

1997

1075

1814

1139

1317 91

111

8897

214

8716

1411

9313

5492

918

4617

5416

6020

0912

9616

1312

1013

5817

8512

4317

2720

4715

3418

0819

7393

713

6820

4919

3213

1818

9615

2319

2915

5920

3197

716

2315

1411

9715

5518

7313

3216

5917

1713

8517

1511

7117

3617

6616

6219

1115

0816

76 909

2046

1305

1417

939

1209

2041

1191

1461

1427

1446

1686

1412

1693

1993

1539

1608

1031

1073

2039

1394

1473

1112

1953

1252

1396

1177

1301

1350

1750

1496

1819

1033

1741

1460

1889

1310

1377

2004

1666

1414

940

1135

1365

1402

1294

1435

986

2002

1726

1240

1869

1989

1631

1699

1805

1167

1266

1323

1391 17

3910

1011

2112

5010

7919

4315

4719

0515

2117

0315

9616

1717

3812

0215

8814

0817

4518

5310

2018

5019

75 1648

1137

1589

1366

1590

1753

1807

1401

2000

1951

1525

1930

1442

1772

1618

902

957

1196

1337 91

010

8318

4910

5318

0118

44 906

1001

1576

1404

1450

1386

1477 15

1790

516

4512

7314

8117

7518

2013

1916

6720

19 923

1735

1145

954

1124

1593

1426

1910

1964

2014 96

816

7120

17 1387

1475

1536

1574

1733

1571

1719

1378

2024

1455

1506

1395

2043

1854 2018

1127

1373

1675 16

5856

226

6223

80 2467

2059

2819

2578

2215

2288

2744

2150

2628

2950

2571

2221

2310

2497

2560

2739

2327

2443

2106

2670

2895

2356

2396

2508

2979 2

125

2129

2241

2595

2244

2915

2554

2698

2299

2452

2453

2690

2933

2792

2052

2431

2296

2711

2713 2077

2454

2674

2606

2768

2912

2105

2833

2886

2167

2333

2177

2472

2907

2389

2563

2723

2147

2538

2318

2934

2957

2544

2659

2402

2699

2426

2904 20

7028

2121

8328

5326

3129

8325

5128

1722

7529

2125

1827

3126

8728

5928

5027

0822

8028

8729

6324

8326

8528

0424

4829

2629

32 2524

2586

2610

2772

2991

2055

2825

2120

2938

2860

2706

2806

2149

2407

2213

2169

2925

2267

2400

2305

2308

2930

2529

2570

2786

2066

2703

2281

2629

2078

2284

2139

2602

2465

2097

2658

2669

2709

2878

2973

2489

2672

2521

2935

2115

2697

2954

2620

2888

2837

2869

2409

2788

2832

2614

2737

2414

2964

2649

2107

2186

2385

2975

2193

2776

2231

2138

2834

2185

2234

2338

2583

2218

2939

2715

2291

2526

2909

2418

2947

2530

2326

2395

2618

2668

2386

2972

2678

2080

2137

2562

2531

2130

2322

2425

2232

2290

2683

2760

2293

2458

2984

2413

2976

2836

2928

2608

2809

2084

2098

2546

2870

2245

2471

2158

2429

2276

2929

2428

2748

2779

2362

2789

2522

2684

2913

2063

2255

2545

2103

2814

2829

2392

2575

2088

2657

2282

2604

2682

2569

2865

2953

2091

2162

2450

2956

2109

2271

2728

2269

2411

2225

2758

2371

2496

2587

2302

2434

2092

2942

2430

2632

2251

2260

2838

2960

2173

2780

2306

2828

2500

2986

2553

2861

2891

2134

2166

2181

2214

2257

2717

2237

2980

2303

2294

2733

2770

2051

2507

2948

2420

2461

2516

2079

2199

2387

2745

2112

2229

2143

2378

2511

2823

2174

2810

2548

2675

2816

2131

2652

2692

2908

2060

2738

2146

2607

2564

2730

2805

2543

2911

2750

2298

2646

2841

2630

2773

2062

2207

2297

2341

2121

2128

2961

2813

2881

2884

2270

2352

2390

2474

2074

2981

2585

2846

2307

2375

2437

2936

2250

2332

2182

2777

2227

2292

2880

2127

2918

2163

2180

2749

2565

2839

2807

2688

2871

2754

2771

2058

2990

2316

2086

2108

2625

2910

2272

2946

2191

2391

2504

2439

2872

2317

2360

2558

2424

2914

2611

2651

2820

2393

2584

2436

2514

2576

2085

2988

2376

2993

2959

2256

2849

2506

2444

2478

2647

2113

2381

2505

2151

2827

2423

2719

2373

2968

2309

2985

2368

2958

2094

2756

2831

2114

2135

2312

2645

2822

2863

2156

2353

2422

2552

2767

2457

2890

2894

2494

2634

2905

2643

2664

2064

2479

2252

2188

2230

2673

2075

2168

2337

2559

2795

2835

2900 22

3520

9623

0028

1128

5622

0123

3422

8624

2126

0924

1627

0026

9528

7522

1922

7725

6624

0824

8727

5323

4923

7428

5528

6423

8820

7123

8326

1729

6529

9823

2823

4824

7625

1320

9023

2924

8523

1123

5121

4227

8125

9728

1520

9525

4926

7728

9822

6524

1024

9329

6921

2425

1522

4925

6121

3223

5827

0524

9222

1728

4424

5128

0224

0625

2526

3822

6327

4123

6424

5926

3322

6228

7628

7324

6226

3924

9125

9423

1528

1223

8224

0124

9827

5524

1526

8129

9223

4326

3523

6526

8626

2223

6925

7928

5228

8324

2725

3526

5320

5627

8426

0128

4529

4925

1222

8323

5521

5526

0323

4523

6721

0226

5527

8324

8221

3627

3630

0021

9522

4827

6221

4027

1627

3421

9829

1725

2329

6226

7928

9324

1228

62 247

722

5327

2424

4623

4427

1029

1920

6829

7421

7925

7428

0821

6029

5522

4723

8424

1722

5928

9923

1424

7327

2624

7025

3323

2429

7828

4026

4829

9526

6126

9129

8727

4721

0026

1925

3726

1221

7229

7025

4728

7729

8928

00 215

721

9628

89 2791

2313

2851

2116

2189

2480

2796

2347

2671 24

4728

3029

0328

0329

4577

029

4020

7222

8525

9823

7221

7625

8021

5323

4224

3225

0325

2025

9925

5528

5420

9326

4121

2623

9724

5627

5225

0127

6329

2425

3925

4229

3721

6123

5029

22 2164

2304

2660

2363

2206

2481

2274

2233

2435

2449

2572

2740

2122

2818

2440

2720

2778

2592

2885

2615

2721

2145

2240

2775

2751

2123

2769

2133

2246

2761

2951

2222

2665

2650

2994

2712

2101

2119

2759

2264

2727

2757 21

8427

3220

5325

9620

9928

6720

7623

3125

5721

8723

9922

0324

4127

2529

2726

9428

4220

8722

2622

2024

5524

6629

9626

4228

5727

9829

6725

8125

9123

0124

8624

4226

4022

8925

1023

2525

8829

8223

4025

4026

2320

5723

4626

2420

8921

4824

9921

7522

6827

0722

1121

9024

4528

4325

6827

4620

6727

6522

2428

4823

9426

8026

4422

0924

6425

3628

9621

1822

2825

2725

7324

0522

3622

3823

9823

7027

4223

6127

9321

1026

1622

9522

0425

6727

4322

2323

5722

5427

8228

2621

1723

2327

9725

5626

8926

2727

1421

7028

6822

7329

3122

3925

0222

6623

3528

2420

6123

5921

5920

6525

9321

9223

3627

6624

8425

7726

5429

9720

8322

4221

4129

20 2794

2069

2589

2799

2943

2171

2729

2916

2534

2212

2287

2952

2977

2202

2321

2874

2319

2941

2261

2495

2178

2621

2902

2404

2656

2438

2517

2205

2475

2460

2532

2704

2278

2403

2582

2258

2613

2330

2419

2488

2073

2696

2676

2194

2636

2216

2144

2366

2923

2200

2243

2320

2774

2666

2590

2801

2605

2966

2210

2054

2165 2339

2104

2858

2702

2897 21

5428

4720

8121

1122

7921

5224

9024

3327

3526

2620

8224

6929

0127

0129

7127

8528

6625

1927

2229

4428

8229

0622

0826

6727

9024

6825

2828

7923

5426

9324

6327

1829

9923

7926

3725

5025

0926

0027

8728

9225

4126

6321

9727

6423

77

020

4060

8010

012

014

0

Dendrogram of agnes(x = dat, method = "complete")

Agglomerative Coefficient = 0.99dat

Hei

ght

Page 40: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering Hierarchical Clustering

Hierarchical Clustering on Artificial Dataset

124 76

021

680

3 1039

269

439

5 7050

554

715

389

089

951

235

289

180

113

294

330

709

246

691

1433 64 199

97 525

131

59 185 6

211

855

438

940

948

213 30

082

666

469

877

471

365

510

607

672

739

822

866

268

410 39

366

729 32

238

854

571

913

672

146

937

573

627

657

9 2245

153

074

721

262

736

554

139

940

665

0 490

32 502

497

240

620

804

516

133

615

158

612

179

454 32

6 9361

465

552

963

9 217

662

174

386

608

383

820

162

140

586 39

882

824

254 68 338 73

784

817

357

875

170

205

1853

671

136

032

083

512

335

650

817

371

528

730

638 63

250

956

747

744

852

262

773

264

135

524

751

192

377

837

849

590

710

160

571

756

809

792

686

730 20 373

114

49 278

402

492

550

692

430

647

41 690

478

839

80 200

423

794

762

109

661

249

257

440

447

896

186

573

884

228

362

12 334

677

771

723

198

218

856

288

417

786 85

333

246

637

146

518

751

567

346

464

348

555

590

079

880

087

127 63

831

856

9 437

72 244

384

539

441

157

666

498

361

772

111

561

337

138

581 51

934

259

334

843

983

275

3 543

229

462

491

565

414

633 64

942

759

589

5 271

411

263

178

164

813

414

576

782

0 37 336 50

467

605 45 407

5536

648

342

273

725

476

440 23

148

813

959

861

181

689

1 9252

652

846

184

350

115

150

669

576

535

841

945

9 496

532

39 319

177 98

148

712

116

120

283

522

493

374

572

105

307

325

566 25

55

438

484

446

583

684

732

761 7

211

298

301

894

479

635 53

270

646

597

728

534

665

888

35 704

214

265 54

623

044

283

043

148

687

3 518

625

303

752

1742

443

678

379

729

663

669

982

359

661

888 19

323

331

077

738

244

851

459

162

814

237

843

389

227

439

784

5 759

222

533

556

724

862

864

168

385

609

766

277

629

403

279

376

768

887

57 779

352

795

147

619

172

236

99 209

161

162

347

471 89 195

477

512

808

178

426

367

194

793

313

329

261

292 58 90 341

252

312

697

141

227

140

582

599

680 69

401

671

869

125

408

846

108

675

678

458

889

882

651

842

742

748

247

806

637

626

812 49

452

774

386

554

2 970

786

7 4829

047

380

248

082

7 15 164

782

841 43 132 74

526

646

845

668

768

8 9123

231

64

570

790

285

188

720

589

460

810

238

453 52 370

202

848

576

102

617

444

778 36 42 657

163

538

321

780

603

293

652

859 77 335

674

755

343

741 22

310

331

417

177

685

512

463

445

719

778

727

329

581

455

160

273

880

130

548

142

912

858

669

676

380

751

364

420

844

539

458

434

575

7 84 658

143

418

380

149

653

400

716

304

702

452 27

215

943

287

759

417

620

654

058

022

660

470

571

728

233

343

484

4 847

451

725 55

775

0 44 7945

555

219

655

974

681

371

874 70

188

676 15

279

622

135

487

288

5 82 184

130

339

670

78 881

346

106

260

284 31

167

299

349

656 63 706

588

175 95

237

703

700

256

600

396

769 46 722

117

165

210

391

840

28 893

585

126

81 269

154

311

897

450

549

521

531

383

878

682

833

587

785

511

592

874

75 234

308

291

861

683

879

642

731

144

372

693

253

323

281

563

359

548

286

499

805 11 351

207

297

368

898

119

854

564

16 500

155

847

263

239

379

443

412

834

219

553

640

271

870

660

26 503

259

258

5621

564

121

349

552

088

057

573

561

675

465

915

686

057

779

182

178

866 22

532

841

560

112

285

175

812

962

486

331

742

0 489

100

204

789

558

775

818

819

883

101

169 14

616

667

9 740

150

463

568

857

824 62

319

248

523

331

340

815 56

085 85

011

060

6 831

23 476

537

578

104

191

487

428

858

67 369

645

654

183

613

315

507

535

190

449

121

251

363

811

669

344

355

504

668 21

250

475

544 83 94 127

868

685

181

825

220

364

404

630

799

413

689

87 189

681

224

411

302

733

472

836

203

34 663 61 829

115 30

913

724

3 390

245

727

267

676

381

421

734

729

622

324

876

574

725 20

7028

2196 43

5 470 14

5135

360 72

635

0 107

241

425

387

670

810

3816

1011

2012

89 192

232

714

30 71

182 2

75 280

610

1248

416

749

3016

8310

3411

6912

5112

8814

3318

65 1924

939

1209

2041 94

311

54 999

1665 10

8116

44 101

010

3317

4114

6018

8914

1413

1013

7720

0416

6690

311

1717

9616

8815

4910

8217

5616

2918

3695

215

4310

4110

7215

3518

0318

57 907

1331

1544

1818

1198

1983

1232

1360

1416

1674

2003 16

0593

813

3310

2512

8418

5213

4417

2411

6517

8315

1919

9018

6294

011

3513

6514

0298

620

0217

2616

9918

05 1631

1167

1266

1869

1989

1323

1391

1179

1776

2015 12

40 1739

909

2046

1305

1417

1193

1354

1412

1693

1993

1539

1608

1031

1073

2039

1191

1461

1427

1446

1686

929

1846

1754

1296

1613

1660

2009

1210

1358

1785

1534

1808

1243

1727

2047

1973

1027

1619

1948

1393

1997

1075

1814

1139

1317

1079

1943

1547

1905

1121

1250

1394

1473

1521

1703

1617

1738

1596

1112

1953

1252

1396

1301

1350

1750

1496

1819 11

7710

2018

5019

7513

6615

9017

5318

0711

3715

89 164

815

2519

3012

0215

8817

4518

5312

9414

35 1408

1401

2000

1951

1618

1442

1772

901

1748

1527

1737

1129

1917

1287

1462

1625

1472

1893

1309

1620

1832

1002

1655

1944

1731

1546

1173

1696

1978

1498

1425 98

510

6413

2119

7015

4014

9020

1210

9711

3114

4312

5812

9120

4811

6015

5318

6120

2192

810

8613

7619

0217

6020

2219

4512

1214

4116

4318

8896

210

1415

3113

2613

9916

4918

92 930

1100

1275

2010

1264

1449

1316

1476

2033

1180

1359

1203

1207

1306

1969

1478

971

1049

1133

1458

1637

1140

1231

1245

1375

1403

1488

1743

1961 20

1692

212

5412

3319

5593

514

3711

9513

3619

9517

8813

6193

115

6310

0411

4610

3017

4218

7513

5294

618

8610

1710

4418

3012

7012

9291

219

6011

3214

4416

8719

2719

0615

0320

0813

0814

7914

4815

1015

5194

817

8113

4814

3415

9110

4312

7117

1813

8010

8919

2610

0012

1816

72 1684

1016

1600

1538

1842

1968

1920 91

111

8896

015

6013

5717

6197

214

8716

1499

210

7112

8696

413

3410

4011

5011

6812

8318

7294

498

998

117

6415

6417

5914

2010

3611

0712

2510

2912

9916

04 987

1439

2045

1215

1598

1304

959

1636

1125

1189

1483

1710

1559

1122

1691

1932

1482

1569

1762

1318

1896

2031

1929

1130

1398

1834

1863 18

5914

5618

2819

76 917

1163

1099

1497

1022

1077

990

1213

1706

1217

1641

1656

1035

1851

1111

1954

1470

1653 10

6911

8715

0413

5117

2295

318

00 1200

1235

965

1646 10

2110

1817

0590

418

7198

811

2812

5310

4610

8514

24 994

1747

1397

1407

1647

1982

1159

1697

1713

1912

1716

1856

915

1720

1005

1845

2025

1353

1996 924

1511

1959

1602

1328

1712

1011

1142

1312

1152

1279

1406

1898

1062

1639

1115

1155

1228

1881

1343

1611

2013

1942

1986 16

7395

697

813

3515

1317

2918

2410

6019

7112

0417

3012

5717

9319

8710

0310

9217

0211

0620

3015

1811

8216

9017

89 961

1065

1903

1680 13

5591

614

5714

6618

1511

0812

1619

7913

4519

5211

4111

47 991

1616

1921

1039

1156

1166

1795

1947

1185

1242

1633

1486

1485

1537

1638 97

716

2315

1415

2311

9715

5518

7313

3216

5917

1713

8517

1511

7117

3617

6616

6219

1115

0816

7610

6812

5911

5311

8115

3215

45 926

1042

1958

1313

1459

1879

1362

1802

1936

937

1368

2049

1007

1579

1813

1708

2020

1285

1552

1994

1388

1734

1076

1369

2029

1096

1070

1101

1346

1436

1176

1867

1635

2040

970

1093

1985

1206

1214

1876 19

0110

2616

7713

0016

7911

7813

7010

4716

9412

4114

1314

1911

8612

5516

1218

3712

8212

3713

8113

7912

9514

99 920

1833

1389

1126

1057

2035

1384 99

713

4299

617

8414

6516

5019

5719

4110

8813

41 933

983

2001

1669

1897

1884

936

1827

1006

1791

1798

1172

1621

1367

976

1711

1226

1468

1247

1512

1782

1864

1940

1303

1330

1700

1530

1573 982

1114

1597

1211

1432

1654

1028

1084

1118

1411

1113

1454

1704

1136

1938

1956

1484

1103

1866

1467

1528

1222

1681

1642

1138

1415

1256

1314

1501

1885

1558

1024

1661

1074

1263

1374

1302

1751

1572

1109

1311

1267

1858

1272

913

1162

1260

1707

1778

1515

1804

1878

967

2011

1087

1583

1870 99

510

3719

81 919

1390

1767

1931

1078

1094

1489

1151

1123

1991

1400

1877

2037

1592

1999

1494

1825

963

1725

1949 975

1651

1098

1935

1095

1561

1382

1315

1831

1907

1144

1298

1522

1170

1946

1183

1606

1509

1581

1050

1925

1984

1329

1586

2050

1840

1061

1262

1230

1562

1580

1632

1809 97

411

5710

5818

2619

0814

6498

419

1914

2210

0812

9312

2316

1514

6314

7420

2315

7711

4319

0918

7415

5016

5219

7211

0515

8712

0818

1612

3919

6316

0317

8017

7712

8118

3516

6416

6318

0618

6019

3712

1917

7319

1316

2620

3293

294

910

1594

114

3817

7917

6519

9811

4912

05 1787

1012

1916

1634

1066

1372

1904

1847

1102

1811

1757

1234

1505

1541 90

295

713

1916

6720

1990

516

4512

7317

7518

20 1481

923

1735

2018

954

1124

1593 11

4514

2619

6420

1419

1013

7396

816

7120

1714

7515

3615

7417

33 1387

908

1349

1246

1307

1584

1895

1914

1013

1261

1628

1939

950

1822

1174

1533

1575

1624

2007

1221

1502

1274

1758

1322

1918

934

1769

1923

1055

1578

1529

1967 1746

1059

1268

1890

1630

1855

1340

1585

1383

1378

2024

1455

1506

1395

2043

1854

1054

1848

1839

1339

1567

1405

1965

1548

1104

1974

1493

1882

1280

1868

1682

1453

1740

947

1752 16

7014

2318

1017

3218

2912

6914

2119

34 1009

1161

1692

951

1790

1689 97

916

0915

2616

98 1571

1719

906

1001

1576

1386

1477

1404

1450

1517

910

1083

1849

1053

1801

1844 1

675

927

1192

1841

1090

1966

1891

969

1721

1763

1701

1786

1883

2036

966

1276

1770 9

4519

2819

8820

2711

9613

3714

9115

8218

9417

14 918

1363

1950

1980

1063

1792

1799 97

315

5610

5217

9720

4212

6515

4216

2716

4020

0519

3320

26 980

1277

1392

1119

1622

1880

1080

1356

1812

1418

1977 1244

1771

1469

1507

1492

1838

1695

1570

1657

1220

1668

1774

1325

1899

942

1190

1278

1229

1495

1023

1445

1471

1524

1290

1794

1817

2044

1110

1184

1452

1347

1148

1431

1992

1749

1019

1520

1554

1900

1158

1164

1447

1709

1091

1236

1297

1568

1410

2006 9

5515

0020

28 1224

1201

1723

993

1051

1566

2038

1238

1056

1480 91

417

6820

3410

6715

99 1048

1843

1744

1823 18

2110

4517

55 1116

921

1915

1199

1595

1678 1

327

958

1887

1428

1409

1594

1429

1440

1962

1032

1371 12

2792

516

0199

816

8511

3413

3811

7515

6511

9413

6415

1616

07 1249

1728

1320

1324 15

5711

2716

5856

226

62 2380 2

467

2792

2125

2453

2690 29

3321

2922

4125

9522

4429

1525

5426

9822

9924

5220

5224

3122

9627

1127

13 2077

2105

2833

2886

2454

2674

2606

2768

2912

2167

2333

2193

2776

2231

2107

2186

2385

2975

2218

2939

2715

2326

2395

2618

2668

2291

2526

2909

2386

2972

2678

2748

2779 21

4725

3821

7724

7225

6323

8927

2323

1829

3429

5725

4426

5929

0724

0226

9924

2629

0420

5928

1922

2123

1021

3421

6621

5026

2829

50 2571

2106

2670

2895 25

7822

1522

8827

44 2443

2356

2396

2508

2979

2275

2921

2518

2731

2687

2859

2850

2708 23

2724

9725

6027

3922

8028

8729

6324

8326

8528

04 2448

2926

2932 2

524

2586

2610

2772

2991 21

8328

5326

3129

8325

5128

1720

5125

0729

4824

2024

6125

1620

7921

9927

4523

8722

9826

4628

4126

3027

50 2773

2074

2981

2585

2846

2375

2437

2936 23

0721

8227

7722

2722

9228

8025

7428

0821

2729

1821

6321

8027

4925

6528

3928

0722

5027

5427

71 2332

2367

2688

2871

2056

2784

2601

2845

2949 25

1221

5526

03 2345

2102

2655

2783

2482

2283

2355

2136

2736

3000

2195

2248

2140

2716

2734

2198

2917

2523

2962

2679

2893

2412

2862

2477

2253

2724

2446

2344

2710 29

1920

6829

7421

7922

4723

8424

1721

6029

5522

5928

9924

7025

3323

1424

7327

2623

2429

7828

4026

4829

9526

6126

9129

8727

4721

0026

1925

3726

1221

7229

7025

4728

7729

8928

00 215

727

9123

1328

5121

1621

8924

8027

9621

9628

8923

4726

71 2447

2830

2903

2803

2945

2058

2990

2316

2086

2108

2625

2272

2946

2910

2191

2391

2504

2364

2459

2633 27

4123

1723

6025

5824

2429

1426

1126

5128

2024

3928

7223

9325

8424

3625

1425

7620

6027

3821

4626

0725

6425

4329

1127

3028

0520

6222

9723

4122

0721

2128

1328

8128

8429

6120

6424

7922

5222

3021

8826

7320

7521

6823

3725

5927

9528

3529

0021

2822

7023

5223

9024

7420

8529

8823

7629

9329

5922

5628

4925

0624

7826

4724

4421

1323

8125

0523

0929

8523

6821

5128

2727

1923

7329

6824

23 2958

2063

2255

2545

2103

2814

2829

2392

2575

2569

2865

2953

2112

2229

2143

2378

2511

2823

2174

2810

2548

2675

2816

2131

2652

2692

2908

2088

2657

2282

2604

2682

2091

2162

2450

2956

2109

2269

2411

2271

2728

2225

2758

2302

2434

2371

2496

2587 27

6220

9229

4224

3026

3222

5122

6028

3829

6022

5727

1723

0628

2825

0025

5328

6128

9129

8620

7123

8326

1729

6529

9823

2823

4825

1324

7628

0924

1329

7628

3629

2820

8420

9825

4628

7022

4524

7121

5824

2922

7629

2924

2820

9525

4922

6524

1026

7728

9821

4227

8125

9728

1520

9023

2924

8523

1123

5120

9623

0028

1128

5622

0123

3424

0822

8624

2126

0924

1627

0026

9528

7522

1923

4923

7428

5528

64 2388

2235

2277

2487

2753

2566

2080

2137

2562

2531

2130

2322

2425

2173

2780

2232

2290

2683

2760

2293

2458

2984

2362

2789

2522

2684

2913 26

0821

8122

1422

3729

8023

0322

9427

3327

7020

5325

9624

6629

9626

4228

5727

9829

6720

7623

3125

5726

9420

9928

6721

8723

9922

0320

8722

2622

2024

5525

8125

9120

8224

6929

0129

7127

0128

6625

1927

2229

4427

8528

8220

9427

5628

3124

2225

5227

6724

5728

9028

9424

9426

3429

0526

4326

6420

5421

65 2339

2104

2858

2702

2897

2081

2111

2279

2152

2490

2433

2735 26

2622

0826

6727

9024

6825

2828

79 2906

2354

2693

2463

2718

2999

2379

2637

2550

2509

2600

2787

2892

2541 2

154

2847

2197

2764

2377

2055

2120

2938

2825

2860

2169

2925

2149

2407

2213

2489

2672

2706

2806

2115

2697

2649

2954

2620

2888

2837

2869

2066

2703

2281

2629

2078

2284

2139

2602

2465

2521

2935

2409

2414

2964

2788

2832

2614

2737

2418

2947

2530

2097

2658

2669

2709

2878

2973

2122

2440

2720

2778

2818

2751

2592

2885

2615

2721

2123

2769

2133

2267

2400

2570

2786

2529

2305

2308

2930

2093

2641

2126

2397

2456

2752 21

6422

0624

8122

7422

3324

3524

4925

7227

4023

0426

6025

3923

6321

3828

3421

8522

3423

38 2583

2501

2763

2542

2937 29

2421

0121

19 2759

2264

2145

2240

2775 21

6123

5029

2226

5029

9427

1226

6520

5723

4626

2421

4824

9920

8921

7522

6827

0722

1121

9024

4528

4321

1822

2825

2725

7324

0522

3622

3823

7027

4223

9820

6727

6522

2428

4825

6827

4623

9426

8026

4422

0925

3628

9624

6423

6127

9321

7826

2129

0224

1924

0426

5624

3825

1723

1929

41 2488

2205

2475

2460

2532

2704 25

8222

7824

0322

4627

6129

51 2222

2114

2135

2312

2645

2822

2863

2156

2353

2170

2868

2273

2239

2502

2931

2223

2357

2266

2335

2824

2556

2689

2627

2714

2117

2323

2797

2124

2515

2249

2561

2492

2132

2358

2705

2217

2844

2263

2451

2802

2406

2525

2638

2262

2876

2873

2462

2639

2491

2594

2315

2812

2382

2401

2498

2755

2415

2681

2992

2343

2635

2365

2622

2686

2369

2535

2653

2427

2579

2852

2883

2493

2969

2061

2359

2654

2997

2336

2766

2441

2484

2577

2725

2927

2301

2486

2442

2640

2065

2593

2192

2159

2204

2567

2743

2842

2110

2616

2295

2254

2782

2826

2258

2613 23

3020

8322

4221

4129

20 279

422

8925

1023

2525

8829

8223

4025

4026

2320

6925

8927

9929

4321

7127

2929

16 2534

2202

2321

2874 22

6124

9522

1222

8729

5229

7720

7326

96 2676

2194

2636

2216

2144

2366

770

2940

2072

2285

2598 23

7221

5323

4225

0325

20 2432

2599

2176

2580

2732

2555

2854

2184

2727

2757 22

0022

4323

2027

74 2666

2590

2801

2605

2966

2210

2923

2663

020

4060

Dendrogram of agnes(x = dat, method = "average")

Agglomerative Coefficient = 0.99dat

Hei

ght

Page 41: Statistical Data Mining and Machine Learning Hilary Term 2016sejdinov/teaching/HT16_lecture3.pdfV1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 ... English_STTakitaki Lithuanian_O

Clustering Hierarchical Clustering

Using Dendrograms

Different ways of measuring dissimilarity result in different trees.Dendrograms are useful for getting a feel for the structure ofhigh-dimensional data though they don’t represent distances betweenobservations well.Dendrograms show hierarchical clusters with respect to increasing valuesof dissimilarity between clusters, cutting a dendrogram horizontally at aparticular height partitions the data into disjoint clusters which arerepresented by the vertical lines it intersects. Cutting horizontallyeffectively reveals the state of the clustering algorithm when thedissimilarity value between clusters is no more than the value cut at.Despite the simplicity of this idea and the above drawbacks, hierarchicalclustering methods provide users with interpretable dendrograms thatallow clusters in high-dimensional data to be better understood.