data exploration and unsupervised learning with clustering

46
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence

Upload: others

Post on 24-May-2022

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Data Exploration and Unsupervised Learning with Clustering

Paul F Rodriguez,PhD San Diego Supercomputer Center

Predictive Analytic Center of Excellence

Page 2: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Clustering Idea

• Given a set of data can we find a natural grouping?

X1

X2

Essential R commands: D =rnorm(12,0,1) #generate 12 #random normal X1 =matrix(D,6,2) #put into 6x2 matrix X1[,1]=X1[,1]+4; #shift center X1[,2]=X1[,2]+2; #repeat for another set of points #bind data points and plot plot(rbind(X1,X2), xlim=c(-10,10),ylim=c(-10,10));

Page 3: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Why Clustering

• A good grouping implies some structure • In other words, given a good grouping, we can then:

• Interpret and label clusters • Identify important features • Characterize new points by the closest cluster (or nearest

neighbors) • Use the cluster assignments as a compression or

summary of the data

Page 4: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Clustering Objective

• Objective: find subsets that are similar within cluster and dissimilar between clusters

• Similarity defined by distance measures • Euclidean distance

• Manhattan distance

• Mahalanobis (Euclidean w/dimensions rescaled by variance)

Page 5: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Kmeans Clustering • A simple, effective, and standard method

Start with K initial cluster centers Loop: Assign each data point to nearest cluster center Calculate mean of cluster for new center Stop when assignments don’t change

• Issues: How to choose K? How to choose initial centers? Will it always stop?

Page 6: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Kmeans Example

• For K=1, using Euclidean distance, where will the cluster center be?

X1

X2

Page 7: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Kmeans Example

• For K=1, the overall mean minimizes Sum Squared Error (SSE), aka Euclidean distance

Essential R commands: Kresult = kmeans(X,1,10,1) #choose 1 data point as initial K centers #10 is max loop iterations #1 is number of initial sets to try #Kresult is an R object with subfields Kresult$cluster #cluster assignments Kresults$tot.withinss # tot within SSE

Page 8: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Kmeans Example

K=2

K=4 K=3

K=1 Essential R commands: inds=which(Kresult$cluster==K) plot(X[inds,],col2use=“red”); …

As K increases individual points get a cluster

Page 9: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Choosing K for Kmeans

K=1 to 10

Essential R commands: for (num_k in 1:10) { Kres=kmeans(X,num_k,10,1); Save and then plot Kres$tot.withinss …

Total Within Cluster SSE

- Not much improvement after K=2 (“elbow”)

Page 10: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Kmeans Example – more points

How many clusters should there be?

Page 11: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Choosing K for Kmeans

K=1 to 10

Total Within Cluster SSE

- Smooth decrease at K ≥ 2, harder to choose - In general, smoother decrease => less structure

Page 12: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Kmeans Guidelines • Choosing K:

• “Elbow” in total-within-cluster SSE as K=1…N • Cross-validation: hold out points, compare fit as K=1…N

• Choosing initial starting points: • take K random data points, do several Kmeans, take best

fit • Stopping:

• may converge to sub-optimal clusters • may get stuck or have slow convergence (point

assignments bounce around), 10 iterations is often good

Page 13: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Kmeans Example uniform K=1 K=2

K=3 K=4

Page 14: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Choosing K - uniform

K=1 to 10

Total Within Cluster SSE

- Smooth decrease across K => less structure

Page 15: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Kmeans Clustering Issues

• Scale: • Dimensions with large numbers may dominate distance

metrics

• Outliers: • Outliers can pull cluster mean, K-mediods uses median

instead of mean

Page 16: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Soft Clustering Methods

• Fuzzy Clustering • Use weighted assignments to all clusters • Find min weighted SSE

• Expectation-Maximization: • Mixture of multivariate Gaussian distributions • find best cluster means & variances that maximize

likelihood

Page 17: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Kmeans – unequal cluster variance

Page 18: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Choosing K – unequal distributions

K=1 to 10

Total Within Cluster SSE

- Smooth decrease across K => less structure

Page 19: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

EM clustering

R: library(‘mclust’) em_fit=Mclust(x); plot(em_fit);

• Selects K=2 (Bayesian Information Criterion)

• Handles unequal variance

Page 20: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Kmeans computations

• Distance of each point to each cluster center • For N points, D dimensions: each loop requires N*D*K

operations

• Update Cluster centers • only track points that change, get change in cluster center

• On HPC:

• Distance calculations can be partitioned data across dimension

Page 21: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Number of Dimensions (i.e. columns in data matrix)

Wall Time (secs)

8000 pts

4000 pts

2000 pts 1000 pts

60 min

30 min

Number of Points (i.e. rows)

R Kmeans Performance 1 Gordon compute node, normal random matrices R: system.time(kmeans())

32K 1K 4K 8K 16K

Page 22: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Kmeans vs EM performance

Number of Dimensions (i.e. columns in data matrix)

Wall Time (secs)

EM: 1000 pts

Kmeans:1000 pts

15 min

10 min Number of Points (i.e. rows)

1 Gordon compute node, normal random matrices R: system.time(Mclust())

8K 1K 4K

Page 23: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Kmeans big data example

• 45,000 NYTimes articles, 102,000 unique words (UCI Machine Learning repository)

• Full Data Matrix: 45Kx102K ~ 40Gb

0 0 1 0 0 0 2 0 0 … 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 0 … 0 0 0 0 3 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 … 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0

… … … …

… 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 … 0 0 1 0 0

article 1 article 2 article 3 … article 45K

… … …

Cell i,j is count of ith-word in jth-article

Page 24: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

0 0 1 0 0 0 2 0 0 … 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0

0 0 0 0 2 0 0 0 0 … 0 0 0 0 3 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 … 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0

… … … …

0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 … 0 0 1 0 0

0 0 1 0 0 0 0 .5 0 0 … 0 0 1 0 0 .5 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 .5 0 0 0 … 1 0 0 0 .4 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0

… …

0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 … 0 0

1. Take difference of column XNxP Cluster_MeansMxP

2. square and sum across columns

Works better for large N small P

Matlab original script Distance calculation to cluster center

articles

Subtract

Page 25: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

0 0 1 0 0 0 2 0 0 … 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 … 0 0 0 0 3 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 … 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0

… … … …

0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 … 0 0 1 0 0

0 0 1 0 0 0 0 .5 0 0 … 0 0 1 0 0 .5 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 .5 0 0 0 … 1 0 0 0 .4 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0

… …

0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 1 0 0 0 2 0 0 1 0 0 0 2 0 0 … 0 0

XNxP Cluster_MeansMxP

Works better for large P and dot( ) will use threads

Matlab Script Altered

1. take difference of row

2. use dot product

Subtract

Page 26: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Kmeans Matlab Runtime

• Matlab Kmeans (original) ~ 50 hours

• Matlab Kmeans (distributed) ~ 10 hours, 8

threads

Page 27: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Kmeans results

7 viable clusters found

cluster means shown with coordinates determining fontsize

Page 28: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Incremental & Hierarchical Clustering • Start with 1 cluster (all instances) and do splits OR Start with N clusters (1 per instance) and do merges • Can be greedy & expensive in its search

some algorithms might merge & split algorithms need to store and recalculate distances

• Need distance between groups

in constrast to K-means

Page 29: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Incremental & Hierarchical Clustering

• Result is a hierarchy of clusters • displayed as a ‘dendrogram’ tree

• Useful for tree-like interpretations

• syntax (e.g. word co-occurences) • concepts (e.g. classification of animals) • topics (e.g. sorting Enron emails) • spatial data (e.g. city distances) • genetic expression (e.g. possible biological networks) • exploratory analysis

Page 30: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Incremental & Hierarchical Clustering

• Clusters are merged/split according to distance or utility measure • Euclidean distance (squared differences) • conditional probabilities (for nominal features)

• Options to choose which clusters to ‘Link’ • single linkage, mean, average (w.r.t. points in clusters)

(may lead to different trees, depending on spreads) • Ward method (smallest increase within cluster variance) • change in probability of features for given clusters

Page 31: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Linkage options • e.g. single linkage (closest to any cluster instance)

Cluster1 Cluster2

• e.g. mean (closest to mean of all cluster instances)

Cluster1 Cluster2

Page 32: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Linkage options (cont’)

• e.g. Ward’s method (find new cluster with min. variance)

Cluster1 Cluster2

• e.g. average (mean of pairwise distances)

Cluster1 Cluster2

Page 33: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Hierarchical Clustering Demo

• 3888 Interactions among 685 proteins From Hu et.al. TAP dataset http://www.compsysbio.org/bacteriome/dataset/)

b0009 b0014 0.92 b0009 b2231 0.87 b0014 b0169 1.0 b0014 b0595 0.76 b0014 b2614 1.0 b0014 b3339 0.95 b0014 b3636 0.9 b0015 b0014 0.99 …….

Page 34: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Hierarchical Clustering Demo

• Essential R commands: >d=read.table("hu_tap_ppi.txt"); >str(d) #show d structure 'data.frame': 3888 obs. of 3 variables: $ V1: Factor w/ 685 levels "b0009","b0014",..: 1 1 2 2 2 2 2 3 3 3 ... $ V2: Factor w/ 536 levels "b0011","b0014",..: 2 248 28 66 297 396 ... $ V3: num 0.92 0.87 1 0.76 1 0.95 0.9 0.99 0.99 0.93 ... >fs =c(d[,1],d[,2]); #combine factor levels >str(fs) int [1:7776] 1 1 2 2 2 2 2 3 3 3 ...

Note: strings read as “factors”

Page 35: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Hierarchical Clustering Demo

Essential R commands: C =matrix(0,P,P); #Connection matrix (aka Adjacency matrix) IJ =cbind(d[,1],d[,2]) #factor level is saved as Nx2 list of i-th,j-th protein for (i in 1:N) {C[IJ[i,1],IJ[i,2]]=1;} #populate C with 1 for connections install.packages('igraph') library('igraph') gc=graph.adjacency(C,mode="directed") plot.graph(gc,vertex.size=3,edge.arrow.size=0,vertex.label=NA) or just plot( ….

Page 36: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Hierarchical Clustering Demo

• hclust with “single” distance: chaining

Items that cluster first

the cluster distance when 2 are combined

d2use=dist(C,method="binary") fit <- hclust(d2use, method=”single") plot(fit)

Page 37: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Hierarchical Clustering Demo

• hclust with “Ward” distance: spherical clusters

Page 38: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Hierarchical Clustering Demo

• Where height change looks big, cut off tree groups <- cutree(fit, k=7) rect.hclust(fit, k=7, border="red")

Page 39: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Hierarchical Clustering Demo

• Kmeans vs Hierarchical: Lots of overlap despite that Kmeans not have ‘binary’ distance option

1 2 3 4 5 6 7 1 26 0 0 293 0 19 23 2 2 1 0 9 0 0 46 3 72 0 0 27 0 2 1 … groups <- cutree(fit, k=7) ;

Kresult=kmeans(d2use,7,10,1); table(Kresult$cluster,groups)

Hierarchical Group Assigment

Km

eans

clu

ster

ass

igm

ent

Page 40: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

• Idea: Given N points and P features (aka dimensions), can we represent data with fewer features: • Yes, if features are constant • Yes, if features are redundant • Yes, if features only contribute noise (conversely, want

features that contribute to variations of the data)

Dimensionality Reduction via Principle Components

Page 41: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

• PCA: • Find set of vector (aka factors) that describe data in

alternative way • First component is the vector that maximizes the variance

of data projected onto that vector • K-th component is orthogonal to all k-1 previous

components

Dimensionality Reduction via Principle Components

Page 42: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

PCA on 2012 Olympic Althetes’ Height by Weight scatter plot

Total Variance Conserved:

Var in Weight + Var in Height = Var in PC1 + Var in PC2

In general:

Var in PC1> Var in PC2> Var in PC3…

PC1

PC2

Projection of (145,5) to PCs

Weight- Kg (mean centered)

Hei

ght-

cm (m

ean

cent

ered

)

Page 43: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

PCA on Height by Weight scatter plot

Essential R: X = X-mean; #mean center data S <- svd(X); #returns matrices U,D,V #cols in V are 2D vectors # S = U*D*V’ #plot each V col as a line in X’s space (use 7th grade geometry) points(.., type=‘l’); #line type #get the 1st coord. point in X’s space S$u[,1]*S$v[,1]*S$d[1] #repeat for 2nd coord. point and plot #PCA is eigen()

Weight- Kg (mean centered)

Hei

ght-

cm (m

ean

cent

ered

)

Page 44: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

• Can choose k heuristically as approximation improves, or choose k so that 95% of data variance accounted

• aka Singular Value Decomposition PCA on square matrices only SVD gives same vectors on square matrices

• Works for numeric data only • In contrast, clustering reduces to categorical groups • In some cases, k PCs k clusters

Principle Components

Page 45: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Summary

• Having no label doesn’t stop you from finding structure in data

• Unsupervised methods are somewhat related

Page 46: Data Exploration and Unsupervised Learning with Clustering

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

End