unsupervised

Upload: chitradeep-dutta-roy

Post on 04-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Unsupervised

    1/60

    Unsupervised Learning

    Unsupervised vs Supervised Learning:

    Most of this course focuses on supervised learningmethodssuch as regression and classification.

    In that setting we observe both a set of features

    X1, X2, . . . , X p for each object, as well as a response oroutcome variable Y. The goal is then to predict Y usingX1, X2, . . . , X p.

    Here we instead focus on unsupervised learning, we where

    observe only the features X1, X2, . . . , X p. We are notinterested in prediction, because we do not have anassociated response variable Y.

    1/52

  • 8/13/2019 Unsupervised

    2/60

    The Goals of Unsupervised Learning

    The goal is to discover interesting things about themeasurements: is there an informative way to visualize thedata? Can we discover subgroups among the variables or

    among the observations? We discuss two methods:

    principal components analysis, a tool used for datavisualization or data pre-processing before supervisedtechniques are applied, and

    clustering, a broad class of methods for discoveringunknown subgroups in data.

    2/52

  • 8/13/2019 Unsupervised

    3/60

    The Challenge of Unsupervised Learning

    Unsupervised learning is more subjective than supervisedlearning, as there is no simple goal for the analysis, such asprediction of a response.

    But techniques for unsupervised learning are of growingimportance in a number of fields: subgroups of breast cancer patients grouped by their gene

    expression measurements, groups of shoppers characterized by their browsing and

    purchase histories, movies grouped by the ratings assigned by movie viewers.

    3/52

  • 8/13/2019 Unsupervised

    4/60

    Another advantage

    It is often easier to obtain unlabeled data from a labinstrument or a computer than labeled data, which canrequire human intervention.

    For example it is difficult to automatically assess theoverall sentiment of a movie review: is it favorable or not?

    4/52

  • 8/13/2019 Unsupervised

    5/60

    Principal Components Analysis

    PCA produces a low-dimensional representation of adataset. It finds a sequence of linear combinations of the

    variables that have maximal variance, and are mutuallyuncorrelated.

    Apart from producing derived variables for use insupervised learning problems, PCA also serves as a tool fordata visualization.

    5/52

  • 8/13/2019 Unsupervised

    6/60

    Principal Components Analysis: details

    The first principal componentof a set of features

    X1, X2, . . . , X p is the normalized linear combination of thefeatures

    Z1=11X1+ 21X2+ . . . + p1Xp

    that has the largest variance. By normalized, we mean thatpj=1

    2j1= 1.

    We refer to the elements 11, . . . , p1 as the loadings of thefirst principal component; together, the loadings make upthe principal component loading vector,

    1= (11 21 . . . p1)T. We constrain the loadings so that their sum of squares is

    equal to one, since otherwise setting these elements to bearbitrarily large in absolute value could result in anarbitrarily large variance.

    6/52

  • 8/13/2019 Unsupervised

    7/60

  • 8/13/2019 Unsupervised

    8/60

    Computation of Principal Components

    Suppose we have a n p data set X. Since we are onlyinterested in variance, we assume that each of the variablesin X has been centered to have mean zero (that is, thecolumn means ofX are zero).

    We then look for the linear combination of the sample

    feature values of the form

    zi1=11xi1+ 21xi2+ . . . + p1xip (1)

    for i= 1, . . . , n that has largest sample variance, subject to

    the constraint thatp

    j=1 2j1= 1.

    Since each of the xij has mean zero, then so does zi1 (forany values ofj1). Hence the sample variance of the zi1can be written as 1

    nni=1 z

    2i1.

    8/52

  • 8/13/2019 Unsupervised

    9/60

    Computation: continued

    Plugging in (1) the first principal component loading vectorsolves the optimization problem

    maximize11,...,p1

    1

    n

    ni=1

    p

    j=1

    j1

    xij

    2

    subject to

    p

    j=1

    2

    j1= 1.

    This problem can be solved via a singular-valuedecomposition of the matrix X, a standard technique in

    linear algebra. We refer to Z1 as the first principal component, with

    realized values z11, . . . , zn1

    9/52

  • 8/13/2019 Unsupervised

    10/60

    Geometry of PCA

    The loading vector 1 with elements 11, 21, . . . , p1defines a direction in feature space along which the data

    vary the most. If we project the n data points x1, . . . , xn onto this

    direction, the projected values are the principal componentscores z11, . . . , zn1 themselves.

    10/52

  • 8/13/2019 Unsupervised

    11/60

    Further principal components

    The second principal component is the linear combinationofX1, . . . , X p that has maximal variance among all linearcombinations that are uncorrelatedwith Z1.

    The second principal component scores z12, z22, . . . , zn2take the form

    zi2=12xi1+ 22xi2+ . . . + p2xip,

    where 2

    is the second principal component loading vector,with elements 12, 22, . . . , p2.

    11/52

  • 8/13/2019 Unsupervised

    12/60

  • 8/13/2019 Unsupervised

    13/60

  • 8/13/2019 Unsupervised

    14/60

    USAarrests data: PCA plot

    3 2 1 0 1 2 3

    3

    2

    1

    0

    1

    2

    3

    First Principal Component

    SecondPrincipalComponent

    Alabama Alaska

    Arizona

    Arkansas

    California

    ColoradoConnecticut

    Delaware

    Florida

    Georgia

    Hawaii

    Idaho

    Illinois

    IndianaIowaKansas

    Kentucky Louisiana

    Maine Maryland

    Massachusetts

    Michigan

    Minnesota

    Mississippi

    Missouri

    Montana

    Nebraska

    Nevada

    New Hampshire

    New Jersey

    New Mexico

    New York

    North Carolina

    rth Dakota

    Ohio

    Oklahoma

    OregonPennsylvania

    Rhode Island

    South Carolina

    South Dakota Tennessee

    Texas

    Utah

    Vermont

    Virginia

    Washington

    West Virginia

    Wisconsin

    Wyoming

    0.5 0.0 0.5

    0.5

    0.0

    0.5

    Murder

    Assault

    UrbanPop

    Rape

    14/52

  • 8/13/2019 Unsupervised

    15/60

    Figure details

    The first two principal components for the USArrests data. The blue state names represent the scores for the first two

    principal components.

    The orange arrows indicate the first two principal

    component loading vectors (with axes on the top andright). For example, the loading for Rape on the firstcomponent is 0.54, and its loading on the second principalcomponent 0.17 [the word Rape is centered at the point(0.54, 0.17)].

    This figure is known as a biplot, because it displays boththe principal component scores and the principalcomponent loadings.

    15/52

  • 8/13/2019 Unsupervised

    16/60

    PCA loadings

    PC1 PC2

    Murder 0.5358995 -0.4181809

    Assault 0.5831836 -0.1879856UrbanPop 0.2781909 0.8728062Rape 0.5434321 0.1673186

    16/52

  • 8/13/2019 Unsupervised

    17/60

    Another Interpretation of Principal Components

    First principal component

    Secondprincipalco

    mponent

    1.0 0.5 0.0 0.5 1.0

    1.0

    0.5

    0.0

    0.5

    1.0

    17/52

  • 8/13/2019 Unsupervised

    18/60

    PCA find the hyperplane closest to the observations

    The first principal component loading vector has a veryspecial property: it defines the line in p-dimensional spacethat is closestto the n observations (using average squaredEuclidean distance as a measure of closeness)

    The notion of principal components as the dimensions thatare closest to the n observations extends beyond just thefirst principal component.

    For instance, the first two principal components of a data

    set span the plane that is closest to the n observations, interms of average squared Euclidean distance.

    18/52

  • 8/13/2019 Unsupervised

    19/60

    Scaling of the variables matters

    If the variables are in different units, scaling each to havestandard deviation equal to one is recommended.

    If they are in the same units, you might or might not scalethe variables.

    3 2 1 0 1 2 3

    3

    2

    1

    0

    1

    2

    3

    First Principal Component

    SecondPr

    incipalComponent

    * *

    *

    *

    *

    **

    **

    *

    *

    *

    *

    ** *

    * ** *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    *

    ***

    *

    *

    * *

    *

    *

    *

    *

    *

    *

    *

    *

    0.5 0.0 0.5

    0.5

    0.0

    0.5

    Murder

    Assault

    UrbanPop

    Rape

    Scaled

    100 50 0 50 100 150

    100

    50

    0

    50

    100

    150

    First Principal Component

    SecondPr

    incipalComponent

    * *

    *

    *

    ***

    * *

    *

    *

    *

    **

    * *

    * ** *

    ***

    *

    *

    **

    *

    *

    *

    *

    *

    **

    ** **

    *

    ** *

    **

    **

    *

    *

    *

    *

    0.5 0.0 0.5 1.0

    0.5

    0.0

    0.5

    1.0

    Murder Assa

    UrbanPop

    Rape

    Unscaled

    19/52

  • 8/13/2019 Unsupervised

    20/60

    Proportion Variance Explained

    To understand the strength of each component, we are

    interested in knowing the proportion of variance explained(PVE) by each one.

    The total variancepresent in a data set (assuming that thevariables have been centered to have mean zero) is definedas

    pj=1

    Var(Xj) =

    pj=1

    1n

    ni=1

    x2ij ,

    and the variance explained by the mth principalcomponent is

    Var(Zm) = 1n

    ni=1

    z2im.

    It can be shown that

    pj=1Var(Xj) =

    Mm=1Var(Zm),

    with M= min(n 1, p).

    20/52

  • 8/13/2019 Unsupervised

    21/60

    Proportion Variance Explained: continued

    Therefore, the PVE of the mth principal component isgiven by the positive quantity between 0 and 1n

    i=1 z2imp

    j=1

    ni=1 x

    2ij

    .

    The PVEs sum to one. We sometimes display thecumulative PVEs.

    1.0 1.5 2.0 2.5 3.0 3.5 4.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Principal Component

    Prop.

    VarianceExplained

    1.0 1.5 2.0 2.5 3.0 3.5 4.0

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Principal Component

    CumulativeProp.

    Va

    rianceExplained

    21/52

  • 8/13/2019 Unsupervised

    22/60

    How many principal components should we use?

    If we use principal components as a summary of our data, howmany components are sufficient?

    No simple answer to this question, as cross-validation is notavailable for this purpose.

    Why not?

    22/52

  • 8/13/2019 Unsupervised

    23/60

  • 8/13/2019 Unsupervised

    24/60

    How many principal components should we use?

    If we use principal components as a summary of our data, howmany components are sufficient?

    No simple answer to this question, as cross-validation is notavailable for this purpose.

    Why not? When could we use cross-validation to select the number of

    components?

    the scree plot on the previous slide can be used as a

    guide: we look for an elbow.

    22/52

  • 8/13/2019 Unsupervised

    25/60

    Clustering

    Clusteringrefers to a very broad set of techniques forfinding subgroups, or clusters, in a data set.

    We seek a partition of the data into distinct groups so thatthe observations within each group are quite similar to

    each other,

    It make this concrete, we must define what it means fortwo or more observations to be similaror different.

    Indeed, this is often a domain-specific consideration that

    must be made based on knowledge of the data beingstudied.

    23/52

  • 8/13/2019 Unsupervised

    26/60

    PCA vs Clustering

    PCA looks for a low-dimensional representation of the

    observations that explains a good fraction of the variance. Clustering looks for homogeneous subgroups among the

    observations.

    24/52

  • 8/13/2019 Unsupervised

    27/60

  • 8/13/2019 Unsupervised

    28/60

    Two clustering methods

    In K-means clustering, we seek to partition theobservations into a pre-specified number of clusters.

    In

    hierarchical clustering, we do not know in advance howmany clusters we want; in fact, we end up with a tree-like

    visual representation of the observations, called a

    dendrogram, that allows us to view at once the clusteringsobtained for each possible number of clusters, from 1 to n.

    26/52

  • 8/13/2019 Unsupervised

    29/60

    K-means clusteringK=2 K=3 K=4

    A simulated data set with 150 observations in 2-dimensionalspace. Panels show the results of applying K-means clusteringwith different values ofK, the number of clusters. The color of

    each observation indicates the cluster to which it was assignedusing the K-means clustering algorithm. Note that there is noordering of the clusters, so the cluster coloring is arbitrary.These cluster labels were not used in clustering; instead, theyare the outputs of the clustering procedure.

    27/52

  • 8/13/2019 Unsupervised

    30/60

    Details ofK-means clustering

    Let C1, . . . , C Kdenote sets containing the indices of theobservations in each cluster. These sets satisfy two properties:

    1. C1 C2 . . . CK={1, . . . , n}. In other words, eachobservation belongs to at least one of the K clusters.

    2. Ck Ck = for all k=k. In other words, the clusters are

    non-overlapping: no observation belongs to more than onecluster.

    For instance, if the ith observation is in the kth cluster, then

    i Ck.

    28/52

  • 8/13/2019 Unsupervised

    31/60

    Details ofK-means clustering: continued

    The idea behind K-means clustering is that a good

    clustering is one for which the within-cluster variationis assmall as possible.

    The within-cluster variation for cluster Ck is a measureWCV(Ck) of the amount by which the observations withina cluster differ from each other.

    Hence we want to solve the problem

    minimizeC1,...,CK

    K

    k=1WCV(Ck)

    . (2)

    In words, this formula says that we want to partition theobservations into Kclusters such that the totalwithin-cluster variation, summed over all Kclusters, is assmall as possible.

    29/52

  • 8/13/2019 Unsupervised

    32/60

    How to define within-cluster variation?

    Typically we use Euclidean distance

    WCV(Ck) = 1

    |Ck|

    i,iCk

    pj=1

    (xij xij)2, (3)

    where |Ck| denotes the number of observations in the kthcluster.

    Combining (2) and (3) gives the optimization problem thatdefines K-means clustering,

    minimizeC1,...,CK

    Kk=1

    1

    |Ck|

    i,iCk

    pj=1

    (xij xij)2

    . (4)

    30/52

  • 8/13/2019 Unsupervised

    33/60

    K-Means Clustering Algorithm

    1. Randomly assign a number, from 1 to K, to each of theobservations. These serve as initial cluster assignments forthe observations.

    2. Iterate until the cluster assignments stop changing:2.1 For each of the Kclusters, compute the cluster centroid.

    The kth cluster centroid is the vector of the pfeature meansfor the observations in the kth cluster.

    2.2 Assign each observation to the cluster whose centroid is

    closest (where closestis defined using Euclidean distance).

    31/52

    f h Al h

  • 8/13/2019 Unsupervised

    34/60

    Properties of the Algorithm

    This algorithm is guaranteed to decrease the value of theobjective (4) at each step. Why?

    32/52

    P i f h Al i h

  • 8/13/2019 Unsupervised

    35/60

    Properties of the Algorithm

    This algorithm is guaranteed to decrease the value of theobjective (4) at each step. Why?Note that

    1

    |Ck|

    i,iCk

    p

    j=1

    (xij xij)2 = 2

    iCk

    p

    j=1

    (xijxkj)2,

    where xkj = 1

    |Ck|

    iCk

    xij is the mean for feature j incluster Ck.

    however it is not guaranteed to give the global minimum.Why not?

    32/52

    E l

  • 8/13/2019 Unsupervised

    36/60

    Example

    Data Step 1 Iteration 1, Step 2a

    Iteration 1, Step 2b Iteration 2, Step 2a Final Results

    33/52

    D t il f P i Fi

  • 8/13/2019 Unsupervised

    37/60

    Details of Previous Figure

    The progress of the K-means algorithm with K=3.

    Top left: The observations are shown. Top center: In Step 1 of the algorithm, each observation is

    randomly assigned to a cluster.

    Top right: In Step 2(a), the cluster centroids are computed.

    These are shown as large colored disks. Initially thecentroids are almost completely overlapping because theinitial cluster assignments were chosen at random.

    Bottom left: In Step 2(b), each observation is assigned tothe nearest centroid.

    Bottom center: Step 2(a) is once again performed, leadingto new cluster centroids.

    Bottom right: The results obtained after 10 iterations.

    34/52

    E l diff t t ti l

  • 8/13/2019 Unsupervised

    38/60

    Example: different starting values320.9 235.8 235.8

    235.8 235.8 310.9

    35/52

  • 8/13/2019 Unsupervised

    39/60

  • 8/13/2019 Unsupervised

    40/60

    Hierarchical Clustering: the idea

  • 8/13/2019 Unsupervised

    41/60

    Hierarchical Clustering: the ideaBuilds a hierarchy in a bottom-up fashion...

    A B

    C

    D

    E

    38/52

    Hierarchical Clustering: the idea

  • 8/13/2019 Unsupervised

    42/60

    Hierarchical Clustering: the ideaBuilds a hierarchy in a bottom-up fashion...

    A B

    C

    D

    E

    38/52

    Hierarchical Clustering: the idea

  • 8/13/2019 Unsupervised

    43/60

    Hierarchical Clustering: the ideaBuilds a hierarchy in a bottom-up fashion...

    A B

    C

    D

    E

    38/52

    Hierarchical Clustering: the idea

  • 8/13/2019 Unsupervised

    44/60

    Hierarchical Clustering: the ideaBuilds a hierarchy in a bottom-up fashion...

    A B

    C

    D

    E

    38/52

    Hierarchical Clustering: the idea

  • 8/13/2019 Unsupervised

    45/60

    Hierarchical Clustering: the ideaBuilds a hierarchy in a bottom-up fashion...

    A B

    C

    D

    E

    38/52

  • 8/13/2019 Unsupervised

    46/60

    Hierarchical Clustering Algorithm

  • 8/13/2019 Unsupervised

    47/60

    Hierarchical Clustering AlgorithmThe approach in words:

    Start with each point in its own cluster.

    Identify theclosesttwo clusters and merge them. Repeat. Ends when all points are in a single cluster.

    A BC

    DE

    0

    1

    2

    3

    4

    Dendrogram

    D E B A C

    39/52

    An Example

  • 8/13/2019 Unsupervised

    48/60

    An Example

    6 4 2 0 2

    2

    0

    2

    4

    X1

    X2

    45 observations generated in 2-dimensional space. In realitythere are three distinct classes, shown in separate colors.However, we will treat these class labels as unknown and willseek to cluster the observations in order to discover the classesfrom the data.

    40/52

    Application of hierarchical clustering

  • 8/13/2019 Unsupervised

    49/60

    pp g

    0

    2

    4

    6

    8

    10

    0

    2

    4

    6

    8

    10

    0

    2

    4

    6

    8

    10

    41/52

    Details of previous figure

  • 8/13/2019 Unsupervised

    50/60

    p g

    Left: Dendrogram obtained from hierarchically clusteringthe data from previous slide, with complete linkage andEuclidean distance.

    Center: The dendrogram from the left-hand panel, cut at a

    height of 9 (indicated by the dashed line). This cut resultsin two distinct clusters, shown in different colors.

    Right: The dendrogram from the left-hand panel, now cutat a height of 5. This cut results in three distinct clusters,shown in different colors. Note that the colors were not

    used in clustering, but are simply used for display purposesin this figure

    42/52

    Another Example

  • 8/13/2019 Unsupervised

    51/60

    p

    3

    4

    1 6

    9

    2

    8

    5 70

    .0

    0.

    5

    1.0

    1.5

    2

    .0

    2.

    5

    3.

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    1.5 1.0 0.5 0.0 0.5 1.0

    1

    .5

    1

    .0

    0

    .5

    0.0

    0.5

    X1

    X2

    An illustration of how to properly interpret a dendrogram withnine observations in two-dimensional space. The raw data on theright was used to generate the dendrogram on the left.

    Observations 5 and 7 are quite similar to each other, as are

    observations 1 and 6. However, observation 9 is no more similar to observation 2 than

    it is to observations 8, 5, and 7, even though observations 9 and 2are close together in terms of horizontal distance.

    This is because observations 2, 8, 5, and 7 all fuse with

    observation 9 at the same height, approximately 1.8. 43/52

    Merges in previous example

  • 8/13/2019 Unsupervised

    52/60

    g p p

    1

    2

    3

    4

    5

    6

    7

    8

    9

    1.5 1.0 0.5 0.0 0.5 1.0

    1.5

    1.0

    0.5

    0.0

    0.5

    1

    2

    3

    4

    5

    6

    7

    8

    9

    1.5 1.0 0.5 0.0 0.5 1.0

    1.5

    1.0

    0.5

    0.0

    0.5

    1

    2

    3

    4

    5

    6

    7

    8

    9

    1.5 1.0 0.5 0.0 0.5 1.0

    1.5

    1.0

    0

    .5

    0.0

    0.5

    1

    2

    3

    4

    5

    6

    7

    8

    9

    1.5 1.0 0.5 0.0 0.5 1.0

    1.5

    1.0

    0

    .5

    0.0

    0.5

    X1X1

    X1X1

    X2

    X2

    X2

    X2

    44/52

    Types of Linkage

  • 8/13/2019 Unsupervised

    53/60

    Linkage Description

    Complete

    Maximal inter-cluster dissimilarity. Compute all pairwisedissimilarities between the observations in cluster A andthe observations in cluster B, and record the largest ofthese dissimilarities.

    Single

    Minimal inter-cluster dissimilarity. Compute all pairwisedissimilarities between the observations in cluster A and

    the observations in cluster B, and record the smallestofthese dissimilarities.

    Average

    Mean inter-cluster dissimilarity. Compute all pairwisedissimilarities between the observations in cluster A andthe observations in cluster B, and record the averageof

    these dissimilarities.

    CentroidDissimilarity between the centroid for cluster A (a meanvector of length p) and the centroid for cluster B. Cen-troid linkage can result in undesirable inversions.

    45/52

    Choice of Dissimilarity Measure

  • 8/13/2019 Unsupervised

    54/60

    So far have used Euclidean distance. An alternative is correlation-based distancewhich considers

    two observations to be similar if their features are highlycorrelated.

    This is an unusual use of correlation, which is normallycomputed between variables; here it is computed betweenthe observation profiles for each pair of observations.

    5 10 15 20

    0

    5

    10

    15

    20

    Variable Index

    Observation 1

    Observation 2

    Observation 3

    1

    2

    3

    46/52

    Scaling of the variables matters

  • 8/13/2019 Unsupervised

    55/60

    Socks Computers

    0

    2

    4

    6

    8

    10

    Socks Computers

    0.

    0

    0.

    2

    0.

    4

    0.

    6

    0.

    8

    1.

    0

    1.

    2

    Socks Computers

    0

    500

    1000

    1500

    47/52

    Practical issues

  • 8/13/2019 Unsupervised

    56/60

    Should the observations or features first be standardized insome way? For instance, maybe the variables should becentered to have mean zero and scaled to have standarddeviation one.

    In the case of hierarchical clustering,

    What dissimilarity measure should be used? What type of linkage should be used?

    How many clusters to choose? (in both K-means orhierarchical clustering). Difficult problem. No agreed-upon

    method. See Elements of Statistical Learning, chapter 13for more details.

    48/52

    Example: breast cancer microarray study

  • 8/13/2019 Unsupervised

    57/60

    Repeated observation of breast tumor subtypes inindependent gene expression data sets; Sorlie at el, PNAS2003

    Average linkage, correlation metric Clustered samples using 500 intrinsic genes: each woman

    was measured before and after chemotherapy. Intrinsicgenes have smallest within/between variation.

    49/52

  • 8/13/2019 Unsupervised

    58/60

    50/52

  • 8/13/2019 Unsupervised

    59/60

    51/52

    Conclusions

  • 8/13/2019 Unsupervised

    60/60

    Unsupervised learning is important for understanding thevariation and grouping structure of a set of unlabeled data,and can be a useful pre-processor for supervised learning

    It is intrinsically more difficult than supervised learning

    because there is no gold standard (like an outcomevariable) and no single objective (like test set accuracy)

    It is an active field of research, with many recentlydeveloped tools such as self-organizing maps, independentcomponents analysisand spectral clustering.See The Elements of Statistical Learning, chapter 14.

    52/52