machine learning kernel cca, kernel kmeans...
TRANSCRIPT
MACHINE LEARNING – 2012
1
MACHINE LEARNING
kernel CCA, kernel Kmeans
Spectral Clustering
MACHINE LEARNING – 2012
2
Change in timetable: We have practical session next week!
MACHINE LEARNING – 2012
3
Structure of today’s and next week’s class
1) Briefly go through some extension or variants on the principle of
kernel PCA, namely kernel CCA.
2) Look at one particular set of clustering algorithms for structure
discovery, kernel K-Means
3) Describe the general concept of Spectral Clustering, highlighting
equivalency between kernel PCA, ISOMAP, etc and their use for
clustering.
4) Introduce the notion of unsupervised and semi-supervised
learning and how this can be used to evaluate clustering
methods
5) Compare kernel K-means and Gaussian Mixture Model for
unsupervised and semi-supervised clustering
MACHINE LEARNING – 2012
4
Canonical Correlation Analysis (CCA)
Video description Audio description
1 1,x y
Nx Py
2 2,x y
,
max ,x y
T T
x yw w
corr w x w y
Determine features in two (or more) separate descriptions of the dataset that
best explain each datapoint.
Extract hidden structure that maximize correlation across two different projections.
MACHINE LEARNING – 2012
5
Canonical Correlation Analysis (CCA)
1 1,x y
2 2,x y
,
max ,x y
T T
x yw w
corr w X w Y
Pair of multidimensional
zero mean variables
We have M instances of the pairs.
1 1,i i
M MN q
i iX x Y y
Search two projections and w :
' and '
x y
T T
x y
w
X w X Y w Y
,
solutions of:
max max corr X',Y'x yw w
MACHINE LEARNING – 2012
6
Canonical Correlation Analysis (CCA)
,
T
x y
T,x y
T
x y
T T,x y
max max corr X',Y'
w w max
w w
w w max
w w
x y
x y
x y
w w
T
Tw w
xy
w wxx x yy y
E XY
X Y
C
C w C w
Covariance matrices
=E : N
=E :
T
xx
T
yy
C XX N
C YY q q
Crosscovariance matrix
is
Measure crosscorrelation between and .
xyC N q
X Y
MACHINE LEARNING – 2012
7
Canonical Correlation Analysis (CCA)
T T
x y
Correlation not affected by rescaling the norm of the vectors,
we can ask that w w 1xx x yy yC w C w
T
x y,
T T
x y
max max w w
u. c. w w 1
x y
xyw w
xx x yy y
C
C w C w
,
T
x y
T,x y
T
x y
T T,x y
max max corr X',Y'
w w max
w w
w w max
w w
x y
x y
x y
w w
T
Tw w
xy
w wxx x yy y
E XY
X Y
C
C w C w
MACHINE LEARNING – 2012
8
Canonical Correlation Analysis (CCA)
T
x y
T T 1 2,
x y
T T
x y
To determine the optimum (maximum) of , solve by Lagrange:
w wmax
w w
u.c. w w 1
x y
xy
w wxx x yy y xy yy yx x xx x
xx x yy y
C
C w C w C C C w C w
C w C w
Generalized Eigenvalue
Problem; can be reduced
to a classical eigenvalue
problem if Cxx is invertible
MACHINE LEARNING – 2012
9
Kernel Canonical Correlation Analysis
- CCA finds basis vectors, s.t. the correlation between the projections
is mutually maximized.
generalized version of PCA for two or more multi-dimensional
datasets.
CCA depends on the coordinate system in which the variables are
described.
Even if strong linear relationship between variables, depending on the
coordinate system used, this relationship might not be visible as a
correlation Kernel CCA.
MACHINE LEARNING – 2012
10
Principle of Kernel Methods
Determine a metric which brings out features of the data so as to
make subsequent computation easier
Original Space
x1
x2
After Lifting Data in Feature Space
Data becomes linearly separable when using a rbf kernel and projecting onto first 2
PC of kernelPCA.
MACHINE LEARNING – 2012
11
Original Space
x1
x2
e1
e2
Idea: Send the data X into a feature space H through the nonlinear map f.
1...
1 ,.....,i M
i N MX x X x xf f f
fH Performs linear
transformation in
feature space
Principle of Kernel Methods (Recall)
In feature space, perform classical linear computation
MACHINE LEARNING – 2012
12
Kernel CCA
1 1,i i
M MN q
i iX x Y y
1 1
1 1
and , with 0 and 0i i i i
M MM M
x y x yi i
i i
x y x yf f f f
Project into a feature space
Construct associated kernel matrices:
, , columns of , are , T T i i
x x x y y y x y x yK F F K F F F F x yf f
The projection vectors can be expressed as a linear combination
in feature space:
and x x x y y yw F w F
MACHINE LEARNING – 2012
13
Kernel CCA
Kernel CCA can then become an optimization problem of the
1/2 1/2
2 2, ,
2 2
max max
u.c 1
x y x y
T
x x y y
T T
x x x y y y
T T
x x x y y y
K K
K K
K K
T
x y
T T,
x y
T T
x y
In Linear CCA, we were solving for:
w w max
w w
u.c. w w 1
x y
xy
w wxx x yy y
xx x yy y
C
C w C w
C w C w
MACHINE LEARNING – 2012
14
Kernel CCA
Kernel CCA can then become an optimization problem of the
1/2 1/2
2 2, ,
2 2
max max
u.c 1
x y x y
T
x x y y
T T
x x x y y y
T T
x x x y y y
K K
K K
K K
In practice, the intersection between the spaces spanned by , is non-zero,
then the problem has a trivial solution, as ~ cos , 1.
x x y y
x x y y
K K
K K
MACHINE LEARNING – 2012
15
Kernel CCA
2
2
Generalized eigenvalue problem:
0 0
0 0
x y x x x
y yy x y
K K K
K K K
2
2
Add a regularization term to increase the rank of the matrix and make it invertible
+ I 2
x x
MK K
Several methods have been proposed to choose carefully the regularizing term so as to
get projections that are as close as possible to the “true” projections.
MACHINE LEARNING – 2012
16
Kernel CCA
2
2
+ I 0 0 2
0 0 + I
2
xx y x x
y yy x
y
MK
K K
K K MK
A
B
Becomes a classical eigenvalue problem 1
1TC AC
Set: and TB C C C
MACHINE LEARNING – 2012
17
Kernel CCA
Can be extended to multiple datasets:
1 1,i i
M MN q
i iX x Y y
Two datasets case
1
1
datasets: ,...., with observations each
Dimensions ,.... :, i.e. :
L
L i i
L X X M
N N X N M
1
1
Applying a non-linear transformation to ,...
construct Gram matrices: ,.....,
L
L
X X
L K K
f
MACHINE LEARNING – 2012
18
Kernel CCA
Was formulated as generalized eigenvalue problem:
Can be extended to multiple datasets:
1 1,i i
M MN q
i iX x Y y
Two datasets case
2
2
1
1 2 1 1
2 1 2 2
1 2
+ I 0 0 ....... 2
0 ........
. . .......
. .
.......
L
L
L L L L L
MK
K K K K
K K K K
K K K K K K
1
2
2
2
......................................... .
.
0 + I 2
L
L
MK
2
2
+ I 0 0 2
0 0 + I
2
xx y x x
y yy x
y
MK
K K
K K MK
A
B
MACHINE LEARNING – 2012
19
Kernel CCA
M. Kuss and T, Graepel ,The Geometry Of Kernel Canonical Correlation Analysis Tech. Report, Max Planck Institute, 2003
MACHINE LEARNING – 2012
20
Matlab Example File: demo_CCA-KCCA.m
Kernel CCA
MACHINE LEARNING – 2012
21
Applications of Kernel CCA
Kernel matrices K1, K2 and K3
correspond to gene-gene similarities in
pathways, genome position, and
microarray expression data resp.
Use RBF kernel with fixed kernel width.
Goal: To measure correlation between heterogeneous datasets and to extract sets of
genes which share similarities with respect to multiple biological attributes
Y Yamanishi, JP Vert, A Nakaya, M Kanehisa - Extraction of correlated gene clusters from multiple genomic data by
generalized kernel canonical correlation analysis, Bioinformatics, 2003
Correlation scores in MKCCA:
pathway vs. genome vs. expression.
MACHINE LEARNING – 2012
22
Applications of Kernel CCA
Goal: To measure correlation between heterogeneous datasets and to extract sets of
genes which share similarities with respect to multiple biological attributes
Y Yamanishi, JP Vert, A Nakaya, M Kanehisa - Extraction of correlated gene clusters from multiple genomic data by
generalized kernel canonical correlation analysis, Bioinformatics, 2003
Correlation scores in MKCCA:
pathway vs. genome vs. expression.
Gives pairwise correlation between K1, K2
Gives pairwise correlation between K1, K3
A readout of the entries with equal
projection onto the first canonical
vectors give the genes which
belong to each cluster
Two clusters correspond to genes
close to each other with respect to
their positions in the pathways, in the
genome, and to their expression
MACHINE LEARNING – 2012
23
Applications of Kernel CCA
Goal: To construct appearance models for estimating an object’s pose from raw
brightness images
T. Melzer, M. Reiter and H. Bischof, Appearance models based on kernel canonical correlation analysis, Pattern
Recognition 36 (2003) p. 1961–1971
Example of two image datapoints with different poses
X: Set of images
Y: pose parameters (pan and tilt
angle of the object w.r.t. the
camera in degrees)
Use linear kernel on X and RBF kernel on Y and compare performance to
applying PCA on the (X, Y) dataset directly
MACHINE LEARNING – 2012
24
kernel-CCA performs better than PCA,
especially for small k=testing/training ratio
(i.e., for larger training sets).
The kernel-CCA estimators tend to
produce less outliers, i.e., gross errors, and
consequently yield a smaller standard
deviation of the pose estimation error than
their PCA-based counterparts.
Applications of Kernel CCA
Goal: To construct appearance models for estimating an object’s pose from raw
brightness images
T. Melzer, M. Reiter and H. Bischof, Appearance models based on kernel canonical correlation analysis, Pattern
Recognition 36 (2003) p. 1961–1971
For very small training sets, the performance of both approaches becomes similar
Testing/training ratio
MACHINE LEARNING – 2012
26
Kernel K-means
Spectral Clustering
MACHINE LEARNING – 2012
27
Structure Discovery: Clustering
Groups pair of points according to how similar these are.
Density based clustering methods (Soft K-means, Kernel K-means,
Gaussian Mixture Models) compare the relative distributions.
MACHINE LEARNING – 2012
28
K-means
*m1
*m2
*m3
K-means is a hard partitioning of the space through K clusters
equidistant according to norm 2 measure.
The distribution of data within each cluster is encapsulated in a
sphere.
MACHINE LEARNING – 2012
29
K-means Algorithm
, 1... ,k k Km
MACHINE LEARNING – 2012
30
2. Calculate the distance from each data point to each centroid .
3. Assignment Step: Assign the responsibility of each data point to its
“closest” centroid (E-step). If a tie happens (i.e. two centroids are equidistant to a data
point, one assigns the data point to the smallest winning centroid).
,p
j k j kd x xm m
arg min ,j k
k
d x m
K-means Algorithm
Iterative Method (variant on Expectation-Maximization)
MACHINE LEARNING – 2012
31
2. Calculate the distance from each data point to each centroid .
3. Assignment Step: Assign the responsibility of each data point to its
“closest” centroid (E-step). If a tie happens (i.e. two centroids are equidistant to a data
point, one assigns the data point to the smallest winning centroid).
,p
j k j kd x xm m
K-means Algorithm
Iterative Method (variant on Expectation-Maximization)
4. Update Step: Adjust the centroids to be the means of all data points
assigned to them (M-step)
5. Go back to step 2 and repeat the process until the clusters are stable.
arg min ,j k
k
d x m
MACHINE LEARNING – 2012
32
Two hyperparameters: number of clusters K and power p of the metric
Very sensitive to the choice of the number of clusters K and the
initialization.
K-means Clustering: Weaknesses
MACHINE LEARNING – 2012
33
Two hyperparameters: number of clusters K and power p of the metric
Choice of power determines the form of the decision boundaries
K-means Clustering: Hyperparameters
P=1 P=2
P=3 P=4
MACHINE LEARNING – 2012
34
Kernel K-means
K-Means algorithm consists of minimization of:
1
1
,...., with
: number of datapoints in cluster
j k
j k
j
Kp
K j k k x C
k x C k
k
k
x
J xm
m C
m m m m
1
i
M
ixf
Project into a feature space
1
1
,...., j k
pKK j k
k x C
J xm m f f m
j k
j
x C
k
x
m
f
We cannot observe the mean in feature space.
Construct the mean in feature space using image of points in same cluster
MACHINE LEARNING – 2012
35
Kernel K-means
2
1
1
,
21
,
21
,....,
2
=
k ,2 k ,
= k ,
j k
j l kl k
j k
j l kj k
j k
KK j k
k x C
j ll j
Kx x Cj j x C
k x C k k
j li j
Kx x Ci i x C
k x C k k
J x
x xx x
x xm m
x xx x
x xm m
m m f f m
f ff f
f f
MACHINE LEARNING – 2012
36
Kernel K-means
Kernel K-means algorithm is also an iterative procedure:
1. Initialization: pick K clusters
2. Assignment Step: Assign each data point to its “closest” centroid (E-step). If a tie happens (i.e. two centroids are equidistant to a data point, one assigns the data point to the
smallest winning centroid) by computing the distance in feature space.
3. Update Step: Update the list of points belonging to each centroid (M-step)
4. Go back to step 2 and repeat the process until the clusters are stable.
,
2
k ,2 k ,
min , min k ,j l kj k
j li j
x x Ci k i i x C
k kk k
x xx x
d x C x xm m
MACHINE LEARNING – 2012
37
Kernel K-means
,
2
k ,2 k ,
min , min k ,j l kj k
j li j
x x Ci k i i x C
k kk k
x xx x
d x C x xm m
With a RBF kernel
If xi is close to all points
in cluster k, this is close
to 1. Cst of value 1
If the points are
well grouped in
cluster k, this sum
is close to 1.
With homogeneous polynomial kernel?
MACHINE LEARNING – 2012
38
Kernel K-means
,
2
k ,2 k ,
min , min k ,j l kj k
j li j
x x Ci k i i x C
k kk k
x xx x
d x C x xm m
With a polynomial kernel Some of the terms
change sign depending
on their position with
respect to the origin. Positive value
If the points are
aligned in the
same Quadran,
the sum is
maximal
MACHINE LEARNING – 2012
39
Kernel K-means: examples
Rbf Kernel, 2 Clusters
MACHINE LEARNING – 2012
40
Kernel K-means: examples
Rbf Kernel, 2 Clusters
MACHINE LEARNING – 2012
41
Kernel K-means: examples Rbf Kernel, 2 Clusters
Kernel width: 0.5
Kernel width: 0.05
MACHINE LEARNING – 2012
42
Kernel K-means: examples
Polynomial Kernel, 2 Clusters
MACHINE LEARNING – 2012
43
Kernel K-means: examples
Polynomial Kernel (p=8), 2 Clusters
MACHINE LEARNING – 2012
44
Kernel K-means: examples
Polynomial Kernel, 2 Clusters
Order 2 Order 4
Order 6
MACHINE LEARNING – 2012
45
Kernel K-means: examples
Polynomial Kernel, 2 Clusters
The separating line will always be perpendicular to the line passing by the origin
(which is located at the mean of the datapoints) and parrallel to the axis of the
ordinates (because of the change in sign of the cosine function in the inner product.
No better than linear K-means!
MACHINE LEARNING – 2012
46
Kernel K-means: examples
Polynomial Kernel, 4 Clusters
Can only group datapoints that do not overlap across quadrans with respect to the
origin (careful, data are centered!).
No better than linear K-means! (except less sensitive to random initialization)
Solutions found with Kernel K-means Solutions found with K-means
MACHINE LEARNING – 2012
47
Choice of number of Clusters in Kernel K-means is important
Kernel K-means: Limitations
MACHINE LEARNING – 2012
48
Choice of number of Clusters in Kernel K-means is important
Kernel K-means: Limitations
MACHINE LEARNING – 2012
49
Choice of number of Clusters in Kernel K-means is important
Kernel K-means: Limitations
MACHINE LEARNING – 2012
50
Limitations of kernel K-means
Raw Data
MACHINE LEARNING – 2012
51
Limitations of kernel K-means
kernel K-means with K=3, RBF kernel
MACHINE LEARNING – 2012
52
From Non-Linear Manifolds
Laplacian Eigenmaps, Isomaps
To Spectral Clustering
MACHINE LEARNING – 2012
53
Non-Linear Manifolds
PCA and Kernel PCA belong to a more general class of methods to
create non-linear manifolds based on spectral decomposition. (Spectral decomposition of matrices is more frequently referred to as an eigenvalue
decomposition.)
Depending on which matrix we decompose, we get a different set of
projections.
• PCA decomposes the covariance matrix of the dataset generate
rotation and projection in the original space
• kernel PCA decomposes the Gram matrix partition or regroup the
datapoints
• The Laplacian matrix is a matrix representation of a graph. Its spectral
decomposition can be used for clustering.
MACHINE LEARNING – 2012
54
Embed Data in a Graph
• Build a similarity graph
• Each vertex on the graph is a datapoint
Original dataset Graph representation of the dataset
MACHINE LEARNING – 2012
55
Measure Distances in Graph
Construct the similarity matrix S to denote whether points are close or far
away to weight the edges of the graph:
0.9.....0.8. .. 0.2 ... 0.2
.....
0.2.....0.2........0.7....0.6
S
MACHINE LEARNING – 2012
56
Disconnected Graphs
1.........1. .. .....0 .......0
.....
0.........0..........1.........1
S
Disconnected Graph:
Two data-points are connected if:
a) the similarity between them is higher
than a threshold.
b) or if they are k-nearest neighbors
(according to the similarity metric)
MACHINE LEARNING – 2012
57
Graph Laplacian
1
1.........1. .. .....0 .......0
Given the similarity matrix .....
0.........0..........1.........1
Construct the diagonal matrix D composed of the sum on each line of K:
.............i
i
S
S
D
2
...0
0 ........0 ,
....
0..................
and then, build the Laplacian matrix :
L is positive semi-definite spectral decomposition possible
i
i
Mi
i
S
S
L D S
MACHINE LEARNING – 2012
58
Graph Laplacian
1 2
Eigenvalue decomposition of the Laplacian matrix:
All eigenvalues of L are positive and the smallest eigenvalue of L is zero:
If we order the eigenvalue by increasing order:
0 .... .
T
M
L U U
If the graph has connected components, then the
eigenvalue =0 has multiplicity .
k
k
MACHINE LEARNING – 2012
59
Spectral Clustering
The multiplicity of the eigenvalue 0 determines the number of connected
components in a graph. The associated eigenvectors identify these connected
components.
For an eigenvalue 0, the correspondini
g eigenvector has the same value
for all vertices in a component,and a different value for each one of their i 1
components.
ie
Identifying the clusters is then trivial when the similarity matrix is
composed of zeros and ones (as when using k-nearest neighbor).
What happens when the similarity matrix is full?
MACHINE LEARNING – 2012
60
Spectral Clustering
0.9.....0.8. .. 0.2 ... 0.2
.....
0.2.....0.2........0.7....0.6
S
Similarity map : N NS
22
can either be binary (k-nearest neighboor)
or continuous with Gaussia kernel
,
i jx x
i j
S
S x x e
MACHINE LEARNING – 2012
61
Spectral Clustering
Similarity map : N NS
22
can either be binary (k-nearest neighboor)
or continuous with Gaussia kernel
,
i jx x
i j
S
S x x e
1 2
1) Build the Laplacian matrix :
2) Do eigenvalue decomposition of the Laplacian matrix:
3) Order the eigenvalue by increasing order:
0 .... .
T
M
L D S
L U U
The first eigenvalue is still zero but with multiplicity 1 only (fully connected graph)!
Idea: the smallest eigenvalues are close to zero and hence provide also
information on the partioning of the graph (see exercise session)
MACHINE LEARNING – 2012
62
Spectral Clustering
1 2
1 1 1
1
2
1 2
1 1
Eigenvalue decomposition of the Laplacian matrix:
........
.
.
.
........
T
K
K
M
L U U
e e e
e
U
e e e
1
.
.
.
i
i
K
i
e
y
e
Construct an embedding of each of the
M datapoints through .
Reduce dimensionality by picking k<M
projections , 1... .
i i
i
x y
y i K
With a clear partitioning of the graph,
the entries in y are split into sets of
equal values. Each group of points
with same value belong to the same
partition (cluster).
MACHINE LEARNING – 2012
63
Spectral Clustering
1 2
1 1 1
1
2
1 2
1 1
Eigenvalue decomposition of the Laplacian matrix:
........
.
.
.
........
T
K
K
M
L U U
e e e
e
U
e e e
When we have a fully connected graph,
the entries in Y take any real value.
1
.
.
.
i
i
K
i
e
y
e
Construct an embedding of each of the
M datapoints through .
Reduce dimensionality by picking k<M
projections , 1... .
i i
i
x y
y i K
MACHINE LEARNING – 2012
64
1
2
One solution for the two associated eigenvectors is:
1
2 1
0
0
0
1
e
e
1
2
Anothersolution for the two associated eigenvectors is:
0.33
0.33
0.88
0.1
0.1
0.99
e
e
Spectral Clustering
Example: 3 datapoints in a graph composed of 2 partitions
1 1 0
The similarity matrix is 1 1 0
0 0 1
has eigenvalue =0 with multiplicity two.
S
L
1x
2x
3x
Entries in the
eigenvector for the
two first datapoins
are equal
1 1
2 2
for the 1st set of eigenvectors for the 2nd set of eigenvectors
1 0 0.33 -0.1
1 0 0.33 -0.1
y y
y y
MACHINE LEARNING – 2012
65
1 2 3
with associated eigenvectors :
1 0.411 0.8
2 1 , 0.404 , 0.7
1 0.81 0.0
e e e
Spectral Clustering Example: 3 datapoints in a fully connected graph
1 0.9 0.02
The similarity matrix is 0.9 1 0.02
0.01 0.02 1
has eigenvalue =0 with multiplicity 1. The seco
S
L
2
3
nd
eigenvalue is small 0.04, whereas the 3rd one
is large, 1.81.
1x
2x
3x
Entries in the 2nd
eigenvector for the
two first datapoins
are almost equal.
1 2
The first two points have almost the same coordinates on the y embedding.
1 0.41 , 1 0.40y y
Reduce the dimensionality by considering the smallest eigenvalue
MACHINE LEARNING – 2012
66
1 2 3
with associated eigenvectors :
1 0.21 0.78
2 1 , 0.57 , 0.57
1 0.79 0.21
e e e
Spectral Clustering Example: 3 datapoints in a fully connected graph
1 0.9 0.8
The similarity matrix is 0.9 1 0.7
0.8 0.7 1
has eigenvalue =0 with multiplicity 1. The secon
S
L
2 3
d
and third eigenvalues are both large 2.23, 2.57.
1x
2x
3x
Entries in the 2nd eigenvector
for the two first datapoins are
no longer equal.
1 2
The first two points have no longer the same coordinates on the y embedding.
1 -0.21 , 1 -0.57y y
The 3rd point is now
closer to the two other
points
MACHINE LEARNING – 2012
67
Spectral Clustering
1x
2x
3x
4x
5x
6x
12w
21w
1y 2y
3y
4y
5y
6y
Step 1:
Embedding in y
Idea: Points close to one another have almost the same coordinate on
the eigenvectors of L with small eigenvalues.
Step1: Do an eigenvalue decomposition of the Lagrange matrix L and
project the datapoints onto the first K eigenvectors with smallest eigenvalue
(hence reducing the dimensionality).
MACHINE LEARNING – 2012
68
Spectral Clustering
1
Step 2:
Perform K-Means on the set of ,... vectors
Cluster datapoints x according to their clustering in y.
My y
1x
2x
3x
4x
5x
6x
12w
21w
1y 2y
3y
4y
5y
6y
MACHINE LEARNING – 2012
70
Equivalency to other non-linear Embeddings
1
1
Spectral deomposition of the similarity matrix
(which is already positive semi-definite)
.
.
.
i
i
K
K i
e
y
e
In Isomap, the embedding is
normalized by the eigenvalues
and uses geodesic distance to build
the similarity matrix,
see supplementary material.
MACHINE LEARNING – 2012
71
Laplacian Eigenmaps
The vectors , 1... , form an embedding of the datapoints.iy i M
Swissroll example Projections on each Laplacian eigenvector
Image courtesy from A. Singh
Solve the generalized eigenvalue problem:
Solution to optimization problem: min such that 1.
Ensures minimal distorsion while preventing arbitrary scaling.
T T
y
Ly Dy
y Ly y Dy
MACHINE LEARNING – 2012
72
Equivalency to other non-linear Embeddings
kernel PCA: Eigenvalue decomposition of the matrix of similarity S
TS UDU
The choice of parameters in kernel K-Means can be initialized by doing a
readout of the Gram matrix after kernel PCA.
MACHINE LEARNING – 2012
73
1
2
1
The optimization problem of kernel K-means is equivalent to:
max ,
Since ,
: M eigenvalues, resulting from the eigenvalue decomposition
of the Gram Matrix
T
H
MT
i
i
i
tr H KH H YD
tr H KH
Kernel K-means and Kernel PCA
:
:
Y M K
D K K
Each entry of Y is 1 if the datapoint belongs to cluster k, otherwise zero
D is diagonal. Element on the diagonal is sum of the datapoint in cluster k
See paper by M. Welling,
supplementary document on website
Look at the eigenvalues to determine
optimal number of clusters.
MACHINE LEARNING – 2012
74
Kernel PCA
projections can
also help
determine the
kernel width
From top to
bottom
Kernel width of
0.8, 1.5, 2.5
MACHINE LEARNING – 2012
75
Kernel PCA projections can help determine the kernel width
The sum of eigenvalue grows as we get a better clustering
MACHINE LEARNING – 2012
76
Quick Recap of Gaussian Mixture Model
MACHINE LEARNING – 2012
77
Clustering with Mixture of Gaussians
Alternative to K-means; soft partitioning with elliptic clusters instead of
spheres
Clustering with Mixtures of Gaussians using spherical Gaussians (left) and non
spherical Gaussians (i.e. with full covariance matrix) (right).
Notice how the clusters become elongated along the direction of the clusters (the
grey circles represent the first and second variances of the distributions).
MACHINE LEARNING – 2012
78
Gaussian Mixture Model (GMM)
Using a set of M N-dimensional training datapoints
The pdf of X will be modeled through a mixture of K Gaussians:
1,...
1,...
i Mi
j j NX x
1
| , , with | , ,
, : mean and covariance matrix of Gaussian
Ki i i i i i
i
i
i i
p X p X p X N
i
m m m
m
1
1K
i
i
Mixing Coefficients
Probability that the data was explained by
Gaussian i:
1
|M
j
i
j
p i p i x
MACHINE LEARNING – 2012
79
Gaussian Mixture Modeling
The parameters of a GMM are the means, covariance matrices and prior
pdf:
Estimation of all the parameters can be done through Expectation-
Maximization (E-M). E-M tries to find the optimum of the likelihood of the
model given the data, i.e.:
1 1 1,..... , ,..... , ,.....K K Km m
max | max |L X p X
See lecture notes for details
MACHINE LEARNING – 2012
80
Gaussian Mixture Model
MACHINE LEARNING – 2012
81
Gaussian Mixture Model
GMM using 4 Gaussians
with random initialization
MACHINE LEARNING – 2012
82
Gaussian Mixture Model
Expectation Maximization is very sensitive to initial conditions:
GMM using 4 Gaussians
with new random
initialization
MACHINE LEARNING – 2012
83
Gaussian Mixture Model
Very sensitive to choice of number of Gaussians. Number of Gaussians
can be optimized iteratively using AIC or BIC, like for K-means:
Here, GMM using 8 Gaussians
MACHINE LEARNING – 2012
84
Evaluation of Clustering Methods
MACHINE LEARNING – 2012
85
Evaluation of Clustering Methods
Clustering methods rely on hyper parameters
• Number of clusters
• Kernel parameters
Need to determine the goodness of these choices
Clustering is unsupervised classification
Do not know the real number of clusters and the data labels
Difficult to evaluate these choice without ground truth
MACHINE LEARNING – 2012
86
Evaluation of Clustering Methods
Two types of measures: Internal versus external measures
Internal measures rely on measure of similarity (e.g. intra-cluster
distance versus inter-cluster distances)
E.g.: Residual Sum of Square is an internal measure (available in
mldemos); Gives the squared distance of each vector from its centroid
summed over all vectors.
Internal measures are problematic as the metric of similarity is often
already optimized by clustering algorithm
2
1
RSS=k
Kk
k x C
x m
MACHINE LEARNING – 2012
87
K-Means, soft-K-Means and GMM have several hyperparameters:
(Fixed number of clusters, beta, number of Gaussian functions)
Measure to determine how well the choice of hyperparameters fit the
dataset (maximum-likelihood measure)
: dataset; : number of datapoints; : number of free parameters
- Aikaike Information Criterion: AIC= 2ln 2
- Bayesian Information Criterion: 2 ln ln
L: maximum likelihood of the model giv
X M B
L B
BIC L B M
en B parameters
Lower BIC implies either fewer explanatory variables, better fit, or both.
As the number of datapoints (observations) increase, BIC assigns more weights
to simpler models than AIC.
Choosing AIC versus BIC depends on the application:
Is the purpose of the analysis to make predictions, or to decide which model
best represents reality?
AIC may have better predictive ability than BIC, but BIC finds a computationally
more efficient solution.
Evaluation of Clustering Methods
Penalty for
increase in
computational
costs
MACHINE LEARNING – 2012
88
Evaluation of Clustering Methods
Two types of measures: Internal versus external measures
External measures assume that a subset of datapoints have class label
and measures how well these datapoints are clustered.
Needs to have an idea of the class and have labeled some
datapoints
Interesting only in cases when labeling is highly time-consuming
when the data is very large (e.g. in speech recognition)
MACHINE LEARNING – 2012
89
Evaluation of Clustering Methods
Raw Data
MACHINE LEARNING – 2012
90
: nm of datapoints, : the set of classes
: nm of clusters,
: nm of members of class and of cluster
, max ,
2 , ,,
, ,
,
,
ik
i
ik
ik
i
i
i
i
c C k
i i
i
i i
i
i
i
M C c
K
n c k
cF C K F c k
M
R c k P c kF c k
R c k P c k
nR c k
c
nP c k
k
Semi-Supervised Learning
Clustering F-Measure: (careful: similar but not the same F-measure as the F-measure we will see for classification!)
Tradeoff between clustering correctly all datapoints of the same class in the
same cluster and making sure that each cluster contains points of only one
class.
Picks for each class
the cluster with the
maximal number of
datapoints
Recall: proportion of
datapoints correctly
classified/clusterized
Precision: proportion of
datapoints of the same
class in the cluster
MACHINE LEARNING – 2012
91
Evaluation of Clustering Methods
RSS with K-means can find the true optimal number of clusters but very
sensitive to random initialization (left and right: two different runs).
RSS finds an optimum for
K=4 and K=5 for the right
run
MACHINE LEARNING – 2012
92
Evaluation of Clustering Methods
BIC (left) and AIC (right) perform very poorly here, splitting some clusters into
two halves.
MACHINE LEARNING – 2012
93
Evaluation of Clustering Methods
BIC (left) and AIC (right) perform much better for picking the right number of
clusters in GMM.
MACHINE LEARNING – 2012
94
Evaluation of Clustering Methods
Raw Data
MACHINE LEARNING – 2012
95
Evaluation of Clustering Methods
Optimization with BIC using K-means
MACHINE LEARNING – 2012
96
Evaluation of Clustering Methods
Optimization with AIC using K-means
AIC tends to find more clusters
MACHINE LEARNING – 2012
97
Evaluation of Clustering Methods
Raw Data
MACHINE LEARNING – 2012
98
Evaluation of Clustering Methods
Optimization with AIC using kernel K-means with RBF
MACHINE LEARNING – 2012
99
Evaluation of Clustering Methods
Optimization with BIC using kernel K-means with RBF
MACHINE LEARNING – 2012
100
Semi-Supervised Learning
Raw Data: 3 classes
MACHINE LEARNING – 2012
101
Semi-Supervised Learning
Clustering with RBF kernel K-Means after optimization with BIC
MACHINE LEARNING – 2012
102
Semi-Supervised Learning
After semi-supervised learning
MACHINE LEARNING – 2012
103
Summary
We have seen several methods for extracting structure in data using the
notion of kernel and discussed their similarity, differences and
complementarity:
• Kernel CCA is a generalization of kernel PCA to determine partial
grouping in different dimensions.
• Kernel PCA can be used too bootstrap choice of hyperparameters in
kernel K-means
We have compared the geometrical division of space yielded by Rbf and
Polynomial kernels.
We have seen that simpler techniques than kernel K-means such as K-
means with norm-p and mixture of Gaussians can yield complex non-linear
clustering.
MACHINE LEARNING – 2012
104
Summary
When to use what? E.g. k-means versus kernel K-means versus GMM
When using any machine learning algorithm, you have to balance a
number of factors:
• Computing time at training and at testing
• Number of open parameters
• Curse of dimensionality (order of growth with number of datapoints,
dimension, etc)
• Robustness to initial condition, optimality of solution