erte pan wireless eng. group advisor: dr. han department of electrical and computer engineering...

26
Erte Pan Wireless Eng. Group Advisor: Dr. Han Department of Electrical and Computer Engineering University of Houston, Houston, TX. Mercer Kernel-Based Clustering in Feature Space For Math6397 Prof. Azencott Author: Mark Girolami Submitted in IEEE Transactions on Neural Netwroks, Vol.3, May, 2002 Citations so far: 593

Upload: victoria-poole

Post on 29-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Erte PanWireless Eng. Group

Advisor: Dr. Han

Department of Electrical and Computer EngineeringUniversity of Houston, Houston, TX.

Erte PanWireless Eng. Group

Advisor: Dr. Han

Department of Electrical and Computer EngineeringUniversity of Houston, Houston, TX.

Mercer Kernel-Based Clustering in Feature Space

For Math6397 Prof. Azencott

Mercer Kernel-Based Clustering in Feature Space

For Math6397 Prof. Azencott

Author: Mark GirolamiSubmitted in IEEE Transactions on Neural

Netwroks, Vol.3, May, 2002Citations so far: 593

Author: Mark GirolamiSubmitted in IEEE Transactions on Neural

Netwroks, Vol.3, May, 2002Citations so far: 593

ContentContent

Problem Statement

Data-space Clustering

Feature-space Clustering

Stochastic Optimization

Nonparametric Clustering

Results and Discussion

References

Problem StatementProblem Statement

A lot of data analysis or machine learning tasks involve classification of data clouds or prediction of incoming data point.

Machine Learning: Enable computers to learn without being explicitly programmed.

Unsupervised Learning

Supervised Learning

Data-space ClusteringData-space Clustering

Clustering: Unsupervised partitioning of data observations into self-similar regions.

Traditional clustering method:

Centroid-based clustering

Hierarchical clustering

Distribution-based clustering…

Data-space ClusteringData-space Clustering

Problem formulation:

N data vectors in D-dimension space:

Given K cluster centers the within-cluster scatter matrix is defined as:

where the binary variable indicates the membership of data point to cluster k

Dnn xNnx ...,2,1,

K

k

N

n

TknknknW mxmxz

NS

1 1

))((1

knz

nx

N

nnkn

kk xzN

m1

1

N

nknk zN

1

Data-space ClusteringData-space Clustering

Data-space clustering criterion: sum-of-squares; measure of compactness

K-means, mean shift and so forth…

The partition of data set is solved by the optimization problem:

NP-hard problem… heuristic algorithms such as Lloyd’s algorithm:

Initialize centroid for given number of clusters k

Assign each data point to the “nearest” mean(Voronoi diagram)

Update centroids of each clusters

)Tr(minarg WZ

SZ

Data-space ClusteringData-space Clustering

Drawbacks of Data-space clustering:

linear separation boundaries.

prefer similar size of each cluster.

equally weighted in each dimension.

number of clusters, K, has to be determined at the beginning.

might stuck into a local minimum.

sensitive to initialization and outliers. Feature-space clustering is proposed to address those problems, hopefully…

Feature-space ClusteringFeature-space Clustering

Same story as in Kernel PCA that everyone can recite…

FD : Xx

Feature-space ClusteringFeature-space Clustering

Computation in feature space, utilizing the kernel trick:

}))()()((1

Tr{)Tr(1 1

K

k

N

n

TknknknW mxmxz

NS

K

k

N

nkn

TknknW mxmxz

NS

1 1

))(())((1

)Tr(

Using the Mercer Kernels, the Gram Matrix is: )(),(),( jijiijji xxxxkKK

Denote the term:

then:

N

i

N

lilklki

k

N

jnjkj

knnkn Kzz

NKz

NKy

1 12

1

12

K

k

N

nknknW yz

NS

1 1

1)Tr(

Feature-space ClusteringFeature-space Clustering

Denote the following terms:

Then the straightforward manipulation of the equations yield:

If the Radial Basis Function kernel is used:

Then the first term reduces to unity, thus:

captures the quadratic sum of the elements allocated to the k-th cluster

NNk

k ijkj

N

i

N

j kikk KzzNCxR

1 1

2)|(

)|(1

)Tr(11 1

k

K

kk

K

k

N

nnnknW CxRKz

NS

}||||)/1(exp{),( 2jiji xxcxxk

)|(1)Tr(1

k

K

kkW CxRS

)|( kCxR

Feature-space ClusteringFeature-space Clustering

For the RBF kernel, the following approximation hold due to the convolution theorem for Gaussians(why?):

This being the case, then:

It make sense for the clustering later on, because the integral is the measurement of the compactness of the cluster

Connectivity to Probability Statistics; Validation of the kernel model. (What if not RBF kernels? This proves they are not valid?)

N

i

N

jijxK

Ndxxp

1 12

2 1)(

dxCxpKzzN

CxRkCx

k

N

i

N

jijkjki

kk

2

1 12

)|(1

)|(

Feature-space ClusteringFeature-space Clustering

Make sense of the integral :

Utilizing the Cauchy’s Inequality in statistics:

The equality holds when , which means the more “uniformly” distributed data, the more compact cluster.

Examples:

Gaussians:

dxxpx

2)(

)}({)( 2 xpEdxxpx

}1{})({}1)({ 2 ExpExpE

1)( axp

21)( 2 dxxp

x

Feature-space ClusteringFeature-space Clustering

The integral represented by is the contrast to the Euclidean compactness measure defined by the sum-of-squares term.

Now the optimization problem in the feature-space becomes:

Lemma: If the binary restriction for is relaxed to , then the optimization above is achieved with Z matrix being binary.

Interpretation: the optimal partitioning of data will only occur when the partition indexes are 0 or 1.

This validates the use of stochastic methods in optimizing.

)|( kCxR

K

kkk

ZW

ZCxRSZ

1

)|(maxarg)Tr(minarg

knz 10 knz

Stochastic OptimizationStochastic Optimization

Define as the penalty associated with assigning the j-th data point to the k-th cluster in feature-space.

Due to the nature of RBF kernel, the range of each element of K would be (0,1].

The second term of the penalty can be viewed as estimate of the conditional probability of the j-th data given the k-th cluster.

The original objective of optimization problem is manipulated into:

N

ljlkl

kkj Kz

ND

1

11

}||||)/1(exp{),( 2jiji xxcxxk

N

j

K

k

N

ljl

k

klkjW K

N

zz

NS

1 1 1

11)Tr(

N

j

K

kkjkjW Dz

NS

1 1

1)Tr(

Stochastic OptimizationStochastic Optimization

Analog to the stochastic optimization in data-space:

where the Ekn is the sum-of-squares distance term.

Solved as the fashion of the Expectation Maximization algorithm:

The cluster indicator is calculated according to its expectation employing softmax function:

each is then updated by the newly estimated expectation values of the indicators

N

n

K

kknknW Ez

NS

1 1

1)Tr(

knz

K

k

newnk

newkn

kn

E

Ez

)exp(

)exp(

2|||| kknewkn mxE

knz

Stochastic OptimizationStochastic Optimization

Similarly, the stochastic optimization in feature-space:

where:

note that indicates the compactness of the k-th cluster.

K

k

newnkk

newknk

K

k

newnk

newkn

kn

D

D

y

yz

11

)2exp(

)2exp(

)exp(

)exp(

))|(exp( kk CxR

N

ljlkl

k

newkn Kz

ND

1

11

k

Stochastic SearchStochastic Search

Stochastic method for optimization

Different optimization criteria in traditional method and stochastic method for optimization purpose:

Traditional: Error criterion. BP method strictly goes along the gradient descent direction. Any direction that enlarge error is NOT acceptable. Easy to get stuck in local minima.

BM: associate the system with “Energy”. Simulated Annealing enables the energy to grow under certain probability.

Simulated AnnealingSimulated Annealing

Simulated Annealing:

1. Create initial solution Z (global states of the system)

Initialize temperature T>>1 2. Repeat until T =T-lower-bound

Repeat until thermal equilibrium is reached at the current T• Generate a random transition from Z

to Z’• Let E = E(Z’) E(Z)• if E < 0 then Z = Z’ • else if exp[E/T] > rand(0,1) then Z = Z’

Reduce temperature T according to the cooling schedule

3. Return Z

1. Create initial solution Z (global states of the system)

Initialize temperature T>>1 2. Repeat until T =T-lower-bound

Repeat until thermal equilibrium is reached at the current T• Generate a random transition from Z

to Z’• Let E = E(Z’) E(Z)• if E < 0 then Z = Z’ • else if exp[E/T] > rand(0,1) then Z = Z’

Reduce temperature T according to the cooling schedule

3. Return Z

This term allows “thermal disturbance” which facilitate finding global minimum

Nonparametric ClusteringNonparametric Clustering

Nonparametric: No assumptions on the number of clusters.

Observations:

the kernel matrix will have a block diagonal structure when there are definite clusters within the data.

eigenvectors of a permuted matrix are the permutations of the original matrix and therefore, an indication of the number of clusters may be given from the eigen-decomposition of kernel matrix.

Recall the approximation:

N

i

N

jijxK

Ndxxp

1 12

2 1)(

Nonparametric ClusteringNonparametric Clustering

Moreover,

Eigen-decomposition of K gives:

Thus we have:

This indicates that if there are K distinct clusters within the data samples then there will be K dominant terms in (Why?)

NTN

N

i

N

jijx

KKN

dxxp 111

)(1 1

22

TUUK

N

ii

TNiN

Tii

N

ii

TNN

TN uuuK

1

2

1

}1{1}{111

2}1{ iTNi u

Nonparametric ClusteringNonparametric Clustering

Examples on phantom data sets:

Results and DiscussionResults and Discussion

Results on phantom 3 data sets: Fisher Iris; Wine data set; Crabs data.

Results and DiscussionResults and Discussion

Conclusions and discussions:

the mean vector in feature-space may not serve as representatives or prototypes of the input space clusters.

the block-diagonal structure of the kernel matrix can be exploited in estimating the number of possible clusters.

choice of kernel will be data specific.

the RBF kernels link the sum-of-squares criterion with the probability metric.

the choice of the parameter of RBF kernel should be determined by the cross-validation or the leave-one-out technique.

eigen-decomposition of N x N kernel matrix scales as O(N^3)

Results and DiscussionResults and Discussion

Remarks of my own:

most appealing point is the link between distance metric and the probability metric.

unclear about why prefer to use the stochastic optimizing instead of ordinary optimizing methods.

no assessment on other types of kernels.

unclear about how to permute the kernel matrix to get the block-diagonal structure.

the “super technical” term “dominant” in the non-parametric part is too vague; needs some quantification.

2}1{ iTNi u

ReferencesReferences

“Data clustering and data visualization”, in Learning in Graphical Models,1998.

“A projection pursuit algorithm for exploratory data analysis”, IEEE Trans. Comput., 1974.

“An algorithm for Euclidean sum-of-squares classification”, Biometrics, 1988

“Maximum certainty data partitioning”, Pattern Recognition, 2000.

“An expectation maximization approach to nonlinear component analysis”, Neural Comput., 2001

Questions?Questions?

Thank you!