clustering - svcl fileclustering in a clustering problem we do not have labels in the training set...

Clustering

Nuno Vasconcelos ECE Department, UCSDp ,

Supervised learningto study unsupervised learning, we start by recalling what we have learned for supervised learningtwo components: optimal decision rule and parameter estimationgivengiven• a training set D = {(x1,y1), …, (xn,yn)}

• xi is a vector of observations, yi is i , yithe label

we start byi i b bili d l f• estimating a probability model for

each class• this includes PX|Y(x|i), PY(i), for all i

2

Maximum likelihoodwe have also seen that this is a relatively straightforward step

maxthe solutions are the parameters such that

0);|(| =Θ∇Θ ixP YX

nt iP ℜ∀≤∇ θθθθ 0)|(2

note that you always have to check the second-order di i

nYX

t ixP ℜ∈∀≤∇Θ θθθθ ,0);|(|2

conditionwe must also do this for the class probabilities PY(i)

b t h th i t h h i f b bilit d l

4

• but here there is not much choice of probability model• Bernoulli: ML estimate is the percent of training points in the class

Maximum likelihoodwe have worked out the Gaussian case in detail • D(i) = {x1

(i) , ..., xn(i)} set of examples from class i

• the ML estimates are

∑= iji x )(1µ niPY =)(

∑ −−=Σ Ti

iji

iji xx ))((1 )()( µµ

∑j

ji nµ

NiPY )(

there are many other distributions for which we can derive a similar set of equations

∑Σj

ijiji xxn

))(( µµ

derive a similar set of equationsbut the Gaussian case is particularly relevant for clustering

5

clustering

Supervised learningstep 2: optimal decision rule

• assuming the zero/one loss, • the optimal decision rule is

)|(maxarg)( |* xiPxi XY

i=

• which can also be written as

[ ])(log)|(logmaxarg)(* iPixPxi +=

• this completes the process, we have a rule to classify any unseen i t

[ ])(log)|(logmaxarg)( | iPixPxi YYXi

+=

6

point

Gaussian classifierif all the covariances are the identity Σi=Ι

[ ]iixdxi αµ += ),(minarg)(*

*?with

[ ]iii

xdxi αµ +),(minarg)(

2d

*?

2||||),( yxyxd −=

)(log2 iPYi −=αthis is just template matchingwith class means as templates

f di it l ifi ti• e.g. for digit classification

7

• compare complexity to nearest neighbors!

Clusteringin a clustering problem we do not have labels in the training sethowever, we can try toestimate these as wellh i t there is a strategy• we start with random class

distributions• we then iterate between

two steps• 1) apply the optimal decision rule for these distributions1) apply the optimal decision rule for these distributions• this assigns each point to one of the clusters• 2) update the distributions by doing parameter estimation within

8

each class

Clusteringa natural question is: what model should we assume?• well, let’s start as simple as possible• assume Gaussian with identity covariances and equal PY(i)• each class has a prototype µi

the clustering algorithm becomesthe clustering algorithm becomes• start with some initial estimate of the µi (e.g. random)• then, iterate betweenthen, iterate between• 1) classification rule:

2* minarg)( ixxi µ−=

• 2) re-estimation:

minarg)( ii

xxi µ

∑ inew x )(1µ

9

∑=j

ji xn

)(µ

K-means (thanks to Andrew Moore, CMU)

10


11


12


13


14

Extensionsthere are many extensions to the basic k-means algorithm• it turns out that one of the most important applications is

supervised learning• remember that the optimal decision rulep

[ ])(log)|(logmaxarg)( |* iPixPxi YYX

i+=

is optimal insofar the probabilities PX|Y(x|i) are correctly estimated• this usually turns out to be impossible, by definition, when we use

t i d l lik th G iparametric models like the Gaussian• although these models are good local approximations, there are

usually multiple clusters when we take a global view

15

• this can be captured with recourse to mixture distributions

Mixture distributionsconsider the following problem• certain types of traffic banned from a bridge• we need a classifier to see if the ban is holding• the sensor measures vehicle weight

d t l if h i OK• need to classify each car in OK vs banned class

• we know that in each classth lti l l tthere are multiple clusters

• e.g. OK = {compact, stationwagon, SUV}

• banned = {truck, bus}• each of these is close to

Gaussian but for the whole

16

Gaussian, but for the wholeclass we get this

Mixture distributionsthis distribution is a mixture• the overall shape is determined

by a number of sub-classes• we introduce a random variable

Z to account for this• given the value of Z (the cluster)

we have a parametric mixturecomponent

• e.g. a Gaussian

# of mixture components

component “weight”

17

cth “mixture component”

Mixture distributionslearning a mixture density is itself a clustering problem• for each training point we need

to figure out from which componentto figure out from which componentit was drawn

• once we know the point assignmentswe need to estimate the parameterswe need to estimate the parametersof the component

this could be done with k-meansa more generic algorithm is expectation-maximization• the only difference is that we never assign the points• in expectation step we compute the posterior class probabilities• but we do not pick the max• in the maximization step the point contributes to all classes

18

• in the maximization step, the point contributes to all classes• weighted by is posterior class probability

Expectation-maximizationin summary1 start with an initial parameter estimate Ψ(0)1. start with an initial parameter estimate Ψ( )

2. E-step: given current parameters Θ(k) and observations in D, compute the posterior class probabilities hij = P(z=j|xi)

(k 1)3. M-step: solve for the parameters, , i.e. compute Θ(k+1), using weighted maximum likelihood, where xi has weight hij

4. go to 2.

in a graphical form

E-step

estimateparameters

fill in classassignments

p

19

M-step

Expectation-maximizationmathematically we haveexpectation:p

maximization:

20

Comparison to k-meansEM: K-means:• assignments

( )( )

( ) ( )⎨⎧ ≠>

=

jkxkPxjPh

xjPxi

iXZiXZ

iXZj

i

,||,1

|maxarg)(

||

|*

• parameter updates

( ) ( )⎩⎨⎧

=otherwise

jjh iXZiXZ

ij ,0,||, ||

∑=j

ij

newi x

n)(1µ

jn

21

Expectation-maximizationnote that the difference to k-means is that• in E-step hij would be thresholded to 0 or 1in E-step hij would be thresholded to 0 or 1• this would make the M-step exactly the same• plus we get the estimate of the covariances and class

probabilities automatically• k-means can be seen as a greedy version of EM• at each point we make the optimal decisionat each point we make the optimal decision• but this does not take into account other points or future iterations• if the hard assignment is best, EM can choose it too

to get a feeling for EM you can use• http://www-cse.ucsd.edu/users/ibayrakt/java/em/

22

Motivationrecall, in Bayesian decision theory we• world states Y in {1 M} observations Xworld, states Y in {1, ..., M}, observations X• class conditional densities PX|Y(x|y)• class probabilities PY(i),• Bayes decision rule (BDR)

we have seen that this is only optimal insofar as all y pprobabilities involved are correctly estimatedit turns out that one of the interesting applications of EM i t l th l diti l d iti

23

is to learn the class-conditional densities

Exampleimage segmentation:• given this image, can we segment it

in the cheetah and background classes?• useful for many applications

• recognition: “this image has• recognition: this image has a cheetah”

• compression: code the cheetah pwith more bits

• graphics: plug in for photoshop ld ll i l ti bj twould allow manipulating objects

since we have two classes, we shouldbe able to do this with our optimal

24

be able to do this with our optimalclassifiers

Examplewe start by collecting a lot of examples of cheetahs

and a lot of examples of grass

o can get tons of these on Google image search

25

you can get tons of these on Google image search

Examplewe represent images as bags of little image patchesthe patches are classified individually, e.g. with Gaussian classifierclassifier

discrete cosinetransform

+++

+++ +

+++ + +

+

+ ++

++

+ ++

+

+

+

+

+++

++

++

+

+

+

+

++ +++ +

+

+ ++ +

++++++

+

++

+ + ++

+++

++ +

+

++

++

+

+

++

++

++

++

+

+

+

+

+

+++

+

++

+G i

Bag of DCT vectors

++

++

+ +

++++

++

+++

++

+

+

+

++

++++ +

+ ++

+++

+

+

+

+

+

+ +

++ ++

+

+

+

+

+

+ +++

+

+

+

+

+

+

+++

++ ++

+

+

+

+

+

+ + + + ++++

+ +

+++

++

++

++

+ ++

+++

+ ++

++

++

+ ++

+

+

++++

+

+

+

+

+

+ +++

++

+

+

+

+

+

++++

++

++

+

+

+

+

+

+Gaussian

P ( | h t h)

26

PX|Y(x|cheetah)

Exampledo the same for grass and apply BDR to each patch to classify

{ })(log)(logmaxarg* iPxGi +Σµ

++ + + ++ ++++ + ++ ++ ++ + +++++

+ ++ ++

+ ++ + ++

{ })(log),,(logmaxargi

iPxGi Yii +Σ= µ

++

++

++++ ++

+

+

+

+

+ +++

++ ++

+ ++

++++ ++

+

+

+

+

+ +++ ++

+++ ++

+

+

+

++ +++

+ ++

+

+

+

+

+ +++++ ++

+

+

+

++

++ ++ ++ ++ ++

( )grassxP WX ||

discrete cosine ++ +

++++++

+

++

+ + ++

+++

++ +

+

++

++

+

+

++

++

++

++

+

+

+

+

+

+++

+

++

+

?

transform+

++

++

+

+ +

+

++

+++

++

+++

++

+

+

+

+

+

++

+++

+ ++

+

+

+

+

+++

++

+

+

+

+

++

++ ++

+

+

+

+

+

++ ++

+

+

+

+

+

+ +++

+

+

+

+

+

+

+++

++ ++

+

+

+

+

+

+++

++

+

+

+

+

+

++ +++ +

+

+

+Bag of DCT

vectors

+ ++++

+ +

+++

++

++

++

+ ++

+++

+ ++

++

++

+ ++

+

+

++++

+

+

+

+

+

+ +++

++

+

+

+

+

+

++++

++

++

+

+

+

+

+

+

?

27

( )cheetahxP WX ||

Exampleit turns out that better performance is achieved by modeling the classes as mixtures of Gaussians

discrete cosinetransform

+++

+++ +

+++ + +

+

+ ++

++

+ ++

+

+

+

+

+++

++

++

+

+

+

+

++ +++ +

+

+

Bag of DCT vectors

++ +

++++++

+

++

+ + ++

+++

++ +

+

++

++

+

+

++

++

++

++

+

+

+

+

+

+++

+

++

+

++

++

+ +

++++

++

+++

++

+

+

+

++

++++ +

+ ++

+++

+

+

+

+

+

+ +

++ ++

+

+

+

+

+

+ +++

+

+

+

+

+

+

+++

++ ++

+

+

+

+

+

+ + + MixtureofGaussians

+ ++++

+ +

+++

++

++

++

+ ++

+++

+ ++

++

++

+ ++

+

+

++++

+

+

+

+

+

+ +++

++

+

+

+

+

+

++++

++

++

+

+

+

+

+

+

P ( | h t h)

28

PX|Y(x|cheetah)

Exampledo the same for grass and apply BDR to each patch to classify

⎭⎬⎫

⎩⎨⎧

+Σ= ∑ )(log),,(logmaxarg* iPxGi Ykikiki πµ

++ + + ++ ++++ + ++ ++ ++ + +++++

+ ++ ++

+ ++ + ++

⎭⎬

⎩⎨ ∑ )(g),,(gg ,,,

iYki

kkikiµ

++

++

++++ ++

+

+

+

+

+ +++

++ ++

+ ++

++++ ++

+

+

+

+

+ +++ ++

+++ ++

+

+

+

++ +++

+ ++

+

+

+

+

+ +++++ ++

+

+

+

++

++ ++ ++ ++ ++

( )grassxP WX ||

discrete cosine?

++ + + ++

++ + ++

++

transform+

++

++

+

+ +

+

++

+++

++

+++

++

+

+

+

+

+

++

+++

+ ++

+

+

+

+

+++

++

+

+

+

+

++

++ ++

+

+

+

+

+

++ ++

+

+

+

+

+

+ +++

+

+

+

+

+

+

+++

++ ++

+

+

+

+

+

+++

++

+

+

+

+

+

++ +++ +

+

+

+Bag of DCT

vectors? + +++

+++ +

++++++

+

++

+++

++

++

++

+

+

+

+

+

+++

++

+

+

+

+

+

+

++

++

+ ++

+

++

+

+

++

++++

+

+

+

+

+

+ +++

++

+

+

+

+

+

++++

++

++

+

+

+

+

+

++

+

++

+ ++

++

+

+

+

29

PX|Y(x|cheetah)

Classificationthe use of sophisticated models, e.g. mixtures usually improves performancep phowever, it is not a magic solutionearlier on in the course we talked about featurestypically, you have to start from a good feature setit turns out that even with a good feature set, you must be g , ycarefulconsider the following example, from our image l ifi ti blclassification problem

30

Examplecheetah Gaussian classifier, DCT space

8 first DCT features all 64 features

Prob. of error: 4% 8%

31

interesting observation: more features = higher error!

Examplethe first reason why this happens is that things are not what we think they are in high dimensionsy gone could say that high dimensional spaces are STRANGE!!!in practice, we invariable have to do some form ofdimensionality reduction

ill th t i l l j l i thiwe will see that eigenvalues play a major role in thisone of the major dimensionality reduction techniques isprincipal component analysis (PCA)principal component analysis (PCA)but let’s start by discussing the problems of high dimensions

32

High dimensional spacesare strange!first thing to know:first thing to know:

“never trust your intuition in high dimensions!”

more often than not you will be wrong!th l f thithere are many examples of thiswe will do a couple here, skipping most of the mathb th f d i t tiboth fun and instructive

33

The hyper-sphereConsider the sphere of radius r on a space of dimension d

it can be shown that its volume is

rr

where Γ(n) is the Gamma function

34

Hyper-cube vs hyper-spherenext consider the hyper-cube [-a,a]d and the inscribed hyper-sphere, i.e.yp p ,

a

a-a

-a

Q: what does your intuition tell you about the relative sizes of these two objects?j1. volume of sphere ~= volume of cube2. volume of sphere >> volume of cube

35

3. volume of sphere << volume of cube

Answerwe can just compute this

sequence that does not depend on a, just on the dimension d!

d 1 2 3 4 5 6 7f 1 785 524 308 164 08 037

it goes to zero, and goes to zero fast!

fd 1 .785 .524 .308 .164 .08 .037

36

Hyper-cube vs hyper-shperethis means that:

“as the dimension of the space increases the volume ofas the dimension of the space increases, the volume of the sphere is much smaller (infinitesimal) than that of

the cube!”how is this going against intuition?it is actually not very surprising. we can see it even in l di ilow dimensions1. d = 1 volume is the same 2 d = 2

a-aa2. d = 2

volume of sphere is alreadysmaller

a-a

37

-a

Hyper-cube vs hyper-sphereas the dimension increases the volume of the shaded corners becomes largerg

a

a-a

-a

in high dimensions the picture you should have in mind is

all the volume of the cubeis in these spikes!

38

Believe it or notwe can check mathematically: consider d and p

a

a-a

d

p

note that-a

d orthogonal to p as d increases and infinitel larger!!!

39

increases and infinitely larger!!!

But there is moreconsider the crust of unit sphere of thickness we can compute its volume ε Swe can compute its volume ε

S1

S2

a

no matter how small ε is, ratio goes to zero as d increasesi e “all the volume is in the crust!”

40

i.e. all the volume is in the crust!

High dimensional Gaussianfor a Gaussian, it can be shown that if

and one considers the hyper-sphere where the probability density drops to 1% of peak valuep y y p p

the probability mass outside this sphere is

where χ2(n) is a chi-squared random variable with n

41

where χ (n) is a chi-squared random variable with ndegrees of freedom

High-dimensional Gaussianif you evaluate this, you’ll find out that

n 1 2 3 4 5 6 10 15 20n 1 2 3 4 5 6 10 15 201-Pn .998 .99 .97 .94 .89 .83 .48 .134 .02

as the dimension increases, all probability mass is on the tailsthe point of maximum density is still the meanthe point of maximum density is still the meanreally strange: in high-dimensions the Gaussian is a very heavy-tailed distributionheavy tailed distributiontake-home message:• “in high dimensions never trust your intuition!”

42

g y

The curse of dimensionalitytypical observation in Bayes decision theory:

• error increases when number of features is large

highly unintuitive since, theoretically:• if I have a problem in n-D I can always generate a problem in

(n+1) D with smaller probability of error(n+1)-D with smaller probability of error

e.g. two uniform classes in 1D

A BA B

can be transformed into a 2D problem with same errorjust add a non-informative variable y.

43

just add a non informative variable y.

Curse of dimensionalityx x

y y

but it is also possible to reduce the error by adding a second variable which is informativesecond variable which is informativeon the left there is no decision boundary that will achieve zero error

44

zero erroron the right, the decision boundary shown has zero error

Curse of dimensionalityin fact, it is impossible to do worse in 2D than 1D

x

yy

if we move the classes along the lines shown in green the error can only go down, since there will be less overlap

45

Curse of dimensionalityso why do we observe this curse of dimensionality?the problem is the quality of the density estimatesp q y yall we have seen so far, assumes perfect estimation of the BDRwe discussed various reasons why this is not easy

most densities are not• most densities are not Gaussian, exponential, etc

• but a mixture of several factorsfactors

• many unknowns (# of components, what type), the likelihood has local minima, etc.

46

• even with algorithms like EM, it is difficult to get this right

Curse of dimensionalitybut the problem is much deeper than thiseven for simple models (e g Gaussian) we needeven for simple models (e.g. Gaussian) we need a large number of examples n to have good estimatesQ: what does “large” mean? This depends on theQ: what does large mean? This depends on the dimension of the spacethe best way to see this is to think of an histogramy g• suppose you have 100 points and you need at least 10 bins per

axis in order to get a reasonable quantization

f if d t tfor uniform data you get, on average,decent in1D, bad in 2D, terrible in 3D(9 out of each10 bins empty)

dimension 1 2 3

points/bin 10 1 0.1

47

(9 out of each10 bins empty) p

Dimensionality reductionwhat do we do about this? we avoid unnecessary dimensionsunnecessary can be measured in two ways:1.features are not discriminat2.features are not independent

non-discriminant means that they do not separate the l llclasses well

discriminant non-discriminant

48

clustering - svcl fileclustering in a clustering problem we do not have labels in the training set...

Documents