clustering - svcl fileclustering in a clustering problem we do not have labels in the training set...
TRANSCRIPT
Supervised learningto study unsupervised learning, we start by recalling what we have learned for supervised learningtwo components: optimal decision rule and parameter estimationgivengiven• a training set D = {(x1,y1), …, (xn,yn)}
• xi is a vector of observations, yi is i , yithe label
we start byi i b bili d l f• estimating a probability model for
each class• this includes PX|Y(x|i), PY(i), for all i
2
Parameter estimationthis is done by maximum likelihood estimationwhich has two steps:• 1) we choose a parametric model for all probabilities
);|(| ΘixP YX
• 2) we select the parameters of class i to be the ones that maximize
);|(|YX
the probability of the data from that class
( )Θ=Θ ;|maxarg )( iDP i( )( )Θ=
Θ=Θ
Θ
Θ
;|logmaxarg
;|maxarg
)(|
)(|
iDP
iDP
iYX
YXi
3
Θ
Maximum likelihoodwe have also seen that this is a relatively straightforward step
maxthe solutions are the parameters such that
0);|(| =Θ∇Θ ixP YX
nt iP ℜ∀≤∇ θθθθ 0)|(2
note that you always have to check the second-order di i
nYX
t ixP ℜ∈∀≤∇Θ θθθθ ,0);|(|2
conditionwe must also do this for the class probabilities PY(i)
b t h th i t h h i f b bilit d l
4
• but here there is not much choice of probability model• Bernoulli: ML estimate is the percent of training points in the class
Maximum likelihoodwe have worked out the Gaussian case in detail • D(i) = {x1
(i) , ..., xn(i)} set of examples from class i
• the ML estimates are
∑= iji x )(1µ niPY =)(
∑ −−=Σ Ti
iji
iji xx ))((1 )()( µµ
∑j
ji nµ
NiPY )(
there are many other distributions for which we can derive a similar set of equations
∑Σj
ijiji xxn
))(( µµ
derive a similar set of equationsbut the Gaussian case is particularly relevant for clustering
5
clustering
Supervised learningstep 2: optimal decision rule
• assuming the zero/one loss, • the optimal decision rule is
)|(maxarg)( |* xiPxi XY
i=
• which can also be written as
[ ])(log)|(logmaxarg)(* iPixPxi +=
• this completes the process, we have a rule to classify any unseen i t
[ ])(log)|(logmaxarg)( | iPixPxi YYXi
+=
6
point
Gaussian classifierif all the covariances are the identity Σi=Ι
[ ]iixdxi αµ += ),(minarg)(*
*?with
[ ]iii
xdxi αµ +),(minarg)(
2d
*?
2||||),( yxyxd −=
)(log2 iPYi −=αthis is just template matchingwith class means as templates
f di it l ifi ti• e.g. for digit classification
7
• compare complexity to nearest neighbors!
Clusteringin a clustering problem we do not have labels in the training sethowever, we can try toestimate these as wellh i t there is a strategy• we start with random class
distributions• we then iterate between
two steps• 1) apply the optimal decision rule for these distributions1) apply the optimal decision rule for these distributions• this assigns each point to one of the clusters• 2) update the distributions by doing parameter estimation within
8
each class
Clusteringa natural question is: what model should we assume?• well, let’s start as simple as possible• assume Gaussian with identity covariances and equal PY(i)• each class has a prototype µi
the clustering algorithm becomesthe clustering algorithm becomes• start with some initial estimate of the µi (e.g. random)• then, iterate betweenthen, iterate between• 1) classification rule:
2* minarg)( ixxi µ−=
• 2) re-estimation:
minarg)( ii
xxi µ
∑ inew x )(1µ
9
∑=j
ji xn
)(µ
Extensionsthere are many extensions to the basic k-means algorithm• it turns out that one of the most important applications is
supervised learning• remember that the optimal decision rulep
[ ])(log)|(logmaxarg)( |* iPixPxi YYX
i+=
is optimal insofar the probabilities PX|Y(x|i) are correctly estimated• this usually turns out to be impossible, by definition, when we use
t i d l lik th G iparametric models like the Gaussian• although these models are good local approximations, there are
usually multiple clusters when we take a global view
15
• this can be captured with recourse to mixture distributions
Mixture distributionsconsider the following problem• certain types of traffic banned from a bridge• we need a classifier to see if the ban is holding• the sensor measures vehicle weight
d t l if h i OK• need to classify each car in OK vs banned class
• we know that in each classth lti l l tthere are multiple clusters
• e.g. OK = {compact, stationwagon, SUV}
• banned = {truck, bus}• each of these is close to
Gaussian but for the whole
16
Gaussian, but for the wholeclass we get this
Mixture distributionsthis distribution is a mixture• the overall shape is determined
by a number of sub-classes• we introduce a random variable
Z to account for this• given the value of Z (the cluster)
we have a parametric mixturecomponent
• e.g. a Gaussian
# of mixture components
component “weight”
17
cth “mixture component”
Mixture distributionslearning a mixture density is itself a clustering problem• for each training point we need
to figure out from which componentto figure out from which componentit was drawn
• once we know the point assignmentswe need to estimate the parameterswe need to estimate the parametersof the component
this could be done with k-meansa more generic algorithm is expectation-maximization• the only difference is that we never assign the points• in expectation step we compute the posterior class probabilities• but we do not pick the max• in the maximization step the point contributes to all classes
18
• in the maximization step, the point contributes to all classes• weighted by is posterior class probability
Expectation-maximizationin summary1 start with an initial parameter estimate Ψ(0)1. start with an initial parameter estimate Ψ( )
2. E-step: given current parameters Θ(k) and observations in D, compute the posterior class probabilities hij = P(z=j|xi)
(k 1)3. M-step: solve for the parameters, , i.e. compute Θ(k+1), using weighted maximum likelihood, where xi has weight hij
4. go to 2.
in a graphical form
E-step
estimateparameters
fill in classassignments
p
19
M-step
Comparison to k-meansEM: K-means:• assignments
( )( )
( ) ( )⎨⎧ ≠>
=
jkxkPxjPh
xjPxi
iXZiXZ
iXZj
i
,||,1
|maxarg)(
||
|*
• parameter updates
( ) ( )⎩⎨⎧
=otherwise
jjh iXZiXZ
ij ,0,||, ||
∑=j
ij
newi x
n)(1µ
jn
21
Expectation-maximizationnote that the difference to k-means is that• in E-step hij would be thresholded to 0 or 1in E-step hij would be thresholded to 0 or 1• this would make the M-step exactly the same• plus we get the estimate of the covariances and class
probabilities automatically• k-means can be seen as a greedy version of EM• at each point we make the optimal decisionat each point we make the optimal decision• but this does not take into account other points or future iterations• if the hard assignment is best, EM can choose it too
to get a feeling for EM you can use• http://www-cse.ucsd.edu/users/ibayrakt/java/em/
22
Motivationrecall, in Bayesian decision theory we• world states Y in {1 M} observations Xworld, states Y in {1, ..., M}, observations X• class conditional densities PX|Y(x|y)• class probabilities PY(i),• Bayes decision rule (BDR)
we have seen that this is only optimal insofar as all y pprobabilities involved are correctly estimatedit turns out that one of the interesting applications of EM i t l th l diti l d iti
23
is to learn the class-conditional densities
Exampleimage segmentation:• given this image, can we segment it
in the cheetah and background classes?• useful for many applications
• recognition: “this image has• recognition: this image has a cheetah”
• compression: code the cheetah pwith more bits
• graphics: plug in for photoshop ld ll i l ti bj twould allow manipulating objects
since we have two classes, we shouldbe able to do this with our optimal
24
be able to do this with our optimalclassifiers
Examplewe start by collecting a lot of examples of cheetahs
and a lot of examples of grass
o can get tons of these on Google image search
25
you can get tons of these on Google image search
Examplewe represent images as bags of little image patchesthe patches are classified individually, e.g. with Gaussian classifierclassifier
discrete cosinetransform
+++
+++ +
+++ + +
+
+ ++
++
+ ++
+
+
+
+
+++
++
++
+
+
+
+
++ +++ +
+
+ ++ +
++++++
+
++
+ + ++
+++
++ +
+
++
++
+
+
++
++
++
++
+
+
+
+
+
+++
+
++
+G i
Bag of DCT vectors
++
++
+ +
++++
++
+++
++
+
+
+
++
++++ +
+ ++
+++
+
+
+
+
+
+ +
++ ++
+
+
+
+
+
+ +++
+
+
+
+
+
+
+++
++ ++
+
+
+
+
+
+ + + + ++++
+ +
+++
++
++
++
+ ++
+++
+ ++
++
++
+ ++
+
+
++++
+
+
+
+
+
+ +++
++
+
+
+
+
+
++++
++
++
+
+
+
+
+
+Gaussian
P ( | h t h)
26
PX|Y(x|cheetah)
Exampledo the same for grass and apply BDR to each patch to classify
{ })(log)(logmaxarg* iPxGi +Σµ
++ + + ++ ++++ + ++ ++ ++ + +++++
+ ++ ++
+ ++ + ++
{ })(log),,(logmaxargi
iPxGi Yii +Σ= µ
++
++
++++ ++
+
+
+
+
+ +++
++ ++
+ ++
++++ ++
+
+
+
+
+ +++ ++
+++ ++
+
+
+
++ +++
+ ++
+
+
+
+
+ +++++ ++
+
+
+
++
++ ++ ++ ++ ++
( )grassxP WX ||
discrete cosine ++ +
++++++
+
++
+ + ++
+++
++ +
+
++
++
+
+
++
++
++
++
+
+
+
+
+
+++
+
++
+
?
transform+
++
++
+
+ +
+
++
+++
++
+++
++
+
+
+
+
+
++
+++
+ ++
+
+
+
+
+++
++
+
+
+
+
++
++ ++
+
+
+
+
+
++ ++
+
+
+
+
+
+ +++
+
+
+
+
+
+
+++
++ ++
+
+
+
+
+
+++
++
+
+
+
+
+
++ +++ +
+
+
+Bag of DCT
vectors
+ ++++
+ +
+++
++
++
++
+ ++
+++
+ ++
++
++
+ ++
+
+
++++
+
+
+
+
+
+ +++
++
+
+
+
+
+
++++
++
++
+
+
+
+
+
+
?
27
( )cheetahxP WX ||
Exampleit turns out that better performance is achieved by modeling the classes as mixtures of Gaussians
discrete cosinetransform
+++
+++ +
+++ + +
+
+ ++
++
+ ++
+
+
+
+
+++
++
++
+
+
+
+
++ +++ +
+
+
Bag of DCT vectors
++ +
++++++
+
++
+ + ++
+++
++ +
+
++
++
+
+
++
++
++
++
+
+
+
+
+
+++
+
++
+
++
++
+ +
++++
++
+++
++
+
+
+
++
++++ +
+ ++
+++
+
+
+
+
+
+ +
++ ++
+
+
+
+
+
+ +++
+
+
+
+
+
+
+++
++ ++
+
+
+
+
+
+ + + MixtureofGaussians
+ ++++
+ +
+++
++
++
++
+ ++
+++
+ ++
++
++
+ ++
+
+
++++
+
+
+
+
+
+ +++
++
+
+
+
+
+
++++
++
++
+
+
+
+
+
+
P ( | h t h)
28
PX|Y(x|cheetah)
Exampledo the same for grass and apply BDR to each patch to classify
⎭⎬⎫
⎩⎨⎧
+Σ= ∑ )(log),,(logmaxarg* iPxGi Ykikiki πµ
++ + + ++ ++++ + ++ ++ ++ + +++++
+ ++ ++
+ ++ + ++
⎭⎬
⎩⎨ ∑ )(g),,(gg ,,,
iYki
kkikiµ
++
++
++++ ++
+
+
+
+
+ +++
++ ++
+ ++
++++ ++
+
+
+
+
+ +++ ++
+++ ++
+
+
+
++ +++
+ ++
+
+
+
+
+ +++++ ++
+
+
+
++
++ ++ ++ ++ ++
( )grassxP WX ||
discrete cosine?
++ + + ++
++ + ++
++
transform+
++
++
+
+ +
+
++
+++
++
+++
++
+
+
+
+
+
++
+++
+ ++
+
+
+
+
+++
++
+
+
+
+
++
++ ++
+
+
+
+
+
++ ++
+
+
+
+
+
+ +++
+
+
+
+
+
+
+++
++ ++
+
+
+
+
+
+++
++
+
+
+
+
+
++ +++ +
+
+
+Bag of DCT
vectors? + +++
+++ +
++++++
+
++
+++
++
++
++
+
+
+
+
+
+++
++
+
+
+
+
+
+
++
++
+ ++
+
++
+
+
++
++++
+
+
+
+
+
+ +++
++
+
+
+
+
+
++++
++
++
+
+
+
+
+
++
+
++
+ ++
++
+
+
+
29
PX|Y(x|cheetah)
Classificationthe use of sophisticated models, e.g. mixtures usually improves performancep phowever, it is not a magic solutionearlier on in the course we talked about featurestypically, you have to start from a good feature setit turns out that even with a good feature set, you must be g , ycarefulconsider the following example, from our image l ifi ti blclassification problem
30
Examplecheetah Gaussian classifier, DCT space
8 first DCT features all 64 features
Prob. of error: 4% 8%
31
interesting observation: more features = higher error!
Examplethe first reason why this happens is that things are not what we think they are in high dimensionsy gone could say that high dimensional spaces are STRANGE!!!in practice, we invariable have to do some form ofdimensionality reduction
ill th t i l l j l i thiwe will see that eigenvalues play a major role in thisone of the major dimensionality reduction techniques isprincipal component analysis (PCA)principal component analysis (PCA)but let’s start by discussing the problems of high dimensions
32
High dimensional spacesare strange!first thing to know:first thing to know:
“never trust your intuition in high dimensions!”
more often than not you will be wrong!th l f thithere are many examples of thiswe will do a couple here, skipping most of the mathb th f d i t tiboth fun and instructive
33
The hyper-sphereConsider the sphere of radius r on a space of dimension d
it can be shown that its volume is
rr
where Γ(n) is the Gamma function
34
Hyper-cube vs hyper-spherenext consider the hyper-cube [-a,a]d and the inscribed hyper-sphere, i.e.yp p ,
a
a-a
-a
Q: what does your intuition tell you about the relative sizes of these two objects?j1. volume of sphere ~= volume of cube2. volume of sphere >> volume of cube
35
3. volume of sphere << volume of cube
Answerwe can just compute this
sequence that does not depend on a, just on the dimension d!
d 1 2 3 4 5 6 7f 1 785 524 308 164 08 037
it goes to zero, and goes to zero fast!
fd 1 .785 .524 .308 .164 .08 .037
36
Hyper-cube vs hyper-shperethis means that:
“as the dimension of the space increases the volume ofas the dimension of the space increases, the volume of the sphere is much smaller (infinitesimal) than that of
the cube!”how is this going against intuition?it is actually not very surprising. we can see it even in l di ilow dimensions1. d = 1 volume is the same 2 d = 2
a-aa2. d = 2
volume of sphere is alreadysmaller
a-a
37
-a
Hyper-cube vs hyper-sphereas the dimension increases the volume of the shaded corners becomes largerg
a
a-a
-a
in high dimensions the picture you should have in mind is
all the volume of the cubeis in these spikes!
38
Believe it or notwe can check mathematically: consider d and p
a
a-a
d
p
note that-a
d orthogonal to p as d increases and infinitel larger!!!
39
increases and infinitely larger!!!
But there is moreconsider the crust of unit sphere of thickness we can compute its volume ε Swe can compute its volume ε
S1
S2
a
no matter how small ε is, ratio goes to zero as d increasesi e “all the volume is in the crust!”
40
i.e. all the volume is in the crust!
High dimensional Gaussianfor a Gaussian, it can be shown that if
and one considers the hyper-sphere where the probability density drops to 1% of peak valuep y y p p
the probability mass outside this sphere is
where χ2(n) is a chi-squared random variable with n
41
where χ (n) is a chi-squared random variable with ndegrees of freedom
High-dimensional Gaussianif you evaluate this, you’ll find out that
n 1 2 3 4 5 6 10 15 20n 1 2 3 4 5 6 10 15 201-Pn .998 .99 .97 .94 .89 .83 .48 .134 .02
as the dimension increases, all probability mass is on the tailsthe point of maximum density is still the meanthe point of maximum density is still the meanreally strange: in high-dimensions the Gaussian is a very heavy-tailed distributionheavy tailed distributiontake-home message:• “in high dimensions never trust your intuition!”
42
g y
The curse of dimensionalitytypical observation in Bayes decision theory:
• error increases when number of features is large
highly unintuitive since, theoretically:• if I have a problem in n-D I can always generate a problem in
(n+1) D with smaller probability of error(n+1)-D with smaller probability of error
e.g. two uniform classes in 1D
A BA B
can be transformed into a 2D problem with same errorjust add a non-informative variable y.
43
just add a non informative variable y.
Curse of dimensionalityx x
y y
but it is also possible to reduce the error by adding a second variable which is informativesecond variable which is informativeon the left there is no decision boundary that will achieve zero error
44
zero erroron the right, the decision boundary shown has zero error
Curse of dimensionalityin fact, it is impossible to do worse in 2D than 1D
x
yy
if we move the classes along the lines shown in green the error can only go down, since there will be less overlap
45
Curse of dimensionalityso why do we observe this curse of dimensionality?the problem is the quality of the density estimatesp q y yall we have seen so far, assumes perfect estimation of the BDRwe discussed various reasons why this is not easy
most densities are not• most densities are not Gaussian, exponential, etc
• but a mixture of several factorsfactors
• many unknowns (# of components, what type), the likelihood has local minima, etc.
46
• even with algorithms like EM, it is difficult to get this right
Curse of dimensionalitybut the problem is much deeper than thiseven for simple models (e g Gaussian) we needeven for simple models (e.g. Gaussian) we need a large number of examples n to have good estimatesQ: what does “large” mean? This depends on theQ: what does large mean? This depends on the dimension of the spacethe best way to see this is to think of an histogramy g• suppose you have 100 points and you need at least 10 bins per
axis in order to get a reasonable quantization
f if d t tfor uniform data you get, on average,decent in1D, bad in 2D, terrible in 3D(9 out of each10 bins empty)
dimension 1 2 3
points/bin 10 1 0.1
47
(9 out of each10 bins empty) p
Dimensionality reductionwhat do we do about this? we avoid unnecessary dimensionsunnecessary can be measured in two ways:1.features are not discriminat2.features are not independent
non-discriminant means that they do not separate the l llclasses well
discriminant non-discriminant
48