1 classification & clustering 魏志達 jyh-da wei -- parametric and nonparametric methods...
TRANSCRIPT
1
Classification & Clustering
魏志達 Jyh-Da Wei
-- Parametric and Nonparametric Methods
Introduction to Machine Learning (Chap 4,5,7,8), E. Alpaydin
2
Classes vs. Clusters
Classification: supervised learning– Pattern Recognization, K Nearest Neighbor, Multilayer Perceptron
Clustering: unsupervised learning– K-Means, Expectation Maximization, Self-Organization Map
Parametric Nonparametric Networks
Classes PR Kernel, KNN MLP
Clusters K-Means, EM Agglomerative SOM
3
Classes vs. Clusters
Classification: supervised learning– Pattern Recognization, K Nearest Neighbor, Multilayer Perceptron
Clustering: unsupervised learning– K-Means, Expectation Maximization, Self-Organization Map
Parametric Nonparametric Networks
Classes PR Kernel, KNN MLP
Clusters K-Means, EM Agglomerative SOM
4
Bayes’ Rule
x
xx
ppP
PCC
C|
|
posterior
likelihoodprior
evidence
0 0
0
0 1
1 1 0 1
0 0 10
1
| ( , ) ( , )
( | ) , | | 1
choose if |
(i.e., ,
max |
max , )
|
|
i i
i k k
k k
P C P C
p x p x C P C p x C p x C
p C x p C x P C xp x
C
p x C P C
p x C P C
P x C P C
P
x
P C x C x
因為給定 x 之值則 p(x) 均等
5
Bayes’ Rule: K>2 Classes
K
kkk
ii
iii
CPCp
CPCp
pCPCp
CP
1
|
|
||
x
x
xx
x
1
0 and 1
choose if | max |
(i.e., , max , )
K
i ii
i i k k
i k k
P C P C
P x C P x C
C P C P C
x x
因為給定 x 之值則 p(x) 均等
6
Gaussian (Normal) Distribution
2
2
2exp
2
1 x-xp
p(x) = N ( μ, σ2)
Estimate μ and σ2:
μ σ
N
mxs
N
xm
t
t
t
t
2
2
2
2
2exp
2
1 xxp
7
Equal variances
Single boundary athalfway between means
P(C1)=P(C2)
8
Variances are different
Two boundaries
P(C1)=P(C2)
9
Multivariate Normal Distribution
μxμxx
μx
1212 2
1exp
2
1Σ
Σ
Σ
T
//d
d
p
~ ,N
10
Multivariate Normal Distribution Mahalanobis distance: (x – μ)T ∑–1 (x – μ)
measures the distance from x to μ in terms of ∑ (normalizes for difference in variances and correlations)
Bivariate: d = 2
2221
2121
iiii /xz
zzzzx,xp
2
2212122
21
21 212
1exp
12
1
11
Bivariate Normal
12
13
Estimation of Parameters
ˆt
iti
t tit
i tit
Tt t ti i it
i tit
rP C
N
rm
r
rS
r
x
x m x m
14
likelihoods
posterior for C1
discriminant: P (C1|x ) = 0.5
只分二類的話,剛好以 0.5 為界線
15
break
16
Classes vs. Clusters
Classification: supervised learning– Pattern Recognization, K Nearest Neighbor, Multilayer Perceptron
Clustering: unsupervised learning– K-Means, Expectation Maximization, Self-Organization Map
Parametric Nonparametric Networks
Classes PR Kernel, KNN MLP
Clusters K-Means, EM Agglomerative SOM
17
Parametric vs. Nonparametric Parametric Methods
– Advantage: it reduces the problem of estimating a probability density function (pdf), discriminant, or regression function to estimating the values of a small number of parameters.
– Disadvantage: this assumption does not always hold and we may incur a large error if it does not.
Nonparametric Methods– Keep the training data;“let the data speak for itself”– Given x, find a small number of closest training instances
and interpolate from these– Nonparametric methods are also called memory-based or
instance-based learning algorithms.
18
Density Estimation
Given the training set X={ xt }t drawn iid (independent and identically distributed) from p(x)
Divide data into bins of size h Histogram estimator: (Figure – next page)
# in the same bin as ˆ
tx xp x
Nh
Extreme case: p(x)=1/h, for exactly consulting the sample space
該 xt 項構成集合之第 t 項
19
Nh
xx#xp̂
t as bin same the in
0.375
20
Density Estimation
Given the training set X={ xt }t drawn iid from p(x) x is always at the center of a bin of size 2h Naive estimator: (Figure – next page)
or
( 讓每一個 xt 投票 )
Nh
hxxhx#xp̂
t
2
otherwise0
1if 21
1
1
u/uw
hxx
wNh
xp̂N
t
t
w(u): 依地緣關係投票,贊成票計 1/2, [-1,1] 區間積分值為 1
21
Nh
hxxhx#xp̂
t
2
h=0.25
h=0.5
Naïve estimator: h=1
22
Kernel Estimator Kernel function, e.g., Gaussian kernel:
Kernel estimator (Parzen windows): Figure – next page
If K is Gaussian, then will be smooth having all the derivatives.
N
t
t
hxx
KNh
xp̂1
1
2exp
2
1 2uuK
p̂
K(u): 依地緣關係給分,實數域積分值為 1
23
5 0 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
11
0
K u( )
55 u
2exp
2
1 2uuK
24
N
t
t
hxx
KNh
xp̂1
1
25
Generalization to Multivariate Data Kernel density estimator
with the requirement that
Multivariate Gaussian kernel
spheric
ellipsoid
1
1ˆ
tN
dt
p KNh h
x x
x
uuu
uu
1212
2
21
exp2
1
2exp
2
1
SS
T//d
d
K
K
( ) 1dR
K x dx
26
k-Nearest Neighbor Estimator
Instead of fixing bin width h and counting the number of instances, fix the instances (neighbors) k and check bin width
dk(x): distance to kth closest instance to x
xNdk
xp̂k2
27
xNdk
xp̂k2
同時向兩側長,看多遠可吃到 k 個 samples
28
Nonparametric Classification(kernel estimator)
1
1
1 ˆˆ | ,
1ˆ, |
tNt i
i i idti
tNt
i i i idt
Nx xp x C K r P C
N h h N
x xp x C p x C P C K r
Nh h
rit 視 xt 是否遲於 Ci 而定 0/1
原本要比較 p(Ci|x)=p(x,Ci)/p(x) 之值何者大但給定 x 之值則 p(x) 均等 , 此處大家都不寫,式子較漂亮
可不看係數只看後項,意義為累計各委員評分這些評分為依地緣而定的正實數值
29
Nonparametric Classification k-nn estimator (1)
For the special case of k-nn estimator
where
ki : the number of neighbors out of the k nearest that belong to ci
Vk(x) : the volume of the d-dimensional hypersphere centered at x,
with radius
cd : the volume of the unit sphere in d dimensions For example,
xVN
kCxp
ki
ii |ˆ
)(kxxr
ddk crV
;3
43
;2
;21
33
3
22
2
11
rcrVd
rcrVd
rcrVd
k
k
k
30
Nonparametric Classification k-nn estimator (2)
From
Then
xVN
kCxp
ki
ii |ˆ
k
k
xp
CPCxpxCP iii
i ˆ
ˆ|ˆ|ˆ
xNV
kxp
kˆ
N
NCP i
i ˆ
要比較 p(Ci|x)=p(x,Ci)/p(x) 之值何者大雖然給定 x 之值則 p(x) 均等 , 但此處大家寫出來,推得的式子較漂亮
意義為累積找到 k samples 之時何類的出席數最多
31
break
32
Classes vs. Clusters
Classification: supervised learning– Pattern Recognization, K Nearest Neighbor, Multilayer Perceptron
Clustering: unsupervised learning– K-Means, Expectation Maximization, Self-Organization Map
Parametric Nonparametric Networks
Classes PR Kernel, KNN MLP
Clusters K-Means, EM Agglomerative SOM
33
Classes vs. Clusters Supervised: X = { xt ,rt }t
Classes Ci i=1,...,K
where p ( x | Ci) ~ N ( μi , ∑i )
Φ = {P (Ci ), μi , ∑i }Ki=1
Unsupervised : X = { xt }t
Clusters Gi i=1,...,k
where p ( x | Gi) ~ N ( μi , ∑i )
Φ = {P ( Gi ), μi , ∑i }ki=1
Labels, r ti ?
k
iii Ppp
1
| GGxx
K
iii Ppp
1
| CCxx
t
ti
T
it
t itt
ii
t
ti
t
tti
it
ti
i
r
r
r
r
N
rCP̂
mxmx
xm
S
34
k-Means Clustering Find k reference vectors (prototypes/codebook
vectors/codewords) which best represent data Reference vectors, mj, j =1,...,k Use nearest (most similar) reference:
Reconstruction error
jt
jit mxmx min
otherwise0
minif 1
1
jt
jit
ti
t i itt
ikii
b
bE
mxmx
mxm X
希望群中心造成的總偏離值最小
35
Encoding/Decoding
otherwise0
minif 1 jt
jit
tib
mxmx
36
k-means Clustering
1. Winner takes all2. 不做逐步修正,而是一口氣取群平均3. 下頁有實例,上課再舉反例 ( 前方將士變節 )
37
38
EM in Gaussian Mixtures zt
i = 1 if xt belongs to Gi, 0 otherwise (labels r ti of
supervised learning); assume p(x|Gi)~N(μi,∑i) E-step:
M-step:
Use estimated labels in place of unknown labels
ti
lti
j jl
jt
il
it
lti
hP
PpPp
,zE
,G
G,GG,G
X
x
xx
|
||
t
ti
Tli
t
t
li
ttil
i
t
ti
t
ttil
it
ti
i
h
h
h
h
N
hP
111
1
mxmx
xm
S
G
擁有 P(Gi ) 做後援就不怕將士變節
39
P(G1|x)=h1=0.5
40
Classes vs. Clusters
Classification: supervised learning– Pattern Recognization, K Nearest Neighbor, Multilayer Perceptron
Clustering: unsupervised learning– K-Means, Expectation Maximization, Self-Organization Map
Parametric Nonparametric Networks
Classes PR Kernel, KNN MLP
Clusters K-Means, EM Agglomerative SOM
41
Agglomerative Clustering
Start with N groups each with one instance and merge two closest groups at each iteration
Distance between two groups Gi and Gj:– Single-link:
– Complete-link:
– Average-link, centroid
sr
,ji ,d,d
js
ir
xxxx GG
GG
min
sr
,ji ,d,d
js
ir
xxxx GG
GG
max
42
Dendrogram
Example: Single-Link Clustering
人類
侏儒黑猩猩
黑猩猩
大猩猩
長臂猿
獼猴
可以動態分群