hideitsu hino

2016/06/06

@

1 / 74

1

2

3

2 / 74

talk

1

If (x) = −!

ln f(x)

H(f) = −!

f(x) ln f(x)dx

f X H(X)

1Shannon Renyi((1− α)−1 log

!f(x)αdx) Tsallis ((q − 1)−1(1−

!fq(x)dx))

( )3 / 74

talk

H(f, g) =Ef [Ig(X)] = −!

f(x) ln g(x)dx,

H(f) =Ef [If (X)] = −!

f(x) ln f(x)dx

Kullback-Leibler

DKL(f, g) = Ef [Ig(X)]− Ef [If (X)] =

!f(x) ln

f(x)

g(x)dx

MI(X,Y ) = H(X) +H(Y )−H(X,Y )

H(X,Y ) X Y

4 / 74

KL

5 / 74

m Y ∈ Rm nX ∈ Rn Y

W ∈ Rn×m :

Y = WX. (1)

Y WWX f(WX) WX (m

) f(wjX), j = 1, . . . ,m

W [Hyvarinen&Oja, 2000]

6 / 74

7 / 74

k

L(c1, . . . , cK) =n"

i=1

minl=1,...,K

∥xi − cl∥2.

Fig. from [Faivishevsky&Goldberger, 2010]

8 / 74

k

L(c1, . . . , cK) =n"

i=1

minl=1,...,K

∥xi − cl∥2.

A Nonparametric Information Theoretic Clustering Algorithm

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

−30 −20 −10 0 10 20 30−30

−20

−10

0

10

20

30

−15 −10 −5 0 5 10 15−15

−10

−5

0

5

10

15

(a) (b) (c)

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

−30 −20 −10 0 10 20 30−30

−20

−10

0

10

20

30

−15 −10 −5 0 5 10 15−15

−10

−5

0

5

10

15

(d) (e) (f)

Figure 2. Comparison of the proposed clustering method NIC and the k-means clustering algorithm on three syntheticcases. (a)-(c) NIC, (d)-(f) k-means.

e.g. (Wang et al., 2009). Since the pre whitening isaccomplished as multiplication of input data by theinvertible matrix matrix A = Cov(X)−1/2 the mutualinformation between the datapoints and the labels isnot changed. The Nonparametric Information Clus-tering (NIC) algorithm is summarized in Fig. 1.

(a) (b) (c)

Figure 3. Three possible clusterings (into two clusters) ofthe same dataset: (a) ‘correct’ clustering, (b) and (c) erro-neous clusterings. Using MeanNN as the MI estimator, theMI clustering score favors the correct solution while usingthe kNN yields the same score for all the three clusterings.

4. Related work

The commonly used k-means algorithm addresses ob-jects X as vectors in Rd. The k-means score functionmeasures the sum of square-distances between vectorsassigned to the same cluster. Observing that:

!

i|ci=j

∥xi − µj∥2 =

1

2nj

!

i=l|ci=cl=j

∥xi − xl∥2

where µj is the average of all data points in cluster j,we can rewrite Skmeans(C) as follows:

Skmeans(C) =nc!

j=1

1

nj

!

i=l|ci=cl=j

∥xi − xl∥2 (8)

It is instructive to compare the k-means score with themutual information score based on a Gaussian within-cluster density (4) and the proposed SNIC score (7):

(9)

Skmeans(C) =nc!

j=1

1

nj

!

i=l|ci=cl=j

∥xi − xl∥2

SGaussMI(C) =nc!

j=1

log1

nj

!

i=l|ci=cl=j

∥xi − xl∥2

SNIC(C) =nc!

j=1

1

(nj−1)

!

i=l|ci=cl=j

log ∥xi − xl∥2


8 / 74

H(X|Y )


9 / 74

H(X|Y )

A Nonparametric Information Theoretic Clustering Algorithm

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

−30 −20 −10 0 10 20 30−30

−20

−10

0

10

20

30

−15 −10 −5 0 5 10 15−15

−10

−5

0

5

10

15

(a) (b) (c)

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

−30 −20 −10 0 10 20 30−30

−20

−10

0

10

20

30

−15 −10 −5 0 5 10 15−15

−10

−5

0

5

10

15

(d) (e) (f)

Figure 2. Comparison of the proposed clustering method NIC and the k-means clustering algorithm on three syntheticcases. (a)-(c) NIC, (d)-(f) k-means.

e.g. (Wang et al., 2009). Since the pre whitening isaccomplished as multiplication of input data by theinvertible matrix matrix A = Cov(X)−1/2 the mutualinformation between the datapoints and the labels isnot changed. The Nonparametric Information Clus-tering (NIC) algorithm is summarized in Fig. 1.

(a) (b) (c)

Figure 3. Three possible clusterings (into two clusters) ofthe same dataset: (a) ‘correct’ clustering, (b) and (c) erro-neous clusterings. Using MeanNN as the MI estimator, theMI clustering score favors the correct solution while usingthe kNN yields the same score for all the three clusterings.

4. Related work

The commonly used k-means algorithm addresses ob-jects X as vectors in Rd. The k-means score functionmeasures the sum of square-distances between vectorsassigned to the same cluster. Observing that:

!

i|ci=j

∥xi − µj∥2 =

1

2nj

!

i=l|ci=cl=j

∥xi − xl∥2

where µj is the average of all data points in cluster j,we can rewrite Skmeans(C) as follows:

Skmeans(C) =nc!

j=1

1

nj

!

i=l|ci=cl=j

∥xi − xl∥2 (8)

It is instructive to compare the k-means score with themutual information score based on a Gaussian within-cluster density (4) and the proposed SNIC score (7):

(9)

Skmeans(C) =nc!

j=1

1

nj

!

i=l|ci=cl=j

∥xi − xl∥2

SGaussMI(C) =nc!

j=1

log1

nj

!

i=l|ci=cl=j

∥xi − xl∥2

SNIC(C) =nc!

j=1

1

(nj−1)

!

i=l|ci=cl=j

log ∥xi − xl∥2


9 / 74

H(X|Y ) Fisher

[Hino&Murata, 2010]

10 / 74

−3 −2 −1 0 1 2 3

−3−2

−10

12

3

1st axis

2nd

axis

LDAminH

−3 −2 −1 0 1 2 3

−3−2

−10

12

3

1st axis

2nd

axis

−3 −2 −1 0 1 2 3

−3−2

−10

12

3

1st axis

2nd

axis

LDAminH

−3 −2 −1 0 1 2 3

−3−2

−10

12

3

1st axis

2nd

axis

11 / 74

12 / 74

European single market completed

The Great Hanshion-

Awaji Earthquake

decay of bubble economy

the Gulf war

TO

PIX

Change P

oin

t S

core

1000

1500

2000

2500

3000

0.00

0.02

0.04

0.06

0.08

0.10

1988!02!01

1988!09!01

1989!05!01

1989!12!01

1990!08!01

1991!04!01

1992!04!01

1992!10!01

1993!06!01

1993!12!01

1994!07!01

1995!02!01

1995!09!01

1996!04!01

:

score(t) = logfafter(t)

fbefore(t).

[Murata+, 2013, Koshijima+, 2015]

13 / 74

f(xt+1|xt:1)

50%, 95%f(xt+1|xt:1)

14 / 74

( )

Vapnik

15 / 74

16 / 74

17 / 74

18 / 74

1

2

3

19 / 74

D = {xi}ni=1 ⊂ R 1

D i.i.d.

20 / 74

f(x) =5

8φ(x;µ = 0,σ = 1) +

3

8φ(x;µ = 3,σ = 1)

21 / 74

22 / 74

f(x;h) =1

nh

n"

i=1

κ((x− xi)/h) (2)

κ#κ(x)dx = 1

h > 0

κh(x) = h−1κ(x/h)

f(x;h) =1

n

n"

i=1

κh(x− xi)

23 / 74

κN (0, 1)

24 / 74

x

MSE(mean squared error): θ

MSE(θ) = E[(θ − θ)2] = Var[θ] + (E[θ]− θ)2

E[f(x;h)] = E[κh(x−X)] =

!κh(x− y)f(y)dy

(f ∗ g)(x) =!

f(x− y)g(y)dy

f(x;h)

E[f(x;h)]− f(x) = (κh ∗ f)(x)− f(x).

Var[f(x;h)] =1

n

$(κ2

h ∗ f)(x)− (κh ∗ f)2(x)%

25 / 74

x

MSE[f(x;h)] =1

n

$(κ2h ∗ f)(x)− (κh ∗ f)2(x)

%

+ {(κh ∗ f)(x)− f(x)}2

26 / 74

L2 ( ) : ISE(integrated squarederror)

ISE[f(·;h)] =! &

f(x;h)− f(x)'2

dx

27 / 74

f(x;h) D = {xi}ni=1ISE f

DMISE(mean integrated squared error)

MISE[f(·;h)] =ED[ISE[f(·;h,D)]]

=

!ED([f(x;h,D)− f(x)])2dx

=

!MSE[f(x;h,D)]dx

28 / 74

MISE[f(·;h)] =n−1! $

(κ2h ∗ f)(x)− (κh ∗ f)2(x)%dx

+

!{(κh ∗ f)(x)− f(x)}2 dx

=(nh)−1!

κ2(x)dx

+ (1− n−1)

!(κh ∗ f)2(x)dx

− 2

!(κh ∗ f)(x)f(x)dx+

!f(x)2dx.

29 / 74

MISE

hMISE

h

30 / 74

1 f C2- L2

2 {hn} hnn h n :

limn→∞

h = 0, limn→∞

nh = ∞.

3 κ 4

!κ(x)dx = 1,

!xκ(x)dx = 0, µ2(κ) =

!x2κ(x)dx < ∞

31 / 74

E[f(x;h)] =#κ(z)f(x− hz)dz f(x− hz)

f(x− hz) = f(x)− hzf ′(x) +1

2h2z2f ′′(x) + o(h2)

E[f(x;h)] = f(x) +1

2h2f ′′(x)

!z2κ(z)dz + o(h2)

E[f(x;h)]− f(x) =1

2h2µ2(κ)f

′′(x) + o(h2) (3)

ff

32 / 74

gR(g) =

#g2(x)dx

Var[f(x;h)] = (nh)−1R(κ)f(x) + o((nh)−1) (4)

(2) (3) 0 MSE

MSE[f(x;h)] =(nh)−1R(κ)f(x) +1

4h4µ2

2(κ)(f′′(x))2

+ o((nh)−1 + h4)

33 / 74

MSE

MISE[f(·;h)] = AMISE[f(·;h)] + o((nh)−1 + h4)

AMISE[f(·;h)] = (nh)−1R(κ) +1

4h4µ2

2(κ)R(f ′′).

AMISE MISE h:

hAMISE =

(R(κ)

µ22(κ)R(f ′′)n

)1/5.

34 / 74

k

f(z) z ∈ Rp

D = {xi}ni=1

z k εk

z ε pb(z; ε) = {x ∈ Rp|∥z − x∥ < ε}

|b(z; ε)| = cpεp

cp = πp/2/Γ(p/2 + 1) Γ( · )

35 / 74

k

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

xi ∈ D ◦z ∈ Rp ×

36 / 74

k

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●ε

z ε

ε(k )

37 / 74

k

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

ε

z ε

ε(k )

37 / 74

k

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

εz ε

ε(k )

37 / 74

k

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

ε

z ε

ε(k )

37 / 74

k

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

ε

z ε

qz(ε) =

!

b(z;ε)f(x)dx.

k/nk ε = εk

εε

kε

38 / 74

k

Taylor :

qz(εk) =

!

b(z;εk){f(z) +∇f(x)(z − x) +O(ε2k)}dx

= |b(z; εk)|(f(z) +O(ε2k)) ≃ εpkcpf(z).

cp Rp

39 / 74

k

k

n, εpkcpf(z)

fk(z) =k

cpnε−pk (5)

40 / 74

k

k

fk(z) =k

cpnε−pk , (6)

εk z D k

41 / 74

42 / 74

1

2

3

43 / 74

H(f) D = {xi}ni=1

xi ∈ Rp, i = 1, . . . , n f(x) X

44 / 74

z ε

qz(ε) =

!

x∈b(z;ε)f(x)dx (7)

qz(ε) =

!

x∈b(z;ε)

&f(x) + (z − x)⊤∇f(z) +O(ε2)

'dx

= |b(z; ε)|*f(z) +O(ε2)

+= cpε

pf(z) +O(εp+2)

k/n O(εp+2)

45 / 74

z ε qz(ε) ε

qz(ε) = cpf(z)εp+

p

4(p/2 + 1)cpε

p+2tr∇2f(z)+O(εp+4) (8)

qz(ε) kε/n cpεp

kεncpεp

= f(z) + Cε2 +O(ε4) (9)

C = ptr∇2f(z)4(p/2+1)

46 / 74

Yε =kε

ncpεpXε = ε2 ε 4

Yε Xε

Yε ≃ f(z) + CXε (10)

2

47 / 74

Yε ≃ f(z) + CXε

Xε Yε

ε

48 / 74

k [ ]

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●ε

z ε

ε(k )

49 / 74

k [ ]

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

ε

z ε

ε(k )

49 / 74

k [ ]

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

εz ε

ε(k )

49 / 74

k [ ]

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

ε

z ε

ε(k )

49 / 74

E = {ε1, . . . , εm},m < nE ε {(Xε, Yε)}ε∈E

R =1

m

"

ε∈E(Yε − f(z)− CXε)

2 (11)

f(z) C

f(z)fs(z)

50 / 74

z fs(z)leave-one-out

Hs(D) = − 1

n

n"

i=1

ln fs,i(xi), (12)

fs,i(xi) xi

Hs(D) Simple Regression EntropyEstimator (SRE) [Hino+, 2015]

51 / 74

SRE: how it works

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

Normal

x

density

0 1 2 3 40.24

0.28

0.32

0.36

Normal

epsilon^2f(z)

Fitted density function Fitted intercept fs(z = 0.5)

52 / 74

SRE: how it works

−3 −2 −1 0 1 2 3

0.00

0.10

0.20

0.30

Bimodal

x

density

1.0 1.5 2.0 2.5 3.0 3.5 4.00.225

0.235

0.245

Bimodal

epsilon^2f(z)

Fitted density function Fitted intercept fs(z = 0.5)

53 / 74

ε xi ∈ D

Yε ≃ f(xi) + CXε

Yε =kε

ncpεpC = ptr∇2f(xi)

4(p/2+1) xi

Y iε Ci :

Y iε ≃ f(xi) + CiXε

54 / 74

Y iε = f(xi) + CiXε

xi ∈ D

− 1

n

n"

i=1

lnY iε = − 1

n

n"

i=1

ln$f(xi) + CiXε

%

= − 1

n

n"

i=1

ln f(xi)

,1 +

CiXε

f(xi)

-

= − 1

n

n"

i=1

ln f(xi)−1

n

n"

i=1

ln

.1 +

CiXε

f(xi)

/

≃ − 1

n

n"

i=1

ln f(xi)−1

n

0n"

i=1

Ci

f(xi)

1Xε

55 / 74

− 1

n

n"

i=1

lnY iε ≃ − 1

n

n"

i=1

ln f(xi)−1

n

0n"

i=1

Ci

f(xi)

1Xε

Yε = − 1n

2ni=1 lnY

iε

H(D) = − 1n

2ni=1 f(xi)

C = − 1n

2ni=1

Ci

f(xi)

ε > 0

Yε = H(D) + CXε (13)

56 / 74

ε ∈ E (13)

Rd =1

m

"

ε∈E(Yε −H(D)− CXε)

2

Direct Regression EntropyEstimator (DRE) [Hino+, 2015]

57 / 74

qz(ε) = cpf(z)εp +

p

4(p/2 + 1)cpε

p+2tr∇2f(z) +O(εp+4)

qz(ε) kε/n cpεp

kεncpεp

= f(z) + Cε2 +O(ε4)

Yε = f(z) + CXε

58 / 74

SRE

min1

m

"

ε∈E(Yε − f(z)− CXε)

2,

and

Hs(D) = − 1

n

n"

i=1

ln fi(xi)

DRE

min1

m

"

ε∈E(Yε −H(D)− CXε)

2

59 / 74

k

60 / 74

qz(ε) = cpf(z)εp +

p

4(p/2 + 1)cpε

p+2tr∇2f(z) +O(εp+4)

qz(ε) kε/n n:

kε ≃ cpnf(z)εp + cpn

p

4(p/2 + 1)tr∇2f(z)εp+2

61 / 74

kε ≃ cpnf(z)εp + cpn

p

4(p/2 + 1)tr∇2f(z)εp+2

X = (εp, εp+2) Y = kεY = β⊤X

kε Poisson

62 / 74

maxL(β) =m3

i=1

e−X⊤i β(X⊤

i β)Yi

Yi!

εp β1 β1z β1/(cpn)

SRE LOOEntropy Estimator with Poisson-noise structure andIdentity-link regression(EPI) [Hino+,under review]

63 / 74

1

2

3

64 / 74

H(f)H(D)

AE = |H(f)− H(D)|

100

65 / 74

Univariate Case15 distributions

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

Normal

x

density

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Skewed

x

density

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Strongly Skewed

x

density

−3 −2 −1 0 1 2 3

0.0

0.5

1.0

1.5

Kurtotic

x

density

−3 −2 −1 0 1 2 3

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Bimodal

x

density

−3 −2 −1 0 1 2 30.0

0.1

0.2

0.3

0.4

Skewed Bimodal

x

density

66 / 74


−3 −2 −1 0 1 2 3

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Trimodal

x

density

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

0.6

10 Claw

x

density

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

Standard Power Exponential

x

density

−3 −2 −1 0 1 2 3

0.05

0.10

0.15

0.20

0.25

Standard Logistic

x

density

−3 −2 −1 0 1 2 3

0.1

0.2

0.3

0.4

0.5

Standard Classical Laplace

x

density

−3 −2 −1 0 1 2 30.1

0.2

0.3

t(df=5)

x

density

67 / 74


−3 −2 −1 0 1 2 3

0.05

0.10

0.15

0.20

0.25

Mixed t

x

density

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

Standard Exponential

x

density

−3 −2 −1 0 1 2 3

0.05

0.10

0.15

0.20

0.25

0.30

Cauchy

x

density

68 / 74

●

●●

●

●

●

●

●

●●

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

Normal

x

density

−3 −2 −1 0 1 2 30.0

0.1

0.2

0.3

0.4

0.5

Skewed

x

density

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Strongly Skewed

x

density

−3 −2 −1 0 1 2 3

0.0

0.5

1.0

1.5

Kurtotic

x

density

−3 −2 −1 0 1 2 3

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Bimodal

x

density

69 / 74

●

●

●

●

●

●

●●

●

●

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

Skewed Bimodal

x

density

−3 −2 −1 0 1 2 3

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Trimodal

x

density

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

0.6

10 Claw

x

density

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

Standard Power Exponential

x

density

−3 −2 −1 0 1 2 3

0.05

0.10

0.15

0.20

0.25

Standard Logistic

x

density

69 / 74

●●

●

●●

●

●●

●

●

●

−3 −2 −1 0 1 2 3

0.1

0.2

0.3

0.4

0.5

Standard Classical Laplace

x

density

−3 −2 −1 0 1 2 3

0.1

0.2

0.3

t(df=5)

x

density

−3 −2 −1 0 1 2 3

0.05

0.10

0.15

0.20

0.25

Mixed t

x

density

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

Standard Exponential

x

density

−3 −2 −1 0 1 2 3

0.05

0.10

0.15

0.20

0.25

0.30

Cauchy

x

density

69 / 74

Univariate CaseResults: Curvature and Improvement

tr∇2f kγ > 0

:

f(x; γ) =1

πγ(1 + (x/γ)2).

∇2f(x; γ) =2

πγ33(x/γ)2 − 1

(1 + (x/γ)2)3

γ 0.01 0.9n = 300 100 k

EPI

|Hk(D)−H(f)|− |Hs(D)−H(f)|

70 / 74

Univariate CaseResults: Curvature and Improvement

maxx∈R log |∇2f(x; γ)|

−0.2

0.0

0.2

0.0 2.5 5.0 7.5LogMaxCurvature

Improvem

ent

71 / 74

That’s all fork

Pros. KDE k-NN

Cons.

72 / 74

I

[Faivishevsky&Goldberger, 2010] Faivishevsky, L. and Goldberger, J. (2010).A Nonparametric Information Theoretic Clustering Algorithm.ICML2010.

[Hino+, 2015] Hino, H., Koshijima, K., and Murata, N. (2015).Non-parametric entropy estimators based on simple linear regression.Computational Statistics & Data Analysis, 89(0):72 – 84.

[Hino&Murata, 2010] Hino, H. and Murata, N. (2010).A conditional entropy minimization criterion for dimensionality reduction andmultiple kernel learning.Neural Computation, 22(11):2887–2923.

[Hyvarinen&Oja, 2000] Hyvarinen, A. and Oja, E. (2000).Independent component analysis: algorithms and applications.Neural Networks, 13(4-5):411–430.

[Koshijima+, 2015] Koshijima, K., Hino, H., and Murata, N. (2015).Change-point detection in a sequence of bags-of-data.Knowledge and Data Engineering, IEEE Transactions on, 27(10):2632–2644.

73 / 74

II

[Murata+, 2013] Murata, N., Koshijima, K., and Hino, H. (2013).Distance-based change-point detection with entropy estimation.In Proceedings of the Sixth Workshop on Information Theoretic Methods inScience and Engineering, pages 22–25.

74 / 74