information-theoretic clustering with applications

68
Information-theoretic clustering with applications Frank Nielsen Sony Computer Science Laboratories Inc ´ Ecole Polytechnique Kyoto University Information Seminar 9th June 2014 c 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/68

Upload: frank-nielsen

Post on 05-Dec-2014

403 views

Category:

Technology


1 download

DESCRIPTION

Information-theoretic clustering with applications Abstract: Clustering is a fundamental and key primitive to discover structural groups of homogeneous data in data sets, called clusters. The most famous clustering technique is the celebrated k-means clustering that seeks to minimize the sum of intra-cluster variances. k-Means is NP-hard as soon as the dimension and the number of clusters are both greater than 1. In the first part of the talk, we first present a generic dynamic programming method to compute the optimal clustering of n scalar elements into k pairwise disjoint intervals. This case includes 1D Euclidean k-means but also other kinds of clustering algorithms like the k-medoids, the k-medians, the k-centers, etc. We extend the method to incorporate cluster size constraints and show how to choose the appropriate number of clusters using model selection. We then illustrate and refine the method on two case studies: 1D Bregman clustering and univariate statistical mixture learning maximizing the complete likelihood. In the second part of the talk, we introduce a generalization of k-means to cluster sets of histograms that has become an important ingredient of modern information processing due to the success of the bag-of-word modelling paradigm. Clustering histograms can be performed using the celebrated k-means centroid-based algorithm. We consider the Jeffreys divergence that symmetrizes the Kullback-Leibler divergence, and investigate the computation of Jeffreys centroids. We prove that the Jeffreys centroid can be expressed analytically using the Lambert W function for positive histograms. We then show how to obtain a fast guaranteed approximation when dealing with frequency histograms and conclude with some remarks on the k-means histogram clustering. References: - Optimal interval clustering: Application to Bregman clustering and statistical mixture learning IEEE ISIT 2014 (recent result poster) http://arxiv.org/abs/1403.2485 - Jeffreys Centroids: A Closed-Form Expression for Positive Histograms and a Guaranteed Tight Approximation for Frequency Histograms. IEEE Signal Process. Lett. 20(7): 657-660 (2013) http://arxiv.org/abs/1303.7286 http://www.i.kyoto-u.ac.jp/informatics-seminar/

TRANSCRIPT

Page 1: Information-theoretic clustering  with applications

Information-theoretic clustering with applications

Frank Nielsen

Sony Computer Science Laboratories IncEcole Polytechnique

Kyoto University Information Seminar9th June 2014

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 1/68

Page 2: Information-theoretic clustering  with applications

Clustering: Data exploratory science

Clustering: Find homogeneous groups in data

Clustering: Separate data into groups

Iris data set from UCI: d = 4, k = 3, n = 150c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 2/68

Page 3: Information-theoretic clustering  with applications

Clustering: What to see?, when to see ?

Distance or similarity measure between data?

Data membership: Hard or soft clustering?

How many groups?

Outliers?

“Shapes” of groups?

Representation of data and cleaning,

Missing data, truncated data, ...

Etc.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 3/68

Page 4: Information-theoretic clustering  with applications

Outline

Clustering n data in d dimensions into k clustersUsual case: d << k << n, but all cases possible

Quick tutorial: “The essentials” Ordinaryk-means, Gaussian mixtures models, Model selection, Other kinds of clustering, etc.

Optimal 1D contiguous clustering

Clustering histograms with Jeffreys divergence

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 4/68

Page 5: Information-theoretic clustering  with applications

Part I

Quick tutorial

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 5/68

Page 6: Information-theoretic clustering  with applications

k-means clustering

Partition the data set X = x1, ..., xn into k groups

P = G1, ...,Gk,

each group Gj has a center cj .

Minimize the objective function (cost, energy, loss):

e(X , C) =n∑

i=1

kminj=1‖xi − cj‖2

NP-hard when d > 1 and k > 1

For d = 1, center c1 is the centroid ande1(X ) = e(X , C1) = var(X ), the “unnormalized” variance:var(X ) = ∑

i ‖xi‖2 − n‖c1‖2

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 6/68

Page 7: Information-theoretic clustering  with applications

Rewriting the k-means cost to minimize

e(X , C) =

n∑

i=1

kminj=1‖xi − cj‖2,

=1

2

k∑

j=1

x ,x ′∈Gj

‖x − x ′‖2 + constant,

= −1

2

k∑

j=1

x∈Gj

x ′ 6∈Gk

‖x − x ′‖2 + constant,

=k∑

j=1

njvar(Gj), size cluster Gi : nj = |Gi |

= var(X )− var(Y),where Y = (nj , cj )kj=1

Interpreted as: min sum cluster inter-distances, max sum clusterinter-distances, min sum of cluster variances, min quantization loss.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 7/68

Page 8: Information-theoretic clustering  with applications

Heuristics for k-means

Local heuristics:

Lloyd (batched): assign data to closest center, remap clustercenters to centroids, reiterate until convergence

McQueen (incremental, online): add a point at a time, assignx to closest cluster, update that cluster centroid, reiterate untilconvergence

Hartigan (single-point): find a point x ∈ Gi and a cluster Gj sothat relocating x ∈ Gj decreases k-means cost, reiterate untilconvergence

Global heuristics:

Forgy: random seeding. Best Forgy is 2-approximation via thevariance-bias decomposition:D(x ,X ) = ∑n

i=1 ‖x − xi‖2 = var(X ) + n‖x − c1‖2. Fastest fast traversal: choose c1 at random, then ci as the

point x farthest from c1, ..., ci−1, repeat. Randomized k-means++ (probabilistic O(log k) bound) Global k-means

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 8/68

Page 9: Information-theoretic clustering  with applications

Probabilistic model-based clustering

Maximize likelihood for the mixture density:

m(x) =k∑

i=1

wip(x ;µi ,Σi)

Statistical mixture models: generative models

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 9/68

Page 10: Information-theoretic clustering  with applications

Statistical Gaussian Mixtures Models

Universal smooth density estimator.

Gaussian distribution:

p(x ;µ,Σ) =1

(2π)d2

|Σ|e−

12DΣ−1 (x−µ,x−µ)

Related to the squared Mahalanobis distance:

DQ(x , y) = (x − y)TQ(x − y), x ∈ Rd

→ log-concave density

Expectation-maximization (EM) algorithm: maximize incompletelikelihood.

EM tends to k-means when Σj = λI with λ→ 0 (or any fixed λ,see [34]).

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 10/68

Page 11: Information-theoretic clustering  with applications

Sampling from a Gaussian Mixture ModelTo sample a variate x from a GMM:

Choose a component l according to the weight distributionw1, ...,wk ,

Draw a variate x according to N(µl ,Σl).

→ Sampling is a doubly stochastic process:

throw a biased dice with k faces to choose the component:

l ∼ Multinomial(w1, ...,wk )

(normalized histogram.)

then draw at random a variate x from the l -th component

x ∼ Normal(µl ,Σl)

x = µ+ Cz with Cholesky: Σ = CC⊤ and z = [z1 ... zd ]⊤

standard normal random variate:zi =√−2 logU1 cos(2πU2)

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 11/68

Page 12: Information-theoretic clustering  with applications

GMMs: Generative models, sampling [15]

A pixel (x,y,R,G,B) : A data point in 5D

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 12/68

Page 13: Information-theoretic clustering  with applications

Model selection: Choosing k

The more parameters the better the model but it over fits.

Solution: Penalized likelihood to maximize.

Bayesian Information Criterion (BIC):

max l(θ)− #θ

2log n

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 13/68

Page 14: Information-theoretic clustering  with applications

Clustering analysis

Parametric clustering (k given, cost-based)

center-based hard clustering (k-means, k-medians, mindiameter, single linkage, etc.)

model-based soft clustering

Non-parametric clustering (k adjusts with data)

Kernel density estimators (mean shift) Dirichlet process mixture models

Hierarchical clustering

Graph-based clustering (normalized cut)

Affinity propagation

Subspace/manifold clustering

Etc!

Cluster validation (distance between clustering, confusion matrix,NMI, Rand index, etc.)

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 14/68

Page 15: Information-theoretic clustering  with applications

Part II

Optimal 1D contiguous clustering

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 15/68

Page 16: Information-theoretic clustering  with applications

Hard clustering: Partitioning the data set

Partition X = x1, ..., xn ⊂ X into k pairwise disjoint clustersC1 ⊂ X , ..., Ck ⊂ X :

X =

k⊎

i=1

Ci

Center-based hard clustering (center=prototype):k-means [1], k-medians, k-center, ℓr -center [21], etc.

Model-based hard clustering: statistical mixtures maximizingthe complete likelihood (prototype=model parameter).

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 16/68

Page 17: Information-theoretic clustering  with applications

k-means clustering

Minimize the sum of intra-cluster variances:

minp1,...,pk

n∑

i=1

kminj=1‖xi − pj‖2

k-means: NP-hard when d > 1 and k > 1 [24, 12]. k-mediansand k-centers also NP-hard [25] (1984)

Global heuristics (for initialization) (Forgy [14], globalk-means [40], k-means++ [4], etc.) and local searchheuristics (incremental Voronoi MacQueen [23], batchedVoronoi Lloyd [22], single-point swap Hartigan [17], etc.)

In 1D, k-means is polynomial [3, 39]: O(n2k).

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 17/68

Page 18: Information-theoretic clustering  with applications

Euclidean 1D k-means

1D k-means [13] has contiguous partition.

Solved by enumerating all(n−1k−1

)partitions in 1D (1958).

Better than Stirling numbers of the second kind S(n, k) thatcount all partitions.

Polynomial in time O(n2k) using Dynamic Programming(DP) [3] (sketched in 1973 in two pages).

R package Ckmeans.1d.dp [39] (2011).

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 18/68

Page 19: Information-theoretic clustering  with applications

(Sketch of the) DP solution [3] (Bellman, 1973)

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 19/68

Page 20: Information-theoretic clustering  with applications

Interval clustering: Structure

Sort X ∈ X with respect to total order < on X in O(n log n).

Output represented by:

k intervals Ii = [xli , xri ] such that Ci = Ii ∩ X . or better k − 1 delimiters li (i ∈ 2, ..., k) since ri = li+1 − 1

(i < k and rk = n) and l1 = 1.

[x1...xl2−1]︸ ︷︷ ︸

C1

[xl2 ...xl3−1]︸ ︷︷ ︸

C2

... [xlk ...xn]︸ ︷︷ ︸

Ck

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 20/68

Page 21: Information-theoretic clustering  with applications

Objective function for interval clustering

Scalars x1 < ... < xn are partitioned contiguouslyinto k clusters: C1 < ... < Ck .

Clustering objective function:

min ek(X ) =k⊕

j=1

e1(Cj)

c1(·): intra-cluster cost/energy⊕: inter-cluster cost/energy (commutative, associative)

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 21/68

Page 22: Information-theoretic clustering  with applications

Examples of objective functionsIn arbitrary dimension X = R

d :

ℓr -clustering (r ≥ 1):⊕

=∑

e1(Cj) = minp∈X

x∈Cj

d(x , p)r

(argmin=prototype pj is the same whether we take power of 1r

of sum or not)Euclidean ℓr -clustering: r = 1 median, r = 2 means.

k-center (limr→∞):⊕

= max

e1(Ci) = minp∈X

maxx∈Cj

d(x , p)

Discrete clustering: Search space in min is Cj instead of X.

Note that in 1D, ℓs-norm distance is always d(p, q) = |p − q|,independent of s ≥ 1.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 22/68

Page 23: Information-theoretic clustering  with applications

Optimal interval clustering by Dynamic ProgrammingXj ,i = xj , ..., xi (j ≤ i)Xi = X1,i = x1, ..., xiE = [ei ,j ]: n × k cost matrix, O(n × k) memoryei ,m = em(Xi )

Optimality equation:

ei ,m = minm≤j≤i

ej−1,m−1 ⊕ e1(Xj ,i)

Associative/commutative operator ⊕ (+ or max).Initialize with ci ,1 = c1(Xi )E : compute from left to right column, from bottom to top.Best clustering solution cost is at en,k .Time: n × k × O(n)× T1(n)=O(n2kT1(n)), O(nk) memory

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 23/68

Page 24: Information-theoretic clustering  with applications

Retrieving the solution: Backtracking

Use an auxiliary matrix S = [si .j ] for storing the argmin.

Backtrack in O(k) time.

Left index lk of Ck stored at sn,k : lk = sn,k .

Iteratively retrieve the previous left interval indexes at entrieslj−1 = slj−1,i for j = k − 1, ..., j = 1.

Note that lj − 1 = n−∑k

l=j nl and lj − 1 =∑j−1

l=1 nl .

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 24/68

Page 25: Information-theoretic clustering  with applications

Optimizing time with a Look Up Table (LUT)

Save time when computing e1(Xj ,i) since we perform n× k ×O(n)such computations.

Look Up Table (LUT): Add extra n × n matrix E1 withE1[j][i ] = e1(Xj ,i).Build in O(n2T1(n))...Then DP in O(n2k)=O(n2T1(n))).

→ quadratic amount of memory (n > 10000...)

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 25/68

Page 26: Information-theoretic clustering  with applications

DP solver with cluster size constraints

n−i and n+i : lower/upper bound constraints on ni = |Ci |∑k

l=1 = n−i ≤ n and∑k

l=1 = n+i ≥ n.When no constraints: add dummy constraints n−i = 1 andn+i = n − k − 1.

nm = |Cm| = i − j + 1 such that n−m ≤ nm ≤ n+m.→ j ≤ i + 1− n−m and j ≥ i + 1− n+m.

ei ,m = minmax1+

∑m−1l=1 n−l ,i+1−n+m≤j

j≤i+1−n−m

ej−1,m−1 ⊕ e1(Xj ,i) ,

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 26/68

Page 27: Information-theoretic clustering  with applications

Model selection from the DP tablem(k) = ek(X )

e1(X ) decreases with k and reaches minimum when k = n.Model selection: trade-off choose best model among all the modelswith k ∈ [1, n].

Regularized objective function: e′k(X ) = ek(X ) + f (k), f (k)related to model complexity.Compute the DP table for k = n, ..., 1 and avoids redundantcomputations.Then compute the criterion for the last line (indexed by n) andchoose the argmin of e′k .

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 27/68

Page 28: Information-theoretic clustering  with applications

A Voronoi cell condition for DP optimality

elements → interval clusters → prototypes

interval clusters ← prototypes

Partition X wrt. P = p1, ..., pk.Voronoi cell:

V (pj) = x ∈ X : d r (x , pj ) ≤ d r (x , pl ) ∀l ∈ 1, ..., k.

x r is a monotonically increasing function on R+, equivalent to

V ′(pj) = x ∈ X : d(x : pj) < d(x : pl)DP guarantees optimal clustering when ∀P, V ′(pj ) is an interval

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 28/68

Page 29: Information-theoretic clustering  with applications

Optimal 1D Bregman k-means

Bregman information [1] e1 (generalizes cluster variance):

e1(Cj) = minxl∈Cj

wlBF (xl : pj). (1)

Expressed as [35]:

e1(Cj) =(∑

xl∈Cjwl

)

(pjF′(pj)− F (pj)) +

(∑

xl∈CjwlF (xl)

)

F ′(pj )(∑

x∈Cjwlx

)

process using Summed Area Tables [10] (SATs)S1(j) =

∑jl=1 wl , S2(j) =

∑jl=1 wlxl , and S3(j) =

∑jl=1 wlF (xl ) in

O(n) time at preprocessing stage.Evaluate the Bregman information e1(Xj ,i) in constant time O(1).

For example,∑i

l=j wlF (xl) = S3(i)− S3(j − 1) with S3(0) = 0.Bregman Voronoi diagrams have connected cells [6] thus DP yieldsoptimal interval clustering.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 29/68

Page 30: Information-theoretic clustering  with applications

Exponential families in statistics

Family of probability distributions:

F = pF (x ; θ) : θ ∈ Θ

Exponential families [30]:

pF (x |θ) = exp (t(x)θ − F (θ) + k(x)) ,

For example:univariate Rayleigh R(σ), t(x) = x2, k(x) = log x , θ = − 1

2σ2 ,

η = −1θ, F (θ) = log− 1

2θ and F ∗(η) = −1 + log 2η.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 30/68

Page 31: Information-theoretic clustering  with applications

Uniorder exponential families: MLE

Maximum Likelihood Estimator (MLE) [30]:

e1(Xj ,i) = l(xj , ..., xi ) = F ∗(ηj ,i ) +1

i − j + 1

i∑

l=j

k(xl).

with ηj ,i =1

i−j+1

∑il=j t(xl).

By making a change of variable yl = t(xl), and not accounting the∑

k(xl ) terms that are constant for any clustering, we get

e1(Xj ,i) ≡ F ∗

1

i − j + 1

i∑

l=j

yl

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 31/68

Page 32: Information-theoretic clustering  with applications

Hard clustering for learning statistical mixtures

Expectation-Maximization learns monotonically from aninitialization by maximizing the incomplete log-likelihood.Mixture maximizing the complete log-likelihood:

lc(X ; L,Ω) =n∑

i=1

log(αlip(xi ; θli )),

L = lii : hidden labels.

max lc ≡ minθ1,...,θk

n∑

i=1

kminj=1

(− log p(xi ; θj)− log αj).

Given fixed α and − log pF (x ; θ) amounts to a dual Bregmandivergence[1].Run Bregman k-means and DP yields optimal partition sinceadditively-weighted Bregman Voronoi diagrams are interval [6].

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 32/68

Page 33: Information-theoretic clustering  with applications

Hard clustering for learning statistical mixtures

Location families:

F = f (x ;µ) = 1

σf0(

x − µ

σ), µ ∈ R

f0 standard density, σ > 0 fixed. Cauchy or Laplacian families havedensity graphs intersecting in exactly one point.

→ singly-connected Maximum Likelihood Voronoi cells.

Model selection: Akaike Information Criterion [7] (AIC):

AIC(x1, ..., xn) = −2l(x1, ..., xn) + 2k +2k(k + 1)

n − k − 1

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 33/68

Page 34: Information-theoretic clustering  with applications

Experiments with: Gaussian Mixture Models (GMMs)

gmm1 score = −3.0754314021966658 (Euclidean k-means) gmm2

score = −3.038795325884112 (Bregman k-means, better)

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 34/68

Page 35: Information-theoretic clustering  with applications

1-mean (centroid): O(n) time

minp

n∑

i=1

(xi − p)2

D(x , p) = (x − p)2, D ′(x , p) = 2(x − p), D ′′(x , p) = 2

Convex optimization (existence and unique solution)

n∑

i=1

D ′(x , p) = 0⇒n∑

i=1

xi − np = 0

Center of mass p = 1n

∑ni=1 xi− (barycenter)

Extends to Bregman divergence:

DF (x , p) = F (x)− F (p)− (x − p)F ′(p)

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 35/68

Page 36: Information-theoretic clustering  with applications

2-means: O(n log n) timeFind xl2 (n − 1 potential locations for xl : from x2 to xn):

minxl2e1(C1) + e1(C2)

Browse from left to right l2 = x2, ...., xn.Update cost in constant time E2(l + 1) from E2(l) (SATs alsoO(1)):

E2(l) = e2(x1...xl−1|xl ...xn)

µ1(l + 1) =(l − 1)µ1(l) + xl

l, µ2(l + 1) =

(n − l + 1)µ2(l)− xln − l

v1(l + 1) =

l∑

i=1

(xi − µ1(l + 1))2 =

l∑

i=1

x2i − lµ21(l + 1)

∆E2(l) =l − 1

l‖µ1(l)− xl‖2 +

n − l + 1

n− l‖µ2(l)− xl‖2

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 36/68

Page 37: Information-theoretic clustering  with applications

2-means: Experiments

Intel Win7 i7-4800

n Brute force SAT Incremental300000 155.022 0.010 0.00911000000 1814.44 0.018 0.015

Do we need sorting and Ω(n log n) time? (k = 1 is linear time)

Note that MaxGap does not yield the separator (becausecentroid is sum of squared distance minimizer)

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 37/68

Page 38: Information-theoretic clustering  with applications

Conclusion

Generic DP for solving interval clustering:

O(n2kT1(n))-time using O(nk) memory O(n2T1(n)) time using O(n2) memory

Refine DP by adding minimum/maximum cluster sizeconstraints

Model selection from DP table

Two applications: 1D Bregman ℓr -clustering. 1D Bregman k-means in O(n2k)

time using O(nk) memory using Summed Area Tables (SATs) Mixture learning maximizing the complete likelihood:

For uni-order exponential families amount to a dual Bregmank-means on Y = yi = t(xi )i

For location families with density graph intersecting pairwisein one point (Cauchy, Laplacian: 6∈ exponential families)

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 38/68

Page 39: Information-theoretic clustering  with applications

Part III

Clustering histograms with Jeffreys divergence

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 39/68

Page 40: Information-theoretic clustering  with applications

Why histogram clustering?

Task: Classify documents into categories:Bag-of-Word (BoW) modeling paradigm [5, 11].

Define a word dictionary, and

Represent each document by a word count histogram.

Centroid-based k-means clustering [1]:

Cluster document histograms to learn categories,

Build visual vocabularies by quantizing image features:Compressed Histogram of Gradient descriptors [8].

→ histogram centroids

wh =∑d

i=1 hi : cumulative sum of bin values

: normalization operator

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 40/68

Page 41: Information-theoretic clustering  with applications

Why Jeffreys divergence?

Distance between two frequency histograms p and q:Kullback-Leibler divergence or relative entropy.

KL(p : q) = H×(p : q)− H(p),

H×(p : q) =d∑

i=1

pi log1

qi, cross− entropy

H(p) = H×(p : p) =

d∑

i=1

pi log1

pi,Shannon entropy.

→ expected extra number of bits per datum that must betransmitted when using the “wrong” distribution q instead of thetrue distribution p.p is hidden by nature (and hypothesized), q is estimated.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 41/68

Page 42: Information-theoretic clustering  with applications

Why Jeffreys divergence?

When clustering histograms, all histograms play the same role →Jeffreys [18] divergence:

J(p, q) = KL(p : q) +KL(q : p),

J(p, q) =d∑

i=1

(pi − qi) logpi

qi= J(q, p).

→ symmetrizes the KL divergence.(aka. J-divergence or symmetrical Kullback-Leibler divergence,etc.)

KL(p : q) =∑

i

pi logpiqi

+ qi − pi

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 42/68

Page 43: Information-theoretic clustering  with applications

Jeffreys centroids: frequency and positive centroids

A set H = h1, ..., hn of weighted histograms.

c = argminx

n∑

j=1

πjJ(hj , x),

πj > 0’s histogram positive weights:∑n

j=1 πj = 1.

Jeffreys positive centroid c :

c = arg minx∈Rd

+

n∑

j=1

πjJ(hj , x),

Jeffreys frequency centroid c :

c = arg minx∈∆d

n∑

j=1

πjJ(hj , x),

∆d : Probability (d − 1)-dimensional simplex.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 43/68

Page 44: Information-theoretic clustering  with applications

Prior work

Histogram clustering wrt. χ2 distance [20]

Histogram clustering wrt. Bhattacharyya distance [26, 33]

Histogram clustering wrt. Kullback-Leibler distance asBregman k-means clustering [1]

Jeffreys frequency centroid [38] (Newton numericaloptimization)

Jeffreys frequency centroid as equivalent symmetrizedBregmen centroid [36]

Mixed Bregman clustering [37]

Smooth family of KL symmetrized centroids includingJensen-Shannon centroids and Jeffreys centroids in limitcase [29]

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 44/68

Page 45: Information-theoretic clustering  with applications

Jeffreys positive centroid

c = arg minx∈Rd

+

J(H, x) = arg minx∈Rd

+

n∑

j=1

πjJ(hj , x).

Theorem (Theorem 1)

The Jeffreys positive centroid c = (c1, ..., cd ) of a set h1, ..., hnof n weighted positive histograms with d bins can be calculatedcomponent-wise exactly using the Lambert W analytic function:

c i =ai

W ( ai

g i e),

where ai =∑n

j=1 πjhij denotes the coordinate-wise arithmetic

weighted means and g i =∏n

j=1(hij )πj the coordinate-wise

geometric weighted means.

Lambert analytic function [2] W (x)eW (x) = x for x ≥ 0.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 45/68

Page 46: Information-theoretic clustering  with applications

Jeffreys positive centroid (proof)

minx

n∑

j=1

πjJ(hj , x)

minx

n∑

j=1

πj

d∑

i=1

(hij − x i)(log hij − log x i )

≡ minx

d∑

i=1

n∑

j=1

πj(xi log x i − x i log hij − hij log x

i )

d∑

i=1

x i log x i − x i log

n∏

j=1

(hij)πj

︸ ︷︷ ︸g

−n∑

j=1

πjhij

︸ ︷︷ ︸

a log x i

minx

d∑

i=1

x i logx i

g− a log x i

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 46/68

Page 47: Information-theoretic clustering  with applications

Jeffreys positive centroid (proof)

Coordinate-wise minimize:

minx

x logx

g− a log x

Setting the derivative to zero, we solve:

logx

g+ 1− a

x= 0

and get

x =a

W ( ag e)

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 47/68

Page 48: Information-theoretic clustering  with applications

Jeffreys frequency centroid: A guaranteed approximation

c = arg minx∈∆d

n∑

j=1

πjJ(hj , x),

Relaxing x from probability simplex ∆d to Rd+, we get

c ′ =c

wc, c i =

ai

W ( ai

g i e),wc =

i

c i

Lemma (Lemma 1)

The cumulative sum wc of the bin values of the Jeffreys positivecentroid c of a set of frequency histograms is less or equal to one:

0 < wc ≤ 1.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 48/68

Page 49: Information-theoretic clustering  with applications

Proof of Lemma 1

From Theorem 1:

wc =

d∑

i=1

c i =

d∑

i=1

ai

W ( ai

g i e).

Arithmetic-geometric mean inequality: ai ≥ g i

Therefore W ( ai

g i e) ≥ 1 and c i ≤ ai . Thus

wc =d∑

i=1

c i ≤d∑

i=1

ai = 1

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 49/68

Page 50: Information-theoretic clustering  with applications

Lemma 2

Lemma (Lemma 2)

For any histogram x and frequency histogram h, we haveJ(x , h) = J(x , h) + (wx − 1)(KL(x : h) + logwx), where wx

denotes the normalization factor (wx =∑d

i=1 xi ).

J(x , H) = J(x , H) + (wx − 1)(KL(x : H) + logwx),

where J(x , H) =∑n

j=1 πjJ(x , hj ) and

KL(x : H) =∑n

j=1 πjKL(x , hj) (with∑n

j=1 πj = 1).

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 50/68

Page 51: Information-theoretic clustering  with applications

Proof of Lemma 2

x i = wx xi

J(x , h) =d∑

i=1

(wx xi − hi) log

wx xi

hi

J(x , h) =

d∑

i=1

(wx xi log

x i

hi+ wx x

i logwx + hi loghi

x i− hi logwx)

= (wx − 1) logwx + J(x , h) + (wx − 1)

d∑

i=1

x i logx i

hi

= J(x , h) + (wx − 1)(KL(x : h) + logwx)

since∑d

i=1 hi =

∑di=1 x

i = 1.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 51/68

Page 52: Information-theoretic clustering  with applications

Guaranteed approximation of c

Theorem (Theorem 2)

Let c denote the Jeffreys frequency centroid and c ′ = cwc

thenormalized Jeffreys positive centroid. Then the approximation

factor αc′ =J(c′,H)

J(c ,H)is such that 1 ≤ αc′ ≤ 1

wc(with wc ≤ 1).

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 52/68

Page 53: Information-theoretic clustering  with applications

Proof of Theorem 2

J(c , H) ≤ J(c , H) ≤ J(c ′, H)

From Lemma 2, sinceJ(c ′, H) = J(c , H) + (1− wc)(KL(c ′, H) + logwc)) andJ(c , H) ≤ J(c , H)

1 ≤ αc′ ≤ 1 +(1− wc)(KL(c ′, H) + logwc)

J(c , H)

KL(c ′ : H) =1

wcKL(c , H)− logwc

αc′ ≤ 1 +(1− wc)KL(c , H)

wcJ(c , H)

Since J(c , H) ≥ J(c , H) and KL(c , H) ≤ J(c , H), we getαc′ ≤ 1

wc.

When wc = 1 the bound is tight.c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 53/68

Page 54: Information-theoretic clustering  with applications

In practice...

c in closed-form → compute wc , KL(c , H), J(c , H).Bound the approximation factor αc′ as:

αc′ ≤ 1 +

(1

wc− 1

)KL(c , H)

J(c , H)≤ 1

wc

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 54/68

Page 55: Information-theoretic clustering  with applications

Fine approximation

From [38, 36], minimization of Jeffreys frequency centroidequivalent to:

c = arg minx∈∆d

KL(a : x) +KL(x : g)

Lagrangian function enforcing∑

i ci = 1:

logc i

g i+ 1− ai

c i+ λ = 0

c i =ai

W(aieλ+1

g i

)

λ = −KL(c : g) ≤ 0

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 55/68

Page 56: Information-theoretic clustering  with applications

Fine approximation: Bisection search

c i ≤ 1⇒ c i =ai

W(aieλ+1

g i

) ≤ 1

λ ≥ log(e aig i )− 1∀i , λ ∈ [max

ilog(e a

ig i )− 1, 0]

s(λ) =∑

i

c i(λ) =d∑

i=1

ai

W(aieλ+1

g i

)

Function s: monotonously decreasing with s(0) ≤ 1.→ Bisection search for s(λ∗) ≃ 1 for arbitrary precision.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 56/68

Page 57: Information-theoretic clustering  with applications

Experiments: Caltech-256

Caltech-256 [16]: 30607 images labeled into 256 categories (256Jeffreys centroids).Arbitrary floating-point precision: http://www.apfloat.org/

c ′′ =a + g

2

αc (optimal positive) αc′ (n

′lized approx.) wc ≤ 1(n′lizing coeff.t) αc′′ (Veldhuis’ approx.)

avg 0.9648680345638155 1.0002205080964255 0.9338228644308926 1.065590178484613min 0.906414219584823 1.0000005079528809 0.8342819488534723 1.0027707382095195max 0.9956399220678585 1.0000031489541772 0.9931975105809021 1.3582296675397754

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 57/68

Page 58: Information-theoretic clustering  with applications

Experiments: Synthetic data-sets

Random binary histograms

α =J(c ′)

J(c)≥ 1

Performance:α ∼ 1.0000009, αmax ∼ 1.00181506, αmin = 1.000000.

Express better worst-case upper bound performance?

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 58/68

Page 59: Information-theoretic clustering  with applications

Summary and conclusion

Jeffreys positive centroid c in closed-form

normalized Jeffreys positive centroid c ′ within approximationfactor 1

wc

Bisection search for arbitrary fine approximation of c .

→ Variational Jeffreys k-means clustering

Other Kullback-Leibler symmetrizations:

Jensen-Shannon divergence [19]

Chernoff divergence [9]

Family of symmetrized centroids including Jensen-Shannonand Jeffreys centroids [29]

Infinitely many symmetrizations using quasi-arithmetic means

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 59/68

Page 60: Information-theoretic clustering  with applications

Clustering: A never-ending story...

An old problem with many new recent results!

21st century is the revolution of data science

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 60/68

Page 61: Information-theoretic clustering  with applications

Computational Information Geometry

[32] [31]

http://www.springer.com/engineering/signals/book/978-3-642-30231-2

http://www.sonycsl.co.jp/person/nielsen/infogeo/MIG/MIGBOOKWEB/

http://www.springer.com/engineering/signals/book/978-3-319-05316-5

http://www.sonycsl.co.jp/person/nielsen/infogeo/GTI/GeometricTheoryOfInformation.htmlc© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 61/68

Page 62: Information-theoretic clustering  with applications

Textbooks: Visual computing

[27] [28]

http://www.sonycsl.co.jp/person/nielsen/visualcomputing/

http://www.lix.polytechnique.fr/~nielsen/JavaProgramming/

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 62/68

Page 63: Information-theoretic clustering  with applications

Bibliography I

Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh.

Clustering with Bregman divergences.Journal of Machine Learning Research, 6:1705–1749, 2005.

D. A. Barry, P. J. Culligan-Hensley, and S. J. Barry.

Real values of the W -function.ACM Trans. Math. Softw., 21(2):161–171, June 1995.

Richard Bellman.

A note on cluster analysis and dynamic programming.Mathematical Biosciences, 18(3-4):311 – 312, 1973.

Anup Bhattacharya, Ragesh Jaiswal, and Nir Ailon.

A tight lower bound instance for k-means++ in constant dimension.CoRR, abs/1401.2912, 2014.

Brigitte Bigi.

Using Kullback-Leibler distance for text categorization.In Proceedings of the 25th European conference on IR research (ECIR), ECIR’03, pages 305–319, Berlin,Heidelberg, 2003. Springer-Verlag.

Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock.

Bregman Voronoi diagrams.Discrete Computational Geometry, 44(2):281–307, September 2010.

J. Cavanaugh.

Unifying the derivations for the Akaike and corrected Akaike information criteria.Statistics & Probability Letters, 33(2):201–208, April 1997.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 63/68

Page 64: Information-theoretic clustering  with applications

Bibliography II

Vijay Chandrasekhar, Gabriel Takacs, David M. Chen, Sam S. Tsai, Yuriy A. Reznik, Radek Grzeszczuk, and

Bernd Girod.Compressed histogram of gradients: A low-bitrate descriptor.International Journal of Computer Vision, 96(3):384–399, 2012.

Herman Chernoff.

A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations.Annals of Mathematical Statistics, 23:493–507, 1952.

Franklin C. Crow.

Summed-area tables for texture mapping.In Proceedings of the 11th Annual Conference on Computer Graphics and Interactive Techniques,SIGGRAPH ’84, pages 207–212, New York, NY, USA, 1984. ACM.

G. Csurka, C. Bray, C. Dance, and L. Fan.

Visual categorization with bags of keypoints.Workshop on Statistical Learning in Computer Vision (ECCV), pages 1–22, 2004.

Sanjoy Dasgupta.

The hardness of k-means clustering.Technical Report CS2008-0916.

Walter D Fisher.

On grouping for maximum homogeneity.Journal of the American Statistical Association, 53(284):789–798, 1958.

Edward W. Forgy.

Cluster analysis of multivariate data: efficiency vs interpretability of classifications.Biometrics, 1965.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 64/68

Page 65: Information-theoretic clustering  with applications

Bibliography III

Vincent Garcia and Frank Nielsen.

Simplification and hierarchical representations of mixtures of exponential families.Signal Processing, 90(12):3197–3212, 2010.

G. Griffin, A. Holub, and P. Perona.

Caltech-256 object category dataset.Technical Report 7694, California Institute of Technology, 2007.

John A. Hartigan.

Clustering Algorithms.John Wiley & Sons, Inc., New York, NY, USA, 99th edition, 1975.

Harold Jeffreys.

An invariant form for the prior probability in estimation problems.Proceedings of the Royal Society of London, 186(1007):453–461, March 1946.

Jianhua Lin.

Divergence measures based on the Shannon entropy.IEEE Transactions on Information Theory, 37:145–151, 1991.

Huan Liu and Rudy Setiono.

Chi2: Feature selection and discretization of numeric attributes.In Proceedings of the Seventh International Conference on Tools with Artificial Intelligence (TAI), pages88–, Washington, DC, USA, 1995. IEEE Computer Society.

Meizhu Liu, Baba C. Vemuri, Shun ichi Amari, and Frank Nielsen.

Shape retrieval using hierarchical total Bregman soft clustering.IEEE Trans. Pattern Anal. Mach. Intell., 34(12):2407–2419, 2012.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 65/68

Page 66: Information-theoretic clustering  with applications

Bibliography IV

Stuart P. Lloyd.

Least squares quantization in PCM.Technical report, Bell Laboratories, 1957.

James B. MacQueen.

Some methods of classification and analysis of multivariate observations.In L. M. Le Cam and J. Neyman, editors, Proceedings of the Fifth Berkeley Symposium on MathematicalStatistics and Probability. University of California Press, Berkeley, CA, USA, 1967.

Meena Mahajan, Prajakta Nimbhorkar, and Kasturi R. Varadarajan.

The planar k-means problem is NP-hard.Theoretical Computer Science, 442:13–21, 2012.

Nimrod Megiddo and Kenneth J Supowit.

On the complexity of some common geometric location problems.SIAM journal on computing, 13(1):182–196, 1984.

Max Mignotte.

Segmentation by fusion of histogram-based k-means clusters in different color spaces.IEEE Transactions on Image Processing (TIP), 17(5):780–787, 2008.

Frank Nielsen.

Visual Computing: Geometry, Graphics, and Vision.Charles River Media / Thomson Delmar Learning, 2005.

Frank Nielsen.

A Concise and Practical Introduction to Programming Algorithms in Java.Undergraduate Topics in Computer Science (UTiCS). Springer Verlag, 2009.http://www.springer.com/computer/programming/book/978-1-84882-338-9.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 66/68

Page 67: Information-theoretic clustering  with applications

Bibliography V

Frank Nielsen.

A family of statistical symmetric divergences based on Jensen’s inequality.CoRR, abs/1009.4004, 2010.

Frank Nielsen.

k-mle: A fast algorithm for learning statistical mixture models.CoRR, abs/1203.5181, 2012.

Frank Nielsen.

Geometric Theory of Information.Springer, 2014.

Frank Nielsen and Rajendra Bhatia, editors.

Matrix Information Geometry (Revised Invited Papers). Springer, 2012.

Frank Nielsen and Sylvain Boltz.

The Burbea-Rao and Bhattacharyya centroids.IEEE Transactions on Information Theory, 57(8):5455–5466, August 2011.

Frank Nielsen and Richard Nock.

Clustering multivariate normal distributions.In Emerging Trends in Visual Computing, pages 164–174. Springer, 2009.

Frank Nielsen and Richard Nock.

Sided and symmetrized Bregman centroids.IEEE Transactions on Information Theory, 55(6):2882–2904, 2009.

Frank Nielsen and Richard Nock.

Sided and symmetrized Bregman centroids.IEEE Transactions on Information Theory, 55(6):2048–2059, June 2009.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 67/68

Page 68: Information-theoretic clustering  with applications

Bibliography VI

Richard Nock, Panu Luosto, and Jyrki Kivinen.

Mixed Bregman clustering with approximation guarantees.In Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases,pages 154–169, Berlin, Heidelberg, 2008. Springer-Verlag.

Raymond N. J. Veldhuis.

The centroid of the symmetrical Kullback-Leibler distance.IEEE signal processing letters, 9(3):96–99, March 2002.

Haizhou Wang and Mingzhou Song.

Ckmeans.1d.dp: Optimal k-means clustering in one dimension by dynamic programming.R Journal, 3(2), 2011.

Juanying Xie, Shuai Jiang, Weixin Xie, and Xinbo Gao.

An efficient global k-means clustering algorithm.Journal of computers, 6(2), 2011.

c© 2014 Frank Nielsen, Sony Computer Science Laboratories, Inc. 68/68