clustering: evolution of methods to meet new challenges · clustering of clustering algorithms7...
TRANSCRIPT
![Page 1: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/1.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Clustering:evolution of methods to meet new challenges
C. Biernacki
Journee “Clustering”, Orange Labs, October 20th 2015
1/54
![Page 2: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/2.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Take home message
cluster
clustering
define both!
2/54
![Page 3: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/3.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Outline
1 Introduction
2 Methods & questions
3 Model-based clustering
4 Illustrations
5 Challenges
3/54
![Page 4: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/4.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
A first systematic attempt
Carl von Linne (1707–1778), Swedish botanist, physician, and zoologist
Father of modern taxonomy based on the most visible similarities between species
Linnaeus’s Systema Naturae (1st ed. in 1735) lists about 10,000 species oforganisms (6,000 plants, 4,236 animals)
4/54
![Page 5: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/5.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Interdisciplinary endeavor
Medicine1: diseases may be classified by etiology (cause), pathogenesis(mechanism by which the disease is caused), or by symptom(s). Alternatively,diseases may be classified according to the organ system involved, though this isoften complicated since many diseases affect more than one organ.
And so on. . .
1Nosologie methodique, dans laquelle les maladies sont rangees par classes, suivant le systeme de Sydenham, &l’ordre des botanistes, par Francois Boissier de Sauvages de Lacroix. Paris, Herissant le fils, 1771
5/54
![Page 6: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/6.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Three main clustering structures
Data set of n individuals x = (x1, . . . , xn), xi described by d variables
Partition in K clusters denoted by z = (z1, . . . , zn), with zi ∈ {1, . . . ,K}Hierarchy Nested partitions
Block partition Crossing simultaneously partitions in individuals and columns
6/54
![Page 7: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/7.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Clustering is the cluster building process
According to JSTOR, data clustering first appeared in the title of a 1954 articledealing with anthropological data
Need to be automatic (algorithms) for complex data: mixed features, large datasets, high-dimensional data. . .
7/54
![Page 8: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/8.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
A 1st aim: explanatory task
A clustering for a marketing studyData: d = 13 demographic attributes (nominal and ordinal variables) ofn = 6 876 shopping mall customers in the San Francisco Bay (SEX (1. Male, 2.Female), MARITAL STATUS (1. Married, 2. Living together, not married, 3.Divorced or separated, 4. Widowed, 5. Single, never married), AGE (1. 14 thru17, 2. 18 thru 24, 3. 25 thru 34, 4. 35 thru 44, 5. 45 thru 54, 6. 55 thru 64, 7.65 and Over), etc.)Partition: retrieve less that 19 999$ (group of “low income”), between 20 000$and 39 999$ group of “average income”), more than 40 000$ (group of “highincome”)
−1 −0.5 0 0.5 1 1.5 2 2.5−1.5
−1
−0.5
0
0.5
1
1.5
1st MCA axis
2nd
MC
A a
xis
−1 −0.5 0 0.5 1 1.5 2 2.5−1.5
−1
−0.5
0
0.5
1
1.5
1st MCA axis
2nd
MC
A a
xis
Low incomeAverage incomeHigh income
8/54
![Page 9: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/9.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
A 2nd aim: preprocessing step
Logit model:Not very flexible since linear borderlineUnbiased ML estimate by asymptotic variance ∼ n(x′wx)−1 is influenced by correlations
A clustering may improve logistic regression predictionMore flexible borderline: piecewise linearDecrease correlation so decrease variance
9/54
![Page 10: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/10.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Mixed features
10/54
![Page 11: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/11.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Large data sets2
2S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering:
Algorithms and Applications, 2911/54
![Page 12: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/12.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
High-dimensional data3
3S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering:
Algorithms and Applications, 2912/54
![Page 13: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/13.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Genesis of “Big Data”
The Big Data phenomenon mainly originates in the increase of computer and digitalresources at an ever lower cost
Storage cost per MB: 700$ in 1981, 1$ in 1994, 0.01$ in 2013→ price divided by 70,000 in thirty years
Storage capacity of HDDs: ≈1.02 Go in 1982, ≈8 To today→ capacity multiplied by 8,000 over the same period
Computeur processing speed: 1 gigaFLOPS4 in 1985, 33 petaFLOPS in 2013→ speed multiplied by 33 million
4FLOP = FLoating-point Operations Per Second13/54
![Page 14: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/14.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Digital flow
Digital in 1986: 1% of the stored information, 0.02 Eo5
Digital in 2007: 94% of the stored information, 280 Eo (multiplied by 14,000)
5Exabyte14/54
![Page 15: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/15.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Societal phenomenon
All human activities are impacted by data accumulation
Trade and business: corporate reporting system , banks, commercial transactions,reservation systems. . .
Governments and organizations: laws, regulations, standardizations ,infrastructure. . .
Entertainment: music, video, games, social networks. . .
Sciences: astronomy, physics and energy, genome,. . .
Health: medical record databases in the social security system. . .
Environment: climate, sustainable development , pollution, power. . .
Humanities and Social Sciences: digitization of knowledge , literature, history ,art, architecture, archaeological data. . .
15/54
![Page 16: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/16.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
New data. . . but classical answers6
6Rexer Analytics’s Annual Data Miner Survey is the largest survey of data mining, data science, and analyticsprofessionals in the industry (survey of 2011)
16/54
![Page 17: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/17.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Outline
1 Introduction
2 Methods & questions
3 Model-based clustering
4 Illustrations
5 Challenges
17/54
![Page 18: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/18.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Clustering of clustering algorithms7
Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into 5groups based on their partitions on 12 different datasets.
It is not surprising to see that the related algorithms are clustered together.
For a visualization of the similarity between the algorithms, the 35 algorithms arealso embedded in a two-dimensional space obtained from the 35x35 similaritymatrix.
7A.K. Jain (2008). Data Clustering: 50 Years Beyond K-Means.18/54
![Page 19: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/19.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Popularity of K -means and hierarchical clustering
Even K -means was first proposed over 50 years ago, it is still one of the most widelyused algorithms for clustering for several reasons: ease of implementation, simplicity,efficiency, empirical success. . . and model-based interpretation (see later)
19/54
![Page 20: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/20.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Within-cluster inertia criterion
Select the partition z minimizing the criterion
WM(z) =n∑
i=1
K∑
k=1
zik‖xi − xk‖2M
‖ · ‖M is the Euclidian distance with metric M in Rd
xk is the mean of the kth cluster
xk =1
nk
n∑
i=1
zikxi
and nk =∑n
k=1 zik indicates the number of individuals in cluster k
20/54
![Page 21: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/21.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Ward hierarchical clustering
aa
bb
cc
dd
ee
aa
bb
cc
dd
ee
++
aa
bb
cc
dd
ee
++
++
aa
bb
cc
dd
ee
++
++
aa
bb
cc
dd
ee++
aa
bb
cc
dd
ee
données singletons d({b},{c})=0.5
d({d},{e})=2 d({a},{b,c}) = 3 d({a,b,c},{d,e})=10
Classification hiérarchique ascendante(méthode de Ward)
aa bb cc dd ee
indice
0.5
22
33
10
Dendogramme
00
Suboptimal optimisation of WM(·)A partition is obtained by cuting the dendrogram
A dissimilarity matrix between pairs of individuals is enough
21/54
![Page 22: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/22.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
K -means algorithm
aa
bb
cc
dd
ee
aa
bb
cc
dd
ee
aa
bb
cc
dd
ee
aa
bb
cc
dd
ee
aa
bb
cc
dd
ee
aa
bb
cc
dd
ee
aa
bb
cc
dd
ee
données
2 individus au hasard affectation aux centres calcul des centres
calcul des centresaffectation aux centres affectation aux centres
Algorithme des centres mobiles
Alternating optimization between the partition and the center of clusters
22/54
![Page 23: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/23.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
The identity metric M: a classical but hazardous choice
−2 0 2 4 6 8 10
−4
−2
02
4
x[,1]
x[,2
]
Données en deux classes sphériques
−2 0 2 4 6 8 10
−4
−2
02
4
x[cl$cluster == 1, ][,1]
x[cl
$clu
ster
==
1, ]
[,2]
k−means sur deux classes sphériques
−→
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
−20
−10
010
20
x[,1]
x[,2
]
Données en deux classes allongées
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
−20
−10
010
20
x[cl$cluster == 1, ][,1]
x[cl
$clu
ster
==
1, ]
[,2]
k−means sur deux classes allongées
−→
Alternative: estimate M(k) by minimizing WM(k)(z) over (z,M(k))
23/54
![Page 24: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/24.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Effect of the metric M through a real example
Animals represented by 13 Boolean features related to appearance and activity
Large weight on the appearance features compared to the activity features: theanimals were clustered into mammals vs. birds
Large weight on the activity features: partitioning predators vs. non-predators
Both partitions are equally valid, and uncover meaningful structures in the data
The user has to carefully choose his representation to obtain a desired clustering
24/54
![Page 25: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/25.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Semi-supervised clustering8
The user has to provide any external information he has on the partitionPair-wise constraints:
A must-link constraint specifies that the point pair connected by the constraint belongto the same clusterA cannot-link constraint specifies that the point pair connected by the constraint do notbelong to the same cluster
Attempts to derive constraints from domain ontology and other external sourcesinto clustering algorithms include the usage of WordNet ontology, gene ontology,Wikipedia, etc. to guide clustering solutions
8O. Chapelle et al. (2006), A.K. Jain (2008). Data Clustering: 50 Years Beyond K-Means.25/54
![Page 26: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/26.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Online clustering
Dynamic data are quite recent: blogs, Web pages, retail chain, credit cardtransaction streams, network packets received by a router and stock market, etc.
As the data gets modified, clustering must be updated accordingly: ability todetect emerging clusters, etc.
Often all data cannot be stored on a disk
This imposes additional requirements to traditional clustering algorithms torapidly process and summarize the massive amount of continuously arriving data
Data stream clustering a significant challenge since they are expected to involvedsingle-pass algorithms
26/54
![Page 27: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/27.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Data representation challenge9
The data unit can be crucial for the data clustering task
9A.K. Jain (2008). Data Clustering: 50 Years Beyond K-Means.27/54
![Page 28: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/28.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
HD clustering: blessing (1/2)
A two-component d-variate Gaussian mixture:
π1 = π2 =1
2, X1|z11 = 1 ∼ Nd (0, I), X1|z12 = 1 ∼ Nd (1, I)
Each variable provides equal and own separation information
Theoretical error decreases when d grows: errtheo = Φ(−√d/2). . .
−3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
x1
x2
1 2 3 4 5 6 7 8 9 100.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
d
err
EmpiricalTheoretical
. . . and empirical error rate decreases also with d!28/54
![Page 29: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/29.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
HD clustering: blessing (2/2)
FDA
−4 −3 −2 −1 0 1 2 3 4−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
1st axis FDA
2nd
axis
FD
A
d=2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−3
−2
−1
0
1
2
3
4
1st axis FDA
2nd
axis
FD
A
d=20
−1.5 −1 −0.5 0 0.5 1 1.5−4
−3
−2
−1
0
1
2
3
4
5
1st axis FDA
2nd
axis
FD
A
d=200
−1.5 −1 −0.5 0 0.5 1 1.5−3
−2
−1
0
1
2
3
1st axis FDA
2nd
axis
FD
A
d=400
29/54
![Page 30: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/30.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
HD clustering: curse (1/2)
Many variables provide no separation information
Same parameter setting except:
X1|z12 = 1 ∼ Nd ((1 0 . . . 0)′, I)
Groups are not separated more when d grows: ‖µ2 − µ1‖I = 1. . .
−4 −3 −2 −1 0 1 2 3 4−5
−4
−3
−2
−1
0
1
2
3
4
x1
x2
1 2 3 4 5 6 7 8 9 100.3
0.32
0.34
0.36
0.38
0.4
0.42
0.44
d
err
EmpiricalTheoretical
. . . thus theoretical error is constant (= Φ(− 12)) and empirical error increases with d
30/54
![Page 31: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/31.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
HD clustering: curse (2/2)
Many variables provide redundant separation information
Same parameter setting except:
Xj1 = X1
1 +N1(0, 1) (j = 2, . . . , d)
Groups are not separated more when d grows: ‖µ2 − µ1‖Σ = 1. . .
−3 −2 −1 0 1 2 3 4−6
−4
−2
0
2
4
6
x1
x2
1 2 3 4 5 6 7 8 9 100.3
0.32
0.34
0.36
0.38
0.4
0.42
d
err
EmpiricalTheoretical
. . . thus errtheo is constant (= Φ(− 12)) and empirical error increases (less) with d
31/54
![Page 32: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/32.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Clustering: an ill-posed problem
Questions to be addressed:
What is the best metric M(k)?
How to choose the number K of clusters: WM(z) decreases K . . .
clusters of different sizes are they well estimated?
How to choose the data unit?
How to select features in a high-dimensional context?
How to deal with mixed data?
. . .
First, answer to. . . what is formally a cluster?
Model-based clustering solution
a cluster ⇐⇒ a distribution
It recasts all previous problems into model design/estimation/selection
32/54
![Page 33: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/33.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Outline
1 Introduction
2 Methods & questions
3 Model-based clustering
4 Illustrations
5 Challenges
33/54
![Page 34: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/34.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Model-based clustering
x = (x1, ..., xn)
−2 0 2 4
−2
02
4
X1
X2
clustering−→
z = (z1, . . . , zn), K clusters
−2 0 2 4
−2
02
4
X1
X2
Mixture model: well-posed problem
p(x|K ;θ) =K∑
k=1
πkp(x|K ;θk ) can be used for
x → θ → p(z|x,K ; θ) → z
x → p(K |x) → K
. . .
with θ = (πk , (αk))
34/54
![Page 35: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/35.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Hypothesis of mixture of parametric distributions
cluster k is modelled by a parametric distribution:
Xi|Zik=1i.i.d.∼ p(·; αk)
cluster k has probability πk with∑K
k=1 πk = 1 :
Zii.i.d.∼ MultK (1, π1, . . . , πK )
The whole mixture parameter θ = (π1, . . . , πK ,α1, . . . ,αK )
p(xi ;θ) =K∑
k=1
πkp(xi ;αk )
35/54
![Page 36: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/36.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Gaussian mixture
p(·;αk) = Nd (µk ,Σk) where αk = ( µk︸︷︷︸mean
, Σk︸︷︷︸
covariance matrix
)
−2 −1 0 1 2 3 40
0.05
0.1
0.15
0.2
0.25
0.3
0.35
x
f(x)
composante 1 composante 2 densite melange
−2
0
2
4
6 −4
−2
0
2
40
0.02
0.04
0.06
0.08
0.1
0.12
x2
x1
f(x)
Parameter = summary + help to understand
cluster k is described by meaningful parameters:cluster size (πk), position (µk) et dispersion (Σk).
36/54
![Page 37: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/37.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
The clustering process in mixtures
1 Estimation of θ by θ
2 Estimation of the conditional probability that xi ∈ cluster k
tik(θ) = p(Zik = 1|Xi = xi ; θ) =πkp(xi ; αk )
p(xi ; θ)
3 Estimation of zi by maximum a posteriori (MAP)
zik = I{k=arg maxh=1,...,K tih(θ)}
−6 −4 −2 0 2 4 6
−5
−4
−3
−2
−1
0
1
2
3
4
5
−6−4
−20
24
68
−6
−4
−2
0
2
4
6
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
y
x
dens
ity
−4 −2 0 2 4 6
−5
−4
−3
−2
−1
0
1
2
3
4
5
123
−→ −→
37/54
![Page 38: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/38.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Estimation of θ by complete-likelihood
Maximize the complete-likelihood over (θ, z)
ℓc(θ; x, z) =n∑
i=1
K∑
k=1
zik ln {πkp(xi ;αk)}
Equivalent to tradition methods
Metric M = I M free Mk freeGaussian model [πλI ] [πλC ] [πλkCk ]
Bias of θ: heavy if poor separated clusters
Associated optimization algorithm: CEM (see later)
CEM with [πλI ] is strictly equivalent to K -means
CEM is simple et fast (convergence with few iterations)
38/54
![Page 39: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/39.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Estimation of θ by observe-likelihood
Maximize the observe-likelihood on θ
ℓ(θ; x) =n∑
i=1
ln p(xi ;θ)
Convergence of θ
General algorithm for missing data: EM
EM is simple but slower than CEM
Interpretation: it is a kind of fuzzy clustering
39/54
![Page 40: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/40.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Principle of EM and CEM
Initialization: θ0
Iteration noq:Step E: estimate probabilities tq = {tik (θ
q)}
Step C: classify by setting tq = MAP({tik (θq)})
Step M: maximize θq+1 = arg maxθ ℓc (θ; x, t
q)
Stopping rule: iteration number or criterion stability
Properties
⊕: simplicity, monotony, low memory requirement
⊖: local maxima (depends on θ0), linear convergence (EM)
−4 −3 −2 −1 0 1 2 3 4 5 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
x
dens
ity
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−2
−1
0
1
2−180
−170
−160
−150
−140
−130
−120
−110
−100
−90
−80
µ1
µ2
Like
lihoo
d
−160
−140
−120
−100
−2 0 2−2
−1
0
1
21EM
essai 1−2 0 2
−2
−1
0
1
2
essai 2−2 0 2
−2
−1
0
1
2
essai 3
−2 0 2−2
−1
0
1
2
essai 4−2 0 2
−2
−1
0
1
2
essai 5−2 0 2
−2
−1
0
1
2
essai 6
40/54
![Page 41: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/41.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Importance of model selection
Model = number of clusters + parametric structure of clusters
Too simple model: bias
classe 1
classe 2
−6 −4 −2 0 2 4 6−5
−4
−3
−2
−1
0
1
2
3
4
5
x1
x2
true modele: [πλk I ]
too simple model: [πλI ]
Too complex model: variance
−10 −5 0 5 10−10
−8
−6
−4
−2
0
2
4
6
8
10
x1
x2
true borderlineborderline with [πλI ]
. . . borderline with [πλkCk ]
41/54
![Page 42: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/42.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Model selection criteria
The most widespread principle
Criterion︸ ︷︷ ︸
to be maximized
= maximum log-likelihood︸ ︷︷ ︸
model-data adequacy
− penalty︸ ︷︷ ︸
”cost” of the model
criterion penalty interpretation user purpose
general criteria in statistics
AIC ν model complexity predictionBIC 0.5ν ln(n) model complexity identification
specific criterion for the clustering aim
ICL 0.5ν ln(n) model complexity well-separated
−∑
i,k zik ln tik (θ) + partition entropy clusters
N.B.: in a prediction context, it is also possible to use the predictive error rate
42/54
![Page 43: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/43.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Outline
1 Introduction
2 Methods & questions
3 Model-based clustering
4 Illustrations
5 Challenges
43/54
![Page 44: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/44.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
ICL/BIC for acoustic emission control
Data: n = 2 061 event locations in a rectangle of R2 representing the vessel
Model: Diagonal Gaussian mixture + uniform (noise)
Groups: sound locations = vessel defects
0 5 10 15 20−2.22
−2.2
−2.18
−2.16
−2.14
−2.12
−2.1
−2.08x 10
4 Critere ICL
0 5 10 15 20−2.22
−2.2
−2.18
−2.16
−2.14
−2.12
−2.1
−2.08
−2.06
−2.04x 10
4 Critere BIC
−100 −50 0 50 100 150−80
−60
−40
−20
0
20
40
60
80
100
44/54
![Page 45: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/45.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
ICL for prostate cancer data (1/2)
Individuals: n = 475 patients with prostatic cancer grouped on clinical criteriainto two Stages 3 and 4 of the disease
Variables: d = 12 pre-trial variates were measured on each patient, composed byeight continuous variables (age, weight, systolic blood pressure, diastolic bloodpressure, serum haemoglobin, size of primary tumour, index of tumour stage andhistolic grade, serum prostatic acid phosphatase) and four categorical variableswith various numbers of levels (performance rating, cardiovascular disease history,electrocardiogram code, bone metastases)
Model: cond. indep. p(x1;αk) = p(x1;αcontk
) · p(x1;αcatk
)
45/54
![Page 46: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/46.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
ICL for prostate cancer data (2/2)
Variables Continuous Categorical MixedError (%) 9.46 47.16 8.63True \ estimated group 1 2 1 2 1 2Stage 3 247 26 142 131 252 21Stage 4 19 183 120 82 20 182
−80 −60 −40 −20 0 20 40 60−50
−40
−30
−20
−10
0
10
20
30
40
1st axis PCA
2nd
axis
PC
A
Continuous data
−2.5 −2 −1.5 −1 −0.5 0 0.5 1−2
−1
0
1
2
3
4
5
1st axis MCA
2nd
axis
MC
A
Categorical data
46/54
![Page 47: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/47.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
BIC for partitioning communes of Wallonia
Data: n = 262 communes of Wallonia in terms of d = 2 fractals at a local level
Model:Data unit: one to one transformation g(x) = (g(x j
i), i = 1, . . . , n j = 1, . . . , d) of the
initial data set. Typically, standard transformations are g(x ji) = x
j
i(identity),
g(x ji) = exp(x j
i) or g(x j
i) = ln(x j
i)
Mixture: K = 6 (fixed) but all 28 Gaussian models
Result: 6 meaningful groups with g(x ji) = exp(x j
i) (natural for fractals. . . )
Model criterion:
BICg = ℓ(θg; g(x)) −ν
2ln n + ln | Hg
︸︷︷︸
Jacobian
|.
47/54
![Page 48: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/48.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
BIC for Gaussian “variable selection”10
Definition
p(x1;θ) =
{K∑
k=1
πkp(xS1 ;µk ,Σk)
}
︸ ︷︷ ︸
clustering variables
×{
p(xU1 ; a+ xR1 b,C)}
︸ ︷︷ ︸
redundant variables
×{
p(xW1 ; u,V)}
︸ ︷︷ ︸
independent variables
where
all parts are Gaussians
S: set of variables useful for clustering
U: set of redondant clustering variables, expressed with R ⊆ S
W : set of variables independent of clustering
TrickVariable selection is recasted as a particular model selected by BIC
10Raftery and Dean (2006), Maugis et al. (09a), Maugis et al. (09b)48/54
![Page 49: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/49.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Curve “cookies” example (1/2)
The Kneading dataset comes from Danone Vitapole Paris Research Center and concerns the
quality of cookies and the relationship with the flour kneading process11. There are 115 different
flours for which the dough resistance is measured during the kneading process for 480 seconds.
One obtains 115 kneading curves observed at 241 equispaced instants of time in the interval [0;
480]. The 115 flours produce cookies of different quality: 50 of them have produced cookies of
good quality, 25 produced medium quality and 40 low quality.
11Leveder et al (2004)49/54
![Page 50: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/50.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Curve “cookies” example (2/2)
Using a basis functional model-based design for functional data12
12Jacques and Preda (2013)50/54
![Page 51: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/51.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Co-clustering (1/2)
Contingency table: document clustering
Mixture of Medline (1033 medical summaries) and Cranfield (1398 aeronauticssummaries)
Lines: 2431 documents
Columns: present words (except stop), thus 9275 unique words
Data matrix: cross counting document×words
Poisson model13
Results with 2×2 blocs
Medline CranfieldMedline 1033 0Cranfield 0 1398
13G. Govaert and M. Nadif (2013). Co-clustering. Wiley.51/54
![Page 52: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/52.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Co-clustering (2/2)
52/54
![Page 53: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/53.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Outline
1 Introduction
2 Methods & questions
3 Model-based clustering
4 Illustrations
5 Challenges
53/54
![Page 54: Clustering: evolution of methods to meet new challenges · Clustering of clustering algorithms7 Jain et al. (2004) hierarchical clustered 35 different clustering algorithms into](https://reader031.vdocuments.net/reader031/viewer/2022021916/5d58e7ba88c9935b298b6015/html5/thumbnails/54.jpg)
Introduction Methods & questions Model-based clustering Illustrations Challenges
Three kinds of challenges, linked to the user task
Model design: depends on data, should incorporate user information
Model estimation: find efficient algorithms along the user requirement
Model selection (validation): depends again on the user purpose
Model-based clustering
It is a just a confortable and rigorous framework
The user keep its freedom in this word, because high flexibility at each level
54/54