Unsupervised Learning Jointly With Image Clustering
Virginia Tech
Jianwei Yang Devi Parikh Dhruv Batra
https://filebox.ece.vt.edu/~jw2yang/ 1
2
Huge amount of images!!!
3
Huge amount of images!!!
Learning without annotation efforts
4
Huge amount of images!!!
Learning without annotation efforts
What we need to learn?
5
An open problem
Huge amount of images!!!
Learning without annotation efforts
What we need to learn?
6
An open problem
A hot problem
Huge amount of images!!!
Learning without annotation efforts
What we need to learn?
7
Various methodologies
An open problem
A hot problem
Huge amount of images!!!
Learning without annotation efforts
What we need to learn?
8
Learning distribution (structure)
Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.
Clustering
9
Learning distribution (structure)
Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.
Clustering
K-means (Image Credit: Jesse Johnson)
10
Learning distribution (structure)
Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.
Clustering
K-means (Image Credit: Jesse Johnson)
Hierarchical Clustering
11
Learning distribution (structure)
Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.
Clustering
K-means (Image Credit: Jesse Johnson)Spectral Clustering
Manor et al, NIPS’04Hierarchical Clustering
12
Learning distribution (structure)
Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.
Clustering
K-means (Image Credit: Jesse Johnson)Spectral Clustering
Manor et al, NIPS’04Hierarchical Clustering Graph Cut
Shi et al, TPAMI’00
13
Learning distribution (structure)
Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.
Clustering
K-means (Image Credit: Jesse Johnson)
DBSCAN, Ester et al, KDD’96 (Image Credit: Jesse Johnson)
Spectral Clustering Manor et al, NIPS’04
Hierarchical Clustering Graph CutShi et al, TPAMI’00
14
Learning distribution (structure)
Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.
Clustering
K-means (Image Credit: Jesse Johnson)
DBSCAN, Ester et al, KDD’96 (Image Credit: Jesse Johnson)
Spectral Clustering Manor et al, NIPS’04
Hierarchical Clustering Graph CutShi et al, TPAMI’00
EM Algorithm, Dempster et al, JRSS’77
15
Learning distribution (structure)
Jain, Anil K., M. Narasimha Murty, and Patrick J. Flynn. "Data clustering: a review." ACM computing surveys (CSUR) 31.3 (1999): 264-323.
Clustering
K-means (Image Credit: Jesse Johnson)
DBSCAN, Ester et al, KDD’96 (Image Credit: Jesse Johnson)
Spectral Clustering Manor et al, NIPS’04
Hierarchical Clustering Graph CutShi et al, TPAMI’00
EM Algorithm, Dempster et al, JRSS’77NMF, Xu et al, SIGIR‘03 (Image Credit: Conrad Lee)
16
Learning distribution (structure)
Sub-space Analysis
PCA (Image Credit: Jesse Johnson)ICA (Image Credit: Shylaja et al)
tSNE, Maaten et al, JMLR’08
Subspace Clustering, Vidal et al.
Sparse coding, Olshausen et al. Vision Research’9717
Learning representation (feature)
Yoshua Bengio, Aaron Courville, and Pierre Vincent. "Representation learning: A review and new perspectives." IEEE Transactions on Pattern Analysis and Machine Intelligence. 35.8 (2013): 1798-1828.
Autoencoder, Hinton et al, Science’06 (Image Credit: Jesse Johnson)
DBN, Hinton et al, Science’06 DBM, Salakhutdinov et al, AISTATS’09
Bengio et al, TPAMI’13
18
Learning representation (feature)
VAE, Kingma et al, arXiv’13(Image Credit: Fast Forward Labs)
GAN, Goodfellow et al, NIPS’14DCGAN, Radford et al, arXiv’15
(Image Credit: Mike Swarbrick Jones)
19
Most Recent CV Works
Spatial context, Doersch et al, ICCV’15Temporal context, Wang et al, ICCV’15
Solving Jigsaw, Noroozi et al, ECCV’16Context Encoder, Deepak et al, CVPR’16
Ego-motion, Jayaraman et al, ICCV’15
20
Most Recent CV Works
Visual concept clustering, Huang et al, CVPR’16
Graph constraint, Li et al, ECCV’16
TAGnet, Wang et al, SDM’16
Deep Embedding, Xie et al, ICML’1621
Our Work
Joint Unsupervised Learning (JULE) of Deep Representations and Image Clusters
22
Outline
• Intuition
• Approach
• Experiments
• Extensions
23
Intuition
Meaningful clusters can provide supervisory signals to learn image representations
24
Intuition
Meaningful clusters can provide supervisory signals to learn image representations
Good representations help to get meaningful clusters
25
Intuition
Cluster images first, and then learn representations
26
Intuition
Cluster images first, and then learn representations
Learn representations first, and then cluster images
27
Intuition
Cluster images and learn representations progressively
Cluster images first, and then learn representations
Learn representations first, and then cluster images
28
IntuitionGood clusterGood representationsGood clusters
Good representations
Poor clusters
Poor representations 29
IntuitionGood clusterGood representationsGood clusters
Good representations
Poor clusters
Poor representations 30
IntuitionGood clusterGood representations
Good representations
Good clusters
Poor clusters
Poor representations 31
IntuitionGood clusterGood representations
Good representations
Good clusters
Poor clusters
Poor representations 32
Approach
• Framework
• Objective
• Algorithm & Implementation
33
Approach: Framework
Agglomerative Clustering
Convolutional Neural Network
Representation Learning
Agglomerative Clustering
arg min ( | , )y
L y I
arg min ( | , )L y I
,
arg min ( , | )y
L y I
34
Approach: Framework
Convolutional Neural Network Agglomerative Clustering
arg min ( | , )L y I
arg min ( | , )y
L y I
35
Approach: Recurrent Framework
36
Approach: Recurrent Framework
37
Approach: Recurrent Framework
38
Approach: Recurrent Framework
39
Approach: Recurrent Framework
40
Approach: Recurrent Framework
41
Approach: Recurrent Framework
Backward at each time-step is time-consuming and prone to over-fitting!
42
Approach: Recurrent Framework
How about updating once for multiple time-steps?
Backward at each time-step is time-consuming and prone to over-fitting!
43
Approach: Recurrent Framework
Partially Unrolling: divide all T time-steps into P periods
In each period, we merge clusters for multiple times and update CNN parameters at the end of period
44
Approach: Recurrent Framework
Partially Unrolling: divide all T time-steps into P periods
In each period, we merge clusters for multiple times and update CNN parameters at the end of period
45
Approach: Recurrent Framework
In each period, we merge clusters for multiple times and update CNN parameters at the end of period
Partially Unrolling: divide all T time-steps into P periods
P is determined by a hyper-parameter will be introduced later 46
Approach: Objective Function
Overall loss:
,
arg min ( , | )y
L y I
arg min ( | , )y
L y I arg min ( | , )L y I
47
Approach: Objective Function
Loss at time-step t:
Conventional Agg. Clustering Strategy
Proposed Agg. Clustering Strategy
48
Approach: Objective Function
Loss at time-step t:
Conventional Agg. Clustering Strategy
Proposed Agg. Clustering Strategy
Affinity measure
49
Approach: Objective Function
Loss at time-step t:
Conventional Agg. Clustering Strategy
Proposed Agg. Clustering Strategy
i-th cluster
50
Approach: Objective Function
Loss at time-step t:
Conventional Agg. Clustering Strategy
Proposed Agg. Clustering Strategy
K_c nearest neighbor clusters of i-th cluster
51
Approach: Objective Function
Loss at time-step t:
Conventional Agg. Clustering Strategy
Proposed Agg. Clustering Strategy
Affinity between i-thcluster and its NN
52
Approach: Objective Function
Loss at time-step t:
Conventional Agg. Clustering Strategy
Proposed Agg. Clustering Strategy
Affinity between i-thcluster and its NN
Differences between two cluster affinities 53
Approach: Objective Function
Loss at time-step t:
Conventional Agg. Clustering Strategy
Proposed Agg. Clustering Strategy
Affinity between i-thcluster and its NN
Differences between two cluster affinities
Merge these two clusters
54
Approach: Objective Function
Loss at time-step t:
Conventional Agg. Clustering Strategy
Proposed Agg. Clustering Strategy
Affinity between i-thcluster and its NN
Differences between two cluster affinities
Merge these two clusters
55
Approach: Objective Function
Loss in forward pass in period p (merge clusters):
Loss in forward pass in period p (merge clusters):
56
Approach: Objective Function
Loss in forward pass in period p (merge clusters):
Loss in forward pass in period p (merge clusters):
57
Approach: Objective Function
Loss in forward pass in period p (merge clusters):
Loss in forward pass in period p (merge clusters):
CNN parameters are fixed
58
Approach: Objective Function
Loss in forward pass in period p (merge clusters):
Loss in forward pass in period p (merge clusters):
CNN parameters are fixed
Cluster labels are fixed
59
Approach: Objective Function
Forward Pass:
Simple Greedy Algorithm
Merge two clusters which minimize the loss at each time step
60
Approach: Objective Function
Forward Pass:
Simple Greedy Algorithm
Merge two clusters which minimize the loss at each time step
61
Approach: Objective Function
Forward Pass:
Simple Greedy Algorithm
Merge two clusters which minimize the loss at each time step
62
Approach: Objective Function
Forward Pass:
Simple Greedy Algorithm
Merge two clusters which minimize the loss at each time step
63
Approach: Objective
Backward Pass:
64
Approach: Objective
Backward Pass:
Consider all previous periods
65
Approach: Objective
Backward Pass:
Cluster based loss is not proper for batch optimization!!!
Consider all previous periods
66
Approach: Objective
Backward Pass:
Cluster based loss is not proper for batch optimization!!!
Consider all previous periods
Approximation:
67
Approach: Objective
Backward Pass:
Convert to sample-based loss:
Consider all previous periods
Intra-sample affinity Inter-sample affinity
Recall cluster-based loss:
68
Approach: Objective
Backward Pass:
Convert to sample-based loss:
Consider all previous periods
Intra-sample affinity Inter-sample affinity
Recall cluster-based loss:
Weighted triplet loss
69
Approach: Algorithm & Implementation
70
Approach: Algorithm & Implementation
Raw image data
71
Approach: Algorithm & Implementation
Raw image data
Assume it is known
72
Approach: Algorithm & Implementation
Raw image data
Assume it is known
Randomly initialize CNN parameters4 samples in each cluster in average
73
Approach: Algorithm & Implementation
Raw image data
Assume it is known
Randomly initialize CNN parameters4 samples in each cluster in average
Train CNN for about 20 epochs
74
Approach: Algorithm & Implementation
Raw image data
Assume it is known
Randomly initialize CNN parameters4 samples in each cluster in average
Train CNN for about 20 epochs
We can go back and retrain the model, but it improve slightly
75
Experiments
• Datasets
• Network Architecture
• Image Clustering
• Representation Learning
76
Experiments: Datasets
MNIST (70000, 10, 28x28) USPS (11000, 10, 16x16) COIL20 (1440, 20, 128x128) COIL100 (7200, 100, 128x128)
UMist (575, 20, 112x92) FRGC (2462, 20, 32x32) CMU-PIE (2856, 68, 32x32) Youtube Face (1000, 41, 55x55)77
Experiments: SettingsTwo important parameters
Set the layer numbers so that theOutput feature map is about 10x10
78
Experiments: Clustering : Performance
+6.43% on NMI to best performance of existing approaches averaged over all datasets
79
Experiments: Clustering : Performance
+12.76% on AC to best performance of existing approaches averaged over all datasets
80
Experiments: Clustering : Performance
Average +21.5% on NMI
81
Experiments: Clustering : Performance
Average +25.7% on NMI
82
Experiments: Clustering : Performance
Our clustering performance vs. that of existing clustering approaches using raw image data.
Clustering performance using our representation fed to existing clustering algorithms.
Experiments: Clustering : Visualization
COIL-20
COIL-100
84
Experiments: Clustering : Visualization
USPS
MNIST-test
85
Experiments: Clustering : Ablation study
86
Experiments: Clustering : Verification
87
Experiments: Clustering : Time Cost
88
Experiments: Representation Learning
Testing generalization of our learnt (unsupervised) representation to LFW face verification.
Evaluation on CIFAR-10 classification
Representation transfer
Representation learning
89
Extensions: Data Visualization
90
Conclusion
• A new unsupervised learning method jointly with image clustering, cast the problem into a recurrent optimization problem;
• In the recurrent framework, clustering is conducted during forward pass, and representation learning is conducted during backward pass;
• A unified loss function in the forward pass and backward pass;
• Performance outperforms the state-of-the-art over a number of datasets;
• It can also learn plausible representations for image recognition.
91
Thanks!
https://github.com/jwyang/joint-unsupervised-learning 92