semi-automatic ground truth generation using unsupervised clustering and limited manual labeling:...
TRANSCRIPT
1SEMI-AUTOMATIC GROUND TRUTH GENERATION US-ING UNSUPERVISED CLUSTERING AND LIMITED MANUAL LABELING: APPLICATION TO HANDWRITTEN CHARACTER RECOGNITION
Szilárd Vajda, Yves Rangoni, Hu-bert CecottiPattern Recognition Letters, 2015
2
Ground-truth genera-tion
UnlabeledLa-
beled
• Usually, real world data is not labeled.• Large data collections need accurate labels.
3
Labeling strat-egy
Unsupervised clustering
Labeling byhuman ex-pert
the closest real data point for each centroid is la-beledImage dataset
Pixels
Profiles
LBP
Randon
Encoder
Feature representations
4
Image dataset
Pixels
Profiles
LBP
Randon
Encoder
Feature representations
Label a
Label c
Label b
Labeling strat-egy
each point inherits a labelof its cluster
5
Input image
Pixels
Profiles
LBP
Randon
Encoder
Feature representations
5
8
5
5
5
5Final label
Consensus voting / Majority voting
Labeling strat-egy
6
Feature representa-tions• Raw pixels
– Pixel intensity in raw images
• Profiles (upper/lower/left/right)– only considers outer shape of the char-
acter
– i.e. consider the distance between the upper horizontal line and the closest pixel to the upper boundary of the im-age
• Local Binary Patterns (LBP)– local texture and rotation invariant rep-
resentation
L. Heutte, T. Paquet, J.V. Moreau, Y. Lecourtier, C. Olivier, A structural/statistical featurebased vector for handwritten character recognition, Pattern Recognit. Lett. 19 (7) (1998) 629–641.
7
Feature representa-tions• Randon transform
– takes multiple and parallel-beam projections of the im-age from different angles
• Encoder network
– a special kind of deep learning architectures
– data-driven
8
• Definitions
• Voting scheme: consensus, majority voting
Classi-fiers
the number of patterns that should be assigned to the i-th class
the number of patterns that are as-signed to the class after classification
: patterns that have a class assigned, : patterns that have no assigned patterns / : patterns that have been correctly / incorrectly classified
9
Classi-fiers• Unsupervised clustering
– K-means clustering (Lloyd algorithm)
– Self Organizing Map (SOM) : a special type of neural net-work trained in an unsupervised fashion, to produce a two-dimensional mapping of the input data
– The Growing Neural Gas (GNG) : no constraints on the topology contrary to the SOM
• Supervised classification
– The k-nearest neighbor (k-nn) classifier
10
Classi-fiers• Evaluation
• measures combine inter-class and intra-class vari-ances
• measures the reliability of the labeling strategy
X: total numbers of vectors to be clustered
11
Dataset• MNIST
– Arabic digits
– 10 classes (0,1,…,9)
– 60,000 training / 10,000 test images
• Lampung
– multi-writer handwritten collection produced by 82 high school students from Bandar Lampung, Indonesia
– 20 character classes
– 23, 447 characters for training
– 7,853 characters for the test
12
Re-sults• Performance of features
13
Re-sults• Compactness of clustering techniques
14
Re-sults• Clustering performance
15
Re-sults• Labeling performance
– Majority / consensus voting: at least 3 methods / 5 methods provide the same label
16
Re-sults• Labeling performance
Competitive performance is shown with few human-labeled samples
17
Re-sults• Classification performance
– against several Monte Carlo simulations (100 times) which pick random samples from complete training set.
18
Re-sults• Classification performance (different voting)
– A fully connected multi layer perceptron classification
96.69 96.74 96.77
The network is more sensitive to the samples with wrong labels
19
Conclu-sion• Semi-automatic labeling scheme with minimal
human involvement.
• The newly discovered labels with this labeling scheme are compared in a k-nn scheme, with randomly selected samples and the complete data (all labeled).
20
Thank you !
Q & A