semi-automatic ground truth generation using unsupervised clustering and limited manual labeling:...

1SEMI-AUTOMATIC GROUND TRUTH GENERATION US-ING UNSUPERVISED CLUSTERING AND LIMITED MANUAL LABELING: APPLICATION TO HANDWRITTEN CHARACTER RECOGNITION

Szilárd Vajda, Yves Rangoni, Hu-bert CecottiPattern Recognition Letters, 2015

2

Ground-truth genera-tion

UnlabeledLa-

beled

• Usually, real world data is not labeled.• Large data collections need accurate labels.

3

Labeling strat-egy

Unsupervised clustering

Labeling byhuman ex-pert

the closest real data point for each centroid is la-beledImage dataset

Pixels

Profiles

LBP

Randon

Encoder

Feature representations

4

Image dataset

Pixels

Profiles

LBP

Randon

Encoder


Label a

Label c

Label b

Labeling strat-egy

each point inherits a labelof its cluster

5

Input image

Pixels

Profiles

LBP

Randon

Encoder


5

8

5

5

5

5Final label

Consensus voting / Majority voting

Labeling strat-egy

6

Feature representa-tions• Raw pixels

– Pixel intensity in raw images

• Profiles (upper/lower/left/right)– only considers outer shape of the char-

acter

– i.e. consider the distance between the upper horizontal line and the closest pixel to the upper boundary of the im-age

• Local Binary Patterns (LBP)– local texture and rotation invariant rep-

resentation

L. Heutte, T. Paquet, J.V. Moreau, Y. Lecourtier, C. Olivier, A structural/statistical featurebased vector for handwritten character recognition, Pattern Recognit. Lett. 19 (7) (1998) 629–641.

7

Feature representa-tions• Randon transform

– takes multiple and parallel-beam projections of the im-age from different angles

• Encoder network

– a special kind of deep learning architectures

– data-driven

8

• Definitions

• Voting scheme: consensus, majority voting

Classi-fiers

the number of patterns that should be assigned to the i-th class

the number of patterns that are as-signed to the class after classification

: patterns that have a class assigned, : patterns that have no assigned patterns / : patterns that have been correctly / incorrectly classified

9

Classi-fiers• Unsupervised clustering

– K-means clustering (Lloyd algorithm)

– Self Organizing Map (SOM) : a special type of neural net-work trained in an unsupervised fashion, to produce a two-dimensional mapping of the input data

– The Growing Neural Gas (GNG) : no constraints on the topology contrary to the SOM

• Supervised classification

– The k-nearest neighbor (k-nn) classifier

10

Classi-fiers• Evaluation

• measures combine inter-class and intra-class vari-ances

• measures the reliability of the labeling strategy

X: total numbers of vectors to be clustered

11

Dataset• MNIST

– Arabic digits

– 10 classes (0,1,…,9)

– 60,000 training / 10,000 test images

• Lampung

– multi-writer handwritten collection produced by 82 high school students from Bandar Lampung, Indonesia

– 20 character classes

– 23, 447 characters for training

– 7,853 characters for the test

12

Re-sults• Performance of features

13

Re-sults• Compactness of clustering techniques

14

Re-sults• Clustering performance

15

Re-sults• Labeling performance

– Majority / consensus voting: at least 3 methods / 5 methods provide the same label

16

Re-sults• Labeling performance

Competitive performance is shown with few human-labeled samples

17

Re-sults• Classification performance

– against several Monte Carlo simulations (100 times) which pick random samples from complete training set.

18

Re-sults• Classification performance (different voting)

– A fully connected multi layer perceptron classification

96.69 96.74 96.77

The network is more sensitive to the samples with wrong labels

19

Conclu-sion• Semi-automatic labeling scheme with minimal

human involvement.

• The newly discovered labels with this labeling scheme are compared in a k-nn scheme, with randomly selected samples and the complete data (all labeled).

20

Thank you !

Q & A

semi-automatic ground truth generation using unsupervised clustering and limited manual labeling:...

Data & Analytics