[ieee 2013 ieee 22nd international symposium on industrial electronics (isie) - taipei, taiwan...

Multi-Objects Recognition using UnsupervisedLearning and Classification

Ren C. Luo, Po-Yu Chuang and Xin-Yi YangInternational Center of Excellence on Intelligent Robotics and Automation Research, National Taiwan University

No. 1, Sec. 4, Roosevelt Road, Taipei, [email protected], [email protected],[email protected]

Abstract—The objective of this paper is to develop a real-timeunsupervised learning method to detect multi-objects. Varianceand gradient variance, as main texture feature, are compressedbased on PCA (Principal Component Analysis) to get initialclassifications of clusters via K-means algorithm in the imageframe. The cluster-kernel of each class, the nucleus as we defined,is figured out through shifting the sampling area in each cluster.Based on the nucleus, the policy of cell expansion as we designedis operated to merge different classes into one object which turnsthe work from feature level into object level. All potential objectsand cells of the objects are detected and labeled in the frame. Thedescriptors of each object are employed as hypothesis for the nextframes object-classification. Our results demonstrate that it ispossible and fast to recognize multi-objects without and trainingmodel or labeled data. The process of learning could be operatedwith initial clustering and automatic updating with informationof the new classification.

I. INTRODUCTION

Recognition filed has witnessed milestone achievements inlearning object model with supervision these years. However,the large prior knowledge is required for this kind of method.For unknown object or environment, the accuracy of super-vised learning method is based on the amounts of training datawhich is really consulting and restricted. For instance, featuresin training phase might need to be hand-labeled and trainingimages showing objects in front of the uniform background areneeded. Recently, algorithms have been developed that operatein an unsupervised manner on an image dataset. M.Weber etal. [1] constructs objects as random constellations of parts,and R. Fergus et al. [2] invite this model to propose a methodwhich learns object class from unlabeled and unsegmentedcluttering scenes. They also push this approach into sparseobject category model in [3]. In fact, it is not obvious howthese methods could be trained without supervision [2 3 45] because the training data are known to include an object.M. Brown et al. [6] present the algorithm for unsupervised3D recognition based on SIFT features. The used dataset areunordered, however, large amounts of images are required toreconstruct the 3D structure. We noticed that the previousworks emphasized only on classication tasks and related workshave cast sight on the combination of classication and segmen-tation in [7] which functions with a kind of semisurpervisedart. It looks that a limited supervised method is acceptable.Slightly supervised learning are pushed by [8] [9]. In anycase, the labeled data might not always be required, but themodel parameter should be refined. In this paper, we present

an unsupervised learning method to recognizes multi-objects.The algorithm recognizes all potential objects in the image-frame without any training data or matching pattern. Differentobjects are classified and labeled, descriptor information ofeach object are stored. The same process is operated in thenext frame, and the characteristics of objects are employedto compare with the one in the previous frame to determinethe new object which should be assigned to the same labelor a new label. In the view of a robot or the vision system,all labeled objects are meaningless, but the object could bedefined or point out for the subsequent use.

The rest of this paper is structured as follows. In section 2we describe our texture feature extraction and initial clusteringalgorithm. In section 3, we present how to mapping the classesbetween feature and geometric space. The concept of cellis given. Section 4 describes how to transfer the class-levelrecognition into the object-level recognition. The cell expan-sion algorithm we designed for merging the same class intoan object will be highlighted. Section 5 demonstrates resultsof proposed algorithm. Section 6 describes the descriptor andhypothesis used for object learning and remembering. Section7 presents conclusions and ideas for future work.

II. FEATURE EXTRACTION

We develop the clustering algorithm based on the texturefeature. Vijaya et al. [10] present an idea which emphasizesunsupervised classification through the texture feature by PCAfashion. In this paper, we employ different features comparedwith Vijayas to describe the texture. Statistical measures likemeans, variance, gradient variance are computed as the charac-teristics of the texture, a color lump in the image. The originalimage is divided into pixel lumps with 10*10 per piece.The mean feature of each texture is extracted which couldbe employed to estimate the similarity between two texturesroughly. The variances of each row and column are computed,hence 20 variances are figured out. We also take the colortransition into the account which is inspired from the definitionof gradient. The gradient variance is defined as a kind of colorinformation of adjacent rows or columns. Conventionally, thedifference between adjacent rows or columns variance valuesis the required gradient variance of the corresponding rowor column. For avoiding the influence of noise pixel, wecompute the gradient variance of a row or column based onfour of which neighbor variances. For example, the variance

of row ai is defined as vi. We employ the adjacent variancevalues vi−2, vi−1, vi, vi+1,vi+2 to do a linear fit. The slopeof the fitting line is the gradient variance of row . Finally40 features are fixed which concludes 10 row variances, 10column variances, 10 row gradient variances and 10 columngradient variances. All features above are stacked and insertedinto a 4*10 feature matrix as sequence. Rows of this matrixshow the information of row variance, column variance, rowgradient variances and column gradient variance respectively.For reducing the dimension of the feature matrix, we applyPCA to transform the raw data set onto new axes whereprincipal components corresponding to larger eigen-valuescarry important information represented by the raw feature set.A 10*1 mapping matrix generated from PCA algorithm helpsus to compress the feature matrix into a 4*1 feature column.Later, we expand the feature column by adding a new rowwhich contains the mean feature of the texture. And K-meansclassification is performed to classify the feature column oftextures into different cluster. At last, the classification will betransferred from feature space back to geometric space whichmeans the different texture classes are detected in the image.Fig. 2(a) shows the result of classification of upper image ofFig. 1(a) based on PCA and K-means. The same classes arelabeled with same color. It is noticeable that different clustersmight be belonged to one object or the same clusters scatterin different objects.

III. MAPPING OF FEATURE AND GEOMETRIC SPACE

After feature extraction, we want to argue how to push therecognition from class-level into the object-level by mergingall clusters into distinct objects. In basic supervised method,we need to input an identified object, and extract features tomake the object recognizable. This general method is hard toapply on unsupervised conditions, since that is impossible toderive a general model for arbitrary objects. A scene image isbasically constructed by background and objects, but relationof background and objects might be changed with differentaspect. While we focus on an object, the other parts of imagecan be considered as background. If we more close to seethe specific part of one object, the other part of origin objectwould become background. People intuitively need some partof image as reference scene, and sense differences betweenobject and reference to recognize object.

Since then we would initialize our method by searching fora class in feature space which distribute widely in geometricspace as our reference of recognition. Based on reference class,basic key points of recognizable objects with some invariantrelation with reference class can be further defined. Traditionalclassifiers focus on comparing similarity between object andscene in feature phase. In our method, we further discussabout relation between invariant feature points and referencein geometric space to make one distinct object assign into oneand only one corresponding class.

A. Definition of Reference Class

Reference regard as a stable and normal exist in geometricspace. We expect every potential objects have robust relationwith reference class, and intuitively, reference class is con-sidered as background. As a background, reference class mustdisperse homogeneously, so it has to be included large enoughmembers, and not regional cluster. The reference would notonly include most members in feature extraction stage, butregard widest distribution in geometric space. Therefore, basedon the classification of features, histogram of classes haveto be include a discriminative weighting factor depend ondistribution of entire scene. The discriminative histogram isbuilt by following:

Pc =N∑i=1

wif(xi, yi)

{f(xi, yi) = 1, if f ∈ Cf(xi, yi) = 0, if f /∈ C

wi =

√(xi − xo)2 + (yi − yo)2√

(xW − xo)2 + (yH − yo)2

Where Pc is accumulative number of class c. O is center pointof image, and xW , yH is max width and height, and weightedcoefficient depend on the ratio of Euclidean distance. Theresult of reference class through discriminative histogram isshown on Fig.1(b). The white block templates are membersof reference class, and difference parts can be further estimatednumber and position by relation with reference.

B. Cell of Potential Object

While feature points mapped to the geometric space, oneobject might include multiple classes or one class might belongdifferent objects. To further clarify the composition of onesingle object, we aspect to find invariant and, stable points,cell of potential object in geometric space which representregional key points of each class, and all key points are respectto reference class. Cell is considered as center of feature classin distinct object, so cells to reference in every searchingdirection have to be local minimal:

Min{D∑

j=1

∑Di=1 di(x, y)

dj(x, y)} (1)

It is expensive to be rechecked each element for conver-gence. Based on Eq. (1), di(x, y) is a vector which representsdistance to reference in ith direction, so gradient of summationof d(x,y) must direct to the direction with maxima varianceof magnitude. While (x,y) move along direction of maximavariance, the cell point, as center point, can be approach byseveral iteration. The new point (x’,y’) can be derived by:

(x′, y′) = (x, y) +N∑i=1

∇di(x, y) (2)

(a) Original

(b) Reference class

Fig. 1: Results of reference class

Since then local minimal of Eq. (1) can convergence throughseveral iteration of Eq. (2). Depending on different situations,structure or pose, one object might result multiple cells asFig. 2(b). Every object must have at least one cell to furtherexpand, and multiple cells in single object would merge intoone through succeeding expanding process.

Additionally, the reference class should distribute homoge-neously, but, in reality, background might divide into severalfeature classes caused by nonhomogeneous lighting, or view-point. If members of reference are not large enough, the resultof cell searching would include many redundant clusters ofcells which are part of background. This result might wastecomputing time and cause misidentification of object. Hence,if member of reference is not large enough, the reference class

(a) Classification in feature phase

(b) Cell points

(c) Result of cell expansion

(d) Result of cells merging

Fig. 2: Demonstration of recognition process

would be further merge other similar class, and expand regionuntil the number of cells convergent to one stable constant.

IV. CELL EXPANSION AND OBJECT RECOGNITION

Cells are considered as expansion center of each object.Considering the geometric constrains, these cells could befurther expanded, and fulfill entire distinct object. In viewof feature space, the feature point is hard to merge intocomplete object, because one object might include multipleclassification of feature. The objects are hard to recognizethough feature space only, but object must be distinct ingeometric space. Thus, in this chapter, we desire to apply theboundary of geometric become constrains of cell expansion,

Fig. 3: Flow chart of proposed algorithm

and, furthermore, combine different feature classes into oneobject.

First of all, feature clusters in geometric space have tobe identified. Boundaries between clusters had been definedby the reference class. While points in same feature classbelongs different objects, the cross line between two pointsmust cross over members of reference class and, moreover,cells are considered as basis of objects. Hence the pointscan be classified by the distance between cells, and becomemember of closest cell. The distance between points andcell has to apply some additional condition. Each object isdistinguished by reference class, and cells are further dividedby classification of feature class. Thus, if cross line of pointand cell pass through reference class or different feature class,the distance would apply penalty factor to make points can beassigned into corresponding cell.

Min{dcell(x, y) +L∑

l=1

µr +M∑

m=1

µc} (3)

Where dcell is distance between point (x,y) and each cell. µr

is penalty factor for path crossing over reference class, andµc is for different class member. All points can be classifiedthough Eq. (3), and assigned into new class which defined bycell. Comparing results of classification by features and cellsin Table. 1 and Table. 2 , there are several isolated points infeature classes as class 11 to 15, but, in cell classification,one cell class must include cluster of points. From Fig. 2(c),there are several points are unclassified, because these pointsare isolated points, or singular points in image. If one pointis surrounded by reference class, or none of cell is same classin feature space, the magnitude of distance would be rapidly

TABLE I: Classification in Feature Phase

Class No. No. of Members Class No. No. of Members

Reference 13414

1 1666 9 8

2 195 10 3

3 52 11 1

4 31 12 1

5 25 13 1

6 14 14 1

7 9 15 1

8 7

TABLE II: Classification in Cell Phase

Class No. No. of Members Class No. No. of Members

Reference 13414

1 65 7 389

2 9 8 24

3 13 9 121

4 33 10 834

5 19 11 144

6 15

increased causing by penalty factor. Thus, noise or singularpoints would be wiped out in classifying phase.

Although points are further classified into cell phase, onerecognized object might include multiple cells due to multiplefeatures in an object. Hence, we would like to evolve classifi-cation from cell phase into object. In object phase, each objecthas one and only one cell.

Recalling the identification of cell, cell is the center offeature in each object, and further regard as segmentation ofobject. Each expansion result is part of one object, so objectcan be rebuilt through geometric relation of cell expansionresult. The classification in object phase can be derived byfollowing criteria:

C1: If expansion areas of two cells are overlapped,two areas can be merged.

C2: If two expansion areas can be connected withoutreference, two areas and related region can be merged.

C3:If expansion area of one cell is segmented byreference class, the result would not be recognized as object.

According to C1 to C3, the cells are further classified inobject phase as Fig. 2(d) and Table 3. Each class in Table 3represents an object, and member is the points on the object.Through these points, distinct objects can be reconstructed andrecognized. The entire proposed algorithm is shown in Fig. 3.

V. PERFORMANCE EVALUATION

The proposed method had been exhaustively studied inprevious chapters, and following, the method tested by five set

TABLE III: Classification in Object Phase

Class No. No. of Members

Reference 13414

1 65

2 41

3 389

4 72

5 834

6 265

(a) Test image 1-5 (b) Results

Fig. 4: Results of object recognition

of images in Fig.4(a). Image size is 1693*931, and templateof feature extraction is 10*10, these images include referenceand object with different texture, and objects with differentgeometric structure. Fig.4(b) are results by proposed method.Small white circles means cells at beginning stage, and bigred circles are cells after cell expansion and merged. Rectan-gular region is maximum size of recognized object. Throughthese results, objects in images are typically recognized byproposed method. Every single object only exist one and only

TABLE IV: Computation Time in Each Main Stages(mSec)

Image No. Feature Extraction Cell Identification Expansion Total

1 715 91 476 1310

2 605 89 484 1051

3 875 86 513 1654

4 877 84 584 1754

5 987 168 616 2045

cell as mention before. In our experiment, not every cellcan be identified easily because of nonhomogeneous lightingor shadow of object. These factors might cause failure torecognize or misidentification of object. Reflection of lightin third and fourth row of Fig.4 is kind of possible factorof misidentification. Nevertheless, reflection of light wouldnot be recognized as object by proposed method, because thestructure is incomplete and smashed by different classes. Asthe same way, the left upper corner of fifth row in Fig. 4 isalso not identified as object.

Timing has been recorded on a desktop with quad-corei-5 3.20 GHz Processor (without multiple process) runningWindow 7 64bit. In Table 4, the timing almost depends onnumber of classes in feature extraction stage, and numberof cells. Since fifth images result fifteen cells in image,processing time is longest. Processing time of almost everypicture is under two second. The running can be furtherreduced by using small size image to save time of featureextraction.

VI. HYPOTHESIS AND DESCRIPTOR OF UNSUPERVISEDLEARNING

Through the cell expansion algorithm, all objects in thisframe are recognized and labeled. And we want to argue, innext frame, whether the detected object will be assigned withthe same label if the object is the same one in this frame. Alsothe algorithm should detect and mark the new object whichmight not appear in previous frames with a new label.

We analyze the components of an object detected by pro-posed method. The object must contain all information ofmerging classes. And each object should map a cell which isthe final merging result of expansion. Therefore, a geometricposition of the cell belongs to an object could be figuredout via the proposed algorithm. We define this kind of cellis object-cell. With a slightly changing background and asubtle moving view, the same object appear in adjacent framesshould share the same cell components which are constructingclasses in the object. More particularly, the texture featuresbelonging different classes of recognized object would berecorded. Another hypothesis is the position of cells of thesame object should be stable if the object is a still-life inthe frames. The shifting distance of cells is small enoughfor the same object in this case. Then the components ofthe object and the position of cells could be employed asdescriptor information. If the objects in adjacent frames areshared same components, and the Euclidean distance betweenboth cells is less than a threshold, objects are considered as the

(a) Labeled object

(b) Recognition of labeled and unknown objects

Fig. 5: Result of descriptor labeling

same one. On the other hand, the object which doesn’t meetthe aforementioned criterions above will be justified as a newobject and assigned a new label. In third row of Fig. 5(a),only lens cap is recognizes as an object and labeled with ayellow rectangular. Fig.5(b) is captured by changing the viewof camera. The proposed method detects both lens cap and thetraveling cup. Obviously, a new object invades in this frame.Comparing with the descriptor information, although the cuphas advantages in the distance of nucleus, the lens cap inright frame will be recognized based on a very close texturestructure. In fact, the cup consists of two textures informationwhich belong to cup cap and body. Hence the cup is labeledwith red rectangular, a new mark.

VII. CONCLUSION AND FUTURE WORK

The experimental results presented robust performance ofproposed algorithm in variant environments. The region ofeach distinct object can be recognized precisely through cellexpansion and merging. Based on computation result, com-puting time generally consume by feature extraction and cellexpansion stage. Nevertheless, this result is processed on largeimage, and time consuming of this two stages depend on thenumber of points. Thus, we convinced this method would beefficiency in general application. Furthermore, we establisha simple descriptor system based on some hypothesis. Thedescriptor is built for serial frame recognition. The learnedobject can be recognized and new objects would be labeledby new marks and so on.

Currently, proposed method is focused on unsupervisedobject recognition. Robust descriptor system has to be furtherestablished. This method plain to be a real time learningsystem, so identify learned object and unknown object wouldbe depending on robust descriptor system. We expect apply

relation between cells and classes into descriptor system,because structure of object which includes different featureclasses is relatively stable property in object recognition.Furthermore, we are interested in implementing unsupervisedrecognition algorithms on a robot platform, and implement inmore complex environment.

REFERENCES

[1] M. Burl, M.Weber, and P. Perona. “A probabilistic approach to objectrecognition using local photometry and global geometry.” In Proc. ECCV,1998, pp. 628641.

[2] R. Fergus, P. Perona, and A. Zisserman, “Object class recognition byunsupervised scale-invariant learning,” in CVPR, 2003, pp. 264271.

[3] R. Fergus, P. Perona, and A. Zisserman, “A sparse object category modelfor efcient learning and exhaustive recognition,” in CVPR (1), 2005, pp.380387.

[4] B. Leibe, A. Leonardis, and B. Schiele, “Combined object categorizationand segmentation with an implicit shape model,” in ECCV04 Workshopon Statistical Learning in Computer Vision, Prague, Czech Republic, May2004, pp.1732

[5] D. J. Crandall and D. P. Huttenlocher, “Weakly supervised learning ofpart-based spatial models for visual object recognition,” in ECCV (1),2006, pp. 1629.

[6] M. Brown and D.G. Lowe, “ Unsupervised 3D Object Recognition andReconstruction in Unordered Datasets”, in 3DIM ,2005, pp.56-63.

[7] J. M. Winn and N. Jojic, “Locus: Learning object classes with unsuper-vised segmentation,” in ICCV, 2005, pp. 756763.

[8] Lexing Xie and Patrick Perez, “Slightly Supervised Learning of Part-Based Appearance Models” in CVPR, 2004, pp.107

[9] M. Najjar, C. Ambroise, and J.P. Cocquerez, “ Feature selection forsemisupervised learning applied to image retrieval”,in ICIP,2003,PP.559-62 vol.3

[10] V.V. Chamundeeswari, D. Singh, and K. Singh, “An Analysis of TextureMeasures in PCA-Based Unsupervised Classification of SAR Images”,Geoscience and Remote Sensing Letters, 2009, pp. 214 - 218

[ieee 2013 ieee 22nd international symposium on industrial electronics (isie) - taipei, taiwan...

Documents