[ieee 2012 ieee applied imagery pattern recognition workshop (aipr 2012) - washington, dc, usa...

8
Evidence Filtering in a Sequence of Images for Recognition Sukhan Lee 1,2, * 1 Director, Intelligent Systems Research Institute (ISRI), School of Information and Communication Engineering. 2 Interaction Science Department. SungKyunKwan University, South Korea [email protected] Muhammad Ilyas, Kim Jaewoong, Ahmed Naguib Intelligent Systems Research Institute (ISRI), School of Information and Communication Engineering. SungKyunKwan University, South Korea {milyasmeo,create}@skku.edu , [email protected] AbstractIn recognizing a target object/entity with its attribute such as pose from images, the evidences extracted initially may be uncertain and/or ambiguous as they can only be defined probabilistically and/or do not satisfy the sufficient condition for recognition. These uncertainties and ambiguities associated with evidences are often due as much to the external, uncontrollable causes, such as the variation of illumination and texture distributions in the scene, as to the quality of the imaging tools used. This paper presents a method of filtering the uncertain and ambiguous evidences obtained from a sequence of images in such a way as to reach a reliable decision level for recognition. First, at each of the image sequence, a number of weak evidences are generated using 3D line, 3D shape descriptor and SIFT which may be ambiguous and/or uncertain to decide recognition quickly and reliably. To reach a faithful recognition, we need to enrich these evidences by Appearance vector and generate multiple interpretations of the target object with higher weights. We incorporate prior established Bayesian Evidence structure which embodied sufficient condition for recognition, to generate such interpretations. Furthermore, when robot moves, we do active recognition using particle filter framework in sequence of images to produce interpretation with highest weight and lowest error covariance. This paper provides readers with the details of the implementation and experimental results of Evidence Filtering in image sequences using Particle filter in HomeMate Robot application. Keywords—Object Recognition;Bayesian Network Evedice Structuret;image sequences;Super-Paticle Filtering I. INTRODUCTION Object recognition has been one of the major problems in robotics and computer vision community and intensively investigated for several decades. In particular, the 3D object recognition has played an important role for manipulation as well as for autonomous navigation, such as visual serving and SLAM in robotics field [1]. One of the major challenges in object recognition has been how to cope with variations in illumination, view point, distance, texture, occlusion etc. Conventional approach to solve such problems is to develop the photometric (SIFT) and geometric features (Lines/curves) invariant to such variations and the efficient organization of visual memory for appearance/view based matching [2]. Despite much success in recognition, the conventional approaches as described above turns out to be inadequate while dealing with recognition involving in robotic services such as errand, HomeMate, etc. services in real-world cluttered environments. This is due to the fact that the environmental variations which service robots must deal with are often beyond what conventional approaches supposed to handle and, furthermore, the preconditions for recognition, such as the target being in the proper sight and viewpoint and at proper distance from camera in the first place, may not necessarily be met when recognition service is requested. The next generation of service robots depend heavily on the capability of robot to search, recognize, fetch and manipulate objects in a cluttered environment. It is apparent that recognition for service robots should extend its scope beyond conventional matching, segmentation and classification tasks toward more human-like capabilities under the framework of cognitive recognition. This paper presents a method of filtering the uncertainty (e.g due to sensor errors) and ambiguous evidences (e.g due to insufficient features) obtained from a sequence of images in such a way as to reach a reliable decision level for recognition. First, at each of the image sequence, a feature optimal for coping with the existing uncertainties, yet possibly, insufficient for avoiding ambiguity is chosen for the quick generation of weak evidences distributed spatially. These weak evidences thus generated are then interpreted into multiple hypotheses, as the regions of interest to investigate further. Then, for each hypothesis, additional features and/or contexts, possibly, satisfying the sufficient condition for recognition are chosen for assigning the hypothesis probability, where the Bayesian network evidence structure for the target object, in which the sufficient condition for recognition is embedded, is used for computing the probabilities. Second, the hypotheses generated from individual images of the sequence are then filtered based on an iterative propagation and fusion paradigm. To this end, individual hypothesis is represented as a super-particle attributed by its weight and distribution, where its weight represents the probability associated with the hypothesis, while its distribution represents the uncertainty associated with the hypothesis along with its variations: the integration of this distribution along the variations results in the weight or the super-particle probability. This setting of individual hypotheses allows the evidence filtering to be processed under well-established particle filtering, such that the super-particles of individual images from the sequence go through an iterative process of propagation, fusion and This research was conducted for the Intelligent Robotics Development Program (F0005000-2009-31) and in part by the KORUS-Tech Program (F0005000-2010-32, KT-2008-SW-AP-FSO-0004) funded by MKE. Also partially supported by the MEST Korea, under the WCU Program supervised by the KOSEF(R31-2010-000-10062-0),by MKE Korea under ITRC NIPA- 2010-(C1090-1021-0008,NTIS-2010-(1415109527). Sukhan Lee*, Corresponding Author.

Upload: ahmed

Post on 27-Feb-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Evidence Filtering in a Sequence of Images for Recognition

Sukhan Lee1,2,* 1Director, Intelligent Systems Research Institute (ISRI), School of Information and Communication Engineering.

2Interaction Science Department. SungKyunKwan University, South Korea

[email protected]

Muhammad Ilyas, Kim Jaewoong, Ahmed Naguib Intelligent Systems Research Institute (ISRI), School of

Information and Communication Engineering. SungKyunKwan University, South Korea

{milyasmeo,create}@skku.edu, [email protected]

Abstract— In recognizing a target object/entity with its attribute such as pose from images, the evidences extracted initially may be uncertain and/or ambiguous as they can only be defined probabilistically and/or do not satisfy the sufficient condition for recognition. These uncertainties and ambiguities associated with evidences are often due as much to the external, uncontrollable causes, such as the variation of illumination and texture distributions in the scene, as to the quality of the imaging tools used. This paper presents a method of filtering the uncertain and ambiguous evidences obtained from a sequence of images in such a way as to reach a reliable decision level for recognition. First, at each of the image sequence, a number of weak evidences are generated using 3D line, 3D shape descriptor and SIFT which may be ambiguous and/or uncertain to decide recognition quickly and reliably. To reach a faithful recognition, we need to enrich these evidences by Appearance vector and generate multiple interpretations of the target object with higher weights. We incorporate prior established Bayesian Evidence structure which embodied sufficient condition for recognition, to generate such interpretations. Furthermore, when robot moves, we do active recognition using particle filter framework in sequence of images to produce interpretation with highest weight and lowest error covariance. This paper provides readers with the details of the implementation and experimental results of Evidence Filtering in image sequences using Particle filter in HomeMate Robot application.

Keywords—Object Recognition;Bayesian Network Evedice Structuret;image sequences;Super-Paticle Filtering

I. INTRODUCTION Object recognition has been one of the major problems in

robotics and computer vision community and intensively investigated for several decades. In particular, the 3D object recognition has played an important role for manipulation as well as for autonomous navigation, such as visual serving and SLAM in robotics field [1].

One of the major challenges in object recognition has been how to cope with variations in illumination, view point, distance, texture, occlusion etc. Conventional approach to solve such problems is to develop the photometric (SIFT) and geometric features (Lines/curves) invariant to such variations and the efficient organization of visual memory for appearance/view based matching [2]. Despite much success in recognition, the conventional approaches as described above turns out to be inadequate while dealing with recognition

involving in robotic services such as errand, HomeMate, etc. services in real-world cluttered environments. This is due to the fact that the environmental variations which service robots must deal with are often beyond what conventional approaches supposed to handle and, furthermore, the preconditions for recognition, such as the target being in the proper sight and viewpoint and at proper distance from camera in the first place, may not necessarily be met when recognition service is requested. The next generation of service robots depend heavily on the capability of robot to search, recognize, fetch and manipulate objects in a cluttered environment. It is apparent that recognition for service robots should extend its scope beyond conventional matching, segmentation and classification tasks toward more human-like capabilities under the framework of cognitive recognition.

This paper presents a method of filtering the uncertainty (e.g due to sensor errors) and ambiguous evidences (e.g due to insufficient features) obtained from a sequence of images in such a way as to reach a reliable decision level for recognition. First, at each of the image sequence, a feature optimal for coping with the existing uncertainties, yet possibly, insufficient for avoiding ambiguity is chosen for the quick generation of weak evidences distributed spatially. These weak evidences thus generated are then interpreted into multiple hypotheses, as the regions of interest to investigate further. Then, for each hypothesis, additional features and/or contexts, possibly, satisfying the sufficient condition for recognition are chosen for assigning the hypothesis probability, where the Bayesian network evidence structure for the target object, in which the sufficient condition for recognition is embedded, is used for computing the probabilities. Second, the hypotheses generated from individual images of the sequence are then filtered based on an iterative propagation and fusion paradigm. To this end, individual hypothesis is represented as a super-particle attributed by its weight and distribution, where its weight represents the probability associated with the hypothesis, while its distribution represents the uncertainty associated with the hypothesis along with its variations: the integration of this distribution along the variations results in the weight or the super-particle probability. This setting of individual hypotheses allows the evidence filtering to be processed under well-established particle filtering, such that the super-particles of individual images from the sequence go through an iterative process of propagation, fusion and

This research was conducted for the Intelligent Robotics DevelopmentProgram (F0005000-2009-31) and in part by the KORUS-Tech Program(F0005000-2010-32, KT-2008-SW-AP-FSO-0004) funded by MKE. Also partially supported by the MEST Korea, under the WCU Program supervised by the KOSEF(R31-2010-000-10062-0),by MKE Korea under ITRC NIPA-2010-(C1090-1021-0008,NTIS-2010-(1415109527). Sukhan Lee*, Corresponding Author.

resampling, being supported by the weights and uncertainty distributions associated with individual super-particles.

Sensory Input: Stereo Camera

Optimal Feature Selection

MultipleInterpretations for

ROIs

Probabilistic Fusion & Filtering

Multiple Hypotheses with Evidence Probabilities

Probabilistic Fusion with Particle Filter

ObservationDelay

Coordinate Trans.

Propagation

Context information

Evidence collection

Knowledge Base

3D workspace modeling

Environment Model

Object Model: Bayesian

Evidence Net

Focus of Attention

IlluminationIndependent Dynamic Sensing:

Virtual Camera

Hierarchical Evidence Accumulation

Cognitive Architecture for Visual Recognition and Pose Estimation

Figure 1. Overall Cognitive Recognition System

Among the many potential real-world applications that the proposed evidence filtering can be applied to, this paper demonstrates by extensive experimentations the superior performance of the proposed evidence filtering method for the success of a search and recognition mission by a service robot for carrying out errand services in a cluttered home environment. Overall cognitive recognition system architecture is shown in Fig. 1 above.

Major components of this recognition system are:

Knowledge base which is composed of Evidence Structure and context information.

Multiple Interpretation or Hypothesis generation.

Probabilistic Reasoning of Evidences

Proactive Evidence collection

Evidence filtering through Particle Filtering

Organization of this paper is as follows: after introduction in section-I, in Section-II we discuss the related work. Bayesian Network Evidence structure is described in Section-III and Section-IV will present Multiple Interpretation generation. Section-V deals with Evidence filtering through two-layered particle filter (TLPF). Experimental results and the conclusions are given in Section-VI.

II. RELATED WORK

Many researchers around the globe have been engaged in 3D object recognition in last decades. Although various

approaches are considered by researchers, however, among them, the model-based recognition method is the most general one and appealing intuitively. In nutshell, it recognizes the objects by matching features e.g photometric (SIFT) and geometric features (Lines/curves) extracted from the scene with stored features (in database) of the object. The method proposed by Taylor [3] concentrated mostly on object tracking and developed a fusion scheme for 3D model-based tracking using a Kalman Filter framework. The algorithm fuses color, edge and texture cues predicted from a textured CAD model of the tracked object to recover the 3D pose. However, the problem of initial pose estimation or object detection/recognition is not considered. David et al.[4] proposed the approach that the recognition and pose estimation are solved simultaneously by minimizing energy function. But it may not be converged to minimum value in functional minimization method due to high non-linearity of the cost function.

As for 3D object recognition for service robot, Saidi [5] proposed an approach for active visual search with a humanoid robot, they transform the problem of object search into sensor planning issue and then formulated as an optimization problem. Ma [6] proposed a 3D object search approach in a Bayesian framework, which is implemented as two steps: coarse-scale search with color histogram matching and fine-scale local search with 3D SIFT matching. Method proposed by Fischler and Bolles [7] uses RANSAC to recognize objects. It projects points of all models on the scene and determines if projected points are close to those of detected scene and recognizes the object through this. This method performs hypothesis and verification tasks several times thus making computational cost to be high. So this method isn't so efficient one. HERB robot from CMU has a suite of sensors to help him perceive the world, including a spinning laser scanner for building 3D world models, a vision system for object recognition and pose estimation [8] for manipulation with single image. And the STAIR robot from Standford [9][10], they focus on learning-based approach, which suffers a the challenge of huge data collection and expensive learning process. R.B.Rusu, Gary Bradski et al. at Willow Garage [11] described Fast Viewpoint Feature Histogram (FVFH) as a descriptor of 3D point cloud for the recognition and pose estimation of 3D objects with possible segmentation. [12] presents a blend of recognition systems supposed to work in parallel to get good performance, but it seems not to work in real time.

But in above mentioned works, ambiguity and uncertainty issues are rarely discussed. Cognitive recognition to overcome ambiguity and uncertainty during the robot motion is very essential for final successful manipulation of objects in 3D world e.g in intelligent manufacturing workspace.

In order to overcome the problems as mentioned above, we present Bayesian network evidence structure approach to overcome uncertainty and ambiguity in recognition in a sequence of images and to supply the sufficient condition for dependable recognition. We use multiple features for multiple interpretations generation as initial recognition of the target object in a probabilistic manner to deal with these issues of

ambiguity and sensor noise. Furthermore, our method implements a two-layered particle filter to maintain the pose candidates from both propagation and new observation, and corresponding probabilities (weights) and poses are evolved in time along with further evidences accumulation in image sequences.

III. BAYESIAN NETWORK EVIDENCE STRUCTURE The need of Bayesian Network Evidence structure arises from the fact to provide sufficient condition for faithful recognition decision and calculate and evolve associated probabilities in time. Evidence structure utilizes pre-determined Bayesian tables to combine various evidences, not only as one step, but actually through several levels of a structure that is pre-designed to reflect corresponding features of each evidence of the target object. The total probability of hypothesis/interpretation is calculated by combining different available evidences. We choose 3D lines, SIFT and shape descriptor features as evidences to recognize the target object initially in this paper. These evidences generate multiple interpretations of the target object by comparing with features in database. For example we consider a simple case of Milk carton as shown in Fig. 2. Feature like 3D lines, Shape descriptor, SIFT is used to generate multiple interpretations and for calculation of probabilities we used Appearance vector (color) as additional evidence. Bayesian Evidence structure is constructed for all these 16 viewpoint scenarios. A snapshot of Bayesian structure is given in table I and II. In the first level line probabilities are calculated from observations and based on these probabilities are calculated for a particular wireframe of the target object. Matching with this wireframe will generate initial interpretations of the target object. In level II, both color signature and shape descriptor are composed by combining sub-evidences e.g Blue and White signature in the case of color and coverage and Aspect ratio, in case of shap descriptor. Thus before feeding into Particle filter for final decision, we have a set of interpretations with associated weights generated from sequential evidence collection.

TABLE I. PROB. OF WIREFRAME GIVEN 3D LINES:

LINE Box Wireframe

L1 L2 L3 L4 L5 T F

T T T T T 0.99 0.01

T T T T F 0.90 0.1

T T T F T 0.75 0.25

T T T F F 0.60 0.40

. . . . . . . e.g Probability of wireframe is calculated from Table I.

TABLE II. CALCULATION OF PROB. OF TARGET OBJ.

Milk Carton

Wire Frame 3D Descriptor Appearance Vector T F

T T T 0.9 0.1 T T F 0.6 0.4 T F T 0.8 0.2 T F F 0.4 0.6 F T T 0.7 0.3 F T F 0.3 0.7 F F T 0.4 0.6 F F F 0.1 0.9

e.g Pr(Milk Carton=T/WireFrame,3D descriptor, Appearance Vector) , is calculated from Table II.

Due to the amount of evidences that could be acquired from 3D point cloud and textures, Bayesian evidence structure is used to combine probabilities for a possible interpretation match, generated from various evidences into one probability figure that may represent the possibility for this particular interpretation to be correct target object.

Figure 2. Evidence Structure and Probability calculation

Each sub-evidence in the base layer is composed of selected type-specific probability figures and combined by pre-defined Bayesian table. These selections are chosen to achieve sufficient conditions required to positively prove that this pose is the candidate one. Obviously, there can be more than one sufficient condition, and so, we might have several pose probabilities out of this particular sub-evidence, and we simply choose the highest of them and propagate it upward in the structure.

IV. MULTIPLE INTERPRETATION GENERATION In real life applications, sometimes, visual recognition may

need to rely on features that are not so powerful enough for a unique and crispy decision. For instance, visual recognition of such home appliances as table, dish washers, refrigerators, TV,

milk box and book, etc., representing objects of a polyhedral shape with little texture, may need to rely on line-based features representing the boundaries of the objects and/or their parts. In this case, the quality of detected features may vary, sometimes very poor, depending on the illumination, viewpoint and distance at the time of detection, resulting in, possibly, a lot of uncertainties. Furthermore, besides uncertainties, depending on the complexity of the feature used for matching, there may arise much ambiguity in decision-making, e.g., the parallel and intersecting 3D lines used in this work can produce many possibilities for the object pose, as depicted by multiple interpretations [13]. Following is the process to generate multiple interpretations using 3D lines, 3D shape descriptor, SIFT and evaluated by color signatures.

Figure 3. Multiple Interpretation Generation Flow chart

Major functions of multiple interpretation generation modules are as under:

We use an optimal set of features among 3D lines, SIFT, Shape Descriptor for initially generating interpretations.

Appearance Vector (color signatures) is used as supporting Evidence.

Bayesian tables are used for probability calculation under sufficient condition for recognition.

The first thing to consider is how to generate multiple interpretations to cover all possible locations of the target object in 3D space based on these features. Below we describe the process of calculating interpretation probability for each evidence one-by-one.

A. Line-Based Probability calculation A 3D line-based interpretation probability combines probabilities of sufficient wireframe lines using Bayesian table. These wireframe lines probabilities are computed by matching every line with scene lines using the following equations:

11

line distance coverage length0

1 (1 )(1 ) (1 )N

iline

iP E E P P P

N θ

−−

=

= − − −∑ (1)

d is ta n c e

_ _ c o v e ra g ec o v e ra g e

2 2_ _

_ _

m in (1, )

m in (1, )

e x p 1 8

p o w e ri

p o w e ri

p o w e ri p ro je c tio n

lin e

i in X Yle n g th

lin e in X Y

dEd

T a nET a n

LP

L

L p o w e rPL s ig m a

θθθ

⎛ ⎞= ⎜ ⎟⎝ ⎠

⎛ ⎞= ⎜ ⎟⎝ ⎠

⎛ ⎞= ⎜ ⎟⎝ ⎠

⎛ ⎞⎛ ⎞ ⎛ ⎞⎜ ⎟= − −⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠⎝ ⎠

(2)

In our approach, 3D lines are estimated from stereo images and 3D point clouds as described in [14]. 3D line features are invariant to translation, orientation and viewpoint, thus the total number of interpretations is much less compared to conventional 2D feature based approaches and each interpretation is less likely to be corrupted by spurious features as well.

( , , | )m

P x H O F

( , , | )m k

P x H O F

F F F1 2 K

Figure 4. Illustarion of sub-interpretation Generation

The upper part of Fig. 4 shows how multiple interpretationas are generated ,with weights, using lines only and below one shows that each inpteretaion can be further represented by sub-interpretations. Each interpretation can be representaed, mathematically, as:

( , , ) ( , ) ( , , )m m mP x H O F P H O F P x F H O= (3)

Where ( , )mP H O F is the probability that F represents hypothesis

mH of the target object O. On the other hand,

( , , )mP x F H O represents the probability that hypothesis

mH of the target object O given F located at x.

(d) (c) (b) (a)

B. 3D Shape Descriptor based Probability Calculation Initially we performed a 3D segmentation based on octree clustering [15]. Segmented octree area is used for interpretation generation about target object. For calculating probability of each segmented octree area, we consider three factors:

Aspect ratio: It is ratio between model and cuboid from scene data with respect to width, height and depth. It compares the volume of model and current segmented octree cell.

Fill factor: It determines the how many octree cell can possibly be fitted inside the model. It is ratio between amount of model’s octree cell and segmented area octree cell.

Planarity of octeree cell: It determines the ratio between planar octree cell and segmented octree area’s cell.

Using predefined Bayesian tables, we calculate probability of 3D shape descriptor based on these three factor’s probability. Probability calculation is shown in Fig. 5 below.

Figure 5. MI Generation and Prob.calc. based on 3D SD

C. Appearance Vector based Probability Calculation After performing segmentation using 2D and 3D

information, the input feature vector x can be generated based on the color and 3D location of the pixels in the segmented region. In the color case, the input feature vector is constructed from the distribution of pixels in RGB color space.

Directions of eigenvectors (6 variables): the angles of each eigenvectors rotated form the R- and the B-axis.

Normalized eigenvalues (3 variables): these values give information on the relative variance along with the above eigenvectors.

Mean in RGB color space (3 variables) Once x is computed, it is mapped into a feature space as:

xWy T=

where W is computed by the saturated biased discriminant analysis, which is a feature extraction method for one-class classification [13]. The probability Pr(O|x) to be used for the recognition framework is computed as:

( )0

,P r( | ) ex p

1 0xd is t y m

O xd

= −⎛ ⎞⎜ ⎟⎝ ⎠

(4)

where xm and 0d are the mean of positive samples in the feature space and the maximum distance of positive samples from xm , respectively.  

D. SIFT based Probability Calculation The object hypothetical pose (i.e interpretation) can also

be generated by calculating a transformation between the SIFT features measured at a current image frame and the corresponding ones stored in the database [16]. The similarity weight or probability of SIFT features for jth object location,

wj ,is denoted as: _

_

matched SIFTj

total SIFT

Nw

N=

where Nmatched_SIFT indicates the number of SIFT features corresponding to SIFT model stored in the database. Ntotal_SIFT is the total number of the SIFT model features shown in Fig. 6.

Figure 6. MI Generation based on SIFT

E. Combined Probability Calculation After probability calculation of each interpretation by each of the described features above, we use pre-defined evidence structure, with supporting Bayesian tables, to combine these probabilities into one figure that represents the overall probability of positively detecting the targeted object and having an estimation of its pose as well.

Figure 7. MI Generation based on SIFT,line,3D Shape descriptor

Total probability of the target object being recognized is calculated as follows (based on predefined evidence structure):

{ , } { , } { , } { , }

{ , } { , } { , } { , }

{ } { }

{ } { }

{ , } O O

O

( ) ( ,[ SD AP ])

max[ ( SD AP )

: P(O

]

( , ) ( , )

( ) (

, ), ( ,, )

T T T F T F T F T F

T T F T T F T F T F

T T T F

T T T

T T F E E

E E

P MC P MC SIFT OR WF AND AND

P MC SIFT MC WF AND AND

Note E P P

P P

P=

=

= +

= { } { }

{ } { }

O) ( ) ( )

( ) (1 ( ));T F F

T T

E E

E E

P P

uP v P

+

= + −

(5)

where u and v are from the table and { }TE is from the hierarchical computation of evidence probabilities .

SD: 3D Shape Descriptor

AP: Appearance vector

WF: Wire frame

MC: Milk Carton

E: Evidence

O: Target Object

Figure 8. Total Probability Calculation

A big challenge in generating multiple interpretations is determining how to verify each interpretation with additional image features as supporting evidences. Most of the initial hypothesized interpretations are inaccurate because correspondences between the model feature sets and image feature sets are usually incorrect. Thus, our approach ranks interpretations in a probabilistic manner using the Bayesian rule. In order to take into account the uncertainty in the values measured in the image, we represent each interpretation as a region in the pose space rather than a point in that space. This approach is similar to that in [17], but the difference is, I. Shimshoni et.al approximates the uncertainty region as a uniform pdf, but we approximate the uncertainty region as a Gaussian pdf, which is more appropriate because it is a good model of the phenomenon. Consequently, each interpretation is represented as a Gaussian pdf with a certain probability weight.

V. EVIDENCE FILTERING Since a reliable 3D recognition and pose estimation cannot

be achieved from a single observation, because the target being in the proper viewpoint of and distance from camera may not be satisfied when recognition is ordered. To overcome this challenge, we can make use of consecutive observations from robot moving, which may observe the same object but measured in different time and different views. The observed information, however, involves uncertainties resulting from noise in measurements and ambiguities of the multiple interpretations due to the incompleteness of the measurements. In order to reduce these uncertainties, we need to fuse the information from both the propagation and current new observation, for this reason we propose a novel two-layered particle filter to implement this fusion process.

Figure 9. Two-Layered Particle Filter (Flow chart)

First of all weak multiple interpretations are generated only in first glimpse of the expected target object in the scene. These weak interpretations are further strengthened by collecting further evidences as robot moves in the environment. Probability weight is assigned to each interpretation as further evidences are collected before these are fed to the Particle filter as initial super particles. In lower layer of the Particle filter these particles(selected multiple interpretations) are propagated as Robot moves and/or camera does panning and/or tilting in search of target object in the environment. When Robot moves to new position or pan/tilt occurs, the new FOV is determined and observed and propagated particles are separated from those that are not observed in new place.

After we get new observations and new set of multiple interpretations at position R(t), we calculate support of particles for each other and hence weights are calculated using Mahalanobis distance measure. At this step, fusion of

observed and propagated and newly observed particles are done along with weight calculation and are maintained in upper layer of the Particle filter. At the end resampling step is performed and particles with highest weight are selected as required target pose object. Details of the algorithms and propagation and fusion of interpretations are shown in Fig. 9 and Fig. 10 respectively. For detailed description of TLPF refer to [18,19].

Figure 10. Super-particle propagation and fusion in TLPF

VI. EXPERIMENTAL RESULTS AND CONCLUSION We experimented our algorithm of recognition in

sequential images in different environments with varying degree of illumination, distance etc. and for different objects varying in color, texture, size etc. We integrated two-layered particle filtering into the HomeMate robot. We tested the proposed method with a stereo camera and HomeMate mobile robot in a cluttered domestic environment including textured and textureless objects. The experimental results show that our dependable recognition approach can achieve good performance.

Figure 11. HomeMate Robot in ISRI,SKKU

Fig. 11 shows HomeMate Robot used in this work serving the author in Intelligent Systems Research Institute(ISRI), SungKyunKwan University, Korea.

Fig. 12 and Fig. 14 show the scenario of HomeMate working process. Fig. 12 shows recognition results of cucumber laying at two different views. Although not having perfect straight lines, the cucumber is faithfully recognized using our algorithm. In Fig. 14, where the Banana Milk box (yellow) is the target object, as we can see that the robot doesn’t recognize the target object at the initial steps i.e in first image frame. But the target object is recognized with converged pose estimation by evidence filtering in particle filtering framework with the robot motion and more image sequences are received.

Figure 12. Cucumber Recognition Results.

Figure 13. Probability Evolution Results

Probability evolution of recognition process is shown in Fig. 13, where probability 0 means no recognition results yet due

92%

82%

to distance, occlusion, etc., where book is converged to an accurate pose with probability 0.946 in a sequence of 20 images; Milk Box is converged with probability 0.944 in 11 images; and Milk Box with occlusion is converged in 14 images with probability 0.930.

Figure 14. Sequential Recognition Results with HomeMate Robot

Finally we conclude that 3D object recognition and pose estimation in sequence of images are basic prerequisites for robots to operate properly in cluttered environment. Due to many variations to take care of, real-world recognition is not meant to be a simple “one-shot “matter. We have to rely on multiple evidences from a spatial-temporal domain with probabilistic logic. The proposed cognitive recognition is an attempt to move to the direction as described above, demonstrating a promising result.

REFERENCES [1] M.Kojima,K.Okada,M.Inaba,” manipulation and recognition of objects

incorporating joints by a humaniod robot for daily assistavive tasks” in Intillegent Robots and Systems,2008 IROS 2008,IEEE/JRS.

[2] H. Wenze and Z. Song-Chun, “Learning a probabilistic model mixing 3d and 2d primitives for view invariant object recognition,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, 2010, pp. 2273–2280.

[3] G. Taylor and L. Kleeman, “Fusion of multimodal visual cues for model-based object tracking,” In Australasian Conference on Robotics and Automation (ACRA2003),Brisbane,Australia, December 2003.

[4] P. David, D. DeMenthon, R. Duraiswami, and H. Samet, “Softposit: Simultaneous pose and correspondence determination,” Computer Vision - ECCV 2002 Pt Iii, vol. 2352, pp. 698–714, 2002.

[5] F. Saidi, O. Stasse, and K. Yokoi, “Active visual search by a humanoid robot,” Lecture Notes in Control and Information Sciences, vol. 370, pp. 171–184, 2008.

[6] J. Ma and J. W. Burdick, “A probabilistic framework for stereo-vision based 3d object search with 6d pose estimation,” in Robotics and Automation (ICRA), 2010 IEEE International Conference on, 2010, pp. 2036–2042.

[7] Random Sample Consensus: A Paradigm for Model Fitting with Apphcatlons to Image Analysis and Automated Cartography Martin A. Fischler and Robert C. Bolles, SRI International.

[8] A. Collet, M. Martinez, and S. S. Srinivasa, “The MOPED framework: Object Recognition and Pose Estimation for Manipulation,” The Internationa Journal of Robotics Research, 2011.

[9] M. Quigley, S. Batra, S. Gould, E. Klingbeil, Q. Le, A. Wellman, and A. Y. Ng, “High-accuracy 3d sensing for mobile manipulation: Improving object detection and door opening,” in Robotics and Automation, 2009. ICRA ’09. IEEE International Conference on, May 2009, pp. 2816 –2822.

[10] B. Sapp, A. Saxena, and A. Y. Ng, “A fast data collection and augmentation procedure for object recognition,” in Proceedings of the 23rd national conference on Artificial intelligence - Volume 3, ser. AAAI’08, 2008, pp. 1402–1408.

[11] Radu Bogdan Rusu, Gary Bradski, Romain Thibaux, John Hsu, ,“Fast 3D Recognition and Pose Using the Viewpoint Feature Histogram”, Willow Garage,

[12] REIN - A Fast, Robust, Scalable REcognition Infrastructure” Marius Muja_, Radu Bogdan Rusu, Gary Bradski, David G. Lowe University of British Columbia, Canada Willow Garage, ICRA-2011.

[13] Z. Lu, S. Lee, and H. Kim, “Probabilistic 3d object recognition based on multiple interpretations generation,” in Asian Conference of Computer Vision ACCV 2010, 2010, vol. 6495, pp. 333–346.

[14] Zhaojin Lu, Seungmin Baek,Sukhan Lee;“Robusr 3D line Extraction from Stereo Point Clouds”,IEEE Conference on Robotics, Automation and Mechatronics, 2008.

[15] Jaewoong Kim and Sukhan Lee; “Fast Neighbor Cells Finding Method for Multiple Octree Representation”, IEEE International Symposium on Computational Intelligence in Robotics and Automation, 2009.

[16] Jeihun Lee,Seung-Min Baek,Changhyun Choi, Sukhan Lee;“A Particle Filter based probabilistic Fusion framework for Simultaneous Recognition and Pose estimation of 3D objects in a seqiuence of Images”.

[17] I. Shimshoni and J. Ponce. Probabilistic 3d object recognition. International Journal of Computer Vision, 36(1):51–70, 2000.

[18] Zhaojin Lu., Ph.D. Dissertation ; “Robust 3D Object Recognition and Pose Estimation Based on Particle Filtering Fusion”,Department of Electronic and Electrical Engineering ,The Graduate School Sungkyunkwan University , Korea.

[19] Sukhan Lee, Zhaojin Lu, “Dependable 3D Recognition with Two-layered Particle Filter”, Proceeding ICUIMC,11,Article No.37