simultaneously learning to recognize and control a low-cost robotic arm

Image and Vision Computing 27 (2009) 1729–1739

Contents lists available at ScienceDirect

Image and Vision Computing

journal homepage: www.elsevier .com/ locate / imavis

Simultaneously learning to recognize and control a low-cost robotic arm

Fredrik Larsson *, Erik Jonsson, Michael FelsbergComputer Vision Laboratory, Department of Electrical Engineering, Linköping University, 581 83 Linköping, Sweden

a r t i c l e i n f o

Article history:Received 25 February 2008Received in revised form 25 January 2009Accepted 8 April 2009

Keywords:Visual servoingLWPRGripper recognitionJacobian estimation

0262-8856/$ - see front matter � 2009 Elsevier B.V. Adoi:10.1016/j.imavis.2009.04.003

* Corresponding author. Tel.: +46 13 281887; fax: +E-mail addresses: [email protected], frela070@gma

a b s t r a c t

In this paper, we present a visual servoing method based on a learned mapping between feature spaceand control space. Using a suitable recognition algorithm, we present and evaluate a complete methodthat simultaneously learns the appearance and control of a low-cost robotic arm. The recognition partis trained using an action precedes perception approach. The novelty of this paper, apart from the visualservoing method per se, is the combination of visual servoing with gripper recognition. We show thatwe can achieve high precision positioning without knowing in advance what the robotic arm looks likeor how it is controlled.

� 2009 Elsevier B.V. All rights reserved.

1. Introduction

Low-cost robotic systems become increasingly available. Thisrequires appropriate methods to control the system despite limita-tions such as weak servos, no joint-feedback and hysteresis. Classi-cal methods based on modeling the inverse kinematics are unableto cope with these added challenges. In this paper, we show thathigh accuracy positioning can nevertheless be achieved with inex-pensive hardware.

In our work, we do not assume that the appearance of the ro-botic arm is known in advance, which means that the systemsimultaneously needs to learn what constitutes the robotic arm,how to recognize the end-effector and how to control it. We haveincluded a heuristic detection and recognition algorithm in Appen-dix A to be able to present a complete method. Learning detectionand recognition is achieved by an action precedes perception ap-proach [11] where we are using a simulated palmar grasp reflex[23].

To be able to control the robotic arm we use visual servoingbased on learning a mapping between feature space and controlspace. We show both in simulations and in real-world experimentsthat we can achieve high accuracy. A Lynx-6 low-cost robotic arm,see Fig. 1, has been used for the real-world experiments.

The paper is organized as follows:

� Section 2 gives a brief overview of related work on recognition,learning robot control and visual servoing.

ll rights reserved.

46 13 138526.il.com (F. Larsson).

� Section 3 deals with learning the control of the low-cost roboticarm. In earlier work, we have shown that by using visual servo-ing based on Locally Weighted Projection Regression (LWPR)[29] we can achieve high precision positioning [18]. The posi-tioning was accurate up to the noise in the detection of position.In this paper, we have replaced the position based visual servo-ing used in Ref. [18] with imaged based visual servoing.

� In Section 4, we present real-world experiments which showthat we can achieve accuracy that is sufficient for simple assem-bling tasks by combining automatic recognition and visual ser-voing based on LWPR. For the real-world experiments a fiveDOF robotic arm of Lynx-6 type [2] has been used.

� Section 5 contains a discussion which reflects advantages anddrawbacks of the proposed method.

� Appendix A presents our method for detecting and recognizingthe gripper1 which has been used in the experiments.

2. Related work

In this paper, we address the problem of controlling a roboticarm by visual servoing without knowing in advance what the ro-botic arm looks like. The different aspects of this problem, i.e.uncalibrated visual servoing and generic visual recognition, havebeen considered separately in the literature. However, we are notaware of any paper that performs visual servoing without actuallyknowing what the robotic arm looks like.

The approaches by Ognibene et al. [20] and Butz et al. [5] aresimilar to ours in that way that they use motor babbling, i.e.

1 In this paper, we use the term end-effector synonymously with the term gripper.

mailto:[email protected]

mailto:[email protected]

http://www.sciencedirect.com/science/journal/02628856

http://www.elsevier.com/locate/imavis

Fig. 1. The Lynx-6 low-cost robotic arm used in real-world evaluation.

1730 F. Larsson et al. / Image and Vision Computing 27 (2009) 1729–1739

spontaneous random movements, to learn how to associate thelimbs’ final position with motor commands. Ognibene et al. usepre-training of the actuator by motor babbling. However, in theirexperiments they use a simulator and do not have to deal withthe problem of learning how to recognize the hand, i.e. they arein fact fully aware of what the robotic arm looks like. Butz et al.are using a hierarchical neural network structure to learn the in-verse kinematics and how to resolve redundancies. They do how-ever not deal with the problem of learning how to acquire theinformation regarding the end-effector configuration.

Toussaint and Goerick [26] present an alternative approach tothe control problem. They are using dynamic bayesian networksto infer and emit control signals, contrary to the more traditionaluse of modeling observed data. By this approach they attack theproblem of motor planning. They do not address the issue of learn-ing how to acquire information of the robots configuration.

Jägersand and Nelson [13] are performing combined visualmodel acquisition and agent control. They do not explain in detailhow they analyze the visual scene, but they mention a templatematching tracker that tracks surface markers as well as a specialpurpose tracker which tracks attached targets or small lights. Fromtheir cited technical report [12], it is clear that they are trackingpredefined features, such as attached light bulbs or markers.

Visual servoing based on an estimated inverse image Jacobian is awell-established technique, but most reported experiments areusing prior knowledge about the appearance of the robotic arm,e.g. markers, or just computer simulations. Siebel and Kassahun[24] and Buessler and Urban [4] are using neural networks for learn-ing visual servoing. Siebel and Kassahun are reporting real-worldexperiments where they use a robotic arm fitted with circular mark-ers while Buessler and Urban do not present how they obtain thedescription of the end-effector.

Farahmand et al. [7] propose two methods for globally estimat-ing the visual-motor Jacobian. Their first method uses a k-nearestneighbor regressor on previously estimated local models to esti-mate the Jacobian for a previously unseen point. If the estimatedJacobian differs more than a certain threshold from Jacobians thatare already in a database, it is added to the database. The secondmethod is based on a local least-squares method. They opt to keepthe history of all robot movements and to estimate the Jacobianfrom this data when it is needed. For experiments, they use MAT-LAB simulations where the features tracked are the projection ofthe end-effector position.

3. Learning to control a robotic arm

Since we do not assume that the appearance of the robotic armis known beforehand the first thing we need to do is to learn howto recognize the end-effector. Once this is achieved we can focus onlearning how to control the robotic arm. We discuss the generalprerequisites needed for a recognition algorithm in Section 3.1and we have included a heuristic recognition method in AppendixA that fulfills these requirements. In Section 3.2, we describe howwe can learn a mapping from feature space to control space. Howthis mapping can be used for visual servoing is discussed in Section3.3. In Section 4.3, we show in real-world experiments that we canachieve good accuracy by combining the autonomous recognitionand learning of control.

3.1. Requirements for recognition of an end-effector with unknownappearance

In this section, we discuss the requirements we need to im-pose on a method that autonomously detects and recognizesthe end-effector. What we need is an algorithm, that given an im-age of the robotic arm, returns a vector that describes the config-uration. Ideally, we would get the image coordinates ofsufficiently many interest points, e.g. the position of each jointand tip of the end-effector, to be able to uniquely determinethe configuration.

If we were to choose manually, the intuitive thing to dowould be to choose a number of physical features, e.g. the tipof the end-effector, that we track through subsequent frames.Since we do not manually choose which physical features totrack, we might end up using interest points that are counterin-tuitive – in the sense that it is hard for a human operator tospecify how to position these points in order to be able tomanipulate an object. This makes it impossible to hard-codehow to grip objects. For a learning system this is of no concern,since the system will learn how to position these partly arbitrarypoints in order to manipulate objects. We are only concernedwith obtaining a consistent estimate of the configuration. Byconsistent we mean that whenever the robotic arm is in a givenconfiguration, say c1, we should end up with the same descrip-tion of this configuration. Assume for the moment that thedescription of the configuration consist of a single interest point,p, that in configuration c1 corresponds to the physical feature f1.It is fully acceptable if we in another configuration c2 match ourtracked point to another physical feature f2. What we do require,is that every time we are in c1 we match p with f1 and everytime we are in c2 we match p with f2.

In Appendix A, we have included a heuristic recognizing meth-od, in order to be able to evaluate a self contained method thatsimultaneously learns the appearance and the control of a roboticarm. This method is used to recognize the end-effector of a roboticarm without specifying its shape, size, color or texture. The onlyassumptions we make are that the end-effector is an articulatedobject and that we know the motor command that controls theopening and closing of the end-effector. These assumptions areused for generating training data that we will use for learningrecognition.

The method described in Appendix is based on template match-ing [3,27]. Instead of template matching we could use other fea-tures, e.g. SIFT-features [19] or channel coded feature maps [14].In that case, the extraction of template patches, Section A.2, shouldbe replaced by extraction of the chosen feature. However, therestriction to features within the segmented regions of interestshould be kept.

Fig. 2. Illustration of the closed loop control scheme. The box denoted Jðx; yÞcorresponds to Algorithm 1. We use the notation x for estimated configuration, xw

for target configuration and y for the control signal.

2 J is sometimes denoted the inverse image Jacobian or visual-motor Jacobian. Wewill simply use the term Jacobian in this paper.

F. Larsson et al. / Image and Vision Computing 27 (2009) 1729–1739 1731

3.2. Locally weighted projection regression

We give a brief introduction of LWPR and introduce the mini-mum of details needed in order to be able to explain our visualservoing approach. For a detailed description, we refer the inter-ested reader to [29]. LWPR is an incremental local learning algo-rithm for non-linear function approximation in high dimensionalspaces and has successfully been used in learning robot control[29,28,22].

The key concept in LWPR is to approximate the underlyingfunction by local linear models. The LWPR model automatically up-dates the number of receptive fields (RFs), i.e. local models, as wellas the location (which is represented by the RF center c) of each RF.The size and shape of the region of validity (decided by the dis-tance metric D) of each RF is updated continuously based on theperformance of each model. Within each local model an incremen-tal version of weighted partial least-squares (PLS) regression isused.

LWPR uses a non-normalized Gaussian weighting kernel to cal-culate the activation or weight of RFk (the subscript k will be used todenote that the particular variable or parameter belongs to RFk) gi-ven query x according to

wk ¼ exp �ðck � xÞT Dkðck � xÞ2

!: ð1Þ

Note that (1) can be seen as a non-regular channel representa-tion of Gaussian type if the distance metric Dk is equal for all k [9].

The output of RFk can be written as a linear mapping

yk ¼ Akxþ bk;0; ð2Þ

where Ak and bk;0 are known parameters acquired through theincremental PLS. The incremental PLS bears a resemblance to incre-mental associative networks [15], one difference being the use ofsubspace projections in PLS.

The predicted output y of the LWPR model is then given as theweighted output of all RFs according to

y ¼PK

k¼1wkykPKk¼1wk

ð3Þ

with K being the total number of RFs.We have been using LWPR to learn the mapping between the

configuration x of the end-effector and the control signals y. Alltraining data was acquired through image processing since nojoint-feedback was available from the robotic arm that has beenused. To improve accuracy we have combined the moderatelytrained LWPR model with visual servoing. That is, we performthe first move of the robotic arm by querying the LWPR modelfor the appropriate control signal. Then we estimate the deviationfrom the target configuration and correct the control signal byusing visual servoing.

3.3. Visual servoing based on LWPR

We use visual servoing [6,17] to minimize the quadratic normof the deviation vector Dx ¼ xw � x, where x denotes the reachedconfiguration and xw denotes the desired configuration of theend-effector. The optimizing criteria can thus be written as

minkDxk2: ð4Þ

If the current position with deviation Dxi originates from thecontrol signal yi, the new control signal is, in accordance to Newtonmethods, given as

yiþ1 ¼ yi � JDxi; ð5Þ

where the Jacobian J is the linear mapping that maps changes Dx inconfiguration space to changes Dy in control signal space.2 Whenthe Jacobian has been estimated the task of correcting for an erro-neous control signal is in theory straightforward. The process ofestimating J and updating the control signal is performed in a closedloop until some stopping criterion, e.g. small enough deviation fromthe target position, has been fulfilled. The entire control scheme isillustrated in Fig. 2. In our case, we get the first control signal fromthe trained LWPR model and he visual servoing loop is activatedafter the first move.

Using LWPR as a basis for visual servoing is straightforward forthe first iteration. The trained LWPR model gives a number of locallinear models from which the Jacobian can be estimated.

According to (2) each yk can be written as

yk ¼ Akðx� xk;cÞ þ bk;0 ð6Þ

leading to

wkyk ¼ e�12ðx�ckÞT Dkðx�ckÞðAkðx� xk;cÞ þ bk;0Þ: ð7Þ

The derivatives dwkdx and dðwk ykÞ

dx are

dwk

dx¼ �ðx� ckÞT Dkwk; ð8Þ

dðwkykÞdx

¼ �ykðx� ckÞT Dkwk þwkAk: ð9Þ

By setting g ¼PK

k¼1wkyk and h ¼PK

k¼1wk, see (3), and by using thequotient rule, dy

dx can be written as

dydx¼ d

dxgh

� �¼ 1

h2

dgdx

h� gdhdx

� �¼ 1

hdgdx� y

dhdx

� �ð10Þ

giving

dydx¼PK

k¼1ð�ykðx� ckÞT Dkwk þwkAkÞ � yPK

k¼1ð�ðx� ckÞT DkwkÞh

ð11Þ

ultimately leading to the expression

Jðx; yÞ ¼ dydx¼PK

k¼1wkðAk þ ðy � ykÞðx� ckÞT DkÞPKk¼1wk

: ð12Þ


Once we have an estimate of J we use (5) to obtain the correctedcontrol signal yiþ1. We use this control signal to move the roboticarm and estimate the new deviation from the target. If none ofour stopping criteria have been met we need to reestimate theJacobian and apply (5) to obtain the new estimate yiþ2. In orderto estimate the new Jacobian according to Ref. (12) we need theconfiguration x that results in the control signal y when used as in-put to LWRP. But we only know this relationship for the first visualservoing iteration since, for subsequent iterations, our control sig-nal was obtained by (5) and not as the result of an input to theLWPR model. We propose a static and an approximative updatingapproach to solve this problem.

3.3.1. Static approachThe simplest solution is the static approach. The Jacobian is

simply not updated and the Jacobian used in the first step is (still)used in the following steps. It should be noted that this approachcan be expected to work only if the first estimation of the Jacobianpoints in the right direction. Still this approach works fairly well(see Section 4). However, for a poorly trained LWPR model onecan expect the static approach to be less successful.

3.3.2. Approximative updating approachThe somewhat more complex solution treats the LWPR model

as if it was exact. This means that we use the reached position asquery and estimate the Jacobian for this configuration. The pseu-do-code is given in Algorithm 1. The wanted configuration is de-noted xw and y ¼ LWPR ðxÞ means the output from the trainedLWPR model given query x. A threshold � is used to terminatethe visual servoing loop if the deviation is small enough. The pro-cedure is also explained in Fig. 3.

Algorithm 1: Approximative updating of the Jacobian

1: y1 ¼ LWPR ðxwÞ2: Estimate the reached configuration x1

3: y2 y1 � Jðxw; y1Þðxw � x1Þ4: for k = 2 to the maximum number of iterations do5: Estimate the reached configuration xk

6: if kxw � xkk2 > �7: ykþ1 yk � Jðxk; LWPRðxkÞÞðxw � xkÞ8: else9: done10: end if11: end for

4. Results

This section is divided into three parts. First, in Section 4.1 wemake a comparison between LWPR and position based visual ser-voing on simulated data. We assume that the appearance of theend-effector is known in advance and we use 3D coordinates.In this case the accuracy of our visual servoing approach is lim-ited by noise in the estimated position. In Section 4.2, we confirmthat these results are valid on real-world data by showing thatthe accuracy (once again) is limited by the noise level also in thiscase. In the last experiment, Section 4.3, we present results fromimage based servoing in combination with the autonomous rec-ognition of the end-effector, as described in Appendix A. Weshow that we can achieve sufficient accuracy for basic assemblytasks.

For all tests we use the reduced 3D task space, denoted 2D+,defined in the COSPAL project [1]. 2D+ refers to the fact that theend-effector can be positioned in two different planes only, the

grip- and the movement-plane, see Fig. 4. The approach vector ofthe end-effector is restricted to be perpendicular to the groundplane. In our evaluations the task space is further restricted to ahalf circle (position based evaluation) or to a quarter circle (imagebased evaluation). We are controlling all five DOF of the roboticarm but use only the position of the end-effector to describe theconfiguration, i.e. for the position based setup we use the 3D posi-tion between the fingers of the end-effector.

The smoothness bias for LWPR is set to 10�5, the initial learningrate to 50 and the default distance metric to 30Iþ 0:05, where I de-notes the identity matrix. All 3D positions were measured in mmand the image coordinates are given in pixels. The same LWPRparameters are used for all experiments.

4.1. Position based visual servoing (simulator results)

We implemented a simulator of an ideal version of the same ro-botic arm that we use for real-world experiments, i.e. we assumeperfect servos and perfect inverse kinematics. We generated train-ing data by randomly positioning the robot arm in the 2D+ planes.We performed 100 test with 100 testing configurations in eachtest. The configurations used for evaluation were not used duringtraining.

Tables 1 and 2 contain the maximum likelihood estimates of themean absolute error from the target position and correspondingstandard deviation from simulations with and without addedGaussian noise. LWPR denotes that the trained model was usedin a one-shot fashion while I-LWPR denotes that the model hasbeen updated incrementally. This means that for each position100 attempts to reach the target were made. The position aftereach attempt was used to update the LWPR model and the final po-sition after the 100th attempt was used. J indicates that the Jaco-bian of the LWPR model has been used for visual servoing andStatic/Update denotes whether the static or the updating approachhas been used. The stopping criteria for the visual servoing was setto 20 iterations or a deviation of less than 0.1 mm from the desiredposition.

The standard deviation of the added Gaussian noise was set to2.6 mm in order to match the estimated real-world noise level.

4.2. Position based servoing (real-world results)

The real-world experimental setup consists of a low-cost ro-botic arm of Lynx-6 type, shown in Fig. 1, and a calibrated stereorig. The end-effector has been equipped with spherical markersto allow accurate estimation of the configuration. Since we areusing a low-cost robotic arm we have to deal with additionalchallenges compared to a top-of-the-line robotic arm with chal-lenges such as weak servos, no joint-feedback and hysteresis.The weak servos are not fully able to compensate for the effectof gravity, meaning that we have a highly non-linear system.Lack of joint-feedback means that all information of the configu-ration of the system has to be acquired by external sensors, inour case cameras, and that we cannot use joint-feedback to com-pensate for the weak servos or hysteresis. The hysteresis effect ishighly cumbersome, especially for control policies based on theinverse kinematics only, since the same control input will resultin different configuration depending on what the previous config-uration was.

The noise in estimated positions due to, e.g. the robotic armshaking, noise in captured images and imperfect segmentation ofmarkers, is assumed to be Gaussian with zero mean. The standarddeviation is estimated to 2.6 mm and this is also the standard devi-ation used in the simulations.

The analytical model has been evaluated and verified to be cor-rect on synthetic data and the real-world performance has been

Fig. 3. The approximative updating approach explained. The green line to the left in each figure represents the true function and the dashed black line to the right the LWPRapproximation.


evaluated on 100 random positions in the 2D+ space. The analyticalmodel was used in a one-shot fashion, i.e. no visual servoing areused. The estimated mean error was 15.87 mm. However, we sus-pect that a slightly better result could be achieved by tedious cal-ibrating of the parameters of the analytical model. Still, the non-linear effect caused by the weak servos and the hysteresis effectmakes it very unlikely that we could achieve a mean error less than10 mm with the analytical model. The analytical model relies on anaccurate calibration of the stereo rig and on a correct mappingfrom camera frame to robot frame. The learned inverse kinematics

on the other hand, has been trained with data that has been ac-quired including these imperfections.

A summary of the result can be seen in Table 3. LWPR denotesthe mean absolute error from the target position when the trainedmodel was used in a one-shot fashion. J indicates that the Jacobianof the LWPR model has been used for visual servoing and Static/Update denotes whether the static or the updating approach hasbeen used. The stopping criteria for the visual servoing was setto 10 iterations or deviation of less than 1 mm from the desiredposition.

Table 3Evaluation on real-world 2D+ scenario. The numbers are the mesan absolute errorfrom the target position and corresponding standard deviation in mm. Fifty testpoints were used for evaluation except from in the 10k case and in the analytical casewere 100 test positions have been used. Stopping criteria for the visual servoing was10 iterations or a deviation less than 1 mm. No evaluation of the visual servoingmethods was done for the 10k case. The level of accuracy reached for 1k and 5k is asaccurate as the noise level permits.

Training points 100 500 5000 10,000a

2D+ Real world. Estimated noise std: 2.6 [mm]LWPR 16.89 (8.30) 12.83 (4.86) 8.78 (4.44) 5.86 (3.05)J Static 9.83 (8.93) 5.41 (5.23) 1.74 (1.63) –J Update 9.07 (8.29) 4.32 (4.21) 1.65 (1.43) –

Analytical solution 15.87 (3.24)

a The LWPR model was trained on a total of 6k unique points. The first 1000points were shown (in random order) 5 times and then the additional 5k pointswere used.

Fig. 4. Illustration of the 2D+ scenario. The end-effector, here equipped by greenand red markers, can be moved in two different planes, the movement and thegripper plane.

Table 1Evaluation on simulated 2D+ scenario when trained on 500, 1000 and 5000 samples.The numbers are the ML estimates of the mean absolute error from the target positionand corresponding standard deviation in mm. One hundred test runts with 100 testpoints were used. No noise has been used.

Training points 500 1000 5000

2D+ simulated data. Added noise std: 0 [mm]LWPR 8.90 (4.81) 7.53 (4.01) 6.46 (3.44)I-LWPR 5.90 (4.29) 5.56 (3.78) 5.73 (3.40)J Static 0.34 (1.20) 0.17 (0.52) 0.11 (0.23)J Update 0.29 (0.90) 0.15 (0.40) 0.09 (0.15)

Table 2Evaluation on simulated 2D+ scenario when trained on 500, 1000 and 5000 samples.The numbers are the ML estimates of the mean absolute error from the target positionand corresponding standard deviation in mm. One hundred test runts with 100 testpoints were used. Gaussian noise with standard deviation of 2.6 mm was added to thepositions in order to simulate the noise in the estimation process.


2D+ simulated data. Added noise std: 2.6 [mm]LWPR 10.10 (4.99) 8.81 (4.27) 7.78 (5.11)I-LWPR 6.08 (4.40) 5.67 (3.85) 5.70 (4.82)J Static 2.10 (1.31) 1.90 (0.96) 1.80 (1.08)J Update 2.00 (1.06) 1.90 (0.85) 1.90 (1.04)


4.3. Image based visual servoing with autonomous recognition

The second real-world experimental setup consists of the samelow-cost robotic arm that was used in the position based experi-ments. The spherical markers have been removed from the gripperand the high resolution cameras have been replaced with twocheap web cameras. We do not use a calibrated stereo setup forthis experiment. The view from the two web cameras can be seenin Fig. 5.

The automatic detection and recognition algorithm is fully de-scribed in Appendix A. In short, the initialization phase automat-ically detects and labels region of interests, ROIs, by using asimulated palmar grasp reflex, see Fig. 6. From each ROI, we ex-tract a number of template patches which are labeled accordingto the ROI they were extracted from. In each new frame the po-sition of the best match for each template patch is found. Thenthe median position of all template patches belonging to thesame label is estimated. These coordinates are then used to de-scribe the configuration of the end-effector; which gives us a to-tal of R coordinates for each image, with R being the number oflabels.

In the evaluation presented below R ¼ 2, meaning that tem-plates were extracted from two regions, which allow orientedpositioning in the 2D+ scenario. To present an intuitive errormeasure, and also in order to be able to compare to the positionbased experiments, we use the coordinate halfway between themedian positions of the two labels as our final position. Thisgives us one coordinate in each image that describes the config-uration of the robotic arm. The size of the templates was 15 � 15pixels and we used 20 patches for each label. The template sizewas kept fixed and was not adjusted to compensate for scalechanges.

Since we are using two different cameras and the templates areautomatically selected, we might end up using different physicalfeatures in the different cameras, i.e. one camera might use tem-plates at the tip of the end-effector while the other camera mightuse templates belonging to the base. This complicates things whenwe evaluate the performance. We cannot simply select a 3D posi-tion in space and project this position into the two image planesand use these coordinates as a target configuration. This could re-sult in trying to position the tip and the base of the end-effector atthe same position, which is obviously not doable. Instead we haveused a configuration already visited, but omitted during the train-ing phase, as a target configuration. Training data was obtained byrandom movements within (near) the 2D+ planes using the analyt-ical model.

Given a target configuration, the trained LWPR model was que-ried in order to obtain the first control signal. This signal was usedand the deviation from the target was estimated. The visual servo-ing approach was then used with the end position obtained withLWPR as the starting position for servoing. Typical trajectories forthe visual servoing can be seen in Fig. 7. Note in the second rowhow the minimization of the combined deviation makes the devi-ation in the right camera increase.

Table 4 contains the results from the second series of real-worldexperiments. LWPR denotes the mean absolute error from the tar-get position and corresponding standard deviation, within paren-thesis, for the trained LWPR model. J Update denotes that theupdating approach has been used for visual servoing. A distanceof 1 pixel corresponds to roughly 0.5 mm within the reaching areaof the robotic arm (both along the x-axis and along the y-axis) inboth cameras.

The obtained errors are, as expected, higher than thosewhen using markers to describe the configuration. Still, wedo achieve sufficient accuracy for simple object manipulatingtasks. It is interesting to note that the highest accuracy was

Cam 1

100 200 300 400 500 600

100

200

300

400

Cam 2

100 200 300 400 500 600

100

200

300

400

Fig. 5. The view seen from the two web cameras during the image based evaluation.

Open Gripper

100 200 300 400 500 600

100

200

300

400

Segmented ROIs

100 200 300 400 500 600

100

200

300

400

1 2

Fig. 6. Left: The image of the opened gripper. Right: After the simulated palmar grasp reflex we get labeled regions of interests.


obtained for the model trained by only 250 training configura-tions. This is explained by the fact that we continuously re-place patches with poor performance in order to compensatefor, e.g. light changes. We can expect to replace more patcheswhen collecting more training data, thus the risk of gettingconflicting information becomes higher. We address this issuein the discussion.

5. Discussion

We have presented a method that allows simultaneous learningof appearance and control of a robotic arm. Sufficient accuracy forsimple assembly tasks is reached by combining autonomous recog-nition with visual servoing based on Locally Weighted ProjectionRegression (LWPR). We have seen that by using powerful algo-rithms we can suffice with inexpensive hardware, such as webcameras and low-cost robotic arms.

In Section 4.2, we show that the accuracy is limited mainlyby noise in the estimated positions when the appearance ofthe end-effector is known and we use 3D coordinates. In Section4.3, we show that these conclusions can also be drawn for imagebased visual servoing with an end-effector that is unknown inadvance.

The restrictions imposed by the 2D+ test scenario avoidproblems with multiple valued solutions to the inverse kine-matic problem. If the training samples form a non-convex set,our linear technique basically fails. This potentially happensfor robotic systems with redundant degrees of freedom. For in-

stance, if all positions would be reachable with servo 1 set toeither þp or �p, the linear averaging of the LWPR method pre-dicts the output to 0 for servo 1. Presumably, this can beavoided with a non-linear representation of the signals, e.g.using the channel representation [10] which allows for multi-modal estimation [8].

Due to the restrictions of our test scenarios we have notencountered any problem with singularities in control space. How-ever, for the same reasons as data with ambiguities would be aproblem for LWPR, existing singularities would cause problems.From a theoretical point of view, the same solution based on chan-nel representation could solve this problem. In a real-world setupthis needs to be verified. Due to noise, unstable configurationsmight occur that could be dealt with by a finite state machine asa higher level controller. This is something that should be investi-gated in future work.

Future work will also include replacing our heuristic recogni-tion algorithm with a theoretically more profound one. We needto decrease the noise in the estimated configurations in order to in-crease our real-world performance. We have tried to use a KLT-tracker where the initial tracking features were initialized by thesame method that we use for extracting our patches. However, thisdid not turn out better than the described template matchingmethod. Also, since we do not have any available joint-feedback,it is hard to justify the extra computational burden required bythe KLT-tracker, because we cannot use the intermediate positionsas training data. Only the final position of a movement can be usedfor training since it is only for this position we have the corre-sponding control signals.

355 357 359 361 363 365 367 369 371 373 375

210

212

214

216

218

220

222

224

226

228

230

01

2 019876543

Cam 1

295 297 299 301 303 305 307 309 311 313 315

190

192

194

196

198

200

202

204

206

208

210

0

1

2 34

567

89

10

Cam 2

410 412 414 416 418 420 422 424 426 428 430

220

222

224

226

228

230

232

234

236

238

240

01

2

3

4 5

67 8 9

10

Cam 1

228 230 232 234 236 238 240 242 244 246 248

194

196

198

200

202

204

206

208

210

212

214

0

1

2

3

4

5

6

78910

Cam 2

Fig. 7. The position after each visual servoing iteration. The red circle indicates the target position and the green cross, at the 0th position, indicated the position reached bythe LWPR model. The left/right column corresponds to the left/right camera and each row corresponds to one target configuration.

Table 4Real-world evaluation of the image-based visual servoing. Mean absolute error from target position and corresponding standard deviation are given in pixels. A pixel correspondsto roughly 0.5 mm within the task space. A total of 250 test points were used for each evaluation.


Camera 1 Camera 2 Camera 1 Camera 2 Camera 1 Camera 2

Real-world evaluationLWPR 6.64 (4.31) 7.27 (4.92) 8.99 (5.57) 8.74 (5.39) 8.37 (3.18) 6.51 (3.61)J Update 3.08 (2.62) 2.71 (2.40) 4.12 (3.14) 4.06 (3.07) 4.68 (2.26) 3.80 (2.17)


Acknowledgements

The research leading to these results has received funding fromthe European Community’s Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No. 215078 DIPLECS and fromthe European Community’s Sixth Framework Programme (FP6/2003-2007) under Arant Agreement No. 004176 COSPAL.

Appendix A. Recognition of an end-effector with unknownappearance

In this section, we present a method to recognize the end-effector of a robotic arm without specifying its shape, size, col-

or or texture. The only two assumptions we make is that theend-effector is an articulated object and that we know whichstate that controls opening and closing. These assumptionsare used to determine the template patches that we use forrecognition.

A.1. Detecting regions of interest

The template patches are extracted in two steps. First, we detectthe regions of interests (ROIs), secondly we extract a large numberof patches within these ROIs. These patches are evaluated and thebest ones are kept. We begin by estimating a background image Ib

in order to detect ROIs. This is done by moving the robotic arm out


of the field of view and then capturing a number of images. Lettingthe mean value represent the background. This could also be doneby, e.g. Gaussian Mixture Models [25] or by the use of Kalman-fil-ters [21] in order to obtain a more advanced model.

After we have obtained Ib, we move the robotic arm to a randomposition and open and close the end-effector. ROIs are found bysimultaneous background segmentation and motion segmentation.As a part of the extraction of ROIs we use the sum of absolutedifferences

SADðt; qÞ ¼X

i

jtðiÞ � qðiÞj: ðA:1Þ

The pseudo-code for the automatic extraction of ROIs can befound in Algorithm 2. Images captured when the end-effector isopen or closed are denoted with Io and Ic, respectively. Further-more, Doc denotes the SAD image between opened image andclosed image, and Dob the SAD image between opened image andbackground image. Before creating the final labeled image wemay use morphological operations, graph-cut [16] or impose somerestrictions on the ROIs, e.g. only keeping homogeneous regionswith area larger than a certain threshold. We use jXij to denotethe cardinality of the i : th connected region in Dbin. The imageDR

label is the final output of the algorithm. Each pixel in DRlabel has

a value between 0 and R, meaning that we defined the ROIs asthe pixels where the value is non-zero.

Algorithm 2: Labeling of ROIs

1: R 0 The number of non-overlapping ROIs2: for every pixel (x,y) in Io do3: Dlabelðx; yÞ 04: Dbinðx; yÞ 05: D0

l ðx; yÞ 06: Docðx; yÞ

PijIoðx; y; iÞ � Icðx; y; iÞj

7: Dobðx; yÞ P

ijIoðx; y; iÞ � Ibðx; y; iÞj8: if Docðx; yÞ > T and Dobðx; yÞ > T then9: Dbinðx; yÞ 110: end if11: end for12: Dbin morph(opening, DbinÞ13: Dbin morph(dilation, DbinÞ14: for each connected region Xi in Dbin do15: if jXij > Tsize then16: DRþ1

label ¼ DRlabel þ ðRþ 1ÞXi

17: R ¼ Rþ 118: end if19: end for

Open Gripper

100 200 300 400 500 600

100

200

300

400

1

2

3

4

Fig. A.1. Left: The image of the opened gripper. Right: Final DRlabel aft

Figs. A.1–A.3 shows the different steps of Algorithm 2. InFig. A.1, the original image is shown together with the final ROIs.Fig. A.2 contains to the thresholded Doc and Dob. Fig. A.3 showsthe final Dlabel.

A.2. Choosing template patches

Within each ROI, obtained according to Algorithm 2, we extractN randomly positioned template patches. N typically being in theorder of 100. Each template is denoted tðrÞn were n indicates thepatch number. The superscript r is a label indicating from whichof the R ROIs the template has been extracted, e.g. in the exampleshown in Fig. A.1 the possible values for r would be 1 or 2 depend-ing on if the template was extracted from the left or the right ROI.

To evaluate the quality of these patches we move the roboticarm to a random position and perform the opening and closingprocedure to obtain new ROIs. At each position in the new ROIswe extract a query patch qm of the same size as the originalpatches. M being the total number of positions in the ROIs. For eachof the RN template patches we compute sðrÞn ;m

ðrÞn ; x

ðrÞn and yðrÞn

according to:

sðrÞn ¼ minm2M

SADðtðrÞn ;qmÞ ðA:2Þ

mðrÞn ¼ argminm2M

SADðtðrÞn ;qmÞ ðA:3Þ

xðrÞn ¼ xðqmðrÞnÞ yðrÞn ¼ yðqmðrÞn

Þ: ðA:4Þ

The lowest SAD score for tðrÞn with respect to all qm is denoted sðrÞn

and the index to the query patch corresponding to this score is de-noted mðrÞn . We assign a position ðxðrÞn ; y

ðrÞn Þ to the template patch,

where ðxðrÞn ; yðrÞn Þ is the (x,y) position of qmðrÞn

. Finally we keepK < N templates belonging to each label. The K templates are cho-sen with respect to sðrÞn . The RK chosen templates are used in sub-sequent frames to detect the end-effector.

A.3. Using the templates to recognize the end-effector

Each time we move the robotic arm, new ROIs are obtained bybackground segmentation. Note that we do not perform the open-ing–closing procedure described above at this stage. One reason isthat it is time consuming, but more important, the opening–closingis impossible if the end-effector is already holding an object. TheROIs are extracted based only on the difference against the back-ground image.

For all RK templates kept in memory, we compute sðrÞn and thebest corresponding position ðxðrÞn ; y

ðrÞn Þ according to (A.2)–(A.4). This

simple procedure will lead to a number of mismatches but

Segmented ROIs

100 200 300 400 500 600

00

00

00

00

1 2

er imposing requirement of minimum size for ROIs. Here R ¼ 2.

Doc>T

100 200 300 400 500 600

100

200

300

400

Dob>T

100 200 300 400 500 600

100

200

300

400

Fig. A.2. Left: The thresholded Doc Right: The thresholded Dob .

Dbin=(Doc>T).*(Dob>T)

100 200 300 400 500 600

100

200

300

400

Dbin after morph

100 200 300 400 500 600

100

200

300

400

Fig. A.3. Left: Dbin after line 8 in Algorithm 2 Right: Dbin after morphological erosion and dilation.

Outliers present

100 200 300 400 500 600

100

200

300

400

Outliers removed

100 200 300 400 500 600

100

200

300

400

Fig. A.4. Left: Before outliers are removed. Right: After outliers have been removed and after new templates have been extracted.


typically we get sufficiently many reliable matches to determinethe position of the end-effector in the images.

We are continuously evaluating the quality of the matches. Weuse, e.g. the SAD score and the movement vector. If the cost for atemplate is too high with respect to the SAD score it is classifiedas an outlier. If the movement vector for a template deviates withan angle larger than a threshold from the movement vectors oftemplates with the same label r, it is also classified as an outlier.

We keep the number of templates in each label constant byextracting new templates from the current view every time we re-move those classified as outliers. For each template from label r

that we remove, we extract a new template patch near the medianposition of the templates of class r not classified as an outlier. Bythis procedure we are able to cope with changes in scale and illu-mination if the changes are not too rapid. The detection result be-fore and after outlier removal can be seen in Fig. A.4.

References

[1] The COSPAL project. Available from: <http://www.cospal.org/>.[2] Lynxmotion robot kits. Available from: <http://www.lynxmotion.com/>.[3] D.H. Ballard, C.M. Brown, Computer Vision, Prentice Hall Professional Technical

Reference, 1982.

http://www.cospal.org/

http://www.lynxmotion.com/


[4] Jean-Luc Buessler, Jean-Philippe Urban, Visually guided movements: learningwith modular neural maps in robotics, Neural Networks 11 (7–8) (1998) 1395–1415.

[5] M.V. Butz, O. Herbort, J. Hoffman, Exploiting redundancy for flexible behavior:unsupervised learning in a modular sensorimotor control architecture,Psychological Review 114 (4) (2007) 1015–1046.

[6] P.I. Corke, Visual Control of Robots: High-Performance Visual Servoing, JohnWiley & Sons, Inc., New York, NY, USA, 1997.

[7] A. Farahmand, A. Shademan, M. JSgersand. Global visual-motor estimation foruncalibrated visual servoing, in: Proceedings of the International Conferenceon Intelligent Robots and Systems (IROS), 2007, pp. 1969–1974.

[8] M. Felsberg, P.-E. Forssén, H. Scharr, Channel smoothing: efficient robustsmoothing of low-level signal features, IEEE Transactions on Pattern Analysisand Machine 28 (2) (2006) 209–222. February.

[9] P.-E. Forssén, Low and Medium Level Vision using Channel Representations,Ph.D. thesis, Linköping University, Sweden, SE-581 83 Linköping, Sweden,March 2004, Dissertation No. 858, ISBN 91-7373-876-X.

[10] G.H. Granlund, An associative perception–action structure using alocalized space variant information representation, in: Proceedings ofAlgebraic Frames for the Perception–Action Cycle (AFPAC), Kiel,Germany, September 2000.

[11] G.H. Granlund, Organization of architectures for cognitive vision systems, in:H. I Christensen, H.H. Nagel (Eds.), Cognitive Vision Systems: Sampling thespectrum of approaches, Springer-Verlag, Berlin, Heidelberg, Germany, 2006,pp. 37–55.

[12] M. Jägersand, R.C. Nelson, Adaptive Differential Visual Feedback forUncalibrated Hand-Eye Coordination and Motor Control, Technical Report579, Computer Science Department, University of Rochester, Rochester NY,1994.

[13] M. Jägersand, R.C. Nelson, On-line estimation of visual-motor models usingactive vision, in: Proceedings of ARPA96, 1996, pp. 677–682.

[14] E. Jonsson, M. Felsberg, Accurate interpolation in appearance-based poseestimation, in: Proceedings of the 15th Scandinavian Conference on ImageAnalysis, vol. 4522, LNCS, 2007, pp. 1–10.

[15] E. Jonsson, M. Felsberg, G.H. Granlund, Incremental Associative Learning,Technical Report LiTH-ISY-R-2691, Department of EE, Linköping University,September 2005.

[16] V. Kolmogorov, R. Zabih, What energy functions can be minimized via graphcuts?, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (2)(2004) 147–159.

[17] D. Kragic, H.I. Christensen, Survey on Visual Servoing for Manipulation,Technical Report, ISRN KTH/NA/P–02/01–SE, January, CVAP259, 2002.

[18] F. Larsson, E. Jonsson, M. Felsberg, Visual servoing for floppy robots usingLWPR, in: Proceedings of the International Workshop on Robotics andMathematics, 2007.

[19] D.G. Lowe, Object recognition from local scale-invariant features, in: Proceedingsof the International Conference on Computer Vision, 1999, pp. 1150–1157.

[20] D. Ognibene, A. Rega, G. Baldassarre, A model of reaching that integratesreinforcement learning and population encoding of postures, in: SAB, 2006, pp.381–393.

[21] C. Ridder, O. Munkelt, H. Kirchner. Adaptive background estimation andforeground detection using Kalman filtering, in: Proceedings of ICAM, 1995.

[22] S. Schaal, C.G. Atkeson, S. Vijayakumar, Scalable techniques fromnonparametric statistics for real time robot learning, Applied Intelligence 17(1) (2002) 49–60.

[23] J.M. Schott, M.N. Rossor, The grasp and other primitive reflexes, Journal ofNeurology Neurosurgery and Psychiatry 74 (2003) 558–560.

[24] Nils T. Siebel, Yohannes Kassahun, Learning neural networks for visualservoing using evolutionary methods, in: HIS’06: Proceedings of the SixthInternational Conference on Hybrid Intelligent Systems, IEEE ComputerSociety, Washington, DC, USA, 2006, p. 6.

[25] C. Stauffer, W.E.L Grimson, Adaptive background mixture models for real-timetracking, in: Proceedings of IEEE International Conference on Computer Visionand Pattern Recognition, 1999.

[26] M. Toussaint, C. Goerick, Probabilistic inference for structured planning inrobotics, in: International Conference on Intelligent Robots and Systems(IROS), 2007, pp. 3068–3073.

[27] D. Vernon, Machine Vision: Automated Visual Inspection and Robot Vision,Prentice-Hall Inc., 1991.

[28] S. Vijayakumar, A. D’souza, T. Shibata, J. Conradt, S. Schaal, Statistical learningfor humanoid robots, Autonomous Robots 12 (1) (2002) 55–69.

[29] S. Vijayakumar, S. Schaal, Locally weighted projection regression: an O(n)algorithm for incremental real time learning in high dimensional spaces, in:Proceedings ICML, 2000, pp. 288–293.

simultaneously learning to recognize and control a low-cost robotic arm

Documents