situated robot learning for multi-modal instruction and ...mdprl/documentos/steil_2004.pdf ·...

Robotics and Autonomous Systems 47 (2004) 129–141

Situated robot learning for multi-modal instructionand imitation of grasping

J.J. Steil, F. Röthling, R. Haschke, H. Ritter∗Neuroinformatics Group, Faculty of Technology, Bielefeld University, P.O. Box 10 01 31, D-33501 Bielefeld, Germany

Abstract

A key prerequisite to make user instruction of work tasks by interactive demonstration effective and convenient is situatedmulti-modal interaction aiming at an enhancement of robot learning beyond simple low-level skill acquisition. We reportthe status of the Bielefeld GRAVIS-robot system that combines visual attention and gestural instruction with an intelligentinterface for speech recognition and linguistic interpretation to allow multi-modal task-oriented instructions. With respectto this platform, we discuss the essential role of learning for robust functioning of the robot and sketch the concept of anintegrated architecture for situated learning on the system level. It has the long-term goal to demonstrate speech-supportedimitation learning of robot actions. We describe the current state of its realization to enable imitation of human hand posturesfor flexible grasping and give quantitative results for grasping a broad range of everyday objects.© 2004 Elsevier B.V. All rights reserved.

Keywords:Interactive demonstration; Imitation; Learning; Architecture; Grasping

1. Introduction

How can we endow robots with enough cognitivecapabilities to enable them to serve as multi-functionalpersonal assistants that can easily and intuitively beinstructed by the human user? A key role in the re-alization of this goal plays the ability ofsituatedlearning. Only when we can instruct robots to exe-cute desired work tasks by means of a combinationof spoken dialog, gestures, and visual demonstration,robots will lose their predominant role as specialistsfor repeatable tasks and become effective to supporthumans in everyday life.

A basic element ofsituated learningis the capabil-ity to observe and successfullyimitate actionsand – as

∗ Corresponding author.E-mail address:[email protected] (H. Ritter).

a prerequisite for that – to establish a common focus ofattention with the human instructor. For multi-modalcommunication, additional perceptive capabilities inthe fields of speech understanding, active vision, andin the interpretation of non-verbal cues like gesturesor body posture are essential and have to be includedand coordinated.

We report on progress in building an integratedrobot system within the framework of the SpecialCollaborative Research Unit SFB 360 ‘Situated Arti-ficial Communicators’. In the course of this long-termprogram, many modules implementing partial skillswere at first realized and evaluated as stand aloneapplications[4,7,18,20,34], but their integration is anadditional research task and a key issue towards therealization of intelligent machines[25,29].

As the development of integrated learning architec-tures for real world tasks poses an enormous challenge

0921-8890/$ – see front matter © 2004 Elsevier B.V. All rights reserved.doi:10.1016/j.robot.2004.03.007

130 J.J. Steil et al. / Robotics and Autonomous Systems 47 (2004) 129–141

[35], there can hardly be found any efforts to scalelearning from the lower level of training single skillsup to a multi-stage learning across the overall system.A primary reason is that most learning approachesrely on highly pre-structured information and searchspaces. Prominent examples are supervised learningof target outputs, unsupervised learning of clusters, orlearning of control tasks with a (usually small) num-ber of predefined variables (pole balancing, trajectorylearning). Here exist well understood approaches likegradient based learning, support vector machines, vec-tor quantization, or Q-learning, which yield for certaintasks remarkable results, e.g. in speech–image inte-gration [26], trajectory learning[19,22,44], in objectrecognition and determination of grasp postures[28],sensor fusion for grasp planning[1], or grasp opti-mization[30].

In real world learning a well defined pre-structuringof the data with respect to the given task is an es-sential part of the learning itself; the system has tofind lower-dimensional relevant manifolds in veryhigh-dimensional data and detect important regu-larities in the course of learning to use these toimprove its capabilities. Furthermore, for a sophis-ticated robot with many motor degrees of freedomor for a cognitive system – as the one discussedhere – finding a solution by exploration of new ac-tions is not suitable because the search spaces in-volved are extremely high-dimensional and by far toocomplex.

Current practice aims at developing well-scalable,homogeneous and transparent architectures to createcomplex systems. Somewhat ironically, successfulexamples of this strategy tend to cluster in the small-or mid-size range, while truly large and complexsystems seem to defy our wishes for ‘formattingaway’ their complexity by good bookkeeping alone.It seems not unlikely that it is one of the hallmarks ofcomplex systems that they confront us with limitedhomogeneity, evolutionarily grown layers of over-lapping functionality and bugs that may even amal-gamate with features. Looking at biological systemswith their enormous complexity, we see that these byno means resemble orthogonal clockworks; instead,they consist of a tangle of interwoven loops stabi-lized by numerous mechanisms of error-tolerance andself-repair. This suggests that a major challenge formoving to higher complexity is to successfully adopt

Fig. 1. Interaction with the GRAVIS-system using speech andgesture.

similar approaches to come to grips with systems thatwe cannot analyze in their full detail.

In the present paper, we address these issues in thecontext of a longer-term research project aiming at therealization of a robot system that is instructable byspeech and gestures (Fig. 1). For the aforementionedreasons, we have pursued the development of this sys-tem in an evolutionary fashion, without the require-ment that a global blueprint had to be available at eachstage of its development. InSection 2, we report ourexperiences with this approach and give an overviewof the current stage of the evolved system.

In Section 3, we focus our discussion on the issueof learning within such a system and argue for threemajor levels at which learning has to be organized:(i) an ontogenetic levelwhich exploits learning meth-ods in order to create initial system functions (suchas object classifiers) from previously acquired train-ing data in an off-line fashion, (ii) arefinement levelatwhich on-line learning is used locally within a func-tional module, with the main effect of increasing themodule’s robustness or refining its performance, butwith no or little need of explicit coordination withadaptive processes in other modules, and (iii) asitu-ated levelat which different learning methods are com-bined in a highly structured way in order to achieveshort-term situated learning at the task level. Whileall three learning levels are important, undoubtedly itis the uppermost,situated levelwhich currently posesthe most exciting research challenge.

In Section 4, we propose an approach how toorganize learning at this level. Our proposal isstrongly motivated by the idea of imitation learning

J.J. Steil et al. / Robotics and Autonomous Systems 47 (2004) 129–141 131

[2,3,6,8,23,24,32], which attempts to find a successful‘action template’ from the observation of a (human)instructor. This requires to (i) endow the robot systemwith sufficient perceptive capabilities to recognize andobserve the action to imitate; (ii) transform the ob-served action into an internal representation, which iswell matched to the system’s own operating character-istics (in particular, its different ‘sensory perspective’and ‘instrumentation’ with actuators); (iii) be able tophysically execute a suitable action by itself. Focus-ing on the important task ofimitation grasping, wedescribe inSection 5an initial implementation ofthis scheme, using our current system as a platformfor the necessary and considerable, perceptual andmotor anchoring of such an imitation learner in itsenvironment.Section 6then presents some results onimitation grasping of common everyday objects withthe system implemented so far.

At all levels, the results of learning can – by its verynature – at best be partially predicted, further erodingthe idea of the availability of a fixed system blueprint.In Section 7, we therefore argue for adataminingperspectivefor coping with systems of such kind.As a concrete example, we briefly describe a power-ful multi-modal monitoring system (AVDisp) that hasbeen developed in our lab very recently and we reportsome experiences from applying this approach to ourrobot system. Finally,Section 8presents some con-clusions.

2. System design and overview

Due to the long-term development of our system,the ideal perspective to define constraints and a uni-fied framework beforehand to facilitate building a cog-nitive learning architecture had to be replaced by an‘evolutionary approach’ to also integrate modules thatwere developed in different research contexts and notnecessarily designed in view of being utilized in thedescribed system. This led to the development of arather flexible architecture, based on a distributed ar-chitecture communication system (DACS[14]) devel-oped earlier in the framework of the SFB 360. In thisframework, very heterogeneous components can beaccommodated as separate and parallel processes, dis-tributed over several workstations and communicat-ing mainly via message passing supported by DACS

Fig. 2. Schematic picture of the current system architecture.

(some modules also use more sophisticated commu-nication facilities of the DACS package). In this way,we have been able to integrate a large number of mod-ules, which use different programming languages (C,C++, Tcl/Tk, Neo/NST), various visualization tools,and a variety of processing paradigms ranging froma neurally inspired attention system to statistical anddeclarative methods for inference and knowledge rep-resentation[25,29].

The current system can be coarsely subdivided intofour major functional clusters depicted schematicallyin Fig. 2. The speech processing(left) and thevi-sual attentionmechanism (right) provide linguisticand visual/gestural inputs converging in anintegra-tion module. This maintains a short-term memory ofobjects and their 3D-coordinates and passes controlto the arm/hand manipulatorif an object is unam-biguously referenced by speech or gesture or theircombination. We found that the coordination of allthese functional modules can be very convenientlyachieved by finite state machines implementing datadriven state transitions. Additionally, speech com-mands like ‘calibrate skin’, ‘park robot arm’, etc. cantrigger specific actions. The most important function-alities of these four building blocks are summarizedbelow for convenience.

2.1. Visual attention

The basic behavior of the active vision systemis – driven by an attention system – to explore the


scene autonomously and search for salient points andobjects, including hands and pointing fingers. Theattention behavior is based on an active stereo visioncamera head and works in full 3D-space. At the lowerlevel it consists of a layered system of topographicallyorganized neural maps for integrating different featuremaps into a continually updated saliency map[20],similar to mechanisms proposed in[9,12,38]. Incor-porating an additional fadeout map and results from ahand detection module, it forms a final attention mapwhose highest peak determines the next fixation. Ifa pointing hand is detected, a further module com-putes a 3D-pointing cone and restricts the attentionto the corresponding region. A holistic, neural objectrecognition system[18] determines whether a knownobject has been seen and can be transferred into theshort-term memory of the integration module.

2.2. Speech processing and understanding

To enable speech interaction and communicationbetween the user and the artificial communicator, oursystem imports a module for speaker-independentspeech understanding[13]. The recognition processis directly influenced by a partial parser which pro-vides linguistic and domain-specific restrictions onword sequences derived from previous investigationson a word corpus. Therefore, partial syntactic struc-tures instead of simple word sequences are generated,like e.g. object descriptions (the red cube) or spatialrelations (in front of). These are combined by thesubsequent speech understanding module to form lin-guistic interpretations. The instructor neither needsto know a special command syntax nor the exactterms or identifiers of the objects. Consequently, thespeech understanding system has to face a high de-gree ofreferential uncertaintyfrom vague meanings,speech recognition errors, and un-modeled languagestructures.

2.3. Modality integration and dialog

In integration of speech and vision, this referentialuncertainty has to be resolved with respect to the vi-sual object memory. Here the system uses a Bayesiannetwork approach[40], where the different kinds ofuncertainties are modeled by conditional probabilitytables that have been estimated from experimental

data. The objects which are denoted in the utteranceare those explaining the observed visual and verbalevidences in the Bayesian network with the maximuma posteriori probability. Additional causal support foran intended object is defined by an optional targetregion of interest that is provided by the 3D-pointingevaluation. The intended object is then used by the di-alog component for system response and manipulatorinstruction. The dialog system is based on an inves-tigation of a corpus of human–human and simulatedhuman–machine dialogs[7]. In particular, it asks fora pointing gesture to resolve ambiguities in the cur-rent spoken instruction with respect to the actual stateof the memory. The overall goal of this module is tocontinue the dialog in every situation.

2.4. Robot arm and hand

Manipulation is carried out by a standard 6-DOFPUMA manipulator operated with the real-timeRCCL-command library[36]. It is further equippedwith a wrist camera to obtain local visual feedbackduring the grasping phase. The grasping is carriedout by a 9-DOF dextrous robot hand developed atthe Technical University of Munich[37]. It has threeapproximately human-sized fingers driven by an hy-draulics system. The fingertips have custom builtforce sensors to provide force feedback for controland evaluation of the grasp. Recently, we have added apalm and rearranged the fingers in a more human-likeconfiguration in order to allow a larger variety oftwo- and three-finger grasps and equip the hand witha tactile display of 8× 8 force sensors on the palm.

The grasp sequence starts with an approach move-ment to the 3D-coordinates determined by the visionand integration modules. Based on visual feedback, itcenters the manipulator above the object and finallygrasps it, starting from an initial hand pre-shape,slowly wrapping all fingers around the object and em-ploying force feedback from the fingertip and palmsensors to detect contacts. Notice that this grasp-ing strategy significantly differs from analytical ap-proaches, which first compute optimal grasp points onthe known object surface and finally employ inversehand kinematics to achieve these contact points[17].Fig. 3 shows a time-evaluation of the grasp sequencefor a preprogrammed grasp. After a successful grasp,a similar chain of events allows the robot to put the

enthe

Subrayado

Declaración

enthe

Subrayado

Destinados

enthe

Resaltado

enthe

Nota adhesiva

el enfoque de posicionar la mano en el centro del objeto y después, con retroalimentación de los sensores colocados ir tomándolo, es una mejor opción que calcular los puntos de contacto y aplicar cinemática inversa.


Fig. 3. Evaluation of a predefined grasp sequence based on datacollected by the system monitoring tool AVDisp.

object down at another gesturally selected location. Toadd flexibility to this fixed grasp behavior and enablegrasping of irregular shaped real world objects, weadded an imitation capability in the choice of graspprototypes based on human observation as detailed inSections 4 and 5.

3. Structuring of learning

Learning is a very multi-faceted phenomenon andits complexity is amply reflected in the numerous dif-ferent proposals on how to relate and implement itsvarious aspects. Theoretical considerations motivatea ‘horizontal subdivision’ of learning into the majortypes of unsupervised, supervised, and reinforcementlearning (with still a substantial number of approachesdistinguishable both at the conceptual and algorith-mic level within each type). Recently there has beena stimulating discussion that such a subdivision mayeven be reflected in anatomically distinguishable sub-systems of the brain[11]. While such a subdivision ishighly attractive in many respects, we think that thereis another very important but different dimension oflearning: thetime scaleat which learning can takeplace.

At the slowest time scale (which we will call theon-togenetic level) learning methods are used in order tocreate initial system functions by off-line algorithms,usually operating on rather large data sets prepared in

contexts that do not require an already trained system.Examples in our system are the unsupervised trainingof the sensory front end of a neural-network-based ob-ject classifier, the supervised training of recognitionmodules (e.g. for object identity[18] or for continuoushand posture[27] as explained in some more detail inSection 5). Learning at this ontogenetic level does notinvolve any behavior of the robot; instead, it permittedus the creation of important initial subsystem function-alities which would have been much harder to obtainby explicit programming alone. While this level canextend even into quite high levels of abstraction (e.g.the computation of probability tables for Bayes nets tointegrate visual and speech recognition results at thesymbolic level[40]), its contribution becomes ‘frozen’afterwards, since the employed techniques frequentlyassume the availability of the target modules in isola-tion, without the complex interactions resulting frombeing embedded in the complete and working system.

The secondlocal refinement levelcomprises thoselearning processes whose adaptive changes occuron-line, during (and based on) the actual behav-ior of the robot and refine its initial capabilities.The increased complexity imposed in this way iscompensated by requiring the adaptive changes tobe (at least largely) local to each module, so thatlearning processes at the second level can become‘encapsulated’ in a single functional module, therebyallowing to achieve a good balance between bene-fits and implementability. Typical examples in oursystem are the adjustment of subsystem calibrations(e.g. on-line color-recalibration of the vision system),slow adaptation to changing environment conditions(e.g. ‘habituation’ of the fingertip sensors, dynamicrenormalization of the feature weights in the saccadicsystem with respect to the current feature statis-tics, continuous update of the skin color model, orforgetting factors in the object memory), and mech-anisms to ensure that system variables remain wellinside their operating ranges. This level can be con-sidered as hosting most of the ‘long-term plasticity’of our system, and its algorithms can largely bebased on ideas of statistical learning. We found thatthe isolated contribution of the adaptivity of sin-gle modules on system performance is often rathersmall; however, the good tuning ofmanyparametershas a big impact on overall system reliability andperformance.

enthe

Nota adhesiva

Dado que el brazo tiene sensores de presión, podemos aplicar este enfoque. es más sencillo.


Finally, the thirdsituated leveladdresses the chal-lenge to make rapid (ultimately ‘one-shot’) learningfeasible. This cannot rely exclusively on slow andrepeated adaptations; instead, this level has to rapidlycoordinate adapted subsystem functionalities in verystructured, situation-specific and cross-modular ways.Clearly, coming up with workable learning mech-anisms at this situated level poses a significant re-search challenge. The notion of imitation learninghas emerged as a very promising paradigm to copewith this challenge. There is a considerable debatein current literature, in which way an architecturefor imitation learning could coordinate the requiredsubsystem functionalities. Some researchers focus onusing ‘neurally inspired’ building blocks of visualprocessing[42] as their primitives, others approachthe problem from the perspective of attention mod-eling [12], or perception–action systems[6], imple-menting either a kind of ‘imitation at the processinglevel’, or focusing on imitation at the level of jointangles[19]. Whenever such systems execute actionsin the real world, their ‘inner workings’ are con-siderably affected by the physical constraints of thereal robot system available. In particular, the ad-vent of more human-like and human-sized robotshas had a major impact on the development of tech-niques for motor-learning and skill transfer in thisarea.

In the context of our own system, we have startedto pursue these ideas within a paradigm ofsituatedlearning for imitation grasping. In Section 4we pro-pose a learning architecture intended to address theissues at the situated level. With the system describedin Section 2as a basis, we have started to implementlearning on this highest level of abstraction to enableour robot system to carry out a variety of grasps ofeveryday objects, visually observing and imitating ahuman instructor who indicates useful grasp postureswith his or her own hand.

4. An architecture for situated learning

As pointed out in the previous section, to enablelearning at the short time scale of the situated level willdepend in an essential way on the highly structuredinterplay of several functional loops, complementingeach other in a tightly coupled fashion to cope with

the joint constraints of high-dimensional search spacesand small number of training samples. In the follow-ing, we argue that a suitable interlocking of the threefunctional loops ofobservation, internal simulation,and sensorimotor explorationcan lead to a schemethat appears sufficiently powerful to enable fast situ-ated learning.

The objective of theobservation loopis to gain apromising ‘action template’ that permits us to dra-matically cut down the otherwise huge search spacesfor the other two learning loops. This requires oneto watch the environment, respectively, the instruc-tor, in order to (i) extract relevant features, events,and chains of observed partial actions, (ii) translatethese from the observed to an intrinsic perspective,and (iii) exploit them for forming an action templatethat is focused on promising regions of the a priorivery high-dimensional search space.

At a second stage the observed action templatehas to be improved further using aninternal simula-tion loop, exploring possible actions in the vicinityof the observed template and selecting promisingaction candidates. At this stage it is important toperform learning from an ‘intrinsic’ perspective inorder to incorporate available model knowledge (e.g.about kinematic and sensor constraints of the actu-ally used hardware), which becomes merged withconstraint information made available from the ob-servation component. The computational basis ofthis component is a dynamics-based grasping sim-ulation, allowing the application of reinforcementlearning methods to improve the grasping strategy.Due to the availability of full information about jointangles, applied wrenches, or contact points in thesimulation, we can make predictions (valid to theextent that the model is accurate) about grasp qualityto generate a suitable reward signal even in the ab-sence of corresponding tactile sensors in the actualTUM-hand.

We think that thissearch space restricted reinforce-ment learning, where the exploration of actions canbe restricted to a neighborhood of the observed suc-cessful trajectory, is the adequate technique to gener-ate promising action candidates. This combination ofobservation and reinforcement learning appears veryflexible: the neighborhood can be chosen small wherehighly reliable observations are available, whereasmore exploration may be needed where poor data are

enthe

Nota adhesiva

Una vez restringido el espacio de búsqueda, la exploración de acciones se restringe a vecinos de observaciones.


given. A typical candidate for the application of thisapproach in our scenario would be the initializationof a grasping sequence with respect to the approachdirection and hand pre-shape based on visual ob-servation of a human hand, which can be obtainedby earlier developed hand and fingertip recognitionmethods (seeSection 5and[27,29]).

Finally, the thirdsensorimotor loopis responsiblefor actually carrying out those action candidates thathave been identified as most promising by the othertwo components. Since the internal simulation modeland its observation-based refinement can always onlyapproximate the actual situation, actual execution ofthe action is faced with two alternative evaluation cri-teria: (i) a maximization of knowledge gain with theconsequence of ‘risky’ actions, for instance, near de-cision boundaries; (ii) a maximization of robustnesswith the consequence of conservative actions in max-imal distance to decision boundaries, leading only tominimal information gain.

Fig. 4 shows the interaction of the different loops.The observation loopacquires example actions fromthe human instructor, thesimulation loopprovidesrefinements of the observed strategies and adapta-tion to the constraints imposed by the robot system.This changes robot learning into an interactive sit-uated learning process, which uses speech and themulti-modal perceptual channels for an effective op-timization of the system’s exploration.

Fig. 4. Multi-stage architecture for the integration of differentlearning paradigms to enable situated learning on the system level.Shown are the interlocking three functional loops ofobserva-tion of example actions,internal simulation, and sensorimotorexploration.

5. Towards imitation grasping: observation,simulation, and control of hand posture

While we do not have a full implementation ofthe described architecture yet, we can report an ini-tial implementation of some of its major features forthe scenario of situated learning of grasping of com-mon everyday objects, such as depicted inFig. 8. Inthis scenario, the observation component is a visionmodule permitting observation and 3D-identificationof a human hand posture indicating to the robot asample of the to-be-used grasp type. The identifiedhand posture is transformed to the joint angle spaceof the robot hand and is used at the same time as aninitial posture for the physics-based simulation of acorresponding grasp as shown inFig. 5.

The hand posture recognitionuses a system forvisual recognition of arbitrary hand postures whichwas previously developed in our group. It works inreal time, however, currently is restricted to a prede-fined viewing area. For a more detailed descriptionof the underlying multi-stage hierarchy of neural net-works which first detect the fingertips in the handimage and then reconstruct the joint angles from thefingertip positions see[27,29]. In the context of thepresent paper, it is a good example for a neurallyinspired processing module, which has been refinedin an evolutionary way over a number of years andemploys learning mainly at the ontogenetic and, to aminor extent, also at the refinement level. Image lo-cations of the fingertips are identified by a two-levelhierarchy of several neural networks trained duringan off-line phase. A further processing stage, employ-ing a Parameterized Self-organizing Map(PSOM,

Fig. 5. Grasp evaluation in physics-based simulation using contactfriction cones (left) and the force closure polytope (lower right).


Fig. 6. Observation and recognition of continuous human handpostures to obtain an initial pre-grasp posture for autonomousgrasping of the anthropomorphic robot hand. In the upper right partthe observed human hand is shown on the screen together with thereconstructed hand posture visualized by the rendered computerhand. Below is shown the operation of the PSOM network whichobtains the inverse transform from fingertip positions found fromobservation in the 2D-images to the joint angles of the hand. Tothe left, the resulting TUM-hand posture can be seen.

[41]), transforms the obtained 2D-features (fingertiplocations) into the joint angles of an articulated handmodel approximating the geometry of a five-fingeredhuman hand. Because the PSOM replaces the discretelattice of the SOM with a continuous manifold andallows to retrieve the inverse of each learned map-ping automatically, it can be trained with data of theanalytically computable forward transform from the3D-posture to the 2D-locations.

However, to actuate the robot-hand, the recon-structed joint angles cannot be used directly, becausethe robot hand shown inFig. 6 has three fingers onlyand differs in palm and finger sizes, proportions ofthe phalanges, and the arrangement of the fingers atthe palm. Additionally, the sensory equipment andcontrol concepts differ, such that we first have totransform the observed human joint angles into aninternal perspective to obtain a posture of the robothand that is approximately equivalent with respect tothe functional properties of the grasp. Geometrically,this transformation maps the different joint angleworkspaces and reflects the kinematic constraintsimposed by the coupled joints of the robot hand. Fur-thermore, we lack direct joint angle measurements inthe TUM-hand and therefore rest grasping on forcefeedback obtained from the custom built fingertip

sensors and a recently added palm sensor (seeFig. 6).With the latter we can evaluate the shape of an objectwhen carrying out a power grasp, while the fingertipsensors are primarily used for precision grasps.

From the perspective of imitation learning, these in-compatibilities between a human and our robot handare nothing but the manifestation of the gap betweenthe characteristics of the observed system and theimitator’s own sensorimotor equipment plus the differ-ent sensory views of the situation. In the present sys-tem, the observation component contributes to closingthat gap by providing a good initial hand posture fromwhich the robot grasping is started.

This initial posture can be used by theinternalsimulationto generate a grasping sequence. The sim-ulation utilizes the real-time dynamics-based packageVortex [39], which allows accurate object motion andinteraction based on Newtonian physics. We extendedthe package to provide static and dynamic friction forcontacts, which is crucial for successful grasp simu-lation. To generate the finger trajectories, we use analgorithm that attempts to confine the object by in-crementally flexing the fingers in a cage-like fashion,evaluating contact conditions on the way. Althoughcontacts are simulated on the basis of point contactsand thus are necessarily coarse, they provide full forcefeedback, which is not available with our real worldtactile fingertip and palm sensors. When all furtherfinger movements have become blocked by objectcontacts, the grasp is considered as complete andevaluated according to a quality function[5], whichwe evaluate by numerical solution of linear matrixinequalities as recently suggested in[15]. Fig. 5 illus-trates the friction cones of a successful grasp togetherwith the resulting polytope of applicable forces. Oneof the next steps will be to use this information inorder to conduct a search for an improved initial con-dition before the grasp sequence is actually carriedout with the real hand.

6. Results for imitation based grasping

Before implementation of the imitation graspingsubsystem described above, our system had to usepre-programmed associations between known objectsand suitable grasps that had been ‘hand-tuned’ fora limited range of objects in rather labor-intensive

enthe

Resaltado

Determina objetos en movimiento, ver cómo.


experiments. From this work, we also knew that our ar-tificial hand – despite its serious limitations – can graspa large number of real world objects. However, due tothe enormous range of possible shapes, generalizingpre-programmed grasps to new and general objectsis a rather hard task. Therefore we tried two imita-tion strategies with our system: (i) a ‘naı̈ve’ imitationstrategy, in which the observed joint angle trajectories(after their transformation into the three-finger geom-etry) were directly applied to control the fingers of theTUM hand during the grasp, until complete closurearound the object; (ii) a strategy in which the visuallyobserved hand posture is matched to the initial condi-tions of a power grip, a precision grip, a three-fingerand two-finger grip, respectively, in order to identifythe grip type. Then, using the initial condition forthat grip type, the closing of the fingers takes placeautonomously using the same algorithm as employedin the simulation, evaluating tactile feedback from thefingertips to sense stable object contact (Fig. 7).

Success with the ‘naı̈ve’ strategy: (i) was very lim-ited such that a quantitative analysis was not evenworthwhile. This reflects the fact that a purely visualservoing is hardly appropriate for successful grasp-ing, which has to take into account haptic feedbackas well. It underlines that in such cases a mixed strat-egy is required, using information from visual obser-vation as a useful constraint for an action sequenceguided to a significant extent under proprioceptivefeedback. Indeed, and in line with this expectation,strategy (ii) yielded by far better results, permitting

Fig. 7. Prototypic grasps with two or three fingers.

successful grasping of many previously unknown ob-jects. Here, the human grasping gesture instructs therobot to select the most appropriate grasp type for thegiven object and its orientation on the desk. The tablein Fig. 8shows the results of a quantitative experimentin which 21 objects were repeatedly grasped (10 tri-als for each object) under setting (ii) (numeric entriesshowing the number of successful trials for the partic-ular grasp type/object pairing, while dashes indicateinfeasible pairings).

Fig. 8. Imitation based grasping of everyday objects sorted withrespect to the number of successful trials (10–0) out of 10 graspattempts, using the most suitable strategy. The remaining standardgrasps are indicated as ‘+’: also possible, ‘(+)’: possible but withless chances of success, ‘−’: not possible. The propeller (no. 2)needs a specialized grasp. The final column gives the number oftrials, which are robust against rotation of the hand after liftingup (‘+’: robust in all trials).

enthe

Resaltado


The evaluation of the experiment reveals some inter-esting details. Six objects can successfully be graspedin all of 10 trials if the most suitable grasp type is used.An additional object, the wooden ‘propeller’ (objectno. 2 in Fig. 8), is of a particularly complex shapeand cannot be grasped by any of the aforementionedgrasp types. However, successful grasping of this ob-ject is possible with a specialized grasp derived fromthe three-finger grasp (see lower left corner image ofFig. 8). The last two objects (keys, no. 20, and pencil,no. 21) are very close to the limits of the hardware ca-pabilities of our hand and can currently not be graspedat all, remaining as a challenge for further improve-ments.

Further experiments revealed that often smallchanges in the pre-grasp posture have a large impacton the grasping success. Therefore our next step willbe to exploit the simulation stage for optimizing theinitial condition before actually executing the grasp,as indicated inSection 5. Possible free parameters tobe evaluated in such a simulation are the exact initialjoint angles of the fingers, the exact relative positionof the hand to the object, and the closing speeds ofthe fingers.

7. A datamining perspective on robot learning

Regarding learning as a central ingredient to facil-itate the construction of complex systems shifts ourview from a complex robot whose behavior unfoldsaccording to well-chosen, explicitly designed controlmechanisms to a view in which a robot much moreresembles a kind of ‘datamining engine’, foragingflexibly for information and regularities in the sensoryimages of its environment. This suggests to adopt asimilar perspective as in the field of datamining, andexploit algorithms from that area, which appear ofconsiderable interest for advanced robots with theirneed to cope with uncertainty and situations too com-plex to be amenable to full analysis on the basis of‘first principles’.

Specifically, we think that recent progress incontent-based image database indexing and retrieval[31,33] may have much to offer for learning robots.Future intelligent robots should be endowed withsome kind of episodic long-term memory to accumu-late visual and other sensory data. Using raw sensor

images for this purpose has many attractive features,provided we can solve the task of efficiently index-ing into and navigating within large collections ofsuch data[21]. Unlike symbolic scene descriptions,raw sensor images are easy to acquire and collect.Moreover, they do not enforce a prior commitmentonto a narrow range of possible future queries butremain ‘open’ to inspection under aspects that maybe unforeseeable at the time of their acquisition. Thisflexibility may be one of the reasons why also thememory system in our brain offers an apparentlyvisually organized interface to our episodic memory.

In fact, recent progress in semantically organizinglarge image collections with machine learning tech-niques for unsupervised category formation[10] andautomatic labeling with classifiers previously trainedon a variety of visual object domains[43] (so thathuman keyword assignment becomes dispensable)can be seen as the first promising steps towards theself-organized structuring of a larger body of sensoryexperience for an artificial cognitive system. Obvi-ously, any progress along these lines is also of im-mediate significance for robotics, a field which maymotivate us to extend such approaches to additionalmodalities, e.g., collecting and organizing a databaseof haptic experiences for dextrous object manipu-lation. Such ‘life-long’ learning may turn out to bethe only viable solution to acquire the huge mass ofworld knowledge that is required for even moderately‘intelligent’ behavior.

A very important prerequisite in this respect is a sys-tem for data collection, diagnosis, and monitoring. Oursystem currently employs more than 30 distributedprocesses with many functional submodules such thattracing their behavior, the collection of sensor data,and the detection of errors becomes a non-trivial task.To better cope with this challenge, we have com-bined adaptive visualization techniques with the ratherrecent approach of sonification in order to conveyan informative, yet intuitive multi-modal display[16](AVDisp – Audio Visual Display,Fig. 9). This appli-cation collects computational results, useful status in-formation, and data from all distributed processes, tagsall these messages with time-stamps, and displays thestate of the overall system and the interactions of itsfunctional modules. It enables us to build up a databaseof all relevant information describing the system be-havior and carry out an analysis of the time-behavior

enthe

Resaltado


Fig. 9. AVDisp: an Audio-Visual Display for system monitoring,data collection, and generation of user feedback (see[16]).

of the complete system (like shown inFig. 3). Wefound that this interface, originally inspired from thedatamining perspective, was already very helpful inthe final debugging and tuning phases of the complexsystem on the engineering level and now provides avaluable basis for implementing more sophisticatedlearning capabilities of our system.

8. Conclusions and outlook

Our initial assumption is that situated and multi-modal interaction is a key prerequisite for learning inartificial intelligent perception–action systems. Thus,we will proceed with the development of the currentplatform and use it as a basis for a systematic re-finement of the described learning architecture. Thelonger-term goal is to demonstrate speech enabled im-itation learning for instructing grasping tasks, becausemulti-fingered grasping combines many of the highlydeveloped capabilities of the human cognitive system:the recognition of object shape, type, position and ori-entation in space; the respective and for the intendedtask appropriate choice of a grasp; the actual realiza-tion of the grasp under complex kinematic constraints;and the following immediate optimization of fingerpositions and contact force with respect to grasp sta-bility and manipulability.

We believe that this research program is promis-ing if a sufficiently developed technological basis isavailable. This basis seems crucial for higher levelarchitectures and includes sophisticated hardware fordata acquisition and action like an articulated dextrous

hand as well as algorithms for robust implementationof the perceptual skills. In particular for the imitationof grasps, we expect progress in the nearer future fromimprovements in the field of multi-fingered hands, es-pecially with respect to robustness and tactile sens-ing. Concerning intelligent control, it is important tohave at our disposal a sufficiently high number of ro-bust and adaptive partial skills, a prerequisite towardwhich many efforts have been made in the course ofthe Special Collaborative Research Unit SFB 360.

The key towards an integrated architecture is adesign, which endows the system with a fruitful in-terrelation of different aspects of learning and theirvarious techniques on the different levels to generate aflexible and incrementally improving combination ofthese partial skills and modules. Here we see the mainfocus of the described learning architecture, know-ing that this goal may be reached only by long-termefforts and in incremental steps. We are aware, thatin view of the enormous complexity of the respectivechallenges, this research program also calls for a closecollaboration of robotics with neighboring disciplineslike neurobiology or cognitive science and we expectmany insights and inspirations from these fields.

Acknowledgements

Among many people who contributed to the robotsystem, we thank in particular G. Fink, J. Fritsch, G.Heidemann, T. Hermann, J. Jockusch, N. Jungclaus,F. Lömker, P. McGuire, R. Rae, G. Sagerer, S. Wrede,S. Wachsmuth, J. Walter. For further contributions ofthe SFB 360 ‘Situated Artificial Communicators’ andthe neuroinformatics and applied informatics groups atthe Faculty of Technology of the Bielefeld Universitysee the references.

References

[1] P.K. Allen, A. Miller, P.Y. Oh, B. Leibowitz, Integration ofvision, tactile sensing for grasping, International Journal ofIntelligent Machines 4 (1) (1999) 129–149.

[2] P. Andry, P. Gaussier, S. Moga, J.P. Banquet, J. Nadel,Learning and communication via imitation: an autonomousrobot perspective, IEEE SMC 31 (2001) 431–442.

[3] P. Bakker, Y. Kuniyoshi, Robot see, robot do: an overviewof robot imitation, in: Proceedings of the AISB Workshop onLearning in Robots and Animals, Brighton, 1996, pp. 3–11.

enthe

Resaltado


[4] C. Bauckhage, G.A. Fink, J. Fritsch, F. Kummert, F.Lömker, G. Sagerer, S. Wachsmuth, An integrated system forcooperative man–machine interaction, in: Proceedings of theIEEE International Symposium on Comp. Int. in Roboticsand Automation, 2001, pp. 328–333.

[5] A. Bicchi, V. Kumar, Robotic grasping and contact: a review,in: Proceedings of the Conference ICRA, 2000, pp. 348–353.

[6] A. Billard, M.J. Mataric, A biologically inspired roboticmodel for learning by imitation, in: Proceedings of the FourthInternational Conference on Autonomous Agents, Barcelona,Spain, 2000.

[7] H. Brandt-Pook, G.A. Fink, S. Wachsmuth, G. Sagerer,Integrated recognition and interpretation of speech for aconstruction task domain, in: Proceedings of the InternationalConference on Human–Computer Interaction, vol. 1, 1999,pp. 550–554.

[8] C. Breazeal, B. Scassellati, Challenges in building robotsthat imitate people, in: K. Dautenhahn, C. Nehaniv (Eds.),Imitation in Animals and Artifacts, MIT Press, Boston, MA,2002.

[9] C. Breazeal, B. Scassellati, A context-dependent attentionsystem for a social robot, in: Proceedings of the IJCAI, 1999,pp. 1146–1151.

[10] Y. Chen, J.Z. Wang, R. Krovetz, An unsupervised learningapproach to content-based image retrieval, in: Proceedings ofthe IEEE International Symposium on Signal Processing andIts Applications, Paris, France, 2003.

[11] K. Doya, What are the computations of cerebellum, the basalganglia, and the cerebral cortex? Neural Networks 12 (1999)961–974.

[12] J.A. Driscoll, R. Alan Peters II, K.R. Cave, A visual attentionnetwork for a humanoid robot, in: Proceedings of the IROS,Victoria, BC, 1998.

[13] G.A. Fink, Developing HMM-based recognizers withESMERALDA, in: V. Matoušek, P. Mautner, J. Ocelı́ková, P.Sojka (Eds.), LN in AI, Berlin, vol. 1692, 1999, pp. 229–234.

[14] G.A. Fink, N. Jungclaus, H. Ritter, G. Sagerer, Acommunication framework for heterogeneous distributedpattern analysis, in: Proceedings of the InternationalConference on Algorithms and Architectures for ParallelProcessing, Brisbane, 1995, pp. 881–890.

[15] L. Han, J.C. Trinkle, Z.X. Li, Grasp analysis as linearmatrix inequality problems, IEEE Transactions of Roboticsand Automation 16 (6) (2000) 663–673.

[16] T. Hermann, C. Niehus, H. Ritter, Interactive visualizationand sonification for monitoring complex processes, in:Proceedings of the International Conference on AuditoryDisplay, Boston, MA, USA, 2003, pp. 247–250.

[17] Ch. Borst, M. Fischer, G. Hirzinger, Calculating handconfigurations for precision and pinch grasps, in: Proceedingsof the IROS, Lausanne, 2002, pp. 1553–1559.

[18] G. Heidemann, D. Lücke, H. Ritter, A system for variousvisual classification tasks based on neural networks, in: A.Barcelona, Sanfeliu, et al. (Eds.), Proceedings of the ICPR,2000, pp. 9–12.

[19] A.J. Ijspeert, J. Nakanishi, S. Schaal, Trajectory formation forimitation with nonlinear dynamical systems, in: Proceedingsof the IROS, 2001, pp. 752–757.

[20] N. Jungclaus, R. Rae, H. Ritter, An integrated system foradvanced human–computer interaction, in: Proceedings ofthe UCSB Workshop on Signals and Images, USA, 1998,pp. 93–97.

[21] T. Kämpfe, T. Käster, M. Pfeiffer, H. Ritter, G. Sagerer,INDI—intelligent database navigation by interactive andintuitive content-based image retrieval, in: Proceedings of theIEEE 2002 International Conference on Image Processing,vol. III, Rochester, USA, 2002, pp. 921–924.

[22] R. Dillmann, M. Kaiser, A. Ude, Acquisition of elementaryrobot skills from human demonstration, in: Proceedings ofthe International Symposium on Intelligent Robotic Systems,Pisa, Italy, 1995, pp. 185–192.

[23] Y. Kuniyoshi, M. Inaba, H. Inoue, Learning by watching:extracting reusable task knowledge from visual observationof human performance, IEEE Transactions of Robotics andAutomation 10 (6) (1994) 799–822.

[24] M.J. Mataric, O.C. Jenkins, A. Fod, V. Zordan, Control andimitation in humanoids, in: Proceedings of the AAAI FallSymposium on Simulating Human Agents, North Falmouth,MA, 2000.

[25] P. McGuire, F. Fritsch, J.J. Steil, F. Röthling, G.A. Fink,S. Wachsmuth, G. Sagerer, H. Ritter, Multi-modal human–machine communication for instructing robot grasping tasks,in: Proceedings of the IROS, 2002, pp. 1082–1089.

[26] P. McKevitt (Ed.), Integration of Natural Language and VisionProcessing, Kluwer Academic Publishers, Dordrecht, 1994.

[27] C. Nölker, H. Ritter, Visual recognition of continuous handpostures, IEEE Transactions of Neural Networks 13 (4) (2002)983–994.

[28] J. Pauli, Learning to recognize and grasp objects, AutonomousRobots 5 (1998) 407–420.

[29] H. Ritter, J.J. Steil, C. Nölker, F. Röthling, P. McGuire, Neuralarchitectures for robot intelligence, Review of Neuroscience14 (1–2) (2003) 121–143.

[30] B. Roessler, J. Zhang, M. Hoechsmann, Visual guidedgrasping and generalization using self-valuing learning, in:Proceedings of the IROS, 2002, pp. 944–949.

[31] Y. Rui, T. Huang, S. Chang, Image retrieval: currenttechniques, promising directions and open issues, Journalof Visual Communication and Image Representation 10 (4)(1999) 39–62.

[32] S. Schaal, Is imitation learning the route to humanoid robots?Trends in Cognitive Sciences 3 (6) (1999) 233–242.

[33] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, R.Jain, Content-based image retrieval at the end of the earlyyears, IEEE Transactions of Pattern Analysis and MachineIntelligence 22 (2000) 1349–1380.

[34] J.J. Steil, G. Heidemann, J. Jockusch, R. Rae, N. Jungclaus,H. Ritter, Guiding attention for grasping tasks by gesturalinstruction: the GRAVIS-robot architecture, in: Proceedingsof the IROS, IEEE, 2001, pp. 1570–1577.

[35] J. O’Sullivan, Towards a robot learning architecture, in: W.-M.Shen (Ed.), Learning Action Models, AAAI Press, 1993,pp. 47–51.

[36] J. Lloyd, M. Parker, Real time control under Unix for RCCL,in: Robotics and Manufacturing ISRAM’90, vol. 3, 1990,pp. 237–242.


[37] R. Menzel, K. Woelfl, F. Pfeiffer, The development of ahydraulic hand, in: Proceedings of the Second Conference onMechatronics and Robotics, 1993, pp. 225–238.

[38] S. Vijayakumar, J. Conradt, T. Shibata, S. Schaal, Overt visualattention for a humanoid robot, in: Proceedings of the IROS,2001, pp. 2332–2337.

[39] CMLabs, Vortex—Physics Engine for Real-time Simulation.http://www.cm-labs.com.

[40] S. Wachsmuth, G. Sagerer, Bayesian networks for speechand image integration, in: Proceedings of the 18th NationalConference on Artificial Intelligence, Edmonton, 2002,pp. 300–306.

[41] J. Walter, C. Nölker, H. Ritter, The PSOM algorithm andapplications, in: Proceedings of the Symposium on NeuralComputation, Berlin, 2000, pp. 758–764.

[42] H. Wersing, E. Körner, Learning optimized features forhierarchical models of invariant recognition, Neural Compu-tation 15 (7) (2003) 1559–1588.

[43] J.Z. Wang, J. Li, G. Wiederhold, SIMPLIcity: semantics-sensitive integrated matching for picture libraries, IEEE PAMI23 (2001) 947–963.

[44] M. Yeasin, S. Chaudhuri, Toward automatic robot program-ming: learning human skill from visual data, IEEE SMC30 (1) (2000) 180–185.

J.J. Steil received the Diploma in Math-ematics from the University of Bielefeld,Germany, in 1993. Since then he hasbeen a Member of the Neuroinformat-ics Group at the University of Bielefeldworking mainly on recurrent networks.In 1995/1996 he stayed for 1 year atthe St. Petersburg Electrotechnical Uni-versity, Russia, supported by a GermanAcademic Exchange Foundation (DAAD)

grant. In 1999, he received the PhD degree with a disserta-tion on ‘Input–Output Stability of Recurrent Neural Networks’;since 2002 he has been appointed Tenured Senior Researcherand Teacher (Akad. Rat). J.J. Steil is Staff Member of the spe-cial research unit 360 ‘Situated Artificial Communicators’ andthe Graduate Program ‘Task-oriented Communication’ headingprojects on robot learning and intelligent systems. Main researchinterests of J.J. Steil are analysis, stability, and control of re-current dynamics and recurrent learning, and cognitive orientedarchitectures of complex robots for multi-modal communicationand grasping. He is Member of the IEEE Neural NetworksSociety.

F. Röthling has completed a 3-year ap-prenticeship in the communication elec-tronics industry working in the fieldsof remote communication at DeutscheTelekom. He received his Diploma inComputer Science from the Universityof Bielefeld, Germany, in June 2000. Heis currently pursuing a PhD program inComputer Science at the University ofBielefeld, working within the Neuroinfor-

matics Group and the Collaborative Research Center 360 ‘SituatedArtificial Communicators’. His field of research is in multi-fingeredhand-control architectures and the development of grasping algo-rithms using tactile and visual sensors.

R. Haschke received the Diploma inComputer Science from the Universityof Bielefeld, Germany, in 1999. Sincethen he has been a Member of the Neu-roinformatics Group at the University ofBielefeld working on recurrent neural net-works and virtual robotics. He just fin-ished a PhD program within the graduateprogram ‘Structure forming phenomena’.Currently he works on learning archi-

tectures in the special research unit 360 ‘Situated ArtificalCommunicators’. His main interests of research are analysis andcontrol of recurren dynamics, learning architectures for complexcognitive robots, and models, strategies and evaluation criteria forgrasping.

H. Ritter studied Physics and Mathe-matics at the Universities of Bayreuth,Heidelberg and Munich and receiveda PhD in Physics from the TechnicalUniversity of Munich in 1988. Since1985, he has been engaged in researchin the field of neural networks. In 1989he moved as a Guest Scientist to theLaboratory of Computer and InformationScience at Helsinki University of Tech-

nology. Subsequently he was Assistant Research Professor at thethen newly established Beckman Institute for Advanced Scienceand Technology and the Department of Physics at the Universityof Illinois at Urbana-Champaign. Since 1990, he is Professor atthe Department of Information Science, University of Bielefeld.His main interests are principles of neural computation, in par-ticular self-organizing and learning systems, and their applicationto machine vision, robot control, data analysis and interactiveman–machine interfaces. In 1999, Helge Ritter was awarded theSEL Alcatel Research Prize and in 2001 the Leibniz Prize of theGerman Research Foundation DFG.

http://www.cm-labs.com

situated robot learning for multi-modal instruction and ...mdprl/documentos/steil_2004.pdf ·...

Documents