framework for classical conditioning in a mobile robot ...734283/fulltext02.pdfproject report...

Project Report

Framework for Classical Conditioning in a Mobile

Robot: Development of Pavlovian Model and

Development of Reinforcement Learning Algorithm

to Avoid and Predict Noxious Events

Quentin Delahaye

Technology

Studies from the Department of Technology at Örebro Universityörebro 2014

Framework for Classical Conditioning in a Mobile

Robot: Development of Pavlovian Model and

Development of Reinforcement Learning Algorithm to

Avoid and Predict Noxious Events

Studies from the Department of Technologyat Örebro University

Quentin Delahaye

Framework for Classical Conditioning

in a Mobile Robot: Development of

Pavlovian Model and Development of

Reinforcement Learning Algorithm to

Avoid and Predict Noxious Events

Supervisors: Dr. Andrey Kiselev

Dr. Amy Loutfi

Examiner: Prof. Franziska Klügl

© Quentin Delahaye, 2014

Title: Framework for Classical Conditioning in a Mobile Robot:Development of Pavlovian Model and Development of Reinforcement

Learning Algorithm to Avoid and Predict Noxious Events

Abstract

Nowadays, robots have more and more sensors and the technologies allowusing them with less contraints as before. Sensors are important to learn aboutthe environment. But the sensors can be used for classical conditioning, andcreate behavior for the robot. One of the behavior developed in this thesis isavoiding and predicting obstacles.

The goal of this thesis is to propose a model which consists of developing aspecific behavior to avoid noxious event, obstacles.

i

Contents

1 Introduction 1

2 Background and Related Works 32.1 Ultimate Scenario and Tools . . . . . . . . . . . . . . . . . . . . 32.2 Type of Obstacles . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Methods to Detect Obstacles . . . . . . . . . . . . . . . . . . . . 5

3 Method 63.1 Reinforcement Learning and Model of Classical Conditioning . 63.2 Different Model to Compute the Assiocative Strength . . . . . . 73.3 Rescorla-Wagner model of a pavlovian model and reinforcement

learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Implementation 94.1 Global Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 94.2 Storing the V-Values on the Map of the Environement . . . . . . 104.3 Estimation of the Value of the Constants . . . . . . . . . . . . . 124.4 Rescorla-Wagner Implemantation . . . . . . . . . . . . . . . . . 134.5 Code Implementation . . . . . . . . . . . . . . . . . . . . . . . . 13

4.5.1 Loop Algorithm . . . . . . . . . . . . . . . . . . . . . . 134.6 Compute the Position of Obstacle . . . . . . . . . . . . . . . . . 15

4.6.1 Implementation of Algorithm to Compute the Positionof Events . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Evaluations 185.1 Observations Results of Computing Position of Events . . . . . 185.2 Observations Results . . . . . . . . . . . . . . . . . . . . . . . . 19

5.2.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . 195.2.2 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . 20

6 Discussion and Future Works 26

ii

CONTENTS iii

References 28

List of Figures

2.1 TurtleBot [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.1 Representation of pavlovian conditionning . . . . . . . . . . . . 6

4.1 Global Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 104.2 Different modules implemented and used for the Conditioning

Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.3 Picture of the matrix draw with value of V in each cell . . . . . 114.4 Assiociative Strength value after 30 trials . . . . . . . . . . . . . 124.5 Assiociative Strength value after 30 trials . . . . . . . . . . . . . 124.6 Algorithm of the loop . . . . . . . . . . . . . . . . . . . . . . . 144.7 Position of obstacles according to the event . . . . . . . . . . . 16

5.1 Matrix of the environment after each forward bumper hit anobstacle (size of cell: 40x40cm) . . . . . . . . . . . . . . . . . . 18

5.2 Picture of the experiment 1 . . . . . . . . . . . . . . . . . . . . . 195.3 Schema explained the different paths did by the robot . . . . . . 205.4 V-value displayed on the matrix with the position of the robot

in green . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.5 Picture of the measurement with a box . . . . . . . . . . . . . . 215.6 Plan of the moving of the turtlebot to measure the position of

the box in the first hour . . . . . . . . . . . . . . . . . . . . . . 225.7 Evolution of V-value of each cell from step 1 to 8 (Red blue lines

correspond to walls) . . . . . . . . . . . . . . . . . . . . . . . . 235.8 Evolution of V-value of each cell from step 1 to 11 (Red blue

lines correspond to walls) . . . . . . . . . . . . . . . . . . . . . 245.9 Evolution of V-value (from step 3 to 11) of the cell which corre-

sponds to the hit with the box at step 3. . . . . . . . . . . . . . 25

iv

List of Algorithms

1 Pseudo code to increase the associative strength V-value in thematrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Pseudo code to decrease the associative strength V-value in thematrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

v

Chapter 1

Introduction

The notion of reflex was introduced by Thomas Willis in the 17th [11]. A re-flex is an involuntary response of a stimulus; for example we retreat quickly thehand when it is touching something scorching. Ivan P. Pavlov presented two dif-ferent types of reflexes: unconditioned reflex (UR) and conditioned reflex (CR)(which are acquired individually) [11]. The UR is a reaction for an uncondi-tioned stimulus (US) and CR is a reaction for a conditioned stimulus (CS). Thisphysiologist demonstrated that after a few tries with CS and US simultaneous,only the CS was enough to create the response of the reflex. After that Skinnershowed that the response to a CS can be reinforced by its consequences [11]and modifies the behavior, this is Operant conditioning.

In neuroscience, many models have been developed which allows havinga mathematical approach for classical conditioning [5]. In this thesis I am in-spired by the work of Robert A. Rescorla and Allan R. Wagner [15]. However afew other algorithms being inspiring by Rescorla-Wagner have been developedlike Temporal-difference (TD) [13] or the Q-Learning which "is a method forsolving reinforcement learning problems" [7]. However each method has someadvantages and drawbacks; we will see which one corresponds to our goal laterin the thesis.

In our case the time and the "recognition of place" of the robot has animportant impact on the development of conditioned reflexes. However thetime can be a significant aspect to develop conditionned reflexes. Indeed inanimal world, the effect of US is reinforced when there is a regularity in thetime. Then the time acts as a CS. This is the case in a study on the effect ofdrugs on rats. This study demonstrated that the effect of the drug dependedof the periodicity and the hour the rats got the injection of amphetamine [3].And more the injection was regular, the bigger was the effect. This notion ofperiodicity for robot could be used to implemant a reinforcement learning. Therobot is moving while it is avoiding obstacles or people in a crowded area inpublic space at specific time because it had already met it the day before at thesame hour.

1

2 CHAPTER 1. INTRODUCTION

The goal of this thesis is to propose a model which consists of developing aspecific behavior to avoid noxious event, obstacles.

This thesis reports the work which has been done in cooperation withtwo other students: Kaushik Raghavan and Rohith Rao Chennamaneni. Theirworks are respectively about "Integration of Various Kinds of Sensors and Giv-ing Reliable Information About the State of the Environment" and "Behaviorand Path Planning". This thesis is focused on the implemantation of algorithmto develop condtioning reflexes with reinforcement learning on obstacles.

The report is organized as follows. Chapter 2 defines the goal. Chapter 3describes in details our proposed method. The implementation s is given inChapter 4. Evaluation of experiment is given in Chapter5 and a disussion ofthe theisis in Chapter 7.

Chapter 2

Background and Related Works

2.1 Ultimate Scenario and Tools

Nowadays it is easy for a robot to stock the map of a building, and move whiledetermining its current position.

In the project we use a TurtleBot [2] (Fig 2.1) – a differential-drive robotwhich has multiple sensors, and runs the ROS [1].

In this project, we consider the ultimate scenario of robot-guide for blindpeople which is capable of offering the best possible route for a person to reachthe destination. The small size of this robot may convenient to guide blind peo-ple to move into an unknown building where tall indications are display onsigns. However to guide people the robot has to elaborate a strategy and ana-lyze the best comfortable path to the person. When it should assist a blind man,it will have to generate the comfortable route by already having a knowledgeof the path and being able to predict the position of eventual obstacles. To cor-respond to this example the best way will be to have a robot with a behaviorsimilar to a dog which has some basic conditioned and unconditioned reflexesand memory to recognize the place. We were inspired by this type of situationto develop useful tools using the sensors, driver motor and other componentsof the robot to analyze the environment, avoid collision, and improve reliability.

We use ROS middleware to communicate with robot hardware and buildour application in the way so it can be easily ported to other robots. The dif-ferent tools we used are:

• RGB image from the Kinect with a field of view of ±45◦ in diagonal

• Depth cloud from the Kinect with a field of vision of ±45◦ in diagonal

• Three Forward bumpers

• The velocity

• The wheel drop

3

4 CHAPTER 2. BACKGROUND AND RELATED WORKS

Figure 2.1: TurtleBot [2]

• Motor driver

• Goal trajectory (local planning between two close points)

• TF odometry

• 2 degrees of freedom of the turtle bot

• Battery Info

Sensor data is pre-processed, combined and interpreted to provide inputevent for developing condtioning reflexes.

2.2 Type of Obstacles

In dynamic environments there are two types of obstacles: Static and dynamicobstacles [4]. However another type of obstacle is the crowd flow. It can havethe same position everyday during the same time. This is the case of trafficin building during busy hours. Most of traffic areas are crowded at differenthours (like the arrival, the lunch and the departure). There might be some fixedpatterns of crowd flow at some place in the same time. And the most of thetime this crowd is congested in the corners or before a turn [9].

2.3. METHODS TO DETECT OBSTACLES 5

2.3 Methods to Detect Obstacles

SLAM (Simultaneous Location and Mapping) is implemented in many robots.It consists in adding methods to build a map and estimating the position of therobot. Each obstacles detected are directly placed in the map [10]. According tothe Article [4], SLAM detects static obstacles and dynamic obstacles, but onlythe dynamic obstacles are analyzed and filtered from the data generated by theSLAM.

According to Amund Skavhaug and Adam L.Kleppe there are three ways tostore a description of an obstacle in a map: "The vector approach and the gridapproach" [10] and particles filters.

The vector approach is used in GraphSLAM [20]. In this case the obstacle isrepresented like a vector which contains different parameters describing it. Butaccording to Amund and Adam it is hard to find the parameters that describesthe obstacle.

The grid map represents the world in a two dimensional map split in cells ofequal size. According to the article [6], it is used for indoor applications. Eachcell corresponds to an area. It contains a value which estimates the probabilityof the cell being occupied or empty. However this system does not store anyinformation about obstacles.

The last method is a mix between the two previous. Particles filter are float-ing points and are randomly drawn in the map [19]. when an obstacle comesin contact with the particles a point will be marked on the SLAM to representthis obstacle. The position of these marked points are stored. "The advantageof this method is that it is multimodal" [10].

There are different methods to detect dynamic obstacles. In this Article [6]about "Real-time Detection of Dynamic Obstacle Using Laser Radar" the au-thors used spatial and temporal difference with grid map to detect dynamicobstacles. They determine dynamic obstacles by comparing in real time thecost of each cell with three different grid-maps at three different times.

In our case we will proposed a solution close from the grid map, and we’lluse the classical conditioning model to develop cost for each cell.

Chapter 3

Method

3.1 Reinforcement Learning and Model of Classical

Conditioning

"Reinforcement learning is learning by interacting with an environment" [21].It is one branch of Operant conditioning. It is subdivided in two other branches:Positive or Negative reward [18]. In our case we use a negative reinforcementlearning in order to avoid noxious rewards. For example, the robot can avoidhitting a person which means for the robot to avoid receiving data from thebumper (the bumper is consider as a noxious reward).

To allow the robot to learn the position and the time of crowded areaswe propose to create associations between: one conditioning and one rewardand, one unconditioning and one reward. The result consists in predicting thenoxious Reward event by just using a CS.

In this thesis, we will create a connection between Positioning/Time (CS)with reward and we will create another connection between the events inter-preted from sensors (US) and Reward (see figure 3.1). After the CS occurred afew times, the associative strength between it and the Reward is more and morestrong.

Figure 3.1: Representation of pavlovian conditionning [8]

6

3.2. DIFFERENT MODEL TO COMPUTE THE ASSIOCATIVE STRENGTH 7

3.2 Different Model to Compute the Assiocative

Strength

There are different methods to compute the strength value of the association(between Reward and CS):

• Rescorla Wagner method (RW)

• TD: Temporal-Difference learning

• Q-learning

The particularity of TD using time between CS and US [13, 5], is calleda real-time model [16]. In the thesis, we will not use this method because wedirectly link the CS and Reward. We won’t compute the time between the twoagents (CS and US).

The Q-learning computes the delay of prediction of an reward ("immediatereward", "delayed reward", and "pure-delayed reward" [7]). In our situationwe just use one type of reward which is periodic and which is not a delayedreward. Indeed a delay reward is linked to the time between an event and areward and not to the periodicity of the event.

The Rescorla Wagner has the advantage to be simple to implement, andallows developing another aspect which is close to animals: inhibition of con-ditioning stimulus [17]. To simplify the implementation we will develop palvo-vian conditionning by using Rescorla-Wagner formula [14] and we will proposea model of reinforcement learning adapted to our subject.

3.3 Rescorla-Wagner model of a pavlovian model

and reinforcement learning

According to Rescorla-Wagner [15] , the equation 3.1 and 3.2 give "the asso-ciative strength of a given stimulus" [17]. The variables use in the equations are:

• λ "is the maximum conditioning US" [15]: 100 if an US occurred and 0otherwise

• α is "rate parameters dependent" [15] on the CS

• β is "rate parameters dependent" [15] on the US

• V: Strength of the association between CS and the reward

∆V = αβ(λ− VTotal) (3.1)

8 CHAPTER 3. METHOD

Vn+1 = Vn + αβ(λ− Vn−1) (3.2)

As it is shown in the thesis about the effect of amphitamine on rats [3], thetime factor amplifies the effect and so the link between CS and CR. We tried torepresent this aspect by being inspired by the movie "Groundhog Day" releasedby Bill Murray. It tells the story of a guy who is living the same day and sameevent everyday. As event are the same every day, he will try to avoid or predictthem. We can interpret this movie for the robot like: if there is an obstacle atthe same position at the same hour like yesterday the associative strength (V-value) will increase. The robot will work every day from Monday to Friday. Ason weekend the robot is off, we don’t include it.

If the robot is at the same place and someone touches it with an interval of1 second it considers that like a new situation of conditioning. So we computeagain the RW equation with the precedent value of the associative strength be-tween the position/time with CR.

Chapter 4

Implementation

4.1 Global Architecture

Fig 4.1 shows three modules we developed: Input Unit, Conditional Unit andBehavior Unit.

The World model contains the data about environment and internal statesof the robot and is used for information sharing with other modules. The datacontained by World Model are:

• Clock: simulation of a clock which contains hour and day.

• Result of Conditioning Unit stocked in matrix (Paragraph 4.2)

• Current Position of the robot

The Interpreter Unit consists in analysing the data from the different sensors(bumper, RGB kinect ...) of the robot and interpreting sensor data accordingto situation. It defines the type of each event occurred by using some basicsentences like "I hit something on the left", "I hit something on the right" or"I approach something" etc. These sentences correspond to the unconditionalstimulus. The Interpreter Unit sends to the Conditioning Unit and the BehaviorUnit all events interpreted.

The Conditioning Unit computs the value of the associative strength (V) be-tween the position and the time with the reward (which is here a punishment).For instance if there is a dynamic obstacle which occures a few time at the samehour and same position, it will develop conditioning such the robot will predictand avoid it the next time. This module reads the type of events (send by Inter-preter Unit) and update V-value (Fig 4.2). It reads the time and the position ofthe robot (figure 4.2).

The Behavior Unit consists in generating the best path for the robot. It iscomposed of two parts: Behavior planner and Path planner. Behavior plannerallows the robot to develop basic action reflexes by reading data send by Inter-preter Unit; for example the robot moves back when someone hits a Bumper.

9

10 CHAPTER 4. IMPLEMENTATION

Figure 4.1: Global Architecture

Path planner consists in computing the best path. It reads data from the Con-ditioning Unit and generates the path while comparing the V value of each cell.

4.2 Storing the V-Values on the Map of the

Environement

To store data we use matrices, a model similar to the grid map. Matrices areused for storing the result of V-values and types of events sent by the InterpreterUnit. Several matrices are used, one matrix for each hours.

We don’t need to store the V-values by using many cells. Indeed, to analyzecrowd flows it is preferable to use a cell size 40cm because it depends of humananatomy [12] and it is upper than the size of Turtlebot which is 36cm. Eventsare also stored in the matrix and linked to the cell and hour they are corre-sponding. A same event can’t be saved more than twice in one cell at a specifichour. To save the data of each matrix, a text file is generated.

Fig 4.3 is a picture of the matrix coding in JAVA, and the cost of each cell(from 0 to 100) and the position of the robot in green.

We compute the position of each type of event as we can see in paragraph4.6. However to update V of a cell we need to know the number of the cells

4.2. STORING THE V-VALUES ON THE MAP OF THE ENVIRONEMENT 11

Figure 4.2: Different modules implemented and used for the Conditioning Unit

Figure 4.3: Picture of the matrix draw with value of V in each cell


Figure 4.4: Assiociative Strength value after 30 trials [14] with α = 1 and λ = 100

Figure 4.5: Assiociative Strength value after 30 trials [14] with α = 1 and λ = 0 afterthe 6th trial

which contain the event (or obstacle). That’s why we have a function whichallows converting position X/Y to cell number.

4.3 Estimation of the Value of the Constants

To determine the value of β which depends of the event, on Fig 4.4 is drew theevolution of the associative strength (V) after 30 trials by using RW alogrithm.

As we can see with β = 0, 3 after two trials the strength value of the associ-ation V-value is upper than 40. This value means that the association is enoughstrong to create a direct link between CS and the reward.

Figure 4.5 shows that when no event are met after a conditioning, withβ = 0, 3 the curve is under 40 after 1 trials. So the link between CS and thereward will be not small after only two failures.

4.4. RESCORLA-WAGNER IMPLEMANTATION 13

4.4 Rescorla-Wagner Implemantation

We simplified the Rescorla-Wagner equation seen on the paragraph 3.3 withα= 1 :

Vnow = Vprevious + β(λ−∑

Vprevious) (4.1)

The value of β is according to the importance of the type of event send by theInterpreter Unit. Indeed some events like "I am in danger" are more importantand have more impact than "something approaches me". To show this differ-ence we reduced the number of trial before to get a V-value upper than 50 aswe saw in the paragraph 4.3 while changing β.

• β = 0.3 (2 trials before to have V>40): "I hit on the left" or "I hit on theright" or "I hit in front of me" or "I aproach something" or "somethingapproaches me"

• β = 0.4 (1 trials before to have V>40): "I am in danger"

4.5 Code Implementation

4.5.1 Loop Algorithm

To do the experimentation we compute main algorithm (Fig 4.6) every second.V-value increases if new events are met. First we Read the buffer which

contains all event which met in the last second. After that for each events wecompute (see detail in the algoirthm( 1)):

• Position of the event in the Matrix

• We keep V-value from the Matrix about the cell concerned

• We compute V-value with λ = 100

V-value decreases if no events are met. First, we read the Position of therobot in the matrix. After that we tested if the position of the robot is differentfrom the precedent or if the hour is different or not. If it is true we computefor each event which are met in the cell the new value of V-value with λ = 0(Algorithm 2).

Algorithm (2) and (1) are coded in C++. Here we explain the main lines ofthe implemantation of this two algorithms which allow increasing and decreas-ing V-value:


Figure 4.6: Algorithm of the loop

4.6. COMPUTE THE POSITION OF OBSTACLE 15

Algorithm 1 Pseudo code to increase the associative strength V-value in thematrixRequire: Type of Event.Require: Current Cell.Require: Hour.Ensure: V.

1: for all type of Event which are met while the last second do2: Get cell number of the event3: Get V from the cell of the Matrix4: compute β value according to the type of Event.5: compute V with R = 100.6: end for7: if Type of Event is new according to the cell number and the hour then8: Stock Type of Event in Matrix Data9: end if

Algorithm 2 Pseudo code to decrease the associative strength V-value in thematrixRequire: Current V.Require: Current Cell.Require: Hour.Ensure: V.

1: for all type of Event STOCKED in the matrix at a specific hour and cellnumber do

2: compute β value according to the type of Event.3: compute V with λ = 0.4: end for5: if V < 10 then6: Erase all event are met and stocked at a specific hour and cell number7: end if

4.6 Compute the Position of Obstacle

The turtlebot has to compute the position of obstacles according to the type ofevent. Indeed the turtle bot has 3 forward bumpers:

• Bumper left

• Bumper front

• Bumper right


Figure 4.7: Position of obstacles according to the event

The position of the obstacle must be computed according to the bumper acti-vated. On the figure 4.7 we can see two obstacles which are on different cells(cell 4 and cell 6).

We compute the position of obstacles by using trigonometry:

Xobstacle = Xrobot + cos(θrobot + θobstacle)× d (4.2a)

Yobstacle = Yrobot + sin(θrobot + θobstacle)× d (4.2b)

Where the variable from equations 4.2 are:

• Angular of the robot θrobot according to the origin

• Angular of the obstacle, define according to the type of event

• Position of the robot Xrobot and Yrobot according to the origin

• d: distance between the center of the robot and the extremity of the sensor

4.6.1 Implementation of Algorithm to Compute the Position of

Events

To compute the position of the event in a first time we compute the orientationof the Turtle bot regarding to the origin axis.The TF package from ROS allowsgetting easily the angular. It displays the result from 0 to 3, which equals to 0◦

to 180◦ from the centre forward to the left of the robot. And vice versa in thedirection to the right with using negative value (0 to -3 =⇒ 0◦ to -180◦ ). Afterthe conversion in degree is done we apply the algorithm seen in paragraph 4.6.

4.6. COMPUTE THE POSITION OF OBSTACLE 17

We define for each type of event the theoretical position that the robot deducefrom its sensors. In table 4.1 we add an angular for the following event:

Type of event Angular

obstacle in left 45◦

obstacle in right -45◦

obstacle in front Someone approachs 0◦

wheel drop 0◦

Table 4.1: Angular according to the type of event

Chapter 5

Evaluations

5.1 Observations Results of Computing Position of

Events

We tested the computing of the position of the obstacles each time the Robotmeet an obstacle. Fig 5.1 shows the results we got according to the three for-ward bumpers (The red point is the center point of the base of the Turtlebotgiven by the Table 5.1):

Table 5.1 shows three postions given by the program which contains thealgorithm to compute the position of event. We used a length between the centerposition of the robot and the event of 25cm.

By using a cell size of 40cm we can see that some obstacles detected by thebumpers sensors can be put in the same cell. This is the case for event detectedby Right and Front Bumper in Fig 5.1.

Figure 5.1: Matrix of the environment after each forward bumper hit an obstacle (sizeof cell: 40x40cm)

18

5.2. OBSERVATIONS RESULTS 19

Table 5.1: Position of object and event computed by the algorithm and the Matrixfunction

Object/Event Position X,Y,θ Position of the centerof the cell concerned

Robot 0.91,0.79,85.56◦ 1.00,0.60

Bumper LEFT 0.75,0.98 0.60,1.00Bumper FRONT 0.93,1.04 1.00,1.00

Bumper RIGHT 1.10,0.95 1.00,1.00

Figure 5.2: Picture of the experiment 1

5.2 Observations Results

5.2.1 Experiment 1

The following experiment consists in testing all modules implement togetherin a Turtlebot. The robot will hit an obstacle during its course: the Input In-terpreter, the Conditioning Unit and the Bhavior planner will have to commu-nicate together to develop after this event a new behavior. Fig 5.3 diplays thedifferent paths that take the Turtlebot.

• Trial 1: The Turtlebot is directly going to the goal, there are no obstacleon its way

20 CHAPTER 5. EVALUATIONS

Figure 5.3: Schema explained the different paths did by the robot

• Trial 2: The Turtlebot hits an obstacle on its path, and generates anotherpath to avoid the obstacle

• Trial 3: The Turtlebot genereted a new path but doesn’t include the cellwhere the obstacle was in trial 2

Fig 5.4 shows at the trial 3 the result at the end of the experiment. TheTurtlebot hit an obstacle (here a foot 5.2) which generated events from its sen-sor. Some V-value increased according to the position of the obstacle and storedit at a specify hour. So from the results of the experiment 1, we can see that thePath planner developed a new behavior which consists in avoiding some cellsand taking another path.

5.2.2 Experiment 2

As discussed in Chapter 2.3, the crowds flow area are very concentred in thecorners. We decided to create a similar situation by using a box of size 48 ×

32cm. We put this box at the entrance of one small corridor of size 132cm.The box size has the advantage to have a size close to the cell size used by theMatrix program (Para 4.2). Fig 5.5 shows the entrance with the box whichtakes only a fourth of the entrance length.

The box here represents a person who suddenly appeared in the corner nextto the wall. And only the bumper of the turtlebot detected this person aftercollision because the field of view of Kinect sensor didn’t allow seing the personbehind the wall.

Fig 5.6 shows the path of the turtlebot for the measurement of V-value.At each return from the start point we change the position of the box to do


Figure 5.4: V-value displayed on the matrix with the position of the robot in green

Figure 5.5: Picture of the measurement with a box


Figure 5.6: Plan of the moving of the turtlebot to measure the position of the box in thefirst hour

three trials with three different positions of the box. The postition of the boxchanges after each return to start point, thanks to that we will have the resultof the movement of a crowd flow in two directions. The numbers on Fig 5.6reprensent the different steps of the robot. The first goal is situated on the left(step 2), the second on the right (step 6) and the last goal is on the left (step 10).To control the turtlebot the program was executed which allows moving it withthe keyboard teleop manually. We decided for the turtlbot to take the shortestpath to go to one point; that’s to say it will have to go close to the corners. Butif each cell situated on the corners already have a V-value, we decided to makethe turtlebot move to the cells where the V-value is the smallest. The goal ofthis scenario is to demonstrate:

• The increase of V-value at different cells

• The decrease of V-value at one precise cell while it moves a few time onthis cell (Blue path between step 9 and 10 )

• Test the detection of event with the three forward bumpers

Fig 5.7 displays the result from step 1 to 8. We have the cost of differentcells which inceased in each corners. After the scenario is over, Fig 5.8 displaysthe result from step 1 to 11. The TurtleBot hits an obstacle in front of him. Atthis moment the best way to go to the step 11 is to take on the right. As thereare no obstacles on this cell corresponding to the old position of box at trial 1,


Figure 5.7: Evolution of V-value of each cell from step 1 to 8 (Red blue lines correspondto walls)

V-value decrease to 24, 99 after one return to the start point. We can observethe evolution of V-value on Fig 5.9 according to the number of trial.

At each second the Turtlebot computes the strength of associative V-valueaccording to the event met. Next V-value is located in a cell according to thecurrent position of the robot and the type of event. Table 5.2 allow showingthat the different positions of the box in the real environment are in agreementwith the corresponding cell where V-value is different to 0. According to theFig 5.8 we can observe three areas. Moreover according to Table 5.2 these areasare linked to the different position of the boxes at different trials in the samehour.


Figure 5.8: Evolution of V-value of each cell from step 1 to 11 (Red blue lines correspondto walls)

Table 5.2: Comparaison betwwen Box position and Cell where V-value increased

BOX Center positionposition X,Y of the

Box in the realenvironment

Center position ofthe cells according

to the differentsteps of the

measurement

Bumper used

BOX trial 1 1.45,2.29 Step 1: 1.40,1.80;Step 3: 1.80,2.20

Bumper Left,Bumper Right

(twice)BOX trial 2 2.47,2.29 Step 5: 2.60,1.80;

Step 7: 2.60,2.20Bumper

Right,BumperLeft

BOX trial 3 1.95,2.29 Step 9: 2.2,1.80 Bumper Front


Figure 5.9: Evolution of V-value (from step 3 to 11) of the cell which corresponds to thehit with the box at step 3.

Chapter 6

Discussion and Future Works

In this thesis, we proposed a model to develop a specific behavior which allowavoiding noxious events like obstacles.

On Classical conditioning part the Rescorna-Wagner model has been usedto increase or decrease the associative strength V-value and has been tested(Fig 5.9). The reinforcement learning has been implemented by using a matrixto store every event which is met with their position and the hour. The Fig 4.6is a solution we implemented in the robot to predict the event.

The position of obstacles is computed using the three forward bumpers.For that, we used the method seen in Section 4.6, next the event is sent to thecorresponding cell of the matrix. The results displayed in table 5.2 tells thismethod is correct.

The experiment 1 shows us a situation of reinforcement learning by de-veloping a new behavior (avoid the cell which contained the obstacle in theprevious trial). The result in Fig 5.8 illustrates the use of this model. This sce-nario is limited by the fact that it is teleoperated, instead of the path plannerand it does not take into account the hour. However the experiment 2 demon-strates the evolution of V-value according to the movement of the crowd andthe position of the event.

To return to the ultimate scenario of robot-guide for blind people, it is ad-vantageous to integrate the hour and not only the position. Even though it hasnot been achieved, basic tools has been developed to pursue this goal (rein-forcement learning and Rescorla-Wagner model).

To store V-values, we used a model based on the Grid map. This method,even if it has the advantage to save resources, the precision is low to detectthe shape of the static obstacle. The other way could be to use particles filtermethod.

The choice about the β-value has been chosen arbitrarily in this thesis tohave only two trials before creating a direct link between CS and the reward.The events will get stacked and will be published every second, so some eventswill be published with a delay. This is one of the limitation to this system.

26

27

Moreover we can have V-Value increased one or two times for the same obsta-cle, because we don’t compute the time between each event.

According to the experiment results, in order to upgrade the model pro-posed, the following improvements can be applied:

• Compute time between each event occured in the same cell

• Add other sensors (Kinect, sound ...)

• Develop inhibition of conditioning stimulus

References

[1] ROS. http://www.ros.org/. Accessed: 2014-05-13. (Cited on pageReferenced on page: 3.)

[2] Turtlebot. http://www.turtlebot.com/. Accessed: 2014-05-13. (Citedon pagesReferenced on page: 3 and 4.)

[3] J. Sullivan A. Arvanitogiannis and S. Amir. Time Acts as a ConditionedStimulus to Control Behavioral Sensitization to Amphetamine in Rats.PhD thesis, Concordia University, Montreal, Quebec, 2000. (Cited onpagesReferenced on page: 1 and 8.)

[4] Zhirong Zou Baifan Chen, Lijue Liu and Xiyang Xu. A Hybrid DataAssociation Approach for SLAM in Dynamic Environments. pages 1–7,2012. (Cited on pagesReferenced on page: 4 and 5.)

[5] Christian Balkenius. Computational models of classical conditioning: acomparative study. (Kamin 1968):1, 1998. (Cited on pagesReferenced on page: 1 and 7.)

[6] Baifan Chen, Zixing Cai, Zheng Xiao, Jinxia Yu, and Limei Liu. Real-timedetection of dynamic obstacle using laser radar. In Young Computer Sci-entists, 2008. ICYCS 2008. The 9th International Conference for, pages1728–1732, Nov 2008. (Cited on pagesReferenced on page: 5 and 5.)

[7] Chris Gaskett. Q-Learning for Robot Control. 1:21–27, 2002. (Cited onpagesReferenced on page: 1 and 7.)

[8] P. Gaussier. Sciences cognitives et robotique: le defi de l apprentissageautonome. pages 35–39. (Cited on pageReferenced on page: 6.)

28

http://www.ros.org/

http://www.turtlebot.com/

REFERENCES 29

[9] K. Katabira, T. Suzuki, H. Zhao, Y. Nakagawa, and R. Shibasaki. Ananalysis of crowds flow characteristics by using laser range scanners. page955. (Cited on pageReferenced on page: 4.)

[10] Adam Leon Kleppe and Amund Skavhaug. Obstacle Detection and Map-ping in Low-Cost, Low-Power Multi-Robot Systems using an InvertedParticle Filter. pages 1–15, 2013. (Cited on pagesReferenced on page: 5, 5, and 5.)

[11] Dominique Lecourt. Dictionnaire d’histoire et philosophie des sciences.Presses universitaires de France, edition 2003. (Cited on pagesReferenced on page: 1, 1, and 1.)

[12] Masakuni Muramatsu, Tunemasa Irie, and Takashi Nagatani. Jammingtransition in pedestrian counter flow. Physica A: Statistical Mechanics andits Applications, 267:487–498, 1999. (Cited on pageReferenced on page: 10.)

[13] Yael Niv. Reinforcement learning in the brain. pages 1–38, 1997. (Citedon pagesReferenced on page: 1 and 7.)

[14] Michael J Renner. Learning the Rescorla-Wagner Model of PavlovianConditioning: An Interactive Simulation. 2004. (Cited on pagesReferenced on page: 7, 12, and 12.)

[15] R. Rescorla. Rescorla-Wagner model. 3(3):2237, 2008. revision #91711.(Cited on pagesReferenced on page: 1, 7, 7, 7, and 7.)

[16] Andrew G.Barto Richard S.Sutton. A Temporal-Difference Model of Clas-sical Conditioning. (Cited on pageReferenced on page: 7.)

[17] Jean Marc Salotti and Florent Lepretre. Classical and operant condition-ing as roots of interaction for robots. edition 2013. (Cited on pagesReferenced on page: 7 and 7.)

[18] J. E. R. Staddon and Y. Niv. Operant conditioning. 3(9):2318, 2008.revision #91609. (Cited on pageReferenced on page: 6.)

[19] S. Thrun. Particle filters in robotics. In Proceedings of the 17th AnnualConference on Uncertainty in AI (UAI), 2002. (Cited on pageReferenced on page: 5.)

30 REFERENCES

[20] S. Thrun and M. Montemerlo. The GraphSLAM algorithm with applica-tions to large-scale mapping of urban structures. International Journal onRobotics Research, 25(5/6):403–430, 2005. (Cited on pageReferenced on page: 5.)

[21] F. Woergoetter and B. Porr. Reinforcement learning. 3(3):1448, 2008.revision #91704. (Cited on pageReferenced on page: 6.)

framework for classical conditioning in a mobile robot ...734283/fulltext02.pdfproject report...

Documents