time multiple - hiroshima · pdf fileintroduction a problem of module learning ... an sm p(sm...
TRANSCRIPT
A Problem of module learningIntroduction
Conventional Robots: Difficulty in Adaption to Environmental Changes(E.C.)Reason: only one Learning modulefor all environments. Long time oflearning prevent from following E.C. .
Environments surrounding robots
Multiple Learnig m
odules
Reason: 1. An EM for memory capacity is enormous(P1).2. One module is required for one Env(P2). ( Larger num. of Env. s makes difficult to store modules)
P1 Module learning using EMs with small data size
P2 Module learning with fast learning and compact storage*: An Environment Model has the feature information of an environment
Masahiro Ono, Hiroshi Ando, Mamoru Sasaki and Atsushi IwataLearning Algorithms for Robots Behaving Flexibly in Dynamic Environments
Graduate school of Advanced Sciences of Matter, Hiroshima University
Env. x1
Env. xxTime
Env. x2Environmental Changes
A Problem of module learning:Total memory capacity becomesenormous(ex. GB order in Robo Cup Soccer).
Target of research: proposition of module learning solving P1 and P2
4. Addition of the Module(if necessary)
Memory 1. Selection of a Module
Module learning
Learning function
EM
2. Action Selection based on the Policy in the selected module
Which is an EM closest to the current Env.?
Module2
Select a module
si
Policyaj= f(si)
situation(digital)
3. Modification of the Policy (function f(si) ) by Learning Module
Dynam
icE
nvironment
ModuleN
Module2Environment Model*(EM)
Learning function
Policy
Module1
action (aj)
sensor Asensor B
sensor Ns
situation(analog)
Conclusions We proposed two module learning algorithm for solving a problem that the conventional module learning needs enormous memory capacity. 1. A module learning algorithm using environment models with small data size 2. A module learning algorithm with fast learning and compact storage of modules They are confirmed by the simulation experimentsthat both algorithms are robustto environmental changes and use memory more effectively. As next step, we’ll incorporate EM used in the first algorithm into detection of environmental changeand selection of RM in the second algorithm.
largedoes not very
Correlation between the current policy and RMi: ri (i = 1, ... N) is calculated.If ri > rth (constant value) then Elimination else Addition.
If R**** > Rg then RM is selected in order of index of RM. ( Example: If current RM is RMi, then next one is RMi+1. )****: R means how often a robot achieve the goal.
Numerical Simulation
Proposed algorithm schematic diagram
Typical Modular LearningEach module is assignedto one Env.
Each module is assigned to multiple Env.s.
The number of modules does not become verylarge even if the number of Env.s becomes large (compact storage of modules)
Proposed ideaProposed idea Partial policy correctionwith Q Learning
Partial correction of policy:f(s3), f(s6), f(s7)
Env. xi Policy: a = f(s) s4 Goal
s0 Gs1 s2
s5 s6 s7
s3Start
Env. xj
s0 Gs1 s2
s5 s6 s7
s3
s4
Policies required forre-learning are updated by Q-Learning.
Maze ProblemRestrictions:- Situation:25 (5x5), Action:4(up, down, right, left)- Environment changes randomly to- one of six environments (Env.a - f) by Nc steps.- If the robot does not reach to Goal in 100 steps, the trial is failure.- If the robot reaches to Goal or the trial is failure, the robot is returned to Start.- Reward: 1(Goal) 0 (except for Goal)
Proposed system with only 2 modules
Q LearningPartial policy correction with Q Learning
Tota
l rew
ards
Nc
10 100 1000 100001 100000
10
100
1000
10000
1
■ Result of the simulation on condition that the environment changes by Nc steps
The performance of the proposed algorithm is better than the othermethods by synthetic judgment.
Module 1
Module 2
Module N
Env. x1 RM1***
RM K
***: We define the modulein this algorithm as “Representation Module”
a group of simillar Env.s
(K < N)
Env. change
Env. x2
Env. xn Env. x2
Env. x1
Env. x5
Env. x8
Env. x4
Env. xn
Problem:Fast re-learning of Policy is required for fast adaptation to environmental changes
Learned policy
Proposed algorithmDetection of Environmental Change
MemorySelection of RM
RM K
Dynam
icE
nvironment
Addition of the Current Policyaddition
Partial policy correctionwith Q Learning
Situation si
Action ajRM 2
RM 1
G
SStart
GoalG
S
G
S
G
S
G
S
G
S
Wall
Module learning algorithm with fast learning and compact storage of modules
an sk
Typical environment model(EM) = a set of probabilities
si p(sk | si, a1) = 0.3
sman p(sm | si, a1) = 0.7
- p(sk | si, an) means a probabilities that an action an in a situation si results in a situation sk.
Ex.
- Q(si, am) means the quality of an action am in a situation si (larger Q means that am is better)
a1si
am
akQ(s1, a1) = 10
a1 is best in si.Ex.
a1a0 a3Goals3 s6 s7
Q(s7, a3)Q(s6, a1)Q(s3, a0)
Reward **
Discounted Reward
Time
Task is completed!
** : Reward is given to robots when robots achieve goal.
1st trial
2nd
3nd
- Problem : Enormous memory capacity is required.
Update Method of Q: Q-Learninig (a type of Reinforcement Learning)
Proposed EM = a set of Q-functionsProposed EM = a set of Q-functions
- Each environment has unique distributionof Q
Q(s1, am) = -1Q(s1, ak) = 1
- Each environment has unique distributionof p .
much less- Memory capacity for an EM composed of Q is much less than that of p as the number of sensors is lager.
1.E+00
1.E+02
1.E+04
1.E+06
1.E+08
1.E+10
1.E+12
num. of sensors(Nse)
mem
ory
cap
acity
[bit]
Na mr Nse⋅ ⋅⋅2Na mr Nse⋅ ⋅⋅2
2d iAn EM composed of Q :
p :
1 2 3 4 5
Numerical Simulation
01E+162E+163E+164E+165E+166E+167E+168E+16
0 20 40 60 80 100 120
05
10152025
0 20 40 60 80 100 120
Robot
A
BC
D
E
Opponent
Stick
In the games, the opponent changes itsstick position to defense from A to E by 1 set (20 points). (Robot's position is fixed.)
■ The robot has already had the strategiesfor A, B, C and D by learning before the experiment.(called SA, SB, SC and SD, respectively)■ The opponent has only a simple fixed strategy.■ The performance of the robot is the same as that of the opponent.
0
1
2
3
4
0 20 40 60 80 100 120
stra
tegy
SA
SB
SC
SD
120100806040200
120
-2
3
8
13
18
23
28
33
38
0 20 40 60 80 100 120
Vx
Robot
Opponent
SC turned out the best strategy.
The total points given to the robot by using each strategy during E.
Evaluation the selection during EA B C D E
0
5
10
1520
0 50 100 150 200game numberSA
SB
Selection is correct.
Opponent selected C.
Opponent's characteristics = stick positions
SA SB SC SD
Opponent's stick position.
the
tota
l poi
nts
R
game number
the
tota
l poi
nts
Assumption :
Proposed algorithm schematic diagram
-
-
-
Module learning with EMs composed of Q-Function
P TTm k
m k
m
( , ) ( ( , ) / )( , ) /
'
s a s as a
a
=∑exp Q
exp(Q ' )
Memory
Module1
DynamicEnvironment
Policy x
Module x
actionPolicy x
Module x
Vx: total reward(goal)
Module selection criterion
Select a module
Action based on the policy in the selected module
Policy Modification basedon update of Q by Q-Learning
xmax Rx&
not min Vxx
Detect features of the current environment
Qx(s , a )mQx(s , a )m j
Situation, Action(at previous time), Reward
Situation
Policy is decided by probabilitis:
(T: constant)
Qobs(s ,a )m j
| Sb1 |-1
| Sbx |-1
| Sbn |-1
Module 1
Module n
Module learning algorithm using environment models with small data size
An EM composed of Qrequires only (1/2) timesthe memroy capacity of the one composed of p
rNse
Na: the number of actions r: sensor resolution[bit] Nse: the number of sensors m: unit data size of p and Q[bit](Na=100, r = 3[bits], m = 8[bits])