time multiple - hiroshima · pdf fileintroduction a problem of module learning ... an sm p(sm...

A Problem of module learningIntroduction

Conventional Robots: Difficulty in Adaption to Environmental Changes(E.C.)Reason: only one Learning modulefor all environments. Long time oflearning prevent from following E.C. .

Environments surrounding robots

Multiple Learnig m

odules

Reason: 1. An EM for memory capacity is enormous(P1).2. One module is required for one Env(P2). ( Larger num. of Env. s makes difficult to store modules)

P1 Module learning using EMs with small data size

P2 Module learning with fast learning and compact storage*: An Environment Model has the feature information of an environment

Masahiro Ono, Hiroshi Ando, Mamoru Sasaki and Atsushi IwataLearning Algorithms for Robots Behaving Flexibly in Dynamic Environments

Graduate school of Advanced Sciences of Matter, Hiroshima University

Env. x1

Env. xxTime

Env. x2Environmental Changes

A Problem of module learning:Total memory capacity becomesenormous(ex. GB order in Robo Cup Soccer).

Target of research: proposition of module learning solving P1 and P2

4. Addition of the Module(if necessary)

Memory 1. Selection of a Module

Module learning

Learning function

EM

2. Action Selection based on the Policy in the selected module

Which is an EM closest to the current Env.?

Module2

Select a module

si

Policyaj= f(si)

situation(digital)

3. Modification of the Policy (function f(si) ) by Learning Module

Dynam

icE

nvironment

ModuleN

Module2Environment Model*(EM)

Learning function

Policy

Module1

action (aj)

sensor Asensor B

sensor Ns

situation(analog)

Conclusions We proposed two module learning algorithm for solving a problem that the conventional module learning needs enormous memory capacity. 1. A module learning algorithm using environment models with small data size 2. A module learning algorithm with fast learning and compact storage of modules They are confirmed by the simulation experimentsthat both algorithms are robustto environmental changes and use memory more effectively. As next step, we’ll incorporate EM used in the first algorithm into detection of environmental changeand selection of RM in the second algorithm.

largedoes not very

Correlation between the current policy and RMi: ri (i = 1, ... N) is calculated.If ri > rth (constant value) then Elimination else Addition.

If R**** > Rg then RM is selected in order of index of RM. ( Example: If current RM is RMi, then next one is RMi+1. )****: R means how often a robot achieve the goal.

Numerical Simulation

Proposed algorithm schematic diagram

Typical Modular LearningEach module is assignedto one Env.

Each module is assigned to multiple Env.s.

The number of modules does not become verylarge even if the number of Env.s becomes large (compact storage of modules)

Proposed ideaProposed idea Partial policy correctionwith Q Learning

Partial correction of policy:f(s3), f(s6), f(s7)

Env. xi Policy: a = f(s) s4 Goal

s0 Gs1 s2

s5 s6 s7

s3Start

Env. xj

s0 Gs1 s2

s5 s6 s7

s3

s4

Policies required forre-learning are updated by Q-Learning.

Maze ProblemRestrictions:- Situation:25 (5x5), Action:4(up, down, right, left)- Environment changes randomly to- one of six environments (Env.a - f) by Nc steps.- If the robot does not reach to Goal in 100 steps, the trial is failure.- If the robot reaches to Goal or the trial is failure, the robot is returned to Start.- Reward: 1(Goal) 0 (except for Goal)

Proposed system with only 2 modules

Q LearningPartial policy correction with Q Learning

Tota

l rew

ards

Nc

10 100 1000 100001 100000

10

100

1000

10000

1

■ Result of the simulation on condition that the environment changes by Nc steps

The performance of the proposed algorithm is better than the othermethods by synthetic judgment.

Module 1

Module 2

Module N

Env. x1 RM1***

RM K

***: We define the modulein this algorithm as “Representation Module”

a group of simillar Env.s

(K < N)

Env. change

Env. x2

Env. xn Env. x2

Env. x1

Env. x5

Env. x8

Env. x4

Env. xn

Problem:Fast re-learning of Policy is required for fast adaptation to environmental changes

Learned policy

Proposed algorithmDetection of Environmental Change

MemorySelection of RM

RM K

Dynam

icE

nvironment

Addition of the Current Policyaddition

Partial policy correctionwith Q Learning

Situation si

Action ajRM 2

RM 1

G

SStart

GoalG

S

G

S

G

S

G

S

G

S

Wall

Module learning algorithm with fast learning and compact storage of modules

an sk

Typical environment model(EM) = a set of probabilities

si p(sk | si, a1) = 0.3

sman p(sm | si, a1) = 0.7

- p(sk | si, an) means a probabilities that an action an in a situation si results in a situation sk.

Ex.

- Q(si, am) means the quality of an action am in a situation si (larger Q means that am is better)

a1si

am

akQ(s1, a1) = 10

a1 is best in si.Ex.

a1a0 a3Goals3 s6 s7

Q(s7, a3)Q(s6, a1)Q(s3, a0)

Reward **

Discounted Reward

Time

Task is completed!

** : Reward is given to robots when robots achieve goal.

1st trial

2nd

3nd

- Problem : Enormous memory capacity is required.

Update Method of Q: Q-Learninig (a type of Reinforcement Learning)

Proposed EM = a set of Q-functionsProposed EM = a set of Q-functions

- Each environment has unique distributionof Q

Q(s1, am) = -1Q(s1, ak) = 1

- Each environment has unique distributionof p .

much less- Memory capacity for an EM composed of Q is much less than that of p as the number of sensors is lager.

1.E+00

1.E+02

1.E+04

1.E+06

1.E+08

1.E+10

1.E+12

num. of sensors(Nse)

mem

ory

cap

acity

[bit]

Na mr Nse⋅ ⋅⋅2Na mr Nse⋅ ⋅⋅2

2d iAn EM composed of Q :

p :

1 2 3 4 5

Numerical Simulation

01E+162E+163E+164E+165E+166E+167E+168E+16

0 20 40 60 80 100 120

05

10152025

0 20 40 60 80 100 120

Robot

A

BC

D

E

Opponent

Stick

In the games, the opponent changes itsstick position to defense from A to E by 1 set (20 points). (Robot's position is fixed.)

■　The robot has already had the strategiesfor A, B, C and D by learning before the experiment.(called SA, SB, SC and SD, respectively)■　The opponent has only a simple fixed strategy.■　The performance of the robot is the same as that of the opponent.

0

1

2

3

4

0 20 40 60 80 100 120

stra

tegy

SA

SB

SC

SD

120100806040200

120

-2

3

8

13

18

23

28

33

38

0 20 40 60 80 100 120

Vx

Robot

Opponent

SC turned out the best strategy.

The total points given to the robot by using each strategy during E.

Evaluation the selection during EA B C D E

0

5

10

1520

0 50 100 150 200game numberSＡ

SB

Selection is correct.

Opponent selected C.

Opponent's characteristics = stick positions

SＡ SB SC SD

Opponent's stick position.

the

tota

l poi

nts

R

game number

the

tota

l poi

nts

Assumption :

Proposed algorithm schematic diagram

-

-

-

Module learning with EMs composed of Q-Function

P TTm k

m k

m

( , ) ( ( , ) / )( , ) /

'

s a s as a

a

=∑exp Q

exp(Q ' )

Memory

Module1

DynamicEnvironment

Policy x

Module x

actionPolicy x

Module x

Vx: total reward(goal)

Module selection criterion

Select a module

Action based on the policy in the selected module

Policy Modification basedon update of Q by Q-Learning

xmax Rx&

not min Vxx

Detect features of the current environment

Qx(s , a )mQx(s , a )m j

Situation, Action(at previous time), Reward

Situation

Policy is decided by probabilitis:

(T: constant)

Qobs(s ,a )m j

| Sb1 |-1

| Sbx |-1

| Sbn |-1

Module 1

Module n

Module learning algorithm using environment models with small data size

An EM composed of Qrequires only (1/2) timesthe memroy capacity of the one composed of p

rNse

Na: the number of actions r: sensor resolution[bit] Nse: the number of sensors m: unit data size of p and Q[bit](Na=100, r = 3[bits], m = 8[bits])

time multiple - hiroshima · pdf fileintroduction a problem of module learning ... an sm p(sm...

Documents