time multiple - hiroshima · pdf fileintroduction a problem of module learning ... an sm p(sm...

1
A Problem of module learning Introduction Conventional Robots: Difficulty in Adaption to Environmental Changes(E.C.) Reason: only one Learning module for all environments. Long time of learning prevent from following E.C. . Environments surrounding robots Multiple Learnig modules Reason: 1. An EM for memory capacity is enormous(P1). 2. One module is required for one Env(P2). ( Larger num. of Env. s makes difficult to store modules) P1 Module learning using EMs with small data size P2 Module learning with fast learning and compact storage *: An Environment Model has the feature information of an environment Masahiro Ono, Hiroshi Ando, Mamoru Sasaki and Atsushi Iwata Learning Algorithms for Robots Behaving Flexibly in Dynamic Environments Graduate school of Advanced Sciences of Matter, Hiroshima University Env. x1 Env. xx Time Env. x2 Environmental Changes A Problem of module learning: Total memory capacity becomes enormous (ex. GB order in Robo Cup Soccer). Target of research: proposition of module learning solving P1 and P2 4. Addition of the Module (if necessary) Memory 1. Selection of a Module Module learning Learning function EM 2. Action Selection based on the Policy in the selected module Which is an EM closest to the current Env.? Module2 Select a module si Policy aj= f(si) situation (digital) 3. Modification of the Policy (function f(si) ) by Learning Module Dynamic Environment ModuleN Module2 Environment Model*(EM) Learning function Policy Module1 action (aj) sensor A sensor B sensor Ns situation (analog) Conclusions We proposed two module learning algorithm for solving a problem that the conventional module learning needs enormous memory capacity. 1. A module learning algorithm using environment models with small data size 2. A module learning algorithm with fast learning and compact storage of modules They are confirmed by the simulation experimentsthat both algorithms are robust to environmental changes and use memory more effectively. As next step, we’ll incorporate EM used in the first algorithm into detection of environmental changeand selection of RM in the second algorithm. large does not very Correlation between the current policy and RMi: ri (i = 1, ... N) is calculated. If ri > rth (constant value) then Elimination else Addition. If R**** > Rg then RM is selected in order of index of RM. ( Example: If current RM is RMi, then next one is RMi+1. ) ****: R means how often a robot achieve the goal. Numerical Simulation Proposed algorithm schematic diagram Typical Modular Learning Each module is assigned to one Env. Each module is assigned to multiple Env.s. The number of modules does not become very large even if the number of Env.s becomes large (compact storage of modules) Proposed idea Proposed idea Partial policy correction with Q Learning Partial correction of policy: f(s3), f(s6), f(s7) Env. xi Policy: a = f(s) s4 Goal s0 G s1 s2 s5 s6 s7 s3 Start Env. xj s0 G s1 s2 s5 s6 s7 s3 s4 Policies required for re-learning are updated by Q-Learning. Maze Problem Restrictions: - Situation:25 (5x5), Action:4(up, down, right, left) - Environment changes randomly to - one of six environments (Env.a - f) by Nc steps. - If the robot does not reach to Goal in 100 steps, the trial is failure. - If the robot reaches to Goal or the trial is failure, the robot is returned to Start. - Reward: 1(Goal) 0 (except for Goal) Proposed system with only 2 modules Q Learning Partial policy correction with Q Learning Total rewards Nc 10 100 1000 10000 1 100000 10 100 1000 10000 1 Result of the simulation on condition that the environment changes by Nc steps The performance of the proposed algorithm is better than the other methods by synthetic judgment. Module 1 Module 2 Module N Env. x1 RM1*** RM K ***: We define the module in this algorithm as “Representation Module” a group of simillar Env.s (K < N) Env. change Env. x2 Env. xn Env. x2 Env. x1 Env. x5 Env. x8 Env. x4 Env. xn Problem: Fast re-learning of Policy is required for fast adaptation to environmental changes Learned policy Proposed algorithm Detection of Environmental Change Memory Selection of RM RM K Dynamic Environment Addition of the Current Policy addition Partial policy correction with Q Learning Situation si Action aj RM 2 RM 1 G S Start Goal G S G S G S G S G S Wall Module learning algorithm with fast learning and compact storage of modules an sk Typical environment model(EM) = a set of probabilities si p(sk | si, a1) = 0.3 sm an p(sm | si, a1) = 0.7 - p(sk | si, an) means a probabilities that an action an in a situation si results in a situation sk. Ex. - Q(si, am) means the quality of an action am in a situation si (larger Q means that am is better) a1 si am ak Q(s1, a1) = 10 a1 is best in si. Ex. a1 a0 a3 Goal s3 s6 s7 Q(s7, a3) Q(s6, a1) Q(s3, a0) Reward ** Discounted Reward Time Task is completed! ** : Reward is given to robots when robots achieve goal. 1st trial 2nd 3nd - Problem : Enormous memory capacity is required. Update Method of Q: Q-Learninig (a type of Reinforcement Learning) Proposed EM = a set of Q functions Proposed EM = a set of Q functions - Each environment has unique distribution of Q Q(s1, am) = -1 Q(s1, ak) = 1 - Each environment has unique distributionof p . much less - Memory capacity for an EM composed of Q is much less than that of p as the number of sensors is lager. Na m rN se 2 Na m rNse 2 2 d i An EM composed of Q : p : Numerical Simulation 0 1E+16 2E+16 3E+16 4E+16 5E+16 6E+16 7E+16 8E+16 0 20 40 60 80 100 0 5 10 15 20 25 0 20 40 60 80 100 120 Robot A B C D E Opponent Stick In the games, the opponent changes its stick position to defense from A to E by 1 set (20 points). (Robot's position is fixed.) The robot has already had the strategies for A, B, C and D by learning before the experiment. (called SA, SB, SC and SD, respectively) The opponent has only a simple fixed strategy. The performance of the robot is the same as that of the opponent. 0 1 2 3 4 strategy SA SB SC SD 120 100 80 60 40 20 0 120 -2 3 8 13 18 23 28 33 38 0 20 40 60 80 100 120 Vx Robot Opponent SC turned out the best strategy. The total points given to the robot by using each strategy during E. Evaluation the selection during E A B C D E 0 5 10 15 20 0 50 100 150 200 game number SSB Selection is correct. Opponent selected C. Opponent's characteristics = stick positions SSB SC SD Opponent's stick position. the total points R game number the total points Assumption : Proposed algorithm schematic diagram - - - Module learning with EMs composed of Q-Function P T T m k m k m ( , ) ( ( , )/ ) ( , )/ ' s a s a s a a = exp Q exp(Q ' ) Memory Module1 Dynamic Environment Policy x Module x action Policy x Module x Vx: total reward(goal) Module selection criterion Select a module Action based on the policy in the selected module Policy Modification based on update of Q by Q-Learning x max Rx & not min Vx x Detect features of the current environment Qx(s , a ) m Qx(s , a ) m j Situation, Action(at previous time), Reward Situation Policy is decided by probabilitis: (T: constant) Qobs(s ,a ) m j | Sb1 | -1 | Sbx | -1 | Sbn | -1 Module 1 Module n Module learning algorithm using environment models with small data size An EM composed of Q requires only (1/2) times the memroy capacity of the one composed of p rNse Na: the number of actions r: sensor resolution [bit] Nse: the number of sensors m: unit data size of p and Q[bit] (Na=100, r = 3[bits], m = 8[bits])

Upload: dokien

Post on 06-Mar-2018

219 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Time Multiple - Hiroshima  · PDF fileIntroduction A Problem of module learning ... an sm p(sm | si, a1) ... 6E+16 7E+16 8E+16 0 20 40 60 80 100 120 0 5 10 15 20 25

A Problem of module learningIntroduction

Conventional Robots: Difficulty in Adaption to Environmental Changes(E.C.)Reason: only one Learning modulefor all environments. Long time oflearning prevent from following E.C. .

Environments surrounding robots

Multiple Learnig m

odules

Reason: 1. An EM for memory capacity is enormous(P1).2. One module is required for one Env(P2). ( Larger num. of Env. s makes difficult to store modules)

P1 Module learning using EMs with small data size

P2 Module learning with fast learning and compact storage*: An Environment Model has the feature information of an environment

Masahiro Ono, Hiroshi Ando, Mamoru Sasaki and Atsushi IwataLearning Algorithms for Robots Behaving Flexibly in Dynamic Environments

Graduate school of Advanced Sciences of Matter, Hiroshima University

Env. x1

Env. xxTime

Env. x2Environmental Changes

A Problem of module learning:Total memory capacity becomesenormous(ex. GB order in Robo Cup Soccer).

Target of research: proposition of module learning solving P1 and P2

4. Addition of the Module(if necessary)

Memory 1. Selection of a Module

Module learning

Learning function

EM

2. Action Selection based on the Policy in the selected module

Which is an EM closest to the current Env.?

Module2

Select a module

si

Policyaj= f(si)

situation(digital)

3. Modification of the Policy (function f(si) ) by Learning Module

Dynam

icE

nvironment

ModuleN

Module2Environment Model*(EM)

Learning function

Policy

Module1

action (aj)

sensor Asensor B

sensor Ns

situation(analog)

Conclusions We proposed two module learning algorithm for solving a problem that the conventional module learning needs enormous memory capacity. 1. A module learning algorithm using environment models with small data size 2. A module learning algorithm with fast learning and compact storage of modules They are confirmed by the simulation experimentsthat both algorithms are robustto environmental changes and use memory more effectively. As next step, we’ll incorporate EM used in the first algorithm into detection of environmental changeand selection of RM in the second algorithm.

largedoes not very

Correlation between the current policy and RMi: ri (i = 1, ... N) is calculated.If ri > rth (constant value) then Elimination else Addition.

If R**** > Rg then RM is selected in order of index of RM. ( Example: If current RM is RMi, then next one is RMi+1. )****: R means how often a robot achieve the goal.

Numerical Simulation

Proposed algorithm schematic diagram

Typical Modular LearningEach module is assignedto one Env.

Each module is assigned to multiple Env.s.

The number of modules does not become verylarge even if the number of Env.s becomes large (compact storage of modules)

Proposed ideaProposed idea Partial policy correctionwith Q Learning

Partial correction of policy:f(s3), f(s6), f(s7)

Env. xi Policy: a = f(s) s4 Goal

s0 Gs1 s2

s5 s6 s7

s3Start

Env. xj

s0 Gs1 s2

s5 s6 s7

s3

s4

Policies required forre-learning are updated by Q-Learning.

Maze ProblemRestrictions:- Situation:25 (5x5), Action:4(up, down, right, left)- Environment changes randomly to- one of six environments (Env.a - f) by Nc steps.- If the robot does not reach to Goal in 100 steps, the trial is failure.- If the robot reaches to Goal or the trial is failure, the robot is returned to Start.- Reward: 1(Goal) 0 (except for Goal)

Proposed system with only 2 modules

Q LearningPartial policy correction with Q Learning

Tota

l rew

ards

Nc

10 100 1000 100001 100000

10

100

1000

10000

1

■ Result of the simulation on condition that the environment changes by Nc steps

The performance of the proposed algorithm is better than the othermethods by synthetic judgment.

Module 1

Module 2

Module N

Env. x1 RM1***

RM K

***: We define the modulein this algorithm as “Representation Module”

a group of simillar Env.s

(K < N)

Env. change

Env. x2

Env. xn Env. x2

Env. x1

Env. x5

Env. x8

Env. x4

Env. xn

Problem:Fast re-learning of Policy is required for fast adaptation to environmental changes

Learned policy

Proposed algorithmDetection of Environmental Change

MemorySelection of RM

RM K

Dynam

icE

nvironment

Addition of the Current Policyaddition

Partial policy correctionwith Q Learning

Situation si

Action ajRM 2

RM 1

G

SStart

GoalG

S

G

S

G

S

G

S

G

S

Wall

Module learning algorithm with fast learning and compact storage of modules

an sk

Typical environment model(EM) = a set of probabilities

si p(sk | si, a1) = 0.3

sman p(sm | si, a1) = 0.7

- p(sk | si, an) means a probabilities that an action an in a situation si results in a situation sk.

Ex.

- Q(si, am) means the quality of an action am in a situation si (larger Q means that am is better)

a1si

am

akQ(s1, a1) = 10

a1 is best in si.Ex.

a1a0 a3Goals3 s6 s7

Q(s7, a3)Q(s6, a1)Q(s3, a0)

Reward **

Discounted Reward

Time

Task is completed!

** : Reward is given to robots when robots achieve goal.

1st trial

2nd

3nd

- Problem : Enormous memory capacity is required.

Update Method of Q: Q-Learninig (a type of Reinforcement Learning)

Proposed EM = a set of Q-functionsProposed EM = a set of Q-functions

- Each environment has unique distributionof Q

Q(s1, am) = -1Q(s1, ak) = 1

- Each environment has unique distributionof p .

much less- Memory capacity for an EM composed of Q is much less than that of p as the number of sensors is lager.

1.E+00

1.E+02

1.E+04

1.E+06

1.E+08

1.E+10

1.E+12

num. of sensors(Nse)

mem

ory

cap

acity

[bit]

Na mr Nse⋅ ⋅⋅2Na mr Nse⋅ ⋅⋅2

2d iAn EM composed of Q :

p :

1 2 3 4 5

Numerical Simulation

01E+162E+163E+164E+165E+166E+167E+168E+16

0 20 40 60 80 100 120

05

10152025

0 20 40 60 80 100 120

Robot

A

BC

D

E

Opponent

Stick

In the games, the opponent changes itsstick position to defense from A to E by 1 set (20 points). (Robot's position is fixed.)

■ The robot has already had the strategiesfor A, B, C and D by learning before the experiment.(called SA, SB, SC and SD, respectively)■ The opponent has only a simple fixed strategy.■ The performance of the robot is the same as that of the opponent.

0

1

2

3

4

0 20 40 60 80 100 120

stra

tegy

SA

SB

SC

SD

120100806040200

120

-2

3

8

13

18

23

28

33

38

0 20 40 60 80 100 120

Vx

Robot

Opponent

SC turned out the best strategy.

The total points given to the robot by using each strategy during E.

Evaluation the selection during EA B C D E

0

5

10

1520

0 50 100 150 200game numberSA

SB

Selection is correct.

Opponent selected C.

Opponent's characteristics = stick positions

SA SB SC SD

Opponent's stick position.

the

tota

l poi

nts

R

game number

the

tota

l poi

nts

Assumption :

Proposed algorithm schematic diagram

-

-

-

Module learning with EMs composed of Q-Function

P TTm k

m k

m

( , ) ( ( , ) / )( , ) /

'

s a s as a

a

=∑exp Q

exp(Q ' )

Memory

Module1

DynamicEnvironment

Policy x

Module x

actionPolicy x

Module x

Vx: total reward(goal)

Module selection criterion

Select a module

Action based on the policy in the selected module

Policy Modification basedon update of Q by Q-Learning

xmax Rx&

not min Vxx

Detect features of the current environment

Qx(s , a )mQx(s , a )m j

Situation, Action(at previous time), Reward

Situation

Policy is decided by probabilitis:

(T: constant)

Qobs(s ,a )m j

| Sb1 |-1

| Sbx |-1

| Sbn |-1

Module 1

Module n

Module learning algorithm using environment models with small data size

An EM composed of Qrequires only (1/2) timesthe memroy capacity of the one composed of p

rNse

Na: the number of actions r: sensor resolution[bit] Nse: the number of sensors m: unit data size of p and Q[bit](Na=100, r = 3[bits], m = 8[bits])