soar-rl: reinforcement learning and soar shelley nason

24
Soar-RL: Reinforcement Learning and Soar Shelley Nason

Post on 22-Dec-2015

234 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Soar-RL: Reinforcement Learning and Soar Shelley Nason

Soar-RL: Reinforcement

Learning and SoarShelley Nason

Page 2: Soar-RL: Reinforcement Learning and Soar Shelley Nason

Reinforcement Learning

Reinforcement learning:Learning how to act so as to maximize the expected cumulative value of a (numeric) reward signal

In Soar terminology, RL learns operator comparison knowledge

Page 3: Soar-RL: Reinforcement Learning and Soar Shelley Nason

A learning method for low-knowledge situations Non-explanation-based, trial and error learning – RL

does not require any model of operator effects to improve action choice.

Additional requirement – rewards. Therefore RL component should be automatic and

general-purpose. Ultimately avoid

Task-specific hand-coding of features Hand-decomposed task or reward structure Programmer tweaking of learning parameters And so on

Page 4: Soar-RL: Reinforcement Learning and Soar Shelley Nason

Q-values Q(s,a): the expected discounted sum of

future rewards, given that the agent takes action a from state s, and follows a particular policy thereafter

Given optimal Q-function, selecting action with highest Q-value at each state yields optimal policy

state action state action state actionEtc.

reward λ*reward

action

Page 5: Soar-RL: Reinforcement Learning and Soar Shelley Nason

Representing the Q-function

In Soar-RL, Q-function stored as productions, testing state and operator, and asserting numeric preferences.

Sp{RL-rule(state <s> ^operator <o> +) … (<s> ^operator <o> = 0.33231)}

During decision phase, the Q-value of an operator O is taken to be the sum of all numeric preferences asserted for O.

Page 6: Soar-RL: Reinforcement Learning and Soar Shelley Nason

Learning

Q-value represented as concatenation:StateXAction Set of Features Value

Q-learning: Move the value of Q(st,at) toward rt + γ*maxa Q(st+1,a).

Bootstrapping: Update the prediction at one step using the prediction at the next step.

Automatic Feature Generation

Q-learning(updating parameters of

linear function)

state action state

reward

Page 7: Soar-RL: Reinforcement Learning and Soar Shelley Nason

Current Work-Automatic Feature generation Constructing rule conditions with which to

associate values Since values stored with RL rules, some RL

rule must fire for every state-action pair for bootstrapping to work

Sufficient distinctions required that agent will not confuse state-action pairs with significantly different Q-values

Want rules that take advantage of opportunities for generalization

Page 8: Soar-RL: Reinforcement Learning and Soar Shelley Nason

Waterjug Task-A reasonable set of rules

One for each state-action pair, for instance-

Sp {RL-0003 :rl (state <s> ^jug <j1> ^jug <j2> ^operator <o> +) (<j1> ^volume 5 ^contents 5) (<j2> ^volume 3 ^contents 0) (<o> ^name fill ^jug <j2>) (<s> ^operator <o> = 0)}

46 RL rules

+1

-1

0,00,0

0,30,3

3,03,0

3,33,3

5,15,1

0,10,1

1,01,0

1,31,34,04,0

4,34,3

5,25,2

0,20,2

2,02,0

2,32,3

5,05,0

5,35,3

Page 9: Soar-RL: Reinforcement Learning and Soar Shelley Nason

How to generate rule automatically? Rule could be built from WM, for instance- sp {RL-1:rl(state <s> ^name waterjug ^jug <j1> ^jug <j2> ^operator <o4> + ^superstate nil ^type state)(<j1> ^contents 0 ^free 5 ^volume 5)(<j2> ^contents 0 ^free 3 ^volume 3)(<o4> ^name fill ^jug <j2>)(<s> ^operator <o4> = 0)}

Page 10: Soar-RL: Reinforcement Learning and Soar Shelley Nason

But we want generalization…0,0

0,3

3,0

3,3

5,1

0,1

1,0

1,34,0

4,3

5,2

0,2

2,0

2,3

5,0

5,3

+1

-1

Page 11: Soar-RL: Reinforcement Learning and Soar Shelley Nason

Adaptive representations

System constructs feature set so that more distinctions in parts of state-action space requiring more distinctions.

Specific-to-general:Collect instances and cluster according to similar values.

General-to-specific:Add distinctions when area with single representation appears to contain multiple values.

Page 12: Soar-RL: Reinforcement Learning and Soar Shelley Nason

General-to-specificOur most general rules Rules made from operator proposals Sp {RL-1:rl(state <s1> ^name waterjug ^jug <j1>

^operator <o1> +)(<j1> ^free 3)(<o1> ^jug <j1> ^name fill)-->(<s1> ^operator <o1> = 0)}

Only generated when no rule fires for the selected operator

Page 13: Soar-RL: Reinforcement Learning and Soar Shelley Nason

0,0

0,3

3,0

3,3

5,1

0,1

1,0

1,34,0

4,3

5,2

0,2

2,0

2,3

5,0

5,3

If jug has contents 3, then pour from jug into other jug.

Specialization –Example of overgeneral representation

+1

-1

Page 14: Soar-RL: Reinforcement Learning and Soar Shelley Nason

Predicted Q-values at the state (3,0)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

1 2 3 4 5 6 7 8 9 10

Update #

Q-v

alu

e

3050

3000

3003

3033

Goal achieved Goal achieved

Page 15: Soar-RL: Reinforcement Learning and Soar Shelley Nason

Predicted Q-value for state (3,0)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 86 171 256 341 426 511 596 681 766 851 936 1021 1106 1191 1276 1361 1446 1531 1616 1701 1786 1871 1956 2041

Update #

Q-v

alu

e

3050

3000

3003

3033

Page 16: Soar-RL: Reinforcement Learning and Soar Shelley Nason

How to fix – Add following rule If there is a jug with

volume 3 and contents 3, pour this jug into the other jug.

0,00,0

0,30,3

3,03,0

3,33,3

5,15,1

0,10,1

1,01,0

1,31,34,04,0

4,34,3

5,25,2

0,20,2

2,02,0

2,32,3

5,05,0

5,35,3

Page 17: Soar-RL: Reinforcement Learning and Soar Shelley Nason

How to fix – Add following rule If there is a jug with volume 3 and contents 3, pour

this jug into the other jug.Q-values at (3,0)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 133 139 145 151 157 163 169 175

3050

3000

3003

3033

Page 18: Soar-RL: Reinforcement Learning and Soar Shelley Nason

Designing a specialization procedure1. How to decide whether to specialize a given

rule.

2. Given that we have chosen to specialize a rule, what conditions should we add to the rule?

3. (optional) In what, if any, cases should a rule be eliminated?

Page 19: Soar-RL: Reinforcement Learning and Soar Shelley Nason

Question 2 (What conditions to add to a rule) – Proposed Answer Trying an activation-based scheme.

When an (instantiated) rule R decides to specialize, it finds the most activated WME, w = (ID ATTR VALUE).

Traces upward through WM, to find a shortest path from ID to some identifier in the rule’s instantiation

w and the WMEs in the trace add themselves to the conditions in R to form a new rule R’

If R’ is not a duplicate of some existing rule, R’ is added to the Rete (without removing R).

Page 20: Soar-RL: Reinforcement Learning and Soar Shelley Nason

Question 1 (Should a given rule be specialized) – Proposed answer Track weights (numeric preferences) of rules. Weights should converge when rules

sufficient, so stop specializing when weights stop moving.

Page 21: Soar-RL: Reinforcement Learning and Soar Shelley Nason

Tracking weights of rule RL-3

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

1 18 35 52 69 86 103 120 137 154 171 188 205 222 239 256 273 290 307 324 341 358 375 392 409 426 443 460 477 494 511

Page 22: Soar-RL: Reinforcement Learning and Soar Shelley Nason

# steps in run

# steps in run

0

100

200

300

400

500

600

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100

Run #

# st

eps

Page 23: Soar-RL: Reinforcement Learning and Soar Shelley Nason

# rules

Total # RL rules by run

0

10

20

30

40

50

60

70

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97

Run #

# ru

les

Page 24: Soar-RL: Reinforcement Learning and Soar Shelley Nason

Conclusions Nuggets – Did work (at least on Waterjug)

That is, came to follow the optimal policy. Coal – Makes too many rules

1. Specializes rules that don’t need specialization.

2. Specializations not always useful, since chosen heuristically

3. Will try to explain non-determinism by making rules.