reinforcement learning -...

54
Reinforcement Learning Peter Sunehag 2013 Reinforcement Learning Agents an approach to AGI Peter Sunehag Tutorial at the seventh conference on AGI Quebec City August 2014 Many slides borrowed from e.g. Sutton&Barto

Upload: trannguyet

Post on 24-May-2018

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Reinforcement Learning

Peter Sunehag

2013

Reinforcement Learning

Reinforcement Learning Agents an approach to AGI

Peter Sunehag

Tutorial at the seventh conference on AGI

Quebec City August 2014

Many slides borrowed from e.g. Sutton&Barto

Page 2: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

!"C-*#&$+J"*K""$+!,+-$B+30+

!"#$%&'(")"$*+,"-'$#$.+L'-)"K&'MS%&E(L%#%&"04(?AT%&.(BA./5"00()=(3&A$0%B.

!,+/'&JC")6+

N+3C.&'#*2)6

</A?2"./1?'(-#M#A6#'(#A#]1717@7(%#T1&A#B%#/.

3'*#%#(#-C

0$*"CC#."$("

*&"@1/1A#"00E(@%/%&B1#1./1?'(M#A6#((6A&0@(5(30"##1#L(3&A$0%B

O*-*#6*#(-C+

8-(2#$"+,"-'$#$.

VA./0E(1717@7(@"/"(?0"..1F1?"/1A#'(

&%L&%..1A#'(?0-./%&1#L

Page 3: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

O&)"+3PPC#(-*#&$6+&%+!,¥ !2%?M%&.(N<"B-%0'(8ObOQ

F1&./(-.%(AF(;K(1#("#(1#/%&%./1#L(&%"0(L"B%

¥ N=#T%&/%@Q(2%01?A3/%&(F01L2/(N9L(%/("07(,++[Q

$%//%&(/2"#("#E(2-B"#

¥ !AB3-/%&(cA(N:!*'(,++DQF1#"00E(.AB%(3&AL&"B.(61/2(X&%".A#"$0%Y(30"E

¥ ;A$A?-3(<A??%&(*%"B.( N</A#%(G(S%0A.A'(;%1@B100%&(%/("07Q

dA&0@e.($%./(30"E%&(AF(.1B-0"/%@(.A??%&'(8OOOf(;-##%&]-3(,+++

¥ =#T%#/A&E(V"#"L%B%#/( NS"#(;AE'(H%&/.%M".'(K%%(G(*.1/.1M01.Q8+]8bg(1B3&AT%B%#/(AT%&(1#@-./&E(./"#@"&@(B%/2A@.

¥ JE#"B1?(!2"##%0()..1L#B%#/( N<1#L2(G(H%&/.%M".'(91%(G(Z"EM1#Q

dA&0@h.($%./("..1L#%&(AF(&"@1A(?2"##%0.(/A(BA$10%(/%0%32A#%(?"00.

¥ >0%T"/A&(!A#/&A0( N!&1/%.(G(H"&/AQNU&A$"$0EQ(6A&0@h.($%./(@A6#]3%"M(%0%T"/A&(?A#/&A00%&

¥ V"#E(;A$A/.#"T1L"/1A#'($1]3%@"0(6"0M1#L'(L&".31#L'(.61/?21#L($%/6%%#(.M100.777

¥ *J]c"BBA#("#@(i%00EF1.2( N*%."-&A'(J"20QdA&0@h.($%./($"?ML"BBA#(30"E%&7(c&"#@B"./%&(0%T%0

Arcade Learning EnivornmentNadaf 2010, Bellamare et al. 2012, ... (on-going challenge)

Page 4: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

4&)PC"*"+3."$*

¥ *%B3A&"00E(.1/-"/%@

¥ !A#/1#-"0(0%"&#1#L("#@(30"##1#L

¥ _$^%?/(1.(/A(-%%"(* /2%(%#T1&A#B%#/

¥ >#T1&A#B%#/(1.(./A?2"./1?("#@(-#?%&/"1#

Environment

actionstate

rewardAgent

Page 5: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

D2-*+#6+!"#$%&'(")"$*+,"-'$#$.F

¥ )#("33&A"?2(/A()&/1F1?1"0(=#/%001L%#?%

¥ K%"&#1#L(F&AB(1#/%&"?/1A#

¥ cA"0]A&1%#/%@(0%"&#1#L

¥ K%"&#1#L("$A-/'(F&AB'("#@(6210%(1#/%&"?/1#L(61/2("#(%R/%&#"0(%#T1&A#B%#/

¥ K%"&#1#L(62"/(/A(@Aj2A6(/A(B"3(.1/-"/1A#.(/A("?/1A#.j.A(".(/A(B"R1B1W%("(#-B%&1?"0(&%6"&@(.1L#"0

Page 6: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

42-P*"'+RS+0$*'&B5(*#&$

Psychology

Artificial Intelligence

Control Theory andOperations Research

Artificial Neural Networks

ReinforcementLearning (RL)

Neuroscience

Page 7: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

T"E+L"-*5'"6+&%+!,

¥ K%"&#%&(1.(#A/(/A0@(621?2("?/1A#.(/A(/"M%

¥ *&1"0]"#@]>&&A&(.%"&?2

¥ UA..1$101/E(AF(@%0"E%@(&%6"&@

<"?&1F1?%(.2A&/]/%&B(L"1#.(FA&(L&%"/%&(0A#L]/%&B(L"1#.

¥ *2%(#%%@(/A("UPC&'" "#@("UPC&#*

¥ !A#.1@%&.(/2%(62A0%(3&A$0%B(AF("(LA"0]@1&%?/%@("L%#/(1#/%&"?/1#L(61/2("#(-#?%&/"1#(%#T1&A#B%#/

Page 8: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

O5P"'A#6"B+,"-'$#$.

Supervised!Learning!

SystemInputs Outputs

Training!Info!!=!!desired!(target)!outputs

Error!!=!!(target!output!!Ð actual!output)

Page 9: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

!"#$%&'(")"$*+,"-'$#$.

RL

SystemInputs Outputs!(ÒactionsÓ)

Training!Info!!=!!evaluations!(ÒrewardsÓ!/!ÒpenaltiesÓ)

Objective:!!get!as!much!reward!as!possible

Page 10: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

3$+WU*"$B"B+WU-)PC"S+V#(HV-(HV&"

X XXO O

X

XO

X

O

XO

X

O

X

XO

X

O

X O

XO

X

O

X O

X

}!xÕs!move

}!xÕs!move

}!oÕs!move

}!xÕs!move

}!oÕs!move

...

...... ...

... ... ... ... ...

x x

x

x o

x

o

xo

x

x

xxo

o

Assume an imperfect opponent:

Ñhe/she sometimes makes mistakes

Page 11: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

3$+!,+3PP'&-(2+*&+V#(HV-(HV&"

1.!Make!a!table!with!one!entry!per!state:

2.!Now!play!lots!of!games.

To!pick!our!moves,!

look!ahead!one!step:

State!!!!!!!!!V(s)!Ð estimated!probability!of!winning

.5!!!!!!!!!!?

.5!!!!!!!!!!?

1!!!!!!!!win

0!!!!!!!!loss

0!!!!!!!draw

x

xxx

oo

oo

oxx

oo

o ox

xx

xo

current!state

various!possible

next!states*Just!pick!the!next!state!with!the!highest

estimated!prob.!of!winning!Ñ the!largest!V(s);

a!greedy move.

But!10%!of!the!time!pick!a!move!at!random;

an!exploratory move.

Page 12: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

!,+,"-'$#$.+!5C"+%&'+V#(HV-(HV&"

ÒExploratoryÓ move

s!!! Ð !the!state!before!our!greedy!move

s !! Ð !!the!state!after!our!greedy!move

We!increment!each!V(s) toward!V(s ) Ð a!backup :

V(s) V (s) V(s ) V (s)

a!small!positive!fraction, e.g., .1

the!step - size parameter

c *

*

*g

Page 13: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

The n-Armed Bandit Problem

Choose repeatedly from one of n actions; each

choice is called a play

After each play , you get a reward , whereat rt

These are unknown action values

Distribution of depends only on rt at

Objective is to maximize the reward in the long term,

e.g., over 1000 plays

To solve the n-armed bandit problem,you must explore a variety of actionsand exploit the best of them

Page 14: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

The Exploration/Exploitation Dilemma

Suppose you form estimates

The greedy action at t is at

You canÕt exploit all the time; you canÕt explore all the time

You can never stop exploring; but you should always reduce exploring. Maybe.

Qt(a) Q*(a) action value estimates

at

* argmaxa

Qt(a)

at at

*exploitation

at at

* exploration

Page 15: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Optimistic Initial Values

All methods so far depend on , i.e., they are biased.

Suppose instead we initialize the action values optimistically,

Q0 (a)

i.e., on the 10-armed testbed, use Q0 (a) 5 for all a

Page 16: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2

The Agent-Environment Interface

Agent and environment interact at discrete time steps : t 0,1, 2,

Agent observes state at step t : st S

produces action at step t : at A(st )

gets resulting reward : rt 1

and resulting next state: st 1

t

. . .st a

rt +1 st +1t +1a

rt +2 st +2t +2a

rt +3 st +3. . .

t +3a

Page 17: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17

Bellman Equation for a Policy

Rt rt 1 rt 2

2rt 3

3rt 4

rt 1 rt 2 rt 3

2rt 4

rt 1 Rt 1

The basic idea:

So: V (s) E Rt st s

E rt 1 V st 1 st s

Or, without the expectation operator:

V (s) (s,a) Pss

aRss

aV (s )

s a

Page 18: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

Golf

State is ball location

Reward of Ð1 for each stroke

until the ball is in the hole

Value of a state?

Actions:

putt (use putter)

driver (use driver)

putt succeeds anywhere on

the green

Page 19: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22

Optimal Value Function for Golf

We can hit the ball farther with driver than with putter,

but with less accuracy

Q*(s,driver) gives the value of using driver first, then

using whichever actions are best

Page 20: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

Simple Monte Carlo

T T T TT

T T T T T

!(&# ) !(&#) %# ! (&# )

where %# is the actual return following state &# .

&#

T T

T T

TT T

T TT

Page 21: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

Simplest TD Method

T T T TT

T T T T T

&# 1

$# 1

&#

!(&# ) !(&#) $# 1 ! (&# 1 ) !(&# )

TTTTT

T T T T T

Page 22: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7

Dynamic Programming

#### &&!$'&! |)()( 11

T

T T T

&#

$# 1

&# 1

T

TT

T

TT

T

T

T

Page 23: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

TD methods bootstrap and sample

Bootstrapping: update involves an (&#)*+#(

MC does not bootstrap

DP bootstraps

TD bootstraps

Sampling: update does not involve an

(,-(.#(/01+23(

MC samples

DP does not sample

TD samples

Page 24: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

Advantages of TD Learning

TD methods do not require a model of the environment,

only experience

TD, but not MC, methods can be fully incremental

You can learn before knowing the final outcome

Ð Less memory

Ð Less peak computation

You can learn without the final outcome

Ð From incomplete sequences

Both MC and TD converge (under certain assumptions to

be detailed later), but which is faster?

Page 25: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. E

TD Gammon

F($'#&,)GHHEI)GHHJI)GHHKI)@@@

L0/%()0'$)>#$%)&,""(*)')K)'.*)')E)$,)

2'.)3,7(),.(),-)0/$)1/(2($)K)'.*)

,.()M1,$$/=";)%0()$'3(N)E)$%(1$

<=>(2%/7()/$)%,)'*7'.2()'"")1/(2($)%,)

1,/.%$)GH+EJ

O/%%/.6

P,#="/.6

QR)1/(2($I)EJ)",2'%/,.$)/31"/($)

(.,&3,#$).#3=(&),-)2,.-/6#&'%/,.$

S--(2%/7()=&'.20/.6)-'2%,&),-)JRR

FP)B'33,.)2,3=/.(*)FPM N)5/%0).(#&'").(%5,&4 '$)7'"#()-#.2%/,.)'11&,T/3'%,&

Page 26: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. Q

A&%0#&)8'3#(")GHKHI)GHUV

SamuelÕs Checkers Player

82,&()=,'&*)2,.-/6#&'%/,.$)=;)')W$2,&/.6)1,";.,3/'"X)

M'-%(&)80'..,.I)GHKRN

Y/./3'T %,)*(%(&3/.()W='24(*+#1)$2,&(X),-)')1,$/%/,.

A"10'+=(%' 2#%,--$

?,%()"('&./.69)$'7()('20)=,'&*)2,.-/6)(.2,#.%(&(*)%,6(%0(&)

5/%0)='24(*+#1)$2,&(

.((*(*)')W$(.$(),-)*/&(2%/,.X9)"/4()*/$2,#.%/.6

D('&./.6)=;)6(.(&'"/Z'%/,.9)$/3/"'&)%,)FP)'"6,&/%03)5/%0)

"/.('&)7'"#()-#.2%/,.)'11&,T/3'%/,.@

Page 27: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. J

SamuelÕs Backups

Page 28: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. K

The Basic Idea

W@)@)@)5()'&()'%%(31%/.6)%,)3'4()%0()$2,&(I)2'"2#"'%(*)-,&)%0()2#&&(.%)=,'&*)1,$/%/,.I)",,4)"/4()%0'%)2'"2#"'%(*)-,&)%0()%(&3/.'")=,'&*)1,$/%/,.$),-)%0()20'/.),-)3,7($)50/20)3,$%)1&,='=";),22#&)*#&/.6)'2%#'")1"';@X

A. L. Samuel

Some Studies in Machine Learning Using the Game of Checkers, 1959

Page 29: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. V

Elevator Dispatching

[&/%($)'.*)C'&%,))GHHU

Page 30: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. \

State Space

¥ 18 hall call buttons: 2 combinations

¥ positions and directions of cars: 18 (rounding to nearest floor)

¥ motion states of cars (accelerating, moving, decelerating, stopped, loading, turning): 6

¥ 40 car buttons: 2

¥ Set of passengers waiting at each floor, each passenger's arrival time and destination: unobservable. However, 18 real numbers are available giving elapsed time since hall buttons pushed; we discretize these.

¥ Set of passengers riding each car and their destinations: observable only through the car buttons

18

4

40

Conservatively about 10 states22

Page 31: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. H

Actions

¥ When moving (halfway between floors):

Ð stop at next floor

Ð continue past next floor

¥ When stopped at a floor:

Ð go up

Ð go down

¥ Asynchronous

Page 32: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

?@)8@)8#%%,.)'.*)A@)B@)C'&%,9)?(/.-,&2(3(.%)D('&./.69)A.)!.%&,*#2%/,. GR

Performance Criteria

¥ Average wait time

¥ Average system time (wait + travel time)

¥ % waiting > T seconds (e.g., T = 60)

¥ Average squared wait time (to encourage fast and fair service)

Minimize:

Page 33: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Computer Go

8;"7'/.)B(""; MERRJN

GQ

WF'$4)]'&)ST2(""(.2()-,&)A!X)MO'.$)C(&"/.(&N

W^(5)P&,$,10/"'),-)A!X)M_,0.)Y2['&%0;N

WB&'.*)[0'""(.6()F'$4X)MP'7/*)Y(20.(&N

Page 34: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Monte Carlo Tree Search + Playout Policy

GJ

P])`)FP)*,).,%)'11";9

8%'%( $1'2()%,,)"'&6(

Y[)`)S8)*,).,%)'11";9

A2%/,.)$1'2()%,,)"'&6(

^(5)A11&,'20

!.)%0()2#&&(.%)$%'%()$'31"()'.*)"('&.)%,)

2,.$%&#2%)')",2'"";)$1(2/'"/Z(*)1,"/2;)

ST1",&'%/,.a(T1",/%'%/,.)*/"(33')*('"%)

5/%0)b11(&)[,.-/*(.2()F&(()Mb[FN

S7'"#'%()"('7($)#$/.6)]"';,#% ],"/2;

.)%,)

;)

*('"%)

N

;

Page 35: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Inverted Helicopter Flight

0%%19aa0("/@$%'.-,&*@(*#a

GK

A.*&(5)^6)(%)'"@)MERRJ+ERR\N

?D)"('&.$)1,"/2;)=(%%(&)%0'.)'.;)0#3'.

O("/2,1%(&)$0,#"*)

-,"",5)')*($/&(*)

%&'>(2%,&;@

?(5'&*

c#.2%/,.

?(/.-,&2(3(.%

D('&./.6

P;.'3/2$

Y,*("

P'%'

F&'>(2%,&;)d

](.'"%;)c#.2%/,.

],"/2;

B(.(&'%/7()3,*(")-,&)3#"%/1"()$#=,1%/3'")*(3,.$%&'%/,.$@

D('&./.6)'"6,&/%03)%0'%)(T%&'2%$9!.%(.*(*)%&'>(2%,&;

O/60+'22#&'2;)*;.'3/2$)3,*("

ST1(&/3(.%'")&($#"%$9)S.'="(*)%0(3)%,)-";)'#%,.,3,#$)0("/2,1%(&)'(&,='%/2$)5("")=(;,.*)%0()2'1'=/"/%/($),-)'.;),%0(&)'#%,.,3,#$)0("/2,1%(&@

Page 36: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

The Arcade Learning Environment (ALE)ALE (Nadaf 2010, Bellamare et. al. 2012) is an interface builtupon the open-source Atari 2600 emulator Stella. It provides aconvenient interface to ATARI 2600 games.

Page 37: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Features for ALE

• Basic Abstraction of Screen Shots (BASS, from Nadaf 2010)first stores a background of the game it’s playing. Then forevery frame it subtracts away the background and divides thescreen into 16x14 tiles. For each colour (8-bit SECAM) itcreates a feature. It then takes the pairwise interaction of allthese resulting features resulting in 1,606,528 features.

• Color provides object recognition.

• We study linear function approximation with BASS. We wantto see how well one can do with that if one finds the rightparameters

• Improved results has been achieved with non-linearneural/deep approaches.

Page 38: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

The gap

Table : The gap between score (more is better) achieved by (linear)learning and (uct) planning

Game UCT BestLearner

Beam Rider 6,624.6 929.4Seaquest 5,132.4 288

Space Invaders 2,718 250.1Pong 21 −19

Out of 55 games, UCT has the best performance on 45 of them.The remaining games require a terribly long horizon.

Page 39: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Learning from an oracle

• Reinforcement learning is made much more difficult thansupervised learning due to the need to explore.

• Therefore, many authors has in recent years been developingways of teaching an rl agent through e.g. demonstration oradvice with reduction to supervised learning.

• I will here discuss this idea in the context of Atari gamesthrough the Arcade Learning Environment (ALE) framework

Page 40: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Learning from UCTA common scenario when applying reinforcement learningalgorithms in real-world situations, learn in a simulator, apply inthe real-world.

• UCT in the “real-world” still requires the simulator.

• UCT does not provide a policy representation, merely atrajectory.

• How do you extract a complete explicit policy from UCT?

• We will treat the value estimates from UCT as adviceprovided to the agent and we can then learn to play Pongwith just a few episodes of data.

• Learning the value function is now a regression problem wesolve using LibLinear (also exploring kernels, brings us back tofeature selection/sparsification )

• Similar to the Dataset Aggregation algorithm for imitationlearning (Ross and Bagnell 2010)

Page 41: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

DAgger for reinforcement learning with advice Initialise D ← ∅Initialise π1(= π∗) t = 0 for i = 1 to N do

while not end of episode dofor each action a do

Obtain feature φ(st , a) and oracle’s Q∗(st , a)Add training sample {(φ(st , a),Q∗(st , a))} to Da.

endAct according to πi

endfor each action a do

Learn new model Q̂ai := wa

i>φ from Da using regression

end

πi (φ) = arg maxa Q̂ai (φ)

end

Page 42: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Preliminary results

0 10 20 30

−20

−18

−16

−14

Episodes

Rew

ard

Average performance on Pong

RLAdviceSARSA

0 10 20 30

−20

−15

−10

EpisodesR

ewar

d

Best Reward so Far on Pong

RLAdvice

Figure : Pong Results: RLadvice with different amount of aggregateddata (1-30 games) vs SARSA (linear function approximation) after 5000games played. Results averaged over 8 runs

By Daswani, Sunehag, Hutter 2014

Page 43: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Partially Observable Markov Decision Processes

• Exacerbated Exploration issues due to not knowing the state.

• Not knowing the underlying state space makes things worse

• Explicit or implicit restrictions to subclasses

• Recent work (feature rl) on history based methods that learnsa map from histories to some finite/compact state space(other alternative PSRs)

• Model-free version becomes feature selection for high-dim rl

Page 44: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

General Reinforcement Learning

• Classes of completely general reinforcement learningenvironments. Beyond hopeless

• No finite underlying state space assumed. Theoretical workexist with some further assumption exist

• Bayesian (computable env.), AIXI (Hutter 2005)

• Finite/compact classes; Optimistic agents (Sunehag&Hutter.2012), Max. exploration (Lattimore&Hutter&Sunehag 2013)

Page 45: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Feature Reinforcement Learning

Life

Universe

Everything

42

Feature RL aims to automatically reduce a complex real-worldnon-Markovian problem to a useful (computationally tractable)representation (MDP).

Formally we create a map φ from an agent’s history to a staterepresentation. φ is then a function that produces a relevantsummary of the history.

φ(ht) = st

Page 46: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Feature Markov Decision Process (ΦMDP)To select the best φ, one defines a cost function.

φbest = arg minφ(Cost(φ)).

• Feature RL is a recent framework.• Original cost from Hutter 2009 is a model-based criterion.

Cost(φ|h) = CL(s1:n|a1:n) + CL(r1:n|s1:n, a1:n) + CL(φ)

A practically useful modification adds a parameter α tocontrol the balance between reward coding and state coding,

Costα(φ|hn) := αCL(s1:n|a1:n) + (1− α)CL(r1:n|s1:n, a1:n) + CL(φ).

• A global stochastic search (e.g. simulated annealing) is usedto find the φ with minimal cost.

• For fixed φ, MDP methods can be used to find a good policy

Page 47: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Model-free cost criterion

Daswani&Sunehag&Hutter 2013 introduced a fitted-Q cost

CostQL(φ) =minQ

12

∑nt=1(rt+1+γmaxa Q(φ(ht+1), a)−Q(φ(ht), at))2+Reg(φ)

• CostQL also extends easily to the linear function approximationsetting by approximating Q(ht , at)← ξ(ht , at)

Tw whereξ : H×A → Rk for some k ∈ R.

• Connects feature rl to feature selection for TD methods, e.g.Lasso-TD or Dantzig-TD using `1 regularization while aboveReg tends to be a more aggressive `0.

• For a fixed policy, a TD cost without maxa can be defined butone can also reduce the problem to feature selection forsupervised learning using pairs (st ,Rt) where Rt is the returnachieved after state st .

Page 48: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Input : Environment Env();Initialise φ ;Initialise history with observations and rewards fromt = init history random actions;Initialise M to be the number of timesteps per epoch;while true do

φ = SimulAnneal(φ, ht);s1:t = (φ(h1), φ(h2), ..., φ(ht));π = FindPolicy(s1:t , r1:t , a1:t−1) ;for i = 1, 2, 3, ...M do

at ← π(st);ot+1, rt+1 ← Env(ht , at);ht+1 ← htatot+1rt+1;t ← t + 1;

end

endAlgorithm 1: A high-level view of the generic ΦMDP algorithm.

Page 49: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Feature maps

• Tabular : use suffix trees to map historiesto states (Nguyen&Sunehag&Hutter2011,2012). Looping trees for long-termdependences (Daswani&Sunehag&Hutter2012)

• Function approximation : define a newfeature class of event selectors. A featureξi checks the n−m position in the history(hn) for an observation-action pair (o, a).

10

10

s0

s2

s1

If the history is (0, 1), (0, 2), (3, 4), (1, 2) then a event-selectorchecking 3 steps in the past for the observation-action pair (0, 2)will be turned on.

Page 50: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Bayesian general reinforcement learning: MC-AIXI-CTW

Unlike Feature RL, the Bayesian approach does not pick one map butuses a mixture of all instead. The problem is (again) split into two mainareas:

• Learning - online sequence prediction / model building

• Planning/Control - search / sequential decision theory

The hard parts:

• Large model class required for Bayesian mixture predictor to havegeneral prediction capabilities.

• Fortunately, an efficient and general class exists: all Prediction

Suffix Trees of maximum finite depth D. Class contains over 22D−1

models!

• The planning problem can be performed approximately withMonte-Carlo Tree Search (UCT)

• MC-AIXI-CTW (Veness et. al. 2010) combines the above

Page 51: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Overview of proposed agent architecture

Page 52: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Domain : POCMAN

Page 53: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

1,000 1,500 2,000 2,500 3,000 3,500

−2

0

Epochs

Cu

m.

Rew

ard

POCMAN : Rolling average over 1000 epochs

FAhQLMC-AIXI 48

Figure : MC-AIXI vs hQL on Pocman

Agent Cores Memory(GB) Time(hours) Iterations

MC-AIXI 96 bits 8 32 60 1 · 105

MC-AIXI 48 bits 8 14.5 49.5 3.5 · 105

FAhQL 1 0.4 17.5 3.5 · 105

Page 54: Reinforcement Learning - agi-conf.orgagi-conf.org/.../wp-content/uploads/2014/08/RL_slides_AGItutorial7.pdfTutorial at the seventh conference on AGI ... Many slides borrowed from e.g

Conclusions/Outlook

• Reinforcement Learning is a powerful paradigm within which(basically) all AI problems can be formulated

• Many practical successes using MDPs by engineering problemreductions/reprentations

• Practically increasing the versatility of agents by learningreductions automatically.

• Recently introduced arcade gaming environment (fromAlberta) for RL containing all ATARI games. Aim, have onegeneric RL agent solve all!

• Use data on how valuable states are from either simulations orexperience to reduce complexity