courses.cs.washington.edu - how to explore?...recent advances in deep learning have made it possible...

1

CS473:ArtificialIntelligenceReinforcementLearningII

DieterFox/UniversityofWashington[MostslidesweretakenfromDanKleinandPieterAbbeel /CS188IntrotoAIatUCBerkeley.AllCS188materialsareavailableathttp://ai.berkeley.edu.]

Explorationvs.Exploitation

HowtoExplore?

§ Severalschemesforforcingexploration§ Simplest:randomactions(ε-greedy)

§ Everytimestep,flipacoin§ With(small)probabilityε,actrandomly§ With(large)probability1-ε,actoncurrentpolicy

§ Problemswithrandomactions?§ Youdoeventuallyexplorethespace,butkeepthrashingaroundoncelearningisdone

§ Onesolution:lowerε overtime§ Anothersolution:explorationfunctions

VideoofDemoQ-learning– ManualExploration– BridgeGrid

VideoofDemoQ-learning– Epsilon-Greedy– Crawler ExplorationFunctions§ Whentoexplore?

§ Randomactions:exploreafixedamount§ Betteridea:exploreareaswhosebadnessisnot(yet)established,eventuallystopexploring

§ Explorationfunction§ Takesavalueestimateuandavisitcountn,andreturnsanoptimisticutility,e.g.

§ Note:thispropagatesthe“bonus”backtostatesthatleadtounknownstatesaswell!

ModifiedQ-Update:

RegularQ-Update:

2

VideoofDemoQ-learning– ExplorationFunction– Crawler Regret

§ Evenifyoulearntheoptimalpolicy,youstillmakemistakesalongtheway!

§ Regretisameasureofyourtotalmistakecost:thedifferencebetweenyour(expected)rewards,includingyouthfulsuboptimality,andoptimal(expected)rewards

§ Minimizingregretgoesbeyondlearningtobeoptimal– itrequiresoptimallylearningtobeoptimal

§ Example:randomexplorationandexplorationfunctionsbothendupoptimal,butrandomexplorationhashigherregret

ApproximateQ-Learning GeneralizingAcrossStates

§ BasicQ-Learningkeepsatableofallq-values

§ Inrealisticsituations,wecannotpossiblylearnabouteverysinglestate!§ Toomanystatestovisitthemallintraining§ Toomanystatestoholdtheq-tablesinmemory

§ Instead,wewanttogeneralize:§ Learnaboutsomesmallnumberoftrainingstatesfrom

experience§ Generalizethatexperiencetonew,similarsituations§ Thisisafundamentalideainmachinelearning,andwe’ll

seeitoverandoveragain

[demo– RLpacman]

Example:Pacman

[Demo:Q-learning– pacman – tiny– watchall(L11D5)][Demo:Q-learning– pacman – tiny– silenttrain(L11D6)][Demo:Q-learning– pacman – tricky– watchall(L11D7)]

Let’ssaywediscoverthroughexperiencethatthisstateisbad:

Innaïveq-learning,weknownothingaboutthisstate:

Oreventhisone!

VideoofDemoQ-LearningPacman – Tiny– WatchAll

3

VideoofDemoQ-LearningPacman – Tiny– SilentTrain VideoofDemoQ-LearningPacman – Tricky– WatchAll

Feature-BasedRepresentations

§ Solution:describeastateusingavectoroffeatures (aka“properties”)§ Featuresarefunctionsfromstatestorealnumbers

(often0/1)thatcaptureimportantpropertiesofthestate

§ Examplefeatures:§ Distancetoclosestghost§ Distancetoclosestdot§ Numberofghosts§ 1/(disttodot)2§ IsPacman inatunnel?(0/1)§ ……etc.§ Isittheexactstateonthisslide?

§ Canalsodescribeaq-state(s,a)withfeatures(e.g.actionmovesclosertofood)

LinearValueFunctions

§ Usingafeaturerepresentation,wecanwriteaqfunction(orvaluefunction)foranystateusingafewweights:

§ Advantage:ourexperienceissummedupinafewpowerfulnumbers

§ Disadvantage:statesmaysharefeaturesbutactuallybeverydifferentinvalue!

ApproximateQ-Learning

§ Q-learningwithlinearQ-functions:

§ Intuitiveinterpretation:§ Adjustweightsofactivefeatures§ E.g.,ifsomethingunexpectedlybadhappens,blamethefeaturesthatwereon:

disprefer allstateswiththatstate’sfeatures

§ Formaljustification:onlineleastsquares

Exact Q’s

Approximate Q’s

Example:Q-Pacman

[Demo:approximateQ-learningpacman (L11D10)]

4

VideoofDemoApproximateQ-Learning-- Pacman Q-LearningandLeastSquares

0 200

20

40

010

2030

40

0

10

20

30

20

22

24

26

LinearApproximation:Regression*

Prediction: Prediction:

Optimization:LeastSquares*

0 200

Error or “residual”

Prediction

Observation

MinimizingError*

Approximatequpdateexplained:

Imaginewehadonlyonepointx,withfeaturesf(x),targetvaluey,andweightsw:

“target” “prediction”0 2 4 6 8 10 12 14 16 18 20

-15

-10

-5

0

5

10

15

20

25

30

Degree 15 polynomial

Overfitting:WhyLimitingCapacityCanHelp*

5

PolicySearch PolicySearch

§ Problem:oftenthefeature-basedpoliciesthatworkwell(wingames,maximizeutilities)aren’ttheonesthatapproximateV/Qbest§ E.g.yourvaluefunctionsfromproject2wereprobablyhorribleestimatesoffuturerewards,butthey

stillproducedgooddecisions§ Q-learning’spriority:getQ-valuesclose(modeling)§ Actionselectionpriority:getorderingofQ-valuesright(prediction)

§ Solution:learnpoliciesthatmaximizerewards,notthevaluesthatpredictthem

§ Policysearch:startwithanoksolution(e.g.Q-learning)thenfine-tunebyhillclimbingonfeatureweights

PolicySearch

§ Simplestpolicysearch:§ StartwithaninitiallinearvaluefunctionorQ-function§ Nudgeeachfeatureweightupanddownandseeifyourpolicyisbetterthanbefore

§ Problems:§ Howdowetellthepolicygotbetter?§ Needtorunmanysampleepisodes!§ Iftherearealotoffeatures,thiscanbeimpractical

§ Bettermethodsexploitlookahead structure,samplewisely,changemultipleparameters…

PolicySearch

[AndrewNg] [Video:HELICOPTER]

PILCO(ProbabilisticInferenceforLearningControl)

• Model-based policy search to minimize given cost function• Policy: mapping from state to control• Rollout: plan using current policy and GP dynamics model• Policy parameter update via CG/BFGS• Highly data efficient

[Deisenroth-etal, ICML-11, RSS-11, ICRA-14, PAMI-14]

Demo:StandardBenchmarkProblem

§ Swingpendulumupandbalanceininvertedposition

§ Learnnonlinearcontrolfromscratch

§ 4Dstatespace,300controllerparameters

§ 7trials/17.5secexperience§ Controlfreq.:10Hz

6

ControllingaLow-CostRoboticManipulator

• Low-cost system ($500 for robot arm and Kinect)• Very noisy• No sensor information about robot’s joint

configuration used• Goal: Learn to stack tower of 5 blocks from

scratch• Kinect camera for tracking block in end-effector• State: coordinates (3D) of block center (from

Kinect camera)• 4 controlled DoF• 20 learning trials for stacking 5 blocks (5 seconds

long each)• Account for system noise, e.g.,

– Robot arm– Image processing

Playing Atari with Deep Reinforcement Learning

Volodymyr Mnih Koray Kavukcuoglu David Silver Alex Graves Ioannis Antonoglou

Daan Wierstra Martin Riedmiller

DeepMind Technologies

{vlad,koray,david,alex.graves,ioannis,daan,martin.riedmiller} @ deepmind.com

Abstract

We present the first deep learning model to successfully learn control policies di-rectly from high-dimensional sensory input using reinforcement learning. Themodel is a convolutional neural network, trained with a variant of Q-learning,whose input is raw pixels and whose output is a value function estimating futurerewards. We apply our method to seven Atari 2600 games from the Arcade Learn-ing Environment, with no adjustment of the architecture or learning algorithm. Wefind that it outperforms all previous approaches on six of the games and surpassesa human expert on three of them.

1 Introduction

Learning to control agents directly from high-dimensional sensory inputs like vision and speech isone of the long-standing challenges of reinforcement learning (RL). Most successful RL applica-tions that operate on these domains have relied on hand-crafted features combined with linear valuefunctions or policy representations. Clearly, the performance of such systems heavily relies on thequality of the feature representation.

Recent advances in deep learning have made it possible to extract high-level features from raw sen-sory data, leading to breakthroughs in computer vision [11, 22, 16] and speech recognition [6, 7].These methods utilise a range of neural network architectures, including convolutional networks,multilayer perceptrons, restricted Boltzmann machines and recurrent neural networks, and have ex-ploited both supervised and unsupervised learning. It seems natural to ask whether similar tech-niques could also be beneficial for RL with sensory data.

However reinforcement learning presents several challenges from a deep learning perspective.Firstly, most successful deep learning applications to date have required large amounts of hand-labelled training data. RL algorithms, on the other hand, must be able to learn from a scalar rewardsignal that is frequently sparse, noisy and delayed. The delay between actions and resulting rewards,which can be thousands of timesteps long, seems particularly daunting when compared to the directassociation between inputs and targets found in supervised learning. Another issue is that most deeplearning algorithms assume the data samples to be independent, while in reinforcement learning onetypically encounters sequences of highly correlated states. Furthermore, in RL the data distribu-tion changes as the algorithm learns new behaviours, which can be problematic for deep learningmethods that assume a fixed underlying distribution.

This paper demonstrates that a convolutional neural network can overcome these challenges to learnsuccessful control policies from raw video data in complex RL environments. The network istrained with a variant of the Q-learning [26] algorithm, with stochastic gradient descent to updatethe weights. To alleviate the problems of correlated data and non-stationary distributions, we use

1

Deepmind AIPlayingAtari That’sallforReinforcementLearning!

§ Verytoughproblem:Howtoperformanytaskwellinanunknown,noisyenvironment!

§ Traditionallyusedmostlyforrobotics,butbecomingmorewidelyused

§ Lotsofopenresearchareas:§ Howtobestbalanceexplorationandexploitation?§ Howtodealwithcaseswherewedon’tknowagoodstate/featurerepresentation?

Reinforcement Learning Agent

Data (experiences with environment)

Policy (how to act in the future)

Conclusion

§ We’redonewithPartI:SearchandPlanning!

§ We’veseenhowAImethodscansolveproblemsin:§ Search§ ConstraintSatisfactionProblems§ Games§ MarkovDecisionProblems§ ReinforcementLearning

§ Nextup:PartII:UncertaintyandLearning!

courses.cs.washington.edu - how to explore?...recent advances in deep learning have made it possible...

Documents