courses.cs.washington.edu - how to explore?...recent advances in deep learning have made it possible...
TRANSCRIPT
1
CS473:ArtificialIntelligenceReinforcementLearningII
DieterFox/UniversityofWashington[MostslidesweretakenfromDanKleinandPieterAbbeel /CS188IntrotoAIatUCBerkeley.AllCS188materialsareavailableathttp://ai.berkeley.edu.]
Explorationvs.Exploitation
HowtoExplore?
§ Severalschemesforforcingexploration§ Simplest:randomactions(ε-greedy)
§ Everytimestep,flipacoin§ With(small)probabilityε,actrandomly§ With(large)probability1-ε,actoncurrentpolicy
§ Problemswithrandomactions?§ Youdoeventuallyexplorethespace,butkeepthrashingaroundoncelearningisdone
§ Onesolution:lowerε overtime§ Anothersolution:explorationfunctions
VideoofDemoQ-learning– ManualExploration– BridgeGrid
VideoofDemoQ-learning– Epsilon-Greedy– Crawler ExplorationFunctions§ Whentoexplore?
§ Randomactions:exploreafixedamount§ Betteridea:exploreareaswhosebadnessisnot(yet)established,eventuallystopexploring
§ Explorationfunction§ Takesavalueestimateuandavisitcountn,andreturnsanoptimisticutility,e.g.
§ Note:thispropagatesthe“bonus”backtostatesthatleadtounknownstatesaswell!
ModifiedQ-Update:
RegularQ-Update:
2
VideoofDemoQ-learning– ExplorationFunction– Crawler Regret
§ Evenifyoulearntheoptimalpolicy,youstillmakemistakesalongtheway!
§ Regretisameasureofyourtotalmistakecost:thedifferencebetweenyour(expected)rewards,includingyouthfulsuboptimality,andoptimal(expected)rewards
§ Minimizingregretgoesbeyondlearningtobeoptimal– itrequiresoptimallylearningtobeoptimal
§ Example:randomexplorationandexplorationfunctionsbothendupoptimal,butrandomexplorationhashigherregret
ApproximateQ-Learning GeneralizingAcrossStates
§ BasicQ-Learningkeepsatableofallq-values
§ Inrealisticsituations,wecannotpossiblylearnabouteverysinglestate!§ Toomanystatestovisitthemallintraining§ Toomanystatestoholdtheq-tablesinmemory
§ Instead,wewanttogeneralize:§ Learnaboutsomesmallnumberoftrainingstatesfrom
experience§ Generalizethatexperiencetonew,similarsituations§ Thisisafundamentalideainmachinelearning,andwe’ll
seeitoverandoveragain
[demo– RLpacman]
Example:Pacman
[Demo:Q-learning– pacman – tiny– watchall(L11D5)][Demo:Q-learning– pacman – tiny– silenttrain(L11D6)][Demo:Q-learning– pacman – tricky– watchall(L11D7)]
Let’ssaywediscoverthroughexperiencethatthisstateisbad:
Innaïveq-learning,weknownothingaboutthisstate:
Oreventhisone!
VideoofDemoQ-LearningPacman – Tiny– WatchAll
3
VideoofDemoQ-LearningPacman – Tiny– SilentTrain VideoofDemoQ-LearningPacman – Tricky– WatchAll
Feature-BasedRepresentations
§ Solution:describeastateusingavectoroffeatures (aka“properties”)§ Featuresarefunctionsfromstatestorealnumbers
(often0/1)thatcaptureimportantpropertiesofthestate
§ Examplefeatures:§ Distancetoclosestghost§ Distancetoclosestdot§ Numberofghosts§ 1/(disttodot)2§ IsPacman inatunnel?(0/1)§ ……etc.§ Isittheexactstateonthisslide?
§ Canalsodescribeaq-state(s,a)withfeatures(e.g.actionmovesclosertofood)
LinearValueFunctions
§ Usingafeaturerepresentation,wecanwriteaqfunction(orvaluefunction)foranystateusingafewweights:
§ Advantage:ourexperienceissummedupinafewpowerfulnumbers
§ Disadvantage:statesmaysharefeaturesbutactuallybeverydifferentinvalue!
ApproximateQ-Learning
§ Q-learningwithlinearQ-functions:
§ Intuitiveinterpretation:§ Adjustweightsofactivefeatures§ E.g.,ifsomethingunexpectedlybadhappens,blamethefeaturesthatwereon:
disprefer allstateswiththatstate’sfeatures
§ Formaljustification:onlineleastsquares
Exact Q’s
Approximate Q’s
Example:Q-Pacman
[Demo:approximateQ-learningpacman (L11D10)]
4
VideoofDemoApproximateQ-Learning-- Pacman Q-LearningandLeastSquares
0 200
20
40
010
2030
40
0
10
20
30
20
22
24
26
LinearApproximation:Regression*
Prediction: Prediction:
Optimization:LeastSquares*
0 200
Error or “residual”
Prediction
Observation
MinimizingError*
Approximatequpdateexplained:
Imaginewehadonlyonepointx,withfeaturesf(x),targetvaluey,andweightsw:
“target” “prediction”0 2 4 6 8 10 12 14 16 18 20
-15
-10
-5
0
5
10
15
20
25
30
Degree 15 polynomial
Overfitting:WhyLimitingCapacityCanHelp*
5
PolicySearch PolicySearch
§ Problem:oftenthefeature-basedpoliciesthatworkwell(wingames,maximizeutilities)aren’ttheonesthatapproximateV/Qbest§ E.g.yourvaluefunctionsfromproject2wereprobablyhorribleestimatesoffuturerewards,butthey
stillproducedgooddecisions§ Q-learning’spriority:getQ-valuesclose(modeling)§ Actionselectionpriority:getorderingofQ-valuesright(prediction)
§ Solution:learnpoliciesthatmaximizerewards,notthevaluesthatpredictthem
§ Policysearch:startwithanoksolution(e.g.Q-learning)thenfine-tunebyhillclimbingonfeatureweights
PolicySearch
§ Simplestpolicysearch:§ StartwithaninitiallinearvaluefunctionorQ-function§ Nudgeeachfeatureweightupanddownandseeifyourpolicyisbetterthanbefore
§ Problems:§ Howdowetellthepolicygotbetter?§ Needtorunmanysampleepisodes!§ Iftherearealotoffeatures,thiscanbeimpractical
§ Bettermethodsexploitlookahead structure,samplewisely,changemultipleparameters…
PolicySearch
[AndrewNg] [Video:HELICOPTER]
PILCO(ProbabilisticInferenceforLearningControl)
• Model-based policy search to minimize given cost function• Policy: mapping from state to control• Rollout: plan using current policy and GP dynamics model• Policy parameter update via CG/BFGS• Highly data efficient
[Deisenroth-etal, ICML-11, RSS-11, ICRA-14, PAMI-14]
Demo:StandardBenchmarkProblem
§ Swingpendulumupandbalanceininvertedposition
§ Learnnonlinearcontrolfromscratch
§ 4Dstatespace,300controllerparameters
§ 7trials/17.5secexperience§ Controlfreq.:10Hz
6
ControllingaLow-CostRoboticManipulator
• Low-cost system ($500 for robot arm and Kinect)• Very noisy• No sensor information about robot’s joint
configuration used• Goal: Learn to stack tower of 5 blocks from
scratch• Kinect camera for tracking block in end-effector• State: coordinates (3D) of block center (from
Kinect camera)• 4 controlled DoF• 20 learning trials for stacking 5 blocks (5 seconds
long each)• Account for system noise, e.g.,
– Robot arm– Image processing
Playing Atari with Deep Reinforcement Learning
Volodymyr Mnih Koray Kavukcuoglu David Silver Alex Graves Ioannis Antonoglou
Daan Wierstra Martin Riedmiller
DeepMind Technologies
{vlad,koray,david,alex.graves,ioannis,daan,martin.riedmiller} @ deepmind.com
Abstract
We present the first deep learning model to successfully learn control policies di-rectly from high-dimensional sensory input using reinforcement learning. Themodel is a convolutional neural network, trained with a variant of Q-learning,whose input is raw pixels and whose output is a value function estimating futurerewards. We apply our method to seven Atari 2600 games from the Arcade Learn-ing Environment, with no adjustment of the architecture or learning algorithm. Wefind that it outperforms all previous approaches on six of the games and surpassesa human expert on three of them.
1 Introduction
Learning to control agents directly from high-dimensional sensory inputs like vision and speech isone of the long-standing challenges of reinforcement learning (RL). Most successful RL applica-tions that operate on these domains have relied on hand-crafted features combined with linear valuefunctions or policy representations. Clearly, the performance of such systems heavily relies on thequality of the feature representation.
Recent advances in deep learning have made it possible to extract high-level features from raw sen-sory data, leading to breakthroughs in computer vision [11, 22, 16] and speech recognition [6, 7].These methods utilise a range of neural network architectures, including convolutional networks,multilayer perceptrons, restricted Boltzmann machines and recurrent neural networks, and have ex-ploited both supervised and unsupervised learning. It seems natural to ask whether similar tech-niques could also be beneficial for RL with sensory data.
However reinforcement learning presents several challenges from a deep learning perspective.Firstly, most successful deep learning applications to date have required large amounts of hand-labelled training data. RL algorithms, on the other hand, must be able to learn from a scalar rewardsignal that is frequently sparse, noisy and delayed. The delay between actions and resulting rewards,which can be thousands of timesteps long, seems particularly daunting when compared to the directassociation between inputs and targets found in supervised learning. Another issue is that most deeplearning algorithms assume the data samples to be independent, while in reinforcement learning onetypically encounters sequences of highly correlated states. Furthermore, in RL the data distribu-tion changes as the algorithm learns new behaviours, which can be problematic for deep learningmethods that assume a fixed underlying distribution.
This paper demonstrates that a convolutional neural network can overcome these challenges to learnsuccessful control policies from raw video data in complex RL environments. The network istrained with a variant of the Q-learning [26] algorithm, with stochastic gradient descent to updatethe weights. To alleviate the problems of correlated data and non-stationary distributions, we use
1
Deepmind AIPlayingAtari That’sallforReinforcementLearning!
§ Verytoughproblem:Howtoperformanytaskwellinanunknown,noisyenvironment!
§ Traditionallyusedmostlyforrobotics,butbecomingmorewidelyused
§ Lotsofopenresearchareas:§ Howtobestbalanceexplorationandexploitation?§ Howtodealwithcaseswherewedon’tknowagoodstate/featurerepresentation?
Reinforcement Learning Agent
Data (experiences with environment)
Policy (how to act in the future)
Conclusion
§ We’redonewithPartI:SearchandPlanning!
§ We’veseenhowAImethodscansolveproblemsin:§ Search§ ConstraintSatisfactionProblems§ Games§ MarkovDecisionProblems§ ReinforcementLearning
§ Nextup:PartII:UncertaintyandLearning!