reinforcement learning with dnns: alphago to alphazerocraven/cs760/lectures/alphazero.pdf · higher...
TRANSCRIPT
![Page 1: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/1.jpg)
ReinforcementLearningwithDNNs:AlphaGo toAlphaZero
CS760:MachineLearningSpring2018
MarkCravenandDavidPage
www.biostat.wisc.edu/~craven/cs760
1
![Page 2: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/2.jpg)
GoalsfortheLecture
• Youshouldunderstandthefollowingconcepts:
• MonteCarlotreesearch(MCTS)• Self-play• Residualneuralnetworks• AlphaZero algorithm
2
![Page 3: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/3.jpg)
ABriefHistoryofGame-PlayingasaCS/AITestofProgress• 1944:AlanTuringandDonaldMichie simulatebyhandtheirchessalgorithmsduringlunchesatBletchleyPark
• 1959:ArthurSamuel’scheckersalgorithm(machinelearning)
• 1961:Michie’s MatchboxEducableNoughts AndCrossesEngine(MENACE)
• 1991:Computersolveschessendgamethoughtdraw:KRBbeatsKNN(223moves)
• 1992:TDGammon trainsforBackgammonbyself-playreinforcementlearning
• 1997:ComputersbestinworldatChess(DeepBluebeatsKasparov)
• 2007:Checkers“solved”bycomputer(guaranteedoptimalplay)
• 2016:ComputersbestatGo(AlphaGo beatsLeeSodol)
• 2017(4monthsago):AlphaZero extendsAlphaGo tobestatchess,shogi
![Page 4: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/4.jpg)
OnlySome oftheseinvolvedLearning• 1944:AlanTuringandDonaldMichie simulatebyhandtheirchessalgorithmsduringlunchesatBletchleyPark
• 1959:ArthurSamuel’scheckersalgorithm(machinelearning)
• 1961:Michie’s MatchboxEducableNoughts AndCrossesEngine(MENACE)
• 1991:Computersolveschessendgamethoughtdraw:KRBbeatsKNN(223moves)
• 1992:TDGammon trainsforBackgammonbyself-playreinforcementlearning
• 1997:ComputersbestinworldatChess(DeepBluebeatsKasparov)
• 2007:Checkers“solved”bycomputer(guaranteedoptimalplay)
• 2016:ComputersbestatGo(AlphaGo beatsLeeSodol)
• 2017(4monthsago):AlphaZero extendsAlphaGo tobestatchess,shogi
![Page 5: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/5.jpg)
OnlySome oftheseinvolvedLearning• 1944:AlanTuringandDonaldMichie simulatebyhandtheirchessalgorithmsduringlunchesatBletchleyPark
• 1959:ArthurSamuel’scheckersalgorithm(machinelearning)
• 1961:Michie’s MatchboxEducableNoughts AndCrossesEngine(MENACE)
• 1991:Computersolveschessendgamethoughtdraw:KRBbeatsKNN(223moves)
• 1992:TDGammon trainsforBackgammonbyself-playreinforcementlearning
• 1997:ComputersbestinworldatChess(DeepBluebeatsKasparov)
• 2007:Checkers“solved”bycomputer(guaranteedoptimalplay)
• 2016:ComputersbestatGo(AlphaGo beatsLeeSodol)
• 2017(4monthsago):AlphaZero extendsAlphaGo tobestatchess,shogi
![Page 6: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/6.jpg)
Background:GamePlaying
• Untillastyear,state-of-the-artformanygamesincludingchesswasminimax searchwithalpha-betapruning(recallIntrotoAI)
• Mosttop-performinggame-playingprogramsdidn’tdolearning
• GameofGowasoneofthefewgameswherehumansstilloutperformedcomputers
![Page 7: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/7.jpg)
MinimaxinaPicture(thanksWikipedia)
![Page 8: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/8.jpg)
MonteCarloTreeSearch(MCTS)inaPicture(thanksWikipedia)
Rollout(RandomSearch)
![Page 9: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/9.jpg)
ReinforcementLearningbyAlphaGo,AlphaGo Zero,andAlphaZero:KeyInsights
• MCTSwithSelf-Play• Don’thavetoguesswhatopponentmightdo,so…• Ifnoexploration,abig-branchinggametreebecomesonepath• Yougetanautomaticallyimproving,evenly-matchedopponentwhoisaccuratelylearningyourstrategy
• “Wehavemettheenemy,andheisus”(famousvariantofPogo,1954)• Noneedforhumanexpertscoringrulesforboardsfromunfinishedgames
• Treatboardasanimage:useresidualconvolutionalneuralnetwork
• AlphaGo Zero:Onedeepneuralnetworklearnsboththevaluefunctionandpolicyinparallel
• AlphaZero:RemovedrolloutaltogetherfromMCTSandjustusedcurrentneuralnetestimatesinstead
![Page 10: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/10.jpg)
AlphaZero (Dec2017):MinimizedRequiredGameKnowledge,ExtendedfromGotoChessandShogi
![Page 11: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/11.jpg)
AlphaZero’s versionofQ-Learning
• Nodiscountonfuturerewards
• Rewardsof0untilendofgame;thenrewardof-1or+1
• ThereforeQ-valueforanactiona orpolicyp fromastateS isexactlyvaluefunction:Q(S,p) =V(S,p)
• AlphaZero usesoneDNN(detailsinabit)tomodelbothp andV
• UpdatestoDNNaremade(trainingexamplesprovided)aftergame
• Duringgame,needtobalanceexploitationandexploration
![Page 12: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/12.jpg)
AlphaZero Algorithm
InitializeDNN!"RepeatForever
PlayGame Update"
PlayGame:
RepeatUntilWinorLose: FromcurrentstateS,performMCTSEstimatemoveprobabilities#byMCTSRecord(S,#)asanexampleRandomlydrawnextmovefrom#
Update ": Letzbepreviousgameoutcome(+1or-1)Samplefromlastgame’sexamples(S,#, &)TrainDNN!"onsampletogetnew"
![Page 13: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/13.jpg)
AlphaZero Play-Game
![Page 14: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/14.jpg)
AlphaZero TrainDNN
![Page 15: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/15.jpg)
AlphaZeroMonteCarloTreeSearch(MTCS)
![Page 16: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/16.jpg)
WhyNeedMCTSAtAll?
• CouldalwaysmakemoveDNNsayshashighestQ:noexploration• CouldjustdrawmovefromDNN’spolicyoutput• PaperssayMCTSoutputprobabilityvectorp selectsstrongermovesthatjustdirectlyusingtheneuralnetwork’spolicyoutputitself(isthereapossiblelessonhereforself-drivingcarstoo??)• StillneedtodecidehowmanytimestorepeatMCTSsearch(game-specific)andhowtotradeoffexplorationandexploitationinMCTS…AlphaZero paperjustsayschoosemovewith“lowcount,highmoveprobability,andhighvalue”—AlphaGo papermorespecific:maximizeupperconfidencebound• Where𝝉 istemperature[1,2],andN𝝉(s,b)iscountoftimeactionbhasbeentakenfromstates,raisedtothepower1/𝝉,choose:
![Page 17: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/17.jpg)
AlphaZero DNNArchitecture:InputNodesRepresentCurrentGameState,IncludinganyneededHistory
![Page 18: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/18.jpg)
AlphaZero DNNArchitecture:OutputNodesRepresentPolicyandValueFunction
• Apolicyisaprobabilitydistributionoverallpossiblemovesfromastate,soneedunitstorepresentallpossiblemoves
• Chessismostcomplicatedtodescribemoves(thoughGoandShogihavehighernumbersofmovestoconsider),sohereisforChessmoves:• 8x8=64possiblestartingpositionsforamove• 56possibledestinationsforqueenmoves:8compassdirections{N,NE,E,SE,S,SW,W,NW}times7possiblemove-lengths
• Another17possibledestinationsforirregularmovessuchasknight• Somemovesimpossible,dependingontheparticularpieceataposition(e.g.,pawncan’tmakeallqueenmoves)andlocationofotherpieces(queencan’tmovethrough2otherpiecestoattackathird)
• Weightsforimpossiblemovesaresetto0andnotallowedtochange• Anotherlayertonormalizeresultsintoprobabilitydistribution
• Onedeepneuralnetworklearnsboththevaluefunctionandpolicyinparallel:oneadditionaloutputnodeforthevalue function,whichestimatestheexpectedoutcomeintherange[-1,1]forfollowingthecurrentpolicyfrompresent(input)state
![Page 19: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/19.jpg)
DeepNeuralNetworksTrick#9:ResNets(ResidualNetworks)
• Whatifyourneuralnetworkistoodeep?
• Intheory,that’snoproblem,givensufficientnodesandconnectivity:early(orlate)layerscanjustlearnidentityfunction(autoencoder)
• Inpracticedeepneuralnetworksfailtolearnidentitywhenneeded
• Asolution:makeidentityeasyoreventhedefault;havetoworkhardtoactuallylearnanon-zeroresidual(andhenceanon-identity)
![Page 20: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/20.jpg)
ResidualNetworkinaPicture(He,Zhang,Ren,Sun,2015):IdentitySkipConnection
Note:outputandinputdimensionalityneedtobethesame.
Whycalled“residual”?
![Page 21: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/21.jpg)
DeepResidualNetworks(ResNets):Startofa35-layerResNet (He,Zhang,Ren,Sun,2015)
DottedlinedenotesincreaseinDimension(2moresuchincreases)
![Page 22: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/22.jpg)
ABriefAside:LeakyReLUs
• RectifiersusedcouldbeReLU or”LeakyReLU”
• LeakyReLU addresses“dyingReLU”problem-–wheninputsumisbelowsomevalue,outputis0,sonogradientfortraining
• ReLU:f(x)=max(0,x)
• LeakyReLU:
• ReLU LeakyReLU
![Page 23: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/23.jpg)
AlphaZero DNNArchitecture:HiddenUnitsArrangedinaResidualNetwork(aCNNwithResidualLayers)
PolicyHead
ConvBlock3x3,256,/1
ResBlock3x3,256,/1
ResBlock3x3,256,/1
... Repeatfor39ResBlocks
ValueHead
![Page 24: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/24.jpg)
AlphaZero DNNArchitecture:ConvolutionBlock
![Page 25: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/25.jpg)
AlphaZero DNNArchitecture:ResidualBlocks
![Page 26: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/26.jpg)
AlphaZero DNNArchitecture:PolicyHead(forGo)
![Page 27: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/27.jpg)
AlphaZero DNNArchitecture:ValueHeadAlphaZero DNNArchitecture:ValueHead
![Page 28: Reinforcement Learning with DNNs: AlphaGo to AlphaZerocraven/cs760/lectures/AlphaZero.pdf · higher numbers of moves to consider), so here is for Chess moves: •8 x 8 = 64 possible](https://reader030.vdocuments.net/reader030/viewer/2022040311/5d623c2e88c993eb3e8ba3b0/html5/thumbnails/28.jpg)
AlphaZero ComparedtoRecentWorldChampions