alphago in depth
TRANSCRIPT
Overview• AIinGamePlaying• MachineLearningandDeepLearning• ReinforcementLearning• AlphaGo'sMethods
2
AIinGamePlaying• AIinGamePlaying– AdversarialSearch–MiniMax– MonteCarloTreeSearch– MulI-ArmedBanditProblem
3
MonteCarloTreeSearch• TreeSearch+MonteCarloMethod– SelecIon– Expansion– SimulaIon– Back-PropagaIon
3/5
1/22/3
0/1
whitewins/total
1/21/1 1/1
1/1 0/1
8
MulI-ArmedBanditProblem• ExploraIonv.s.ExploitaIon
678
678
678
678
678
678
678
678
678
FGH
FGH
FGH
FGH
FGH
FGH
FGH
FGH
FGH
13
UCB1algorithm
argmax
i(x̄i +
r2logn
ni)
• :Themeanpayoutformachine• :Thenumberofplaysofmachine• :Thetotalnumberofplays
x̄i +
r2logn
nix̄i +
r2logn
ni
i
i
n
678
678
678
678
678
678
FGH
FGH
FGH
678
678
678
FGH
FGH
FGH
FGH
FGH
FGH
14
UCB1algorithm
R(s, a) = Q(s, a) + c
slogN(s)
N(s, a) 3/5
1/22/3
0/11/21/1 1/1
1/1 0/1
a⇤ = argmax
aR(s, a)
3/5
2/3 N(s, a) = 3
N(s) = 5
c = constant
Q(s, a) = 2/3
15
UCB1algorithm
3/5
1/22/3
0/11/21/1 1/1
1/1 0/1
R(s, a1) = 2/3 + 0.5
rlog5
3
= 1.3029
R(s, a2) = 1/2 + 0.5
rlog5
2
= 0.9485s
a1 a2R(s, a1) > R(s, a2)
a⇤ = argmax
aR(s, a) = a1
16
MachineLearningandDeepLearning• MachineLearningandDeepLearning:– SupervisedMachineLearning– NeuralNetworks– ConvoluIonalNeuralNetworks– TrainingNeuralNetworks
17
SupervisedMachineLearning
MachineLearningModel
Problem: SoluIon:Output:
Problem: Output:MachineLearningModel
Feedback
18
NeuralNetworks
n W1
W2
x1
x2
b Wb
nin = w1x1 + w2x2 + wb
Sigmoid:nout
=1
1 + e�nin
nout
1� e�2nin
1 + e�2nin
tanh:
ReLU:⇢nin if nin > 0
0 otherwise
20
NeuralNetworks• ANDGate
x1 x2 y 0 0 0 0 1 0 1 0 0 1 1 1 (0,0)
(0,1) (1,1)
(1,0)
0
1
n
b-30
y x1x2
20 20
y =1
1 + e�(20x1+20x2�30)
20x1 + 20x2 � 30 = 0
21
NeuralNetworks
x
y
n11
n12
n21
n22
b b
z1
z2 W12,y
W12,x
W11,y
W11,b W12,b
W11,x W21,11
W22,12
W21,12
W22,11
W21,b W22,b
Inputlayer
Hiddenlayer
Outputlayer
22
NeuralNetworks• XORGate
n -20
20
b
-10
y
(0,0)
(0,1) (1,1)
(1,0)
0 1
(0,0)
(0,1) (1,1)
(1,0)
1 0
(0,0)
(0,1) (1,1)
(1,0) 0
0 1
n1
b-30 20 20
x1
x2
n2
b-10 20 20
x1
x2
x1 x2 n1 n2 y
0 0 0 0 0
0 1 0 1 1
1 0 0 1 1
1 1 1 1 0
23
MulI-ClassClassificaIon• SogMax
n1
n2
n3
n1,out =en1,in
en1,in + en2,in + en3,in
n2,out =en2,in
en1,in + en2,in + en3,in
n3,out =en3,in
en1,in + en2,in + en3,in
n1,in
n2,in
n3,in
24
MulI-ClassClassificaIon• SogMax
n1,out =en1,in
en1,in + en2,in + en3,in
n1
n2
n3
n1,in
n2,in
n3,in
25
n1,in � n2,in and
n1,in � n3,in
n1,in ⌧ n2,in or
n1,in ⌧ n3,in
ConvoluIonalNeuralNeworksConvoluIonal
layer
RecepIvefields RecepIvefields
Inputlayer
ConvoluIonallayer
…….
…….
28
TrainingNeuralNetworks• One-HotEncoding:
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 1 0 0
0 0 0 0 0
1 1 1 1 1
1 1 0 1 1
1 1 0 0 1
1 1 0 1 1
1 1 1 1 1
0 0 0 0 0
0 0 0 0 0
0 1 0 0 0
0 0 0 0 0
0 0 0 0 0
Player’sstones
Opponent’sstones
EmptyposiIons
NextposiIon
Input Output
30
TrainingNeuralNetworks
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 1 0 0
0 0 0 0 0
1 1 1 1 1
1 1 0 1 1
1 1 0 0 1
1 1 0 1 1
1 1 1 1 1
0 0 0 0 0
0 .5 0 0 0
0 .3 0 0 0
0 .2 0 0 0
0 0 0 0 0
ForwardpropagaIon
pw(a|s)
Inputs:
Inputlayer
ConvoluIonallayer
Outputlayer
Outputs:s
31
TrainingNeuralNetworks
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 1 0 0
0 0 0 0 0
1 1 1 1 1
1 1 0 1 1
1 1 0 0 1
1 1 0 1 1
1 1 1 1 1
0 0 0 0 0
0 .5 0 0 0
0 .3 0 0 0
0 .2 0 0 0
0 0 0 0 0
Inputs:
Inputlayer
ConvoluIonallayer
Outputlayer
Outputs:s
pw(a|s)
0 0 0 0 0
0 0 0 0 0
0 1 0 0 0
0 0 0 0 0
0 0 0 0 0
Golden:BackwardpropagaIon
CostfuncIon:
aiw = w � ⌘@ � log(pw(ai|s))
@w32
�log(pw(ai|s))
BackwardPropagaIon
n2 n1 JCost
funcIon:
n2(out) n2(in) w21
37
@J
@w21=
@J
@n2(out)
@n2(out)
@n2(in)
@n2(in)
@w21
w21 w21 � ⌘@J
@w21
w21 w21 � ⌘@J
@n2(out)
@n2(out)
@n2(in)
@n2(in)
@w21
ReinforcementLearning
State: St
Reward(Feedback):Rt
AcIon:At
• Feedbackisdelayed.• Nosupervisor,onlyarewardsignal.• Rulesofthegameareunknown.• Agent’sacIonsaffectthesubsequentstate
AgentEnvironment
40
• Thebehaviorofanagent
Policy
sstate
a1acIon a2acIon⇡(a2 | s)= 0.5
⇡(a1 | s)= 0.5
StochasIcPolicysstate
DeterminisIcPolicy
⇡(s) = aacIona
41
Value• Theexpectedlong-termreward
sstate q⇡(s, a)v⇡(s)
acIona⇡policy
State-valueFuncIon AcIon-valueFuncIon
rreward:
⇡policy
sstate
rreward:
⇡policy
end end42
PolicyGradientMethod• REINFOCE– theREwardIncrement=NonnegaIveFactorOffsetReinforCEment
weightsinpolicyfuncIon
reward baseline(usually=0)
learningrate
43
w w + ↵(r � b)@log⇡(a|s)
@w
PolicyNetworks
0 0 0 00 1 0 00 0 0 00 0 0 0
One-hotencoding ProbabiliIesofacIons
Sampling
ExecuteacIon
NeuralNetworks
s ⇡(a|s)45
AlphaGo’sMethods• Training:– Supervisedlearning:ClassificaIon– Reinforcementlearning– Supervisedlearning:Regression
• Searching:– Searchingwithpolicyandvaluenetworks– Distributedsearch
58
Training
Humanexpertdata Self-playdata
Rolloutpolicy
SLpolicynetwork
RLpolicynetwork
Valuenetwork
ClassificaIon RegressionPolicy
gradientGeneratedata
IniIalizeweights
p⇡ p� p⇢ v✓
59
SupervisedLearning:ClassificaIon
Humanexpertdata
Rolloutpolicy
SLpolicynetwork
ClassificaIon
p⇡ p�
KGSdataset160,000games29.4millionposiIons
linear-sogmaxnetwork(fasterbutlessaccurate)
13-layersconvoluIonalneuralnetwork
50GPUs,3weeksAccuracy:57.0%
60
Input/OutputData
0 0 0
0 1 0
0 0 0
0 1 0
0 0 1
0 1 0
1 0 1
1 0 0
1 0 1
Input
NextposiIon
Output
0 0 0
1 0 0
0 0 0
Stonecolor:3planes
player,opponent,empty
Liberty:8planes
1~8liberIes
Stonecolor,Liberty,Turnssince,Capturesize,
Self-atarisize,Laddercapture,Ladderescape,Sensibleness.
Total:48planes
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 1 0
0 0 0
0 1 0
0 0 1
0 1 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
61
Symmetries
62
InputRotaIon90degrees
RotaIon180degrees
RotaIon270degrees
VerIcalreflecIon
VerIcalreflecIon
VerIcalreflecIon
VerIcalreflecIon
SLPolicyNetwork
InputSize:19x1948planes
FirstlayerConv+ReLU
Kernelsize:5x5kfilters
2ndto12thlayersConv+ReLU
Kernelsize:3x3kfilters
13thlayerKernelsize:1x1
1filtersSogmax
k=19263
SupervisedLearning:ClassificaIon
Input:GoldenacIon:
BackwardpropagaIon
p�
SLPolicynetwork:
s ap�(a|s)
ProbabiliIesofacIons:
Learningrate
� � � ⌘@ � logp�(a|s)
@�
64
ReinforcementLearning
Self-playdata
RLpolicynetwork
Policygradient
p⇢
50GPUs,1daywon80%SLnetwork
10,000x128games
WeightsiniIalizedbySLpolicynetwork
65
ReinforcementLearning
p⇢SLpolicynetwork
p�
p⇢�
RLpolicynetwork
Opponent
IniIalizeWeights
⇢ = ⇢� = �
play End
PolicyGradientmethod
p⇢
Opponentpool
Addtoopponentpool
p⇢
rreward
66
PolicyGradientMethod
p⇢ p⇢� p⇢ p⇢
p⇢(a1|s1) p⇢(a2|s2) p⇢(aT |sT )s1 s2 sT
r(sT )
…….
…….
RewardBackwardpropagaIon
Learningrate Baseline67
⇢ ! ⇢+ ↵TX
i=1
@logp⇢(ai|si)@⇢
(r(sT )� b(st))
Supervisedlearning:Regression
Self-playdata
RLpolicynetwork
Valuenetwork
Regression
Generatedata
p⇢v✓
30millionposiIons
50GPUs,1weekMSE:0.226
iniIalizeweights
15-layersconvoluIonalneuralnetwork
68
ValueNetwork
InputSize:19x1948planes
14thlayerFully-connected256ReLUunit
1st~13thThesameas
policynetworks
15thlayerFully-connected
1tanhunit
+1unit(currentcolor)
69
Input/OutputData
…….
PlayedbyRLpolicynetwork p⇢
PlayedbySLpolicynetworkp�
RandomlysampleanintegerUin1~4501
2
t=1 t=U-1 t=U+1 end
RandomacIon
t=U
3 GenerateTrainingExample:
sU+1state: rreward:value:zU+1
(sU+1, zU+1)
…….
70
Supervisedlearning:Regression
Input:Goldenvalue:
BackwardpropagaIon
Valuenetwork:
s v✓
v✓(s)
Outputvalue:
z
✓ ✓ + ⌘(z � v✓(s))@v✓(s)
@✓
71
• EachedgestoresasetofstaIsIcs• :combinedmeanacIonvalue• :priorprobabilityevaluatedby• :esImatedacIonvalueby• :esImatedacIonvalueby• :countsofevaluaIonsby• :countsofevaluaIonsby
Searching
P (s, a)
Nv(s, a)
Nr(s, a)
Wr(s, a)
Wv(s, a)
Q(s, a)s
a
v✓(s)
p⇡(a|s)
v✓(s)
p⇡(a|s)
p�(a|s)
73
SelecIon
a⇤ = argmax
a(Q(s, a) + u(s, a))
u(s, a) = cP (s, a)
pPb Nr(s, b)
1 +Nr(s, a)
ChooseacIon
a⇤
PUCTAlgorithm:
ExploitaIon ExploraIon
VisitcountsofparentnodesVisitcountsofedge(s, a)
LevelofexploraIon
s
74
Expansion
s
a
s0
Insertthenodeforthesuccessorstate.s0
1
2
Nv(s0, a0) = Nr(s
0, a0) = 0
Wr(s0, a0) = Wv(s
0, a0) = 0
P (s0, a0) = p�(a0|s0)
p�(a0|s0)
Ifvisitcountexceedathreshold:,Nr(s, a) > nthr
a0 a0
Foreverypossible,iniIalizethestaIsIcs:
a0
75
EvaluaIon
p⇡
1
2 SimulatetheacIonbyrolloutpolicynetwork.p⇡
Evaluatebyvaluenetwork.v✓(s0) v✓
r(sT )
v✓(s0) Whenreachingterminal,
calculatethereward.sT
r(sT )
76
Backup
s
a
r(sT )
v✓(s0)s0
UpdatethestaIsIcsofeveryvisitededge:(s, a)
Nr(s, a) Nr(s, a) + 1
Wr(s, a) Wr(s, a) + r(sT )
Nv(s, a) Nv(s, a) + 1
Wv(s, a) Wv(s, a) + v✓(s0)
Q(s, a) = (1� �)Wv(s, a)
Nv(s, a)+ �
Wr(s, a)
Nr(s, a)
InterpolaIonconstant
77
DistributedSearch
p⇡
r(sT )
v✓(s0)
p�(a0|s0)
MainsearchtreeMasterCPU
Policy&valuenetworks176GPUs
Rolloutpolicynetworks1,202CPUs
78
Reference• MasteringthegameofGowithdeepneuralnetworksandtreesearch– hqp://www.nature.com/nature/journal/v529/n7587/full/nature16961.html
79
FurtherReading• MonteCarloTreeSearch– hqps://jesradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/
• NeuralNetworksBackwardPropagaIon– hqp://cpmarkchang.logdown.com/posts/277349-neural-network-backward-propagaIon
• ConvoluIonalNeuralNetworks– hqp://cs231n.github.io/convoluIonal-networks/
• PolicyGradientMethod:REINFORCE– hqps://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/kaelbling96a-html/node37.html
80