how alphago works
TRANSCRIPT
![Page 1: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/1.jpg)
Presenter: Shane (Seungwhan) Moon
PhD studentLanguage Technologies Institute, School of Computer Science
Carnegie Mellon University
3/2/2016
How it works
![Page 2: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/2.jpg)
AlphaGo vs European Champion (Fan Hui 2-‐Dan)
October 5 – 9, 2015
<Official match>-‐ Time limit: 1 hour-‐ AlphaGo Wins (5:0)
*rank
![Page 3: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/3.jpg)
AlphaGo vs World Champion (Lee Sedol 9-‐Dan)
March 9 – 15, 2016
<Official match>-‐ Time limit: 2 hours
Venue: Seoul, Four Seasons Hotel
Image Source: Josun Times Jan 28th 2015
![Page 4: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/4.jpg)
Lee Sedol
Photo source: Maeil Economics 2013/04
wiki
![Page 5: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/5.jpg)
Computer Go AI?
![Page 6: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/6.jpg)
Computer Go AI – Definition
s (state)
d = 1
0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 1 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0
=
(e.g. we can represent the board into a matrix-‐like form)
* The actual model uses other features than board positions as well
![Page 7: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/7.jpg)
Computer Go AI – Definition
s (state)
d = 1 d = 2
a (action)
Given s, pick the best a
Computer GoArtificial
Intelligences a s'
![Page 8: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/8.jpg)
Computer Go AI – An Implementation Idea?d = 1 d = 2
…
How about simulating all possible board positions?
![Page 9: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/9.jpg)
Computer Go AI – An Implementation Idea?d = 1 d = 2
…
d = 3
…
…
…
…
![Page 10: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/10.jpg)
Computer Go AI – An Implementation Idea?d = 1 d = 2
…
d = 3
…
…
…
…
… d = maxD
Process the simulation until the game ends,then report win / lose results
![Page 11: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/11.jpg)
Computer Go AI – An Implementation Idea?d = 1 d = 2
…
d = 3
…
…
…
…
… d = maxD
Process the simulation until the game ends,then report win / lose results
e.g. it wins 13 times if the next stone gets placed here
37,839 times
431,320 times
Choose the “next action / stone”that has the most win-‐counts in the full-‐scale simulation
![Page 12: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/12.jpg)
This is NOT possible; it is said the possible configurations of the board exceeds the number of atoms in the universe
![Page 13: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/13.jpg)
Key: To Reduce Search Space
![Page 14: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/14.jpg)
Reducing Search Space
1. Reducing “action candidates” (Breadth Reduction)
d = 1 d = 2
…
d = 3
…
…
…
… d = maxD
Win?Loss?
IF there is a model that can tell you that these movesare not common / probable (e.g. by experts, etc.) …
![Page 15: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/15.jpg)
Reducing Search Space
1. Reducing “action candidates” (Breadth Reduction)
d = 1 d = 2
…
d = 3
…
… d = maxD
Win?Loss?
Remove these from search candidates in advance (breadth reduction)
![Page 16: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/16.jpg)
Reducing Search Space
2. Position evaluation ahead of time (Depth Reduction)
d = 1 d = 2
…
d = 3
…
… d = maxD
Win?Loss?
Instead of simulating until the maximum depth ..
![Page 17: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/17.jpg)
Reducing Search Space
2. Position evaluation ahead of time (Depth Reduction)
d = 1 d = 2
…
d = 3
…
V = 1
V = 2
V = 10
IF there is a function that can measure:V(s): “board evaluation of state s”
![Page 18: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/18.jpg)
Reducing Search Space
1. Reducing “action candidates” (Breadth Reduction)
2. Position evaluation ahead of time (Depth Reduction)
![Page 19: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/19.jpg)
1. Reducing “action candidates”
Learning: P ( next action | current state )
= P ( a | s )
![Page 20: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/20.jpg)
1. Reducing “action candidates”
(1) Imitating expert moves (supervised learning)
Current State
Prediction Model
Next State
s1 s2
s2 s3
s3 s4
Data: Online Go experts (5~9 dan)160K games, 30M board positions
![Page 21: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/21.jpg)
1. Reducing “action candidates”
(1) Imitating expert moves (supervised learning)
Prediction Model
Current Board Next Board
![Page 22: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/22.jpg)
1. Reducing “action candidates”
(1) Imitating expert moves (supervised learning)
Prediction Model
Current Board Next Action
There are 19 X 19 = 361possible actions(with different probabilities)
![Page 23: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/23.jpg)
1. Reducing “action candidates”
(1) Imitating expert moves (supervised learning)
Prediction Model
0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 00 -‐1 0 0 1 -‐1 1 0 00 1 0 0 1 -‐1 0 0 00 0 0 0 -‐1 0 0 0 00 0 0 0 0 0 0 0 00 -‐1 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0
s af: sà a
Current Board Next Action
![Page 24: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/24.jpg)
1. Reducing “action candidates”
(1) Imitating expert moves (supervised learning)
Prediction Model
0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 00 -‐1 0 0 1 -‐1 1 0 00 1 0 0 1 -‐1 0 0 00 0 0 0 -‐1 0 0 0 00 0 0 0 0 0 0 0 00 -‐1 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0
s g: sà p(a|s)
0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0.2 0.1 0 00 0 0 0 0 0.4 0.2 0 00 0 0 0 0 0.1 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0
p(a|s) aargmax
Current Board Next Action
![Page 25: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/25.jpg)
1. Reducing “action candidates”
(1) Imitating expert moves (supervised learning)
Prediction Model
0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 00 -‐1 0 0 1 -‐1 1 0 00 1 0 0 1 -‐1 0 0 00 0 0 0 -‐1 0 0 0 00 0 0 0 0 0 0 0 00 -‐1 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0
s g: sà p(a|s) p(a|s) aargmax
Current Board Next Action
![Page 26: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/26.jpg)
1. Reducing “action candidates”
(1) Imitating expert moves (supervised learning)
Deep Learning(13 Layer CNN)
0 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 00 -‐1 0 0 1 -‐1 1 0 00 1 0 0 1 -‐1 0 0 00 0 0 0 -‐1 0 0 0 00 0 0 0 0 0 0 0 00 -‐1 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 1 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0
s g: sà p(a|s)
0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0.2 0.1 0 00 0 0 0 0 0.4 0.2 0 00 0 0 0 0 0.1 0 0 00 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0
p(a|s) aargmax
Current Board Next Action
![Page 27: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/27.jpg)
Convolutional Neural Network (CNN)
CNN is a powerful model for image recognition tasks; it abstracts out the input image through convolution layers
Image source
![Page 28: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/28.jpg)
Convolutional Neural Network (CNN)
And they use this CNN model (similar architecture) to evaluate the board position; which learns “some” spatial invariance
![Page 29: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/29.jpg)
Go: abstraction is the key to win
CNN: abstraction is its forte
![Page 30: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/30.jpg)
1. Reducing “action candidates”
(1) Imitating expert moves (supervised learning)
Expert Moves Imitator Model(w/ CNN)
Current Board Next Action
Training:
![Page 31: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/31.jpg)
1. Reducing “action candidates”
(2) Improving through self-‐plays (reinforcement learning)
Expert Moves Imitator Model
(w/ CNN)
Expert Moves Imitator Model
(w/ CNN)VS
Improving by playing against itself
![Page 32: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/32.jpg)
1. Reducing “action candidates”
(2) Improving through self-‐plays (reinforcement learning)
Expert Moves Imitator Model
(w/ CNN)
Expert Moves Imitator Model
(w/ CNN)VS
Return: board positions, win/lose info
![Page 33: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/33.jpg)
1. Reducing “action candidates”
(2) Improving through self-‐plays (reinforcement learning)
Expert Moves Imitator Model(w/ CNN)
Board position win/loss
Training:
Lossz = -‐1
![Page 34: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/34.jpg)
1. Reducing “action candidates”
(2) Improving through self-‐plays (reinforcement learning)
Expert Moves Imitator Model(w/ CNN)
Training:
z = +1
Board position win/loss
Win
![Page 35: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/35.jpg)
1. Reducing “action candidates”
(2) Improving through self-‐plays (reinforcement learning)
Updated Modelver 1.1
Updated Modelver 1.3VS
Return: board positions, win/lose info
It uses the same topology as the expert moves imitator model, and just uses the updated parameters
Older models vs. newer models
![Page 36: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/36.jpg)
1. Reducing “action candidates”
(2) Improving through self-‐plays (reinforcement learning)
Updated Model ver 1.3
Updated Model ver 1.7VS
Return: board positions, win/lose info
![Page 37: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/37.jpg)
1. Reducing “action candidates”
(2) Improving through self-‐plays (reinforcement learning)
Updated Model ver 1.5
Updated Model ver 2.0VS
Return: board positions, win/lose info
![Page 38: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/38.jpg)
1. Reducing “action candidates”
(2) Improving through self-‐plays (reinforcement learning)
Updated Model ver 3204.1
Updated Model ver 46235.2VS
Return: board positions, win/lose info
![Page 39: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/39.jpg)
1. Reducing “action candidates”
(2) Improving through self-‐plays (reinforcement learning)
Updated Model ver 1,000,000VS
The final model wins 80% of the timewhen playing against the first model
Expert Moves Imitator Model
![Page 40: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/40.jpg)
2. Board Evaluation
![Page 41: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/41.jpg)
2. Board Evaluation
Updated Modelver 1,000,000
Board Position
Training:
Win / Loss
Win(0~1)
Value Prediction Model
(Regression)
Adds a regression layer to the modelPredicts values between 0~1Close to 1: a good board positionClose to 0: a bad board position
![Page 42: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/42.jpg)
Reducing Search Space
1. Reducing “action candidates”(Breadth Reduction)
2. Board Evaluation (Depth Reduction)
Policy Network
Value Network
![Page 43: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/43.jpg)
Looking ahead (w/ Monte Carlo Search Tree)
Action Candidates Reduction(Policy Network)
Board Evaluation(Value Network)
(Rollout): Faster version of estimating p(a|s)à uses shallow networks (3 msà 2µs)
![Page 44: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/44.jpg)
ResultsElo rating system
Performance with different combinations of AlphaGo components
![Page 45: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/45.jpg)
TakeawaysUse the networks trained for a certain task (with different loss objectives) for several other tasks
![Page 46: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/46.jpg)
Lee Sedol 9-‐dan vs AlphaGo
![Page 47: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/47.jpg)
Lee Sedol 9-‐dan vs AlphaGoEnergy Consumption
Lee Sedol AlphaGo
-‐ Recommended calories for a man per day: ~2,500 kCal
-‐ Assumption: Lee consumes the entire amount of per-‐day calories in this one game2,500 kCal * 4,184 J/kCal
~= 10M [J]
-‐ Assumption: CPU: ~100 W, GPU: ~300 W-‐ 1,202 CPUs, 176 GPUs
170,000 J/sec * 5 hr * 3,600 sec/hr
~= 3,000M [J]
A very, very rough calculation ;)
![Page 48: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/48.jpg)
AlphaGo is estimated to be around ~5-‐dan
= multiple machines European champion
![Page 49: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/49.jpg)
Taking CPU / GPU resources to virtually infinity?
But Google has promised not to use more CPU/GPUsthan they used for Fan Hui for the game with Lee
No one knowshow it will converge
![Page 50: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/50.jpg)
AlphaGo learns millions of Go games every day
AlphaGo will presumably converge to some point eventually.
However, in the Nature paper they don’t report how AlphaGo’s performance improvesas a function of times AlphaGo plays against itself (self-‐plays).
![Page 51: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/51.jpg)
What if AlphaGo learns Lee’s game strategy
Google said they won’t use Lee’s game plays as AlphaGo’s training data
Even if it does, it won’t be easy to modify the model trained over millions ofdata points with just a few game plays with Lee
(prone to over-‐fitting, etc.)
![Page 52: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/52.jpg)
AlphaGo’s Weakness?
![Page 53: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/53.jpg)
AlphaGo – How It WorksPresenter: Shane (Seungwhan) Moon
PhD studentLanguage Technologies Institute, School of Computer Science
Carnegie Mellon [email protected]
3/2/2016
![Page 54: How AlphaGo Works](https://reader031.vdocuments.net/reader031/viewer/2022021503/586f7a111a28ab10258b71a5/html5/thumbnails/54.jpg)
Reference
• Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." Nature 529.7587 (2016): 484-‐489.