ttic 31230, fundamentals of deep learning · ttic 31230, fundamentals of deep learning david...
TRANSCRIPT
![Page 1: TTIC 31230, Fundamentals of Deep Learning · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2020 What About alpha-beta? 1. Grand Uni cation AlphaZero uni es chess](https://reader036.vdocuments.net/reader036/viewer/2022071210/60215b43319509466a0b8c47/html5/thumbnails/1.jpg)
TTIC 31230, Fundamentals of Deep Learning
David McAllester, Winter 2020
What About alpha-beta?
1
![Page 2: TTIC 31230, Fundamentals of Deep Learning · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2020 What About alpha-beta? 1. Grand Uni cation AlphaZero uni es chess](https://reader036.vdocuments.net/reader036/viewer/2022071210/60215b43319509466a0b8c47/html5/thumbnails/2.jpg)
Grand Unification
AlphaZero unifies chess and go algorithms.
This unification of intuition (go) and calculation (chess) is sur-prising.
This unification grew out of go algorithms.
But are the algorithmic insights of chess algorithms really ir-relevant?
2
![Page 3: TTIC 31230, Fundamentals of Deep Learning · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2020 What About alpha-beta? 1. Grand Uni cation AlphaZero uni es chess](https://reader036.vdocuments.net/reader036/viewer/2022071210/60215b43319509466a0b8c47/html5/thumbnails/3.jpg)
Chess Background
The first min-max computer chess program was described byClaude Shannon in 1950.
Alpha-beta pruning was invented by various people indepen-dently, including John McCarthy in the late 1950s.
Alpha-beta has been the cornerstone of all chess algorithmsuntil AlphaZero.
3
![Page 4: TTIC 31230, Fundamentals of Deep Learning · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2020 What About alpha-beta? 1. Grand Uni cation AlphaZero uni es chess](https://reader036.vdocuments.net/reader036/viewer/2022071210/60215b43319509466a0b8c47/html5/thumbnails/4.jpg)
Alpha-Beta Pruning
def MaxValue(s,alpha,beta):
value = alpha
for s2 in s.children():
value = max(value, MinValue(s2,value,beta))
if value >= beta: break()
return value
def MinValue(s,alpha,beta):
value = beta
for s2 in s.children():
value = min(value, MaxValue(s2,alpha,value))
if value <= alpha: break()
return value
![Page 5: TTIC 31230, Fundamentals of Deep Learning · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2020 What About alpha-beta? 1. Grand Uni cation AlphaZero uni es chess](https://reader036.vdocuments.net/reader036/viewer/2022071210/60215b43319509466a0b8c47/html5/thumbnails/5.jpg)
Strategies
An optimal alpha-beta tree is the union of a root-player strat-egy and an opponent strategy.
A strategy for the root player is a selection of a single action foreach root-player move and a response for each possible actionof the opponent.
A strategy for the opponent is a selection of a single action foreach opponent move and a response for each possible action ofthe root player.
5
![Page 6: TTIC 31230, Fundamentals of Deep Learning · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2020 What About alpha-beta? 1. Grand Uni cation AlphaZero uni es chess](https://reader036.vdocuments.net/reader036/viewer/2022071210/60215b43319509466a0b8c47/html5/thumbnails/6.jpg)
Proposal
Simulations should be divided into root-player strategy simu-lations and opponent strategy simulations.
A root-player strategy simulation is optimistic for the rootplayer and pessimistic for the opponent.
An opponent strategy simulation is optimistic for the opponentplayer and pessimistic for the root-player.
6
![Page 7: TTIC 31230, Fundamentals of Deep Learning · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2020 What About alpha-beta? 1. Grand Uni cation AlphaZero uni es chess](https://reader036.vdocuments.net/reader036/viewer/2022071210/60215b43319509466a0b8c47/html5/thumbnails/7.jpg)
Proposal
U(s, a) =
{λu πΦ(s, a) if N(s, a) = 0µ̂(s, a) + λu πΦ(s, a)/N(s, a) otherwise
(1)
λu should be divided into λ+u and λ−u with λ+
u > λ−u .
Simulations should be divided into two types — optimistic andpessimistic.
In optimistic simulations we use λ+u for root-player moves and
λ−u for opponent moves.
In pessimistic simulations we use λ−u for root-player moves andλ+u for opponent moves.
7
![Page 8: TTIC 31230, Fundamentals of Deep Learning · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2020 What About alpha-beta? 1. Grand Uni cation AlphaZero uni es chess](https://reader036.vdocuments.net/reader036/viewer/2022071210/60215b43319509466a0b8c47/html5/thumbnails/8.jpg)
AlphaStar
Grandmaster level in StarCraft II using multi-agent reinforce-ment learning, Nature Oct. 2019, Vinyals et al.
StarCraft:
• Players control hundreds of units.
• Individual actions are selected from 1026 possibilities (anaction is a kind of procedure call with arguments).
• Cyclic non-transitive strategies (rock-paper-scisors).
• Imperfect information — the state is not fully observable.
8
![Page 9: TTIC 31230, Fundamentals of Deep Learning · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2020 What About alpha-beta? 1. Grand Uni cation AlphaZero uni es chess](https://reader036.vdocuments.net/reader036/viewer/2022071210/60215b43319509466a0b8c47/html5/thumbnails/9.jpg)
The Paper is Vague
It basically says the following ideas are used:A policy gradient algorithm, auto-regressive policies, self-attentionover the observation history, LSTMs, pointer-networks, scat-ter connections, replay buffers, asynchronous advantage actor-critic algorithms, TD(λ) (gradients on value function Bellmanerror), clipped importance sampling (V-trace), a new unde-fined method they call UPGO that “moves policies towardtrajectories with better than average reward”, a value func-tion that can see the opponents observation (training only), a“z statistic” stating a high level strategy, supervised learningfrom human play, a “league” of players (next slide).
9
![Page 10: TTIC 31230, Fundamentals of Deep Learning · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2020 What About alpha-beta? 1. Grand Uni cation AlphaZero uni es chess](https://reader036.vdocuments.net/reader036/viewer/2022071210/60215b43319509466a0b8c47/html5/thumbnails/10.jpg)
The League
The league has three classes of agents: main (M), main ex-ploiters (E), and league exploiters (L). M and L play againsteverybody. E plays only against M.
10
![Page 11: TTIC 31230, Fundamentals of Deep Learning · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2020 What About alpha-beta? 1. Grand Uni cation AlphaZero uni es chess](https://reader036.vdocuments.net/reader036/viewer/2022071210/60215b43319509466a0b8c47/html5/thumbnails/11.jpg)
A Rube Goldberg Contraption?
11
![Page 12: TTIC 31230, Fundamentals of Deep Learning · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2020 What About alpha-beta? 1. Grand Uni cation AlphaZero uni es chess](https://reader036.vdocuments.net/reader036/viewer/2022071210/60215b43319509466a0b8c47/html5/thumbnails/12.jpg)
Video
https://www.youtube.com/watch?v=UuhECwm31dM
12
![Page 13: TTIC 31230, Fundamentals of Deep Learning · TTIC 31230, Fundamentals of Deep Learning David McAllester, Winter 2020 What About alpha-beta? 1. Grand Uni cation AlphaZero uni es chess](https://reader036.vdocuments.net/reader036/viewer/2022071210/60215b43319509466a0b8c47/html5/thumbnails/13.jpg)
END