mastering the game of go without human knowledge€¦ · ladders. alphazero. alphago zero’s gift....

38
Mastering the Game of Go Without Human Knowledge Lead: Liam Hinzman Facilitators: Tahseen Shabab and Susan Cheng

Upload: others

Post on 13-Jun-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Mastering the Game of Go Without Human Knowledge

Lead: Liam HinzmanFacilitators: Tahseen Shabab and Susan Cheng

Page 2: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

AlphaGo Zero

Page 3: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Overview● Brief History of AI in Games● What is Go and Why Should You Care?● How AlphaGo Zero Works● Results● Discussion

Page 4: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Minimax

Page 5: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

HeuristicsReduces search depth

Page 6: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Deep Blue

● 126 million positions per second● Hand-designed Heuristics

Page 7: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

The Game of Go

Page 8: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different
Page 9: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different
Page 10: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Go is Incredibly Complex

Go is Hard for Computers

Page 11: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

How AlphaGo Zero Works

Policy IterationMonte-Carlo Tree Search Residual Network

Page 12: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Monte-Carlo Tree Search (MCTS)

Page 13: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Monte-Carlo Tree Search (MCTS)

Page 14: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

MCTS: Advantages● Aheuristic● Online-search● Works well on large trees

Page 15: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

MCTS: Disadvantages● Many simulation are required● No generalization between similar states● Performance is dependent on “rollout” policy

Page 16: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

MCTS in AlphaGo Zero

Page 17: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Upper Confidence Bound for Trees (UCT)

Page 18: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Upper Confidence Bound for Trees (UCT)

Page 19: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Upper Confidence Bound for Trees (UCT)

s State

a Action

Q(s, a) Expected Reward

P(s, a) Policy

N(s, a) # of state visits

cpuct Hyperparameter

Page 20: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

AlphaGo Zero’s Network Architecture

Page 21: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Residual Layer

Page 22: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Dual Heads

Page 23: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Training

Evaluator

Self-play Worker

Training Worker

𝜋’ > 𝜋

Page 24: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

How AlphaGo Zero Chooses a Move

𝜋 ~ N1/𝜏1600 Simulations

Page 25: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Self-Play Workers

Page 26: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Training Worker

Page 27: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Evaluator

400 Games 55% Win Rate

Page 28: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

MCTS in AlphaGo Zero

Page 29: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

5 Minute Break

Page 30: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

AlphaGo Zero vs AlphaGoEntirely self-play

Input is game board

Single network

No rollouts

Supervised learning + self-play

Input is hand-crafted features

Two networks

Rollouts were used

Page 31: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Results

Page 32: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Learning Stages

Page 33: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Ladders

Page 34: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

AlphaZero

Page 35: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

AlphaGo Zero’s Gift

Page 36: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

Discussion

Page 37: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

DiscussionHow can the AlphaGo Zero algorithm be extended to different games?

How can the sample efficiency of AlphaGo Zero be improved?

A very stable training environment is need for the algorithm. Can this be alleviated to let AlphaZero applied to real-world problems?

Page 38: Mastering the Game of Go Without Human Knowledge€¦ · Ladders. AlphaZero. AlphaGo Zero’s Gift. Discussion. Discussion How can the AlphaGo Zero algorithm be extended to different

ResourcesMastering the Game of Go without Human Knowledge

David Silver 2017 NIPS Talk

ELF OpenGo: An Analysis and Open Reimplementation of AlphaZero

David Silver’s PhD Thesis: Reinforcement Learning and Simulation-Based Search in Computer Go

A Brief History of Game AI Up To AlphaGo - Andrey Kurenkov

AlphaGo Zero Demystified - Dylan Djian