alphago – artificial intelligencecis.csuohio.edu/~sschung/cis601/presentation1... · alphago...

AlphaGo – Artificial Intelligence

CIS 601 - Graduate Seminar Presentation

Seungyoon Jang

CSU ID: 2725495

Directory

Introduction

Monte Carlo Tree Search(MCTS)

Policy Network & Value Network

Supervised Learning & Reinforcement Learning

Evaluation of AlphaGo

References

Introduction

What is “Go?”

“Go” is an abstract strategy board game for two players, in

which the aim is to surround more territory than the opponent

with black and white circle stones.

Introduction

Simple rule, Complex number of cases

1) One player uses the white stones and the other black

2) Take turns placing the stone on the vacant

intersections(called “Points”) of a board with a 19x19 grid

of lines

3) Compared to chess, the lower bound on the number of

legal moves in Go: 2 x 10170

Introduction

Problem

1) All games of perfect information have an optimal value

function, v*(s), which determines the outcome of the game

2) May be solved by recursively computing the optimal value

function in a search tree containing approximately 𝑏𝑑

possible sequences of moves

* b : the game’s breadth / d: depth

3) In large games, especially Go(b≈250, d≈150),

exhaustive search is infeasible.


What’s Monte Carlo Tree Search?1) Heuristic search algorithm for some kinds of decision

processes, most notably those employed in game play

2) The focus of MCTS is on the analysis of the most promising moves, expanding the search tree based on random sampling of the search space

3) As more simulations are executed, the search tree grows larger and the relevant values become more accurate.

4) Been limited to shallow policies or value functions based on a linear combination input features.


What’s Monte Carlo Tree Search?


Employing Deep Convolutional Neural Networks

1) Have achieved unprecedented performance in visual

domains

Ex: image classification, face recognition

2) Reduce the effective depth and breadth of the search tree

3) Evaluating positions using a “value network”

4) Sampling actions using a “policy network”

Policy network & Value Network

Neural Network Training Pipeline Architecture


Supervised Learning(SL) Policy Network

1) Alternates between convolutional layers

with weights σ, and rectifier nonlinearities

2) A final softmax layer outputs a probability

distribution over all legal moves a

3) The input s to the policy network is a

simple representation of the board state

4) Trained on randomly sampled state-action

pairs (s, a), using stochastic gradient ascent

to maximize the likelihood of the human

move a selected in state s


Rollout Policy

1) Trained simultaneously with SL policy to supplement

slow speed to evaluate during search

2) Using linear softmax of small pattern features with

weights π


Reinforcement Learning of Policy Network1) Identical in structure to the SL policy network

2) weights ρ are initialized to the same values, ρ = σ

3) Play games between the current policy network Pρand a randomly selected previous iteration of the policy network

4) Weights are then updated at each time step t by stochastic gradient ascent in the direction that maximizes expected outcome

5) Here 𝑧𝑡 is the terminal reward at the end of the game from the perspective of the current player at time step t: +1 for winning and -1 for losing


Reinforcement Learning of Value Network

1) Estimating a value function 𝑣𝑝(s) that predicts the outcome

from position s of games played by using policy p for both

players weights ρ are initialized to the same values, ρ = σ

2) Similar architecture to the policy network, but outputs a

single prediction instead of a probability distribution.

3) Minimize the mean squared error (MSE) between the

predicted value 𝑣θ (s), and the corresponding outcome z


Searching with policy and value networks1) At each time step t of each simulation, an action 𝑎𝑡is selected from

state 𝑠𝑡𝑎𝑡 = arg max(Q(𝑠𝑡 , a) + u(𝑠𝑡 , a))

2) The node is evaluated in two very different ways:

- By the value network 𝑣ᶿ(𝑠𝐿)

- By the outcome 𝑧𝐿of a random rollout played out until

terminal step T using the fast rollout policy p

These evaluations are combined, using a mixing parameter λ,into an evaluation V(𝑠𝐿)

V(𝑠𝐿) = (1- λ) 𝑣ᶿ(𝑠𝐿) + λ𝑍𝐿

3) The algorithm chooses the most visited move from the root position


One more quick review of MCTS structure

Evaluation of AlphaGo

* ELO Rating: A method for calculating the relative skill levels of players in zero-sum games

References

Silver, David, et al. "Mastering the game of Go

with deepneural networks and tree search."

Nature 529.7587 (2016): 484-489.

https://en.wikipedia.org/wiki/Monte_Carlo_tr

ee_search

https://en.wikipedia.org/wiki/Go_(game)

alphago – artificial intelligencecis.csuohio.edu/~sschung/cis601/presentation1... · alphago...

Documents