reinforcement learning and artificial neural nets

36
Reinforcement Learning and Artificial Neural Nets Pierre de Lacaze Director, Machine Intelligence Shareablee Inc. Lisp NYC September 20 th , 2016 [email protected] [email protected]

Upload: pierre-de-lacaze

Post on 15-Apr-2017

311 views

Category:

Science


4 download

TRANSCRIPT

Page 1: Reinforcement Learning and Artificial Neural Nets

Reinforcement Learningand Artificial Neural Nets

Pierre de LacazeDirector, Machine Intelligence

Shareablee Inc.

Lisp NYCSeptember 20th, 2016

[email protected]@shareablee.com

Page 2: Reinforcement Learning and Artificial Neural Nets

Overview

• Reinforcement Learning (RL)

• Artificial Neural Nets (ANN)

• Deep Learning (DL)

• Deep Reinforcement Learning (DRL)

• Google Deepmind AlphaGO

Page 3: Reinforcement Learning and Artificial Neural Nets

Part 1

Reinforcement Learning

Most of the material is drawn from Tom Mitchell’s book Machine Learning (Chpt. 13)

Page 4: Reinforcement Learning and Artificial Neural Nets

Reinforcement Learning Scenario

• Similar to a Markov Decision Process (Bellman 1957)• However in Reinforcement Learning agent has no previous knowledge.

Page 5: Reinforcement Learning and Artificial Neural Nets

Reinforcement Learning Problem• Agent that can observe and act upon an environment

• Learn a policy (sequence of actions) to solve particular goals.

• State transition and rewards provided by the environment

• Agent typically has no domain knowledge

• Differences with other ML Approaches– Delayed Reward– Involves Exploration– Partially observable states– Training instances are generated by agent

Page 6: Reinforcement Learning and Artificial Neural Nets

The Reinforcement Learning Task• Markov Decision Process

– Agent has set S of observable states and set A of actions– At each step agent observes state and selects an action– Environment responds with a next state and a reward.

st+1 = δ(st, at), rt = r(st, at)

• Learning Task

– Learn a policy π: S A for selecting next action

– A solution is to choose at each state the action with largest cumulative reward

Vπ(st) = rt + γrt+1 +…+ γ2rt+2 +….. , where 0 <= γ <= 1

– In fact we would like to learn an optimal such policy

π*(s) = argmax Vπ(s), for all states s π

Page 7: Reinforcement Learning and Artificial Neural Nets

Q Learning Motivation• Cannot learn π* directly because no training data of the form <s, a>

• Recall V* is the discounted cumulative reward

• Prefer s1 to s2 when V*(s1) > V*(s2)

• π*(s) = argmax[r(s, a) + γ V*(δ(s, a))] a

• Problem is requires perfect knowledge of reward function and state transition function

Page 8: Reinforcement Learning and Artificial Neural Nets

The Q Function

• Q(s,a) = r(s, a) + γ V*(δ(s, a))

• π*(s) = argmax Q(s, a) a• By learning Q instead of V* we no longer need

to have complete knowledge of r and δ.

• We will need to estimate training values for Q

Page 9: Reinforcement Learning and Artificial Neural Nets

Q Learning Algorithm

• Initialize the state/action rewards table to 0• Start in initial state s• Repeat forever– Select an action a ad execute it– Observe immediate reward r and new state s’– Update the table entry for s, a as follows• Q(s, a) = r + γ max [a’, Q(s’, a’)]

– s s’

Page 10: Reinforcement Learning and Artificial Neural Nets

Simple Grid Example

Initial Q Value Estimates Table{:s1 {:east [:s2 0], :south [:s4 0]} :s2 {:east [:s3 100], :west [:s1 0], :south [:s5 0]} :s3 {:east [:s3 0], :west [:s3 0], :north [:s3 0], :south [:s3 0]} :s4 {:east [:s5 0], :north [:s1 0]} :s5 {:east [:s6 0], :west [:s4 0], :north [:s2 0]} :s6 {:east [:s5 0], :north [:s3 100]}}

Final Q Value Estimates Table{:s1 {:east [:s2 90.0], :south [:s4 72.9]} :s2 {:east [:s3 100.0], :west [:s1 81.0], :south [:s5 81.0]} :s3 {:east [:s3 0], :west [:s3 0], :north [:s3 0], :south [:s3 0]} :s4 {:east [:s5 81.0], :north [:s1 81.0]} :s5 {:east [:s6 90.0], :west [:s4 72.9], :north [:s2 90.0]} :s6 {:east [:s5 81.0], :north [:s3 100.0]}}

Optimal Policy: [:s1 :east :s2 :east :s3]

S10

S20

S3100

S40

S50

S60

γ = 0.9

Page 11: Reinforcement Learning and Artificial Neural Nets

Q-Learning Convergence

• Under the following criteria Q-Learning is provably convergent:

• Deterministic MDP

• Bounded immediate reward values

• Actions are executed with nonzero frequency

Page 12: Reinforcement Learning and Artificial Neural Nets

Reminder

Show Code Examplesfrom rl.clj

Page 13: Reinforcement Learning and Artificial Neural Nets

Nondeterministic Q-Learning• Nondeterministic MDP: A single action can yield one of several states with some probability

distribution.

• Augment deterministic Q function with expected reward and weighted average of estimated Q values.

Q(s,a) = E[r(s,a)] + γ Σ P(s’|s, a) max Q(s’, a’) s’ s’

• To ensure convergence need a decaying weighted average:

Qn(s,a) (1-αn)Qn-1(s,a) + αn[r + max Qn-1(s’, a’)]

» 1

Where αn = ------------------ 1 + visitsn(s, a)

Page 14: Reinforcement Learning and Artificial Neural Nets

Temporal Difference Learning

• Q-Learning learns from immediate successor state.• TD-Learning learns from distant states as well.

• Q(1)(st, a) = rt + γ max [a, Q(st+1, a)]

• Q(2)(st, a) = rt + γrt+1 + γ2 max [a, Q(st+2, a)]

• Q(n)(st, a) = rt + γrt+1 +…+ γn-1rt+n-1 + γn max [a, Q(st+n, a)]

• Sutton, 1988: TD(λ)• Q(λ)(st, a) = rt + γrt+1 + γ2 max [a, Q(st+2, a)]

Page 15: Reinforcement Learning and Artificial Neural Nets

TD-Gammon • TD-Gammon, Gerry Tesaro, 1995

• Used Temporal Difference Learning• Used Neural Net to learn Value function

• Trained with 1.5 million games

• Effectively the first time Neural Nets were successfully used in conjunction with Reinforcement Learning

• http://courses.cs.washington.edu/courses/cse590hk/01sp/Readings/tesauro95cacm.pdf

Page 16: Reinforcement Learning and Artificial Neural Nets

Issues in Reinforcement Learning• Exploration vs. Exploitation

– open research area – ε-greedy action selection– softmax action selection

• Generalization– Too many state, action pairs in real life– Learn state abstractions– Transference

• Q function approximators– Neural Nets as universal approximators (DRL)– Using the Q values estimates as training data– Cost of exploration in non-simulation settings

• Partial episodes– Estimate reward that would obtained by completing the episode– Trace functions

• Parallelism– Multiple agents learning a shared global policy– Deepmind RL asynchronous library

Page 17: Reinforcement Learning and Artificial Neural Nets

Reminder

Show cool videos from Pieter Abdeel’s (UC Berkeley)

IJCAI 2016 presentation

Page 18: Reinforcement Learning and Artificial Neural Nets

Part 2

Artificial Neural Networks

Most of the material is drawn from Tom Mitchell’s book Machine Learning (Chpt. 4)

Page 19: Reinforcement Learning and Artificial Neural Nets

Neural Nets in a Nutshell

• Neural Nets are a network of layers of units• Input layer, hidden layers, output layer• Different Types of Units:– Linear Unit, Perceptron, Sigmoid Unit, ReLU

• Task is to learn weights for different units• Typically given a set of training examples• Gradient Descent algorithm used to train units• Backpropagation algorithm used to train network

Page 20: Reinforcement Learning and Artificial Neural Nets

Properties of ANNs• General method for learning real-valued, discrete-value and

vector-valued functions from examples

• Backpropagation uses gradient-descent to adjust weights to best fit training data (input/output pairs)

• Robust to noisy data

• Successfully used in image recognition & speech recognition– Yann LeCun 1989 (handwritten chars), – Gary Cottrell 1990 (face recognition)

Page 21: Reinforcement Learning and Artificial Neural Nets

Appropriateness of ANNs

• Many training instances• Target function is real-valued, discrete-value

or vector-valued• Training data contains errors/noise• Training time is not an issue • Fast evaluation is important• Humans understanding of target function is

not important.

Page 22: Reinforcement Learning and Artificial Neural Nets

Linear Units and Perceptrons• Linear Unit: A linear combination of weighted inputs (real-valued)• Perceptron: Thresholded Linear Unit (discrete-valued)

Note: w0 is a bias whose purpose is to move the threshold of the activation function.

Page 23: Reinforcement Learning and Artificial Neural Nets

Representational Capacity• A single Perceptron can represent all primitive Boolean functions: AND, OR, NAND and

NOR all of which are all linear separable.

• Every Boolean function can be represented by some network of perceptrons.

• The XOR cannot be represented by a single perceptron because the decision surface is not linearly separable.

Page 24: Reinforcement Learning and Artificial Neural Nets

Linear Unit Error Function• Recall Output of a linear unit: w0 + w1x1 + …. Wnxn

• Use the Mean Squared Error to measure the error of the unit:

• Where D is the set of training examples

• Goal is to learn a weight vector that minimizes the error

Page 25: Reinforcement Learning and Artificial Neural Nets

Weight Space Error Surface

Page 26: Reinforcement Learning and Artificial Neural Nets

Training Rules for Single Units1. Delta Rule for Linear Units based on derivative of error function

2. Perceptron Training Rule based on derivative of the error function

• Converges asymptotically to minimum error hypothesis• Takes unbounded time to converge• Convergence is independent on linear separability

• Converges to a hypothesis that perfectly fits the training data• Takes bounded time to converge if learning rate is small enough• Convergence is dependent on linear separability• Convergence Proved by Minsky & Pappert 1969

Page 27: Reinforcement Learning and Artificial Neural Nets

Gradient Descent Algorithm(standard version)

1. Initialize each of weights wi to small random values +/- 0.5

2. Repeat until termination criteria is reached

a. Set each Δwi to zero

b. For each training example ([x1,…,xn], t)• Compute the output o of the unit • For each unit weight

• Set Δwi = Δwi + η (t – o) xi

• For each unit weight • Set wi = wi + Δwi

• Note: η is the learning rate, typically 0.5 or less.

Page 28: Reinforcement Learning and Artificial Neural Nets

Standard vs. Stochastic

• Standard gradient descent increments the weights after summing over all training examples

• Incremental or stochastic gradient descent increments the weights for each training example.

• Both approaches used in practice.• Stochastic version can avoid local minima.

Page 29: Reinforcement Learning and Artificial Neural Nets

Absolute Basics of ANNs• A network of units

• Network organized into layers:– Input layer of inputs– Hidden layer of units (one or more)– Output layer of units

• A standard ANN is fully connected– All outputs of a previous layers are connected to all units of the next layer.

• Outputs of the network are computed by feeding the inputs into the next layer, computing the outputs of that layer, feeding those into the subsequent layer, etc.

• A Feed-Forward ANN forms an acyclic graph.

Page 30: Reinforcement Learning and Artificial Neural Nets

Sigmoid Units

Need redefine the error function E to sum over all output units

Gradient for this new error function E is given by

Sigmoid σ(x) function has the nice property that its derivative is σ(x)(1 – σ(x))

Page 31: Reinforcement Learning and Artificial Neural Nets

Backpropagation Algorithm(Using incremental gradient descent)

1. Initial weights to small random numbers

2. Until termination criteria for each training example

a. Compute the network outputs for the training example

b. For each output unit k compute its error:

δk = ok (1 – ok) (tk – ok)

c. For each hidden unit h compute its error:

δh = oh (1 – oh) Σ (whk δk ) k

d. Update each network weight wij

wij = wij + η δh xij

Page 32: Reinforcement Learning and Artificial Neural Nets

Vectorizing Backpropagation• Each layer can be represented as

– matrix of weights, where ith row are weights of the ith unit – bias vector where ith element is the bias of the ith unit

Page 33: Reinforcement Learning and Artificial Neural Nets

Advanced Topics in ANN

• Alternative error functions– Adding a momentum term– Adding penalty term for weight magnitude– Adding terms for errors in the slope (Mitchell&Thrun, 1993)

• Decoupling distance and direction– Line search– Conjugate gradient method

• Dynamically altering network structure– Removing least salient connections. (LeCunn, 1990)

• Using different activation functions– Hyperbolic Tangent: TANH– Rectified Linear Units (ReLU)

• Used only in past couple of years• f(x) = max(0,x)

Page 34: Reinforcement Learning and Artificial Neural Nets

ALVINN: Early Self-Driving Car• Drove up to 70 mph on

sectioned off California highway.

• Network Architecture– 960 Input Units– 4 Hidden Units– 30 Output Units

• Pomerleau,1993

Page 35: Reinforcement Learning and Artificial Neural Nets

Reminder

Show Code Examplesannv.clj

1. Identity Function2. MNIST Data Set

Page 36: Reinforcement Learning and Artificial Neural Nets

References• Reinforcement Learning

– http://personal.disco.unimib.it/Vanneschi/McGrawHill_-_Machine_Learning_-Tom_Mitchell.pdf– https://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html

• Neural Nets & Deep Learning– http://personal.disco.unimib.it/Vanneschi/McGrawHill_-_Machine_Learning_-Tom_Mitchell.pdf– http://neuralnetworksanddeeplearning.com/chap2.html– https://en.wikipedia.org/wiki/Recurrent_neural_network– http://deeplearning.net/tutorial/deeplearning.pdf

• Convolutional Neural Networks– http://cs231n.github.io/convolutional-networks/– http://neuralnetworksanddeeplearning.com/chap6.html

• Deep Reinforcement Learning– http://www0.cs.ucl.ac.uk/staff/d.silver/web/Resources_files/deep_rl.pdf– http://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/

• AlphaGo– https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf– http://deeplearningskysthelimit.blogspot.com/