which pizza should you order?? - limsi-greedy : algorithm `invented' by the reinforcement...

Test A/B

Introduction to Reinforcement Learning

Guillaume [email protected]

novembre 2016

Université Paris Sud & LIMSI

1

Context

Which pizza should you order ? ?

• I really love Neapolitan pizza ?

• Which pizza should I order ?

• Always choose a Neapolitan pizza ?and risk to miss something exceptionally good

• Should I try another pizza ?and risk to be disappointed

2

Which pizza should you order ? ?

• I really love Neapolitan pizza ?

• Which pizza should I order ?

• Always choose a Neapolitan pizza ?and risk to miss something exceptionally good

• Should I try another pizza ?and risk to be disappointed

2

Historically

Feynman’s restaurant problem

3

More seriously...

How to design a web page ?

• the design of a web page has (may have) a huge impact on thenumber/actions of people seeing it.

• too many changes / bad changes → loose visitors

• not enough changes → out-fashioned page → loose visitors

Several way to evaluate the impact of a changes : number of sells,of unique visitors, ...

4

Example no 2

• Washington Post

• goal : evaluate the impact of modifying an article / of addinga new article

• formally : several variants/versions of the web site ;continuously find the best one according to visitors “reactions”

https://developer.washingtonpost.com/pb/blog/post/2016/02/08/

bandito-a-multi-armed-bandit-tool-for-content-testing/

5

Some results (1)

6

Some results (1)

CTR : 11,7% CTR : 24,6%

Evaluation : Click-Through Rate (% of visitors that have clicked onthe link)

6

Some results (2)

CTR : 3,3%

CTR : 3,9%

CTR : 4,8%

7

Evaluation

• each user only sees one version ofthe webpage ⇒ no directcomparison

• there may be some “localfluctuations” : online shop thatchanges its web page aroundChristmas and observes an increasein sells ⇒ no way to know thecause of the increase

⇒ need of a controlled experiments

=compare user behavior on both versionsof the site

8

Evaluation

• each user only sees one version ofthe webpage ⇒ no directcomparison

• there may be some “localfluctuations” : online shop thatchanges its web page aroundChristmas and observes an increasein sells ⇒ no way to know thecause of the increase

⇒ need of a controlled experiments =compare user behavior on both versionsof the site

8

Applications...

• resources allocations in project management

• medial tests

• portfolio management

• online advertisement

• adaptive routing

• ...

Multi-Armed Bandit

9

Main idea

This is not supervised learning

• supervised learning : we known the ‘best’ answer among allpossible answer (and used it to calculate a loss/reward)

• reinforcement : we only know the reward/loss associated to asingle answer and have no idea of the best achievable reward. 10

Formalization

• n choices (= arms)

• the i-th choice gives areward of 1 with a unknownbut fix probability pi

• at time t, we choose ai andget a reward rt ∼ Ber(pat )

Goal : N decisions ⇒ maximize the cumulative reward (∑N

i=1 ri )

11

Exploration/exploitation dilemma

• exploitation : use accumulated knowledge to take the bestdecision (according to what is known)

• exploration : take a “non-optimal” decision to update theaccumulated knowledge 12

Why is this a dilemma ?

• exploration only : we may made only bad decision and achievea small cumulative reward

• exploitation only : bad estimation of the rewards + miss newopportunities

⇒ trade-off between exploration and exploitation

⇒ on-line learning of preferences/rewards

13

Notations

Notations

• Ht all information gathered until timestep t (chosen actionand observed rewards)

• a learning algorithm π associate to Ht a probabilitydistribution on all possible actions A

Evaluation criterion

Regret(T , π, f ∗) = E

[T∑

t=1

f ∗(a∗)− f ∗(at)

](1)

• at chosen action at t

• a∗ optimal action at t

• f ∗ ‘true’ reward 14

Methods

ε-greedy Algorithm

PrincipeAt each step, random choice between ‘exploration’ and‘exploitation’ :

explore

exploit

arm A

armB

Formally

• hyper-parameter of the algorithm : ε (manually chosen)

• with probability ε : exploration

• with probability 1− ε : exploitation 15

How to simulate a discrete probability distribution with Py-thon

The following function returns Exploration with a probability ε andExploitation with a probability 1− ε

import randomdef discret_rand(epsilon):

if random.random() > epsilon:return 0

else:return 1

16

Exploration

• choose an action randomly

• observe the reward

• ‘keep track’ of the action and theassociated reward

17

Exploitation

• choose the action with the highest expected value• ‘value’ of a decision :∑

observed rewards with this choice#choice

• observe the reward• ‘keep track’ of the action and the associated reward

18

Formally

Algorithm ε-greedyAt each iteration t :

• estimate f̂t = E[f ∗|Ht ]

• with a probability ε choose at ∼ Unif(A), otherwiseat = argmaxa f̂t

Theoretical guarantees

• if ε is small enough and decreases with t, the ε-greedyalgorithm converges exponentially to the optimalRegret(T , π).

19

A/B Testing

• ε-greedy : algorithm ‘invented’ by the reinforcement learningcommunity

• A/B testing : same algorithm/idea developed for marketingstudies

• generally, only exploitation after N decisions• smaller reward (you should test this !)

20

How to evaluate a banditalgorithm

Evaluation

• one of the most complicated aspect of bandit algorithms

• sequence of decisions ⇒ no train/test corpus

• only way to test = simulation = Monte-Carlo method [Metropolis,

47], [Ulam et von Neumann]

21

Simulation principles (1) : choices

Principles

• k choices

• each choice result in a reward of 1 with probability p and 0with probability 1− p (Bernouilli Arm)

In python (1 bras)

class BernoulliArm:def __init__(self, p):

self.p = p

def draw(self):return 0.0 if random.random() > self.p else 1.0

22

Example

2 pizzasChoose between “4 fromages” and “reine” :

• each pizza is represented by an arm

• simulation → we know the good answer (but choose to forget it) :• “4 fromage” arm : reward 1 with probability 0.6• “reine” arm : reward 1 with probability 0.2

General case

• as many arms as there are pizzas/choices

• list of BernoulliArm

23

Simulation principle (2) : main loop

• we “play” n times (= horizon)1. make a decision2. get the reward (by calling the right method)3. update your knowledge

• metric : cumulative reward

24

Trick

• results can change from oneexperience to another (depend on aprobability)

• solution : do k simulations ⊕average reward at each time stepover all simulations

25

Putting everything together

1 def test_algorithm(algo, means, num_sims, horizon):2 # init. all decisions3 arms = [BernoulliArm(mu) for mu in means]4 rewards = []5

6 for sim in range(num_sims):7 algo.initialize(len(arms))8

9 for t in range(horizon):10 chosen_arm = algo.select_arm()11 reward = arms[chosen_arm].draw()12 algo.update(chosen_arm, reward)13

14 rewards.append(reward)15

16 return np.array(rewards).reshape((num_sims, horizon))

26

Interface of a bandit algorithm

1 class EpsilonGreedy:2

3 # constructor4 # epsilon (float): tradeofff exploration/exploitation5 def __init__(self, epsilon): pass6

7 # re-initialize the algorithm in order to run a new simulation8 # n_arms (int): number of arms9 def initialize(self, n_arms): pass

10

11 # return a index of the chosen decision12 def select_arm(): pass13

14 # update knowledge15 # chosen_arm (int): the decision that has been made16 # reward (float): the obtained reward17 def update(self, chosen_arm, reward): pass

27

Lab

1st part

1. Let rt be the estimated quality of an action at time t. How isrt estimated ?

2. How can rt+1 be computed from rt ?3. Observe the evolution of the reward at each time for the

parameters :• ε ∈ {0.1, 0.2, 0.3, 0.4, 0.5}• horizon : 250• number of simulations : 100• arms :

means = [0.1, 0.1, 0.1, 0.1, 0.9]random.shuffle(means)arms = [BernoulliArm(mu) for mu in means]

4. What can you conclude !

28

2nd part i

• Limit of ε-greedy algorithm : uniform choice of actions in theexploration phase. What happens when an action has a reward99% of the time an the other in 10% of the time ?

• SoftMax Algorithm : if we have n actions, we choose the i-thactions with probability :

eriτ

∑nj=1 e

rjτ

where τ is an hyper-parameter. Implement this strategy andobserve what happens for different value of τ .

29

which pizza should you order?? - limsi-greedy : algorithm `invented' by the reinforcement...

Documents