which pizza should you order?? - limsi-greedy : algorithm `invented' by the reinforcement...
TRANSCRIPT
Test A/B
Introduction to Reinforcement Learning
Guillaume [email protected]
novembre 2016
Université Paris Sud & LIMSI
1
Context
Which pizza should you order ? ?
• I really love Neapolitan pizza ?
• Which pizza should I order ?
• Always choose a Neapolitan pizza ?and risk to miss something exceptionally good
• Should I try another pizza ?and risk to be disappointed
2
Which pizza should you order ? ?
• I really love Neapolitan pizza ?
• Which pizza should I order ?
• Always choose a Neapolitan pizza ?and risk to miss something exceptionally good
• Should I try another pizza ?and risk to be disappointed
2
Historically
Feynman’s restaurant problem
3
More seriously...
How to design a web page ?
• the design of a web page has (may have) a huge impact on thenumber/actions of people seeing it.
• too many changes / bad changes → loose visitors
• not enough changes → out-fashioned page → loose visitors
Several way to evaluate the impact of a changes : number of sells,of unique visitors, ...
4
Example no 2
• Washington Post
• goal : evaluate the impact of modifying an article / of addinga new article
• formally : several variants/versions of the web site ;continuously find the best one according to visitors “reactions”
https://developer.washingtonpost.com/pb/blog/post/2016/02/08/
bandito-a-multi-armed-bandit-tool-for-content-testing/
5
Some results (1)
6
Some results (1)
CTR : 11,7% CTR : 24,6%
Evaluation : Click-Through Rate (% of visitors that have clicked onthe link)
6
Some results (2)
CTR : 3,3%
CTR : 3,9%
CTR : 4,8%
7
Evaluation
• each user only sees one version ofthe webpage ⇒ no directcomparison
• there may be some “localfluctuations” : online shop thatchanges its web page aroundChristmas and observes an increasein sells ⇒ no way to know thecause of the increase
⇒ need of a controlled experiments
=compare user behavior on both versionsof the site
8
Evaluation
• each user only sees one version ofthe webpage ⇒ no directcomparison
• there may be some “localfluctuations” : online shop thatchanges its web page aroundChristmas and observes an increasein sells ⇒ no way to know thecause of the increase
⇒ need of a controlled experiments =compare user behavior on both versionsof the site
8
Applications...
• resources allocations in project management
• medial tests
• portfolio management
• online advertisement
• adaptive routing
• ...
Multi-Armed Bandit
9
Main idea
This is not supervised learning
• supervised learning : we known the ‘best’ answer among allpossible answer (and used it to calculate a loss/reward)
• reinforcement : we only know the reward/loss associated to asingle answer and have no idea of the best achievable reward. 10
Formalization
• n choices (= arms)
• the i-th choice gives areward of 1 with a unknownbut fix probability pi
• at time t, we choose ai andget a reward rt ∼ Ber(pat )
Goal : N decisions ⇒ maximize the cumulative reward (∑N
i=1 ri )
11
Exploration/exploitation dilemma
• exploitation : use accumulated knowledge to take the bestdecision (according to what is known)
• exploration : take a “non-optimal” decision to update theaccumulated knowledge 12
Why is this a dilemma ?
• exploration only : we may made only bad decision and achievea small cumulative reward
• exploitation only : bad estimation of the rewards + miss newopportunities
⇒ trade-off between exploration and exploitation
⇒ on-line learning of preferences/rewards
13
Notations
Notations
• Ht all information gathered until timestep t (chosen actionand observed rewards)
• a learning algorithm π associate to Ht a probabilitydistribution on all possible actions A
Evaluation criterion
Regret(T , π, f ∗) = E
[T∑
t=1
f ∗(a∗)− f ∗(at)
](1)
• at chosen action at t
• a∗ optimal action at t
• f ∗ ‘true’ reward 14
Methods
ε-greedy Algorithm
PrincipeAt each step, random choice between ‘exploration’ and‘exploitation’ :
explore
exploit
arm A
armB
Formally
• hyper-parameter of the algorithm : ε (manually chosen)
• with probability ε : exploration
• with probability 1− ε : exploitation 15
How to simulate a discrete probability distribution with Py-thon
The following function returns Exploration with a probability ε andExploitation with a probability 1− ε
import randomdef discret_rand(epsilon):
if random.random() > epsilon:return 0
else:return 1
16
Exploration
• choose an action randomly
• observe the reward
• ‘keep track’ of the action and theassociated reward
17
Exploitation
• choose the action with the highest expected value• ‘value’ of a decision :∑
observed rewards with this choice#choice
• observe the reward• ‘keep track’ of the action and the associated reward
18
Formally
Algorithm ε-greedyAt each iteration t :
• estimate f̂t = E[f ∗|Ht ]
• with a probability ε choose at ∼ Unif(A), otherwiseat = argmaxa f̂t
Theoretical guarantees
• if ε is small enough and decreases with t, the ε-greedyalgorithm converges exponentially to the optimalRegret(T , π).
19
A/B Testing
• ε-greedy : algorithm ‘invented’ by the reinforcement learningcommunity
• A/B testing : same algorithm/idea developed for marketingstudies
• generally, only exploitation after N decisions• smaller reward (you should test this !)
20
How to evaluate a banditalgorithm
Evaluation
• one of the most complicated aspect of bandit algorithms
• sequence of decisions ⇒ no train/test corpus
• only way to test = simulation = Monte-Carlo method [Metropolis,
47], [Ulam et von Neumann]
21
Simulation principles (1) : choices
Principles
• k choices
• each choice result in a reward of 1 with probability p and 0with probability 1− p (Bernouilli Arm)
In python (1 bras)
class BernoulliArm:def __init__(self, p):
self.p = p
def draw(self):return 0.0 if random.random() > self.p else 1.0
22
Example
2 pizzasChoose between “4 fromages” and “reine” :
• each pizza is represented by an arm
• simulation → we know the good answer (but choose to forget it) :• “4 fromage” arm : reward 1 with probability 0.6• “reine” arm : reward 1 with probability 0.2
General case
• as many arms as there are pizzas/choices
• list of BernoulliArm
23
Simulation principle (2) : main loop
• we “play” n times (= horizon)1. make a decision2. get the reward (by calling the right method)3. update your knowledge
• metric : cumulative reward
24
Trick
• results can change from oneexperience to another (depend on aprobability)
• solution : do k simulations ⊕average reward at each time stepover all simulations
25
Putting everything together
1 def test_algorithm(algo, means, num_sims, horizon):2 # init. all decisions3 arms = [BernoulliArm(mu) for mu in means]4 rewards = []5
6 for sim in range(num_sims):7 algo.initialize(len(arms))8
9 for t in range(horizon):10 chosen_arm = algo.select_arm()11 reward = arms[chosen_arm].draw()12 algo.update(chosen_arm, reward)13
14 rewards.append(reward)15
16 return np.array(rewards).reshape((num_sims, horizon))
26
Interface of a bandit algorithm
1 class EpsilonGreedy:2
3 # constructor4 # epsilon (float): tradeofff exploration/exploitation5 def __init__(self, epsilon): pass6
7 # re-initialize the algorithm in order to run a new simulation8 # n_arms (int): number of arms9 def initialize(self, n_arms): pass
10
11 # return a index of the chosen decision12 def select_arm(): pass13
14 # update knowledge15 # chosen_arm (int): the decision that has been made16 # reward (float): the obtained reward17 def update(self, chosen_arm, reward): pass
27
Lab
1st part
1. Let rt be the estimated quality of an action at time t. How isrt estimated ?
2. How can rt+1 be computed from rt ?3. Observe the evolution of the reward at each time for the
parameters :• ε ∈ {0.1, 0.2, 0.3, 0.4, 0.5}• horizon : 250• number of simulations : 100• arms :
means = [0.1, 0.1, 0.1, 0.1, 0.9]random.shuffle(means)arms = [BernoulliArm(mu) for mu in means]
4. What can you conclude !
28
2nd part i
• Limit of ε-greedy algorithm : uniform choice of actions in theexploration phase. What happens when an action has a reward99% of the time an the other in 10% of the time ?
• SoftMax Algorithm : if we have n actions, we choose the i-thactions with probability :
eriτ
∑nj=1 e
rjτ
where τ is an hyper-parameter. Implement this strategy andobserve what happens for different value of τ .
29