© 2014 adobe systems incorporated. all rights reserved. adobe confidential. philip s. thomas |...

© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Philip S. Thomas | Summer Intern, Adobe ResearchSafe Policy Search for Digital Marketing

| Graduate Student, UMass Amherst


Motivation

Researcher: “I have and algorithm that might be able to improve the policy that you are using.”

Reasonable Person: “If it fails it could be very costly. What guarantees can you give me?”

Researcher: “If the algorithm is tuned properly, it usually helps quite a bit in simulation.”

Reasonable Person: “Can you ensure that the algorithm is tuned properly for our system? Can you ensure that an improvement in simulation will correspond to an improvement in the real world? Can you guarantee that your algorithm will not cost us more than $X with 95% confidence?”

Researcher: “No, no, and no.”

2


Motivation

Researcher: “I have and algorithm that might be able to improve the policy that you are using.”

Reasonable Person: “If it fails it could be very costly. What guarantees can you give me?”

Researcher: “You can specify an amount $X and a confidence, 1 - α, and we will only change your existing policy if we can guarantee that our change will improve profits by more than $X with confidence 1 - α.”

Reasonable Person: “I can pick any $X and 1 - α that I want? I could select $1 and confidence 0.999?.”

Researcher: “Absolutely.”

3


Goal

Provide an algorithm that:

Can use on and off-policy data the compute a new policy that we expect to be better.

Will only propose a change to the policy if we can guarantee that its performance is better than X with confidence 1 - α.

The user gets to specify X and α.

Uncompromising approach to safety.

4


Theoretical Challenges

We have data using the current policy parameters,θ0, of the company.

We would like to guarantee that some other policy,θ, has expected return greater than X with confidence 1 - α.

Estimating the return of θ with off-policy data is called the off-policy evaluation problem.

Active research area

We have an even harder problem: high-confidence off-policy evaluation.

5


Our Approach

Three parts:

Determine whether θ are safe. (High-confidence off-policy evaluation problem).

Search θ–space for the θ that we expect to work best. (Off-policy policy search problem).

Determine how to apply our algorithm repeatedly.

6


Importance Sampling

7


Importance Sampling

8

Maximum possible return

Minimum possible return

Small Return

Large Returnbi

ai


Ensuring Safety

Hoeffding’s Inequality

9

2 2

1

1

1 2ˆPr , , expn

i i ni

i ii

n kf f k

n b a



10



11


Our Approach: Part 1 – Are θ safe?

Importance sampling results in samples with very different ranges and variances.

We derive a novel concentration inequality that is well suited to this particular setting:

12


Example

13


Example: Gridworld

14


Digital Marketing Example

Actions:

Promote: Give information about product or company. Good shortly after a product was purchased.

Sell: Provide special offer or other direct attempt at generating a sale. Useful after several promotions of the product.

NULL: Do not mention the product. Useful to avoid oversaturation.

Observations:

Recency: How long ago did the customer last make a purchase?

Frequency: How many purchases has the customer made?

Model of the world:

The probability of the customer buying depends on the recency, frequency, and also other underlying properties of the customer that cannot be directly observed.

15



The company begins with a decent, but not optimal policy.

Goal: Maximize the expected (discounted) number of sales.

The company selects fmin to be 90% of the number of sales their policy generated, and required a confidence of 95%.

A degradation of more than 10% should be unlikely.

16



17

© 2014 adobe systems incorporated. all rights reserved. adobe confidential. philip s. thomas |...

Documents