© 2014 adobe systems incorporated. all rights reserved. adobe confidential. philip s. thomas |...

18
© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Philip S. Thomas | Summer Intern, Adobe Research Safe Policy Search for Digital Marketing | Graduate Student, UMass Amherst

Upload: magdalene-russell

Post on 18-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Philip S. Thomas | Summer Intern, Adobe ResearchSafe Policy Search for Digital Marketing

| Graduate Student, UMass Amherst

© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Motivation

Researcher: “I have and algorithm that might be able to improve the policy that you are using.”

Reasonable Person: “If it fails it could be very costly. What guarantees can you give me?”

Researcher: “If the algorithm is tuned properly, it usually helps quite a bit in simulation.”

Reasonable Person: “Can you ensure that the algorithm is tuned properly for our system? Can you ensure that an improvement in simulation will correspond to an improvement in the real world? Can you guarantee that your algorithm will not cost us more than $X with 95% confidence?”

Researcher: “No, no, and no.”

2

© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Motivation

Researcher: “I have and algorithm that might be able to improve the policy that you are using.”

Reasonable Person: “If it fails it could be very costly. What guarantees can you give me?”

Researcher: “You can specify an amount $X and a confidence, 1 - α, and we will only change your existing policy if we can guarantee that our change will improve profits by more than $X with confidence 1 - α.”

Reasonable Person: “I can pick any $X and 1 - α that I want? I could select $1 and confidence 0.999?.”

Researcher: “Absolutely.”

3

© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Goal

Provide an algorithm that:

Can use on and off-policy data the compute a new policy that we expect to be better.

Will only propose a change to the policy if we can guarantee that its performance is better than X with confidence 1 - α.

The user gets to specify X and α.

Uncompromising approach to safety.

4

© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Theoretical Challenges

We have data using the current policy parameters,θ0, of the company.

We would like to guarantee that some other policy,θ, has expected return greater than X with confidence 1 - α.

Estimating the return of θ with off-policy data is called the off-policy evaluation problem.

Active research area

We have an even harder problem: high-confidence off-policy evaluation.

5

© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Our Approach

Three parts:

Determine whether θ are safe. (High-confidence off-policy evaluation problem).

Search θ–space for the θ that we expect to work best. (Off-policy policy search problem).

Determine how to apply our algorithm repeatedly.

6

© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Importance Sampling

7

© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Importance Sampling

8

Maximum possible return

Minimum possible return

Small Return

Large Returnbi

ai

© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Ensuring Safety

Hoeffding’s Inequality

9

2 2

1

1

1 2ˆPr , , expn

i i ni

i ii

n kf f k

n b a

© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Hoeffding’s Inequality

10

© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Hoeffding’s Inequality

11

© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Our Approach: Part 1 – Are θ safe?

Importance sampling results in samples with very different ranges and variances.

We derive a novel concentration inequality that is well suited to this particular setting:

12

© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Example

13

© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Example: Gridworld

14

© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Digital Marketing Example

Actions:

Promote: Give information about product or company. Good shortly after a product was purchased.

Sell: Provide special offer or other direct attempt at generating a sale. Useful after several promotions of the product.

NULL: Do not mention the product. Useful to avoid oversaturation.

Observations:

Recency: How long ago did the customer last make a purchase?

Frequency: How many purchases has the customer made?

Model of the world:

The probability of the customer buying depends on the recency, frequency, and also other underlying properties of the customer that cannot be directly observed.

15

© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Digital Marketing Example

The company begins with a decent, but not optimal policy.

Goal: Maximize the expected (discounted) number of sales.

The company selects fmin to be 90% of the number of sales their policy generated, and required a confidence of 95%.

A degradation of more than 10% should be unlikely.

16

© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.

Digital Marketing Example

17

© 2014 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.