an e˚cient bandit algorithm for real-time multivariate...

1
Multivariate Testing Background Daniel N. Hill 1 , Houssam Nassif 2 , Yi Liu 2 , Anand Iyer 2 , SVN Vishwanathan 3,4 1. Amazon.com, Palo Alto, CA 2. Amazon.com, Seattle, WA 3. University of California, Santa Cruz, CA 4. A9, Palo Alto, CA Simulations show that our approach is appropriate for messages that have thousands of possible configurations and strong interactions between widgets. Experiment on live traffic: 48 treatments. Binary reward. Daily training updates. An efficient bandit algorithm for real-time multivariate optimization We aim to increase adoption of an Amazon service by optimiz- ing a message shown to customers. This message is composed of several widgets, each with multiple variants. The number of possible combinations grows exponentially. We need to effi- ciently test a large set of candidates to find the message with the highest conversion rate. 2 x 2 x 3 x 2 x 2 = 48 possible treatments Estimated 66 days to measure 5% lift in A/B experiment Widget We assume a layout A is composed of D widgets with N vari- ants each. There are thus N D possible combinations. On each trial t in {1,..., T}, an award R A is sampled from {-1,+1}. We denote the optimal layout A*, Denote layout chosen on trial t as A t . We seek to minimize total regret Δ T : : Our solution has 3 components: 1 2 3 Discrete analog of gradient descent Models tested MVT2 - Multivariate testing model MVT1 - MVT2 without interaction terms N D -MAB - Multi-armed bandit with N D arms D-MABs - Independent MABs for D widgets Simulator model D = 3 N = 8 α 1 = α 2 = 1 12 D = 1,728 treatments MVT2 and N D -MAB capture strong interactions MVT2 and MVT1 handle 1,000s of treatments Convergence = proportion of traffic to winning treatment Multivariate testing converges in 9 days Large variation in performance between treatments Title text Offer details Reject button Accept button Image 2x 2x 3x 2x 2x Take-aways • Experiments with 100s of combinatorial treatments can be run ef- ficiently using the multivariate multi-armed bandit. •There can be a large impact in decomposing a message into its components and then applying combinatorial optimization. •Future work will focus on contextualization, personalization, and more complicated success metrics. Quadratic probit model captures interactions between widgets. Multi-armed bandit strategy allocates traffic away from losing treatments. Hill-climbing with random restarts to make real-time decisions. #1029 Probit regression Item Quality Bias Item Interactions Title text Offer details Reject button Accept button Image Model learns O(D 2 N 2 ) parameters for O(N D ) possible treatments Training data Sample weights from posterior Normal prior Thompson sampling Layout is chosen proportional to the probability it is the optimal solution Optimize one widget at a time, given the rest of the layout With S restarts, requires O(SKN) evaluations rather than O(N D ) Title text Offer details Reject button Accept button Image MVT1 converges fastest but to a higher level of regret Hill-climbing converges in 20 iterations ] Simulation Results Experimental Results Problem Formulation Quadratic but not cubic interactions seen in real data p-values from likelihood ratio test for two experiments Incorporate context through additional terms. Assume an L-dimensional context vector X: Context Item-Context Interaction Exp. #2 Exp. #1 ?

Upload: others

Post on 13-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An e˚cient bandit algorithm for real-time multivariate ...pages.cs.wisc.edu/~hous21/presentations/KDD_17.pdf · p-values from likelihood ratio test for two experiments Incorporate

Multivariate TestingBackground

Daniel N. Hill1, Houssam Nassif 2, Yi Liu2, Anand Iyer 2, SVN Vishwanathan3,4

1. Amazon.com, Palo Alto, CA 2. Amazon.com, Seattle, WA 3. University of California, Santa Cruz, CA 4. A9, Palo Alto, CA

Simulations show that our approach is appropriate for messages that havethousands of possible con�gurations and strong interactions between widgets.

Experiment on live tra�c: 48 treatments. Binary reward. Daily training updates.

An e�cient bandit algorithm for real-time multivariate optimization

We aim to increase adoption of an Amazon service by optimiz-ing a message shown to customers. This message is composed of several widgets, each with multiple variants. The number of possible combinations grows exponentially. We need to e�-ciently test a large set of candidates to �nd the message with the highest conversion rate.

2 x 2 x 3 x 2 x 2 = 48 possible treatments

Estimated 66 days to measure 5% lift in A/B experiment

Widget

We assume a layout A is composed of D widgets with N vari-ants each.

There are thus ND possible combinations. On each trial t in {1,..., T}, an award RA is sampled from {-1,+1}. We denote the optimal layout A*,

Denote layout chosen on trial t as At. We seek to minimize total regret ΔT::

Our solution has 3 components:

1

2

3

Discrete analog of gradient descent

Models testedMVT2 - Multivariate testing modelMVT1 - MVT2 without interaction termsND-MAB - Multi-armed bandit with ND armsD-MABs - Independent MABs for D widgets

Simulator model

D = 3N = 8α1 = α2 = 1

12D= 1,728 treatments

MVT2 and ND-MAB capture strong interactions MVT2 and MVT1 handle 1,000s of treatments

Convergence = proportion of tra�c to winning treatment

Multivariate testing converges in 9 daysLarge variation in performance between treatments

Title textO�er details

Reject button

Accept buttonImage

2x

2x 3x2x

2x

Take-aways

• Experiments with 100s of combinatorial treatments can be run ef-�ciently using the multivariate multi-armed bandit.

•There can be a large impact in decomposing a message into its components and then applying combinatorial optimization.

•Future work will focus on contextualization, personalization, and more complicated success metrics.

Quadratic probit model captures interactions between widgets.

Multi-armed bandit strategy allocates tra�c away from losing treatments.

Hill-climbing with random restarts to make real-time decisions.

#1029

Probit regression

ItemQuality

Bias ItemInteractions

Title textO�er details

Reject button

Accept buttonImage

Model learns O(D2N2) parametersfor O(ND) possible treatments

Training data

Sample weights from posterior

Normal prior

Thompson sampling

Layout is chosen proportionalto the probability it is theoptimal solution

Optimize one widget at a time, given the rest of the layout

With S restarts, requiresO(SKN) evaluations ratherthan O(ND)

Title textO�er details

Reject button

Accept buttonImage

MVT1 converges fastestbut to a higher level of regret

Hill-climbing converges in 20 iterations

]

Simulation Results

Experimental Results

Problem Formulation

Quadratic but not cubic interactions seen in real data

p-values from likelihood ratio testfor two experiments

Incorporate context through additional terms.Assume an L-dimensional context vector X:

Context Item-ContextInteraction

Exp. #2Exp. #1

?