implementing and analyzing online experiments

IMPLEMENTING AND ANALYZING ONLINE EXPERIMENTSSEAN J. TAYLOR 28 JUL 2015 MULTITHREADED DATA

WHO AM I?

• Core Data Science Team at Facebook

• PhD from NYU in Information Systems

• Four academic papers employing online field experiments

• Teach and consult on experimental design at Facebook

http://seanjtaylor.comhttp://github.com/seanjtaylorhttp://facebook.com/seanjtaylor@seanjtaylor

I ASSUME YOU KNOW

• Why causality matters

• A little bit of Python and R

• Basic statistics + linear regression

SIMPLEST POSSIBLE EXPERIMENT

user_id version spent

123 B $10

596 A $0

456 A $4

991 B $9

def get_version(user_id): if user_id % 2: return 'A' else: return 'B'

> t.test(c(0, 4), c(10, 9))

Welch Two Sample t-‐test

data: c(0, 4) and c(10, 9) t = -‐3.638, df = 1.1245, p-‐value = 0.1487 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -‐27.74338 12.74338 sample estimates: mean of x mean of y 2.0 9.5

COMMON PROBLEMS

• Type I errors from measuring too many effects

• Type II and M errors from lack of power

• Repeated use of the same population (“pollution”)

• Type I errors from violation of the i.i.d. assumption

• Composing many changes into one experiment

POWEROR

THE ONLY WAY TO TRULY FAIL AT AN EXPERIMENT

THE SIZE OF YOUR CONFIDENCE INTERVALS

ERRORS

• Type I: Thinking your metric changed when it didn’t. We usually bound this at 1 or 5%.

• Type II: Thinking your metric didn’t change when it did. You can control this through better planning.

HOW TO MAKE TYPE I ERRORS

$ SpentTime Spent

Survey Satisfaction

oMale, <25

Female, <25

Male, >=25

Female, >=25

Measure a ton of metrics

AVOID TYPE II ERRORS WITH POWER

1. Use enough subjects in your experiment.

2. Test a reasonably strong treatment. Remember: you care about the difference.

POWER ANALYSIS

First step in designing an experiment is to determine how much data you’ll need to learn the answer to your question.

Process:

• set the smallest effect size you’d like to detect.

• simulate your experiment 200 times at various sample sizes

• count the number of simulated experiments where you correctly reject the null of effect=0.

TYPE M ERRORS

• Magnitude error: reporting an effect size which is too large

• happens when your experiment is underpowered AND you only report the significant results

IMPLEMENTATION

PLANOUT: KEY IDEAS

• an experiment is just a pseudo-random mapping from (user, context) → parameters, and is serializable.

• persistent randomizations implemented through hash functions, salts make experiments orthogonal

• always log exposures (parameters assignment) to improve precision, provide randomization check

• namespaces create ability to do sequential experiments on new blocks of users

https://facebook.github.io/planout/

A/B TESTING IN PLANOUTfrom planout.ops.random import * from planout.experiment import SimpleExperiment

class ButtonCopyExperiment(SimpleExperiment): def assign(self, params, user_id): # `params` is always the first argument. params.button_text = UniformChoice( choices=["Buy now!", "Buy later!"], unit=user_id )

# Later in your production code: from myexperiments import ButtonCopyExperiment

e = ButtonCopyExperiment(user_id=212) print(e.get('button_text'))

# Event later: e = ButtonCopyExperiment(user_id=212) e.log_event('purchase', {'amount': 9.43})

PLANOUT LOGS → DATA

{"inputs": {"user_id": 212}, "name": "ButtonCopyExperiment", "checksum": "646e69a5", "params": {"button_text": "Buy later!"}, "time": 1437952369, "salt": "ButtonCopyExperiment", "event": “exposure"}

{"inputs": {"user_id": 212}, "name": "ButtonCopyExperiment", "checksum": "646e69a5", "params": {"button_text": "Buy later!"}, "time": 1437952369, "extra_data": {"amount": 9.43}, "salt": "ButtonCopyExperiment", "event": "purchase"}

user_id button_text

123 Buy later!

596 Buy later!

456 Buy now!

991 Buy later!

user_id amount

123 $12

596 $9

Exposures

Purchases

ADVANCED DESIGN 1: FACTORIAL DESIGN

• Can use conditional logic as well as other random assignment operators: RandomInteger, RandomFloat, WeightedChoice, Sample.

class FactorialExperiment(SimpleExperiment): def assign(self, params, user_id): params.button_text = UniformChoice( choices=["Buy now!", "Buy later!"], unit=user_id ) params.button_color = UniformChoice( choices=["blue", "orange"], unit=user_id )

ADVANCED DESIGN 2: INCREMENTAL CHANGES

## We're going to try two different button redesigns. class FirstExperiment(SimpleExperiment): def assign(self, params, user_id): # ... set some params

class SecondExperiment(SimpleExperiment): def assign(self, params, user_id): # ... set some params differently class ButtonNamespace(SimpleNamespace): def setup(self): self.name = 'button_experiment_sequence' self.primary_unit = 'user_id' self.num_segments = 1000

def setup_experiments(): # allocate and deallocate experiments here # First gets 100 out of 1000 segments. self.add_experiment('first', FirstExperiment, 100) self.add_experiment('second', SecondExperiment, 100)

ADVANCED DESIGN 3: WITHIN-SUBJECTS

Previous experiments persistently assigned same treatment to user, but unit of analysis can be more complex:

class DiscountExperiment(SimpleExperiment): def assign(self, params, user_id, item_id): params.discount = BernoulliTrial(p=0.1, unit=[user_id, item_id]) if params.discount: params.discount_amount = RandomInteger( min=5, max=15, unit=user_id ) else: params.discount_amount = 0

e = DiscountExperiment(user_id=212, item_id=2) print(e.get('discount_amount'))

ANALYSIS

THE IDEAL DATA SET

Subject / User

Gender Age Button Size

Button Text

Spent Bounce

Erin F 22 Large Buy Now!

Ashley F 29 Large Buy Later!

Gary M 34 Small Buy Now!

Leo M 18 Large Buy Now!

Ed M 46 Small Buy Later!

Sam M 25 Small Buy Now!

Independent Observations

Randomly Assigned MetricsPre-experiment

Covariates{ { {

SIMPLEST CASE: OLS

> summary(lm(spent ~ button.size, data = df))

Call: lm(formula = spent ~ button.size, data = df)

Residuals: 1 2 3 4 5 6 10.0 -‐0.5 -‐4.5 -‐10.0 4.5 0.5

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.000 5.489 1.822 0.143 factor(button.size)s -‐5.500 6.722 -‐0.818 0.459

Residual standard error: 7.762 on 4 degrees of freedom Multiple R-‐squared: 0.1434, Adjusted R-‐squared: -‐0.07079 F-‐statistic: 0.6694 on 1 and 4 DF, p-‐value: 0.4592

DATA REDUCTION

Subject Xi Di Yi

Evan M 0 1

Ashley F 0 1

Greg M 1 0

Leena F 1 0

Ema F 0 0

Seamus M 1 1

X D Y Cases

M 0 1 1

M 1 1 1

F 0 1 1

F 1 1 0

M 0 0 0

M 1 0 1

F 0 0 1

F 1 0 1

N # treatments X # groups X #outcomes

source('css_stats.R')

reduced <-‐ df %>% mutate(rounded.spent = round(spent, 0)) %>% group_by(button.size, rounded.spent) %>% summarise(n = n())

> lm(rounded.spent ~ button.size, data = reduced, weights = n) %>% + coeftest(vcov = sandwich.lm)

t test of coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 7.43137 0.45162 16.4548 7.522e-‐14 *** button.sizes -‐2.45178 0.59032 -‐4.1533 0.0004149 *** -‐-‐-‐ Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

DATA REDUCTION + WEIGHTED OLS

FACTORIAL DESIGNS

• Identify two types of effects: marginal and interactions. Need to fix one group as the baseline.

> coeftest(lm(spent ~ button.size * button.text, data = df))

t test of coefficients:

Estimate Std. Error t value Pr(>|t|) (Intercept) 6.79643 0.62998 10.7884 < 2.2e-‐16 *** button.sizes -‐2.43253 0.86673 -‐2.8066 0.006064 ** button.textn 2.11611 0.86673 2.4415 0.016458 * button.sizes:button.textn -‐2.57660 1.27584 -‐2.0195 0.046219 * -‐-‐-‐ Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

USING COVARIATES TO GAIN PRECISION

• With simple random assignment, using covariates is not necessary.

• However, you can improve precision of ATE estimates if covariates explain a lot of variation in the potential outcomes.

• Can be added to a linear model and SEs should get smaller if they are helpful.

NON-IID DATA

• Repeated observations of the same user are not independent.

• Ditto if you ‘re experimenting on certain items only.

• If you ignore dependent data, the true confidence intervals are larger than you think.

Subject / User

Item Button Size

Erin Shirt Large $20

Erin Socks Large $4

Erin Pants Large $0

Leo Shirt Large $0

Ed Shirt Small $9

Ed Socks Small $5

THE BOOTSTRAP

All Your Data

Generate random sub-samples

Compute statistics or estimate model

parameters

-2 -1 0 1 2Statistic

Get a distribution over statistic of interest

(e.g. the treatment effect)

- take mean - CIs == 95% quantiles - SEs == standard deviation

USER AND USER-ITEM BOOTSTRAPS

source('css_stats.R') library(broom) ## for extracting model coefficients

fitter <-‐ function(.data) { lm(summary ~ opposed, data = .data, weights = .weights) %>% tidy }

iid.replicates <-‐ iid.bootstrap(df, fitter, .combine = bind_rows) oneway.replicates <-‐ clustered.bootstrap(df, c('user_id'), fitter, .combine = bind_rows) twoway.replicates <-‐ clustered.bootstrap(df, c('user_id', 'item_id'), fitter, .combine = bind_rows)

> head(iid.replicates) term estimate std.error statistic p.value 1 (Intercept) 0.4700000 0.04795154 9.801561 6.296919e-‐18 2 button.sizes -‐0.2200000 0.08003333 -‐2.748855 6.695621e-‐03 3 (Intercept) 0.4250000 0.05307832 8.007036 5.768641e-‐13 4 button.sizes -‐0.1750000 0.08456729 -‐2.069358 4.049329e-‐02 5 (Intercept) 0.4137931 0.05141050 8.048805 4.118301e-‐13 6 button.sizes -‐0.1429598 0.08621804 -‐1.658119 9.965016e-‐02

DEPENDENCE CHANGES CONFIDENCE INTERVALS

DATA REDUCTION WITH DEPENDENT DATA

Subject Di Yij

Evan 1 1

Evan 1 0

Ashley 0 1

Greg 1 0

Leena 1 0

Leena 1 1

Ema 0 0

Seamus 1 1

Create bootstrap replicates

reduce the replicates as if they’re i.i.d.

compute statistics on reduced data

THANKS! HERE ARE SOME RESOURCES:

• Me: http://seanjtaylor.com

• These slides:http://www.slideshare.net/seanjtaylor/implementing-and-analyzing-online-experiments

• Full Featured Tutorial: http://eytan.github.io/www-15-tutorial/

• “Field Experiments” by Gerber and Green

implementing and analyzing online experiments

Data & Analytics

which test do i use? statistics for two group experiments...

chapter eleven. designing, conducting, analyzing, and...

vs explorer – analyzing large scale docking experiments...

1 analyzing the impact of implementing new tolls …...

analyzing & implementing delayed deceleration...

analyzing cases of resilience success and … cases of...

analyzing social experiments as implemented: a

design of experiments in protein production and...

spock my decisions are completely logical. patient…elliot...

analyzing & implementing delayed deceleration approaches ·...

variables independent variables dependent variables...

tips for analyzing nonregular fractional factorial...

chapter 16 analyzing experiments with categorical outcomes

test optimization utilizing design of experiments · test...

implementing design and analysis of experiments in the u.s...

implementing inquiry based experiments in a fluid science...

forum sharing and analyzing data from presence...

analyzing social experiments as implemented: a reexamination...

practical and investigating work in science. this work have...

analyzing and implementing a reed-solomon decoder...