a/b test with three-way anova

A/B test with three-way ANOVABen Paul

May 22, 2015

Background

• We tested two versions of a landing page in order to determine which had a greater return on investment(ROI).

• We also collected data about tra�c source (“referer”) and country of origin, to determine if the e�ectof landing page was di�erent depending on the referer and country.

Hypotheses

• No hypotheses were specified.

Method

• Since no hypotheses were specified in advance, the data set was randomly split in half so that one halfcould be used to generate hypotheses (the “exploratory” data set) and the other half could be used totest those hypotheses (the “validation” data set).

• In particular, an ANOVA was conducted on the exploratory data set to assess the e�ects of landingpage, country, and referer on ROI. E�ects found in the exploratory data set were re-tested in thevalidation set.

Analysis

Set up environment

1

library("plyr")library("dplyr")library("ggplot2")library("lmtest")source("interaction.t.tests.R")

knitr::opts_chunk$set(comment = NA) # remove hashes in outputknitr::opts_chunk$set(fig.width = 12) # increase plot widththeme_set(theme_gray(base_size = 12)) # decrease font size

Read data

dat <- read.csv("data/referer_data.csv")

Split data

set.seed(25)samp <- sample(nrow(dat), nrow(dat) / 2)explore <- dat[samp, ]validate <- dat[-samp, ]

Clean data

Handle data types Check that data types are appropriate.

summary(explore, maxsum = 10); str(explore);

referer country landing_page roiblogher : 1483 AU : 2457 a:24983 Min. : 5.63caranddriver : 1479 CA : 9966 b:25017 1st Qu.: 15.43FB :12573 Other: 2519 Median : 18.44footballoutsider: 1542 UK :10002 Mean : 37.16Google :24840 US :25056 3rd Qu.: 63.77MSmag : 1462 Max. :182.76Other : 1572pioneeringwoman : 1499scify : 1565YouTube : 1985

�data.frame�: 50000 obs. of 4 variables:$ referer : Factor w/ 10 levels "blogher","caranddriver",..: 9 3 5 3 5 5 3 10 1 5 ...$ country : Factor w/ 5 levels "AU","CA","Other",..: 5 4 5 3 5 3 5 4 4 4 ...$ landing_page: Factor w/ 2 levels "a","b": 1 1 2 2 1 1 2 2 1 2 ...$ roi : num 39.49 44.61 15.43 8.65 15.43 ...

Data types appear to be appropriate. Referer, country, and landing_page are defined as factors, and roi isnumeric, as expected.

2

ROI ranges from 5.63 to 182.76. Assuming the unit is cents, these values would appear to be within reason fora website visit. (It is di�cult to find comparable benchmarks, but related data on average revenue per uniquevisitor can be found at http://www.businessinsider.com/chart-of-the-day-revenue-per-unique-visitor-2011-1.Since our data concern profit rather than revenue, it would make sense that our numbers are much lowerthan those from the Business Insider article.)

Analyze data

Diagnostics A three-way ANOVA was planned to test the e�ect of landing page, country, and referer onROI. Since we are analyzing a landing page test, only terms that included the landing page variable wereentered into the analysis: landing page, country x landing page, referer x landing page, and country x refererx landing page.

Before inspecting the results, diagnostic plots were inspected to ensure ANOVA assumptions were met.

explore_fit <- aov(roi ~ landing_page + landing_page:referer + landing_page:country + landing_page:country:referer, data = explore)layout(1)plot(explore_fit, 1)

0 50 100 150

−5e−

090e

+00

5e−0

9

Fitted values

Res

idua

ls

aov(roi ~ landing_page + landing_page:referer + landing_page:country + land ...

Residuals vs Fitted

69476

12439

98507

In this plot, the vast majority of the 500,000 residuals appear to have no relationship with fitted values.However, it appears that heteroscedasticity may be present: there are about 10-20 data points with low fittedvalues that seem to be associated with greater residual variation compared to those with higher fitted values.But since it is di�cult to tell from visual inspection if this represents significant heteroscedasticity, a formaltest for heteroscedasticity was run.

# Breusch-Pagan test for heteroscedasticitybptest(explore_fit)

studentized Breusch-Pagan test

data: explore_fitBP = 34.944, df = 99, p-value = 1

The test failed to detect heteroscedasticity, p = 1. Thus, heteroscedasticity is not a concern. The nextdiagnostic was to check if nonnormality is present.

3

http://www.businessinsider.com/chart-of-the-day-revenue-per-unique-visitor-2011-1

layout(1)plot(explore_fit, 2)

−4 −2 0 2 4

−100

050

100

150

Theoretical Quantiles

Stan

dard

ized

resi

dual

s

aov(roi ~ landing_page + landing_page:referer + landing_page:country + land ...

Normal Q−Q

69476

12439

98507

The data appear to be very nearly normal, although there are deviations in the tails that appear to a�ectabout 20 of the 500,000 data points. Again, this is not thought to be a large enough concern to merit furtheraction. ANOVA assumptions appear to be reasonably met and we can proceed with analyzing the results.

explore_fit %>% drop1(.~., test = "F") # use Type III SS so that variable order doesn�t matter - see http://www.statmethods.net/stats/anova.html

ANOVA

Warning: attempting model selection on an essentially perfect fit isnonsense

Single term deletions

Model:roi ~ landing_page + landing_page:referer + landing_page:country +

landing_page:country:refererDf Sum of Sq RSS AIC F value

<none> 0 -2367527landing_page 1 22272 22272 -40237 8.1789e+24landing_page:referer 18 1120353 1120353 155633 2.2857e+25landing_page:country 8 672600 672600 130140 3.0875e+25landing_page:referer:country 72 0 0 -601608 1.5156e+18

Pr(>F)<none>landing_page < 2.2e-16 ***landing_page:referer < 2.2e-16 ***landing_page:country < 2.2e-16 ***landing_page:referer:country < 2.2e-16 ***---Signif. codes: 0 �***� 0.001 �**� 0.01 �*� 0.05 �.� 0.1 � � 1

The main e�ect of landing page was significant, qualified by statistically significant interactions with refererand with country (all ps < 0.001). Although the three-way interaction between landing page, referer, and

4

country was also statistically significant, it is associated with a sum of squares = 0, indicating that it did notexplain any variance. Therefore, it will not be interpreted.

To visualize the e�ects, all data points were plotted, with landing page on the x-axis and ROI on the y-axis,and graphs faceted by referer and colored by country. The order of referers was changed to demonstrate theresults more clearly.

# reorder referer levelsref_levels <- c("blogher", "MSmag", "pioneeringwoman",

"caranddriver", "footballoutsider","scify", "FB", "Google", "YouTube", "Other")

explore <- within(explore, referer <- factor(referer, levels = ref_levels))

ggplot(explore, aes(x = landing_page, y = roi)) +geom_jitter(alpha = 0.5, aes(color = country), position = position_jitter(width = 0.2, height = 0.1)) +facet_wrap(~referer, nrow = 1)

blogher MSmag pioneeringwoman caranddriver footballoutsider scify FB Google YouTube Other

0

50

100

150

a b a b a b a b a b a b a b a b a b a blanding_page

roi

countryAUCAOtherUKUS

The interaction between referer and landing page can be seen in that some referers have greater ROI withlanding page “a”, others have greater ROI with landing page “b”, and others have no di�erence. Follow-upt-tests were conducted to verify this e�ect.

The interaction between country and landing page is not visible from the graph and may be of a much lowermagnitude. Follow-up t-tests were conducted to verify this e�ect as well.

# set p-value cutoff using Bonferroni correction considering we are running several tests:# (1) for each referer, run t-test comparing ROI from "a" vs. "b"# (2) for each country, run t-test comparing ROI from "a" vs. "b"

p_cutoff <- 0.05 / (length(levels(explore$referer)) * 2)

# tests with refererexplore %>% interaction.t.tests(iv = "referer", group_var = "landing_page", dv = "roi", p_cutoff = p_cutoff)

[1] " *** blogher: �a� had 31.74 greater roi, p = 0.00"[1] " *** MSmag: �a� had 31.19 greater roi, p = 0.00"[1] " *** pioneeringwoman: �a� had 30.57 greater roi, p = 0.00"[1] " *** caranddriver: �b� had 6.95 greater roi, p = 0.00"[1] " *** footballoutsider: �b� had 7.56 greater roi, p = 0.00"[1] "scify: no difference in roi, p = 0.93"

5

[1] "FB: no difference in roi, p = 0.69"[1] "Google: no difference in roi, p = 0.14"[1] "YouTube: no difference in roi, p = 0.58"[1] "Other: no difference in roi, p = 0.93"

# tests with countryexplore %>% interaction.t.tests(iv = "country", group_var = "landing_page", dv = "roi", p_cutoff = p_cutoff)

[1] " *** AU: �a� had 2.33 greater roi, p = 0.00"[1] " *** CA: �a� had 2.35 greater roi, p = 0.00"[1] " *** Other: �a� had 1.69 greater roi, p = 0.00"[1] " *** UK: �a� had 2.36 greater roi, p = 0.00"[1] " *** US: �a� had 1.90 greater roi, p = 0.00"

Results indicate that:

• Landing page “a” had greater ROI than landing page “b” for referers “blogher”, “MSmag”, and“pioneeringwoman” (p < 0.001, ROI di�erences range from 31 - 32).

• Landing page “b” had greater ROI than landing page “a” for referers “caranddriver” and “footballout-sider”" (p < 0.001, ROI di�erences range from 7 - 8).

• Landing pages “a” and “b” did not have di�erent ROIs for referers “scify”, “FB”, “Google”, “YouTube”,and “Other” (all ps > 0.1).

• Landing page “a” ROI is greater than “b” by two units (presumably cents) regardless of country; despitethe statistically significant interaction e�ect between the two variables, the t-test results above showthat the e�ect of landing page di�ers by less than one cent from country to country (ROI di�erencesrange from 1.69 - 2.36 cents). Since the magnitude of this e�ect is extremely low compared to othere�ects seen in the data, it is considered to be of minimal importance and will not be interpreted further.

The referer by landing page interaction was re-tested in the validation data set.

# reorder referer levelsref_levels <- c("blogher", "MSmag", "pioneeringwoman",

"caranddriver", "footballoutsider","scify", "FB", "Google", "YouTube", "Other")

validate <- within(validate, referer <- factor(referer, levels = ref_levels))

validate %>% interaction.t.tests(iv = "referer", group_var = "landing_page", dv = "roi")

[1] " *** blogher: �a� had 31.74 greater roi, p = 0.00"[1] " *** MSmag: �a� had 31.19 greater roi, p = 0.00"[1] " *** pioneeringwoman: �a� had 30.57 greater roi, p = 0.00"[1] " *** caranddriver: �b� had 6.95 greater roi, p = 0.00"[1] " *** footballoutsider: �b� had 7.56 greater roi, p = 0.00"[1] "scify: no difference in roi, p = 0.93"[1] "FB: no difference in roi, p = 0.69"[1] "Google: no difference in roi, p = 0.14"[1] "YouTube: no difference in roi, p = 0.58"[1] "Other: no difference in roi, p = 0.93"

The same results are seen (all ps < 0.001), so the hypotheses have been supported. Just as in the exploratoryset:

6

• Landing page “a” had greater ROI than landing page “b” for referers “blogher”, “MSmag”, and“pioneeringwoman” (p < 0.001, ROI di�erences range from 31 - 32, same as exploratory results).

• Landing page “b” had greater ROI than landing page “a” for referers “caranddriver” and “footballout-sider”" (p < 0.001, ROI di�erences range from 7 - 8, same as exploratory results).

• Landing pages “a” and “b” did not have di�erent ROIs for referers “scify”, “FB”, “Google”, “YouTube”,and “Other” (all ps > 0.1, same as exploratory results).

Discussion

The referers whose tra�c benefits from landing page “a” all appear to be targeted at females (blogher,MSmag, pioneeringwoman). Assuming that these referers are associated with blogher.com, msmagazine.com,and thepioneerwoman.com, data from Alexa confirm that females are “greatly over-represented” in theirtra�c (http://www.alexa.com/siteinfo/blogher.com, http://www.alexa.com/siteinfo/msmagazine.com, http://www.alexa.com/siteinfo/thepioneerwoman.com).

In contrast, referers whose tra�c benefits from page “b” appear to be targeted at males (caranddriver,footballoutsider). Assuming that these referers are associated with caranddriver.com and footballout-siders.com, data from Alexa confirm that males are “over-represented” in the former (http://www.alexa.com/siteinfo/caranddriver.com) and “greatly over-represented” in the latter (http://www.alexa.com/siteinfo/footballoutsiders.com).

Finally, referers whose tra�c benefits equally from page “a” and “b” appear to be targeted at both malesand females roughly equally. Although Alexa data are not available for “scify”, Quantcast data for syfy.com(which used to be scify.com) shows the genders are roughly even, with only slightly more males (https://www.quantcast.com/syfy.com). Alexa data for youtube.com, facebook.com, and google.com also show asimilar pattern: there are some gender di�erences but not nearly to the magnitude of that seen in sites likecaranddriver.com and msmagazine.com.

In light of this finding, it may be worthwhile to assign all tra�c from overwhelmingly female referers to seelanding page “a” and all tra�c from overwhelmingly male referers to see landing page “b”. We would monitoroverall ROI to ensure that it increases after this change.

In addition, it may be worthwhile to conduct user interviews to try to discern why the pages appeal todi�erent genders. For example, it could be found that the title at the top of landing page “a” resonates withfemales, while an image on the page does not. Di�erent versions of the image more aligned with the messagecould then be attempted in an e�ort to further increase ROI.

7

http://www.alexa.com/siteinfo/blogher.com

http://www.alexa.com/siteinfo/msmagazine.com

http://www.alexa.com/siteinfo/thepioneerwoman.com

http://www.alexa.com/siteinfo/thepioneerwoman.com

http://www.alexa.com/siteinfo/caranddriver.com

http://www.alexa.com/siteinfo/caranddriver.com

http://www.alexa.com/siteinfo/footballoutsiders.com

http://www.alexa.com/siteinfo/footballoutsiders.com

https://www.quantcast.com/syfy.com

https://www.quantcast.com/syfy.com

a/b test with three-way anova

Internet