the dangers of a/b testing - tripteasett.triptease.com/.../the-dangers-of-a-b-testing-final-1.pdf2...

1

The Dangers of A/B testing

2

The dangers of A/B testing

An e-commerce manager recently got in touch. She runs the website for a group

of 10 hotels and wanted to rollout our Direct Booking Platform. Her goal was to

reassure guests that booking direct is best. But she had a problem.

Her senior management wanted to make sure that her investment was going to

deliver a good return. So they asked their marketing team to prepare an A/B test.

Our sales team wanted to help because they‘ve seen the value that our Platform

can generate for clients. However, being the data geeks that we are, it’s our

understanding of statistics and their limitations that prevented us from taking

the easy route and running an A/B test. A/B testing is only effective in certain

scenarios, and in this scenario, we knew A/B testing was not going to be effective

for this hotel group. Why?

Insufficient data For this group of hotels, with 2,000 conversion events a month, the A/B test

would need for run for over eight months to produce a result with a reasonable

statistical confidence. Given the impact a full rollout of the platform would have,

during the test they’d miss out on £90,000 in lost revenue. This was because half

of their customers would never see the widget and be reassured enough to book

direct.

We are frequently asked to undertake A/B tests that would suffer the same

data issue. The availability of testing tools has driven a false confidence in their

results across the industry. So we thought it would be helpful to do a quick

review of A/B tests – when they are great and when they should be avoided. So,

with special thanks to our Data Science team, here are the real facts.

3

A quick introduction to A/B testing:

A/B testing is a form of hypothetical testing where two variants are compared

against each other using a key metric. The name A/B testing comes from the fact

that the two variants are named variant A or control group and variant B or test

group

IMPORTANT: When running an A/B test, a hypothesis is required which we aim to

either prove or disprove.

Take note of the above. Starting an A/B test with an open hypothesis and “let’s see if

A is better or worse than B” is a subtle misappropriation of this statistical tool.

A/B tests are popular and effective in high volume retail businesses and large

digital companies: Amazon, Google, and Expedia are famous for A/B testing almost

everything they do. However, selling to millions of people a week is different to a

hotel selling 100 rooms. For A/B tests to be an effective research and management

tool it helps to be a certain kind of business generating data at scale.

4

Setting up an A/B test:

Setting up an A/B test is straightforward. Two versions of a website are

developed. Visitors are served one of the two websites at random over the

duration of the test period. This ensures an identical population distribution

between the users that saw variant A and the users that saw variant B.

There are five rules that help to ensure a great set up:

For example, the variant displayed must not depend upon where the user

is coming from or whether or not they are logged in. This ensures that all

extraneous variables influence both groups in the same way.

When setting up an A/B test, we are looking to attribute a change in a key

metric to a variant of the website. For this to be possible users must only see

one variant across all their visits on the website.

Therefore, the test must be set up by User ID and not by Session ID as Session

ID changes every time a browser is closed.

The final aspect of setting up a test is formulating a hypothesis in the form of:

“Variant B performs x% better than variant A when looking at this metric”. The

metric used for A/B testing website design can take the form of click through

rates, bounce rates or conversion rates.

1. A consistent web experience for users

2. Random selection of the variant displayed for all new users

3. The test is run at the same time

4. The test must be set up by User ID and not Session ID

5. Have an agreed clear metric to assess the test by

5

The measurement problem:

True uplift vs. measured uplift So you have run an A/B test. Now you need to interpret it and see if the hypothesis

is proven or not. The key watch out here is to distinguish between true uplift vs.

measured uplift.

True uplift is the real uplift that can be measured in the business over the long run.

Measured uplift is what you experience during the test.

In an ideal world, the measured uplift = the true uplift. However, we don’t live

in a perfect world and that frequently leads to A/B test misinterpretation. The

difference between measured and true uplift is expressed in the concept of statistical

Lies, damn lies & statistics The chart shows a real-world example of the measurement problem. It tracks the

impact of an A/A test where both websites shown to a hotel’s customers were

identical. In this instance the true uplift in conversion over the 90 day period should

be 0%, because there is nothing to drive any difference in performance. However, the

test recorded a 17% uplift. This is because there were very few conversion events

and a small number of bookings heavily skewed the measured uplift.

Time

Up

lift

3.0

2.5

2.5

1.5

1.0

0.5

0

10 20 30 40 50 60 70 80 90

6

Feeling confident?

To measure the difference between true and measured uplift, our statistical

friends talk about confidence intervals and confidence levels. Imagine you are

asked for your forecast on direct bookings for the next month by your boss.

You could say that the bookings will be £1M-£1.2M for the month. Or

you could say that the direct bookings will be £1.2M but you are only 90%

confident that you will hit that. These are two different ways of explaining

your confidence in the accuracy of your forecast: Confidence intervals and

Confidence levels.

Statistical significanceTo quantify these measures of confidence, statisticians use the concept of

statistical significance e.g. a test result with 15% conversion uplift with 90%

statistical significance means that 10 times out of 100 the 15% uplift result

would have occurred due to sampling error alone.

7

How to improve the Statistical

When running an A/B test, we always want to ensure that the measured uplift estimates

the true uplift as precisely as possible. The more data we have, the more precise the

therefore need to compute the number of conversions that will ensure an accurate

of conversions are needed in order to achieve that.

2. Minimum detectable impact

3. Conversion rate

%

Co

nver

sio

ns

Nee

ded

30,000

22,500

15,000

7,500

50% 60% 70% 80% 90% 100%

0

8

2 4 6 8 10

14,000

12,000

16,000

18,000

Co

nver

sio

ns

Nee

ded

Co

nver

sio

ns

Nee

ded

Conversion Uplift (%) Impact (%)

88,900 more conversions needed at the 2% uplift 14,000 conversions needed to identify a 5% impact

9,100

98,000

2% 6%

9

Back to the real world

Let’s return to the example of our 10 hotel group. They had c. 1,000 rooms, an

ADR of £150, an avg. LOS of 1.5 nights and 80% occupancy. The conversion

rate of their website was 8% with an average 2,000 conversions online each

month (unfortunately a lot of business is currently coming via the OTAs). The

management team wanted to run an A/B test.

times out of 100 the impact of the platform is true to at least a 5% increase in

uplift.

In this instance we needed 14,000 conversions across the A/B test in order to

meant running an A/B test for almost eight months!

A very expensive testEight months is a very long time for a team to run a test. And even worse, it costs

the hotel money.

With a conversion uplift of just 5%, over the course of eight months the widget

would be generating an extra 800

direct bookings with a revenue of £180,000. If half of all the customers don’t

see the widget because they are in the hidden version of the website, that

means the hotel group is missing out on £90,000 of direct bookings.

10

To A/B or not to A/B?

So are A/B tests a great tool for helping inform great decisions? Yes, absolutely.

Should we do them in every instance? No.

So when should we use them? The short answer is it depends.

It depends on, amongst other things:

Ultimately, we need to weigh up when an A/B test will help us make a better

decision and when taking a decision and monitoring overall results is a better

approach. Often, in fast moving industries with a wasting inventory, good

judgement and urgency are more expedient.

1. The decision to be made

4. The conversion rate

2. The confidence we need to make it

5. Whether we need segmented results across individual properties in a group– many more conversions required

3. How many visitors we get to a website

6. The cost in lost revenue to run the test

the dangers of a/b testing - tripteasett.triptease.com/.../the-dangers-of-a-b-testing-final-1.pdf2...

Documents