the dangers of a/b testing - tripteasett.triptease.com/.../the-dangers-of-a-b-testing-final-1.pdf2...
TRANSCRIPT
1
The Dangers of A/B testing
2
The dangers of A/B testing
An e-commerce manager recently got in touch. She runs the website for a group
of 10 hotels and wanted to rollout our Direct Booking Platform. Her goal was to
reassure guests that booking direct is best. But she had a problem.
Her senior management wanted to make sure that her investment was going to
deliver a good return. So they asked their marketing team to prepare an A/B test.
Our sales team wanted to help because they‘ve seen the value that our Platform
can generate for clients. However, being the data geeks that we are, it’s our
understanding of statistics and their limitations that prevented us from taking
the easy route and running an A/B test. A/B testing is only effective in certain
scenarios, and in this scenario, we knew A/B testing was not going to be effective
for this hotel group. Why?
Insufficient data For this group of hotels, with 2,000 conversion events a month, the A/B test
would need for run for over eight months to produce a result with a reasonable
statistical confidence. Given the impact a full rollout of the platform would have,
during the test they’d miss out on £90,000 in lost revenue. This was because half
of their customers would never see the widget and be reassured enough to book
direct.
We are frequently asked to undertake A/B tests that would suffer the same
data issue. The availability of testing tools has driven a false confidence in their
results across the industry. So we thought it would be helpful to do a quick
review of A/B tests – when they are great and when they should be avoided. So,
with special thanks to our Data Science team, here are the real facts.
3
A quick introduction to A/B testing:
A/B testing is a form of hypothetical testing where two variants are compared
against each other using a key metric. The name A/B testing comes from the fact
that the two variants are named variant A or control group and variant B or test
group
IMPORTANT: When running an A/B test, a hypothesis is required which we aim to
either prove or disprove.
Take note of the above. Starting an A/B test with an open hypothesis and “let’s see if
A is better or worse than B” is a subtle misappropriation of this statistical tool.
A/B tests are popular and effective in high volume retail businesses and large
digital companies: Amazon, Google, and Expedia are famous for A/B testing almost
everything they do. However, selling to millions of people a week is different to a
hotel selling 100 rooms. For A/B tests to be an effective research and management
tool it helps to be a certain kind of business generating data at scale.
4
Setting up an A/B test:
Setting up an A/B test is straightforward. Two versions of a website are
developed. Visitors are served one of the two websites at random over the
duration of the test period. This ensures an identical population distribution
between the users that saw variant A and the users that saw variant B.
There are five rules that help to ensure a great set up:
For example, the variant displayed must not depend upon where the user
is coming from or whether or not they are logged in. This ensures that all
extraneous variables influence both groups in the same way.
When setting up an A/B test, we are looking to attribute a change in a key
metric to a variant of the website. For this to be possible users must only see
one variant across all their visits on the website.
Therefore, the test must be set up by User ID and not by Session ID as Session
ID changes every time a browser is closed.
The final aspect of setting up a test is formulating a hypothesis in the form of:
“Variant B performs x% better than variant A when looking at this metric”. The
metric used for A/B testing website design can take the form of click through
rates, bounce rates or conversion rates.
1. A consistent web experience for users
2. Random selection of the variant displayed for all new users
3. The test is run at the same time
4. The test must be set up by User ID and not Session ID
5. Have an agreed clear metric to assess the test by
5
The measurement problem:
True uplift vs. measured uplift So you have run an A/B test. Now you need to interpret it and see if the hypothesis
is proven or not. The key watch out here is to distinguish between true uplift vs.
measured uplift.
True uplift is the real uplift that can be measured in the business over the long run.
Measured uplift is what you experience during the test.
In an ideal world, the measured uplift = the true uplift. However, we don’t live
in a perfect world and that frequently leads to A/B test misinterpretation. The
difference between measured and true uplift is expressed in the concept of statistical
Lies, damn lies & statistics The chart shows a real-world example of the measurement problem. It tracks the
impact of an A/A test where both websites shown to a hotel’s customers were
identical. In this instance the true uplift in conversion over the 90 day period should
be 0%, because there is nothing to drive any difference in performance. However, the
test recorded a 17% uplift. This is because there were very few conversion events
and a small number of bookings heavily skewed the measured uplift.
Time
Up
lift
3.0
2.5
2.5
1.5
1.0
0.5
0
10 20 30 40 50 60 70 80 90
6
Feeling confident?
To measure the difference between true and measured uplift, our statistical
friends talk about confidence intervals and confidence levels. Imagine you are
asked for your forecast on direct bookings for the next month by your boss.
You could say that the bookings will be £1M-£1.2M for the month. Or
you could say that the direct bookings will be £1.2M but you are only 90%
confident that you will hit that. These are two different ways of explaining
your confidence in the accuracy of your forecast: Confidence intervals and
Confidence levels.
Statistical significanceTo quantify these measures of confidence, statisticians use the concept of
statistical significance e.g. a test result with 15% conversion uplift with 90%
statistical significance means that 10 times out of 100 the 15% uplift result
would have occurred due to sampling error alone.
7
How to improve the Statistical
When running an A/B test, we always want to ensure that the measured uplift estimates
the true uplift as precisely as possible. The more data we have, the more precise the
therefore need to compute the number of conversions that will ensure an accurate
of conversions are needed in order to achieve that.
2. Minimum detectable impact
3. Conversion rate
%
Co
nver
sio
ns
Nee
ded
30,000
22,500
15,000
7,500
50% 60% 70% 80% 90% 100%
0
8
2 4 6 8 10
14,000
12,000
16,000
18,000
Co
nver
sio
ns
Nee
ded
Co
nver
sio
ns
Nee
ded
Conversion Uplift (%) Impact (%)
88,900 more conversions needed at the 2% uplift 14,000 conversions needed to identify a 5% impact
9,100
98,000
2% 6%
9
Back to the real world
Let’s return to the example of our 10 hotel group. They had c. 1,000 rooms, an
ADR of £150, an avg. LOS of 1.5 nights and 80% occupancy. The conversion
rate of their website was 8% with an average 2,000 conversions online each
month (unfortunately a lot of business is currently coming via the OTAs). The
management team wanted to run an A/B test.
times out of 100 the impact of the platform is true to at least a 5% increase in
uplift.
In this instance we needed 14,000 conversions across the A/B test in order to
meant running an A/B test for almost eight months!
A very expensive testEight months is a very long time for a team to run a test. And even worse, it costs
the hotel money.
With a conversion uplift of just 5%, over the course of eight months the widget
would be generating an extra 800
direct bookings with a revenue of £180,000. If half of all the customers don’t
see the widget because they are in the hidden version of the website, that
means the hotel group is missing out on £90,000 of direct bookings.
10
To A/B or not to A/B?
So are A/B tests a great tool for helping inform great decisions? Yes, absolutely.
Should we do them in every instance? No.
So when should we use them? The short answer is it depends.
It depends on, amongst other things:
Ultimately, we need to weigh up when an A/B test will help us make a better
decision and when taking a decision and monitoring overall results is a better
approach. Often, in fast moving industries with a wasting inventory, good
judgement and urgency are more expedient.
1. The decision to be made
4. The conversion rate
2. The confidence we need to make it
5. Whether we need segmented results across individual properties in a group– many more conversions required
3. How many visitors we get to a website
6. The cost in lost revenue to run the test