sample size – the indispensable a/b test calculation that you’re not making
DESCRIPTION
If you’re a marketer it’s very likely that you’ve run an A/B test. It’s also likely that you’ve never calculated the sample size for your tests, and instead, you run tests until they reach statistical significance. If this is the case, your strategy is statistically flawed. Conforming to sample size requires marketers to wait longer for test results, but choosing to ignore it will bear false positives and lead to bad decisions. This deck was created for an email audience for there are valuable lessons for anyone who runs A/B tests.TRANSCRIPT
Sample Size
The indispensable A/B test calculation
that you’re not making.
As Marketers, many of us run A/B Tests
We test copy
We test design
We test subject lines
We choose winners
Version A is converting better than Version B and statistical significance
has breached 95%.
So, Version A won.
Version A is converting better than Version B and statistical significance
has breached 95%.
So, Version A won.
OR DID IT?
That math is half-baked
Suppose you check an A/B Test twice: Once after 200 impressions and then after 500.
Then you end the test.
Now, instead, suppose you stop the test once you reach significance:
Now, suppose you stop the experiment as soon
as there is a significant result:
FALSE POSITIVE!
How often will you get a false positive?
26.1%So you just went from 95% confidence to 74%
This is a worst-case scenario. BUT, some test platforms do this automatically!
Assuming you check results after every impression andstop once you reach significance….
OK…well, then when should I stop an A/B test?
SAMPLE SIZEDictates how long to run a test
SAMPLE SIZE
• Used religiously in the pharmaceutical Industry, economic studies, etc…
https://www.optimizely.com/resources/sample-size-calculator
Agenda
1. How we put this into practice on a website test
2. How we applied these learnings to email testing:
• Open rates
• Click to Open Rates
• Conversion Rates
A/B Testing on your websiteHere’s your new test process:
1. Determine your baseline conversion rate (or click rate, or download rate, etc..)
2. Decide how long you are willing to wait for a result. Convert your unique traffic metric to a sample size.
3. Adjust MDE (Minimum Detectable Effect) until your Sample Size is just under the target you determined in #2 above.
4. Re-adjust MDE until you are content.
5. Start the test, and don’t stop until you hit the sample size.
Case Study: Item Urgency
Case Study: Item Urgency
TEST (VERSION A):INVENTORY NOTIFICATION
CONTROL (VERSION B):NO INVENTORY NOTIFICATION
STEP 1 – We determined our baseline conversion rate
STEP 2 – Calculate Target Sample Size
We initially decided we wanted a result in 2 weeks. So we took the last 2 weeks of unique product page views:
STEP 2 – Calculate Target Sample Size
We then divided that number by two (since we’ll have two test segments)
Divided by two again to account for desktop traffic only
Then multiplied by 5% (since the message only displays on 5% of product pages)
Sample Size -> 12,351
This gave us 30% MDE (Conversion Lift). This is unrealistic
How about 10% ?
107,105 unique visits ~ 17 weeks
Wow, that’s a long time…
Yep.
You’re probably not running your tests long
enough
WAIT A MINUTE.
MY A/B TEST PLATFORM SAYS NOTHING ABOUT SAMPLE SIZE…
EVERYONE WANTS INSTANT GRATIFICATION
YOUR A/B TEST PLATFORM IS HAPPY TO SELL IT
Quietly assuming you have calculated sample size on your own
Item Urgency - Test ResultsWe are over 4 weeks in….
*Conv. rate is higher than expected because test platform runs on 7 day conversion window.
Lift is over 10%
Note the spike in the beginning and the increased stabilization with time
Item Urgency - Test Results
The effect is slowly approaching the MDE
Test Results
Significance is now over 95%, but it’s been up and down.
Many marketers would stop the test on 9/5 and declare a 57% Lift.
Test Results
Email Testing
After learning about Sample Size, we reconsidered our email testing strategy
• Open Rate (Subject line testing)
• Click-to-Open (CTO) Rate
• Conversion Rate
OPEN RATE
We used sample size to gut check the size of our subject line test segments
OPEN RATE
Remember, for the sample size calculator, you need the baseline conversion rate and then the sample size, and that will give you
MDE.
OPEN RATE
First, we needed the baseline conversion open rate
OPEN RATE
Our open rates typically end up ~ 17% , but when we make the call on our winning subject line, open
rates are usually around 7%.
OPEN RATE
Next we need the sample size
OPEN RATE
We always test 4 different subject lines.
We had been sending each subject line to 10,000 customers.
So, sample size ~ 10,000
OPEN RATE
Plugging these numbers in, this would only detect 13% open rate lift or higher
OPEN RATE
13% lift on 17% open rate is 19.2%.
We rarely see subject lines this high
We needed a lower MDE to make sure we could detect more winners…
OPEN RATE
We ended up doubling our subject line segment to 80,000, giving us an MDE ~ 9.2%
CTO
First we needed the baseline
CTO
We averaged the last 10 weeks -> 11% CTO
CTO
Sample size = ½ of the avg opens count
CTO
We averaged the last 10 weeks -> Avgopens = 107,000 / 2 = 53,500
CTO
CTO
4.4% CTO lift is a very reasonable goal for a test.
This showed us that we could trust most of the results of our past CTO tests.
GRID vs. FREE FORM
15.7% CTO Lift
PRODUCT NAMES vs. NO PRODUCT NAMES
22.6% CTO Lift
Conversion Rate
We had been making many email decisions after reaching significance
on a conversion rate lift
Conversion Rate
Time for a reality check.
Conversion Rate
Baseline Conversion Rate ~ 1.5%
Conversion Rate
Sample Size = ½ Average # Clicks -> 6,000
Conversion Rate
Conversion Rate
38% is ASTRONOMICAL
Conversion Rate
To get meaningful results for conversion rate, consider running an email test many times, so that
you can eventually reach the necessary sample size.
Takeaways
This is the MDE curve again. Remember what this looks like.The longer you run a test, the lower MDE will be.
The more traffic volume you have, the faster MDE will drop
Takeaways
For Web Testing
• If you stop your A/B tests once you reach statistical significance, you are increasing your chances of finding false positives
• Calculating sample size will give you a clear stop date and an MDE
• MDE and sample size are inversely related – The lower the MDE, the larger the sample size
• Most likely, your A/B tests need to run much longer than you realize
For Email Testing
• Use sample size to determine the size of your subject line test segments
• Your CTO tests are probably reaching the necessary sample size
• Your Conversion tests are probably not hitting sample size
Sources
Kyle Rush – Mozcon 2014 Presentation
https://seomoz.box.com/shared/static/2fw6yevkkmmdumz431j4.pdf
Evan Miller – How not to run an AB test
http://www.evanmiller.org/how-not-to-run-an-ab-test.html
Zack NotesDigital Marketing Manager
@zacknotes
slideshare.net/zacknotes1/presentations
Appendix
GRID vs. FREE FORM
PRODUCT NAMES vs. NO PRODUCT NAMES
What do you do if a test reaches sample size and your lift < MDE?
You can either extend the test and accept a lower MDE or Move On.