sample size – the indispensable a/b test calculation that you’re not making

Sample Size

The indispensable A/B test calculation

that you’re not making.

As Marketers, many of us run A/B Tests

We test copy

We test design

We test subject lines

We choose winners

Version A is converting better than Version B and statistical significance

has breached 95%.

So, Version A won.

Version A is converting better than Version B and statistical significance

has breached 95%.

So, Version A won.

OR DID IT?

That math is half-baked

Suppose you check an A/B Test twice: Once after 200 impressions and then after 500.

Then you end the test.

Now, instead, suppose you stop the test once you reach significance:

Now, suppose you stop the experiment as soon

as there is a significant result:

FALSE POSITIVE!

How often will you get a false positive?

26.1%So you just went from 95% confidence to 74%

This is a worst-case scenario. BUT, some test platforms do this automatically!

Assuming you check results after every impression andstop once you reach significance….

OK…well, then when should I stop an A/B test?

SAMPLE SIZEDictates how long to run a test

SAMPLE SIZE

• Used religiously in the pharmaceutical Industry, economic studies, etc…

https://www.optimizely.com/resources/sample-size-calculator

https://www.optimizely.com/resources/sample-size-calculator

Agenda

1. How we put this into practice on a website test

2. How we applied these learnings to email testing:

• Open rates

• Click to Open Rates

• Conversion Rates

A/B Testing on your websiteHere’s your new test process:

1. Determine your baseline conversion rate (or click rate, or download rate, etc..)

2. Decide how long you are willing to wait for a result. Convert your unique traffic metric to a sample size.

3. Adjust MDE (Minimum Detectable Effect) until your Sample Size is just under the target you determined in #2 above.

4. Re-adjust MDE until you are content.

5. Start the test, and don’t stop until you hit the sample size.

Case Study: Item Urgency

Case Study: Item Urgency

TEST (VERSION A):INVENTORY NOTIFICATION

CONTROL (VERSION B):NO INVENTORY NOTIFICATION

STEP 1 – We determined our baseline conversion rate

STEP 2 – Calculate Target Sample Size

We initially decided we wanted a result in 2 weeks. So we took the last 2 weeks of unique product page views:

STEP 2 – Calculate Target Sample Size

We then divided that number by two (since we’ll have two test segments)

Divided by two again to account for desktop traffic only

Then multiplied by 5% (since the message only displays on 5% of product pages)

Sample Size -> 12,351

This gave us 30% MDE (Conversion Lift). This is unrealistic

How about 10% ?

107,105 unique visits ~ 17 weeks

Wow, that’s a long time…

You’re probably not running your tests long

enough

WAIT A MINUTE.

MY A/B TEST PLATFORM SAYS NOTHING ABOUT SAMPLE SIZE…

EVERYONE WANTS INSTANT GRATIFICATION

YOUR A/B TEST PLATFORM IS HAPPY TO SELL IT

Quietly assuming you have calculated sample size on your own

Item Urgency - Test ResultsWe are over 4 weeks in….

*Conv. rate is higher than expected because test platform runs on 7 day conversion window.

Lift is over 10%

Note the spike in the beginning and the increased stabilization with time

Item Urgency - Test Results

The effect is slowly approaching the MDE

Test Results

Significance is now over 95%, but it’s been up and down.

Many marketers would stop the test on 9/5 and declare a 57% Lift.

Test Results

Email Testing

After learning about Sample Size, we reconsidered our email testing strategy

• Open Rate (Subject line testing)

• Click-to-Open (CTO) Rate

• Conversion Rate

OPEN RATE

We used sample size to gut check the size of our subject line test segments

OPEN RATE

Remember, for the sample size calculator, you need the baseline conversion rate and then the sample size, and that will give you

MDE.

OPEN RATE

First, we needed the baseline conversion open rate

OPEN RATE

Our open rates typically end up ~ 17% , but when we make the call on our winning subject line, open

rates are usually around 7%.

OPEN RATE

Next we need the sample size

OPEN RATE

We always test 4 different subject lines.

We had been sending each subject line to 10,000 customers.

So, sample size ~ 10,000

OPEN RATE

Plugging these numbers in, this would only detect 13% open rate lift or higher

OPEN RATE

13% lift on 17% open rate is 19.2%.

We rarely see subject lines this high

We needed a lower MDE to make sure we could detect more winners…

OPEN RATE

We ended up doubling our subject line segment to 80,000, giving us an MDE ~ 9.2%

CTO

First we needed the baseline

CTO

We averaged the last 10 weeks -> 11% CTO

CTO

Sample size = ½ of the avg opens count

CTO

We averaged the last 10 weeks -> Avgopens = 107,000 / 2 = 53,500

CTO

4.4% CTO lift is a very reasonable goal for a test.

This showed us that we could trust most of the results of our past CTO tests.

GRID vs. FREE FORM

15.7% CTO Lift

PRODUCT NAMES vs. NO PRODUCT NAMES

22.6% CTO Lift

Conversion Rate

We had been making many email decisions after reaching significance

on a conversion rate lift

Conversion Rate

Time for a reality check.

Conversion Rate

Baseline Conversion Rate ~ 1.5%

Conversion Rate

Sample Size = ½ Average # Clicks -> 6,000

Conversion Rate

Conversion Rate

38% is ASTRONOMICAL

Conversion Rate

To get meaningful results for conversion rate, consider running an email test many times, so that

you can eventually reach the necessary sample size.

Takeaways

This is the MDE curve again. Remember what this looks like.The longer you run a test, the lower MDE will be.

The more traffic volume you have, the faster MDE will drop

Takeaways

For Web Testing

• If you stop your A/B tests once you reach statistical significance, you are increasing your chances of finding false positives

• Calculating sample size will give you a clear stop date and an MDE

• MDE and sample size are inversely related – The lower the MDE, the larger the sample size

• Most likely, your A/B tests need to run much longer than you realize

For Email Testing

• Use sample size to determine the size of your subject line test segments

• Your CTO tests are probably reaching the necessary sample size

• Your Conversion tests are probably not hitting sample size

Sources

Kyle Rush – Mozcon 2014 Presentation

https://seomoz.box.com/shared/static/2fw6yevkkmmdumz431j4.pdf

Evan Miller – How not to run an AB test

http://www.evanmiller.org/how-not-to-run-an-ab-test.html

https://seomoz.box.com/shared/static/2fw6yevkkmmdumz431j4.pdf

http://www.evanmiller.org/how-not-to-run-an-ab-test.html

Zack NotesDigital Marketing Manager

[email protected]

@zacknotes

slideshare.net/zacknotes1/presentations

mailto:[email protected]

Appendix

GRID vs. FREE FORM

PRODUCT NAMES vs. NO PRODUCT NAMES

What do you do if a test reaches sample size and your lift < MDE?

You can either extend the test and accept a lower MDE or Move On.