![Page 1: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/1.jpg)
April 13, 2023
Danielle Jabin
A/B Testing: Avoiding Common Pitfalls
![Page 2: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/2.jpg)
2
Make all the world’s music available instantly to
everyone, wherever and whenever they want it
![Page 3: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/3.jpg)
3
As of March 5, 2014
![Page 4: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/4.jpg)
4
Over 24 million active users
As of March 5, 2014
![Page 5: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/5.jpg)
5
Access to more than 20 million songs
As of March 5, 2014
![Page 6: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/6.jpg)
6
![Page 7: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/7.jpg)
7
But can we make it even easier?
![Page 8: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/8.jpg)
8
We can try……with A/B testing!
![Page 9: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/9.jpg)
9
So…what’s an A/B test?
![Page 10: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/10.jpg)
10
Control A
![Page 11: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/11.jpg)
Pitfall #1: Not limiting your error rate
![Page 12: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/12.jpg)
12
Source: assets.20bits.com/20081027/normal-curve-small.png
![Page 13: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/13.jpg)
13
What if I flip a coin 100 times and get 51 heads?
![Page 14: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/14.jpg)
14
What if I flip a coin 100 times and get 5 heads?
![Page 15: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/15.jpg)
15
![Page 16: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/16.jpg)
16
The likelihood of obtaining a certain value under a
given distribution is measured by its p-value
![Page 17: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/17.jpg)
17
If there is a low likelihood that a change is due to
chance alone, we call our results statistically
significant
![Page 18: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/18.jpg)
18
What if I flip a coin 100 times and get 5 heads?
![Page 19: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/19.jpg)
19
●alpha levels of 5% and 1% are most commonly used– Alternatively: P(significant) = .05 or .01
Statistical significance is measured by alpha
![Page 20: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/20.jpg)
20
Each alpha has a corresponding Z-score
alpha Z-score (two-sided test)
.10 1.65
.05 1.96
.01 2.58
![Page 21: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/21.jpg)
21
The Z-score tells us how far a particular value is from
the mean (and what the corresponding likelihood is)
![Page 22: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/22.jpg)
22
Source: assets.20bits.com/20081027/normal-curve-small.png
![Page 23: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/23.jpg)
23
Compute the Z-score at the end of the test
![Page 24: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/24.jpg)
24
Standard deviation (σ) tells us how spread out the
numbers are
![Page 25: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/25.jpg)
25
![Page 26: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/26.jpg)
26
To lock in error rates before you start, fix your sample
size
![Page 27: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/27.jpg)
27
●To lock in error rates before you start a test, fix your sample size
What should my sample size be?
Sample size in each group (assumes equal sized groups)
Represents the desired power (typically .84 for 80% power).
Represents the desired level of statistical significance (typically 1.96).
Standard deviation of the outcome variable Effect Size (the
difference in means)
Source: www.stanford.edu/~kcobb/hrp259/lecture11.ppt
![Page 28: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/28.jpg)
28
●Compute your sample size– Using alpha, beta, standard deviation of your metric, and effect size
●Run your test! But stop once you’ve reached the fixed sample size stopping point
●Compute your z-score and compare it with the z-score for the chosen alpha level
Recap: running an A/B test
![Page 29: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/29.jpg)
29
Control A
![Page 30: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/30.jpg)
30
Resulting Z-score?
![Page 31: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/31.jpg)
31
33.3
![Page 32: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/32.jpg)
Pitfall #2: Stopping your test before the fixed sample size stopping point
![Page 33: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/33.jpg)
33
●With σ = 10, difference in means = 1
Sample size for varying alpha levels
Two-sided test
alpha = .10, beta = .80 1230
alpha = .05, beta = .80 1568
alpha = .01, beta = .80 2339
![Page 34: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/34.jpg)
34
●1,000 experiments with 200,000 fake participants divided randomly into two groups both receiving the exact same version, A, with a 3% conversion rate
Let’s see some numbers
Stop at first point of significance
Ended as significant
90% significance reached
654 of 1,000 100 of 1,000
95% significance reached
427 of 1,000 49 of 1,000
99% significance reached
146 of 1,000 14 of 1,000
Source: destack.home.xs4all.nl/projects/significance/
![Page 35: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/35.jpg)
35
●Don’t peek●Okay, maybe you can peek, but don’t stop or make a decision before you
reach the fixed sample size stopping point●Sequential sampling
Remedies
![Page 36: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/36.jpg)
36
Control A B
![Page 37: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/37.jpg)
Pitfall #3: Making multiple comparisons in one test
![Page 38: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/38.jpg)
38
●P(significant) + P(not significant) = 1●Let’s take an alpha of .05
– P(significant) = .05– P(not significant) = 1 – P(significant) = 1 - .05 = .95
A test can be one of two things: significant or not significant
![Page 39: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/39.jpg)
39
●P(at least 1 significant) = 1 - P(none of the 2 are significant)●P(none of the 2 are significant) = P(not significant)*P(not significant) = .95*.95
= .9025●P(at least 1 significant) = 1 - .9025 = .0975
What about for two comparisons?
![Page 40: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/40.jpg)
40
●That’s almost 2x (1.95x, to be precise) your .05 significance rate!
What about for two comparisons?
![Page 41: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/41.jpg)
41
And it just gets worse…
P(at least 1 signifcant) An increase of…
5 variations 1 – (1-.05)^5 = .23 4.6x
10 variations 1 – (1-.05)^10 = .40 8x
20 variations 1 – (1-.05)^20 = .64 12.8x
![Page 42: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/42.jpg)
42
●Bonferroni correction– Divide P(significant), your alpha, by the number of variations you are testing,
n– alpha/n becomes the new level of statistical significance
How can we remedy this?
![Page 43: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/43.jpg)
43
●Our new P(significant) = .05/2 = .025●Our new P(not significant) = 1 - .025 = .975●P(at least 1 significant) = 1 - P(none of the 2 are significant)●P(none of the 2 are significant) = P(not significant)*P(not significant)
= .975*.975 = .951●P(at least 1 significant) = 1 - .951 = .0499
So what about two comparisons now?
![Page 44: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/44.jpg)
44
P(significant) stays under .05
Corrected alpha P(at least 1 signifcant)
5 variations .05/5 = .01 1 – (1-.01)^5 = .049
10 variations .05/10 = .005 1 – (1-.005)^10 = .049
20 variations .05/20 = .0025 1 – (1-.0025)^20 = .049
![Page 45: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/45.jpg)
Questions?
![Page 46: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/46.jpg)
Appendix
![Page 47: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/47.jpg)
47
1. Decide what to test2. Determine a metric to test3. Formulate your hypothesis
1. Select an effect size threshold: what change of the metric would make a rollout worthwhile?
4. Calculate sample size (your stopping point)1. Decide your Type I (alpha) and Type 2 (beta) error levels and the
corresponding z-scores2. Determine the standard deviation of your metric
5. Run your test! But stop once you’ve reached the fixed sample size stopping point
6. Compute your z-score and compare it with the z-score for your chosen alpha level
A/B test steps:
![Page 48: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/48.jpg)
48
●Type I error: incorrectly reject a true null hypothesis– alpha
●Type II error: incorrectly accept a false null hypothesis– beta– Power: 1 - beta
Type I and Type II error
![Page 49: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/49.jpg)
49
Z-score reference table
alpha One-sided test Two-sided test
.10 1.28 1.65
.05 1.65 1.96
.01 2.33 2.58
![Page 50: A/B Testing Pitfalls and Lessons Learned at Spotify](https://reader035.vdocuments.net/reader035/viewer/2022062709/558e0a301a28abaa178b4588/html5/thumbnails/50.jpg)
50
Z-score for proportions (e.g. conversion)