opticon 2017 running experiment engines with stats engine
TRANSCRIPT
![Page 2: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/2.jpg)
AgendaHere
1. Why we built Stats Engine2. How to make a decisions with Stats
Engine3. How to scale your decision process
opticon2017
![Page 3: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/3.jpg)
opticon2017opticon2017
Why we built Stats Engine
![Page 4: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/4.jpg)
![Page 5: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/5.jpg)
![Page 6: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/6.jpg)
The study followed 1,291 participants for 10 years.
No exercise: 438 with 128 deaths (29%)Light exercise: 576 with 7 deaths (1%)Moderate exercise: 262 with 8 deaths (3%)Heavy exercise: 40 with 2 deaths (5%)
![Page 7: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/7.jpg)
“Thank goodness a third person didn't die, or public health
authorities would be banning jogging.”
– Alex Hutchinson, Runner’s World
![Page 8: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/8.jpg)
![Page 9: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/9.jpg)
![Page 10: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/10.jpg)
“A/A” results
![Page 11: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/11.jpg)
The “T-test” (a.k.a. “NHST”, a.k.a. “Student T-test” )
The T-test in a nutshell1. Run your experiment until you have reached
the required sample size, and then stop.2. Ask “What are the chances I’d have gotten
these results in an A/A test?” (p-value)3. If p-value < 5%, your results are significant.
![Page 12: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/12.jpg)
1908Data is expensive.
Data is slow.Practitioners are trained.
2017Data is cheap.Data is real-time.Practitioners are everyone.
The T-test was designed for this world
![Page 13: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/13.jpg)
T-Test Pitfalls1. Peeking2. Multiple comparisons
![Page 14: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/14.jpg)
1. Peeking
![Page 15: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/15.jpg)
p-Value < 5%. Significant!
p-Value > 5%. Inconclusive.
p-Value > 5%. Inconclusive.
Min Sample Size
Time
Experiment Starts p-Value > 5%. Inconclusive.
![Page 16: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/16.jpg)
Why is this a problem?
There is a ~5% chance of seeing a false positive each time you peek.
![Page 17: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/17.jpg)
p-Value < 5%. Significant!
p-Value > 5%. Inconclusive.
p-Value > 5%. Inconclusive.
Min Sample Size
Time
Experiment Starts p-Value > 5%. Inconclusive.
4 peeks —> ~18% chance of seeing a false positive
![Page 18: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/18.jpg)
The “T-test” (a.k.a. “NHST”, a.k.a. “Student T-test” )
The T-test in a nutshell1. Run your experiment until you have reached the required sample size, and then stop.2. Ask “What are the chances I’d have gotten these results in an A/A test?” (p-value)3. If p-value < 5%, your results are significant.
![Page 19: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/19.jpg)
1:45 2:45 3:45 4:45 5:45
![Page 20: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/20.jpg)
Solution: Stats Engine uses sequential testing to compute an “always-valid” p-value.
![Page 21: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/21.jpg)
2. Multiple Comparisons
![Page 24: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/24.jpg)
- - - - -
Metrics
1 2 3 4 5
Variations
A
B
C
D
Control
![Page 25: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/25.jpg)
False Discovery Rate = P( No Real Improvement | 10% Lift )
False Positive Rate = P( 10% Lift | No Real Improvement ) “How likely are my results if I assume there is no underlying difference between my variation and control?
“How likely is it that my results are a fluke?”
Solution: Stats Engine controls False Discovery Rate by becoming more conservative when more metrics and variations are added to a test.
![Page 26: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/26.jpg)
opticon2017opticon2017
How to make decisions with Stats Engine
When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?
![Page 27: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/27.jpg)
opticon2017opticon2017
How to make decisions with Stats Engine
When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?
![Page 28: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/28.jpg)
Variation
👍 Use “visitors remaining” to decide whether continuing your experiment is worth it.
![Page 29: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/29.jpg)
opticon2017opticon2017
How to make decisions with Stats Engine
When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?
![Page 30: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/30.jpg)
A
B
AB
![Page 31: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/31.jpg)
“Peeking at A/B Tests: Why it matters, and what to do about it” KDD 2017
👍 Statistical Significance rises whenever there is strong evidence of a difference between variation and control
![Page 32: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/32.jpg)
“Peeking at A/B Tests: Why it matters, and what to do about it” KDD 2017
0
![Page 33: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/33.jpg)
Variatio
Variation
👍 Statistical Significance will “reset” when there is strong evidence of an underlying change.
![Page 34: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/34.jpg)
Variation
👍 If your point estimate is near the edge of its confidence interval, consider running the experiment longer.
-19.3% -2.58%
![Page 35: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/35.jpg)
opticon2017opticon2017
How to make decisions with Stats Engine
When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?
![Page 36: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/36.jpg)
False Discovery Rate = P( No Real Improvement | 10% Lift )
False Positive Rate = P( 10% Lift | No Real Improvement ) “How likely are my results if I assume there is no underlying difference between my variation and control?
“How likely is it that my results are a fluke?”
Solution: Stats Engine controls False Discovery Rate by becoming more conservative when more metrics and variations are added to a test.
![Page 37: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/37.jpg)
Stats Engine treats each metric as a “signal”.
High Signal metrics are directly affected by the experiment
Low Signal metrics are indirectly or not at all affected by the experiment
![Page 38: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/38.jpg)
False Discovery Rate = P( No Real Improvement | 10% Lift )
False Positive Rate = P( 10% Lift | No Real Improvement ) “How likely are my results if I assume there is no underlying difference between my variation and control?
“How likely is it that my results are a fluke?”
Solution: Stats Engine controls False Discovery Rate by becoming more conservative when more low signal metrics and variations are added to a test.
![Page 39: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/39.jpg)
Variations
A
B
C
D
Metrics
1 2 3 4 5 6 7 8
Primary Secondary Monitoring
…
![Page 40: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/40.jpg)
👍For maximum velocity, use “high signal” primary and secondary metrics.
👍Use monitoring metrics for “low signal” metrics.
![Page 41: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/41.jpg)
opticon2017opticon2017
How to make decisions with Stats Engine
When should I stop an experiment?Understanding resetsHow do additional variations and metrics affect my experiment?How do I trade off between risk and velocity?
![Page 42: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/42.jpg)
Max False Discovery Rate
👍 Use your Statistical Significance threshold to control risk vs. velocity.
![Page 43: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/43.jpg)
opticon2017opticon2017
How to scale your decision process
Risk vs. Velocity for Experimentation ProgramsGetting organizational buy-in
![Page 44: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/44.jpg)
👍Define “risk classes” for your team’s experiments
👍Keep low-risk experiments “low touch”
👍Save data science analysis resources for high risk experiments
👍Run high-risk experiments for 1+ conversion cycles to control for seasonality
👍Rerun high-risk experiments
Risk vs. Velocity for Experimentation Programs
![Page 45: Opticon 2017 Running Experiment Engines with Stats Engine](https://reader033.vdocuments.net/reader033/viewer/2022052606/5a6490117f8b9a27568b67f5/html5/thumbnails/45.jpg)
👍Decide how and when you’ll share experiment results with your organization.
👍Write down your “decision process” and socialize with the team
Getting organizational buy-in