n=10^9: automated experimentation at scale
TRANSCRIPT
N=109 Automated
Experimenta5on at Scale Wojciech Galuba Decision Tools Lead,
Facebook @wgaluba
N=109: Automated Experimentation at Scale
Wojtek Galuba (wgaluba@fb) Decision Tools Team Lead Data Science Infrastructure Facebook
History of Data Science Infra at FB • Founded April 2012 • A group of data scientists and software engineers • Experienced first hand the need for better infrastructure • Need continues to grow • Team doubled over the past year • Expect continued rapid growth this year
Why do we experiment?
Experimentation
Product changes
Experiment to study this
Metrics
Experiment to:
Catch problems before they arise
Experiment to:
Choose between multiple options
Experiment to:
Challenge intuitions about product
Experiment to:
Not only evaluate ideas but generate new ones
Challenges
Many experiments
• Experiments running in parallel • Modifying many different aspects of the product • Overlaps are possible and may conflict
Many metric dimensions • Different contexts of user actions • Thousands of device types • Geography • Demographics • Time • Enormous space of possible questions
Many teams • Many ways to run an experiment • Diverse audience for results • Huge set of results from every experiment • Many ways to interpret results
Experimentation at Facebook
An experiment
QuickExperiment
Div
ide
peop
le ra
ndom
ly color: blue
size: medium"
color: blue"size: big"
color: green"size: medium"
QuickExperiment • Centralized experiment management • Purely config-level: no code pushes to iterate • Automatic exposure logging
PlanOut
PlanOut • Open sourced: http://facebook.github.io/planout/ • Flexible experimental design • Full, programmatic control over param values
Experiment evaluation
Exposures
Metrics
% change from control to test -1 0 1 2 -2 3 -3
posts
99.9 % 99 % 95 % Confidence:
Assess decision risk
99.9 % 99 % 95 % Confidence:
Lessons learned
Computing answers to exponential number of possible questions
Pre-compute • low specificity • low dimensionality • long-term
Compute on-the-fly • high specificity • high dimensionality • short-term
A balancing act
Tackling many dimensions Two sets of tools
For exploration For extraction
Automated exploration
Enforce a lifecycle; In particular:
clear experiment end dates
Why lifecycle policy? • Unifies methodology across teams • Prevents tech debt buildup • Minimizes bad impact on product
Ease of rapid iteration; Safe and scientifically valid iteration
Fast, but not too fast • Novelty effect vs. top engaged users bump • Understand if waiting helps
Ensure mutual exclusion; Across platforms, features and infra
Why mutual exclusion? • Fewer experiment conflicts • Lower metrics variance
Exposure log everything • Measure effects on the exposed only • Conditioning analyses on the time since last exposure
The culture
Experimentation gives focus; But watch out for tunnel vision!
The culture
Cultivate sound practices; Safe and low-impact experimentation
The culture
Educate on data interpretation; Uniform decision-making
across teams
Understanding uncertainty
“Robust misinterpretation of confidence intervals” Rink Hoekstra et al. Psychonomic Bulletin & Review
• Only 3% of scientists got all 6 answers right...
• How do we educate the users of the tools?
The three stages of experimentation
infrastructure
Stage 1: Artisanal
Photo credit: Abhisek Sarda
Stage 2: Power tools
Stage 2: Power tools
Stage 3: Industrialized
Photo credit: Steve Jurvetson
Conclusions
Empower, but don’t overwhelm
Conclusions
Filter and automate, but maintain broad focus
Conclusions Clean data and powerful tools are great, but
building the right experimentation culture is equally important
N=109 Automated Experimenta5on at
Scale Wojciech Galuba
Decision Tools Lead, Facebook @wgaluba