machine learning experimentation at sift science
Post on 24-Jan-2018
390 Views
Preview:
TRANSCRIPT
ML Experimentation at SiftAlex Painoatpaino@siftcience.com
Follow along at: http://go.siftscience.com/ml-experimentation
1
Agenda
Background
Motivation
Running experiments correctly
Comparing experiments correctly
Building tools to ensure correctness
2
About Sift Science
- Abuse prevention platform powered by machine learning
- Learns in real-time
- Several abuse prevention products and counting:
3
Payment Fraud Content Abuse Promo Abuse Account Abuse
About Sift Science
4
Motivation - Why is this important?
1. Experiments must happen to improve an ML system
5
Motivation - Why is this important?
1. Experiments must happen to improve an ML system
2. Evaluation needs to correctly identify positive changes
Evaluation as a loss function for your stack
6
Motivation - Why is this important?
1. Experiments must happen to improve an ML system
2. Evaluation needs to correctly identify positive changes
Evaluation as a loss function for your stack
3. Getting this right is a subtle and tricky problem
7
How do we run experiments?
8
Running experiments correctly - Background
- Large delay in feedback for Sift - up to 90 days
- → offline experiments over historical data
9
Created
account
Updated credit
card info
Updated
settings
Purchased
item
Chargeback
t
90 days
Running experiments correctly - Background
- Large delay in feedback for Sift - up to 90 days
- → offline experiments over historical data
- Need to simulate the online case as closely as possible
10
Created
account
Updated credit
card info
Updated
settings
Purchased
item
Chargeback
t
90 days
Running experiments correctly - Lessons
Lesson: train & test set creation
- Can’t pick random splits
11
Running experiments correctly - Lessons
Lesson: train & test set creation
- Can’t pick random splits
- Disjoint in time and set of users
12
Train
Test
t
users
Running experiments correctly - Lessons
Lesson: train & test set creation
- Can’t pick random splits
- Disjoint in time and set of users
- Watch for class skew - ours is over 50:1 → need to downsample
13
Train
Test
t
users
Running experiments correctly - Lessons
Lesson: preventing cheating
- External data sources need to be versioned
14
t
Created
account
Updated credit
card info
Login from IP
Address A
IP Address B
Known Tor
Exit Node
Tor Exit
Node DB
Login from IP
Address B
Login from IP
Address B
Transaction
Running experiments correctly - Lessons
Lesson: preventing cheating
- External data sources need to be versioned
- Can’t leak groundtruth into feature vectors
15
t
Created
account
Updated credit
card info
Login from IP
Address A
IP Address B
Known Tor
Exit Node
Tor Exit
Node DB
Login from IP
Address B
Login from IP
Address B
Transaction
Running experiments correctly - Lessons
Lesson: considering scores at key decision points
- Scores given for any event (e.g. user login)
16
t
Running experiments correctly - Lessons
Lesson: considering scores at key decision points
- Scores given for any event (e.g. user login)
- Need to evaluate scores our customers use to
make decisions
17
t
Running experiments correctly - Lessons
Lesson: parity with the online system
- Our system does online learning → so should the offline experiments
18
Running experiments correctly - Lessons
Lesson: parity with the online system
- Our system does online learning → so should the offline experiments
- Reusing the same code paths
19
How do we compare experiments?
20
Comparing Experiments Correctly - Background
21
Customer-specific
Global
Global
Models
Sift Score
Comparing Experiments Correctly - Background
22
Customer-specific (Payment Abuse)
Global (Payment Abuse)
Global (Payment Abuse)
Payment Abuse Models
Payment Abuse Score
Customer-specific (Account Abuse)
Global (Account Abuse)
Global (Account Abuse)
Account Abuse Models
Account Abuse Score
Customer-specific (Promotion Abuse)
Global (Promotion Abuse)
Global (Promotion Abuse)
Promotion Abuse Models
Promotion Abuse Score
Customer-specific (Content Abuse)
Global (Content Abuse)
Global (Content Abuse)
Content Abuse Models
Content Abuse Score
Comparing Experiments Correctly - Background
23
Thousands of configurations
to evaluate!
Comparing Experiments Correctly - Background
Thousands of (customer, abuse type)
combinations to evaluate
24
Comparing Experiments Correctly - Background
Thousands of (customer, abuse type)
combinations to evaluate
Each with different features, models, class
skew, and noise levels
25
Comparing Experiments Correctly - Background
Thousands of (customer, abuse type)
combinations to evaluate
Each with different features, models, class
skew, and noise levels
→ Need some way to consolidate these
evaluations
26
??
Comparing Experiments Correctly - Lessons
Lesson: pitfalls with consolidating results
- Can’t throw all samples together → different score distributions
27
Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect
+ =
Comparing Experiments Correctly - Lessons
Lesson: pitfalls with consolidating results
- Can’t throw all samples together → different score distributions
- Weighted averages are tricky
28
Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect
+ =
Comparing Experiments Correctly - Lessons
Lesson: require statistical significance everywhere
- Examine significant differences in per-customer summary stats
29
Comparing Experiments Correctly - Lessons
Lesson: require statistical significance everywhere
- Examine significant differences in per-customer summary stats
- Use confidence intervals where possible, e.g. for AUC ROC
30
http://www.med.mcgill.ca/epidemiology/hanley/software/hanley_mcneil_radiology_82.pdfhttp://www.cs.nyu.edu/~mohri/pub/area.pdf
How do we ensure correctness?
31
Building tools to ensure correctness
32
Building tools to ensure correctness
- Big productivity win
33
Building tools to ensure correctness
- Big productivity win
- Allows non-data scientists to conduct experiments safely
34
Building tools to ensure correctness
- Big productivity win
- Allows non-data scientists to conduct experiments safely
- Saves the team from drawing incorrect conclusions
35
Building tools to ensure correctness
- Big productivity win
- Allows non-data scientists to conduct experiments safely
- Saves the team from drawing incorrect conclusions
36
vs
Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
37
Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
38
Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
39
Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
40
ROC
Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
41
ROC Score distribution
Building tools to ensure correctness - Examples
Example: Jupyter notebooks
for deep-dives
42
Key Takeaways
43
Key Takeaways
1. Need to carefully design experiments to remove biases
44
Key Takeaways
1. Need to carefully design experiments to remove biases
2. Require statistical significance when comparing results to filter out noise
45
Key Takeaways
1. Need to carefully design experiments to remove biases
2. Require statistical significance when comparing results to filter out noise
3. The right tools can help ensure all of your analyses are correct while
improving productivity
46
Questions?
47
top related