machine learning experimentation at sift science

ML Experimentation at SiftAlex [email protected]

Follow along at: http://go.siftscience.com/ml-experimentation

1

Agenda

Background

Motivation

Running experiments correctly

Comparing experiments correctly

Building tools to ensure correctness

2

About Sift Science

- Abuse prevention platform powered by machine learning

- Learns in real-time

- Several abuse prevention products and counting:

3

Payment Fraud Content Abuse Promo Abuse Account Abuse

About Sift Science

4

Motivation - Why is this important?

1. Experiments must happen to improve an ML system

5



2. Evaluation needs to correctly identify positive changes

Evaluation as a loss function for your stack

6



2. Evaluation needs to correctly identify positive changes

Evaluation as a loss function for your stack

3. Getting this right is a subtle and tricky problem

7

How do we run experiments?

8

Running experiments correctly - Background

- Large delay in feedback for Sift - up to 90 days

- → offline experiments over historical data

9

Created

account

Updated credit

card info

Updated

settings

Purchased

item

Chargeback

t

90 days

Running experiments correctly - Background

- Large delay in feedback for Sift - up to 90 days

- → offline experiments over historical data

- Need to simulate the online case as closely as possible

10

Created

account

Updated credit

card info

Updated

settings

Purchased

item

Chargeback

t

90 days

Running experiments correctly - Lessons

Lesson: train & test set creation

- Can’t pick random splits

11




- Disjoint in time and set of users

12

Train

Test

t

users




- Disjoint in time and set of users

- Watch for class skew - ours is over 50:1 → need to downsample

13

Train

Test

t

users


Lesson: preventing cheating

- External data sources need to be versioned

14

t

Created

account

Updated credit

card info

Login from IP

Address A

IP Address B

Known Tor

Exit Node

Tor Exit

Node DB

Login from IP

Address B

Login from IP

Address B

Transaction


Lesson: preventing cheating

- External data sources need to be versioned

- Can’t leak groundtruth into feature vectors

15

t

Created

account

Updated credit

card info

Login from IP

Address A

IP Address B

Known Tor

Exit Node

Tor Exit

Node DB

Login from IP

Address B

Login from IP

Address B

Transaction


Lesson: considering scores at key decision points

- Scores given for any event (e.g. user login)

16

t


Lesson: considering scores at key decision points

- Scores given for any event (e.g. user login)

- Need to evaluate scores our customers use to

make decisions

17

t


Lesson: parity with the online system

- Our system does online learning → so should the offline experiments

18


Lesson: parity with the online system

- Our system does online learning → so should the offline experiments

- Reusing the same code paths

19

How do we compare experiments?

20

Comparing Experiments Correctly - Background

21

Customer-specific

Global

Global

Models

Sift Score


22

Customer-specific (Payment Abuse)

Global (Payment Abuse)

Global (Payment Abuse)

Payment Abuse Models

Payment Abuse Score

Customer-specific (Account Abuse)

Global (Account Abuse)

Global (Account Abuse)

Account Abuse Models

Account Abuse Score

Customer-specific (Promotion Abuse)

Global (Promotion Abuse)

Global (Promotion Abuse)

Promotion Abuse Models

Promotion Abuse Score

Customer-specific (Content Abuse)

Global (Content Abuse)

Global (Content Abuse)

Content Abuse Models

Content Abuse Score


23

Thousands of configurations

to evaluate!


Thousands of (customer, abuse type)

combinations to evaluate

24




Each with different features, models, class

skew, and noise levels

25




Each with different features, models, class

skew, and noise levels

→ Need some way to consolidate these

evaluations

26

??

Comparing Experiments Correctly - Lessons

Lesson: pitfalls with consolidating results

- Can’t throw all samples together → different score distributions

27

Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect

+ =


Lesson: pitfalls with consolidating results

- Can’t throw all samples together → different score distributions

- Weighted averages are tricky

28

Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect

+ =


Lesson: require statistical significance everywhere

- Examine significant differences in per-customer summary stats

29


Lesson: require statistical significance everywhere

- Examine significant differences in per-customer summary stats

- Use confidence intervals where possible, e.g. for AUC ROC

30

http://www.med.mcgill.ca/epidemiology/hanley/software/hanley_mcneil_radiology_82.pdfhttp://www.cs.nyu.edu/~mohri/pub/area.pdf

http://www.med.mcgill.ca/epidemiology/hanley/software/hanley_mcneil_radiology_82.pdf

http://www.cs.nyu.edu/~mohri/pub/area.pdf

How do we ensure correctness?

31


32


- Big productivity win

33



- Allows non-data scientists to conduct experiments safely

34




- Saves the team from drawing incorrect conclusions

35




- Saves the team from drawing incorrect conclusions

36

vs

Building tools to ensure correctness - Examples

Example: Sift’s experiment evaluation page for high-level analysis

37



38



39



40

ROC



41

ROC Score distribution


Example: Jupyter notebooks

for deep-dives

42

Key Takeaways

43

Key Takeaways

1. Need to carefully design experiments to remove biases

44

Key Takeaways


2. Require statistical significance when comparing results to filter out noise

45

Key Takeaways


2. Require statistical significance when comparing results to filter out noise

3. The right tools can help ensure all of your analyses are correct while

improving productivity

46

Questions?

47

machine learning experimentation at sift science

Engineering