machine learning experimentation at sift science

Post on 24-Jan-2018

390 Views

Category:

Engineering

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ML Experimentation at SiftAlex Painoatpaino@siftcience.com

Follow along at: http://go.siftscience.com/ml-experimentation

1

Agenda

Background

Motivation

Running experiments correctly

Comparing experiments correctly

Building tools to ensure correctness

2

About Sift Science

- Abuse prevention platform powered by machine learning

- Learns in real-time

- Several abuse prevention products and counting:

3

Payment Fraud Content Abuse Promo Abuse Account Abuse

About Sift Science

4

Motivation - Why is this important?

1. Experiments must happen to improve an ML system

5

Motivation - Why is this important?

1. Experiments must happen to improve an ML system

2. Evaluation needs to correctly identify positive changes

Evaluation as a loss function for your stack

6

Motivation - Why is this important?

1. Experiments must happen to improve an ML system

2. Evaluation needs to correctly identify positive changes

Evaluation as a loss function for your stack

3. Getting this right is a subtle and tricky problem

7

How do we run experiments?

8

Running experiments correctly - Background

- Large delay in feedback for Sift - up to 90 days

- → offline experiments over historical data

9

Created

account

Updated credit

card info

Updated

settings

Purchased

item

Chargeback

t

90 days

Running experiments correctly - Background

- Large delay in feedback for Sift - up to 90 days

- → offline experiments over historical data

- Need to simulate the online case as closely as possible

10

Created

account

Updated credit

card info

Updated

settings

Purchased

item

Chargeback

t

90 days

Running experiments correctly - Lessons

Lesson: train & test set creation

- Can’t pick random splits

11

Running experiments correctly - Lessons

Lesson: train & test set creation

- Can’t pick random splits

- Disjoint in time and set of users

12

Train

Test

t

users

Running experiments correctly - Lessons

Lesson: train & test set creation

- Can’t pick random splits

- Disjoint in time and set of users

- Watch for class skew - ours is over 50:1 → need to downsample

13

Train

Test

t

users

Running experiments correctly - Lessons

Lesson: preventing cheating

- External data sources need to be versioned

14

t

Created

account

Updated credit

card info

Login from IP

Address A

IP Address B

Known Tor

Exit Node

Tor Exit

Node DB

Login from IP

Address B

Login from IP

Address B

Transaction

Running experiments correctly - Lessons

Lesson: preventing cheating

- External data sources need to be versioned

- Can’t leak groundtruth into feature vectors

15

t

Created

account

Updated credit

card info

Login from IP

Address A

IP Address B

Known Tor

Exit Node

Tor Exit

Node DB

Login from IP

Address B

Login from IP

Address B

Transaction

Running experiments correctly - Lessons

Lesson: considering scores at key decision points

- Scores given for any event (e.g. user login)

16

t

Running experiments correctly - Lessons

Lesson: considering scores at key decision points

- Scores given for any event (e.g. user login)

- Need to evaluate scores our customers use to

make decisions

17

t

Running experiments correctly - Lessons

Lesson: parity with the online system

- Our system does online learning → so should the offline experiments

18

Running experiments correctly - Lessons

Lesson: parity with the online system

- Our system does online learning → so should the offline experiments

- Reusing the same code paths

19

How do we compare experiments?

20

Comparing Experiments Correctly - Background

21

Customer-specific

Global

Global

Models

Sift Score

Comparing Experiments Correctly - Background

22

Customer-specific (Payment Abuse)

Global (Payment Abuse)

Global (Payment Abuse)

Payment Abuse Models

Payment Abuse Score

Customer-specific (Account Abuse)

Global (Account Abuse)

Global (Account Abuse)

Account Abuse Models

Account Abuse Score

Customer-specific (Promotion Abuse)

Global (Promotion Abuse)

Global (Promotion Abuse)

Promotion Abuse Models

Promotion Abuse Score

Customer-specific (Content Abuse)

Global (Content Abuse)

Global (Content Abuse)

Content Abuse Models

Content Abuse Score

Comparing Experiments Correctly - Background

23

Thousands of configurations

to evaluate!

Comparing Experiments Correctly - Background

Thousands of (customer, abuse type)

combinations to evaluate

24

Comparing Experiments Correctly - Background

Thousands of (customer, abuse type)

combinations to evaluate

Each with different features, models, class

skew, and noise levels

25

Comparing Experiments Correctly - Background

Thousands of (customer, abuse type)

combinations to evaluate

Each with different features, models, class

skew, and noise levels

→ Need some way to consolidate these

evaluations

26

??

Comparing Experiments Correctly - Lessons

Lesson: pitfalls with consolidating results

- Can’t throw all samples together → different score distributions

27

Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect

+ =

Comparing Experiments Correctly - Lessons

Lesson: pitfalls with consolidating results

- Can’t throw all samples together → different score distributions

- Weighted averages are tricky

28

Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect

+ =

Comparing Experiments Correctly - Lessons

Lesson: require statistical significance everywhere

- Examine significant differences in per-customer summary stats

29

Comparing Experiments Correctly - Lessons

Lesson: require statistical significance everywhere

- Examine significant differences in per-customer summary stats

- Use confidence intervals where possible, e.g. for AUC ROC

30

http://www.med.mcgill.ca/epidemiology/hanley/software/hanley_mcneil_radiology_82.pdfhttp://www.cs.nyu.edu/~mohri/pub/area.pdf

How do we ensure correctness?

31

Building tools to ensure correctness

32

Building tools to ensure correctness

- Big productivity win

33

Building tools to ensure correctness

- Big productivity win

- Allows non-data scientists to conduct experiments safely

34

Building tools to ensure correctness

- Big productivity win

- Allows non-data scientists to conduct experiments safely

- Saves the team from drawing incorrect conclusions

35

Building tools to ensure correctness

- Big productivity win

- Allows non-data scientists to conduct experiments safely

- Saves the team from drawing incorrect conclusions

36

vs

Building tools to ensure correctness - Examples

Example: Sift’s experiment evaluation page for high-level analysis

37

Building tools to ensure correctness - Examples

Example: Sift’s experiment evaluation page for high-level analysis

38

Building tools to ensure correctness - Examples

Example: Sift’s experiment evaluation page for high-level analysis

39

Building tools to ensure correctness - Examples

Example: Sift’s experiment evaluation page for high-level analysis

40

ROC

Building tools to ensure correctness - Examples

Example: Sift’s experiment evaluation page for high-level analysis

41

ROC Score distribution

Building tools to ensure correctness - Examples

Example: Jupyter notebooks

for deep-dives

42

Key Takeaways

43

Key Takeaways

1. Need to carefully design experiments to remove biases

44

Key Takeaways

1. Need to carefully design experiments to remove biases

2. Require statistical significance when comparing results to filter out noise

45

Key Takeaways

1. Need to carefully design experiments to remove biases

2. Require statistical significance when comparing results to filter out noise

3. The right tools can help ensure all of your analyses are correct while

improving productivity

46

Questions?

47

top related