machine learning experimentation at sift science

ML Experimentation at SiftAlex Painoatpaino@siftcience.com

Follow along at: http://go.siftscience.com/ml-experimentation

Agenda

Background

Motivation

Running experiments correctly

Comparing experiments correctly

Building tools to ensure correctness

About Sift Science

- Abuse prevention platform powered by machine learning

- Learns in real-time

- Several abuse prevention products and counting:

Payment Fraud Content Abuse Promo Abuse Account Abuse

About Sift Science

Motivation - Why is this important?

1. Experiments must happen to improve an ML system

2. Evaluation needs to correctly identify positive changes

Evaluation as a loss function for your stack

2. Evaluation needs to correctly identify positive changes

Evaluation as a loss function for your stack

3. Getting this right is a subtle and tricky problem

How do we run experiments?

Running experiments correctly - Background

- Large delay in feedback for Sift - up to 90 days

- → offline experiments over historical data

Created

account

Updated credit

card info

Updated

settings

Purchased

Chargeback

90 days

Running experiments correctly - Background

- Large delay in feedback for Sift - up to 90 days

- → offline experiments over historical data

- Need to simulate the online case as closely as possible

Created

account

Updated credit

card info

Updated

settings

Purchased

Chargeback

90 days

Running experiments correctly - Lessons

Lesson: train & test set creation

- Can’t pick random splits

- Disjoint in time and set of users

- Watch for class skew - ours is over 50:1 → need to downsample

Lesson: preventing cheating

- External data sources need to be versioned

Created

account

Updated credit

card info

Login from IP

Address A

IP Address B

Known Tor

Exit Node

Tor Exit

Node DB

Login from IP

Address B

Login from IP

Address B

Transaction

Lesson: preventing cheating

- External data sources need to be versioned

- Can’t leak groundtruth into feature vectors

Created

account

Updated credit

card info

Login from IP

Address A

IP Address B

Known Tor

Exit Node

Tor Exit

Node DB

Login from IP

Address B

Login from IP

Address B

Transaction

Lesson: considering scores at key decision points

- Scores given for any event (e.g. user login)

Lesson: considering scores at key decision points

- Scores given for any event (e.g. user login)

- Need to evaluate scores our customers use to

make decisions

Lesson: parity with the online system

- Our system does online learning → so should the offline experiments

Lesson: parity with the online system

- Our system does online learning → so should the offline experiments

- Reusing the same code paths

How do we compare experiments?

Comparing Experiments Correctly - Background

Customer-specific

Global

Models

Sift Score

Customer-specific (Payment Abuse)

Global (Payment Abuse)

Payment Abuse Models

Payment Abuse Score

Customer-specific (Account Abuse)

Global (Account Abuse)

Account Abuse Models

Account Abuse Score

Customer-specific (Promotion Abuse)

Global (Promotion Abuse)

Promotion Abuse Models

Promotion Abuse Score

Customer-specific (Content Abuse)

Global (Content Abuse)

Content Abuse Models

Content Abuse Score

Thousands of configurations

to evaluate!

Thousands of (customer, abuse type)

combinations to evaluate

Each with different features, models, class

skew, and noise levels

Each with different features, models, class

skew, and noise levels

→ Need some way to consolidate these

evaluations

Comparing Experiments Correctly - Lessons

Lesson: pitfalls with consolidating results

- Can’t throw all samples together → different score distributions

Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect

Lesson: pitfalls with consolidating results

- Can’t throw all samples together → different score distributions

- Weighted averages are tricky

Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect

Lesson: require statistical significance everywhere

- Examine significant differences in per-customer summary stats

Lesson: require statistical significance everywhere

- Examine significant differences in per-customer summary stats

- Use confidence intervals where possible, e.g. for AUC ROC

http://www.med.mcgill.ca/epidemiology/hanley/software/hanley_mcneil_radiology_82.pdfhttp://www.cs.nyu.edu/~mohri/pub/area.pdf

How do we ensure correctness?

- Big productivity win

- Allows non-data scientists to conduct experiments safely

- Saves the team from drawing incorrect conclusions

Building tools to ensure correctness - Examples

Example: Sift’s experiment evaluation page for high-level analysis

ROC Score distribution

Example: Jupyter notebooks

for deep-dives

Key Takeaways

1. Need to carefully design experiments to remove biases

Key Takeaways

2. Require statistical significance when comparing results to filter out noise

Key Takeaways

2. Require statistical significance when comparing results to filter out noise

3. The right tools can help ensure all of your analyses are correct while

improving productivity

Questions?

machine learning experimentation at sift science

Engineering

sift and object recognition

scale invariant feature transform...

sift: scale invariant feature...

sifter sift® - chauvin

sift & surf

sift prcsr

sift - assam

experimentation & quasi-experimentation in countering...

sift algorithm introduction

mlsploit: a framework for interactive experimentation with...

sift-ms - sra instruments€¦ · fundamentals of sift-ms...

(sift) 2013

phuong phap sift

sift ms2 sift‐ms ㅣselected ion flow tube mass...

sift flow: dense correspondence across diﬀerent...

simplifying the complex - sift, llc - sift origin...

sift white

sift findas

scale invariant feature transform...

sift feature matching