Download - Learning at Low False Positive Rate
![Page 1: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/1.jpg)
Learning at Low False Positive Rate
Scott Wen-tau Yih Joshua GoodmanLearning for Messaging and Adversarial Problems
Microsoft Research Geoff Hulten
Microsoft
![Page 2: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/2.jpg)
2
•Regular way
•Give him lots of positive & negative examples to learn
•Violent way
•Punch him for false positive
•Collaborative way
•Train the 2nd using examples that the 1st thinks are spam
![Page 3: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/3.jpg)
3
•Regular way
•Give him lots of positive & negative examples to learn
•Violent way
•Collaborative way
10%
15% ~ 30%
20% ~ 40%
![Page 4: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/4.jpg)
4
Low False-Positive Region Improving the whole ROC curve?
Fantastic! But…
0 (No missed
spam)
(All spam missed)
1
0 (No good caught)
1 (All good
caught)
We only care about low false-positive region!
![Page 5: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/5.jpg)
5
Outline
Quick review of ROC curves Methods to improve spam filters in
low false-positive region 2-stage framework Training with utility
Experiments Related work Conclusions
![Page 6: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/6.jpg)
6
False-Positive vs. False-Negative
X: Ratio of misclassified good mail (FPR) Y: Ratio of missed spam (FNR) Statistical filter assigns scores to messages
Change behavior by choosing different thresholds
0 (No missed
spam)
(All spam missed)
1
0 (No good caught)
1 (All good
caught)
ROC curve
![Page 7: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/7.jpg)
7
Properties of ROC Curves
ROC shows the trade-off between false positive (misclassified good mail) and false negative (missed spam) given different thresholds = 0.5 may not be the best choice Decide according to the ROC curve
The ranking decides the ROC curve, not the absolute scores.
For spam filtering, we only care about how much spam the filter can catch when the false-positive rate is low. The cost of missing good mail is much higher
than not catching spam!
![Page 8: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/8.jpg)
8
Outline
Quick Review of ROC curves Methods to improve spam filters in
low false-positive region 2-stage framework (collaborative way) Training with utility (violent way)
Experiments Related work Conclusions
![Page 9: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/9.jpg)
9
2-Stage Framework: Idea Forget easy good mail and hard spam
The messages that have low scores They fall in the high false-positive region.
Try do a better job on other messages
![Page 10: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/10.jpg)
10
Use 2-Stage Framework
We trained 2 models for these 2 stages
Apply the 1st-stage model on all messages
For messages having scores less than , don’t change the order or scores
Re-score and re-order the remaining messages using the 2nd-stage model
![Page 11: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/11.jpg)
11
Train 2-Stage Framework (1/2)
The naïve way Train the 1st-stage model as usual Score the training messages using the
1st-stage model Use the subset of training data whose
scores are larger than as the training data for the 2nd-stage model
Problem The scores of the training data tend to be
too good and different from scores on unseen data
Solution: Use cross-validation
![Page 12: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/12.jpg)
12
Why does 2-stage work?
The 2-stage framework provides more complex hypothesis space, which might fit the data better.
Suppose that the base model form is a linear classifier.
Pick the subset of the data in the region you care about. Find all messages, good and spam,
that are more than, say, 50% likely to be spam according to the first model.
Train a new model on only this data. At test time, use both models.
Spam mail
Good mail
Spam mail
Good mail
Spam mail
Good mail
![Page 13: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/13.jpg)
13
Training with Utility
Motivation again: why do we care more in the low false-positive rate region? The “cost” or “utility” of a false positive error
(misclassified good mail) is much higher.
A common approach is to select the right threshold to get the correct FP/FN trade-off
A less common approach is to “re-weight” the data (training with utility) It's more important to get negative examples right Duplicate negative examples 10 times and train
the model on the new data
![Page 14: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/14.jpg)
14
Work for Naïve Bayes?
Training with utility is usually used in non-probabilistic learning algorithms, such as SVMs, when the data is highly skewed.
It has been argued that training with utility has no effect on naïve Bayes Only the prior changes Probability is multiplied by
a constant Effectively, the decision
hyperplane only shiftsbut not rotates
![Page 15: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/15.jpg)
15
The Real Results
In practice, training with utility improves both naïve Bayes and logistic regression filters.
For naïve Bayes, smoothing is the key Training with utility is equivalent to having
different smoothing parameters for positive and negative examples.
For logistic regression, the hyperplane may both shift and rotate, even without smoothing.
![Page 16: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/16.jpg)
16
Outline
Quick Review of ROC curves Methods to improve spam filters in
low false-positive region 2-stage framework Training with utility
Experiments Related work Conclusions
![Page 17: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/17.jpg)
17
Data Hotmail Feedback Loop
Polling over 100,000 Hotmail users daily Asked to hand-label a sample message addressed
to him as Good or Spam Very roughly 3% of user labels are errors
Training data (7/1/05 ~ 11/30/05) 5,000 msg per day; 765,000 msg in total
Testing data (12/1/05 ~ 12/15/05) 10,000 msg per day; 150,000 msg in total
Features (excluding proprietary features) Subject keywords and body keywords
![Page 18: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/18.jpg)
18
Logistic Regression (20% improvement)
![Page 19: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/19.jpg)
19
Naïve Bayes (40% improvement)
![Page 20: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/20.jpg)
20
Related Work 2-stage framework
Different from Boosting We specifically focus on low false-positive region We use cross validation We combine classifiers differently
A special form of decision list with only 2 layers, or cascade of classifiers ([Viola&Jones ’01], [Roth&Yih ’01]) Cascades improve overall accuracy or speed up the
system Cascades use more features or more complex models
ain later stages Training with utility
Has been done before – but not for spam we think Typically used for unbalanced data as opposed to
for emphasizing low false positive rate
![Page 21: Learning at Low False Positive Rate](https://reader031.vdocuments.net/reader031/viewer/2022032313/56812a6d550346895d8df274/html5/thumbnails/21.jpg)
21
Conclusions
Reduced false negative by 20-40% at low false positive rates. Training with utility (10% for both NB and LR) Two-stage filtering works even better (15% ~ 30%) The combination works the best!
40% gain for Naïve Bayes 20% for logistic regression
Both techniques can potentially be used with most other learning methods
Key insight: by specifically targeting low false positive rate at training time can get better results.