a simple probabilistic approach to learning from positive and unlabeled examples

A Simple Probabilistic Approach to Learning from Positive and Unlabeled Examples

Dell Zhang (BBK) and Wee Sun Lee (NUS)

Problem

Supervised Learning

Problem

Semi-Supervised Learning

Problem

PU Learning

Problem

Unlabeled Examples Help

Problem

PU Learning To distinguish

the interesting instances (the positive class C+) with

other instances (the negative class C-)

by learning a classifier from a set of positive examples P and a set of unlabeled examples U

There is no labeled negative example!

Applications To automatically filter web pages according to a user's

preference the browsed or bookmarked pages can be used as positive examples while unlabeled examples can be easily collected from the web

To automatically find machine learning literature the ICML papers can be used as positive examples while unlabeled examples can be easily collected from the ACM or IEEE

digital library To automatically identify cancer patients

the patients known to have cancers can be used as positive examples while unlabeled examples can be easily collected from the patient

database To automatically discover future customers for direct

marketing the current customers of the company can be used as positive examples while unlabeled examples can be purchased at a low cost compared with

obtaining negative examples ……

Approaches Existing Approaches

PNB (Denis et al. 2002); PNCT (Denis et al. 2003)

S-EM (Liu et al. 2002); RC-SVM (Li & Liu 2003)

PEBL (Yu et al. 2004); SVMC (Yu 2005) PN-SVM (Fung et al. 2005) W-LR (Lee & Liu 2003); B-SVM (Liu et al.

2003) Our Proposed Approach

B-Pr

Our Approach

Cx

Cx

p

1 pP

U1

Pr[ | ] Pr[ | ](1 )P C p x x

Pr[ | ] Pr[ | ] Pr[ | ]U C p C x x x

A Probabilistic Model

Our Approach

1Pr[ | ] Pr[ | ] Pr[ | ] Pr[ | ]

1

pC C P U

p

x x x x

( ) sgn Pr[ | ] Pr[ | ]f b P U x x x

( ) sgn Pr[ | ] Pr[ | ]f x C C x x

(1 ) (1 )b p p

Our Approach

Biased PrTFIDF (B-Pr) Estimate

PrTFIDF (Joachims 1997) Estimmate

Maximize On a held-out validation set (Lee & Liu 2003)

Linear Time Complexity!

b2Pr[ ] Pr[ ( ) 1]pr C r f x

Pr[ | ] and Pr[ | ]P Ux x

Experiments

Reuters-21578

B-Pr>RC-SVM>PEBL (p=0.55)

RC-SVM>B-Pr>PEBL (p=0.85)

Experiments

20NewsGroups

B-Pr>W-LR>S-EM (p=0.3)

B-Pr>W-LR>S-EM (p=0.7)

Conclusion

A New Approach to Learning from Positive and Unlabeled Examples As effective as the state-of-the-art

approaches Yet simpler and faster

Thank you

Questions? Comments? Suggestions? ……

a simple probabilistic approach to learning from positive and unlabeled examples

Documents

unlabeled examples uthere

bsvm liu

positive class c

lrsem p

sem liu

svmpebl p

prpebl p

unlabeled examplesas