jaime carbonell, pinar donmez, jingui he & vamshi ambati language technologies institute...

Jaime Carbonell, Pinar Donmez, Jingui He & Vamshi AmbatiLanguage Technologies Institute

Carnegie Mellon Universitywww.cs.cmu.edu/~jgc

27 October 2010

Active and Proactive Machine Learning

http://www.cs.cmu.edu/~jgc

October 2010 Jaime G. Carbonell, Language Technolgies Institute

2

Why is Active Learning Important? Labeled data volumes unlabeled data

volumes 1.2% of all proteins have known structures < .01% of all galaxies in the Sloan Sky Survey have

consensus type labels < .0001% of all web pages have topic labels << E-10% of all internet sessions are labeled as to

fraudulence (malware, etc.) < .0001 of all financial transactions investigated

w.r.t. fraudulence

If labeling is costly, or limited, select the instances with maximal impact for learning

Is (Pro)Active Learning Relevant to Language Technologies?

Text Classification By topic, genre, difficulty, … In learning to rank search results

Question Answering Question-type classification Answer ranking

Machine Translation Selecting sentences to translate for

LDL’s Eliciting partial or full alignment


3


4

Active Learning

Training data: Special case:

Functional space: Fitness Criterion: a.k.a. loss function

Sampling Strategy:

iinkiikiii yxOxyx :}{},{ ,...1,...1

}{ lj pf

),()(minarg ,

,lj

iipji

ljpfxfy

l

0k

},...,{|))ˆ,(ˆ(minarg 1},...,{ 1

kitesttestxxx

xxxyxfLnki


5

Sampling Strategies

Random sampling (preserves distribution) Uncertainty sampling (Lewis, 1996; Tong & Koller, 2000)

proximity to decision boundary maximal distance to labeled x’s

Density sampling (kNN-inspired McCallum & Nigam, 2004) Representative sampling (Xu et al, 2003) Instability sampling (probability-weighted)

x’s that maximally change decision boundary Ensemble Strategies

Boosting-like ensemble (Baram, 2003) DUAL (Donmez & Carbonell, 2007)

Dynamically switches strategies from Density-Based to Uncertainty-Based by estimating derivative of expected residual error reduction

Which point to sample?Grey = unlabeled

Red = class A

Brown = class B

October 2010 6Jaime G. Carbonell, Language Technolgies Institute

Density-Based Sampling

Centroid of largest unsampled cluster


Uncertainty Sampling

Closest to decision boundary


Maximal Diversity Sampling

Maximally distant from labeled x’s


Ensemble-Based Possibilities

Uncertainty + Diversity criteria

Density + uncertainty criteria



11

Strategy Selection: No Universal Optimum

• Optimal operating range for AL sampling strategies differs

• How to get the best of both worlds?

• (Hint: ensemble methods, e.g. DUAL)


12

How does DUAL do better? Runs DWUS until it estimates a cross-over

Monitor the change in expected error at each iteration to detect when it is stuck in local minima

DUAL uses a mixture model after the cross-over ( saturation ) point

Our goal should be to minimize the expected future error If we knew the future error of Uncertainty

Sampling (US) to be zero, then we’d force But in practice, we do not know it

( )

t

DWUSx

^ ^21

( ) [( ) | ] 0i i it

DWUS E y y xn

^* 2argmax * [( ) | ] (1 ) * ( )

U

is i i ii I

x E y y x p x

1


13

More on DUAL [ECML 2007] After cross-over, US does better => uncertainty score

should be given more weight should reflect how well US performs

can be calculated by the expected error of

US on the unlabeled data* =>

Finally, we have the following selection criterion for DUAL:

* US is allowed to choose data only from among the already sampled instances, and is calculated on the remaining unlabeled set to

^ ^ ^* 2argmax(1 ( )) * [( ) | ] ( ) * ( )

U

is i i ii I

x US E y y x US p x

^ ^

( )US

^( )US


14

Results: DUAL vs DWUS


15

Beyond Dual

Paired Sampling with Geodesic Density Estimation Donmez & Carbonell, SIAM 2008

Active Rank Learning Search results: Donmez & Carbonell, WWW 2008 In general: Donmez & Carbonell, ICML 2008

Structure Learning Inferring 3D protein structure from 1D sequence

Remains open problem


16

Active Sampling for RankSVM I

Consider a candidate Assume is added to training set with Total loss on pairs that include is:

n is the # of training instances with a different label than

Objective function to be minimized becomes:


17

Active Sampling for RankSVM II

Assume the current ranking function is There are two possible cases:

Assume

Derivative w.r.t at a single point

or


18

Active Sampling for RankSVM III

Substitute in the previous equation to estimate

Magnitude of the total derivative

estimates the ability of to change the current ranker if added into training

Finally,


19

Active Sampling for RankBoost I Again, estimate how the current ranker would change if was in the training set

Estimate this change by the difference in ranking loss before and after is added

Ranking loss w.r.t is (Freund et al., 2003):


20

Active Sampling for RankBoost II Difference in the ranking loss between the current and the enlarged set:

indicates how much the current ranker needs to change to compensate for the loss introduced by the new instance

Finally, the instance with the highest loss differential is sampled:


21

Results on TREC03


22

Active vs Proactive Learning

Active Learning Proactive Learning

Number of Oracles Individual (only one) Multiple, with different capabilities, costs and areas of expertise

Reliability Infallible (100% right) Variable across oracles and queries, depending on difficulty, expertise, …

Reluctance Indefatigable (always answers)

Variable across oracles and queries, depending on workload, certainty, …

Cost per query Invariant (free or constant) Variable across oracles and queries, depending on workload, difficulty, …

Note: “Oracle” {expert, experiment, computation, …}

Active Learning is Awesome, but … is it Enough?

Single Perfect Source

Multiple Sources

Differing Expertise

Labeling Noise

Answer Reluctance

Fixed Labeling Cost

Varying-Cost Model

TaskDifficulty

Ambiguity

ExpertiseLevel

TraditionalTraditionalActiveActiveLearningLearning

GoingGoingBeyondBeyond

Fixed over time Time-varying23JMLR_’09

KDD ‘09

SDM_sub ‘10

CIKM ‘08

ProactiveProactiveLearningLearning


24

Scenario 1: Reluctance

2 oracles: reliable oracle: expensive but always

answers with a correct label reluctant oracle: cheap but may not

respond to some queries Define a utility score as expected value

of information at unit cost

( | , ) * ( )( , )

k

P ans x k V xU x k

C


25

How to estimate ?

Cluster unlabeled data using k-means Ask the label of each cluster centroid to the reluctant

oracle. If label received: increase of nearby points

no label: decrease of nearby points

equals 1 when label received, -1

otherwise

# clusters depend on the clustering budget and oracle fee

ˆ( | , )P ans x k

ˆ( | ,reluctant)P ans x

ˆ( | ,reluctant)P ans x

max( , )0.5ˆ( | ,reluctant) exp ln2

tt t

t

d cc ct

c

x xh x yP ans x x C

Z x x

( , ) { 1, 1}c ch x y


26

Algorithm for Scenario 1


27

Scenario 2: Fallibility Two oracles:

One perfect but expensive oracle One fallible but cheap oracle, always answers

Alg. Similar to Scenario 1 with slight modifications

During exploration: Fallible oracle provides the label with its confidence

Confidence = of fallible oracle

If then we don’t use the label

but we still update

ˆ( | ) [0.45,0.5]P y x

ˆ( | )P y x

ˆ(correct | , )P x k


28

Scenario 3: Non-uniform Cost Uniform cost: Fraud detection, face recognition,

etc.

Non-uniform cost: text categorization, medical diagnosis, protein structure prediction, etc.

2 oracles:Fixed-cost OracleVariable-cost Oracle

ˆmax ( | ) 1( ) 1

1 1y Y

non unif

P y x YC x

Y


29

Underlying Sampling Strategy Conditional entropy based sampling, weighted by a

density measure

Captures the information content of a close neighborhood

2

2{ 1} { 1}ˆ ˆˆ ˆ( ) log min ( | , ) exp * min ( | , )

xy y

k x N

U x P y x w x k P y k w

close neighbors of x


30

Results: Reluctance


31

Cost varies non-uniformly

statistically significant (p<0.01)

Sequential Bayesian Filtering

Tracking the states of multiple systems as each evolves over time

Sequentially arriving observations (noisy labels)

Goal: Estimate posterior

distribution

32

Changing Accuracy with time t

Noisy labels

SDM ‘10

33

Predict Update

A Closer Look to the Model SDM ‘10

Predictor Selection

34Accuracy at the last time selected

Probability of accuracy

There is a chance that the accuracy might have increased

Our belief of the accuracy diverges over time as thesource goes unexplored

SDM ‘10

Copyright@2009 Pinar Donmez 35

Red: true Blue: estimatedBlack: mle

Does Tracking Predictor Accuracy Actually Help in Proactive Learning?

36

SDM ‘10


37

Proactive Learning in General Multiple Experts (a.k.a. Oracles) Different areas of expertise Different costs Different reliabilities Different availability

What question to ask and whom to query? Joint optimization of query & oracle

selection Referals among Oracles (with referal fees) Learn about Oracle capabilities as well as

solving the Active Learning problem at hand Non-static Oracle properties


38

Current Issues in Proactive Learning

Large numbers of oracles [Donmez, Carbonell & Schneider, KDD-2009]

Based on multi-armed bandit approach Non-stationary oracles [Donmez, Carbonell & Schneider, SDM-

2010]

Expertise changes with time (improve or decay) Exploration vs exploitation tradeoff

What if labeled set is empty for some classes? Minority class discovery (unsupervised) [He & Carbonell,

NIPS 2007, SIAM 2008, SDM 2009]

After first instance discovery proactive learning, or minority-class characterization [He & Carbonell, SIAM 2010]


39

Minority Classes vs Outliers Rare classes

A group of points Clustered Non-separable from

the majority classes

Outliers A single point Scattered Separable

The Big Picture

UnbalancedUnlabeledData Set

RareCategoryDetection

Learning inUnbalanced

Settings

Classifier

RawData

FeatureRepresentation

Relational

Temporal

FeatureExtraction

October 2010 40Jaime G. Carbonell, Language

Technolgies Institute


41 7. Budget exhausted?

Minority Class Discovery Method

1. Calculate problem-specific similarity a

2. , , ix S , ,i iNN x a x A x x a ,i in NN x a

3.

,max

j ii i j

x NN x a ts n n

4. Query argmaxix S ix s

5. a new class?x

Increase t by 1

6. Output

No

Yes

x

No

2t RelevanceFeedback


42

Summary of Real Data Sets Abalone

4177 examples 7-dimensional

features 20 classes Largest class:

16.50% Smallest class:

0.34%

Shuttle 4515 examples 9-dimensional

features 7 classes Largest class:

75.53% Smallest class:

0.13%


43

Results on Real Data SetsAbalone Shuttle

MALICEMALICE

InterleaveInterleave

Random sampling Random sampling

SourceLanguageCorpus

ModelModel

Trainer

MT System

SS

Active Learner

S,TS,T

Active Learning for MT

ExpertTranslator

Sampled corpus

Parallel corpus


S,T1

S,T1

SourceLanguageCorpus

ModelModel

Trainer

MT System

SS

ACT Framework

.

.

.

S,T2

S,T2

S,Tn

S,Tn

Active Crowd Translation

SentenceSelection

TranslationSelection


Active Learning Strategy:Diminishing Density Weighted Diversity Sampling

46

|)(|

)]/(*[^)/(

)( )(

sPhrases

LxcounteULxP

Sdensity sPhrasesx

Lifx

Lifx

sPhrases

xcount

Sdiversity sPhrasesx

1

0

|)(|

)(*

)( )(

)()(

)(*)()1()(

2

2

SdiversitySdensity

SdiversitySdensitySScore

Experiments:Language Pair: Spanish-EnglishIterations: 20 Batch Size: 1000 sentences eachTranslation: Moses Phrase SMTDevelopment Set: 343 sensTest Set: 506 sens

Graph:X: Performance (BLEU )Y: Data (Thousand words)


Translation Selection from AMT

• Translator Reliability

• Translation Selection:


Parting Thoughts

Proactive Learning New field just started New work and full details Domnez Dissertation Applications Abound: e-science (compbio),

finance, network securirty, language technologies (MT), …

Theory still in the making (e.g. Liu Yang) Open challenge: Proactive structure learning

Rare Class discovery and Classification Dovetails with Active/Proactive Learning New Work and Full Details Jingrui He

Dissertation


48


49

THANK YOU!


50

Specially Designed ExponentialFamilies [Efron & Tibshirani 1996]

Favorable compromise between parametric and nonparametric density estimation

Estimated density

xtxgxg T100 exp

Carrier density

Normalizing parameter

parameter vector1p

vector of sufficient statistics1p


51


52

SEDER Algorithm

Carrier density: kernel density estimator To decouple the estimation of different

parameters Decompose Relax the constraint such that

Tdxxxt221 ,,

d

j

j

1 00

jx

jjjjij

ji

j

jdxx

xx1exp

2exp

2

1 2

102

2


53

Parameter Estimation Theorem 3 [SDM 2009]: the maximum likelihood

estimate and of and satisfy the following conditions:

where

dj ,,1

n

kn

i j

ji

jkj

i

n

i

jjij

ji

jkj

in

k

jk

xx

xExx

x1

1 2

2

0

1

2

2

2

0

1

2

2êxp

2êxp

j1

ji0j

1̂ji0̂

jx

jjjjij

ji

j

j

jjji dxx

xxxxE

2

102

222 ˆêxp

2exp

2

1


54

Parameter Estimation cont. Let

:

where ,

212

111

jjj

b

: positive parameterjb

dj ,,1A

ACBBb j

2

4ˆ2

n

kn

i j

ji

jk

n

i

jij

ji

jk

xx

xxx

nA

1

1 2

2

1

2

2

2

2exp

2exp

1

2jB

n

k

jkxn

C1

21

in most cases

1ˆ jb


55

Scoring Function The estimated density

Scoring function: norm of the gradient

where

n

i

d

j jj

ji

jj

jjbb

xbx

bnxg

1 1 2

2

2exp

2

11~

d

l ll

n

i

li

llkki

k

b

xbxxDs

1 22

2

1

d

j jj

ji

jj

jjib

xbx

bnxD

1 2

2

2exp

2

11


56

Summary of Real Data Sets

Data Set

n d m Largest Class

Smallest Class

Ecoli 336 7 6 42.56% 2.68%

Glass 214 9 6 35.51% 4.21%

Page Blocks 5473 10 5 89.77% 0.51%

Abalone 4177 7 20 16.50% 0.34%

Shuttle 4515 9 7 75.53% 0.13%

Moderately Skewed

Extremely Skewed


57

Moderately Skewed Data Sets

Ecoli Glass

MALICE

MALICE

GRADE: Full Prior Information

2. Calculate class-specific similarity ca

3. , , ix S , ,c ci iNN x a x A x x a ,c c

i in NN x a

4.

,max

cj i

c ci i j

x NN x a ts n n

5. Query argmaxix S ix s

6. class c?x

Increase t by 1

7. Output

No

Yes

x

1. For each rare class c, 2 c m

RelevanceFeedback


Results on Real Data Sets

Ecoli

Glass

Abalone

Shuttle

MALICE MALICE

MALICEMALICE



60

Performance Measures

MAP (Mean Average Precision)

MAP is the average of AP values for all queries

NDCG (Normalized Discounted Cumulative Gain) The impact of each relevant document is discounted as a function of rank position

jaime carbonell, pinar donmez, jingui he & vamshi ambati language technologies institute...

Documents