active perspectives on computational learning and testing liu yang slide 1

Active Perspectives on Computational Learning and Testing

Liu Yang

Slide 1

Interaction

• An Interactive Protocol - Have algorithm interact with oracle/experts• Care about - Query complexity - Computational Efficiency • Question: how much better can learner do w/

interaction, vs. getting data in one shot ?

Slide 2

Notation

• Instance space X = {0, 1}^n • Concept space C: collection of fn h: X -> {-1,1}• Distribution D over X • Unknown target function h*: the true labeling

function (Realizable case: h* in C) • Err(h) = Px~D[h(x) ~= h*(x)]

Slide 3

“Active” means Label Request

• Label request: have a pool of unlabeled exs, pick any x and receive h*(x), repeat

Alg achieves Query Complexity S(ε,δ,h) for (C,D) if it outputs hn after ≤ n label requests, and for any h* in C,

ε > 0, δ > 0, n ≥ S(ε, δ, h), P[err(hn ≤ ε )] ≥ 1- δ

• Motivation: labeled data is expensive to get • Using label request, can do - Active Learning: find h has small err(h) - Active Testing: decide h* in C or far from C

Slide 4

Thesis Outline

• Bayesian Active Learning Using Arbitrary Binary Valued Queries (Published)

• Active Property Testing (Major results submitted)

• Self-Verifying Bayesian Active Learning (Published)

• A Theory of Transfer Learning (Accepted)

• Learning with General Types of Query (In progress)

• Active Learning with a Drifting Distri. (Submitted)

This Talk

Slide 5

Outline

• Active Property Testing (Submitted) • Self-Verifying Bayesian Active Learning (Published)

• Transfer Learning (Accepted)

• Learning with General Types of Query (in progress)

Slide 6

Property Testing

• What Property Testing is for ? - Quickly tell whether the right fn class - Estimate complexity of fn without actually learning • Question : Can you do w/ fewer queries than

learning ? • Yes !!! e.g. Union of d Intervals, testing help!!!

----++++----+++++++++-----++---+++-------- - Testing tells how big d need to be close to target - #Label: Active Testing need O(1), Passive Testing

need Θ(√d), Active Learning need Θ(d)Slide 7

Active Property Testing

• Passive Testing : has no control on labeled exs• Membership Query: unrealistic to query fn. at

arbitrary points• Active query asks labels from what exist in env• Question : Is active testing still get significant

benefit in label requests over passive testing ?

Slide 8

Property Tester

• - Accepts w.p. >= 2/3 if h* in C - Rejects w.p. >= 2/3 if d(h*;C) = ming in C Px~D [h*(x) ≠ g(x)] ≥ ε

Slide 9

Testing Union of d Intervals• ----++++----+++++++++-----++---+++--------

Theorem. Testing unions of <=d intervals in active testing model uses only queries.

• Noise Sensitivity := Pr [two close points labeled diff]• Proof Idea: - all unions of d intervals have low noise sensitivity - all functions that are far from this class have

noticeably larger noise sensitivity - we introduce a tester that estimates the noise

sensitivity of the input function.

Slide 10

Summary of Results

★★Active Testing (constant # of queries ) much better than Passive Testing and Active Learning on unions of intervals, cluster assumption, margin assumption★ Testing is easier than learning on LTF✪ For dictator (single variable fn), Active Testing no help

Slide 11

Outline

• Active Property Testing (Submitted)

• Self-Verifying Bayesian Active Learning (Published)



Slide 12

Self-Verifying Bayesian Active Learning

Self-verifying (a special type of stopping criterion)- given ε, adaptively decides # of query, then halts- has the property that E[err] < ε when halts

Question: Can you do with E[#query] = o(1/ε) ? (passive learning need 1/ε labels)

Slide 13

Example: Intervals

- -

Suppose D is uniform on [0,1]

0 1

Slide 14

+

Example: Intervals

Suppose h* is empty interval, D is uniform on [0,1]

}

2 2

} } } } } } } } } } } } } } } } } } } } } } } } } } } } } } }

2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

10

Verification Lower Bound

alg somehow arrives at h, er(h) < ε; how to verify h ε close to h* ? h* ε close to h every h’ ε outside green ball is not h* - Everything on red circle is outside green ball; so have to verify those are not h* - Suppose h* is empty interval, then er(h) < ε => p(h=+) < ε.

h* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Slide 15

h*h

C

ε 2ε

- In particular, intervals of width 2 are on red circle . - Need one label in each interval to verify it is not h*- So need (1/) labels to verify the target isn’t one of these.

Learning with a prior

• Suppose we know a distribution the target is sampled from, call it prior

Slide 16

Interval Example with prior- - - - - |+++++++|- - - - - -

• Algorithm: Query random pts till find first +, do binary search to find end-pts. Halt when reach a pre-specified prior-based query budget. Output posterior’s Bayes classifier.

• Let budget N be high enough so E[err] < ε - N = o(1/ε) sufficient for E[err|w*>0] < ε: can learn each interval

of w > 0 w/ o(1/eps) queries; by DCT, the average of any collection of o(1/N) fns is o(1/N), so only need N=o(1/ε) to make E[err|w*>0] < ε.

- N = o(1/ε) sufficient for E[err|w*=0] < ε: if P(w*=0)>0, then after some L = O(log(1/ε)) queries, w.p.> 1-ε, most prob. mass on empty interval, so posterior’s Bayes classifier has 0 error rate

Slide 17

Can do o(1/eps) for any VC-class

Theorem : With the prior, can do o(1/ε)• There are methods that find a good classifier

in o(1/eps) queries (though they aren’t self-verifying) [see TrueSampleComplexityAL08]

• Need to set a stopping criterion for those alg• The stop criterion we use : let alg run until

make a certain #query (set the budget to be just large enough so the E[err] < ε)

Slide 18

Outline

• Active Property Testing (Submitted) • Self-Verifying Bayesian Active Learning (Published)



Slide 19

prior

h1*

x11,y1

1 … x1k,y1

k

Task 1

hT*

xT1,yT

1 … xTk,yT

k

…

Task T

Model of Transfer Learning Motivation: Learners often Not Too Altruistic

h2*

x21,y2

1 … x2k,y2

k

Task 2

Layer 1: draw task i.i.d. from (unknown) prior

Layer 2: per task, draw data i.i.d. from target

Better Estimate of Prior

Slide 20

Insights

• Using a good estimate of the prior is almost as good as using the true prior

- Now we only need a VC-dim # of additional points from each task, to get a good estimate of the prior

- We’ve seen self-verifying alg, if given the true prior, has guarantee on the err and # queries

- As #task ->∞, when call self-verifying alg, it outputs a classifier as good as if it had the true prior

Slide 21

Main Result

• Design an alg using this insight - Uses at most VC-dim # additional labeled points per

task (vs. learning with known true prior) - Estimate joint distribution on (x1,y1)…(xd,yd) - Invert to get prior estimate• Running this alg asymptotically just as good as

having direct knowledge of prior (Bayesian) - [HKS] showed passive save const. factors in Θ(1/ε) sample complexity per

task (replace vc-dim w/ a prior-dependent complexity measure) - We showed access to prior can improve active sample complexity,

sometimes from Θ(1/ε) to o(1/ε).

Slide 22

Estimating the prior Insight : Identifiability of priors by d-dim joint distri.

• Distrib of full sequence (x1,y1),(x2,y2),… uniquely identifies prior• Set of joint distribs on (x1,y1),…,(xk,yk) s.t. 1≤k<∞

identifies distrib of full sequence (x1,y1),(x2,y2),… • For any k > d=VC-dim, can express distrib on (x1,y1),…,(xk,yk)

in terms of distrib of (x1,y1),…,(xd,yd).• How to do it when d =1 ? e.g. threshold - for two points x1, x2, if x1 < x2, then Pr(+,-)=0,

Pr(+,+)=Pr(+.), Pr(-,-)=Pr(.-), Pr(-,+)=Pr(.+)-Pr(++) = Pr(.+)-Pr(+.) - for any k > 1 points, can directly to reduce from k to 1 P(-----------(-+)++++++++++) = P( (-+) ) = P( (.+) ) - P( (+.) )

Slide 23

Outline

• Active Property Testing (Submitted)

• Self-Verifying Bayesian Active learning (published)



- Learning DNF formula - Learning Voronoi

Slide 24

Learning with General Query

• Construct problem-specific queries used to efficiently learn those problems having no known efficient algorithms to PAC-learn.

Slide 25

Learning DNF formulas

- DNF formulas:n is # of variables;

poly-sized DNF: # of terms is nO(1) e.g. set of fn like f = (x1 x2) (x1 x4) ∧ ∨ ∧

- Natural form of knowledge representation- [Valiant 1984]; a great challenge over 20 years- PAC-learning DNF formulas appears to be very hard.- Fastest known alg[KS01] runs in time exp(n1/3 log2n).- If the alg forced to output a hypothesis which itself is

a DNF, the problem is NP-hard.

(x1∧x2)

(x1∧x4)

Slide 26

Thanks !

Slide 31

active perspectives on computational learning and testing liu yang slide 1

Documents