active perspectives on computational learning and testing liu yang slide 1
Post on 20-Dec-2015
215 views
TRANSCRIPT
Interaction
• An Interactive Protocol - Have algorithm interact with oracle/experts• Care about - Query complexity - Computational Efficiency • Question: how much better can learner do w/
interaction, vs. getting data in one shot ?
Slide 2
Notation
• Instance space X = {0, 1}^n • Concept space C: collection of fn h: X -> {-1,1}• Distribution D over X • Unknown target function h*: the true labeling
function (Realizable case: h* in C) • Err(h) = Px~D[h(x) ~= h*(x)]
Slide 3
“Active” means Label Request
• Label request: have a pool of unlabeled exs, pick any x and receive h*(x), repeat
Alg achieves Query Complexity S(ε,δ,h) for (C,D) if it outputs hn after ≤ n label requests, and for any h* in C,
ε > 0, δ > 0, n ≥ S(ε, δ, h), P[err(hn ≤ ε )] ≥ 1- δ
• Motivation: labeled data is expensive to get • Using label request, can do - Active Learning: find h has small err(h) - Active Testing: decide h* in C or far from C
Slide 4
Thesis Outline
• Bayesian Active Learning Using Arbitrary Binary Valued Queries (Published)
• Active Property Testing (Major results submitted)
• Self-Verifying Bayesian Active Learning (Published)
• A Theory of Transfer Learning (Accepted)
• Learning with General Types of Query (In progress)
• Active Learning with a Drifting Distri. (Submitted)
This Talk
Slide 5
Outline
• Active Property Testing (Submitted) • Self-Verifying Bayesian Active Learning (Published)
• Transfer Learning (Accepted)
• Learning with General Types of Query (in progress)
Slide 6
Property Testing
• What Property Testing is for ? - Quickly tell whether the right fn class - Estimate complexity of fn without actually learning • Question : Can you do w/ fewer queries than
learning ? • Yes !!! e.g. Union of d Intervals, testing help!!!
----++++----+++++++++-----++---+++-------- - Testing tells how big d need to be close to target - #Label: Active Testing need O(1), Passive Testing
need Θ(√d), Active Learning need Θ(d)Slide 7
Active Property Testing
• Passive Testing : has no control on labeled exs• Membership Query: unrealistic to query fn. at
arbitrary points• Active query asks labels from what exist in env• Question : Is active testing still get significant
benefit in label requests over passive testing ?
Slide 8
Property Tester
• - Accepts w.p. >= 2/3 if h* in C - Rejects w.p. >= 2/3 if d(h*;C) = ming in C Px~D [h*(x) ≠ g(x)] ≥ ε
Slide 9
Testing Union of d Intervals• ----++++----+++++++++-----++---+++--------
Theorem. Testing unions of <=d intervals in active testing model uses only queries.
• Noise Sensitivity := Pr [two close points labeled diff]• Proof Idea: - all unions of d intervals have low noise sensitivity - all functions that are far from this class have
noticeably larger noise sensitivity - we introduce a tester that estimates the noise
sensitivity of the input function.
Slide 10
Summary of Results
★★Active Testing (constant # of queries ) much better than Passive Testing and Active Learning on unions of intervals, cluster assumption, margin assumption★ Testing is easier than learning on LTF✪ For dictator (single variable fn), Active Testing no help
Slide 11
Outline
• Active Property Testing (Submitted)
• Self-Verifying Bayesian Active Learning (Published)
• Transfer Learning (Accepted)
• Learning with General Types of Query (in progress)
Slide 12
Self-Verifying Bayesian Active Learning
Self-verifying (a special type of stopping criterion)- given ε, adaptively decides # of query, then halts- has the property that E[err] < ε when halts
Question: Can you do with E[#query] = o(1/ε) ? (passive learning need 1/ε labels)
Slide 13
Example: Intervals
Suppose h* is empty interval, D is uniform on [0,1]
}
2 2
} } } } } } } } } } } } } } } } } } } } } } } } } } } } } } }
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
10
Verification Lower Bound
alg somehow arrives at h, er(h) < ε; how to verify h ε close to h* ? h* ε close to h every h’ ε outside green ball is not h* - Everything on red circle is outside green ball; so have to verify those are not h* - Suppose h* is empty interval, then er(h) < ε => p(h=+) < ε.
h* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Slide 15
h*h
C
ε 2ε
- In particular, intervals of width 2 are on red circle . - Need one label in each interval to verify it is not h*- So need (1/) labels to verify the target isn’t one of these.
Learning with a prior
• Suppose we know a distribution the target is sampled from, call it prior
Slide 16
Interval Example with prior- - - - - |+++++++|- - - - - -
• Algorithm: Query random pts till find first +, do binary search to find end-pts. Halt when reach a pre-specified prior-based query budget. Output posterior’s Bayes classifier.
• Let budget N be high enough so E[err] < ε - N = o(1/ε) sufficient for E[err|w*>0] < ε: can learn each interval
of w > 0 w/ o(1/eps) queries; by DCT, the average of any collection of o(1/N) fns is o(1/N), so only need N=o(1/ε) to make E[err|w*>0] < ε.
- N = o(1/ε) sufficient for E[err|w*=0] < ε: if P(w*=0)>0, then after some L = O(log(1/ε)) queries, w.p.> 1-ε, most prob. mass on empty interval, so posterior’s Bayes classifier has 0 error rate
Slide 17
Can do o(1/eps) for any VC-class
Theorem : With the prior, can do o(1/ε)• There are methods that find a good classifier
in o(1/eps) queries (though they aren’t self-verifying) [see TrueSampleComplexityAL08]
• Need to set a stopping criterion for those alg• The stop criterion we use : let alg run until
make a certain #query (set the budget to be just large enough so the E[err] < ε)
Slide 18
Outline
• Active Property Testing (Submitted) • Self-Verifying Bayesian Active Learning (Published)
• Transfer Learning (Accepted)
• Learning with General Types of Query (in progress)
Slide 19
prior
h1*
x11,y1
1 … x1k,y1
k
Task 1
hT*
xT1,yT
1 … xTk,yT
k
…
Task T
Model of Transfer Learning Motivation: Learners often Not Too Altruistic
h2*
x21,y2
1 … x2k,y2
k
Task 2
Layer 1: draw task i.i.d. from (unknown) prior
Layer 2: per task, draw data i.i.d. from target
Better Estimate of Prior
Slide 20
Insights
• Using a good estimate of the prior is almost as good as using the true prior
- Now we only need a VC-dim # of additional points from each task, to get a good estimate of the prior
- We’ve seen self-verifying alg, if given the true prior, has guarantee on the err and # queries
- As #task ->∞, when call self-verifying alg, it outputs a classifier as good as if it had the true prior
Slide 21
Main Result
• Design an alg using this insight - Uses at most VC-dim # additional labeled points per
task (vs. learning with known true prior) - Estimate joint distribution on (x1,y1)…(xd,yd) - Invert to get prior estimate• Running this alg asymptotically just as good as
having direct knowledge of prior (Bayesian) - [HKS] showed passive save const. factors in Θ(1/ε) sample complexity per
task (replace vc-dim w/ a prior-dependent complexity measure) - We showed access to prior can improve active sample complexity,
sometimes from Θ(1/ε) to o(1/ε).
Slide 22
Estimating the prior Insight : Identifiability of priors by d-dim joint distri.
• Distrib of full sequence (x1,y1),(x2,y2),… uniquely identifies prior• Set of joint distribs on (x1,y1),…,(xk,yk) s.t. 1≤k<∞
identifies distrib of full sequence (x1,y1),(x2,y2),… • For any k > d=VC-dim, can express distrib on (x1,y1),…,(xk,yk)
in terms of distrib of (x1,y1),…,(xd,yd).• How to do it when d =1 ? e.g. threshold - for two points x1, x2, if x1 < x2, then Pr(+,-)=0,
Pr(+,+)=Pr(+.), Pr(-,-)=Pr(.-), Pr(-,+)=Pr(.+)-Pr(++) = Pr(.+)-Pr(+.) - for any k > 1 points, can directly to reduce from k to 1 P(-----------(-+)++++++++++) = P( (-+) ) = P( (.+) ) - P( (+.) )
Slide 23
Outline
• Active Property Testing (Submitted)
• Self-Verifying Bayesian Active learning (published)
• Transfer Learning (Accepted)
• Learning with General Types of Query (in progress)
- Learning DNF formula - Learning Voronoi
Slide 24
Learning with General Query
• Construct problem-specific queries used to efficiently learn those problems having no known efficient algorithms to PAC-learn.
Slide 25
Learning DNF formulas
- DNF formulas:n is # of variables;
poly-sized DNF: # of terms is nO(1) e.g. set of fn like f = (x1 x2) (x1 x4) ∧ ∨ ∧
- Natural form of knowledge representation- [Valiant 1984]; a great challenge over 20 years- PAC-learning DNF formulas appears to be very hard.- Fastest known alg[KS01] runs in time exp(n1/3 log2n).- If the alg forced to output a hypothesis which itself is
a DNF, the problem is NP-hard.
(x1∧x2)
(x1∧x4)
Slide 26