Coarse sample complexity bounds for active learning
Sanjoy DasguptaUC San Diego
Supervised learningGiven access to labeled data (drawn iid from an unknown underlying distribution P), want to learn a classifier chosen from hypothesis class H, with misclassification rate <.
Sample complexity characterized by d = VC dimension of H.If data is separable, need roughly d/labeled samples.
Active learningIn many situations – like speech recognition and document retrieval – unlabeled data is easy to come by, but there is a charge for each label.
What is the minimum number of labels needed to achieve the target error rate?
Our result
A parameter which coarsely characterizes the label complexity of active learning in the separable setting
Can adaptive querying really help?
[CAL92, D04]: Threshold functions on the real line hw(x) = 1(x ¸ w), H = {hw: w 2 R}
Start with 1/ unlabeled points
Binary search – need just log 1/ labels, from which the rest can be inferred! Exponential improvement in sample complexity.
w
+-
More general hypothesis classes
For a general hypothesis class with VC dimension d, is a “generalized binary search” possible?
Random choice of queries d/ labelsPerfect binary search d log 1/ labels
Where in this large range does the label complexity of active learning lie?
We’ve already handled linear separators in 1-d…
Linear separators in R2
For linear separators in R1, need just log 1/ labels.But when H = {linear separators in R2}: some target hypotheses require 1/ labels to be queried! h3h2
h0
h1
fraction of distribution
Need 1/ labels to distinguish between h0, h1, h2, …, h1/!
Consider any distribution over the circle in R2.
A fuller pictureFor linear separators in R2: some bad target hypotheses which require 1/ labels, but “most” require just O(log 1/) labels…
good
bad
A view of the hypothesis space
H = {linear separators in R2}
All-positivehypothesis
All-negativehypothesis
Good region
Bad regions
Geometry of hypothesis space
H = any hypothesis class, of VC dimension d < 1.
P = underlying distribution of data.
(i) Non-Bayesian setting: no probability measure on H
(ii) But there is a natural (pseudo) metric: d(h,h’) = P(h(x) h’(x))
(iii) Each point x defines a cut through H
h
h’
H
x
The learning process
(h0 = target hypothesis)
Keep asking for labels until the diameter of the remaining version space is at most .
h0
H
Searchability indexAccuracy Data distribution PAmount of unlabeled data
Each hypothesis h 2 H has a “searchability index” h
(h) / min(pos mass of h, neg mass of h), but never <
· (h) · 1, bigger is better
1/2
1/4
1/5
1/4
1/5
Example: linear separators in R2, data on a circle:
1/3
1/3
All positive hypothesis
H
Searchability indexAccuracy Data distribution PAmount of unlabeled data
Each hypothesis h 2 H has a “searchability index” (h)
Searchability index lies in the range: · (h) · 1
Upper bound. There is an active learning scheme which identifies any target hypothesis h 2 H (within accuracy · ) with a label complexity of at most:
Lower bound. For any h 2 H, any active learning scheme for the neighborhood B(h, (h)) has a label complexity of at least:
[When (h) À : active learning helps a lot.]
Linear separators in Rd
Previous sample complexity results for active learning have focused on the following case:
H = homogeneous (through the origin) linear separators in Rd
Data distributed uniformly over unit sphere
[1] Query by committee [SOS92, FSST97]Bayesian setting: average-case over target hypotheses picked uniformly from the unit sphere[2] Perceptron-based active learner [DKM05]Non-Bayesian setting: worst-case over target hypotheses
In either case: just (d log 1/) labels needed!
Example: linear separators in Rd
This sample complexity is realized by many schemes:
[SOS92, FSST97] Query by committee
[DKM05] Perceptron-based active learner
Simplest of all, [CAL92]: pick a random point whose label is not completely certain (with respect to current version space)
} as
before
H: {Homogeneous linear separators in Rd}, P: uniform distribution
(h) is the same for all h, and is ¸ 1/8
Linear separators in Rd
Uniform distribution:
Concentrated near the equator
(any equator)
+
-
Linear separators in Rd
Instead: distribution P with a different vertical marginal:
Result: ¸ 1/32, provided amt of unlabeled data grows by …
Do the schemes [CAL92, SOS92, FSST97, DKM05] achieve this label complexity?
+
-
Say that for < 1,
U(x)/ · P(x) · U(x)
(U = uniform)
What next
1. Make this algorithmic!
Linear separators: is some kind of “querying near current boundary” a reasonable approximation?
2. Nonseparable data
Need a robust base learner!
true boundary+-
Thanks
For helpful discussions:
Peter BartlettYoav FreundAdam KalaiJohn LangfordClaire Monteleoni
Star-shaped configurations
Hypothesis space: In the vicinity of the “bad” hypothesis h0, we find a star structure:
Data space:
h3h2
h1
h0
h0
h1
h2
h3
h1/
Example: the 1-d lineSearchability index lies in range: · (h) · 1
Theorem: · # labels needed ·
Example: Threshold functions on the line
w
+-
Result: = 1/2 for any target hypothesis and any input distribution
Linear separators in Rd
Result: = (1) for most target hypotheses, but is for the hypothesis that makes one slab +, the other -… the most “natural” one!
origin
Data lies on the rim of two slabs, distributed uniformly