coarse sample complexity bounds for active learning sanjoy dasgupta uc san diego

22
Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Upload: melvyn-payne

Post on 01-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Coarse sample complexity bounds for active learning

Sanjoy DasguptaUC San Diego

Page 2: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Supervised learningGiven access to labeled data (drawn iid from an unknown underlying distribution P), want to learn a classifier chosen from hypothesis class H, with misclassification rate <.

Sample complexity characterized by d = VC dimension of H.If data is separable, need roughly d/labeled samples.

Page 3: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Active learningIn many situations – like speech recognition and document retrieval – unlabeled data is easy to come by, but there is a charge for each label.

What is the minimum number of labels needed to achieve the target error rate?

Page 4: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Our result

A parameter which coarsely characterizes the label complexity of active learning in the separable setting

Page 5: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Can adaptive querying really help?

[CAL92, D04]: Threshold functions on the real line hw(x) = 1(x ¸ w), H = {hw: w 2 R}

Start with 1/ unlabeled points

Binary search – need just log 1/ labels, from which the rest can be inferred! Exponential improvement in sample complexity.

w

+-

Page 6: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

More general hypothesis classes

For a general hypothesis class with VC dimension d, is a “generalized binary search” possible?

Random choice of queries d/ labelsPerfect binary search d log 1/ labels

Where in this large range does the label complexity of active learning lie?

We’ve already handled linear separators in 1-d…

Page 7: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Linear separators in R2

For linear separators in R1, need just log 1/ labels.But when H = {linear separators in R2}: some target hypotheses require 1/ labels to be queried! h3h2

h0

h1

fraction of distribution

Need 1/ labels to distinguish between h0, h1, h2, …, h1/!

Consider any distribution over the circle in R2.

Page 8: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

A fuller pictureFor linear separators in R2: some bad target hypotheses which require 1/ labels, but “most” require just O(log 1/) labels…

good

bad

Page 9: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

A view of the hypothesis space

H = {linear separators in R2}

All-positivehypothesis

All-negativehypothesis

Good region

Bad regions

Page 10: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Geometry of hypothesis space

H = any hypothesis class, of VC dimension d < 1.

P = underlying distribution of data.

(i) Non-Bayesian setting: no probability measure on H

(ii) But there is a natural (pseudo) metric: d(h,h’) = P(h(x) h’(x))

(iii) Each point x defines a cut through H

h

h’

H

x

Page 11: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

The learning process

(h0 = target hypothesis)

Keep asking for labels until the diameter of the remaining version space is at most .

h0

H

Page 12: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Searchability indexAccuracy Data distribution PAmount of unlabeled data

Each hypothesis h 2 H has a “searchability index” h

(h) / min(pos mass of h, neg mass of h), but never <

· (h) · 1, bigger is better

1/2

1/4

1/5

1/4

1/5

Example: linear separators in R2, data on a circle:

1/3

1/3

All positive hypothesis

H

Page 13: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Searchability indexAccuracy Data distribution PAmount of unlabeled data

Each hypothesis h 2 H has a “searchability index” (h)

Searchability index lies in the range: · (h) · 1

Upper bound. There is an active learning scheme which identifies any target hypothesis h 2 H (within accuracy · ) with a label complexity of at most:

Lower bound. For any h 2 H, any active learning scheme for the neighborhood B(h, (h)) has a label complexity of at least:

[When (h) À : active learning helps a lot.]

Page 14: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Linear separators in Rd

Previous sample complexity results for active learning have focused on the following case:

H = homogeneous (through the origin) linear separators in Rd

Data distributed uniformly over unit sphere

[1] Query by committee [SOS92, FSST97]Bayesian setting: average-case over target hypotheses picked uniformly from the unit sphere[2] Perceptron-based active learner [DKM05]Non-Bayesian setting: worst-case over target hypotheses

In either case: just (d log 1/) labels needed!

Page 15: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Example: linear separators in Rd

This sample complexity is realized by many schemes:

[SOS92, FSST97] Query by committee

[DKM05] Perceptron-based active learner

Simplest of all, [CAL92]: pick a random point whose label is not completely certain (with respect to current version space)

} as

before

H: {Homogeneous linear separators in Rd}, P: uniform distribution

(h) is the same for all h, and is ¸ 1/8

Page 16: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Linear separators in Rd

Uniform distribution:

Concentrated near the equator

(any equator)

+

-

Page 17: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Linear separators in Rd

Instead: distribution P with a different vertical marginal:

Result: ¸ 1/32, provided amt of unlabeled data grows by …

Do the schemes [CAL92, SOS92, FSST97, DKM05] achieve this label complexity?

+

-

Say that for < 1,

U(x)/ · P(x) · U(x)

(U = uniform)

Page 18: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

What next

1. Make this algorithmic!

Linear separators: is some kind of “querying near current boundary” a reasonable approximation?

2. Nonseparable data

Need a robust base learner!

true boundary+-

Page 19: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Thanks

For helpful discussions:

Peter BartlettYoav FreundAdam KalaiJohn LangfordClaire Monteleoni

Page 20: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Star-shaped configurations

Hypothesis space: In the vicinity of the “bad” hypothesis h0, we find a star structure:

Data space:

h3h2

h1

h0

h0

h1

h2

h3

h1/

Page 21: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Example: the 1-d lineSearchability index lies in range: · (h) · 1

Theorem: · # labels needed ·

Example: Threshold functions on the line

w

+-

Result: = 1/2 for any target hypothesis and any input distribution

Page 22: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Linear separators in Rd

Result: = (1) for most target hypotheses, but is for the hypothesis that makes one slab +, the other -… the most “natural” one!

origin

Data lies on the rim of two slabs, distributed uniformly