Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

22
Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego

Transcript of Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Page 1: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Coarse sample complexity bounds for active learning

Sanjoy DasguptaUC San Diego

Page 2: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Supervised learningGiven access to labeled data (drawn iid from an unknown underlying distribution P), want to learn a classifier chosen from hypothesis class H, with misclassification rate <.

Sample complexity characterized by d = VC dimension of H.If data is separable, need roughly d/labeled samples.

Page 3: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Active learningIn many situations – like speech recognition and document retrieval – unlabeled data is easy to come by, but there is a charge for each label.

What is the minimum number of labels needed to achieve the target error rate?

Page 4: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Our result

A parameter which coarsely characterizes the label complexity of active learning in the separable setting

Page 5: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Can adaptive querying really help?

[CAL92, D04]: Threshold functions on the real line hw(x) = 1(x ¸ w), H = {hw: w 2 R}

Start with 1/ unlabeled points

Binary search – need just log 1/ labels, from which the rest can be inferred! Exponential improvement in sample complexity.

w

+-

Page 6: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

More general hypothesis classes

For a general hypothesis class with VC dimension d, is a “generalized binary search” possible?

Random choice of queries d/ labelsPerfect binary search d log 1/ labels

Where in this large range does the label complexity of active learning lie?

We’ve already handled linear separators in 1-d…

Page 7: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Linear separators in R2

For linear separators in R1, need just log 1/ labels.But when H = {linear separators in R2}: some target hypotheses require 1/ labels to be queried! h3h2

h0

h1

fraction of distribution

Need 1/ labels to distinguish between h0, h1, h2, …, h1/!

Consider any distribution over the circle in R2.

Page 8: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

A fuller pictureFor linear separators in R2: some bad target hypotheses which require 1/ labels, but “most” require just O(log 1/) labels…

good

bad

Page 9: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

A view of the hypothesis space

H = {linear separators in R2}

All-positivehypothesis

All-negativehypothesis

Good region

Bad regions

Page 10: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Geometry of hypothesis space

H = any hypothesis class, of VC dimension d < 1.

P = underlying distribution of data.

(i) Non-Bayesian setting: no probability measure on H

(ii) But there is a natural (pseudo) metric: d(h,h’) = P(h(x) h’(x))

(iii) Each point x defines a cut through H

h

h’

H

x

Page 11: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

The learning process

(h0 = target hypothesis)

Keep asking for labels until the diameter of the remaining version space is at most .

h0

H

Page 12: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Searchability indexAccuracy Data distribution PAmount of unlabeled data

Each hypothesis h 2 H has a “searchability index” h

(h) / min(pos mass of h, neg mass of h), but never <

· (h) · 1, bigger is better

1/2

1/4

1/5

1/4

1/5

Example: linear separators in R2, data on a circle:

1/3

1/3

All positive hypothesis

H

Page 13: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Searchability indexAccuracy Data distribution PAmount of unlabeled data

Each hypothesis h 2 H has a “searchability index” (h)

Searchability index lies in the range: · (h) · 1

Upper bound. There is an active learning scheme which identifies any target hypothesis h 2 H (within accuracy · ) with a label complexity of at most:

Lower bound. For any h 2 H, any active learning scheme for the neighborhood B(h, (h)) has a label complexity of at least:

[When (h) À : active learning helps a lot.]

Page 14: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Linear separators in Rd

Previous sample complexity results for active learning have focused on the following case:

H = homogeneous (through the origin) linear separators in Rd

Data distributed uniformly over unit sphere

[1] Query by committee [SOS92, FSST97]Bayesian setting: average-case over target hypotheses picked uniformly from the unit sphere[2] Perceptron-based active learner [DKM05]Non-Bayesian setting: worst-case over target hypotheses

In either case: just (d log 1/) labels needed!

Page 15: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Example: linear separators in Rd

This sample complexity is realized by many schemes:

[SOS92, FSST97] Query by committee

[DKM05] Perceptron-based active learner

Simplest of all, [CAL92]: pick a random point whose label is not completely certain (with respect to current version space)

} as

before

H: {Homogeneous linear separators in Rd}, P: uniform distribution

(h) is the same for all h, and is ¸ 1/8

Page 16: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Linear separators in Rd

Uniform distribution:

Concentrated near the equator

(any equator)

+

-

Page 17: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Linear separators in Rd

Instead: distribution P with a different vertical marginal:

Result: ¸ 1/32, provided amt of unlabeled data grows by …

Do the schemes [CAL92, SOS92, FSST97, DKM05] achieve this label complexity?

+

-

Say that for < 1,

U(x)/ · P(x) · U(x)

(U = uniform)

Page 18: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

What next

1. Make this algorithmic!

Linear separators: is some kind of “querying near current boundary” a reasonable approximation?

2. Nonseparable data

Need a robust base learner!

true boundary+-

Page 19: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Thanks

For helpful discussions:

Peter BartlettYoav FreundAdam KalaiJohn LangfordClaire Monteleoni

Page 20: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Star-shaped configurations

Hypothesis space: In the vicinity of the “bad” hypothesis h0, we find a star structure:

Data space:

h3h2

h1

h0

h0

h1

h2

h3

h1/

Page 21: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Example: the 1-d lineSearchability index lies in range: · (h) · 1

Theorem: · # labels needed ·

Example: Threshold functions on the line

w

+-

Result: = 1/2 for any target hypothesis and any input distribution

Page 22: Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.

Linear separators in Rd

Result: = (1) for most target hypotheses, but is for the hypothesis that makes one slab +, the other -… the most “natural” one!

origin

Data lies on the rim of two slabs, distributed uniformly