Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
-
date post
20-Dec-2015 -
Category
Documents
-
view
218 -
download
0
Transcript of Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Analysis of greedy active learning
Sanjoy DasguptaUC San Diego
Standard learning modelGiven m labeled points, want to learn a classifier with misclassification rate <, chosen from a hypothesis class H with VC dimension d < 1.
VC theory: need m to be roughly d/, in the realizable case.
Active learningUnlabeled data is easy to come by, but there is a charge for each label.
What is the minimum number of labels needed to achieve the target error rate?
Can adaptive querying help?
Simple hypothesis class: threshold functions on the real line:
hw(x) = 1(x ¸ w), H = {hw: w 2 R}Start with m ¼ 1/ unlabeled points
Binary search – need just log m labels, from which the rest can be inferred! An exponential improvement in sample complexity.
Binary search
X1?
X6? X8?
X1 X8X6X3
m data points: there are effectively m+1 different hypotheses.
Query tree has m+1 leaves, depth ¼ log m.Question: Is this a general phenomenon? For other
hypothesis classes, is a generalized binary search possible?
Bad news – I
H = {linear separators in R1}: active learning reduces sample complexity from m to log m.
But H = {linear separators in R2}: there are some target hypotheses for which all m labels need to be queried! (No matter how benign the input distribution.)In this case: learning to accuracy requires 1/ labels…
The benefit of averaging
For linear separators in R2:In the worst case over target hypotheses, active learning offers no improvement in sample complexity.
But there is a query tree in which the depths of the O(m2) target hypotheses are spread almost evenly over [log m, m].
The average depth is just log m.Question: does active learning help only in a Bayesian model?
Degrees of Bayesian-ity
Prior over hypotheses
Pseudo-Bayesian modelThe prior is used only to count queries
Bayesian modelThe prior is used for counting queries and also for the generalization bound
High mass
Low mass
Different stopping criteria. Suppose the remaining version space is:
Effective hypothesis class
Fix a hypothesis class H of VC dimension d < 1, and a set of unlabeled examples x1, x2, …, xm, where m ¸ d/.
Sauer’s lemma: H can label these points in at most md different ways… the effective hypothesis class
Heff = { (h(x1), h(x2), …, h(xm)) : h 2 H}
has size |Heff| · md.
Goal (in the realizable case): pick the element of Heff which is consistent with all the hidden labels, while asking for just a small subset of these labels.
Model of an active learner
Query tree:
X1?
X6? X8?
X3?
h1h5
h6 h3h2
Each leaf is annotated with an element of Heff.
Weights over Heff.
Goal: a tree T of small average depth,
Q(T,) = h (h) ¢ depth(h)
(can also use random coin flips at internal nodes)
Question: in this averaged model, can we always find a tree of depth o(m)?
Bad news – II
Pick any d > 0 and m ¸ 2d. There is an input space of size m and a hypothesis class H of VC dimension d such that (for uniform ) any active learning strategy requires ¸ m/8 queries on average.
Choose:Input space = any {x1, …, xm}H = all concepts which are positive on exactly d inputs.
A revised goalDepending on
the choice of the hypothesis classperhaps the input distribution
the average number of labels needed by an optimal active learner is somewhere in the range [d log m, m].
Ideal case: d log m perfect binary searchWorst case:m randomly chosen labels(within constants)
Is there a generic active learning strategy which always achieves close to the optimal number of queries, no matter what it might be?
Heuristics for active learning
A common strategy in many heuristics:Greedy strategy. After seeing t labels, remaining
version space is some Ht. Always choose the point which most evenly divides Ht, according to -mass.
For instance, Tong-Koller (2000) – linear separators:
/ volume
Question: How good is this greedy scheme? And how does its performance depend on the choice of ?
Greedy active learningChoose any . How does the greedy query tree TG compare to the optimal tree T*?
Upper bound. Q(TG, ) · 4 Q(T*, ) log 1/(minh (h)).
Example: For uniform , the approximation ratio is log |Heff| · d log m.
Lower bounds.[1] Uniform : we have an example in whichQ(TG, ) ¸ Q(T*, ) ¢ (log |Heff|/log log |Heff|)[2] Non-uniform : an example where ranges between 1/2 and 1/2n, and Q(TG, ) ¸ Q(T*, ) ¢ (n).
Sub-optimality of greedy scheme
[1] The case of uniform .
There are simple examples in which the greedy scheme uses (log n/log log n) times the optimal number of labels.
(a) The hypothesis class consists of several clusters(b) Each cluster is efficiently searchable(c) But first the version space must be narrowed down to one of these clusters: an inefficient process[Invoke this construction recursively.]
Optimal strategy reduces entropy only gradually at first, then ramps it up later – an over-eager greedy scheme is fooled.
Sub-optimality, cont’d[2] The case of general .
For any n ¸ 2:
There is a hypothesis class H of size 2n+1 and distribution over H such that:(a) ranges from 1/2 to 1/2n+1
(b) optimal expected number of queries is <3(c) greedy strategy uses ¸ n/2 queries on average.
h0
h11 h21
h1
2
h13
h22
h23
h1n h2n
H, (proportional to area)
Sub-optimality, cont’dThree types of queries:
(i) Is target some h1i ? (ii) some h2i ? (iii) h1j or h2j ?
Upper bound: overview
Upper bound. Q(TG, ) · 4 Q(T*, ) log 1/(minh (h)).
If the optimal tree is short, then
either: there is a query which (in expectation) cuts off a good chunk of the version space
or: some particular hypothesis has high weight.
At least in the first case, the greedy scheme gets off to a good start [cf. Johnson’s argument for set cover].
Quality of a query
Need a notion of query quality which can only decrease with time.
If S is a version space, and query xi splits it into S+, S-, we’ll say that “xi shrinks (S, )” by
2 (S+) (S-) (S)
Claim: If xi shrinks (Heff, ) by , then it shrinks (S,) by at most for any S µ Heff.
When is the optimal tree short?
Claim: Pick any S µ Heff, and any tree T whose leaves include all of S. Then there must be a query which shrinks (S, S) by at least:
(1 – CP(S))/Q(T, S).
Here:S is restricted to S
CP() = h (h)2 (collision probability)
Main argument
If the optimal tree has small average depth, then there are two possible cases:
Case one: there is some query which shrinks the version space significantly
In this case, the greedy strategy will find such a query and clear progress will be made. The resulting subtrees, considered together, will also require few queries.
Proof, cont’d
Case two: some classifier h* has very high -mass
In this case, the version space might shrink by just an insignificant amount in one round. But:
in roughly the number of queries that the optimal strategy requires for target h*, the greedy strategy will either eliminate h* or declare it to be the answer.
In the former case, by the time h* is eliminated, the version space will have shrunk significantly.
These two cases form the basis of an inductive argument.
An open problem
Just about the only positive result in active learning:
[FSST97] Query by committee: if the data distribution is uniform over the unit sphere, can learn homogeneous linear separators using just O(d log 1/) labels.
But the minute we allow non-homogeneous hyperplanes, the query complexity increases to 1/… What’s going on?