A New Linearthreshold Algorithm Anna Rapoport Lev Faivishevsky.

date post
22Dec2015 
Category
Documents

view
216 
download
0
Embed Size (px)
Transcript of A New Linearthreshold Algorithm Anna Rapoport Lev Faivishevsky.
A New Linearthreshold Algorithm
Anna Rapoport
Lev Faivishevsky
Introduction
Valiant (1984) and others have studied the problem of learning various classes of Boolean functions from examples. Now we’re going to discuss incremental learning of these functions.We consider a setting in which the learner responds to each example according to a current hypothesis. Then the learner updates it, if necessary, based on the correct classification of the example.
Introduction (cont.)
One natural measure of the quality of learning in this setting is the number of mistakes the learner makes.
For suitable classes of functions, learning algorithms are available that make a bounded number of mistakes, with the bound independent of the number of examples seen by the learner.
Introduction (cont.)
We present an algorithm that learns disjunctive Boolean functions, along with variants for learning other classes of Boolean functions.
The basic method can be expressed as a linear threshold algorithm.
A primary advantage of this algorithm is that the number of mistakes grows only logarithmically with the number of irrelevant attributes in the examples. Also it is computationally efficient in both time and space.
How does it work?
We study learning in an online setting – there’s no separate set of training examples. The learner attempts to predict the appropriate response for each example, starting with the first example received.
After making this prediction, the learner is told whether the prediction was correct, and then uses this information to improve its hypothesis.
The learner continues to learn as long as it receives examples.
The Setting
Now we’re going to describe in more detail the learning environment that we consider and the classes of functions that the algorithm can learn. We assume that learning takes place in a sequence of trials. The order of events in a trial is as follows:
The Setting (cont.)
(1)(1) The learner receives some information about the world, corresponding to a single example. This information consists of the values of n Boolean attributes, for some n that remains fixed. We think of the information received as a point in {0,1}n. We call this point an instance and we call {0,1}n the instance space.
The Setting (cont.)
(2) The learner makes a response. The learner has a choice of two responses, labeled 0 and 1. We call this response the learner’s prediction of the correct value.
(3) The learner is told whether or not the response was correct. This information is called the reinforcement.
The Setting (cont.)
Each trial begins after the previous trial has ended.We assume that for entire sequence of trials, there is a single function f :{0,1}n →{0,1} which maps each instance to the correct response to that instance. This function is called target function or target concept.The algorithm for learning in this setting is called algorithm for online learning from examples (AOLLE)
Mistake Bound (introduction)
We evaluate the algorithm’s learning behavior by counting the worstcase number of mistakes that it will make while learning a function from a specified class of functions. Computational complexity is also considered. The method is computationally time and space efficient.
General results about mistake bounds for AOOLE
At first we present upper and lower bounds on the number of mistakes in the case where one ignores issues of computational efficiency.
The instant space can be any finite space X, and the target class is assumed to be a collection of functions, each with domain X and range {0,1}.
Some definitions:
Def 1:For any learning algorithm A and any target function ƒ, let MA(ƒ) be the maximum over all possible sequences of instance of the number of mistakes that algorithm A makes when the target function is ƒ.
Some definitions:
Def 2:For any learning algorithm A and any nonempty target class C, let
MA(C) max ƒєC MA(ƒ).
Define MA(C) := 1, if C is empty. Any number greater than or equal to MA(C)
will be called a mistake bound for algorithm A applied to class C.
Some definitions:
Def 3:The optimal mistake bound for a target class C, denoted opt(C), is the
minimum over all algorithms A of MA(C) (regardless algorithm’s computational efficiency) . An algorithm A is called optimal for class C if MA(C) = opt(C). Thus
opt(C) represents the best possible worst case mistake bound for any algorithm
learning C.
2 Auxiliary algorithms
If computational resources are no issue, there’s a straightforward learning algorithm that has excellent mistake bounds for many classes of functions. We’re going briefly to observe them, because it gives an upper limit on the mistake bound and because it suggests strategies that one might explore in searching for computationally efficient algorithms.
Algorithm 1: halving algorithm (HA)
The HA can be applied to any finite class C of functions taking values in {0,1}. The HA maintains a variable CONSIST = C (initially). When it receives an instance, it determines the sets
ξ0(CONSIST,x) ={ƒєC, ƒ(x)=0}
ξ1(CONSIST,x) ={ƒєC, ƒ(x)=1}
HA: scheme of the work
ξ1(CONSIST,x) > ξ0(CONSIST,x)
true false
Predicts 1 Predicts 0
When it receives the reinforcement, it sets:
CONSIST= ξ1(CONSIST,x), if correct is 1;
CONSIST= ξ0(CONSIST,x), if correct is 0;
HA: main results
Def: Let MHALVING(C) denote the maximum number of mistakes that the algorithm will make when it is run for the class C.
Th 1:For any nonempty target class C,
MHALVING(C) log2C
Th 2:For any finite target class C,
opt (C) log2C
Algorithm 2: standard optimal algorithm (SOA)
Def 1:A mistake tree for a target class C over an instance space X is a binary tree
each of whose nodes is a nonempty subset of C, and each internal node is labeled with a point of X and satisfies:
1.The root of the tree is C;
2. For any internal node C’ labeled with x the left child of C’ is ξ0(C’,x) and
right one is ξ1(C’,x).
SOA
Def 2:A complete kmistake tree is a mistake tree that is a complete binary tree of height k.
Def 3:For any nonempty finite target class C, let K(C) equal the largest integer k s.t. there exists a complete kmistake free for C. K()= 1.
The SOA is similar to HA, but it compares
K(ξ1(CONSIST,x)) > K(ξ0(CONSIST,x))
SOA: main results
Th1:Let X be any instance space. C:X{0,1} opt(C) = MSOA(C) = K(C)
Def 4:SX is shattered by a target class C if for US ƒєC s.t ƒ(U)=1 & ƒ(SU)=0
Def 5:The VapnikChervonenkis dimension is the card of the largest set, shattered by C
Th2:For any target class C:
VCdim(C) opt(C)
The linearthreshold algorithm (LTA)
Def 1:ƒ:{0,1}n{0,1}is linearlyseparable if there is a hyperplane in Rn separating the points on which the function is 1 from those on which it’s 0.
Def 1:A monotone disjunction is such in which no literal appears negated: ƒ(x1,..,xn) = xi1 … xik
A hyperplane given by xi1 +…+ xik= ½ is a separating hyperplane for ƒ.
WINNOW 1
The instance space is X={0,1}n
The algorithm maintains weights {w1,..,wn}єR+ , each having 1 as its initial value.θ є R – the threshold.When the learner receives an instance (x1,..,xn), the learner responds as follows:
if wi xi θ, then it predicts 1;
if wi xi θ, then it predicts 0.
WINNOW 1
The weights are changed only if the learner makes a mistake according the table:
learner’s
prediction
correct
response
update action update
name
1 0 wi:0 if xi1
wi unchanged if xi0
elimination step
0 1 wi: wi if xi1
wi unchanged if xi0
>1 fixed parameter
promotion step
Requirements for WINNOW1
The space needed (without counting bits per weight) and the sequential time needed per trial are both linear in n.
Nonzero weights are powers of , so the weights are at most . Thus if the logarithms (base ) of the weights are stored, only O(log2log) bits per weight are needed
Mistake bound for WINNOW1
Th: Suppose that the target function is a kliteral monotone disjunction given by
ƒ(x1,..,xn) xi1 … xik. If WINNOW1 is run with 1 and 1/, then for
any sequence of instances the total number of mistakes will be bounded by
k(log1) n/
Example:
Good bounds are obtained if 2, θ n/.
We get the bound 2k*log2n 2 , the dominating first term is minimized for e; the bound then becomes
(e/log2e)* k*log2n e 1.885k*log2n e
Lower mistake bound
Def: For 1 k n, let Ĉk denote the class of kliteral monotone disjunctions, and let Ck denote the class of all those monotone
disjunctions that have at most k literals.Th: (lower bound) For 1 k n,
opt(Ck) opt(Ĉk) k [log2(n/k)].For n1
we also have opt(Ck) k/8(1 log2(n/k))
Modified WINNOW1
For instance space X{0,1}n, and s.t. 0<1 let F(X,):X{0,1}s.t. for ƒєF(X,) µ1,..,µn 0 s.t. for all (x1,..,xn) є X
µ i xi 1 if ƒ (x1,..,xn)=1 (*)
µ i xi 1 if ƒ (x1,..,xn)=0 (**)So the inverse images of 0 and 1 are linearly separable with a minimum separation that depends on . The mistake bound that we derive will be practical only for those functions for which is sufficiently large.
Example: an rofk threshold function
Def:Let X={0,1}n,an rofk threshold function ƒ is defined by selecting a set of k significant variables. ƒ=1 whenever at least r of this k variables are 1.
ƒ=1 xi1 +…+ xik r
(1/r)xi1 +…+(1/r) xik 1 if ƒ (x1,..,xn)=1
(1/r)xi1 +…+(1/r) xik 1r if ƒ (x1,..,xn)=0
Thus the rofk threshold functions є F({0,1}n,1/r)
WINNOW2
The only change to WINNOW1 updating rule when a mistake is made.
learner’s
prediction
correct
response
update action update
name
1 0 wi: wi/ if xi1
wi unchanged if xi0
demotion step
0 1 wi: wi if xi1
wi unchanged if xi0
>1 fixed parameter
promotion step
Requirements for WINNOW2
We use = 1+ /2 for learning target function in F(X,).
Space & time requirements for WINNOW2 are similar to those for WINNOW1. However, more bits will be needed to store each weight, perhaps as many as the logarithm of the mistake bound.
Mistake bound for WINNOW2
Th: For 0<1, if the target function ƒ is in F(X,) for X{0,1}n, if µ1,..,µn have
been chosen s.t. ƒ satisfies (*), (**), and if WINNOW2 is run with =1+ /2 and θ1 and the algorithm receives instances from X, then the number of mistakes will be bounded by
(8/2)(n/θ) + {5/ +14lnθ/2} µi.
Example: an rofk threshold function
Now we are going to calculate mistake bound for rofk threshold functions. We have: =1/r and µi= k/r. So for =1+1/2r and θ=n mistake bound 8r2 + 5k +14krlnn. Note that 1ofk threshold functions are just kliteral monotone disjunctions. Thus if =3/2, WINNOW2 will learn monotone disjunctions. The mistake bound is similar to the bound for WINNOW1, though with larger constants.
Conclusion:
The first part gives us general results about how many mistakes an effective learner might make if computational complexity were not an issue.The second part describes an efficient algorithm for learning specific target class.A key advantage of WINNOW1 and WINNOW2 is their performance when few attributes are relevant.
Conclusion:
If we define the number of relevant variables needed to express a function in the class F({0,1}n, ) to be the least number of strictly positive weights needed to describe a separating hyperplane, then this target class for n > 1 can be learned with a number of mistakes bounded by C*klogn/2
when the target function can be expressed with k relevant variables.