Download - A New Linear-threshold Algorithm Anna Rapoport Lev Faivishevsky.

A New Linear-threshold Algorithm

Anna Rapoport

Lev Faivishevsky

Introduction

Valiant (1984) and others have studied the problem of learning various classes of Boolean functions from examples. Now we’re going to discuss incremental learning of these functions.We consider a setting in which the learner responds to each example according to a current hypothesis. Then the learner updates it, if necessary, based on the correct classification of the example.

Introduction (cont.)

One natural measure of the quality of learning in this setting is the number of mistakes the learner makes.

For suitable classes of functions, learning algorithms are available that make a bounded number of mistakes, with the bound independent of the number of examples seen by the learner.

Introduction (cont.)

We present an algorithm that learns disjunctive Boolean functions, along with variants for learning other classes of Boolean functions.

The basic method can be expressed as a linear- threshold algorithm.

A primary advantage of this algorithm is that the number of mistakes grows only logarithmically with the number of irrelevant attributes in the examples. Also it is computationally efficient in both time and space.

How does it work?

We study learning in an on-line setting – there’s no separate set of training examples. The learner attempts to predict the appropriate response for each example, starting with the first example received.

After making this prediction, the learner is told whether the prediction was correct, and then uses this information to improve its hypothesis.

The learner continues to learn as long as it receives examples.

The Setting

Now we’re going to describe in more detail the learning environment that we consider and the classes of functions that the algorithm can learn. We assume that learning takes place in a sequence of trials. The order of events in a trial is as follows:

The Setting (cont.)

(1)(1) The learner receives some information about the world, corresponding to a single example. This information consists of the values of n Boolean attributes, for some n that remains fixed. We think of the information received as a point in {0,1}n. We call this point an instance and we call {0,1}n the instance space.

The Setting (cont.)

(2) The learner makes a response. The learner has a choice of two responses, labeled 0 and 1. We call this response the learner’s prediction of the correct value.

(3) The learner is told whether or not the response was correct. This information is called the reinforcement.

The Setting (cont.)

Each trial begins after the previous trial has ended.We assume that for entire sequence of trials, there is a single function f :{0,1}n →{0,1} which maps each instance to the correct response to that instance. This function is called target function or target concept.The algorithm for learning in this setting is called algorithm for on-line learning from examples (AOLLE)

Mistake Bound (introduction)

We evaluate the algorithm’s learning behavior by counting the worst-case number of mistakes that it will make while learning a function from a specified class of functions. Computational complexity is also considered. The method is computationally time and space efficient.

General results about mistake bounds for AOOLE

At first we present upper and lower bounds on the number of mistakes in the case where one ignores issues of computational efficiency.

The instant space can be any finite space X, and the target class is assumed to be a collection of functions, each with domain X and range {0,1}.

Some definitions:

Def 1:For any learning algorithm A and any target function ƒ, let MA(ƒ) be the maximum over all possible sequences of instance of the number of mistakes that algorithm A makes when the target function is ƒ.

Some definitions:

Def 2:For any learning algorithm A and any non-empty target class C, let

MA(C) max ƒєC MA(ƒ).

Define MA(C) := -1, if C is empty. Any number greater than or equal to MA(C)

will be called a mistake bound for algorithm A applied to class C.

Some definitions:

Def 3:The optimal mistake bound for a target class C, denoted opt(C), is the

minimum over all algorithms A of MA(C) (regardless algorithm’s computational efficiency) . An algorithm A is called optimal for class C if MA(C) = opt(C). Thus

opt(C) represents the best possible worst case mistake bound for any algorithm

learning C.

2 Auxiliary algorithms

If computational resources are no issue, there’s a straightforward learning algorithm that has excellent mistake bounds for many classes of functions. We’re going briefly to observe them, because it gives an upper limit on the mistake bound and because it suggests strategies that one might explore in searching for computationally efficient algorithms.

Algorithm 1: halving algorithm (HA)

The HA can be applied to any finite class C of functions taking values in {0,1}. The HA maintains a variable CONSIST = C (initially). When it receives an instance, it determines the sets

ξ0(CONSIST,x) ={ƒєC, ƒ(x)=0}

ξ1(CONSIST,x) ={ƒєC, ƒ(x)=1}

HA: scheme of the work

|ξ1(CONSIST,x)| > |ξ0(CONSIST,x)|

true false

Predicts 1 Predicts 0

When it receives the reinforcement, it sets:

CONSIST= ξ1(CONSIST,x), if correct is 1;

CONSIST= ξ0(CONSIST,x), if correct is 0;

HA: main results

Def: Let MHALVING(C) denote the maximum number of mistakes that the algorithm will make when it is run for the class C.

Th 1:For any non-empty target class C,

MHALVING(C) log2|C|

Th 2:For any finite target class C,

opt (C) log2|C|

Algorithm 2: standard optimal algorithm (SOA)

Def 1:A mistake tree for a target class C over an instance space X is a binary tree

each of whose nodes is a non-empty subset of C, and each internal node is labeled with a point of X and satisfies:

1.The root of the tree is C;

2. For any internal node C’ labeled with x the left child of C’ is ξ0(C’,x) and

right one is ξ1(C’,x).

SOA

Def 2:A complete k-mistake tree is a mistake tree that is a complete binary tree of height k.

Def 3:For any non-empty finite target class C, let K(C) equal the largest integer k s.t. there exists a complete k-mistake free for C. K()= -1.

The SOA is similar to HA, but it compares

K(ξ1(CONSIST,x)) > K(ξ0(CONSIST,x))

SOA: main results

Th1:Let X be any instance space. C:X{0,1} opt(C) = MSOA(C) = K(C)

Def 4:SX is shattered by a target class C if for US ƒєC s.t ƒ(U)=1 & ƒ(S-U)=0

Def 5:The Vapnik-Chervonenkis dimension is the card of the largest set, shattered by C

Th2:For any target class C:

VCdim(C) opt(C)

The linear-threshold algorithm (LTA)

Def 1:ƒ:{0,1}n{0,1}is linearly-separable if there is a hyperplane in Rn separating the points on which the function is 1 from those on which it’s 0.

Def 1:A monotone disjunction is such in which no literal appears negated: ƒ(x1,..,xn) = xi1 … xik

A hyperplane given by xi1 +…+ xik= ½ is a separating hyperplane for ƒ.

WINNOW 1

The instance space is X={0,1}n

The algorithm maintains weights {w1,..,wn}єR+ , each having 1 as its initial value.θ є R – the threshold.When the learner receives an instance (x1,..,xn), the learner responds as follows:

if wi xi θ, then it predicts 1;

if wi xi θ, then it predicts 0.

WINNOW 1

The weights are changed only if the learner makes a mistake according the table:

learner’s

prediction

correct

response

update action update

name

1 0 wi:0 if xi1

wi unchanged if xi0

elimination step

0 1 wi: wi if xi1

wi unchanged if xi0

>1 fixed parameter

promotion step

Requirements for WINNOW1

The space needed (without counting bits per weight) and the sequential time needed per trial are both linear in n.

Non-zero weights are powers of , so the weights are at most . Thus if the logarithms (base ) of the weights are stored, only O(log2log) bits per weight are needed

Mistake bound for WINNOW1

Th: Suppose that the target function is a k-literal monotone disjunction given by

ƒ(x1,..,xn) xi1 … xik. If WINNOW1 is run with 1 and 1/, then for

any sequence of instances the total number of mistakes will be bounded by

k(log1) n/

Example:

Good bounds are obtained if 2, θ n/.

We get the bound 2k*log2n 2 , the dominating first term is minimized for e; the bound then becomes

(e/log2e)* k*log2n e 1.885k*log2n e

Lower mistake bound

Def: For 1 k n, let Ĉk denote the class of k-literal monotone disjunctions, and let Ck denote the class of all those monotone

disjunctions that have at most k literals.Th: (lower bound) For 1 k n,

opt(Ck) opt(Ĉk) k [log2(n/k)].For n1

we also have opt(Ck) k/8(1 log2(n/k))

Modified WINNOW1

For instance space X{0,1}n, and s.t. 0<1 let F(X,):X{0,1}s.t. for ƒєF(X,) µ1,..,µn 0 s.t. for all (x1,..,xn) є X

µ i xi 1 if ƒ (x1,..,xn)=1 (*)

µ i xi 1- if ƒ (x1,..,xn)=0 (**)So the inverse images of 0 and 1 are linearly separable with a minimum separation that depends on . The mistake bound that we derive will be practical only for those functions for which is sufficiently large.

Example: an r-of-k threshold function

Def:Let X={0,1}n,an r-of-k threshold function ƒ is defined by selecting a set of k significant variables. ƒ=1 whenever at least r of this k variables are 1.

ƒ=1 xi1 +…+ xik r

(1/r)xi1 +…+(1/r) xik 1 if ƒ (x1,..,xn)=1

(1/r)xi1 +…+(1/r) xik 1-r if ƒ (x1,..,xn)=0

Thus the r-of-k threshold functions є F({0,1}n,1/r)

WINNOW2

The only change to WINNOW1 updating rule when a mistake is made.

learner’s

prediction

correct

response

update action update

name

1 0 wi: wi/ if xi1

wi unchanged if xi0

demotion step

0 1 wi: wi if xi1

wi unchanged if xi0

>1 fixed parameter

promotion step

Requirements for WINNOW2

We use = 1+ /2 for learning target function in F(X,).

Space & time requirements for WINNOW2 are similar to those for WINNOW1. However, more bits will be needed to store each weight, perhaps as many as the logarithm of the mistake bound.

Mistake bound for WINNOW2

Th: For 0<1, if the target function ƒ is in F(X,) for X{0,1}n, if µ1,..,µn have

been chosen s.t. ƒ satisfies (*), (**), and if WINNOW2 is run with =1+ /2 and θ1 and the algorithm receives instances from X, then the number of mistakes will be bounded by

(8/2)(n/θ) + {5/ +14lnθ/2} µi.

Example: an r-of-k threshold function

Now we are going to calculate mistake bound for r-of-k threshold functions. We have: =1/r and µi= k/r. So for =1+1/2r and θ=n mistake bound 8r2 + 5k +14krlnn. Note that 1-of-k threshold functions are just k-literal monotone disjunctions. Thus if =3/2, WINNOW2 will learn monotone disjunctions. The mistake bound is similar to the bound for WINNOW1, though with larger constants.

Conclusion:

The first part gives us general results about how many mistakes an effective learner might make if computational complexity were not an issue.The second part describes an efficient algorithm for learning specific target class.A key advantage of WINNOW1 and WINNOW2 is their performance when few attributes are relevant.

Conclusion:

If we define the number of relevant variables needed to express a function in the class F({0,1}n, ) to be the least number of strictly positive weights needed to describe a separating hyperplane, then this target class for n > 1 can be learned with a number of mistakes bounded by C*klogn/2

when the target function can be expressed with k relevant variables.