On approximating weighted sums with exponentially many terms

http://www.elsevier.com/locate/jcss

Journal of Computer and System Sciences 69 (2004) 196–234

On approximating weighted sums with exponentiallymany terms$

Deepak Chawla,1 Lin Li, and Stephen Scott�

Department of Computer Science, University of Nebraska, Lincoln, NE 68588-0115, USA

Received 28 March 2003; revised 8 January 2004

Abstract

Multiplicative weight-update algorithms such as Winnow and Weighted Majority have been studiedextensively due to their on-line mistake bounds’ logarithmic dependence on N; the total number of inputs,which allows them to be applied to problems where N is exponential. However, a large N requirestechniques to efficiently compute the weighted sums of inputs to these algorithms. In special cases, theweighted sum can be exactly computed efficiently, but for numerous problems such an approach seemsinfeasible. Thus we explore applications of Markov chain Monte Carlo (MCMC) methods to estimate thetotal weight. Our methods are very general and applicable to any representation of a learning problem forwhich the inputs to a linear learning algorithm can be represented as states in a completely connected,untruncated Markov chain. We give theoretical worst-case guarantees on our technique and then apply it totwo problems: learning DNF formulas using Winnow, and pruning classifier ensembles using WeightedMajority. We then present empirical results on simulated data indicating that in practice, the timecomplexity is much better than what is implied by our worst-case theoretical analysis.r 2003 Elsevier Inc. All rights reserved.

Keywords: Markov chain Monte Carlo approximation; Winnow; Weighted Majority; Multiplicative weight updates;

Perceptron; DNF learning; Boosting

1. Introduction

Multiplicative weight-update algorithms (e.g. [6,21,24]) have been studied extensively due totheir on-line mistake bounds’ logarithmic dependence on N; the total number of inputs. (These

ARTICLE IN PRESS

$A preliminary version [7] of this paper appeared in COLT 2001.�Corresponding author.

E-mail address: [email protected] (S. Scott).

URL: http://www.cse.unl.edu/~sscott.1Now at EMC Corporation, Raleigh-Durham, NC.

0022-0000/$ - see front matter r 2003 Elsevier Inc. All rights reserved.

doi:10.1016/j.jcss.2004.01.006

bounds can be translated into PAC sample complexity bounds via a simple procedure [22].) Thisattribute efficiency allows them to be applied to problems where N is exponential in the input size,which is the case in many applications, including using Winnow [21] to learn DNF formulas inunrestricted domains and using the Weighted Majority algorithm (WM [24]) to predict nearly aswell as the best pruning of a classifier ensemble (from e.g. boosting). However, a large N requirestechniques to efficiently compute the weighted sums of inputs to WM and Winnow. One methodof doing this is to exploit commonalities among the inputs, partitioning them into a polynomialnumber of groups such that given a single member of each group, the total weight contribution ofthat group can be efficiently computed [13–15,25,30,39]. But many WM and Winnow applicationsdo not appear to exhibit such structure, so it seems that a brute-force implementation is the onlyoption to guarantee complete correctness.2 Thus we explore applications of Markov chain MonteCarlo (MCMC) methods to estimate the total weight without the need for special structure in theproblem. Our methods are very general and applicable to any representation of a learningproblem for which the inputs to the linear learning algorithm can be represented as states in acompletely connected, untruncated3 Markov chain. In this paper we apply our results to two suchproblems, described below.First we study learning DNF formulas (e.g. [5]) using Winnow4 [21] and not using membership

queries. We enumerate all possible DNF terms and use Winnow to learn a monotone disjunctionover these terms, which it can do while making OðK log NÞ prediction mistakes, where K is thenumber of relevant terms and N is the total number of terms. So a brute-force implementation ofWinnow makes a polynomial number of errors on arbitrary examples (i.e. with no distributionalassumptions) and does not require membership queries. However, a brute-force implementationrequires exponential time to compute the weighted sum of the inputs. So we apply our MCMC-based results to estimate this sum.Next we investigate pruning a classifier ensemble (from e.g. boosting), which can reduce

overfitting and time for evaluation [26,40]. We use the Weighted Majority algorithm (WM) [24],using all possible prunings as experts. WM is guaranteed to not make many more predictionmistakes than the best expert, so we know that a brute-force WM will perform nearly as well asthe best pruning. However, the exponential number of prunings motivates us to use an MCMCapproach to approximate the weighted sum of the experts’ predictions.MCMC methods [16] have been applied to problems in approximate summation, where the

goal is to approximate W ¼P

xAOsðxÞ; where s is a positive function and O is a finite set of

combinatorial structures. It involves defining an ergodic Markov chain M with state space O andstationary distribution p: Then one repeatedly simulates M to draw samples almost according top: Under appropriate conditions, this technique yields accuracy guarantees. E.g. sometimes onecan guarantee that the estimate of the sum is within a factor e of the true value (with high

ARTICLE IN PRESS

2For additive weight-update algorithms such as the Perceptron algorithm, kernels can often be used to exactly

compute the weighted sums (e.g. [8,29,36]), though a kernel function might not exist for the desired mapping of features.3We believe that the untruncated requirement can be removed by generalizing results of Morris and Sinclair [28] (see

Section 4.2).4We also study the application of our methods to learning DNF via Rosenblatt’s Perceptron [33] algorithm, though

this is done only for contrast with Winnow since exact sums for DNF learning via Perceptron can be computed with

kernels [18].

D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234 197

probability). When this is true and the estimation algorithm requires only polynomial time, thealgorithm is called a fully polynomial randomized approximation scheme (FPRAS).We combine two FPRASs for application to estimating weighted sums. The first approximator

is for the approximate knapsack problem [10,28], where given a positive real vector ~ww and real

number b; the goal is to estimate jf~ppAf0; 1gn: ~ww � pð~xxÞpbgj within a multiplicative factor e: Theother FPRAS is for estimating the sum of the weights of a weighted matching of a graph: for a

graph G and lX0; approximate ZGðlÞ ¼Pn

k¼0 mklk; where mk is the number of matchings in G

of size k and n is the number of nodes. This problem has applications to the monomer-dimer

problem of statistical physics [16].While we have thoroughly analyzed our approach in the context of these two problems, our

results do not guarantee efficient algorithms for learning DNF and for finding the best pruning.5

But we do provide theoretical machinery that could potentially be applied to analyze algorithmsthat learn e.g. restricted cases of DNF, including subclasses of DNF formulas and/or specificdistributions over examples. Further, our experimental results provide interesting insights intothe algorithms’ behaviors and show that the weighted sums can be approximated well despitethe pessimistic worst-case bounds. Couple this with the fact that good approximations of theweighted sums are not always necessary to accurately simulate Winnow and WM (since we areonly interested in the predictions made based on these weighted sums, not the sums themselves),and our results have potential to be effective tools in theory and in practice.The rest of this paper is organized as follows. In Section 2 we give background on the on-line

learning model and summarize related work in learning DNF formulas, pruning ensembles ofclassifiers, and MCMC methods. Section 3 presents our algorithm and Markov chain, and provesgeneral bounds on the accuracy and time complexity of our estimation procedure. In Section 4 weapply these results to the problems of learning DNF formulas with Winnow and Perceptron andpruning ensembles with Weighted Majority. Then some empirical results appear in Section 5.Finally, we conclude in Section 6 with a description of future and ongoing work.

2. Related work

2.1. The on-line learning model

We focus on on-line learning algorithms, where learning proceeds in a series of trials.6 In trial t;

an example ~XX t is presented to the learning algorithm A; which makes a prediction7 #ct of ~XX t’slabel. After this prediction is made, A is told the true label ct; which A uses to update its

hypothesis before making future predictions. If cta#ct; we say that A made a prediction mistake. IfM is the set of examples for which a mistake is made, the goal is to minimize jMj on any sequence

of adversarily-generated examples X ¼ ð~XX 1;y; ~XX tÞ: Below we overview the on-line learningalgorithms Winnow [21], Perceptron [33], and Weighted Majority (WM) [24].

ARTICLE IN PRESS

5This is not surprising since is unlikely that an efficient distribution-free DNF-learning algorithm exists [2,3].6 In Sections 3 and 4, our results focus on only the current trial, so we omit the subscript t unless it is not clear from

context.7We assume ct; #ctAf1;þ1g:

D. Chawla et al. / Journal of Computer and System Sciences 69 (2004) 196–234198

Winnow maintains a weight vector ~wwARþN (N-dimensional positive real space). Upon

receiving an instance ~XX tA½0; 1�N ;Winnow makes its prediction #ct ¼ þ1 if Wt ¼ ~wwt � ~XX tXy and 1otherwise (y40 is a threshold). Given the true label ct; the weights are updated as follows:

wtþ1;i ¼ wt;i aXt;iðct#ctÞ=2 for some a41: If wtþ1;i4wt;i we call the update a promotion and if

wtþ1;iowt;i we call it a demotion. Littlestone [21] showed that if each example is labeled by some

monotone disjunction of K of its N inputs, then Winnow will never make more than OðK log NÞprediction mistakes on any sequence of examples. This makes Winnow a natural tool to apply tolearning DNF since by enumerating all 3n possible terms as inputs to Winnow, K-term DNF canbe learned while making only OðKnÞ prediction mistakes. However, the time complexity ofrunning Winnow this way is exponential in n:

Similar to Winnow, the Perceptron algorithm maintains a weight vector ~wwARN : Upon

receiving an instance ~XX tA½0; 1�N ; it makes its prediction #ct ¼ þ1 if Wt ¼ ~wwt � ~XX tXy and 1

otherwise.8 Given the true label ct; the weights are updated as follows: wtþ1;i ¼ wt;i þ aXt;iðct #ctÞ=2 for some a40: In contrast to Winnow, the Perceptron algorithm can be forced to makeOðKNÞ mistakes on monotone K-disjunctions over N inputs [20], making it inappropriate forlearning DNF (see also Khardon et al. [18]). However, the additive nature of the weight updatesyields much better time complexity bounds for MCMC in contrast to those for multiplicativeweight-update schemes (Section 4.1.2).Inputs to the Weighted Majority algorithm [24] are themselves predictions of ‘‘experts’’ on the

current example9 ~xxt: Each such expert ei in the pool has its own weight wi (initialized to 1), andwhen a new example ~xxt is given to each expert in the pool, expert ei sends to WM its predictionXt;i ¼ eið~xxtÞAR; where the sign of Xt;i indicates ei’s predicted label and jXt;ij can be thought of as

ei’s confidence (though some experts may restrict themselves to predictions from f1;þ1g). WM

then takes a weighted combination of the predictions and predicts #ct ¼ þ1 if Wt ¼ ~wwt � ~XX tX0 and1 otherwise. Upon receiving the correct label ct; if WM makes a prediction mistake, it reducesthe weights of all experts that predicted incorrectly by dividing them by some constant a41: It hasbeen shown that if the best expert in the pool makes at most n mistakes, then WM has a mistakebound10 of Oðnþ log NÞ: Applying this to predicting nearly as well as the best pruning of anensemble is straightforward. By placing each possible pruning into the pool, we get a pool size ofN ¼ 2n and thus a mistake bound of Oðnþ nÞ: However, the time complexity of a straightforwardimplementation of this algorithm is exponential in n:

2.2. Learning DNF formulas

Learning DNF formulas has been heavily studied, but positive learning-theoretic results existonly in restricted cases, including assuming a uniform distribution over examples (e.g. [5]) or

ARTICLE IN PRESS

8For additive weight update algorithms like Perceptron, often the threshold is included in the weight vector as wt;0;

corresponding to an extra attribute Xt;0 ¼ 1: The dot product is then compared to 0 rather than y:9Throughout this paper, lower case ~xx and ~yy will represent examples in the original space, while capital ~XX and ~YY

represent the examples mapped to a new space, which is the input space of Winnow, Perceptron, and WM.10Stronger results on predicting with expert advice were given by Cesa-Bianchi et al. [6] using a more complex

algorithm, but these are only better than WM’s by a constant factor. Thus for simplicity, we use WM.


assuming that the number of terms is bounded by Oðlog nÞ; where n is the number of variables [4].In both of these cases, the algorithms require, in addition to labeled examples, membership queries,i.e. they need to be able to present arbitrary examples to an oracle and be told their labels.In contrast, directly applying Winnow to this problem by enumerating all possible terms and

learning a monotone disjunction over them does not require any restrictions or the use ofmembership queries. Since there are only 3n possible DNF terms over n variables, Winnow’smistake bound on this problem is OðKnÞ; where K is the number of relevant terms in the targetfunction. However, the time complexity to make a prediction on each example is exponential inthis case if a brute-force approach is taken. Indeed, Khardon et al. [18] showed that if P a #P,then there is no polynomial time algorithm to exactly simulate Winnow over exponentially manyconjunctive features for learning even monotone DNF. Further, while they did provide a kernelallowing them to exactly compute Perceptron’s weighted sums when learning DNF, they also gavean exponential lower bound on the number of mistakes that kernel perceptron makes in learning

DNF: 2OðnÞ:

2.3. Pruning ensembles of classifiers

Prior work in pruning ensembles of classifiers [26,40] (produced by boosting) has beenconducted for two reasons. First, the time required to evaluate a complete ensemble is prohibitivein some applications. Second, despite some evidence to the contrary [32,34], boosting can be proneto overfitting. The methods of Margineantu and Dietterich [26] and Tamon and Xiang [40] notonly sought subsets of the ensemble with high prediction accuracy, but also with high diversity, i.e.hypotheses with high accuracy on different portions of the instance space. The approaches theyused included simple ones like early stopping, ones that utilized divergence measures such asKullback–Leibler divergence or the k statistic, and methods that used prediction error, sometimescombined with a divergence measure.To address the concern of overfitting, one can use the WM algorithm, using all possible

prunings as ‘‘experts’’ in a pool. Since WM is guaranteed to not perform much worse (in terms ofnumber of on-line prediction mistakes) than the best expert in the pool, we know that a brute-force implementation of this algorithm is guaranteed to not perform much worse than the bestpruning. However, a brute-force implementation of WM would take time exponential in thenumber of hypotheses in the ensemble. So we use an MCMC approach to approximate theweighted sum of the experts’ predictions.

2.4. Markov chain Monte Carlo methods

MCMC methods [16] have been applied to problems in combinatorial optimization andapproximate summation, where the goal is to approximate weighted sum W ¼

PxAOsðxÞ; where s

is a positive function defined on O and O is a very large, finite set of combinatorial structures. Theprocess involves defining an ergodic Markov chain M with state space O and stationarydistribution p: Then one repeatedly simulates M some number of steps to draw several samplesalmost according to p: Under appropriate conditions, this technique yields accuracy guarantees.E.g. in approximate summation, sometimes one can guarantee that the estimate of the sum iswithin a factor e of the true value with high probability. When this is true and the estimation

ARTICLE IN PRESS


algorithm requires only polynomial time, the algorithm is called a fully polynomial randomized

approximation scheme (FPRAS). In certain cases a similar argument can be made aboutcombinatorial optimization problems, i.e. that the algorithm’s solution is within a factor of e ofthe true maximum or minimum.A well-studied problem with an MCMC solution is the approximate knapsack problem, where

one is given a positive real vector ~ww and real number b: The goal is to estimate jOj within a

multiplicative factor of e; where O ¼ f~ppAf0; 1gn: ~ww �~pppbg (i.e. as an approximate summationproblem, sð~ppÞ ¼ 1 for all~ppAO). Dyer et al. [10] gave a Markov chain for this problem and arguedthat a polynomial (in n and 1=e) number of samples from it were sufficient to accurately estimatejOj: Later, Morris and Sinclair [28] showed that it is sufficient to simulate the chain for apolynomial number of steps to obtain each sample (i.e. that the chain is rapidly mixing), thusgiving a FPRAS for the knapsack problem.Another problem with a FPRAS [16] is computing the sum of the weights of a weighted

matching with parameter l: For a graph G and lX0; approximate ZGðlÞ ¼Pn

k¼0 mklk; where mk

is the number of matchings in G of size k and n is the number of nodes. This problem hasapplications to the monomer-dimer problem of statistical physics. In the next section, we combinethe knapsack solution with the matching solution to approximate the weighted sums of inputs oflinear learning algorithms.

3. Our general algorithm and Markov chain

In general, our state space O will consist of the set of inputs to the learning algorithmunder consideration (Perceptron, Winnow, or WM). As such, we can think of the statesof O as functions that map from examples in the original space (the ~xx variables) to the

input space of the learning algorithm (the ~XX variables). So for a state ~ppAO and an inputexample ~xx; we let pð~xxÞ denote ~pp evaluated at ~xx: E.g. when learning DNF with binary pð~xxÞ(Section 4.1.1), ~pp is a term, ~xx is an assignment to the variables, and pð~xxÞ ¼ 1 if ~xx satisfies ~pp and 0otherwise.Depending on the application and choice of pð~xxÞ in Section 4, O will take on different forms.

When learning DNF with Perceptron or Winnow and binary pð~xxÞ; we use O ¼ f0; 1gn: Whenlearning DNF with Perceptron or Winnow and pð~xxÞ a linear or logistic function (see Section 4.1.1),

we use O ¼Qn

i¼1f0;y; kig for some integers k1;y; kn40 (as described in Section 4.1, ki is the

number of values for feature i in a general DNF representation). When we use WM to prune

ensembles, we define Oþ (similarly, O) as the set of prunings that predict þ1 (similarly, 1) on

the current example. In Section 4.2 we show that Oþ and O are each simply f0; 1gn truncated bya single hyperplane. Since in this case the state space of our Markov chain is truncated, we musttake care to not exit the state space during a transition. Hence the need for Step 3 in our algorithmbelow.Consider a vector~pp ¼ ðp1;y; pi;y; pnÞ: We say that vector~pp0 is a neighbor of~pp if and only if~pp

and ~pp0 differ in at most one position. I.e. if and only if ~pp0 ¼ ðp1;y; p0i;y; pnÞ; where p0i may or

may not equal pi (if p0i ¼ pi then the edge from~pp to~pp0 is a self-loop). (Note that if O is a truncated

hypercube, then ~pp0 might not be in O; even if ~ppAO: This is why we test for membership in Step 3

ARTICLE IN PRESS


below.) We now define M as a Markov chain with state space O that makes transitions from state~ppAO to state ~qqAO by the following rules.

(1) With probability 1=2 let ~qq ¼ ~pp: Otherwise:(2) Let ~pp0 be a neighbor of ~pp selected uniformly at random(3) If ~pp0AO; then let ~pp00 ¼ ~pp0; else let ~pp00 ¼ ~pp:(4) With probability minf1; p00ð~xxÞ w~pp 00=ðpð~xxÞ w~ppÞg; let~qq ¼ ~pp00; else let~qq ¼ ~pp: Here w~pp is the weight

of node ~pp in the learning algorithm.

Thus M is a random walk where the transition probabilities favor nodes with higherweights.

Lemma 1. If every state in O can be reached from every other state, then M is ergodic with

stationary distribution

ptð~ppÞ ¼pð~xxtÞ w~pp

Wt

;

where Wt ¼P

~ppAO pð~xxtÞ w~pp ; i.e. the weighted sum of inputs over all states (inputs) in O when

example ~xxt is the current example.

Proof. Since all states in O can communicate, M is irreducible. Also, the self-loop of step 1ensures aperiodicity. Finally, M is reversible since the transition probabilities

Pð~pp;~qqÞ ¼ minf1; qð~xxtÞ w~qq =ðpð~xxtÞ w~ppÞg2n

¼ minf1; ptð~qqÞ=ptð~ppÞg2n

(here n is the number of neighbors) satisfy the detailed balance condition ptð~ppÞPð~pp;~qqÞ ¼ptð~qqÞPð~qq;~ppÞ: So M is ergodic with the stated stationary distribution. &

For each new trial in an on-line algorithm, the weighted sums we estimate are potentiallydifferent. Thus we must conduct a new estimation procedure (with a new Markov chain) for eachtrial. To simplify notation, for the rest of this paper we will let the index t of each trial be implicit,omitting any subscripts unless necessary. Further, in each trial our algorithm defines multipleMarkov chains, each assuming that the weight updates of previous trials were made using adifferent11 learning rate ai: Hence the weight w~pp of a node~pp (and hence the stationary distribution

p and the sum of weights W ) will be functions of both ai and t; so we use the subscript of ai todenote these differences, leaving the t implicit.Recalling the definition of Winnow in Section 2.1, if the initial weight vector is the all 1s vector,

the weight of term ~pp is w~pp ¼ az~pp ; where z~pp ¼P

~xxAM c~xx pð~xxÞ; M is the set of examples for which a

ARTICLE IN PRESS

11Note that, however, the actual sequence of updates made will be the same regardless of ai: This sequence of updatesis determined by running the learning algorithm with the original learning rate a:


prediction mistake is made and c~xxAf1;þ1g is example ~xx’s label. We will refer to z~pp as node ~pp’s

total update, since from it we can directly compute node ~pp’s weight. In fact, if the Perceptronalgorithm is started with all weights equal to 0, then node ~pp’s weight is w~pp ¼ a z~pp : Similarly,

starting WM with all weights equal to 1 implies w~pp ¼ az~pp ; just like Winnow.

Let B be a bound (over all nodes in O) on the magnitudes of the total updates up to the currenttrial, i.e. BXmax~ppAOfjz~pp jg: Since this requires taking the maximum over an exponentially large

set, we note that it suffices to instead use BXP

~xxAM max~ppAOfjc~xx pð~xxÞjg: This quantity is easy to

bound so long as bounds on the possible values of c~xx and pð~xxÞ are known for each ~xx; which is thecase for all our algorithms. (E.g. in Winnow and Perceptron, it suffices to set B equal to the sum ofall promotions and demotions made on all examples for which a prediction mistake was made up

to the current trial.) Now let r be the smallest integer such that ð1þ 1=BÞr1Xa and rX1þ log2a

(so rp2þ Blna). Also, let z ¼ 1=ða1=ðr1Þ 1ÞXB and ai ¼ ð1þ 1=zÞi1 ¼ aði1Þ=ðr1Þ for 1pipr(so ar ¼ a).Now define fið~ppÞ ¼ wai1;~pp=wai;~pp ; where ~pp is chosen according to pai

: Then

E½fi� ¼X~ppAO

wai1;~pp

wai;~pp

� �pð~xxÞ wai;~pp

WðaiÞ¼ Wðai1Þ

W ðaiÞ:

So we can estimate Wðai1Þ=WðaiÞ by sampling states~pp from M and computing the sample meanof the fið~ppÞ: Note that

WðaÞ ¼ WðarÞW ðar1Þ

� �Wðar1ÞWðar2Þ

� �?

Wða2ÞWða1Þ

� �Wða1Þ:

So for each value a2;y; ar; we run S independent simulations of M; and let Xi be the samplemean of wai1

=wai: Then our estimate12 is

WðaÞ ¼ Wða1ÞYr

i¼2

1=Xi: ð1Þ

In order to complete our computation of W; we must also compute Wða1Þ: Due to the definitionof a1; this is straightforward for learning DNF with the various definitions of pð~xxÞ (Sections 4.1.1and 4.1.2). For the ensemble pruning problem, Wþða1Þ ¼ jOþj; where Oþ is the set of pruningsthat predict þ1 on the input ~xx: This cannot be efficiently computed exactly, so we must estimate itwith the FPRAS of Morris and Sinclair [28] (Section 4.2).The following theorem bounds the error of our algorithm’s estimates of W : The theorem is

based on variation distance, which is a distance measure between a Markov chain’s simulated andstationary distributions, defined as maxUDOjPtð~pp;UÞ pðUÞj; where Ptð~pp; �Þ is the distribution ofa chain’s state at simulation step t given that the simulation started in state ~ppAO; and p is thechain’s stationary distribution.

Theorem 2. Assume apfi; fipb for all i; where fi is the same as fi but with samples drawn according

to the distribution yielded by simulating M: Let the sample size S ¼ J130rb=ðae2Þn and M be

ARTICLE IN PRESS

12When we apply our results to the Perceptron algorithm in Section 4.1.2, we will also use a0 ¼ 0 and update the

product of ratios accordingly.


simulated long enough for each sample such that the variation distance between the empirical

distribution and paiis at most ea=ð5brÞ for each i: Also, assume that Wða1Þ can be computed exactly.

Then for any d40; WðaÞ satisfies

Pr½ð1 eÞWðaÞpWðaÞpð1þ eÞWðaÞ�X1 d:

Proof. Let the distribution #paibe the one resulting from simulating M; and assume that the

variation distance jj #pai pai

jjpea=ð5brÞ: Now consider the random variable fi; which is the same

as fi except that the terms are selected according to #pai: Since fiA½a; b�; jE½fi� E½fi�jpea=ð5rÞ;

which implies E½fi� ea=ð5rÞpE½fi�pE½fi� þ ea=ð5rÞ: Factoring out E½fi� from both sides andnoting that 1=E½fi�p1=a yields

1 e5r

� �E½fi�pE½fi�p 1þ e

5r

� �E½fi�: ð2Þ

This allows us to conclude that E½fi�XE½fi�=2: Since fipb; we get Var½fi�pb E½fi�; yieldingVar½fi�ðE½fi�Þ2

pb

E ½fi�p

2b

E ½fi�p2b=a: ð3Þ

Let Xð1Þi ;y;X

ðSÞi be a sequence of S independent copies of fi; and let %Xi ¼ ð

PSj¼1 X

ðjÞi Þ=S: Then

E½ %Xi� ¼ E½fi� and Var½ %Xi� ¼ Var½fi�=S: The estimator of W ðaÞ is Wða1Þ=X ¼ Wða1Þ=Qr

i¼2%Xi:

Since the %Xi’s are independent, E½X � ¼Qr

i¼2 E½ %Xi� ¼Qr

i¼2 E½fi� and E½X 2� ¼Qr

i¼2 E½ %X2i �: Let r ¼Qr

i¼2 Wðai1Þ=WðaiÞ; (i.e. what we are estimating with X ) and #r ¼ E½X �: Then applying Eq. (2)

gives

1 e5r

� �r

rp #rpr 1þ e5r

� �r

:

Since limr-Nð1þ e=ð5rÞÞr ¼ ee=5p1þ e=4 and ð1 e=ð5rÞÞr is minimized at r ¼ 1; we get

1 e4

� �rp #rp 1þ e

4

� �r:

Since Var½X � ¼ E½X 2� ðE½X �Þ2; we have

Var½X �ðE½X �Þ2

¼Yr1

i¼1

1þ Var½ %Xi�ðE½ %Xi�Þ2

! 1

p 1þ 2b

aS

� �r1

1 ðby Eq:ð3ÞÞ

p 1þ e2

65r

� �r

1pexpðe2=65Þ 1pe2=64:

The last inequality holds since expðx=65Þp1þ 1=64 for xA½0; 1�: We now apply Chebyshev’sinequality to X with standard deviation e #r=8:

Pr½jX #rj4e #r=4�p1=4:

ARTICLE IN PRESS


So with probability at least 3=4 we get

1 e4

� �#rpXp 1þ e

4

� �#r;

which implies that with probability at least 3=4

ð1þ eÞr

X1

ð1 e=4Þ2rX1=XX

1

ð1þ e=4Þ2rXð1 eÞ

r: ð4Þ

Making the approximation with probability at least 1 d for any d40 is done by rerunningOðln 1=dÞ times the procedure for estimating X and taking the median of the results [17]. &

It is also possible to extend Theorem 2 to the case where Wða1Þ cannot be exactly computed,but can be accurately estimated.

Corollary 3. Assume apfi; fipb for all i: Let the sample size S ¼ J30 r b=ðae2Þn; Wða1Þ’s estimatebe within e=2 of its true value with probability X3=4; and M be simulated long enough for each

sample such that the variation distance between the empirical distribution and paiis at most

ea=ð10brÞ for all i: Then for any d40; WðaÞ satisfies


Proof. The analysis is the same as in the proof of Theorem 2, except we now must accommodate

another source of error. First, substitute e=2 for e in Eq. (4). Given the accuracy of Wða1Þ withprobability at least 3=4; we get

Wða1Þð1 e=2Þ2

rp

Wða1ÞX

pW ða1Þð1þ e=2Þ2

r;

with probability at least 1=2: This completes the proof (the constants in S remain unchanged).Similar to Theorem 2, both estimates can be run multiple times and the median taken in order toreduce the probability of failure. &

We now bound the mixing time of M by using the canonical paths method [38]. In this method,we treat M as a directed graph with vertices O and edges E ¼ fð~pp;~qqÞAO� O: Qð~pp;~qqÞ40g; whereQð~pp;~qqÞ ¼ pað~ppÞ Pð~pp;~qqÞ: For each ordered pair ð~pp;~qqÞAO� O; we specify a canonical path g~pp;~qqAGfrom~pp to~qq in the graph ðO;EÞ that corresponds to a set of legal transitions in M from~pp to~qq: Wemeasure how heavily any one edge in E is loaded with canonical paths by

%r ¼ %rðGÞ ¼ maxeAE

1

QðeÞXg~pp ;~qq{e

pað~ppÞpað~qqÞ jg~pp;~qq j

8<:

9=;: ð5Þ

We start with a result from Sinclair [38], restated by Jerrum and Sinclair [16].

ARTICLE IN PRESS


Theorem 4 (Jerrum and Sinclair [16], Sinclair [38]). Let M be a finite, reversible, ergodic Markov

chain with loop probabilities Pð~pp;~ppÞX1=2 for all ~pp: Let G be a set of canonical paths with maximumedge loading %r ¼ %rðGÞ: Then the mixing time of M satisfies t~ppðeÞp %rðln 1=pð~ppÞ þ ln 1=eÞ for any

choice of initial state ~pp; i.e. after simulating M for %rðln 1=pð~ppÞ þ ln 1=eÞ steps starting in ~pp; the

variation distance between #paiand pai

is at most e:

In general, O ¼Qn

i¼1 f0;y; kig for some integers k1;y; kn: Without loss of generality we let

O ¼ f0;y; kgn for some positive integer k: Then there is an edge from node~pp ¼ ðp1;y; pi;y; pnÞto ~pp0 ¼ ðp1;y; p0

i;y; pnÞ; i.e. an edge exists between each pair of nodes that differ in at most one

position (self-loops also exist). For our proof, we assume that the hypercube is untruncated, whichis necessary to ensure that no canonical paths leave the chain. However, it is likely that mixingtime bounds also exist for truncated hypercubes. Such a bound could probably be derived by therecent work of Morris and Sinclair [28] who give an FPRAS for a truncated Boolean hypercubethat has a uniform distribution.Let~pp ¼ ðp1;y; pnÞ and~qq ¼ ðq1;y; qnÞ be arbitrary states of O: The canonical path g~pp;~qq consists

of n edges, where edge i is

ððq1;y; qi1; pi; piþ1;y; pnÞ; ðq1;y; qi1; qi; piþ1;y; pnÞÞ;i.e. position i is changed from ~ppi to ~qqi: So some edges of g~pp;~qq might be loops. Now focus on a

particular oriented edge

e ¼ ð~aa;~aa0Þ ¼ ðða1;y; ai;y; anÞ; ða1;y; a0i;y; anÞÞ:We will now bound Eq. (5) for e; which yields a bound on %r and allows us to apply Theorem 4.Let cpðeÞ ¼ fð~pp;~qqÞ: g~pp;~qq{eg be the set of endpoints of canonical paths that use edge e: We use

Jerrum and Sinclair’s [16] mapping ~ZZe : cpðeÞ-O; defined13 as follows: if ð~pp;~qqÞ ¼ððp1;y; pnÞ; ðq1;y; qnÞÞAcpðeÞ; then

~ZZeð~pp;~qqÞ ¼ ðb1;y; bnÞ ¼ ðp1;y; pi1; ai; qiþ1;y; qnÞ:Note that ~pp ¼ ðb1;y; bi1; ai; aiþ1;y; anÞ and ~qq ¼ ða1;y; ai1; a0

i; biþ1;y; bnÞ: Since ~pp and ~qq can

be unambiguously recovered from ~ZZeð~pp;~qqÞ; the mapping ~ZZe is injective.We are now ready to state the mixing time bound.

Theorem 5. For all ~pp;~qqAO and for all eAO� O such that ð~pp;~qqÞAcpðeÞ; assume

pað~ppÞ pað~qqÞpg pað~aa0Þ pað~ZZeð~pp;~qqÞÞfor some function g ¼ gðn;K ; k; aÞ: Also assume that for all neighbors ~aa and ~aa0 in O;

maxfpað~aaÞ=pað~aa0Þ; pað~aa0Þ=pað~aaÞgph ¼ hðn;K ; k; aÞ:Then a simulation of M that starts at node ~pp and is of length

T ¼ 2kn2 g h lnWðaÞwa;~pp

� �þ lnð1=e0Þ

� �

will draw samples from #pa such that jj #pa pajjpe0:

ARTICLE IN PRESS

13Vector notation is used when denoting ~ZZe since ~ZZeð~pp;~qqÞAO for all ~pp;~qqAO:


Proof. Since QðeÞ ¼ minfpað~aaÞ; pað~aa0Þg=ð2knÞ; we get

pað~ppÞ pað~qqÞp2kn g QðeÞ

minfpað~aaÞ; pað~aa0Þg pað~aa0Þ pað~ZZeð~pp;~qqÞÞ

¼ 2kn g QðeÞmaxf1; pað~aa0Þ=pað~aaÞgpað~ZZeð~pp;~qqÞÞp 2kn g h QðeÞ pað~ZZeð~pp;~qqÞÞ:

Given the above inequality, we can now bound %r: Since jg~pp;~qq j ¼ n; we get

1

QðeÞXg~pp ;~qq{e

pað~ppÞ pað~qqÞ jg~pp;~qq jp2kn2ghXg~pp ;~qq{e

pað~ZZeð~pp;~qqÞÞp2kn2gh:

The last inequality holds because ~ZZe is injective and pa is a probability distribution. ApplyingTheorem 4 completes the proof. &

Corollary 6. For Markov chains for which g and h are polynomial in n and K (we assume k and a areconstants) and for approximation schemes for which b and 1=a are polynomial in n and K ; our

algorithm is an FPRAS.

4. Example applications


We consider generalized DNF representations, where the instance space isQn1

i¼0 f1;y; kig and

the set of terms isQn1

i¼0 f0;y; kig; where ki is the number of values for feature i: A term

~pp ¼ ðp0;y; pn1Þ is satisfied by example~xx ¼ ðx0;y;xn1Þ if and only if 8 pi40; pi ¼ xi: So pi ¼ 0implies that xi is irrelevant for term~pp; and pi40 implies that xi must equal pi for~pp to be satisfied.We present algorithms to learn this concept class that are based on Littlestone’s Winnow [21]

and Rosenblatt’s Perceptron [33] algorithms. The inputs to the linear threshold units learned bythese algorithms consist of the entire set of DNF terms over the original set of n inputs. Forreasons that will become clear, we look at different versions of the function pð~xxÞ; which measuresthe degree to which ~xx satisfies~pp: These versions include pð~xxÞ being a threshold function, a logisticfunction, and a linear function.None of our approaches in this section give complete, efficient solutions to the problem of

learning DNF in the on-line model, but they do give new mechanisms that could be refined forpotential application to restricted subclasses of learning DNF, e.g. restricted classes of DNF orspecific distributions over the examples.

4.1.1. WinnowRecalling the definition of Winnow in Section 2.1, if the initial weight vector is the all 1s vector,

the weight of term ~pp is w~pp ¼ az~pp ; where z~pp ¼P

~xxAM c~xx pð~xxÞ; M is the set of examples for which a

ARTICLE IN PRESS


prediction mistake is made and c~xxAf1;þ1g is example ~xx’s label. We now state one ofLittlestone’s results for Winnow [23], which we will use to bound the number of predictionmistakes it makes for the variations of our algorithm.

Theorem 7 (Littlestone [23]). Let ð~YY j; cjÞA½0; 1�N � f0; 1g for j ¼ 1;y; t (t is the index of the

current trial). Suppose that there exist ~mmX0 and 0oro1 such that whenever cj ¼ 1 we have ~mm �~YY jX1 and whenever cj ¼ 0 we have~mm � ~YY jp1 r: Now suppose Winnow sees as inputs X ¼ ð~XX j; cjÞwhere each XijA½0; 1�; and define ~EEj ¼ ðjXj1 Yj1j;y; jXjN YjN jÞ: Then the number of mistakes

made by Winnow on X with a ¼ 1þ r=2 and y ¼ N is at most

8=r2 þmax 0;14

r2XN

i¼1

milnðmiyÞ !

þ 4

r

Xt

j¼1

~mm � ~EEj:

We will examine three versions of our algorithm, differing in the values that Winnow receives asits inputs. In the following, we say that a variable xi in example ~xx matches its correspondingvariable pi in term~pp if pi ¼ 0 or pi ¼ xi: We let m~xx;~ppAf0;y; ng denote the number of variables in

~xx that match their corresponding variables in ~pp (we drop the subscript ~xx when it is clear fromcontext).

(1) Binary pð~xxÞ means that Winnow input X~pp ¼ 1 if m~pp ¼ n and 0 otherwise.

(2) Logistic pð~xxÞ means that Winnow input X~pp is

pð~xxÞ ¼ 2

1þ esðm~ppnÞ; ð6Þ

where s40 is a parameter. Thus pð~xxÞAð0; 1� and grows as ~pp becomes more satisfied by ~xx (itequals 1 if and only if it is completely satisfied).

(3) Linear pð~xxÞ means that Winnow input X~pp is

pð~xxÞ ¼ 1þ m~pp

n þ 1: ð7Þ

Thus pð~xxÞAð0; 1� and grows as ~pp becomes more satisfied by ~xx (it equals 1 if and only if it iscompletely satisfied).

We defined pð~xxÞ40 for all ~xx in order to ensure that for a given ~xx; every term in O contributessomething to the weighted sum, and the hypercube is untruncated, which allows us to applyTheorem 5. Using binary pð~xxÞ also yields an untruncated hypercube, as explained below.Before considering these three cases individually, we state some common results for them all.

First note that for logistic and linear pð~xxÞ; O consists of the entire set of possible terms, since eachterm gives to Winnow a value pð~xxÞ40: Thus in the chain defined in Section 3, every state can bereached from every other state. Further, for binary pð~xxÞ; we note that there are exactly 2n termsthat are satisfied by ~xx; i.e. ~pp is satisfied by ~xx if and only if pi ¼ 0 or pi ¼ xi for all iAf1;y; ng:Thus in this case we can construct the state space to be O ¼ f0; 1gn; which is completely connectedand untruncated. Therefore it is obvious that Lemma 1 applies to all our Markov chains. We nowdiscuss the application of Theorem 2. All that is required to apply this result is to bound the range

ARTICLE IN PRESS


of f and f: Since f and f are independent of pð~xxÞ; the same result applies to all three versions of ouralgorithm.

Lemma 8. When applying binary, logistic, or linear Winnow to learn DNF, for all i; 1=epfi; fipe:

Proof. First note that the only difference between fi and fi is the probability distribution thatgenerates the terms that define them, i.e. their ranges are the same. Thus we focus on bounding fi

only. Let z~pp be node ~pp’s total update as defined in Section 3. Then

fið~ppÞ ¼wai1;~pp

wai;~pp¼ ai1

ai

� �z~pp

¼ ð1þ 1=zÞi2

ð1þ 1=zÞi1

!z~pp

¼ ð1þ 1=zÞz~pp :

Recall that from its definition, zXBXjz~pp j for all ~pp (to avoid division by zero, we can also assume

that z40). If z~ppo0; then 1pð1þ 1=zÞz~pppe: If z~ppX1; then 1=epð1þ 1=zÞz~ppp1: &

Note that Wða1Þ ¼ Wð1Þ is simplyP

~ppAO pð~xxÞ: For the binary case, this is simply the number of

terms satisfied by ~xx; which equals 2n: For the linear and logistic cases, it can be efficientlycomputed exactly if we assume that ki ¼ k for all i: Under this assumption, the number of terms

that match exactly iAf0;y; ng variables in the example ~xx is 2iðniÞðk 1Þni; since there are ðn

iÞ

positions to place the matched variables, each matched position pj can equal 0 or xj; and each

unmatched position pj0 can take on any value from f1;y; kg\fxj0g: Thus for the logistic case, weget

Wð1Þ ¼Xn

i¼0

2in

i

� �2ðk 1Þni

1þ esðinÞ; ð8Þ

and for the linear case, we get

Wð1Þ ¼Xn

i¼0

2in

i

� �ði þ 1Þðk 1Þni

n þ 1¼ ðk þ 1Þn þ 2nðk þ 1Þn1

n þ 1: ð9Þ

Thus all three can efficiently be computed exactly. By applying this and substituting Lemma 8’sbounds into Theorem 2, we get the following.

Corollary 9. When applying Winnow to learn generalized DNF (with ki ¼ k for all i for the logistic

and linear cases), let the sample size S ¼ J130 r e2=e2n and M be simulated long enough for each

sample such that the variation distance between the empirical distribution and pai;tis at most e=ð5e2rÞ:

Then for any d40; WðaÞ satisfies


As stated in the following corollary, our algorithms’ behaviors are the same as Winnow’s for astraightforward brute-force implementation if the weighted sums are not too close to y for any

input. This is true with probability at least 1 d0t for each trial t; so setting d0t ¼ d=2t yields a total

ARTICLE IN PRESS


probability of failure of at most14P

N

t¼1 d=2t ¼ d: Finally, note that it is easy to extend the

corollary to tolerate a bounded number of trials with weighted sums that are near y by thinking ofsuch potential mispredictions as noise and applying Theorem 7.

Corollary 10. Using the assumptions of Corollary 9, if WtðaÞe½y=ð1þ eÞ; y=ð1 eÞ� for all trials t;then with probability at least 1 d; the number of mistakes made by Winnow on any sequence ofexamples is as bounded by Theorem 7 (see Sections 4.1.1.1–4.1.1.3 and Lemma 11).

A hurdle that must be overcome to get an efficient algorithm is S’s polynomial dependenceon 1=e in Corollary 9, even though Winnow might at times have W=y exponentially close15

to 1, requiring exponentially small e: It is open whether this can be addressed in anaverage-case analysis of Winnow when learning restricted concept classes under specificdistributions.We now explore bounding the mixing times of the Markov chains. Note that the bounds are

based on worst-case analyses and assume that the maximum number of weight updates (asbounded by Theorem 7’s mistake bound) have been made. Prior to making that number ofupdates (e.g. near the start of training), the mixing time bounds will be lower since the stationarydistribution p ofM will be closer to uniform (in fact, before the first update, p is uniform). We canget mixing time bounds for these earlier cases by substituting the number of prediction mistakesmade so far for the mistake bounds.

4.1.1.1. Binary pð~xxÞ. It is straightforward to apply Theorem 7 to the binary case. Let the vector~mmbe 0 for each irrelevant term and 1 for each relevant term. Then when c ¼ 1; at least one relevant

term must be satisfied, so ~mm � ~YYX1: Further, if c ¼ 0; then no relevant terms are satisfied and

~mm � ~YY ¼ 0p1 r for r ¼ 1: Assuming all examples are noise-free, applying Theorem 7 yields a

mistake bound of jMjp8þ 14K ln N: So if kXki for all i; then using the at most ðk þ 1Þn possibleterms as Winnow’s inputs, it can learn K-term generalized DNF with at most 8þ 14Kn lnðk þ 1Þprediction mistakes.Unfortunately, with the binary case it is very difficult to find non-trivial bounds on g and h from

Theorem 5, due to the discontinuity of pð~xxÞ: Bounding both g and h requires bounding the ratiosof the weights of nodes in O: For binary pð~xxÞ; these weights directly depend on how often thenodes predicted 1 when a prediction mistake was made, but it is difficult to relate how often thisoccurs for a node~pp to how often this occurs for another node~qq; even if~pp and~qq are neighbors. Onthe other hand, when we consider logistic and linear pð~xxÞ; we can relate the nodes’ weights and getnon-trivial bounds on g and h:

4.1.1.2. Logistic pð~xxÞ. The mistake bound of this application of Winnow is similar to that of thestraightforward version with binary inputs.

ARTICLE IN PRESS

14Recall from the proof of Theorem 2 that only Oðlog 1=d0Þ runs of the estimation procedure are needed to reduce the

probability of failure to d0:15The potential problem of Wt=y ¼ 1 can be avoided by using a threshold of yþ aðjMjþ1Þ; where jMj is the mistake

bound from applying Theorem 7. Obviously W can never equal this new threshold.


Lemma 11. When using Equation 6 with s ¼ lnð60Kn ln kÞ to specify the inputs, where k ¼maxi fkig; the number of prediction mistakes made by Winnow when learning DNF is at most

8:88þ 15:54Kn ln k:

Proof. We start by finding ~mm and r that satisfy the conditions of Theorem 7. For each of the Krelevant terms (Winnow inputs), set the corresponding value in ~mm to a constant m; which we willdefine later. Set all other values in ~mm to 0. In the worst case, when an example ~xx is positive, itsatisfies exactly one relevant term ~pp and does not at all satisfy any of the other relevant terms.Then pð~xxÞ ¼ 1 and qð~xxÞ ¼ 2=ð1þ esnÞ for all other relevant terms ~qq: Thus it suffices to set m suchthat

mþ 2ðK 1Þm1þ esn

X1:

After some algebra we see that it suffices to set m ¼ ð1þ esnÞ=ð2K þ esn 1Þ:Now we find r: In the worst case, for a negative example ~xx; we will have each relevant term ~pp

almost fully satisfied, i.e. m~pp ¼ n 1: Hence pð~xxÞ ¼ 2=ð1þ esÞ: So we need r such that 1rX2Km=ð1þ esÞ: Substituting m yields

rp1 2Kð1þ esnÞ2Kð1þ esÞ þ esnð1þ esÞ es 1

:

This expression decreases with increasing n; so to find an appropriate r; it suffices to take its limitas n-N: Applying l’Hopital’s rule shows that r ¼ 1 2K=ð1þ esÞ is sufficient, which is positive

so long as s4lnð2K 1Þ: We assume that all examples are noise-free, so ~EEj ¼~00 for all j: Now

applying Theorem 7 yields a mistake bound of

ð1þ esÞ2

ð1þ esÞ2 4Kðes þ 1 KÞ

!

� 8þ 14K1þ esn

2K þ esn 1

� �ln

1þ esn

2K þ esn 1

� �þ ln N

� ��

pð1þ esÞ2

ð1þ esÞ2 4Kð1þ esÞ

!ð8þ 14K lnNÞ

¼ 1þ es

1þ es 4K

� �ð8þ 14K ln NÞp1:11ð8þ 14K ln NÞ;

since K ; nX1 and kX2: Noting that Npkn completes the proof. &

We now work towards a mixing time bound for the chain.

Lemma 12. Let O ¼ f0;y; kgn: Then for all ~pp;~qqAO;

pað~ppÞpað~qqÞp4ajMjpað~aa0Þpað~ZZeð~pp;~qqÞÞ:

ARTICLE IN PRESS


Proof. Let ~qq1yi denote ðq1;y; qiÞ; and similarly for ~pp1yi: Then m~pp ¼ m~pp1yiþ m~ppiþ1yn

; m~qq ¼m~qq1yi

þ m~qqiþ1yn; m~ZZeð~pp;~qqÞ ¼ m~pp1yi

þ m~qqiþ1yn; and m~aa 0 ¼ m~qq1yi

þ m~ppiþ1yn: Further, all four of these

values are in f0;y; ng: This yields

pað~ppÞpað~qqÞpað~aa0Þ pað~ZZeð~pp;~qqÞÞ

¼ pð~xxÞqð~xxÞaz~ppþz~qq

~ZZeð~pp;~qqÞð~xxÞ~aa0ð~xxÞaz~ZZeð~pp ;~qqÞþz~aa 0

¼ az~ppþz~qq

az~ZZeð~pp ;~qqÞþz~aa 0

� �1þ Uð~pp1yi;~qqiþ1yn; nÞ þ Uð~qq1yi;~ppiþ1yn; nÞ þ Uð~pp1yn;~qq1yn; 2nÞ1þ Uð~pp1yi;~ppiþ1yn; nÞ þ Uð~qq1yi;~qqiþ1yn; nÞ þ Uð~pp1yn;~qq1yn; 2nÞ

� �o4az~ppþz~qqz~ZZeð~pp ;~qqÞz~aa 0 ;

where Uð~ppiyj;~qqi0yj0 ; nÞ ¼ expðsðm~ppiyjþ m~qqi0yj0 nÞÞ: The last inequality follows from each term

in the numerator being strictly less than the entire denominator. Now let C be the exponent of thea term. Then we have16

C ¼ z~pp þ z~qq z~ZZeð~pp;~qqÞ z~aa 0

¼X~xxAM

2c~xx1þ expðsðm~xx;~pp1yi

þ m~xx;~ppiþ1yn nÞÞ

þ 2c~xx1þ expðsðm~xx;~qq1yi

þ m~xx;~qqiþ1yn nÞÞ

2c~xx1þ expðsðm~xx;~pp1yi

þ m~xx;~qqiþ1yn nÞÞ

2c~xx1þ expðsðm~xx;~qq1yi

þ m~xx;~ppiþ1yn nÞÞ:

Each term of the above summation is between 1 and 1; so a worst-case upper bound is jMj: &

Lemma 13. For all neighbors ~pp and ~qqAO;

maxfpað~ppÞ=pað~qqÞ;pað~qqÞ=pað~ppÞgpajMj 60Kn ln k:

Proof. Since ~pp and ~qq are neighbors, they only differ in one position, so jm~pp m~qq jp1: Then

pað~ppÞpað~qqÞ

¼ pð~xxÞqð~xxÞ

� �az~ppz~qq ¼ 1þ expðs n s m~qqÞ

1þ expðsn sm~ppÞ

� �az~ppz~qq

p1þ expðs n s ðm~pp 1ÞÞ

1þ expðs n s m~ppÞ

� �az~ppz~qq

¼ 1þ es expðs n s m~ppÞ1þ expðs n s m~ppÞ

� �az~ppz~qq

ARTICLE IN PRESS

16When the subscript ~xx is omitted from m; then m counts the number of matches with the current example. In the

summations over ~xxAM; m~xx represents the number of matches with example ~xx:


p1þ expðs n s m~ppÞ1þ expðsn s m~ppÞ

� �esaz~ppz~qq ¼ esaz~ppz~qq :

We now consider z~pp z~qq ; which equals

X~xxAM

2c~xx1þ expðs n s m~xx;~ppÞ

2c~xx1þ expðs n s m~xx;~qqÞ

¼ 2X~xxAM

c~xxexpðsn sm~xx;~qqÞ expðsn sm~xx;~ppÞ

1þ expðsn sm~xx;~qqÞ þ expðsn sm~xx;~ppÞ þ expð2sn sðm~xx;~pp þ m~xx;~qqÞÞ

� �

p 2X~xxAM

c~xxexpðsn sm~xx;~pp þ sÞ expðsn sm~xx;~ppÞ

1þ expðsn sm~xx;~pp þ sÞ þ expðsn sm~xx;~ppÞ þ expð2sn sðm~xx;~pp 1þ m~xx;~ppÞÞ

� �

¼ 2X~xxAM

c~xxexpðsn sm~xx;~ppÞðes 1Þ

1þ expðsn sm~xx;~ppÞðes þ 1Þ þ es expð2sðn m~xx;~ppÞÞ

� �

p 2X~xxAM

c~xx1þ expðsðn m~xx;~ppÞÞ

pjMj;

where the third line follows from the fact that there are only three ways to relate m~xx;~pp and m~xx;~qq for

a specific ~xx: If m~xx;~pp ¼ m~xx;~qq þ 1; then that term of the summation equals the bound. If m~xx;~pp ¼m~xx;~qq 1; then that term of the summation is negative, which is less than the bound. If m~xx;~pp ¼m~xx;~qq ; then that term of the summation is 0, which is less than the bound.

Finally, we note that a symmetric argument can be made for pað~qqÞ=pað~ppÞ: &

We now apply Theorem 5 to bound the mixing time of this Markov chain.

Corollary 14. When learning generalized DNF using Winnow and a logistic pð~xxÞ; a simulation of Mthat starts at any node and is of length

Ti ¼ 480kn3a1þ2jMji K ln kðn ln k þ 2jMj ln ai þ lnð1=e0ÞÞ

(where jMj is the number of prediction mistakes made so far) will draw samples from #paisuch that

jj #pai pai

jjpe0:

Proof. Lemmas 12 and 13 bound g and h; which we substitute directly into Theorem 5. Also, note

that WðaiÞpknajMji and wai;~ppXajMj

i ; completing the proof. &

Before many prediction mistakes have been made, our algorithm can quickly generate randomsamples from M almost according to its stationary distribution p: Unfortunately, our worst-casemixing time bound of this chain (once jMj approaches Winnow’s mistake bound) is exponential inn and K : Indeed, a straightforward brute-force means to compute the sum of the weights can be

done in YðknÞ time, whereas our bound of Ti grows with a1þ2ð8:88þ15:54Kn ln kÞi Xk31Kn ln aiXk12:5n

ARTICLE IN PRESS


when ai ¼ a ¼ 3=2 (a popular value for a). However, Corollary 14 is based on worst-case,adversary-based analyses. In particular, the proofs of Lemmas 12 and 13 both bound theexponents of the a terms with jMj even though it is possible that they are much smaller. (Forexample, in Lemma 13’s proof, the terms in the summation of z~pp z~qq are exponentially small

when m~xx;~pp is small, which can occur frequently for terms~pp with few zeroes. Also, we assumed that

each term of the summation was positive, even though several could be negative.) It is openwhether sub-exponential bounds can be achieved by applying a different analysis to some specialcases of restricted concept classes and distributional assumptions. Further, in Section 5 we showthat in practice, our algorithm performs much better than the worst-case theoretical results imply,especially considering that highly accurate estimates of the weighted sums are not needed so longas we know which side of the threshold the sum lies on.

4.1.1.3. Linear pð~xxÞ. Applying Theorem 7 to the linear case is not as straightforward as it was forthe logistic case. As in the proof of Lemma 11, we set the entries in ~mm corresponding to relevantterms to m and the remaining entries to 0. When c ¼ 1; in the worst case exactly one relevant termmatches all n variables in ~xx and the remaining K 1 relevant terms match 0. Then we get

~mm � ~YY ¼ m 1þ K 1

n þ 1

� �¼ m

n þ K

n þ 1

� �;

which has to beX1: When c ¼ 0; the worst case has all K relevant terms matching n 1 variablesof ~xx; yielding

~mm � ~YY ¼ mnK

n þ 1

� �Xm

n þ K

n þ 1

� �for n;KX2: Thus it is impossible to get a r40 for the linear case unless we treat such worst-case

examples as noise and use a non-zero ~EE : Following this idea, we assume that when c ¼ 1; allrelevant inputs to Winnow are at least g1=ðn þ 1Þ (i.e. all relevant inputs match at least g1 1variables in ~xx), with of course at least one such input ¼ 1: Further, we assume that when c ¼ 0; allrelevant inputs are at most g0=ðn þ 1Þ: Then it is easy to show that setting

m ¼ n þ 1

n þ 1þ g1ðK 1Þand

r ¼ n þ 1þ g1ðK 1Þ g0Kn þ 1þ g1ðK 1Þ

satisfies the conditions of Theorem 7. For example, if g1 ¼ 3n=4 and g0 ¼ n=2; then m ¼ð4n þ 4Þ=ðn þ 3nK þ 4Þ; r ¼ ðn þ nK þ 4Þ=ðn þ 3nK þ 4ÞX1=3; and Theorem 7’s mistakebound is

jMjp 72þ 1264Kðn þ 1Þ

n þ 3Kn þ 4

� �ln

4Nðn þ 1Þn þ 3Kn þ 4

� �þ 36

Xt

j¼1

~mm � ~EEj

p 72þ 252 ln N þ 36Xt

j¼1

~mm � ~EEjp72þ 252n ln k þ 36Xt

j¼1

~mm � ~EEj; ð10Þ

ARTICLE IN PRESS


if nX2: Of course, this is of little value as an adversarial bound, since the third term is summedover all trials and an adversary could make each term of this summation positive. But a bound ofthis form might be useful under appropriate distributional assumptions.We now bound the mixing time for the linear case.


pað~ppÞpað~qqÞpð1þ n=2Þ2

n þ 1pað~aa0Þpað~ZZeð~pp;~qqÞÞ:

Proof. Using the same notation introduced in the first paragraph of Lemma 12’s proof, we get

pað~ppÞpað~qqÞpað~aa0Þ pað~ZZeð~pp;~qqÞÞ

¼ pð~xxÞqð~xxÞaz~ppþz~qq

~ZZeð~pp;~qqÞð~xxÞ~aa0ð~xxÞaz~ZZeð~pp ;~qqÞþz~aa 0

¼ az~ppþz~qq

az~ZZeð~pp ;~qqÞþz~aa 0

� � ð1þ m~pp1yiþ m~ppiþ1yn

Þð1þ m~qq1yiþ m~qqiþ1yn

Þð1þ m~pp1yi

þ m~qqiþ1ynÞð1þ m~qq1yi

þ m~ppiþ1ynÞ

� �

paz~ppþz~qqz~ZZeð~pp ;~qqÞz~aa 0ð1þ n=2Þ2

n þ 1

!:

The last inequality holds since the second term is maximized by setting m~pp1yi¼ m~qqiþ1yn

¼ 0 and

m~qq1yi¼ m~ppiþ1yn

¼ n=2: Now let C be the first term’s exponent:

C ¼ z~pp þ z~qq z~ZZeð~pp;~qqÞ z~aa 0

¼X~xxAM

c~xx1þ m~xx;~pp1yi

þ m~xx;~ppiþ1ynþ 1þ m~xx;~qq1yi

þ m~xx;~qqiþ1yn

n þ 1

�

1þ m~xx;~pp1yiþ m~xx;~qqiþ1yn

þ 1þ m~xx;~qq1yiþ m~xx;~ppiþ1yn

n þ 1

�¼ 0: &


maxfpað~ppÞ=pað~qqÞ;pað~qqÞ=pað~ppÞgp2ajMj=ðnþ1Þ:


pað~ppÞpað~qqÞ

¼ pð~xxÞqð~xxÞ

� �az~ppz~qq ¼ 1þ m~pp

1þ m~qq

� �az~ppz~qqp

2þ m~qq

1þ m~qq

� �az~ppz~qqp2az~ppz~qq :

We now consider z~pp z~qq :

z~pp z~qq ¼ 1

n þ 1

X~xxAM

c~xxð1þ m~xx;~pp 1 m~xx;~qqÞpjMj=ðn þ 1Þ:


ARTICLE IN PRESS


Corollary 17. When learning generalized DNF using Winnow and a linear pð~xxÞ; a simulation of Mthat starts at any node and is of length

Ti ¼ 4knðn2=4þ n þ 1ÞajMj=ðnþ1Þi ðn ln k þ 2jMj ln ai þ lnð1=e0ÞÞ

will draw samples from #paisuch that jj #pai

paijjpe0:

Proof. Lemmas 15 and 16 bound g and h; which we substitute directly into Theorem 5. Also, note

that WðaiÞpknajMji and wai;~ppXajMj


Note that if jMj ¼ Oðn log KÞ (either if Winnow is in the early stages of learning or if the thirdterm of Eq. (10) is Oðn log KÞ), then the chain’s mixing time is polynomial in all relevantparameters if k (the number of values each variable can take on) is a constant, yielding an FPRASunder the conditions of Corollary 9.

4.1.2. PerceptronWe now consider applying our technique to Rosenblatt’s Perceptron [33] algorithm. The

purpose of our analysis is primarily for contrast to the Winnow case since17 Khardon et al. [18]give a kernel function to efficiently exactly compute the weighted sums when applying Perceptronto learning DNF. But they also give an exponential lower bound on the number of mistakes that

kernel perceptron makes in learning DNF: 2OðnÞ: Thus their results do not imply an efficient DNF-learning algorithm.We refer the reader to Section 2.1 for an overview of the Perceptron algorithm. Recall from

Section 3 that if the initial weight vector is the all 0s vector, then term ~pp’s weight is w~pp ¼ a z~pp ;

where z~pp ¼P

~xxAM c~xxpð~xxÞ; M is the set of examples for which a prediction mistake is made and

c~xxAf1;þ1g is example ~xx’s label.Since our technique is only capable of estimating positive functions, we cannot allow the

Perceptron’s weights to be negative. Thus to each weight in ‘‘standard’’ Perceptron, we add apositive constant c; yielding a new weight for term ~pp of w~pp ¼ c þ az~pp ; where c ¼3ajMjX3amax~ppAOfjz~pp jg: The dot product of this new weight vector with the Perceptron inputs

will then be compared to the new threshold c:We use the same definitions and algorithm (given in Section 3) that were used for Winnow,

except the ‘‘base’’ value of ai is now a0 ¼ 0 rather than a1 ¼ 1: So our weight estimate is the sameas given in Eq. (1), but the product runs from i ¼ 1 to r rather than starting at i ¼ 2; and wemultiply it by Wða0Þ: As with Winnow, this latter quantity is easily computed exactly: For binarypð~xxÞ; Wða0Þ ¼ c2n: For logistic pð~xxÞ; if ki ¼ k for all i; we get

Wð0Þ ¼ cXn

i¼0

2in

i

� �2ðk 1Þni

1þ esðinÞ;

ARTICLE IN PRESS

17Note, however, that other applications of Perceptron for which no kernels are available might be amenable to an

MCMC-based approach to estimate the dot products.


and for the linear case, we get

Wð0Þ ¼ cXn

i¼0

2in

i

� �ði þ 1Þðk 1Þni

n þ 1¼ cðk þ 1Þn þ 2cnðk þ 1Þn1

n þ 1:

Thus all three can efficiently be computed exactly.

We now make the following argument about fi and fi for all three versions of pð~xxÞ:

Lemma 18. When applying binary, logistic, or linear Perceptron to learn DNF, for all i;

2=3pfi; fip2:

Proof. We focus on the actual random variables, since the estimates (the ‘‘hat’’ variables) have thesame range. Since these variables for all three versions have pð~xxÞ in the numerator anddenominator, they all equal

fið~ppÞ ¼c þ z~pp ai1

c þ z~ppai

:

If z~ppX0; then (since ai1oai) obviously fið~ppÞp1: Also, since cX2az~pp and aXai;

fið~ppÞXc þ z~pp ai1

c þ cai=ð2aÞX

c þ z~ppai1

3c=2¼ 2

31þ z~pp ai1

c

� �X2=3:

If z~ppo0; then obviously fið~ppÞ41: Also,

fið~ppÞpc þ z~pp ai1

c=2¼ 2þ 2z~pp ai1

cp2:

Thus fið~ppÞA½2=3; 2� for all i: &

This leads to the following corollary.

Corollary 19. When applying Perceptron to learn generalized DNF (with ki ¼ k for all i for the

logistic and linear cases), let the sample size S ¼ J390r=e2n and M be simulated long enough foreach sample such that the variation distance between the empirical distribution and pai

is at most

e=ð15rÞ: Then for any d40; WðaÞ satisfies


To bound the number of prediction mistakes the Perceptron algorithm makes in learning DNF,we apply a result from Gentile and Warmuth [12].

Theorem 20 (Gentile and Warmuth [12]). Let ð~YY j; cjÞA½0; 1�N � f0; 1g for j ¼ 1;y; t; let ~mm be an

arbitrary weight vector of dimension N; and let M be the set of examples on which the Perceptron

algorithm makes a prediction mistake. Then the number of mistakes made by the Perceptron

ARTICLE IN PRESS


algorithm is

jMjp jj~mmjj2 rg~mm;M

!2

;

where jj � jj2 is the 2-norm, rXjj~YY jj2 for all ~YYAM; and

g~mm;M ¼ 1

jMjX~YY jAM

cj~mm � ~YY j

is the average margin of ~mm:

4.1.2.1. Binary pð~xxÞ. Applying Theorem 20 to the binary case is straightforward. We let~mm be 0 inall places except those corresponding to a relevant attribute, which are set to 1 (so there are K 1sin ~mm). In addition, we add an ðN þ 1Þth position to ~mm; setting it to 1=2: This position will

correspond to a 1 added to each example seen by the perceptron. Thus we get jj~mmjj2 ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiK þ 1=4

p:

Since all Perceptron inputs are from f0; 1g for the binary case, we get g~mm;MX1=2: Further, since

exactly 2n þ 1 inputs are 1 for each example, r ¼ffiffiffiffiffiffiffiffiffiffiffiffiffi2n þ 1

p: Applying Theorem 20 yields

jMjpð4K þ 1Þð2n þ 1Þ:As with binary pð~xxÞ with Winnow, binary pð~xxÞ with Perceptron is difficult to analyze to provide

non-trivial bounds on the mixing time. Thus we look at the logistic and linear cases.

4.1.2.2. Logistic pð~xxÞ. We begin by bounding the number of mistakes logistic Perceptron willmake.

Lemma 21. When using Equation 6 with s ¼ lnð60Kn ln kÞ to specify the inputs, the number ofprediction mistakes made by Perceptron when learning DNF is at most

ð20K þ 5Þð2þ k=ð3600 ln2 kÞÞn:

Proof. We use the same ~mm as we did for the binary case, namely 1s at the K relevant positions,1=2 matching the extra 1 added to each example, and 0s elsewhere. We now bound g~mm;M : If

cj ¼ þ1 for some trial j; then at least one of the relevant terms ~pp must send a 1 input to

Perceptron. Thus cj~mm � ~YY jX1 1=2 ¼ 1=2; where ~YY j is the vector of inputs to Perceptron (i.e. the

outputs of the pð�Þ functions). If cj ¼ 1; then in the worst case each relevant term will be almost

completely satisfied by the input example ~xxj; i.e. all but one variable in each relevant term will be

satisfied. If this happens, then the total contribution to ~mm � ~YY j that comes from the K relevant

terms is 2K=ð1þ esÞ: Adding this to the extra 1=2 and multiplying by cj ¼ 1 yields a worst-

case bound of

g~mm;MXcj ~mm � ~YY jX1

2 2K

1þ es¼ 1

2 2K

1þ 60Kn ln kX1

2 1

30 ln 2X0:45;

since nX1 and kX2:

ARTICLE IN PRESS


We now bound r: Recall that Eq. (8) sums the pð~xxÞ values for the entire set of terms. By

substituting ðpð~xxÞÞ2 for pð~xxÞ in this equation and taking the square root, we exactly get the 2-normfor any input to Perceptron:

jj~YY jj2 ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXn

i¼02i

n

i

� �4ðk 1Þni

ð1þ esðinÞÞ2

sp

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXn

i¼0

2in

i

� �4kni

e2sðinÞ

s

¼ 2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXn

i¼0

2in

i

� �knið60Kn ln kÞ2i2n

s

¼ 2kn=2

ð60Kn ln kÞn

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXn

i¼0

ðð7200=kÞn2K2ln2kÞi n

i

� �s

¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi4

ð7200=kÞn2K2 ln2 k þ 1

ð3600=kÞn2K2 ln2 k

n !vuut o2 ð2þ k=ð3600ln2kÞÞn=2;

since n;KX1: Thus setting r ¼ 2 ð2þ k=ð3600ln2kÞÞn=2 suffices. &

We now bound the mixing time for the Markov chain.


pað~ppÞpað~qqÞp16pað~aa0Þ pað~ZZeð~pp;~qqÞÞ:

Proof. As in the proof of Lemma 12, let ~qq1yi denote ðq1;y; qiÞ; and similarly for ~pp1yi: Thenm~pp ¼ m~pp1yi

þ m~ppiþ1yn; m~qq ¼ m~qq1yi

þ m~qqiþ1yn; m~ZZeð~pp;~qqÞ ¼ m~pp1yi

þ m~qqiþ1yn; and m~aa 0 ¼ m~qq1yi

þ m~ppiþ1yn:

Further, all four of these values are in f0;y; ng: This yields

pað~ppÞ pað~qqÞpað~aa0Þ pað~ZZeð~pp;~qqÞÞ

¼ pð~xxÞqð~xxÞ ðc þ a z~ppÞðc þ a z~qqÞ~ZZeð~pp;~qqÞð~xxÞ~aa0ð~xxÞ ðc þ a z~ZZeð~pp;~qqÞÞðc þ a z~aa 0 Þ

¼ ðc þ a z~ppÞðc þ a z~qqÞðc þ a z~ZZeð~pp;~qqÞÞðc þ a z~aa 0 Þ

� �

� 1þ Uð~pp1yi;~qqiþ1yn; nÞ þ Uð~qq1yi;~ppiþ1yn; nÞ þ Uð~pp1yn;~qq1yn; 2nÞ1þ Uð~pp1yi;~ppiþ1yn; nÞ þ Uð~qq1yi;~qqiþ1yn; nÞ þ Uð~pp1yn;~qq1yn; 2nÞ

� �

o 4ðc þ a z~ppÞðc þ a z~qqÞ

ðc þ a z~ZZeð~pp;~qqÞÞðc þ a z~aa 0 Þ

� �;

where Uð~ppiyj;~qqi0yj0 ; nÞ ¼ expðsðm~ppiyjþ m~qqi0yj0 nÞÞ: The last inequality comes directly from

the proof of Lemma 12. We now bound the second term. Since pð~xxÞp1; each ða zÞ summation is

upper bounded by ajMj; so the numerator is at most ðc þ ajMjÞ2: Meanwhile, the denominator is

at least ðc ajMjÞ2: Thus since c ¼ 3ajMj; the second term is at most 4: &

ARTICLE IN PRESS



maxfpað~ppÞ=pað~qqÞ;pað~qqÞ=pað~ppÞgp120Kn ln k:


pað~ppÞpað~qqÞ

¼ pð~xxÞqð~xxÞ

� �c þ a z~pp

c þ a z~qq

� �¼ 1þ expðs n s m~qqÞ


� �c þ a z~pp

c þ a z~qq

� �

p1þ expðs n s ðm~pp 1ÞÞ


� �c þ a jMjc a jMj

� �

¼ 1þ es expðs n s m~ppÞ1þ expðs n s m~ppÞ

� �c þ a jMjc a jMj

� �

p1þ expðs n s m~ppÞ1þ expðs n s m~ppÞ

� �2es ¼ 120Kn ln k;

since s ¼ lnð60Knln kÞ: Finally, we note that a symmetric argument can be made forpað~qqÞ=pað~ppÞ: &

We now apply Theorem 5 to bound the mixing time of this Markov chain.

Corollary 24. When learning generalized DNF using Perceptron and a logistic pð~xxÞ; a simulation of

M that starts at any node and is of length

Ti ¼ 3840kKn3 ln kðn ln k þ lnð4aijMjÞ þ lnð1=e0ÞÞ


paijjpe0:

Proof. Lemmas 22 and 23 bound g and h; which we substitute directly into Theorem 5. Also, notethat WðaiÞpknðc þ ai jMjÞ and wai;~ppX1; completing the proof. &

4.1.2.3. Linear pð~xxÞ. We begin by bounding the number of mistakes Perceptron will make.However, like the linear Winnow case, worst-case (adversary) bounds are not possible since theaverage margin could be forced to be negative. Thus we assume that the examples are such thatmost of them are linearly separable and have a positive average margin g~mm;M :

Lemma 25. When using Equation 7 to specify the inputs, and if the average margin g~mm;M of the

sequence of examples is positive, then the number of prediction mistakes made by Perceptron whenlearning DNF is at most

5Kððk þ 1Þn þ 6nðk þ 1Þn1 þ 4nðn 1Þðk þ 1Þn2Þ4g2~mm;Mðn þ 1Þ2

:

ARTICLE IN PRESS


Proof. We use the same ~mm as we did for the binary and logistic cases, and hence get the same 2-norm for this vector as before. To bound r; recall that Eq. (9) sums the pð~xxÞ values for the entireset of terms. By substituting ðpð~xxÞÞ2 for pð~xxÞ in this equation and taking the square root, weexactly get the 2-norm for any input to Perceptron:

jj~YY jj2 ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXn

i¼02i

n

i

� �ðk 1Þni i þ 1

n þ 1

� �2s

¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðk þ 1Þn þ 6nðk þ 1Þn1 þ 4nðn 1Þðk þ 1Þn2

qn þ 1

;

which completes the proof. &

Now we bound the mixing time.


pað~ppÞ pað~qqÞp4ð1þ n=2Þ2

n þ 1pað~aa0Þpað~ZZeð~pp;~qqÞÞ:

Proof. Using the same notation introduced in the first paragraph of Lemma 22’s proof,we get

pað~ppÞ pað~qqÞpað~aa0Þ pað~ZZeð~pp;~qqÞÞ

¼ pð~xxÞ qð~xxÞ~ZZeð~pp;~qqÞð~xxÞ~aa0ð~xxÞ

� � ðc þ a z~ppÞðc þ a z~qqÞðc þ a z~ZZeð~pp;~qqÞÞðc þ a z~aa 0 Þ

� �

pð1þ n=2Þ2

n þ 1

!� 4;

using results from the proofs of Lemmas 15 and 22. &


maxfpað~ppÞ=pað~qqÞ;pað~qqÞ=pað~ppÞgp4:


pað~ppÞpað~qqÞ

¼ pð~xxÞqð~xxÞ

� �c þ a z~pp

c þ a z~qq

� �¼ 1þ m~pp

1þ m~qq

� �c þ a z~pp

c þ a z~qq

� �p

2þ m~qq

1þ m~qq

� �2p4:


ARTICLE IN PRESS


Corollary 28. When learning generalized DNF using Perceptron and a linear pð~xxÞ; a simulation of

M that starts at any node and is of length

Ti ¼32kn2ð1þ n=2Þ2

ðn þ 1Þ

!ðn ln k þ lnð4ai jMjÞ þ lnð1=e0ÞÞ


paijjpe0:

Proof. Lemmas 26 and 27 bound g and h; which we substitute directly into Theorem 5. Also, notethat WðaiÞpknðc þ ai jMjÞ and wai;~ppX1; completing the proof. &

4.2. Pruning ensembles of classifiers

We now apply our methods to pruning an ensemble, produced by e.g. AdaBoost [35].AdaBoost’s output is a set of functions hi :X-R; where iAf1;y; ng and X is the instancespace. Each hi is trained on a different distribution over the training examples and is associatedwith a parameter biAR that weights its predictions. Given an instance ~xxAX; the ensemble’s

prediction is Hð~xxÞ ¼ signðPn

i¼1 bihið~xxÞÞ: Thus signðhið~xxÞÞ is hi’s prediction on ~xx; jhið~xxÞj is its

confidence in its prediction, and bi weights AdaBoost’s confidence in hi: It has been shown that ifeach hi has error less than 1=2 on its distribution, then the error on the training set and thegeneralization error of Hð�Þ can be bounded. Strong bounds on Hð�Þ’s generalization error canalso be shown even if the boosting algorithm is run past the point where Hð�Þ’s error is zero [34].However, overfitting can still occur [26], i.e. sometimes better generalization can be achievedif some of the hi’s are discarded. So our goal is to find a weighted combination of all possibleprunings that performs not much worse in terms of generalization error than the best singlepruning.To predict nearly as well as the best pruning, we place every possible pruning in a pool (so

N ¼ 2n) and run WM. We start by computing Wþ and W; which are, respectively, the sums ofthe weights of the experts predicting a positive and a negative label on example ~xx: Then WMpredicts þ1 if Wþ4W and 1 otherwise. Whenever WM makes a prediction mistake, it reducesthe weights of all experts that predicted incorrectly by dividing them by a (see Section 2.1).As in Section 4.1, using a binary pð~xxÞ makes bounding the mixing time difficult, except in a

trivial sense. Thus we use a linear pð~xxÞ; which allows us to also incorporate each pruning’sconfidence in its prediction, and to use that confidence when updating the weights. Given anexample ~xxAX; we compute hið~xxÞ for all iAf1;y; ng: We then use our MCMC procedure to

compute Wþ; an estimate of Wþ ¼P

~ppAOþ pð~xxÞw~pp ; where pð~xxÞ ¼Pn

i¼1 pibihið~xxÞ; Oþ ¼f~ppAf0; 1gn:

Pni¼1 pi bi hið~xxÞX0g; w~pp ¼ az~pp ; z~pp ¼

P~xx AMc~xx pð~xxÞ; and M is the set of examples

for which a prediction mistake was made. A similar procedure is used to compute W: Then WM

predicts þ1 if Wþ4W and 1 otherwise.

Define the Markov chain M with state space Oþ (similarly, O) and that makes transitionsaccording to the description in Section 3. The chain corresponds to a random walk on the

Boolean hypercube truncated by a hyperplane. It is easy to show that all pairs of states in Oþ

ARTICLE IN PRESS


(similarly, O) can communicate. To move from node~ppAOþ to~qqAOþ; first add to~pp all bits i in~qqand not in ~pp that correspond to positions where bi hiðxÞX0: Then delete from ~pp all bits i in ~pp andnot in ~qq that correspond to positions where bi hiðxÞo0: Then delete the unnecessary ‘‘positivebits’’ and add the necessary ‘‘negative bits’’. It is easy to see that all states between ~pp and ~qq are in

Oþ: Thus M is irreducible and hence ergodic by Lemma 1.As before, we let B be an upper bound18 on the sum of all updates made on any pruning. Then

it is straightforward to adapt Lemma 8’s proof to bound f ; fA½1=e; e� for WM when applying

Section 3’s procedure to estimate WðaÞ: But when applying Eq. (1), we must determine Wða1Þ ¼Wð1Þ ¼ jOþj: This is equivalent to counting the number of solutions to a 0–1 knapsack problem,

which is #P-complete. Thus in order to complete our computation of W; we must also estimate

jOþj: We do this by mapping the problem to the knapsack problem which is summarized inSection 2.4 and shown to have an FPRAS by Morris and Sinclair [28]. If we let the ‘‘weight’’ ofitem i in our problem be wi ¼ bi hið~xxÞ for an example ~xx; then the only difference between the two

problems is that the weights in the jOþj estimation problem may be negative. We now argue thatthey are still equivalent and thus we can directly apply the results of Morris and Sinclair. Given a

vector ~ppAf0; 1gn and a weight vector ~ww; let p0i ¼ 1 pi if wio0 and p0

i ¼ pi otherwise. Also, let

w0i ¼ jwij and b0 ¼

Pwio0 jwij: It is easy to argue that

Pni¼1 wipiX0 (which is the definition19 of Oþ)

if and only ifPn

i¼1 w0ip

0iXb0 (which is an instance of the knapsack problem). If we let sþ ¼P

wi40;pi¼1 wi; s ¼P

wio0;pi¼1wi; and s0 ¼P

wio0;pi¼0wi; then b0 ¼ s0 s: This is exactly what

is added to both sides of the inequalityPn

i¼1wi piX0 to getPn

i¼1w0i p0

iXb0: Thus we can efficiently

estimate jOþj to within a factor of e; allowing us to apply Corollary 3.

Corollary 29. When applying WM to learn a weighted combination of ensemble prunings, let the

sample size S ¼ J130 re2=e2n; jOþj be estimated to within e=2 of its true value with probability

X3=4 via the procedure outlined above, and M be simulated long enough for each sample such that

the variation distance between the empirical distribution and paiis at most e=ð10e2rÞ: Then for any

d40; WþðaÞ satisfies

Pr½ð1 eÞWþðaÞpWþðaÞpð1þ eÞWþðaÞ�X1 d;

and the same result applies to WðaÞ if jOj is well approximated.

Note that if Wþ=We½1e1þe;

1þe1e� for all trials, then our estimates of Wþ and W are (with

probability at least 1 d0t for trial t) sufficiently accurate to correctly determine whether or not

Wþ4W: Setting d0t ¼ d=2t yields a total probability of failure of at most20P

N

t¼1 d=2t ¼ d: Thus

ARTICLE IN PRESS

18 In contrast to Section 4.1, where B is by definition upper bounded by Winnow’s or Perceptron’s mistake bound, for

this application B could be arbitrarily large since it depends on the predictions of arbitrary hypotheses. Thus for the rest

of this section we implicitly assume that B is polynomial in all relevant parameters, i.e. that it is expressed in unary.19By negating all wi; we can use the same arguments to estimate jOj:20Recall from the proof of Theorem 2 that only Oðlog 1=d0Þ runs of the estimation procedure are needed to reduce the

probability of failure to d0:


under these conditions, our version of WM runs identically to the brute-force version, and we canapply WM’s mistake bounds. This yields the following corollary.

Corollary 30. Using the assumptions of Corollary 29, if Wþ=We½1þe1e;

1e1þe� for all t; then with

probability at least 1 d; the number of prediction mistakes made by this algorithm on any sequenceof examples is Oðnþ nÞ; where n is the number of hypotheses in the ensemble and n is the number of

mistakes made by the best pruning.

Now we bound the mixing time. Bounding g is easy, since when viewed as multisets, ~pp,~qq ¼~aa0,~ZZeð~pp;~qqÞ; which implies z~pp þ z~qq ¼ z~aa 0 þ z~ZZeð~pp;~qqÞ and pþa ð~ppÞpþa ð~qqÞ ¼ pþa ð~aa0Þpþa ð~ZZeð~pp;~qqÞÞ: Thus g ¼1: Further, if ~pp and ~qq are neighbors that differ in bit i; then

z~pp z~qq ¼X~xxAM

Xj:pj¼1

c~xxbjhjð~xxÞ X

j:qj¼1

c~xxbjhjð~xxÞ

0@

1A

pbi

X~xxAM

c~xx hið~xxÞ;

which implies that for any neighbors ~pp and ~qq; maxfpað~ppÞ=pað~qqÞ;pað~qqÞ=pað~ppÞgpaBmax whereBmax ¼ maxjfbj

P~xxAMc~xx hjð~xxÞg:

Corollary 31. When learning a weighted combination of ensemble prunings with Weighted Majority,

if O ¼ f0; 1gn; then a simulation of M that starts at any node and is of length

Ti ¼ 2n2aBmax

i ðn ln 2þ nðBmax BminÞlnai þ ln 1=e0Þ


paijjpe0:

Proof. We substitute our bounds of g and h directly into Theorem 5 and note that

WðaiÞp2nanBmax

i and wai;~ppXanBmin


The above mixing time bound is only polynomial if Bmax is logarithmic in all relevantparameters, which is unlikely. However, our analysis is handicapped by worst-case assumptions,as with Corollary 14. While it is unlikely that an efficient bound on the mixing time exists forgeneral ensembles with arbitrary classifiers and arbitrary distributions over examples, it is openwhether restricted cases could have better bounds. Further, in Section 5 we show that in practice,our algorithm performs much better than the worst-case theoretical results imply, especiallyconsidering that highly accurate estimates of the weighted sums are not needed so long as weknow whether or not Wþ4W:Note that in Corollary 31 we assume that O ¼ f0; 1gn; i.e. that all prunings classify ~xx as positive

(or negative), and the chain is an untruncated hypercube. This is because without such anassumption we cannot guarantee that a canonical path between two nodes does not leave O at anytime. A new approach, employing balanced, almost-uniform permutations has recently been

ARTICLE IN PRESS


pioneered by Morris and Sinclair [28] and applied to the truncated Boolean hypercube problemwhen the chain’s stationary distribution is uniform (i.e. for counting the number of solutions tothe knapsack problem). It is reasonable that their technique could be generalized to the case of anon-uniform distribution, which would allow us to consider truncated hypercubes for WM andfor other algorithms.

4.3. Discussion

In examining the results and proofs related to using Winnow on DNF, we see some interestingdifferences. Recall from Theorem 5 that there are two functions that must be bounded in order tobound a chain’s mixing time: g; which bounds the ratio of pð~ppÞpð~qqÞ to pð~aa0Þpð~ZZeð~pp;~qqÞÞ; and h;which bounds the ratio of weights of neighboring nodes in the chain. Linear pð~xxÞ allows us tobound g with a polynomial since the z’s in the exponent of a all cancel out due to the linear natureof the weight updates. However, logistic pð~xxÞ (and probably binary pð~xxÞ) in the worst case getcharged a multiplicative factor of a for each update, yielding an upper bound of g that isexponential in jMj: In contrast, when bounding h in an adversarial setting, both linear and logisticpð~xxÞ allow the difference in the z’s to grow with jMj: Further, in the worst case we cannot boundjMj for linear, but we can for logistic and binary.A more fundamental difference arises when comparing multiplicative weight update algorithms

(Winnow and WM) to the additive weight update algorithm (Perceptron). The additive weightupdates prevent M’s stationary distribution from deviating very far from uniform (whencompared to the MWU algorithms). Thus for Perceptron, M mixes rapidly, even if jMj isexponentially large. However, from a learning-theoretic standpoint, Perceptron is not a goodchoice for learning DNF since Lemma 21’s bound on the number of mistakes (updates)Perceptron will make in the worst case is exponential, which is corroborated by Khardon et al.’slower bound [18].A natural extension of the results of Section 4.2 is to generalize the results to multiclass

predictions. This comes for free if the boosting algorithm used is AdaBoost.MH from Schapireand Singer [35], which reduces the multiclass boosting problem to a set of binary ones, allowingour results to fit into this framework. Alternatively, AdaBoost.M1 from Freund and Schapire [11]more directly addresses the multiclass problem by having each hypothesis predict its confidencethat ~xx belongs to class j: Then the ensemble’s prediction is the class that maximizes theseconfidence-rated predictions. That is, each class is tested individually and the one that scores thehighest is the predicted class for ~xx: To adapt our framework to this, rather than simply estimating

Wþ and W; we estimate W j for each class j and then predict the class with the maximum. Since

each W j estimate uses a separate Boolean hypercube truncated by a single hyperplane (the onethat separates prunings that predict class j from those predicting another class), we can bound themixing time using the same machinery developed in Section 4.2, assuming a version of Theorem 5exists for truncated cubes.Since one of the goals of pruning an ensemble of classifiers is to reduce its size, one may adopt

one of several heuristics, such as choosing the pruning that has highest weight in WM, the highestratio of weight to size, or the highest product of weight and diversity, where diversity is measuredby e.g. Kullback–Leibler divergence (see [26]). Let f ð~ppÞ be the function that one wants to

maximize. Then the goal is to find the ~ppAf0; 1gn that approximately maximizes f : To do this one

ARTICLE IN PRESS


can define a newMarkov chainM0 whose transition probabilities are the same as forM in Section3 except that Step 3 is irrelevant (since there is no training example ~xx) and in Step 4, change the

transition probability to minf1; rf ð~pp 00Þf ð~ppÞg: The parameter r governs the shape of the stationarydistribution: r ¼ 1 implies a uniform distribution over all prunings, while a large value of r yieldsa distribution that peaks at prunings with large f ð~ppÞ: (This is a special case of simulated annealing

[19] where the temperature is held constant.) Lemma 1 obviously holds for M0; but it is an openproblem to bound how far from optimal its solution will be. Of course, other combinatorialoptimization methods such as genetic algorithms can also be applied here.Similarly, one issue with our DNF algorithm is that after training, we still require the training

examples and running M to evaluate the hypothesis on a new example. In lieu of this, one can,after training, search (using a modified chain or a GA as described above) for the terms with thelargest weights in Winnow. The result is a set of rules, and the prediction on a new example can bea thresholded sum of weights of satisfied rules, using the same threshold y: The only issue then isto determine how many terms to select. Since each example satisfies exactly 2n terms, for anexample to be classified as positive, the average weight of its satisfied terms must be at least y=2n:Thus one heuristic is to choose as many terms as possible with weight at least y=2n; stopping whenwe find ‘‘too many’’ (as specified by a parameter) terms with weight less than y=2n: Using thispruned set of rules, no additional false positives will occur, and in fact the number might bereduced. The only concern is causing extra false negatives.

5. Empirical results

The primary purpose of our experiments is to assess how well our algorithms will work inpractice, especially when compared to our worst-case bounds. A more thorough empiricalanalysis (as well as some heuristic optimizations) of our technique is given by Tao and Scott [41].The goal of our algorithms is to use the weighted sum approximations to accurately simulate

Winnow and WM, since an accurate simulation is required for us to apply Winnow’s and WM’serror bounds.21 We measure accuracy of simulation in several ways: (1) comparing weighted sumestimates to their true values (computed by brute-force implementations); (2) counting thenumber of times our algorithm predicts differently from brute-force (e.g. for Winnow, the fractionof weighted sum estimates that are on the opposite side of the threshold as the true weighted sum);and (3) measuring prediction error. The simulated data we used in our experiments should makelearning straightforward for brute-force, so low prediction error should correlate (to some extent)to simulation accuracy. This performance measure is especially useful on problems that are toolarge for brute-force to handle.


In our DNF experiments, we defined the instance space to be X ¼ f1; 2gn and the set of terms

to be P ¼ f0; 1; 2gn; i.e. ki ¼ 2 for all i: Recall that a term ~pp ¼ ðp1;y; pnÞAP is satisfied by

ARTICLE IN PRESS

21Conceivably, in some cases our algorithm may accidentally make fewer prediction mistakes when deviating from

brute-force implementations, but in the absence of a formal theory to characterize this, it is safer to assume that such

behavior is anomalous. Thus our primary goal is to measure how well we simulate brute-force.


example ~xx ¼ ðx1;y; xnÞAX if and only if 8 pi40; pi ¼ xi: So pi ¼ 0 ) xi is irrelevant for term ~ppand pi40 ) xi must equal pi for ~pp to be satisfied. Even though we do not have an analysis of itsmixing time, we used binary pð~xxÞ in our experiments.22 Even though using binary pð~xxÞ we could

define for each new example O ¼ f0; 1gn (i.e. an untruncated Boolean hypercube; see Section4.1.1), we chose to use as O a truncated (with a single hyperplane) version of P: We did this fortwo reasons. First, doing so allows us to evaluate the performance of our MCMC approach whenthe hypercube is truncated, which is more generally applicable and also currently lacking intheoretical results. Second, this experimental approach gives us a better idea of how quickly thetime complexity of a brute-force implementation grows as a function of n: Comparing this withthe time of our MCMC experiments tells us the minimum value of n for which our approach isfaster.We generated random (from a uniform distribution) K ¼ 5-term DNF formulas, using

nAf10; 15; 20g: So the total number of Winnow inputs was 310 ¼ 59049; 315 ¼ 1:43� 107; and

320 ¼ 3:49� 109: For each value of n there were nine training/testing set combinations, each with50 training examples and 50 testing examples. Examples were generated uniformly at random.Table 1 gives averaged23 results for n ¼ 10; indexed by S and T (‘‘BF’’ means brute-force).

‘‘GUESS’’ is the average error of the estimates ð¼ jguess actualj=actualÞ: ‘‘LOW’’ is the fractionof guesses that were oy when the actual value was 4y; and ‘‘HIGH’’ is symmetric. These are theonly times our algorithm deviates from brute-force. ‘‘PM’’ is the number of prediction mistakesmade by Winnow on the training set while learning (in all experiments, Winnow repeatedly madepasses over the training set until it correctly classified all training examples). This gives anevaluation of each algorithm in an on-line setting. After training was complete, we also evaluatedthe hypotheses on their respective test sets, i.e. in a batch learning setting. ‘‘GE’’ is thegeneralization error on the test set. Finally, ‘‘Stheo’’ and ‘‘Ttheo’’ are S and T from Corollaries 9and 14 that guarantee an error of GUESS given the values of r in our simulations using a ¼ 3=2:These latter two columns show how pessimistic the worst-case bounds are in contrast to whatworks in practice. In general, the results in the columns varied little across the different data sets:the standard deviations of the values were typically small when compared to the means.Both GUESS and HIGH are very sensitive to T but not as sensitive to S: LOW was negligible

due to the distribution of weights as training progressed: the term ~ppe ¼~00 (satisfied by all

examples) had high weights. Since all computations started at ~00 and the Markov chain M seeksout nodes with high weights, the estimates tended to be too high rather than too low. But this isless significant as S and T increase. For S ¼ 100 and T ¼ 300; training and testing with M wasslower than brute-force by a factor of over 108. The average value of r used was 20.79 (range was19–26).Since the run time of our algorithm varies linearly in r; we ran some experiments where we fixed

r rather than letting it be set as in Section 4.1. We set S ¼ 100; T ¼ 300 and rAf5; 10; 15; 20g: Theresults are in Table 2. This indicates that for the given parameter values, r can be reduced belowthat which is stipulated in Section 3.

ARTICLE IN PRESS

22When we compare actual mixing times to theoretical bounds, we will compare to theoretical bounds for logistic

pð~xxÞ; which is a reasonable approximation to the binary case.23The number of weight estimations made per row in the table varied due to a varying number of training rounds, but

typically was around 3000.


Results for n ¼ 15 appear in Table 3. The trends for n ¼ 15 are similar to those for n ¼ 10:Brute-force is faster than M at S ¼ 500 and T ¼ 1500; but only by a factor of 16. The averagevalue of r used was 31.52 (range was 19–40). As with n ¼ 10; r can be reduced to speed up thealgorithm, but at a cost of increasing the errors of the predictions (e.g. see Table 4). We ran thesame experiments with a training set of size 100 rather than 50 (the test set was still of size 50),summarized in Table 5. As expected, error on the guesses changes little, but GE is decreased.For n ¼ 20; no exact (brute-force) sums were computed since there are over 3 billion inputs. So

we only examined the prediction error of our algorithm. With S ¼ 1000; T ¼ 2000; r set as in

ARTICLE IN PRESS

Table 1

Results for n ¼ 10 and r chosen as in Section 3

S T GUESS LOW HIGH PM GE Stheo Ttheo

100 100 0:4713 0:0000 0:1674 35:67 0:0600 2:23� 105 1:772� 10104

100 200 0:1252 0:0017 0:0350 35:67 0:0533 3:16� 106 1:777� 10104

100 300 0:0634 0:0041 0:0172 37:89 0:0711 1:23� 107 1:780� 10104

100 500 0:0484 0:0091 0:0078 40:11 0:0844 2:11� 107 1:781� 10104

500 100 0:4826 0:0000 0:1594 34:67 0:1000 2:13� 105 1:772� 10104

500 200 0:1174 0:0000 0:0314 33:83 0:0600 3:60� 106 1:778� 10104

500 300 0:0441 0:0043 0:0145 34:22 0:0867 2:55� 107 1:781� 10104

500 500 0:0232 0:0034 0:0064 37:88 0:0800 9:16� 107 1:784� 10104

BF 36:56 0:0730

Table 2

Results for n ¼ 10; S ¼ 100; and T ¼ 300

r GUESS LOW HIGH PM GE

5 0:1279 0:0119 0:0203 40:67 0:084410 0:0837 0:0095 0:0189 38:33 0:086715 0:0711 0:0058 0:0159 37:78 0:080020 0:0638 0:0042 0:0127 36:22 0:0889BF 36:56 0:0730

Table 3


S T GUESS LOW HIGH PM GE Stheo Ttheo

500 1500 0:0368 0:0028 0:0099 60:22 0:0700 5:01� 107 4:112� 10151

500 1800 0:0333 0:0040 0:0049 60:75 0:0675 6:12� 107 4:112� 10151

500 2000 0:0296 0:0035 0:0023 57:00 0:0675 7:68� 107 4:113� 10151

1000 1500 0:0388 0:0015 0:0042 56:25 0:0650 4:51� 107 4:111� 10151

1000 1800 0:0253 0:0006 0:0038 59:00 0:0775 1:06� 108 4:114� 10151

1000 2000 0:0207 0:0025 0:0020 49:00 0:0800 1:58� 108 4:115� 10151

BF 60:22 0:0800


Section 3, and a training set of size 100, the average number of prediction mistakes was 91.75 andthe average GE was 0:11: The average value of r used was 55 (range was 26–78), and the run timefor M was over 270 times faster than brute-force (brute-force was run on a small number ofexamples to estimate its time complexity for n ¼ 20). Thus for this case our algorithm provides asignificant speed advantage. When running our algorithm with a fixed value of r ¼ 30 (reducingtime per example by almost a factor of 2), GE increases to 0:1833:In summary, even though our experiments are for small values of n; they indicate that relatively

small values of S; T ; and r are sufficient to minimize our algorithm’s deviations from brute-forceWinnow. In addition, our algorithm becomes significantly faster than that of brute-forcesomewhere between n ¼ 15 and n ¼ 20; which is small for a machine learning problem. However,our implementation is still extremely slow, taking several days or longer to finish training whenn ¼ 20 (evaluating the learned hypothesis is also slow). Thus we are actively working onoptimizations to speed up learning and evaluation (see Section 6).

5.2. Ensemble pruning

For the Weighted Majority experiments, we used AdaBoost over decision shrubs (depth-2decision trees) generated by C4.5 [31] to learn hypotheses for an artificial two-dimensional data set(Fig. 1). The target concept is a circle of radius 10 and the examples are distributed around itscircumference, each point’s distance from the circle normally distributed with zero mean and unitvariance. By concentrating examples around the circular boundary and limiting each decisiontree’s depth, we required ensembles of multiple trees to achieve low classification error on thedata. We created an ensemble of 10 classifiers and simulated WM with24 SAf50; 75; 100g and

ARTICLE IN PRESS

Table 5

Results for n ¼ 15; S ¼ 500; and T ¼ 1500; and a training set of size 100


10 0:0577 0:0046 0:0478 78.00 0:051120 0:0456 0:0032 0:0073 78.33 0:073330 0:0405 0:0044 0:0081 74.44 0:0689BF 80.11 0:0356

Table 4

Results for n ¼ 15; S ¼ 500; and T ¼ 1500; and a training set of size 50


10 0:0572 0:0049 0:0132 59.22 0:107520 0:0444 0:0033 0:0063 61.22 0:075630 0:0407 0:0022 0:0047 62.00 0:0822BF 60.22 0:0800

24The estimation of jOþj required an order of magnitude larger values of S and T than did the estimation of the

ratios to get sufficiently low error rates.


TAf500; 750; 1000g on the set of 210 prunings and compared the values computed for Eq. (1) to

the true values from brute-force WM. The results are in Table 6: ‘‘jOþj’’ denotes the error of ourestimates of jOþj; ‘‘Xi’’ denotes the error of our estimates of the ratios WþðaiÞ=Wþðai1Þ; and‘‘WþðaÞ’’ denotes the error of our estimates of WþðaÞ: Finally, ‘‘DEPARTURE’’ indicates ouralgorithm’s departure from brute-force WM, i.e. in these experiments our algorithm perfectly

ARTICLE IN PRESS

-15

-10

-5

0

5

10

15

-15 -10 -5 0 5 10 15

"positives""negatives"

Fig. 1. The circle data set.

Table 6


S T jOþj Xi WþðaÞ DEPARTURE

50 500 0:0423 0:00050 0:0071 0.0000

50 750 0:0332 0:00069 0:0061 0.0000

50 1000 0:0419 0:00068 0:0070 0.0000

75 500 0:0223 0:00067 0:0050 0.0000

75 750 0:0197 0:00047 0:0047 0.0000

75 1000 0:0276 0:00058 0:0055 0.0000

100 500 0:0185 0:00040 0:0047 0.0000

100 750 0:0215 0:00055 0:0050 0.0000

100 1000 0:0288 0:00044 0:0056 0.0000


emulated brute-force. Finally, we note that other results show that for n ¼ 30; S ¼ 200; andT ¼ 2000; our algorithm takes about 4:5 h=example to run, while brute-force takes about2:8 h=example: Thus we expect our algorithm to run faster than brute-force at about n ¼ 31 orn ¼ 32:

6. Conclusions and future work

MWU algorithms are particularly useful when the number of inputs is very large, since theirmistake bounds are logarithmic in the total number of inputs. However, only in specific cases is itknown how to compute the weighted sum of inputs efficiently to exploit this attribute efficiency.We presented a general, widely applicable method based on Markov chain Monte Carlo toestimate these weighted sums, along with theoretical and empirical analyses of these methods asapplied to learning DNF formulas and pruning ensembles of classifiers. Our theoretical results donot yield efficient algorithms for these problems, but they do provide machinery for potentiallyconducting average-case analyses of these algorithms on e.g. restricted classes of DNF and/or onrestricted distributions. Further, as a heuristic, our methods show promise: in experimental resultson simulated data, our algorithms perform much better than the worst-case theoretical resultsimply, especially considering that highly accurate estimates of the weighted sums are not neededso long as we know which side of the threshold the sum lies on. Also, our approach can very easilybe implemented on a parallel or clustered architecture since all samples are drawn independentlyof each other, and each term in Eq. (1) can be computed independently of the other terms.Recent work by Tao and Scott [41] includes a thorough empirical analysis of this method to

speed it up further. They conducted tests using data sets from the UCI Repository [1] andincluded experimenting with other sampling methods besides the Metropolis sampler [27] ofSection 3. They also utilized methods that stop sampling early when it is known what side of thethreshold the weighted sum will fall.It is open whether better mixing time bounds are possible for special cases of the DNF problem

of Section 4.1. For example, if one considers learning parity functions (or other classes offunctions) under the uniform distribution, can the bounds of Lemmas 12 and 13 be tightened tosub-exponential? Alternatively, are there special cases where Winnow with linear pð~xxÞ (Section4.1.1.3) has on average a sufficiently small mistake bound to make the mixing time polynomial?It should be possible to apply our results to other problems for which an MWU algorithm is

applicable with an exponential number of inputs. The key is to map the set of inputs to ahypercube (perhaps truncated), and use that space as the set of states for the Markov chain. Onesuch application would be to accelerate Winnow-based algorithms [13,37,42] for a learning modelthat generalizes the conventional multiple-instance learning model [9]. Other applications mightinclude using the Perceptron algorithm on problems where no kernel is available to exactlycompute the weighted sums.When Morris and Sinclair [28] solved the knapsack problem, they also generalized their result

to a hypercube truncated by multiple hyperplanes (though the number of hyperplanes must beconstant), rather than the single one that we consider in Section 4.2. Since the stationarydistribution of their chain was assumed uniform, a natural question to ask is whether their results

ARTICLE IN PRESS


can be generalized to non-uniform distributions, and if there are applications of thisgeneralization to learning problems.There is also the question of how to elegantly choose S and T for empirical use to balance time

complexity and precision. While it is important to accurately estimate the weighted sums in orderto properly simulate WM and Winnow, some imperfections in simulation can be handled sinceincorrect simulation decisions can be treated as noise, which Winnow and WM can tolerate.Ideally, the algorithms would intelligently choose S and T based on past performance, perhaps(for Winnow) utilizing the brute-force upper bound of ay on all weights in a brute-force execution(since no promotions can occur past that point). So 8~pp; zð~ppÞp1þ Iloga ym: If this bound isexceeded during a run of Winnow, then one can increase S and T and run again.

Acknowledgments

The authors thank Mark Jerrum and Alistair Sinclair for their discussions, Jeff Jackson,Qingping Tao, and the COLT and JCSS reviewers for their helpful comments and Jeff Jackson forpresenting an earlier version of this paper at COLT. This work was supported in part by NSFGrants CCR-0092761 and CCR-9877080 with matching funds from UNL-CCIS and a Laymangrant, and was completed in part utilizing the Research Computing Facility of the University ofNebraska. Deepak Chawla performed this work at the University of Nebraska.

References

[1] C. Blake, E. Keogh, C.J. Merz, UCI repository of machine learning databases, http://www.ics.uci.edu/~mlearn/

MLRepository.html (2003).

[2] A. Blum, P. Chalasani, J. Jackson, On learning embedded symmetric concepts, in: Proceedings of the Sixth Annual

Workshop on Computational Learning Theory, ACM Press, New York, NY, 1993, pp. 337–346.

[3] A. Blum, M. Furst, J. Jackson, M. Kearns, Y. Mansour, S. Rudich, Weakly learning DNF and characterizing

statistical query learning using Fourier analysis, in: Proceedings of 26th ACM Symposium on Theory of

Computing, 1994, pp. 253–262.

[4] N.H. Bshouty, Simple learning algorithms using divide and conquer, Comput. Complexity 6 (2) (1997) 174–194.

[5] N. H. Bshouty, J. Jackson, C. Tamon, More efficient PAC-learning of DNF with membership queries under the

uniform distribution, J. Comput. System Sci., to appear (early version in COLT 99).

[6] N. Cesa-Bianchi, Y. Freund, D. Helmbold, D. Haussler, R. Schapire, M. Warmuth, How to use expert advice,

J. ACM 44 (3) (1997) 427–485.

[7] D. Chawla, L. Li, S.D. Scott, Efficiently approximating weighted sums with exponentially many terms, in:

Proceedings of the 14th Annual Conference on Computational Learning Theory, 2001, pp. 82–98.

[8] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning

Methods, Cambridge, MA, Cambridge University Press, 2000.

[9] T.G. Dietterich, R.H. Lathrop, T. Lozano-Perez, Solving the multiple-instance problem with axis-parallel

rectangles, Artificial Intelligence 89 (1–2) (1997) 31–71.

[10] M. Dyer, A. Frieze, R. Kannan, A. Kapoor, U. Vazirani, A mildly exponential time algorithm for approximating

the number of solutions to a multidimensional knapsack problem, Combin. Probab. Comput. 2 (1993) 271–284.

[11] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting,

J. Comput. System Sci. 55 (1) (1997) 119–139.

[12] C. Gentile, M. K. Warmuth, Linear hinge loss and average margin, in: M.S. Kearns, S.A. Solla, D.A. Cohn (Eds.),

Advances in Neural Information Processing Systems, Vol. 11, MIT Press, Cambridge, MA, 1998, pp. 225–231.

ARTICLE IN PRESS


&ast;http://www.ics.uci.edu/~mlearn/MLRepository.html

&ast;http://www.ics.uci.edu/~mlearn/MLRepository.html

[13] S.A. Goldman, S.K. Kwek, S.D. Scott, Agnostic learning of geometric patterns, J. Comput. System Sci. 6 (1)

(2001) 123–151.

[14] S.A. Goldman, S.D. Scott, Multiple-instance learning of real-valued geometric patterns, Ann. Math. Artificial

Intelligence 39 (3) (2003) 259–290.

[15] D.P. Helmbold, R.E. Schapire, Predicting nearly as well as the best pruning of a decision tree, Mach. Learning 27

(1) (1997) 51–68.

[16] M. Jerrum, A. Sinclair, The Markov chain Monte Carlo method: an approach to approximate counting and

integration, in: D. Hochbaum (Ed.), Approximation Algorithms for NP-Hard Problems, Boston, MA, PWS Pub.,

1996, pp. 482–520 (Chapter 12).

[17] M.R. Jerrum, L.G. Valiant, V.V. Vazirani, Random generation of combinatorial structures from a uniform

distribution, Theoret. Comput. Sci. 43 (1986) 169–188.

[18] R. Khardon, D. Roth, R. Servedio, Efficiency versus convergence of Boolean kernels for online learning

algorithms, in: T.G. Dietterich, S. Becker, Z. Ghahramani (Eds.), Advances in Neural Information Processing

Systems, Vol. 14, 2001, MIT Press, Cambridge, MA, pp. 423–430.

[19] S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, Optimization by simulated annealing, Science 220 (1983) 671–680.

[20] J. Kivinen, M.K. Warmuth, P. Auer, The perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds

when few input variables are relevant, Artificial Intelligence 97 (1–2) (1997) 325–343.

[21] N. Littlestone, Learning quickly when irrelevant attributes abound: a new linear-threshold algorithm, Mach.

Learning 2 (1988) 285–318.

[22] N. Littlestone, From on-line to batch learning, in: Proceedings of the Second Annual Workshop on

Computational Learning Theory, Morgan Kaufmann, Los Altos, CA, 1989, pp. 269–284.

[23] N. Littlestone, Redundant noisy attributes, attribute errors, and linear threshold learning using Winnow, in:

Proceedings of the Fourth Annual Workshop on Computational Learning Theory, Morgan Kaufmann, San

Mateo, CA, 1991, pp. 147–156.

[24] N. Littlestone, M.K. Warmuth, The weighted majority algorithm, Inform. and Comput. 108 (2) (1994) 212–261.

[25] W. Maass, M.K. Warmuth, Efficient learning with virtual threshold gates, Inform. Comput. 141 (1) (1998) 66–83.

[26] D.D. Margineantu, T.G. Dietterich, Pruning adaptive boosting, in: Proceedings of the 14th International

Conference on Machine Learning, Morgan Kaufmann, Los Altos, CA, 1997, pp. 211–218.

[27] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, Equation of state calculation by fast

computing machines, J. Chem. Phys. 21 (1953) 1087–1092.

[28] B. Morris, A. Sinclair, Random walks on truncated cubes and sampling 0–1 knapsack solutions, SIAM J.

Comput., to appear (early version in FOCS 99).

[29] K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda, B. Scholkopf, An introduction to kernel-based learning methods,

IEEE Trans. Neural Networks 12 (2) (2001) 181–201.

[30] F. Pereira, Y. Singer, An efficient extension to mixture techniques for prediction and decision trees, Mach.

Learning 36 (3) (1999) 183–199.

[31] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, Los Altos, CA, 1993.

[32] J.R. Quinlan, Bagging, boosting, and C4.5, in: Proceedings of the 13th National Conference on Aritificial

Intelligence, 1996, pp. 725–730.

[33] F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization in the brain, Psych.

Rev. 65 (1958) 386–407 (Reprinted in Neurocomputing, MIT Press, Cambridge, MA, 1988).

[34] R.E. Schapire, Y. Freund, P. Bartlett, W.S. Lee, Boosting the margin: a new explanation for the effectiveness of

voting methods, Ann. Statist. 26 (5) (1998) 1651–1686.

[35] R.E. Schapire, Y. Singer, Improved boosting algorithms using confidence-rated predictions, Mach. Learning 37 (3)

(1999) 297–336.

[36] B. Scholkopf, A.J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and

Beyond, MIT Press, Cambridge, MA, 2002.

[37] S.D. Scott, J. Zhang, J. Brown, On generalized multiple-instance learning, Technical Report UNL-CSE-2003-5,

Dept. of Computer Science, University of Nebraska, 2003.

[38] A. Sinclair, Improved bounds for mixing rates of Markov chains and multicommodity flow, Combin. Probab.

Comput. 1 (1992) 351–370.

ARTICLE IN PRESS


[39] E. Takimoto, M. Warmuth, Predicting nearly as well as the best pruning of a planar decision graph, Theoret.

Comput. Sci. 288 (2) (2002) 217–235.

[40] C. Tamon, J. Xiang, On the boosting pruning problem, in: Proceedings of the 11th European Conference on

Machine Learning, Springer, Berlin, 2000, pp. 404–412.

[41] Q. Tao, S. Scott, An analysis of MCMC sampling methods for estimating weighted sums in Winnow,

in: C.H. Dagli (Ed.), Artificial Neural Networks in Engineering, ASME Press, Fairfield, NJ, 2003, pp. 15–20.

[42] Q. Tao, S.D. Scott, A faster algorithm for generalized multiple-instance learning, in: Proceedings of the

Seventeenth Annual FLAIRS Conference, AAAI Press, Miami Beach, FL, 2004, to appear.

ARTICLE IN PRESS


On approximating weighted sums with exponentially many terms

Documents

Transcript of On approximating weighted sums with exponentially many terms